HTML Truncating in Python
-
12-11-2019 - |
Pergunta
Is there a pure-Python tool to take some HTML and truncate it as close to a given length as possible, but make sure the resulting snippet is well-formed? For example, given this HTML:
<h1>This is a header</h1>
<p>This is a paragraph</p>
it would not produce:
<h1>This is a hea
but:
<h1>This is a header</h1>
or at least:
<h1>This is a hea</h1>
I can't find one that works, though I found one that relies on pullparser
, which is both obsolete and dead.
Solução
I don't think you need a full-fledged parser - you only need to tokenize the the input string into one of:
- text
- open tag
- close tag
- self-closing tag
- character entity
Once you have a stream of tokens like that, it's easy to use a stack to keep track of what tags need closing. I actually ran into this problem a while ago and wrote a small library to do this:
https://github.com/eentzel/htmltruncate.py
It's worked well for me, and handles most of the corner cases well, including arbitrarily nested markup, counting character entities as a single character, returning an error on malformed markup, etc.
It will produce:
<h1>This is a hea</h1>
on your example. This could perhaps be changed, but it's hard in the general case - what if you're trying to truncate to 10 characters, but the <h1>
tag isn't closed for another, say, 300 characters?
Outras dicas
If you're using DJANGO lib, you can simply :
from django.utils import text, html
class class_name():
def trim_string(self, stringf, limit, offset = 0):
return stringf[offset:limit]
def trim_html_words(self, html, limit, offset = 0):
return text.truncate_html_words(html, limit)
def remove_html(self, htmls, tag, limit = 'all', offset = 0):
return html.strip_tags(htmls)
Anyways, here's the code from truncate_html_words from django :
import re
def truncate_html_words(s, num):
"""
Truncates html to a certain number of words (not counting tags and comments).
Closes opened tags if they were correctly closed in the given html.
"""
length = int(num)
if length <= 0:
return ''
html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input')
# Set up regular expressions
re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)')
re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>')
# Count non-HTML words and keep note of open tags
pos = 0
ellipsis_pos = 0
words = 0
open_tags = []
while words <= length:
m = re_words.search(s, pos)
if not m:
# Checked through whole string
break
pos = m.end(0)
if m.group(1):
# It's an actual non-HTML word
words += 1
if words == length:
ellipsis_pos = pos
continue
# Check for tag
tag = re_tag.match(m.group(0))
if not tag or ellipsis_pos:
# Don't worry about non tags or tags after our truncate point
continue
closing_tag, tagname, self_closing = tag.groups()
tagname = tagname.lower() # Element names are always case-insensitive
if self_closing or tagname in html4_singlets:
pass
elif closing_tag:
# Check for match in open tags list
try:
i = open_tags.index(tagname)
except ValueError:
pass
else:
# SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags
open_tags = open_tags[i+1:]
else:
# Add it to the start of the open tags list
open_tags.insert(0, tagname)
if words <= length:
# Don't try to close tags if we don't need to truncate
return s
out = s[:ellipsis_pos] + ' ...'
# Close any tags still open
for tag in open_tags:
out += '</%s>' % tag
# Return string
return out
I found the answer by slacy very helpful and would upvote it if I had the reputation, - however there was one extra thing to note. In my environment I had html5lib installed as well as BeautifulSoup4. BeautifulSoup used the html5lib parser and this resulted in my html snippet being wrapped in html and body tags which is not what I wanted.
>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<html><head></head><body><p>s</p></body></html>'
To resolve these issues I told BeautifulSoup to use the python parser:
from bs4 import BeautifulSoup
def truncate_html(html, length):
return unicode(BeautifulSoup(html[:length], "html.parser"))
>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<p>s</p>'
You can do this in one line with BeautifulSoup (assuming you want to truncate at a certain number of source characters, not at a number of content characters):
from BeautifulSoup import BeautifulSoup
def truncate_html(html, length):
return unicode(BeautifulSoup(html[:length]))
This will serve your requirement.An easy to use HTML parser and bad markup corrector
My initial thought would be use an XML parser (maybe python's sax parser), then probably count the text characters in each xml element. I would ignore the tags characters count to make it more consistent as well as simpler, but either should be possible.
I'd recommend first completely parsing the HTML then truncate. A great HTML parser for python is lxml. After parsing and truncating, you can print it back in to HTML format.
Look at HTML Tidy to cleanup/reformat/reindent HTML.