64

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>
    <p>
        Some text
        <span>more text</span>
        even more text
    </p>
    <ul>
        <li>list item</li>
        <li>yet another list item</li>
    </ul>
</div>
<p>Some other text</p>
<ul>
    <li>list item</li>
    <li>yet another list item</li>
</ul>

I tried doing something like:

def parse_text(contents_string)
    Newlines = re.compile(r'[\r\n]\s+')
    bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    txt = bs.getText('\n')
    return Newlines.sub('\n', txt)

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

1
  • Show us what the expected output looks like? You want to strip all the indenting whitespace, and newlines, right? Commented Dec 29, 2018 at 7:54

2 Answers 2

134

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

This outputs:

Some text more text even more text

  * list item
  * yet another list item

Some other text

  * list item
  * yet another list item
Sign up to request clarification or add additional context in comments.

5 Comments

Can I use html2text in junction with BeautifulSoup. For example I parse the chunk of html I'm interested at and then feed it to html2text using pretify()?
Yes, html2text can process HTML in chunks by calling HTML2Text.feed(chunk) on each successive chunk, and then calling HTML2Text.close() to get the text result (similar to HTMLParser.feed()).
This answer made me happy and sad at the same time. RIP Aaron Swartz.
Remember to check whether html2text complies with your licensing policy as it is distributed under GPLv3.
html2text convert the html string to the markdown string. So the library may not meet everyone's needs, Some one may not want markdown tag appear int the result. such as me.
7

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

3 Comments

nltk.clean_html gives NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
Even if you happen to have an old version of nltk, don't use this function. It's fast because it processes html with regexes: github.com/nltk/nltk/blob/…
I added an answer on a related question which gives a way to strip JavaScript via BeautifulSoup: stackoverflow.com/a/47782943/2112722

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.