Rendered HTML to plain text using Python

Question

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>
    <p>
        Some text
        <span>more text</span>
        even more text
    </p>
    <ul>
        <li>list item</li>
        <li>yet another list item</li>
    </ul>
</div>
<p>Some other text</p>
<ul>
    <li>list item</li>
    <li>yet another list item</li>
</ul>

I tried doing something like:

def parse_text(contents_string)
    Newlines = re.compile(r'[\r\n]\s+')
    bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    txt = bs.getText('\n')
    return Newlines.sub('\n', txt)

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

Show us what the expected output looks like? You want to strip all the indenting whitespace, and newlines, right? — smci
– smci, Commented Dec 29, 2018 at 7:54

JosefAssad · Accepted Answer · 2019-02-16 09:04:50Z

134

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

This outputs:

Some text more text even more text

  * list item
  * yet another list item

Some other text

  * list item
  * yet another list item

edited Feb 16, 2019 at 9:04

JosefAssad

4,12830 silver badges37 bronze badges

answered Nov 12, 2012 at 3:09

del

6,58111 gold badges46 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

btatarov Over a year ago

Can I use html2text in junction with BeautifulSoup. For example I parse the chunk of html I'm interested at and then feed it to html2text using pretify()?

del Over a year ago

Yes, html2text can process HTML in chunks by calling HTML2Text.feed(chunk) on each successive chunk, and then calling HTML2Text.close() to get the text result (similar to HTMLParser.feed()).

Steve Rossiter Over a year ago

This answer made me happy and sad at the same time. RIP Aaron Swartz.

Pawel Kam Over a year ago

Remember to check whether html2text complies with your licensing policy as it is distributed under GPLv3.

vipcxj Over a year ago

html2text convert the html string to the markdown string. So the library may not meet everyone's needs, Some one may not want markdown tag appear int the result. such as me.

Community · Accepted Answer · 2017-05-23 12:02:46Z

7

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Nov 5, 2013 at 17:53

Paul

7,3658 gold badges45 silver badges41 bronze badges

3 Comments

Martin Thoma Over a year ago

nltk.clean_html gives NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

digenishjkl Over a year ago

Even if you happen to have an old version of nltk, don't use this function. It's fast because it processes html with regexes: github.com/nltk/nltk/blob/…

Sarah Messer Over a year ago

I added an answer on a related question which gives a way to strip JavaScript via BeautifulSoup: stackoverflow.com/a/47782943/2112722

Collectives™ on Stack Overflow

Rendered HTML to plain text using Python

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related