Parsing HTML documents using lxml in python

Question

I just downloaded lxml to parse broken HTML documents. I was reading through the documentation of lxml but could not find that given a HTML document, how do we just retrieve the text in the document using lxml. I will be obliged if someone could help me with this.

lxml might be a bit low level for this, have you considered BeautifulSoup? — Preet Kukreti
– Preet Kukreti, Commented Aug 22, 2012 at 13:07
I have tried BeatuifulSoup but it does not handle broken HTML as well as lxml!!! Please let me know the syntax — Programmer
– Programmer, Commented Aug 22, 2012 at 13:07
By "retrieve the text in the document" do you mean retrieve text inside a particular element? — naiquevin
– naiquevin, Commented Aug 22, 2012 at 13:09

Steve Mayne · Accepted Answer · 2012-08-23 08:24:38Z

1

It's very simple:

from lxml import html
html_document = ... #Get your document contents here from a file or whatever

tree = html.fromstring(html_document)
text_document = tree.text_content()

If you only want the content from specific blocks (e.g. the body block), then you can access them using xpath expressions:

body_tags = tree.xpath('//body')
if body_tags:
  body = body_tags[0]
  text_document = body.text_content()
else:
  text_document = ''

edited Aug 23, 2012 at 8:24

answered Aug 22, 2012 at 13:12

Steve Mayne

23k4 gold badges53 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Programmer Over a year ago

Is there a way to only read the text in the title and the text in the body using lxml

Collectives™ on Stack Overflow

Parsing HTML documents using lxml in python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related