0

I just downloaded lxml to parse broken HTML documents. I was reading through the documentation of lxml but could not find that given a HTML document, how do we just retrieve the text in the document using lxml. I will be obliged if someone could help me with this.

3
  • 1
    lxml might be a bit low level for this, have you considered BeautifulSoup? Commented Aug 22, 2012 at 13:07
  • I have tried BeatuifulSoup but it does not handle broken HTML as well as lxml!!! Please let me know the syntax Commented Aug 22, 2012 at 13:07
  • By "retrieve the text in the document" do you mean retrieve text inside a particular element? Commented Aug 22, 2012 at 13:09

1 Answer 1

1

It's very simple:

from lxml import html
html_document = ... #Get your document contents here from a file or whatever

tree = html.fromstring(html_document)
text_document = tree.text_content()

If you only want the content from specific blocks (e.g. the body block), then you can access them using xpath expressions:

body_tags = tree.xpath('//body')
if body_tags:
  body = body_tags[0]
  text_document = body.text_content()
else:
  text_document = ''
Sign up to request clarification or add additional context in comments.

1 Comment

Is there a way to only read the text in the title and the text in the body using lxml

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.