I just downloaded lxml to parse broken HTML documents. I was reading through the documentation of lxml but could not find that given a HTML document, how do we just retrieve the text in the document using lxml. I will be obliged if someone could help me with this.
1 Answer
It's very simple:
from lxml import html
html_document = ... #Get your document contents here from a file or whatever
tree = html.fromstring(html_document)
text_document = tree.text_content()
If you only want the content from specific blocks (e.g. the body block), then you can access them using xpath expressions:
body_tags = tree.xpath('//body')
if body_tags:
body = body_tags[0]
text_document = body.text_content()
else:
text_document = ''
1 Comment
Programmer
Is there a way to only read the text in the title and the text in the body using lxml
BeautifulSoup?