276

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.

If I have a document of the form:

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.

If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.

I'd prefer a built-in module but that might be asking a little too much.


I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.

1
  • 3
    like all the other answerers, I would recommend BeautifulSoup because it is really good in handling broken HTML files. Commented Jul 29, 2012 at 12:24

7 Answers 7

291

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

Sign up to request clarification or add additional context in comments.

11 Comments

What exactly is the parsed_html object?
parsed_html is a BeautifulSoup object, think of it like a DOMElement or DOMDocument, except it has "tricky" properties, like "body" will refer to the BeautifulSoup object (remember, it's a tree node basically) of the first (and in this case, only) body element of the root element (in our case, html)
General info: If performance is critical, better use the lxml library instead (see answer below). With cssselect it’s quite useful aswell and performance is often 10- to 100-fold better than the other libraries available.
parsed_html = BeautifulSoup(html) doesn't work for me, parsed_html = BeautifulSoup(html, 'html.parser') does
@Nathan To be fair, major version update means major incompatible change, so it's likely that the code would break in one way or the other anyway. Better to break early than late.
|
113

I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

the element selector is 'div#mw-head.noprint'

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

2 Comments

Quite useful for someone coming from a jQuery frontend!
Remark. This library uses lxml under the hood.
54

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

4 Comments

I was looking for something that details features/functionality rather than performance/efficiency. EDIT: Sorry for the pre-mature answer, that link is actually good. Thanks.
The first point-list kinds of summarize the features and functions :)
If you use BeautifulSoup4 (latest version): from bs4 import BeautifulSoup
The parser perf article has moved (its from 2008 though so might be out of date) to: ianbicking.org/blog/2008/03/python-html-parser-performance.html
38

Compared to the other parser libraries lxml is extremely fast:

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation

4 Comments

HTTPS not supported
@Sergio use import requests, save buffer to file: stackoverflow.com/a/14114741/1518921 (or urllib), after load saved file using parse, doc = parse('localfile.html').getroot()
I parses huge HTMLs for a specific data. Doing it with BeautifulSoup took 1.7 sec, but applying lxml instead, boosted it nearly *100 times FASTER! If care about performance, lxml is the best option
On the other hand, lxml carries a 12MB C extension. Mostly insignificant, but might be depends on what you do (in rare cases).
11

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

2 Comments

AIUI Beautiful Soup can be made to work with most "backend" XML parsers, lxml seems to be one of the supported parsers crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
@ffledgling Some functions of BeautifulSoup are quite sluggish however.
2

I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

Comments

0

I would use EHP

https://github.com/iogf/ehp

Here it is:

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

Output:

Something here
Something else

1 Comment

Please explain. What would you use EHP over the popular BeautifulSoup or lxml?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.