Parsing HTML using Python

Question

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.

If I have a document of the form:

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.

If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.

I'd prefer a built-in module but that might be asking a little too much.

I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.

like all the other answerers, I would recommend BeautifulSoup because it is really good in handling broken HTML files. — Pascal Rosin
– Pascal Rosin, Commented Jul 29, 2012 at 12:24

Aadaam · Accepted Answer · 2019-10-26 17:44:01Z

291

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

edited Oct 26, 2019 at 17:44

user8826104

answered Jul 29, 2012 at 12:12

Aadaam

3,7491 gold badge16 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

ffledgling Over a year ago

What exactly is the parsed_html object?

Aadaam Over a year ago

parsed_html is a BeautifulSoup object, think of it like a DOMElement or DOMDocument, except it has "tricky" properties, like "body" will refer to the BeautifulSoup object (remember, it's a tree node basically) of the first (and in this case, only) body element of the root element (in our case, html)

Lenar Hoyt Over a year ago

General info: If performance is critical, better use the lxml library instead (see answer below). With cssselect it’s quite useful aswell and performance is often 10- to 100-fold better than the other libraries available.

Pavel Over a year ago

parsed_html = BeautifulSoup(html) doesn't work for me, parsed_html = BeautifulSoup(html, 'html.parser') does

user202729 Over a year ago

@Nathan To be fair, major version update means major incompatible change, so it's likely that the code would break in one way or the other anyway. Better to break early than late.

|

chris Frisina · Accepted Answer · 2018-07-19 15:22:35Z

113

I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

the element selector is 'div#mw-head.noprint'

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

edited Jul 19, 2018 at 15:22

chris Frisina

19.3k24 gold badges91 silver badges173 bronze badges

answered Jul 29, 2012 at 12:47

YusuMishi

2,4171 gold badge19 silver badges8 bronze badges

2 Comments

Jay Dadhania Over a year ago

Quite useful for someone coming from a jQuery frontend!

user202729 Over a year ago

Remark. This library uses lxml under the hood.

sbell · Accepted Answer · 2023-02-17 20:11:12Z

54

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

edited Feb 17, 2023 at 20:11

sbell

4657 silver badges13 bronze badges

answered Jul 29, 2012 at 12:07

Qiau

6,2453 gold badges32 silver badges40 bronze badges

4 Comments

ffledgling Over a year ago

I was looking for something that details features/functionality rather than performance/efficiency. EDIT: Sorry for the pre-mature answer, that link is actually good. Thanks.

Qiau Over a year ago

The first point-list kinds of summarize the features and functions :)

Franck Dernoncourt Over a year ago

If you use BeautifulSoup4 (latest version): from bs4 import BeautifulSoup

kristianp Over a year ago

The parser perf article has moved (its from 2008 though so might be out of date) to: ianbicking.org/blog/2008/03/python-html-parser-performance.html

Lenar Hoyt · Accepted Answer · 2014-11-08 01:21:28Z

38

Compared to the other parser libraries lxml is extremely fast:

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation

edited Nov 8, 2014 at 1:21

answered Nov 8, 2014 at 1:08

Lenar Hoyt

6,1896 gold badges51 silver badges61 bronze badges

4 Comments

Sergio Over a year ago

HTTPS not supported

Korvo Over a year ago

@Sergio use import requests, save buffer to file: stackoverflow.com/a/14114741/1518921 (or urllib), after load saved file using parse, doc = parse('localfile.html').getroot()

Alex-Bogdanov Over a year ago

I parses huge HTMLs for a specific data. Doing it with BeautifulSoup took 1.7 sec, but applying lxml instead, boosted it nearly *100 times FASTER! If care about performance, lxml is the best option

user202729 Over a year ago

On the other hand, lxml carries a 12MB C extension. Mostly insignificant, but might be depends on what you do (in rare cases).

Love and peace - Joe Codeswell · Accepted Answer · 2014-10-25 18:50:16Z

11

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

answered Oct 25, 2014 at 18:50

Love and peace - Joe Codeswell

2,9824 gold badges44 silver badges52 bronze badges

2 Comments

ffledgling Over a year ago

AIUI Beautiful Soup can be made to work with most "backend" XML parsers, lxml seems to be one of the supported parsers crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Lenar Hoyt Over a year ago

@ffledgling Some functions of BeautifulSoup are quite sluggish however.

Wesam Nabki · Accepted Answer · 2016-07-15 15:51:02Z

2

I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

answered Jul 15, 2016 at 15:51

Wesam Nabki

2,65429 silver badges24 bronze badges

Comments

Unknown Soldier · Accepted Answer · 2016-03-20 09:44:48Z

0

I would use EHP

https://github.com/iogf/ehp

Here it is:

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

Output:

Something here
Something else

answered Mar 20, 2016 at 9:44

Unknown Soldier

372 bronze badges

1 Comment

ChaimG Over a year ago

Please explain. What would you use EHP over the popular BeautifulSoup or lxml?

Collectives™ on Stack Overflow

Parsing HTML using Python

7 Answers 7

11 Comments

2 Comments

4 Comments

4 Comments

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

11 Comments

2 Comments

4 Comments

4 Comments

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related