Python and HTMLParser.handle_data() - How to get data from tags?

Question

I'm trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I'm not sure how to do it. This is the code I have so far:

import urllib.request
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)


url = "website"
page = urllib.request.urlopen(url).read()

parser = MyHTMLParser(strict=False)
parser.feed(str(page))

If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?

I recommend you to use BeautifulSoup because it has a really friendly interface. — jcollado
– jcollado, Commented Dec 13, 2011 at 10:07
Not just because of the friendly interface, though - it's much more forgiving of the kind of broken/incorrect HTML you'll see out on the wild wild web. — babbageclunk
– babbageclunk, Commented Dec 13, 2011 at 10:17
I tried BeautifulSoup. The page I parsed made it choke. What do you do when even BeautifulSoup won't work? :) — user1049697
– user1049697, Commented Dec 13, 2011 at 10:54
You can also sanitize your input with BeautifulSoup. Some more information in this question. — jcollado
– jcollado, Commented Dec 13, 2011 at 12:02
What is the web-page you are trying to parse, and what data are you trying to extract? — ekhumoro
– ekhumoro, Commented Dec 13, 2011 at 18:19

nnov · Accepted Answer · 2023-02-01 12:21:46Z

1

class HTMLParse(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = True
    def handle_endtag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = False
    def handle_data(self, data):
        if self.recordh2:
            # do your work here

edited Feb 1, 2023 at 12:21

nnov

5716 silver badges20 bronze badges

answered Jan 10, 2014 at 20:47

hwang

461 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ando Jurai Over a year ago

is there no way to merely retrieve data between the tags? I mean, everybody advice to use BS or lxml, but I would like to try and use HTMLParser if it was possible, as my application is very simple (and as I would like to learn to do simple manipulations in the command line interface too)...

sjngm · Accepted Answer · 2012-01-12 08:03:29Z

0

html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()

    if line.startswith("<h2"):
       data = data+line

hp = MyHTMLParser()
hp.feed(data)
hp.close()

thus you can extract data from h2 tag, hope it can help

edited Jan 12, 2012 at 8:03

sjngm

13k16 gold badges90 silver badges118 bronze badges

answered Jan 12, 2012 at 4:45

Yanan

241 bronze badge

2 Comments

Dan Over a year ago

Bad! Don't parse HTML with that!

Yanan Over a year ago

What's the best way to parse HTML? I tried HTMLParser, the parsing speed is really slow

user393899 · Accepted Answer · 2012-01-12 05:01:01Z

0

I don't have time to format/clean this up it but this is how I usually do it:

        class HTMLParse(HTMLParser.HTMLParser):
            def handle_starttag(self, tag, attr):
                if tag.lower() == "a":
                    for item in attr:
                        #print item
                        if item[0].lower() == "href":
                            path = urlparse.urlparse(item[1]).path
                            ext = os.path.splitext(path)[1]
                            if ext.lower() in (".jpeg", ".jpg", ".png",
                                               ".bmp"):
                                print "Found: "+ item[1]

answered Jan 12, 2012 at 5:01

user393899

531 gold badge1 silver badge6 bronze badges

Collectives™ on Stack Overflow

Python and HTMLParser.handle_data() - How to get data from tags?

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related