4

I'm trying to parse a web page with the Python HTMLParser. I want to get the content of a tag, but I'm not sure how to do it. This is the code I have so far:

import urllib.request
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print("Encountered   some data:", data)


url = "website"
page = urllib.request.urlopen(url).read()

parser = MyHTMLParser(strict=False)
parser.feed(str(page))

If I understand correctly, I can use the handle_data() function to get the data between tags. How do I specify which tags to get the data from? And how do I get the data?

5
  • 4
    I recommend you to use BeautifulSoup because it has a really friendly interface. Commented Dec 13, 2011 at 10:07
  • Not just because of the friendly interface, though - it's much more forgiving of the kind of broken/incorrect HTML you'll see out on the wild wild web. Commented Dec 13, 2011 at 10:17
  • I tried BeautifulSoup. The page I parsed made it choke. What do you do when even BeautifulSoup won't work? :) Commented Dec 13, 2011 at 10:54
  • You can also sanitize your input with BeautifulSoup. Some more information in this question. Commented Dec 13, 2011 at 12:02
  • What is the web-page you are trying to parse, and what data are you trying to extract? Commented Dec 13, 2011 at 18:19

3 Answers 3

1
class HTMLParse(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = True
    def handle_endtag(self, tag, attrs):
        if tag == 'h2':
            self.recordh2 = False
    def handle_data(self, data):
        if self.recordh2:
            # do your work here
Sign up to request clarification or add additional context in comments.

1 Comment

is there no way to merely retrieve data between the tags? I mean, everybody advice to use BS or lxml, but I would like to try and use HTMLParser if it was possible, as my application is very simple (and as I would like to learn to do simple manipulations in the command line interface too)...
0
html_code = urllib2.urlopen("xxx")
html_code_list = html_code.readlines()
data = ""
for line in html_code_list:
    line = line.strip()

    if line.startswith("<h2"):
       data = data+line

hp = MyHTMLParser()
hp.feed(data)
hp.close()

thus you can extract data from h2 tag, hope it can help

2 Comments

Bad! Don't parse HTML with that!
What's the best way to parse HTML? I tried HTMLParser, the parsing speed is really slow
0

I don't have time to format/clean this up it but this is how I usually do it:

        class HTMLParse(HTMLParser.HTMLParser):
            def handle_starttag(self, tag, attr):
                if tag.lower() == "a":
                    for item in attr:
                        #print item
                        if item[0].lower() == "href":
                            path = urlparse.urlparse(item[1]).path
                            ext = os.path.splitext(path)[1]
                            if ext.lower() in (".jpeg", ".jpg", ".png",
                                               ".bmp"):
                                print "Found: "+ item[1]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.