extracting tags from html file using python

Question

I want to extract tag from an html file in python without using BeautifulSoup. For example, I want to get

class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine

from

<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a>

Any ideas?

Why don't you want to use BeautifulSoup? There's probably a good reason, but it makes the question more useful to others if you can include such information. — John La Rooy
– John La Rooy, Commented Jul 1, 2013 at 1:48
That is not a tag, its just a fragment of HTML. What do you want to do exactly? — Burhan Khalid
– Burhan Khalid, Commented Jul 1, 2013 at 5:29

IT Ninja · Accepted Answer · 2013-07-01 01:31:31Z

For doing basic dom parsing, you can use the xml parser in the stl.

here is an example of turning xml into html using it (from the docs):

import xml.dom.minidom

document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

def handleSlideshow(slideshow):
    print "<html>"
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print "</html>"

def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)

def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
    print "<title>%s</title>" % getText(title.childNodes)

def handleSlideTitle(title):
    print "<h2>%s</h2>" % getText(title.childNodes)

def handlePoints(points):
    print "<ul>"
    for point in points:
        handlePoint(point)
    print "</ul>"

def handlePoint(point):
    print "<li>%s</li>" % getText(point.childNodes)

def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print "<p>%s</p>" % getText(title.childNodes)

handleSlideshow(dom)

Kevin · Accepted Answer · 2013-07-01 05:00:31Z

1

Have a look at this XML API provided in python, it explains how to access attributes , elements and has some HTML examples too. You can also generate parser objects.

edited Jul 1, 2013 at 5:00

Kevin

56.6k15 gold badges107 silver badges139 bronze badges

answered Jul 1, 2013 at 4:25

Saurabh7

7301 gold badge7 silver badges19 bronze badges

Collectives™ on Stack Overflow

extracting tags from html file using python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related