2

I want to extract tag from an html file in python without using BeautifulSoup. For example, I want to get

class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine 

from

<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a>

Any ideas?

2
  • Why don't you want to use BeautifulSoup? There's probably a good reason, but it makes the question more useful to others if you can include such information. Commented Jul 1, 2013 at 1:48
  • That is not a tag, its just a fragment of HTML. What do you want to do exactly? Commented Jul 1, 2013 at 5:29

2 Answers 2

1

For doing basic dom parsing, you can use the xml parser in the stl.

here is an example of turning xml into html using it (from the docs):

import xml.dom.minidom

document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

def handleSlideshow(slideshow):
    print "<html>"
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print "</html>"

def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)

def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
    print "<title>%s</title>" % getText(title.childNodes)

def handleSlideTitle(title):
    print "<h2>%s</h2>" % getText(title.childNodes)

def handlePoints(points):
    print "<ul>"
    for point in points:
        handlePoint(point)
    print "</ul>"

def handlePoint(point):
    print "<li>%s</li>" % getText(point.childNodes)

def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print "<p>%s</p>" % getText(title.childNodes)

handleSlideshow(dom)
Sign up to request clarification or add additional context in comments.

Comments

1

Have a look at this XML API provided in python, it explains how to access attributes , elements and has some HTML examples too. You can also generate parser objects.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.