Parsing XML with python and ElementTree

Question

I am do class project where I have to save a list of links to a text file.

I given the XML and am trying to iterate through all the url's but am troubles.

I have tried using element tree but can not iterate through the I read many other questions and tried that with no success. Please help

The structure like this

<urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">
  <url>
     <loc>
        http://www.crawlingcourse.com/item-3911512
     </loc>
  </url>
<url>....

What does your code look like so far? In what way is it not working? — larsks
– larsks, Commented Nov 2, 2016 at 19:27
From the example, just want to make sure your XML is correct (all elements closed, doctype, etc)? — Eugene
– Eugene, Commented Nov 2, 2016 at 19:37

Laurent LAPORTE · Accepted Answer · 2016-11-02 19:41:13Z

4

I suggest you to use lxml to efficiently parse an XML file.

from lxml import etree

Your XML sample is not well-formed, I fixed it like this:

content = """\
<urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">
  <url>
     <loc>
        http://www.crawlingcourse.com/item-3911512
     </loc>
  </url>
</urlset>"""

To parse a file, you can use etree.parse(). But since this sample is a string, I use etree.XML():

tree = etree.XML(content)

The natural way to search elements in a XML tree is using XPath. For instance, you can do that:

loc_list = tree.xpath("//url/loc")

But You'll get nothing:

for loc in loc_list:
    print(loc.text)
# None

The reason, an it is probably your problem, is that <urlset> use a default namespace: "http://www.crawlingcourse.com/sitemap/1.3".

To make it work, you need to use xpath() function with this namespace. Let's give a name to this namespace: "s":

NS = {'s': "http://www.crawlingcourse.com/sitemap/1.3"}

Then, use the s prefix in your XPath expression like this:

loc_list = tree.xpath("//s:url/s:loc", namespaces=NS)

for loc in loc_list:
    print(loc.text)
#     http://www.crawlingcourse.com/item-3911512

Because your XML is indented, you need to strip the spaces:

for loc in loc_list:
    url = loc.text.strip()
    print(url)
# http://www.crawlingcourse.com/item-3911512

answered Nov 2, 2016 at 19:41

Laurent LAPORTE

23.2k7 gold badges64 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hahu Over a year ago

Thank you @Laurent for taking the time to explain. You solved my problem and taught me how it actually works. thank u

Eugene · Accepted Answer · 2016-11-02 19:52:00Z

Well, the issue really is the namespace.

Here's working code:

from xml.etree.cElementTree import XML, fromstring, tostring, ElementTree
xml_string = '<?xml version="1.0"?><urlset><url><loc>http://www.crawlingcourse.com/item-3911512</loc></url></urlset>'
tree = ElementTree(fromstring(xml_string))
print [elem.text for elem in tree.iter(tag='loc')]

Now, if you want to add <urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">, the tags are going to be different. From http://www.w3schools.com/xml/xml_namespaces.asp:

XML Namespaces - The xmlns Attribute. When using prefixes in XML, a namespace for the prefix must be defined. The namespace can be defined by an xmlns attribute in the start tag of an element. The namespace declaration has the following syntax. xmlns:prefix="URI".

Threw me off too!

Collectives™ on Stack Overflow

Parsing XML with python and ElementTree

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related