1

I am do class project where I have to save a list of links to a text file.

I given the XML and am trying to iterate through all the url's but am troubles.

I have tried using element tree but can not iterate through the I read many other questions and tried that with no success. Please help

The structure like this

<urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">
  <url>
     <loc>
        http://www.crawlingcourse.com/item-3911512
     </loc>
  </url>
<url>....
2
  • 1
    What does your code look like so far? In what way is it not working? Commented Nov 2, 2016 at 19:27
  • From the example, just want to make sure your XML is correct (all elements closed, doctype, etc)? Commented Nov 2, 2016 at 19:37

2 Answers 2

4

I suggest you to use lxml to efficiently parse an XML file.

from lxml import etree

Your XML sample is not well-formed, I fixed it like this:

content = """\
<urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">
  <url>
     <loc>
        http://www.crawlingcourse.com/item-3911512
     </loc>
  </url>
</urlset>"""

To parse a file, you can use etree.parse(). But since this sample is a string, I use etree.XML():

tree = etree.XML(content)

The natural way to search elements in a XML tree is using XPath. For instance, you can do that:

loc_list = tree.xpath("//url/loc")

But You'll get nothing:

for loc in loc_list:
    print(loc.text)
# None

The reason, an it is probably your problem, is that <urlset> use a default namespace: "http://www.crawlingcourse.com/sitemap/1.3".

To make it work, you need to use xpath() function with this namespace. Let's give a name to this namespace: "s":

NS = {'s': "http://www.crawlingcourse.com/sitemap/1.3"}

Then, use the s prefix in your XPath expression like this:

loc_list = tree.xpath("//s:url/s:loc", namespaces=NS)

for loc in loc_list:
    print(loc.text)
#     http://www.crawlingcourse.com/item-3911512

Because your XML is indented, you need to strip the spaces:

for loc in loc_list:
    url = loc.text.strip()
    print(url)
# http://www.crawlingcourse.com/item-3911512
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Laurent for taking the time to explain. You solved my problem and taught me how it actually works. thank u
1

Well, the issue really is the namespace.

Here's working code:

from xml.etree.cElementTree import XML, fromstring, tostring, ElementTree
xml_string = '<?xml version="1.0"?><urlset><url><loc>http://www.crawlingcourse.com/item-3911512</loc></url></urlset>'
tree = ElementTree(fromstring(xml_string))
print [elem.text for elem in tree.iter(tag='loc')]

Now, if you want to add <urlset xmlns="http://www.crawlingcourse.com/sitemap/1.3">, the tags are going to be different. From http://www.w3schools.com/xml/xml_namespaces.asp:

XML Namespaces - The xmlns Attribute. When using prefixes in XML, a namespace for the prefix must be defined. The namespace can be defined by an xmlns attribute in the start tag of an element. The namespace declaration has the following syntax. xmlns:prefix="URI".

Threw me off too!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.