0

I am trying to parse an xml file that is located in the same folder as my python script but when I run the script it does not print in the terminal as it's supposed to. I am using ElementTree here is my code:

import xml.etree.ElementTree

f = xml.etree.ElementTree.parse('atom.xml').getroot()
for atype in f.findall('link'):
   print(atype.get('href'))

this is what I want to get from the xml the href

<?xml version='1.0' ?>
 <feed xmlns="http://www.w3.org/2005/Atom">
 <title type="text">Gwern</title>
 <id>https://www.gwern.net/</id>
 <updated>2017-07-22T14:57:39Z</updated>
 <link href="https://www.gwern.net/atom.xml" rel="self" />
<author>
<name>gwern</name>
</author>
<author>
 <name>ujdRR</name>
</author>
 <generator uri="http://github.com/jgm/gitit"    version="HEAD">gitit</generator>
<entry>
<id>https://www.gwern.net/Mail%20delivery?   utm_source=RSS&amp;utm_medium=feed&amp;utm_campaign=1</id>
  <title type="text">Modified &quot;Mail delivery.page&quot;, Modified   &quot;Mistakes.page&quot;, Modified &quot;Nootropics.page&quot;, Modified &quot;Touhou.page&quot;, Modified &quot;Wikipedia resume.page&quot;,         &quot;Zeo.page&quot;, Modified &quot;hakyll.hs&quot;, Modified &quot;newsletter/2017/06.page&quot;, Modified &quot;the-long-stagnation.page&quot;, Modified &quot;wittgenstein-thesis.page&quot;</title>
<updated>2017-06-25T04:00:06Z</updated>
<author>
  <name>gwern</name>
</author>
<link href="https://www.gwern.net/Mail%20delivery?utm_source=RSS&amp;utm_medium=feed&amp;utm_campaign=1" rel="alternate" />
<summary type="text">record all minor pending edits</summary>

8
  • 1
    1. Are you sure it's an XML file and not HTML? 2. If findall doesn't find anything then nothing will be printed... Commented Jul 24, 2017 at 14:16
  • 1
    What happens when you run this code? Commented Jul 24, 2017 at 14:20
  • nothing happens it just returns an empty terminal Commented Jul 24, 2017 at 14:24
  • @DeepSpace . I have added the text I want to get from the xml file Commented Jul 24, 2017 at 14:30
  • @mots Both my points are still valid... Commented Jul 24, 2017 at 14:32

1 Answer 1

3

Question: ... what I want to get from the xml the href

Your XML has a Namespace: <feed xmlns="http://www.w3.org/2005/Atom">',
therefore you have to use a Namespace Parameter with findall.
Second, the XML has Two <link ...> Tags, One Inside a <entry> Tag.

findall(self, path, namespaces=None)
Finds all elements matching the ElementPath expression. Same as getroot().findall(path).
The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.

root = tree.getroot()
namespaces = {
'xmlns':"http://www.w3.org/2005/Atom"
}

# Get the First <link ...> Outside <entry>
link = root.findall('./xmlns:link', namespaces)[0]
print('link:{} {}'.format(link, link.get('href')))

# Find all <link ...> Inside <entry>
for link in root.findall('./xmlns:entry/xmlns:link', namespaces):
    print(link.get('href'))

Output:

link:<Element {http://www.w3.org/2005/Atom}link at 0xf6a6d8ac> https://www.gwern.net/atom.xml
https://www.gwern.net/Mail%20delivery?utm_source=RSS&utm_medium=feed&utm_campaign=1

Tested with Python: 3.4.2

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.