parsing an xml file with python

Question

I am trying to parse an xml file that is located in the same folder as my python script but when I run the script it does not print in the terminal as it's supposed to. I am using ElementTree here is my code:

import xml.etree.ElementTree

f = xml.etree.ElementTree.parse('atom.xml').getroot()
for atype in f.findall('link'):
   print(atype.get('href'))

this is what I want to get from the xml the href

<?xml version='1.0' ?>
 <feed xmlns="http://www.w3.org/2005/Atom">
 <title type="text">Gwern</title>
 <id>https://www.gwern.net/</id>
 <updated>2017-07-22T14:57:39Z</updated>
 <link href="https://www.gwern.net/atom.xml" rel="self" />
<author>
<name>gwern</name>
</author>
<author>
 <name>ujdRR</name>
</author>
 <generator uri="http://github.com/jgm/gitit"    version="HEAD">gitit</generator>
<entry>
<id>https://www.gwern.net/Mail%20delivery?   utm_source=RSS&amp;utm_medium=feed&amp;utm_campaign=1</id>
  <title type="text">Modified &quot;Mail delivery.page&quot;, Modified   &quot;Mistakes.page&quot;, Modified &quot;Nootropics.page&quot;, Modified &quot;Touhou.page&quot;, Modified &quot;Wikipedia resume.page&quot;,         &quot;Zeo.page&quot;, Modified &quot;hakyll.hs&quot;, Modified &quot;newsletter/2017/06.page&quot;, Modified &quot;the-long-stagnation.page&quot;, Modified &quot;wittgenstein-thesis.page&quot;</title>
<updated>2017-06-25T04:00:06Z</updated>
<author>
  <name>gwern</name>
</author>
<link href="https://www.gwern.net/Mail%20delivery?utm_source=RSS&amp;utm_medium=feed&amp;utm_campaign=1" rel="alternate" />
<summary type="text">record all minor pending edits</summary>

1. Are you sure it's an XML file and not HTML? 2. If findall doesn't find anything then nothing will be printed... — DeepSpace
– DeepSpace, Commented Jul 24, 2017 at 14:16
@DeepSpace . I have added the text I want to get from the xml file — user6003897
– user6003897, Commented Jul 24, 2017 at 14:30

stovfl · Accepted Answer · 2017-07-25 10:08:06Z

Question: ... what I want to get from the xml the href

Your XML has a Namespace: <feed xmlns="http://www.w3.org/2005/Atom">',
therefore you have to use a Namespace Parameter with findall.
Second, the XML has Two <link ...> Tags, One Inside a <entry> Tag.

findall(self, path, namespaces=None)
Finds all elements matching the ElementPath expression. Same as getroot().findall(path).
The optional namespaces argument accepts a prefix-to-namespace mapping that allows the usage of XPath prefixes in the path expression.

root = tree.getroot()
namespaces = {
'xmlns':"http://www.w3.org/2005/Atom"
}

# Get the First <link ...> Outside <entry>
link = root.findall('./xmlns:link', namespaces)[0]
print('link:{} {}'.format(link, link.get('href')))

# Find all <link ...> Inside <entry>
for link in root.findall('./xmlns:entry/xmlns:link', namespaces):
    print(link.get('href'))

Output:

link:<Element {http://www.w3.org/2005/Atom}link at 0xf6a6d8ac> https://www.gwern.net/atom.xml
https://www.gwern.net/Mail%20delivery?utm_source=RSS&utm_medium=feed&utm_campaign=1

Tested with Python: 3.4.2

Collectives™ on Stack Overflow

parsing an xml file with python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related