Parsing XML with namespace

Question

With this XML

<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
    <subject>Reference rates</subject>
    <Sender>
        <name>European Central Bank</name>
    </Sender>
    <Cube>
        <Cube time='2013-12-20'>
            <Cube currency='USD' rate='1.3655'/>
            <Cube currency='JPY' rate='142.66'/>
        </Cube>
    </Cube>
</Envelope>

I can get the inner Cube tags like this

from xml.etree.ElementTree import ElementTree

t = ElementTree()
t.parse('eurofxref-daily.xml')
day = t.find('Cube/Cube')
print 'Day:', day.attrib['time']
for currency in day:
    print currency.items()

Day: 2013-12-20
[('currency', 'USD'), ('rate', '1.3655')]
[('currency', 'JPY'), ('rate', '142.66')]

The problem is that the above XML is a cleaned version of the original file which has defined namespaces

<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
    <gesmes:subject>Reference rates</gesmes:subject>
    <gesmes:Sender>
        <gesmes:name>European Central Bank</gesmes:name>
    </gesmes:Sender>
    <Cube>
        <Cube time='2013-12-20'>
            <Cube currency='USD' rate='1.3655'/>
            <Cube currency='JPY' rate='142.66'/>
        </Cube>
    </Cube>
</gesmes:Envelope>

When I try to get the first Cube tag I get a None

t = ElementTree()
t.parse('eurofxref-daily.xml')
print t.find('Cube')

None

The root tag includes the namespace

root = t.getroot()
print 'root.tag:', root.tag

root.tag: {http://www.gesmes.org/xml/2002-08-01}Envelope

Its children also

for e in root.getchildren():
    print 'e.tag:', e.tag

e.tag: {http://www.gesmes.org/xml/2002-08-01}subject
e.tag: {http://www.gesmes.org/xml/2002-08-01}Sender
e.tag: {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube

I can get the Cube tags if I include the namespace in the tag

day = t.find('{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube/{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube')
print 'Day: ', day.attrib['time']

Day:  2013-12-20

But that is really ugly. Apart from cleaning the file before processing or doing string manipulation is there an elegant way to handle it?

What do you mean by "defined namespaces that do not exist" the URIs do not have to exist on the web — mmmmmm
– mmmmmm, Commented Dec 22, 2013 at 13:54
@Mark I mean that those URIs return a 404 not found. If that is not a problem then the problem is another one. — Clodoaldo Neto
– Clodoaldo Neto, Commented Dec 22, 2013 at 16:03
That is not a problem for namespaces, and including the namespace is the correct way to do it as per the lxml (which is a superset of ElementTree) tutorial - and this way of using namespaces is much nicer than I see in other XML APIs — mmmmmm
– mmmmmm, Commented Dec 22, 2013 at 17:17

Louis · Accepted Answer · 2013-12-22 18:11:55Z

2

There's a more elegant way than including the whole namespace URI in the text of the query. For a python version that does not support the namespaces argument on ElementTree.find, lxml provides the missing functionality and is "mostly compatible" with xml.etree:

from lxml.etree import ElementTree

t = ElementTree()
t.parse('eurofxref-daily.xml')
namespaces = { "exr": "http://www.ecb.int/vocabulary/2002-08-01/eurofxref" }
day = t.find('exr:Cube', namespaces)
print day

Using the namespaces object, you can set it once and for all and then just use prefixes in your queries.

Here is the output:

$ python test.py
<Element '{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube' at 0x7fe0f95e3290>

If you find prefixes inelegant, then you have to work on a file without namespaces. Or there may be other tools out there that will "cheat" and match on local-name() even if namespaces are in effect but I don't use them.

In python 2.7 or python 3.3, or higher, you could use the same code as above but use xml.etree instead of lxml because they've added support for namespaces to these versions.

edited Dec 22, 2013 at 18:11

answered Dec 22, 2013 at 13:03

Louis

152k28 gold badges288 silver badges332 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Clodoaldo Neto Over a year ago

What version of XML you used? Notice that the problem happens with the second unmodified version, the one with namespaces. You got the root Cube which has no attributes. Try find('Cube/Cube').

Clodoaldo Neto Over a year ago

find(match, namespaces) is new in Python 3.3

Louis Over a year ago

Actually the namespaces argument is also available in python 2.7, which is what I used for testing my answer before posting it. But you've used the 2.6 tag in your question so I've updated my answer to take this into account.

Clodoaldo Neto Over a year ago

It is not documented: docs.python.org/release/2.7/library/…

Clodoaldo Neto Over a year ago

lxml is not included in 2.6 but your answer is good enough so I'm accepting.

|

Collectives™ on Stack Overflow

Parsing XML with namespace

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related