2

With this XML

<?xml version="1.0" encoding="UTF-8"?>
<Envelope>
    <subject>Reference rates</subject>
    <Sender>
        <name>European Central Bank</name>
    </Sender>
    <Cube>
        <Cube time='2013-12-20'>
            <Cube currency='USD' rate='1.3655'/>
            <Cube currency='JPY' rate='142.66'/>
        </Cube>
    </Cube>
</Envelope>

I can get the inner Cube tags like this

from xml.etree.ElementTree import ElementTree

t = ElementTree()
t.parse('eurofxref-daily.xml')
day = t.find('Cube/Cube')
print 'Day:', day.attrib['time']
for currency in day:
    print currency.items()

Day: 2013-12-20
[('currency', 'USD'), ('rate', '1.3655')]
[('currency', 'JPY'), ('rate', '142.66')]

The problem is that the above XML is a cleaned version of the original file which has defined namespaces

<?xml version="1.0" encoding="UTF-8"?>
<gesmes:Envelope xmlns:gesmes="http://www.gesmes.org/xml/2002-08-01" xmlns="http://www.ecb.int/vocabulary/2002-08-01/eurofxref">
    <gesmes:subject>Reference rates</gesmes:subject>
    <gesmes:Sender>
        <gesmes:name>European Central Bank</gesmes:name>
    </gesmes:Sender>
    <Cube>
        <Cube time='2013-12-20'>
            <Cube currency='USD' rate='1.3655'/>
            <Cube currency='JPY' rate='142.66'/>
        </Cube>
    </Cube>
</gesmes:Envelope>

When I try to get the first Cube tag I get a None

t = ElementTree()
t.parse('eurofxref-daily.xml')
print t.find('Cube')

None

The root tag includes the namespace

root = t.getroot()
print 'root.tag:', root.tag

root.tag: {http://www.gesmes.org/xml/2002-08-01}Envelope

Its children also

for e in root.getchildren():
    print 'e.tag:', e.tag

e.tag: {http://www.gesmes.org/xml/2002-08-01}subject
e.tag: {http://www.gesmes.org/xml/2002-08-01}Sender
e.tag: {http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube

I can get the Cube tags if I include the namespace in the tag

day = t.find('{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube/{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube')
print 'Day: ', day.attrib['time']

Day:  2013-12-20

But that is really ugly. Apart from cleaning the file before processing or doing string manipulation is there an elegant way to handle it?

3
  • What do you mean by "defined namespaces that do not exist" the URIs do not have to exist on the web Commented Dec 22, 2013 at 13:54
  • @Mark I mean that those URIs return a 404 not found. If that is not a problem then the problem is another one. Commented Dec 22, 2013 at 16:03
  • That is not a problem for namespaces, and including the namespace is the correct way to do it as per the lxml (which is a superset of ElementTree) tutorial - and this way of using namespaces is much nicer than I see in other XML APIs Commented Dec 22, 2013 at 17:17

1 Answer 1

2

There's a more elegant way than including the whole namespace URI in the text of the query. For a python version that does not support the namespaces argument on ElementTree.find, lxml provides the missing functionality and is "mostly compatible" with xml.etree:

from lxml.etree import ElementTree

t = ElementTree()
t.parse('eurofxref-daily.xml')
namespaces = { "exr": "http://www.ecb.int/vocabulary/2002-08-01/eurofxref" }
day = t.find('exr:Cube', namespaces)
print day

Using the namespaces object, you can set it once and for all and then just use prefixes in your queries.

Here is the output:

$ python test.py
<Element '{http://www.ecb.int/vocabulary/2002-08-01/eurofxref}Cube' at 0x7fe0f95e3290>

If you find prefixes inelegant, then you have to work on a file without namespaces. Or there may be other tools out there that will "cheat" and match on local-name() even if namespaces are in effect but I don't use them.

In python 2.7 or python 3.3, or higher, you could use the same code as above but use xml.etree instead of lxml because they've added support for namespaces to these versions.

Sign up to request clarification or add additional context in comments.

6 Comments

What version of XML you used? Notice that the problem happens with the second unmodified version, the one with namespaces. You got the root Cube which has no attributes. Try find('Cube/Cube').
find(match, namespaces) is new in Python 3.3
Actually the namespaces argument is also available in python 2.7, which is what I used for testing my answer before posting it. But you've used the 2.6 tag in your question so I've updated my answer to take this into account.
lxml is not included in 2.6 but your answer is good enough so I'm accepting.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.