XML parse into array in python

Question

I have XML like this:

<?xml version="1.0" ?>
<iq id="123" to="test" type="result">
    <query xmlns="jabber:iq:roster">
        <item jid="foo" subscription="both"/>
        <item jid="bar" subscription="both"/>
    </query>
</iq>

And I would like to parse jid from item into array. I thought something like this would work

import xml.etree.ElementTree as ET

myarr = []

xml = '<?xml version="1.0" ?><iq id="123" to="test" type="result"><query xmlns="jabber:iq:roster"><item jid="foo" subscription="both"/><item jid="bar" subscription="both"/></query></iq>'

root = ET.fromstring(xml)

for item in root.findall('query'):
    t = item.get('jid')
    myarr.append(t)
    print (t)

Community · Accepted Answer · 2017-05-23 12:04:04Z

1

You need to handle namespaces. One option would to paste the namespace into the xpath expression:

for item in root.findall('.//{%(ns)s}query/{%(ns)s}item' % {'ns': 'jabber:iq:roster'}):
    t = item.attrib.get('jid')
    myarr.append(t)
    print (t)

Prints:

foo
bar

2 Comments

Jonathan Eunice Over a year ago

While it's less general, the iterator could also be simplified to: root.findall('.//{jabber:iq:roster}item')

Tomas Bruckner Over a year ago

Thanks, I had no experience with namespaces in XML before. Now, it's quite clear

Jonathan Eunice · Accepted Answer · 2014-10-31 02:34:24Z

I endorse @alecxe's approach, which I will label "handle the namespaces." That is the most general and correct approach. Unfortunately, namespaces are often ugly, wordy, and they needlessly complexity XPath expressions.

For the many simple cases where namespaces are an artifact of the XML world's desire for über-precision and not truly necessary to identify the nodes in a document, a simpler "eliminate the namespaces" alternative allows more concise searches. The key routine is:

def strip_namespaces(tree):
    """
    Strip the namespaces from an ElementTree in order to make
    processing easier. Adapted from @nonagon's answer
    at http://stackoverflow.com/a/25920989/240490
    """
    for el in tree.iter():
        if '}' in el.tag:
            el.tag = el.tag.split('}', 1)[1]  # strip namespaces
        for k, v in el.attrib.items():
            if '}' in k:
                newkey = k.split('}', 1)[1]
                el.attrib[newkey] = v
            del el.attrib[k]
    return tree

Then the program continues much as before, but without worrying about those pesky namespaces:

root = ET.fromstring(xml)
strip_namespaces(root)

for item in root.findall('.//item'):
    t = item.attrib.get('jid')
    myarr.append(t)
    print (t)

This is not effective if you are trying to modify the ElementTree and re-emit XML, but if you're just trying to deconstruct and grab data from the tree, it works well.

Collectives™ on Stack Overflow

XML parse into array in python

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related