Parsing nested and complex XML in Python

Question

I am trying to parse quite complex xml file and store its content in dataframe. I tried xml.etree.ElementTree and I managed to retrieve some elements but I somehow retrieved it multiple times as if there were more objects. I am trying to extract the following: category, created, last_updated, accession type, name type identifier, name type synonym as a list

<cellosaurus>
<cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">
  <accession-list>
    <accession type="primary">CVCL_B375</accession>
  </accession-list>
  <name-list>
    <name type="identifier">#490</name>
    <name type="synonym">490</name>
    <name type="synonym">Mab 7</name>
    <name type="synonym">Mab7</name>
  </name-list>
  <comment-list>
    <comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment>
    <comment category="Monoclonal antibody isotype"> IgM, kappa </comment>
  </comment-list>
  <species-list>
    <cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term>
  </species-list>
  <derived-from>
    <cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term>
  </derived-from>
  <reference-list>
    <reference resource-internal-ref="Patent=US5616470"/>
  </reference-list>
  <xref-list>
    <xref database="CLO" category="Ontologies" accession="CLO_0001018">
      <url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url>
    </xref>
    <xref database="ATCC" category="Cell line collections" accession="HB-12029">
      <url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url>
    </xref>
    <xref database="Wikidata" category="Other" accession="Q54422073">
      <url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url>
    </xref>
  </xref-list>
</cell-line>
</cellosaurus>

Nitul · Accepted Answer · 2020-10-10 19:44:27Z

3

Your question is a little unclear given the fact that in some cases you are looking to parse tag attributes and in others you are looking to parse tag_values.

My understanding is as follows. You want the following values:

Value of the attribute category of the tag cell-line.
Value of the attribute created of the tag cell-line.
Value of the attribute last_updated of the tag cell-line.
Value of the attribute type of the tag accession.
The text corresponding to the tag name with the attribute identifier.
The text corresponding to the tag name with the attribute synonym.

These values may be extracted from the xml file using the module xml.etree.Etree. In particular, look to using the findall and iter methods of the Element class.

Assuming that the xml is in a file called input.xml, the following snippet should do the trick.

import xml.etree.ElementTree as et


def main():
    tree = et.parse('cellosaurus.xml')
    root = tree.getroot()

    results = []
    for element in root.findall('.//cell-line'):
        key_values = {}
        for key in ['category', 'created', 'last_updated']:
            key_values[key] = element.attrib[key]
        for child in element.iter():
            if child.tag == 'accession':
                key_values['accession type'] = child.attrib['type']
            elif child.tag == 'name' and child.attrib['type'] == 'identifier':
                key_values['name type identifier'] = child.text
            elif child.tag == 'name' and child.attrib['type'] == 'synonym':
                key_values['name type synonym'] = child.text
        results.append([
                # Using the get method of the dict object in case any particular
                # entry does not have all the required attributes.
                 key_values.get('category'            , None)
                ,key_values.get('created'             , None)
                ,key_values.get('last_updated'        , None)
                ,key_values.get('accession type'      , None)
                ,key_values.get('name type identifier', None)
                ,key_values.get('name type synonym'   , None)
                ])

    print(results)


if __name__ == '__main__':
    main()

edited Oct 10, 2020 at 19:44

answered Oct 10, 2020 at 18:29

Nitul

3793 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

szuszfol Over a year ago

Thank you, sorry if I was a bit unclear I was looking for both tag attributes and tag values and you understood me correctly. However, the solution does not work for many examples of cell-lines. The xml I am working is huge and I only posted the example so there are many entries of the same object. Any ideas why the solution is not working for multiple objects?

Nitul Over a year ago

If the xml file you are working with, has the same structure as the example you have provided, then this solution should work, irrespective of the size of the file. For the cases where you are not getting the desired values, perhaps the structure of the xml is not the same. Please provide an example of such cases.

szuszfol Over a year ago

It is very difficult to paste such a big example If you still are willing to help please have a look here: ftp.expasy.org/databases/cellosaurus where the xml file is posted.

Nitul Over a year ago

I have updated the answer given that the file that you are working with contains multiple 'cell-line' tags. Please note that certain 'cell-line' elements in the file do not have child tags of the form name with an attribute type and value synonym. In such cases the corresponding entry in the list will be None. The result is now a list of lists.

szuszfol Over a year ago

I checked with one property 'sex' which is also in <cell-line> and unfortunately it does not work. It complains when an object does not have this property. Example <cell-line category="Transformed cell line" created="2012-10-22" last_updated="2019-05-24" entry_version="12" sex="Female" age="Age unspecified"> . And also, in the synonym properrties, there is often more then one. How do I list all of them? I tried creating a list and appending the child but that did not work

Jack Fleeting · Accepted Answer · 2020-10-10 18:46:13Z

2

The simplest way to parse xml is, IMHO, using lxml.

from lxml import etree
data = """[your xml above]"""
doc = etree.XML(data)
for att in doc.xpath('//cell-line'):
    print(att.attrib['category'])
    print(att.attrib['last_updated'])
    print(att.xpath('.//accession/@type')[0])
    print(att.xpath('.//name[@type="identifier"]/text()')[0])
    print(att.xpath('.//name[@type="synonym"]/text()'))

Output:

Hybridoma
2020-03-12
primary
#490
['490', 'Mab 7', 'Mab7']

You can then assign the outputs to variables, append to list, etc.

answered Oct 10, 2020 at 18:46

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

4 Comments

szuszfol Over a year ago

Thanks! This is very neat, however how do I handle situations when an object does not have a specific property? Let's say a category is missing?

Jack Fleeting Over a year ago

@szuszfol What do you mean by "category is missing"? The attribute category isn't present in the <cell-line> node or that its attribute value is empty (<cell-line category="">?

szuszfol Over a year ago

For example one object would have the following tag <cell-line created="2012-06-06" last_updated="2020-03-12" entry_version="6">, but there are still other object that look like this <cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">. So there were the 'category' is missing no value or None would have to be specified

Jack Fleeting Over a year ago

@szuszfol There are a couple of ways of handling it, but if you suspect that an attribute (like category is missing, you need to wrap the relevant statement in a try/except block: try: cat = att.attrib['category'] except: cat = "none" print(cat)

the_train · Accepted Answer · 2020-10-11 04:42:03Z

Another method. Recently, I compared several XML parsing libraries, and found that this is easy to use. I recommend it.

from simplified_scrapy import SimplifiedDoc, utils

xml = '''your xml above'''
# xml = utils.getFileContent('your file name.xml')

results = []
doc = SimplifiedDoc(xml)
for ele in doc.selects('cell-line'):
  key_values = {}
  for k in ele:
    if k not in ['tag','html']:
      key_values[k]=ele[k]
  key_values['name type identifier'] = ele.select('name@type="identifier">text()')
  key_values['name type synonym'] = ele.selects('name@type="synonym">text()')
  results.append(key_values)
print (results)

Result:

[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]

Collectives™ on Stack Overflow

Parsing nested and complex XML in Python

3 Answers 3

5 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related