1

I am trying to parse quite complex xml file and store its content in dataframe. I tried xml.etree.ElementTree and I managed to retrieve some elements but I somehow retrieved it multiple times as if there were more objects. I am trying to extract the following: category, created, last_updated, accession type, name type identifier, name type synonym as a list

<cellosaurus>
<cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">
  <accession-list>
    <accession type="primary">CVCL_B375</accession>
  </accession-list>
  <name-list>
    <name type="identifier">#490</name>
    <name type="synonym">490</name>
    <name type="synonym">Mab 7</name>
    <name type="synonym">Mab7</name>
  </name-list>
  <comment-list>
    <comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment>
    <comment category="Monoclonal antibody isotype"> IgM, kappa </comment>
  </comment-list>
  <species-list>
    <cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term>
  </species-list>
  <derived-from>
    <cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term>
  </derived-from>
  <reference-list>
    <reference resource-internal-ref="Patent=US5616470"/>
  </reference-list>
  <xref-list>
    <xref database="CLO" category="Ontologies" accession="CLO_0001018">
      <url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url>
    </xref>
    <xref database="ATCC" category="Cell line collections" accession="HB-12029">
      <url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url>
    </xref>
    <xref database="Wikidata" category="Other" accession="Q54422073">
      <url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url>
    </xref>
  </xref-list>
</cell-line>
</cellosaurus>

3 Answers 3

3

Your question is a little unclear given the fact that in some cases you are looking to parse tag attributes and in others you are looking to parse tag_values.

My understanding is as follows. You want the following values:

  1. Value of the attribute category of the tag cell-line.
  2. Value of the attribute created of the tag cell-line.
  3. Value of the attribute last_updated of the tag cell-line.
  4. Value of the attribute type of the tag accession.
  5. The text corresponding to the tag name with the attribute identifier.
  6. The text corresponding to the tag name with the attribute synonym.

These values may be extracted from the xml file using the module xml.etree.Etree. In particular, look to using the findall and iter methods of the Element class.

Assuming that the xml is in a file called input.xml, the following snippet should do the trick.

import xml.etree.ElementTree as et


def main():
    tree = et.parse('cellosaurus.xml')
    root = tree.getroot()

    results = []
    for element in root.findall('.//cell-line'):
        key_values = {}
        for key in ['category', 'created', 'last_updated']:
            key_values[key] = element.attrib[key]
        for child in element.iter():
            if child.tag == 'accession':
                key_values['accession type'] = child.attrib['type']
            elif child.tag == 'name' and child.attrib['type'] == 'identifier':
                key_values['name type identifier'] = child.text
            elif child.tag == 'name' and child.attrib['type'] == 'synonym':
                key_values['name type synonym'] = child.text
        results.append([
                # Using the get method of the dict object in case any particular
                # entry does not have all the required attributes.
                 key_values.get('category'            , None)
                ,key_values.get('created'             , None)
                ,key_values.get('last_updated'        , None)
                ,key_values.get('accession type'      , None)
                ,key_values.get('name type identifier', None)
                ,key_values.get('name type synonym'   , None)
                ])

    print(results)


if __name__ == '__main__':
    main()
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you, sorry if I was a bit unclear I was looking for both tag attributes and tag values and you understood me correctly. However, the solution does not work for many examples of cell-lines. The xml I am working is huge and I only posted the example so there are many entries of the same object. Any ideas why the solution is not working for multiple objects?
If the xml file you are working with, has the same structure as the example you have provided, then this solution should work, irrespective of the size of the file. For the cases where you are not getting the desired values, perhaps the structure of the xml is not the same. Please provide an example of such cases.
It is very difficult to paste such a big example If you still are willing to help please have a look here: ftp.expasy.org/databases/cellosaurus where the xml file is posted.
I have updated the answer given that the file that you are working with contains multiple 'cell-line' tags. Please note that certain 'cell-line' elements in the file do not have child tags of the form name with an attribute type and value synonym. In such cases the corresponding entry in the list will be None. The result is now a list of lists.
I checked with one property 'sex' which is also in <cell-line> and unfortunately it does not work. It complains when an object does not have this property. Example <cell-line category="Transformed cell line" created="2012-10-22" last_updated="2019-05-24" entry_version="12" sex="Female" age="Age unspecified"> . And also, in the synonym properrties, there is often more then one. How do I list all of them? I tried creating a list and appending the child but that did not work
2

The simplest way to parse xml is, IMHO, using lxml.

from lxml import etree
data = """[your xml above]"""
doc = etree.XML(data)
for att in doc.xpath('//cell-line'):
    print(att.attrib['category'])
    print(att.attrib['last_updated'])
    print(att.xpath('.//accession/@type')[0])
    print(att.xpath('.//name[@type="identifier"]/text()')[0])
    print(att.xpath('.//name[@type="synonym"]/text()'))

Output:

Hybridoma
2020-03-12
primary
#490
['490', 'Mab 7', 'Mab7']

You can then assign the outputs to variables, append to list, etc.

4 Comments

Thanks! This is very neat, however how do I handle situations when an object does not have a specific property? Let's say a category is missing?
@szuszfol What do you mean by "category is missing"? The attribute category isn't present in the <cell-line> node or that its attribute value is empty (<cell-line category="">?
For example one object would have the following tag <cell-line created="2012-06-06" last_updated="2020-03-12" entry_version="6">, but there are still other object that look like this <cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">. So there were the 'category' is missing no value or None would have to be specified
@szuszfol There are a couple of ways of handling it, but if you suspect that an attribute (like category is missing, you need to wrap the relevant statement in a try/except block: try: cat = att.attrib['category'] except: cat = "none" print(cat)
1

Another method. Recently, I compared several XML parsing libraries, and found that this is easy to use. I recommend it.

from simplified_scrapy import SimplifiedDoc, utils

xml = '''your xml above'''
# xml = utils.getFileContent('your file name.xml')

results = []
doc = SimplifiedDoc(xml)
for ele in doc.selects('cell-line'):
  key_values = {}
  for k in ele:
    if k not in ['tag','html']:
      key_values[k]=ele[k]
  key_values['name type identifier'] = ele.select('name@type="identifier">text()')
  key_values['name type synonym'] = ele.selects('name@type="synonym">text()')
  results.append(key_values)
print (results)

Result:

[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.