3

I am trying to read the XML file and convert it to pandas. However it returns empty data

This is the sample of xml structure:

<Instance ID="1">
<MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh"  DataSource="DeepTutorSummer2014"/>
<ProblemDescription>A car windshield collides with a mosquito, squashing it.</ProblemDescription>
<Question>How does this work tion?</Question>
<Answer>tthis is my best  </Answer>
<Annotation Label="correct(0)|correct_but_incomplete(1)|contradictory(0)|incorrect(0)">
<AdditionalAnnotation ContextRequired="0" ExtraInfoInAnswer="0"/>
<Comments Watch="1"> The student forgot to tell the opposite force. Opposite means opposite direction, which is important here. However, one can argue that the opposite is implied. See the reference answers.</Comments>
</Annotation>
<ReferenceAnswers>
1:  Since the windshield exerts a force on the mosquito, which we can call action, the mosquito exerts an equal and opposite force on the windshield, called the reaction.

</ReferenceAnswers>
</Instance>

I have tried this code, however it's not working on my side. It returns empty dataframe.

import pandas as pd 
import xml.etree.ElementTree as et 

xtree = et.parse("grade_data.xml")
xroot = xtree.getroot() 

df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", 'Question', 'Answer',
           'ContextRequired', 'ExtraInfoInAnswer', 'Comments', 'Watch', 'ReferenceAnswers']
rows = []


for node in xroot: 
    s_name = node.attrib.get("ID")
    s_student = node.find("StudentID") 
    s_task = node.find("TaskID") 
    s_source = node.find("DataSource") 
    s_desc = node.find("ProblemDescription") 
    s_question = node.find("Question") 
    s_ans = node.find("Answer") 
    s_label = node.find("Label") 
    s_contextrequired = node.find("ContextRequired") 
    s_extraInfoinAnswer = node.find("ExtraInfoInAnswer")
    s_comments = node.find("Comments") 
    s_watch = node.find("Watch") 
    s_referenceAnswers = node.find("ReferenceAnswers") 


    rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, 
                 "DataSource": s_source, "ProblemDescription": s_desc , 
                 "Question": s_question , "Answer": s_ans ,"Label": s_label,
                 "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
                 "Comments": s_comments ,  "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers, 

                })

out_df = pd.DataFrame(rows, columns = df_cols)
1

2 Answers 2

2

The problem in your solution was that the "element data extraction" was not done properly. The xml you mentioned in the question is nested in several layers. And that is why we need to recursively read and extract the data. The following solution should give you what you need in this case. Although I would encourage you to look at this article and the python documentation for more clarity.

Method: 1

import numpy as np
import pandas as pd
#import os
import xml.etree.ElementTree as ET

def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True): 
    """Parse the input XML source and store the result in a pandas 
    DataFrame with the given columns. 

    For xml_source = xml_file, Set: source_is_file = True
    For xml_source = xml_string, Set: source_is_file = False

    <element attribute_key1=attribute_value1, attribute_key2=attribute_value2>
        <child1>Child 1 Text</child1>
        <child2>Child 2 Text</child2>
        <child3>Child 3 Text</child3>
    </element>
    Note that for an xml structure as shown above, the attribute information of 
    element tag can be accessed by list(element). Any text associated with <element> tag can be accessed
    as element.text and the name of the tag itself can be accessed with
    element.tag.
    """
    if source_is_file:
        xtree = ET.parse(xml_source) # xml_source = xml_file
        xroot = xtree.getroot()
    else:
        xroot = ET.fromstring(xml_source) # xml_source = xml_string
    consolidator_dict = dict()
    default_instance_dict = {label: None for label in df_cols}

    def get_children_info(children, instance_dict):
        # We avoid using element.getchildren() as it is deprecated.
        # Instead use list(element) to get a list of attributes.
        for child in children:
            #print(child)
            #print(child.tag)
            #print(child.items())
            #print(child.getchildren()) # deprecated method
            #print(list(child))
            if len(list(child))>0:
                instance_dict = get_children_info(list(child), 
                                                  instance_dict)

            if len(list(child.keys()))>0:
                items = child.items()
                instance_dict.update({key: value for (key, value) in items})             

            #print(child.keys())
            instance_dict.update({child.tag: child.text})
        return instance_dict

    # Loop over all instances
    for instance in list(xroot):
        instance_dict = default_instance_dict.copy()           
        ikey, ivalue = instance.items()[0] # The first attribute is "ID"
        instance_dict.update({ikey: ivalue}) 
        if show_progress:
            print('{}: {}={}'.format(instance.tag, ikey, ivalue))
        # Loop inside every instance
        instance_dict = get_children_info(list(instance), 
                                          instance_dict)   

        #consolidator_dict.update({ivalue: instance_dict.copy()}) 
        consolidator_dict[ivalue] = instance_dict.copy()       
    df = pd.DataFrame(consolidator_dict).T 
    df = df[df_cols]

    return df

Run the following to generate the desired output.

xml_source = r'grade_data.xml'
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
           "ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']

df = xml2df(xml_source, df_cols, source_is_file = True)
df

Method: 2

Given you have the xml_string, you could convert xml >> dict >> dataframe. run the following to get the desired output.

Note: You will need to install xmltodict to use Method-2. This method is inspired by the solution suggested by @martin-blech at How to convert XML to JSON in Python? [duplicate] . Kudos to @martin-blech for making it.

pip install -U xmltodict

Solution

def read_recursively(x, instance_dict):  
    #print(x)
    txt = ''
    for key in x.keys():
        k = key.replace("@","")
        if k in df_cols: 
            if isinstance(x.get(key), dict):
                instance_dict, txt = read_recursively(x.get(key), instance_dict)
            #else:                
            instance_dict.update({k: x.get(key)})
            #print('{}: {}'.format(k, x.get(key)))
        else:
            #print('else: {}: {}'.format(k, x.get(key)))
            # dig deeper if value is another dict
            if isinstance(x.get(key), dict):
                instance_dict, txt = read_recursively(x.get(key), instance_dict)                
            # add simple text associated with element
            if k=='#text':
                txt = x.get(key)
        # update text to corresponding parent element    
        if (k!='#text') and (txt!=''):
            instance_dict.update({k: txt})
    return (instance_dict, txt)

You will need the function read_recursively() given above. Now run the following.

import xmltodict, json

o = xmltodict.parse(xml_string) # INPUT: XML_STRING
#print(json.dumps(o)) # uncomment to see xml to json converted string

consolidated_dict = dict()
oi = o['Instances']['Instance']

for x in oi:
    instance_dict = dict()
    instance_dict, _ = read_recursively(x, instance_dict)
    consolidated_dict.update({x.get("@ID"): instance_dict.copy()})
df = pd.DataFrame(consolidated_dict).T
df = df[df_cols]
df
Sign up to request clarification or add additional context in comments.

2 Comments

That is impressive. Thank you
@mzjn Thank you for pointing two things I could improve in the solution. 1. I was not aware of the fact that element.getchildren() is deprecated. I will update it with list(element) in the solution. 2. About parsing: yes, you are correct that the xml-parser has already parsed the DOM and created the tree; which is why we can get the root and other elements. What I wanted to mean was "element data extraction" was not done properly. @HaniIhlayyle The xml you mentioned in the question is nested in several layers. And that is why we need to recursively read and extract the data.
1

Several issues:

  • Calling .find on the loop variable, node, expects a child node to exist: current_node.find('child_of_current_node'). However, since all the nodes are the children of root they do not maintain their own children, so no loop is required;
  • Not checking NoneType that can result from missing nodes with find() and prevents retrieving .tag or .text or other attributes;
  • Not retrieving node content with .text, otherwise the <Element... object is returned;

Consider this adjustment using the ternary condition expression a if condition else b to ensure variable has a value regardless:

rows = []

s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") is not None else None
s_task = xroot.find("TaskID").text if xroot.find("TaskID") is not None else None      
s_source = xroot.find("DataSource").text if xroot.find("DataSource") is not None else None
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") is not None else None
s_question = xroot.find("Question").text if xroot.find("Question") is not None else None    
s_ans = xroot.find("Answer").text if xroot.find("Answer") is not None else None
s_label = xroot.find("Label").text if xroot.find("Label") is not None else None
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") is not None else None
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") is not None else None
s_comments = xroot.find("Comments").text if xroot.find("Comments") is not None else None
s_watch = xroot.find("Watch").text if xroot.find("Watch") is not None else None
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") is not None else None

rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, 
             "DataSource": s_source, "ProblemDescription": s_desc , 
             "Question": s_question , "Answer": s_ans ,"Label": s_label,
             "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
             "Comments": s_comments ,  "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers     
            })

out_df = pd.DataFrame(rows, columns = df_cols)

Alternatively, run a more dynamic version assigning to an inner dictionary using the iterator variable:

rows = []
for node in xroot: 
    inner = {}
    inner[node.tag] = node.text

    rows.append(inner)

out_df = pd.DataFrame(rows, columns = df_cols)

Or list/dict comprehension:

rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.