I am trying to import data from a XML file that contains breath-by-breath data from an exercise test. the XML structure is as follows (simplified to show the general structure):
<?xml version="1.0"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="MetasoftStudio">
<Table ss:ExpandedColumnCount="21" ss:ExpandedRowCount="458" x:FullColumns="1" x:FullRows="1" ss:StyleID="s62" ss:DefaultColumnWidth="53">
<Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/>
<Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="97"/>
<Column ss:StyleID="s62" ss:AutoFitWidth="0" ss:Width="137"/>
<Row ss:AutoFitHeight="0" ss:Height="26">
<Cell ss:StyleID="Default"><Data ss:Type="String">t</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">Phase</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">Marker</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'O2</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/kg</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'O2/HR</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">HR</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">WR</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'O2</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'E/V'CO2</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">RER</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">V'E</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">BF</Data></Cell>
</Row>
<Row ss:Height="15">
<Cell ss:StyleID="Default"><Data ss:Type="String">h:mm:ss</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">ml/min/kg</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">ml</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">W</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">L/min</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">/min</Data></Cell>
</Row>
<Row ss:Height="15">
<Cell ss:StyleID="Default"><Data ss:Type="String">0:00:06</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String">Rest</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="String"></Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">0.27972413565454501</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">4.3706896196022598</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">4.5856415681072953</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">61</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">0</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">27.002532271037801</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">26.4113108545688</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">1.0223851598932201</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">10.155340000000001</Data></Cell>
<Cell ss:StyleID="Default"><Data ss:Type="Number">18.07</Data></Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
I have used lxml to parse and iterate over the XML file then extracted the 'data' in each 'cell' appending it to a list, and then appending that list to a parent list (giving me a nested list of each row) using the code:
from lxml import etree, objectify
import pandas as pd
with open('Python/cortex.xml') as infile:
xml_file = infile.read()
root = objectify.fromstring(xml_file)
header = []
data = []
for row in root.Worksheet.Table.getchildren():
temp_row = []
if not row.tag == '{urn:schemas-microsoft-com:office:spreadsheet}Column':
for cell in row.getchildren():
temp_row.append(cell.Data)
data.append(temp_row)
header = data.pop(0) #remove the first 'row' and store in header list
del data[0] #remove 2nd line of superfluous data
The first row gives the headers, hence I pop that into its own list, and row 2 contains the units for each variable, so I just get rid of that. All working well so far (or so it seemed)...
Now I need to get it into a pd dataframe to start working with it. If I go df = pd.DataFrame(data, columns=header) and I print(df) i get:
ValueError: Buffer has wrong number of dimensions (expected 1, got 32)
Ok not sure what happened there... If I make the df without assigning the header and print that I get:
0 1 2 3 \
0 [[[0:00:06]]] [[[Rest]]] [[[]]] [[[0.279724135654545]]]
1 [[[0:00:09]]] [[[Rest]]] [[[]]] [[[0.465136232899829]]]
2 [[[0:00:13]]] [[[Rest]]] [[[]]] [[[0.357975433456662]]]
3 [[[0:00:19]]] [[[Rest]]] [[[]]] [[[0.543332419057909]]]
4 [[[0:00:24]]] [[[Rest]]] [[[]]] [[[0.374604578743889]]]
That doesn't look right! Where did all these lists in lists in lists come from! If I iterate over and print the nested list data, it prints perfectly, but once I try to convert it to a df something goes wrong.
Can anyone enlighten me as to what has happened and how I can get the data into the pd df? If there is a better method than how I've done it, then I am happy to give it a go.