I need to process an approximately 8Gb large .XML file. The file structure is (simplified) similar to the below:
<TopLevelElement>
<SomeElementList>
<Element>zzz</Element>
....and so on for thousands of rows
</SomeElementList>
<Records>
<RecordType1>
<RecordItem id="aaaa">
<SomeData>
<SomeMoreData NameType="xxx">
<NameComponent1>zzz</NameComponent1>
....
<AnotherNameComponent>zzzz</AnotherNameComponent>
</SomeMoreData>
</SomeData>
</RecordItem>
..... hundreds of thousands of items, some are quite large.
</RecordType1>
<RecordType2>
<RecordItem id="cccc">
...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1
</RecordItem>
</RecordType2>
</Records>
</TopLevelElement>
I need to extract some of the sub-elements in RecordType1 and RecordType2 elements. There are conditions to determine which record items need to be processed and which fields need to be extracted. The individual RecordItems do not exceed 120k (some have extensive text data, which I do not need).
Here is the code. Function get_all_records receives following inputs: a) path to the XML file; b) record category ('RecordType1' or 'RecordType2'); c) what name components to pick
from xml.etree import cElementTree as ET
def get_all_records(xml_file_path, record_category, name_types, name_components):
context = ET.iterparse(xml_file_path, events=("start", "end"))
context = iter(context)
event, root = next(context)
all_records = []
for event, elem in context:
if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
if record_contents:
all_records += record_contents
root.clear()
return all_records
I have experimented with the number of records, the code nicely processes 100k RecordItems (only Type1, it just takes too long to get to Type2) in approximately one minute. Attempting to process a larger number of records (I took one million), eventually leads to MemoryError in ElementTree.py. So I am guessing no memory is released despite of root.clear() statement.
An ideal solution would be one where the RecordItems would be read one at the time, processed, and then discarded from the memory, but I have no clue how to do that. From XML point of view the two extra layers of elements (TopLevelElement and Records) seem to complicate the task. I am new to XML and to respective Python libraries so an explanation with detail would be much appreciated!
all_recordscontaining all the matched records, can you just perform your processing at the point where you currently haveall_records += record_conents? Building of that list is probably what's eating your memory.ElementTree.py", line 1224, in iterator data = source.read(16 * 1024) MemoryError. I was able to process about 960k records. In addition, the program froze at random intervals for 2 to 20 minute each time. I also tried to process 'RecordType2' (which come after RecordType1), and those were never reached (MemoryError) again. Unless it is some bug in iterparse itself, it must be something wrong with how I iterate through the XML file.