How to iteratively parse a large XML file in Python?

Question

I need to process an approximately 8Gb large .XML file. The file structure is (simplified) similar to the below:

<TopLevelElement>
    <SomeElementList>
        <Element>zzz</Element>
        ....and so on for thousands of rows
    </SomeElementList>
    <Records>
        <RecordType1>
            <RecordItem id="aaaa">
                <SomeData>
                    <SomeMoreData NameType="xxx">
                        <NameComponent1>zzz</NameComponent1>
                        ....
                        <AnotherNameComponent>zzzz</AnotherNameComponent>
                    </SomeMoreData>
                </SomeData>
            </RecordItem>
        ..... hundreds of thousands of items, some are quite large.
        </RecordType1>
        <RecordType2>
            <RecordItem id="cccc">
            ...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1 
            </RecordItem>
        </RecordType2>
    </Records>
</TopLevelElement>

I need to extract some of the sub-elements in RecordType1 and RecordType2 elements. There are conditions to determine which record items need to be processed and which fields need to be extracted. The individual RecordItems do not exceed 120k (some have extensive text data, which I do not need).

Here is the code. Function get_all_records receives following inputs: a) path to the XML file; b) record category ('RecordType1' or 'RecordType2'); c) what name components to pick

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    context = ET.iterparse(xml_file_path, events=("start", "end"))
    context = iter(context)
    event, root = next(context)
    all_records = []
    for event, elem in context:
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
            root.clear()
    return all_records

I have experimented with the number of records, the code nicely processes 100k RecordItems (only Type1, it just takes too long to get to Type2) in approximately one minute. Attempting to process a larger number of records (I took one million), eventually leads to MemoryError in ElementTree.py. So I am guessing no memory is released despite of root.clear() statement.

An ideal solution would be one where the RecordItems would be read one at the time, processed, and then discarded from the memory, but I have no clue how to do that. From XML point of view the two extra layers of elements (TopLevelElement and Records) seem to complicate the task. I am new to XML and to respective Python libraries so an explanation with detail would be much appreciated!

Instead of build a list all_records containing all the matched records, can you just perform your processing at the point where you currently have all_records += record_conents? Building of that list is probably what's eating your memory. — larsks
– larsks, Commented Aug 13, 2021 at 17:55
Hi, all_records contains only processed records, and a single item in all_records is less than 100 bytes. I am building a single list, because ultimately I need to export it to a .CSV file. — Raits
– Raits, Commented Aug 13, 2021 at 18:03
If you don't build the list, do you still run out of memory? If the problem persists, obviously I'm on the wrong track, but it seems worth a try. If you're just outputting a CSV file, you can write out records iteratively as you read them in -- you don't need to build a list and write it out all at once. — larsks
– larsks, Commented Aug 13, 2021 at 18:11
I ran my code without building the big list, and got same ElementTree.py", line 1224, in iterator data = source.read(16 * 1024) MemoryError. I was able to process about 960k records. In addition, the program froze at random intervals for 2 to 20 minute each time. I also tried to process 'RecordType2' (which come after RecordType1), and those were never reached (MemoryError) again. Unless it is some bug in iterparse itself, it must be something wrong with how I iterate through the XML file. — Raits
– Raits, Commented Aug 16, 2021 at 6:25

Pavel Gomon · Accepted Answer · 2021-09-05 20:26:17Z

2

Iterating over a huge XML file is always painful.

I'll go over all the process from start to finish, suggesting the best practices for keeping low memory yet maximizing parsing speed.

First no need to store ET.iterparse as a variable. Just iterate over it like

for event, elem in ET.iterparse(xml_file, events=("start", "end")): This iterator created for, well..., iteration without storing anything else in memory except the current tag. Also you don't need root.clear() with this new approach and you can go as long as your hard disk space allows it for huge XML files.

Your code should look like:

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    all_records = []
    for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
    return all_records

Also, please think carefully about the reason you need to store the whole list of all_records. If it's only for writing CSV file at the end of the process - this reason isn't good enough and can cause memory issues when scaling to even bigger XML files.

Make sure you write each new row to CSV as this row happens, turning memory issues into none-issue.

P.S.

If you need to store several tags before you find your main tag in order to parse this historic information as you go down the XML file - just store it locally in some new variables. This comes handy whenever future data in XML file makes you go backwards to a specific tag you know already occured.

edited Sep 5, 2021 at 20:26

answered Sep 5, 2021 at 20:20

Pavel Gomon

2131 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dan Over a year ago

I saw a small commercial program years ago that took a moderately nested xml but could be on the order of 1/2 gb to gb in size that used VB and SQL Server to load the xml into a database. I was under the impression that this was a VB bulk load operation rather than SQL but looking around the internet all I see is SQL bulk load functions. I swear it took something on the order of 5 or less seconds to load 300k + records with at least 23 fields or so. I wonder if someone has written a wrapper for this and maybe it would be quicker to bulk load to a database and then query the data needed?

Dan Over a year ago

It would probably be the way to go if this was going to be something that's done often rather than trying to parse so much information all the time.

Collectives™ on Stack Overflow

How to iteratively parse a large XML file in Python?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related