Performance bulk-loading data from an XML file to MySQL

Question

Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?

I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...

The average table size is roughly:

Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000

Let me know if more info is needed!

Server Specs:

Note this is a linode VPS

Intel Xeon Processor L5520 - Quad Core - 2.27GHZ 4GB Total Ram

XML Sample

https://gist.github.com/2510267

Thanks!

After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.

Have you tried profiling your code to see where the time is being spent? — eggyal
– eggyal, Commented Apr 27, 2012 at 14:21
You might try altering the transaction log so it doesn't bog things down: stackoverflow.com/questions/996403/disable-transaction-log — Steve Wellens
– Steve Wellens, Commented Apr 27, 2012 at 14:23
did you try your code on a small test to make sure it works fine? — Tamer Shlash
– Tamer Shlash, Commented Apr 27, 2012 at 14:24
No I havent but with that said once I get the import to run through the entire file without problems this is something I will be doing since I am in no rush to stop the current import from running. I'm just more interested in knowing if this is normal. — Nick
– Nick, Commented Apr 27, 2012 at 14:25
I tested this like crazy on smaller imports which were roughly 50MB in size and it took less than 5 seconds to import, I also know it's working because I can go into MySQL and see the data continuously being imported and watch the import on top — Nick
– Nick, Commented Apr 27, 2012 at 14:30

Charles Duffy · Accepted Answer · 2012-04-27 15:56:35Z

2

One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.

Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.

Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).

edited Apr 27, 2012 at 15:56

answered Apr 27, 2012 at 15:50

Charles Duffy

299k43 gold badges441 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nick Over a year ago

Ah a bug, thanks for that! The existence check should be excluded based on the import type which was mistakenly pruned from the latest update to the tool, I will also look into mysqlimport tool.

Nick Over a year ago

Marking this as accepted since found a bug and provided a helpful hint to improve performance :)

George · Accepted Answer · 2012-04-27 15:12:37Z

0

This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.

answered Apr 27, 2012 at 15:12

George

1,0542 gold badges11 silver badges23 bronze badges

Comments

matchew · Accepted Answer · 2012-04-27 15:27:22Z

0

I have 3 quick suggesstions to make without seeing your code After attempting something similiar

optimize your code for high performance High-performance XML parsing in Python with lxml is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively

Doing these things greatly improved the speed of a similar project I worked on. Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)

edited Apr 27, 2012 at 15:27

answered Apr 27, 2012 at 15:21

matchew

19.8k5 gold badges46 silver badges48 bronze badges

1 Comment

Nick Over a year ago

The code is posted in a gist which I referenced a link to in the second sentence of the question. I have already read article #1 and the system is running on a quad core which is evenly distributing the load of mysql/python fairly evenly already so would rewriting to take advantage of multiple cores really help?

Collectives™ on Stack Overflow

Performance bulk-loading data from an XML file to MySQL

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest