3

Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?

I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...

The average table size is roughly:

Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000

Let me know if more info is needed!

Server Specs:

Note this is a linode VPS

Intel Xeon Processor L5520 - Quad Core - 2.27GHZ 4GB Total Ram

XML Sample

https://gist.github.com/2510267

Thanks!


After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.

7
  • 2
    Have you tried profiling your code to see where the time is being spent? Commented Apr 27, 2012 at 14:21
  • You might try altering the transaction log so it doesn't bog things down: stackoverflow.com/questions/996403/disable-transaction-log Commented Apr 27, 2012 at 14:23
  • 1
    did you try your code on a small test to make sure it works fine? Commented Apr 27, 2012 at 14:24
  • No I havent but with that said once I get the import to run through the entire file without problems this is something I will be doing since I am in no rush to stop the current import from running. I'm just more interested in knowing if this is normal. Commented Apr 27, 2012 at 14:25
  • I tested this like crazy on smaller imports which were roughly 50MB in size and it took less than 5 seconds to import, I also know it's working because I can go into MySQL and see the data continuously being imported and watch the import on top Commented Apr 27, 2012 at 14:30

3 Answers 3

2

One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.

Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.

Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).

Sign up to request clarification or add additional context in comments.

2 Comments

Ah a bug, thanks for that! The existence check should be excluded based on the import type which was mistakenly pruned from the latest update to the tool, I will also look into mysqlimport tool.
Marking this as accepted since found a bug and provided a helpful hint to improve performance :)
0

This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.

Comments

0

I have 3 quick suggesstions to make without seeing your code After attempting something similiar

  1. optimize your code for high performance High-performance XML parsing in Python with lxml is a great article to look at.
  2. look into pypy
  3. rewrite your code to take advantage of multiple cpu's which python will not do natively

Doing these things greatly improved the speed of a similar project I worked on. Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)

1 Comment

The code is posted in a gist which I referenced a link to in the second sentence of the question. I have already read article #1 and the system is running on a quad core which is evenly distributing the load of mysql/python fairly evenly already so would rewriting to take advantage of multiple cores really help?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.