0

I am new to Java. I have this 2 GB xml file which I need to parse and store its data into a database.

Someone on StackOverflow recommended me to use Dom4j for long xml files. Parsing is doing good, but returned Document (by Dom4j) is very long and on iteration loads all DOM objects into memory (heap).

This results into out-of-memory anomalies. Can somebody please help me how to avoid such errors? Do we have some phenomenon in Java for on-demand heap allocation and deposition in Java?

3
  • 2
    Is SAX or StAx an option for this? Do you need all data in memory? Commented Jun 10, 2013 at 9:55
  • Use a stax parser , increase the heap size. Commented Jun 10, 2013 at 9:55
  • Quickest solution: run your Java app with more memory (try using 4 GB). Mode detailed solutions: do not keep the whole XML in memory (since it won't fit), instead process it by chunks. Commented Jun 10, 2013 at 9:55

2 Answers 2

5

You have two choices:

  1. reconfigure your JVM to allocate more maximum memory (via -Xmx2g or similar). See here for more info. This option is obviously limited also by your OS and the amount of free memory in your system.
  2. use a streaming API (such as SAX) that doesn't load all the XML into your memory at once, but rather streams it through your process, allowing you to analyse it without holding the entire doc in memory

The first option may help you immediately, and isn't specific to this question. The second option is the more scalable solution since it'll allow you to analyse documents of any size. Of course you need to worry about the memory consumption of the results of your analysis, but that's another matter entirely.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Brian, increasing heap size is of-course known to me and processing XML in chunks is good suggestion. But I need some generic solution for avoiding too much data getting loaded in heap. Related problem was there for a large table too - with around 15000 records. In that too some said to use cursors. But these solutions seems to be contextual - is there any generic solution or guidelines for avoiding out-of-memory anamolies? Also Dom4j has a SAX parser.
1

If you need to parse big XML files (and adding to the Java heap does not always work), you need a SAX parser which allows you to parse the XML stream instead of loading the whole DOM tree into memory.

You may also check SAXDOMIX

SAXDOMIX contains classes that can forward SAX events or DOM sub-trees to your application during the parsing of an XML document. The framework defines simple interfaces that allow the application to get DOM sub-trees in the middle of a SAX parsing. After handling, all DOM sub-trees become eligible for garbage collection. This solves the DOM scalability problem.

3 Comments

Thanks Juned, I am using Dom4j and I think they also have a SAX parser. As one of the code snippet says - SAXReader reader = new SAXReader();
With DOM problem is that the entire xml tree need to be loaded in memory. No matter how big heap size u set and if ur tree does not fit in it, you will end up with Out of memory error. SAX is better for parsing big xml, as you can read in chunks. I like SAXDOMIX as it mixes sax and dom to allow u parse in chunks and with ease. Try that.
DOM (as output) is being used intentionally as many of the xml nodes are inter-dependent and fully SAX is making the processing really slow. Doesn't SAX parser in Dom4j do the same job?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.