Java XML Parser for huge files

Question

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.

Any suggestions?

1.8 gb is a HUGE text file. Is it not possible to break that up into chunks at the file level? — Owen
– Owen, Commented Oct 19, 2010 at 15:01
@Owen - it depends on your domain. When interfacing with data dumps from other people's systems, this situation can happen very easily. — Nick Fortescue
– Nick Fortescue, Commented Oct 19, 2010 at 15:03
i did not think about that but i guess we again need such a parser to avoid spoiling the xml file? it will not be practical doing that kind of manually or any suggestion how to do that? — mehmet6parmak
– mehmet6parmak, Commented Oct 19, 2010 at 15:05

Duncan Jones · Accepted Answer · 2015-04-02 10:01:16Z

20

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).

StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

edited Apr 2, 2015 at 10:01

Duncan Jones

69.7k32 gold badges204 silver badges262 bronze badges

answered Oct 19, 2010 at 15:19

Tomas Narros

13.5k2 gold badges43 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wilfred Springer Over a year ago

Although I agree that StAX is usually the best solution, there are situations in which SAX is better. If you have documents that contain large blocks of Text content, then AFAIR the StAX API will read those blocks of Text in memory entirely and handle that as a single event. SAX parsers will normally split it up in smaller chunks and feed it to your handlers piecewise. It's not guaranteed to take advantage this opportunity, but in StAX this opportunity does not even exist. (Which I personally feel is a little awkward for a streaming API.)

shareef Over a year ago

greeting can some one please improve my understanding here. because i had interview question about this and the key words i answerd was sax and thread but still he needed third key word i answered executer thread pool ... he said yes and?!~ the answer was priority queue can some one explain how

ThomasRS Over a year ago

@wilfred-springer Coalesce is a feature which can be set on the XMLInputFactory - StAX API generally supports this in the same way as SAX. See for example the FasterXML inputfactory.

andrewmu · Accepted Answer · 2010-10-19 15:01:03Z

10

Use a SAX based parser that presents you with the contents of the document in a stream of events.

answered Oct 19, 2010 at 15:01

andrewmu

14.6k4 gold badges42 silver badges37 bronze badges

Comments

Eugene Kuleshov · Accepted Answer · 2010-10-19 15:21:11Z

4

StAX API is easier to deal with compared to SAX. Here is a short tutorial

answered Oct 19, 2010 at 15:21

Eugene Kuleshov

31.8k5 gold badges70 silver badges67 bronze badges

Comments

dogbane · Accepted Answer · 2010-10-19 15:12:49Z

3

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

answered Oct 19, 2010 at 15:12

dogbane

276k77 gold badges407 silver badges415 bronze badges

1 Comment

M.S.Naidu Over a year ago

How about licensing which GPL?

Will Hartung · Accepted Answer · 2010-10-19 15:18:33Z

3

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).

You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.

If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

answered Oct 19, 2010 at 15:18

Will Hartung

119k20 gold badges134 silver badges209 bronze badges

Comments

Nathan Hughes · Accepted Answer · 2010-10-19 17:31:37Z

3

Stream the file into a SAX parser and read it into memory in chunks.

SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

edited Oct 19, 2010 at 17:31

answered Oct 19, 2010 at 15:00

Nathan Hughes

96.7k21 gold badges193 silver badges288 bronze badges

Comments

Nick Fortescue · Accepted Answer · 2010-10-19 15:05:29Z

1

Use almost any SAX Parser to stream the file a bit at a time.

edited Oct 19, 2010 at 15:05

answered Oct 19, 2010 at 15:00

Nick Fortescue

44.3k27 gold badges109 silver badges137 bronze badges

Comments

Adrian Smith · Accepted Answer · 2010-10-25 11:05:48Z

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).

FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.

If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.

So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.

StringsFile file = new StringsFile();
StringInFile str = file.newString("abc");        // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

Chris W · Accepted Answer · 2010-10-21 15:35:45Z

0

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

answered Oct 21, 2010 at 15:35

Chris W

9387 silver badges14 bronze badges

Collectives™ on Stack Overflow

Java XML Parser for huge files

9 Answers 9

3 Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

3 Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related