16

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.

Any suggestions?

5
  • 1.8 gb is a HUGE text file. Is it not possible to break that up into chunks at the file level? Commented Oct 19, 2010 at 15:01
  • 1
    @Owen - it depends on your domain. When interfacing with data dumps from other people's systems, this situation can happen very easily. Commented Oct 19, 2010 at 15:03
  • i did not think about that but i guess we again need such a parser to avoid spoiling the xml file? it will not be practical doing that kind of manually or any suggestion how to do that? Commented Oct 19, 2010 at 15:05
  • @Nick - I didn't consider that. Good point. Commented Oct 20, 2010 at 3:13
  • What do you want to do with it? Commented Jul 4, 2014 at 9:08

9 Answers 9

20

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).

Sign up to request clarification or add additional context in comments.

3 Comments

Although I agree that StAX is usually the best solution, there are situations in which SAX is better. If you have documents that contain large blocks of Text content, then AFAIR the StAX API will read those blocks of Text in memory entirely and handle that as a single event. SAX parsers will normally split it up in smaller chunks and feed it to your handlers piecewise. It's not guaranteed to take advantage this opportunity, but in StAX this opportunity does not even exist. (Which I personally feel is a little awkward for a streaming API.)
greeting can some one please improve my understanding here. because i had interview question about this and the key words i answerd was sax and thread but still he needed third key word i answered executer thread pool ... he said yes and?!~ the answer was priority queue can some one explain how
@wilfred-springer Coalesce is a feature which can be set on the XMLInputFactory - StAX API generally supports this in the same way as SAX. See for example the FasterXML inputfactory.
10

Use a SAX based parser that presents you with the contents of the document in a stream of events.

Comments

4

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Comments

3

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

1 Comment

How about licensing which GPL?
3

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).

You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.

If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Comments

3

Stream the file into a SAX parser and read it into memory in chunks.

SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Comments

1

Use almost any SAX Parser to stream the file a bit at a time.

Comments

1

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).

FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.

If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.

So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.

StringsFile file = new StringsFile();
StringInFile str = file.newString("abc");        // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

Comments

0

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.