5

I have a large XML file (around 400MB) that I need to ensure is well-formed before I start processing it.

First thing I tried was something similar to below, which is great as I can find out if XML is not well formed and which parts of XML are 'bad'

$doc = simplexml_load_string($xmlstr);
if (!$doc) {
    $errors = libxml_get_errors();

    foreach ($errors as $error) {
        echo display_xml_error($error);
    }

    libxml_clear_errors();
}

Also tried...

$doc->load( $tempFileName, LIBXML_DTDLOAD|LIBXML_DTDVALID )

I tested this with a file of about 60MB, but anything a lot larger (~400MB) causes something which is new to me "oom killer" to kick in and terminate the script after what always seems like 30 secs.

I thought I may need to increase the memory on the script so figured out the peak usage when processing 60MB and adjusted it accordingly for a large and also turn the script time limit off just in case it was that.

set_time_limit(0);
ini_set('memory_limit', '512M');

Unfortunately this didn't work, as oom killer appears to be a linux thing that kicks in if memory load (even the right term?) is consistently high.

It would be great if I could load xml in chunks somehow as I imagine this will reduce the memory load so that oom killer doesn't stick it's fat nose in and kill my process.

Does anyone have any experience validating a large XML file and capturing errors of where it's badly formed, a lot of posts I've read point to SAX and XMLReader that might solve my problem.

UPDATE So @chiborg pretty much solved this issue for me...the only downside to this method is that I don't get to see all of the errors in the file, just the first that failed which I guess makes sense as I think it can't parse past the first point that fails.

When using simplexml...it's able to capture most of the issues in the file and show me at the end which was nice.

3
  • 1
    The SimpleXML extension is that, a simple tool for simple XML. It loads everything into memory and it isn't designed for large files. You'll have to validate it with XMLReader whether you like it or not. Commented Dec 13, 2012 at 10:44
  • Do you have just one XML file, or do you regularly get XML files that are around this size? Commented Dec 13, 2012 at 10:51
  • Regularly get files but majority of files I process will be smaller so would love to implement a solution to cover my worst case scenario and know that going forward oom killer isn't going to be an issue Commented Dec 13, 2012 at 10:55

2 Answers 2

6

Since the SimpleXML and DOM APIs will always load the document into memory, using a streaming parser like SAX or XMLReader is the better approach.

Adpating the code from the example page, it could look like this:

$xml_parser = xml_parser_create();
if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        $errors[] = array(
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser));
    }
}
xml_parser_free($xml_parser);
Sign up to request clarification or add additional context in comments.

Comments

0

For big file, perfect use XMLReader class.

But if liked simplexml syntax: https://github.com/dkrnl/SimpleXMLReader/blob/master/library/SimpleXMLReader.php Usage example: http://github.com/dkrnl/SimpleXMLReader/blob/master/examples/example1.php

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.