Pre-process an unformed XML (Java)

Question

the XML file I am working with is unformed and therefore invalid. It presents the following issues:

multiple XML declarations (error message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
Absence of the root Element (error message: Extra content at the end of the document)

The file includes multiple records and this is an excerpt with two records:

<?xml version="1.0" encoding="utf-8"?>
<ElementAa xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace">
  <ElementBa attributeB1="11111" attributeB2="someDate">
    <ElementCa attributeC1="someString" attributeC2="someOtherDate">
      <ElementDa attributeD1="12345" />
    </ElementCa>
    <ElementEa attributeE1="ABCD" />
  </ElementBa>
</ElementAa>
<?xml version="1.0" encoding="utf-8"?>
<ElementAb xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace">
  <ElementBb attributeB1="22222" attributeB2="AgainDate">
    <ElementCb attributeC1="anotherString" attributeC2="yetAnotherDate">
      <ElementDb attributeD1="67891" />
    </ElementCb>
    <ElementEb attributeE1="EFGHI" />
  </ElementBb>
</ElementAb>

In order to be well-formed and valid, the above document should be turned into this (please correct me if I am wrong):

<?xml version="1.0" encoding="utf-8"?>
<ElementAa xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace">
<ElementBa attributeB1="11111" attributeB2="someDate">
    <ElementCa attributeC1="someString" attributeC2="someOtherDate">
        <ElementDa attributeD1="12345"/>
    </ElementCa>
    <ElementEa attributeE1="ABCD"/>
</ElementBa>
<ElementBb attributeB1="22222" attributeB2="AgainDate">
    <ElementCb attributeC1="anotherString" attributeC2="yetAnotherDate">
        <ElementDb attributeD1="67891"/>
    </ElementCb>
    <ElementEb attributeE1="EFGHI"/>
</ElementBb>
</ElementAa>

Although I am aware that in the best of all possible worlds the data should be of high quality, unfortunately I will have to deal with a poor dataset and I am trying to find a good approach to achieve a well-formed and valid XML. At the moment, I have written 2 utility methods that remove all XML declarations (using the Pattern/Matcher for regex) and inject the only one required at the top the file and I am about to do something similar to remove any extra root node elements and only keep <ElementAa xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace">

I do not think this approach is particularly ideal and I fear it will be very much problematic when dealing with large files, can you help? Any recommendation, suggestion, potential approach would be much appreciated! I am really looking for a good approach for the scenario described.

Thank you so much,

I.

EDIT 1: As mentioned, the XML content is inside a .txt file and the 2 utility methods I wrote use the common BufferedReader to read its content. I am trying to do all the "data cleaning" before renaming the file with .xml extension (I have another utility that does that) and feeds it into a JaxB parser.

EDIT 2: Unfortunately, I have no control over the XML generation as I read the files directly from an FTP. It would be good to have control over how multiple XML get concatenate into the resulting one for which I have provided the excerpt, but it is not possible.

Recommendation: Change the code that created the file to create a valid XML file. Alternatively, change it to not concatenate multiple XML files into one, but leave them separate, either as individual files on the file system, or as individual entries in a zip file. The second option is especially good for keeping the files together and for downloading, since if will compress the XML too. — Andreas
– Andreas, Commented Sep 5, 2016 at 21:39
@Andreas, I cannot do that, I am afraid. I have no control over how the "XML" files (vertical commas are because they are not well-formed) are generated. I know it is rather annoying, but I am exploring any possible avenue to circumvent what you would rightly call bad data. — panza
– panza, Commented Sep 5, 2016 at 21:42
Then I suggest you un-concatenate them, by scanning for the <?xml XML declaration and splitting right before it, then parse each piece separately. — Andreas
– Andreas, Commented Sep 5, 2016 at 22:31

Michael Kay · Accepted Answer · 2016-09-06 08:29:32Z

Basically, your task is to write a parser for a grammar that has some similarities to the grammar for XML. Before you can write a parser for any grammar, you need to define what that grammar is: that is, specify what input your tool will accept, perhaps in terms of variations from the grammar of XML.

Of course, this will be expensive: the purpose of standardisation is to reduce costs so that everyone can use the same grammar and the same parsers, and if people use proprietary variations then life gets a lot more complicated for everyone.

So far, you're asking us to guess the grammar of your deviant XML by showing us a single example. Well, an example doesn't make a specification. More seriously, writing a parser for a language that hasn't been specified by continually extending it to handle more and more examples is not going to work: Sisyphus will finish his task before you do.

You should also bear in mind that the better you are at picking up other people's garbage, the more garbage they will throw at you.

Addendum

If in fact it is the case that your input file contains a sequence of well-formed XML documents concatenated into a single file, then the grammar of your input can actually be specified fairly easily. It's just one extra rule added to the XML specification:

file ::= document+

Perhaps with the modification that the XML declaration at the start of a document is mandatory.

So defining the grammar you want to accept may not be too difficult. But writing a parser that accurately accepts this grammar is still a challenge. The cleanest way to do it is probably to take an open-source XML parser and modify it.

There's no way of parsing this grammar with regular expressions: it is not a regular language (if you don't understand what this means, you shouldn't be writing parsers, but essentially it means that the definition of the grammar is recursive).

There are however some tricks you could use. Every document starts with <?xml, and the only places <?xml can occur are (a) at the start of a document, (b) in a comment, and (c) in a CDATA section. Comments and CDATA sections cannot be nested, so I think it's the case that every instance of your language will conform to the simpler grammar:

(`<?xml` (stuff | cdata | comment)* )*

where stuff is defined as anything that doesn't contain <?xml, <![CDATA[, or <!--), and cdata and comment are defined as in XML.

Parsing your document according to this simpler (non-recursive) grammar is sufficient to identify the document boundaries, and having done that you can then pass each document to a regular XML parser.

Collectives™ on Stack Overflow

Pre-process an unformed XML (Java)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related