the XML file I am working with is unformed and therefore invalid. It presents the following issues:
multiple XML declarations (error message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
Absence of the root Element (error message: Extra content at the end of the document)
The file includes multiple records and this is an excerpt with two records:
<?xml version="1.0" encoding="utf-8"?> <ElementAa xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace"> <ElementBa attributeB1="11111" attributeB2="someDate"> <ElementCa attributeC1="someString" attributeC2="someOtherDate"> <ElementDa attributeD1="12345" /> </ElementCa> <ElementEa attributeE1="ABCD" /> </ElementBa> </ElementAa> <?xml version="1.0" encoding="utf-8"?> <ElementAb xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace"> <ElementBb attributeB1="22222" attributeB2="AgainDate"> <ElementCb attributeC1="anotherString" attributeC2="yetAnotherDate"> <ElementDb attributeD1="67891" /> </ElementCb> <ElementEb attributeE1="EFGHI" /> </ElementBb> </ElementAb>
In order to be well-formed and valid, the above document should be turned into this (please correct me if I am wrong):
<?xml version="1.0" encoding="utf-8"?>
<ElementAa xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace">
<ElementBa attributeB1="11111" attributeB2="someDate">
<ElementCa attributeC1="someString" attributeC2="someOtherDate">
<ElementDa attributeD1="12345"/>
</ElementCa>
<ElementEa attributeE1="ABCD"/>
</ElementBa>
<ElementBb attributeB1="22222" attributeB2="AgainDate">
<ElementCb attributeC1="anotherString" attributeC2="yetAnotherDate">
<ElementDb attributeD1="67891"/>
</ElementCb>
<ElementEb attributeE1="EFGHI"/>
</ElementBb>
</ElementAa>
Although I am aware that in the best of all possible worlds the data should be of high quality, unfortunately I will have to deal with a poor dataset and I am trying to find a good approach to achieve a well-formed and valid XML. At the moment, I have written 2 utility methods that remove all XML declarations (using the Pattern/Matcher for regex) and inject the only one required at the top the file and I am about to do something similar to remove any extra root node elements and only keep <ElementAa xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="some-namespace">
I do not think this approach is particularly ideal and I fear it will be very much problematic when dealing with large files, can you help? Any recommendation, suggestion, potential approach would be much appreciated! I am really looking for a good approach for the scenario described.
Thank you so much,
I.
EDIT 1: As mentioned, the XML content is inside a .txt file and the 2 utility methods I wrote use the common BufferedReader to read its content. I am trying to do all the "data cleaning" before renaming the file with .xml extension (I have another utility that does that) and feeds it into a JaxB parser.
EDIT 2: Unfortunately, I have no control over the XML generation as I read the files directly from an FTP. It would be good to have control over how multiple XML get concatenate into the resulting one for which I have provided the excerpt, but it is not possible.
<?xmlXML declaration and splitting right before it, then parse each piece separately.