2

I have a challenging situation. ECHA publishes the schema for their IUCLID product as zipped collections of xsd files. I want to be able to import data into SQL and use those xsd files to build the SQL tables.

The trouble is while they publish a structure showing an ERD style relationship layout - https://poisoncentres.echa.europa.eu/documents/1789887/5577602/pcn_format_data_model_en.pdf/d667afb6-a36b-4596-48dc-3b2de436d3de?t=1635233313478 - the xsd does not have any sign of those relationships.. (though I might simply be missing them)

The xsd are found in here - https://poisoncentres.echa.europa.eu/documents/1789887/10223884/PCN+Format_v4.0.zip/7d7641c0-facc-898c-bffd-45b080bfdb75?t=1635502393893 and the general page is here - https://poisoncentres.echa.europa.eu/poison-centres-notification-format

I think my option is to do it the long way by converting xsd into SQL (pref Mysql dialect) via maybe a json schema and then hand stitching things together with the hope that there are no huge differences when they release a new version - but this feels icky :)

Does anyone have any thoughts on a better method?

EDIT: So I also noticed this https://poisoncentres.echa.europa.eu/documents/1789887/6428404/echa_example_1_dossier.i6z/98873135-5373-d2df-0cc2-9177a403cad0?t=1590667188569 which is an example PCN document.. this at least gives relationships between chunks of xml, but I'm not sure these will always be available on new versions of the schema.. and boy its painful to parse..

4
  • Which language(s) are you able to use for reading the XSDs and emitting the equivalent SQL? Commented Nov 20, 2021 at 16:28
  • wide open - I'd prefer not to use java or .net as I'm not that good at it.. Commented Nov 21, 2021 at 4:03
  • I think you have a decision to make. Do you a) invest a lot of time to implement a technically rigorous solution that can handle anything that might occur in the XSD model (both now and in the future) or b) implement a solution that handles the current version of the XSDs, and tweak it in future if necessary Some libraries that load XSDs are focused only on the task of validating XML documents. But option a) might require a library that can load an XSD and allow general-purpose exploration of the XSD model. Such libraries do exist, but there are not many, and not in every language. Commented Nov 21, 2021 at 20:38
  • Yes, I'm feeling my solution is to document the steps taken to build the structure from the existing XSD/examples and also commit the existing XSDs so its easy to find the changes.. and hopefully along the way potentially build tooling to convert later :) - I will probably document my steps here too.. Commented Nov 22, 2021 at 0:41

1 Answer 1

0

So I'm going to steadily update this as I progress..

The first challenge is to bring all these XSDs into a single file so we can toss what is not being used and throw it into things like xsd2xml for generation of sample XML files..

How could I merge an XSD schema with imports and includes into a single file? Covers this - but much of what is mentioned is gone.

This - The 'minOccurs' attribute cannot be present looks tantalizing, but as usual all gone.

This looks the thing - https://sourceforge.net/projects/graphvisu/ but only wants HTTP - doesn't like file

Managed to get xsdwalker running - which shows up a problem.. it seems that the set of xsds doesn't really fit cleanly into a hierarchy.. which means we need to run with the sample xmls instead

Update:

Stupid me - the format uses href:xlink to reference other portions - but this isn't an XML cross reference, its just a href - so I am manually replacing the "name": { "xlink:type": "simple", "xlink:href": "52a44784-64a6-4836-b8d6-b84315fd958e_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d", "content": "Ethane-1,2-diol" },

with the expanded i6c:Document format - this is not pretty at all, and once I've coded something then I'll be cherry picking out what is needed... also as a complete facepalm and abuse of the XML format - almost everything in the XML i6c document is coded values - even though the idea of wasting all that bandwidth with readable variable names - is to have readable variables as well.. FFS

I compromised and inserted a "contents' key holding the referenced file contents - see Searching an XML structure but modifying a node higher in the hierarchy

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.