6

I have a huge xml file that I cannot open unless I import it into a database. I am using Postgres for this. I have a schema that goes with this data file. There are too many columns so i'd like to automate the process of creating a table from this schema and then importing the data file from the local drive on my computer to populate this table. How do i do this? I saw a lot of answers on SO but haven't been able to understand this correctly. Also I do not have superuser rights so will have to work around that.

Here's what the schema file looks like:

> <?xml version="1.0" encoding="UTF-8"?> <xs:schema
> xmlns:xs="http://www.w3.org/2001/XMLSchema"
> elementFormDefault="qualified"
>     targetNamespace="http://www.drugbank.ca" xmlns="http://www.drugbank.ca">
>     <xs:element name="drugbank" type="drugbank-type">
>         <xs:annotation>
>             <xs:documentation>This is the root element for the DrugBank database schema. DrugBank is a database on drug and
> drug-targets.</xs:documentation>
>         </xs:annotation>
>     </xs:element>
>     <xs:complexType name="drugbank-type">
>         <xs:annotation>
>             <xs:documentation>This is the root element type for the DrugBank database schema.</xs:documentation>
>         </xs:annotation>
>         <xs:sequence>
>             <xs:element name="drug" type="drug-type" maxOccurs="unbounded"/>
>         </xs:sequence>
>         <xs:attribute name="version" type="xs:string" use="required">
>             <xs:annotation>
>                 <xs:documentation>The DrugBank version for the exported XML file.</xs:documentation>
>             </xs:annotation>
>         </xs:attribute>
>         <xs:attribute name="exported-on" type="xs:date" use="required">
>             <xs:annotation>
>                 <xs:documentation>The date the XML file was exported.</xs:documentation>
>             </xs:annotation>
>         </xs:attribute>
>     </xs:complexType>
>     <xs:complexType name="drug-type">
>         <xs:sequence>
>             <xs:element maxOccurs="unbounded" minOccurs="1" name="drugbank-id"
>                 type="drugbank-drug-salt-id-type"> </xs:element>
>             <xs:element name="name" type="xs:string"/>
>             <xs:element name="description" type="xs:string"/>
>             <xs:element name="cas-number" type="xs:string"/>
>             <xs:element name="unii" type="xs:string"/>
>             <xs:element name="average-mass" type="xs:float" minOccurs="0"/>
>             <xs:element name="monoisotopic-mass" type="xs:float" minOccurs="0"/>
>             <xs:element name="state" type="state-type" minOccurs="0"/>
>             <xs:element name="groups" type="group-list-type"/>
>             <xs:element name="general-references" type="reference-list-type"/>
>             <xs:element name="synthesis-reference" type="xs:string"/>
>             <xs:element name="indication" type="xs:string"/>
>             <xs:element name="pharmacodynamics" type="xs:string"/>
>             <xs:element name="mechanism-of-action" type="xs:string"/>
>             <xs:element name="toxicity" type="xs:string"/>
>             <xs:element name="metabolism" type="xs:string"/>
>             <xs:element name="absorption" type="xs:string"/>
>             <xs:element name="half-life" type="xs:string"/>
>             <xs:element name="protein-binding" type="xs:string"/>
>             <xs:element name="route-of-elimination" type="xs:string"/>
>             <xs:element name="volume-of-distribution" type="xs:string"/>
>             <xs:element name="clearance" type="xs:string"/>
>             <xs:element name="classification" type="classification-type" minOccurs="0"/>
>             <xs:element name="salts" type="salt-list-type"/>
>             <xs:element name="synonyms" type="synonym-list-type"/>
>             <xs:element name="products" type="product-list-type"/>
>             <xs:element name="international-brands" type="international-brand-list-type"/>
>             <xs:element name="mixtures" type="mixture-list-type"/>
>             <xs:element name="packagers" type="packager-list-type"/>
>             <xs:element name="manufacturers" type="manufacturer-list-type"/>
>             <xs:element name="prices" type="price-list-type"/>
>             <xs:element name="categories" type="category-list-type"/>
>             <xs:element name="affected-organisms" type="affected-organism-list-type"/>
>             <xs:element name="dosages" type="dosage-list-type"/>
>             <xs:element name="atc-codes" type="atc-code-list-type"/>
>             <xs:element name="ahfs-codes" type="ahfs-code-list-type"/>
>             <xs:element name="pdb-entries" type="pdb-entry-list-type"/>
>             <xs:element name="fda-label" type="xs:anyURI" minOccurs="0"/>
>             <xs:element name="msds" type="xs:anyURI" minOccurs="0"/>
>             <xs:element name="patents" type="patent-list-type"/>
>             <xs:element name="food-interactions" type="food-interaction-list-type"/>
>             <xs:element name="drug-interactions" type="drug-interaction-list-type"/>
>             <xs:element minOccurs="0" name="sequences" type="sequence-list-type"/>
>             <xs:element minOccurs="0" name="calculated-properties" type="calculated-property-list-type"/>
>             <xs:element name="experimental-properties" type="experimental-property-list-type"/>
>             <xs:element name="external-identifiers" type="external-identifier-list-type"/>
>             <xs:element name="external-links" type="external-link-list-type"/>
>             <xs:element name="pathways" type="pathway-list-type"/>
>             <xs:element name="reactions" type="reaction-list-type"/>
>             <xs:element name="snp-effects" type="snp-effect-list-type"/>
>             <xs:element name="snp-adverse-drug-reactions" type="snp-adverse-drug-reaction-list-type"/>
>             <xs:element name="targets" type="target-list-type"/>
>             <xs:element name="enzymes" type="enzyme-list-type"/>
>             <xs:element name="carriers" type="carrier-list-type"/>
>             <xs:element name="transporters" type="transporter-list-type"/>
>         </xs:sequence>

This is only a part of it. It's a huge file. Any kind of help/guidance is much appreciated.

2
  • Hey there. How big is this xml file? Coincidently I am now importing a 120GB xml file into my database, but I am using another approach based splitting the xml file, importing them into temporary tables and unnesting them to the target table. Not sure if it is what you want. Commented Apr 16, 2018 at 16:58
  • @JimJones Wow! In comparison my file is a meager 725MB.I guess, in my case, I could get away with splitting the file. I've imported csv and text data into the db before but with a lot smaller number of columns. If I'm able to figure out how to import this xml schema into a table, half the battle is won. Commented Apr 16, 2018 at 17:51

1 Answer 1

3

There are probably a thousand ways to import XML files into PostgreSQL, but here is an alternative I find quite easy to implement and is already tested with large xml documents (120GB+)

Depending on the size of your XML file, consider splitting it. A terrific tool to do so is xml_split. This command splits file.xml in smaller files with a maximum of 100MB:

xml_split -n 5 -l 1 -s 100MB file.xml

Once you have your files split in a reasonable size, you can start importing them without having the risk of running out of memory.

Let's consider the following XML file structure ...

<?xml version="1.0"?>
<t>
    <foo>
        <id j="a">1</id>
        <val>bar1</val>
    </foo>
    <foo>
        <id j="b">8</id>
        <val>bar1</val>
    </foo>
    <foo>
        <id j="c">5</id>
        <val>bar1</val>
    </foo>
    <foo>
        <id j="b">2</id>
    </foo>
</t>

... and the following target table, where we will insert the XML records.

CREATE TABLE t (id TEXT, entry XML);

The code bellow imports XML files into a temporary unlogged table and unnest them into the table t using a CTE (aka WITH clause) by the node <foo>. The command perl -pe 's/\n/\\n/g' replaces newline characters with \\n so that you do not get a Premature end of data exception:

#!/bin/bash

psql testdb -c "CREATE UNLOGGED TABLE tmp (entry xml);"

for f in /path/to/your/files/;do

    cat $f | perl -pe 's/\n/\\n/g' |psql testdb -c "COPY tmp FROM STDIN;"
    psql testdb -c "
    WITH j AS (
      SELECT UNNEST(XPATH('//t/foo',entry)) AS entry FROM tmp
    )
      INSERT INTO t 
      SELECT XPATH('//foo/id/text()',j.entry),j.entry FROM j;

      TRUNCATE TABLE tmp;"

done

psql testdb -c "DROP TABLE tmp;"

And here is your data:

testdb=# SELECT * FROM t;
 id  |          entry           
-----+--------------------------
 {1} | <foo>                   +
     |         <id j="a">1</id>+
     |         <val>bar1</val> +
     |     </foo>
 {8} | <foo>                   +
     |         <id j="b">8</id>+
     |         <val>bar1</val> +
     |     </foo>
 {5} | <foo>                   +
     |         <id j="c">5</id>+
     |         <val>bar1</val> +
     |     </foo>
 {2} | <foo>                   +
     |         <id j="b">2</id>+
     |     </foo>
(4 Zeilen)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.