0

For an assignment, I need to parse through a 2 million line XML file, and input the data into a MySQL database. Since we are using a python environment with sqlite for the class, I am attempting to use python to parse the file. Keep in mind I am just learning python so everything is new!

I have had a few attempts, but keep failing and getting frustrated. For efficiency, I am testing my code out on just a small amount of the full XML, here:

<pub>
<ID>7</ID>
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title>
<year>2003</year>
<booktitle>AVBPA</booktitle>
<pages>895-902</pages>
<authors>
    <author>J. K. Schneider</author>
    <author>C. E. Richardson</author>
    <author>F. W. Kiefer</author>
    <author>Venu Govindaraju</author>
</authors>
</pub>

First attempt

Here I successfully pulled out all data from each tag, except when there are multiple authors under the <authors> tag. I am trying to loop through each node in the authors tag, count, then create a temporary array for those authors, then throw them into my database next with SQL. I am getting "15" for the number of authors, but clearly there are only 4! How do I solve this?

from xml.dom import minidom

xmldoc= minidom.parse("test.xml")

pub = xmldoc.getElementsByTagName("pub")[0]
ID = pub.getElementsByTagName("ID")[0].firstChild.data
title = pub.getElementsByTagName("title")[0].firstChild.data
year = pub.getElementsByTagName("year")[0].firstChild.data
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data
pages = pub.getElementsByTagName("pages")[0].firstChild.data
authors = pub.getElementsByTagName("authors")[0]
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

print(ID)
print(title)
print(year)
print(booktitle)
print(pages)
print(author)
0

1 Answer 1

1

Notice that you were getting the number of characters in the first author here, for the code limits the result to only the first author (index 0) and then get its length :

author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

Just don't limit the result to get all the authors :

author = authors.getElementsByTagName("author")
num_authors = len(author)
print("Number of authors: ", num_authors )

You can use list comprehension to get all author names, instead of author elements, in a list :

author = [a.firstChild.data for a in authors.getElementsByTagName("author")]
print(author)
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']
Sign up to request clarification or add additional context in comments.

2 Comments

I knew I needed to access each variable in an array, but wasn't sure on the syntax. Thank you so much!
Hey @har07 , so I made progress, but some of my XML data is "bad" in a sense... I have an entry with special characters like "í" in names, and come out to "&iacute;" in the XML file. How do I process these special language characters into python? The error I am getting is "ExpatError: undefined entity:".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.