For an assignment, I need to parse through a 2 million line XML file, and input the data into a MySQL database. Since we are using a python environment with sqlite for the class, I am attempting to use python to parse the file. Keep in mind I am just learning python so everything is new!
I have had a few attempts, but keep failing and getting frustrated. For efficiency, I am testing my code out on just a small amount of the full XML, here:
<pub>
<ID>7</ID>
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title>
<year>2003</year>
<booktitle>AVBPA</booktitle>
<pages>895-902</pages>
<authors>
<author>J. K. Schneider</author>
<author>C. E. Richardson</author>
<author>F. W. Kiefer</author>
<author>Venu Govindaraju</author>
</authors>
</pub>
First attempt
Here I successfully pulled out all data from each tag, except when there are multiple authors under the <authors> tag. I am trying to loop through each node in the authors tag, count, then create a temporary array for those authors, then throw them into my database next with SQL. I am getting "15" for the number of authors, but clearly there are only 4! How do I solve this?
from xml.dom import minidom
xmldoc= minidom.parse("test.xml")
pub = xmldoc.getElementsByTagName("pub")[0]
ID = pub.getElementsByTagName("ID")[0].firstChild.data
title = pub.getElementsByTagName("title")[0].firstChild.data
year = pub.getElementsByTagName("year")[0].firstChild.data
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data
pages = pub.getElementsByTagName("pages")[0].firstChild.data
authors = pub.getElementsByTagName("authors")[0]
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )
print(ID)
print(title)
print(year)
print(booktitle)
print(pages)
print(author)