Handling multiple nodes when parsing XML with Python

Question

For an assignment, I need to parse through a 2 million line XML file, and input the data into a MySQL database. Since we are using a python environment with sqlite for the class, I am attempting to use python to parse the file. Keep in mind I am just learning python so everything is new!

I have had a few attempts, but keep failing and getting frustrated. For efficiency, I am testing my code out on just a small amount of the full XML, here:

<pub>
<ID>7</ID>
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title>
<year>2003</year>
<booktitle>AVBPA</booktitle>
<pages>895-902</pages>
<authors>
    <author>J. K. Schneider</author>
    <author>C. E. Richardson</author>
    <author>F. W. Kiefer</author>
    <author>Venu Govindaraju</author>
</authors>
</pub>

First attempt

Here I successfully pulled out all data from each tag, except when there are multiple authors under the <authors> tag. I am trying to loop through each node in the authors tag, count, then create a temporary array for those authors, then throw them into my database next with SQL. I am getting "15" for the number of authors, but clearly there are only 4! How do I solve this?

from xml.dom import minidom

xmldoc= minidom.parse("test.xml")

pub = xmldoc.getElementsByTagName("pub")[0]
ID = pub.getElementsByTagName("ID")[0].firstChild.data
title = pub.getElementsByTagName("title")[0].firstChild.data
year = pub.getElementsByTagName("year")[0].firstChild.data
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data
pages = pub.getElementsByTagName("pages")[0].firstChild.data
authors = pub.getElementsByTagName("authors")[0]
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

print(ID)
print(title)
print(year)
print(booktitle)
print(pages)
print(author)

har07 · Accepted Answer · 2017-04-23 06:27:02Z

1

Notice that you were getting the number of characters in the first author here, for the code limits the result to only the first author (index 0) and then get its length :

author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

Just don't limit the result to get all the authors :

author = authors.getElementsByTagName("author")
num_authors = len(author)
print("Number of authors: ", num_authors )

You can use list comprehension to get all author names, instead of author elements, in a list :

author = [a.firstChild.data for a in authors.getElementsByTagName("author")]
print(author)
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']

answered Apr 23, 2017 at 6:27

har07

89.5k12 gold badges87 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

douglasrcjames_old Over a year ago

I knew I needed to access each variable in an array, but wasn't sure on the syntax. Thank you so much!

douglasrcjames_old Over a year ago

Hey @har07 , so I made progress, but some of my XML data is "bad" in a sense... I have an entry with special characters like "í" in names, and come out to "í" in the XML file. How do I process these special language characters into python? The error I am getting is "ExpatError: undefined entity:".

Collectives™ on Stack Overflow

Handling multiple nodes when parsing XML with Python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related