Parsing large xml data using python's elementtree

Question

I'm currently learning how to parse xml data using elementtree. I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.

My code is right below, and a bit of the xml data is after my code.

import xml.etree.ElementTree as ET

tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()


for article in root.findall('article'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):  # all venue tags with id attribute
        print 'journal'

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>

<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>

Martijn Pieters · Accepted Answer · 2013-05-18 14:05:14Z

1

You are using .fromstring() instead of .parse():

import xml.etree.ElementTree as ET

tree = ET.parse("C:\pbc.xml")
root = tree.getroot()

.fromstring() expects to be given the XML data in a bytestring, not a filename.

If the document is really large (many megabytes or more) then you should use the ET.iterparse() function instead and clear elements you have processed:

for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
    for title in aarticle.findall('title'):
        print 'Title: {}'.format(title.txt)
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):
        print 'journal'

    article.clear()

edited May 18, 2013 at 14:05

answered May 18, 2013 at 13:56

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

user2274879 Over a year ago

Hi Pieters, I used iterparse, as well as the code you put forward, however, I got the following error:ParseError: no element found: line 21, column 0.

Martijn Pieters Over a year ago

@user2274879: Then there appears to be a problem with your input XML file. Use a XML validator to check for errors and fix them before trying to parse the file with Python.

user2274879 Over a year ago

Thanks a lot. I'll use xml validator to check for errors, however, the main xml data is extremely large.

Martijn Pieters Over a year ago

@user2274879: Take the first 100 lines or so, making sure you get a complete XML document (make sure it has complete <article> elements and a closing </dblp> tag at the end).

user2274879 Over a year ago

<dblp> <article mdate="2002-01-03" key="persons/Tresch96"> <author>Markus Tresch</author> <title>Principles of Distributed Object Database Languages.</title> <journal>technical Report 248, ETH Zürich, Dept. of Computer Science</journal> <month>July</month> <year>1996</year> </article> </dblp>

|

unutbu · Accepted Answer · 2013-05-20 14:50:31Z

1

with open("C:\pbc.xml", 'rb') as f:
    root = ET.fromstring(f.read().strip())

Unlike ET.parse, ET.fromstring expects a string with XML content, not the name of a file.

Also in contrast to ET.parse, ET.fromstring returns a root Element, not a Tree. So you should omit

root = tree.getroot()

Also, the XML snippet you posted needs a closing </dblp> to be parsable. I assume your real data has that closing tag...

The iterparse provided by xml.etree.ElementTree does not have a tag argument, although lxml.etree.iterparse does have a tag argument.

Try:

import xml.etree.ElementTree as ET
import htmlentitydefs

filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
    if elem.tag == 'article':
        for author in elem.findall('author'):
            print 'Author name: {}'.format(author.text)
        for journal in elem.findall('journal'):  # all venue tags with id attribute
            print(journal.text)
        elem.clear()

Note: To use iterparse your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file.

edited May 20, 2013 at 14:50

answered May 18, 2013 at 13:38

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

21 Comments

user2274879 Over a year ago

Hi unutbu, I did exactly what you suggested and got the following error: ParseError: no element found: line 21, column 0.

unutbu Over a year ago

Remove all empty lines from the beginning of the file, or else add .strip() to f.read() (see above.)

Martijn Pieters Over a year ago

@user2274879: Your XML document is cut off; there should be more data beyond line 21, but if your XML document matches what you posted here exactly, then at least the </dblp> closing tag is missing.

Martijn Pieters Over a year ago

@unutbu: It's line 21 that is the problem. The XML in the OP has no more than 21 lines, and it is missing data beyond that.

unutbu Over a year ago

@MartijnPieters: The error you point out is correct, but not the immediate error the OP is experiencing. Notice that the error occurs on column 0, not column 10.

|

lichenbo · Accepted Answer · 2013-05-18 13:36:25Z

0

You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

answered May 18, 2013 at 13:36

lichenbo

1,04912 silver badges13 bronze badges

1 Comment

user2274879 Over a year ago

Hi Lichenbo, I removed the first two lines, and i still got the same error.

Collectives™ on Stack Overflow

Parsing large xml data using python's elementtree

3 Answers 3

8 Comments

21 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

21 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related