0

I'm currently learning how to parse xml data using elementtree. I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.

My code is right below, and a bit of the xml data is after my code.

import xml.etree.ElementTree as ET

tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()


for article in root.findall('article'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):  # all venue tags with id attribute
        print 'journal'
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>

<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>

3 Answers 3

1

You are using .fromstring() instead of .parse():

import xml.etree.ElementTree as ET

tree = ET.parse("C:\pbc.xml")
root = tree.getroot()

.fromstring() expects to be given the XML data in a bytestring, not a filename.

If the document is really large (many megabytes or more) then you should use the ET.iterparse() function instead and clear elements you have processed:

for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
    for title in aarticle.findall('title'):
        print 'Title: {}'.format(title.txt)
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):
        print 'journal'

    article.clear()
Sign up to request clarification or add additional context in comments.

8 Comments

Hi Pieters, I used iterparse, as well as the code you put forward, however, I got the following error:ParseError: no element found: line 21, column 0.
@user2274879: Then there appears to be a problem with your input XML file. Use a XML validator to check for errors and fix them before trying to parse the file with Python.
Thanks a lot. I'll use xml validator to check for errors, however, the main xml data is extremely large.
@user2274879: Take the first 100 lines or so, making sure you get a complete XML document (make sure it has complete <article> elements and a closing </dblp> tag at the end).
<dblp> <article mdate="2002-01-03" key="persons/Tresch96"> <author>Markus Tresch</author> <title>Principles of Distributed Object Database Languages.</title> <journal>technical Report 248, ETH Z&uuml;rich, Dept. of Computer Science</journal> <month>July</month> <year>1996</year> </article> </dblp>
|
1
with open("C:\pbc.xml", 'rb') as f:
    root = ET.fromstring(f.read().strip())

Unlike ET.parse, ET.fromstring expects a string with XML content, not the name of a file.

Also in contrast to ET.parse, ET.fromstring returns a root Element, not a Tree. So you should omit

root = tree.getroot()

Also, the XML snippet you posted needs a closing </dblp> to be parsable. I assume your real data has that closing tag...


The iterparse provided by xml.etree.ElementTree does not have a tag argument, although lxml.etree.iterparse does have a tag argument.

Try:

import xml.etree.ElementTree as ET
import htmlentitydefs

filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
    if elem.tag == 'article':
        for author in elem.findall('author'):
            print 'Author name: {}'.format(author.text)
        for journal in elem.findall('journal'):  # all venue tags with id attribute
            print(journal.text)
        elem.clear()

Note: To use iterparse your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file.

21 Comments

Hi unutbu, I did exactly what you suggested and got the following error: ParseError: no element found: line 21, column 0.
Remove all empty lines from the beginning of the file, or else add .strip() to f.read() (see above.)
@user2274879: Your XML document is cut off; there should be more data beyond line 21, but if your XML document matches what you posted here exactly, then at least the </dblp> closing tag is missing.
@unutbu: It's line 21 that is the problem. The XML in the OP has no more than 21 lines, and it is missing data beyond that.
@MartijnPieters: The error you point out is correct, but not the immediate error the OP is experiencing. Notice that the error occurs on column 0, not column 10.
|
0

You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

1 Comment

Hi Lichenbo, I removed the first two lines, and i still got the same error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.