Python lxml error "namespace not defined."

Question

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:

The documents are defined like this:

<sphinx:document id="18059090929806848187">
  <url>http://www.some-website.com</url>
  <page_number>104</page_number>
  <size>7865</size>
</sphinx:document>

Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.

sample code:

from lxml import objectify, etree
import gzip

with open ('file_list','rb') as file_list:
 for file in file_list:
  in_xml = gzip.open(file.strip('\n'))
  xml2 = etree.iterparse(in_xml)
  for action, elem in xml2:
   if elem.tag == "page_number":
    print elem.text + str(file)

the first value elem.text is returned but only for the first file in the list and quickly followed by the error:

lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20

Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?

Thanks

I think lxml is expecting that the namespace is defined in the document see e.g. wikipedia. If you have access to where the data is generated you can add the expected definition. Else you could strip the namespace away if you don't need it. — syntonym
– syntonym, Commented Mar 18, 2016 at 13:40
Thanks syntonym, how could I strip the namespace away? Interestingly (to me at least) If I change sphinx:docuemnt to sphinxdocument (which I don't really want to do for the sake of efficiency), it works fine but I can't run a replace on the gzip.open(filename.gz) output because I get: xml=gzip.open('00000448069335828601.xml.gz') xml.replace('sphinx:document','sphinxdocument') AttributeError: 'GzipFile' object has no attribute 'replace' — RJJ
– RJJ, Commented Mar 18, 2016 at 13:52
Is that your entire XML document, or is that snippet from in the middle of one? — Robᵩ
– Robᵩ, Commented Mar 18, 2016 at 13:55
@Rob, It's a snippet - the first few lines and the last line. Each file is about 400 rows of xml, all in the same structure as the first few posted above. thanks — RJJ
– RJJ, Commented Mar 18, 2016 at 14:01

Robᵩ · Accepted Answer · 2016-03-18 14:19:04Z

3

Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.

Your choices are:

Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
```
xml2 =etree.iterparse(in_xml, recover=True)
```

edited Mar 18, 2016 at 14:19

answered Mar 18, 2016 at 14:13

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python lxml error "namespace not defined."

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related