3

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:

The documents are defined like this:

<sphinx:document id="18059090929806848187">
  <url>http://www.some-website.com</url>
  <page_number>104</page_number>
  <size>7865</size>
</sphinx:document>

Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.

sample code:

from lxml import objectify, etree
import gzip

with open ('file_list','rb') as file_list:
 for file in file_list:
  in_xml = gzip.open(file.strip('\n'))
  xml2 = etree.iterparse(in_xml)
  for action, elem in xml2:
   if elem.tag == "page_number":
    print elem.text + str(file)

the first value elem.text is returned but only for the first file in the list and quickly followed by the error:

lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20

Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?

Thanks

7
  • See this question: stackoverflow.com/questions/7018326/… Commented Mar 18, 2016 at 13:36
  • I think lxml is expecting that the namespace is defined in the document see e.g. wikipedia. If you have access to where the data is generated you can add the expected definition. Else you could strip the namespace away if you don't need it. Commented Mar 18, 2016 at 13:40
  • Thanks syntonym, how could I strip the namespace away? Interestingly (to me at least) If I change sphinx:docuemnt to sphinxdocument (which I don't really want to do for the sake of efficiency), it works fine but I can't run a replace on the gzip.open(filename.gz) output because I get: xml=gzip.open('00000448069335828601.xml.gz') xml.replace('sphinx:document','sphinxdocument') AttributeError: 'GzipFile' object has no attribute 'replace' Commented Mar 18, 2016 at 13:52
  • Is that your entire XML document, or is that snippet from in the middle of one? Commented Mar 18, 2016 at 13:55
  • @Rob, It's a snippet - the first few lines and the last line. Each file is about 400 rows of xml, all in the same structure as the first few posted above. thanks Commented Mar 18, 2016 at 14:01

1 Answer 1

3

Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.

Your choices are:

  • Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.

  • Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:

    xml2 =etree.iterparse(in_xml, recover=True)
    
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.