Writing XML to file corrupts files in python

Question

I'm attempting to write contents from xml.dom.minidom object to file. The simple idea is to use 'writexml' method:

import codecs

def write_xml_native():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    f = codecs.open('codified.xml', mode='w', encoding='utf-8')
    # Using native writexml() method to write
    xmldoc.writexml(f, encoding="utf=8")
    f.close()

The problem is that it corrupts the non-latin-encoded text in the file. The other way is to get the text string and write it to file explicitly:

def write_xml():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    # Opening file for writing UTF-8, which is XML's default encoding
    f = codecs.open('codified3.xml', mode='w', encoding='utf-8')
    # Writing XML in UTF-8 encoding, as recommended in the documentation
    f.write(xmldoc.toxml("utf-8"))
    f.close()

This results in the following error:

Traceback (most recent call last):
  File "D:\Projects\Semio\semioparser.py", line 45, in <module>
    write_xml()
  File "D:\Projects\Semio\semioparser.py", line 42, in write_xml
    f.write(xmldoc.toxml(encoding="utf-8"))
  File "C:\Python26\lib\codecs.py", line 686, in write
    return self.writer.write(data)
  File "C:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 2064: ordinal not in range(128)

How do I write an XML text to file? What is it I'm missing?

EDIT. Error is fixed by adding decode statement: f.write(xmldoc.toxml("utf-8").decode("utf-8")) But russian symbols are still corrupted.

The text is not corrupted when viewed in an interpreter, but when it's written in file.

Just a thought: Are you sure you are not viewing the file incorrectly? Maybe the reader is expecting another encoding than utf-8 and it just looks borked. — mahju
– mahju, Commented Dec 29, 2010 at 12:28
@Nubsis That is exactly what was going on. The viewer has been expecting ASCII encoding. I'll keep the thread though because using .decode() was the problem too. Thanks! — martinthenext
– martinthenext, Commented Jan 9, 2011 at 19:08

barti_ddu · Accepted Answer · 2010-12-19 18:20:18Z

10

Hmm, though this should work:

xml = minidom.parse("test.xml")
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

you may alternatively try:

with codecs.open("test.xml", "r", "utf-8") as inp:
    xml = minidom.parseString(inp.read().encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

Update: In case you construct xml out of string object, you should encode it before passing to minidom parser, like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
import xml.dom.minidom as minidom

xml = minidom.parseString(u"<ru>Тест</ru>".encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

edited Dec 19, 2010 at 18:20

answered Dec 19, 2010 at 18:09

barti_ddu

10.4k1 gold badge48 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

martinthenext Over a year ago

Thanks for your answer. I've tested all of your code, none of it works fine for me. Even the last piece, that has nothing to do with opening XML file, translates russian string to nonsense. This means the problem is in writing urf-8 to files. Any more ideas?

barti_ddu Over a year ago

@martinthenext: i'm almost sure that you get valid "utf-8" (all of 3 examples work fine for me, both on windows & linux and python 2.5, 2.6 & 2.7) or your python installation is broken; here goes the screenshot: img190.imageshack.us/img190/9072/minidom.png

martinthenext Over a year ago

Wait, output of the interpreter itself is just fine, no problems with that. It gets corrupted when being written to a file. How can I fix this?

barti_ddu Over a year ago

@martinthenext: mind the 2-nd line from the bottom at the screenshot: it displays generated file content (recoded from utf -> cp866, i.e. console encoding). And what do you mean by "corrupted", how do you check this?

martinthenext Over a year ago

well, 'corrupted' means Russian characters are replaced by rubbish.

|

Tim Pietzcker · Accepted Answer · 2010-12-19 17:48:17Z

0

Try this:

with open("codified.xml", "w") as f:
    f.write(xmldoc.toxml("utf-8").decode("utf-8"))

This works for me (under Python 3, though).

answered Dec 19, 2010 at 17:48

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

3 Comments

Tim Pietzcker Over a year ago

What happens if you x = codecs.open("semio2.xml", encoding="utf-8") and xmldoc = minidom.parse(x)?

martinthenext Over a year ago

it says UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128). I can't understand why.

barti_ddu Over a year ago

@martinthenext: you are getting this error because you feed minidom unicode strings (while it accepts binary only). If you open file in "utf-8" mode, you should encode it contents to "utf-8" before parsing.

Collectives™ on Stack Overflow

Writing XML to file corrupts files in python

2 Answers 2

7 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related