6

I'm attempting to write contents from xml.dom.minidom object to file. The simple idea is to use 'writexml' method:

import codecs

def write_xml_native():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    f = codecs.open('codified.xml', mode='w', encoding='utf-8')
    # Using native writexml() method to write
    xmldoc.writexml(f, encoding="utf=8")
    f.close()

The problem is that it corrupts the non-latin-encoded text in the file. The other way is to get the text string and write it to file explicitly:

def write_xml():
    # Building DOM from XML
    xmldoc = minidom.parse('semio2.xml')
    # Opening file for writing UTF-8, which is XML's default encoding
    f = codecs.open('codified3.xml', mode='w', encoding='utf-8')
    # Writing XML in UTF-8 encoding, as recommended in the documentation
    f.write(xmldoc.toxml("utf-8"))
    f.close()

This results in the following error:

Traceback (most recent call last):
  File "D:\Projects\Semio\semioparser.py", line 45, in <module>
    write_xml()
  File "D:\Projects\Semio\semioparser.py", line 42, in write_xml
    f.write(xmldoc.toxml(encoding="utf-8"))
  File "C:\Python26\lib\codecs.py", line 686, in write
    return self.writer.write(data)
  File "C:\Python26\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 2064: ordinal not in range(128)

How do I write an XML text to file? What is it I'm missing?

EDIT. Error is fixed by adding decode statement: f.write(xmldoc.toxml("utf-8").decode("utf-8")) But russian symbols are still corrupted.

The text is not corrupted when viewed in an interpreter, but when it's written in file.

2
  • 1
    Just a thought: Are you sure you are not viewing the file incorrectly? Maybe the reader is expecting another encoding than utf-8 and it just looks borked. Commented Dec 29, 2010 at 12:28
  • @Nubsis That is exactly what was going on. The viewer has been expecting ASCII encoding. I'll keep the thread though because using .decode() was the problem too. Thanks! Commented Jan 9, 2011 at 19:08

2 Answers 2

10

Hmm, though this should work:

xml = minidom.parse("test.xml")
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

you may alternatively try:

with codecs.open("test.xml", "r", "utf-8") as inp:
    xml = minidom.parseString(inp.read().encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)

Update: In case you construct xml out of string object, you should encode it before passing to minidom parser, like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
import xml.dom.minidom as minidom

xml = minidom.parseString(u"<ru>Тест</ru>".encode("utf-8"))
with codecs.open("out.xml", "w", "utf-8") as out:
    xml.writexml(out)
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your answer. I've tested all of your code, none of it works fine for me. Even the last piece, that has nothing to do with opening XML file, translates russian string to nonsense. This means the problem is in writing urf-8 to files. Any more ideas?
@martinthenext: i'm almost sure that you get valid "utf-8" (all of 3 examples work fine for me, both on windows & linux and python 2.5, 2.6 & 2.7) or your python installation is broken; here goes the screenshot: img190.imageshack.us/img190/9072/minidom.png
Wait, output of the interpreter itself is just fine, no problems with that. It gets corrupted when being written to a file. How can I fix this?
@martinthenext: mind the 2-nd line from the bottom at the screenshot: it displays generated file content (recoded from utf -> cp866, i.e. console encoding). And what do you mean by "corrupted", how do you check this?
well, 'corrupted' means Russian characters are replaced by rubbish.
|
0

Try this:

with open("codified.xml", "w") as f:
    f.write(xmldoc.toxml("utf-8").decode("utf-8"))

This works for me (under Python 3, though).

3 Comments

What happens if you x = codecs.open("semio2.xml", encoding="utf-8") and xmldoc = minidom.parse(x)?
it says UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128). I can't understand why.
@martinthenext: you are getting this error because you feed minidom unicode strings (while it accepts binary only). If you open file in "utf-8" mode, you should encode it contents to "utf-8" before parsing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.