Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

Question

When running the following code (which just prints out file names):

print filename

It throws the following error:

File "myscript.py", line 78, in __listfilenames
print filename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)

So to fix this, I tried changing print filename to print filename.encode('utf-8') which didn't fix the problem.

The script only fails when trying read a filename such as Coé.jpg.

Any ideas how I can modify filename so the script continues to work when it comes acorss a special character?

NB. I'm a python noob

Martijn Pieters · Accepted Answer · 2015-01-19 17:39:56Z

1

filename is already encoded. It is already a byte string and doesn't need encoding again.

But since you asked it to be encoded, Python first has to decode it for you, and it can only do that with the default ASCII encoding. That implicit decoding fails:

>>> 'Coé.jpg'
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.decode('utf8')
u'Co\xe9.jpg'
>>> 'Coé.jpg'.decode('utf8').encode('utf8')
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

If you wanted encoded bytestrings, you don't have to do any encoding at all. Remove the .encode('utf8').

You probably need to read up on Python and Unicode. I recommend:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

The rule of thumb is: decode as early as you can, encode as late as you can. That means when you receive data, decode to Unicode objects, when you need to pass that information to something else, encode only then. Many APIs can do the decoding and encoding as part of their job; print will encode to the codec used by the terminal, for example.

edited Jan 19, 2015 at 17:39

answered Jan 19, 2015 at 17:34

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nadine Over a year ago

But if I run the same script without .encoded(). bit, my script then gives me this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)

Martijn Pieters Over a year ago

@Nadine: perhaps. You could well have other errors in your code where you are mixing unicode and byte strings.

Nadine Over a year ago

I've updated my question to better describe the problem as works correctly and only fails on special chars. I will review your updated answer and the links now

Martijn Pieters Over a year ago

@Nadine: are you saying that the full traceback for your error now points to print filename and gives you a UnicodeDecodeError? I am very skeptical that that is the case.

Nadine Over a year ago

You're right to be sckeptical, the issue is a lot deeper than I thought. However, you're links have massively helped me understand Python unicode and as such, will mark this as accepted.

Collectives™ on Stack Overflow

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related