Why does ENcoding a string result in a DEcoding error (UnicodeDecodeError)?

Question

I'm really confused. I tried to encode but the error said can't decode....

>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

I know how to avoid the error with "u" prefix on the string. I'm just wondering why the error is "can't decode" when encode was called. What is Python doing under the hood?

_{See also: unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError , the other way around.}

Note that this issue is fixed in 3.x: strings are simply Unicode and bytes objects are simply raw bytes, and never the twain shall meet. Strings don't do any implicit decoding when .encode is called; in fact, .encode is not supported at all, as it makes no sense. The weird behaviour was only present in 2.x as a compatibility hack in the first place, because of the way that Unicode was introduced into the language. — Karl Knechtel
– Karl Knechtel, Commented Jan 4, 2023 at 6:46

Winston Ewert · Accepted Answer · 2012-03-10 05:34:51Z

171

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don't have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn't valid ascii. That's why you get a complaint about not being able to decode.

answered Mar 10, 2012 at 5:34

Winston Ewert

45.2k10 gold badges70 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jon Tirsen Over a year ago

So what is the solution? Especially if I don't have a string literal, I just have a string object.

Winston Ewert Over a year ago

@JonTirsen, you should not be encoding a string object. A string object is already encoded. If you need to change the encoding, you need to decode it into a unicode string and then encode it as the desired encoding.

deinonychusaur Over a year ago

So to state it clearly from above you can "你好".decode('utf-8').encode('utf-8')

deinonychusaur Over a year ago

@WinstonEwert I guess I was confused. The encoding business tend to leave me eternally confused. I guess my confusion came from my own problem of not knowing the if the input is a string or unicode string and what encoding it may have.

Winston Ewert Over a year ago

@deinonychusaur, yeah... I get that.

|

wim · Accepted Answer · 2025-06-29 13:08:35Z

56

+250

Always encode from Unicode to bytes. In this direction, you choose the encoding.

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to Unicode.
In this direction, you have to know what the encoding is.

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode("utf-8")
u'\u4f60\u597d'
>>> print _
你好

This point can't be stressed enough. If you want to avoid playing Unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:

A unicode object (type unicode) is decoded already; you never want to call decode on it.
A bytestring object (type str) is encoded already; you never want to call encode on it.

When .encode is called on a bytestring, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a Unicode string, Python 2 implicitly tries to convert it to bytes (a str object).

These implicit conversions are why you can get UnicodeDecodeError when you've called encode. Encoding usually accepts an object of type unicode; when called on a str object, there's an implicit decoding into an object of type unicode before re-encoding. The implicit decoding chooses a default 'ascii' codec^†, resulting in a decoding error from an encoding call.

In Python 3, the methods str.decode and bytes.encode were removed, as part of the changes to define separate, unambiguous types for text and raw "bytes" data.

^† _{...or whatever coding sys.getdefaultencoding() mentions; usually this is 'ascii'}

edited Jun 29 at 13:08

answered Mar 10, 2012 at 5:14

wim

368k113 gold badges681 silver badges816 bronze badges

4 Comments

thoslin Over a year ago

So do you mean that Python decodes the bytestring before encoding?

wim Over a year ago

@thoslin exactly, I added more details.

NoBugs Over a year ago

What is _, and why are your print statements missing parenthesis?

wim Over a year ago

@NoBugs 1. in the REPL, _ refers to the previous value 2. because this is a python-2.x question.

Dadaso Zanzane · Accepted Answer · 2016-05-13 08:16:26Z

40

You can try this

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Or

You can also try following

Add following line at top of your .py file.

# -*- coding: utf-8 -*-

edited May 13, 2016 at 8:16

answered Jan 4, 2016 at 13:00

Dadaso Zanzane

6,2852 gold badges27 silver badges25 bronze badges

2 Comments

Alexey Over a year ago

This must be accepted answer!

Karl Knechtel Jun 12 at 6:55

Why? It's completely incorrect. It explains nothing, and does not solve the described problem, but instead gives two ideas for things to try, both of which are for different kinds of Unicode-related problem, and which don't correspond to each other either.

johnsyweb · Accepted Answer · 2012-03-10 05:20:13Z

7

If you're using Python < 3, you'll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:

Python 2.7.2 (default, Jan 14 2012, 23:14:09) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

Further reading: Unicode HOWTO.

edited Mar 10, 2012 at 5:20

answered Mar 10, 2012 at 5:14

johnsyweb

143k26 gold badges197 silver badges253 bronze badges

2 Comments

MxLDevs Over a year ago

If you're encoding a string, why does it throw a decode error?

shluvme Over a year ago

@MxLDevs because you can't get a decode error on an encode action.

aschmid00 · Accepted Answer · 2014-06-04 18:46:07Z

4

You use u"你好".encode('utf8') to encode an unicode string. But if you want to represent "你好", you should decode it. Just like:

"你好".decode("utf8")

You will get what you want. Maybe you should learn more about encode & decode.

edited Jun 4, 2014 at 18:46

aschmid00

7,1682 gold badges49 silver badges67 bronze badges

answered Dec 19, 2013 at 3:37

Qingtian

571 bronze badge

Comments

kenorb · Accepted Answer · 2017-05-28 16:36:09Z

3

In case you're dealing with Unicode, sometimes instead of encode('utf-8'), you can also try to ignore the special characters, e.g.

"你好".encode('ascii','ignore')

or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here.

Not particularly useful in this example, but can work better in other scenarios when it's not possible to convert some special characters.

Alternatively you can consider replacing particular character using replace().

answered May 28, 2017 at 16:36

kenorb

169k95 gold badges712 silver badges796 bronze badges

1 Comment

Karl Knechtel Jun 14 at 23:59

This is wrong. The reported problem in Python 2.x is a decode error which happens in code that runs implicitly, before the encode takes effect. Specifying 'ignore' in the code has no effect on that implicit code. The underlying problem is that, in Python 2.x, "你好" does not contain those characters in the first place; Python 2.x's str type only contains bytes, and pretends to contain characters by attempting implicit conversions on the fly. The unicode_escape proposal here would return an empty string.

0range · Accepted Answer · 2018-09-27 22:51:27Z

1

If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.

Call locale charmap from the shell (not the python interpreter) and you should see

[user@host dir] $ locale charmap
UTF-8
[user@host dir] $

If this is not the case, and you see something else, e.g.

[user@host dir] $ locale charmap
ANSI_X3.4-1968
[user@host dir] $

Python will (at least in some cases such as in mine) inherit the shell's encoding and will not be able to print (some? all?) unicode characters. Python's own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.

If you find that you have this problem, you can fix that by

[user@host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user@host dir] $ locale charmap
UTF-8
[user@host dir] $

(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this.

answered Sep 27, 2018 at 22:51

0range

2,1762 gold badges25 silver badges32 bronze badges

1 Comment

Karl Knechtel Jun 14 at 23:55

This has nothing to do with the described problem, which can be reproduced in Python 2.x without a terminal or even a shell (for example, by writing to a file instead of using print). And it is the terminal which has to deal with the encoding, not the shell.

Collectives™ on Stack Overflow

Why does ENcoding a string result in a DEcoding error (UnicodeDecodeError)?

7 Answers 7

7 Comments

4 Comments

2 Comments

2 Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

4 Comments

2 Comments

2 Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related