Always encode from Unicode to bytes.
In this direction, you choose the encoding.
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好
The other way is to decode from bytes to Unicode.
In this direction, you have to know what the encoding is.
>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode("utf-8")
u'\u4f60\u597d'
>>> print _
你好
This point can't be stressed enough. If you want to avoid playing Unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:
- A
unicode object (type unicode) is decoded already; you never want to call decode on it.
- A bytestring object (type
str) is encoded already; you never want to call encode on it.
When .encode is called on a bytestring, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a Unicode string, Python 2 implicitly tries to convert it to bytes (a str object).
These implicit conversions are why you can get UnicodeDecodeError when you've called encode. Encoding usually accepts an object of type unicode; when called on a str object, there's an implicit decoding into an object of type unicode before re-encoding. The implicit decoding chooses a default 'ascii' codec†, resulting in a decoding error from an encoding call.
In Python 3, the methods str.decode and bytes.encode were removed, as part of the changes to define separate, unambiguous types for text and raw "bytes" data.
† ...or whatever coding sys.getdefaultencoding() mentions; usually this is 'ascii'
bytesobjects are simply raw bytes, and never the twain shall meet. Strings don't do any implicit decoding when.encodeis called; in fact,.encodeis not supported at all, as it makes no sense. The weird behaviour was only present in 2.x as a compatibility hack in the first place, because of the way that Unicode was introduced into the language.