Python throws UnicodeEncodeError although I am doing str.decode(). Why?

Question

Consider this function:

def escape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        try:
            c = c.decode('ascii')
        except UnicodeDecodeError:
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

It should escape all non ascii characters by the corresponding htmlentitydefs. Unfortunately python throws

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

when the variable text contains the string whose repr() is u'Tam\xe1s Horv\xe1th'.

But, I don't use str.encode(). I only use str.decode(). Do I miss something?

Community · Accepted Answer · 2017-05-23 12:02:48Z

It's a misleading error-report which comes from the way python handles the de/encoding process. You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module. And somewhere there lies the origin for this misleading Exception messages.

You may check for yourself: either

u'\x80'.encode('ascii')

or

u'\x80'.decode('ascii')

will throw a UnicodeEncodeError, where a

u'\x80'.encode('utf8')

will not, but

u'\x80'.decode('utf8')

again will!

I guess you are confused by the meaning of encoding and decoding. To put it simple:

                     decode             encode    
ByteString (ascii)  --------> UNICODE  --------->  ByteString (utf8)
            codec                                              codec

But why is there a codec-argument for the decode method? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec as an argument. If not provided it assumes you mean the sys.getdefaultencoding() to be implicitly used.

so when you use c.decode('ascii') you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii.

See also: https://stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

Daniel Roseman · Accepted Answer · 2011-12-21 14:33:54Z

5

You're passing a string that's already unicode. So, before Python can call decode on it, it has to actually encode it - and it does so by default using the ASCII encoding.

Edit to add It depends on what you want to do. If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call: text.encode('ascii', 'xmlcharrefreplace').

edited Dec 21, 2011 at 14:33

answered Dec 21, 2011 at 14:07

Daniel Roseman

602k68 gold badges910 silver badges923 bronze badges

1 Comment

Aufwind Over a year ago

Or is my approach of escaping the characters nonsense?

wberry · Accepted Answer · 2011-12-21 14:39:28Z

Python has two types of strings: character-strings (the unicode type) and byte-strings (the str type). The code you have pasted operates on byte-strings. You need a similar function to handle character-strings.

Maybe this:

def uescape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        if (ord(c) < 32) or (ord(c) > 126):
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

I do wonder whether either function is truly necessary for you. If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a content.encode('UTF-8') as the final step before delivering it to the client. Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding.

Community · Accepted Answer · 2017-05-23 11:47:30Z

2

This answer always works for me when I have this problem:

def byteify(input):
    '''
    Removes unicode encodings from the given input string.
    '''
    if isinstance(input, dict):
        return {byteify(key):byteify(value) for key,value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

from How to get string objects instead of Unicode ones from JSON in Python?

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Nov 26, 2015 at 21:58

Blairg23

12.2k7 gold badges77 silver badges75 bronze badges

Comments

Heladio Cisneros Reyes · Accepted Answer · 2016-07-22 16:13:48Z

0

I found solution in this-site

reload(sys)
sys.setdefaultencoding("latin-1")

a = u'\xe1'
print str(a) # no exception

answered Jul 22, 2016 at 16:13

Heladio Cisneros Reyes

1

Comments

kev · Accepted Answer · 2011-12-21 14:17:49Z

-1

decode a str make no sense.

I think you can check ord(c)>127

answered Dec 21, 2011 at 14:17

kev

163k49 gold badges286 silver badges282 bronze badges

Collectives™ on Stack Overflow

Python throws UnicodeEncodeError although I am doing str.decode(). Why?

6 Answers 6

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related