4

Consider this function:

def escape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        try:
            c = c.decode('ascii')
        except UnicodeDecodeError:
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

It should escape all non ascii characters by the corresponding htmlentitydefs. Unfortunately python throws

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

when the variable text contains the string whose repr() is u'Tam\xe1s Horv\xe1th'.

But, I don't use str.encode(). I only use str.decode(). Do I miss something?

6 Answers 6

11

It's a misleading error-report which comes from the way python handles the de/encoding process. You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module. And somewhere there lies the origin for this misleading Exception messages.

You may check for yourself: either

u'\x80'.encode('ascii')

or

u'\x80'.decode('ascii')

will throw a UnicodeEncodeError, where a

u'\x80'.encode('utf8')

will not, but

u'\x80'.decode('utf8')

again will!

I guess you are confused by the meaning of encoding and decoding. To put it simple:

                     decode             encode    
ByteString (ascii)  --------> UNICODE  --------->  ByteString (utf8)
            codec                                              codec

But why is there a codec-argument for the decode method? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec as an argument. If not provided it assumes you mean the sys.getdefaultencoding() to be implicitly used.

so when you use c.decode('ascii') you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii.

See also: https://stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

Sign up to request clarification or add additional context in comments.

Comments

5

You're passing a string that's already unicode. So, before Python can call decode on it, it has to actually encode it - and it does so by default using the ASCII encoding.

Edit to add It depends on what you want to do. If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call: text.encode('ascii', 'xmlcharrefreplace').

1 Comment

Or is my approach of escaping the characters nonsense?
2

Python has two types of strings: character-strings (the unicode type) and byte-strings (the str type). The code you have pasted operates on byte-strings. You need a similar function to handle character-strings.

Maybe this:

def uescape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        if (ord(c) < 32) or (ord(c) > 126):
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

I do wonder whether either function is truly necessary for you. If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a content.encode('UTF-8') as the final step before delivering it to the client. Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding.

Comments

2

This answer always works for me when I have this problem:

def byteify(input):
    '''
    Removes unicode encodings from the given input string.
    '''
    if isinstance(input, dict):
        return {byteify(key):byteify(value) for key,value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

from How to get string objects instead of Unicode ones from JSON in Python?

Comments

0

I found solution in this-site

reload(sys)
sys.setdefaultencoding("latin-1")

a = u'\xe1'
print str(a) # no exception

Comments

-1

decode a str make no sense.

I think you can check ord(c)>127

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.