1

I am currently working with a python script (appengine) that takes an input from the user (text) and stores it in the database for re-distribution later.

The text that comes in is unknown, in terms of encoding and I need to have it encoded only once.

Example Texts from clients:

  • This%20is%20a%20test
  • This is a test

Now in python what I thought I could do is decode it then encode it so both samples become:

  • This%20is%20a%20test
  • This%20is%20a%20test

The code that I am using is as follows:

#
# Dencode as UTF-8
#
pl = pl.encode('UTF-8')

#
#Unquote the string, then requote to assure encoding
#
pl = urllib.quote(urllib.unquote(pl))

Where pl is from the POST parameter for payload.

The Issue

The issue is that sometimes I get special (Chinese, Arabic) type chars and I get the following error.

'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)
    ..snip..
    return codecs.utf_8_decode(input, errors, True)
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xc3' in position 0: ordinal not in range(128)

does anyone know the best solution to process the string given the above issue?

Thanks.

1

1 Answer 1

1

Replace

pl = pl.encode('UTF-8')

with

pl = pl.decode('UTF-8')

since you're trying to decode a byte-string into a string of characters.

A design issue with Python 2 lets you .encode a bytestring (which is already encoded) by automatically decoding it as ASCII (which is why it apparently works for ASCII strings, failing only for non-ASCII bytes).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.