converting string to unicode in python

Question

I am trying to convert a string type to Unicode in Python. I want it to work for any non-english string, for example Japanese, Chinese or Spanish.

For example, japanese_var has some japanese characters [ドキュメントを翻訳します].

Printing it would give,

'\x83h\x83L\x83\x85\x83\x81\x83\x93\x83g\x82\xf0\x96|\x96\xf3\x82\xb5\x82\xdc\x82\xb7'

Checking its type,

type(japanese_var)
<type 'str'>

How can I convert it to type 'unicode'?

Should i use japanese_var.decode('mbcs')? What could be the consequences of using this code as i will be using it on different OS platforms & different foreign Locale?

I am using python 2.5.4

I am reading the parameter which can be any non-english string of a file from its properties.

You need to know the encoding of the string. There isn't really a simple solution that will work for any string. — interjay
– interjay, Commented Dec 9, 2013 at 9:41
Where is this string coming from? (If it's a literal, stick a u directly in front of it, though you may need to be careful about source code encoding.) — user2357112
– user2357112, Commented Dec 9, 2013 at 9:44

svk · Accepted Answer · 2013-12-09 10:03:39Z

4

You need to know the encoding of the input string. There is no reliable universal solution.

The encoding should be available from the source of the input string. For instance, if you're taking text from a web page, the encoding should be indicated as part of the HTTP Content-Type, either as a HTTP response header from the server or as <meta> tag in the page source.

Once you know the encoding, use the decode method.

This string appears to be Shift-JIS:

>>> x = '\x83h\x83L\x83\x85\x83\x81\x83\x93\x83g\x82\xf0\x96|\x96\xf3\x82\xb5\x82\xdc\x82\xb7'
>>> print x.decode( "shift-jis" )
ドキュメントを翻訳します

edited Dec 9, 2013 at 10:03

answered Dec 9, 2013 at 9:56

svk

5,94920 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user3073180 · Accepted Answer · 2014-09-17 22:54:10Z

0

It worked for me by passing "mbcs" to decode for any locale.

Thanks guys for your help.

answered Sep 17, 2014 at 22:54

user3073180

651 silver badge14 bronze badges

Collectives™ on Stack Overflow

converting string to unicode in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related