How can I decode a utf-8 byte array to a string in Python2?

Question

I have an array of bytes representing a utf-8 encoded string. I want to decode these bytes back into the string in Pyton2. I am relying on Python2 for my overall program, so I can not switch to Python3.

array = [67, 97, 102, **-61, -87**, 32, 70, 108, 111, 114, 97]

-> Café Flora

Since every character in the string I want is not necessarily represented by exactly 1 byte in the array, I can not use a solution like:

"".join(map(chr, array))

I tried to create a function that would step through the array, and whenever it encounters a number not in the range 0-127 (ASCII), create a new 16 bit int, shift the current bits over 8 to the left, and then add the following byte using a bitwise OR. Finally it would use unichr() to decode it.

result = []


for i in range(len(byte_array)):
    x = byte_array[i]
    if x < 0:
        b16 = x & 0xFFFF # 16 bit
        b16 = b16 << 8
        b16 = b16 | byte_array[i+1]
        result.append(unichr(m16))
    else:
        result.append(chr(x))

return "".join(result)

However, this was unsuccessful.

The following article explains the issue very well, and includes a nodeJS solution:

http://ixti.net/development/node.js/2011/10/26/get-utf-8-string-from-array-of-bytes-in-node-js.html

Can you not "pad" every number in the range of 0-128 with 00? — OneCricketeer
– OneCricketeer, Commented Aug 2, 2016 at 17:59
As you can see inmy answer, your join with map and chr version almost works - but for the problem you have with negative numbers. My answer bellow is essentialy the same, using the more readable generator expression , and taking care of the negative numbers. — jsbueno
– jsbueno, Commented Aug 2, 2016 at 18:15

user2357112 · Accepted Answer · 2016-08-02 18:08:27Z

3

Use the little-used array module to convert your input to a bytestring and then decode it with the UTF-8 codec:

import array
decoded = array.array('b', your_input).tostring().decode('utf-8')

edited Aug 2, 2016 at 18:08

answered Aug 2, 2016 at 17:59

user2357112

286k32 gold badges490 silver badges569 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

wim Over a year ago

ValueError: byte must be in range(0, 256)

Tim Petri Over a year ago

This gives me a ValueError: byte must be in range(0, 256)

user2357112 Over a year ago

@wim: Oh, huh. They have signed values in the input instead of unsigned. This won't work directly, then.

user2357112 Over a year ago

@wim: Decoding method switched to something that takes signed values.

Noumenon Over a year ago

In Python 3 tostring() was renamed to tobytes(), otherwise, this works great.

wim · Accepted Answer · 2016-08-02 18:05:25Z

2

you can use struct.pack for this

>>> a =  [67, 97, 102, -61, -87, 32, 70, 108, 111, 114, 97]
>>> struct.pack("b"*len(a),*a)
'Caf\xc3\xa9 Flora'
>>> print struct.pack("b"*len(a),*a).decode('utf8')
Café Flora

edited Aug 2, 2016 at 18:05

wim

368k113 gold badges681 silver badges816 bronze badges

answered Aug 2, 2016 at 18:01

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

Comments

jsbueno · Accepted Answer · 2016-08-02 18:19:26Z

1

You have to have in mind that a "string" in Python2 is not proper text, just a sequence of bytes in memory, which happens to map to characters when you "print" them - if the mapping of the intend characters in the byte sequence matches the one in the terminal, you will see properly formatted text.

If your terminal is not UTF-8, even if you get the proper byte-strign in memory, just printing it would show you the wrong results. That is why the extra "decode" step is needed at the end of the expression.

text = b''.join(chr(i if i > 0 else 256 + i) for i in array).decode('utf-8')

As your source encoded the numbers between 128 and 255 as negative numbers, we have the inline "if" operator to renormalize the value before calling "chr".

Just to be clear - you say "Since every character in the string I want is not necessarily represented by exactly 1 byte in the array," - So - what takes care of that if you use Python2.x strings, is the terminal anyway. If you want to deal with proper tet, after joining your numbers to a proper (byte) string, is to use the "decode" method - this is the part that will know about UTF-8 multi-byte encoded characters and give you back a (text) string object (an 'unicode' object in Python 2) - that will treat each character as an entity.

edited Aug 2, 2016 at 18:19

answered Aug 2, 2016 at 17:59

jsbueno

113k11 gold badges159 silver badges239 bronze badges

2 Comments

Tavian Barnes Over a year ago

i if i > 0 else 256 + i can be written i & 0xFF too

jsbueno Over a year ago

It may work, although for semantic reasons, I would not recommend that. It abuses a side effect of the "&" operation just to make the number positive. Maybe the "most correct" way is @Joran's answer using struct - but I am uncertain about the percformance of doing that. This way is the "thinking in strings" way. :-)

Collectives™ on Stack Overflow

How can I decode a utf-8 byte array to a string in Python2?

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related