5

I have an array of bytes representing a utf-8 encoded string. I want to decode these bytes back into the string in Pyton2. I am relying on Python2 for my overall program, so I can not switch to Python3.

array = [67, 97, 102, **-61, -87**, 32, 70, 108, 111, 114, 97] 

-> Café Flora

Since every character in the string I want is not necessarily represented by exactly 1 byte in the array, I can not use a solution like:

"".join(map(chr, array))

I tried to create a function that would step through the array, and whenever it encounters a number not in the range 0-127 (ASCII), create a new 16 bit int, shift the current bits over 8 to the left, and then add the following byte using a bitwise OR. Finally it would use unichr() to decode it.

result = []


for i in range(len(byte_array)):
    x = byte_array[i]
    if x < 0:
        b16 = x & 0xFFFF # 16 bit
        b16 = b16 << 8
        b16 = b16 | byte_array[i+1]
        result.append(unichr(m16))
    else:
        result.append(chr(x))

return "".join(result)

However, this was unsuccessful.

The following article explains the issue very well, and includes a nodeJS solution:

http://ixti.net/development/node.js/2011/10/26/get-utf-8-string-from-array-of-bytes-in-node-js.html

3
  • 3
    That is not how UTF-8 decoding works. Commented Aug 2, 2016 at 17:58
  • Can you not "pad" every number in the range of 0-128 with 00? Commented Aug 2, 2016 at 17:59
  • As you can see inmy answer, your join with map and chr version almost works - but for the problem you have with negative numbers. My answer bellow is essentialy the same, using the more readable generator expression , and taking care of the negative numbers. Commented Aug 2, 2016 at 18:15

3 Answers 3

3

Use the little-used array module to convert your input to a bytestring and then decode it with the UTF-8 codec:

import array
decoded = array.array('b', your_input).tostring().decode('utf-8')
Sign up to request clarification or add additional context in comments.

5 Comments

ValueError: byte must be in range(0, 256)
This gives me a ValueError: byte must be in range(0, 256)
@wim: Oh, huh. They have signed values in the input instead of unsigned. This won't work directly, then.
@wim: Decoding method switched to something that takes signed values.
In Python 3 tostring() was renamed to tobytes(), otherwise, this works great.
2

you can use struct.pack for this

>>> a =  [67, 97, 102, -61, -87, 32, 70, 108, 111, 114, 97]
>>> struct.pack("b"*len(a),*a)
'Caf\xc3\xa9 Flora'
>>> print struct.pack("b"*len(a),*a).decode('utf8')
Café Flora

Comments

1

You have to have in mind that a "string" in Python2 is not proper text, just a sequence of bytes in memory, which happens to map to characters when you "print" them - if the mapping of the intend characters in the byte sequence matches the one in the terminal, you will see properly formatted text.

If your terminal is not UTF-8, even if you get the proper byte-strign in memory, just printing it would show you the wrong results. That is why the extra "decode" step is needed at the end of the expression.

text = b''.join(chr(i if i > 0 else 256 + i) for i in array).decode('utf-8')

As your source encoded the numbers between 128 and 255 as negative numbers, we have the inline "if" operator to renormalize the value before calling "chr".

Just to be clear - you say "Since every character in the string I want is not necessarily represented by exactly 1 byte in the array," - So - what takes care of that if you use Python2.x strings, is the terminal anyway. If you want to deal with proper tet, after joining your numbers to a proper (byte) string, is to use the "decode" method - this is the part that will know about UTF-8 multi-byte encoded characters and give you back a (text) string object (an 'unicode' object in Python 2) - that will treat each character as an entity.

2 Comments

i if i > 0 else 256 + i can be written i & 0xFF too
It may work, although for semantic reasons, I would not recommend that. It abuses a side effect of the "&" operation just to make the number positive. Maybe the "most correct" way is @Joran's answer using struct - but I am uncertain about the percformance of doing that. This way is the "thinking in strings" way. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.