5

I have a byte array which I believe correctly stores a UTF-16 encoded Surrogate Pair for the unicode character šŽ‘

Running that byte array through .Net System.Text.Encoding.Unicode.GetString() returns non-expected results.

Actual results: ��

Expected results: šŽ‘

Code example:

byte[] inputByteArray = new byte[4];
inputByteArray[0] = 0x91;
inputByteArray[1] = 0xDF;
inputByteArray[2] = 0x00;
inputByteArray[3] = 0xD8;

// System.Text.Encoding.Unicode accepts little endian UTF-16
// Least significant byte first within the byte array [0] MSByete in [3]
string str = System.Text.Encoding.Unicode.GetString(inputByteArray);

// This returns �� rather than the excpected symbol: šŽ‘ 
Console.WriteLine(str);

Detail on how I got to that particular byte array from the character : šŽ‘

This character is within the Supplementary Multilingual Plane. This character in Unicode is 0x10391. Encoded into a UTF-16 surrogate pair, this should be :

Minus the Unicode value with 0x10000 : val = 0x00391 = (0x10391 - 0x10000)

High surrogate: 0xD800 = ( 0xD800 + (0x00391 >> 10 )) top 10 bits

Low surrogate: 0xDF91 = (0xDC00 + (0x00391 & 0b_0011_1111_1111)) bottom 10 bits

1 Answer 1

6

Encoding.Unicode is little-endian on a per-UTF-16 code unit basis. You still need to put the high surrogate code unit before the low surrogate code unit. Here's sample code that works:

using System;
using System.Text;

class Test
{
    static void Main()
    {
        byte[] data =
        {
            0x00, 0xD8, // High surrogate
            0x91, 0xDF  // Low surrogate
        };
        string text = Encoding.Unicode.GetString(data);
        Console.WriteLine(char.ConvertToUtf32(text, 0)); // 66449
    }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking ā€œPost Your Answerā€, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.