1

I have a function that returns a byte array that represents a JSON string I need to parse. Generally, I would use Encoding.Default.GetString(myByteArray) to convert it to a string, but the resulting string has some unrecognized characters in it: ?{"rects":[],"text":""} instead of {"rects":[],"text":""}.

I have tried using every other encoding scheme the Encoding class has (that I know of anyway): UTF8, UTF7, UTF32, Unicode, BigEndianUnicode, Latin1, and ASCII, but every single one resulted in a string with ?, ??, or ÿ_ at the beginning (or in the case of UTF32, the whole string was ?'s).

Strangely, using new StreamReader(new MemoryStream(myByteArray)).ReadToEnd() decoded the string perfectly, and is what I'm currently using in my code. I used StreamReader.CurrentEncoding to figure out what encoding it was using and printed it to the console (System.Text.UnicodeEncoding), then tried using new UnicodeEncoding().GetString(myByteArray), but still no luck.

How do I identify what encoding the byte arrays are using so I can decode it directly instead of wrapping it in streams?

// data is the example JSON string: {"rects":[],"text":""}
// In practice, the JSON strings are much longer.
var data = new byte[] { 255, 254, 123, 0, 34, 0, 114, 0, 101, 0, 99, 0, 116, 0, 115, 0, 34, 0, 58, 0, 91, 0, 93, 0, 44, 0, 34, 0, 116, 0, 101, 0, 120, 0, 116, 0, 34, 0, 58, 0, 34, 0, 34, 0, 125, 0 };

var ms = new MemoryStream(data);
var sr = new StreamReader(ms);

var text = sr.ReadToEnd();
Console.WriteLine(sr.CurrentEncoding);
Console.WriteLine(text);

var text2 = Encoding.Default.GetString(data);

Console.WriteLine(text2);

dynamic json = JsonConvert.DeserializeObject<dynamic>(text);

Console.WriteLine(json.text);
Console.WriteLine(json.rects);

Thanks!

1
  • new UnicodeEncoding() works for me (does not produce ? at the beginning) Commented Dec 1, 2022 at 19:59

2 Answers 2

1

Well, you have UTF-16 with Byte Order Mark (BOM) which defines the encoding. In your case BOM is FE which is UTF-16 (LE):

var data = new byte[] { 
  255, 254, // <- BOM (UTF-16 (LE))
  123, 0, 34, 0, 114, 0, /* Payload */ };

So you can just get rid of BOM and decode the rest:

string result = Encoding.Unicode.GetString(data.AsSpan(2));

Note, that file readers (like StreamReader) can detect BOM, get the correct decoder and use it when reading the file.

Sign up to request clarification or add additional context in comments.

1 Comment

That was it, thanks! I had tried to use Substring(2) to skip the first 2 ?'s after decoding the array and ran into other issues with deserializing later in the string, but removing the the first two bytes in the array before decoding worked perfectly.
0

What are the first two bytes for? Look what characters 255 and 254 are https://ascii-tables.com/ just remove them and it should work fine

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.