Correctly handle utf8 string received from json.net

Question

I'm using json.net to read data sent in json format from a server. The server encodes all string-type data it sends in json as utf-8.

Now to read the data in c# I do something like this: string s = json.Value<string>("data");

I assume the string s is now in utf-8 format, whereas the default encoding for strings in c# is utf-16 (unicode).

To convert the string to unicode, would this be correct?

byte[] bytes = Encoding.Unicode.GetBytes(s);
string unicode = Encoding.UTF8.GetString(bytes);

What I want (I think) is the raw bytes from s and then pass that to the utf-8 decoder to get unicode, but I'm not sure what exactly Encoding.Unicode.GetBytes gives me, or what I should use instead.

You can't double parse it. But what is wrong with your string in the first place, since all strings in .NET are UTF16? — Patrick Hofman
– Patrick Hofman, Commented Apr 4, 2016 at 14:02
Well the string is received as utf-8, I assumed I had to do something, but if json.net automatically handles this then it's ok as you say, but I don't know if that's the case. — DaedalusAlpha
– DaedalusAlpha, Commented Apr 4, 2016 at 14:03
I think you need to swap it. Encoding.UTF8.GetBytes(s) and then Encoding.Unicode.GetString(bytes). This way you will convert the UTF8 to Unicode. — Peter Keuter
– Peter Keuter, Commented Apr 4, 2016 at 14:05
In your question you have a variable called json -- how does that get populated? Is there some kind of stream being read from a web response? If so, you want to pass Encoding.UTF8 to the stream reader. — Brian Rogers
– Brian Rogers, Commented Apr 4, 2016 at 14:34
You are right, I just discovered the data is read from the socket using Encoding.Default.GetString which isn't exactly optimal. Using Encoding.UTF8there directly should fix all problems with utf-8 encoded strings. — DaedalusAlpha
– DaedalusAlpha, Commented Apr 4, 2016 at 14:49

Patrick Hofman · Accepted Answer · 2016-04-04 14:07:41Z

1

There is no need to convert anything, since string objects in .NET are encoded in UTF-16.

If there is anything to change, you should change something where JSON.NET deserializes the string: you can't double parse it. The incoming JSON string is already interpreted for a specific encoding. You can't go back from there without the original bytes.

answered Apr 4, 2016 at 14:07

Patrick Hofman

158k23 gold badges270 silver badges343 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

DaedalusAlpha Over a year ago

If the json data that is received looks like this: { "data" : "strÃ¶" } it definitely needs to be converted becase it will look exactly like that in the c# string as well.

Patrick Hofman Over a year ago

Are you sure all goes well on the other end?

DaedalusAlpha Over a year ago

You were correct; the string that was parsed by json was created from the raw data from the socket using Encoding.Default instead of Encoding.UTF8.

Collectives™ on Stack Overflow

Correctly handle utf8 string received from json.net

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related