Skip to main content
4 events
when toggle format what by license comment
Sep 7, 2019 at 15:55 comment added verdy_p When you use "GetBytes", of course you don't specify an encoding, but you assume a byte order to get the two bytes in a specic for each code unit stored locally in the string. When you build a new string from bytes, you also need a converter, not necessarily UTF-8 to UTF-16, you could insert the extra 0 in the high byte, or pack two bytes (in MSB first or LSB first order) in the same 16-bit code unit. Strings are then compact form for arrays of 16-bit integers. The relation with "characters" is another problem, in C# they're not actual types as they are still represented as strings
Sep 7, 2019 at 15:47 comment added verdy_p So in a C# string, you can safely store a code unit like 0xFFFF or 0xFFFE, even if they are non-characters in UTF-16, and you can store an isolated 0xD800 not followed by a code unit in 0xDC00..0xDFFF (i.e. unpaired surrogates which are invalid in UTF-16). The same remark applies to strings in Javascript/ECMAscript and Java.
Sep 7, 2019 at 15:42 comment added verdy_p Actually a string in C# is NOT restricted to just UTF-16. What is true is that it contains a vector of 16-bit code units, but these 16-bit code units are not restricted to valid UTF-16. But as they are 16-bit, you need an encoding (byte order) to convert them to 8bit. A string can then store non-Unicode data, including binary code (e.g. a bitmap image). It becomes interpreted as UTF-16 only in I/O and text formatters that make such interpretation.
Jul 2, 2018 at 20:51 history answered Jason Goemaat CC BY-SA 4.0