1. Home
2. Questions
3. AI Assist Labs
4. Tags
6. Challenges
7. Chat
8. Articles
9. Users
11. Jobs
12. Companies
13. Collectives
14. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

Timeline for How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

Current License: CC BY-SA 4.0

4 events

when toggle format	what		by	license	comment
Sep 7, 2019 at 15:55	comment	added	verdy_p		When you use "GetBytes", of course you don't specify an encoding, but you assume a byte order to get the two bytes in a specic for each code unit stored locally in the string. When you build a new string from bytes, you also need a converter, not necessarily UTF-8 to UTF-16, you could insert the extra 0 in the high byte, or pack two bytes (in MSB first or LSB first order) in the same 16-bit code unit. Strings are then compact form for arrays of 16-bit integers. The relation with "characters" is another problem, in C# they're not actual types as they are still represented as strings
Sep 7, 2019 at 15:47	comment	added	verdy_p		So in a C# string, you can safely store a code unit like 0xFFFF or 0xFFFE, even if they are non-characters in UTF-16, and you can store an isolated 0xD800 not followed by a code unit in 0xDC00..0xDFFF (i.e. unpaired surrogates which are invalid in UTF-16). The same remark applies to strings in Javascript/ECMAscript and Java.
Sep 7, 2019 at 15:42	comment	added	verdy_p		Actually a string in C# is NOT restricted to just UTF-16. What is true is that it contains a vector of 16-bit code units, but these 16-bit code units are not restricted to valid UTF-16. But as they are 16-bit, you need an encoding (byte order) to convert them to 8bit. A string can then store non-Unicode data, including binary code (e.g. a bitmap image). It becomes interpreted as UTF-16 only in I/O and text formatters that make such interpretation.
Jul 2, 2018 at 20:51	history	answered	Jason Goemaat	CC BY-SA 4.0