Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

Question

From here

Essentially, string uses the UTF-16 character encoding form

But when saving vs StreamWriter :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

I've seen this sample (broken link removed):

enter image description here

And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

So why does .net use utf16 as default encoding for string and utf8 for saving files?

Thank you.

p.s. Ive already read the famous article

This post from Eric Lippert goes into more details of why the decision was made. — Lukazoid
– Lukazoid, Commented Apr 25, 2014 at 12:40
@Lukazoid Great post but note the comments, where Hans Passant disagrees with a convincing argument. — Ohad Schneider
– Ohad Schneider, Commented Jun 21, 2014 at 21:52
Working version of @Lukazoid's link: web.archive.org/web/20161121052650/http://blog.coverity.com/… — Ian Kemp - SO dead by AI greed
– Ian Kemp - SO dead by AI greed, Commented Nov 7, 2018 at 6:14
The short answer is that UTF16 is not portable, while UTF8 is super portable. — Zoltan Tirinda
– Zoltan Tirinda, Commented Mar 26, 2019 at 13:26

Jon Skeet · Accepted Answer · 2016-07-29 06:09:51Z

79

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

edited Jul 29, 2016 at 6:09

answered Feb 18, 2013 at 17:39

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

26 Comments

gbjbaanb Over a year ago

the point of UTF-8 is that, if you need 6 bytes per character to truly represent all possibilities, then anything less than UTF-32 is a problem that needs special cases and extra code. So UTF-16 and UTF-8 are both imperfect. However, as UTF-8 is half the size, you might as well use that. You gain nothing by using UTF-16 over it (except increased file/string sizes). Of course, some people will use UTF-16 and ignorantly assume it handles all characters.

Royi Namir Over a year ago

I've read it 14 times. still I don't understand this line : the size per code unit being constant . AFAIK the size can be 2,3,4 bytes (in utf-16) so what is constant here ?

Jon Skeet Over a year ago

@gbjbaanb: No, .NET uses UTF-16. So when anything outside the BMP is required, surrogate pairs are used. Each character is a UTF-16 code unit. (As far as I'm aware there's no such thing as UCS-16 either - I think you mean UCS-2.)

Jon Skeet Over a year ago

@RoyiNamir: No, the size of a UTF-16 code unit is always 2 bytes. A Unicode character takes either one code unit (for the Basic Multilingual plane) or two code units (for characters U+10000 and above).

Jon Skeet Over a year ago

@FernandoPelliccioni: How do you define "variable-width encoding" precisely? Having just reread definitions, I agree I was confused about the precise meaning of "code unit" but both UTF-8 and UTF-16 are variable width in terms of "they can take a variable number of bytes to represent a single Unicode code point". For UTF-8 it's 1-4 bytes, for UTF-16 it's 2 or 4. Will check over the rest of my answer for precision now.

|

Hans Passant · Accepted Answer · 2013-02-18 18:18:24Z

As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.

.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.

The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.

Remy Lebeau · Accepted Answer · 2015-07-10 23:57:56Z

18

UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.

edited Jul 10, 2015 at 23:57

Remy Lebeau

609k36 gold badges516 silver badges875 bronze badges

answered Jul 16, 2013 at 21:49

user2457603

1911 silver badge4 bronze badges

Collectives™ on Stack Overflow

Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

3 Answers 3

26 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

26 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related