What is the encoding used by `String.fromCharCode`?

Question

I'm using btoa to encode a Uint8Array to a base 64 string. And I hit a strange case. This works:

export function toBase64(data: Uint8Array): string {
    return btoa(String.fromCharCode(...data))
}

Whereas this does not (btoa will often complain about an unknown character):

export function toBase64(data: Uint8Array): string {
    return btoa(new TextDecoder('latin1').decode(data))
}

Question

What encoding should I use with TextDecoder to produce the same string as via fromCharCode?

Background

Peacing together various documentation the following should be true:

btoa expects a latin1 encoding
String.fromCharCode will convert individual integers to the respective utf16 character
for the first 256 characters latin1 and utf16 overlap

Test

Doing some experiments it is clear the two approaches yield different strings. With this setup:

const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint8Array(array);

Running:

String.fromCharCode(...d)

will yield

\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Whereas running:

(new TextDecoder('latin1')).decode(d)

will yield

\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Where they substantially differ in the range 7F-9F (copied below for clearity)

\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F

\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ

Wrong idea. Why fromCharCode? Anyway, String.fromCharCode produces a string from UTF-16 units. In contrast to that, String.fromCodePoint produces a string out of the sequence of Unicode code points. You need to understand the concept of code point. It is the abstraction not based on a particular encoding, it is the bijection between the character set and the mathematical set of integer numbers, abstracted from their computer representation. So, String.fromCodePoint is the way to produce Unicode text in a way, agnostic to the particular encoding used in the system. — Sergey A Kryukov
– Sergey A Kryukov, Commented Jan 14 at 0:40
...while btoa and atob are totally unrelated not only to Unicode but to text data in general. It is used to Base64-encode and to decode arbitrary binary data. Base64-encoded string is ASCII. Not Latin1, not any UTF, not anything else. — Sergey A Kryukov
– Sergey A Kryukov, Commented Jan 14 at 0:44
Are you in a browser? Because the TextDecoder variant doesn't even work for me in a browser, as the Euro sign and co. coming out from some further conversation are rather high codepoints and btoa refuses them. console.log(new TextDecoder("ascii").decode(Uint8Array.from([128])).charCodeAt()); and console.log(btoa(new TextDecoder("ascii").decode(Uint8Array.from([128]))));. And if you're not a in a browser, you can likely use Buffer.toString("base64") of Node.js. — tevemadar
– tevemadar, Commented Jan 16 at 15:55

2 revs · Accepted Answer · 2025-01-15 00:15:28Z

0

String.fromCharCode takes in UTF-16 code-units, so you'd have to use an UTF-16 decoder to get the same result. However you also need to use an Uint16Array to represent the data:

const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint16Array(array);
const fromString = String.fromCharCode(...d);
const decoded = (new TextDecoder("UTF-16le")).decode(d);
console.log(fromString);
console.log(decoded);
console.log(fromString === decoded);

Note that on Big-Endian machines you might have to use a "UTF-16be" instead, or to generate the buffer through a DataView, though I couldn't test it myself and I'm not sure how many such machines crawl the modern web.

edited Jan 15 at 0:15

community wiki

2 revs
Kaiido

Sign up to request clarification or add additional context in comments.

9 Comments

Bergi Jan 14 at 6:49

Will this work on big-endian machines?

Kaiido Jan 14 at 6:53

@Bergi good question, I suppose we should change for "UTF-16be" there, or force the ArrayBuffer to LE through a DataView, but BE machines are quite rare and I can't test myself.

Bergi Jan 14 at 10:29

Yeah, me neither; but worth mentioning anyway. Probably fix via const isBigEndian = new Uint16Array(new Uint8Array([0, 1]).buffer)[0] == 1, then new TextDecoder(isBigEndian ? "UTF16be" : "UTF16le")

Newbie Jan 14 at 11:29

So is my understanding there is no available encoding that fully overlaps with the first 256 UTF16 codes?

Kaiido Jan 15 at 0:26

@Bergi FWIW news.ycombinator.com/item?id=16190209 For web-facing code, we can forget about BE.

|

Collectives™ on Stack Overflow

What is the encoding used by `String.fromCharCode`?

Question

Background

Test

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Question

Background

Test

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related