I'm using btoa to encode a Uint8Array to a base 64 string. And I hit a strange case. This works:
export function toBase64(data: Uint8Array): string {
return btoa(String.fromCharCode(...data))
}
Whereas this does not (btoa will often complain about an unknown character):
export function toBase64(data: Uint8Array): string {
return btoa(new TextDecoder('latin1').decode(data))
}
Question
What encoding should I use with TextDecoder to produce the same string as via fromCharCode?
Background
Peacing together various documentation the following should be true:
btoaexpects alatin1encodingString.fromCharCodewill convert individual integers to the respectiveutf16character- for the first 256 characters
latin1andutf16overlap
Test
Doing some experiments it is clear the two approaches yield different strings. With this setup:
const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint8Array(array);
Running:
String.fromCharCode(...d)
will yield
\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Whereas running:
(new TextDecoder('latin1')).decode(d)
will yield
\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
Where they substantially differ in the range 7F-9F (copied below for clearity)
\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F
\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ
fromCharCode? Anyway,String.fromCharCodeproduces a string from UTF-16 units. In contrast to that,String.fromCodePointproduces a string out of the sequence of Unicode code points. You need to understand the concept of code point. It is the abstraction not based on a particular encoding, it is the bijection between the character set and the mathematical set of integer numbers, abstracted from their computer representation. So,String.fromCodePointis the way to produce Unicode text in a way, agnostic to the particular encoding used in the system.btoaandatobare totally unrelated not only to Unicode but to text data in general. It is used to Base64-encode and to decode arbitrary binary data. Base64-encoded string is ASCII. Not Latin1, not any UTF, not anything else.TextDecodervariant doesn't even work for me in a browser, as the Euro sign and co. coming out from some further conversation are rather high codepoints andbtoarefuses them.console.log(new TextDecoder("ascii").decode(Uint8Array.from([128])).charCodeAt());andconsole.log(btoa(new TextDecoder("ascii").decode(Uint8Array.from([128]))));. And if you're not a in a browser, you can likely useBuffer.toString("base64")of Node.js.