0

I'm using btoa to encode a Uint8Array to a base 64 string. And I hit a strange case. This works:

export function toBase64(data: Uint8Array): string {
    return btoa(String.fromCharCode(...data))
}

Whereas this does not (btoa will often complain about an unknown character):

export function toBase64(data: Uint8Array): string {
    return btoa(new TextDecoder('latin1').decode(data))
}

Question

What encoding should I use with TextDecoder to produce the same string as via fromCharCode?

Background

Peacing together various documentation the following should be true:

  • btoa expects a latin1 encoding
  • String.fromCharCode will convert individual integers to the respective utf16 character
  • for the first 256 characters latin1 and utf16 overlap

Test

Doing some experiments it is clear the two approaches yield different strings. With this setup:

const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint8Array(array);

Running:

String.fromCharCode(...d)

will yield

\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Whereas running:

(new TextDecoder('latin1')).decode(d)

will yield

\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Where they substantially differ in the range 7F-9F (copied below for clearity)

\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F

\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ
3
  • Wrong idea. Why fromCharCode? Anyway, String.fromCharCode produces a string from UTF-16 units. In contrast to that, String.fromCodePoint produces a string out of the sequence of Unicode code points. You need to understand the concept of code point. It is the abstraction not based on a particular encoding, it is the bijection between the character set and the mathematical set of integer numbers, abstracted from their computer representation. So, String.fromCodePoint is the way to produce Unicode text in a way, agnostic to the particular encoding used in the system. Commented Jan 14 at 0:40
  • ...while btoa and atob are totally unrelated not only to Unicode but to text data in general. It is used to Base64-encode and to decode arbitrary binary data. Base64-encoded string is ASCII. Not Latin1, not any UTF, not anything else. Commented Jan 14 at 0:44
  • Are you in a browser? Because the TextDecoder variant doesn't even work for me in a browser, as the Euro sign and co. coming out from some further conversation are rather high codepoints and btoa refuses them. console.log(new TextDecoder("ascii").decode(Uint8Array.from([128])).charCodeAt()); and console.log(btoa(new TextDecoder("ascii").decode(Uint8Array.from([128]))));. And if you're not a in a browser, you can likely use Buffer.toString("base64") of Node.js. Commented Jan 16 at 15:55

1 Answer 1

0

String.fromCharCode takes in UTF-16 code-units, so you'd have to use an UTF-16 decoder to get the same result. However you also need to use an Uint16Array to represent the data:

const array = Array.from({ length: 256 }, (_, i) => i);
const d = new Uint16Array(array);
const fromString = String.fromCharCode(...d);
const decoded = (new TextDecoder("UTF-16le")).decode(d);
console.log(fromString);
console.log(decoded);
console.log(fromString === decoded);

Note that on Big-Endian machines you might have to use a "UTF-16be" instead, or to generate the buffer through a DataView, though I couldn't test it myself and I'm not sure how many such machines crawl the modern web.

Sign up to request clarification or add additional context in comments.

9 Comments

Will this work on big-endian machines?
@Bergi good question, I suppose we should change for "UTF-16be" there, or force the ArrayBuffer to LE through a DataView, but BE machines are quite rare and I can't test myself.
Yeah, me neither; but worth mentioning anyway. Probably fix via const isBigEndian = new Uint16Array(new Uint8Array([0, 1]).buffer)[0] == 1, then new TextDecoder(isBigEndian ? "UTF16be" : "UTF16le")
So is my understanding there is no available encoding that fully overlaps with the first 256 UTF16 codes?
@Bergi FWIW news.ycombinator.com/item?id=16190209 For web-facing code, we can forget about BE.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.