Unicode, JavaScript and Base64

发表于:

文章未翻译

写的时候不用中文,还不去翻译,真的是懒死了(

I recently needed to work with binary data in the browser, specifically RSA keys, which I need to encode into Base64.

let key=new Uint8Array(256)
// I'll just use some random data here
crypto.getRandomValues(key)
console.log(key) // Uint8Array(256) [38, 104, 42, 189....

Now that we have the binary data in a TypedArray, I can just call btoa to convert them into Base64, right?

Well, not really. btoa only accepts strings, so we need to do the conversion first.

let str=new TextDecoder().decode(key) // binary array to utf-8 string
btoa(str) // Uncaught DOMException:
// Failed to execute 'btoa' on 'Window':
// The string to be encoded contains characters outside of the Latin1 range.

For some reason, it didn’t work. After some digging, I found this to be a bit more complex than I had anticipated(at least in JavaScript).

Unicode

Unicode, is a standard designed for consistent encoding of text expressed in different writing systems. The characters, also called code points, can hold a value from 0(U+0000) to 0x10FFFF(U+10FFFF), 1,114,111 in total, of which 1,112,064 are valid and assignable. Of course, most of them are currently unused.

UTF-8

UTF-8 is a variable-width character encoding used to represent Unicode. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte code units. It’s the de facto text encoding standard for the modern world.

UTF-16

Like UTF-8, UTF-16 also uses variable-width encoding to save space. The main difference is that each code point is represented with 1 or 2 code units. One code unit is 2 bytes or 16 bits, which can represent 65535 different code points. And with 1 extra unit or 32 bits we can cover every one of them.

JavaScript

DOMString

In JavaScript, Strings are represented in UTF-16 internally, as DOMStrings.

btoa

window.btoa() is a native function provided by most browser runtimes. In simple terms, btoa() only expects binary data as input. This means we should only have single-byte characters in the input string. Each code unit will have a value of 0 to 255 respectively.

Base64

Base64 is a binary-to-text encoding scheme that represents binary data in an ASCII string format by translating the data into a radix-64 representation. The translation/index table is as follows:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
         0 A            17 R            34 i            51 z
         1 B            18 S            35 j            52 0
         2 C            19 T            36 k            53 1
         3 D            20 U            37 l            54 2
         4 E            21 V            38 m            55 3
         5 F            22 W            39 n            56 4
         6 G            23 X            40 o            57 5
         7 H            24 Y            41 p            58 6
         8 I            25 Z            42 q            59 7
         9 J            26 a            43 r            60 8
        10 K            27 b            44 s            61 9
        11 L            28 c            45 t            62 +
        12 M            29 d            46 u            63 /
        13 N            30 e            47 v
        14 O            31 f            48 w         (pad) =
        15 P            32 g            49 x
        16 Q            33 h            50 y

24 bits of the input are encoded into 4 characters in the translation table(6 bits representing 0-63), then repeat. If fewer than 24 bits are available, pad the data with zeros on the right. A full 0b000000 is represented with one “=” padding character (so “==” if we have 8 bits available, and “=” for 16).

The padding is pretty pointless if you are not concatenating two Base64 strings.

Back to the Question

Based on what we’ve just learned, we need to turn the Uint8Array, into a string of bytes.

function charsToBinaryStr(chars) {
  let result = '';
  for (let i = 0; i < chars.byteLength; i++) {
    result += String.fromCharCode(chars[i]);
  }
  return result;
}

let b64=btoa(charsToBinaryStr(key));
console.log(b64); // "0fJxfbjVMk3IAck/VazUns8LWFkbqvIVXG2JviHHrIs3+

That’s a success. Of course, when decoding we need to do the opposite of this, since atob() also returns the annoying binary string. It’s also an issue when we want to encode a UTF-8 string. We need to call TextEncoder.encode() to convert the UTF-8 to Uint8Array, loop over them to construct the new binary string, then call btoa(). Actually, since the Base64 encoding scheme is so simple, could we just implement it in JavaScript ourselves?

I’m too lazy to roll the code myself, so I went and found rfc4648.js, which is a real lightweight implementation of rfc4648, aka. the Base64 RFC.

And here is the result of some simple benchmarks.

bytes rfc4648.js btoa(with charsToBinaryStr)
512 0.12313999996483325 0.08142999999821186
1k 0.13263000006899237 0.20672999991625549
4K 0.6471599999852479 0.7708199999794364

The JavaScript implementation is a tad slower than the native version, but only when the input size is small. For bigger inputs, string concatenation is really slow, and it tanks the execution speed. Because of this, I would recommend just using the library instead of fiddling with btoa and atob yourself. It also comes with handy url-safe Base64 support without any performance degrades. If you want that with the native implementation, you’ll have to run String.replace() on the encoded string, more on that in the appendix.

And that’s the end. I’ve only scratched the surface of those topics, but I still, hope you’ve learned something new about them. Thanks for reading.

Appendix

Url-Safe Base64

It uses a separate conversion table, shown as follows:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
    0 A            17 R            34 i            51 z
    1 B            18 S            35 j            52 0
    2 C            19 T            36 k            53 1
    3 D            20 U            37 l            54 2
    4 E            21 V            38 m            55 3
    5 F            22 W            39 n            56 4
    6 G            23 X            40 o            57 5
    7 H            24 Y            41 p            58 6
    8 I            25 Z            42 q            59 7
    9 J            26 a            43 r            60 8
   10 K            27 b            44 s            61 9
   11 L            28 c            45 t            62 - (minus)
   12 M            29 d            46 u            63 _
   13 N            30 e            47 v           (underline)
   14 O            31 f            48 w
   15 P            32 g            49 x
   16 Q            33 h            50 y         (pad) =

I’ll save some time for you. Only character 62 and 63 is different from the normal version. “+” is now “-” and ‘/’ becomes “_”. You can simply run a string replace to convert between the two.

Why the confusing names?

I remember seeing these two functions, atob() and btoa(), for the first time. The first question that came to mind is, why are they named like that? This is JavaScript after all, and these abbreviations sound like something out of C. It turns out they are actually from the Unix world. The names carried over from Unix into the old Netscape codebase in 1995 and still kept the same after all these years. The web standards are indeed fascinating, isn’t it?