From 8ac24b4dee520ce0dbaf7686ea010a69370050f7 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Wed, 25 Oct 2023 13:37:12 -0700 Subject: [PATCH 1/8] update specification --- spec.html | 162 +++++++++++++++--------------------------------------- 1 file changed, 43 insertions(+), 119 deletions(-) diff --git a/spec.html b/spec.html index 4cd2943..c28cbe7 100644 --- a/spec.html +++ b/spec.html @@ -22,53 +22,14 @@

Uint8Array.prototype.toBase64 ( [ _options_ ] )

1. Let _opts_ be ? GetOptionsObject(_options_). 1. Let _alphabet_ be ? Get(_opts_, *"alphabet"*). 1. If _alphabet_ is *undefined*, set _alphabet_ to *"base64"*. - 1. Set _alphabet_ to ? ToString(_alphabet_). + 1. If _alphabet_ is not a String, throw a *TypeError* exception. 1. If _alphabet_ is neither *"base64"* nor *"base64url"*, throw a *TypeError* exception. 1. If _alphabet_ is *"base64"*, then - 1. Let _outAscii_ be the sequence of code points which results from encoding _toEncode_ according to the base64 encoding specified in section 4 of RFC 4648. + 1. Let _outAscii_ be the sequence of code points which results from encoding _toEncode_ according to the base64 encoding specified in section 4 of RFC 4648. Padding is included. 1. Else, 1. Assert: _alphabet_ is *"base64url"*. - 1. Let _outAscii_ be the sequence of code points which results from encoding _toEncode_ according to the base64url encoding specified in section 5 of RFC 4648. + 1. Let _outAscii_ be the sequence of code points which results from encoding _toEncode_ according to the base64url encoding specified in section 5 of RFC 4648. Padding is included. 1. Return CodePointsToString(_outAscii_). - 1. NOTE: CodePointsToString is used only because RFC 4648 does not produce a sequence of code units. Implementations may be able to produce an ECMAScript string directly. - - - - -

Uint8Array.prototype.toPartialBase64 ( [ _options_ ] )

- - 1. Let _O_ be the *this* value. - 1. Let _toEncode_ be ? GetUint8ArrayBytes(_O_). - 1. Let _opts_ be ? GetOptionsObject(_options_). - 1. Let _alphabet_ be ? Get(_opts_, *"alphabet"*). - 1. If _alphabet_ is *undefined*, set _alphabet_ to *"base64"*. - 1. Set _alphabet_ to ? ToString(_alphabet_). - 1. If _alphabet_ is neither *"base64"* nor *"base64url"*, throw a *TypeError* exception. - 1. Let _more_ be ToBoolean(? Get(_opts_, *"more"*)). - 1. Let _extra_ be ? Get(_opts_, *"extra"*). - 1. If _extra_ is neither *undefined* nor *null*, then - 1. TODO: consider handling array-of-bytes. - 1. Let _extraBytes_ be ? GetUint8ArrayBytes(_extra_). - 1. Set _toEncode_ to the list-concatenation of _extraBytes_ and _toEncode_. - 1. If _more_ is *true*, then - 1. Let _toEncodeLength_ be the length of _toEncode_. - 1. Let _overflowLength_ be _toEncodeLength_ modulo 3. - 1. Let _overflowBytes_ be a List whose elements are the last _overflowLength_ elements of _toEncode_. - 1. TODO: convert _overflowBytes_ to a new Uint8Array here. - 1. Remove the last _overflowLength_ elements of _toEncode_. - 1. Else, - 1. Let _overflowBytes_ be *null*. - 1. If _alphabet_ is *"base64"*, then - 1. Let _outAscii_ be the sequence of code points which results from encoding _toEncode_ according to the base64 encoding specified in section 4 of RFC 4648. - 1. Else, - 1. Assert: _alphabet_ is *"base64url"*. - 1. Let _outAscii_ be the sequence of code points which results from encoding _toEncode_ according to the base64url encoding specified in section 5 of RFC 4648. - 1. Let _result_ be CodePointsToString(_outAscii_). - 1. NOTE: CodePointsToString is used only because RFC 4648 does not produce a sequence of code units. Implementations may be able to produce an ECMAScript string directly. - 1. Let _obj_ be OrdinaryObjectCreate(%Object.prototype%). - 1. Perform ! CreateDataPropertyOrThrow(_obj_, *"result"*, _result_). - 1. Perform ! CreateDataPropertyOrThrow(_obj_, *"extra"*, _overflowBytes_). - 1. Return _obj_.
@@ -87,80 +48,59 @@

Uint8Array.prototype.toHex ( )

-

Uint8Array.fromBase64 ( _value_ [ , _options_ ] )

+

Uint8Array.fromBase64 ( _string_ [ , _options_ ] )

+

The standard base64 alphabet is a List whose elements are the code points corresponding to every letter and number in the Unicode Basic Latin block along with *"+"* and *"/"*; that is, it is StringToCodePoints(*"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/").

- 1. Let _string_ be ? GetStringForBinaryEncoding(_value_). + 1. If _string_ is not a String, throw a *TypeError* exception. 1. Let _opts_ be ? GetOptionsObject(_options_). 1. Let _alphabet_ be ? Get(_opts_, *"alphabet"*). 1. If _alphabet_ is *undefined*, set _alphabet_ to *"base64"*. - 1. Set _alphabet_ to ? ToString(_alphabet_). + 1. If _alphabet_ is not a String, throw a *TypeError* exception. 1. If _alphabet_ is neither *"base64"* nor *"base64url"*, throw a *TypeError* exception. - 1. TODO: normalize whitespace / padding here. - 1. TODO: figure out what the right defaults for whitespace/padding are. - 1. Let _characters_ be StringToCodePoints(_string_). - 1. NOTE: StringToCodePoints is used only because RFC 4648 does not produce a sequence of code units. Implementations may be able to base64-decode _string_ directly. - 1. If _alphabet_ is *"base64"*, then - 1. If _characters_ cannot result from applying the base64 encoding specified in section 4 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception. - 1. Let _bytes_ be the unique sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _characters_. + 1. Let _strict_ be ToBoolean(? Get(_opts_, *"strict"*)). + 1. NOTE: The order of validation and decoding in the algorithm below is not observable. Implementations are encouraged to perform them in whatever order is most efficient, possibly interleaving validation and stripping whitespace with decoding, as long as the behaviour is observably equivalent. + 1. Let _input_ be StringToCodePoints(_string_). + 1. If _alphabet_ is *"base64url"*, then + 1. If _input_ contains U+002B (PLUS SIGN) or U+002F (SOLIDUS), throw a *SyntaxError* exception. + 1. Replace all occurrences of U+002D (HYPHEN-MINUS) in _input_ with U+002B (PLUS SIGN). + 1. Replace all occurrences of U+005F (LOW LINE) in _input_ with U+002F (SOLIDUS). + 1. NOTE: When _strict_ is *false*, the algorithm below is equivalent to the forgiving-base64 decode operation in HTML. + 1. If _strict_ is *false*, then + 1. Remove all occurrences of U+0009 (TAB), U+000A (LF), U+000C (FF), U+000D (CR), and U+0020 (SPACE) from _input_. + 1. Let _inputLength_ be the length of _input_. + 1. If _inputLength_ modulo 4 is 0, then + 1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), then + 1. Remove the last element of _input_. + 1. Set _inputLength_ to _inputLength_ - 1. + 1. If _input_ is not empty and the last element of _input_ is U+003D (EQUALS SIGN), then + 1. Remove the last element of _input_. + 1. Set _inputLength_ to _inputLength_ - 1. 1. Else, - 1. Assert: _alphabet_ is *"base64url*". - 1. If _characters_ cannot result from applying the base64url encoding specified in section 5 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception. - 1. Let _bytes_ be the unique sequence of bytes such that applying the base64url encoding specified in section 5 of RFC 4648 to that sequence would produce _characters_. - 1. Let _resultLength_ be the number of bytes in _bytes_. - 1. Let _result_ be ? AllocateTypedArray(*"Uint8Array"*, %Uint8Array%, %Uint8Array.prototype%, _resultLength_). + 1. If _strict_ is *true*, throw a *SyntaxError* exception. + 1. If _input_ contains any elements which are not also elements of the standard base64 alphabet, throw a *SyntaxError* exception. + 1. Let _lastChunkSize_ be _inputLength_ modulo 4. + 1. If _lastChunkSize_ is 1, then + 1. Throw a *SyntaxError* exception. + 1. Else if _lastChunkSize_ is 2 or _lastChunkSize_ is 3, then + 1. Append U+0041 (LATIN CAPITAL LETTER A) to _lastChunkSize_ a total of (4 - _lastChunkSize_) times. + 1. Let _bytes_ be the unique sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _input_. + 1. Let _byteLength_ be the length _bytes_. + 1. If _lastChunkSize_ is 2, then + 1. If _strict_ is *true* and _bytes_[_byteLength_ - 2] is not 0, throw a *SyntaxError* exception. + 1. Remove the final 2 elements of _bytes_. + 1. Else if _lastChunkSize_ is 3, then + 1. If _strict_ is *true* and _bytes_[_byteLength_ - 1] is not 0, throw a *SyntaxError* exception. + 1. Remove the final element of _bytes_. + 1. Let _result_ be ? AllocateTypedArray(*"Uint8Array"*, %Uint8Array%, %Uint8Array.prototype%, _byteLength_). 1. Set the value at each index of _result_.[[ViewedArrayBuffer]].[[ArrayBufferData]] to the value at the corresponding index of _bytes_. 1. Return _result_.
- -

Uint8Array.fromPartialBase64 ( _value_ [ , _options_ ] )

- - 1. Let _string_ be ? GetStringForBinaryEncoding(_value_). - 1. Let _opts_ be ? GetOptionsObject(_options_). - 1. Let _alphabet_ be ? Get(_opts_, *"alphabet"*). - 1. If _alphabet_ is *undefined*, set _alphabet_ to *"base64"*. - 1. Set _alphabet_ to ? ToString(_alphabet_). - 1. If _alphabet_ is neither *"base64"* nor *"base64url"*, throw a *TypeError* exception. - 1. Let _more_ be ToBoolean(? Get(_opts_, *"more"*)). - 1. Let _extra_ be ? Get(_opts_, *"extra"*). - 1. If _extra_ is neither *undefined* nor *null*, then - 1. Let _extraString_ be ? GetStringForBinaryEncoding(_extra_). - 1. Set _string_ to the list-concatenation of _extraString_ and _string_. - 1. If _more_ is *true*, then - 1. TODO: think about handling of padding on _string_ / _extra_ in this case. This currently assumes no padding on either. - 1. Let _stringLength_ be the length of _string_. - 1. Let _overflowLength_ be _stringLength_ modulo 4. - 1. Let _newLength_ be _stringLength_ - _overflowLength_. - 1. Let _overflow_ be the substring of _string_ from _newLength_. - 1. Set _string_ to the substring of _string_ from 0 to _newLength_. - 1. Else, - 1. Let _overflow_ be *null*. - 1. TODO: normalize whitespace / strip padding here. - 1. TODO: figure out what the right defaults for whitespace/padding are. - 1. Let _characters_ be StringToCodePoints(_string_). - 1. NOTE: StringToCodePoints is used only because RFC 4648 does not produce a sequence of code units. Implementations may be able to base64-decode _string_ directly. - 1. If _alphabet_ is *"base64"*, then - 1. If _characters_ cannot result from applying the base64 encoding specified in section 4 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception. - 1. Let _bytes_ be the unique sequence of bytes such that applying the base64 encoding specified in section 4 of RFC 4648 to that sequence would produce _characters_. - 1. Else, - 1. Assert: _alphabet_ is *"base64url*". - 1. If _characters_ cannot result from applying the base64url encoding specified in section 5 of RFC 4648 to some sequence of bytes, throw a *SyntaxError* exception. - 1. Let _bytes_ be the unique sequence of bytes such that applying the base64url encoding specified in section 5 of RFC 4648 to that sequence would produce _characters_. - 1. Let _resultLength_ be the number of bytes in _bytes_. - 1. Let _result_ be ? AllocateTypedArray(*"Uint8Array"*, %Uint8Array%, %Uint8Array.prototype%, _resultLength_). - 1. Set the value at each index of _result_.[[ViewedArrayBuffer]].[[ArrayBufferData]] to the value at the corresponding index of _bytes_. - 1. Let _obj_ be OrdinaryObjectCreate(%Object.prototype%). - 1. Perform ! CreateDataPropertyOrThrow(_obj_, *"result"*, _result_). - 1. Perform ! CreateDataPropertyOrThrow(_obj_, *"extra"*, _overflow_). - 1. Return _obj_. - -
- -

Uint8Array.fromHex ( _value_ )

+

Uint8Array.fromHex ( _string_ )

- 1. Let _string_ be ? GetStringForBinaryEncoding(_value_). + 1. If _string_ is not a String, throw a *TypeError* exception. 1. TODO: consider stripping whitespace here. 1. Let _stringLen_ be the length of _string_. 1. If _stringLen_ modulo 2 is not 0, throw a *SyntaxError* exception. @@ -197,22 +137,6 @@

- -

- GetStringForBinaryEncoding ( - _arg_: an ECMAScript language value, - ): either a normal completion containing a String or a throw completion -

-
- - 1. If _arg_ is an Object, let _string_ be ? ToPrimitive(_arg_, ~string~); else let _string_ be _arg_. - 1. NOTE: Because `[` is not a valid base64 or hex character, the Strings returned by %Object.prototype.toString% will produce a SyntaxError during encoding. Implementations are encouraged to provide an informative error message in that situations. - 1. if _string_ is not a String, throw a TypeError exception. - 1. NOTE: The above step is included to prevent errors such as accidentally passing `null` to `fromBase64` and receiving a Uint8Array containing the bytes « 0x9e, 0xe9, 0x65 ». - 1. Return _string_. - -
- From a3eb39bd35e3401cacb7d0325f3f60ebb633c6e4 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Wed, 25 Oct 2023 13:38:54 -0700 Subject: [PATCH 2/8] update ecmarkup --- package-lock.json | 17 +++++++++-------- package.json | 6 +++--- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/package-lock.json b/package-lock.json index 0350942..1875fd5 100644 --- a/package-lock.json +++ b/package-lock.json @@ -4,9 +4,10 @@ "requires": true, "packages": { "": { + "name": "proposal-arraybuffer-base64", "dependencies": { - "@tc39/ecma262-biblio": "2.1.2553", - "ecmarkup": "^17.0.0", + "@tc39/ecma262-biblio": "2.1.2653", + "ecmarkup": "^18.0.0", "jsdom": "^21.1.1", "prismjs": "^1.29.0" } @@ -163,9 +164,9 @@ } }, "node_modules/@tc39/ecma262-biblio": { - "version": "2.1.2553", - "resolved": "https://registry.npmjs.org/@tc39/ecma262-biblio/-/ecma262-biblio-2.1.2553.tgz", - "integrity": "sha512-c2h05szLmHNnNO+7gE7mzgK3qDoZOEQrPKIIIZtY8gRpe5F2qTNIfj5tqxVYV7WUTC+/ZsR9/kojJ823hL2hvg==" + "version": "2.1.2653", + "resolved": "https://registry.npmjs.org/@tc39/ecma262-biblio/-/ecma262-biblio-2.1.2653.tgz", + "integrity": "sha512-/CIVRwkV3fTaYNxFSEvbDsTPFBNfeJOjQACZXs11+NKkbHhFh9pvr8j6NbJ/fekJLgAu6x2QXvGdA/kEGR/08g==" }, "node_modules/@tootallnate/once": { "version": "2.0.0", @@ -508,9 +509,9 @@ } }, "node_modules/ecmarkup": { - "version": "17.0.0", - "resolved": "https://registry.npmjs.org/ecmarkup/-/ecmarkup-17.0.0.tgz", - "integrity": "sha512-eQr9Vn9IPIH3rrbYEGPqfAwDJ9pg1zrOSZXc8HQwVMQ9d5tb+BsoPeKw5W1SinL09yZalcbLyqnX7rC393VRdA==", + "version": "18.0.0", + "resolved": "https://registry.npmjs.org/ecmarkup/-/ecmarkup-18.0.0.tgz", + "integrity": "sha512-VSItKQ+39dv1FeR1YbGGlJ/rx17wsPSkS7morrOCwLGHh+7ehy89hao+rQ0/ptiBAN3nbytXzwUBUTC3XNmxaA==", "dependencies": { "chalk": "^4.1.2", "command-line-args": "^5.2.0", diff --git a/package.json b/package.json index 7bae9fd..cb3be1d 100644 --- a/package.json +++ b/package.json @@ -3,14 +3,14 @@ "name": "proposal-arraybuffer-base64", "scripts": { "build-playground": "mkdir -p dist && cp playground/* dist && node scripts/static-highlight.js playground/index-raw.html > dist/index.html && rm dist/index-raw.html", - "build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose --js-out dist/spec/ecmarkup.js --css-out dist/spec/ecmarkup.css spec.html dist/spec/index.html", + "build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose spec.html --assets-dir dist/spec dist/spec/index.html", "build": "npm run build-playground && npm run build-spec", "format": "emu-format --write spec.html", "check-format": "emu-format --check spec.html" }, "dependencies": { - "@tc39/ecma262-biblio": "2.1.2553", - "ecmarkup": "^17.0.0", + "@tc39/ecma262-biblio": "2.1.2653", + "ecmarkup": "^18.0.0", "jsdom": "^21.1.1", "prismjs": "^1.29.0" } From 950cd7590fd324fb87fcd7290d670303c67d491f Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Wed, 25 Oct 2023 18:57:29 -0700 Subject: [PATCH 3/8] update polyfill & playground, add tests --- playground/index-raw.html | 112 +++------------- playground/polyfill-core.mjs | 222 ++++++++++++++------------------ playground/polyfill-install.mjs | 46 +------ test-polyfill.mjs | 94 ++++++++++++++ 4 files changed, 216 insertions(+), 258 deletions(-) create mode 100644 test-polyfill.mjs diff --git a/playground/index-raw.html b/playground/index-raw.html index 0d33c2b..8f7fd6f 100644 --- a/playground/index-raw.html +++ b/playground/index-raw.html @@ -48,7 +48,7 @@

Proposed Support for Base64 in JavaScript

Introduction

-

This page documents an early-stage proposal for native base64 and hex encoding and decoding for binary data in JavaScript, and includes a non-production, slightly inaccurate polyfill you can experiment with in the browser's console. Some details of the polyfill, particularly around coercion and order of observable effects, are not identical to the proposed spec text.

+

This page documents a stage-2 proposal for native base64 and hex encoding and decoding for binary data in JavaScript, and includes a non-production polyfill you can experiment with in the browser's console.

The proposal would provide methods for encoding and decoding Uint8Arrays as base64 and hex strings.

Feedback on the proposal's repository is appreciated.

@@ -84,8 +84,7 @@

Basic usage

Options

The base64 methods take an optional options bag which allows specifying the alphabet as either "base64" (the default) or "base64url" (the URL-safe variant).

-

In the future this may allow specifying arbitrary alphabets.

-

In later versions of this proposal the options bag may also allow additional options, such as specifying whether to generate / enforce padding characters and how to handle whitespace.

+

When encoding, the options bag also allows specifying strict: false (the default) or strict: true. When using strict: false, whitespace is legal and padding is optional. When using strict: true, whitespace is forbidden and standard padding (including any overflow bits in the last character being 0) is enforced - i.e., only canonical encodings are allowed.

The hex methods do not have any options.


@@ -94,103 +93,30 @@ 

Options

// '+/+/' console.log(array.toBase64({ alphabet: 'base64url' })); // '-_-_' -
-

Streaming

-

Two additional methods, toPartialBase64 and fromPartialBase64, allow encoding and decoding chunks of base64. This requires managing state, which is handled by returning a { result, extra } pair. The options bag for these methods takes two additional arguments, one which specifies whether more data is expected and one which specifies any extra values returned by a previous call.

-

These methods are intended for lower-level use and are less convenient to use.

-

Streaming versions of the hex APIs are not included since they are straightforward to do manually.

+console.log(Uint8Array.fromBase64('SGVsbG8g\nV29ybG R')); +// works, despite whitespace, missing padding, and non-zero overflow bits -

Streaming an ArrayBuffer into chunks of base64 strings:

-

-let buffer = (new Float64Array([0.1, 0.2, 0.3, 0.4])).buffer;
-let chunkSize = 6;
-let resultChunks = [];
-
-let result, extra;
-for (let offset = 0; offset < buffer.byteLength; offset += chunkSize) {
-  let length = Math.min(chunkSize, buffer.byteLength - offset);
-  let view = new Uint8Array(buffer, offset, length);
-  ({ result, extra } = view.toPartialBase64({ more: true, extra }));
-  resultChunks.push(result);
+try {
+  Uint8Array.fromBase64('SGVsbG8g\nV29ybG Q=', { strict: true });
+} catch {
+  console.log('with strict: true, whitespace is rejected');
 }
-({ result } = extra.toPartialBase64({ more: false }));
-resultChunks.push(result);
-console.log(resultChunks);
-// ['mpmZmZmZ', 'uT+amZmZ', 'mZnJPzMz', 'MzMzM9M/', 'mpmZmZmZ', '', '2T8=']
-
- -

Streaming base64 strings into Uint8Arrays:

-

-let chunks = ['mpmZmZmZuT+am', 'ZmZmZnJPzMz', 'MzMz', 'M9M/mpmZmZmZ', '2T8='];
-// individual chunks are not necessarily correctly-padded base64 strings
-
-let output = new Uint8Array(new ArrayBuffer(0, { maxByteLength: 1024 }));
-let result, extra;
-for (let c of chunks) {
-  ({ result, extra } = Uint8Array.fromPartialBase64(c, { more: true, extra }));
-  let offset = output.length;
-  let newLength = offset + result.length;
-  output.buffer.resize(newLength);
-  output.set(result, offset);
+try {
+  Uint8Array.fromBase64('SGVsbG8gV29ybGQ', { strict: true });
+} catch {
+  console.log('with strict: true, padding is required');
 }
-// if padding was optional,
-// you'd need to do a final `fromPartialBase64` call here with `more: false`
-
-console.log(new Float64Array(output.buffer));
-// Float64Array([0.1, 0.2, 0.3, 0.4])
-
- -

Note that the above snippet makes use of the Growable ArrayBuffers proposal for illustration, which not all browsers support as of this writing.

- -

A more involved example, creating a TransformStream which encodes contiguous Uint8Arrays:

-

-class BufferToStringTransformStream extends TransformStream {
-  #extra = null;
-  constructor(alphabet) {
-    super({
-      transform: (chunk, controller) => {
-        let { result, extra } = chunk.toPartialBase64({
-          alphabet,
-          extra: this.#extra,
-          more: true,
-        });
-        this.#extra = extra;
-        controller.enqueue(result);
-      },
-      flush: (controller) => {
-        if (this.#extra == null) return; // stream was empty
-        let { result } = this.#extra.toPartialBase64({ alphabet });
-        controller.enqueue(result);
-      },
-    });
-  }
+try {
+  Uint8Array.fromBase64('SGVsbG8gV29ybGR=', { strict: true });
+} catch {
+  console.log('with strict: true, non-zero overflow bits are rejected');
 }
-
-// use:
-let source = new ReadableStream({
-  start(controller) {
-    controller.enqueue(new Uint8Array([1, 2]));
-    controller.enqueue(new Uint8Array([3, 4]));
-    controller.close();
-  },
-});
-
-let chunks = [];
-let sink = new WritableStream({
-  write(chunk) {
-    chunks.push(chunk);
-  },
-  close() {
-    console.log(chunks.join('')); // 'AQIDBA=='
-  },
-});
-
-source
-  .pipeThrough(new BufferToStringTransformStream())
-  .pipeTo(sink);
 
+

Streaming

+

There is no support for streaming. However, it can be implemented in userland.

+

Thanks for reading! If you got this far, you should try out the proposal in your browser's developer tools on this page, and submit feedback on GitHub.

diff --git a/playground/polyfill-core.mjs b/playground/polyfill-core.mjs index 5756f0f..8c2c004 100644 --- a/playground/polyfill-core.mjs +++ b/playground/polyfill-core.mjs @@ -20,110 +20,106 @@ function assert(condition, message) { } } -function alphabetFromIdentifier(alphabet) { - if (alphabet === 'base64') { - return base64Characters; - } else if (alphabet === 'base64url') { - return base64UrlCharacters; - } else { - throw new TypeError('expected alphabet to be either "base64" or "base64url"'); +function getOptions(options) { + if (typeof options === 'undefined') { + return Object.create(null); + } + if (options && typeof options === 'object') { + return options; } + throw new TypeError('options is not object'); } -export function uint8ArrayToBase64(arr, alphabetIdentifier = 'base64', more = false, origExtra = null) { +export function uint8ArrayToBase64(arr, options) { checkUint8Array(arr); - let alphabet = alphabetFromIdentifier(alphabetIdentifier); - more = !!more; - if (origExtra != null) { - checkUint8Array(origExtra); - // a more efficient algorithm would avoid copying - // but writing that out is unclear / a pain - // the difference is not observable - let copy = new Uint8Array(arr.length + origExtra.length); - copy.set(origExtra); - copy.set(arr, origExtra.length); - arr = copy; + let opts = getOptions(options); + let alphabet = opts.alphabet; + if (typeof alphabet === 'undefined') { + alphabet = 'base64'; + } + if (alphabet !== 'base64' && alphabet !== 'base64url') { + throw new TypeError('expected alphabet to be either "base64" or "base64url"'); } + + let lookup = alphabet === 'base64' ? base64Characters : base64UrlCharacters; let result = ''; let i = 0; for (; i + 2 < arr.length; i += 3) { let triplet = (arr[i] << 16) + (arr[i + 1] << 8) + arr[i + 2]; result += - alphabet[(triplet >> 18) & 63] + - alphabet[(triplet >> 12) & 63] + - alphabet[(triplet >> 6) & 63] + - alphabet[triplet & 63]; - } - if (more) { - let extra = arr.slice(i); // TODO should this be a view, or a copy? - return { result, extra }; - } else { - if (i + 2 === arr.length) { - let triplet = (arr[i] << 16) + (arr[i + 1] << 8); - result += - alphabet[(triplet >> 18) & 63] + - alphabet[(triplet >> 12) & 63] + - alphabet[(triplet >> 6) & 63] + - '='; - } else if (i + 1 === arr.length) { - let triplet = arr[i] << 16; - result += - alphabet[(triplet >> 18) & 63] + - alphabet[(triplet >> 12) & 63] + - '=='; - } - return { result, extra: null }; + lookup[(triplet >> 18) & 63] + + lookup[(triplet >> 12) & 63] + + lookup[(triplet >> 6) & 63] + + lookup[triplet & 63]; } + if (i + 2 === arr.length) { + let triplet = (arr[i] << 16) + (arr[i + 1] << 8); + result += + lookup[(triplet >> 18) & 63] + + lookup[(triplet >> 12) & 63] + + lookup[(triplet >> 6) & 63] + + '='; + } else if (i + 1 === arr.length) { + let triplet = arr[i] << 16; + result += + lookup[(triplet >> 18) & 63] + + lookup[(triplet >> 12) & 63] + + '=='; + } + return result; } -export function base64ToUint8Array(str, alphabetIdentifier = 'base64', more = false, origExtra = null) { - if (typeof str !== 'string') { - throw new TypeError('expected str to be a string'); +export function base64ToUint8Array(string, options) { + if (typeof string !== 'string') { + throw new TypeError('expected input to be a string'); } - let alphabet = alphabetFromIdentifier(alphabetIdentifier); - more = !!more; - if (origExtra != null) { - if (typeof origExtra !== 'string') { - throw new TypeError('expected extra to be a string'); - } - str = origExtra + str; - } - let map = new Map(alphabet.split('').map((c, i) => [c, i])); - - let extra; - if (more) { - let padding = str.length % 4; - if (padding === 0) { - extra = ''; - } else { - extra = str.slice(-padding); - str = str.slice(0, -padding) - } - } else { - // todo opt-in optional padding - if (str.length % 4 !== 0) { - throw new Error('not correctly padded'); + let opts = getOptions(options); + let alphabet = opts.alphabet; + if (typeof alphabet === 'undefined') { + alphabet = 'base64'; + } + if (alphabet !== 'base64' && alphabet !== 'base64url') { + throw new TypeError('expected alphabet to be either "base64" or "base64url"'); + } + let strict = !!opts.strict; + let input = string; + + if (!strict) { + input = input.replaceAll(/[\u0009\u000A\u000C\u000D\u0020]/g, ''); + } + if (input.length % 4 === 0) { + if (input.length > 0 && input.at(-1) === '=') { + input = input.slice(0, -1); + if (input.length > 0 && input.at(-1) === '=') { + input = input.slice(0, -1); + } } - extra = null; + } else if (strict) { + throw new SyntaxError('not correctly padded'); } - assert(str.length % 4 === 0, 'str.length % 4 === 0'); - if (str.endsWith('==')) { - str = str.slice(0, -2); - } else if (str.endsWith('=')) { - str = str.slice(0, -1); + + let map = new Map((alphabet === 'base64' ? base64Characters : base64UrlCharacters).split('').map((c, i) => [c, i])); + if ([...input].some(c => !map.has(c))) { + let bad = [...input].filter(c => !map.has(c)); + throw new SyntaxError(`contains illegal character(s) ${JSON.stringify(bad)}`); + } + + let lastChunkSize = input.length % 4; + if (lastChunkSize === 1) { + throw new SyntaxError('bad length'); + } else if (lastChunkSize === 2 || lastChunkSize === 3) { + input += 'A'.repeat(4 - lastChunkSize); } + assert(input.length % 4 === 0); let result = []; let i = 0; - for (; i + 3 < str.length; i += 4) { - let c1 = str[i]; - let c2 = str[i + 1]; - let c3 = str[i + 2]; - let c4 = str[i + 3]; - if ([c1, c2, c3, c4].some(c => !map.has(c))) { - throw new Error('bad character'); - } + for (; i < input.length; i += 4) { + let c1 = input[i]; + let c2 = input[i + 1]; + let c3 = input[i + 2]; + let c4 = input[i + 3]; let triplet = (map.get(c1) << 18) + (map.get(c2) << 12) + @@ -136,42 +132,20 @@ export function base64ToUint8Array(str, alphabetIdentifier = 'base64', more = fa triplet & 255 ); } - // TODO if we want to be _really_ pedantic, following the RFC, we should enforce the extra 2-4 bits are 0 - if (i + 2 === str.length) { - // the `==` case - let c1 = str[i]; - let c2 = str[i + 1]; - if ([c1, c2].some(c => !map.has(c))) { - throw new Error('bad character'); + + if (lastChunkSize === 2) { + if (strict && result.at(-2) !== 0) { + throw new SyntaxError('extra bits'); } - let triplet = - (map.get(c1) << 18) + - (map.get(c2) << 12); - result.push((triplet >> 16) & 255); - } else if (i + 3 === str.length) { - // the `=` case - let c1 = str[i]; - let c2 = str[i + 1]; - let c3 = str[i + 2]; - if ([c1, c2, c3].some(c => !map.has(c))) { - throw new Error('bad character'); + result.splice(-2, 2); + } else if (lastChunkSize === 3) { + if (strict && result.at(-1) !== 0) { + throw new SyntaxError('extra bits'); } - let triplet = - (map.get(c1) << 18) + - (map.get(c2) << 12) + - (map.get(c3) << 6); - result.push( - (triplet >> 16) & 255, - (triplet >> 8) & 255, - ); - } else { - assert(i === str.length); + result.pop(); } - return { - result: new Uint8Array(result), - extra, - }; + return new Uint8Array(result); } export function uint8ArrayToHex(arr) { @@ -183,19 +157,19 @@ export function uint8ArrayToHex(arr) { return out; } -export function hexToUint8Array(str) { - if (typeof str !== 'string') { - throw new TypeError('expected str to be a string'); +export function hexToUint8Array(string) { + if (typeof string !== 'string') { + throw new TypeError('expected string to be a string'); } - if (str.length % 2 !== 0) { - throw new SyntaxError('str should be an even number of characters'); + if (string.length % 2 !== 0) { + throw new SyntaxError('string should be an even number of characters'); } - if (/[^0-9a-zA-Z]/.test(str)) { - throw new SyntaxError('str should only contain hex characters'); + if (/[^0-9a-zA-Z]/.test(string)) { + throw new SyntaxError('string should only contain hex characters'); } - let out = new Uint8Array(str.length / 2); + let out = new Uint8Array(string.length / 2); for (let i = 0; i < out.length; ++i) { - out[i] = parseInt(str.slice(i * 2, i * 2 + 2), 16); + out[i] = parseInt(string.slice(i * 2, i * 2 + 2), 16); } return out; } diff --git a/playground/polyfill-install.mjs b/playground/polyfill-install.mjs index 703f632..84edb8e 100644 --- a/playground/polyfill-install.mjs +++ b/playground/polyfill-install.mjs @@ -1,53 +1,17 @@ -import { checkUint8Array, uint8ArrayToBase64, base64ToUint8Array, uint8ArrayToHex, hexToUint8Array } from './polyfill-core.mjs'; +import { uint8ArrayToBase64, base64ToUint8Array, uint8ArrayToHex, hexToUint8Array } from './polyfill-core.mjs'; -Uint8Array.prototype.toBase64 = function (opts) { - checkUint8Array(this); - let alphabet; - if (opts && typeof opts === 'object') { - 0, { alphabet } = opts; - } - return uint8ArrayToBase64(this, alphabet).result; +Uint8Array.prototype.toBase64 = function (options) { + return uint8ArrayToBase64(this, options); }; -Uint8Array.prototype.toPartialBase64 = function (opts) { - checkUint8Array(this); - let alphabet, more, extra; - if (opts && typeof opts === 'object') { - 0, { alphabet, more, extra } = opts; - } - return uint8ArrayToBase64(this, alphabet, more, extra); -}; - -Uint8Array.fromBase64 = function (string, opts) { - if (typeof string !== 'string') { - throw new Error('expected argument to be a string'); - } - let alphabet; - if (opts && typeof opts === 'object') { - 0, { alphabet } = opts; - } - return base64ToUint8Array(string, alphabet).result; -}; - -Uint8Array.fromPartialBase64 = function (string, opts) { - if (typeof string !== 'string') { - throw new Error('expected argument to be a string'); - } - let alphabet, more, extra; - if (opts && typeof opts === 'object') { - 0, { alphabet, more, extra } = opts; - } - return base64ToUint8Array(string, alphabet, more, extra); +Uint8Array.fromBase64 = function (string, options) { + return base64ToUint8Array(string, options); }; Uint8Array.prototype.toHex = function () { - checkUint8Array(this); return uint8ArrayToHex(this); }; Uint8Array.fromHex = function (string) { - if (typeof string !== 'string') { - throw new Error('expected argument to be a string'); - } return hexToUint8Array(string); }; diff --git a/test-polyfill.mjs b/test-polyfill.mjs new file mode 100644 index 0000000..0eef29d --- /dev/null +++ b/test-polyfill.mjs @@ -0,0 +1,94 @@ +import test from 'node:test'; +import assert from 'node:assert'; + +import { + uint8ArrayToBase64, + base64ToUint8Array, + uint8ArrayToHex, + hexToUint8Array, +} from './playground/polyfill-core.mjs'; + +let stringToBytes = str => new TextEncoder().encode(str); + +// https://datatracker.ietf.org/doc/html/rfc4648#section-10 +let standardBase64Vectors = [ + ['', ''], + ['f', 'Zg=='], + ['fo', 'Zm8='], + ['foo', 'Zm9v'], + ['foob', 'Zm9vYg=='], + ['fooba', 'Zm9vYmE='], + ['foobar', 'Zm9vYmFy'], +]; +test('standard vectors', async t => { + for (let [string, result] of standardBase64Vectors) { + await t.test(JSON.stringify(string), () => { + assert.strictEqual(uint8ArrayToBase64(stringToBytes(string)), result); + + assert.deepStrictEqual(base64ToUint8Array(result), stringToBytes(string)); + assert.deepStrictEqual(base64ToUint8Array(result, { strict: true }), stringToBytes(string)); + }); + } +}); + +let malformedPadding = ['=', 'Zg=', 'Z===', 'Zm8==', 'Zm9v=']; +test('malformed padding', async t => { + for (let string of malformedPadding) { + await t.test(JSON.stringify(string), () => { + assert.throws(() => base64ToUint8Array(string), SyntaxError); + assert.throws(() => base64ToUint8Array(string, { strict: true }), SyntaxError); + }); + } +}); + +let illegal = [ + 'Zm.9v', + 'Zm9v^', + 'Zg==&', + 'Z−==', // U+2212 'Minus Sign' + 'Z+==', // U+FF0B 'Fullwidth Plus Sign' + 'Zg\u00A0==', // nbsp + 'Zg\u2009==', // thin space + 'Zg\u2028==', // thin space +]; +test('illegal characters', async t => { + for (let string of malformedPadding) { + await t.test(JSON.stringify(string), () => { + assert.throws(() => base64ToUint8Array(string), SyntaxError); + assert.throws(() => base64ToUint8Array(string, { strict: true }), SyntaxError); + }); + } +}); + +let onlyNonStrict = [ + ['Zg', Uint8Array.of(0x66)], + ['Zh', Uint8Array.of(0x66)], + ['Zh==', Uint8Array.of(0x66)], + ['Zm8', Uint8Array.of(0x66, 0x6f)], + ['Zm9', Uint8Array.of(0x66, 0x6f)], + ['Zm9=', Uint8Array.of(0x66, 0x6f)], +]; +test('only valid in non-strict', async t => { + for (let [encoded, decoded] of onlyNonStrict) { + await t.test(JSON.stringify(encoded), () => { + assert.deepStrictEqual(base64ToUint8Array(encoded), decoded); + assert.throws(() => base64ToUint8Array(encoded, { strict: true }), SyntaxError); + }); + } +}); + +test('alphabet-specific strings', async t => { + let standardOnly = 'x+/y'; + await t.test(JSON.stringify(standardOnly), () => { + assert.deepStrictEqual(base64ToUint8Array(standardOnly), Uint8Array.of(0xc7, 0xef, 0xf2)); + assert.deepStrictEqual(base64ToUint8Array(standardOnly, { alphabet: 'base64' }), Uint8Array.of(0xc7, 0xef, 0xf2)); + assert.throws(() => base64ToUint8Array(standardOnly, { alphabet: 'base64url' }), SyntaxError); + }); + + let urlOnly = 'x-_y'; + await t.test(JSON.stringify(urlOnly), () => { + assert.deepStrictEqual(base64ToUint8Array(urlOnly, { alphabet: 'base64url' }), Uint8Array.of(0xc7, 0xef, 0xf2)); + assert.throws(() => base64ToUint8Array(urlOnly), SyntaxError); + assert.throws(() => base64ToUint8Array(urlOnly, { alphabet: 'base64' }), SyntaxError); + }); +}); From b169fea953415ae3326f7d38a9cbb5c93cc45823 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Wed, 25 Oct 2023 23:02:58 -0700 Subject: [PATCH 4/8] add streaming implementation --- stream.mjs | 121 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 121 insertions(+) create mode 100644 stream.mjs diff --git a/stream.mjs b/stream.mjs new file mode 100644 index 0000000..c146ff5 --- /dev/null +++ b/stream.mjs @@ -0,0 +1,121 @@ +import './playground/polyfill-install.mjs'; + +let whitespace = new Set(['\u0009', '\u000A', '\u000C', '\u000D', '\u0020']); + +// This mirrors the somewhat awkward TextDecoder API. +// Better designs are of course possible. +class Base64Decoder { + #options; + #extra; + constructor(options) { + this.#options = options; + this.#extra = ''; + } + + decode(chunk = '', options = {}) { + let stream = options.stream ?? false; + chunk = this.#extra + chunk; + this.#extra = ''; + + if (!stream) { + return Uint8Array.fromBase64(chunk, this.#options); + } + + let realCharacterCount = 0; + let hasWhitespace = false; + for (let i = 0; i < chunk.length; ++i) { + if (whitespace.has(chunk[i])) { + hasWhitespace = true; + } else { + ++realCharacterCount; + } + } + + // requires 1 additional pass over `chunk`, plus one additional copy of `chunk` + let extraCharacterCount = realCharacterCount % 4; + if (extraCharacterCount !== 0) { + if (!hasWhitespace) { + this.#extra = chunk.slice(-extraCharacterCount); + chunk = chunk.slice(0, -extraCharacterCount); + } else { + // need to do a bit more work to figure out where to slice + let collected = 0; + let i = chunk.length - 1; + while (true) { + if (!whitespace.has(chunk[i])) { + ++collected; + if (collected === extraCharacterCount) { + break; + } + } + --i; + } + this.#extra = chunk.slice(i); + chunk = chunk.slice(0, i); + } + } + + return Uint8Array.fromBase64(chunk, this.#options); + } +} + + +class Base64Encoder { + #options; + #extra; + #extraLength; + constructor(options) { + this.#options = options; + this.#extra = new Uint8Array(3); + this.#extraLength = 0; + } + + // partly derived from https://github.com/lucacasonato/base64_streams/blob/main/src/iterator/encoder.ts + encode(chunk = Uint8Array.of(), options = {}) { + let stream = options.stream ?? false; + + if (this.#extraLength > 0) { + let bytesNeeded = 3 - this.#extraLength; + let bytesAvailable = Math.min(bytesNeeded, chunk.length); + this.#extra.set(chunk.subarray(0, bytesAvailable), this.#extraLength); + chunk = chunk.subarray(bytesAvailable); + this.#extraLength += bytesAvailable; + } + + if (!stream) { + // assert: this.#extraLength.length === 0 || this.#extraLength === 3 || chunk.length === 0 + let prefix = this.#extra.subarray(0, this.#extraLength).toBase64(); + this.#extraLength = 0; + return prefix + chunk.toBase64(); + } + + let extraReturn = ''; + + if (this.#extraLength === 3) { + extraReturn = this.#extra.toBase64(); + this.#extraLength = 0; + } + + let remainder = chunk.length % 3; + if (remainder > 0) { + this.#extra.set(chunk.subarray(chunk.length - remainder)); + this.#extraLength = remainder; + chunk = chunk.subarray(0, chunk.length - remainder); + } + + return extraReturn + chunk.toBase64(); + } +} + +let decoder = new Base64Decoder(); + +console.log(decoder.decode('SG Vsb', { stream: true })); +console.log(decoder.decode('G8gV29ybGR', { stream: true })); +console.log(decoder.decode()); + + +let encoder = new Base64Encoder(); + +console.log(encoder.encode(Uint8Array.of(72, 101, 108, 108, 111), { stream: true })); +console.log(encoder.encode(Uint8Array.of(32, 87, 111, 114, 108, 100), { stream: true })); +console.log(encoder.encode()); From 0e4fc1ec02373d8e5a651d2036833a70f01a2cbc Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Wed, 25 Oct 2023 23:21:33 -0700 Subject: [PATCH 5/8] update readme and FAQ --- README.md | 62 +++++++++++++++++++++++++++++++-------- base64.md | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 138 insertions(+), 12 deletions(-) create mode 100644 base64.md diff --git a/README.md b/README.md index 1b27747..b7e2ef7 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ It is currently at Stage 2 of [the TC39 process](https://tc39.es/process-documen Try it out on [the playground](https://tc39.github.io/proposal-arraybuffer-base64/). -Initial spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/). +Spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/). ## Basic API @@ -32,27 +32,65 @@ This would add `Uint8Array.prototype.toBase64`/`Uint8Array.prototype.toHex` and ## Options -An options bag argument for the base64 methods could allow specifying additional details such as the alphabet (to include at least `base64` and `base64url`), whether to generate / enforce padding, and how to handle whitespace. +An options bag argument for the base64 methods allows specifying the alphabet as either `base64` or `base64url`. -## Streaming API +When encoding, the options bag also allows specifying `strict: false` (the default) or `strict: true`. When using `strict: false`, whitespace is legal and padding is optional. When using `strict: true`, whitespace is forbidden and standard padding (including any overflow bits in the last character being 0) is enforced - i.e., only [canonical](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5) encodings are allowed. -Additional `toPartialBase64` and `fromPartialBase64` methods would allow working with chunks of base64, at the cost of more complexity. See [the playground](https://tc39.github.io/proposal-arraybuffer-base64/) linked above for examples. +## Streaming -Streaming versions of the hex APIs are not included since they are straightforward to do manually. +There is no support for streaming. However, it is [relatively straightforward to do effeciently in userland](./stream.mjs) on top of this API, with support for all the same options as the underlying functions. -See [issue #13](https://github.com/tc39/proposal-arraybuffer-base64/issues/13) for discussion. +## FAQ -## Questions +### What variation exists among base64 implementations in standards, in other languages, and in existing JavaScript libraries? -### Should these be asynchronous? +I have a [whole page on that](./base64.md), with tables and footnotes and everything. There is relatively little room for variation, but languages and libraries manage to explore almost all of the room there is. -In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous. +To summarize, base64 encoders can vary in the following ways: + +- Standard or URL-safe alphabet +- Whether `=` is included in output +- Whether to add linebreaks after a certain number of characters + +and decoders can vary in the following ways: + +- Standard or URL-safe alphabet +- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`) +- Whether to fail on non-zero padding bits +- Whether lines must be of a limited length +- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace) + +### What alphabets are supported? + +For base64, you can specify either base64 or base64url for both the encoder and the decoder. + +For hex, both lowercase and uppercase characters (including mixed within the same string) will decode successfully. Output is always lowercase. + +### How is `=` padding handled? + +Padding is always generated. The base64 decoder does not require it to be present unless `strict: true` is specified; however, if it is present, it must be well-formed (i.e., once stripped of whitespace the length of the string must be a multiple of 4, and there can be 1 or 2 padding `=` characters). -Possibly we should have asynchronous versions for working with large data. That is not currently included. For the moment you can use the streaming API to chunk the work. +### How are the extra padding bits handled? + +If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits, which means you have an extra 4 or 2 bits which don't encode anything. + +Per [the RFC](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5), decoders MAY reject input strings where the padding bits are non-zero. Here, non-zero padding bits are silently ignored when `strict: false` (the default), and are an error when `strict: true`. + +### How is whitespace handled? + +The encoders do not output whitespace. The hex decoder does not allow it as input. The base64 decoder allows [ASCII whitespace](https://infra.spec.whatwg.org/#ascii-whitespace) anywhere in the string as long as `strict: true` is not specified. + +### How are other characters handled? + +The presence of any other characters causes an exception. + +### Why are these synchronous? + +In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous. -### What other encodings should be included, if any? +### Why just these encodings? -I think base64 and hex are the only encodings which make sense, and those are currently included. +While other string encodings exist, none are nearly as commonly used as these two. See issues [#7](https://github.com/tc39/proposal-arraybuffer-base64/issues/7), [#8](https://github.com/tc39/proposal-arraybuffer-base64/issues/8), and [#11](https://github.com/tc39/proposal-arraybuffer-base64/issues/11). diff --git a/base64.md b/base64.md new file mode 100644 index 0000000..a31cca4 --- /dev/null +++ b/base64.md @@ -0,0 +1,88 @@ +# Notes on Base64 as it exists + +Towards an implementation in JavaScript. + +## The RFCs + +There are two RFCs which are still generally relevant in modern times: [4648](https://datatracker.ietf.org/doc/html/rfc4648), which defines only the base64 and base64url encodings, and [2045](https://datatracker.ietf.org/doc/html/rfc2045#section-6.8), which defines [MIME](https://en.wikipedia.org/wiki/MIME) and includes a base64 encoding. + +RFC 4648 is "the base64 RFC". It obsoletes RFC [3548](https://datatracker.ietf.org/doc/html/rfc3548). + +- It defines both the standard (`+/`) and url-safe (`-_`) alphabets. +- "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise." Certain malformed padding MAY be ignored. +- "Decoders MAY chose to reject an encoding if the pad bits have not been set to zero" +- "Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise." + +RFC 2045 is not usually relevant, but it's worth summarizing its behavior anyway: + +- Only the standard (`+/`) alphabet is supported. +- It defines only an encoding. The encoding is specified to include `=`. No direction is given for decoders which encounter data which is not padded with `=`, or which has non-zero padding bits. In practice, decoders seem to ignore both. +- "Any characters outside of the base64 alphabet are to be ignored in base64-encoded data." +- MIME requires lines of length at most 76 characters, seperated by CRLF. + +RFCs [1421](https://datatracker.ietf.org/doc/html/rfc1421) and [7468](https://datatracker.ietf.org/doc/html/rfc7468), which define "Privacy-Enhanced Mail" and related things (think `-----BEGIN PRIVATE KEY-----`), are basically identical to the above except that they mandate lines of exactly 64 characters, except that the last line may be shorter. + +RFC [4880](https://datatracker.ietf.org/doc/html/rfc4880#section-6) defines OpenPGP messages and is just the RFC 2045 format plus a checksum. In practice, only whitespace is ignored, not all non-base64 characters. + +No other variations are contemplated in any other RFC or implementation that I'm aware of. That is, we have the following ways that base64 encoders can vary: + +- Standard or URL-safe alphabet +- Whether `=` is included in output +- Whether to add linebreaks after a certain number of characters + +and the following ways that base64 decoders can vary: + +- Standard or URL-safe alphabet +- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`) +- Whether to fail on non-zero padding bits +- Whether lines must be of a limited length +- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace) + +## Programming languages + +Note that neither C++ nor Rust have built-in base64 support. In C++ the Boost library is quite common in large projects and parts sometimes get pulled in to the standard library, and in Rust the [base64 crate](https://docs.rs/base64/latest/base64/) is the clear choice of everyone, so I'm mentioning those as well. + +"✅ / ⚙️" means the default is yes but it's configurable. A bare "⚙️" means it's configurable and there is no default. + +| | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input | +| ------------------- | ---------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- | +| C++ (Boost) | ❌ | ❌ | ❌ | ?[^cpp] | ? | ❌ | ❌ | +| Ruby | ✅ | ✅ / ⚙️[^ruby] | ✅ / ⚙️[^ruby2] | ✅ / ⚙️ | ✅ / ⚙️ | ❌ | ✅ / ⚙️ | +| Python | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ / ⚙️ | ✅ / ⚙️ | +| Rust (base64 crate) | ✅ | ⚙️ | ❌ | ⚙️ | ⚙️ | ❌ | ❌ | +| Java | ✅ | ✅ / ⚙️ | ❌ / ⚙️[^java] | ✅ | ✅ | ❌ | ❌ / ⚙️ | +| Go | ✅ | ✅ | ❌ | ✅ / ⚙️ | ✅ / ⚙️ | ❌ | ✅[^go] | +| C# | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ | +| PHP | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ / ⚙️ | ✅ / ⚙️ | +| Swift | ❌ | ✅ | ❌ / ⚙️ | ❌ | ✅ | ❌ / ⚙️ | ❌ / ⚙️ | + +[^cpp]: Boost adds extra null bytes to the output when padding is present, and treats non-zero padding bits as meaningful (i.e. it produces more output when they are present) +[^ruby]: Ruby only allows configuring padding with the urlsafe alphabet +[^ruby2]: Ruby adds linebreaks every 60 characters +[^java]: Java allows MIME-format output, with `\r\n` sequences after every 76 characters of output +[^go]: Go only allows linebreaks specifically + +## JS libraries + +Only including libraries with a least a million downloads per week and at least 100 distinct dependents. + +| | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input | +| --------------------------- | ----------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- | +| `atob`/`btoa` | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | +| Node's Buffer | ✅[^node] | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | +| base64-js (38m/wk) | ✅ (for decoding) | ✅ | ❌ | ❌ | ✅ | ❌[^base64-js] | ❌ | +| @smithy/util-base64 (8m/wk) | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | +| crypto-js (6m/wk) | ✅ | ✅ | ❌ | ✅ | ✅ | ❌[^crypto-js] | ❌ | +| js-base64 (5m/wk) | ✅ | ✅ / ⚙️ | ❌ | ✅ | ✅ | ✅ | ✅ | +| base64-arraybuffer (4m/wk) | ❌ | ✅ | ❌ | ✅ | ✅ | ❌[^base64-arraybuffer] | ❌ | +| base64url (2m/wk) | ✅ | ❌ / ⚙️ | ❌ | ✅ | ✅ | ✅ | ✅ | +| base-64 (2m/wk) | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | + +[^node]: Node allows mixing alphabets within the same string in input +[^base64-js]: Illegal characters are interpreted as `A` +[^crypto-js]: Illegal characters are interpreted as `A` +[^base64-arraybuffer]: Illegal characters are interpreted as `A` + +## "Whitespace" + +In all of the above, "whitespace" means only _ASCII_ whitespace. I don't think anyone has special handling for Unicode but non-ASCII whitespace. From 7d1b5cc69a7338ed4fef0000ba49d0867961ee66 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Sat, 28 Oct 2023 17:53:46 -0700 Subject: [PATCH 6/8] address comments --- playground/index-raw.html | 2 +- spec.html | 12 ++++++++---- test-polyfill.mjs | 2 +- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/playground/index-raw.html b/playground/index-raw.html index 8f7fd6f..7817f90 100644 --- a/playground/index-raw.html +++ b/playground/index-raw.html @@ -115,7 +115,7 @@

Options

Streaming

-

There is no support for streaming. However, it can be implemented in userland.

+

There is no support for streaming. However, it can be implemented in userland.