diff --git a/README.md b/README.md index 1b27747..b7e2ef7 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ It is currently at Stage 2 of [the TC39 process](https://tc39.es/process-documen Try it out on [the playground](https://tc39.github.io/proposal-arraybuffer-base64/). -Initial spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/). +Spec text is available [here](https://tc39.github.io/proposal-arraybuffer-base64/spec/). ## Basic API @@ -32,27 +32,65 @@ This would add `Uint8Array.prototype.toBase64`/`Uint8Array.prototype.toHex` and ## Options -An options bag argument for the base64 methods could allow specifying additional details such as the alphabet (to include at least `base64` and `base64url`), whether to generate / enforce padding, and how to handle whitespace. +An options bag argument for the base64 methods allows specifying the alphabet as either `base64` or `base64url`. -## Streaming API +When encoding, the options bag also allows specifying `strict: false` (the default) or `strict: true`. When using `strict: false`, whitespace is legal and padding is optional. When using `strict: true`, whitespace is forbidden and standard padding (including any overflow bits in the last character being 0) is enforced - i.e., only [canonical](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5) encodings are allowed. -Additional `toPartialBase64` and `fromPartialBase64` methods would allow working with chunks of base64, at the cost of more complexity. See [the playground](https://tc39.github.io/proposal-arraybuffer-base64/) linked above for examples. +## Streaming -Streaming versions of the hex APIs are not included since they are straightforward to do manually. +There is no support for streaming. However, it is [relatively straightforward to do effeciently in userland](./stream.mjs) on top of this API, with support for all the same options as the underlying functions. -See [issue #13](https://github.com/tc39/proposal-arraybuffer-base64/issues/13) for discussion. +## FAQ -## Questions +### What variation exists among base64 implementations in standards, in other languages, and in existing JavaScript libraries? -### Should these be asynchronous? +I have a [whole page on that](./base64.md), with tables and footnotes and everything. There is relatively little room for variation, but languages and libraries manage to explore almost all of the room there is. -In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous. +To summarize, base64 encoders can vary in the following ways: + +- Standard or URL-safe alphabet +- Whether `=` is included in output +- Whether to add linebreaks after a certain number of characters + +and decoders can vary in the following ways: + +- Standard or URL-safe alphabet +- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`) +- Whether to fail on non-zero padding bits +- Whether lines must be of a limited length +- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace) + +### What alphabets are supported? + +For base64, you can specify either base64 or base64url for both the encoder and the decoder. + +For hex, both lowercase and uppercase characters (including mixed within the same string) will decode successfully. Output is always lowercase. + +### How is `=` padding handled? + +Padding is always generated. The base64 decoder does not require it to be present unless `strict: true` is specified; however, if it is present, it must be well-formed (i.e., once stripped of whitespace the length of the string must be a multiple of 4, and there can be 1 or 2 padding `=` characters). -Possibly we should have asynchronous versions for working with large data. That is not currently included. For the moment you can use the streaming API to chunk the work. +### How are the extra padding bits handled? + +If the length of your input data isn't exactly a multiple of 3 bytes, then encoding it will use either 2 or 3 base64 characters to encode the final 1 or 2 bytes. Since each base64 character is 6 bits, this means you'll be using either 12 or 18 bits to represent 8 or 16 bits, which means you have an extra 4 or 2 bits which don't encode anything. + +Per [the RFC](https://datatracker.ietf.org/doc/html/rfc4648#section-3.5), decoders MAY reject input strings where the padding bits are non-zero. Here, non-zero padding bits are silently ignored when `strict: false` (the default), and are an error when `strict: true`. + +### How is whitespace handled? + +The encoders do not output whitespace. The hex decoder does not allow it as input. The base64 decoder allows [ASCII whitespace](https://infra.spec.whatwg.org/#ascii-whitespace) anywhere in the string as long as `strict: true` is not specified. + +### How are other characters handled? + +The presence of any other characters causes an exception. + +### Why are these synchronous? + +In practice most base64'd data I encounter is on the order of hundreds of bytes (e.g. SSH keys), which can be encoded and decoded extremely quickly. It would be a shame to require Promises to deal with such data, I think, especially given that the alternatives people currently use all appear to be synchronous. -### What other encodings should be included, if any? +### Why just these encodings? -I think base64 and hex are the only encodings which make sense, and those are currently included. +While other string encodings exist, none are nearly as commonly used as these two. See issues [#7](https://github.com/tc39/proposal-arraybuffer-base64/issues/7), [#8](https://github.com/tc39/proposal-arraybuffer-base64/issues/8), and [#11](https://github.com/tc39/proposal-arraybuffer-base64/issues/11). diff --git a/base64.md b/base64.md new file mode 100644 index 0000000..a31cca4 --- /dev/null +++ b/base64.md @@ -0,0 +1,88 @@ +# Notes on Base64 as it exists + +Towards an implementation in JavaScript. + +## The RFCs + +There are two RFCs which are still generally relevant in modern times: [4648](https://datatracker.ietf.org/doc/html/rfc4648), which defines only the base64 and base64url encodings, and [2045](https://datatracker.ietf.org/doc/html/rfc2045#section-6.8), which defines [MIME](https://en.wikipedia.org/wiki/MIME) and includes a base64 encoding. + +RFC 4648 is "the base64 RFC". It obsoletes RFC [3548](https://datatracker.ietf.org/doc/html/rfc3548). + +- It defines both the standard (`+/`) and url-safe (`-_`) alphabets. +- "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise." Certain malformed padding MAY be ignored. +- "Decoders MAY chose to reject an encoding if the pad bits have not been set to zero" +- "Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise." + +RFC 2045 is not usually relevant, but it's worth summarizing its behavior anyway: + +- Only the standard (`+/`) alphabet is supported. +- It defines only an encoding. The encoding is specified to include `=`. No direction is given for decoders which encounter data which is not padded with `=`, or which has non-zero padding bits. In practice, decoders seem to ignore both. +- "Any characters outside of the base64 alphabet are to be ignored in base64-encoded data." +- MIME requires lines of length at most 76 characters, seperated by CRLF. + +RFCs [1421](https://datatracker.ietf.org/doc/html/rfc1421) and [7468](https://datatracker.ietf.org/doc/html/rfc7468), which define "Privacy-Enhanced Mail" and related things (think `-----BEGIN PRIVATE KEY-----`), are basically identical to the above except that they mandate lines of exactly 64 characters, except that the last line may be shorter. + +RFC [4880](https://datatracker.ietf.org/doc/html/rfc4880#section-6) defines OpenPGP messages and is just the RFC 2045 format plus a checksum. In practice, only whitespace is ignored, not all non-base64 characters. + +No other variations are contemplated in any other RFC or implementation that I'm aware of. That is, we have the following ways that base64 encoders can vary: + +- Standard or URL-safe alphabet +- Whether `=` is included in output +- Whether to add linebreaks after a certain number of characters + +and the following ways that base64 decoders can vary: + +- Standard or URL-safe alphabet +- Whether `=` is required in input, and how to handle malformed padding (e.g. extra `=`) +- Whether to fail on non-zero padding bits +- Whether lines must be of a limited length +- How non-base64-alphabet characters are handled (sometimes with special handling for only a subset, like whitespace) + +## Programming languages + +Note that neither C++ nor Rust have built-in base64 support. In C++ the Boost library is quite common in large projects and parts sometimes get pulled in to the standard library, and in Rust the [base64 crate](https://docs.rs/base64/latest/base64/) is the clear choice of everyone, so I'm mentioning those as well. + +"✅ / ⚙️" means the default is yes but it's configurable. A bare "⚙️" means it's configurable and there is no default. + +| | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input | +| ------------------- | ---------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- | +| C++ (Boost) | ❌ | ❌ | ❌ | ?[^cpp] | ? | ❌ | ❌ | +| Ruby | ✅ | ✅ / ⚙️[^ruby] | ✅ / ⚙️[^ruby2] | ✅ / ⚙️ | ✅ / ⚙️ | ❌ | ✅ / ⚙️ | +| Python | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ / ⚙️ | ✅ / ⚙️ | +| Rust (base64 crate) | ✅ | ⚙️ | ❌ | ⚙️ | ⚙️ | ❌ | ❌ | +| Java | ✅ | ✅ / ⚙️ | ❌ / ⚙️[^java] | ✅ | ✅ | ❌ | ❌ / ⚙️ | +| Go | ✅ | ✅ | ❌ | ✅ / ⚙️ | ✅ / ⚙️ | ❌ | ✅[^go] | +| C# | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ✅ | +| PHP | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ / ⚙️ | ✅ / ⚙️ | +| Swift | ❌ | ✅ | ❌ / ⚙️ | ❌ | ✅ | ❌ / ⚙️ | ❌ / ⚙️ | + +[^cpp]: Boost adds extra null bytes to the output when padding is present, and treats non-zero padding bits as meaningful (i.e. it produces more output when they are present) +[^ruby]: Ruby only allows configuring padding with the urlsafe alphabet +[^ruby2]: Ruby adds linebreaks every 60 characters +[^java]: Java allows MIME-format output, with `\r\n` sequences after every 76 characters of output +[^go]: Go only allows linebreaks specifically + +## JS libraries + +Only including libraries with a least a million downloads per week and at least 100 distinct dependents. + +| | supports urlsafe | `=`s in output | whitespace in output | can omit `=`s in input | can have non-zero padding bits | can have arbitrary characters in input | can have whitespace in input | +| --------------------------- | ----------------- | -------------- | -------------------- | ---------------------- | ------------------------------ | -------------------------------------- | ---------------------------- | +| `atob`/`btoa` | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | +| Node's Buffer | ✅[^node] | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | +| base64-js (38m/wk) | ✅ (for decoding) | ✅ | ❌ | ❌ | ✅ | ❌[^base64-js] | ❌ | +| @smithy/util-base64 (8m/wk) | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | +| crypto-js (6m/wk) | ✅ | ✅ | ❌ | ✅ | ✅ | ❌[^crypto-js] | ❌ | +| js-base64 (5m/wk) | ✅ | ✅ / ⚙️ | ❌ | ✅ | ✅ | ✅ | ✅ | +| base64-arraybuffer (4m/wk) | ❌ | ✅ | ❌ | ✅ | ✅ | ❌[^base64-arraybuffer] | ❌ | +| base64url (2m/wk) | ✅ | ❌ / ⚙️ | ❌ | ✅ | ✅ | ✅ | ✅ | +| base-64 (2m/wk) | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | + +[^node]: Node allows mixing alphabets within the same string in input +[^base64-js]: Illegal characters are interpreted as `A` +[^crypto-js]: Illegal characters are interpreted as `A` +[^base64-arraybuffer]: Illegal characters are interpreted as `A` + +## "Whitespace" + +In all of the above, "whitespace" means only _ASCII_ whitespace. I don't think anyone has special handling for Unicode but non-ASCII whitespace. diff --git a/package-lock.json b/package-lock.json index 0350942..1875fd5 100644 --- a/package-lock.json +++ b/package-lock.json @@ -4,9 +4,10 @@ "requires": true, "packages": { "": { + "name": "proposal-arraybuffer-base64", "dependencies": { - "@tc39/ecma262-biblio": "2.1.2553", - "ecmarkup": "^17.0.0", + "@tc39/ecma262-biblio": "2.1.2653", + "ecmarkup": "^18.0.0", "jsdom": "^21.1.1", "prismjs": "^1.29.0" } @@ -163,9 +164,9 @@ } }, "node_modules/@tc39/ecma262-biblio": { - "version": "2.1.2553", - "resolved": "https://registry.npmjs.org/@tc39/ecma262-biblio/-/ecma262-biblio-2.1.2553.tgz", - "integrity": "sha512-c2h05szLmHNnNO+7gE7mzgK3qDoZOEQrPKIIIZtY8gRpe5F2qTNIfj5tqxVYV7WUTC+/ZsR9/kojJ823hL2hvg==" + "version": "2.1.2653", + "resolved": "https://registry.npmjs.org/@tc39/ecma262-biblio/-/ecma262-biblio-2.1.2653.tgz", + "integrity": "sha512-/CIVRwkV3fTaYNxFSEvbDsTPFBNfeJOjQACZXs11+NKkbHhFh9pvr8j6NbJ/fekJLgAu6x2QXvGdA/kEGR/08g==" }, "node_modules/@tootallnate/once": { "version": "2.0.0", @@ -508,9 +509,9 @@ } }, "node_modules/ecmarkup": { - "version": "17.0.0", - "resolved": "https://registry.npmjs.org/ecmarkup/-/ecmarkup-17.0.0.tgz", - "integrity": "sha512-eQr9Vn9IPIH3rrbYEGPqfAwDJ9pg1zrOSZXc8HQwVMQ9d5tb+BsoPeKw5W1SinL09yZalcbLyqnX7rC393VRdA==", + "version": "18.0.0", + "resolved": "https://registry.npmjs.org/ecmarkup/-/ecmarkup-18.0.0.tgz", + "integrity": "sha512-VSItKQ+39dv1FeR1YbGGlJ/rx17wsPSkS7morrOCwLGHh+7ehy89hao+rQ0/ptiBAN3nbytXzwUBUTC3XNmxaA==", "dependencies": { "chalk": "^4.1.2", "command-line-args": "^5.2.0", diff --git a/package.json b/package.json index 7bae9fd..cb3be1d 100644 --- a/package.json +++ b/package.json @@ -3,14 +3,14 @@ "name": "proposal-arraybuffer-base64", "scripts": { "build-playground": "mkdir -p dist && cp playground/* dist && node scripts/static-highlight.js playground/index-raw.html > dist/index.html && rm dist/index-raw.html", - "build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose --js-out dist/spec/ecmarkup.js --css-out dist/spec/ecmarkup.css spec.html dist/spec/index.html", + "build-spec": "mkdir -p dist/spec && ecmarkup --lint-spec --strict --load-biblio @tc39/ecma262-biblio --verbose spec.html --assets-dir dist/spec dist/spec/index.html", "build": "npm run build-playground && npm run build-spec", "format": "emu-format --write spec.html", "check-format": "emu-format --check spec.html" }, "dependencies": { - "@tc39/ecma262-biblio": "2.1.2553", - "ecmarkup": "^17.0.0", + "@tc39/ecma262-biblio": "2.1.2653", + "ecmarkup": "^18.0.0", "jsdom": "^21.1.1", "prismjs": "^1.29.0" } diff --git a/playground/index-raw.html b/playground/index-raw.html index 0d33c2b..7817f90 100644 --- a/playground/index-raw.html +++ b/playground/index-raw.html @@ -48,7 +48,7 @@
This page documents an early-stage proposal for native base64 and hex encoding and decoding for binary data in JavaScript, and includes a non-production, slightly inaccurate polyfill you can experiment with in the browser's console. Some details of the polyfill, particularly around coercion and order of observable effects, are not identical to the proposed spec text.
+This page documents a stage-2 proposal for native base64 and hex encoding and decoding for binary data in JavaScript, and includes a non-production polyfill you can experiment with in the browser's console.
The proposal would provide methods for encoding and decoding Uint8Arrays as base64 and hex strings.
Feedback on the proposal's repository is appreciated.
@@ -84,8 +84,7 @@The base64 methods take an optional options bag which allows specifying the alphabet as either "base64" (the default) or "base64url" (the URL-safe variant).
-In the future this may allow specifying arbitrary alphabets.
-In later versions of this proposal the options bag may also allow additional options, such as specifying whether to generate / enforce padding characters and how to handle whitespace.
+When encoding, the options bag also allows specifying strict: false (the default) or strict: true. When using strict: false, whitespace is legal and padding is optional. When using strict: true, whitespace is forbidden and standard padding (including any overflow bits in the last character being 0) is enforced - i.e., only canonical encodings are allowed.
The hex methods do not have any options.
@@ -94,103 +93,30 @@ Options
// '+/+/'
console.log(array.toBase64({ alphabet: 'base64url' }));
// '-_-_'
-
-Two additional methods, toPartialBase64 and fromPartialBase64, allow encoding and decoding chunks of base64. This requires managing state, which is handled by returning a { result, extra } pair. The options bag for these methods takes two additional arguments, one which specifies whether more data is expected and one which specifies any extra values returned by a previous call.
These methods are intended for lower-level use and are less convenient to use.
-Streaming versions of the hex APIs are not included since they are straightforward to do manually.
+console.log(Uint8Array.fromBase64('SGVsbG8g\nV29ybG R')); +// works, despite whitespace, missing padding, and non-zero overflow bits -Streaming an ArrayBuffer into chunks of base64 strings:
-
-let buffer = (new Float64Array([0.1, 0.2, 0.3, 0.4])).buffer;
-let chunkSize = 6;
-let resultChunks = [];
-
-let result, extra;
-for (let offset = 0; offset < buffer.byteLength; offset += chunkSize) {
- let length = Math.min(chunkSize, buffer.byteLength - offset);
- let view = new Uint8Array(buffer, offset, length);
- ({ result, extra } = view.toPartialBase64({ more: true, extra }));
- resultChunks.push(result);
+try {
+ Uint8Array.fromBase64('SGVsbG8g\nV29ybG Q=', { strict: true });
+} catch {
+ console.log('with strict: true, whitespace is rejected');
}
-({ result } = extra.toPartialBase64({ more: false }));
-resultChunks.push(result);
-console.log(resultChunks);
-// ['mpmZmZmZ', 'uT+amZmZ', 'mZnJPzMz', 'MzMzM9M/', 'mpmZmZmZ', '', '2T8=']
-
-
-Streaming base64 strings into Uint8Arrays:
-
-let chunks = ['mpmZmZmZuT+am', 'ZmZmZnJPzMz', 'MzMz', 'M9M/mpmZmZmZ', '2T8='];
-// individual chunks are not necessarily correctly-padded base64 strings
-
-let output = new Uint8Array(new ArrayBuffer(0, { maxByteLength: 1024 }));
-let result, extra;
-for (let c of chunks) {
- ({ result, extra } = Uint8Array.fromPartialBase64(c, { more: true, extra }));
- let offset = output.length;
- let newLength = offset + result.length;
- output.buffer.resize(newLength);
- output.set(result, offset);
+try {
+ Uint8Array.fromBase64('SGVsbG8gV29ybGQ', { strict: true });
+} catch {
+ console.log('with strict: true, padding is required');
}
-// if padding was optional,
-// you'd need to do a final `fromPartialBase64` call here with `more: false`
-
-console.log(new Float64Array(output.buffer));
-// Float64Array([0.1, 0.2, 0.3, 0.4])
-
-
-Note that the above snippet makes use of the Growable ArrayBuffers proposal for illustration, which not all browsers support as of this writing.
- -A more involved example, creating a TransformStream which encodes contiguous Uint8Arrays:
-
-class BufferToStringTransformStream extends TransformStream {
- #extra = null;
- constructor(alphabet) {
- super({
- transform: (chunk, controller) => {
- let { result, extra } = chunk.toPartialBase64({
- alphabet,
- extra: this.#extra,
- more: true,
- });
- this.#extra = extra;
- controller.enqueue(result);
- },
- flush: (controller) => {
- if (this.#extra == null) return; // stream was empty
- let { result } = this.#extra.toPartialBase64({ alphabet });
- controller.enqueue(result);
- },
- });
- }
+try {
+ Uint8Array.fromBase64('SGVsbG8gV29ybGR=', { strict: true });
+} catch {
+ console.log('with strict: true, non-zero overflow bits are rejected');
}
-
-// use:
-let source = new ReadableStream({
- start(controller) {
- controller.enqueue(new Uint8Array([1, 2]));
- controller.enqueue(new Uint8Array([3, 4]));
- controller.close();
- },
-});
-
-let chunks = [];
-let sink = new WritableStream({
- write(chunk) {
- chunks.push(chunk);
- },
- close() {
- console.log(chunks.join('')); // 'AQIDBA=='
- },
-});
-
-source
- .pipeThrough(new BufferToStringTransformStream())
- .pipeTo(sink);
+There is no support for streaming. However, it can be implemented in userland.
+ diff --git a/playground/polyfill-core.mjs b/playground/polyfill-core.mjs index 5756f0f..8c2c004 100644 --- a/playground/polyfill-core.mjs +++ b/playground/polyfill-core.mjs @@ -20,110 +20,106 @@ function assert(condition, message) { } } -function alphabetFromIdentifier(alphabet) { - if (alphabet === 'base64') { - return base64Characters; - } else if (alphabet === 'base64url') { - return base64UrlCharacters; - } else { - throw new TypeError('expected alphabet to be either "base64" or "base64url"'); +function getOptions(options) { + if (typeof options === 'undefined') { + return Object.create(null); + } + if (options && typeof options === 'object') { + return options; } + throw new TypeError('options is not object'); } -export function uint8ArrayToBase64(arr, alphabetIdentifier = 'base64', more = false, origExtra = null) { +export function uint8ArrayToBase64(arr, options) { checkUint8Array(arr); - let alphabet = alphabetFromIdentifier(alphabetIdentifier); - more = !!more; - if (origExtra != null) { - checkUint8Array(origExtra); - // a more efficient algorithm would avoid copying - // but writing that out is unclear / a pain - // the difference is not observable - let copy = new Uint8Array(arr.length + origExtra.length); - copy.set(origExtra); - copy.set(arr, origExtra.length); - arr = copy; + let opts = getOptions(options); + let alphabet = opts.alphabet; + if (typeof alphabet === 'undefined') { + alphabet = 'base64'; + } + if (alphabet !== 'base64' && alphabet !== 'base64url') { + throw new TypeError('expected alphabet to be either "base64" or "base64url"'); } + + let lookup = alphabet === 'base64' ? base64Characters : base64UrlCharacters; let result = ''; let i = 0; for (; i + 2 < arr.length; i += 3) { let triplet = (arr[i] << 16) + (arr[i + 1] << 8) + arr[i + 2]; result += - alphabet[(triplet >> 18) & 63] + - alphabet[(triplet >> 12) & 63] + - alphabet[(triplet >> 6) & 63] + - alphabet[triplet & 63]; - } - if (more) { - let extra = arr.slice(i); // TODO should this be a view, or a copy? - return { result, extra }; - } else { - if (i + 2 === arr.length) { - let triplet = (arr[i] << 16) + (arr[i + 1] << 8); - result += - alphabet[(triplet >> 18) & 63] + - alphabet[(triplet >> 12) & 63] + - alphabet[(triplet >> 6) & 63] + - '='; - } else if (i + 1 === arr.length) { - let triplet = arr[i] << 16; - result += - alphabet[(triplet >> 18) & 63] + - alphabet[(triplet >> 12) & 63] + - '=='; - } - return { result, extra: null }; + lookup[(triplet >> 18) & 63] + + lookup[(triplet >> 12) & 63] + + lookup[(triplet >> 6) & 63] + + lookup[triplet & 63]; } + if (i + 2 === arr.length) { + let triplet = (arr[i] << 16) + (arr[i + 1] << 8); + result += + lookup[(triplet >> 18) & 63] + + lookup[(triplet >> 12) & 63] + + lookup[(triplet >> 6) & 63] + + '='; + } else if (i + 1 === arr.length) { + let triplet = arr[i] << 16; + result += + lookup[(triplet >> 18) & 63] + + lookup[(triplet >> 12) & 63] + + '=='; + } + return result; } -export function base64ToUint8Array(str, alphabetIdentifier = 'base64', more = false, origExtra = null) { - if (typeof str !== 'string') { - throw new TypeError('expected str to be a string'); +export function base64ToUint8Array(string, options) { + if (typeof string !== 'string') { + throw new TypeError('expected input to be a string'); } - let alphabet = alphabetFromIdentifier(alphabetIdentifier); - more = !!more; - if (origExtra != null) { - if (typeof origExtra !== 'string') { - throw new TypeError('expected extra to be a string'); - } - str = origExtra + str; - } - let map = new Map(alphabet.split('').map((c, i) => [c, i])); - - let extra; - if (more) { - let padding = str.length % 4; - if (padding === 0) { - extra = ''; - } else { - extra = str.slice(-padding); - str = str.slice(0, -padding) - } - } else { - // todo opt-in optional padding - if (str.length % 4 !== 0) { - throw new Error('not correctly padded'); + let opts = getOptions(options); + let alphabet = opts.alphabet; + if (typeof alphabet === 'undefined') { + alphabet = 'base64'; + } + if (alphabet !== 'base64' && alphabet !== 'base64url') { + throw new TypeError('expected alphabet to be either "base64" or "base64url"'); + } + let strict = !!opts.strict; + let input = string; + + if (!strict) { + input = input.replaceAll(/[\u0009\u000A\u000C\u000D\u0020]/g, ''); + } + if (input.length % 4 === 0) { + if (input.length > 0 && input.at(-1) === '=') { + input = input.slice(0, -1); + if (input.length > 0 && input.at(-1) === '=') { + input = input.slice(0, -1); + } } - extra = null; + } else if (strict) { + throw new SyntaxError('not correctly padded'); } - assert(str.length % 4 === 0, 'str.length % 4 === 0'); - if (str.endsWith('==')) { - str = str.slice(0, -2); - } else if (str.endsWith('=')) { - str = str.slice(0, -1); + + let map = new Map((alphabet === 'base64' ? base64Characters : base64UrlCharacters).split('').map((c, i) => [c, i])); + if ([...input].some(c => !map.has(c))) { + let bad = [...input].filter(c => !map.has(c)); + throw new SyntaxError(`contains illegal character(s) ${JSON.stringify(bad)}`); + } + + let lastChunkSize = input.length % 4; + if (lastChunkSize === 1) { + throw new SyntaxError('bad length'); + } else if (lastChunkSize === 2 || lastChunkSize === 3) { + input += 'A'.repeat(4 - lastChunkSize); } + assert(input.length % 4 === 0); let result = []; let i = 0; - for (; i + 3 < str.length; i += 4) { - let c1 = str[i]; - let c2 = str[i + 1]; - let c3 = str[i + 2]; - let c4 = str[i + 3]; - if ([c1, c2, c3, c4].some(c => !map.has(c))) { - throw new Error('bad character'); - } + for (; i < input.length; i += 4) { + let c1 = input[i]; + let c2 = input[i + 1]; + let c3 = input[i + 2]; + let c4 = input[i + 3]; let triplet = (map.get(c1) << 18) + (map.get(c2) << 12) + @@ -136,42 +132,20 @@ export function base64ToUint8Array(str, alphabetIdentifier = 'base64', more = fa triplet & 255 ); } - // TODO if we want to be _really_ pedantic, following the RFC, we should enforce the extra 2-4 bits are 0 - if (i + 2 === str.length) { - // the `==` case - let c1 = str[i]; - let c2 = str[i + 1]; - if ([c1, c2].some(c => !map.has(c))) { - throw new Error('bad character'); + + if (lastChunkSize === 2) { + if (strict && result.at(-2) !== 0) { + throw new SyntaxError('extra bits'); } - let triplet = - (map.get(c1) << 18) + - (map.get(c2) << 12); - result.push((triplet >> 16) & 255); - } else if (i + 3 === str.length) { - // the `=` case - let c1 = str[i]; - let c2 = str[i + 1]; - let c3 = str[i + 2]; - if ([c1, c2, c3].some(c => !map.has(c))) { - throw new Error('bad character'); + result.splice(-2, 2); + } else if (lastChunkSize === 3) { + if (strict && result.at(-1) !== 0) { + throw new SyntaxError('extra bits'); } - let triplet = - (map.get(c1) << 18) + - (map.get(c2) << 12) + - (map.get(c3) << 6); - result.push( - (triplet >> 16) & 255, - (triplet >> 8) & 255, - ); - } else { - assert(i === str.length); + result.pop(); } - return { - result: new Uint8Array(result), - extra, - }; + return new Uint8Array(result); } export function uint8ArrayToHex(arr) { @@ -183,19 +157,19 @@ export function uint8ArrayToHex(arr) { return out; } -export function hexToUint8Array(str) { - if (typeof str !== 'string') { - throw new TypeError('expected str to be a string'); +export function hexToUint8Array(string) { + if (typeof string !== 'string') { + throw new TypeError('expected string to be a string'); } - if (str.length % 2 !== 0) { - throw new SyntaxError('str should be an even number of characters'); + if (string.length % 2 !== 0) { + throw new SyntaxError('string should be an even number of characters'); } - if (/[^0-9a-zA-Z]/.test(str)) { - throw new SyntaxError('str should only contain hex characters'); + if (/[^0-9a-zA-Z]/.test(string)) { + throw new SyntaxError('string should only contain hex characters'); } - let out = new Uint8Array(str.length / 2); + let out = new Uint8Array(string.length / 2); for (let i = 0; i < out.length; ++i) { - out[i] = parseInt(str.slice(i * 2, i * 2 + 2), 16); + out[i] = parseInt(string.slice(i * 2, i * 2 + 2), 16); } return out; } diff --git a/playground/polyfill-install.mjs b/playground/polyfill-install.mjs index 703f632..84edb8e 100644 --- a/playground/polyfill-install.mjs +++ b/playground/polyfill-install.mjs @@ -1,53 +1,17 @@ -import { checkUint8Array, uint8ArrayToBase64, base64ToUint8Array, uint8ArrayToHex, hexToUint8Array } from './polyfill-core.mjs'; +import { uint8ArrayToBase64, base64ToUint8Array, uint8ArrayToHex, hexToUint8Array } from './polyfill-core.mjs'; -Uint8Array.prototype.toBase64 = function (opts) { - checkUint8Array(this); - let alphabet; - if (opts && typeof opts === 'object') { - 0, { alphabet } = opts; - } - return uint8ArrayToBase64(this, alphabet).result; +Uint8Array.prototype.toBase64 = function (options) { + return uint8ArrayToBase64(this, options); }; -Uint8Array.prototype.toPartialBase64 = function (opts) { - checkUint8Array(this); - let alphabet, more, extra; - if (opts && typeof opts === 'object') { - 0, { alphabet, more, extra } = opts; - } - return uint8ArrayToBase64(this, alphabet, more, extra); -}; - -Uint8Array.fromBase64 = function (string, opts) { - if (typeof string !== 'string') { - throw new Error('expected argument to be a string'); - } - let alphabet; - if (opts && typeof opts === 'object') { - 0, { alphabet } = opts; - } - return base64ToUint8Array(string, alphabet).result; -}; - -Uint8Array.fromPartialBase64 = function (string, opts) { - if (typeof string !== 'string') { - throw new Error('expected argument to be a string'); - } - let alphabet, more, extra; - if (opts && typeof opts === 'object') { - 0, { alphabet, more, extra } = opts; - } - return base64ToUint8Array(string, alphabet, more, extra); +Uint8Array.fromBase64 = function (string, options) { + return base64ToUint8Array(string, options); }; Uint8Array.prototype.toHex = function () { - checkUint8Array(this); return uint8ArrayToHex(this); }; Uint8Array.fromHex = function (string) { - if (typeof string !== 'string') { - throw new Error('expected argument to be a string'); - } return hexToUint8Array(string); }; diff --git a/spec.html b/spec.html index 4cd2943..50a8104 100644 --- a/spec.html +++ b/spec.html @@ -22,53 +22,14 @@The standard base64 alphabet is a List whose elements are the code points corresponding to every letter and number in the Unicode Basic Latin block along with *"+"* and *"/"*; that is, it is StringToCodePoints(*"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/").