RFC: Allow full unicode range

leebyron · leebyron · commit 48d263d863eb · 2021-04-13T02:45:01.000-07:00
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
diff --git a/spec/Appendix B -- Grammar Summary.md b/spec/Appendix B -- Grammar Summary.md
@@ -97,13 +97,20 @@ StringValue ::
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::
-  - SourceCharacter but not `"` or \ or LineTerminator
-  - \u EscapedUnicode
-  - \ EscapedCharacter
+  - SourceCharacter but not `"` or `\` or LineTerminator
+  - `\u` EscapedUnicode
+  - `\` EscapedCharacter
 
-EscapedUnicode :: /[0-9A-Fa-f]{4}/
+EscapedUnicode ::
+  - HexDigit HexDigit HexDigit HexDigit
+  - `{` HexDigit+ `}` "but only if <= 0x10FFFF"
 
-EscapedCharacter :: one of `"` \ `/` b f n r t
+HexDigit :: one of
+  - `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+  - `A` `B` `C` `D` `E` `F`
+  - `a` `b` `c` `d` `e` `f`
+
+EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
 
 BlockStringCharacter ::
   - SourceCharacter but not `"""` or `\"""`
diff --git a/spec/Section 2 -- Language.md b/spec/Section 2 -- Language.md
@@ -50,23 +50,24 @@ SourceCharacter ::
   - "U+0009"
   - "U+000A"
   - "U+000D"
-  - "U+0020–U+FFFF"
+  - "U+0020–U+10FFFF"
 
 GraphQL documents are expressed as a sequence of
-[Unicode](https://unicode.org/standard/standard.html) characters. However, with
+[Unicode](https://unicode.org/standard/standard.html) code points (informally
+referred to as *"characters"* through most of this specification). However, with
 few exceptions, most of GraphQL is expressed only in the original non-control
 ASCII range so as to be as widely compatible with as many existing tools,
 languages, and serialization formats as possible and avoid display issues in
 text editors and source control.
 
+Note: Non-ASCII Unicode code points may freely appear within {StringValue} and
+{Comment} tokens.
+
 
 ### Unicode
 
 UnicodeBOM :: "Byte Order Mark (U+FEFF)"
 
-Non-ASCII Unicode characters may freely appear within {StringValue} and
-{Comment} portions of GraphQL.
-
 The "Byte Order Mark" is a special Unicode character which
 may appear at the beginning of a file containing Unicode which programs may use
 to determine the fact that the text stream is Unicode, what endianness the text
@@ -804,13 +805,20 @@ StringValue ::
   - `"""` BlockStringCharacter* `"""`
 
 StringCharacter ::
-  - SourceCharacter but not `"` or \ or LineTerminator
-  - \u EscapedUnicode
-  - \ EscapedCharacter
+  - SourceCharacter but not `"` or `\` or LineTerminator
+  - `\u` EscapedUnicode
+  - `\` EscapedCharacter
+
+EscapedUnicode ::
+  - HexDigit HexDigit HexDigit HexDigit
+  - `{` HexDigit+ `}` "but only if <= 0x10FFFF"
 
-EscapedUnicode :: /[0-9A-Fa-f]{4}/
+HexDigit :: one of
+  - `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
+  - `A` `B` `C` `D` `E` `F`
+  - `a` `b` `c` `d` `e` `f`
 
-EscapedCharacter :: one of `"` \ `/` b f n r t
+EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
 
 BlockStringCharacter ::
   - SourceCharacter but not `"""` or `\"""`
@@ -825,9 +833,9 @@ be interpreted as the beginning of a block string. As an example, the source
 {`""""""`} can only be interpreted as a single empty block string and not three
 empty strings.
 
-Non-ASCII Unicode characters are allowed within single-quoted strings. 
-Since {SourceCharacter} must not contain some ASCII control characters, escape 
-sequences must be used to represent these characters. The {`\`}, {`"`} 
+Non-ASCII Unicode characters are allowed within single-quoted strings.
+Since {SourceCharacter} must not contain some ASCII control characters, escape
+sequences must be used to represent these characters. The {`\`}, {`"`}
 characters also must be escaped. All other escape sequences are optional.
 
 **Block Strings**
@@ -892,32 +900,49 @@ StringValue :: `""`
 
 StringValue :: `"` StringCharacter+ `"`
 
-  * Return the Unicode character sequence of all {StringCharacter}
-    Unicode character values.
-
-StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator
-
-  * Return the character value of {SourceCharacter}.
-
-StringCharacter :: \u EscapedUnicode
-
-  * Return the character whose code unit value in the Unicode Basic Multilingual
-    Plane is the 16-bit hexadecimal value {EscapedUnicode}.
-
-StringCharacter :: \ EscapedCharacter
-
-  * Return the character value of {EscapedCharacter} according to the table below.
-
-| Escaped Character | Code Unit Value | Character Name               |
-| ----------------- | --------------- | ---------------------------- |
-| `"`               | U+0022          | double quote                 |
-| `\`               | U+005C          | reverse solidus (back slash) |
-| `/`               | U+002F          | solidus (forward slash)      |
-| `b`               | U+0008          | backspace                    |
-| `f`               | U+000C          | form feed                    |
-| `n`               | U+000A          | line feed (new line)         |
-| `r`               | U+000D          | carriage return              |
-| `t`               | U+0009          | horizontal tab               |
+  * Let {string} be the sequence of all {StringCharacter} code points.
+  * For each {point} at {index} in {string}:
+    * If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
+      * Let {lowCodePoint} be the code point at {index} + {1} in {string}.
+      * If {lowCodePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
+        * Raise a parse error (a *High Surrogate* must be followed by a *Low Surrogate*).
+      * Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowCodePoint} - 0xDC00) + 0x10000.
+      * Within {string}, replace {codePoint} and {lowCodePoint} with {decodedPoint}.
+    * If {codePoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
+      * Raise a parse error (a *Low Surrogate* must follow a *High Surrogate*).
+  * Return {string}.
+
+Note: {StringValue} should avoid encoding code points as surrogate pairs.
+While services must interpret them accordingly, a bracked escape (for example
+`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
+[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
+
+StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
+
+  * Return the code point {SourceCharacter}.
+
+StringCharacter :: `\u` EscapedUnicode
+
+  * Let {value} be the 21-bit hexadecimal value represented by the sequence of
+    {HexDigit} within {EscapedUnicode}.
+  * Assert {value} <= 0x10FFFF.
+  * Return the code point {value}.
+
+StringCharacter :: `\` EscapedCharacter
+
+  * Return the code point represented by {EscapedCharacter} according to the
+    table below.
+
+| Escaped Character | Code Point | Character Name               |
+| ----------------- | ---------- | ---------------------------- |
+| `"`               | U+0022     | double quote                 |
+| `\`               | U+005C     | reverse solidus (back slash) |
+| `/`               | U+002F     | solidus (forward slash)      |
+| `b`               | U+0008     | backspace                    |
+| `f`               | U+000C     | form feed                    |
+| `n`               | U+000A     | line feed (new line)         |
+| `r`               | U+000D     | carriage return              |
+| `t`               | U+0009     | horizontal tab               |
 
 StringValue :: `"""` BlockStringCharacter* `"""`