Skip to content

Commit 27e6676

Browse files
committed
RFC: Allow full unicode range
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
1 parent d4777b4 commit 27e6676

File tree

2 files changed

+34
-6
lines changed

2 files changed

+34
-6
lines changed

spec/Appendix B -- Grammar Summary.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,14 @@ StringCharacter ::
101101
- `\u` EscapedUnicode
102102
- `\` EscapedCharacter
103103

104-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
104+
EscapedUnicode ::
105+
- HexDigit HexDigit HexDigit HexDigit
106+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
107+
108+
HexDigit :: one of
109+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
110+
- `A` `B` `C` `D` `E` `F`
111+
- `a` `b` `c` `d` `e` `f`
105112

106113
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
107114

spec/Section 2 -- Language.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ SourceCharacter ::
5050
- "U+0009"
5151
- "U+000A"
5252
- "U+000D"
53-
- "U+0020–U+FFFF"
53+
- "U+0020–U+10FFFF"
5454

5555
GraphQL documents are expressed as a sequence of
5656
[Unicode](https://unicode.org/standard/standard.html) code points (informally
@@ -809,7 +809,14 @@ StringCharacter ::
809809
- `\u` EscapedUnicode
810810
- `\` EscapedCharacter
811811

812-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
812+
EscapedUnicode ::
813+
- HexDigit HexDigit HexDigit HexDigit
814+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
815+
816+
HexDigit :: one of
817+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
818+
- `A` `B` `C` `D` `E` `F`
819+
- `a` `b` `c` `d` `e` `f`
813820

814821
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
815822

@@ -893,16 +900,30 @@ StringValue :: `""`
893900

894901
StringValue :: `"` StringCharacter+ `"`
895902

896-
* Return the sequence of all {StringCharacter} code points.
903+
* Let {string} be the sequence of all {StringCharacter} code points.
904+
* For each {codePoint} at {index} in {string}:
905+
* If {codePoint} is >= 0xD800 and <= 0xDBFF (a [*High Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
906+
* Let {lowPoint} be the code point at {index} + {1} in {string}.
907+
* Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
908+
* Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} - 0xDC00) + 0x10000.
909+
* Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
910+
* Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a [*Low Surrogate*](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
911+
* Return {string}.
912+
913+
Note: {StringValue} should avoid encoding code points as surrogate pairs.
914+
While services must interpret them accordingly, a braced escape (for example
915+
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
916+
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
897917

898918
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
899919

900920
* Return the code point {SourceCharacter}.
901921

902922
StringCharacter :: `\u` EscapedUnicode
903923

904-
* Let {value} be the 16-bit hexadecimal value represented by the sequence of
905-
hexadecimal digits within {EscapedUnicode}.
924+
* Let {value} be the 21-bit hexadecimal value represented by the sequence of
925+
{HexDigit} within {EscapedUnicode}.
926+
* Assert {value} <= 0x10FFFF.
906927
* Return the code point {value}.
907928

908929
StringCharacter :: `\` EscapedCharacter

0 commit comments

Comments
 (0)