Skip to content

Commit 5acbaf2

Browse files
committed
[lex.charset] Fix various issues with the description of UCNs.
Clarify that \U sequences not beginning 00 are ill-formed. Clarify handling of code points naming reserved or noncharacter code points. Remove unnecessary circumlocution through "short identifiers" by directly talking about code points. Use code point values directly rather than using C++ 0x notation. [lex.string] Fix description of what UCNs mean, and convert it to a note.
1 parent c45d32d commit 5acbaf2

File tree

1 file changed

+23
-20
lines changed

1 file changed

+23
-20
lines changed

source/lex.tex

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -207,18 +207,17 @@
207207
\terminal{\textbackslash U} hex-quad hex-quad
208208
\end{bnf}
209209

210-
The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash
211-
U00NNNNNN} is that character
212-
that has \tcode{U+NNNNNN} as a code point short identifier;
213-
the character designated by the \grammarterm{universal-character-name}
214-
\tcode{\textbackslash uNNNN} is that character
215-
that has \tcode{U+NNNN} as a code point short identifier.
216-
If a \grammarterm{universal-character-name} does not correspond to
217-
a code point in ISO/IEC 10646 or
218-
if a \grammarterm{universal-character-name} corresponds to
219-
a surrogate code point,
220-
the program is ill-formed. Additionally, if
221-
a \grammarterm{universal-character-name} outside
210+
A \grammarterm{universal-character-name}
211+
designates the character in ISO/IEC 10646 (if any)
212+
whose code point is the hexadecimal number represented by
213+
the sequence of \grammarterm{hexadecimal-digit}s
214+
in the \grammarterm{universal-character-name}.
215+
The program is ill-formed if that number is not a code point
216+
or if it is a surrogate code point.
217+
Noncharacter code points and reserved code points
218+
are considered to designate separate characters distinct from
219+
any ISO/IEC 10646 character.
220+
If a \grammarterm{universal-character-name} outside
222221
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
223222
\grammarterm{r-char-sequence} of
224223
a character or
@@ -228,10 +227,10 @@
228227
\grammarterm{r-char-sequence}\iref{lex.string} does not form a
229228
\grammarterm{universal-character-name}.}
230229
\begin{note}
231-
ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive).
232-
A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive).
230+
ISO/IEC 10646 code points are integers in the range $[0, \mathrm{10FFFF}]$ (hexadecimal).
231+
A surrogate code point is a value in the range $[\mathrm{D800}, \mathrm{DFFF}]$ (hexadecimal).
233232
A control character is a character whose code point is
234-
in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive).
233+
in either of the ranges $[0, \mathrm{1F}]$ or $[\mathrm{7F}, \mathrm{9F}]$ (hexadecimal).
235234
\end{note}
236235

237236
\pnum
@@ -1144,7 +1143,7 @@
11441143
provided that the code point value
11451144
can be encoded as a single UTF-8 code unit.
11461145
\begin{note}
1147-
That is, provided the code point value is in the range 0x0-0x7F (inclusive).
1146+
That is, provided the code point value is in the range $[0, \mathrm{7F}]$ (hexadecimal).
11481147
\end{note}
11491148
If the value is not representable with a single UTF-8 code unit,
11501149
the program is ill-formed.
@@ -1163,7 +1162,7 @@
11631162
provided that the code point value is
11641163
representable with a single 16-bit code unit.
11651164
\begin{note}
1166-
That is, provided the code point value is in the range 0x0-0xFFFF (inclusive).
1165+
That is, provided the code point value is in the range $[0, \mathrm{FFFF}]$ (hexadecimal).
11671166
\end{note}
11681167
If the value is not representable
11691168
with a single 16-bit code unit, the program is ill-formed.
@@ -1685,9 +1684,13 @@
16851684
character requiring a surrogate pair, plus one for the terminating
16861685
\tcode{u'\textbackslash 0'}. \begin{note} The size of a \tcode{char16_t}
16871686
string literal is the number of code units, not the number of
1688-
characters. \end{note} Within \tcode{char32_t} and \tcode{char16_t}
1689-
string literals, any \grammarterm{universal-character-name}{s} shall be within the range
1690-
\tcode{0x0} to \tcode{0x10FFFF}. The size of a narrow string literal is
1687+
characters. \end{note}
1688+
\begin{note}
1689+
Any \grammarterm{universal-character-name}{s} are required to
1690+
correspond to a code point in the range
1691+
$[0, \mathrm{D800})$ or $[\mathrm{E000}, \mathrm{10FFFF}]$ (hexadecimal)\iref{lex.charset}.
1692+
\end{note}
1693+
The size of a narrow string literal is
16911694
the total number of escape sequences and other characters, plus at least
16921695
one for the multibyte encoding of each \grammarterm{universal-character-name}, plus
16931696
one for the terminating \tcode{'\textbackslash 0'}.

0 commit comments

Comments
 (0)