From c45d32d660d98f968cb81925169535b76c8fccac Mon Sep 17 00:00:00 2001 From: Dawn Perchik Date: Thu, 21 Feb 2019 19:45:34 -1000 Subject: [PATCH] P1139R2 Address wording issues related to ISO 10646 [lex] Turn notes into separate sentences. --- source/lex.tex | 49 +++++++++++++++++++++++++++-------------- source/preprocessor.tex | 2 +- 2 files changed, 34 insertions(+), 17 deletions(-) diff --git a/source/lex.tex b/source/lex.tex index 3dec13ec8e..b623fef534 100644 --- a/source/lex.tex +++ b/source/lex.tex @@ -208,21 +208,31 @@ \end{bnf} The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash -UNNNNNNNN} is that character whose character short name in ISO/IEC 10646 is -\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name} -\tcode{\textbackslash uNNNN} is that character whose character short name in -ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a -\grammarterm{universal-character-name} corresponds to a surrogate code point (in the -range 0xD800--0xDFFF, inclusive), the program is ill-formed. Additionally, if -the hexadecimal value for a \grammarterm{universal-character-name} outside +U00NNNNNN} is that character +that has \tcode{U+NNNNNN} as a code point short identifier; +the character designated by the \grammarterm{universal-character-name} +\tcode{\textbackslash uNNNN} is that character +that has \tcode{U+NNNN} as a code point short identifier. +If a \grammarterm{universal-character-name} does not correspond to +a code point in ISO/IEC 10646 or +if a \grammarterm{universal-character-name} corresponds to +a surrogate code point, +the program is ill-formed. Additionally, if +a \grammarterm{universal-character-name} outside the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or \grammarterm{r-char-sequence} of a character or -string literal corresponds to a control character (in either of the -ranges 0x00--0x1F or 0x7F--0x9F, both inclusive) or to a character in the basic +string literal corresponds to a control character or +to a character in the basic source character set, the program is ill-formed.\footnote{A sequence of characters resembling a \grammarterm{universal-character-name} in an \grammarterm{r-char-sequence}\iref{lex.string} does not form a \grammarterm{universal-character-name}.} +\begin{note} +ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive). +A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive). +A control character is a character whose code point is +in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive). +\end{note} \pnum The \defnx{basic execution character set}{character set!basic execution} and the @@ -1132,8 +1142,10 @@ The value of a UTF-8 character literal is equal to its ISO/IEC 10646 code point value, provided that the code point value -is representable with a single UTF-8 code unit -(that is, provided it is in the C0 Controls and Basic Latin Unicode block). +can be encoded as a single UTF-8 code unit. +\begin{note} +That is, provided the code point value is in the range 0x0-0x7F (inclusive). +\end{note} If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple \grammarterm{c-char}{s} is ill-formed. @@ -1146,11 +1158,14 @@ \indextext{prefix!\idxcode{u}}% is a character literal of type \tcode{char16_t}, known as a \defn{UTF-16 character literal}. -The value -of a UTF-16 character literal containing a single \grammarterm{c-char} is -equal to its ISO/IEC 10646 code point value, provided that the code point value is -representable with a single 16-bit code unit (that is, provided it is in the -basic multi-lingual plane). If the value is not representable +The value of a UTF-16 character literal +is equal to its ISO/IEC 10646 code point value, +provided that the code point value is +representable with a single 16-bit code unit. +\begin{note} +That is, provided the code point value is in the range 0x0-0xFFFF (inclusive). +\end{note} +If the value is not representable with a single 16-bit code unit, the program is ill-formed. A UTF-16 character literal containing multiple \grammarterm{c-char}{s} is ill-formed. @@ -1562,6 +1577,8 @@ A single \grammarterm{c-char} may produce more than one \tcode{char16_t} character in the form of surrogate pairs. +A surrogate pair is a representation for a single code point +as a sequence of two 16-bit code units. \end{note} \pnum diff --git a/source/preprocessor.tex b/source/preprocessor.tex index e43db6a9d7..1bc22aa914 100644 --- a/source/preprocessor.tex +++ b/source/preprocessor.tex @@ -1549,7 +1549,7 @@ An integer literal of the form \tcode{yyyymmL} (for example, \tcode{199712L}). If this symbol is defined, then every character in the Unicode required set, when -stored in an object of type \tcode{wchar_t}, has the same value as the short identifier +stored in an object of type \tcode{wchar_t}, has the same value as the code point of that character. The \defn{Unicode required set} consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.