P1139R2 Address wording issues related to ISO 10646

Dawn Perchik · Dawn Perchik · commit fbc63564f9ae · 2019-03-05T11:33:44.000-08:00
[lex] Turn notes into separate sentences.
diff --git a/source/lex.tex b/source/lex.tex
@@ -208,21 +208,31 @@
 \end{bnf}
 
 The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash
-UNNNNNNNN} is that character whose character short name in ISO/IEC 10646 is
-\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
-\tcode{\textbackslash uNNNN} is that character whose character short name in
-ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
-\grammarterm{universal-character-name} corresponds to a surrogate code point (in the
-range 0xD800--0xDFFF, inclusive), the program is ill-formed. Additionally, if
-the hexadecimal value for a \grammarterm{universal-character-name} outside
+U00NNNNNN} is that character
+that has \tcode{U+NNNNNN} as a code point short identifier;
+the character designated by the \grammarterm{universal-character-name}
+\tcode{\textbackslash uNNNN} is that character
+that has \tcode{U+NNNN} as a code point short identifier.
+If a \grammarterm{universal-character-name} does not correspond to
+a code point in ISO/IEC 10646 or
+if a \grammarterm{universal-character-name} corresponds to
+a surrogate code point,
+the program is ill-formed. Additionally, if
+a \grammarterm{universal-character-name} outside
 the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
 \grammarterm{r-char-sequence} of
 a character or
-string literal corresponds to a control character (in either of the
-ranges 0x00--0x1F or 0x7F--0x9F, both inclusive) or to a character in the basic
+string literal corresponds to a control character or
+to a character in the basic
 source character set, the program is ill-formed.\footnote{A sequence of characters resembling a \grammarterm{universal-character-name} in an
 \grammarterm{r-char-sequence}\iref{lex.string} does not form a
 \grammarterm{universal-character-name}.}
+\begin{note}
+ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive).
+A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive).
+A control character is a character whose code point is
+in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive).
+\end{note}
 
 \pnum
 The \defnx{basic execution character set}{character set!basic execution} and the
@@ -1132,8 +1142,10 @@
 The value of a UTF-8 character literal
 is equal to its ISO/IEC 10646 code point value,
 provided that the code point value
-is representable with a single UTF-8 code unit
-(that is, provided it is in the C0 Controls and Basic Latin Unicode block).
+can be encoded as a single UTF-8 code unit.
+\begin{note}
+That is, provided the code point value is in the range 0x0-0x7F (inclusive).
+\end{note}
 If the value is not representable with a single UTF-8 code unit,
 the program is ill-formed.
 A UTF-8 character literal containing multiple \grammarterm{c-char}{s} is ill-formed.
@@ -1148,8 +1160,11 @@
 is a character literal of type \tcode{char16_t}. The value
 of a \tcode{char16_t} character literal containing a single \grammarterm{c-char} is
 equal to its ISO/IEC 10646 code point value, provided that the code point value is
-representable with a single 16-bit code unit (that is, provided it is in the
-basic multi-lingual plane). If the value is not representable
+representable with a single 16-bit code unit.
+\begin{note}
+That is, provided the code point value is in the range 0x0-0xFFFF (inclusive).
+\end{note}
+If the value is not representable
 with a single 16-bit code unit, the program is ill-formed. A \tcode{char16_t} character literal
 containing multiple \grammarterm{c-char}{s} is ill-formed.
 
@@ -1554,6 +1569,10 @@
 is initialized with the given characters. A single \grammarterm{c-char} may
 produce more than one \tcode{char16_t} character in the form of
 surrogate pairs.
+\begin{note}
+A surrogate pair is a representation for a single code point
+as a sequence of two 16-bit code units.
+\end{note}
 
 \pnum
 \indextext{literal!string!\idxcode{char32_t}}%
diff --git a/source/preprocessor.tex b/source/preprocessor.tex
@@ -1549,7 +1549,7 @@
 An integer literal of the form \tcode{yyyymmL} (for example,
 \tcode{199712L}).
 If this symbol is defined, then every character in the Unicode required set, when
-stored in an object of type \tcode{wchar_t}, has the same value as the short identifier
+stored in an object of type \tcode{wchar_t}, has the same value as the code point
 of that character. The \defn{Unicode required set} consists of all
 the characters that are defined by ISO/IEC 10646, along with
 all amendments and technical corrigenda as of the specified year and month.