Skip to content

P1139R2 Address wording issues related to ISO 10646 #2741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 11, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 33 additions & 16 deletions source/lex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -208,21 +208,31 @@
\end{bnf}

The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash
UNNNNNNNN} is that character whose character short name in ISO/IEC 10646 is
\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character whose character short name in
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
\grammarterm{universal-character-name} corresponds to a surrogate code point (in the
range 0xD800--0xDFFF, inclusive), the program is ill-formed. Additionally, if
the hexadecimal value for a \grammarterm{universal-character-name} outside
U00NNNNNN} is that character
that has \tcode{U+NNNNNN} as a code point short identifier;
the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character
that has \tcode{U+NNNN} as a code point short identifier.
If a \grammarterm{universal-character-name} does not correspond to
a code point in ISO/IEC 10646 or
if a \grammarterm{universal-character-name} corresponds to
a surrogate code point,
the program is ill-formed. Additionally, if
a \grammarterm{universal-character-name} outside
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
\grammarterm{r-char-sequence} of
a character or
string literal corresponds to a control character (in either of the
ranges 0x00--0x1F or 0x7F--0x9F, both inclusive) or to a character in the basic
string literal corresponds to a control character or
to a character in the basic
source character set, the program is ill-formed.\footnote{A sequence of characters resembling a \grammarterm{universal-character-name} in an
\grammarterm{r-char-sequence}\iref{lex.string} does not form a
\grammarterm{universal-character-name}.}
\begin{note}
ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive).
A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive).
A control character is a character whose code point is
in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive).
\end{note}

\pnum
The \defnx{basic execution character set}{character set!basic execution} and the
Expand Down Expand Up @@ -1132,8 +1142,10 @@
The value of a UTF-8 character literal
is equal to its ISO/IEC 10646 code point value,
provided that the code point value
is representable with a single UTF-8 code unit
(that is, provided it is in the C0 Controls and Basic Latin Unicode block).
can be encoded as a single UTF-8 code unit.
\begin{note}
That is, provided the code point value is in the range 0x0-0x7F (inclusive).
\end{note}
If the value is not representable with a single UTF-8 code unit,
the program is ill-formed.
A UTF-8 character literal containing multiple \grammarterm{c-char}{s} is ill-formed.
Expand All @@ -1146,11 +1158,14 @@
\indextext{prefix!\idxcode{u}}%
is a character literal of type \tcode{char16_t},
known as a \defn{UTF-16 character literal}.
The value
of a UTF-16 character literal containing a single \grammarterm{c-char} is
equal to its ISO/IEC 10646 code point value, provided that the code point value is
representable with a single 16-bit code unit (that is, provided it is in the
basic multi-lingual plane). If the value is not representable
The value of a UTF-16 character literal
is equal to its ISO/IEC 10646 code point value,
provided that the code point value is
representable with a single 16-bit code unit.
\begin{note}
That is, provided the code point value is in the range 0x0-0xFFFF (inclusive).
\end{note}
If the value is not representable
with a single 16-bit code unit, the program is ill-formed.
A UTF-16 character literal
containing multiple \grammarterm{c-char}{s} is ill-formed.
Expand Down Expand Up @@ -1562,6 +1577,8 @@
A single \grammarterm{c-char} may
produce more than one \tcode{char16_t} character in the form of
surrogate pairs.
A surrogate pair is a representation for a single code point
as a sequence of two 16-bit code units.
\end{note}

\pnum
Expand Down
2 changes: 1 addition & 1 deletion source/preprocessor.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1549,7 +1549,7 @@
An integer literal of the form \tcode{yyyymmL} (for example,
\tcode{199712L}).
If this symbol is defined, then every character in the Unicode required set, when
stored in an object of type \tcode{wchar_t}, has the same value as the short identifier
stored in an object of type \tcode{wchar_t}, has the same value as the code point
of that character. The \defn{Unicode required set} consists of all
the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda as of the specified year and month.
Expand Down