Skip to content

Commit 43c44f3

Browse files
committed
P2736R2 Referencing The Unicode Standard
1 parent 9ce105b commit 43c44f3

File tree

7 files changed

+56
-199
lines changed

7 files changed

+56
-199
lines changed

source/back.tex

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,24 +18,6 @@ \chapter{Bibliography}
1818
Programming languages, their environments, and system software interfaces ---
1919
Floating-point extensions for C --- Part 3: Interchange and extended types}
2020
% Other international standards.
21-
\item
22-
%%% Format for the following entry is based on that specified at
23-
%%% http://www.iec.ch/standardsdev/resources/draftingpublications/directives/principles/referencing.htm
24-
The Unicode Consortium. Unicode Standard Annex, \UAX{29},
25-
\doccite{Unicode Text Segmentation} [online].
26-
Edited by Mark Davis. Revision 35; issued for Unicode 12.0.0. 2019-02-15 [viewed 2020-02-23].
27-
Available from: \url{http://www.unicode.org/reports/tr29/tr29-35.html}
28-
\item
29-
The Unicode Consortium. Unicode Standard Annex, \UAX{31},
30-
\doccite{Unicode Identifier and Pattern Syntax} [online].
31-
Edited by Mark Davis. Revision 33; issued for Unicode 13.0.0.
32-
2020-02-13 [viewed 2021-06-08].
33-
Available from: \url{https://www.unicode.org/reports/tr31/tr31-33.html}
34-
\item
35-
The Unicode Standard Version 14.0,
36-
\doccite{Core Specification}.
37-
Unicode Consortium, ISBN 978-1-936213-29-0, copyright \copyright 2021 Unicode, Inc.
38-
Available from: \url{https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf}
3921
\item
4022
IANA Time Zone Database.
4123
Available from: \url{https://www.iana.org/time-zones}

source/future.tex

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2036,6 +2036,10 @@
20362036
If \tcode{(Mode \& little_endian)}, the facet shall generate a
20372037
multibyte sequence in little-endian order,
20382038
as opposed to the default big-endian order.
2039+
\item
2040+
UCS-2 is the same encoding as UTF-16,
2041+
except that it encodes scalar values in the range
2042+
\ucode{0000}--\ucode{ffff} (Basic Multilingual Plane) only.
20392043
\end{itemize}
20402044

20412045
\pnum
@@ -2046,8 +2050,7 @@
20462050
\begin{itemize}
20472051
\item
20482052
The facet shall convert between UTF-8 multibyte sequences
2049-
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem})
2050-
within the program.
2053+
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem}).
20512054
\item
20522055
Endianness shall not affect how multibyte sequences are read or written.
20532056
\item
@@ -2062,8 +2065,7 @@
20622065
\begin{itemize}
20632066
\item
20642067
The facet shall convert between UTF-16 multibyte sequences
2065-
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem})
2066-
within the program.
2068+
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem}).
20672069
\item
20682070
Multibyte sequences shall be read or written
20692071
according to the \tcode{Mode} flag, as set out above.
@@ -2086,13 +2088,6 @@
20862088
The multibyte sequences may be written as either a text or a binary file.
20872089
\end{itemize}
20882090

2089-
\pnum
2090-
The encoding forms UTF-8, UTF-16, and UTF-32 are specified in ISO/IEC 10646.
2091-
The encoding form UCS-2 is specified in ISO/IEC 10646:2003.
2092-
\begin{footnote}
2093-
Cancelled and replaced by ISO/IEC 10646:2017.
2094-
\end{footnote}
2095-
20962091
\rSec1[depr.conversions]{Deprecated convenience conversion interfaces}
20972092

20982093
\rSec2[depr.conversions.general]{General}

source/intro.tex

Lines changed: 2 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,6 @@
5151
Operating System Interface (POSIX), Technical Corrigendum 1}
5252
\item ISO/IEC/IEEE 9945:2009/Cor 2:2017, \doccite{Information Technology --- Portable
5353
Operating System Interface (POSIX), Technical Corrigendum 2}
54-
\item ISO/IEC 10646, \doccite{Information technology ---
55-
Universal Coded Character Set (UCS)}
56-
\item ISO/IEC 10646:2003,
57-
\begin{footnote}
58-
Cancelled and replaced by ISO/IEC 10646:2017.
59-
\end{footnote}
60-
\doccite{Information technology ---
61-
Universal Multiple-Octet Coded Character Set (UCS)}
6254
\item ISO/IEC/IEEE 60559:2020, \doccite{Information technology ---
6355
Microprocessor Systems --- Floating-Point arithmetic}
6456
\item ISO 80000-2:2009, \doccite{Quantities and units ---
@@ -75,14 +67,8 @@
7567
Language Specification},
7668
Standard Ecma-262, third edition, 1999.
7769
\item
78-
The Unicode Consortium.
79-
Unicode Standard Annex, \UAX{44}, \doccite{Unicode Character Database}.
80-
Edited by Ken Whistler and Lauren\c{t}iu Iancu.
81-
Available from: \url{http://www.unicode.org/reports/tr44/}
82-
\item
83-
The Unicode Consortium.
84-
The Unicode Standard, \doccite{Derived Core Properties}.
85-
Available from: \url{https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt}
70+
The Unicode Consortium. \doccite{The Unicode Standard}.
71+
Available from: \url{https://www.unicode.org/versions/latest/}
8672
\end{itemize}
8773

8874
\pnum
@@ -104,12 +90,6 @@
10490
hereinafter called \defn{ECMA-262}.
10591
\indextext{references!normative|)}
10692

107-
\pnum
108-
\begin{note}
109-
References to ISO/IEC 10646:2003 are used only
110-
to support deprecated features\iref{depr.locale.stdcvt}.
111-
\end{note}
112-
11393
\rSec0[intro.defs]{Terms and definitions}
11494

11595
\pnum

source/iostreams.tex

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6850,7 +6850,7 @@
68506850
if invoking the native Unicode API requires transcoding,
68516851
implementations should substitute invalid code units
68526852
with \unicode{fffd}{replacement character} per
6853-
The Unicode Standard Version 14.0 - Core Specification, Chapter 3.9.
6853+
the Unicode Standard, Chapter 3.9 \ucode{fffd} Substitution in Conversion.
68546854
\end{itemdescr}
68556855

68566856
\rSec3[ostream.unformatted]{Unformatted output functions}
@@ -7786,7 +7786,7 @@
77867786
If invoking the native Unicode API requires transcoding,
77877787
implementations should substitute invalid code units
77887788
with \unicode{fffd}{replacement character} per
7789-
The Unicode Standard Version 14.0 - Core Specification, Chapter 3.9.
7789+
the Unicode Standard, Chapter 3.9 \ucode{fffd} Substitution in Conversion.
77907790
\end{itemdescr}
77917791

77927792
\indexlibraryglobal{vprint_nonunicode}%

source/lex.tex

Lines changed: 31 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,10 @@
8080
\end{note}
8181
If an input file is determined to be a UTF-8 file,
8282
then it shall be a well-formed UTF-8 code unit sequence and
83-
it is decoded to produce a sequence of UCS scalar values
84-
that constitutes the sequence of elements of the translation character set.
83+
it is decoded to produce a sequence of Unicode scalar values.
84+
A sequence of translation character set elements is then formed
85+
by mapping each Unicode scalar value
86+
to the corresponding translation character set element.
8587
In the resulting sequence,
8688
each pair of characters in the input sequence consisting of
8789
\unicode{000d}{carriage return} followed by \unicode{000a}{line feed},
@@ -242,18 +244,17 @@
242244
The \defnadj{translation}{character set} consists of the following elements:
243245
\begin{itemize}
244246
\item
245-
each character named by ISO/IEC 10646,
246-
as identified by its unique UCS scalar value, and
247+
each abstract character assigned a code point in the Unicode codespace, and
247248
\item
248-
a distinct character for each UCS scalar value
249-
where no named character is assigned.
249+
a distinct character for each Unicode scalar value
250+
not assigned to an abstract character.
250251
\end{itemize}
251252
\begin{note}
252-
ISO/IEC 10646 code points are integers
253+
Unicode code points are integers
253254
in the range $[0, \mathrm{10FFFF}]$ (hexadecimal).
254255
A surrogate code point is a value
255256
in the range $[\mathrm{D800}, \mathrm{DFFF}]$ (hexadecimal).
256-
A UCS scalar value is any code point that is not a surrogate code point.
257+
A Unicode scalar value is any code point that is not a surrogate code point.
257258
\end{note}
258259

259260
\pnum
@@ -353,126 +354,27 @@
353354
\tcode{\textbackslash U} \grammarterm{hex-quad} \grammarterm{hex-quad}, or
354355
\tcode{\textbackslash u\{\grammarterm{simple-hexadecimal-digit-sequence}\}}
355356
designates the character in the translation character set
356-
whose UCS scalar value is the hexadecimal number represented by
357+
whose Unicode scalar value is the hexadecimal number represented by
357358
the sequence of \grammarterm{hexadecimal-digit}s
358359
in the \grammarterm{universal-character-name}.
359-
The program is ill-formed if that number is not a UCS scalar value.
360+
The program is ill-formed if that number is not a Unicode scalar value.
360361

361362
\pnum
362363
A \grammarterm{universal-character-name}
363364
that is a \grammarterm{named-universal-character}
364-
designates the character named by its \grammarterm{n-char-sequence}.
365-
A character is so named if the \grammarterm{n-char-sequence} is equal to
366-
\begin{itemize}
367-
\item
368-
the associated character name or associated character name alias
369-
specified in ISO/IEC 10646 subclause ``Code charts and lists of character names''
370-
or
371-
\item
372-
the control code alias given in \tref{lex.charset.ucn}.
365+
designates the corresponding character
366+
in the Unicode Standard (chapter 4.8 Name)
367+
if the \grammarterm{n-char-sequence} is equal
368+
to its character name or
369+
to one of its character name aliases of
370+
type ``control'', ``correction'', or ``alternate'';
371+
otherwise, the program is ill-formed.
373372
\begin{note}
374-
The aliases in \tref{lex.charset.ucn} are provided for control characters
375-
which otherwise have no associated character name or character name alias.
376-
These names are derived from
373+
These aliases are listed in
377374
the Unicode Character Database's \tcode{NameAliases.txt}.
378-
For historical reasons, control characters are formally unnamed.
379-
\end{note}
380-
\end{itemize}
381-
\begin{note}
382-
None of the associated character names,
383-
associated character name aliases, or
384-
control code aliases
385-
have leading or trailing spaces.
375+
None of these names or aliases have leading or trailing spaces.
386376
\end{note}
387377

388-
\begin{multicolfloattable}{Control code aliases}{lex.charset.ucn}{ll}
389-
\unicode{0000}{null} \\
390-
\unicode{0001}{start of heading} \\
391-
\unicode{0002}{start of text} \\
392-
\unicode{0003}{end of text} \\
393-
\unicode{0004}{end of transmission} \\
394-
\unicode{0005}{enquiry} \\
395-
\unicode{0006}{acknowledge} \\
396-
\unicode{0007}{alert} \\
397-
\unicode{0008}{backspace} \\
398-
\unicode{0009}{character tabulation} \\
399-
\unicode{0009}{horizontal tabulation} \\
400-
\unicode{000a}{line feed} \\
401-
\unicode{000a}{new line} \\
402-
\unicode{000a}{end of line} \\
403-
\unicode{000b}{line tabulation} \\
404-
\unicode{000b}{vertical tabulation} \\
405-
\unicode{000c}{form feed} \\
406-
\unicode{000d}{carriage return} \\
407-
\unicode{000e}{shift out} \\
408-
\unicode{000e}{locking-shift one} \\
409-
\unicode{000f}{shift in} \\
410-
\unicode{000f}{locking-shift zero} \\
411-
\unicode{0010}{data link escape} \\
412-
\unicode{0011}{device control one} \\
413-
\unicode{0012}{device control two} \\
414-
\unicode{0013}{device control three} \\
415-
\unicode{0014}{device control four} \\
416-
\unicode{0015}{negative acknowledge} \\
417-
\unicode{0016}{synchronous idle} \\
418-
\unicode{0017}{end of transmission block} \\
419-
\unicode{0018}{cancel} \\
420-
\unicode{0019}{end of medium} \\
421-
\unicode{001a}{substitute} \\
422-
\unicode{001b}{escape} \\
423-
\unicode{001c}{information separator four} \\
424-
\unicode{001c}{file separator} \\
425-
\unicode{001d}{information separator three} \\
426-
\unicode{001d}{group separator} \\
427-
\unicode{001e}{information separator two} \\
428-
\unicode{001e}{record separator} \\
429-
\unicode{001f}{information separator one} \\
430-
\unicode{001f}{unit separator} \\
431-
\columnbreak
432-
\unicode{007f}{delete} \\
433-
\unicode{0082}{break permitted here} \\
434-
\unicode{0083}{no break here} \\
435-
\unicode{0084}{index} \\
436-
\unicode{0085}{next line} \\
437-
\unicode{0086}{start of selected area} \\
438-
\unicode{0087}{end of selected area} \\
439-
\unicode{0088}{character tabulation set} \\
440-
\unicode{0088}{horizontal tabulation set} \\
441-
\unicode{0089}{character tabulation with justification} \\
442-
\unicode{0089}{horizontal tabulation with justification} \\
443-
\unicode{008a}{line tabulation set} \\
444-
\unicode{008a}{vertical tabulation set} \\
445-
\unicode{008b}{partial line forward} \\
446-
\unicode{008b}{partial line down} \\
447-
\unicode{008c}{partial line backward} \\
448-
\unicode{008c}{partial line up} \\
449-
\unicode{008d}{reverse line feed} \\
450-
\unicode{008d}{reverse index} \\
451-
\unicode{008e}{single shift two} \\
452-
\unicode{008e}{single-shift-2} \\
453-
\unicode{008f}{single shift three} \\
454-
\unicode{008f}{single-shift-3} \\
455-
\unicode{0090}{device control string} \\
456-
\unicode{0091}{private use one} \\
457-
\unicode{0091}{private use-1} \\
458-
\unicode{0092}{private use two} \\
459-
\unicode{0092}{private use-2} \\
460-
\unicode{0093}{set transmit state} \\
461-
\unicode{0094}{cancel character} \\
462-
\unicode{0095}{message waiting} \\
463-
\unicode{0096}{start of guarded area} \\
464-
\unicode{0096}{start of protected area} \\
465-
\unicode{0097}{end of guarded area} \\
466-
\unicode{0097}{end of protected area} \\
467-
\unicode{0098}{start of string} \\
468-
\unicode{009a}{single character introducer} \\
469-
\unicode{009b}{control sequence introducer} \\
470-
\unicode{009c}{string terminator} \\
471-
\unicode{009d}{operating system command} \\
472-
\unicode{009e}{privacy message} \\
473-
\unicode{009f}{application program command} \\
474-
\end{multicolfloattable}
475-
476378
\pnum
477379
If a \grammarterm{universal-character-name} outside
478380
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
@@ -491,10 +393,6 @@
491393
The \defnadj{basic literal}{character set} consists of
492394
all characters of the basic character set,
493395
plus the control characters specified in \tref{lex.charset.literal}.
494-
\begin{note}
495-
The alias \uname{bell} for \ucode{0007} shown in ISO 10646
496-
is ambiguous with \unicode{1f514}{bell}.
497-
\end{note}
498396

499397
\begin{floattable}{Additional control characters in the basic literal character set}{lex.charset.literal}{ll}
500398
\topline
@@ -544,9 +442,10 @@
544442
\indextext{UTF-16}%
545443
\indextext{UTF-32}%
546444
For a UTF-8, UTF-16, or UTF-32 literal,
547-
the UCS scalar value
445+
the Unicode scalar value
548446
corresponding to each character of the translation character set
549-
is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
447+
is encoded as specified in the Unicode Standard
448+
for the respective Unicode encoding form.
550449
\indextext{character set|)}
551450

552451
\rSec1[lex.pptoken]{Preprocessing tokens}
@@ -887,14 +786,14 @@
887786
\begin{bnf}
888787
\nontermdef{identifier-start}\br
889788
nondigit\br
890-
\textnormal{an element of the translation character set of class XID_Start}
789+
\textnormal{an element of the translation character set with the Unicode property XID_Start}
891790
\end{bnf}
892791

893792
\begin{bnf}
894793
\nontermdef{identifier-continue}\br
895794
digit\br
896795
nondigit\br
897-
\textnormal{an element of the translation character set of class XID_Continue}
796+
\textnormal{an element of the translation character set with the Unicode property XID_Continue}
898797
\end{bnf}
899798

900799
\begin{bnf}
@@ -913,8 +812,9 @@
913812
\pnum
914813
\indextext{name!length of}%
915814
\indextext{name}%
916-
The character classes XID_Start and XID_Continue
917-
are Derived Core Properties as described by \UAX{44}.
815+
\begin{note}
816+
The character properties XID_Start and XID_Continue are Derived Core Properties
817+
as described by \UAX{44} of the Unicode Standard.
918818
\begin{footnote}
919819
On systems in which linkers cannot accept extended
920820
characters, an encoding of the \grammarterm{universal-character-name} can be used in
@@ -925,9 +825,10 @@
925825
place a translation limit on significant characters for external
926826
identifiers.
927827
\end{footnote}
828+
\end{note}
928829
The program is ill-formed
929830
if an \grammarterm{identifier} does not conform to
930-
Normalization Form C as specified in ISO/IEC 10646.
831+
Normalization Form C as specified in the Unicode Standard.
931832
\begin{note}
932833
Identifiers are case-sensitive.
933834
\end{note}
@@ -2099,7 +2000,7 @@
20992000
\impldef{code unit sequence for non-representable \grammarterm{string-literal}}
21002001
code unit sequence is encoded.
21012002
\begin{note}
2102-
No character lacks representation in any of the UCS encoding forms.
2003+
No character lacks representation in any Unicode encoding form.
21032004
\end{note}
21042005
When encoding a stateful character encoding,
21052006
implementations should encode the first such sequence

0 commit comments

Comments
 (0)