From 4221ff97eb53f5372ced79a9cfde32e82eaa8220 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Mon, 8 Apr 2024 20:49:38 -0700 Subject: [PATCH 1/5] use identity escapes where possible --- spec.emu | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/spec.emu b/spec.emu index 83fab36..d47ec75 100644 --- a/spec.emu +++ b/spec.emu @@ -52,12 +52,14 @@ contributors:
description
-
It returns a string representing a |Pattern| for matching _c_. If _c_ is white space or an ASCII punctuator, the returned value is an escape sequence (corresponding with |HexEscapeSequence| if possible, or otherwise with |RegExpUnicodeEscapeSequence|). Otherwise, the returned value is a string representation of _c_ itself.
+
It returns a string representing a |Pattern| for matching _c_. If _c_ is white space or an ASCII punctuator, the returned value is an escape sequence. Otherwise, the returned value is a string representation of _c_ itself.
- 1. Let _punctuators_ be the string-concatenation of *"(){}[]|,.?\*+-^$=<>/#&!%:;@~'`"*, the code unit 0x0022 (QUOTATION MARK), and the code unit 0x005C (REVERSE SOLIDUS). - 1. Let _toEscape_ be StringToCodePoints(_punctuators_). + 1. If _c_ is matched by |SyntaxCharacter| or _c_ is U+002F (SOLIDUS), then + 1. Return the string-concatenation of 0x005C (REVERSE SOLIDUS) and UTF16EncodeCodePoint(_c_). + 1. Let _otherPunctuators_ be the string-concatenation of *",-=<>#&!%:;@~'`"* and the code unit 0x0022 (QUOTATION MARK). + 1. Let _toEscape_ be StringToCodePoints(_otherPunctuators_). 1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace|, then 1. If _c_ ≤ 0xFF, then 1. Let _hex_ be Number::toString(𝔽(_c_), 16). From 07fc1bdaa7423f063dac132637c35af244bb44f8 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Mon, 8 Apr 2024 20:51:00 -0700 Subject: [PATCH 2/5] also escape line terminators --- spec.emu | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec.emu b/spec.emu index d47ec75..20fa4f9 100644 --- a/spec.emu +++ b/spec.emu @@ -60,7 +60,7 @@ contributors: 1. Return the string-concatenation of 0x005C (REVERSE SOLIDUS) and UTF16EncodeCodePoint(_c_). 1. Let _otherPunctuators_ be the string-concatenation of *",-=<>#&!%:;@~'`"* and the code unit 0x0022 (QUOTATION MARK). 1. Let _toEscape_ be StringToCodePoints(_otherPunctuators_). - 1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace|, then + 1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace| or |LineTerminator|, then 1. If _c_ ≤ 0xFF, then 1. Let _hex_ be Number::toString(𝔽(_c_), 16). 1. Return the string-concatenation of the code unit 0x005C (REVERSE SOLIDUS), *"x"*, and StringPad(_hex_, 2, *"0"*, ~start~). From b51ecb4441123ceb018f658b4996f524a656a59b Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Mon, 8 Apr 2024 20:52:04 -0700 Subject: [PATCH 3/5] use control escapes where possible --- spec.emu | 2 ++ 1 file changed, 2 insertions(+) diff --git a/spec.emu b/spec.emu index 20fa4f9..c56e0c0 100644 --- a/spec.emu +++ b/spec.emu @@ -58,6 +58,8 @@ contributors: 1. If _c_ is matched by |SyntaxCharacter| or _c_ is U+002F (SOLIDUS), then 1. Return the string-concatenation of 0x005C (REVERSE SOLIDUS) and UTF16EncodeCodePoint(_c_). + 1. Else if _c_ is the code point listed in some cell of the “Code Point” column of , then + 1. Return the string-concatenation of 0x005C (REVERSE SOLIDUS) and the string in the “ControlEscape” column of the row whose “Code Point” column contains _c_. 1. Let _otherPunctuators_ be the string-concatenation of *",-=<>#&!%:;@~'`"* and the code unit 0x0022 (QUOTATION MARK). 1. Let _toEscape_ be StringToCodePoints(_otherPunctuators_). 1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace| or |LineTerminator|, then From eef281a74d0b6d6cc65814bafa21cdbd6fbd8c0a Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Mon, 8 Apr 2024 21:04:17 -0700 Subject: [PATCH 4/5] also escape lone surrogates --- spec.emu | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec.emu b/spec.emu index c56e0c0..22a4587 100644 --- a/spec.emu +++ b/spec.emu @@ -62,7 +62,7 @@ contributors: 1. Return the string-concatenation of 0x005C (REVERSE SOLIDUS) and the string in the “ControlEscape” column of the row whose “Code Point” column contains _c_. 1. Let _otherPunctuators_ be the string-concatenation of *",-=<>#&!%:;@~'`"* and the code unit 0x0022 (QUOTATION MARK). 1. Let _toEscape_ be StringToCodePoints(_otherPunctuators_). - 1. If _toEscape_ contains _c_ or _c_ is matched by |WhiteSpace| or |LineTerminator|, then + 1. If _toEscape_ contains _c_, _c_ is matched by |WhiteSpace| or |LineTerminator|, or _c_ has the same numeric value as a leading surrogate or trailing surrogate, then 1. If _c_ ≤ 0xFF, then 1. Let _hex_ be Number::toString(𝔽(_c_), 16). 1. Return the string-concatenation of the code unit 0x005C (REVERSE SOLIDUS), *"x"*, and StringPad(_hex_, 2, *"0"*, ~start~). From 27eee05522e932bc1718af608f2a8ee70eab5420 Mon Sep 17 00:00:00 2001 From: Kevin Gibbons Date: Mon, 8 Apr 2024 21:04:51 -0700 Subject: [PATCH 5/5] also escape leading ascii letters --- spec.emu | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/spec.emu b/spec.emu index 22a4587..5221cf3 100644 --- a/spec.emu +++ b/spec.emu @@ -31,9 +31,11 @@ contributors: 1. Let _escaped_ be the empty String. 1. Let _cpList_ be StringToCodePoints(_S_). 1. For each code point _c_ in _cpList_, do - 1. If _escaped_ is the empty String and _c_ is matched by |DecimalDigit|, then - 1. NOTE: Escaping a leading digit ensures that output corresponds with pattern text which may be used after a `\0` character escape or a |DecimalEscape| such as `\1` and still match _S_ rather than be interpreted as an extension of the preceding escape sequence. - 1. Set _escaped_ to the string-concatenation of _escaped_, the code unit 0x005C (REVERSE SOLIDUS), *"x3"*, and the code unit whose numeric value is the numeric value of _c_. + 1. If _escaped_ is the empty String, and _c_ is matched by |DecimalDigit| or |AsciiLetter|, then + 1. NOTE: Escaping a leading digit ensures that output corresponds with pattern text which may be used after a `\0` character escape or a |DecimalEscape| such as `\1` and still match _S_ rather than be interpreted as an extension of the preceding escape sequence. Escaping a leading ASCII letter does the same for the context after `\c`. + 1. Let _hex_ be Number::toString(𝔽(_c_), 16). + 1. Assert: The length of _hex_ is 2. + 1. Set _escaped_ to the string-concatenation of the code unit 0x005C (REVERSE SOLIDUS), *"x"*, and _hex_. 1. Else, 1. Set _escaped_ to the string-concatenation of _escaped_ and EncodeForRegExpEscape(_c_). 1. Return _escaped_.