-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char #26285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ot match ASCII char
|
👋 Welcome back sherman! A progress list of the required criteria for merging this PR into |
|
@xuemingshen-oracle This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 244 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@xuemingshen-oracle The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template
Outdated
Show resolved
Hide resolved
src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template
Outdated
Show resolved
Hide resolved
make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java
Outdated
Show resolved
Hide resolved
make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java
Outdated
Show resolved
Hide resolved
naotoj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for adding case folding support which is long overdue 🙂
Since this is adding a new support for casefolding for character class ranges, I think CSR and a release note should be considered.
make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java
Outdated
Show resolved
Hide resolved
Thanks for the review. Arguably, the change I made years ago to support Level 1 + RL2.1/2 already implies that character class ranges should conform to RL1.5 — just like other constructs (back-ref, slice, single and property) So it might be reasonable to categorize this as "just" a pure bug fix. That said, it is a behavioral change, and I’m happy to go through the CSR and release note process if strongly preferred. 🙂 My initial thought was to defer the CSR until we fully switch to a case-folding-mapping–based implementation (replacing the current toUpperCase/toLowerCase logic), at which point we could also update the javadoc to explicitly document the behavior of each construct, as RL1.5 recommends/suggests. But if we prefer to align all of that now with this fix, I’m fine doing it together. |
naotoj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me.
As to the CSR, it seems ok without it if this is a pure bug fix.
naotoj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates look good to me.
|
Thanks for the reviews! |
|
@xuemingshen-oracle This pull request has not yet been marked as ready for integration. |
|
Thanks for the reviews! |
|
Going to push as commit 401af27.
Your commit was automatically rebased without conflicts. |
|
@xuemingshen-oracle Pushed as commit 401af27. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
Regex class should conform to Level 1 of Unicode Technical Standard #18: Unicode Regular Expressions, plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.
This PR primarily addresses conformance with RL1.5: Simple Loose Matches, which requires that simple case folding be applied to literals and (optionally) to character classes. When applied to character classes, each class is expected to be closed under simple case folding. See the standard for a detailed explanation of what it means for a class to be “closed.”
RL1.5 states:
To meet this requirement, an implementation that supports case-sensitive matching should
In the Pattern implementation, 5 types of constructs may be affected by case sensitivity:
Note: Single characters and families may appear independently or within a character class.
For case-insensitive (loose) matching, the implementation already applies Character.toUpperCase() and Character.toLowerCase() to both the pattern and the input string for back-refs, slices, and single characters. This effectively makes these constructs closed under case folding.
This has been verified in the newly added test case test/jdk/java/util/regex/CaseFoldingTest.java.
For example:
Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true
Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true
The character properties (families) are not "closed" and should remain unchanged. This is acceptable per RL1.5, if the behavior is clearly specified (TBD: update javadoc to reflect this).
Current Non-Conformance: Character Class Ranges, as reported in the original bug report.
Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false
vs
Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches(). => true
vs Perl. (Perl also claims to support the Unicode's loose match with it it's "i" modifier)
perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'. => false
perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/i ? "true\n" : "false\n"'. => true
The root issue is that the range construct is not implemented to be closed under simple case folding. Applying toUpperCase() and toLowerCase() to a range like [\u0170-\u0180] does not produce a meaningful or valid range for case-folding comparisons. For example [\u0170-\u0180] => [\u0053-\u243] with uppercase conversion.
The following characters (from the CaseFolding.txt) currently fail the case-insensitive-match when used in a character class range construct.
What This PR Does
This PR adds support for ensuring that character class ranges are closed under simple case folding when the (?ui) (Unicode case-insensitive) flag is used, bringing Pattern into better conformance with UTS #18 Level 1 (RL1.5).
Notes
(1) The PR also tries to fix a special corner case for U+00df
see: https://codepoints.net/U+00DF vs https://codepoints.net/U+1E9E?lang=en for more context.
Pattern.compile("(?ui)\u00df").matcher("\u1e9e").matches() => false
Pattern.compile("(?ui)\u1e9e").matcher("\u00df").matches() => false
vs
perl -C -e 'print "\x{1e9e}" =~ /\x{df}/ ? "true\n" : "false\n"' => false
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/ ? "true\n" : "false\n"' => false
perl -C -e 'print "\x{1e9e}" =~ /\x{df}/i ? "true\n" : "false\n"' => true
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/i ? "true\n" : "false\n"' => true
The Java Character class still CORRECTLY returns u+00df for its upper case, as suggested by the Unicode. So our toUpperCase() != toLowerCase() in single() implementation fails to pick SingleU for case-insensitive matching as expected.
Integer.toHexString(Character.toUpperCase('\u00df')) => 0xdf
(2) Known limitations: 3 'S'-like characters still fail
There are 3 characters whose case folding mappings (per CaseFolding.txt) are not captured by our current logic, which relies only on Java's toUpperCase()/toLowerCase() conversions. These characters cannot be matched across constructs like back-ref, slice, single, or range using the current API. We will leave them unchanged for now, pending a possible migration to a pure case folding based matching implementation.
1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T
Refs:
https://bugs.openjdk.org/browse/JDK-6486934
https://bugs.openjdk.org/browse/CCC-6486934
https://cr.openjdk.org/~sherman/6486934_6233084_6504326_6436458/
We are fixing an almost 20-year old bug :-)
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285$ git checkout pull/26285Update a local copy of the PR:
$ git checkout pull/26285$ git pull https://git.openjdk.org/jdk.git pull/26285/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26285View PR using the GUI difftool:
$ git pr show -t 26285Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26285.diff
Using Webrev
Link to Webrev Comment