Skip to content

Conversation

@xuemingshen-oracle
Copy link

@xuemingshen-oracle xuemingshen-oracle commented Jul 14, 2025

Regex class should conform to Level 1 of Unicode Technical Standard #18: Unicode Regular Expressions, plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.

This PR primarily addresses conformance with RL1.5: Simple Loose Matches, which requires that simple case folding be applied to literals and (optionally) to character classes. When applied to character classes, each class is expected to be closed under simple case folding. See the standard for a detailed explanation of what it means for a class to be “closed.”

RL1.5 states:

To meet this requirement, an implementation that supports case-sensitive matching should

1. Provide at least the simple, default Unicode case-insensitive matching, and
2. Specify which character properties or constructs are closed under the matching.

In the Pattern implementation, 5 types of constructs may be affected by case sensitivity:

1. back-refs
2. string slices (sequences)
3. single character,
4. character families (Unicode Properties ...), and
5. character class ranges

Note: Single characters and families may appear independently or within a character class.

For case-insensitive (loose) matching, the implementation already applies Character.toUpperCase() and Character.toLowerCase() to both the pattern and the input string for back-refs, slices, and single characters. This effectively makes these constructs closed under case folding.

This has been verified in the newly added test case test/jdk/java/util/regex/CaseFoldingTest.java.

For example:

Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true
Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true

The character properties (families) are not "closed" and should remain unchanged. This is acceptable per RL1.5, if the behavior is clearly specified (TBD: update javadoc to reflect this).

Current Non-Conformance: Character Class Ranges, as reported in the original bug report.

Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false
vs
Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches(). => true

vs Perl. (Perl also claims to support the Unicode's loose match with it it's "i" modifier)

perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'. => false
perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/i ? "true\n" : "false\n"'. => true

The root issue is that the range construct is not implemented to be closed under simple case folding. Applying toUpperCase() and toLowerCase() to a range like [\u0170-\u0180] does not produce a meaningful or valid range for case-folding comparisons. For example [\u0170-\u0180] => [\u0053-\u243] with uppercase conversion.

The following characters (from the CaseFolding.txt) currently fail the case-insensitive-match when used in a character class range construct.

[T] 0049 [lo: 0069, up: 0049] 0131 [lo: 0131, up: 0049]
[C] 00B5 [lo: 00b5, up: 039c] 03BC [lo: 03bc, up: 039c]
[T] 0130 [lo: 0069, up: 0130] 0069 [lo: 0069, up: 0049]
[C] 017F [lo: 017f, up: 0053] 0073 [lo: 0073, up: 0053]
[C] 01C5 [lo: 01c6, up: 01c4] 01C6 [lo: 01c6, up: 01c4]
[C] 01C8 [lo: 01c9, up: 01c7] 01C9 [lo: 01c9, up: 01c7]
[C] 01CB [lo: 01cc, up: 01ca] 01CC [lo: 01cc, up: 01ca]
[C] 01F2 [lo: 01f3, up: 01f1] 01F3 [lo: 01f3, up: 01f1]
[C] 0345 [lo: 0345, up: 0399] 03B9 [lo: 03b9, up: 0399]
[C] 03C2 [lo: 03c2, up: 03a3] 03C3 [lo: 03c3, up: 03a3]
[C] 03D0 [lo: 03d0, up: 0392] 03B2 [lo: 03b2, up: 0392]
[C] 03D1 [lo: 03d1, up: 0398] 03B8 [lo: 03b8, up: 0398]
[C] 03D5 [lo: 03d5, up: 03a6] 03C6 [lo: 03c6, up: 03a6]
[C] 03D6 [lo: 03d6, up: 03a0] 03C0 [lo: 03c0, up: 03a0]
[C] 03F0 [lo: 03f0, up: 039a] 03BA [lo: 03ba, up: 039a]
[C] 03F1 [lo: 03f1, up: 03a1] 03C1 [lo: 03c1, up: 03a1]
[C] 03F4 [lo: 03b8, up: 03f4] 03B8 [lo: 03b8, up: 0398]
[C] 03F5 [lo: 03f5, up: 0395] 03B5 [lo: 03b5, up: 0395]
[C] 1C80 [lo: 1c80, up: 0412] 0432 [lo: 0432, up: 0412]
[C] 1C81 [lo: 1c81, up: 0414] 0434 [lo: 0434, up: 0414]
[C] 1C82 [lo: 1c82, up: 041e] 043E [lo: 043e, up: 041e]
[C] 1C83 [lo: 1c83, up: 0421] 0441 [lo: 0441, up: 0421]
[C] 1C84 [lo: 1c84, up: 0422] 0442 [lo: 0442, up: 0422]
[C] 1C85 [lo: 1c85, up: 0422] 0442 [lo: 0442, up: 0422]
[C] 1C86 [lo: 1c86, up: 042a] 044A [lo: 044a, up: 042a]
[C] 1C87 [lo: 1c87, up: 0462] 0463 [lo: 0463, up: 0462]
[C] 1C88 [lo: 1c88, up: a64a] A64B [lo: a64b, up: a64a]
[C] 1E9B [lo: 1e9b, up: 1e60] 1E61 [lo: 1e61, up: 1e60]
[S] 1E9E [lo: 00df, up: 1e9e] 00DF [lo: 00df, up: 00df]
[C] 1FBE [lo: 1fbe, up: 0399] 03B9 [lo: 03b9, up: 0399]
[S] 1FD3 [lo: 1fd3, up: 1fd3] 0390 [lo: 0390, up: 0390]
[S] 1FE3 [lo: 1fe3, up: 1fe3] 03B0 [lo: 03b0, up: 03b0]
[C] 2126 [lo: 03c9, up: 2126] 03C9 [lo: 03c9, up: 03a9]
[C] 212A [lo: 006b, up: 212a] 006B [lo: 006b, up: 004b]
[C] 212B [lo: 00e5, up: 212b] 00E5 [lo: 00e5, up: 00c5]
[S] FB05 [lo: fb05, up: fb05] FB06 [lo: fb06, up: fb06]

What This PR Does
This PR adds support for ensuring that character class ranges are closed under simple case folding when the (?ui) (Unicode case-insensitive) flag is used, bringing Pattern into better conformance with UTS #18 Level 1 (RL1.5).

Notes

(1) The PR also tries to fix a special corner case for U+00df
see: https://codepoints.net/U+00DF vs https://codepoints.net/U+1E9E?lang=en for more context.

Pattern.compile("(?ui)\u00df").matcher("\u1e9e").matches() => false
Pattern.compile("(?ui)\u1e9e").matcher("\u00df").matches() => false

vs

perl -C -e 'print "\x{1e9e}" =~ /\x{df}/ ? "true\n" : "false\n"' => false
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/ ? "true\n" : "false\n"' => false
perl -C -e 'print "\x{1e9e}" =~ /\x{df}/i ? "true\n" : "false\n"' => true
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/i ? "true\n" : "false\n"' => true

The Java Character class still CORRECTLY returns u+00df for its upper case, as suggested by the Unicode. So our toUpperCase() != toLowerCase() in single() implementation fails to pick SingleU for case-insensitive matching as expected.

Integer.toHexString(Character.toUpperCase('\u00df')) => 0xdf

(2) Known limitations: 3 'S'-like characters still fail

There are 3 characters whose case folding mappings (per CaseFolding.txt) are not captured by our current logic, which relies only on Java's toUpperCase()/toLowerCase() conversions. These characters cannot be matched across constructs like back-ref, slice, single, or range using the current API. We will leave them unchanged for now, pending a possible migration to a pure case folding based matching implementation.

1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T

Refs:
https://bugs.openjdk.org/browse/JDK-6486934
https://bugs.openjdk.org/browse/CCC-6486934
https://cr.openjdk.org/~sherman/6486934_6233084_6504326_6436458/

We are fixing an almost 20-year old bug :-)


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285
$ git checkout pull/26285

Update a local copy of the PR:
$ git checkout pull/26285
$ git pull https://git.openjdk.org/jdk.git pull/26285/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26285

View PR using the GUI difftool:
$ git pr show -t 26285

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26285.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 14, 2025

👋 Welcome back sherman! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@xuemingshen-oracle This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char

Reviewed-by: naoto

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 244 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 14, 2025
@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@xuemingshen-oracle The following labels will be automatically applied to this pull request:

  • build
  • core-libs
  • i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jul 14, 2025

Webrevs

Copy link
Member

@naotoj naotoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for adding case folding support which is long overdue 🙂
Since this is adding a new support for casefolding for character class ranges, I think CSR and a release note should be considered.

@xuemingshen-oracle
Copy link
Author

Looks good. Thanks for adding case folding support which is long overdue 🙂 Since this is adding a new support for casefolding for character class ranges, I think CSR and a release note should be considered.

Thanks for the review. Arguably, the change I made years ago to support Level 1 + RL2.1/2 already implies that character class ranges should conform to RL1.5 — just like other constructs (back-ref, slice, single and property) So it might be reasonable to categorize this as "just" a pure bug fix.

That said, it is a behavioral change, and I’m happy to go through the CSR and release note process if strongly preferred. 🙂

My initial thought was to defer the CSR until we fully switch to a case-folding-mapping–based implementation (replacing the current toUpperCase/toLowerCase logic), at which point we could also update the javadoc to explicitly document the behavior of each construct, as RL1.5 recommends/suggests.

But if we prefer to align all of that now with this fix, I’m fine doing it together.

Copy link
Member

@naotoj naotoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.
As to the CSR, it seems ok without it if this is a pure bug fix.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 14, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jul 15, 2025
Copy link
Member

@naotoj naotoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good to me.

@xuemingshen-oracle
Copy link
Author

Thanks for the reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Jul 15, 2025

@xuemingshen-oracle This pull request has not yet been marked as ready for integration.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 15, 2025
@xuemingshen-oracle
Copy link
Author

Thanks for the reviews!
/integrate

@openjdk
Copy link

openjdk bot commented Jul 15, 2025

Going to push as commit 401af27.
Since your change was applied there have been 244 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 15, 2025
@openjdk openjdk bot closed this Jul 15, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jul 15, 2025
@openjdk openjdk bot removed the rfr Pull request is ready for review label Jul 15, 2025
@openjdk
Copy link

openjdk bot commented Jul 15, 2025

@xuemingshen-oracle Pushed as commit 401af27.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants