8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char #26285

xuemingshen-oracle · 2025-07-14T04:53:13Z

Regex class should conform to Level 1 of Unicode Technical Standard #18: Unicode Regular Expressions, plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.

This PR primarily addresses conformance with RL1.5: Simple Loose Matches, which requires that simple case folding be applied to literals and (optionally) to character classes. When applied to character classes, each class is expected to be closed under simple case folding. See the standard for a detailed explanation of what it means for a class to be “closed.”

RL1.5 states:

To meet this requirement, an implementation that supports case-sensitive matching should

1. Provide at least the simple, default Unicode case-insensitive matching, and
2. Specify which character properties or constructs are closed under the matching.

In the Pattern implementation, 5 types of constructs may be affected by case sensitivity:

1. back-refs
2. string slices (sequences)
3. single character,
4. character families (Unicode Properties ...), and
5. character class ranges

Note: Single characters and families may appear independently or within a character class.

For case-insensitive (loose) matching, the implementation already applies Character.toUpperCase() and Character.toLowerCase() to both the pattern and the input string for back-refs, slices, and single characters. This effectively makes these constructs closed under case folding.

This has been verified in the newly added test case test/jdk/java/util/regex/CaseFoldingTest.java.

For example:

Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true
Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true

The character properties (families) are not "closed" and should remain unchanged. This is acceptable per RL1.5, if the behavior is clearly specified (TBD: update javadoc to reflect this).

Current Non-Conformance: Character Class Ranges, as reported in the original bug report.

Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false
vs
Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches(). => true

vs Perl. (Perl also claims to support the Unicode's loose match with it it's "i" modifier)

perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'. => false
perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/i ? "true\n" : "false\n"'. => true

The root issue is that the range construct is not implemented to be closed under simple case folding. Applying toUpperCase() and toLowerCase() to a range like [\u0170-\u0180] does not produce a meaningful or valid range for case-folding comparisons. For example [\u0170-\u0180] => [\u0053-\u243] with uppercase conversion.

The following characters (from the CaseFolding.txt) currently fail the case-insensitive-match when used in a character class range construct.

[T] 0049 [lo: 0069, up: 0049] 0131 [lo: 0131, up: 0049]
[C] 00B5 [lo: 00b5, up: 039c] 03BC [lo: 03bc, up: 039c]
[T] 0130 [lo: 0069, up: 0130] 0069 [lo: 0069, up: 0049]
[C] 017F [lo: 017f, up: 0053] 0073 [lo: 0073, up: 0053]
[C] 01C5 [lo: 01c6, up: 01c4] 01C6 [lo: 01c6, up: 01c4]
[C] 01C8 [lo: 01c9, up: 01c7] 01C9 [lo: 01c9, up: 01c7]
[C] 01CB [lo: 01cc, up: 01ca] 01CC [lo: 01cc, up: 01ca]
[C] 01F2 [lo: 01f3, up: 01f1] 01F3 [lo: 01f3, up: 01f1]
[C] 0345 [lo: 0345, up: 0399] 03B9 [lo: 03b9, up: 0399]
[C] 03C2 [lo: 03c2, up: 03a3] 03C3 [lo: 03c3, up: 03a3]
[C] 03D0 [lo: 03d0, up: 0392] 03B2 [lo: 03b2, up: 0392]
[C] 03D1 [lo: 03d1, up: 0398] 03B8 [lo: 03b8, up: 0398]
[C] 03D5 [lo: 03d5, up: 03a6] 03C6 [lo: 03c6, up: 03a6]
[C] 03D6 [lo: 03d6, up: 03a0] 03C0 [lo: 03c0, up: 03a0]
[C] 03F0 [lo: 03f0, up: 039a] 03BA [lo: 03ba, up: 039a]
[C] 03F1 [lo: 03f1, up: 03a1] 03C1 [lo: 03c1, up: 03a1]
[C] 03F4 [lo: 03b8, up: 03f4] 03B8 [lo: 03b8, up: 0398]
[C] 03F5 [lo: 03f5, up: 0395] 03B5 [lo: 03b5, up: 0395]
[C] 1C80 [lo: 1c80, up: 0412] 0432 [lo: 0432, up: 0412]
[C] 1C81 [lo: 1c81, up: 0414] 0434 [lo: 0434, up: 0414]
[C] 1C82 [lo: 1c82, up: 041e] 043E [lo: 043e, up: 041e]
[C] 1C83 [lo: 1c83, up: 0421] 0441 [lo: 0441, up: 0421]
[C] 1C84 [lo: 1c84, up: 0422] 0442 [lo: 0442, up: 0422]
[C] 1C85 [lo: 1c85, up: 0422] 0442 [lo: 0442, up: 0422]
[C] 1C86 [lo: 1c86, up: 042a] 044A [lo: 044a, up: 042a]
[C] 1C87 [lo: 1c87, up: 0462] 0463 [lo: 0463, up: 0462]
[C] 1C88 [lo: 1c88, up: a64a] A64B [lo: a64b, up: a64a]
[C] 1E9B [lo: 1e9b, up: 1e60] 1E61 [lo: 1e61, up: 1e60]
[S] 1E9E [lo: 00df, up: 1e9e] 00DF [lo: 00df, up: 00df]
[C] 1FBE [lo: 1fbe, up: 0399] 03B9 [lo: 03b9, up: 0399]
[S] 1FD3 [lo: 1fd3, up: 1fd3] 0390 [lo: 0390, up: 0390]
[S] 1FE3 [lo: 1fe3, up: 1fe3] 03B0 [lo: 03b0, up: 03b0]
[C] 2126 [lo: 03c9, up: 2126] 03C9 [lo: 03c9, up: 03a9]
[C] 212A [lo: 006b, up: 212a] 006B [lo: 006b, up: 004b]
[C] 212B [lo: 00e5, up: 212b] 00E5 [lo: 00e5, up: 00c5]
[S] FB05 [lo: fb05, up: fb05] FB06 [lo: fb06, up: fb06]

What This PR Does
This PR adds support for ensuring that character class ranges are closed under simple case folding when the (?ui) (Unicode case-insensitive) flag is used, bringing Pattern into better conformance with UTS #18 Level 1 (RL1.5).

Notes

(1) The PR also tries to fix a special corner case for U+00df
see: https://codepoints.net/U+00DF vs https://codepoints.net/U+1E9E?lang=en for more context.

Pattern.compile("(?ui)\u00df").matcher("\u1e9e").matches() => false
Pattern.compile("(?ui)\u1e9e").matcher("\u00df").matches() => false

vs

perl -C -e 'print "\x{1e9e}" =~ /\x{df}/ ? "true\n" : "false\n"' => false
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/ ? "true\n" : "false\n"' => false
perl -C -e 'print "\x{1e9e}" =~ /\x{df}/i ? "true\n" : "false\n"' => true
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/i ? "true\n" : "false\n"' => true

The Java Character class still CORRECTLY returns u+00df for its upper case, as suggested by the Unicode. So our toUpperCase() != toLowerCase() in single() implementation fails to pick SingleU for case-insensitive matching as expected.

Integer.toHexString(Character.toUpperCase('\u00df')) => 0xdf

(2) Known limitations: 3 'S'-like characters still fail

There are 3 characters whose case folding mappings (per CaseFolding.txt) are not captured by our current logic, which relies only on Java's toUpperCase()/toLowerCase() conversions. These characters cannot be matched across constructs like back-ref, slice, single, or range using the current API. We will leave them unchanged for now, pending a possible migration to a pure case folding based matching implementation.

1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T

Refs:
https://bugs.openjdk.org/browse/JDK-6486934
https://bugs.openjdk.org/browse/CCC-6486934
https://cr.openjdk.org/~sherman/6486934_6233084_6504326_6436458/

We are fixing an almost 20-year old bug :-)

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char (Bug - P4)

Reviewers

Naoto Sato (@naotoj - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285
$ git checkout pull/26285

Update a local copy of the PR:
$ git checkout pull/26285
$ git pull https://git.openjdk.org/jdk.git pull/26285/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26285

View PR using the GUI difftool:
$ git pr show -t 26285

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26285.diff

Using Webrev

Link to Webrev Comment

…ot match ASCII char

bridgekeeper · 2025-07-14T04:53:51Z

👋 Welcome back sherman! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-07-14T04:54:08Z

@xuemingshen-oracle This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char

Reviewed-by: naoto

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 244 new commits pushed to the master branch:

38af17d: 8356807: Change log_info(cds) to MetaspaceShared::report_loading_error()
820263e: 8360701: Add bailout when the register allocator interference graph grows unreasonably large
b65fdf5: 8358768: [vectorapi] Make VectorOperators.SUADD an Associative
... and 241 more: https://git.openjdk.org/jdk/compare/ba0c12231b0f5b680951e75765b5d292f31a2cbc...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-07-14T04:54:44Z

@xuemingshen-oracle The following labels will be automatically applied to this pull request:

build
core-libs
i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-07-14T04:58:43Z

Webrevs

src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java

naotoj

Looks good. Thanks for adding case folding support which is long overdue 🙂
Since this is adding a new support for casefolding for character class ranges, I think CSR and a release note should be considered.

test/jdk/java/util/regex/CaseFoldingTest.java

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java

test/jdk/java/util/regex/CaseFoldingTest.java

test/jdk/lib/testlibrary/java/lang/UCDFiles.java

xuemingshen-oracle · 2025-07-14T20:38:13Z

Looks good. Thanks for adding case folding support which is long overdue 🙂 Since this is adding a new support for casefolding for character class ranges, I think CSR and a release note should be considered.

Thanks for the review. Arguably, the change I made years ago to support Level 1 + RL2.1/2 already implies that character class ranges should conform to RL1.5 — just like other constructs (back-ref, slice, single and property) So it might be reasonable to categorize this as "just" a pure bug fix.

That said, it is a behavioral change, and I’m happy to go through the CSR and release note process if strongly preferred. 🙂

My initial thought was to defer the CSR until we fully switch to a case-folding-mapping–based implementation (replacing the current toUpperCase/toLowerCase logic), at which point we could also update the javadoc to explicitly document the behavior of each construct, as RL1.5 recommends/suggests.

But if we prefer to align all of that now with this fix, I’m fine doing it together.

naotoj

Changes look good to me.
As to the CSR, it seems ok without it if this is a pure bug fix.

naotoj

Updates look good to me.

test/jdk/java/util/regex/CaseFoldingTest.java

xuemingshen-oracle · 2025-07-15T16:53:57Z

Thanks for the reviews!
/integrate

openjdk · 2025-07-15T16:54:52Z

@xuemingshen-oracle This pull request has not yet been marked as ready for integration.

xuemingshen-oracle · 2025-07-15T17:55:54Z

Thanks for the reviews!
/integrate

openjdk · 2025-07-15T17:57:20Z

Going to push as commit 401af27.
Since your change was applied there have been 244 commits pushed to the master branch:

38af17d: 8356807: Change log_info(cds) to MetaspaceShared::report_loading_error()
820263e: 8360701: Add bailout when the register allocator interference graph grows unreasonably large
b65fdf5: 8358768: [vectorapi] Make VectorOperators.SUADD an Associative
... and 241 more: https://git.openjdk.org/jdk/compare/ba0c12231b0f5b680951e75765b5d292f31a2cbc...master

Your commit was automatically rebased without conflicts.

openjdk · 2025-07-15T17:57:29Z

@xuemingshen-oracle Pushed as commit 401af27.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

8360459: UNICODE_CASE and character class with non-ASCII range does n…

640d7a6

…ot match ASCII char

openjdk bot added the rfr Pull request is ready for review label Jul 14, 2025

openjdk bot added build [email protected] core-libs [email protected] i18n [email protected] labels Jul 14, 2025

liach reviewed Jul 14, 2025

View reviewed changes

update to address the review comments

735bd72

naotoj reviewed Jul 14, 2025

View reviewed changes

update to address the review comments

e18d266

naotoj approved these changes Jul 14, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jul 14, 2025

update and add more test cases, and fix a test failure

c2afc42

openjdk bot removed the ready Pull request is ready to be integrated label Jul 15, 2025

improve the lookup logic and test case for +00df

b85f581

naotoj reviewed Jul 15, 2025

View reviewed changes

test/jdk/java/util/regex/CaseFoldingTest.java Outdated Show resolved Hide resolved

update to fix the typo

a090888

naotoj approved these changes Jul 15, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jul 15, 2025

openjdk bot added the integrated Pull request has been integrated label Jul 15, 2025

openjdk bot closed this Jul 15, 2025

openjdk bot removed the ready Pull request is ready to be integrated label Jul 15, 2025

openjdk bot removed the rfr Pull request is ready for review label Jul 15, 2025

8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char #26285

8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char #26285

Uh oh!

Conversation

xuemingshen-oracle commented Jul 14, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Jul 14, 2025

Uh oh!

openjdk bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jul 14, 2025

Uh oh!

mlbridge bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naotoj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xuemingshen-oracle commented Jul 14, 2025

Uh oh!

naotoj left a comment

Choose a reason for hiding this comment

Uh oh!

naotoj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuemingshen-oracle commented Jul 15, 2025

Uh oh!

openjdk bot commented Jul 15, 2025

Uh oh!

xuemingshen-oracle commented Jul 15, 2025

Uh oh!

openjdk bot commented Jul 15, 2025

Uh oh!

openjdk bot commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

xuemingshen-oracle commented Jul 14, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Jul 14, 2025 •

edited

Loading

mlbridge bot commented Jul 14, 2025 •

edited

Loading