Skip to content

8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics #25998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 36 commits into from

Conversation

vy
Copy link
Contributor

@vy vy commented Jun 26, 2025

Validate input in java.lang.StringCoding intrinsic Java wrappers, improve their documentation, enhance the checks in the associated IR or assembly code, and adapt them to cause VM crash on invalid input.

Implementation notes

The goal of the associated umbrella issue JDK-8156534 is to, for java.lang.String* classes,

  1. Move @IntrinsicCandidate-annotated public methods1 (in Java code) to private ones, and wrap them with a public "front door" method
  2. Since we moved the @IntrinsicCandidate annotation to a new method, intrinsic mappings – i.e., associated do_intrinsic() calls in vmIntrinsics.hpp – need to be updated too
  3. Add necessary input validation (range, null, etc.) checks to the newly created public front door method
  4. Place all input validation checks in the intrinsic code (add if missing!) behind a VerifyIntrinsicChecks VM flag

Following preliminary work needs to be carried out as well:

  1. Add a new VerifyIntrinsicChecks VM flag
  2. Update generate_string_range_check to produce a HaltNode. That is, crash the VM if VerifyIntrinsicChecks is set and a Java wrapper fails to spot an invalid input.

1 @IntrinsicCandidate-annotated constructors are not subject to this change, since they are a special case.

Functional and performance tests

  • tier1 (which includes test/hotspot/jtreg/compiler/intrinsics/string) passes on several platforms. Further tiers will be executed after integrating reviewer feedback.

  • Performance impact is still actively monitored using test/micro/org/openjdk/bench/java/lang/String{En,De}code.java, among other tests. If you have suggestions on benchmarks, please share in the comments.

Verification of the VM crash

I've tested the VM crash scenario as follows:

  1. Created the following test program:
public class StrIntri {
    public static void main(String[] args) {
        Exception lastException = null;
        for (int i = 0; i < 1_000_000; i++) {
            try {
                jdk.internal.access.SharedSecrets.getJavaLangAccess().countPositives(new byte[]{1,2,3}, 2, 5);
            } catch (Exception exception) {
                lastException = exception;
            }
        }
        if (lastException != null) {
            lastException.printStackTrace();
        } else {
            System.out.println("completed");
        }
    }
}
  1. Compiled the JDK and run the test:
$ bash jib.sh configure -p linux-x64-slowdebug
$ CONF=linux-x64-slowdebug make jdk
$ ./build/linux-x64-slowdebug/jdk/bin/java -XX:+VerifyIntrinsicChecks --add-exports java.base/jdk.internal.access=ALL-UNNAMED StrIntri.java
java.lang.ArrayIndexOutOfBoundsException: Range [2, 2 + 5) out of bounds for length 3

Received AIOOBE as expected.

  1. Removed all checks in StringCodec.java, and re-compiled the JDK
  2. Set the countPositives(...) arguments in the program to (null, 1, 1), run it, and observed the VM crash with unexpected null in intrinsic.
  3. Set the countPositives(...) arguments in the program to (new byte[]{1,2,3}, 2, 5), run it, and observed the VM crash with unexpected guard failure in intrinsic.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25998/head:pull/25998
$ git checkout pull/25998

Update a local copy of the PR:
$ git checkout pull/25998
$ git pull https://git.openjdk.org/jdk.git pull/25998/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25998

View PR using the GUI difftool:
$ git pr show -t 25998

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25998.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 26, 2025

👋 Welcome back vyazici! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 26, 2025

@vy This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics

Reviewed-by: rriggs, liach, dfenacci, thartmann, redestad, jrose

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 263 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Jun 26, 2025

@vy The following labels will be automatically applied to this pull request:

  • core-libs
  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@vy vy changed the title 8156534: Check if range checks can be moved into Java wrapper for intrinsics 8361842: Validate input in both Java and C++ for java.lang.StringCoding intrinsics Jul 10, 2025
* </p>
*
* @param sa the source byte array containing characters encoded in UTF-16
* @param sp the index of the <em>byte (not character!)</em> from the source array to start reading from
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the byte (not character!) emphasis here and below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect.
This is the index of a character (two bytes).
As it is used in encodeISOArray0(), it is incremented by 1 and passed to StringUTF16.getChar(), where it is multiplied by 2 to obtain the real byte[] index.

* {@linkplain Preconditions#checkFromIndexSize(int, int, int, BiFunction) out of bounds}
*/
static int encodeISOArray(byte[] sa, int sp, byte[] da, int dp, int len) {
checkFromIndexSize(sp, len << 1, requireNonNull(sa, "sa").length, AIOOBE_FORMATTER);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sa contains 2-byte chars, and sp points to an index of this inflated array. Though, len denotes the codepoint count, hence the len << 1 while checking sp and len bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference of sa.length is likely wrong also, as it is the source length in bytes but for the index check should be checking the source length in chars.
It might be worth trying to find or create a test for the accidental incorrect interpretation of length in bytes vs chars..

@vy vy marked this pull request as ready for review July 10, 2025 12:55
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 10, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 10, 2025

@rose00
Copy link
Contributor

rose00 commented Jul 10, 2025

I disagree with a small part of the statement of goals:

Always validate all input at the intrinsic (but preferably behind a VM flag)

As formulated above, this is a violation of DRY and if embraced the wrong way will lead to code that is harder to review and prove bug-free. Performing 100% accurate range/null/validation checks is deeply impractical for an assembly-based or IR-based intrinsic. It’s too hard to verify by code review, and coverage testing is suspect.

We must frankly put all the weight of verification on Java code, including Java bytecode intrinsic behaviors. Java code is high-level and can be read mostly as a declarative spec, if clearly written (as straight-line code, then the intrinsic call). Also, such simple Java code shapes (and their underlying bytecodes) are tested many orders of magnitude more than any given intrinsic.

I see two bits of evidence that you agree with me on this: 1. The intrinsic-local validation (IR or assembly) is allowed to Halt instead of throw, and 2. the intrinsic-local validation is optional, turned on only by a stress test mode. This tells me that the extra optional testing is also not required to be 100%.

Thus, I think the above goal would be better stated this way:

Validate input in the IR or assembly code of the intrinsic in an ad hoc manner to catch bugs in the Java validation.

Note: IR or assembly based validation code should not obscure the code or add large maintenance costs, and under a VM diagnostic flag (or debug flag), and causing a VM halt instead of a Java throw.

I think I'm agreeing with you on the material points. It is important to summarize our intentions accurately at the top, for those readers that are reading only the top as a summary.

@vy vy changed the title 8361842: Validate input in both Java and C++ for java.lang.StringCoding intrinsics 8361842: Move input validation checks to Java for String-related intrinsics Jul 11, 2025
@vy
Copy link
Contributor Author

vy commented Jul 11, 2025

@rose00, thanks so much for the feedback. I agree with your remarks and get your points on "Always validate all input at the intrinsic" is a violation of DRY and an impractical goal.

I incorporated your suggestions as follows:

  1. Renamed the ticket to Move input validation checks to Java for String-related intrinsics (to better reflect the goal)
  2. Replaced Always validate all input at the intrinsic... with your suggestion

Copy link
Contributor

@dafedafe dafedafe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this Volkan!
I left a couple of minor comments.
I also noticed that you haven't yet added the benchmark results to the description: do you want to run them again after the reviews?

@vy
Copy link
Contributor Author

vy commented Jul 15, 2025

I left a couple of minor comments. I also noticed that you haven't yet added the benchmark results to the description: do you want to run them again after the reviews?

@dafedafe, thanks so much for the review! I've implemented the changes you requested, and shared some benchmark figures in the associated ticket. I am still actively working on evaluating the performance impact.

vy added 3 commits August 11, 2025 15:28
Those who are touching to these methods should well be
aware of the details elaborated in the `@apiNote`, no
need to put it on a display.
Copy link
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some testing on linux-amd64 and verified that on microbenchmarks that exercise for example StringCoding.hasNegatives (a front door of one of the intrinsics this PR changes) the generated assembly is identical under ideal conditions. Spurious regressions seen in some setups could be inlining related: moving from a simple range check emitted by the intrinsic to a call to Preconditions.checkFromIndexSize may push us over some inlining threshold in some cases. I'll try to get my hands on a linux-aarch64 machine to do some diagnostic runs on.

An idea for future investigation could be to make Preconditions.checkFromIndexSize an intrinsic similar to Preconditions.checkIndex - to help the compiler do the right thing with more ease and perhaps slightly faster.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 13, 2025
@rose00
Copy link
Contributor

rose00 commented Aug 14, 2025

… to help the compiler do the right thing with more ease and perhaps slightly faster

If the Java code is good enough, then the precondition method can simply be marked @ForceInline.

@rose00
Copy link
Contributor

rose00 commented Aug 14, 2025

Some more parts to the precondition intrinsic story, FTR:

Apart from inlining, one of the goals of the intrinsic preconditions is to allow them to more reliably interact with elimination of range checks. The JVM knows its own intrinsic checks in the aaload bytecode (and its brothers), and it also knows its own intrinsic precondition method. Both kinds of checks are fodder for RCE (range check elimination) and specifically the iteration range splitting performed when a loop is factored into pre/main/post loops. The main part is statically proven to traverse an index range which is incapable of failing any of the checks (within the loop body). This RCE in turn unlocks power moves like vectorization.

If the precondition check in question works OK for these use cases, it can be marked force-inline, but if there is also evidence that it would unlock more loop optimizations, then it should be made an intrinsic (or else built on top of another intrinsics, with force-inline).

@rose00
Copy link
Contributor

rose00 commented Aug 14, 2025

Another comment on the precondition in question: It does not appear to be inside a loop, but rather a precursor to a bulk operation (which searches the sign bits of a byte array slice). It's hard to imagine the JIT doing a better job with that as an intrinsic, since it probably won't be RCE-ed within an enclosing hot loop. So, yes, force-inline it.

Volkan found a pre-existing RFE about that precondition check, and I added a lengthy comment to it, FTR:

https://bugs.openjdk.org/browse/JDK-8361837#comment-14809088

Copy link
Contributor

@rose00 rose00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lovely work. I've left a few suggestions which you may wish to take action on.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 15, 2025
@vy
Copy link
Contributor Author

vy commented Aug 18, 2025

JDK-8361842, addressed by this PR, is the first step in a series of similar improvements under the JDK-8156534 umbrella issue. I wanted to get this one in perfect shape to serve as a guide for the subsequent PRs, hence the meticulous effort. Thanks so much to everyone helped with reviewing this work. 🙇 😍

I've verified that tier1,tier2,tier3,tier4,tier5,hs-comp-stress,hs-precheckin-comp passes for 2ba4ba6 on several platforms.

/integrate

@openjdk
Copy link

openjdk bot commented Aug 18, 2025

@vy This pull request has not yet been marked as ready for integration.

Copy link
Member

@liach liach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in principle; didn't check in the details for compiler code, which I don't necessarily understand.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 18, 2025
Copy link
Contributor

@RogerRiggs RogerRiggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

@vy
Copy link
Contributor Author

vy commented Aug 19, 2025

/integrate

@openjdk
Copy link

openjdk bot commented Aug 19, 2025

Going to push as commit 655dc51.
Since your change was applied there have been 268 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 19, 2025
@openjdk openjdk bot closed this Aug 19, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 19, 2025
@openjdk
Copy link

openjdk bot commented Aug 19, 2025

@vy Pushed as commit 655dc51.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@vy vy deleted the strIntrinCheck branch August 19, 2025 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.