-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false #26917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… false Signed-off-by: Patrick Zhang <[email protected]>
|
👋 Welcome back qpzhang! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@cnqpzhang The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
@cnqpzhang If you look back at the history of this code you will see that you are undoing a change that was made deliberately by @theRealAph. Your patch may improve the specific test case you have provided but at the cost of a significant and unacceptable increase in code cache use for all cases. The comment at the head of the code you have edited makes this point explicitly. The reasoning behind that comment is available in the JIRA history and associated review comments. The relevant issue is https://bugs.openjdk.org/browse/JDK-8179444 and the corresponding review thread starts with https://mail.openjdk.org/pipermail/hotspot-dev/2017-April/026742.html and continues with https://mail.openjdk.org/pipermail/hotspot-dev/2017-May/026766.html I don't recommend integrating this change. |
|
Hi @adinn, thanks for your review. I have read two related JBS:
Particularly to two My PR undoes some of the first patch (1ce2a362524), as described by #2 and #3 in the PR summary, but it is not all. Please see below, 1ce2a362524 removed the 1ce2a362524#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aL4972-L4974 6c68ce2d396#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4680-R4684 This looks a bit confusing when we have Without the In contrast, with the So, it appears that BlockZeroingLowLimit currently serves two purposes: as the lower limit for block zeroing, and as the threshold determining whether to call a stub or perform STP unrolling inline. Should we fix this, leave it as it is, or just add comments to explain it better? |
Signed-off-by: Patrick Zhang <[email protected]>
|
It's difficult for anyone to predict all the possibilities of To begin with, please add this short patch, then see if any of this PR provides an advantage. |
|
Mailing list message from Andrew Haley on hotspot-dev: On 29/08/2025 12:10, Patrick Zhang wrote:
That's not going to tell you anything. The zeroing code is expanded many -- |
Signed-off-by: Patrick Zhang <[email protected]>
Thanks for advice. Updated accordingly (commit 3 vs 2: 22e72f4) to keep the shape of the generated code as unchanged as possible. My test case with |
OK, but please edit the claims at the top of this PR to respect the new reality. In particular, please state the test cases which are improved. |
Signed-off-by: Patrick Zhang <[email protected]>
Updated. |
|
I can't see any statistically-significant improvement. Please tell us your test results and your test conditions. |
The impact can be divided into two parts, at execution time and at code generation time respectively.
Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1. Results (Base) Patched (minor variations or slight improvements, as expected) Note that we do not include C2 tests and size > 256 as the generated code are same, no noticeable performance change.
Firstly, we test clearing 4 words (32 bytes) with low limit 8 bytes (1 words), the patch will correct the low limit to 256 bytes (32 words). Run it 20 times to see the ratios of patch vs base (lower is better): Test results, zero_words wall time (ns): Secondly, we test clearing larger memory, 16 words (128 bytes) with low limit 64 bytes (8 words). Remember to update New test results, zero_words wall time (ns): In summary, the code changes bring a slight improvement to execution time, though some of these differences may be within normal variation, and a clear reduction in wall time for the Thanks for taking the time to read this long write-up in details. |
|
@cnqpzhang This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a |
Signed-off-by: Patrick Zhang <[email protected]>
Signed-off-by: Patrick Zhang <[email protected]>
Signed-off-by: Patrick Zhang <[email protected]>
|
Added test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp to measure the impact of different low limits and cleared word counts on the wall time of Run the test and compare the wall times. We can see that $ make run-test TEST="gtest:MacroAssemblerZeroWordsTest"
See below for the detailed run log, including the generated code sequences under various conditions: |
Hi @theRealAph , Do you have any further comments on the updates? Aside from the code changes to |
|
@cnqpzhang I don't understand why you think these tests indicate anything useful for real use cases. Do you have an actual user whose needs justify adopting this change? Let's consider what your patch and associated test achieve. Initially you tried to remove the limit on unrolling that was imposed to avoid excessive cache consumption. When it was explained why this was inappropriate you reduced the patch so that it now adjusts the threshold at which unrolling is replaced by a call to the stub. Your two test runs appear to demonstrate a performance improvement between old and new but the difference is more apparent than real. In the specific configurations you have selected your change to the unrolling threshold targets two very specific points of disparity. The new cases fully unroll while the old cases rely on a callout. Not surpisingly. this gives very different performance when you run it in a loop many times. But we already know that callouts are more expensive than inline code. The important thing to note is that this transition between unrolling vs callout happens in both old and new code, just at different size points. If you ran with other config settings and sizes you could find many cases where both versions fully unroll or both rely on a callout. So your test does not truly reflect what is going on here and your fix is really doing little more than rescaling the dials so they can go up to 11. You have provided no good evidence as to why we need to adjust the scale by which we compute the threshold between unrolling or callout. Furthermore, since this rescaling allows more unrolling to occur than in the old version you still need to justify why that is worth doing. |
|
@adinn Thank you for the good summary of the proposed code changes. You omitted a key condition in the context: I would like to reiterate that I have no objection to the functions when the The initial patch aimed to decouple these uses, but @adinn raised concerns regarding the size of code caches and potential performance side effects. I profiled a SPECjbb2015 PRESET run and presented the minor impact, while @theRealAph commented that this approach might not fully capture all the impacts in detail. Based on the subsequent advice, a compromise was proposed: we could start by resetting With regards to your last question, I would not try to justify why |
It does not. When |
zero_words does not check In contrast, the inner stub function does so. address generate_zero_blocks() { |
It doesn't need to because That is to say, if a user sets |
Yes, this is the current state we have, with the patch, and it also represents the compromise I can accept regarding Is there anything else we need to do for this PR? |
|
Hi @theRealAph and @adinn, please let me know if you have any additional comments on this PR, or advice to improve it. Thank you. |

Issue:
In AArch64 port,
UseBlockZeroingis by default set to true andBlockZeroingLowLimitis initialized to 256. IfDC ZVAis supported,BlockZeroingLowLimitis later updated to4 * VM_Version::zva_length(). WhenUseBlockZeroingis set to false, all related conditional checks should ignoreBlockZeroingLowLimit. However, the functionMacroAssembler::zero_words(Register base, uint64_t cnt)still evaluates the lower limit and bases its code generation logic on it, which seems to be an incomplete conditional check.This PR:
BlockZeroingLowLimitto4 * VM_Version::zva_length()or 256 with a warning message if it was manually configured from the default whileUseBlockZeroingis disabled.MacroAssembler::zero_words(Register base, uint64_t cnt)andMacroAssembler::zero_words(Register ptr, Register cnt)to explain why we do not checkUseBlockZeroingin the outer part of these functions. Instead, the decision is delegated to the stub functionzero_blocks, which encapsulates the DC ZVA instructions and serves as the inner implementation ofzero_words. This approach helps better control the increase in code cache size during array or object instance initialization.test/micro/org/openjdk/bench/vm/gc/RawAllocationRate.javato better cover scenarios involving smaller arrays and objects..Tests:
vm.compiler.ClearMemory, andvm.gc.RawAllocationRate(includingarrayTestandinstanceTest) showed no obvious regression. Negative tests withjdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.arrayTest_C1 -bm thrpt -gc false -wi 0 -w 30 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" -p size=32demonstrated good wall times onzero_words_reg_immcalls, as expected.Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26917/head:pull/26917$ git checkout pull/26917Update a local copy of the PR:
$ git checkout pull/26917$ git pull https://git.openjdk.org/jdk.git pull/26917/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26917View PR using the GUI difftool:
$ git pr show -t 26917Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26917.diff
Using Webrev
Link to Webrev Comment