Skip to content

Conversation

@cnqpzhang
Copy link
Contributor

@cnqpzhang cnqpzhang commented Aug 24, 2025

Issue:
In AArch64 port, UseBlockZeroing is by default set to true and BlockZeroingLowLimit is initialized to 256. If DC ZVA is supported, BlockZeroingLowLimit is later updated to 4 * VM_Version::zva_length(). When UseBlockZeroing is set to false, all related conditional checks should ignore BlockZeroingLowLimit. However, the function MacroAssembler::zero_words(Register base, uint64_t cnt) still evaluates the lower limit and bases its code generation logic on it, which seems to be an incomplete conditional check.

This PR:

  1. Reset BlockZeroingLowLimit to 4 * VM_Version::zva_length() or 256 with a warning message if it was manually configured from the default while UseBlockZeroing is disabled.
  2. Added necessary comments in MacroAssembler::zero_words(Register base, uint64_t cnt) and MacroAssembler::zero_words(Register ptr, Register cnt) to explain why we do not check UseBlockZeroing in the outer part of these functions. Instead, the decision is delegated to the stub function zero_blocks, which encapsulates the DC ZVA instructions and serves as the inner implementation of zero_words. This approach helps better control the increase in code cache size during array or object instance initialization.
  3. Added more testing sizes to test/micro/org/openjdk/bench/vm/gc/RawAllocationRate.java to better cover scenarios involving smaller arrays and objects..

Tests:

  1. Performance tests on the bundled JMH vm.compiler.ClearMemory, and vm.gc.RawAllocationRate (including arrayTest and instanceTest) showed no obvious regression. Negative tests with jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.arrayTest_C1 -bm thrpt -gc false -wi 0 -w 30 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" -p size=32 demonstrated good wall times on zero_words_reg_imm calls, as expected.
  2. Jtreg ter1 test on Ampere Altra, AmpereOne, Graviton2 and 3, tier2 on Altra. No new issues found. Passed tests of GHA Sanity Checks.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8365991: AArch64: Ignore BlockZeroingLowLimit when UseBlockZeroing is false (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26917/head:pull/26917
$ git checkout pull/26917

Update a local copy of the PR:
$ git checkout pull/26917
$ git pull https://git.openjdk.org/jdk.git pull/26917/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26917

View PR using the GUI difftool:
$ git pr show -t 26917

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26917.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 24, 2025

👋 Welcome back qpzhang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 24, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Aug 24, 2025

@cnqpzhang The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot [email protected] rfr Pull request is ready for review labels Aug 24, 2025
@mlbridge
Copy link

mlbridge bot commented Aug 24, 2025

@adinn
Copy link
Contributor

adinn commented Aug 26, 2025

@cnqpzhang If you look back at the history of this code you will see that you are undoing a change that was made deliberately by @theRealAph. Your patch may improve the specific test case you have provided but at the cost of a significant and unacceptable increase in code cache use for all cases.

The comment at the head of the code you have edited makes this point explicitly. The reasoning behind that comment is available in the JIRA history and associated review comments. The relevant issue is

https://bugs.openjdk.org/browse/JDK-8179444

and the corresponding review thread starts with

https://mail.openjdk.org/pipermail/hotspot-dev/2017-April/026742.html

and continues with

https://mail.openjdk.org/pipermail/hotspot-dev/2017-May/026766.html

I don't recommend integrating this change.

@cnqpzhang
Copy link
Contributor Author

Hi @adinn, thanks for your review.

I have read two related JBS:

  1. JDK-8179444, Put zero_words on a diet (May 2017), 1ce2a362524
  2. JDK-8270947, C1: use zero_words to initialize all objects (Jul 2021), 6c68ce2d396

Particularly to two zero_words functions, reg_reg and reg_imm, the first patch (1ce2a362524) had MacroAssembler::zero_words(Register ptr, Register cnt) call the stub function generate_zero_blocks() and moved the if (UseBlockZeroing) condition into it, as such got a shorter instruction sequence for ClearArray. While the second one made MacroAssembler::zero_words(Register base, uint64_t cnt) route to the stub as well.

My PR undoes some of the first patch (1ce2a362524), as described by #2 and #3 in the PR summary, but it is not all. Please see below, 1ce2a362524 removed the BlockZeroingLowLimit check when dropping the call to block_zero. Next, 6c68ce2d396 had zero_words(Register base, uint64_t cnt) call zero_words(Register ptr, Register cnt) then the stub func, which should have added back the UseBlockZeroing check but omitted it (intentionally?).

1ce2a362524#diff-fe18bdf6585d1a0d4d510f382a568c4428334d4ad941581ecc10ec60ccafca4aL4972-L4974

  } else if (UseBlockZeroing && cnt >= (u_int64_t)(BlockZeroingLowLimit >> LogBytesPerWord)) {
    mov(tmp, cnt);
    block_zero(base, tmp, true);

6c68ce2d396#diff-0f4150a9c607ccd590bf256daa800c0276144682a92bc6bdced5e8bc1bb81f3aR4680-R4684

void MacroAssembler::zero_words(Register base, uint64_t cnt)
{
  guarantee(zero_words_block_size < BlockZeroingLowLimit,
            "increase BlockZeroingLowLimit");
  if (cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord) {

This looks a bit confusing when we have -XX:-UseBlockZeroing while the BlockZeroingLowLimit stil works. For example, when we have '-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=16 and object instance size = 32,

Without the UseBlockZeroing check (base), we have:

 ;; zero_words {
  0x0000400013b02e40:   subs  x8, x11, #0x8
  0x0000400013b02e44:   b.cc  0x0000400013b02e4c  // b.lo, b.ul, b.last
  0x0000400013b02e48:   bl  0x0000400013b02f10          ;   {runtime_call Stub::Stub Generator zero_blocks_stub}
  0x0000400013b02e4c:   tbz  w11, #2, 0x0000400013b02e58
  0x0000400013b02e50:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e54:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e58:   tbz  w11, #1, 0x0000400013b02e60
  0x0000400013b02e5c:   stp  xzr, xzr, [x10], #16
  0x0000400013b02e60:   tbz  w11, #0, 0x0000400013b02e68
  0x0000400013b02e64:   str  xzr, [x10]
 ;; } zero_words

In contrast, with the UseBlockZeroing check (patched), we will see:

 ;; zero_words (count = 2) {
  0x000040003415e874:   stp  xzr, xzr, [x10]
 ;; } zero_words

So, it appears that BlockZeroingLowLimit currently serves two purposes: as the lower limit for block zeroing, and as the threshold determining whether to call a stub or perform STP unrolling inline. Should we fix this, leave it as it is, or just add comments to explain it better?

@cnqpzhang
Copy link
Contributor Author

Regarding the impact to code caches, I measured JMH vm.gc.RawAllocationRate.arrayTest and SPECjbb2015 PRESET run. The first is not suitable for comparison because the array init code only takes a small portion of the overall space, with -XX:+TieredCompilation the sum of three segmented caches only showed <<1% diff. In another viewpoint, SPECjbb2015 can be a complicated enough app that is able to demonstrate the impact on code caches, so I plot such a chart for a 20 minutes run, baseline vs patched.

image

We could eyeball that the profiled and non-profiled nmethods have slightly bigger sizes of used caches (patched vs baseline), tiny part of the total sizes ~6MB (profiled nm) and ~12MB (non-profiled nm). Furthermore, these diffs are relatively far smaller than the total reserved size, either 32M (C1 only), or 48M (with C2), or 240M (configured ergonomically by JVM). I manually set it as -XX:InitialCodeCacheSize=32M -XX:ReservedCodeCacheSize=64M for a managed range.

Therefore, I have a question regarding the practical impact of the code cache in this context. Specifically, is the code cache still practically a significant concern relative to the benefits gained from reduced call counts and the modest performance improvements in code generation and execution for the generated array and object initialization code?

That said, I fully understand the potential risks and concerns associated with modifying the existing logic. I would get prepared to roll back the changes related to the C2 part.

@theRealAph
Copy link
Contributor

It's difficult for anyone to predict all the possibilities of -XX command-line arguments that users might try, despite them not making any sense.

To begin with, please add this short patch, then see if any of this PR provides an advantage.


diff --git a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
index 9321dd0542e..14a584c5106 100644
--- a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
@@ -446,6 +446,11 @@ void VM_Version::initialize() {
     FLAG_SET_DEFAULT(UseBlockZeroing, false);
   }
 
+  if (!UseBlockZeroing && !FLAG_IS_DEFAULT(BlockZeroingLowLimit)) {
+    warning("BlockZeroingLowLimit has been ignored because UseBlockZeroing is disabled");
+    FLAG_SET_DEFAULT(BlockZeroingLowLimit, 4 * VM_Version::zva_length());
+  }
+
   if (VM_Version::supports_sve2()) {
     if (FLAG_IS_DEFAULT(UseSVE)) {
       FLAG_SET_DEFAULT(UseSVE, 2);

@mlbridge
Copy link

mlbridge bot commented Sep 1, 2025

Mailing list message from Andrew Haley on hotspot-dev:

On 29/08/2025 12:10, Patrick Zhang wrote:

Regarding the impact to code caches, I measured JMH

That's not going to tell you anything. The zeroing code is expanded many
times during a compilation, and code cache size is limited. Every time
we needlessly expand intrinsics inline we kick user's code out.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@cnqpzhang
Copy link
Contributor Author

To begin with, please add this short patch, then see if any of this PR provides an advantage.


diff --git a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
index 9321dd0542e..14a584c5106 100644
--- a/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
@@ -446,6 +446,11 @@ void VM_Version::initialize() {
     FLAG_SET_DEFAULT(UseBlockZeroing, false);
   }
 
+  if (!UseBlockZeroing && !FLAG_IS_DEFAULT(BlockZeroingLowLimit)) {
+    warning("BlockZeroingLowLimit has been ignored because UseBlockZeroing is disabled");
+    FLAG_SET_DEFAULT(BlockZeroingLowLimit, 4 * VM_Version::zva_length());
+  }
+
   if (VM_Version::supports_sve2()) {
     if (FLAG_IS_DEFAULT(UseSVE)) {
       FLAG_SET_DEFAULT(UseSVE, 2);

Thanks for advice. Updated accordingly (commit 3 vs 2: 22e72f4) to keep the shape of the generated code as unchanged as possible. My test case with -XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8, size=32 also works as expected. I added some comments to better clarify the purpose of the if-condition inside the zero_words function to avoid future confusion upon. Please help review, thanks.

@theRealAph
Copy link
Contributor

Please help review, thanks.

OK, but please edit the claims at the top of this PR to respect the new reality. In particular, please state the test cases which are improved.

@cnqpzhang
Copy link
Contributor Author

OK, but please edit the claims at the top of this PR to respect the new reality. In particular, please state the test cases which are improved.

Updated.
Had a new change (f23abb9) to set BlockZeroingLowLimit to 256 if is_zva_enabled() returns false, otherwise we would have 0 from _zva_length.

@theRealAph
Copy link
Contributor

I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

@cnqpzhang
Copy link
Contributor Author

I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

The impact can be divided into two parts, at execution time and at code generation time respectively.

  1. Execution time measured by JMH RawAllocationRate test cases
    As mentioned in the initial PR summary, we do not expect significant improvement in the execution of zero_words with this PR, neither in the original version (C1 and C2) nor in the current revision (C1 only). The instruction sequences generated by both the baseline and patched versions show only minor differences under certain test conditions. Additionally, some reduction in cmp and branch instructions is insufficient to yield a significant performance benefit.

Let us focus on tests that can generate diffs, for example, I run below on Ampere Altra (Neoverse-N1), Fedora 40, Kernel 6.1.

JVM_ARGS="-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8"
JMH_ARGS="-p size=32 -p size=48 -p size=64 -p size=80 -p size=96 -p size=128 -p size=256"
jdk/bin/java -jar images/test/micro/benchmarks.jar RawAllocationRate.instanceTest_C1 -bm thrpt -gc false -wi 2 -w 60 -i 1 -r 30 -t 1 -f 1 -tu s -jvmArgs "${JVM_ARGS}" ${JMH_ARGS} -rf csv -rff results.csv

Results (Base)

"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7013.365157,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9160.068513,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10216.516550,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9512.467605,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7555.693378,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9033.057061,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5559.689404,NaN,"ops/s",256

Patched (minor variations or slight improvements, as expected)

"Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit","Param: size"
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7071.799147,NaN,"ops/s",32
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9250.847903,NaN,"ops/s",48
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,10240.947817,NaN,"ops/s",64
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9757.645075,NaN,"ops/s",80
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,7531.211049,NaN,"ops/s",96
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,9045.657067,NaN,"ops/s",128
"org.openjdk.bench.vm.gc.RawAllocationRate.instanceTest_C1","thrpt",1,1,5560.328088,NaN,"ops/s",256

Note that we do not include C2 tests and size > 256 as the generated code are same, no noticeable performance change.

  1. Code-gen time measured by Gtest test_MacroAssembler_zero_words.cpp
    I created jdk/test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp to measure the wall time of zero_words calls; however, I have not included it in this PR because it still contains some hardcoded variables.
#include "asm/assembler.hpp"
#include "asm/assembler.inline.hpp"
#include "asm/macroAssembler.hpp"
#include "unittest.hpp"
#include <chrono>

#if defined(AARCH64) && !defined(ZERO)

TEST_VM(AssemblerAArch64, zero_words_wall_time) {
    BufferBlob* b = BufferBlob::create("aarch64Test", 200000);
    CodeBuffer code(b);
    MacroAssembler _masm(&code);

    const size_t call_count = 1000;
    const size_t word_count = 4; // 32B / 8B-per-word = 4
    // const size_t word_count = 16; // 128B / 8B-per-word = 16
    uint64_t* buffer = new uint64_t[word_count];
    Register base = r10;
    uint64_t cnt = word_count;

    // Set up base register to point to buffer
    _masm.mov(base, (uintptr_t)buffer);

    auto start = std::chrono::steady_clock::now();
    for (size_t i = 0; i < call_count; ++i) {
        _masm.zero_words(base, cnt);
    }
    auto end = std::chrono::steady_clock::now();

    auto wall_time_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    printf("zero_words wall time (ns): %ld\n", wall_time_ns / call_count);

    // Optionally verify buffer is zeroed
    for (size_t i = 0; i < word_count; ++i) {
        ASSERT_EQ(buffer[i], 0u);
    }

    delete[] buffer;
}

#endif  // AARCH64 && !ZERO

Firstly, we test clearing 4 words (32 bytes) with low limit 8 bytes (1 words), the patch will correct the low limit to 256 bytes (32 words). Run it 20 times to see the ratios of patch vs base (lower is better):

for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=8" 2>/dev/null | grep "wall time"
done

Test results, zero_words wall time (ns):

Base 	Patch	Patch vs Base
346	    45	    0.13
393	    45	    0.11
398	    46	    0.12
390	    30	    0.08
322	    29	    0.09
398	    27	    0.07
392	    51	    0.13
392	    44	    0.11
361	    53	    0.15
390	    44	    0.11
299	    28	    0.09
303	    29	    0.10
419	    52	    0.12
390	    44	    0.11
403	    29	    0.07
387	    44	    0.11
387	    53	    0.14
307	    29	    0.09
298	    45	    0.15
387	    45	    0.12

Secondly, we test clearing larger memory, 16 words (128 bytes) with low limit 64 bytes (8 words). Remember to update test_MacroAssembler_zero_words.cpp with const size_t word_count = 16; and use below command line:

for ((i=0;i<20;i++));do
make test-only TEST="gtest:AssemblerAArch64.zero_words_wall_time" TEST_OPTS="JAVA_OPTIONS=-XX:-UseBlockZeroing -XX:BlockZeroingLowLimit=64" | grep "wall time"
done

New test results, zero_words wall time (ns):

Base 	Patch	Patch vs Base
370     204	    0.55
310     205	    0.66
369     209	    0.57
381     208	    0.55
384     172	    0.45
365     209	    0.57
364     205	    0.56
378     204	    0.54
388     208	    0.54
375     200	    0.53
369     201	    0.54
289     204	    0.71
377     204	    0.54
380     201	    0.53
379     201	    0.53
379     199	    0.53
388     207	    0.53
375     204	    0.54
402     201	    0.50
373     202	    0.54

In summary, the code changes bring a slight improvement to execution time, though some of these differences may be within normal variation, and a clear reduction in wall time for the zero_words_reg_imm calls under the specific test conditions where UseBlockZeroing is false and mem words cnt > BlockZeroingLowLimit / BytesPerWord. I understood that some of the observed differences are not statistically significant, and certain improved code-gen wall time ratios may be of limited concern. However, the primary purpose of this PR is to address the logical issue: ensuring that a configured BlockZeroingLowLimit should not take its confusing effect when UseBlockZeroing is false, unlike its behavior when true.

Thanks for taking the time to read this long write-up in details.

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 5, 2025

@cnqpzhang This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@cnqpzhang
Copy link
Contributor Author

Added test/hotspot/gtest/aarch64/test_MacroAssembler_zero_words.cpp to measure the impact of different low limits and cleared word counts on the wall time of MacroAssembler::zero_words and compare the resulting differences.

Run the test and compare the wall times. We can see that fixing the low limit from a lower value to the default 256 improves codegen efficiency, by 11x on clear_4_words (289 vs. 25) and by 1.6x on clear_16_words (170 vs. 107).

$ make run-test TEST="gtest:MacroAssemblerZeroWordsTest"

Clear 4 words with lower limit 8, zero_words wall time (ns): 289
Clear 4 words with lower limit 256, zero_words wall time (ns): 25
Clear 16 words with lower limit 64, zero_words wall time (ns): 170
Clear 16 words with lower limit 256, zero_words wall time (ns): 107

See below for the detailed run log, including the generated code sequences under various conditions:

Test selection 'gtest:MacroAssemblerZeroWordsTest', will run:
* gtest:MacroAssemblerZeroWordsTest/server

Running test 'gtest:MacroAssemblerZeroWordsTest/server'
Note: Google Test filter = MacroAssemblerZeroWordsTest*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from MacroAssemblerZeroWordsTest
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_8B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c56a40:   mov     x10, #0x6fb0                    // #28592
  0x0000400011c56a44:   movk    x10, #0xab05, lsl #16
  0x0000400011c56a48:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c56a4c:   orr     x11, xzr, #0x4
  0x0000400011c56a50:   subs    x8, x11, #0x8
  0x0000400011c56a54:   b.cc    0x0000400011c56a5c  // b.lo, b.ul, b.last
  0x0000400011c56a58:   bl      Stub::Stub Generator zero_blocks_stub
  0x0000400011c56a5c:   tbz     w11, #2, 0x0000400011c56a68
  0x0000400011c56a60:   stp     xzr, xzr, [x10], #16
  0x0000400011c56a64:   stp     xzr, xzr, [x10], #16
  0x0000400011c56a68:   tbz     w11, #1, 0x0000400011c56a70
  0x0000400011c56a6c:   stp     xzr, xzr, [x10], #16
  0x0000400011c56a70:   tbz     w11, #0, 0x0000400011c56a78
  0x0000400011c56a74:   str     xzr, [x10]
--------------------------------------------------------------------------------

Clear 4 words with lower limit 8, zero_words wall time (ns): 289
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_8B_vm (2 ms)
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_256B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c57400:   mov     x10, #0x6fb0                    // #28592
  0x0000400011c57404:   movk    x10, #0xab05, lsl #16
  0x0000400011c57408:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c5740c:   stp     xzr, xzr, [x10]
  0x0000400011c57410:   stp     xzr, xzr, [x10, #16]
--------------------------------------------------------------------------------

Clear 4 words with lower limit 256, zero_words wall time (ns): 25
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_32B_with_lowlimit_256B_vm (0 ms)
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_64B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c57400:   mov     x10, #0x6fe0                    // #28640
  0x0000400011c57404:   movk    x10, #0xab05, lsl #16
  0x0000400011c57408:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c5740c:   orr     x11, xzr, #0x10
  0x0000400011c57410:   subs    x8, x11, #0x8
  0x0000400011c57414:   b.cc    0x0000400011c5741c  // b.lo, b.ul, b.last
  0x0000400011c57418:   bl      Stub::Stub Generator zero_blocks_stub
  0x0000400011c5741c:   tbz     w11, #2, 0x0000400011c57428
  0x0000400011c57420:   stp     xzr, xzr, [x10], #16
  0x0000400011c57424:   stp     xzr, xzr, [x10], #16
  0x0000400011c57428:   tbz     w11, #1, 0x0000400011c57430
  0x0000400011c5742c:   stp     xzr, xzr, [x10], #16
  0x0000400011c57430:   tbz     w11, #0, 0x0000400011c57438
  0x0000400011c57434:   str     xzr, [x10]
--------------------------------------------------------------------------------

Clear 16 words with lower limit 64, zero_words wall time (ns): 170
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_64B_vm (0 ms)
[ RUN      ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_256B_vm
--------------------------------------------------------------------------------
udf     #0
  0x0000400011c57400:   mov     x10, #0x6fe0                    // #28640
  0x0000400011c57404:   movk    x10, #0xab05, lsl #16
  0x0000400011c57408:   movk    x10, #0xaaaa, lsl #32
  0x0000400011c5740c:   stp     xzr, xzr, [x10]
  0x0000400011c57410:   stp     xzr, xzr, [x10, #16]
  0x0000400011c57414:   stp     xzr, xzr, [x10, #32]
  0x0000400011c57418:   stp     xzr, xzr, [x10, #48]
  0x0000400011c5741c:   stp     xzr, xzr, [x10, #64]
  0x0000400011c57420:   stp     xzr, xzr, [x10, #80]
  0x0000400011c57424:   stp     xzr, xzr, [x10, #96]
  0x0000400011c57428:   stp     xzr, xzr, [x10, #112]
  0x0000400011c5742c:   add     x10, x10, #0x80
--------------------------------------------------------------------------------

Clear 16 words with lower limit 256, zero_words wall time (ns): 107
[       OK ] MacroAssemblerZeroWordsTest.UseBZ_clear_128B_with_lowlimit_256B_vm (0 ms)
[----------] 4 tests from MacroAssemblerZeroWordsTest (110 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (110 ms total)
[  PASSED  ] 4 tests.
Finished running test 'gtest:MacroAssemblerZeroWordsTest/server'
Test report is stored in build-pr/test-results/gtest_MacroAssemblerZeroWordsTest_server

==============================
Test summary
==============================
   TEST                                              TOTAL  PASS  FAIL ERROR  SKIP   
   gtest:MacroAssemblerZeroWordsTest/server              4     4     0     0     0   
==============================
TEST SUCCESS

@cnqpzhang
Copy link
Contributor Author

I can't see any statistically-significant improvement. Please tell us your test results and your test conditions.

Hi @theRealAph , Do you have any further comments on the updates? Aside from the code changes to BlockZeroingLowLimit, the refinements to code-gen, added comments, and tests help improve code clarity and reduce potential technical debt, offering long-term value beyond an immediate performance gain to execution time. Would appreciate your approval of this PR. Thank you.

@adinn
Copy link
Contributor

adinn commented Oct 27, 2025

@cnqpzhang I don't understand why you think these tests indicate anything useful for real use cases. Do you have an actual user whose needs justify adopting this change?

Let's consider what your patch and associated test achieve. Initially you tried to remove the limit on unrolling that was imposed to avoid excessive cache consumption. When it was explained why this was inappropriate you reduced the patch so that it now adjusts the threshold at which unrolling is replaced by a call to the stub. Your two test runs appear to demonstrate a performance improvement between old and new but the difference is more apparent than real. In the specific configurations you have selected your change to the unrolling threshold targets two very specific points of disparity. The new cases fully unroll while the old cases rely on a callout. Not surpisingly. this gives very different performance when you run it in a loop many times. But we already know that callouts are more expensive than inline code.

The important thing to note is that this transition between unrolling vs callout happens in both old and new code, just at different size points. If you ran with other config settings and sizes you could find many cases where both versions fully unroll or both rely on a callout. So your test does not truly reflect what is going on here and your fix is really doing little more than rescaling the dials so they can go up to 11. You have provided no good evidence as to why we need to adjust the scale by which we compute the threshold between unrolling or callout. Furthermore, since this rescaling allows more unrolling to occur than in the old version you still need to justify why that is worth doing.

@cnqpzhang
Copy link
Contributor Author

@adinn Thank you for the good summary of the proposed code changes. You omitted a key condition in the context: -XX:-UseBlockZeroing

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

The initial patch aimed to decouple these uses, but @adinn raised concerns regarding the size of code caches and potential performance side effects. I profiled a SPECjbb2015 PRESET run and presented the minor impact, while @theRealAph commented that this approach might not fully capture all the impacts in detail. Based on the subsequent advice, a compromise was proposed: we could start by resetting BlockZeroingLowLimit to its default value when -XX:-UseBlockZeroing is configured. At this point, I faced two additional challenges: first, how to quantify the statistical improvement; and second, whether I am attempting to demonstrate the patch’s benefits based on assumptions about the -XX options users might provide (I’m not trying to predict, but the new code has begun to behave in this way). So, this is gradually going far beyond of my initial purpose, and I think you might prefer to continue use BlockZeroingLowLimit in such a dual-use manner, not only for DC ZVA, but also for unrolling or callout. Perhaps the two flags should be renamed to UseDCZVA and BlockZeroingUnrollLimit respectively.

  product(bool, UseBlockZeroing, true,                                  \
          "Use DC ZVA for block zeroing")                               \
  product(intx, BlockZeroingLowLimit, 256,                              \
          "Minimum size in bytes when block zeroing will be used")      \
          range(wordSize, max_jint)                                     \

With regards to your last question, I would not try to justify why rescaling is worth doing, because it was not my intention. The original motivation is to improve the code clarity around the low limit, make the logic clearly expressed with less ambiguity.

@theRealAph
Copy link
Contributor

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

It does not. When -XX:-UseBlockZeroing is set, BlockZeroingLowLimit is ignored.

@cnqpzhang
Copy link
Contributor Author

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

It does not. When -XX:-UseBlockZeroing is set, BlockZeroingLowLimit is ignored.

zero_words does not check UseBlockZeroing, it directly compares cnt and BlockZeroingLowLimit / BytesPerWord.

https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L6198C1-L6204C16

address MacroAssembler::zero_words(Register base, uint64_t cnt)
{
  assert(wordSize <= BlockZeroingLowLimit,
            "increase BlockZeroingLowLimit");
  address result = nullptr;
  if (cnt <= (uint64_t)BlockZeroingLowLimit / BytesPerWord) {
#ifndef PRODUCT

In contrast, the inner stub function does so.

https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L669

address generate_zero_blocks() {
Label done;
Label base_aligned;

Register base = r10, cnt = r11;

__ align(CodeEntryAlignment);
StubId stub_id = StubId::stubgen_zero_blocks_id;
StubCodeMark mark(this, stub_id);
address start = __ pc();

if (UseBlockZeroing) {

@theRealAph
Copy link
Contributor

I would like to reiterate that I have no objection to the functions when the -XX:+UseBlockZeroing option is set, everything can keep as is. My point is that BlockZeroingLowLimit serves literally/specifically as a switch to control whether DC ZVA instructions are generated for clearing instances under a specified bytes size limitation, rather than for deciding between unrolling and callout. Therefore, it should NOT affect the code-gen results any longer when -XX:-UseBlockZeroing is set, should it?

It does not. When -XX:-UseBlockZeroing is set, BlockZeroingLowLimit is ignored.

zero_words does not check UseBlockZeroing, it directly compares cnt and BlockZeroingLowLimit / BytesPerWord.

It doesn't need to because

  if (!UseBlockZeroing && !FLAG_IS_DEFAULT(BlockZeroingLowLimit)) {
    warning("BlockZeroingLowLimit has been ignored because UseBlockZeroing is disabled");
    FLAG_SET_DEFAULT(BlockZeroingLowLimit, is_zva_enabled() ? (4 * VM_Version::zva_length()) : 256);
  }

That is to say, if a user sets BlockZeroingLowLimit and -XX:-UseBlockZeroing, then the user's BlockZeroingLowLimit is, rightly, ignored.

@cnqpzhang
Copy link
Contributor Author

That is to say, if a user sets BlockZeroingLowLimit and -XX:-UseBlockZeroing, then the user's BlockZeroingLowLimit is, rightly, ignored.

Yes, this is the current state we have, with the patch, and it also represents the compromise I can accept regarding zero_words using BlockZeroingLowLimit to decide between "unroll vs callout" without checking UseBlockZeroing. I added necessary comments to warn others from having similar confusion or misunderstanding about this code snippet.

Is there anything else we need to do for this PR?

@cnqpzhang
Copy link
Contributor Author

Hi @theRealAph and @adinn, please let me know if you have any additional comments on this PR, or advice to improve it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot [email protected] rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

3 participants