Refactor barrier implementations #332

wks · 2025-09-22T02:13:06Z

We recently added another write barrier, the SATBBarrier. Being another barrier based on the per-object unlog bit, it shares much code with the ObjectBarrier. We refactor the write barriers to improve code sharing and clean up unused code.

We introduce abstract base classes MMTkUnlogBitBarrierSetRuntime, MMTkUnlogBitBarrierSetAssembler, MMTkUnlogBitBarrierSetC1 and MMTkUnlogBitBarrierSetC2, which both ObjectBarrier and SATBBarrier derive from. Those base classes include common parts related to unlog bit handling, including the fast paths for checking if the unlog bit is set, and also the generic implementation for both the pre and the post barriers if applicable.

Runtime:

Extracted a method for testing if the unlog bit is set for a given object.

Assembler:

Extracted a method for emitting the unlog bit checking fast path.
Extracted a method for generating both the pre and the post write barriers.
Simplified the fast path to use fewer instructions and fewer scratch registers.
Worked around an issue where the tmp1 and tmp2 temporary registers passed in from BarrierSetAssembler::store_at may overlap with dst.base() and dst_index(), respectively.
- See https://bugs.openjdk.org/browse/JDK-8301371

C1:

Changed "runtime stubs" to be one per runtime function, not one per pre/post/refload barrier. Those "runtime stubs" are now assembled by MMTkBarrierSetAssembler instead of by MMTkObjectBarrierSetAssembler or MMTkSATBBarrierSetAssembler. The intention is that all barriers can call those functions.
Extracted a method for emitting the unlog bit checking fast path.
Extracted a method for emitting both the pre and the post write barriers and their slow-path code stub.
We no longer pass the slot and new_val to the runtime function in the slow path. Object-logging barriers only needs the obj argument for scanning its fields.
- For this reason, we moved the C1 CodeStub for the pre/post barrier slow paths down from mmtkBarrierSetC1.hpp to mmtkUnlogBitBarrier.hpp (and unified the pre/post stub implementation) because it no longer needs the slot or new_val args, and is now specific to unlog bit barriers. If we introduce field-logging barriers, we will implement a different stub.
The C1 SATBBarrier no longer needs code patching. The code patching was intended for barriers that load from the field in the fast path using LIR instructions, such as G1's SATB barrier. Our object-logging barrier scans the fields in runtime functions (mmtk_object_reference_write_pre), not using LIR instructions.

C2:

Extracted a method for emitting the unlog bit checking fast path.
Extracted a method for emitting both the pre and the post write barriers.
We no longer pass the slot and new_val to the runtime function in the slow path.

Misc:

Removed the #define private public hack as we no longer need it.
Removed unused macros, dead code and irrelevant comments.
Changed the macro SOFT_REFERENCE_LOAD_BARRIER to a run-time flag mmtk_enable_reference_load_barrier settable by environment variable.

They are not specific to concrete barriers, although each barrier only uses some but not all of them.

And no define private as public.

and move the slow-path stub to UnlogBarrierSet, too. And combine the pre and post barrier stub.

This is the basic rule for using pre-compiled headers.

SATB barrier does not have arraycopy epilogue.

Maybe start an issue instead.

wks · 2025-09-25T06:18:04Z

I labelled this as ready for review.

There is one thing intentionally not done in this PR:

To be consistent with write barriers, MMTkBarrierSet{Assembler,C1,C2} should also override load_at and provide virtual methods object_reference_read_{pre_post} for subclasses to override. However, we currently don't have read barriers in MMTk core except SATBBarrier which only applies read barriers for weak references. We can't work out a nice interface for read barriers unless we have one in MMTk core, as it may be more complicated than write barriers. For example,

"loaded value barriers" is a post barrier, but it should have access to the loaded value, and can override the value read by the raw memory read. In other words, it needs to overwrite the result register (Assembler and C1), and/or may need to return a different register/node that holds the final result (C1 and C2)
"indirection barriers" may override the slot address to read from, and apply not only to reference fields, but value fields as well.

Because SATBBarrier is currently the only barrier (policy) that has a barrier (mechanism) for read accesses, we just let it override load_at directly. If we introduce another barrier, such as the field-logging SATB barrier or RC barrier, we may hoist some code to a shared file or superclasses.

qinsoon

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

qinsoon · 2025-09-25T06:47:12Z

openjdk/barriers/mmtkSATBBarrier.cpp


-  __ branch_destination(slow->continuation());
+void MMTkSATBBarrierSetC1::object_reference_write_pre(LIRAccess& access, LIR_Opr src, LIR_Opr slot, LIR_Opr new_val) const {
+  object_reference_write_pre_or_post(access, src, true);


I find this a bit confusing when you don't have comments for /* pre = */. Maybe just have two functions. object_reference_write_pre and object_reference_write_post, they call an internal function that deals both cases.

I should have comments on those call sites, too. I'll add them.

I can't have object_reference_write_pre and object_reference_write_post in MMTkUnlogBitBarrierSetC1. They are virtual methods in MMTkBarrierSetC1, and having them in MMTkUnlogBitBarrierSetC1 will override those virtual methods. Instead, each concrete barrier type (MMTkObjectBarrierSetC1 and MMTkSATBBarrierSetC1) should override one of them, and not both. The overriding method, like this MMTkSATBBarrierSetC1::object_reference_write_pre, should call a method in MMTkUnlogBitBarrierSetC1 that has a different name, which is object_reference_write_pre_or_post in this case.

The code following the comment is just for C1, not assembler that emits stubs for C1.

wks · 2025-09-26T01:49:02Z

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

Most data show that there is no difference. But some cases, such as (build1, 1.8x, Immix) and (build2, 2.4x, StickyImmix) exhibited large slowdown together with large noise. They are caused by abnormal slowdowns in mutator time (time.other) in one or two invocations. This can be seen by plotting the time of each invocation separately.

Since it affects both the master branch and this PR, I suspect it is caused by either

an existing bug in the master branch that caused abnormal slowdown in mutator, or
some anomalies on vole.moma that interrupted the testing processes at unpredictable times.

I'll run the experiment on a different moma machine to see if can be reproduced.

wks · 2025-09-26T03:33:59Z

The CPU frequency of vole.mama, the machine I ran tests on, was not locked to maximum. This may be the cause of the abnormal slowdown. I am running the same tests on another machine where CPU frequencies are set to maximum.

qinsoon · 2025-09-26T03:37:18Z

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

Can you run all the benchmarks over the weekend?

wks · 2025-09-26T16:36:34Z

I reran the lusearch benchmark with the same binaries and same configuration on mole.moma (same machine configuration, but we have checked the CPU frequencies are almost always max, although some CPUs occasionally show 50% of the max CPU frequency in /proc/cpuinfo).

The results are similar. Overall, there is no significant evidence of performance improvement or regression.

And there are still random outliers in mutator time as shown below, in both build1 and build2. The two yellow spikes below are from Immix in build1 (master), and the three green spikes below are from ConcurrentImmix in build2 (this PR).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.other

And there is no spike in STW time, as shown below. This is strange because in lusearch, STW time is 25% to 200% of the mutator time, depending on the plan and the heap size. If the slowdown is caused by CPU frequency, it should increase the STW time in the same iterations by the same ratio, too. But there is no spike in STW time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.stw

wks · 2025-09-28T03:43:31Z

Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.

time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.

time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&

From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.

If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.

perf didn't show any significant bottleneck in build1-GenImmix.

tianleq · 2025-09-28T23:51:52Z

Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.

time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&

From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.

perf didn't show any significant bottleneck in build1-GenImmix.

If slot gets removed, does this still work with filed barriers? I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.

wks · 2025-09-29T02:26:40Z

If slot gets removed, does this still work with filed barriers?

No. It won't work with field barriers. That's why the methods are implemented in MMTkUnlogBitBarrierSet{Assembler,C1,C2} instead of MMTkBarrierSet{Assembler,C1,C2}. To use field barriers, we need to reimplement a new write barrier fast path anyway. If the fast path fails, it will need to call the slow path with the slot argument. We will do that in something like MMTkFieldUnlogBarrierSet{Assembler,C1,C2}.

The code blobs for calling runtime functions, i.e. MMTkBarrierSetC1::_object_reference_write_{pre,post,slow}_c1_runtime_code_blob still properly forward the src, slot and new_val arguments to the runtime functions as long as those arguments are provided, and should still work for field barriers.

I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.

That's right. If we don't need the slot address, we don't need to code patching. We will need code patching again to support field barriers.

qinsoon · 2025-09-29T03:36:11Z

There are 3 tests failing in concurrent Immix.

tianleq · 2025-10-02T01:28:08Z

A side note, in C1, code patching is done on instruction instead of LIR. When code needs patching, the value of slot is incorrect. To support field barrier, we need to be conservative in this case, keeping track of all slots of the src

wks · 2025-10-10T14:04:25Z

This PR may have introduced bug in the SATB barrier. So far we have observed it crashing in avrora, h2o, lusearch, and xalan in the CI. It is hard to reproduce, but I managed to reproduce it locally in h2o. It always crashes after FinalMark, and it crashes at random locations, including both native code and Java code. It looks like some objects are prematurely reclaimed, possibly related to weak references.

wks · 2025-10-16T10:58:58Z

I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691

But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.

tianleq · 2025-10-17T00:20:03Z

I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691

But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.

I ran into the same issue when I was implementing the prototype.

I also ran into some C1 issues and it might be related. So when code needs patching, the value of the slot is invalid. The comment in the following is incorrect, after doing those the scratch register still contains an invalid slot. The only reason of doing this is to make the c1 LIR assembler happy, otherwise, it might trigger an assertion failure saying something like "expecting a register but found an address"

if (stub->patch_code != lir_patch_none) {
    // Patch
    assert(stub->scratch->is_single_cpu(), "must be");
    assert(stub->scratch->is_register(), "Precondition.");
    ce->mem2reg(stub->slot, stub->scratch, T_OBJECT, stub->patch_code, stub->info, false /*wide*/, false /*unaligned*/);
    // Now stub->scratch contains the pre_val instead of the slot address
    // So the following is to load the slot address into scrach register
    // Resolve address 
    auto masm = ce->masm();
    LIR_Address* addr = stub->slot->as_address_ptr();
    Address from_addr = ce->as_Address(addr);
    __ lea(stub->scratch->as_register(), from_addr);
    // Store parameter
    ce->store_parameter(stub->scratch->as_pointer_register(), 1);
  }

wks · 2025-10-17T02:15:46Z

I can't reproduce the bug with -XX:-UseCompressedOops. Maybe it is related to compressed pointer.

wks · 2025-10-17T10:13:46Z

After updating mmtk-core to revision mmtk/mmtk-core@1ffa5b3 (PR mmtk/mmtk-core#1400), I can't reproduce the crash. It was still reproducible before that commit. I don't understand it. OpenJDK doesn't use alloc_with_options at all.

wks · 2025-10-20T08:07:51Z

I managed to reproduce the crash on the master branch. It seems to be more reproducible if we disable C2.

(update: I created an issue for this: #334, and it contains steps for reproducing it on the master branch)

So it is an existing bug. It seems to be more reproducible on the CI server after this PR is merged. But locally I can reproduce the bug on master as often as with this PR.

wks added 30 commits September 19, 2025 23:08

WIP: MMTkUnlogBitBarrierSetRuntime

a03671e

Use one fewer scratch register

6cd0fdd

No. We can't. We can't use either tmp1 or tmp2.

153a459

Extract assembler fast path

b826a07

Apply unlog bit runtime barrier fast path to SATB

43f1ac5

Apply unlog bit assembler barrier fast path to SATB

3f3c193

Extract common part of pre and post from assembler

9be6ae1

Define __ separately for each function impl

efc92a6

Extract c1 pre/post write barrier runtime stub

5d49e81

Hoist C1 runtime stubs up to BarrierSetAssembler

0b0915f

They are not specific to concrete barriers, although each barrier only uses some but not all of them.

Adjust visibility

19df558

Object barrier C1 no use slot or new_val

a6cb14d

MMTkC1PreBarrierStub: only src

344f546

No needs_patching

236b578

No touching new_val

4c3847e

No touching slot operand in SATB C1 barrier

7d72ed3

And no define private as public.

C1 pre/post barrier method only access arg

822c58e

Extract C1 unlog bit check fast paths

cd81d20

Extract C1 pre/post barrier to UnlogBarrierSet

53d68a0

and move the slow-path stub to UnlogBarrierSet, too. And combine the pre and post barrier stub.

Comments

fbdba71

Revert Cargo.toml to online mmtk-core repo

24b82ab

Remove commented-out hack

5969878

Update comments to link to bug tracker

684bfa3

Comments and spacing

2443252

Ensure all cpp files include precompiled.hpp

4c2eddb

This is the basic rule for using pre-compiled headers.

One runtime code blob per function

e0ae0bc

Why do we push registers when we can save_live_registers_no_oop_map?

4abf58a

Name

0ebac36

Rewrite some assembly in SATB barrier

8bc5eab

Remove irrelevant comment.

de68de8

SATB barrier does not have arraycopy epilogue.

wks added 4 commits September 24, 2025 17:42

Bump mmtk-core revision

1c6990b

Remove new-idea comment

1770104

Maybe start an issue instead.

Don't disrupt the code order too much for no reason

3ee9112

Cite the source of reference_load_barrier_for_unknown_load

85c61b3

wks changed the title ~~WIP: Refactor barrier implementation~~ Refactor barrier implementations Sep 25, 2025

wks marked this pull request as ready for review September 25, 2025 06:09

wks requested review from qinsoon and tianleq September 25, 2025 06:18

qinsoon reviewed Sep 25, 2025

View reviewed changes

wks added 2 commits September 25, 2025 15:13

Add missing /* pre = */ comment

a31a3ba

Fix comment

63bffdd

The code following the comment is just for C1, not assembler that emits stubs for C1.

Shouldn't remove SATB barrier if val is null

940c691

Bump mmtk-core revision

9f5efe0

wks mentioned this pull request Oct 20, 2025

Segmentation fault with ConcurrentImmix and h2o #334

Open

Refactor barrier implementations #332

Are you sure you want to change the base?

Refactor barrier implementations #332

Uh oh!

Conversation

wks commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Sep 25, 2025

Uh oh!

qinsoon left a comment

Choose a reason for hiding this comment

Uh oh!

qinsoon Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

wks Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

wks commented Sep 26, 2025

Uh oh!

wks commented Sep 26, 2025

Uh oh!

qinsoon commented Sep 26, 2025

Uh oh!

wks commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Sep 28, 2025

Uh oh!

tianleq commented Sep 28, 2025

Uh oh!

wks commented Sep 29, 2025

Uh oh!

qinsoon commented Sep 29, 2025

Uh oh!

tianleq commented Oct 2, 2025

Uh oh!

wks commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Oct 16, 2025

Uh oh!

tianleq commented Oct 17, 2025

Uh oh!

wks commented Oct 17, 2025

Uh oh!

wks commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wks commented Sep 22, 2025 •

edited

Loading

wks commented Sep 26, 2025 •

edited

Loading

wks commented Oct 10, 2025 •

edited

Loading

wks commented Oct 17, 2025 •

edited

Loading

wks commented Oct 20, 2025 •

edited

Loading