Skip to content

Conversation

@wks
Copy link
Contributor

@wks wks commented Sep 22, 2025

We recently added another write barrier, the SATBBarrier. Being another barrier based on the per-object unlog bit, it shares much code with the ObjectBarrier. We refactor the write barriers to improve code sharing and clean up unused code.

We introduce abstract base classes MMTkUnlogBitBarrierSetRuntime, MMTkUnlogBitBarrierSetAssembler, MMTkUnlogBitBarrierSetC1 and MMTkUnlogBitBarrierSetC2, which both ObjectBarrier and SATBBarrier derive from. Those base classes include common parts related to unlog bit handling, including the fast paths for checking if the unlog bit is set, and also the generic implementation for both the pre and the post barriers if applicable.

Runtime:

  • Extracted a method for testing if the unlog bit is set for a given object.

Assembler:

  • Extracted a method for emitting the unlog bit checking fast path.
  • Extracted a method for generating both the pre and the post write barriers.
  • Simplified the fast path to use fewer instructions and fewer scratch registers.
  • Worked around an issue where the tmp1 and tmp2 temporary registers passed in from BarrierSetAssembler::store_at may overlap with dst.base() and dst_index(), respectively.

C1:

  • Changed "runtime stubs" to be one per runtime function, not one per pre/post/refload barrier. Those "runtime stubs" are now assembled by MMTkBarrierSetAssembler instead of by MMTkObjectBarrierSetAssembler or MMTkSATBBarrierSetAssembler. The intention is that all barriers can call those functions.
  • Extracted a method for emitting the unlog bit checking fast path.
  • Extracted a method for emitting both the pre and the post write barriers and their slow-path code stub.
  • We no longer pass the slot and new_val to the runtime function in the slow path. Object-logging barriers only needs the obj argument for scanning its fields.
    • For this reason, we moved the C1 CodeStub for the pre/post barrier slow paths down from mmtkBarrierSetC1.hpp to mmtkUnlogBitBarrier.hpp (and unified the pre/post stub implementation) because it no longer needs the slot or new_val args, and is now specific to unlog bit barriers. If we introduce field-logging barriers, we will implement a different stub.
  • The C1 SATBBarrier no longer needs code patching. The code patching was intended for barriers that load from the field in the fast path using LIR instructions, such as G1's SATB barrier. Our object-logging barrier scans the fields in runtime functions (mmtk_object_reference_write_pre), not using LIR instructions.

C2:

  • Extracted a method for emitting the unlog bit checking fast path.
  • Extracted a method for emitting both the pre and the post write barriers.
  • We no longer pass the slot and new_val to the runtime function in the slow path.

Misc:

  • Removed the #define private public hack as we no longer need it.
  • Removed unused macros, dead code and irrelevant comments.
  • Changed the macro SOFT_REFERENCE_LOAD_BARRIER to a run-time flag mmtk_enable_reference_load_barrier settable by environment variable.

wks added 30 commits September 19, 2025 23:08
They are not specific to concrete barriers, although each barrier only
uses some but not all of them.
And no define private as public.
and move the slow-path stub to UnlogBarrierSet, too.  And combine the
pre and post barrier stub.
This is the basic rule for using pre-compiled headers.
SATB barrier does not have arraycopy epilogue.
@wks wks changed the title WIP: Refactor barrier implementation Refactor barrier implementations Sep 25, 2025
@wks wks marked this pull request as ready for review September 25, 2025 06:09
@wks
Copy link
Contributor Author

wks commented Sep 25, 2025

I labelled this as ready for review.

There is one thing intentionally not done in this PR:

To be consistent with write barriers, MMTkBarrierSet{Assembler,C1,C2} should also override load_at and provide virtual methods object_reference_read_{pre_post} for subclasses to override. However, we currently don't have read barriers in MMTk core except SATBBarrier which only applies read barriers for weak references. We can't work out a nice interface for read barriers unless we have one in MMTk core, as it may be more complicated than write barriers. For example,

  • "loaded value barriers" is a post barrier, but it should have access to the loaded value, and can override the value read by the raw memory read. In other words, it needs to overwrite the result register (Assembler and C1), and/or may need to return a different register/node that holds the final result (C1 and C2)
  • "indirection barriers" may override the slot address to read from, and apply not only to reference fields, but value fields as well.

Because SATBBarrier is currently the only barrier (policy) that has a barrier (mechanism) for read accesses, we just let it override load_at directly. If we introduce another barrier, such as the field-logging SATB barrier or RC barrier, we may hoist some code to a shared file or superclasses.

@wks wks requested review from qinsoon and tianleq September 25, 2025 06:18
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?


__ branch_destination(slow->continuation());
void MMTkSATBBarrierSetC1::object_reference_write_pre(LIRAccess& access, LIR_Opr src, LIR_Opr slot, LIR_Opr new_val) const {
object_reference_write_pre_or_post(access, src, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit confusing when you don't have comments for /* pre = */. Maybe just have two functions. object_reference_write_pre and object_reference_write_post, they call an internal function that deals both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have comments on those call sites, too. I'll add them.

I can't have object_reference_write_pre and object_reference_write_post in MMTkUnlogBitBarrierSetC1. They are virtual methods in MMTkBarrierSetC1, and having them in MMTkUnlogBitBarrierSetC1 will override those virtual methods. Instead, each concrete barrier type (MMTkObjectBarrierSetC1 and MMTkSATBBarrierSetC1) should override one of them, and not both. The overriding method, like this MMTkSATBBarrierSetC1::object_reference_write_pre, should call a method in MMTkUnlogBitBarrierSetC1 that has a different name, which is object_reference_write_pre_or_post in this case.

wks added 2 commits September 25, 2025 15:13
The code following the comment is just for C1, not assembler that emits
stubs for C1.
@wks
Copy link
Contributor Author

wks commented Sep 26, 2025

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

Most data show that there is no difference. But some cases, such as (build1, 1.8x, Immix) and (build2, 2.4x, StickyImmix) exhibited large slowdown together with large noise. They are caused by abnormal slowdowns in mutator time (time.other) in one or two invocations. This can be seen by plotting the time of each invocation separately.

Since it affects both the master branch and this PR, I suspect it is caused by either

  • an existing bug in the master branch that caused abnormal slowdown in mutator, or
  • some anomalies on vole.moma that interrupted the testing processes at unpredictable times.

I'll run the experiment on a different moma machine to see if can be reproduced.

@wks
Copy link
Contributor Author

wks commented Sep 26, 2025

The CPU frequency of vole.mama, the machine I ran tests on, was not locked to maximum. This may be the cause of the abnormal slowdown. I am running the same tests on another machine where CPU frequencies are set to maximum.

@qinsoon
Copy link
Member

qinsoon commented Sep 26, 2025

Can you run benchmarks with GenImmix to make sure the changes do not affect performance?

Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-25-Thu-081657&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

Can you run all the benchmarks over the weekend?

@wks
Copy link
Contributor Author

wks commented Sep 26, 2025

I reran the lusearch benchmark with the same binaries and same configuration on mole.moma (same machine configuration, but we have checked the CPU frequencies are almost always max, although some CPUs occasionally show 50% of the max CPU frequency in /proc/cpuinfo).

The results are similar. Overall, there is no significant evidence of performance improvement or regression.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^hfac&

And there are still random outliers in mutator time as shown below, in both build1 and build2. The two yellow spikes below are from Immix in build1 (master), and the three green spikes below are from ConcurrentImmix in build2 (this PR).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.other

And there is no spike in STW time, as shown below. This is strange because in lusearch, STW time is 25% to 200% of the mutator time, depending on the plan and the heap size. If the slowdown is caused by CPU frequency, it should increase the STW time in the same iterations by the same ratio, too. But there is no spike in STW time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-26-Fri-054710&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&XY^build-mmtk_gc^invocation&time.stw

@wks
Copy link
Contributor Author

wks commented Sep 28, 2025

Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.

time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&

From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.

perf didn't show any significant bottleneck in build1-GenImmix.

@tianleq
Copy link
Collaborator

tianleq commented Sep 28, 2025

Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^hfac^mmtk_gc&build;build1|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

time.stw is a bit noisy, but overall there is no clear sign of regression or improvement.

time.other shows no obvious changes, too. There is one noisy benchmark sunflow which shows 15% speed-up in mutator time in GenImmix. Plotting all invocations (as shown below), we don't see any spikes in the results, but GenImmix seems to be constantly faster on build2 (this PR) than build1 (master).

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4&benchmark^1^sunflow|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|41&XY^build-mmtk_gc^invocation&

From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2025-09-26-Fri-092914&benchmark^build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|60&build&mmtk_gc|70&build-mmtk_gc^build-gc|40&Histogram%20(with%20CI)^build-mmtk_gc^benchmark&

If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the slot and new_val arguments in the slow-path call. But if write barrier slow paths dominate the execution time of sunflow so much that two registers may affect the performance, then build1-GenImmix should be significantly slower than Immix.

perf didn't show any significant bottleneck in build1-GenImmix.

If slot gets removed, does this still work with filed barriers? I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.

@wks
Copy link
Contributor Author

wks commented Sep 29, 2025

If slot gets removed, does this still work with filed barriers?

No. It won't work with field barriers. That's why the methods are implemented in MMTkUnlogBitBarrierSet{Assembler,C1,C2} instead of MMTkBarrierSet{Assembler,C1,C2}. To use field barriers, we need to reimplement a new write barrier fast path anyway. If the fast path fails, it will need to call the slow path with the slot argument. We will do that in something like MMTkFieldUnlogBarrierSet{Assembler,C1,C2}.

The code blobs for calling runtime functions, i.e. MMTkBarrierSetC1::_object_reference_write_{pre,post,slow}_c1_runtime_code_blob still properly forward the src, slot and new_val arguments to the runtime functions as long as those arguments are provided, and should still work for field barriers.

I noticed that code patching also gets removed, I presume it is because slot is removed and thus nothing needs to be patched.

That's right. If we don't need the slot address, we don't need to code patching. We will need code patching again to support field barriers.

@qinsoon
Copy link
Member

qinsoon commented Sep 29, 2025

There are 3 tests failing in concurrent Immix.

@tianleq
Copy link
Collaborator

tianleq commented Oct 2, 2025

A side note, in C1, code patching is done on instruction instead of LIR. When code needs patching, the value of slot is incorrect. To support field barrier, we need to be conservative in this case, keeping track of all slots of the src

@wks
Copy link
Contributor Author

wks commented Oct 10, 2025

This PR may have introduced bug in the SATB barrier. So far we have observed it crashing in avrora, h2o, lusearch, and xalan in the CI. It is hard to reproduce, but I managed to reproduce it locally in h2o. It always crashes after FinalMark, and it crashes at random locations, including both native code and Java code. It looks like some objects are prematurely reclaimed, possibly related to weak references.

@wks
Copy link
Contributor Author

wks commented Oct 16, 2025

I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691

But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.

@tianleq
Copy link
Collaborator

tianleq commented Oct 17, 2025

I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691

But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later.

I ran into the same issue when I was implementing the prototype.

I also ran into some C1 issues and it might be related. So when code needs patching, the value of the slot is invalid. The comment in the following is incorrect, after doing those the scratch register still contains an invalid slot. The only reason of doing this is to make the c1 LIR assembler happy, otherwise, it might trigger an assertion failure saying something like "expecting a register but found an address"

if (stub->patch_code != lir_patch_none) {
    // Patch
    assert(stub->scratch->is_single_cpu(), "must be");
    assert(stub->scratch->is_register(), "Precondition.");
    ce->mem2reg(stub->slot, stub->scratch, T_OBJECT, stub->patch_code, stub->info, false /*wide*/, false /*unaligned*/);
    // Now stub->scratch contains the pre_val instead of the slot address
    // So the following is to load the slot address into scrach register
    // Resolve address 
    auto masm = ce->masm();
    LIR_Address* addr = stub->slot->as_address_ptr();
    Address from_addr = ce->as_Address(addr);
    __ lea(stub->scratch->as_register(), from_addr);
    // Store parameter
    ce->store_parameter(stub->scratch->as_pointer_register(), 1);
  }

@wks
Copy link
Contributor Author

wks commented Oct 17, 2025

I can't reproduce the bug with -XX:-UseCompressedOops. Maybe it is related to compressed pointer.

@wks
Copy link
Contributor Author

wks commented Oct 17, 2025

After updating mmtk-core to revision mmtk/mmtk-core@1ffa5b3 (PR mmtk/mmtk-core#1400), I can't reproduce the crash. It was still reproducible before that commit. I don't understand it. OpenJDK doesn't use alloc_with_options at all.

@wks
Copy link
Contributor Author

wks commented Oct 20, 2025

I managed to reproduce the crash on the master branch. It seems to be more reproducible if we disable C2.

(update: I created an issue for this: #334, and it contains steps for reproducing it on the master branch)

So it is an existing bug. It seems to be more reproducible on the CI server after this PR is merged. But locally I can reproduce the bug on master as often as with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants