-
Notifications
You must be signed in to change notification settings - Fork 38
Refactor barrier implementations #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
They are not specific to concrete barriers, although each barrier only uses some but not all of them.
And no define private as public.
and move the slow-path stub to UnlogBarrierSet, too. And combine the pre and post barrier stub.
This is the basic rule for using pre-compiled headers.
SATB barrier does not have arraycopy epilogue.
|
I labelled this as ready for review. There is one thing intentionally not done in this PR: To be consistent with write barriers,
Because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run benchmarks with GenImmix to make sure the changes do not affect performance?
openjdk/barriers/mmtkSATBBarrier.cpp
Outdated
|
|
||
| __ branch_destination(slow->continuation()); | ||
| void MMTkSATBBarrierSetC1::object_reference_write_pre(LIRAccess& access, LIR_Opr src, LIR_Opr slot, LIR_Opr new_val) const { | ||
| object_reference_write_pre_or_post(access, src, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit confusing when you don't have comments for /* pre = */. Maybe just have two functions. object_reference_write_pre and object_reference_write_post, they call an internal function that deals both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have comments on those call sites, too. I'll add them.
I can't have object_reference_write_pre and object_reference_write_post in MMTkUnlogBitBarrierSetC1. They are virtual methods in MMTkBarrierSetC1, and having them in MMTkUnlogBitBarrierSetC1 will override those virtual methods. Instead, each concrete barrier type (MMTkObjectBarrierSetC1 and MMTkSATBBarrierSetC1) should override one of them, and not both. The overriding method, like this MMTkSATBBarrierSetC1::object_reference_write_pre, should call a method in MMTkUnlogBitBarrierSetC1 that has a different name, which is object_reference_write_pre_or_post in this case.
The code following the comment is just for C1, not assembler that emits stubs for C1.
Here are the results. vole.moma, lusearch from DaCapo Chopin, 40 invocations, 5 iterations, with three heap sizes: 1.9x, 2.4x and 3.0x w.r.t. SemiSpace on vole.moma. build1 is master and build2 is this PR. Most data show that there is no difference. But some cases, such as
Since it affects both the master branch and this PR, I suspect it is caused by either
I'll run the experiment on a different moma machine to see if can be reproduced. |
|
The CPU frequency of vole.mama, the machine I ran tests on, was not locked to maximum. This may be the cause of the abnormal slowdown. I am running the same tests on another machine where CPU frequencies are set to maximum. |
Can you run all the benchmarks over the weekend? |
|
I reran the lusearch benchmark with the same binaries and same configuration on mole.moma (same machine configuration, but we have checked the CPU frequencies are almost always max, although some CPUs occasionally show 50% of the max CPU frequency in The results are similar. Overall, there is no significant evidence of performance improvement or regression. And there are still random outliers in mutator time as shown below, in both build1 and build2. The two yellow spikes below are from Immix in build1 (master), and the three green spikes below are from ConcurrentImmix in build2 (this PR). And there is no spike in STW time, as shown below. This is strange because in lusearch, STW time is 25% to 200% of the mutator time, depending on the plan and the heap size. If the slowdown is caused by CPU frequency, it should increase the STW time in the same iterations by the same ratio, too. But there is no spike in STW time. |
|
Results for all benchmarks are here. DaCapo Chopin MR2, vole.moma, 2.4x min heap size w.r.t semispace on vole.moma, 40 invocations, 5 iterations, using ConcurrentImmix, Immix and GenImmix. time.stw is a bit noisy, but overall there is no clear sign of regression or improvement. time.other shows no obvious changes, too. There is one noisy benchmark From the un-normalized plots shown below, we see build2-GenImmix is even faster than Immix in mutator time. If we assume C2-compiled code is dominating the execution time, then the only change I made to the C2 barrier is removing the
|
If |
No. It won't work with field barriers. That's why the methods are implemented in The code blobs for calling runtime functions, i.e.
That's right. If we don't need the slot address, we don't need to code patching. We will need code patching again to support field barriers. |
|
There are 3 tests failing in concurrent Immix. |
|
A side note, in C1, code patching is done on instruction instead of LIR. When code needs patching, the value of |
|
This PR may have introduced bug in the SATB barrier. So far we have observed it crashing in avrora, h2o, lusearch, and xalan in the CI. It is hard to reproduce, but I managed to reproduce it locally in h2o. It always crashes after FinalMark, and it crashes at random locations, including both native code and Java code. It looks like some objects are prematurely reclaimed, possibly related to weak references. |
|
I fixed one obvious bug which is a typo that removes the barrier when the new field value is null. 940c691 But the SIGSEGV bug remains. It is not related to the weak ref load barrier because it can still be reproduced after disabling reference processing and the ref-load barrier. When I disable the C1 write barrier fast path, the bug can't be reproduced. I think it is the culprit. It is the least favorite part of the refactoring because it tries to work around several limitations of the C1 IR. I'll look into that later. |
I ran into the same issue when I was implementing the prototype. I also ran into some C1 issues and it might be related. So when code needs patching, the value of the slot is invalid. The comment in the following is incorrect, after doing those the scratch register still contains an invalid slot. The only reason of doing this is to make the c1 LIR assembler happy, otherwise, it might trigger an assertion failure saying something like "expecting a register but found an address" |
|
I can't reproduce the bug with |
|
After updating mmtk-core to revision mmtk/mmtk-core@1ffa5b3 (PR mmtk/mmtk-core#1400), I can't reproduce the crash. It was still reproducible before that commit. I don't understand it. OpenJDK doesn't use |
|
I managed to reproduce the crash on the master branch. It seems to be more reproducible if we disable C2. (update: I created an issue for this: #334, and it contains steps for reproducing it on the master branch) So it is an existing bug. It seems to be more reproducible on the CI server after this PR is merged. But locally I can reproduce the bug on master as often as with this PR. |
We recently added another write barrier, the
SATBBarrier. Being another barrier based on the per-object unlog bit, it shares much code with theObjectBarrier. We refactor the write barriers to improve code sharing and clean up unused code.We introduce abstract base classes
MMTkUnlogBitBarrierSetRuntime,MMTkUnlogBitBarrierSetAssembler,MMTkUnlogBitBarrierSetC1andMMTkUnlogBitBarrierSetC2, which bothObjectBarrierandSATBBarrierderive from. Those base classes include common parts related to unlog bit handling, including the fast paths for checking if the unlog bit is set, and also the generic implementation for both the pre and the post barriers if applicable.Runtime:
Assembler:
tmp1andtmp2temporary registers passed in fromBarrierSetAssembler::store_atmay overlap withdst.base()anddst_index(), respectively.C1:
MMTkBarrierSetAssemblerinstead of byMMTkObjectBarrierSetAssemblerorMMTkSATBBarrierSetAssembler. The intention is that all barriers can call those functions.slotandnew_valto the runtime function in the slow path. Object-logging barriers only needs theobjargument for scanning its fields.mmtkBarrierSetC1.hpptommtkUnlogBitBarrier.hpp(and unified the pre/post stub implementation) because it no longer needs theslotornew_valargs, and is now specific to unlog bit barriers. If we introduce field-logging barriers, we will implement a different stub.mmtk_object_reference_write_pre), not using LIR instructions.C2:
slotandnew_valto the runtime function in the slow path.Misc:
#define private publichack as we no longer need it.SOFT_REFERENCE_LOAD_BARRIERto a run-time flagmmtk_enable_reference_load_barriersettable by environment variable.