-
Notifications
You must be signed in to change notification settings - Fork 1.2k
"Mark-delay" performance improvement to major GC #13580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
31fc961
to
72500e9
Compare
Thanks! I’ll have a look tomorrow. |
72500e9
to
9af1835
Compare
b341dd4
to
4a6c7af
Compare
I've addressed @kayceesrk's review comments and rebased. |
Thanks to @kayceesrk and @stedolan I think this can come out of draft now. |
The code looks ok now. MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR. |
4a6c7af
to
7d91cc4
Compare
any insights on this so far? |
7d91cc4
to
295543e
Compare
The Win32 problem is due to an accounting error, observed across platforms, which causes the |
(note: this accounting problem is specific to this PR, and not observed on trunk). |
7a17a33
to
af5fb77
Compare
I have further diagnosed the accounting problem, and put in a dirty hack to prove my hypothesis. My hack, which should not be merged, demonstrates that addressing this accounting issue fixes the problem by artificially consuming the rest of the slice budget at the point at which sweeping is first completed. |
6eb0f2d
to
be8c5b7
Compare
The CI failure is on a fiddly test Here's the modified code that exercises the darkening of the custom_tag objects during a (* TEST *)
let w = Weak.create 1
let major_obj () =
let n = Sys.opaque_identity 42 in
let v = Sys.opaque_identity (Int64.of_int n) in
Gc.full_major ();
v (* [v] is unmarked *)
let () =
Weak.set w 0 (Some (major_obj ()));
(* [v] is unmarked *)
let x = Option.get (Weak.get_copy w 0) in
(* [v] is marked because [Weak.get_copy] marks the object without making a
copy. *)
Gc.major ();
Printf.printf "value: %Ld\n%!" x;
let junk = List.init 1_000_000 Fun.id in
Gc.minor ();
ignore (Sys.opaque_identity junk);
(* check that the memory representing x did not get reused in junk *)
Printf.printf "value: %Ld\n%!" x |
The
I've also kicked off a CI test run here: https://github.com/ocaml-multicore/multicoretests/tree/target-mark-delay-13580 I'll send a PR to bump the workflow's targeted |
Hm. I can see the debug runtime workflow is failing with an assertion error:
which is this assertion: Edit: The other (non-debug-runtime) workflows completed successfully |
@jmid how do I get the reproduction case for the failure? I’ll be able to have a look. |
I just confirmed locally that the following will trigger the assertion error almost immediately:
with
I'll try to carve out a standalone reproducer. |
Here is a standalone repro triggering the problem: let test () =
let t = Weak.create 256 in
Weak.fill t 0 1 (Some 1L);
let d = Domain.spawn (fun () ->
Weak.get_copy t 0 |> ignore
) in
Domain.join d
let _ =
for _i = 1 to 100 do
Printf.printf "#%!";
test ()
done This takes only a few iterations to consistently trigger it on my Linux box (the bigger the weak array, the sooner it seems to trigger):
|
The fix turned out to be a simple one. I've made a PR here: NickBarnes#4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me.
(sync, handler, data, leader_setup, enter_spin_callback, enter_spin_data)` | ||
|
||
Try to run an STW section in which all domains will run | ||
`handler(domain, data, N, p)` (where `N` and `p` define the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are N
and p
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of domains participating in the STW, and an array of their domain_state pointers. These variables are commonly called participant_count
and participating
in the handlers). I find those variable names a bit cumbersome and visually obtrusive; I have contemplated a refactor in which we have an STW data structure incorporating these as fields and conventionally using a short variable name like stw
, but this PR is not the place for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. Another PR. Ok with the comment for now.
Bearing all that in mind, the spec of this function, which drives | ||
all STW sections, is as follows: | ||
|
||
`caml_try_run_on_all_domains_with_spin_work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a different PR, we should reorganise this large comment to first describe what the function does (what follows) and then have the details on how it interacts with domain creation and termination.
@NickBarnes gentle ping to consider NickBarnes#4. |
3802cb2
to
c2e3589
Compare
Co-authored-by: Stephen Dolan <[email protected]>
…unter at the start of any slice when it falls very far behind alloc_counter.
c2e3589
to
d372edb
Compare
Thanks for that @kayceesrk; I have merged it after a rebase. I think I've now addressed the various review comments on this, and it appears to be passing tests.... |
I've been looking at the macos-arm64 failure. The test failure is on https://github.com/ocaml/ocaml/blob/trunk/testsuite/tests/parallel/major_gc_wait_backup.ml#L41. The test expects new major cycles to have been done after doing some long-lived allocating work. I initially assumed that this PR collects the floating garbage quickly and hence we don't get to see a cycle completed. But that's not the case here. Bug?Based on the GC logs, I suspect there is a bug which prevents the GC from advancing from My conjecture is that the transition from is under the guard which checks that the mode is not opportunistic In the GC trace, I see that the two domains are stuck at Also, this bug (if it is one) is possible on trunk, where the phase change code is guarded under a similar check. The current PR triggers it. TestI've made a small change to the test to make it fail more frequently: diff --git a/test.ml b/major_gc_wait_backup.ml
index f039604441..0b1ff9f49b 100644
--- a/test.ml
+++ b/major_gc_wait_backup.ml
@@ -37,8 +37,6 @@ let _ =
) in
Gc.full_major ();
let n = major_collections () in
- for i = 1 to 5 do
- ignore (make 24);
- done;
+ ignore (make 24);
assert ((major_collections ()) > n);
print_endline "sleep OK" Good runA successful run uses around 800M. % /usr/bin/time -l ./a.out
wait OK
sleep OK
2.46 real 2.36 user 0.11 sys
811499520 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
49674 page reclaims
73 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
2709 involuntary context switches
41202382335 instructions retired
8064320362 cycles elapsed
811125184 peak memory footprint The tail of the GC message trace filtered to major slice messages looks like this:
Bad runAn unsuccessful run ends up using a lot more memory (1.88 GB in this case): % /usr/bin/time -l ./a.out
wait OK
Fatal error: exception File "test.ml", line 43, characters 2-8: Assertion failed
1.57 real 1.34 user 0.20 sys
2021195776 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
123559 page reclaims
20 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
3788 involuntary context switches
25039794935 instructions retired
4791806351 cycles elapsed
2021413632 peak memory footprint The tail of the bad run trace looks like this:
|
I've narrowed the problem down to https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L1987-L1990 /* Finalisers */
if (caml_gc_phase == Phase_mark_final &&
get_major_slice_work(mode) > 0 &&
caml_final_update_first(domain_state)) {
/* This domain has updated finalise first values */
(void)caml_atomic_counter_decr(&num_domains_to_final_update_first); In the test, there is an idle domain, which does not allocate. The GC log has
The call to There is no finaliser work, but unless the counter is decremented, the GC will not progress to the next phase. The change was made in #11903. CC @stedolan. We will have a similar issue with |
Separately, we should consider making |
@NickBarnes wonder if you've had some cycles to look at this? Would be useful to get this PR across the line. |
This is the upstreaming of oxcaml/oxcaml#2348 oxcaml/oxcaml#2358 (minor) and oxcaml/oxcaml#3029 by @stedolan. It introduces a new sweep-only phase at the start of each major GC cycle. This reduces the "latent garbage delay" - the time between a block becoming unreachable and it becoming available for allocation - by approximately half a major GC cycle.
Because marking, including root marking, doesn't take place until part-way through the GC cycle (when we move from sweep-only to mark-and-sweep), the allocation colour is not always
MARKED
but changes fromUNMARKED
toMARKED
at that point. Effectively we switch from a grey mutator allocating white to a black mutator allocating black.This PR is in draft because I've just done a fairly mechanical (although manual) patch application; I'm publishing it so that @stedolan and perhaps @kayceesrk can take a look. It passes the whole testsuite on my machine, including the new test (
parallel/churn.ml
) written by @stedolan for theflambda-backend
mark-delay work.