"Mark-delay" performance improvement to major GC #13580

NickBarnes · 2024-10-29T16:50:08Z

This is the upstreaming of oxcaml/oxcaml#2348 oxcaml/oxcaml#2358 (minor) and oxcaml/oxcaml#3029 by @stedolan. It introduces a new sweep-only phase at the start of each major GC cycle. This reduces the "latent garbage delay" - the time between a block becoming unreachable and it becoming available for allocation - by approximately half a major GC cycle.

Because marking, including root marking, doesn't take place until part-way through the GC cycle (when we move from sweep-only to mark-and-sweep), the allocation colour is not always MARKED but changes from UNMARKED to MARKED at that point. Effectively we switch from a grey mutator allocating white to a black mutator allocating black.

This PR is in draft because I've just done a fairly mechanical (although manual) patch application; I'm publishing it so that @stedolan and perhaps @kayceesrk can take a look. It passes the whole testsuite on my machine, including the new test (parallel/churn.ml) written by @stedolan for the flambda-backend mark-delay work.

kayceesrk · 2024-10-29T16:59:02Z

Thanks! I’ll have a look tomorrow.

runtime/caml/domain.h

runtime/major_gc.c

runtime/signals.c

runtime/major_gc.c

NickBarnes · 2024-10-31T09:58:51Z

I've addressed @kayceesrk's review comments and rebased.

NickBarnes · 2024-10-31T10:28:01Z

Thanks to @kayceesrk and @stedolan I think this can come out of draft now.

Changes

kayceesrk · 2024-10-31T15:05:13Z

The code looks ok now.

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

runtime/caml/platform.h

kayceesrk · 2024-11-05T01:30:24Z

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

any insights on this so far?

NickBarnes · 2024-11-08T14:43:45Z

The Win32 problem is due to an accounting error, observed across platforms, which causes the work_counter to trail further and further behind the alloc_counter. Eventually on 32-bit platforms it differs by enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows, this leads to out-of-memory at (just under) a 2 GiB heap. On 32-bit Linux machines with enough memory, the heap grows past 2GiB, leading the alloc_counter to wrap around once more, pushing work budgets back into positive territory, so the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms.
Taking this PR back into draft mode until I've got a fix.

NickBarnes · 2024-11-08T14:44:17Z

(note: this accounting problem is specific to this PR, and not observed on trunk).

NickBarnes · 2024-11-09T08:25:55Z

I have further diagnosed the accounting problem, and put in a dirty hack to prove my hypothesis.
On the trunk, we maintain alloc_counter as a global estimate of the GC work necessary to keep up with allocation, and work_counter as a global record of the total amount of GC work done. Both of these counters are intnat sized, and on 32-bit platforms they wrap around. For each slice, in each domain, we work until the work_counter exceeds the slice_target - the value that the alloc_counter had at the start of the slice - or (alternatively) we run out of work to do.
In this way, these two counters remain approximately in sync.
However, on this branch, we delay marking until after sweeping has finished on at least one domain, and we always start marking in a fresh slice (after a minor GC). So on the last sweeping slice (on the domain which reaches the end of sweeping first), approximately half of the work expected in the slice is not used, and work_counter slips behind alloc_counter by this amount. At the end of the slice, instead of work_counter slightly exceeding slice_target, or not quite reaching it (if we got to the end of the collection), it falls behind by (on average) half of a slice's amount of work. Depending on the workload, there may be very little marking to do, not enough to catch up on the next slice (for instance, on testsuite/tests/regression/pr5757/pr5757.ml there is very little marking to do).
So, as the process continues, work_counter gradually falls further and further behind alloc_counter. This may lead to larger slice budgets and thus longer (but fewer) GC pauses, but should be otherwise harmless. However (as described in a previous comment), on 32-bit platforms eventually the shortfall is enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows we eventually hit the 2 GiB process memory limit, causing us to run out of memory with a (just under) 2 GiB heap. On 32-bit Linux machines with enough memory, we are able to continue past that point (as there is a 3 GiB process memory limit), advancing alloc_counter until it overtakes work_counter, pushing work budgets back into positive territory. Then the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms. Instead the slice targets grow larger and larger, which may exacerbate the problem as all the sweeping is easily completed in a single slice.

My hack, which should not be merged, demonstrates that addressing this accounting issue fixes the problem by artificially consuming the rest of the slice budget at the point at which sweeping is first completed.
Paging @stedolan and @damiendoligez for thoughts.
I also have a sketch of a refactor of major_collection_slice to make this sort of thing easier to reason about. I hope to make a PR for that in the next week or two.

kayceesrk · 2025-06-24T10:36:21Z

The CI failure is on a fiddly test tests/weak-ephe-final/ephe_custom.ml. This test was added in #13859. The test depends on the specific colour of the custom object at Weak.set. This PR changes the allocation colour and hence the test no longer works as expected.

Here's the modified code that exercises the darkening of the custom_tag objects during a Weak.get_copy, capturing the intent of the original test.

(* TEST *)

let w = Weak.create 1

let major_obj () =
  let n = Sys.opaque_identity 42 in
  let v = Sys.opaque_identity (Int64.of_int n) in
  Gc.full_major ();
  v (* [v] is unmarked *)

let () =
  Weak.set w 0 (Some (major_obj ()));
  (* [v] is unmarked *)
  let x = Option.get (Weak.get_copy w 0) in
  (* [v] is marked because [Weak.get_copy] marks the object without making a
     copy. *)
  Gc.major ();
  Printf.printf "value: %Ld\n%!" x;
  let junk = List.init 1_000_000 Fun.id in
  Gc.minor ();
  ignore (Sys.opaque_identity junk);
  (* check that the memory representing x did not get reused in junk *)
  Printf.printf "value: %Ld\n%!" x

jmid · 2025-06-24T12:48:56Z

The multicoretests failure is due to #14081 and has since been fixed on the main repo.
It should be sufficient to change the following line to ref: 'main' to get it going again:

ocaml/.github/workflows/multicoretests.yml

Line 39 in 45e1a06

ref: 0.8

I've also kicked off a CI test run here: https://github.com/ocaml-multicore/multicoretests/tree/target-mark-delay-13580

I'll send a PR to bump the workflow's targeted multicoretests version when we release the next one (we've gone with a particular tag to prevent that test suite updates in between runs add needless version confusion 🤷 )

jmid · 2025-06-24T13:05:16Z

Hm. I can see the debug runtime workflow is failing with an assertion error:
https://github.com/ocaml-multicore/multicoretests/actions/runs/15850475068/job/44682432570#step:16:1418

random seed: 117530064
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Weak stress test with Domain
[00] file runtime/major_gc.c; line 1522 ### Assertion failed: caml_marking_started()
File "src/weak/dune", line 52, characters 7-16:
52 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/weak && ./lin_tests.exe --verbose)
Command got signal ILL.

which is this assertion:
https://github.com/NickBarnes/ocaml/blob/3802cb2077ebaf8a67a28ef26fa244d3205f38ae/runtime/major_gc.c#L1522

Edit: The other (non-debug-runtime) workflows completed successfully

kayceesrk · 2025-06-25T01:59:19Z

@jmid how do I get the reproduction case for the failure? I’ll be able to have a look.

jmid · 2025-06-25T08:07:35Z

I just confirmed locally that the following will trigger the assertion error almost immediately:

git clone -b nick-markdelay [email protected]:NickBarnes/ocaml.git ocaml-06-25-2025-PR13580
cd ocaml-06-25-2025-PR13580
opam switch create . --empty
opam pin add -k path --inplace-build ocaml-variants.5.5.0+mark-delay .
opam install qcheck-core
git clone [email protected]:ocaml-multicore/multicoretests.git
cd multicoretests
OCAMLRUNPARAM='s=4096,V=1' dune exec src/weak/lin_tests.exe --profile=debug-runtime

with

Done: 75% (3/4, 1 left) (jobs: 0)### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 418233538
[01] file runtime/major_gc.c; line 1522 ### Assertion failed: caml_marking_started()
Command got signal ILL.

I'll try to carve out a standalone reproducer.

jmid · 2025-06-25T09:26:47Z

Here is a standalone repro triggering the problem:

let test () =
  let t = Weak.create 256 in
  Weak.fill t 0 1 (Some 1L);
  let d = Domain.spawn (fun () ->
      Weak.get_copy t 0 |> ignore
    ) in
  Domain.join d

let _ =
  for _i = 1 to 100 do
    Printf.printf "#%!";
    test ()
  done

This takes only a few iterations to consistently trigger it on my Linux box (the bigger the weak array, the sooner it seems to trigger):

$ ocamlopt -runtime-variant=d repro.ml && OCAMLRUNPARAM='s=4096,V=1,b' ./a.out
### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
###[01] file runtime/major_gc.c; line 1522 ### Assertion failed: caml_marking_started()
Illegal instruction (core dumped)

kayceesrk · 2025-06-25T13:04:17Z

The fix turned out to be a simple one. I've made a PR here: NickBarnes#4.

runtime/weak.c

kayceesrk

The code looks good to me.

kayceesrk · 2025-06-25T13:50:21Z

runtime/domain.c

+    (sync, handler, data, leader_setup, enter_spin_callback, enter_spin_data)`
+
+  Try to run an STW section in which all domains will run
+  `handler(domain, data, N, p)` (where `N` and `p` define the


what are N and p?

The number of domains participating in the STW, and an array of their domain_state pointers. These variables are commonly called participant_count and participating in the handlers). I find those variable names a bit cumbersome and visually obtrusive; I have contemplated a refactor in which we have an STW data structure incorporating these as fields and conventionally using a short variable name like stw, but this PR is not the place for it.

Yes, agreed. Another PR. Ok with the comment for now.

kayceesrk · 2025-06-25T13:52:26Z

runtime/domain.c

+  Bearing all that in mind, the spec of this function, which drives
+  all STW sections, is as follows:
+
+  `caml_try_run_on_all_domains_with_spin_work


In a different PR, we should reorganise this large comment to first describe what the function does (what follows) and then have the details on how it interacts with domain creation and termination.

runtime/major_gc.c

kayceesrk · 2025-06-30T05:21:26Z

@NickBarnes gentle ping to consider NickBarnes#4.

Co-authored-by: Stephen Dolan <[email protected]>

…unter at the start of any slice when it falls very far behind alloc_counter.

NickBarnes · 2025-07-01T11:26:10Z

Thanks for that @kayceesrk; I have merged it after a rebase. I think I've now addressed the various review comments on this, and it appears to be passing tests....

kayceesrk · 2025-07-02T13:38:42Z

I've been looking at the macos-arm64 failure. The test failure is on https://github.com/ocaml/ocaml/blob/trunk/testsuite/tests/parallel/major_gc_wait_backup.ml#L41. The test expects new major cycles to have been done after doing some long-lived allocating work. I initially assumed that this PR collects the floating garbage quickly and hence we don't get to see a cycle completed. But that's not the case here.

Bug?

Based on the GC logs, I suspect there is a bug which prevents the GC from advancing from Phase_mark_final. The program ends up allocating more and more.

My conjecture is that the transition from Phase_mark_final to Phase_sweep_ephe only happens when the slice mode is not opportunistic. The phase change code:

https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L2094

is under the guard which checks that the mode is not opportunistic

https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L1986

In the GC trace, I see that the two domains are stuck at Phase_mark_final. ~~Is it possible that the program only triggers opportunistic major slices, which prevents the GC phase change. But the program keeps allocating more and more memory?~~ edit: I noticed that the slices are not opportunistic but interruptible. I'm looking into this.

Also, this bug (if it is one) is possible on trunk, where the phase change code is guarded under a similar check. The current PR triggers it.

Test

I've made a small change to the test to make it fail more frequently:

diff --git a/test.ml b/major_gc_wait_backup.ml
index f039604441..0b1ff9f49b 100644
--- a/test.ml
+++ b/major_gc_wait_backup.ml
@@ -37,8 +37,6 @@ let _ =
   ) in
   Gc.full_major ();
   let n = major_collections () in
-  for i = 1 to 5 do
-    ignore (make 24);
-  done;
+  ignore (make 24);
   assert ((major_collections ()) > n);
   print_endline "sleep OK"

Good run

A successful run uses around 800M.

 % /usr/bin/time -l ./a.out                
wait OK
sleep OK
        2.46 real         2.36 user         0.11 sys
           811499520  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               49674  page reclaims
                  73  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                2709  involuntary context switches
         41202382335  instructions retired
          8064320362  cycles elapsed
           811125184  peak memory footprint

The tail of the GC message trace filtered to major slice messages looks like this:

<snip>
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)

Bad run

An unsuccessful run ends up using a lot more memory (1.88 GB in this case):

% /usr/bin/time -l ./a.out
wait OK
Fatal error: exception File "test.ml", line 43, characters 2-8: Assertion failed
        1.57 real         1.34 user         0.20 sys
          2021195776  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              123559  page reclaims
                  20  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                3788  involuntary context switches
         25039794935  instructions retired
          4791806351  cycles elapsed
          2021413632  peak memory footprint

The tail of the bad run trace looks like this:

<snip>
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)                                                                    
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)   
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)    
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)     
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)    
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)   
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)                                                                    
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)

kayceesrk · 2025-07-03T13:05:43Z

I've narrowed the problem down to https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L1987-L1990

    /* Finalisers */
    if (caml_gc_phase == Phase_mark_final &&
        get_major_slice_work(mode) > 0 &&
        caml_final_update_first(domain_state)) {
      /* This domain has updated finalise first values */
      (void)caml_atomic_counter_decr(&num_domains_to_final_update_first);

In the test, there is an idle domain, which does not allocate. The GC log has

[01] Updated major work: [F]  0 heap_words,  0 allocated,  0 allocated (direct),  
  0 allocated (suspended), 0 allocated (resumed),  0 alloc_work,  0 dependent_work,  
  0 extra_work,  278023983 work counter [behind],  278023983 alloc counter,  
  278023983 slice target,  0 slice budget

The call to get_major_slice_work(mode) always returns 0. This prevents the idle domain ([01]) to not be able to call caml_final_update_first and decrement the num_domains_to_final_update_first counter. The other domain keeps allocating.

There is no finaliser work, but unless the counter is decremented, the GC will not progress to the next phase.

The change was made in #11903. CC @stedolan. We will have a similar issue with num_domains_to_final_update_last.

kayceesrk · 2025-07-04T06:34:38Z

caml_final_update_first and caml_final_update_last are not incremental. So checking whether get_major_slice_work(mode) > 0 is too strict. For a quick fix, I suggest removing the condition. I have tested the fix, and the test no longer fails. If this is acceptable, I can make a PR to trunk.

Separately, we should consider making caml_final_update_{first,last} to be incremental.

kayceesrk · 2025-07-09T14:38:45Z

@NickBarnes wonder if you've had some cycles to look at this? Would be useful to get this PR across the line.

NickBarnes force-pushed the nick-markdelay branch 2 times, most recently from 31fc961 to 72500e9 Compare October 29, 2024 16:56

NickBarnes added the run-multicoretests Makes the CI run the multicore testsuite label Oct 29, 2024

NickBarnes force-pushed the nick-markdelay branch from 72500e9 to 9af1835 Compare October 29, 2024 17:12