Skip to content

"Mark-delay" performance improvement to major GC #13580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

NickBarnes
Copy link
Contributor

@NickBarnes NickBarnes commented Oct 29, 2024

This is the upstreaming of oxcaml/oxcaml#2348 oxcaml/oxcaml#2358 (minor) and oxcaml/oxcaml#3029 by @stedolan. It introduces a new sweep-only phase at the start of each major GC cycle. This reduces the "latent garbage delay" - the time between a block becoming unreachable and it becoming available for allocation - by approximately half a major GC cycle.

Because marking, including root marking, doesn't take place until part-way through the GC cycle (when we move from sweep-only to mark-and-sweep), the allocation colour is not always MARKED but changes from UNMARKED to MARKED at that point. Effectively we switch from a grey mutator allocating white to a black mutator allocating black.

This PR is in draft because I've just done a fairly mechanical (although manual) patch application; I'm publishing it so that @stedolan and perhaps @kayceesrk can take a look. It passes the whole testsuite on my machine, including the new test (parallel/churn.ml) written by @stedolan for the flambda-backend mark-delay work.

@NickBarnes NickBarnes force-pushed the nick-markdelay branch 2 times, most recently from 31fc961 to 72500e9 Compare October 29, 2024 16:56
@kayceesrk
Copy link
Contributor

kayceesrk commented Oct 29, 2024

Thanks! I’ll have a look tomorrow.

@NickBarnes NickBarnes added the run-multicoretests Makes the CI run the multicore testsuite label Oct 29, 2024
@NickBarnes NickBarnes force-pushed the nick-markdelay branch 3 times, most recently from b341dd4 to 4a6c7af Compare October 31, 2024 09:58
@NickBarnes
Copy link
Contributor Author

I've addressed @kayceesrk's review comments and rebased.

@NickBarnes NickBarnes marked this pull request as ready for review October 31, 2024 10:27
@NickBarnes
Copy link
Contributor Author

Thanks to @kayceesrk and @stedolan I think this can come out of draft now.

@kayceesrk
Copy link
Contributor

The code looks ok now.

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

@kayceesrk
Copy link
Contributor

MSVC 32-bit has a failing test. I'll wait until the test is fixed before I approve the PR.

any insights on this so far?

@NickBarnes NickBarnes marked this pull request as draft November 8, 2024 14:38
@NickBarnes
Copy link
Contributor Author

The Win32 problem is due to an accounting error, observed across platforms, which causes the work_counter to trail further and further behind the alloc_counter. Eventually on 32-bit platforms it differs by enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows, this leads to out-of-memory at (just under) a 2 GiB heap. On 32-bit Linux machines with enough memory, the heap grows past 2GiB, leading the alloc_counter to wrap around once more, pushing work budgets back into positive territory, so the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms.
Taking this PR back into draft mode until I've got a fix.

@NickBarnes
Copy link
Contributor Author

(note: this accounting problem is specific to this PR, and not observed on trunk).

@NickBarnes
Copy link
Contributor Author

NickBarnes commented Nov 9, 2024

I have further diagnosed the accounting problem, and put in a dirty hack to prove my hypothesis.
On the trunk, we maintain alloc_counter as a global estimate of the GC work necessary to keep up with allocation, and work_counter as a global record of the total amount of GC work done. Both of these counters are intnat sized, and on 32-bit platforms they wrap around. For each slice, in each domain, we work until the work_counter exceeds the slice_target - the value that the alloc_counter had at the start of the slice - or (alternatively) we run out of work to do.
In this way, these two counters remain approximately in sync.
However, on this branch, we delay marking until after sweeping has finished on at least one domain, and we always start marking in a fresh slice (after a minor GC). So on the last sweeping slice (on the domain which reaches the end of sweeping first), approximately half of the work expected in the slice is not used, and work_counter slips behind alloc_counter by this amount. At the end of the slice, instead of work_counter slightly exceeding slice_target, or not quite reaching it (if we got to the end of the collection), it falls behind by (on average) half of a slice's amount of work. Depending on the workload, there may be very little marking to do, not enough to catch up on the next slice (for instance, on testsuite/tests/regression/pr5757/pr5757.ml there is very little marking to do).
So, as the process continues, work_counter gradually falls further and further behind alloc_counter. This may lead to larger slice budgets and thus longer (but fewer) GC pauses, but should be otherwise harmless. However (as described in a previous comment), on 32-bit platforms eventually the shortfall is enough to trigger an underflow, giving negative work budgets. At that point, progress on the current cycle stalls but allocation continues, causing the heap to balloon. On 32-bit Windows we eventually hit the 2 GiB process memory limit, causing us to run out of memory with a (just under) 2 GiB heap. On 32-bit Linux machines with enough memory, we are able to continue past that point (as there is a 3 GiB process memory limit), advancing alloc_counter until it overtakes work_counter, pushing work budgets back into positive territory. Then the GC is able to progress again, successfully collecting, until the problem repeats.
On 64-bit platforms the same accounting problem is observed, but it doesn't cause the same symptoms. Instead the slice targets grow larger and larger, which may exacerbate the problem as all the sweeping is easily completed in a single slice.

My hack, which should not be merged, demonstrates that addressing this accounting issue fixes the problem by artificially consuming the rest of the slice budget at the point at which sweeping is first completed.
Paging @stedolan and @damiendoligez for thoughts.
I also have a sketch of a refactor of major_collection_slice to make this sort of thing easier to reason about. I hope to make a PR for that in the next week or two.

@NickBarnes NickBarnes marked this pull request as ready for review June 23, 2025 15:53
@kayceesrk
Copy link
Contributor

kayceesrk commented Jun 24, 2025

The CI failure is on a fiddly test tests/weak-ephe-final/ephe_custom.ml. This test was added in #13859. The test depends on the specific colour of the custom object at Weak.set. This PR changes the allocation colour and hence the test no longer works as expected.

Here's the modified code that exercises the darkening of the custom_tag objects during a Weak.get_copy, capturing the intent of the original test.

(* TEST *)

let w = Weak.create 1

let major_obj () =
  let n = Sys.opaque_identity 42 in
  let v = Sys.opaque_identity (Int64.of_int n) in
  Gc.full_major ();
  v (* [v] is unmarked *)

let () =
  Weak.set w 0 (Some (major_obj ()));
  (* [v] is unmarked *)
  let x = Option.get (Weak.get_copy w 0) in
  (* [v] is marked because [Weak.get_copy] marks the object without making a
     copy. *)
  Gc.major ();
  Printf.printf "value: %Ld\n%!" x;
  let junk = List.init 1_000_000 Fun.id in
  Gc.minor ();
  ignore (Sys.opaque_identity junk);
  (* check that the memory representing x did not get reused in junk *)
  Printf.printf "value: %Ld\n%!" x

@jmid
Copy link
Contributor

jmid commented Jun 24, 2025

The multicoretests failure is due to #14081 and has since been fixed on the main repo.
It should be sufficient to change the following line to ref: 'main' to get it going again:

I've also kicked off a CI test run here: https://github.com/ocaml-multicore/multicoretests/tree/target-mark-delay-13580

I'll send a PR to bump the workflow's targeted multicoretests version when we release the next one (we've gone with a particular tag to prevent that test suite updates in between runs add needless version confusion 🤷 )

@jmid
Copy link
Contributor

jmid commented Jun 24, 2025

Hm. I can see the debug runtime workflow is failing with an assertion error:
https://github.com/ocaml-multicore/multicoretests/actions/runs/15850475068/job/44682432570#step:16:1418

random seed: 117530064
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Weak stress test with Domain
[00] file runtime/major_gc.c; line 1522 ### Assertion failed: caml_marking_started()
File "src/weak/dune", line 52, characters 7-16:
52 |  (name lin_tests)
            ^^^^^^^^^
(cd _build/default/src/weak && ./lin_tests.exe --verbose)
Command got signal ILL.

which is this assertion:
https://github.com/NickBarnes/ocaml/blob/3802cb2077ebaf8a67a28ef26fa244d3205f38ae/runtime/major_gc.c#L1522

Edit: The other (non-debug-runtime) workflows completed successfully

@kayceesrk
Copy link
Contributor

@jmid how do I get the reproduction case for the failure? I’ll be able to have a look.

@jmid
Copy link
Contributor

jmid commented Jun 25, 2025

I just confirmed locally that the following will trigger the assertion error almost immediately:

git clone -b nick-markdelay [email protected]:NickBarnes/ocaml.git ocaml-06-25-2025-PR13580
cd ocaml-06-25-2025-PR13580
opam switch create . --empty
opam pin add -k path --inplace-build ocaml-variants.5.5.0+mark-delay .
opam install qcheck-core
git clone [email protected]:ocaml-multicore/multicoretests.git
cd multicoretests
OCAMLRUNPARAM='s=4096,V=1' dune exec src/weak/lin_tests.exe --profile=debug-runtime

with

Done: 75% (3/4, 1 left) (jobs: 0)### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
random seed: 418233538
[01] file runtime/major_gc.c; line 1522 ### Assertion failed: caml_marking_started()
Command got signal ILL.

I'll try to carve out a standalone reproducer.

@jmid
Copy link
Contributor

jmid commented Jun 25, 2025

Here is a standalone repro triggering the problem:

let test () =
  let t = Weak.create 256 in
  Weak.fill t 0 1 (Some 1L);
  let d = Domain.spawn (fun () ->
      Weak.get_copy t 0 |> ignore
    ) in
  Domain.join d

let _ =
  for _i = 1 to 100 do
    Printf.printf "#%!";
    test ()
  done

This takes only a few iterations to consistently trigger it on my Linux box (the bigger the weak array, the sooner it seems to trigger):

$ ocamlopt -runtime-variant=d repro.ml && OCAMLRUNPARAM='s=4096,V=1,b' ./a.out
### OCaml runtime: debug mode ###
### set OCAMLRUNPARAM=v=0 to silence this message
###[01] file runtime/major_gc.c; line 1522 ### Assertion failed: caml_marking_started()
Illegal instruction (core dumped)

@kayceesrk
Copy link
Contributor

The fix turned out to be a simple one. I've made a PR here: NickBarnes#4.

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me.

(sync, handler, data, leader_setup, enter_spin_callback, enter_spin_data)`

Try to run an STW section in which all domains will run
`handler(domain, data, N, p)` (where `N` and `p` define the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are N and p?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of domains participating in the STW, and an array of their domain_state pointers. These variables are commonly called participant_count and participating in the handlers). I find those variable names a bit cumbersome and visually obtrusive; I have contemplated a refactor in which we have an STW data structure incorporating these as fields and conventionally using a short variable name like stw, but this PR is not the place for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. Another PR. Ok with the comment for now.

Bearing all that in mind, the spec of this function, which drives
all STW sections, is as follows:

`caml_try_run_on_all_domains_with_spin_work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a different PR, we should reorganise this large comment to first describe what the function does (what follows) and then have the details on how it interacts with domain creation and termination.

@kayceesrk
Copy link
Contributor

@NickBarnes gentle ping to consider NickBarnes#4.

@NickBarnes
Copy link
Contributor Author

Thanks for that @kayceesrk; I have merged it after a rebase. I think I've now addressed the various review comments on this, and it appears to be passing tests....

@kayceesrk
Copy link
Contributor

kayceesrk commented Jul 2, 2025

I've been looking at the macos-arm64 failure. The test failure is on https://github.com/ocaml/ocaml/blob/trunk/testsuite/tests/parallel/major_gc_wait_backup.ml#L41. The test expects new major cycles to have been done after doing some long-lived allocating work. I initially assumed that this PR collects the floating garbage quickly and hence we don't get to see a cycle completed. But that's not the case here.

Bug?

Based on the GC logs, I suspect there is a bug which prevents the GC from advancing from Phase_mark_final. The program ends up allocating more and more.

My conjecture is that the transition from Phase_mark_final to Phase_sweep_ephe only happens when the slice mode is not opportunistic. The phase change code:

https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L2094

is under the guard which checks that the mode is not opportunistic

https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L1986

In the GC trace, I see that the two domains are stuck at Phase_mark_final. Is it possible that the program only triggers opportunistic major slices, which prevents the GC phase change. But the program keeps allocating more and more memory? edit: I noticed that the slices are not opportunistic but interruptible. I'm looking into this.

Also, this bug (if it is one) is possible on trunk, where the phase change code is guarded under a similar check. The current PR triggers it.

Test

I've made a small change to the test to make it fail more frequently:

diff --git a/test.ml b/major_gc_wait_backup.ml
index f039604441..0b1ff9f49b 100644
--- a/test.ml
+++ b/major_gc_wait_backup.ml
@@ -37,8 +37,6 @@ let _ =
   ) in
   Gc.full_major ();
   let n = major_collections () in
-  for i = 1 to 5 do
-    ignore (make 24);
-  done;
+  ignore (make 24);
   assert ((major_collections ()) > n);
   print_endline "sleep OK"

Good run

A successful run uses around 800M.

 % /usr/bin/time -l ./a.out                
wait OK
sleep OK
        2.46 real         2.36 user         0.11 sys
           811499520  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               49674  page reclaims
                  73  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                2709  involuntary context switches
         41202382335  instructions retired
          8064320362  cycles elapsed
           811125184  peak memory footprint

The tail of the GC message trace filtered to major slice messages looks like this:

<snip>
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.M]: 1043460 sweep,  0 mark (0 blocks)
[00] Major slice [o.U]: 4092 sweep,  0 mark (0 blocks)
[01] Major slice [i.M]: 0 sweep,  0 mark (0 blocks)

Bad run

An unsuccessful run ends up using a lot more memory (1.88 GB in this case):

% /usr/bin/time -l ./a.out
wait OK
Fatal error: exception File "test.ml", line 43, characters 2-8: Assertion failed
        1.57 real         1.34 user         0.20 sys
          2021195776  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              123559  page reclaims
                  20  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                3788  involuntary context switches
         25039794935  instructions retired
          4791806351  cycles elapsed
          2021413632  peak memory footprint

The tail of the bad run trace looks like this:

<snip>
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)                                                                    
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)   
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)    
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)     
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)    
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)   
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)                                                                    
[00] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)
[01] Major slice [i.F]: 0 sweep,  0 mark (0 blocks)

@kayceesrk
Copy link
Contributor

kayceesrk commented Jul 3, 2025

I've narrowed the problem down to https://github.com/NickBarnes/ocaml/blob/d372edbb85b26090914185bd2491b9118e73b740/runtime/major_gc.c#L1987-L1990

    /* Finalisers */
    if (caml_gc_phase == Phase_mark_final &&
        get_major_slice_work(mode) > 0 &&
        caml_final_update_first(domain_state)) {
      /* This domain has updated finalise first values */
      (void)caml_atomic_counter_decr(&num_domains_to_final_update_first);

In the test, there is an idle domain, which does not allocate. The GC log has

[01] Updated major work: [F]  0 heap_words,  0 allocated,  0 allocated (direct),  
  0 allocated (suspended), 0 allocated (resumed),  0 alloc_work,  0 dependent_work,  
  0 extra_work,  278023983 work counter [behind],  278023983 alloc counter,  
  278023983 slice target,  0 slice budget

The call to get_major_slice_work(mode) always returns 0. This prevents the idle domain ([01]) to not be able to call caml_final_update_first and decrement the num_domains_to_final_update_first counter. The other domain keeps allocating.

There is no finaliser work, but unless the counter is decremented, the GC will not progress to the next phase.

The change was made in #11903. CC @stedolan. We will have a similar issue with num_domains_to_final_update_last.

@kayceesrk
Copy link
Contributor

kayceesrk commented Jul 4, 2025

caml_final_update_first and caml_final_update_last are not incremental. So checking whether get_major_slice_work(mode) > 0 is too strict. For a quick fix, I suggest removing the condition. I have tested the fix, and the test no longer fails. If this is acceptable, I can make a PR to trunk.

Separately, we should consider making caml_final_update_{first,last} to be incremental.

@kayceesrk
Copy link
Contributor

@NickBarnes wonder if you've had some cycles to look at this? Would be useful to get this PR across the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gc Performance PR or issues affecting runtime performance of the compiled programs run-multicoretests Makes the CI run the multicore testsuite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants