Work stealing #507

wenyuzhao · 2021-12-10T08:36:41Z

This PR adds work-stealing support to the work-packet system.

Performance: http://squirrel.anu.edu.au/plotty-public/wenyuz/v8/p/j2sUrH

TODO:

Fix h2
Fix pmd

This reverts commit c63b94c.

qinsoon · 2022-05-23T05:08:43Z

Related issue: #185

wks

The enum-map crate provides static methods to look up Enum items without concrete instances. You may consider bumping the dependency version of enum-map, too, if that's convenient.

wks · 2022-05-23T06:55:02Z

src/scheduler/scheduler.rs

+            let first_stw_stage = work_buckets.iter().nth(1).map(|(id, _)| id).unwrap();
+            let mut open_stages: Vec<WorkBucketStage> = vec![first_stw_stage];
            // The rest will open after the previous stage is done.
+            let stages = work_buckets
+                .iter()
+                .map(|(stage, _)| stage)
+                .collect::<Vec<_>>();


You may use static methods from the Enum<T> type in enum-map 0.6.2:

let first_stw_stage: WorkBucketStage = Enum::<WorkBucketStage>::from_usize(1); let possible_values: usize = <WorkBucketStage as Enum::<WorkBucketStage>>::POSSIBLE_VALUES; let stages = (0..possible_values) .map(|i| { let stage: WorkBucketStage = Enum::<WorkBucketStage>::from_usize(i); stage }).collect::<Vec<_>>();

It is a bit awkward because the enum-map crate which mmtk-core currently depends on is way too old (0.6.2 vs the latest 2.2.0).

If we update Cargo.toml and bump the version to enum-map = "=2.1.0", we will be able to do them much more elegantly:

let first_stw_stage = WorkBucketStage::from_usize(1); let stages = (0..WorkBucketStage::LENGTH).map(WorkBucketStage::from_usize);

If you feel convenient, you can update the dependency for us in this PR, too. Note that 2.2.0 requires rustc 1.60.0 which our moma machines currently don't have. As a workaround, enum-map = "=2.1.0" will lock the version to exactly 2.1.0. We should let our administrator update our installations when appropriate.

@wks @qinsoon I think we should leave the code as is for now. I won't be physically in Canberra so I will hold off image updates of CI machines until I get back unless it's urgent (i.e. security vulnerability).

I think this also raises a meta-question of what's our MSRV policy. Currently, our MSRV is 1.59.0, which is not that old. Any change to the MSRV could potentially affect any other projects that depend on us. Of course, we can make the decision that we aggressively update MSRV. But that's something we should at least discuss as a group.

Depending on way too old stuff is kinda bad, in my opinion. If we can even just bump to v2.1.0 without disrupting the MSRV version, I think we should do that. (At some point we should do an audit that updates our dependencies)

I switched to enum-map = "=2.1.0" and removed the code that fetches the first stage by using .iter().

src/scheduler/scheduler.rs

qinsoon · 2022-05-23T05:24:18Z

src/scheduler/scheduler.rs

+            let first_stw_stage = work_buckets.iter().nth(1).map(|(id, _)| id).unwrap();
+            let mut open_stages: Vec<WorkBucketStage> = vec![first_stw_stage];
            // The rest will open after the previous stage is done.
+            let stages = work_buckets


I suggest creating a constant of Vec<WorkBucketStage> to explicitly list all the buckets in a specified order. Items in enum has no explicit order (no Ord,PartialOrd for the enum), and iter() on EnumMap does not specify the iteration order as well (https://docs.rs/enum-map/latest/enum_map/struct.Iter.html). This implementation basically assumes 1. Rust implements the enum as u8, 2. we define the variants in the enum in a correct order, 3. enum_map iterates the map in an increasing order of the key All of these assumptions are not documented, and could be changed.

There are a few uses of work_buckets.iter(), work_buckets.values().nth(), etc., with assumptions on its order. I suggest updating them as well.

The enum-map crate provides static methods to map between usize and the enum type. See my comment below.

On the documentation of the derive macro Enum, the example code even asserts that the elements are mapped to 0, 1, 2, ... in order, although it never explicitly states so. https://docs.rs/enum-map/latest/enum_map/derive.Enum.html#enums-without-payload

To be safe, we may list the stages by ourselves for now, but I'll open an issue and ask the enum-map developers to make the guarantee in its docs.

FYI: https://gitlab.com/KonradBorowski/enum-map/-/issues/25

FYI: https://gitlab.com/KonradBorowski/enum-map/-/issues/25

Got it. Thanks. However, the mapping between enum and usize is just part of the issue. The problem also includes the order of how EnumMap iterates its keys/values. The code in the PR assumes the first element in work_buckets.iter() and work_buckets.values() is the first stw phase. Generally, a map is an unordered collection, and the order to iterate a map is implementation specific. It just happens to work for enum map for its current implementation.

I switched to enum-map = "=2.1.0" and removed the ordering assumption.

src/scheduler/scheduler.rs

src/scheduler/worker.rs

qinsoon · 2022-05-23T06:57:15Z

src/scheduler/worker.rs


+    /// Poll a ready-to-execute work pakcet in the following order:
+    ///
+    /// 1. Any packet that should be processed only by this worker.


This order is different from what we have in master. But it is fine. #600 will be outdated after we merge this PR. This is just a note for myself, no change is needed.

src/scheduler/worker.rs

src/scheduler/scheduler.rs

wks · 2022-05-23T08:02:38Z

src/scheduler/work_bucket.rs

+    #[inline(always)]
+    pub fn add_prioritized(&self, work: Box<dyn GCWork<VM>>) {
+        self.prioritized_queue.as_ref().unwrap().push(work);
+        self.notify_one_worker();


Notifying other workers should not be the responsibility of an individual bucket. It should be the responsibility of the GCWorkScheduler because

GCWorkScheduler owns the WorkerGroup, and

All call sites of WorkBucket::add always get scheduler.work_bucket[stage] before calling add, usually in this form:

self.mmtk.scheduler.work_buckets[WorkBucketStage::Closure] .add(ProcessModBuf::<E>::new(modbuf, self.meta));

This indicates that it is the GCWorkScheduler's responsibility to (1) add a work packet to a concrete bucket, and (2) notify other workers.

I have two suggestions. Either should work, but I prefer the first one.

Make scheduler.work_buckets private;

Introduce a Scheduler::add_work_packet(&self, bucket, packet) method (or copy from memory_manager::add_work_packet)

Force everyone to add work packets using Scheduler::add_work_packet

Notify other workers in Scheduler::add_work_packet

Replace WorkBucket::group with a call-back function (or trait)

It should be called when one or more work packets are added.

In GCWorkScheduler::new (i.e. when those buckets are created), register callbacks on those buckets to call methods on the worker group (you just created the worker group before creating buckets).

And of course we can leave it as is for now, and refactor later, given that the first way may affect too many files.

I thought about this a little bit more. Looks like option 1 above is a necessary step before implementing the design proposed at #546. It builds a bottleneck for adding work packets, and the design in #546 can utilize it to either redirect each packet to the main queue or temporarily store it in a bucket.

I will implement option 1 as a separate PR (this will likely to introduce many changes in mmtk-core and bindings), and have a try on #546.

src/scheduler/worker.rs

wks · 2022-05-23T10:17:12Z

src/scheduler/scheduler.rs

+            let first_stw_stage = work_buckets.iter().nth(1).map(|(id, _)| id).unwrap();
+            let mut open_stages: Vec<WorkBucketStage> = vec![first_stw_stage];
            // The rest will open after the previous stage is done.
+            let stages = work_buckets


The enum-map crate provides static methods to map between usize and the enum type. See my comment below.

On the documentation of the derive macro Enum, the example code even asserts that the elements are mapped to 0, 1, 2, ... in order, although it never explicitly states so. https://docs.rs/enum-map/latest/enum_map/derive.Enum.html#enums-without-payload

To be safe, we may list the stages by ourselves for now, but I'll open an issue and ask the enum-map developers to make the guarantee in its docs.

qinsoon

LGTM

qinsoon · 2022-05-25T01:36:59Z

The fail in JikesRVM tests is a bit strange.

+++ ./dist/RFastAdaptiveSemiSpace_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xms75M -Xmx75M -jar /home/runner/work/mmtk-core/mmtk-core/mmtk-jikesrvm/.github/scripts/../../repos/jikesrvm/dacapo/dacapo-2006-10-MR2.jar bloat
===== DaCapo bloat starting =====
Error: Process completed with exit code 1.

There is no error message. I cannot reproduce it locally, and I haven't seen this kind of error before. I will just retry the run and see what happens.

wks

LGTM

qinsoon · 2022-05-26T01:57:12Z

The JikesRVM tests keep failing constantly (3/3 runs failed). The JikesRVM binding and the mmtk-core are both v0.12, which passed the tests in mmtk/mmtk-jikesrvm#110. So it is possibly either a bug in this PR, or a bug revealed by this PR. I will investigate into this as it may relate to weak reference processing (it always failed in the weak ref tests).

@wenyuzhao If you can think up anything that may cause the issue, also let me know.

wks · 2022-06-03T03:52:01Z

I tried my small test case on OpenJDK, and it either passes or crash with out-of-memory. The SoftReference is never erroneously cleared.

The following command makes the test pass on my machine, but shrinking the heap to -Xm{s,x}20M will make it OOM.

env MMTK_NO_REFERENCE_TYPES=false MMTK_THREADS=1 RUST_BACKTRACE=1 MMTK_PLAN=SemiSpace /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:MetaspaceSize=100M -Xm{s,x}22M Main

I guess the bug is still related to clearing SoftReference, but in the VM-specific part.

wks · 2022-06-03T04:12:03Z

It is also reproducible if I replace SoftReference with WeakReference.

wks · 2022-06-03T04:42:39Z

Here is an even simpler test case. This time you can use any heap size. It doesn't matter.

Update: Now it checks the WeakReference after each System.gc(). It also uses cmdline argument to decide the number of rounds. So we can specify a large number (such as 1000), and it will crash as soon as the WeakReference is erroneously cleared.

Just like before, it is impossible to reproduce it on OpenJDK.

import java.lang.ref.WeakReference;
import java.util.ArrayList;

public class Simple {
    Object hard;
    WeakReference<Object> soft;

    public void populate() {
        Object obj = new Object();
        hard = obj;
        soft = new WeakReference(obj);
    }

    public void check(int round) {
        Object obj = soft.get();
        if (obj == null) {
            throw new RuntimeException("obj is null! round=" + round);
        }
    }

    public static void main(String[] args) {
        int gc_rounds = Integer.parseInt(args[0]);
        Simple main = new Simple();

        System.out.println("Populating...");
        main.populate();

        System.out.println("Doing lots of GC...");
        for (int i = 0; i < gc_rounds; i++) {
            System.gc();
            main.check(i);
        }

        System.out.println("Checking...");
        main.check(gc_rounds);

        Object obj2 = main.hard;
        System.out.println(obj2);

        System.out.println("Done.");
    }

}

qinsoon · 2022-06-03T04:44:12Z

Here is an even simpler test case. This time you can use any heap size. It doesn't matter.

Just like before, it is impossible to reproduce it on OpenJDK.

import java.lang.ref.WeakReference;
import java.util.ArrayList;

public class Simple {
    Object hard;
    WeakReference<Object> soft;

    public void populate() {
        Object obj = new Object();
        hard = obj;
        soft = new WeakReference(obj);
    }

    public void check() {
        Object obj = soft.get();
        if (obj == null) {
            throw new RuntimeException("obj is null!");
        }
        Object obj2 = hard;
        System.out.println(obj2);
    }

    public static void main(String[] args) {
        Simple main = new Simple();

        System.out.println("Populating...");
        main.populate();

        System.out.println("Doing lots of GC...");
        for (int i = 0; i < 100; i++) {
            System.gc();
        }

        System.out.println("Checking...");
        main.check();

        System.out.println("Done.");
    }

}

Does this test pass with master but fail with this PR?

wks · 2022-06-03T05:16:20Z

@qinsoon Yes. It always passes with mmtk-core master, no matter the number of threads.

By the way, with the work-stealing branch, it sometimes cannot finish all 100 GC rounds. It may hang or crash sometimes.

wks · 2022-06-03T12:03:19Z

There is a chance that a WeakRefProcessing work packet is executed before any ProcessEdgesWork is executed (with MMTK_THREADS=1). This doesn't look right. Whenever this happens, my test case will raise exception because the WeakReference is cleared.

[2022-06-03T11:58:49Z INFO  mmtk::plan::global] User triggering collection
[2022-06-03T11:58:49Z WARN  mmtk::util::reference_processor] WeakRefProcessing
[2022-06-03T11:58:49Z WARN  mmtk::util::reference_processor] WeakRefProcessing End
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81289
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81289
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81291
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81291
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81292
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81292
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81293
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81293
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81299
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81299
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81300
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81300
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81302
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81302
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81303
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81303
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81306
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81294

wks · 2022-06-03T13:59:38Z

A more detailed log of a problematic GC (timestamp and type parameters are omitted):

BEGIN mmtk::scheduler::gc_work::ScheduleCollection
END mmtk::scheduler::gc_work::ScheduleCollection
BEGIN mmtk::scheduler::gc_work::StopMutators
END mmtk::scheduler::gc_work::StopMutators
BEGIN mmtk::scheduler::gc_work::StopMutators
BEGIN mmtk::scheduler::gc_work::Prepare
END mmtk::scheduler::gc_work::Prepare
BEGIN mmtk::scheduler::gc_work::PrepareCollector
END mmtk::scheduler::gc_work::PrepareCollector
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareCollector
END mmtk::scheduler::gc_work::PrepareCollector
BEGIN mmtk::util::reference_processor::SoftRefProcessing
END mmtk::util::reference_processor::SoftRefProcessing
BEGIN mmtk::util::reference_processor::WeakRefProcessing
END mmtk::util::reference_processor::WeakRefProcessing
BEGIN mmtk::scheduler::gc_work::VMProcessWeakRefs
END mmtk::scheduler::gc_work::VMProcessWeakRefs
BEGIN mmtk::util::finalizable_processor::Finalization
END mmtk::util::finalizable_processor::Finalization
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
...
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanVMSpecificRoots
END mmtk::scheduler::gc_work::ScanVMSpecificRoots
BEGIN mmtk_jikesrvm::scan_statics::ScanStaticRoots
END mmtk_jikesrvm::scan_statics::ScanStaticRoots
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk_jikesrvm::scanning::ScanGlobalRoots
END mmtk_jikesrvm::scanning::ScanGlobalRoots
BEGIN mmtk_jikesrvm::scan_statics::ScanStaticRoots
END mmtk_jikesrvm::scan_statics::ScanStaticRoots
BEGIN mmtk_jikesrvm::scanning::ScanGlobalRoots
END mmtk_jikesrvm::scanning::ScanGlobalRoots
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
...

It looks like the buckets for reference processing are opened too early.

I guess what happened is, some work packets are asynchronous. Those mmtk-core work packets finished very quickly, but there are VM threads working in the background, and may deliver more work packets afterwards. However, the GC workers in the core thinks all buckets are empty. It then decide that more work packets should be opened. The weak reference processing work packets are added by the ScheduleCollection work packet which is executed by the GC coordinator thread. Therefore WeakRefProcessing becomes executable even before stacks are scanned.

qinsoon · 2022-06-03T14:19:58Z

@wks Nice debugging! In ScheduleCollection, we do not schedule anything for the Closure bucket, but rather rely on the binding to generate the closure packets. Actually we do not schedule anything between scanning stacks (or stopping mutators - i can’t remember) and weak ref processing. That means there could be a bug that at some point, the schedule sees there is no work in the closure bucket and all conditions for opening a new bucket is satisfied, so it activates the weak processing bucket. This also explains why CI may pass occasionally after Wenyu fixed an issue about designated work (before the fix, the scheduler might open new buckets even if some workers have designated work). The fix probably eliminated some cases, so it reduces the chance we see the bug in CI. @wenyuzhao

wenyuzhao · 2022-06-03T14:34:29Z

@wks Can you please advise which machine were you using and what is your command line args? It's weird that I tried the test case with both SemiSpace and MarkSweep multiple times, and I did not see any crashes. I'm using fox.moma.

My command: dist/RFastAdaptiveMarkSweep_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xms20M -Xmx20M Simple

wks · 2022-06-03T15:13:40Z

I found the cause.

One difference between the JikesRVM binding and the OpenJDK binding is that in OpenJDK, any GC worker can execute the StopMutators work packet, while in JikesRVM, only the GC coordinator thread can execute StopMutators. The background was that OpenJDK requires that the thread that stops all thread must be the same as the thread that starts the mutators. (See #450)

The StopMutators work packet calls stop_all_mutator, and then immediately notifys workers to continue. See:

// StopMutators::do_work
        trace!("stop_all_mutators start");
        mmtk.plan.base().prepare_for_stack_scanning();
        <E::VM as VMBinding>::VMCollection::stop_all_mutators::<E>(worker.tls);
        trace!("stop_all_mutators end");
        mmtk.scheduler.notify_mutators_paused(mmtk);    // THIS LINE!!!!!!!!!!!!
        if <E::VM as VMBinding>::VMScanning::SCAN_MUTATORS_IN_SAFEPOINT {
...
                for mutator in <E::VM as VMBinding>::VMActivePlan::mutators() {
                    mmtk.scheduler.work_buckets[WorkBucketStage::Prepare]
                        .add(ScanStackRoot::<E>(mutator));  // ADD WORK PACKETS
                }
        }
        mmtk.scheduler.work_buckets[WorkBucketStage::Prepare].add(ScanVMSpecificRoots::<E>::new());  // ADD WORK PACKETS

Note that it notifies the GC workers before adding more work packets. This is not a problem for OpenJDK, because the StopMutators as a whole is executed on a GC worker. If any GC worker is running, it does not satisfy the condition of "all mutators have parked".

However, on JikesRVM, StopMutators is executed on the GC coordinator thread, which is not counted as a GC worker. If there is a slight delay after the notify_mutators_paused invocation, as in the following patch:

diff --git a/src/scheduler/gc_work.rs b/src/scheduler/gc_work.rs
index 615458f0..a4a2636b 100644
--- a/src/scheduler/gc_work.rs
+++ b/src/scheduler/gc_work.rs
@@ -10,6 +10,7 @@ use std::marker::PhantomData;
 use std::mem;
 use std::ops::{Deref, DerefMut};
 use std::sync::atomic::Ordering;
+use std::time::Duration;
 
 pub struct ScheduleCollection;
 
@@ -186,6 +187,9 @@ impl<E: ProcessEdgesWork> GCWork<E::VM> for StopMutators<E> {
         <E::VM as VMBinding>::VMCollection::stop_all_mutators::<E>(worker.tls);
         trace!("stop_all_mutators end");
         mmtk.scheduler.notify_mutators_paused(mmtk);
+        std::thread::sleep(Duration::from_millis(10));
         if <E::VM as VMBinding>::VMScanning::SCAN_MUTATORS_IN_SAFEPOINT {
             // Prepare mutators if necessary
             // FIXME: This test is probably redundant. JikesRVM requires to call `prepare_mutator` once after mutators are paused

Then GC workers will wake up, and found that all open buckets are drained. Then the GC workers will open all subsequent work packets. Because the delay after notify_mutators_paused can be arbitrary depending on operating system scheduling, arbitrarily many work packets may have been executed before the StopMutators work packet add ScanStackRoot and ScanVMSpecificRoot work packets, which generate all the subsequent ProcessEdgesWork work packets.

So the current "all GC workers have stopped" is insufficient. It does not take the running GC coordinator thread into account.

wks · 2022-06-03T15:17:21Z

@wks Can you please advise which machine were you using and what is your command line args? It's weird that I tried the test case with both SemiSpace and MarkSweep multiple times, and I did not see any crashes. I'm using fox.moma.

My command: dist/RFastAdaptiveMarkSweep_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xms20M -Xmx20M Simple

I am using my own laptop in a LXC container. It should be reproducible on any machine.

My command line is:

MMTK_THREADS=2  ~/projects/mmtk-github/jikesrvm/dist/RFastAdaptiveSemiSpace_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xm{s,x}600M Simple 1000

The number of threads matters. It is more likely to reproduce it with 2 threads instead of 1.

Also note that I updated the test script in the afternoon, so that you can specify how many System.gc() it runs on the command line, and it will check the WeakReference after each GC and fail fast.

wenyuzhao · 2022-06-04T00:27:55Z

The issue is not because the controller proceeded StopMutators packets.

It was not a problem before, because the coordinator was the one that opens new buckets. So it cannot open new buckets until the coordinator StopMutators job is finished. Now it's the last yielding GC worker to open new buckets (to reduce inter-thread messages).

I changed the open condition of the buckets so that no new buckets can open unless the coordinator finished the coordinator's work packets. Hope this can solve the problem.

wenyuzhao · 2022-06-04T02:40:36Z

@wks Can you please confirm that the change fixes the bug on your side? I'm able to reproduce the bug now and looks like it is fixed.

wks

I see some problem with the memory orders used in various atomic operations. They may, in theory, cause some problem. If unsure, we can always use SeqCst.

src/scheduler/scheduler.rs

src/scheduler/work_bucket.rs

src/scheduler/worker.rs

src/scheduler/scheduler.rs

wks · 2022-06-04T04:55:02Z

@wenyuzhao I confirm that the bug is no longer reproducible on my local machine. However, I also see some memory order issue. See my previous comments.

Some dependencies have new versions. It is a good chance to bump the dependency versions, too, after we bump the Rust toolchain versions. One notable dependency is `enum-map`. Its latest version is 2.4.2, but we locked its version to 2.1.0 because it required a newer Rust toolchain. Now we can depend on its latest version, instead. We also changed the `rust-version` property in `Cargo.toml` to `1.61.0` because `enum-map-derive` depends on that, and `1.61.0` is not new compared to the `1.66.1` we use. See: #507 (comment) This closes #693

Some dependencies have new versions. It is a good chance to bump the dependency versions, too, after we bump the Rust toolchain versions. One notable dependency is `enum-map`. Its latest version is 2.4.2, but we locked its version to 2.1.0 because it required a newer Rust toolchain. Now we can depend on its latest version, instead. We also changed the `rust-version` property in `Cargo.toml` to `1.61.0` because `enum-map-derive` depends on that, and `1.61.0` is not new compared to the `1.66.1` we use. See: mmtk#507 (comment) This closes mmtk#693

wenyuzhao added 2 commits May 12, 2022 16:37

work stealing

5fe9c59

Fix CI

126f465

wenyuzhao force-pushed the work-stealing branch from 71b3ecf to 126f465 Compare May 12, 2022 06:47

wenyuzhao added 10 commits May 13, 2022 18:58

Add comments

4d3e75d

optimizations

80b8497

wip

e6e1d6a

optimize

8c0a029

revert

c63b94c

Revert "revert"

e20f4c8

This reverts commit c63b94c.

minor

08502b6

Merge branch 'master' into work-stealing

20ac69c

minor

f7a941e

Fix CI

fae3b20

wenyuzhao marked this pull request as ready for review May 23, 2022 05:06

wenyuzhao requested review from qinsoon and wks May 23, 2022 05:06

wks reviewed May 23, 2022

View reviewed changes

qinsoon requested changes May 23, 2022

View reviewed changes

wks requested changes May 23, 2022

View reviewed changes

wenyuzhao added 3 commits May 24, 2022 11:07

minor

7402d7b

refactor local work queues

c7884b6

Fix work buckets enumeration

3f5bb03

wenyuzhao requested review from qinsoon and wks May 25, 2022 00:05

qinsoon approved these changes May 25, 2022

View reviewed changes

qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label May 25, 2022

wks approved these changes May 25, 2022

View reviewed changes

minor

48b1e41

Fix jikes

db098b2

minor

34e15e2

cleanup

091dacf

wks requested changes Jun 4, 2022

View reviewed changes

src/scheduler/scheduler.rs Outdated Show resolved Hide resolved

src/scheduler/work_bucket.rs Outdated Show resolved Hide resolved

src/scheduler/worker.rs Outdated Show resolved Hide resolved

src/scheduler/scheduler.rs Show resolved Hide resolved

seqcst

42d3540

wks approved these changes Jun 4, 2022

View reviewed changes

wenyuzhao merged commit 43a6755 into master Jun 5, 2022

wenyuzhao deleted the work-stealing branch June 5, 2022 01:32

wks mentioned this pull request Jun 15, 2022

Workers spuriously wake up #537

Closed

qinsoon mentioned this pull request Jun 23, 2022

Undocumented behavior of local_work_bucket #600

Closed

wks mentioned this pull request Nov 8, 2022

Update rust-toolchain #693

Closed

6 tasks

wks mentioned this pull request Feb 10, 2023

Update dependencies versions #756

Merged

qinsoon mentioned this pull request Mar 8, 2023

Condition of opening work packets #774

Open

Work stealing #507

Work stealing #507

Uh oh!

Conversation

wenyuzhao commented Dec 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinsoon commented May 23, 2022

Uh oh!

wks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wks May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinsoon left a comment

Choose a reason for hiding this comment

Uh oh!

qinsoon commented May 25, 2022

Uh oh!

wks left a comment

Choose a reason for hiding this comment

Uh oh!

qinsoon commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wks commented Jun 3, 2022

Uh oh!

wks commented Jun 3, 2022

Uh oh!

wks commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinsoon commented Jun 3, 2022

Uh oh!

wks commented Jun 3, 2022

Uh oh!

wks commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wenyuzhao commented Dec 10, 2021 •

edited

Loading

wks May 23, 2022 •

edited

Loading

qinsoon commented May 26, 2022 •

edited

Loading

wks commented Jun 3, 2022 •

edited

Loading

wks commented Jun 3, 2022 •

edited

Loading