Skip to content

Conversation

@wenyuzhao
Copy link
Member

@wenyuzhao wenyuzhao commented Dec 10, 2021

This PR adds work-stealing support to the work-packet system.

Performance: http://squirrel.anu.edu.au/plotty-public/wenyuz/v8/p/j2sUrH

Screen Shot 2022-05-23 at 2 46 49 pm

TODO:

  • Fix h2
  • Fix pmd

@wenyuzhao wenyuzhao marked this pull request as ready for review May 23, 2022 05:06
@wenyuzhao wenyuzhao requested review from qinsoon and wks May 23, 2022 05:06
@qinsoon
Copy link
Member

qinsoon commented May 23, 2022

Related issue: #185

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum-map crate provides static methods to look up Enum items without concrete instances. You may consider bumping the dependency version of enum-map, too, if that's convenient.

Comment on lines 74 to 80
let first_stw_stage = work_buckets.iter().nth(1).map(|(id, _)| id).unwrap();
let mut open_stages: Vec<WorkBucketStage> = vec![first_stw_stage];
// The rest will open after the previous stage is done.
let stages = work_buckets
.iter()
.map(|(stage, _)| stage)
.collect::<Vec<_>>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may use static methods from the Enum<T> type in enum-map 0.6.2:

let first_stw_stage: WorkBucketStage = Enum::<WorkBucketStage>::from_usize(1);

let possible_values: usize = <WorkBucketStage as Enum::<WorkBucketStage>>::POSSIBLE_VALUES;
let stages = (0..possible_values)                                                                   
        .map(|i| {                                                                                      
            let stage: WorkBucketStage = Enum::<WorkBucketStage>::from_usize(i);                        
            stage                                                                                       
        }).collect::<Vec<_>>();                                                                         

It is a bit awkward because the enum-map crate which mmtk-core currently depends on is way too old (0.6.2 vs the latest 2.2.0).

If we update Cargo.toml and bump the version to enum-map = "=2.1.0", we will be able to do them much more elegantly:

let first_stw_stage = WorkBucketStage::from_usize(1);
let stages = (0..WorkBucketStage::LENGTH).map(WorkBucketStage::from_usize);

If you feel convenient, you can update the dependency for us in this PR, too. Note that 2.2.0 requires rustc 1.60.0 which our moma machines currently don't have. As a workaround, enum-map = "=2.1.0" will lock the version to exactly 2.1.0. We should let our administrator update our installations when appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wks @qinsoon I think we should leave the code as is for now. I won't be physically in Canberra so I will hold off image updates of CI machines until I get back unless it's urgent (i.e. security vulnerability).

I think this also raises a meta-question of what's our MSRV policy. Currently, our MSRV is 1.59.0, which is not that old. Any change to the MSRV could potentially affect any other projects that depend on us. Of course, we can make the decision that we aggressively update MSRV. But that's something we should at least discuss as a group.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on way too old stuff is kinda bad, in my opinion. If we can even just bump to v2.1.0 without disrupting the MSRV version, I think we should do that. (At some point we should do an audit that updates our dependencies)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to enum-map = "=2.1.0" and removed the code that fetches the first stage by using .iter().

let first_stw_stage = work_buckets.iter().nth(1).map(|(id, _)| id).unwrap();
let mut open_stages: Vec<WorkBucketStage> = vec![first_stw_stage];
// The rest will open after the previous stage is done.
let stages = work_buckets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest creating a constant of Vec<WorkBucketStage> to explicitly list all the buckets in a specified order. Items in enum has no explicit order (no Ord,PartialOrd for the enum), and iter() on EnumMap does not specify the iteration order as well (https://docs.rs/enum-map/latest/enum_map/struct.Iter.html). This implementation basically assumes 1. Rust implements the enum as u8, 2. we define the variants in the enum in a correct order, 3. enum_map iterates the map in an increasing order of the key All of these assumptions are not documented, and could be changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few uses of work_buckets.iter(), work_buckets.values().nth(), etc., with assumptions on its order. I suggest updating them as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum-map crate provides static methods to map between usize and the enum type. See my comment below.

On the documentation of the derive macro Enum, the example code even asserts that the elements are mapped to 0, 1, 2, ... in order, although it never explicitly states so. https://docs.rs/enum-map/latest/enum_map/derive.Enum.html#enums-without-payload

To be safe, we may list the stages by ourselves for now, but I'll open an issue and ask the enum-map developers to make the guarantee in its docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: https://gitlab.com/KonradBorowski/enum-map/-/issues/25

Got it. Thanks. However, the mapping between enum and usize is just part of the issue. The problem also includes the order of how EnumMap iterates its keys/values. The code in the PR assumes the first element in work_buckets.iter() and work_buckets.values() is the first stw phase. Generally, a map is an unordered collection, and the order to iterate a map is implementation specific. It just happens to work for enum map for its current implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to enum-map = "=2.1.0" and removed the ordering assumption.


/// Poll a ready-to-execute work pakcet in the following order:
///
/// 1. Any packet that should be processed only by this worker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This order is different from what we have in master. But it is fine. #600 will be outdated after we merge this PR. This is just a note for myself, no change is needed.

#[inline(always)]
pub fn add_prioritized(&self, work: Box<dyn GCWork<VM>>) {
self.prioritized_queue.as_ref().unwrap().push(work);
self.notify_one_worker();
Copy link
Collaborator

@wks wks May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notifying other workers should not be the responsibility of an individual bucket. It should be the responsibility of the GCWorkScheduler because

  1. GCWorkScheduler owns the WorkerGroup, and
  2. All call sites of WorkBucket::add always get scheduler.work_bucket[stage] before calling add, usually in this form:
self.mmtk.scheduler.work_buckets[WorkBucketStage::Closure]
    .add(ProcessModBuf::<E>::new(modbuf, self.meta));

This indicates that it is the GCWorkScheduler's responsibility to (1) add a work packet to a concrete bucket, and (2) notify other workers.

I have two suggestions. Either should work, but I prefer the first one.

  1. Make scheduler.work_buckets private;
    • Introduce a Scheduler::add_work_packet(&self, bucket, packet) method (or copy from memory_manager::add_work_packet)
    • Force everyone to add work packets using Scheduler::add_work_packet
    • Notify other workers in Scheduler::add_work_packet
  2. Replace WorkBucket::group with a call-back function (or trait)
    • It should be called when one or more work packets are added.
    • In GCWorkScheduler::new (i.e. when those buckets are created), register callbacks on those buckets to call methods on the worker group (you just created the worker group before creating buckets).

And of course we can leave it as is for now, and refactor later, given that the first way may affect too many files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this a little bit more. Looks like option 1 above is a necessary step before implementing the design proposed at #546. It builds a bottleneck for adding work packets, and the design in #546 can utilize it to either redirect each packet to the main queue or temporarily store it in a bucket.

I will implement option 1 as a separate PR (this will likely to introduce many changes in mmtk-core and bindings), and have a try on #546.

let first_stw_stage = work_buckets.iter().nth(1).map(|(id, _)| id).unwrap();
let mut open_stages: Vec<WorkBucketStage> = vec![first_stw_stage];
// The rest will open after the previous stage is done.
let stages = work_buckets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum-map crate provides static methods to map between usize and the enum type. See my comment below.

On the documentation of the derive macro Enum, the example code even asserts that the elements are mapped to 0, 1, 2, ... in order, although it never explicitly states so. https://docs.rs/enum-map/latest/enum_map/derive.Enum.html#enums-without-payload

To be safe, we may list the stages by ourselves for now, but I'll open an issue and ask the enum-map developers to make the guarantee in its docs.

@wenyuzhao wenyuzhao requested review from qinsoon and wks May 25, 2022 00:05
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinsoon qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label May 25, 2022
@qinsoon
Copy link
Member

qinsoon commented May 25, 2022

The fail in JikesRVM tests is a bit strange.

+++ ./dist/RFastAdaptiveSemiSpace_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xms75M -Xmx75M -jar /home/runner/work/mmtk-core/mmtk-core/mmtk-jikesrvm/.github/scripts/../../repos/jikesrvm/dacapo/dacapo-2006-10-MR2.jar bloat
===== DaCapo bloat starting =====
Error: Process completed with exit code 1.

There is no error message. I cannot reproduce it locally, and I haven't seen this kind of error before. I will just retry the run and see what happens.

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinsoon
Copy link
Member

qinsoon commented May 26, 2022

The JikesRVM tests keep failing constantly (3/3 runs failed). The JikesRVM binding and the mmtk-core are both v0.12, which passed the tests in mmtk/mmtk-jikesrvm#110. So it is possibly either a bug in this PR, or a bug revealed by this PR. I will investigate into this as it may relate to weak reference processing (it always failed in the weak ref tests).

@wenyuzhao If you can think up anything that may cause the issue, also let me know.

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

I tried my small test case on OpenJDK, and it either passes or crash with out-of-memory. The SoftReference is never erroneously cleared.

The following command makes the test pass on my machine, but shrinking the heap to -Xm{s,x}20M will make it OOM.

env MMTK_NO_REFERENCE_TYPES=false MMTK_THREADS=1 RUST_BACKTRACE=1 MMTK_PLAN=SemiSpace /home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:MetaspaceSize=100M -Xm{s,x}22M Main

I guess the bug is still related to clearing SoftReference, but in the VM-specific part.

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

It is also reproducible if I replace SoftReference with WeakReference.

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

Here is an even simpler test case. This time you can use any heap size. It doesn't matter.

Update: Now it checks the WeakReference after each System.gc(). It also uses cmdline argument to decide the number of rounds. So we can specify a large number (such as 1000), and it will crash as soon as the WeakReference is erroneously cleared.

Just like before, it is impossible to reproduce it on OpenJDK.

import java.lang.ref.WeakReference;
import java.util.ArrayList;

public class Simple {
    Object hard;
    WeakReference<Object> soft;

    public void populate() {
        Object obj = new Object();
        hard = obj;
        soft = new WeakReference(obj);
    }

    public void check(int round) {
        Object obj = soft.get();
        if (obj == null) {
            throw new RuntimeException("obj is null! round=" + round);
        }
    }

    public static void main(String[] args) {
        int gc_rounds = Integer.parseInt(args[0]);
        Simple main = new Simple();

        System.out.println("Populating...");
        main.populate();

        System.out.println("Doing lots of GC...");
        for (int i = 0; i < gc_rounds; i++) {
            System.gc();
            main.check(i);
        }

        System.out.println("Checking...");
        main.check(gc_rounds);

        Object obj2 = main.hard;
        System.out.println(obj2);

        System.out.println("Done.");
    }

}

@qinsoon
Copy link
Member

qinsoon commented Jun 3, 2022

Here is an even simpler test case. This time you can use any heap size. It doesn't matter.

Just like before, it is impossible to reproduce it on OpenJDK.

import java.lang.ref.WeakReference;
import java.util.ArrayList;

public class Simple {
    Object hard;
    WeakReference<Object> soft;

    public void populate() {
        Object obj = new Object();
        hard = obj;
        soft = new WeakReference(obj);
    }

    public void check() {
        Object obj = soft.get();
        if (obj == null) {
            throw new RuntimeException("obj is null!");
        }
        Object obj2 = hard;
        System.out.println(obj2);
    }

    public static void main(String[] args) {
        Simple main = new Simple();

        System.out.println("Populating...");
        main.populate();

        System.out.println("Doing lots of GC...");
        for (int i = 0; i < 100; i++) {
            System.gc();
        }

        System.out.println("Checking...");
        main.check();

        System.out.println("Done.");
    }

}

Does this test pass with master but fail with this PR?

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

@qinsoon Yes. It always passes with mmtk-core master, no matter the number of threads.

By the way, with the work-stealing branch, it sometimes cannot finish all 100 GC rounds. It may hang or crash sometimes.

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

There is a chance that a WeakRefProcessing work packet is executed before any ProcessEdgesWork is executed (with MMTK_THREADS=1). This doesn't look right. Whenever this happens, my test case will raise exception because the WeakReference is cleared.

[2022-06-03T11:58:49Z INFO  mmtk::plan::global] User triggering collection
[2022-06-03T11:58:49Z WARN  mmtk::util::reference_processor] WeakRefProcessing
[2022-06-03T11:58:49Z WARN  mmtk::util::reference_processor] WeakRefProcessing End
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81289
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81289
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81291
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81291
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81292
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81292
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81293
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81293
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81299
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81299
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81300
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81300
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81302
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81302
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81303
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork End: 81303
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81306
[2022-06-03T11:58:49Z ERROR mmtk::scheduler::gc_work] ProcessEdgesWork: 81294

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

A more detailed log of a problematic GC (timestamp and type parameters are omitted):

BEGIN mmtk::scheduler::gc_work::ScheduleCollection
END mmtk::scheduler::gc_work::ScheduleCollection
BEGIN mmtk::scheduler::gc_work::StopMutators
END mmtk::scheduler::gc_work::StopMutators
BEGIN mmtk::scheduler::gc_work::StopMutators
BEGIN mmtk::scheduler::gc_work::Prepare
END mmtk::scheduler::gc_work::Prepare
BEGIN mmtk::scheduler::gc_work::PrepareCollector
END mmtk::scheduler::gc_work::PrepareCollector
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareMutator
END mmtk::scheduler::gc_work::PrepareMutator
BEGIN mmtk::scheduler::gc_work::PrepareCollector
END mmtk::scheduler::gc_work::PrepareCollector
BEGIN mmtk::util::reference_processor::SoftRefProcessing
END mmtk::util::reference_processor::SoftRefProcessing
BEGIN mmtk::util::reference_processor::WeakRefProcessing
END mmtk::util::reference_processor::WeakRefProcessing
BEGIN mmtk::scheduler::gc_work::VMProcessWeakRefs
END mmtk::scheduler::gc_work::VMProcessWeakRefs
BEGIN mmtk::util::finalizable_processor::Finalization
END mmtk::util::finalizable_processor::Finalization
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
...
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanStackRoot
END mmtk::scheduler::gc_work::ScanStackRoot
BEGIN mmtk::scheduler::gc_work::ScanVMSpecificRoots
END mmtk::scheduler::gc_work::ScanVMSpecificRoots
BEGIN mmtk_jikesrvm::scan_statics::ScanStaticRoots
END mmtk_jikesrvm::scan_statics::ScanStaticRoots
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk_jikesrvm::scanning::ScanGlobalRoots
END mmtk_jikesrvm::scanning::ScanGlobalRoots
BEGIN mmtk_jikesrvm::scan_statics::ScanStaticRoots
END mmtk_jikesrvm::scan_statics::ScanStaticRoots
BEGIN mmtk_jikesrvm::scanning::ScanGlobalRoots
END mmtk_jikesrvm::scanning::ScanGlobalRoots
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
BEGIN mmtk::scheduler::gc_work::PlanProcessEdges
END mmtk::scheduler::gc_work::PlanProcessEdges
...

It looks like the buckets for reference processing are opened too early.

I guess what happened is, some work packets are asynchronous. Those mmtk-core work packets finished very quickly, but there are VM threads working in the background, and may deliver more work packets afterwards. However, the GC workers in the core thinks all buckets are empty. It then decide that more work packets should be opened. The weak reference processing work packets are added by the ScheduleCollection work packet which is executed by the GC coordinator thread. Therefore WeakRefProcessing becomes executable even before stacks are scanned.

@qinsoon
Copy link
Member

qinsoon commented Jun 3, 2022

@wks Nice debugging! In ScheduleCollection, we do not schedule anything for the Closure bucket, but rather rely on the binding to generate the closure packets. Actually we do not schedule anything between scanning stacks (or stopping mutators - i can’t remember) and weak ref processing. That means there could be a bug that at some point, the schedule sees there is no work in the closure bucket and all conditions for opening a new bucket is satisfied, so it activates the weak processing bucket. This also explains why CI may pass occasionally after Wenyu fixed an issue about designated work (before the fix, the scheduler might open new buckets even if some workers have designated work). The fix probably eliminated some cases, so it reduces the chance we see the bug in CI. @wenyuzhao

@wenyuzhao
Copy link
Member Author

@wks Can you please advise which machine were you using and what is your command line args? It's weird that I tried the test case with both SemiSpace and MarkSweep multiple times, and I did not see any crashes. I'm using fox.moma.

My command: dist/RFastAdaptiveMarkSweep_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xms20M -Xmx20M Simple

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

I found the cause.

One difference between the JikesRVM binding and the OpenJDK binding is that in OpenJDK, any GC worker can execute the StopMutators work packet, while in JikesRVM, only the GC coordinator thread can execute StopMutators. The background was that OpenJDK requires that the thread that stops all thread must be the same as the thread that starts the mutators. (See #450)

The StopMutators work packet calls stop_all_mutator, and then immediately notifys workers to continue. See:

// StopMutators::do_work
        trace!("stop_all_mutators start");
        mmtk.plan.base().prepare_for_stack_scanning();
        <E::VM as VMBinding>::VMCollection::stop_all_mutators::<E>(worker.tls);
        trace!("stop_all_mutators end");
        mmtk.scheduler.notify_mutators_paused(mmtk);    // THIS LINE!!!!!!!!!!!!
        if <E::VM as VMBinding>::VMScanning::SCAN_MUTATORS_IN_SAFEPOINT {
...
                for mutator in <E::VM as VMBinding>::VMActivePlan::mutators() {
                    mmtk.scheduler.work_buckets[WorkBucketStage::Prepare]
                        .add(ScanStackRoot::<E>(mutator));  // ADD WORK PACKETS
                }
        }
        mmtk.scheduler.work_buckets[WorkBucketStage::Prepare].add(ScanVMSpecificRoots::<E>::new());  // ADD WORK PACKETS

Note that it notifies the GC workers before adding more work packets. This is not a problem for OpenJDK, because the StopMutators as a whole is executed on a GC worker. If any GC worker is running, it does not satisfy the condition of "all mutators have parked".

However, on JikesRVM, StopMutators is executed on the GC coordinator thread, which is not counted as a GC worker. If there is a slight delay after the notify_mutators_paused invocation, as in the following patch:

diff --git a/src/scheduler/gc_work.rs b/src/scheduler/gc_work.rs
index 615458f0..a4a2636b 100644
--- a/src/scheduler/gc_work.rs
+++ b/src/scheduler/gc_work.rs
@@ -10,6 +10,7 @@ use std::marker::PhantomData;
 use std::mem;
 use std::ops::{Deref, DerefMut};
 use std::sync::atomic::Ordering;
+use std::time::Duration;
 
 pub struct ScheduleCollection;
 
@@ -186,6 +187,9 @@ impl<E: ProcessEdgesWork> GCWork<E::VM> for StopMutators<E> {
         <E::VM as VMBinding>::VMCollection::stop_all_mutators::<E>(worker.tls);
         trace!("stop_all_mutators end");
         mmtk.scheduler.notify_mutators_paused(mmtk);
+        std::thread::sleep(Duration::from_millis(10));
         if <E::VM as VMBinding>::VMScanning::SCAN_MUTATORS_IN_SAFEPOINT {
             // Prepare mutators if necessary
             // FIXME: This test is probably redundant. JikesRVM requires to call `prepare_mutator` once after mutators are paused

Then GC workers will wake up, and found that all open buckets are drained. Then the GC workers will open all subsequent work packets. Because the delay after notify_mutators_paused can be arbitrary depending on operating system scheduling, arbitrarily many work packets may have been executed before the StopMutators work packet add ScanStackRoot and ScanVMSpecificRoot work packets, which generate all the subsequent ProcessEdgesWork work packets.

So the current "all GC workers have stopped" is insufficient. It does not take the running GC coordinator thread into account.

@wks
Copy link
Collaborator

wks commented Jun 3, 2022

@wks Can you please advise which machine were you using and what is your command line args? It's weird that I tried the test case with both SemiSpace and MarkSweep multiple times, and I did not see any crashes. I'm using fox.moma.

My command: dist/RFastAdaptiveMarkSweep_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xms20M -Xmx20M Simple

I am using my own laptop in a LXC container. It should be reproducible on any machine.

My command line is:

MMTK_THREADS=2  ~/projects/mmtk-github/jikesrvm/dist/RFastAdaptiveSemiSpace_x86_64_m32-linux/rvm -X:gc:no_reference_types=false -Xm{s,x}600M Simple 1000

The number of threads matters. It is more likely to reproduce it with 2 threads instead of 1.

Also note that I updated the test script in the afternoon, so that you can specify how many System.gc() it runs on the command line, and it will check the WeakReference after each GC and fail fast.

@wenyuzhao
Copy link
Member Author

The issue is not because the controller proceeded StopMutators packets.

It was not a problem before, because the coordinator was the one that opens new buckets. So it cannot open new buckets until the coordinator StopMutators job is finished. Now it's the last yielding GC worker to open new buckets (to reduce inter-thread messages).

I changed the open condition of the buckets so that no new buckets can open unless the coordinator finished the coordinator's work packets. Hope this can solve the problem.

@wenyuzhao
Copy link
Member Author

@wks Can you please confirm that the change fixes the bug on your side? I'm able to reproduce the bug now and looks like it is fixed.

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some problem with the memory orders used in various atomic operations. They may, in theory, cause some problem. If unsure, we can always use SeqCst.

@wks
Copy link
Collaborator

wks commented Jun 4, 2022

@wenyuzhao I confirm that the bug is no longer reproducible on my local machine. However, I also see some memory order issue. See my previous comments.

@wenyuzhao wenyuzhao merged commit 43a6755 into master Jun 5, 2022
@wenyuzhao wenyuzhao deleted the work-stealing branch June 5, 2022 01:32
@wks wks mentioned this pull request Jun 15, 2022
@wks wks mentioned this pull request Nov 8, 2022
6 tasks
@wks wks mentioned this pull request Feb 10, 2023
qinsoon pushed a commit that referenced this pull request Feb 21, 2023
Some dependencies have new versions. It is a good chance to bump the
dependency versions, too, after we bump the Rust toolchain versions.

One notable dependency is `enum-map`. Its latest version is 2.4.2, but
we locked its version to 2.1.0 because it required a newer Rust
toolchain. Now we can depend on its latest version, instead. We also
changed the `rust-version` property in `Cargo.toml` to `1.61.0` because
`enum-map-derive` depends on that, and `1.61.0` is not new compared to
the `1.66.1` we use. See:
#507 (comment)

This closes #693
wenyuzhao pushed a commit to wenyuzhao/mmtk-core that referenced this pull request Mar 20, 2023
Some dependencies have new versions. It is a good chance to bump the
dependency versions, too, after we bump the Rust toolchain versions.

One notable dependency is `enum-map`. Its latest version is 2.4.2, but
we locked its version to 2.1.0 because it required a newer Rust
toolchain. Now we can depend on its latest version, instead. We also
changed the `rust-version` property in `Cargo.toml` to `1.61.0` because
`enum-map-derive` depends on that, and `1.61.0` is not new compared to
the `1.66.1` we use. See:
mmtk#507 (comment)

This closes mmtk#693
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants