Generalising weak reference processing and finalisation

*WARNING: This issue contains wild and crazy ideas.*

Currently mmtk-core already has Java-style weak reference processors and finaliser processors.  In https://github.com/mmtk/mmtk-core/issues/544, we discussed whether we should keep Java semantics.  But as we start to support other languages and VMs, it is clear that we need to go beyond what's available in Java.

*Update: After discussions, it is clear that this idea is not crazy.  Besides the reasons provided below, another reason for supporting ref processing in bindings is that it will allow us to make apple-to-apple compare MMTk and the VM's own GC because both shall use the same reference processor.*

Task list:

-   [X] Introduce a language-neutral API for processing references in VM bindings. (https://github.com/mmtk/mmtk-core/pull/700)
-   [ ] Migrate bindings away from the reference/finalizer processors in mmtk-core.
    -   [X] Ruby
    -   [ ] OpenJDK
    -   [ ] JikesRVM
    -   [ ] Julia
-   [ ] Deprecate and remove the reference/finalizer processors in mmtk-core.

# Other languages

## Java (Yes. Java.)

In addition to `java.lang.ref.XxxxReference` and things implemented with them (such as `WeakHashMap` which is implemented with `WeakReference`), Java also has JNI weak handles which weakly refer to an object, but are not Java objects.  The current weak ref processing mechanism cannot handle those weak handles.

## Ruby

**ObjectSpace::WeakMap** and **WeakRef**: In Ruby, the most basic programmer-visible weak data structure is the `ObjectSpace::WeakMap` type.  It is a weak-key weak-map hash map.  If either the key or the value is dead, the key-value pair is removed from the map.  It is used to implement the `WeakRef` type in the stdlib.  It stores `WeakRef` as the key and the referred object as value. If either the `WeakRef` or the referred object dies, the association between them is removed.  Under the hood, `ObjectSpace::WeakMap` is implemented by adding finalisers on both the key and the value.

**Global internal data structures**: Some internal data structures in Ruby has weak reference semantics.  Those data structure holds per-object data for live objects, but can be cleaned up if the object dies.

-   *ID*: Each Ruby object may have an ID, obtained by `obj.object_id`.  The ID is guaranteed to be unique while the object is alive.  Under the hood, the Ruby runtime maintains a global bidirectional ID-to-object and object-to-ID map.  When an object is moved, the `gc_move` function updates the bi-directional map; when an object dies, the finaliser `obj_free` removes that object from the bidirectional map.
-   *gen_ivtbl*: Objects other than `T_OBJECT` have their instance variables held in an external table, and a global map `generic_iv_tbl_` maps each object to its "gen_ivtbl".  When an object dies, its associated "gen_ivtbl" is freed.

The "cleaned when object dies" semantics satisfies the definition of "weak reference".  Actually, weak references are intended to be used to implement canonicalising mappings, as described in [Java's documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/ref/WeakReference.html).

## V8 and Ephemeron

V8 supports Ephemeron.  Simply speaking, an ephemeron is a pair

```rust
struct Ephemeron {
    key: WeakReference,
    value: MaybeWeakReference,
}
```

If the object referred by the `key` is alive, the `value` field behaves like a strong reference; otherwise the `value` field behaves like a weak reference.

Ephemeron behaves like `java.util.WeakHashMap` entries.  If the key dies, the key-value pair is automatically removed from the `WeakHashMap`.  Under the hood, OpenJDK implements it by using `WeakReference`s to point to the key.  When the key dies, the `WeakReference` is enqueued, and the `WeakHashMap` "expunges stale entries" from time to time.  It is not as good as Ephemeron, though, because with native Ephemeron support, the GC can clear the value field directly.

# Why the current mechanism in MMTk core is not enough?

## Different data structures

Different languages/VMs have different weak data structures.

Some of them are not heap objects.  For example, JNI weak handles are not heap objects, but MMTk core's ReferenceProcessor assumes weak references are heap objects.

Some of them can hold multiple key-value pairs in one complex data structure.  For example, in Ruby, the weak tables are hash tables implemented in C.  They cannot be simply updated like the way GC updates fields when an object moves.  If the hash table uses object address as the key, and the object is moved, then the table entry needs to be re-hashed because the key changed.

## Different semantics

Ephemeron's unusual semantics that "when key dies, the value becomes weak" is not handled by existing things in Java.

Although both Java's `WeakHashMap` and Ruby's `ObjectSpace::WeakMap` emulate ephemeron-like behaviour using finaliser, it is not as efficient as supporting Ephemerons directly in GC, because weak maps still briefly keeps the value "alive", while the "expunge stale entry" operations need to be executed at a later time.

# Proposed interface

~~*Note: this may be crazy*~~ Maybe not that crazy.  Wenyu is already doing something like this [in the lxr branch of the mmtk-openjdk binding](https://github.com/wenyuzhao/mmtk-openjdk/blob/lxr/mmtk/src/reference_glue.rs#L243-L289)

MMTk core provides a reference processing stages `RefClosure` (replacing our current `XxxRefClosure` phase), during which two functions can be called:
-   `is_alive(ObjectAddress) -> bool`: Return whether an object is alive.
    -   Update: `is_reachable` should be a better name.
-   `trace_object(ObjectAddress) -> ObjectAddress`: Keep the object alive, trace that object, and return its new address (if moved).

And the VMBinding provides one function to be executed by GC worker threads during the new `RefClosure` phase:
-   `Collection::do_ref_processing()`: Do whatever the VM needs to process weak refs.  MMTk core may call this multiple times if the VM keeps additional objects alive via `trace_object`.

MMTk doesn't care about what the VM do during `do_ref_processing()`.

## How to implement Java-style references

The VM binding maintains its own list of "candidate" and "finalized" object lists.  During `do_ref_processing`, the VM binding inspects each candidate.

```rust
fn do_ref_processing() {
    for obj in openjdk::soft_weak_phantom {
        if mmtk::is_alive(obj) {
            let dst = obj.ref_field;
            if mmtk::is_alive(dst) {
                trace_object(obj.ref_field);
            } else if openjdk::is_soft_reference(dst) && !mmtk::is_emergency_collection() {
                trace_object(obj.ref_field);
            } else {
                obj.ref_field = Address::NULL;

                if openjdk::has_queue(obj) {
                    openjdk::enqueue(obj);
                }
            }
        }
    }
    for obj in openjdk::finalize_candidates {
        if !mmtk::is_alive(obj) {
            openjdk::finalizable_objects.push(obj);
        }
    }
}
```

## How to implement Ephemeron

```rust
fn do_ref_processing() {
    for obj in v8::ephemerons {
        if mmtk::is_alive(obj.key) {
            mmtk::trace_object(obj.value);
        }
    }
}
```


## How to implement global maps in Ruby

```rust
fn do_ref_processing() {
    for entry in ruby::obj_id_map {
        if !mmtk::is_alive(entry.obj) {
            ruby::obj_id_map.remove_entry(entry);
        }
    }
    for entry in ruby::gen_ivtbl_map {
        if !mmtk::is_alive(entry.obj) {
            ruby::gen_ivtbl_map.remove_entry(entry);
        }
    }
}
```

## Problems

-   Q: Can this be parallelised?
    -   A: MMTk can provide a callback so that `do_ref_processing` can create sub-tasks, while MMTk-core create multiple work packets under the hood.

-   Q: How to support multiple strength levels (soft, weak, finalizer, phantom, ...)
    -   A: MMTk core can call `do_ref_processing` multiple times, passing a integer parameter that indicates how many time MMTk has done the transitive closure.  It is up to the VM binding to interpret the integer, for example, when n = 1, handle soft references; when n = 2, handle weak references, ...
    -   *Update: The new "sentinel" mechanism (introduced in https://github.com/mmtk/mmtk-core/pull/700) allows the GC to expand transitive closure multiple times, and call `process_weak_refs` each time a transitive computing is finished.  The VM binding can implement a state machine to handle a different strength each time.*

-   Q: This looks very unsafe.  The VM can basically do anything here.
    -   A: It is just a matter of whether MMTk core or VM can do it better.

# Update

Wenyu is already doing something similar in the lxr branch of mmtk-openjdk.  https://github.com/wenyuzhao/mmtk-openjdk/blob/lxr/mmtk/src/reference_glue.rs#L243-L289

However, I think work packets (`GCWork` and the buckets) are an implementation detail of mmtk-core, and shouldn't be exposed to the VM binding (I am still open to objections for now).  In my proposed API, `trace_object` can be provided as a call-back closure that encapsulates the logic related to work packets, and the VMBinding only specify which object need to be kept alive.

*Update: In https://github.com/mmtk/mmtk-core/pull/700, we encapsulated `trace_object` behind the `ObjectTracer` trait (already exists for supporting object-enqueuing tracing), and the new `ObjectTracerContext` trait encapsulates the creation and flushing of `ProcessEdgesWork`.*



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generalising weak reference processing and finalisation #694

Other languages

Java (Yes. Java.)

Ruby

V8 and Ephemeron

Why the current mechanism in MMTk core is not enough?

Different data structures

Different semantics

Proposed interface

How to implement Java-style references

How to implement Ephemeron

How to implement global maps in Ruby

Problems

Update

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generalising weak reference processing and finalisation #694

Description

Other languages

Java (Yes. Java.)

Ruby

V8 and Ephemeron

Why the current mechanism in MMTk core is not enough?

Different data structures

Different semantics

Proposed interface

How to implement Java-style references

How to implement Ephemeron

How to implement global maps in Ruby

Problems

Update

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions