[GR-48683] Add JFR old object sample event #6743

galderz · 2023-06-05T16:45:26Z

Implements JFR jdk.OldObjectSample event for #5145, without path to GC root functionality. I have a prototype for the path to GC root functionality, but I’ve decided to split this work into a separate PR to make reviewing more manageable.

This PR includes:

Event emitted for sampled objects and arrays that are possible leaks.
They are sampled when they’re allocated via the slow path. The slow path happens when an object does not fit in the TLAB and a new one is required. Arrays bigger than the large array threshold are also allocated via the slow path regardless.
Following the same logic found in HotSpot, the sampler is implemented as a custom priority queue ordered by span, which is based on allocation size. Objects with the smallest span are at the top of the queue.
The sample queue is initialized with a fixed-size array at runtime. The default size is 256 elements, but this can be configured at runtime using -XX:FlightRecorderOptions=old-object-queue-size=<size> option.
If the queue is full, and the sampled object is bigger than the top, the top is removed and the sampled object is inserted.
If the queue is full and the sampled object is smaller than the top, the code tries to scavenge the queue which means purging objects that have been garbage collected.
To scavenge, the underlying array backing the queue is iterated over, removing those sampled objects that have been garbage collected.
To check if an object has been garbage collected or not, sampled objects are wrapped around WeakReferences.
Operations on the sampler queue are protected by a new non-reentrant SpinLock. This new lock implementation works similarly to a VMMutex but extends it to offer a tryLock() function. This function is useful for sampling because if another object is already being sampled and the lock cannot be acquired, the sampling is simply skipped. When emitting old object sample events, the sampler waits to acquire the lock.
Old object sample events emitted with custom descriptions for Thread, ThreadGroup and Class. Ellipsis descriptions are also supported.

Caveats:

Calling WeakReference.get() from uninterruptible code fails. Adding @AnnotateOriginal @Uninterruptible(reason = "Called from uninterruptible code.", mayBeInlined = true) to Target_java_lang_ref_Reference.get() does not seem to have any impact. Looking at AnnotationSubstitutionProcessor, classes that have @Substitute don’t seem to get processed for methods with @AnnotateOriginal. Worked around by adding calleeMustBe = false to the Uninterruptible method within which WeakReference.get() is called. This caveat has been addressed, see latest commits.
Old object queue size can only be configured at runtime via the startup recorder options flag. I couldn’t find a way to unit test this since all unit tests run as a single binary run, so the queue size is fixed for all the tests.
Old object sample event descriptions for classes that have a size field not implemented since they’re not feasible in a closed world environment. In HotSpot this is achieved by using reflection to lookup a size field but that cannot be done in SubstrateVM.
Old object sample event description tests only added for thread group. Threads not tested because creating many instances of those triggers other intermediary objects to be sampled so couldn’t easily get to the Thread instances. Classes not tested since it requires dynamically creating brand new Class instances at runtime, something not feasible in closed-world SubstrateVM.

Once this PR is integrated I'll work on adding path to GC root functionality.

galderz · 2023-06-05T16:45:59Z

Can @roberttoyonaga @christianhaeubl and @adinn review this?

galderz · 2023-06-05T16:57:48Z

Fixing some eclipse format issues...

galderz · 2023-06-06T07:56:58Z

Added a few commits to address warning errors. Also made the full sample queue test more reliable.

galderz · 2023-06-06T11:20:43Z

I think I've fixed all style and format errors. However, gate testing is not working. It fails with:

Tag "test" not part of any task.
Run the following command to see all available tasks and their tags:
  mx -v gate --dry-run
Traceback (most recent call last):
  File "/home/runner/work/graal/graal/mx/mx_gate.py", line 475, in gate
    mx.abort(f'Tag "{tag}" not part of any task.\n'
  File "/home/runner/work/graal/graal/mx/mx.py", line 4273, in abort
    raise SystemExit(error_code)
SystemExit: 1

@christianhaeubl Any ideas? @zakkak said it could be related to #6738.

zakkak · 2023-06-06T11:31:47Z

@zakkak said it could be related to #6738.

FTR I have created #6746 to resolve the issue.

roberttoyonaga

Thanks Galder! It would be really awesome to add this support for JFR leak profiling. I've left a few comments/questions here.

roberttoyonaga · 2023-06-07T16:59:36Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/locks/SpinLock.java

+    public void lock() {
+        @SuppressWarnings("deprecation")
+        final long threadId = Thread.currentThread().getId();
+        while (!lock.compareAndSet(NOT_HELD, threadId)) {


I think we might need a PauseNode.pause(); here. Would it be helpful to use JavaSpinLockUtils instead of this custom spinlock(It might require a bit of re-working though)? It already has tryLock methods.

Good catch. I had completely missed that class. I've just tried it and does the job.

roberttoyonaga · 2023-06-07T18:31:33Z

...atevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/JfrOldObjectSampler.java

+        int numDead = 0;
+        for (int i = 0; i < samples.getCapacity(); i++) {
+            final OldObject sample = samples.getSample(i);
+            if (sample.reference != null && sample.reference.get() == null) {


In this method the current thread holds the spinlock (which can lock without transition). sample.reference.get() steps into interruptible code and may result in the current thread waiting for a safepoint while holding the spinlock. If another thread calls JfrOldObjectSampler.emit(long , boolean) at this time, it will deadlock because it is blocked waiting for the spinlock, but cannot allow safepointing since it's executing uninterruptible code.

Your comment in the PR description says it's not possible to make WeakReference.get() uninterruptible using @AnnotateOriginal. @christianhaeubl do you know if there's a way to work around this?

I've found a solution. I've hacked AnnotationSubstitutionProcessor.handleAnnotatedMethodInSubstitutionClass to allow for @KeepOriginal and AnnotateOriginal to be used together, and then made Reference.get like this:

@KeepOriginal @AnnotateOriginal @Uninterruptible(reason = "Called from uninterruptible code.", mayBeInlined = true) native T get();

Then it all works :)

Nice! I think that should solve the problem. Were only superficial changes to AnnotationSubstitutionProcessor.handleAnnotatedMethodInSubstitutionClass needed?

Yeah, small changes indeed :)

I've not pushed all the commits yet because I was trying to get clarification about #6743 (comment), but that might take some days, so I'll probably push updates shortly.

roberttoyonaga · 2023-06-08T15:41:54Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/JfrChunkWriter.java

         * changing epoch.
         */
-        this.flushCheckpointRepos = new JfrRepository[]{stackTraceRepo, methodRepo, typeRepo, symbolRepo};
+        this.flushCheckpointRepos = new JfrRepository[]{stackTraceRepo, methodRepo, typeRepo, symbolRepo, oldObjectRepo};


Maybe the oldObjectRepo should come earlier in the write order. It will reference type and symbol data. Otherwise there could be races (ie. if entries are added to oldObjectRepo after symbolRepo is written to disk but before oldObjectRepo is written to disk, then there could be dangling references in the chunk)

Makes sense, I've reordered repos.

roberttoyonaga · 2023-06-08T17:38:46Z

substratevm/src/com.oracle.svm.test/src/com/oracle/svm/test/jfr/oldobject/JfrOldObjectTest.java

+    public void triggerUpdateOfLastKnownHeapUsage() {
+        // Trigger GC before tests to get a reading of last known heap usage for the first executed
+        // test.
+        System.gc();


Why do we have to invoke gc twice here?

Yup, once is enough :)

roberttoyonaga · 2023-06-08T18:11:02Z

substratevm/src/com.oracle.svm.test/src/com/oracle/svm/test/jfr/oldobject/JfrOldObjectTest.java

+        return startRecording(events, getDefaultConfiguration(), settings);
+    }
+
+    static void blackhole(Object obj) {


Why do we need this method?

Good question. The leak field is assigned to a static variable but it's not used, so at the beginning when I started working on this I was worried the compiler might optimize this away and remove all the code. The blackhole method is just a way to make sure the field is used in a trivial way, but one that cannot be optimized away because it can (extremely unlikely) end up causing a side effect (printing an empty space character).

I've just removed it and calls to it and it all works fine, so I'll go ahead and remove it.

roberttoyonaga · 2023-06-08T18:28:33Z

substratevm/src/com.oracle.svm.test/src/com/oracle/svm/test/jfr/oldobject/TestArrayLeak.java

+            Object[] right = new Object[3];
+            node[2] = right;
+            node = right;
+        }


Is this adding objects to this array try to make the 100 element array allocation get sampled along the allocation slow path? If so, is there a chance that it doesn't actually get sampled?

What if you directly called JfrOldObjectSampleEvents.sample(result, size.rawValue(), Integer.MIN_VALUE); from this testing code?

Is this adding objects to this array try to make the 100 element array allocation get sampled along the allocation slow path? If so, is there a chance that it doesn't actually get sampled?

All I'm doing is generating enough arrays so that some of them get sampled as old objects and I can verify that sampled arrays make it through as events and they have the right data populated. Having the right data here means that the sampled event is for Object[] has length of 100. If no arrays were sampled the test would fail because it guarantees that the list of events generated for the Object[] is not empty.

What if you directly called JfrOldObjectSampleEvents.sample(result, size.rawValue(), Integer.MIN_VALUE); from this testing code?

Hmmm, not sure what that would do that this test is not already covering?

Ok I see. I was just concerned that there isn't a guarantee the length 100 array gets sampled (although it's probably likely because we try 1M times). Although unlikely, what if the the length 3 array accidentally gets sampled because we happen to run out of space in the current TLAB, would the test fail because it's only expecting the length 100 array?

I don't think that will happen because the allocated size of an array of 100 elements is bigger than that of 3 elements, so you should see the priority queue logic kicking in, by pushing out arrays of 3 elements out and sampling those that have 100 elements.

roberttoyonaga · 2023-06-08T19:25:51Z

...evm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectEventEmitter.java

+                    final long heapUsedAtLastGC = sample.heapUsedAtLastGC;
+                    final int arrayLength = sample.arrayLength;
+                    OldObjectSampleEvent.emit(timestamp, objectId, allocationTime, threadId, stackTraceId, heapUsedAtLastGC, arrayLength);
+                }


The flushing thread can write serialized data to the current chunk on disk while this method is executing. If the event data gets flushed without the constant pool data in the OldObjectRepository, then there will be missing reference data. I think this scenario is possible because flushes are ignored by the OldObjectRepository. I think fixing this might be as simple as allowing flushes in the OldObjectRepository similar to the other repos. Most of the time flushes will be ignored anyway because OldObjectRepository will only have serialized data once a recording is actually stopped.

Makes sense, I've changed it to allow flushes.

roberttoyonaga · 2023-06-08T21:11:11Z

...atevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/JfrOldObjectSampler.java

+        final OldObject prev = samples.getSample(prevIndex);
+        if (prev.reference != null) {
+            queue.remove(prev);
+            prev.span += sample.span;


Why are we adding span here to the previous? Is this an optimization?

roberttoyonaga · 2023-06-09T15:31:31Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/locks/SpinLock.java

+     * obtained. This lock is non-reentrant so if the lock is already held by the current thread,
+     * the method throws a {@link RuntimeException}.
+     */
+    @Uninterruptible(reason = "Called from uninterruptible code.", mayBeInlined = true)


I think maybe this shouldn't be allowed to be in-lined since we are effectively locking without transition here (if it gets in-lined into interruptible code it also becomes interruptible). You might also need callermustbe = true ie @Uninterruptible(reason = "The whole critical section must be uninterruptible.", callerMustBe = true)

This class will be removed and instead use JavaSpinLockUtils.

galderz · 2023-06-26T15:33:47Z

Pushed a few updates that should address all of @roberttoyonaga's feedback:

Use JavaSpinLockUtils instead of rolling your own.
Make calls to WeakReference.get() uninterruptible. To achieve this, since Reference class is substituted,
annotation processor had to be changed to enable @KeepOriginal and @AnnotateOriginal to be used together
in classes that are marked with @Substitute.
Just called System.gc() call once to record a first heap usage.
Remove blackhole methods which are not necessary.
Put old object repo before type and symbol repo.
Support flush of old object repository.
Link samples with a addition-time-based singly linked list, in order to keep same logic as in HotSpot when scavenging. When removing entries through scavenging, their priority is passed on to an sample allocated just after it. This is done to keep samples distributed evenly.
Migrate sampler to profiler and separate sampler and emitter. By doing that, the profiler can be more easily unit tested, while still encapsulating the sampler, emitter, queue and list implementations.
Added a unit test for the old object profiler.
Just like in HotSpot, add total allocated tracking logic.
Made clear which of the tests exercise JFR recording vs the other unit tests.

galderz · 2023-06-27T12:46:25Z

Formatting, style...etc should be done now, but CI shows all the recording tests I've added failing when tested with labsjdk-ce-21-jvmci-23.1-b09. The problem arises because object field of the recorded event seems to be null. E.g.

jdk.OldObjectSample {
  startTime = 13:40:06.910 (2023-06-27)
  allocationTime = 13:40:06.910 (2023-06-27)
  objectSize = 33.8 kB
  objectAge = 19.2 ms
  lastKnownHeapUsage = 216 bytes
  object = N/A
...

I'm looking into this.

galderz · 2023-06-27T13:15:02Z

Hmmmm, seems like in JDK 21 (or some other version post 17) the event has an extra field objectSize:

            E: field
             A: name allocationTime
             A: class 219
                E: annotation
                 A: class 18707
                E: annotation
                 A: class 18699
                 A: value TICKS
                E: annotation
                 A: class 7632
                 A: value Allocation Time
            E: field
             A: name objectSize
             A: class 219
                E: annotation
                 A: class 18707
                E: annotation
                 A: class 18720
                 A: value BYTES
                E: annotation
                 A: class 7632
                 A: value Object Size
            E: field
             A: name objectAge
             A: class 219
                E: annotation
                 A: class 18707
                E: annotation
                 A: class 18693
                 A: value TICKS
                E: annotation
                 A: class 7632
                 A: value Object Age

Compared with JDK17:

            E: field
             A: name allocationTime
             A: class 205
                E: annotation
                 A: class 18875
                E: annotation
                 A: class 18866
                 A: value TICKS
                E: annotation
                 A: class 6859
                 A: value Allocation Time
            E: field
             A: name objectAge
             A: class 205
                E: annotation
                 A: class 18875
                E: annotation
                 A: class 18860
                 A: value TICKS
                E: annotation
                 A: class 6859
                 A: value Object Age

Observed parsing the JFR file with @jiekang's excellent jfr-reader tool.

I'm unsure how we should handle this. I'll see if I can handle it using the JDK earlier/later boolean suppliers.

galderz · 2023-06-27T13:49:08Z

FYI objectSize field was added in JDK18.

galderz · 2023-06-28T04:17:36Z

Fixed tests and pushed commits to support JDK18+. I've ended up removing object description tests because in later JDKs generating ThreadGroup instances unfortunately creates weak reference arrays that hide ThreadGroup instance samples (e.g. see here). This changes have been introduced via the virtual thread support. I had similar issues with other types created while creating Thread instances. As well as being pretty heavyweight objects, creating these classes generate a lot of other garbage which dominates samples. I will defer testing of object descriptions, via unit tests, to the path to GC root work, because that involves considerable changes on the emit side, so I can use that opportunity to revisit the code more thoroughly and find an easy way to validate these descriptions.

roberttoyonaga

Thanks for the updates @galderz! I think you've addressed all my earlier comments. I left a few more pretty minor notes.

roberttoyonaga · 2023-06-28T17:06:44Z

...evm/src/com.oracle.svm.test/src/com/oracle/svm/test/jfr/oldobject/TestOldObjectProfiler.java

+            profiler.sample(new WeakReference<>(String.valueOf(i)), i * 100, -1);
+        }
+
+        profiler.emit(0, Long.MAX_VALUE);


I'm not really sure OldObjectSampler.scavenge() gets called here. I think you need to sample once more after the queue is full in order to induce a scavenge operation. I think this test still passes because OldObjectEventEmitter.emitUnchained(long) checks if the each sample is alive before emitting.

Hmmmm, I think I know what you mean, but let's refresh how scavenging has been ported to see if this makes sense:

Technically the way scavenge works in HotSpot is like this:

void ObjectSampler::add(HeapWord* obj, size_t allocated, traceid thread_id, bool virtual_thread, const JfrBlobHandle& bh, JavaThread* thread) { ... if (Atomic::load(&_dead_samples)) { // There's a small race where a GC scan might reset this to true, potentially // causing a back-to-back scavenge. Atomic::store(&_dead_samples, false); scavenge(); }

It only gets called when sampled, but it calls it whenever a sample has been GC'd. There's no such notification mechanism in SVM, so instead I went for the option of scavenging when the queue is full.

Now, the issue I think you're implying above is that if the queue is not full, no scavenging happens and maybe it's good idea to scavenge after emitting to see if anything is dead and can be removed. This sounds like a good idea to me. It seems better than an alternative option of scavenging in each object sample add call. Have I got this right?

and maybe it's good idea to scavenge after emitting to see if anything is dead and can be removed.

You may not even need to run a full scavenge() operation. When emitting in OldObjectEventEmitter.emitUnchained(long) you already check if each sample is alive. If it is dead, you can directly remove it with OldObjectSampler.remove(OldObject). Admittedly, this is only really useful if you have multiple simultaneous recordings running since we only emit the OldObjectSample event once per recording anyway (so usually there will be no point to doing the extra cleanup work during event emission).

Have I got this right?

hmm my original comment was actually concerning the test itself. I think that your current implementation of scavenging/emitting works perfectly fine. I think testScavengeMiddle, testScavengeYoungest, testScavengeOldest and testScavengeAll don't actually exercise the code in OldObjectSampler.scavenge() and instead rely on OldObjectEventEmitter.emitUnchained(long) to remove the dead samples.

You may not even need to run a full scavenge() operation. When emitting in OldObjectEventEmitter.emitUnchained(long) you already check if each sample is alive. If it is dead, you can directly remove it with OldObjectSampler.remove(OldObject). Admittedly, this is only really useful if you have multiple simultaneous recordings running since we only emit the OldObjectSample event once per recording anyway (so usually there will be no point to doing the extra cleanup work during event emission).

Ok, I think I'll leave the logic as is for now.

hmm my original comment was actually concerning the test itself. I think that your current implementation of scavenging/emitting works perfectly fine. I think testScavengeMiddle, testScavengeYoungest, testScavengeOldest and testScavengeAll don't actually exercise the code in OldObjectSampler.scavenge() and instead rely on OldObjectEventEmitter.emitUnchained(long) to remove the dead samples.

That's right. I'll rework this to make sure the scavenging gets tested.

roberttoyonaga · 2023-06-28T18:02:47Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectSampler.java

+            // the weight of a sample that is removed should be redistributed.
+            // Here it gets redistributed to the sample that came just after in time.
+            queue.remove(prev);
+            prev.span += sample.span;


This is the same as how it's done in Hotspot. However, do you know why the span is added to the younger neighbor rather than the older neighbor? To me, it makes more sense to add additional weight to the older neighbor since the goal is to keep track of long-lived objects. I'm not saying the behavior should be changed, I'm just bringing it up for potential discussion because I'm curious.

Another interesting observation which also needs some further explanation :)

Because the span is added it to the younger neighbour, that is why I implemented it as a single-linked list. In HotSpot, this is implemented as double-linked list so it could efficiently distribute the span to the older sample. I went for the simpler, more space efficient, option, taking into account that the default sample size is relatively small, so when removing things I can just loop around and fix things up.

Now, about how to distribute the span. Here's what Erik said recently when I asked him about the linked list of samples:

The weight of the object, the allocation span, in the time-based list, should be divided to its neighbour(s).

Then @tstuefe pointed out:

span is redistributed to its (in time) predecessor sample. Following your explanation, should the span not be evenly divided between predecessor and successor?

The first part is not entirely correct since prev is really the sample that was sampled after the current object, so indeed the younger one. But the point is that Erik suggest it should be divided. So, 50% to younger and 50% to older. Unfortunately Erik didn't reply to this, so I think it would make sense to follow up on hotspot-jfr-dev and see if HotSpot should change accordingly. Feel free to kick off a discussion there since I'll be away for a bit.

Thanks for the detailed response! Ok that all makes sense to me. I agree that a singly-linked is the best option to achieve the same behavior as in hotspot (only redistribute to the younger neighbor).

That's a good idea. I'll follow up on your "hotspot-jfr-dev" email thread asking "why doesn't the Hotspot implementation actually distribute the span evenly to both neighbors? And if it must distribute unevenly, why not choose the older neighbor?".

What was the final outcome of this discussion?

What if all sampled objects are dead at some point? Then, we don't redistribute the span and queue.total() will in fall to 0. This will result in a large difference between totalAllocated and queue.total(), so new samples can easily evict old samples for a long period of time.

I asked on the hotspot mailing list here: https://mail.openjdk.org/pipermail/hotspot-jfr-dev/2023-June/005164.html

Unfortunately, there was no response.

roberttoyonaga · 2023-06-28T19:46:13Z

.../com.oracle.svm.core/src/com/oracle/svm/core/heap/Target_java_lang_ref_PhantomReference.java

    }
+
+    @AnnotateOriginal
+    @Uninterruptible(reason = "Called from uninterruptible code.", mayBeInlined = true)


I understand why you need it for Reference, but why do you need to add the uninterruptible annotation to this class? I have the same question about SoftReference.

Good question. I don't know the exact details on the limitations around @Uninterruptible, but unless you do this native image fails, e.g.:

com.oracle.svm.core.util.VMError$HostedError: Found 1 violations of @Uninterruptible usage: - method java.lang.ref.Reference.get():Object is annotated but java.lang.ref.PhantomReference.get():Object is not

I suspect that if classes override a method that is @Uninterruptible, e.g. PhantomReference, then they also need to mark it @Uninterruptible.

galderz · 2023-07-13T18:48:37Z

I've added a few more commits to address the latest feedback:

Renamed the old scavenge unit test names to better represent what they test, e.g. testEmitSkippingYoungest.
Modified testSampleAfterEmit to use a custom mutable boolean object to control when to phase objects from being alive to being dead. It makes the test easier to understand rather than relying on specific time tic values.
Added a scavenge specific unit test that verifies that scavenging works as expected. It uses the mutable boolean object to control objects being dead when it matters for scavenging to kick in.
Re-added thread id retrieval behind effects in order to easily run the test from the IDE.

roberttoyonaga

Thanks for the new changes. It looks good to me.

galderz · 2023-07-14T08:43:54Z

The espresso and sulong failures seem unrelated.

galderz · 2023-07-14T13:50:32Z

@christianhaeubl Any thoughts on this PR?

galderz · 2023-08-07T09:11:23Z

FYI I've just updated the PR to fix the conflicts.

galderz · 2023-08-28T07:55:05Z

Updated the code to fix conflicts. As a result of changes in master, I moved the code to using ReferenceInternals.getReferent to get the value out of the weak reference. The method I had used before to add uninterruptible to Reference.get methods didn't work any more.

By the way, let me know when you're happy with the code and I can squash all these commits. This would greatly simplify potential backports down the line.

christianhaeubl

Thanks for the PR, I reviewed most of it (except for test cases) and added questions and comments.

christianhaeubl · 2023-09-15T08:10:32Z

...nal.vm.compiler/src/org/graalvm/compiler/hotspot/replacements/HotSpotAllocationSnippets.java


    @Override
-    protected final Object callNewInstanceStub(Word hub) {
+    protected final Object callNewInstanceStub(Word hub, UnsignedWord sizeSp) {


Why sizeSp instead of size?

christianhaeubl · 2023-09-15T08:11:52Z

substratevm/src/com.oracle.svm.core.genscavenge/src/com/oracle/svm/core/genscavenge/GCImpl.java

    private void collectOperation(CollectionVMOperationData data) {
        assert VMOperation.isGCInProgress();
        assert getCollectionEpoch().equal(data.getRequestingEpoch());
+        final HeapAccounting heapAccounting = HeapImpl.getAccounting();


We tend to avoid the final keyword for local variables because it primarily adds verbosity without much benefit.

christianhaeubl · 2023-09-15T08:15:15Z

substratevm/src/com.oracle.svm.core.genscavenge/src/com/oracle/svm/core/genscavenge/GCImpl.java

+        Heap.getHeap().updateUsedAtGC();
        JfrGCHeapSummaryEvent.emit(JfrGCWhen.AFTER_GC);

+        UnsignedWord sizeAfter = heapAccounting.getEdenUsedBytes().add(heapAccounting.getSurvivorUsedBytes()).add(heapAccounting.getOldUsedBytes());


can be simplified to heapAccounting.getUsedBytes()

Ah yes. Initially I didn't spot that survivor bytes are counted youngUsedBytes, but I see now that after in after collection eden and survivor are added together and they update youngUsedBytes.

christianhaeubl · 2023-09-15T08:26:54Z

substratevm/src/com.oracle.svm.core.genscavenge/src/com/oracle/svm/core/genscavenge/GCImpl.java

        GenScavengeMemoryPoolMXBeans.notifyAfterCollection();

        printGCAfter(cause);
+        Heap.getHeap().updateUsedAtGC();


Doesn't this store the same value as heapAccounting.getUsedBytes()? If so, then we should probably add a new field to HeapAccounting (e.g., usedAfterLastGC) that is updated in HeapAccounting.notifyAfterCollection().

The problem with this approach is in HeapAccounting, which is not accessible outside its module. I need to get the heap used after last GC from the svm.core module. This is why I added the methods directly to Heap. Any ideas around this? I doubt you want to open up HeapAccounting to access externally? getUsedAtLastGC, or a similarly named method, could remain in Heap and then define updateUsedAtGC in HeapImpl.

christianhaeubl · 2023-09-15T09:35:21Z

...racle.svm.core.genscavenge/src/com/oracle/svm/core/genscavenge/JfrOldObjectSampleEvents.java

+
+import java.lang.ref.WeakReference;
+
+public class JfrOldObjectSampleEvents {


The class name confused me a bit. I don't think that this class should have Events in its name because it is only used for the sampling (and not for emitting the event).

allocatedSize should be of type UnsignedWord (if a conversion from UnsignedWord to long is necessary, we try to do it as late as possible).

The class name confused me a bit. I don't think that this class should have Events in its name because it is only used for the sampling (and not for emitting the event).

Makes sense. I can rename it to JfrOldObjectSampler.

allocatedSize should be of type UnsignedWord (if a conversion from UnsignedWord to long is necessary, we try to do it as late as possible).

Hmmmm, this clashes a bit with my attempts to test the profiler logic in a way that it can run in JVM mode, see #6743 (comment). Is it possible to run code that uses UnsignedWord in JVM mode? By doing an early switch to long I am then able to keep the OldObjectProfiler logic runnable in both JVM and native modes.

christianhaeubl · 2023-09-15T15:49:32Z

substratevm/src/com.oracle.svm.test/src/com/oracle/svm/test/jfr/AbstractJfrTest.java

+        /*
+         * Register proxies for event data assertion
+         */
+        RuntimeProxyCreation.register(Unsigned.class);


Please add some more information to the comment why this is needed.

christianhaeubl · 2023-09-15T16:09:01Z

....oracle.svm.hosted/src/com/oracle/svm/hosted/substitute/AnnotationSubstitutionProcessor.java

            register(methodSubstitutions, annotated, original, substitution);
        } else if (keepOriginalAnnotation != null) {
-            register(methodSubstitutions, annotated, original, original);
+            AnnotateOriginal annotateOriginalAnnotation = lookupAnnotation(annotatedMethod, AnnotateOriginal.class);


Do you still need this change?

Indeed not needed.

christianhaeubl · 2023-09-15T16:16:57Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectSampler.java

+        if (queue.isFull()) {
+            if (queue.peek().span > span) {
+                // Sample will not fit, try to scavenge
+                // HotSpot code scavenges when it gets notified that a sample was GC'd,


Theoretically, you could use a reference queue to get notified but I think that is too tricky to do from uninterruptible code. How does the notification work in HotSpot?

That notification is done via the oop storage that we discussed. We can implement it once #7503 is in place.

christianhaeubl · 2023-09-15T16:22:08Z

...tevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/JfrOldObjectProfiler.java

+
+    // Moving locking to OldObjectProfiler class cannot easily be done.
+    // For starters Unsafe needs to be accessible there,
+    // but there might be other issues.


For multi-line comments, we are moving more and more towards /* */-style comments to avoid formatting issues.

christianhaeubl · 2023-09-15T16:25:43Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectSampler.java

+final class OldObjectSampler {
+    private final OldObjectArray samples;
+    private final OldObjectPriorityQueue queue;
+    private final OldObjectList list;


Please add some class-level JavaDoc why the 3 different views on the data are necessary (array, list, priority queue).

galderz · 2023-10-09T13:54:24Z

I pushed commits too early, I still have some comments to handle. I'll post an update when they're ready for another review.

galderz · 2023-10-16T06:54:21Z

@christianhaeubl I've addressed all comments. Can you have another look? As said above, let me know when you're happy before integrating it so that I can squash all commits. Doing this makes it easier for me to backport it to another branch if necessary.

galderz · 2023-10-30T16:52:19Z

I've fixed the conflicts due to graal compiler package refactoring and I used the opportunity to squash the commits.

galderz · 2023-11-01T08:49:37Z

Fixed another conflict and pushed.

* Sample objects and arrays that are possible leaks. * Sampling done in allocation slow path. * It uses a custom priority queue ordered by span. * The sampler uses a fixed-sized array at runtime to keep the samples in memory. By default this is 256 elements but it's configurable via `-XX:FlightRecorderOptions`. * Samples are scavenged when the queue is full. * Checking if objects are garbage collected is done using WeakReferences. * Added JFR repository for serializing and emitting old object information. * Added plain object and array tests. * Added tests that the sample queue to be full. * Added old object description tests.

christianhaeubl

@galderz : I am currently cleaning up this PR so that we can integrated into master. As part of that, I came across a few code parts that were unclear, so I added a few questions.

christianhaeubl · 2024-01-23T15:17:52Z

...evm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectEventEmitter.java

+        OldObject current = list.head();
+        while (current != null) {
+            if (current.reference != null) {
+                final Object obj = effects.getWeakReferent(current.reference);


Two questions:

After emitting the data, do we need to clear the data structures?

Currently, we emit the data in the order from the oldest to the youngest allocation. Is it necessary to emit the data in this order?

Two questions:

After emitting the data, do we need to clear the data structures?

Which data structure are you talking about exactly? The sample queue/list has to remain in place, the elements in it only get cleared when they're GC'd. So, they keep being emitted until then.

Currently, we emit the data in the order from the oldest to the youngest allocation. Is it necessary to emit the data in this order?

I've been wondering about this and I explicitly asked about this in the jfr-dev list (see discussion).

Specifically I asked:

I'm wondering now if the fact that the list view is ordered by insertion time is relevant here. Is it maybe trying to give more importance (by increasing its span) to surviving objects that are near in time to those that have been garbage collected.

Erik replied:

The intent of the code is to remove the sample from both the time-based list and the priority queue as the object is no longer alive.

So the time aspect of the list order is acknowledged.

Now, when it comes to emitting the events, the HotSpot code uses the list order to write the events. See here. object_sampler->first() calls into _list->first(). I've tried to replicate the same logic in svm.

christianhaeubl · 2024-01-23T15:26:05Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectSampler.java

+    @Uninterruptible(reason = "Access protected by lock.")
+    public boolean sample(WeakReference<Object> ref, long allocatedSize, int arrayLength) {
+        totalAllocated += allocatedSize;
+        long span = totalAllocated - queue.total();


I don't fully understand this part: here, totalAllocated is increased no matter if the sample actually fits into the queue. So, totalAllocated will increase faster than OldObjectPriorityQueue.total. This means that span will increase over time, which makes it easier and easier for new samples to evict old samples.

This is also what Thomas Stuefe was concerned about. See mailing list message: https://mail.openjdk.org/pipermail/hotspot-jfr-dev/2023-June/005069.html

As far as I can tell, this question wasn't given a clear answer.

christianhaeubl · 2024-01-23T15:31:16Z

substratevm/src/com.oracle.svm.core/src/com/oracle/svm/core/jfr/oldobject/OldObjectSampler.java

+            // the weight of a sample that is removed should be redistributed.
+            // Here it gets redistributed to the sample that came just after in time.
+            queue.remove(prev);
+            prev.span += sample.span;


What was the final outcome of this discussion?

What if all sampled objects are dead at some point? Then, we don't redistribute the span and queue.total() will in fall to 0. This will result in a large difference between totalAllocated and queue.total(), so new samples can easily evict old samples for a long period of time.

galderz · 2024-01-24T09:47:41Z

What was the final outcome of this discussion?
What if all sampled objects are dead at some point? Then, we don't redistribute the span and queue.total() will in fall to 0. This will result in a large difference between totalAllocated and queue.total(), so new samples can easily evict old samples for a long period of time.

There has not been further replies to this discussion. I've replied in the thread to see if Erik can provide an answer.

In any case, if it's really a bug, then we wait for the hotspot fix and then backport it here. That way the two impls are relatively close in their logic.

galderz · 2024-01-24T09:48:22Z

@christianhaeubl I see there's a conflict in HeapImpl. Shall I address it or are you taking care of it?

christianhaeubl · 2024-01-24T09:51:22Z

@galderz: I will handle the conflict.

christianhaeubl · 2024-01-25T08:33:40Z

...vm/src/com.oracle.svm.test/src/com/oracle/svm/test/jfr/oldobject/TestRecordingArrayLeak.java

+        }
+        // Trigger a GC so that last sweep gets updated,
+        // and the objects above are considered older than last GC sweep time.
+        System.gc();


One, probably last question: what is the intention of this line? I don't think that your code accesses the last GC sweep time anywhere.

Good catch. I think this is leftover from previous versions of the code when I was tracking the last GC sweep time, but that is only relevant when path to GC root is on, and you set a cut off time for the samples. I'll remove it and make sure everything works as expected.

Indeed not necessary. I've pushed a commit to remove System.gc() calls related to last sweep time.

christianhaeubl · 2024-01-25T12:32:09Z

@galderz , @roberttoyonaga : for integration into master, I opened #8251 which adds a larger number of cleanups and fixes smaller bugs. Feel free to review this PR.

galderz · 2024-02-21T10:08:20Z

@christianhaeubl Now that #8251 is integrated, shall we close this?

christianhaeubl · 2024-02-21T10:10:43Z

Yes, good point.

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jun 5, 2023

galderz changed the title ~~Add JFR old object event~~ Add JFR old object sample event Jun 5, 2023

zakkak requested review from adinn and roberttoyonaga June 6, 2023 07:35

zakkak added the redhat-interest label Jun 6, 2023

roberttoyonaga reviewed Jun 9, 2023

View reviewed changes

roberttoyonaga reviewed Jun 28, 2023

View reviewed changes

roberttoyonaga added feature native-image labels Jun 30, 2023

roberttoyonaga approved these changes Jul 13, 2023

View reviewed changes

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch from 11344c0 to 57a0051 Compare August 7, 2023 09:10

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch from 57a0051 to 48df685 Compare August 28, 2023 07:52

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch from 302e66f to c903091 Compare September 11, 2023 04:49

christianhaeubl reviewed Sep 15, 2023

View reviewed changes

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch from c903091 to 18a34e7 Compare September 25, 2023 04:08

fniephaus assigned galderz Oct 25, 2023

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch 2 times, most recently from 7b4763a to fb6b4ce Compare October 30, 2023 11:56

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch from 8c20fe2 to 95c4e3e Compare November 1, 2023 08:49

galderz force-pushed the topic.0605.old-object-sample-event.mk4 branch from 95c4e3e to d8a5916 Compare November 17, 2023 08:53

christianhaeubl reviewed Jan 23, 2024

View reviewed changes

christianhaeubl mentioned this pull request Jan 25, 2024

[GR-51629] Add support for JFR -XX:FlightRecorderOptions #7585

Merged

christianhaeubl reviewed Jan 25, 2024

View reviewed changes

christianhaeubl mentioned this pull request Jan 25, 2024

[GR-48683] Add JFR event OldObjectSample. #8251

Merged

Remove unnecessary System.gc() calls to update last sweep time

e4b362e

christianhaeubl closed this Feb 21, 2024

fniephaus added this to the GraalVM for JDK 23 (September 17, 2024) milestone May 2, 2024

fniephaus changed the title ~~Add JFR old object sample event~~ [GR-48683] Add JFR old object sample event May 2, 2024


		import java.lang.ref.WeakReference;

		public class JfrOldObjectSampleEvents {

[GR-48683] Add JFR old object sample event #6743

[GR-48683] Add JFR old object sample event #6743

Uh oh!

Conversation

galderz commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

galderz commented Jun 5, 2023

Uh oh!

galderz commented Jun 5, 2023

Uh oh!

galderz commented Jun 6, 2023

Uh oh!

galderz commented Jun 6, 2023

Uh oh!

zakkak commented Jun 6, 2023

Uh oh!

roberttoyonaga left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roberttoyonaga Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

galderz commented Jun 26, 2023

Uh oh!

galderz commented Jun 27, 2023

Uh oh!

galderz commented Jun 27, 2023

Uh oh!

galderz commented Jun 27, 2023

Uh oh!

galderz commented Jun 28, 2023

Uh oh!

roberttoyonaga left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

galderz commented Jun 5, 2023 •

edited

Loading

roberttoyonaga Jun 12, 2023 •

edited

Loading

roberttoyonaga Jun 29, 2023 •

edited

Loading

roberttoyonaga Jun 28, 2023 •

edited

Loading