Skip to content

Conversation

@brettwooldridge
Copy link
Contributor

@brettwooldridge brettwooldridge commented Sep 26, 2025

Cleaner Revenge Tour 😄

Ok, this refactor leaves the doubly linked-list intact but tightens up the conditional logic around the node linking/unlinking, as well as reducing the method invocation overhead in some code paths. All-in-all an extremely minor refactor but seemingly does well in the JMH harness.

Tests run on a M1 Ultra Mac Studio.

High-memory high-threading (2x core count):

java -Xmx20g -jar target/benchmarks.jar -t 16 -i 5 -wi 6

5.18.0
Benchmark                Mode  Cnt       Score        Error  Units
MyBenchmark.testMethod  thrpt   25  550361.312 ± 452304.324  ops/s

5.19.0-SNAPSHOT
Benchmark                Mode  Cnt       Score        Error  Units
MyBenchmark.testMethod  thrpt   25  806701.763 ± 548298.436  ops/s

Lower-memory, threads matching core count:

java -Xmx16g -jar target/benchmarks.jar -t 8 -i 5 -wi 6

5.18.0
Benchmark                Mode  Cnt       Score        Error  Units
MyBenchmark.testMethod  thrpt   25  216233.155 ± 209346.033  ops/s

5.19.0-SNAPSHOT
Benchmark                Mode  Cnt       Score        Error  Units
MyBenchmark.testMethod  thrpt   25  313306.938 ± 324647.767  ops/s

Tests run on an Epyc 7402 Proxmox VM

High-memory high-threading (2x core count):

java -Xmx20g -jar target/benchmarks.jar -t 16 -i 5 -wi 6

5.18.0
Benchmark                Mode  Cnt        Score       Error  Units
MyBenchmark.testMethod  thrpt   25  1003122.171 ± 88700.984  ops/s

5.19.0-SNAPSHOT
Benchmark                Mode  Cnt        Score        Error  Units
MyBenchmark.testMethod  thrpt   25  1057712.282 ± 105237.902  ops/s

Lower-memory, threads matching core count:

java -Xmx16g -jar target/benchmarks.jar -t 8 -i 5 -wi 6

5.18.0
Benchmark                Mode  Cnt       Score       Error  Units
MyBenchmark.testMethod  thrpt   25   948151.717 ± 50740.821  ops/s

5.19.0-SNAPSHOT
Benchmark                Mode  Cnt        Score       Error  Units
MyBenchmark.testMethod  thrpt   25  1020379.664 ± 82178.334  ops/s

Copy link
Member

@matthiasblaesing matthiasblaesing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce improved numbers with the provided jmh invocations (thanks for that). I left one inline comment. Could please check if you agree and see if the numbers still hold up if a fix is applied?

cleanerRunning = true;
}

return ref;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we need the equivalent of Reference.reachabilityFence on obj. If the caller does not retain a strong reference, we need to ensure, that the reference is kept at least until the reference cleaner is completely enqueued. As observed in the last iteration early GC can happen: #1684 (comment).

In the comment I suggested to use an empty sychronized block.

Copy link
Contributor Author

@brettwooldridge brettwooldridge Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I don't know how I lost the change I had made, but I did somehow. The way that I intended to solve it, and had at some point but must have reverted it, was to insert this before the return ref:

// Ensure that obj is referencable past the enqueue point.
if (obj == null) {
   throw new IllegalArgumentException("Cleaner object cannot be null");
}

I believe that should address it. Can you test it?

If that does not fix it, remove synchronized from the method signature, and do this:

public Cleanable register(final Object obj, final Runnable cleanupTask) {
   // The important side effect is the PhantomReference, that is yielded after the referent is GCed
   final CleanerRef ref = new CleanerRef(obj, referenceQueue, cleanupTask);

   synchronized (this) {
        // everything else
        ...
   }

   // Ensure that obj is referencable past the enqueue point.
   if (obj == null) {
      throw new IllegalArgumentException("Cleaner object cannot be null");
   }
   return ref;
}

EDIT: I updated the pull request with this later fix.

Copy link
Contributor Author

@brettwooldridge brettwooldridge Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change should work. The compiler cannot reorder the conditional before the synchronized block, and it cannot eliminate the reference bc it cannot know whether it is null or not.

By the way, really nice work on this class. Outside of this change, I simply don't see any way it could possibly be more efficient. I love seeing code like this.

@matthiasblaesing
Copy link
Member

Thanks for the update. I run a few measurements based on you suggested invocation. And these are the results:

measure.ods

For "Tabelle 1" I ran 5.18.1 as baseline and your branch as comparison (title is "modified" then). Looking at the Diff I see mostly positive effects for the first run variant, but even there I see one run that regressed. The second run variant allways comes out slower. This also holds when averaging the values.

"Tabelle 2" is mostly identical to the first experiment, but I introduced the "check non null" in register onto master (that is master-synced) and ran that for comparison.

It is hard to draw a conclusion from this. My runtime environment is a bad example though (Notebook).

Could you please have a look and maybe rerun your numbers?

@brettwooldridge
Copy link
Contributor Author

brettwooldridge commented Nov 4, 2025

@matthiasblaesing Sorry for taking so long to get back to this. And apologies for this long post.

I ran lots of tests. Lots. And the more I ran, the more confusing things became. At a high level, with larger runs, the new code consistently benchmarks roughly 10% lower than the existing code. And therein lies a mystery.

Studying the code, from a logical perspective, it cannot possibly be slower:

  • It acquires fewer locks.
  • It contains fewer conditionals.
  • It contains fewer method dispatches.

Even at a bytecode level the new code "wins". And yet.

And yet, according to JMH, it turns in lower ops/s. This is even benchmarking against a master branch that includes the code to ensure the reference is maintained past the linking of the phantom reference:

   // Ensure that obj is referencable past the enqueue point.
   if (obj == null) {
      throw new IllegalArgumentException("Cleaner object cannot be null");
   }

First Revelation

On a whim, I cranked up the JMH iterations per fork and I noticed something interesting:

# Run progress: 40.00% complete, ETA 00:17:22
# Fork: 3 of 5
# Warmup Iteration   1: 1701444.457 ops/s
# Warmup Iteration   2: 2240284.714 ops/s
# Warmup Iteration   3: 2240826.902 ops/s
# Warmup Iteration   4: 2214448.058 ops/s
# Warmup Iteration   5: 2206987.306 ops/s
# Warmup Iteration   6: 2269375.607 ops/s
Iteration   1: 2242599.580 ops/s
Iteration   2: 397205.504 ops/s
Iteration   3: 1770953.479 ops/s
Iteration   4: 385594.600 ops/s
Iteration   5: 264045.054 ops/s
Iteration   6: 42915.212 ops/s
Iteration   7: 24755.365 ops/s
Iteration   8: 6410.848 ops/s
Iteration   9: 26682.966 ops/s
Iteration  10: 2249.819 ops/s
Iteration  11: 16611.778 ops/s

We're looking at a drop from over 2 million ops/s early in the run to as low as 6000 ops/s. This occurs in both the master branch and the proposed change.

First, it should be noted that if the test is started with a smaller heap, for example -Xmx8g instead of -Xmg20g, the test very quickly falls over with an OutOfMemory exception. Even with a larger heap, if the iterations are cranked up, the result is the same. This was a clue.

Why is this occurring? Well, we have N number of threads creating and registering objects, and only one thread cleaning them up. In addition, registering objects is "cheap" while cleaning them up is "expensive" in relative terms.

Second Revelation

JMH is measuring one side of the system -- registering objects. But the other side, cleaning references, is unseen and unmeasured.

In the benchmark, if we look at the lock itself, what we have are N threads contending for the lock plus the cleaner thread also contending for that lock. Assuming roughly fair queuing by the scheduler, the cleaner thread is going to lose most of the time -- it's 1 thread vs. N threads in terms of who is going to win the lock acquisition.

Hypothesis

The reason that this pull request benchmarks lower (~10%) is that the cleaner thread is "winning" more, at the expense of the N threads that are trying to register objects. But again, the benchmark is only measuring the registering side.

If the predicates above...

  • It acquires fewer locks.
  • It contains fewer conditionals.
  • It contains fewer method dispatches.

... are accepted, this is an obvious conclusion. There is no other reason the master registration code would be faster otherwise.

It should be noted again that both the master branch and this pull request both degenerate into four-digit ops/s if the memory is more constrained or the test iterations are increased.

Where are we?

I don't really see a simple method of measuring the throughput of the entire system -- the registration side and the cleaning side. Especially because this would require triggering GC deterministically in order to force the cleaner to run, while at the same time artificially constraining the N registration threads in such a way that the cleaner is allowed to "keep up", in order to balance object creation and retirement (and avoid OOM).

I would argue that given the closeness in performance of measuring even just one side of the system, the greater clarity and simplicity of the code in the pull request "wins" in a pragmatic sense.

And I would argue, without evidence, that the throughput of the entire system is obviously higher -- because the only explanation that makes sense for the master code to be faster in "registration" is due to increased crowding out of the cleaner thread in lock acquisition.

I'm not sure where to go from here. I really can't invest more time in doing something like constructing some kind of harness that measures the total throughput of the system while ensuring that the cleaner is not overrun by registration spamming.

I don't think JMH is meaningful here if we are only measuring the registration side of the equation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants