-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fixes #1616 tighten up synchronization and conditional logic #1688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fixes #1616 tighten up synchronization and conditional logic #1688
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reproduce improved numbers with the provided jmh invocations (thanks for that). I left one inline comment. Could please check if you agree and see if the numbers still hold up if a fix is applied?
| cleanerRunning = true; | ||
| } | ||
|
|
||
| return ref; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we need the equivalent of Reference.reachabilityFence on obj. If the caller does not retain a strong reference, we need to ensure, that the reference is kept at least until the reference cleaner is completely enqueued. As observed in the last iteration early GC can happen: #1684 (comment).
In the comment I suggested to use an empty sychronized block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I don't know how I lost the change I had made, but I did somehow. The way that I intended to solve it, and had at some point but must have reverted it, was to insert this before the return ref:
// Ensure that obj is referencable past the enqueue point.
if (obj == null) {
throw new IllegalArgumentException("Cleaner object cannot be null");
}I believe that should address it. Can you test it?
If that does not fix it, remove synchronized from the method signature, and do this:
public Cleanable register(final Object obj, final Runnable cleanupTask) {
// The important side effect is the PhantomReference, that is yielded after the referent is GCed
final CleanerRef ref = new CleanerRef(obj, referenceQueue, cleanupTask);
synchronized (this) {
// everything else
...
}
// Ensure that obj is referencable past the enqueue point.
if (obj == null) {
throw new IllegalArgumentException("Cleaner object cannot be null");
}
return ref;
}EDIT: I updated the pull request with this later fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this change should work. The compiler cannot reorder the conditional before the synchronized block, and it cannot eliminate the reference bc it cannot know whether it is null or not.
By the way, really nice work on this class. Outside of this change, I simply don't see any way it could possibly be more efficient. I love seeing code like this.
|
Thanks for the update. I run a few measurements based on you suggested invocation. And these are the results: For "Tabelle 1" I ran 5.18.1 as baseline and your branch as comparison (title is "modified" then). Looking at the Diff I see mostly positive effects for the first run variant, but even there I see one run that regressed. The second run variant allways comes out slower. This also holds when averaging the values. "Tabelle 2" is mostly identical to the first experiment, but I introduced the "check non null" in It is hard to draw a conclusion from this. My runtime environment is a bad example though (Notebook). Could you please have a look and maybe rerun your numbers? |
606925b to
5fac28f
Compare
|
@matthiasblaesing Sorry for taking so long to get back to this. And apologies for this long post. I ran lots of tests. Lots. And the more I ran, the more confusing things became. At a high level, with larger runs, the new code consistently benchmarks roughly 10% lower than the existing code. And therein lies a mystery. Studying the code, from a logical perspective, it cannot possibly be slower:
Even at a bytecode level the new code "wins". And yet. And yet, according to JMH, it turns in lower ops/s. This is even benchmarking against a master branch that includes the code to ensure the reference is maintained past the linking of the phantom reference: // Ensure that obj is referencable past the enqueue point.
if (obj == null) {
throw new IllegalArgumentException("Cleaner object cannot be null");
}First RevelationOn a whim, I cranked up the JMH iterations per fork and I noticed something interesting: We're looking at a drop from over 2 million ops/s early in the run to as low as 6000 ops/s. This occurs in both the master branch and the proposed change. First, it should be noted that if the test is started with a smaller heap, for example Why is this occurring? Well, we have N number of threads creating and registering objects, and only one thread cleaning them up. In addition, registering objects is "cheap" while cleaning them up is "expensive" in relative terms. Second RevelationJMH is measuring one side of the system -- registering objects. But the other side, cleaning references, is unseen and unmeasured. In the benchmark, if we look at the lock itself, what we have are N threads contending for the lock plus the cleaner thread also contending for that lock. Assuming roughly fair queuing by the scheduler, the cleaner thread is going to lose most of the time -- it's 1 thread vs. N threads in terms of who is going to win the lock acquisition. HypothesisThe reason that this pull request benchmarks lower (~10%) is that the cleaner thread is "winning" more, at the expense of the N threads that are trying to register objects. But again, the benchmark is only measuring the registering side. If the predicates above...
... are accepted, this is an obvious conclusion. There is no other reason the master registration code would be faster otherwise. It should be noted again that both the master branch and this pull request both degenerate into four-digit ops/s if the memory is more constrained or the test iterations are increased. Where are we?I don't really see a simple method of measuring the throughput of the entire system -- the registration side and the cleaning side. Especially because this would require triggering GC deterministically in order to force the cleaner to run, while at the same time artificially constraining the N registration threads in such a way that the cleaner is allowed to "keep up", in order to balance object creation and retirement (and avoid OOM). I would argue that given the closeness in performance of measuring even just one side of the system, the greater clarity and simplicity of the code in the pull request "wins" in a pragmatic sense. And I would argue, without evidence, that the throughput of the entire system is obviously higher -- because the only explanation that makes sense for the master code to be faster in "registration" is due to increased crowding out of the cleaner thread in lock acquisition. I'm not sure where to go from here. I really can't invest more time in doing something like constructing some kind of harness that measures the total throughput of the system while ensuring that the cleaner is not overrun by registration spamming. I don't think JMH is meaningful here if we are only measuring the registration side of the equation. |
Cleaner Revenge Tour 😄
Ok, this refactor leaves the doubly linked-list intact but tightens up the conditional logic around the node linking/unlinking, as well as reducing the method invocation overhead in some code paths. All-in-all an extremely minor refactor but seemingly does well in the JMH harness.
Tests run on a M1 Ultra Mac Studio.
High-memory high-threading (2x core count):
Lower-memory, threads matching core count:
Tests run on an Epyc 7402 Proxmox VM
High-memory high-threading (2x core count):
Lower-memory, threads matching core count: