-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient #21039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
LGTM |
| | } | ||
| """.stripMargin) | ||
|
|
||
| bufVars = null // explicitly null this field out to allow the referent to be GC'd sooner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious what happens when bufVars is accessed in doConsumeWithoutKeys where exists below this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow of whole-stage codegen ensures that doConsumeWithoutKeys can only be on the call stack when doProduceWithoutKeys is also on the call stack; the liveness of the former is strictly a subset of the latter.
That's because for a plan tree that looks like:
+- A
+- B
+- C
The whole-stage codegen system (mostly) works like:
A.produce
|------> B.produce
| |------> C.produce
| | |------> B.consume
| | | |------> A.consume
| | | | |
| | | |<-------o
| | |<--------o
| |<--------o
|<--------o
o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. Thank you for the clarification.
|
Test build #89185 has finished for PR 21039 at commit
|
|
LGTM, merging to master! |
|
Reverted the PR due to the test failure: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4694/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/ Please try to resolve it and submit it again |
|
Thanks for reverting it for me. The test failure was definitely related to the explicit nulling from this PR, but I can't see how that's possible yet. First of all, in the build that first introduced my change, build 4693, this particular test was passing: The build that failed was the one immediately after that. Second, the stack trace seen from the failure indicates that Stack trace: The relevant line in spark/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala Line 274 in 75a1830
It's reading bufVars.
The relevant line in spark/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala Line 237 in 75a1830
It's calling child.produce(), and that's before the nulling at line 241.
Unless there's multiple threads messing around the state (which is something that this part of whole-stage codegen doesn't expect to begin with), I can't see how I'm waiting for the next build (4695, which still has this change) to finish and see if the test still fails. |
|
shall we just don't do the nulling out? It wouldn't help the GC a lot. |
|
I just checked the same test in Build 4695, which still has this change, and the test passed: re:
The whole-stage codegen logic in most physical operators assume that codegen happens on a single thread. As such, it might use instance fields on the operator to pass state between the val plan = df.queryExecution.executedPlan
// then on Thread1
plan.execute // triggers whole-stage codegen
// at around the same time, on Thread2
plan.execute // also triggers whole-stage codegenNow we're going to be performing whole-stage codegen on 2 different threads, at around the same time, on the exact same plan (if there are If this theory holds, nulling out the state introduces a "leak" of the "cross-talk", so it's now possible to see an NPE if the timing is just right. But it's fairly scary already even without the nulling... Anyway, I'll resend this PR with only the |
What changes were proposed in this pull request?
Mark
HashAggregateExec.bufVarsas transient to avoid it from being serialized.Also manually null out this field at the end of
doProduceWithoutKeys()to shorten its lifecycle, because it'll no longer be used after that.How was this patch tested?
Existing tests.