Skip to content

Conversation

@rednaxelafx
Copy link
Contributor

What changes were proposed in this pull request?

Mark HashAggregateExec.bufVars as transient to avoid it from being serialized.
Also manually null out this field at the end of doProduceWithoutKeys() to shorten its lifecycle, because it'll no longer be used after that.

How was this patch tested?

Existing tests.

@hvanhovell
Copy link
Contributor

LGTM

| }
""".stripMargin)

bufVars = null // explicitly null this field out to allow the referent to be GC'd sooner
Copy link
Member

@kiszk kiszk Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what happens when bufVars is accessed in doConsumeWithoutKeys where exists below this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow of whole-stage codegen ensures that doConsumeWithoutKeys can only be on the call stack when doProduceWithoutKeys is also on the call stack; the liveness of the former is strictly a subset of the latter.
That's because for a plan tree that looks like:

+- A
  +- B
    +- C

The whole-stage codegen system (mostly) works like:

A.produce
 |------> B.produce
 |         |------> C.produce
 |         |         |------> B.consume
 |         |         |         |------> A.consume
 |         |         |         |        |
 |         |         |         |<-------o
 |         |         |<--------o
 |         |<--------o
 |<--------o
 o

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. Thank you for the clarification.

@SparkQA
Copy link

SparkQA commented Apr 11, 2018

Test build #89185 has finished for PR 21039 at commit 0b5a075.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 271c891 Apr 11, 2018
@gatorsmile
Copy link
Member

Reverted the PR due to the test failure: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4694/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/

Please try to resolve it and submit it again

@rednaxelafx
Copy link
Contributor Author

rednaxelafx commented Apr 12, 2018

Thanks for reverting it for me. The test failure was definitely related to the explicit nulling from this PR, but I can't see how that's possible yet.

First of all, in the build that first introduced my change, build 4693, this particular test was passing:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4693/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/

The build that failed was the one immediately after that.

Second, the stack trace seen from the failure indicates that doProduceWithoutKeys is indeed on the stack, and it was before the line that I null'd out bufVars, so I can't see how bufVars can be null in doConsumeWithoutKeys.

Stack trace:

java.lang.NullPointerException
      at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithoutKeys(HashAggregateExec.scala:274)
      at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:171)
      at org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:209)
      at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:180)
      at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
      at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
      at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:182)
...
      at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
      at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
      at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithoutKeys(HashAggregateExec.scala:237)
      at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:163)

The relevant line in doConsumeWithoutKeys is:


It's reading bufVars.

The relevant line in doProduceWithoutKeys is:

| ${child.asInstanceOf[CodegenSupport].produce(ctx, this)}

It's calling child.produce(), and that's before the nulling at line 241.

Unless there's multiple threads messing around the state (which is something that this part of whole-stage codegen doesn't expect to begin with), I can't see how bufVars would be null here...

I'm waiting for the next build (4695, which still has this change) to finish and see if the test still fails.

@cloud-fan
Copy link
Contributor

shall we just don't do the nulling out? It wouldn't help the GC a lot.

@rednaxelafx
Copy link
Contributor Author

I just checked the same test in Build 4695, which still has this change, and the test passed:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/4695/testReport/junit/org.apache.spark.sql/TPCDSQuerySuite/q61/
Something weird must have happened when running this test in Build 4694...

re:

shall we just don't do the nulling out? It wouldn't help the GC a lot.
@cloud-fan I can certainly do that -- just mark the transient, removing the nulling code.
But what I'm concerned about is that the test failure was exposing some other problem that we might not have anticipated before.

The whole-stage codegen logic in most physical operators assume that codegen happens on a single thread. As such, it might use instance fields on the operator to pass state between the doProduce and doConsume methods, as is the case with bufVars in HashAggregateExec.
Now, imagine a scenario where I take a DataFrame df:

val plan = df.queryExecution.executedPlan

// then on Thread1
plan.execute // triggers whole-stage codegen

// at around the same time, on Thread2
plan.execute // also triggers whole-stage codegen

Now we're going to be performing whole-stage codegen on 2 different threads, at around the same time, on the exact same plan (if there are HashAggregateExecs involved, they're gonna be the same instances).
In this case, the two threads might accidentally "cross-talk" because of the way whole-stage codegen uses instance fields to pass state between doProduce and doConsume. It's basically "pure luck" that both threads finish the codegen correctly (despite one thread might be accidentally using the state from (overwritten by) another thread).

If this theory holds, nulling out the state introduces a "leak" of the "cross-talk", so it's now possible to see an NPE if the timing is just right. But it's fairly scary already even without the nulling...

Anyway, I'll resend this PR with only the @transient part of the change, since that won't cause any regressions for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants