[SPARK-19309][SQL] disable common subexpression elimination for conditional expressions #16659

cloud-fan · 2017-01-20T12:42:02Z

What changes were proposed in this pull request?

As I pointed out in #15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. #15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

How was this patch tested?

regression test

cloud-fan · 2017-01-20T12:43:07Z

cc @viirya @kiszk @hvanhovell

viirya · 2017-01-20T13:51:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

      outputExternalType,
      bufferDeserializer :: Nil)

+    val serializeExprs = outputSerializer.map(_.transform {


nit: outputSerializeExprs

it's always used, no need to make it lazy val.

viirya · 2017-01-20T14:05:45Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ReferenceToExpressions.scala

- * @param result The expression that contains [[BoundReference]] and produces the final output.
- * @param children The expressions that used as input values for [[BoundReference]].
- */
-case class ReferenceToExpressions(result: Expression, children: Seq[Expression])


viirya · 2017-01-20T14:06:53Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

-      // e.g. `CaseWhen`, we should support them.
+    //   2. conditional expressions: common subexpressions will always be evaluated at the
+    //                               beginning, so we should not recurse into condition expressions,
+    //                               whole children may not get evaluated.


maybe rephrase it? whole children may not get evaluated looks not easy to understand.

viirya · 2017-01-20T14:09:26Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

   */
-  def addExprTree(
-      root: Expression,
-      ignoreLeaf: Boolean = true,


From the code change, I don't see any place other than tests using ignoreLeaf = false. Curious why we have it.

viirya · 2017-01-20T14:10:41Z

This looks good to me. Just few comments.

SparkQA · 2017-01-20T14:23:18Z

Test build #71720 has finished for PR 16659 at commit 45608b1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-20T15:02:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

+
  override lazy val initialValues: Seq[Expression] = {
    val zero = Literal.fromObject(aggregator.zero, bufferExternalType)
-    bufferSerializer.map(ReferenceToExpressions(_, zero :: Nil))


Why bufferSerializer now replaced with bufferDeserializer?

sorry, typo...

viirya · 2017-01-20T15:03:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

      bufferDeserializer :: inputDeserializer.get :: Nil)
-
-    bufferSerializer.map(ReferenceToExpressions(_, reduced :: Nil))
+    deserializeToBuffer(reduced)


viirya · 2017-01-20T15:03:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala

      leftBuffer :: rightBuffer :: Nil)
-
-    bufferSerializer.map(ReferenceToExpressions(_, merged :: Nil))
+    deserializeToBuffer(merged)


SparkQA · 2017-01-21T04:27:23Z

Test build #71752 has finished for PR 16659 at commit cda9723.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-01-21T04:52:53Z

The child expression in Sum is wrapped in Coalesce. So making org.apache.spark.sql.SQLQuerySuite.Common subexpression elimination test failed.

viirya · 2017-01-21T05:59:26Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

this's cool.

I just found that not all the children of AtLeastNNonNulls get accessed during evaluation too. Do we need to add it here too?

viirya · 2017-01-21T05:59:38Z

LGTM

cloud-fan · 2017-01-21T07:15:04Z

I reran the DatasetBenchmark, there is no performance regression.

SparkQA · 2017-01-21T07:50:17Z

Test build #71759 has finished for PR 16659 at commit 9d50048.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-21T11:04:51Z

Test build #71763 has finished for PR 16659 at commit e7d928c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-21T16:41:02Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

      // when it is code generated. This decision should be a cost based one.
      //
      // The cost of doing subexpression elimination is:
      //   1. Extra function call, although this is probably *good* as the JIT can decide to


Nit: we removed 2. and 3.. We do not need 1., right?

but we do have an extra function call to evaluate common subexpression at the beginning.

: ) Just removed 1.. Not the whole sentence

oh i see :)

maybe we should still keep it, to make the indent consistent between the "cost" part and the "benefit" part. It also makes it more obvious that we only have one "cost".

I am fine to keep it.

gatorsmile · 2017-01-21T17:38:16Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala

-    equivalence.addExprTree(price, false)
-    equivalence.addExprTree(discount, false)
-    // quantity, price, discount and (price * (1 - discount))
-    assert(equivalence.getAllEquivalentExprs.count(_.size > 1) == 4)


To other reviewers: the new addExprTree always ignores the leaf nodes. Thus, these test cases are not needed.

gatorsmile · 2017-01-21T17:47:04Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+    //          expression. We should only recurse into the predicate expression.
+    //   3. CaseWhen: like `If`, the children of `CaseWhen` only get accessed in a certain
+    //                condition. We should only recurse into the first condition expression as it
+    //                will always get accessed.


CaseWhen could be very deep.

CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END
When expr1 = true, returns expr2; when expr3 = true, return expr4; else return expr5.

Compared with the previous impl, will we miss some expression elimination chances?

nvm, CaseWhen implements CodegenFallback. Thus, the previous impl skips it.

gatorsmile · 2017-01-21T18:08:02Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+    def childrenToRecurse: Seq[Expression] = expr match {
+      case _: CodegenFallback => Nil
+      case i: If => i.predicate :: Nil
+      case c: CaseWhenBase => c.children.head :: Nil


This case is not reachable, could we leave a comment above this?

gatorsmile · 2017-01-21T18:18:33Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala

+    //                condition. We should only recurse into the first condition expression as it
+    //                will always get accessed.
+    //   4. Coalesce: it's also a conditional expression, we should only recurse into the first
+    //                children, because others may not get accessed.


Although Coalesce might miss some expression elimination chances, I think it is very rare when users use the same expressions in Coalesce.

Could you update the comments?

Coalesce may be just a small part of the whole expression tree, and the children of Coalesce may be same with other expressions inside the expression tree.

gatorsmile · 2017-01-21T18:20:35Z

LGTM except a few comments.

gatorsmile · 2017-01-23T04:49:51Z

LGTM pending test.

SparkQA · 2017-01-23T05:19:48Z

Test build #71818 has finished for PR 16659 at commit 0753ee6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-23T05:32:01Z

thanks, merging to master!

…tional expressions ## What changes were proposed in this pull request? As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes apache#16659 from cloud-fan/codegen.

disable common subexpression elimination for conditional expressions

45608b1

viirya reviewed Jan 20, 2017

View reviewed changes

address comments

cda9723

viirya reviewed Jan 21, 2017

View reviewed changes

improve

e7d928c

cloud-fan force-pushed the codegen branch from 9d50048 to e7d928c Compare January 21, 2017 08:38

gatorsmile reviewed Jan 21, 2017

View reviewed changes

address comments

0753ee6

asfgit closed this in de6ad3d Jan 23, 2017

viirya mentioned this pull request Jan 24, 2017

[SPARK-18395][SQL] Evaluate common subexpression like lazy variable with a function approach #15837

Closed

[SPARK-19309][SQL] disable common subexpression elimination for conditional expressions #16659

[SPARK-19309][SQL] disable common subexpression elimination for conditional expressions #16659

Uh oh!

Conversation

cloud-fan commented Jan 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jan 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 20, 2017

Uh oh!

SparkQA commented Jan 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 21, 2017

Uh oh!

viirya commented Jan 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jan 21, 2017

Uh oh!

cloud-fan commented Jan 21, 2017

Uh oh!

SparkQA commented Jan 21, 2017

Uh oh!

SparkQA commented Jan 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 21, 2017

Uh oh!

gatorsmile commented Jan 23, 2017

Uh oh!

cloud-fan Jan 21, 2017 •

edited

Loading