Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

As I pointed out in #15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. #15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

How was this patch tested?

regression test

@cloud-fan
Copy link
Contributor Author

cc @viirya @kiszk @hvanhovell

outputExternalType,
bufferDeserializer :: Nil)

val serializeExprs = outputSerializer.map(_.transform {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: outputSerializeExprs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy val?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's always used, no need to make it lazy val.

* @param result The expression that contains [[BoundReference]] and produces the final output.
* @param children The expressions that used as input values for [[BoundReference]].
*/
case class ReferenceToExpressions(result: Expression, children: Seq[Expression])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

// e.g. `CaseWhen`, we should support them.
// 2. conditional expressions: common subexpressions will always be evaluated at the
// beginning, so we should not recurse into condition expressions,
// whole children may not get evaluated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rephrase it? whole children may not get evaluated looks not easy to understand.

*/
def addExprTree(
root: Expression,
ignoreLeaf: Boolean = true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code change, I don't see any place other than tests using ignoreLeaf = false. Curious why we have it.

@viirya
Copy link
Member

viirya commented Jan 20, 2017

This looks good to me. Just few comments.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71720 has finished for PR 16659 at commit 45608b1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override lazy val initialValues: Seq[Expression] = {
val zero = Literal.fromObject(aggregator.zero, bufferExternalType)
bufferSerializer.map(ReferenceToExpressions(_, zero :: Nil))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bufferSerializer now replaced with bufferDeserializer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, typo...

bufferDeserializer :: inputDeserializer.get :: Nil)

bufferSerializer.map(ReferenceToExpressions(_, reduced :: Nil))
deserializeToBuffer(reduced)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

leftBuffer :: rightBuffer :: Nil)

bufferSerializer.map(ReferenceToExpressions(_, merged :: Nil))
deserializeToBuffer(merged)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71752 has finished for PR 16659 at commit cda9723.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jan 21, 2017

The child expression in Sum is wrapped in Coalesce. So making org.apache.spark.sql.SQLQuerySuite.Common subexpression elimination test failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this's cool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found that not all the children of AtLeastNNonNulls get accessed during evaluation too. Do we need to add it here too?

@viirya
Copy link
Member

viirya commented Jan 21, 2017

LGTM

@cloud-fan
Copy link
Contributor Author

I reran the DatasetBenchmark, there is no performance regression.

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71759 has finished for PR 16659 at commit 9d50048.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71763 has finished for PR 16659 at commit e7d928c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// when it is code generated. This decision should be a cost based one.
//
// The cost of doing subexpression elimination is:
// 1. Extra function call, although this is probably *good* as the JIT can decide to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we removed 2. and 3.. We do not need 1., right?

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we do have an extra function call to evaluate common subexpression at the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) Just removed 1.. Not the whole sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should still keep it, to make the indent consistent between the "cost" part and the "benefit" part. It also makes it more obvious that we only have one "cost".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to keep it.

equivalence.addExprTree(price, false)
equivalence.addExprTree(discount, false)
// quantity, price, discount and (price * (1 - discount))
assert(equivalence.getAllEquivalentExprs.count(_.size > 1) == 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To other reviewers: the new addExprTree always ignores the leaf nodes. Thus, these test cases are not needed.

// expression. We should only recurse into the predicate expression.
// 3. CaseWhen: like `If`, the children of `CaseWhen` only get accessed in a certain
// condition. We should only recurse into the first condition expression as it
// will always get accessed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CaseWhen could be very deep.

CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END
When expr1 = true, returns expr2; when expr3 = true, return expr4; else return expr5.

Compared with the previous impl, will we miss some expression elimination chances?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, CaseWhen implements CodegenFallback. Thus, the previous impl skips it.

def childrenToRecurse: Seq[Expression] = expr match {
case _: CodegenFallback => Nil
case i: If => i.predicate :: Nil
case c: CaseWhenBase => c.children.head :: Nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is not reachable, could we leave a comment above this?

// condition. We should only recurse into the first condition expression as it
// will always get accessed.
// 4. Coalesce: it's also a conditional expression, we should only recurse into the first
// children, because others may not get accessed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Coalesce might miss some expression elimination chances, I think it is very rare when users use the same expressions in Coalesce.

Could you update the comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coalesce may be just a small part of the whole expression tree, and the children of Coalesce may be same with other expressions inside the expression tree.

@gatorsmile
Copy link
Member

LGTM except a few comments.

@gatorsmile
Copy link
Member

LGTM pending test.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71818 has finished for PR 16659 at commit 0753ee6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@asfgit asfgit closed this in de6ad3d Jan 23, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tional expressions

## What changes were proposed in this pull request?

As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

## How was this patch tested?

regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16659 from cloud-fan/codegen.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…tional expressions

## What changes were proposed in this pull request?

As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed.

Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression.

This PR simply stops common subexpression elimination for conditional expressions, with some cleanup.

## How was this patch tested?

regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16659 from cloud-fan/codegen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants