[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745

gatorsmile · 2016-03-16T00:23:41Z

What changes were proposed in this pull request?

This PR is a follow-up of #11682.

Now, ColumnPruning and PushPredicateThroughProject reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best.

For example, in the following query,

    val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
    val originalQuery =
      input.select('a, 'b, 'c, 'd,
        WindowExpression(
          AggregateExpression(Count('b), Complete, isDistinct = false),
          WindowSpecDefinition( 'a :: Nil,
            SortOrder('b, Ascending) :: Nil,
            UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)

After multiple iteration of two rules of ColumnPruning and PushPredicateThroughProject, the optimized plan we generated is like:

Project [a#0,c#0]                                                                                                                                                                    
+- Filter (window#0L > cast(1 as bigint))                                                                                                                                            
   +- Project [a#0,c#0,window#0L]                                                                                                                                                    
      +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
         +- LocalRelation [a#0,b#0,c#0,d#0]

However, the expected optimized plan should be like:

Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
   +- Project [a#0,c#0,window#0L]
      +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
         +- Project [a#0,b#0,c#0]
            +- LocalRelation [a#0,b#0,c#0,d#0]

The solution of this PR is to split the rule ColumnPruning into two parts: PushProjectThroughFilter and ColumnPruning. The new rule PushProjectThroughPredicate runs before starting ColumnPruning. This PR also moves the rule PushPredicateThroughProject before the rules SetOperationPushDown and PushPredicateThroughJoin to ensure all the predicates can be pushed before PushProjectThroughFilter reverses the effect of the rule PushPredicateThroughProject.

How was this patch tested?

The existing test cases already expose the problem, but we need to add more regression tests to ensure the future code changes will not break it.

TODO: add more test cases.
Will submit another PR for stopping pushing Project through the other operators in ColumnPruning if it contains non-deterministic expressions.

…OverColumnPruning

…ushProjectThroughFilter.

SparkQA · 2016-03-16T02:01:31Z

Test build #53248 has finished for PR 11745 at commit c21748a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-17T07:03:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

- *   - Project <- Join
- *   - LeftSemiJoin
+ * Note: This rule could reverse the effects of PushPredicateThroughProject.
+ * This rule should be run before ColumnPruning for ensuring that Project can be


I'm a little against to depending on rules order too much, sometimes we have to as other solutions are way too complex, but for this issue, can we try to find a more general solution?

Yeah, I have the same concern. This PR is just to resolve the conflicts based on the current infrastructure.

In my opinion, in each batch, we need a few rule sets. The order of rule sets do not matter. In each rule set, the order of rules matters. However, this is a fundamental design change. @marmbrus @rxin might have a better idea in this.

SparkQA · 2016-03-17T08:15:08Z

Test build #53400 has finished for PR 11745 at commit b853e01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-18T18:53:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ * This rule should be run before ColumnPruning for ensuring that Project can be
+ * pushed as low as possible.
+ */
+object PushProjectThroughFilter extends Rule[LogicalPlan] {


We does not actual PUSH project through filter, we create new Project before to prune some columns.

As I said in another PR, we remove the those Project before filter.

@davies The naming of this rule is not right, but I still think this PR fixes the fundamental issue of the conflicts between ColumnPruning and PushPredicateThroughProject. If we do not take the ideas of this PR, I can find a test case to show the minor fix in ColumnPruning does not cover all the cases.

davies · 2016-03-19T04:24:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    // Because ColumnPruning is called after PushPredicateThroughProject, the predicate push down
+    // is reversed. This batch is to ensure Filter is pushed below Project, if possible.
+    Batch("Push Predicate Through Project", Once,
+      PushPredicateThroughProject) ::


Put this role in a separate batch is not correct, some other filter push down rules depend on this one.

I did not remove it from the original batch. Just added the extra batch here.

oh, I missed that, sorry.

gatorsmile added 2 commits March 15, 2016 12:06

Merge remote-tracking branch 'upstream/master' into predicatePushDown…

c6221a4

…OverColumnPruning

address the conflicts of two rules: PushPredicateThroughProject and P…

c21748a

…ushProjectThroughFilter.

gatorsmile mentioned this pull request Mar 17, 2016

[SPARK-13840] [SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator #11682

Closed

move SetOperationPushDown below PushPredicateThroughProject

b853e01

cloud-fan reviewed Mar 17, 2016
View reviewed changes

davies reviewed Mar 18, 2016
View reviewed changes

davies mentioned this pull request Mar 18, 2016

[SPARK-13919] [SQL] fix column pruning through filter #11828

Closed

davies reviewed Mar 19, 2016
View reviewed changes

gatorsmile mentioned this pull request Mar 22, 2016

[SPARK-13891] [SQL] [TEST] Issue Exceptions when Hitting the Max Iteration Limit in Optimizer and Analyzer #11714

Closed

asfgit closed this in 6603d9f Mar 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745

[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745

Uh oh!

gatorsmile commented Mar 16, 2016

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

cloud-fan Mar 17, 2016

Uh oh!

gatorsmile Mar 17, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

davies Mar 18, 2016

Uh oh!

gatorsmile Mar 19, 2016

Uh oh!

davies Mar 19, 2016

Uh oh!

gatorsmile Mar 19, 2016

Uh oh!

davies Mar 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745

[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745

Uh oh!

Conversation

gatorsmile commented Mar 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

cloud-fan Mar 17, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 17, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

davies Mar 18, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

davies Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

davies Mar 19, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants