Skip to content

Conversation

@gatorsmile
Copy link
Member

What changes were proposed in this pull request?

This PR is a follow-up of #11682.

Now, ColumnPruning and PushPredicateThroughProject reverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best.

For example, in the following query,

    val input = LocalRelation('a.int, 'b.string, 'c.double, 'd.int)
    val originalQuery =
      input.select('a, 'b, 'c, 'd,
        WindowExpression(
          AggregateExpression(Count('b), Complete, isDistinct = false),
          WindowSpecDefinition( 'a :: Nil,
            SortOrder('b, Ascending) :: Nil,
            UnspecifiedFrame)).as('window)).where('window > 1).select('a, 'c)

After multiple iteration of two rules of ColumnPruning and PushPredicateThroughProject, the optimized plan we generated is like:

Project [a#0,c#0]                                                                                                                                                                    
+- Filter (window#0L > cast(1 as bigint))                                                                                                                                            
   +- Project [a#0,c#0,window#0L]                                                                                                                                                    
      +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]   
         +- LocalRelation [a#0,b#0,c#0,d#0]                                                                                                                                          

However, the expected optimized plan should be like:

Project [a#0,c#0]
+- Filter (window#0L > cast(1 as bigint))
   +- Project [a#0,c#0,window#0L]
      +- Window [(count(b#0),mode=Complete,isDistinct=false) windowspecdefinition(a#0, b#0 ASC, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS window#0L], [a#0], [b#0 ASC]
         +- Project [a#0,b#0,c#0]
            +- LocalRelation [a#0,b#0,c#0,d#0]                                                                                                                                          

The solution of this PR is to split the rule ColumnPruning into two parts: PushProjectThroughFilter and ColumnPruning. The new rule PushProjectThroughPredicate runs before starting ColumnPruning. This PR also moves the rule PushPredicateThroughProject before the rules SetOperationPushDown and PushPredicateThroughJoin to ensure all the predicates can be pushed before PushProjectThroughFilter reverses the effect of the rule PushPredicateThroughProject.

How was this patch tested?

The existing test cases already expose the problem, but we need to add more regression tests to ensure the future code changes will not break it.

TODO: add more test cases.
Will submit another PR for stopping pushing Project through the other operators in ColumnPruning if it contains non-deterministic expressions.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53248 has finished for PR 11745 at commit c21748a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* - Project <- Join
* - LeftSemiJoin
* Note: This rule could reverse the effects of PushPredicateThroughProject.
* This rule should be run before ColumnPruning for ensuring that Project can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little against to depending on rules order too much, sometimes we have to as other solutions are way too complex, but for this issue, can we try to find a more general solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have the same concern. This PR is just to resolve the conflicts based on the current infrastructure.

In my opinion, in each batch, we need a few rule sets. The order of rule sets do not matter. In each rule set, the order of rules matters. However, this is a fundamental design change. @marmbrus @rxin might have a better idea in this.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53400 has finished for PR 11745 at commit b853e01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* This rule should be run before ColumnPruning for ensuring that Project can be
* pushed as low as possible.
*/
object PushProjectThroughFilter extends Rule[LogicalPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We does not actual PUSH project through filter, we create new Project before to prune some columns.

As I said in another PR, we remove the those Project before filter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies The naming of this rule is not right, but I still think this PR fixes the fundamental issue of the conflicts between ColumnPruning and PushPredicateThroughProject. If we do not take the ideas of this PR, I can find a test case to show the minor fix in ColumnPruning does not cover all the cases.

// Because ColumnPruning is called after PushPredicateThroughProject, the predicate push down
// is reversed. This batch is to ensure Filter is pushed below Project, if possible.
Batch("Push Predicate Through Project", Once,
PushPredicateThroughProject) ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this role in a separate batch is not correct, some other filter push down rules depend on this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not remove it from the original batch. Just added the extra batch here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I missed that, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants