-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13919] [SQL] [WIP] Resolving the Conflicts of ColumnPruning and PushPredicateThroughProject #11745
Conversation
…OverColumnPruning
…ushProjectThroughFilter.
|
Test build #53248 has finished for PR 11745 at commit
|
| * - Project <- Join | ||
| * - LeftSemiJoin | ||
| * Note: This rule could reverse the effects of PushPredicateThroughProject. | ||
| * This rule should be run before ColumnPruning for ensuring that Project can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little against to depending on rules order too much, sometimes we have to as other solutions are way too complex, but for this issue, can we try to find a more general solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have the same concern. This PR is just to resolve the conflicts based on the current infrastructure.
In my opinion, in each batch, we need a few rule sets. The order of rule sets do not matter. In each rule set, the order of rules matters. However, this is a fundamental design change. @marmbrus @rxin might have a better idea in this.
|
Test build #53400 has finished for PR 11745 at commit
|
| * This rule should be run before ColumnPruning for ensuring that Project can be | ||
| * pushed as low as possible. | ||
| */ | ||
| object PushProjectThroughFilter extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We does not actual PUSH project through filter, we create new Project before to prune some columns.
As I said in another PR, we remove the those Project before filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies The naming of this rule is not right, but I still think this PR fixes the fundamental issue of the conflicts between ColumnPruning and PushPredicateThroughProject. If we do not take the ideas of this PR, I can find a test case to show the minor fix in ColumnPruning does not cover all the cases.
| // Because ColumnPruning is called after PushPredicateThroughProject, the predicate push down | ||
| // is reversed. This batch is to ensure Filter is pushed below Project, if possible. | ||
| Batch("Push Predicate Through Project", Once, | ||
| PushPredicateThroughProject) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this role in a separate batch is not correct, some other filter push down rules depend on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not remove it from the original batch. Just added the extra batch here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I missed that, sorry.
What changes were proposed in this pull request?
This PR is a follow-up of #11682.
Now,
ColumnPruningandPushPredicateThroughProjectreverse each other's effect. Although it will not cause the max iteration now, some queries are not optimized to the best.For example, in the following query,
After multiple iteration of two rules of
ColumnPruningandPushPredicateThroughProject, the optimized plan we generated is like:However, the expected optimized plan should be like:
The solution of this PR is to split the rule
ColumnPruninginto two parts:PushProjectThroughFilterandColumnPruning. The new rulePushProjectThroughPredicateruns before startingColumnPruning. This PR also moves the rulePushPredicateThroughProjectbefore the rulesSetOperationPushDownandPushPredicateThroughJointo ensure all the predicates can be pushed beforePushProjectThroughFilterreverses the effect of the rulePushPredicateThroughProject.How was this patch tested?
The existing test cases already expose the problem, but we need to add more regression tests to ensure the future code changes will not break it.
TODO: add more test cases.
Will submit another PR for stopping pushing
Projectthrough the other operators inColumnPruningif it contains non-deterministic expressions.