-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17357][SQL] Fix current predicate pushdown #14912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #64766 has finished for PR 14912 at commit
|
| test("push down filters that are combined") { | ||
| // The following predicate ('a === 2 || 'a === 3) && ('c > 10 || 'a === 2) | ||
| // will be simplified as ('a == 2) || ('c > 10 && 'a == 3). | ||
| // ('a === 2 || 'a === 3) can be pushed down. But the simplified one can't. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what happens if I just have the predicate
(a = 2) || (c > 10 && a = 3)
Will anything will be pushed down ? Have you considered instead modifying the boolean simplification logic.
Another approach that will catch these cases is as follows:
1.a Convert filters to conjunctive normal form
1.b combine filters
1.c Push filters
1.a, b and c will be run in a batch until fixed point.
Follow this batch by BooleanSimplification -- this can find and extract common factors for efficiency.
Overall, cnf may maximize the potential for filter push down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, as I mentioned in the description, this is currently a simplest to prevent the predicates which can be pushed down becoming not pushed down.
Your case is not pushed down at the beginning. This patch currently doesn't help it.
Because the optimization rules are independent, boolean simplification logic is just a general rule to simplify predicates, and doesn't be aware of the pushdown logic. Basically boolean simplification now looks good and it makes sense to do (a > 10 || a < 100) && (a > 10 || b == 5) => (a > 10) || (a < 100 && b == 5), however, it causes the pushdown issue.
Your another approach makes sense to me. I have thought about this, just don't know if it is necessary to come out it for this corner case, because it needs more code changes.
If it is acceptable, I will implement it. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering how the Optimizer works, we can't extract CombineFilters and PushDownPredicates as a new batch, as we should also respect the interaction between them and other rules. I do an alternative approach to convert predicates of filters to cnf during combining filters, and then perform additional predicate pushdown immediately. So the following BooleanSimplification will not affect the predicate pushdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that we should respect the interaction between CombineFilters, PushDownPredicates and other rules. I do think it's important that cnf conversion run before any of the push-down / reordering rules. And the simplification rules should run afterwards.
My concern with rolling this into CombineFilters is that it doesn't get triggered unless there are adjoining Filter nodes. In the example you have:
val originalQuery = testRelation
.select('a, 'b, ('c + 1) as 'cc)
.groupBy('a)('a, count('cc) as 'c)
.where('c > 10)
.where(('a === 2) || ('c > 10 && 'a === 3))
I think that (a == 2 || a==3) should get pushed down even if you don't have ".where (c > 10)",
but I'm not sure that it will be since toCNF is in CombineFilters. Could you confirm ?
My suggestion is that toCNF warrants a separate rule -- for example when you're doing joins, and you have
select * from A inner join C on (A.a1 = C.c1) where A.a2 = 2 || (C.c2 = 10 && A.a2 = 3),
you want (A.a2 = 2 || A.a2 = 3) pushed down into A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It is only triggered when adjoining Filters are there. So in above example, the predicate (a == 2 || a==3) will not be pushed down when there is no .where(c > 10).
|
Test build #64929 has finished for PR 14912 at commit
|
|
@srinathshankar I've addressed your comments. Please take a look. Thanks. |
|
@viirya Could you please wait for the CNF predicate normalization rule? @liancheng @yjshen did a few related work before. See #10444 and #8200. Let us also collect the inputs from @ioana-delaney @nsyca . They did a lot of related work in the past 10+ years. We need a good design about CNF normalization, which can benefit the other optimizer rules. |
|
hmm, looks like there are previous works regarding CNF but none of them are really merged. @gatorsmile Thanks for the context. |
|
The CNF exponential expansion issue is an important concern in previous works. Actually you can find that this patch doesn't produce a real CNF for predicate. I use |
|
@srinathshankar @gatorsmile I think CNF is another issue other then the issue this PR was proposed to solve at the first. I would like to solve the original adjoining Filter pushdown problem here. And leave CNF issue (it is not trivial and I don't expect it will be solved soon) for later PRs. What do you think? Thanks. |
|
Could you define the conditions in which the predicates are unable to be pushed down? Then, we can easily justify the significance. |
|
@gatorsmile I've described it in the pr description. Simply said, now a Filter will be stopped to pushdown once it encounters another Filter. This patch does is to perform |
|
also cc @cloud-fan |
|
I am thinking whether it makes more sense to maintain multiple semantically equivalent predicate sets for each |
|
To maintain the predicate sets may increase much complexity as I can think. I don't know how big the set could be. But once you change one of the predicates, you need to construct all equivalent predicates in the set too. I think we can maintain CNF and simplification predicates. CNF should be enough to push down predicates and simplification predicate can be used in Filter execution. |
|
Thanks, @gatorsmile, for mentioning me. I will try my best to comment on this thread. Disclaimer: I have not looked at the existing code manipulating predicates/expressions in Spark. Nor have I the code in this PR. I am writing my comment here based solely on the comments I read in this PR. One of the goals of predicate transformation, in general, is to aid the predicate pushdown. If a new form of a predicate, or a derived form of a superset of a predicate is to be generated, it should be because there is a potential the new form or the derived form can be pushed down further the plan. Another goal of the transformation is because the new form has a potential to be simplified further. Taking the example of The most benefit in the topic of predicate transformation is the equality transitivity property as equality predicates are commonly used in SQL queries. I remember there were a few JIRAs opened, but deferred, to solve this problem. There are some capability in the current version to propagate the equality transitivity but the behaviour is not consistent. Predicate transformation like extracting common subterms. An example is the predicate Introducing superset, redundant predicates like the last example above will complicate the computation of filter ratios of the predicates on a given stream when we introduce the Cost-based Optimization, which I assume depends on a good estimate of filter ratios on a given stream. This is because we cannot make assumption on the independent filtering affects among a set of predicates. Here the filter ratio of the newly generated superset predicate should be ignored in the filtering estimate. Another goal of predicate transformation is to derive contradiction and/or tautology. This is achieved by building the inequality relationships among the same column of a set of predicate. A simple example is |
|
@nsyca Thanks for your detailed comment. I would like to leave the decision of predicate transformation to later PRs, as this PR is not motivated by this. I think the goal to simplify a predicate such as As I said in before comment, my first opinion is not to complicate the predicate handling too much. We can keep a form of predicate which benefits predicate pushdown most, I guess the form should be CNF. We can also keep the simplified form of predicate which is better for execution in |
|
ping @srinathshankar @cloud-fan @hvanhovell Can you help review this change? Some context here: Some predicates are unable to push down because:
|
|
Test build #65298 has finished for PR 14912 at commit
|
|
@viirya, I agree that we need a separate set of PRs to address the general problem. On your comment: "I think the goal to simplify a predicate such as (a > 10 || b > 2) && (a > 10 || c == 3) to (a > 10) || (b > 2 && c == 3), is to eliminate redundant filtering expressions running in Filter in execution time." |
|
ping @cloud-fan @hvanhovell @srinathshankar again, would you please take a look this? Thanks. |
|
ping @cloud-fan @hvanhovell Can you review this if you have time? Thanks! |
|
ping @cloud-fan @hvanhovell @srinathshankar again, please take look if you have time. Thanks! |
|
ping @cloud-fan @hvanhovell @srinathshankar Can you take a look? |
|
@viirya TBH this seems hacky to me and I'd rather not merge this. I think we should just focus on having proper CNF in the optimizer. I am sorry to disappoint you. |
|
@hvanhovell OK. Let's see if we can have a proper CNF soon. Thank you. |
What changes were proposed in this pull request?
Currently some predicates, which should be pushdown, will not be correctly pushdown in
Optimizer.Predicates simplified to the form that can't be push down before they are pushed down
In
Optimizer,Filteroperator will go through the rulesPushDownPredicate,CombineFiltersandBooleanSimplification.Under this rule order, it is possibly that some predicates that should be able to push down, can't be pushed down through operators.
Because
Filterwill not pushdown through anotherFilternode and will wait for combination in the ruleCombineFilterslater. After theFilters are combined,BooleanSimplificationwill simplify conditions in the combinedFilterand make some predicates, which are able to push down, become unable to push down.This is this change wants to fix.
Predicates are in the form unable to push down at the beginning
We may need to come out an approach to maintain multiple forms of predicates which at least can benefit pushdown and expression simplification. This will leave to later PRs. Need more discussion.
Change in this patch
After
Filters are combined inCombineFilters, this change triggersPushDownPredicatesimmediately to push down the combined predicates. So the combined predicates will not be simplified byBooleanSimplificationbefore pushing down.How was this patch tested?
Jenkins tests.