[SPARK-17357][SQL] Fix current predicate pushdown #14912

viirya · 2016-09-01T06:51:46Z

What changes were proposed in this pull request?

Currently some predicates, which should be pushdown, will not be correctly pushdown in Optimizer.

Predicates simplified to the form that can't be push down before they are pushed down

In Optimizer, Filter operator will go through the rules PushDownPredicate, CombineFilters and BooleanSimplification.

Under this rule order, it is possibly that some predicates that should be able to push down, can't be pushed down through operators.

Because Filter will not pushdown through another Filter node and will wait for combination in the rule CombineFilters later. After the Filters are combined, BooleanSimplification will simplify conditions in the combined Filter and make some predicates, which are able to push down, become unable to push down.

This is this change wants to fix.

Predicates are in the form unable to push down at the beginning

We may need to come out an approach to maintain multiple forms of predicates which at least can benefit pushdown and expression simplification. This will leave to later PRs. Need more discussion.

Change in this patch

After Filters are combined in CombineFilters, this change triggers PushDownPredicates immediately to push down the combined predicates. So the combined predicates will not be simplified by BooleanSimplification before pushing down.

How was this patch tested?

Jenkins tests.

SparkQA · 2016-09-01T08:54:44Z

Test build #64766 has finished for PR 14912 at commit 9e1c315.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-09-02T00:30:06Z

cc @srinathshankar

srinathshankar · 2016-09-02T17:35:45Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+  test("push down filters that are combined") {
+    // The following predicate ('a === 2 || 'a === 3) && ('c > 10 || 'a === 2)
+    // will be simplified as ('a == 2) || ('c > 10 && 'a == 3).
+    // ('a === 2 || 'a === 3) can be pushed down. But the simplified one can't.


So what happens if I just have the predicate
(a = 2) || (c > 10 && a = 3)
Will anything will be pushed down ? Have you considered instead modifying the boolean simplification logic.
Another approach that will catch these cases is as follows:
1.a Convert filters to conjunctive normal form
1.b combine filters
1.c Push filters
1.a, b and c will be run in a batch until fixed point.
Follow this batch by BooleanSimplification -- this can find and extract common factors for efficiency.
Overall, cnf may maximize the potential for filter push down

yeah, as I mentioned in the description, this is currently a simplest to prevent the predicates which can be pushed down becoming not pushed down.

Your case is not pushed down at the beginning. This patch currently doesn't help it.

Because the optimization rules are independent, boolean simplification logic is just a general rule to simplify predicates, and doesn't be aware of the pushdown logic. Basically boolean simplification now looks good and it makes sense to do (a > 10 || a < 100) && (a > 10 || b == 5) => (a > 10) || (a < 100 && b == 5), however, it causes the pushdown issue.

Your another approach makes sense to me. I have thought about this, just don't know if it is necessary to come out it for this corner case, because it needs more code changes.

If it is acceptable, I will implement it. Thank you.

Considering how the Optimizer works, we can't extract CombineFilters and PushDownPredicates as a new batch, as we should also respect the interaction between them and other rules. I do an alternative approach to convert predicates of filters to cnf during combining filters, and then perform additional predicate pushdown immediately. So the following BooleanSimplification will not affect the predicate pushdown.

I agree with you that we should respect the interaction between CombineFilters, PushDownPredicates and other rules. I do think it's important that cnf conversion run before any of the push-down / reordering rules. And the simplification rules should run afterwards.
My concern with rolling this into CombineFilters is that it doesn't get triggered unless there are adjoining Filter nodes. In the example you have:
val originalQuery = testRelation
.select('a, 'b, ('c + 1) as 'cc)
.groupBy('a)('a, count('cc) as 'c)
.where('c > 10)
.where(('a === 2) || ('c > 10 && 'a === 3))

I think that (a == 2 || a==3) should get pushed down even if you don't have ".where (c > 10)",
but I'm not sure that it will be since toCNF is in CombineFilters. Could you confirm ?
My suggestion is that toCNF warrants a separate rule -- for example when you're doing joins, and you have
select * from A inner join C on (A.a1 = C.c1) where A.a2 = 2 || (C.c2 = 10 && A.a2 = 3),
you want (A.a2 = 2 || A.a2 = 3) pushed down into A

You are right. It is only triggered when adjoining Filters are there. So in above example, the predicate (a == 2 || a==3) will not be pushed down when there is no .where(c > 10).

SparkQA · 2016-09-05T08:44:16Z

Test build #64929 has finished for PR 14912 at commit 8f6f91d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-09-05T08:49:38Z

@srinathshankar I've addressed your comments. Please take a look. Thanks.

gatorsmile · 2016-09-07T03:12:57Z

@viirya Could you please wait for the CNF predicate normalization rule? @liancheng @yjshen did a few related work before. See #10444 and #8200.

Let us also collect the inputs from @ioana-delaney @nsyca . They did a lot of related work in the past 10+ years. We need a good design about CNF normalization, which can benefit the other optimizer rules.

viirya · 2016-09-07T03:22:17Z

hmm, looks like there are previous works regarding CNF but none of them are really merged. @gatorsmile Thanks for the context.

viirya · 2016-09-07T03:48:11Z

The CNF exponential expansion issue is an important concern in previous works. Actually you can find that this patch doesn't produce a real CNF for predicate. I use splitDisjunctivePredicates to obtain disjunctive predicates and convert them to conjunctive form. The conversion here is not recursive. I think this should prevent exponential explosion. Of course it is a compromise and can't benefit for all predicates. But I would suspect how often a complex predicate that needs complete conversion of CNF is used.

viirya · 2016-09-09T04:02:17Z

@srinathshankar @gatorsmile I think CNF is another issue other then the issue this PR was proposed to solve at the first. I would like to solve the original adjoining Filter pushdown problem here. And leave CNF issue (it is not trivial and I don't expect it will be solved soon) for later PRs.

What do you think? Thanks.

gatorsmile · 2016-09-09T05:38:07Z

Could you define the conditions in which the predicates are unable to be pushed down? Then, we can easily justify the significance.

viirya · 2016-09-09T06:57:18Z

@gatorsmile I've described it in the pr description.

Simply said, now a Filter will be stopped to pushdown once it encounters another Filter. BooleanSimplification rule will simplify the predicate to a form that can't be pushed down in next round of optimization. For example, (a > 10 || b > 2) && (a > 10 || c == 3) will be simplified as (a > 10) || (b > 2 && c == 3).

This patch does is to perform PushDownPredicate once the adjoining Filters are merged. So the predicates which are not pushed down can be pushed down again.

viirya · 2016-09-09T08:42:47Z

also cc @cloud-fan

gatorsmile · 2016-09-10T18:58:53Z

I am thinking whether it makes more sense to maintain multiple semantically equivalent predicate sets for each Filter. In your example, we have both (a > 10 || b > 2) && (a > 10 || c == 3) and (a > 10) || (b > 2 && c == 3). If we also considering the predicate transitivity inferences and predicate simplication at the same time, we could have multiple semantically equivalent predicate sets. Then, we have more chances to push down the predicates.

viirya · 2016-09-12T03:37:48Z

To maintain the predicate sets may increase much complexity as I can think. I don't know how big the set could be. But once you change one of the predicates, you need to construct all equivalent predicates in the set too. I think we can maintain CNF and simplification predicates. CNF should be enough to push down predicates and simplification predicate can be used in Filter execution.

nsyca · 2016-09-12T15:17:36Z

Thanks, @gatorsmile, for mentioning me. I will try my best to comment on this thread. Disclaimer: I have not looked at the existing code manipulating predicates/expressions in Spark. Nor have I the code in this PR. I am writing my comment here based solely on the comments I read in this PR.

One of the goals of predicate transformation, in general, is to aid the predicate pushdown. If a new form of a predicate, or a derived form of a superset of a predicate is to be generated, it should be because there is a potential the new form or the derived form can be pushed down further the plan.

Another goal of the transformation is because the new form has a potential to be simplified further.

Taking the example of (a > 10 || b > 2) && (a > 10 || c == 3), I don't see any benefit of transforming to (a > 10) || (b > 2 && c == 3) as it will form a disjunctive predicate. If only b == c by transitivity rule then we may want to do that in order to simplify further to (a > 10 || c == 3 (because b == c and c > 2 && c == 3 can be reduced to c == 3.

The most benefit in the topic of predicate transformation is the equality transitivity property as equality predicates are commonly used in SQL queries. I remember there were a few JIRAs opened, but deferred, to solve this problem. There are some capability in the current version to propagate the equality transitivity but the behaviour is not consistent.

Predicate transformation like extracting common subterms. An example is the predicate (a=1 || b=2) && (a=1 || c=4) and a is a column from a different stream of columns b and c should be transformed to a=1 && (b=2 || c=4). A more complex case is the predicate (a=1 || b=2) && (a=3 || c=4) should have a new predicate (a=1 || a=3) added as a superset predicate to early filter the stream of a to just the two values needed.

Introducing superset, redundant predicates like the last example above will complicate the computation of filter ratios of the predicates on a given stream when we introduce the Cost-based Optimization, which I assume depends on a good estimate of filter ratios on a given stream. This is because we cannot make assumption on the independent filtering affects among a set of predicates. Here the filter ratio of the newly generated superset predicate should be ignored in the filtering estimate.

Another goal of predicate transformation is to derive contradiction and/or tautology. This is achieved by building the inequality relationships among the same column of a set of predicate. A simple example is a>1 && a < 1 should be evaluated to false at the compile time and eliminate the scan of the stream completely. The stream is treated like producing an empty set. Depending on the context, the stream may be substituted by a NULL row when it is a subquery in an existential (EXISTS) or a universal (ALL) subquery, or a singleton NULL value when it is a scalar subquery.

viirya · 2016-09-13T03:41:04Z

@nsyca Thanks for your detailed comment. I would like to leave the decision of predicate transformation to later PRs, as this PR is not motivated by this.

I think the goal to simplify a predicate such as (a > 10 || b > 2) && (a > 10 || c == 3) to (a > 10) || (b > 2 && c == 3), is to eliminate redundant filtering expressions running in Filter in execution time.

As I said in before comment, my first opinion is not to complicate the predicate handling too much. We can keep a form of predicate which benefits predicate pushdown most, I guess the form should be CNF. We can also keep the simplified form of predicate which is better for execution in Filter.

viirya · 2016-09-13T03:52:56Z

ping @srinathshankar @cloud-fan @hvanhovell Can you help review this change?

Some context here:

Some predicates are unable to push down because:

Predicates are simplified to the form which is not able to push down

Filter can't push down through Filter. So the predicates in Filter will be simplified to the form unable to push down in next optimizing round. This is this change wants to fix. This change triggers PushdownPredicate in CombineFilters. So combined predicates can be pushed down before BooleanSimplification rule.
Predicates are in the form unable to push down at the beginning

We may need to come out an approach to maintain multiple forms of predicates which at least can benefit pushdown and expression simplification. This will leave to later PRs. Need more discussion.

SparkQA · 2016-09-13T06:06:18Z

Test build #65298 has finished for PR 14912 at commit f69473f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nsyca · 2016-09-13T16:11:39Z

@viirya, I agree that we need a separate set of PRs to address the general problem.

On your comment: "I think the goal to simplify a predicate such as (a > 10 || b > 2) && (a > 10 || c == 3) to (a > 10) || (b > 2 && c == 3), is to eliminate redundant filtering expressions running in Filter in execution time."
My two cents: If that is the case, deferring the simplification to the point just right before the execution time would be an option to consider.

viirya · 2016-09-16T02:20:52Z

ping @cloud-fan @hvanhovell @srinathshankar again, would you please take a look this? Thanks.

viirya · 2016-09-22T02:15:18Z

ping @cloud-fan @hvanhovell Can you review this if you have time? Thanks!

viirya · 2016-09-26T04:59:59Z

ping @cloud-fan @hvanhovell @srinathshankar again, please take look if you have time. Thanks!

viirya · 2016-10-05T12:47:00Z

ping @cloud-fan @hvanhovell @srinathshankar Can you take a look?

hvanhovell · 2016-10-05T21:09:52Z

@viirya TBH this seems hacky to me and I'd rather not merge this. I think we should just focus on having proper CNF in the optimizer. I am sorry to disappoint you.

viirya · 2016-10-06T01:51:45Z

@hvanhovell OK. Let's see if we can have a proper CNF soon. Thank you.

Simplified predicates should be pushdown.

9e1c315

viirya mentioned this pull request Sep 1, 2016

[SPARK-16849][SQL] Improve subquery execution by deduplicating the subqueries with the same results #14452

Closed

srinathshankar reviewed Sep 2, 2016
View reviewed changes

Address comment. Consider more general case.

8f6f91d

viirya changed the title ~~[SPARK-17357][SQL] Simplified predicates should be pushed down through operators~~ [SPARK-17357][SQL] Fix current predicate pushdown Sep 5, 2016

Focus on the first problem of predicate pushdown.

f69473f

viirya closed this Oct 6, 2016

viirya mentioned this pull request Oct 20, 2016

[SPARK-17357][SPARK-6624][SQL] Convert filter predicate to CNF in Optimizer for pushdown #15558

Closed

viirya deleted the simplified-predicate-pushdown branch December 27, 2023 18:19

[SPARK-17357][SQL] Fix current predicate pushdown #14912

[SPARK-17357][SQL] Fix current predicate pushdown #14912

Uh oh!

Conversation

viirya commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Predicates simplified to the form that can't be push down before they are pushed down

Predicates are in the form unable to push down at the beginning

Change in this patch

How was this patch tested?

Uh oh!

SparkQA commented Sep 1, 2016

Uh oh!

sameeragarwal commented Sep 2, 2016

Uh oh!

srinathshankar Sep 2, 2016

Choose a reason for hiding this comment

Uh oh!

viirya Sep 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 5, 2016

Choose a reason for hiding this comment

Uh oh!

srinathshankar Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

viirya Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2016

Uh oh!

viirya commented Sep 5, 2016

Uh oh!

gatorsmile commented Sep 7, 2016

Uh oh!

viirya commented Sep 7, 2016

Uh oh!

viirya commented Sep 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 9, 2016

Uh oh!

gatorsmile commented Sep 9, 2016

Uh oh!

viirya commented Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 9, 2016

Uh oh!

gatorsmile commented Sep 10, 2016

Uh oh!

viirya commented Sep 12, 2016

Uh oh!

nsyca commented Sep 12, 2016

Uh oh!

viirya commented Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 13, 2016

Uh oh!

nsyca commented Sep 13, 2016

Uh oh!

viirya commented Sep 16, 2016

Uh oh!

viirya commented Sep 22, 2016

Uh oh!

viirya commented Sep 26, 2016

Uh oh!

viirya commented Oct 5, 2016

Uh oh!

hvanhovell commented Oct 5, 2016

Uh oh!

viirya commented Oct 6, 2016

Uh oh!

viirya commented Sep 1, 2016 •

edited

Loading

viirya Sep 3, 2016 •

edited

Loading

viirya commented Sep 7, 2016 •

edited

Loading

viirya commented Sep 9, 2016 •

edited

Loading

viirya commented Sep 13, 2016 •

edited

Loading

viirya commented Sep 13, 2016 •

edited

Loading