Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Sep 1, 2016

What changes were proposed in this pull request?

Currently some predicates, which should be pushdown, will not be correctly pushdown in Optimizer.

Predicates simplified to the form that can't be push down before they are pushed down

In Optimizer, Filter operator will go through the rules PushDownPredicate, CombineFilters and BooleanSimplification.

Under this rule order, it is possibly that some predicates that should be able to push down, can't be pushed down through operators.

Because Filter will not pushdown through another Filter node and will wait for combination in the rule CombineFilters later. After the Filters are combined, BooleanSimplification will simplify conditions in the combined Filter and make some predicates, which are able to push down, become unable to push down.

This is this change wants to fix.

Predicates are in the form unable to push down at the beginning

We may need to come out an approach to maintain multiple forms of predicates which at least can benefit pushdown and expression simplification. This will leave to later PRs. Need more discussion.

Change in this patch

After Filters are combined in CombineFilters, this change triggers PushDownPredicates immediately to push down the combined predicates. So the combined predicates will not be simplified by BooleanSimplification before pushing down.

How was this patch tested?

Jenkins tests.

@SparkQA
Copy link

SparkQA commented Sep 1, 2016

Test build #64766 has finished for PR 14912 at commit 9e1c315.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sameeragarwal
Copy link
Member

cc @srinathshankar

test("push down filters that are combined") {
// The following predicate ('a === 2 || 'a === 3) && ('c > 10 || 'a === 2)
// will be simplified as ('a == 2) || ('c > 10 && 'a == 3).
// ('a === 2 || 'a === 3) can be pushed down. But the simplified one can't.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if I just have the predicate
(a = 2) || (c > 10 && a = 3)
Will anything will be pushed down ? Have you considered instead modifying the boolean simplification logic.
Another approach that will catch these cases is as follows:
1.a Convert filters to conjunctive normal form
1.b combine filters
1.c Push filters
1.a, b and c will be run in a batch until fixed point.
Follow this batch by BooleanSimplification -- this can find and extract common factors for efficiency.
Overall, cnf may maximize the potential for filter push down

Copy link
Member Author

@viirya viirya Sep 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, as I mentioned in the description, this is currently a simplest to prevent the predicates which can be pushed down becoming not pushed down.

Your case is not pushed down at the beginning. This patch currently doesn't help it.

Because the optimization rules are independent, boolean simplification logic is just a general rule to simplify predicates, and doesn't be aware of the pushdown logic. Basically boolean simplification now looks good and it makes sense to do (a > 10 || a < 100) && (a > 10 || b == 5) => (a > 10) || (a < 100 && b == 5), however, it causes the pushdown issue.

Your another approach makes sense to me. I have thought about this, just don't know if it is necessary to come out it for this corner case, because it needs more code changes.

If it is acceptable, I will implement it. Thank you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering how the Optimizer works, we can't extract CombineFilters and PushDownPredicates as a new batch, as we should also respect the interaction between them and other rules. I do an alternative approach to convert predicates of filters to cnf during combining filters, and then perform additional predicate pushdown immediately. So the following BooleanSimplification will not affect the predicate pushdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that we should respect the interaction between CombineFilters, PushDownPredicates and other rules. I do think it's important that cnf conversion run before any of the push-down / reordering rules. And the simplification rules should run afterwards.
My concern with rolling this into CombineFilters is that it doesn't get triggered unless there are adjoining Filter nodes. In the example you have:
val originalQuery = testRelation
.select('a, 'b, ('c + 1) as 'cc)
.groupBy('a)('a, count('cc) as 'c)
.where('c > 10)
.where(('a === 2) || ('c > 10 && 'a === 3))

I think that (a == 2 || a==3) should get pushed down even if you don't have ".where (c > 10)",
but I'm not sure that it will be since toCNF is in CombineFilters. Could you confirm ?
My suggestion is that toCNF warrants a separate rule -- for example when you're doing joins, and you have
select * from A inner join C on (A.a1 = C.c1) where A.a2 = 2 || (C.c2 = 10 && A.a2 = 3),
you want (A.a2 = 2 || A.a2 = 3) pushed down into A

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It is only triggered when adjoining Filters are there. So in above example, the predicate (a == 2 || a==3) will not be pushed down when there is no .where(c > 10).

@viirya viirya changed the title [SPARK-17357][SQL] Simplified predicates should be pushed down through operators [SPARK-17357][SQL] Fix current predicate pushdown Sep 5, 2016
@SparkQA
Copy link

SparkQA commented Sep 5, 2016

Test build #64929 has finished for PR 14912 at commit 8f6f91d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Sep 5, 2016

@srinathshankar I've addressed your comments. Please take a look. Thanks.

@gatorsmile
Copy link
Member

@viirya Could you please wait for the CNF predicate normalization rule? @liancheng @yjshen did a few related work before. See #10444 and #8200.

Let us also collect the inputs from @ioana-delaney @nsyca . They did a lot of related work in the past 10+ years. We need a good design about CNF normalization, which can benefit the other optimizer rules.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2016

hmm, looks like there are previous works regarding CNF but none of them are really merged. @gatorsmile Thanks for the context.

@viirya
Copy link
Member Author

viirya commented Sep 7, 2016

The CNF exponential expansion issue is an important concern in previous works. Actually you can find that this patch doesn't produce a real CNF for predicate. I use splitDisjunctivePredicates to obtain disjunctive predicates and convert them to conjunctive form. The conversion here is not recursive. I think this should prevent exponential explosion. Of course it is a compromise and can't benefit for all predicates. But I would suspect how often a complex predicate that needs complete conversion of CNF is used.

@viirya
Copy link
Member Author

viirya commented Sep 9, 2016

@srinathshankar @gatorsmile I think CNF is another issue other then the issue this PR was proposed to solve at the first. I would like to solve the original adjoining Filter pushdown problem here. And leave CNF issue (it is not trivial and I don't expect it will be solved soon) for later PRs.

What do you think? Thanks.

@gatorsmile
Copy link
Member

Could you define the conditions in which the predicates are unable to be pushed down? Then, we can easily justify the significance.

@viirya
Copy link
Member Author

viirya commented Sep 9, 2016

@gatorsmile I've described it in the pr description.

Simply said, now a Filter will be stopped to pushdown once it encounters another Filter. BooleanSimplification rule will simplify the predicate to a form that can't be pushed down in next round of optimization. For example, (a > 10 || b > 2) && (a > 10 || c == 3) will be simplified as (a > 10) || (b > 2 && c == 3).

This patch does is to perform PushDownPredicate once the adjoining Filters are merged. So the predicates which are not pushed down can be pushed down again.

@viirya
Copy link
Member Author

viirya commented Sep 9, 2016

also cc @cloud-fan

@gatorsmile
Copy link
Member

I am thinking whether it makes more sense to maintain multiple semantically equivalent predicate sets for each Filter. In your example, we have both (a > 10 || b > 2) && (a > 10 || c == 3) and (a > 10) || (b > 2 && c == 3). If we also considering the predicate transitivity inferences and predicate simplication at the same time, we could have multiple semantically equivalent predicate sets. Then, we have more chances to push down the predicates.

@viirya
Copy link
Member Author

viirya commented Sep 12, 2016

To maintain the predicate sets may increase much complexity as I can think. I don't know how big the set could be. But once you change one of the predicates, you need to construct all equivalent predicates in the set too. I think we can maintain CNF and simplification predicates. CNF should be enough to push down predicates and simplification predicate can be used in Filter execution.

@nsyca
Copy link
Contributor

nsyca commented Sep 12, 2016

Thanks, @gatorsmile, for mentioning me. I will try my best to comment on this thread. Disclaimer: I have not looked at the existing code manipulating predicates/expressions in Spark. Nor have I the code in this PR. I am writing my comment here based solely on the comments I read in this PR.

One of the goals of predicate transformation, in general, is to aid the predicate pushdown. If a new form of a predicate, or a derived form of a superset of a predicate is to be generated, it should be because there is a potential the new form or the derived form can be pushed down further the plan.

Another goal of the transformation is because the new form has a potential to be simplified further.

Taking the example of (a > 10 || b > 2) && (a > 10 || c == 3), I don't see any benefit of transforming to (a > 10) || (b > 2 && c == 3) as it will form a disjunctive predicate. If only b == c by transitivity rule then we may want to do that in order to simplify further to (a > 10 || c == 3 (because b == c and c > 2 && c == 3 can be reduced to c == 3.

The most benefit in the topic of predicate transformation is the equality transitivity property as equality predicates are commonly used in SQL queries. I remember there were a few JIRAs opened, but deferred, to solve this problem. There are some capability in the current version to propagate the equality transitivity but the behaviour is not consistent.

Predicate transformation like extracting common subterms. An example is the predicate (a=1 || b=2) && (a=1 || c=4) and a is a column from a different stream of columns b and c should be transformed to a=1 && (b=2 || c=4). A more complex case is the predicate (a=1 || b=2) && (a=3 || c=4) should have a new predicate (a=1 || a=3) added as a superset predicate to early filter the stream of a to just the two values needed.

Introducing superset, redundant predicates like the last example above will complicate the computation of filter ratios of the predicates on a given stream when we introduce the Cost-based Optimization, which I assume depends on a good estimate of filter ratios on a given stream. This is because we cannot make assumption on the independent filtering affects among a set of predicates. Here the filter ratio of the newly generated superset predicate should be ignored in the filtering estimate.

Another goal of predicate transformation is to derive contradiction and/or tautology. This is achieved by building the inequality relationships among the same column of a set of predicate. A simple example is a>1 && a < 1 should be evaluated to false at the compile time and eliminate the scan of the stream completely. The stream is treated like producing an empty set. Depending on the context, the stream may be substituted by a NULL row when it is a subquery in an existential (EXISTS) or a universal (ALL) subquery, or a singleton NULL value when it is a scalar subquery.

@viirya
Copy link
Member Author

viirya commented Sep 13, 2016

@nsyca Thanks for your detailed comment. I would like to leave the decision of predicate transformation to later PRs, as this PR is not motivated by this.

I think the goal to simplify a predicate such as (a > 10 || b > 2) && (a > 10 || c == 3) to (a > 10) || (b > 2 && c == 3), is to eliminate redundant filtering expressions running in Filter in execution time.

As I said in before comment, my first opinion is not to complicate the predicate handling too much. We can keep a form of predicate which benefits predicate pushdown most, I guess the form should be CNF. We can also keep the simplified form of predicate which is better for execution in Filter.

@viirya
Copy link
Member Author

viirya commented Sep 13, 2016

ping @srinathshankar @cloud-fan @hvanhovell Can you help review this change?

Some context here:

Some predicates are unable to push down because:

  1. Predicates are simplified to the form which is not able to push down

    Filter can't push down through Filter. So the predicates in Filter will be simplified to the form unable to push down in next optimizing round. This is this change wants to fix. This change triggers PushdownPredicate in CombineFilters. So combined predicates can be pushed down before BooleanSimplification rule.

  2. Predicates are in the form unable to push down at the beginning

    We may need to come out an approach to maintain multiple forms of predicates which at least can benefit pushdown and expression simplification. This will leave to later PRs. Need more discussion.

@SparkQA
Copy link

SparkQA commented Sep 13, 2016

Test build #65298 has finished for PR 14912 at commit f69473f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nsyca
Copy link
Contributor

nsyca commented Sep 13, 2016

@viirya, I agree that we need a separate set of PRs to address the general problem.

On your comment: "I think the goal to simplify a predicate such as (a > 10 || b > 2) && (a > 10 || c == 3) to (a > 10) || (b > 2 && c == 3), is to eliminate redundant filtering expressions running in Filter in execution time."
My two cents: If that is the case, deferring the simplification to the point just right before the execution time would be an option to consider.

@viirya
Copy link
Member Author

viirya commented Sep 16, 2016

ping @cloud-fan @hvanhovell @srinathshankar again, would you please take a look this? Thanks.

@viirya
Copy link
Member Author

viirya commented Sep 22, 2016

ping @cloud-fan @hvanhovell Can you review this if you have time? Thanks!

@viirya
Copy link
Member Author

viirya commented Sep 26, 2016

ping @cloud-fan @hvanhovell @srinathshankar again, please take look if you have time. Thanks!

@viirya
Copy link
Member Author

viirya commented Oct 5, 2016

ping @cloud-fan @hvanhovell @srinathshankar Can you take a look?

@hvanhovell
Copy link
Contributor

@viirya TBH this seems hacky to me and I'd rather not merge this. I think we should just focus on having proper CNF in the optimizer. I am sorry to disappoint you.

@viirya
Copy link
Member Author

viirya commented Oct 6, 2016

@hvanhovell OK. Let's see if we can have a proper CNF soon. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants