[SPARK-14172][SQL] Hive table partition predicate not passed down correctly #13893

jiangxb1987 · 2016-06-24T12:09:58Z

What changes were proposed in this pull request?

Currently partition predicate is not passed down correctly when condition contains nondeterministic parts. This PR changed the logic in collectProjectsAndFilters() to add the deterministic parts into filters, so that partition predicate can be passed down correctly.

How was this patch tested?

new test in PruningSuite.

AlekseiS · 2016-06-24T18:11:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

Likely there's a bug. I think you want to use "substitutedFilters" instead of "deterministicFilters" here. I think you can also add a test with substitution for it.

@AlekseiS U r right! I have fixed this, thank you!

jiangxb1987 · 2016-06-27T04:37:32Z

@cloud-fan could you please have a look at this PR?

cloud-fan · 2016-06-27T08:22:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala

I'm not sure if it's safe to push it down. For non-deterministic expressions, the order(or number) of input rows matters. If we push down the deterministic part of filter condition, then the input rows to the remaining filter condition will change and may result to wrong answer.

cc @liancheng @yhuai

Good question. I think technically we can't push down any predicates that are placed after a non-deterministic predicate. Otherwise number of input rows may change and lead to wrong query results.

Yes you are right. I thought the deterministic part can always be PPDed safely but it was not, in fact, the order of each part should also be considered. For example:
rand() < 0.01 AND partition_col = 'some_value'
should not be PPDed, but
partition_col = 'some_value' AND rand() < 0.01
still could be.
Thank you for your kindly reply!

AlekseiS · 2016-06-27T16:36:44Z

@liancheng I think partition predicates are a bit different. If you explicitly specify a partition predicate, like "date=2016-06-27", do you really expect other partitions being scanned regardless of whether you use non-deterministic function or not? Most likely, no, so if partition filter is specified and it's deterministic it's expected to be always used.
For this reason, I think that it's always correct to push filters which only reference partition columns to the scan. @jiangxb1987 do you think you could modify the patch to do it?

cloud-fan · 2016-06-28T12:19:46Z

@AlekseiS , I think we should always consider the risk and take care of it even it's not a common case. We can't assume what users expect, and prepare for the worst case.

jiangxb1987 · 2016-06-29T04:56:05Z

@cloud-fan I pushed a commit to apply predicate pushdown on deterministic parts placed before any non-deterministic predicates, should it be safe to do this optimization？

cloud-fan · 2016-06-29T05:04:06Z

no, the predicates order doesn't matter. Our optimizer can reorder the predicates to run them more efficient.

jiangxb1987 · 2016-06-29T05:19:58Z

Predicates should not be reordered if a condition contains non-deterministic parts, for example, 'rand() < 0.1 AND a=1' should not be reordered to 'a=1 AND rand() < 0.1' as the number of calls rand() will change and thus output different rows.@cloud-fan @liancheng

cloud-fan · 2016-06-29T05:30:38Z

It's a good point, looks like we can also improve the PushDownPredicate rule according to this. cc @liancheng

jiangxb1987 · 2016-06-29T09:05:38Z

If PushDownPredicate should be improved, I would like to send a PR in one or two days. Should I open a separate JIRA to track that issue?@cloud-fan

liancheng · 2016-06-29T09:22:54Z

@jiangxb1987 Please feel free to create a new JIRA ticket and PR for this, thanks!

…rectly

…ould be PPDed.

jiangxb1987 · 2016-07-20T09:21:51Z

With PR#14012 the order between deterministic and non-deterministic predicates would not be changed arbitrarily, so I think we could apply this improvement which push down predicates placed before non-deterministic parts in partition conditions so that we could do partition pruning even when condition contains non-deterministic fields. @liancheng @cloud-fan

cloud-fan · 2016-07-20T09:54:17Z

retest this please

SparkQA · 2016-07-20T12:15:56Z

Test build #62596 has finished for PR 13893 at commit db28228.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-08-02T03:03:30Z

ping @cloud-fan

cloud-fan · 2016-08-02T08:53:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+
+        // Deterministic parts of filter condition placed before non-deterministic predicates could
+        // be pushed down safely.
+        val (pushDown, rest) =


after think about it more, I think it's not safe to do so. collectProjectsAndFilters should return all deterministic projects and filters upon a scan node. And the returned filter conditions are not only used for filter pushdown, but also treated as the whole filters upon this scan node. So the rest conditions here won't get executed.

cc @liancheng to confirm this.

@cloud-fan Thanks for your comment! But I did searched the codebase and found the returned filters only used for predicates pushdown or partition pruning, in both case it should be safe for us to drop the rest condition. Thank you!

Can you write a test about this? The logic in DataSourceStrategy shows that, when we get a scan node with the projects and filters upon it, we will rebuild the project and filter(with project lists and filter conditions merged) and wrap the scan node with it. So the filter condition that isn't returned by collectProjectsAndFilters won't get executed.

I also think that silently dropping nondeterministic filters can be dangerous. Maybe we should just return all operators beneath the top-most nondeterministic filter as the bottom operator?

For example, say we have a plan tree like this:

Project a, b Filter a > 1 Filter b < 3 Filter RAND(42) > 0.5 Filter c < 2 TableScan t

We should return the following result:

( // Project list Seq(a, b), // Deterministic filters Seq(b < 3, a > 1), // The top-most nondeterministic filter with all operators beneath Filter(RAND(42) > 0.5, Filter(c < 2, TableScan(t))) )

Thank you @cloud-fan for pointing that out, I realized my previous thoughts were wrong. I fully agree with @liancheng 's improvement idea. Will update related code as well as new testcases tomorrow.

after an offline discussion with @liancheng , we think it would be better to have a wrapper node for scan(table scan or file scan), and this wrapper node can also hold project list and filter conditions. Then in optimizer we can improve the ColumnPrunning and FilterPushdown rules to push down into this wrapper node. After this we don't need PhysicalOperator anymore and the planner can match on the wrapper node directly.

@cloud-fan Do you mean something like adding in basicLogicalOperators the following:

case class Scanner( projectionList: Seq[NamedExpression], filters: Seq[Expression], child: LogicalPlan) extends UnaryNode

And pass that to the planner instead of applying PhysicalOperation?

I'm willing to take this work. Thanks!

yup, thanks!

@cloud-fan Now that I could insert a Scanner operator over CatalogRelation in Optimizer, but I noticed a relation may also be something like l: LogicalRelation(relation: CatalogRelation, _, _), in this case, we couldn't analyze the class LogicalRelation because it's in package spark-sql while Optimizer is in spark-catalyst, thus we are not able to determine whether a Scanner should be added. I think we don't want to add Scanner over every BaseRelation.

jiangxb1987 · 2016-08-12T08:18:40Z

@cloud-fan I've send a PR to add Scanner operator in #14619 , please have a look at it when you have time, thanks!

sameeragarwal · 2017-06-16T23:16:36Z

ping @jiangxb1987 @cloud-fan

cloud-fan · 2017-06-17T00:28:06Z

@jiangxb1987 do we still have this bug?

jiangxb1987 · 2017-06-19T03:55:44Z

ya, this still exists. Let me find some time to resolve this.

gatorsmile · 2017-10-04T16:16:43Z

@heary-cao tried to resolve the same issue in #18969

ping @jiangxb1987

dongjoon-hyun · 2018-09-11T07:01:55Z

Ping, @jiangxb1987 .

AlekseiS reviewed Jun 24, 2016
View reviewed changes

cloud-fan reviewed Jun 27, 2016
View reviewed changes

jiangxb1987 added 4 commits July 20, 2016 17:07

[SPARK-14172][SQL] Hive table partition predicate not passed down cor…

7034222

…rectly

should collect substituded filters.

4549ad5

deterministic parts placed before any non-deterministic predicates sh…

e932bc4

…ould be PPDed.

refactor to be more clear.

db28228

jiangxb1987 force-pushed the bugfix branch from 2b6e95c to db28228 Compare July 20, 2016 09:07

cloud-fan reviewed Aug 2, 2016
View reviewed changes

jiangxb1987 mentioned this pull request Aug 12, 2016

[SPARK-17031][SQL] Add Scanner operator to wrap the optimized plan directly in planner #14619

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

gatorsmile closed this Sep 5, 2019

[SPARK-14172][SQL] Hive table partition predicate not passed down correctly #13893

[SPARK-14172][SQL] Hive table partition predicate not passed down correctly #13893

Uh oh!

Conversation

jiangxb1987 commented Jun 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jun 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlekseiS commented Jun 27, 2016

Uh oh!

cloud-fan commented Jun 28, 2016

Uh oh!

jiangxb1987 commented Jun 29, 2016

Uh oh!

cloud-fan commented Jun 29, 2016

Uh oh!

jiangxb1987 commented Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangxb1987 commented Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng commented Jun 29, 2016

Uh oh!

jiangxb1987 commented Jul 20, 2016

Uh oh!

cloud-fan commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

jiangxb1987 commented Aug 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Aug 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Aug 12, 2016

Uh oh!

sameeragarwal commented Jun 16, 2017

Uh oh!

cloud-fan commented Jun 17, 2017

Uh oh!

jiangxb1987 commented Jun 19, 2017

Uh oh!

gatorsmile commented Oct 4, 2017

jiangxb1987 commented Jun 29, 2016 •

edited

Loading

cloud-fan commented Jun 29, 2016 •

edited

Loading

jiangxb1987 commented Jun 29, 2016 •

edited

Loading

liancheng Aug 2, 2016 •

edited

Loading

jiangxb1987 Aug 3, 2016 •

edited

Loading