-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26736][SQL] if filter condition And has non-determined sub function it does not do partition prunning
#24118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t do partition prunning
|
Could you describe more (e.g., an example query to reproduce the issue you described) in the PR description? |
|
@maropu The description was updated. Please review it, thanks~ |
| (Some(substitutedFields), filters, other, collectAliases(substitutedFields)) | ||
|
|
||
| case Filter(condition, child) if condition.deterministic => | ||
| case Filter(condition, child) if condition.deterministic || condition.isInstanceOf[And] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern should not return non-determinisitc exprs, but this current change does so, right?
Can we modify code to extract deterministic exprs in line 67-69?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I will do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been updated. Please review it, thanks~~
|
ok to test |
|
Test build #103613 has finished for PR 24118 at commit
|
|
Could you make the PR title/description more precise before reviews? I think this pr does not target the rand() function only.... |
And has non-determined function it does not do partition prunning
|
OK, The title has been updated~ |
And has non-determined function it does not do partition prunningAnd having non-determined sub function it does not do partition prunning
And having non-determined sub function it does not do partition prunningAnd has non-determined sub function it does not do partition prunning
| val (fields, filters, other, aliases) = collectProjectsAndFilters(child) | ||
| val substitutedCondition = substitute(aliases)(condition) | ||
| (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) | ||
| case filter: Filter if filter.condition.deterministic || filter.condition.isInstanceOf[And] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PhysicalOperation is used in many places, so have you checked this change has no side-effect for the other behivours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have checked this change and I think that it has no side-effect for the other behaviors.
|
It seems you forget to update the title in the jira? |
|
It has been updated. |
| Some(condition) | ||
| } else { | ||
| val andCondition = condition.asInstanceOf[And] | ||
| if (andCondition.left.deterministic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to handle only at the top level ? What happens when the non-deterministic predicate is nested like :
where ((c1 = 10 and rand() < 1) and (c2 = 20))
in this case, in my understanding of the code, we will ignore (c1 = 10) for pruning purposes ? cc @maropu
Edited: Another example :
where c1 = 10 and c2 = 20 and c3 = 30 and rand() < 1 and c4 = 40
in this case, we would only consider c4 = 40 for pruning, no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, we need more general solution for that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal @maropu It has been updated. Please review it, thanks~~
|
Test build #103830 has finished for PR 24118 at commit
|
| (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) | ||
| } else { | ||
| (None, Nil, filter, Map.empty) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests somewhere to check if this addition could extract deterministic conditions you expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the test case named Partition pruning - with filter containing non-determined condition in sub And-expr in PruningSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but it seems they are end-to-end tests, so I think we need more fine-grained tests for collectProjectsAndFilters.
| val substitutedCondition = substitute(aliases)(condition) | ||
| (fields, filters ++ splitConjunctivePredicates(substitutedCondition), other, aliases) | ||
| case filter @ Filter(condition, child) | ||
| if condition.deterministic || condition.isInstanceOf[And] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the isInstanceOf check required any more ?
| } | ||
| } | ||
|
|
||
| private def getDeterminedExpression(expr: Expression): Option[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getDeterministicExpression or extractDeterministicExpression ? What do you think @maropu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I like the name including Deterministic
| } | ||
| } | ||
|
|
||
| private def getDeterminedExpression(expr: Expression): Option[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also some comments about this function at the top with small snippets of input and output.
|
We should also update comments here to reflect the changes ? |
|
I have updated some codes and add test cases. Please review it, thanks~ |
|
Test build #103976 has finished for PR 24118 at commit
|
| * col = 1 and rand() < 1 | ||
| * (col1 = 1 and rand() < 1) and col2 = 1 | ||
| * col1 = 1 or rand() < 1 | ||
| * (col1 = 1 and rand() < 1) or (col2 = 1 and rand() < 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't need to handle this case (col1 = 1 and rand() < 1) or (col2 = 1 and rand() < 1) in this pr because DNF forms should be handled in another normalization logic (e.g., SPARK-6624). So, I think its ok to handle CFN forms only here. In fact, I think we should keep the same semantics with PushDownPredicate. cc: @gatorsmile @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to be consistent with PushDownPredicate
|
Test build #104131 has finished for PR 24118 at commit
|
|
retest this please |
|
Test build #104154 has finished for PR 24118 at commit
|
|
@maropu Is there any progress about this PR? |
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
|
Is there any problems about the pr? We also encounter this problem, and we find that hive can hanle this well. |
|
this should have been fixed by a58d91b |
|
It seems this pr intended to target hive tables? I tried the example query in the current master; |
|
Seems we need to update |
|
Ah, right. I'll make a pr for that later. |
…ions in Hive tables
### What changes were proposed in this pull request?
This PR intends to improve partition pruning for nondeterministic expressions in Hive tables:
Before this PR:
```
scala> sql("""create table test(id int) partitioned by (dt string)""")
scala> sql("""select * from test where dt='20190101' and rand() < 0.5""").explain()
== Physical Plan ==
*(1) Filter ((isnotnull(dt#19) AND (dt#19 = 20190101)) AND (rand(6515336563966543616) < 0.5))
+- Scan hive default.test [id#18, dt#19], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#18], [dt#19], Statistics(sizeInBytes=8.0 EiB)
```
After this PR:
```
== Physical Plan ==
*(1) Filter (rand(-9163956883277176328) < 0.5)
+- Scan hive default.test [id#0, dt#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#0], [dt#1], Statistics(sizeInBytes=8.0 EiB), [isnotnull(dt#1), (dt#1 = 20190101)]
```
This PR is the rework of #24118.
### Why are the changes needed?
For better performance.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit tests added.
Closes #27219 from maropu/SPARK-26736.
Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
What changes were proposed in this pull request?
If filter condition
Andincludes non-determined sub expression partition prunning will not work. This patch will take out the determined sub expression inAndto make the partition prunning work.Example:
A partitioned table definition:
create table test(id int) partitioned by (dt string);The following sql does not do partition prunning:
select * from test where dt='20190101' and rand() < 0.5;This PR will fix this problem.
How was this patch tested?
It will be tested in
PruningSuiteby adding test casePartition pruning - with filter containing non-determined conditionPlease review http://spark.apache.org/contributing.html before opening a pull request.