-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 #33680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #142200 has finished for PR 33680 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
@huaxingao I think it would be better to file a separate JIRA although it's just the change in explain. cc @c21 FYI |
|
Hmm, is it possible to add a test? |
@viirya Thanks for taking a look. The reason that I didn't add a new test is because we have partition pruning test with both partition filters and data filters here https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L734 |
| "|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]", | ||
| "orc" -> | ||
| "|PushedFilters: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]", | ||
| "|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. #30652 also only updated this.
c21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, thanks @huaxingao.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @huaxingao , @HyukjinKwon , @viirya , @c21 .
Merged to master/3.2.
### What changes were proposed in this pull request? not push down partition filter to `ORCScan` for DSv2 ### Why are the changes needed? Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1 ``` == Physical Plan == *(1) Filter (isnotnull(value#19) AND NOT (value#19 = a)) +- *(1) ColumnarToRow +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string> ``` Also, we don't push down partition filter for parquet in DSv2. #30652 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suites Closes #33680 from huaxingao/orc_filter. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b04330c) Signed-off-by: Dongjoon Hyun <[email protected]>
|
BTW, @huaxingao . Do we need this in old branches like branch-3.1? |
|
Thanks! @dongjoon-hyun I will back port to 3.1 Thank you all @c21 @viirya @HyukjinKwon |
What changes were proposed in this pull request?
not push down partition filter to
ORCScanfor DSv2Why are the changes needed?
Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to
ORCScan. We don't push down partition filter to ORCScan in DSv1Also, we don't push down partition filter for parquet in DSv2.
#30652
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing test suites