[SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 #33680

huaxingao · 2021-08-08T23:01:39Z

What changes were proposed in this pull request?

not push down partition filter to ORCScan for DSv2

Why are the changes needed?

Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to ORCScan. We don't push down partition filter to ORCScan in DSv1

== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>

Also, we don't push down partition filter for parquet in DSv2.
#30652

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suites

SparkQA · 2021-08-08T23:12:24Z

Test build #142200 has finished for PR 33680 at commit 2878965.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-08T23:48:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46712/

SparkQA · 2021-08-09T00:27:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46712/

huaxingao · 2021-08-09T00:48:56Z

cc @dongjoon-hyun @viirya

HyukjinKwon · 2021-08-09T01:07:31Z

@huaxingao I think it would be better to file a separate JIRA although it's just the change in explain. cc @c21 FYI

viirya · 2021-08-09T01:31:39Z

Hmm, is it possible to add a test?

huaxingao · 2021-08-09T01:50:56Z

is it possible to add a test?

@viirya Thanks for taking a look. The reason that I didn't add a new test is because we have partition pruning test with both partition filters and data filters here https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L734
For pushed down filters display in explain, i modified the expected result in ExplainSuite. Any suggestions for the new tests to add?

viirya · 2021-08-09T03:34:19Z

sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala

            "|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]",
          "orc" ->
-            "|PushedFilters: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]",
+            "|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]",


Oh, I see. #30652 also only updated this.

c21

LGTM as well, thanks @huaxingao.

dongjoon-hyun

+1, LGTM. Thank you, @huaxingao , @HyukjinKwon , @viirya , @c21 .
Merged to master/3.2.

### What changes were proposed in this pull request? not push down partition filter to `ORCScan` for DSv2 ### Why are the changes needed? Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1 ``` == Physical Plan == *(1) Filter (isnotnull(value#19) AND NOT (value#19 = a)) +- *(1) ColumnarToRow +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string> ``` Also, we don't push down partition filter for parquet in DSv2. #30652 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suites Closes #33680 from huaxingao/orc_filter. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit b04330c) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2021-08-09T17:48:11Z

BTW, @huaxingao . Do we need this in old branches like branch-3.1?

huaxingao · 2021-08-09T18:17:55Z

Thanks! @dongjoon-hyun I will back port to 3.1

Thank you all @c21 @viirya @HyukjinKwon

[MINOR][SQL] not push down partition filter for ORCScan for DSv2

2878965

github-actions bot added the SQL label Aug 8, 2021

huaxingao changed the title ~~[MINOR][SQL] Not push down partition filter to ORCScan for DSv2~~ [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 Aug 9, 2021

viirya reviewed Aug 9, 2021

View reviewed changes

viirya approved these changes Aug 9, 2021

View reviewed changes

c21 approved these changes Aug 9, 2021

View reviewed changes

dongjoon-hyun approved these changes Aug 9, 2021

View reviewed changes

dongjoon-hyun closed this in b04330c Aug 9, 2021

huaxingao deleted the orc_filter branch August 9, 2021 18:17

huaxingao mentioned this pull request Aug 12, 2021

[SPARK-36351][SQL] Refactor filter push down in file source v2 #33650

Closed

[SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 #33680

[SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 #33680

Uh oh!

Conversation

huaxingao commented Aug 8, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Aug 8, 2021

Uh oh!

SparkQA commented Aug 8, 2021

Uh oh!

SparkQA commented Aug 9, 2021

Uh oh!

huaxingao commented Aug 9, 2021

Uh oh!

HyukjinKwon commented Aug 9, 2021

Uh oh!

viirya commented Aug 9, 2021

Uh oh!

huaxingao commented Aug 9, 2021

Uh oh!

viirya Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 9, 2021

Uh oh!

huaxingao commented Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants