Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

not push down partition filter to ORCScan for DSv2

Why are the changes needed?

Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to ORCScan. We don't push down partition filter to ORCScan in DSv1

== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>

Also, we don't push down partition filter for parquet in DSv2.
#30652

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suites

@github-actions github-actions bot added the SQL label Aug 8, 2021
@SparkQA
Copy link

SparkQA commented Aug 8, 2021

Test build #142200 has finished for PR 33680 at commit 2878965.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46712/

@SparkQA
Copy link

SparkQA commented Aug 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46712/

@huaxingao
Copy link
Contributor Author

cc @dongjoon-hyun @viirya

@HyukjinKwon
Copy link
Member

@huaxingao I think it would be better to file a separate JIRA although it's just the change in explain. cc @c21 FYI

@huaxingao huaxingao changed the title [MINOR][SQL] Not push down partition filter to ORCScan for DSv2 [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2 Aug 9, 2021
@viirya
Copy link
Member

viirya commented Aug 9, 2021

Hmm, is it possible to add a test?

@huaxingao
Copy link
Contributor Author

is it possible to add a test?

@viirya Thanks for taking a look. The reason that I didn't add a new test is because we have partition pruning test with both partition filters and data filters here https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L734
For pushed down filters display in explain, i modified the expected result in ExplainSuite. Any suggestions for the new tests to add?

"|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]",
"orc" ->
"|PushedFilters: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]",
"|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. #30652 also only updated this.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, thanks @huaxingao.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @huaxingao , @HyukjinKwon , @viirya , @c21 .
Merged to master/3.2.

dongjoon-hyun pushed a commit that referenced this pull request Aug 9, 2021
### What changes were proposed in this pull request?
not push down partition filter to `ORCScan` for DSv2

### Why are the changes needed?
Seems to me that partition filter is only used for partition pruning and shouldn't be pushed down to `ORCScan`. We don't push down partition filter to ORCScan in DSv1
```
== Physical Plan ==
*(1) Filter (isnotnull(value#19) AND NOT (value#19 = a))
+- *(1) ColumnarToRow
   +- FileScan orc [value#19,p1#20,p2#21] Batched: true, DataFilters: [isnotnull(value#19), NOT (value#19 = a)], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/pt/_5f4sxy56x70dv9zpz032f0m0000gn/T/spark-c1..., PartitionFilters: [isnotnull(p1#20), isnotnull(p2#21), (p1#20 = 1), (p2#21 = 2)], PushedFilters: [IsNotNull(value), Not(EqualTo(value,a))], ReadSchema: struct<value:string>
```
Also, we don't push down partition filter for parquet in DSv2.
#30652

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suites

Closes #33680 from huaxingao/orc_filter.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b04330c)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

BTW, @huaxingao . Do we need this in old branches like branch-3.1?

@huaxingao
Copy link
Contributor Author

Thanks! @dongjoon-hyun I will back port to 3.1

Thank you all @c21 @viirya @HyukjinKwon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants