[SPARK-36647][SQL][TESTS] Push down Aggregate (Min/Max/Count) for Parquet if filter is on partition col #34248

huaxingao · 2021-10-12T00:52:25Z

What changes were proposed in this pull request?

I just realized that with the changes in #33650, the restriction for not pushing down Min/Max/Count for partition filter was already removed. This PR just added test to make sure Min/Max/Count in parquet are pushed down if filter is on partition col.

Why are the changes needed?

To complete the work for Aggregate (Min/Max/Count) push down for Parquet

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

… filter is on partition col

SparkQA · 2021-10-12T02:02:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48587/

SparkQA · 2021-10-12T02:47:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48587/

SparkQA · 2021-10-12T05:55:41Z

Test build #144110 has finished for PR 34248 at commit cd22629.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

LGTM, thanks @huaxingao.

c21 · 2021-10-12T21:07:39Z

...scala/org/apache/spark/sql/execution/datasources/parquet/ParquetAggregatePushDownSuite.scala

+        val enableVectorizedReader = Seq("false", "true")
+        for (testVectorizedReader <- enableVectorizedReader) {


nit: we can be more scala here, but not a big deal:

Seq("false", "true").foreach { enableVectorizedReader => withSQLConf(...) { ... } }

@c21 Thanks for reviewing! I fixed this.

huaxingao · 2021-10-13T02:41:45Z

cc @viirya Could you please take a look when you have time? Thanks!

viirya · 2021-10-13T03:10:03Z

...scala/org/apache/spark/sql/execution/datasources/parquet/ParquetAggregatePushDownSuite.scala

+        Seq("false", "true").foreach { enableVectorizedReader =>
+          withSQLConf(SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED.key -> "true",
+            vectorizedReaderEnabledKey -> enableVectorizedReader) {
+            val max = sql("SELECT max(id) FROM tmp WHERE p = 0")


Can you add other two supported aggregate functions? And how about group by on partition column case?

added.
Group by on partition column is a little more complicated and needs some code changes: currently, we only have the aggregate values in the returned row. For group by on partition column, we will need to pass down the partition col value and prepend that value to the aggregation row. I will have a separate PR for that work.

SparkQA · 2021-10-13T04:04:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48655/

SparkQA · 2021-10-13T04:51:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48655/

SparkQA · 2021-10-13T05:10:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48659/

SparkQA · 2021-10-13T05:15:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48658/

viirya · 2021-10-13T05:55:22Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+      // However, if the filter or group by is on partition column,
+      // max/min/count can still be pushed down


So group by on partition column is not supported yet. Then this comment is not correct.

SparkQA · 2021-10-13T05:55:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48659/

SparkQA · 2021-10-13T06:04:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48658/

SparkQA · 2021-10-13T08:20:55Z

Test build #144177 has finished for PR 34248 at commit 6cc6f7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-13T09:19:24Z

Test build #144180 has finished for PR 34248 at commit 1c96138.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-18T20:21:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48854/

SparkQA · 2021-10-18T21:01:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48854/

SparkQA · 2021-10-19T00:37:26Z

Test build #144380 has finished for PR 34248 at commit 1293ae0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-10-26T17:10:21Z

retest this please

viirya · 2021-10-26T17:11:04Z

I'll merge this after CI, since last CI was a few days ago.

SparkQA · 2021-10-26T17:13:20Z

Test build #144626 has started for PR 34248 at commit 1293ae0.

SparkQA · 2021-10-26T17:53:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49096/

SparkQA · 2021-10-26T18:36:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49096/

viirya · 2021-10-27T00:02:18Z

retest this please

SparkQA · 2021-10-27T00:54:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49103/

SparkQA · 2021-10-27T01:02:01Z

Test build #144633 has finished for PR 34248 at commit 1293ae0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2021-10-27T01:44:56Z

retest this please

SparkQA · 2021-10-27T01:53:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49103/

SparkQA · 2021-10-27T02:56:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49106/

SparkQA · 2021-10-27T03:55:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49106/

SparkQA · 2021-10-27T06:46:54Z

Test build #144636 has finished for PR 34248 at commit 1293ae0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-10-27T07:13:44Z

Thanks! Merging to master.

huaxingao · 2021-10-27T07:20:06Z

Thanks @c21 @viirya

[SPARK-36647][SQL] Push down Aggregate (Min/Max/Count) for Parquet if…

cd22629

… filter is on partition col

github-actions bot added the SQL label Oct 12, 2021

HyukjinKwon changed the title ~~[SPARK-36647][SQL] Push down Aggregate (Min/Max/Count) for Parquet if filter is on partition col~~ [SPARK-36647][SQL][TESTS] Push down Aggregate (Min/Max/Count) for Parquet if filter is on partition col Oct 12, 2021

c21 approved these changes Oct 12, 2021

View reviewed changes

address comments

6cc6f7b

viirya reviewed Oct 13, 2021

View reviewed changes

huaxingao added 2 commits October 12, 2021 21:16

address comments

d7ecec4

fix

1c96138

viirya reviewed Oct 13, 2021

View reviewed changes

c21 mentioned this pull request Oct 16, 2021

[SPARK-34960][SQL] Aggregate push down for ORC #34298

Closed

fix comment

1293ae0

viirya approved these changes Oct 26, 2021

View reviewed changes

viirya closed this in 4aec9d7 Oct 27, 2021

huaxingao deleted the partitionFilter branch October 27, 2021 07:20

		val enableVectorizedReader = Seq("false", "true")
		for (testVectorizedReader <- enableVectorizedReader) {

		// However, if the filter or group by is on partition column,
		// max/min/count can still be pushed down

[SPARK-36647][SQL][TESTS] Push down Aggregate (Min/Max/Count) for Parquet if filter is on partition col #34248

[SPARK-36647][SQL][TESTS] Push down Aggregate (Min/Max/Count) for Parquet if filter is on partition col #34248

Uh oh!

Conversation

huaxingao commented Oct 12, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 12, 2021

Uh oh!

SparkQA commented Oct 12, 2021

Uh oh!

SparkQA commented Oct 12, 2021

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

c21 Oct 12, 2021

Choose a reason for hiding this comment

Uh oh!

huaxingao Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 13, 2021

Uh oh!

viirya Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

huaxingao Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

viirya Oct 13, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 13, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 18, 2021

Uh oh!

SparkQA commented Oct 19, 2021

Uh oh!

viirya commented Oct 26, 2021

Uh oh!

viirya commented Oct 26, 2021

Uh oh!

SparkQA commented Oct 26, 2021

Uh oh!

SparkQA commented Oct 26, 2021

Uh oh!

SparkQA commented Oct 26, 2021

Uh oh!

viirya commented Oct 27, 2021

Uh oh!

SparkQA commented Oct 27, 2021

Uh oh!

SparkQA commented Oct 27, 2021

Uh oh!

huaxingao commented Oct 27, 2021

Uh oh!

SparkQA commented Oct 27, 2021

Uh oh!

SparkQA commented Oct 27, 2021

Uh oh!