[SPARK-38533][SQL] DS V2 aggregate push-down supports project with alias #35823

beliefer · 2022-03-12T07:27:56Z

What changes were proposed in this pull request?

Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

Line 96 in c91c2e9

if filters.isEmpty && project.forall(_.isInstanceOf[AttributeReference]) =>

This PR let it works good with alias.

The first example:
the origin plan show below:

Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession@77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions@5f8da82)

If we can complete push down the aggregate, then the plan will be:

Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee

If we can partial push down the aggregate, then the plan will be:

Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee

The second example:
the origin plan show below:

Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession@25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions@345d641e)

If we can complete push down the aggregate, then the plan will be:

Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee

If we can partial push down the aggregate, then the plan will be:

Aggregate [DEPT#25], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee

Why are the changes needed?

Alias is more useful.

Does this PR introduce any user-facing change?

'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

How was this patch tested?

New tests.

codecov-commenter · 2022-03-12T10:53:58Z

Codecov Report

Merging #35823 (c91c2e9) into master (c483e29) will increase coverage by 0.00%.
The diff coverage is 90.62%.

❗ Current head c91c2e9 differs from pull request most recent head 787e0dd. Consider uploading reports for the commit 787e0dd to get more accurate results

@@           Coverage Diff           @@
##           master   #35823   +/-   ##
=======================================
  Coverage   91.19%   91.19%           
=======================================
  Files         297      297           
  Lines       64696    64724   +28     
  Branches     9919     9921    +2     
=======================================
+ Hits        58999    59025   +26     
- Misses       4330     4332    +2     
  Partials     1367     1367

Flag	Coverage Δ
unittests	`91.17% <90.62%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/pyspark/sql/tests/test_udf.py	`95.56% <ø> (ø)`
python/pyspark/pandas/indexes/category.py	`93.33% <66.66%> (-0.92%)`	⬇️
python/pyspark/pandas/indexes/datetimes.py	`95.23% <66.66%> (-0.52%)`	⬇️
python/pyspark/pandas/indexes/timedelta.py	`75.40% <66.66%> (-0.46%)`	⬇️
python/pyspark/pandas/base.py	`94.14% <100.00%> (+0.03%)`	⬆️
python/pyspark/pandas/frame.py	`97.06% <100.00%> (+<0.01%)`	⬆️
python/pyspark/pandas/tests/test_dataframe.py	`97.34% <100.00%> (+0.01%)`	⬆️
python/pyspark/pandas/tests/test_series.py	`96.24% <100.00%> (+<0.01%)`	⬆️
...n/pyspark/mllib/tests/test_streaming_algorithms.py	`76.34% <0.00%> (-0.36%)`	⬇️
python/pyspark/streaming/tests/test_context.py	`98.42% <0.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 216b972...787e0dd. Read the comment docs.

huaxingao · 2022-03-12T18:06:13Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

nit: unnecessary change?

huaxingao · 2022-03-12T18:10:10Z

I actually have an alias over aggregate test in FileSource too. Could you please change that one as well?
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala#L187

beliefer · 2022-03-13T01:34:41Z

I actually have an alias over aggregate test in FileSource too. Could you please change that one as well?
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceAggregatePushDownSuite.scala#L187

Thank you for the remind.

dcoliversun · 2022-03-14T03:03:59Z