[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections #33921

vicennial · 2021-09-06T18:08:16Z

What changes were proposed in this pull request?

This PR filters out ExtractValuess that contains any aggregation function in the NestedColumnAliasing rule to prevent cases where aggregations are pushed down into projections.

Why are the changes needed?

To handle a corner/missed case in NestedColumnAliasing that can cause users to encounter a runtime exception.

Consider the following schema:

root
 |-- a: struct (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- e: string (nullable = true)
 |    |-- d: integer (nullable = true)
 |-- b: string (nullable = true)

and the query:
SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b

Executing the query before this PR will result in the error:

java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
  at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...

The optimised plan before this PR is:

'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
   +- Relation default.test_aggregates[a#0,b#1] parquet

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test in NestedColumnAliasingSuite. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:

 Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]

dongjoon-hyun · 2021-09-06T20:25:02Z

cc @viirya and @sunchao

HyukjinKwon · 2021-09-07T01:18:06Z

ok to test

HyukjinKwon · 2021-09-07T01:18:12Z

cc @karenfeng too FYI

SparkQA · 2021-09-07T02:18:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47534/

SparkQA · 2021-09-07T02:26:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47534/

SparkQA · 2021-09-07T06:04:36Z

Test build #143032 has finished for PR 33921 at commit b7dbdc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell

LGTM

viirya · 2021-09-08T01:15:25Z

Thanks for your contribution! Merging to master/3.2.

…e functions into projections ### What changes were proposed in this pull request? This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections. ### Why are the changes needed? To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception. Consider the following schema: ``` root |-- a: struct (nullable = true) | |-- c: struct (nullable = true) | | |-- e: string (nullable = true) | |-- d: integer (nullable = true) |-- b: string (nullable = true) ``` and the query: `SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b` Executing the query before this PR will result in the error: ``` java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true]) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ... ``` The optimised plan before this PR is: ``` 'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3] +- 'Project [max(a#0).c.e AS _extract_e#5, b#1] +- Relation default.test_aggregates[a#0,b#1] parquet ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier. The produced optimized plan is checked for equivalency with a plan of the form: ``` Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456] +- LocalRelation <empty>, [a#451, b#452] ``` Closes #33921 from vicennial/spark-36677. Authored-by: Venkata Sai Akhil Gudesa <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 2ed6e7b) Signed-off-by: Liang-Chi Hsieh <[email protected]>

dongjoon-hyun · 2021-09-08T04:52:22Z

+1, LGTM. Thank you, @vicennial , @HyukjinKwon , @hvanhovell , @viirya .

github-actions bot added the SQL label Sep 6, 2021

vicennial added 2 commits September 6, 2021 20:16

init

870c750

retrigger checks

6ee2151

vicennial force-pushed the spark-36677 branch from bb73aca to 6ee2151 Compare September 6, 2021 18:16

retrigger checks

b7dbdc8

hvanhovell approved these changes Sep 7, 2021

View reviewed changes

viirya approved these changes Sep 8, 2021

View reviewed changes

viirya closed this in 2ed6e7b Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections #33921

[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections #33921

Uh oh!

vicennial commented Sep 6, 2021

Uh oh!

dongjoon-hyun commented Sep 6, 2021

Uh oh!

HyukjinKwon commented Sep 7, 2021

Uh oh!

HyukjinKwon commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

hvanhovell left a comment

Uh oh!

viirya commented Sep 8, 2021

Uh oh!

dongjoon-hyun commented Sep 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections #33921

[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections #33921

Uh oh!

Conversation

vicennial commented Sep 6, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Sep 6, 2021

Uh oh!

HyukjinKwon commented Sep 7, 2021

Uh oh!

HyukjinKwon commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

SparkQA commented Sep 7, 2021

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 8, 2021

Uh oh!

dongjoon-hyun commented Sep 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants