Skip to content

Conversation

@vicennial
Copy link
Contributor

What changes were proposed in this pull request?

This PR filters out ExtractValuess that contains any aggregation function in the NestedColumnAliasing rule to prevent cases where aggregations are pushed down into projections.

Why are the changes needed?

To handle a corner/missed case in NestedColumnAliasing that can cause users to encounter a runtime exception.

Consider the following schema:

root
 |-- a: struct (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- e: string (nullable = true)
 |    |-- d: integer (nullable = true)
 |-- b: string (nullable = true)

and the query:
SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b

Executing the query before this PR will result in the error:

java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
  at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...

The optimised plan before this PR is:

'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
   +- Relation default.test_aggregates[a#0,b#1] parquet

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test in NestedColumnAliasingSuite. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:

 Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]

@github-actions github-actions bot added the SQL label Sep 6, 2021
@dongjoon-hyun
Copy link
Member

cc @viirya and @sunchao

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

cc @karenfeng too FYI

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47534/

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47534/

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Test build #143032 has finished for PR 33921 at commit b7dbdc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viirya
Copy link
Member

viirya commented Sep 8, 2021

Thanks for your contribution! Merging to master/3.2.

@viirya viirya closed this in 2ed6e7b Sep 8, 2021
viirya pushed a commit that referenced this pull request Sep 8, 2021
…e functions into projections

### What changes were proposed in this pull request?

This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections.

### Why are the changes needed?

To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception.

Consider the following schema:
```
root
 |-- a: struct (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- e: string (nullable = true)
 |    |-- d: integer (nullable = true)
 |-- b: string (nullable = true)
```
and the query:
`SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b`

Executing the query before this PR will result in the error:
```
java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
  at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...
```
The optimised plan before this PR is:

```
'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
   +- Relation default.test_aggregates[a#0,b#1] parquet
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:
```
 Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]
```

Closes #33921 from vicennial/spark-36677.

Authored-by: Venkata Sai Akhil Gudesa <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 2ed6e7b)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
@dongjoon-hyun
Copy link
Member

+1, LGTM. Thank you, @vicennial , @HyukjinKwon , @hvanhovell , @viirya .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants