-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections #33921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bb73aca to
6ee2151
Compare
|
ok to test |
|
cc @karenfeng too FYI |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143032 has finished for PR 33921 at commit
|
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks for your contribution! Merging to master/3.2. |
…e functions into projections
### What changes were proposed in this pull request?
This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections.
### Why are the changes needed?
To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception.
Consider the following schema:
```
root
|-- a: struct (nullable = true)
| |-- c: struct (nullable = true)
| | |-- e: string (nullable = true)
| |-- d: integer (nullable = true)
|-- b: string (nullable = true)
```
and the query:
`SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b`
Executing the query before this PR will result in the error:
```
java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...
```
The optimised plan before this PR is:
```
'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
+- Relation default.test_aggregates[a#0,b#1] parquet
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:
```
Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]
```
Closes #33921 from vicennial/spark-36677.
Authored-by: Venkata Sai Akhil Gudesa <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
(cherry picked from commit 2ed6e7b)
Signed-off-by: Liang-Chi Hsieh <[email protected]>
|
+1, LGTM. Thank you, @vicennial , @HyukjinKwon , @hvanhovell , @viirya . |
What changes were proposed in this pull request?
This PR filters out
ExtractValuess that contains any aggregation function in theNestedColumnAliasingrule to prevent cases where aggregations are pushed down into projections.Why are the changes needed?
To handle a corner/missed case in
NestedColumnAliasingthat can cause users to encounter a runtime exception.Consider the following schema:
and the query:
SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY bExecuting the query before this PR will result in the error:
The optimised plan before this PR is:
Does this PR introduce any user-facing change?
No
How was this patch tested?
A new unit test in
NestedColumnAliasingSuite. The test consists of the repro mentioned earlier.The produced optimized plan is checked for equivalency with a plan of the form: