-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-38185][SQL] Fix data incorrect if aggregate function is empty #35490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
|
cc @sigmod as well |
| private[sql] def groupOnly: Boolean = { | ||
| aggregateExpressions.map { | ||
| // aggregateExpressions can be empty through Dateset.agg | ||
| aggregateExpressions.nonEmpty && aggregateExpressions.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it more explicit to check groupingExpressions.nonEmpty? Logically, an aggregate operator without output columns is still group only if it has group columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
|
thanks, merging to master/3.2! |
Add `aggregateExpressions.nonEmpty` check in `groupOnly` function.
The group only condition should check if the aggregate expression is empty.
In DataFrame api, it is allowed to make a empty aggregations.
So the following query should return 1 rather than 0 because it's a global aggregate.
```scala
val emptyAgg = Map.empty[String, String]
spark.range(2).where("id > 2").agg(emptyAgg).limit(1).count
```
yes, bug fix
Add test
Closes #35490 from ulysses-you/SPARK-38185.
Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 25a4c5f)
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @ulysses-you and all.
|
The code looks good but |
|
thank you @dongjoon-hyun , The rule that change aggregate if it's group only with limit 1 to project is landed at branch-3.3 see SPARK-36183, and it's the key to trigger this bug. But the group only condition is added since branch-3.2, see SPARK-34808. So it would be good to also fix this in branch-3.2. |
Add `aggregateExpressions.nonEmpty` check in `groupOnly` function.
The group only condition should check if the aggregate expression is empty.
In DataFrame api, it is allowed to make a empty aggregations.
So the following query should return 1 rather than 0 because it's a global aggregate.
```scala
val emptyAgg = Map.empty[String, String]
spark.range(2).where("id > 2").agg(emptyAgg).limit(1).count
```
yes, bug fix
Add test
Closes apache#35490 from ulysses-you/SPARK-38185.
Authored-by: ulysses-you <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 25a4c5f)
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 3711a81)
What changes were proposed in this pull request?
Add
aggregateExpressions.nonEmptycheck ingroupOnlyfunction.Why are the changes needed?
The group only condition should check if the aggregate expression is empty.
In DataFrame api, it is allowed to make a empty aggregations.
So the following query should return 1 rather than 0 because it's a global aggregate.
Does this PR introduce any user-facing change?
yes, bug fix
How was this patch tested?
Add test