-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20725][SQL] partial aggregate should behave correctly for sameResult #17964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @hvanhovell |
|
Test build #76871 has finished for PR 17964 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this works because of the way we plan aggregates and I am totally fine with this.
I am slightly worried about non-complete aggregate expression that cannot be resolved and wreak havok further down the line because sameResult falsely evaluated to true. Can we special case non-complete aggregate expressions?
From an architectural point of view it might be better to add this as a normalize function to Expression.
|
Test build #76891 has started for PR 17964 at commit |
|
LGTM |
|
|
||
| test("SPARK-20725: partial aggregate should behave correctly for sameResult") { | ||
| val df1 = spark.range(10).agg(sum($"id")) | ||
| val df2 = spark.range(10).agg(sum($"id")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val df1 = spark.range(10).agg(sumDistinct($"id"))
val df2 = spark.range(10).agg(sumDistinct($"id"))They will not match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! The reason is, HashAggregateExec.requiredChildDistributionExpressions is a Option[Seq[Expression]], which is not treated as expressions of HashAggregateExec, and thus not touched by QueryPlan.mapExpressions.
I have fixed it in QueryPlan
|
retest this please |
|
Test build #76893 has finished for PR 17964 at commit
|
|
Test build #76901 has finished for PR 17964 at commit
|
|
LGTM - merging to master/2.2. |
…Result ## What changes were proposed in this pull request? For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them. ## How was this patch tested? a new regression test Author: Wenchen Fan <[email protected]> Closes #17964 from cloud-fan/tmp. (cherry picked from commit 1283c3d) Signed-off-by: Herman van Hovell <[email protected]>
|
@cloud-fan can you backport this to 2.1? |
…Result ## What changes were proposed in this pull request? For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them. ## How was this patch tested? a new regression test Author: Wenchen Fan <[email protected]> Closes apache#17964 from cloud-fan/tmp.
…Result ## What changes were proposed in this pull request? For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them. ## How was this patch tested? a new regression test Author: Wenchen Fan <[email protected]> Closes apache#17964 from cloud-fan/tmp.
What changes were proposed in this pull request?
For aggregate function with
PartialMergeorFinalmode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them.How was this patch tested?
a new regression test