[SPARK-20725][SQL] partial aggregate should behave correctly for sameResult #17964

cloud-fan · 2017-05-12T14:31:25Z

What changes were proposed in this pull request?

For aggregate function with PartialMerge or Final mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them.

How was this patch tested?

a new regression test

cloud-fan · 2017-05-12T14:31:50Z

cc @hvanhovell

SparkQA · 2017-05-12T15:41:59Z

Test build #76871 has finished for PR 17964 at commit a2ebfda.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-12T19:21:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

Ok, so this works because of the way we plan aggregates and I am totally fine with this.

I am slightly worried about non-complete aggregate expression that cannot be resolved and wreak havok further down the line because sameResult falsely evaluated to true. Can we special case non-complete aggregate expressions?

From an architectural point of view it might be better to add this as a normalize function to Expression.

SparkQA · 2017-05-13T06:02:36Z

Test build #76891 has started for PR 17964 at commit 557298e.

hvanhovell · 2017-05-13T06:17:31Z

LGTM

gatorsmile · 2017-05-13T07:22:56Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SameResultSuite.scala

+
+  test("SPARK-20725: partial aggregate should behave correctly for sameResult") {
+    val df1 = spark.range(10).agg(sum($"id"))
+    val df2 = spark.range(10).agg(sum($"id"))


val df1 = spark.range(10).agg(sumDistinct($"id")) val df2 = spark.range(10).agg(sumDistinct($"id"))

They will not match?

Good catch! The reason is, HashAggregateExec.requiredChildDistributionExpressions is a Option[Seq[Expression]], which is not treated as expressions of HashAggregateExec, and thus not touched by QueryPlan.mapExpressions.

I have fixed it in QueryPlan

gatorsmile · 2017-05-13T07:25:36Z

retest this please

SparkQA · 2017-05-13T09:44:48Z

Test build #76893 has finished for PR 17964 at commit 557298e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-13T14:53:36Z

Test build #76901 has finished for PR 17964 at commit 49da955.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-13T19:08:29Z

LGTM - merging to master/2.2.

…Result ## What changes were proposed in this pull request? For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them. ## How was this patch tested? a new regression test Author: Wenchen Fan <[email protected]> Closes #17964 from cloud-fan/tmp. (cherry picked from commit 1283c3d) Signed-off-by: Herman van Hovell <[email protected]>

hvanhovell · 2017-05-13T19:09:53Z

@cloud-fan can you backport this to 2.1?

…Result ## What changes were proposed in this pull request? For aggregate function with `PartialMerge` or `Final` mode, the input is aggregate buffers instead of the actual children expressions. So the actual children expressions won't affect the result, we should normalize the expr id for them. ## How was this patch tested? a new regression test Author: Wenchen Fan <[email protected]> Closes apache#17964 from cloud-fan/tmp.

hvanhovell reviewed May 12, 2017

View reviewed changes

partial aggregate should behave correctly for sameResult

557298e

cloud-fan force-pushed the tmp branch from a2ebfda to 557298e Compare May 13, 2017 06:02

gatorsmile reviewed May 13, 2017

View reviewed changes

fix another bug

49da955

asfgit closed this in 1283c3d May 13, 2017

cloud-fan mentioned this pull request May 14, 2017

[SPARK-20725][SQL][BRANCH-2.1] partial aggregate should behave correctly for sameResult #17975

Closed

[SPARK-20725][SQL] partial aggregate should behave correctly for sameResult #17964

[SPARK-20725][SQL] partial aggregate should behave correctly for sameResult #17964

Uh oh!

Conversation

cloud-fan commented May 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented May 12, 2017

Uh oh!

SparkQA commented May 12, 2017

Uh oh!

hvanhovell May 12, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 13, 2017

Uh oh!

hvanhovell commented May 13, 2017

Uh oh!

gatorsmile May 13, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 13, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 13, 2017

Uh oh!

SparkQA commented May 13, 2017

Uh oh!

SparkQA commented May 13, 2017

Uh oh!

hvanhovell commented May 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented May 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented May 12, 2017 •

edited

Loading

hvanhovell commented May 13, 2017 •

edited

Loading