[SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate #15187

hvanhovell · 2016-09-21T22:23:20Z

What changes were proposed in this pull request?

We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example:

select   grp, 
         collect_list(col1),
         count(distinct col2)
from     tbl_a
group by 1

This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the RewriteDistinctAggregates in such cases (this is similar to the approach taken in 1.6).

How was this patch tested?

Created RewriteDistinctAggregatesSuite which checks if the aggregates with distinct aggregate functions get rewritten into two Aggregates and an Expand. Added a regression test to DataFrameAggregateSuite.

…gregate combined with a non-partial aggregate.

hvanhovell · 2016-09-21T22:23:55Z

cc @JoshRosen @rxin

SparkQA · 2016-09-22T00:59:10Z

Test build #65737 has finished for PR 15187 at commit 4a9ffaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srinathshankar · 2016-09-22T18:45:16Z

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregatesSuite.scala

+    checkRewrite(RewriteDistinctAggregates(input))
+  }
+
+  test("multiple distinct groups without non-distinct aggregates") {


Do you mean non-partial aggregates here ?

I actually mean that the test only contains distinct aggregates.

srinathshankar · 2016-09-22T19:01:11Z

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregatesSuite.scala

+      .analyze
+    val rewrite = RewriteDistinctAggregates(input)
+    comparePlans(input, rewrite)
+  }


Could you also add single distinct group with aggregates that have partial

srinathshankar · 2016-09-22T19:01:56Z

.../src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregatesSuite.scala

+    val input = testRelation
+      .groupBy('a)(countDistinct('b, 'c), countDistinct('d), sum('e))
+      .analyze
+    checkRewrite(RewriteDistinctAggregates(input))


Could you also add a test with partials, and one without partials here ? (part of the same test(""))

srinathshankar

LGTM

SparkQA · 2016-09-22T21:28:20Z

Test build #65784 has finished for PR 15187 at commit bda0ba0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-22T21:28:43Z

Merging to master/2.0. Thanks for the review.

…a non-partial aggregate We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example: ```sql select grp, collect_list(col1), count(distinct col2) from tbl_a group by 1 ``` This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the `RewriteDistinctAggregates` in such cases (this is similar to the approach taken in 1.6). Created `RewriteDistinctAggregatesSuite` which checks if the aggregates with distinct aggregate functions get rewritten into two `Aggregates` and an `Expand`. Added a regression test to `DataFrameAggregateSuite`. Author: Herman van Hovell <[email protected]> Closes #15187 from hvanhovell/SPARK-17616. (cherry picked from commit 0d63487) Signed-off-by: Herman van Hovell <[email protected]>

Add case to RewriteDistinctAggregates to rewrite a single distinct ag…

4a9ffaa

…gregate combined with a non-partial aggregate.

srinathshankar suggested changes Sep 22, 2016

View reviewed changes

Improve tests

bda0ba0

srinathshankar approved these changes Sep 22, 2016

View reviewed changes

asfgit closed this in 0d63487 Sep 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate #15187

[SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate #15187

Uh oh!

hvanhovell commented Sep 21, 2016

Uh oh!

hvanhovell commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

srinathshankar Sep 22, 2016

Uh oh!

hvanhovell Sep 22, 2016

Uh oh!

srinathshankar Sep 22, 2016

Uh oh!

hvanhovell Sep 22, 2016

Uh oh!

srinathshankar Sep 22, 2016

Uh oh!

hvanhovell Sep 22, 2016

Uh oh!

srinathshankar left a comment

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

hvanhovell commented Sep 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate #15187

[SPARK-17616][SQL] Support a single distinct aggregate combined with a non-partial aggregate #15187

Uh oh!

Conversation

hvanhovell commented Sep 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Sep 21, 2016

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

srinathshankar Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

srinathshankar Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

srinathshankar Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 22, 2016

Choose a reason for hiding this comment

Uh oh!

srinathshankar left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 22, 2016

Uh oh!

hvanhovell commented Sep 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants