Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example:

select   grp, 
         collect_list(col1),
         count(distinct col2)
from     tbl_a
group by 1

This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the RewriteDistinctAggregates in such cases (this is similar to the approach taken in 1.6).

How was this patch tested?

Created RewriteDistinctAggregatesSuite which checks if the aggregates with distinct aggregate functions get rewritten into two Aggregates and an Expand. Added a regression test to DataFrameAggregateSuite.

…gregate combined with a non-partial aggregate.
@hvanhovell
Copy link
Contributor Author

cc @JoshRosen @rxin

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65737 has finished for PR 15187 at commit 4a9ffaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkRewrite(RewriteDistinctAggregates(input))
}

test("multiple distinct groups without non-distinct aggregates") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean non-partial aggregates here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually mean that the test only contains distinct aggregates.

.analyze
val rewrite = RewriteDistinctAggregates(input)
comparePlans(input, rewrite)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add single distinct group with aggregates that have partial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

val input = testRelation
.groupBy('a)(countDistinct('b, 'c), countDistinct('d), sum('e))
.analyze
checkRewrite(RewriteDistinctAggregates(input))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test with partials, and one without partials here ? (part of the same test(""))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@srinathshankar srinathshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65784 has finished for PR 15187 at commit bda0ba0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

Merging to master/2.0. Thanks for the review.

@asfgit asfgit closed this in 0d63487 Sep 22, 2016
asfgit pushed a commit that referenced this pull request Sep 22, 2016
…a non-partial aggregate

We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example:
```sql
select   grp,
         collect_list(col1),
         count(distinct col2)
from     tbl_a
group by 1
```
This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the `RewriteDistinctAggregates` in such cases (this is similar to the approach taken in 1.6).

Created `RewriteDistinctAggregatesSuite` which checks if the aggregates with distinct aggregate functions get rewritten into two `Aggregates` and an `Expand`. Added a regression test to `DataFrameAggregateSuite`.

Author: Herman van Hovell <[email protected]>

Closes #15187 from hvanhovell/SPARK-17616.

(cherry picked from commit 0d63487)
Signed-off-by: Herman van Hovell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants