[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule #9406

hvanhovell · 2015-11-02T09:47:06Z

The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.

This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the JIRA ticket for some information on this. The advantages over the - competing - first PR are:

This can use the faster TungstenAggregate code path.
It is impossible to OOM due to an OpenHashSet allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.

The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.

cc @yhuai - Could you also tell me where to add tests for this?

marmbrus · 2015-11-03T10:49:49Z

ok to test

SparkQA · 2015-11-03T11:22:35Z

Test build #44915 has finished for PR 9406 at commit 6139f47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Expand(\n

SparkQA · 2015-11-04T18:01:27Z

Test build #45019 has finished for PR 9406 at commit 1e705fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Expand(\n

…operators.

SparkQA · 2015-11-06T09:18:02Z

Test build #45207 has finished for PR 9406 at commit 9be5b9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * final class ShuffleSortDataFormat extends SortDataFormat<PackedRecordPointer, LongArray>\n * final class UnsafeSortDataFormat extends SortDataFormat<RecordPointerAndKeyPrefix, LongArray>\n * case class Expand(\n

hvanhovell · 2015-11-06T10:02:22Z

Hmmmm... this is a bit of a strange error.

hvanhovell · 2015-11-06T10:02:31Z

Jenkins retest this please

hvanhovell · 2015-11-06T10:07:05Z

Jenkins is not retesting... @marmbrus could you add me to the whitelist?

SparkQA · 2015-11-06T12:38:11Z

Test build #45219 has finished for PR 9406 at commit d3bdb2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Expand(\n

marmbrus · 2015-11-06T23:23:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Utils.scala

It would be really helpful if there was an example of what this rewrite looks like here.

I'll add an example in the follow-up PR.

yhuai · 2015-11-06T23:35:11Z

@hvanhovell I have started to use this PR as the foundation of removing our old aggregation code path.

marmbrus · 2015-11-06T23:35:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Utils.scala

Comment on what each tuple element is, or maybe even use a case class?

I'll add documentation in a follow-up PR.

marmbrus · 2015-11-07T00:02:25Z

Okay, I looked over this pretty quickly and it looks awesome. We need some tests and we are super close to me cutting a preview release. That said, I'd really like to include this in 1.6. Here is my proposal:

Lets merge this as is.
Yin will start ripping out the old aggregation path. (with or without [SPARK-11451][SQL] Support single distinct count on multiple columns. #9409).
Comments and TODOs can be addressed in a follow up.

…g Rule The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](#9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <[email protected]> Closes #9406 from hvanhovell/SPARK-9241-rewriter. (cherry picked from commit 6d0ead3) Signed-off-by: Michael Armbrust <[email protected]>

This PR is a follow up for PR #9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <[email protected]> Closes #9541 from hvanhovell/SPARK-9241-followup. (cherry picked from commit ef36284) Signed-off-by: Yin Huai <[email protected]>

This PR is a follow up for PR #9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <[email protected]> Closes #9541 from hvanhovell/SPARK-9241-followup.

This PR is a follow up for PR apache/spark#9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <[email protected]> Closes #9541 from hvanhovell/SPARK-9241-followup.

hvanhovell force-pushed the SPARK-9241-rewriter branch from 6139f47 to 1e705fe Compare November 4, 2015 15:58

rxin mentioned this pull request Nov 5, 2015

[SPARK-9241] [SQL] [WIP] Supporting multiple DISTINCT columns #9280

Closed

hvanhovell added 7 commits November 6, 2015 08:48

rebase

c9d0c1d

Fix a few small bugs.

733fced

Improve readability

d85462d

Fix issue with variable reuse between regular and distinct aggregate …

7b5369c

…operators.

Fix Group By Clause equality

d626c20

Fixing count default values (1)

ece657b

Fixing count default values (2).

9be5b9d

hvanhovell force-pushed the SPARK-9241-rewriter branch from 1e705fe to 9be5b9d Compare November 6, 2015 08:17

Improve docs. Triggering build :P...

d3bdb2b

marmbrus reviewed Nov 6, 2015
View reviewed changes

asfgit closed this in 6d0ead3 Nov 7, 2015

hvanhovell mentioned this pull request Nov 7, 2015

[SPARK-9241][SQL] Supporting multiple DISTINCT columns - follow-up #9541

Closed

[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule #9406

[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule #9406

Uh oh!

Conversation

hvanhovell commented Nov 2, 2015

Uh oh!

marmbrus commented Nov 3, 2015

Uh oh!

SparkQA commented Nov 3, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

SparkQA commented Nov 6, 2015

Uh oh!

hvanhovell commented Nov 6, 2015

Uh oh!

hvanhovell commented Nov 6, 2015

Uh oh!

hvanhovell commented Nov 6, 2015

Uh oh!

SparkQA commented Nov 6, 2015

Uh oh!

marmbrus Nov 6, 2015

Choose a reason for hiding this comment

Uh oh!

hvanhovell Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Nov 6, 2015

Uh oh!

marmbrus Nov 6, 2015

Choose a reason for hiding this comment

Uh oh!

hvanhovell Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants