Skip to content

Conversation

@hvanhovell
Copy link
Contributor

The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.

This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the JIRA ticket for some information on this. The advantages over the - competing - first PR are:

  • This can use the faster TungstenAggregate code path.
  • It is impossible to OOM due to an OpenHashSet allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.

The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.

cc @yhuai - Could you also tell me where to add tests for this?

@marmbrus
Copy link
Contributor

marmbrus commented Nov 3, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Nov 3, 2015

Test build #44915 has finished for PR 9406 at commit 6139f47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Expand(\n

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #45019 has finished for PR 9406 at commit 1e705fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Expand(\n

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45207 has finished for PR 9406 at commit 9be5b9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * final class ShuffleSortDataFormat extends SortDataFormat<PackedRecordPointer, LongArray>\n * final class UnsafeSortDataFormat extends SortDataFormat<RecordPointerAndKeyPrefix, LongArray>\n * case class Expand(\n

@hvanhovell
Copy link
Contributor Author

Hmmmm... this is a bit of a strange error.

@hvanhovell
Copy link
Contributor Author

Jenkins retest this please

@hvanhovell
Copy link
Contributor Author

Jenkins is not retesting... @marmbrus could you add me to the whitelist?

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45219 has finished for PR 9406 at commit d3bdb2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class Expand(\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really helpful if there was an example of what this rewrite looks like here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an example in the follow-up PR.

@yhuai
Copy link
Contributor

yhuai commented Nov 6, 2015

@hvanhovell I have started to use this PR as the foundation of removing our old aggregation code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on what each tuple element is, or maybe even use a case class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add documentation in a follow-up PR.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 7, 2015

Okay, I looked over this pretty quickly and it looks awesome. We need some tests and we are super close to me cutting a preview release. That said, I'd really like to include this in 1.6. Here is my proposal:

asfgit pushed a commit that referenced this pull request Nov 7, 2015
…g Rule

The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.

This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](#9280) are:
- This can use the faster TungstenAggregate code path.
- It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.

The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.

cc yhuai - Could you also tell me where to add tests for this?

Author: Herman van Hovell <[email protected]>

Closes #9406 from hvanhovell/SPARK-9241-rewriter.

(cherry picked from commit 6d0ead3)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 6d0ead3 Nov 7, 2015
asfgit pushed a commit that referenced this pull request Nov 7, 2015
This PR is a follow up for PR #9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite.

cc yhuai marmbrus

Author: Herman van Hovell <[email protected]>

Closes #9541 from hvanhovell/SPARK-9241-followup.

(cherry picked from commit ef36284)
Signed-off-by: Yin Huai <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 7, 2015
This PR is a follow up for PR #9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite.

cc yhuai marmbrus

Author: Herman van Hovell <[email protected]>

Closes #9541 from hvanhovell/SPARK-9241-followup.
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
This PR is a follow up for PR apache/spark#9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite.

cc yhuai marmbrus

Author: Herman van Hovell <[email protected]>

Closes #9541 from hvanhovell/SPARK-9241-followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants