-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule #9406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #44915 has finished for PR 9406 at commit
|
6139f47 to
1e705fe
Compare
|
Test build #45019 has finished for PR 9406 at commit
|
1e705fe to
9be5b9d
Compare
|
Test build #45207 has finished for PR 9406 at commit
|
|
Hmmmm... this is a bit of a strange error. |
|
Jenkins retest this please |
|
Jenkins is not retesting... @marmbrus could you add me to the whitelist? |
|
Test build #45219 has finished for PR 9406 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really helpful if there was an example of what this rewrite looks like here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add an example in the follow-up PR.
|
@hvanhovell I have started to use this PR as the foundation of removing our old aggregation code path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on what each tuple element is, or maybe even use a case class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add documentation in a follow-up PR.
|
Okay, I looked over this pretty quickly and it looks awesome. We need some tests and we are super close to me cutting a preview release. That said, I'd really like to include this in 1.6. Here is my proposal:
|
…g Rule The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](#9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <[email protected]> Closes #9406 from hvanhovell/SPARK-9241-rewriter. (cherry picked from commit 6d0ead3) Signed-off-by: Michael Armbrust <[email protected]>
This PR is a follow up for PR #9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <[email protected]> Closes #9541 from hvanhovell/SPARK-9241-followup. (cherry picked from commit ef36284) Signed-off-by: Yin Huai <[email protected]>
This PR is a follow up for PR #9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <[email protected]> Closes #9541 from hvanhovell/SPARK-9241-followup.
This PR is a follow up for PR apache/spark#9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <[email protected]> Closes #9541 from hvanhovell/SPARK-9241-followup.
The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.
This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the JIRA ticket for some information on this. The advantages over the - competing - first PR are:
OpenHashSetallocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.
cc @yhuai - Could you also tell me where to add tests for this?