-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16094][SQL] Support HashAggregateExec for non-partial aggregates #13802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #60919 has finished for PR 13802 at commit
|
| Seq(Row(Seq(1, 2, 3), Seq(Map(3 -> 0), Map(3 -> 0), Map(4 -> 1)))) | ||
| ) | ||
| // TODO: We need to implement `UnsafeMapData#hashCode` and `UnsafeMapData#equals` for getting | ||
| // a set of input data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to implement them?
|
Test build #60928 has finished for PR 13802 at commit
|
|
Test build #60933 has finished for PR 13802 at commit
|
|
@maropu this won't work for other I do think there is merit in the general ideal; but I think we should be focussing on creating a growable bytes-to-bytes map and creating byte backed mutable |
|
@hvanhovell oh, I see. okay, I'll check we can implement mutable
|
|
@maropu all aggregates that current set
|
|
@hvanhovell As for |
|
As for |
|
I think we should rename and document
|
|
Thanks for your explanation! |
|
Is it okay to make a new pr to fix these? |
|
What do you want to fix? WindowAggregateFunctions? |
|
No, I'd just like to fix incorrect comments. or, could you? |
|
Could you have a go? Would be great! |
|
okay |
|
@maropu could you close this one? It is not that relevant anymore. Thanks for working on it though! |
|
yea, thanks! |
What changes were proposed in this pull request?
The current spark cannot use
HashAggregateExecfor non-partial aggregates becauseCollect(CollectSet/CollectList) uses a single shared buffer inside. Since SortAggregateExec is expensive in some cases, we'd better off fixing this.This pr is to change plans from
into
How was this patch tested?
Checked non-partial aggregates (
collect_setandcollect_list) worked well forHashAggregateExecinDataFrameSuite.