[SPARK-16094][SQL] Support HashAggregateExec for non-partial aggregates #13802

maropu · 2016-06-21T08:19:41Z

What changes were proposed in this pull request?

The current spark cannot use HashAggregateExec for non-partial aggregates because Collect (CollectSet/CollectList) uses a single shared buffer inside. Since SortAggregateExec is expensive in some cases, we'd better off fixing this.

This pr is to change plans from

SortAggregate(key=[key#3077], functions=[collect_set(value#3078, 0, 0)], output=[key#3077,collect_set(value)#3088])
+- *Sort [key#3077 ASC], false, 0
   +- Exchange hashpartitioning(key#3077, 5)
      +- Scan ExistingRDD[key#3077,value#3078]

into

HashAggregate(keys=[key#3077], functions=[collect_set(value#3078, 0, 0)], output=[key#3077, collect_set(value)#3088])
+- Exchange hashpartitioning(key#3077, 5)
   +- Scan ExistingRDD[key#3077,value#3078]

How was this patch tested?

Checked non-partial aggregates (collect_set and collect_list) worked well for HashAggregateExec in DataFrameSuite.

SparkQA · 2016-06-21T09:43:13Z

Test build #60919 has finished for PR 13802 at commit 517d7ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-06-21T12:52:30Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+      Seq(Row(Seq(1, 2, 3), Seq(Map(3 -> 0), Map(3 -> 0), Map(4 -> 1))))
+    )
+    // TODO: We need to implement `UnsafeMapData#hashCode` and `UnsafeMapData#equals` for getting
+    // a set of input data.


Do we need to implement them?

SparkQA · 2016-06-21T13:03:41Z

Test build #60928 has finished for PR 13802 at commit 0506453.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-21T14:32:05Z

Test build #60933 has finished for PR 13802 at commit 88ba697.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-06-21T14:45:22Z

@maropu this won't work for other hive_udfs since these also maintain internal state, and currently require per group processing. This also has a greater potential of creating out-of-memory errors than the sort based approach.

I do think there is merit in the general ideal; but I think we should be focussing on creating a growable bytes-to-bytes map and creating byte backed mutable ArrayData and MapData implementations.

maropu · 2016-06-22T04:23:36Z

@hvanhovell oh, I see. okay, I'll check we can implement mutable ArrayData and MapData.
btw, I have some questions;

Any reason to use SortAggregateExec for all the non-partial aggregates? It seems it is okay to use HashAggregateExec for non-partial ones except for collect_xxx and hive_udaf.
Why do we have no hashCode and equals in UnsafeMapData? ArrayBasedMapData already has these override functions.

hvanhovell · 2016-06-22T04:43:24Z

@maropu all aggregates that current set supportsPartial = false cannot be partially aggregated and require that the entire group is processed in one step. So the name is a bit misleading. I suppose we could rename it.

UnsafeMapData is typically part of an UnsafeRow which already implements equals() and hashCode() without requiring its elements to implement these methods (it uses the backing byte array). I suppose we can add these methods.

maropu · 2016-06-22T14:52:06Z

@hvanhovell As for UnsafeMapData, could you check #13847?

maropu · 2016-06-22T15:59:29Z

As for supportPartial, I could understand that collect and hive_udaf has such a limitation though,
how about AggregateWindowFunction? It seems these functions RowNumber and Rank work well even for HashAggregateExec and they dont have the limitation. BTW, we at least need to fix comments for supportPartial, "Currently Hive UDAF is the only one that doesn't support partial aggregation." is incorrect now; https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L178

hvanhovell · 2016-06-22T16:15:01Z

I think we should rename and document supportsPartial to reflect what it actually does.

Rank and RowNumer are window functions. They both rely on ordered evaluation, and they should never be evaluated in any *AggregateExec operator. We have a rule to prevent this in the analyzer.

maropu · 2016-06-22T16:19:55Z

Thanks for your explanation!

maropu · 2016-06-22T16:24:32Z

Is it okay to make a new pr to fix these?

hvanhovell · 2016-06-22T17:16:44Z

What do you want to fix? WindowAggregateFunctions?

maropu · 2016-06-22T17:18:36Z

No, I'd just like to fix incorrect comments. or, could you?

hvanhovell · 2016-06-22T17:22:11Z

Could you have a go? Would be great!

maropu · 2016-06-22T17:25:34Z

okay

hvanhovell · 2016-08-31T13:02:58Z

@maropu could you close this one? It is not that relevant anymore. Thanks for working on it though!

maropu · 2016-08-31T13:05:24Z

yea, thanks!

Support HashAggregateExec for non-partial aggregates

0506453

maropu force-pushed the SPARK-16094 branch from 517d7ea to 0506453 Compare June 21, 2016 11:32

Add tests for array and map

88ba697

maropu reviewed Jun 21, 2016
View reviewed changes

maropu mentioned this pull request Jun 23, 2016

[SPARK-16200][SQL] Rename AggregateFunction#supportsPartial #13852

Closed

maropu closed this Aug 31, 2016

maropu deleted the SPARK-16094 branch July 5, 2017 11:47

[SPARK-16094][SQL] Support HashAggregateExec for non-partial aggregates #13802

[SPARK-16094][SQL] Support HashAggregateExec for non-partial aggregates #13802

Uh oh!

Conversation

maropu commented Jun 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

maropu Jun 21, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

hvanhovell commented Jun 21, 2016

Uh oh!

maropu commented Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

hvanhovell commented Aug 31, 2016

Uh oh!

maropu commented Aug 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maropu commented Jun 22, 2016 •

edited

Loading

maropu commented Jun 22, 2016 •

edited

Loading