[WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when count(distinct) #3137

YanTangZhai · 2014-11-06T13:40:54Z

Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows:
ohs1 [d, b, c, a] => [a, b, c, d] => file1
ohs2 [e, f, g, a] => [a, e, f, g] => file2
ohs3 [e, h, i, g] => [e, g, h, i] => file3
ohs4 [j, h, a] => [a, h, j] => sortedSet
When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows:
file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
file1 -> b -> ohsB; ohsB -> b;
file1 -> c -> ohsC; ohsC -> c;
file1 -> d -> ohsD; ohsD -> d;
file2 > e -> ohsE; file3 -> e -> ohsE; ohsE> e;
...
I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments.

update

Update

update

Update

SparkQA · 2014-11-06T13:44:47Z

Test build #23003 has started for PR 3137 at commit eecb499.

This patch merges cleanly.

SparkQA · 2014-11-06T15:33:26Z

Test build #23003 has finished for PR 3137 at commit eecb499.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ExternalSet[K](
- class KeyArraySortDataFormat[T : ClassTag] extends SortDataFormat[T, Array[T]]
- case class CombineSetsAndCountExternal(inputSet: Expression) extends AggregateExpression
- case class CombineSetsAndCountExternalFunction(

AmplabJenkins · 2014-11-06T15:33:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23003/
Test PASSed.

scwf · 2014-11-06T17:48:38Z

Do you have a test for this?

chenghao-intel · 2014-11-08T14:59:58Z

Probably a better way is providing the Sort-Merge-Aggregation not just for DistinctCount. I am working on a POC, hope we can discuss that soon.

marmbrus · 2014-12-17T19:25:13Z

Thanks for working on this! Looks like there is still some discussion on the correct approach here. To keep the PR queue small, I propose we close this issue and revisit once there is a full design.

YanTangZhai · 2014-12-18T07:57:09Z

@marmbrus Thanks. I'm also trying another approach to optimize this operation. I want to discuss it with you later.

YanTangZhai and others added 7 commits August 6, 2014 21:07

Merge pull request #1 from apache/master

cdef539

update

Merge pull request #3 from apache/master

cbcba66

Update

Merge pull request #6 from apache/master

8a00106

Update

Merge pull request #7 from apache/master

03b62b0

Update

Merge pull request #8 from apache/master

76d4027

update

Merge pull request #9 from apache/master

d26d982

Update

A method to avoid OOM when count(distinct) by providing ExternalSet

eecb499

asfgit closed this in ca12608 Dec 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when count(distinct) #3137

[WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when count(distinct) #3137

Uh oh!

YanTangZhai commented Nov 6, 2014

Uh oh!

SparkQA commented Nov 6, 2014

Uh oh!

SparkQA commented Nov 6, 2014

Uh oh!

AmplabJenkins commented Nov 6, 2014

Uh oh!

scwf commented Nov 6, 2014

Uh oh!

chenghao-intel commented Nov 8, 2014

Uh oh!

marmbrus commented Dec 17, 2014

Uh oh!

YanTangZhai commented Dec 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when count(distinct) #3137

[WIP] [SPARK-4273] [SQL] Providing ExternalSet to avoid OOM when count(distinct) #3137

Uh oh!

Conversation

YanTangZhai commented Nov 6, 2014

Uh oh!

SparkQA commented Nov 6, 2014

Uh oh!

SparkQA commented Nov 6, 2014

Uh oh!

AmplabJenkins commented Nov 6, 2014

Uh oh!

scwf commented Nov 6, 2014

Uh oh!

chenghao-intel commented Nov 8, 2014

Uh oh!

marmbrus commented Dec 17, 2014

Uh oh!

YanTangZhai commented Dec 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants