Skip to content

Conversation

@YanTangZhai
Copy link
Contributor

Some task may OOM when count(distinct) if it needs to process many records. CombineSetsAndCountFunction puts all records into an OpenHashSet, if it fetchs many records, it may occupy large memory.
I think a data structure ExternalSet like ExternalAppendOnlyMap could be provided to store OpenHashSet data in disks when it's capacity exceeds some threshold.
For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be indicated as follows:
ohs1 [d, b, c, a] => [a, b, c, d] => file1
ohs2 [e, f, g, a] => [a, e, f, g] => file2
ohs3 [e, h, i, g] => [e, g, h, i] => file3
ohs4 [j, h, a] => [a, h, j] => sortedSet
When output, all keys with the same hashCode will be put into a OpenHashSet, then the iterator of this OpenHashSet is accessing. The procedure could be indicated as follows:
file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
file1 -> b -> ohsB; ohsB -> b;
file1 -> c -> ohsC; ohsC -> c;
file1 -> d -> ohsD; ohsD -> d;
file2 > e -> ohsE; file3 -> e -> ohsE; ohsE> e;
...
I think using the ExternalSet could avoid OOM when count(distinct). Welcomes comments.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23003 has started for PR 3137 at commit eecb499.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23003 has finished for PR 3137 at commit eecb499.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExternalSet[K](
    • class KeyArraySortDataFormat[T : ClassTag] extends SortDataFormat[T, Array[T]]
    • case class CombineSetsAndCountExternal(inputSet: Expression) extends AggregateExpression
    • case class CombineSetsAndCountExternalFunction(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23003/
Test PASSed.

@scwf
Copy link
Contributor

scwf commented Nov 6, 2014

Do you have a test for this?

@chenghao-intel
Copy link
Contributor

Probably a better way is providing the Sort-Merge-Aggregation not just for DistinctCount. I am working on a POC, hope we can discuss that soon.

@marmbrus
Copy link
Contributor

Thanks for working on this! Looks like there is still some discussion on the correct approach here. To keep the PR queue small, I propose we close this issue and revisit once there is a full design.

@asfgit asfgit closed this in ca12608 Dec 17, 2014
@YanTangZhai
Copy link
Contributor Author

@marmbrus Thanks. I'm also trying another approach to optimize this operation. I want to discuss it with you later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants