Implement ApproximateCountDistinct for SparkSql #737

larvaboy · 2014-05-12T10:33:34Z

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.

A simple serializer and test cases are added as well.

AmplabJenkins · 2014-05-12T10:37:58Z

Can one of the admins verify this patch?

pwendell · 2014-05-12T17:30:45Z

This patch duplicates some logic that already exists elsewhere in Spark - would you mind updating it to use this class?:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableHyperLogLog.scala

marmbrus · 2014-05-12T18:28:52Z

@pwendell, I don't think that will work as Spark SQL does its own serialization for shuffles sometimes using Kryo and I don't think that SerializableHyperLogLog works with Kryo.

marmbrus · 2014-05-12T18:34:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

I'm normally all for the Option pattern, but in this case you are probably incurring more object allocations that we want to in the critical path of query execution. I'd just use an if here.

This has been changed into a null check.

rxin · 2014-05-12T18:38:58Z

Bypassing SerializableHyperLogLog has a few benefits:

Less memory usage because we don't need the wrapper.
Works with Spark SQL's internal serializer.
stream-lib will actually make HyperLogLog serializable next release - so SerializableHyperLogLog will be gone ....

marmbrus · 2014-05-12T18:50:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

Having a default here is reasonable, but we should probably expose this to the user as well. Maybe two versions in the parser?

Please refer to the most recent version where we have another parser allowing users to pass in the standard deviation.

The first version has the benefit of hiding the implementation details from the user. The standard deviation is not an intuitive parameter for an end user, especially given its side effect to the memory usage.

Please let me know your thoughts on the new version.

pwendell · 2014-05-12T20:53:45Z

@marmbrus @rxin ah okay guys - sorry for my wrong comment :)

larvaboy · 2014-05-13T09:30:21Z

All the review issues should have been fixed in the most recent version of the code. Please let me know if I missed anything.

Thanks a lot for the quick feedback.

We use stream-lib's HyperLogLog to approximately count the number of distinct elements in each partition, and merge the HyperLogLogs to compute the final result. If the expressions can not be successfully broken apart, we fall back to the exact CountDistinct.

marmbrus · 2014-05-13T18:30:55Z

LGTM. Thanks for doing this!

larvaboy · 2014-05-13T18:49:35Z

Thanks, Michael.

I just re-arranged my change sets a bit to put them together. Let me know if there's anything else needed to merge this to the upstream.

rxin · 2014-05-14T04:25:57Z

Thanks. I merged this.

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <[email protected]> Closes #737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos. (cherry picked from commit c33b8dc) Signed-off-by: Reynold Xin <[email protected]>

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <[email protected]> Closes apache#737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.

marmbrus reviewed May 12, 2014
View reviewed changes

larvaboy added 4 commits May 13, 2014 11:30

Fix a couple of minor typos.

653542b

Fix a minor typo in the toString method of the Count case class.

1d9aacf

Add SparkSql serializer for HyperLogLog.

7ad273a

larvaboy added 4 commits May 13, 2014 11:30

Add the parser for the approximate count.

f57917d

Add a test case for count distinct and approximate count distinct.

95b4067

Fix alignment and null handling issues.

9ba8360

Add support of user-provided standard deviation to ApproxCountDistinct.

bd8ef3f

asfgit closed this in c33b8dc May 14, 2014

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[HADP-57165] Adaptive maxFilePerTask threshold (apache#737)

04d7191

Implement ApproximateCountDistinct for SparkSql #737

Implement ApproximateCountDistinct for SparkSql #737

Uh oh!

Conversation

larvaboy commented May 12, 2014

Uh oh!

AmplabJenkins commented May 12, 2014

Uh oh!

pwendell commented May 12, 2014

Uh oh!

marmbrus commented May 12, 2014

Uh oh!

marmbrus May 12, 2014

Choose a reason for hiding this comment

Uh oh!

larvaboy May 13, 2014

Choose a reason for hiding this comment

Uh oh!

rxin commented May 12, 2014

Uh oh!

marmbrus May 12, 2014

Choose a reason for hiding this comment

Uh oh!

larvaboy May 13, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell commented May 12, 2014

Uh oh!

larvaboy commented May 13, 2014

Uh oh!

marmbrus commented May 13, 2014

Uh oh!

larvaboy commented May 13, 2014

Uh oh!

rxin commented May 14, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants