Skip to content

Conversation

@larvaboy
Copy link
Contributor

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.

A simple serializer and test cases are added as well.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

This patch duplicates some logic that already exists elsewhere in Spark - would you mind updating it to use this class?:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableHyperLogLog.scala

@marmbrus
Copy link
Contributor

@pwendell, I don't think that will work as Spark SQL does its own serialization for shuffles sometimes using Kryo and I don't think that SerializableHyperLogLog works with Kryo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm normally all for the Option pattern, but in this case you are probably incurring more object allocations that we want to in the critical path of query execution. I'd just use an if here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed into a null check.

@rxin
Copy link
Contributor

rxin commented May 12, 2014

Bypassing SerializableHyperLogLog has a few benefits:

  1. Less memory usage because we don't need the wrapper.
  2. Works with Spark SQL's internal serializer.
  3. stream-lib will actually make HyperLogLog serializable next release - so SerializableHyperLogLog will be gone ....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a default here is reasonable, but we should probably expose this to the user as well. Maybe two versions in the parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the most recent version where we have another parser allowing users to pass in the standard deviation.

The first version has the benefit of hiding the implementation details from the user. The standard deviation is not an intuitive parameter for an end user, especially given its side effect to the memory usage.

Please let me know your thoughts on the new version.

@pwendell
Copy link
Contributor

@marmbrus @rxin ah okay guys - sorry for my wrong comment :)

@larvaboy
Copy link
Contributor Author

All the review issues should have been fixed in the most recent version of the code. Please let me know if I missed anything.

Thanks a lot for the quick feedback.

larvaboy added 4 commits May 13, 2014 11:30
We use stream-lib's HyperLogLog to approximately count the number of
distinct elements in each partition, and merge the HyperLogLogs to
compute the final result.

If the expressions can not be successfully broken apart, we fall back to
the exact CountDistinct.
@marmbrus
Copy link
Contributor

LGTM. Thanks for doing this!

@larvaboy
Copy link
Contributor Author

Thanks, Michael.

I just re-arranged my change sets a bit to put them together. Let me know if there's anything else needed to merge this to the upstream.

@rxin
Copy link
Contributor

rxin commented May 14, 2014

Thanks. I merged this.

@asfgit asfgit closed this in c33b8dc May 14, 2014
asfgit pushed a commit that referenced this pull request May 14, 2014
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.

A simple serializer and test cases are added as well.

Author: larvaboy <[email protected]>

Closes #737 from larvaboy/master and squashes the following commits:

bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.

(cherry picked from commit c33b8dc)
Signed-off-by: Reynold Xin <[email protected]>
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.

A simple serializer and test cases are added as well.

Author: larvaboy <[email protected]>

Closes apache#737 from larvaboy/master and squashes the following commits:

bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants