-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Implement ApproximateCountDistinct for SparkSql #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
This patch duplicates some logic that already exists elsewhere in Spark - would you mind updating it to use this class?: |
|
@pwendell, I don't think that will work as Spark SQL does its own serialization for shuffles sometimes using Kryo and I don't think that SerializableHyperLogLog works with Kryo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm normally all for the Option pattern, but in this case you are probably incurring more object allocations that we want to in the critical path of query execution. I'd just use an if here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been changed into a null check.
|
Bypassing SerializableHyperLogLog has a few benefits:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a default here is reasonable, but we should probably expose this to the user as well. Maybe two versions in the parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the most recent version where we have another parser allowing users to pass in the standard deviation.
The first version has the benefit of hiding the implementation details from the user. The standard deviation is not an intuitive parameter for an end user, especially given its side effect to the memory usage.
Please let me know your thoughts on the new version.
|
All the review issues should have been fixed in the most recent version of the code. Please let me know if I missed anything. Thanks a lot for the quick feedback. |
We use stream-lib's HyperLogLog to approximately count the number of distinct elements in each partition, and merge the HyperLogLogs to compute the final result. If the expressions can not be successfully broken apart, we fall back to the exact CountDistinct.
|
LGTM. Thanks for doing this! |
|
Thanks, Michael. I just re-arranged my change sets a bit to put them together. Let me know if there's anything else needed to merge this to the upstream. |
|
Thanks. I merged this. |
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <[email protected]> Closes #737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos. (cherry picked from commit c33b8dc) Signed-off-by: Reynold Xin <[email protected]>
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <[email protected]> Closes apache#737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.
A simple serializer and test cases are added as well.