-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12933][SQL] Initial implementation of Count-Min sketch #10851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add some comment acknowledging stream-lib
|
do we also need to update the test runner to add this module? cc @JoshRosen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClassTag is used here for creating arrays. I found using Seq can slow down test execution quite a bit.
|
@rxin Already added sketch module to |
4201605 to
486414d
Compare
|
Test build #49811 has finished for PR 10851 at commit
|
2bf907a to
7ea22a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of black magic...
|
Somehow there is no timing information for the test cases in this new module. Can you take a look at that? You might need to change the sbt build file. |
|
Oh, I forgot: you also need to update |
|
Test build #49820 has finished for PR 10851 at commit
|
|
Test build #49824 has finished for PR 10851 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just start with "A Count-Min sketch is a probabilistic data structure ..."
i.e. your second paragraph.
And then explain the type of data types supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare that this could throw some exception?
|
Test build #49826 has finished for PR 10851 at commit
|
|
Test build #49827 has finished for PR 10851 at commit
|
|
Test build #49848 has finished for PR 10851 at commit
|
a6e7479 to
e06ff13
Compare
|
Test build #49882 has finished for PR 10851 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to break it down since the original pattern match somehow introduced an implicit tuple containing more than 22 fields after adding the spark-sketch module.
|
Test build #49923 has finished for PR 10851 at commit
|
|
Test build #49925 has finished for PR 10851 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add some comment here explaining this is just a duplicate and is put here to minimize dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments for the duplicated Platform class and Murmur3_x86_32 class.
|
@liancheng can you make sure the generated javadocs look ok? |
|
I've checked the Javadoc, it looks good. |
|
I looked at this quickly (i.e. didn't do a detail review), but changes lgtm. |
|
Test build #2445 has started for PR 10851 at commit |
1608ec9 to
65853ad
Compare
|
Test build #49929 has finished for PR 10851 at commit
|
|
Going to merge this. Thanks. Would be great if @cloud-fan can take another look at the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CMSMergeException is a protected static class, can user catch this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, actually I've fixed this issue in # 10893.
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under
common/sketch. The implementation is based on theCountMinSketchclass in stream-lib.As required by the design doc, spark-sketch should have no external dependency.
Two classes,
Murmur3_x86_32andPlatformare copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.The following features will be added in future follow-up PRs: