-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12936][SQL] Initial bloom filter implementation #10883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
58e7c59 to
cbc53f2
Compare
|
cc @rxin @liancheng |
cbc53f2 to
bbf3822
Compare
|
Test build #49939 has finished for PR 10883 at commit
|
bbf3822 to
2db0171
Compare
|
Test build #49940 has finished for PR 10883 at commit
|
2db0171 to
1617226
Compare
|
Test build #49941 has finished for PR 10883 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be consistent in naming with the count-min sketch one, i.e. rename this BloomFilterImpl.
69ac65d to
920f292
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string part is same with count-min, but long part is different, is it worth to abstract it? cc @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the long part different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because they use different strategy to produce n hash values for a long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some inline comment explaining why they are different
|
Test build #50002 has finished for PR 10883 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use allItems.take(numInsertion).foreach(filter.put) for simplicity if this code path doesn't slow down test execution too much.
|
Test build #50005 has finished for PR 10883 at commit
|
|
Test build #50012 has finished for PR 10883 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the 0.001 here stand for epsilon? Should we wrap 0.03 - filter1.expectedFpp() with Math.abs()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to special case "Byte" here when we already have numItems (it's 200 for Byte type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to make 0.01 a private static constant, e.g. EPSILON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should swap 0.03 and filter1.expectedFpp() here.
|
One high level comment about testing code: I'm a little bit confused by those magic numbers there. Would be nice to name them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a public static constant DEFAULT_EXPECTED_FPP for this 0.03. I was pretty confused when finding the magic number 0.03 in the testing code.
|
Tip: I renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add another test case for incompatible merge (like this one).
|
Test build #50031 has finished for PR 10883 at commit
|
|
Test build #50034 has finished for PR 10883 at commit
|
|
LGTM |
|
As discussed offline, it might be better to have put(String) and put(Long) in addition to put(Object). I'm going to merge this pull request. Please address the remaining comments in a new pull request. |
This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the
BloomFilterclass in guava.Some difference from the design doc:
bitSizeinstead ofsizeInBytesto user.expectedInsertionsparameter when create bloom filter.