-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19527][Core] Approximate Size of Intersection of Bloom Filters #16864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19527][Core] Approximate Size of Intersection of Bloom Filters #16864
Conversation
… Also function to create union (non-mutation) of two Bloom filters.
…ance instead of static functions
|
|
||
| package org.apache.spark.util.sketch; | ||
|
|
||
| public class IncompatibleUnionException extends Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need some javadoc ere.
| * Callers must ensure the bloom filters are appropriately sized to avoid saturating them. | ||
| * | ||
| * @throws IncompatibleUnionException if either are null, different classes, or different size or number of hash functions | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about just calling this union?
|
|
||
| /** | ||
| * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this return a long rather than a double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debating this due to possible rounding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea but that would only be off by 1. I wouldn't worry about that since it is approximate anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to keep it as double because the estimate could be out of bound if the bits are full.
|
cc @mengxr / @tjhunter / @jkbradley is this good to have? |
|
I meant just union, but createUnion ... |
| */ | ||
| public abstract long bitSize(); | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe the method first and its properties (approximation error). Then put the reference in @seealso with a permanent link to the paper: https://dx.doi.org/10.1021%2Fci600526a
|
|
||
| /** | ||
| * Swamidass & Baldi (2007) approximation for number of items in a Bloom filter | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to keep it as double because the estimate could be out of bound if the bits are full.
| */ | ||
| public abstract BloomFilterImpl union(BloomFilter other) throws IncompatibleUnionException; | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Same here. Document the method first and then mention the reference.
How is it different from intersecting two bloom filters and then estimate the number of items? Union might lead to larger approximation error.Okay, I got why. Please also document it.
|
|
||
| @Override | ||
| public double approxItems() { | ||
| double m = bitSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mathis deprecated. Usemath.- Please add a test when
bitsis full. This should returnDouble.Infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math is deprecated. Use math.
Assume you were thinking of Scala?
…roxItemsInIntersection. Also added reference to paper
…ntersection of A & B
Bcpoole
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe requested changes were all handled
|
ping @mengxr Should we move forward with this PR? |
|
retest this please |
|
Test looks good |
|
@Bcpoole Thanks for this PR. But I want to ask which place in spark can this extension apply to ? e.g. can this algo used in join cost estimating or somewhere else ? But if there is no apparent uses for now, I will decrease priority of reviewing this because there're many PRs accumulated waiting review. |
|
Should we close this PR since it goes stale? WDYT @WeichenXu123 ? |
|
@jiangxb1987 yes I agree to close it. |
What changes were proposed in this pull request?
Added functions to get the Swamidass & Baldi (2007) approximation for number of items in a Bloom filter and the intersections of two filters. Added an exception type IncompatibleUnionException mimicing IncompatibleMergeException. As needed for the intersection approximation, there is a function that create the union of two Bloom filters (no mutations).
How was this patch tested?
Manual Tests