Skip to content

Conversation

@Bcpoole
Copy link

@Bcpoole Bcpoole commented Feb 9, 2017

What changes were proposed in this pull request?

Added functions to get the Swamidass & Baldi (2007) approximation for number of items in a Bloom filter and the intersections of two filters. Added an exception type IncompatibleUnionException mimicing IncompatibleMergeException. As needed for the intersection approximation, there is a function that create the union of two Bloom filters (no mutations).

How was this patch tested?

Manual Tests


package org.apache.spark.util.sketch;

public class IncompatibleUnionException extends Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need some javadoc ere.

* Callers must ensure the bloom filters are appropriately sized to avoid saturating them.
*
* @throws IncompatibleUnionException if either are null, different classes, or different size or number of hash functions
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about just calling this union?


/**
* Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this return a long rather than a double?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating this due to possible rounding errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea but that would only be off by 1. I wouldn't worry about that since it is approximate anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to keep it as double because the estimate could be out of bound if the bits are full.

@rxin
Copy link
Contributor

rxin commented Feb 9, 2017

cc @mengxr / @tjhunter / @jkbradley is this good to have?

@rxin
Copy link
Contributor

rxin commented Feb 9, 2017

I meant just union, but createUnion ...

*/
public abstract long bitSize();

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the method first and its properties (approximation error). Then put the reference in @seealso with a permanent link to the paper: https://dx.doi.org/10.1021%2Fci600526a


/**
* Swamidass & Baldi (2007) approximation for number of items in a Bloom filter
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to keep it as double because the estimate could be out of bound if the bits are full.

*/
public abstract BloomFilterImpl union(BloomFilter other) throws IncompatibleUnionException;

/**
Copy link
Contributor

@mengxr mengxr Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Same here. Document the method first and then mention the reference.
  • How is it different from intersecting two bloom filters and then estimate the number of items? Union might lead to larger approximation error. Okay, I got why. Please also document it.


@Override
public double approxItems() {
double m = bitSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Math is deprecated. Use math.
  • Please add a test when bits is full. This should return Double.Infinity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math is deprecated. Use math.

Assume you were thinking of Scala?

Copy link
Author

@Bcpoole Bcpoole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe requested changes were all handled

@jiangxb1987
Copy link
Contributor

ping @mengxr Should we move forward with this PR?

@jiangxb1987
Copy link
Contributor

retest this please

@Bcpoole
Copy link
Author

Bcpoole commented Jun 23, 2017

Test looks good

@WeichenXu123
Copy link
Contributor

@Bcpoole Thanks for this PR. But I want to ask which place in spark can this extension apply to ? e.g. can this algo used in join cost estimating or somewhere else ? But if there is no apparent uses for now, I will decrease priority of reviewing this because there're many PRs accumulated waiting review.

@jiangxb1987
Copy link
Contributor

Should we close this PR since it goes stale? WDYT @WeichenXu123 ?

@WeichenXu123
Copy link
Contributor

@jiangxb1987 yes I agree to close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants