[SPARK-5992][ML] Locality Sensitive Hashing #15148

Yunni · 2016-09-19T06:38:23Z

What changes were proposed in this pull request?

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.

Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:

Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
PySpark Integration for the scala classes and methods.

How was this patch tested?

Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.

Tested the methods on WEX dataset from AWS, with the steps and results here.

References

Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

…r-Model class hierarchy to make RandomProjection works.

…er model parameters

…discussed in the Design Doc.

…ction.

… on random projection.

…on random projection.

srowen · 2016-09-19T07:23:41Z

Fix the title please?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

viirya · 2016-09-19T07:29:48Z

mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala

+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+    (x.asBreeze - y.asBreeze).toArray.map(math.abs).min


This seems include redundant operations.
For DenseVector, we can directly use its values: Array[Double].
For SparseVector, we can use Breeze's subtraction op then get the data from the result.

I am wondering what's API to calculate the difference between two spark Vectors?

For a pair of DenseVector, you can directly use its values member and do something like:

x.values.zip(y.values).map(x => math.abs(x._1 - x._2)).min

For a pair of SparseVector, you may not need to conver (x.asBreeze - y.asBreeze) back to Array, because the resulting array should be sparse too. We can directly map on the Breeze vector, i.e., (x.asBreeze - y.Breeze).map(math.abs).min.

Thanks! Since it's generated by hashing, I am assuming it's a pair of dense vector.

viirya · 2016-09-19T07:32:50Z

@Yunni Please use a proper title as "[SPARK-5992][ML] ...".

viirya · 2016-09-19T07:37:51Z

mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala

+   */
+  def approxNearestNeighbors(dataset: Dataset[_], key: KeyType, k: Int = 1,
+                             distCol: String = "distance"): Dataset[_] = {
+    if (k < 1) {


Usually we use assert for this. And more informative error message might be The number of nearest neighbors cannot be less than 1.

viirya · 2016-09-19T07:50:07Z

mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala

+    val nearestHashValue = nearestHashDataset.collect()(0)(0).asInstanceOf[Double]
+
+    // Filter the dataset where the hash value equals to u
+    val modelSubset = modelDataset.filter(hashDistUDF(col($(outputCol))) === nearestHashValue)


You do hashDistUDF twice for the dataset. Besides, you might get less than k nearest neighbors in current approach. We can do this like:

val hashDistCol = "_hash_dist" modelDataset.withColumn(hashDistCol, hashDistUDF(col($(outputCol)))) .sort(hashDistCol) .drop(hashDistCol) .limit(k) .withColumn(distCol, keyDistUDF(col($(inputCol))))

Actually this does not work because number of elements with the same "hashDistCol" can be much larger than k. In that case, we are random selecting k elements of the same "hashDistCol" value.

To resolve the issue you mentioned, I am changing nearestHashValue to hashThreshold, which is the maximum "hashDistCol" for the top k elements.

Yeah, I think we can replace the limit above to a filter to choose the elements failed in this range.

viirya · 2016-09-19T08:01:57Z

mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala

+    val explodeCols = Seq("lsh#entry", "lsh#hashValue")
+    val explodedA = processDataset(datasetA, explodeCols)
+
+    // If this is a self join, we need to recreate the inputCol of datasetB to avoid ambiguity.


Do we need this? I think we already do dedup operation in Analyzer for self-join.

Got it. You want to access inputCol from both left and right sides.

Once #14719 is merged, I think we can skip this redundant operation.

Added a TODO.

viirya · 2016-09-19T08:09:09Z

mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala

+    )
+
+    // Filter the joined datasets where the distance are smaller than the threshold.
+    joinedDatasetWithDist.distinct().filter(col(distCol) < threshold)


I think do distinct after filter should be better as you will filter out most of records.

Very good point. Done.

viirya · 2016-09-19T08:12:56Z

mllib/src/main/scala/org/apache/spark/ml/lsh/RandomProjection.scala

+
+class RandomProjectionModel(
+    override val uid: String,
+    val randUnitVectors: Array[breeze.linalg.Vector[Double]])


Can we use spark vector? We have BLAS library (BLAS.dot) for spark vector and you don't need to convert to Breeze and back to spark vector below.

viirya · 2016-09-19T08:17:59Z

mllib/src/main/scala/org/apache/spark/ml/lsh/RandomProjection.scala

+  }
+
+  override protected[this] def keyDistance(x: Vector, y: Vector): Double = {
+    euclideanDistance(x.asBreeze, y.asBreeze)


Vectors.sqdist is specified for spark vector. We can use it and get its square root.

viirya · 2016-09-19T08:18:44Z

mllib/src/main/scala/org/apache/spark/ml/lsh/RandomProjection.scala

+
+  private[this] var inputDim = -1
+
+  private[this] lazy val randUnitVectors: Array[breeze.linalg.Vector[Double]] = {


As mentioned above, we can use spark vector to avoid Breeze conversion.

viirya · 2016-09-19T08:26:55Z

mllib/src/test/scala/org/apache/spark/ml/lsh/LSHTest.scala

+
+    // Compute precision and recall
+    val correctCount = expected.join(actual, model.getInputCol).count().toDouble
+    (correctCount / expected.count(), correctCount / actual.count())


I think the precision and recall values should be swapped. correctCount / expected.count() should be recall. correctCount / actual.count() should be precision.

viirya · 2016-09-19T08:31:43Z

This looks pretty solid. cc @dbtsai @jkbradley

sethah · 2016-09-19T22:48:59Z

@Yunni Could you provide the specific reference paper this patch is based on? Also, it might be nice to put the reference in the code somewhere, e.g. the scaladoc for LSH/Random Projections. Thanks!

Yunni · 2016-09-19T22:49:12Z

Thanks very much for reviewing @viirya I made some changes based on your comments. PTAL.

…c for LSH along with reference papers

Yunni · 2016-09-19T23:21:15Z

Hi @sethah, I have updated the reference in the PR and scaladoc for LSH.

viirya · 2016-09-20T03:52:45Z

@Yunni Thanks for working on this.

sethah · 2016-09-20T05:33:15Z

A few high-level comments/questions:

Should this go into the feature package as a feature estimator/transformer? That is where other dimensionality reduction techniques have gone and I'm not sure we should create a new package for this.
Could you please point me to a specific section of a specific paper that documents the approaches used here? AFAICT, this patch implements something different than most of the Approximate nearest neighbors via LSH algorithms found in papers. For instance, the method in section 2 here as well as the method on Wikipedia here are different than the implementation in this pr. Also, the spark package spark-neighbors employs those approaches. I'm not an expert in LSH so I was just hoping for some clarification.
The implementation of the RandomProjections class actually follows the implementation of the "2-stable" (or more generically, "p-stable") LSH algorithm, and not the "Random Projection" algorithm in the paper that is referenced. At the very least, we should clarify this. Potentially, we should think of a better name.

@karlhigley Would you mind taking a look at the patch, or providing your input on the comments?

…ge to be under feature

Yunni · 2016-09-20T15:53:48Z

Hi @sethah,
Thanks for the comments.

I agree. I have moved lsh package to be under feature
In "Similarity search in high dimensions via hashing", there is an algorithm in the box Approximate Nearest Neighbor Query. It's almost the same as the algorithm on Wikipedia. I think you find it looks different because
1. it is using the Dataset API instead of RDD.
2. it finds exactly k elements regardless of the bucket sizes. (unless #elems in the origin dataset < k)
I am clarifying this in the scaladoc of RandomProjection. I will implement the LSH for cos distance (which is RandomProjection in the paper) as SignRandomProjection. Please advise if you come up with a better name.

Yunni · 2016-10-28T22:42:45Z

Awesome! Thanks Joseph and thanks everyone else for reviewing this! 👍

## What changes were proposed in this pull request? Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit). Detailed changes are as follows: (1) Implement abstract LSH, LSHModel classes as Estimator-Model (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin Things that will be implemented in a follow-up PR: - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance - PySpark Integration for the scala classes and methods. ## How was this patch tested? Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally. Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit). ## References Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Author: Yunni <[email protected]> Author: Yun Ni <[email protected]> Closes apache#15148 from Yunni/SPARK-5992-yunn-lsh.

sethah · 2016-11-05T22:57:52Z

I apologize for coming late to this, but I am taking a look at some of the documentation now. For RandomProjection class there are two links: one to wikipedia entry on stable distributions and one to a survey paper. The wikipedia links to the "stable distributions" section despite also having a section on random projections, which is the supposed algorithm. The paper has a "Random Projection" section as well - neither of the Random Projection methods in the links match the code here. I expressed this concern before. The approach in the Random Projection class does not match either the "Random Projection" method OR the "P-Stable distribution" methods that I find in the literature.

I summarized this in a comment way up towards the top. If this method is some well-accepted hybrid of the two, fine, but I think the references would leave users quite confused. I think it's nice to have certainty about the practical effectiveness of this method since it has already been deployed in industry, so my main concern is really just documentation. Right now, we're linking to sources which describe distinctly different algorithms than what we have implemented. Thoughts?

For convenience, some references:

karlhigley · 2016-11-06T02:02:46Z

@sethah: I think you're right that there's a discrepancy here, and I'm embarrassed that I didn't see it when I first reviewed the PR. On a reread of the source and your comment above, it looks like the LSH models in this PR use a single hash function to compute a single hash table, which doesn't match my understanding of OR-amplification. For OR-amplification, multiple hash functions would be applied to compute multiple hash tables, and points placed in the same bucket in any hash table would be considered candidate neighbors.

From the comments, it looks like the discrepancy might be due to some confusion between the number of hash functions applied and the dimensionality of the hash functions. This is a subtle point that I was confused about too, and it took me quite a while to work it out because different authors use the term "hash function" to refer to different things at different levels of abstraction. In one sense (at a lower level), a random projection is made up of many component hash functions, but in another sense (at a higher level) a random projection represents a single hash function for the purposes of OR-amplification.

Given that the PR has already been merged, I concur that the best way forward is to adjust the comments and documentation. That probably involves changing the references to OR-amplification to simply refer to the dimensionality of the hash function.

On the other issue you mentioned regarding mismatches between what's implemented and the linked documents, I think some of that confusion also stems from inconsistent terminology in the source material. LSH based on p-stable distributions (for Euclidean distance) does involve random projections, although the authors don't directly say so in the paper. There's a somewhat similar LSH method for cosine distance that's sometimes referred to as "sign random projection" (though the authors of the paper don't use that term either). Sign random projection is what the "Random Projection" section of the Wikipedia page is referring to; what's implemented here looks like LSH based on p-stable distributions. Maybe one way to clarify would be to name the models after the distance measures they're intended to approximate, and provide explanations of the methods they use in the comments?

sethah · 2016-11-07T03:58:31Z

@karlhigley Thanks for your detailed response. From the amplification section on Wikipedia, it is pretty clear to me that this implementation is not doing OR/AND amplification. outputDim is just the number of concatenated random hash functions (k in the wiki article).

For now we can clarify some of this a bit better in the documentation, and perhaps in the future we can extend this implementation to use optional AND/OR amplification. I can work on a PR for it this week, unless there are any objections. @jkbradley @Yunni @MLnick ?

Yunni · 2016-11-07T05:14:01Z

@sethah I think you are right. OR-amplification is only applied inside NN search and similarity join through hashDistance and explode. transform itself does not apply amplifications.

Sorry to miss this. I will clarify this in the user guide, and I am happy for the PR you send to fix the documentation. @jkbradley @MLnick

sethah · 2016-11-07T06:46:18Z

mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala

+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+    // Since it's generated by hashing, it will be a pair of dense vectors.
+    x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min


Does this make sense for MinHash? For the RandomProjection class I understand that the absolute difference between their hash values is a measure of their similarity, but for MinHash I don't think it is. It is true that dissimilar items have a lower likelihood of hash collisions, but it should not be true that they have a low likelihood to hash to buckets near each other. We use this hashDistance to ensure that we get enough near-neighbor candidates, but I don't see how this hashDistance corresponds to similarity in the case where there are no zero distance elements.

Make sense. hashDistance for MinHash should just be binary. I will make another PR to fix this.

sethah · 2016-11-07T18:48:26Z

Ok, I'm looking more closely at this algorithm versus the literature. I agree that there is a lot of inconsistent terminology which is probably leading to some of the confusion here.

Most or all of the LSH algorithms in the literature describe a process which applies a composition of AND and OR amplification. @karlhigley This is what the package spark-neighbors does as well, correct? AND amplification is applied by generating hash functions g(x) = (h1(x), h2(x), ..., hd(x)) which are concatenations of several of the vanilla locality sensitive hashing functions. These algorithms only compare g(x) == g(y) for near-neighbor candidacy. Still, they then apply OR amplification by using L of these hashing functions and accepting a point as a candidate if any of the g_i for i = 1 to L hash functions fall into the same bucket as the query point.

In this patch we only apply OR amplification by generating a single g(x) = (h1(x), h2(x), ..., hd(x)) and we consider candidates if any of the h_i for i = 1 to d match. For a (p1, p2) sensitive hashing family, this OR amplification transforms it into a (1 - (1 - p1)^d, 1 - (1 - p2)^d) family, where p1 is a "good" collision and p2 is a "bad" collision. Consider a (0.8, 0.2) hash family where we apply OR amplification with a dimension d = 10. We will transform this into a (0.99999989, 0.893) sensitive family. Basically, we amplify the good and bad collisions. If instead we implement the composition of AND then OR amplification as in the literature, we transform a (0.8, 0.2) sensitive family into a (.8785, .0064). In this way, we amplify the "good" collision and dampen the "bad" collision probabilities. If this is correct, then I think the current implementation will end up selecting most of the points as candidates and may impact the runtime performance. This reference sums it up nicely IMO.

I will look into testing this out more concretely.

Yunni · 2016-11-07T20:56:25Z

@sethah Yes, that's why outputDim is introduced for users to trade off between false negative rate and running time.
During my tests, LSH without amplification can be (0.5, 0.5)-sensitive or even worse depending on the input distribution. Even that case, outputDim = 4 or outputDim = 5 already gives very good accuracy. And the number of rows being scanned should be outputDim * averageBucketSize.

jkbradley · 2016-11-07T22:59:44Z

It sounds like discussions are converging, but I want to confirm a few things + make a few additions.

Amplification

Is this agreed?

Approx neighbors and similarity are doing OR-amplification when comparing hash values, as described in the Wikipedia article. This is computing an amplified hash function implicitly.
transform() is not doing amplification. It outputs the value of a collection of hash functions, rather than aggregating them to do amplification.
- This is my main question: Is amplification ever done explicitly, and when would you ever need that?

Adding combined AND and OR amplification in the future sounds good to me. My main question right now is whether we need to adjust the API before the 2.1 release. I don't see a need to, but please comment if you see an issue with the current API.

One possibility: We could rename outputDim to something specific to OR-amplification.

Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set.

Random Projection

I agree this should be renamed to something like "PStableHashing." My apologies for not doing enough background research to disambiguate.

MinHash

I think this is implemented correctly, according to the reference given in the linked Wikipedia article.

This reference to perfect hash functions may be misleading. I'd prefer to remove it.

hashDistance

Rethinking this, I am unsure about what function we should use. Currently, hashDistance is only used by approxNearestNeighbors. Since approxNearestNeighbors sorts by hashDistance, using a soft measure might be better than what we currently have:

MinHash
- Currently: Uses OR-amplification for single probing, and something odd for multiple probing
- Best option for approxNearestNeighbors: this Wikipedia section, which is equivalent or OR-amplification when using single probing. I.e., replace this line of code with: x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size
RandomProjection
- Currently: Uses OR-amplification for single probing, and something reasonable for multiple probing

@Yunni What is the best resource you have for single vs multiple probing? I'm wondering now if they are uncommon terms and should be renamed.

sethah · 2016-11-07T23:37:13Z

So I'll try to summarize the AND/OR amplification and how I think it fits into the current API right now. LSH relies on a single hashing function h(x) which is (R, cR, p1, p2)-sensitive which just means it meets certain properties needed for LSH. In the case of 2-stable method h(x) = floor((x dot r) / w) which maps Vector[Double] => Int. p1 and p2 correspond to "good" and "bad" collision probabilities respectively. To decrease the probability of a bad collision we can use AND-amplification by creating a new, compound hash function g(x) = [h1(x), h2(x), ..., hd(x)] where the h_i(x) correspond to different random vectors r. Now we only consider collisions for two vectors x and y if g(x) == g(y) (i.e. standard vector equality). This makes the probability of both types of collisions decrease to (p1^d, p2^d). For a hypothetical (0.8, 0.2)-sensitive distribution this goes to (0.4, 0.0016) for d = 4. Making the false-positive rate very low, but meaning we also miss a lot of good candidates. To mitigate this we can further apply OR-amplification by generating not one compound hash function g(x) but L compound functions

g1(x) = [h11(x), ..., h1d(x)]
g2(x) = [h21(x), ..., h2d(x)]
gL(x)  = [hL1(x), ..., hLd(x)]

Then we convert the original probabilities to (1 - (1 - p1^L)^b, 1 - (1 - p2^L)^b) and in our example (0.8, 0.2) => (0.8785, 0.006) for L = 4, d = 4.

The current implementation is equivalent to the L = 1 case always, and outputDim corresponds to d. The concern I have with the RandomProjection API right now is that if we extend to offer arbitrary L then our models do not store just a d-dimensional array of random vectors but more like a L x d matrix of random vectors. And we would have hashFunctions instead of hashFunction (though this is still private). One question I have is - why do we expose randUnitVectors at all? I feel it leaves us more room for changes in the future if we do not expose it, especially considering the points I just made. There may be some reason to expose it that I haven't thought of though. What do we think about changing it to private?

I like the idea of changing outputDim to something related to OR-amplification a lot. I think minhash is done properly right now but the hashDistance measure doesn't make sense as already discussed. Right now, I'd like to focus on making sure we don't corner ourselves with the API since internal algo details and documentation can always be changed later.

jkbradley · 2016-11-07T23:54:29Z

The current implementation is equivalent to the L = 1 case always, and outputDim corresponds to d.

That is true if you're talking about comparing hash values. But for approx similarity and nearest neighbors, this is doing d = 1 and L = outputDim (i.e., OR amplification). (Did you swap accidentally?) Definitely need to clarify in the docs.

I'm not too worried about making randUnitVectors private. We can always deprecate it and have it throw an exception when it is not applicable.

I'm more worried about the schema for transform(). Do you think we should go ahead and output a Matrix so we can support AND and OR in the future?

sethah · 2016-11-08T01:06:36Z

I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter.

Thinking more about it, this is a tough question. What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality?

When/if we use the AND/OR amplification, we could go a couple of different routes. Let's say for d = 3 and L = 3 we could first apply our hashing scheme to the input to obtain:

features	g1	g2	g3
[12.5609584702036...	[112.0,1.0,12.0]	[1.0,120.0,16.0]	[102.0,1.0,14.0]
...	...	...	...

Then we generate g1(q), g2(q), g3(q) where q is the query point and we would select all points where g1(q) == g1(x_i) OR g2(q) == g2(x_i) OR .... In spark-neighbors, instead the number of elements in the output dataframe has L * N rows where N is the number of rows in the input dataframe. Then you can join on the hashed column plus a "table identifier" (the index l in range [1, L]). Still, this makes a temporary dataframe within the near-neighbors or approx-join algos, and I'm not sure the output schema of transform needs to have all L hashed values. We could store randUnitVectors: Array[Array[Vector]] and for transform output the hashed value for only the first sequence of random vectors, but that seems a bit strange to me. Thoughts?

karlhigley · 2016-11-08T02:32:35Z

@sethah: Your description of the combination of AND and OR amplification from the literature matches my understanding, and the combination of the two is what I was aiming for in spark-neighbors. I also concur with your assessment of the potential performance impacts of OR-amplification without first applying AND-amplification, in terms of both precision/recall and runtime.

karlhigley · 2016-11-08T03:07:25Z

@jkbradley: "Multi-probe" seems like a standard term, and I think this is the original paper that coined it.

Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set.

I confess that I'm a little confused what you mean by the above. There are several relevant dimensionalities: the dimensionality of the input points (x), the dimensionality of the computed hashes (i.e. the results of applying g(x)), and the number of hash tables computed (i.e. how many g(x) functions are applied), which is the dimensionality of AND-amplification (in a sense).

After wrestling with inconsistent terminology for a while, what I settled on for spark-neighbors was to refer to g(x) as a hash function, the outputs of g(x) as hashes, the sub-elements of g(x) -- h1(x) etc. -- as whatever made sense for the particular method (e.g. permutations for Minhash), and the output of each of the L g(x) functions as a hash table. While that terminology isn't necessarily standard, it helped me identify the common concepts across LSH methods clearly enough to build some abstractions around them.

Using those terms, the dimensionality of the g(x) hash functions and the hashes they produce is equivalent to the number of h(x) sub-elements they contain. I thought of applying OR-amplification as producing multiple hash tables by using multiple g(x) functions, with a collision in any one hash table producing a pair of candidate neighbors.

Does that make any more (or less) sense?

Yunni · 2016-11-08T06:08:42Z

@jkbradley I agree with most of your comments above. And I would like to suggest the following:

I would recommend a more intuitive name like HyperplaneProjection instead of PStableHashing if we adopt the LSH function @sethah suggested.
x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size is AND-amplification. I think we should use OR-amplification here. I have already made a pull request to fix the issue in [SPARK-18334] MinHash should use binary hash distance #15800.
I think for MinHash, multi-probing NN Search is either single probing or full scan.
Here is my reference for Multi-probing: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf

@sethah @karlhigley Now I see your LSH function for Euclidean distance is the AND-amplification of what I have implemented.

Do you have any reference for compound AND/OR-amplification? I see this is not always working without assumptions on distance threshold and sensitivity, for example, (0.6, 0.4) => (0.426, 0.098) for L = 4, d = 4, and (0.8, 0.2) => (0.678, 0.000) for L = 10, d = 10
For the schema of transform(), I think we either add a generic type for the output column in LSH class or change the output type to Array[Vector]. I would recommend the latter way because (1) it's very easy to select from the array to get what @sethah suggested (2) The type of output column still needs to be spark sql compatible, which is not so generic.

jkbradley · 2016-11-09T00:02:36Z

@sethah

What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality?

I agree it's mainly for dimensionality reduction, though these LSH functions are not ideal for that. (E.g., most people doing dimensionality reduction would probably want to use random projections without bucketing.)

@karlhigley

I agree with your description of different dimensionalities and agree we may just have to pick some terminology out of many choices. I'm fairly ambivalent about what terminology we choose, though it would be great for it to match whatever references we cite. (And maybe we do need another reference cited for describing OR vs AND amplification and "dimensions.")

@Yunni

Have you seen "HyperplaneProjection" used in literature?
I'll respond about the hashDistance in [https://github.com/[SPARK-18334] MinHash should use binary hash distance #15800]
Let's not implement both types of amplification just yet. Let's either:
- Fix the API so we can add them in the future, or
- Make LSH private for now so that we can change fix its API for 2.2.

MLnick · 2016-11-09T10:03:49Z

I tend to agree that the terminology used here is a little confusing, and doesn't seem to match up with the "general" terminology (I use that term loosely however).

Terminology

In my dealings with LSH, I too have tended to come across the version that @sethah mentions (and @karlhigley's package, and others such as https://github.com/marufaytekin/lsh-spark, implement). that is, each input vector is hashed into L "tables" of hash signatures of "length" or "dimension" d. Each hash signature is created by concatenating the result of applying d "hash functions".

I agree what's effectively implemented here is L = outputDim and d=1. What I find a bit troubling is that it is done "implicitly", as part of the hashDistance function. Without knowing that is what is happening, it is not clear to a new user - coming from other common LSH implementations - that outputDim is not the "number of hash functions" or "length of the hash signatures" but actually the "number of hash tables".

Transform semantics

In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that. In my view, the real use case is the approximate nearest neighbour search.

I'll give a concrete example for the transform output. Let's say I want to export recommendation model factor vectors (from ALS), or Word2Vec vectors, etc, to a real-time scoring system. I have many items, so I'd like to use LSH to make my scoring feasible. I do this by effectively doing a real-time version of OR-amplification. I store the hash tables (L tables of d hash signatures) with my vectors. When doing "similar items" for a given item, I retrieve the hash sigs of the query item, and use these to filter down the candidate item set for my scoring. This is in fact something I'm working on in a demo project currently. So if we will support the OR/AND combo, then it will be very important to output the full L x d set of hash sigs in transform.

Proposal:

My recommendation is:

future proof the API by returning Array[Vector] in transform (as mentioned above by others);
we need to update the docs / user guide to make it really clear what the implementation is doing;
I think we need to make it clear that the implied d value here is 1 - we can mention that AND amplification will be implemented later and perhaps even link to a JIRA.
rename outputDim to something like numHashTables.
when we add AND-amp, we can add the parameter hashSignatureLength or numHashFunctions.
make as much private as possible to avoid being stuck with any implementation detail in future releases (e.g. I also don't see why randUnitVectors or randCoefficients needs to be public).

One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is.

I believe we should support OR/AND in future. If so, then to me many things need to change - hashFunction, hashDistance etc will need to be refactored. Most of the implementation is private/protected so I think it will be ok. Let's just ensure we're not left with an API that we can't change in future. Setting L and d=1 must then yield the same result as current impl to avoid a behavior change (I guess this will be ok since current default for L is 1, and we can make the default for d when added also 1).

Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.

MLnick · 2016-11-09T11:15:35Z

Oh and for naming - I'm ok with the current ones actually. However we could think about changing to ScalarRandomProjectionLSH (a term mentioned in @karlhigley's package), as later we will have SignRandomProjectionLSH for cosine distance; and MinHashLSH, etc - just to make it clear what the class is doing. (perhaps later we have some other random projection algorithm that conflicts etc).

We could name according to the estimated metric such as EuclideanLSH or so on, but if we want to support say Euclidean and Manhattan distance at some point that becomes problematic. So perhaps best not to?

jkbradley · 2016-11-09T18:36:41Z

@MLnick I agree with most of your comments. A few responses:

In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that.

This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.

I also don't see why randUnitVectors or randCoefficients needs to be public

You mentioned people using LSH outside of Spark for serving. In order to do that, we will need to expose randUnitVectors and randCoefficients so that users can compute hash values for query points. That said, I'm fine with making those private for now and preventing this use case for 1 release while we stabilize the API.

One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is.

What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.

Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.

You can see some results linked from the JIRA.

MLnick · 2016-11-09T19:55:00Z

This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.

Ok makes sense - for the transform case if users are looking to directly use the hash sigs as lower-dim representation, they can always set L=1 and d (assuming we do AND + OR later) to get just one "vector" output.

For the public vals - sorry if I wan't clear. I meant we should probably not expose them until the API is fully baked. But yes I see that they are useful to expose once we're happy with the API. I just don't love the idea of changing things later (and throwing errors and whatnot) if we can avoid it - I think we saw similar issues with e.g. NaiveBayes now.

What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.

Matrix can work - I don't think Array[Vector] is an issue either. I seem to recall a comment above that Matrix was a bit less easy to work with (exploding indices and so on). I don't see a big difference between an Lx1 matrix and an L-length Array of 1-d vectors in practical terms. So, I'm ok with either approach.

I'll check the JIRA - sorry I missed the links.

sethah · 2016-11-09T20:33:32Z

If we were to use a matrix for the output, then when we do approxSimilarityJoin we would want to explode the output column by matrix rows, assuming the matrix structure was:

| ---g1(x)---- |
| ---g2(x)---- |
|     ...      |
| ---gL(x)---- |

This is probably possible, but might be a bit awkward? Array[Vector] might make it a bit easier.

jkbradley · 2016-11-09T22:38:56Z

Good points: Array of Vectors sounds good to me.

There has been a lot of discussion. I'm going to try to summarize things in a follow-up JIRA, which I'll link here shortly. LSH turned out to be a much messier area than I expected; thanks a lot to everyone for all of the post-hoc reviews and discussions!

jkbradley · 2016-11-10T00:02:19Z

Phew, done! https://issues.apache.org/jira/browse/SPARK-18392

Yunni · 2016-11-10T00:11:33Z

Thanks for the discussion, everyone! I will take a look at the JIRA.

## What changes were proposed in this pull request? Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit). Detailed changes are as follows: (1) Implement abstract LSH, LSHModel classes as Estimator-Model (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin Things that will be implemented in a follow-up PR: - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance - PySpark Integration for the scala classes and methods. ## How was this patch tested? Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally. Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit). ## References Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Author: Yunni <[email protected]> Author: Yun Ni <[email protected]> Closes apache#15148 from Yunni/SPARK-5992-yunn-lsh.

Yunni added 6 commits September 13, 2016 11:47

First Commit of LSH function implementation. Implement basic Estimato…

1bbd48c

…r-Model class hierarchy to make RandomProjection works.

Implementation of Approximate Nearest Neighbors. Add distCol as anoth…

ca46d82

…er model parameters

Implement approxSimilarityJoin(). Remove modelDataset and distCol as …

c693f5b

…discussed in the Design Doc.

Add test utility method to check LSH property. Tested on random proje…

c9ee0f9

…ction.

Add testing utility for approximate nearest neighbor. Run the testing…

fc838e0

… on random projection.

Add testing utility for approximate similarity join. Run the testing …

aa138e8

…on random projection.

viirya reviewed Sep 19, 2016

View reviewed changes

Yunni changed the title ~~Spark 5992 yunn lsh~~ [SPARK-5992][ML] Locality Sensitive Hashing Sep 19, 2016

Yun Ni added 2 commits September 19, 2016 15:31

Code review comments. A new unit test of k nearest neighbor for large k

bbcbcf0

Code review comments. A new unit test of k nearest neighbor for large k

d389159

(1) Refactor hashDistCol for nearest neighbor search. (2) Add scalado…

19d012a

…c for LSH along with reference papers

Code Review comments: (1) Rewrite hashDistance (2) Move the lsh packa…

269c8c9

…ge to be under feature

Add comment to clarify the implementation of RandomProjection

9065f7d

asfgit closed this in ac26e9c Oct 28, 2016

sethah reviewed Nov 7, 2016

View reviewed changes

sethah mentioned this pull request Nov 7, 2016

[SPARK-18080][ML][PySpark] Locality Sensitive Hashing (LSH) Python API. #15768

Closed

MLnick mentioned this pull request Nov 9, 2016

[SPARK-18334] MinHash should use binary hash distance #15800

Closed


		private[this] var inputDim = -1

		private[this] lazy val randUnitVectors: Array[breeze.linalg.Vector[Double]] = {

[SPARK-5992][ML] Locality Sensitive Hashing #15148

[SPARK-5992][ML] Locality Sensitive Hashing #15148

Uh oh!

Conversation

Yunni commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

References

Uh oh!

srowen commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 19, 2016

Uh oh!

sethah commented Sep 19, 2016

Uh oh!

Yunni commented Sep 19, 2016

Uh oh!

Yunni commented Sep 19, 2016

Uh oh!

viirya commented Sep 20, 2016

Uh oh!

sethah commented Sep 20, 2016

Uh oh!

Yunni commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Yunni commented Sep 19, 2016 •

edited

Loading

viirya Sep 19, 2016 •

edited

Loading

viirya Sep 20, 2016 •

edited

Loading

viirya Sep 19, 2016 •

edited

Loading

Yunni commented Sep 20, 2016 •

edited

Loading

karlhigley commented Nov 6, 2016 •

edited

Loading

sethah Nov 7, 2016 •

edited

Loading

Yunni commented Nov 7, 2016 •

edited

Loading

Yunni commented Nov 8, 2016 •

edited

Loading