-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5992][ML] Locality Sensitive Hashing #15148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r-Model class hierarchy to make RandomProjection works.
…er model parameters
…discussed in the Design Doc.
… on random projection.
…on random projection.
|
Fix the title please? |
| * @return The distance between hash vectors x and y in double | ||
| */ | ||
| protected[ml] def hashDistance(x: Vector, y: Vector): Double = { | ||
| (x.asBreeze - y.asBreeze).toArray.map(math.abs).min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems include redundant operations.
For DenseVector, we can directly use its values: Array[Double].
For SparseVector, we can use Breeze's subtraction op then get the data from the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering what's API to calculate the difference between two spark Vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a pair of DenseVector, you can directly use its values member and do something like:
x.values.zip(y.values).map(x => math.abs(x._1 - x._2)).min
For a pair of SparseVector, you may not need to conver (x.asBreeze - y.asBreeze) back to Array, because the resulting array should be sparse too. We can directly map on the Breeze vector, i.e., (x.asBreeze - y.Breeze).map(math.abs).min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Since it's generated by hashing, I am assuming it's a pair of dense vector.
|
@Yunni Please use a proper title as "[SPARK-5992][ML] ...". |
| */ | ||
| def approxNearestNeighbors(dataset: Dataset[_], key: KeyType, k: Int = 1, | ||
| distCol: String = "distance"): Dataset[_] = { | ||
| if (k < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we use assert for this. And more informative error message might be The number of nearest neighbors cannot be less than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| val nearestHashValue = nearestHashDataset.collect()(0)(0).asInstanceOf[Double] | ||
|
|
||
| // Filter the dataset where the hash value equals to u | ||
| val modelSubset = modelDataset.filter(hashDistUDF(col($(outputCol))) === nearestHashValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do hashDistUDF twice for the dataset. Besides, you might get less than k nearest neighbors in current approach. We can do this like:
val hashDistCol = "_hash_dist"
modelDataset.withColumn(hashDistCol, hashDistUDF(col($(outputCol))))
.sort(hashDistCol)
.drop(hashDistCol)
.limit(k)
.withColumn(distCol, keyDistUDF(col($(inputCol))))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this does not work because number of elements with the same "hashDistCol" can be much larger than k. In that case, we are random selecting k elements of the same "hashDistCol" value.
To resolve the issue you mentioned, I am changing nearestHashValue to hashThreshold, which is the maximum "hashDistCol" for the top k elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can replace the limit above to a filter to choose the elements failed in this range.
| val explodeCols = Seq("lsh#entry", "lsh#hashValue") | ||
| val explodedA = processDataset(datasetA, explodeCols) | ||
|
|
||
| // If this is a self join, we need to recreate the inputCol of datasetB to avoid ambiguity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I think we already do dedup operation in Analyzer for self-join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. You want to access inputCol from both left and right sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #14719 is merged, I think we can skip this redundant operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO.
| ) | ||
|
|
||
| // Filter the joined datasets where the distance are smaller than the threshold. | ||
| joinedDatasetWithDist.distinct().filter(col(distCol) < threshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think do distinct after filter should be better as you will filter out most of records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. Done.
|
|
||
| class RandomProjectionModel( | ||
| override val uid: String, | ||
| val randUnitVectors: Array[breeze.linalg.Vector[Double]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use spark vector? We have BLAS library (BLAS.dot) for spark vector and you don't need to convert to Breeze and back to spark vector below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
|
|
||
| override protected[this] def keyDistance(x: Vector, y: Vector): Double = { | ||
| euclideanDistance(x.asBreeze, y.asBreeze) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectors.sqdist is specified for spark vector. We can use it and get its square root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| private[this] var inputDim = -1 | ||
|
|
||
| private[this] lazy val randUnitVectors: Array[breeze.linalg.Vector[Double]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, we can use spark vector to avoid Breeze conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| // Compute precision and recall | ||
| val correctCount = expected.join(actual, model.getInputCol).count().toDouble | ||
| (correctCount / expected.count(), correctCount / actual.count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the precision and recall values should be swapped. correctCount / expected.count() should be recall. correctCount / actual.count() should be precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
This looks pretty solid. cc @dbtsai @jkbradley |
|
@Yunni Could you provide the specific reference paper this patch is based on? Also, it might be nice to put the reference in the code somewhere, e.g. the scaladoc for LSH/Random Projections. Thanks! |
|
Thanks very much for reviewing @viirya I made some changes based on your comments. PTAL. |
…c for LSH along with reference papers
|
Hi @sethah, I have updated the reference in the PR and scaladoc for LSH. |
|
@Yunni Thanks for working on this. |
|
A few high-level comments/questions:
@karlhigley Would you mind taking a look at the patch, or providing your input on the comments? |
…ge to be under feature
|
Hi @sethah,
|
|
Awesome! Thanks Joseph and thanks everyone else for reviewing this! 👍 |
## What changes were proposed in this pull request? Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit). Detailed changes are as follows: (1) Implement abstract LSH, LSHModel classes as Estimator-Model (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin Things that will be implemented in a follow-up PR: - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance - PySpark Integration for the scala classes and methods. ## How was this patch tested? Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally. Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit). ## References Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Author: Yunni <[email protected]> Author: Yun Ni <[email protected]> Closes apache#15148 from Yunni/SPARK-5992-yunn-lsh.
|
I apologize for coming late to this, but I am taking a look at some of the documentation now. For I summarized this in a comment way up towards the top. If this method is some well-accepted hybrid of the two, fine, but I think the references would leave users quite confused. I think it's nice to have certainty about the practical effectiveness of this method since it has already been deployed in industry, so my main concern is really just documentation. Right now, we're linking to sources which describe distinctly different algorithms than what we have implemented. Thoughts? For convenience, some references: |
|
@sethah: I think you're right that there's a discrepancy here, and I'm embarrassed that I didn't see it when I first reviewed the PR. On a reread of the source and your comment above, it looks like the LSH models in this PR use a single hash function to compute a single hash table, which doesn't match my understanding of OR-amplification. For OR-amplification, multiple hash functions would be applied to compute multiple hash tables, and points placed in the same bucket in any hash table would be considered candidate neighbors. From the comments, it looks like the discrepancy might be due to some confusion between the number of hash functions applied and the dimensionality of the hash functions. This is a subtle point that I was confused about too, and it took me quite a while to work it out because different authors use the term "hash function" to refer to different things at different levels of abstraction. In one sense (at a lower level), a random projection is made up of many component hash functions, but in another sense (at a higher level) a random projection represents a single hash function for the purposes of OR-amplification. Given that the PR has already been merged, I concur that the best way forward is to adjust the comments and documentation. That probably involves changing the references to OR-amplification to simply refer to the dimensionality of the hash function. On the other issue you mentioned regarding mismatches between what's implemented and the linked documents, I think some of that confusion also stems from inconsistent terminology in the source material. LSH based on p-stable distributions (for Euclidean distance) does involve random projections, although the authors don't directly say so in the paper. There's a somewhat similar LSH method for cosine distance that's sometimes referred to as "sign random projection" (though the authors of the paper don't use that term either). Sign random projection is what the "Random Projection" section of the Wikipedia page is referring to; what's implemented here looks like LSH based on p-stable distributions. Maybe one way to clarify would be to name the models after the distance measures they're intended to approximate, and provide explanations of the methods they use in the comments? |
|
@karlhigley Thanks for your detailed response. From the amplification section on Wikipedia, it is pretty clear to me that this implementation is not doing OR/AND amplification. For now we can clarify some of this a bit better in the documentation, and perhaps in the future we can extend this implementation to use optional AND/OR amplification. I can work on a PR for it this week, unless there are any objections. @jkbradley @Yunni @MLnick ? |
|
@sethah I think you are right. OR-amplification is only applied inside NN search and similarity join through Sorry to miss this. I will clarify this in the user guide, and I am happy for the PR you send to fix the documentation. @jkbradley @MLnick |
| @Since("2.1.0") | ||
| override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { | ||
| // Since it's generated by hashing, it will be a pair of dense vectors. | ||
| x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense for MinHash? For the RandomProjection class I understand that the absolute difference between their hash values is a measure of their similarity, but for MinHash I don't think it is. It is true that dissimilar items have a lower likelihood of hash collisions, but it should not be true that they have a low likelihood to hash to buckets near each other. We use this hashDistance to ensure that we get enough near-neighbor candidates, but I don't see how this hashDistance corresponds to similarity in the case where there are no zero distance elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. hashDistance for MinHash should just be binary. I will make another PR to fix this.
|
Ok, I'm looking more closely at this algorithm versus the literature. I agree that there is a lot of inconsistent terminology which is probably leading to some of the confusion here. Most or all of the LSH algorithms in the literature describe a process which applies a composition of AND and OR amplification. @karlhigley This is what the package spark-neighbors does as well, correct? AND amplification is applied by generating hash functions In this patch we only apply OR amplification by generating a single I will look into testing this out more concretely. |
|
@sethah Yes, that's why |
|
It sounds like discussions are converging, but I want to confirm a few things + make a few additions. AmplificationIs this agreed?
Adding combined AND and OR amplification in the future sounds good to me. My main question right now is whether we need to adjust the API before the 2.1 release. I don't see a need to, but please comment if you see an issue with the current API.
Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set. Random ProjectionI agree this should be renamed to something like "PStableHashing." My apologies for not doing enough background research to disambiguate. MinHashI think this is implemented correctly, according to the reference given in the linked Wikipedia article.
hashDistanceRethinking this, I am unsure about what function we should use. Currently, hashDistance is only used by approxNearestNeighbors. Since approxNearestNeighbors sorts by hashDistance, using a soft measure might be better than what we currently have:
@Yunni What is the best resource you have for single vs multiple probing? I'm wondering now if they are uncommon terms and should be renamed. |
|
So I'll try to summarize the AND/OR amplification and how I think it fits into the current API right now. LSH relies on a single hashing function Then we convert the original probabilities to The current implementation is equivalent to the I like the idea of changing |
That is true if you're talking about comparing hash values. But for approx similarity and nearest neighbors, this is doing d = 1 and L = outputDim (i.e., OR amplification). (Did you swap accidentally?) Definitely need to clarify in the docs. I'm not too worried about making I'm more worried about the schema for transform(). Do you think we should go ahead and output a Matrix so we can support AND and OR in the future? |
|
I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter. Thinking more about it, this is a tough question. What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality? When/if we use the AND/OR amplification, we could go a couple of different routes. Let's say for d = 3 and L = 3 we could first apply our hashing scheme to the input to obtain:
Then we generate |
|
@sethah: Your description of the combination of AND and OR amplification from the literature matches my understanding, and the combination of the two is what I was aiming for in spark-neighbors. I also concur with your assessment of the potential performance impacts of OR-amplification without first applying AND-amplification, in terms of both precision/recall and runtime. |
|
@jkbradley: "Multi-probe" seems like a standard term, and I think this is the original paper that coined it.
I confess that I'm a little confused what you mean by the above. There are several relevant dimensionalities: the dimensionality of the input points ( After wrestling with inconsistent terminology for a while, what I settled on for spark-neighbors was to refer to Using those terms, the dimensionality of the Does that make any more (or less) sense? |
|
@jkbradley I agree with most of your comments above. And I would like to suggest the following:
@sethah @karlhigley Now I see your LSH function for Euclidean distance is the AND-amplification of what I have implemented.
|
I agree it's mainly for dimensionality reduction, though these LSH functions are not ideal for that. (E.g., most people doing dimensionality reduction would probably want to use random projections without bucketing.) I agree with your description of different dimensionalities and agree we may just have to pick some terminology out of many choices. I'm fairly ambivalent about what terminology we choose, though it would be great for it to match whatever references we cite. (And maybe we do need another reference cited for describing OR vs AND amplification and "dimensions.")
|
|
I tend to agree that the terminology used here is a little confusing, and doesn't seem to match up with the "general" terminology (I use that term loosely however). TerminologyIn my dealings with LSH, I too have tended to come across the version that @sethah mentions (and @karlhigley's package, and others such as https://github.com/marufaytekin/lsh-spark, implement). that is, each input vector is hashed into I agree what's effectively implemented here is Transform semanticsIn terms of I'll give a concrete example for the Proposal:My recommendation is:
One issue I have is that currently we would output a I believe we should support OR/AND in future. If so, then to me many things need to change - Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet. |
|
Oh and for naming - I'm ok with the current ones actually. However we could think about changing to We could name according to the estimated metric such as |
|
@MLnick I agree with most of your comments. A few responses:
This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.
You mentioned people using LSH outside of Spark for serving. In order to do that, we will need to expose randUnitVectors and randCoefficients so that users can compute hash values for query points. That said, I'm fine with making those private for now and preventing this use case for 1 release while we stabilize the API.
What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.
You can see some results linked from the JIRA. |
Ok makes sense - for the For the public vals - sorry if I wan't clear. I meant we should probably not expose them until the API is fully baked. But yes I see that they are useful to expose once we're happy with the API. I just don't love the idea of changing things later (and throwing errors and whatnot) if we can avoid it - I think we saw similar issues with e.g. NaiveBayes now.
Matrix can work - I don't think I'll check the JIRA - sorry I missed the links. |
|
If we were to use a matrix for the output, then when we do This is probably possible, but might be a bit awkward? |
|
Good points: Array of Vectors sounds good to me. There has been a lot of discussion. I'm going to try to summarize things in a follow-up JIRA, which I'll link here shortly. LSH turned out to be a much messier area than I expected; thanks a lot to everyone for all of the post-hoc reviews and discussions! |
|
Thanks for the discussion, everyone! I will take a look at the JIRA. |
## What changes were proposed in this pull request? Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit). Detailed changes are as follows: (1) Implement abstract LSH, LSHModel classes as Estimator-Model (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin Things that will be implemented in a follow-up PR: - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance - PySpark Integration for the scala classes and methods. ## How was this patch tested? Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally. Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit). ## References Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Author: Yunni <[email protected]> Author: Yun Ni <[email protected]> Closes apache#15148 from Yunni/SPARK-5992-yunn-lsh.
What changes were proposed in this pull request?
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.
Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
How was this patch tested?
Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.
Tested the methods on WEX dataset from AWS, with the steps and results here.
References
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).