Skip to content

Commit 85d22c3

Browse files
committed
Locality Sensitive Hashing (LSH) Python API.
1 parent dc4c600 commit 85d22c3

File tree

4 files changed

+325
-16
lines changed

4 files changed

+325
-16
lines changed

mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
3939
* higher the dimension is, the lower the false negative rate.
4040
* @group param
4141
*/
42-
final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension, where" +
43-
"increasing dimensionality lowers the false negative rate, and decreasing dimensionality" +
44-
" improves the running performance", ParamValidators.gt(0))
42+
final val outputDim: IntParam = new IntParam(this, "outputDim", "The output dimension, where" +
43+
" increasing dimensionality lowers the false negative rate, and decreasing dimensionality" +
44+
" improves the running performance.", ParamValidators.gt(0))
4545

4646
/** @group getParam */
4747
final def getOutputDim: Int = $(outputDim)
@@ -109,11 +109,11 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
109109
* - Single Probing: Fast, return at most k elements (Probing only one buckets)
110110
* - Multiple Probing: Slow, return exact k elements (Probing multiple buckets close to the key)
111111
*
112-
* @param dataset the dataset to search for nearest neighbors of the key
113-
* @param key Feature vector representing the item to search for
114-
* @param numNearestNeighbors The maximum number of nearest neighbors
115-
* @param singleProbing True for using Single Probing; false for multiple probing
116-
* @param distCol Output column for storing the distance between each result row and the key
112+
* @param dataset The dataset to search for nearest neighbors of the key.
113+
* @param key Feature vector representing the item to search for.
114+
* @param numNearestNeighbors The maximum number of nearest neighbors.
115+
* @param singleProbing True for using Single Probing; false for multiple probing.
116+
* @param distCol Output column for storing the distance between each result row and the key.
117117
* @return A dataset containing at most k items closest to the key. A distCol is added to show
118118
* the distance between each row and the key.
119119
*/
@@ -215,12 +215,12 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
215215
* [[outputCol]] exists, it will use the [[outputCol]]. This allows caching of the transformed
216216
* data when necessary.
217217
*
218-
* @param datasetA One of the datasets to join
219-
* @param datasetB Another dataset to join
220-
* @param threshold The threshold for the distance of row pairs
221-
* @param distCol Output column for storing the distance between each result row and the key
218+
* @param datasetA One of the datasets to join.
219+
* @param datasetB Another dataset to join.
220+
* @param threshold The threshold for the distance of row pairs.
221+
* @param distCol Output column for storing the distance between each result row and the key.
222222
* @return A joined dataset containing pairs of rows. The original rows are in columns
223-
* "datasetA" and "datasetB", and a distCol is added to show the distance of each pair
223+
* "datasetA" and "datasetB", and a distCol is added to show the distance of each pair.
224224
*/
225225
def approxSimilarityJoin(
226226
datasetA: Dataset[_],

mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ import org.apache.spark.sql.types.StructType
3333
*
3434
* Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is
3535
* a perfect hash function:
36-
* `h_i(x) = (x * k_i mod prime) mod numEntries`
37-
* where `k_i` is the i-th coefficient, and both `x` and `k_i` are from `Z_prime^*`
36+
* `h_i(x) = (x * k_i \mod prime) \mod numEntries`
37+
* where `k_i` is the i-th coefficient, and both `x` and `k_i` are from `Z_{prime^*}`
3838
*
3939
* Reference:
4040
* [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on Perfect Hash Function]]

mllib/src/main/scala/org/apache/spark/ml/feature/RandomProjection.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ private[ml] trait RandomProjectionParams extends Params {
6060
*
6161
* Model produced by [[RandomProjection]], where multiple random vectors are stored. The vectors
6262
* are normalized to be unit vectors and each vector is used in a hash function:
63-
* `h_i(x) = floor(r_i.dot(x) / bucketLength)`
63+
* `h_i(x) = floor(r_i * x / bucketLength)`
6464
* where `r_i` is the i-th random unit vector. The number of buckets will be `(max L2 norm of input
6565
* vectors) / bucketLength`.
6666
*

0 commit comments

Comments
 (0)