-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[Spark-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing #16715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@yanboliang @jkbradley Please take a look. Thanks! |
|
Test build #72042 has finished for PR 16715 at commit
|
|
Test build #72044 has finished for PR 16715 at commit
|
|
Test build #72046 has finished for PR 16715 at commit
|
|
Test build #72050 has finished for PR 16715 at commit
|
|
Test build #72055 has finished for PR 16715 at commit
|
|
@yanboliang Would you have time to take a look? Thanks! |
|
@jkbradley @Yunni I'll have a look at next week. Thanks. |
|
Thanks very much, @yanboliang ~~ |
|
@yanboliang, just a friendly reminder please don't forget to review the PR when you have time. Thanks! |
|
@Yunni I'm on travel at Spark Summit East these days, and will review after the summit. Thanks for your patience. |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving comments on the code, will check docs and examples too.
python/pyspark/ml/feature.py
Outdated
|
|
||
| class LSHParams(Params): | ||
| """ | ||
| Mixin for Locality Sensitive Hashing(LSH) algorithm parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after "Hashing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
python/pyspark/ml/feature.py
Outdated
|
|
||
| class LSHModel(): | ||
| """ | ||
| Mixin for Locality Sensitive Hashing(LSH) models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
python/pyspark/ml/feature.py
Outdated
| """ | ||
|
|
||
| @since("2.2.0") | ||
| def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, singleProbing=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just leave singe probing out since it has no effect and we aren't including it in the doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
python/pyspark/ml/feature.py
Outdated
| @since("2.2.0") | ||
| def approxSimilarityJoin(self, datasetA, datasetB, threshold, distCol="distCol"): | ||
| """ | ||
| Join two dataset to approximately find all pairs of rows whose distance are smaller than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"two datasets"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| .. note:: Experimental | ||
| LSH class for Jaccard distance. | ||
| The input can be dense or sparse vectors, but it is more efficient if it is sparse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This notation is not correct. For Python it should be Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
python/pyspark/ml/feature.py
Outdated
| ... (Vectors.dense([1.0, -1.0 ]),), | ||
| ... (Vectors.dense([1.0, 1.0]),)] | ||
| >>> df = spark.createDataFrame(data, ["keys"]) | ||
| >>> rp = BucketedRandomProjectionLSH(inputCol="keys", outputCol="values", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call them "features" and "hashes" ? I'm open to other names but "keys" and "values" is unclear to me
EDIT: I see this is the Scala example convention. I still think "features" and "hashes" is better, but either way is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I have changed the terms to "features" and "hashes"
python/pyspark/ml/feature.py
Outdated
| >>> df2 = spark.createDataFrame(data2, ["keys"]) | ||
| >>> model.approxNearestNeighbors(df2, Vectors.dense([1.0, 2.0]), 1).collect() | ||
| [Row(keys=DenseVector([2.0, 2.0]), values=[DenseVector([1.0])], distCol=1.0)] | ||
| >>> model.approxSimilarityJoin(df, df2, 3.0).select("distCol").head()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since doctests are also partly used to demonstrate the usage of the algorithm, I don't think this line is particularly useful. It is quite hard to interpret. I think it might be nicer to add an "id" column to the dataframes and then do a "show" here to see the joined dataframes, as in the Scala example. Then again, you end up with:
+--------------------+--------------------+----------------+
| datasetA| datasetB| distCol|
+--------------------+--------------------+----------------+
|[[1.0,1.0],Wrappe...|[[3.0,2.0],Wrappe...|2.23606797749979|
+--------------------+--------------------+----------------+
Which is also confusing! Thoughts on which option is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think showing the ids would be more interpretable, as users are able to see the feature vectors of the ids from the examples.
python/pyspark/ml/feature.py
Outdated
| """ | ||
|
|
||
| @property | ||
| @since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be exposed, since it's private in Scala. Also, Array[(Int, Int)] does not serialize to Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| transform the data; if the :py:attr:`outputCol` exists, it will use that. This allows | ||
| caching of the transformed data when necessary. | ||
| :param dataset: The dataset to search for nearest neighbors of the key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the note that's in the scala doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| return self.getOrDefault(self.threshold) | ||
|
|
||
|
|
||
| class LSHParams(Params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classes in this file are alphabetized for the most part. Let's keep the convention here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not alphabetized here because the declaration order matters for PySpark shell.
| model = brp.fit(dfA) | ||
|
|
||
| # Feature Transformation | ||
| model.transform(dfA).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other examples typically will output some print statements along with the output, explaining what you're seeing. As it is, this example just spits out a bunch of dataframes with no explanations. I'd like us to add that here, and for the Scala examples really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for Scala/Java/Python Examples.
| """ | ||
| An example demonstrating BucketedRandomProjectionLSH. | ||
| Run with: | ||
| bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file names usually end with "_example". Have we not done that here because of how long the name is already? I slightly prefer to stick with the convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a mistake. Sorry about it!
| model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < datasetB.id").show() | ||
|
|
||
| # Approximate nearest neighbor search | ||
| model.approxNearestNeighbors(dfA, key, 2).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two output empty dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increased the number of HashTables.
| model = mh.fit(dfA) | ||
|
|
||
| # Feature Transformation | ||
| model.transform(dfA).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about print statements here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
First pass, thanks @Yunni and @yanboliang ! |
|
Test build #72664 has finished for PR 16715 at commit
|
|
Test build #72668 has finished for PR 16715 at commit
|
|
Could you please add tag "[PYTHON]" to the PR title? |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nearly ready. With the suggested changes, all of the following are good:
- Doc tests pass
- Docs build and examples can be copy/pasted to REPL for python and scala
- examples run and have coherent output
- python API docs look good
python/pyspark/ml/feature.py
Outdated
| :param numNearestNeighbors: The maximum number of nearest neighbors. | ||
| :param distCol: Output column for storing the distance between each result row and the key. | ||
| Use "distCol" as default value if it's not specified. | ||
| :return: A dataset containing at most k items closest to the key. A distCol is added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A distCol" -> "A column 'distCol'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
python/pyspark/ml/feature.py
Outdated
| :param distCol: Output column for storing the distance between each result row and the key. | ||
| Use "distCol" as default value if it's not specified. | ||
| :return: A joined dataset containing pairs of rows. The original rows are in columns | ||
| "datasetA" and "datasetB", and a distCol is added to show the distance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
"a distCol" -> "a column distCol"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * The input is dense or sparse vectors, each of which represents a point in the Euclidean | ||
| * distance space. The output will be vectors of configurable dimension. Hash values in the | ||
| * same dimension are calculated by the same hash function. | ||
| * distance space. The output will be vectors of configurable dimension. Hash values in the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
python/pyspark/ml/feature.py
Outdated
| return self._call_java("approxNearestNeighbors", dataset, key, numNearestNeighbors, | ||
| distCol) | ||
|
|
||
| @since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've decided not to put since tags in parent classes, since they'll be wrong for future derived classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 4 places.
| // neighbor search. | ||
| // We could avoid computing hashes by passing in the already-transformed dataset, e.g. | ||
| // `model.approxNearestNeighbors(transformedA, key, 2)` | ||
| // It may return less than 2 rows because of lack of elements in the hash buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to "It may return less than 2 rows when not enough approximate near-neighbor candidates are found." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| import org.apache.spark.sql.types.StructField; | ||
| import org.apache.spark.sql.types.StructType; | ||
|
|
||
| import static org.apache.spark.sql.functions.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just import col here and minhash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
| Refer to the [BucketedRandomProjectionLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH) | ||
| for more details on the API. | ||
|
|
||
| {% include_example python/ml/bucketed_random_projection_lsh.py %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not correct and the docs don't build because of it. In the future, can you check that the docs build when you make changes?
cd docs; SKIP_API=1 jekyll serve --watch
More detailed instructions here. Also you can build the python docs by cd python/docs; make html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, should be bucketed_random_projection_lsh_example.py (and similarly for minhash include_example below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to retest after renaming the python examples. Thanks for the in formation.
| """ | ||
| An example demonstrating BucketedRandomProjectionLSH. | ||
| Run with: | ||
| bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh_example.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the appropriate note for this to the Scala and Java examples as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 4 places.
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments.
python/pyspark/ml/feature.py
Outdated
| :param datasetA: One of the datasets to join. | ||
| :param datasetB: Another dataset to join. | ||
| :param threshold: The threshold for the distance of row pairs. | ||
| :param distCol: Output column for storing the distance between each result row and the key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be "distance between each pair of rows", rather than "between each result row and the key"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
python/pyspark/ml/feature.py
Outdated
| :param distCol: Output column for storing the distance between each result row and the key. | ||
| Use "distCol" as default value if it's not specified. | ||
| :return: A joined dataset containing pairs of rows. The original rows are in columns | ||
| "datasetA" and "datasetB", and a distCol is added to show the distance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - "distance between each pair" rather than "distance of"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| model.approxSimilarityJoin(dfA, dfB, 1.5) | ||
| .select(col("datasetA.id").alias("idA"), | ||
| col("datasetB.id").alias("idB"), | ||
| col("distCol").alias("EuclideanDistance")).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just pass distCol = EuclideanDistance here, and for approxNearestNeighbors.
We can do this throughout the examples (and obviously for min hash change it to jaccard accordingly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6 places.
docs/ml-features.md
Outdated
| Refer to the [BucketedRandomProjectionLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH) | ||
| for more details on the API. | ||
|
|
||
| {% include_example python/ml/bucketed_random_projection_lsh.py %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, should be bucketed_random_projection_lsh_example.py (and similarly for minhash include_example below)
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor things. Thanks for all the work on this!
| model.approxSimilarityJoin(dfA, dfB, 0.6) | ||
| .select(col("datasetA.id").alias("idA"), | ||
| col("datasetB.id").alias("idB"), | ||
| col("distCol").alias("JaccardDistance")).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass distCol as method parameter instead of alias
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| * @param distCol Output column for storing the distance between each pair of rows. | ||
| * @return A joined dataset containing pairs of rows. The original rows are in columns | ||
| * "datasetA" and "datasetB", and a distCol is added to show the distance of each pair. | ||
| * "datasetA" and "datasetB", and a distCol is added to show the distance between each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a column "distCol"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // $example on$ | ||
| import org.apache.spark.ml.feature.MinHashLSH | ||
| import org.apache.spark.ml.linalg.Vectors | ||
| import org.apache.spark.sql.functions._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just import col here and above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| >>> mh2.getOutputCol() == mh.getOutputCol() | ||
| True | ||
| >>> modelPath = temp_path + "/mh-model" | ||
| >>> model.save(modelPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add an equality check here and for BRP. For example for IDFModel we have:
loadedModel.transform(df).head().idf == model.transform(df).head().idf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
|
Test build #72856 has finished for PR 16715 at commit
|
|
Test build #72881 has finished for PR 16715 at commit
|
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor thing, then LGTM. Thanks @Yunni!
|
|
||
| package org.apache.spark.examples.ml; | ||
|
|
||
| import org.apache.spark.ml.linalg.Vector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it below under example on
|
Test build #72923 has started for PR 16715 at commit |
|
@sethah Really appreciate your detailed code review and comments. :) |
|
Test build #72956 has finished for PR 16715 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, merged into master. Thanks for all.
| /** | ||
| * An example demonstrating BucketedRandomProjectionLSH. | ||
| * Run with: | ||
| * bin/run-example org.apache.spark.examples.ml.JavaBucketedRandomProjectionLSHExample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can simplify it as bin/run-example ml.JavaBucketedRandomProjectionLSHExample, but it's ok to leave as it is.
|
BTW, in the future I'd prefer to separate the examples and the Python API. I'm not sure if we ever fully decided on a normal protocol for this, but it certainly would make the review easier :) |
|
+1 @sethah |
…e Hashing ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR apache#15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <[email protected]> Author: Yanbo Liang <[email protected]> Author: Yunni <[email protected]> Closes apache#16715 from Yunni/spark-18080.
|
Sure. Will do. |
|
Hey, super excited about this feature! Was actually thinking of writing this, myself, until I saw this. What version of Spark is this slated to hit the Python API? Thanks :) |
|
Hi @e-m-m, I think the Python API will be included in Spark 2.2. |
|
Wow, thanks for the quick answer, @Yunni! Sounds great. I'll definitely be using it. |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the PR & reviews! @Yunni Would you mind sending a "[MINOR]" follow-up PR to fix my late comment + the one from @yanboliang above?
| def __init__(self, inputCol=None, outputCol=None, seed=None, numHashTables=1, | ||
| bucketLength=None): | ||
| """ | ||
| __init__(self, inputCol=None, outputCol=None, seed=None, numHashTables=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing "\" at end of line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I found this issue when reviewing this PR, but I found the generated Python API doc is correct, so I ignored it. @jkbradley Could you let me know the effect of \ at the end of line? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought it was necessary for proper doc generation, but maybe it's not.
What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was based on @yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
How was this patch tested?
API and examples are tested using spark-submit:
bin/spark-submit examples/src/main/python/ml/min_hash_lsh.pybin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.pyUser guide changes are generated and manually inspected:
SKIP_API=1 jekyll build