Skip to content

Conversation

@Yunni
Copy link
Contributor

@Yunni Yunni commented Jan 26, 2017

What changes were proposed in this pull request?

This pull request includes python API and examples for LSH. The API changes was based on @yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.

How was this patch tested?

API and examples are tested using spark-submit:
bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py

User guide changes are generated and manually inspected:
SKIP_API=1 jekyll build

@Yunni
Copy link
Contributor Author

Yunni commented Jan 26, 2017

@yanboliang @jkbradley Please take a look. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72042 has finished for PR 16715 at commit 65dab3e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72044 has finished for PR 16715 at commit 3d3bcf0.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72046 has finished for PR 16715 at commit 69dccde.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72050 has finished for PR 16715 at commit e7542d0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #72055 has finished for PR 16715 at commit 5cfc9c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@yanboliang Would you have time to take a look? Thanks!

@yanboliang
Copy link
Contributor

@jkbradley @Yunni I'll have a look at next week. Thanks.

@Yunni
Copy link
Contributor Author

Yunni commented Jan 28, 2017

Thanks very much, @yanboliang ~~

@Yunni
Copy link
Contributor Author

Yunni commented Feb 6, 2017

@yanboliang, just a friendly reminder please don't forget to review the PR when you have time. Thanks!

@yanboliang
Copy link
Contributor

@Yunni I'm on travel at Spark Summit East these days, and will review after the summit. Thanks for your patience.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving comments on the code, will check docs and examples too.


class LSHParams(Params):
"""
Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after "Hashing"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


class LSHModel():
"""
Mixin for Locality Sensitive Hashing(LSH) models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""

@since("2.2.0")
def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, singleProbing=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just leave singe probing out since it has no effect and we aren't including it in the doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@since("2.2.0")
def approxSimilarityJoin(self, datasetA, datasetB, threshold, distCol="distCol"):
"""
Join two dataset to approximately find all pairs of rows whose distance are smaller than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"two datasets"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.. note:: Experimental
LSH class for Jaccard distance.
The input can be dense or sparse vectors, but it is more efficient if it is sparse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notation is not correct. For Python it should be Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

... (Vectors.dense([1.0, -1.0 ]),),
... (Vectors.dense([1.0, 1.0]),)]
>>> df = spark.createDataFrame(data, ["keys"])
>>> rp = BucketedRandomProjectionLSH(inputCol="keys", outputCol="values",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call them "features" and "hashes" ? I'm open to other names but "keys" and "values" is unclear to me

EDIT: I see this is the Scala example convention. I still think "features" and "hashes" is better, but either way is acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I have changed the terms to "features" and "hashes"

>>> df2 = spark.createDataFrame(data2, ["keys"])
>>> model.approxNearestNeighbors(df2, Vectors.dense([1.0, 2.0]), 1).collect()
[Row(keys=DenseVector([2.0, 2.0]), values=[DenseVector([1.0])], distCol=1.0)]
>>> model.approxSimilarityJoin(df, df2, 3.0).select("distCol").head()[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since doctests are also partly used to demonstrate the usage of the algorithm, I don't think this line is particularly useful. It is quite hard to interpret. I think it might be nicer to add an "id" column to the dataframes and then do a "show" here to see the joined dataframes, as in the Scala example. Then again, you end up with:

+--------------------+--------------------+----------------+
|            datasetA|            datasetB|         distCol|
+--------------------+--------------------+----------------+
|[[1.0,1.0],Wrappe...|[[3.0,2.0],Wrappe...|2.23606797749979|
+--------------------+--------------------+----------------+

Which is also confusing! Thoughts on which option is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think showing the ids would be more interpretable, as users are able to see the feature vectors of the ids from the examples.

"""

@property
@since("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be exposed, since it's private in Scala. Also, Array[(Int, Int)] does not serialize to Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

transform the data; if the :py:attr:`outputCol` exists, it will use that. This allows
caching of the transformed data when necessary.
:param dataset: The dataset to search for nearest neighbors of the key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the note that's in the scala doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return self.getOrDefault(self.threshold)


class LSHParams(Params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classes in this file are alphabetized for the most part. Let's keep the convention here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not alphabetized here because the declaration order matters for PySpark shell.

model = brp.fit(dfA)

# Feature Transformation
model.transform(dfA).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other examples typically will output some print statements along with the output, explaining what you're seeing. As it is, this example just spits out a bunch of dataframes with no explanations. I'd like us to add that here, and for the Scala examples really.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for Scala/Java/Python Examples.

"""
An example demonstrating BucketedRandomProjectionLSH.
Run with:
bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file names usually end with "_example". Have we not done that here because of how long the name is already? I slightly prefer to stick with the convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a mistake. Sorry about it!

model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < datasetB.id").show()

# Approximate nearest neighbor search
model.approxNearestNeighbors(dfA, key, 2).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two output empty dataframes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increased the number of HashTables.

model = mh.fit(dfA)

# Feature Transformation
model.transform(dfA).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about print statements here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sethah
Copy link
Contributor

sethah commented Feb 7, 2017

First pass, thanks @Yunni and @yanboliang !

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72664 has finished for PR 16715 at commit b1da01e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72668 has finished for PR 16715 at commit 8f1d708.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Could you please add tag "[PYTHON]" to the PR title?
Also, please remove "Please review http://spark.apache.org/contributing.html before opening a pull request." from the PR description since that will become part of the commit message.
Thanks!

@Yunni Yunni changed the title [Spark-18080][ML] Python API & Examples for Locality Sensitive Hashing [Spark-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing Feb 11, 2017
Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nearly ready. With the suggested changes, all of the following are good:

  • Doc tests pass
  • Docs build and examples can be copy/pasted to REPL for python and scala
  • examples run and have coherent output
  • python API docs look good

:param numNearestNeighbors: The maximum number of nearest neighbors.
:param distCol: Output column for storing the distance between each result row and the key.
Use "distCol" as default value if it's not specified.
:return: A dataset containing at most k items closest to the key. A distCol is added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A distCol" -> "A column 'distCol'"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

:param distCol: Output column for storing the distance between each result row and the key.
Use "distCol" as default value if it's not specified.
:return: A joined dataset containing pairs of rows. The original rows are in columns
"datasetA" and "datasetB", and a distCol is added to show the distance of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

"a distCol" -> "a column distCol"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* The input is dense or sparse vectors, each of which represents a point in the Euclidean
* distance space. The output will be vectors of configurable dimension. Hash values in the
* same dimension are calculated by the same hash function.
* distance space. The output will be vectors of configurable dimension. Hash values in the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

return self._call_java("approxNearestNeighbors", dataset, key, numNearestNeighbors,
distCol)

@since("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've decided not to put since tags in parent classes, since they'll be wrong for future derived classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 4 places.

// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// `model.approxNearestNeighbors(transformedA, key, 2)`
// It may return less than 2 rows because of lack of elements in the hash buckets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to "It may return less than 2 rows when not enough approximate near-neighbor candidates are found." ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import static org.apache.spark.sql.functions.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just import col here and minhash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Refer to the [BucketedRandomProjectionLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH)
for more details on the API.

{% include_example python/ml/bucketed_random_projection_lsh.py %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not correct and the docs don't build because of it. In the future, can you check that the docs build when you make changes?

cd docs; SKIP_API=1 jekyll serve --watch

More detailed instructions here. Also you can build the python docs by cd python/docs; make html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, should be bucketed_random_projection_lsh_example.py (and similarly for minhash include_example below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to retest after renaming the python examples. Thanks for the in formation.

"""
An example demonstrating BucketedRandomProjectionLSH.
Run with:
bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the appropriate note for this to the Scala and Java examples as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 4 places.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments.

:param datasetA: One of the datasets to join.
:param datasetB: Another dataset to join.
:param threshold: The threshold for the distance of row pairs.
:param distCol: Output column for storing the distance between each result row and the key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be "distance between each pair of rows", rather than "between each result row and the key"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

:param distCol: Output column for storing the distance between each result row and the key.
Use "distCol" as default value if it's not specified.
:return: A joined dataset containing pairs of rows. The original rows are in columns
"datasetA" and "datasetB", and a distCol is added to show the distance of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - "distance between each pair" rather than "distance of"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

model.approxSimilarityJoin(dfA, dfB, 1.5)
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("distCol").alias("EuclideanDistance")).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just pass distCol = EuclideanDistance here, and for approxNearestNeighbors.

We can do this throughout the examples (and obviously for min hash change it to jaccard accordingly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6 places.

Refer to the [BucketedRandomProjectionLSH Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH)
for more details on the API.

{% include_example python/ml/bucketed_random_projection_lsh.py %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, should be bucketed_random_projection_lsh_example.py (and similarly for minhash include_example below)

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things. Thanks for all the work on this!

model.approxSimilarityJoin(dfA, dfB, 0.6)
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("distCol").alias("JaccardDistance")).show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass distCol as method parameter instead of alias

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param distCol Output column for storing the distance between each pair of rows.
* @return A joined dataset containing pairs of rows. The original rows are in columns
* "datasetA" and "datasetB", and a distCol is added to show the distance of each pair.
* "datasetA" and "datasetB", and a distCol is added to show the distance between each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a column "distCol"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// $example on$
import org.apache.spark.ml.feature.MinHashLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just import col here and above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

>>> mh2.getOutputCol() == mh.getOutputCol()
True
>>> modelPath = temp_path + "/mh-model"
>>> model.save(modelPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add an equality check here and for BRP. For example for IDFModel we have:

loadedModel.transform(df).head().idf == model.transform(df).head().idf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72856 has finished for PR 16715 at commit c64d50b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72881 has finished for PR 16715 at commit 5d55752.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor thing, then LGTM. Thanks @Yunni!


package org.apache.spark.examples.ml;

import org.apache.spark.ml.linalg.Vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it below under example on

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72923 has started for PR 16715 at commit d849c3a.

@Yunni
Copy link
Contributor Author

Yunni commented Feb 15, 2017

@sethah Really appreciate your detailed code review and comments. :)
@MLnick @yanboliang Thank you for the help as well. Please let me know if you have any other comments.

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72956 has finished for PR 16715 at commit 36fd9bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class LinearSVCWrapperWriter(instance: LinearSVCWrapper) extends MLWriter
  • class LinearSVCWrapperReader extends MLReader[LinearSVCWrapper]
  • class NoSuchDatabaseException(val db: String) extends AnalysisException(s\"Database '$db' not found\")
  • class ResolveBroadcastHints(conf: CatalystConf) extends Rule[LogicalPlan]
  • case class JsonToStruct(
  • case class StructToJson(
  • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode
  • case class InnerOuterEstimation(conf: CatalystConf, join: Join) extends Logging
  • case class LeftSemiAntiEstimation(conf: CatalystConf, join: Join)
  • case class NumericRange(min: JDecimal, max: JDecimal) extends Range
  • class FileStreamOptions(parameters: CaseInsensitiveMap[String]) extends Logging

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merged into master. Thanks for all.

/**
* An example demonstrating BucketedRandomProjectionLSH.
* Run with:
* bin/run-example org.apache.spark.examples.ml.JavaBucketedRandomProjectionLSHExample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can simplify it as bin/run-example ml.JavaBucketedRandomProjectionLSHExample, but it's ok to leave as it is.

@asfgit asfgit closed this in 08c1972 Feb 16, 2017
@sethah
Copy link
Contributor

sethah commented Feb 16, 2017

BTW, in the future I'd prefer to separate the examples and the Python API. I'm not sure if we ever fully decided on a normal protocol for this, but it certainly would make the review easier :)

@yanboliang
Copy link
Contributor

+1 @sethah

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 16, 2017
…e Hashing

## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR apache#15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
`bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
`bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Author: Yun Ni <[email protected]>
Author: Yanbo Liang <[email protected]>
Author: Yunni <[email protected]>

Closes apache#16715 from Yunni/spark-18080.
@Yunni
Copy link
Contributor Author

Yunni commented Feb 16, 2017

Sure. Will do.

@e-m-m
Copy link

e-m-m commented Feb 21, 2017

Hey, super excited about this feature! Was actually thinking of writing this, myself, until I saw this. What version of Spark is this slated to hit the Python API? Thanks :)

@Yunni
Copy link
Contributor Author

Yunni commented Feb 21, 2017

Hi @e-m-m, I think the Python API will be included in Spark 2.2.

@e-m-m
Copy link

e-m-m commented Feb 21, 2017

Wow, thanks for the quick answer, @Yunni! Sounds great. I'll definitely be using it.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the PR & reviews! @Yunni Would you mind sending a "[MINOR]" follow-up PR to fix my late comment + the one from @yanboliang above?

def __init__(self, inputCol=None, outputCol=None, seed=None, numHashTables=1,
bucketLength=None):
"""
__init__(self, inputCol=None, outputCol=None, seed=None, numHashTables=1,
Copy link
Member

@jkbradley jkbradley Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing "\" at end of line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I found this issue when reviewing this PR, but I found the generated Python API doc is correct, so I ignored it. @jkbradley Could you let me know the effect of \ at the end of line? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought it was necessary for proper doc generation, but maybe it's not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants