-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation #17494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @jkbradley |
|
cc @thunterdb |
|
cc @holdenk |
|
Test build #75430 has finished for PR 17494 at commit
|
9b81ce9 to
47f0257
Compare
47f0257 to
e129e06
Compare
|
Test build #75432 has finished for PR 17494 at commit
|
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, LGTM otherwise.
| ... [Vectors.dense([9, 0, 0, 1])]] | ||
| >>> dataset = spark.createDataFrame(dataset, ["features"]) | ||
| >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0] | ||
| >>> print(str(pearsonCorr).replace('nan', 'NaN')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is mainly modified from mllib's old Correlation. I can't think why it does the replacement except for better representation of the 'NaN' values.
| * val Row(coeff: Matrix) = Correlation.corr(data, "value").head | ||
| * // coeff now contains the Pearson correlation matrix. | ||
| * }}} | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here - below it says "cache the input RDD" but we that should be "the input Dataset"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also since we are here as well, there is a reference to input RDD up above in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right. fixed. :-)
|
Test build #75434 has finished for PR 17494 at commit
|
|
ping @MLnick |
python/pyspark/ml/stat.py
Outdated
| [ 0.05564149, 1. , NaN, 0.91359586], | ||
| [ NaN, NaN, 1. , NaN], | ||
| [ 0.40047142, 0.91359586, NaN, 1. ]]) | ||
| >>> spearmanCorr = Correlation.corr(dataset, 'features', method="spearman").collect()[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor nit - but let's use single ' everywhere here rather than have a mix of single & double - as in "spearman" here -> spearman' and above
python/pyspark/ml/stat.py
Outdated
| >>> from pyspark.ml.stat import Correlation | ||
| >>> dataset = [[Vectors.dense([1, 0, 0, -2])], | ||
| ... [Vectors.dense([4, 5, 0, 3])], | ||
| ... [Vectors.dense([6, 7, 0, 8])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another minor nit - seems an extra space here
| .. note:: Experimental | ||
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Scala documentation had a warning about caching being suggested when using Spearman, would it make sense to copy this warning over as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Fixed.
ec6c003 to
8936880
Compare
8936880 to
fbcc1fe
Compare
|
Test build #75495 has finished for PR 17494 at commit
|
|
Test build #75496 has finished for PR 17494 at commit
|
|
@jkbradley @MLnick @holdenk If there is no more questions about this change, maybe we can make it into 2.2 in time? |
python/pyspark/ml/stat.py
Outdated
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. | ||
| @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think @note will work for PyDoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced it.
python/pyspark/ml/stat.py
Outdated
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. | ||
| Notice: For Spearman, a rank correlation, we need to create an RDD[Double] for each column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a .. note::?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Not quite familiar with PyDoc...
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. | ||
| .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I picked up that the doc gen will fail here - there needs to be 2 spaces before the start of each subsequent line, like this:
.. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
to avoid recomputing the common lineage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. ok. fixed. see if this time it's ok.
|
Test build #75568 has finished for PR 17494 at commit
|
|
LGTM pending Jenkins confirming. |
|
Test build #75564 has finished for PR 17494 at commit
|
|
Thanks @MLnick |
|
Test build #75569 has finished for PR 17494 at commit
|
python/pyspark/ml/stat.py
Outdated
| >>> dataset = spark.createDataFrame(dataset, ['features']) | ||
| >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0] | ||
| >>> print(str(pearsonCorr).replace('nan', 'NaN')) | ||
| DenseMatrix([[ 1. , 0.05564149, NaN, 0.40047142], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe I'm being overly cautious - but doctests with floats have bit me in the past - would it be good to use the ... syntax here or is this going to be ok? (Just asking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point - it may lead to flaky tests I guess at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we have many tests in pyspark now with floats like this, it is a fair point, I agreed.
1a7b0f9 to
601d9eb
Compare
|
Test build #75574 has finished for PR 17494 at commit
|
|
LGTM if others are ok too |
|
Thanks @jkbradley |
|
LGTM as well |
|
Thanks @holdenk |
|
Merged to master. Thanks! |
What changes were proposed in this pull request?
The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.
How was this patch tested?
Python unit test.
Please review http://spark.apache.org/contributing.html before opening a pull request.