[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation #17494

+    ...            [Vectors.dense([9, 0, 0, 1])]]
+    >>> dataset = spark.createDataFrame(dataset, ["features"])
+    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
+    >>> print(str(pearsonCorr).replace('nan', 'NaN'))


Any reason for this replacement?

The test is mainly modified from mllib's old Correlation. I can't think why it does the replacement except for better representation of the 'NaN' values.

MLnick · 2017-03-31T10:57:35Z

mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala

+   *    val Row(coeff: Matrix) = Correlation.corr(data, "value").head
   *    // coeff now contains the Pearson correlation matrix.
   *  }}}
   *


While we're here - below it says "cache the input RDD" but we that should be "the input Dataset"

OK. Fixed it.

Also since we are here as well, there is a reference to input RDD up above in the docstring.

oh, right. fixed. :-)

SparkQA · 2017-03-31T12:53:15Z

Test build #75434 has finished for PR 17494 at commit a684ac8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-04-03T07:58:24Z

ping @MLnick

MLnick · 2017-04-03T18:37:45Z

python/pyspark/ml/stat.py

+                 [ 0.05564149,  1.        ,         NaN,  0.91359586],
+                 [        NaN,         NaN,  1.        ,         NaN],
+                 [ 0.40047142,  0.91359586,         NaN,  1.        ]])
+    >>> spearmanCorr = Correlation.corr(dataset, 'features', method="spearman").collect()[0][0]


Super minor nit - but let's use single ' everywhere here rather than have a mix of single & double - as in "spearman" here -> spearman' and above

MLnick · 2017-04-03T18:38:39Z

python/pyspark/ml/stat.py

+    >>> from pyspark.ml.stat import Correlation
+    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
+    ...            [Vectors.dense([4, 5, 0, 3])],
+    ...            [Vectors.dense([6, 7, 0,  8])],


another minor nit - seems an extra space here

holdenk · 2017-04-03T22:23:17Z

python/pyspark/ml/stat.py

+    .. note:: Experimental
+
+    Compute the correlation matrix for the input dataset of Vectors using the specified method.
+    Methods currently supported: `pearson` (default), `spearman`.


So the Scala documentation had a warning about caching being suggested when using Spearman, would it make sense to copy this warning over as well?

Sounds good. Fixed.

SparkQA · 2017-04-04T01:41:53Z

Test build #75495 has finished for PR 17494 at commit 8936880.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-04T03:02:39Z

Test build #75496 has finished for PR 17494 at commit fbcc1fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-04-05T23:53:38Z

@jkbradley @MLnick @holdenk If there is no more questions about this change, maybe we can make it into 2.2 in time?

MLnick · 2017-04-06T07:46:49Z

python/pyspark/ml/stat.py

+    Compute the correlation matrix for the input dataset of Vectors using the specified method.
+    Methods currently supported: `pearson` (default), `spearman`.
+
+    @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column


I don't think @note will work for PyDoc?

Replaced it.

MLnick · 2017-04-06T08:29:08Z

python/pyspark/ml/stat.py

+    Compute the correlation matrix for the input dataset of Vectors using the specified method.
+    Methods currently supported: `pearson` (default), `spearman`.
+
+    Notice: For Spearman, a rank correlation, we need to create an RDD[Double] for each column


I think we should use a .. note::?

Ok. Not quite familiar with PyDoc...

MLnick · 2017-04-06T08:35:31Z

python/pyspark/ml/stat.py

+    Compute the correlation matrix for the input dataset of Vectors using the specified method.
+    Methods currently supported: `pearson` (default), `spearman`.
+
+    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column


Sorry, I picked up that the doc gen will fail here - there needs to be 2 spaces before the start of each subsequent line, like this:

.. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'` to avoid recomputing the common lineage.

ah. ok. fixed. see if this time it's ok.

SparkQA · 2017-04-06T08:35:44Z

Test build #75568 has finished for PR 17494 at commit fd76901.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-04-06T08:48:00Z

LGTM pending Jenkins confirming.

SparkQA · 2017-04-06T08:49:15Z

Test build #75564 has finished for PR 17494 at commit 5d9d70f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-04-06T08:50:33Z

Thanks @MLnick

SparkQA · 2017-04-06T09:48:20Z

Test build #75569 has finished for PR 17494 at commit 5d04326.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-04-06T10:24:22Z

python/pyspark/ml/stat.py

+    >>> dataset = spark.createDataFrame(dataset, ['features'])
+    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
+    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
+    DenseMatrix([[ 1.        ,  0.05564149,         NaN,  0.40047142],


So maybe I'm being overly cautious - but doctests with floats have bit me in the past - would it be good to use the ... syntax here or is this going to be ok? (Just asking).

Fair point - it may lead to flaky tests I guess at some point.

Although we have many tests in pyspark now with floats like this, it is a fair point, I agreed.

SparkQA · 2017-04-06T14:32:45Z

Test build #75574 has finished for PR 17494 at commit 601d9eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-04-06T17:21:28Z

LGTM if others are ok too

viirya · 2017-04-07T05:10:06Z

Thanks @jkbradley

holdenk · 2017-04-07T05:15:47Z

LGTM as well

viirya · 2017-04-07T05:16:51Z

Thanks @holdenk

MLnick · 2017-04-07T08:59:02Z

Merged to master. Thanks!

viirya force-pushed the correlation-python-api branch from 9b81ce9 to 47f0257 Compare March 31, 2017 09:50

Add Python interface for ml.stats.Correlation.

e129e06

viirya force-pushed the correlation-python-api branch from 47f0257 to e129e06 Compare March 31, 2017 09:51

MLnick reviewed Mar 31, 2017

View reviewed changes

Address comment.

a684ac8

MLnick reviewed Apr 3, 2017

View reviewed changes

holdenk reviewed Apr 3, 2017

View reviewed changes

viirya force-pushed the correlation-python-api branch from ec6c003 to 8936880 Compare April 4, 2017 00:46

Address comments.

fbcc1fe

viirya force-pushed the correlation-python-api branch from 8936880 to fbcc1fe Compare April 4, 2017 00:47

MLnick reviewed Apr 6, 2017

View reviewed changes

Replace @note.

5d9d70f

MLnick reviewed Apr 6, 2017

View reviewed changes

Fix doc.

fd76901

MLnick reviewed Apr 6, 2017

View reviewed changes

Fix doc.

5d04326

holdenk reviewed Apr 6, 2017

View reviewed changes

Be cautious to avoid possible flaky test.

601d9eb

viirya force-pushed the correlation-python-api branch from 1a7b0f9 to 601d9eb Compare April 6, 2017 13:36

asfgit closed this in 1a52a62 Apr 7, 2017

viirya deleted the correlation-python-api branch December 27, 2023 18:34

[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation #17494

[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation #17494

Uh oh!

Conversation

viirya commented Mar 31, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Mar 31, 2017

Uh oh!

viirya commented Mar 31, 2017

Uh oh!

SparkQA commented Mar 31, 2017

Uh oh!

SparkQA commented Mar 31, 2017

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2017

Uh oh!

viirya commented Apr 3, 2017

Uh oh!

MLnick Apr 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 4, 2017

Uh oh!

SparkQA commented Apr 4, 2017

Uh oh!

viirya commented Apr 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

MLnick commented Apr 6, 2017

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

viirya commented Apr 6, 2017

Uh oh!

SparkQA commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

viirya commented Mar 31, 2017 •

edited

Loading

MLnick Apr 3, 2017 •

edited

Loading

viirya commented Apr 5, 2017 •

edited

Loading