Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Mar 31, 2017

What changes were proposed in this pull request?

The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.

How was this patch tested?

Python unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@viirya
Copy link
Member Author

viirya commented Mar 31, 2017

cc @jkbradley

@viirya
Copy link
Member Author

viirya commented Mar 31, 2017

cc @thunterdb

@viirya
Copy link
Member Author

viirya commented Mar 31, 2017

cc @holdenk

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75430 has finished for PR 17494 at commit 9b81ce9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class Correlation(object):

@viirya viirya force-pushed the correlation-python-api branch from 9b81ce9 to 47f0257 Compare March 31, 2017 09:50
@viirya viirya force-pushed the correlation-python-api branch from 47f0257 to e129e06 Compare March 31, 2017 09:51
@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75432 has finished for PR 17494 at commit e129e06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class Correlation(object):

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, LGTM otherwise.

... [Vectors.dense([9, 0, 0, 1])]]
>>> dataset = spark.createDataFrame(dataset, ["features"])
>>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
>>> print(str(pearsonCorr).replace('nan', 'NaN'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this replacement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is mainly modified from mllib's old Correlation. I can't think why it does the replacement except for better representation of the 'NaN' values.

* val Row(coeff: Matrix) = Correlation.corr(data, "value").head
* // coeff now contains the Pearson correlation matrix.
* }}}
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here - below it says "cache the input RDD" but we that should be "the input Dataset"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also since we are here as well, there is a reference to input RDD up above in the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right. fixed. :-)

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75434 has finished for PR 17494 at commit a684ac8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 3, 2017

ping @MLnick

[ 0.05564149, 1. , NaN, 0.91359586],
[ NaN, NaN, 1. , NaN],
[ 0.40047142, 0.91359586, NaN, 1. ]])
>>> spearmanCorr = Correlation.corr(dataset, 'features', method="spearman").collect()[0][0]
Copy link
Contributor

@MLnick MLnick Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor nit - but let's use single ' everywhere here rather than have a mix of single & double - as in "spearman" here -> spearman' and above

>>> from pyspark.ml.stat import Correlation
>>> dataset = [[Vectors.dense([1, 0, 0, -2])],
... [Vectors.dense([4, 5, 0, 3])],
... [Vectors.dense([6, 7, 0, 8])],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another minor nit - seems an extra space here

.. note:: Experimental
Compute the correlation matrix for the input dataset of Vectors using the specified method.
Methods currently supported: `pearson` (default), `spearman`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Scala documentation had a warning about caching being suggested when using Spearman, would it make sense to copy this warning over as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Fixed.

@viirya viirya force-pushed the correlation-python-api branch from ec6c003 to 8936880 Compare April 4, 2017 00:46
@viirya viirya force-pushed the correlation-python-api branch from 8936880 to fbcc1fe Compare April 4, 2017 00:47
@SparkQA
Copy link

SparkQA commented Apr 4, 2017

Test build #75495 has finished for PR 17494 at commit 8936880.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2017

Test build #75496 has finished for PR 17494 at commit fbcc1fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 5, 2017

@jkbradley @MLnick @holdenk If there is no more questions about this change, maybe we can make it into 2.2 in time?

Compute the correlation matrix for the input dataset of Vectors using the specified method.
Methods currently supported: `pearson` (default), `spearman`.
@note For Spearman, a rank correlation, we need to create an RDD[Double] for each column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think @note will work for PyDoc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced it.

Compute the correlation matrix for the input dataset of Vectors using the specified method.
Methods currently supported: `pearson` (default), `spearman`.
Notice: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a .. note::?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Not quite familiar with PyDoc...

Compute the correlation matrix for the input dataset of Vectors using the specified method.
Methods currently supported: `pearson` (default), `spearman`.
.. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I picked up that the doc gen will fail here - there needs to be 2 spaces before the start of each subsequent line, like this:

.. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
  and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
  which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
  to avoid recomputing the common lineage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. ok. fixed. see if this time it's ok.

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75568 has finished for PR 17494 at commit fd76901.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 6, 2017

LGTM pending Jenkins confirming.

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75564 has finished for PR 17494 at commit 5d9d70f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 6, 2017

Thanks @MLnick

@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75569 has finished for PR 17494 at commit 5d04326.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> dataset = spark.createDataFrame(dataset, ['features'])
>>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
>>> print(str(pearsonCorr).replace('nan', 'NaN'))
DenseMatrix([[ 1. , 0.05564149, NaN, 0.40047142],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe I'm being overly cautious - but doctests with floats have bit me in the past - would it be good to use the ... syntax here or is this going to be ok? (Just asking).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point - it may lead to flaky tests I guess at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we have many tests in pyspark now with floats like this, it is a fair point, I agreed.

@viirya viirya force-pushed the correlation-python-api branch from 1a7b0f9 to 601d9eb Compare April 6, 2017 13:36
@SparkQA
Copy link

SparkQA commented Apr 6, 2017

Test build #75574 has finished for PR 17494 at commit 601d9eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM if others are ok too

@viirya
Copy link
Member Author

viirya commented Apr 7, 2017

Thanks @jkbradley

@holdenk
Copy link
Contributor

holdenk commented Apr 7, 2017

LGTM as well

@viirya
Copy link
Member Author

viirya commented Apr 7, 2017

Thanks @holdenk

@MLnick
Copy link
Contributor

MLnick commented Apr 7, 2017

Merged to master. Thanks!

@asfgit asfgit closed this in 1a52a62 Apr 7, 2017
@viirya viirya deleted the correlation-python-api branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants