[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA #7963

MechCoder · 2015-08-05T15:44:38Z

Singular Value Decomposition wrappers are missing in PySpark. Since the base for a RowMatrix has been laid writing the wrappers becomes straightforward. Will follow up with the PCA Wrappers in another PR.

MechCoder · 2015-08-05T16:13:31Z

Actually I'll add the PCA wrappers in this PR as well.

SparkQA · 2015-08-05T18:16:50Z

Test build #39888 has finished for PR 7963 at commit 25999f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-05T18:48:00Z

Test build #39895 has finished for PR 7963 at commit a65efbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SingularValueDecomposition(JavaModelWrapper):

SparkQA · 2015-08-05T19:23:40Z

Test build #39901 has finished for PR 7963 at commit f64a83f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SingularValueDecomposition(JavaModelWrapper):
- case class In(value: Expression, list: Seq[Expression]) extends Predicate
- case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

SparkQA · 2015-08-05T20:00:26Z

Test build #39904 has finished for PR 7963 at commit 2286bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SingularValueDecomposition(JavaModelWrapper):
- case class In(value: Expression, list: Seq[Expression]) extends Predicate
- case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

MechCoder · 2015-08-05T20:58:22Z

All right this PR is ready for review.

cc: @dusenberrymw @mengxr

SparkQA · 2015-08-05T21:22:44Z

Test build #39916 has finished for PR 7963 at commit 30ef817.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SingularValueDecomposition(JavaModelWrapper):
- case class In(value: Expression, list: Seq[Expression]) extends Predicate
- case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

dusenberrymw · 2015-08-07T19:17:33Z

docs/mllib-dimensionality-reduction.md

Minor: This import isn't needed.

dusenberrymw · 2015-08-07T19:35:34Z

Great work, @MechCoder! I left some very small comments, and otherwise it looks good.

MechCoder · 2015-08-10T07:47:20Z

Thanks for the reviews, I have addressed your comments. Do you have anything else?

SparkQA · 2015-08-10T08:10:13Z

Test build #40286 has finished for PR 7963 at commit c62e622.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SingularValueDecomposition(JavaModelWrapper):

dusenberrymw · 2015-08-12T18:56:11Z

python/pyspark/mllib/linalg/distributed.py

I'd check that matrix is a DenseMatrix here as well.

dusenberrymw · 2015-08-12T18:57:26Z

@MechCoder I'd just add the DenseMatrix checks, and then this will be great. Thanks!

…ted Linear Algebra Classes This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows: * `RowMatrix` **[1]** 1. `computeGramianMatrix` 2. `computeCovariance` 3. `computeColumnSummaryStatistics` 4. `columnSimilarities` 5. `tallSkinnyQR` **[2]** * `IndexedRowMatrix` **[3]** 1. `computeGramianMatrix` * `CoordinateMatrix` 1. `transpose` * `BlockMatrix` 1. `validate` 2. `cache` 3. `persist` 4. `transpose` **[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR apache#7963 for SPARK-6227. **[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion. **[3]**: Note: `multiply` and `computeSVD` are already part of PR apache#7963 for SPARK-6227. Author: Mike Dusenberry <[email protected]> Closes apache#9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.

cavaunpeu · 2016-05-20T19:10:56Z

any progress on this @dusenberrymw @MechCoder? it would be really helpful if I could do matrix multiplication in pyspark.

SparkQA · 2016-05-27T00:23:27Z

Test build #59437 has finished for PR 7963 at commit 70a871d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2016-05-27T01:28:53Z

@cavaunpeu Thanks for the ping! I think I've addressed the pending diff comment.

It will take me some time to refresh the knowledge of the codebase. Can @MLnick or @holdenk give a final pass?

SparkQA · 2016-05-27T01:48:19Z

Test build #59445 has finished for PR 7963 at commit 0bc6a3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2016-06-13T22:08:10Z

Bump?

MLnick · 2016-06-17T09:47:14Z

@MechCoder thanks for updating this - may need to wait until after 2.0 release for review.

holdenk · 2016-10-07T21:03:51Z

Now that its past the 2.0 release should we maybe take another look @MLnick / @davies?

holdenk

Thanks for working on this and sorry this fell through the cracks post 2.0. I've left some initial comments - likely the same comments apply to the indexed one as well.

holdenk · 2016-10-07T21:19:36Z

docs/mllib-dimensionality-reduction.md


+</div>
+<div data-lang="python" markdown="1">
+{% highlight python %}


Now days we tend write new examples separately and then use the include example syntax to bring them

holdenk · 2016-10-07T21:19:56Z

docs/mllib-dimensionality-reduction.md

+
+The following code demonstrates how to compute principal components on a `RowMatrix`
+and use them to project the vectors into a low-dimensional space.
+


Same as above

holdenk · 2016-10-07T21:21:45Z

python/pyspark/mllib/linalg/distributed.py

        R = decomp.call("R")
        return QRDecomposition(Q, R)

+    def computeSVD(self, k, computeU=False, rCond=1e-9):


Would be good to add a since annotation here

holdenk · 2016-10-07T21:23:22Z

python/pyspark/mllib/linalg/distributed.py

+        For more specific details on implementation, please refer
+        the scala documentation.
+
+        :param k: Set the number of singular values to keep.


It might be good to copy the longer description from RowMatrix for the k param

holdenk · 2016-10-07T21:24:04Z

python/pyspark/mllib/linalg/distributed.py

+
+    def computePrincipalComponents(self, k):
+        """
+        Computes the k principal components of the given row matrix


It might be good to copy the warnings form RowMatrix here as well.

holdenk · 2016-10-07T21:25:11Z

python/pyspark/mllib/linalg/distributed.py

+
+
+class SingularValueDecomposition(JavaModelWrapper):
+    """Wrapper around the SingularValueDecomposition scala case class"""


Probably add a versionAdded

MechCoder · 2016-10-11T15:47:15Z

Thanks for the reviews @holdenk . Unfortunately I will not be able to work on this anytime soon. Feel free to cherry-pick the commits, (if you wish)

holdenk · 2016-10-14T01:48:56Z

@MechCoder Thanks! I'll look around and see if anyone else is interested in taking this over and bringing it to the finish line otherwise I'll pick it up myself after OSCON :)

HyukjinKwon · 2017-02-09T12:25:26Z

Ping @MechCoder, are you able to proceed this PR and address the comments above? If not it might be good to close this for now.

MLnick · 2017-04-12T10:00:02Z

Note I revived this at #17621 based on @MechCoder's work.

…CA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on apache#7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoder <[email protected]> Author: Nick Pentreath <[email protected]> Closes apache#17621 from MLnick/SPARK-6227-pyspark-svd-pca.

…CA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on #7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoder <[email protected]> Author: Nick Pentreath <[email protected]> Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca. (cherry picked from commit db2fb84) Signed-off-by: Nick Pentreath <[email protected]>

SixAlien3 · 2017-06-22T16:35:19Z

@MLnick Hi, I'm interesting in this PySpark wrapper for SVD. How many columns can this support? Cuz I see in the old document it can only support columns <1000. How about this wrapper?

MechCoder changed the title ~~[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD~~ [SPARK-6227] [WIP] [MLlib] [PySpark] Implement PySpark wrappers for SVD Aug 5, 2015

MechCoder force-pushed the svd_pyspark branch 3 times, most recently from 56978ae to f64a83f Compare August 5, 2015 19:07

MechCoder force-pushed the svd_pyspark branch from f64a83f to 2286bfd Compare August 5, 2015 19:37

MechCoder changed the title ~~[SPARK-6227] [WIP] [MLlib] [PySpark] Implement PySpark wrappers for SVD~~ [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD Aug 5, 2015

MechCoder changed the title ~~[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD~~ [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA Aug 5, 2015

dusenberrymw reviewed Aug 7, 2015
View reviewed changes

docs/mllib-dimensionality-reduction.md Outdated

Copy link

Contributor

dusenberrymw Aug 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This import isn't needed.

MechCoder force-pushed the svd_pyspark branch from 30ef817 to c62e622 Compare August 10, 2015 07:46

dusenberrymw reviewed Aug 12, 2015
View reviewed changes

python/pyspark/mllib/linalg/distributed.py Outdated

Copy link

Contributor

dusenberrymw Aug 12, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check that matrix is a DenseMatrix here as well.

dusenberrymw mentioned this pull request Nov 3, 2015

[SPARK-9656] [MLlib] [Python] Add missing methods to PySpark's Distributed Linear Algebra Classes #9441

Closed

MechCoder added 6 commits May 26, 2016 20:17

[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD

6248d0e

Add PCA Wrappers

921d5b6

Added docs

5558fa9

Add support for multiply and computeSVD in IRM

50ed700

Added tests

59f53d5

minor changes to doc

70a871d

MechCoder force-pushed the svd_pyspark branch from c62e622 to 70a871d Compare May 27, 2016 00:21

Add check for DenseMatrix

0bc6a3c

holdenk reviewed Oct 7, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

MLnick mentioned this pull request Apr 12, 2017

[SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) #17621

Closed

MechCoder deleted the svd_pyspark branch May 1, 2017 04:05


		The following code demonstrates how to compute principal components on a `RowMatrix`
		and use them to project the vectors into a low-dimensional space.



		class SingularValueDecomposition(JavaModelWrapper):
		"""Wrapper around the SingularValueDecomposition scala case class"""

[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA #7963

[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA #7963

Uh oh!

Conversation

MechCoder commented Aug 5, 2015

Uh oh!

MechCoder commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

MechCoder commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw commented Aug 7, 2015

Uh oh!

MechCoder commented Aug 10, 2015

Uh oh!

SparkQA commented Aug 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dusenberrymw commented Aug 12, 2015

Uh oh!

cavaunpeu commented May 20, 2016

Uh oh!

SparkQA commented May 27, 2016

Uh oh!

MechCoder commented May 27, 2016

Uh oh!

SparkQA commented May 27, 2016

Uh oh!

MechCoder commented Jun 13, 2016

Uh oh!

MLnick commented Jun 17, 2016

Uh oh!

holdenk commented Oct 7, 2016

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Oct 11, 2016

Uh oh!

holdenk commented Oct 14, 2016

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

MLnick commented Apr 12, 2017

Uh oh!

SixAlien3 commented Jun 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants