Skip to content

Conversation

@MechCoder
Copy link
Contributor

Singular Value Decomposition wrappers are missing in PySpark. Since the base for a RowMatrix has been laid writing the wrappers becomes straightforward. Will follow up with the PCA Wrappers in another PR.

@MechCoder MechCoder changed the title [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD [SPARK-6227] [WIP] [MLlib] [PySpark] Implement PySpark wrappers for SVD Aug 5, 2015
@MechCoder
Copy link
Contributor Author

Actually I'll add the PCA wrappers in this PR as well.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39888 has finished for PR 7963 at commit 25999f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39895 has finished for PR 7963 at commit a65efbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):

@MechCoder MechCoder force-pushed the svd_pyspark branch 3 times, most recently from 56978ae to f64a83f Compare August 5, 2015 19:07
@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39901 has finished for PR 7963 at commit f64a83f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39904 has finished for PR 7963 at commit 2286bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

@MechCoder MechCoder changed the title [SPARK-6227] [WIP] [MLlib] [PySpark] Implement PySpark wrappers for SVD [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD Aug 5, 2015
@MechCoder
Copy link
Contributor Author

All right this PR is ready for review.

cc: @dusenberrymw @mengxr

@MechCoder MechCoder changed the title [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA Aug 5, 2015
@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39916 has finished for PR 7963 at commit 30ef817.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This import isn't needed.

@dusenberrymw
Copy link
Contributor

Great work, @MechCoder! I left some very small comments, and otherwise it looks good.

@MechCoder
Copy link
Contributor Author

Thanks for the reviews, I have addressed your comments. Do you have anything else?

@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40286 has finished for PR 7963 at commit c62e622.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check that matrix is a DenseMatrix here as well.

@dusenberrymw
Copy link
Contributor

@MechCoder I'd just add the DenseMatrix checks, and then this will be great. Thanks!

ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 27, 2016
…ted Linear Algebra Classes

This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

* `RowMatrix` <sup>**[1]**</sup>
  1. `computeGramianMatrix`
  2. `computeCovariance`
  3. `computeColumnSummaryStatistics`
  4. `columnSimilarities`
  5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
  1. `computeGramianMatrix`
* `CoordinateMatrix`
  1. `transpose`
* `BlockMatrix`
  1. `validate`
  2. `cache`
  3. `persist`
  4. `transpose`

**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR apache#7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR apache#7963 for SPARK-6227.

Author: Mike Dusenberry <[email protected]>

Closes apache#9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
@cavaunpeu
Copy link

any progress on this @dusenberrymw @MechCoder? it would be really helpful if I could do matrix multiplication in pyspark.

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59437 has finished for PR 7963 at commit 70a871d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

@cavaunpeu Thanks for the ping! I think I've addressed the pending diff comment.

It will take me some time to refresh the knowledge of the codebase. Can @MLnick or @holdenk give a final pass?

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59445 has finished for PR 7963 at commit 0bc6a3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

Bump?

@MLnick
Copy link
Contributor

MLnick commented Jun 17, 2016

@MechCoder thanks for updating this - may need to wait until after 2.0 release for review.

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

Now that its past the 2.0 release should we maybe take another look @MLnick / @davies?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and sorry this fell through the cracks post 2.0. I've left some initial comments - likely the same comments apply to the indexed one as well.


</div>
<div data-lang="python" markdown="1">
{% highlight python %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now days we tend write new examples separately and then use the include example syntax to bring them


The following code demonstrates how to compute principal components on a `RowMatrix`
and use them to project the vectors into a low-dimensional space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

R = decomp.call("R")
return QRDecomposition(Q, R)

def computeSVD(self, k, computeU=False, rCond=1e-9):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a since annotation here

For more specific details on implementation, please refer
the scala documentation.
:param k: Set the number of singular values to keep.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to copy the longer description from RowMatrix for the k param


def computePrincipalComponents(self, k):
"""
Computes the k principal components of the given row matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to copy the warnings form RowMatrix here as well.



class SingularValueDecomposition(JavaModelWrapper):
"""Wrapper around the SingularValueDecomposition scala case class"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add a versionAdded

@MechCoder
Copy link
Contributor Author

Thanks for the reviews @holdenk . Unfortunately I will not be able to work on this anytime soon. Feel free to cherry-pick the commits, (if you wish)

@holdenk
Copy link
Contributor

holdenk commented Oct 14, 2016

@MechCoder Thanks! I'll look around and see if anyone else is interested in taking this over and bringing it to the finish line otherwise I'll pick it up myself after OSCON :)

@HyukjinKwon
Copy link
Member

Ping @MechCoder, are you able to proceed this PR and address the comments above? If not it might be good to close this for now.

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2017

Note I revived this at #17621 based on @MechCoder's work.

@MechCoder MechCoder deleted the svd_pyspark branch May 1, 2017 04:05
ghost pushed a commit to dbtsai/spark that referenced this pull request May 3, 2017
…CA (v2)

Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).

Based on apache#7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder <[email protected]>
Author: Nick Pentreath <[email protected]>

Closes apache#17621 from MLnick/SPARK-6227-pyspark-svd-pca.
asfgit pushed a commit that referenced this pull request May 3, 2017
…CA (v2)

Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).

Based on #7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder <[email protected]>
Author: Nick Pentreath <[email protected]>

Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.

(cherry picked from commit db2fb84)
Signed-off-by: Nick Pentreath <[email protected]>
@SixAlien3
Copy link

@MLnick Hi, I'm interesting in this PySpark wrapper for SVD. How many columns can this support? Cuz I see in the old document it can only support columns <1000. How about this wrapper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants