[SPARK-14661] [MLlib] trim PCAModel by required explained variance #12419

psuszyns · 2016-04-15T14:16:37Z

What changes were proposed in this pull request?

New method in PCAModel for auto-trimming the model to minimal number of features calculated from required variance retained by those features

How was this patch tested?

unit tests

AmplabJenkins · 2016-04-15T14:17:16Z

Can one of the admins verify this patch?

srowen · 2016-04-15T15:25:33Z

mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala

  }
+
+  def minimalByVarianceExplained(requiredVarianceRetained: Double): PCAModel = {
+    val minFeaturesNum = explainedVariance


How about explainedVariance.values.scanLeft(0.0)(_ + _).indexWhere(_ >= requiredVarianceRetained) + 1. Eh, OK to make that robust you'd have to handle the case where that returns 0 (means you need to keep all the PCs, so, could just return this), and also arg-check the required variance to be in [0,1].

sethah · 2016-04-15T15:47:05Z

It looks like this could just as well be implemented in ML instead of MLlib. It is my understanding that we should avoid adding new features to MLlib unless it's blocking an improvement in ML. That doesn't seem to be the case here.

… refactored code to be more robust)

psuszyns · 2016-04-15T17:23:28Z

@srowen, @sethah: Thank you for feedback, I've pushed a commit that's supposed to address your concerns

sethah · 2016-04-15T18:12:50Z

@psuszyns I have some high level comments. To me, it does not make sense to train a PCA model, keeping k components, and then trim by variance explained. If I have a model with 10 columns, and I train a PCA model with k = 6 components, I retain some fraction of the variance. Then I request to trim the model by some fraction that might be greater than the variance I originally retained, so it will be impossible.

I think this should be implemented by having two parameters k and retainedVariance where the full PCA is trained, and then the model is trimmed by one of the two possible methods. When you set one of the params, you can automatically unset the other since it doesn't make sense to use them both (this is done, for example, in Logistic Regression with threshold and thresholds). This would require changing ML and MLlib, which is ok. Perhaps @srowen could provide some thoughts.

psuszyns · 2016-04-15T19:51:40Z

Well I wanted to add this option without braking/amending current API. In my app I use it by first training the PCA with k = number of features and then calling the method I added. But I agree that it would be nicer to have the 'variance retained' as the input parameter of the PCA. I'll add appropriate setters to the 'ml' version and another constructor to the 'mllib' version, ok?

sethah · 2016-04-15T20:12:17Z

I don't believe this will break the API. You can get away without even changing the MLlib API by adding a private constructor or a private call to the fit method that passes in a retained variance parameter. Also, it looks like it would be good to update the computePrincipalComponentsAndExplainedVariance method to alternately work with an retained variance parameter. Otherwise, you'd have to pass it a value for k equal to the full number of principal components and then trim it manually. Thoughts?

…incipalComponentsAndExplainedVariance instead of PCAModel mutation function)

psuszyns · 2016-04-26T13:41:41Z

@sethah please review my latest commit - is it any close to what you had in your mind?

sethah · 2016-04-27T14:57:26Z

@psuszyns This introduces a breaking change to the MLlib API, which we should avoid since it is not strictly necessary. Looking at this more carefully, the simplest way to do this seems like it would be to add this for only spark.ML by requesting the full PCA from MLlib, then trimming according to retained variance in the spark.ML fit method. I'm not sure if we ought to make this available in MLlib, given that we could avoid some of the complexity. If we do, we need to do it in a way that does not break the APIs.

Also, please do run the style checker, and see Contributing to Spark for Spark specific style guidelines.

@srowen @mengxr What do you think about this change?

jodersky · 2016-04-27T18:37:53Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

   */
  @Since("1.6.0")
-  def computePrincipalComponentsAndExplainedVariance(k: Int): (Matrix, Vector) = {
+  def computePrincipalComponentsAndExplainedVariance(filter: Either[Int, Double])


I'm no expert in the ML domain, but from a user perspective, this breaks API backwards compatibility.
An alternative could be to create a new method and factor out common behaviour shared with the current computePrincipalComponentsAndExplainedVariance into a private utility method.

@sethah @jodersky It looks like the comment Since("1.6.0") is false becaue this method is not available in spark 1.6 - this change was merged to master instead of 1.6 branch. Do you still consider this change as API breaking given that it modifies API that wasn't yet released? If yes then I'll do as @jodersky said and introduce a new method and move common code to a new private one. I'd really like to have this feature in MLlib version because I use it.

Not sure about the breakage, nevertheless I would recommend implementing a new method regardless. I find the method's parameter type Either[Int, Double] to be quite confusing.

Yeah having one method to mean two things using an Either is too strange. At least, you would provide two overloads. And then, no reason to overload versus given them distinct and descriptive names.

I don't understand the question about unreleased APIs -- 1.6.0 was released a while ago and this method takes an Int parameter there. We certainly want to keep the ability to set a fixed number of principal components.

This is RowMatrix as in 1.6.1 release: https://github.com/apache/spark/blob/15de51c238a7340fa81cb0b80d029a05d97bfc5c/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala am I correct? If yes then can you find there a method named computePrincipalComponentsAndExplainedVariance? I can't, yet on master it is annotated with Since("1.6.0") - isn't it false?

Aha you're right, it wasn't in 1.6. This is my fault: 21b3d2a
It never was added to branch 1.6, despite the apparent intention. At this point I think it should be considered 2.0+ and you can fix that annotation here. So yeah this method was never 'released'. Still I think we want to do something different with the argument anyway.

HyukjinKwon · 2017-05-11T12:25:44Z

Hi @psuszyns, I just wonder if we are still active on this.

## What changes were proposed in this pull request? This PR proposes to close PRs ... - inactive to the review comments more than a month - WIP and inactive more than a month - with Jenkins build failure but inactive more than a month - suggested to be closed and no comment against that - obviously looking inappropriate (e.g., Branch 0.5) To make sure, I left a comment for each PR about a week ago and I could not have a response back from the author in these PRs below: Closes apache#11129 Closes apache#12085 Closes apache#12162 Closes apache#12419 Closes apache#12420 Closes apache#12491 Closes apache#13762 Closes apache#13837 Closes apache#13851 Closes apache#13881 Closes apache#13891 Closes apache#13959 Closes apache#14091 Closes apache#14481 Closes apache#14547 Closes apache#14557 Closes apache#14686 Closes apache#15594 Closes apache#15652 Closes apache#15850 Closes apache#15914 Closes apache#15918 Closes apache#16285 Closes apache#16389 Closes apache#16652 Closes apache#16743 Closes apache#16893 Closes apache#16975 Closes apache#17001 Closes apache#17088 Closes apache#17119 Closes apache#17272 Closes apache#17971 Added: Closes apache#17778 Closes apache#17303 Closes apache#17872 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18017 from HyukjinKwon/close-inactive-prs.

SPARK-14661: PCA trimmed by variance

ac1b9fe

srowen reviewed Apr 15, 2016
View reviewed changes

SPARK-14661: changes after code review (implemented also for 'ml' and…

e0b2b76

… refactored code to be more robust)

SPARK-14661: changes after code review (modifying RowMatrix.computePr…

c759e4c

…incipalComponentsAndExplainedVariance instead of PCAModel mutation function)

jodersky reviewed Apr 27, 2016
View reviewed changes

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

[SPARK-14661] [MLlib] trim PCAModel by required explained variance #12419

[SPARK-14661] [MLlib] trim PCAModel by required explained variance #12419

Uh oh!

Conversation

psuszyns commented Apr 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Apr 15, 2016

Uh oh!

srowen Apr 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Apr 15, 2016

Uh oh!

psuszyns commented Apr 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sethah commented Apr 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psuszyns commented Apr 15, 2016

Uh oh!

sethah commented Apr 15, 2016

Uh oh!

psuszyns commented Apr 26, 2016

Uh oh!

sethah commented Apr 27, 2016

Uh oh!

jodersky Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

psuszyns Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

jodersky Apr 27, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

psuszyns Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

srowen Apr 15, 2016 •

edited

Loading

psuszyns commented Apr 15, 2016 •

edited

Loading

sethah commented Apr 15, 2016 •

edited

Loading