[SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use #4140

ogeagla · 2015-01-21T17:45:35Z

This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.

…ot be private to mllib, added tests for newly-exposed functionality

AmplabJenkins · 2015-01-21T17:47:09Z

Can one of the admins verify this patch?

mengxr · 2015-01-21T17:59:38Z

ok to test

mengxr · 2015-01-21T17:59:51Z

cc @dbtsai

SparkQA · 2015-01-21T18:02:30Z

Test build #25895 has started for PR 4140 at commit 64408a4.

This patch merges cleanly.

SparkQA · 2015-01-21T19:11:07Z

Test build #25895 has finished for PR 4140 at commit 64408a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T19:11:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25895/
Test PASSed.

dbtsai · 2015-01-23T22:25:08Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

The default argument is not friendly for Java though; why don't we add another constructor which takes only mean and variance?

Also, users will want to know if withMean or withStd is used, do we really need to have them as private variables?

dbtsai · 2015-01-23T23:04:58Z

For the unit-test part, is it possible not to change too much? Also, it will be easier to debug if the assertion is in the test instead of abstract out. For example, having validateConstant function is not necessary, probably more easy to read to have all the assert code in the test.

Having the data as global variables is okay for me.

Thanks.

…tructor which uses defaults, un-refactor test class

SparkQA · 2015-01-24T03:02:41Z

Test build #26038 has started for PR 4140 at commit 997d2e0.

This patch merges cleanly.

ogeagla · 2015-01-24T03:03:22Z

@dbtsai that makes sense. I've changed this back in latest commit.

SparkQA · 2015-01-24T04:11:31Z

Test build #26038 has finished for PR 4140 at commit 997d2e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StandardScalerModel (

AmplabJenkins · 2015-01-24T04:11:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26038/
Test PASSed.

dbtsai · 2015-01-27T00:14:15Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

Moving require to the bottom of this constructor. Also add @DeveloperApi annotation to both setWithMean and setWithStd APIs.

I have a question about this API. If the default withMean is false, why do we require mean in the constructor? If the feature dimension is really large, this puts some extra cost that cannot be ignored. Similarly, should we take std directly instead of variance in the constructor? My proposal is the following:

StandoardScalerModel(std: Vector, mean: Vector, withStd: Boolean, withMean: Boolean). I put variance in front of mean because scaling is used more frequently than shifting.

this(std: Vector, mean: Vector): enable withMean and withStd based on whether the input arguments are null or not. Throw exception is both are null.

this(std: Vector) = this(std, null).

setWithMean and setWithStd check whether the corresponding mean/variance is null or not and throw exceptions if a user want to set it to true while the value is null.

Sounds reasonable for me. Although the changes will be larger, this will be more handy and save extra space if withMean is not used.

@mengxr Just to make sure I'm clear, are you suggesting changing the StandardScalerModel to take the standard deviation vector (instead of variance)? Or are you just calling it 'std' for short?

In my opinion, taking variance will be ideal since it's the output of MultivariateOnlineSummarizer.

dbtsai · 2015-01-27T00:18:56Z

LGTM except those two minor details. Thanks.

…rg ordering, add dev api annotations, do better null checking, add another test and some doc for this.

SparkQA · 2015-01-28T12:52:49Z

Test build #26224 has started for PR 4140 at commit 9078fe0.

This patch merges cleanly.

SparkQA · 2015-01-28T12:53:50Z

Test build #26224 has finished for PR 4140 at commit 9078fe0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StandardScalerModel (

AmplabJenkins · 2015-01-28T12:53:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26224/
Test FAILed.

ogeagla · 2015-01-28T12:57:21Z

@mengxr I've incorporated your comments.

@dbtsai I ran lint-scala and it fails due to files being too big, and that's what's failing the build. Shall I separate the StandardScaler and StandardScalerModel impl and tests into separate files?

Also, I added a tiny bit to the docs. I would be happy to add more if you think it's appropriate.

mengxr · 2015-01-28T17:51:21Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

line too wide. the limit is 100 chars.

mengxr · 2015-01-28T17:53:04Z

@ogeagla You can see the error messages from the Jenkins build results (https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26224/), where it shows

[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:29: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:78: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:79: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:93: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:117: File line length exceeds 100 characters
[error] (mllib/compile:scalastyle) errors exist
[error] Total time: 5 s, completed Jan 28, 2015 4:53:33 AM
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:29: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:78: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:79: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:93: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:117: File line length exceeds 100 characters
[error] (mllib/compile:scalastyle) errors exist

So it is the line length but not the file length.

mengxr · 2015-01-28T18:01:02Z

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala

Back to the discussion on using std instead of variance, Essentially we are computing the 1/std here. If we want to save the model, we certainly prefer saving factor to saving variance, so we don't need to recompute this while loading it back. It also saves storage. If we have std, we don't need to create factor and the transformation can be done on the fly with std. The overhead is just a if check.

I agree with that, I can make those changes. An additional one-time overhead is the computing of sqrt of the variance from the summary in StandardScaler.fit to provide to the StandardScalerModel constructor.

…instead of variance

SparkQA · 2015-02-01T11:37:50Z

Test build #26472 has started for PR 4140 at commit fa64dfa.

This patch merges cleanly.

SparkQA · 2015-02-01T12:48:58Z

Test build #26472 has finished for PR 4140 at commit fa64dfa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StandardScalerModel (

AmplabJenkins · 2015-02-01T12:49:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26472/
Test PASSed.

mengxr · 2015-02-01T17:21:31Z

LGTM. Merged into master. Thanks!

[SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to n…

64408a4

…ot be private to mllib, added tests for newly-exposed functionality

dbtsai reviewed Jan 23, 2015
View reviewed changes

[SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add cons…

997d2e0

…tructor which uses defaults, un-refactor test class

dbtsai reviewed Jan 27, 2015
View reviewed changes

[SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change a…

9078fe0

…rg ordering, add dev api annotations, do better null checking, add another test and some doc for this.

mengxr reviewed Jan 28, 2015
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala Outdated

Copy link

Contributor

mengxr Jan 28, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too wide. the limit is 100 chars.

mengxr reviewed Jan 28, 2015
View reviewed changes

[SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev …

fa64dfa

…instead of variance

asfgit closed this in bdb0680 Feb 1, 2015

[SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use #4140

[SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use #4140

Uh oh!

Conversation

ogeagla commented Jan 21, 2015

Uh oh!

AmplabJenkins commented Jan 21, 2015

Uh oh!

mengxr commented Jan 21, 2015

Uh oh!

mengxr commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

AmplabJenkins commented Jan 21, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Jan 23, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

ogeagla commented Jan 24, 2015

Uh oh!

SparkQA commented Jan 24, 2015

Uh oh!

AmplabJenkins commented Jan 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Jan 27, 2015

Uh oh!

SparkQA commented Jan 28, 2015

Uh oh!

SparkQA commented Jan 28, 2015

Uh oh!

AmplabJenkins commented Jan 28, 2015

Uh oh!

ogeagla commented Jan 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jan 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2015

Uh oh!

SparkQA commented Feb 1, 2015

Uh oh!

AmplabJenkins commented Feb 1, 2015

Uh oh!

mengxr commented Feb 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants