Skip to content

Conversation

@ogeagla
Copy link
Contributor

@ogeagla ogeagla commented Jan 21, 2015

This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.

…ot be private to mllib, added tests for newly-exposed functionality
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented Jan 21, 2015

ok to test

@mengxr
Copy link
Contributor

mengxr commented Jan 21, 2015

cc @dbtsai

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25895 has started for PR 4140 at commit 64408a4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25895 has finished for PR 4140 at commit 64408a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25895/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default argument is not friendly for Java though; why don't we add another constructor which takes only mean and variance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, users will want to know if withMean or withStd is used, do we really need to have them as private variables?

@dbtsai
Copy link
Member

dbtsai commented Jan 23, 2015

For the unit-test part, is it possible not to change too much? Also, it will be easier to debug if the assertion is in the test instead of abstract out. For example, having validateConstant function is not necessary, probably more easy to read to have all the assert code in the test.

Having the data as global variables is okay for me.

Thanks.

…tructor which uses defaults, un-refactor test class
@SparkQA
Copy link

SparkQA commented Jan 24, 2015

Test build #26038 has started for PR 4140 at commit 997d2e0.

  • This patch merges cleanly.

@ogeagla
Copy link
Contributor Author

ogeagla commented Jan 24, 2015

@dbtsai that makes sense. I've changed this back in latest commit.

@SparkQA
Copy link

SparkQA commented Jan 24, 2015

Test build #26038 has finished for PR 4140 at commit 997d2e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StandardScalerModel (

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26038/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving require to the bottom of this constructor. Also add @DeveloperApi annotation to both setWithMean and setWithStd APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about this API. If the default withMean is false, why do we require mean in the constructor? If the feature dimension is really large, this puts some extra cost that cannot be ignored. Similarly, should we take std directly instead of variance in the constructor? My proposal is the following:

  • StandoardScalerModel(std: Vector, mean: Vector, withStd: Boolean, withMean: Boolean). I put variance in front of mean because scaling is used more frequently than shifting.
  • this(std: Vector, mean: Vector): enable withMean and withStd based on whether the input arguments are null or not. Throw exception is both are null.
  • this(std: Vector) = this(std, null).
  • setWithMean and setWithStd check whether the corresponding mean/variance is null or not and throw exceptions if a user want to set it to true while the value is null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable for me. Although the changes will be larger, this will be more handy and save extra space if withMean is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr Just to make sure I'm clear, are you suggesting changing the StandardScalerModel to take the standard deviation vector (instead of variance)? Or are you just calling it 'std' for short?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, taking variance will be ideal since it's the output of MultivariateOnlineSummarizer.

@dbtsai
Copy link
Member

dbtsai commented Jan 27, 2015

LGTM except those two minor details. Thanks.

…rg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26224 has started for PR 4140 at commit 9078fe0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26224 has finished for PR 4140 at commit 9078fe0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StandardScalerModel (

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26224/
Test FAILed.

@ogeagla
Copy link
Contributor Author

ogeagla commented Jan 28, 2015

@mengxr I've incorporated your comments.

@dbtsai I ran lint-scala and it fails due to files being too big, and that's what's failing the build. Shall I separate the StandardScaler and StandardScalerModel impl and tests into separate files?

Also, I added a tiny bit to the docs. I would be happy to add more if you think it's appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line too wide. the limit is 100 chars.

@mengxr
Copy link
Contributor

mengxr commented Jan 28, 2015

@ogeagla You can see the error messages from the Jenkins build results (https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26224/), where it shows

[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:29: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:78: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:79: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:93: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:117: File line length exceeds 100 characters
[error] (mllib/compile:scalastyle) errors exist
[error] Total time: 5 s, completed Jan 28, 2015 4:53:33 AM
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:29: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:78: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:79: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:93: File line length exceeds 100 characters
[error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala:117: File line length exceeds 100 characters
[error] (mllib/compile:scalastyle) errors exist

So it is the line length but not the file length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to the discussion on using std instead of variance, Essentially we are computing the 1/std here. If we want to save the model, we certainly prefer saving factor to saving variance, so we don't need to recompute this while loading it back. It also saves storage. If we have std, we don't need to create factor and the transformation can be done on the fly with std. The overhead is just a if check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that, I can make those changes. An additional one-time overhead is the computing of sqrt of the variance from the summary in StandardScaler.fit to provide to the StandardScalerModel constructor.

@SparkQA
Copy link

SparkQA commented Feb 1, 2015

Test build #26472 has started for PR 4140 at commit fa64dfa.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 1, 2015

Test build #26472 has finished for PR 4140 at commit fa64dfa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StandardScalerModel (

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26472/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Feb 1, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in bdb0680 Feb 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants