-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use #4140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ot be private to mllib, added tests for newly-exposed functionality
|
Can one of the admins verify this patch? |
|
ok to test |
|
cc @dbtsai |
|
Test build #25895 has started for PR 4140 at commit
|
|
Test build #25895 has finished for PR 4140 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default argument is not friendly for Java though; why don't we add another constructor which takes only mean and variance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, users will want to know if withMean or withStd is used, do we really need to have them as private variables?
|
For the unit-test part, is it possible not to change too much? Also, it will be easier to debug if the assertion is in the test instead of abstract out. For example, having Having the data as global variables is okay for me. Thanks. |
…tructor which uses defaults, un-refactor test class
|
Test build #26038 has started for PR 4140 at commit
|
|
@dbtsai that makes sense. I've changed this back in latest commit. |
|
Test build #26038 has finished for PR 4140 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving require to the bottom of this constructor. Also add @DeveloperApi annotation to both setWithMean and setWithStd APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about this API. If the default withMean is false, why do we require mean in the constructor? If the feature dimension is really large, this puts some extra cost that cannot be ignored. Similarly, should we take std directly instead of variance in the constructor? My proposal is the following:
StandoardScalerModel(std: Vector, mean: Vector, withStd: Boolean, withMean: Boolean). I putvariancein front ofmeanbecause scaling is used more frequently than shifting.this(std: Vector, mean: Vector): enablewithMeanandwithStdbased on whether the input arguments are null or not. Throw exception is both are null.this(std: Vector) = this(std, null).setWithMeanandsetWithStdcheck whether the correspondingmean/varianceis null or not and throw exceptions if a user want to set it to true while the value is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable for me. Although the changes will be larger, this will be more handy and save extra space if withMean is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr Just to make sure I'm clear, are you suggesting changing the StandardScalerModel to take the standard deviation vector (instead of variance)? Or are you just calling it 'std' for short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, taking variance will be ideal since it's the output of MultivariateOnlineSummarizer.
|
LGTM except those two minor details. Thanks. |
…rg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
|
Test build #26224 has started for PR 4140 at commit
|
|
Test build #26224 has finished for PR 4140 at commit
|
|
Test FAILed. |
|
@mengxr I've incorporated your comments. @dbtsai I ran lint-scala and it fails due to files being too big, and that's what's failing the build. Shall I separate the Also, I added a tiny bit to the docs. I would be happy to add more if you think it's appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line too wide. the limit is 100 chars.
|
@ogeagla You can see the error messages from the Jenkins build results (https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26224/), where it shows So it is the line length but not the file length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to the discussion on using std instead of variance, Essentially we are computing the 1/std here. If we want to save the model, we certainly prefer saving factor to saving variance, so we don't need to recompute this while loading it back. It also saves storage. If we have std, we don't need to create factor and the transformation can be done on the fly with std. The overhead is just a if check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that, I can make those changes. An additional one-time overhead is the computing of sqrt of the variance from the summary in StandardScaler.fit to provide to the StandardScalerModel constructor.
…instead of variance
|
Test build #26472 has started for PR 4140 at commit
|
|
Test build #26472 has finished for PR 4140 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master. Thanks! |
This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.