-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-11852] [ML] StandardScaler minor refactor #9839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to check mean and std which are parts of the model, withStd and withStd are params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withStd and withStd of StandardScalerModel must be inherited from StandardScaler, so we can not construct StandardScalerModel directly by specifying the two variables. Here we combine the original test cases into one with testEstimatorAndModelReadWrite which both test the estimator and model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not an ideal unit test for read/write because the model fitting part shouldn't be part of it, which is already covered by other tests. Constructing estimator and model directly can save some test time.
|
Test build #46330 has finished for PR 9839 at commit
|
|
Test build #46333 has finished for PR 9839 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If withMean and withStd are parameters, we should save them in metadata/ but not both under data/ and medadata/. Can we change the constructor of ml.StandardScalerModel to take only std and mean but construct scaler only inside transform? So scaler is no longer a member variable. We can fix performance issues in 1.7.
|
Jenkins, test this please. |
|
Test build #46413 has finished for PR 9839 at commit
|
|
Test build #46415 has finished for PR 9839 at commit
|
```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```. Author: Yanbo Liang <[email protected]> Closes #9839 from yanboliang/standardScaler-refactor. (cherry picked from commit 9ace2e5) Signed-off-by: Xiangrui Meng <[email protected]>
|
LGTM. Merged into master and branch-1.6. Thanks! |
https://github.pie.apple.com/IPR/apache-incubator-iceberg/compare/IPR:apache-incubator-iceberg:ece9b3117...IPR:apache-incubator-iceberg:85968ae4b Release notes: - [Make OAuth audience and resource configurable (apache#9839)](https://github.pie.apple.com/IPR/apache-incubator-iceberg/commit/b204803e560be2dfe58c46a47f262a2be920745a) - [Internal: [SPJ] Add bucket to reducible function (](https://github.pie.apple.com/IPR/apache-incubator-iceberg/commit/2d64f72b5728a49e9907d4769497ce5f2b177756)[#1285](https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/1285)[)](https://github.pie.apple.com/IPR/apache-incubator-iceberg/commit/2d64f72b5728a49e9907d4769497ce5f2b177756) - Core, Spark: Calling rewrite_position_delete_files fails on tables with more than 1k columns (apache#10020) ([apache#1283](https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/1283)) - [[Internal] Update revapi version (](https://github.pie.apple.com/IPR/apache-incubator-iceberg/commit/ece9b311752f4f27700ac804f7503658e5de9fab)[#1282](https://github.pie.apple.com/IPR/apache-incubator-iceberg/pull/1282)[)](https://github.pie.apple.com/IPR/apache-incubator-iceberg/commit/ece9b311752f4f27700ac804f7503658e5de9fab)
withStdandwithMeanshould be params ofStandardScalerandStandardScalerModel.