-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug #19278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @BryanCutler Thanks! |
|
Test build #81927 has finished for PR 19278 at commit
|
smurching
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I've left a few comments.
(This might be a task for a future PR) I'm curious why setEstimator() and setEvaluator() aren't part of the ValidatorParams API - they're currently called by both classes that extend ValidatorParams (TrainValidationSplit, CrossValidator) and moving these setter methods up the class hierarchy would allow you to share more loading/persistence code in ValidatorParams.saveImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an Option[List[String]] that defaults to None instead of a List[String] that defaults to null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the docstring to state that params included in skipParams aren't set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is unused & can be removed.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the parallelsim param persistence. I think it is a little confusing that some of the params are loaded manually, and some by DefaultParamReader.getAndSetParams that then requires you to then skip certain params. Is is possible to do this without the list of params to skip? If not, then maybe it would be better not to use DefaultParamsReader.getAndSetParams at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this included by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will remove it. sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I will remove it. sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also skip estimator and evaluator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Because estimator and evaluator isn't included in metadata. You can check the saveImpl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you also need to skip estimator and evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Because estimator and evaluator isn't included in metadata. You can check the saveImpl.
|
@BryanCutler The reason I add |
042b3d5 to
cc30578
Compare
|
Test build #81959 has finished for PR 19278 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! One question: I think this maintains backwards compatibility, but would you mind testing that manually by:
- Exporting a CV model using spark/master
- Importing that CV model using this PR's branch, and making sure that works?
Thanks!
| .setTrainRatio(0.5) | ||
| .setEstimatorParamMaps(paramMaps) | ||
| .setSeed(42L) | ||
| .setParallelism(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the test for the Model too please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The model do not own parallel parameter. This was discussed before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, thanks
| .setNumFolds(20) | ||
| .setEstimatorParamMaps(paramMaps) | ||
| .setSeed(42L) | ||
| .setParallelism(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the test for the model too please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| * TODO: Move to [[Metadata]] method | ||
| */ | ||
| def getAndSetParams(instance: Params, metadata: Metadata): Unit = { | ||
| def getAndSetParams(instance: Params, metadata: Metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix scala style: 1 arg per line for multiline declarations
| * This works if all Params (except params included by `skipParams` list) implement | ||
| * [[org.apache.spark.ml.param.Param.jsonDecode()]]. | ||
| * | ||
| * The params included in `skipParams` won't be set. This is useful if some params don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document params using @param
|
|
||
| val (metadata, estimator, evaluator, estimatorParamMaps) = | ||
| ValidatorParams.loadImpl(path, sc, className) | ||
| val numFolds = (metadata.params \ "numFolds").extract[Int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numFolds is no longer needed
|
@jkbradley Sure I tested the backwards compatibility. Part of the reason I changed into |
|
Test build #82057 has finished for PR 19278 at commit
|
|
LGTM |
What changes were proposed in this pull request?
Currently the param of CrossValidator/TrainValidationSplit persist/loading is hardcoding, which is different with other ML estimators. This cause persist bug for new added
parallelismparam.I refactor related code, avoid hardcoding persist/load param. And in the same time, it solve the
parallelismpersisting bug.This refactoring is very useful because we will add more new params in #19208 , hardcoding param persisting/loading making the thing adding new params very troublesome.
How was this patch tested?
Test added.