-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22126][ML][WIP] Fix model-specific optimization support for ML tuning #19350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| @Since("2.3.0") | ||
| def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add doc for this interface later. But it can be reviewed first, currently this interface looks a little ugly.
|
Test build #82189 has finished for PR 19350 at commit
|
|
retest this please. |
|
Test build #82191 has finished for PR 19350 at commit
|
|
Test build #83889 has finished for PR 19350 at commit
|
|
Test build #84023 has finished for PR 19350 at commit
|
| // Fit models in a Future for training in parallel | ||
| val modelFutures = paramMaps.map { paramMap => | ||
| Future[Model[_]] { | ||
| fit(dataset, paramMap).asInstanceOf[Model[_]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this work in a pipeline?
If the Estimator in CV is a Pipeline, then here it will call fit(dataset, paramMap) on the Pipeline which will in turn fit on each stage with that paramMap. This is what the current parallel CV is doing.
But if we have a stage with model-specific optimization (let's say for arguments sake a LinearRegression that can internally optimize maxIter) then its fit will be called with only a single paramMap arg.
So that pushing the parallel fit into Estimator nullifies any benefit from model-specific optimizations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Oh, the design is still under discussion on JIRA and will be changed I think. I should mark this WIP. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick I dicussed with @jkbradley @MrBago offline and here is the newest proposal
https://docs.google.com/document/d/1xw5M4sp1e0eQie75yIt-r6-GTuD5vpFf_I6v-AFBM3M/edit?usp=sharing
Thanks!
|
Design changed. @MrBago will create new PR for this later. New design is here https://docs.google.com/document/d/1xw5M4sp1e0eQie75yIt-r6-GTuD5vpFf_I6v-AFBM3M/edit?usp=sharing |
What changes were proposed in this pull request?
Push down fitting parallelization code from CrossValidator/TrainValidationSplit into Estimators.
See discussions in SPARK-19357.
Design doc here https://docs.google.com/document/d/1xw5M4sp1e0eQie75yIt-r6-GTuD5vpFf_I6v-AFBM3M/edit?usp=sharing
scala api:
java api:
Note: Either in scala or in java, developer can use helper method
HasParrallelism.getExecutionContextto get theExecutionContextobject which this API requires.Discussion: Whether we need to provide both scala & java api, or only provide java api ? In this PR, I provide both scala & java api, so in scala side, developer can override scala api (it will be easier to use), and in java side, developer can override java api.
How was this patch tested?
N/A