-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark #19122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here maybe need a discussion.
Currently in pyspark it both do not cache train dataset and validation dataset but in scala impl it cache both of them.
But I prefer cache validation dataset but do not cache train dataset, because the size of validation dataset is only 1/numFolds of input dataset, it deserve caching otherwise it will scan input dataset again. But the size train dataset is (numFolds - 1)/numFolds of input dataset. We can directly scan from input dataset to generate the train dataset and won't slow down too much.
@BryanCutler @MLnick What do you think about it ? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will do multi-model training when fitting on the estimator. So I think it's still beneficial to cache training dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we have already cached input dataset, then generate "training dataset" only need a "map" operation on cached df with filtering out only 1/numFolds. So I think the cost won't be too much more compared with caching "training dataset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, but seems we don't check if input dataset is cached or not here? Should we cache it if it is not cached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... How to checking input dataset caching status is not easy, there are still discussions in SPARK-18608. But for now I think we can keep consistent with scala-side to cache both of training and validation dataset. So I update code.
|
Test build #81386 has finished for PR 19122 at commit
|
|
Test build #81387 has finished for PR 19122 at commit
|
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a benchmark for this? Can Python GIL be a problem here to downgrade performance in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the line metrics[index] += metric/nFolds will downgrade perf because of lock issue ?
I can change code to avoid this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual fitting and evaluation methods run here might include CPU bound codes. So I am not sure if multithreading here can well boost the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #81411 has finished for PR 19122 at commit
|
|
Test build #81457 has finished for PR 19122 at commit
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few questions, but why don't we get OneVsRest with the shared param merged in first?
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the param is not being passed to Java, should we check that it is >=1 here and in setParam?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add check when creating thread pool.
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just use epm as the argument in the function instead of an index? e.g. pool.map(singleTrain, epm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description should be more general. Is the plan to put the shared param in here or in OneVsRest first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worry, #19110 will merge first and then I will merge it to this PR.
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning on adding a unit test to verify that parallel has the same results as serial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added.
0a94344 to
d6cf103
Compare
|
@BryanCutler code updated. thanks! |
|
Test build #81705 has finished for PR 19122 at commit
|
|
Test build #81707 has finished for PR 19122 at commit
|
|
@BryanCutler Do you have more comments? I can check it out now but don't want to review at the same time. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had just a few minor suggestions, otherwise LGTM
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: import order
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a little better to check if the bestModel chosen was the same in both cases, same with the TrainValidationSplit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I tried. But how to get model parents ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, I guess you'd have to check cvSerialModel.bestModel.weights which isn't too ideal either. It's fine how it is.
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I asked for this check but I don't think we really need it. The ValueError from creating a ThreadPool with processes < 1 is pretty easy to see the problem. If you want to leave it that's fine too.
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal, but you could use variable j instead of k here now
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to review a little...and then realized I made an error when reviewing the original patches in this work. I posted here: https://issues.apache.org/jira/browse/SPARK-19357 Please let me know what your thoughts are on the best way to fix this.
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't sort the metrics. The metrics are guaranteed to be returned in the same order as the estimatorParamMaps, so they should match up already.
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: don't sort the metrics
python/pyspark/ml/tuning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: This should be grouped with the other 3rd-party library imports
|
Test build #82058 has finished for PR 19122 at commit
|
|
ping @jkbradley |
3464dfe to
93ab39a
Compare
|
Test build #82275 has finished for PR 19122 at commit
|
|
Discussed elsewhere: We'll delay the multi-model fitting optimization in favor of getting this in for now. Taking a look now... |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks basically ready; just a tiny comment
python/pyspark/ml/tests.py
Outdated
| ["features", "label"]) | ||
|
|
||
| lr = LogisticRegression() | ||
| grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With only 0 or 1 iteration, I don't think we could expect to see big differences between parallelism 1 or 2, even if there were bugs in our implementation. How about trying more, saying 5 and 6 iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for TrainValidationSplit
|
Test build #3961 has finished for PR 19122 at commit
|
|
Test build #83075 has finished for PR 19122 at commit
|
|
LGTM |
|
Whoops, could you please send a follow-up PR to do 1 doc update?
|
|
@jkbradley Sure I will! |
## What changes were proposed in this pull request? Fix doc issue mentioned here: #19122 (comment) ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes #19641 from WeichenXu123/fix_doc.
What changes were proposed in this pull request?
Add parallelism support for ML tuning in pyspark.
How was this patch tested?
Test updated.