[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark #19122

WeichenXu123 · 2017-09-04T16:07:28Z

What changes were proposed in this pull request?

Add parallelism support for ML tuning in pyspark.

How was this patch tested?

Test updated.

WeichenXu123 · 2017-09-04T16:17:27Z

python/pyspark/ml/tuning.py

Here maybe need a discussion.
Currently in pyspark it both do not cache train dataset and validation dataset but in scala impl it cache both of them.
But I prefer cache validation dataset but do not cache train dataset, because the size of validation dataset is only 1/numFolds of input dataset, it deserve caching otherwise it will scan input dataset again. But the size train dataset is (numFolds - 1)/numFolds of input dataset. We can directly scan from input dataset to generate the train dataset and won't slow down too much.
@BryanCutler @MLnick What do you think about it ? Thanks!

We will do multi-model training when fitting on the estimator. So I think it's still beneficial to cache training dataset?

Suppose we have already cached input dataset, then generate "training dataset" only need a "map" operation on cached df with filtering out only 1/numFolds. So I think the cost won't be too much more compared with caching "training dataset".

That's right, but seems we don't check if input dataset is cached or not here? Should we cache it if it is not cached?

Hmm... How to checking input dataset caching status is not easy, there are still discussions in SPARK-18608. But for now I think we can keep consistent with scala-side to cache both of training and validation dataset. So I update code.

SparkQA · 2017-09-04T16:30:07Z

Test build #81386 has finished for PR 19122 at commit 57cf534.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasParallelism(Params):
class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):
class TrainValidationSplit(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):

SparkQA · 2017-09-04T16:55:02Z

Test build #81387 has finished for PR 19122 at commit be2f3d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-05T08:31:10Z

python/pyspark/ml/tuning.py

Can we have a benchmark for this? Can Python GIL be a problem here to downgrade performance in the end?

Do you mean the line metrics[index] += metric/nFolds will downgrade perf because of lock issue ?
I can change code to avoid this. Thanks!

The actual fitting and evaluation methods run here might include CPU bound codes. So I am not sure if multithreading here can well boost the performance.

Oh, I think this work well. We already have PRs do similar things #19110 and #16774 .

SparkQA · 2017-09-05T15:05:25Z

Test build #81411 has finished for PR 19122 at commit 596ce35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-06T13:40:58Z

Test build #81457 has finished for PR 19122 at commit 0a94344.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

I had a few questions, but why don't we get OneVsRest with the shared param merged in first?

BryanCutler · 2017-09-11T17:50:07Z

python/pyspark/ml/tuning.py

Since the param is not being passed to Java, should we check that it is >=1 here and in setParam?

I add check when creating thread pool.

BryanCutler · 2017-09-11T17:55:53Z

python/pyspark/ml/tuning.py

Could you just use epm as the argument in the function instead of an index? e.g. pool.map(singleTrain, epm)

BryanCutler · 2017-09-11T17:58:23Z

python/pyspark/ml/param/_shared_params_code_gen.py

I think the description should be more general. Is the plan to put the shared param in here or in OneVsRest first?

No worry, #19110 will merge first and then I will merge it to this PR.

BryanCutler · 2017-09-11T18:22:08Z

python/pyspark/ml/tuning.py

Are you planning on adding a unit test to verify that parallel has the same results as serial?

test added.

WeichenXu123 · 2017-09-13T03:43:53Z

@BryanCutler code updated. thanks!

SparkQA · 2017-09-13T03:48:28Z

Test build #81705 has finished for PR 19122 at commit d6cf103.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-13T05:05:20Z

Test build #81707 has finished for PR 19122 at commit 7122884.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-09-21T22:02:47Z

@BryanCutler Do you have more comments? I can check it out now but don't want to review at the same time.

BryanCutler

I had just a few minor suggestions, otherwise LGTM

BryanCutler · 2017-09-21T22:17:50Z

python/pyspark/ml/tuning.py

minor: import order

BryanCutler · 2017-09-21T22:35:07Z

python/pyspark/ml/tests.py

I think it would be a little better to check if the bestModel chosen was the same in both cases, same with the TrainValidationSplit test.

hmm... I tried. But how to get model parents ?

oh right, I guess you'd have to check cvSerialModel.bestModel.weights which isn't too ideal either. It's fine how it is.

BryanCutler · 2017-09-21T22:42:13Z

python/pyspark/ml/tuning.py

Sorry, I asked for this check but I don't think we really need it. The ValueError from creating a ThreadPool with processes < 1 is pretty easy to see the problem. If you want to leave it that's fine too.

BryanCutler · 2017-09-21T22:49:10Z

python/pyspark/ml/tuning.py

not a big deal, but you could use variable j instead of k here now

jkbradley

I started to review a little...and then realized I made an error when reviewing the original patches in this work. I posted here: https://issues.apache.org/jira/browse/SPARK-19357 Please let me know what your thoughts are on the best way to fix this.

jkbradley · 2017-09-21T22:33:53Z

python/pyspark/ml/tests.py

Don't sort the metrics. The metrics are guaranteed to be returned in the same order as the estimatorParamMaps, so they should match up already.

jkbradley · 2017-09-21T22:34:25Z

python/pyspark/ml/tests.py

ditto: don't sort the metrics

jkbradley · 2017-09-21T22:35:01Z

python/pyspark/ml/tuning.py

style: This should be grouped with the other 3rd-party library imports

SparkQA · 2017-09-22T04:10:28Z

Test build #82058 has finished for PR 19122 at commit 3464dfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-09-27T15:40:52Z

ping @jkbradley

SparkQA · 2017-09-28T11:46:19Z

Test build #82275 has finished for PR 19122 at commit 93ab39a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-10-25T17:57:30Z

Discussed elsewhere: We'll delay the multi-model fitting optimization in favor of getting this in for now. Taking a look now...

jkbradley

Looks basically ready; just a tiny comment

jkbradley · 2017-10-25T18:10:47Z

python/pyspark/ml/tests.py

+            ["features", "label"])
+
+        lr = LogisticRegression()
+        grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()


With only 0 or 1 iteration, I don't think we could expect to see big differences between parallelism 1 or 2, even if there were bugs in our implementation. How about trying more, saying 5 and 6 iterations?

Same for TrainValidationSplit

SparkQA · 2017-10-25T21:50:24Z

Test build #3961 has finished for PR 19122 at commit 93ab39a.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-10-26T04:40:54Z

Test build #83075 has finished for PR 19122 at commit 8b3ef97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-10-27T22:18:39Z

LGTM
Merging with master
Thanks @WeichenXu123 , @BryanCutler and @viirya !

jkbradley · 2017-10-27T22:25:58Z

Whoops, could you please send a follow-up PR to do 1 doc update?

Update https://github.com/apache/spark/blob/master/docs/ml-tuning.md to say this is supported in Python. (Search for "NOTE: this is not yet supported in Python")

WeichenXu123 · 2017-10-28T00:57:16Z

@jkbradley Sure I will!

## What changes were proposed in this pull request? Fix doc issue mentioned here: #19122 (comment) ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes #19641 from WeichenXu123/fix_doc.

WeichenXu123 commented Sep 4, 2017

View reviewed changes

viirya reviewed Sep 5, 2017

View reviewed changes

WeichenXu123 mentioned this pull request Sep 9, 2017

[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark #19110

Closed

BryanCutler requested changes Sep 11, 2017

View reviewed changes

BryanCutler reviewed Sep 11, 2017

View reviewed changes

WeichenXu123 force-pushed the par-ml-tuning-py branch from 0a94344 to d6cf103 Compare September 13, 2017 03:42

BryanCutler reviewed Sep 21, 2017

View reviewed changes

jkbradley reviewed Sep 21, 2017

View reviewed changes

WeichenXu123 added 9 commits September 28, 2017 17:55

init pr

44f4332

update

b321534

improve code in thread

d5209c4

update

6c3debd

update

849b675

add serial parallel cmp testcase

b03499a

fix py style

fb0ac04

update

dbe66fb

update

93ab39a

WeichenXu123 force-pushed the par-ml-tuning-py branch from 3464dfe to 93ab39a Compare September 28, 2017 10:32

jkbradley reviewed Oct 25, 2017

View reviewed changes

WeichenXu123 added 2 commits October 26, 2017 11:26

improve unit test

67ad3d2

Merge branch 'master' into par-ml-tuning-py

8b3ef97

asfgit closed this in 20eb95e Oct 27, 2017

WeichenXu123 deleted the par-ml-tuning-py branch October 28, 2017 01:50

WeichenXu123 mentioned this pull request Nov 2, 2017

[SPARK-21911][ML][FOLLOW-UP] Fix doc for parallel ML Tuning in PySpark #19641

Closed

[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark #19122

[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark #19122

Uh oh!

Conversation

WeichenXu123 commented Sep 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 4, 2017

Uh oh!

SparkQA commented Sep 4, 2017

Uh oh!

viirya Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2017

Uh oh!

SparkQA commented Sep 6, 2017

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Sep 13, 2017

Uh oh!

SparkQA commented Sep 13, 2017

Uh oh!

SparkQA commented Sep 13, 2017

Uh oh!

jkbradley commented Sep 21, 2017

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

viirya Sep 5, 2017 •

edited

Loading