[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API #19627

WeichenXu123 · 2017-11-01T13:49:35Z

What changes were proposed in this pull request?

Add python API for collecting sub-models during CrossValidator/TrainValidationSplit fitting.

How was this patch tested?

UT added.

SparkQA · 2017-11-01T13:53:41Z

Test build #83290 has finished for PR 19627 at commit c071183.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasCollectSubModels(Params):
class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels, MLReadable, MLWritable):
class TrainValidationSplit(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,

WeichenXu123 · 2017-11-01T14:13:03Z

Jenkins, test this please.

SparkQA · 2017-11-17T03:43:32Z

Test build #83953 has finished for PR 19627 at commit 4c3a7ea.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-17T03:58:30Z

Test build #83954 has finished for PR 19627 at commit 9e27f6b.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-11-17T04:00:42Z

My local test passed. This test failure looks like test system issue.

holdenk · 2017-11-18T14:49:40Z

What happens when you run check-license locally? I agree it doesn't look like any of these changes would impact the license headers.

WeichenXu123 · 2017-11-19T02:38:11Z

@holdenk Find the reason. There is an empty file in the directory. :)

SparkQA · 2017-11-19T02:39:34Z

Test build #83991 has finished for PR 19627 at commit 758bc24.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-19T03:49:39Z

Test build #83992 has finished for PR 19627 at commit ae082f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,

WeichenXu123 · 2017-11-21T12:07:17Z

@holdenk Thanks!

jkbradley · 2017-12-01T18:44:09Z

Is this still WIP or ready?

WeichenXu123 · 2017-12-02T00:32:09Z

@jkbradley I think it is better to review #19857 (fix python model specific optimization) and merge it first and then I rebase & update this PR. :)

WeichenXu123 · 2018-04-10T11:25:01Z

@MrBago @yogeshg @jkbradley Updated and ready for review now!

SparkQA · 2018-04-10T12:41:28Z

Test build #89111 has finished for PR 19627 at commit 81473b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the PR!

You'll need to update _from_java and _to_java for CrossValidator and TrainValidationSplit.

Also, please update the PR description.

jkbradley · 2018-04-11T01:13:22Z

python/pyspark/ml/tests.py

+        tvs = TrainValidationSplit(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,
+                                   collectSubModels=True)
+        tvsModel = tvs.fit(dataset)
+        assert len(tvsModel.subModels) == len(grid)


Use self.assertEqual here and elsewhere.

jkbradley · 2018-04-11T01:19:54Z

python/pyspark/ml/param/_shared_params_code_gen.py

         "TypeConverters.toInt"),
        ("parallelism", "the number of threads to use when running parallel algorithms (>= 1).",
         "1", "TypeConverters.toInt"),
+        ("collectSubModels", "whether to collect a list of sub-models trained during tuning",


It would be nice to add the full description from Scala.

jkbradley · 2018-04-11T18:55:50Z

python/pyspark/ml/tuning.py



-class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):
+class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,


You'll need to update _from_java and _to_java as well to pass collectSubModels around. (Same for TrainValidationSplit)

Let's also clarify in the doc for CrossValidatorModel.copy() that it does not copy the extra Params into the subModels. (same for TrainValidationSplitModel)

jkbradley · 2018-04-11T19:47:58Z

python/pyspark/ml/tests.py

        cvParallelModel = cv.fit(dataset)
        self.assertEqual(cvSerialModel.avgMetrics, cvParallelModel.avgMetrics)

+    def test_expose_sub_models(self):


Nice tests. Can you make one addition: Test the copy() method to make sure it copies the submodels.

SparkQA · 2018-04-13T11:51:45Z

Test build #89334 has finished for PR 19627 at commit 80f07fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-04-16T16:30:29Z

LGTM
Merging with master
Thanks!

WeichenXu123 added 2 commits November 17, 2017 11:51

init pr

1edd66b

add submodels save load support

9e27f6b

WeichenXu123 force-pushed the expose-model-list-py branch from 4c3a7ea to 9e27f6b Compare November 17, 2017 03:52

WeichenXu123 changed the title ~~[SPARK-21088][ML][WIP] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API~~ [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API Nov 17, 2017

fix_RAT_check

758bc24

fix python style

ae082f5

WeichenXu123 mentioned this pull request Dec 1, 2017

[SPARK-22667][ML][WIP] Fix model-specific optimization support for ML tuning: Python API #19857

Closed

WeichenXu123 changed the title ~~[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API~~ [SPARK-21088][ML][WIP] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API Dec 1, 2017

WeichenXu123 closed this Apr 10, 2018

merge master & update code logic

81473b0

WeichenXu123 reopened this Apr 10, 2018

WeichenXu123 changed the title ~~[SPARK-21088][ML][WIP] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API~~ [SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API Apr 10, 2018

jkbradley reviewed Apr 11, 2018

View reviewed changes

address comments

80f07fb

asfgit closed this in 0461482 Apr 16, 2018

WeichenXu123 deleted the expose-model-list-py branch April 16, 2018 23:36



		class CrossValidator(Estimator, ValidatorParams, HasParallelism, MLReadable, MLWritable):
		class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,

[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API #19627

[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API #19627

Uh oh!

Conversation

WeichenXu123 commented Nov 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 1, 2017

Uh oh!

WeichenXu123 commented Nov 1, 2017

Uh oh!

SparkQA commented Nov 17, 2017

Uh oh!

SparkQA commented Nov 17, 2017

Uh oh!

WeichenXu123 commented Nov 17, 2017

Uh oh!

holdenk commented Nov 18, 2017

Uh oh!

WeichenXu123 commented Nov 19, 2017

Uh oh!

SparkQA commented Nov 19, 2017

Uh oh!

SparkQA commented Nov 19, 2017

Uh oh!

WeichenXu123 commented Nov 21, 2017

Uh oh!

jkbradley commented Dec 1, 2017

Uh oh!

WeichenXu123 commented Dec 2, 2017

Uh oh!

WeichenXu123 commented Apr 10, 2018

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

jkbradley Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 13, 2018

Uh oh!

jkbradley commented Apr 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WeichenXu123 commented Nov 1, 2017 •

edited

Loading