[SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala #18313

hhbyyh · 2017-06-15T06:28:27Z

What changes were proposed in this pull request?

Allow CrossValidatorModel and TrainValidationSplitModel preserve all the of trained models.

add a new string param modelPath, If set, all the models fitted during the training will be saved under the specific directory path. By default the models will not be saved.

Save the models during the training to avoid expensive memory consumption from caching the models,

For CrossValidator, it avoids caching all the trained models in memory, and no extra memory consumption is required. (GC can kick in and clear up the models trained for previous splits.)
Allow some models to be preserved even if the training process are interrupted or stopped by exception.

Sample for cross validation models:
file name pattern: paramMap-split#-metric

Sample for train validation split:
file name pattern: paramMap-metric

How was this patch tested?

new unit tests and local test.

SparkQA · 2017-06-15T06:32:37Z

Test build #78085 has started for PR 18313 at commit 0fc43e1.

SparkQA · 2017-06-24T02:17:07Z

Test build #78554 has finished for PR 18313 at commit c7e0bcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-26T20:26:09Z

Test build #78647 has finished for PR 18313 at commit 39c025f.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-07-21T05:01:33Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+              logWarning(models(i).uid + " did not implement MLWritable. Serialization omitted.")
+          }
+        }
        metrics(i) += metric


ping @hhbyyh @jkbradley
I am sorry but I want to say that this code is incorrect. Inside fit method save the list of fitted models help nothing.
What we need is to let CrossValidatorModel/TrainValidationSplitModel preserve the full list of fitted models, when save CrossValidatorModel/TrainValidationSplitModel we can save the list of models, also we can choose not to save, controlled by a parameter.

so you want to keep all the trained models in CrossValidatorModel?

Yes I think so.
In order to save time, I would like to take over this feature, if you don't mind.
ping @jkbradley

That's interesting...
If somehow we follow your suggestion, with numFolds = 10, and param grid size = 8, you'll be holding 80 models in the driver memory (CrossValidatorModel) at the same time. I would be very surprised to see anyone go along with your suggestion.
I'll happily close the PR if your suggestion turns out to be good.

@hhbyyh So I we need to set parameter default value false, only when user really need this they turn on this feature... I will discuss with @jkbradley later.

I agree that 80 models in driver memory sounds like a lot. However, we already are holding that many in driver memory at once in val models = est.fit(trainingDataset, epm), so that should not be a problem for current use cases.

Scaling to large models which do not fit in memory is a different problem, but your PR does bring up the issue that exposing something like models: Seq[...] could cause problems in the future if we want to scale more. I'd suggest 2 things:

The models could be exposed via a getter, rather than a val. In the future, if the models are not available, the getter could throw a nice exception.

In the future, we could add the Param which you are suggesting for dumping the models to some directory during training. Feel free to preserve this PR for that, but I think this PR is overkill for most users' needs.

hhbyyh · 2017-07-25T00:01:28Z

Current implementation of CrossValidator (with or without this PR) NEVER holds all the trained models in the driver memory at the same time. It collects models sequentially for each split and allows GC to clear previous trained models.
splits.zipWithIndex.foreach { ...
which means the driver memory only keeps 8 models rather than the 80 models in my example.

With all due respect @jkbradley , for the record, I strongly vote against keeping all the models in the CrossValidatorModel in any way, which makes the feature impractical for any production usage.

jkbradley · 2017-08-01T23:23:31Z

Oh, you're right; I overlooked that it only holds all of the models for a single split. In that case, I agree it could be problematic to keep all in memory by default. How does this sound then: We can do 2 separate PRs, each of which adds a separate option:

Add an option for keeping the models in memory so that they are available as a data field in the Model object after training. This caters to smaller use cases, focusing on ease-of-use.
(your PR) Add an option to dump models to a directory, not keeping them in memory. This caters to big use cases, focusing on scalability.

What do you think? If this sounds good, feel free to reopen your current PR.

hhbyyh · 2017-08-02T00:04:04Z

@jkbradley Thanks for the suggestion. After the discussion, I found that actually we can reduce the memory requirement for the tuning process. Check https://issues.apache.org/jira/browse/SPARK-21535 if you have some time.

And I agree to your suggestion overall,

About 1, better have a proper way to manage the cached models for indexing, another hint is that we may calculate the driver memory requirement in advance and send warning or error if necessary.

About 2, if we want to use indices rather than the param combination for model names, we probably need to save a summary table with the models. perhaps like #16158 ?

WeichenXu123 · 2017-08-02T00:28:43Z

@jkbradley
I think the thing is simple.
When persist model list param is false, just keep the code logic the same and it won't increase the memory cost (This is the default case)
When persist model list param is true, in the fit method, we can collect all the models and pass them to the CrossValidatorModel/TrainValidationSplit. This will increase memory cost but it doesn't matter because it is not the default case.

@hhbyyh Your new PR #18733 is meaningful I think, which make the memory cost to be O(1), but we need to consider the case of parallelism, in #16774 , maybe we should wait the #16774 merged, then consider how to optimize the memory usage.

WeichenXu123

I suggest pending this feature and #18733 until #16774 get merged.

MLnick · 2017-08-03T12:09:43Z

I commented on the JIRA.

I would like to better understand the use case more of keeping all the models (as per my JIRA comment). I suspect that #16158 may be the more useful approach in practice.

But overall, if there is a use case for keeping the models, I would agree with @jkbradley's suggestion that we offer a simple "keep all sub-models as a field in the model" approach, as well as consider the large-scale case with possibly the "dump to file" option.

In addition, we could have an option to keep "best", "all" or "k" models (user-specified as a number or %)?

jkbradley · 2017-08-23T02:28:25Z

+1 for merging #16774 before proceeding with the other work since it will affect everything else.

@MLnick I'd be Ok with adding options for best/all/k models in the future, but I'm not convinced it's needed since the topK could be computed post-hoc and since the 2 approaches for keeping all models should handle big & small use cases.

MLnick · 2017-08-23T07:21:26Z

The idea with best/all/k is to to allow use cases with fairly large models (say large enough that all 100 or 1000 or whatever param combinations is not feasible to collect to driver) to still store more than just the best model.

So it's a way to satisfy both the small-to-medium use case of storing "all", the default use case of "best" and a part solution to the large-model use case using "k". So the idea with the "k" version is to not do a full "collect then top k" but instead keep a running top-k (PriorityQueue perhaps) in order to limit memory consumption.

But I agree if we have both solutions (memory and file-based) then it's not necessary (though if one did want to do a top-k on the file-based scenario it would be quite clunky to do). So if that is the end goal then let's do the 2-step process suggested above.

WeichenXu123 · 2017-09-12T15:48:26Z

@hhbyyh I apologize to you that your PR is valuable (in the case model list is very big).
But now your PR is stale and I integrate it into my new PR #19208
Would you mind to take a look?
Thanks!

hhbyyh · 2017-09-14T23:02:15Z

That's all right. Please just proceed.

YY-OnCall added 3 commits June 14, 2017 22:53

save all models

f6dfc66

precision rounding

4c51912

comment update

0fc43e1

Merge remote-tracking branch 'upstream/master' into saveModels

c7e0bcd

rename to modelPreservePath

39c025f

WeichenXu123 reviewed Jul 21, 2017

View reviewed changes

hhbyyh closed this Jul 25, 2017

WeichenXu123 reviewed Aug 2, 2017

View reviewed changes

hhbyyh mentioned this pull request Sep 14, 2017

[SPARK-21087] [ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala #19208

Closed

[SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala #18313

[SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala #18313

Uh oh!

Conversation

hhbyyh commented Jun 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 15, 2017

Uh oh!

SparkQA commented Jun 24, 2017

Uh oh!

SparkQA commented Jun 26, 2017

Uh oh!

WeichenXu123 Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

hhbyyh Jul 22, 2017

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Jul 22, 2017

Choose a reason for hiding this comment

Uh oh!

hhbyyh Jul 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Jul 24, 2017

Choose a reason for hiding this comment

Uh oh!

jkbradley Jul 24, 2017

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Jul 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbradley commented Aug 1, 2017

Uh oh!

hhbyyh commented Aug 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Aug 2, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

MLnick commented Aug 3, 2017

Uh oh!

jkbradley commented Aug 23, 2017

Uh oh!

MLnick commented Aug 23, 2017

Uh oh!

WeichenXu123 commented Sep 12, 2017

Uh oh!

hhbyyh commented Sep 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hhbyyh commented Jun 15, 2017 •

edited

Loading

hhbyyh Jul 24, 2017 •

edited

Loading

hhbyyh commented Jul 25, 2017 •

edited

Loading

hhbyyh commented Aug 2, 2017 •

edited

Loading