[SPARK-19357][ML] Adding parallel model evaluation in ML tuning #16774

BryanCutler · 2017-02-02T00:04:36Z

What changes were proposed in this pull request?

Modified CrossValidator and TrainValidationSplit to be able to evaluate models in parallel for a given parameter grid. The level of parallelism is controlled by a parameter numParallelEval used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently. This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately. The default value is 1 which will train/evaluate in serial.

How was this patch tested?

Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel. Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples.

…te models in paralell

BryanCutler · 2017-02-02T00:06:32Z

CC @MLnick, this doesn't yet address the issue of double caching, but I wanted to post what I had so far to discuss the additional param and default value. I'll be looking into the caching next.

SparkQA · 2017-02-02T01:09:45Z

Test build #72267 has finished for PR 16774 at commit 5650e98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick

Made a first pass.

Let's add unit tests in both test suites to check that the metric results for default numPar are the same as for numPar > 1.

It would be good to also add a section about the parallel execution option to each validator's doc page.

I think it's important to note in the docs that this will actually only really work as expected if the FAIR scheduler is enabled, otherwise I don't think things will actually be executed concurrently.

MLnick · 2017-02-14T00:11:31Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+
+      // Evaluate models concurrently, limited by using a sliding window over models
+      val foldMetrics = models.zip(epm).grouped(numPar).map { win =>
+        win.par.map { m =>


I'd like to use a case here to be more clear, so .map { case (model, paramMap) => or similar.

MLnick · 2017-02-14T00:17:23Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+
+      // Fit models concurrently, limited by using a sliding window over models
+      val models = epm.grouped(numPar).map { win =>
+        win.par.map(est.fit(trainingDataset, _))


I'd prefer to be a bit more verbose here and use a closure variable like paramMap (or epm to follow existing impl) instead of _ just be clear & more readable.

MLnick · 2017-02-14T00:22:37Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

-    val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
+    // Fit models concurrently, limited by using a sliding window over models
+    val models = epm.grouped(numPar).map { win =>
+      win.par.map(est.fit(trainingDataset, _))


Same comment as for cross-val applies here

MLnick · 2017-02-14T00:22:49Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

-    }
+    // Evaluate models concurrently, limited by using a sliding window over models
+    val metrics = models.zip(epm).grouped(numPar).map { win =>
+      win.par.map { m =>


Same comment as for cross-val applies here

MLnick · 2017-02-14T00:24:32Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

-    // multi-model training
    logDebug(s"Train split with multiple sets of parameters.")
-    val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
+    // Fit models concurrently, limited by using a sliding window over models


perhaps add "sliding window of size 'numPar' over ..." - can add to all instances of this comment

MLnick · 2017-02-14T00:27:38Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

  def getNumFolds: Int = $(numFolds)

-  setDefault(numFolds -> 3)
+  setDefault(numFolds -> 3, numParallelEval -> 1)


May make more sense to put the setDefault call in the parent trait ValidatorParams

MLnick · 2017-02-14T00:28:09Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+   * @group param
+   */
+  val numParallelEval: IntParam = new IntParam(this, "numParallelEval",
+    "max number of models to evaluate in parallel, 1 for serial evaluation")


Add note about default value being 1

MLnick · 2017-02-14T00:29:43Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

    instr.logParams(numFolds, seed)
    logTuningParams(instr)

+    // Compute metrics for each model over each fold


technically it's each estimator rather than model - since both the fitting (estimator or pipeline) and evaluation (model or pipeline model) is in parallel.

Hmm, I was just referencing the general process that a metric is computed by evaluating the model predictions on the validation set, and that this is done at each fold. I don't think that has any bearing to the concurrency here, but I added this just because the process is a little more complicated with the addition of a sliding window...

BryanCutler · 2017-02-14T17:31:54Z

Thanks for the review @MLnick! All good suggestions, I'll work on an update and addition of the unit test.

BryanCutler · 2017-02-14T17:40:07Z

Regarding the default value of the new parameter, do you think 1, so that it is run in serial as it is currently doing, is the best option? It's definitely safe, but I wonder if most people will even know that the parameter exists..

Do you know of any possible way to maybe automatically choose a value? I remember a while ago I saw the mllib ALS algorithm automatically choose the block size, but I don't think that's done in the ml version. Maybe there were too many problems doing this..

MLnick · 2017-02-15T23:18:00Z

I'd say coming up with a heuristic or algorithm to automatically set the parallel execution param is going to be pretty challenging, since it depends on the details of the individual pipeline components, cluster resources, etc. So for now let's leave it fixed.

I think setting the default to match current behavior is best (i.e. 1). We can add a section to the docs explaining its effect, and also add a couple lines to each example to illustrate typical usage.

…iting for entire window to finish

…adjusted comments

BryanCutler · 2017-02-16T23:37:51Z

I think setting the default to match current behavior is best (i.e. 1).

I agree, just wanted to bring it up in case others had differing view since it was a concern in the JIRA.

I think it's important to note in the docs that this will actually only really work as expected if the FAIR scheduler is enabled, otherwise I don't think things will actually be executed concurrently.

I ran some tests with and without the FAIR scheduler enabled. Without FAIR scheduler, if there are enough resources in the cluster (e.g. cores) then the tasks are run concurrently. If not enough resources, then the jobs will wait. With FAIR scheduler the tasks are run concurrently and share the available cores. Both will complete model selection in about the same time, with equal resources, so I think the main benefit of FAIR is for multi-users so that one users jobs don't get starved.

SparkQA · 2017-02-16T23:39:02Z

Test build #73021 has finished for PR 16774 at commit 6a9b735.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-02-16T23:46:05Z

@MLnick , I updated this PR:

added unit tests
added usage of parameter to examples
updated ml-tuning.md documentation
changed sliding window to a bounded semaphore to limit parallelism, this way as soon as one job is complete, another will get scheduled without waiting for the entire window to finish - it's a bit cleaner too

Please take a look when you get a chance, thanks!

thunterdb · 2017-02-17T19:52:14Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+      .setNumFolds(2)
+      .setNumParallelEval(1)
+    val cvSerialModel = cv.fit(dataset)
+    cv.setNumParallelEval(2)


this test is not deterministic now. @MLnick , do you know if we have some utilities to retry tests a much of times?

Yes, the models could now be evaluated in a different order, but the end result of returning the best model would be deterministic and shouldn't ever change.

Yes, I'd say we don't care about deterministic execution order here - we care about the result being the same regardless of execution order.

thunterdb · 2017-02-17T19:58:58Z

Thanks for working on this task, this is a much requested feature. While it will work for simple cases in the current shape, it is going to cause some issues for any complex deployments (Apache Toree, Livy, Databricks, etc.) because the threadpool that controls the computations is not managed. The default assumption with .par is a lot of quick tasks. With the current implementation, because the same thread pool is going to be shared across all the parallel collections, users are going to encounter some mysterious freezes in other places, while the ML models are finishing to train (I am talking from experience here).

While the situation with .par has notably improved with scala 2.10, it is better to:

create a dedicated thread pool for each .fit, that users can replace.
use futures, of which the execution context is tied to the thread pool above.
not use semaphores, but instead rely on the thread limit at the thread pool level to cap the number of concurrent execution.

If you do not do that, users in a shared environment like any of the above will experience some mysterious freezes depending on what other users are doing. Ideally the default resources should be tied to SparkSession, but we can start with a default static pool marked as experimental API.

More concretely, the API should look like this, I believe:

  def setNumParallelEval(num: Int) // Creates an execution context with the given max number of threads
  def setExecutionExecutorService(exec: ExecutorService) // Will use the given executor service instead of an executor service shared by all the ML calculations

See the doc in:
https://twitter.github.io/scala_school/concurrency.html#executor
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html

More about parallel collections:
http://stackoverflow.com/questions/5424496/scala-parallel-collections-degree-of-parallelism
http://stackoverflow.com/questions/14214757/what-is-the-benefit-of-using-futures-over-parallel-collections-in-scala

BryanCutler · 2017-02-24T22:25:33Z

Hi @thunterdb , thanks for the review and all of the details you provided! I agree that a configurable execution service would be needed for running under a shared environment instead of simply using the default context. I'll rework this and make use of Futures too.

BryanCutler · 2017-02-28T01:48:39Z

@thunterdb and @MLnick I updated this to use a configurable ExecutorService and Futures instead of Scala parallel collections. The ExecutorService is retrieved by a function to lazily initialize the thread-pool until it is needed. When running cross-validation there is a trade-off between waiting for the all calculations to complete at each split before continuing to the next and excessively caching all of the split datasets at once. I chose to be safe and just wait for calculations at each split to complete, but let me know what you guys think - hopefully this makes sense :)

SparkQA · 2017-02-28T02:22:43Z

Test build #73545 has finished for PR 16774 at commit 1c2e391.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T02:32:45Z

Test build #73549 has finished for PR 16774 at commit 9e055cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick

Made a pass on these updates.

I think it overall looks pretty good. Mostly a few clean up comments, and also I think we should look at a simpler API / method for setting a custom execution context.

In terms of CrossValidator, I think for now its fine to do parallel execution within each fold. Otherwise it may be very easy to flood a cluster. We can gather user feedback from usage in the wild and always update at a later point.

MLnick · 2017-04-10T09:04:06Z

docs/ml-tuning.md

 method in each of these evaluators.

 To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
+Sets of parameters from the parameter grid can be evaluated in parallel by setting `numParallelEval` with a value of 2 or more (a value of 1 will evaluate in serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.


I think it's a little clearer to say something like "By default, sets of parameters ... will be evaluated in serial. Parameter evaluation can be done in parallel by setting ..."

MLnick · 2017-04-10T09:04:30Z

docs/ml-tuning.md

 To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
+Sets of parameters from the parameter grid can be evaluated in parallel by setting `numParallelEval` with a value of 2 or more (a value of 1 will evaluate in serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
+The value of `numParallelEval` should be chosen carefully to maximize parallelism without exceeding cluster resources, and will be capped at the number of cores in the driver system.  Generally speaking, a value up to 10 should be sufficient for most clusters.
+


Also will need to mention that custom ExecutorService can be specified, and some detail on the default thread pool it creates (and that it is a new separate pool to avoid blocking any of the default Scala pools).

Since that API is marked as experimental, maybe it would be better to not document right away until we are sure this is what we need?

MLnick · 2017-04-10T09:05:14Z

docs/ml-tuning.md

 method in each of these evaluators.

 To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
+Sets of parameters from the parameter grid can be evaluated in parallel by setting `numParallelEval` with a value of 2 or more (a value of 1 will evaluate in serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.


We should add a note that this only works in Scala/Java, not Python (or R)

MLnick · 2017-04-10T09:10:22Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+   */
+  @Experimental
+  @InterfaceStability.Unstable
+  def setExecutorService(getExecutorService: (Int) => ExecutorService): Unit = {


I think I prefer a simpler API:

def setExecutorService(executorService: ExecutorService): Unit = {

Not sure the function version will work nicely with Java.

I think the idea should be that setting numParallelEval will specify the number of threads for the default thread-pool, while a custom one can be set with this method (advanced usage). If set, the custom one will override.

I think you are probably right about the Java thing, I'll test that out.

custom one can be set with this method (advanced usage). If set, the custom one will override.

That's how this works, the executor service can be overridden and set any size thread pool, it is just passed the numParallelEval param for convenience in case it wants to use it.

One reason I did not do the simpler API that you mentioned is then it requires the ExecutorService to be instantiated with a default and then exists for the life of CrossValidator or TrainValidationSplit. The way it is currently, by default the ExecutorService is created at the beginning of fit when it is needed, and then cleaned up at the end.

ah, I see what you are getting at. Even if I override with a custom ExecutorService but leave numParallelEval to 1, it will not use the custom.. I can fix that

MLnick · 2017-04-10T09:10:40Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+   *
+   * @group param
+   */
+  val numParallelEval: IntParam = new IntParam(this, "numParallelEval",


Perhaps set this as group expertParam for now

MLnick · 2017-04-10T09:26:27Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

-        i += 1
+
+      // Fit models in a Future with thread-pool size determined by '$numParallelEval'
+      val models = epm.map { paramMap =>


If we use the implicit val above we can do:

val models = epm.map { paramMap => Future { est.fit(trainingDataset, paramMap) }.mapTo[Model[_]] }

MLnick · 2017-04-10T09:26:43Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+        } (executionContext)
      }
+
+      Future.sequence[Model[_], Iterable](models)(implicitly, executionContext).onComplete { _ =>


Likewise:

Future.sequence[Model[_], Iterable](models).onComplete { _ => trainingDataset.unpersist() }

MLnick · 2017-04-10T09:27:11Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+            val metric = eval.evaluate(model.transform(validationDataset, paramMap))
+            logDebug(s"Got metric $metric for model trained with $paramMap.")
+            metric
+          } (executionContext)


Can then remove the (executionContext) here and elsewhere

MLnick · 2017-04-10T09:28:52Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+
+      // Evaluate models in a Future with thread-pool size determined by '$numParallelEval'
+      val foldMetricFutures = models.zip(epm).map { case (modelFuture, paramMap) =>
+        modelFuture.flatMap { model =>


Note we could use for comprehension here. But I tried it and it doesn't really make it all that much simpler, and the explicit flatMap and Future here makes it a bit clearer.

MLnick · 2017-04-10T09:30:54Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+
+    val parentSerial = cvSerialModel.bestModel.parent.asInstanceOf[LogisticRegression]
+    val parentParallel = cvParallelModel.bestModel.parent.asInstanceOf[LogisticRegression]
+    assert(parentSerial.getRegParam === parentParallel.getRegParam)


Does it make sense to also check uid equality here?

It should probably be done in the test that already runs checkCopy on line 62 (at least until we cleanup these basic checks). I'll take a look at that.

MLnick · 2017-04-10T09:47:01Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

+      trainingDataset.unpersist()
+    } (executionContext)
+
+    // Evaluate models concurrently, limited by a barrier with '$numParallelEval' permits


This comment is stale

MLnick · 2017-04-10T09:47:45Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+      }
+
+      // Wait for metrics to be calculated before upersisting validation dataset
+      val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf))


Does it make sense to also use sequence here?

val metrics = (ThreadUtils.awaitResult( Future.sequence[Double, Iterable](metricFutures), Duration.Inf)).toArray

I thought about that, but since it's a blocking call anyway, it will still be bound by the longest running thread.

Sure, not a big deal either way

SparkQA · 2017-08-23T20:11:10Z

Test build #81046 has finished for PR 16774 at commit 658aacb.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T20:20:39Z

Test build #81047 has finished for PR 16774 at commit 2c73b0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-08-23T20:36:51Z

@MLnick @WeichenXu123 I updated to use the trait HasParallel and fixed up some of the docs, please take another look, thanks!

BryanCutler · 2017-08-31T00:01:05Z

ping @MLnick , does this look ok to merge?

WeichenXu123

LGTM except a minor comment, thanks!

WeichenXu123 · 2017-08-31T04:24:23Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+      .setEstimatorParamMaps(lrParamMaps)
+      .setEvaluator(eval)
+      .setNumFolds(2)
+      .setParallelism(1)


add .setSeed(XXX)

So the seed param here is fixed by default and doesn't need to be set to ensure consistent results. I think that's why it's not set in the other tests in this suite. I'm not a fan of this behavior and I think it's better to explicitly set in tests, but then we should probably be consistent and set elsewhere too. What are your thoughts on this @MLnick ?

OK I agree.

Yeah seed defaults to a hash of the class name. There has been debate over this (see SPARK-16832). Personally I also don't like that behavior, but for now that's what it is.

felixcheung · 2017-08-31T08:27:42Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+      .setNumFolds(2)
+      .setParallelism(1)
+    val cvSerialModel = cv.fit(dataset)
+    cv.setParallelism(2)


how do we validate setParallelism is parallelizing?
I mean - can we test and validate that it is functionally working?

It's a little difficult to do this in a unit test without making it flaky. I have run tests manually and verified it is working by both the expected speedup in timing and that the expected number of tasks are run concurrently. I can post some results if that would help.

@BryanCutler may be worth posting the result to the JIRA for posterity.

WeichenXu123

LGTM except minor comments. Thanks!

WeichenXu123 · 2017-09-03T15:30:23Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

-        i += 1
+
+      // Fit models in a Future for training in parallel
+      val models = epm.map { paramMap =>


I think use var name modelFutures instead of models will be more clear.

WeichenXu123 · 2017-09-03T15:33:45Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+    }.transpose.map(_.sum)
+
+    // Calculate average metric over all splits
    f2jBLAS.dscal(numModels, 1.0 / $(numFolds), metrics, 1)


I don't like this line code because it use low-level api which make the code difficult to read.
And here is not bottleneck.
So I think we can simply use:

val metrics = ... .... .transpose.map(_.sum / $(numFolds))

instead.
What do you think about it ?

I must say I wondered why the author of this bothered with a BLAS call to do the computation... even if there were a few 1000s param combinations it would still not be significant overhead. +1

Yeah, I agree. I'll go ahead and make that change here.

WeichenXu123 · 2017-09-03T15:36:52Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

-      logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
-      metrics(i) += metric
-      i += 1
+    val models = epm.map { paramMap =>


models ==> modelFutures

mgaido91 · 2017-09-04T20:30:09Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

-    val metrics = new Array[Double](epm.length)
+
+    // Create execution context based on $(parallelism)
+    val executionContext = getExecutionContext


In the corresponding PR for PySpark implementation the number of threads is limited by the number of models to be trained (https://github.com/WeichenXu123/spark/blob/be2f3d0ec50db4730c9e3f9a813a4eb96889f5b6/python/pyspark/ml/tuning.py#L261). We might do that for instance by overriding the getParallelism method. What do you think about this?

MLnick

LGTM pending the minor outstanding comments from @WeichenXu123

jkbradley · 2017-09-05T16:18:11Z

I think #19110 is ready to merge now. @BryanCutler @WeichenXu123 do you have a preference for which gets merged first?

…ssary

BryanCutler · 2017-09-05T17:17:39Z

If you wouldn't mind, please merge this first (once passes Jenkins). I had already updated this with the HasParallelism trait after waiting on OneVsRest for a while. Thanks!

SparkQA · 2017-09-05T18:16:43Z

Test build #81417 has finished for PR 16774 at commit 7a8221b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-09-05T19:37:59Z

This should be ready to merge @jkbradley @MLnick

MLnick · 2017-09-06T12:13:03Z

LGTM. Merged this to master. Thanks @BryanCutler and everyone for reviewing!

(FYI merged this before #19110 since in this PR the doc in the trait is a little more detailed which is slightly better IMO).

WeichenXu123 · 2017-09-12T15:43:35Z

@BryanCutler @MLnick I found a bug in this PR: after save estimator (CV or TVS) and then load again, the "Parallelism" setting will be lost. But I fix this in #19208 by the way.

jkbradley · 2017-09-18T21:06:32Z

@WeichenXu123 Thanks for finding that bug! Can you please separate out your bugfix? It's good to get fixes in, rather than attaching them to PRs which may require discussion, so that we make sure that bugs don't slip into the next release.

BryanCutler · 2017-09-18T23:14:41Z

@WeichenXu123 , it would be great if you could separate out the bugfix. I looked in #19208 but couldn't find what you were referring to.

WeichenXu123 · 2017-09-19T00:33:31Z

OK. I will separate a PR. Thanks!

… in fitting ## What changes were proposed in this pull request? Via some test I found CrossValidator still exists memory issue, it will still occupy `O(n*sizeof(model))` memory for holding models when fitting, if well optimized, it should be `O(parallelism*sizeof(model))` This is because modelFutures will hold the reference to model object after future is complete (we can use `future.value.get.get` to fetch it), and the `Future.sequence` and the `modelFutures` array holds references to each model future. So all model object are keep referenced. So it will still occupy `O(n*sizeof(model))` memory. I fix this by merging the `modelFuture` and `foldMetricFuture` together, and use `atomicInteger` to statistic complete fitting tasks and when all done, trigger `trainingDataset.unpersist`. I ever commented this issue on the old PR [SPARK-19357] apache#16774 (review) unfortunately, at that time I do not realize that the issue still exists, but now I confirm it and create this PR to fix it. ## Discussion I give 3 approaches which we can compare, after discussion I realized none of them is ideal, we have to make a trade-off. **After discussion with jkbradley , choose approach 3** ### Approach 1 ~~The approach proposed by MrBago at~~ apache#19904 (comment) ~~This approach resolve the model objects referenced issue, allow the model objects to be GCed in time. **BUT, in some cases, it still do not resolve the O(N) model memory occupation issue**. Let me use an extreme case to describe it:~~ ~~suppose we set `parallelism = 1`, and there're 100 paramMaps. So we have 100 fitting & evaluation tasks. In this approach, because of `parallelism = 1`, the code have to wait 100 fitting tasks complete, **(at this time the memory occupation by models already reach 100 * sizeof(model) )** and then it will unpersist training dataset and then do 100 evaluation tasks.~~ ### Approach 2 ~~This approach is my PR old version code~~ apache@2cc7c28 ~~This approach can make sure at any case, the peak memory occupation by models to be `O(numParallelism * sizeof(model))`, but, it exists an issue that, in some extreme case, the "unpersist training dataset" will be delayed until most of the evaluation tasks complete. Suppose the case `parallelism = 1`, and there're 100 fitting & evaluation tasks, each fitting&evaluation task have to be executed one by one, so only after the first 99 fitting&evaluation tasks and the 100th fitting task complete, the "unpersist training dataset" will be triggered.~~ ### Approach 3 After I compared approach 1 and approach 2, I realized that, in the case which parallelism is low but there're many fitting & evaluation tasks, we cannot achieve both of the following two goals: - Make the peak memory occupation by models(driver-side) to be O(parallelism * sizeof(model)) - unpersist training dataset before most of the evaluation tasks started. So I vote for a simpler approach, move the unpersist training dataset to the end (Does this really matters ?) Because the goal 1 is more important, we must make sure the peak memory occupation by models (driver-side) to be O(parallelism * sizeof(model)), otherwise it will bring high risk of OOM. Like following code: ``` val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => Future[Double] { val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]] //...other minor codes val metric = eval.evaluate(model.transform(validationDataset, paramMap)) logDebug(s"Got metric metricformodeltrainedwithparamMap.") metric } (executionContext) } val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) trainingDataset.unpersist() // <------- unpersist at the end validationDataset.unpersist() ``` ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#19904 from WeichenXu123/fix_cross_validator_memory_issue.

Changed CrossValidator and TrainValidationSplit fit methods to evalua…

5650e98

…te models in paralell

MLnick suggested changes Feb 14, 2017

View reviewed changes

BryanCutler added 4 commits February 14, 2017 10:53

made closure vars more explicit, moved param default to trait

36a1a68

added paramvalidator for numParallelEval to ensure >=1

b051afa

added test cases for CrossValidation and TrainValidationSplit

46fe252

added numParallelEval param usage to examples

1274ba4

BryanCutler added 3 commits February 16, 2017 10:39

added documentation to ml-tuning

80ac2fd

changed sliding window limit to use a semaphore instead to prevent wa…

8126710

…iting for entire window to finish

added note about parallelism capped by Scala collection thread pool, …

6a9b735

…adjusted comments

BryanCutler changed the title ~~[SPARK-19357][ML][WIP] Adding parallel model evaluation in ML tuning~~ [SPARK-19357][ML] Adding parallel model evaluation in ML tuning Feb 16, 2017

thunterdb reviewed Feb 17, 2017

View reviewed changes

BryanCutler added 2 commits February 27, 2017 17:29

reworked to use ExecutorService and Futures

1c2e391

fixed wildcard import

9e055cd

MLnick mentioned this pull request Mar 16, 2017

[SPARK-19979][MLLIB] Allow multiple pipelines when tuning #17306

Closed

MLnick suggested changes Apr 10, 2017

View reviewed changes

MLnick reviewed Apr 10, 2017

View reviewed changes

BryanCutler added 2 commits August 23, 2017 12:02

Updated to use Trait HasParallelsim

658aacb

fixed up docs

2c73b0b

WeichenXu123 reviewed Aug 31, 2017

View reviewed changes

felixcheung reviewed Aug 31, 2017

View reviewed changes

WeichenXu123 reviewed Sep 3, 2017

View reviewed changes

mgaido91 reviewed Sep 4, 2017

View reviewed changes

MLnick approved these changes Sep 5, 2017

View reviewed changes

jkbradley mentioned this pull request Sep 5, 2017

[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark #19110

Closed

removed blas calculation for CrossValidator metric calc, was not nece…

7a8221b

…ssary

WeichenXu123 mentioned this pull request Sep 6, 2017

[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in PySpark #19122

Closed

asfgit closed this in 16c4c03 Sep 6, 2017

WeichenXu123 mentioned this pull request Sep 19, 2017

[SPARK-21087] [ML] CrossValidator, TrainValidationSplit expose sub models after fitting: Scala #19208

Closed

WeichenXu123 mentioned this pull request Dec 6, 2017

[SPARK-22707][ML] Optimize CrossValidator memory occupation by models in fitting #19904

Closed

[SPARK-19357][ML] Adding parallel model evaluation in ML tuning #16774

[SPARK-19357][ML] Adding parallel model evaluation in ML tuning #16774

Uh oh!

Conversation

BryanCutler commented Feb 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Feb 2, 2017

Uh oh!

SparkQA commented Feb 2, 2017

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Feb 14, 2017

Uh oh!

BryanCutler commented Feb 14, 2017

Uh oh!

MLnick commented Feb 15, 2017

Uh oh!

BryanCutler commented Feb 16, 2017

Uh oh!

SparkQA commented Feb 16, 2017

Uh oh!

BryanCutler commented Feb 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thunterdb commented Feb 17, 2017

Uh oh!

BryanCutler commented Feb 24, 2017

Uh oh!

BryanCutler commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

BryanCutler commented Feb 2, 2017 •

edited

Loading