[SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CrossValidator #16158

hhbyyh · 2016-12-05T22:25:07Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-18724

Currently TrainValidationSplitModel only provides tuning metrics in the format of Array[Double], which makes it harder for matching the metrics back to the paramMap generating them and affects the user experience for the tuning framework.
Add a Tuning Summary to provide better presentation for the tuning metrics, for now the idea is to use a DataFrame listing all the params and corresponding metrics.

The Tuning Summary Class can be further extended for CrossValidator.
We can also add training time statistics and metrics rank to the data frame if that sounds good.

Update:

To support pipeline estimator, change the tuning summary column name to include full param reference:

How was this patch tested?

existing and new unit tests

hhbyyh · 2016-12-05T22:30:58Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

+  def summary: TuningSummary = trainingSummary.getOrElse {
+    throw new SparkException(
+      s"No training summary available for the ${this.getClass.getSimpleName}")
+  }


I'm thinking we should add a new trait hasSummary to wrap the summary-related code. I can create another jira if that's reasonable.

addressed in #17654

SparkQA · 2016-12-05T23:40:59Z

Test build #69690 has finished for PR 16158 at commit 425a419.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-12-08T07:44:27Z

@MLnick Does this match your thoughts? Appreciate your opinions.

MLnick

Will go through in more detail but just a quick comment about needing the DataFrame ref in the summary ctor.

MLnick · 2016-12-12T14:33:02Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

    val bestModel = est.fit(dataset, epm(bestIndex)).asInstanceOf[Model[_]]
-    copyValues(new TrainValidationSplitModel(uid, bestModel, metrics).setParent(this))
+    val model = copyValues(new TrainValidationSplitModel(uid, bestModel, metrics).setParent(this))
+    val summary = new TuningSummary(bestModel.transform(dataset), epm, metrics, bestIndex)


It seems wasteful to do bestModel.transform(dataset) just to get access to the sqlContext. Is it really necessary?

Indeed that's not necessary. I just replaced it with SparkSession.builder().getOrCreate(). Is there a better way to get the default contexts? Thanks

SparkQA · 2016-12-13T22:32:29Z

Test build #70097 has finished for PR 16158 at commit bd18c00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-15T12:35:28Z

Test build #71393 has finished for PR 16158 at commit bd18c00.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-02-22T06:53:52Z

Test build #73262 has finished for PR 16158 at commit 2a0af1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-04-06T10:27:46Z

Sorry this slipped! I'd like to revisit soon after 2.2. settles down.

I think we may need to consider how this integrates with training / evaluation summaries to create a holistic solution (see SPARK-19053)

hhbyyh · 2017-06-30T23:28:01Z

@MLnick Thanks for your attention. I'm not sure if SPARK-19053 is still active and maybe it's not a blocking issue for this change. If you don't mind, I'll extend the jira/PR scope to involve CrossValidator to have an integrated improvement.

MLnick · 2017-07-03T12:24:07Z

Yeah maybe do the CV one in this PR too.

SparkQA · 2017-07-06T00:34:54Z

Test build #79249 has finished for PR 16158 at commit 0a698fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-06T01:05:46Z

Test build #79251 has finished for PR 16158 at commit bd459b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-07-06T18:33:01Z

add tuning summary for crossValidator.

SparkQA · 2017-07-24T23:40:57Z

Test build #79919 has finished for PR 16158 at commit c0bc81a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-08-03T12:55:27Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

+  def hasSummary: Boolean = trainingSummary.nonEmpty
+
+  /**
+   * Gets summary of model on training set. An exception is


Should probably rather be "summary of model performance on the validation set"?

MLnick · 2017-08-03T12:56:15Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

+  def hasSummary: Boolean = trainingSummary.nonEmpty
+
+  /**
+   * Gets summary of model on training set. An exception is


Likewise, "cross-validation performance of each model" or similar?

MLnick · 2017-08-03T12:57:19Z

mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala

-    copyValues(new CrossValidatorModel(uid, bestModel, metrics).setParent(this))
+    val model = new CrossValidatorModel(uid, bestModel, metrics).setParent(this)
+    val summary = new TuningSummary(epm, metrics, bestIndex)
+    model.setSummary(Some(summary))


Just to confirm, the tuning summary will not be saved? Since it's a small dataframe, perhaps we should consider saving it with the model? (Can do that in a later PR however)

If we want to just save the tuning summary in the model, perhaps we can just discard the TuningSummary, and add a tuningSummary: DataFrame field/function in the models. Sounds good?

Are there other obvious things that might go into the summary in future, that would make a TuningSummary class a better fit?

Future support for say, multiple metrics, could simply extend the dataframe columns so that is ok. But is there anything else you can think of?

There might be something like detailed training log and training time for each model. But I'm thinking the current Summary pattern does have some room for improvement (e.g., save/load and API), it makes me feel bad when I have to duplicate the code like
def hasSummary: Boolean = trainingSummary.nonEmpty. Thus saving it to the models sounds like a good idea to me.

The latest implementation does not need to save the extra dataframe. Since basically the dataframe can be generated from $(estimatorParamMaps) and avgMetrics.

MLnick · 2017-08-03T12:59:15Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TuningSummary.scala

+    val spark = SparkSession.builder().getOrCreate()
+    val sqlContext = spark.sqlContext
+    val sc = spark.sparkContext
+    val fields = params(0).toSeq.sortBy(_.param.name).map(_.param.name) ++ Seq("metrics")


"metrics" is a bit generic. Perhaps it's better (and more user-friendly) to make this be something like metric_name metric so that it's obvious what metric was being optimized for? such as ROC metric or AUC metric or MSE metric? etc

MLnick · 2017-08-03T13:00:58Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TuningSummary.scala

+private[tuning] class TuningSummary private[tuning](
+    private[tuning] val params: Array[ParamMap],
+    private[tuning] val metrics: Array[Double],
+    private[tuning] val bestIndex: Int) {


It appears bestIndex is never used?

MLnick · 2017-08-03T13:01:35Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+    }
+    assert(cvModel.summary.trainingMetrics.collect().toSet === expected.toSet)
+  }
+


Shall we add a test for the exception being thrown if no summary?

MLnick · 2017-08-03T13:02:36Z

@hhbyyh sorry for the delay. Left a few review comments.

Tested the examples and it looks cool! Very useful

hhbyyh · 2017-08-09T22:25:12Z

Move the tuningSummary to Models, and updated the name of the metrics column.

SparkQA · 2017-08-09T23:30:02Z

Test build #80467 has finished for PR 16158 at commit 72aea62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

Thanks for the PR ! I leave some comments.

WeichenXu123 · 2017-09-07T13:46:27Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+    val rows = sc.parallelize(params.zip(metrics)).map { case (param, metric) =>
+      val values = param.toSeq.sortBy(_.param.name).map(_.value.toString) ++ Seq(metric.toString)
+      Row.fromSeq(values)
+    }


Here the var names is a little confusing,
params ==> paramMaps
case (param, metric) ==> case (paramMap, metric)
will be more clear.

WeichenXu123 · 2017-09-07T13:56:18Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+    val fields = params(0).toSeq.sortBy(_.param.name).map(_.param.name) ++ Seq(metricName)
+    val schema = new StructType(fields.map(name => StructField(name, StringType)).toArray)
+    val rows = sc.parallelize(params.zip(metrics)).map { case (param, metric) =>
+      val values = param.toSeq.sortBy(_.param.name).map(_.value.toString) ++ Seq(metric.toString)


Here seems exists a problem:
Suppose params(0) (which is a ParamMap) contains ParamA and ParamB,
and params(1) (which is a ParamMap) contains ParamA and ParamC,
The code here will run into problems. Because you compose the row values sorted by param name but do not check whether every row exactly match the first row.
I think better way is, go though the whole ParamMap list and collect all params used, and sort them by name, as the dataframe schema.

And here use param_value.toString, some array type param will convert to unreadable string.
For example, DoubleArrayParam, doubleArray.toString will became "[DXXXXX"
use Param.jsonEncode is better.

Thanks, we should support the case for custom paramMap.

SparkQA · 2017-09-11T06:38:25Z

Test build #81622 has finished for PR 16158 at commit 297091f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-09-11T17:13:28Z

Update:

To support pipeline estimator, change the tuning summary column name to include full param reference:

hhbyyh · 2018-03-21T03:55:30Z

Please advice if this is a good feature to add. If not I'll close it. Thanks.

SparkQA · 2018-07-30T00:12:31Z

Test build #93758 has finished for PR 16158 at commit 4aef3aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2018-07-30T03:28:44Z

gentle ping @MLnick, Thanks for the review. Appreciate if you have some time for further comments.

WeichenXu123 · 2019-10-20T14:08:20Z

@hhbyyh This PR is stale. If there's nobody interested in this and no further updates, would you mind to close it ? Thanks!

YY-OnCall added 4 commits December 2, 2016 16:49

tuning summary

d1e22d5

Merge remote-tracking branch 'upstream/master' into tuningsummary

a7cfa63

add ut

ad73c12

add comments

425a419

hhbyyh commented Dec 5, 2016

View reviewed changes

MLnick reviewed Dec 12, 2016

View reviewed changes

get default spark session

bd18c00

resolve merge conflict

2a0af1d

YY-OnCall added 3 commits July 5, 2017 12:29

merge conflict

1a594d0

support cross validation

0a698fe

update version

bd459b1

hhbyyh changed the title ~~[SPARK-18724][ML] Add TuningSummary for TrainValidationSplit~~ [SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CountVectorizer Jul 6, 2017

YY-OnCall added 2 commits July 24, 2017 12:19

Merge remote-tracking branch 'upstream/master' into tuningsummary

4e3e19c

improve unit test

c0bc81a

hhbyyh mentioned this pull request Aug 2, 2017

[SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala #18313

Closed

MLnick reviewed Aug 3, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into tuningsummary

bbf3f9f

YY-OnCall added 2 commits August 9, 2017 10:55

Merge remote-tracking branch 'upstream/master' into tuningsummary

b6a7c53

remove TuningSummary

72aea62

WeichenXu123 reviewed Sep 7, 2017

View reviewed changes

YY-OnCall added 2 commits September 10, 2017 09:37

Merge remote-tracking branch 'upstream/master' into tuningsummary

91da358

update for pipeline

297091f

Merge remote-tracking branch 'upstream/master' into tuningsummary

670467a

hhbyyh changed the title ~~[SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CountVectorizer~~ [SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CrossValidator Dec 10, 2017

YY-OnCall added 2 commits December 28, 2017 16:32

Merge remote-tracking branch 'upstream/master' into tuningsummary

36b1dd5

Merge remote-tracking branch 'upstream/master' into tuningsummary

5844e0c

YY-OnCall added 5 commits July 23, 2018 10:42

merge conflict

8c829b5

Merge remote-tracking branch 'upstream/master' into tuningsummary

4aaf8e5

Merge remote-tracking branch 'upstream/master' into tuningsummary

ceaad1c

Merge remote-tracking branch 'upstream/master' into tuningsummary

41c4c12

remove sort add comments

4aef3aa

dongjoon-hyun added the ML label Jun 14, 2019

hhbyyh closed this Oct 26, 2019

[SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CrossValidator #16158

[SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CrossValidator #16158

Uh oh!

Conversation

hhbyyh commented Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hhbyyh Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 5, 2016

Uh oh!

hhbyyh commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MLnick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

SparkQA commented Jan 15, 2017

Uh oh!

SparkQA commented Feb 22, 2017

Uh oh!

MLnick commented Apr 6, 2017

Uh oh!

hhbyyh commented Jun 30, 2017

Uh oh!

MLnick commented Jul 3, 2017

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

hhbyyh commented Jul 6, 2017

Uh oh!

SparkQA commented Jul 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Aug 3, 2017

Uh oh!

hhbyyh commented Aug 9, 2017

Uh oh!

SparkQA commented Aug 9, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Dec 5, 2016 •

edited

Loading

hhbyyh Dec 5, 2016 •

edited

Loading

hhbyyh commented Dec 8, 2016 •

edited

Loading

hhbyyh Dec 13, 2016 •

edited

Loading

hhbyyh Aug 3, 2017 •

edited

Loading