Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Dec 5, 2016

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-18724

Currently TrainValidationSplitModel only provides tuning metrics in the format of Array[Double], which makes it harder for matching the metrics back to the paramMap generating them and affects the user experience for the tuning framework.
Add a Tuning Summary to provide better presentation for the tuning metrics, for now the idea is to use a DataFrame listing all the params and corresponding metrics.

image

The Tuning Summary Class can be further extended for CrossValidator.
We can also add training time statistics and metrics rank to the data frame if that sounds good.

Update:

To support pipeline estimator, change the tuning summary column name to include full param reference:
image

How was this patch tested?

existing and new unit tests

def summary: TuningSummary = trainingSummary.getOrElse {
throw new SparkException(
s"No training summary available for the ${this.getClass.getSimpleName}")
}
Copy link
Contributor Author

@hhbyyh hhbyyh Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we should add a new trait hasSummary to wrap the summary-related code. I can create another jira if that's reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #17654

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69690 has finished for PR 16158 at commit 425a419.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 8, 2016

@MLnick Does this match your thoughts? Appreciate your opinions.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will go through in more detail but just a quick comment about needing the DataFrame ref in the summary ctor.

val bestModel = est.fit(dataset, epm(bestIndex)).asInstanceOf[Model[_]]
copyValues(new TrainValidationSplitModel(uid, bestModel, metrics).setParent(this))
val model = copyValues(new TrainValidationSplitModel(uid, bestModel, metrics).setParent(this))
val summary = new TuningSummary(bestModel.transform(dataset), epm, metrics, bestIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wasteful to do bestModel.transform(dataset) just to get access to the sqlContext. Is it really necessary?

Copy link
Contributor Author

@hhbyyh hhbyyh Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's not necessary. I just replaced it with SparkSession.builder().getOrCreate(). Is there a better way to get the default contexts? Thanks

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70097 has finished for PR 16158 at commit bd18c00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2017

Test build #71393 has finished for PR 16158 at commit bd18c00.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2017

Test build #73262 has finished for PR 16158 at commit 2a0af1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 6, 2017

Sorry this slipped! I'd like to revisit soon after 2.2. settles down.

I think we may need to consider how this integrates with training / evaluation summaries to create a holistic solution (see SPARK-19053)

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jun 30, 2017

@MLnick Thanks for your attention. I'm not sure if SPARK-19053 is still active and maybe it's not a blocking issue for this change. If you don't mind, I'll extend the jira/PR scope to involve CrossValidator to have an integrated improvement.

@MLnick
Copy link
Contributor

MLnick commented Jul 3, 2017

Yeah maybe do the CV one in this PR too.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79249 has finished for PR 16158 at commit 0a698fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79251 has finished for PR 16158 at commit bd459b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh hhbyyh changed the title [SPARK-18724][ML] Add TuningSummary for TrainValidationSplit [SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CountVectorizer Jul 6, 2017
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 6, 2017

add tuning summary for crossValidator.

@SparkQA
Copy link

SparkQA commented Jul 24, 2017

Test build #79919 has finished for PR 16158 at commit c0bc81a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def hasSummary: Boolean = trainingSummary.nonEmpty

/**
* Gets summary of model on training set. An exception is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably rather be "summary of model performance on the validation set"?

def hasSummary: Boolean = trainingSummary.nonEmpty

/**
* Gets summary of model on training set. An exception is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, "cross-validation performance of each model" or similar?

copyValues(new CrossValidatorModel(uid, bestModel, metrics).setParent(this))
val model = new CrossValidatorModel(uid, bestModel, metrics).setParent(this)
val summary = new TuningSummary(epm, metrics, bestIndex)
model.setSummary(Some(summary))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the tuning summary will not be saved? Since it's a small dataframe, perhaps we should consider saving it with the model? (Can do that in a later PR however)

Copy link
Contributor Author

@hhbyyh hhbyyh Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to just save the tuning summary in the model, perhaps we can just discard the TuningSummary, and add a tuningSummary: DataFrame field/function in the models. Sounds good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other obvious things that might go into the summary in future, that would make a TuningSummary class a better fit?

Future support for say, multiple metrics, could simply extend the dataframe columns so that is ok. But is there anything else you can think of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be something like detailed training log and training time for each model. But I'm thinking the current Summary pattern does have some room for improvement (e.g., save/load and API), it makes me feel bad when I have to duplicate the code like
def hasSummary: Boolean = trainingSummary.nonEmpty. Thus saving it to the models sounds like a good idea to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest implementation does not need to save the extra dataframe. Since basically the dataframe can be generated from $(estimatorParamMaps) and avgMetrics.

val spark = SparkSession.builder().getOrCreate()
val sqlContext = spark.sqlContext
val sc = spark.sparkContext
val fields = params(0).toSeq.sortBy(_.param.name).map(_.param.name) ++ Seq("metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"metrics" is a bit generic. Perhaps it's better (and more user-friendly) to make this be something like metric_name metric so that it's obvious what metric was being optimized for? such as ROC metric or AUC metric or MSE metric? etc

private[tuning] class TuningSummary private[tuning](
private[tuning] val params: Array[ParamMap],
private[tuning] val metrics: Array[Double],
private[tuning] val bestIndex: Int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears bestIndex is never used?

}
assert(cvModel.summary.trainingMetrics.collect().toSet === expected.toSet)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a test for the exception being thrown if no summary?

@MLnick
Copy link
Contributor

MLnick commented Aug 3, 2017

@hhbyyh sorry for the delay. Left a few review comments.

Tested the examples and it looks cool! Very useful

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 9, 2017

Move the tuningSummary to Models, and updated the name of the metrics column.
image

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80467 has finished for PR 16158 at commit 72aea62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR ! I leave some comments.

val rows = sc.parallelize(params.zip(metrics)).map { case (param, metric) =>
val values = param.toSeq.sortBy(_.param.name).map(_.value.toString) ++ Seq(metric.toString)
Row.fromSeq(values)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the var names is a little confusing,
params ==> paramMaps
case (param, metric) ==> case (paramMap, metric)
will be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

val fields = params(0).toSeq.sortBy(_.param.name).map(_.param.name) ++ Seq(metricName)
val schema = new StructType(fields.map(name => StructField(name, StringType)).toArray)
val rows = sc.parallelize(params.zip(metrics)).map { case (param, metric) =>
val values = param.toSeq.sortBy(_.param.name).map(_.value.toString) ++ Seq(metric.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems exists a problem:
Suppose params(0) (which is a ParamMap) contains ParamA and ParamB,
and params(1) (which is a ParamMap) contains ParamA and ParamC,
The code here will run into problems. Because you compose the row values sorted by param name but do not check whether every row exactly match the first row.
I think better way is, go though the whole ParamMap list and collect all params used, and sort them by name, as the dataframe schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here use param_value.toString, some array type param will convert to unreadable string.
For example, DoubleArrayParam, doubleArray.toString will became "[DXXXXX"
use Param.jsonEncode is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we should support the case for custom paramMap.

@SparkQA
Copy link

SparkQA commented Sep 11, 2017

Test build #81622 has finished for PR 16158 at commit 297091f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Sep 11, 2017

Update:

To support pipeline estimator, change the tuning summary column name to include full param reference:
image

@hhbyyh hhbyyh changed the title [SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CountVectorizer [SPARK-18724][ML] Add TuningSummary for TrainValidationSplit and CrossValidator Dec 10, 2017
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 21, 2018

Please advice if this is a good feature to add. If not I'll close it. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93758 has finished for PR 16158 at commit 4aef3aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 30, 2018

gentle ping @MLnick, Thanks for the review. Appreciate if you have some time for further comments.

@WeichenXu123
Copy link
Contributor

@hhbyyh This PR is stale. If there's nobody interested in this and no further updates, would you mind to close it ? Thanks!

@hhbyyh hhbyyh closed this Oct 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants