[SPARK-8971][ML] Add stratified sampling to ML CrossValidator and TrainValidationSplit #14321

sethah · 2016-07-22T21:42:20Z

What changes were proposed in this pull request?

This patch adds the ability to do stratified sampling in cross validation for ML pipelines. This is accomplished by modifying some of the methods in StratifiedSamplingUtils to support multiple splits instead of a single subsample of the data. A method is added to PairRDDFunctions to support randomSplitByKey. Please see the detailed explanation below.

How was this patch tested?

Unit tests were added to PairRDDFunctionsSuite, MLUtilsSuite, CrossValidatorSuite, and TrainValidationSuite.

Algorithm changes

Currently, Spark implements a stratified sampling function on PairRDDs using the method sampleByKeyExact and sampleByKey. This method calls a stratified sampling routine that is implemented in StratifiedSamplingUtils. The underlying algorithm is described here in the paper by Xiangrui Meng. When exact samples stratified samples are required, the algorithm makes an extra pass through the data. Each sample is mapped on to the interval [0, 1](for sampling without replacement), and we expect that, say for a 50% sample, we will split the interval at 0.5 and accept the samples which fell below that threshold. Items near 0 are highly likely to be accepted, while items near 1 are highly unlikely to be accepted. Items near 0.5 are uncertain, and are added to a waitlist on the first pass. The items in the waitlist will be sorted and used to determine the exact split point which produces 50/50 sample.

This patch modifies the routine to produce multiple splits by generating multiple waitlists on the first pass. Each waitlist is sorted to determine the exact split points and then we can sample as normal.

One potential concern is that if this is used for a large number of splits, it may degrade to the point where sorting the entire dataset would be quicker, as the waitlists get closer and closer together. It could potentially cause OOM errors on the driver if there are too many waitlists collected. Still, before this patch there was not a way to actually take a single split of the data, as sampleByKey does not return the complement of the sample. This patch fixes this as well.

ML API

This patch also allows users to specify a stratified column in the CrossValidator and TrainValidationSplit estimators. This is done by converting the input dataframe to a PairRDD and calling the randomSplitByKey method. This is exposed via a setStratifiedCol parameter which, if set, will use exact stratified splits for cross validation.

Future considerations

This can be implemented as a function on dataframes in the future, if there is interest. It is somewhat inconvient to convert the dataframe to a pair rdd, perform sampling, and then convert back to a dataframe.

…plit in ml/tuning

sethah · 2016-07-22T21:54:00Z

cc @MLnick @hhbyyh @mengxr I believe there is still interest in stratified sampling methods. Could you provide feedback/review on this patch? Thanks!

SparkQA · 2016-07-22T23:42:10Z

Test build #62738 has finished for PR 14321 at commit 37be0b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-07-27T10:23:57Z

mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

+        val keys = pairData.keys.distinct.collect()
+        val weights: Array[scala.collection.Map[Any, Double]] =
+          Array(keys.map((_, $(trainRatio))).toMap, keys.map((_, 1 - $(trainRatio))).toMap)
+        val splitsWithKeys = pairData.randomSplitByKey(weights, exact = true, $(seed))


does it make sense perhaps to have a convenience version of randomSplitByKey that takes an Array[Double] for weights and applies the same sampling weight for each key? Since I would expect the vast majority of the time the use case is to split the dataset into folds with the same sampling ratio across keys?

MLnick · 2016-08-11T13:52:41Z

core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala

+        n: Int,
+        exact: Boolean): Unit = {
+
+      def countByKey[K, V](xs: TraversableOnce[(K, V)]): Map[K, Int] = {


Do we need this? We should be able to use rdd.countByKey in L824 (totalCounts), L835 & L836 (sampleCounts and complementCounts) below? (You've basically done that in the test for kFoldStratified).

pramitchoudhary · 2016-10-19T01:36:00Z

Has the progress on this initiative stalled for any reason. May be I could be of help. @sethah

HyukjinKwon · 2017-02-09T13:51:43Z

I just happened to look at this PR. Is this still WIP or waiting more review comments? If it is simply that the author is not currently able to proceed this further, then, maybe it'd be better to close this for now.

idlecool · 2017-10-17T00:09:03Z

Hi @sethah, any plans to work on it again?

sethah added 4 commits July 20, 2016 11:41

Adding stratified sampling to cross validation and train validation s…

a058cd8

…plit in ml/tuning

Adding some tests and style fixes

5f244d1

Refactor for efficiency when computing multiple waitlists.

67f6002

Move some logic back into SSUtils

37be0b5

MLnick reviewed Jul 27, 2016
View reviewed changes

MLnick mentioned this pull request Aug 1, 2016

[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy #12896

Closed

MLnick reviewed Aug 11, 2016
View reviewed changes

hqzizania mentioned this pull request Aug 24, 2016

[SPARK-17055] [MLLIB] add groupKFold to CrossValidator #14640

Closed

sethah closed this Feb 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8971][ML] Add stratified sampling to ML CrossValidator and TrainValidationSplit #14321

[SPARK-8971][ML] Add stratified sampling to ML CrossValidator and TrainValidationSplit #14321

Uh oh!

sethah commented Jul 22, 2016

Uh oh!

sethah commented Jul 22, 2016

Uh oh!

SparkQA commented Jul 22, 2016

Uh oh!

MLnick Jul 27, 2016

Uh oh!

MLnick Aug 11, 2016 •

edited

Loading

Uh oh!

pramitchoudhary commented Oct 19, 2016

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

idlecool commented Oct 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-8971][ML] Add stratified sampling to ML CrossValidator and TrainValidationSplit #14321

[SPARK-8971][ML] Add stratified sampling to ML CrossValidator and TrainValidationSplit #14321

Uh oh!

Conversation

sethah commented Jul 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

Algorithm changes

ML API

Future considerations

Uh oh!

sethah commented Jul 22, 2016

Uh oh!

SparkQA commented Jul 22, 2016

Uh oh!

MLnick Jul 27, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pramitchoudhary commented Oct 19, 2016

Uh oh!

HyukjinKwon commented Feb 9, 2017

Uh oh!

idlecool commented Oct 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MLnick Aug 11, 2016 •

edited

Loading