[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should have at most one 0 #15149

srowen · 2016-09-19T09:57:39Z

What changes were proposed in this pull request?

Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0

How was this patch tested?

Jenkins tests plus new test cases

srowen · 2016-09-19T09:57:55Z

CC @MLnick @zhengruifeng

SparkQA · 2016-09-19T11:47:00Z

Test build #65593 has finished for PR 15149 at commit ddc8dab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * is largest. A class may be selected even if this ratio is less than 1 (that is, the class

SparkQA · 2016-09-19T14:43:08Z

Test build #65597 has finished for PR 15149 at commit f4ce7c5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-09-19T14:55:53Z

mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala

does it make sense to test all 0 case too?

Sounds good.

BTW I'm finding that many cases use thresholds that sum to 1. Is it actually important to prohibit this? I don't see that thresholds/cutoffs are actually interpreted as a probability distribution or anything.

Actually I think sum to 1 is fine - the R package actually limits to !sum > 1 => sum <= 1 (https://github.com/cran/randomForest/blob/master/R/predict.randomForest.R#L47) - I think I misread it earlier.

No they don't seem to be probabilities. In fact I don't see a theoretical reason why the sum <=1, since they're just scalings or "weights". Requiring all > 0 of course makes practical sense. Anyway we should just match what they do as that is what this thresholds implementation was based on.

@MLnick Yeah, I was wondering, the point of this PR is just to match R? Actually, the following works

fit <- randomForest(as.factor(V4) ~ ., data=data, cutoff=c(0.05, 0.05, 0.05, 0.05)) > fit$forest$cutoff = c(1000, 1000, 1000, 1) > table(testClass=data$V4, predict(fit, newdata=data)) testClass 1 2 3 4 1 0 3 2 56 2 0 6 1 144 3 0 2 3 116 4 0 0 0 67

So, you can actually predict with arbitrary cutoff values, perhaps this is hack or a bug in R.

Honestly I don't really see the theoretical justification even for the cutoff approach - a hard threshold works in the binary case but not really in multi-class. Anything related to thresholds I've seen is mostly related to OneVsRest scenarios where the threshold can be adjusted per class...

So yes my point here was that if we are just copying the approach from the R package, let's be consistent with it.

But yes, the main issue is ensuring they're all positive, so I'm ok with removing the <=1 constraint.

sethah · 2016-09-19T21:15:20Z

Why does the sum need to be less than one? That is not the case for R's randomForest "cutoff" parameter.

MLnick · 2016-09-20T08:08:29Z

@sethah that is the case for R's randomForest - well <=1 at least: https://github.com/cran/randomForest/blob/master/R/predict.randomForest.R#L47

SparkQA · 2016-09-20T10:18:15Z

Test build #65647 has finished for PR 15149 at commit 175d084.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- * and sum in (0,1]. The class with largest value p/t is predicted, where p is the original
- * is largest. A class may be selected even if this ratio is less than 1 (that is, the class
- \"values > 0 and sum in (0,1]. The class with largest value p/t is predicted, where p \" +
- values > 0 and sum in (0,1]. The class with largest value p/t is predicted, where p is the
- thresholds = Param(Params._dummy(), \"thresholds\",\"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 and sum in (0,1]. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)

MLnick · 2016-09-20T10:50:00Z

mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala

This doc is good but we essentially say "the class with highest p/t is chosen" twice - once here and again in the paragraph below. Perhaps we can consolidate?

SparkQA · 2016-09-20T11:17:14Z

Test build #65648 has finished for PR 15149 at commit da2b5fd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T12:44:37Z

Test build #65650 has finished for PR 15149 at commit 8fe92e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-09-20T14:45:57Z

mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala

Why change from the while loop?

I was going to ask you the opposite question, why was it change to a while loop -- is this performance-critical?

This is called for every instance in the dataset when using transform method, so I think it is. I haven't done explicit testing to see the difference, though.

Yeah sounds OK to me. I also got rid of an extra conversion to an array here.

sethah · 2016-09-20T15:29:04Z

Requiring these thresholds to sum <= 1 seems entirely arbitrary. I don't know why thresholds that sum to 0.347 are any more valid than thresholds that sum to 347. If these are not meant to represent a probability distribution then what basis is there for the sum to <= 1?

srowen · 2016-09-20T15:37:31Z

Right now, that limit is only for parity with the randomForest package that this is apparently based on. I agree that it's not clear why these couldn't sum to something more than 1. If they were to be interpreted as prior probabilities then they really should sum to 1 exactly.

I'm neutral on changing it ... not changing this would make this less of a behavior change, which is nice. The real problem we're trying to solve here is requiring all thresholds to be positive.

sethah · 2016-09-20T15:41:13Z

+1 for not changing the sum requirement. I agree that we need to restrict them to sum to something non-zero and all positive. Thanks for the clarification.

SparkQA · 2016-09-20T16:07:30Z

Test build #65659 has finished for PR 15149 at commit 6712d7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T16:29:13Z

Test build #65660 has finished for PR 15149 at commit 6d7a2d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…t cutoff, requiring all > 0

SparkQA · 2016-09-21T10:33:57Z

Test build #65713 has finished for PR 15149 at commit 727f732.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-09-21T10:52:06Z

mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala

+      new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.0, 0.0))
+    }
+    intercept[IllegalArgumentException] {
+      new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.8, 0.8))


This is now valid, correct? So we should remove this test case.

Yeah oops I missed a couple things on my last merge. Fixing it up now ...

SparkQA · 2016-09-21T11:55:23Z

Test build #65714 has finished for PR 15149 at commit 80934ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-09-21T13:15:16Z

LGTM

sethah · 2016-09-21T14:47:10Z

mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala

        " The class with largest value p/t is predicted, where p is the original probability" +
-        " of that class and t is the class' threshold",
+        " of that class and t is the class's threshold",
        isValid = "(t: Array[Double]) => t.forall(_ >= 0)", finalMethods = false),


This line doesn't line up with what is in sharedParams.scala. This file should generate sharedParams.scala via build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen". t.forall(_ >= 0) Should be t.forall(_ > 0).

sethah · 2016-09-21T14:49:47Z

mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala

      while (i < probabilitySize) {
-        if (thresholds(i) == 0.0) {
-          max = Double.PositiveInfinity
+        val scaled = probability(i) / thresholds(i)


maybe we can add a comment to future developers that we don't have to worry about divide by zero errors here.

sethah · 2016-09-21T14:52:10Z

python/pyspark/ml/param/shared.py

    """

-    thresholds = Param(Params._dummy(), "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.", typeConverter=TypeConverters.toListFloat)
+    thresholds = Param(Params._dummy(), "thresholds","Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.", typeConverter=TypeConverters.toListFloat)


This file should be generated via python _shared_params_code_gen.py > shared.py, but looking at the space that was deleted maybe it wasn't?

Aha, I've just put 2 + 2 together about what these files named 'code gen' are about. Let me regen them.

sethah · 2016-09-21T15:39:06Z

mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala

+    intercept[IllegalArgumentException] {
+      new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.0, 0.0))
+    }
  }


My apologies for not thinking of this earlier, maybe we should test negative as well, while we're here.

sethah · 2016-09-21T15:39:30Z

One small comment, otherwise LGTM. Thanks!

SparkQA · 2016-09-21T16:23:57Z

Test build #65720 has finished for PR 15149 at commit 278a193.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ > 0))
- thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)

holdenk · 2016-09-21T16:30:58Z

This looks really reasonable, the only catch is that the thresholds can be effectively set through setThreshold as well as setThresholds.

So we probably also want to update the range notation used in setThreshold (since right now its listed valid thresholds as [0, 1] and 0 is no longer valid we probably want to swap it to (0, 1]
As well we probably want to change the validator used for threshold to exclude 0 in SharedParamsCodeGen.scala

We should also probably add a test for this as well since it almost went in without this.

holdenk · 2016-09-21T16:32:55Z

Sorry for my late review - was giving a talk yesterday so was focused on that.

srowen · 2016-09-21T16:52:23Z

Ah, do we need to update that? it looks like threshold is separate, and overrides thresholds. It's just used as a cutoff for the positive class, so it doesn't have same problem when it's 0. You could legitimately set it to 0 to always predict the positive class.

In the binary case, using this as a cutoff gives the same answer as this ratio-based rule anyway.

(Really ... it would make sense to allow one threshold to be 0, which would effectively mean always predict the class, in the multiclass case. But let's leave that.)

SparkQA · 2016-09-21T17:16:11Z

Test build #65724 has finished for PR 15149 at commit 0810e4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-09-21T17:34:43Z

It does introduce a slight inconsistency because setting thresholds to [0, 1] for binary is now not allowed, but setting threshold to 0 is fine. Still, I think it's valid to set threshold to 0. As Sean says, it's also technically valid to allow one 0 in thresholds... but really that seems to just complicate things for no particularly good reason.

Also, anyone doing binary classification will be using threshold - and if they're changing that they probably know what they're doing anyway.

holdenk · 2016-09-21T17:51:32Z

Yah I guess we can consider the case where the user explicitly states multinomial as the family but then only has two classes and uses setThreshold rather than setThresholds an error state regardless.
(and even then right now in the current code that will result in the thresholds being ignored since we only call getThresholds when thresholds is defined and setThreshold clears the current thresholds).

MLnick · 2016-09-21T18:19:17Z

Hmm, yes that would be an inconsistent scenario cos thresholds would be used in that case rather than threshold. And threshold could have been set to 0 => thresholds = [1, 0] (or set to 1 => [0, 1]). Even though it's most likely users would use binary classification with binomial, this is definitely a corner case.

Either way it could blow up the calculation. So we could just allow at most one threshold to be 0. And in this case always predict that class (similar to the way it is now except at most one class can have Inf p/t score)

holdenk · 2016-09-21T18:26:49Z

I think it would also be ok to explicitly fail if we don't want to support that - but fail intentionally.

MLnick · 2016-09-21T18:38:57Z

Sure - though actually I think it is perhaps simpler to just allow one 0 in validation for thresholds - because we definitely don't want to throw an error only at prediction time once the user has gone and trained a model. It could be ok for a single model but in a complex pipeline that would be super frustrating. And it seems like doing validation on the combo of what is set for family, threshold and thresholds could be convoluted and more error-prone?

SparkQA · 2016-09-22T11:19:54Z

Test build #65768 has finished for PR 15149 at commit e6de49b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ >= 0) && t.count(_ == 0) <= 1)
- thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)

sethah · 2016-09-22T15:10:25Z

mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala

-          max = Double.PositiveInfinity
+        // thresholds are all > 0, excepting that at most one may be 0
+        val scaled = probability(i) / thresholds(i)
+        if (scaled > max) {


If probability(i) and thresholds(i) are both 0.0 here, we will have scaled = NaN. Maybe we can break out of the loop early if we encounter a zero threshold. BTW, this also begs the question of what the answer should be with an infinitely low probability and an infinitely low threshold - but I'm totally fine just predicting whatever threshold is zero in that case :D

Yeah, that occurred to me. It will never be selected because NaN isn't bigger than anything, including NegativeInfinity. If for some reason you have one class only (is this even valid?) you'd select this class, which is I guess correct.

Very small but positive prob / threshold? that should still work fine to the limits of machine precision here.

I actually meant what is the correct answer when they are both zero. Essentially saying "never predict this class" and "always predict this class" at the same time. I figured we would just predict the class with zero threshold regardless of the probability, but it seems currently it's the opposite. We give the probability higher precedence.

BTW, it is valid to have only one class.

You're right. It will never be predicted, and I think that's more sensible because probability = 0 means "never predict" and 0 threshold only sort of implies always predicting the class. It's undefined, so either one seems coherent as a result. I prefer the current behavior I guess.

I see it's possible code-wise to have one class but don't think it's a valid use case, so, not worried about the behavior (even if it will still return the one single class here always anyway).

Ok, I think that's reasonable behavior. Is it better to handle the zero threshold case in the code explicitly? It confused me at first, and I had to refer to divide by zero behavior in scala to understand the code.

It's worth a comment at least, yes. I think it's probably no simpler or faster code-wise to special case it.

sethah · 2016-09-22T15:24:58Z

Right now, if a multinomial family is used in LOR, it silently ignores threshold regardless. I don't really like that behavior, but perhaps we can focus on it in (and add some tests) in SPARK-11543. I like allowing a single zero threshold since it gives users a way to always predict a certain class, and it's behavior is rather clear.

SparkQA · 2016-09-22T17:09:29Z

Test build #65776 has finished for PR 15149 at commit 9796492.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // ('scaled' = +Infinity). However in the case that this class also has
- // 0 probability, the class will not be selected ('scaled' is NaN).

srowen · 2016-09-24T07:16:10Z

Merged to master

jkbradley · 2016-09-28T22:49:17Z

Thanks all for handling this edge case! Coming late to the discussion...but I like the decisions made here.

MLnick reviewed Sep 19, 2016

View reviewed changes

srowen force-pushed the SPARK-17057 branch from f4ce7c5 to 175d084 Compare September 20, 2016 09:40

MLnick reviewed Sep 20, 2016

View reviewed changes

srowen changed the title ~~[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 and sum < 1 to match randomForest cutoff~~ [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 and sum <= 1 to match randomForest cutoff Sep 20, 2016

sethah reviewed Sep 20, 2016

View reviewed changes

srowen changed the title ~~[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 and sum <= 1 to match randomForest cutoff~~ [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 Sep 21, 2016

Match ProbabilisticClassifer.thresholds requirements to R randomFores…

6ea8044

…t cutoff, requiring all > 0

srowen force-pushed the SPARK-17057 branch from 6d7a2d9 to 6ea8044 Compare September 21, 2016 09:53

Fix Logistic Regression test with 0 threshold

727f732

MLnick reviewed Sep 21, 2016

View reviewed changes

Remove now-valid negative test

80934ed

sethah reviewed Sep 21, 2016

View reviewed changes

Properly fix and regenerate generated code

278a193

sethah reviewed Sep 21, 2016

View reviewed changes

One more test for negative threshold

0810e4b

Allow one zero threshold

e6de49b

sethah reviewed Sep 22, 2016

View reviewed changes

Add comment about prob = 0 / threshold = 0 case

9796492

srowen changed the title ~~[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0~~ [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should have at most one 0 Sep 24, 2016

asfgit closed this in 248916f Sep 24, 2016

srowen deleted the SPARK-17057 branch September 28, 2016 13:28

[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should have at most one 0 #15149

[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should have at most one 0 #15149

Uh oh!

Conversation

srowen commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Sep 19, 2016

Uh oh!

MLnick commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Sep 20, 2016

Uh oh!

srowen commented Sep 20, 2016

Uh oh!

sethah commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

MLnick commented Sep 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 19, 2016 •

edited

Loading

MLnick Sep 20, 2016 •

edited

Loading

MLnick Sep 21, 2016 •

edited

Loading

MLnick commented Sep 20, 2016 •

edited

Loading

holdenk commented Sep 21, 2016 •

edited

Loading

holdenk commented Sep 21, 2016 •

edited

Loading

MLnick commented Sep 21, 2016 •

edited

Loading