Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Sep 19, 2016

What changes were proposed in this pull request?

Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0

How was this patch tested?

Jenkins tests plus new test cases

@srowen
Copy link
Member Author

srowen commented Sep 19, 2016

CC @MLnick @zhengruifeng

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65593 has finished for PR 15149 at commit ddc8dab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • * is largest. A class may be selected even if this ratio is less than 1 (that is, the class

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65597 has finished for PR 15149 at commit f4ce7c5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to test all 0 case too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

BTW I'm finding that many cases use thresholds that sum to 1. Is it actually important to prohibit this? I don't see that thresholds/cutoffs are actually interpreted as a probability distribution or anything.

Copy link
Contributor

@MLnick MLnick Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think sum to 1 is fine - the R package actually limits to !sum > 1 => sum <= 1 (https://github.com/cran/randomForest/blob/master/R/predict.randomForest.R#L47) - I think I misread it earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they don't seem to be probabilities. In fact I don't see a theoretical reason why the sum <=1, since they're just scalings or "weights". Requiring all > 0 of course makes practical sense. Anyway we should just match what they do as that is what this thresholds implementation was based on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Yeah, I was wondering, the point of this PR is just to match R? Actually, the following works

fit <- randomForest(as.factor(V4) ~ ., data=data, cutoff=c(0.05, 0.05, 0.05, 0.05))
> fit$forest$cutoff = c(1000, 1000, 1000, 1)
> table(testClass=data$V4, predict(fit, newdata=data))

testClass   1   2   3   4
        1   0   3   2  56
        2   0   6   1 144
        3   0   2   3 116
        4   0   0   0  67

So, you can actually predict with arbitrary cutoff values, perhaps this is hack or a bug in R.

Copy link
Contributor

@MLnick MLnick Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't really see the theoretical justification even for the cutoff approach - a hard threshold works in the binary case but not really in multi-class. Anything related to thresholds I've seen is mostly related to OneVsRest scenarios where the threshold can be adjusted per class...

So yes my point here was that if we are just copying the approach from the R package, let's be consistent with it.

But yes, the main issue is ensuring they're all positive, so I'm ok with removing the <=1 constraint.

@sethah
Copy link
Contributor

sethah commented Sep 19, 2016

Why does the sum need to be less than one? That is not the case for R's randomForest "cutoff" parameter.

@MLnick
Copy link
Contributor

MLnick commented Sep 20, 2016

@sethah that is the case for R's randomForest - well <=1 at least: https://github.com/cran/randomForest/blob/master/R/predict.randomForest.R#L47

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65647 has finished for PR 15149 at commit 175d084.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • * and sum in (0,1]. The class with largest value p/t is predicted, where p is the original
    • * is largest. A class may be selected even if this ratio is less than 1 (that is, the class
    • \"values > 0 and sum in (0,1]. The class with largest value p/t is predicted, where p \" +
    • values > 0 and sum in (0,1]. The class with largest value p/t is predicted, where p is the
    • thresholds = Param(Params._dummy(), \"thresholds\",\"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 and sum in (0,1]. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is good but we essentially say "the class with highest p/t is chosen" twice - once here and again in the paragraph below. Perhaps we can consolidate?

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65648 has finished for PR 15149 at commit da2b5fd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen changed the title [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 and sum < 1 to match randomForest cutoff [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 and sum <= 1 to match randomForest cutoff Sep 20, 2016
@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65650 has finished for PR 15149 at commit 8fe92e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change from the while loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask you the opposite question, why was it change to a while loop -- is this performance-critical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called for every instance in the dataset when using transform method, so I think it is. I haven't done explicit testing to see the difference, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds OK to me. I also got rid of an extra conversion to an array here.

@sethah
Copy link
Contributor

sethah commented Sep 20, 2016

Requiring these thresholds to sum <= 1 seems entirely arbitrary. I don't know why thresholds that sum to 0.347 are any more valid than thresholds that sum to 347. If these are not meant to represent a probability distribution then what basis is there for the sum to <= 1?

@srowen
Copy link
Member Author

srowen commented Sep 20, 2016

Right now, that limit is only for parity with the randomForest package that this is apparently based on. I agree that it's not clear why these couldn't sum to something more than 1. If they were to be interpreted as prior probabilities then they really should sum to 1 exactly.

I'm neutral on changing it ... not changing this would make this less of a behavior change, which is nice. The real problem we're trying to solve here is requiring all thresholds to be positive.

@sethah
Copy link
Contributor

sethah commented Sep 20, 2016

+1 for not changing the sum requirement. I agree that we need to restrict them to sum to something non-zero and all positive. Thanks for the clarification.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65659 has finished for PR 15149 at commit 6712d7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65660 has finished for PR 15149 at commit 6d7a2d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen changed the title [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 and sum <= 1 to match randomForest cutoff [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 Sep 21, 2016
@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65713 has finished for PR 15149 at commit 727f732.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.0, 0.0))
}
intercept[IllegalArgumentException] {
new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.8, 0.8))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now valid, correct? So we should remove this test case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah oops I missed a couple things on my last merge. Fixing it up now ...

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65714 has finished for PR 15149 at commit 80934ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Sep 21, 2016

LGTM

" The class with largest value p/t is predicted, where p is the original probability" +
" of that class and t is the class' threshold",
" of that class and t is the class's threshold",
isValid = "(t: Array[Double]) => t.forall(_ >= 0)", finalMethods = false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't line up with what is in sharedParams.scala. This file should generate sharedParams.scala via build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen". t.forall(_ >= 0) Should be t.forall(_ > 0).

while (i < probabilitySize) {
if (thresholds(i) == 0.0) {
max = Double.PositiveInfinity
val scaled = probability(i) / thresholds(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a comment to future developers that we don't have to worry about divide by zero errors here.

"""

thresholds = Param(Params._dummy(), "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.", typeConverter=TypeConverters.toListFloat)
thresholds = Param(Params._dummy(), "thresholds","Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.", typeConverter=TypeConverters.toListFloat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be generated via python _shared_params_code_gen.py > shared.py, but looking at the space that was deleted maybe it wasn't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I've just put 2 + 2 together about what these files named 'code gen' are about. Let me regen them.

intercept[IllegalArgumentException] {
new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.0, 0.0))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for not thinking of this earlier, maybe we should test negative as well, while we're here.

@sethah
Copy link
Contributor

sethah commented Sep 21, 2016

One small comment, otherwise LGTM. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65720 has finished for PR 15149 at commit 278a193.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ > 0))
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2016

This looks really reasonable, the only catch is that the thresholds can be effectively set through setThreshold as well as setThresholds.

So we probably also want to update the range notation used in setThreshold (since right now its listed valid thresholds as [0, 1] and 0 is no longer valid we probably want to swap it to (0, 1]
As well we probably want to change the validator used for threshold to exclude 0 in SharedParamsCodeGen.scala

We should also probably add a test for this as well since it almost went in without this.

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2016

Sorry for my late review - was giving a talk yesterday so was focused on that.

@srowen
Copy link
Member Author

srowen commented Sep 21, 2016

Ah, do we need to update that? it looks like threshold is separate, and overrides thresholds. It's just used as a cutoff for the positive class, so it doesn't have same problem when it's 0. You could legitimately set it to 0 to always predict the positive class.

In the binary case, using this as a cutoff gives the same answer as this ratio-based rule anyway.

(Really ... it would make sense to allow one threshold to be 0, which would effectively mean always predict the class, in the multiclass case. But let's leave that.)

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65724 has finished for PR 15149 at commit 0810e4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Sep 21, 2016

It does introduce a slight inconsistency because setting thresholds to [0, 1] for binary is now not allowed, but setting threshold to 0 is fine. Still, I think it's valid to set threshold to 0. As Sean says, it's also technically valid to allow one 0 in thresholds... but really that seems to just complicate things for no particularly good reason.

Also, anyone doing binary classification will be using threshold - and if they're changing that they probably know what they're doing anyway.

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2016

Yah I guess we can consider the case where the user explicitly states multinomial as the family but then only has two classes and uses setThreshold rather than setThresholds an error state regardless.
(and even then right now in the current code that will result in the thresholds being ignored since we only call getThresholds when thresholds is defined and setThreshold clears the current thresholds).

@MLnick
Copy link
Contributor

MLnick commented Sep 21, 2016

Hmm, yes that would be an inconsistent scenario cos thresholds would be used in that case rather than threshold. And threshold could have been set to 0 => thresholds = [1, 0] (or set to 1 => [0, 1]). Even though it's most likely users would use binary classification with binomial, this is definitely a corner case.

Either way it could blow up the calculation. So we could just allow at most one threshold to be 0. And in this case always predict that class (similar to the way it is now except at most one class can have Inf p/t score)

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2016

I think it would also be ok to explicitly fail if we don't want to support that - but fail intentionally.

@MLnick
Copy link
Contributor

MLnick commented Sep 21, 2016

Sure - though actually I think it is perhaps simpler to just allow one 0 in validation for thresholds - because we definitely don't want to throw an error only at prediction time once the user has gone and trained a model. It could be ok for a single model but in a complex pipeline that would be super frustrating. And it seems like doing validation on the combo of what is set for family, threshold and thresholds could be convoluted and more error-prone?

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65768 has finished for PR 15149 at commit e6de49b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ >= 0) && t.count(_ == 0) <= 1)
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)

max = Double.PositiveInfinity
// thresholds are all > 0, excepting that at most one may be 0
val scaled = probability(i) / thresholds(i)
if (scaled > max) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If probability(i) and thresholds(i) are both 0.0 here, we will have scaled = NaN. Maybe we can break out of the loop early if we encounter a zero threshold. BTW, this also begs the question of what the answer should be with an infinitely low probability and an infinitely low threshold - but I'm totally fine just predicting whatever threshold is zero in that case :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that occurred to me. It will never be selected because NaN isn't bigger than anything, including NegativeInfinity. If for some reason you have one class only (is this even valid?) you'd select this class, which is I guess correct.

Very small but positive prob / threshold? that should still work fine to the limits of machine precision here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant what is the correct answer when they are both zero. Essentially saying "never predict this class" and "always predict this class" at the same time. I figured we would just predict the class with zero threshold regardless of the probability, but it seems currently it's the opposite. We give the probability higher precedence.

BTW, it is valid to have only one class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It will never be predicted, and I think that's more sensible because probability = 0 means "never predict" and 0 threshold only sort of implies always predicting the class. It's undefined, so either one seems coherent as a result. I prefer the current behavior I guess.

I see it's possible code-wise to have one class but don't think it's a valid use case, so, not worried about the behavior (even if it will still return the one single class here always anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think that's reasonable behavior. Is it better to handle the zero threshold case in the code explicitly? It confused me at first, and I had to refer to divide by zero behavior in scala to understand the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth a comment at least, yes. I think it's probably no simpler or faster code-wise to special case it.

@sethah
Copy link
Contributor

sethah commented Sep 22, 2016

Right now, if a multinomial family is used in LOR, it silently ignores threshold regardless. I don't really like that behavior, but perhaps we can focus on it in (and add some tests) in SPARK-11543. I like allowing a single zero threshold since it gives users a way to always predict a certain class, and it's behavior is rather clear.

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65776 has finished for PR 15149 at commit 9796492.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • // ('scaled' = +Infinity). However in the case that this class also has
    • // 0 probability, the class will not be selected ('scaled' is NaN).

@srowen srowen changed the title [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should be > 0 [SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should have at most one 0 Sep 24, 2016
@srowen
Copy link
Member Author

srowen commented Sep 24, 2016

Merged to master

@asfgit asfgit closed this in 248916f Sep 24, 2016
@srowen srowen deleted the SPARK-17057 branch September 28, 2016 13:28
@jkbradley
Copy link
Member

Thanks all for handling this edge case! Coming late to the discussion...but I like the decisions made here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants