-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17057] [ML] ProbabilisticClassifierModels' thresholds should have at most one 0 #15149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65593 has finished for PR 15149 at commit
|
|
Test build #65597 has finished for PR 15149 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to test all 0 case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
BTW I'm finding that many cases use thresholds that sum to 1. Is it actually important to prohibit this? I don't see that thresholds/cutoffs are actually interpreted as a probability distribution or anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think sum to 1 is fine - the R package actually limits to !sum > 1 => sum <= 1 (https://github.com/cran/randomForest/blob/master/R/predict.randomForest.R#L47) - I think I misread it earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they don't seem to be probabilities. In fact I don't see a theoretical reason why the sum <=1, since they're just scalings or "weights". Requiring all > 0 of course makes practical sense. Anyway we should just match what they do as that is what this thresholds implementation was based on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Yeah, I was wondering, the point of this PR is just to match R? Actually, the following works
fit <- randomForest(as.factor(V4) ~ ., data=data, cutoff=c(0.05, 0.05, 0.05, 0.05))
> fit$forest$cutoff = c(1000, 1000, 1000, 1)
> table(testClass=data$V4, predict(fit, newdata=data))
testClass 1 2 3 4
1 0 3 2 56
2 0 6 1 144
3 0 2 3 116
4 0 0 0 67So, you can actually predict with arbitrary cutoff values, perhaps this is hack or a bug in R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I don't really see the theoretical justification even for the cutoff approach - a hard threshold works in the binary case but not really in multi-class. Anything related to thresholds I've seen is mostly related to OneVsRest scenarios where the threshold can be adjusted per class...
So yes my point here was that if we are just copying the approach from the R package, let's be consistent with it.
But yes, the main issue is ensuring they're all positive, so I'm ok with removing the <=1 constraint.
|
Why does the sum need to be less than one? That is not the case for R's randomForest "cutoff" parameter. |
|
@sethah that is the case for R's randomForest - well <=1 at least: https://github.com/cran/randomForest/blob/master/R/predict.randomForest.R#L47 |
|
Test build #65647 has finished for PR 15149 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is good but we essentially say "the class with highest p/t is chosen" twice - once here and again in the paragraph below. Perhaps we can consolidate?
|
Test build #65648 has finished for PR 15149 at commit
|
|
Test build #65650 has finished for PR 15149 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change from the while loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask you the opposite question, why was it change to a while loop -- is this performance-critical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called for every instance in the dataset when using transform method, so I think it is. I haven't done explicit testing to see the difference, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds OK to me. I also got rid of an extra conversion to an array here.
|
Requiring these thresholds to sum <= 1 seems entirely arbitrary. I don't know why thresholds that sum to |
|
Right now, that limit is only for parity with the randomForest package that this is apparently based on. I agree that it's not clear why these couldn't sum to something more than 1. If they were to be interpreted as prior probabilities then they really should sum to 1 exactly. I'm neutral on changing it ... not changing this would make this less of a behavior change, which is nice. The real problem we're trying to solve here is requiring all thresholds to be positive. |
|
+1 for not changing the sum requirement. I agree that we need to restrict them to sum to something non-zero and all positive. Thanks for the clarification. |
|
Test build #65659 has finished for PR 15149 at commit
|
|
Test build #65660 has finished for PR 15149 at commit
|
…t cutoff, requiring all > 0
|
Test build #65713 has finished for PR 15149 at commit
|
| new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.0, 0.0)) | ||
| } | ||
| intercept[IllegalArgumentException] { | ||
| new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.8, 0.8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now valid, correct? So we should remove this test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah oops I missed a couple things on my last merge. Fixing it up now ...
|
Test build #65714 has finished for PR 15149 at commit
|
|
LGTM |
| " The class with largest value p/t is predicted, where p is the original probability" + | ||
| " of that class and t is the class' threshold", | ||
| " of that class and t is the class's threshold", | ||
| isValid = "(t: Array[Double]) => t.forall(_ >= 0)", finalMethods = false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't line up with what is in sharedParams.scala. This file should generate sharedParams.scala via build/sbt "mllib/runMain org.apache.spark.ml.param.shared.SharedParamsCodeGen". t.forall(_ >= 0) Should be t.forall(_ > 0).
| while (i < probabilitySize) { | ||
| if (thresholds(i) == 0.0) { | ||
| max = Double.PositiveInfinity | ||
| val scaled = probability(i) / thresholds(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add a comment to future developers that we don't have to worry about divide by zero errors here.
python/pyspark/ml/param/shared.py
Outdated
| """ | ||
|
|
||
| thresholds = Param(Params._dummy(), "thresholds", "Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.", typeConverter=TypeConverters.toListFloat) | ||
| thresholds = Param(Params._dummy(), "thresholds","Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.", typeConverter=TypeConverters.toListFloat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be generated via python _shared_params_code_gen.py > shared.py, but looking at the space that was deleted maybe it wasn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I've just put 2 + 2 together about what these files named 'code gen' are about. Let me regen them.
| intercept[IllegalArgumentException] { | ||
| new TestProbabilisticClassificationModel("myuid", 2, 2).setThresholds(Array(0.0, 0.0)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies for not thinking of this earlier, maybe we should test negative as well, while we're here.
|
One small comment, otherwise LGTM. Thanks! |
|
Test build #65720 has finished for PR 15149 at commit
|
|
This looks really reasonable, the only catch is that the thresholds can be effectively set through So we probably also want to update the range notation used in We should also probably add a test for this as well since it almost went in without this. |
|
Sorry for my late review - was giving a talk yesterday so was focused on that. |
|
Ah, do we need to update that? it looks like In the binary case, using this as a cutoff gives the same answer as this ratio-based rule anyway. (Really ... it would make sense to allow one threshold to be 0, which would effectively mean always predict the class, in the multiclass case. But let's leave that.) |
|
Test build #65724 has finished for PR 15149 at commit
|
|
It does introduce a slight inconsistency because setting Also, anyone doing binary classification will be using |
|
Yah I guess we can consider the case where the user explicitly states multinomial as the family but then only has two classes and uses |
|
Hmm, yes that would be an inconsistent scenario cos Either way it could blow up the calculation. So we could just allow at most one threshold to be |
|
I think it would also be ok to explicitly fail if we don't want to support that - but fail intentionally. |
|
Sure - though actually I think it is perhaps simpler to just allow one 0 in validation for |
|
Test build #65768 has finished for PR 15149 at commit
|
| max = Double.PositiveInfinity | ||
| // thresholds are all > 0, excepting that at most one may be 0 | ||
| val scaled = probability(i) / thresholds(i) | ||
| if (scaled > max) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If probability(i) and thresholds(i) are both 0.0 here, we will have scaled = NaN. Maybe we can break out of the loop early if we encounter a zero threshold. BTW, this also begs the question of what the answer should be with an infinitely low probability and an infinitely low threshold - but I'm totally fine just predicting whatever threshold is zero in that case :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that occurred to me. It will never be selected because NaN isn't bigger than anything, including NegativeInfinity. If for some reason you have one class only (is this even valid?) you'd select this class, which is I guess correct.
Very small but positive prob / threshold? that should still work fine to the limits of machine precision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant what is the correct answer when they are both zero. Essentially saying "never predict this class" and "always predict this class" at the same time. I figured we would just predict the class with zero threshold regardless of the probability, but it seems currently it's the opposite. We give the probability higher precedence.
BTW, it is valid to have only one class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. It will never be predicted, and I think that's more sensible because probability = 0 means "never predict" and 0 threshold only sort of implies always predicting the class. It's undefined, so either one seems coherent as a result. I prefer the current behavior I guess.
I see it's possible code-wise to have one class but don't think it's a valid use case, so, not worried about the behavior (even if it will still return the one single class here always anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think that's reasonable behavior. Is it better to handle the zero threshold case in the code explicitly? It confused me at first, and I had to refer to divide by zero behavior in scala to understand the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth a comment at least, yes. I think it's probably no simpler or faster code-wise to special case it.
|
Right now, if a multinomial family is used in LOR, it silently ignores |
|
Test build #65776 has finished for PR 15149 at commit
|
|
Merged to master |
|
Thanks all for handling this edge case! Coming late to the discussion...but I like the decisions made here. |
What changes were proposed in this pull request?
Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0
How was this patch tested?
Jenkins tests plus new test cases