[SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces #16441

imatiach-msft · 2016-12-30T20:31:08Z

What changes were proposed in this pull request?

For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier.
Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug.
This change corrects the interface and adds the ability for the classifier to give a probabilities vector.

How was this patch tested?

The basic ML tests were run after making the changes. I've marked this as WIP as I need to add more tests.

SparkQA · 2016-12-30T20:51:49Z

Test build #70759 has finished for PR 16441 at commit 4468891.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2016-12-30T21:48:13Z

Jenkins, retest this please

SparkQA · 2016-12-30T22:32:32Z

Test build #70760 has finished for PR 16441 at commit 489e0e6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-12-31T01:14:05Z

Thanks for the PR; I do want to get this fixed. However, I don't think this is the right way to make predictions of probabilities for GBTs. I believe it should depend on the loss used. E.g., check out page 8 of Friedman (1999) "Greedy Function Approximation? A Gradient Boosting Machine"

SparkQA · 2017-01-05T19:20:16Z

Test build #70935 has finished for PR 16441 at commit 4348c2e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-05T19:45:45Z

Thanks, I've updated the PR based on your comment. The only disadvantage to the current code is that I do the probability computation within the classifier, but it seems like it should be moved to the LogLoss.scala class. However, it's not a problem right now because GBTClassifier only uses logistic loss, and other learners would have to be modified in a similar way as well probably.

SparkQA · 2017-01-05T20:43:43Z

Test build #70938 has finished for PR 16441 at commit 2b842e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-05T20:57:32Z

Test build #70939 has finished for PR 16441 at commit 9def0ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-05T21:14:20Z

@jkbradley I've updated based on your comments, please take another look, thanks!

sethah

Thanks for the patch. I made a first pass

sethah · 2017-01-05T21:22:51Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

put this back on one line

sethah · 2017-01-05T21:32:53Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

This is actually not correct since the constructor was private[ml] before. Since this has always been private, and we aren't actually using it anywhere, I think we can remove this constructor entirely.

Since tag not needed since it's private

removed since tag

sethah · 2017-01-05T21:40:48Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

We should import org.apache.spark.ml.linalg.BLAS and call BLAS.dot here and in predict.

it looks like BLAS.dot is only for Vector, but these are both arrays. I'm worried that this may degrade performance. Is this specifically what you are looking for:
BLAS.dot(Vectors.dense(treePredictions), Vectors.dense(_treeWeights))
is the extra dense vector allocation worth it?

Yeah, I see it's not quite the same as in other places. We can leave it

oh ok, thank you for confirming

sethah · 2017-01-05T22:02:18Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

my concern is that this is hard coded to logistic loss. Maybe we can add a static method to GBTClassificationModel

private def classProbability(class: Int, loss: String, rawPrediction: Double): Double = { loss match { case "logistic" => ... case _ => throw new Exception("Only logistic loss is supported ...") } }

sethah · 2017-01-05T22:06:47Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

Just use defaults here. And I'm in favor of only setting parameters that matter for the given test, otherwise it may give the impression that the test depends on a certain, say checkpoint interval.

sethah · 2017-01-05T22:41:50Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

Could you take a look at this test, and make it line up here? Specifically:

compute probabilities manually from rawPrediction and ensure that it matches the probabilities column

make sure that probabilities.argmax and rawPrediction.argmax equal the prediction

make sure probabilities sum to one

check the different code paths by unsetting some of the output columns

sethah · 2017-01-05T22:44:20Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

In logistic regression we had previously overridden some of the methods in probabilistic classifier since we were only dealing with two classes, which makes those methods a bit faster (hard to say how much). We can do it here for now, but I'd be slightly in favor of not doing it since I'm not sure how much we gain from it and it makes the code harder to follow. Thoughts?

sorry I'm a bit confused, this classifier also only deals with two classes, it does not support multiclass data. Instead of overriding, what is the alternative? There is no default predictRaw or raw2probability implemented in probabilistic classifier, and it seems that this is the minimum required for GBTClassifier to use ProbabilisticClassifier. Can you please give more information on this point?

I can see how my comment was confusing now :) Since GBT only supports two classes right now, we could override methods like probability2prediction which are by default calling what is implemented in ProbabilisticClassifier. When thresholds are not defined, it calls probablity.argmax which for two classes we could simplify to

if (probability(1) > probablity(0)) 1 else 0

Looking now, logistic regression also had a getThreshold method which allowed it to avoid loops in some cases, but we don't have it here. Let's leave things how they are.

sorry, I'm still a little confused, should I override probability2prediction and simplify, or should I keep the argmax as is? The argmax seems better because it is more general anyway, but please let me know if you would prefer that I make any changes here.

Let's not change anything for now, it's fine as it is. Sorry for the confusion.

sethah · 2017-01-05T22:45:36Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

you can use .foreach { case Row(raw: DenseVector, pred: Double, prob: DenseVector) => ... } here.

imatiach-msft · 2017-01-06T04:50:34Z

@sethah @jkbradley thank you for the review - could you please take another look since I've updated the code review based on your comments?

SparkQA · 2017-01-06T05:09:41Z

Test build #70963 has finished for PR 16441 at commit 0c0cb8b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-06T06:14:55Z

It looks like I am failing the binary compatibility tests despite this constructor being private:

class GBTClassificationModel private[ml](
@SInCE("1.6.0") override val uid: String,
private val _trees: Array[DecisionTreeRegressionModel],
private val _treeWeights: Array[Double],
@SInCE("1.6.0") override val numFeatures: Int,
@SInCE("2.2.0") override val numClasses: Int)
extends ProbabilisticClassificationModel[Vector, GBTClassificationModel]

This is the same thing that happened in my original PR and then I had to add the additional this() overload to pass the tests. In the PR comment it was mentioned that I should be able to remove the unused constructor, does this mean that I need to change the binary compatibility test somehow as well? My guess is that the binary compat tests are java based and not scala based, in which case private[ml] doesn't matter, so the solution would be to keep the extra constructor I had before, just make sure that it is still private[ml], only so I can pass the binary compat tests.

SparkQA · 2017-01-06T16:25:16Z

Test build #70982 has finished for PR 16441 at commit 2f06cb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-06T16:36:30Z

Indeed re-adding the constructor seems to make the binary compatibility tests pass (see spark QA build above). I think in favor of making the binary compat tests pass, we can keep the extra private constructor, even though for most people it won't do anything.

Please let me know if there are any outstanding comments that still need to be addressed. Thank you!

imatiach-msft · 2017-01-06T17:42:07Z

I've removed the WIP from title to reflect the status of the pull request.

imatiach-msft · 2017-01-09T21:05:11Z

ping @sethah @jkbradley could you please take another look since I've updated the code review based on your comments? Thank you!

sethah

Made another pass, thanks for working on this, my apologies for the delayed review.

sethah · 2017-01-10T00:16:03Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

DenseVector is unused

sethah · 2017-01-10T01:20:15Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

predFromRaw ?

Also, can we leave a comment regarding the fact that we'd want to check other loss types here for classification if they are ever added.

done and done

sethah · 2017-01-10T01:20:44Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

check that prob(0) + prob(1) ~== 1.0 absTol 1e-8

good idea! done. I added absEps for 1e-8 so that there won't be any magic constants floating around.

sethah · 2017-01-10T01:24:57Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

We can save ourselves some computation here:

case dv: DenseVector => dv.values(0) = computeProb(dv.values(0)) dv.values(1) = 1.0 - dv.values(0) dv

sethah · 2017-01-10T01:35:41Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

Actually, this would be better served embedded in the loss object. One solution would be to make a few changes to the loss:

trait ClassificationLoss extends Loss { private[spark] def computeProbability(prediction: Double): Double } object LogLoss extends ClassificationLoss

Then we could add a class member to the model private val oldLoss: ClassificationLoss = getOldLossType, then we can just call oldLoss.computeProbability(pred) inside raw2ProbabilityInPlace. There might be a better solution too, but really I think it should be part of the loss.

adding "private val oldLoss: ClassificationLoss = getOldLossType" won't work because getOldLossType returns a Loss and not a LogLoss, which doesn't have computeProbability. However, I did add the ClassificationLoss trait and in ClassProbability I just call LogLoss.computeProbability. I'm not sure if it will pass the binary compat checks though, let's see...

You can change getOldLossType to return a classification loss, can't you?

good point, will update

sethah · 2017-01-10T01:38:25Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

Since tag not needed since it's private

sethah · 2017-01-10T01:42:33Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

We should just use numClasses = 2 for now, since getNumClasses can make an extra pass over the data, and >2 classes are not supported anyway.

hmm, the logistic regression gets the number of classes and throws in the binomial case, and getNumClasses should ideally get the number of classes from the metadata which shouldn't make an extra pass (ideally the label column is categorical?), but I think it's ok for now to make it 2 until we make GBT support the multiclass case.

If getNumClasses doesn't find metadata, then it will make a pass over the data.

right, I removed it for now, but ideally the user would preprocess the data and make the label column categorical. Either they would do that through the string indexer, or if they know it ahead of time, they would just add the metadata themselves (although unfortunately currently only advanced users would be able to do this, there is no transform that will allow they to pre-specify the labels if they know ahead of time what the labels are)

It's still there...

oops, I thought I changed it, sorry

sethah · 2017-01-10T01:44:23Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

I prefer to leave the handling of thresholds for another JIRA, but technically users will be able to set it. We can either do it here in this PR, or throw an error until we get it implemented in a follow up. Thoughts @jkbradley?

it looks like decision tree classifier has the same problem with thresholds

actually, it looks like both this classifier and decision tree handle thresholds already in method probability2prediction under ProbabilisticClassifier.scala. Can you give more information on why GBTClassifier is not handling thresholds correctly?

There is no setThresholds method, and there are no unit tests off the top of my head.

I do see a setThresholds method both on the classifier and the model. It comes from ProbabilisticClassifier:

abstract class ProbabilisticClassifier[
FeaturesType,
E <: ProbabilisticClassifier[FeaturesType, E, M],
M <: ProbabilisticClassificationModel[FeaturesType, M]]
extends Classifier[FeaturesType, E, M] with ProbabilisticClassifierParams {

/** @group setParam */
def setProbabilityCol(value: String): E = set(probabilityCol, value).asInstanceOf[E]

/** @group setParam */
def setThresholds(value: Array[Double]): E = set(thresholds, value).asInstanceOf[E]
}

ah, ok good catch. We should handle thresholds in this PR then. Can you look at other test suites and add those tests?

sure, I've added more tests in the latest commit. I've also fixed an issue where predict was not using thresholds - if they are defined we now use them.

sethah · 2017-01-10T01:50:30Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

we can avoid duplicating this code. Maybe, as in LogisticRegression, we can create a private function called score or margin and then use that in predict and predictRaw

good idea, refactored to margin private method

sethah · 2017-01-10T01:54:16Z

mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala

Shall we add a "default params" test for parity with other suites like LogisticRegression?

good idea, added the extra test

SparkQA · 2017-01-10T18:35:07Z

Test build #71142 has finished for PR 16441 at commit 2bd32a0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ClassificationLoss extends Loss

SparkQA · 2017-01-10T18:39:26Z

Test build #71144 has finished for PR 16441 at commit ffa0fe5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-10T20:01:46Z

Test build #71145 has finished for PR 16441 at commit 1dde99b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-10T20:20:31Z

ping @sethah @jkbradley could you please take another look since I've updated the code review based on your comments? Thank you!

SparkQA · 2017-01-10T20:54:49Z

Test build #71150 has finished for PR 16441 at commit 8cd6c2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah

Looking good! Thanks for all the updates

sethah · 2017-01-10T20:53:53Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

style: put each arg on one line, using 4 space indentation as is done with the constructor

done, thanks, also updated the other constructor (my default intellij settings don't seem to match the suggested ones)

sethah · 2017-01-10T20:55:17Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

nit: if (margin(features) > 0.0) 1.0 else 0.0

sethah · 2017-01-10T20:58:34Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

This comment should be removed since we made this function generic

moved comment to LogLoss computeProbability method (kept for positive result only)

sethah · 2017-01-10T21:01:04Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

Should we make a private class member private val loss = getOldLossType? Otherwise we call getOldLossType, (which calls getLossType) for every single instance.

hmm, this is a tricky point, because in the future if we have more than one loss when the user changes it the results should change as well, but since we only have one loss function I guess it is ok... I'll make the update but add a warning comment

You mean that if someone takes a model and changes the loss type via set(lossType, "other") that the probability function should change? I don't think it makes sense to change the probability function for a model, since the probability is chosen to be optimal for a specific loss, but it's a good point. What do you think?

sethah · 2017-01-10T21:07:33Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

It's still there...

sethah · 2017-01-10T22:38:48Z

mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala

this can be private[spark] I think

sethah · 2017-01-10T22:39:20Z

mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala

nit: prefer explicit doubles like 1.0 instead of 1

imatiach-msft · 2017-01-10T23:47:24Z

ping @sethah @jkbradley could you please take another look since I've updated the code review based on your comments? Thank you!

SparkQA · 2017-01-11T00:39:48Z

Test build #71169 has finished for PR 16441 at commit 0b96223.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-11T00:44:31Z

Test build #71170 has finished for PR 16441 at commit b4f9b34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-11T00:55:18Z

Test build #71171 has finished for PR 16441 at commit 92d1348.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…er test case

…resholds if they are specified

…ed doc

SparkQA · 2017-01-18T21:06:11Z

Test build #71616 has finished for PR 16441 at commit 1abfee0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-18T22:39:35Z

Test build #71617 has finished for PR 16441 at commit 818de81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-01-18T23:33:07Z

LGTM
Merging with master
Thanks @imatiach-msft and @sethah for reviewing!

MLnick · 2017-01-19T08:17:00Z

@imatiach-msft thanks for this, really great to have GBT in the classification trait hierarchy, and now usable with binary evaluator metrics!

…ning instance and fixed interfaces ## What changes were proposed in this pull request? For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier. Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug. This change corrects the interface and adds the ability for the classifier to give a probabilities vector. ## How was this patch tested? The basic ML tests were run after making the changes. I've marked this as WIP as I need to add more tests. Author: Ilya Matiach <[email protected]> Closes apache#16441 from imatiach-msft/ilmat/fix-GBT.

yonglyhoo · 2017-07-15T05:53:55Z

In which release this fix is going to be available? Thanks!

MLnick · 2017-07-15T05:59:05Z

Should be in 2.2.0

…

On Sat, 15 Jul 2017 at 07:54, yonglyhoo ***@***.***> wrote: In which release this fix is going to be available? Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16441 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB6BKB35I1USk8AQg-f-YXh3f-e0Nks5sOFQOgaJpZM4LYYGD> .

yonglyhoo · 2017-07-15T06:01:10Z

Great! Thanks Nick!Yong Sent from Yahoo Mail for iPhone On Friday, July 14, 2017, 10:59 PM, Nick Pentreath <[email protected]> wrote: Should be in 2.2.0

On Sat, 15 Jul 2017 at 07:54, yonglyhoo ***@***.***> wrote: In which release this fix is going to be available? Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16441 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB6BKB35I1USk8AQg-f-YXh3f-e0Nks5sOFQOgaJpZM4LYYGD> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

imatiach-msft force-pushed the ilmat/fix-GBT branch from 2b842e5 to 9def0ca Compare January 5, 2017 19:55

sethah reviewed Jan 5, 2017

View reviewed changes

imatiach-msft changed the title ~~[SPARK-14975][ML][WIP] Fixed GBTClassifier to predict probability per training instance and fixed interfaces~~ [SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces Jan 6, 2017

sethah reviewed Jan 10, 2017

View reviewed changes

imatiach-msft added 17 commits January 18, 2017 15:56

Fixed binary compatibility tests

d29b70d

Fixing GBT classifier based on comments

d4afdd0

Fixing probabilities calculated from raw scores

62702c8

fixed scala style, multiplied raw prediction value by 2 in prob estimate

27882b3

Updating based on code review, including code cleanup and adding bett…

8698d16

…er test case

Adding back constructor but making it private

aaf1b06

updates to GBTClassifier based on comments

bafab79

minor fixes to scala style

2a6dea4

Fixing more scala style

52c5115

Using getOldLossType as per comments

609a1b0

Added more tests for thresholds, fixed minor bug in predict to use th…

a28afe6

…resholds if they are specified

Updated based on newest comments

9d5bb9b

missed one arg

89965f5

Moving arg to its own line

cacbbc1

Updated based on latest comments - moved classifier loss trait, updat…

7396dac

…ed doc

Fixed up minor comments

f2e041d

Updated based on comments from jkbradley

1abfee0

imatiach-msft force-pushed the ilmat/fix-GBT branch from 0def50c to 1abfee0 Compare January 18, 2017 20:57

Fixing build issues - need to keep numClasses in model

818de81

asfgit closed this in fe409f3 Jan 18, 2017

imatiach-msft mentioned this pull request Nov 19, 2018

raw2probabilityInPlace microsoft/SynapseML#440

Closed

[SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces #16441

[SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces #16441

Uh oh!

Conversation

imatiach-msft commented Dec 30, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 30, 2016

Uh oh!

imatiach-msft commented Dec 30, 2016

Uh oh!

SparkQA commented Dec 30, 2016

Uh oh!

jkbradley commented Dec 31, 2016

Uh oh!

SparkQA commented Jan 5, 2017

Uh oh!

imatiach-msft commented Jan 5, 2017

Uh oh!

SparkQA commented Jan 5, 2017

Uh oh!

SparkQA commented Jan 5, 2017

Uh oh!

imatiach-msft commented Jan 5, 2017

Uh oh!

sethah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imatiach-msft commented Jan 6, 2017

Uh oh!

SparkQA commented Jan 6, 2017

sethah Jan 6, 2017 •

edited

Loading