[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization #16131

actuaryzhang · 2016-12-04T03:18:08Z

Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

What changes were proposed in this pull request?

Update initialization in Poisson GLM

How was this patch tested?

Add test in GeneralizedLinearRegressionSuite

@srowen @sethah @yanboliang @HyukjinKwon @mengxr

actuaryzhang · 2016-12-04T06:50:46Z

Jenkins, add to whitelist

srowen · 2016-12-04T08:13:47Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

      require(y >= 0.0, "The response variable of Poisson family " +
        s"should be non-negative, but got $y")
-      y
+      y + 0.1


You're saying that the initial value of the response variable could be 0 because it could be a mean over all 0 values? yes, maybe. Why 0.1? shouldn't this just be math.max(y, EPSILON) for a small epsilon?

The issue is initializing mu = y for the case of y = 0 can lead to almost zero weight in subsequent IWLS. I'll explain in detail below.

The following initialization method from FamilyAndLink shows that the mean mu is initialized using family.initialize. So mu could be zero if y = 0, which is incorrect since the mean of the Poisson can never be zero.

val newInstances = instances.map { instance => val mu = family.initialize(instance.label, instance.weight) val eta = predict(mu) Instance(eta, instance.weight, instance.features) }

In each iteration of the reweighted least squares, the weight fed to WLS is defined as follows (reweightFunc):

val eta = model.predict(instance.features) val mu = fitted(eta) val offset = eta + (instance.label - mu) * link.deriv(mu) val weight = instance.weight / (math.pow(this.link.deriv(mu), 2.0) * family.variance(mu))

Let's use one observation as an example. Suppose mu = y = 0. Then running the above prints:

offset: -32.43970761868737 weight: 2.2177289643212133E-14

The weight is almost zero, which is the cause of the issue. To see why, in the poisson case with log link, we have (math.pow(this.link.deriv(mu), 2.0) * family.variance(mu)) = 1/mu where link.deriv(mu) = 1/mu^2 and family.variance(mu) is mu. So the weight is basically mu.

That also explains why using small epsilon as initialization may not be helpful. I added 0.1 because that is how R does it:

> poisson()$initialize expression({ if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") n <- rep.int(1, nobs) mustart <- y + 0.1 })

@srowen

@srowen Actually we have protected value against 0.0 by:

override def project(mu: Double): Double = { if (mu < epsilon) { epsilon } else if (mu.isInfinity) { Double.MaxValue } else { mu } }

but it seems that epsilon is not enough, since the function curve is very steep near zero. I'm ok for this change, and matching R make sense. Thanks.

The problem I see is that the initial learned model always produces mu ~= 0, which causes the adjusted response to blow up (since it depends on 1/mu). That causes the predicted response to blow up, which finally causes the weights to become infinity.

BTW, statsmodels in Python initializes all families except Binomial to mu_0 = (y + avg(y)) / 2. I am curious if we have a reference for defaulting to mu_0 = y when we first implemented GLR. It would be nice to have a sound reason for the initialization other than matching one package or the other, though probably not strictly necessary.

@yanboliang Thanks much for clarifying and approving this.

@sethah That's exactly the issue. Using avg(y) for Binomial would be more costly than the current approach since it takes one more pass through the data, right?

@srowen Theoretically, one only needs to add 0.1 to the y = 0 case, which is a guess of the mean for those cases. But I think it may be better to add this small number to all cases. Imagine that one models the rates of occurrence, i.e., frequency divided by exposure. For certain large exposure, the rate can get tiny and close to zero. Adding 0.1 to that may help avoid numerical issues too in that case. Does that make sense?

This is still an argument for making tiny values at least some epsilon (here 0.1) right? why does 30 need to become 30.1? isn't that just introducing some small inaccuracy needlessly?

@srowen This is not adjusting the data (y). Rather, the adjustment is for the mean E(y) in the very first step of the iteration. When you observe y = 30, it probably does not matter whether E(y) = 30 or E(y) = 30.1 because the posterior will be very close. But when you observe y = 0, it is certainly not E(y) = 0, because Poisson must have positive mean. The mean E(y) in this case is probably a small number, say 0.1. One may set it to be 1e-16, but this will cause numerical issues in the subsequent iterations. Do you prefer just adding 0.1 to y = 0?

@srowen I verified that the following approaches give the same results for a few data sets
a)y + 0.1
b) if (y == 0) y + 0.1 else y
c) math.max(0.1, y)
Which one do you want to use, c)?

c) seems most consistent with the intent. Unless I'm totally missing something that seems preferable (with a comment to explain what it's for)

actuaryzhang · 2016-12-04T19:55:50Z

@srowen
Try this example below or the example @sethah had issue with in #15683.

I have tried running the 2.1 version Poisson GLM on our data and it fails for most of them (it does work when there are not lots of zero sometimes). I traced down the cause and the fix proposed here seems to be the correct. At least the Poisson GLM is working on the data where it failed before.

val datasetPoissonLogWithZero = Seq(
      LabeledPoint(0.0, Vectors.dense(18, 1.0)),
      LabeledPoint(1.0, Vectors.dense(12, 0.0)),
      LabeledPoint(0.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(13, 2.0)),
      LabeledPoint(0.0, Vectors.dense(15, 1.0)),
      LabeledPoint(1.0, Vectors.dense(16, 1.0)),
      LabeledPoint(0.0, Vectors.dense(10, 0.0)),
      LabeledPoint(0.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(12, 2.0)),
      LabeledPoint(0.0, Vectors.dense(13, 0.0)),
      LabeledPoint(1.0, Vectors.dense(15, 0.0)),
      LabeledPoint(1.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(12, 2.0)),
      LabeledPoint(1.0, Vectors.dense(12, 2.0))
    ).toDF()
    
val glr = new GeneralizedLinearRegression()
  .setFamily("poisson")
  .setLink("log")
  .setMaxIter(20)
  .setRegParam(0)

val model = glr.fit(datasetPoissonLogWithZero)

actuaryzhang · 2016-12-05T22:44:03Z

@srowen @yanboliang
I have updated the code and further cleaned up the test. Please review and let me know if there is any question. Thanks.

sethah

I don't have a problem with adding 0.1 in all cases since this is just an initialization step, but it'd still be nice to have a reference to point to. Is there anything that discusses initializations for different families in IRLS?

sethah · 2016-12-06T00:13:57Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

        .setFitIntercept(fitIntercept).setLinkPredictionCol("linkPrediction")
      val model = trainer.fit(dataset)
      val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
+      println("coeff is " + actual)


sethah · 2016-12-06T00:15:31Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

      require(y >= 0.0, "The response variable of Poisson family " +
        s"should be non-negative, but got $y")
-      y
+      // Set lower bound for mean in the FIRST step in IWLS


let's change this to // force Poisson mean > 0 to avoid numerical instability in IRLS

actuaryzhang · 2016-12-06T00:31:48Z

@sethah Thanks for the review. I have updated according to your suggestion.

@yanboliang @srowen Please take another look. Thanks.

srowen · 2016-12-06T02:25:19Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

      require(y >= 0.0, "The response variable of Poisson family " +
        s"should be non-negative, but got $y")
-      y
+      // force Poisson mean > 0 to avoid numerical instability in IRLS


It wouldn't hurt to give a reference to the R source code here if you can, even just pointing out that this is (sort of) how R deals with it, as justification. It is a surprisingly big epsilon after all.

actuaryzhang · 2016-12-06T05:59:04Z

@srowen Done. Thanks for the suggestion.

actuaryzhang · 2016-12-07T06:26:03Z

@srowen Is this ready to be merged?

sethah · 2016-12-07T06:35:06Z

LGTM

srowen · 2016-12-07T08:37:43Z

Merged to master, and to 2.1 to match SPARK-18166

Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <[email protected]> Closes #16131 from actuaryzhang/master. (cherry picked from commit b828027) Signed-off-by: Sean Owen <[email protected]>

Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <[email protected]> Closes apache#16131 from actuaryzhang/master.

actuaryzhang added 2 commits December 3, 2016 16:41

Change initial value in Poisson GLM to avoid numerical issue

784cb09

Update Poisson GLM test (for incorrect initialization)

56c4779

actuaryzhang mentioned this pull request Dec 4, 2016

[SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong requirement of response values #15683

Closed

actuaryzhang changed the title ~~[SPARK-18701][ML] Poisson GLM fails due to wrong initialization~~ [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization Dec 4, 2016

srowen reviewed Dec 4, 2016

View reviewed changes

yanboliang approved these changes Dec 5, 2016

View reviewed changes

Set lower bound in Poisson GLM initialization

a24966a

sethah reviewed Dec 6, 2016

View reviewed changes

Update comment and remove unnecessary printing in test

b162970

srowen reviewed Dec 6, 2016

View reviewed changes

Add reference to R's Poisson GLM initialization

271d315

asfgit closed this in b828027 Dec 7, 2016

[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization #16131

[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization #16131

Uh oh!

Conversation

actuaryzhang commented Dec 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

actuaryzhang commented Dec 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented Dec 4, 2016

Uh oh!

actuaryzhang commented Dec 5, 2016

Uh oh!

sethah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented Dec 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

actuaryzhang commented Dec 6, 2016

Uh oh!

actuaryzhang commented Dec 7, 2016

Uh oh!

sethah commented Dec 7, 2016

Uh oh!

srowen commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

actuaryzhang commented Dec 4, 2016 •

edited

Loading

yanboliang Dec 5, 2016 •

edited

Loading

actuaryzhang Dec 5, 2016 •

edited

Loading