Skip to content

Conversation

@actuaryzhang
Copy link
Contributor

@actuaryzhang actuaryzhang commented Dec 4, 2016

Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

What changes were proposed in this pull request?

Update initialization in Poisson GLM

How was this patch tested?

Add test in GeneralizedLinearRegressionSuite

@srowen @sethah @yanboliang @HyukjinKwon @mengxr

@actuaryzhang
Copy link
Contributor Author

Jenkins, add to whitelist

@actuaryzhang actuaryzhang changed the title [SPARK-18701][ML] Poisson GLM fails due to wrong initialization [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization Dec 4, 2016
require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
y
y + 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying that the initial value of the response variable could be 0 because it could be a mean over all 0 values? yes, maybe. Why 0.1? shouldn't this just be math.max(y, EPSILON) for a small epsilon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is initializing mu = y for the case of y = 0 can lead to almost zero weight in subsequent IWLS. I'll explain in detail below.

The following initialization method from FamilyAndLink shows that the mean mu is initialized using family.initialize. So mu could be zero if y = 0, which is incorrect since the mean of the Poisson can never be zero.

      val newInstances = instances.map { instance =>
        val mu = family.initialize(instance.label, instance.weight)
        val eta = predict(mu)
        Instance(eta, instance.weight, instance.features)
      }

In each iteration of the reweighted least squares, the weight fed to WLS is defined as follows (reweightFunc):

        val eta = model.predict(instance.features)
        val mu = fitted(eta)
        val offset = eta + (instance.label - mu) * link.deriv(mu)
        val weight = instance.weight / (math.pow(this.link.deriv(mu), 2.0) * family.variance(mu))

Let's use one observation as an example. Suppose mu = y = 0. Then running the above prints:

offset: -32.43970761868737 weight: 2.2177289643212133E-14 

The weight is almost zero, which is the cause of the issue. To see why, in the poisson case with log link, we have (math.pow(this.link.deriv(mu), 2.0) * family.variance(mu)) = 1/mu where link.deriv(mu) = 1/mu^2 and family.variance(mu) is mu. So the weight is basically mu.

That also explains why using small epsilon as initialization may not be helpful. I added 0.1 because that is how R does it:

> poisson()$initialize
expression({
    if (any(y < 0)) 
        stop("negative values not allowed for the 'Poisson' family")
    n <- rep.int(1, nobs)
    mustart <- y + 0.1
})

@srowen

Copy link
Contributor

@yanboliang yanboliang Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Actually we have protected value against 0.0 by:

override def project(mu: Double): Double = {
      if (mu < epsilon) {
        epsilon
      } else if (mu.isInfinity) {
        Double.MaxValue
      } else {
        mu
      }
    }

but it seems that epsilon is not enough, since the function curve is very steep near zero. I'm ok for this change, and matching R make sense. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see is that the initial learned model always produces mu ~= 0, which causes the adjusted response to blow up (since it depends on 1/mu). That causes the predicted response to blow up, which finally causes the weights to become infinity.

BTW, statsmodels in Python initializes all families except Binomial to mu_0 = (y + avg(y)) / 2. I am curious if we have a reference for defaulting to mu_0 = y when we first implemented GLR. It would be nice to have a sound reason for the initialization other than matching one package or the other, though probably not strictly necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang Thanks much for clarifying and approving this.

@sethah That's exactly the issue. Using avg(y) for Binomial would be more costly than the current approach since it takes one more pass through the data, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Theoretically, one only needs to add 0.1 to the y = 0 case, which is a guess of the mean for those cases. But I think it may be better to add this small number to all cases. Imagine that one models the rates of occurrence, i.e., frequency divided by exposure. For certain large exposure, the rate can get tiny and close to zero. Adding 0.1 to that may help avoid numerical issues too in that case. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an argument for making tiny values at least some epsilon (here 0.1) right? why does 30 need to become 30.1? isn't that just introducing some small inaccuracy needlessly?

Copy link
Contributor Author

@actuaryzhang actuaryzhang Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen This is not adjusting the data (y). Rather, the adjustment is for the mean E(y) in the very first step of the iteration. When you observe y = 30, it probably does not matter whether E(y) = 30 or E(y) = 30.1 because the posterior will be very close. But when you observe y = 0, it is certainly not E(y) = 0, because Poisson must have positive mean. The mean E(y) in this case is probably a small number, say 0.1. One may set it to be 1e-16, but this will cause numerical issues in the subsequent iterations. Do you prefer just adding 0.1 to y = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I verified that the following approaches give the same results for a few data sets
a)y + 0.1
b) if (y == 0) y + 0.1 else y
c) math.max(0.1, y)
Which one do you want to use, c)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c) seems most consistent with the intent. Unless I'm totally missing something that seems preferable (with a comment to explain what it's for)

@actuaryzhang
Copy link
Contributor Author

@srowen
Try this example below or the example @sethah had issue with in #15683.

I have tried running the 2.1 version Poisson GLM on our data and it fails for most of them (it does work when there are not lots of zero sometimes). I traced down the cause and the fix proposed here seems to be the correct. At least the Poisson GLM is working on the data where it failed before.

val datasetPoissonLogWithZero = Seq(
      LabeledPoint(0.0, Vectors.dense(18, 1.0)),
      LabeledPoint(1.0, Vectors.dense(12, 0.0)),
      LabeledPoint(0.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(13, 2.0)),
      LabeledPoint(0.0, Vectors.dense(15, 1.0)),
      LabeledPoint(1.0, Vectors.dense(16, 1.0)),
      LabeledPoint(0.0, Vectors.dense(10, 0.0)),
      LabeledPoint(0.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(12, 2.0)),
      LabeledPoint(0.0, Vectors.dense(13, 0.0)),
      LabeledPoint(1.0, Vectors.dense(15, 0.0)),
      LabeledPoint(1.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(15, 0.0)),
      LabeledPoint(0.0, Vectors.dense(12, 2.0)),
      LabeledPoint(1.0, Vectors.dense(12, 2.0))
    ).toDF()
    
val glr = new GeneralizedLinearRegression()
  .setFamily("poisson")
  .setLink("log")
  .setMaxIter(20)
  .setRegParam(0)

val model = glr.fit(datasetPoissonLogWithZero)

@actuaryzhang
Copy link
Contributor Author

@srowen @yanboliang
I have updated the code and further cleaned up the test. Please review and let me know if there is any question. Thanks.

Copy link
Contributor

@sethah sethah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with adding 0.1 in all cases since this is just an initialization step, but it'd still be nice to have a reference to point to. Is there anything that discusses initializations for different families in IRLS?

.setFitIntercept(fitIntercept).setLinkPredictionCol("linkPrediction")
val model = trainer.fit(dataset)
val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1))
println("coeff is " + actual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove :)

require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
y
// Set lower bound for mean in the FIRST step in IWLS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change this to // force Poisson mean > 0 to avoid numerical instability in IRLS

@actuaryzhang
Copy link
Contributor Author

@sethah Thanks for the review. I have updated according to your suggestion.

@yanboliang @srowen Please take another look. Thanks.

require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
y
// force Poisson mean > 0 to avoid numerical instability in IRLS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't hurt to give a reference to the R source code here if you can, even just pointing out that this is (sort of) how R deals with it, as justification. It is a surprisingly big epsilon after all.

@actuaryzhang
Copy link
Contributor Author

@srowen Done. Thanks for the suggestion.

@actuaryzhang
Copy link
Contributor Author

@srowen Is this ready to be merged?

@sethah
Copy link
Contributor

sethah commented Dec 7, 2016

LGTM

@srowen
Copy link
Member

srowen commented Dec 7, 2016

Merged to master, and to 2.1 to match SPARK-18166

asfgit pushed a commit that referenced this pull request Dec 7, 2016
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

## What changes were proposed in this pull request?
Update initialization in Poisson GLM

## How was this patch tested?
Add test in GeneralizedLinearRegressionSuite

srowen sethah yanboliang HyukjinKwon mengxr

Author: actuaryzhang <[email protected]>

Closes #16131 from actuaryzhang/master.

(cherry picked from commit b828027)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in b828027 Dec 7, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

## What changes were proposed in this pull request?
Update initialization in Poisson GLM

## How was this patch tested?
Add test in GeneralizedLinearRegressionSuite

srowen sethah yanboliang HyukjinKwon mengxr

Author: actuaryzhang <[email protected]>

Closes apache#16131 from actuaryzhang/master.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

## What changes were proposed in this pull request?
Update initialization in Poisson GLM

## How was this patch tested?
Add test in GeneralizedLinearRegressionSuite

srowen sethah yanboliang HyukjinKwon mengxr

Author: actuaryzhang <[email protected]>

Closes apache#16131 from actuaryzhang/master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants