-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization #16131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, add to whitelist |
| require(y >= 0.0, "The response variable of Poisson family " + | ||
| s"should be non-negative, but got $y") | ||
| y | ||
| y + 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're saying that the initial value of the response variable could be 0 because it could be a mean over all 0 values? yes, maybe. Why 0.1? shouldn't this just be math.max(y, EPSILON) for a small epsilon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is initializing mu = y for the case of y = 0 can lead to almost zero weight in subsequent IWLS. I'll explain in detail below.
The following initialization method from FamilyAndLink shows that the mean mu is initialized using family.initialize. So mu could be zero if y = 0, which is incorrect since the mean of the Poisson can never be zero.
val newInstances = instances.map { instance =>
val mu = family.initialize(instance.label, instance.weight)
val eta = predict(mu)
Instance(eta, instance.weight, instance.features)
}
In each iteration of the reweighted least squares, the weight fed to WLS is defined as follows (reweightFunc):
val eta = model.predict(instance.features)
val mu = fitted(eta)
val offset = eta + (instance.label - mu) * link.deriv(mu)
val weight = instance.weight / (math.pow(this.link.deriv(mu), 2.0) * family.variance(mu))
Let's use one observation as an example. Suppose mu = y = 0. Then running the above prints:
offset: -32.43970761868737 weight: 2.2177289643212133E-14
The weight is almost zero, which is the cause of the issue. To see why, in the poisson case with log link, we have (math.pow(this.link.deriv(mu), 2.0) * family.variance(mu)) = 1/mu where link.deriv(mu) = 1/mu^2 and family.variance(mu) is mu. So the weight is basically mu.
That also explains why using small epsilon as initialization may not be helpful. I added 0.1 because that is how R does it:
> poisson()$initialize
expression({
if (any(y < 0))
stop("negative values not allowed for the 'Poisson' family")
n <- rep.int(1, nobs)
mustart <- y + 0.1
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Actually we have protected value against 0.0 by:
override def project(mu: Double): Double = {
if (mu < epsilon) {
epsilon
} else if (mu.isInfinity) {
Double.MaxValue
} else {
mu
}
}
but it seems that epsilon is not enough, since the function curve is very steep near zero. I'm ok for this change, and matching R make sense. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I see is that the initial learned model always produces mu ~= 0, which causes the adjusted response to blow up (since it depends on 1/mu). That causes the predicted response to blow up, which finally causes the weights to become infinity.
BTW, statsmodels in Python initializes all families except Binomial to mu_0 = (y + avg(y)) / 2. I am curious if we have a reference for defaulting to mu_0 = y when we first implemented GLR. It would be nice to have a sound reason for the initialization other than matching one package or the other, though probably not strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang Thanks much for clarifying and approving this.
@sethah That's exactly the issue. Using avg(y) for Binomial would be more costly than the current approach since it takes one more pass through the data, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Theoretically, one only needs to add 0.1 to the y = 0 case, which is a guess of the mean for those cases. But I think it may be better to add this small number to all cases. Imagine that one models the rates of occurrence, i.e., frequency divided by exposure. For certain large exposure, the rate can get tiny and close to zero. Adding 0.1 to that may help avoid numerical issues too in that case. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still an argument for making tiny values at least some epsilon (here 0.1) right? why does 30 need to become 30.1? isn't that just introducing some small inaccuracy needlessly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen This is not adjusting the data (y). Rather, the adjustment is for the mean E(y) in the very first step of the iteration. When you observe y = 30, it probably does not matter whether E(y) = 30 or E(y) = 30.1 because the posterior will be very close. But when you observe y = 0, it is certainly not E(y) = 0, because Poisson must have positive mean. The mean E(y) in this case is probably a small number, say 0.1. One may set it to be 1e-16, but this will cause numerical issues in the subsequent iterations. Do you prefer just adding 0.1 to y = 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I verified that the following approaches give the same results for a few data sets
a)y + 0.1
b) if (y == 0) y + 0.1 else y
c) math.max(0.1, y)
Which one do you want to use, c)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c) seems most consistent with the intent. Unless I'm totally missing something that seems preferable (with a comment to explain what it's for)
|
@srowen I have tried running the 2.1 version Poisson GLM on our data and it fails for most of them (it does work when there are not lots of zero sometimes). I traced down the cause and the fix proposed here seems to be the correct. At least the Poisson GLM is working on the data where it failed before. |
|
@srowen @yanboliang |
sethah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with adding 0.1 in all cases since this is just an initialization step, but it'd still be nice to have a reference to point to. Is there anything that discusses initializations for different families in IRLS?
| .setFitIntercept(fitIntercept).setLinkPredictionCol("linkPrediction") | ||
| val model = trainer.fit(dataset) | ||
| val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) | ||
| println("coeff is " + actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove :)
| require(y >= 0.0, "The response variable of Poisson family " + | ||
| s"should be non-negative, but got $y") | ||
| y | ||
| // Set lower bound for mean in the FIRST step in IWLS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change this to // force Poisson mean > 0 to avoid numerical instability in IRLS
|
@sethah Thanks for the review. I have updated according to your suggestion. @yanboliang @srowen Please take another look. Thanks. |
| require(y >= 0.0, "The response variable of Poisson family " + | ||
| s"should be non-negative, but got $y") | ||
| y | ||
| // force Poisson mean > 0 to avoid numerical instability in IRLS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't hurt to give a reference to the R source code here if you can, even just pointing out that this is (sort of) how R deals with it, as justification. It is a surprisingly big epsilon after all.
|
@srowen Done. Thanks for the suggestion. |
|
@srowen Is this ready to be merged? |
|
LGTM |
|
Merged to master, and to 2.1 to match SPARK-18166 |
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <[email protected]> Closes #16131 from actuaryzhang/master. (cherry picked from commit b828027) Signed-off-by: Sean Owen <[email protected]>
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <[email protected]> Closes apache#16131 from actuaryzhang/master.
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <[email protected]> Closes apache#16131 from actuaryzhang/master.
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.
What changes were proposed in this pull request?
Update initialization in Poisson GLM
How was this patch tested?
Add test in GeneralizedLinearRegressionSuite
@srowen @sethah @yanboliang @HyukjinKwon @mengxr