-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero. #10274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero. #10274
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR is meant to address this "TODO", then the comment should be removed.
|
@dbtsai Would you have time to take a look at this? Thank you! |
|
jenkins, add to whitelist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LinearRegression has a bug related to this,
when fitIntercept is false, the code should still train the model. Can you fix it in either separate PR or here?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix it in a separate PR to make thing easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did notice that bug. I was planning to create separate jira for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbtsai I just created PR for this bug with separate jira (SPARK-12732)..
|
Test build #48713 has finished for PR 10274 at commit
|
|
Test build #48962 has finished for PR 10274 at commit
|
|
Test build #49023 has finished for PR 10274 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added standardizeLabel = false with non-zero regParam with analytic normal equation solution.
Added failure test of standardizeLabel = true, regParam != 0 and yStd == 0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added exception and the test for standardizeLabel = true, regParam != 0 and yStd == 0.0. The only thing now left is to add test for the case standardizeLabel = false, regParam != 0 and yStd == 0.0. As I mentioned before, in this case I cannot compare with glmnet. So, I'll try to implement normal equation in python by myself and compare with that. The good thing is that, for this particular case, both normal and l-bfgs give identical results!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Great that you see the same result! For normal equation, will be nice to in R so we can have it in comment consistently. I did implement it once when I implemented the LBFGS version, let me try to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbtsai I've implemented normal equation with regularization in R. Here is my code:
ridge_regression <- function(A, b, lambda, intercept=TRUE){
if (intercept) {
A = cbind(rep(1.0, length(b)), A)
I = diag(ncol(A))
I[1,1] = 0.0
} else {
I = diag(ncol(A))
}
R = chol( t(A) %*% A + lambda*I )
z = solve(t(R), t(A) %*% b)
w = solve(R, z)
return(w)
}
And here are the results I get using this function.
A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
b <- c(17, 19, 23, 29)
ridge_regression(A, b, 0.1)
[,1]
[1,] 12.9048179
[2,] 2.1151586
[3,] 0.6580494
The problem is that these don't quite match with glmnet. The difference can be at few percent level:
> model <- glmnet(A, b, intercept=TRUE, lambda=0.1, standardize=FALSE,
+ alpha=0, thresh=1E-20)
> print(as.vector(coef(model)))
[1] 13.1018870 2.2362361 0.6159732
But, my results match exactly with what I get from ridge regression in sklearn:
from sklearn.linear_model import Ridge
import numpy as np
A = np.array([[0, 1, 2, 3],[5, 7, 11, 13]]).T
b = np.array([17.0, 19.0, 23.0, 29.0])
model = Ridge(alpha=0.1, solver='cholesky', fit_intercept=True)
model.fit(A, b)
print model.intercept_
print model.coef_
12.9048178613
[ 2.11515864 0.65804935]
Even if I use other solvers (svd, lsqr, sparse_cg) in sklearn.linear_model.Ridge, I get exactly the same results.
If I don't use regularization by setting lambda=0, then the results from glmnet are identical to what I get from normal equation and sklearn.linear_model.Ridge.
Have you observed such differences before? Is glmnet making some other corrections or its just numerical precision issue? I can't seem to reproduce glmnet results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for getting you back so late. The difference is due to that glmnet always standardizes labels even standardization == false. standardization == false is turning off the standardization on features. As a result, at least in glmnet, when ystd == 0.0, the training is not valid.
…and modified test accordingly.
|
Test build #49437 has finished for PR 10274 at commit
|
|
LGTM. Wait for one extra test. Thanks. |
|
Merged into master. Thanks! |
This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.