[SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero. #10274

iyounus · 2015-12-12T03:37:13Z

This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.

sethah · 2015-12-14T17:42:59Z

mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala

If this PR is meant to address this "TODO", then the comment should be removed.

jkbradley · 2016-01-04T23:53:39Z

@dbtsai Would you have time to take a look at this? Thank you!

dbtsai · 2016-01-05T01:40:35Z

jenkins, add to whitelist

dbtsai · 2016-01-05T01:44:02Z

mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala

The LinearRegression has a bug related to this,

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226

when fitIntercept is false, the code should still train the model. Can you fix it in either separate PR or here?

Thanks.

Let's fix it in a separate PR to make thing easier.

I did notice that bug. I was planning to create separate jira for this.

@dbtsai I just created PR for this bug with separate jira (SPARK-12732)..

SparkQA · 2016-01-05T02:34:47Z

Test build #48713 has finished for PR 10274 at commit f9573c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-07T20:24:31Z

Test build #48962 has finished for PR 10274 at commit a310232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-08T18:53:23Z

Test build #49023 has finished for PR 10274 at commit e920c29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-01-12T00:21:22Z

mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala

Added standardizeLabel = false with non-zero regParam with analytic normal equation solution.

Added failure test of standardizeLabel = true, regParam != 0 and yStd == 0.0.

I've added exception and the test for standardizeLabel = true, regParam != 0 and yStd == 0.0. The only thing now left is to add test for the case standardizeLabel = false, regParam != 0 and yStd == 0.0. As I mentioned before, in this case I cannot compare with glmnet. So, I'll try to implement normal equation in python by myself and compare with that. The good thing is that, for this particular case, both normal and l-bfgs give identical results!

Awesome! Great that you see the same result! For normal equation, will be nice to in R so we can have it in comment consistently. I did implement it once when I implemented the LBFGS version, let me try to find it.

@dbtsai I've implemented normal equation with regularization in R. Here is my code:

ridge_regression <- function(A, b, lambda, intercept=TRUE){ if (intercept) { A = cbind(rep(1.0, length(b)), A) I = diag(ncol(A)) I[1,1] = 0.0 } else { I = diag(ncol(A)) } R = chol( t(A) %*% A + lambda*I ) z = solve(t(R), t(A) %*% b) w = solve(R, z) return(w) }

And here are the results I get using this function.

A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) b <- c(17, 19, 23, 29) ridge_regression(A, b, 0.1) [,1] [1,] 12.9048179 [2,] 2.1151586 [3,] 0.6580494

The problem is that these don't quite match with glmnet. The difference can be at few percent level:

> model <- glmnet(A, b, intercept=TRUE, lambda=0.1, standardize=FALSE, + alpha=0, thresh=1E-20) > print(as.vector(coef(model))) [1] 13.1018870 2.2362361 0.6159732

But, my results match exactly with what I get from ridge regression in sklearn:

from sklearn.linear_model import Ridge import numpy as np A = np.array([[0, 1, 2, 3],[5, 7, 11, 13]]).T b = np.array([17.0, 19.0, 23.0, 29.0]) model = Ridge(alpha=0.1, solver='cholesky', fit_intercept=True) model.fit(A, b) print model.intercept_ print model.coef_ 12.9048178613 [ 2.11515864 0.65804935]

Even if I use other solvers (svd, lsqr, sparse_cg) in sklearn.linear_model.Ridge, I get exactly the same results.

If I don't use regularization by setting lambda=0, then the results from glmnet are identical to what I get from normal equation and sklearn.linear_model.Ridge.

Have you observed such differences before? Is glmnet making some other corrections or its just numerical precision issue? I can't seem to reproduce glmnet results.

Sorry for getting you back so late. The difference is due to that glmnet always standardizes labels even standardization == false. standardization == false is turning off the standardization on features. As a result, at least in glmnet, when ystd == 0.0, the training is not valid.

…and modified test accordingly.

SparkQA · 2016-01-15T03:17:09Z

Test build #49437 has finished for PR 10274 at commit d591989.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-01-15T05:40:08Z

LGTM. Wait for one extra test. Thanks.

mengxr · 2016-01-20T19:17:49Z

Merged into master. Thanks!

sethah reviewed Dec 14, 2015
View reviewed changes

dbtsai reviewed Jan 5, 2016
View reviewed changes

iyounus added 2 commits January 7, 2016 11:31

fixing fit in weighted least sqaures when std of lable is zero.

338dad2

modifications as suggested by sethah

a310232

minor fix

e920c29

dbtsai reviewed Jan 12, 2016
View reviewed changes

added exception for the case when regParam>0 and standardLabel=true, …

d591989

…and modified test accordingly.

iyounus mentioned this pull request Jan 17, 2016

[Spark-12732][ML] bug fix in linear regression train #10702

Closed

asfgit closed this in 9753835 Jan 20, 2016

[SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero. #10274

[SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero properly if standard deviation of target variable is zero. #10274

Uh oh!

Conversation

iyounus commented Dec 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jan 4, 2016

Uh oh!

dbtsai commented Jan 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 5, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 15, 2016

Uh oh!

dbtsai commented Jan 15, 2016

Uh oh!

mengxr commented Jan 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants