Skip to content

Conversation

@iyounus
Copy link
Contributor

@iyounus iyounus commented Dec 12, 2015

This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is meant to address this "TODO", then the comment should be removed.

@jkbradley
Copy link
Member

@dbtsai Would you have time to take a look at this? Thank you!

@dbtsai
Copy link
Member

dbtsai commented Jan 5, 2016

jenkins, add to whitelist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LinearRegression has a bug related to this,

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226

when fitIntercept is false, the code should still train the model. Can you fix it in either separate PR or here?

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix it in a separate PR to make thing easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did notice that bug. I was planning to create separate jira for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai I just created PR for this bug with separate jira (SPARK-12732)..

@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48713 has finished for PR 10274 at commit f9573c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48962 has finished for PR 10274 at commit a310232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #49023 has finished for PR 10274 at commit e920c29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added standardizeLabel = false with non-zero regParam with analytic normal equation solution.

Added failure test of standardizeLabel = true, regParam != 0 and yStd == 0.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added exception and the test for standardizeLabel = true, regParam != 0 and yStd == 0.0. The only thing now left is to add test for the case standardizeLabel = false, regParam != 0 and yStd == 0.0. As I mentioned before, in this case I cannot compare with glmnet. So, I'll try to implement normal equation in python by myself and compare with that. The good thing is that, for this particular case, both normal and l-bfgs give identical results!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Great that you see the same result! For normal equation, will be nice to in R so we can have it in comment consistently. I did implement it once when I implemented the LBFGS version, let me try to find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai I've implemented normal equation with regularization in R. Here is my code:

ridge_regression <- function(A, b, lambda, intercept=TRUE){
    if (intercept) {
        A = cbind(rep(1.0, length(b)), A)
        I = diag(ncol(A))
        I[1,1] = 0.0
    } else {
        I = diag(ncol(A))
    }
    R = chol( t(A) %*% A + lambda*I )
    z = solve(t(R), t(A) %*% b)
    w = solve(R, z)
    return(w)
}

And here are the results I get using this function.

A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
b <- c(17, 19, 23, 29)
ridge_regression(A, b, 0.1)
           [,1]
[1,] 12.9048179
[2,]  2.1151586
[3,]  0.6580494

The problem is that these don't quite match with glmnet. The difference can be at few percent level:

> model <- glmnet(A, b, intercept=TRUE, lambda=0.1, standardize=FALSE,
+ alpha=0, thresh=1E-20)
> print(as.vector(coef(model)))
[1] 13.1018870  2.2362361  0.6159732

But, my results match exactly with what I get from ridge regression in sklearn:

from sklearn.linear_model import Ridge
import numpy as np

A = np.array([[0, 1, 2, 3],[5, 7, 11, 13]]).T
b = np.array([17.0, 19.0, 23.0, 29.0])
model = Ridge(alpha=0.1, solver='cholesky', fit_intercept=True)
model.fit(A, b)
print model.intercept_
print model.coef_

12.9048178613
[ 2.11515864  0.65804935]

Even if I use other solvers (svd, lsqr, sparse_cg) in sklearn.linear_model.Ridge, I get exactly the same results.

If I don't use regularization by setting lambda=0, then the results from glmnet are identical to what I get from normal equation and sklearn.linear_model.Ridge.

Have you observed such differences before? Is glmnet making some other corrections or its just numerical precision issue? I can't seem to reproduce glmnet results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for getting you back so late. The difference is due to that glmnet always standardizes labels even standardization == false. standardization == false is turning off the standardization on features. As a result, at least in glmnet, when ystd == 0.0, the training is not valid.

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49437 has finished for PR 10274 at commit d591989.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Jan 15, 2016

LGTM. Wait for one extra test. Thanks.

@mengxr
Copy link
Contributor

mengxr commented Jan 20, 2016

Merged into master. Thanks!

@asfgit asfgit closed this in 9753835 Jan 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants