Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,24 @@ private[ml] class WeightedLeastSquares(
val aaBar = summary.aaBar
val aaValues = aaBar.values

if (bStd == 0) {
if (fitIntercept) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LinearRegression has a bug related to this,

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226

when fitIntercept is false, the code should still train the model. Can you fix it in either separate PR or here?

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix it in a separate PR to make thing easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did notice that bug. I was planning to create separate jira for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai I just created PR for this bug with separate jira (SPARK-12732)..

logWarning(s"The standard deviation of the label is zero, so the coefficients will be " +
s"zeros and the intercept will be the mean of the label; as a result, " +
s"training is not needed.")
val coefficients = new DenseVector(Array.ofDim(k-1))
val intercept = bBar
val diagInvAtWA = new DenseVector(Array(0D))
return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA)
} else {
require(!(regParam > 0.0 && standardizeLabel),
"The standard deviation of the label is zero. " +
"Model cannot be regularized with standardization=true")
logWarning(s"The standard deviation of the label is zero. " +
"Consider setting fitIntercept=true.")
}
}

// add regularization to diagonals
var i = 0
var j = 2
Expand All @@ -94,8 +112,7 @@ private[ml] class WeightedLeastSquares(
if (standardizeFeatures) {
lambda *= aVar(j - 2)
}
if (standardizeLabel) {
// TODO: handle the case when bStd = 0
if (standardizeLabel && bStd != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check when standardizeLabel = true and bStd == 0.0 with regularization, what is the solution from R, and add it into unit-test? I guess the effective regularization will be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai The problem here is that for regularized regression in R, I need to use glmnet. But for this specific case (constant label, no intercept and no regularization) the results from glmnet do no match with lm. So I see a discrepancy within R itself. Have a look at the following R code:

A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)  
b <- c(17, 17, 17, 17)  
w <- c(1, 2, 3, 4)  
df <- as.data.frame(cbind(A, b))

lm.model <- lm(b ~ . -1, data=df, weights=w)
print(as.vector(coef(lm.model)))
[1] -9.221298  3.394343

glm.model <- glmnet(A, b, weights=w, intercept=FALSE, lambda=0,
                    standardize=FALSE, alpha=0, thresh=1E-14)
print(as.vector(coef(glm.model)))
[1] 0 0 0

Note that in this example, I expect same results from both lm and glmnet because I've set lambda=0 in glmnet. (BTW standardize has not effect here.) It seems to me that glmnet just sets all coefficients to zero if label is constant and intercept is not included. This is true even if I include regularization.

Right now WeightedLeastSquares (without regularization) matches with lm, and I think this is the correct behaviour given my understanding of the normal equation. With regularization, it should still give some non-zero coefficients, which is does. I don't know why glmnet behaves differently, but I don't think we should try to match that in this particular case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. As you said, we will expect non zero coefficients in this case, so we don't have to match glmnet.

However, we may want to throw excpetion when standerizeLabe is true, and ystd is zero since the problem is not well defined.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WeightedLeastSquares class is private and its instantiated in LinearRegression class where standerizeLabe parameter is hard wired to be true. So, the user doesn't have any control on this parameter.

We can throw an exception when yStd is zero and regParam is non-zero. But, if that is the case, then, why not to throw exception when yStd is zero regardless of what other parameters are? I cannot think of any interpretation of the model in this case.

The option could be to simply log a warning when we don't standardize the label here.

Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting to see that when regularization is zero, with/without standardization on labels and features will not change the solution of Linear Regression which you can experiment.

As a result, the only issue that the model will be non-interpretable will be yStd is zero and regParam is non-zero. You can have a require there with proper message. I think logging a warning will be probably very easy for users to ignore. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the exercise,

  test("WLS against lm") {
    /*
       R code:

       df <- as.data.frame(cbind(A, b))
       for (formula in c(b ~ . -1, b ~ .)) {
         model <- lm(formula, data=df, weights=w)
         print(as.vector(coef(model)))
       }

       [1] -3.727121  3.009983
       [1] 18.08  6.08 -0.60
     */

    val expected = Seq(
      Vectors.dense(0.0, -3.727121, 3.009983),
      Vectors.dense(18.08, 6.08, -0.60))

    var idx = 0
    for (fitIntercept <- Seq(false, true)) {
      for (standardization <- Seq(false, true)) {
        val wls = new WeightedLeastSquares(
          fitIntercept, regParam = 0.0, standardizeFeatures = standardization,
          standardizeLabel = standardization).fit(instances)
        val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1))
        assert(actual ~== expected(idx) absTol 1e-4)
      }
      idx += 1
    }
  }

lambda /= bStd
}
aaValues(i) += lambda
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import org.apache.spark.rdd.RDD
class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext {

private var instances: RDD[Instance] = _
private var instancesConstLabel: RDD[Instance] = _

override def beforeAll(): Unit = {
super.beforeAll()
Expand All @@ -43,6 +44,20 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext
Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
), 2)

/*
R code:

A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
b.const <- c(17, 17, 17, 17)
w <- c(1, 2, 3, 4)
*/
instancesConstLabel = sc.parallelize(Seq(
Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
Instance(17.0, 2.0, Vectors.dense(1.0, 7.0)),
Instance(17.0, 3.0, Vectors.dense(2.0, 11.0)),
Instance(17.0, 4.0, Vectors.dense(3.0, 13.0))
), 2)
}

test("WLS against lm") {
Expand All @@ -65,15 +80,59 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext

var idx = 0
for (fitIntercept <- Seq(false, true)) {
val wls = new WeightedLeastSquares(
fitIntercept, regParam = 0.0, standardizeFeatures = false, standardizeLabel = false)
.fit(instances)
val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1))
assert(actual ~== expected(idx) absTol 1e-4)
for (standardization <- Seq(false, true)) {
val wls = new WeightedLeastSquares(
fitIntercept, regParam = 0.0, standardizeFeatures = standardization,
standardizeLabel = standardization).fit(instances)
val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1))
assert(actual ~== expected(idx) absTol 1e-4)
}
idx += 1
}
}

test("WLS against lm when label is constant and no regularization") {
/*
R code:

df.const.label <- as.data.frame(cbind(A, b.const))
for (formula in c(b.const ~ . -1, b.const ~ .)) {
model <- lm(formula, data=df.const.label, weights=w)
print(as.vector(coef(model)))
}

[1] -9.221298 3.394343
[1] 17 0 0
*/

val expected = Seq(
Vectors.dense(0.0, -9.221298, 3.394343),
Vectors.dense(17.0, 0.0, 0.0))

var idx = 0
for (fitIntercept <- Seq(false, true)) {
for (standardization <- Seq(false, true)) {
val wls = new WeightedLeastSquares(
fitIntercept, regParam = 0.0, standardizeFeatures = standardization,
standardizeLabel = standardization).fit(instancesConstLabel)
val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1))
assert(actual ~== expected(idx) absTol 1e-4)
}
idx += 1
}
}

test("WLS with regularization when label is constant") {
// if regParam is non-zero and standardization is true, the problem is ill-defined and
// an exception is thrown.
val wls = new WeightedLeastSquares(
fitIntercept = false, regParam = 0.1, standardizeFeatures = true,
standardizeLabel = true)
intercept[IllegalArgumentException]{
wls.fit(instancesConstLabel)
}
}

test("WLS against glmnet") {
/*
R code:
Expand Down