Skip to content

Commit eb00378

Browse files
WeichenXu123dbtsai
authored andcommitted
[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0
## What changes were proposed in this pull request? When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result. BUT current implementation centralize the `coefficientMatrix` by the global coeffs means. In fact the `coefficientMatrix` should be centralized on each feature index itself. Because, according to the MLOR probability distribution function, it can be proven easily that: suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`, then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution. `c` is an arbitrary vector of `numFeatures` dimension. reference https://core.ac.uk/download/pdf/6287975.pdf So that we need to centralize the `coefficientMatrix` on each feature dimension separately. **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.** ## How was this patch tested? Tests added. Author: WeichenXu <[email protected]> Closes #17706 from WeichenXu123/mlor_center.
1 parent a750a59 commit eb00378

File tree

2 files changed

+14
-3
lines changed

2 files changed

+14
-3
lines changed

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -609,9 +609,14 @@ class LogisticRegression @Since("1.2.0") (
609609
Friedman, et al. "Regularization Paths for Generalized Linear Models via
610610
Coordinate Descent," https://core.ac.uk/download/files/153/6287975.pdf
611611
*/
612-
val denseValues = denseCoefficientMatrix.values
613-
val coefficientMean = denseValues.sum / denseValues.length
614-
denseCoefficientMatrix.update(_ - coefficientMean)
612+
val centers = Array.fill(numFeatures)(0.0)
613+
denseCoefficientMatrix.foreachActive { case (i, j, v) =>
614+
centers(j) += v
615+
}
616+
centers.transform(_ / numCoefficientSets)
617+
denseCoefficientMatrix.foreachActive { case (i, j, v) =>
618+
denseCoefficientMatrix.update(i, j, v - centers(j))
619+
}
615620
}
616621

617622
// center the intercepts when using multinomial algorithm

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1139,6 +1139,9 @@ class LogisticRegressionSuite
11391139
0.10095851, -0.85897154, 0.08392798, 0.07904499), isTransposed = true)
11401140
val interceptsR = Vectors.dense(-2.10320093, 0.3394473, 1.76375361)
11411141

1142+
model1.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))
1143+
model2.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))
1144+
11421145
assert(model1.coefficientMatrix ~== coefficientsR relTol 0.05)
11431146
assert(model1.coefficientMatrix.toArray.sum ~== 0.0 absTol eps)
11441147
assert(model1.interceptVector ~== interceptsR relTol 0.05)
@@ -1204,6 +1207,9 @@ class LogisticRegressionSuite
12041207
-0.3180040, 0.9679074, -0.2252219, -0.4319914,
12051208
0.2452411, -0.6046524, 0.1050710, 0.1180180), isTransposed = true)
12061209

1210+
model1.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))
1211+
model2.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))
1212+
12071213
assert(model1.coefficientMatrix ~== coefficientsR relTol 0.05)
12081214
assert(model1.coefficientMatrix.toArray.sum ~== 0.0 absTol eps)
12091215
assert(model1.interceptVector.toArray === Array.fill(3)(0.0))

0 commit comments

Comments
 (0)