[SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) #11610

iyounus · 2016-03-09T20:12:34Z

What changes were proposed in this pull request?

"normal" solver in LinearRegression uses Cholesky decomposition to calculate the coefficients. If the data has features with identical values (zero variance), then (A^T A) matrix is not positive definite any more and the Cholesky decomposition fails.

Since A^T.A and features variances are calculated in single pass, it's better to modify ATA instead to re-calculating it from the data after dropping constant columns. In this PR, I'm dropping columns and rows from ATA corresponding to features with zero variance. Then the cholesky decomposition can be performed without any problem.

How was this patch tested?

A unit test under LineatReagessionSuite is added which compares results from this change and l-bgfs solver to glmnet. All these are now consistent.

SparkQA · 2016-03-09T21:05:08Z

Test build #52764 has finished for PR 11610 at commit 9412ef4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iyounus · 2016-03-09T21:44:42Z

I should point out that to identify constant features, I'm comparing variance (aVar) to zero. But, It can happen that the variance for constant features may not be identically zero due to numerical inaccuracies. In this case, the cholesky decomposition still fails. Is there a good way to deal with this problem. I'm thinking of comparing aVar to some very small number, maybe 1e-10, instead of 0.0. Is there a better way to deal with this problem?

sethah · 2016-03-12T00:19:41Z

mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala

      }
    }
+    /*
+      If more than of the features in the data are constant (i.e, data matrix has constant columns),


"than of the" -> "than one of the". Also, "i.e," -> "i.e."

SparkQA · 2016-03-12T02:42:33Z

Test build #52981 has finished for PR 11610 at commit 652d2bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-03-14T07:40:08Z

mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala

+    val aVarRaw = summary.aVar.values
+    // this will keep track of features to keep in the model, and remove
+    // features with zero variance.
+    val nzVarIndex = aVarRaw.zipWithIndex.filter(_._1 != 0).map(_._2)


Explicitly filter(_._1 != 0.0)

Using foreachActive to build the non zero element index given the reason we just discussed.

mengxr · 2016-03-14T19:27:54Z

@iyounus @dbtsai The normal equation approach will fail if the matrix A is rank-deficient. It happens when there are constant columns. However, more generally, it happens when there are linearly dependent columns in the training dataset. So this PR solves one case but it is not a general solution. We can try two approaches:

Distributed QR/SVD to deal with rank deficiency. QR requires pivoting. SVD might be an easier approach for us, though a little expensive. See DGELSD from LAPACK: http://www.netlib.org/lapack/explore-html/d7/d3b/group__double_g_esolve.html#ga94bd4a63a6dacf523e25ff617719f752
Add a small positive number, e.g., 1e-8, to the diagonal when AtA is rank deficient. This should solve the numerical issue but returns a less accurate solution. However, it is always good to apply regularization. I believe this won't make the model worse.

SparkQA · 2016-03-14T20:16:55Z

Test build #53091 has finished for PR 11610 at commit e9c80a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-03-15T05:02:20Z

I will vote for approach 1.

SVD will be the most stable algorithm, but slowest O(mn^2 + n^3) compared with Cholesky O(mn^2) or QR O(mn^2 - n^3/3) decomposition. SVD solves both underdetermined (rank-deficient) and overdetermined (more data than equations) problems in a least square problem. Given that it's a local operation, we can just call DGELSD in LAPACK.

iyounus · 2016-03-15T18:24:46Z

I'm a bit confused about the use of DGELSD. As far as I can tell, it requires matrix A itself. But in the current implementation, we're decomposing A^T.A on the driver.

To find the inverse of A^T.A, I only need the matrix V and singular values from SVD decomposition: A^T.A = V \Sigma^2 V^T. I can construct this using eigenvalues and eigenvectors of A^T.A which I can do on the driver. Then, finding the inverse is trivial. Is this what we're actually trying to do?

mengxr · 2016-03-16T07:45:24Z

Locally, we are solving A^T A x = A^T b. In a rank deficient case, we can compute the min-length least squares solution that also minimizes \| x \|_2, which is unique and returned by DGELSD. This is the same as eigen decomposition because A^T A is symmetric. We can try that too.

Ideally, we should solve the least squares problems using tall-skinny QR/SVD to get better stability. But it is a little tricky for sparse data. So I would suggest solving the normal equation locally with SVD.

dbtsai · 2016-03-16T08:40:38Z

I'm not an expert in this area, but after thinking it more, I don't think we can use DGELSD which minimizes ||b - A*x|| using the singular value decomposition (SVD) of a rectangular matrix A, which may be rank-deficient. Since A is n x m matrix where n is the number of training samples, A^T A is trackable locally. Unless we have distributed version of DGELSD; this will not work. Seems that eigen decomposition is the good approach.

For those least square problems, when the number of columns is small, we can always solve it by A^TA x = A^Tb. Why do we need tall-skinny QR/SVD for stability? Isn't it for the case that the number of columns is larger?

Thanks.

iyounus · 2016-03-16T18:21:02Z

One problem with the eigen decomposition method is that for rank deficient matrix some of the eigenvalues can be extremely small (instead of being zero) and their contribution to the inverse can become very large.

I'll try out these methods (DGELSD and eigen decomposition) and see how they behave in this case.

mengxr · 2016-03-16T21:30:00Z

@dbtsai There is a good chance of precision loss during the computation of A^T A if A is ill-conditioned. A better approach is to factorize A directly. It is similar to tall-skinny QR without storing Q (applying Q^T to be directly). SVD is similar. See this paper: http://web.stanford.edu/~paulcon/docs/mapreduce-2013-arbenson.pdf. We can definitely switch to it to get better stability but we need to handle sparsity, which might not be worth the time.

@iyounus You can use RCOND to control the rank estimation. Usually a number like 1e-12 should work well.

srowen · 2016-05-06T17:44:16Z

Ping @iyounus ?

iyounus · 2016-05-10T17:58:36Z

@mengxr I looked into using DGELSD to solve A^T A x = A^T b as you suggested. It works fine, but then the issue is how to calculate the errors on the coefficients for the rank deficit case. I cannot invert A^T A in that case. Any suggestions?

SparkQA · 2016-05-23T16:26:38Z

Test build #59136 has finished for PR 11610 at commit e9c80a8.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-05-23T16:41:31Z

Test build #59144 has finished for PR 11610 at commit e9c80a8.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

sethah · 2016-10-12T05:13:43Z

This problem should be handled by #15394 if it is merged. It seems this is no longer active, and we are pursuing alternative solutions. Shall we close this?

Closes apache#11610 Closes apache#15411 Closes apache#15501 Closes apache#12613 Closes apache#12518 Closes apache#12026 Closes apache#15524 Closes apache#12693 Closes apache#12358 Closes apache#15588 Closes apache#15635 Closes apache#15678 Closes apache#14699 Closes apache#9008

remove constant features from training in noraml solver

9412ef4

iyounus changed the title ~~[SPARK-13777] [ML] Remove constant features from training in noraml solver (WLS)~~ [SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) Mar 9, 2016

sethah reviewed Mar 12, 2016
View reviewed changes

incorporating suggestionsmade by sethah

652d2bd

dbtsai reviewed Mar 14, 2016
View reviewed changes

removing use of .values when getting arrays from DenseVector

e9c80a8

srowen mentioned this pull request Oct 30, 2016

[BUILD] Close stale Pull Requests. #15685

Closed

asfgit closed this in 26b07f1 Oct 31, 2016

[SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) #11610

[SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) #11610

Uh oh!

Conversation

iyounus commented Mar 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

iyounus commented Mar 9, 2016

Uh oh!

sethah Mar 12, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 12, 2016

Uh oh!

dbtsai Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

dbtsai Mar 14, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Mar 14, 2016

Uh oh!

SparkQA commented Mar 14, 2016

Uh oh!

dbtsai commented Mar 15, 2016

Uh oh!

iyounus commented Mar 15, 2016

Uh oh!

mengxr commented Mar 16, 2016

Uh oh!

dbtsai commented Mar 16, 2016

Uh oh!

iyounus commented Mar 16, 2016

Uh oh!

mengxr commented Mar 16, 2016

Uh oh!

srowen commented May 6, 2016

Uh oh!

iyounus commented May 10, 2016

Uh oh!

SparkQA commented May 23, 2016

Uh oh!

SparkQA commented May 23, 2016

Uh oh!

sethah commented Oct 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants