Skip to content

Conversation

@iyounus
Copy link
Contributor

@iyounus iyounus commented Mar 9, 2016

What changes were proposed in this pull request?

"normal" solver in LinearRegression uses Cholesky decomposition to calculate the coefficients. If the data has features with identical values (zero variance), then (A^T A) matrix is not positive definite any more and the Cholesky decomposition fails.

Since A^T.A and features variances are calculated in single pass, it's better to modify ATA instead to re-calculating it from the data after dropping constant columns. In this PR, I'm dropping columns and rows from ATA corresponding to features with zero variance. Then the cholesky decomposition can be performed without any problem.

How was this patch tested?

A unit test under LineatReagessionSuite is added which compares results from this change and l-bgfs solver to glmnet. All these are now consistent.

@iyounus iyounus changed the title [SPARK-13777] [ML] Remove constant features from training in noraml solver (WLS) [SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) Mar 9, 2016
@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52764 has finished for PR 11610 at commit 9412ef4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iyounus
Copy link
Contributor Author

iyounus commented Mar 9, 2016

I should point out that to identify constant features, I'm comparing variance (aVar) to zero. But, It can happen that the variance for constant features may not be identically zero due to numerical inaccuracies. In this case, the cholesky decomposition still fails. Is there a good way to deal with this problem. I'm thinking of comparing aVar to some very small number, maybe 1e-10, instead of 0.0. Is there a better way to deal with this problem?

}
}
/*
If more than of the features in the data are constant (i.e, data matrix has constant columns),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"than of the" -> "than one of the". Also, "i.e," -> "i.e."

@SparkQA
Copy link

SparkQA commented Mar 12, 2016

Test build #52981 has finished for PR 11610 at commit 652d2bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val aVarRaw = summary.aVar.values
// this will keep track of features to keep in the model, and remove
// features with zero variance.
val nzVarIndex = aVarRaw.zipWithIndex.filter(_._1 != 0).map(_._2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly filter(_._1 != 0.0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using foreachActive to build the non zero element index given the reason we just discussed.

@mengxr
Copy link
Contributor

mengxr commented Mar 14, 2016

@iyounus @dbtsai The normal equation approach will fail if the matrix A is rank-deficient. It happens when there are constant columns. However, more generally, it happens when there are linearly dependent columns in the training dataset. So this PR solves one case but it is not a general solution. We can try two approaches:

  1. Distributed QR/SVD to deal with rank deficiency. QR requires pivoting. SVD might be an easier approach for us, though a little expensive. See DGELSD from LAPACK: http://www.netlib.org/lapack/explore-html/d7/d3b/group__double_g_esolve.html#ga94bd4a63a6dacf523e25ff617719f752
  2. Add a small positive number, e.g., 1e-8, to the diagonal when AtA is rank deficient. This should solve the numerical issue but returns a less accurate solution. However, it is always good to apply regularization. I believe this won't make the model worse.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53091 has finished for PR 11610 at commit e9c80a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Mar 15, 2016

I will vote for approach 1.

SVD will be the most stable algorithm, but slowest O(mn^2 + n^3) compared with Cholesky O(mn^2) or QR O(mn^2 - n^3/3) decomposition. SVD solves both underdetermined (rank-deficient) and overdetermined (more data than equations) problems in a least square problem. Given that it's a local operation, we can just call DGELSD in LAPACK.

@iyounus
Copy link
Contributor Author

iyounus commented Mar 15, 2016

I'm a bit confused about the use of DGELSD. As far as I can tell, it requires matrix A itself. But in the current implementation, we're decomposing A^T.A on the driver.

To find the inverse of A^T.A, I only need the matrix V and singular values from SVD decomposition: A^T.A = V \Sigma^2 V^T. I can construct this using eigenvalues and eigenvectors of A^T.A which I can do on the driver. Then, finding the inverse is trivial. Is this what we're actually trying to do?

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

Locally, we are solving A^T A x = A^T b. In a rank deficient case, we can compute the min-length least squares solution that also minimizes \| x \|_2, which is unique and returned by DGELSD. This is the same as eigen decomposition because A^T A is symmetric. We can try that too.

Ideally, we should solve the least squares problems using tall-skinny QR/SVD to get better stability. But it is a little tricky for sparse data. So I would suggest solving the normal equation locally with SVD.

@dbtsai
Copy link
Member

dbtsai commented Mar 16, 2016

I'm not an expert in this area, but after thinking it more, I don't think we can use DGELSD which minimizes ||b - A*x|| using the singular value decomposition (SVD) of a rectangular matrix A, which may be rank-deficient. Since A is n x m matrix where n is the number of training samples, A^T A is trackable locally. Unless we have distributed version of DGELSD; this will not work. Seems that eigen decomposition is the good approach.

For those least square problems, when the number of columns is small, we can always solve it by A^TA x = A^Tb. Why do we need tall-skinny QR/SVD for stability? Isn't it for the case that the number of columns is larger?

Thanks.

@iyounus
Copy link
Contributor Author

iyounus commented Mar 16, 2016

One problem with the eigen decomposition method is that for rank deficient matrix some of the eigenvalues can be extremely small (instead of being zero) and their contribution to the inverse can become very large.

I'll try out these methods (DGELSD and eigen decomposition) and see how they behave in this case.

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

@dbtsai There is a good chance of precision loss during the computation of A^T A if A is ill-conditioned. A better approach is to factorize A directly. It is similar to tall-skinny QR without storing Q (applying Q^T to be directly). SVD is similar. See this paper: http://web.stanford.edu/~paulcon/docs/mapreduce-2013-arbenson.pdf. We can definitely switch to it to get better stability but we need to handle sparsity, which might not be worth the time.

@iyounus You can use RCOND to control the rank estimation. Usually a number like 1e-12 should work well.

@srowen
Copy link
Member

srowen commented May 6, 2016

Ping @iyounus ?

@iyounus
Copy link
Contributor Author

iyounus commented May 10, 2016

@mengxr I looked into using DGELSD to solve A^T A x = A^T b as you suggested. It works fine, but then the issue is how to calculate the errors on the coefficients for the rank deficit case. I cannot invert A^T A in that case. Any suggestions?

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59136 has finished for PR 11610 at commit e9c80a8.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59144 has finished for PR 11610 at commit e9c80a8.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Oct 12, 2016

This problem should be handled by #15394 if it is merged. It seems this is no longer active, and we are pursuing alternative solutions. Shall we close this?

srowen added a commit to srowen/spark that referenced this pull request Oct 31, 2016
@asfgit asfgit closed this in 26b07f1 Oct 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants