-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) #11610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13777] [ML] Remove constant features from training in normal solver (WLS) #11610
Conversation
|
Test build #52764 has finished for PR 11610 at commit
|
|
I should point out that to identify constant features, I'm comparing variance (aVar) to zero. But, It can happen that the variance for constant features may not be identically zero due to numerical inaccuracies. In this case, the cholesky decomposition still fails. Is there a good way to deal with this problem. I'm thinking of comparing aVar to some very small number, maybe 1e-10, instead of 0.0. Is there a better way to deal with this problem? |
| } | ||
| } | ||
| /* | ||
| If more than of the features in the data are constant (i.e, data matrix has constant columns), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"than of the" -> "than one of the". Also, "i.e," -> "i.e."
|
Test build #52981 has finished for PR 11610 at commit
|
| val aVarRaw = summary.aVar.values | ||
| // this will keep track of features to keep in the model, and remove | ||
| // features with zero variance. | ||
| val nzVarIndex = aVarRaw.zipWithIndex.filter(_._1 != 0).map(_._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly filter(_._1 != 0.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using foreachActive to build the non zero element index given the reason we just discussed.
|
@iyounus @dbtsai The normal equation approach will fail if the matrix A is rank-deficient. It happens when there are constant columns. However, more generally, it happens when there are linearly dependent columns in the training dataset. So this PR solves one case but it is not a general solution. We can try two approaches:
|
|
Test build #53091 has finished for PR 11610 at commit
|
|
I will vote for approach 1. SVD will be the most stable algorithm, but slowest O(mn^2 + n^3) compared with Cholesky O(mn^2) or QR O(mn^2 - n^3/3) decomposition. SVD solves both underdetermined (rank-deficient) and overdetermined (more data than equations) problems in a least square problem. Given that it's a local operation, we can just call DGELSD in LAPACK. |
|
I'm a bit confused about the use of DGELSD. As far as I can tell, it requires matrix A itself. But in the current implementation, we're decomposing A^T.A on the driver. To find the inverse of A^T.A, I only need the matrix V and singular values from SVD decomposition: A^T.A = V \Sigma^2 V^T. I can construct this using eigenvalues and eigenvectors of A^T.A which I can do on the driver. Then, finding the inverse is trivial. Is this what we're actually trying to do? |
|
Locally, we are solving Ideally, we should solve the least squares problems using tall-skinny QR/SVD to get better stability. But it is a little tricky for sparse data. So I would suggest solving the normal equation locally with SVD. |
|
I'm not an expert in this area, but after thinking it more, I don't think we can use For those least square problems, when the number of columns is small, we can always solve it by Thanks. |
|
One problem with the eigen decomposition method is that for rank deficient matrix some of the eigenvalues can be extremely small (instead of being zero) and their contribution to the inverse can become very large. I'll try out these methods (DGELSD and eigen decomposition) and see how they behave in this case. |
|
@dbtsai There is a good chance of precision loss during the computation of A^T A if A is ill-conditioned. A better approach is to factorize A directly. It is similar to tall-skinny QR without storing Q (applying Q^T to be directly). SVD is similar. See this paper: http://web.stanford.edu/~paulcon/docs/mapreduce-2013-arbenson.pdf. We can definitely switch to it to get better stability but we need to handle sparsity, which might not be worth the time. @iyounus You can use |
|
Ping @iyounus ? |
|
@mengxr I looked into using DGELSD to solve |
|
Test build #59136 has finished for PR 11610 at commit
|
|
Test build #59144 has finished for PR 11610 at commit
|
|
This problem should be handled by #15394 if it is merged. It seems this is no longer active, and we are pursuing alternative solutions. Shall we close this? |
Closes apache#11610 Closes apache#15411 Closes apache#15501 Closes apache#12613 Closes apache#12518 Closes apache#12026 Closes apache#15524 Closes apache#12693 Closes apache#12358 Closes apache#15588 Closes apache#15635 Closes apache#15678 Closes apache#14699 Closes apache#9008
What changes were proposed in this pull request?
"normal" solver in LinearRegression uses Cholesky decomposition to calculate the coefficients. If the data has features with identical values (zero variance), then (A^T A) matrix is not positive definite any more and the Cholesky decomposition fails.
Since A^T.A and features variances are calculated in single pass, it's better to modify ATA instead to re-calculating it from the data after dropping constant columns. In this PR, I'm dropping columns and rows from ATA corresponding to features with zero variance. Then the cholesky decomposition can be performed without any problem.
How was this patch tested?
A unit test under LineatReagessionSuite is added which compares results from this change and l-bgfs solver to glmnet. All these are now consistent.