-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13029][ml] fix a logistic regression issue when input data has a column with identical value #10940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #50168 has finished for PR 10940 at commit
|
|
Test build #50181 has finished for PR 10940 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is for review only, to show an example on a larger data set. Will remove if merged.
|
Test build #50184 has finished for PR 10940 at commit
|
|
Test build #50191 has finished for PR 10940 at commit
|
|
Test build #50189 has finished for PR 10940 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if branch is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll remove, I was avoiding changing existing logic.
|
@coderxiang Would the fix be cleaner if we set |
|
@mengxr you mean do this locally? I was concerned this will create confusion since we are modifying the true value. |
|
Test build #50210 has finished for PR 10940 at commit
|
|
Test build #50213 has finished for PR 10940 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version is correct. Checking value != 0.0 is much cheaper than computing localCoefficientsArray(index).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 1.0 / featuresMean(i) }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change value into 1.0
|
Intuitively, this makes sense to me. Since when Can you add couple more tests as the following. Thanks. First, adding two new datasets by zeroing out Matching the result against GLMNET like the rest of the tests when
+cc @iyounus Linear Regression may have similar issue, if you have time, you may check it out. Thanks. |
|
@dbtsai without regularization may let the objective being not strongly-convex and thus not guaranteeing the uniqueness of the solution. |
|
Jenkins, test this please. |
|
@coderxiang I agree, without regularization, those features become collinear so the solution will not be unique. However, for those features with std != 0, the coefficients should be unique. Can you check them at least? |
|
Test build #50229 has finished for PR 10940 at commit
|
|
Test build #50227 has finished for PR 10940 at commit
|
|
@dbtsai Linear regression also has similar issues. There, "normal" and "l-gbfs" solvers treat this case differently (and incorrectly). The other problem there is that if intercept=true, then a constant feature column makes the gramian matrix singular and cholesky decomposition fails. Should I create separate jira for the case of constant feature? |
|
@iyounus Ideally, it will be great that |
|
@dbtsai I agree that the constant feature doesn't have predictive power. But, the WeightedLeastSqures just throws an |
|
@iyounus Maybe in this case, |
|
@coderxiang @dbtsai Sorry for late response! I actually thought this PR already got merged ... Anyway, I tested If we have a constant column in our training data, do we expect it to change or stay constant in test data? If its value might change, we should set its coefficient to zero because we cannot estimate how big the change would be. If its value stays constant (or maybe users created this column to add bias manually), it shouldn't be regularized and users should really turn on So my suggestion is to follow glmnet and set the coefficients of constant columns to zero regardless of other settings. If there are constant columns and |
|
Had an offline discussion with @dbtsai and @coderxiang . We agreed to keep the current behavior and have it well documented. I will mark this JIRA as "won't" and created SPARK-13590 for documentation and logging improvement. @coderxiang Do you mind closing this PR? |
This is a bug that appears while fitting a Logistic Regression model with and
setFitIntercept(false). If the data matrix has one column with identical value, the resulting model is often not correct. Specifically, the special column will always get a weight of 0, due to the logic that checks columns with std=0. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight.The fix is to update the special handing logic to make it compatible with columns that has 0 variance.
Two unit tests are included in this PR where one of them is for review only. I'll remove it if this is going to be merged.
cc @mengxr @dbtsai