Skip to content

Conversation

@coderxiang
Copy link
Contributor

This is a bug that appears while fitting a Logistic Regression model with and setFitIntercept(false). If the data matrix has one column with identical value, the resulting model is often not correct. Specifically, the special column will always get a weight of 0, due to the logic that checks columns with std=0. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight.

The fix is to update the special handing logic to make it compatible with columns that has 0 variance.

Two unit tests are included in this PR where one of them is for review only. I'll remove it if this is going to be merged.

cc @mengxr @dbtsai

@coderxiang coderxiang changed the title [SPARK-13029][ml] fix a logistic regression issue when inputing data has a column with identical value [SPARK-13029][ml] fix a logistic regression issue when input data has a column with identical value Jan 27, 2016
@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50168 has finished for PR 10940 at commit 09e95a7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50181 has finished for PR 10940 at commit eca66df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is for review only, to show an example on a larger data set. Will remove if merged.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50184 has finished for PR 10940 at commit df49cfe.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50191 has finished for PR 10940 at commit ae4dd3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50189 has finished for PR 10940 at commit ae4dd3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if branch is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll remove, I was avoiding changing existing logic.

@mengxr
Copy link
Contributor

mengxr commented Jan 27, 2016

@coderxiang Would the fix be cleaner if we set featuresStd(i) to 1.0 if it is 0.0?

@coderxiang
Copy link
Contributor Author

@mengxr you mean do this locally? I was concerned this will create confusion since we are modifying the true value.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50210 has finished for PR 10940 at commit 43db782.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50213 has finished for PR 10940 at commit 43db782.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version is correct. Checking value != 0.0 is much cheaper than computing localCoefficientsArray(index).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 1.0 / featuresMean(i) }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change value into 1.0

@dbtsai
Copy link
Member

dbtsai commented Jan 27, 2016

Intuitively, this makes sense to me. Since when setFitIntercept(false), those features with std == 0 will act as the effect of intercept resulting non-zero coefficients.

Can you add couple more tests as the following. Thanks.

First, adding two new datasets by zeroing out binaryDataset and making one column as non-zero constance.

Matching the result against GLMNET like the rest of the tests when

  1. setFitIntercept(false), setStandardization(false) with/without regularization
  2. setFitIntercept(false), setStandardization(true) with/without regularization
  3. setFitIntercept(true), setStandardization(false) with/without regularization
  4. setFitIntercept(true), setStandardization(true) with/without regularization

+cc @iyounus Linear Regression may have similar issue, if you have time, you may check it out. Thanks.

@coderxiang
Copy link
Contributor Author

@dbtsai without regularization may let the objective being not strongly-convex and thus not guaranteeing the uniqueness of the solution.

@coderxiang
Copy link
Contributor Author

Jenkins, test this please.

@dbtsai
Copy link
Member

dbtsai commented Jan 27, 2016

@coderxiang I agree, without regularization, those features become collinear so the solution will not be unique. However, for those features with std != 0, the coefficients should be unique. Can you check them at least?

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50229 has finished for PR 10940 at commit 914cffc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50227 has finished for PR 10940 at commit 914cffc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@iyounus
Copy link
Contributor

iyounus commented Jan 28, 2016

@dbtsai Linear regression also has similar issues. There, "normal" and "l-gbfs" solvers treat this case differently (and incorrectly). The other problem there is that if intercept=true, then a constant feature column makes the gramian matrix singular and cholesky decomposition fails. Should I create separate jira for the case of constant feature?

@dbtsai
Copy link
Member

dbtsai commented Jan 28, 2016

@iyounus Ideally, it will be great that intercept=true, we keep the current behavior which is constant column doesn't have any predictive power.

@iyounus
Copy link
Contributor

iyounus commented Jan 28, 2016

@dbtsai I agree that the constant feature doesn't have predictive power. But, the WeightedLeastSqures just throws an AssertionError in lapack.dpotrs (https://issues.apache.org/jira/browse/SPARK-11918), whereas the "l-bfgs" solver sets the coefficient to zero.

@dbtsai
Copy link
Member

dbtsai commented Jan 29, 2016

@iyounus Maybe in this case, WeightedLeastSqures should drop those columns so the model can be still trained.

@mengxr
Copy link
Contributor

mengxr commented Mar 1, 2016

@coderxiang @dbtsai Sorry for late response! I actually thought this PR already got merged ... Anyway, I tested glmnet and found that glmnet outputs zero coefficients for constant columns regardless of intercept, regularization, and standardization settings. I thought about it today and I feel it actually makes sense:

If we have a constant column in our training data, do we expect it to change or stay constant in test data? If its value might change, we should set its coefficient to zero because we cannot estimate how big the change would be. If its value stays constant (or maybe users created this column to add bias manually), it shouldn't be regularized and users should really turn on fitIntercept instead.

So my suggestion is to follow glmnet and set the coefficients of constant columns to zero regardless of other settings. If there are constant columns and fitIntercept is false. We should output a warning message. Does it sound good to you?

@mengxr
Copy link
Contributor

mengxr commented Mar 1, 2016

Had an offline discussion with @dbtsai and @coderxiang . We agreed to keep the current behavior and have it well documented. I will mark this JIRA as "won't" and created SPARK-13590 for documentation and logging improvement.

@coderxiang Do you mind closing this PR?

@asfgit asfgit closed this in c37bbb3 Mar 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants