[SPARK-13029][ml] fix a logistic regression issue when input data has a column with identical value #10940

coderxiang · 2016-01-27T04:07:33Z

This is a bug that appears while fitting a Logistic Regression model with and setFitIntercept(false). If the data matrix has one column with identical value, the resulting model is often not correct. Specifically, the special column will always get a weight of 0, due to the logic that checks columns with std=0. However, the correct solution, which is unique for L2 logistic regression, usually has non-zero weight.

The fix is to update the special handing logic to make it compatible with columns that has 0 variance.

Two unit tests are included in this PR where one of them is for review only. I'll remove it if this is going to be merged.

cc @mengxr @dbtsai

SparkQA · 2016-01-27T06:16:30Z

Test build #50168 has finished for PR 10940 at commit 09e95a7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T08:32:30Z

Test build #50181 has finished for PR 10940 at commit eca66df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

coderxiang · 2016-01-27T09:12:25Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

This test is for review only, to show an example on a larger data set. Will remove if merged.

SparkQA · 2016-01-27T09:27:45Z

Test build #50184 has finished for PR 10940 at commit df49cfe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T11:50:41Z

Test build #50191 has finished for PR 10940 at commit ae4dd3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T11:57:15Z

Test build #50189 has finished for PR 10940 at commit ae4dd3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-27T19:11:30Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

This if branch is not necessary.

Sure I'll remove, I was avoiding changing existing logic.

mengxr · 2016-01-27T19:12:34Z

@coderxiang Would the fix be cleaner if we set featuresStd(i) to 1.0 if it is 0.0?

coderxiang · 2016-01-27T19:14:37Z

@mengxr you mean do this locally? I was concerned this will create confusion since we are modifying the true value.

SparkQA · 2016-01-27T20:07:32Z

Test build #50210 has finished for PR 10940 at commit 43db782.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T20:30:25Z

Test build #50213 has finished for PR 10940 at commit 43db782.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-01-27T21:07:07Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

The previous version is correct. Checking value != 0.0 is much cheaper than computing localCoefficientsArray(index).

dbtsai · 2016-01-27T21:16:34Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 1.0 / featuresMean(i) }

dbtsai · 2016-01-27T21:16:56Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Change value into 1.0

dbtsai · 2016-01-27T21:34:43Z

Intuitively, this makes sense to me. Since when setFitIntercept(false), those features with std == 0 will act as the effect of intercept resulting non-zero coefficients.

Can you add couple more tests as the following. Thanks.

First, adding two new datasets by zeroing out binaryDataset and making one column as non-zero constance.

Matching the result against GLMNET like the rest of the tests when

setFitIntercept(false), setStandardization(false) with/without regularization
setFitIntercept(false), setStandardization(true) with/without regularization
setFitIntercept(true), setStandardization(false) with/without regularization
setFitIntercept(true), setStandardization(true) with/without regularization

+cc @iyounus Linear Regression may have similar issue, if you have time, you may check it out. Thanks.

coderxiang · 2016-01-27T21:42:15Z

@dbtsai without regularization may let the objective being not strongly-convex and thus not guaranteeing the uniqueness of the solution.

coderxiang · 2016-01-27T21:42:39Z

Jenkins, test this please.

dbtsai · 2016-01-27T21:47:29Z

@coderxiang I agree, without regularization, those features become collinear so the solution will not be unique. However, for those features with std != 0, the coefficients should be unique. Can you check them at least?

SparkQA · 2016-01-27T22:41:16Z

Test build #50229 has finished for PR 10940 at commit 914cffc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T23:35:44Z

Test build #50227 has finished for PR 10940 at commit 914cffc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

iyounus · 2016-01-28T22:28:57Z

@dbtsai Linear regression also has similar issues. There, "normal" and "l-gbfs" solvers treat this case differently (and incorrectly). The other problem there is that if intercept=true, then a constant feature column makes the gramian matrix singular and cholesky decomposition fails. Should I create separate jira for the case of constant feature?

dbtsai · 2016-01-28T22:39:21Z

@iyounus Ideally, it will be great that intercept=true, we keep the current behavior which is constant column doesn't have any predictive power.

iyounus · 2016-01-28T23:37:32Z

@dbtsai I agree that the constant feature doesn't have predictive power. But, the WeightedLeastSqures just throws an AssertionError in lapack.dpotrs (https://issues.apache.org/jira/browse/SPARK-11918), whereas the "l-bfgs" solver sets the coefficient to zero.

dbtsai · 2016-01-29T00:59:55Z

@iyounus Maybe in this case, WeightedLeastSqures should drop those columns so the model can be still trained.

mengxr · 2016-03-01T05:34:23Z

@coderxiang @dbtsai Sorry for late response! I actually thought this PR already got merged ... Anyway, I tested glmnet and found that glmnet outputs zero coefficients for constant columns regardless of intercept, regularization, and standardization settings. I thought about it today and I feel it actually makes sense:

If we have a constant column in our training data, do we expect it to change or stay constant in test data? If its value might change, we should set its coefficient to zero because we cannot estimate how big the change would be. If its value stays constant (or maybe users created this column to add bias manually), it shouldn't be regularized and users should really turn on fitIntercept instead.

So my suggestion is to follow glmnet and set the coefficients of constant columns to zero regardless of other settings. If there are constant columns and fitIntercept is false. We should output a warning message. Does it sound good to you?

mengxr · 2016-03-01T07:36:13Z

Had an offline discussion with @dbtsai and @coderxiang . We agreed to keep the current behavior and have it well documented. I will mark this JIRA as "won't" and created SPARK-13590 for documentation and logging improvement.

@coderxiang Do you mind closing this PR?

coderxiang added 2 commits January 26, 2016 17:25

handle Logistic regression with column of unique value

437a285

handle Logistic regression with column of unique value

09e95a7

coderxiang changed the title ~~[SPARK-13029][ml] fix a logistic regression issue when inputing data has a column with identical value~~ [SPARK-13029][ml] fix a logistic regression issue when input data has a column with identical value Jan 27, 2016

update doc

eca66df

coderxiang added 3 commits January 27, 2016 00:39

update doc

daf7479

add an simpler example

8a4f5e9

Merge remote-tracking branch 'upstream/master' into dev

d2c22c7

coderxiang force-pushed the dev branch from 272dde0 to df49cfe Compare January 27, 2016 09:11

coderxiang reviewed Jan 27, 2016
View reviewed changes

coderxiang force-pushed the dev branch from df49cfe to ae4dd3c Compare January 27, 2016 09:47

Merge branch 'dev' of github.com:coderxiang/spark into dev

ae4dd3c

mengxr reviewed Jan 27, 2016
View reviewed changes

remove unrelated files

ec86b25

mengxr reviewed Jan 27, 2016
View reviewed changes

coderxiang force-pushed the dev branch from 43db782 to 914cffc Compare January 27, 2016 21:16

dbtsai reviewed Jan 27, 2016
View reviewed changes

remove unnecessary code

914cffc

dbtsai reviewed Jan 27, 2016
View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala Outdated

Copy link

Member

dbtsai Jan 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change value into 1.0

asfgit closed this in c37bbb3 Mar 1, 2016

[SPARK-13029][ml] fix a logistic regression issue when input data has a column with identical value #10940

[SPARK-13029][ml] fix a logistic regression issue when input data has a column with identical value #10940

Uh oh!

Conversation

coderxiang commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

coderxiang Jan 27, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

mengxr Jan 27, 2016

Choose a reason for hiding this comment

Uh oh!

coderxiang Jan 27, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jan 27, 2016

Uh oh!

coderxiang commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

mengxr Jan 27, 2016

Choose a reason for hiding this comment

Uh oh!

dbtsai Jan 27, 2016

Choose a reason for hiding this comment

Uh oh!

dbtsai Jan 27, 2016

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Jan 27, 2016

Uh oh!

coderxiang commented Jan 27, 2016

Uh oh!

coderxiang commented Jan 27, 2016

Uh oh!

dbtsai commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

iyounus commented Jan 28, 2016

Uh oh!

dbtsai commented Jan 28, 2016

Uh oh!

iyounus commented Jan 28, 2016

Uh oh!

dbtsai commented Jan 29, 2016

Uh oh!

mengxr commented Mar 1, 2016

Uh oh!

mengxr commented Mar 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants