[SPARK-7159][ML] Add multiclass logistic regression to Spark ML #13796

sethah · 2016-06-21T03:22:36Z

What changes were proposed in this pull request?

This patch adds a new estimator/transformer MultinomialLogisticRegression to spark ML.

How was this patch tested?

Added new test suite MultinomialLogisticRegressionSuite.

Approach

Do not use a "pivot" class in the algorithm formulation

Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by @dbtsai.

Separate multinomial logistic regression and binary logistic regression

The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one.

Arguments for:

The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class.
The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant.

Arguments against:

It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes).
Some portions of the code are repeated.

This is a major design point and warrants more discussion.

Mean centering

When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail here. R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well.

Feature scaling

In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space before optimizaton. In ML, this means that numFeatures * numClasses extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases.

One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm much slower. The tradeoffs here merit more discussion.

Specifying and inferring the number of outcome classes

The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the histogram variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm.

Performance

Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary.

Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each

Notes:

Time in units of seconds
Metric is classification accuracy

algo	elasticNetParam	fitIntercept	metric	maxIter	numPoints	numClasses	numFeatures	time	standardization	regParam
ml	0	true	0.746415	30	100000	3	100000	327.923	true	0
mllib	0	true	0.743785	30	100000	3	100000	390.217	true	0

algo	elasticNetParam	fitIntercept	metric	maxIter	numPoints	numClasses	numFeatures	time	standardization	regParam
ml	0	true	0.973238	30	2000000	3	10000	385.476	true	0
mllib	0	true	0.949828	30	2000000	3	10000	550.403	true	0

algo	elasticNetParam	fitIntercept	metric	maxIter	numPoints	numClasses	numFeatures	time	standardization	regParam
mllib	0	true	0.864358	30	2000000	3	10000	543.359	true	0.1
ml	0	true	0.867418	30	2000000	3	10000	401.955	true	0.1

algo	elasticNetParam	fitIntercept	metric	maxIter	numPoints	numClasses	numFeatures	time	standardization	regParam
ml	1	true	0.807449	30	2000000	3	10000	334.892	true	0.05

algo	elasticNetParam	fitIntercept	metric	maxIter	numPoints	numClasses	numFeatures	time	standardization	regParam
ml	0	true	0.602006	30	2000000	500	100	112.319	true	0
mllib	0	true	0.567226	30	2000000	500	100	263.768	true	0

References

Friedman, et al. "Regularization Paths for Generalized Linear Models via Coordinate Descent"
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

Follow up items

Consider using level 2 BLAS routines in the gradient computations - SPARK-17134
Add model summary for MLOR - SPARK-17139
Add initial model to MLOR and add test for intercept priors - SPARK-17140
Python API - SPARK-17138
Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - SPARK-17090
Refactor helper classes out of LogisticRegression.scala - SPARK-17135
Design optimizer interface for added flexibility in ML algos - SPARK-17136
Support compressing the coefficients and intercepts for MLOR models - SPARK-17137

sethah · 2016-06-21T03:24:19Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

I noticed some (somewhat modest) performance gains when explicitly broadcasting the coefficients when the number of coefficients was large.

SparkQA · 2016-06-21T03:33:56Z

Test build #60900 has finished for PR 13796 at commit e4681b0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-06-21T03:34:42Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

In testing I found that the extra divisions required to standardize the data one every iteration have significant impacts for large numbers of features/classes. I changed the LogisticAggregator to optionally standardize the data. This way we can change binary LR if needed and/or we can give the option to users so that if they already have standardized data then the runtime will be less.

sethah · 2016-06-21T03:58:34Z

cc @dbtsai @jkbradley @mengxr If you get a chance to review and/or provide feedback on the approach I'd appreciate it.

SparkQA · 2016-06-21T04:48:10Z

Test build #60901 has finished for PR 13796 at commit f4c817f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-06-21T07:23:53Z

@sethah Thanks for implementing MultinomialLogisticRegression! This would be a major feature for Spark 2.1. @dbtsai is probably the best people to review this PR. But he is taking a break now. Do you mind waiting him for couple days?

dbtsai · 2016-07-04T21:25:46Z

@sethah I apologize for the delay. I just came back to US. Gonna make the first pass. Thanks.

dbtsai · 2016-08-12T18:15:44Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

It not multinomial, just return 1.

Done.

(I deleted a previous comment, which was incorrect because I misunderstood your suggestion)

dbtsai · 2016-08-12T18:17:11Z

@sethah Please merge master since there is a conflict. Thanks.

dbtsai · 2016-08-12T18:22:19Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Update the doc. It supports softmax (multinomial logistic) loss now.

sethah · 2016-08-12T19:35:27Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala

This is here because, when we subtract off the max margin below, we will end up doing Double.PositiveInfinity - Double.PositiveInfinity which equals NaN in scala. Alternatively, we could just use Double.MaxValue instead.

SparkQA · 2016-08-12T19:46:47Z

Test build #63702 has finished for PR 13796 at commit 9d4559e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-12T20:18:57Z

Test build #63704 has finished for PR 13796 at commit 026e1f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-08-12T23:03:14Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

many people use softmax as more academic term. Maybe we should have both multinomial (softmax) loss in the documentation, and use multinomial for api or more user face place.

I've replaced multinomial with multinomial (softmax) in a couple comments in the code. I think as long as we're clear on the problem formulation (which will be made explicit when the derivation is added to LogisticAggregator) then we shouldn't have to worry too much about semantics. Let me know if you had something else in mind or you see other places we should update.

dbtsai · 2016-08-18T05:05:13Z

I go through the PR again, and it's in a very good shape. Only couple minor issues needed to be addressed. Thank you @sethah for the great work. This will be a big feature in Spark 2.1

sethah · 2016-08-18T17:46:01Z

Thanks @dbtsai for the detailed review! I addressed most comments. We still need to:

Decide whether to handle numClasses being specified in metadata
Decide what happens when numClasses == 1 (only label 0.0 is encountered)

Also, one thing I'm concerned about is having separate MultinomialLogisticRegression and LogisticRegression. Of course, we do this mainly because we cannot change the LR API to support a matrix of coefficients very easily. Still, I think it's quite annoying to have to switch to a different estimator for multiclass. The multinomial estimator more or less supercedes the functionality of BLOR, but LogisticRegression is a canonical name and users may gravitate to it. Further, even when/if people realize that you can use MLOR for both binary and multiclass, it may be confusing what LogisticRegression is used for. I just want to discuss it before we make it public.

SparkQA · 2016-08-18T18:12:36Z

Test build #63996 has finished for PR 13796 at commit 0c851d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logWarning(s\"All labels belong to a single class and fitIntercept=false. It's a \" +

sethah · 2016-08-18T18:15:58Z

I also created an umbrella JIRA for tracking follow up items - SPARK-17133

SparkQA · 2016-08-18T20:52:38Z

Test build #64007 has finished for PR 13796 at commit ffc64d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-08-19T00:52:07Z

mllib/src/main/scala/org/apache/spark/ml/classification/MultinomialLogisticRegression.scala

+             Where Z is a normalizing constant. Hence,
+             {{{
+              b_k = \log(P(k)) + \log(Z)
+                  = \log(count_k) - \log(count) + \log(Z)


Minor: this deviation doesn't show why you can chose \lambda freely given \log(count) + \log(Z) should be already determined by the problem. Since the \lambda is in the \exp, we call it phase.

{{{ P(1) = \exp(b_1) / Z ... P(K) = \exp(b_K) / Z where Z = \sum_{k=1}{K} \exp(b_k) Since this problem doesn't have unique solution, one of the solution that satisfy the above equation will be \exp(b_k) = count_k * exp(\lambda) hence \b_k = \log(count_k) + \lambda }}}

SparkQA · 2016-08-19T03:58:22Z

Test build #64033 has finished for PR 13796 at commit fc2aa95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-08-19T05:19:17Z

@sethah Thank you for this great weighted MLOR work in Spark 2.1. I merged this PR into master, and let's discuss/work on the followups in separate JIRAs. Thanks.

sethah · 2016-08-19T05:23:48Z

@dbtsai Thanks for all of your meticulous review. Very much appreciated! Glad we can have MLOR in Spark ML now.

dbtsai · 2016-08-19T23:34:31Z

@sethah I also have a concern to have a separate MLOR class, and I prefer consolidate them into one so we can maintain them easier. This has to be done before the release of 2.1 otherwise, we can not change the interface anymore. Can you create a JIRA discuss this issue? Thanks.

WeichenXu123 · 2016-08-23T04:17:13Z

I found a problem in the merged code: when reg == 0 the minimizer of softmax cost is not unique.
In such case, it will cause Hessian matrix non-invertible, and I thinks it may cause quasi-newton's methods such as LBFGS run into numerical problems.
so, is it better to forbid the reg==0 case for softmax parameters ?

and, in reg==0 case, softmax result will be equivalent to pivoting logistic regression.
Or the better way is using pivoting logistic regression to replace it ?

cc @sethah @dbtsai

dbtsai · 2016-08-23T04:24:26Z

@WeichenXu123 Do you run into this potential issue with any dataset? If so, we may need to consider optimize softmax with pivoting when reg == 0. BTW, it's obvious that the solution of the minima of the softmax cost is not unique, but why will it lead to non-invertible Hessian matrix?

WeichenXu123 · 2016-08-23T04:34:54Z

@dbtsai
I give a reference here:
http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression
it mentions this problem:
the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)

dbtsai · 2016-08-23T05:06:30Z

The solution of this overparameterized problem in the link is just adding the regularization, and users may not want it. I think we need to optimize it on (k-1) parameters when no regularization like the implementation in mllib, and then put the final or the first one back by centering the intercepts and weights. @sethah could you have a JIRA to track this issue? Thanks.

WeichenXu123 · 2016-08-23T06:30:55Z

@dbtsai
in this jira
https://issues.apache.org/jira/browse/SPARK-17163
it consider to unify interface for binary logistic regression and softmax.
but I think they are not equivalent in fact (when numClass == 2, only when reg == 0 they are equivalent).
I think the better way is:

extend the binary logistic regression to numClass > 2 cases, but optimize it on (numClass-1) parameters.
modify the softmax reg==0 case computation, use the way described in 1

sethah · 2016-08-23T14:44:29Z

SPARK-17201

sethah reviewed Jun 21, 2016
View reviewed changes

dbtsai reviewed Aug 12, 2016
View reviewed changes

sethah force-pushed the SPARK-7159_M branch from f4c817f to 9d4559e Compare August 12, 2016 18:47

sethah reviewed Aug 12, 2016
View reviewed changes

dbtsai reviewed Aug 12, 2016
View reviewed changes

performance speedups in prediction and other review

0c851d7

small further review changes

ffc64d4

dbtsai reviewed Aug 19, 2016
View reviewed changes

minor review changes, updating intercept prior derivation

fc2aa95

asfgit closed this in 287bea1 Aug 19, 2016

[SPARK-7159][ML] Add multiclass logistic regression to Spark ML #13796

[SPARK-7159][ML] Add multiclass logistic regression to Spark ML #13796

Uh oh!

Conversation

sethah commented Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Approach

Do not use a "pivot" class in the algorithm formulation

Separate multinomial logistic regression and binary logistic regression

Mean centering

Feature scaling

Specifying and inferring the number of outcome classes

Performance

References

Follow up items

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethah commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

Uh oh!

mengxr commented Jun 21, 2016

Uh oh!

dbtsai commented Jul 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Aug 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Aug 18, 2016

Uh oh!

sethah commented Aug 18, 2016

Uh oh!

SparkQA commented Aug 18, 2016

Uh oh!

sethah commented Aug 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Aug 18, 2016

Uh oh!

dbtsai Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

dbtsai commented Aug 19, 2016

Uh oh!

sethah commented Aug 19, 2016

Uh oh!

dbtsai commented Aug 19, 2016

Uh oh!

WeichenXu123 commented Aug 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbtsai commented Aug 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sethah commented Jun 21, 2016 •

edited

Loading

sethah commented Aug 18, 2016 •

edited

Loading

dbtsai Aug 19, 2016 •

edited

Loading

WeichenXu123 commented Aug 23, 2016 •

edited

Loading

dbtsai commented Aug 23, 2016 •

edited

Loading

WeichenXu123 commented Aug 23, 2016 •

edited

Loading

dbtsai commented Aug 23, 2016 •

edited

Loading

WeichenXu123 commented Aug 23, 2016 •

edited

Loading