Skip to content

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented Jun 21, 2016

What changes were proposed in this pull request?

This patch adds a new estimator/transformer MultinomialLogisticRegression to spark ML.

JIRA: SPARK-7159

How was this patch tested?

Added new test suite MultinomialLogisticRegressionSuite.

Approach

Do not use a "pivot" class in the algorithm formulation

Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by @dbtsai.

Separate multinomial logistic regression and binary logistic regression

The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one.

Arguments for:

  • The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class.
  • The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant.

Arguments against:

  • It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes).
  • Some portions of the code are repeated.

This is a major design point and warrants more discussion.

Mean centering

When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail here. R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well.

Feature scaling

In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space before optimizaton. In ML, this means that numFeatures * numClasses extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases.

One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm much slower. The tradeoffs here merit more discussion.

Specifying and inferring the number of outcome classes

The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the histogram variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm.

Performance

Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary.

Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each

Notes:

  • Time in units of seconds
  • Metric is classification accuracy
algo elasticNetParam fitIntercept metric maxIter numPoints numClasses numFeatures time standardization regParam
ml 0 true 0.746415 30 100000 3 100000 327.923 true 0
mllib 0 true 0.743785 30 100000 3 100000 390.217 true 0
algo elasticNetParam fitIntercept metric maxIter numPoints numClasses numFeatures time standardization regParam
ml 0 true 0.973238 30 2000000 3 10000 385.476 true 0
mllib 0 true 0.949828 30 2000000 3 10000 550.403 true 0
algo elasticNetParam fitIntercept metric maxIter numPoints numClasses numFeatures time standardization regParam
mllib 0 true 0.864358 30 2000000 3 10000 543.359 true 0.1
ml 0 true 0.867418 30 2000000 3 10000 401.955 true 0.1
algo elasticNetParam fitIntercept metric maxIter numPoints numClasses numFeatures time standardization regParam
ml 1 true 0.807449 30 2000000 3 10000 334.892 true 0.05
algo elasticNetParam fitIntercept metric maxIter numPoints numClasses numFeatures time standardization regParam
ml 0 true 0.602006 30 2000000 500 100 112.319 true 0
mllib 0 true 0.567226 30 2000000 500 100 263.768 true 0

References

Friedman, et al. "Regularization Paths for Generalized Linear Models via Coordinate Descent"
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

Follow up items

  • Consider using level 2 BLAS routines in the gradient computations - SPARK-17134
  • Add model summary for MLOR - SPARK-17139
  • Add initial model to MLOR and add test for intercept priors - SPARK-17140
  • Python API - SPARK-17138
  • Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - SPARK-17090
  • Refactor helper classes out of LogisticRegression.scala - SPARK-17135
  • Design optimizer interface for added flexibility in ML algos - SPARK-17136
  • Support compressing the coefficients and intercepts for MLOR models - SPARK-17137

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some (somewhat modest) performance gains when explicitly broadcasting the coefficients when the number of coefficients was large.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60900 has finished for PR 13796 at commit e4681b0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing I found that the extra divisions required to standardize the data one every iteration have significant impacts for large numbers of features/classes. I changed the LogisticAggregator to optionally standardize the data. This way we can change binary LR if needed and/or we can give the option to users so that if they already have standardized data then the runtime will be less.

@sethah
Copy link
Contributor Author

sethah commented Jun 21, 2016

cc @dbtsai @jkbradley @mengxr If you get a chance to review and/or provide feedback on the approach I'd appreciate it.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60901 has finished for PR 13796 at commit f4c817f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

@sethah Thanks for implementing MultinomialLogisticRegression! This would be a major feature for Spark 2.1. @dbtsai is probably the best people to review this PR. But he is taking a break now. Do you mind waiting him for couple days?

@dbtsai
Copy link
Member

dbtsai commented Jul 4, 2016

@sethah I apologize for the delay. I just came back to US. Gonna make the first pass. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It not multinomial, just return 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

(I deleted a previous comment, which was incorrect because I misunderstood your suggestion)

@dbtsai
Copy link
Member

dbtsai commented Aug 12, 2016

@sethah Please merge master since there is a conflict. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the doc. It supports softmax (multinomial logistic) loss now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here because, when we subtract off the max margin below, we will end up doing Double.PositiveInfinity - Double.PositiveInfinity which equals NaN in scala. Alternatively, we could just use Double.MaxValue instead.

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63702 has finished for PR 13796 at commit 9d4559e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63704 has finished for PR 13796 at commit 026e1f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many people use softmax as more academic term. Maybe we should have both multinomial (softmax) loss in the documentation, and use multinomial for api or more user face place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced multinomial with multinomial (softmax) in a couple comments in the code. I think as long as we're clear on the problem formulation (which will be made explicit when the derivation is added to LogisticAggregator) then we shouldn't have to worry too much about semantics. Let me know if you had something else in mind or you see other places we should update.

@dbtsai
Copy link
Member

dbtsai commented Aug 18, 2016

I go through the PR again, and it's in a very good shape. Only couple minor issues needed to be addressed. Thank you @sethah for the great work. This will be a big feature in Spark 2.1

@sethah
Copy link
Contributor Author

sethah commented Aug 18, 2016

Thanks @dbtsai for the detailed review! I addressed most comments. We still need to:

  • Decide whether to handle numClasses being specified in metadata
  • Decide what happens when numClasses == 1 (only label 0.0 is encountered)

Also, one thing I'm concerned about is having separate MultinomialLogisticRegression and LogisticRegression. Of course, we do this mainly because we cannot change the LR API to support a matrix of coefficients very easily. Still, I think it's quite annoying to have to switch to a different estimator for multiclass. The multinomial estimator more or less supercedes the functionality of BLOR, but LogisticRegression is a canonical name and users may gravitate to it. Further, even when/if people realize that you can use MLOR for both binary and multiclass, it may be confusing what LogisticRegression is used for. I just want to discuss it before we make it public.

@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #63996 has finished for PR 13796 at commit 0c851d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logWarning(s\"All labels belong to a single class and fitIntercept=false. It's a \" +

@sethah
Copy link
Contributor Author

sethah commented Aug 18, 2016

I also created an umbrella JIRA for tracking follow up items - SPARK-17133

@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #64007 has finished for PR 13796 at commit ffc64d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Where Z is a normalizing constant. Hence,
{{{
b_k = \log(P(k)) + \log(Z)
= \log(count_k) - \log(count) + \log(Z)
Copy link
Member

@dbtsai dbtsai Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this deviation doesn't show why you can chose \lambda freely given \log(count) + \log(Z) should be already determined by the problem. Since the \lambda is in the \exp, we call it phase.

             {{{
                P(1) = \exp(b_1) / Z
                ...
                P(K) = \exp(b_K) / Z
                where Z = \sum_{k=1}{K} \exp(b_k)

               Since this problem doesn't have unique solution, one of the solution that satisfy the above equation will be

               \exp(b_k) = count_k * exp(\lambda)
               hence
               \b_k = \log(count_k) + \lambda
             }}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64033 has finished for PR 13796 at commit fc2aa95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 287bea1 Aug 19, 2016
@dbtsai
Copy link
Member

dbtsai commented Aug 19, 2016

@sethah Thank you for this great weighted MLOR work in Spark 2.1. I merged this PR into master, and let's discuss/work on the followups in separate JIRAs. Thanks.

@sethah
Copy link
Contributor Author

sethah commented Aug 19, 2016

@dbtsai Thanks for all of your meticulous review. Very much appreciated! Glad we can have MLOR in Spark ML now.

@dbtsai
Copy link
Member

dbtsai commented Aug 19, 2016

@sethah I also have a concern to have a separate MLOR class, and I prefer consolidate them into one so we can maintain them easier. This has to be done before the release of 2.1 otherwise, we can not change the interface anymore. Can you create a JIRA discuss this issue? Thanks.

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Aug 23, 2016

I found a problem in the merged code: when reg == 0 the minimizer of softmax cost is not unique.
In such case, it will cause Hessian matrix non-invertible, and I thinks it may cause quasi-newton's methods such as LBFGS run into numerical problems.
so, is it better to forbid the reg==0 case for softmax parameters ?

and, in reg==0 case, softmax result will be equivalent to pivoting logistic regression.
Or the better way is using pivoting logistic regression to replace it ?

cc @sethah @dbtsai

@dbtsai
Copy link
Member

dbtsai commented Aug 23, 2016

@WeichenXu123 Do you run into this potential issue with any dataset? If so, we may need to consider optimize softmax with pivoting when reg == 0. BTW, it's obvious that the solution of the minima of the softmax cost is not unique, but why will it lead to non-invertible Hessian matrix?

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Aug 23, 2016

@dbtsai
I give a reference here:
http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression
it mentions this problem:
the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)

@dbtsai
Copy link
Member

dbtsai commented Aug 23, 2016

The solution of this overparameterized problem in the link is just adding the regularization, and users may not want it. I think we need to optimize it on (k-1) parameters when no regularization like the implementation in mllib, and then put the final or the first one back by centering the intercepts and weights. @sethah could you have a JIRA to track this issue? Thanks.

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Aug 23, 2016

@dbtsai
in this jira
https://issues.apache.org/jira/browse/SPARK-17163
it consider to unify interface for binary logistic regression and softmax.
but I think they are not equivalent in fact (when numClass == 2, only when reg == 0 they are equivalent).
I think the better way is:

  1. extend the binary logistic regression to numClass > 2 cases, but optimize it on (numClass-1) parameters.
  2. modify the softmax reg==0 case computation, use the way described in 1

@sethah
Copy link
Contributor Author

sethah commented Aug 23, 2016

SPARK-17201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants