-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7159][ML] Add multiclass logistic regression to Spark ML #13796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some (somewhat modest) performance gains when explicitly broadcasting the coefficients when the number of coefficients was large.
|
Test build #60900 has finished for PR 13796 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testing I found that the extra divisions required to standardize the data one every iteration have significant impacts for large numbers of features/classes. I changed the LogisticAggregator to optionally standardize the data. This way we can change binary LR if needed and/or we can give the option to users so that if they already have standardized data then the runtime will be less.
|
cc @dbtsai @jkbradley @mengxr If you get a chance to review and/or provide feedback on the approach I'd appreciate it. |
|
Test build #60901 has finished for PR 13796 at commit
|
|
@sethah I apologize for the delay. I just came back to US. Gonna make the first pass. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not multinomial, just return 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
(I deleted a previous comment, which was incorrect because I misunderstood your suggestion)
|
@sethah Please merge master since there is a conflict. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the doc. It supports softmax (multinomial logistic) loss now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here because, when we subtract off the max margin below, we will end up doing Double.PositiveInfinity - Double.PositiveInfinity which equals NaN in scala. Alternatively, we could just use Double.MaxValue instead.
|
Test build #63702 has finished for PR 13796 at commit
|
|
Test build #63704 has finished for PR 13796 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many people use softmax as more academic term. Maybe we should have both multinomial (softmax) loss in the documentation, and use multinomial for api or more user face place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've replaced multinomial with multinomial (softmax) in a couple comments in the code. I think as long as we're clear on the problem formulation (which will be made explicit when the derivation is added to LogisticAggregator) then we shouldn't have to worry too much about semantics. Let me know if you had something else in mind or you see other places we should update.
|
I go through the PR again, and it's in a very good shape. Only couple minor issues needed to be addressed. Thank you @sethah for the great work. This will be a big feature in Spark 2.1 |
|
Thanks @dbtsai for the detailed review! I addressed most comments. We still need to:
Also, one thing I'm concerned about is having separate |
|
Test build #63996 has finished for PR 13796 at commit
|
|
I also created an umbrella JIRA for tracking follow up items - SPARK-17133 |
|
Test build #64007 has finished for PR 13796 at commit
|
| Where Z is a normalizing constant. Hence, | ||
| {{{ | ||
| b_k = \log(P(k)) + \log(Z) | ||
| = \log(count_k) - \log(count) + \log(Z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: this deviation doesn't show why you can chose \lambda freely given \log(count) + \log(Z) should be already determined by the problem. Since the \lambda is in the \exp, we call it phase.
{{{
P(1) = \exp(b_1) / Z
...
P(K) = \exp(b_K) / Z
where Z = \sum_{k=1}{K} \exp(b_k)
Since this problem doesn't have unique solution, one of the solution that satisfy the above equation will be
\exp(b_k) = count_k * exp(\lambda)
hence
\b_k = \log(count_k) + \lambda
}}}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
Test build #64033 has finished for PR 13796 at commit
|
|
@sethah Thank you for this great weighted MLOR work in Spark 2.1. I merged this PR into master, and let's discuss/work on the followups in separate JIRAs. Thanks. |
|
@dbtsai Thanks for all of your meticulous review. Very much appreciated! Glad we can have MLOR in Spark ML now. |
|
@sethah I also have a concern to have a separate MLOR class, and I prefer consolidate them into one so we can maintain them easier. This has to be done before the release of 2.1 otherwise, we can not change the interface anymore. Can you create a JIRA discuss this issue? Thanks. |
|
I found a problem in the merged code: when reg == 0 the minimizer of softmax cost is not unique. and, in |
|
@WeichenXu123 Do you run into this potential issue with any dataset? If so, we may need to consider optimize softmax with pivoting when |
|
@dbtsai |
|
The solution of this overparameterized problem in the link is just adding the regularization, and users may not want it. I think we need to optimize it on (k-1) parameters when no regularization like the implementation in mllib, and then put the final or the first one back by centering the intercepts and weights. @sethah could you have a JIRA to track this issue? Thanks. |
|
@dbtsai
|
What changes were proposed in this pull request?
This patch adds a new estimator/transformer
MultinomialLogisticRegressionto spark ML.JIRA: SPARK-7159
How was this patch tested?
Added new test suite
MultinomialLogisticRegressionSuite.Approach
Do not use a "pivot" class in the algorithm formulation
Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by @dbtsai.
Separate multinomial logistic regression and binary logistic regression
The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one.
Arguments for:
Arguments against:
This is a major design point and warrants more discussion.
Mean centering
When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail here. R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well.
Feature scaling
In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space before optimizaton. In ML, this means that
numFeatures * numClassesextra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases.One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm much slower. The tradeoffs here merit more discussion.
Specifying and inferring the number of outcome classes
The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the
histogramvariable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm.Performance
Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary.
Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each
Notes:
References
Friedman, et al. "Regularization Paths for Generalized Linear Models via Coordinate Descent"
http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Follow up items
LogisticRegression.scala- SPARK-17135