-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hinge loss for LinearSVC #17862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @jkbradley Sorry I know this is like the last minute for 2.2, but the change may be important for user experience. If we're not comfortable making API change right now, how about we just change the default optimizer to LBFGS before 2.2 releases? Also appreciate the comments from other reviewers. |
|
Test build #76463 has finished for PR 17862 at commit
|
| * @group setParam | ||
| */ | ||
| @Since("2.2.0") | ||
| def setOptimizer(value: String): this.type = set(optimizer, value.toLowerCase(Locale.ROOT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just call it solver, since LinearRegression also has something similar like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 - can use HasSolver trait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solver is l-bfgs if training with lbfgs(l2 reg) or owlqn(l1/elasticNet reg) for other algorithms such as LinearRegression and LogisticRegression, so I concerned that it may confuse users if we distinguish them in LinearSVC. Or we should clarify it in document. Thanks.
|
|
||
| def regParamL1Fun = (index: Int) => 0D | ||
| val optimizer = new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) | ||
| val optimizerAlgo = $(optimizer) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you refresh me on why OWLQN was used originally? AFAIK OWLQN is for L1 regularized loss functions, while L-BFGS only supports L2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Hinge Loss is a non-smooth and non-differentiable loss, the convex optimization problem is a non-smooth one, so we use use OWL-QN rather than L-BFGS as the underlying optimizer.
@hhbyyh AFAIK in cases where L-BFGS does not encounter any non-smooth point, it often converges to the optimum faster, but there is change of failure on non-smooth point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the info @yanboliang. I'm thinking we should provide both LBFGS and OWNQN as options for L1-SVM (Standard hinge). From the several dataset I tested, LBFGS converges as good as (or even better than) OWNQN and uses part of the time. Yet due to the potential failure, we may not use LBFGS as the default optimizer(solver).
A better solution as I see is to use L2-SVM(squared hinge loss) with LBFGS (combine this with #17645), but the change may be too large to be included into 2.2 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw other libraries like sklearn.svm.linearSVC use squared hinge loss as the default loss function, it's differentiable. What about merging #17645 firstly and then we can use LBFGS as the default solver as well? Otherwise, we would make behavior change if we make this switch after 2.2 release. cc @jkbradley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. But we'll also need to update python and R API. I don't know the release date of 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for more other comments, if we reach a consensus, I'd like to help review and merge this to catch 2.2. cc @jkbradley @MLnick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would be in favor of making square hinge the default with L-BFGS. But we would need to make it clear to users that if they select the pure hinge loss then there is a risk involved in using L-BFGS rather than OWL-QN.
|
Update: switch to HasSolver trait and use OWLQN as default optimizer |
|
Test build #76508 has finished for PR 17862 at commit
|
|
@hhbyyh can we smooth the hinge-loss using soft-max (variant of ReLU) and then use LBFGS ? |
|
@debasish83 There're several approaches trying to smooth the hinge loss. https://en.wikipedia.org/wiki/Hinge_loss. For the one you're proposing, do you know if it's used in other SVM implementations? Please help provide some reference or evaluation data if possible. Thanks. |
|
Test build #76637 has started for PR 17862 at commit |
| val trainer1 = new LinearSVC() | ||
| test("linearSVC OWLQN comparison with R e1071 and scikit-learn") { | ||
| val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN) | ||
| .setRegParam(0.00002) // set regParam = 2.0 / datasize / c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask why we set regParam = 2.0 / datasize / c to match solution of sklearn.svm.LinearSVC? AFAIK, sklearn called liblinear to train linear SVM classification model, I can understand liblinear using C for penalty parameter of the error term which is different from regParam, and it doesn't multiply 1/n on the error term, but where is the 2.0 come from? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
Please refer to page 36.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slides also explain it...Please see slide 32...the max can be replaced by soft-max with the softness lambda can be tuned...log-sum-exp is a standard soft-max that can be used which is similar to ReLu functions and we can re-use it from MLP:
ftp://ftp.cs.wisc.edu/math-prog/talks/informs99ssv.ps
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.pdf
I can add the formulation if there is interest...it needs some tuning for soft-max parameter but the convergence will be good with LBFGS (OWLQN is not needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@debasish83 Not sure if it's a better solution than squared hinge loss. But I would be interested to learn the performance (accuracy and speed). Can you please help try to evaluate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hinge loss is not differentiable...how are you smoothing it before you can use a quasi newton solver ? Since the papers smooth the max, a newton/quasi-newton solver should work well...if you are keeping the non-differentiable loss better will be to use a sub-gradient solver as suggested by the talk...I will evaluate the formulation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhbyyh I saw some posts that hinge loss is not differentiable but squared hinge loss is for practical purposes...can you please point to a reference on squared hinge loss ?
|
|
||
| test("linearSVC L-BFGS comparison with R e1071 and scikit-learn") { | ||
| val trainer1 = new LinearSVC().setSolver(LinearSVC.LBFGS) | ||
| .setRegParam(0.00003) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we switch to different regParam compared with the above test? Does it means the converged solution will depends on optimizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I can use your help here since I cannot find the theory proof for this case.
LR may have the same behavior. Since if I set the optimizer of LR to OWLQN, all the L2 regularization cases will fail.
|
Test build #78002 has finished for PR 17862 at commit
|
|
Test build #78019 has finished for PR 17862 at commit
|
|
Merge the change from #17645 into a single change. |
|
@hhbyyh Thanks for doing the extra work to use the new aggregator here. I do think it's better to separate those changes from this one, though. There is actually more that needs to be done for the conversion (need to use |
|
Test build #78024 has finished for PR 17862 at commit
|
|
Sure. That's reasonable. I'll move the hingeAggregator to a new PR. |
|
Test build #78119 has finished for PR 17862 at commit
|
|
Catching up here... To make sure I caught the decisions made in the discussion above, is it correct that this PR will:
If those are the decisions, can you please update the PR description and JIRA as such when you update the PR @hhbyyh ? Thank you! |
|
+1 @jkbradley for test on large-scale datasets. @hhbyyh Do you have time to test it? If not, I can help. Thanks. |
|
+1 for adding test on large-scale datasets. |
|
Sure, I can find some larger dataset to test with. But I guess, as showed in the PR description, LBFGS will generally outperform OWLQS, but not in all the cases. I assume single large scale dataset is not sufficient for the testing, so if you plan to test on some data set, please leave comment here. Thank you. |
|
Test build #81367 has finished for PR 17862 at commit
|
|
Tested with several larger data set with Hinge Loss function, to compare l-bfgs and owlqn solvers.
l-bfgs does not always take fewer iterations, but it converges to a smaller final loss. |
|
@hhbyyh Test result looks good! |
|
Test build #81836 has finished for PR 17862 at commit
|
|
Test build #82376 has finished for PR 17862 at commit
|
|
Test build #82377 has finished for PR 17862 at commit
|
|
Test build #82766 has finished for PR 17862 at commit
|
|
Please let me know if there's any unresolved comments. Thanks. |
| private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT) | ||
|
|
||
| /** String name for Squared Hinge Loss. */ | ||
| private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need .toLowerCase(Locale.ROOT) here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure consistency with param validation across all Locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. IMO these characters are all in ASCII, I think they won't encounter locales issue. (But do you encounter such issue in some env ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I never, but I cannot grantee it for all the Locales.
| @Since("2.3.0") | ||
| final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " + | ||
| "function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.", | ||
| (s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isValid function you can use
ParamValidators.inArray[String](supportedLosses))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, IMO we need toLowerCase here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about this, but solver param in LinearRegression also ignore the thing. I tend to keep them consistent, what do you think of it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to support case-insensitive params in LinearRegression, or change the default behavior of ParamValidators.inArray. And we should improve the consistency in supporting case-insensitive String params anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a jira to address that issue: https://issues.apache.org/jira/browse/SPARK-22331
| private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT) | ||
|
|
||
| /* Set of loss function that LinearSVC supports */ | ||
| private[classification] val supportedLoss = Array(HINGE, SQUARED_HINGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supportedLoss ==> supportedLosses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I can update it.
|
Thanks @WeichenXu123 for the comments. I'll update supportedLoss ==> supportedLosses with other possible changes. |
|
Test build #105501 has finished for PR 17862 at commit
|
|
Test build #112418 has finished for PR 17862 at commit
|
|
@hhbyyh This PR is stale. If there's nobody interested in this and no further updates, would you mind to close it ? Thanks! |
What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-20602
Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between LBFGS and OWLQN on several public dataset and found LBFGS converges much faster for LinearSVC in most cases.
The following table presents the number of training iterations and f1 score of both optimizers until convergence
data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
training code: new LinearSVC().setMaxIter(10000).setTol(1e-6) with master=local[1]
LBFGS requires far less iterations in most cases and probably is a better default optimizer.
How was this patch tested?
strengthen existing unit tests