Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented May 4, 2017

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20602

Currently LinearSVC in Spark only supports OWLQN as the optimizer ( check https://issues.apache.org/jira/browse/SPARK-14709). I made comparison between LBFGS and OWLQN on several public dataset and found LBFGS converges much faster for LinearSVC in most cases.
The following table presents the number of training iterations and f1 score of both optimizers until convergence

image

data source: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
training code: new LinearSVC().setMaxIter(10000).setTol(1e-6) with master=local[1]
LBFGS requires far less iterations in most cases and probably is a better default optimizer.

How was this patch tested?

strengthen existing unit tests

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 4, 2017

ping @jkbradley Sorry I know this is like the last minute for 2.2, but the change may be important for user experience. If we're not comfortable making API change right now, how about we just change the default optimizer to LBFGS before 2.2 releases?

Also appreciate the comments from other reviewers.

@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76463 has finished for PR 17862 at commit d46e5ed.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @group setParam
*/
@Since("2.2.0")
def setOptimizer(value: String): this.type = set(optimizer, value.toLowerCase(Locale.ROOT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call it solver, since LinearRegression also has something similar like this.

Copy link
Contributor

@MLnick MLnick May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 - can use HasSolver trait

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solver is l-bfgs if training with lbfgs(l2 reg) or owlqn(l1/elasticNet reg) for other algorithms such as LinearRegression and LogisticRegression, so I concerned that it may confuse users if we distinguish them in LinearSVC. Or we should clarify it in document. Thanks.


def regParamL1Fun = (index: Int) => 0D
val optimizer = new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))
val optimizerAlgo = $(optimizer) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you refresh me on why OWLQN was used originally? AFAIK OWLQN is for L1 regularized loss functions, while L-BFGS only supports L2?

Copy link
Contributor

@yanboliang yanboliang May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick Hinge Loss is a non-smooth and non-differentiable loss, the convex optimization problem is a non-smooth one, so we use use OWL-QN rather than L-BFGS as the underlying optimizer.
@hhbyyh AFAIK in cases where L-BFGS does not encounter any non-smooth point, it often converges to the optimum faster, but there is change of failure on non-smooth point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the info @yanboliang. I'm thinking we should provide both LBFGS and OWNQN as options for L1-SVM (Standard hinge). From the several dataset I tested, LBFGS converges as good as (or even better than) OWNQN and uses part of the time. Yet due to the potential failure, we may not use LBFGS as the default optimizer(solver).

A better solution as I see is to use L2-SVM(squared hinge loss) with LBFGS (combine this with #17645), but the change may be too large to be included into 2.2 now.

Copy link
Contributor

@yanboliang yanboliang May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw other libraries like sklearn.svm.linearSVC use squared hinge loss as the default loss function, it's differentiable. What about merging #17645 firstly and then we can use LBFGS as the default solver as well? Otherwise, we would make behavior change if we make this switch after 2.2 release. cc @jkbradley

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. But we'll also need to update python and R API. I don't know the release date of 2.2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for more other comments, if we reach a consensus, I'd like to help review and merge this to catch 2.2. cc @jkbradley @MLnick

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would be in favor of making square hinge the default with L-BFGS. But we would need to make it clear to users that if they select the pure hinge loss then there is a risk involved in using L-BFGS rather than OWL-QN.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 6, 2017

Update: switch to HasSolver trait and use OWLQN as default optimizer

@SparkQA
Copy link

SparkQA commented May 6, 2017

Test build #76508 has finished for PR 17862 at commit f7d5559.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@debasish83
Copy link

@hhbyyh can we smooth the hinge-loss using soft-max (variant of ReLU) and then use LBFGS ?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 8, 2017

@debasish83 There're several approaches trying to smooth the hinge loss. https://en.wikipedia.org/wiki/Hinge_loss. For the one you're proposing, do you know if it's used in other SVM implementations? Please help provide some reference or evaluation data if possible. Thanks.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76637 has started for PR 17862 at commit 8a7c10f.

val trainer1 = new LinearSVC()
test("linearSVC OWLQN comparison with R e1071 and scikit-learn") {
val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN)
.setRegParam(0.00002) // set regParam = 2.0 / datasize / c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask why we set regParam = 2.0 / datasize / c to match solution of sklearn.svm.LinearSVC? AFAIK, sklearn called liblinear to train linear SVM classification model, I can understand liblinear using C for penalty parameter of the error term which is different from regParam, and it doesn't multiply 1/n on the error term, but where is the 2.0 come from? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slides also explain it...Please see slide 32...the max can be replaced by soft-max with the softness lambda can be tuned...log-sum-exp is a standard soft-max that can be used which is similar to ReLu functions and we can re-use it from MLP:
ftp://ftp.cs.wisc.edu/math-prog/talks/informs99ssv.ps
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.pdf
I can add the formulation if there is interest...it needs some tuning for soft-max parameter but the convergence will be good with LBFGS (OWLQN is not needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@debasish83 Not sure if it's a better solution than squared hinge loss. But I would be interested to learn the performance (accuracy and speed). Can you please help try to evaluate it?

Copy link

@debasish83 debasish83 May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hinge loss is not differentiable...how are you smoothing it before you can use a quasi newton solver ? Since the papers smooth the max, a newton/quasi-newton solver should work well...if you are keeping the non-differentiable loss better will be to use a sub-gradient solver as suggested by the talk...I will evaluate the formulation...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhbyyh I saw some posts that hinge loss is not differentiable but squared hinge loss is for practical purposes...can you please point to a reference on squared hinge loss ?


test("linearSVC L-BFGS comparison with R e1071 and scikit-learn") {
val trainer1 = new LinearSVC().setSolver(LinearSVC.LBFGS)
.setRegParam(0.00003)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we switch to different regParam compared with the above test? Does it means the converged solution will depends on optimizer?

Copy link
Contributor Author

@hhbyyh hhbyyh May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I can use your help here since I cannot find the theory proof for this case.

LR may have the same behavior. Since if I set the optimizer of LR to OWLQN, all the L2 regularization cases will fail.

@SparkQA
Copy link

SparkQA commented Jun 13, 2017

Test build #78002 has finished for PR 17862 at commit 3707580.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2017

Test build #78019 has finished for PR 17862 at commit 2ca5a74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jun 14, 2017

Merge the change from #17645 into a single change.

@hhbyyh hhbyyh changed the title [SPARK-20602] [ML]Adding LBFGS as optimizer for LinearSVC [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hinge loss for LinearSVC Jun 14, 2017
@sethah
Copy link
Contributor

sethah commented Jun 14, 2017

@hhbyyh Thanks for doing the extra work to use the new aggregator here. I do think it's better to separate those changes from this one, though. There is actually more that needs to be done for the conversion (need to use RDDLossFunction and also add a test suite for the aggregator). Would you mind submitting a PR for just the conversion changes?

@SparkQA
Copy link

SparkQA commented Jun 14, 2017

Test build #78024 has finished for PR 17862 at commit 5f7f456.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jun 14, 2017

Sure. That's reasonable. I'll move the hingeAggregator to a new PR.

@SparkQA
Copy link

SparkQA commented Jun 15, 2017

Test build #78119 has finished for PR 17862 at commit 0297057.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Catching up here... To make sure I caught the decisions made in the discussion above, is it correct that this PR will:

  • Add support for squared hinge loss, and use that as the default (which I fully support)
  • Switch from OWLQN to LBFGS (which is fine with me if reasonably large-scale tests support that choice)
    • On this note, has anyone tested on large-scale datasets (ideally with millions of rows and columns)?

If those are the decisions, can you please update the PR description and JIRA as such when you update the PR @hhbyyh ? Thank you!

@yanboliang
Copy link
Contributor

+1 @jkbradley for test on large-scale datasets. @hhbyyh Do you have time to test it? If not, I can help. Thanks.

@WeichenXu123
Copy link
Contributor

+1 for adding test on large-scale datasets.
Another thing I want to know is that: you can compare the final loss value on the result coefficients, between LIBLINEAR(scikit-learn), LBFGS, OWLQN, so that we can know which optimizer can get more optimized result (testing and comparing on different kind of dataset).

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 30, 2017

Sure, I can find some larger dataset to test with.

But I guess, as showed in the PR description, LBFGS will generally outperform OWLQS, but not in all the cases. I assume single large scale dataset is not sufficient for the testing, so if you plan to test on some data set, please leave comment here. Thank you.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81367 has finished for PR 17862 at commit cec628b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Sep 12, 2017

Tested with several larger data set with Hinge Loss function, to compare l-bfgs and owlqn solvers.
Run until converged or exceed maxIter (2000).

dataset numRecords numFeatures l-bfgs iterations owlqn iterations l-bfgs final loss owlqn final loss
url_combined 2396130 3231961 317 (952 sec) 287 (1661 sec) 9.71E-5 1.64E-4
kdda 8407752 20216830 2000+ (29729 sec) 288 13664 (sec) 0.0068 0.0135
webspam 350000 254 344 (67 sec) 1502 (714 sec) 0.18273 0.18273
SUSY 5000000 18 152 (145 sec) 1242 (3357 sec) 0.499 0.499

l-bfgs does not always take fewer iterations, but it converges to a smaller final loss.
For each iteration, owlqn takes longer time ( 2 or 3 times) than l-bfgs. Logistic Regression also exhibits the similar behavior.

@WeichenXu123
Copy link
Contributor

@hhbyyh Test result looks good!
OWLQN takes longer time for each iteration, because each iteration's line search, it made more passes on dataset.

@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81836 has finished for PR 17862 at commit 0f5cad5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2017

Test build #82376 has finished for PR 17862 at commit bf4d955.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2017

Test build #82377 has finished for PR 17862 at commit 1f8e984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2017

Test build #82766 has finished for PR 17862 at commit 0bb5afe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Oct 17, 2017

Please let me know if there's any unresolved comments. Thanks.

private[classification] val HINGE: String = "hinge".toLowerCase(Locale.ROOT)

/** String name for Squared Hinge Loss. */
private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need .toLowerCase(Locale.ROOT) here ?

Copy link
Contributor Author

@hhbyyh hhbyyh Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure consistency with param validation across all Locales.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. IMO these characters are all in ASCII, I think they won't encounter locales issue. (But do you encounter such issue in some env ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I never, but I cannot grantee it for all the Locales.

@Since("2.3.0")
final val loss: Param[String] = new Param(this, "loss", "Specifies the loss " +
"function. hinge is the standard SVM loss while squared_hinge is the square of the hinge loss.",
(s: String) => LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isValid function you can use
ParamValidators.inArray[String](supportedLosses))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, IMO we need toLowerCase here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about this, but solver param in LinearRegression also ignore the thing. I tend to keep them consistent, what do you think of it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to support case-insensitive params in LinearRegression, or change the default behavior of ParamValidators.inArray. And we should improve the consistency in supporting case-insensitive String params anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a jira to address that issue: https://issues.apache.org/jira/browse/SPARK-22331

private[classification] val SQUARED_HINGE: String = "squared_hinge".toLowerCase(Locale.ROOT)

/* Set of loss function that LinearSVC supports */
private[classification] val supportedLoss = Array(HINGE, SQUARED_HINGE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supportedLoss ==> supportedLosses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can update it.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Oct 23, 2017

Thanks @WeichenXu123 for the comments. I'll update supportedLoss ==> supportedLosses with other possible changes.

@SparkQA
Copy link

SparkQA commented May 17, 2019

Test build #105501 has finished for PR 17862 at commit 0bb5afe.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112418 has finished for PR 17862 at commit 64bc339.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor

@hhbyyh This PR is stale. If there's nobody interested in this and no further updates, would you mind to close it ? Thanks!

@hhbyyh hhbyyh closed this Oct 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.