Skip to content

Commit 5c2275d

Browse files
DB Tsairxin
authored andcommitted
L-BFGS Documentation
Documentation for L-BFGS, and an example of training binary L2 logistic regression using L-BFGS. Author: DB Tsai <[email protected]> Closes apache#702 from dbtsai/dbtsai-lbfgs-doc and squashes the following commits: 0712215 [DB Tsai] Update 38fdfa1 [DB Tsai] Removed extra empty line 5745b64 [DB Tsai] Update again e9e418e [DB Tsai] Update 7381521 [DB Tsai] L-BFGS Documentation
1 parent a5150d1 commit 5c2275d

File tree

1 file changed

+116
-4
lines changed

1 file changed

+116
-4
lines changed

docs/mllib-optimization.md

Lines changed: 116 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ title: <a href="mllib-guide.html">MLlib</a> - Optimization
2828
## Mathematical description
2929

3030
### Gradient descent
31-
3231
The simplest method to solve optimization problems of the form `$\min_{\wv \in\R^d} \; f(\wv)$`
3332
is [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent).
3433
Such first-order optimization methods (including gradient descent and stochastic variants
@@ -128,10 +127,19 @@ is sampled, i.e. `$|S|=$ miniBatchFraction $\cdot n = 1$`, then the algorithm is
128127
standard SGD. In that case, the step direction depends from the uniformly random sampling of the
129128
point.
130129

131-
130+
### Limited-memory BFGS (L-BFGS)
131+
[L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS) is an optimization
132+
algorithm in the family of quasi-Newton methods to solve the optimization problems of the form
133+
`$\min_{\wv \in\R^d} \; f(\wv)$`. The L-BFGS method approximates the objective function locally as a
134+
quadratic without evaluating the second partial derivatives of the objective function to construct the
135+
Hessian matrix. The Hessian matrix is approximated by previous gradient evaluations, so there is no
136+
vertical scalability issue (the number of training features) when computing the Hessian matrix
137+
explicitly in Newton's method. As a result, L-BFGS often achieves rapider convergence compared with
138+
other first-order optimization.
132139

133140
## Implementation in MLlib
134141

142+
### Gradient descent and stochastic gradient descent
135143
Gradient descent methods including stochastic subgradient descent (SGD) as
136144
included as a low-level primitive in `MLlib`, upon which various ML algorithms
137145
are developed, see the
@@ -142,12 +150,12 @@ The SGD method
142150
[GradientDescent.runMiniBatchSGD](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
143151
has the following parameters:
144152

145-
* `gradient` is a class that computes the stochastic gradient of the function
153+
* `Gradient` is a class that computes the stochastic gradient of the function
146154
being optimized, i.e., with respect to a single training example, at the
147155
current parameter value. MLlib includes gradient classes for common loss
148156
functions, e.g., hinge, logistic, least-squares. The gradient class takes as
149157
input a training example, its label, and the current parameter value.
150-
* `updater` is a class that performs the actual gradient descent step, i.e.
158+
* `Updater` is a class that performs the actual gradient descent step, i.e.
151159
updating the weights in each iteration, for a given gradient of the loss part.
152160
The updater is also responsible to perform the update from the regularization
153161
part. MLlib includes updaters for cases without regularization, as well as
@@ -163,3 +171,107 @@ each iteration, to compute the gradient direction.
163171
Available algorithms for gradient descent:
164172

165173
* [GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
174+
175+
### L-BFGS
176+
L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various
177+
ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective
178+
function, and updater into optimizer yourself instead of using the training APIs like
179+
[LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD).
180+
See the example below. It will be addressed in the next release.
181+
182+
The L1 regularization by using
183+
[L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.L1Updater) will not work since the
184+
soft-thresholding logic in L1Updater is designed for gradient descent. See the developer's note.
185+
186+
The L-BFGS method
187+
[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
188+
has the following parameters:
189+
190+
* `Gradient` is a class that computes the gradient of the objective function
191+
being optimized, i.e., with respect to a single training example, at the
192+
current parameter value. MLlib includes gradient classes for common loss
193+
functions, e.g., hinge, logistic, least-squares. The gradient class takes as
194+
input a training example, its label, and the current parameter value.
195+
* `Updater` is a class that computes the gradient and loss of objective function
196+
of the regularization part for L-BFGS. MLlib includes updaters for cases without
197+
regularization, as well as L2 regularizer.
198+
* `numCorrections` is the number of corrections used in the L-BFGS update. 10 is
199+
recommended.
200+
* `maxNumIterations` is the maximal number of iterations that L-BFGS can be run.
201+
* `regParam` is the regularization parameter when using regularization.
202+
203+
The `return` is a tuple containing two elements. The first element is a column matrix
204+
containing weights for every feature, and the second element is an array containing
205+
the loss computed for every iteration.
206+
207+
Here is an example to train binary logistic regression with L2 regularization using
208+
L-BFGS optimizer.
209+
{% highlight scala %}
210+
import org.apache.spark.SparkContext
211+
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
212+
import org.apache.spark.mllib.linalg.Vectors
213+
import org.apache.spark.mllib.util.MLUtils
214+
import org.apache.spark.mllib.classification.LogisticRegressionModel
215+
216+
val data = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
217+
val numFeatures = data.take(1)(0).features.size
218+
219+
// Split data into training (60%) and test (40%).
220+
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
221+
222+
// Append 1 into the training data as intercept.
223+
val training = splits(0).map(x => (x.label, MLUtils.appendBias(x.features))).cache()
224+
225+
val test = splits(1)
226+
227+
// Run training algorithm to build the model
228+
val numCorrections = 10
229+
val convergenceTol = 1e-4
230+
val maxNumIterations = 20
231+
val regParam = 0.1
232+
val initialWeightsWithIntercept = Vectors.dense(new Array[Double](numFeatures + 1))
233+
234+
val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
235+
training,
236+
new LogisticGradient(),
237+
new SquaredL2Updater(),
238+
numCorrections,
239+
convergenceTol,
240+
maxNumIterations,
241+
regParam,
242+
initialWeightsWithIntercept)
243+
244+
val model = new LogisticRegressionModel(
245+
Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)),
246+
weightsWithIntercept(weightsWithIntercept.size - 1))
247+
248+
// Clear the default threshold.
249+
model.clearThreshold()
250+
251+
// Compute raw scores on the test set.
252+
val scoreAndLabels = test.map { point =>
253+
val score = model.predict(point.features)
254+
(score, point.label)
255+
}
256+
257+
// Get evaluation metrics.
258+
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
259+
val auROC = metrics.areaUnderROC()
260+
261+
println("Loss of each step in training process")
262+
loss.foreach(println)
263+
println("Area under ROC = " + auROC)
264+
{% endhighlight %}
265+
266+
#### Developer's note
267+
Since the Hessian is constructed approximately from previous gradient evaluations,
268+
the objective function can not be changed during the optimization process.
269+
As a result, Stochastic L-BFGS will not work naively by just using miniBatch;
270+
therefore, we don't provide this until we have better understanding.
271+
272+
* `Updater` is a class originally designed for gradient decent which computes
273+
the actual gradient descent step. However, we're able to take the gradient and
274+
loss of objective function of regularization for L-BFGS by ignoring the part of logic
275+
only for gradient decent such as adaptive step size stuff. We will refactorize
276+
this into regularizer to replace updater to separate the logic between
277+
regularization and step update later.

0 commit comments

Comments
 (0)