@@ -28,7 +28,6 @@ title: <a href="mllib-guide.html">MLlib</a> - Optimization
2828## Mathematical description
2929
3030### Gradient descent
31-
3231The simplest method to solve optimization problems of the form ` $\min_{\wv \in\R^d} \; f(\wv)$ `
3332is [ gradient descent] ( http://en.wikipedia.org/wiki/Gradient_descent ) .
3433Such first-order optimization methods (including gradient descent and stochastic variants
@@ -128,10 +127,19 @@ is sampled, i.e. `$|S|=$ miniBatchFraction $\cdot n = 1$`, then the algorithm is
128127standard SGD. In that case, the step direction depends from the uniformly random sampling of the
129128point.
130129
131-
130+ ### Limited-memory BFGS (L-BFGS)
131+ [ L-BFGS] ( http://en.wikipedia.org/wiki/Limited-memory_BFGS ) is an optimization
132+ algorithm in the family of quasi-Newton methods to solve the optimization problems of the form
133+ ` $\min_{\wv \in\R^d} \; f(\wv)$ ` . The L-BFGS method approximates the objective function locally as a
134+ quadratic without evaluating the second partial derivatives of the objective function to construct the
135+ Hessian matrix. The Hessian matrix is approximated by previous gradient evaluations, so there is no
136+ vertical scalability issue (the number of training features) when computing the Hessian matrix
137+ explicitly in Newton's method. As a result, L-BFGS often achieves rapider convergence compared with
138+ other first-order optimization.
132139
133140## Implementation in MLlib
134141
142+ ### Gradient descent and stochastic gradient descent
135143Gradient descent methods including stochastic subgradient descent (SGD) as
136144included as a low-level primitive in ` MLlib ` , upon which various ML algorithms
137145are developed, see the
@@ -142,12 +150,12 @@ The SGD method
142150[ GradientDescent.runMiniBatchSGD] ( api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent )
143151has the following parameters:
144152
145- * ` gradient ` is a class that computes the stochastic gradient of the function
153+ * ` Gradient ` is a class that computes the stochastic gradient of the function
146154being optimized, i.e., with respect to a single training example, at the
147155current parameter value. MLlib includes gradient classes for common loss
148156functions, e.g., hinge, logistic, least-squares. The gradient class takes as
149157input a training example, its label, and the current parameter value.
150- * ` updater ` is a class that performs the actual gradient descent step, i.e.
158+ * ` Updater ` is a class that performs the actual gradient descent step, i.e.
151159updating the weights in each iteration, for a given gradient of the loss part.
152160The updater is also responsible to perform the update from the regularization
153161part. MLlib includes updaters for cases without regularization, as well as
@@ -163,3 +171,107 @@ each iteration, to compute the gradient direction.
163171Available algorithms for gradient descent:
164172
165173* [ GradientDescent.runMiniBatchSGD] ( api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent )
174+
175+ ### L-BFGS
176+ L-BFGS is currently only a low-level optimization primitive in ` MLlib ` . If you want to use L-BFGS in various
177+ ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective
178+ function, and updater into optimizer yourself instead of using the training APIs like
179+ [ LogisticRegressionWithSGD] ( api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD ) .
180+ See the example below. It will be addressed in the next release.
181+
182+ The L1 regularization by using
183+ [ L1Updater] ( api/mllib/index.html#org.apache.spark.mllib.optimization.L1Updater ) will not work since the
184+ soft-thresholding logic in L1Updater is designed for gradient descent. See the developer's note.
185+
186+ The L-BFGS method
187+ [ LBFGS.runLBFGS] ( api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS )
188+ has the following parameters:
189+
190+ * ` Gradient ` is a class that computes the gradient of the objective function
191+ being optimized, i.e., with respect to a single training example, at the
192+ current parameter value. MLlib includes gradient classes for common loss
193+ functions, e.g., hinge, logistic, least-squares. The gradient class takes as
194+ input a training example, its label, and the current parameter value.
195+ * ` Updater ` is a class that computes the gradient and loss of objective function
196+ of the regularization part for L-BFGS. MLlib includes updaters for cases without
197+ regularization, as well as L2 regularizer.
198+ * ` numCorrections ` is the number of corrections used in the L-BFGS update. 10 is
199+ recommended.
200+ * ` maxNumIterations ` is the maximal number of iterations that L-BFGS can be run.
201+ * ` regParam ` is the regularization parameter when using regularization.
202+
203+ The ` return ` is a tuple containing two elements. The first element is a column matrix
204+ containing weights for every feature, and the second element is an array containing
205+ the loss computed for every iteration.
206+
207+ Here is an example to train binary logistic regression with L2 regularization using
208+ L-BFGS optimizer.
209+ {% highlight scala %}
210+ import org.apache.spark.SparkContext
211+ import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
212+ import org.apache.spark.mllib.linalg.Vectors
213+ import org.apache.spark.mllib.util.MLUtils
214+ import org.apache.spark.mllib.classification.LogisticRegressionModel
215+
216+ val data = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
217+ val numFeatures = data.take(1)(0).features.size
218+
219+ // Split data into training (60%) and test (40%).
220+ val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
221+
222+ // Append 1 into the training data as intercept.
223+ val training = splits(0).map(x => (x.label, MLUtils.appendBias(x.features))).cache()
224+
225+ val test = splits(1)
226+
227+ // Run training algorithm to build the model
228+ val numCorrections = 10
229+ val convergenceTol = 1e-4
230+ val maxNumIterations = 20
231+ val regParam = 0.1
232+ val initialWeightsWithIntercept = Vectors.dense(new Array[ Double] (numFeatures + 1))
233+
234+ val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
235+ training,
236+ new LogisticGradient(),
237+ new SquaredL2Updater(),
238+ numCorrections,
239+ convergenceTol,
240+ maxNumIterations,
241+ regParam,
242+ initialWeightsWithIntercept)
243+
244+ val model = new LogisticRegressionModel(
245+ Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)),
246+ weightsWithIntercept(weightsWithIntercept.size - 1))
247+
248+ // Clear the default threshold.
249+ model.clearThreshold()
250+
251+ // Compute raw scores on the test set.
252+ val scoreAndLabels = test.map { point =>
253+ val score = model.predict(point.features)
254+ (score, point.label)
255+ }
256+
257+ // Get evaluation metrics.
258+ val metrics = new BinaryClassificationMetrics(scoreAndLabels)
259+ val auROC = metrics.areaUnderROC()
260+
261+ println("Loss of each step in training process")
262+ loss.foreach(println)
263+ println("Area under ROC = " + auROC)
264+ {% endhighlight %}
265+
266+ #### Developer's note
267+ Since the Hessian is constructed approximately from previous gradient evaluations,
268+ the objective function can not be changed during the optimization process.
269+ As a result, Stochastic L-BFGS will not work naively by just using miniBatch;
270+ therefore, we don't provide this until we have better understanding.
271+
272+ * ` Updater ` is a class originally designed for gradient decent which computes
273+ the actual gradient descent step. However, we're able to take the gradient and
274+ loss of objective function of regularization for L-BFGS by ignoring the part of logic
275+ only for gradient decent such as adaptive step size stuff. We will refactorize
276+ this into regularizer to replace updater to separate the logic between
277+ regularization and step update later.
0 commit comments