Skip to content

Commit c39aa6d

Browse files
committed
update user guide
1 parent c24076d commit c39aa6d

File tree

1 file changed

+12
-7
lines changed

1 file changed

+12
-7
lines changed

docs/ml-advanced.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,17 +59,22 @@ Given $n$ weighted observations $(w_i, a_i, b_i)$:
5959

6060
The number of features for each observation is $m$. We use the following weighted least squares formulation:
6161
`\[
62-
minimize_{x}\frac{1}{2} \sum_{i=1}^n \frac{w_i(a_i^T x -b_i)^2}{\sum_{k=1}^n w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
62+
\min_{\mathbf{x}}\frac{1}{2} \sum_{i=1}^n \frac{w_i(\mathbf{a}_i^T \mathbf{x} -b_i)^2}{\sum_{k=1}^n w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
6363
\]`
6464
where $\lambda$ is the regularization parameter, $\delta$ is the population standard deviation of the label
6565
and $\sigma_j$ is the population standard deviation of the j-th feature column.
6666

67-
This objective function has an analytic solution and it requires only one pass over the data to collect necessary statistics to solve.
68-
Unlike the original dataset which can only be stored in a distributed system,
69-
these statistics can be loaded into memory on a single machine if the number of features is relatively small, and then we can solve the objective function through Cholesky factorization on the driver.
67+
This objective function has an analytic solution and it requires only one pass over the data to collect necessary statistics to solve. For an
68+
$n \times m$ data matrix, these statistics require only $O(m^2)$ storage and so can be stored on a single machine when $n$ (the number of features) is
69+
relatively small. We can then solve the normal equations on a single machine using local methods like direct Cholesky factorization or iterative optimization programs.
7070

71-
WeightedLeastSquares only supports L2 regularization and provides options to enable or disable regularization and standardization.
72-
In order to make the normal equation approach efficient, WeightedLeastSquares requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
71+
Spark ML currently supports two types of solvers for the normal equations: Cholesky factorization and Quasi-Newton methods (L-BFGS/OWL-QN). Cholesky factorization
72+
depends on a positive definite covariance matrix (e.g. columns of the data matrix must be linearly independent) and will fail if this condition is violated. Quasi-Newton methods
73+
are still capable of providing a reasonable solution even when the covariance matrix is not positive definite, so the normal equation solver can also fall back to
74+
Quasi-Newton methods in this case. This fallback is currently always enabled for the `LinearRegression` estimator.
75+
76+
`WeightedLeastSquares` supports L1, L2, and elastic-net regularization and provides options to enable or disable regularization and standardization.
77+
In order to make the normal equation approach efficient, `WeightedLeastSquares` requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
7378

7479
## Iteratively reweighted least squares (IRLS)
7580

@@ -83,6 +88,6 @@ It solves certain optimization problems iteratively through the following proced
8388
* solve a weighted least squares (WLS) problem by WeightedLeastSquares.
8489
* repeat above steps until convergence.
8590

86-
Since it involves solving a weighted least squares (WLS) problem by WeightedLeastSquares in each iteration,
91+
Since it involves solving a weighted least squares (WLS) problem by `WeightedLeastSquares` in each iteration,
8792
it also requires the number of features to be no more than 4096.
8893
Currently IRLS is used as the default solver of [GeneralizedLinearRegression](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression).

0 commit comments

Comments
 (0)