Skip to content

Commit b1b2030

Browse files
rezazadehmengxr
authored andcommitted
[MLlib][SPARK-2997] Update SVD documentation to reflect roughly square
Update the documentation to reflect the fact we can handle roughly square matrices. Author: Reza Zadeh <[email protected]> Closes apache#2070 from rezazadeh/svddocs and squashes the following commits: 826b8fe [Reza Zadeh] left singular vectors 3f34fc6 [Reza Zadeh] PCA is still TS 7ffa2aa [Reza Zadeh] better title aeaf39d [Reza Zadeh] More docs 788ed13 [Reza Zadeh] add computational cost explanation 6429c59 [Reza Zadeh] Add link to rowmatrix docs 1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square
1 parent 572952a commit b1b2030

File tree

1 file changed

+23
-6
lines changed

1 file changed

+23
-6
lines changed

docs/mllib-dimensionality-reduction.md

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
1111
of reducing the number of variables under consideration.
1212
It can be used to extract latent features from raw and noisy features
1313
or compress data while maintaining the structure.
14-
MLlib provides support for dimensionality reduction on tall-and-skinny matrices.
14+
MLlib provides support for dimensionality reduction on the <a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
1515

1616
## Singular value decomposition (SVD)
1717

@@ -39,8 +39,26 @@ If we keep the top $k$ singular values, then the dimensions of the resulting low
3939
* `$\Sigma$`: `$k \times k$`,
4040
* `$V$`: `$n \times k$`.
4141

42-
MLlib provides SVD functionality to row-oriented matrices that have only a few columns,
43-
say, less than $1000$, but many rows, i.e., *tall-and-skinny* matrices.
42+
### Performance
43+
We assume $n$ is smaller than $m$. The singular values and the right singular vectors are derived
44+
from the eigenvalues and the eigenvectors of the Gramian matrix $A^T A$. The matrix
45+
storing the left singular vectors $U$, is computed via matrix multiplication as
46+
$U = A (V S^{-1})$, if requested by the user via the computeU parameter.
47+
The actual method to use is determined automatically based on the computational cost:
48+
49+
* If $n$ is small ($n < 100$) or $k$ is large compared with $n$ ($k > n / 2$), we compute the Gramian matrix
50+
first and then compute its top eigenvalues and eigenvectors locally on the driver.
51+
This requires a single pass with $O(n^2)$ storage on each executor and on the driver, and
52+
$O(n^2 k)$ time on the driver.
53+
* Otherwise, we compute $(A^T A) v$ in a distributive way and send it to
54+
<a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> to
55+
compute $(A^T A)$'s top eigenvalues and eigenvectors on the driver node. This requires $O(k)$
56+
passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
57+
58+
### SVD Example
59+
60+
MLlib provides SVD functionality to row-oriented matrices, provided in the
61+
<a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
4462

4563
<div class="codetabs">
4664
<div data-lang="scala" markdown="1">
@@ -124,9 +142,8 @@ MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format.
124142
<div class="codetabs">
125143
<div data-lang="scala" markdown="1">
126144

127-
The following code demonstrates how to compute principal components on a tall-and-skinny `RowMatrix`
145+
The following code demonstrates how to compute principal components on a `RowMatrix`
128146
and use them to project the vectors into a low-dimensional space.
129-
The number of columns should be small, e.g, less than 1000.
130147

131148
{% highlight scala %}
132149
import org.apache.spark.mllib.linalg.Matrix
@@ -144,7 +161,7 @@ val projected: RowMatrix = mat.multiply(pc)
144161

145162
<div data-lang="java" markdown="1">
146163

147-
The following code demonstrates how to compute principal components on a tall-and-skinny `RowMatrix`
164+
The following code demonstrates how to compute principal components on a `RowMatrix`
148165
and use them to project the vectors into a low-dimensional space.
149166
The number of columns should be small, e.g, less than 1000.
150167

0 commit comments

Comments
 (0)