11---
22layout : global
3- title : <a href="mllib-guide.html">MLlib</a> - Basics
3+ title : Basics - MLlib
4+ displayTitle : <a href="mllib-guide.html">MLlib</a> - Basics
45---
56
67* Table of contents
@@ -26,11 +27,11 @@ of the vector.
2627<div data-lang =" scala " markdown =" 1 " >
2728
2829The base class of local vectors is
29- [ ` Vector ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Vector ) , and we provide two
30- implementations: [ ` DenseVector ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.DenseVector ) and
31- [ ` SparseVector ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.SparseVector ) . We recommend
30+ [ ` Vector ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Vector ) , and we provide two
31+ implementations: [ ` DenseVector ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.DenseVector ) and
32+ [ ` SparseVector ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.SparseVector ) . We recommend
3233using the factory methods implemented in
33- [ ` Vectors ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Vector ) to create local vectors.
34+ [ ` Vectors ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Vector ) to create local vectors.
3435
3536{% highlight scala %}
3637import org.apache.spark.mllib.linalg.{Vector, Vectors}
@@ -53,11 +54,11 @@ Scala imports `scala.collection.immutable.Vector` by default, so you have to imp
5354<div data-lang =" java " markdown =" 1 " >
5455
5556The base class of local vectors is
56- [ ` Vector ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Vector ) , and we provide two
57- implementations: [ ` DenseVector ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. DenseVector ) and
58- [ ` SparseVector ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. SparseVector ) . We recommend
57+ [ ` Vector ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Vector.html ) , and we provide two
58+ implementations: [ ` DenseVector ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ DenseVector.html ) and
59+ [ ` SparseVector ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ SparseVector.html ) . We recommend
5960using the factory methods implemented in
60- [ ` Vectors ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Vector ) to create local vectors.
61+ [ ` Vectors ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Vector.html ) to create local vectors.
6162
6263{% highlight java %}
6364import org.apache.spark.mllib.linalg.Vector;
@@ -78,13 +79,13 @@ MLlib recognizes the following types as dense vectors:
7879
7980and the following as sparse vectors:
8081
81- * MLlib's [ ` SparseVector ` ] ( api/pyspark /pyspark.mllib.linalg.SparseVector-class.html ) .
82+ * MLlib's [ ` SparseVector ` ] ( api/python /pyspark.mllib.linalg.SparseVector-class.html ) .
8283* SciPy's
8384 [ ` csc_matrix ` ] ( http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix )
8485 with a single column
8586
8687We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
87- in [ ` Vectors ` ] ( api/pyspark /pyspark.mllib.linalg.Vectors-class.html ) to create sparse vectors.
88+ in [ ` Vectors ` ] ( api/python /pyspark.mllib.linalg.Vectors-class.html ) to create sparse vectors.
8889
8990{% highlight python %}
9091import numpy as np
@@ -117,7 +118,7 @@ For multiclass classification, labels should be class indices staring from zero:
117118<div data-lang =" scala " markdown =" 1 " >
118119
119120A labeled point is represented by the case class
120- [ ` LabeledPoint ` ] ( api/mllib /index.html#org.apache.spark.mllib.regression.LabeledPoint ) .
121+ [ ` LabeledPoint ` ] ( api/scala /index.html#org.apache.spark.mllib.regression.LabeledPoint ) .
121122
122123{% highlight scala %}
123124import org.apache.spark.mllib.linalg.Vectors
@@ -134,7 +135,7 @@ val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
134135<div data-lang =" java " markdown =" 1 " >
135136
136137A labeled point is represented by
137- [ ` LabeledPoint ` ] ( api/mllib/index.html# org. apache. spark. mllib. regression. LabeledPoint ) .
138+ [ ` LabeledPoint ` ] ( api/java/ org/ apache/ spark/ mllib/ regression/ LabeledPoint.html ) .
138139
139140{% highlight java %}
140141import org.apache.spark.mllib.linalg.Vectors;
@@ -151,7 +152,7 @@ LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new
151152<div data-lang =" python " markdown =" 1 " >
152153
153154A labeled point is represented by
154- [ ` LabeledPoint ` ] ( api/pyspark /pyspark.mllib.regression.LabeledPoint-class.html ) .
155+ [ ` LabeledPoint ` ] ( api/python /pyspark.mllib.regression.LabeledPoint-class.html ) .
155156
156157{% highlight python %}
157158from pyspark.mllib.linalg import SparseVector
@@ -184,28 +185,40 @@ After loading, the feature indices are converted to zero-based.
184185<div class =" codetabs " >
185186<div data-lang =" scala " markdown =" 1 " >
186187
187- [ ` MLUtils.loadLibSVMFile ` ] ( api/mllib /index.html#org.apache.spark.mllib.util.MLUtils$ ) reads training
188+ [ ` MLUtils.loadLibSVMFile ` ] ( api/scala /index.html#org.apache.spark.mllib.util.MLUtils$ ) reads training
188189examples stored in LIBSVM format.
189190
190191{% highlight scala %}
191192import org.apache.spark.mllib.regression.LabeledPoint
192193import org.apache.spark.mllib.util.MLUtils
193194import org.apache.spark.rdd.RDD
194195
195- val training : RDD[ LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
196+ val examples : RDD[ LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
196197{% endhighlight %}
197198</div >
198199
199200<div data-lang =" java " markdown =" 1 " >
200- [ ` MLUtils.loadLibSVMFile ` ] ( api/mllib/index.html# org. apache. spark. mllib. util. MLUtils$ ) reads training
201+ [ ` MLUtils.loadLibSVMFile ` ] ( api/java/ org/ apache/ spark/ mllib/ util/ MLUtils.html ) reads training
201202examples stored in LIBSVM format.
202203
203204{% highlight java %}
204205import org.apache.spark.mllib.regression.LabeledPoint;
205206import org.apache.spark.mllib.util.MLUtils;
206- import org.apache.spark.rdd.RDDimport;
207+ import org.apache.spark.api.java.JavaRDD;
208+
209+ JavaRDD<LabeledPoint > examples =
210+ MLUtils.loadLibSVMFile(jsc.sc(), "mllib/data/sample_libsvm_data.txt").toJavaRDD();
211+ {% endhighlight %}
212+ </div >
213+
214+ <div data-lang =" python " markdown =" 1 " >
215+ [ ` MLUtils.loadLibSVMFile ` ] ( api/python/pyspark.mllib.util.MLUtils-class.html ) reads training
216+ examples stored in LIBSVM format.
207217
208- RDD<LabeledPoint > training = MLUtils.loadLibSVMFile(jsc, "mllib/data/sample_libsvm_data.txt");
218+ {% highlight python %}
219+ from pyspark.mllib.util import MLUtils
220+
221+ examples = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
209222{% endhighlight %}
210223</div >
211224</div >
@@ -227,10 +240,10 @@ We are going to add sparse matrix in the next release.
227240<div data-lang =" scala " markdown =" 1 " >
228241
229242The base class of local matrices is
230- [ ` Matrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Matrix ) , and we provide one
231- implementation: [ ` DenseMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.DenseMatrix ) .
243+ [ ` Matrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Matrix ) , and we provide one
244+ implementation: [ ` DenseMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.DenseMatrix ) .
232245Sparse matrix will be added in the next release. We recommend using the factory methods implemented
233- in [ ` Matrices ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.Matrices ) to create local
246+ in [ ` Matrices ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.Matrices ) to create local
234247matrices.
235248
236249{% highlight scala %}
@@ -244,10 +257,10 @@ val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
244257<div data-lang =" java " markdown =" 1 " >
245258
246259The base class of local matrices is
247- [ ` Matrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Matrix ) , and we provide one
248- implementation: [ ` DenseMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. DenseMatrix ) .
260+ [ ` Matrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Matrix.html ) , and we provide one
261+ implementation: [ ` DenseMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ DenseMatrix.html ) .
249262Sparse matrix will be added in the next release. We recommend using the factory methods implemented
250- in [ ` Matrices ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. Matrices ) to create local
263+ in [ ` Matrices ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ Matrices.html ) to create local
251264matrices.
252265
253266{% highlight java %}
@@ -269,6 +282,15 @@ and distributed matrices. Converting a distributed matrix to a different format
269282global shuffle, which is quite expensive. We implemented three types of distributed matrices in
270283this release and will add more types in the future.
271284
285+ The basic type is called ` RowMatrix ` . A ` RowMatrix ` is a row-oriented distributed
286+ matrix without meaningful row indices, e.g., a collection of feature vectors.
287+ It is backed by an RDD of its rows, where each row is a local vector.
288+ We assume that the number of columns is not huge for a ` RowMatrix ` .
289+ An ` IndexedRowMatrix ` is similar to a ` RowMatrix ` but with row indices,
290+ which can be used for identifying rows and joins.
291+ A ` CoordinateMatrix ` is a distributed matrix stored in [ coordinate list (COO)] ( https://en.wikipedia.org/wiki/Sparse_matrix ) format,
292+ backed by an RDD of its entries.
293+
272294*** Note***
273295
274296The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
@@ -284,7 +306,7 @@ limited by the integer range but it should be much smaller in practice.
284306<div class =" codetabs " >
285307<div data-lang =" scala " markdown =" 1 " >
286308
287- A [ ` RowMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) can be
309+ A [ ` RowMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) can be
288310created from an ` RDD[Vector] ` instance. Then we can compute its column summary statistics.
289311
290312{% highlight scala %}
@@ -303,7 +325,7 @@ val n = mat.numCols()
303325
304326<div data-lang =" java " markdown =" 1 " >
305327
306- A [ ` RowMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. RowMatrix ) can be
328+ A [ ` RowMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ RowMatrix.html ) can be
307329created from a ` JavaRDD<Vector> ` instance. Then we can compute its column summary statistics.
308330
309331{% highlight java %}
@@ -333,8 +355,8 @@ which could be faster if the rows are sparse.
333355<div class =" codetabs " >
334356<div data-lang =" scala " markdown =" 1 " >
335357
336- ` RowMatrix#computeColumnSummaryStatistics ` returns an instance of
337- [ ` MultivariateStatisticalSummary ` ] ( api/mllib /index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary ) ,
358+ [ ` RowMatrix#computeColumnSummaryStatistics ` ] ( api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix ) returns an instance of
359+ [ ` MultivariateStatisticalSummary ` ] ( api/scala /index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary ) ,
338360which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
339361total count.
340362
@@ -355,6 +377,31 @@ println(summary.numNonzeros) // number of nonzeros in each column
355377val cov: Matrix = mat.computeCovariance()
356378{% endhighlight %}
357379</div >
380+
381+ <div data-lang =" java " markdown =" 1 " >
382+
383+ [ ` RowMatrix#computeColumnSummaryStatistics ` ] ( api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics() ) returns an instance of
384+ [ ` MultivariateStatisticalSummary ` ] ( api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html ) ,
385+ which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
386+ total count.
387+
388+ {% highlight java %}
389+ import org.apache.spark.mllib.linalg.Matrix;
390+ import org.apache.spark.mllib.linalg.distributed.RowMatrix;
391+ import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
392+
393+ RowMatrix mat = ... // a RowMatrix
394+
395+ // Compute column summary statistics.
396+ MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
397+ System.out.println(summary.mean()); // a dense vector containing the mean value for each column
398+ System.out.println(summary.variance()); // column-wise variance
399+ System.out.println(summary.numNonzeros()); // number of nonzeros in each column
400+
401+ // Compute the covariance matrix.
402+ Matrix cov = mat.computeCovariance();
403+ {% endhighlight %}
404+ </div >
358405</div >
359406
360407### IndexedRowMatrix
@@ -366,9 +413,9 @@ an RDD of indexed rows, which each row is represented by its index (long-typed)
366413<div data-lang =" scala " markdown =" 1 " >
367414
368415An
369- [ ` IndexedRowMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix )
416+ [ ` IndexedRowMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix )
370417can be created from an ` RDD[IndexedRow] ` instance, where
371- [ ` IndexedRow ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow ) is a
418+ [ ` IndexedRow ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow ) is a
372419wrapper over ` (Long, Vector) ` . An ` IndexedRowMatrix ` can be converted to a ` RowMatrix ` by dropping
373420its row indices.
374421
@@ -391,9 +438,9 @@ val rowMat: RowMatrix = mat.toRowMatrix()
391438<div data-lang =" java " markdown =" 1 " >
392439
393440An
394- [ ` IndexedRowMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. IndexedRowMatrix )
441+ [ ` IndexedRowMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ IndexedRowMatrix.html )
395442can be created from an ` JavaRDD<IndexedRow> ` instance, where
396- [ ` IndexedRow ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. IndexedRow ) is a
443+ [ ` IndexedRow ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ IndexedRow.html ) is a
397444wrapper over ` (long, Vector) ` . An ` IndexedRowMatrix ` can be converted to a ` RowMatrix ` by dropping
398445its row indices.
399446
@@ -427,9 +474,9 @@ dimensions of the matrix are huge and the matrix is very sparse.
427474<div data-lang =" scala " markdown =" 1 " >
428475
429476A
430- [ ` CoordinateMatrix ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix )
477+ [ ` CoordinateMatrix ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix )
431478can be created from an ` RDD[MatrixEntry] ` instance, where
432- [ ` MatrixEntry ` ] ( api/mllib /index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry ) is a
479+ [ ` MatrixEntry ` ] ( api/scala /index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry ) is a
433480wrapper over ` (Long, Long, Double) ` . A ` CoordinateMatrix ` can be converted to a ` IndexedRowMatrix `
434481with sparse rows by calling ` toIndexedRowMatrix ` . In this release, we do not provide other
435482computation for ` CoordinateMatrix ` .
@@ -453,21 +500,21 @@ val indexedRowMatrix = mat.toIndexedRowMatrix()
453500<div data-lang =" java " markdown =" 1 " >
454501
455502A
456- [ ` CoordinateMatrix ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. CoordinateMatrix )
503+ [ ` CoordinateMatrix ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ CoordinateMatrix.html )
457504can be created from a ` JavaRDD<MatrixEntry> ` instance, where
458- [ ` MatrixEntry ` ] ( api/mllib/index.html# org. apache. spark. mllib. linalg. distributed. MatrixEntry ) is a
505+ [ ` MatrixEntry ` ] ( api/java/ org/ apache/ spark/ mllib/ linalg/ distributed/ MatrixEntry.html ) is a
459506wrapper over ` (long, long, double) ` . A ` CoordinateMatrix ` can be converted to a ` IndexedRowMatrix `
460507with sparse rows by calling ` toIndexedRowMatrix ` .
461508
462- {% highlight scala %}
509+ {% highlight java %}
463510import org.apache.spark.api.java.JavaRDD;
464511import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
465512import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
466513import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
467514
468515JavaRDD<MatrixEntry > entries = ... // a JavaRDD of matrix entries
469516// Create a CoordinateMatrix from a JavaRDD<MatrixEntry >.
470- CoordinateMatrix mat = new CoordinateMatrix(entries);
517+ CoordinateMatrix mat = new CoordinateMatrix(entries.rdd() );
471518
472519// Get its size.
473520long m = mat.numRows();
0 commit comments