Skip to content

Commit 26d35f3

Browse files
mengxrpwendell
authored andcommitted
[SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng <[email protected]> Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc
1 parent bf9d49b commit 26d35f3

12 files changed

+1543
-769
lines changed

docs/mllib-basics.md

Lines changed: 476 additions & 0 deletions
Large diffs are not rendered by default.

docs/mllib-classification-regression.md

Lines changed: 0 additions & 568 deletions
This file was deleted.

docs/mllib-clustering.md

Lines changed: 21 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,21 @@
11
---
22
layout: global
3-
title: MLlib - Clustering
3+
title: <a href="mllib-guide.html">MLlib</a> - Clustering
44
---
55

66
* Table of contents
77
{:toc}
88

99

10-
# Clustering
10+
## Clustering
1111

1212
Clustering is an unsupervised learning problem whereby we aim to group subsets
1313
of entities with one another based on some notion of similarity. Clustering is
1414
often used for exploratory analysis and/or as a component of a hierarchical
1515
supervised learning pipeline (in which distinct classifiers or regression
16-
models are trained for each cluster). MLlib supports
16+
models are trained for each cluster).
17+
18+
MLlib supports
1719
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
1820
the most commonly used clustering algorithms that clusters the data points into
1921
predfined number of clusters. The MLlib implementation includes a parallelized
@@ -31,17 +33,14 @@ a given dataset, the algorithm returns the best clustering result).
3133
* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
3234
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
3335

34-
Available algorithms for clustering:
35-
36-
* [KMeans](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans)
37-
38-
39-
40-
# Usage in Scala
36+
## Examples
4137

38+
<div class="codetabs">
39+
<div data-lang="scala" markdown="1">
4240
Following code snippets can be executed in `spark-shell`.
4341

44-
In the following example after loading and parsing data, we use the KMeans object to cluster the data
42+
In the following example after loading and parsing data, we use the
43+
[`KMeans`](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
4544
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
4645
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
4746
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
@@ -63,22 +62,22 @@ val clusters = KMeans.train(parsedData, numClusters, numIterations)
6362
val WSSSE = clusters.computeCost(parsedData)
6463
println("Within Set Sum of Squared Errors = " + WSSSE)
6564
{% endhighlight %}
65+
</div>
6666

67-
68-
# Usage in Java
69-
67+
<div data-lang="java" markdown="1">
7068
All of MLlib's methods use Java-friendly types, so you can import and call them there the same
7169
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
7270
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
7371
calling `.rdd()` on your `JavaRDD` object.
72+
</div>
7473

75-
# Usage in Python
74+
<div data-lang="python" markdown="1">
7675
Following examples can be tested in the PySpark shell.
7776

78-
In the following example after loading and parsing data, we use the KMeans object to cluster the data
79-
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
80-
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
81-
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
77+
In the following example after loading and parsing data, we use the KMeans object to cluster the
78+
data into two clusters. The number of desired clusters is passed to the algorithm. We then compute
79+
Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In
80+
fact the optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
8281

8382
{% highlight python %}
8483
from pyspark.mllib.clustering import KMeans
@@ -91,7 +90,7 @@ parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
9190

9291
# Build the model (cluster the data)
9392
clusters = KMeans.train(parsedData, 2, maxIterations=10,
94-
runs=10, initialization_mode="random")
93+
runs=10, initializationMode="random")
9594

9695
# Evaluate clustering by computing Within Set Sum of Squared Errors
9796
def error(point):
@@ -101,7 +100,6 @@ def error(point):
101100
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
102101
print("Within Set Sum of Squared Error = " + str(WSSSE))
103102
{% endhighlight %}
103+
</div>
104104

105-
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared
106-
Errors.
107-
105+
</div>
Lines changed: 44 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,56 @@
11
---
22
layout: global
3-
title: MLlib - Collaborative Filtering
3+
title: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering
44
---
55

66
* Table of contents
77
{:toc}
88

9-
# Collaborative Filtering
9+
## Collaborative filtering
1010

1111
[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
1212
is commonly used for recommender systems. These techniques aim to fill in the
1313
missing entries of a user-item association matrix. MLlib currently supports
1414
model-based collaborative filtering, in which users and products are described
1515
by a small set of latent factors that can be used to predict missing entries.
1616
In particular, we implement the [alternating least squares
17-
(ALS)](http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf)
17+
(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
1818
algorithm to learn these latent factors. The implementation in MLlib has the
1919
following parameters:
2020

21-
* *numBlocks* is the number of blacks used to parallelize computation (set to -1 to auto-configure).
21+
* *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
2222
* *rank* is the number of latent factors in our model.
2323
* *iterations* is the number of iterations to run.
2424
* *lambda* specifies the regularization parameter in ALS.
25-
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for *implicit feedback* data
26-
* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the *baseline* confidence in preference observations
25+
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
26+
*implicit feedback* data.
27+
* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
28+
*baseline* confidence in preference observations.
2729

28-
## Explicit vs Implicit Feedback
30+
### Explicit vs. implicit feedback
2931

3032
The standard approach to matrix factorization based collaborative filtering treats
3133
the entries in the user-item matrix as *explicit* preferences given by the user to the item.
3234

33-
It is common in many real-world use cases to only have access to *implicit feedback*
34-
(e.g. views, clicks, purchases, likes, shares etc.). The approach used in MLlib to deal with
35-
such data is taken from
36-
[Collaborative Filtering for Implicit Feedback Datasets](http://www2.research.att.com/~yifanhu/PUB/cf.pdf).
37-
Essentially instead of trying to model the matrix of ratings directly, this approach treats the data as
38-
a combination of binary preferences and *confidence values*. The ratings are then related
39-
to the level of confidence in observed user preferences, rather than explicit ratings given to items.
40-
The model then tries to find latent factors that can be used to predict the expected preference of a user
41-
for an item.
35+
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
36+
clicks, purchases, likes, shares etc.). The approach used in MLlib to deal with such data is taken
37+
from
38+
[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
39+
Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
40+
as a combination of binary preferences and *confidence values*. The ratings are then related to the
41+
level of confidence in observed user preferences, rather than explicit ratings given to items. The
42+
model then tries to find latent factors that can be used to predict the expected preference of a
43+
user for an item.
4244

43-
Available algorithms for collaborative filtering:
45+
## Examples
4446

45-
* [ALS](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS)
46-
47-
48-
# Usage in Scala
49-
50-
Following code snippets can be executed in `spark-shell`.
47+
<div class="codetabs">
5148

49+
<div data-lang="scala" markdown="1">
5250
In the following example we load rating data. Each row consists of a user, a product and a rating.
53-
We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation
54-
model by measuring the Mean Squared Error of rating prediction.
51+
We use the default [ALS.train()](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS$)
52+
method which assumes ratings are explicit. We evaluate the
53+
recommendation model by measuring the Mean Squared Error of rating prediction.
5554

5655
{% highlight scala %}
5756
import org.apache.spark.mllib.recommendation.ALS
@@ -64,8 +63,9 @@ val ratings = data.map(_.split(',') match {
6463
})
6564

6665
// Build the recommendation model using ALS
66+
val rank = 10
6767
val numIterations = 20
68-
val model = ALS.train(ratings, 1, 20, 0.01)
68+
val model = ALS.train(ratings, rank, numIterations, 0.01)
6969

7070
// Evaluate the model on rating data
7171
val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}
@@ -85,19 +85,19 @@ If the rating matrix is derived from other source of information (i.e., it is in
8585
other signals), you can use the trainImplicit method to get better results.
8686

8787
{% highlight scala %}
88-
val model = ALS.trainImplicit(ratings, 1, 20, 0.01)
88+
val alpha = 0.01
89+
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
8990
{% endhighlight %}
91+
</div>
9092

91-
# Usage in Java
92-
93+
<div data-lang="java" markdown="1">
9394
All of MLlib's methods use Java-friendly types, so you can import and call them there the same
9495
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
9596
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
9697
calling `.rdd()` on your `JavaRDD` object.
98+
</div>
9799

98-
# Usage in Python
99-
Following examples can be tested in the PySpark shell.
100-
100+
<div data-lang="python" markdown="1">
101101
In the following example we load rating data. Each row consists of a user, a product and a rating.
102102
We use the default ALS.train() method which assumes ratings are explicit. We evaluate the
103103
recommendation by measuring the Mean Squared Error of rating prediction.
@@ -111,7 +111,9 @@ data = sc.textFile("mllib/data/als/test.data")
111111
ratings = data.map(lambda line: array([float(x) for x in line.split(',')]))
112112

113113
# Build the recommendation model using Alternating Least Squares
114-
model = ALS.train(ratings, 1, 20)
114+
rank = 10
115+
numIterations = 20
116+
model = ALS.train(ratings, rank, numIterations)
115117

116118
# Evaluate the model on training data
117119
testdata = ratings.map(lambda p: (int(p[0]), int(p[1])))
@@ -126,5 +128,13 @@ signals), you can use the trainImplicit method to get better results.
126128

127129
{% highlight python %}
128130
# Build the recommendation model using Alternating Least Squares based on implicit ratings
129-
model = ALS.trainImplicit(ratings, 1, 20)
131+
model = ALS.trainImplicit(ratings, rank, numIterations, alpha = 0.01)
130132
{% endhighlight %}
133+
</div>
134+
135+
</div>
136+
137+
## Tutorial
138+
139+
[AMP Camp](http://ampcamp.berkeley.edu/) provides a hands-on tutorial for
140+
[personalized movie recommendation with MLlib](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html).

0 commit comments

Comments
 (0)