@@ -14,7 +14,7 @@ Clustering is an unsupervised learning problem whereby we aim to group subsets
1414of entities with one another based on some notion of similarity. Clustering is
1515often used for exploratory analysis and/or as a component of a hierarchical
1616supervised learning pipeline (in which distinct classifiers or regression
17- models are trained for each cluster).
17+ models are trained for each cluster).
1818
1919MLlib supports the following models:
2020
@@ -25,7 +25,7 @@ most commonly used clustering algorithms that clusters the data points into a
2525predefined number of clusters. The MLlib implementation includes a parallelized
2626variant of the [ k-means++] ( http://en.wikipedia.org/wiki/K-means%2B%2B ) method
2727called [ kmeans||] ( http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf ) .
28- The implementation in MLlib has the following parameters:
28+ The implementation in MLlib has the following parameters:
2929
3030* * k* is the number of desired clusters.
3131* * maxIterations* is the maximum number of iterations to run.
@@ -35,12 +35,12 @@ initialization via k-means\|\|.
3535guaranteed to find a globally optimal solution, and when run multiple times on
3636a given dataset, the algorithm returns the best clustering result).
3737* * initializationSteps* determines the number of steps in the k-means\|\| algorithm.
38- * * epsilon* determines the distance threshold within which we consider k-means to have converged.
38+ * * epsilon* determines the distance threshold within which we consider k-means to have converged.
3939
4040### Gaussian mixture
4141
4242A [ Gaussian Mixture Model] ( http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model )
43- represents a composite distribution whereby points are drawn from one of * k* Gaussian sub-distributions,
43+ represents a composite distribution whereby points are drawn from one of * k* Gaussian sub-distributions,
4444each with its own probability. The MLlib implementation uses the
4545[ expectation-maximization] ( http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm )
4646 algorithm to induce the maximum-likelihood model given a set of samples. The implementation
@@ -221,8 +221,8 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
221221<div class =" codetabs " >
222222<div data-lang =" scala " markdown =" 1 " >
223223In the following example after loading and parsing data, we use a
224- [ GaussianMixture] ( api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture )
225- object to cluster the data into two clusters. The number of desired clusters is passed
224+ [ GaussianMixture] ( api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture )
225+ object to cluster the data into two clusters. The number of desired clusters is passed
226226to the algorithm. We then output the parameters of the mixture model.
227227
228228{% highlight scala %}
@@ -238,7 +238,7 @@ val gmm = new GaussianMixture().setK(2).run(parsedData)
238238
239239// output parameters of max-likelihood model
240240for (i <- 0 until gmm.k) {
241- println("weight=%f\nmu=%s\nsigma=\n%s\n" format
241+ println("weight=%f\nmu=%s\nsigma=\n%s\n" format
242242 (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
243243}
244244
@@ -298,7 +298,7 @@ public class GaussianMixtureExample {
298298<div data-lang =" python " markdown =" 1 " >
299299In the following example after loading and parsing data, we use a
300300[ GaussianMixture] ( api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture )
301- object to cluster the data into two clusters. The number of desired clusters is passed
301+ object to cluster the data into two clusters. The number of desired clusters is passed
302302to the algorithm. We then output the parameters of the mixture model.
303303
304304{% highlight python %}
@@ -326,7 +326,7 @@ for i in range(2):
326326
327327In the following example, we load word count vectors representing a corpus of documents.
328328We then use [ LDA] ( api/scala/index.html#org.apache.spark.mllib.clustering.LDA )
329- to infer three topics from the documents. The number of desired clusters is passed
329+ to infer three topics from the documents. The number of desired clusters is passed
330330to the algorithm. We then output the topics, represented as probability distributions over words.
331331
332332<div class =" codetabs " >
@@ -428,27 +428,27 @@ a dependency.
428428
429429## Streaming clustering
430430
431- When data arrive in a stream, we may want to estimate clusters dynamically,
432- updating them as new data arrive. MLlib provides support for streaming k-means clustering,
433- with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
434- uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
431+ When data arrive in a stream, we may want to estimate clusters dynamically,
432+ updating them as new data arrive. MLlib provides support for streaming k-means clustering,
433+ with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
434+ uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
435435all points to their nearest cluster, compute new cluster centers, then update each cluster using:
436436
437437`\begin{equation}
438438 c_ {t+1} = \frac{c_tn_t\alpha + x_tm_t}{n_t\alpha+m_t}
439439\end{equation}`
440440`\begin{equation}
441- n_ {t+1} = n_t + m_t
441+ n_ {t+1} = n_t + m_t
442442\end{equation}`
443443
444- Where ` $c_t$ ` is the previous center for the cluster, ` $n_t$ ` is the number of points assigned
445- to the cluster thus far, ` $x_t$ ` is the new cluster center from the current batch, and ` $m_t$ `
446- is the number of points added to the cluster in the current batch. The decay factor ` $\alpha$ `
447- can be used to ignore the past: with ` $\alpha$=1 ` all data will be used from the beginning;
448- with ` $\alpha$=0 ` only the most recent data will be used. This is analogous to an
449- exponentially-weighted moving average.
444+ Where ` $c_t$ ` is the previous center for the cluster, ` $n_t$ ` is the number of points assigned
445+ to the cluster thus far, ` $x_t$ ` is the new cluster center from the current batch, and ` $m_t$ `
446+ is the number of points added to the cluster in the current batch. The decay factor ` $\alpha$ `
447+ can be used to ignore the past: with ` $\alpha$=1 ` all data will be used from the beginning;
448+ with ` $\alpha$=0 ` only the most recent data will be used. This is analogous to an
449+ exponentially-weighted moving average.
450450
451- The decay can be specified using a ` halfLife ` parameter, which determines the
451+ The decay can be specified using a ` halfLife ` parameter, which determines the
452452correct decay factor ` a ` such that, for data acquired
453453at time ` t ` , its contribution by time ` t + halfLife ` will have dropped to 0.5.
454454The unit of time can be specified either as ` batches ` or ` points ` and the update rule
@@ -472,9 +472,9 @@ import org.apache.spark.mllib.clustering.StreamingKMeans
472472
473473{% endhighlight %}
474474
475- Then we make an input stream of vectors for training, as well as a stream of labeled data
476- points for testing. We assume a StreamingContext ` ssc ` has been created, see
477- [ Spark Streaming Programming Guide] ( streaming-programming-guide.html#initializing ) for more info.
475+ Then we make an input stream of vectors for training, as well as a stream of labeled data
476+ points for testing. We assume a StreamingContext ` ssc ` has been created, see
477+ [ Spark Streaming Programming Guide] ( streaming-programming-guide.html#initializing ) for more info.
478478
479479{% highlight scala %}
480480
@@ -496,24 +496,24 @@ val model = new StreamingKMeans()
496496
497497{% endhighlight %}
498498
499- Now register the streams for training and testing and start the job, printing
499+ Now register the streams for training and testing and start the job, printing
500500the predicted cluster assignments on new data points as they arrive.
501501
502502{% highlight scala %}
503503
504504model.trainOn(trainingData)
505- model.predictOnValues(testData).print()
505+ model.predictOnValues(testData.map(lp => (lp.label, lp.features)) ).print()
506506
507507ssc.start()
508508ssc.awaitTermination()
509-
509+
510510{% endhighlight %}
511511
512- As you add new text files with data the cluster centers will update. Each training
512+ As you add new text files with data the cluster centers will update. Each training
513513point should be formatted as ` [x1, x2, x3] ` , and each test data point
514- should be formatted as ` (y, [x1, x2, x3]) ` , where ` y ` is some useful label or identifier
515- (e.g. a true category assignment). Anytime a text file is placed in ` /training/data/dir `
516- the model will update. Anytime a text file is placed in ` /testing/data/dir `
514+ should be formatted as ` (y, [x1, x2, x3]) ` , where ` y ` is some useful label or identifier
515+ (e.g. a true category assignment). Anytime a text file is placed in ` /training/data/dir `
516+ the model will update. Anytime a text file is placed in ` /testing/data/dir `
517517you will see predictions. With new data, the cluster centers will change!
518518
519519</div >
0 commit comments