Skip to content

Commit 1029f66

Browse files
committed
SPARK-5805 Fixed the type error in documentation.
1 parent 077eec2 commit 1029f66

File tree

1 file changed

+31
-31
lines changed

1 file changed

+31
-31
lines changed

docs/mllib-clustering.md

Lines changed: 31 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Clustering is an unsupervised learning problem whereby we aim to group subsets
1414
of entities with one another based on some notion of similarity. Clustering is
1515
often used for exploratory analysis and/or as a component of a hierarchical
1616
supervised learning pipeline (in which distinct classifiers or regression
17-
models are trained for each cluster).
17+
models are trained for each cluster).
1818

1919
MLlib supports the following models:
2020

@@ -25,7 +25,7 @@ most commonly used clustering algorithms that clusters the data points into a
2525
predefined number of clusters. The MLlib implementation includes a parallelized
2626
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
2727
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
28-
The implementation in MLlib has the following parameters:
28+
The implementation in MLlib has the following parameters:
2929

3030
* *k* is the number of desired clusters.
3131
* *maxIterations* is the maximum number of iterations to run.
@@ -35,12 +35,12 @@ initialization via k-means\|\|.
3535
guaranteed to find a globally optimal solution, and when run multiple times on
3636
a given dataset, the algorithm returns the best clustering result).
3737
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
38-
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
38+
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
3939

4040
### Gaussian mixture
4141

4242
A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
43-
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
43+
represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions,
4444
each with its own probability. The MLlib implementation uses the
4545
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
4646
algorithm to induce the maximum-likelihood model given a set of samples. The implementation
@@ -221,8 +221,8 @@ print("Within Set Sum of Squared Error = " + str(WSSSE))
221221
<div class="codetabs">
222222
<div data-lang="scala" markdown="1">
223223
In the following example after loading and parsing data, we use a
224-
[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
225-
object to cluster the data into two clusters. The number of desired clusters is passed
224+
[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
225+
object to cluster the data into two clusters. The number of desired clusters is passed
226226
to the algorithm. We then output the parameters of the mixture model.
227227

228228
{% highlight scala %}
@@ -238,7 +238,7 @@ val gmm = new GaussianMixture().setK(2).run(parsedData)
238238

239239
// output parameters of max-likelihood model
240240
for (i <- 0 until gmm.k) {
241-
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
241+
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
242242
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
243243
}
244244

@@ -298,7 +298,7 @@ public class GaussianMixtureExample {
298298
<div data-lang="python" markdown="1">
299299
In the following example after loading and parsing data, we use a
300300
[GaussianMixture](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture)
301-
object to cluster the data into two clusters. The number of desired clusters is passed
301+
object to cluster the data into two clusters. The number of desired clusters is passed
302302
to the algorithm. We then output the parameters of the mixture model.
303303

304304
{% highlight python %}
@@ -326,7 +326,7 @@ for i in range(2):
326326

327327
In the following example, we load word count vectors representing a corpus of documents.
328328
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
329-
to infer three topics from the documents. The number of desired clusters is passed
329+
to infer three topics from the documents. The number of desired clusters is passed
330330
to the algorithm. We then output the topics, represented as probability distributions over words.
331331

332332
<div class="codetabs">
@@ -428,27 +428,27 @@ a dependency.
428428

429429
## Streaming clustering
430430

431-
When data arrive in a stream, we may want to estimate clusters dynamically,
432-
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
433-
with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
434-
uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
431+
When data arrive in a stream, we may want to estimate clusters dynamically,
432+
updating them as new data arrive. MLlib provides support for streaming k-means clustering,
433+
with parameters to control the decay (or "forgetfulness") of the estimates. The algorithm
434+
uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign
435435
all points to their nearest cluster, compute new cluster centers, then update each cluster using:
436436

437437
`\begin{equation}
438438
c_{t+1} = \frac{c_tn_t\alpha + x_tm_t}{n_t\alpha+m_t}
439439
\end{equation}`
440440
`\begin{equation}
441-
n_{t+1} = n_t + m_t
441+
n_{t+1} = n_t + m_t
442442
\end{equation}`
443443

444-
Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
445-
to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
446-
is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
447-
can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
448-
with `$\alpha$=0` only the most recent data will be used. This is analogous to an
449-
exponentially-weighted moving average.
444+
Where `$c_t$` is the previous center for the cluster, `$n_t$` is the number of points assigned
445+
to the cluster thus far, `$x_t$` is the new cluster center from the current batch, and `$m_t$`
446+
is the number of points added to the cluster in the current batch. The decay factor `$\alpha$`
447+
can be used to ignore the past: with `$\alpha$=1` all data will be used from the beginning;
448+
with `$\alpha$=0` only the most recent data will be used. This is analogous to an
449+
exponentially-weighted moving average.
450450

451-
The decay can be specified using a `halfLife` parameter, which determines the
451+
The decay can be specified using a `halfLife` parameter, which determines the
452452
correct decay factor `a` such that, for data acquired
453453
at time `t`, its contribution by time `t + halfLife` will have dropped to 0.5.
454454
The unit of time can be specified either as `batches` or `points` and the update rule
@@ -472,9 +472,9 @@ import org.apache.spark.mllib.clustering.StreamingKMeans
472472

473473
{% endhighlight %}
474474

475-
Then we make an input stream of vectors for training, as well as a stream of labeled data
476-
points for testing. We assume a StreamingContext `ssc` has been created, see
477-
[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
475+
Then we make an input stream of vectors for training, as well as a stream of labeled data
476+
points for testing. We assume a StreamingContext `ssc` has been created, see
477+
[Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for more info.
478478

479479
{% highlight scala %}
480480

@@ -496,24 +496,24 @@ val model = new StreamingKMeans()
496496

497497
{% endhighlight %}
498498

499-
Now register the streams for training and testing and start the job, printing
499+
Now register the streams for training and testing and start the job, printing
500500
the predicted cluster assignments on new data points as they arrive.
501501

502502
{% highlight scala %}
503503

504504
model.trainOn(trainingData)
505-
model.predictOnValues(testData).print()
505+
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
506506

507507
ssc.start()
508508
ssc.awaitTermination()
509-
509+
510510
{% endhighlight %}
511511

512-
As you add new text files with data the cluster centers will update. Each training
512+
As you add new text files with data the cluster centers will update. Each training
513513
point should be formatted as `[x1, x2, x3]`, and each test data point
514-
should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
515-
(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
516-
the model will update. Anytime a text file is placed in `/testing/data/dir`
514+
should be formatted as `(y, [x1, x2, x3])`, where `y` is some useful label or identifier
515+
(e.g. a true category assignment). Anytime a text file is placed in `/training/data/dir`
516+
the model will update. Anytime a text file is placed in `/testing/data/dir`
517517
you will see predictions. With new data, the cluster centers will change!
518518

519519
</div>

0 commit comments

Comments
 (0)