You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-5879][MLLIB] update PIC user guide and add a Java example
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.
Author: Xiangrui Meng <[email protected]>
Closes#4680 from mengxr/SPARK-5897 and squashes the following commits:
847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
(cherry picked from commit d12d2ad)
Signed-off-by: Xiangrui Meng <[email protected]>
Copy file name to clipboardExpand all lines: docs/mllib-clustering.md
+82-13Lines changed: 82 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -270,23 +270,92 @@ for i in range(2):
270
270
271
271
## Power iteration clustering (PIC)
272
272
273
-
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
273
+
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
274
+
graph given pairwise similarties as edge properties,
275
+
described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
276
+
It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
277
+
[power iteration](http://en.wikipedia.org/wiki/Power_iteration) and uses it to cluster vertices.
278
+
MLlib includes an implementation of PIC using GraphX as its backend.
279
+
It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model with the clustering assignments.
280
+
The similarities must be nonnegative.
281
+
PIC assumes that the similarity measure is symmetric.
282
+
A pair `(srcId, dstId)` regardless of the ordering should appear at most once in the input data.
283
+
If a pair is missing from input, their similarity is treated as zero.
284
+
MLlib's PIC implementation takes the following (hyper-)parameters:
285
+
286
+
*`k`: number of clusters
287
+
*`maxIterations`: maximum number of power iterations
288
+
*`initializationMode`: initialization model. This can be either "random", which is the default,
289
+
to use a random vector as vertex properties, or "degree" to use normalized sum similarities.
274
290
275
-
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
276
-
* calculates the principal eigenvalue and eigenvector
277
-
* Clusters each of the input points according to their principal eigenvector component value
291
+
**Examples**
292
+
293
+
In the following, we show code snippets to demonstrate how to use PIC in MLlib.
0 commit comments