Skip to content

Commit a64f374

Browse files
committed
[SPARK-5879][MLLIB] update PIC user guide and add a Java example
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue. Author: Xiangrui Meng <[email protected]> Closes #4680 from mengxr/SPARK-5897 and squashes the following commits: 847d216 [Xiangrui Meng] apache header 87719a2 [Xiangrui Meng] remove PIC image 2dd921f [Xiangrui Meng] update PIC user guide and add a Java example (cherry picked from commit d12d2ad) Signed-off-by: Xiangrui Meng <[email protected]>
1 parent 470cba8 commit a64f374

File tree

4 files changed

+149
-13
lines changed

4 files changed

+149
-13
lines changed
-243 KB
Binary file not shown.

docs/mllib-clustering.md

Lines changed: 82 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -270,23 +270,92 @@ for i in range(2):
270270

271271
## Power iteration clustering (PIC)
272272

273-
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
273+
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
274+
graph given pairwise similarties as edge properties,
275+
described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
276+
It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
277+
[power iteration](http://en.wikipedia.org/wiki/Power_iteration) and uses it to cluster vertices.
278+
MLlib includes an implementation of PIC using GraphX as its backend.
279+
It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model with the clustering assignments.
280+
The similarities must be nonnegative.
281+
PIC assumes that the similarity measure is symmetric.
282+
A pair `(srcId, dstId)` regardless of the ordering should appear at most once in the input data.
283+
If a pair is missing from input, their similarity is treated as zero.
284+
MLlib's PIC implementation takes the following (hyper-)parameters:
285+
286+
* `k`: number of clusters
287+
* `maxIterations`: maximum number of power iterations
288+
* `initializationMode`: initialization model. This can be either "random", which is the default,
289+
to use a random vector as vertex properties, or "degree" to use normalized sum similarities.
274290

275-
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
276-
* calculates the principal eigenvalue and eigenvector
277-
* Clusters each of the input points according to their principal eigenvector component value
291+
**Examples**
292+
293+
In the following, we show code snippets to demonstrate how to use PIC in MLlib.
294+
295+
<div class="codetabs">
296+
<div data-lang="scala" markdown="1">
297+
298+
[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering)
299+
implements the PIC algorithm.
300+
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
301+
affinity matrix.
302+
Calling `PowerIterationClustering.run` returns a
303+
[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel),
304+
which contains the computed clustering assignments.
278305

279-
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
306+
{% highlight scala %}
307+
import org.apache.spark.mllib.clustering.PowerIterationClustering
308+
import org.apache.spark.mllib.linalg.Vectors
280309

281-
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
310+
val similarities: RDD[(Long, Long, Double)] = ...
311+
312+
val pic = new PowerIteartionClustering()
313+
.setK(3)
314+
.setMaxIterations(20)
315+
val model = pic.run(similarities)
316+
317+
model.assignments.foreach { case (vertexId, clusterId) =>
318+
println(s"$vertexId -> $clusterId")
319+
}
320+
{% endhighlight %}
321+
322+
A full example that produces the experiment described in the PIC paper can be found under
323+
[`examples/`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala).
324+
325+
</div>
282326

283-
<p style="text-align: center;">
284-
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
285-
title="The Property Graph"
286-
alt="The Property Graph"
287-
width="50%" />
288-
<!-- Images are downsized intentionally to improve quality on retina displays -->
289-
</p>
327+
<div data-lang="java" markdown="1">
328+
329+
[`PowerIterationClustering`](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
330+
implements the PIC algorithm.
331+
It takes an `JavaRDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
332+
affinity matrix.
333+
Calling `PowerIterationClustering.run` returns a
334+
[`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html)
335+
which contains the computed clustering assignments.
336+
337+
{% highlight java %}
338+
import scala.Tuple2;
339+
import scala.Tuple3;
340+
341+
import org.apache.spark.api.java.JavaRDD;
342+
import org.apache.spark.mllib.clustering.PowerIterationClustering;
343+
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
344+
345+
JavaRDD<Tuple3<Long, Long, Double>> similarities = ...
346+
347+
PowerIterationClustering pic = new PowerIterationClustering()
348+
.setK(2)
349+
.setMaxIterations(10);
350+
PowerIterationClusteringModel model = pic.run(similarities);
351+
352+
for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
353+
System.out.println(assignment._1() + " -> " + assignment._2());
354+
}
355+
{% endhighlight %}
356+
</div>
357+
358+
</div>
290359

291360
## Latent Dirichlet allocation (LDA)
292361

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.spark.examples.mllib;
19+
20+
import scala.Tuple2;
21+
import scala.Tuple3;
22+
23+
import com.google.common.collect.Lists;
24+
25+
import org.apache.spark.SparkConf;
26+
import org.apache.spark.api.java.JavaRDD;
27+
import org.apache.spark.api.java.JavaSparkContext;
28+
import org.apache.spark.mllib.clustering.PowerIterationClustering;
29+
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
30+
31+
/**
32+
* Java example for graph clustering using power iteration clustering (PIC).
33+
*/
34+
public class JavaPowerIterationClusteringExample {
35+
public static void main(String[] args) {
36+
SparkConf sparkConf = new SparkConf().setAppName("JavaPowerIterationClusteringExample");
37+
JavaSparkContext sc = new JavaSparkContext(sparkConf);
38+
39+
@SuppressWarnings("unchecked")
40+
JavaRDD<Tuple3<Long, Long, Double>> similarities = sc.parallelize(Lists.newArrayList(
41+
new Tuple3<Long, Long, Double>(0L, 1L, 0.9),
42+
new Tuple3<Long, Long, Double>(1L, 2L, 0.9),
43+
new Tuple3<Long, Long, Double>(2L, 3L, 0.9),
44+
new Tuple3<Long, Long, Double>(3L, 4L, 0.1),
45+
new Tuple3<Long, Long, Double>(4L, 5L, 0.9)));
46+
47+
PowerIterationClustering pic = new PowerIterationClustering()
48+
.setK(2)
49+
.setMaxIterations(10);
50+
PowerIterationClusteringModel model = pic.run(similarities);
51+
52+
for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
53+
System.out.println(assignment._1() + " -> " + assignment._2());
54+
}
55+
56+
sc.stop();
57+
}
58+
}

mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
package org.apache.spark.mllib.clustering
1919

20+
import org.apache.spark.api.java.JavaRDD
2021
import org.apache.spark.{Logging, SparkException}
2122
import org.apache.spark.annotation.Experimental
2223
import org.apache.spark.graphx._
@@ -115,6 +116,14 @@ class PowerIterationClustering private[clustering] (
115116
pic(w0)
116117
}
117118

119+
/**
120+
* A Java-friendly version of [[PowerIterationClustering.run]].
121+
*/
122+
def run(similarities: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)])
123+
: PowerIterationClusteringModel = {
124+
run(similarities.rdd.asInstanceOf[RDD[(Long, Long, Double)]])
125+
}
126+
118127
/**
119128
* Runs the PIC algorithm.
120129
*

0 commit comments

Comments
 (0)