Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
95 changes: 82 additions & 13 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,23 +270,92 @@ for i in range(2):

## Power iteration clustering (PIC)

Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
graph given pairwise similarties as edge properties,
described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
[power iteration](http://en.wikipedia.org/wiki/Power_iteration) and uses it to cluster vertices.
MLlib includes an implementation of PIC using GraphX as its backend.
It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model with the clustering assignments.
The similarities must be nonnegative.
PIC assumes that the similarity measure is symmetric.
A pair `(srcId, dstId)` regardless of the ordering should appear at most once in the input data.
If a pair is missing from input, their similarity is treated as zero.
MLlib's PIC implementation takes the following (hyper-)parameters:

* `k`: number of clusters
* `maxIterations`: maximum number of power iterations
* `initializationMode`: initialization model. This can be either "random", which is the default,
to use a random vector as vertex properties, or "degree" to use normalized sum similarities.

* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
* calculates the principal eigenvalue and eigenvector
* Clusters each of the input points according to their principal eigenvector component value
**Examples**

In the following, we show code snippets to demonstrate how to use PIC in MLlib.

<div class="codetabs">
<div data-lang="scala" markdown="1">

[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering)
implements the PIC algorithm.
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
affinity matrix.
Calling `PowerIterationClustering.run` returns a
[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel),
which contains the computed clustering assignments.

Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
{% highlight scala %}
import org.apache.spark.mllib.clustering.PowerIterationClustering
import org.apache.spark.mllib.linalg.Vectors

Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
val similarities: RDD[(Long, Long, Double)] = ...

val pic = new PowerIteartionClustering()
.setK(3)
.setMaxIterations(20)
val model = pic.run(similarities)

model.assignments.foreach { case (vertexId, clusterId) =>
println(s"$vertexId -> $clusterId")
}
{% endhighlight %}

A full example that produces the experiment described in the PIC paper can be found under
[`examples/`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala).

</div>

<p style="text-align: center;">
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
title="The Property Graph"
alt="The Property Graph"
width="50%" />
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
<div data-lang="java" markdown="1">

[`PowerIterationClustering`](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
implements the PIC algorithm.
It takes an `JavaRDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
affinity matrix.
Calling `PowerIterationClustering.run` returns a
[`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html)
which contains the computed clustering assignments.

{% highlight java %}
import scala.Tuple2;
import scala.Tuple3;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.clustering.PowerIterationClustering;
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;

JavaRDD<Tuple3<Long, Long, Double>> similarities = ...

PowerIterationClustering pic = new PowerIterationClustering()
.setK(2)
.setMaxIterations(10);
PowerIterationClusteringModel model = pic.run(similarities);

for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
System.out.println(assignment._1() + " -> " + assignment._2());
}
{% endhighlight %}
</div>

</div>

## Latent Dirichlet allocation (LDA)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.mllib;

import scala.Tuple2;
import scala.Tuple3;

import com.google.common.collect.Lists;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.clustering.PowerIterationClustering;
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;

/**
* Java example for graph clustering using power iteration clustering (PIC).
*/
public class JavaPowerIterationClusteringExample {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaPowerIterationClusteringExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);

@SuppressWarnings("unchecked")
JavaRDD<Tuple3<Long, Long, Double>> similarities = sc.parallelize(Lists.newArrayList(
new Tuple3<Long, Long, Double>(0L, 1L, 0.9),
new Tuple3<Long, Long, Double>(1L, 2L, 0.9),
new Tuple3<Long, Long, Double>(2L, 3L, 0.9),
new Tuple3<Long, Long, Double>(3L, 4L, 0.1),
new Tuple3<Long, Long, Double>(4L, 5L, 0.9)));

PowerIterationClustering pic = new PowerIterationClustering()
.setK(2)
.setMaxIterations(10);
PowerIterationClusteringModel model = pic.run(similarities);

for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
System.out.println(assignment._1() + " -> " + assignment._2());
}

sc.stop();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

package org.apache.spark.mllib.clustering

import org.apache.spark.api.java.JavaRDD
import org.apache.spark.{Logging, SparkException}
import org.apache.spark.annotation.Experimental
import org.apache.spark.graphx._
Expand Down Expand Up @@ -115,6 +116,14 @@ class PowerIterationClustering private[clustering] (
pic(w0)
}

/**
* A Java-friendly version of [[PowerIterationClustering.run]].
*/
def run(similarities: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)])
: PowerIterationClusteringModel = {
run(similarities.rdd.asInstanceOf[RDD[(Long, Long, Double)]])
}

/**
* Runs the PIC algorithm.
*
Expand Down