From b8b9f9ac6b282e40d220fcfbb24a5c35a8decec0 Mon Sep 17 00:00:00 2001
From: Feynman Liang <fliang@databricks.com>
Date: Mon, 17 Aug 2015 15:20:10 -0700
Subject: [PATCH 1/3] Adds new LDA features to user guide

---
 docs/mllib-clustering.md                      | 111 +++++++++++++++---
 .../spark/mllib/clustering/LDAModel.scala     |   1 -
 .../spark/mllib/clustering/LDASuite.scala     |   1 +
 3 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index bb875ae2ae6c..8c1562c9a1a5 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -443,23 +443,106 @@ LDA can be thought of as a clustering algorithm as follows:
 * Rather than estimating a clustering using a traditional distance, LDA uses a function based
  on a statistical model of how text documents are generated.
 
-LDA takes in a collection of documents as vectors of word counts.
-It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
-on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:
-
-* Topics: Inferred topics, each of which is a probability distribution over terms (words).
-* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
+LDA supports different inference algorithms via `setOptimizer` function.
+`EMLDAOptimizer` learns clustering using
+[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
+on the likelihood function and yields comprehensive results, while
+`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
+variational
+inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
+and is generally memory friendly.
 
-LDA takes the following parameters:
+LDA takes in a collection of documents as vectors of word counts and the
+following parameters:
 
 * `k`: Number of topics (i.e., cluster centers)
-* `maxIterations`: Limit on the number of iterations of EM used for learning
-* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
-* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
-* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created.  If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
-
-*Note*: LDA is a new feature with some missing functionality.  In particular, it does not yet
-support prediction on new documents, and it does not have a Python API.  These will be added in the future.
+* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
+`EMLDAOptimizer` or `OnlineLDAOptimizer`
+* `docConcentration`: Dirichlet parameter for prior over documents'
+distributions over topics. Larger values encourage smoother inferred
+distributions.
+* `topicConcentration`: Dirichlet parameter for prior over topics'
+distributions over terms (words). Larger values encourage smoother
+inferred distributions.
+* `maxIterations`: Limit on the number of iterations.
+* `checkpointInterval`: If using checkpointing (set in the Spark
+configuration), this parameter specifies the frequency with which
+checkpoints will be created.  If `maxIterations` is large, using
+checkpointing can help reduce shuffle file sizes on disk and help with
+failure recovery.
+
+
+All of MLlib's LDA models support:
+
+* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
+which is a probability distribution over terms (words).
+* `topicsMatrix`: For each non empty document in the
+training set, LDA gives a probability distribution over topics. Note
+that for empty documents, we don't create the topic distributions.
+
+*Note*: LDA is still an experimental feature under active development.
+As a result, certain features are only available in one of the two
+optimizers / models generated by the optimizer. The following
+discussion will describe each optimizer/model pair separately.
+
+**EMLDAOptimizer and DistributedLDAModel**
+
+For the parameters provided to `LDA`:
+
+* `docConcentration`: Only symmetric priors are supported, so all values
+in the provided `k`-dimensional vector must be identical. All values
+must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
+(uniform `k` dimensional vector with value $(50 / k) + 1$
+* `topicConcentration`: Only symmetric priors supported. Values must be
+$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
+* `maxIterations`: Interpreted as maximum number of EM iterations.
+
+`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
+the inferred topics but also the full training corpus and topic
+distributions for each document in the training corpus. A
+`DistributedLDAModel` supports:
+
+ * `topTopicsPerDocument(k)`: The top `k` topics and their weights for
+ each document in the training corpus
+ * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
+ the corresponding weight of the topic in the documents.
+ * `logPrior`: log probability of the estimated topics and
+ document-topic distributions given the hyperparameters
+ `docConcentration` and `topicConcentration`
+ * `logLikelihood`: log likelihood of the training corpus, given the
+ inferred topics and document-topic distributions
+
+**OnlineLDAOptimizer and LocalLDAModel**
+
+For the parameters provided to `LDA`:
+
+* `docConcentration`: Asymmetric priors can be used by passing in a
+vector with values equal to the Dirichlet parameter in each of the `k`
+dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
+default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
+* `topicConcentration`: Only symmetric priors supported. Values must be
+$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
+* `maxIterations`: Interpreted as maximum number of minibatches to
+submit.
+
+In addition, `OnlineLDAOptimizer` accepts the following parameters:
+
+* `miniBatchFraction`: Fraction of corpus sampled and used at each
+iteration
+* `optimizeAlpha`: If set to true, performs maximum-likelihood
+estimation of the hyperparameter `alpha` (aka `docConcentration`)
+after each minibatch and returns the optimized `alpha` in the resulting
+`LDAModel`
+* `tau0` and `kappa`: Used for learning-rate decay, which is computed by
+$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.
+
+`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the
+inferred topics. A `LocalLDAModel` supports:
+
+* `logLikelihood(documents)`: Calculates a lower bound on the provided
+`documents` given the inferred topics.
+* `logPerplexity(documents)`: Calculates an upper bound on the
+perplexity of the provided `documents` given the inferred topics.
 
 **Examples**
 
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
index 82f05e4a18ce..87412a3cd338 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
@@ -420,7 +420,6 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
       }
       val topicsMat = Matrices.fromBreeze(brzTopics)
 
-      // TODO: initialize with docConcentration, topicConcentration, and gammaShape after SPARK-9940
       new LocalLDAModel(topicsMat, docConcentration, topicConcentration, gammaShape)
     }
   }
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
index 99e28499fd31..73d686226bad 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
@@ -68,6 +68,7 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext {
     // Train a model
     val lda = new LDA()
     lda.setK(k)
+      .setOptimizer(new EMLDAOptimizer)
       .setDocConcentration(topicSmoothing)
       .setTopicConcentration(termSmoothing)
       .setMaxIterations(5)

From 7401012e597d51113605134b3080b8735c9239c4 Mon Sep 17 00:00:00 2001
From: Feynman Liang <fliang@databricks.com>
Date: Tue, 25 Aug 2015 10:49:12 -0700
Subject: [PATCH 2/3] Code review comments

---
 docs/mllib-clustering.md | 62 ++++++++++++++++++++++++----------------
 1 file changed, 38 insertions(+), 24 deletions(-)

diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 8c1562c9a1a5..280dff10d6b7 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -438,10 +438,13 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath")
 is a topic model which infers topics from a collection of text documents.
 LDA can be thought of as a clustering algorithm as follows:
 
-* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
-* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
-* Rather than estimating a clustering using a traditional distance, LDA uses a function based
- on a statistical model of how text documents are generated.
+* Topics correspond to cluster centers, and documents correspond to
+examples (rows) in a dataset.
+* Topics and documents both exist in a feature space, where feature
+vectors are vectors of word counts (bag of words).
+* Rather than estimating a clustering using a traditional distance, LDA
+uses a function based on a statistical model of how text documents are
+generated.
 
 LDA supports different inference algorithms via `setOptimizer` function.
 `EMLDAOptimizer` learns clustering using
@@ -453,10 +456,10 @@ inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
 and is generally memory friendly.
 
 LDA takes in a collection of documents as vectors of word counts and the
-following parameters:
+following parameters (set using the builder pattern):
 
 * `k`: Number of topics (i.e., cluster centers)
-* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
+* `optimizer`: Optimizer to use for learning the LDA model, either
 `EMLDAOptimizer` or `OnlineLDAOptimizer`
 * `docConcentration`: Dirichlet parameter for prior over documents'
 distributions over topics. Larger values encourage smoother inferred
@@ -474,18 +477,25 @@ failure recovery.
 
 All of MLlib's LDA models support:
 
-* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
-which is a probability distribution over terms (words).
-* `topicsMatrix`: For each non empty document in the
-training set, LDA gives a probability distribution over topics. Note
-that for empty documents, we don't create the topic distributions.
+* `describeTopics`: Returns the top terms and their weights for each topics
+* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
+is a topic
 
 *Note*: LDA is still an experimental feature under active development.
 As a result, certain features are only available in one of the two
-optimizers / models generated by the optimizer. The following
-discussion will describe each optimizer/model pair separately.
+optimizers / models generated by the optimizer. Currently, a distributed
+model can be converted into a local model (during which we assume a
+uniform `docConcentration` document-topic prior), but not vice-versa.
 
-**EMLDAOptimizer and DistributedLDAModel**
+The following discussion will describe each optimizer/model pair
+separately.
+
+**Expectation Maximization**
+
+Implemented in
+[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer)
+and
+[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel).
 
 For the parameters provided to `LDA`:
 
@@ -495,16 +505,16 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
 (uniform `k` dimensional vector with value $(50 / k) + 1$
 * `topicConcentration`: Only symmetric priors supported. Values must be
 $> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
-* `maxIterations`: Interpreted as maximum number of EM iterations.
+* `maxIterations`: The maximum number of EM iterations.
 
 `EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
 the inferred topics but also the full training corpus and topic
 distributions for each document in the training corpus. A
 `DistributedLDAModel` supports:
 
- * `topTopicsPerDocument(k)`: The top `k` topics and their weights for
+ * `topTopicsPerDocument`: The top topics and their weights for
  each document in the training corpus
- * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
+ * `topDocumentsPerTopic`: The top documents for each topic and
  the corresponding weight of the topic in the documents.
  * `logPrior`: log probability of the estimated topics and
  document-topic distributions given the hyperparameters
@@ -512,7 +522,12 @@ distributions for each document in the training corpus. A
  * `logLikelihood`: log likelihood of the training corpus, given the
  inferred topics and document-topic distributions
 
-**OnlineLDAOptimizer and LocalLDAModel**
+**Online Variational Bayes**
+
+Implemented in
+[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html)
+and
+[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html).
 
 For the parameters provided to `LDA`:
 
@@ -522,17 +537,16 @@ dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
 default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
 * `topicConcentration`: Only symmetric priors supported. Values must be
 $>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
-* `maxIterations`: Interpreted as maximum number of minibatches to
-submit.
+* `maxIterations`: Maximum number of minibatches to submit.
 
 In addition, `OnlineLDAOptimizer` accepts the following parameters:
 
 * `miniBatchFraction`: Fraction of corpus sampled and used at each
 iteration
-* `optimizeAlpha`: If set to true, performs maximum-likelihood
-estimation of the hyperparameter `alpha` (aka `docConcentration`)
-after each minibatch and returns the optimized `alpha` in the resulting
-`LDAModel`
+* `optimizeDocConcentration`: If set to true, performs maximum-likelihood
+estimation of the hyperparameter `docConcentration` (aka `alpha`)
+after each minibatch and sets the optimized `docConcentration` in the
+returned `LocalLDAModel`
 * `tau0` and `kappa`: Used for learning-rate decay, which is computed by
 $(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.
 

From c8a10133ec40f4e6c76b322a7a0b416486c3d157 Mon Sep 17 00:00:00 2001
From: Feynman Liang <fliang@databricks.com>
Date: Tue, 25 Aug 2015 15:55:53 -0700
Subject: [PATCH 3/3] Code review changes

---
 docs/mllib-clustering.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 280dff10d6b7..c187b246dfd8 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -477,15 +477,15 @@ failure recovery.
 
 All of MLlib's LDA models support:
 
-* `describeTopics`: Returns the top terms and their weights for each topics
+* `describeTopics`: Returns topics as arrays of most important terms and
+term weights
 * `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
 is a topic
 
 *Note*: LDA is still an experimental feature under active development.
 As a result, certain features are only available in one of the two
 optimizers / models generated by the optimizer. Currently, a distributed
-model can be converted into a local model (during which we assume a
-uniform `docConcentration` document-topic prior), but not vice-versa.
+model can be converted into a local model, but not vice-versa.
 
 The following discussion will describe each optimizer/model pair
 separately.