From b8b9f9ac6b282e40d220fcfbb24a5c35a8decec0 Mon Sep 17 00:00:00 2001 From: Feynman Liang Date: Mon, 17 Aug 2015 15:20:10 -0700 Subject: [PATCH 1/3] Adds new LDA features to user guide --- docs/mllib-clustering.md | 111 +++++++++++++++--- .../spark/mllib/clustering/LDAModel.scala | 1 - .../spark/mllib/clustering/LDASuite.scala | 1 + 3 files changed, 98 insertions(+), 15 deletions(-) diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae2ae6c..8c1562c9a1a5 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -443,23 +443,106 @@ LDA can be thought of as a clustering algorithm as follows: * Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated. -LDA takes in a collection of documents as vectors of word counts. -It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) -on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides: - -* Topics: Inferred topics, each of which is a probability distribution over terms (words). -* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only) +LDA supports different inference algorithms via `setOptimizer` function. +`EMLDAOptimizer` learns clustering using +[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) +on the likelihood function and yields comprehensive results, while +`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online +variational +inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) +and is generally memory friendly. -LDA takes the following parameters: +LDA takes in a collection of documents as vectors of word counts and the +following parameters: * `k`: Number of topics (i.e., cluster centers) -* `maxIterations`: Limit on the number of iterations of EM used for learning -* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions. -* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions. -* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery. - -*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet -support prediction on new documents, and it does not have a Python API. These will be added in the future. +* `LDAOptimizer`: Optimizer to use for learning the LDA model, either +`EMLDAOptimizer` or `OnlineLDAOptimizer` +* `docConcentration`: Dirichlet parameter for prior over documents' +distributions over topics. Larger values encourage smoother inferred +distributions. +* `topicConcentration`: Dirichlet parameter for prior over topics' +distributions over terms (words). Larger values encourage smoother +inferred distributions. +* `maxIterations`: Limit on the number of iterations. +* `checkpointInterval`: If using checkpointing (set in the Spark +configuration), this parameter specifies the frequency with which +checkpoints will be created. If `maxIterations` is large, using +checkpointing can help reduce shuffle file sizes on disk and help with +failure recovery. + + +All of MLlib's LDA models support: + +* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of +which is a probability distribution over terms (words). +* `topicsMatrix`: For each non empty document in the +training set, LDA gives a probability distribution over topics. Note +that for empty documents, we don't create the topic distributions. + +*Note*: LDA is still an experimental feature under active development. +As a result, certain features are only available in one of the two +optimizers / models generated by the optimizer. The following +discussion will describe each optimizer/model pair separately. + +**EMLDAOptimizer and DistributedLDAModel** + +For the parameters provided to `LDA`: + +* `docConcentration`: Only symmetric priors are supported, so all values +in the provided `k`-dimensional vector must be identical. All values +must also be $> 1.0$. Providing `Vector(-1)` results in default behavior +(uniform `k` dimensional vector with value $(50 / k) + 1$ +* `topicConcentration`: Only symmetric priors supported. Values must be +$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$. +* `maxIterations`: Interpreted as maximum number of EM iterations. + +`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only +the inferred topics but also the full training corpus and topic +distributions for each document in the training corpus. A +`DistributedLDAModel` supports: + + * `topTopicsPerDocument(k)`: The top `k` topics and their weights for + each document in the training corpus + * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and + the corresponding weight of the topic in the documents. + * `logPrior`: log probability of the estimated topics and + document-topic distributions given the hyperparameters + `docConcentration` and `topicConcentration` + * `logLikelihood`: log likelihood of the training corpus, given the + inferred topics and document-topic distributions + +**OnlineLDAOptimizer and LocalLDAModel** + +For the parameters provided to `LDA`: + +* `docConcentration`: Asymmetric priors can be used by passing in a +vector with values equal to the Dirichlet parameter in each of the `k` +dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in +default behavior (uniform `k` dimensional vector with value $(1.0 / k)$) +* `topicConcentration`: Only symmetric priors supported. Values must be +$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$. +* `maxIterations`: Interpreted as maximum number of minibatches to +submit. + +In addition, `OnlineLDAOptimizer` accepts the following parameters: + +* `miniBatchFraction`: Fraction of corpus sampled and used at each +iteration +* `optimizeAlpha`: If set to true, performs maximum-likelihood +estimation of the hyperparameter `alpha` (aka `docConcentration`) +after each minibatch and returns the optimized `alpha` in the resulting +`LDAModel` +* `tau0` and `kappa`: Used for learning-rate decay, which is computed by +$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations. + +`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the +inferred topics. A `LocalLDAModel` supports: + +* `logLikelihood(documents)`: Calculates a lower bound on the provided +`documents` given the inferred topics. +* `logPerplexity(documents)`: Calculates an upper bound on the +perplexity of the provided `documents` given the inferred topics. **Examples** diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala index 82f05e4a18ce..87412a3cd338 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala @@ -420,7 +420,6 @@ object LocalLDAModel extends Loader[LocalLDAModel] { } val topicsMat = Matrices.fromBreeze(brzTopics) - // TODO: initialize with docConcentration, topicConcentration, and gammaShape after SPARK-9940 new LocalLDAModel(topicsMat, docConcentration, topicConcentration, gammaShape) } } diff --git a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala index 99e28499fd31..73d686226bad 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala @@ -68,6 +68,7 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext { // Train a model val lda = new LDA() lda.setK(k) + .setOptimizer(new EMLDAOptimizer) .setDocConcentration(topicSmoothing) .setTopicConcentration(termSmoothing) .setMaxIterations(5) From 7401012e597d51113605134b3080b8735c9239c4 Mon Sep 17 00:00:00 2001 From: Feynman Liang Date: Tue, 25 Aug 2015 10:49:12 -0700 Subject: [PATCH 2/3] Code review comments --- docs/mllib-clustering.md | 62 ++++++++++++++++++++++++---------------- 1 file changed, 38 insertions(+), 24 deletions(-) diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 8c1562c9a1a5..280dff10d6b7 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -438,10 +438,13 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath") is a topic model which infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm as follows: -* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset. -* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts. -* Rather than estimating a clustering using a traditional distance, LDA uses a function based - on a statistical model of how text documents are generated. +* Topics correspond to cluster centers, and documents correspond to +examples (rows) in a dataset. +* Topics and documents both exist in a feature space, where feature +vectors are vectors of word counts (bag of words). +* Rather than estimating a clustering using a traditional distance, LDA +uses a function based on a statistical model of how text documents are +generated. LDA supports different inference algorithms via `setOptimizer` function. `EMLDAOptimizer` learns clustering using @@ -453,10 +456,10 @@ inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. LDA takes in a collection of documents as vectors of word counts and the -following parameters: +following parameters (set using the builder pattern): * `k`: Number of topics (i.e., cluster centers) -* `LDAOptimizer`: Optimizer to use for learning the LDA model, either +* `optimizer`: Optimizer to use for learning the LDA model, either `EMLDAOptimizer` or `OnlineLDAOptimizer` * `docConcentration`: Dirichlet parameter for prior over documents' distributions over topics. Larger values encourage smoother inferred @@ -474,18 +477,25 @@ failure recovery. All of MLlib's LDA models support: -* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of -which is a probability distribution over terms (words). -* `topicsMatrix`: For each non empty document in the -training set, LDA gives a probability distribution over topics. Note -that for empty documents, we don't create the topic distributions. +* `describeTopics`: Returns the top terms and their weights for each topics +* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column +is a topic *Note*: LDA is still an experimental feature under active development. As a result, certain features are only available in one of the two -optimizers / models generated by the optimizer. The following -discussion will describe each optimizer/model pair separately. +optimizers / models generated by the optimizer. Currently, a distributed +model can be converted into a local model (during which we assume a +uniform `docConcentration` document-topic prior), but not vice-versa. -**EMLDAOptimizer and DistributedLDAModel** +The following discussion will describe each optimizer/model pair +separately. + +**Expectation Maximization** + +Implemented in +[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer) +and +[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel). For the parameters provided to `LDA`: @@ -495,16 +505,16 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior (uniform `k` dimensional vector with value $(50 / k) + 1$ * `topicConcentration`: Only symmetric priors supported. Values must be $> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$. -* `maxIterations`: Interpreted as maximum number of EM iterations. +* `maxIterations`: The maximum number of EM iterations. `EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only the inferred topics but also the full training corpus and topic distributions for each document in the training corpus. A `DistributedLDAModel` supports: - * `topTopicsPerDocument(k)`: The top `k` topics and their weights for + * `topTopicsPerDocument`: The top topics and their weights for each document in the training corpus - * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and + * `topDocumentsPerTopic`: The top documents for each topic and the corresponding weight of the topic in the documents. * `logPrior`: log probability of the estimated topics and document-topic distributions given the hyperparameters @@ -512,7 +522,12 @@ distributions for each document in the training corpus. A * `logLikelihood`: log likelihood of the training corpus, given the inferred topics and document-topic distributions -**OnlineLDAOptimizer and LocalLDAModel** +**Online Variational Bayes** + +Implemented in +[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html) +and +[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html). For the parameters provided to `LDA`: @@ -522,17 +537,16 @@ dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in default behavior (uniform `k` dimensional vector with value $(1.0 / k)$) * `topicConcentration`: Only symmetric priors supported. Values must be $>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$. -* `maxIterations`: Interpreted as maximum number of minibatches to -submit. +* `maxIterations`: Maximum number of minibatches to submit. In addition, `OnlineLDAOptimizer` accepts the following parameters: * `miniBatchFraction`: Fraction of corpus sampled and used at each iteration -* `optimizeAlpha`: If set to true, performs maximum-likelihood -estimation of the hyperparameter `alpha` (aka `docConcentration`) -after each minibatch and returns the optimized `alpha` in the resulting -`LDAModel` +* `optimizeDocConcentration`: If set to true, performs maximum-likelihood +estimation of the hyperparameter `docConcentration` (aka `alpha`) +after each minibatch and sets the optimized `docConcentration` in the +returned `LocalLDAModel` * `tau0` and `kappa`: Used for learning-rate decay, which is computed by $(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations. From c8a10133ec40f4e6c76b322a7a0b416486c3d157 Mon Sep 17 00:00:00 2001 From: Feynman Liang Date: Tue, 25 Aug 2015 15:55:53 -0700 Subject: [PATCH 3/3] Code review changes --- docs/mllib-clustering.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 280dff10d6b7..c187b246dfd8 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -477,15 +477,15 @@ failure recovery. All of MLlib's LDA models support: -* `describeTopics`: Returns the top terms and their weights for each topics +* `describeTopics`: Returns topics as arrays of most important terms and +term weights * `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column is a topic *Note*: LDA is still an experimental feature under active development. As a result, certain features are only available in one of the two optimizers / models generated by the optimizer. Currently, a distributed -model can be converted into a local model (during which we assume a -uniform `docConcentration` document-topic prior), but not vice-versa. +model can be converted into a local model, but not vice-versa. The following discussion will describe each optimizer/model pair separately.