You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-9888] [MLLIB] User guide for new LDA features
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <[email protected]>
Closes#8254 from feynmanliang/SPARK-9888.
is a topic model which infers topics from a collection of text documents.
439
439
LDA can be thought of as a clustering algorithm as follows:
440
440
441
-
* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
442
-
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
443
-
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
444
-
on a statistical model of how text documents are generated.
445
-
446
-
LDA takes in a collection of documents as vectors of word counts.
447
-
It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
448
-
on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:
449
-
450
-
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
451
-
* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
441
+
* Topics correspond to cluster centers, and documents correspond to
442
+
examples (rows) in a dataset.
443
+
* Topics and documents both exist in a feature space, where feature
444
+
vectors are vectors of word counts (bag of words).
445
+
* Rather than estimating a clustering using a traditional distance, LDA
446
+
uses a function based on a statistical model of how text documents are
447
+
generated.
448
+
449
+
LDA supports different inference algorithms via `setOptimizer` function.
LDA takes in a collection of documents as vectors of word counts and the
459
+
following parameters (set using the builder pattern):
454
460
455
461
*`k`: Number of topics (i.e., cluster centers)
456
-
*`maxIterations`: Limit on the number of iterations of EM used for learning
457
-
*`docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
458
-
*`topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
459
-
*`checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
460
-
461
-
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
462
-
support prediction on new documents, and it does not have a Python API. These will be added in the future.
462
+
*`optimizer`: Optimizer to use for learning the LDA model, either
463
+
`EMLDAOptimizer` or `OnlineLDAOptimizer`
464
+
*`docConcentration`: Dirichlet parameter for prior over documents'
465
+
distributions over topics. Larger values encourage smoother inferred
466
+
distributions.
467
+
*`topicConcentration`: Dirichlet parameter for prior over topics'
468
+
distributions over terms (words). Larger values encourage smoother
469
+
inferred distributions.
470
+
*`maxIterations`: Limit on the number of iterations.
471
+
*`checkpointInterval`: If using checkpointing (set in the Spark
472
+
configuration), this parameter specifies the frequency with which
473
+
checkpoints will be created. If `maxIterations` is large, using
474
+
checkpointing can help reduce shuffle file sizes on disk and help with
475
+
failure recovery.
476
+
477
+
478
+
All of MLlib's LDA models support:
479
+
480
+
*`describeTopics`: Returns topics as arrays of most important terms and
481
+
term weights
482
+
*`topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
483
+
is a topic
484
+
485
+
*Note*: LDA is still an experimental feature under active development.
486
+
As a result, certain features are only available in one of the two
487
+
optimizers / models generated by the optimizer. Currently, a distributed
488
+
model can be converted into a local model, but not vice-versa.
489
+
490
+
The following discussion will describe each optimizer/model pair
0 commit comments