[SPARK-1405][MLLIB] LDA on GraphX #2388

witgo · 2014-09-14T16:57:02Z

Asymmetric Dirichlet priors

Asymmetric Dirichlet priors substantially increases the robustness of LDA to variations in the number of topics and to the highly skewed word frequency distributions common in natural language.

Support 1000000 topics

 This can better cover the long tail distribution

Topic de-duplication
Add the documentation
Add infer interface
Add unit tests
Add the performance test
Optimizing the infer interface performance
Verifying the correctness of the algorithm

The performance test:

Item	value
cluster resource	36 executors(36 cores, 216g memory)
training set	253064 document, 29696335 words, 75496 distinct words
number of iterations	`150`
parameter	alpha 0.01 , beta 0.01
number of topics/running time (minutes)	`2000`/`42.26` , `10000`/`49.47`, `100000`/`73.14`, `1000000`/`125.43`

conf/spark-defaults.conf:

spark.akka.frameSize   20
spark.executor.instances 36
spark.rdd.compress true
spark.executor.memory   6g
spark.default.parallelism  72
spark.broadcast.blockSize  8192
spark.storage.memoryFraction 0.2
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.spark.mllib.clustering.LDAKryoRegistrator

REFERENCES:

Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang, Y. Gao, J. Zeng, Q. Yang, et al. "Peacock: Learning Long-Tail Topic Features for Industrial Applications".arXiv:1405.4402, 2014
Hanna Wallach, David Mimno, and Andrew McCallum. "Rethinking LDA: Why priors matter". In Advances in Neural Information Processing Systems 22, pages 1973–1981, 2009.

SparkQA · 2014-09-14T16:59:15Z

QA tests have started for PR 2388 at commit 9860fd1.

This patch merges cleanly.

SparkQA · 2014-09-14T17:09:12Z

QA tests have started for PR 2388 at commit 5fa02ef.

This patch merges cleanly.

SparkQA · 2014-09-14T17:50:10Z

QA tests have finished for PR 2388 at commit 9860fd1.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],

SparkQA · 2014-09-14T18:16:02Z

QA tests have finished for PR 2388 at commit 5fa02ef.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],

SparkQA · 2014-09-15T13:04:15Z

QA tests have started for PR 2388 at commit dc7ef13.

This patch merges cleanly.

SparkQA · 2014-09-15T14:12:29Z

QA tests have finished for PR 2388 at commit dc7ef13.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],

SparkQA · 2014-09-15T15:49:15Z

QA tests have started for PR 2388 at commit 3738e74.

This patch merges cleanly.

SparkQA · 2014-09-15T16:56:38Z

QA tests have finished for PR 2388 at commit 3738e74.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

SparkQA · 2014-09-18T17:24:28Z

QA tests have started for PR 2388 at commit 0dd8ad0.

This patch merges cleanly.

SparkQA · 2014-09-18T18:33:55Z

QA tests have finished for PR 2388 at commit 0dd8ad0.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

SparkQA · 2014-09-20T15:04:21Z

QA tests have started for PR 2388 at commit d407854.

This patch merges cleanly.

SparkQA · 2014-09-20T15:05:22Z

QA tests have finished for PR 2388 at commit d407854.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

SparkQA · 2014-09-20T15:44:20Z

QA tests have started for PR 2388 at commit 14903b1.

This patch merges cleanly.

SparkQA · 2014-09-20T16:35:12Z

QA tests have finished for PR 2388 at commit 14903b1.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

SparkQA · 2014-09-21T13:39:24Z

QA tests have started for PR 2388 at commit f775916.

This patch merges cleanly.

SparkQA · 2014-09-21T15:39:25Z

Tests timed out after a configured wait of 120m.

SparkQA · 2014-09-23T03:34:26Z

QA tests have started for PR 2388 at commit bf84e7b.

This patch merges cleanly.

SparkQA · 2014-09-23T04:42:25Z

QA tests have finished for PR 2388 at commit bf84e7b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],

SparkQA · 2014-09-23T04:42:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20686/

witgo · 2014-10-29T16:06:14Z

retest this please

jkbradley · 2014-11-07T00:33:59Z

@witgo Thanks for the PR! This looks like a very featureful implementation, but I think it will require some refactoring to fit in well with future development. I'll give some high-level comments for now, and can perhaps do a lower-level pass later on.

APIs

I suspect we'll have other types of topic modeling in the future, not just LDA. It would be great to think ahead for that. The simplest way is probably to rename everything as "LDA", not "topic modeling," and to minimize the public API. (Other topic models we might want later are LSA, PLSA, HDP, CTM, etc.)

This should probably go under "clustering" instead of "feature."

Code organization

Some of the code is more general than LDA and could go elsewhere in MLlib. E.g., some of the sampling methods could go in stat/ Also, minMaxIndexSearch, minMaxValueSearch, etc. (or can those be replaced using existing generic methods in Scala or Java?).

Documentation and code clarity

The current thing making this hardest to review is the lack of documentation and the difficulty in understanding what each value and method does. For documentation, it will be helpful to see comments for all classes and methods, and also inline comments explaining code where needed. For code clarity, using more descriptive variable and method names will help a lot.

Other thoughts

It would be nice to remove some experimental items (such as mergeDuplicateTopic) for now.

Thanks again!

SparkQA · 2014-11-07T00:36:50Z

Test build #512 has started for PR 2388 at commit fe40445.

This patch merges cleanly.

SparkQA · 2014-11-07T01:58:59Z

Test build #512 has finished for PR 2388 at commit fe40445.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

debasish83 · 2014-11-12T09:31:39Z

@jkbradley we support LSA (sparse coding) and PLSA through #3221...

jkbradley · 2014-11-13T00:35:28Z

@debasish83 Nice, I'll take a look! @witgo Thanks for making that change.

jkbradley · 2015-01-08T19:44:52Z

@witgo I’m submitting a simple PR for LDA which using EM for learning. I believe that it would be good to support other learning methods such as Gibbs sampling (as in your PR), where the user can select the learning method via an LDA parameter. If you have feedback on my PR, especially the public API, please do let me know. Thanks very much!

witgo · 2015-01-29T17:44:28Z

Here is a sample faster branch(work in progress):
https://github.com/witgo/spark/tree/lda_MH

mengxr · 2015-02-03T09:06:25Z

@witgo We've merged #4047 and closed this PR. Thanks for your contribution! Please create JIRAs and propose new features that can be added to the LDA implementation in master.

witgo · 2015-02-03T09:43:06Z

@mengxr
I created a JIRAs SPARK-5556.

witgo force-pushed the graphx_lda branch from 9860fd1 to 5fa02ef Compare September 14, 2014 17:04

witgo force-pushed the graphx_lda branch from 5fa02ef to dc7ef13 Compare September 15, 2014 12:57

witgo force-pushed the graphx_lda branch from dc7ef13 to 3738e74 Compare September 15, 2014 15:46

witgo force-pushed the graphx_lda branch from 3738e74 to 0dd8ad0 Compare September 18, 2014 17:19

witgo changed the title ~~[WIP][SPARK-1405][MLLIB]LDA based on Graphx~~ [WIP][SPARK-1405][MLLIB] topic modeling on Graphx Sep 19, 2014

witgo force-pushed the graphx_lda branch from 0dd8ad0 to d407854 Compare September 20, 2014 15:01

witgo force-pushed the graphx_lda branch from d407854 to 14903b1 Compare September 20, 2014 15:37

witgo force-pushed the graphx_lda branch 2 times, most recently from 71b03f1 to f775916 Compare September 21, 2014 13:35

witgo force-pushed the graphx_lda branch 2 times, most recently from 673771e to bf84e7b Compare September 23, 2014 03:31

witgo force-pushed the graphx_lda branch 2 times, most recently from 3895828 to fe40445 Compare October 30, 2014 16:03

witgo changed the title ~~[SPARK-1405][MLLIB] topic modeling on Graphx~~ [SPARK-1405][MLLIB] topic modeling on GraphX Nov 4, 2014

witgo force-pushed the graphx_lda branch from fe40445 to 5ed277c Compare November 4, 2014 15:09

witgo changed the title ~~[SPARK-1405][MLLIB] topic modeling on GraphX~~ [SPARK-1405][MLLIB] LDA on GraphX Nov 7, 2014

witgo force-pushed the graphx_lda branch 4 times, most recently from 311e542 to 92103d4 Compare November 8, 2014 11:29

witgo force-pushed the graphx_lda branch from 92103d4 to 77223c8 Compare November 13, 2014 06:21

witgo force-pushed the graphx_lda branch from 77223c8 to ee04988 Compare December 1, 2014 08:26

jkbradley mentioned this pull request Dec 15, 2014

[SPARK-2199] [mllib] topic modeling #1269

Closed

witgo added 2 commits January 16, 2015 14:22

LDA on GraphX

0d9ab59

Minor fix

68360c4

witgo force-pushed the graphx_lda branch from ee04988 to 68360c4 Compare January 16, 2015 06:23

asfgit closed this in 980764f Feb 3, 2015

[SPARK-1405][MLLIB] LDA on GraphX #2388

[SPARK-1405][MLLIB] LDA on GraphX #2388

Uh oh!

Conversation

witgo commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 15, 2014

Uh oh!

SparkQA commented Sep 15, 2014

Uh oh!

SparkQA commented Sep 15, 2014

Uh oh!

SparkQA commented Sep 15, 2014

Uh oh!

SparkQA commented Sep 18, 2014

Uh oh!

SparkQA commented Sep 18, 2014

Uh oh!

SparkQA commented Sep 20, 2014

Uh oh!

SparkQA commented Sep 20, 2014

Uh oh!

SparkQA commented Sep 20, 2014

Uh oh!

SparkQA commented Sep 20, 2014

Uh oh!

SparkQA commented Sep 21, 2014

Uh oh!

SparkQA commented Sep 21, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

witgo commented Oct 29, 2014

Uh oh!

jkbradley commented Nov 7, 2014

Uh oh!

SparkQA commented Nov 7, 2014

Uh oh!

SparkQA commented Nov 7, 2014

Uh oh!

debasish83 commented Nov 12, 2014

Uh oh!

jkbradley commented Nov 13, 2014

Uh oh!

jkbradley commented Jan 8, 2015

Uh oh!

witgo commented Jan 29, 2015

Uh oh!

mengxr commented Feb 3, 2015

Uh oh!

witgo commented Feb 3, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants