Skip to content

Conversation

@witgo
Copy link
Contributor

@witgo witgo commented Sep 14, 2014

  • Asymmetric Dirichlet priors

    Asymmetric Dirichlet priors substantially increases the robustness of LDA to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. 
    
  • Support 1000000 topics

     This can better cover the long tail distribution
    
  • Topic de-duplication

  • Add the documentation

  • Add infer interface

  • Add unit tests

  • Add the performance test

  • Optimizing the infer interface performance

  • Verifying the correctness of the algorithm

The performance test:

Item value
cluster resource 36 executors(36 cores, 216g memory)
training set 253064 document, 29696335 words, 75496 distinct words
number of iterations 150
parameter alpha 0.01 , beta 0.01
number of topics/running time (minutes) 2000/42.26 , 10000/49.47, 100000/73.14, 1000000/125.43

conf/spark-defaults.conf:

spark.akka.frameSize   20
spark.executor.instances 36
spark.rdd.compress true
spark.executor.memory   6g
spark.default.parallelism  72
spark.broadcast.blockSize  8192
spark.storage.memoryFraction 0.2
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.spark.mllib.clustering.LDAKryoRegistrator

REFERENCES:

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 2388 at commit 9860fd1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 2388 at commit 5fa02ef.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 2388 at commit 9860fd1.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 2388 at commit 5fa02ef.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 2388 at commit dc7ef13.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 2388 at commit dc7ef13.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 2388 at commit 3738e74.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 2388 at commit 3738e74.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2388 at commit 0dd8ad0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2388 at commit 0dd8ad0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

@witgo witgo changed the title [WIP][SPARK-1405][MLLIB]LDA based on Graphx [WIP][SPARK-1405][MLLIB] topic modeling on Graphx Sep 19, 2014
@SparkQA
Copy link

SparkQA commented Sep 20, 2014

QA tests have started for PR 2388 at commit d407854.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 20, 2014

QA tests have finished for PR 2388 at commit d407854.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

@SparkQA
Copy link

SparkQA commented Sep 20, 2014

QA tests have started for PR 2388 at commit 14903b1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 20, 2014

QA tests have finished for PR 2388 at commit 14903b1.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],

@witgo witgo force-pushed the graphx_lda branch 2 times, most recently from 71b03f1 to f775916 Compare September 21, 2014 13:35
@SparkQA
Copy link

SparkQA commented Sep 21, 2014

QA tests have started for PR 2388 at commit f775916.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 21, 2014

Tests timed out after a configured wait of 120m.

@witgo witgo force-pushed the graphx_lda branch 2 times, most recently from 673771e to bf84e7b Compare September 23, 2014 03:31
@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have started for PR 2388 at commit bf84e7b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have finished for PR 2388 at commit bf84e7b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20686/

@witgo
Copy link
Contributor Author

witgo commented Oct 29, 2014

retest this please

@witgo witgo force-pushed the graphx_lda branch 2 times, most recently from 3895828 to fe40445 Compare October 30, 2014 16:03
@witgo witgo changed the title [SPARK-1405][MLLIB] topic modeling on Graphx [SPARK-1405][MLLIB] topic modeling on GraphX Nov 4, 2014
@jkbradley
Copy link
Member

@witgo Thanks for the PR! This looks like a very featureful implementation, but I think it will require some refactoring to fit in well with future development. I'll give some high-level comments for now, and can perhaps do a lower-level pass later on.

APIs

I suspect we'll have other types of topic modeling in the future, not just LDA. It would be great to think ahead for that. The simplest way is probably to rename everything as "LDA", not "topic modeling," and to minimize the public API. (Other topic models we might want later are LSA, PLSA, HDP, CTM, etc.)

This should probably go under "clustering" instead of "feature."

Code organization

Some of the code is more general than LDA and could go elsewhere in MLlib. E.g., some of the sampling methods could go in stat/ Also, minMaxIndexSearch, minMaxValueSearch, etc. (or can those be replaced using existing generic methods in Scala or Java?).

Documentation and code clarity

The current thing making this hardest to review is the lack of documentation and the difficulty in understanding what each value and method does. For documentation, it will be helpful to see comments for all classes and methods, and also inline comments explaining code where needed. For code clarity, using more descriptive variable and method names will help a lot.

Other thoughts

It would be nice to remove some experimental items (such as mergeDuplicateTopic) for now.

Thanks again!

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #512 has started for PR 2388 at commit fe40445.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #512 has finished for PR 2388 at commit fe40445.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo witgo changed the title [SPARK-1405][MLLIB] topic modeling on GraphX [SPARK-1405][MLLIB] LDA on GraphX Nov 7, 2014
@witgo witgo force-pushed the graphx_lda branch 4 times, most recently from 311e542 to 92103d4 Compare November 8, 2014 11:29
@debasish83
Copy link

@jkbradley we support LSA (sparse coding) and PLSA through #3221...

@jkbradley
Copy link
Member

@debasish83 Nice, I'll take a look! @witgo Thanks for making that change.

@jkbradley
Copy link
Member

@witgo I’m submitting a simple PR for LDA which using EM for learning. I believe that it would be good to support other learning methods such as Gibbs sampling (as in your PR), where the user can select the learning method via an LDA parameter. If you have feedback on my PR, especially the public API, please do let me know. Thanks very much!

@witgo
Copy link
Contributor Author

witgo commented Jan 29, 2015

Here is a sample faster branch(work in progress):
https://github.com/witgo/spark/tree/lda_MH

@asfgit asfgit closed this in 980764f Feb 3, 2015
@mengxr
Copy link
Contributor

mengxr commented Feb 3, 2015

@witgo We've merged #4047 and closed this PR. Thanks for your contribution! Please create JIRAs and propose new features that can be added to the LDA implementation in master.

@witgo
Copy link
Contributor Author

witgo commented Feb 3, 2015

@mengxr
I created a JIRAs SPARK-5556.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants