-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1405][MLLIB] LDA on GraphX #2388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2388 at commit
|
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
71b03f1 to
f775916
Compare
|
QA tests have started for PR 2388 at commit
|
|
Tests timed out after a configured wait of |
673771e to
bf84e7b
Compare
|
QA tests have started for PR 2388 at commit
|
|
QA tests have finished for PR 2388 at commit
|
|
Test PASSed. |
|
retest this please |
3895828 to
fe40445
Compare
|
@witgo Thanks for the PR! This looks like a very featureful implementation, but I think it will require some refactoring to fit in well with future development. I'll give some high-level comments for now, and can perhaps do a lower-level pass later on. APIs I suspect we'll have other types of topic modeling in the future, not just LDA. It would be great to think ahead for that. The simplest way is probably to rename everything as "LDA", not "topic modeling," and to minimize the public API. (Other topic models we might want later are LSA, PLSA, HDP, CTM, etc.) This should probably go under "clustering" instead of "feature." Code organization Some of the code is more general than LDA and could go elsewhere in MLlib. E.g., some of the sampling methods could go in stat/ Also, minMaxIndexSearch, minMaxValueSearch, etc. (or can those be replaced using existing generic methods in Scala or Java?). Documentation and code clarity The current thing making this hardest to review is the lack of documentation and the difficulty in understanding what each value and method does. For documentation, it will be helpful to see comments for all classes and methods, and also inline comments explaining code where needed. For code clarity, using more descriptive variable and method names will help a lot. Other thoughts It would be nice to remove some experimental items (such as mergeDuplicateTopic) for now. Thanks again! |
|
Test build #512 has started for PR 2388 at commit
|
|
Test build #512 has finished for PR 2388 at commit
|
311e542 to
92103d4
Compare
|
@jkbradley we support LSA (sparse coding) and PLSA through #3221... |
|
@debasish83 Nice, I'll take a look! @witgo Thanks for making that change. |
|
@witgo I’m submitting a simple PR for LDA which using EM for learning. I believe that it would be good to support other learning methods such as Gibbs sampling (as in your PR), where the user can select the learning method via an LDA parameter. If you have feedback on my PR, especially the public API, please do let me know. Thanks very much! |
|
Here is a sample faster branch(work in progress): |
|
@mengxr |
Asymmetric Dirichlet priors
Support 1000000 topics
Topic de-duplication
Add the documentation
Add infer interface
Add unit tests
Add the performance test
Optimizing the infer interface performance
Verifying the correctness of the algorithm
The performance test:
1502000/42.26,10000/49.47,100000/73.14,1000000/125.43conf/spark-defaults.conf:
REFERENCES: