-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[Spark-5567][MLlib] Add predict method to LocalLDAModel #7760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-5567][MLlib] Add predict method to LocalLDAModel #7760
Conversation
247e920 to
c709dd5
Compare
|
Test build #38905 has finished for PR 7760 at commit
|
c709dd5 to
4a6f323
Compare
|
Test build #38912 has finished for PR 7760 at commit
|
|
Test build #38921 has finished for PR 7760 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use normalize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
4a6f323 to
3be2947
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
Test build #38971 has finished for PR 7760 at commit
|
|
Test build #38977 has finished for PR 7760 at commit
|
|
I just noticed this: [https://github.com//pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277] |
|
I don't think so; the Scaling by count for perplexity is done here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can be more specific:
Predicts the topic mixture distribution for each document (often called "theta" in the literature). Returns a vector of zeros for an empty document.
This uses a variational approximation follow Hoffman et al. (2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
About scaling by counts in [https://github.com//pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]:
That bound should count each token (instance of a word), but its current implementation would treat a document "blah" and a document "blah blah" as identical.
This is to make the term per-word, i.e., average over all tokens. The issue is that the value computed earlier is not quite per-word. I think this problem might be caught by modifying the unit test to have some terms have multiple copies. (I'm trying this out currently.) |
|
Actually, if you have gensim set up already, it might be faster for you to test it. (Thanks!) |
|
I will take a look tomorrow morning. On Wed, Jul 29, 2015 at 11:40 PM jkbradley [email protected] wrote:
|
|
Sounds good; it's pretty darn late |
|
@jkbradley WRT scaling by counts: Oh, I misunderstood you. Yes you are right, that should be scaled by counts. I will fix it in this PR and include tests in the unit tests for |
|
Test build #39075 has finished for PR 7760 at commit
|
|
LGTM, thanks! I'll merge this with master. |
jkbradley Exposes `bound` (variational log likelihood bound) through public API as `logLikelihood`. Also adds unit tests, some DRYing of `LDASuite`, and includes unit tests mentioned in #7760 Author: Feynman Liang <[email protected]> Closes #7801 from feynmanliang/SPARK-9481-logLikelihood and squashes the following commits: 6d1b2c9 [Feynman Liang] Negate perplexity definition 5f62b20 [Feynman Liang] Add logLikelihood
@jkbradley @hhbyyh
Adds
topicDistributionsto LocalLDAModel. Please review after #7757 is merged.