Skip to content

Conversation

@feynmanliang
Copy link
Contributor

@jkbradley @hhbyyh

Adds topicDistributions to LocalLDAModel. Please review after #7757 is merged.

@feynmanliang feynmanliang force-pushed the SPARK-5567-predict-in-LDA branch from 247e920 to c709dd5 Compare July 29, 2015 23:40
@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38905 has finished for PR 7760 at commit 247e920.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@feynmanliang feynmanliang force-pushed the SPARK-5567-predict-in-LDA branch 2 times, most recently from c709dd5 to 4a6f323 Compare July 30, 2015 00:09
@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38912 has finished for PR 7760 at commit c709dd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38921 has finished for PR 7760 at commit 4a6f323.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use normalize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@feynmanliang feynmanliang force-pushed the SPARK-5567-predict-in-LDA branch from 4a6f323 to 3be2947 Compare July 30, 2015 03:38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38971 has finished for PR 7760 at commit 3be2947.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #38977 has finished for PR 7760 at commit 6bfb87c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I just noticed this: [https://github.com//pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]
Shouldn't it scale by "count?"

@feynmanliang
Copy link
Contributor Author

I don't think so; the bound is for the entire document (joint over all words in document), not per-word. This is needed when doing online alpha hyperparameter estimate updates. bound also is not part of the public API.

Scaling by count for perplexity is done here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be more specific:

Predicts the topic mixture distribution for each document (often called "theta" in the literature).  Returns a vector of zeros for an empty document.

This uses a variational approximation follow Hoffman et al. (2010), where the approximate distribution is called "gamma."  Technically, this method returns this approximation "gamma" for each document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@jkbradley
Copy link
Member

About scaling by counts in [https://github.com//pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]:

I don't think so; the bound is for the entire document (joint over all words in document), not per-word. This is needed when doing online alpha hyperparameter estimate updates. bound also is not part of the public API.

That bound should count each token (instance of a word), but its current implementation would treat a document "blah" and a document "blah blah" as identical.

Scaling by count for perplexity is done here

This is to make the term per-word, i.e., average over all tokens. The issue is that the value computed earlier is not quite per-word.

I think this problem might be caught by modifying the unit test to have some terms have multiple copies. (I'm trying this out currently.)

@jkbradley
Copy link
Member

Actually, if you have gensim set up already, it might be faster for you to test it. (Thanks!)

@feynmanliang
Copy link
Contributor Author

I will take a look tomorrow morning.

On Wed, Jul 29, 2015 at 11:40 PM jkbradley [email protected] wrote:

Actually, if you have gensim set up already, it might be faster for you to
test it. (Thanks!)


Reply to this email directly or view it on GitHub
#7760 (comment).

@jkbradley
Copy link
Member

Sounds good; it's pretty darn late

@feynmanliang
Copy link
Contributor Author

@jkbradley WRT scaling by counts:

Oh, I misunderstood you. Yes you are right, that should be scaled by counts.

I will fix it in this PR and include tests in the unit tests for logLikelihood (upcoming PR).

@SparkQA
Copy link

SparkQA commented Jul 30, 2015

Test build #39075 has finished for PR 7760 at commit 0ad1134.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM, thanks! I'll merge this with master.

@asfgit asfgit closed this in d8cfd53 Jul 30, 2015
asfgit pushed a commit that referenced this pull request Jul 31, 2015
jkbradley Exposes `bound` (variational log likelihood bound) through public API as `logLikelihood`. Also adds unit tests, some DRYing of `LDASuite`, and includes unit tests mentioned in #7760

Author: Feynman Liang <[email protected]>

Closes #7801 from feynmanliang/SPARK-9481-logLikelihood and squashes the following commits:

6d1b2c9 [Feynman Liang] Negate perplexity definition
5f62b20 [Feynman Liang] Add logLikelihood
@feynmanliang feynmanliang deleted the SPARK-5567-predict-in-LDA branch August 3, 2015 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants