[Spark-5567][MLlib] Add predict method to LocalLDAModel #7760

feynmanliang · 2015-07-29T23:16:28Z

@jkbradley @hhbyyh

Adds topicDistributions to LocalLDAModel. Please review after #7757 is merged.

SparkQA · 2015-07-30T00:00:39Z

Test build #38905 has finished for PR 7760 at commit 247e920.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T00:19:50Z

Test build #38912 has finished for PR 7760 at commit c709dd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T00:51:50Z

Test build #38921 has finished for PR 7760 at commit 4a6f323.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2015-07-30T03:03:52Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

use normalize?

hhbyyh · 2015-07-30T03:41:32Z

mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala

SparkQA · 2015-07-30T04:24:58Z

Test build #38971 has finished for PR 7760 at commit 3be2947.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T04:48:33Z

Test build #38977 has finished for PR 7760 at commit 6bfb87c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-30T05:48:28Z

I just noticed this: [https://github.com//pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]
Shouldn't it scale by "count?"

feynmanliang · 2015-07-30T06:11:41Z

I don't think so; the bound is for the entire document (joint over all words in document), not per-word. This is needed when doing online alpha hyperparameter estimate updates. bound also is not part of the public API.

Scaling by count for perplexity is done here

jkbradley · 2015-07-30T06:31:38Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

We can be more specific:

Predicts the topic mixture distribution for each document (often called "theta" in the literature). Returns a vector of zeros for an empty document. This uses a variational approximation follow Hoffman et al. (2010), where the approximate distribution is called "gamma." Technically, this method returns this approximation "gamma" for each document.

jkbradley · 2015-07-30T06:37:28Z

About scaling by counts in [https://github.com//pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]:

I don't think so; the bound is for the entire document (joint over all words in document), not per-word. This is needed when doing online alpha hyperparameter estimate updates. bound also is not part of the public API.

That bound should count each token (instance of a word), but its current implementation would treat a document "blah" and a document "blah blah" as identical.

Scaling by count for perplexity is done here

This is to make the term per-word, i.e., average over all tokens. The issue is that the value computed earlier is not quite per-word.

I think this problem might be caught by modifying the unit test to have some terms have multiple copies. (I'm trying this out currently.)

jkbradley · 2015-07-30T06:40:21Z

Actually, if you have gensim set up already, it might be faster for you to test it. (Thanks!)

feynmanliang · 2015-07-30T06:49:56Z

I will take a look tomorrow morning.

On Wed, Jul 29, 2015 at 11:40 PM jkbradley [email protected] wrote:

Actually, if you have gensim set up already, it might be faster for you to
test it. (Thanks!)

—
Reply to this email directly or view it on GitHub
#7760 (comment).

jkbradley · 2015-07-30T07:19:56Z

Sounds good; it's pretty darn late

feynmanliang · 2015-07-30T17:24:10Z

@jkbradley WRT scaling by counts:

Oh, I misunderstood you. Yes you are right, that should be scaled by counts.

I will fix it in this PR and include tests in the unit tests for logLikelihood (upcoming PR).

SparkQA · 2015-07-30T18:06:15Z

Test build #39075 has finished for PR 7760 at commit 0ad1134.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-30T20:17:13Z

LGTM, thanks! I'll merge this with master.

jkbradley Exposes `bound` (variational log likelihood bound) through public API as `logLikelihood`. Also adds unit tests, some DRYing of `LDASuite`, and includes unit tests mentioned in #7760 Author: Feynman Liang <[email protected]> Closes #7801 from feynmanliang/SPARK-9481-logLikelihood and squashes the following commits: 6d1b2c9 [Feynman Liang] Negate perplexity definition 5f62b20 [Feynman Liang] Add logLikelihood

feynmanliang force-pushed the SPARK-5567-predict-in-LDA branch from 247e920 to c709dd5 Compare July 29, 2015 23:40

feynmanliang force-pushed the SPARK-5567-predict-in-LDA branch 2 times, most recently from c709dd5 to 4a6f323 Compare July 30, 2015 00:09

hhbyyh reviewed Jul 30, 2015
View reviewed changes

Feynman Liang added 2 commits July 29, 2015 20:37

Add predict methods to LocalLDAModel

2a821a6

Rename topicDistribution -> topicDistributions

3be2947

feynmanliang force-pushed the SPARK-5567-predict-in-LDA branch from 4a6f323 to 3be2947 Compare July 30, 2015 03:38

hhbyyh reviewed Jul 30, 2015
View reviewed changes

Feynman Liang added 3 commits July 29, 2015 20:58

Code review cleanup

061780c

Fix checks and doc for variationalInference

476f788

Remove extra newline

6bfb87c

jkbradley reviewed Jul 30, 2015
View reviewed changes

Code review fixes

27b3877

Remove println

0ad1134

feynmanliang mentioned this pull request Jul 30, 2015

[SPARK-9481][MLlib]Add logLikelihood to LocalLDAModel #7801

Closed

asfgit closed this in d8cfd53 Jul 30, 2015

feynmanliang deleted the SPARK-5567-predict-in-LDA branch August 3, 2015 19:40

[Spark-5567][MLlib] Add predict method to LocalLDAModel #7760

[Spark-5567][MLlib] Add predict method to LocalLDAModel #7760

Uh oh!

Conversation

feynmanliang commented Jul 29, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

hhbyyh Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

feynmanliang Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

hhbyyh Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

feynmanliang Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

jkbradley commented Jul 30, 2015

Uh oh!

feynmanliang commented Jul 30, 2015

Uh oh!

jkbradley Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

feynmanliang Jul 30, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jul 30, 2015

Uh oh!

jkbradley commented Jul 30, 2015

Uh oh!

feynmanliang commented Jul 30, 2015

Uh oh!

jkbradley commented Jul 30, 2015

Uh oh!

feynmanliang commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

jkbradley commented Jul 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants