[SPARK-13857][ML][WIP] Add "recommend all" functionality in ALS #12574

MLnick · 2016-04-21T12:03:59Z

This PR adds "recommend all" functionality to ML's ALSModel, similar to what exists in the recommendProductsForUsers/recommendUsersForProducts methods in MLlib's MatrixFactorizationModel.

This is a WIP to get some feedback around:

my approach to generic blockify and recommendAll methods inMatrixFactorizationModel, to handle Double and Float types (since ml uses Float for ratings and factors while mllib uses Double)
the approach to the new params related to recommending top-k in ALSModel
semantics of ALSModel.transform and how it relates to pipelines, cross-validation & evaluation (see this JIRA as well as SPARK-14409 and the linked design doc and [SPARK-14409][ML] Adding a RankingEvaluator to ML #12461).

cc @mengxr @jkbradley @srowen @debasish83 @yongtang

TODO

Decide on transform semantics and how this fits in with cross-validation and evaluation.
Performance testing vs MatrixFactorizationModel and vs the alternative transform semantics.
Clean up schema validation a bit and add related tests.

How was this patch tested?

New unit tests in ml.recommendation.ALSSuite

MLnick · 2016-04-21T12:10:42Z

Adding some further detail...

Proposed semantics

ALSModel.transform handles both point predictions (predict rating for each (user, item) combo in dataest), and "top-k" recommendations (predict Array(recommendations) for each user in dataset).
Add 3 parameters to handle this:
- k - number of recommendations to compute for each unique id in the input column.
- recommendFor - whether to recommend for user (recommends top k items for each user) or item (recommends top k users for each item).
- withScores - (expert) whether to return recommendations as [(id1, score1), (id2, score2), ...] or [id1, id2, ...]. Default false i.e. ids only.
Re-use predictionCol for recommendations. This means the output schema / semantics to transform are different when recommending top-k.

Interaction with evaluation

Here I've gone with the approach of having ALSModel.transform handle munging the input data into the form required for evaluation using RankingEvaluator (see #12461). So users can currently do the following (using RankingMetrics until RankingEvaluator is merged):

val als = new ALS().setRecommendFor("user").setK(10)
val results = als.fit(training).transform(test).select("prediction", "label")
  .as[(Array[Int], Array[Int])]
  .rdd
val metrics = new RankingMetrics()
println(metrics.meanAveragePrecision)
println(metrics.ndcgAt(als.getK()))
...

Specifically:

if the input dataset is the same format as for ALS.fit, i.e. it has a user and item column, then ALSModel.transform returns only a set of recommended items and a set of "ground truth" items for each user. That is, it does a distinct on user and only returns columns user, prediction, label for (id, recommendations, ground truth) respectively. So it discards any other columns. The reasoning here is that in practice this form will only really be used for cross-validation / evaluation (where you only care about the user, recommended, actual output of transform).
If the input dataset only has a user column, then ALSModel.transform returns a set of recommended items for each user, along with any other columns in the input dataset. The reasoning here is, once you have your ALSModel and you want to make predictions for, say, a bunch of users, you will only pass in a DataFrame with a user column (no item column, and maybe a bunch of columns of user metadata, which in this use case you don't want to discard).

Possible alternative semantics

In the current approach, ALSModel handles transforming the input data into a form suitable for RankingEvaluator. The alternative is to instead have RankingEvaluator do that.

In this case:

input dataset to ALSModel.transform for recommendations is kept the same as for ALS.fit, i.e. [user, item, rating].
output dataset to ALSModel.transform for recommendations looks like [user, item, rating, prediction] - where rating and prediction are real numbers (not arrays as in the above approach). However, both rating and prediction column can be null. A null prediction occurs when there is a user, item, rating combo in the data set, but the item does not occur in the top-k recommendations for that user. A null rating occurs when an item is occurs in the top-k recommendations for a user, but that user, item, rating combo doesn't occur in the dataset.
RankingEvaluator.evaluate takes the format [query, document, relevance, prediction], where relevance and prediction are nullable. It then basically groups by query and collects a set of ground truth documents (from document where relevance is not null) and a set of predicted documents (from document where prediction is not null). Additionally, since order matters in predictions, RankingEvaluator would need to sort the predicted set by prediction score.

This alternative seems more "DataFrame"-like, but it would require changes to RankingEvaluator. I have a version of RankingEvaluator that works in this manner pretty much ready to go, and I've done most of the work on this alternative of ALSModel.transform too, so this PR can be adjusted in that direction quickly.

The downside to this alternative is that performance may suffer as it's perhaps not that efficient - since ALSModel.transform blocks the user and item factors, predicts using BLAS, would then need to explode/flatMap those back into the format required. RankingEvaluator.evaluate then needs to groupBy user and collect the two sets of documents, as well as sorting the predicted set (again, since we already did that in ALSModel.transform).

SparkQA · 2016-04-21T12:33:07Z

Test build #56533 has finished for PR 12574 at commit f8cf8b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-21T14:18:42Z

retest this please

SparkQA · 2016-04-21T14:49:42Z

Test build #56544 has finished for PR 12574 at commit f8cf8b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-22T06:35:55Z

Test failure seems to be caused by issue in #12599

frreiss · 2016-04-22T17:40:28Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

These two joins and distinct should probably be a cogroup.mapGroups, since you know that id is a key and dataset is unlikely to contain many duplicates. As far as I can see, the current Catalyst rules will generate a conservative plan with three stages for the expression distinct.join.join, since there's no information available on key constraints.

EDIT: On second thought, there's currently no cogroup operator in the DataFrame API, so I guess the way this expression is currently written is the best one can do.

I'll add a note to look into optimizing this when cogroup exists.

By the way, when you say that dataset is unlikely to contain many duplicates, are you referring to duplicate rows? If so, yes that is true as generally each user, item, rating tuple is distinct (though it's not guaranteed to be the case).

Sorry, I meant duplicate values of srcCol that are filtered out by the first distinct. What I was trying to say was that, if a cogroup was used here, every group would fit handily in memory.

Ah, ok. Yes, that should generally be true (though in larger-scale social network style domains, one could run into "super-node" problems with highly popular users or items).

MLnick · 2016-04-24T16:59:38Z

jenkins retest this please

SparkQA · 2016-04-24T17:29:12Z

Test build #56849 has finished for PR 12574 at commit f8cf8b4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-25T12:49:21Z

Test build #56894 has finished for PR 12574 at commit c529c89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-25T12:58:47Z

Test build #56895 has finished for PR 12574 at commit 4af7105.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

frreiss · 2016-04-25T18:01:48Z

LGTM

BryanCutler · 2016-04-25T18:08:07Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

Would it make sense to set recommendFor and k together? like this

def setRecommendFor(forval: String, kval: Int)

That way you would be guaranteed they are both set if using recommendFor

That's interesting. I like the idea, though no other estimator/model uses this approach (but AFAIK there's no other estimator where 2 specific non-default params must be set together). @jkbradley @holdenk thoughts?

SparkQA · 2016-07-29T13:59:29Z

Test build #63005 has finished for PR 12574 at commit af432cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-07-29T14:12:54Z

ping @mengxr @jkbradley @srowen @yanboliang for thoughts and comments on this.

debasish83 · 2016-08-07T01:16:26Z

@MLnick I recently visited IBM STC but unfortunately missed you on the meeting...we discussed about the ML/MLlib changes for matrix factorization...

debasish83 · 2016-08-07T01:16:46Z

I will take a pass at the PR as well..

MLnick · 2016-08-08T13:35:33Z

@debasish83 thanks, would like to get your comments especially around transform semantics. I will be doing performance testing on this and post numbers soon.

MLnick · 2016-09-12T15:27:36Z

@debasish83 any chance to take a look yet?

debasish83 · 2016-12-26T04:51:35Z

Can we close it ? Looks like SPARK-18235 opened up recommendForAll

hhbyyh · 2016-12-27T07:12:42Z

Hi @debasish83 SPARK-18235 was already closed as duplicated. I think @MLnick put a more thorough solution here.

SparkQA · 2017-01-12T12:41:51Z

Test build #71257 has finished for PR 12574 at commit 32fde78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Until apache/spark#12574 is merged, the ML pipeline doesn’t have the same ability to generate top recommendations for a set of users. So, we’ll tune the model with an ML pipeline, build it from the same data with the RDD-based ALS implementation, then have a component to load that model and generate recs into an intermediate store TBD.

frreiss reviewed Apr 22, 2016
View reviewed changes

MLnick force-pushed the SPARK-13857-als-parity branch from c529c89 to 4af7105 Compare April 25, 2016 12:19

BryanCutler reviewed Apr 25, 2016
View reviewed changes

Nick Pentreath added 7 commits July 28, 2016 15:31

initial work on adding 'recommendAll' methods to ML ALS

412b1b5

Update params and tests

84b42ee

recommend all using transform

a12bede

remove space

409c8a2

clean up and docs

71ae89d

comment out transformSchema for now (see SPARK-14891)

675f3fa

clean up from rebase

af432cc

MLnick force-pushed the SPARK-13857-als-parity branch from 4af7105 to af432cc Compare July 29, 2016 13:04

sethah mentioned this pull request Nov 4, 2016

[SPARK-18235][ML] ml.ALSModel function parity: ALSModel should support recommendforAll #15762

Closed

Merge branch 'master' into SPARK-13857-als-parity

32fde78

hhbyyh mentioned this pull request Feb 28, 2017

[Spark-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe #17090

Closed

MLnick closed this Apr 11, 2017

[SPARK-13857][ML][WIP] Add "recommend all" functionality in ALS #12574

[SPARK-13857][ML][WIP] Add "recommend all" functionality in ALS #12574

Uh oh!

Conversation

MLnick commented Apr 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

How was this patch tested?

Uh oh!

MLnick commented Apr 21, 2016

Proposed semantics

Interaction with evaluation

Possible alternative semantics

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

MLnick commented Apr 21, 2016

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

MLnick commented Apr 22, 2016

Uh oh!

frreiss Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 23, 2016

Choose a reason for hiding this comment

Uh oh!

frreiss Apr 25, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 26, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick commented Apr 24, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

SparkQA commented Apr 25, 2016

Uh oh!

SparkQA commented Apr 25, 2016

Uh oh!

frreiss commented Apr 25, 2016

Uh oh!

BryanCutler Apr 25, 2016

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 26, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2016

Uh oh!

MLnick commented Jul 29, 2016

Uh oh!

debasish83 commented Aug 7, 2016

Uh oh!

debasish83 commented Aug 7, 2016

Uh oh!

MLnick commented Aug 8, 2016

Uh oh!

MLnick commented Sep 12, 2016

Uh oh!

debasish83 commented Dec 26, 2016

Uh oh!

hhbyyh commented Dec 27, 2016

Uh oh!

SparkQA commented Jan 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MLnick commented Apr 21, 2016 •

edited

Loading

frreiss Apr 22, 2016 •

edited

Loading