Skip to content

Conversation

@MLnick
Copy link
Contributor

@MLnick MLnick commented Apr 21, 2016

This PR adds "recommend all" functionality to ML's ALSModel, similar to what exists in the recommendProductsForUsers/recommendUsersForProducts methods in MLlib's MatrixFactorizationModel.

This is a WIP to get some feedback around:

  1. my approach to generic blockify and recommendAll methods inMatrixFactorizationModel, to handle Double and Float types (since ml uses Float for ratings and factors while mllib uses Double)
  2. the approach to the new params related to recommending top-k in ALSModel
  3. semantics of ALSModel.transform and how it relates to pipelines, cross-validation & evaluation (see this JIRA as well as SPARK-14409 and the linked design doc and [SPARK-14409][ML] Adding a RankingEvaluator to ML #12461).

cc @mengxr @jkbradley @srowen @debasish83 @yongtang

TODO

  • Decide on transform semantics and how this fits in with cross-validation and evaluation.
  • Performance testing vs MatrixFactorizationModel and vs the alternative transform semantics.
  • Clean up schema validation a bit and add related tests.

How was this patch tested?

New unit tests in ml.recommendation.ALSSuite

@MLnick
Copy link
Contributor Author

MLnick commented Apr 21, 2016

Adding some further detail...

Proposed semantics

  • ALSModel.transform handles both point predictions (predict rating for each (user, item) combo in dataest), and "top-k" recommendations (predict Array(recommendations) for each user in dataset).
  • Add 3 parameters to handle this:
    • k - number of recommendations to compute for each unique id in the input column.
    • recommendFor - whether to recommend for user (recommends top k items for each user) or item (recommends top k users for each item).
    • withScores - (expert) whether to return recommendations as [(id1, score1), (id2, score2), ...] or [id1, id2, ...]. Default false i.e. ids only.
  • Re-use predictionCol for recommendations. This means the output schema / semantics to transform are different when recommending top-k.

Interaction with evaluation

Here I've gone with the approach of having ALSModel.transform handle munging the input data into the form required for evaluation using RankingEvaluator (see #12461). So users can currently do the following (using RankingMetrics until RankingEvaluator is merged):

val als = new ALS().setRecommendFor("user").setK(10)
val results = als.fit(training).transform(test).select("prediction", "label")
  .as[(Array[Int], Array[Int])]
  .rdd
val metrics = new RankingMetrics()
println(metrics.meanAveragePrecision)
println(metrics.ndcgAt(als.getK()))
...

Specifically:

  • if the input dataset is the same format as for ALS.fit, i.e. it has a user and item column, then ALSModel.transform returns only a set of recommended items and a set of "ground truth" items for each user. That is, it does a distinct on user and only returns columns user, prediction, label for (id, recommendations, ground truth) respectively. So it discards any other columns. The reasoning here is that in practice this form will only really be used for cross-validation / evaluation (where you only care about the user, recommended, actual output of transform).
  • If the input dataset only has a user column, then ALSModel.transform returns a set of recommended items for each user, along with any other columns in the input dataset. The reasoning here is, once you have your ALSModel and you want to make predictions for, say, a bunch of users, you will only pass in a DataFrame with a user column (no item column, and maybe a bunch of columns of user metadata, which in this use case you don't want to discard).

Possible alternative semantics

In the current approach, ALSModel handles transforming the input data into a form suitable for RankingEvaluator. The alternative is to instead have RankingEvaluator do that.

In this case:

  • input dataset to ALSModel.transform for recommendations is kept the same as for ALS.fit, i.e. [user, item, rating].
  • output dataset to ALSModel.transform for recommendations looks like [user, item, rating, prediction] - where rating and prediction are real numbers (not arrays as in the above approach). However, both rating and prediction column can be null. A null prediction occurs when there is a user, item, rating combo in the data set, but the item does not occur in the top-k recommendations for that user. A null rating occurs when an item is occurs in the top-k recommendations for a user, but that user, item, rating combo doesn't occur in the dataset.
  • RankingEvaluator.evaluate takes the format [query, document, relevance, prediction], where relevance and prediction are nullable. It then basically groups by query and collects a set of ground truth documents (from document where relevance is not null) and a set of predicted documents (from document where prediction is not null). Additionally, since order matters in predictions, RankingEvaluator would need to sort the predicted set by prediction score.

This alternative seems more "DataFrame"-like, but it would require changes to RankingEvaluator. I have a version of RankingEvaluator that works in this manner pretty much ready to go, and I've done most of the work on this alternative of ALSModel.transform too, so this PR can be adjusted in that direction quickly.

The downside to this alternative is that performance may suffer as it's perhaps not that efficient - since ALSModel.transform blocks the user and item factors, predicts using BLAS, would then need to explode/flatMap those back into the format required. RankingEvaluator.evaluate then needs to groupBy user and collect the two sets of documents, as well as sorting the predicted set (again, since we already did that in ALSModel.transform).

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56533 has finished for PR 12574 at commit f8cf8b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Apr 21, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56544 has finished for PR 12574 at commit f8cf8b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Apr 22, 2016

Test failure seems to be caused by issue in #12599

Copy link
Contributor

@frreiss frreiss Apr 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two joins and distinct should probably be a cogroup.mapGroups, since you know that id is a key and dataset is unlikely to contain many duplicates. As far as I can see, the current Catalyst rules will generate a conservative plan with three stages for the expression distinct.join.join, since there's no information available on key constraints.

EDIT: On second thought, there's currently no cogroup operator in the DataFrame API, so I guess the way this expression is currently written is the best one can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note to look into optimizing this when cogroup exists.

By the way, when you say that dataset is unlikely to contain many duplicates, are you referring to duplicate rows? If so, yes that is true as generally each user, item, rating tuple is distinct (though it's not guaranteed to be the case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant duplicate values of srcCol that are filtered out by the first distinct. What I was trying to say was that, if a cogroup was used here, every group would fit handily in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Yes, that should generally be true (though in larger-scale social network style domains, one could run into "super-node" problems with highly popular users or items).

@MLnick
Copy link
Contributor Author

MLnick commented Apr 24, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 24, 2016

Test build #56849 has finished for PR 12574 at commit f8cf8b4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick MLnick force-pushed the SPARK-13857-als-parity branch from c529c89 to 4af7105 Compare April 25, 2016 12:19
@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56894 has finished for PR 12574 at commit c529c89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56895 has finished for PR 12574 at commit 4af7105.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@frreiss
Copy link
Contributor

frreiss commented Apr 25, 2016

LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set recommendFor and k together? like this

def setRecommendFor(forval: String, kval: Int)

That way you would be guaranteed they are both set if using recommendFor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. I like the idea, though no other estimator/model uses this approach (but AFAIK there's no other estimator where 2 specific non-default params must be set together). @jkbradley @holdenk thoughts?

@MLnick MLnick force-pushed the SPARK-13857-als-parity branch from 4af7105 to af432cc Compare July 29, 2016 13:04
@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #63005 has finished for PR 12574 at commit af432cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Jul 29, 2016

ping @mengxr @jkbradley @srowen @yanboliang for thoughts and comments on this.

@debasish83
Copy link

@MLnick I recently visited IBM STC but unfortunately missed you on the meeting...we discussed about the ML/MLlib changes for matrix factorization...

@debasish83
Copy link

I will take a pass at the PR as well..

@MLnick
Copy link
Contributor Author

MLnick commented Aug 8, 2016

@debasish83 thanks, would like to get your comments especially around transform semantics. I will be doing performance testing on this and post numbers soon.

@MLnick
Copy link
Contributor Author

MLnick commented Sep 12, 2016

@debasish83 any chance to take a look yet?

@debasish83
Copy link

Can we close it ? Looks like SPARK-18235 opened up recommendForAll

@hhbyyh
Copy link
Contributor

hhbyyh commented Dec 27, 2016

Hi @debasish83 SPARK-18235 was already closed as duplicated. I think @MLnick put a more thorough solution here.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71257 has finished for PR 12574 at commit 32fde78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jayzes pushed a commit to ello/spark-jobs that referenced this pull request Jan 22, 2017
Until apache/spark#12574 is merged, the ML pipeline doesn’t have the same ability to generate top recommendations for a set of users. So, we’ll tune the model with an ML pipeline, build it from the same data with the RDD-based ALS implementation, then have a component to load that model and generate recs into an intermediate store TBD.
@MLnick MLnick closed this Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants