[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy #12896

MLnick · 2016-05-04T10:33:14Z

This PR adds a param to ALS/ALSModel to set the strategy used when encountering unknown users or items at prediction time in transform. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation.

The current behavior returns NaN if a user/item is unknown. In scenario (b), this can easily occur when using CrossValidator or TrainValidationSplit since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns NaN for all metrics, making model selection impossible.

The new param, coldStartStrategy, defaults to nan (the current behavior). The other option supported initially is drop, which drops all rows with NaN predictions. This flag allows users to use ALS in cross-validation settings. It is made an expertParam. The param is made a string so that the set of strategies can be extended in future (some options are discussed in SPARK-14489).

How was this patch tested?

New unit tests, and manual "before and after" tests for Scala & Python using MovieLens ml-latest-small as example data. Here, using CrossValidator or TrainValidationSplit with the default param setting results in metrics that are all NaN, while setting coldStartStrategy to drop results in valid metrics.

MLnick · 2016-05-04T10:34:37Z

cc @sethah @holdenk @srowen @jkbradley

SparkQA · 2016-05-04T11:15:57Z

Test build #57753 has finished for PR 12896 at commit fc43745.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-05-04T15:22:33Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

I don't like the name "unknownStrategy" - seems very ambiguous. I'm not sure what is best, but for example I think "unknownPredictionStrategy" or "newUserItemPredictionStrategy" are better. Just something that will be obvious what it is by the name.

Fair enough - I was trying to keep it brief. But it could be more clear.
How about "unknownIdPredictionStrategy"?
On Wed, 4 May 2016 at 17:23, Seth Hendrickson [email protected]
wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):

@@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo

/** @group getParam */
def getItemCol: String = $(itemCol)
+

/**

* Param for strategy for dealing with unknown users or items at prediction time.

* Supported values:

* - "nan": prediction value for unknown ids will be NaN.

* - "drop": rows in the input DataFrame containing unknown ids will be dropped.

* Default: "nan".

* @group expertParam

*/

val unknownStrategy = new Param[String](this, "unknownStrategy",

I don't like the name "unknownStrategy" - seems very ambiguous. I'm not
sure what is best, but for example I think "unknownPredictionStrategy" or
"newUserItemPredictionStrategy" are better. Just something that will be
obvious what it is by the name.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62059332

Thinking about it more, using "unknown" just seems confusing. "newIdPredictionStrategy" ? I don't have a great idea for it.

I personally don't see much difference between "new" and "unknown". Will
think about it some more

On Wed, 4 May 2016 at 18:08, Seth Hendrickson [email protected]
wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):

@@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo

/** @group getParam */
def getItemCol: String = $(itemCol)
+

/**

* Param for strategy for dealing with unknown users or items at prediction time.

* Supported values:

* - "nan": prediction value for unknown ids will be NaN.

* - "drop": rows in the input DataFrame containing unknown ids will be dropped.

* Default: "nan".

* @group expertParam

*/

val unknownStrategy = new Param[String](this, "unknownStrategy",

Thinking about it more, using "unknown" just seems confusing.
"newIdPredictionStrategy" ? I don't have a great idea for it.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62067123

I think "unknownIdPredictionStrategy" or just "unknownIdStrategy" sound fine - they are unknown to the trained model right?

How about "coldStartStrategy"? For people familiar with recommendations "cold start" is clear in meaning. But it may be less clear for people less familiar.

@srowen what do you think?

@MLnick I think it's good. It's an "expert" param anyway. Also, it's better in a way because it's hard to assume what it means. For those that don't know it, they can check out the doc.

MLnick · 2016-05-07T10:33:30Z

Changed param name to coldStartStrategy, added a little more doc to the param, and removed the Python tests.

MLnick · 2016-05-07T10:36:33Z

I think I may create a new example for ALS cross-validation (or port the Movielens example from mllib even though ml won't yet support top-k recs) to illustrate the usage of this param.

SparkQA · 2016-05-07T11:15:13Z

Test build #58064 has finished for PR 12896 at commit b71dcac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-05-11T15:24:09Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

"will be dropped." -> "will not be included in the recommendations."

It could sound like we are modifying the input dataframe, but we are not.

Fair enough. Will reword this a bit.
On Wed, 11 May 2016 at 17:25, Seth Hendrickson [email protected]
wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):

@@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo

/** @group getParam */
def getItemCol: String = $(itemCol)
+

/**

* Param for strategy for dealing with unknown or new users/items at prediction time.

* This may be useful in cross-validation or production scenarios, for handling user/item ids

* the model has not seen in the training data.

* Supported values:

* - "nan": predicted value for unknown ids will be NaN.

* - "drop": rows in the input DataFrame containing unknown ids will be dropped.

"will be dropped." -> "will not be included in the recommendations."

It could sound like we are modifying the input dataframe, but we are not.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/b71dcaca39d5f10b484932eaf25e290c7796d93b#r62869002

MLnick · 2016-05-11T18:54:59Z

@srowen @jkbradley - this is technically not a "bug", but it impacts usability of CrossValidator / TrainValidationSplit for users of ALS. As such I'd like to have it in 2.0, if you have time to take a quick pass. I don't feel super strongly about that if you prefer to leave it for 2.1 though.

SparkQA · 2016-07-20T11:10:15Z

Test build #62598 has finished for PR 12896 at commit 0d0278a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-01T13:30:34Z

Hm, let me go through the logic with you one more time here. Isn't it better in theory to fix the model to not return NaN, but rather return some default answer, even if it's "0" or equivalent? this is at least no worse for scoring, and fixes the evaluation problem. New users and items are reasonable conditions for this model, not an error case.

The current behavior isn't that helpful, so I'm not sure leaving it as a choice is doing anybody a favor. My concern with the "drop" mode is that it is not penalizing any case where the model can't make an answer.

MLnick · 2016-08-01T14:50:51Z

I think there is a fair bit of difference between cross-validating the model and scoring in production.

In most practical live scoring situations, there may be multiple levels of fallbacks / defaults for the cold-start case (including e.g. "most popular", "newest", "content-based methods" etc etc). There may also be various post-processing steps applied to the results. I don't think it's feasible to re-create live behaviour perfectly for cross-validation scenarios (especially as these systems are often totally different from Spark).

Even for offline bulk scoring, again there may be many different options for cold start. Do we intend to support all of them within Spark? Again I don't think that's feasible, though as discussed on the JIRA we can certainly support a few useful options, such as "average user" which could indeed serve for both CV and live scoring purposes.

I actually think NaN for live scoring is "better" than say 0, because then it is very clear that it's a missing data point (which the system can choose how to handle) rather than a prediction of 0.

For CV, I'd expect that predicting 0 would have a dramatic negative impact on RMSE. So for CV I'd say the drop option is more reasonable.

This is not arguing against other reasonable options (average rating, average user vectors and so on) - we can add those later on user demand. This is just a start.

MLnick · 2016-08-01T14:51:42Z

As for leaving the current behavior as default - it's more for compatibility. Though as I mention above I believe returning NaN is reasonable for live scoring purposes.

srowen · 2016-08-01T14:54:51Z

For scoring, returning "I don't know" could be reasonable in order to let some other logic take over. I could see that. It would be important to do that, but maybe a separable concern. For evaluation, hm, if the goal is to ignore cases where the result will reasonably be "I don't know" then what about constructing the evaluation differently, so that the test set doesn't have users that the train set doesn't for example? that's more sound, because then any 'bad' NaNs still show up. Is it that this is hard to implement in the current structure?

MLnick · 2016-08-01T15:03:33Z

Your suggestion is, to me, the ideal solution. It's probably the more common method of splitting "ratings" datasets for CV purposes.

I'm interested in working on it but I think it would be a whole new specific cross-validator class. I'm not quite sure what the best approach is for efficiency (refer #14321 for stratified sampling approach, it's more for labels and is not efficient for this case, but the general concept might apply). In short, it's obviously a lot more effort and will take time. Perhaps it also starts life outside of Spark in packages. Not sure on this yet, but happy to collaborate on ideas!

Originally this PR was intended for 2.0 to at least make ALS useable with the CV classes.

srowen · 2016-08-01T15:14:00Z

OK. I can see that this is a bit of a band-aid, but might still be implemented as part of the beginnings of a more general 'cold start' configuration. As long as the API itself makes sense in the long term, and it seems so.

"drop" really drops any prediction that's NaN, not necessarily only for unknown users and items. I suppose the assumption is that those are the same cases, which is probably true in practice now.

What would it mean if a method was later added to make point predictions? you'd get nothing back, or maybe this would be considered synonymous with NaN in this case and get NaN back.

It's a little stretch to implement the required behavior here as a cold-start strategy but not terrible. If you're OK with it go for it. I think there are a few more comments to resolve about the @since tags

MLnick · 2016-08-02T09:05:16Z

Another option is to make predictionCol nullable and return null predictions. The drop strategy can still apply (though it will need to be a custom filter rather than df.na.drop), but it makes it totally clear when a prediction is "missing" vs NaN.

However, is it even possible to get a bunch of NaNs, e.g. if the model somehow diverged (I don't think that's even possible with ALS?). So, this may just add needless complexity.

MLnick · 2016-08-02T09:10:57Z

Sean, let's assume we ultimately have the following available:

drop - drop NaN
average - return the average rating for new users/items
average_factor - predict using the average factor vector for new users/items

Would you suggest that this a "good" end result? Or would you suggest to not have the drop option available in this list?

srowen · 2016-08-02T16:25:58Z

Yes that sounds like the direction this is heading. In a way, "drop" and "nan" aren't really different cold start strategies. Both return an implicit or explicit "I don't know; ask something else". One of these wouldn't have a counterpart in an API that offers point predictions.

"drop" and "nan" have distinct use in the context of evaluation, and it sounds like we'd even get rid of "nan" except for potential backwards compatibility. But, maybe that's just fine.

SparkQA · 2016-08-10T14:12:17Z

Test build #63535 has finished for PR 12896 at commit 9ff2a0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-08-11T11:56:24Z

@srowen @jkbradley any further comments / issues? I plan to create follow-up JIRAs for docs & examples, as well as expansions to the cold-start strategies.

also cc @mengxr @yanboliang for any comments.

dzhus · 2016-09-06T15:35:56Z

Excuse me for barging in, but speaking of compatibility and ALS cross-validation, am I correct that RegressionEvaluator is currently unaware of special logic required for RMSE to be at least remotely helpful with implicit ALS, as discussed in #597? I am getting RMSE of well over 1 using RegressionEvaluator on implicit ALS training results.

srowen · 2016-09-06T15:39:37Z

RMSE is just RMSE, regardless of what data it's applied to. You're right that RMSE should be <= 1 if all input/outputs are in [0,1], but that's not actually the case in the implicit model, where predictions can actually be outside this range. Really, RMSE is not the right metric for implicit results to begin with. You would want ranking metrics for this case.

SparkQA · 2016-11-21T07:32:35Z

Test build #68925 has started for PR 12896 at commit a439899.

SparkQA · 2016-11-30T11:16:28Z

Test build #69401 has finished for PR 12896 at commit a439899.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-01-18T09:47:45Z

Jenkins retest this please

SparkQA · 2017-01-18T10:57:30Z

Test build #71592 has finished for PR 12896 at commit a439899.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-23T14:18:48Z

Test build #71838 has finished for PR 12896 at commit 9981fd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-01-24T08:30:01Z

Reviving after a hiatus. Updated since tags. I've actually recently come across a number of users hitting this issue in production and are unable to use ALS with cross-validation as a result.

Added SPARK-19345 for adding detail to the docs / examples and SPARK-19346 for adding the advanced strategies discussed above.

ping @sethah @srowen @jkbradley in case you have any more comment. I'll merge in a day or two if no major comments.

MLnick · 2017-02-20T10:37:09Z

jenkins retest this please

SparkQA · 2017-02-20T13:26:33Z

Test build #73164 has finished for PR 12896 at commit 9981fd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T17:45:06Z

Test build #73521 has finished for PR 12896 at commit 041a3b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-02-28T14:17:32Z

Merged to master

MLnick mentioned this pull request May 4, 2016

[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577

Closed

sethah reviewed May 4, 2016
View reviewed changes

MLnick force-pushed the SPARK-14489-als-nan branch from fc43745 to b71dcac Compare May 7, 2016 10:32

sethah reviewed May 11, 2016
View reviewed changes

Nick Pentreath added 8 commits July 20, 2016 11:56

Scala param and tests

c1ebc81

Python param and tests

674bd0d

Improve test a bit

39c6e49

Doc tweak

9c99699

Doc remove space

e87432f

Rename param to 'coldStartStrategy'

b4c1e6c

Update Python param and remove unnecessary test

8d605d2

Fix compile issues due to SparkSession and mllib-local

0d0278a

MLnick force-pushed the SPARK-14489-als-nan branch from b71dcac to 0d0278a Compare July 20, 2016 10:11

Nick Pentreath added 2 commits July 20, 2016 13:28

case insensitive param value. Bump since annotations

f40f506

Clarify comment for 'drop' case

6599372

update since tags in python

9ff2a0a

Merge branch 'master' into SPARK-14489-als-nan

a439899

Bump since tags to 2.2.0

9981fd8

check that remaining prediction is equal

041a3b3

asfgit closed this in b405466 Feb 28, 2017

MLnick deleted the SPARK-14489-als-nan branch May 4, 2017 09:04

[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy #12896

[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy #12896

Uh oh!

Conversation

MLnick commented May 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

MLnick commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented May 7, 2016

Uh oh!

MLnick commented May 7, 2016

Uh oh!

SparkQA commented May 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented May 11, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

srowen commented Aug 1, 2016

Uh oh!

MLnick commented Aug 1, 2016

Uh oh!

MLnick commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Aug 1, 2016

Uh oh!

MLnick commented Aug 1, 2016

Uh oh!

srowen commented Aug 1, 2016

Uh oh!

MLnick commented Aug 2, 2016

Uh oh!

MLnick commented Aug 2, 2016

Uh oh!

srowen commented Aug 2, 2016

Uh oh!

SparkQA commented Aug 10, 2016

Uh oh!

MLnick commented Aug 11, 2016

Uh oh!

dzhus commented Sep 6, 2016

Uh oh!

srowen commented Sep 6, 2016

Uh oh!

SparkQA commented Nov 21, 2016

Uh oh!

SparkQA commented Nov 30, 2016

Uh oh!

MLnick commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

MLnick commented Jan 24, 2017

Uh oh!

MLnick commented Feb 20, 2017

MLnick commented May 4, 2016 •

edited

Loading

MLnick commented Aug 1, 2016 •

edited

Loading