Skip to content

Conversation

@MLnick
Copy link
Contributor

@MLnick MLnick commented May 4, 2016

This PR adds a param to ALS/ALSModel to set the strategy used when encountering unknown users or items at prediction time in transform. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation.

The current behavior returns NaN if a user/item is unknown. In scenario (b), this can easily occur when using CrossValidator or TrainValidationSplit since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns NaN for all metrics, making model selection impossible.

The new param, coldStartStrategy, defaults to nan (the current behavior). The other option supported initially is drop, which drops all rows with NaN predictions. This flag allows users to use ALS in cross-validation settings. It is made an expertParam. The param is made a string so that the set of strategies can be extended in future (some options are discussed in SPARK-14489).

How was this patch tested?

New unit tests, and manual "before and after" tests for Scala & Python using MovieLens ml-latest-small as example data. Here, using CrossValidator or TrainValidationSplit with the default param setting results in metrics that are all NaN, while setting coldStartStrategy to drop results in valid metrics.

@MLnick
Copy link
Contributor Author

MLnick commented May 4, 2016

cc @sethah @holdenk @srowen @jkbradley

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57753 has finished for PR 12896 at commit fc43745.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the name "unknownStrategy" - seems very ambiguous. I'm not sure what is best, but for example I think "unknownPredictionStrategy" or "newUserItemPredictionStrategy" are better. Just something that will be obvious what it is by the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - I was trying to keep it brief. But it could be more clear.
How about "unknownIdPredictionStrategy"?
On Wed, 4 May 2016 at 17:23, Seth Hendrickson [email protected]
wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):

@@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo

/** @group getParam */
def getItemCol: String = $(itemCol)
+

  • /**
  • * Param for strategy for dealing with unknown users or items at prediction time.
  • * Supported values:
  • * - "nan": prediction value for unknown ids will be NaN.
  • * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
  • * Default: "nan".
  • * @group expertParam
  • */
  • val unknownStrategy = new Param[String](this, "unknownStrategy",

I don't like the name "unknownStrategy" - seems very ambiguous. I'm not
sure what is best, but for example I think "unknownPredictionStrategy" or
"newUserItemPredictionStrategy" are better. Just something that will be
obvious what it is by the name.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62059332

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, using "unknown" just seems confusing. "newIdPredictionStrategy" ? I don't have a great idea for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't see much difference between "new" and "unknown". Will
think about it some more

On Wed, 4 May 2016 at 18:08, Seth Hendrickson [email protected]
wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):

@@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo

/** @group getParam */
def getItemCol: String = $(itemCol)
+

  • /**
  • * Param for strategy for dealing with unknown users or items at prediction time.
  • * Supported values:
  • * - "nan": prediction value for unknown ids will be NaN.
  • * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
  • * Default: "nan".
  • * @group expertParam
  • */
  • val unknownStrategy = new Param[String](this, "unknownStrategy",

Thinking about it more, using "unknown" just seems confusing.
"newIdPredictionStrategy" ? I don't have a great idea for it.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62067123

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "unknownIdPredictionStrategy" or just "unknownIdStrategy" sound fine - they are unknown to the trained model right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "coldStartStrategy"? For people familiar with recommendations "cold start" is clear in meaning. But it may be less clear for people less familiar.

@srowen what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick I think it's good. It's an "expert" param anyway. Also, it's better in a way because it's hard to assume what it means. For those that don't know it, they can check out the doc.

@MLnick MLnick force-pushed the SPARK-14489-als-nan branch from fc43745 to b71dcac Compare May 7, 2016 10:32
@MLnick
Copy link
Contributor Author

MLnick commented May 7, 2016

Changed param name to coldStartStrategy, added a little more doc to the param, and removed the Python tests.

@MLnick
Copy link
Contributor Author

MLnick commented May 7, 2016

I think I may create a new example for ALS cross-validation (or port the Movielens example from mllib even though ml won't yet support top-k recs) to illustrate the usage of this param.

@SparkQA
Copy link

SparkQA commented May 7, 2016

Test build #58064 has finished for PR 12896 at commit b71dcac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be dropped." -> "will not be included in the recommendations."

It could sound like we are modifying the input dataframe, but we are not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Will reword this a bit.
On Wed, 11 May 2016 at 17:25, Seth Hendrickson [email protected]
wrote:

In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):

@@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo

/** @group getParam */
def getItemCol: String = $(itemCol)
+

  • /**
  • * Param for strategy for dealing with unknown or new users/items at prediction time.
  • * This may be useful in cross-validation or production scenarios, for handling user/item ids
  • * the model has not seen in the training data.
  • * Supported values:
  • * - "nan": predicted value for unknown ids will be NaN.
  • * - "drop": rows in the input DataFrame containing unknown ids will be dropped.

"will be dropped." -> "will not be included in the recommendations."

It could sound like we are modifying the input dataframe, but we are not.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/b71dcaca39d5f10b484932eaf25e290c7796d93b#r62869002

@MLnick
Copy link
Contributor Author

MLnick commented May 11, 2016

@srowen @jkbradley - this is technically not a "bug", but it impacts usability of CrossValidator / TrainValidationSplit for users of ALS. As such I'd like to have it in 2.0, if you have time to take a quick pass. I don't feel super strongly about that if you prefer to leave it for 2.1 though.

@MLnick MLnick force-pushed the SPARK-14489-als-nan branch from b71dcac to 0d0278a Compare July 20, 2016 10:11
@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62598 has finished for PR 12896 at commit 0d0278a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 1, 2016

Hm, let me go through the logic with you one more time here. Isn't it better in theory to fix the model to not return NaN, but rather return some default answer, even if it's "0" or equivalent? this is at least no worse for scoring, and fixes the evaluation problem. New users and items are reasonable conditions for this model, not an error case.

The current behavior isn't that helpful, so I'm not sure leaving it as a choice is doing anybody a favor. My concern with the "drop" mode is that it is not penalizing any case where the model can't make an answer.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 1, 2016

I think there is a fair bit of difference between cross-validating the model and scoring in production.

In most practical live scoring situations, there may be multiple levels of fallbacks / defaults for the cold-start case (including e.g. "most popular", "newest", "content-based methods" etc etc). There may also be various post-processing steps applied to the results. I don't think it's feasible to re-create live behaviour perfectly for cross-validation scenarios (especially as these systems are often totally different from Spark).

Even for offline bulk scoring, again there may be many different options for cold start. Do we intend to support all of them within Spark? Again I don't think that's feasible, though as discussed on the JIRA we can certainly support a few useful options, such as "average user" which could indeed serve for both CV and live scoring purposes.

I actually think NaN for live scoring is "better" than say 0, because then it is very clear that it's a missing data point (which the system can choose how to handle) rather than a prediction of 0.

For CV, I'd expect that predicting 0 would have a dramatic negative impact on RMSE. So for CV I'd say the drop option is more reasonable.

This is not arguing against other reasonable options (average rating, average user vectors and so on) - we can add those later on user demand. This is just a start.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 1, 2016

As for leaving the current behavior as default - it's more for compatibility. Though as I mention above I believe returning NaN is reasonable for live scoring purposes.

@srowen
Copy link
Member

srowen commented Aug 1, 2016

For scoring, returning "I don't know" could be reasonable in order to let some other logic take over. I could see that. It would be important to do that, but maybe a separable concern. For evaluation, hm, if the goal is to ignore cases where the result will reasonably be "I don't know" then what about constructing the evaluation differently, so that the test set doesn't have users that the train set doesn't for example? that's more sound, because then any 'bad' NaNs still show up. Is it that this is hard to implement in the current structure?

@MLnick
Copy link
Contributor Author

MLnick commented Aug 1, 2016

Your suggestion is, to me, the ideal solution. It's probably the more common method of splitting "ratings" datasets for CV purposes.

I'm interested in working on it but I think it would be a whole new specific cross-validator class. I'm not quite sure what the best approach is for efficiency (refer #14321 for stratified sampling approach, it's more for labels and is not efficient for this case, but the general concept might apply). In short, it's obviously a lot more effort and will take time. Perhaps it also starts life outside of Spark in packages. Not sure on this yet, but happy to collaborate on ideas!

Originally this PR was intended for 2.0 to at least make ALS useable with the CV classes.

@srowen
Copy link
Member

srowen commented Aug 1, 2016

OK. I can see that this is a bit of a band-aid, but might still be implemented as part of the beginnings of a more general 'cold start' configuration. As long as the API itself makes sense in the long term, and it seems so.

"drop" really drops any prediction that's NaN, not necessarily only for unknown users and items. I suppose the assumption is that those are the same cases, which is probably true in practice now.

What would it mean if a method was later added to make point predictions? you'd get nothing back, or maybe this would be considered synonymous with NaN in this case and get NaN back.

It's a little stretch to implement the required behavior here as a cold-start strategy but not terrible. If you're OK with it go for it. I think there are a few more comments to resolve about the @since tags

@MLnick
Copy link
Contributor Author

MLnick commented Aug 2, 2016

Another option is to make predictionCol nullable and return null predictions. The drop strategy can still apply (though it will need to be a custom filter rather than df.na.drop), but it makes it totally clear when a prediction is "missing" vs NaN.

However, is it even possible to get a bunch of NaNs, e.g. if the model somehow diverged (I don't think that's even possible with ALS?). So, this may just add needless complexity.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 2, 2016

Sean, let's assume we ultimately have the following available:

  • drop - drop NaN
  • average - return the average rating for new users/items
  • average_factor - predict using the average factor vector for new users/items

Would you suggest that this a "good" end result? Or would you suggest to not have the drop option available in this list?

@srowen
Copy link
Member

srowen commented Aug 2, 2016

Yes that sounds like the direction this is heading. In a way, "drop" and "nan" aren't really different cold start strategies. Both return an implicit or explicit "I don't know; ask something else". One of these wouldn't have a counterpart in an API that offers point predictions.

"drop" and "nan" have distinct use in the context of evaluation, and it sounds like we'd even get rid of "nan" except for potential backwards compatibility. But, maybe that's just fine.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63535 has finished for PR 12896 at commit 9ff2a0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Aug 11, 2016

@srowen @jkbradley any further comments / issues? I plan to create follow-up JIRAs for docs & examples, as well as expansions to the cold-start strategies.

also cc @mengxr @yanboliang for any comments.

@dzhus
Copy link

dzhus commented Sep 6, 2016

Excuse me for barging in, but speaking of compatibility and ALS cross-validation, am I correct that RegressionEvaluator is currently unaware of special logic required for RMSE to be at least remotely helpful with implicit ALS, as discussed in #597? I am getting RMSE of well over 1 using RegressionEvaluator on implicit ALS training results.

@srowen
Copy link
Member

srowen commented Sep 6, 2016

RMSE is just RMSE, regardless of what data it's applied to. You're right that RMSE should be <= 1 if all input/outputs are in [0,1], but that's not actually the case in the implicit model, where predictions can actually be outside this range. Really, RMSE is not the right metric for implicit results to begin with. You would want ranking metrics for this case.

@SparkQA
Copy link

SparkQA commented Nov 21, 2016

Test build #68925 has started for PR 12896 at commit a439899.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69401 has finished for PR 12896 at commit a439899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Jan 18, 2017

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71592 has finished for PR 12896 at commit a439899.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71838 has finished for PR 12896 at commit 9981fd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Jan 24, 2017

Reviving after a hiatus. Updated since tags. I've actually recently come across a number of users hitting this issue in production and are unable to use ALS with cross-validation as a result.

Added SPARK-19345 for adding detail to the docs / examples and SPARK-19346 for adding the advanced strategies discussed above.

ping @sethah @srowen @jkbradley in case you have any more comment. I'll merge in a day or two if no major comments.

@MLnick
Copy link
Contributor Author

MLnick commented Feb 20, 2017

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2017

Test build #73164 has finished for PR 12896 at commit 9981fd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73521 has finished for PR 12896 at commit 041a3b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor Author

MLnick commented Feb 28, 2017

Merged to master

@asfgit asfgit closed this in b405466 Feb 28, 2017
@MLnick MLnick deleted the SPARK-14489-als-nan branch May 4, 2017 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants