-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy #12896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #57753 has finished for PR 12896 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the name "unknownStrategy" - seems very ambiguous. I'm not sure what is best, but for example I think "unknownPredictionStrategy" or "newUserItemPredictionStrategy" are better. Just something that will be obvious what it is by the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - I was trying to keep it brief. But it could be more clear.
How about "unknownIdPredictionStrategy"?
On Wed, 4 May 2016 at 17:23, Seth Hendrickson [email protected]
wrote:
In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):@@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
/** @group getParam */
def getItemCol: String = $(itemCol)
+
- /**
- * Param for strategy for dealing with unknown users or items at prediction time.
- * Supported values:
- * - "nan": prediction value for unknown ids will be NaN.
- * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
- * Default: "nan".
- * @group expertParam
- */
- val unknownStrategy = new Param[String](this, "unknownStrategy",
I don't like the name "unknownStrategy" - seems very ambiguous. I'm not
sure what is best, but for example I think "unknownPredictionStrategy" or
"newUserItemPredictionStrategy" are better. Just something that will be
obvious what it is by the name.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62059332
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, using "unknown" just seems confusing. "newIdPredictionStrategy" ? I don't have a great idea for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't see much difference between "new" and "unknown". Will
think about it some more
On Wed, 4 May 2016 at 18:08, Seth Hendrickson [email protected]
wrote:
In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):@@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
/** @group getParam */
def getItemCol: String = $(itemCol)
+
- /**
- * Param for strategy for dealing with unknown users or items at prediction time.
- * Supported values:
- * - "nan": prediction value for unknown ids will be NaN.
- * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
- * Default: "nan".
- * @group expertParam
- */
- val unknownStrategy = new Param[String](this, "unknownStrategy",
Thinking about it more, using "unknown" just seems confusing.
"newIdPredictionStrategy" ? I don't have a great idea for it.—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62067123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "unknownIdPredictionStrategy" or just "unknownIdStrategy" sound fine - they are unknown to the trained model right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "coldStartStrategy"? For people familiar with recommendations "cold start" is clear in meaning. But it may be less clear for people less familiar.
@srowen what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick I think it's good. It's an "expert" param anyway. Also, it's better in a way because it's hard to assume what it means. For those that don't know it, they can check out the doc.
fc43745 to
b71dcac
Compare
|
Changed param name to |
|
I think I may create a new example for ALS cross-validation (or port the Movielens example from mllib even though ml won't yet support top-k recs) to illustrate the usage of this param. |
|
Test build #58064 has finished for PR 12896 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"will be dropped." -> "will not be included in the recommendations."
It could sound like we are modifying the input dataframe, but we are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Will reword this a bit.
On Wed, 11 May 2016 at 17:25, Seth Hendrickson [email protected]
wrote:
In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
#12896 (comment):@@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
/** @group getParam */
def getItemCol: String = $(itemCol)
+
- /**
- * Param for strategy for dealing with unknown or new users/items at prediction time.
- * This may be useful in cross-validation or production scenarios, for handling user/item ids
- * the model has not seen in the training data.
- * Supported values:
- * - "nan": predicted value for unknown ids will be NaN.
- * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
"will be dropped." -> "will not be included in the recommendations."
It could sound like we are modifying the input dataframe, but we are not.
—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12896/files/b71dcaca39d5f10b484932eaf25e290c7796d93b#r62869002
|
@srowen @jkbradley - this is technically not a "bug", but it impacts usability of |
b71dcac to
0d0278a
Compare
|
Test build #62598 has finished for PR 12896 at commit
|
|
Hm, let me go through the logic with you one more time here. Isn't it better in theory to fix the model to not return NaN, but rather return some default answer, even if it's "0" or equivalent? this is at least no worse for scoring, and fixes the evaluation problem. New users and items are reasonable conditions for this model, not an error case. The current behavior isn't that helpful, so I'm not sure leaving it as a choice is doing anybody a favor. My concern with the "drop" mode is that it is not penalizing any case where the model can't make an answer. |
|
I think there is a fair bit of difference between cross-validating the model and scoring in production. In most practical live scoring situations, there may be multiple levels of fallbacks / defaults for the cold-start case (including e.g. "most popular", "newest", "content-based methods" etc etc). There may also be various post-processing steps applied to the results. I don't think it's feasible to re-create live behaviour perfectly for cross-validation scenarios (especially as these systems are often totally different from Spark). Even for offline bulk scoring, again there may be many different options for cold start. Do we intend to support all of them within Spark? Again I don't think that's feasible, though as discussed on the JIRA we can certainly support a few useful options, such as "average user" which could indeed serve for both CV and live scoring purposes. I actually think For CV, I'd expect that predicting This is not arguing against other reasonable options (average rating, average user vectors and so on) - we can add those later on user demand. This is just a start. |
|
As for leaving the current behavior as default - it's more for compatibility. Though as I mention above I believe returning |
|
For scoring, returning "I don't know" could be reasonable in order to let some other logic take over. I could see that. It would be important to do that, but maybe a separable concern. For evaluation, hm, if the goal is to ignore cases where the result will reasonably be "I don't know" then what about constructing the evaluation differently, so that the test set doesn't have users that the train set doesn't for example? that's more sound, because then any 'bad' NaNs still show up. Is it that this is hard to implement in the current structure? |
|
Your suggestion is, to me, the ideal solution. It's probably the more common method of splitting "ratings" datasets for CV purposes. I'm interested in working on it but I think it would be a whole new specific cross-validator class. I'm not quite sure what the best approach is for efficiency (refer #14321 for stratified sampling approach, it's more for labels and is not efficient for this case, but the general concept might apply). In short, it's obviously a lot more effort and will take time. Perhaps it also starts life outside of Spark in packages. Not sure on this yet, but happy to collaborate on ideas! Originally this PR was intended for |
|
OK. I can see that this is a bit of a band-aid, but might still be implemented as part of the beginnings of a more general 'cold start' configuration. As long as the API itself makes sense in the long term, and it seems so. "drop" really drops any prediction that's NaN, not necessarily only for unknown users and items. I suppose the assumption is that those are the same cases, which is probably true in practice now. What would it mean if a method was later added to make point predictions? you'd get nothing back, or maybe this would be considered synonymous with NaN in this case and get NaN back. It's a little stretch to implement the required behavior here as a cold-start strategy but not terrible. If you're OK with it go for it. I think there are a few more comments to resolve about the |
|
Another option is to make However, is it even possible to get a bunch of |
|
Sean, let's assume we ultimately have the following available:
Would you suggest that this a "good" end result? Or would you suggest to not have the |
|
Yes that sounds like the direction this is heading. In a way, "drop" and "nan" aren't really different cold start strategies. Both return an implicit or explicit "I don't know; ask something else". One of these wouldn't have a counterpart in an API that offers point predictions. "drop" and "nan" have distinct use in the context of evaluation, and it sounds like we'd even get rid of "nan" except for potential backwards compatibility. But, maybe that's just fine. |
|
Test build #63535 has finished for PR 12896 at commit
|
|
@srowen @jkbradley any further comments / issues? I plan to create follow-up JIRAs for docs & examples, as well as expansions to the cold-start strategies. also cc @mengxr @yanboliang for any comments. |
|
Excuse me for barging in, but speaking of compatibility and ALS cross-validation, am I correct that RegressionEvaluator is currently unaware of special logic required for RMSE to be at least remotely helpful with implicit ALS, as discussed in #597? I am getting RMSE of well over 1 using |
|
RMSE is just RMSE, regardless of what data it's applied to. You're right that RMSE should be <= 1 if all input/outputs are in [0,1], but that's not actually the case in the implicit model, where predictions can actually be outside this range. Really, RMSE is not the right metric for implicit results to begin with. You would want ranking metrics for this case. |
|
Test build #68925 has started for PR 12896 at commit |
|
Test build #69401 has finished for PR 12896 at commit
|
|
Jenkins retest this please |
|
Test build #71592 has finished for PR 12896 at commit
|
|
Test build #71838 has finished for PR 12896 at commit
|
|
Reviving after a hiatus. Updated since tags. I've actually recently come across a number of users hitting this issue in production and are unable to use ALS with cross-validation as a result. Added SPARK-19345 for adding detail to the docs / examples and SPARK-19346 for adding the advanced strategies discussed above. ping @sethah @srowen @jkbradley in case you have any more comment. I'll merge in a day or two if no major comments. |
|
jenkins retest this please |
|
Test build #73164 has finished for PR 12896 at commit
|
|
Test build #73521 has finished for PR 12896 at commit
|
|
Merged to master |
This PR adds a param to
ALS/ALSModelto set the strategy used when encountering unknown users or items at prediction time intransform. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation.The current behavior returns
NaNif a user/item is unknown. In scenario (b), this can easily occur when usingCrossValidatororTrainValidationSplitsince some users/items may only occur in the test set and not in the training set. In this case, the evaluator returnsNaNfor all metrics, making model selection impossible.The new param,
coldStartStrategy, defaults tonan(the current behavior). The other option supported initially isdrop, which drops all rows withNaNpredictions. This flag allows users to useALSin cross-validation settings. It is made anexpertParam. The param is made a string so that the set of strategies can be extended in future (some options are discussed in SPARK-14489).How was this patch tested?
New unit tests, and manual "before and after" tests for Scala & Python using MovieLens
ml-latest-smallas example data. Here, usingCrossValidatororTrainValidationSplitwith the default param setting results in metrics that are allNaN, while settingcoldStartStrategytodropresults in valid metrics.