SPARK-1668: Add implicit preference as an option to examples/MovieLensALS #597

techaddict · 2014-04-30T08:07:23Z

Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/

…sALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/

AmplabJenkins · 2014-04-30T08:07:57Z

Merged build triggered.

AmplabJenkins · 2014-04-30T08:08:05Z

Merged build started.

AmplabJenkins · 2014-04-30T08:46:43Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-30T08:46:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14584/

mengxr · 2014-04-30T17:56:17Z

@techaddict Thanks for working on this JIRA. You also need to change the evaluation code. Implicit ALS predicts 0/1 instead of the original rating. So you need some mapping before computing RMSE.

techaddict · 2014-05-02T08:38:24Z

Mapping rating in case of ImplicitPref to {r=0 --> 0, r>0 --> 1},
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) to
Rating(fields(0).toInt, fields(1).toInt, if (fields(2).toDouble == 0) 0.0 else 1.0) when ImplicitPref is true.
This will work right ?

MLnick · 2014-05-04T06:40:52Z

It is true that implicit prefs predict 0/1 (ie a "preference" matrix rather than a "rating" matrix), but the ratings are taken as confidence levels indicating preference (or in the case of negative ratings, lack of preference). So already there is an implicit mapping of 1 if r > 0, 0 if r == 0, with the actual rating being a confidence value in the case of r > 0.

So keeping ratings input as is, is a reasonable approach. Even better would be to map low ratings to zero or perhaps even negative scores, as a low rating would indicate a lack of preference certainly.

srowen · 2014-05-04T06:55:46Z

On this note, recall there was a change a while back to handle the case of negative confidence levels. 0 still means "don't know" and positive values mean "confident that the prediction should be 1". Negative values means "confident that the prediction should be 0".

I have in this case used some kind of weighted RMSE. The weight is the absolute value of the confidence. The error is the difference between prediction and either 1 or 0, depending on whether r is positive or negative.

mengxr · 2014-05-06T00:35:07Z

MovieLens ratings are on a scale of 1-5:

5: Must see
4: Will enjoy
3: It's okay
2: Fairly bad
1: Awful

So we should not recommend a movie if the predicted rating is less than 3. To map ratings to confidence scores, I would use 5 -> 2, 4 -> 1, 3 -> 0, 2 -> -1, 1 -> -2 or 5 -> 2.5, 4 -> 1.5, 3 -> 0.5, 2 -> -0.5, 1 -> -1.5. The latter mappings means unobserved entries are generally between It's okay and Fairly bad.

For evaluation, the mapping should be if (r >= 3) 1.0 else 0.0 for MovieLens ratings, and I agree with @srowen on weighted RMSE.

AmplabJenkins · 2014-05-06T02:32:58Z

Merged build triggered.

AmplabJenkins · 2014-05-06T02:33:06Z

Merged build started.

AmplabJenkins · 2014-05-06T03:07:20Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-06T03:07:20Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14695/

srowen · 2014-05-06T05:34:05Z

Can I make a tiny suggestion to map from ratings to weights with something like "rating - 2.5" instead of "rating - 3"? So that 3 becomes a small positive value like 0.5?

There is an argument that even neutral ratings are weak positive interactions; to have even consumed the item to be able to rate it means you had an interest.

But more than that, the semantics of 0 in this expanded world of non-positive weights are "the same as never having interacted at all" -- which doesn't quite fit. I don't know if the intermediate sparse representations do this internally, at the moment, but it's possible that 0 values are ignored when constructing the sparse representation, because the 0s are implicit. This would be a problem, at least, a theoretical one.

mengxr · 2014-05-06T07:15:28Z

+1 on @srowen 's suggestion.

techaddict · 2014-05-06T07:22:21Z

@mengxr @srowen now good ?

AmplabJenkins · 2014-05-06T07:22:57Z

Merged build triggered.

AmplabJenkins · 2014-05-06T07:23:03Z

Merged build started.

mengxr · 2014-05-06T07:25:42Z

@techaddict For training, we should keep the r - 2.5s, which indicate confidence. For evaluation, we could either use if (r > 2.5) 1.0 else 0.0 or weighted RMSE suggested by @srowen . Also, we need to map the predictions to interval [0.0, 1.0].

AmplabJenkins · 2014-05-06T07:57:51Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-06T07:57:51Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14708/

techaddict · 2014-05-06T08:23:29Z

@mengxr i'm bit confused.

val ratings = sc.textFile(params.input).map { line =>
      val fields = line.split("::")
      if (params.implicitPrefs) {
        Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble - 2.5)
      } else {
        Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)
      }
    }.cache()

val test = splits(1)
      .map(x => Rating(x.user, x.product, if(x.rating>=0)1.0 else 0.0))
      .cache()

  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long) = {
    val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
    val predictionsAndRatings = predictions.map(x => ((x.user, x.product), (x.rating + 2.5) / 5.0))
      .join(data.map(x => ((x.user, x.product), x.rating)))
      .values
    math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).mean())
  }

mengxr · 2014-05-06T08:37:57Z

ratings is correct. test also need to check implicitPrefs. If implicitPrefs, predictions should map to if (pred > 1.0) 1.0 else if (pred < 0.0) 0.0 else pred.

AmplabJenkins · 2014-05-06T09:07:58Z

Merged build triggered.

AmplabJenkins · 2014-05-06T09:08:03Z

Merged build started.

AmplabJenkins · 2014-05-07T17:43:04Z

Merged build started.

AmplabJenkins · 2014-05-07T18:18:13Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-07T18:18:13Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14777/

mengxr · 2014-05-07T22:50:25Z

examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala

Can we change it to the following:

if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r

AmplabJenkins · 2014-05-08T00:12:57Z

Merged build triggered.

AmplabJenkins · 2014-05-08T00:13:06Z

Merged build started.

AmplabJenkins · 2014-05-08T00:52:16Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-08T00:52:17Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14793/

techaddict · 2014-05-08T02:52:35Z

@mengxr done

mengxr · 2014-05-08T04:08:47Z

LGTM. Thanks!

rxin · 2014-05-08T04:14:53Z

Merged. Thanks!

…sALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/ Author: Sandeep <[email protected]> Closes #597 from techaddict/SPARK-1668 and squashes the following commits: 8b371dc [Sandeep] Second Pass on reviews by mengxr eca9d37 [Sandeep] based on mengxr's suggestions 937e54c [Sandeep] Changes 5149d40 [Sandeep] Changes based on review 1dd7657 [Sandeep] use mean() 42444d7 [Sandeep] Based on Suggestions by mengxr e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/ (cherry picked from commit 108c4c1) Signed-off-by: Reynold Xin <[email protected]>

invkrh · 2014-06-06T13:54:24Z

Just a question on the result.

implicitPref rank numInterations lambda -> rmse
true          30   40             1.0   -> 0.5776665087027969

Here, 0.57 is the error we will make when we predict 0/1, but is that too much ?
That means the preference we predicted is +/- 0.57 far from 0/1. It doesn't look good enough for me.
Tell me if I am missing something. And how can I know that what rmse indicates a good prediction ?

In the paper on which the implicit ALS is based on, we see that it used expected percentile rank.
Maybe, mean averaged precision at k (MAP@K) will also be useful for evaluation. Do you have some kind of these results ?

Thank you. =)

srowen · 2014-06-06T14:04:04Z

Simple RMSE is not a great metric for this model, because it treats all errors equally when the model itself does not at all. 1s are much more important than 0s. The predictions are not rating-like. See my comment above.

I usually try to look at metrics that measure how good the top of the ranking is, since this is far more like what the user experiences. MAP or something like area under the curve are about as good as you can hope for, but still somewhat flawed. It's hard to eval recommenders since you have such incomplete information on what the "right" or "relevant" items are.

invkrh · 2014-06-10T16:52:12Z

I have recently tested expected percentile rank(EPR) evaluation method proposed in the paper on the Movielens data set and a real world data set. However, I got a expected rank about 50% in both set, according to the paper, that means implicit ALS actually does not predict anything.

I am not sure if any evaluation has been done like this.

How can we make sure that implicit ALS is implemented correctly in MLlib without checking code?

srowen · 2014-06-10T17:43:02Z

The results depend a whole lot on the choice of parameters. Did you try some degree of search for the best lambda / # features? it's quite possible to make a model that can't predict anything. I have generally found ALS works fine on the Movielens data set.

invkrh · 2014-06-11T11:57:47Z

I have tried different lamdba and # features. But nothing has changed. To be clear, initially, the Movielens dataset it is divided into training set(80%) and test set(20%). The ratings are re-interpreted as rating - 2.5, we take only the positives in both training and test set, as we want to simulate a implicit feedback case where no negative feedback exists. All the negative ratings are considered as non-observed. Finally, we evaluated EPR both in training set and test set. It's about 49%~50% in both cases. Am I doing the right thing ?

srowen · 2014-06-11T20:04:11Z

You mentioned trying lots of values but what did you try? What about other test metrics -- to rule out some problem in the evaluation? Maybe you can share some of how you ran the test in a gist.

invkrh · 2014-06-12T09:23:41Z

Here is the values I have tried: seed is set to 42

in & out means in sample (training set) out-of-sample (test set)

#factor = 12, lamda = 1, alpha = 1

 iter 20  => 
             MAP_in = 0.035399855240788425
             MAP_out = 0.007907455900941737
             EPR_in  = 0.4902389595686534
             EPR_out = 0.4931204751436468

  iter 40  => 
              MAP_in = 0.033210624652830374
              MAP_out = 0.007158070987320343
              EPR_in  = 0.4907502816419743
              EPR_out = 0.49214166351173705

#factor = 50, alpha = 1, iter = 30

  lambda = 1, => 
              MAP_in = 0.029096938174350682
              MAP_out = 0.006634856811818636
              EPR_in  = 0.4928298931862564
              EPR_out = 0.49328834081999423

 lambda = 0.001 => 
              MAP_in = 0.02903970778838223
              MAP_out = 0.006569378517284138
              EPR_in  = 0.4929466287464198
              EPR_out = 0.49337539845412665

I have not tried other metrics, as said before, RMSE is not that good.
Maybe there are some errors in my metrics. I will give AUC and ROC a try to rule out them.

I listed some code snippets here. There are 2 evaluation methods and the main
https://gist.github.com/coderh/05a83be081c1f713e15b

invkrh · 2014-06-12T14:31:44Z

Ok, I have found the error in my metric.

val itemFactors = model.productFeatures.collect()

This line is for creating a item-factor matrix, the problem is that item factors are not ordered by item id when collecting them, which leads to a wrong matrix, that's y the result is non sense.

Adding a sortBy(_._1), like

val itemFactors = model.productFeatures.collect().sortBy(_._1)

give a EPR like 9%(in sample), 10%(out of sample)

Implicit ALS works. Thanks.

…sALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/ Author: Sandeep <[email protected]> Closes apache#597 from techaddict/SPARK-1668 and squashes the following commits: 8b371dc [Sandeep] Second Pass on reviews by mengxr eca9d37 [Sandeep] based on mengxr's suggestions 937e54c [Sandeep] Changes 5149d40 [Sandeep] Changes based on review 1dd7657 [Sandeep] use mean() 42444d7 [Sandeep] Based on Suggestions by mengxr e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/

…pache#597) * Avoids adding duplicated secret volumes when init-container is used Cherry-picked from apache#20148. * Added the missing commit from upstream

we use project_domain_name instead of project_domain_id for citynetwork provider

SPARK-1668: Add implicit preference as an option to examples/MovieLen…

e3082fa

…sALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/

Based on Suggestions by mengxr

42444d7

use mean()

1dd7657

Changes based on review

5149d40

Changes

937e54c

mengxr reviewed May 7, 2014
View reviewed changes

Second Pass on reviews by mengxr

8b371dc

asfgit closed this in 108c4c1 May 8, 2014

mengxr mentioned this pull request Jun 6, 2014

[SPARK-1266] persist factors in implicit ALS #165

Closed

dzhus mentioned this pull request Sep 6, 2016

[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy #12896

Closed

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

use project_domain_name (apache#597)

ad86cf2

we use project_domain_name instead of project_domain_id for citynetwork provider

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[HADP-55340] Remove netty-3.7.0.Final (apache#597)

51cc4e0

SPARK-1668: Add implicit preference as an option to examples/MovieLensALS #597

SPARK-1668: Add implicit preference as an option to examples/MovieLensALS #597

Uh oh!

Conversation

techaddict commented Apr 30, 2014

Uh oh!

AmplabJenkins commented Apr 30, 2014

Uh oh!

AmplabJenkins commented Apr 30, 2014

Uh oh!

AmplabJenkins commented Apr 30, 2014

Uh oh!

AmplabJenkins commented Apr 30, 2014

Uh oh!

mengxr commented Apr 30, 2014

Uh oh!

techaddict commented May 2, 2014

Uh oh!

MLnick commented May 4, 2014

Uh oh!

srowen commented May 4, 2014

Uh oh!

mengxr commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

srowen commented May 6, 2014

Uh oh!

mengxr commented May 6, 2014

Uh oh!

techaddict commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

mengxr commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

techaddict commented May 6, 2014

Uh oh!

mengxr commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 6, 2014

Uh oh!

AmplabJenkins commented May 7, 2014

Uh oh!

AmplabJenkins commented May 7, 2014

Uh oh!

AmplabJenkins commented May 7, 2014

Uh oh!

mengxr May 7, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented May 8, 2014

Uh oh!

AmplabJenkins commented May 8, 2014

Uh oh!

AmplabJenkins commented May 8, 2014

Uh oh!

AmplabJenkins commented May 8, 2014

Uh oh!

techaddict commented May 8, 2014

Uh oh!

mengxr commented May 8, 2014

Uh oh!

rxin commented May 8, 2014

Uh oh!

invkrh commented Jun 6, 2014