-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20790] [MLlib] Correctly handle negative values for implicit feedback in ALS #18022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ import org.apache.spark.ml.recommendation.ALS._ | |
| import org.apache.spark.ml.recommendation.ALS.Rating | ||
| import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} | ||
| import org.apache.spark.ml.util.TestingUtils._ | ||
| import org.apache.spark.mllib.recommendation.MatrixFactorizationModelSuite | ||
| import org.apache.spark.mllib.util.MLlibTestSparkContext | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.scheduler.{SparkListener, SparkListenerStageCompleted} | ||
|
|
@@ -78,7 +79,7 @@ class ALSSuite | |
| val k = 2 | ||
| val ne0 = new NormalEquation(k) | ||
| .add(Array(1.0f, 2.0f), 3.0) | ||
| .add(Array(4.0f, 5.0f), 6.0, 2.0) // weighted | ||
| .add(Array(4.0f, 5.0f), 12.0, 2.0) // weighted | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was this test change intentional?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this test change was intentional, because I change the semantic meaning of the arguments to add, before add would multiply the second and third arguments together internally, so to make this test valid I premultiplied them together. In the usage of this function in ALS.scala, for non-implicit the third argument is 1, so there is no change, and implicit is now handled correctly. |
||
| assert(ne0.k === k) | ||
| assert(ne0.triK === k * (k + 1) / 2) | ||
| // NumPy code that computes the expected values: | ||
|
|
@@ -347,6 +348,37 @@ class ALSSuite | |
| ALSSuite.genFactors(size, rank, random, a, b) | ||
| } | ||
|
|
||
| /** | ||
| * Train ALS using the given training set and parameters | ||
| * @param training training dataset | ||
| * @param rank rank of the matrix factorization | ||
| * @param maxIter max number of iterations | ||
| * @param regParam regularization constant | ||
| * @param implicitPrefs whether to use implicit preference | ||
| * @param numUserBlocks number of user blocks | ||
| * @param numItemBlocks number of item blocks | ||
| * @return a trained ALSModel | ||
| */ | ||
| def trainALS( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you need a new overload?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is a helper function, because I call it twice in the test. I also wanted to use this in the testALS function, but it wasn't straightforward. I can't use testALS in my test since it does more than just train the model and it doesn't allow me to compare the two models the test generates, one with negative values and one with those negative values zeroed out. |
||
| training: RDD[Rating[Int]], | ||
| rank: Int, | ||
| maxIter: Int, | ||
| regParam: Double, | ||
| implicitPrefs: Boolean = false, | ||
| numUserBlocks: Int = 2, | ||
| numItemBlocks: Int = 3): ALSModel = { | ||
| val spark = this.spark | ||
| import spark.implicits._ | ||
| val als = new ALS() | ||
| .setRank(rank) | ||
| .setRegParam(regParam) | ||
| .setImplicitPrefs(implicitPrefs) | ||
| .setNumUserBlocks(numUserBlocks) | ||
| .setNumItemBlocks(numItemBlocks) | ||
| .setSeed(0) | ||
| als.fit(training.toDF()) | ||
| } | ||
|
|
||
| /** | ||
| * Test ALS using the given training/test splits and parameters. | ||
| * @param training training dataset | ||
|
|
@@ -455,6 +487,22 @@ class ALSSuite | |
| targetRMSE = 0.3) | ||
| } | ||
|
|
||
| test("implicit feedback regression") { | ||
| val trainingWithNeg = sc.parallelize(Array(Rating(0, 0, 1), Rating(1, 1, 1), Rating(0, 1, -3))) | ||
| val trainingWithZero = sc.parallelize(Array(Rating(0, 0, 1), Rating(1, 1, 1), Rating(0, 1, 0))) | ||
| val modelWithNeg = | ||
| trainALS(trainingWithNeg, rank = 1, maxIter = 5, regParam = 0.01, implicitPrefs = true) | ||
| val modelWithZero = | ||
| trainALS(trainingWithZero, rank = 1, maxIter = 5, regParam = 0.01, implicitPrefs = true) | ||
| val userFactorsNeg = modelWithNeg.userFactors | ||
| val itemFactorsNeg = modelWithNeg.itemFactors | ||
| val userFactorsZero = modelWithZero.userFactors | ||
| val itemFactorsZero = modelWithZero.itemFactors | ||
| userFactorsNeg.collect().foreach(arr => logInfo(s"implicit test " + arr.mkString(" "))) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small nit here but ideally we don't usually log info during this sort of test?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, I meant to remove, shall I open another pr? |
||
| userFactorsZero.collect().foreach(arr => logInfo(s"implicit test " + arr.mkString(" "))) | ||
| assert(userFactorsNeg.intersect(userFactorsZero).count() == 0) | ||
| assert(itemFactorsNeg.intersect(itemFactorsZero).count() == 0) | ||
| } | ||
| test("using generic ID types") { | ||
| val (ratings, _) = genImplicitTestData(numUsers = 20, numItems = 40, rank = 2, noiseStd = 0.01) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the substance of the change? I might need some help understanding why this is needed. Yes, even negative values should be recorded for implicit prefs, I agree. It adds 1 + c1 now instead of (1 + c1) / c1, so that's why the factor of c is taken out above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this is the crux of the change (moving outside of the if condition). Changing the arguments was more to be less confusing and more direct, since it was very confusing to me before where the
(1+c1)/c1was coming from and then when it is actually used inadd, it gets multiplied byc1, which is a wasted operation and may not even exactly yield1+c1in the end.