-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13857][ML][WIP] Add "recommend all" functionality in ALS #12574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Adding some further detail... Proposed semantics
Interaction with evaluationHere I've gone with the approach of having val als = new ALS().setRecommendFor("user").setK(10)
val results = als.fit(training).transform(test).select("prediction", "label")
.as[(Array[Int], Array[Int])]
.rdd
val metrics = new RankingMetrics()
println(metrics.meanAveragePrecision)
println(metrics.ndcgAt(als.getK()))
...Specifically:
Possible alternative semanticsIn the current approach, In this case:
This alternative seems more "DataFrame"-like, but it would require changes to The downside to this alternative is that performance may suffer as it's perhaps not that efficient - since |
|
Test build #56533 has finished for PR 12574 at commit
|
|
retest this please |
|
Test build #56544 has finished for PR 12574 at commit
|
|
Test failure seems to be caused by issue in #12599 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two joins and distinct should probably be a cogroup.mapGroups, since you know that id is a key and dataset is unlikely to contain many duplicates. As far as I can see, the current Catalyst rules will generate a conservative plan with three stages for the expression distinct.join.join, since there's no information available on key constraints.
EDIT: On second thought, there's currently no cogroup operator in the DataFrame API, so I guess the way this expression is currently written is the best one can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a note to look into optimizing this when cogroup exists.
By the way, when you say that dataset is unlikely to contain many duplicates, are you referring to duplicate rows? If so, yes that is true as generally each user, item, rating tuple is distinct (though it's not guaranteed to be the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant duplicate values of srcCol that are filtered out by the first distinct. What I was trying to say was that, if a cogroup was used here, every group would fit handily in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Yes, that should generally be true (though in larger-scale social network style domains, one could run into "super-node" problems with highly popular users or items).
|
jenkins retest this please |
|
Test build #56849 has finished for PR 12574 at commit
|
c529c89 to
4af7105
Compare
|
Test build #56894 has finished for PR 12574 at commit
|
|
Test build #56895 has finished for PR 12574 at commit
|
|
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set recommendFor and k together? like this
def setRecommendFor(forval: String, kval: Int)
That way you would be guaranteed they are both set if using recommendFor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. I like the idea, though no other estimator/model uses this approach (but AFAIK there's no other estimator where 2 specific non-default params must be set together). @jkbradley @holdenk thoughts?
4af7105 to
af432cc
Compare
|
Test build #63005 has finished for PR 12574 at commit
|
|
ping @mengxr @jkbradley @srowen @yanboliang for thoughts and comments on this. |
|
@MLnick I recently visited IBM STC but unfortunately missed you on the meeting...we discussed about the ML/MLlib changes for matrix factorization... |
|
I will take a pass at the PR as well.. |
|
@debasish83 thanks, would like to get your comments especially around |
|
@debasish83 any chance to take a look yet? |
|
Can we close it ? Looks like SPARK-18235 opened up recommendForAll |
|
Hi @debasish83 SPARK-18235 was already closed as duplicated. I think @MLnick put a more thorough solution here. |
|
Test build #71257 has finished for PR 12574 at commit
|
Until apache/spark#12574 is merged, the ML pipeline doesn’t have the same ability to generate top recommendations for a set of users. So, we’ll tune the model with an ML pipeline, build it from the same data with the RDD-based ALS implementation, then have a component to load that model and generate recs into an intermediate store TBD.
This PR adds "recommend all" functionality to ML's
ALSModel, similar to what exists in therecommendProductsForUsers/recommendUsersForProductsmethods in MLlib'sMatrixFactorizationModel.This is a WIP to get some feedback around:
blockifyandrecommendAllmethods inMatrixFactorizationModel, to handleDoubleandFloattypes (sincemlusesFloatfor ratings and factors whilemllibusesDouble)ALSModelALSModel.transformand how it relates to pipelines, cross-validation & evaluation (see this JIRA as well as SPARK-14409 and the linked design doc and [SPARK-14409][ML] Adding a RankingEvaluator to ML #12461).cc @mengxr @jkbradley @srowen @debasish83 @yongtang
TODO
transformsemantics and how this fits in with cross-validation and evaluation.MatrixFactorizationModeland vs the alternativetransformsemantics.How was this patch tested?
New unit tests in
ml.recommendation.ALSSuite