-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[ML][MLLIB] SPARK-2426: Integrate Breeze QuadraticMinimizer with ALS #3221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nds, Qp with smoothness, Qp with L1
…ing for distributed runs; rho=50 for equality constraint, default rho=1.0, alpha = 1.0 (no over-relaxation) for convergence study
…zer;Elastic net formulation in ALS, elastic net parameter not exposed to users
…pute map measure along with rmse
…tric for movielens dataset
… BoundedPriorityQueue similar to RDD.top
|
I looked more into it and I will open up an API in Breeze QuadraticMinimizer where in-place of DenseMatrix gram, upper triangular gram can be sent but the inner workspace has to be n x n because for Cholesky we need to compute LL' and for Quasi Definite System we have to compute LDL' / LU and both of them need n x n space...so I won't be able to decrease the QuadraticMinimizer workspace size...for dposv BLAS allocates memory for LL' and it is not visible to user... |
|
Test build #29023 has finished for PR 3221 at commit
|
|
@mengxr I added the optimization for lower triangular matrix and now they are very close...Let me know what do you think and if there are any other tricks you would like me to try...Note that with these optimization, QuadraticMinimizer with POSITIVE constraint will also run much faster Breeze QuadraticMinimizer (default): unset solver; ./bin/spark-submit --master spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. 15/03/23 12:26:55 INFO ALS: solveTime 205.379 ms ML CholeskySolver: export solver=mllib; ./bin/spark-submit --master spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150323122612-0002/0/stderr I am running only 2 iterations but you can see in the tail the solvers run at par... |
|
All the runtime enhancements are being added to Breeze in this PR: scalanlp/breeze#386 |
|
@mengxr I discussed with David and the only reason I can think of is that inside the solvers I am using DenseMatrix and DenseVector in-place of primitive arrays for workspace creation....that might be causing the first iteration runtime difference due to loading up the interface classes and other features that comes with DenseMatrix and DenseVector...I can move to primitive arrays for the workspace but then the code will look ugly...Let me know if I should ? I am surprised that this issue does not show up after the first call ! |
|
@mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark...I can clean up the PR for reviews |
…arse and simplex constraints
|
I integrated with Breeze 0.11.2. Only visible difference is first iteration Breeze QuadraticMinimizer: TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221722-0000/0/stderr mllib CholeskySolver: TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221/0/stderr Visible difference is in first 2 iterations as showed in previous experiments as well. I fixed the random seed test now and so different runs will not produce the same result. I need this structure to build ALM as ALM extends mllib.ALS and adds LossType in constructor along with userConstraint and itemConstraint... Right now I am experimenting with LeastSquare (for validation with ALS) and then I will start experimenting with LogLikelihood loss...I am keen to run very large document, word, count datasets through ALM once it is ready... For this PR I have updated MovieLensALS with userConstraint and itemConstraint and I am considering if we should add a Sparse Coding formulation in examples now or we bring that in a separate PR ? I have not cleaned up CholeskySolver from ALS yet and waiting for the feedbacks but I have added test-cases in ml.ALSSuite for all the constraints....At ALS flow level I need to construct more test-cases and I can bring them in separate PR as well... |
|
Test build #29340 has finished for PR 3221 at commit
|
|
Test build #29353 has finished for PR 3221 at commit
|
|
What are MiMa tests ? I am bit confused on it...From the logs look like ALS class now take userConstraint and productConstraint which is Enumeration and that caused it...Enumeration.value is not allowed in class parameters ? |
|
@mengxr @jkbradley In my internal testing, I am finding the sparse formulations useful for extracting genre/topic information out of netflix/movielens dataset...I did not get any improvement on MAP / RMSE (which was expected from other papers)...The sparse formulations are:
The reference: I am considering if it makes sense to add a 20 newsgroup flow in examples that was shown in the paper to show the value of adding sparsity implemented in ALS ? Also do we have perplexity implemented so that we can start comparing topic models.....The ALS runtime with sparse formulations are also pretty good.... |
|
@debasish83 Including sparsity for both recommendation and for topic modeling will be useful, to be sure. We don't have perplexity (or even prediction) implemented yet, but that definitely needs to be done for LDA. New JIRA: [https://issues.apache.org/jira/browse/SPARK-6793] |
|
can we use RankingMetric to see the prediction on a document dataset ? LDA vs Sparse Coding in this PR for example....actually we can that on 20 NG or some other known dataset...let me know which one I should try...perplexity is a bit different measure than MAP for example... |
|
There isn't a clear answer about which metric is best, and none correspond that well with human perception. Most of the literature seems to think perplexity is better than log likelihood. If you can get labeled data, then using predictions would be reasonable, but even then, it's hard to say what the best metric is. For 20 newsgroups, maybe we could compare intra- and inter- newsgroup similarity based on topic distributions? |
|
In this paper https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf Xi compared classifier accuracy on the extracted features from spare coding and LDA...we can also do that...Ranking might be a good idea as well...loglikelihood is tricky since without sparsity constraint and only positivity you will for sure make loglikelihood higher than someone who is adding sparsity ! |
|
Good point about log likelihood not being very meaningful. Perhaps the same will be true about perplexity? Using the topics as features sounds like a good idea. If we want to argue that topics are meaningful, comparing within and between newsgroups might be best. That covers the 2 main use cases for LDA I've heard of. |
|
Cool...let me do that and add a 20NG flow... |
|
Just out of curiosity, what's the second use case ? Actually topics as feature is good but most datasets where someone will run stuff are unlabeled...so we have to just pick one of these ranking/perplexity...I am comparing if loglikelihood with pos vs loglikelihood with sparsity improves ranking metric in 0-1 recommendation as compared to ALS implicit feedback but I am not sure if it is possible to use ranking metric in any unlabeled dataset.....but then ranking does not care about underlying topic goodness...so may be we need the feature extractor flow and also perplexity |
|
By "2nd use case," I meant trying to recover topics which correspond to human perceptions. Since the 20 newsgroup dataset is nicely divided into 20 human-perceived groups, it seems reasonable to expect that:
It's just a heuristic, but at least it's based on human-defined groupings. |
|
@jkbradley we still could not access the wikipedia dataset on ec2...will it be possible for you to upload the 1 Billion token dataset on EC2 ? I wanted to do a sparse coding scalability run on the large dataset as well... |
|
@jkbradley let me know if you need vzcloud access and I can create few nodes for you...ec2 might be easier for other's to access it as well... |
|
@debasish83 Just to make sure, you're specifying it as a requestor-pays bucket, right? [http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html] |
|
ohh sorry I don't know about requester pays...let me look into it |
|
@mengxr @jkbradley is this still relevant given the recent changes in ALS? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
ALS is a generic algorithm for matrix factorization which is equally applicable for both feature space and similarity space. Current ALS support L2 regularization and positivity constraint. This PR introduces userConstraint and productConstraint to ALS and let's the user select different constraints for user and product solves. The supported constraints are the following:
SMOOTH : default ALS with L2 regularization
ELASTIC NET: ALS with Elastic Net regularization
POSITIVE: ALS with positive factors
BOUNDS: ALS with factors bounded within upper and lower bound (default within 0 and 1)
EQUALITY: ALS with equality constraint (default the factors sum up to 1 and positive)
First let's focus on the problem formulation. Both implicit and explicit feedback ALS formulation can be written as a quadratic minimization problem. The quadratic objective can be written as xtHx + ctx. Each of the respective constraints take the following form (for example sparsity constraint)
minimize xtHx + ctx
s.t ||x||1 <= c (SPARSE constraint)
We rewrite the objective as f(x) = xtHx + ctx and the constraint as an indicator function g(x)
Now minimization of f(x) + g(x) can be carried out using various forward backward splitting algorithms. We choose ADMM for this PR.
For use-cases the PR is focused on the following:
Example run:
MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.1 --kryo hdfs://localhost:8020/sandbox/movielens/
References:
2007 Sparse coding: papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in Graphlab):
https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
2012 Sparse Coding + MR/MPI Microsoft: http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
Implementing the 20NG flow to validate the sparse coding result improvement over LDA based topic modeling.
Reference:
Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
The EQUALITY formulation with a Quadratic loss is an approximation to the KL divergence loss being used in PLSA. We are interested to see if it improves the result further as compared to the Sparse coding.
Next steps:
Detailed experiments are on the JIRA https://issues.apache.org/jira/browse/SPARK-2426