Skip to content

Conversation

@debasish83
Copy link

ALS is a generic algorithm for matrix factorization which is equally applicable for both feature space and similarity space. Current ALS support L2 regularization and positivity constraint. This PR introduces userConstraint and productConstraint to ALS and let's the user select different constraints for user and product solves. The supported constraints are the following:

SMOOTH : default ALS with L2 regularization
ELASTIC NET: ALS with Elastic Net regularization
POSITIVE: ALS with positive factors
BOUNDS: ALS with factors bounded within upper and lower bound (default within 0 and 1)
EQUALITY: ALS with equality constraint (default the factors sum up to 1 and positive)

First let's focus on the problem formulation. Both implicit and explicit feedback ALS formulation can be written as a quadratic minimization problem. The quadratic objective can be written as xtHx + ctx. Each of the respective constraints take the following form (for example sparsity constraint)

minimize xtHx + ctx
s.t ||x||1 <= c (SPARSE constraint)

We rewrite the objective as f(x) = xtHx + ctx and the constraint as an indicator function g(x)
Now minimization of f(x) + g(x) can be carried out using various forward backward splitting algorithms. We choose ADMM for this PR.

For use-cases the PR is focused on the following:

  1. Sparse matrix factorization
    Example run:
    MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.1 --kryo hdfs://localhost:8020/sandbox/movielens/
  2. Topic modeling using LSA
    References:
    2007 Sparse coding: papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
    2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in Graphlab):
    https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
    2012 Sparse Coding + MR/MPI Microsoft: http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
    Implementing the 20NG flow to validate the sparse coding result improvement over LDA based topic modeling.
  3. Topic modeling using PLSA
    Reference:
    Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
    The EQUALITY formulation with a Quadratic loss is an approximation to the KL divergence loss being used in PLSA. We are interested to see if it improves the result further as compared to the Sparse coding.

Next steps:

  1. Move QuadraticMinimizerSuite.scala to breeze
  2. Scale the factorization rank and remove the need to construct H matrix
  3. Replace the quadratic loss xtHx + ctx with a Convex loss

Detailed experiments are on the JIRA https://issues.apache.org/jira/browse/SPARK-2426

Debasish Das added 30 commits June 12, 2014 20:24
…ing for distributed runs; rho=50 for equality constraint, default rho=1.0, alpha = 1.0 (no over-relaxation) for convergence study
…zer;Elastic net formulation in ALS, elastic net parameter not exposed to users
@debasish83
Copy link
Author

I looked more into it and I will open up an API in Breeze QuadraticMinimizer where in-place of DenseMatrix gram, upper triangular gram can be sent but the inner workspace has to be n x n because for Cholesky we need to compute LL' and for Quasi Definite System we have to compute LDL' / LU and both of them need n x n space...so I won't be able to decrease the QuadraticMinimizer workspace size...for dposv BLAS allocates memory for LL' and it is not visible to user...

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #29023 has finished for PR 3221 at commit 196d8c8.

  • This patch fails RAT tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class PowerMethod(maxIters: Int = 10,tolerance: Double = 1E-5) extends SerializableLogging
    • trait Proximal
    • case class ProjectIdentity() extends Proximal
    • case class ProjectProbabilitySimplex(s: Double) extends Proximal
    • case class ProjectL1(s: Double) extends Proximal
    • case class ProjectBox(l: DenseVector[Double], u: DenseVector[Double]) extends Proximal
    • case class ProjectPos() extends Proximal
    • case class ProjectSoc() extends Proximal
    • case class ProjectEquality(Aeq: DenseMatrix[Double], beq: DenseVector[Double]) extends Proximal
    • case class ProjectHyperPlane(a: DenseVector[Double], b: Double) extends Proximal
    • case class ProximalL1(var lambda: Double = 1.0) extends Proximal
    • case class ProximalL2() extends Proximal
    • case class ProximalSumSquare() extends Proximal
    • case class ProximalLogBarrier() extends Proximal
    • case class ProximalHuber() extends Proximal
    • case class ProximalLinear(c: DenseVector[Double]) extends Proximal
    • case class ProximalLp(c: DenseVector[Double]) extends Proximal
    • class QuadraticMinimizer(nGram: Int,

@debasish83
Copy link
Author

@mengxr I added the optimization for lower triangular matrix and now they are very close...Let me know what do you think and if there are any other tricks you would like me to try...Note that with these optimization, QuadraticMinimizer with POSITIVE constraint will also run much faster

Breeze QuadraticMinimizer (default):

unset solver; ./bin/spark-submit --master spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800670, test: 199539.
Quadratic minimization userConstraint SMOOTH productConstraint SMOOTH
Running Breeze QuadraticMinimizer for users with constraint SMOOTH
Running Breeze QuadraticMinimizer for items with constraint SMOOTH
Test RMSE = 2.4985081126233846.

15/03/23 12:26:55 INFO ALS: solveTime 205.379 ms
15/03/23 12:26:55 INFO ALS: solveTime 72.116 ms
15/03/23 12:26:56 INFO ALS: solveTime 74.034 ms
15/03/23 12:26:56 INFO ALS: solveTime 77.379 ms
15/03/23 12:26:57 INFO ALS: solveTime 36.532 ms
15/03/23 12:26:57 INFO ALS: solveTime 29.775 ms
15/03/23 12:26:58 INFO ALS: solveTime 48.925 ms
15/03/23 12:26:58 INFO ALS: solveTime 51.904 ms
15/03/23 12:26:59 INFO ALS: solveTime 30.882 ms
15/03/23 12:26:59 INFO ALS: solveTime 30.658 ms

ML CholeskySolver:

export solver=mllib; ./bin/spark-submit --master spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800670, test: 199539.
Quadratic minimization userConstraint SMOOTH productConstraint SMOOTH
Test RMSE = 2.4985081126233846.

TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150323122612-0002/0/stderr
15/03/23 12:26:20 INFO ALS: solveTime 102.243 ms
15/03/23 12:26:21 INFO ALS: solveTime 38.195 ms
15/03/23 12:26:21 INFO ALS: solveTime 60.583 ms
15/03/23 12:26:22 INFO ALS: solveTime 59.882 ms
15/03/23 12:26:22 INFO ALS: solveTime 36.59 ms
15/03/23 12:26:23 INFO ALS: solveTime 36.021 ms
15/03/23 12:26:23 INFO ALS: solveTime 59.271 ms
15/03/23 12:26:24 INFO ALS: solveTime 59.217 ms
15/03/23 12:26:24 INFO ALS: solveTime 36.344 ms
15/03/23 12:26:25 INFO ALS: solveTime 35.838 ms

I am running only 2 iterations but you can see in the tail the solvers run at par...

@debasish83
Copy link
Author

All the runtime enhancements are being added to Breeze in this PR: scalanlp/breeze#386
Please let me know if there are additional feedbacks. Except the first 2 solves, rest all are pretty much same...I looked into memory allocation and Breeze matrix/vector are backed by new Array[Double] as well similar to MLlib version...

@debasish83
Copy link
Author

@mengxr I discussed with David and the only reason I can think of is that inside the solvers I am using DenseMatrix and DenseVector in-place of primitive arrays for workspace creation....that might be causing the first iteration runtime difference due to loading up the interface classes and other features that comes with DenseMatrix and DenseVector...I can move to primitive arrays for the workspace but then the code will look ugly...Let me know if I should ? I am surprised that this issue does not show up after the first call !

@debasish83
Copy link
Author

@mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark...I can clean up the PR for reviews

@debasish83
Copy link
Author

I integrated with Breeze 0.11.2. Only visible difference is first iteration

Breeze QuadraticMinimizer:

TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221722-0000/0/stderr
15/03/27 22:17:32 INFO ALS: solveTime 234.153 ms
15/03/27 22:17:32 INFO ALS: solveTime 82.499 ms
15/03/27 22:17:33 INFO ALS: solveTime 83.579 ms
15/03/27 22:17:33 INFO ALS: solveTime 83.039 ms
15/03/27 22:17:34 INFO ALS: solveTime 35.545 ms
15/03/27 22:17:34 INFO ALS: solveTime 30.707 ms
15/03/27 22:17:35 INFO ALS: solveTime 53.025 ms
15/03/27 22:17:36 INFO ALS: solveTime 53.021 ms
15/03/27 22:17:36 INFO ALS: solveTime 31.329 ms
15/03/27 22:17:37 INFO ALS: solveTime 32.136 ms

mllib CholeskySolver:

TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221/0/stderr
app-20150327221722-0000/ app-20150327221803-0001/
TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221803-0001/0/stderr
15/03/27 22:18:11 INFO ALS: solveTime 98.692 ms
15/03/27 22:18:12 INFO ALS: solveTime 38.997 ms
15/03/27 22:18:12 INFO ALS: solveTime 62.361 ms
15/03/27 22:18:13 INFO ALS: solveTime 60.316 ms
15/03/27 22:18:13 INFO ALS: solveTime 36.569 ms
15/03/27 22:18:14 INFO ALS: solveTime 36.321 ms
15/03/27 22:18:14 INFO ALS: solveTime 60.007 ms
15/03/27 22:18:15 INFO ALS: solveTime 59.771 ms
15/03/27 22:18:15 INFO ALS: solveTime 36.519 ms
15/03/27 22:18:16 INFO ALS: solveTime 38.295 ms

Visible difference is in first 2 iterations as showed in previous experiments as well. I fixed the random seed test now and so different runs will not produce the same result.

I need this structure to build ALM as ALM extends mllib.ALS and adds LossType in constructor along with userConstraint and itemConstraint...

Right now I am experimenting with LeastSquare (for validation with ALS) and then I will start experimenting with LogLikelihood loss...I am keen to run very large document, word, count datasets through ALM once it is ready...

For this PR I have updated MovieLensALS with userConstraint and itemConstraint and I am considering if we should add a Sparse Coding formulation in examples now or we bring that in a separate PR ?

I have not cleaned up CholeskySolver from ALS yet and waiting for the feedbacks but I have added test-cases in ml.ALSSuite for all the constraints....At ALS flow level I need to construct more test-cases and I can bring them in separate PR as well...

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29340 has finished for PR 3221 at commit 1848181.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29353 has finished for PR 3221 at commit 33b5a97.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@debasish83
Copy link
Author

What are MiMa tests ? I am bit confused on it...From the logs look like ALS class now take userConstraint and productConstraint which is Enumeration and that caused it...Enumeration.value is not allowed in class parameters ?

@debasish83
Copy link
Author

@mengxr @jkbradley In my internal testing, I am finding the sparse formulations useful for extracting genre/topic information out of netflix/movielens dataset...I did not get any improvement on MAP / RMSE (which was expected from other papers)...The sparse formulations are:

  1. Sparse coding: L2 on users/words, L1 on documents/movies
  2. L2 on users/words, probability simplex on documents/movies

The reference:
2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in Graphlab):
https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
showed sparse coding producing better result than LDA...

I am considering if it makes sense to add a 20 newsgroup flow in examples that was shown in the paper to show the value of adding sparsity implemented in ALS ? Also do we have perplexity implemented so that we can start comparing topic models.....The ALS runtime with sparse formulations are also pretty good....

@jkbradley
Copy link
Member

@debasish83 Including sparsity for both recommendation and for topic modeling will be useful, to be sure. We don't have perplexity (or even prediction) implemented yet, but that definitely needs to be done for LDA. New JIRA: [https://issues.apache.org/jira/browse/SPARK-6793]

@debasish83
Copy link
Author

can we use RankingMetric to see the prediction on a document dataset ? LDA vs Sparse Coding in this PR for example....actually we can that on 20 NG or some other known dataset...let me know which one I should try...perplexity is a bit different measure than MAP for example...

@jkbradley
Copy link
Member

There isn't a clear answer about which metric is best, and none correspond that well with human perception. Most of the literature seems to think perplexity is better than log likelihood. If you can get labeled data, then using predictions would be reasonable, but even then, it's hard to say what the best metric is. For 20 newsgroups, maybe we could compare intra- and inter- newsgroup similarity based on topic distributions?

@debasish83
Copy link
Author

In this paper https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf Xi compared classifier accuracy on the extracted features from spare coding and LDA...we can also do that...Ranking might be a good idea as well...loglikelihood is tricky since without sparsity constraint and only positivity you will for sure make loglikelihood higher than someone who is adding sparsity !

@jkbradley
Copy link
Member

Good point about log likelihood not being very meaningful. Perhaps the same will be true about perplexity?

Using the topics as features sounds like a good idea. If we want to argue that topics are meaningful, comparing within and between newsgroups might be best. That covers the 2 main use cases for LDA I've heard of.

@debasish83
Copy link
Author

Cool...let me do that and add a 20NG flow...

@debasish83
Copy link
Author

Just out of curiosity, what's the second use case ? Actually topics as feature is good but most datasets where someone will run stuff are unlabeled...so we have to just pick one of these ranking/perplexity...I am comparing if loglikelihood with pos vs loglikelihood with sparsity improves ranking metric in 0-1 recommendation as compared to ALS implicit feedback but I am not sure if it is possible to use ranking metric in any unlabeled dataset.....but then ranking does not care about underlying topic goodness...so may be we need the feature extractor flow and also perplexity

@jkbradley
Copy link
Member

By "2nd use case," I meant trying to recover topics which correspond to human perceptions. Since the 20 newsgroup dataset is nicely divided into 20 human-perceived groups, it seems reasonable to expect that:

  • For documents within a newsgroup, topic distributions should be similar
  • For documents across newsgroups, topic distributions should be different

It's just a heuristic, but at least it's based on human-defined groupings.

@debasish83
Copy link
Author

@jkbradley we still could not access the wikipedia dataset on ec2...will it be possible for you to upload the 1 Billion token dataset on EC2 ? I wanted to do a sparse coding scalability run on the large dataset as well...

@debasish83
Copy link
Author

@jkbradley let me know if you need vzcloud access and I can create few nodes for you...ec2 might be easier for other's to access it as well...

@jkbradley
Copy link
Member

@debasish83 Just to make sure, you're specifying it as a requestor-pays bucket, right? [http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html]

@debasish83
Copy link
Author

ohh sorry I don't know about requester pays...let me look into it

@andrewor14
Copy link
Contributor

@mengxr @jkbradley is this still relevant given the recent changes in ALS?

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants