Skip to content

Commit d31c618

Browse files
hhbyyhmengxr
authored andcommitted
[SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix
jira: https://issues.apache.org/jira/browse/SPARK-7368 Add QR decomposition for RowMatrix. I'm not sure what's the blueprint about the distributed Matrix from community and whether this will be a desirable feature , so I sent a prototype for discussion. I'll go on polish the code and provide ut and performance statistics if it's acceptable. The implementation refers to the [paper: https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf] Austin R. Benson, David F. Gleich, James Demmel. "Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures", 2013 IEEE International Conference on Big Data, which is a stable algorithm with good scalability. Currently I tried it on a 400000 * 500 rowMatrix (16 partitions) and it can bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced) to 2.6 mins on a 4 worker cluster. I think there will still be some room for performance improvement. Any trial and suggestion is welcome. Author: Yuhao Yang <[email protected]> Closes #5909 from hhbyyh/qrDecomposition and squashes the following commits: cec797b [Yuhao Yang] remove unnecessary qr 0fb1012 [Yuhao Yang] hierarchy R computing 3fbdb61 [Yuhao Yang] update qr to indirect and add ut 0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition 39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition 39b0b22 [Yuhao Yang] initial draft for discussion
1 parent 6175d6c commit d31c618

File tree

3 files changed

+70
-1
lines changed

3 files changed

+70
-1
lines changed

mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,11 @@ import org.apache.spark.annotation.Experimental
2525
*/
2626
@Experimental
2727
case class SingularValueDecomposition[UType, VType](U: UType, s: Vector, V: VType)
28+
29+
/**
30+
* :: Experimental ::
31+
* Represents QR factors.
32+
*/
33+
@Experimental
34+
case class QRDecomposition[UType, VType](Q: UType, R: VType)
35+

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ import java.util.Arrays
2222
import scala.collection.mutable.ListBuffer
2323

2424
import breeze.linalg.{DenseMatrix => BDM, DenseVector => BDV, SparseVector => BSV, axpy => brzAxpy,
25-
svd => brzSvd}
25+
svd => brzSvd, MatrixSingularException, inv}
2626
import breeze.numerics.{sqrt => brzSqrt}
2727
import com.github.fommil.netlib.BLAS.{getInstance => blas}
2828

@@ -497,6 +497,50 @@ class RowMatrix(
497497
columnSimilaritiesDIMSUM(computeColumnSummaryStatistics().normL2.toArray, gamma)
498498
}
499499

500+
/**
501+
* Compute QR decomposition for [[RowMatrix]]. The implementation is designed to optimize the QR
502+
* decomposition (factorization) for the [[RowMatrix]] of a tall and skinny shape.
503+
* Reference:
504+
* Paul G. Constantine, David F. Gleich. "Tall and skinny QR factorizations in MapReduce
505+
* architectures" ([[http://dx.doi.org/10.1145/1996092.1996103]])
506+
*
507+
* @param computeQ whether to computeQ
508+
* @return QRDecomposition(Q, R), Q = null if computeQ = false.
509+
*/
510+
def tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix, Matrix] = {
511+
val col = numCols().toInt
512+
// split rows horizontally into smaller matrices, and compute QR for each of them
513+
val blockQRs = rows.glom().map { partRows =>
514+
val bdm = BDM.zeros[Double](partRows.length, col)
515+
var i = 0
516+
partRows.foreach { row =>
517+
bdm(i, ::) := row.toBreeze.t
518+
i += 1
519+
}
520+
breeze.linalg.qr.reduced(bdm).r
521+
}
522+
523+
// combine the R part from previous results vertically into a tall matrix
524+
val combinedR = blockQRs.treeReduce{ (r1, r2) =>
525+
val stackedR = BDM.vertcat(r1, r2)
526+
breeze.linalg.qr.reduced(stackedR).r
527+
}
528+
val finalR = Matrices.fromBreeze(combinedR.toDenseMatrix)
529+
val finalQ = if (computeQ) {
530+
try {
531+
val invR = inv(combinedR)
532+
this.multiply(Matrices.fromBreeze(invR))
533+
} catch {
534+
case err: MatrixSingularException =>
535+
logWarning("R is not invertible and return Q as null")
536+
null
537+
}
538+
} else {
539+
null
540+
}
541+
QRDecomposition(finalQ, finalR)
542+
}
543+
500544
/**
501545
* Find all similar columns using the DIMSUM sampling algorithm, described in two papers
502546
*

mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ package org.apache.spark.mllib.linalg.distributed
1919

2020
import scala.util.Random
2121

22+
import breeze.numerics.abs
2223
import breeze.linalg.{DenseVector => BDV, DenseMatrix => BDM, norm => brzNorm, svd => brzSvd}
2324

2425
import org.apache.spark.SparkFunSuite
@@ -238,6 +239,22 @@ class RowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
238239
}
239240
}
240241
}
242+
243+
test("QR Decomposition") {
244+
for (mat <- Seq(denseMat, sparseMat)) {
245+
val result = mat.tallSkinnyQR(true)
246+
val expected = breeze.linalg.qr.reduced(mat.toBreeze())
247+
val calcQ = result.Q
248+
val calcR = result.R
249+
assert(closeToZero(abs(expected.q) - abs(calcQ.toBreeze())))
250+
assert(closeToZero(abs(expected.r) - abs(calcR.toBreeze.asInstanceOf[BDM[Double]])))
251+
assert(closeToZero(calcQ.multiply(calcR).toBreeze - mat.toBreeze()))
252+
// Decomposition without computing Q
253+
val rOnly = mat.tallSkinnyQR(computeQ = false)
254+
assert(rOnly.Q == null)
255+
assert(closeToZero(abs(expected.r) - abs(rOnly.R.toBreeze.asInstanceOf[BDM[Double]])))
256+
}
257+
}
241258
}
242259

243260
class RowMatrixClusterSuite extends SparkFunSuite with LocalClusterSparkContext {

0 commit comments

Comments
 (0)