Skip to content

Conversation

@johnc1231
Copy link
Contributor

@johnc1231 johnc1231 commented Mar 28, 2017

What changes were proposed in this pull request?

  • I added the method toBlockMatrixDense to the IndexedRowMatrix class. The current implementation of toBlockMatrix is insufficient for users with relatively dense IndexedRowMatrix objects, since it assumes sparsity.

EDIT: Ended up deciding that there should be just a single toBlockMatrix method, which creates a BlockMatrix whose blocks may be dense or sparse depending on the sparsity of the rows. This method will work better on any current use case of toBlockMatrix and doesn't go through CoordinateMatrix like the old method.

How was this patch tested?

I used the same tests already written for toBlockMatrix() to test this method. I also added a new additional unit test for an edge case that was not adequately tested by current test suite.

I ran the original IndexedRowMatrix tests, plus wrote more to better handle edge cases ignored by original tests.

.zipWithIndex
.map({ case (values, blockColumn) =>
((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't miss anything, the parameters of GridPartitioner are wrong. Should be:

GridPartitioner(numRowBlocks, numColBlocks, rows.partitions.length)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. My code makes the assumption that there is a single block per partition, which is incorrect. Thanks for that.

toBlockMatrix(1024, 1024)
}


Copy link
Member

@viirya viirya Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the extra line.

}
}

test("toBlockMatrixDense") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see you test newly added toBlockMatrixDense, do you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, you seem to have commented right on the toBlockMatrixDense tests. Originally, toBlockMatrix had only the tests marked with the comment // Tests when n % colsPerBlock != 0. I added the tests marked with // Tests when m % rowsPerBlock != 0 to toBlockMatrix, then used the same tests for the Dense version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean now, will fix.


/**
* Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to have both toBlockMatrix and toBlockMatrixDense for converting to BlockMatrix ?

Shall we combine them and have just one toBlockMatrix method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been going back and forth on this myself. I think converting to a BlockMatrix backed by dense matrices is better default behavior than one backed by sparse matrices, but the the current implementation of toBlockMatrix advertises that it converts to a BlockMatrix backed SparseMatrices, and I thought changing that could negatively affect people who want that behavior. I suppose we could add a default argument to toBlockMatrix like isSparse = true so that it would not break anyone's code but people would be able to convert to dense version if they wanted. What do you think of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if DenseMatrix-backed is better default behavior for toBlockMatrix. Actually the rows in IndexedRowMatrix can be sparse or dense. Choose which one, SparkMatrix-backed or DenseMartix-backed, is totally depending on the use case.

Looks like toBlockMatrixDense is already a non-small function. Merging it with current toBlockMatrix might not a good idea. I'd keep it as it's now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can generalize the change to SparseMatrix-based BlockMatrix too. But maybe we can do it in following PR.

Copy link
Contributor Author

@johnc1231 johnc1231 Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this comment. Moved my thoughts on this to new comment down at the bottom of the thread.

@johnc1231
Copy link
Contributor Author

Addressed comments where everything was clear, replied to the last one about only having one toBlockMatrix. Back to you @viirya . Thanks for feedback.

* a smaller value. Must be an integer value greater than 0.
* @param colsPerBlock The number of columns of each block. The blocks at the right edge may have
* a smaller value. Must be an integer value greater than 0.
* @return a [[BlockMatrix]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a @Since annotation like toBlockMatrix. Although I doubt this can make in 2.2.0, you can set it to 2.2.0 temporarily. If there is a suggested version from committers, we can change it later.

@viirya
Copy link
Member

viirya commented Apr 3, 2017

ok to test.

@viirya
Copy link
Member

viirya commented Apr 3, 2017

oh, seems only the committers can trigger jenkins test. cc @jkbradley @MLnick

@MLnick
Copy link
Contributor

MLnick commented Apr 3, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75475 has finished for PR 17459 at commit 06c2b3a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock)
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style of the comments here and below is not correct. Can you fix it?


ir.vector.toArray
.grouped(colsPerBlock)
.zipWithIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: where you are writing ({ ... }) just write { ... }

((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
})
}).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map({
case ((blockRow, blockColumn), itr) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't put a type on vals/vars unless it's important for clarity or needed for a cast

@johnc1231
Copy link
Contributor Author

Thanks for the feedback guys. All comments addressed, though if anyone else has feedback on discussion me and @viirya are having about whether there should be a separate toBlockMatrixDense method or we should just have an argument to specify Dense or Sparse in the default toBlockMatrix method, please weigh in. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75482 has finished for PR 17459 at commit 12e78bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@johnc1231
Copy link
Contributor Author

Did some thinking about this, and I think that to make the API cleaner maybe we could deprecate the regular toBlockMatrix method and add toBlockMatrixSparse. Until it's removed, we could just have toBlockMatrix call toBlockMatrixSparse. I think that'd be more explicit and make it clearer for users what kind of BlockMatrix they're creating. After that I think it'd be easy enough for me to abstract out the DenseMatrix creation step of toBlockMatrixDense and make it a general purpose helper method that toBlockMatrixDense and toBlockMatrixSparse can call to create the specific type of BlockMatrix they want.

I think this explicitness is important since it seems a lot of users create a BlockMatrix through these methods as opposed to with a BlockMatrix constructor since it's kind of a hard constructor to use (official Spark docs also suggest to users that it's easier to use toBlockMatrix than to attempt to use constructors: https://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix).

If we don't want to go the deprecation route, we could have toBlockMatrix take an argument specifying whether the data is sparse or dense, but I think that should be an explicitly required argument since it's otherwise easy to create something unintended.

@viirya
Copy link
Member

viirya commented Apr 4, 2017

I've done a bit prototype locally to generalize this change to SparseMatrix. During that, I have a thought that do we have the limit that all Matrix in BlockMatrix need to be the same kind of Matrix (i.e., DenseMatrix or SparseMatrix)?

Actually we can easily have only one toBlockMatrix method which creates a BlockMatrix including both DenseMatrix or SparseMatrix, depending if the blocks are sparse or not.

From the external view of this API, we don't have an explicit difference between SparseMatrix-backed and DenseMatrix-backed BlockMatrixs. We don't have subclasses for it, nor any property can be used to know about it. Doesn't it mean we don't really care about it?

@johnc1231
Copy link
Contributor Author

johnc1231 commented Apr 4, 2017

@viirya I think we definitely care about giving users the ability to make either dense or sparse Block matrices. I made a 100k by 10k IndexedRowMatrix of random doubles, then converted it to a BlockMatrix to multiply it by its transpose. With the current toBlockMatrix implementation, that took 252 seconds on 128 cores. With my implementation, that took 35 seconds. The backing of a BlockMatrix matters a lot, and we need to let users be explicit about it.

I considered having toBlockMatrix check if the rows of IndexedRowMatrix were dense or sparse, but there is no guarantee of consistency. Like, an IndexedRowMatrix could be a mix of Dense and Sparse Vectors. In that case, it would not be clear what type of BlockMatrix to create. A decent approximation of this would be to just decide the matrix type based on the first vector we look at in the iterator we get from groupByKey, creating a mix of Dense and Sparse matrices in a BlockMatrix, but I still think it's best to be explicit. Also, we currently have the description of toBlockMatrix promising to make a BlockMatrix backed by instances of SparseMatrix, so we have made promises to users about the composition of the BlockMatrix before.

@viirya
Copy link
Member

viirya commented Apr 4, 2017

I considered having toBlockMatrix check if the rows of IndexedRowMatrix were dense or sparse, but there is no guarantee of consistency. Like, an IndexedRowMatrix could be a mix of Dense and Sparse Vectors. In that case, it would not be clear what type of BlockMatrix to create. A decent approximation of this would be to just decide the matrix type based on the first vector we look at in the iterator we get from groupByKey, creating a mix of Dense and Sparse matrices in a BlockMatrix, but I still think it's best to be explicit. Also, we currently have the description of toBlockMatrix promising to make a BlockMatrix backed by instances of SparseMatrix, so we have made promises to users about the composition of the BlockMatrix before.

I don't mean we don't care about it. I meant whether a BlockMatrix is backed by DenseMatrix or SparseMatrix, it has the same behavior and we can't tell the difference between them now. Btw, there is no guarantee that BlockMatrix is purely consisted of DenseMatrix or SparseMatrix. It could be a mix of them.

Thus, we can have a toBlockMatrix which creates a BlockMatrix which is a mix of DenseMatrix and SparseMatrix. A block in a BlockMatrix can be a DenseMatrix and SparseMatrix, depending on the ratio of values in the block. Yes, it is like "a decent approximation" you talked.

For a IndexedRowMatrix completely consisted of DenseVector, this toBlockMatrix definitely returns a BlockMatrix backed by DenseMatrix. For other cases, DenseMatrix might not be best choice for all blocks in the BlockMatrix, as many blocks will be sparse.

About the promise that toBlockMatrix makes a BlockMatrix backed by instances of SparseMatrix, as I said it is not explicitly bound to the API level. I think it is not a big problem.

@johnc1231
Copy link
Contributor Author

Alright, I agree with this. We'll switch off Dense or Sparse matrix backings based on what the type of the first vector in the iterator is. I'd be happy to take on making these adjustments.

@viirya
Copy link
Member

viirya commented Apr 4, 2017

@johnc1231 The prototype I did: https://github.com/apache/spark/compare/master...viirya:general-toblockmatrix?expand=1

Maybe you can take a look and see if it is useful to you.

@johnc1231
Copy link
Contributor Author

I think your prototype looks good. I'm pretty much just gonna do exactly that then.

@johnc1231
Copy link
Contributor Author

@viirya I made changes exactly as you did in your prototype, plus a few style edits. But yeah, I think this is a good, easy to use implementation that will be better in all use cases than current implementation.

@SparkQA
Copy link

SparkQA commented Apr 4, 2017

Test build #75518 has finished for PR 17459 at commit 4582a7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76107 has finished for PR 17459 at commit d692d30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@johnc1231
Copy link
Contributor Author

@viirya I fixed the test as you asked, so please take a look when you get a chance. I'm having a little bit of trouble with my local spark build for some reason, but I'll do that other benchmark when it's resolved.

@johnc1231
Copy link
Contributor Author

@viirya Any more feedback on this?

@viirya
Copy link
Member

viirya commented Apr 28, 2017

@johnc1231 Thanks for updating this. I'll review it in the weekend.

idxRowMatDense.toBlockMatrix(2, 0)
}

assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the style looks weird.

Maybe:

assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
  matrix.isInstanceOf[DenseMatrix]
}.reduce(_ && _))


assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _))
assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: styling like suggested above.

matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
assert(blockMat2.blocks.map { case (_, matrix: Matrix) =>
matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: styling like above suggested.

@viirya
Copy link
Member

viirya commented May 3, 2017

Except for few comments regarding style, the code changes LGTM.

And it'd be good if we can have benchmark for sparse case too.

cc @MLnick @jkbradley for review.

@johnc1231
Copy link
Contributor Author

Did a sparse benchmark (2014 Macbook Pro with 2.2Hz i7) 60 partitions, 10k by 10k matrix with mostly 0's, 10% 1's, made of SparseVectors. Both old method and new method took about 7.5 seconds.

@johnc1231
Copy link
Contributor Author

johnc1231 commented May 5, 2017

@viirya Addressed style nitpicks and did sparse benchmarks. Think that should be everything.

@SparkQA
Copy link

SparkQA commented May 5, 2017

Test build #76504 has finished for PR 17459 at commit 994b457.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@johnc1231
Copy link
Contributor Author

@viirya Now that style nitpicks and sparse benchmarks are done, are you good with this? Also, per your recommendation, CCing @MLnick and @jkbradley for review of this. Should be easy to review, since we've iterated on it a lot.

@johnc1231
Copy link
Contributor Author

Also, made changes suggested by @srowen . Don't know if he now has to sign off on those changes being done.

@johnc1231
Copy link
Contributor Author

This has been reviewed pretty thoroughly at this point. Can a committer give this a quick look? @srowen @MLnick @jkbradley I think it's basically ready to go in.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good, though I had a few questions on returning to look again

}
}
val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray)
val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockMatrix seems to use sparse representations when <= 10% of values are non-zero when converting to an indexed row matrix. Maybe go with that?


val m = numRows()
val n = numCols()
val lastRowBlockIndex = m / rowsPerBlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the index of the final, smaller block, if any? I get it, but if m = 100 and n = 10 then this is 10, which is not the index of the last row block. There is no leftover smaller block and the last one is 9. I think the code works and I'm splitting hairs but wonder if this is clearer if it's the "remainder" block index or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Replaced word "last" with "remainder" and added a small clarifying comment.

ir.vector match {
case SparseVector(size, indices, values) =>
indices.zip(values).map { case (index, value) =>
val blockColumn = index / colsPerBlock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an assumption here that the block index can't be larger than an Int, but it could, right? conceptually the index in an IndexedRow could be huge. Does blockRow need to stay a Long or am I overlooking why it won't happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is true that IndexedRowMatrix could have a Long number of rows, but BlockMatrix is backed by an RDD of ((Int, Int), Matrix), so we're limited by that. I can just add a check that computes whether it's possible to make a BlockMatrix from the given IndexedRowMatrix.

val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) {
denseMatrix
} else {
denseMatrix.toSparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this isn't inefficient because making the dense matrix doesn't copy or anything. Seems OK

@johnc1231
Copy link
Contributor Author

@srowen Addressed comments, back to you. And thanks for taking time to look this over.

s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")

// Since block matrices require an integer row index
require(numRows() / rowsPerBlock < Int.MaxValue,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: isn't <= OK too? very much a corner case, but hey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that is true. Will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's true if I do floating point division. It's not necessarily true if it's long / int division.

assert(blockMat2.numCols() === n)
assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze())

assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just blockMat.blocks.forall { case (_, matrix) => matrix.isInstanceOf[SparseMatrix] }?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure there is no forall on RDD's, which is why I wrote it this way. Could do it as collect().forall I suppose.

@johnc1231
Copy link
Contributor Author

@srowen Fixed both, back to you

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77437 has finished for PR 17459 at commit a7a03dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77439 has finished for PR 17459 at commit 289dbdb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the comment above, I reviewed the logic again and it looks good. CC @viirya

s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock")

// Since block matrices require an integer row index
require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the previous toBlockMatrix would have failed too when the number of rows exceeded this threshold? it looks like it, given how CoordinateMatrix.toBlockMatrix works. Hm, I wonder if you should even put this warning over there too because it will fail mysteriously otherwise. The result might even be wrong.

BTW on second look, I realize this check isn't quite the same as the math that's performed below: math.ceil(m.toDouble / rowsPerBlock).toInt. I think you want to check exactly the same thing. Maybe move the check below the declaration of m, n, and just say: require(math.ceil(m.toDouble / rowsPerBlock) <= Int.MaxValue) That's very clear.

Also, cols need to be checked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 We should fix CoordinateMatrix.toBlockMatrix too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cols, we may not need to do this check. Because each IndexedRow can only have the number of columns less than Int.MaxValue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, even with block size one IndexedRows are limited by length of array which is itself limited by max int, so should be fine.


((blockRow, blockColumn), finalMatrix)
}
new BlockMatrix(blocks, rowsPerBlock, colsPerBlock, this.numRows(), this.numCols())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can the last two args simply be m, n for clarity?

@johnc1231
Copy link
Contributor Author

@srowen @viirya All comments addressed, back to you guys. Hopefully we've just about reached something ready to commit.

@SparkQA
Copy link

SparkQA commented May 31, 2017

Test build #77606 has finished for PR 17459 at commit f9c5506.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 1, 2017

Merged to master

@asfgit asfgit closed this in 0975019 Jun 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants