-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20109][MLlib] Rewrote toBlockMatrix method on IndexedRowMatrix #17459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .zipWithIndex | ||
| .map({ case (values, blockColumn) => | ||
| ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values)) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't miss anything, the parameters of GridPartitioner are wrong. Should be:
GridPartitioner(numRowBlocks, numColBlocks, rows.partitions.length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. My code makes the assumption that there is a single block per partition, which is incorrect. Thanks for that.
| toBlockMatrix(1024, 1024) | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the extra line.
| } | ||
| } | ||
|
|
||
| test("toBlockMatrixDense") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see you test newly added toBlockMatrixDense, do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, you seem to have commented right on the toBlockMatrixDense tests. Originally, toBlockMatrix had only the tests marked with the comment // Tests when n % colsPerBlock != 0. I added the tests marked with // Tests when m % rowsPerBlock != 0 to toBlockMatrix, then used the same tests for the Dense version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean now, will fix.
|
|
||
| /** | ||
| * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a good idea to have both toBlockMatrix and toBlockMatrixDense for converting to BlockMatrix ?
Shall we combine them and have just one toBlockMatrix method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been going back and forth on this myself. I think converting to a BlockMatrix backed by dense matrices is better default behavior than one backed by sparse matrices, but the the current implementation of toBlockMatrix advertises that it converts to a BlockMatrix backed SparseMatrices, and I thought changing that could negatively affect people who want that behavior. I suppose we could add a default argument to toBlockMatrix like isSparse = true so that it would not break anyone's code but people would be able to convert to dense version if they wanted. What do you think of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if DenseMatrix-backed is better default behavior for toBlockMatrix. Actually the rows in IndexedRowMatrix can be sparse or dense. Choose which one, SparkMatrix-backed or DenseMartix-backed, is totally depending on the use case.
Looks like toBlockMatrixDense is already a non-small function. Merging it with current toBlockMatrix might not a good idea. I'd keep it as it's now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can generalize the change to SparseMatrix-based BlockMatrix too. But maybe we can do it in following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this comment. Moved my thoughts on this to new comment down at the bottom of the thread.
|
Addressed comments where everything was clear, replied to the last one about only having one toBlockMatrix. Back to you @viirya . Thanks for feedback. |
| * a smaller value. Must be an integer value greater than 0. | ||
| * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have | ||
| * a smaller value. Must be an integer value greater than 0. | ||
| * @return a [[BlockMatrix]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a @Since annotation like toBlockMatrix. Although I doubt this can make in 2.2.0, you can set it to 2.2.0 temporarily. If there is a suggested version from committers, we can change it later.
|
ok to test. |
|
oh, seems only the committers can trigger jenkins test. cc @jkbradley @MLnick |
|
ok to test |
|
Test build #75475 has finished for PR 17459 at commit
|
| toCoordinateMatrix().toBlockMatrix(rowsPerBlock, colsPerBlock) | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style of the comments here and below is not correct. Can you fix it?
|
|
||
| ir.vector.toArray | ||
| .grouped(colsPerBlock) | ||
| .zipWithIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: where you are writing ({ ... }) just write { ... }
| ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values)) | ||
| }) | ||
| }).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rows.getNumPartitions)).map({ | ||
| case ((blockRow, blockColumn), itr) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually don't put a type on vals/vars unless it's important for clarity or needed for a cast
|
Thanks for the feedback guys. All comments addressed, though if anyone else has feedback on discussion me and @viirya are having about whether there should be a separate |
|
Test build #75482 has finished for PR 17459 at commit
|
|
Did some thinking about this, and I think that to make the API cleaner maybe we could deprecate the regular I think this explicitness is important since it seems a lot of users create a If we don't want to go the deprecation route, we could have |
|
I've done a bit prototype locally to generalize this change to Actually we can easily have only one From the external view of this API, we don't have an explicit difference between |
|
@viirya I think we definitely care about giving users the ability to make either dense or sparse Block matrices. I made a 100k by 10k IndexedRowMatrix of random doubles, then converted it to a BlockMatrix to multiply it by its transpose. With the current toBlockMatrix implementation, that took 252 seconds on 128 cores. With my implementation, that took 35 seconds. The backing of a BlockMatrix matters a lot, and we need to let users be explicit about it. I considered having toBlockMatrix check if the rows of |
I don't mean we don't care about it. I meant whether a Thus, we can have a For a About the promise that |
|
Alright, I agree with this. We'll switch off Dense or Sparse matrix backings based on what the type of the first vector in the iterator is. I'd be happy to take on making these adjustments. |
|
@johnc1231 The prototype I did: https://github.com/apache/spark/compare/master...viirya:general-toblockmatrix?expand=1 Maybe you can take a look and see if it is useful to you. |
|
I think your prototype looks good. I'm pretty much just gonna do exactly that then. |
|
@viirya I made changes exactly as you did in your prototype, plus a few style edits. But yeah, I think this is a good, easy to use implementation that will be better in all use cases than current implementation. |
|
Test build #75518 has finished for PR 17459 at commit
|
|
Test build #76107 has finished for PR 17459 at commit
|
|
@viirya I fixed the test as you asked, so please take a look when you get a chance. I'm having a little bit of trouble with my local spark build for some reason, but I'll do that other benchmark when it's resolved. |
|
@viirya Any more feedback on this? |
|
@johnc1231 Thanks for updating this. I'll review it in the weekend. |
| idxRowMatDense.toBlockMatrix(2, 0) | ||
| } | ||
|
|
||
| assert(blockMat.blocks.map { case (_, matrix: Matrix) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the style looks weird.
Maybe:
assert(blockMat.blocks.map { case (_, matrix: Matrix) =>
matrix.isInstanceOf[DenseMatrix]
}.reduce(_ && _))
|
|
||
| assert(blockMat.blocks.map { case (_, matrix: Matrix) => | ||
| matrix.isInstanceOf[DenseMatrix]}.reduce(_ && _)) | ||
| assert(blockMat2.blocks.map { case (_, matrix: Matrix) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: styling like suggested above.
| matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _)) | ||
| assert(blockMat2.blocks.map { case (_, matrix: Matrix) => | ||
| matrix.isInstanceOf[SparseMatrix]}.reduce(_ && _)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: styling like above suggested.
|
Except for few comments regarding style, the code changes LGTM. And it'd be good if we can have benchmark for sparse case too. cc @MLnick @jkbradley for review. |
|
Did a sparse benchmark (2014 Macbook Pro with 2.2Hz i7) 60 partitions, 10k by 10k matrix with mostly 0's, 10% 1's, made of SparseVectors. Both old method and new method took about 7.5 seconds. |
|
@viirya Addressed style nitpicks and did sparse benchmarks. Think that should be everything. |
|
Test build #76504 has finished for PR 17459 at commit
|
|
@viirya Now that style nitpicks and sparse benchmarks are done, are you good with this? Also, per your recommendation, CCing @MLnick and @jkbradley for review of this. Should be easy to review, since we've iterated on it a lot. |
|
Also, made changes suggested by @srowen . Don't know if he now has to sign off on those changes being done. |
|
This has been reviewed pretty thoroughly at this point. Can a committer give this a quick look? @srowen @MLnick @jkbradley I think it's basically ready to go in. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good, though I had a few questions on returning to look again
| } | ||
| } | ||
| val denseMatrix = new DenseMatrix(actualNumRows, actualNumColumns, matrixAsArray) | ||
| val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockMatrix seems to use sparse representations when <= 10% of values are non-zero when converting to an indexed row matrix. Maybe go with that?
|
|
||
| val m = numRows() | ||
| val n = numCols() | ||
| val lastRowBlockIndex = m / rowsPerBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the index of the final, smaller block, if any? I get it, but if m = 100 and n = 10 then this is 10, which is not the index of the last row block. There is no leftover smaller block and the last one is 9. I think the code works and I'm splitting hairs but wonder if this is clearer if it's the "remainder" block index or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Replaced word "last" with "remainder" and added a small clarifying comment.
| ir.vector match { | ||
| case SparseVector(size, indices, values) => | ||
| indices.zip(values).map { case (index, value) => | ||
| val blockColumn = index / colsPerBlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an assumption here that the block index can't be larger than an Int, but it could, right? conceptually the index in an IndexedRow could be huge. Does blockRow need to stay a Long or am I overlooking why it won't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is true that IndexedRowMatrix could have a Long number of rows, but BlockMatrix is backed by an RDD of ((Int, Int), Matrix), so we're limited by that. I can just add a check that computes whether it's possible to make a BlockMatrix from the given IndexedRowMatrix.
| val finalMatrix = if (countForValues / arraySize.toDouble >= 0.5) { | ||
| denseMatrix | ||
| } else { | ||
| denseMatrix.toSparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this isn't inefficient because making the dense matrix doesn't copy or anything. Seems OK
|
@srowen Addressed comments, back to you. And thanks for taking time to look this over. |
| s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock") | ||
|
|
||
| // Since block matrices require an integer row index | ||
| require(numRows() / rowsPerBlock < Int.MaxValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: isn't <= OK too? very much a corner case, but hey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, that is true. Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's true if I do floating point division. It's not necessarily true if it's long / int division.
| assert(blockMat2.numCols() === n) | ||
| assert(blockMat2.toBreeze() === idxRowMatSparse.toBreeze()) | ||
|
|
||
| assert(blockMat.blocks.map { case (_, matrix: Matrix) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just blockMat.blocks.forall { case (_, matrix) => matrix.isInstanceOf[SparseMatrix] }?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure there is no forall on RDD's, which is why I wrote it this way. Could do it as collect().forall I suppose.
|
@srowen Fixed both, back to you |
|
Test build #77437 has finished for PR 17459 at commit
|
|
Test build #77439 has finished for PR 17459 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the comment above, I reviewed the logic again and it looks good. CC @viirya
| s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock") | ||
|
|
||
| // Since block matrices require an integer row index | ||
| require(numRows() / rowsPerBlock.toDouble <= Int.MaxValue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess the previous toBlockMatrix would have failed too when the number of rows exceeded this threshold? it looks like it, given how CoordinateMatrix.toBlockMatrix works. Hm, I wonder if you should even put this warning over there too because it will fail mysteriously otherwise. The result might even be wrong.
BTW on second look, I realize this check isn't quite the same as the math that's performed below: math.ceil(m.toDouble / rowsPerBlock).toInt. I think you want to check exactly the same thing. Maybe move the check below the declaration of m, n, and just say: require(math.ceil(m.toDouble / rowsPerBlock) <= Int.MaxValue) That's very clear.
Also, cols need to be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 We should fix CoordinateMatrix.toBlockMatrix too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cols, we may not need to do this check. Because each IndexedRow can only have the number of columns less than Int.MaxValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, even with block size one IndexedRows are limited by length of array which is itself limited by max int, so should be fine.
|
|
||
| ((blockRow, blockColumn), finalMatrix) | ||
| } | ||
| new BlockMatrix(blocks, rowsPerBlock, colsPerBlock, this.numRows(), this.numCols()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can the last two args simply be m, n for clarity?
|
Test build #77606 has finished for PR 17459 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
I added the methodtoBlockMatrixDenseto the IndexedRowMatrix class. The current implementation oftoBlockMatrixis insufficient for users with relatively dense IndexedRowMatrix objects, since it assumes sparsity.EDIT: Ended up deciding that there should be just a single
toBlockMatrixmethod, which creates a BlockMatrix whose blocks may be dense or sparse depending on the sparsity of the rows. This method will work better on any current use case oftoBlockMatrixand doesn't go throughCoordinateMatrixlike the old method.How was this patch tested?
I used the same tests already written fortoBlockMatrix()to test this method. I also added a new additional unit test for an edge case that was not adequately tested by current test suite.I ran the original
IndexedRowMatrixtests, plus wrote more to better handle edge cases ignored by original tests.