-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10599][MLLIB] Lower communication for block matrix multiplication #8757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #42448 has finished for PR 8757 at commit
|
|
@brkyvz Thank you for notifying me. I would be interested to benchmark this PR. Should I use the same code from the mailing list? It can be found here as well https://github.com/avulanov/blockmatrix-benchmark |
|
@avulanov Feel free to benchmark it in anyway. The same code is also useful. I'm interested in how it would scale, and how it would perform if the matrix is fully dense. I'm doing some benchmarks of my own, it would be nice to have some sanity checks |
|
More results. Tests performed on 4 executors each with 30 GB RAM, and 4 cores each:
Note that for the old implementation, the cluster ran out of disk space for sparsity=0.2 |
|
Thank you for the update. Indeed, the tests take finite time to finish now. Let's add @mengxr to the discussion. Distributed matrix multiplication makes sense when it is faster than doing it on a single node. Lets assume that we have squared blocks, and
I've done a benchmark for single node multiplication, for example it take 0.04s to multiply matrix 1000x1000 and 16.55s for 10000x10000 with OpenBLAS and 2x Xeon X5650 @ 2.67GHz. More results are here https://github.com/avulanov/scala-blas. For the following distributed experiment, I am using 6 node with the same CPU, 5 workers and 1 master. Block-diagonal matrix multiplication:
For some reason, distributed operations are slower than the estimation on single node, though they can be well parallalized. Do you know the reason for that? Column and row matrix multiplication
Distributed operations become faster than single node with bigger columnar matrix. The test did not finish for the block size of 10000 because of Out of free space exception, though I used tempfs of 18GB as both spark.local.dir and java tmp. It seems that shuffling is really huge. Should it be so big? Link to the tests: https://github.com/avulanov/blockmatrix-benchmark/blob/master/src/blockmatrix.scala |
|
Making a pass now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update documentation
|
Minor comments only. Other than that, it looks fine to me. @avulanov In your "Block-diagonal matrix multiplication" tests, do you know if data were shuffled during the multiplications? I'm wondering if Spark/BlockMatrix properly avoided shuffling the data. |
|
Test build #1864 has finished for PR 8757 at commit
|
|
@brkyvz Can you please add "[MLLIB]" to the PR title? |
|
@jkbradley Thank you for the review. Addressed your comments |
|
Test build #43726 has finished for PR 8757 at commit
|
|
retest this please |
|
Test build #43727 has finished for PR 8757 at commit
|
|
@jkbradley According to the time taken it actually did the shuffle. However, I am not sure how useful in practice these block-diagonal matrices. |
|
This LGTM. I'll merge it with master. Thanks for the PR! @avulanov I looked at your code, but the results seem strange to me. We'll have to look into it more, I guess. As far as utility of block-diagonal matrices, I've mainly seen them in the context of specialized applications with very structured feature interactions, but my experience there is from research, not industry. |

This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
NOTE: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
Initial benchmarking showed promising results (look below), however I did hit some
FileNotFoundexceptions with the new implementation after the shuffle.Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s
cc @avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster