Skip to content

Conversation

@brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Sep 14, 2015

This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:

  • Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
  • Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition

NOTE: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.

Initial benchmarking showed promising results (look below), however I did hit some FileNotFound exceptions with the new implementation after the shuffle.

Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s

cc @avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42448 has finished for PR 8757 at commit 8dac58f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@avulanov
Copy link
Contributor

@brkyvz Thank you for notifying me. I would be interested to benchmark this PR. Should I use the same code from the mailing list? It can be found here as well https://github.com/avulanov/blockmatrix-benchmark

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 14, 2015

@avulanov Feel free to benchmark it in anyway. The same code is also useful. I'm interested in how it would scale, and how it would perform if the matrix is fully dense. I'm doing some benchmarks of my own, it would be nice to have some sanity checks

@brkyvz
Copy link
Contributor Author

brkyvz commented Sep 15, 2015

More results. Tests performed on 4 executors each with 30 GB RAM, and 4 cores each:
Code: https://github.com/brkyvz/git-rest-api/blob/c530778f3e6df6378a1a1d6495c5d52d6d590410/notebooks/block%20matrix%20benchmarking.scala

  • Size of A: 1e5 x 1e5
  • Size of B: 1e5 x 1e5
  • Block Size: 1024
  • Number of partitions: 128

Note that for the old implementation, the cluster ran out of disk space for sparsity=0.2
vs_sparsity

@avulanov
Copy link
Contributor

Thank you for the update. Indeed, the tests take finite time to finish now. Let's add @mengxr to the discussion.

Distributed matrix multiplication makes sense when it is faster than doing it on a single node. Lets assume that we have squared blocks, and block*block takes time Tblock on a single machine. I prepared two tests:

  • Block-diagonal matrix multiplication (M * M), where M is NxN. Single machine multiplication time will be N*Tblock. The optimal distributed time would be Tblock if the number of nodes <= N. This seems to be embarrassingly parallel.
  • Columnar and row matrix multiplication, (M * M^T), where M has 1 column and N row blocks. Single machine multiplication time will be N*N*Tblock

I've done a benchmark for single node multiplication, for example it take 0.04s to multiply matrix 1000x1000 and 16.55s for 10000x10000 with OpenBLAS and 2x Xeon X5650 @ 2.67GHz. More results are here https://github.com/avulanov/scala-blas.

For the following distributed experiment, I am using 6 node with the same CPU, 5 workers and 1 master.

Block-diagonal matrix multiplication:

Size Block Time Est. single node time
1000x1000 1000 0.539322901 0.04
2000x2000 1000 0.594227124 0.08
3000x3000 1000 0.541293169 0.12
4000x4000 1000 0.520753395 0.16
5000x5000 1000 0.702532957 0.2
Size Block Time Est. single node time
10000x10000 10000 27.565218631 16.55
20000x20000 10000 28.363953039 33.1
30000x30000 10000 114.133834717 49.65
40000x40000 10000 117.701914787 66.2
50000x50000 10000 141.827804904 82.75

For some reason, distributed operations are slower than the estimation on single node, though they can be well parallalized. Do you know the reason for that?

Column and row matrix multiplication

Size Block Time Est. single node time
1000x1000 1000 0.281162649 0.04
2000x1000 1000 0.461582522 0.16
3000x1000 1000 0.520122422 0.36
4000x1000 1000 0.560923767 0.64
5000x1000 1000 0.887406721 1

Distributed operations become faster than single node with bigger columnar matrix. The test did not finish for the block size of 10000 because of Out of free space exception, though I used tempfs of 18GB as both spark.local.dir and java tmp. It seems that shuffling is really huge. Should it be so big?

Link to the tests: https://github.com/avulanov/blockmatrix-benchmark/blob/master/src/blockmatrix.scala

@jkbradley
Copy link
Member

Making a pass now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update documentation

@jkbradley
Copy link
Member

Minor comments only. Other than that, it looks fine to me.

@avulanov In your "Block-diagonal matrix multiplication" tests, do you know if data were shuffled during the multiplications? I'm wondering if Spark/BlockMatrix properly avoided shuffling the data.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #1864 has finished for PR 8757 at commit ae98edc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@brkyvz Can you please add "[MLLIB]" to the PR title?

@brkyvz brkyvz changed the title [SPARK-10599] Lower communication for block matrix multiplication [SPARK-10599][MLLIB] Lower communication for block matrix multiplication Oct 14, 2015
@brkyvz
Copy link
Contributor Author

brkyvz commented Oct 14, 2015

@jkbradley Thank you for the review. Addressed your comments

@SparkQA
Copy link

SparkQA commented Oct 14, 2015

Test build #43726 has finished for PR 8757 at commit 19c4b13.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Oct 14, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Oct 14, 2015

Test build #43727 has finished for PR 8757 at commit 19c4b13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@avulanov
Copy link
Contributor

@jkbradley According to the time taken it actually did the shuffle. However, I am not sure how useful in practice these block-diagonal matrices.

@jkbradley
Copy link
Member

This LGTM. I'll merge it with master. Thanks for the PR!

@avulanov I looked at your code, but the results seem strange to me. We'll have to look into it more, I guess. As far as utility of block-diagonal matrices, I've mainly seen them in the context of specialized applications with very structured feature interactions, but my experience there is from research, not industry.

@asfgit asfgit closed this in 10046ea Oct 16, 2015
@brkyvz brkyvz deleted the opt-bmm branch February 3, 2019 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants