[SPARK-10599][MLLIB] Lower communication for block matrix multiplication #8757

brkyvz · 2015-09-14T21:36:10Z

This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:

Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition

NOTE: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.

Initial benchmarking showed promising results (look below), however I did hit some FileNotFound exceptions with the new implementation after the shuffle.

Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s

cc @avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster

SparkQA · 2015-09-14T21:46:39Z

Test build #42448 has finished for PR 8757 at commit 8dac58f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

avulanov · 2015-09-14T23:47:40Z

@brkyvz Thank you for notifying me. I would be interested to benchmark this PR. Should I use the same code from the mailing list? It can be found here as well https://github.com/avulanov/blockmatrix-benchmark

brkyvz · 2015-09-14T23:50:59Z

@avulanov Feel free to benchmark it in anyway. The same code is also useful. I'm interested in how it would scale, and how it would perform if the matrix is fully dense. I'm doing some benchmarks of my own, it would be nice to have some sanity checks

brkyvz · 2015-09-15T17:18:39Z

More results. Tests performed on 4 executors each with 30 GB RAM, and 4 cores each:
Code: https://github.com/brkyvz/git-rest-api/blob/c530778f3e6df6378a1a1d6495c5d52d6d590410/notebooks/block%20matrix%20benchmarking.scala

Size of A: 1e5 x 1e5
Size of B: 1e5 x 1e5
Block Size: 1024
Number of partitions: 128

Note that for the old implementation, the cluster ran out of disk space for sparsity=0.2

avulanov · 2015-09-15T21:36:16Z

Thank you for the update. Indeed, the tests take finite time to finish now. Let's add @mengxr to the discussion.

Distributed matrix multiplication makes sense when it is faster than doing it on a single node. Lets assume that we have squared blocks, and block*block takes time Tblock on a single machine. I prepared two tests:

Block-diagonal matrix multiplication (M * M), where M is NxN. Single machine multiplication time will be N*Tblock. The optimal distributed time would be Tblock if the number of nodes <= N. This seems to be embarrassingly parallel.
Columnar and row matrix multiplication, (M * M^T), where M has 1 column and N row blocks. Single machine multiplication time will be N*N*Tblock

I've done a benchmark for single node multiplication, for example it take 0.04s to multiply matrix 1000x1000 and 16.55s for 10000x10000 with OpenBLAS and 2x Xeon X5650 @ 2.67GHz. More results are here https://github.com/avulanov/scala-blas.

For the following distributed experiment, I am using 6 node with the same CPU, 5 workers and 1 master.

Block-diagonal matrix multiplication:

Size	Block	Time	Est. single node time
1000x1000	1000	0.539322901	0.04
2000x2000	1000	0.594227124	0.08
3000x3000	1000	0.541293169	0.12
4000x4000	1000	0.520753395	0.16
5000x5000	1000	0.702532957	0.2

Size	Block	Time	Est. single node time
10000x10000	10000	27.565218631	16.55
20000x20000	10000	28.363953039	33.1
30000x30000	10000	114.133834717	49.65
40000x40000	10000	117.701914787	66.2
50000x50000	10000	141.827804904	82.75

For some reason, distributed operations are slower than the estimation on single node, though they can be well parallalized. Do you know the reason for that?

Column and row matrix multiplication

Size	Block	Time	Est. single node time
1000x1000	1000	0.281162649	0.04
2000x1000	1000	0.461582522	0.16
3000x1000	1000	0.520122422	0.36
4000x1000	1000	0.560923767	0.64
5000x1000	1000	0.887406721	1

Distributed operations become faster than single node with bigger columnar matrix. The test did not finish for the block size of 10000 because of Out of free space exception, though I used tempfs of 18GB as both spark.local.dir and java tmp. It seems that shuffling is really huge. Should it be so big?

Link to the tests: https://github.com/avulanov/blockmatrix-benchmark/blob/master/src/blockmatrix.scala

jkbradley · 2015-10-09T00:03:33Z

Making a pass now

jkbradley · 2015-10-09T00:19:15Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

Update documentation

jkbradley · 2015-10-09T00:21:11Z

Minor comments only. Other than that, it looks fine to me.

@avulanov In your "Block-diagonal matrix multiplication" tests, do you know if data were shuffled during the multiplications? I'm wondering if Spark/BlockMatrix properly avoided shuffling the data.

SparkQA · 2015-10-09T00:26:31Z

Test build #1864 has finished for PR 8757 at commit ae98edc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-10-09T01:39:49Z

@brkyvz Can you please add "[MLLIB]" to the PR title?

brkyvz · 2015-10-14T17:40:16Z

@jkbradley Thank you for the review. Addressed your comments

SparkQA · 2015-10-14T18:06:23Z

Test build #43726 has finished for PR 8757 at commit 19c4b13.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2015-10-14T18:23:48Z

retest this please

SparkQA · 2015-10-14T19:09:01Z

Test build #43727 has finished for PR 8757 at commit 19c4b13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

avulanov · 2015-10-16T19:46:40Z

@jkbradley According to the time taken it actually did the shuffle. However, I am not sure how useful in practice these block-diagonal matrices.

jkbradley · 2015-10-16T22:28:59Z

This LGTM. I'll merge it with master. Thanks for the PR!

@avulanov I looked at your code, but the results seem strange to me. We'll have to look into it more, I guess. As far as utility of block-diagonal matrices, I've mainly seen them in the context of specialized applications with very structured feature interactions, but my experience there is from research, not industry.

lower communication for block matrix multiplication

8dac58f

fix scalastyle

ae98edc

jkbradley reviewed Oct 9, 2015
View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

Copy link

Member

jkbradley Oct 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update documentation

brkyvz changed the title ~~[SPARK-10599] Lower communication for block matrix multiplication~~ [SPARK-10599][MLLIB] Lower communication for block matrix multiplication Oct 14, 2015

added tests

19c4b13

asfgit closed this in 10046ea Oct 16, 2015

brkyvz deleted the opt-bmm branch February 3, 2019 20:55

[SPARK-10599][MLLIB] Lower communication for block matrix multiplication #8757

[SPARK-10599][MLLIB] Lower communication for block matrix multiplication #8757

Uh oh!

Conversation

brkyvz commented Sep 14, 2015

Uh oh!

SparkQA commented Sep 14, 2015

Uh oh!

avulanov commented Sep 14, 2015

Uh oh!

brkyvz commented Sep 14, 2015

Uh oh!

brkyvz commented Sep 15, 2015

Uh oh!

avulanov commented Sep 15, 2015

Block-diagonal matrix multiplication:

Column and row matrix multiplication

Uh oh!

jkbradley commented Oct 9, 2015

Uh oh!

jkbradley Oct 9, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 9, 2015

Uh oh!

SparkQA commented Oct 9, 2015

Uh oh!

jkbradley commented Oct 9, 2015

Uh oh!

brkyvz commented Oct 14, 2015

Uh oh!

SparkQA commented Oct 14, 2015

Uh oh!

brkyvz commented Oct 14, 2015

Uh oh!

SparkQA commented Oct 14, 2015

Uh oh!

avulanov commented Oct 16, 2015

Uh oh!

jkbradley commented Oct 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants