[SPARK-14290][CORE][Network] avoid significant memory copy in netty's transferTo #12083

liyezhang556520 · 2016-03-31T09:58:11Z

What changes were proposed in this pull request?

When netty transfer data that is not FileRegion, data will be in format of ByteBuf, If the data is large, there will occur significant performance issue because there is memory copy underlying in sun.nio.ch.IOUtil.write, the CPU is 100% used, and network is very low.

In this PR, if data size is large, we will split it into small chunks to call WritableByteChannel.write(), so that avoid wasting of memory copy. Because the data can't be written within a single write, and it will call transferTo multiple times.

How was this patch tested?

Spark unit test and manual test.
Manual test:
sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length

For more details, please refer to SPARK-14290

liyezhang556520 · 2016-03-31T10:32:55Z

common/network-common/src/main/java/org/apache/spark/network/protocol/MessageWithHeader.java

+   * The size should not be too large as it will waste underlying memory copy. e.g. If network
+   * avaliable buffer is smaller than this limit, the data cannot be sent within one single write
+   * operation while it still will make memory copy with this size.
+   */


I set this limit to 512K because in my test, it can successfully write about 600KB ~1.5MB size data for each WritableByteChannel.write(). This size need to be decided after more tests by someone else.

Is it possible to know the accurate number? I guess not because it's OS dependent and may be changed vis OS settings.

However, I saw Hadoop uses private static int NIO_BUFFER_LIMIT = 8*1024; //should not be more than 64KB.

I'm also a little worried that 512k might be a bit too much. On my machine, /proc/sys/net/core/wmem_default is around 200k, which (I assume) means you'd be copying about half of the buffer with no need here.

Instead, how about using a more conservative value (like hadoop's), and loop in copyByteBuf until you either write the whole source buffer, or get a short write?

I think a too small value will waste a lot of system calls. Our use case is different than Hadoop. Here we may send large messages.

What if we create DirectByteBuffer here manually for a big buf (big enough so that we can get benefits even if creating a direct buffer is slow) and try to write as many as possible? Then we can avoid the memory copy in IOUtil.write.

Is it possible to know the accurate number? I guess not because it's OS dependent and may be changed vis OS settings.

@zsxwing There might be a way to get the accurate number of the network buffer, but I think it's meaningless to do that because even we get the accurate number, we cannot guarantee the network send buffer is empty each time we write the data, which means, it's always possible that we can only write part of the data whatever size we set NIO_BUFFER_LIMIT. We can only say the smaller the NIO_BUFFER_LIMIT is, the less redundant copy will be made.

On my machine, /proc/sys/net/core/wmem_default is around 200k, which (I assume) means you'd be copying about half of the buffer with no need here.

@vanzin , on my machine, both wmem_default and wmem_max are also around 200K, but in my test, I can successfully write more than 512K for each WritableByteChannel.write(), this size should be the same with return size of writeFromNativeBuffer as in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#65. I don't know why. Can you also make a test?

What if we create DirectByteBuffer here manually for a big buf (big enough so that we can get benefits even if creating a direct buffer is slow) and try to write as many as possible? Then we can avoid the memory copy in IOUtil.write.

@zsxwing , Yes, redundant copy can be avoided if we give a directBuffer directly to WritableByteChannel.write() because of code in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#50, but I don't know if that's worthwhile. IOUtil will maintain a directBuffer pool to avoid frequently allocate the directBuffers. I think that's why when I made the test, the first time I run code sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Long](1024 * 1024 * 200)).iterator).reduce((a,b)=> a).length, the network throughput is extremely low on executor side, and if I ran this code after I ran the code sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length, the network throughput will be much higher.

So, If we want create direct Buffer manually in Spark, It's better also maintain a buffer pool, but that will introduce much more complexity and have the risk of memory leak.

liyezhang556520 · 2016-03-31T10:33:31Z

cc @rxin

SparkQA · 2016-03-31T11:55:18Z

Test build #54613 has finished for PR 12083 at commit 63ca85a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-03-31T19:20:58Z

@zsxwing

vanzin · 2016-03-31T20:10:07Z

This is a little unexpected; I'd expect that if there isn't enough buffer space in the WritableByteChannel, you'd get a short write and that's it. The code already takes care of that by keeping track of how many bytes were written, and a quick look at the netty code shows it does the same (it does spin a few times, by default 16, calling transferTo in a loop to see if it makes progress).

Do you know where that is breaking? Are we maybe failing to set some flag somewhere that properly configures the channels as non-blocking? Or is this an issue with the underlying WritableByteChannel that ends up used (and do you know what that implementation is)?

zsxwing · 2016-03-31T20:26:05Z

~~@liyezhang556520 Could you point out the class calling IOUtil.write? The implementation I found is ByteArrayWritableChannel.write. It just calls ByteBuffer.get which doesn't use a buffer.~~

Never mind. I found there is another implemetation: sun.nio.ch.SocketChannelImpl

zsxwing · 2016-03-31T20:38:18Z

I just read the codes here: http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#46

sun.nio.ch.IOUtil.write still needs the memory copy for small chunks (non DirectByteBuffer)

liyezhang556520 · 2016-04-01T02:10:23Z

Hi @vanzin , the memory copy place is given out by @zsxwing , the call stack is as follows:

        at java.nio.Bits.copyFromArray(Bits.java:754)
        at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:371)
        at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:342)
        at sun.nio.ch.IOUtil.write(IOUtil.java:60)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:466)
        - locked <0x00007f8a8a28d400> (a java.lang.Object)
        at org.apache.spark.network.protocol.MessageWithHeader.copyByteBuf(MessageWithHeader.java:131)
        at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:114)

The whole buffer copy is in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#60, but the buffer cannot be totally written if its size is greater than the available underlying buffer's. Which is in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#65. So each time we will make a copy of the input ByteBuf, and write only a part of it if the input size is big relatively. This results in multiply copies of the input ByteBuf that is not necessary.

The method of handling the issue in this PR is the same as that in Hadoop, please refer to https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L2957

zsxwing · 2016-04-01T18:14:25Z

@liyezhang556520 thanks for your clarifying. Now I think I understand this issue. It's because writeFromNativeBuffer doesn't guarantee writing all bytes in the copied buffer (limited by the underlying OS buffer). So if we write a 1M buffer, it can only write NIO_BUFFER_LIMIT (512K). And we need to write the rest 512K again. So in this case, we need to copy 1M + 512K bytes. If we divide 1M buffer to 2 * 512K buffers, then we only need to copy 512K + 512K bytes. Is it correct?

vanzin · 2016-04-01T18:26:22Z

@liyezhang556520 ah, I see. Thanks for the pointer. So basically, if the source buffer is not a direct buffer, that class is making a copy of the whole source buffer before trying to write it to the channel. That's, uh, a little silly, but I guess it's something we have to live with...

liyezhang556520 · 2016-04-02T02:32:12Z

So if we write a 1M buffer, it can only write NIO_BUFFER_LIMIT (512K). And we need to write the reset 512K again. So in this case, we need to copy 1M + 512K bytes. If we divide 1M buffer to 2 * 512K buffers, then we only need to copy 512K + 512K bytes. Is it correct?

@zsxwing , yes that right, so there will be tremendous copies if the data to be written is huge.

So basically, if the source buffer is not a direct buffer, that class is making a copy of the whole source buffer before trying to write it to the channel. That's, uh, a little silly, but I guess it's something we have to live with...

@vanzin Yes, we have to live with it if the buffer is not a direct buffer.

liyezhang556520 · 2016-04-05T03:57:33Z

@zsxwing , @vanzin Any further comments?

vanzin · 2016-04-05T17:41:54Z

@liyezhang556520 I like the idea of eargerly copying into a direct buffer, but understand that might be a lot of code for not much gain. I still think we should reduce that limit though - maybe 256k?

liyezhang556520 · 2016-04-06T03:11:49Z

@vanzin , @zsxwing , I have changed the buffer limit to 256K. I do agree that it's better we handle this issue by manually copying data to directBuffer, so no duplicate copy will be made. But that will introduce code complexity somehow. How about we merge this PR first, and in further, I can create a new JIRA and submit a new PR to use directByteBufferPool to avoid duplicate memory copy? Because once we use directBuffer, the code is supposed to be take care of and should be well reviewed.

SparkQA · 2016-04-06T04:46:38Z

Test build #55082 has finished for PR 12083 at commit a793696.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liyezhang556520 · 2016-04-06T07:10:12Z

Jenkins, retest this please.

SparkQA · 2016-04-06T07:20:28Z

Test build #55099 has finished for PR 12083 at commit a793696.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

liyezhang556520 · 2016-04-06T07:29:35Z

retest this please.

SparkQA · 2016-04-06T09:27:51Z

Test build #55102 has finished for PR 12083 at commit a793696.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-04-06T22:23:42Z

LGTM

vanzin · 2016-04-06T23:11:21Z

Merging to master, thanks!

…tty's transferTo apache#12083

davies · 2016-04-11T06:38:33Z

@liyezhang556520 Could you send a patch for 1.6 branch?

liyezhang556520 · 2016-04-11T13:03:40Z

@davies , please see #12296

spark-14290 avoid significant memory copy in netty's transferTo

63ca85a

liyezhang556520 reviewed Mar 31, 2016
View reviewed changes

change the NIO buffer size limit from 512K to 256K

a793696

asfgit closed this in c4bb02a Apr 6, 2016

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 7, 2016

[EXT][SPARK-14290][CORE][Network] avoid significant memory copy in ne…

5756001

…tty's transferTo apache#12083

zsxwing mentioned this pull request Aug 14, 2018

[SPARK-25115] [Core] Eliminate extra memory copy done when a ByteBuf is used that is backed by > 1 ByteBuffer. #22105

Closed

[SPARK-14290][CORE][Network] avoid significant memory copy in netty's transferTo #12083

[SPARK-14290][CORE][Network] avoid significant memory copy in netty's transferTo #12083

Uh oh!

Conversation

liyezhang556520 commented Mar 31, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyezhang556520 commented Mar 31, 2016

Uh oh!

SparkQA commented Mar 31, 2016

Uh oh!

andrewor14 commented Mar 31, 2016

Uh oh!

vanzin commented Mar 31, 2016

Uh oh!

zsxwing commented Mar 31, 2016

Uh oh!

zsxwing commented Mar 31, 2016

Uh oh!

liyezhang556520 commented Apr 1, 2016

Uh oh!

zsxwing commented Apr 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Apr 1, 2016

Uh oh!

liyezhang556520 commented Apr 2, 2016

Uh oh!

liyezhang556520 commented Apr 5, 2016

Uh oh!

vanzin commented Apr 5, 2016

Uh oh!

liyezhang556520 commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

liyezhang556520 commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

liyezhang556520 commented Apr 6, 2016

Uh oh!

SparkQA commented Apr 6, 2016

Uh oh!

zsxwing commented Apr 6, 2016

Uh oh!

vanzin commented Apr 6, 2016

Uh oh!

davies commented Apr 11, 2016

Uh oh!

liyezhang556520 commented Apr 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zsxwing commented Apr 1, 2016 •

edited

Loading