-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14290][CORE][Network] avoid significant memory copy in netty's transferTo #12083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * The size should not be too large as it will waste underlying memory copy. e.g. If network | ||
| * avaliable buffer is smaller than this limit, the data cannot be sent within one single write | ||
| * operation while it still will make memory copy with this size. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set this limit to 512K because in my test, it can successfully write about 600KB ~1.5MB size data for each WritableByteChannel.write(). This size need to be decided after more tests by someone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to know the accurate number? I guess not because it's OS dependent and may be changed vis OS settings.
However, I saw Hadoop uses private static int NIO_BUFFER_LIMIT = 8*1024; //should not be more than 64KB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also a little worried that 512k might be a bit too much. On my machine, /proc/sys/net/core/wmem_default is around 200k, which (I assume) means you'd be copying about half of the buffer with no need here.
Instead, how about using a more conservative value (like hadoop's), and loop in copyByteBuf until you either write the whole source buffer, or get a short write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a too small value will waste a lot of system calls. Our use case is different than Hadoop. Here we may send large messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we create DirectByteBuffer here manually for a big buf (big enough so that we can get benefits even if creating a direct buffer is slow) and try to write as many as possible? Then we can avoid the memory copy in IOUtil.write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to know the accurate number? I guess not because it's OS dependent and may be changed vis OS settings.
@zsxwing There might be a way to get the accurate number of the network buffer, but I think it's meaningless to do that because even we get the accurate number, we cannot guarantee the network send buffer is empty each time we write the data, which means, it's always possible that we can only write part of the data whatever size we set NIO_BUFFER_LIMIT. We can only say the smaller the NIO_BUFFER_LIMIT is, the less redundant copy will be made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine, /proc/sys/net/core/wmem_default is around 200k, which (I assume) means you'd be copying about half of the buffer with no need here.
@vanzin , on my machine, both wmem_default and wmem_max are also around 200K, but in my test, I can successfully write more than 512K for each WritableByteChannel.write(), this size should be the same with return size of writeFromNativeBuffer as in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#65. I don't know why. Can you also make a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we create DirectByteBuffer here manually for a big buf (big enough so that we can get benefits even if creating a direct buffer is slow) and try to write as many as possible? Then we can avoid the memory copy in IOUtil.write.
@zsxwing , Yes, redundant copy can be avoided if we give a directBuffer directly to WritableByteChannel.write() because of code in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#50, but I don't know if that's worthwhile. IOUtil will maintain a directBuffer pool to avoid frequently allocate the directBuffers. I think that's why when I made the test, the first time I run code sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Long](1024 * 1024 * 200)).iterator).reduce((a,b)=> a).length, the network throughput is extremely low on executor side, and if I ran this code after I ran the code sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length, the network throughput will be much higher.
So, If we want create direct Buffer manually in Spark, It's better also maintain a buffer pool, but that will introduce much more complexity and have the risk of memory leak.
|
cc @rxin |
|
Test build #54613 has finished for PR 12083 at commit
|
|
This is a little unexpected; I'd expect that if there isn't enough buffer space in the Do you know where that is breaking? Are we maybe failing to set some flag somewhere that properly configures the channels as non-blocking? Or is this an issue with the underlying |
|
Never mind. I found there is another implemetation: |
|
I just read the codes here: http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#46
|
|
Hi @vanzin , the memory copy place is given out by @zsxwing , the call stack is as follows: The whole buffer copy is in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#60, but the buffer cannot be totally written if its size is greater than the available underlying buffer's. Which is in line http://www.grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/sun/nio/ch/IOUtil.java#65. So each time we will make a copy of the input The method of handling the issue in this PR is the same as that in Hadoop, please refer to https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L2957 |
|
@liyezhang556520 thanks for your clarifying. Now I think I understand this issue. It's because |
|
@liyezhang556520 ah, I see. Thanks for the pointer. So basically, if the source buffer is not a direct buffer, that class is making a copy of the whole source buffer before trying to write it to the channel. That's, uh, a little silly, but I guess it's something we have to live with... |
@zsxwing , yes that right, so there will be tremendous copies if the data to be written is huge.
@vanzin Yes, we have to live with it if the buffer is not a direct buffer. |
|
@liyezhang556520 I like the idea of eargerly copying into a direct buffer, but understand that might be a lot of code for not much gain. I still think we should reduce that limit though - maybe 256k? |
|
@vanzin , @zsxwing , I have changed the buffer limit to 256K. I do agree that it's better we handle this issue by manually copying data to directBuffer, so no duplicate copy will be made. But that will introduce code complexity somehow. How about we merge this PR first, and in further, I can create a new JIRA and submit a new PR to use directByteBufferPool to avoid duplicate memory copy? Because once we use directBuffer, the code is supposed to be take care of and should be well reviewed. |
|
Test build #55082 has finished for PR 12083 at commit
|
|
Jenkins, retest this please. |
|
Test build #55099 has finished for PR 12083 at commit
|
|
retest this please. |
|
Test build #55102 has finished for PR 12083 at commit
|
|
LGTM |
|
Merging to master, thanks! |
|
@liyezhang556520 Could you send a patch for 1.6 branch? |
What changes were proposed in this pull request?
When netty transfer data that is not
FileRegion, data will be in format ofByteBuf, If the data is large, there will occur significant performance issue because there is memory copy underlying insun.nio.ch.IOUtil.write, the CPU is 100% used, and network is very low.In this PR, if data size is large, we will split it into small chunks to call
WritableByteChannel.write(), so that avoid wasting of memory copy. Because the data can't be written within a single write, and it will calltransferTomultiple times.How was this patch tested?
Spark unit test and manual test.
Manual test:
sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).lengthFor more details, please refer to SPARK-14290