[SPARK-26674][CORE]Consolidate CompositeByteBuf when reading large frame #23602

liupc · 2019-01-21T11:38:04Z

What changes were proposed in this pull request?

Currently, TransportFrameDecoder will not consolidate the buffers read from network which may cause memory waste. Actually, bytebuf's writtenIndex is far less than it's capacity in most cases, so we can optimize it by doing consolidation.

This PR will do this optimization.

Related codes:

spark/common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

Line 143 in 9a30e23

    
           CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer(Integer.MAX_VALUE);

How was this patch tested?

UT

Please review http://spark.apache.org/contributing.html before opening a pull request.

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

srowen · 2019-01-29T15:04:30Z

Do you have a sense of how much time the consolidation takes vs memory saved? Just trying to get a handle on what the tradeoff is here. It's probably a good change. Any other places we can do this?

liupc · 2019-01-30T07:19:50Z

@srowen Thanks for suggestion, currently, It's just a thought, without much sense on the tradeoff, Seems the memory saved is decided by the readable size of socket and the memory allocation strategy(default AdaptiveRecvByteBufAllocator), I will try to do some benchmark tests to give some data about the tradeoff if possible.
Seems it's the only place the CompositeByteBuf is used.

liupc · 2019-02-01T09:50:22Z

I have done some benchmark tests on my local machine, seems it can save large memory with small time cost -- ~ 200 milis for 50% saving from a 1GB CompositeByteBuf

Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Linux 4.15.0-43-generic
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

[test consolidate 100 buffers each with 10m, 50% used for 1 loop]
Allocating 5242880 bytes
Time cost with 1 loop for consolidating: 223 millis

[test consolidate 100 buffers each with 10m, 100% used for 1 loop]
Allocating 10485760 bytes
Time cost with 1 loop for consolidating: 451 millis

[test consolidate 100 buffers each with 10m, 50% used for 10 loop]
Allocating 5242880 bytes
Time cost with 10 loop for consolidating: 1924 millis

[test consolidate 100 buffers each with 10m, 100% used for 10 loop]
Allocating 10485760 bytes
Time cost with 10 loop for consolidating: 3787 millis

[test consolidate 100 buffers each with 10m, 50% used for 50 loop]
Allocating 5242880 bytes
Time cost with 50 loop for consolidating: 8870 millis

[test consolidate 100 buffers each with 10m, 100% used for 50 loop]
Allocating 10485760 bytes
Time cost with 50 loop for consolidating: 17472 millis

[test consolidate 20 buffers each with 50m, 50% used for 1 loop]
Allocating 26214400 bytes
Time cost with 1 loop for consolidating: 184 millis

[test consolidate 20 buffers each with 50m, 100% used for 1 loop]
Allocating 52428800 bytes
Time cost with 1 loop for consolidating: 367 millis

[test consolidate 20 buffers each with 50m, 50% used for 10 loop]
Allocating 26214400 bytes
Time cost with 10 loop for consolidating: 1847 millis

[test consolidate 20 buffers each with 50m, 100% used for 10 loop]
Allocating 52428800 bytes
Time cost with 10 loop for consolidating: 3638 millis

[test consolidate 20 buffers each with 50m, 50% used for 50 loop]
Allocating 26214400 bytes
Time cost with 50 loop for consolidating: 9126 millis

[test consolidate 20 buffers each with 50m, 100% used for 50 loop]
Allocating 52428800 bytes
Time cost with 50 loop for consolidating: 19391 millis

[test consolidate 10 buffers each with 100m, 50% used for 1 loop]
Allocating 52428800 bytes
Time cost with 1 loop for consolidating: 211 millis

[test consolidate 10 buffers each with 100m, 100% used for 1 loop]
Allocating 104857600 bytes
Time cost with 1 loop for consolidating: 400 millis

[test consolidate 10 buffers each with 100m, 50% used for 10 loop]
Allocating 52428800 bytes
Time cost with 10 loop for consolidating: 1954 millis

[test consolidate 10 buffers each with 100m, 100% used for 10 loop]
Allocating 104857600 bytes
Time cost with 10 loop for consolidating: 3846 millis

[test consolidate 10 buffers each with 100m, 50% used for 50 loop]
Allocating 52428800 bytes
Time cost with 50 loop for consolidating: 9747 millis

[test consolidate 10 buffers each with 100m, 100% used for 50 loop]
Allocating 104857600 bytes
Time cost with 50 loop for consolidating: 19542 millis

liupc · 2019-02-01T09:51:32Z

The benchmark codes is as below:

 private CompositeByteBuf createCompositeBuf(ByteBufAllocator alloc, int numComponents, int size, int writtenBytes) {
    CompositeByteBuf compositeByteBuf = alloc.compositeBuffer(Integer.MAX_VALUE);
    for (int i =0; i< numComponents; i++) {
      ByteBuf buf = alloc.ioBuffer(size);
      buf.writerIndex(writtenBytes);
      compositeByteBuf.addComponent(buf).writerIndex(compositeByteBuf.writerIndex() + buf.readableBytes());
    }
    return compositeByteBuf;
  }

  private void testConsolidateWithLoop(String testName, ByteBufAllocator alloc, int numComponents, int size, int util, int loopCount) {
    long totalTime = 0L;
    int writtenBytes = (int)((double)size * util / 100);
    for (int i = 0; i < loopCount; i++) {
      CompositeByteBuf buf = createCompositeBuf(alloc, numComponents, size, writtenBytes);
      long start = System.currentTimeMillis();
      buf.consolidate();
      long cost = System.currentTimeMillis() - start;
      totalTime += cost;
      buf.release();
    }
    System.out.println("[" + testName + "]");
    System.out.println("Allocating " + writtenBytes + " bytes");
    System.out.println("Time cost with " + loopCount + " loop for consolidating: " + totalTime + " millis");
    System.out.println();
  }

  @Test
  public void benchmarkForConsolidation() throws Exception {
    PooledByteBufAllocator alloc = new PooledByteBufAllocator(true);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 50% used for 1 loop",
        alloc, 100, 1024 * 1024 * 10, 50, 1);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 100% used for 1 loop",
        alloc, 100, 1024 * 1024 * 10, 100, 1);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 50% used for 10 loop",
        alloc, 100, 1024 * 1024 * 10, 50, 10);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 100% used for 10 loop",
        alloc, 100, 1024 * 1024 * 10, 100, 10);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 50% used for 50 loop",
        alloc, 100, 1024 * 1024 * 10, 50, 50);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 100% used for 50 loop",
        alloc, 100, 1024 * 1024 * 10, 100, 50);

    /////////
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 50% used for 1 loop",
        alloc, 20, 1024 * 1024 * 50, 50, 1);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 100% used for 1 loop",
        alloc, 20, 1024 * 1024 * 50, 100, 1);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 50% used for 10 loop",
        alloc, 20, 1024 * 1024 * 50, 50, 10);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 100% used for 10 loop",
        alloc, 20, 1024 * 1024 * 50, 100, 10);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 50% used for 50 loop",
        alloc, 20, 1024 * 1024 * 50, 50, 50);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 100% used for 50 loop",
        alloc, 20, 1024 * 1024 * 50, 100, 50);

    //////
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 50% used for 1 loop",
        alloc, 10, 1024 * 1024 * 100, 50, 1);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 100% used for 1 loop",
        alloc, 10, 1024 * 1024 * 100, 100, 1);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 50% used for 10 loop",
        alloc, 10, 1024 * 1024 * 100, 50, 10);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 100% used for 10 loop",
        alloc, 10, 1024 * 1024 * 100, 100, 10);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 50% used for 50 loop",
        alloc, 10, 1024 * 1024 * 100, 50, 50);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 100% used for 50 loop",
        alloc, 10, 1024 * 1024 * 100, 100, 50);
  }

liupc · 2019-02-01T11:37:40Z

The only risk now is when doing consolidate, the memory will double before it's done.
Or maybe we should just replace
CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer(Integer.MAX_VALUE);
with
CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer();

the default maxComponents value is 16 and if components exceeds this threshold, then consolidate would happen. so due to default socket buffer is small(usually <1M, according to net.ipv4.tcp_rmem), it's safe.

srowen · 2019-02-02T00:29:12Z

Interesting, see #12038 where there is a lot of discussion about this; it looks to be pretty on purpose. @vanzin and @liyezhang556520 discussed it.

liupc · 2019-02-02T07:37:28Z

CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer();

@srowen I just run a benchmark test for the above code, and it's true that it's rather slow. the test report is as below:

// --- Test Reports for plan 2 ------
//
// [test consolidate 1000 buffers each with 1m, 50% used for 1 loop]
// Allocating 524288 bytes
// Time cost with 1 loop for consolidating: 5338 millis
//
// [test consolidate 1000 buffers each with 1m, 100% used for 1 loop]
// Allocating 1048576 bytes
// Time cost with 1 loop for consolidating: 10220 millis
//
// [test consolidate 1000 buffers each with 1m, 50% used for 10 loop]
// Allocating 524288 bytes
// Time cost with 10 loop for consolidating: 49249 millis
//
// [test consolidate 1000 buffers each with 1m, 100% used for 10 loop]
// Allocating 1048576 bytes
// Time cost with 10 loop for consolidating: 99247 millis
//
// [test consolidate 1000 buffers each with 1m, 50% used for 50 loop]
// Allocating 524288 bytes
// Time cost with 50 loop for consolidating: 249160 millis
// ...... too slow

liupc · 2019-02-02T07:48:20Z

But, I come up with another idea, we can just check the writerIndex of the compositebytebuf, and if the delta exceeds some threshold then we can do consolidation.

How to set a reasonable threshold which take good care of both performance and memory?

1. Estimate memoryOverhead for shuffle
Actually, we can assume sixty percent of memoryOverhead can be used in shuffle, because for most applications, when in shuffle phase, the memoryOverhead are mainly used here.

2. Estimate the threshold upon the shuffle memoryOverhead
In the worst case, the writerIndex is equal to the capacity, then we must reserve half of this memory for consolidation, thus we get 0.3 * memoryOverhead as the threshold.

This is a conservative estimation, for most cases, we can make the threshold higher, but we just keep it unchanged for better safety. Then, let's say we fetch a 1GB shuffle block(memoryOverhead should be larger), then we got at least 300M as the threshold.

3. How about the performance?
Benchmark codes and test report:

https://github.com/liupc/spark/blob/01372cec7208c9a82be64f8a92d0a1515c1ce560/common/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java#L141

//  --- Test Reports for plan 3 ---
  //  [test consolidate 1000 buffers each with 1m, 50% used for 1 loop]
  //  Allocating 524288 bytes
  //  Time cost with 1 loop for consolidating: 116 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 100% used for 1 loop]
  //  Allocating 1048576 bytes
  //  Time cost with 1 loop for consolidating: 635 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 50% used for 10 loop]
  //  Allocating 524288 bytes
  //  Time cost with 10 loop for consolidating: 1003 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 100% used for 10 loop]
  //  Allocating 1048576 bytes
  //  Time cost with 10 loop for consolidating: 5702 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 50% used for 50 loop]
  //  Allocating 524288 bytes
  //  Time cost with 50 loop for consolidating: 4799 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 100% used for 50 loop]
  //  Allocating 1048576 bytes
  //  Time cost with 50 loop for consolidating: 28440 millis
  //
  //  [test consolidate 100 buffers each with 10m, 50% used for 1 loop]
  //  Allocating 5242880 bytes
  //  Time cost with 1 loop for consolidating: 96 millis
  //
  //  [test consolidate 100 buffers each with 10m, 100% used for 1 loop]
  //  Allocating 10485760 bytes
  //  Time cost with 1 loop for consolidating: 571 millis
  //
  //  [test consolidate 100 buffers each with 10m, 50% used for 10 loop]
  //  Allocating 5242880 bytes
  //  Time cost with 10 loop for consolidating: 940 millis
  //
  //  [test consolidate 100 buffers each with 10m, 100% used for 10 loop]
  //  Allocating 10485760 bytes
  //  Time cost with 10 loop for consolidating: 5727 millis
  //
  //  [test consolidate 100 buffers each with 10m, 50% used for 50 loop]
  //  Allocating 5242880 bytes
  //  Time cost with 50 loop for consolidating: 4739 millis
  //
  //  [test consolidate 100 buffers each with 10m, 100% used for 50 loop]
  //  Allocating 10485760 bytes
  //  Time cost with 50 loop for consolidating: 28356 millis
  //
  //  [test consolidate 20 buffers each with 50m, 50% used for 1 loop]
  //  Allocating 26214400 bytes
  //  Time cost with 1 loop for consolidating: 96 millis
  //
  //  [test consolidate 20 buffers each with 50m, 100% used for 1 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 1 loop for consolidating: 577 millis
  //
  //  [test consolidate 20 buffers each with 50m, 50% used for 10 loop]
  //  Allocating 26214400 bytes
  //  Time cost with 10 loop for consolidating: 966 millis
  //
  //  [test consolidate 20 buffers each with 50m, 100% used for 10 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 10 loop for consolidating: 5731 millis
  //
  //  [test consolidate 20 buffers each with 50m, 50% used for 50 loop]
  //  Allocating 26214400 bytes
  //  Time cost with 50 loop for consolidating: 4826 millis
  //
  //  [test consolidate 20 buffers each with 50m, 100% used for 50 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 50 loop for consolidating: 28734 millis
  //
  //  [test consolidate 10 buffers each with 100m, 50% used for 1 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 1 loop for consolidating: 98 millis
  //
  //  [test consolidate 10 buffers each with 100m, 100% used for 1 loop]
  //  Allocating 104857600 bytes
  //  Time cost with 1 loop for consolidating: 576 millis
  //
  //  [test consolidate 10 buffers each with 100m, 50% used for 10 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 10 loop for consolidating: 965 millis
  //
  //  [test consolidate 10 buffers each with 100m, 100% used for 10 loop]
  //  Allocating 104857600 bytes
  //  Time cost with 10 loop for consolidating: 6122 millis
  //
  //  [test consolidate 10 buffers each with 100m, 50% used for 50 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 50 loop for consolidating: 5228 millis
  //
  //  [test consolidate 10 buffers each with 100m, 100% used for 50 loop]
  //  Allocating 104857600 bytes
  //  Time cost with 50 loop for consolidating: 30797 millis

liupc · 2019-02-02T07:59:04Z

Seems that we can gain huge memory saving with little time spent.(at most ~ 500milis for a 1GB shuffle).

This method has many advantanges:

For small shuffle block, this consolidation will never be triggered, for they don't satisfy the threshold, so it's good for small application -- as fast as current.
For large shuffle block, this consolidation will save huge memory, and in some cases, it can avoid direct memory oom, for the bytebuf are consolidated early. so it's also good for large applications -- saving memory.

liupc · 2019-02-02T10:15:54Z

cc @vanzin @liyezhang556520

srowen · 2019-02-04T15:06:41Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

    return conf.getBoolean(SPARK_NETWORK_IO_PREFERDIRECTBUFS_KEY, true);
  }

+  /** The threshold for consolidation, it is derived upon the memoryOverhead in yarn mode. */


This replicates a lot of logic from elsewhere with hard-coded constants. Is it really important vary it so finely and add a whole new conf? It seems like this ought to be pretty independent of the environment, whether consolidating a buffer of size X is worthwhile.

@srowen Thanks, yes, I think you are right, we can just make it some fixed factor of the frame size.

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

liupc · 2019-02-08T06:37:53Z

After refine, the perf tests show that consolidation can work well for reading 1GB block with little extra memory(use consolidate(cIndex, numComponents) to avoid unnecessary consolidation for already consolidated Components).
Here is the newest test results, seems 20MiB is enough for the threshold.

Results with my laptop at low battery.

Build frame buf with consolidation threshold 1048576 cost 6057 milis
Build frame buf with consolidation threshold 5242880 cost 4899 milis
Build frame buf with consolidation threshold 10485760 cost 2809 milis
Build frame buf with consolidation threshold 20971520 cost 2762 milis
Build frame buf with consolidation threshold 31457280 cost 3074 milis
Build frame buf with consolidation threshold 52428800 cost 2399 milis
Build frame buf with consolidation threshold 83886080 cost 4010 milis
Build frame buf with consolidation threshold 104857600 cost 2808 milis
Build frame buf with consolidation threshold 314572800 cost 4150 milis
Build frame buf with consolidation threshold 524288000 cost 2519 milis
Build frame buf with consolidation threshold 9223372036854775807 cost 1664 milis // no consolidation

Results with my laptop with normal battery.

Build frame buf with consolidation threshold 1048576 cost 3356 milis
Build frame buf with consolidation threshold 5242880 cost 2075 milis
Build frame buf with consolidation threshold 10485760 cost 1948 milis
Build frame buf with consolidation threshold 20971520 cost 1421 milis
Build frame buf with consolidation threshold 31457280 cost 1371 milis
Build frame buf with consolidation threshold 52428800 cost 1286 milis
Build frame buf with consolidation threshold 83886080 cost 1769 milis
Build frame buf with consolidation threshold 104857600 cost 2068 milis
Build frame buf with consolidation threshold 314572800 cost 2448 milis
Build frame buf with consolidation threshold 524288000 cost 2317 milis
Build frame buf with consolidation threshold 9223372036854775807 cost 1072 milis

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

vanzin · 2019-02-09T00:21:04Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

+    // to reduce memory consumption.
+    if (frameBuf.capacity() - consolidatedFrameBufSize > consolidateFrameBufsDeltaThreshold) {
+      int newNumComponents = frameBuf.numComponents() - consolidatedNumComponents;
+      frameBuf.consolidate(consolidatedNumComponents, newNumComponents);


The logic here seems correct, but how is this different than just calling frameBuf.consolidate() without having to keep track of the component count in this class?

No parameter consolidate() will do unnecessary consolidate for already consolidated components (aka there are always one component after consolidation), it's slow and memory wasting, However, consolidate(cIndex, numComponents) will only consolidate specified new components.

For instance, let's say we add 10 components, and do first consolidation, then we got one consolidated component. If we use consolidate(cIndex, numComponents) here, then next time we do consolidation after another 10 components added, we do not need to consolidate the components already consolidated(no extra memory allocation and copy).

Ok, so in the end you don't end up with a single component, but with many components of size CONSOLIDATE_THRESHOLD each (minus the last one). I thought I saw the tests checking for a single component after consolidation, but may have misread.

Yes, that's it.

vanzin · 2019-02-09T00:21:52Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

+
+    // Reset buf and size for next frame.
+    ByteBuf frameBufCopy = frameBuf.duplicate();
+    frameBuf = null;


To follow up Sean's question, aren't you leaking frameBuf here now? You're returning a duplicate and not releasing this instance to decrement its ref count.

(Another way to say that returning the buffer itself is probably the right thing.)

No, frameBuf.duplicate create a derived buffer which shares the memory region of the parent buffer. A derived buffer does not have its own reference count but shares the reference count of the parent buffer.

Here we can return a local variable refer to the frameBuf object, and null out the frameBuf for next frame decoding.

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

vanzin · 2019-02-09T00:43:00Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

+  private int frameRemainingBytes = UNKNOWN_FRAME_SIZE;
  private volatile Interceptor interceptor;

+  public TransportFrameDecoder() {


I though you were going to make this configurable. Where are you reading the value from the configuration?

Now I think maybe we can just make it a fixed value, user will unlikely to change this threshold in most cases, and it requires little memory as shown in the newest tests reports.

vanzin · 2019-02-11T18:12:26Z

ok to test

SparkQA · 2019-02-11T22:48:21Z

Test build #102205 has finished for PR 23602 at commit 3fb7484.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

vanzin · 2019-02-13T19:39:22Z

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

  }

+  @Test
+  public void testConsolidationForDecodingNonFullyWrittenByteBuf() {


If I understand correctly, this is testing that consolidation is reducing the amount of memory needed to hold a frame? But since you're writing just 1 MB to the decoder, that's not triggering consolidation, is it?

Playing with CompositeByteBuf, it adjusts the internal capacity based on the readable bytes of the components, but the component buffers remain unchanged, so still holding on to the original amount of memory:

scala> cb.numComponents() res4: Int = 2 scala> cb.capacity() res5: Int = 8 scala> cb.component(0).capacity() res6: Int = 1048576

So I'm not sure this test is testing anything useful.

Also it would be nice not to use so many magic numbers.

@vanzin I think the test should be refined. but I was quesion about your test.
CompositeByteBuf.capacity returns the last component endOffset, I think use the capacity for testing is ok.
https://github.com/netty/netty/blob/8fecbab2c56d3f49d0353d58ee1681f3e6d3feca/buffer/src/main/java/io/netty/buffer/CompositeByteBuf.java#L730

Maybe my question wasn't clear. I'm asking what part of Spark code is this test testing.

As far as I can see, it's testing netty code, and these are not netty unit tests.

@vanzin it think this test is a little duplicate of testConsolidationPerf, we can just remove it. I will update soon. Sorry for that.

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

vanzin · 2019-02-13T19:45:11Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

+    while (frameRemainingBytes > 0 && !buffers.isEmpty()) {
+      ByteBuf next = nextBufferForFrame(frameRemainingBytes);
+      frameRemainingBytes -= next.readableBytes();
+      frameBuf.addComponent(next).writerIndex(frameBuf.writerIndex() + next.readableBytes());


Not sure if that's a new call, but this can be frameBuf.addComponent(true, next)

@vanzin This is a copy of existent code. I can replace it with what you suggested.

SparkQA · 2019-02-14T18:26:45Z

Test build #102351 has finished for PR 23602 at commit ef63cdb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-15T07:15:46Z

Test build #102377 has finished for PR 23602 at commit 449efed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2019-02-15T10:46:25Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java

+    // Reset buf and size for next frame.
+    ByteBuf frame = frameBuf;
+    frameBuf = null;
+    nextFrameSize = UNKNOWN_FRAME_SIZE;


You have to reset consolidatedFrameBufSize and consolidatedNumComponents back to 0 for the next frame buffer.

Otherwise after a very huge frame all the smaller but still quite huge frames are not consolidated at all.
And when consolidation starts as a frame which bigger then the maximum up to this then only the components are consolidated which are after the previous maximum.

@attilapiros Good catch! Than you so much! I will fix it.

@attilapiros done!

I see you fixed this, but it should have been caught by unit tests. So there's probably a check missing in your tests (expected number of components?).

I think not the check for the expected number of components missing but testing with multiple messages. Right now within the loop body, where a new TransportFrameDecoder is created too, there is only one 1GB message sent.

Yes, I can add some code to test multiple messages, and we just need to do the same check for consolidated buf capacity. I think this is more result oriented.

SparkQA · 2019-02-15T19:08:44Z

Test build #102393 has finished for PR 23602 at commit 5f8c4eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-16T07:25:04Z

Test build #102409 has finished for PR 23602 at commit 3aad18a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-19T20:49:04Z

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

+        totalBytesGot += buf.capacity();
+      }
+      assertEquals(numMessages, retained.size());
+      assertEquals(targetBytes * numMessages, totalBytesGot);


Does this mean this test now requires 3GB of memory just to store the data it's checking?

That seems wasteful. Either change the test to do checks after each separate message is written, or lower the size of the messages.

vanzin · 2019-02-19T20:50:37Z

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java

+          decoder.channelRead(ctx, buf);
+          while (writtenBytes < targetBytes) {
+            buf = Unpooled.buffer(pieceBytes * 2);
+            ByteBuf writtenBuf = Unpooled.buffer(pieceBytes).writerIndex(pieceBytes);


Just wanted to point out you're counting this allocation time in your performance measurement, which isn't optimal.

Done, thank you @vanzin

SparkQA · 2019-02-20T18:10:07Z

Test build #102552 has finished for PR 23602 at commit 6ca6f71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-25T18:28:09Z

retest this please

SparkQA · 2019-02-25T18:41:26Z

Test build #102758 has finished for PR 23602 at commit 6ca6f71.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-25T19:27:01Z

retest this please

vanzin · 2019-02-25T19:27:23Z

looks good pending tests (which failed last with an unrelated issue that should now be fixed).

SparkQA · 2019-02-26T00:32:56Z

Test build #102760 has finished for PR 23602 at commit 6ca6f71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-26T00:40:18Z

Merging to master.

Consolidate CompositeByteBuf when reading large frame

0072d06

liupc changed the title ~~[SPARK-26674]Consolidate CompositeByteBuf when reading large frame~~ [SPARK-26674][CORE]Consolidate CompositeByteBuf when reading large frame Jan 28, 2019

srowen requested changes Jan 29, 2019

View reviewed changes

...n/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java Outdated Show resolved Hide resolved

Update as commented

a2536da

Consolidate when exceeds threshold and add tests

9fd8ecd

fix consolidateBufsThreshold

9515621

srowen reviewed Feb 4, 2019

View reviewed changes

vanzin reviewed Feb 4, 2019

View reviewed changes

liupengcheng added 2 commits February 8, 2019 14:30

Refine as commented

1e3e9cf

Fix style

96a71ed

srowen reviewed Feb 8, 2019

View reviewed changes

vanzin reviewed Feb 9, 2019

View reviewed changes

liupengcheng added 2 commits February 9, 2019 12:09

Update as commented

f872e24

Reword consolidation threshold variable name

3fb7484

vanzin reviewed Feb 13, 2019

View reviewed changes

liupengcheng added 2 commits February 14, 2019 22:19

Fix tests and minor updates

bc44188

remove useless code

ef63cdb

Remove testConsolidationForDecodingNonFullyWrittenByteBuf

449efed

attilapiros requested changes Feb 15, 2019

View reviewed changes

Fix consolidatedFrameBufSize and consolidatedNumComponents not reset

5f8c4eb

Testing multiple messages

3aad18a

vanzin reviewed Feb 19, 2019

View reviewed changes

Lower the testing message size and not counting allocation time

6ca6f71

vanzin closed this in 52a180f Feb 26, 2019

[SPARK-26674][CORE]Consolidate CompositeByteBuf when reading large frame #23602

[SPARK-26674][CORE]Consolidate CompositeByteBuf when reading large frame #23602

Uh oh!

Conversation

liupc commented Jan 21, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Uh oh!

srowen commented Jan 29, 2019

Uh oh!

liupc commented Jan 30, 2019

Uh oh!

liupc commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liupc commented Feb 1, 2019

Uh oh!

liupc commented Feb 1, 2019

Uh oh!

srowen commented Feb 2, 2019

Uh oh!

liupc commented Feb 2, 2019

Uh oh!

liupc commented Feb 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liupc commented Feb 2, 2019

Uh oh!

liupc commented Feb 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liupc commented Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liupc Feb 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liupc Feb 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Feb 11, 2019

Uh oh!

SparkQA commented Feb 11, 2019

Uh oh!

Uh oh!

liupc commented Feb 1, 2019 •

edited

Loading

liupc commented Feb 2, 2019 •

edited

Loading

liupc commented Feb 8, 2019 •

edited

Loading

liupc Feb 9, 2019 •

edited

Loading

liupc Feb 9, 2019 •

edited

Loading