Skip to content

Conversation

@liupc
Copy link

@liupc liupc commented Jan 21, 2019

What changes were proposed in this pull request?

Currently, TransportFrameDecoder will not consolidate the buffers read from network which may cause memory waste. Actually, bytebuf's writtenIndex is far less than it's capacity in most cases, so we can optimize it by doing consolidation.

This PR will do this optimization.

Related codes:

CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer(Integer.MAX_VALUE);

How was this patch tested?

UT

Please review http://spark.apache.org/contributing.html before opening a pull request.

@liupc liupc changed the title [SPARK-26674]Consolidate CompositeByteBuf when reading large frame [SPARK-26674][CORE]Consolidate CompositeByteBuf when reading large frame Jan 28, 2019
@srowen
Copy link
Member

srowen commented Jan 29, 2019

Do you have a sense of how much time the consolidation takes vs memory saved? Just trying to get a handle on what the tradeoff is here. It's probably a good change. Any other places we can do this?

@liupc
Copy link
Author

liupc commented Jan 30, 2019

@srowen Thanks for suggestion, currently, It's just a thought, without much sense on the tradeoff, Seems the memory saved is decided by the readable size of socket and the memory allocation strategy(default AdaptiveRecvByteBufAllocator), I will try to do some benchmark tests to give some data about the tradeoff if possible.
Seems it's the only place the CompositeByteBuf is used.

@liupc
Copy link
Author

liupc commented Feb 1, 2019

I have done some benchmark tests on my local machine, seems it can save large memory with small time cost -- ~ 200 milis for 50% saving from a 1GB CompositeByteBuf


Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Linux 4.15.0-43-generic
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

[test consolidate 100 buffers each with 10m, 50% used for 1 loop]
Allocating 5242880 bytes
Time cost with 1 loop for consolidating: 223 millis

[test consolidate 100 buffers each with 10m, 100% used for 1 loop]
Allocating 10485760 bytes
Time cost with 1 loop for consolidating: 451 millis

[test consolidate 100 buffers each with 10m, 50% used for 10 loop]
Allocating 5242880 bytes
Time cost with 10 loop for consolidating: 1924 millis

[test consolidate 100 buffers each with 10m, 100% used for 10 loop]
Allocating 10485760 bytes
Time cost with 10 loop for consolidating: 3787 millis

[test consolidate 100 buffers each with 10m, 50% used for 50 loop]
Allocating 5242880 bytes
Time cost with 50 loop for consolidating: 8870 millis

[test consolidate 100 buffers each with 10m, 100% used for 50 loop]
Allocating 10485760 bytes
Time cost with 50 loop for consolidating: 17472 millis

[test consolidate 20 buffers each with 50m, 50% used for 1 loop]
Allocating 26214400 bytes
Time cost with 1 loop for consolidating: 184 millis

[test consolidate 20 buffers each with 50m, 100% used for 1 loop]
Allocating 52428800 bytes
Time cost with 1 loop for consolidating: 367 millis

[test consolidate 20 buffers each with 50m, 50% used for 10 loop]
Allocating 26214400 bytes
Time cost with 10 loop for consolidating: 1847 millis

[test consolidate 20 buffers each with 50m, 100% used for 10 loop]
Allocating 52428800 bytes
Time cost with 10 loop for consolidating: 3638 millis

[test consolidate 20 buffers each with 50m, 50% used for 50 loop]
Allocating 26214400 bytes
Time cost with 50 loop for consolidating: 9126 millis

[test consolidate 20 buffers each with 50m, 100% used for 50 loop]
Allocating 52428800 bytes
Time cost with 50 loop for consolidating: 19391 millis

[test consolidate 10 buffers each with 100m, 50% used for 1 loop]
Allocating 52428800 bytes
Time cost with 1 loop for consolidating: 211 millis

[test consolidate 10 buffers each with 100m, 100% used for 1 loop]
Allocating 104857600 bytes
Time cost with 1 loop for consolidating: 400 millis

[test consolidate 10 buffers each with 100m, 50% used for 10 loop]
Allocating 52428800 bytes
Time cost with 10 loop for consolidating: 1954 millis

[test consolidate 10 buffers each with 100m, 100% used for 10 loop]
Allocating 104857600 bytes
Time cost with 10 loop for consolidating: 3846 millis

[test consolidate 10 buffers each with 100m, 50% used for 50 loop]
Allocating 52428800 bytes
Time cost with 50 loop for consolidating: 9747 millis

[test consolidate 10 buffers each with 100m, 100% used for 50 loop]
Allocating 104857600 bytes
Time cost with 50 loop for consolidating: 19542 millis

@liupc
Copy link
Author

liupc commented Feb 1, 2019

The benchmark codes is as below:

 private CompositeByteBuf createCompositeBuf(ByteBufAllocator alloc, int numComponents, int size, int writtenBytes) {
    CompositeByteBuf compositeByteBuf = alloc.compositeBuffer(Integer.MAX_VALUE);
    for (int i =0; i< numComponents; i++) {
      ByteBuf buf = alloc.ioBuffer(size);
      buf.writerIndex(writtenBytes);
      compositeByteBuf.addComponent(buf).writerIndex(compositeByteBuf.writerIndex() + buf.readableBytes());
    }
    return compositeByteBuf;
  }

  private void testConsolidateWithLoop(String testName, ByteBufAllocator alloc, int numComponents, int size, int util, int loopCount) {
    long totalTime = 0L;
    int writtenBytes = (int)((double)size * util / 100);
    for (int i = 0; i < loopCount; i++) {
      CompositeByteBuf buf = createCompositeBuf(alloc, numComponents, size, writtenBytes);
      long start = System.currentTimeMillis();
      buf.consolidate();
      long cost = System.currentTimeMillis() - start;
      totalTime += cost;
      buf.release();
    }
    System.out.println("[" + testName + "]");
    System.out.println("Allocating " + writtenBytes + " bytes");
    System.out.println("Time cost with " + loopCount + " loop for consolidating: " + totalTime + " millis");
    System.out.println();
  }

  @Test
  public void benchmarkForConsolidation() throws Exception {
    PooledByteBufAllocator alloc = new PooledByteBufAllocator(true);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 50% used for 1 loop",
        alloc, 100, 1024 * 1024 * 10, 50, 1);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 100% used for 1 loop",
        alloc, 100, 1024 * 1024 * 10, 100, 1);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 50% used for 10 loop",
        alloc, 100, 1024 * 1024 * 10, 50, 10);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 100% used for 10 loop",
        alloc, 100, 1024 * 1024 * 10, 100, 10);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 50% used for 50 loop",
        alloc, 100, 1024 * 1024 * 10, 50, 50);

    testConsolidateWithLoop("test consolidate 100 buffers each with 10m, 100% used for 50 loop",
        alloc, 100, 1024 * 1024 * 10, 100, 50);

    /////////
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 50% used for 1 loop",
        alloc, 20, 1024 * 1024 * 50, 50, 1);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 100% used for 1 loop",
        alloc, 20, 1024 * 1024 * 50, 100, 1);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 50% used for 10 loop",
        alloc, 20, 1024 * 1024 * 50, 50, 10);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 100% used for 10 loop",
        alloc, 20, 1024 * 1024 * 50, 100, 10);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 50% used for 50 loop",
        alloc, 20, 1024 * 1024 * 50, 50, 50);
    testConsolidateWithLoop("test consolidate 20 buffers each with 50m, 100% used for 50 loop",
        alloc, 20, 1024 * 1024 * 50, 100, 50);

    //////
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 50% used for 1 loop",
        alloc, 10, 1024 * 1024 * 100, 50, 1);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 100% used for 1 loop",
        alloc, 10, 1024 * 1024 * 100, 100, 1);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 50% used for 10 loop",
        alloc, 10, 1024 * 1024 * 100, 50, 10);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 100% used for 10 loop",
        alloc, 10, 1024 * 1024 * 100, 100, 10);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 50% used for 50 loop",
        alloc, 10, 1024 * 1024 * 100, 50, 50);
    testConsolidateWithLoop("test consolidate 10 buffers each with 100m, 100% used for 50 loop",
        alloc, 10, 1024 * 1024 * 100, 100, 50);
  }

@liupc
Copy link
Author

liupc commented Feb 1, 2019

The only risk now is when doing consolidate, the memory will double before it's done.
Or maybe we should just replace
CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer(Integer.MAX_VALUE);
with
CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer();

the default maxComponents value is 16 and if components exceeds this threshold, then consolidate would happen. so due to default socket buffer is small(usually <1M, according to net.ipv4.tcp_rmem), it's safe.

@srowen
Copy link
Member

srowen commented Feb 2, 2019

Interesting, see #12038 where there is a lot of discussion about this; it looks to be pretty on purpose. @vanzin and @liyezhang556520 discussed it.

@liupc
Copy link
Author

liupc commented Feb 2, 2019

CompositeByteBuf frame = buffers.getFirst().alloc().compositeBuffer();

@srowen I just run a benchmark test for the above code, and it's true that it's rather slow. the test report is as below:

// --- Test Reports for plan 2 ------
//
// [test consolidate 1000 buffers each with 1m, 50% used for 1 loop]
// Allocating 524288 bytes
// Time cost with 1 loop for consolidating: 5338 millis
//
// [test consolidate 1000 buffers each with 1m, 100% used for 1 loop]
// Allocating 1048576 bytes
// Time cost with 1 loop for consolidating: 10220 millis
//
// [test consolidate 1000 buffers each with 1m, 50% used for 10 loop]
// Allocating 524288 bytes
// Time cost with 10 loop for consolidating: 49249 millis
//
// [test consolidate 1000 buffers each with 1m, 100% used for 10 loop]
// Allocating 1048576 bytes
// Time cost with 10 loop for consolidating: 99247 millis
//
// [test consolidate 1000 buffers each with 1m, 50% used for 50 loop]
// Allocating 524288 bytes
// Time cost with 50 loop for consolidating: 249160 millis
// ...... too slow

@liupc
Copy link
Author

liupc commented Feb 2, 2019

But, I come up with another idea, we can just check the writerIndex of the compositebytebuf, and if the delta exceeds some threshold then we can do consolidation.

How to set a reasonable threshold which take good care of both performance and memory?

1. Estimate memoryOverhead for shuffle
Actually, we can assume sixty percent of memoryOverhead can be used in shuffle, because for most applications, when in shuffle phase, the memoryOverhead are mainly used here.

2. Estimate the threshold upon the shuffle memoryOverhead
In the worst case, the writerIndex is equal to the capacity, then we must reserve half of this memory for consolidation, thus we get 0.3 * memoryOverhead as the threshold.

This is a conservative estimation, for most cases, we can make the threshold higher, but we just keep it unchanged for better safety. Then, let's say we fetch a 1GB shuffle block(memoryOverhead should be larger), then we got at least 300M as the threshold.

3. How about the performance?
Benchmark codes and test report:

https://github.com/liupc/spark/blob/01372cec7208c9a82be64f8a92d0a1515c1ce560/common/network-common/src/test/java/org/apache/spark/network/util/TransportFrameDecoderSuite.java#L141

//  --- Test Reports for plan 3 ---
  //  [test consolidate 1000 buffers each with 1m, 50% used for 1 loop]
  //  Allocating 524288 bytes
  //  Time cost with 1 loop for consolidating: 116 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 100% used for 1 loop]
  //  Allocating 1048576 bytes
  //  Time cost with 1 loop for consolidating: 635 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 50% used for 10 loop]
  //  Allocating 524288 bytes
  //  Time cost with 10 loop for consolidating: 1003 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 100% used for 10 loop]
  //  Allocating 1048576 bytes
  //  Time cost with 10 loop for consolidating: 5702 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 50% used for 50 loop]
  //  Allocating 524288 bytes
  //  Time cost with 50 loop for consolidating: 4799 millis
  //
  //  [test consolidate 1000 buffers each with 1m, 100% used for 50 loop]
  //  Allocating 1048576 bytes
  //  Time cost with 50 loop for consolidating: 28440 millis
  //
  //  [test consolidate 100 buffers each with 10m, 50% used for 1 loop]
  //  Allocating 5242880 bytes
  //  Time cost with 1 loop for consolidating: 96 millis
  //
  //  [test consolidate 100 buffers each with 10m, 100% used for 1 loop]
  //  Allocating 10485760 bytes
  //  Time cost with 1 loop for consolidating: 571 millis
  //
  //  [test consolidate 100 buffers each with 10m, 50% used for 10 loop]
  //  Allocating 5242880 bytes
  //  Time cost with 10 loop for consolidating: 940 millis
  //
  //  [test consolidate 100 buffers each with 10m, 100% used for 10 loop]
  //  Allocating 10485760 bytes
  //  Time cost with 10 loop for consolidating: 5727 millis
  //
  //  [test consolidate 100 buffers each with 10m, 50% used for 50 loop]
  //  Allocating 5242880 bytes
  //  Time cost with 50 loop for consolidating: 4739 millis
  //
  //  [test consolidate 100 buffers each with 10m, 100% used for 50 loop]
  //  Allocating 10485760 bytes
  //  Time cost with 50 loop for consolidating: 28356 millis
  //
  //  [test consolidate 20 buffers each with 50m, 50% used for 1 loop]
  //  Allocating 26214400 bytes
  //  Time cost with 1 loop for consolidating: 96 millis
  //
  //  [test consolidate 20 buffers each with 50m, 100% used for 1 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 1 loop for consolidating: 577 millis
  //
  //  [test consolidate 20 buffers each with 50m, 50% used for 10 loop]
  //  Allocating 26214400 bytes
  //  Time cost with 10 loop for consolidating: 966 millis
  //
  //  [test consolidate 20 buffers each with 50m, 100% used for 10 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 10 loop for consolidating: 5731 millis
  //
  //  [test consolidate 20 buffers each with 50m, 50% used for 50 loop]
  //  Allocating 26214400 bytes
  //  Time cost with 50 loop for consolidating: 4826 millis
  //
  //  [test consolidate 20 buffers each with 50m, 100% used for 50 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 50 loop for consolidating: 28734 millis
  //
  //  [test consolidate 10 buffers each with 100m, 50% used for 1 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 1 loop for consolidating: 98 millis
  //
  //  [test consolidate 10 buffers each with 100m, 100% used for 1 loop]
  //  Allocating 104857600 bytes
  //  Time cost with 1 loop for consolidating: 576 millis
  //
  //  [test consolidate 10 buffers each with 100m, 50% used for 10 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 10 loop for consolidating: 965 millis
  //
  //  [test consolidate 10 buffers each with 100m, 100% used for 10 loop]
  //  Allocating 104857600 bytes
  //  Time cost with 10 loop for consolidating: 6122 millis
  //
  //  [test consolidate 10 buffers each with 100m, 50% used for 50 loop]
  //  Allocating 52428800 bytes
  //  Time cost with 50 loop for consolidating: 5228 millis
  //
  //  [test consolidate 10 buffers each with 100m, 100% used for 50 loop]
  //  Allocating 104857600 bytes
  //  Time cost with 50 loop for consolidating: 30797 millis

@liupc
Copy link
Author

liupc commented Feb 2, 2019

Seems that we can gain huge memory saving with little time spent.(at most ~ 500milis for a 1GB shuffle).

This method has many advantanges:

  1. For small shuffle block, this consolidation will never be triggered, for they don't satisfy the threshold, so it's good for small application -- as fast as current.
  2. For large shuffle block, this consolidation will save huge memory, and in some cases, it can avoid direct memory oom, for the bytebuf are consolidated early. so it's also good for large applications -- saving memory.

@liupc
Copy link
Author

liupc commented Feb 2, 2019

cc @vanzin @liyezhang556520

return conf.getBoolean(SPARK_NETWORK_IO_PREFERDIRECTBUFS_KEY, true);
}

/** The threshold for consolidation, it is derived upon the memoryOverhead in yarn mode. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replicates a lot of logic from elsewhere with hard-coded constants. Is it really important vary it so finely and add a whole new conf? It seems like this ought to be pretty independent of the environment, whether consolidating a buffer of size X is worthwhile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Thanks, yes, I think you are right, we can just make it some fixed factor of the frame size.

@liupc
Copy link
Author

liupc commented Feb 8, 2019

After refine, the perf tests show that consolidation can work well for reading 1GB block with little extra memory(use consolidate(cIndex, numComponents) to avoid unnecessary consolidation for already consolidated Components).
Here is the newest test results, seems 20MiB is enough for the threshold.

Results with my laptop at low battery.

Build frame buf with consolidation threshold 1048576 cost 6057 milis
Build frame buf with consolidation threshold 5242880 cost 4899 milis
Build frame buf with consolidation threshold 10485760 cost 2809 milis
Build frame buf with consolidation threshold 20971520 cost 2762 milis
Build frame buf with consolidation threshold 31457280 cost 3074 milis
Build frame buf with consolidation threshold 52428800 cost 2399 milis
Build frame buf with consolidation threshold 83886080 cost 4010 milis
Build frame buf with consolidation threshold 104857600 cost 2808 milis
Build frame buf with consolidation threshold 314572800 cost 4150 milis
Build frame buf with consolidation threshold 524288000 cost 2519 milis
Build frame buf with consolidation threshold 9223372036854775807 cost 1664 milis // no consolidation

Results with my laptop with normal battery.

Build frame buf with consolidation threshold 1048576 cost 3356 milis
Build frame buf with consolidation threshold 5242880 cost 2075 milis
Build frame buf with consolidation threshold 10485760 cost 1948 milis
Build frame buf with consolidation threshold 20971520 cost 1421 milis
Build frame buf with consolidation threshold 31457280 cost 1371 milis
Build frame buf with consolidation threshold 52428800 cost 1286 milis
Build frame buf with consolidation threshold 83886080 cost 1769 milis
Build frame buf with consolidation threshold 104857600 cost 2068 milis
Build frame buf with consolidation threshold 314572800 cost 2448 milis
Build frame buf with consolidation threshold 524288000 cost 2317 milis
Build frame buf with consolidation threshold 9223372036854775807 cost 1072 milis

// to reduce memory consumption.
if (frameBuf.capacity() - consolidatedFrameBufSize > consolidateFrameBufsDeltaThreshold) {
int newNumComponents = frameBuf.numComponents() - consolidatedNumComponents;
frameBuf.consolidate(consolidatedNumComponents, newNumComponents);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here seems correct, but how is this different than just calling frameBuf.consolidate() without having to keep track of the component count in this class?

Copy link
Author

@liupc liupc Feb 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No parameter consolidate() will do unnecessary consolidate for already consolidated components (aka there are always one component after consolidation), it's slow and memory wasting, However, consolidate(cIndex, numComponents) will only consolidate specified new components.

For instance, let's say we add 10 components, and do first consolidation, then we got one consolidated component. If we use consolidate(cIndex, numComponents) here, then next time we do consolidation after another 10 components added, we do not need to consolidate the components already consolidated(no extra memory allocation and copy).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so in the end you don't end up with a single component, but with many components of size CONSOLIDATE_THRESHOLD each (minus the last one). I thought I saw the tests checking for a single component after consolidation, but may have misread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's it.


// Reset buf and size for next frame.
ByteBuf frameBufCopy = frameBuf.duplicate();
frameBuf = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up Sean's question, aren't you leaking frameBuf here now? You're returning a duplicate and not releasing this instance to decrement its ref count.

(Another way to say that returning the buffer itself is probably the right thing.)

Copy link
Author

@liupc liupc Feb 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, frameBuf.duplicate create a derived buffer which shares the memory region of the parent buffer. A derived buffer does not have its own reference count but shares the reference count of the parent buffer.

Here we can return a local variable refer to the frameBuf object, and null out the frameBuf for next frame decoding.

private int frameRemainingBytes = UNKNOWN_FRAME_SIZE;
private volatile Interceptor interceptor;

public TransportFrameDecoder() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though you were going to make this configurable. Where are you reading the value from the configuration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think maybe we can just make it a fixed value, user will unlikely to change this threshold in most cases, and it requires little memory as shown in the newest tests reports.

@vanzin
Copy link
Contributor

vanzin commented Feb 11, 2019

ok to test

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102205 has finished for PR 23602 at commit 3fb7484.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

@Test
public void testConsolidationForDecodingNonFullyWrittenByteBuf() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is testing that consolidation is reducing the amount of memory needed to hold a frame? But since you're writing just 1 MB to the decoder, that's not triggering consolidation, is it?

Playing with CompositeByteBuf, it adjusts the internal capacity based on the readable bytes of the components, but the component buffers remain unchanged, so still holding on to the original amount of memory:

scala> cb.numComponents()
res4: Int = 2

scala> cb.capacity()
res5: Int = 8

scala> cb.component(0).capacity()
res6: Int = 1048576

So I'm not sure this test is testing anything useful.

Also it would be nice not to use so many magic numbers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin I think the test should be refined. but I was quesion about your test.
CompositeByteBuf.capacity returns the last component endOffset, I think use the capacity for testing is ok.
https://github.com/netty/netty/blob/8fecbab2c56d3f49d0353d58ee1681f3e6d3feca/buffer/src/main/java/io/netty/buffer/CompositeByteBuf.java#L730

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my question wasn't clear. I'm asking what part of Spark code is this test testing.

As far as I can see, it's testing netty code, and these are not netty unit tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin it think this test is a little duplicate of testConsolidationPerf, we can just remove it. I will update soon. Sorry for that.

while (frameRemainingBytes > 0 && !buffers.isEmpty()) {
ByteBuf next = nextBufferForFrame(frameRemainingBytes);
frameRemainingBytes -= next.readableBytes();
frameBuf.addComponent(next).writerIndex(frameBuf.writerIndex() + next.readableBytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's a new call, but this can be frameBuf.addComponent(true, next)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin This is a copy of existent code. I can replace it with what you suggested.

@SparkQA
Copy link

SparkQA commented Feb 14, 2019

Test build #102351 has finished for PR 23602 at commit ef63cdb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102377 has finished for PR 23602 at commit 449efed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Reset buf and size for next frame.
ByteBuf frame = frameBuf;
frameBuf = null;
nextFrameSize = UNKNOWN_FRAME_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to reset consolidatedFrameBufSize and consolidatedNumComponents back to 0 for the next frame buffer.

Otherwise after a very huge frame all the smaller but still quite huge frames are not consolidated at all.
And when consolidation starts as a frame which bigger then the maximum up to this then only the components are consolidated which are after the previous maximum.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attilapiros Good catch! Than you so much! I will fix it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attilapiros done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you fixed this, but it should have been caught by unit tests. So there's probably a check missing in your tests (expected number of components?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not the check for the expected number of components missing but testing with multiple messages. Right now within the loop body, where a new TransportFrameDecoder is created too, there is only one 1GB message sent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can add some code to test multiple messages, and we just need to do the same check for consolidated buf capacity. I think this is more result oriented.

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102393 has finished for PR 23602 at commit 5f8c4eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2019

Test build #102409 has finished for PR 23602 at commit 3aad18a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

totalBytesGot += buf.capacity();
}
assertEquals(numMessages, retained.size());
assertEquals(targetBytes * numMessages, totalBytesGot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean this test now requires 3GB of memory just to store the data it's checking?

That seems wasteful. Either change the test to do checks after each separate message is written, or lower the size of the messages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

decoder.channelRead(ctx, buf);
while (writtenBytes < targetBytes) {
buf = Unpooled.buffer(pieceBytes * 2);
ByteBuf writtenBuf = Unpooled.buffer(pieceBytes).writerIndex(pieceBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to point out you're counting this allocation time in your performance measurement, which isn't optimal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you @vanzin

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102552 has finished for PR 23602 at commit 6ca6f71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Feb 25, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2019

Test build #102758 has finished for PR 23602 at commit 6ca6f71.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Feb 25, 2019

retest this please

@vanzin
Copy link
Contributor

vanzin commented Feb 25, 2019

looks good pending tests (which failed last with an unrelated issue that should now be fixed).

@SparkQA
Copy link

SparkQA commented Feb 26, 2019

Test build #102760 has finished for PR 23602 at commit 6ca6f71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Feb 26, 2019

Merging to master.

@vanzin vanzin closed this in 52a180f Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants