[SPARK-19659] Fetch big blocks to disk when shuffle-read. #16989

jinxing64 · 2017-02-19T10:24:44Z

What changes were proposed in this pull request?

Currently the whole block is fetched into memory(off heap by default) when shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can be large when skew situations. If OOM happens during shuffle read, job will be killed and users will be notified to "Consider boosting spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more memory can resolve the OOM. However the approach is not perfectly suitable for production environment, especially for data warehouse.
Using Spark SQL as data engine in warehouse, users hope to have a unified parameter(e.g. memory) but less resource wasted(resource is allocated but not used). The hope is strong especially when migrating data engine to Spark from another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one is very time consuming.
It's not always easy to predict skew situations, when happen, it make sense to fetch remote blocks to disk for shuffle-read, rather than kill the job because of OOM.

In this pr, I propose to fetch big blocks to disk(which is also mentioned in SPARK-3019):

Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus;
Request memory from MemoryManager before fetch blocks and release the memory to MemoryManager when ManagedBuffer is released.
Fetch remote blocks to disk when failing acquiring memory from MemoryManager, otherwise fetch to memory.

This is an improvement for memory control when shuffle blocks and help to avoid OOM in scenarios like below:

Single huge block;
Sizes of many blocks are underestimated in MapStatus and the actual footprint of blocks is much larger than the estimated.

How was this patch tested?

Added unit test in MapStatusSuite and ShuffleBlockFetcherIteratorSuite.

jinxing64 · 2017-02-19T10:46:53Z

@vanzin @squito
Would you mind to take a look at this when have time ?

squito · 2017-02-21T18:49:54Z

Hi @jinxing64 I posted a comment on jira about the design -- I think this is a big enough change that its worth discussing the design first. Its fine to keep working on the code as a demonstration if you want, but for now I'd ask that you label this a work-in-progress "[WIP]". (I personally have only briefly glanced at the code and am unlikely to look more closely till we sort out the design issues.)

fwiw I think this is will be a great feature, we just need to be thoughtful about it.

squito · 2017-02-21T18:51:17Z

Jenkins, add to whitelist

SparkQA · 2017-02-21T21:22:05Z

Test build #73224 has finished for PR 16989 at commit 21f6da3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-02-22T02:22:58Z

@squito
Thanks a lot for your comments : )
Yes, There must be a design doc for discussing. I will prepare and post a pdf to jira.

jinxing64 · 2017-02-28T02:41:40Z

@rxin @squito @davies @andrewor14 @JoshRosen
I've uploaded a design doc to jira(https://issues.apache.org/jira/browse/SPARK-19659), It's great if you could help comment on it :) Actually, my data warehouse suffers a lot on this issue. Please take a look if possible. Sorry if this is bothering.

SparkQA · 2017-03-29T16:57:14Z

Test build #75358 has finished for PR 16989 at commit 822b125.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-31T15:30:36Z

Test build #75435 has finished for PR 16989 at commit 03213aa.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-03-31T16:34:30Z

Test build #75436 has finished for PR 16989 at commit 65e7c42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-01T06:55:29Z

Test build #75441 has finished for PR 16989 at commit 1ec1c0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-16T15:34:20Z

Test build #75834 has finished for PR 16989 at commit dccd7ff.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-04-17T10:45:53Z

Jenkins, test this please

jinxing64 · 2017-04-17T12:10:19Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

Remove the protected and make this visible for test.

SparkQA · 2017-04-17T13:44:06Z

Test build #75855 has finished for PR 16989 at commit 63f059d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-17T13:44:22Z

Test build #75853 has finished for PR 16989 at commit 63f059d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-17T16:09:24Z

Test build #75858 has finished for PR 16989 at commit b6a8993.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-18T06:27:01Z

Test build #75883 has finished for PR 16989 at commit c63f39b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-19T11:12:31Z

Test build #75935 has finished for PR 16989 at commit 135c668.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-19T13:17:39Z

Test build #75938 has finished for PR 16989 at commit 31cfee0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-02T15:16:21Z

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java

Ideally we should use DiskBlockManager.getFile to store data in file system.

Yes, I wanted to use DiskBlockManager.getFile, but I found it's hard to import DiskBlockManager from OneForOneBlockFetcher.

SparkEnv.get.blockManager.diskBlockManager

@cloud-fan
Yes, but OneForOneBlockFetcher is in network-shuffle package, I find it hard to import SparkEnv from core package. Did I miss something?(I'm sorry if this question is stupid.)

instead of passing a boolean fetchToDisk, shall we ask the caller to pass in Option<File> file? My concern is that, Spark has a rule about where to write temp files, we can't just write it to the current directory.

@cloud-fan
I understood ~
I will refine, I will replace boolean fetchToDisk with Option<File[]> shuffleFilesOpt.

cloud-fan · 2017-05-02T15:19:21Z

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java

shall we remove the partial written file when failing?

Yes, that will be good !

cloud-fan · 2017-05-02T15:46:36Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

add parameter doc for this

Yes, I will refine.

cloud-fan · 2017-05-02T15:50:52Z

core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala

how is the overhead when serializing hash map with kryo?

cloud-fan · 2017-05-02T16:02:25Z

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

can we move this into ShuffleBlockFetcherIterator?

Yes, ideally this should be moved into ShuffleBlockFetcherIterator, but I didn't find a better implementation other than

extends MemoryConsumer(tmm, tmm.pageSizeBytes(), if (SparkTransportConf.fromSparkConf(SparkEnv.get.conf, "shuffle").preferDirectBufs()) { MemoryMode.OFF_HEAP } else { MemoryMode.ON_HEAP } )

I think above is not good looking.
And I'd be a little bit hesitant to expose a 'setMode' in MemoryConsumer

SparkQA · 2017-05-25T02:25:03Z

Test build #77316 has finished for PR 16989 at commit 3971706.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class FileSegmentManagedBuffer extends ManagedBuffer

SparkQA · 2017-05-25T03:16:08Z

Test build #77319 has finished for PR 16989 at commit 188862e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class FileSegmentManagedBuffer extends ManagedBuffer

SparkQA · 2017-05-25T03:47:56Z

Test build #77321 has finished for PR 16989 at commit b07a3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class FileSegmentManagedBuffer extends ManagedBuffer

cloud-fan · 2017-05-25T08:09:43Z

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

+
+    val diskBlockManager = mock(classOf[DiskBlockManager])
+    doReturn{
+      var blockId = new TempLocalBlockId(UUID.randomUUID())


nit: can be val

.. sorry for nit ...

cloud-fan · 2017-05-25T08:12:10Z

good job! merging to master/2.2!

## What changes were proposed in this pull request? Currently the whole block is fetched into memory(off heap by default) when shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can be large when skew situations. If OOM happens during shuffle read, job will be killed and users will be notified to "Consider boosting spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more memory can resolve the OOM. However the approach is not perfectly suitable for production environment, especially for data warehouse. Using Spark SQL as data engine in warehouse, users hope to have a unified parameter(e.g. memory) but less resource wasted(resource is allocated but not used). The hope is strong especially when migrating data engine to Spark from another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one is very time consuming. It's not always easy to predict skew situations, when happen, it make sense to fetch remote blocks to disk for shuffle-read, rather than kill the job because of OOM. In this pr, I propose to fetch big blocks to disk(which is also mentioned in SPARK-3019): 1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus; 2. Request memory from `MemoryManager` before fetch blocks and release the memory to `MemoryManager` when `ManagedBuffer` is released. 3. Fetch remote blocks to disk when failing acquiring memory from `MemoryManager`, otherwise fetch to memory. This is an improvement for memory control when shuffle blocks and help to avoid OOM in scenarios like below: 1. Single huge block; 2. Sizes of many blocks are underestimated in `MapStatus` and the actual footprint of blocks is much larger than the estimated. ## How was this patch tested? Added unit test in `MapStatusSuite` and `ShuffleBlockFetcherIteratorSuite`. Author: jinxing <[email protected]> Closes #16989 from jinxing64/SPARK-19659. (cherry picked from commit 3f94e64) Signed-off-by: Wenchen Fan <[email protected]>

jinxing64 · 2017-05-25T13:16:43Z

@cloud-fan @JoshRosen @mridulm @squito @viirya
Thanks a lot for taking so much time reviewing this patch !
Sorry for the stupid mistakes I made. I will be more careful next time :)

JoshRosen · 2017-05-26T01:27:20Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+    }
+
+    // Shuffle remote blocks to disk when the request is too large.
+    // TODO: Encryption and compression should be considered.


Could you expand on what the TODO here is? I want to make sure this doesn't slip through the cracks and become forgotten.

Actually I'm just going to send a follow-up PR. Ideally all local files written by Spark could be encrypted and compressed according to config. One example is UnsafeSorterSpillWriter, it writes data with DiskBlockObjectWriter, which calls SerializerManager.wrapStream and handles encryption and compression automatically.

I haven't really followed this review (sorry), but shuffle data is transmitted encrypted and compressed over the wire, so there might be a chance that there's nothing to do here.

ah that's a good point! Yea we don't need to encrypt and compress the data again here. I'll update this comment.

One question: do we need to encrypt and compress the data for sort buffer spill and aggregate buffer spill? cc @JoshRosen

do we need to encrypt and compress the data for sort buffer spill and aggregate buffer spill?

Yes, but I thought I had done that in a previous change. Maybe I missed something.

…read ## What changes were proposed in this pull request? This PR includes some minor improvement for the comments and tests in #16989 ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #18117 from cloud-fan/follow. (cherry picked from commit 1d62f8a) Signed-off-by: Wenchen Fan <[email protected]>

…read ## What changes were proposed in this pull request? This PR includes some minor improvement for the comments and tests in apache#16989 ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes apache#18117 from cloud-fan/follow.

…uffle. In current code(#16989), big blocks are shuffled to disk. This pr proposes to collect metrics for remote bytes fetched to disk. Author: jinxing <[email protected]> Closes #18249 from jinxing64/SPARK-19937.

zsxwing · 2017-06-29T19:03:01Z

common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java

    return nextChunk;
  }

+  @Override


@jinxing64 this breaks old shuffle service. We should avoid change server side codes. Right now I just disabled this feature in #18467

Thanks, I will try make a pr as soon as possible.

…uffle. In current code(apache#16989), big blocks are shuffled to disk. This pr proposes to collect metrics for remote bytes fetched to disk. Author: jinxing <[email protected]> Closes apache#18249 from jinxing64/SPARK-19937.

zsxwing · 2017-06-29T20:44:59Z

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java

+
+    public DownloadCallback(File targetFile, int chunkIndex) throws IOException {
+      this.targetFile = targetFile;
+      this.channel = Channels.newChannel(new FileOutputStream(targetFile));


Does this work with RetryingBlockFetcher? Let's say we have 2 chunks: "chunk 1", "chunk 2". If "chunk 1" fails, it will fail "chunk 2" as well. However, DownloadCallbacks for "chunk 2" are still running. In this case, RetryingBlockFetcher will retry "chunk 2" as well. Hence, there will be 2 DownloadCallbacks writing to the same file.

One possible fix is writing to a temp file and renaming it to the target file.

👍
I will make a pr today for this.

@zsxwing @cloud-fan
OneForOneBlockFetcher "open blocks" asynchronously. If I understand correctly, the retry of the start() in OneForOneBlockFetcher is only triggered when failure of sending OpenBlocks, but failure of fetching chunk cannot trigger the retry in RetryingBlockFetcher. DownloadCalback is not initialized when the failure of "open blocks" happens. So there cannot be two DownloadCallbacks for same stream working at the same time.

@jinxing64 The retry logic is here:

spark/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java

Line 215 in 88a536b

if (shouldRetry(exception)) {

The issue is there will be two DownloadCallbacks download the same content to the same target file. While the first one finishes, ShuffleBlockFetcherIterator may start to read it, however, the second DownloadCallback may be still running and writing to the target file. It could cause ShuffleBlockFetcherIterator reading a partial result.

Pardon, I could hardly believe there are two DownloadCallbacks download the same content to the same target file. In my understanding:

When RetryingBlockFetcher retry, there is no DownloadCallback initialized;

When failure of fetching chunk, retry from RetryingBlockFetcher will not be triggered.

@zsxwing
Sorry, I just realized this issue. There can be conflict between two DownloadCallbacks. I will figure out a way to resolve this.

squito · 2018-01-31T20:35:20Z

...on/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java

+
+    @Override
+    public void onData(String streamId, ByteBuffer buf) throws IOException {
+      channel.write(buf);


I am super-late on reviewing this, apologies, just asking questions for my own understanding, and to consider possible future improvements -- this won't do a zero-copy transfer, will it? That ByteBuffer is still in user space?

From my understanding, we'd need to do special handling to use netty's spliceTo when possible:
https://stackoverflow.com/questions/30322957/is-there-transferfrom-like-functionality-in-netty-for-zero-copy

but I'm still working on putting all the pieces together here and admittedly this is out of my area of expertise

@squito This is a Java Channel. Not sure how to call io.netty.channel.epoll.AbstractEpollStreamChannel.spliceTo here.

By the way, I think this is a zero-copy transfer since the underlying buffer is an off heap buffer.

Anyway, I found a bug here...

right, I realize there isn't a simple one-line change here to switch to using spliceTo, I was wondering what the behavior is.

I actually thought zero-copy and offheap were orthogonal -- anytime netty gives you direct access to bytes, it has to be copied to user space, right?

@squito You are right. It needs a copy between user space and kernel space.

## What changes were proposed in this pull request? This is a followup of #16989 The fetch-big-block-to-disk feature is disabled by default, because it's not compatible with external shuffle service prior to Spark 2.2. The client sends stream request to fetch block chunks, and old shuffle service can't support it. After 2 years, Spark 2.2 has EOL, and now it's safe to turn on this feature by default ## How was this patch tested? existing tests Closes #23625 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This is a followup of apache#16989 The fetch-big-block-to-disk feature is disabled by default, because it's not compatible with external shuffle service prior to Spark 2.2. The client sends stream request to fetch block chunks, and old shuffle service can't support it. After 2 years, Spark 2.2 has EOL, and now it's safe to turn on this feature by default ## How was this patch tested? existing tests Closes apache#23625 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

jinxing64 changed the title ~~[SPARK-19659] Fetch big blocks to disk when shuffle-read.~~ [WIP][SPARK-19659] Fetch big blocks to disk when shuffle-read. Feb 22, 2017

jinxing64 force-pushed the SPARK-19659 branch from 03213aa to 65e7c42 Compare March 31, 2017 13:45

jinxing64 changed the title ~~[WIP][SPARK-19659] Fetch big blocks to disk when shuffle-read.~~ [SPARK-19659] Fetch big blocks to disk when shuffle-read. Apr 1, 2017

jinxing64 mentioned this pull request Apr 1, 2017

[WIP][SPARK-19937] Collect metrics of block sizes when shuffle. #17276

Closed

jinxing64 changed the title ~~[SPARK-19659] Fetch big blocks to disk when shuffle-read.~~ [WIP][SPARK-19659] Fetch big blocks to disk when shuffle-read. Apr 10, 2017

jinxing64 commented Apr 17, 2017

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala Outdated

Copy link

Author

jinxing64 Apr 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the protected and make this visible for test.

jinxing64 force-pushed the SPARK-19659 branch from b6a8993 to c63f39b Compare April 18, 2017 03:40

cloud-fan reviewed May 2, 2017

View reviewed changes

jinxing64 force-pushed the SPARK-19659 branch from 188862e to b07a3b6 Compare May 25, 2017 00:42

jinxing64 changed the title ~~[WIP][SPARK-19659] Fetch big blocks to disk when shuffle-read.~~ [SPARK-19659] Fetch big blocks to disk when shuffle-read. May 25, 2017

cloud-fan reviewed May 25, 2017

View reviewed changes

asfgit closed this in 3f94e64 May 25, 2017

JoshRosen reviewed May 26, 2017

View reviewed changes

cloud-fan mentioned this pull request May 26, 2017

[SPARK-19659][CORE][FOLLOW-UP] Fetch big blocks to disk when shuffle-read #18117

Closed

jinxing64 mentioned this pull request Jun 9, 2017

[SPARK-19937] Collect metrics for remote bytes read to disk during shuffle. #18249

Closed

jinxing64 mentioned this pull request Jun 28, 2017

[SPARK-21236] Make the threshold of using HighlyCompressedStatus configurable. #18446

Closed

zsxwing mentioned this pull request Jun 29, 2017

[SPARK-19659][Core]Disable spark.reducer.maxReqSizeShuffleToMem #18467

Closed

zsxwing reviewed Jun 29, 2017

View reviewed changes

sameeragarwal mentioned this pull request Jan 29, 2018

[SPARK-23207][SQL] Shuffle+Repartition on a DataFrame could lead to incorrect answers #20393

Closed

squito reviewed Jan 31, 2018

View reviewed changes

This was referenced Jan 10, 2019

[SPARK-26590][CORE] make fetch-block-to-disk backward compatible #23510

Closed

[SPARK-26604][CORE] Clean up channel registration for StreamManager #23521

Closed

cloud-fan mentioned this pull request Jan 23, 2019

[SPARK-26700][CORE] enable fetch-big-block-to-disk by default #23625

Closed

[SPARK-19659] Fetch big blocks to disk when shuffle-read. #16989

[SPARK-19659] Fetch big blocks to disk when shuffle-read. #16989

Uh oh!

Conversation

jinxing64 commented Feb 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jinxing64 commented Feb 19, 2017

Uh oh!

squito commented Feb 21, 2017

Uh oh!

squito commented Feb 21, 2017

Uh oh!

SparkQA commented Feb 21, 2017

Uh oh!

jinxing64 commented Feb 22, 2017

Uh oh!

jinxing64 commented Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 29, 2017

Uh oh!

SparkQA commented Mar 31, 2017

Uh oh!

SparkQA commented Mar 31, 2017

Uh oh!

SparkQA commented Apr 1, 2017

Uh oh!

SparkQA commented Apr 16, 2017

Uh oh!

jinxing64 commented Apr 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 17, 2017

Uh oh!

SparkQA commented Apr 17, 2017

Uh oh!

SparkQA commented Apr 17, 2017

Uh oh!

SparkQA commented Apr 18, 2017

Uh oh!

SparkQA commented Apr 19, 2017

Uh oh!

SparkQA commented Apr 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinxing64 May 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2017

Uh oh!

SparkQA commented May 25, 2017

Uh oh!

jinxing64 commented Feb 19, 2017 •

edited

Loading

jinxing64 commented Feb 28, 2017 •

edited

Loading

jinxing64 May 3, 2017 •

edited

Loading