Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Apr 22, 2021

What changes were proposed in this pull request?

This PR proposes a workaround to address the Netty OOM issue (SPARK-24989, SPARK-27991):

Basically, ShuffleBlockFetcherIterator would catch the OutOfDirectMemoryError from Netty and then set a global flag for the shuffle module. Any pending fetch requests would be deferred if there're in-flight requests until the flag is unset. And the flag will be unset when there's a fetch request succeed.

Note that catching the Netty OOM rather than abort the application is feasible because Netty manage its own memory region (offheap by default) separately. So Netty OOM doesn't mean the memory shortage of Spark.

Why are the changes needed?

The Netty OOM issue is a very corner case. It usually happens in the large-scale cluster, where a reduce task could fetch shuffle blocks from hundreds of nodes concurrently in a short time. Internally, we found a cluster that has created 260+ clients within 6s before throwing Netty OOM.

Although Spark has configurations, e.g., spark.reducer.maxReqsInFlight to tune the number of concurrent requests, it's usually not a easy decision for the user to set a reasonable value regarding the workloads, machine resources, etc. But with this fix, Spark would heal the Netty memory issue itself without any specific configurations.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests.

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 22, 2021

cc @mridulm @tgravescs @attilapiros could you take a look? thanks!

@mridulm
Copy link
Contributor

mridulm commented Apr 22, 2021

+CC @otterc

@github-actions github-actions bot added the CORE label Apr 22, 2021
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took an initial pass, looks like an interesting issue @Ngone51 !
Thanks for working on it.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137776 has finished for PR 32287 at commit 972bd33.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42306/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42306/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42384/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42384/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137854 has finished for PR 32287 at commit b941e72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 27, 2021

In the latest update, there're two major changes:

  1. Only unset the isNettyOOMOnShuffle when NettyUtils.freeDirectMemory() > averageRemoteBlockSize

  2. Defer blocks in batch, see enqueueDeferredFetchRequestIfNecessary. (cc @cloud-fan )

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42517/

@SparkQA
Copy link

SparkQA commented Apr 27, 2021

Test build #137997 has finished for PR 32287 at commit 16b4065.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DeferFetchRequestResult(fetchRequest: FetchRequest) extends FetchResult

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42581/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138062 has finished for PR 32287 at commit 31a08a6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42585/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42585/

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Test build #138066 has finished for PR 32287 at commit e85db7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42604/

@SparkQA
Copy link

SparkQA commented May 19, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43227/

@SparkQA
Copy link

SparkQA commented May 19, 2021

Test build #138706 has finished for PR 32287 at commit 4af6ee7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented May 19, 2021

@Ngone51 Now that #32389 is merged, can we look at the Suite changes to leverage that ?
I am fine with rest of the PR.

@Ngone51
Copy link
Member Author

Ngone51 commented May 20, 2021

@mridulm I already rebased against that PR :)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 00b63c8 May 20, 2021
@dongjoon-hyun
Copy link
Member

Sorry for missing ping here. +1, late LGTM.

@Ngone51
Copy link
Member Author

Ngone51 commented May 20, 2021

Thank you, everyone!!

@Ngone51 Ngone51 deleted the SPARK-27991 branch May 20, 2021 05:16
domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
### What changes were proposed in this pull request?

This PR proposes a workaround to address the Netty OOM issue (SPARK-24989, SPARK-27991):

Basically, `ShuffleBlockFetcherIterator` would catch the `OutOfDirectMemoryError` from Netty and then set a global flag for the shuffle module. Any pending fetch requests would be deferred if there're in-flight requests until the flag is unset. And the flag will be unset when there's a fetch request succeed.

Note that catching the Netty OOM rather than abort the application is feasible because Netty manage its own memory region (offheap by default) separately. So Netty OOM doesn't mean the memory shortage of Spark.

### Why are the changes needed?

The Netty OOM issue is a very corner case. It usually happens in the large-scale cluster, where a reduce task could fetch shuffle blocks from hundreds of nodes concurrently in a short time. Internally, we found a cluster that has created 260+ clients within 6s before throwing Netty OOM.

Although Spark has configurations, e.g., `spark.reducer.maxReqsInFlight` to tune the number of concurrent requests, it's usually not a easy decision for the user to set a reasonable value regarding the workloads, machine resources, etc. But with this fix, Spark would heal the Netty memory issue itself without any specific configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes apache#32287 from Ngone51/SPARK-27991.

Authored-by: yi.wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@loukey-lj
Copy link

I'm also having this issue, can I now configure to increase the netty memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants