[SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss #16045

JoshRosen · 2016-11-28T22:00:05Z

What changes were proposed in this pull request?

This is the master branch version of #15986; the original description follows:

This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails.

This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several ShuffleDependency instances that were retained by TaskSetManagers inside the scheduler but were not otherwise referenced. Each of these TaskSetManagers was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the TaskSchedulerImpl.taskIdToTaskSetManager map.

Entries are added to the taskIdToTaskSetManager map when tasks are launched and are removed inside of TaskScheduler.statusUpdate(), which is invoked by the scheduler backend while processing StatusUpdate messages from executors. The problem with this design is that a completely dead executor will never send a StatusUpdate. There is some code in statusUpdate which handles tasks that exit with the TaskState.LOST state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The executorLost and removeExecutor methods don't appear to perform any cleanup of the taskId -> * mappings, causing the leaks observed here.

This patch's fix is to maintain a executorId -> running task id mapping so that these taskId -> * maps can be properly cleaned up following an executor loss.

There are some potential corner-case interactions that I'm concerned about here, especially some details in the comment in removeExecutor, so I'd appreciate a very careful review of these changes.

How was this patch tested?

I added a new unit test to TaskSchedulerImplSuite.

/cc @kayousterhout and @markhamstra, who reviewed #15986.

SparkQA · 2016-11-29T02:12:42Z

Test build #69265 has finished for PR 16045 at commit 9689763.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-11-29T02:15:13Z

Jenkins, retest this please

SparkQA · 2016-11-29T04:24:27Z

Test build #69286 has finished for PR 16045 at commit 9689763.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-11-29T06:47:21Z

Jenkins, retest this please

SparkQA · 2016-11-29T06:52:41Z

Test build #69304 has started for PR 16045 at commit 9689763.

JoshRosen · 2016-11-29T08:45:55Z

Jenkins, retest this please

SparkQA · 2016-11-29T11:31:59Z

Test build #69317 has finished for PR 16045 at commit 9689763.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2016-11-29T19:39:30Z

LGTM

JoshRosen · 2016-11-30T00:26:50Z

Cool, I'm going to merge this into master and branch-2.1 in that case. Thanks!

_This is the master branch version of #15986; the original description follows:_ This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails. This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map. Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here. This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss. There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes. I added a new unit test to `TaskSchedulerImplSuite`. /cc kayousterhout and markhamstra, who reviewed #15986. Author: Josh Rosen <[email protected]> Closes #16045 from JoshRosen/fix-leak-following-total-executor-loss-master. (cherry picked from commit 9a02f68) Signed-off-by: Josh Rosen <[email protected]>

## What changes were proposed in this pull request? _This is the master branch version of apache#15986; the original description follows:_ This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails. This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map. Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here. This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss. There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes. ## How was this patch tested? I added a new unit test to `TaskSchedulerImplSuite`. /cc kayousterhout and markhamstra, who reviewed apache#15986. Author: Josh Rosen <[email protected]> Closes apache#16045 from JoshRosen/fix-leak-following-total-executor-loss-master.

Port of apache#15986 to master branch.

9689763

asfgit closed this in 9a02f68 Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss #16045

[SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss #16045

Uh oh!

JoshRosen commented Nov 28, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

kayousterhout commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss #16045

[SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss #16045

Uh oh!

Conversation

JoshRosen commented Nov 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 29, 2016

Uh oh!

SparkQA commented Nov 29, 2016

Uh oh!

kayousterhout commented Nov 29, 2016

Uh oh!

JoshRosen commented Nov 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants