[SPARK-17022][YARN]Handle potential deadlock in driver handling messages #14605

WangTaoTheTonic · 2016-08-11T16:54:13Z

What changes were proposed in this pull request?

We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock.

How was this patch tested?

manual tests

SparkQA · 2016-08-11T17:29:47Z

Test build #63619 has finished for PR 14605 at commit 80c2d11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-08-11T22:07:29Z

Argh. I really dislike askWithRetry. This is just another reason why. LGTM, since this maintains the current behavior.

Merging to master / 2.0.

…ages ## What changes were proposed in this pull request? We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock. ## How was this patch tested? manual tests Author: WangTaoTheTonic <[email protected]> Closes #14605 from WangTaoTheTonic/lock. (cherry picked from commit ea0bf91) Signed-off-by: Marcelo Vanzin <[email protected]>

…messages apache#14605

## What changes were proposed in this pull request? This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. ## How was this patch tested? Existing tests, and manual tests under yarn-client mode. Author: Angus Gerry <[email protected]> Closes #14710 from angolon/SPARK-16533.

This pull request reverts the changes made as a part of apache#14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. Existing tests, and manual tests under yarn-client mode. Author: Angus Gerry <[email protected]> Closes apache#14710 from angolon/SPARK-16533.

[SPARK-17022]Handle potential deadlock in driver handling messages

80c2d11

asfgit closed this in ea0bf91 Aug 11, 2016

zzcclp added a commit to zzcclp/spark that referenced this pull request Aug 12, 2016

[EXT][SPARK-17022][YARN]Handle potential deadlock in driver handling …

f23cdd3

…messages apache#14605

angolon mentioned this pull request Aug 19, 2016

[SPARK-16533][CORE] resolve deadlocking in driver when executors die #14710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17022][YARN]Handle potential deadlock in driver handling messages #14605

[SPARK-17022][YARN]Handle potential deadlock in driver handling messages #14605

Uh oh!

WangTaoTheTonic commented Aug 11, 2016 •

edited

Loading

Uh oh!

SparkQA commented Aug 11, 2016

Uh oh!

vanzin commented Aug 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17022][YARN]Handle potential deadlock in driver handling messages #14605

[SPARK-17022][YARN]Handle potential deadlock in driver handling messages #14605

Uh oh!

Conversation

WangTaoTheTonic commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 11, 2016

Uh oh!

vanzin commented Aug 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WangTaoTheTonic commented Aug 11, 2016 •

edited

Loading