-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1978] In some cases, spark-yarn does not automatically restart the failed container #921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e00b656
3c464bd
1faf4f4
5c376e0
aff827c
04c6f7e
056b8c7
32ac7af
8800eba
bc3aa66
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -200,17 +200,25 @@ class ExecutorLauncher(args: ApplicationMasterArguments, conf: Configuration, sp | |
|
|
||
| logInfo("Allocating " + args.numExecutors + " executors.") | ||
| // Wait until all containers have finished | ||
| // TODO: This is a bit ugly. Can we make it nicer? | ||
| // TODO: Handle container failure | ||
|
|
||
| yarnAllocator.addResourceRequests(args.numExecutors) | ||
| yarnAllocator.allocateResources() | ||
| while ((yarnAllocator.getNumExecutorsRunning < args.numExecutors) && (!driverClosed)) { | ||
| allocateMissingExecutor() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar here can you move this up above allocateResources
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you change this to have similar logic - allocate outside loop, then inside loop add missing and then allocate. |
||
| yarnAllocator.allocateResources() | ||
| Thread.sleep(100) | ||
| } | ||
|
|
||
| logInfo("All executors have launched.") | ||
| } | ||
|
|
||
| private def allocateMissingExecutor() { | ||
| val missingExecutorCount = args.numExecutors - yarnAllocator.getNumExecutorsRunning - | ||
| yarnAllocator.getNumPendingAllocate | ||
| if (missingExecutorCount > 0) { | ||
| logInfo("Allocating %d containers to make up for (potentially) lost containers". | ||
| format(missingExecutorCount)) | ||
| yarnAllocator.addResourceRequests(missingExecutorCount) | ||
| } | ||
| } | ||
|
|
||
| // TODO: We might want to extend this to allocate more containers in case they die ! | ||
|
|
@@ -220,13 +228,7 @@ class ExecutorLauncher(args: ApplicationMasterArguments, conf: Configuration, sp | |
| val t = new Thread { | ||
| override def run() { | ||
| while (!driverClosed) { | ||
| val missingExecutorCount = args.numExecutors - yarnAllocator.getNumExecutorsRunning - | ||
| yarnAllocator.getNumPendingAllocate | ||
| if (missingExecutorCount > 0) { | ||
| logInfo("Allocating %d containers to make up for (potentially) lost containers". | ||
| format(missingExecutorCount)) | ||
| yarnAllocator.addResourceRequests(missingExecutorCount) | ||
| } | ||
| allocateMissingExecutor() | ||
| sendProgress() | ||
| Thread.sleep(sleepTime) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these 2 lines up above the allocateResources() call. That way max failed check happens first, then we would add in any missing executors, and then call allocate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that only after the call
yarnAllocator.allocateResources(),yarnAllocator.getNumExecutorsFailedvalue will change.This is appropriate: after changing the value of
yarnAllocator.getNumExecutorsFailed, immediately callcheckNumExecutorsFailedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that it isn't updated until that is called and in this has it really doesn't matter to much since we are in while loop and then after this we will just go to the reporter thread to continue to loop. I just find it odd to add in missing executors but then not call allocate immediately afterwards (we sleep and then we possibly break from the loop). For that reason I would prefer these to be moved up. Yes we have one extra unneeded call to them, but I think it flows better and will be more resilient to other code changes in the future.
Also note that launchReporterThread runs the logic the same way so right now you are doing the failed check and and allocateMissing twice in a row on the first time it goes into the reporter thread.
If you really have objections to that then we could make one call of allocateResources() outside of the while loop and then do the checkFailed, allocateMissing, and then allocateResources in side the while loop.