[SPARK-20079][yarn] Fix client AM not allocating executors after restart. #18663

vanzin · 2017-07-18T00:20:22Z

The main goal of this change is to avoid the situation described
in the bug, where an AM restart in the middle of a job may cause
no new executors to be allocated because of faulty logic in the
reset path.

The change does two things:

fixes the executor alloc manager's reset() so that it does not
stop allocation after a reset() in the middle of a job
re-orders the initialization of the YarnAllocator class so that
it fetches the current executor ID before triggering the reset()
above.

This ensures both that the new allocator gets new requests for executors,
and that it starts from the correct executor id.

Tested with unit tests and by manually causing AM restarts while
running jobs using spark-shell in YARN mode.

Closes #17882

…tart. The main goal of this change is to avoid the situation described in the bug, where an AM restart in the middle of a job may cause no new executors to be allocated because of faulty logic in the reset path. The change does two things: - fixes the executor alloc manager's reset() so that it does not stop allocation after a reset() in the middle of a job - re-orders the initialization of the YarnAllocator class so that it fetches the current executor ID before triggering the reset() above. This ensures both that the new allocator gets new requests for executors, and that it starts from the correct executor id. Tested with unit tests and by manually causing AM restarts while running jobs using spark-shell in YARN mode.

vanzin · 2017-07-18T00:20:51Z

FYI, I started by trying to clean up the referenced PR but ended up pretty much re-writing it...

SparkQA · 2017-07-18T03:21:31Z

Test build #79692 has finished for PR 18663 at commit 1496b78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-07-18T16:43:46Z

@tgravescs @jerryshao

jerryshao · 2017-07-18T20:36:16Z

...e-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

-          // a new one registered after the failure. This will only happen in yarn-client mode.
-          reset()
-        }
+        reset()


Is it OK to trigger reset even in the first attempt?

Yes. There are no executors when the first attempt registers with the driver, so everything reset() does basically amounts to a no-op.

Thanks for the explain, let me try your patch locally.

jerryshao · 2017-07-18T21:29:33Z

LGTM, I tried it locally, looks now executors can be ramped up soon after AM restart.

tgravescs · 2017-07-19T14:02:49Z

I haven't had a chance to look at this yet, but this doesn't by chance fix the allocator re-evaluating if it needs executors all the time does it?

I have seen issues where executors can idle timeout, because scheduler isn't scheduling them fast enough, might be busy or the locality wait settings interfere. it gets down to only a few executors even though it has 10000+ tasks to run still. If it doesn't I will file a separate jira for that.

I'll try to review this later today

vanzin · 2017-07-19T17:20:12Z

That sounds like a different issue. I've seen @squito debugging issues that sound similar to that, not sure if he got down to making any scheduler changes.

SparkQA · 2017-07-25T20:50:03Z

Test build #79940 has finished for PR 18663 at commit 56abc80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-07-31T17:48:32Z

retest this please

SparkQA · 2017-07-31T20:58:38Z

Test build #80085 has finished for PR 18663 at commit 56abc80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-08-01T17:12:09Z

Merging to master.

witgo and others added 2 commits July 17, 2017 13:26

Refactoring RetrieveLastAllocatedExecutorId

ba8bf28

vanzin changed the title ~~[SPARK-20079][yarn] Fix client AM not allocating executors aftert restart.~~ [SPARK-20079][yarn] Fix client AM not allocating executors after restart. Jul 18, 2017

jerryshao reviewed Jul 18, 2017

View reviewed changes

jerryshao mentioned this pull request Jul 24, 2017

[WIP][SPARK-20079][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17882

Closed

Remove unneeded import.

56abc80

asfgit closed this in 6735433 Aug 1, 2017

vanzin deleted the SPARK-20079 branch August 7, 2017 20:12

[SPARK-20079][yarn] Fix client AM not allocating executors after restart. #18663

[SPARK-20079][yarn] Fix client AM not allocating executors after restart. #18663

Uh oh!

Conversation

vanzin commented Jul 18, 2017

Uh oh!

vanzin commented Jul 18, 2017

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

vanzin commented Jul 18, 2017

Uh oh!

jerryshao Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Jul 18, 2017

Uh oh!

tgravescs commented Jul 19, 2017

Uh oh!

vanzin commented Jul 19, 2017

Uh oh!

SparkQA commented Jul 25, 2017

Uh oh!

vanzin commented Jul 31, 2017

Uh oh!

SparkQA commented Jul 31, 2017

Uh oh!

vanzin commented Aug 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants