Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Jul 18, 2017

The main goal of this change is to avoid the situation described
in the bug, where an AM restart in the middle of a job may cause
no new executors to be allocated because of faulty logic in the
reset path.

The change does two things:

  • fixes the executor alloc manager's reset() so that it does not
    stop allocation after a reset() in the middle of a job
  • re-orders the initialization of the YarnAllocator class so that
    it fetches the current executor ID before triggering the reset()
    above.

This ensures both that the new allocator gets new requests for executors,
and that it starts from the correct executor id.

Tested with unit tests and by manually causing AM restarts while
running jobs using spark-shell in YARN mode.

Closes #17882

witgo and others added 2 commits July 17, 2017 13:26
…tart.

The main goal of this change is to avoid the situation described
in the bug, where an AM restart in the middle of a job may cause
no new executors to be allocated because of faulty logic in the
reset path.

The change does two things:

- fixes the executor alloc manager's reset() so that it does not
  stop allocation after a reset() in the middle of a job
- re-orders the initialization of the YarnAllocator class so that
  it fetches the current executor ID before triggering the reset()
  above.

This ensures both that the new allocator gets new requests for executors,
and that it starts from the correct executor id.

Tested with unit tests and by manually causing AM restarts while
running jobs using spark-shell in YARN mode.
@vanzin
Copy link
Contributor Author

vanzin commented Jul 18, 2017

FYI, I started by trying to clean up the referenced PR but ended up pretty much re-writing it...

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79692 has finished for PR 18663 at commit 1496b78.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 18, 2017

@tgravescs @jerryshao

@vanzin vanzin changed the title [SPARK-20079][yarn] Fix client AM not allocating executors aftert restart. [SPARK-20079][yarn] Fix client AM not allocating executors after restart. Jul 18, 2017
// a new one registered after the failure. This will only happen in yarn-client mode.
reset()
}
reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to trigger reset even in the first attempt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are no executors when the first attempt registers with the driver, so everything reset() does basically amounts to a no-op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explain, let me try your patch locally.

@jerryshao
Copy link
Contributor

LGTM, I tried it locally, looks now executors can be ramped up soon after AM restart.

@tgravescs
Copy link
Contributor

I haven't had a chance to look at this yet, but this doesn't by chance fix the allocator re-evaluating if it needs executors all the time does it?

I have seen issues where executors can idle timeout, because scheduler isn't scheduling them fast enough, might be busy or the locality wait settings interfere. it gets down to only a few executors even though it has 10000+ tasks to run still. If it doesn't I will file a separate jira for that.

I'll try to review this later today

@vanzin
Copy link
Contributor Author

vanzin commented Jul 19, 2017

That sounds like a different issue. I've seen @squito debugging issues that sound similar to that, not sure if he got down to making any scheduler changes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2017

Test build #79940 has finished for PR 18663 at commit 56abc80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Jul 31, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jul 31, 2017

Test build #80085 has finished for PR 18663 at commit 56abc80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Aug 1, 2017

Merging to master.

@asfgit asfgit closed this in 6735433 Aug 1, 2017
@vanzin vanzin deleted the SPARK-20079 branch August 7, 2017 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants