Skip to content

Conversation

@suyanNone
Copy link
Contributor

@suyanNone suyanNone commented Aug 23, 2016

What changes were proposed in this pull request?

In current spark DA mode, if we enabled blacklist, it will have a chance to hang the Spark job.

example:
executor: A, taskset(task1), blacklistTime > 60s

  1. task1 allocated in exec-A, and failed, so exec-A is blacklist for task1
  2. exec-A idled out before can't get any task to run. Because exec-A idled out, yarnAllocator decrease the YarnAllocator.targetExecutorNumber to 0.
    In the meantime, DA always calculate DA.targetExecutor = 1. and so the DA.oldTargetNumExecutors also be 1, then the DA.delta = 0, and result DA will not tell the YarnAllocator the actual needed targetNumber.
  3. So, because current delta=0, will skip DA.targetExecutor -> YarnAllocator.targetExecutor, then DA.targetExecutor = 1 while YarnAllocator.targetExecutor = 0, it will never get a executor to run task, it hangs.

This patch adopts the easiest way just remove delta = 0 logic, the shortage is will always communicate with YarnAllocator.

How was this patch tested?

manual test

@suyanNone suyanNone changed the title [SPARK-15815] K、 [SPARK-15815] Keeping tell yarn the target executors in DA mode Aug 23, 2016
@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64263 has finished for PR 14765 at commit 59de77b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@suyanNone
Copy link
Contributor Author

jenkins retest.

@andrewor14
Copy link
Contributor

andrewor14 commented Sep 16, 2016

From the JIRA description it seems that this issue arises not only in the context of DA. If that's the case then we should definitely not just arbitrarily remove code from ExecutorAllocationManager. Let's discuss on a more general solution on the JIRA, but for now we should close this PR since it's neither sufficient nor correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants