Skip to content

Conversation

@KaiXinXiaoLei
Copy link

During running tasks, when the total number of executors is the value of spark.dynamicAllocation.maxExecutors and the AM is failed. Then a new AM restarts. Because in ExecutorAllocationManager, the total number of executors does not changed, driver does not send RequestExecutors to AM to ask executors. Then the total number of executors is the value of spark.dynamicAllocation.initialExecutors . So the total number of executors in driver and AM is different.

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42394 has finished for PR 8737 at commit d7ed6dc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42396 has finished for PR 8737 at commit 564725b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2015

Test build #42397 has finished for PR 8737 at commit 209f4da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 reset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just call this reset() since it needs to do more than just setting the target

@KaiXinXiaoLei
Copy link
Author

Run a long job, and stages have many tasks. During running tasks, the AM is failed. Then a new AM restarts. In ExecutorAllocationManager, because there is many tasks to run, the value of numExecutorsTarget is the value of spark.dynamicAllocation.maxExecutors and does not changed. So driver does not send RequestExecutors message to AM. So the new AM does not know the total number of executors.

@jerryshao
Copy link
Contributor

I'm wondering when AM is restarted, what is the initial value of numExecutorsTarget?

@jerryshao
Copy link
Contributor

From my point, I think we should fix this in ExecutorAllocationManager. When AM is reconnected, ramping down the numExecutorTarget to the initial number.

Also I think your problem only lies in yarn-client mode.

@SparkQA
Copy link

SparkQA commented Sep 17, 2015

Test build #42578 has finished for PR 8737 at commit 258f146.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vidma
Copy link

vidma commented Sep 17, 2015

looks more legitimate now. could we craft a test for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is off

@jerryshao
Copy link
Contributor

Hi @vazin, according to my test, when AM is failed and restarted by Yarn RM, all the internal states will be refreshed (including YarnAllocator), since it is a new process now. Also all the containers (executors) related to the old AM will be exited. So to some extent AM side of executor management is reset to the initial state, what we should care is to reset driver side ExecutorAlloationManager back to the initial state.

So from my understanding we don't need to take care of YarnAllocator, what we need to care about is driver side state.

@vanzin
Copy link
Contributor

vanzin commented Sep 21, 2015

Ok, I think I read too much into what was going on. But:

Also all the containers (executors) related to the old AM will be exited.

I see. That, though, seems to be caused by the reset state in the AM, not because the executors depend on the AM in any way; the AM will send a request to YARN that basically means "I need way less executors than I currently have, so feel free to kill all the others".

If some of the state was somehow kept between AM instances (or updated once the new AM registers with the driver), that could be avoided. But since this really only affects yarn-client mode, that seems like not enough value for the extra work.

Given that, LGTM aside from the minor method rename.

@vanzin
Copy link
Contributor

vanzin commented Sep 21, 2015

(Actually, my comment regarding an explicit message vs. resetting the target executor count still applies; I just don't think it's that important, although perhaps a comment would be nice.)

@jerryshao
Copy link
Contributor

Yeah, I agree with you, an explicit code path to handle this issue is always better.

Besides I think we should add more documents to this fix, it is kinda of strange for others to guess the meaning of this code snippets.

@vanzin
Copy link
Contributor

vanzin commented Sep 22, 2015

Also, there might be an issue with this patch; ExecutorAllocationManager keeps things like executorsPendingToRemove, which may be stale after the AM goes down. If you're unlucky enough, the new AM will not get rid of those particular executors, and now you're stuck with them forever, since the driver thinks it already asked for them to be killed.

So there's more state in the driver that needs to either be reset or communicated to the new AM.

@srowen
Copy link
Member

srowen commented Oct 1, 2015

@KaiXinXiaoLei are you working on this, or else do you mind closing this PR? I'm also not clear if it's the same thing as #8945

@KaiXinXiaoLei
Copy link
Author

@srowen This problem is not the same with #8945. In this MR, During running tasks, the AM is failed and restarted. Then the numbers of executors is in the initial state, and not be the same with the total number of executors in dynamic-executor-allocation.

@KaiXinXiaoLei KaiXinXiaoLei changed the title [SPARK-10582] using dynamic-executor-allocation, if a new AM restarts, executors should be registered. [SPARK-10582] If a new AM restarts, the total number of executors should be in initial state in driver side. Oct 8, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too vague...

Reset this manager to the initial starting state.
This must be called if the cluster manager is restarted.

@andrewor14
Copy link
Contributor

@KaiXinXiaoLei We sync the target with the AM every time we call updateAndSyncNumExecutorsTarget so the target is updated fairly often anyway. The real problem is that all of the "pending executors" variables must be reset. This includes

ExecutorAllocationManager#executorsPendingToRemove
CoarseGrainedSchedulerBackend#executorsPendingToRemove
CoarseGrainedSchedulerBackend#numPendingExecutors

As @vanzin suggested, these all need to be cleared right? This patch in its current state seems insufficient. Also, in future revisions, please use more detailed java docs and commit messages.

@asfgit asfgit closed this in 8d4449c Oct 18, 2015
@jerryshao
Copy link
Contributor

@andrewor14 , are we still planning to address this issue? Seem it is actually a problem here with dynamic allocation enabled..

@andrewor14
Copy link
Contributor

@KaiXinXiaoLei were you able to address the comments?

@KaiXinXiaoLei
Copy link
Author

@andrewor14 I am sorry to reply so late. I just test the latest code, the problem still exists. So i think i continue tracking this problem .Thanks.

@jerryshao
Copy link
Contributor

Yes, the problem still exists, @KaiXinXiaoLei are you still working on this issue to address the comments mentioned above?

@KaiXinXiaoLei
Copy link
Author

@jerryshao Ok, Thanks.

asfgit pushed a commit that referenced this pull request Dec 9, 2015
…tion

Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in `ExecutorAllocationManager` and `CoarseGrainedSchedulerBacked` should be reset.

This issue is originally addressed in #8737 , here re-opened again. Thanks a lot KaiXinXiaoLei for finding this issue.

andrewor14 and vanzin would you please help to review this, thanks a lot.

Author: jerryshao <[email protected]>

Closes #9963 from jerryshao/SPARK-10582.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants