-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10582] If a new AM restarts, the total number of executors should be in initial state in driver side. #8737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #42394 has finished for PR 8737 at commit
|
|
Test build #42396 has finished for PR 8737 at commit
|
|
Test build #42397 has finished for PR 8737 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 reset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just call this reset() since it needs to do more than just setting the target
|
Run a long job, and stages have many tasks. During running tasks, the AM is failed. Then a new AM restarts. In ExecutorAllocationManager, because there is many tasks to run, the value of numExecutorsTarget is the value of spark.dynamicAllocation.maxExecutors and does not changed. So driver does not send RequestExecutors message to AM. So the new AM does not know the total number of executors. |
|
I'm wondering when AM is restarted, what is the initial value of |
|
From my point, I think we should fix this in Also I think your problem only lies in yarn-client mode. |
|
Test build #42578 has finished for PR 8737 at commit
|
|
looks more legitimate now. could we craft a test for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is off
|
Hi @vazin, according to my test, when AM is failed and restarted by Yarn RM, all the internal states will be refreshed (including YarnAllocator), since it is a new process now. Also all the containers (executors) related to the old AM will be exited. So to some extent AM side of executor management is reset to the initial state, what we should care is to reset driver side So from my understanding we don't need to take care of |
|
Ok, I think I read too much into what was going on. But:
I see. That, though, seems to be caused by the reset state in the AM, not because the executors depend on the AM in any way; the AM will send a request to YARN that basically means "I need way less executors than I currently have, so feel free to kill all the others". If some of the state was somehow kept between AM instances (or updated once the new AM registers with the driver), that could be avoided. But since this really only affects yarn-client mode, that seems like not enough value for the extra work. Given that, LGTM aside from the minor method rename. |
|
(Actually, my comment regarding an explicit message vs. resetting the target executor count still applies; I just don't think it's that important, although perhaps a comment would be nice.) |
|
Yeah, I agree with you, an explicit code path to handle this issue is always better. Besides I think we should add more documents to this fix, it is kinda of strange for others to guess the meaning of this code snippets. |
|
Also, there might be an issue with this patch; So there's more state in the driver that needs to either be reset or communicated to the new AM. |
|
@KaiXinXiaoLei are you working on this, or else do you mind closing this PR? I'm also not clear if it's the same thing as #8945 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too vague...
Reset this manager to the initial starting state.
This must be called if the cluster manager is restarted.
|
@KaiXinXiaoLei We sync the target with the AM every time we call As @vanzin suggested, these all need to be cleared right? This patch in its current state seems insufficient. Also, in future revisions, please use more detailed java docs and commit messages. |
|
@andrewor14 , are we still planning to address this issue? Seem it is actually a problem here with dynamic allocation enabled.. |
|
@KaiXinXiaoLei were you able to address the comments? |
|
@andrewor14 I am sorry to reply so late. I just test the latest code, the problem still exists. So i think i continue tracking this problem .Thanks. |
|
Yes, the problem still exists, @KaiXinXiaoLei are you still working on this issue to address the comments mentioned above? |
|
@jerryshao Ok, Thanks. |
…tion Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in `ExecutorAllocationManager` and `CoarseGrainedSchedulerBacked` should be reset. This issue is originally addressed in #8737 , here re-opened again. Thanks a lot KaiXinXiaoLei for finding this issue. andrewor14 and vanzin would you please help to review this, thanks a lot. Author: jerryshao <[email protected]> Closes #9963 from jerryshao/SPARK-10582.
During running tasks, when the total number of executors is the value of
spark.dynamicAllocation.maxExecutorsand the AM is failed. Then a new AM restarts. Because in ExecutorAllocationManager, the total number of executors does not changed, driver does not send RequestExecutors to AM to ask executors. Then the total number of executors is the value ofspark.dynamicAllocation.initialExecutors. So the total number of executors in driver and AM is different.