-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10515] When killing executor, the pending replacement executors should not be lost #8945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #43124 has finished for PR 8945 at commit
|
|
Hmm, test failure looks like it might be related. Also, does this replace #8668? If so, could you close one of them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is off
|
Test build #43389 has finished for PR 8945 at commit
|
|
jenkins test failed is not caused by my code. please retest please. |
|
retest this please |
|
Thanks LGTM. I'll merge this once tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same problem I mentioned before in the other PR. sc.killExecutor does not immediately update the state of the master.apps list. So these tests are bound to fail in weird ways.
You need to use killNExecutors instead of sc.killExecutor and getApplications instead of master.apps. See other tests in this same file.
|
Test build #43453 has finished for PR 8945 at commit
|
|
Test build #43458 has finished for PR 8945 at commit
|
|
Test build #43464 has finished for PR 8945 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, not sure if I understand this. Why does the number of executors stay at 2 after you call sc.killExecutor, which does not replace it? Shouldn't it go down to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 Here i use sc.killExecutor(executors.head), I want to say a executor lost and a new executor should start to replace. Before the new executor registers, the executor is idle timeout. Then the total number of executors should not change. So "apps.head.executors.size === 2"
Kill a executor and a new executor should replaces it. Make sure the total number of executor be not changed.
|
Test build #43628 has finished for PR 8945 at commit
|
|
LGTM, will let Andrew have a final look. |
…s should not be lost If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them. For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1. see: #8668 Author: KaiXinXiaoLei <[email protected]> Author: huleilei <[email protected]> Closes #8945 from KaiXinXiaoLei/pendingexecutor.
…s should not be lost If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them. For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1. see: apache#8668 Author: KaiXinXiaoLei <[email protected]> Author: huleilei <[email protected]> Closes apache#8945 from KaiXinXiaoLei/pendingexecutor.
|
Forgot to add: merged into master 1.5. |
If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.
see: #8668