Skip to content

Conversation

@KaiXinXiaoLei
Copy link

If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.

see: #8668

@SparkQA
Copy link

SparkQA commented Sep 30, 2015

Test build #43124 has finished for PR 8945 at commit 1bdde8e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 30, 2015

Hmm, test failure looks like it might be related.

Also, does this replace #8668? If so, could you close one of them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is off

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43389 has finished for PR 8945 at commit e382315.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@KaiXinXiaoLei
Copy link
Author

jenkins test failed is not caused by my code. please retest please.

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

Thanks LGTM. I'll merge this once tests pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same problem I mentioned before in the other PR. sc.killExecutor does not immediately update the state of the master.apps list. So these tests are bound to fail in weird ways.

You need to use killNExecutors instead of sc.killExecutor and getApplications instead of master.apps. See other tests in this same file.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43453 has finished for PR 8945 at commit e382315.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43458 has finished for PR 8945 at commit b7b42cc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43464 has finished for PR 8945 at commit cb69dc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, not sure if I understand this. Why does the number of executors stay at 2 after you call sc.killExecutor, which does not replace it? Shouldn't it go down to 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewor14 Here i use sc.killExecutor(executors.head), I want to say a executor lost and a new executor should start to replace. Before the new executor registers, the executor is idle timeout. Then the total number of executors should not change. So "apps.head.executors.size === 2"

Kill a executor and a new executor should replaces it.  Make sure the  total number of executor be not changed.
@SparkQA
Copy link

SparkQA commented Oct 13, 2015

Test build #43628 has finished for PR 8945 at commit da13040.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Oct 14, 2015

LGTM, will let Andrew have a final look.

asfgit pushed a commit that referenced this pull request Oct 15, 2015
…s should not be lost

If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.

see: #8668

Author: KaiXinXiaoLei <[email protected]>
Author: huleilei <[email protected]>

Closes #8945 from KaiXinXiaoLei/pendingexecutor.
@asfgit asfgit closed this in 2d00012 Oct 15, 2015
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Oct 16, 2015
…s should not be lost

If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.

see: apache#8668

Author: KaiXinXiaoLei <[email protected]>
Author: huleilei <[email protected]>

Closes apache#8945 from KaiXinXiaoLei/pendingexecutor.
@andrewor14
Copy link
Contributor

Forgot to add: merged into master 1.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants