-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster killed for max n… #10794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
30048ac
fa7d54b
3a1724c
659c505
ebe3c7f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,6 +37,7 @@ import org.apache.spark.deploy.yarn.YarnSparkHadoopUtil._ | |
| import org.apache.spark.rpc.{RpcCallContext, RpcEndpointRef} | ||
| import org.apache.spark.scheduler.{ExecutorExited, ExecutorLossReason} | ||
| import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.RemoveExecutor | ||
| import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.RetrieveLastAllocatedExecutorId | ||
| import org.apache.spark.util.ThreadUtils | ||
|
|
||
| /** | ||
|
|
@@ -81,8 +82,23 @@ private[yarn] class YarnAllocator( | |
| new ConcurrentHashMap[ContainerId, java.lang.Boolean]) | ||
|
|
||
| @volatile private var numExecutorsRunning = 0 | ||
| // Used to generate a unique ID per executor | ||
| private var executorIdCounter = 0 | ||
|
|
||
| /** | ||
| * Used to generate a unique ID per executor | ||
| * | ||
| * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then | ||
| * the id of new executor will start from 1, this will conflict with the executor has | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to clarify this to say this is required for client mode when driver isn't running on yarn. this isn't an issue in cluster mode.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tgravescs I think we can clarify this in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer to do it here in this comment to better describe the situation this is needed. It should be a line or two and I personally much prefer that then pointing at jiras unless its a big discussion/background required, then the jira makes more sense.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tgravescs Ok, I will do it soon. Thanks a lot. |
||
| * already created before. So, we should initialize the `executorIdCounter` by getting | ||
| * the max executorId from driver. | ||
| * | ||
| * And this situation of executorId conflict is just in yarn client mode, so this is an issue | ||
| * in yarn client mode. For more details, can check in jira. | ||
| * | ||
| * @see SPARK-12864 | ||
| */ | ||
| private var executorIdCounter: Int = | ||
| driverRef.askWithRetry[Int](RetrieveLastAllocatedExecutorId) | ||
|
|
||
| @volatile private var numExecutorsFailed = 0 | ||
|
|
||
| @volatile private var targetNumExecutors = | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of awkward. You don't need to keep track of another variable; just compute the max executor ID when the AM asks for it. You already have all the information you need in
executorDataMap.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 Thank for review it. For my understanding, I don't think we can get the max executor ID in executorDataMap. Because, when AM is failure, all the executor are disconnect and be removed, by this time, as the code in method
CoarseGrainedSchedulerBackend.removeExecutorshow, the executor information in executorDataMap also be removed.So, I think the executor information in executorDataMap is not complete. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, with dynamic allocation, for example, the executor with the max known id may be gone already.
minor:
executorId.toIntinstead ofInteger.parseIntThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin Thanks for your comments. I will optimize it.