-
Couldn't load subscription status.
- Fork 9.1k
YARN-11323. [Federation] Improve ResourceManager Handler FinishApps. #4954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| federationStateStoreService.cleanUpFinishApplicationsWithRetries(applicationId); | ||
| if (response != null) { | ||
| LOG.info("applicationId = {} remove from state store success.", | ||
| applicationId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the var appId and single line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your help reviewing the code, I will fix it.
|
|
||
| T runWithRetries() throws Exception { | ||
| int retry = 0; | ||
| while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such method in the Yarn Federation module, I will make a separate abstract class and put it in the yarn-server-common federation package. This method can be used by other Yarn Federation modules.
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| public static final String FEDERATION_STATESTORE_CLEANUP_RETRY_SLEEP_TIME = | ||
| FEDERATION_PREFIX + "state-store.clean-up-retry-sleep-time"; | ||
|
|
||
| public static final long DEFAULT_FEDERATION_STATESTORE_CLEANUP_RETRY_SLEEP_TIME = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimeUnit.SECONDS.toMillis(1);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your help reviewing the code, I will fix it.
| The default value is 1000ms. | ||
| </description> | ||
| <name>yarn.federation.state-store.clean-up-retry-sleep-time</name> | ||
| <value>1000ms</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
|
|
||
| public abstract class FederationActionRetry<T> { | ||
|
|
||
| public static final Logger LOG = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
...va/org/apache/hadoop/yarn/server/resourcemanager/federation/FederationStateStoreService.java
Show resolved
Hide resolved
|
|
||
| return new FederationActionRetry<Boolean>() { | ||
| @Override | ||
| public Boolean run() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be written as a lambda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, I will refactor this part of the code.
| stateStore = rm.getFederationStateStoreService().getStateStoreClient(); | ||
|
|
||
| // generate an [app01] and join the [SC-1] cluster. | ||
| List<ApplicationId> appIds = new ArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@goiri Please help to review this pr again, thank you very much! |
| if (!rmApps.containsKey(applicationId)) { | ||
| try { | ||
| Boolean cleanUpSuccess = | ||
| cleanUpFinishApplicationsWithRetries(applicationId, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacing
|
|
||
| // print app cleanup log | ||
| LOG.info("cleanup finished applications size = {}, number = {} successful cleanup.", | ||
| applicationHomeSCs.size(), successCleanUpAppCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacing
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@goiri Thank you very much for helping to review the code! |
|
@slfan1989 Hi, I found that newly submitted apps may be mistakenly deleted in Delete triggered condition 1: |
JIRA: YARN-11323. [Federation] Improve Router Handler FinishApps.
In YARN-8755 and YARN-7599, the function of cleaning up the Finish App has been completed, but there are some ways to implement the function that need to be optimized.
YARN-7599 calls 2 methods:
The usage method is as follows:
The usage method is as follows:
Clean up the thread to compare list 1 and list 2, and delete the apps that do not exist in list 2.
There are several problems with this approach:
For example, the cluster has 2 SubClusters, namely SubClusterA and SubClusterB; SubCluserB restarts at this time; when
the cleaning thread calls the router's interface to obtain the RM App list, it may only obtain the App list of SubClusterA.
If it is deleted at this time, some apps may be lost.
The design ideas of YARN-11323 are as follows:
Apps that are already in the completed state should be handed over to SubCluster's RM for cleanup.
Delete triggered condition 1:
We illustrate by way of example, now our Federation cluster has 2 sub-clusters, namely SubClusterA and SubClusterB
The cleaning process of SubClusterB and SubClusterA is the same.
Delete triggered condition 2: