-
Notifications
You must be signed in to change notification settings - Fork 9.1k
YARN-11127. Potential deadlock in AsyncDispatcher caused by RMNodeImp… #4270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
To make sure unittest pass, I submit f887c7a. But I think this unittest fail, was not introduced by this PR. I test the two unittest in our MAC on trunk branch, still fail. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Many unit test failed, reason is 'Metrics source DelegationTokenSecretManagerMetrics already exists!', wait for #4289 to solve this problem. I solve this problem by below way on my own mac. |
|
🎊 +1 overall
This message was automatically generated. |
...-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestRMDeadLockTriggerByApp.java
Outdated
Show resolved
Hide resolved
| yarnCluster.start(); | ||
|
|
||
| // start rm client | ||
| yarnClient = YarnClient.createYarnClient(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.yarnClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't know this code standards. Should I fix all member variable like pattern this.xxx?
...-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestRMDeadLockTriggerByApp.java
Outdated
Show resolved
Hide resolved
| attemptId = appReport.getCurrentApplicationAttemptId(); | ||
| appAttempt = | ||
| yarnCluster.getResourceManager().getRMContext().getRMApps() | ||
| .get(attemptId.getApplicationId()).getCurrentAppAttempt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this appId?
...-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestRMDeadLockTriggerByApp.java
Show resolved
Hide resolved
...-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestRMDeadLockTriggerByApp.java
Outdated
Show resolved
Hide resolved
| // Avoid the EventHandlingThread struck forever | ||
| if (deadLock) { | ||
| LOG.info("Found dead lock, stop EventHandlingThread manually!"); | ||
| ((AsyncDispatcher) rm.getRMContext().getDispatcher()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use fail() after this.
| if (deadLock) { | ||
| LOG.info("Found dead lock, stop EventHandlingThread manually!"); | ||
| ((AsyncDispatcher) rm.getRMContext().getDispatcher()) | ||
| .forceEventHandlingThreadStop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the forceEventHandlingThreadStop() in the tearDown?
| Assert.assertFalse("There is dead lock!", deadLock); | ||
|
|
||
| // join all thread | ||
| allocateThread.join(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the assert triggers, this won't join()
…l, SchedulerApplicationAttempt and RMAppImpl's lock contention.
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@goiri Can you help me review this again? This patch have been applied on our cluster for long time. |
|
@goiri @slfan1989 The main reason for the deadlock described in this issue is that there are many types of locks in ResourceManager, and we do not have a clear rule to restrict the use of locks. I think this is an important topic. For me, I think we should have this rules: We can only try to acquire the lock in this order: Node --> App --> AppAttempt.But the lock cannot be acquired in the opposite direction. And in addition, there is no need to use write lock protect |
Description of PR
detail message see: https://issues.apache.org/jira/browse/YARN-11127
How was this patch tested?
uni-test in PR
For code changes: