-
Notifications
You must be signed in to change notification settings - Fork 9.1k
YARN-11698. Store pending log aggregators in the NM State Store #6845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
c497647 to
79e7bd9
Compare
|
🎊 +1 overall
This message was automatically generated. |
79e7bd9 to
6902a32
Compare
|
💔 -1 overall
This message was automatically generated. |
6902a32 to
652a7b6
Compare
|
🎊 +1 overall
This message was automatically generated. |
652a7b6 to
08cc91b
Compare
|
🎊 +1 overall
This message was automatically generated. |
| if (!context.getContainers().containsKey(cid)) { | ||
| ApplicationId appId = | ||
| cid.getApplicationAttemptId().getApplicationId(); | ||
| if (isApplicationStopped(appId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code will also delete all delegation tokens in the application, including those used for log aggregation, causing log aggregation to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? Are you saying there is a problem with these changes? Or something else wrong in the existing code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I carefully reviewed the code again and there should be no problem of aggregation failure.
By the way.
Aggregate logs of containers that completed more than 30 minutes (or longer) to onto HDFS in advance, rather than waiting for the job to completed.
Is this implementation better than the current one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually change anything with how log aggregation works. It just prevents the node manager from holding onto finished containers in the store after the logs have been aggregated
Description of PR
Stores containers pending log aggregation in the NodeManager state store so logs can still be aggregated for complete containers after a Node Manager restart. This undoes and replaces https://issues.apache.org/jira/browse/YARN-4771 with a finer-grained approach that doesn't involve storing containers indefinitely until the application finishes.
The original approach has several issues, some of which were mentioned in the JIRA but decided it was ok:
Instead, this adds a new state store entry for containers pending log aggregation. This solves all the above issues, while still providing the same guarantees about logs being aggregated after a Node Manager restart.
How was this patch tested?
New UTs added
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?