YARN-11698. Store pending log aggregators in the NM State Store #6845

Kimahriman · 2024-05-27T19:01:12Z

Description of PR

Stores containers pending log aggregation in the NodeManager state store so logs can still be aggregated for complete containers after a Node Manager restart. This undoes and replaces https://issues.apache.org/jira/browse/YARN-4771 with a finer-grained approach that doesn't involve storing containers indefinitely until the application finishes.

The original approach has several issues, some of which were mentioned in the JIRA but decided it was ok:

Long running applications can lead to a large number of containers being stored indefinitely in the state store as well as in memory on the Node Manager
On restarts, the Node Manager has to do a lot of work fully recovering all of these complete containers just so they can be registered for log aggregation again
This leads to large heartbeat messages to the Resource Manager that can DoS or OOM it
This ignores the fact that users may not have log aggregation enabled or may have rolling log aggregation enabled, meaning containers are stored even after there is no need to worry about aggregating the logs in the future

Instead, this adds a new state store entry for containers pending log aggregation. This solves all the above issues, while still providing the same guarantees about logs being aggregated after a Node Manager restart.

How was this patch tested?

New UTs added

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

hadoop-yetus · 2024-05-27T21:39:47Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 30s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 4 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	44m 19s		trunk passed
+1 💚	compile	1m 25s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	compile	1m 22s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	checkstyle	0m 40s		trunk passed
+1 💚	mvnsite	0m 45s		trunk passed
+1 💚	javadoc	0m 44s		trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javadoc	0m 38s		trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	spotbugs	1m 27s		trunk passed
+1 💚	shadedclient	34m 30s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 32s		the patch passed
+1 💚	compile	1m 21s		the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
+1 💚	javac	1m 21s		the patch passed
+1 💚	compile	1m 18s		the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
+1 💚	javac	1m 18s		the patch passed
-1 ❌	blanks	0m 0s	/blanks-eol.txt	The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
-0 ⚠️	checkstyle	0m 30s	/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt	hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 13 new + 330 unchanged - 0 fixed = 343 total (was 330)
+1 💚	mvnsite	0m 35s		the patch passed
-1 ❌	javadoc	0m 34s	/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt	hadoop-yarn-server-nodemanager in the patch failed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.
-1 ❌	javadoc	0m 32s	/patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt	hadoop-yarn-server-nodemanager in the patch failed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.
-1 ❌	spotbugs	1m 31s	/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.html	hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚	shadedclient	34m 18s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	24m 44s		hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 45s		The patch does not generate ASF License warnings.
		157m 15s

Reason	Tests
SpotBugs	module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
	Switch statement found in org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogHandlerEvent) where one case falls through to the next case At LogAggregationService.java:where one case falls through to the next case At LogAggregationService.java:[lines 404-406]

Subsystem	Report/Notes
Docker	ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6845/1/artifact/out/Dockerfile
GITHUB PR	#6845
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 5b38c0b5239a 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `6aee095`
Default Java	Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6845/1/testReport/
Max. process+thread count	699 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6845/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org