HDDS-1284. Adjust default values of pipline recovery for more resilient service restart #733

elek · 2019-04-12T12:10:13Z

As of now we have a following algorithm to handle node failures:

In case of a missing node the leader of the pipline or the scm can detected the missing heartbeats.
SCM will start to close the pipeline (CLOSING state) and try to close the containers with the remaining nodes in the pipeline
After 5 minutes the pipeline will be destroyed (CLOSED) and a new pipeline can be created from the healthy nodes (one node can be part only one pipwline in the same time).

While this algorithm can work well with a big cluster it doesn't provide very good usability on small clusters:

Use case1:

Given 3 nodes, in case of a service restart, if the restart takes more than 90s, the pipline will be moved to the CLOSING state. For the next 5 minutes (ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING state. As there are no more nodes and we can't assign the same node to two different pipeline, the cluster will be unavailable for 5 minutes.

Use case2:

Given 90 nodes and 30 pipelines where all the pipelines are spread across 3 racks. Let's stop one rack. As all the pipelines are affected, all the pipelines will be moved to the CLOSING state. We have no free nodes, therefore we need to wait for 5 minutes to write any data to the cluster.

These problems can be solved in multiple ways:

1.) Instead of waiting 5 minutes, destroy the pipeline when all the containers are reported to be closed. (Most of the time it's enough, but some container report can be missing)
2.) Support multi-raft and open a pipeline as soon as we have enough nodes (even if the nodes already have a CLOSING pipelines).

Both the options require more work on the pipeline management side. For 0.4.0 we can adjust the following parameters to get better user experience:

{code}

ozone.scm.pipeline.destroy.timeout
60s
OZONE, SCM, PIPELINE

Once a pipeline is closed, SCM should wait for the above configured time
before destroying a pipeline.

ozone.scm.stale.node.interval 90s OZONE, MANAGEMENT The interval for stale node flagging. Please see ozone.scm.heartbeat.thread.interval before changing this value. {code}

First of all, we can be more optimistic and mark node to stale only after 5 mins instead of 90s. 5 mins should be enough most of the time to recover the nodes.

Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout. Ideally the close command is sent by the scm to the datanode with a HB. Between two HB we have enough time to close all the containers via ratis. With the next HB, datanode can report the successful datanode. (If the containers can be closed the scm can manage the QUASI_CLOSED containers)

We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds for the confirmation. --> 66 seconds seems to be a safe choice (assuming that 6 seconds is enough to process the report about the successful closing)

See: https://issues.apache.org/jira/browse/HDDS-1284

…nt service restart.

hadoop-yetus · 2019-04-12T13:58:25Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	851	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
-1	test4tests	0	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ trunk Compile Tests _
+1	mvninstall	1306	trunk passed
+1	compile	66	trunk passed
+1	checkstyle	21	trunk passed
+1	mvnsite	43	trunk passed
+1	shadedclient	736	branch has no errors when building and testing our client artifacts.
+1	findbugs	80	trunk passed
+1	javadoc	39	trunk passed
		_ Patch Compile Tests _
+1	mvninstall	44	the patch passed
+1	compile	33	the patch passed
+1	javac	33	the patch passed
+1	checkstyle	15	the patch passed
+1	mvnsite	36	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	1	The patch has no ill-formed XML file.
+1	shadedclient	744	patch has no errors when building and testing our client artifacts.
+1	findbugs	81	the patch passed
+1	javadoc	35	the patch passed
		_ Other Tests _
+1	unit	85	common in the patch passed.
+1	asflicense	32	The patch does not generate ASF License warnings.
		4315

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/1/artifact/out/Dockerfile
GITHUB PR	#733
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux 630e39838fbf 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `abace70`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/1/testReport/
Max. process+thread count	440 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common U: hadoop-hdds/common
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/1/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

elek · 2019-04-17T09:58:08Z

The original commit (which was reverted) was fixed by @linyiqun in HDDS-1297 (thx here again) I applied it to this branch to prove that the two commits together don't cause any problem.

In case of merge please don't squash the two commits just rebase this branch.

hadoop-yetus · 2019-04-17T11:12:23Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	40	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
0	mvndep	27	Maven dependency ordering for branch
+1	mvninstall	1185	trunk passed
+1	compile	78	trunk passed
+1	checkstyle	30	trunk passed
+1	mvnsite	136	trunk passed
+1	shadedclient	899	branch has no errors when building and testing our client artifacts.
+1	findbugs	194	trunk passed
+1	javadoc	106	trunk passed
		_ Patch Compile Tests _
0	mvndep	12	Maven dependency ordering for patch
-1	mvninstall	23	server-scm in the patch failed.
-1	compile	66	hadoop-hdds in the patch failed.
-1	javac	66	hadoop-hdds in the patch failed.
-0	checkstyle	25	hadoop-hdds: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
-1	mvnsite	26	server-scm in the patch failed.
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	1	The patch has no ill-formed XML file.
+1	shadedclient	820	patch has no errors when building and testing our client artifacts.
-1	findbugs	24	server-scm in the patch failed.
+1	javadoc	97	the patch passed
		_ Other Tests _
+1	unit	75	common in the patch passed.
+1	unit	31	framework in the patch passed.
+1	unit	63	container-service in the patch passed.
-1	unit	27	server-scm in the patch failed.
+1	asflicense	26	The patch does not generate ASF License warnings.
		4368

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/Dockerfile
GITHUB PR	#733
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux a444bdbf61a0 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar 18 15:02:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `d608be6`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
mvninstall	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-mvninstall-hadoop-hdds_server-scm.txt
compile	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-compile-hadoop-hdds.txt
javac	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-compile-hadoop-hdds.txt
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/diff-checkstyle-hadoop-hdds.txt
mvnsite	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-mvnsite-hadoop-hdds_server-scm.txt
findbugs	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-findbugs-hadoop-hdds_server-scm.txt
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-unit-hadoop-hdds_server-scm.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/testReport/
Max. process+thread count	341 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common hadoop-hdds/framework hadoop-hdds/container-service hadoop-hdds/server-scm U: hadoop-hdds
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2019-04-17T11:27:07Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	57	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
0	mvndep	38	Maven dependency ordering for branch
+1	mvninstall	1186	trunk passed
+1	compile	90	trunk passed
+1	checkstyle	35	trunk passed
+1	mvnsite	152	trunk passed
+1	shadedclient	941	branch has no errors when building and testing our client artifacts.
+1	findbugs	199	trunk passed
+1	javadoc	108	trunk passed
		_ Patch Compile Tests _
0	mvndep	11	Maven dependency ordering for patch
+1	mvninstall	133	the patch passed
+1	compile	67	the patch passed
+1	javac	67	the patch passed
-0	checkstyle	25	hadoop-hdds: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
+1	mvnsite	110	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	1	The patch has no ill-formed XML file.
+1	shadedclient	816	patch has no errors when building and testing our client artifacts.
+1	findbugs	215	the patch passed
+1	javadoc	97	the patch passed
		_ Other Tests _
+1	unit	74	common in the patch passed.
+1	unit	31	framework in the patch passed.
+1	unit	55	container-service in the patch passed.
-1	unit	111	server-scm in the patch failed.
+1	asflicense	27	The patch does not generate ASF License warnings.
		4564

Reason	Tests
Failed junit tests	hadoop.hdds.scm.node.TestSCMNodeManager

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/artifact/out/Dockerfile
GITHUB PR	#733
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux 270564ca6d65 4.4.0-141-generic #167~14.04.1-Ubuntu SMP Mon Dec 10 13:20:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `d608be6`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/artifact/out/diff-checkstyle-hadoop-hdds.txt
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/artifact/out/patch-unit-hadoop-hdds_server-scm.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/testReport/
Max. process+thread count	349 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common hadoop-hdds/framework hadoop-hdds/container-service hadoop-hdds/server-scm U: hadoop-hdds
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

…Initialization. Contributed by Yiqun Lin.

hadoop-yetus · 2019-04-18T15:19:36Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	28	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 2 new or modified test files.
		_ trunk Compile Tests _
0	mvndep	22	Maven dependency ordering for branch
+1	mvninstall	1080	trunk passed
+1	compile	82	trunk passed
+1	checkstyle	28	trunk passed
+1	mvnsite	138	trunk passed
+1	shadedclient	814	branch has no errors when building and testing our client artifacts.
+1	findbugs	195	trunk passed
+1	javadoc	98	trunk passed
		_ Patch Compile Tests _
0	mvndep	13	Maven dependency ordering for patch
+1	mvninstall	134	the patch passed
+1	compile	67	the patch passed
+1	javac	67	the patch passed
-0	checkstyle	23	hadoop-hdds: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1	mvnsite	122	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	1	The patch has no ill-formed XML file.
+1	shadedclient	744	patch has no errors when building and testing our client artifacts.
+1	findbugs	222	the patch passed
+1	javadoc	100	the patch passed
		_ Other Tests _
+1	unit	75	common in the patch passed.
+1	unit	34	framework in the patch passed.
+1	unit	59	container-service in the patch passed.
-1	unit	95	server-scm in the patch failed.
+1	asflicense	29	The patch does not generate ASF License warnings.
		4201

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/artifact/out/Dockerfile
GITHUB PR	#733
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux dadbccbe879f 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `b979fdd`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
checkstyle	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/artifact/out/diff-checkstyle-hadoop-hdds.txt
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/artifact/out/patch-unit-hadoop-hdds_server-scm.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/testReport/
Max. process+thread count	465 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common hadoop-hdds/framework hadoop-hdds/container-service hadoop-hdds/server-scm U: hadoop-hdds
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

elek · 2019-05-16T15:20:57Z

Just merged.

Changed the KVSerde to only value Serde for the Eventhubs input and output descriptors. Since the key is always a `String`, the key serde should always be `NoOpSerde` and will lead to an error otherwise since the Samza `serializers.SerdeManager.scala` expectes a `byte[]` Author: Daniel Chen <[email protected]> Reviewers: Prateek Maheshwari <[email protected]> Closes apache#733 from dxichen/eventhubs-example-cleanup

HDDS-1284. Adjust default values of pipline recovery for more resilie…

235cdc0

…nt service restart.

elek added the ozone label Apr 12, 2019

elek force-pushed the HDDS-1284 branch from aa5aab5 to 235cdc0 Compare April 12, 2019 12:10

elek force-pushed the HDDS-1284 branch from 81b2b34 to 58fc2ca Compare April 17, 2019 10:09

HDDS-1297. Fix IllegalArgumentException thrown with MiniOzoneCluster …

5e86c8e

…Initialization. Contributed by Yiqun Lin.

elek force-pushed the HDDS-1284 branch from 58fc2ca to 5e86c8e Compare April 18, 2019 14:08

elek closed this May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HDDS-1284. Adjust default values of pipline recovery for more resilient service restart #733

HDDS-1284. Adjust default values of pipline recovery for more resilient service restart #733

Uh oh!

elek commented Apr 12, 2019

Uh oh!

hadoop-yetus commented Apr 12, 2019

Uh oh!

elek commented Apr 17, 2019

Uh oh!

hadoop-yetus commented Apr 17, 2019

Uh oh!

hadoop-yetus commented Apr 17, 2019

Uh oh!

hadoop-yetus commented Apr 18, 2019

Uh oh!

elek commented May 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

HDDS-1284. Adjust default values of pipline recovery for more resilient service restart #733

HDDS-1284. Adjust default values of pipline recovery for more resilient service restart #733

Uh oh!

Conversation

elek commented Apr 12, 2019

Uh oh!

hadoop-yetus commented Apr 12, 2019

Uh oh!

elek commented Apr 17, 2019

Uh oh!

hadoop-yetus commented Apr 17, 2019

Uh oh!

hadoop-yetus commented Apr 17, 2019

Uh oh!

hadoop-yetus commented Apr 18, 2019

Uh oh!

elek commented May 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants