Skip to content

Conversation

@elek
Copy link
Member

@elek elek commented Apr 12, 2019

As of now we have a following algorithm to handle node failures:

  1. In case of a missing node the leader of the pipline or the scm can detected the missing heartbeats.
  2. SCM will start to close the pipeline (CLOSING state) and try to close the containers with the remaining nodes in the pipeline
  3. After 5 minutes the pipeline will be destroyed (CLOSED) and a new pipeline can be created from the healthy nodes (one node can be part only one pipwline in the same time).

While this algorithm can work well with a big cluster it doesn't provide very good usability on small clusters:

Use case1:

Given 3 nodes, in case of a service restart, if the restart takes more than 90s, the pipline will be moved to the CLOSING state. For the next 5 minutes (ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING state. As there are no more nodes and we can't assign the same node to two different pipeline, the cluster will be unavailable for 5 minutes.

Use case2:

Given 90 nodes and 30 pipelines where all the pipelines are spread across 3 racks. Let's stop one rack. As all the pipelines are affected, all the pipelines will be moved to the CLOSING state. We have no free nodes, therefore we need to wait for 5 minutes to write any data to the cluster.

These problems can be solved in multiple ways:

1.) Instead of waiting 5 minutes, destroy the pipeline when all the containers are reported to be closed. (Most of the time it's enough, but some container report can be missing)
2.) Support multi-raft and open a pipeline as soon as we have enough nodes (even if the nodes already have a CLOSING pipelines).

Both the options require more work on the pipeline management side. For 0.4.0 we can adjust the following parameters to get better user experience:

{code}

ozone.scm.pipeline.destroy.timeout
60s
OZONE, SCM, PIPELINE

Once a pipeline is closed, SCM should wait for the above configured time
before destroying a pipeline.

ozone.scm.stale.node.interval 90s OZONE, MANAGEMENT The interval for stale node flagging. Please see ozone.scm.heartbeat.thread.interval before changing this value. {code}

First of all, we can be more optimistic and mark node to stale only after 5 mins instead of 90s. 5 mins should be enough most of the time to recover the nodes.

Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout. Ideally the close command is sent by the scm to the datanode with a HB. Between two HB we have enough time to close all the containers via ratis. With the next HB, datanode can report the successful datanode. (If the containers can be closed the scm can manage the QUASI_CLOSED containers)

We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds for the confirmation. --> 66 seconds seems to be a safe choice (assuming that 6 seconds is enough to process the report about the successful closing)

See: https://issues.apache.org/jira/browse/HDDS-1284

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 851 Docker mode activated.
_ Prechecks _
+1 @author 0 The patch does not contain any @author tags.
-1 test4tests 0 The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 mvninstall 1306 trunk passed
+1 compile 66 trunk passed
+1 checkstyle 21 trunk passed
+1 mvnsite 43 trunk passed
+1 shadedclient 736 branch has no errors when building and testing our client artifacts.
+1 findbugs 80 trunk passed
+1 javadoc 39 trunk passed
_ Patch Compile Tests _
+1 mvninstall 44 the patch passed
+1 compile 33 the patch passed
+1 javac 33 the patch passed
+1 checkstyle 15 the patch passed
+1 mvnsite 36 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 744 patch has no errors when building and testing our client artifacts.
+1 findbugs 81 the patch passed
+1 javadoc 35 the patch passed
_ Other Tests _
+1 unit 85 common in the patch passed.
+1 asflicense 32 The patch does not generate ASF License warnings.
4315
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/1/artifact/out/Dockerfile
GITHUB PR #733
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 630e39838fbf 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / abace70
maven version: Apache Maven 3.3.9
Default Java 1.8.0_191
findbugs v3.1.0-RC1
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-733/1/testReport/
Max. process+thread count 440 (vs. ulimit of 5500)
modules C: hadoop-hdds/common U: hadoop-hdds/common
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-733/1/console
Powered by Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

@elek
Copy link
Member Author

elek commented Apr 17, 2019

The original commit (which was reverted) was fixed by @linyiqun in HDDS-1297 (thx here again) I applied it to this branch to prove that the two commits together don't cause any problem.

In case of merge please don't squash the two commits just rebase this branch.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 40 Docker mode activated.
_ Prechecks _
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
0 mvndep 27 Maven dependency ordering for branch
+1 mvninstall 1185 trunk passed
+1 compile 78 trunk passed
+1 checkstyle 30 trunk passed
+1 mvnsite 136 trunk passed
+1 shadedclient 899 branch has no errors when building and testing our client artifacts.
+1 findbugs 194 trunk passed
+1 javadoc 106 trunk passed
_ Patch Compile Tests _
0 mvndep 12 Maven dependency ordering for patch
-1 mvninstall 23 server-scm in the patch failed.
-1 compile 66 hadoop-hdds in the patch failed.
-1 javac 66 hadoop-hdds in the patch failed.
-0 checkstyle 25 hadoop-hdds: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
-1 mvnsite 26 server-scm in the patch failed.
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 820 patch has no errors when building and testing our client artifacts.
-1 findbugs 24 server-scm in the patch failed.
+1 javadoc 97 the patch passed
_ Other Tests _
+1 unit 75 common in the patch passed.
+1 unit 31 framework in the patch passed.
+1 unit 63 container-service in the patch passed.
-1 unit 27 server-scm in the patch failed.
+1 asflicense 26 The patch does not generate ASF License warnings.
4368
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/Dockerfile
GITHUB PR #733
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux a444bdbf61a0 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar 18 15:02:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / d608be6
maven version: Apache Maven 3.3.9
Default Java 1.8.0_191
findbugs v3.1.0-RC1
mvninstall https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-mvninstall-hadoop-hdds_server-scm.txt
compile https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-compile-hadoop-hdds.txt
javac https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-compile-hadoop-hdds.txt
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/diff-checkstyle-hadoop-hdds.txt
mvnsite https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-mvnsite-hadoop-hdds_server-scm.txt
findbugs https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-findbugs-hadoop-hdds_server-scm.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/artifact/out/patch-unit-hadoop-hdds_server-scm.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/testReport/
Max. process+thread count 341 (vs. ulimit of 5500)
modules C: hadoop-hdds/common hadoop-hdds/framework hadoop-hdds/container-service hadoop-hdds/server-scm U: hadoop-hdds
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-733/2/console
Powered by Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 57 Docker mode activated.
_ Prechecks _
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
0 mvndep 38 Maven dependency ordering for branch
+1 mvninstall 1186 trunk passed
+1 compile 90 trunk passed
+1 checkstyle 35 trunk passed
+1 mvnsite 152 trunk passed
+1 shadedclient 941 branch has no errors when building and testing our client artifacts.
+1 findbugs 199 trunk passed
+1 javadoc 108 trunk passed
_ Patch Compile Tests _
0 mvndep 11 Maven dependency ordering for patch
+1 mvninstall 133 the patch passed
+1 compile 67 the patch passed
+1 javac 67 the patch passed
-0 checkstyle 25 hadoop-hdds: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
+1 mvnsite 110 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 816 patch has no errors when building and testing our client artifacts.
+1 findbugs 215 the patch passed
+1 javadoc 97 the patch passed
_ Other Tests _
+1 unit 74 common in the patch passed.
+1 unit 31 framework in the patch passed.
+1 unit 55 container-service in the patch passed.
-1 unit 111 server-scm in the patch failed.
+1 asflicense 27 The patch does not generate ASF License warnings.
4564
Reason Tests
Failed junit tests hadoop.hdds.scm.node.TestSCMNodeManager
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/artifact/out/Dockerfile
GITHUB PR #733
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux 270564ca6d65 4.4.0-141-generic #167~14.04.1-Ubuntu SMP Mon Dec 10 13:20:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / d608be6
maven version: Apache Maven 3.3.9
Default Java 1.8.0_191
findbugs v3.1.0-RC1
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/artifact/out/diff-checkstyle-hadoop-hdds.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/artifact/out/patch-unit-hadoop-hdds_server-scm.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/testReport/
Max. process+thread count 349 (vs. ulimit of 5500)
modules C: hadoop-hdds/common hadoop-hdds/framework hadoop-hdds/container-service hadoop-hdds/server-scm U: hadoop-hdds
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-733/3/console
Powered by Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
0 reexec 28 Docker mode activated.
_ Prechecks _
+1 @author 0 The patch does not contain any @author tags.
+1 test4tests 0 The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
0 mvndep 22 Maven dependency ordering for branch
+1 mvninstall 1080 trunk passed
+1 compile 82 trunk passed
+1 checkstyle 28 trunk passed
+1 mvnsite 138 trunk passed
+1 shadedclient 814 branch has no errors when building and testing our client artifacts.
+1 findbugs 195 trunk passed
+1 javadoc 98 trunk passed
_ Patch Compile Tests _
0 mvndep 13 Maven dependency ordering for patch
+1 mvninstall 134 the patch passed
+1 compile 67 the patch passed
+1 javac 67 the patch passed
-0 checkstyle 23 hadoop-hdds: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 mvnsite 122 the patch passed
+1 whitespace 0 The patch has no whitespace issues.
+1 xml 1 The patch has no ill-formed XML file.
+1 shadedclient 744 patch has no errors when building and testing our client artifacts.
+1 findbugs 222 the patch passed
+1 javadoc 100 the patch passed
_ Other Tests _
+1 unit 75 common in the patch passed.
+1 unit 34 framework in the patch passed.
+1 unit 59 container-service in the patch passed.
-1 unit 95 server-scm in the patch failed.
+1 asflicense 29 The patch does not generate ASF License warnings.
4201
Subsystem Report/Notes
Docker Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/artifact/out/Dockerfile
GITHUB PR #733
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname Linux dadbccbe879f 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / b979fdd
maven version: Apache Maven 3.3.9
Default Java 1.8.0_191
findbugs v3.1.0-RC1
checkstyle https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/artifact/out/diff-checkstyle-hadoop-hdds.txt
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/artifact/out/patch-unit-hadoop-hdds_server-scm.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/testReport/
Max. process+thread count 465 (vs. ulimit of 5500)
modules C: hadoop-hdds/common hadoop-hdds/framework hadoop-hdds/container-service hadoop-hdds/server-scm U: hadoop-hdds
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-733/4/console
Powered by Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

@elek
Copy link
Member Author

elek commented May 16, 2019

Just merged.

@elek elek closed this May 16, 2019
shanthoosh pushed a commit to shanthoosh/hadoop that referenced this pull request Oct 15, 2019
Changed the KVSerde to only value Serde  for the Eventhubs input and output descriptors.
Since the key is always a `String`, the key serde should always be `NoOpSerde` and will lead to an error otherwise since the Samza `serializers.SerdeManager.scala` expectes a `byte[]`

Author: Daniel Chen <[email protected]>

Reviewers: Prateek Maheshwari <[email protected]>

Closes apache#733 from dxichen/eventhubs-example-cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants