-
Couldn't load subscription status.
- Fork 9.1k
HDDS-1284. Adjust default values of pipline recovery for more resilient service restart #733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nt service restart.
|
💔 -1 overall
This message was automatically generated. |
|
The original commit (which was reverted) was fixed by @linyiqun in HDDS-1297 (thx here again) I applied it to this branch to prove that the two commits together don't cause any problem. In case of merge please don't squash the two commits just rebase this branch. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…Initialization. Contributed by Yiqun Lin.
|
💔 -1 overall
This message was automatically generated. |
|
Just merged. |
Changed the KVSerde to only value Serde for the Eventhubs input and output descriptors. Since the key is always a `String`, the key serde should always be `NoOpSerde` and will lead to an error otherwise since the Samza `serializers.SerdeManager.scala` expectes a `byte[]` Author: Daniel Chen <[email protected]> Reviewers: Prateek Maheshwari <[email protected]> Closes apache#733 from dxichen/eventhubs-example-cleanup
As of now we have a following algorithm to handle node failures:
While this algorithm can work well with a big cluster it doesn't provide very good usability on small clusters:
Use case1:
Given 3 nodes, in case of a service restart, if the restart takes more than 90s, the pipline will be moved to the CLOSING state. For the next 5 minutes (ozone.scm.pipeline.destroy.timeout) the container will remain in the CLOSING state. As there are no more nodes and we can't assign the same node to two different pipeline, the cluster will be unavailable for 5 minutes.
Use case2:
Given 90 nodes and 30 pipelines where all the pipelines are spread across 3 racks. Let's stop one rack. As all the pipelines are affected, all the pipelines will be moved to the CLOSING state. We have no free nodes, therefore we need to wait for 5 minutes to write any data to the cluster.
These problems can be solved in multiple ways:
1.) Instead of waiting 5 minutes, destroy the pipeline when all the containers are reported to be closed. (Most of the time it's enough, but some container report can be missing)
2.) Support multi-raft and open a pipeline as soon as we have enough nodes (even if the nodes already have a CLOSING pipelines).
Both the options require more work on the pipeline management side. For 0.4.0 we can adjust the following parameters to get better user experience:
{code}
ozone.scm.stale.node.interval 90s OZONE, MANAGEMENT The interval for stale node flagging. Please see ozone.scm.heartbeat.thread.interval before changing this value. {code}ozone.scm.pipeline.destroy.timeout
60s
OZONE, SCM, PIPELINE
Once a pipeline is closed, SCM should wait for the above configured time
before destroying a pipeline.
First of all, we can be more optimistic and mark node to stale only after 5 mins instead of 90s. 5 mins should be enough most of the time to recover the nodes.
Second: we can decrease the time of ozone.scm.pipeline.destroy.timeout. Ideally the close command is sent by the scm to the datanode with a HB. Between two HB we have enough time to close all the containers via ratis. With the next HB, datanode can report the successful datanode. (If the containers can be closed the scm can manage the QUASI_CLOSED containers)
We need to wait 29 seconds (worst case) for the next HB, and 29+30 seconds for the confirmation. --> 66 seconds seems to be a safe choice (assuming that 6 seconds is enough to process the report about the successful closing)
See: https://issues.apache.org/jira/browse/HDDS-1284