-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Each node will need to keep track of the last few master node changes (this should be fine in memory) and its local node information (a node might see the master node flapping, but the master node itself is fine). e.g. if the master node has changed >3 times in the last 30 minutes, then it's not stable. otherwise, nothing to report.
The coordinating node might have to contact a master eligible node which in turn might have to contact other master eligible nodes, but this is the worst-case scenario and definitely does not involve fanning out to all nodes.
- Store a view of the last 30 minutes of master history on each node, and add the ability to query any node for its view of master history Adding a view of master history #85941
- Determine if the node has master (if one is present in the cluster state or otherwise if it has seen one in the last 30 seconds) Master stability health indicator part 1 (when a master has been seen recently) #86524
In case the node has master according to above definition:
- Check if a seen master is stable - master did not go
nullrepeatedly in the last 30 minutes. Report GREEN Master stability health indicator part 1 (when a master has been seen recently) #86524 - Check if master is UNSTABLE - did it change more than 3 times in the last 30 minutes ? Report YELLOW (add history info on who was master and when) Master stability health indicator part 1 (when a master has been seen recently) #86524
- RCA for unstable master - try and contact a previous master about the reason it stepped down (take disconnect/timeout of the network call into account). For this we'll need to expose the "master history log" we built at the above points over the wire. Master stability health indicator part 1 (when a master has been seen recently) #86524
- Check if we (the node coordinating this check) are unstable - has the witnessed master gone null/not-null more than 3 times in the last 30 minutes but the identity hasn't changed? Report the unstable master case above (report YELLOW) with the same RCA Master stability health indicator part 1 (when a master has been seen recently) #86524
In case the node does not have a master node:
-
Check if we know of any master-eligible nodes - in case we don't know of any report RED due to a discovery problem (include witnessed masters history). Adding additional capability to the master_is_stable health indicator service #87482
In case we know of some use thePeerFinderto check:- Check if one of them is master - in which case report RED as we are having issues joining a cluster Adding additional capability to the master_is_stable health indicator service #87482
- None of them is master and we aren't master eligible
- RCA reach out to a master eligible node and run the same checks (take disconnect/timeout of the network call into account)
-
If we are master eligible nodes collect the information form all known master eligible nodes (the information about term/version/voting config should be available in
ClusterFormationFailureHelper, however we'd need that exposed over the wire Adding a transport action to get cluster formation info #87306 ) - take disconnect/timeout of the network call into account -
Document the settings we create for the master stability and cluster diagnostics service(s) Documenting master_is_stable health API settings #87901
-
Create
master-is-stableindicator- Outline the details we'll be able to provide on drill down
- Add impacts
- Add user-actions
- Investigate what troubleshooting docs we can offer