-
Notifications
You must be signed in to change notification settings - Fork 30
More/progressive health check commands #292
Description
Yes, even more. Per discussion with @gerhard:
- There is no single health check command that would be "universal": too many things can go wrong and would be considered a failure by different teams
- There are node-local and cluster-wide checks, which should be reflected in command names
- Health checks are stages (just like human or animal health checks), so we need commands that perform increasingly comprehensive checks that will have an increasing likelihood of false positives, e.g.
node_health_check
today checks every channel process which takes a long time with 10s of thousands of channels
Below is a proposal draft that will be refined as we go.
Introduction
Since relatively few multi-service systems that use messaging can be considered
completely identical and different operators consider different things to be
within the normal parameters, team RabbitMQ (and some other folks who work on data services and their automation [1][2]) has long concluded that
there is no such thing as a "one true way to health check" a RabbitMQ node.
The Docker image maintainer community have arrived at a similar conclusion
Things get even more involved with clusters since distributed system monitoring,
the level of fault tolerance acceptable for a given system,
and preferred ways of reacting/recovering can vary even greatly from
ops team to ops team.
Another important aspect of node monitoring is how it should be altered during
upgrades. This proposal doesn't cover that part.
Two Types of Health Checks
The proposal is to classify every health check RabbitMQ offers into one of
two categories:
- Node-local checks
- Cluster checks
Each category will have a number of checks organized into stages, with
increasingly more aspects of the system checked. This means the probability
of false positives for higher stages will also be higher. Which stage
is used by a given deployment is a choice of that system's operators.
Node-local Checks
Stage 1
What rabbitmqctl ping
offers today: it ensures that the runtime is running
and (indirectly) that CLI tools can authenticate to it.
This is the most basic check possible. Except for the CLI tool authentication
part, the probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.
Stage 2
Includes all checks in stage 1 plus makes sure that rabbitmqctl status
(well, the function that backs it) succeeds.
This is a common way of sanity checking a node.
The probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.
Stage 3
Includes all checks in stage 2 plus checks that the RabbitMQ application is running
(not stopped/"paused" with rabbitmqctl stop_app
or the Pause Minority partition
handling strategy) and there are no resource alarms.
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.
Systems hovering around their max allowed memory usage will have a high
probability of false positives.
Stage 4
Includes all checks in stage 3 plus checks that there are no failing virtual hosts.
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.
Stage 5
Includes all checks in stage 4 plus a check on all enabled listeners
(using a temporary TCP connection).
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.
Stage 6
Includes all checks in stage 5 plus what rabbitmqctl node_health_check
does (it sanity checks every local queue master process and every channel).
The probability of false positives is moderate for systems under
above average load or with a large number of queues and channels
(starting with 10s of thousands).
Optional Check 1
Includes all checks in stage 4 plus checks that an expected set of plugins is
enabled.
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly depending on the deployment
tools/strategies used (e.g. all plugins can be temporarily disabled).
Cluster Checks
Stage 1
Checks for the expected number of nodes in a cluster.
The probability of false positives can be considered approaching 0.
Stage 2
Checks for network partitions detected by a node.
The probability of false positives is a function of the partition
detection algorithm used. With a timer-based strategy it is moderate
(say, within the (0, 0.2] range). With adaptive accrual failure detectors it is lower (according to our team's anecdotal evidence).
Tasks
- Agree on a range of checks and document it
- rabbitmq-diagnostics is_running (Introoduce rabbitmq-diagnostics is_running, is_booting #294)
- rabbitmq-diagnostics is_booting (Introoduce rabbitmq-diagnostics is_running, is_booting #294)
- rabbitmq-plugins is_enabled [plugin] (Introduce rabbitmq-plugins is_enabled [plugin 1] [plugin 2] [...] #295)
- rabbitmq-diagnostics alarms (Introduce
rabbitmq-diagnostics alarms
#296) - rabbitmq-diagnostics check_alarms (Introduce
rabbitmq-diagnostics alarms
#296) - rabbitmq-diagnostics check_local_alarms (Introduce
rabbitmq-diagnostics alarms
#296) - rabbitmq-diagnostics listeners (Introduce
rabbitmq-diagnostics listeners
#298) - rabbitmq-diagnostics check_running
- rabbitmq-diagnostics check_protocol_listener (Introduces listener check commands #300)
- rabbitmq-diagnostics check_port_listener (Introduces listener check commands #300)
- rabbitmq-diagnostics check_port_connectivity (Introduces listener check commands #300)
rabbitmq-diagnostics check_virtual_hosts
was extracted into a separate issue scheduled for 3.7.12
.