Skip to content
This repository was archived by the owner on Nov 18, 2020. It is now read-only.
This repository was archived by the owner on Nov 18, 2020. It is now read-only.

More/progressive health check commands #292

@michaelklishin

Description

@michaelklishin

Yes, even more. Per discussion with @gerhard:

  • There is no single health check command that would be "universal": too many things can go wrong and would be considered a failure by different teams
  • There are node-local and cluster-wide checks, which should be reflected in command names
  • Health checks are stages (just like human or animal health checks), so we need commands that perform increasingly comprehensive checks that will have an increasing likelihood of false positives, e.g. node_health_check today checks every channel process which takes a long time with 10s of thousands of channels

Below is a proposal draft that will be refined as we go.

Introduction

Since relatively few multi-service systems that use messaging can be considered
completely identical and different operators consider different things to be
within the normal parameters, team RabbitMQ (and some other folks who work on data services and their automation [1][2]) has long concluded that
there is no such thing as a "one true way to health check" a RabbitMQ node.

The Docker image maintainer community have arrived at a similar conclusion

Things get even more involved with clusters since distributed system monitoring,
the level of fault tolerance acceptable for a given system,
and preferred ways of reacting/recovering can vary even greatly from
ops team to ops team.

Another important aspect of node monitoring is how it should be altered during
upgrades. This proposal doesn't cover that part.

Two Types of Health Checks

The proposal is to classify every health check RabbitMQ offers into one of
two categories:

  • Node-local checks
  • Cluster checks

Each category will have a number of checks organized into stages, with
increasingly more aspects of the system checked. This means the probability
of false positives for higher stages will also be higher. Which stage
is used by a given deployment is a choice of that system's operators.

Node-local Checks

Stage 1

What rabbitmqctl ping offers today: it ensures that the runtime is running
and (indirectly) that CLI tools can authenticate to it.

This is the most basic check possible. Except for the CLI tool authentication
part, the probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

Stage 2

Includes all checks in stage 1 plus makes sure that rabbitmqctl status
(well, the function that backs it) succeeds.

This is a common way of sanity checking a node.
The probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

Stage 3

Includes all checks in stage 2 plus checks that the RabbitMQ application is running
(not stopped/"paused" with rabbitmqctl stop_app or the Pause Minority partition
handling strategy) and there are no resource alarms.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Systems hovering around their max allowed memory usage will have a high
probability of false positives.

Stage 4

Includes all checks in stage 3 plus checks that there are no failing virtual hosts.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Stage 5

Includes all checks in stage 4 plus a check on all enabled listeners
(using a temporary TCP connection).

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Stage 6

Includes all checks in stage 5 plus what rabbitmqctl node_health_check
does (it sanity checks every local queue master process and every channel).

The probability of false positives is moderate for systems under
above average load or with a large number of queues and channels
(starting with 10s of thousands).

Optional Check 1

Includes all checks in stage 4 plus checks that an expected set of plugins is
enabled.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly depending on the deployment
tools/strategies used (e.g. all plugins can be temporarily disabled).

Cluster Checks

Stage 1

Checks for the expected number of nodes in a cluster.

The probability of false positives can be considered approaching 0.

Stage 2

Checks for network partitions detected by a node.

The probability of false positives is a function of the partition
detection algorithm used. With a timer-based strategy it is moderate
(say, within the (0, 0.2] range). With adaptive accrual failure detectors it is lower (according to our team's anecdotal evidence).

Tasks

rabbitmq-diagnostics check_virtual_hosts was extracted into a separate issue scheduled for 3.7.12.

  1. Add a healthcheck script docker-library/rabbitmq#174 (comment)
  2. HEALTHCHECK directive in Dockerfile docker-library/cassandra#76 (comment)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions