More/progressive health check commands

Yes, even more. Per discussion with @gerhard:

 * There is no single health check command that would be "universal": too many things can go wrong and would be considered a failure by different teams
 * There are node-local and cluster-wide checks, which should be reflected in command names
 * Health checks are stages (just like human or animal health checks), so we need commands that perform increasingly comprehensive checks that will have an increasing likelihood of false positives, e.g. `node_health_check` today checks every channel process which takes a long time with 10s of thousands of channels

Below is a proposal draft that will be refined as we go.

# Introduction

Since relatively few multi-service systems that use messaging can be considered
completely identical and different operators consider different things to be
within the normal parameters, team RabbitMQ (and some other folks who work on data services and their automation [1][2]) has long concluded that
there is no such thing as a "one true way to health check" a RabbitMQ node.

The Docker image maintainer community have arrived at a similar conclusion

Things get even more involved with clusters since distributed system monitoring,
the level of fault tolerance acceptable for a given system,
and preferred ways of reacting/recovering can vary even greatly from
ops team to ops team.

Another important aspect of node monitoring is how it should be altered during
upgrades. This proposal doesn't cover that part.

## Two Types of Health Checks

The proposal is to classify every health check RabbitMQ offers into one of
two categories:

 * Node-local checks
 * Cluster checks

Each category will have a number of checks organized into stages, with
increasingly more aspects of the system checked. This means the probability
of false positives for higher stages will also be higher. Which stage
is used by a given deployment is a choice of that system's operators.

### Node-local Checks

#### Stage 1

What `rabbitmqctl ping` offers today: it ensures that the runtime is running
and (indirectly) that CLI tools can authenticate to it.

This is the most basic check possible. Except for the CLI tool authentication
part, the probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

#### Stage 2

Includes all checks in stage 1 plus makes sure that `rabbitmqctl status`
(well, the function that backs it) succeeds.

This is a common way of sanity checking a node.
The probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

#### Stage 3

Includes all checks in stage 2 plus checks that the RabbitMQ application is running
(not stopped/"paused" with `rabbitmqctl stop_app` or the Pause Minority partition
handling strategy) and there are no resource alarms.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Systems hovering around their max allowed memory usage will have a high
probability of false positives.

#### Stage 4

Includes all checks in stage 3 plus checks that there are no failing virtual hosts.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

#### Stage 5

Includes all checks in stage 4 plus a check on all enabled listeners
(using a temporary TCP connection).

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

#### Stage 6

Includes all checks in stage 5 plus what `rabbitmqctl node_health_check`
does (it sanity checks every local queue master process and every channel).

The probability of false positives is moderate for systems under
above average load or with a large number of queues and channels
(starting with 10s of thousands).


#### Optional Check 1

Includes all checks in stage 4 plus checks that an expected set of plugins is
enabled.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly depending on the deployment
tools/strategies used (e.g. all plugins can be temporarily disabled).

### Cluster Checks

#### Stage 1

Checks for the expected number of nodes in a cluster.

The probability of false positives can be considered approaching 0.

#### Stage 2

Checks for network partitions detected by a node.

The probability of false positives is a function of the partition
detection algorithm used. With a timer-based strategy it is moderate
(say, within the (0, 0.2] range). With [adaptive accrual failure detectors](https://github.com/rabbitmq/aten) it is lower (according to our team's anecdotal evidence).


## Tasks

 - [x] Agree on a range of checks and [document it](https://www.rabbitmq.com/monitoring.html)
 - [x] rabbitmq-diagnostics is_running (#294)
 - [x] rabbitmq-diagnostics is_booting  (#294)
 - [x] rabbitmq-plugins is_enabled [plugin] (#295)
 - [x] rabbitmq-diagnostics alarms (#296)
 - [x] rabbitmq-diagnostics check_alarms (#296)
 - [x] rabbitmq-diagnostics check_local_alarms (#296)
 - [x] rabbitmq-diagnostics listeners (#298)
 - [x] rabbitmq-diagnostics check_running
 - [x] rabbitmq-diagnostics check_protocol_listener (#300)
 - [x] rabbitmq-diagnostics check_port_listener (#300)
 - [x] rabbitmq-diagnostics check_port_connectivity (#300)

`rabbitmq-diagnostics check_virtual_hosts` was extracted into a separate issue scheduled for `3.7.12`.

1. https://github.com/docker-library/rabbitmq/pull/174#issuecomment-452002696
2. https://github.com/docker-library/cassandra/pull/76#issuecomment-246054271

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More/progressive health check commands #292

Introduction

Two Types of Health Checks

Node-local Checks

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Optional Check 1

Cluster Checks

Stage 1

Stage 2

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More/progressive health check commands #292

Description

Introduction

Two Types of Health Checks

Node-local Checks

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Optional Check 1

Cluster Checks

Stage 1

Stage 2

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions