Allow configuring per-mount-point per-queue-type disk alarms #14815

the-mikedavis · 2025-10-24T19:23:20Z

This is an extension of the free disk space alarm which allows configuring additional mount points to monitor and which queue type(s) to block when they are near full. For example with a config like so:

stream.data_dir = /mnt/data/streams
# Directory where the file system is mounted.
disk_free_limits.stream.mount_point = /mnt/data/streams
# Alarm threshold: if free space falls under this absolute
# limit then an alarm fires per queue type.
disk_free_limits.stream.absolute = 2GB
# Queue types to block when the threshold is breached.
disk_free_limits.stream.queue_types = stream

Publishers to streams would be blocked once the free space of /data/stream-data falls under 2GB. Publishers to classic or quorum queues could continue though.

The motivation of this feature is that you may want to use separate disks for different queue types. For example for higher throughput you may want to use volume(s) with better throughput and/or IOPS for streaming but use standard disks for queue data. Also, alarms are currently fairly aggressive by blocking all publishing. Ideally you should be able to continue using queues when the space you have allocated for streams fills up, or vice versa.

This is a different approach than #14086. Instead of measuring disk usage under a directory like du(1), rabbit_disk_monitor is updated to measure free space of all mounts at once with disksup:get_disk_info/0. Under the hood this performs the same df(1) check as rabbit_disk_monitor had been doing previously - measuring mount-point free space is much cheaper than measuring directory disk footprint. Monitoring mount points is also quite flexible: you can use multiple disks on one mount point with RAID-0 striping or split up a single disk with partitions.

This is a draft - it needs tests and currently only AMQP 0-9-1 is updated to perform selecting blocking. All other protocols currently block for any alarm.

Some of the commits in this branch are refactors that could be cherry-picked out. #14814 is pretty trivial and the refactors to use maps instead of dict in rabbit_alarm and use disksup instead of the custom df code in rabbit_disk_alarm are not strictly related to the feature here.

Discussed in #14590

This is the same as the `raft.data_dir` option but for Osiris' data directory. Configuring this in Cuttlefish is nicer than the existing `$RABBITMQ_STREAM_DIR` environment variable way of changing the dir.

This is not a functional change, just a refactor to eliminate dicts and use maps instead. This cleans up some helper functions like dict_append/3, and we can use map comprehensions in some places to avoid intermediary lists.

Previously we set `start_disksup` to `false` to avoid OTP's automatic monitoring of disk space. `disksup`'s gen_server starts a port (which runs `df` on Unix) which measures disk usage and sets an alarm through OTP's `alarm_handler` when usage exceeds the configured `disk_almost_full_threshold`. We can set this threshold to 1.0 to effectively turn off disksup's monitoring (i.e. the alarm will never be set). By enabling disksup we have access to `get_disk_data/0` and `get_disk_info/0,1` which can be used to replace the copied versions in `rabbit_disk_monitor`.

`disksup` now exposes the calculation for available disk space for a given path using the same `df` mechanism on Unix. We can use this directly and drop the custom code which reimplements that.

This introduces a new variant of `rabbit_alarm:resource_alarm_source()`: `{disk, QueueType}` which triggers when the configured mount point for queue type(s) fall under their limit of available space.

This covers both network and direct connections for 0-9-1. We store a set of the queue types which have been published into on both a channel and connection level since blocking is done on the connection level but only the channel knows what queue types have been published. Then when the published queue types or the set of alarms changes, the connection evaluates whether it is affected by the alarm. If not it may publish but once a channel publishes to an alarmed queue type the connection then blocks until the channel exits or the alarm clears.

the-mikedavis · 2025-10-24T19:52:14Z

deps/amqp_client/src/amqp_gen_connection.erl

+        false ->
+            {noreply, State1}
+    end;
+handle_cast({channel_published_to_queue_type, _ChPid, QT},


This feature might need a feature flag. Here for direct connections if old client code is used on a newer server then it would error after publishing since it isn't expecting this cast. I think it would be unlikely to happen in practice but the mixed-version test suite will probably run into this.

samuelmasse · 2025-10-24T20:13:40Z

What would the config setup be for having a main disk that contains quorum and classic queues and a secondary disk that contains streams. Would we specify the same mount point for quorum and classic with each defining queue_types as quorum and classic respectively? Would that result in a common alarm for both or two alarms looking at the same thing.

the-mikedavis · 2025-10-24T20:57:18Z

Ah yeah, in that scenario you could have a config like so:

disk_free_limits.streaming.mount_point = /mnt/data/streams
disk_free_limits.streaming.absolute = 2GB
disk_free_limits.streaming.queue_types = stream

disk_free_limits.messaging.mount_point = /mnt/data/queues
disk_free_limits.messaging.absolute = 2GB
disk_free_limits.messaging.queue_types = classic,quorum

And if /mnt/data/queues went under its configured limit it would set two alarms (disk for classic and disk for quorum queue types) but wouldn't affect streams.

samuelmasse · 2025-10-24T22:38:34Z

Ah I see thanks! So if I understand correctly the name here disk_free_limits.[name].mount_point can be any name we want to set it to, it wouldn't have to be the name of a queue type. So I could set disk_free_limits.bob.mount_point for example.

Taking that thought further, what would the process of adding that "bob" disk alarm to an existing broker look like. If node A thinks the "bob" alarm exists but node B doesn't can there be issues that comes from the disagreement? When node A restarts with the new configuration do all nodes now know of the "bob" alarm or just node A until all other nodes also restart.

Also after I added my new "bob" disk alarm, what ways do I have as a user to then monitor the "bob" alarm to see if it's currently alarming, what value it is configured to and how close it's getting to the alarm point. For MQ's use case currently we are getting this information from the /api/nodes endpoint using the disk_free_limit, disk_free and disk_free_alarm values. Are we thinking about adding an alarm name map to the output of this API to give those values for each disk alarm. So I would maybe access disk_free_map.bob.disk_free to know the status of my "bob" disk alarm.

Lastly for the RabbitMQ console we currently have a column named "Disk space" that displays the information of the as of now only disk alarm. When I add this "bob" disk alarm would we want to dynamically add a new column to that table named something like "Disk space (bob)". In that case would we also support defining the ordering of those columns. For example if we consider that it's most relevant to display the disk alarm of quorum queues on the left, then streams, then classic and at the end the disk alarm for non queue storage, would we be able to define that order manually in some console config.

the-mikedavis added 10 commits October 24, 2025 13:03

Allow configuring osiris data_dir in Cuttlefish config

c81fbc5

This is the same as the `raft.data_dir` option but for Osiris' data directory. Configuring this in Cuttlefish is nicer than the existing `$RABBITMQ_STREAM_DIR` environment variable way of changing the dir.

rabbit_stream_queue: Enable recovery after registering queue type

2116c0c

rabbit_alarm: Prefer maps to dicts

6295123

This is not a functional change, just a refactor to eliminate dicts and use maps instead. This cleans up some helper functions like dict_append/3, and we can use map comprehensions in some places to avoid intermediary lists.

rabbit_disk_monitor: Use disksup to determine available bytes

3da73a7

`disksup` now exposes the calculation for available disk space for a given path using the same `df` mechanism on Unix. We can use this directly and drop the custom code which reimplements that.

rabbit.schema: Add config options for per-queue-type disk limits

d57a1c2

rabbit_disk_monitor: Monitor per-queue-type mount points

19be906

rabbit_alarm: Add a helper to format resource alarm sources

0e4f126

Set per-queue-type disk alarms for configured mount points

4776581

This introduces a new variant of `rabbit_alarm:resource_alarm_source()`: `{disk, QueueType}` which triggers when the configured mount point for queue type(s) fall under their limit of available space.

the-mikedavis requested review from SimonUnge and lukebakken October 24, 2025 19:29

the-mikedavis self-assigned this Oct 24, 2025

the-mikedavis commented Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow configuring per-mount-point per-queue-type disk alarms #14815

Allow configuring per-mount-point per-queue-type disk alarms #14815

the-mikedavis commented Oct 24, 2025

Uh oh!

the-mikedavis Oct 24, 2025

Uh oh!

samuelmasse commented Oct 24, 2025

Uh oh!

the-mikedavis commented Oct 24, 2025 •

edited

Loading

Uh oh!

samuelmasse commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allow configuring per-mount-point per-queue-type disk alarms #14815

Are you sure you want to change the base?

Allow configuring per-mount-point per-queue-type disk alarms #14815

Conversation

the-mikedavis commented Oct 24, 2025

Uh oh!

the-mikedavis Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

samuelmasse commented Oct 24, 2025

Uh oh!

the-mikedavis commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuelmasse commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

the-mikedavis commented Oct 24, 2025 •

edited

Loading