Quorum Queue followers may stop taking snapshots under specific usage patterns #14137

the-mikedavis · 2025-06-25T20:18:41Z

the-mikedavis
Jun 25, 2025
Maintainer

Describe the bug

Under specific usage patterns, followers in a QQ cluster may stop taking snapshots while the leader continues taking snapshots normally.

This is reproducible up to v4.1.0 and should be fixed in #13971. I've seen this in 3.13.7 but it works up to v4.1.0 as long as the delivery limit is undefined.

Reproduction steps

I have edited the v4.1.0 source to add a list shuffle to rabbit_fifo to make this reproduction much more reliable: 4c9f87d. The edit to rabbit_fifo simulates unstable ordering of maps:keys/1 across cluster members: map iteration order is not defined unless specified and can vary.

With those changes, the following steps can casue the followers of a 3-node QQ cluster to stop taking snapshots. quorum_status reports that the leader's snapshot index increases as messages are published to- and consumed from the QQ while the followers remain stuck.

$ make start-cluster
# Set the delivery limit of QQ "qq" to unlimited.
$ rabbitmqctl -n rabbit-1 set_policy qq-unlimited-delivery-limit '^qq$' '{"delivery-limit": -1}' --priority 123 --apply-to "quorum_queues"
# Publish many messages so we can consume them with basic.get.
$ perf-test --quorum-queue --queue qq --consumers 0 --producers 1 --pmessages 10000
# Run the reproduction code from the patch in another terminal.
$ rabbitmq-diagnostics -n rabbit-1 remote_shell
> rabbit_repro:run().
# Consume the remaining messages in the queue. This will cause the leader to take a
# snapshot.
$ perf-test --quorum-queue --queue qq --consumers 1 --producers 0 --time 2
# (optional) Publish and consume many messages. This is not necessary but it
# moves the leader's snapshot index forward.
# View the quorum status.
$ rabbitmq-queues -n rabbit-1 quorum_status qq
Status of quorum queue qq on node rabbit-1@mango2 ...
┌─────────────────┬────────────┬────────────┬────────────────┬──────────────┬──────────────┬──────────────┬────────────────┬──────┬─────────────────┐
│ Node Name       │ Raft State │ Membership │ Last Log Index │ Last Written │ Last Applied │ Commit Index │ Snapshot Index │ Term │ Machine Version │
├─────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-1@mango2 │ leader     │ voter      │ 1590120        │ 1590120      │ 1590120      │ 1590120      │ 1550540        │ 1    │ 5               │
├─────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-2@mango2 │ follower   │ voter      │ 1590120        │ 1590120      │ 1590120      │ 1590120      │ -1             │ 1    │ 5               │
├─────────────────┼────────────┼────────────┼────────────────┼──────────────┼──────────────┼──────────────┼────────────────┼──────┼─────────────────┤
│ rabbit-3@mango2 │ follower   │ voter      │ 1590120        │ 1590120      │ 1590120      │ 1590120      │ -1             │ 1    │ 5               │
└─────────────────┴────────────┴────────────┴────────────────┴──────────────┴──────────────┴──────────────┴────────────────┴──────┴─────────────────┘

As the QQ sees continued usage, the followers' disk space will be consumed while the leader's will remain relatively empty. The followers are 'stuck' - they cannot snapshot and truncate their old data away.

Expected behavior

Leader and follower snapshot indices are not necessarily identical but if the leader and followers have similar last applied indices, the snapshot indices should be 'within spitting distance' of each other. The followers should not stop taking snapshots while the leader continues taking snapshots.

Additional context

The reproduction relies on unstable map ordering. Iteration order of an Erlang map is not defined and can vary across OTP releases or even across nodes. (TODO: information about reproducing this would be valuable.)

In the reproduction code in rabbit_repro:run/0, we use basic.get ten times from a single channel to create ten consumers in the QQ which are tied to the same Erlang process (the channel). Crashing this channel causes the messages to be returned to the QQ. In v4.1.0 and below, QQs use a helper function rabbit_fifo:handle_down/3 for this case which cancels the downed consumers (returning their messages) in the order returned by maps:keys/1. For large hash maps (more than 32 elements), maps:keys/1 internally uses an iterator to build the list of keys. Iterator ordering is defined when a map has 32 or fewer elements but if a QQ has more than 33 total consumers and a channel exits which has more than one consumer, the messages checked out to those consumers are returned potentially out-of-order.

The next step of rabbit_repro:run/0 uses basic.get and basic.ack to get and ack nine of the ten returned messages and then uses basic.get and basic.reject to return the last message. The basic.reject is a crucial step because it runs through this codepath in QQs: returns are sent to the leader as aux commands. When the delivery limit is not set, the leader appends a #requeue{} command to the log based on its knowledge of the checked out message. With disorder in the returns queue the leader might think a different message is checked out than the followers.

Specifically what can happen here is that when followers handle the #requeue{} command, they will delete the Raft index of the checked out message according to the leader. This leaves junk in the #rabbit_fifo.ra_indexes set for the followers permanently. ra_indexes is a component of the rabbit_fifo:smallest_raft_index/1 calculation which QQs use to determine if enough messages have been consumed to take a snapshot, so this prevents followers from taking snapshots.

Answered by the-mikedavis

Oct 15, 2025

Actually I was wrong here ☝️, both about the delivery-limit and the purge part.

When a channel with 33 or more consumers closes, the QQ returns the messages in a different order on all replicas.

With a delivery-limit set, if the leader handles a poison message and chooses to discard or dead-letter it, the followers will not discard or dead-letter the same message, so their view of the queue will eventually be longer than the leader's view. When the queue becomes empty and the leader handles another basic.get it will return basic.get_empty but the followers will check out their extra messages. Those checkouts will sit there stuck in the followers since positive or negative acknowledgement…

View full answer

the-mikedavis · 2025-06-25T20:19:26Z

the-mikedavis
Jun 25, 2025
Maintainer Author

This is fixed by #13971 which is released in v4.1.1. The reproduction relies on unstable map ordering which is fixed by the PR.

0 replies

the-mikedavis · 2025-06-25T20:19:55Z

the-mikedavis
Jun 25, 2025
Maintainer Author

I say that this usage pattern is very specific because it relies on multiple basic.get calls on a single channel before a basic.ack or basic.reject. Push-style consumers are far more efficient and give much higher throughput - they always should be preferred to basic.get.

It might also technically be possible to reproduce this with basic.consume as well with the requirement that you create multiple subscribers on the same channel, and have many other channels subscribed to the same QQ. Multiple basic.consume on the same channel is another odd usage pattern, though, which should be avoided.

0 replies

the-mikedavis · 2025-06-25T20:20:27Z

the-mikedavis
Jun 25, 2025
Maintainer Author

One quick remediation for this is to send a queue.purge command. The #purge{} QQ command has an optimization to reset ra_indexes when no messages are checked out. So if the queue can be fully consumed and then you send a purge command, the followers should move their "smallest raft index" up, take snapshots and clear disk space.

Another way to remediate is to use rabbitmq-queues delete_member and then rabbitmq-queues add_member on the follower nodes. This is not guaranteed to fix the issue but followers may recover from the leader's snapshot which is further in the log than the commands that caused the problem, so it is a practical (and low-risk) step.

Also, to avoid the issue you can set a delivery limit on the QQ. This is not a remediation - it doesn't fix followers if they are lagging behind - but it prevents it since it avoids the #requeue{} codepath.

1 reply

the-mikedavis Oct 15, 2025
Maintainer Author

Actually I was wrong here ☝️, both about the delivery-limit and the purge part.

When a channel with 33 or more consumers closes, the QQ returns the messages in a different order on all replicas.

With a delivery-limit set, if the leader handles a poison message and chooses to discard or dead-letter it, the followers will not discard or dead-letter the same message, so their view of the queue will eventually be longer than the leader's view. When the queue becomes empty and the leader handles another basic.get it will return basic.get_empty but the followers will check out their extra messages. Those checkouts will sit there stuck in the followers since positive or negative acknowledgement will never come.
Purge would only fully reset the queue if there were no messages checked out to consumers. Purging does not affect checked-out messages, so the "stuck" messages mentioned above will not be discarded by the purge. Purging might work when there are no messages checked out though, so closing all connections and then executing a purge (through the management UI for example) should purge the stuck messages.

Smaller reproduction case...

This bug can be reliably reproduced on 3.13 without the hacky list shuffle I mentioned in the first comment. Sending 33 basic.gets to a QQ with at least 33 messages and then closing the channel causes the returns queue to have a different order on all three QQ members.

-module(rabbit_repro).

-export([run/0]).

-include_lib("amqp_client/include/amqp_client.hrl").

-define(Q, <<"qq">>).
-define(N_MESSAGES, 33).

run() ->
    {_, BasicGetRunnerRef} = spawn_monitor(fun() ->
        {ok, Conn} = connection(),
        link(Conn),
        {ok, Ch} = amqp_connection:open_channel(Conn),
        link(Ch),
        %% Get >32 messages and then exit without checking them out so that
        %% they are returned.
        basic_get(Ch, ?N_MESSAGES),
        ok
    end),

    receive
        {'DOWN', BasicGetRunnerRef, process, _, normal} ->
            ok;
        {'DOWN', BasicGetRunnerRef, process, _, _} ->
            exit(basic_get_runner_fail)
    end,

    ok.

connection() ->
    amqp_connection:start(#amqp_params_direct{virtual_host = <<"/">>,
                                              username = <<"guest">>,
                                              password = <<"guest">>}).

basic_get(Ch, N) ->
    basic_get(Ch, N, none).

basic_get(_Ch, 0, Tag) ->
    Tag;
basic_get(Ch, N, Tag0) ->
    Tag = case amqp_channel:call(Ch, #'basic.get'{queue = ?Q}) of
        {#'basic.get_ok'{delivery_tag = DeliveryTag}, _} ->
            {some, DeliveryTag};
        #'basic.get_empty'{} ->
            Tag0
    end,
    basic_get(Ch, N - 1, Tag).

perf-test -u qq -qq -x 1 -y 0 --time 1 to create the queue and add some messages, then run rabbit_repro:run().. Then check the returns queues

%% #rabbit_fifo.returns is element 5
element(5, maps:get(machine_state, element(3, element(2, sys:get_state({'%2F_qq', 'rabbit-1@mango2'}))))).
element(5, maps:get(machine_state, element(3, element(2, sys:get_state({'%2F_qq', 'rabbit-2@mango2'}))))).
element(5, maps:get(machine_state, element(3, element(2, sys:get_state({'%2F_qq', 'rabbit-3@mango2'}))))).

The only ways around this bug are:

Don't create more than 32 consumers on a single channel. This is somewhat easy to do with a polling-style consumer with basic.get. Also, polling consumers are highly discouraged.
Upgrade to >=4.0. This bug was fixed by Quorum queues v4 #10637. (That change is much too large to safely backport even if 3.13 were still supported.)
Only run a single node, not a cluster. (Since the bug only affects followers.) Single-node clusters are not usually advisable though since they're a single point of failure.

Answer selected by michaelklishin

the-mikedavis · 2025-06-25T20:24:21Z

the-mikedavis
Jun 25, 2025
Maintainer Author

I'm not satisfied with the shuffle part of the reproduction steps. I'll be looking into the large map implementation upstream and trying to figure out where natural changes in ordering can come from. From what I've seen looking so far it sounds like collision nodes in the HAMT are expected to be rare - maybe that's a lead.

2 replies

the-mikedavis Jun 26, 2025
Maintainer Author

Ok I think I see how the map shuffling happens within the same OTP version in practice.

Erlang maps are HAMTs (see link above) which have a consistent ordering if the hash of keys is consistent. So two maps with the same keys should iterate in the same order, even if the keys were inserted in different orders. There is a difference between nodes though in the keys of the #rabbit_fifo.consumers maps in 3.13.x. Before #10637 the consumers map is keyed by {consumer_tag(), pid()}. Local pids are hashed in a different way than external pids, so these values will hash differently and the iteration order on cluster members of different nodes will vary.

This is probably not possible to reproduce in practice on 4.0.x then since the #rabbit_fifo.consumers map is updated to use ra:index()s as keys and immediate terms are hashed consistently between nodes (at least if they are on the same OTP version).

ansd Jun 26, 2025
Maintainer

Your analysis seems to be spot on because running the following on main branch supports your hypothesis.

By default, the test succeeds (of course):

make -C deps/rabbit ct-rabbit_fifo_prop t=tests:two_nodes_same_otp_version

Apply the following diff to use the old undefined map iteration order:

diff --git a/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl b/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl
index fcc35397f2..c7e86130bb 100644
--- a/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl
+++ b/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl
@@ -1539,7 +1539,7 @@ different_nodes_prop(Node, Conf0, Commands) ->
     Entries = lists:zip(Indexes, Commands),
     InitState = test_init(Conf),
     Fun = fun(_) -> true end,
-    MachineVersion = 6,
+    MachineVersion = 5,

     {State1, _Effs1} = run_log(InitState, Entries, Fun, MachineVersion),
     {State2, _Effs2} = erpc:call(Node, ?MODULE, run_log,

Re-run the test. The test fails.

Apply the following diff to additionally enable the feature flag:

diff --git a/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl b/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl
index fcc35397f2..d8f7aecbf2 100644
--- a/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl
+++ b/deps/rabbit/test/rabbit_fifo_prop_SUITE.erl
@@ -108,7 +108,7 @@ end_per_group(_Group, _Config) ->

 init_per_testcase(_TestCase, Config) ->
     ok = meck:new(rabbit_feature_flags, [passthrough]),
-    meck:expect(rabbit_feature_flags, is_enabled, fun (_) -> false end),
+    meck:expect(rabbit_feature_flags, is_enabled, fun (_) -> true end),
     Config.

 end_per_testcase(_TestCase, _Config) ->
@@ -1539,7 +1539,7 @@ different_nodes_prop(Node, Conf0, Commands) ->
     Entries = lists:zip(Indexes, Commands),
     InitState = test_init(Conf),
     Fun = fun(_) -> true end,
-    MachineVersion = 6,
+    MachineVersion = 5,

     {State1, _Effs1} = run_log(InitState, Entries, Fun, MachineVersion),
     {State2, _Effs2} = erpc:call(Node, ?MODULE, run_log,

This time, the test succeeds with the old machine version 5!

michaelklishin · 2025-06-26T09:11:05Z

michaelklishin
Jun 26, 2025
Maintainer

@the-mikedavis #13971 is not safe to backport to v4.0.x even if it was covered by community support, and it's already shipped in v4.1.x, so we can only look into what may be missing.

0 replies

Uh oh!

Quorum Queue followers may stop taking snapshots under specific usage patterns #14137

Uh oh!

the-mikedavis Jun 25, 2025 Maintainer

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 5 comments · 3 replies

Uh oh!

the-mikedavis Jun 25, 2025 Maintainer Author

Uh oh!

the-mikedavis Jun 25, 2025 Maintainer Author

Uh oh!

Uh oh!

the-mikedavis Jun 25, 2025 Maintainer Author

Uh oh!

the-mikedavis Oct 15, 2025 Maintainer Author

Uh oh!

Uh oh!

the-mikedavis Jun 25, 2025 Maintainer Author

Uh oh!

Uh oh!

the-mikedavis Jun 26, 2025 Maintainer Author

Uh oh!

ansd Jun 26, 2025 Maintainer

Uh oh!

michaelklishin Jun 26, 2025 Maintainer

the-mikedavis
Jun 25, 2025
Maintainer

Replies: 5 comments 3 replies

the-mikedavis
Jun 25, 2025
Maintainer Author

the-mikedavis
Jun 25, 2025
Maintainer Author

the-mikedavis
Jun 25, 2025
Maintainer Author

the-mikedavis Oct 15, 2025
Maintainer Author

the-mikedavis
Jun 25, 2025
Maintainer Author

the-mikedavis Jun 26, 2025
Maintainer Author

ansd Jun 26, 2025
Maintainer

michaelklishin
Jun 26, 2025
Maintainer