Skip to content

Conversation

@mergify
Copy link

@mergify mergify bot commented Oct 1, 2025

Keep exclusive/auto-delete queues with Khepri + network partition

Why

With Mnesia, when the network partition strategy is set to pause_minority, nodes on the "minority side" are stopped.

Thus, the exclusive queues that were hosted by nodes on that minority side are lost:

  • Consumers connected on these nodes are disconnected because the nodes are stopped.
  • Queue records on the majority side are deleted from the metadata store.

This was ok with Mnesia and how this network partition handling strategy is implemented. However, it does not work with Khepri because the nodes on the "minority side" continue to run and serve clients. Therefore the cluster ends up in a weird situation:

  1. The "majority side" deleted the queue records.
  2. When the network partition is solved, the "minority side" gets the record deletion, but the queue processes continue to run.

This was similar for auto-delete queues.

How

With Khepri, we stop to delete transient queue records in general, just because there is a node going down. Thanks to this, an exclusive or an auto-delete queue and its consumer(s) are not affected by a network partition: they continue to work.

However, if a node is really lost, we need to clean up dead queue records. This was already done for durable queues with both Mnesia and Khepri. But with Khepri, transient queue records persist in the store like durable queue records (unlike with Mnesia).

That's why this commit changes the clean-up function, rabbit_amqqueue:forget_all_durable/1 into rabbit_amqqueue:forget_all/1 which deletes all queue records of queues that were hosted on the given node, regardless if they are transient or durable.

In addition to this, the queue process will spawn a temporary process who will try to delete the underlying record indefinitely if no other processes are waiting for a reply from the queue process. That's the case for queues that are deleted because of an internal event (like the exclusive/auto-delete conditions). The queue process will exit, which will notify connections that the queue is gone.

Thanks to this, the temporary process will do its best to delete the record in case of a network partition, whether the consumers go away during or after that partition. That said, the node monitor drives some failsafe code that cleans up record if the queue process was killed before it could delete its own record.

Fixes #12949, #12597, #14527.


This is an automatic backport of pull request #14573 done by Mergify.

… message

[Why]
So far, when there was a network partition with Mnesia, the most popular
partition handling strategies restarted RabbitMQ nodes. Therefore,
`rabbit` would execute the boot steps and one of them would notify other
members of the cluster that "this RabbitMQ node is live".

With Khepri, nodes are not restarted anymore and thus, boot steps are
not executed at the end of a network partition. As a consequence, other
members are not notified that a member is back online.

[How]
When the node monitor receives the `nodeup` message (managed by Erlang,
meaning that "a remote Erlang node just connected to this node through
Erlang distribution"), a `node_up` message is sent to all cluster
members (meaning "RabbitMQ is now running on the originating node").
Yeah, very poor naming...

This lets the RabbitMQ node monitor know when other nodes running
RabbitMQ are back online and react accordingly.

If a node is restarted, it means that another node could receive the
`node_up` message twice. The actions behind it must be idempotent.

(cherry picked from commit 2c1b752)
[Why]
With Mnesia, when the network partition strategy is set to
`pause_minority`, nodes on the "minority side" are stopped.

Thus, the exclusive queues that were hosted by nodes on that minority
side are lost:
* Consumers connected on these nodes are disconnected because the nodes
  are stopped.
* Queue records on the majority side are deleted from the metadata
  store.

This was ok with Mnesia and how this network partition handling strategy
is implemented. However, it does not work with Khepri because the nodes
on the "minority side" continue to run and serve clients. Therefore the
cluster ends up in a weird situation:
1. The "majority side" deleted the queue records.
2. When the network partition is solved, the "minority side" gets the
   record deletion, but the queue processes continue to run.

This was similar for auto-delete queues.

[How]
With Khepri, we stop to delete transient queue records in general, just
because there is a node going down. Thanks to this, an exclusive or an
auto-delete queue and its consumer(s) are not affected by a network
partition: they continue to work.

However, if a node is really lost, we need to clean up dead queue
records. This was already done for durable queues with both Mnesia and
Khepri. But with Khepri, transient queue records persist in the store
like durable queue records (unlike with Mnesia).

That's why this commit changes the clean-up function,
`rabbit_amqqueue:forget_all_durable/1` into
`rabbit_amqqueue:forget_all/1` which deletes all queue records of queues
that were hosted on the given node, regardless if they are transient or
durable.

In addition to this, the queue process will spawn a temporary process
who will try to delete the underlying record indefinitely if no other
processes are waiting for a reply from the queue process. That's the
case for queues that are deleted because of an internal event (like the
exclusive/auto-delete conditions). The queue process will exit, which
will notify connections that the queue is gone.

Thanks to this, the temporary process will do its best to delete the
record in case of a network partition, whether the consumers go away
during or after that partition. That said, the node monitor drives some
failsafe code that cleans up record if the queue process was killed
before it could delete its own record.

Fixes #12949, #12597, #14527.

(cherry picked from commit 3c4d073)
…mixed-version testing

[Why]
The `*_queue_after_partition_recovery_1` testcases rely on the fact that
the queue process retries to delete its queue record in its terminate
function.

RabbitMQ 4.1.x and before don't have that. Thus, depending on the timing
of the election of the Khepri leader after a network partition, the test
might fail.

(cherry picked from commit 69e1703)
@dumbbell dumbbell force-pushed the mergify/bp/v4.2.x/pr-14573 branch from 28cac81 to 4826a38 Compare October 3, 2025 12:47
@dumbbell dumbbell marked this pull request as ready for review October 3, 2025 15:12
@dumbbell dumbbell merged commit 1c856ee into v4.2.x Oct 3, 2025
291 checks passed
@dumbbell dumbbell deleted the mergify/bp/v4.2.x/pr-14573 branch October 3, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants