TQ: Add support for "Expunge" messages #8874

andrewjstone · 2025-08-20T22:48:25Z

When a node requests a share for an old configuration it will receive
an Expunged message if the share request receiving node has a later
committed configuration where the node is not a member. When the
Expunged message is receieved, the expunged node persists this fact
and stops replying to peer messages.

We also fix a bug in the test where nexus wasn't actually committing
configurations at all.

So far invariants really slow the test down. On my machine checking the
invariants after every applied event makes the test take ~120s. Without
invariant checking, it takes about 14s.

I definitely want to add more invariants for correctness, but maybe not
check them when they aren't applicable. Some may become postconditions
on certain events instead.

andrewjstone · 2025-08-20T22:48:59Z

This PR builds on #8801

When a node requests a share for an old configuration it will receive an `Expunged` message if the share request receiving node has a later committed configuration where the node is not a member. When the `Expunged` message is receieved, the expunged node persists this fact and stops replying to peer messages. We also fix a bug in the test where nexus wasn't actually committing configurations at all. So far invariants really slow the test down. On my machine checking the invariants after every applied event makes the test take ~120s. Without invariant checking, it takes about 14s. I definitely want to add more invariants for correctness, but maybe not check them when they aren't applicable. Some may become postconditions on certain events instead.

This builds on #8874 To make using using tqdb easier, we put all necessary information inside events. This works because of deterministic replay using the same infrastucture as the proptest itself. To ensure we are truly deterministically replaying we must assert that any Event in our log has actually been generated when we go to apply it in tqdb. In the common case case this is equivalent to asserting that the `DeliveredEnvelope` is actually the same as the one pulled off the `bootstrap_network` vec during runs. In order to guarantee this a few changes needed to be made: 1. Determinism exists, except for in the crypto code because we don't seed the various random number generators, and therefore different key shares and rack secrets get generated in different runs. We could seed them, but then this makes it possible that we accidentally seed them in production code. More importantly for our current purposes, though, is that it's tedious and unnecessary. Instead we just implement comparison methods for messages that ignore things like key shares. This works fine because if the shares are not self-consistent with the rack secret and the parameters such as threshold that we don't change the crypto code itself will fail immediately. 2. We had a few actions that generated multiple events. Unfortunately, applying these events would result in mutating the test networks multiple times, resulting in a difference between which message was recorded and which was actually pulled out to be delivered. This was fixed by changing each action to only do a single thing at a time, like deliver one envelope or commit a configuration at one node. This also allows for finer grain interleaving and matches better with our TLA+ spec/model checking. This was observable by looking at the event logs. The change to an action typically generating a single event means, however, that we need to generate more actions per run to get the equivalent functionality. A single action won't result in N commits or N envelopes delivered. Therefore, each test run now needs to have significantly more actions generated. Unfortunately this can make test runs even longer. In order to help alleviate this we made three other changes: 1. We change the invariant checks to only look at nodes that could possibly be mutated by an event application. We call these the "affected nodes". This prevents looping over every node in each check. 2. We reduce the member universe, and hence the size of the state space. 3. We reduce the total number of test cases per run.

andrewjstone requested review from sunshowers and plotnick August 20, 2025 22:48

andrewjstone force-pushed the tq-expunged branch 2 times, most recently from ddfc395 to 665ef30 Compare August 20, 2025 22:52

andrewjstone changed the title ~~Add support for "Expunge" messages~~ TQ: Add support for "Expunge" messages Aug 23, 2025

andrewjstone mentioned this pull request Aug 23, 2025

TQ: cluster proptest Action/Event cleanup #8899

Merged

andrewjstone force-pushed the tqdb branch from bc1ecc7 to b22ee4a Compare August 27, 2025 23:29

Base automatically changed from tqdb to main August 28, 2025 02:07

andrewjstone added 2 commits August 28, 2025 03:37

Do not allow commits on expunged nodes

d85c5d9

andrewjstone force-pushed the tq-expunged branch from 0ca88b1 to d85c5d9 Compare August 28, 2025 03:38

andrewjstone enabled auto-merge (squash) August 28, 2025 03:39

andrewjstone merged commit 93c7540 into main Aug 28, 2025
16 checks passed

andrewjstone deleted the tq-expunged branch August 28, 2025 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TQ: Add support for "Expunge" messages #8874

TQ: Add support for "Expunge" messages #8874

Uh oh!

andrewjstone commented Aug 20, 2025 •

edited

Loading

Uh oh!

andrewjstone commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

TQ: Add support for "Expunge" messages #8874

TQ: Add support for "Expunge" messages #8874

Uh oh!

Conversation

andrewjstone commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewjstone commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

andrewjstone commented Aug 20, 2025 •

edited

Loading