Skip to content

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Aug 20, 2025

When a node requests a share for an old configuration it will receive
an Expunged message if the share request receiving node has a later
committed configuration where the node is not a member. When the
Expunged message is receieved, the expunged node persists this fact
and stops replying to peer messages.

We also fix a bug in the test where nexus wasn't actually committing
configurations at all.

So far invariants really slow the test down. On my machine checking the
invariants after every applied event makes the test take ~120s. Without
invariant checking, it takes about 14s.

I definitely want to add more invariants for correctness, but maybe not
check them when they aren't applicable. Some may become postconditions
on certain events instead.

@andrewjstone
Copy link
Contributor Author

This PR builds on #8801

@andrewjstone andrewjstone force-pushed the tq-expunged branch 2 times, most recently from ddfc395 to 665ef30 Compare August 20, 2025 22:52
@andrewjstone andrewjstone changed the title Add support for "Expunge" messages TQ: Add support for "Expunge" messages Aug 23, 2025
Base automatically changed from tqdb to main August 28, 2025 02:07
When a node requests a share for an old configuration it will receive
an `Expunged` message if the share request receiving node has a later
committed configuration where the node is not a member. When the
`Expunged` message is receieved, the expunged node persists this fact
and stops replying to peer messages.

We also fix a bug in the test where nexus wasn't actually committing
configurations at all.

So far invariants really slow the test down. On my machine checking the
invariants after every applied event makes the test take ~120s. Without
invariant checking, it takes about 14s.

I definitely want to add more invariants for correctness, but maybe not
check them when they aren't applicable. Some may become postconditions
on certain events instead.
@andrewjstone andrewjstone enabled auto-merge (squash) August 28, 2025 03:39
@andrewjstone andrewjstone merged commit 93c7540 into main Aug 28, 2025
16 checks passed
@andrewjstone andrewjstone deleted the tq-expunged branch August 28, 2025 06:31
andrewjstone added a commit that referenced this pull request Aug 28, 2025
This builds on #8874 

To make using using tqdb easier, we put all necessary information inside
events. This works because of deterministic replay using the same
infrastucture as the proptest itself. To ensure we are truly
deterministically replaying we must assert that any Event in our log has
actually been generated when we go to apply it in tqdb. In the common
case case this is equivalent to asserting that the `DeliveredEnvelope`
is actually the same as the one pulled off the `bootstrap_network` vec
during runs. In order to guarantee this a few changes needed to be made:

1. Determinism exists, except for in the crypto code because we don't
seed the various random number generators, and therefore different key
shares and rack secrets get generated in different runs. We could seed
them, but then this makes it possible that we accidentally seed them in
production code. More importantly for our current purposes, though, is
that it's tedious and unnecessary. Instead we just implement comparison
methods for messages that ignore things like key shares. This works fine
because if the shares are not self-consistent with the rack secret and
the parameters such as threshold that we don't change the crypto code
itself will fail immediately.

2. We had a few actions that generated multiple events. Unfortunately,
applying these events would result in mutating the test networks
multiple times, resulting in a difference between which message was
recorded and which was actually pulled out to be delivered. This was
fixed by changing each action to only do a single thing at a time, like
deliver one envelope or commit a configuration at one node. This also
allows for finer grain interleaving and matches better with our TLA+
spec/model checking. This was observable by looking at the event logs.

The change to an action typically generating a single event means,
however, that we need to generate more actions per run to get the
equivalent functionality. A single action won't result in N commits or N
envelopes delivered. Therefore, each test run now needs to have
significantly more actions generated. Unfortunately this can make test
runs even longer. In order to help alleviate this we made three other
changes:

1. We change the invariant checks to only look at nodes that could
possibly be mutated by an event application. We call these the "affected
nodes". This prevents looping over every node in each check.
2. We reduce the member universe, and hence the size of the state space.
3. We reduce the total number of test cases per run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant