write correct db_metadata_nexus records during blueprint execution #9023

davepacheco · 2025-09-16T03:15:20Z

Before this change, blueprint execution populated db_metadata_nexus records for Nexus zones that were all active. Now, per RFD 588, it writes quiesced records for zones that have a generation newer than the currently active one.

I've pulled much of this straight out of #8936. Difference from what's there:

During blueprint execution, we directly compute the active/not-yet Nexus sets based on the blueprint we're executing. This is functionally equivalent to reading it out of the database, since the final query is conditional on the blueprint still being the current target.
Rather than create a pub function in the datastore to read one Nexus's record for the tests, I used the one I added in coordinated Nexus quiesce #9010 that reads multiple and just added a helper for it in the test suite.

Depends on #9010.

smklein · 2025-09-16T15:11:55Z

nexus/db-queries/src/db/datastore/db_metadata.rs

        opctx: &OpContext,
-        blueprint: &nexus_types::deployment::Blueprint,
+        blueprint_id: BlueprintUuid,
+        active: &BTreeSet<OmicronZoneUuid>,


More a note for myself: this is identical to #8936, but using BTreeSets instead of Vec (totally reasonable)

smklein · 2025-09-16T15:18:09Z

nexus/reconfigurator/execution/src/lib.rs

    opctx: &'a OpContext,
    datastore: &'a DataStore,
    blueprint: &'a Blueprint,
+    nexus_id: Option<OmicronZoneUuid>,


Why would we want to let this be optional? When are we running the executor outside Nexus?

(The consequences of a Nexus being configured to not supply this value seem really bad)

Yeah, good question. There are basically three callers of realize_blueprint():

The nexus-reconfigurator-execution tests (via realize_blueprint_and_expect). These pass a made-up Nexus id.

reconfigurator-exec-unsafe, a dev tool. This passes None.

The Nexus blueprint execution background task. This passes a real value here.

So this value would be None when running from reconfigurator-exec-unsafe. This isn't really new. It's already the case that the Nexus id is optional for blueprint execution and some steps (namely: saga re-assignment and marking failed support bundle) are skipped if it's not provided. I adopted the same pattern here.

Something I realized in addressing your other comment is that it wouldn't be safe for reconfigurator-exec-unsafe to try to do this step, at least not all the time. Consider:

Suppose we have Nexus instances N1, N2, N3 at generation G and blueprint generation = G.

We provision instances N4, N5, N6 at generation G + 1 in preparation for handoff.

We create blueprint B1 with generation G + 1, starting the handoff process.

Someone runs reconfigurator-exec-unsafe. It needs to compute the set of active vs. not-yet Nexus zones as we do here.

The way this code is written now, we won't get here at all because we don't have a valid nexus id, so it won't do anything.

If we instead queried the database for the list of currently-active Nexus zones, there are two possibilities:

Handoff has not happened at the time that we query it. reconfigurator-exec-unsafe finds N1, N2, and N3 active and N4, N5, and N6 "not-yet".

Handoff has happened at the time that we query it. reconfigurator-exec-unsafe finds N1, N2, and N3 quiesced and N4, N5, and N6 "active" (or maybe even not-yet?)

But there's a time-of-check-to-time-of-use race in case (1). The handoff could immediately happen after it queries the database. Then it might insert records with the wrong state. The check against the current target blueprint does not save us here because the target blueprint doesn't change in the handoff transaction.

This is not a problem for Nexus doing blueprint execution because it cannot quiesce while it's doing blueprint execution.

It's possible that we could allow reconfigurator-exec-unsafe to do this step in some cases (e.g., if blueprint.nexus_generation matches the highest-valued Nexus generation, then we know that no handoff is in progress) but I don't think it's worth the complexity right now.

Ack, so to confirm:

reconfigurator-exec-unsafe could have issues if it supplied a non-None value here

... but, as written, it can't do that

... but also, this is probably fine; it's a dev tool anyway

smklein · 2025-09-16T15:34:40Z

nexus/reconfigurator/execution/src/database.rs

+        .find_map(|(_sled_id, zone_cfg, nexus_config)| {
+            (zone_cfg.id == nexus_id).then_some(nexus_config.nexus_generation)
+        })


Just to make sure we're super-clear on terminology:

There is a top-level blueprint nexus_generation which will be bumped to start the quiesce process

After this value is bumped, but before quiesce is complete, we have nexuses running with a nexus_generation value less than this top-level nexus_generation

In this scenario, we could have:

Nexus (running, quiescing, supposed to have active record) @ generation = N

Nexus (waiting, supposed to have not_yet record) @ generation = N + 1

blueprint.nexus_generation @ generation = N + 1

I think we've said the blueprint_generation identifies the "Nexus instances currently in control", but it's a little weird this is not totally overlapping with the notion of "active". (I know this is by design - I'm okay with the process, just wondering about how we're referring to these concepts separately).

I added this comment in my PR, to try to help clarify:

// We need to determine the generation of the currently running set of // Nexuses. This is usually the same as "blueprint.nexus_generation", but // can lag behind it if we are one of those Nexuses running after quiescing // has started.

WDYT about adding a note explaining this? I just want to make it very clear why we aren't simply matching on the top-level blueprint.nexus_generation.

I think we've said the blueprint_generation identifies the "Nexus instances currently in control", but it's a little weird this is not totally overlapping with the notion of "active". (I know this is by design - I'm okay with the process, just wondering about how we're referring to these concepts separately).

Agreed -- I don't think we should describe the blueprint nexus_generation as identifying the Nexus instances that are in control, but rather the ones that we want to be in control.

Updated the comment to be much more explicit in f01ffe9.

…e-notyet-records

…9023)

davepacheco added 12 commits September 5, 2025 16:22

WIP: pretty good changes so far

6545d73

WIP: cont

cba808d

pass OpContext through usefully

8dce7c1

start fixing/writing tests

7ad5d55

fix test

40f75ac

WIP: live test for handoff

b57236b

Merge branch 'main' into dap/quiesce-with-db

ee285d7

final timeout was too short due to transient Cockroach issue on a4x2

f7c92b3

remove stuff that will go into separate PRs

d45a995

clean up / refactor

bdf36a7

fix wrong comment

0830d57

write db_metadata_nexus records with the correct state

4d99381

davepacheco requested review from jgallagher and smklein September 16, 2025 03:15

davepacheco mentioned this pull request Sep 16, 2025

add live test for Nexus handoff #9024

Merged

1 task

smklein reviewed Sep 16, 2025

View reviewed changes

davepacheco added 4 commits September 16, 2025 09:28

Merge remote-tracking branch 'origin/main' into dap/quiesce-with-db

fa6d78f

Merge branch 'dap/quiesce-with-db' into dap/write-notyet-records

0f15899

review feedback (update comment)

f01ffe9

review feedback

e8eb204

smklein approved these changes Sep 16, 2025

View reviewed changes

davepacheco mentioned this pull request Sep 16, 2025

add reconfigurator-cli command for bumping Nexus generation #9033

Merged

Base automatically changed from dap/quiesce-with-db to main September 16, 2025 22:42

davepacheco added 3 commits September 16, 2025 16:11

Merge commit 'e8eb204' into dap/write-notyet-records

88d0668

Merge commit 'a96a4e64c1f756cc58340ccb8a1082a6aab90706' into dap/writ…

04cdeb7

…e-notyet-records

Merge commit '2163cfa43af7d5558021946bd642f78e6104ab06' into dap/writ…

08b8580

…e-notyet-records

davepacheco merged commit d98f378 into main Sep 17, 2025
17 checks passed

davepacheco deleted the dap/write-notyet-records branch September 17, 2025 13:21

charliepark pushed a commit that referenced this pull request Sep 19, 2025

write correct db_metadata_nexus records during blueprint execution (#…

f22f879

…9023)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

write correct db_metadata_nexus records during blueprint execution #9023

write correct db_metadata_nexus records during blueprint execution #9023

Uh oh!

davepacheco commented Sep 16, 2025

Uh oh!

smklein Sep 16, 2025

Uh oh!

smklein Sep 16, 2025

Uh oh!

davepacheco Sep 16, 2025

Uh oh!

davepacheco Sep 16, 2025

Uh oh!

smklein Sep 16, 2025

Uh oh!

smklein Sep 16, 2025

Uh oh!

davepacheco Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

write correct db_metadata_nexus records during blueprint execution #9023

write correct db_metadata_nexus records during blueprint execution #9023

Uh oh!

Conversation

davepacheco commented Sep 16, 2025

Uh oh!

smklein Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!