Skip to content

Conversation

davepacheco
Copy link
Collaborator

Before this change, blueprint execution populated db_metadata_nexus records for Nexus zones that were all active. Now, per RFD 588, it writes quiesced records for zones that have a generation newer than the currently active one.

I've pulled much of this straight out of #8936. Difference from what's there:

  • During blueprint execution, we directly compute the active/not-yet Nexus sets based on the blueprint we're executing. This is functionally equivalent to reading it out of the database, since the final query is conditional on the blueprint still being the current target.
  • Rather than create a pub function in the datastore to read one Nexus's record for the tests, I used the one I added in coordinated Nexus quiesce #9010 that reads multiple and just added a helper for it in the test suite.

Depends on #9010.

opctx: &OpContext,
blueprint: &nexus_types::deployment::Blueprint,
blueprint_id: BlueprintUuid,
active: &BTreeSet<OmicronZoneUuid>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More a note for myself: this is identical to #8936, but using BTreeSets instead of Vec (totally reasonable)

opctx: &'a OpContext,
datastore: &'a DataStore,
blueprint: &'a Blueprint,
nexus_id: Option<OmicronZoneUuid>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to let this be optional? When are we running the executor outside Nexus?

(The consequences of a Nexus being configured to not supply this value seem really bad)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good question. There are basically three callers of realize_blueprint():

  • The nexus-reconfigurator-execution tests (via realize_blueprint_and_expect). These pass a made-up Nexus id.
  • reconfigurator-exec-unsafe, a dev tool. This passes None.
  • The Nexus blueprint execution background task. This passes a real value here.

So this value would be None when running from reconfigurator-exec-unsafe. This isn't really new. It's already the case that the Nexus id is optional for blueprint execution and some steps (namely: saga re-assignment and marking failed support bundle) are skipped if it's not provided. I adopted the same pattern here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I realized in addressing your other comment is that it wouldn't be safe for reconfigurator-exec-unsafe to try to do this step, at least not all the time. Consider:

  • Suppose we have Nexus instances N1, N2, N3 at generation G and blueprint generation = G.
  • We provision instances N4, N5, N6 at generation G + 1 in preparation for handoff.
  • We create blueprint B1 with generation G + 1, starting the handoff process.
  • Someone runs reconfigurator-exec-unsafe. It needs to compute the set of active vs. not-yet Nexus zones as we do here.

The way this code is written now, we won't get here at all because we don't have a valid nexus id, so it won't do anything.

If we instead queried the database for the list of currently-active Nexus zones, there are two possibilities:

  1. Handoff has not happened at the time that we query it. reconfigurator-exec-unsafe finds N1, N2, and N3 active and N4, N5, and N6 "not-yet".
  2. Handoff has happened at the time that we query it. reconfigurator-exec-unsafe finds N1, N2, and N3 quiesced and N4, N5, and N6 "active" (or maybe even not-yet?)

But there's a time-of-check-to-time-of-use race in case (1). The handoff could immediately happen after it queries the database. Then it might insert records with the wrong state. The check against the current target blueprint does not save us here because the target blueprint doesn't change in the handoff transaction.

This is not a problem for Nexus doing blueprint execution because it cannot quiesce while it's doing blueprint execution.

It's possible that we could allow reconfigurator-exec-unsafe to do this step in some cases (e.g., if blueprint.nexus_generation matches the highest-valued Nexus generation, then we know that no handoff is in progress) but I don't think it's worth the complexity right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, so to confirm:

  • reconfigurator-exec-unsafe could have issues if it supplied a non-None value here
  • ... but, as written, it can't do that
  • ... but also, this is probably fine; it's a dev tool anyway

Comment on lines +28 to +30
.find_map(|(_sled_id, zone_cfg, nexus_config)| {
(zone_cfg.id == nexus_id).then_some(nexus_config.nexus_generation)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure we're super-clear on terminology:

  • There is a top-level blueprint nexus_generation which will be bumped to start the quiesce process
  • After this value is bumped, but before quiesce is complete, we have nexuses running with a nexus_generation value less than this top-level nexus_generation

In this scenario, we could have:

  • Nexus (running, quiescing, supposed to have active record) @ generation = N
  • Nexus (waiting, supposed to have not_yet record) @ generation = N + 1
  • blueprint.nexus_generation @ generation = N + 1

I think we've said the blueprint_generation identifies the "Nexus instances currently in control", but it's a little weird this is not totally overlapping with the notion of "active". (I know this is by design - I'm okay with the process, just wondering about how we're referring to these concepts separately).

I added this comment in my PR, to try to help clarify:

    // We need to determine the generation of the currently running set of
    // Nexuses. This is usually the same as "blueprint.nexus_generation", but
    // can lag behind it if we are one of those Nexuses running after quiescing
    // has started.

WDYT about adding a note explaining this? I just want to make it very clear why we aren't simply matching on the top-level blueprint.nexus_generation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've said the blueprint_generation identifies the "Nexus instances currently in control", but it's a little weird this is not totally overlapping with the notion of "active". (I know this is by design - I'm okay with the process, just wondering about how we're referring to these concepts separately).

Agreed -- I don't think we should describe the blueprint nexus_generation as identifying the Nexus instances that are in control, but rather the ones that we want to be in control.

Updated the comment to be much more explicit in f01ffe9.

Base automatically changed from dap/quiesce-with-db to main September 16, 2025 22:42
@davepacheco davepacheco merged commit d98f378 into main Sep 17, 2025
17 checks passed
@davepacheco davepacheco deleted the dap/write-notyet-records branch September 17, 2025 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants