-
Notifications
You must be signed in to change notification settings - Fork 54
write correct db_metadata_nexus records during blueprint execution #9023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
opctx: &OpContext, | ||
blueprint: &nexus_types::deployment::Blueprint, | ||
blueprint_id: BlueprintUuid, | ||
active: &BTreeSet<OmicronZoneUuid>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More a note for myself: this is identical to #8936, but using BTreeSets instead of Vec (totally reasonable)
opctx: &'a OpContext, | ||
datastore: &'a DataStore, | ||
blueprint: &'a Blueprint, | ||
nexus_id: Option<OmicronZoneUuid>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want to let this be optional? When are we running the executor outside Nexus?
(The consequences of a Nexus being configured to not supply this value seem really bad)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good question. There are basically three callers of realize_blueprint()
:
- The nexus-reconfigurator-execution tests (via
realize_blueprint_and_expect
). These pass a made-up Nexus id. reconfigurator-exec-unsafe
, a dev tool. This passesNone
.- The Nexus blueprint execution background task. This passes a real value here.
So this value would be None
when running from reconfigurator-exec-unsafe
. This isn't really new. It's already the case that the Nexus id is optional for blueprint execution and some steps (namely: saga re-assignment and marking failed support bundle) are skipped if it's not provided. I adopted the same pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I realized in addressing your other comment is that it wouldn't be safe for reconfigurator-exec-unsafe
to try to do this step, at least not all the time. Consider:
- Suppose we have Nexus instances N1, N2, N3 at generation G and blueprint generation = G.
- We provision instances N4, N5, N6 at generation G + 1 in preparation for handoff.
- We create blueprint
B1
with generation G + 1, starting the handoff process. - Someone runs
reconfigurator-exec-unsafe
. It needs to compute the set of active vs. not-yet Nexus zones as we do here.
The way this code is written now, we won't get here at all because we don't have a valid nexus id, so it won't do anything.
If we instead queried the database for the list of currently-active Nexus zones, there are two possibilities:
- Handoff has not happened at the time that we query it.
reconfigurator-exec-unsafe
finds N1, N2, and N3 active and N4, N5, and N6 "not-yet". - Handoff has happened at the time that we query it.
reconfigurator-exec-unsafe
finds N1, N2, and N3 quiesced and N4, N5, and N6 "active" (or maybe even not-yet?)
But there's a time-of-check-to-time-of-use race in case (1). The handoff could immediately happen after it queries the database. Then it might insert records with the wrong state. The check against the current target blueprint does not save us here because the target blueprint doesn't change in the handoff transaction.
This is not a problem for Nexus doing blueprint execution because it cannot quiesce while it's doing blueprint execution.
It's possible that we could allow reconfigurator-exec-unsafe
to do this step in some cases (e.g., if blueprint.nexus_generation
matches the highest-valued Nexus generation, then we know that no handoff is in progress) but I don't think it's worth the complexity right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, so to confirm:
reconfigurator-exec-unsafe
could have issues if it supplied a non-None
value here- ... but, as written, it can't do that
- ... but also, this is probably fine; it's a dev tool anyway
.find_map(|(_sled_id, zone_cfg, nexus_config)| { | ||
(zone_cfg.id == nexus_id).then_some(nexus_config.nexus_generation) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure we're super-clear on terminology:
- There is a top-level blueprint
nexus_generation
which will be bumped to start the quiesce process - After this value is bumped, but before quiesce is complete, we have nexuses running with a
nexus_generation
value less than this top-levelnexus_generation
In this scenario, we could have:
- Nexus (running, quiescing, supposed to have
active
record) @ generation = N - Nexus (waiting, supposed to have
not_yet
record) @ generation = N + 1 - blueprint.nexus_generation @ generation = N + 1
I think we've said the blueprint_generation
identifies the "Nexus instances currently in control", but it's a little weird this is not totally overlapping with the notion of "active". (I know this is by design - I'm okay with the process, just wondering about how we're referring to these concepts separately).
I added this comment in my PR, to try to help clarify:
// We need to determine the generation of the currently running set of
// Nexuses. This is usually the same as "blueprint.nexus_generation", but
// can lag behind it if we are one of those Nexuses running after quiescing
// has started.
WDYT about adding a note explaining this? I just want to make it very clear why we aren't simply matching on the top-level blueprint.nexus_generation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've said the blueprint_generation identifies the "Nexus instances currently in control", but it's a little weird this is not totally overlapping with the notion of "active". (I know this is by design - I'm okay with the process, just wondering about how we're referring to these concepts separately).
Agreed -- I don't think we should describe the blueprint nexus_generation
as identifying the Nexus instances that are in control, but rather the ones that we want to be in control.
Updated the comment to be much more explicit in f01ffe9.
Before this change, blueprint execution populated
db_metadata_nexus
records for Nexus zones that were allactive
. Now, per RFD 588, it writesquiesced
records for zones that have a generation newer than the currently active one.I've pulled much of this straight out of #8936. Difference from what's there:
pub
function in the datastore to read one Nexus's record for the tests, I used the one I added in coordinated Nexus quiesce #9010 that reads multiple and just added a helper for it in the test suite.Depends on #9010.