Planner refactor: add `ZoneSafetyChecks` type #9202

jgallagher · 2025-10-13T13:46:46Z

This should be (almost) entirely moving things around, based on discussions @karencfv and I had on #9044. There's one very small behavioral change that didn't affect any tests but I think is more correct; I'll point it out in a comment below.

As I was doing this I found a couple spots where I think we should change the behavior slightly, but I didn't want to roll that into this same PR. I left // TODO-john comments on those, and I'd be happy to stage those in a followup PR if others agree they should change.

jgallagher · 2025-10-13T13:57:59Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+        // In the (very unlikely!) event that `boards_preferred` references
+        // boards that are no longer present in `current_boards`, we'll skip
+        // them: presumably they are not currently present in inventory and we
+        // wouldn't be able to configure updates for them anyway.


This comment describes the behavioral change made by this PR; before, if we had something in boards_preferred that wasn't in current_boards, we'd still attempt to configure an update for it.

That said, I think we have a slightly different bug here around sleds that aren't in inventory more generally - will discuss that in chat and then file an issue if I'm understanding it correctly.

Brought this up in chat and I think the fix is straightforward, so I pushed it onto this PR as 80d8f82

davepacheco · 2025-10-13T22:36:07Z

nexus/reconfigurator/planning/src/planner.rs

+                    zone_safety_checks.is_zone_unsafe_to_shut_down(&zone.id)
                {
-                    report.unsafe_zones.insert(zone.id, zone_report);
+                    report.unsafe_zones.insert(zone.id, reason.clone());


nit: do we need to clone here?

Yeah: report.unsafe_zones wants to take ownership of the reason, and we have a reference to a reason here because zone_safety_checks is still holding ownership internally. (For good reason: we'll continue to reference it in later steps of planning.)

davepacheco · 2025-10-13T22:40:24Z

nexus/reconfigurator/planning/src/mgs_updates/host_phase_1.rs

-        BTreeMap<TypedUuid<OmicronZoneKind>, ZoneUnsafeToShutdown>,
-    >,
-) -> Result<MgsUpdateOutcome, FailedHostOsUpdateReason> {
+    unsafe_zone_reason: F,


Suggestion, take it or leave it since I think it'd be non-trivial and wouldn't really change anything but (arguably) reduce complexity: I would either accept a &dyn FnOnce() -> Option<String> to avoid the generic here or (better) define a struct for this in this crate. e.g., UnsafeZoneChecker (probably a better name) with a constructor that could be used by the planner with whatever information it has and a method to be called here.

davepacheco · 2025-10-13T22:45:28Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+    // Helper closure to determine whether this board is safe to restart. We'll
+    // use this when performing an update that will result in a sled reboot (SP
+    // or Host OS).
+    let unsafe_zone_reason = || {
+        // If this board isn't a sled, it's not hosting zones.
+        let sled_id = board.sled_id()?;
+
+        // If the board is a sled, check `zone_safety_checks`.
+        zone_safety_checks.is_sled_unsafe_to_shut_down(&sled_id)
+    };


Following up on my other comment, why not pass zone_safety_checks down to the component-specific try_make_update and have it invoke a method like can_shutdown_sled(sled_id)?

As I said, this isn't really important. I just find it a little confusing in reading the helpers that they take this closure that takes no arguments and somehow knows how to make this decision.

I went back and forth on this a bunch of times; basically all the difficulty I was having is because we don't always have a sled_id, and passing in an Option<SledUuid> where "it's always Some(_) if the SP is a sled and always None if it isn't" feels bad. Especially because there are other contexts where we can have a sled but still have an Option<SledUuid> (for reasons I don't remember; I think this comes up when we're trying to pair SPs to sleds based on inventory, but if we don't have the sled record in inventory we can't figure out its sled ID, maybe?).

Maybe I could pass in zone_safety_checks and replace the SpType argument with the UpdateableBoard type I have here? The latter is still a 3-variant enum (sled / switch / power), but statically contains the sled ID if it's a sled. I'll try that and see how it looks.

Maybe I could pass in zone_safety_checks and replace the SpType argument with the UpdateableBoard type I have here? The latter is still a 3-variant enum (sled / switch / power), but statically contains the sled ID if it's a sled. I'll try that and see how it looks.

I think this worked out pretty well: 73cdd89

davepacheco · 2025-10-13T22:49:06Z

nexus/reconfigurator/planning/src/planner/zone_safety.rs

+        // TODO-john Should this be looking at the `PlanningInput`'s policy
+        // instead of the hard coded constant we use for the default policy?
+        // (Same question for the other *_REDUNDANCY constants in this file.)


Yes, if we have a separate constant I feel like it should be using that.

Ok. I'm worried this will break some tests; even if it doesn't I'll need to thread some extra arguments through, so I'll do this in a separate PR.

Filed #9220 for this and updated the comment to reference it.

davepacheco · 2025-10-13T22:49:38Z

nexus/reconfigurator/planning/src/planner/zone_safety.rs

+            // TODO-john Should we also check the number of cockroach nodes in
+            // `self.blueprint` like we do for boundary NTP and internal DNS?
+            // Inventory could have 5 live nodes, but if we've just expunged
+            // one, we know that's out of date.


I'm not sure what to make of this one. This means we're making a decision based an out-of-date inventory, which we usually don't sweat too much (we know there's always that risk). But if it's a predictable, detectable, and important case, maybe we should?

I think for us to be able to detect it by looking at the blueprint, it really has to be that the zone was expunged, right? This check wouldn't help us identify a case where we just bounced a sled for one of the (only) 5 Cockroach nodes and so it's not healthy but we didn't notice. What prevents that?

If this is really just for the expungement case, I'm not sure it's worth special casing.

Looking at the zones in the blueprint would just be for expungement, yeah. We could look for a PendingMgsUpdate targeting an SP or host and then assume any zones on that sled are going to be down shortly, to try to catch that specific "inventory is out of date but we know we just made things underreplicated" case? (If that sounds right I think I'd also do that as a followup.)

karencfv

Thanks for the refactor. This is looking much nicer!

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

nexus/reconfigurator/planning/src/planner/zone_safety.rs

karencfv · 2025-10-14T00:41:25Z

nexus/reconfigurator/planning/src/planner/zone_safety.rs

+    // Helper method only used by `ZoneSafetyChecksBuilder` during our
+    // construction; ensure we update both our fields together.
+    fn insert(
+        &mut self,
+        sled_id: SledUuid,
+        zone_id: OmicronZoneUuid,
+        reason: ZoneUnsafeToShutdown,
+    ) {
+        self.zone_to_sled.insert(zone_id, sled_id);
+        self.sleds_with_unsafe_zones
+            .entry(sled_id)
+            .or_default()
+            .insert(zone_id, reason);
+    }
+}


It feels a bit brittle to have to update both fields at the same time with this method 🤔 What do you think about having a helper method that uses self.sleds_with_unsafe_zones to construct and return a BTreeMap<OmicronZoneUuid, SledUuid>, and removing the sleds_with_unsafe_zones field entirely?

Hm. This method is private and is only called during construction of this type, so it's the one place where we do any insertion, and it ensures both are updated. I don't entirely follow your suggestion. I assume you meant remove the zone_to_sled field entirely? But that means zone_unsafe_shutdown_reason() would need to loop over all the zone IDs inside all the sleds of sleds_with_unsafe_zones (either to allocate a BTreeMap<OmicronZoneUuid, SledUuid> or more likely just to find the zone, if it exists somewhere inside self.sleds_with_unsafe_zones), which seems worse than having this method handle the bookkeeping of both maps.

Hm, fair. My worry is that somewhere down the line, someone ends up making that field public or some change like that happens and a bug is introduced. Having two sources of truth for what is essentially the same information just organised differently makes me nervous.

How about adding a big warning comment that the fields should ever only be updated via this helper method?

karencfv · 2025-10-14T01:02:16Z

nexus/reconfigurator/planning/src/planner.rs

+        // There shouldn't be any sleds left in `included_sled_baseboards`; if
+        // there are, those are sleds our planning input says are part of the
+        // system but whose SPs aren't currently in inventory. We have to
+        // consider them "not updated" in terms of planning.
+        for (baseboard_id, _sled_id) in included_sled_baseboards {
+            blocked_mgs_updates.push(BlockedMgsUpdate {
+                baseboard_id: Arc::new(baseboard_id.clone()),
+                reason: FailedMgsUpdateReason::Sp(
+                    FailedSpUpdateReason::SpNotInInventory,
+                ),
+            });
+        }
+


I'm not 100% sure about this check. By returning a FailedMgsUpdateReason::Sp(reason) we are saying that we specifically attempted an SP update and failed, but this check isn't really doing that. Would it not make sense to just include all of the boards and let them through? Each individual component is already checking for the existence of it's corresponding SP anyway.

Example:

omicron/nexus/reconfigurator/planning/src/mgs_updates/rot.rs

Lines 112 to 114 in a3ee742

let Some(sp_info) = inventory.sps.get(baseboard_id) else {

return Err(FailedRotUpdateReason::SpNotInInventory);

};

I almost put a comment here that was something like

// We could put a `BlockedMgsUpdate` in for any or all of this sled's components. // It's sufficient to just pick one, so we'll pick the SP.

Would it not make sense to just include all of the boards and let them through?

I thought this wouldn't work, but looking again I'm not sure why I thought that... Lemme try it. I think it means we'd report a blocked update for the first component we try (RotBootloader) instead of the SP, but maybe that's okay?

I added a test to cover this case in 6788c2c, then changed it as you suggested in 59623e6: indeed it works fine, and the only visible difference is that the planner report complains about the bootloader instead of the SP (which is fine - we just need to insert some blocked update for missing sleds).

…fety-checks

jgallagher · 2025-10-14T21:55:21Z

This changed enough I think it warrants a rereview. This now includes a bugfix (the planner shouldn't move on to updating zones if there's a sled missing from inventory) with a test, so probably makes sense to get it in before R17 and the upcoming dogfood hubris update tests?

jgallagher · 2025-10-14T21:56:14Z

This changed enough I think it warrants a rereview. This now includes a bugfix (the planner shouldn't move on to updating zones if there's a sled missing from inventory) with a test, so probably makes sense to get it in before R17 and the upcoming dogfood hubris update tests?

I could also separate out the bugfix and land just that on main, if the general refactor here seems too risky.

karencfv

Looks great thanks! I think we have enough test coverage to land this without being too worried about breaking things before R17. I'll leave it up to you to decide whether to open up a different PR with the bug fix though.

karencfv · 2025-10-14T23:50:09Z

nexus/reconfigurator/planning/src/planner/zone_safety.rs

+    // Helper method only used by `ZoneSafetyChecksBuilder` during our
+    // construction; ensure we update both our fields together.
+    fn insert(
+        &mut self,
+        sled_id: SledUuid,
+        zone_id: OmicronZoneUuid,
+        reason: ZoneUnsafeToShutdown,
+    ) {
+        self.zone_to_sled.insert(zone_id, sled_id);
+        self.sleds_with_unsafe_zones
+            .entry(sled_id)
+            .or_default()
+            .insert(zone_id, reason);
+    }
+}


Hm, fair. My worry is that somewhere down the line, someone ends up making that field public or some change like that happens and a bug is introduced. Having two sources of truth for what is essentially the same information just organised differently makes me nervous.

How about adding a big warning comment that the fields should ever only be updated via this helper method?

…fety-checks

Planner refactor: add ZoneSafetyChecks type

c056c5e

jgallagher requested review from karencfv and smklein October 13, 2025 13:46

jgallagher commented Oct 13, 2025

View reviewed changes

add blocked update reason for sled SPs not in inventory

80d8f82

jgallagher requested a review from davepacheco October 13, 2025 14:27

davepacheco reviewed Oct 13, 2025

View reviewed changes

davepacheco approved these changes Oct 13, 2025

View reviewed changes

karencfv reviewed Oct 14, 2025

View reviewed changes

jgallagher added 5 commits October 14, 2025 15:54

pass ZoneSafetyChecks down into SP/host try_make_update()s

73cdd89

style

b2afd85

add reconfigurator-cli test for blocked updates due to missing sleds

6788c2c

pass missing sleds down to MGS planner and let errors bubble out

59623e6

Merge remote-tracking branch 'origin/main' into john/refactor-zone-sa…

0422db0

…fety-checks

jgallagher requested review from davepacheco and karencfv October 14, 2025 21:53

karencfv approved these changes Oct 15, 2025

View reviewed changes

jgallagher added 2 commits October 15, 2025 10:36

comment cleanup

0855c0b

Merge remote-tracking branch 'origin/main' into john/refactor-zone-sa…

de99a37

…fety-checks

jgallagher merged commit ef318b3 into main Oct 16, 2025
16 checks passed

jgallagher deleted the john/refactor-zone-safety-checks branch October 16, 2025 15:25

	let Some(sp_info) = inventory.sps.get(baseboard_id) else {
	return Err(FailedRotUpdateReason::SpNotInInventory);
	};

Planner refactor: add ZoneSafetyChecks type #9202

Planner refactor: add ZoneSafetyChecks type #9202

Uh oh!

Conversation

jgallagher commented Oct 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgallagher commented Oct 14, 2025

Uh oh!

jgallagher commented Oct 14, 2025

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Planner refactor: add `ZoneSafetyChecks` type #9202

Planner refactor: add `ZoneSafetyChecks` type #9202

jgallagher Oct 14, 2025 •

edited

Loading

jgallagher Oct 14, 2025 •

edited

Loading