Skip to content

Conversation

@jgallagher
Copy link
Contributor

This should be (almost) entirely moving things around, based on discussions @karencfv and I had on #9044. There's one very small behavioral change that didn't affect any tests but I think is more correct; I'll point it out in a comment below.

As I was doing this I found a couple spots where I think we should change the behavior slightly, but I didn't want to roll that into this same PR. I left // TODO-john comments on those, and I'd be happy to stage those in a followup PR if others agree they should change.

Comment on lines +304 to +307
// In the (very unlikely!) event that `boards_preferred` references
// boards that are no longer present in `current_boards`, we'll skip
// them: presumably they are not currently present in inventory and we
// wouldn't be able to configure updates for them anyway.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes the behavioral change made by this PR; before, if we had something in boards_preferred that wasn't in current_boards, we'd still attempt to configure an update for it.

That said, I think we have a slightly different bug here around sleds that aren't in inventory more generally - will discuss that in chat and then file an issue if I'm understanding it correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brought this up in chat and I think the fix is straightforward, so I pushed it onto this PR as 80d8f82

zone_safety_checks.is_zone_unsafe_to_shut_down(&zone.id)
{
report.unsafe_zones.insert(zone.id, zone_report);
report.unsafe_zones.insert(zone.id, reason.clone());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need to clone here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah: report.unsafe_zones wants to take ownership of the reason, and we have a reference to a reason here because zone_safety_checks is still holding ownership internally. (For good reason: we'll continue to reference it in later steps of planning.)

BTreeMap<TypedUuid<OmicronZoneKind>, ZoneUnsafeToShutdown>,
>,
) -> Result<MgsUpdateOutcome, FailedHostOsUpdateReason> {
unsafe_zone_reason: F,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion, take it or leave it since I think it'd be non-trivial and wouldn't really change anything but (arguably) reduce complexity: I would either accept a &dyn FnOnce() -> Option<String> to avoid the generic here or (better) define a struct for this in this crate. e.g., UnsafeZoneChecker (probably a better name) with a constructor that could be used by the planner with whatever information it has and a method to be called here.

Comment on lines 682 to 691
// Helper closure to determine whether this board is safe to restart. We'll
// use this when performing an update that will result in a sled reboot (SP
// or Host OS).
let unsafe_zone_reason = || {
// If this board isn't a sled, it's not hosting zones.
let sled_id = board.sled_id()?;

// If the board is a sled, check `zone_safety_checks`.
zone_safety_checks.is_sled_unsafe_to_shut_down(&sled_id)
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on my other comment, why not pass zone_safety_checks down to the component-specific try_make_update and have it invoke a method like can_shutdown_sled(sled_id)?

As I said, this isn't really important. I just find it a little confusing in reading the helpers that they take this closure that takes no arguments and somehow knows how to make this decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this a bunch of times; basically all the difficulty I was having is because we don't always have a sled_id, and passing in an Option<SledUuid> where "it's always Some(_) if the SP is a sled and always None if it isn't" feels bad. Especially because there are other contexts where we can have a sled but still have an Option<SledUuid> (for reasons I don't remember; I think this comes up when we're trying to pair SPs to sleds based on inventory, but if we don't have the sled record in inventory we can't figure out its sled ID, maybe?).

Maybe I could pass in zone_safety_checks and replace the SpType argument with the UpdateableBoard type I have here? The latter is still a 3-variant enum (sled / switch / power), but statically contains the sled ID if it's a sled. I'll try that and see how it looks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could pass in zone_safety_checks and replace the SpType argument with the UpdateableBoard type I have here? The latter is still a 3-variant enum (sled / switch / power), but statically contains the sled ID if it's a sled. I'll try that and see how it looks.

I think this worked out pretty well: 73cdd89

Comment on lines 217 to 219
// TODO-john Should this be looking at the `PlanningInput`'s policy
// instead of the hard coded constant we use for the default policy?
// (Same question for the other *_REDUNDANCY constants in this file.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we have a separate constant I feel like it should be using that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm worried this will break some tests; even if it doesn't I'll need to thread some extra arguments through, so I'll do this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9220 for this and updated the comment to reference it.

Comment on lines 266 to 269
// TODO-john Should we also check the number of cockroach nodes in
// `self.blueprint` like we do for boundary NTP and internal DNS?
// Inventory could have 5 live nodes, but if we've just expunged
// one, we know that's out of date.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to make of this one. This means we're making a decision based an out-of-date inventory, which we usually don't sweat too much (we know there's always that risk). But if it's a predictable, detectable, and important case, maybe we should?

I think for us to be able to detect it by looking at the blueprint, it really has to be that the zone was expunged, right? This check wouldn't help us identify a case where we just bounced a sled for one of the (only) 5 Cockroach nodes and so it's not healthy but we didn't notice. What prevents that?

If this is really just for the expungement case, I'm not sure it's worth special casing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the zones in the blueprint would just be for expungement, yeah. We could look for a PendingMgsUpdate targeting an SP or host and then assume any zones on that sled are going to be down shortly, to try to catch that specific "inventory is out of date but we know we just made things underreplicated" case? (If that sounds right I think I'd also do that as a followup.)

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor. This is looking much nicer!

Comment on lines +99 to +113
// Helper method only used by `ZoneSafetyChecksBuilder` during our
// construction; ensure we update both our fields together.
fn insert(
&mut self,
sled_id: SledUuid,
zone_id: OmicronZoneUuid,
reason: ZoneUnsafeToShutdown,
) {
self.zone_to_sled.insert(zone_id, sled_id);
self.sleds_with_unsafe_zones
.entry(sled_id)
.or_default()
.insert(zone_id, reason);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit brittle to have to update both fields at the same time with this method 🤔 What do you think about having a helper method that uses self.sleds_with_unsafe_zones to construct and return a BTreeMap<OmicronZoneUuid, SledUuid>, and removing the sleds_with_unsafe_zones field entirely?

Copy link
Contributor Author

@jgallagher jgallagher Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This method is private and is only called during construction of this type, so it's the one place where we do any insertion, and it ensures both are updated. I don't entirely follow your suggestion. I assume you meant remove the zone_to_sled field entirely? But that means zone_unsafe_shutdown_reason() would need to loop over all the zone IDs inside all the sleds of sleds_with_unsafe_zones (either to allocate a BTreeMap<OmicronZoneUuid, SledUuid> or more likely just to find the zone, if it exists somewhere inside self.sleds_with_unsafe_zones), which seems worse than having this method handle the bookkeeping of both maps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, fair. My worry is that somewhere down the line, someone ends up making that field public or some change like that happens and a bug is introduced. Having two sources of truth for what is essentially the same information just organised differently makes me nervous.

How about adding a big warning comment that the fields should ever only be updated via this helper method?

Comment on lines 1477 to 1489
// There shouldn't be any sleds left in `included_sled_baseboards`; if
// there are, those are sleds our planning input says are part of the
// system but whose SPs aren't currently in inventory. We have to
// consider them "not updated" in terms of planning.
for (baseboard_id, _sled_id) in included_sled_baseboards {
blocked_mgs_updates.push(BlockedMgsUpdate {
baseboard_id: Arc::new(baseboard_id.clone()),
reason: FailedMgsUpdateReason::Sp(
FailedSpUpdateReason::SpNotInInventory,
),
});
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about this check. By returning a FailedMgsUpdateReason::Sp(reason) we are saying that we specifically attempted an SP update and failed, but this check isn't really doing that. Would it not make sense to just include all of the boards and let them through? Each individual component is already checking for the existence of it's corresponding SP anyway.

Example:

let Some(sp_info) = inventory.sps.get(baseboard_id) else {
return Err(FailedRotUpdateReason::SpNotInInventory);
};

Copy link
Contributor Author

@jgallagher jgallagher Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost put a comment here that was something like

// We could put a `BlockedMgsUpdate` in for any or all of this sled's components.
// It's sufficient to just pick one, so we'll pick the SP.

Would it not make sense to just include all of the boards and let them through?

I thought this wouldn't work, but looking again I'm not sure why I thought that... Lemme try it. I think it means we'd report a blocked update for the first component we try (RotBootloader) instead of the SP, but maybe that's okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test to cover this case in 6788c2c, then changed it as you suggested in 59623e6: indeed it works fine, and the only visible difference is that the planner report complains about the bootloader instead of the SP (which is fine - we just need to insert some blocked update for missing sleds).

@jgallagher
Copy link
Contributor Author

This changed enough I think it warrants a rereview. This now includes a bugfix (the planner shouldn't move on to updating zones if there's a sled missing from inventory) with a test, so probably makes sense to get it in before R17 and the upcoming dogfood hubris update tests?

@jgallagher
Copy link
Contributor Author

This changed enough I think it warrants a rereview. This now includes a bugfix (the planner shouldn't move on to updating zones if there's a sled missing from inventory) with a test, so probably makes sense to get it in before R17 and the upcoming dogfood hubris update tests?

I could also separate out the bugfix and land just that on main, if the general refactor here seems too risky.

Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great thanks! I think we have enough test coverage to land this without being too worried about breaking things before R17. I'll leave it up to you to decide whether to open up a different PR with the bug fix though.

Comment on lines +99 to +113
// Helper method only used by `ZoneSafetyChecksBuilder` during our
// construction; ensure we update both our fields together.
fn insert(
&mut self,
sled_id: SledUuid,
zone_id: OmicronZoneUuid,
reason: ZoneUnsafeToShutdown,
) {
self.zone_to_sled.insert(zone_id, sled_id);
self.sleds_with_unsafe_zones
.entry(sled_id)
.or_default()
.insert(zone_id, reason);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, fair. My worry is that somewhere down the line, someone ends up making that field public or some change like that happens and a bug is introduced. Having two sources of truth for what is essentially the same information just organised differently makes me nervous.

How about adding a big warning comment that the fields should ever only be updated via this helper method?

@jgallagher jgallagher merged commit ef318b3 into main Oct 16, 2025
16 checks passed
@jgallagher jgallagher deleted the john/refactor-zone-safety-checks branch October 16, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants