-
Notifications
You must be signed in to change notification settings - Fork 62
Planner refactor: add ZoneSafetyChecks type
#9202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // In the (very unlikely!) event that `boards_preferred` references | ||
| // boards that are no longer present in `current_boards`, we'll skip | ||
| // them: presumably they are not currently present in inventory and we | ||
| // wouldn't be able to configure updates for them anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment describes the behavioral change made by this PR; before, if we had something in boards_preferred that wasn't in current_boards, we'd still attempt to configure an update for it.
That said, I think we have a slightly different bug here around sleds that aren't in inventory more generally - will discuss that in chat and then file an issue if I'm understanding it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brought this up in chat and I think the fix is straightforward, so I pushed it onto this PR as 80d8f82
| zone_safety_checks.is_zone_unsafe_to_shut_down(&zone.id) | ||
| { | ||
| report.unsafe_zones.insert(zone.id, zone_report); | ||
| report.unsafe_zones.insert(zone.id, reason.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need to clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah: report.unsafe_zones wants to take ownership of the reason, and we have a reference to a reason here because zone_safety_checks is still holding ownership internally. (For good reason: we'll continue to reference it in later steps of planning.)
| BTreeMap<TypedUuid<OmicronZoneKind>, ZoneUnsafeToShutdown>, | ||
| >, | ||
| ) -> Result<MgsUpdateOutcome, FailedHostOsUpdateReason> { | ||
| unsafe_zone_reason: F, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion, take it or leave it since I think it'd be non-trivial and wouldn't really change anything but (arguably) reduce complexity: I would either accept a &dyn FnOnce() -> Option<String> to avoid the generic here or (better) define a struct for this in this crate. e.g., UnsafeZoneChecker (probably a better name) with a constructor that could be used by the planner with whatever information it has and a method to be called here.
| // Helper closure to determine whether this board is safe to restart. We'll | ||
| // use this when performing an update that will result in a sled reboot (SP | ||
| // or Host OS). | ||
| let unsafe_zone_reason = || { | ||
| // If this board isn't a sled, it's not hosting zones. | ||
| let sled_id = board.sled_id()?; | ||
|
|
||
| // If the board is a sled, check `zone_safety_checks`. | ||
| zone_safety_checks.is_sled_unsafe_to_shut_down(&sled_id) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on my other comment, why not pass zone_safety_checks down to the component-specific try_make_update and have it invoke a method like can_shutdown_sled(sled_id)?
As I said, this isn't really important. I just find it a little confusing in reading the helpers that they take this closure that takes no arguments and somehow knows how to make this decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on this a bunch of times; basically all the difficulty I was having is because we don't always have a sled_id, and passing in an Option<SledUuid> where "it's always Some(_) if the SP is a sled and always None if it isn't" feels bad. Especially because there are other contexts where we can have a sled but still have an Option<SledUuid> (for reasons I don't remember; I think this comes up when we're trying to pair SPs to sleds based on inventory, but if we don't have the sled record in inventory we can't figure out its sled ID, maybe?).
Maybe I could pass in zone_safety_checks and replace the SpType argument with the UpdateableBoard type I have here? The latter is still a 3-variant enum (sled / switch / power), but statically contains the sled ID if it's a sled. I'll try that and see how it looks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I could pass in
zone_safety_checksand replace theSpTypeargument with theUpdateableBoardtype I have here? The latter is still a 3-variant enum (sled / switch / power), but statically contains the sled ID if it's a sled. I'll try that and see how it looks.
I think this worked out pretty well: 73cdd89
| // TODO-john Should this be looking at the `PlanningInput`'s policy | ||
| // instead of the hard coded constant we use for the default policy? | ||
| // (Same question for the other *_REDUNDANCY constants in this file.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we have a separate constant I feel like it should be using that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm worried this will break some tests; even if it doesn't I'll need to thread some extra arguments through, so I'll do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9220 for this and updated the comment to reference it.
| // TODO-john Should we also check the number of cockroach nodes in | ||
| // `self.blueprint` like we do for boundary NTP and internal DNS? | ||
| // Inventory could have 5 live nodes, but if we've just expunged | ||
| // one, we know that's out of date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to make of this one. This means we're making a decision based an out-of-date inventory, which we usually don't sweat too much (we know there's always that risk). But if it's a predictable, detectable, and important case, maybe we should?
I think for us to be able to detect it by looking at the blueprint, it really has to be that the zone was expunged, right? This check wouldn't help us identify a case where we just bounced a sled for one of the (only) 5 Cockroach nodes and so it's not healthy but we didn't notice. What prevents that?
If this is really just for the expungement case, I'm not sure it's worth special casing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the zones in the blueprint would just be for expungement, yeah. We could look for a PendingMgsUpdate targeting an SP or host and then assume any zones on that sled are going to be down shortly, to try to catch that specific "inventory is out of date but we know we just made things underreplicated" case? (If that sounds right I think I'd also do that as a followup.)
karencfv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the refactor. This is looking much nicer!
| // Helper method only used by `ZoneSafetyChecksBuilder` during our | ||
| // construction; ensure we update both our fields together. | ||
| fn insert( | ||
| &mut self, | ||
| sled_id: SledUuid, | ||
| zone_id: OmicronZoneUuid, | ||
| reason: ZoneUnsafeToShutdown, | ||
| ) { | ||
| self.zone_to_sled.insert(zone_id, sled_id); | ||
| self.sleds_with_unsafe_zones | ||
| .entry(sled_id) | ||
| .or_default() | ||
| .insert(zone_id, reason); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit brittle to have to update both fields at the same time with this method 🤔 What do you think about having a helper method that uses self.sleds_with_unsafe_zones to construct and return a BTreeMap<OmicronZoneUuid, SledUuid>, and removing the sleds_with_unsafe_zones field entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. This method is private and is only called during construction of this type, so it's the one place where we do any insertion, and it ensures both are updated. I don't entirely follow your suggestion. I assume you meant remove the zone_to_sled field entirely? But that means zone_unsafe_shutdown_reason() would need to loop over all the zone IDs inside all the sleds of sleds_with_unsafe_zones (either to allocate a BTreeMap<OmicronZoneUuid, SledUuid> or more likely just to find the zone, if it exists somewhere inside self.sleds_with_unsafe_zones), which seems worse than having this method handle the bookkeeping of both maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, fair. My worry is that somewhere down the line, someone ends up making that field public or some change like that happens and a bug is introduced. Having two sources of truth for what is essentially the same information just organised differently makes me nervous.
How about adding a big warning comment that the fields should ever only be updated via this helper method?
| // There shouldn't be any sleds left in `included_sled_baseboards`; if | ||
| // there are, those are sleds our planning input says are part of the | ||
| // system but whose SPs aren't currently in inventory. We have to | ||
| // consider them "not updated" in terms of planning. | ||
| for (baseboard_id, _sled_id) in included_sled_baseboards { | ||
| blocked_mgs_updates.push(BlockedMgsUpdate { | ||
| baseboard_id: Arc::new(baseboard_id.clone()), | ||
| reason: FailedMgsUpdateReason::Sp( | ||
| FailedSpUpdateReason::SpNotInInventory, | ||
| ), | ||
| }); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure about this check. By returning a FailedMgsUpdateReason::Sp(reason) we are saying that we specifically attempted an SP update and failed, but this check isn't really doing that. Would it not make sense to just include all of the boards and let them through? Each individual component is already checking for the existence of it's corresponding SP anyway.
Example:
omicron/nexus/reconfigurator/planning/src/mgs_updates/rot.rs
Lines 112 to 114 in a3ee742
| let Some(sp_info) = inventory.sps.get(baseboard_id) else { | |
| return Err(FailedRotUpdateReason::SpNotInInventory); | |
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost put a comment here that was something like
// We could put a `BlockedMgsUpdate` in for any or all of this sled's components.
// It's sufficient to just pick one, so we'll pick the SP.
Would it not make sense to just include all of the boards and let them through?
I thought this wouldn't work, but looking again I'm not sure why I thought that... Lemme try it. I think it means we'd report a blocked update for the first component we try (RotBootloader) instead of the SP, but maybe that's okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to cover this case in 6788c2c, then changed it as you suggested in 59623e6: indeed it works fine, and the only visible difference is that the planner report complains about the bootloader instead of the SP (which is fine - we just need to insert some blocked update for missing sleds).
|
This changed enough I think it warrants a rereview. This now includes a bugfix (the planner shouldn't move on to updating zones if there's a sled missing from inventory) with a test, so probably makes sense to get it in before R17 and the upcoming dogfood hubris update tests? |
I could also separate out the bugfix and land just that on |
karencfv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great thanks! I think we have enough test coverage to land this without being too worried about breaking things before R17. I'll leave it up to you to decide whether to open up a different PR with the bug fix though.
| // Helper method only used by `ZoneSafetyChecksBuilder` during our | ||
| // construction; ensure we update both our fields together. | ||
| fn insert( | ||
| &mut self, | ||
| sled_id: SledUuid, | ||
| zone_id: OmicronZoneUuid, | ||
| reason: ZoneUnsafeToShutdown, | ||
| ) { | ||
| self.zone_to_sled.insert(zone_id, sled_id); | ||
| self.sleds_with_unsafe_zones | ||
| .entry(sled_id) | ||
| .or_default() | ||
| .insert(zone_id, reason); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, fair. My worry is that somewhere down the line, someone ends up making that field public or some change like that happens and a bug is introduced. Having two sources of truth for what is essentially the same information just organised differently makes me nervous.
How about adding a big warning comment that the fields should ever only be updated via this helper method?
This should be (almost) entirely moving things around, based on discussions @karencfv and I had on #9044. There's one very small behavioral change that didn't affect any tests but I think is more correct; I'll point it out in a comment below.
As I was doing this I found a couple spots where I think we should change the behavior slightly, but I didn't want to roll that into this same PR. I left
// TODO-johncomments on those, and I'd be happy to stage those in a followup PR if others agree they should change.