-
Notifications
You must be signed in to change notification settings - Fork 62
add module for planning MGS-based updates #8261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
karencfv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Mostly I just have a bunch of questions
| NotDone, | ||
| /// the requested update has not completed and the preconditions are not | ||
| /// currently met | ||
| Impossible, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance it took me a second to understand what was happening here. How about Skipped? Feels a bit closer to what the description is saying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep wondering about this one. Is there a chance some preconditions are not met, but the update is still possible? I'm thinking about this in the scope of the RoT that has so many preconditions like pending_persistent_boot_preference or transient_boot_preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sorry this might cover a lot of ground you know but I wasn't sure if the confusion was about the name or the behavior)
Hmm. "Skipped" isn't right. For one, we don't know that it was ever tried. It's really important about this variant that the planner must fix the preconditions of the update because otherwise the system may get stuck with this update pending (and impossible) forever.
Recall that we added preconditions to deal with a sequence like this:
- initial state: active slot v2, inactive slot v1
- blueprint B1 created to update to v3
- nexus 1 executes B1. new state: active = v3, inactive = v2
- blueprint B2 created to update to v4
- nexus 1 executes B2. new state: active = v4, inactive = v3
- nexus 2 executes old blueprint B1. (This is always a thing that can happen. Imagine Nexus 2 loaded the target blueprint, then was busy for a while doing other things and only just got around to executing it.)
Now Nexus 2 downgrades the SP to active = v3, inactive = v4. The idea with preconditions is that Nexus 2 checks the slots and sees that the preconditions don't hold and skips execution. On the next lap through the executor, it gets an updated blueprint and realizes there's nothing to do.
But the risk introduced by preconditions is: what if they're wrong for whatever reason? Execution would skip this update, presumably forever. The planner has to notice that case and update the preconditions to match reality. This could happen if:
- initial state: active slot v2, inactive slot v1
- blueprint B1 created to update to v3 (precondition: active = v2, inactive = v1)
- blueprint B2 created to update to v4 (precondition: active = v2, inactive = v1)
- blueprint B3 created to update to v5 (precondition: active = v2, inactive = v1)
- nexus 1 executes B2. new state: active = v4, inactive = v2.
- at this point, the latest blueprint (B3) has a pending update whose preconditions (active = v2, inactive = v1) can never be satisfied. But presumably we still do want to update to v5. What we want is to create a blueprint B4 with precondition: active = v4, inactive = v2.
Another name for this variant would be NeedsUpdate or NeedsPreconditionsUpdated, as in "this PendingMgsUpdate needs its preconditions updated before anybody can execute it". The code is structured instead as: "this PendingMgsUpdate is impossible to execute [because its preconditions aren't true], so just remove it [and the usual process will add a new one with the right preconditions, if it can]".
Is there a chance some preconditions are not met, but the update is still possible? I'm thinking about this in the scope of the RoT that has so many preconditions like pending_persistent_boot_preference or transient_boot_preference.
It's conceivable that the preconditions are not met, but there's an update in progress already that will cause them to be met. This might be possible if the inventory was collected in the middle of an update. But there's no way for the planner to know whether that's true. If the planner takes no action, and that's not what's going on, then the system will be stuck forever. On the other hand, if the planner assumes it needs to update the preconditions, and there was an update going on, that will eventually converge. (On the next planning lap, we'll see that the inventory has changed again, and either the update we're trying to do is done (in which case we just remove it altogether) or its preconditions need another update.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
| expected_inactive_version, | ||
| } => { | ||
| let Some(active_caboose) = | ||
| inventory.caboose_for(CabooseWhich::SpSlot0, baseboard_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a collection hasn't occurred since the time an update happened and this code is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inventory will reflect the pre-update state and consider the update NotDone. That's fine. It'll stay in the blueprint until an inventory is collected that does reflect the change.
| // TODO When we add support for planning RoT, RoT bootloader, and host OS | ||
| // updates, we'll try these in a hardcoded priority order until any of them | ||
| // returns `Some`. For now, we only plan SP updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the order be?
- SP
- RoT
- RoT bootloader
- Host OS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's: RoT bootloader, RoT, SP, host OS. This is described in https://rfd.shared.oxide.computer/rfd/565#_update_sequence, where @labbott mentioned that we generally assume it goes bootloader, then RoT, then SP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment to point to the RFD.
| } | ||
|
|
||
| if matching_artifacts.len() > 1 { | ||
| // This is noteworthy, but not a problem. We'll just pick one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, what I meant by the comment was: this doesn't prevent us from picking one and proceeding (as opposed to the previous case). I can't think of a reason why this would happen under normal conditions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment.
I could see an argument that we shouldn't do anything in this case but I don't feel strongly either way.
plotnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this, seems like it's mostly hairy edge cases. Getting pretty close to full update capability!
| /// (it is possible to have baseboards in inventory that would never be | ||
| /// updated because they're not considered part of the current system) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this referring to the eventual full-TQ case, when we could have boards that are physically present but not trusted? Are there any other circumstances in which this would differ from what's in inventory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main case I was thinking of is that in dogfood we have historically had sleds physically in the rack but not part of the control plane for various reasons. From Omicron's perspective, they're not at all part of the system, but they do get seen by inventory (which, by design, discovers everything that's physically connected.)
There's also the case that during the "add sled" process, sleds are present and seen in inventory before they're added to the cluster. In this case, we might actually want to update their software, but I think we want to do that via a separate "Nexus-driven sled recovery" process (similar to MUPdate), not ordinary Reconfigurator actions.
|
Thanks for the reviews! I think I've addressed the open feedback so I'm going to enable auto-merge. This is zero risk since it's not used by anything and we can always keep iterating on it. |
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! (Extra comments are better late than never, I hope?)
| // alone, great. Return that. | ||
| if matches!( | ||
| caboose_status, | ||
| Err(_) | Ok(MgsUpdateStatus::Done) | Ok(MgsUpdateStatus::Impossible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can caboose_status ever match Err(_)? It looks like errors are immediately returned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're correct, the way this is structured right now. I guess the cleanup we could do here is have caboose_status not be a Result at all, since we're returning the Err case today. I imagine this will get reworked when we add the RoT/bootloader/host OS cases shortly.
| current_artifacts: &TufRepoDescription, | ||
| ) -> Option<PendingMgsUpdate> { | ||
| let Some(sp_info) = inventory.sps.get(baseboard_id) else { | ||
| warn!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud about #8265 - do we need a way to elevate these warnings out to something that can be reported to omdb / an operator? (Not urgent, necessarily, but might be something we want "pretty soon" after things are functioning?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've been thinking about this a bunch. I am imagining that the we'll want to have an accumulator for "things that we're waiting for right now". Any time we bail out of a planning path (like we do here) because we're waiting on some condition, we'll want to append to that. Then I think we want to be able to report that with the upgrade status. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it; I've wanted a similar accumulated log to attach context to actions taken, although that's been a lot less urgent because the diff at least shows the action itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #8284.
| if a.id.name != *board { | ||
| return false; | ||
| } | ||
|
|
||
| match a.id.kind.to_known() { | ||
| None => false, | ||
| Some( | ||
| KnownArtifactKind::GimletSp | ||
| | KnownArtifactKind::PscSp | ||
| | KnownArtifactKind::SwitchSp, | ||
| ) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really illuminates the comment that @rmustacc made a while back about the multiple KnownArtifactKind::*Sp variants not really being useful, since the thing we really need to key on is the board. Another "not urgent" thing, but it might be nice to squash these down to just KnownArtifactKind::Sp?
| // This should be impossible unless we shipped a TUF repo with multiple | ||
| // artifacts for the same board. But it doesn't prevent us from picking | ||
| // one and proceeding. Make a note and proceed. | ||
| warn!(log, "found more than one matching artifact for SP update"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we hit this, are there any extra checks we should do to separate "it's fine to just take the first one" from "I don't know how to pick which one I'm supposed to use"? (E.g., should they all be the same version?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure. It's hard to reason about what to do here because I don't think we'd expect to see it, and if we did, it's not clear what it would mean. (What did we intend if we shipped a TUF repo with multiple artifacts for the same board?) Again, I'm open to making this a harder error (don't plan anything for this case) but it didn't quite seem worth it when I was looking at it.
This PR adds a new module within
nexus_reconfigurator_planningfor planning MGS-managed updates. Some notes:plan_mgs_updates(). I modeled its arguments at what I expect the planner to have available to it, based on Plan zone updates for target release #8024.