add module for planning MGS-based updates #8261

davepacheco · 2025-06-03T20:50:47Z

This PR adds a new module within nexus_reconfigurator_planning for planning MGS-managed updates. Some notes:

This is not currently used anywhere.
The only exposed item here is plan_mgs_updates(). I modeled its arguments at what I expect the planner to have available to it, based on Plan zone updates for target release #8024.
This currently only plans SP updates, not RoT updates or anything else.
There's more test code here than implementation code.

karencfv

Looks great! Mostly I just have a bunch of questions

karencfv · 2025-06-03T22:26:46Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+    NotDone,
+    /// the requested update has not completed and the preconditions are not
+    /// currently met
+    Impossible,


At first glance it took me a second to understand what was happening here. How about Skipped? Feels a bit closer to what the description is saying?

I keep wondering about this one. Is there a chance some preconditions are not met, but the update is still possible? I'm thinking about this in the scope of the RoT that has so many preconditions like pending_persistent_boot_preference or transient_boot_preference.

(sorry this might cover a lot of ground you know but I wasn't sure if the confusion was about the name or the behavior)

Hmm. "Skipped" isn't right. For one, we don't know that it was ever tried. It's really important about this variant that the planner must fix the preconditions of the update because otherwise the system may get stuck with this update pending (and impossible) forever.

Recall that we added preconditions to deal with a sequence like this:

initial state: active slot v2, inactive slot v1

blueprint B1 created to update to v3

nexus 1 executes B1. new state: active = v3, inactive = v2

blueprint B2 created to update to v4

nexus 1 executes B2. new state: active = v4, inactive = v3

nexus 2 executes old blueprint B1. (This is always a thing that can happen. Imagine Nexus 2 loaded the target blueprint, then was busy for a while doing other things and only just got around to executing it.)

Now Nexus 2 downgrades the SP to active = v3, inactive = v4. The idea with preconditions is that Nexus 2 checks the slots and sees that the preconditions don't hold and skips execution. On the next lap through the executor, it gets an updated blueprint and realizes there's nothing to do.

But the risk introduced by preconditions is: what if they're wrong for whatever reason? Execution would skip this update, presumably forever. The planner has to notice that case and update the preconditions to match reality. This could happen if:

initial state: active slot v2, inactive slot v1

blueprint B1 created to update to v3 (precondition: active = v2, inactive = v1)

blueprint B2 created to update to v4 (precondition: active = v2, inactive = v1)

blueprint B3 created to update to v5 (precondition: active = v2, inactive = v1)

nexus 1 executes B2. new state: active = v4, inactive = v2.

at this point, the latest blueprint (B3) has a pending update whose preconditions (active = v2, inactive = v1) can never be satisfied. But presumably we still do want to update to v5. What we want is to create a blueprint B4 with precondition: active = v4, inactive = v2.

Another name for this variant would be NeedsUpdate or NeedsPreconditionsUpdated, as in "this PendingMgsUpdate needs its preconditions updated before anybody can execute it". The code is structured instead as: "this PendingMgsUpdate is impossible to execute [because its preconditions aren't true], so just remove it [and the usual process will add a new one with the right preconditions, if it can]".

Is there a chance some preconditions are not met, but the update is still possible? I'm thinking about this in the scope of the RoT that has so many preconditions like pending_persistent_boot_preference or transient_boot_preference.

It's conceivable that the preconditions are not met, but there's an update in progress already that will cause them to be met. This might be possible if the inventory was collected in the middle of an update. But there's no way for the planner to know whether that's true. If the planner takes no action, and that's not what's going on, then the system will be stuck forever. On the other hand, if the planner assumes it needs to update the preconditions, and there was an update going on, that will eventually converge. (On the next planning lap, we'll see that the inventory has changed again, and either the update we're trying to do is done (in which case we just remove it altogether) or its preconditions need another update.)

Thanks for the explanation!

karencfv · 2025-06-04T00:48:08Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+            expected_inactive_version,
+        } => {
+            let Some(active_caboose) =
+                inventory.caboose_for(CabooseWhich::SpSlot0, baseboard_id)


What happens if a collection hasn't occurred since the time an update happened and this code is called?

The inventory will reflect the pre-update state and consider the update NotDone. That's fine. It'll stay in the blueprint until an inventory is collected that does reflect the change.

karencfv · 2025-06-04T00:51:12Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+    // TODO When we add support for planning RoT, RoT bootloader, and host OS
+    // updates, we'll try these in a hardcoded priority order until any of them
+    // returns `Some`.  For now, we only plan SP updates.


What would the order be?

SP

RoT

RoT bootloader

Host OS?

I believe it's: RoT bootloader, RoT, SP, host OS. This is described in https://rfd.shared.oxide.computer/rfd/565#_update_sequence, where @labbott mentioned that we generally assume it goes bootloader, then RoT, then SP.

Updated the comment to point to the RFD.

karencfv · 2025-06-04T00:55:42Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+    }
+
+    if matching_artifacts.len() > 1 {
+        // This is noteworthy, but not a problem.  We'll just pick one.


Why is it not a problem?

Well, what I meant by the comment was: this doesn't prevent us from picking one and proceeding (as opposed to the previous case). I can't think of a reason why this would happen under normal conditions though.

Updated the comment.

I could see an argument that we shouldn't do anything in this case but I don't feel strongly either way.

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

plotnick

Thanks for tackling this, seems like it's mostly hairy edge cases. Getting pretty close to full update capability!

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

plotnick · 2025-06-04T15:49:49Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+///   (it is possible to have baseboards in inventory that would never be
+///   updated because they're not considered part of the current system)


Is this referring to the eventual full-TQ case, when we could have boards that are physically present but not trusted? Are there any other circumstances in which this would differ from what's in inventory?

The main case I was thinking of is that in dogfood we have historically had sleds physically in the rack but not part of the control plane for various reasons. From Omicron's perspective, they're not at all part of the system, but they do get seen by inventory (which, by design, discovers everything that's physically connected.)

There's also the case that during the "add sled" process, sleds are present and seen in inventory before they're added to the cluster. In this case, we might actually want to update their software, but I think we want to do that via a separate "Nexus-driven sled recovery" process (similar to MUPdate), not ordinary Reconfigurator actions.

davepacheco · 2025-06-04T17:04:17Z

Thanks for the reviews! I think I've addressed the open feedback so I'm going to enable auto-merge. This is zero risk since it's not used by anything and we can always keep iterating on it.

jgallagher

Looks great! (Extra comments are better late than never, I hope?)

jgallagher · 2025-06-04T20:13:21Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+    // alone, great.  Return that.
+    if matches!(
+        caboose_status,
+        Err(_) | Ok(MgsUpdateStatus::Done) | Ok(MgsUpdateStatus::Impossible)


Can caboose_status ever match Err(_)? It looks like errors are immediately returned above.

I think you're correct, the way this is structured right now. I guess the cleanup we could do here is have caboose_status not be a Result at all, since we're returning the Err case today. I imagine this will get reworked when we add the RoT/bootloader/host OS cases shortly.

jgallagher · 2025-06-04T20:16:46Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+    current_artifacts: &TufRepoDescription,
+) -> Option<PendingMgsUpdate> {
+    let Some(sp_info) = inventory.sps.get(baseboard_id) else {
+        warn!(


Thinking out loud about #8265 - do we need a way to elevate these warnings out to something that can be reported to omdb / an operator? (Not urgent, necessarily, but might be something we want "pretty soon" after things are functioning?)

Yes, I've been thinking about this a bunch. I am imagining that the we'll want to have an accumulator for "things that we're waiting for right now". Any time we bail out of a planning path (like we do here) because we're waiting on some condition, we'll want to append to that. Then I think we want to be able to report that with the upgrade status. What do you think?

I like it; I've wanted a similar accumulated log to attach context to actions taken, although that's been a lot less urgent because the diff at least shows the action itself.

Filed #8284.

jgallagher · 2025-06-04T20:18:22Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+            if a.id.name != *board {
+                return false;
+            }
+
+            match a.id.kind.to_known() {
+                None => false,
+                Some(
+                    KnownArtifactKind::GimletSp
+                    | KnownArtifactKind::PscSp
+                    | KnownArtifactKind::SwitchSp,
+                ) => true,


This really illuminates the comment that @rmustacc made a while back about the multiple KnownArtifactKind::*Sp variants not really being useful, since the thing we really need to key on is the board. Another "not urgent" thing, but it might be nice to squash these down to just KnownArtifactKind::Sp?

jgallagher · 2025-06-04T20:20:07Z

nexus/reconfigurator/planning/src/mgs_updates/mod.rs

+        // This should be impossible unless we shipped a TUF repo with multiple
+        // artifacts for the same board.  But it doesn't prevent us from picking
+        // one and proceeding.  Make a note and proceed.
+        warn!(log, "found more than one matching artifact for SP update");


If we hit this, are there any extra checks we should do to separate "it's fine to just take the first one" from "I don't know how to pick which one I'm supposed to use"? (E.g., should they all be the same version?)

I'm not really sure. It's hard to reason about what to do here because I don't think we'd expect to see it, and if we did, it's not clear what it would mean. (What did we intend if we shipped a TUF repo with multiple artifacts for the same board?) Again, I'm open to making this a harder error (don't plan anything for this case) but it didn't quite seem worth it when I was looking at it.

initial draft: planning SP, some pieces missing

29d2093

davepacheco added this to the 16 milestone Jun 3, 2025

davepacheco requested review from jgallagher, karencfv and plotnick June 3, 2025 20:50

davepacheco self-assigned this Jun 3, 2025

add docs

38dc53a

karencfv reviewed Jun 4, 2025

View reviewed changes

davepacheco added 2 commits June 4, 2025 07:05

fix openapi

f28b929

update comments based on review feedback

bd7bcd6

plotnick approved these changes Jun 4, 2025

View reviewed changes

review feedback

bf2ab7c

davepacheco enabled auto-merge (squash) June 4, 2025 17:04

davepacheco mentioned this pull request Jun 4, 2025

planner support for configuring SP images/slots #7414

Closed

davepacheco merged commit 521ff93 into main Jun 4, 2025
18 checks passed

davepacheco deleted the dap/sp-planning branch June 4, 2025 18:26

davepacheco mentioned this pull request Jun 4, 2025

incorporate SP updates into planner #8269

Merged

jgallagher reviewed Jun 4, 2025

View reviewed changes

davepacheco mentioned this pull request Jun 5, 2025

planner needs way to report events that it's waiting for #8284

Closed

		/// (it is possible to have baseboards in inventory that would never be
		/// updated because they're not considered part of the current system)

add module for planning MGS-based updates #8261

add module for planning MGS-based updates #8261

Uh oh!

Conversation

davepacheco commented Jun 3, 2025

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

plotnick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Jun 4, 2025

Uh oh!

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants