-
Couldn't load subscription status.
- Fork 60
[sled-agent] Add endpoints and omdb subcommands to stop and start the switch zone
#8959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We cannot shut down the switch zone from itself: But we can shut down the other switch zone, and restart it. From switch zone 1: |
sled-agent/src/hardware_monitor.rs
Outdated
| { | ||
| error!(self.log, "Failed to activate switch"; e); | ||
| } | ||
| async fn set_tofino_loaded(&mut self, tofino_loaded: bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: don't love this name; set_... to me usually implies a setter, which this is, but it's doing much more than that.
Maybe set_tofino_loaded_and_update_switch_zone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call. Looking at this again, I also don't love that the borrow_and_update() read of the policy is kinda hidden inside ensure_switch_zone_activated_or_deactivated(). I took a crack at both of these in c0b9899 - what do you think of that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New stuff looks good to me. Policy as an argument makes sense.
| match policy { | ||
| OperatorSwitchZonePolicy::StartIfSwitchPresent => (), | ||
| OperatorSwitchZonePolicy::StopDespiteSwitchPresence => { | ||
| // Disabling our switch zone is very dangerous: if our switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way to be thorough with this case
sled-agent/api/src/lib.rs
Outdated
| /// A debugging endpoint only used by `omdb` that allows us to test | ||
| /// restarting the switch zone without restarting sled-agent. See | ||
| /// https://github.com/oxidecomputer/omicron/issues/8480 for context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably clear to you after the reconciliation work, but IMO it's worth calling out:
The addition or removal of a switch zone is asynchronous with respect to this method. Specifically, it's possible to call this API with the "disable" option while a tofino driver is loading, and a new switch zone could even be initialized after this function returns.
| ); | ||
| self.handle_hardware_update(update).await; | ||
| } | ||
| Ok(()) = self.switch_zone_policy_rx.changed() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my comment above on the API) this arm is not biased to happen before e.g., the hardware_rx re-loading, or the service manager being ready.
I think that's fine, but it might result in the following sequence:
- See no switch zone
- Call the endpoint to disable the switch zone
- A new switch zone shows up
- (later) the switch zone gets disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that sequence being possible and I also think it's fine. I added a note to the API doc comment about this being async and racy if called while the switch zone is starting or stopping.
…ed_or_deactivated()
This is to assist with debugging and development around #8480; specifically, it's item 2 on the initial plan:
This PR adds
aandbso we can docon a racklette.I'll put some testing notes in a comment. The diff stat is silly; this makes a sled-agent API change so we get a new 8000+ line OpenAPI doc. The actual changes here are small (but in critical parts of sled-agent - please review carefully!).