Skip to content

Conversation

@jgallagher
Copy link
Contributor

This is to assist with debugging and development around #8480; specifically, it's item 2 on the initial plan:

De-risk the existing sled agent mechanism for tearing down the switch zone in this case.
a. add an internal API for it
b. add an omdb command to invoke that API
c. test this a bunch by putting the Sidecar into A2 and/or using ignition to power it off and then using the omdb command to trigger the response that we want

This PR adds a and b so we can do c on a racklette.

I'll put some testing notes in a comment. The diff stat is silly; this makes a sled-agent API change so we get a new 8000+ line OpenAPI doc. The actual changes here are small (but in critical parts of sled-agent - please review carefully!).

@jgallagher
Copy link
Contributor Author

We cannot shut down the switch zone from itself:

# omdb -w sled-agent --sled-agent-url http://[fd00:1122:3344:103::1]:12345 switch-zone-policy danger-danger-disable
Error: setting policy

Caused by:
    Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "24320a26-5e72-49bb-b86c-fc13365abe24", "content-length": "147", "date": "Fri, 29 Aug 2025 19:44:23 GMT"}; value: Error { error_code: None, message: "requests to disable the switch zone must come from the other switch zone", request_id: "24320a26-5e72-49bb-b86c-fc13365abe24" }

But we can shut down the other switch zone, and restart it. From switch zone 1:

# pilot host exec -c 'zoneadm list | grep switch || echo "no switch zone"' 14
14  BRM42220026        ok: oxz_switch
# omdb -w sled-agent --sled-agent-url http://[fd00:1122:3344:101::1]:12345 switch-zone-policy danger-danger-disable
switch zone DISABLED and will not start even if a switch is present
# pilot host exec -c 'zoneadm list | grep switch || echo "no switch zone"' 14
14  BRM42220026        ok: no switch zone
# omdb -w sled-agent --sled-agent-url http://[fd00:1122:3344:101::1]:12345 switch-zone-policy enable
switch zone will start if a switch is present (default)

... wait a bit ... (didn't measure, but something like 15-30 seconds?)

# pilot host exec -c 'zoneadm list | grep switch || echo "no switch zone"' 14
14  BRM42220026        ok: oxz_switch

{
error!(self.log, "Failed to activate switch"; e);
}
async fn set_tofino_loaded(&mut self, tofino_loaded: bool) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: don't love this name; set_... to me usually implies a setter, which this is, but it's doing much more than that.

Maybe set_tofino_loaded_and_update_switch_zone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. Looking at this again, I also don't love that the borrow_and_update() read of the policy is kinda hidden inside ensure_switch_zone_activated_or_deactivated(). I took a crack at both of these in c0b9899 - what do you think of that change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New stuff looks good to me. Policy as an argument makes sense.

match policy {
OperatorSwitchZonePolicy::StartIfSwitchPresent => (),
OperatorSwitchZonePolicy::StopDespiteSwitchPresence => {
// Disabling our switch zone is very dangerous: if our switch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way to be thorough with this case

Comment on lines 746 to 748
/// A debugging endpoint only used by `omdb` that allows us to test
/// restarting the switch zone without restarting sled-agent. See
/// https://github.com/oxidecomputer/omicron/issues/8480 for context.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably clear to you after the reconciliation work, but IMO it's worth calling out:

The addition or removal of a switch zone is asynchronous with respect to this method. Specifically, it's possible to call this API with the "disable" option while a tofino driver is loading, and a new switch zone could even be initialized after this function returns.

);
self.handle_hardware_update(update).await;
}
Ok(()) = self.switch_zone_policy_rx.changed() => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment above on the API) this arm is not biased to happen before e.g., the hardware_rx re-loading, or the service manager being ready.

I think that's fine, but it might result in the following sequence:

  • See no switch zone
  • Call the endpoint to disable the switch zone
  • A new switch zone shows up
  • (later) the switch zone gets disabled

Copy link
Contributor Author

@jgallagher jgallagher Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that sequence being possible and I also think it's fine. I added a note to the API doc comment about this being async and racy if called while the switch zone is starting or stopping.

@jgallagher jgallagher merged commit 4f51537 into main Sep 2, 2025
16 checks passed
@jgallagher jgallagher deleted the john/omdb-switch-zone-control branch September 2, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants