sled-agent: `HardwareMonitor` task gets wedged if it tries to start up an unhealthy switch zone

In testing #8959 on dublin, I followed this sequence on scrimlet14:

* starting state: switch zone is up and healthy as normal
* I put the associated sidecar in A2 (at this point dendrite went into maintenance inside the switch zone)
* I used the new `omdb` command (from 8959) to set the switch zone policy to "disabled"
* the `HardwareMonitor` task responded to this policy change and shut down the switch zone
* I used the new `omdb` command (from 8959) to set the switch zone policy to "enabled"
* the `HardwareMonitor` task responded to this policy change and started the switch zone; however, the sidecar was still in A2 so dendrite immediately went into maintenance, and the switch zone was in general very unhealthy (as expected for a sidecar in A2!)
* I used the new `omdb` command (from 8959) to set the switch zone policy to "disabled"
* the `HardwareMonitor` task did NOT respond to this policy change; that's the subject of this issue

The `HardareMonitor` task follows the typical pattern where it's a spawned tokio task that receives incoming messages and acts on them. In the case where it attempts to start a switch zone but that switch zone is not healthy, it gets stuck waiting for the switch zone to _become_ healthy, and no longer processes other incoming messages. This affected my ability to control the switch zone via the new policy knobs above, but would also mean we'd fail to respond to tofino presence/absence changes or disk presence/absence changes if we got stuck starting an unhealthy switch zone.

The chain here (omitting some details and linking directly to specific branches we take in this path) is roughly:

1. The `HardwareMonitor` [awaits `activate_switch()`](https://github.com/oxidecomputer/omicron/blob/c7961db45126fa7f413f5230009a4c84d3474df4/sled-agent/src/hardware_monitor.rs#L209)
2. `activate_switch()` [awaits `ensure_switch_zone()`](https://github.com/oxidecomputer/omicron/blob/c7961db45126fa7f413f5230009a4c84d3474df4/sled-agent/src/services.rs#L3248-L3256)
3. `ensure_switch_zone()` [calls `start_switch_zone()`](https://github.com/oxidecomputer/omicron/blob/c7961db45126fa7f413f5230009a4c84d3474df4/sled-agent/src/services.rs#L3460-L3465), which is not async 
4. `start_switch_zone()` is not itself async because it [spawns a task](https://github.com/oxidecomputer/omicron/blob/c7961db45126fa7f413f5230009a4c84d3474df4/sled-agent/src/services.rs#L3438-L3440) that goes into an infinite loop trying to start the switch zone 
5. At this point, `ensure_switch_zone()` returns `Ok(())`, even if we haven't _actually_ started the switch zone yet. We've successfully spawned the task that will try forever to start it.
6. ...but there's still one more thing to do back in `activate_switch()`: if we have an underlay address, we await [ensuring uplinks are configured](https://github.com/oxidecomputer/omicron/blob/c7961db45126fa7f413f5230009a4c84d3474df4/sled-agent/src/services.rs#L3258-L3263)
7. This goes into an [infinite retry loop until it can get our local switch ID from MGS](https://github.com/oxidecomputer/omicron/blob/c7961db45126fa7f413f5230009a4c84d3474df4/sled-agent/src/bootstrap/early_networking.rs#L379-L397). In the case where the sidecar is in A2, MGS can't determine this, so we're stuck here indefinitely, and we never return control to the `HardwareMonitor` task.

I think (and will attempt and test) that we can address this _specific_ path with a couple changes:

1. Move the "ensure uplinks are configured" (step 6) work into the task spawned by `start_switch_zone()` (step 4), instead of doing it outside and after we spawn that task.
2. Change the infinite retry loop talking to MGS (step 7) to a single attempt; if this fails, we'll retry at a higher level (specifically the "start the switch zone" task, which retries until success or it's told to fail).

This should allow us to avoid getting wedged if the switch zone is so unhealthy MGS can't determine the local switch ID. But I think there's opportunity for other kinds of failures here that would require more extensive rework. For example, later in the early networking code, there are at least two other infinite retry loops where we could get stuck, and there are several cases where we don't really handle errors (other than logging them at the `error!` level, we proceed as though they had succeeded and will never retry them). I hope that with the small changes outlined above we handle the common case of "nothing works", but "partly working" probably needs something more along the lines of a reconciler task that understands how to retry individual pieces of applying configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

sled-agent: `HardwareMonitor` task gets wedged if it tries to start up an unhealthy switch zone #8970

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

sled-agent: HardwareMonitor task gets wedged if it tries to start up an unhealthy switch zone #8970

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

sled-agent: `HardwareMonitor` task gets wedged if it tries to start up an unhealthy switch zone #8970