-
Couldn't load subscription status.
- Fork 60
Description
In testing #8959 on dublin, I followed this sequence on scrimlet14:
- starting state: switch zone is up and healthy as normal
- I put the associated sidecar in A2 (at this point dendrite went into maintenance inside the switch zone)
- I used the new
omdbcommand (from 8959) to set the switch zone policy to "disabled" - the
HardwareMonitortask responded to this policy change and shut down the switch zone - I used the new
omdbcommand (from 8959) to set the switch zone policy to "enabled" - the
HardwareMonitortask responded to this policy change and started the switch zone; however, the sidecar was still in A2 so dendrite immediately went into maintenance, and the switch zone was in general very unhealthy (as expected for a sidecar in A2!) - I used the new
omdbcommand (from 8959) to set the switch zone policy to "disabled" - the
HardwareMonitortask did NOT respond to this policy change; that's the subject of this issue
The HardareMonitor task follows the typical pattern where it's a spawned tokio task that receives incoming messages and acts on them. In the case where it attempts to start a switch zone but that switch zone is not healthy, it gets stuck waiting for the switch zone to become healthy, and no longer processes other incoming messages. This affected my ability to control the switch zone via the new policy knobs above, but would also mean we'd fail to respond to tofino presence/absence changes or disk presence/absence changes if we got stuck starting an unhealthy switch zone.
The chain here (omitting some details and linking directly to specific branches we take in this path) is roughly:
- The
HardwareMonitorawaitsactivate_switch() activate_switch()awaitsensure_switch_zone()ensure_switch_zone()callsstart_switch_zone(), which is not asyncstart_switch_zone()is not itself async because it spawns a task that goes into an infinite loop trying to start the switch zone- At this point,
ensure_switch_zone()returnsOk(()), even if we haven't actually started the switch zone yet. We've successfully spawned the task that will try forever to start it. - ...but there's still one more thing to do back in
activate_switch(): if we have an underlay address, we await ensuring uplinks are configured - This goes into an infinite retry loop until it can get our local switch ID from MGS. In the case where the sidecar is in A2, MGS can't determine this, so we're stuck here indefinitely, and we never return control to the
HardwareMonitortask.
I think (and will attempt and test) that we can address this specific path with a couple changes:
- Move the "ensure uplinks are configured" (step 6) work into the task spawned by
start_switch_zone()(step 4), instead of doing it outside and after we spawn that task. - Change the infinite retry loop talking to MGS (step 7) to a single attempt; if this fails, we'll retry at a higher level (specifically the "start the switch zone" task, which retries until success or it's told to fail).
This should allow us to avoid getting wedged if the switch zone is so unhealthy MGS can't determine the local switch ID. But I think there's opportunity for other kinds of failures here that would require more extensive rework. For example, later in the early networking code, there are at least two other infinite retry loops where we could get stuck, and there are several cases where we don't really handle errors (other than logging them at the error! level, we proceed as though they had succeeded and will never retry them). I hope that with the small changes outlined above we handle the common case of "nothing works", but "partly working" probably needs something more along the lines of a reconciler task that understands how to retry individual pieces of applying configuration.