Skip to content

sled-agent: HardwareMonitor task gets wedged if it tries to start up an unhealthy switch zone #8970

@jgallagher

Description

@jgallagher

In testing #8959 on dublin, I followed this sequence on scrimlet14:

  • starting state: switch zone is up and healthy as normal
  • I put the associated sidecar in A2 (at this point dendrite went into maintenance inside the switch zone)
  • I used the new omdb command (from 8959) to set the switch zone policy to "disabled"
  • the HardwareMonitor task responded to this policy change and shut down the switch zone
  • I used the new omdb command (from 8959) to set the switch zone policy to "enabled"
  • the HardwareMonitor task responded to this policy change and started the switch zone; however, the sidecar was still in A2 so dendrite immediately went into maintenance, and the switch zone was in general very unhealthy (as expected for a sidecar in A2!)
  • I used the new omdb command (from 8959) to set the switch zone policy to "disabled"
  • the HardwareMonitor task did NOT respond to this policy change; that's the subject of this issue

The HardareMonitor task follows the typical pattern where it's a spawned tokio task that receives incoming messages and acts on them. In the case where it attempts to start a switch zone but that switch zone is not healthy, it gets stuck waiting for the switch zone to become healthy, and no longer processes other incoming messages. This affected my ability to control the switch zone via the new policy knobs above, but would also mean we'd fail to respond to tofino presence/absence changes or disk presence/absence changes if we got stuck starting an unhealthy switch zone.

The chain here (omitting some details and linking directly to specific branches we take in this path) is roughly:

  1. The HardwareMonitor awaits activate_switch()
  2. activate_switch() awaits ensure_switch_zone()
  3. ensure_switch_zone() calls start_switch_zone(), which is not async
  4. start_switch_zone() is not itself async because it spawns a task that goes into an infinite loop trying to start the switch zone
  5. At this point, ensure_switch_zone() returns Ok(()), even if we haven't actually started the switch zone yet. We've successfully spawned the task that will try forever to start it.
  6. ...but there's still one more thing to do back in activate_switch(): if we have an underlay address, we await ensuring uplinks are configured
  7. This goes into an infinite retry loop until it can get our local switch ID from MGS. In the case where the sidecar is in A2, MGS can't determine this, so we're stuck here indefinitely, and we never return control to the HardwareMonitor task.

I think (and will attempt and test) that we can address this specific path with a couple changes:

  1. Move the "ensure uplinks are configured" (step 6) work into the task spawned by start_switch_zone() (step 4), instead of doing it outside and after we spawn that task.
  2. Change the infinite retry loop talking to MGS (step 7) to a single attempt; if this fails, we'll retry at a higher level (specifically the "start the switch zone" task, which retries until success or it's told to fail).

This should allow us to avoid getting wedged if the switch zone is so unhealthy MGS can't determine the local switch ID. But I think there's opportunity for other kinds of failures here that would require more extensive rework. For example, later in the early networking code, there are at least two other infinite retry loops where we could get stuck, and there are several cases where we don't really handle errors (other than logging them at the error! level, we proceed as though they had succeeded and will never retry them). I hope that with the small changes outlined above we handle the common case of "nothing works", but "partly working" probably needs something more along the lines of a reconciler task that understands how to retry individual pieces of applying configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions