Skip to content

[sled-agent] PUT /omicron-zones holds a lock while initializing new zones #7379

@jgallagher

Description

@jgallagher

In the PUT /omicron-zones handler, sled-agent takes a lock on its in-memory map of zones:

let mut existing_zones = self.inner.zones.lock().await;

This lock is held throughout the call to ensure_all_omicron_zones(), which both tears down zones that should no longer be running and starts up new zones that should be running.

In #7373, we saw sled-agent get stuck forever holding this lock due to a complex set of circumstances that led to it sitting in an infinite retry loop waiting for internal DNS to respond successfully. In logs, this showed up as a series of warnings:

20:09:36.420Z WARN SledAgent (ServiceManager): Failed to look up switch zone locations
    error = Error resolving dendrite services in internal DNS: no record found for Query { name: Name("_dendrite._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }
    file = sled-agent/src/bootstrap/early_networking.rs:233
    retry_after = 17.59195738s

Other issues track the set of circumstances leading to the internal DNS failures there. This issue is for "we shouldn't hold a lock that blocks other requests across a series of potentially-very-complex operations". This same zone map is read when reporting inventory, so while a PUT /omicron-zones is operating with the lock, any inventory requests have to block.

Fixing this is probably nontrivial. One idea we've bounced around is that sled-agent should accept its new config and return immediately, then asynchronously try to realize that config. (There are nuances here that would need to be worked out, like how this request can fail and what that would mean, and whether any callers have expectations for what a successful response means.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions