-
Notifications
You must be signed in to change notification settings - Fork 62
Description
In the PUT /omicron-zones handler, sled-agent takes a lock on its in-memory map of zones:
omicron/sled-agent/src/services.rs
Line 3393 in 4285881
| let mut existing_zones = self.inner.zones.lock().await; |
This lock is held throughout the call to ensure_all_omicron_zones(), which both tears down zones that should no longer be running and starts up new zones that should be running.
In #7373, we saw sled-agent get stuck forever holding this lock due to a complex set of circumstances that led to it sitting in an infinite retry loop waiting for internal DNS to respond successfully. In logs, this showed up as a series of warnings:
20:09:36.420Z WARN SledAgent (ServiceManager): Failed to look up switch zone locations
error = Error resolving dendrite services in internal DNS: no record found for Query { name: Name("_dendrite._tcp.control-plane.oxide.internal."), query_type: SRV, query_class: IN }
file = sled-agent/src/bootstrap/early_networking.rs:233
retry_after = 17.59195738s
Other issues track the set of circumstances leading to the internal DNS failures there. This issue is for "we shouldn't hold a lock that blocks other requests across a series of potentially-very-complex operations". This same zone map is read when reporting inventory, so while a PUT /omicron-zones is operating with the lock, any inventory requests have to block.
Fixing this is probably nontrivial. One idea we've bounced around is that sled-agent should accept its new config and return immediately, then asynchronously try to realize that config. (There are nuances here that would need to be worked out, like how this request can fail and what that would mean, and whether any callers have expectations for what a successful response means.)