-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Currently when sled-agent receives an omicron_zones_put it calls omicron_zones_ensure which creates datasets and brings up zones before omicron-zones.json is persisted to the ledger. If a sled reboots before persistence, we'll end up coming up with the old omicron-zones.json and start launching old zones. We believe this is technically safe, because inventory collection will read the old omicron-zones.json and the blueprint executor will redeploy the new zones on the next activation.
While safe, this pattern is "backwards" from what is normally done in distributed systems. Typically you persist the intended state and then go about realizing it. That's what we'd like to do here. We could then also have the inventory collection see which zones are actually up in addition to reading omicron-zones.config.
@davepacheco proposed this path forward:
- validate that we should be able to honor this request
- immediately store it. It is now the current intended state. Return immediately.
- in the background, constantly try to make sure the real state matches the intended state
- have a way to report:
- the last generation successfully realized
- the generation we're trying to realize
- any transient errors blocking that
- any persistent errors (where it's come to rest having not done this) -- hopefully these don't exist. if they do, we should identify them during validation instead.