-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Right now, failure at many blueprint execution steps causes all of execution to stop. That feels like the conservative thing to do, but it's not necessarily. When fixing a broken system requires Reconfigurator to proceed despite some failures, this choice is problematic. This came up in a colo incident a few weeks ago where we wanted to use Reconfigurator to repair the system by deploying a second Oximeter zone. In trying to do so, execution failed because it couldn't deploy datasets to a sled agent that was known to be broken (and unrelated to the problem we were trying to repair). There's no reason that failure to deploy datasets to one sled should prevent Reconfigurator from trying to deploy zones to a different sled. Arguably, it's wrong to say that failing to deploy datasets to a sled should prevent us from trying to deploy zones to the same sled. After all, it's totally possible that we successfully deployed datasets during a previous execution, or that there are no new datasets to be created there. We could instead leave it to the sled agent to check this dependency, failing the zones request if it's missing some dataset that needs to be there.
In today's update call I proposed a pattern where:
- we build the individual execution operations such that when they have strict dependencies like this, the request fails cleanly when the dependency is not satisfied (i.e.,
PUT /omicron-zonesshould fail if the body refers to datasets that haven't been configured). - having done that, it's generally safe to call these operations even if their dependencies haven't succeeded; this allows Reconfigurator execution to always try to do these operations rather than skipping them if some dependent operation seems to have failed
The conclusion was that this is directionally a good pattern, but it's not clear how unsafe it would be to just ignore these dependencies today. So the proposal here is:
- build new operations using this pattern where possible
- for existing operations, evaluate them to determine if we can do things this way and convert them where possible