many Reconfigurator execution steps are fatal that shouldn't be

Right now, failure at many blueprint execution steps causes all of execution to stop.  That feels like the conservative thing to do, but it's not necessarily.  When fixing a broken system requires Reconfigurator to proceed despite some failures, this choice is problematic.  This came up in a colo incident a few weeks ago where we wanted to use Reconfigurator to repair the system by deploying a second Oximeter zone.  In trying to do so, execution failed because it couldn't deploy datasets to a sled agent that was known to be broken (and unrelated to the problem we were trying to repair).  There's no reason that failure to deploy datasets to one sled should prevent Reconfigurator from trying to deploy zones to a _different_ sled.  Arguably, it's wrong to say that failing to deploy datasets to a sled should prevent us from trying to deploy zones to the _same_ sled.  After all, it's totally possible that we successfully deployed datasets during a previous execution, or that there are no new datasets to be created there.  We could instead leave it to the sled agent to check this dependency, failing the zones request if it's missing some dataset that needs to be there.

In today's update call I proposed a pattern where:

- we build the individual execution operations such that when they have strict dependencies like this, the request fails cleanly when the dependency is not satisfied (i.e., `PUT /omicron-zones` should fail if the body refers to datasets that haven't been configured).
- having done that, it's generally safe to call these operations even if their dependencies haven't succeeded; this allows Reconfigurator execution to _always_ try to do these operations rather than skipping them if some dependent operation seems to have failed

The conclusion was that this is directionally a good pattern, but it's not clear how unsafe it would be to just ignore these dependencies today.  So the proposal here is:

- build new operations using this pattern where possible
- for existing operations, evaluate them to determine if we _can_ do things this way and convert them where possible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

many Reconfigurator execution steps are fatal that shouldn't be #6999

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

many Reconfigurator execution steps are fatal that shouldn't be #6999

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions