restart customer Instances after sled reboot

I haven't verified this but after talking with @smklein we believe that if a sled reboots, any customer Instances that were running on that system will no longer be running (not there, nor anywhere).  But the API state will probably reflect that they _are_ still running.  It's not clear if there'd be any way to get them running again.

Part of the design here was that the `sled_agent_put()` call from the Sled Agent to Nexus would be an opportunity for Nexus to verify that the expected Instances were still running.  In practice, this probably needs to trigger an RFD 373-style RPW that determines what's _supposed_ to be on each Sled, what _is_ running on each sled, and fixes things appropriately.  It might be cleanest to factor that into two RPWs:

- one RPW: checks what's supposed to be running on each Sled, checks what's running there, and for any discrepancies, marks the Instance _failed_ (or something like that)
- second RPW: for each failed Instance, try to start it elsewhere

There's a related issue here around sleds that have failed more permanently.  I'd suggest we treat this as a different kind of thing and _not_ try to automatically detect this using a heartbeat mechanism or something like that.  That kind of automation can make things worse.  For this case (which really should be rare), we could require that an operator mark the sled as "permanently gone -- remove it from the cluster", after which we mark its Instances failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

restart customer Instances after sled reboot #3633

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

restart customer Instances after sled reboot #3633

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions