-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Labels
Sled AgentRelated to the Per-Sled Configuration and ManagementRelated to the Per-Sled Configuration and ManagementnexusRelated to nexusRelated to nexusvirtualizationPropolis Integration & VM ManagementPropolis Integration & VM Management
Milestone
Description
- Updating Instance State Information within Nexus
- "Sled Agent registering itself with Nexus" should also transfer information about "Here are the instances the sled agent knows about". It can start as an empty set. See restart customer Instances after sled reboot #3633 for a lot more detail here.
- The Sled Agent should refuse to handle instance requests until it successfully registers itself with Nexus. This would help avoid race conditions where: Nexus sends a request to a rebooting sled, at the same time as the sled registers with nexus and identifies that "all instances are dead now", inadvertently marking a very new instance as failed.
- Nexus should look up all instances that should have been running on the sled and mark them failed.
- Later Nexus can use an RPW to look for instances that are marked as "failed + auto_boot_on_fault", and re-provision them in the background.
- Idea: We could plausible update the "normal" instance provisioning workflow to rely on this RPW for provisioning, too. This would let "instance create" return much faster, and leave the work of "finding an appropriate sled and starting the instance" to a background task that could tolerate slower APIs to the backend.
- Ensuring metric registration: As part of the above RPW, one would like to also ensure that running instances have an assignment to an
oximeter
collector recorded in theomicron.public.metric_producer
table. When instances are stopped, that assignment needs to be removed by the cleanup-portion of that RPW.
- Instances without Sleds
- We need to make it possible for Instances to not have a propolis ID / sled ID, in the case that they are stopped.
- We also have the cleanup to do, ensuring that the virtual resources consumed by instances are no longer consumed in the case when an instance is stopped, but not deleted.
- Handling Failed Instances
- Confirm that instances can be forcefully deleted after being marked failed
- Plumb through the sled agent API @gjcolombo mentioned to "force-stop an instance" through the public-facing API for this failed case, to ensure that the instance is truly destroyed.
Metadata
Metadata
Assignees
Labels
Sled AgentRelated to the Per-Sled Configuration and ManagementRelated to the Per-Sled Configuration and ManagementnexusRelated to nexusRelated to nexusvirtualizationPropolis Integration & VM ManagementPropolis Integration & VM Management