-
Notifications
You must be signed in to change notification settings - Fork 56
Description
(Copying from #8740.)
This is extremely unlikely, but suppose:
- we're doing an upgrade and decide to start quiescing (as part of the handoff from old Nexus to new Nexus)
- there are sagas running in some Nexus N1, so it's waiting for those to finish
- there are no sagas running in any other Nexus so they've disabled their database access
- the sled hosting N1 fails and needs to be expunged
As of #8740 and #8794, there'd be no way to expunge the busted sled because the only remaining Nexus instances have shut off their database access. The upgrade would be stuck, since quiescing cannot complete without affirmative confirmation from all in-service Nexus instances. In this situation, support would need to either pause the upgrade before the handoff, get Nexus back up, do the expungement, and then proceed again; or else force the handoff (e.g., writing quiesce_completed
) and allowing the saga to be abandoned. This sounds bad in that it seems hard to fix, but it's also extremely unlikely (both sled and disk failure seem quite rare in practice and this would have to happen in a pretty small window), and it is recoverable, so I'm marking this an "important non-blocker" for self-service update.