Skip to content

better deal with expungement during quiesce window #8795

@davepacheco

Description

@davepacheco

(Copying from #8740.)

This is extremely unlikely, but suppose:

  • we're doing an upgrade and decide to start quiescing (as part of the handoff from old Nexus to new Nexus)
  • there are sagas running in some Nexus N1, so it's waiting for those to finish
  • there are no sagas running in any other Nexus so they've disabled their database access
  • the sled hosting N1 fails and needs to be expunged

As of #8740 and #8794, there'd be no way to expunge the busted sled because the only remaining Nexus instances have shut off their database access. The upgrade would be stuck, since quiescing cannot complete without affirmative confirmation from all in-service Nexus instances. In this situation, support would need to either pause the upgrade before the handoff, get Nexus back up, do the expungement, and then proceed again; or else force the handoff (e.g., writing quiesce_completed) and allowing the saga to be abandoned. This sounds bad in that it seems hard to fix, but it's also extremely unlikely (both sled and disk failure seem quite rare in practice and this would have to happen in a pretty small window), and it is recoverable, so I'm marking this an "important non-blocker" for self-service update.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions