Skip to content

SWIM: when pinged node is immediately dead, we sometimes don't issue .unreachable #397

@ktoso

Description

@ktoso

Found during "real cluster" and some tests after the Membership became more stable thanks to the seen tables in #376

There's a number of cases here I think which were not covered:

  • if we we try .connect, and the connection fails we never replied back to SWIM so it had no chance to mark .unreachable
    • it never stored such member in its members to ping either, so even if the node is completely killer, swim would never try to keep pinging it and never issue unreachable
    • since it might not issue unreachable, downing never has a change to trigger and nodes never get removed
  • if we dont notice a node is unreachable, but other nodes tell us in gossip it is
    • our gossip instance applies the change to its membership but does not notify the cluster about the -> unreachable it seems 🤔
    • this again leads to not issuing down

Uncovering this depends on a bunch of stuff from the hardening so will fix this as separate specific commits, but likely as part of the larger #376 PR.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions