Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Jan 26, 2020

Motivation:

Test and harden the SWIM implementation against:

  • already dead nodes
  • nodes which we don't realize are dead but another member informs us in gossip that they are, thus we must issue an unreachable information to our cluster shell

TODO:

  • cleanups
  • even more tests

Modifications:

Result:

@ktoso ktoso changed the title Wip swim hardening =swim #397 swim must detect unreachable via other nodes Jan 26, 2020
return "REPL(to:\(to.address))"
case .ask(let who):
return "ASK(\(who.path))"
return "ASK(\(who.address))"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracelog saved the day here ❤️ was useful to be able to see all messages in/out


try expectUnreachability(p3)
try expectUnreachability(p1)
}
Copy link
Member Author

@ktoso ktoso Jan 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: more tests here, including:

  • trying completely always dead nodes,
  • flaky nodes which come back alive etc -- we must see an .reachable then etc, in all cases


/// Interval at which gossip messages should be issued.
/// Every `interval` a `fanout` number of gossip messages will be sent. // TODO which fanout?
public var probeInterval: TimeAmount = .seconds(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this was weird, was used but shouldnt be -- we had probeInterval both in swim.gossip and in swim.failureDetector, so some settings we applied in tests never were effective since they set "the wrong one". Now that value is only in the failure detector

/// and only after exceeding `suspicionTimeoutPeriodsMax` shall the node be declared as `.unreachable`,
/// which results in an `Cluster.MemberReachabilityChange` `Cluster.Event` which downing strategies may act upon.
public var pingTimeout: TimeAmount = .milliseconds(300)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: computed property so that multiplies the two so we can easily print "at earliest, we'll notice an unreachable node in N seconds" etc

case applied(previousStatus: SWIM.Status?, currentStatus: SWIM.Status)

/// True if the directive was `applied` and the from/to statuses differ, meaning that a change notification has issued.
var isEffectiveStatusChange: Bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is preparing for #401 already

case .suspect:
return .markedSuspect(member: member)
case .dead:
return .confirmedDead(member: member)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dance to be reformulated as "change (from to)" so we can more surely emit events to the cluster (about reachability)

case .applied where member.status.isUnreachable || member.status.isDead:
// FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit...

// TODO: ensure we don't double emit this
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That TODO drives why applied should perhaps return a "change"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on this in next PR

@ktoso
Copy link
Member Author

ktoso commented Jan 27, 2020

If you want to, and/or have the time @drexin please have a look, I'll continue with more hardening of the cluster+swim interactions, next up: #401 as we never emitted to the cluster events on the unreachable -> alive | suspect edge. I guess it may have been since none of the cluster events existed back then, so this is more of a bringing them up to date on both ends

@ktoso ktoso merged commit 2feeded into apple:master Jan 27, 2020
@ktoso ktoso deleted the wip-swim-hardening branch January 27, 2020 03:41
case .applied where member.status.isUnreachable || member.status.isDead:
// FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit...

// TODO: ensure we don't double emit this
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on this in next PR

// FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit...

// TODO: ensure we don't double emit this
self.escalateMemberUnreachable(context: context, member: member)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's the change that makes unreachability something that can be signalled to cluster shell if another node told us about it. Also known as "the fix" -- though more hardening needed around it, as we do nto want to emit those too manytimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SWIM: when pinged node is immediately dead, we sometimes don't issue .unreachable

2 participants