-
Notifications
You must be signed in to change notification settings - Fork 79
=swim #397 swim must detect unreachable via other nodes #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return "REPL(to:\(to.address))" | ||
| case .ask(let who): | ||
| return "ASK(\(who.path))" | ||
| return "ASK(\(who.address))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracelog saved the day here ❤️ was useful to be able to see all messages in/out
|
|
||
| try expectUnreachability(p3) | ||
| try expectUnreachability(p1) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: more tests here, including:
- trying completely always dead nodes,
- flaky nodes which come back alive etc -- we must see an .reachable then etc, in all cases
|
|
||
| /// Interval at which gossip messages should be issued. | ||
| /// Every `interval` a `fanout` number of gossip messages will be sent. // TODO which fanout? | ||
| public var probeInterval: TimeAmount = .seconds(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this was weird, was used but shouldnt be -- we had probeInterval both in swim.gossip and in swim.failureDetector, so some settings we applied in tests never were effective since they set "the wrong one". Now that value is only in the failure detector
| /// and only after exceeding `suspicionTimeoutPeriodsMax` shall the node be declared as `.unreachable`, | ||
| /// which results in an `Cluster.MemberReachabilityChange` `Cluster.Event` which downing strategies may act upon. | ||
| public var pingTimeout: TimeAmount = .milliseconds(300) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: computed property so that multiplies the two so we can easily print "at earliest, we'll notice an unreachable node in N seconds" etc
| case applied(previousStatus: SWIM.Status?, currentStatus: SWIM.Status) | ||
|
|
||
| /// True if the directive was `applied` and the from/to statuses differ, meaning that a change notification has issued. | ||
| var isEffectiveStatusChange: Bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is preparing for #401 already
| case .suspect: | ||
| return .markedSuspect(member: member) | ||
| case .dead: | ||
| return .confirmedDead(member: member) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dance to be reformulated as "change (from to)" so we can more surely emit events to the cluster (about reachability)
| case .applied where member.status.isUnreachable || member.status.isDead: | ||
| // FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit... | ||
|
|
||
| // TODO: ensure we don't double emit this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That TODO drives why applied should perhaps return a "change"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on this in next PR
…o immediately dead node
|
If you want to, and/or have the time @drexin please have a look, I'll continue with more hardening of the cluster+swim interactions, next up: #401 as we never emitted to the cluster events on the |
| case .applied where member.status.isUnreachable || member.status.isDead: | ||
| // FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit... | ||
|
|
||
| // TODO: ensure we don't double emit this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on this in next PR
| // FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit... | ||
|
|
||
| // TODO: ensure we don't double emit this | ||
| self.escalateMemberUnreachable(context: context, member: member) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's the change that makes unreachability something that can be signalled to cluster shell if another node told us about it. Also known as "the fix" -- though more hardening needed around it, as we do nto want to emit those too manytimes.
Motivation:
Test and harden the SWIM implementation against:
TODO:
Modifications:
Result: