=swim #397 swim must detect unreachable via other nodes #400

ktoso · 2020-01-26T06:27:48Z

Motivation:

Test and harden the SWIM implementation against:

already dead nodes
nodes which we don't realize are dead but another member informs us in gossip that they are, thus we must issue an unreachable information to our cluster shell

TODO:

cleanups
even more tests

Modifications:

Depends on +testKit Add fishForMessages allowing selective expecting of msgs #399
Part of making =cluster #52 Leader actions on convergence, including .removals #376 and the cluster rock solid

Result:

Resolves SWIM: when pinged node is immediately dead, we sometimes don't issue .unreachable #397

ktoso · 2020-01-26T06:28:54Z

Sources/DistributedActors/Cluster/SWIM/SWIMInstance+Logging.swift

+                return "REPL(to:\(to.address))"
            case .ask(let who):
-                return "ASK(\(who.path))"
+                return "ASK(\(who.address))"


tracelog saved the day here ❤️ was useful to be able to see all messages in/out

ktoso · 2020-01-26T06:29:34Z

Tests/DistributedActorsTests/Cluster/SWIM/SWIMShellClusteredTests.swift

+
+        try expectUnreachability(p3)
+        try expectUnreachability(p1)
+    }


TODO: more tests here, including:

trying completely always dead nodes,

flaky nodes which come back alive etc -- we must see an .reachable then etc, in all cases

Sources/DistributedActors/Cluster/SWIM/SWIMSettings.swift

ktoso · 2020-01-26T06:31:27Z

Sources/DistributedActors/Cluster/SWIM/SWIMSettings.swift


-    /// Interval at which gossip messages should be issued.
-    /// Every `interval` a `fanout` number of gossip messages will be sent. // TODO which fanout?
-    public var probeInterval: TimeAmount = .seconds(1)


so this was weird, was used but shouldnt be -- we had probeInterval both in swim.gossip and in swim.failureDetector, so some settings we applied in tests never were effective since they set "the wrong one". Now that value is only in the failure detector

ktoso · 2020-01-26T06:32:08Z

Sources/DistributedActors/Cluster/SWIM/SWIMSettings.swift

+    /// and only after exceeding `suspicionTimeoutPeriodsMax` shall the node be declared as `.unreachable`,
+    /// which results in an `Cluster.MemberReachabilityChange` `Cluster.Event` which downing strategies may act upon.
    public var pingTimeout: TimeAmount = .milliseconds(300)
 }


TODO: computed property so that multiplies the two so we can easily print "at earliest, we'll notice an unreachable node in N seconds" etc

Sources/DistributedActors/Cluster/SWIM/SWIMInstance.swift

Tests/DistributedActorsTests/Cluster/SWIM/SWIMShellClusteredTests.swift

ktoso · 2020-01-27T03:06:13Z

Sources/DistributedActors/Cluster/SWIM/SWIMInstance.swift

+        case applied(previousStatus: SWIM.Status?, currentStatus: SWIM.Status)
+
+        /// True if the directive was `applied` and the from/to statuses differ, meaning that a change notification has issued.
+        var isEffectiveStatusChange: Bool {


This is preparing for #401 already

ktoso · 2020-01-27T03:17:08Z

Sources/DistributedActors/Cluster/SWIM/SWIMInstance.swift

+                    case .suspect:
+                        return .markedSuspect(member: member)
+                    case .dead:
                        return .confirmedDead(member: member)


This dance to be reformulated as "change (from to)" so we can more surely emit events to the cluster (about reachability)

ktoso · 2020-01-27T03:19:39Z

Sources/DistributedActors/Cluster/SWIM/SWIMShell.swift

+                case .applied where member.status.isUnreachable || member.status.isDead:
+                    // FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit...
+
+                    // TODO: ensure we don't double emit this


That TODO drives why applied should perhaps return a "change"

Working on this in next PR

…o immediately dead node

ktoso · 2020-01-27T03:23:13Z

If you want to, and/or have the time @drexin please have a look, I'll continue with more hardening of the cluster+swim interactions, next up: #401 as we never emitted to the cluster events on the unreachable -> alive | suspect edge. I guess it may have been since none of the cluster events existed back then, so this is more of a bringing them up to date on both ends

ktoso · 2020-01-27T06:41:18Z

Sources/DistributedActors/Cluster/SWIM/SWIMShell.swift

+                case .applied where member.status.isUnreachable || member.status.isDead:
+                    // FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit...
+
+                    // TODO: ensure we don't double emit this


Working on this in next PR

ktoso · 2020-01-27T06:41:58Z

Sources/DistributedActors/Cluster/SWIM/SWIMShell.swift

+                    // FIXME: rather, we should return in applied if it was a change or not, only if it was we should emit...
+
+                    // TODO: ensure we don't double emit this
+                    self.escalateMemberUnreachable(context: context, member: member)


So that's the change that makes unreachability something that can be signalled to cluster shell if another node told us about it. Also known as "the fix" -- though more hardening needed around it, as we do nto want to emit those too manytimes.

ktoso changed the title ~~Wip swim hardening~~ =swim #397 swim must detect unreachable via other nodes Jan 26, 2020

ktoso commented Jan 26, 2020

View reviewed changes

Sources/DistributedActors/Cluster/SWIM/SWIMSettings.swift Outdated Show resolved Hide resolved

ktoso commented Jan 26, 2020

View reviewed changes

yim-lee reviewed Jan 27, 2020

View reviewed changes

Sources/DistributedActors/Cluster/SWIM/SWIMInstance.swift Outdated Show resolved Hide resolved

Tests/DistributedActorsTests/Cluster/SWIM/SWIMShellClusteredTests.swift Show resolved Hide resolved

ktoso commented Jan 27, 2020

View reviewed changes

=swim #397 swim must detect unreachable via other nodes info, and als…

7db25ba

…o immediately dead node

ktoso merged commit 2feeded into apple:master Jan 27, 2020

ktoso deleted the wip-swim-hardening branch January 27, 2020 03:41

ktoso commented Jan 27, 2020

View reviewed changes

This was referenced Jan 27, 2020

+swim #401 complete unreachable state transitions and reachability events logic #403

Merged

=cluster #52 Leader actions on convergence, including .removals #376

Merged

=swim #397 swim must detect unreachable via other nodes #400

=swim #397 swim must detect unreachable via other nodes #400

Uh oh!

Conversation

ktoso commented Jan 26, 2020

Modifications:

Result:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ktoso Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ktoso commented Jan 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ktoso Jan 26, 2020 •

edited

Loading