=cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy #375

ktoso · 2020-01-17T13:41:22Z

Motivation:

Normally a down node is "scary and should ASAP just really die".

A node could be marked as .down wrongly after all, and thus it is a "zombie", we don't like zombie nodes and it should init a shutdown immediately.

Even if it didn't, we have the shoot the other node messages (RIP) that should cause it to die if it attempted further communication (more tests there I want to add tho)

Modifications:

when a node notices it is supposed to be down, it initiates a shutdown.
it should NOT attempt super fancy grateful handovers
- these can happen if a node leaves willingly (i.e. goes into .leaving, does the handover, and then does .down itself)

Result:

Resolves Node upon being .down should initiate shutdown #55
Resolves test_singletonByClusterLeadership_withLeaderChange MUST work when down(self) is issued #378
Resolves test_singletonByClusterLeadership_withLeaderChange MUST work when down(self) is issued #378 test_singletonByClusterLeadership_withLeaderChange MUST work when down(self) is issued
Discvovered and provides workaround for Test and harden: Sending many messages to a ref while it is trying to associate #383

ktoso · 2020-01-17T15:25:32Z

Failure was new #377

ktoso · 2020-01-17T21:20:59Z

Sources/DistributedActors/ActorLogging.swift

    public static func make(system: ActorSystem, identifier: String? = nil) -> Logger {
+        if let overriddenLoggerFactory = system.settings.overrideLoggerFactory {
+            return overriddenLoggerFactory(identifier ?? system.name)
+        }


Was missing and thus not capturing logs on system level

ktoso · 2020-01-17T21:21:20Z

Sources/DistributedActors/ActorSystem.swift


    public var log: Logger {
-        var l = ActorLogger.make(system: self) // we only do this to go "through" the proxy; we may not need it in the future?
+        var l = ActorLogger.make(system: self)


This works now

ktoso · 2020-01-17T21:22:04Z

Sources/DistributedActors/Cluster/ClusterShell.swift


            self.swimRef.tell(.local(.confirmDead(state.myselfNode)))
-            context.log.warning("Self node was determined [.down]. (TODO: initiate shutdown based on config)") // TODO: initiate a shutdown it configured to do so
+            context.system.settings.cluster.onDownAction.make()(context.system)


The feature.

yim-lee

👍

ktoso · 2020-01-20T03:39:38Z

One of the failures was a bad test now that we're shutting down the system in this PR this had to be adjusted. #377

Resolves #377

ktoso · 2020-01-20T05:08:01Z

The #378 failure makes sense and "is correct" given the settings the test is run under. I'll make .leaving a real thing and that'll work well then

ktoso · 2020-01-20T05:08:15Z

Applied workaround for #378 and following up separately.

…leadership change event

…f down immediately, must become leaving instead

…ting - more tests for all cases of "self downing", by shutting down, leaving and downing myself. - downing is now reimplemented in terms of Cluster.Events, which makes it more resilient -- it had a bug where a not-self leader would be marked it would throw, since the membership was only partially maintained - the singleton tests have been now passing consistently - cluster OnDownAction shutdown now is gracefulshutdown and has a timeout. Current impl is harsh and just delays the shutting down.

…trol reaching

…fail" to stash (for same reason)

…est address reachable nodes

ktoso · 2020-01-21T10:59:33Z

It's not the most clean/separate things PR of all time... was quite hard to figure out / locate some of the issues 🙇 Follow ups I expect to be more self contained; Will comment some more inline.

ktoso · 2020-01-21T11:06:20Z

Sources/DistributedActors/Cluster/ClusterControl.swift


+    public func leave() {
+        self.ref.tell(.command(.downCommand(self.node.node)))
+    }


may get it's own command, we have a .leaving status in membership.

The idea is while leaving we may still perform actions but others would not give us new work etc.
This would be used by virtual actors and singletons

ktoso · 2020-01-21T11:06:56Z

Sources/DistributedActors/Cluster/ClusterReceptionist.swift

+                    }
+                })
+            }
+        }


A slight delay is useful for allowing to spread the .down gossip to others before we really die.

Tests also cover the "shutdown immediately" case that it all works correctly 👍

ktoso · 2020-01-21T11:07:43Z

Sources/DistributedActors/Cluster/ClusterSettings.swift

-    public var downingStrategy: DowningStrategySettings = .none
+    /// Strategy how members determine if others (or myself) shall be marked as `.down`.
+    /// This strategy should be set to the same (or compatible) strategy on all members of a cluster to avoid split brain situations.
+    public var downingStrategy: DowningStrategySettings = .timeout(.default)


I figured to make the downing on by default after all...

We should soon implement a slightly better one, but for the sake of showing how singletons move around etc I guess let's leave it on...

Opinions @drexin @yim-lee ?

I figured to make the downing on by default after all...

+1

ktoso · 2020-01-21T11:08:29Z

Sources/DistributedActors/Cluster/ClusterSettings.swift

+    /// When this member node notices it has been marked as `.down` in the membership, it can automatically perform an action.
+    /// This setting determines which action to take. Generally speaking, the best course of action is to quickly and gracefully
+    /// shut down the node and process, potentially leaving a higher level orchestrator to replace the node (e.g. k8s starting a new pod for the cluster).
+    public var onDownAction: OnDownActionStrategySettings = .gracefulShutdown(delay: .seconds(3))


This resolves an ancient ticket #55, I think it is indeed correct to keep the default to kill the actor system as we chatted in the ticket -- sanity check that we still agree with this @drexin ? :)

ktoso · 2020-01-21T11:09:49Z

Sources/DistributedActors/Cluster/ClusterShell.swift

        }
    }
+
+    func tryIntroduceGossipPeer(_ context: ActorContext<Message>, _ state: ClusterShellState, change: Cluster.MembershipChange, file: String = #file, line: UInt = #line) {


This is because lack of #371 but also since it's "faster" since we know exactly the peers so we're cheating instead of waiting for gossip rounds of the receptionist 🤔

ktoso · 2020-01-21T11:09:56Z

Sources/DistributedActors/Cluster/ClusterShell.swift

-                    state.gossipControl.introduce(peer: gossipPeer)
-                }
-            }
-            // TODO: was this needed here? state.gossipControl.update(Cluster.Gossip())


it was not :)

ktoso · 2020-01-21T11:10:54Z

Sources/DistributedActors/Cluster/ClusterShell.swift

            case .retryHandshake(let initiated):
                return self.connectSendHandshakeOffer(context, state, initiated: initiated)

-            // FIXME: this is now a cluster event !!!!!


comment not aligned with reality anymore, the command here is "from the failure detector" and used only by swim to tell us about reachability change. SWIM does not know about Cluster.Member thus it cannot do the reachability cluster event. This is ok.

ktoso · 2020-01-21T11:12:11Z

Sources/DistributedActors/Cluster/ClusterControl.swift

    }
+
+    public func down(member: Cluster.Member) {
+        self.ref.tell(.command(.downCommandMember(member)))


This is interesting / important, thanks to using this version internally, we are carrying all the metadata that a member has correctly -- such as the reachability at the moment when someone decided to call down. Mostly a "more correct view in the cluster membership" change. For end users calling either of them will yield the expected result

ktoso · 2020-01-21T11:12:59Z

Sources/DistributedActors/Cluster/SWIM/SWIMSettings.swift

+    /// and `.down` in the high-level membership.
    public var suspicionTimeoutPeriodsMax: Int = 10
-    public var suspicionTimeoutPeriodsMin: Int = 10
+    // public var suspicionTimeoutPeriodsMin: Int = 10 // FIXME: this is once we have LHA, Local Health Aware Suspicion


SWIM: was not used nor implemented yet. This indeed is important, but only once LHA is implemented #352

ktoso · 2020-01-21T11:16:19Z

Sources/DistributedActors/Cluster/Transport/ActorRef+RemotePersonality.swift

-                self._cachedAssociationRemoteControl = remoteControl // TODO: atomically...
+                self.system.log.warning("FIXME: Workaround, ActorRef's RemotePersonality had to spin \(spinNr) times to obtain remoteControl to send message to \(self.address)")
+                // self._cachedAssociationRemoteControl = remoteControl // TODO: atomically cache a remote control?
                return remoteControl


The terrible concurrency issue around association lookup 😱 We must fix this, more details in #383 (will work on it asap, as it means message loss)

ktoso · 2020-01-21T11:17:09Z

Tests/ActorSingletonPluginTests/ActorSingletonPluginClusteredTests.swift

+import DistributedActorsTestKit
+import XCTest
+
+final class ActorSingletonPluginClusteredTests: ClusteredNodesTestBase {


Separated cluster-less tests from Cluster tests for the singleton

ktoso · 2020-01-21T11:18:15Z

Tests/DistributedActorsTests/Cluster/ClusterOnDownActionTests.swift

+
+            second.cluster.down(node: first.cluster.node.node)
+
+            try self.capturedLogs(of: first).awaitLogContaining(self.testKit(first), text: "Self node was marked [.down]!")


Checking logs is more reliable than inspecting the nodes status weirdly enough, as we want to check if they don't accidentally caused a shutdown etc

ktoso · 2020-01-21T11:27:44Z

Whooo boy... this hardened a ton of stuff. 🤗
Onwards to #383 and merging #376

Post reviews welcome though don't stress too much about it..

yim-lee · 2020-01-21T17:19:09Z

Sources/ActorSingletonPlugin/ActorSingletonProxy.swift

+            metadata["managerRef"] = "\(managerRef.address)"
        }

        return metadata


yim-lee · 2020-01-21T17:37:52Z

Sources/DistributedActors/Cluster/Transport/ActorRef+RemotePersonality.swift

            remoteControl.sendUserMessage(type: Message.self, envelope: Envelope(payload: .message(message)), recipient: self.address)
        } else {
-            self.deadLetters.adapted().tell(message, file: file, line: line)
+            pprint("no remote control!!!! \(self.address)")


ktoso commented Jan 17, 2020

View reviewed changes

yim-lee approved these changes Jan 17, 2020

View reviewed changes

ktoso mentioned this pull request Jan 20, 2020

test_singletonByClusterLeadership_withLeaderChange MUST work when down(self) is issued #378

Closed

ktoso added 12 commits January 21, 2020 19:44

=crdt slightly improved logging, message sounded wrong when ignoring …

e5731b1

…leadership change event

=cluster #55 add automatic OnDownActions, default to system shutdown

1e8d059

=test fix #377 in face of auto shutdown

2d9ea0c

=test #378 Temp workaround for downing() self + a node shutting itsel…

1ffcb87

…f down immediately, must become leaving instead

=remote #378 fix hope-for-the-best concurrency approach in remote con…

3772884

…trol reaching

=singleton minor rephrasing to avoid 2 spots where control flow can "…

e74b91a

…fail" to stash (for same reason)

=leadership slightly rename to make it more obvious we talk about low…

48bb4c1

…est address reachable nodes

=tests no need for logs dumping always in singleton test

8541ba6

=test avoid shutting down nodes, as we want to inspect them

32279a4

=workaround,cluster #383 hacky workaround until resolving properly

8d7ddf1

=test fix log assertions, timeout alignents

73f6951

ktoso mentioned this pull request Jan 21, 2020

Integration test kill -9-ing a node #381

Open

ktoso commented Jan 21, 2020

View reviewed changes

ktoso changed the title ~~=cluster #55 add automatic OnDownActions, default to system shutdown~~ =cluster #55 #377 #383 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy Jan 21, 2020

ktoso changed the title ~~=cluster #55 #377 #383 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy~~ =cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy Jan 21, 2020

ktoso merged commit 8668220 into apple:master Jan 21, 2020

ktoso deleted the wip-a-downed-node-automatically-shutdown branch January 21, 2020 11:28

yim-lee reviewed Jan 21, 2020

View reviewed changes

Sources/ActorSingletonPlugin/ActorSingletonProxy.swift

metadata["managerRef"] = "\(managerRef.address)"

}

return metadata

Copy link

Member

yim-lee Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

yim-lee reviewed Jan 21, 2020

View reviewed changes

avolokhov mentioned this pull request Mar 8, 2020

FAILED: Occasional: Exited with signal code 11 #492

Closed


		second.cluster.down(node: first.cluster.node.node)

		try self.capturedLogs(of: first).awaitLogContaining(self.testKit(first), text: "Self node was marked [.down]!")

=cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy #375

=cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy #375

Uh oh!

Conversation

ktoso commented Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation:

Modifications:

Result:

Uh oh!

ktoso commented Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yim-lee left a comment

Choose a reason for hiding this comment

Uh oh!

ktoso commented Jan 20, 2020

Uh oh!

ktoso commented Jan 20, 2020

Uh oh!

ktoso commented Jan 20, 2020

Uh oh!

ktoso commented Jan 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ktoso commented Jan 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ktoso commented Jan 17, 2020 •

edited

Loading

ktoso commented Jan 17, 2020 •

edited

Loading