-
Notifications
You must be signed in to change notification settings - Fork 79
=cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
=cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy #375
Conversation
|
Failure was new #377 |
| public static func make(system: ActorSystem, identifier: String? = nil) -> Logger { | ||
| if let overriddenLoggerFactory = system.settings.overrideLoggerFactory { | ||
| return overriddenLoggerFactory(identifier ?? system.name) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was missing and thus not capturing logs on system level
|
|
||
| public var log: Logger { | ||
| var l = ActorLogger.make(system: self) // we only do this to go "through" the proxy; we may not need it in the future? | ||
| var l = ActorLogger.make(system: self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works now
|
|
||
| self.swimRef.tell(.local(.confirmDead(state.myselfNode))) | ||
| context.log.warning("Self node was determined [.down]. (TODO: initiate shutdown based on config)") // TODO: initiate a shutdown it configured to do so | ||
| context.system.settings.cluster.onDownAction.make()(context.system) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature.
yim-lee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
The #378 failure makes sense and "is correct" given the settings the test is run under. I'll make |
|
Applied workaround for #378 and following up separately. |
…leadership change event
…f down immediately, must become leaving instead
…ting - more tests for all cases of "self downing", by shutting down, leaving and downing myself. - downing is now reimplemented in terms of Cluster.Events, which makes it more resilient -- it had a bug where a not-self leader would be marked it would throw, since the membership was only partially maintained - the singleton tests have been now passing consistently - cluster OnDownAction shutdown now is gracefulshutdown and has a timeout. Current impl is harsh and just delays the shutting down.
…fail" to stash (for same reason)
…est address reachable nodes
|
It's not the most clean/separate things PR of all time... was quite hard to figure out / locate some of the issues 🙇 Follow ups I expect to be more self contained; Will comment some more inline. |
|
|
||
| public func leave() { | ||
| self.ref.tell(.command(.downCommand(self.node.node))) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may get it's own command, we have a .leaving status in membership.
The idea is while leaving we may still perform actions but others would not give us new work etc.
This would be used by virtual actors and singletons
| } | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slight delay is useful for allowing to spread the .down gossip to others before we really die.
Tests also cover the "shutdown immediately" case that it all works correctly 👍
| public var downingStrategy: DowningStrategySettings = .none | ||
| /// Strategy how members determine if others (or myself) shall be marked as `.down`. | ||
| /// This strategy should be set to the same (or compatible) strategy on all members of a cluster to avoid split brain situations. | ||
| public var downingStrategy: DowningStrategySettings = .timeout(.default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured to make the downing on by default after all...
+1
| /// When this member node notices it has been marked as `.down` in the membership, it can automatically perform an action. | ||
| /// This setting determines which action to take. Generally speaking, the best course of action is to quickly and gracefully | ||
| /// shut down the node and process, potentially leaving a higher level orchestrator to replace the node (e.g. k8s starting a new pod for the cluster). | ||
| public var onDownAction: OnDownActionStrategySettings = .gracefulShutdown(delay: .seconds(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| } | ||
|
|
||
| func tryIntroduceGossipPeer(_ context: ActorContext<Message>, _ state: ClusterShellState, change: Cluster.MembershipChange, file: String = #file, line: UInt = #line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because lack of #371 but also since it's "faster" since we know exactly the peers so we're cheating instead of waiting for gossip rounds of the receptionist 🤔
| state.gossipControl.introduce(peer: gossipPeer) | ||
| } | ||
| } | ||
| // TODO: was this needed here? state.gossipControl.update(Cluster.Gossip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was not :)
| case .retryHandshake(let initiated): | ||
| return self.connectSendHandshakeOffer(context, state, initiated: initiated) | ||
|
|
||
| // FIXME: this is now a cluster event !!!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment not aligned with reality anymore, the command here is "from the failure detector" and used only by swim to tell us about reachability change. SWIM does not know about Cluster.Member thus it cannot do the reachability cluster event. This is ok.
| } | ||
|
|
||
| public func down(member: Cluster.Member) { | ||
| self.ref.tell(.command(.downCommandMember(member))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting / important, thanks to using this version internally, we are carrying all the metadata that a member has correctly -- such as the reachability at the moment when someone decided to call down. Mostly a "more correct view in the cluster membership" change. For end users calling either of them will yield the expected result
| /// and `.down` in the high-level membership. | ||
| public var suspicionTimeoutPeriodsMax: Int = 10 | ||
| public var suspicionTimeoutPeriodsMin: Int = 10 | ||
| // public var suspicionTimeoutPeriodsMin: Int = 10 // FIXME: this is once we have LHA, Local Health Aware Suspicion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SWIM: was not used nor implemented yet. This indeed is important, but only once LHA is implemented #352
| self._cachedAssociationRemoteControl = remoteControl // TODO: atomically... | ||
| self.system.log.warning("FIXME: Workaround, ActorRef's RemotePersonality had to spin \(spinNr) times to obtain remoteControl to send message to \(self.address)") | ||
| // self._cachedAssociationRemoteControl = remoteControl // TODO: atomically cache a remote control? | ||
| return remoteControl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terrible concurrency issue around association lookup 😱 We must fix this, more details in #383 (will work on it asap, as it means message loss)
| import DistributedActorsTestKit | ||
| import XCTest | ||
|
|
||
| final class ActorSingletonPluginClusteredTests: ClusteredNodesTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated cluster-less tests from Cluster tests for the singleton
|
|
||
| second.cluster.down(node: first.cluster.node.node) | ||
|
|
||
| try self.capturedLogs(of: first).awaitLogContaining(self.testKit(first), text: "Self node was marked [.down]!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking logs is more reliable than inspecting the nodes status weirdly enough, as we want to check if they don't accidentally caused a shutdown etc
| metadata["managerRef"] = "\(managerRef.address)" | ||
| } | ||
|
|
||
| return metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| remoteControl.sendUserMessage(type: Message.self, envelope: Envelope(payload: .message(message)), recipient: self.address) | ||
| } else { | ||
| self.deadLetters.adapted().tell(message, file: file, line: line) | ||
| pprint("no remote control!!!! \(self.address)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intended?
Motivation:
Normally a down node is "scary and should ASAP just really die".
A node could be marked as .down wrongly after all, and thus it is a "zombie", we don't like zombie nodes and it should init a shutdown immediately.
Even if it didn't, we have the shoot the other node messages (RIP) that should cause it to die if it attempted further communication (more tests there I want to add tho)
Modifications:
Result: