Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Jan 17, 2020

Motivation:

Normally a down node is "scary and should ASAP just really die".

A node could be marked as .down wrongly after all, and thus it is a "zombie", we don't like zombie nodes and it should init a shutdown immediately.

Even if it didn't, we have the shoot the other node messages (RIP) that should cause it to die if it attempted further communication (more tests there I want to add tho)

Modifications:

  • when a node notices it is supposed to be down, it initiates a shutdown.
  • it should NOT attempt super fancy grateful handovers
    • these can happen if a node leaves willingly (i.e. goes into .leaving, does the handover, and then does .down itself)

Result:

@ktoso
Copy link
Member Author

ktoso commented Jan 17, 2020

Failure was new #377

public static func make(system: ActorSystem, identifier: String? = nil) -> Logger {
if let overriddenLoggerFactory = system.settings.overrideLoggerFactory {
return overriddenLoggerFactory(identifier ?? system.name)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was missing and thus not capturing logs on system level


public var log: Logger {
var l = ActorLogger.make(system: self) // we only do this to go "through" the proxy; we may not need it in the future?
var l = ActorLogger.make(system: self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works now


self.swimRef.tell(.local(.confirmDead(state.myselfNode)))
context.log.warning("Self node was determined [.down]. (TODO: initiate shutdown based on config)") // TODO: initiate a shutdown it configured to do so
context.system.settings.cluster.onDownAction.make()(context.system)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature.

Copy link
Member

@yim-lee yim-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ktoso
Copy link
Member Author

ktoso commented Jan 20, 2020

One of the failures was a bad test now that we're shutting down the system in this PR this had to be adjusted. #377

Resolves #377

@ktoso
Copy link
Member Author

ktoso commented Jan 20, 2020

The #378 failure makes sense and "is correct" given the settings the test is run under. I'll make .leaving a real thing and that'll work well then

@ktoso
Copy link
Member Author

ktoso commented Jan 20, 2020

Applied workaround for #378 and following up separately.

ktoso added 12 commits January 21, 2020 19:44
…f down immediately, must become leaving instead
…ting

- more tests for all cases of "self downing", by shutting down, leaving
and downing myself.
- downing is now reimplemented in terms of Cluster.Events, which makes
it more resilient -- it had a bug where a not-self leader would be
marked it would throw, since the membership was only partially
maintained
- the singleton tests have been now passing consistently
- cluster OnDownAction shutdown now is gracefulshutdown and has a
timeout. Current impl is harsh and just delays the shutting down.
@ktoso
Copy link
Member Author

ktoso commented Jan 21, 2020

It's not the most clean/separate things PR of all time... was quite hard to figure out / locate some of the issues 🙇 Follow ups I expect to be more self contained; Will comment some more inline.


public func leave() {
self.ref.tell(.command(.downCommand(self.node.node)))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may get it's own command, we have a .leaving status in membership.

The idea is while leaving we may still perform actions but others would not give us new work etc.
This would be used by virtual actors and singletons

}
})
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A slight delay is useful for allowing to spread the .down gossip to others before we really die.

Tests also cover the "shutdown immediately" case that it all works correctly 👍

public var downingStrategy: DowningStrategySettings = .none
/// Strategy how members determine if others (or myself) shall be marked as `.down`.
/// This strategy should be set to the same (or compatible) strategy on all members of a cluster to avoid split brain situations.
public var downingStrategy: DowningStrategySettings = .timeout(.default)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured to make the downing on by default after all...

We should soon implement a slightly better one, but for the sake of showing how singletons move around etc I guess let's leave it on...

Opinions @drexin @yim-lee ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured to make the downing on by default after all...

+1

/// When this member node notices it has been marked as `.down` in the membership, it can automatically perform an action.
/// This setting determines which action to take. Generally speaking, the best course of action is to quickly and gracefully
/// shut down the node and process, potentially leaving a higher level orchestrator to replace the node (e.g. k8s starting a new pod for the cluster).
public var onDownAction: OnDownActionStrategySettings = .gracefulShutdown(delay: .seconds(3))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves an ancient ticket #55, I think it is indeed correct to keep the default to kill the actor system as we chatted in the ticket -- sanity check that we still agree with this @drexin ? :)

}
}

func tryIntroduceGossipPeer(_ context: ActorContext<Message>, _ state: ClusterShellState, change: Cluster.MembershipChange, file: String = #file, line: UInt = #line) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because lack of #371 but also since it's "faster" since we know exactly the peers so we're cheating instead of waiting for gossip rounds of the receptionist 🤔

state.gossipControl.introduce(peer: gossipPeer)
}
}
// TODO: was this needed here? state.gossipControl.update(Cluster.Gossip())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not :)

case .retryHandshake(let initiated):
return self.connectSendHandshakeOffer(context, state, initiated: initiated)

// FIXME: this is now a cluster event !!!!!
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment not aligned with reality anymore, the command here is "from the failure detector" and used only by swim to tell us about reachability change. SWIM does not know about Cluster.Member thus it cannot do the reachability cluster event. This is ok.

}

public func down(member: Cluster.Member) {
self.ref.tell(.command(.downCommandMember(member)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting / important, thanks to using this version internally, we are carrying all the metadata that a member has correctly -- such as the reachability at the moment when someone decided to call down. Mostly a "more correct view in the cluster membership" change. For end users calling either of them will yield the expected result

/// and `.down` in the high-level membership.
public var suspicionTimeoutPeriodsMax: Int = 10
public var suspicionTimeoutPeriodsMin: Int = 10
// public var suspicionTimeoutPeriodsMin: Int = 10 // FIXME: this is once we have LHA, Local Health Aware Suspicion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWIM: was not used nor implemented yet. This indeed is important, but only once LHA is implemented #352

self._cachedAssociationRemoteControl = remoteControl // TODO: atomically...
self.system.log.warning("FIXME: Workaround, ActorRef's RemotePersonality had to spin \(spinNr) times to obtain remoteControl to send message to \(self.address)")
// self._cachedAssociationRemoteControl = remoteControl // TODO: atomically cache a remote control?
return remoteControl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terrible concurrency issue around association lookup 😱 We must fix this, more details in #383 (will work on it asap, as it means message loss)

import DistributedActorsTestKit
import XCTest

final class ActorSingletonPluginClusteredTests: ClusteredNodesTestBase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated cluster-less tests from Cluster tests for the singleton


second.cluster.down(node: first.cluster.node.node)

try self.capturedLogs(of: first).awaitLogContaining(self.testKit(first), text: "Self node was marked [.down]!")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking logs is more reliable than inspecting the nodes status weirdly enough, as we want to check if they don't accidentally caused a shutdown etc

@ktoso ktoso changed the title =cluster #55 add automatic OnDownActions, default to system shutdown =cluster #55 #377 #383 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy Jan 21, 2020
@ktoso ktoso changed the title =cluster #55 #377 #383 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy =cluster #55 #377 #383 #378 OnDownActions, harden singleton & Downing tests, fix TimeoutDowningStrategy Jan 21, 2020
@ktoso
Copy link
Member Author

ktoso commented Jan 21, 2020

Whooo boy... this hardened a ton of stuff. 🤗
Onwards to #383 and merging #376

Post reviews welcome though don't stress too much about it..

@ktoso ktoso merged commit 8668220 into apple:master Jan 21, 2020
@ktoso ktoso deleted the wip-a-downed-node-automatically-shutdown branch January 21, 2020 11:28
metadata["managerRef"] = "\(managerRef.address)"
}

return metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

remoteControl.sendUserMessage(type: Message.self, envelope: Envelope(payload: .message(message)), recipient: self.address)
} else {
self.deadLetters.adapted().tell(message, file: file, line: line)
pprint("no remote control!!!! \(self.address)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intended?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_singletonByClusterLeadership_withLeaderChange MUST work when down(self) is issued Node upon being .down should initiate shutdown

2 participants