Skip to content

Conversation

@ktoso
Copy link
Member

@ktoso ktoso commented Apr 22, 2020

Motivation:

We did not solve for some situations how an association may be needed.

E.g. you get a ref to node C from node B, but you never knew node C yet. If you started sending messages to that ref, before the other parts of the system ensured the association you could end up sending them into the void causing message loss.

Modifications:

  • the association is eager so we store it as we initiate the handshake dance, not only as the result
  • this gives us a place to enqueue / buffer the messages to be sent

Result:

@ktoso ktoso changed the title [WIP] Association creation revamp Remote Association creation revamp, removal of hack-workaround in RemoteRef Apr 24, 2020
@ktoso
Copy link
Member Author

ktoso commented Apr 24, 2020

Trying to figure out why test_ensureDownAndRemovalSpreadsToAllMembers fails more often now... It should not have impacted SWIM much really 🤔

@ktoso
Copy link
Member Author

ktoso commented Apr 24, 2020

Follow up for future: Reduce locking or lockless Associations #579

@ktoso
Copy link
Member Author

ktoso commented Apr 24, 2020

Facing an always reproducing ClusterLeaderActionsClusteredTests.test_ensureDownAndRemovalSpreadsToAllMembers now but can't find why that is since a few hours... Not moving into suspect, but that's SWIM things which are really unchanged by this PR yet it reliaby fails here and is fine on master 🤔

@yim-lee
Copy link
Member

yim-lee commented Apr 24, 2020

Is the failing test making the right assumptions/assertions? 🤔

@ktoso
Copy link
Member Author

ktoso commented Apr 27, 2020

Sadly it seems it has the right assumptions hm hm, digging more.

@ktoso
Copy link
Member Author

ktoso commented Apr 27, 2020

Seems I'm not used to the revised SWIM behavior after introduction of the LHA modifications :) It can take a while to trigger the down now in small clusters. That's fair and fine to be honest. Triggering those is better safe than sorry. Tweaked some things and some minor cleanups to make logs less verbose.

@ktoso
Copy link
Member Author

ktoso commented Apr 27, 2020

Huh an integration test failure... probably timing there, checking A place that used localhost instead of 127.0.0.1

@ktoso
Copy link
Member Author

ktoso commented Apr 27, 2020

@swift-server-bot test this please

@ktoso
Copy link
Member Author

ktoso commented Apr 27, 2020

Ok all green here, unlocked another test now :)

/// is clear about its current lifecycle state (it may have already terminated the moment the message was sent,
/// or even before then). To obtain lifecycle status of this actor the usual strategy of watching it needs to be employed.
// TODO: reimplement as CellDelegate as it shall become simply another transport?
public final class RemoteClusterActorPersonality<Message: Codable> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can public be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish! Sadly that's very tricky :-(

I'll make a ticket to see if we'd be able to make it so. It shows up in the public enum Personality { so it has to be public and in some protocol requirements...

Would be nice if we could make those internal though; but on the other hand, if we need external ActorTransports, we may need to make those not even public but e.g. CellDelegate has to be even open 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To revisit: #591

let testKey: NIOSSLPrivateKeySource = .privateKey(try NIOSSLPrivateKey(bytes: [UInt8](testKey1.utf8), format: .pem))
let local = self.setUpNode("local") { settings in
settings.cluster.node.host = "127.0.0.1"
settings.cluster.node.host = "localhost"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When should one use 127.0.0.1 vs localhost?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a slight annoyance :-( So we compare the address that someone extends a handshake to is "the same" as we bound to. So so if a service bound as localhost, other nodes must join("localhost" ... and the same with 127.0.0.1... We should perhaps add some special handling for this, to be able to treat localhost as 127.0.0.1 when doing this comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New ticket: #590

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the behavior is the same for localhost and 127.0.0.1--basically pick one and stick with it? I guess we just need to document/explain it somewhere. I saw other code comments and thought 127.0.0.1 was preferred (maybe it is) and wondered why we changed to localhost here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point we should "stick to 127.0.0.1", this change here likely was not necessary hm...

@ktoso
Copy link
Member Author

ktoso commented Apr 28, 2020

And nasty double complete association... debugging

@ktoso
Copy link
Member Author

ktoso commented Apr 28, 2020

I broke things while trying to get more things passing in stable way... Will have to revisit again.

test_sendingMessageToNotYetAssociatedNode_mustCauseAssociationAttempt now fails which was the entire purpose of this PR, so I messed up big time somewhere since monday :-/ and the Singleton tests manifesting problems...

@ktoso
Copy link
Member Author

ktoso commented Apr 29, 2020

Wohoo all green out of the "nasty ones" though the PR needs a rebase as it hit

18:48:55 error: Expected String [actor:sact://[email protected]:9001/user/alpha] to contain: [actor:sact://CRDTSerializationTests@localhost:9001/user/alpha] introduced yesterday -- since we stick to 127.0.0.1 by default now

public struct ActorTestKitSettings {
/// Timeout used by default by all the `expect...` and `within` functions defined on the testkit and test probes.
var expectationTimeout: TimeAmount = .seconds(3)
var expectationTimeout: TimeAmount = .seconds(5)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at a test failure on CI just now and they often are "just on the edge of making the timeout" and locally they're fine and more often missing it on CI -- these matter for multi node tests where we're waiting for some random gossip to cause the event etc. They're not entirely predictable, so making the default general timeout I think will be good for us.

If a message comes in quickly there's no problem at all -- an expect immediately gets it and returns after all, so this increase should NOT have impact on build times except hitting those which are right but slow (because non deterministic how/when that message is going to be sent)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, 3 seconds means often that we give a gossip 2 (often not lucky enough to make 3) gossip rounds to get us an information; with 5 we give it more changes -- if the information (most often this is cluster events like down / removed), has a chance to do the gossip dance (convergence is needs a few rounds) and get us the event

let eventsOnFirstSub = try p1.expectMessages(count: 9)
for event in eventsOnFirstSub {
pinfo("Captured event: \(event)")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to help debugging good vs. bad executions

let system = ActorSystem("it_XPCActorable_echo_service") { settings in
settings.transports += .xpcService

settings.cluster.swim.failureDetector.pingTimeout = .seconds(3)
Copy link
Member Author

@ktoso ktoso Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really is an acknowlagement to how we now use SWIM; it IS our failure detector.

It used to be both the only membership we had and the failure detector.
Now we're more true to how we use it in the settings.

wait_log_exists ${first_logs} 'Binding to: ' 200 # since it might be compiling again...

stdbuf -i0 -o0 -e0 swift run it_Clustered_swim_suspension_reachability 8228 localhost 7337 > ${second_logs} 2>&1 &
stdbuf -i0 -o0 -e0 swift run it_Clustered_swim_suspension_reachability 8228 127.0.0.1 7337 > ${second_logs} 2>&1 &
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed defaults everywhere, was getting confused from a mix; now it's 127.0.0.1 everywhere by default.

UniqueNode from = 2;
Node to = 3;
UniqueNode originNode = 2;
Node targetNode = 3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the first things / types I ever did for this project 😉
The naming from/to seemed good at the time, but since then we have a more common local/remote target/origin wording, so keeping those words all the way through

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious--How come targetNode is Node instead of UniqueNode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's on purpose -- in say a user writes join(Node(127.0.0.1 8888)) we don't know that nodes NodeID and thus it cannot be an UniqueNode, only once the node replies we know it's unique id.

// If we discovered the node using gossip or an actor ref indicating an actor on the remote node we would indeed know the NID though; but the handshake is extended to "some node on that address".

If we end up associating and that NID is different than the uniqueNode used by some actor ref, it means that other unique node is likely "wrong" in the sense that it likely was a previous actor system instance on the same host:port, but it's a new instance/process. Such messages would end up being deadlettered.

Copy link
Member Author

@ktoso ktoso May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a specific test for this: Test: Hold "old ref" while new associated node on same host:port is associated #603

// end::cluster-sample-actors-discover-and-chat[]

system.park(atMost: .seconds(60))
system.park(atMost: .seconds(6000))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically keep them around "forever" when testing looking at logs ;)

// where Message: ActorMessage {
// self.watch(try self.spawn(naming, props: props, behavior))
// }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup

@ktoso
Copy link
Member Author

ktoso commented Apr 30, 2020

Heh all green 5.3 but some flaky tests remain. I'll not fix them all in this PR, but will continue the fight ⚔️

Was:

Message: \(message)
Actor: \(context)
This is a bug, please open a ticket.
""", file: file, line: line)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More details in case this happens again, the previous crash would not be easy to follow up on


// ==== ----------------------------------------------------------------------------------------------------------------
// MARK: Remote Association State Machine
// MARK: Association
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change™

Associations used to be created only once a handshake completed. So "an association is the result of a handshake" this is wrong and cannot work.

Reasons:

  • we might be racing handshakes and also various reasons to kick off a handshake
  • while the handshake is being made, we might be trying to send to the remote already; we MUST have a place to queue up messages rather than dropping them on the floor while the handshake is negotiating
    • this means the association MUST be created eagerly, and then we always use to it "send" and the send may be a a) real send b) an enqueue or c) the association (the node is down/dead) is a tombstone.

These associations must be accessible without actor messaging and cacheable by remote refs -- this way sending to a remote does nto need a "request/reply" with the actor shell - we just grab the association (and we can cache it (!)). The getting the association today is a normal lock, but we will be able to either RWLock it or go lockless (hopefully... it's technically possible, though somewhat hard -- I'd want to do this with Swift atomics)

}

/// Complete the association and drain any pending message sends onto the channel.
// TODO: This style can only ever work since we lock around the entirety of enqueueing messages and this setting; make it such that we don't need the lock eventually
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce locking or lockless Associations #579

}

// 2) execute any pending tasks and clear them
self.runCompletionTasks()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long story how it would be very annoying to have futures in here, since we need to create associations in remote refs and other places -- making this task list is much simpler.

///
/// Tombstones are slightly lighter than a real association, and are kept for a maximum of `settings.cluster.associationTombstoneTTL` TODO: make this setting (!!!)
/// before being cleaned up.
struct Tombstone: Hashable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upcoming work to polish tombstone handling: Cluster: Prune association tombstones #356

return state
}
self.terminateAssociation(context.system, state: &state, memberToDown.node)
state = self.interpretLeaderActions(context.system, state, state.collectLeaderActions())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leader actions internally checks if we're leader and should do things; less special casing here 👍

/// This MAY return `inFlight`, in which case it means someone already initiated a handshake with given node,
/// and we should _do nothing_ and trust that our `whenCompleted` will be notified when the already in-flight handshake completes.
mutating func registerHandshake(with remoteNode: Node, whenCompleted: EventLoopPromise<Wire.HandshakeResponse>) -> HandshakeStateMachine.State {
mutating func beginHandshake(with remoteNode: Node, whenCompleted: EventLoopPromise<Wire.HandshakeResponse>) -> HandshakeStateMachine.State {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just register, it's kicking it off

let replyProbe1 = self.testKit(first).spawnTestProbe(expecting: String.self)
let ref1 = try first.singleton.host(GreeterSingleton.Message.self, settings: singletonSettings, GreeterSingleton.makeBehavior(instance: GreeterSingleton("Hello-1")))
ref1.tell(.greet(name: "Charlie-1", _replyTo: replyProbe1.ref))
ref1.tell(.greet(name: "Alice-1", _replyTo: replyProbe1.ref))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namings here were a bit confusing (the same), so adjusted a bit

try assertAssociated(remote, withExactly: local.cluster.node)
}

func test_boundServer_shouldAcceptAssociate_raceFromBothNodes() throws {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have failed before, now it's solid.

}
}

func test_sendingMessageToNotYetAssociatedNode_mustCauseAssociationAttempt() throws {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have dropped messages and never worked before, now just works 🚀

@ktoso
Copy link
Member Author

ktoso commented Apr 30, 2020

FAILED TESTS:
  - 3089:Test Case 'ClusterLeaderActionsClusteredTests.test_down_to_removed_ensureRemovalHappensWhenAllHaveSeenDown' failed (3.134 seconds)
  - 3432:Test Case 'ClusterLeaderActionsClusteredTests.test_up_ensureAllSubscribersGetMovingUpEvents' failed (3.06 seconds)
  - 3886:Test Case 'SWIMShellClusteredTests.test_SWIMShell_shouldMonitorJoinedClusterMembers' failed (1.13 seconds)

Hmm weird those are pretty stable locally, I'll dig some more tomorrow before merging.

@ktoso ktoso requested a review from yim-lee April 30, 2020 15:19
@ktoso
Copy link
Member Author

ktoso commented Apr 30, 2020

(Not expecting a full review here, but if you'd want to skim it @yim-lee or @drexin most notably Association, ClusterShell and ClusterShellState that'd be cool, other files I think you can ignore -- i commented on the more crucial parts as well).

@ktoso ktoso requested a review from drexin April 30, 2020 15:20
UniqueNode from = 2;
Node to = 3;
UniqueNode originNode = 2;
Node targetNode = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious--How come targetNode is Node instead of UniqueNode?


/// If enabled, logs membership changes (including the entire membership table from the perspective of the current node).
public var logMembershipChanges: Logger.Level? = .info
public var logMembershipChanges: Logger.Level? = .debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't .debug lead to more logging though?

/// Since we MAY have 2 connections open at this point in time -- one we opened, and another that was opened
/// to us when the other node tried to associated, we'll perform a tie-breaker to ensure we predictably
/// only use _one_ of them, and close the other.
// let selectedChannel: Channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used?

}
if let n = uniqueNode {
metadata["handshake/peer"] = "\(n)"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have both node and uniqueNode be non-nil?

// ./' ./' ./' Never gonna give you up, never gonna let you down ./' ./' ./'
// we currently never give up

// FIXME: implement giving up reconnecting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticketed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to: Cluster: Prune association tombstones #356
I'll make a new one specifically for this though, thanks! Smaller tickets == better..

Cluster: Implement retrying and giving up in handshakes #604


do {
let bytes: ByteBuffer = try proto.serializedByteBuffer(allocator: context.channel.allocator)
// TODO: should we use the serialization infra ourselves here? I guess so...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need ticket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yes #605

_ = try p7337.expectMessage()
_ = try p8228.expectMessage()
// _ = try firstProbe.expectMessage()
// _ = try secondProbe.expectMessage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intended?

// MARK: Joining into existing cluster

// FIXME: unlock this test
// FIXME: unlock this test // revisit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no, i forgot to unlock it; thanks. Not passing... :\

ktoso added 17 commits May 1, 2020 10:41
solve the handshake truncation issue; ordering of flushing and sending
the handshake

we solve this by strongly ordering that flushing messages only happens
after we write the handshake reply; as well as signaling the replyTo of
the handshake only once we've swapped the association to ready
much confusion and in one of many code paths we get it slightly wrong --
and it's really only used in tests... not entirely worth it -- we will
revive when we need it
@ktoso
Copy link
Member Author

ktoso commented May 1, 2020

(Fighting some last timing sensitive tests on CI, adding more debug information / timings to diagnose better)

Oh I see, there's two issues:

@ktoso
Copy link
Member Author

ktoso commented May 1, 2020

Ok I think I've had it with this PR growing too large; we have way less flaky tests but enough remain to be annoying and we must fix them asap as it hinders development -- i'll continue on this when I'm back 🙇

I want to go back to small PRs addressing specific failures.

Now: #608

@ktoso ktoso merged commit 6fb66f4 into apple:master May 1, 2020
@ktoso ktoso deleted the wip-transport-association-revamp branch May 1, 2020 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment