-
Notifications
You must be signed in to change notification settings - Fork 79
Remote Association creation revamp, removal of hack-workaround in RemoteRef #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Trying to figure out why |
|
Follow up for future: Reduce locking or lockless Associations #579 |
|
Facing an always reproducing |
|
Is the failing test making the right assumptions/assertions? 🤔 |
|
Sadly it seems it has the right assumptions hm hm, digging more. |
|
Seems I'm not used to the revised SWIM behavior after introduction of the LHA modifications :) It can take a while to trigger the down now in small clusters. That's fair and fine to be honest. Triggering those is better safe than sorry. Tweaked some things and some minor cleanups to make logs less verbose. |
|
|
|
@swift-server-bot test this please |
|
Ok all green here, unlocked another test now :) |
| /// is clear about its current lifecycle state (it may have already terminated the moment the message was sent, | ||
| /// or even before then). To obtain lifecycle status of this actor the usual strategy of watching it needs to be employed. | ||
| // TODO: reimplement as CellDelegate as it shall become simply another transport? | ||
| public final class RemoteClusterActorPersonality<Message: Codable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can public be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish! Sadly that's very tricky :-(
I'll make a ticket to see if we'd be able to make it so. It shows up in the public enum Personality { so it has to be public and in some protocol requirements...
Would be nice if we could make those internal though; but on the other hand, if we need external ActorTransports, we may need to make those not even public but e.g. CellDelegate has to be even open 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To revisit: #591
| let testKey: NIOSSLPrivateKeySource = .privateKey(try NIOSSLPrivateKey(bytes: [UInt8](testKey1.utf8), format: .pem)) | ||
| let local = self.setUpNode("local") { settings in | ||
| settings.cluster.node.host = "127.0.0.1" | ||
| settings.cluster.node.host = "localhost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When should one use 127.0.0.1 vs localhost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a slight annoyance :-( So we compare the address that someone extends a handshake to is "the same" as we bound to. So so if a service bound as localhost, other nodes must join("localhost" ... and the same with 127.0.0.1... We should perhaps add some special handling for this, to be able to treat localhost as 127.0.0.1 when doing this comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New ticket: #590
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the behavior is the same for localhost and 127.0.0.1--basically pick one and stick with it? I guess we just need to document/explain it somewhere. I saw other code comments and thought 127.0.0.1 was preferred (maybe it is) and wondered why we changed to localhost here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point we should "stick to 127.0.0.1", this change here likely was not necessary hm...
|
And nasty double complete association... debugging |
|
I broke things while trying to get more things passing in stable way... Will have to revisit again.
|
|
Wohoo all green out of the "nasty ones" though the PR needs a rebase as it hit
|
| public struct ActorTestKitSettings { | ||
| /// Timeout used by default by all the `expect...` and `within` functions defined on the testkit and test probes. | ||
| var expectationTimeout: TimeAmount = .seconds(3) | ||
| var expectationTimeout: TimeAmount = .seconds(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at a test failure on CI just now and they often are "just on the edge of making the timeout" and locally they're fine and more often missing it on CI -- these matter for multi node tests where we're waiting for some random gossip to cause the event etc. They're not entirely predictable, so making the default general timeout I think will be good for us.
If a message comes in quickly there's no problem at all -- an expect immediately gets it and returns after all, so this increase should NOT have impact on build times except hitting those which are right but slow (because non deterministic how/when that message is going to be sent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, 3 seconds means often that we give a gossip 2 (often not lucky enough to make 3) gossip rounds to get us an information; with 5 we give it more changes -- if the information (most often this is cluster events like down / removed), has a chance to do the gossip dance (convergence is needs a few rounds) and get us the event
| let eventsOnFirstSub = try p1.expectMessages(count: 9) | ||
| for event in eventsOnFirstSub { | ||
| pinfo("Captured event: \(event)") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to help debugging good vs. bad executions
| let system = ActorSystem("it_XPCActorable_echo_service") { settings in | ||
| settings.transports += .xpcService | ||
|
|
||
| settings.cluster.swim.failureDetector.pingTimeout = .seconds(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really is an acknowlagement to how we now use SWIM; it IS our failure detector.
It used to be both the only membership we had and the failure detector.
Now we're more true to how we use it in the settings.
| wait_log_exists ${first_logs} 'Binding to: ' 200 # since it might be compiling again... | ||
|
|
||
| stdbuf -i0 -o0 -e0 swift run it_Clustered_swim_suspension_reachability 8228 localhost 7337 > ${second_logs} 2>&1 & | ||
| stdbuf -i0 -o0 -e0 swift run it_Clustered_swim_suspension_reachability 8228 127.0.0.1 7337 > ${second_logs} 2>&1 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed defaults everywhere, was getting confused from a mix; now it's 127.0.0.1 everywhere by default.
| UniqueNode from = 2; | ||
| Node to = 3; | ||
| UniqueNode originNode = 2; | ||
| Node targetNode = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the first things / types I ever did for this project 😉
The naming from/to seemed good at the time, but since then we have a more common local/remote target/origin wording, so keeping those words all the way through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious--How come targetNode is Node instead of UniqueNode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's on purpose -- in say a user writes join(Node(127.0.0.1 8888)) we don't know that nodes NodeID and thus it cannot be an UniqueNode, only once the node replies we know it's unique id.
// If we discovered the node using gossip or an actor ref indicating an actor on the remote node we would indeed know the NID though; but the handshake is extended to "some node on that address".
If we end up associating and that NID is different than the uniqueNode used by some actor ref, it means that other unique node is likely "wrong" in the sense that it likely was a previous actor system instance on the same host:port, but it's a new instance/process. Such messages would end up being deadlettered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a specific test for this: Test: Hold "old ref" while new associated node on same host:port is associated #603
| // end::cluster-sample-actors-discover-and-chat[] | ||
|
|
||
| system.park(atMost: .seconds(60)) | ||
| system.park(atMost: .seconds(6000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically keep them around "forever" when testing looking at logs ;)
| // where Message: ActorMessage { | ||
| // self.watch(try self.spawn(naming, props: props, behavior)) | ||
| // } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup
|
Heh all green 5.3 but some flaky tests remain. I'll not fix them all in this PR, but will continue the fight ⚔️ Was:
|
| Message: \(message) | ||
| Actor: \(context) | ||
| This is a bug, please open a ticket. | ||
| """, file: file, line: line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More details in case this happens again, the previous crash would not be easy to follow up on
|
|
||
| // ==== ---------------------------------------------------------------------------------------------------------------- | ||
| // MARK: Remote Association State Machine | ||
| // MARK: Association |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change™
Associations used to be created only once a handshake completed. So "an association is the result of a handshake" this is wrong and cannot work.
Reasons:
- we might be racing handshakes and also various reasons to kick off a handshake
- while the handshake is being made, we might be trying to send to the remote already; we MUST have a place to queue up messages rather than dropping them on the floor while the handshake is negotiating
- this means the association MUST be created eagerly, and then we always use to it "send" and the send may be a a) real send b) an enqueue or c) the association (the node is down/dead) is a tombstone.
These associations must be accessible without actor messaging and cacheable by remote refs -- this way sending to a remote does nto need a "request/reply" with the actor shell - we just grab the association (and we can cache it (!)). The getting the association today is a normal lock, but we will be able to either RWLock it or go lockless (hopefully... it's technically possible, though somewhat hard -- I'd want to do this with Swift atomics)
| } | ||
|
|
||
| /// Complete the association and drain any pending message sends onto the channel. | ||
| // TODO: This style can only ever work since we lock around the entirety of enqueueing messages and this setting; make it such that we don't need the lock eventually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce locking or lockless Associations #579
| } | ||
|
|
||
| // 2) execute any pending tasks and clear them | ||
| self.runCompletionTasks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long story how it would be very annoying to have futures in here, since we need to create associations in remote refs and other places -- making this task list is much simpler.
| /// | ||
| /// Tombstones are slightly lighter than a real association, and are kept for a maximum of `settings.cluster.associationTombstoneTTL` TODO: make this setting (!!!) | ||
| /// before being cleaned up. | ||
| struct Tombstone: Hashable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upcoming work to polish tombstone handling: Cluster: Prune association tombstones #356
| return state | ||
| } | ||
| self.terminateAssociation(context.system, state: &state, memberToDown.node) | ||
| state = self.interpretLeaderActions(context.system, state, state.collectLeaderActions()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leader actions internally checks if we're leader and should do things; less special casing here 👍
| /// This MAY return `inFlight`, in which case it means someone already initiated a handshake with given node, | ||
| /// and we should _do nothing_ and trust that our `whenCompleted` will be notified when the already in-flight handshake completes. | ||
| mutating func registerHandshake(with remoteNode: Node, whenCompleted: EventLoopPromise<Wire.HandshakeResponse>) -> HandshakeStateMachine.State { | ||
| mutating func beginHandshake(with remoteNode: Node, whenCompleted: EventLoopPromise<Wire.HandshakeResponse>) -> HandshakeStateMachine.State { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not just register, it's kicking it off
| let replyProbe1 = self.testKit(first).spawnTestProbe(expecting: String.self) | ||
| let ref1 = try first.singleton.host(GreeterSingleton.Message.self, settings: singletonSettings, GreeterSingleton.makeBehavior(instance: GreeterSingleton("Hello-1"))) | ||
| ref1.tell(.greet(name: "Charlie-1", _replyTo: replyProbe1.ref)) | ||
| ref1.tell(.greet(name: "Alice-1", _replyTo: replyProbe1.ref)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namings here were a bit confusing (the same), so adjusted a bit
| try assertAssociated(remote, withExactly: local.cluster.node) | ||
| } | ||
|
|
||
| func test_boundServer_shouldAcceptAssociate_raceFromBothNodes() throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have failed before, now it's solid.
| } | ||
| } | ||
|
|
||
| func test_sendingMessageToNotYetAssociatedNode_mustCauseAssociationAttempt() throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have dropped messages and never worked before, now just works 🚀
Hmm weird those are pretty stable locally, I'll dig some more tomorrow before merging. |
| UniqueNode from = 2; | ||
| Node to = 3; | ||
| UniqueNode originNode = 2; | ||
| Node targetNode = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious--How come targetNode is Node instead of UniqueNode?
|
|
||
| /// If enabled, logs membership changes (including the entire membership table from the perspective of the current node). | ||
| public var logMembershipChanges: Logger.Level? = .info | ||
| public var logMembershipChanges: Logger.Level? = .debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't .debug lead to more logging though?
| /// Since we MAY have 2 connections open at this point in time -- one we opened, and another that was opened | ||
| /// to us when the other node tried to associated, we'll perform a tie-breaker to ensure we predictably | ||
| /// only use _one_ of them, and close the other. | ||
| // let selectedChannel: Channel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used?
| } | ||
| if let n = uniqueNode { | ||
| metadata["handshake/peer"] = "\(n)" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have both node and uniqueNode be non-nil?
| // ./' ./' ./' Never gonna give you up, never gonna let you down ./' ./' ./' | ||
| // we currently never give up | ||
|
|
||
| // FIXME: implement giving up reconnecting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ticketed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| do { | ||
| let bytes: ByteBuffer = try proto.serializedByteBuffer(allocator: context.channel.allocator) | ||
| // TODO: should we use the serialization infra ourselves here? I guess so... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks yes #605
| _ = try p7337.expectMessage() | ||
| _ = try p8228.expectMessage() | ||
| // _ = try firstProbe.expectMessage() | ||
| // _ = try secondProbe.expectMessage() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intended?
| // MARK: Joining into existing cluster | ||
|
|
||
| // FIXME: unlock this test | ||
| // FIXME: unlock this test // revisit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need this still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no, i forgot to unlock it; thanks. Not passing... :\
solve the handshake truncation issue; ordering of flushing and sending the handshake we solve this by strongly ordering that flushing messages only happens after we write the handshake reply; as well as signaling the replyTo of the handshake only once we've swapped the association to ready
much confusion and in one of many code paths we get it slightly wrong -- and it's really only used in tests... not entirely worth it -- we will revive when we need it
|
(Fighting some last timing sensitive tests on CI, adding more debug information / timings to diagnose better) Oh I see, there's two issues:
|
|
Ok I think I've had it with this PR growing too large; we have way less flaky tests but enough remain to be annoying and we must fix them asap as it hinders development -- i'll continue on this when I'm back 🙇 I want to go back to small PRs addressing specific failures. Now: #608 |
Motivation:
We did not solve for some situations how an association may be needed.
E.g. you get a ref to node C from node B, but you never knew node C yet. If you started sending messages to that ref, before the other parts of the system ensured the association you could end up sending them into the void causing message loss.
Modifications:
Result: