-
Notifications
You must be signed in to change notification settings - Fork 79
Closed
Labels
Milestone
Description
This is pretty rare but our model of association allocation is wrong, it has to change...
We can get message loss 😱 (and we have, in singleton tests) when we resolve a ref manually to a remote we know from membership, but the association is not yet finished (it is still handshaking)
The issue:
- when we send out membership updates, the associations may still be doing handshakes
- when someone uses resolve to get a ref on other node, it may end up hitting the remote send message path before the handshake completes, thus no association
- when no association is found, this causes dead letters
This is hard to solve due to the involved concurrency and how there's no place for a queue in the association since there is none yet.
Solution:
- associations are not a result of handshakes, they shall be the beginning of them
- they should gain a queue so we can push into the "right order" even if the channel is not ready yet
Test:
- 2 nodes
- spawn on second
- resolve on first
- start joining
- IMMEDIATELY start sending many messages to the resolved one
- there's likely to be message loss
This also shows up in test_singletonByClusterLeadership_stashMessagesIfNoLeader