Skip to content

Test and harden: Sending many messages to a ref while it is trying to associate #383

@ktoso

Description

@ktoso

This is pretty rare but our model of association allocation is wrong, it has to change...

We can get message loss 😱 (and we have, in singleton tests) when we resolve a ref manually to a remote we know from membership, but the association is not yet finished (it is still handshaking)

The issue:

  • when we send out membership updates, the associations may still be doing handshakes
  • when someone uses resolve to get a ref on other node, it may end up hitting the remote send message path before the handshake completes, thus no association
  • when no association is found, this causes dead letters

This is hard to solve due to the involved concurrency and how there's no place for a queue in the association since there is none yet.

Solution:

  • associations are not a result of handshakes, they shall be the beginning of them
  • they should gain a queue so we can push into the "right order" even if the channel is not ready yet

Test:

  • 2 nodes
  • spawn on second
  • resolve on first
  • start joining
  • IMMEDIATELY start sending many messages to the resolved one
  • there's likely to be message loss

This also shows up in test_singletonByClusterLeadership_stashMessagesIfNoLeader

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions