Skip to content

LoadBalancer keyed on slot instead of primary node, not reset on NodesManager.initialize() #3683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

drewfustin
Copy link

@drewfustin drewfustin commented Jun 19, 2025

Pull Request check-list

  • Do tests and lints pass with this change?
  • Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)? link
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?
  • Is there an example added to the examples folder (if applicable)?

Description of change

As noted in #3681, reseting the load balancer on NodesManager.initialize() causes the index associated with the primary node to reset to 0. If a ConnectionError or TimeoutError is raised by an attempt to connect to a primary node, NodesManager.initialize() is called, and the the load balancer's index for that node will reset to 0. Therefore, the next attempt in the retry loop will not move on from the primary node to a replica node (with index > 0) as expected, but will instead retry the primary node again (and presumably raise the same error).

Since NodesManager.initialize() being called on ConnectionError or TimeoutError is the valid strategy, and since the primary node's host will often be replaced in tandem with events that cause these errors (e.g. when a primary node is deleted and then recreated in Kubernetes), keying the LoadBalancer dictionary on the primary node's name (host:port) doesn't feel appropriate. Instead, keying the dictionary on the Redis Cluster's slot seems to be a better strategy. As such, the server_index corresponding to key slot doesn't need to be reset to 0 on NodesManager.initialize() as the slot isn't expected to change and need to be reset, only the host:port would require such. Instead, the slot can maintain its "state" even when the NodesManager is reinitialized, thus resolving #3681.

With the fix in this PR implemented, the output of the loop from #3681 becomes what is expected when the primary node goes down (the load balancer continues to the next node on a TimeoutError):

Attempt 1
idx: 2 | node: 100.66.151.143:6379 | type: replica
'bar'

Attempt 2
idx: 0 | node: 100.66.122.229:6379 | type: primary
Exception: Timeout connecting to server

Attempt 3
idx: 1 | node: 100.66.151.143:6379 | type: replica
'bar'

Attempt 4
idx: 2 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 5
idx: 0 | node: 100.66.122.229:6379 | type: primary
Exception: Error 113 connecting to 100.66.122.229:6379. No route to host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode
1 participant