Skip to content

Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode #3681

Open
@drewfustin

Description

@drewfustin

Using redis-py==6.2.0. I have a 6 node Redis Cluster running in EKS (3 primaries / 3 replicas), and whenever one of the primary nodes goes down, my service will raise TimeoutError despite the replica being available, having sufficient Retry on the client, and using client-side load balancing as LoadBalancingStrategy.ROUND_ROBIN.

I've boiled it down to the following, so I can better observe what's happening in the RedisCluster._internal_execute_command retry loop (note the client doesn't have a Retry on it in this example because I'm handling it with the manual loop, but rest assured the service that is running has something like retry=Retry(backoff=NoBackoff(), retries=3) on it):

import os
from redis.cluster import LoadBalancingStrategy, RedisCluster
rc = RedisCluster(
    host=os.getenv('REDIS_HOST'),
    port=os.getenv('REDIS_PORT'),
    password=os.getenv('REDIS_PASSWORD'),
    load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
    socket_connect_timeout=1.0,
    require_full_coverage=False,
    decode_responses=True,
)
rc.set("foo", "bar")  # True
slot = rc.determine_slot("GET", "foo")  # 12182
lbs = rc.load_balancing_strategy  # ROUND_ROBIN
for i in range(5):
    print(f"\nAttempt {i + 1}")
    try:
        primary_name = rc.nodes_manager.slots_cache[slot][0].name
        n_slots = len(rc.nodes_manager.slots_cache[slot])
        node_idx = rc.nodes_manager.read_load_balancer.get_server_index(primary_name, n_slots, lbs)
        node = rc.nodes_manager.slots_cache[slot][node_idx]
        print(f"idx: {node_idx} | node: {node.name} | type: {node.server_type}")
        rc._execute_command(node, "GET", "foo")
    except Exception as e:
        print(f"Exception: {e}")

With a healthy cluster, this will output:

Attempt 1
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

If I kill the primary node in EKS (kubectl delete pod redis-node-3 where this was the 100.66.97.179 pod), and running the loop again, I get the following (until EKS gets redis-node-3 back up and running):

Attempt 1
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 2
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 4
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Basically, as soon as I get the TimeoutError from the primary node, the load balancer gets stuck and stops bouncing between the primary and the replica, but keeps trying the primary over and over.

If I instead kill the replica node in EKS, I get exactly what I'd expect, where the load balancer still round robins between both nodes:

Attempt 1
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server

Attempt 3
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server

Attempt 5
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions