-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Version:
redis-py: 5.0.3
redis: 7.2.4
Platform:
Python 3.12.2 running on Debian bookworm
Description:
First off I want to say thanks for maintaining this library :)
So to the issue at hand:
It seems like the retry-functionality in the PubSub-client (both async and sync) is not working correctly.
To simulate a crash of redis I am closing and restarting redis while our system is running. Even though our connection is set up with an infinite amount of retry, the PubSub-client always crashes and never recovers.
Having dug though the source code I have found the pathways causing the crash.
redis-py/redis/asyncio/client.py
Lines 885 to 896 in 1784b37
| async def _execute(self, conn, command, *args, **kwargs): | |
| """ | |
| Connect manually upon disconnection. If the Redis server is down, | |
| this will fail and raise a ConnectionError as desired. | |
| After reconnection, the ``on_connect`` callback should have been | |
| called by the # connection to resubscribe us to any channels and | |
| patterns we were previously listening to | |
| """ | |
| return await conn.retry.call_with_retry( | |
| lambda: command(*args, **kwargs), | |
| lambda error: self._disconnect_raise_connect(conn, error), | |
| ) |
The
_execute-function runs periodicially, and whenever redis is closed the command fails.This causes the
_disconnect_raise_connect-function to be called:redis-py/redis/asyncio/client.py
Lines 870 to 883 in 1784b37
| async def _disconnect_raise_connect(self, conn, error): | |
| """ | |
| Close the connection and raise an exception | |
| if retry_on_error is not set or the error is not one | |
| of the specified error types. Otherwise, try to | |
| reconnect | |
| """ | |
| await conn.disconnect() | |
| if ( | |
| conn.retry_on_error is None | |
| or isinstance(error, tuple(conn.retry_on_error)) is False | |
| ): | |
| raise error | |
| await conn.connect() |
_disconnect_raise_connect closes the connection, conn.retry_on_error is True making the function skip raising error.This is followed by an attempt at reconnecting.
redis-py/redis/asyncio/connection.py
Lines 269 to 284 in 1784b37
| async def connect(self): | |
| """Connects to the Redis server if not already connected""" | |
| if self.is_connected: | |
| return | |
| try: | |
| await self.retry.call_with_retry( | |
| lambda: self._connect(), lambda error: self.disconnect() | |
| ) | |
| except asyncio.CancelledError: | |
| raise # in 3.7 and earlier, this is an Exception, not BaseException | |
| except (socket.timeout, asyncio.TimeoutError): | |
| raise TimeoutError("Timeout connecting to server") | |
| except OSError as e: | |
| raise ConnectionError(self._error_message(e)) | |
| except Exception as exc: | |
| raise ConnectionError(exc) from exc |
Since we are not currently connected, the first if-statement is
False. A call with retry is made to the _connect-function.redis-py/redis/asyncio/connection.py
Lines 678 to 700 in 1784b37
| async def _connect(self): | |
| """Create a TCP socket connection""" | |
| async with async_timeout(self.socket_connect_timeout): | |
| reader, writer = await asyncio.open_connection( | |
| **self._connection_arguments() | |
| ) | |
| self._reader = reader | |
| self._writer = writer | |
| sock = writer.transport.get_extra_info("socket") | |
| if sock: | |
| sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) | |
| try: | |
| # TCP_KEEPALIVE | |
| if self.socket_keepalive: | |
| sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) | |
| for k, v in self.socket_keepalive_options.items(): | |
| sock.setsockopt(socket.SOL_TCP, k, v) | |
| except (OSError, TypeError): | |
| # `socket_keepalive_options` might contain invalid options | |
| # causing an error. Do not leave the connection open. | |
| writer.close() | |
| raise |
Inside the
_connect-function an attempt is made to connect to redis. Since redis is unavailable an OSError is created. OSError is not in the list of supported_errors causing it to bypass the error handling in the call_with_retry-function and is propagated to the connect-function. Here it is converted to a ConnectionError and is propagated all the way to the top unhandled.
Is this intended behaviour?
Is this something that could be handled in a different way?
We have temporarily fixed this by adding OSError to the list of supported_errors, but this feels like quite a hacky solution.