Skip to content

Conversation

@Honza-Kubat
Copy link

No description provided.

jpignata and others added 30 commits October 4, 2012 16:46
If there is a timeout set check that an existing clientId's timestamp is
newer than double that timeout. If no timeout is set just check for the
clientId's existence.

This mitigates a race condition that is triggered when a connect and/or
subscribe occurs for a given clientId that is in the midst of being reaped
by the garbage collector. In this case orphaned messages queues and channel
subscriptions could remain after a clientId was reaped.
Our Redis backend is becoming CPU bound so we need to shard the keys
Faye uses over multiple Redis systems to reduce the amount of work
each system needs to do.

This change introduces a lightweight wrapper to abstract using a
distributed set of redis servers. The Engine now accepts an array of
Redis URL strings in the servers option key. Note that since pub/sub
isn't shardable, the first server provided will be used as the bus.

The only significant change made to the existing engine is the call
to multi. The signature for multi has been changed to accept a key
so that it can return a specific connection. This implies that multi
can only operate on one key only per call and leveraged for atomicity.
That said nothing enforces this: multi returns a connection on which
anything can be called. Utilitor emptor.
SUNION isn't shardable because the sets are likely distributed amongst
multiple Redis servers. Instead, we'll have to do more work to collect
the ClientIDs and ensure we only deliver to each once.
To reduce Redis chatter, we don't support wildcard channel delivery.
This is not something we use in our Bayeux implementation as these
subscriptions are not permitted.
We only need the connection items while connecting so instead of
keeping them around, we'll just set an ivar for the original
urls and use them as our index values.
It's more generic and it'll make more sense if we set it for other
processes (the dedicated GC process, e.g.). Also, some docs and a
version bump.
If the set of clients is really large, the ZRANGEBYSCORE Redis command
can overwhelm the process by taking too much time and too much
memory. Setting the `gc_limit` option will now set the LIMIT parameter
on that command, allowing the GC cycle to finish within a reasonable
amount of time. Also adds some logging.
Publishing was stepping on GC's toes by calling `destroyClient` whenever
it detected a client ID was past its prime. I *think* there may have
been some contention between the two (publisher and gc), so this moves
the cleanup responsibility wholly to the latter.

Well, not completely. The `disconnect` function inside of Faye itself
calls `destroyClient` as well, but let's leave that be.
A lot of changes here. Primarily, the destroyClient function now
tolerates Redis failures all the way downstream, and if any error is
detected, the destroy process stops. Instead of trying to recover, we
rely on the hope that GC will pick up a client and re-run the calls in
the future, since the removal from the /clients sorted set is the final
Redis call.

The GC lock has also been removed. After grabbing a list of expired
client IDs, we iterate over them with destroyClient, which returns
immediately. Since we're simply launching destroyClient into flight, and
don't care if it succeeds or not, we allow GC to re-run in the future
without any restrictions.

Because destroyClient doesn't block, we have an additional, expected
test failure -- "Redis engine destroyClient when the client has
subscriptions stops the client receiving messages:". This will fail b/c
it's written to expect the destroy to occur immediately. If the test is
rewritten with the expectation in a callback, it succeeds.
Previously, it would set the initial score to zero, and then allow
ping() to update the score to now in the callback. This seems like it
would allow a window of opportunity for GC to sneak in and kill the
newly created client, however, so now it simply sets the score in the
initial call.
Rewrite GC to handle Redis failures
Just fails rather than trying again.
This is just burning a Redis call that we don't need to make.
Let's be optimistic about cleaning up the channels and messages for a
client (the latter expires anyway).
It turns out that having publishers GC is pretty important. That
mechanism was largely responsible for keeping memory use down on the
backend Redis servers, and turning it off proved a catastrophe.
Since we want to move to a serial GC, the callback is now always
invoked, but with a "false" argument if the GC didn't complete for
whatever reason.
This replaces the batch interval-based GC with a serial process. Now,
clients are processed one-at-a-time, and we depend heavily on the
callbacks (and errbacks) to re-invoke the GC loop. An experiment,
really.
Since we're hinging into the `destroyClient` function, we should get
stats from any source that destroys clients, including the GC process,
publishers, and front-ends.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants