Skip to content

Conversation

@aravindyeduvaka
Copy link
Contributor

Fixing a bug where READONLY is run before AUTH when reconnecting to clustered Slaves and since Readonly is run as Fire and Forget we never get an error either.

This causes MOVED errors until the Multiplexer is recreated (easily seen when NoRedirect is specified) or all MOVED errors are handled internally but this means the Slaves are not at all used to distribute load which can cause connection issues with much smaller loads.

@mgravell
Copy link
Collaborator

mgravell commented Feb 27, 2020

This is a great find, thank you. I think this has been a thorn for ages - and seeing this coming from someone with the "First-time contributor" flag is: just amazing.

(edit: I just spotted the "microsoft" tag in your profile, so: fair enough, but: I'm still delighted to get a great fix like this!)

Just for the paperwork, can I get a confirmation that you're able and willing (and thus doing so) to contribute this code to the project under the project's license? (we need to get a CLA bot!)

@mgravell
Copy link
Collaborator

btw; once I've validated this, I want to get this into the 2.1 drop that has been sat at preview for a few weeks, and get it shipped

@aravindyeduvaka
Copy link
Contributor Author

Licensing: Yes, I am able and willing (and thus doing so) to contribute this code to the project under the project's license.
Yeah Microsoft but I am a recent grad so this is a 'true' First-time contribution.

Getting it into the 2.1 release sounds good but some people on my team have concerns since it seems like the latest version has some connection issues (probably unrelated to this) and customers might hit these issues by upgrading.
#1120
#1178

@mgravell mgravell merged commit 2a2ab6e into StackExchange:master Feb 28, 2020
@mgravell
Copy link
Collaborator

Much appreciated.

I hear you on the upgrade issue; we can try and continue work on the stability issue in the background, but; not trivial, and ultimately we don't have dedicated time to investigate this. I'm 75% sure we should just prioritize swapping out the entire network layer in favor of RESPite; the actual RESP part is feature-complete (at least for RESP2) - we mainly just need the "pool" pieces. This would solve a lot of problems in one go. But that's also a lot of moving parts, so: probably a v3. Sigh; managing libraries is tricky!

@lostindark
Copy link

@mgravell Is it possible to release this fix on top of 2.0.519 (maybe create a 2.0.519.1)? Our production service are impacted by this bug several times per week and we're struggle to keep up our availability SLA.
We can't wait for the fix for stability issue in newer build. Looks like we don't have an ETA on that yet. Publish a build on top of 519 seems the best option here.

@mgravell
Copy link
Collaborator

mgravell commented Mar 18, 2020 via email

@lostindark
Copy link

lostindark commented Mar 18, 2020

There only reason for 2.1.* is a build / versioning snafu, not anything more fundamental; so any fix will be in 2.1., but this 2.0. Vs 2.1.* doesn't represent anything of significance. Sometimes build go weird. This fix is high on my priorities. In fact, let me deal with the home chores (in these unusual days), and then see if I can merge it.

On Wed, 18 Mar 2020, 06:08 Teddy Zhang, @.***> wrote: @mgravell https://github.com/mgravell Is it possible to release this fix on top of 2.0.519 (maybe create a 2.0.519.1)? Our production service are impacted by this bug several times per week and we're struggle to keep up our availability SLA. We can't wait for the fix for stability issue in newer build. Looks like we don't have an ETA on that yet. Publish a build on top of 519 seems the best option here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1367 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMHP43EGRKNHKJ5TEPLRIBQNTANCNFSM4K4LTSKQ .

I'm not sure if I fully understand your reply. You seems indicate that you'll apply the fix to 2.1., but that's not what we want right now.
I think the newer stable release should be on 2.1.
and the fix will be part of it. However, we can't use it yet as there a connection issue introduced after 2.0.519 and hasn't been fixed yet.

My ask is create a branch 2.0.519 from this commit: https://github.com/StackExchange/StackExchange.Redis/tree/v2.0.519
Apply this fix: 2a2ab6e
And produce a new version 2.0.519.1.
Basically we want the fix on top of the known good build 2.0.519.
Looks like @aravindyeduvaka already created one: https://github.com/aravindyeduvaka/StackExchange.Redis/tree/StableVersion. Can this be merge back and create an official nuget out of this?

I think this is the best short term solution. With this, we can have enough time to wait for newer 2.1.* to become stable.

@NickCraver
Copy link
Collaborator

@lostindark The connection issue was also fixed in #1374 and will be in the 2.1.x release.

It's not trivial for us to go back on Git history due to height differences, but there's also no need here. Expect 2.1.x to be released tomorrow - we're running it in production on Stack Overflow right now. The 2.1.0-preview.53 release at https://www.myget.org/feed/stackoverflow/package/nuget/StackExchange.Redis will be what the final 2.1.x stable release consists of tomorrow unless we find any issues today in production (we have no expectation of such).

@lostindark
Copy link

@NickCraver Thanks for the update. I'll wait for the new stable release and test it when it's ready.

@lostindark
Copy link

@NickCraver I saw 2.1.0 disappeared from nuget.org. But no new version comes out yet. What is the plan?

@NickCraver
Copy link
Collaborator

Hey @lostindark sorry we're busy with kids and family as well - bit slow on updates here.

We found some race conditions that caused errors in badly handled ways (e.g. an event handler) specifically in the Sentinel code after the 2.1.0 cut. Some in master are fixes (async of sync - e.g. #1402), and I've got more bits to change in #1403. Once we hammer these tests more in CI and dogfood, we'll put another build live that's more stable for users.

Now, all of that is specific to connecting with Sentinel, and only when the downstream master is down (we're hammering that case in tests, but it's not the norm). If you're not connecting to Sentinel servers: you're fine but we'll still update here.

Another set of changes post-2.1.0 is improved error messages especially for Azure/AWS cases, see #1397 for details.

odinserj added a commit to HangfireIO/StackExchange.Redis that referenced this pull request Mar 26, 2020
@lostindark
Copy link

@NickCraver Thanks for the detail update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

After enable the Authentication and SSL, it has a high chance to get "MOVED" errors

4 participants