Fixing a bug where READONLY is run before AUTH. #1367

aravindyeduvaka · 2020-02-26T18:43:53Z

Fixing a bug where READONLY is run before AUTH when reconnecting to clustered Slaves and since Readonly is run as Fire and Forget we never get an error either.

This causes MOVED errors until the Multiplexer is recreated (easily seen when NoRedirect is specified) or all MOVED errors are handled internally but this means the Slaves are not at all used to distribute load which can cause connection issues with much smaller loads.

…hen reconnecting to Clustered Slaves.

mgravell · 2020-02-27T07:25:45Z

This is a great find, thank you. I think this has been a thorn for ages - and seeing this coming from someone with the "First-time contributor" flag is: just amazing.

(edit: I just spotted the "microsoft" tag in your profile, so: fair enough, but: I'm still delighted to get a great fix like this!)

Just for the paperwork, can I get a confirmation that you're able and willing (and thus doing so) to contribute this code to the project under the project's license? (we need to get a CLA bot!)

mgravell · 2020-02-27T07:37:14Z

btw; once I've validated this, I want to get this into the 2.1 drop that has been sat at preview for a few weeks, and get it shipped

aravindyeduvaka · 2020-02-27T22:51:03Z

Licensing: Yes, I am able and willing (and thus doing so) to contribute this code to the project under the project's license.
Yeah Microsoft but I am a recent grad so this is a 'true' First-time contribution.

Getting it into the 2.1 release sounds good but some people on my team have concerns since it seems like the latest version has some connection issues (probably unrelated to this) and customers might hit these issues by upgrading.
#1120
#1178

mgravell · 2020-02-28T07:27:20Z

Much appreciated.

I hear you on the upgrade issue; we can try and continue work on the stability issue in the background, but; not trivial, and ultimately we don't have dedicated time to investigate this. I'm 75% sure we should just prioritize swapping out the entire network layer in favor of RESPite; the actual RESP part is feature-complete (at least for RESP2) - we mainly just need the "pool" pieces. This would solve a lot of problems in one go. But that's also a lot of moving parts, so: probably a v3. Sigh; managing libraries is tricky!

lostindark · 2020-03-18T06:08:12Z

@mgravell Is it possible to release this fix on top of 2.0.519 (maybe create a 2.0.519.1)? Our production service are impacted by this bug several times per week and we're struggle to keep up our availability SLA.
We can't wait for the fix for stability issue in newer build. Looks like we don't have an ETA on that yet. Publish a build on top of 519 seems the best option here.

mgravell · 2020-03-18T10:01:23Z

There only reason for 2.1.* is a build / versioning snafu, not anything more fundamental; so any fix will be in 2.1.*, but this 2.0.* Vs 2.1.* doesn't represent anything of significance. Sometimes build go weird. This fix is high on my priorities. In fact, let me deal with the home chores (in these unusual days), and then see if I can merge it.

…

On Wed, 18 Mar 2020, 06:08 Teddy Zhang, ***@***.***> wrote: @mgravell <https://github.com/mgravell> Is it possible to release this fix on top of 2.0.519 (maybe create a 2.0.519.1)? Our production service are impacted by this bug several times per week and we're struggle to keep up our availability SLA. We can't wait for the fix for stability issue in newer build. Looks like we don't have an ETA on that yet. Publish a build on top of 519 seems the best option here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1367 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMHP43EGRKNHKJ5TEPLRIBQNTANCNFSM4K4LTSKQ> .

lostindark · 2020-03-18T17:49:47Z

There only reason for 2.1.* is a build / versioning snafu, not anything more fundamental; so any fix will be in 2.1., but this 2.0. Vs 2.1.* doesn't represent anything of significance. Sometimes build go weird. This fix is high on my priorities. In fact, let me deal with the home chores (in these unusual days), and then see if I can merge it.
…
On Wed, 18 Mar 2020, 06:08 Teddy Zhang, @.***> wrote: @mgravell https://github.com/mgravell Is it possible to release this fix on top of 2.0.519 (maybe create a 2.0.519.1)? Our production service are impacted by this bug several times per week and we're struggle to keep up our availability SLA. We can't wait for the fix for stability issue in newer build. Looks like we don't have an ETA on that yet. Publish a build on top of 519 seems the best option here. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1367 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMHP43EGRKNHKJ5TEPLRIBQNTANCNFSM4K4LTSKQ .

I'm not sure if I fully understand your reply. You seems indicate that you'll apply the fix to 2.1., but that's not what we want right now.
I think the newer stable release should be on 2.1. and the fix will be part of it. However, we can't use it yet as there a connection issue introduced after 2.0.519 and hasn't been fixed yet.

My ask is create a branch 2.0.519 from this commit: https://github.com/StackExchange/StackExchange.Redis/tree/v2.0.519
Apply this fix: 2a2ab6e
And produce a new version 2.0.519.1.
Basically we want the fix on top of the known good build 2.0.519.
Looks like @aravindyeduvaka already created one: https://github.com/aravindyeduvaka/StackExchange.Redis/tree/StableVersion. Can this be merge back and create an official nuget out of this?

I think this is the best short term solution. With this, we can have enough time to wait for newer 2.1.* to become stable.

NickCraver · 2020-03-18T18:14:22Z

@lostindark The connection issue was also fixed in #1374 and will be in the 2.1.x release.

It's not trivial for us to go back on Git history due to height differences, but there's also no need here. Expect 2.1.x to be released tomorrow - we're running it in production on Stack Overflow right now. The 2.1.0-preview.53 release at https://www.myget.org/feed/stackoverflow/package/nuget/StackExchange.Redis will be what the final 2.1.x stable release consists of tomorrow unless we find any issues today in production (we have no expectation of such).

lostindark · 2020-03-18T19:42:36Z

@NickCraver Thanks for the update. I'll wait for the new stable release and test it when it's ready.

lostindark · 2020-03-24T17:59:23Z

@NickCraver I saw 2.1.0 disappeared from nuget.org. But no new version comes out yet. What is the plan?

NickCraver · 2020-03-24T19:12:24Z

Hey @lostindark sorry we're busy with kids and family as well - bit slow on updates here.

We found some race conditions that caused errors in badly handled ways (e.g. an event handler) specifically in the Sentinel code after the 2.1.0 cut. Some in master are fixes (async of sync - e.g. #1402), and I've got more bits to change in #1403. Once we hammer these tests more in CI and dogfood, we'll put another build live that's more stable for users.

Now, all of that is specific to connecting with Sentinel, and only when the downstream master is down (we're hammering that case in tests, but it's not the norm). If you're not connecting to Sentinel servers: you're fine but we'll still update here.

Another set of changes post-2.1.0 is improved error messages especially for Azure/AWS cases, see #1397 for details.

@aravindyeduvaka

From StackExchange#1367 by @aravindyeduvaka

lostindark · 2020-03-27T17:52:23Z

@NickCraver Thanks for the detail update!

Fixing a bug where READONLY is run before AUTH. Causes MOVED errors w…

5bba252

…hen reconnecting to Clustered Slaves.

mgravell merged commit 2a2ab6e into StackExchange:master Feb 28, 2020

NickCraver mentioned this pull request Mar 15, 2020

After enable the Authentication and SSL, it has a high chance to get "MOVED" errors #1266

Closed

NickCraver linked an issue Mar 15, 2020 that may be closed by this pull request

After enable the Authentication and SSL, it has a high chance to get "MOVED" errors #1266

Closed

This was referenced Mar 15, 2020

StackExchange.Redis.RedisServerException: Key has MOVED to Endpoint error #1274

Closed

"ERR This instance has cluster support disabled" under load with master/slave #1047

Closed

odinserj added a commit to HangfireIO/StackExchange.Redis that referenced this pull request Mar 26, 2020

Fixing a bug where READONLY is run before AUTH

2e6f6ed

From StackExchange#1367 by @aravindyeduvaka

Fixing a bug where READONLY is run before AUTH. #1367

Fixing a bug where READONLY is run before AUTH. #1367

Uh oh!

Conversation

aravindyeduvaka commented Feb 26, 2020

Uh oh!

mgravell commented Feb 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgravell commented Feb 27, 2020

Uh oh!

aravindyeduvaka commented Feb 27, 2020

Uh oh!

mgravell commented Feb 28, 2020

Uh oh!

lostindark commented Mar 18, 2020

Uh oh!

mgravell commented Mar 18, 2020 via email

Uh oh!

lostindark commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickCraver commented Mar 18, 2020

Uh oh!

lostindark commented Mar 18, 2020

Uh oh!

lostindark commented Mar 24, 2020

Uh oh!

NickCraver commented Mar 24, 2020

Uh oh!

lostindark commented Mar 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mgravell commented Feb 27, 2020 •

edited

Loading

lostindark commented Mar 18, 2020 •

edited

Loading