-
Notifications
You must be signed in to change notification settings - Fork 246
DRIVERS-1934: withTransaction API retries too frequently #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| callbacks. | ||
|
|
||
| A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that | ||
| A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I read it wrong, or maybe its a typo? Not sure.
| [abortTransaction](../transactions/transactions.md#aborttransaction) on the session. | ||
| 2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is | ||
| less than 120 seconds, jump back to step two. | ||
| less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know python uses ** as the exponential operator. I don't believe that is a standard across other programming languages. Is it clear that its exponent? Is there a different preferred symbol for exponent? (I know math commonly uses ^ but it typically also means bitwise XOR in code so I felt like that could be confusing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** seems fine to me, but I might also be biased because that's what Node does 🙂. But I don't think people will have trouble understanding what this means.
| ### Retry Backoff is Enforced | ||
|
|
||
| Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms | ||
| and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test would not work as described because jitter is non-deterministic.
An alternative test would be to use a failpoint to fail the transaction X times and then assert the overall time is larger than some threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I realized that when implementing it. I set the fail point to fail 3 times and that seems to be consistently working (and without jitter failing 3 times would still cause the success to happen within 500ms)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: it was consistent locally but super flakey on atlas. I modified the with transaction code to append backoff values to make this easier to test and the test is now more stable.
source/transactions-convenient-api/transactions-convenient-api.md
Outdated
Show resolved
Hide resolved
| }, | ||
| "data": { | ||
| "failCommands": ["commitTransaction"], | ||
| "errorCode": 24, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use this error code instead for these tests?:
code: 251,
name: NoSuchTransaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in f40529f
| ``` | ||
|
|
||
| Let the callback for the transaction be a simple `insertOne` command. Check that the total time for all retries exceeded | ||
| 3.5 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two tests are identical. What if we run this test twice, first with backoff disabled (eg jitter set to always be 0%) and second with it enabled and jitter set to always be 100%. Then we can assert the second attempt takes 3 seconds longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo i like this idea! Changed in f40529f
I decreased the number of failures to 13 so the test wouldn't take as long (resulting in total backing time to be 2.2 seconds) and added a +/- 1 second window for the difference between the two attempts.
| ``` | ||
|
|
||
| Let the callback for the transaction be a simple `insertOne` command. Let `no_backoff_time` be the time it took for the | ||
| command to succeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"command" -> "withTransaction call"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 56e88f0
| return `1`. Set the fail point to force 13 retries using the same command as before. Using the same callback as before, | ||
| check that the total time for the withTransaction command is within +/-1 second of `no_backoff_time` plus 2.2 seconds. | ||
| Note that 2.2 seconds is the sum of backoff 13 consecutive backoff values and the 1-second window is just to account for | ||
| potential networking differences between the two runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit for tone:
"the 1-second window is just to account for
potential networking differences between the two runs."
->
"the 1-second window accounts for potential variance between the two runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 56e88f0
| This method can be expressed by the following pseudo-code: | ||
|
|
||
| ```typescript | ||
| var BACKOFF_INITIAL = 1 // 1ms initial backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update this to use 5ms as described in the tech doc. And the growth rate as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 56e88f0
source/transactions-convenient-api/transactions-convenient-api.md
Outdated
Show resolved
Hide resolved
source/transactions-convenient-api/transactions-convenient-api.md
Outdated
Show resolved
Hide resolved
source/transactions-convenient-api/transactions-convenient-api.md
Outdated
Show resolved
Hide resolved
source/transactions-convenient-api/transactions-convenient-api.md
Outdated
Show resolved
Hide resolved
| > [!NOTE] | ||
| > errorCode 251 is NoSuchTransaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't formatting for me? Maybe it's something to do with the indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm wacky? it seemed to render in my preview in pycharm when I was editing this so i assumed it worked. I'll play around with the indentation after looking into CSOT. worse case i'll be boring and just have this note be normal text lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you were right about it being indentation! but unindenting it messes with the formatting of the bullets below it so no fancy alerts this time ;-;
i gotta keep this fancy alerts thing in mind for a future opportunity -- I want to use it now haha
(made it a normal comment in 1f04505)
Please complete the following before merging:
clusters).