-
Notifications
You must be signed in to change notification settings - Fork 680
Set TCP_USER_TIMEOUT to 15 seconds for database connections
#4645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting. The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates). This commit tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as 15 seconds. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.
Just to make restate the obvious point: |
|
Thanks for your comment weiznich! That is indeed true in theory. In practice, with the traffic patterns we get in crates.io, all connections in the pool will be requested roughly at the same time (we don't get a steady stream of requests, we get bursts of 100+ parallel requests whenever someone runs |
Turbo87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks! :)
feel free to r=me
|
@bors r=Turbo87 |
|
📌 Commit 399c8bb has been approved by |
|
☀️ Test successful - checks-actions |
In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting.
The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates).
This PR tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as 15 seconds. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.
Note that the "15 seconds" value is a hunch I had, I'm open on feedback on different values we could use.
This PR intentionally doesn't have any tests: while I have a local commit changing ChaosProxy to simulate packet loss, that's done by shelling out to
iptables, which is both linux-only and root-only. I am not aware of any cross-platform, unprivileged tool we could use to simulate this in our test suite unfortunately.Huge thanks to @weiznich in diesel-rs/diesel#3016 for helping debug this issue and for trying possible fixes on the diesel side.