You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Auto merge of #4645 - ferrous-systems:pa-tcp-user-timeout, r=Turbo87
Set `TCP_USER_TIMEOUT` to 15 seconds for database connections
In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting.
The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates).
This PR tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as **15 seconds**. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.
Note that the "15 seconds" value is a hunch I had, I'm open on feedback on different values we could use.
This PR intentionally doesn't have any tests: while I have a local commit changing ChaosProxy to simulate packet loss, that's done by shelling out to `iptables`, which is both linux-only and root-only. I am not aware of any cross-platform, unprivileged tool we could use to simulate this in our test suite unfortunately.
Huge thanks to `@weiznich` in diesel-rs/diesel#3016 for helping debug this issue and for trying possible fixes on the diesel side.
0 commit comments