Set `TCP_USER_TIMEOUT` to 15 seconds for database connections #4645

pietroalbini · 2022-03-17T11:04:03Z

In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting.

The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates).

This PR tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as 15 seconds. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.

Note that the "15 seconds" value is a hunch I had, I'm open on feedback on different values we could use.

This PR intentionally doesn't have any tests: while I have a local commit changing ChaosProxy to simulate packet loss, that's done by shelling out to iptables, which is both linux-only and root-only. I am not aware of any cross-platform, unprivileged tool we could use to simulate this in our test suite unfortunately.

Huge thanks to @weiznich in diesel-rs/diesel#3016 for helping debug this issue and for trying possible fixes on the diesel side.

In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting. The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates). This commit tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as 15 seconds. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.

weiznich · 2022-03-17T12:27:45Z

With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.

Just to make restate the obvious point: TCP_USER_TIMEOUT refers to the timeout for a individual network request. That does not necessarily mean that after exactly 15 seconds everything will be fine. That depends on more factors. As crates.io use a connection pool the worst case scenario would be something like:
An operation tries to checkout a connection from the pool. -> The pool runs in the 15s timeout for the first connection, so it tries the next one and so on -> After the pool runs out of connections/time it will raise an error, but not earlier. That could be easily more than 15 seconds depending on the pool configuration. That means TCP_USER_TIMEOUT is just only one part that can help preventing such issues again.

pietroalbini · 2022-03-17T15:33:51Z

Thanks for your comment weiznich! That is indeed true in theory. In practice, with the traffic patterns we get in crates.io, all connections in the pool will be requested roughly at the same time (we don't get a steady stream of requests, we get bursts of 100+ parallel requests whenever someone runs cargo build), causing all of them to time out in parallel.

src/db.rs

Turbo87

LGTM thanks! :)

feel free to r=me

pietroalbini · 2022-03-18T21:58:34Z

@bors r=Turbo87

bors · 2022-03-18T21:58:36Z

📌 Commit 399c8bb has been approved by Turbo87

bors · 2022-03-18T21:58:42Z

⌛ Testing commit 399c8bb with merge 0cbb775...

bors · 2022-03-18T22:04:07Z

☀️ Test successful - checks-actions
Approved by: Turbo87
Pushing 0cbb775 to master...

pietroalbini added 2 commits March 17, 2022 11:24

require the server configuration when connecting to the database

e03f65b

pietroalbini requested review from Turbo87 and jtgeibel March 17, 2022 11:04

Turbo87 reviewed Mar 18, 2022

View reviewed changes

src/db.rs Outdated Show resolved Hide resolved

Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Mar 18, 2022

accept the database config instead of the server config

399c8bb

pietroalbini requested a review from Turbo87 March 18, 2022 13:53

Turbo87 approved these changes Mar 18, 2022

View reviewed changes

bors merged commit 0cbb775 into rust-lang:master Mar 18, 2022

This was referenced Mar 18, 2022

Yet more database metrics #3914

Closed

Add an environment variable to let us exclude crate IDs from the recent downloads list #4649

Merged

pietroalbini deleted the pa-tcp-user-timeout branch March 18, 2022 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set `TCP_USER_TIMEOUT` to 15 seconds for database connections #4645

Set `TCP_USER_TIMEOUT` to 15 seconds for database connections #4645

Uh oh!

pietroalbini commented Mar 17, 2022

Uh oh!

weiznich commented Mar 17, 2022

Uh oh!

pietroalbini commented Mar 17, 2022

Uh oh!

Uh oh!

Turbo87 left a comment

Uh oh!

pietroalbini commented Mar 18, 2022

Uh oh!

bors commented Mar 18, 2022

Uh oh!

bors commented Mar 18, 2022

Uh oh!

bors commented Mar 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Set TCP_USER_TIMEOUT to 15 seconds for database connections #4645

Set TCP_USER_TIMEOUT to 15 seconds for database connections #4645

Uh oh!

Conversation

pietroalbini commented Mar 17, 2022

Uh oh!

weiznich commented Mar 17, 2022

Uh oh!

pietroalbini commented Mar 17, 2022

Uh oh!

Uh oh!

Turbo87 left a comment

Choose a reason for hiding this comment

Uh oh!

pietroalbini commented Mar 18, 2022

Uh oh!

bors commented Mar 18, 2022

Uh oh!

bors commented Mar 18, 2022

Uh oh!

bors commented Mar 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Set `TCP_USER_TIMEOUT` to 15 seconds for database connections #4645

Set `TCP_USER_TIMEOUT` to 15 seconds for database connections #4645