Skip to content

Conversation

@pietroalbini
Copy link
Member

In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting.

The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates).

This PR tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as 15 seconds. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.

Note that the "15 seconds" value is a hunch I had, I'm open on feedback on different values we could use.

This PR intentionally doesn't have any tests: while I have a local commit changing ChaosProxy to simulate packet loss, that's done by shelling out to iptables, which is both linux-only and root-only. I am not aware of any cross-platform, unprivileged tool we could use to simulate this in our test suite unfortunately.

Huge thanks to @weiznich in diesel-rs/diesel#3016 for helping debug this issue and for trying possible fixes on the diesel side.

In the December 22, 2021 crates.io outage, we experienced full packet
loss between the database server and the application server(s). While
the crates.io application supports running without a database during
outages, those mitigations kicked in only after a bit more than 15
minutes of the packet loss starting.

The default parameters of the Linux network stack result in a TCP
connection being marked as broken after roughly 15 minutes of no
acknowledgements being received, which is what I believe happened during
that outage.

Broken database mitigations kicking in after 15 minutes is too long for
crates.io, as we ideally want those mitigations to kick in as soon as
possible (ensuring crates.io users can continue downloading crates).

This commit tells libpq (the underlying PostgreSQL client used by
diesel) to set the Linux network stack timeout to a configurable value,
which we set as 15 seconds. With this change, if an outage like the one
on Dec 22 2021 one happens again, crates.io will be fully unavailable
only for 15 seconds rather than 15 minutes.
@weiznich
Copy link
Contributor

With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.

Just to make restate the obvious point: TCP_USER_TIMEOUT refers to the timeout for a individual network request. That does not necessarily mean that after exactly 15 seconds everything will be fine. That depends on more factors. As crates.io use a connection pool the worst case scenario would be something like:
An operation tries to checkout a connection from the pool. -> The pool runs in the 15s timeout for the first connection, so it tries the next one and so on -> After the pool runs out of connections/time it will raise an error, but not earlier. That could be easily more than 15 seconds depending on the pool configuration. That means TCP_USER_TIMEOUT is just only one part that can help preventing such issues again.

@pietroalbini
Copy link
Member Author

Thanks for your comment weiznich! That is indeed true in theory. In practice, with the traffic patterns we get in crates.io, all connections in the pool will be requested roughly at the same time (we don't get a steady stream of requests, we get bursts of 100+ parallel requests whenever someone runs cargo build), causing all of them to time out in parallel.

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Mar 18, 2022
@pietroalbini pietroalbini requested a review from Turbo87 March 18, 2022 13:53
Copy link
Member

@Turbo87 Turbo87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! :)

feel free to r=me

@pietroalbini
Copy link
Member Author

@bors r=Turbo87

@bors
Copy link
Contributor

bors commented Mar 18, 2022

📌 Commit 399c8bb has been approved by Turbo87

@bors
Copy link
Contributor

bors commented Mar 18, 2022

⌛ Testing commit 399c8bb with merge 0cbb775...

@bors
Copy link
Contributor

bors commented Mar 18, 2022

☀️ Test successful - checks-actions
Approved by: Turbo87
Pushing 0cbb775 to master...

@bors bors merged commit 0cbb775 into rust-lang:master Mar 18, 2022
@pietroalbini pietroalbini deleted the pa-tcp-user-timeout branch March 18, 2022 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants