Skip to content

Commit 6ae40c3

Browse files
committed
set TCP_USER_TIMEOUT to 15 seconds for database connections
In the December 22, 2021 crates.io outage, we experienced full packet loss between the database server and the application server(s). While the crates.io application supports running without a database during outages, those mitigations kicked in only after a bit more than 15 minutes of the packet loss starting. The default parameters of the Linux network stack result in a TCP connection being marked as broken after roughly 15 minutes of no acknowledgements being received, which is what I believe happened during that outage. Broken database mitigations kicking in after 15 minutes is too long for crates.io, as we ideally want those mitigations to kick in as soon as possible (ensuring crates.io users can continue downloading crates). This commit tells libpq (the underlying PostgreSQL client used by diesel) to set the Linux network stack timeout to a configurable value, which we set as 15 seconds. With this change, if an outage like the one on Dec 22 2021 one happens again, crates.io will be fully unavailable only for 15 seconds rather than 15 minutes.
1 parent e03f65b commit 6ae40c3

File tree

2 files changed

+34
-4
lines changed

2 files changed

+34
-4
lines changed

src/config/database_pools.rs

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
//! - `DB_OFFLINE`: If set to `leader` then use the read-only follower as if it was the leader.
1010
//! If set to `follower` then act as if `READ_ONLY_REPLICA_URL` was unset.
1111
//! - `READ_ONLY_MODE`: If defined (even as empty) then force all connections to be read-only.
12+
//! - `DB_TCP_TIMEOUT_MS`: TCP timeout in milliseconds. See the doc comment for more details.
1213
1314
use crate::env;
1415

@@ -18,6 +19,12 @@ pub struct DatabasePools {
1819
pub primary: DbPoolConfig,
1920
/// An optional follower database. Always read-only.
2021
pub replica: Option<DbPoolConfig>,
22+
/// Number of seconds to wait for unacknowledged TCP packets before treating the connection as
23+
/// broken. This value will determine how long crates.io stays unavailable in case of full
24+
/// packet loss between the application and the database: setting it too high will result in an
25+
/// unnecessarily long outage (before the unhealthy database logic kicks in), while setting it
26+
/// too low might result in healthy connections being dropped.
27+
pub tcp_timeout_ms: u64,
2128
}
2229

2330
#[derive(Debug)]
@@ -67,6 +74,11 @@ impl DatabasePools {
6774
_ => None,
6875
};
6976

77+
let tcp_timeout_ms = match dotenv::var("DB_TCP_TIMEOUT_MS") {
78+
Ok(num) => num.parse().expect("couldn't parse DB_TCP_TIMEOUT_MS"),
79+
Err(_) => 15 * 1000, // 15 seconds
80+
};
81+
7082
match dotenv::var("DB_OFFLINE").as_deref() {
7183
// The actual leader is down, use the follower in read-only mode as the primary and
7284
// don't configure a replica.
@@ -79,6 +91,7 @@ impl DatabasePools {
7991
min_idle: primary_min_idle,
8092
},
8193
replica: None,
94+
tcp_timeout_ms,
8295
},
8396
// The follower is down, don't configure the replica.
8497
Ok("follower") => Self {
@@ -89,6 +102,7 @@ impl DatabasePools {
89102
min_idle: primary_min_idle,
90103
},
91104
replica: None,
105+
tcp_timeout_ms,
92106
},
93107
_ => Self {
94108
primary: DbPoolConfig {
@@ -106,6 +120,7 @@ impl DatabasePools {
106120
pool_size: replica_pool_size,
107121
min_idle: replica_min_idle,
108122
}),
123+
tcp_timeout_ms,
109124
},
110125
}
111126
}
@@ -119,6 +134,7 @@ impl DatabasePools {
119134
min_idle: None,
120135
},
121136
replica: None,
137+
tcp_timeout_ms: 1000, // 1 second
122138
}
123139
}
124140
}

src/db.rs

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,8 @@ impl DieselPool {
5454
}
5555

5656
pub(crate) fn new_test(config: &config::Server, url: &str) -> DieselPool {
57-
let conn =
58-
PgConnection::establish(&connection_url(config, url)).expect("failed to establish connection");
57+
let conn = PgConnection::establish(&connection_url(config, url))
58+
.expect("failed to establish connection");
5959
conn.begin_test_transaction()
6060
.expect("failed to begin test transaction");
6161
DieselPool::Test(Arc::new(ReentrantMutex::new(conn)))
@@ -142,13 +142,27 @@ pub fn connection_url(config: &config::Server, url: &str) -> String {
142142
let mut url = Url::parse(url).expect("Invalid database URL");
143143

144144
// Enforce secure connections in production.
145-
if config.base.env == Env::Production && !url.query_pairs().any(|(k, _)| k == "sslmode") {
146-
url.query_pairs_mut().append_pair("sslmode", "require");
145+
if config.base.env == Env::Production {
146+
maybe_append_url_param(&mut url, "sslmode", "require");
147147
}
148148

149+
// Configure the time it takes for diesel to return an error when there is full packet loss
150+
// between the application and the database.
151+
maybe_append_url_param(
152+
&mut url,
153+
"tcp_user_timeout",
154+
&config.db.tcp_timeout_ms.to_string(),
155+
);
156+
149157
url.into()
150158
}
151159

160+
fn maybe_append_url_param(url: &mut Url, key: &str, value: &str) {
161+
if !url.query_pairs().any(|(k, _)| k == key) {
162+
url.query_pairs_mut().append_pair(key, value);
163+
}
164+
}
165+
152166
pub trait RequestTransaction {
153167
/// Obtain a read/write database connection from the primary pool
154168
fn db_conn(&self) -> Result<DieselPooledConn<'_>, PoolError>;

0 commit comments

Comments
 (0)