-
Notifications
You must be signed in to change notification settings - Fork 55
Description
@internet-diglett was deploying Omicron (a branch off "main" but I think that's not relevant here) and ran into a case where RSS was looping forever trying to retry database initialization that had already completed. There were two sleds: "workstation-1" (running RSS) and "test-station-1".
On workstation-1, we have a bunch of log messages like this one:
[2023-07-05T17:10:03.798208372Z] INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=f8557412-0bea-4465-ad3f-68218bb4ba50, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:62144, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Failed to get address: Failed to get address for name oxControlService3/omicron6 in oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43: Zone execution error: Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43 /usr/sbin/ipadm show-addr -p -o ADDR oxControlService3/omicron6] executed and failed with status: exit status: 1 stdout:
stderr: ipadm: Address object not found
followed by exactly one message like this:
[2023-07-05T17:12:48.114002609Z] INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=6ab3e28e-9af9-460c-980a-5890fe4f8b4e, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:62144, local_addr=[fd00:1122:3344:101::1]:12345, response_code=204)
followed by a bunch of messages like this:
[2023-07-05T17:12:48.304971089Z] INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=310d8023-bda5-420c-bcd4-69116f18bea3, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:41882, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Failed to initialize CockroachDb: Error running command in zone 'oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43': Command [/opt/oxide/cockroachdb/bin/cockroach init --insecure --host [fd00:1122:3344:101::6]:32221] executed and failed with status: exit status: 1 stdout: stderr: ERROR: cluster has already been initialized
Failed running "init"
All of these correspond to requests from RSS for POST /cockroachdb
, which initialize the CockroachDB cluster (runs cockroach init
and loads a few SQL files) using any of the CockroachDB zones on that sled. To summarize what's above: initially, these requests fail apparently because the zone is not yet set up. At some point, one request completes successfully. Then Sled Agent keeps getting those requests and those fail because this step has already been done.
Over on the RSS side, the only log entries for this I found were:
$ grep /cockroachdb sled-agent-1-RSS.log | bunyan
[2023-07-05T17:14:21.843265387Z] WARN: SledAgent/RSS/8449 on workstation-1: Failed to initialize CockroachDB
error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/cockroachdb): operation timed out
[2023-07-05T17:15:24.605878005Z] WARN: SledAgent/RSS/8449 on workstation-1: Failed to initialize CockroachDB
error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/cockroachdb): operation timed out
We discovered at this point that these hosts' clocks are not sync'd up: workstation-1 is about 197 seconds ahead of test-station-2. Thus, those timeouts happened around 2023-07-05T17:11:04.843Z and 2023-07-05T17:12:07.605Z on test-station-1's clock.
So what happened here? The "operation timed out" sounds like either an explicit request timeout or else the TCP connection failed because one side didn't ACK within some expected time window. @internet-diglett found that we are using a 60-second request timeout here:
omicron/sled-agent/src/rack_setup/service.rs
Line 932 in c3048a1
.timeout(dur) |
So here's a plausible sequence of events:
17:10:03.798Z on test-station-1, a `POST /cockroachdb` request (call it R1) fails because networking is not yet setup in the zone (direct evidence: the log entry above)
17:10:04.483Z on workstation-1, RSS tries another `POST /cockroachdb` request (call it R2) (indirect evidence: this event is inferred based on the next event + knowing there's a 60-second request timeout -- also consistent with the failure just miliseconds earlier)
17:11:04.483Z on workstation-1, RSS times out request R2 (direct evidence: log entry above, with timestamp adjusted for the clock difference between the two servers). Critically, _request R2 would still be running_.
17:11:07.605Z on workstation-1, RSS tries another `POST /cockroachdb` request (call it R3) (indirect evidence: similar to above). This gets stuck behind `SledAgent.inner.zones.lock().await` because this lock is held by R2.
17:12:07.605Z on workstation-1, RSS times out request R3 (direct evidence: similar to above)
17:12:48.114Z on test-station-1, request R2 completes successfully, having taken about 2m43s
Subsequently, all requests by RSS to `POST /cockroachdb` fail because the database is already initialized (by R2).
All of this is what we'd expect if the operations done during POST /cockroachdb
took more than 60s. So did that happen? We can't tell for sure, but @internet-diglett ran the two cockroach sql
steps that are part of db initialization by hand. Together they took over 2m24s. So there's every reason to believe this is what happened.
I think there's room for several improvements here:
- Most immediately, I think we should not time out this request on the RSS side. A connect timeout is fine. But I don't think any good can come from abandoning the in-flight request that we haven't gotten a response for if we have no reason to think there's a networking problem in the way. Update: see do not time out database init requests #3503.
POST /cockroachdb
is not idempotent. Update: I've filed Sled AgentPOST /cockroachdb
is not idempotent #3499 for this.- I think it'd be worthwhile for RSS to log all requests that it makes, either at "debug" or "trace" level. This would have made it easier (and higher confidence) to construct the above timeline. Update: I'm not sure how best to do this. Filed log all RSS HTTP requests by default #3502 for it.
- The "request completed" log entries could include the request latency. That would have helped confirm that the
POST /cockroachdb
that succeded did take longer than 60s. Update: this is "request completed" log message should include response latency dropshot#692. - The warning log message on the SledAgent side could include how long before it's going to retry. This information is available at the time we log this message. This would make it easier (and higher confidence) to construct the above timeline. Update: see do not time out database init requests #3503.
I'm going to try to roll most of those into a PR shortly and file separate issues for the rest.