-
Couldn't load subscription status.
- Fork 60
Description
On the clickhouse cluster on dogfood, we can query most tables normally:
oximeter_cluster_3 :) select * from oximeter.fields_string limit 1;
SELECT *
FROM oximeter.fields_string
LIMIT 1
Query id: 0d9ae787-7841-437f-af63-5ddaa507dde7
┌─timeseries_name─────────────────────────┬──────timeseries_key─┬─field_name─┬─field_value─┐
│ bgp_session:active_connections_accepted │ 7162701739278076253 │ hostname │ oxz_switch │
└─────────────────────────────────────────┴─────────────────────┴────────────┴─────────────┘
1 row in set. Elapsed: 0.003 sec.
But some table fail queries with an opaque error:
oximeter_cluster_3 :) select * from oximeter.fields_uuid limit 1;
SELECT *
FROM oximeter.fields_uuid
LIMIT 1
Query id: 840c5568-56f3-4ae8-8de0-2c014791fd9f
0 rows in set. Elapsed: 0.032 sec.
Received exception from server (version 23.8.7):
Code: 1000. DB::Exception: Received from fd00:1122:3344:107::23:9000. DB::Exception: Invalid argument. (POCO_EXCEPTION)
That error doesn't tell us much, but the logs include messages like this (for a different table, but same log line):
2025.10.15 21:24:27.770525 [ 210 ] {} <Information> oximeter.measurements_cumulativei64_local (2df79e33-9497-4083-93a1-c2d9ee96177d): Code: 234. DB::Excepti
on: No active replica has part all_13521460_13522322_553 or covering part (cannot execute queue-0023159588: GET_PART with virtual parts [all_13521460_135223
22_553]). (NO_REPLICA_HAS_PART)
These logs go back until at least 10-06, and have been happening often:
root@oxz_clickhouse_server_f9065d53:~# grep NO_REPLICA_HAS_PART /data/log/clickhouse.* | wc -l
18086
I asked Claude how to look up tables with missing parts, and here's what I came up with:
SELECT
database,
table,
type,
new_part_name
FROM system.replication_queue
WHERE last_exception LIKE '%NO_REPLICA_HAS_PART%'
ORDER BY
database ASC,
table ASC
Query id: 680cf37e-3d6a-42c9-8da4-28754c235187
┌─database─┬─table────────────────────────────┬─type─────┬─new_part_name──────────────┐
│ oximeter │ fields_ipaddr_local │ GET_PART │ all_0_10278596_5391567 │
│ oximeter │ fields_u16_local │ GET_PART │ all_0_6894663_2847443 │
│ oximeter │ fields_u32_local │ GET_PART │ all_0_19635386_8237090 │
│ oximeter │ fields_u8_local │ GET_PART │ all_0_1007375_202061 │
│ oximeter │ fields_uuid_local │ GET_PART │ all_0_20936853_4604456 │
│ oximeter │ measurements_bool_local │ GET_PART │ all_782266_782270_1 │
│ oximeter │ measurements_bool_local │ GET_PART │ all_779265_782265_1293 │
│ oximeter │ measurements_cumulativei64_local │ GET_PART │ all_13521460_13522322_553 │
│ oximeter │ measurements_f64_local │ GET_PART │ all_757486_797440_37935 │
│ oximeter │ measurements_histogramu64_local │ GET_PART │ all_14980829_14981155_72 │
│ oximeter │ measurements_u64_local │ GET_PART │ all_14553900_14555829_1664 │
│ oximeter │ measurements_u64_local │ GET_PART │ all_14553900_14555830_1665 │
│ oximeter │ measurements_u64_local │ GET_PART │ all_14553900_14555828_1663 │
└──────────┴──────────────────────────────────┴──────────┴────────────────────────────┘
13 rows in set. Elapsed: 0.002 sec.
Interestingly, I haven't noticed any errors from nexus or oximeter related to this, but I also haven't looked exhaustively.
I think we need to figure out how to remediate the problem, how to prevent recurrence, and also whether we're missing alerting that would have told us something was wrong.