clickhouse cluster: tables unavailable due to missing parts.

On the clickhouse cluster on dogfood, we can query most tables normally:

```
oximeter_cluster_3 :) select * from oximeter.fields_string limit 1;

SELECT *
FROM oximeter.fields_string
LIMIT 1

Query id: 0d9ae787-7841-437f-af63-5ddaa507dde7

┌─timeseries_name─────────────────────────┬──────timeseries_key─┬─field_name─┬─field_value─┐
│ bgp_session:active_connections_accepted │ 7162701739278076253 │ hostname   │ oxz_switch  │
└─────────────────────────────────────────┴─────────────────────┴────────────┴─────────────┘

1 row in set. Elapsed: 0.003 sec.
```

But some table fail queries with an opaque error:

```
oximeter_cluster_3 :) select * from oximeter.fields_uuid limit 1;

SELECT *
FROM oximeter.fields_uuid
LIMIT 1

Query id: 840c5568-56f3-4ae8-8de0-2c014791fd9f


0 rows in set. Elapsed: 0.032 sec.

Received exception from server (version 23.8.7):
Code: 1000. DB::Exception: Received from fd00:1122:3344:107::23:9000. DB::Exception: Invalid argument. (POCO_EXCEPTION)
```

That error doesn't tell us much, but the logs include messages like this (for a different table, but same log line):

```
2025.10.15 21:24:27.770525 [ 210 ] {} <Information> oximeter.measurements_cumulativei64_local (2df79e33-9497-4083-93a1-c2d9ee96177d): Code: 234. DB::Excepti
on: No active replica has part all_13521460_13522322_553 or covering part (cannot execute queue-0023159588: GET_PART with virtual parts [all_13521460_135223
22_553]). (NO_REPLICA_HAS_PART)
```

These logs go back until at least 10-06, and have been happening often:

```
root@oxz_clickhouse_server_f9065d53:~# grep NO_REPLICA_HAS_PART /data/log/clickhouse.* | wc -l
   18086
```

I asked Claude how to look up tables with missing parts, and here's what I came up with:

```
SELECT
    database,
    table,
    type,
    new_part_name
FROM system.replication_queue
WHERE last_exception LIKE '%NO_REPLICA_HAS_PART%'
ORDER BY
    database ASC,
    table ASC

Query id: 680cf37e-3d6a-42c9-8da4-28754c235187

┌─database─┬─table────────────────────────────┬─type─────┬─new_part_name──────────────┐
│ oximeter │ fields_ipaddr_local              │ GET_PART │ all_0_10278596_5391567     │
│ oximeter │ fields_u16_local                 │ GET_PART │ all_0_6894663_2847443      │
│ oximeter │ fields_u32_local                 │ GET_PART │ all_0_19635386_8237090     │
│ oximeter │ fields_u8_local                  │ GET_PART │ all_0_1007375_202061       │
│ oximeter │ fields_uuid_local                │ GET_PART │ all_0_20936853_4604456     │
│ oximeter │ measurements_bool_local          │ GET_PART │ all_782266_782270_1        │
│ oximeter │ measurements_bool_local          │ GET_PART │ all_779265_782265_1293     │
│ oximeter │ measurements_cumulativei64_local │ GET_PART │ all_13521460_13522322_553  │
│ oximeter │ measurements_f64_local           │ GET_PART │ all_757486_797440_37935    │
│ oximeter │ measurements_histogramu64_local  │ GET_PART │ all_14980829_14981155_72   │
│ oximeter │ measurements_u64_local           │ GET_PART │ all_14553900_14555829_1664 │
│ oximeter │ measurements_u64_local           │ GET_PART │ all_14553900_14555830_1665 │
│ oximeter │ measurements_u64_local           │ GET_PART │ all_14553900_14555828_1663 │
└──────────┴──────────────────────────────────┴──────────┴────────────────────────────┘

13 rows in set. Elapsed: 0.002 sec.
```

Interestingly, I haven't noticed any errors from nexus or oximeter related to this, but I also haven't looked exhaustively.

I think we need to figure out how to remediate the problem, how to prevent recurrence, and also whether we're missing alerting that would have told us something was wrong.

cc @bnaecker @karencfv 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

clickhouse cluster: tables unavailable due to missing parts. #9234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

clickhouse cluster: tables unavailable due to missing parts. #9234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions