Skip to content

clickhouse cluster: tables unavailable due to missing parts. #9234

@jmcarp

Description

@jmcarp

On the clickhouse cluster on dogfood, we can query most tables normally:

oximeter_cluster_3 :) select * from oximeter.fields_string limit 1;

SELECT *
FROM oximeter.fields_string
LIMIT 1

Query id: 0d9ae787-7841-437f-af63-5ddaa507dde7

┌─timeseries_name─────────────────────────┬──────timeseries_key─┬─field_name─┬─field_value─┐
│ bgp_session:active_connections_accepted │ 7162701739278076253 │ hostname   │ oxz_switch  │
└─────────────────────────────────────────┴─────────────────────┴────────────┴─────────────┘

1 row in set. Elapsed: 0.003 sec.

But some table fail queries with an opaque error:

oximeter_cluster_3 :) select * from oximeter.fields_uuid limit 1;

SELECT *
FROM oximeter.fields_uuid
LIMIT 1

Query id: 840c5568-56f3-4ae8-8de0-2c014791fd9f


0 rows in set. Elapsed: 0.032 sec.

Received exception from server (version 23.8.7):
Code: 1000. DB::Exception: Received from fd00:1122:3344:107::23:9000. DB::Exception: Invalid argument. (POCO_EXCEPTION)

That error doesn't tell us much, but the logs include messages like this (for a different table, but same log line):

2025.10.15 21:24:27.770525 [ 210 ] {} <Information> oximeter.measurements_cumulativei64_local (2df79e33-9497-4083-93a1-c2d9ee96177d): Code: 234. DB::Excepti
on: No active replica has part all_13521460_13522322_553 or covering part (cannot execute queue-0023159588: GET_PART with virtual parts [all_13521460_135223
22_553]). (NO_REPLICA_HAS_PART)

These logs go back until at least 10-06, and have been happening often:

root@oxz_clickhouse_server_f9065d53:~# grep NO_REPLICA_HAS_PART /data/log/clickhouse.* | wc -l
   18086

I asked Claude how to look up tables with missing parts, and here's what I came up with:

SELECT
    database,
    table,
    type,
    new_part_name
FROM system.replication_queue
WHERE last_exception LIKE '%NO_REPLICA_HAS_PART%'
ORDER BY
    database ASC,
    table ASC

Query id: 680cf37e-3d6a-42c9-8da4-28754c235187

┌─database─┬─table────────────────────────────┬─type─────┬─new_part_name──────────────┐
│ oximeter │ fields_ipaddr_local              │ GET_PART │ all_0_10278596_5391567     │
│ oximeter │ fields_u16_local                 │ GET_PART │ all_0_6894663_2847443      │
│ oximeter │ fields_u32_local                 │ GET_PART │ all_0_19635386_8237090     │
│ oximeter │ fields_u8_local                  │ GET_PART │ all_0_1007375_202061       │
│ oximeter │ fields_uuid_local                │ GET_PART │ all_0_20936853_4604456     │
│ oximeter │ measurements_bool_local          │ GET_PART │ all_782266_782270_1        │
│ oximeter │ measurements_bool_local          │ GET_PART │ all_779265_782265_1293     │
│ oximeter │ measurements_cumulativei64_local │ GET_PART │ all_13521460_13522322_553  │
│ oximeter │ measurements_f64_local           │ GET_PART │ all_757486_797440_37935    │
│ oximeter │ measurements_histogramu64_local  │ GET_PART │ all_14980829_14981155_72   │
│ oximeter │ measurements_u64_local           │ GET_PART │ all_14553900_14555829_1664 │
│ oximeter │ measurements_u64_local           │ GET_PART │ all_14553900_14555830_1665 │
│ oximeter │ measurements_u64_local           │ GET_PART │ all_14553900_14555828_1663 │
└──────────┴──────────────────────────────────┴──────────┴────────────────────────────┘

13 rows in set. Elapsed: 0.002 sec.

Interestingly, I haven't noticed any errors from nexus or oximeter related to this, but I also haven't looked exhaustively.

I think we need to figure out how to remediate the problem, how to prevent recurrence, and also whether we're missing alerting that would have told us something was wrong.

cc @bnaecker @karencfv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions