Skip to content

Keeper expungement failing on a4x2 #6945

@andrewjstone

Description

@andrewjstone

Steps to reproduce

All omdb ops performed on sled g0

  1. Launch a4x2
  2. Set clickhouse-policy to both via omdb
  3. regenerate a blueprint and make target
  4. hyperstop g2 (node with one keeper)
  5. expunge g2 via omdb
  6. regenerate a couple blueprints and set as targets
  7. Ensure the zones get expunged in the blueprints

Evidence

Sled g2 can definitely no longer be reached. I see a log related to failing to contact it in the nexus node on g3. However, the keeper still shows it in inventory both in keeper-config.xml and via the clickhouse keeper-client command.

root@oxz_clickhouse_keeper_1d4c8dac:~# clickhouse keeper-client --host [fd00:1122:3344:104::21]
Connected to ZooKeeper at [fd00:1122:3344:104::21]:9181 with session_id 8
Keeper feature flag FILTERED_LIST: enabled
Keeper feature flag MULTI_READ: enabled
Keeper feature flag CHECK_NOT_EXISTS: disabled
/ :) get /keeper/config
server.1=fd00:1122:3344:101::21:9234;participant;1
server.2=fd00:1122:3344:104::21:9234;participant;1
server.3=fd00:1122:3344:103::21:9234;participant;1
server.4=fd00:1122:3344:103::22:9234;participant;1
server.5=fd00:1122:3344:102::21:9234;participant;1

The keeper on sled g2 is server.5

I then checked to see that there has been keeper log entries committed by the leader and they are increasing.

/ :) lgif
first_log_idx   1
first_log_term  1
last_log_idx    1515
last_log_term   1
last_committed_log_idx  1515
leader_committed_log_idx        1515
target_committed_log_idx        1515
last_snapshot_idx       0

I then checked crdb to see what the configuration was:

root@[fd00:1122:3344:101::3]:32221/omicron> select * from bp_clickhouse_cluster_config ;
              blueprint_id             | generation | max_used_server_id | max_used_keeper_id |   cluster_name   |            cluster_secret            | highest_seen_keeper_leader_committed_log_index
---------------------------------------+------------+--------------------+--------------------+------------------+--------------------------------------+-------------------------------------------------
  16dfac44-0091-453a-b5e0-2e1b8cad2329 |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0
  69cdc490-9a9d-46e9-b0c0-c8661b0b4794 |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0
  79d919a7-13cd-4b47-9e9c-d15515c8532f |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0
  bc23843c-1b2a-49d0-9b7b-224f1ed2e892 |          2 |                  3 |                  5 | oximeter_cluster | 5b815633-062c-438d-8acc-1858bb059e9e |                                              0

Interestingly the highest_seen_keeper_leader_committed_log_index is 0 for all blueprints.

There are also no related rows in inventory:

root@[fd00:1122:3344:101::3]:32221/omicron> select * from  inv_clickhouse_keeper_membership;
  inv_collection_id | queried_keeper_id | leader_committed_log_index | raft_config
--------------------+-------------------+----------------------------+--------------
(0 rows)


Time: 4ms total (execution 4ms / network 1ms)

root@[fd00:1122:3344:101::3]:32221/omicron>

It appears that retrieving this inventory data from clickhouse-admin-keeper is not working resulting in failure to modify the keepers.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions