Skip to content

Conversation

@jgallagher
Copy link
Contributor

With all the zone start and ledgering moved to sled-agent-config-reconciler, we can remove this type entirely from sled-agent. I kept the schema check but moved it to the legacy_configs.rs module in the config reconciler, where the same structs still exist to allow conversion of the old ledgers -> the new combined ledger.

Builds on top of #8219.

@jgallagher
Copy link
Contributor Author

Results of testing an upgrade from main to this branch on dublin:

  • Before the upgrade, each sled had the normal triple of ledgers in both /pool/int/*/config directories (omicron-{physical-disks,datasets,zones}.json). After the upgrade, these had successfully been combined into omicron-sled-config.json.
  • A raw inventory JSON blob taken from the scrimlet while the other sleds were still parked/mupdating is too for a comment, so here's a gist. The most interesting bits are the status reported in the new last_reconciliation field. All of the disks failed with Failed to access keys necessary to unlock storage. This error may be transient., which caused all the datasets to fail with could not find matching zpool oxp_*. All the zones failed with either Time not yet synchronized (for zones needing timesync) or zone's transient root dataset is not available: oxp_* (for zones that don't need timesync). All of this is as expected for a cold booted sled before LRTQ has unlocked the rack secret.
  • Gist containing the omdb inventory output after the sleds were all back online. It notes that all disks, datasets, and zones were reconciled successfully on all sleds.
  • All the transient oxp_*/crypt/zone datasets correctly had their properties reapplied (confirmed again by cold booting an individual sled), without Nexus pushing down new sled configs (i.e., blueprint execution was disabled). This confirms this work fixes [sled-agent] Transient zone datasets don't have expected properties after a sled reboot #7546. We can see that the properties were applied when the datasets were recreated before time sync'd in zpool history, and zfs get confirms they're set.
  • All of the transient zone root datasets were also recreated with all properties applied after a cold boot, as expected since they are no longer created on demand when zones are launched.
  • The console came back online successfully. I was able to create and log in to new instances.

@jgallagher
Copy link
Contributor Author

Expunging a disk backing a crucible zone and a propolis zone (disk 4deb1041-f79b-4244-99ef-fc13ba01248a). Blueprint diff after marking disk expunged:

 MODIFIED SLEDS:

  sled 57ac088d-3edc-4b30-8132-4ccd72dc1e2a (active, config generation 5 -> 6):

    physical disks:
    ----------------------------------------------------
    vendor   model             serial     disposition
    ----------------------------------------------------
    1b96     WUS4C6432DSP3X3   A079DE84   in service
    1b96     WUS4C6432DSP3X3   A079DEE9   in service
    1b96     WUS4C6432DSP3X3   A079DF1E   in service
    1b96     WUS4C6432DSP3X3   A079DFBF   in service
    1b96     WUS4C6432DSP3X3   A079E342   in service
    1b96     WUS4C6432DSP3X3   A079E35A   in service
    1b96     WUS4C6432DSP3X3   A079E3AE   in service
    1b96     WUS4C6432DSP3X3   A079E708   in service
    1b96     WUS4C6432DSP3X3   A084A7EA   in service
*   1b96     WUS4C6432DSP3X3   A079E184   - in service
     └─                                   + expunged ⏳


    datasets:
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    dataset name                                                                                                dataset id                             disposition    quota     reservation   compression
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/cockroachdb                                                  54a8d294-70d8-447e-8af9-b9547af51661   in service     none      none          off
    oxp_1d0862e0-88eb-4464-94ae-36008c15ead9/crucible                                                           6e136ef1-d779-4019-9ada-5cba375c6950   in service     none      none          off
    oxp_239e9bf1-b727-40c4-9d4a-c2d756b5d1e2/crucible                                                           5dc0bc0e-e6c7-4bd6-8826-c29ebba8f8c9   in service     none      none          off
    oxp_377390a9-a470-47f9-bab1-3a7d6dddeea1/crucible                                                           02b7a5b8-9d72-4395-9ac3-df49f85d2375   in service     none      none          off
    oxp_425b288c-be5f-4722-9d28-1c585ef74329/crucible                                                           ebcb4b47-9395-43c0-9ded-06f0b181881a   in service     none      none          off
    oxp_6b19ca72-b559-4502-bbe0-8f538d88be8f/crucible                                                           c752719d-2b36-4184-8f34-c4fb3e6a4abe   in service     none      none          off
    oxp_8ba0eb01-a6c9-4d66-8943-f14e619e4bae/crucible                                                           9d2206c5-f5e1-46e9-baa7-b249e413bcc6   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crucible                                                           481be232-08b7-400d-a7c7-e74a29bf6bd2   in service     none      none          off
    oxp_d59df55b-1061-4926-9f9a-0bd9bbfe19e8/crucible                                                           908be71b-b759-4f7d-bf15-25eeca106742   in service     none      none          off
    oxp_ecd66a2b-4dda-48a7-8563-c97d054614c8/crucible                                                           0bb2e606-cc70-4537-b2eb-cf2107cd4054   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/clickhouse                                                   7fe941b9-100e-4861-aeda-c573e3fae85c   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/internal_dns                                                 81434447-2729-4198-a83d-1f5c2fb74ba2   in service     none      none          off
    oxp_1d0862e0-88eb-4464-94ae-36008c15ead9/crypt/zone                                                         d5c118ff-0408-4781-b7ac-6fcaff0e42d6   in service     none      none          off
    oxp_239e9bf1-b727-40c4-9d4a-c2d756b5d1e2/crypt/zone                                                         94421f3a-ef13-4c23-b216-37009fda2cd5   in service     none      none          off
    oxp_377390a9-a470-47f9-bab1-3a7d6dddeea1/crypt/zone                                                         7b179f54-8261-45d0-8184-b9ea04e90f33   in service     none      none          off
    oxp_425b288c-be5f-4722-9d28-1c585ef74329/crypt/zone                                                         3d65f625-db23-4965-84eb-54309b1e9f60   in service     none      none          off
    oxp_6b19ca72-b559-4502-bbe0-8f538d88be8f/crypt/zone                                                         531b4dea-ff7f-46e6-af63-dd18411d1c81   in service     none      none          off
    oxp_8ba0eb01-a6c9-4d66-8943-f14e619e4bae/crypt/zone                                                         724562e2-a524-422e-bddf-38564425778d   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/zone                                                         d2c6bad8-c894-4726-8b17-2bdd76c05e75   in service     none      none          off
    oxp_d59df55b-1061-4926-9f9a-0bd9bbfe19e8/crypt/zone                                                         99a523af-dbbe-4497-b678-c5e54db8c7dd   in service     none      none          off
    oxp_ecd66a2b-4dda-48a7-8563-c97d054614c8/crypt/zone                                                         8fde560b-5198-4252-84a9-3b9f8c22d7d3   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/zone/oxz_clickhouse_eda41da5-4a85-42be-9cc2-1442ed1f8f82     a888d5c1-1813-4645-a95f-394d9d9ed295   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/zone/oxz_cockroachdb_7e40a063-dd58-4c35-a34f-3d8f6f3a5daf    3f42df9b-2762-4c22-b0bb-8fa1d2db6800   in service     none      none          off
    oxp_425b288c-be5f-4722-9d28-1c585ef74329/crypt/zone/oxz_crucible_12bf5955-66f6-4e91-9ba6-09858bebcbd1       2727efcd-f0cc-48ef-8171-2022f3fba549   in service     none      none          off
    oxp_8ba0eb01-a6c9-4d66-8943-f14e619e4bae/crypt/zone/oxz_crucible_15111508-b94c-4048-97a3-bdc1f2ea5744       50a766e9-b66d-4c94-b6fe-fe71c56aa336   in service     none      none          off
    oxp_d59df55b-1061-4926-9f9a-0bd9bbfe19e8/crypt/zone/oxz_crucible_47ac13eb-ca95-4f79-8558-87bcf57c6155       6cf69375-77d8-48f6-98e2-48b0f67c41dc   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/zone/oxz_crucible_76ba7765-4314-4a38-94a4-1dfc3df64d60       d4175907-cba8-4f56-92c8-e1d586f140bd   in service     none      none          off
    oxp_239e9bf1-b727-40c4-9d4a-c2d756b5d1e2/crypt/zone/oxz_crucible_7c9249ae-0be0-432a-b656-c45e1c068238       eea95245-c50b-411c-a3cf-bd753795a517   in service     none      none          off
    oxp_1d0862e0-88eb-4464-94ae-36008c15ead9/crypt/zone/oxz_crucible_8fc41b37-ccd4-480e-bdb1-228c54550f07       3b50b27d-88f9-4770-81cc-d74a5b53c58e   in service     none      none          off
    oxp_377390a9-a470-47f9-bab1-3a7d6dddeea1/crypt/zone/oxz_crucible_96c3218b-cf64-4fe7-b462-66caed9fc930       ff8057c6-4791-425a-aa45-a898054f18b0   in service     none      none          off
    oxp_ecd66a2b-4dda-48a7-8563-c97d054614c8/crypt/zone/oxz_crucible_f902d6d9-75a5-497b-b1f2-cc88b3d5542a       20269607-cbc8-4d21-8f8f-54161a0188ed   in service     none      none          off
    oxp_6b19ca72-b559-4502-bbe0-8f538d88be8f/crypt/zone/oxz_crucible_fa2c89e4-ed76-4256-bcb8-95211418a67e       f902454f-6f86-4739-8d66-fb6cbbd8dc91   in service     none      none          off
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/zone/oxz_internal_dns_0c41634d-72f0-44f2-96c8-52daf2c4c2e1   0fd72575-fde6-4113-9f75-effe3dff0b52   in service     none      none          off
    oxp_425b288c-be5f-4722-9d28-1c585ef74329/crypt/zone/oxz_nexus_fa64f0dc-770d-4b37-befc-054a3bd62cb8          a4e4be96-b64a-4617-a17f-d249a5e4b8d3   in service     none      none          off
    oxp_ecd66a2b-4dda-48a7-8563-c97d054614c8/crypt/zone/oxz_ntp_b572817c-29da-41bc-b063-3bb23fcca50b            7a26e295-0606-4aba-bb15-46350d747196   in service     none      none          off
    oxp_1d0862e0-88eb-4464-94ae-36008c15ead9/crypt/debug                                                        7b5bf9f0-68ac-436b-b282-5c7e76f44712   in service     100 GiB   none          gzip-9
    oxp_239e9bf1-b727-40c4-9d4a-c2d756b5d1e2/crypt/debug                                                        322ecf67-07b7-402a-8294-8c05fb62230a   in service     100 GiB   none          gzip-9
    oxp_377390a9-a470-47f9-bab1-3a7d6dddeea1/crypt/debug                                                        6dddae6c-e09a-4c1f-a495-27f13a678e79   in service     100 GiB   none          gzip-9
    oxp_425b288c-be5f-4722-9d28-1c585ef74329/crypt/debug                                                        74022bb3-827b-4cf3-a5df-5fd120a843fd   in service     100 GiB   none          gzip-9
    oxp_6b19ca72-b559-4502-bbe0-8f538d88be8f/crypt/debug                                                        61202b95-a2bc-4c77-8488-a7ac2c63ff89   in service     100 GiB   none          gzip-9
    oxp_8ba0eb01-a6c9-4d66-8943-f14e619e4bae/crypt/debug                                                        19f4e756-9cb1-4579-aacc-7fc53affdb31   in service     100 GiB   none          gzip-9
    oxp_adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/debug                                                        b90c7c17-defd-4c60-98e6-9950a505874d   in service     100 GiB   none          gzip-9
    oxp_d59df55b-1061-4926-9f9a-0bd9bbfe19e8/crypt/debug                                                        719004e5-16d4-4e21-9cb8-d68f9e7748e2   in service     100 GiB   none          gzip-9
    oxp_ecd66a2b-4dda-48a7-8563-c97d054614c8/crypt/debug                                                        0efd3318-45c7-4583-81fa-9627e8b977d9   in service     100 GiB   none          gzip-9
*   oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crucible                                                           d5bae49f-fe38-410b-ae90-eff0d22b35b0   - in service   none      none          off
     └─                                                                                                                                                + expunged
*   oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/zone                                                         bdea1929-4ffa-47d1-8944-140e21f5ef35   - in service   none      none          off
     └─                                                                                                                                                + expunged
*   oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/zone/oxz_crucible_8203de2e-7003-4469-9031-ade1f20232ab       9bea5d6f-d7ad-49af-9e4c-c3a95a1f1652   - in service   none      none          off
     └─                                                                                                                                                + expunged
*   oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/debug                                                        61733534-80d6-4815-890e-2e9278182d6d   - in service   100 GiB   none          gzip-9
     └─                                                                                                                                                + expunged


    omicron zones:
    ---------------------------------------------------------------------------------------------------------------
    zone type      zone id                                image source      disposition      underlay IP
    ---------------------------------------------------------------------------------------------------------------
    boundary_ntp   b572817c-29da-41bc-b063-3bb23fcca50b   install dataset   in service       fd00:1122:3344:102::10
    clickhouse     eda41da5-4a85-42be-9cc2-1442ed1f8f82   install dataset   in service       fd00:1122:3344:102::5
    cockroach_db   7e40a063-dd58-4c35-a34f-3d8f6f3a5daf   install dataset   in service       fd00:1122:3344:102::3
    crucible       12bf5955-66f6-4e91-9ba6-09858bebcbd1   install dataset   in service       fd00:1122:3344:102::7
    crucible       15111508-b94c-4048-97a3-bdc1f2ea5744   install dataset   in service       fd00:1122:3344:102::b
    crucible       47ac13eb-ca95-4f79-8558-87bcf57c6155   install dataset   in service       fd00:1122:3344:102::8
    crucible       76ba7765-4314-4a38-94a4-1dfc3df64d60   install dataset   in service       fd00:1122:3344:102::6
    crucible       7c9249ae-0be0-432a-b656-c45e1c068238   install dataset   in service       fd00:1122:3344:102::f
    crucible       8fc41b37-ccd4-480e-bdb1-228c54550f07   install dataset   in service       fd00:1122:3344:102::d
    crucible       96c3218b-cf64-4fe7-b462-66caed9fc930   install dataset   in service       fd00:1122:3344:102::c
    crucible       f902d6d9-75a5-497b-b1f2-cc88b3d5542a   install dataset   in service       fd00:1122:3344:102::a
    crucible       fa2c89e4-ed76-4256-bcb8-95211418a67e   install dataset   in service       fd00:1122:3344:102::e
    internal_dns   0c41634d-72f0-44f2-96c8-52daf2c4c2e1   install dataset   in service       fd00:1122:3344:2::1
    nexus          fa64f0dc-770d-4b37-befc-054a3bd62cb8   install dataset   in service       fd00:1122:3344:102::4
*   crucible       8203de2e-7003-4469-9031-ade1f20232ab   install dataset   - in service     fd00:1122:3344:102::9
     └─                                                                     + expunged ⏳

Relevant sled-agent logs after making that blueprint the target - we see the new config ledgered to both internal pools:

18:36:11.520Z INFO SledAgent (SledConfigLedgerTask): Writing ledger to /pool/int/8d21e428-0e1d-4282-b7d5-a69c5655685b/config/.omicron-sled-config.json.tmp
    file = common/src/ledger.rs:197
18:36:11.522Z INFO SledAgent (SledConfigLedgerTask): Writing ledger to /pool/int/d82ba65f-6b72-425b-98f7-60842853459a/config/.omicron-sled-config.json.tmp
    file = common/src/ledger.rs:197
18:36:11.522Z INFO SledAgent (SledConfigLedgerTask): updated sled config ledger
    file = sled-agent/config-reconciler/src/ledger.rs:358
    generation = 6

That immediately triggers the reconciler:

18:36:11.523Z INFO SledAgent (ConfigReconcilerTask): starting reconciliation due to config change
    file = sled-agent/config-reconciler/src/reconciler_task.rs:303

First it shuts down the crucible zone:

18:36:11.523Z INFO SledAgent (ConfigReconcilerTask): shutting down running zone
    file = sled-agent/config-reconciler/src/reconciler_task/zones.rs:569
    zone = oxz_crucible_8203de2e-7003-4469-9031-ade1f20232ab
18:36:11.523Z INFO SledAgent (ZoneBundler): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/8d21e428-0e1d-4282-b7d5-a69c5655685b/debug/bundle/zone", "/pool/int/d82ba65f-6b72-425b-98f7-60842853459a/debug/bundle/zone"], cause: UnexpectedZone, extra_log_dirs: ["/pool/ext/ecd66a2b-4dda-48a7-8563-c97d054614c8/crypt/debug", "/pool/ext/8ba0eb01-a6c9-4d66-8943-f14e619e4bae/crypt/debug", "/pool/ext/239e9bf1-b727-40c4-9d4a-c2d756b5d1e2/crypt/debug", "/pool/ext/6b19ca72-b559-4502-bbe0-8f538d88be8f/crypt/debug", "/pool/ext/8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/debug", "/pool/ext/377390a9-a470-47f9-bab1-3a7d6dddeea1/crypt/debug", "/pool/ext/d59df55b-1061-4926-9f9a-0bd9bbfe19e8/crypt/debug", "/pool/ext/425b288c-be5f-4722-9d28-1c585ef74329/crypt/debug", "/pool/ext/1d0862e0-88eb-4464-94ae-36008c15ead9/crypt/debug", "/pool/ext/adcdc14c-b565-4bdb-b786-5c12f5396a0d/crypt/debug"] }
    file = sled-agent/src/zone_bundle.rs:369
    zone_name = oxz_crucible_8203de2e-7003-4469-9031-ade1f20232ab
18:36:13.296Z INFO SledAgent (ZoneBundler): finished zone bundle
    file = sled-agent/src/zone_bundle.rs:1178
    metadata = ZoneBundleMetadata { id: ZoneBundleId { zone_name: "oxz_crucible_8203de2e-7003-4469-9031-ade1f20232ab", bundle_id: 9b089647-485d-4096-b80c-9dd8473f6d75 }, time_created: 2025-05-28T18:36:11.523802410Z, version: 0, cause: UnexpectedZone }
18:36:16.224Z INFO SledAgent (ConfigReconcilerTask): halt_and_remove_logged: Previous zone state: Running
    file = illumos-utils/src/zone.rs:461
    zone = oxz_crucible_8203de2e-7003-4469-9031-ade1f20232ab

Next we would remove all the datasets on that disk, but we note that we're leaking them instead pending #6177:

18:36:16.227Z WARN oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/debug (ConfigReconcilerTask): leaking ZFS dataset (should be deleted: omicron#6177)
    file = sled-agent/config-reconciler/src/reconciler_task/datasets.rs:173
    id = 61733534-80d6-4815-890e-2e9278182d6d
18:36:16.227Z WARN oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/zone/oxz_crucible_8203de2e-7003-4469-9031-ade1f20232ab (ConfigReconcilerTask): leaking ZFS dataset (should be deleted: omicron#6177)
    file = sled-agent/config-reconciler/src/reconciler_task/datasets.rs:173
    id = 9bea5d6f-d7ad-49af-9e4c-c3a95a1f1652
18:36:16.227Z WARN oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crypt/zone (ConfigReconcilerTask): leaking ZFS dataset (should be deleted: omicron#6177)
    file = sled-agent/config-reconciler/src/reconciler_task/datasets.rs:173
    id = bdea1929-4ffa-47d1-8944-140e21f5ef35
18:36:16.227Z WARN oxp_8bef130c-8e10-4eee-a0c4-9dcf6ee91116/crucible (ConfigReconcilerTask): leaking ZFS dataset (should be deleted: omicron#6177)
    file = sled-agent/config-reconciler/src/reconciler_task/datasets.rs:173
    id = d5bae49f-fe38-410b-ae90-eff0d22b35b0

Next we remove the disk:

18:36:16.227Z INFO SledAgent (ConfigReconcilerTask): removing managed disk: no longer present in config
    disk = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079E184" }
    disk_id = 4deb1041-f79b-4244-99ef-fc13ba01248a
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:327

This wakes up the InstanceManager and causes it to shut down the instance whose propolis zone was backed by this disk:

18:36:16.232Z INFO SledAgent (InstanceManager): use_only_these_disks: Removing instance
    file = sled-agent/src/instance_manager.rs:781
    instance_id = c63de85a-9531-4ba7-8852-97b1592e2a59 (propolis)
18:36:16.232Z INFO SledAgent (InstanceManager): Received request to terminate instance
    file = sled-agent/src/instance.rs:2119
    instance_id = 339e49a5-7220-4cbc-96dd-a9e4cc3c0577
    propolis_id = c63de85a-9531-4ba7-8852-97b1592e2a59
... snip the rest of the InstanceManager logs ...

It also waks up the dump setup task, which now realizes there are only 9 debug datasets left:

18:36:16.355Z INFO SledAgent (DumpSetup-worker): Updated view of disks
    core_datasets = 2
    debug_datasets = 9
    dump_slices = 2
    file = sled-agent/config-reconciler/src/dump_setup.rs:627

The rest of the reconciliation proceeds and makes no changes (we didn't add any disks, datasets, or zones).

@jgallagher
Copy link
Contributor Author

Followup from the disk expungement testing: maybe it's not right that we note that we're leaking datasets and/or would delete them for an expunged disk - we should probably only try to delete datasets if the disk is still present?

@jgallagher jgallagher merged commit 4b230b3 into john/sled-agent-config-reconciler-zone-deps Jun 4, 2025
16 of 17 checks passed
@jgallagher jgallagher deleted the john/sled-agent-config-reconciler-cleanup-zones-local branch June 4, 2025 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants