-
Notifications
You must be signed in to change notification settings - Fork 62
[reconfigurator] MGS updates to skip sled boards with zones that should not be shut down #9044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
490ace9
29f079b
bff626f
cd70897
6d54774
509b7c3
4c94b62
714cb16
3cd57bc
53d86b5
7967fde
65b5228
abc6c9d
c405fba
0a0ebb2
cf22b7e
6800b08
8db56d9
31bb2bc
85d3894
0b892ea
983b6f8
5f3d891
e934428
b38f053
19de828
a630198
332858b
9cfaf9e
f39561f
675fd60
b36fee1
5f4532c
112a8b8
85976da
fdf7846
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # Load example system | ||
| load-example --nsleds 3 --ndisks-per-sled 3 | ||
|
|
||
| # Create a TUF repository from a fake manifest. (The output TUF repo is | ||
| # written to a temporary directory that this invocation of `reconfigurator-cli` | ||
| # is running out of as its working directory.) | ||
| # | ||
| # This is used to simulate the initial version of the system. | ||
| tuf-assemble ../../update-common/manifests/fake-0.0.1.toml | ||
|
|
||
| # Load the target release from the assembled TUF repository. | ||
| set target-release repo-0.0.1.zip | ||
|
|
||
| # Print the default target release. | ||
| show | ||
|
|
||
| # Update the install dataset on all sleds to the target release. | ||
| # This will cause zones to be noop converted over to Artifact, | ||
| # unblocking upgrades. | ||
| sled-update-install-dataset serial0 --to-target-release | ||
| sled-update-install-dataset serial1 --to-target-release | ||
| sled-update-install-dataset serial2 --to-target-release | ||
|
|
||
| # Generate inventory, then do a planning run to ensure that all zones | ||
| # are set to Artifact. | ||
| inventory-generate | ||
| blueprint-plan latest latest | ||
| blueprint-diff latest | ||
| # The above blueprint includes a pending MGS update, which we should delete | ||
| # (we want to start from a fresh state). | ||
| blueprint-edit latest delete-sp-update serial0 | ||
| # Also set the Omicron config for all sleds to reflect the | ||
| # corresponding image sources. | ||
| sled-set serial0 omicron-config latest | ||
| sled-set serial1 omicron-config latest | ||
| sled-set serial2 omicron-config latest | ||
| # Generate inventory once more to reflect the omicron config changes. | ||
| inventory-generate | ||
| inventory-show latest | ||
|
|
||
| # Setup is now done -- create another TUF repository which will act as the | ||
| # target release being updated to. | ||
| tuf-assemble ../../update-common/manifests/fake.toml | ||
|
|
||
| # Load the target release from the assembled TUF repository. | ||
| set target-release repo-1.0.0.zip | ||
|
|
||
| # First, print out sled information. | ||
| sled-list | ||
|
|
||
| # Retrieve blueprint information to know the ID of the zone to expunge | ||
| blueprint-show latest | ||
|
|
||
| # We expunge a single internal_dns zone | ||
| # | ||
| # Note that since we are not running the command to emulate the sled-agent | ||
| # performing an inventory collection and seeing the DNS zone has gone away, the | ||
| # planner will not attemt to restore the internal DNS zone. This is intentional | ||
| # as we want the zone to stay in this state for the purposes of this test | ||
| blueprint-edit latest expunge-zones 99e2f30b-3174-40bf-a78a-90da8abba8ca | ||
| blueprint-diff latest | ||
|
|
||
| # Attempt to upgrade one RoT bootloader. This should successfully plan the | ||
| # pending RoT bootloader update | ||
| blueprint-plan latest latest | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why doesn't this planning run also attempt to restore internal DNS redundancy? Is it because we're trying to do MGS updates, and therefore we don't do any zone additions? I'm a little worried we might have a sort of planning deadlock here:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not as familiar with the code to expunge and restore zones. That said, as far as I know the only thing gated by unfinished MGS driven updates are zone updates. Not restoring zone redundancy. Looking at the I added this command locally to the test here and this is the output I get > # Mark for cleanup
> blueprint-edit latest mark-for-cleanup 99e2f30b-3174-40bf-a78a-90da8abba8ca
blueprint df06bb57-ad42-4431-9206-abff322896c7 created from latest blueprint (af934083-59b5-4bf6-8966-6fb5292c29e1): marked zone 99e2f30b-3174-40bf-a78a-90da8abba8ca ready for cleanup
> blueprint-plan latest latest
INFO performed noop zone image source checks on sled, sled_id: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c, num_total: 8, num_already_artifact: 8, num_eligible: 0, num_ineligible: 0
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c, slot: a, expected_hash: 0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c, slot: b, expected_hash: 0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
INFO performed noop zone image source checks on sled, sled_id: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6, num_total: 8, num_already_artifact: 8, num_eligible: 0, num_ineligible: 0
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6, slot: a, expected_hash: 0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6, slot: b, expected_hash: 0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
INFO performed noop zone image source checks on sled, sled_id: d81c6a84-79b8-4958-ae41-ea46c9b19763, num_total: 8, num_already_artifact: 8, num_eligible: 0, num_ineligible: 0
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: d81c6a84-79b8-4958-ae41-ea46c9b19763, slot: a, expected_hash: 0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: d81c6a84-79b8-4958-ae41-ea46c9b19763, slot: b, expected_hash: 0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
INFO skipping board for MGS-driven update, serial_number: serial0, part_number: model0 due to unsafe zones: internal_dns
INFO skipping board for MGS-driven update, serial_number: serial2, part_number: model2 due to unsafe zones: internal_dns
INFO configuring MGS-driven update, artifact_version: 1.0.0, artifact_hash: 5b0f601b1fbb8674db9c751a02f8b14f8e6d4e8470f4f7b686fecb2c49ec11f9, expected_stage0_next_version: NoValidVersion, expected_stage0_version: 0.0.1, component: rot_bootloader, sp_slot: 1, sp_type: Sled, serial_number: serial1, part_number: model1
INFO ran out of boards for MGS-driven update
generated blueprint 7f976e0d-d2a5-4eeb-9e82-c82bc2824aba based on parent blueprint df06bb57-ad42-4431-9206-abff322896c7
blueprint source: planner with report:
planning report:
* skipping noop zone image source check on sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c: all 8 zones are already from artifacts
* skipping noop zone image source check on sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6: all 8 zones are already from artifacts
* skipping noop zone image source check on sled d81c6a84-79b8-4958-ae41-ea46c9b19763: all 8 zones are already from artifacts
* 1 pending MGS update:
* model1:serial1: RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })
* 2 zones not ready to shut down safely for an MGS driven update:
* zone 427ec88f-f467-42fa-9bbb-66a91a36103c: only 2/2 internal DNS zones are synchronized; require at least 3
* zone ea5b4030-b52f-44b2-8d70-45f15f987d01: only 2/2 internal DNS zones are synchronized; require at least 3
* discretionary zones placed:
* internal_dns zone on sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c from source install dataset
* zone updates waiting on discretionary zones
* waiting to update top-level nexus_generation: some non-Nexus zone are not yet updated
> blueprint-diff latest
from: blueprint df06bb57-ad42-4431-9206-abff322896c7
to: blueprint 7f976e0d-d2a5-4eeb-9e82-c82bc2824aba
MODIFIED SLEDS:
sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c (active, config generation 4 -> 5):
host phase 2 contents:
------------------------
slot boot image source
------------------------
A current contents
B current contents
physical disks:
------------------------------------------------------------------------------------
vendor model serial disposition
------------------------------------------------------------------------------------
fake-vendor fake-model serial-727522a7-934f-494d-b5b3-160968e74463 in service
fake-vendor fake-model serial-72c59873-31ff-4e36-8d76-ff834009349a in service
fake-vendor fake-model serial-b5fd5bc1-099e-4e77-8028-a9793c11f43b in service
datasets:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dataset name dataset id disposition quota reservation compression
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
oxp_727522a7-934f-494d-b5b3-160968e74463/crucible 2f204c50-a327-479c-8852-f53ec7a19c1f in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crucible 78f34ce7-42f1-41da-995f-318f32054ad2 in service none none off
oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crucible 1640adb6-70bf-44cf-b05c-bff6dd300cf3 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/clickhouse 841d5648-05f0-47b0-b446-92f6b60fe9a6 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/external_dns 8e0bd2bd-23b7-4bc6-9e73-c4d4ebc0bc8c in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/internal_dns 2ad1875a-92ac-472f-8c26-593309f0e4da expunged none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone 4829f422-aa31-41a8-ab73-95684ff1ef48 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone 775f9207-c42d-4af2-9186-27ffef67735e in service none none off
oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crypt/zone 3b66453b-7148-4c1b-84a9-499e43290ab4 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_clickhouse_353b3b65-20f7-48c3-88f7-495bd5d31545 b46de15d-33e7-4cd0-aa7c-e7be2a61e71b in service none none off
oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crypt/zone/oxz_crucible_86a22a56-0168-453d-9df1-cb2a7c64b5d3 3e0d6188-c503-49cf-a441-fa7df40ceb43 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_crucible_bd354eef-d8a6-4165-9124-283fb5e46d77 5ae11c7e-08fa-4d78-a4ea-14b4a9a10241 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_crucible_e2fdefe7-95b2-4fd2-ae37-56929a06d58c b8f2a09f-8bd2-4418-872b-a4457a3f958c in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_crucible_pantry_ad6a3a03-8d0f-4504-99a4-cbf73d69b973 49f8fbb6-5bac-4609-907f-6e3dfc206059 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_external_dns_6c3ae381-04f7-41ea-b0ac-74db387dbc3a 8c4fa711-1d5d-4e93-85f0-d17bff47b063 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_internal_dns_99e2f30b-3174-40bf-a78a-90da8abba8ca c31623de-c19b-4615-9f1d-5e1daa5d3bda expunged none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_nexus_466a9f29-62bf-4e63-924a-b9efdb86afec 3560dd69-3b23-4c69-807d-d673104cfc68 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_ntp_62620961-fc4a-481e-968b-f5acbac0dc63 09b9cc9b-3426-470b-a7bc-538f82dede03 in service none none off
oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/debug 93957ca0-9ed1-4e7b-8c34-2ce07a69541c in service 100 GiB none gzip-9
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/debug 2db6b7c1-0f46-4ced-a3ad-48872793360e in service 100 GiB none gzip-9
oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crypt/debug 318fae85-abcb-4259-b1b6-ac96d193f7b7 in service 100 GiB none gzip-9
+ oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/internal_dns 0e04c3b5-47fc-412c-b352-dbf09a6bfac8 in service none none off
+ oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_internal_dns_ff6dceb4-84c5-4e0e-afe1-3c884cb656cb 71d27157-3ff7-4806-be2e-1ad6d6b5bfa2 in service none none off
omicron zones:
------------------------------------------------------------------------------------------------------------------------
zone type zone id image source disposition underlay IP
------------------------------------------------------------------------------------------------------------------------
clickhouse 353b3b65-20f7-48c3-88f7-495bd5d31545 artifact: version 0.0.1 in service fd00:1122:3344:102::23
crucible 86a22a56-0168-453d-9df1-cb2a7c64b5d3 artifact: version 0.0.1 in service fd00:1122:3344:102::28
crucible bd354eef-d8a6-4165-9124-283fb5e46d77 artifact: version 0.0.1 in service fd00:1122:3344:102::26
crucible e2fdefe7-95b2-4fd2-ae37-56929a06d58c artifact: version 0.0.1 in service fd00:1122:3344:102::27
crucible_pantry ad6a3a03-8d0f-4504-99a4-cbf73d69b973 artifact: version 0.0.1 in service fd00:1122:3344:102::25
external_dns 6c3ae381-04f7-41ea-b0ac-74db387dbc3a artifact: version 0.0.1 in service fd00:1122:3344:102::24
internal_dns 99e2f30b-3174-40bf-a78a-90da8abba8ca artifact: version 0.0.1 expunged ✓ fd00:1122:3344:1::1
internal_ntp 62620961-fc4a-481e-968b-f5acbac0dc63 artifact: version 0.0.1 in service fd00:1122:3344:102::21
nexus 466a9f29-62bf-4e63-924a-b9efdb86afec artifact: version 0.0.1 in service fd00:1122:3344:102::22
+ internal_dns ff6dceb4-84c5-4e0e-afe1-3c884cb656cb install dataset in service fd00:1122:3344:1::1
COCKROACHDB SETTINGS:
state fingerprint::::::::::::::::: (none) (unchanged)
cluster.preserve_downgrade_option: (do not modify) (unchanged)
METADATA:
internal DNS version::: 1 (unchanged)
external DNS version::: 1 (unchanged)
target release min gen: 1 (unchanged)
nexus gen:::::::::::::: 1 (unchanged)
OXIMETER SETTINGS:
generation: 1 (unchanged)
read from:: SingleNode (unchanged)
PENDING MGS UPDATES:
Pending MGS-managed updates (all baseboards):
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sp_type slot part_number serial_number artifact_hash artifact_version details
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ sled 1 model1 serial1 5b0f601b1fbb8674db9c751a02f8b14f8e6d4e8470f4f7b686fecb2c49ec11f9 1.0.0 RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })
internal DNS:
* DNS zone: "control-plane.oxide.internal":
* name: @ (records: 2 -> 3)
- NS ns1.control-plane.oxide.internal
- NS ns2.control-plane.oxide.internal
+ NS ns1.control-plane.oxide.internal
+ NS ns2.control-plane.oxide.internal
+ NS ns3.control-plane.oxide.internal
* name: _nameservice._tcp (records: 2 -> 3)
- SRV port 5353 427ec88f-f467-42fa-9bbb-66a91a36103c.host.control-plane.oxide.internal
- SRV port 5353 ea5b4030-b52f-44b2-8d70-45f15f987d01.host.control-plane.oxide.internal
+ SRV port 5353 427ec88f-f467-42fa-9bbb-66a91a36103c.host.control-plane.oxide.internal
+ SRV port 5353 ea5b4030-b52f-44b2-8d70-45f15f987d01.host.control-plane.oxide.internal
+ SRV port 5353 ff6dceb4-84c5-4e0e-afe1-3c884cb656cb.host.control-plane.oxide.internal
+ name: ff6dceb4-84c5-4e0e-afe1-3c884cb656cb.host (records: 1)
+ AAAA fd00:1122:3344:1::1
* name: ns1 (records: 1 -> 1)
- AAAA fd00:1122:3344:2::1
+ AAAA fd00:1122:3344:1::1
* name: ns2 (records: 1 -> 1)
- AAAA fd00:1122:3344:3::1
+ AAAA fd00:1122:3344:2::1
+ name: ns3 (records: 1)
+ AAAA fd00:1122:3344:3::1
unchanged names: 46 (records: 58)
It seems to me like everything is working as expected. I don't think I want to add the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ahhh yeah I forgot about needing to see it cleaned up. Thanks, the comment is great. |
||
| blueprint-diff latest | ||
|
|
||
| # We generate another plan, there should be no changes | ||
| blueprint-plan latest latest | ||
| blueprint-diff latest | ||
|
|
||
| # Now, forcibly update the simulated RoT bootloader in all sleds to reflect that | ||
| # the update completed. | ||
| # Collect inventory from it and use that collection for another planning step. | ||
| # This should report that the update completed, and successfully plan a | ||
| # pending RoT update | ||
| sled-update-rot-bootloader serial0 --stage0 1.0.0 | ||
| sled-update-rot-bootloader serial1 --stage0 1.0.0 | ||
| sled-update-rot-bootloader serial2 --stage0 1.0.0 | ||
| inventory-generate | ||
| blueprint-plan latest latest | ||
| blueprint-diff latest | ||
|
|
||
| # We repeat the same process with the RoT, but now expecting to see a single | ||
| # blocked SP update due to serial0 containing an unsafe zone. We also see a | ||
| # pending SP update on serial1 which has no unsafe zones. | ||
| sled-update-rot serial0 --slot-b 1.0.0 --active-slot b --persistent-boot-preference b | ||
| sled-update-rot serial1 --slot-b 1.0.0 --active-slot b --persistent-boot-preference b | ||
| sled-update-rot serial2 --slot-b 1.0.0 --active-slot b --persistent-boot-preference b | ||
| inventory-generate | ||
| blueprint-plan latest latest | ||
| blueprint-diff latest | ||
|
|
||
| # Since there was a blocked update on serial0, the planner has moved on to | ||
| # serial1. This sled has no unsafe zones, so we forcibly update it along with | ||
| # the SPs to see the errors with the Host OS on serial2 and serial0, and plan no | ||
| # pending updates | ||
| sled-update-sp serial0 --active 1.0.0 | ||
| sled-update-sp serial1 --active 1.0.0 | ||
| sled-update-sp serial2 --active 1.0.0 | ||
| sled-update-host-phase2 serial1 --boot-disk B --slot-b d944ae205b61ccf4322448f7d0311a819c53d9844769de066c5307c1682abb47 | ||
| sled-update-host-phase1 serial1 --active B --slot-b b99d5273ba1418bebb19d74b701d716896409566d41de76ada71bded4c9b166b | ||
| inventory-generate | ||
| blueprint-plan latest latest | ||
| blueprint-diff latest | ||
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a similar command to easily update all the hubris images of a given sled (or baseboard?) to match the latest blueprint like this one that we have for Omicron configs? Having to manually delete the staged SP update is kinda confusing.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's my unfamiliarity with this part of the codebase, but I couldn't find an easy quick way to create this command as part of this PR. Do you mind if this is done as a follow-up PR? I want to get this merged soon so it can be tested before the release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, totally fine.