-
Notifications
You must be signed in to change notification settings - Fork 62
[sled-agent] Integrate config-reconciler #8064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| #[clap(subcommand)] | ||
| Zones(ZoneCommands), | ||
|
|
||
| /// print information about zpools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you expecting that inventory will supplant this info? Or are you planning on replacing this access to the sled agent later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that inventory would supplant this. (I think maybe it already has, in practice? I definitely only look at inventory when I'm curious about zpools; I don't think I've ever used these omdb subcommands.)
This is somewhat extracted from #8064, but can be landed independently and will make some of the followup sled-agent-config-reconciler PRs a little cleaner. We don't yet ledger `OmicronSledConfig`s to disk, so we're free to fiddle with the details of its fields without worrying about backwards compatibility. Fixes #7774.
abd7542 to
2574c5c
Compare
2574c5c to
a057195
Compare
a057195 to
0faddda
Compare
…ig reconciler (#8188) The primary change here is replacing these inventory fields (a subset of `OmicronSledConfig`): ```rust pub omicron_zones: OmicronZonesConfig, pub omicron_physical_disks_generation: Generation, ``` with these: ```rust pub ledgered_sled_config: Option<OmicronSledConfig>, pub reconciler_status: ConfigReconcilerInventoryStatus, pub last_reconciliation: Option<ConfigReconcilerInventory>, ``` Once #8064 lands, all three of these will be filled in meaningfully; as of this PR, only `ledgered_sled_config` is populated. (`reconciler_status` is always `NotYetRun` and `last_reconciliation` is always `None`, since there is no reconciler yet.) The rest of the changes are all fallout from changing inventory: * Update `omdb` printing * Update sled-agent to report the new inventory fields * Update consumers of inventory (tests, reconfigurator planner, one Nexus RPW) - these all just look at `ledgered_sled_config` for now, but will need to be updated on #8064 once other fields are populated * Update database schema, model, and queries (the bulk of the diff). This requires dropping all preexisting collections, since there's no way to migrate from just `omicron_zones` to a full `OmicronSledConfig`. The first few schema migrations take care of this. Before merging I'll go through an upgrade on a racklette and confirm things come back up okay after the schema migration blows away all the pre-update inventory collections. (We think this is fine, but it'd be good to confirm.) But I think this is close enough that it's reviewable. Couple other minor changes that came along for the ride: * Closes #6770 (`inv_sled_omicron_zones` is gone now) * Fixes #8084 (added `image_source` columns to the inventory zone config table, so we don't lose `ImageSource::Artifact { hash }` values reported by sled-agent)
0faddda to
8ff4ae3
Compare
|
I'm putting racklette testing notes for this branch plus a few followups in comments on the last of those followups (#8220). |
andrewjstone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard PR to review, given its broad scope. It was made somewhat easier by recognizing a few patterns such as replacing calls to the storage manager with rx channels for disk and datasets.
It all appears correct to me, but again, hard to really tell. I'm sure it was tedious to implement as well :)
Regardless, looks good enough to merge and continue with.
| method = GET, | ||
| path = "/datasets", | ||
| }] | ||
| async fn datasets_get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these only used by the OMDB commands that got removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I think these are obviated by all the information available via /inventory?
| impl LocalStorage for ConfigReconcilerHandle { | ||
| async fn dyn_datasets_config_list(&self) -> Result<DatasetsConfig, Error> { | ||
| self.datasets_config_list().await.map_err(|err| err.into()) | ||
| // TODO-cleanup This is super gross; add a better API (maybe fetch a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to clean this up in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but thanks for the reminder that I need to do this. I think once this stack of work lands, the only consumers of the current StorageManager are tests (including one that implements this API); I'd like to do this cleanup along with changing those tests to interact with the config reconciler instead.
| /// Given a sled config, produce a reconciler result that sled-agent could | ||
| /// have emitted if reconciliation succeeded. | ||
| /// | ||
| /// This method should only be used by tests and dev tools; real code should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark this with #[test] and maybe #[cfg(any(test, feature = "testing"))] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can, because of the "and dev tools". IIRC reconfigurator-cli and some of the "example system" stuff uses this, neither of which is gated by test / testing.
| // reflects the parent blueprint disk generation. If it does | ||
| // then we mark any expunged disks decommissioned. | ||
| // | ||
| // TODO-correctness We inspect `last_reconciliation` here to confirm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On balance, this seems like the right choice to me. We should know the sled agent has acted before decommissioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the confirmation, I'll update the comment.
… starting zones (#8219) This dramatically reduces the work that `ServiceManager::start_omicron_zone()` does by moving most of it to the config-reconciler: * Moved: shutting down existing zone of the same name * Moved: checking for time sync * Reworked: checking datasets and choosing a root zpool (now checks are performed against the most-recently-reconciled `DatasetConfig`s, and we never choose a root zpool since all zones have a property specifying which they should use) Builds on #8064 + #8218. Fixes #8173.
… starting zones (#8219) This dramatically reduces the work that `ServiceManager::start_omicron_zone()` does by moving most of it to the config-reconciler: * Moved: shutting down existing zone of the same name * Moved: checking for time sync * Reworked: checking datasets and choosing a root zpool (now checks are performed against the most-recently-reconciled `DatasetConfig`s, and we never choose a root zpool since all zones have a property specifying which they should use) Builds on #8064 + #8218. Fixes #8173.
papertigers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed some of the nexus bits and mostly focused on the sled-agent/support-bundle side of things and it looks okay to me.
Just noticed one comment with weird wording and a note mostly to future me about some support bundle stuff that looks like it will be an easy change once this lands.
| .all_sled_diagnostics_directories(); | ||
| let tempdir = m2_debug_datasets.first().ok_or(Error::MissingStorage)?; | ||
| let current_internal_disks = self.internal_disks_rx.current(); | ||
| let mut m2_debug_datasets = current_internal_disks.all_debug_datasets(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR but just wanted to call out that we will be switching this to a u.2 rather than a m.2 for encryption reasons. #8197
I think once this lands I will pick up that issue and swap things around.
…8218) This is more consistent with how the reconciler remembers zones and disks. This is almost all just moving code around. The only nontrivial changes are: 1. In the spot where we ought to delete datasets, we at least "forget" them in-memory and log an error about leaking the ZFS dataset 2. The dataset serialization task remembers fewer details about datasets it ensured (just the names - that's all we need for the error checking it does when dealing with nested datasets) Staged on top of #8064. PR 1 of 3 working towards #8173.
|
Just a note before merging: The diff is now quite a bit larger than it was when this PR was reviewed, because d811bda pulls in the stack of 4 PRs that were built on top of this. They were all reviewed independently, and I'd like to land all of this at once. |
@askfongjojo noticed that our `uptime`s on dogfood are all nonsense: ``` 8 BRM44220011 ok: 14:05:34 up 14066 day(s), 14:06, 1 user, load average: 5.54, 5.60, 5.62 9 BRM44220005 ok: 14:05:34 up 14066 day(s), 14:06, 1 user, load average: 17.51, 17.81, 17.81 10 BRM42220009 ok: 14:05:35 up 14066 day(s), 14:06, 1 user, load average: 15.00, 14.46, 14.01 11 BRM42220006 ok: 14:05:35 up 14066 day(s), 14:06, 0 users, load average: 11.49, 11.21, 11.10 12 BRM42220057 ok: 14:05:35 up 14066 day(s), 14:06, 0 users, load average: 2.88, 2.62, 2.03 ... ``` I _think_ #8064 introduced this. It shuffled around how time sync is checked and added a callback that the config-reconciler is supposed to run when it detects time has synchronized; that callback is responsible for rewriting `uptime` (among other things), but it never actually executes the callback. This PR fixes that. However, we have some racklettes that are running commits that include #8064 that have reasonable uptimes. I'm not sure how that's possible - is there some other way `uptime` can be correct if sled-agent doesn't fix it?
This doorbell API was used by Sled Agent prior to the config reconciler introduced in #8064. Nothing currently appears to use it.
This PR integrates the new
sled-agent-config-reconcilercrate withsled-agent. It will not currently pass tests due to the reconciler not being completely implemented, but I'd like to get any feedback on this integration work itself (particularly as it pertains to the API ofsled-agent-config-reconciler). See the description of #8063 for more context.There are a couple serious warts with this PR:
StorageManager(because its functionality is being absorbed intosled-agent-config-reconciler); however, the storage manager also has a rich set of test support. This PR leaves a couple sled-agent submodules using that test support (support-bundle/storage and zone-bundle). In the long run I think it'd be better to rework these (if there are no remaining production uses ofStorageManager), but for now I think this is... okay? Feedback welcome.