[sled-agent] Refactor service management out of `StorageManager` #2946

smklein · 2023-04-28T16:28:29Z

History

The Sled Agent has historically had two different "managers" responsible for Zones:

ServiceManager, which resided over zones that do not operate on Datasets
StorageManager, which manages disks, but also manages zones which operate on those disks

This separation is even reflected in the sled agent API exposed to Nexus - the Sled Agent exposes:

PUT /services
PUT /filesystem

For "add a service (within a zone) to this sled" vs "add a dataset (and corresponding zone) to this sled within a particular zpool".

This has been kinda handy for Nexus, since "provision CRDB on this dataset" and "start the CRDB service on that dataset" don't need to be separate operations. Within the Sled Agent, however, it has been a pain-in-the-butt from a perspective of diverging implementations. The StorageManager and ServiceManager have evolved their own mechanisms for storing configs, identifying filesystems on which to place zpools, etc, even though their responsibilities (managing running zones) overlap quite a lot.

This PR

This PR migrates the responsibility for "service management" entirely into the ServiceManager, leaving the StorageManager responsible for monitoring disks.

In detail, this means:

The responsibility for launching Clickhouse, CRDB, and Crucible zones has moved from storage_manager.rs into services.rs
- Unfortunately, this also means we're taking a somewhat hacky approach to formatting CRDB. This is fixed in [sled-agent] Separate CockroachDB "start" from CockroachDB "init" #2954.
The StorageManager no longer requires an Etherstub device during construction
The ServiceZoneRequest can operate on an optional dataset argument
The "config management" for datastore-based zones is now much more aligned with non-dataset zones. Each sled stores /var/oxide/services.toml and /var/oxide/storage-services.toml for each group.
- These still need to be fixed with Sled Agent must manage durable storage for configs, zones, explicitly #2888 , but it should be simpler now.
filesystem_ensure - which previously asked the StorageManager to format a dataset and also launch a zone - now asks the StorageManager to format a dataset, and separately asks the ServiceManager to launch a zone.
- In the future, this may become vectorized ("ensure the sled has all the datasets we want...") to have parity with the service management, but this would require a more invasive change in Nexus.

…zones, zone_name -> zone_type, config -> ledger

andrewjstone

This is a nice cleanup! I'm happy to get the StorageManager out of the business of launching zones.

andrewjstone · 2023-05-01T14:19:47Z

sled-agent/src/bootstrap/hardware.rs

-        // pretty tight; we should consider merging them together.
-        let storage_manager =
-            StorageManager::new(&log, underlay_etherstub.clone()).await;
+        let storage_manager = StorageManager::new(&log).await;


andrewjstone · 2023-05-01T14:33:45Z

sled-agent/src/services.rs

+        .join(STORAGE_SERVICES_CONFIG_FILENAME)
+}
+
+// TODO(ideas):


Looks like this first part is implemented.

I actually implemented this out fully within #2972

Comment removed in 541f68d

andrewjstone · 2023-05-01T14:34:37Z

sled-agent/src/services.rs

+// - ... Writer which *knows the type* to be serialized, so can direct it to the
+// appropriate output path.
+//
+// - TODO: later: Can also make the path writing safer, by...


Maybe move this technique out into an issue?

See: #2972 - I just finished implementing this.

Oh. Excellent!

andrewjstone · 2023-05-01T14:36:56Z

sled-agent/src/services.rs

-            );
-        }
+        self.load_non_storage_services().await?;
+        // TODO: These will fail if the disks aren't attached.


I'm having a hard time reasoning about when we'd want to retry, since it's unclear to me when a disk would become attached within a retry period. My admittedly, somewhat uninformed, take at this moment is that we shouldn't retry.

I think the distinction between these calls not retrying and NTP retrying is twofold:

NTP retries even on successful startup, since we are waiting for time sync

NTP is a barrier to starting other services. If it fails later services will not work.

This definitely deserves a follow-up bug - it's only relevant in the "cold boot" case, but that does matter!

Here's the deal:

When we boot the sled agent, it's possible that not all disks have been parsed (e.g., suppose there's a U.2 that's slow to bind a driver).

When we call this function, we'll read the service ledger from the M.2s, see what services should be started, and try starting them.

If any of those services...

... have datasets in the U.2s

... have zone filesystems in the U.2s

... then we'd fail to start them.

But that doesn't mean we should never launch the service - if it's in the M.2 service ledger, either RSS or Nexus provisioned that service, so we should keep giving it a shot. "The driver did bind, but it just took a while" and "The U.2 was unplugged, but someone put it back" are both cases where we should be able to launch these services, even if we might initially fail.

If, after a long enough period of time, we determine that we cannot launch that service (if the U.2 is detached, fails, etc), then we'd have an opportunity to do some notification through the fault tolerance system (we either tell Nexus that the service isn't booting, or Nexus notices on its own somehow).

Filed #2973 , will mention it in this PR

Mentioned in ed20fff

Thanks for the explanation. Those justifications make sense to me.

andrewjstone · 2023-05-01T14:48:43Z

sled-agent/src/services.rs

-    pub all_svcs_config_path: PathBuf,
+    // The path for the ServiceManager to store information about
+    // all running services.
+    all_svcs_ledger_path: PathBuf,


Why the change from config to ledger here?

Also, if we are going to change the names, we may also want to change the constants to refer to LEDGER instead of CONFIG.

I think I got confused between the following overloaded use of the term "config":

We use "config" to describe parameters for a variety of modules within the sled agent, to describe tweaks on how they should be executed. For example, the sled agent has a "config", the bootstrap agent has a "config", the storage manager also has a "config".

totally separately, we're storing the list of services that the sled manages. I had previously called this "config", but I think the name was not-totally-accurate, so I'm updating it to ledger. After all, it's just a ledger of "what am I responsible for running".

RE: the constants, will do!

Done in 5d59951

andrewjstone

Thanks for the explanations and cleanup. Ship it!

smklein added 5 commits April 28, 2023 09:54

[sled-agent] Make service_manager responsible for storage services too

7b54128

Merge branch 'main' into storage-manager-cleanup

f410325

CRDB auto-format on boot

0b4b040

better use of 'unique_name' (for storage zones), auto-launch storage …

ef9517c

…zones, zone_name -> zone_type, config -> ledger

Merge branch 'main' into storage-manager-cleanup

f1fd1f5

smklein mentioned this pull request Apr 30, 2023

[sled-agent] Separate CockroachDB "start" from CockroachDB "init" #2954

Merged

Merge branch 'main' into storage-manager-cleanup

3a9ad87

smklein added storage Related to storage. bootstrap services For those occasions where you want the rack to turn on Sled Agent Related to the Per-Sled Configuration and Management and removed bootstrap services For those occasions where you want the rack to turn on labels Apr 30, 2023

Fix tests

9d00c93

andrewjstone reviewed May 1, 2023

View reviewed changes

smklein added 2 commits May 1, 2023 12:29

Remove the comments about the ledger, we do that in #2972

541f68d

configs -> ledgers

5d59951

smklein mentioned this pull request May 1, 2023

Sled Agent service launching should cope with U.2s that might not be ready yet #2973

Open

review feedback

ed20fff

smklein marked this pull request as ready for review May 1, 2023 16:49

andrewjstone approved these changes May 1, 2023

View reviewed changes

smklein merged commit ccc28fe into main May 1, 2023

smklein deleted the storage-manager-cleanup branch May 1, 2023 18:16

[sled-agent] Refactor service management out of StorageManager #2946

[sled-agent] Refactor service management out of StorageManager #2946

Uh oh!

Conversation

smklein commented Apr 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

History

This PR

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[sled-agent] Refactor service management out of `StorageManager` #2946

[sled-agent] Refactor service management out of `StorageManager` #2946

smklein commented Apr 28, 2023 •

edited

Loading