Skip to content

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Apr 28, 2023

History

The Sled Agent has historically had two different "managers" responsible for Zones:

  1. ServiceManager, which resided over zones that do not operate on Datasets
  2. StorageManager, which manages disks, but also manages zones which operate on those disks

This separation is even reflected in the sled agent API exposed to Nexus - the Sled Agent exposes:

  • PUT /services
  • PUT /filesystem

For "add a service (within a zone) to this sled" vs "add a dataset (and corresponding zone) to this sled within a particular zpool".

This has been kinda handy for Nexus, since "provision CRDB on this dataset" and "start the CRDB service on that dataset" don't need to be separate operations. Within the Sled Agent, however, it has been a pain-in-the-butt from a perspective of diverging implementations. The StorageManager and ServiceManager have evolved their own mechanisms for storing configs, identifying filesystems on which to place zpools, etc, even though their responsibilities (managing running zones) overlap quite a lot.

This PR

This PR migrates the responsibility for "service management" entirely into the ServiceManager, leaving the StorageManager responsible for monitoring disks.

In detail, this means:

  • The responsibility for launching Clickhouse, CRDB, and Crucible zones has moved from storage_manager.rs into services.rs
  • The StorageManager no longer requires an Etherstub device during construction
  • The ServiceZoneRequest can operate on an optional dataset argument
  • The "config management" for datastore-based zones is now much more aligned with non-dataset zones. Each sled stores /var/oxide/services.toml and /var/oxide/storage-services.toml for each group.
  • filesystem_ensure - which previously asked the StorageManager to format a dataset and also launch a zone - now asks the StorageManager to format a dataset, and separately asks the ServiceManager to launch a zone.
    • In the future, this may become vectorized ("ensure the sled has all the datasets we want...") to have parity with the service management, but this would require a more invasive change in Nexus.

@smklein smklein added storage Related to storage. bootstrap services For those occasions where you want the rack to turn on Sled Agent Related to the Per-Sled Configuration and Management and removed bootstrap services For those occasions where you want the rack to turn on labels Apr 30, 2023
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice cleanup! I'm happy to get the StorageManager out of the business of launching zones.

// pretty tight; we should consider merging them together.
let storage_manager =
StorageManager::new(&log, underlay_etherstub.clone()).await;
let storage_manager = StorageManager::new(&log).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo!

.join(STORAGE_SERVICES_CONFIG_FILENAME)
}

// TODO(ideas):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this first part is implemented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually implemented this out fully within #2972

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment removed in 541f68d

// - ... Writer which *knows the type* to be serialized, so can direct it to the
// appropriate output path.
//
// - TODO: later: Can also make the path writing safer, by...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this technique out into an issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #2972 - I just finished implementing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Excellent!

);
}
self.load_non_storage_services().await?;
// TODO: These will fail if the disks aren't attached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time reasoning about when we'd want to retry, since it's unclear to me when a disk would become attached within a retry period. My admittedly, somewhat uninformed, take at this moment is that we shouldn't retry.

I think the distinction between these calls not retrying and NTP retrying is twofold:

  1. NTP retries even on successful startup, since we are waiting for time sync
  2. NTP is a barrier to starting other services. If it fails later services will not work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely deserves a follow-up bug - it's only relevant in the "cold boot" case, but that does matter!

Here's the deal:

  • When we boot the sled agent, it's possible that not all disks have been parsed (e.g., suppose there's a U.2 that's slow to bind a driver).
  • When we call this function, we'll read the service ledger from the M.2s, see what services should be started, and try starting them.
  • If any of those services...
    • ... have datasets in the U.2s
    • ... have zone filesystems in the U.2s
  • ... then we'd fail to start them.

But that doesn't mean we should never launch the service - if it's in the M.2 service ledger, either RSS or Nexus provisioned that service, so we should keep giving it a shot. "The driver did bind, but it just took a while" and "The U.2 was unplugged, but someone put it back" are both cases where we should be able to launch these services, even if we might initially fail.

If, after a long enough period of time, we determine that we cannot launch that service (if the U.2 is detached, fails, etc), then we'd have an opportunity to do some notification through the fault tolerance system (we either tell Nexus that the service isn't booting, or Nexus notices on its own somehow).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #2973 , will mention it in this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in ed20fff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Those justifications make sense to me.

pub all_svcs_config_path: PathBuf,
// The path for the ServiceManager to store information about
// all running services.
all_svcs_ledger_path: PathBuf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from config to ledger here?

Also, if we are going to change the names, we may also want to change the constants to refer to LEDGER instead of CONFIG.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got confused between the following overloaded use of the term "config":

  • We use "config" to describe parameters for a variety of modules within the sled agent, to describe tweaks on how they should be executed. For example, the sled agent has a "config", the bootstrap agent has a "config", the storage manager also has a "config".
  • totally separately, we're storing the list of services that the sled manages. I had previously called this "config", but I think the name was not-totally-accurate, so I'm updating it to ledger. After all, it's just a ledger of "what am I responsible for running".

RE: the constants, will do!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5d59951

@smklein smklein marked this pull request as ready for review May 1, 2023 16:49
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations and cleanup. Ship it!

@smklein smklein merged commit ccc28fe into main May 1, 2023
@smklein smklein deleted the storage-manager-cleanup branch May 1, 2023 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management storage Related to storage.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants