Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Jul 22, 2024

This PR exposes an API from the Sled Agent which allows Nexus to configure datasets independently from Zones.

Here's an example subset of zfs list -o name on a deployed system, with some annotations in-line

# This is the pool of an arbitrary U.2
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980

# Crucible has a dataset that isn't encrypted at the ZFS layer, because it's encrypted internally...
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crucible
# ... and it contains a lot of region datasets.
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crucible/regions/...

# We have a dataset which uses a trust-quorum-derived encryption key.
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crypt
# Durable datasets (e.g. Cockroach's) can be stored in here.
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crypt/cockroachdb
# The "debug" dataset has been historically created by + managed by the Sled Agent.
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crypt/debug
# Transient zone filesystems also exist here, and are encrypted.
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crypt/zone
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crypt/zone/oxz_cockroachdb_8bbea076-ff60-4330-8302-383e18140ef3
oxp_e12f29b8-1ab8-431e-bc96-1c1298947980/crypt/zone/oxz_crucible_a232eba2-e94f-4592-a5a6-ec23f9be3296 

History

Prior to this PR, the sled agent exposed no interfaces to explicitly manage datasets on their own. Datasets could be created one of two ways:

  1. Created and managed by the sled agent, without telling Nexus. See: the debug dataset.
  2. Created in response to requests from Nexus to create zones. See: crucible, cockroachdb, and the zone filesystems above.

These APIs did not provide a significant amount of control over dataset usage, and provided no mechanism for setting quotas and reservations.

This PR

  • Expands Nexus' notion of "dataset kind" to include the following variants:
    • zone_root, for the crypt/zone dataset,
    • zone, for any dataset within crypt/zone (e.g., crypt/zone/oxz_cockroachdb_8bbea076-ff60-4330-8302-383e18140ef3).
    • debug for the crypt/debug dataset.
  • Adds two endpoints to Sled Agent: datasets_put, and datasets_get, for setting a configuration of expected datasets. At the moment, datasets_put is purely additive, and does not remove any missing datasets.
    • This API provides a mechanism for Nexus to manage quotas and reservations, which it will do in the future.

This PR is related to #6167, which provides additional tooling through the inventory for inspecting dataset state on deployed sleds.

Fixes #6042, #6107

InternalDns,
}

impl DatasetKind {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A chunk of this implementation was moved from sled-storage/src/dataset.rs.

History lesson: We used to have types defined that would be usable by Nexus (common/src/api/internal), and types that were internal to the sled agent (sled-storage/src/dataset.rs) for defining dataset types/kinds.

This PR merges both concepts, but tweaks some names to avoid changing any schemas that are saved on M.2s.

@smklein smklein marked this pull request as ready for review July 30, 2024 18:12
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this great! I've a few questions and suggestions, but the structure looks quite straightforward to me. Thanks!

Hash,
)]
#[serde(rename_all = "snake_case")]
#[serde(tag = "type", rename_all = "snake_case")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will adding this tag break any serialization? I'm thinking of code which attempts to deserialize these values, but which is not necessarily running against the same version of the code here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should just always require tag in serde impls to ensure forward compatibility, in case any particular variant starts having associated data in the future. (The requirement to use a tag, if any variant has associated data, is imposed by the openapi tooling.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down the route of "serialize/deserialize via string", as @sunshowers recommended later.

let s = match self {
Crucible => "crucible",
Cockroach => "cockroach",
Cockroach => "cockroachdb",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the Display implementation used and when is a value serialized? I'm wondering why this variant is "cockroachdb" here but "cockroach_db" when serialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i would have a slight preference for the Display and Serialize strings being identical...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 as well. For ZoneKind, which is related, we already have 4 different string representations X_X:

Copy link
Collaborator Author

@smklein smklein Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my mishaps with naming here were intended to keep backwards compatibility between:

  • Names for cockroach we have in deployed systems (e.g., the zfs list-ed name is cockroachdb)
  • Names for cockroach we have stored on-disk in configuration files (e.g., grepping for cockroach_db in the schema directory of Omicron shows that it's used in e.g., all-zones-requests.json, which is used for bootstrapping "what zones get auto-launched when we reboot")

To be clear, I think I misinterpreted this in my original PR.

I was attempting to merge DatasetKind (from this file) and DatasetType (from sled-storage/src/dataset.rs) without breaking backwards compatibility. It was in the all-zones-requests.json configuration file where I saw the underscored "cockroach_db" name and tried to keep compatibility.

HOWEVER, upon closer inspection, the cockroach_db name actually comes from OmicronZoneType in nexus-sled-agent-shared/src/inventory.rs, so I should be able to just completely stick with the cockroachdb name (no underscores) in this DatasetKind variant.

Updated in 93134c2 to just use cockroachdb in serialize + display


// The "crypt" dataset needs these details, but should already exist
// by the time we're creating datasets inside.
let encryption_details = None;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So None here does not mean "remove encryption"? Is there a way or a need to do that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Zfs::ensure_filesystem function probably needs some re-work, I agree that this is confusing...

No, this isn't necessary, at least not for this PR. We only supply a value here for "encryption roots" (e.g., the crypt root) which is still automatically created by sled agents.

All our encrypted filesystems used dataset names within the crypt/ dataset, and are implicitly encrypted.

See: DatasetName::full_name, which puts the dataset on crypt/ if it should be encrypted.

I'll add some extra docs to ensure_filesystem, in the meanwhile.

Comment on lines +1029 to +1032
// Ensure the dataset has a usable UUID.
if let Ok(id_str) = Zfs::get_oxide_value(&fs_name, "uuid") {
if let Ok(id) = id_str.parse::<DatasetUuid>() {
if id != config.id {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this happen? We've asked to set a UUID, which succeeded previously, but somehow it didn't "take"? Related, is there a reason we don't just call Zfs::set_oxide_value() unconditionally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view it as more of a defensive guard against mis-matched configuration, or against "an old device/dataset was imported, and we want to be careful not to auto-import + overwrite it without some explicit intervention".

I don't think this is likely, but the name for this dataset is basically something like:

oxp_<pool UUID>/crucible

This name is not that unique, and doesn't really give a unique identifier that signifies this control plane is managing the disk.

By adding (and checking!) the UUID here, we ensure not only "this is a crucible dataset", but also, "this is our crucible dataset, and not one configured by someone else".

Hash,
)]
#[serde(rename_all = "snake_case")]
#[serde(tag = "type", rename_all = "snake_case")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should just always require tag in serde impls to ensure forward compatibility, in case any particular variant starts having associated data in the future. (The requirement to use a tag, if any variant has associated data, is imposed by the openapi tooling.)

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look at the db and reconfigurator bits, they generally look quite good. Thanks for doing this!

self.generation > other.generation
}

// No need to do this, the generation number is provided externally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm as someone unfamiliar with this code I'm not qutie sure what it means here -- I see that there's a generation field. Do you mean it gets updated by whatever owns this config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the generation field exists in the config, so we don't actually need (nor want) to change anything in the implementation of Ledgerable::generation_bump.

The "Ledgerable" trait exists to help us write a file to many disks, and assure that we'll read back the latest one, if it exists. Part of this is deciphering: If we failed halfway through a write, and we see two distinct ledgers, which should we use?

There are some configurations (basically, in the bootstore only now) that want to bump the generation number on every single write, so this just tracks "what's the latest thing that was written".

However, in this case -- and many others, for Nexus-supplied data -- we already have a generation from Nexus, and we don't need to have a duplicate generation number for writing things to disk. As long as the data is unmodified from Nexus -> Sled Agent -> Disk, we can safely re-use this field.

Debug,
}

impl Serialize for DatasetKind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hmm, I think you'll also want to update the JsonSchema impl to match this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm -- what about the JsonSchema implementation should we update? It's currently being derived, and is used as part of internal APIs, but I don't think it needs to match the serialization + to/from string implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR with a fixed JSON schema.

Basically the issue was that JSON schemas are meant to be a description of the serialization format, and they were out of sync since we'd manually implemented Serialize and Deserialize. It was currently moot for us because we were using replace, but it's a lurking bug that can lead to issues down the road.

I've changed the schema to be just string, which is overly generic but is at least correct. A more sophisticated impl would be something like string with a validation regex that lists all the possibilities, but that seems like overkill given that we're using replace anyway.

@smklein
Copy link
Collaborator Author

smklein commented Aug 8, 2024

Thank you everyone for the feedback! I think I've addressed all comments so far.

@smklein
Copy link
Collaborator Author

smklein commented Aug 26, 2024

Hey all, I think I've addressed all feedback here, and rebased onto main as of 8/26.

If ya'll have a moment, I'd appreciate another look -- I have a few PRs downstream of this, and want to make sure this is a foundation we approve of.

Copy link
Contributor

@papertigers papertigers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here make sense to me. Only one minor nit that I could go either way on.

@smklein smklein enabled auto-merge (squash) August 28, 2024 18:02
@smklein smklein merged commit 648507d into main Aug 28, 2024
@smklein smklein deleted the explicit-datasets branch August 28, 2024 18:27
smklein added a commit that referenced this pull request Sep 6, 2024
Provides visibility into deployed datasets, building atop
#6144.

Similar to Physical Disks (and Zpools), Datasets are observable in two
ways:

- The Sled Agent provides an API to manage their configuration ("this is
what is intended")
- The Sled Agent exposes information about datasets via inventory ("this
is what exists")

This PR implements the inventory aspect of datasets, to provide
visibility into the state of sled storage.

Additionally, this PR provides some omdb commands for inspection:

- `omdb sled-agent datasets list` has been added to show dataset
configuration.
- `omdb db inventory collections show` has been updated to emit disk,
zpool, and dataset info from inventory.

---------

Co-authored-by: Rain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sled Agent: API to manage datasets independently of zones

6 participants