Sled Agent uses implicit interfaces with components it provisions

When sled agent is setting up various control plane zones, it makes some assumptions about the things running in those zones.  It looks easy to accidentally break things.

## The problem

One example: [Sled Agent writes out the "deployment" section of the Nexus config file.](https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/services.rs#L1318-L1388)  There's an implicit interface here.  Suppose someone wants to change the format of the Nexus config file.  They also need to update Sled Agent to write the updated format.  But that's not enough: an old Sled Agent could be asked to provision a new Nexus zone, or a new Sled Agent could be asked to provision an old Nexus zone.  In both cases, it'd likely write out a config file that Nexus won't be able to parse.

Another example: [Sled Agent runs `cockroach` commands inside the CockroachDB zone.](https://github.com/oxidecomputer/omicron/blob/351c47edc7ebdebc943f3fd60420023a6e9adc93/sled-agent/src/services.rs#L1055-L1065)  This assumes that the zone delivers an executable called `cockroach` at that path, that it will do the right thing in whatever process environment this ends up running in, that it supports the arguments provided, etc.  If we want to change any of this, we have the same problem as above: an old Sled Agent can't provision a new CockroachDB zone and vice versa.  I don't think this is that theoretical: at some point presumably not that far from now we will want to remove the `--insecure` flag from these command lines and we won't really be able to do that without a fleet-wide Sled Agent/Nexus flag day.

More generally, any time Sled Agent either executes commands in the zone or writes files into the zone (including the SMF profile), it's interfacing with software that may be delivered separately and at runtime could be older or newer than the Sled Agent itself.

## What do we do?

I'm not sure.

I think we probably need some stable interface for Sled Agent to expose metadata to Nexus.  Whether that's a file, SMF properties, or an HTTP interface similar to what other clouds provide in their metadata service -- I don't know.  If we can give this interface a name and version, then for example when we package Nexus it can say something like "I depend on a Sled Agent that provides version X of this API".  Then we can avoid trying to deploy Nexuses to Sled Agents that can't run them.

As for running commands inside the zone: one option is to minimize the surface of that interface and consider it immutable.  Per @jclulow, having the CockroachDB zone deliver a `/opt/oxide/bin/init_database` that takes no arguments is a lot more stable than running individual `cockroach` commands.  We might be able to avoid even this by instead exposing metadata (with the above mechanism), like "please initialize the database" and then having a service in the zone be responsible for the action itself.  This is still an implicit interface, but I think it's much narrower and well-defined than running an arbitrary command.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sled Agent uses implicit interfaces with components it provisions #3407

The problem

What do we do?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sled Agent uses implicit interfaces with components it provisions #3407

Description

The problem

What do we do?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions