-
Notifications
You must be signed in to change notification settings - Fork 55
Description
When sled agent is setting up various control plane zones, it makes some assumptions about the things running in those zones. It looks easy to accidentally break things.
The problem
One example: Sled Agent writes out the "deployment" section of the Nexus config file. There's an implicit interface here. Suppose someone wants to change the format of the Nexus config file. They also need to update Sled Agent to write the updated format. But that's not enough: an old Sled Agent could be asked to provision a new Nexus zone, or a new Sled Agent could be asked to provision an old Nexus zone. In both cases, it'd likely write out a config file that Nexus won't be able to parse.
Another example: Sled Agent runs cockroach
commands inside the CockroachDB zone. This assumes that the zone delivers an executable called cockroach
at that path, that it will do the right thing in whatever process environment this ends up running in, that it supports the arguments provided, etc. If we want to change any of this, we have the same problem as above: an old Sled Agent can't provision a new CockroachDB zone and vice versa. I don't think this is that theoretical: at some point presumably not that far from now we will want to remove the --insecure
flag from these command lines and we won't really be able to do that without a fleet-wide Sled Agent/Nexus flag day.
More generally, any time Sled Agent either executes commands in the zone or writes files into the zone (including the SMF profile), it's interfacing with software that may be delivered separately and at runtime could be older or newer than the Sled Agent itself.
What do we do?
I'm not sure.
I think we probably need some stable interface for Sled Agent to expose metadata to Nexus. Whether that's a file, SMF properties, or an HTTP interface similar to what other clouds provide in their metadata service -- I don't know. If we can give this interface a name and version, then for example when we package Nexus it can say something like "I depend on a Sled Agent that provides version X of this API". Then we can avoid trying to deploy Nexuses to Sled Agents that can't run them.
As for running commands inside the zone: one option is to minimize the surface of that interface and consider it immutable. Per @jclulow, having the CockroachDB zone deliver a /opt/oxide/bin/init_database
that takes no arguments is a lot more stable than running individual cockroach
commands. We might be able to avoid even this by instead exposing metadata (with the above mechanism), like "please initialize the database" and then having a service in the zone be responsible for the action itself. This is still an implicit interface, but I think it's much narrower and well-defined than running an arbitrary command.