Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Feb 7, 2024

Resolves #5002

This change attempts to make zone setup simpler and more independent. Namely: If any particular zone cannot start, due to NTP timesync, internal DNS lookup, or a missing disk, it should be able to fail without necessarily preventing all other zones from initializing.

@smklein smklein changed the title [sled-agent] Check zone validity concurrently [sled-agent] Initialize zones more independently of each other Feb 7, 2024
@smklein smklein marked this pull request as ready for review February 7, 2024 18:47
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes LGTM; all my comments are ignorable nitpicks in the interest of urgency.

Should we run this through RSS and cold boot on madrid to confirms we didn't miss something subtle?

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me. All the behavior seems to track with retries happening from within load_services rather than nested just for timesync. Nice work @smklein!

omicron_generation,
ledger_generation,
zones,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right below here is where we should check if anything has changed before committing the ledger. That should fix #5014

This does not have to be done in this PR. I just wanted to point it out for future reference.

// etc, that NTP and the internal DNS system it depends on MUST be
// initialized prior to other zones.
// Destroy zones that should not be running
for zone in zones_to_be_removed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible bundling could take a while. Can we do this concurrently as well? I think for this to work we'd have to move the removal from the existing_zones into this method because it's a locked mutex guard which should not be Sync.

Looking at ZoneBundler::create we take a lock, but I don't think that lock is actually required anymore as the all the storage related stuff uses message passing and the heavy call to create at the bottom that does all the work operates on local copies of the parameters.

Even if we don't decide to make this concurrent, we should probably go ahead and remove that lock call in ZoneBundle::create unless there is something I'm missing, which is of course possible. I'm happy to do that in a separate PR if desired.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, tracking in #5024

@andrewjstone
Copy link
Contributor

Should we run this through RSS and cold boot on madrid to confirms we didn't miss something subtle?

I'll give this a run on the testbed as a sanity check. I'll leave it up to you guys to decide if it should also run on madrid.

@andrewjstone
Copy link
Contributor

Should we run this through RSS and cold boot on madrid to confirms we didn't miss something subtle?

I'll give this a run on the testbed as a sanity check. I'll leave it up to you guys to decide if it should also run on madrid.

I ran this branch through RSS and all zones came up as expected. I also coldbooted sled g0 and all zones relaunched correctly.

@augustuswm
Copy link
Contributor

When this is ready I'm available to get the build on to dogfood, and then the colo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sled agent fails to get past ensuring ledgered zones when a dataset goes missing

5 participants