-
Notifications
You must be signed in to change notification settings - Fork 62
[sled-agent] Initialize zones more independently of each other #5012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jgallagher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes LGTM; all my comments are ignorable nitpicks in the interest of urgency.
Should we run this through RSS and cold boot on madrid to confirms we didn't miss something subtle?
andrewjstone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me. All the behavior seems to track with retries happening from within load_services rather than nested just for timesync. Nice work @smklein!
| omicron_generation, | ||
| ledger_generation, | ||
| zones, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right below here is where we should check if anything has changed before committing the ledger. That should fix #5014
This does not have to be done in this PR. I just wanted to point it out for future reference.
| // etc, that NTP and the internal DNS system it depends on MUST be | ||
| // initialized prior to other zones. | ||
| // Destroy zones that should not be running | ||
| for zone in zones_to_be_removed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible bundling could take a while. Can we do this concurrently as well? I think for this to work we'd have to move the removal from the existing_zones into this method because it's a locked mutex guard which should not be Sync.
Looking at ZoneBundler::create we take a lock, but I don't think that lock is actually required anymore as the all the storage related stuff uses message passing and the heavy call to create at the bottom that does all the work operates on local copies of the parameters.
Even if we don't decide to make this concurrent, we should probably go ahead and remove that lock call in ZoneBundle::create unless there is something I'm missing, which is of course possible. I'm happy to do that in a separate PR if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, tracking in #5024
I'll give this a run on the testbed as a sanity check. I'll leave it up to you guys to decide if it should also run on madrid. |
I ran this branch through RSS and all zones came up as expected. I also coldbooted sled |
|
When this is ready I'm available to get the build on to dogfood, and then the colo. |
Resolves #5002
This change attempts to make zone setup simpler and more independent. Namely: If any particular zone cannot start, due to NTP timesync, internal DNS lookup, or a missing disk, it should be able to fail without necessarily preventing all other zones from initializing.