-
Notifications
You must be signed in to change notification settings - Fork 54
coordinated Nexus quiesce #9010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is now ready for review. Note that there should be minimal risk / impact to existing systems by landing this PR because existing systems do not quiesce until after #8936. By the time we do that, I hope we'll have more complete testing in place (e.g., a successful live test). I need to put together a few more PRs for that. |
|
||
let db_nexus_ids: BTreeSet<_> = nexus_ids | ||
.iter() | ||
.cloned() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - since UUIDs impl Copy
, I think this could be
.cloned() | |
.copied() |
(which is not different in terms of generated code but it's more obvious that this is cheap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e8eb204.
Ok(count) | ||
} | ||
|
||
/// Updates the "last_drained_blueprint_id" for the given Nexus id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong docstring on this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed in e8eb204.
} | ||
} | ||
|
||
pub async fn extra_datastore(&self, log: &Logger) -> Arc<DataStore> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this, as opposed to cloning the return value of datastore()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datastore (really, the underlying pool) has in-memory state associated with being quiesced. In our test, we want independent instances of this so that they can be quiesced independently.
edit: I'll add a comment about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough. Probably worth at least a doc comment noting how this is different? I'm tempted to suggest a more descriptive name too, but I'm not sure what. independent_datastore()
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added in e8eb204.
/// Channel for TUF repository artifacts to be replicated out to sleds | ||
pub tuf_artifact_replication_rx: mpsc::Receiver<ArtifactsWithPlan>, | ||
/// Channel for exposing the latest loaded blueprint | ||
pub blueprint_load_tx: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure I understand: we have this field now because in nexus/app/mod.rs
, we need to construct this channel to get a handle to the receiver before we've set up the background task system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server)) | ||
} | ||
|
||
/// Updates the "last_drained_blueprint_id" for the given Nexus id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, should we update this to return a bool? It's always going to be 0 or 1 rows updated, right?
Alternatively - we could just return an error if the "count == 0", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to return an error for the 0
case because that could just mean we've already updated it to this blueprint id and that's fine. The caller needs to retry errors, but not that case.
I can see the appeal of the bool. I want to log the value either way, which feels slightly more idiomatic at the caller level, so I'm going to leave it.
pub async fn database_nexus_access_update_quiesced( | ||
&self, | ||
nexus_id: OmicronZoneUuid, | ||
) -> Result<usize, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about returning a usize - since we're indexing on nexus_id
already, seems like we'll either perform the update successfully or not, and can return that more idiomatically than a usize
.
Implements coordinated Nexus quiesce per RFD 588.
This PR only changes the existing Nexus quiesce process to do what RFD 588 documents. There are several other things that need to happen for a successful handoff. I will create follow-on PRs for those.
Fixes #8859
Fixes #8857
Fixes #8796
Fixes #8795
Fixes #8971
Fixes #8858