-
Notifications
You must be signed in to change notification settings - Fork 48
[nexus] Use cockroachdb range stats in reconfigurator planner #8441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1227,20 +1228,27 @@ impl<'a> Planner<'a> { | |||
.map(|z| (z.id, z.image_source.clone())) | |||
.collect::<BTreeMap<_, _>>(); | |||
for &sled_id in &sleds { | |||
if !self | |||
let zones_currently_updating = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really only changed this to make visibility in tests better, but it helps when you "forget to update inventory after editing a blueprint".
// <https://github.com/oxidecomputer/omicron/issues/6404> | ||
// ZoneKind::CockroachDb => todo!("check cluster status in inventory"), | ||
ZoneKind::CockroachDb => { | ||
debug!(self.log, "Checking if Cockroach node can shut down"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to return more structured "why we cannot update" results here, but using debug messages in the meantime.
This feels like a very doable follow-up; I'd also like to test we're getting sufficient coverage here.
|
||
// All nodes must report: "We have the necessary redundancy, and | ||
// have observed no underreplicated ranges". | ||
for (_node_id, status) in all_statuses { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could include the node ID in the log messages (and later the more structured "why we can't update" results), so we know which node is blocking update?
I wonder if it's worthwhile to gather all the ways from all the nodes we might be blocking an update, instead of only logging the first one?
Neither of these is at all urgent, and both would be fine to defer to #8284
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -5365,6 +5404,302 @@ pub(crate) mod test { | |||
logctx.cleanup_successful(); | |||
} | |||
|
|||
#[test] | |||
fn test_update_cockroach() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test 👍
Updates the reconfigurator to evaluate cockroachdb cluster health before upgrading zones
Only updates zones if:
Builds on #8379 and #8426