Skip to content

[nexus] Use cockroachdb range stats in reconfigurator planner #8441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 70 commits into from
Jul 10, 2025

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jun 24, 2025

Updates the reconfigurator to evaluate cockroachdb cluster health before upgrading zones

Only updates zones if:

  • Ranges underreplicated == 0, and
  • Live nodes == COCKROACH_REDUNDANCY
  • At least COCKROACH_REDUNDANCY nodes are reporting their status successfully

Builds on #8379 and #8426

@@ -1227,20 +1228,27 @@ impl<'a> Planner<'a> {
.map(|z| (z.id, z.image_source.clone()))
.collect::<BTreeMap<_, _>>();
for &sled_id in &sleds {
if !self
let zones_currently_updating = self
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really only changed this to make visibility in tests better, but it helps when you "forget to update inventory after editing a blueprint".

// <https://github.com/oxidecomputer/omicron/issues/6404>
// ZoneKind::CockroachDb => todo!("check cluster status in inventory"),
ZoneKind::CockroachDb => {
debug!(self.log, "Checking if Cockroach node can shut down");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to return more structured "why we cannot update" results here, but using debug messages in the meantime.

This feels like a very doable follow-up; I'd also like to test we're getting sufficient coverage here.

@smklein smklein marked this pull request as ready for review July 7, 2025 19:12
@smklein smklein requested a review from jgallagher July 8, 2025 19:34

// All nodes must report: "We have the necessary redundancy, and
// have observed no underreplicated ranges".
for (_node_id, status) in all_statuses {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could include the node ID in the log messages (and later the more structured "why we can't update" results), so we know which node is blocking update?

I wonder if it's worthwhile to gather all the ways from all the nodes we might be blocking an update, instead of only logging the first one?

Neither of these is at all urgent, and both would be fine to defer to #8284

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and updated the logs in 55e30a3 - but I'm going to defer to #8284 to make this structured, since that seems dependent on @plotnick 's work.

@@ -5365,6 +5404,302 @@ pub(crate) mod test {
logctx.cleanup_successful();
}

#[test]
fn test_update_cockroach() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test 👍

Base automatically changed from range-in-inventory to main July 10, 2025 19:26
smklein added a commit that referenced this pull request Jul 10, 2025
This PR directly follows the extraction from
#8379

It pulls two of these metrics into inventory, where they will be used by
the reconfigurator to decide if a Cockroach zone can be safely updated
in #8441.
@smklein smklein merged commit 33ecf8f into main Jul 10, 2025
17 checks passed
@smklein smklein deleted the use-range-in-reconfigurator branch July 10, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants