-
Notifications
You must be signed in to change notification settings - Fork 108
OCPBUGS-57444: set appropriate rolling update settings #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@tjungblu: This pull request references Jira Issue OCPBUGS-57444, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughUpdates OAuth OpenShift deployment manifest and controller: replicas now use control plane count; rolling update parameters (MaxUnavailable, MaxSurge) are computed from control plane size. Adds unit test verifying rolling update calculations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
bindata/oauth-openshift/deployment.yaml (1)
10-13: Note: Static rolling update values will be overridden by controller.These hardcoded values (
maxUnavailable: 1,maxSurge: 0) will be dynamically overridden by the controller based on control plane node count. Consider adding a comment to indicate these are placeholder values.pkg/controllers/deployment/deployment_controller.go (1)
268-269: Consider extracting rolling update calculation for testability.The inline calculation logic could benefit from being extracted into a separate function for easier unit testing and potential reuse.
+// calculateRollingUpdateStrategy determines optimal rolling update parameters based on control plane count +func calculateRollingUpdateStrategy(controlPlaneCount int32) (maxUnavailable, maxSurge intstr.IntOrString) { + maxUnavailable = intstr.FromInt32(max(controlPlaneCount-1, 1)) + maxSurge = intstr.FromInt32(controlPlaneCount) + return maxUnavailable, maxSurge +} + // Given the control plane sizes, we adjust the max unavailable and max surge values to mimic "MinAvailable". // We always ensure it is controlPlaneCount - 1, as this allows us to keep have at least a single replica running. // We also set MaxSurge to always be exactly the control plane count, as this allows us to more quickly replace failing // deployments with a new replica set. This does not clash with the pod anti affinity set above. -maxUnavailable := intstr.FromInt32(max(*controlPlaneCount-1, 1)) -maxSurge := intstr.FromInt32(*controlPlaneCount) +maxUnavailable, maxSurge := calculateRollingUpdateStrategy(*controlPlaneCount) expectedDeployment.Spec.Strategy.RollingUpdate.MaxUnavailable = &maxUnavailable expectedDeployment.Spec.Strategy.RollingUpdate.MaxSurge = &maxSurge
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
bindata/oauth-openshift/deployment.yaml(2 hunks)pkg/controllers/deployment/deployment_controller.go(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (6)
bindata/oauth-openshift/deployment.yaml (2)
30-30: LGTM! Consistent terminology update to control-plane.The node selector change from
mastertocontrol-planealigns with Kubernetes' standardized terminology and matches the controller changes.
42-44: LGTM! Toleration properly updated for control-plane nodes.The toleration key change ensures OAuth pods can be scheduled on control-plane nodes with the NoSchedule taint.
pkg/controllers/deployment/deployment_controller.go (4)
14-14: LGTM! Required import for IntOrString type.The
intstrpackage import is correctly added to support the new dynamic MaxUnavailable and MaxSurge configuration.
257-263: LGTM! Clear transition from master to control-plane terminology.The variable naming and error message properly reflect the shift to control-plane nodes, maintaining consistency with the deployment manifest changes.
264-272: Solid rolling update strategy implementation with appropriate safeguards.The implementation correctly:
- Ensures at least one replica remains available (
max(controlPlaneCount-1, 1))- Allows faster rollouts with
maxSurge = controlPlaneCount- Won't conflict with pod anti-affinity rules
The comments clearly explain the "MinAvailable" mimicking behavior.
268-268: Confirmed — Go 1.24 declared; builtin max is supported.go.mod declares "go 1.24.0", so the use of max at pkg/controllers/deployment/deployment_controller.go:268 is valid; no change required.
|
/jira refresh |
|
@tjungblu: This pull request references Jira Issue OCPBUGS-57444, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tjungblu: This pull request references Jira Issue OCPBUGS-57444, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
| topologyKey: kubernetes.io/hostname | ||
| tolerations: | ||
| - key: node-role.kubernetes.io/master | ||
| - key: node-role.kubernetes.io/control-plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the taints are not yet there:
message: '0/6 nodes are available: 3 node(s) didn''t match Pod''s node affinity/selector,
3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption:
0/6 nodes are available: 6 Preemption is not helpful for scheduling.'
c1e805a to
aeebe6a
Compare
|
/lgtm |
|
Can you add some unit test checking this behavior? /hold |
Signed-off-by: Thomas Jungblut <[email protected]>
aeebe6a to
0f3451c
Compare
|
Thanks @tjungblu! /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liouk, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-4.19 release-4.18 release-4.17 release-4.16 |
|
@tjungblu: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
shoot, we've already branched /cherry-pick release-4.20 release-4.19 release-4.18 release-4.17 release-4.16 |
|
@tjungblu: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
pkg/controllers/deployment/deployment_controller.go (1)
262-264: Nil-deref guard added — good.This addresses the prior review about potential panic on nil controlPlaneCount.
🧹 Nitpick comments (4)
pkg/controllers/deployment/deployment_controller_test.go (2)
11-101: Good table-driven coverage; add 0-node edge case.Please include a test for controlPlaneCount = 0 to prevent accidental negative math and to document the desired behavior when no CP nodes are selectable.
Apply this diff to extend the table:
{ name: "five control plane nodes", controlPlaneCount: 5, expectedMaxUnavailable: 4, // max(5-1, 1) = max(4, 1) = 4 expectedMaxSurge: 5, }, + { + name: "zero control plane nodes", + controlPlaneCount: 0, + expectedMaxUnavailable: 0, + expectedMaxSurge: 0, + },
52-67: Also cover nil RollingUpdate to guard against panics.Add a test that passes a Deployment with Strategy.RollingUpdate == nil; the helper should initialize it.
Additional test (outside the shown hunk):
func TestSetRollingUpdateParameters_AllowsNilRollingUpdate(t *testing.T) { deployment := &appsv1.Deployment{ ObjectMeta: metav1.ObjectMeta{Name: "test", Namespace: "ns"}, Spec: appsv1.DeploymentSpec{ Strategy: appsv1.DeploymentStrategy{ Type: appsv1.RollingUpdateDeploymentStrategyType, RollingUpdate: nil, }, }, } setRollingUpdateParameters(3, deployment) if deployment.Spec.Strategy.RollingUpdate == nil { t.Fatalf("RollingUpdate should be initialized") } if got := deployment.Spec.Strategy.RollingUpdate.MaxUnavailable.IntVal; got != 2 { t.Fatalf("MaxUnavailable = %d, want 2", got) } if got := deployment.Spec.Strategy.RollingUpdate.MaxSurge.IntVal; got != 3 { t.Fatalf("MaxSurge = %d, want 3", got) } }pkg/controllers/deployment/deployment_controller.go (2)
262-264: Tighten error wording.Minor grammar fix for clarity.
- return nil, false, append(errs, fmt.Errorf("found nil control plane nodes count")) + return nil, false, append(errs, fmt.Errorf("control plane node count is nil"))
321-325: Clarify comment to reflect single-node caveat.Current comment claims “always keep at least a single replica running,” which isn’t true for a single-node control plane under required anti-affinity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
bindata/oauth-openshift/deployment.yaml(1 hunks)pkg/controllers/deployment/deployment_controller.go(3 hunks)pkg/controllers/deployment/deployment_controller_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- bindata/oauth-openshift/deployment.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cursor Bugbot
🔇 Additional comments (2)
pkg/controllers/deployment/deployment_controller_test.go (1)
11-49: LGTM overall.Solid, readable assertions; types validated explicitly.
pkg/controllers/deployment/deployment_controller.go (1)
14-14: Import looks correct.
| // Set the replica count to the number of control plane nodes. | ||
| controlPlaneCount, err := c.countNodes(expectedDeployment.Spec.Template.Spec.NodeSelector) | ||
| if err != nil { | ||
| return nil, false, append(errs, fmt.Errorf("failed to determine number of master nodes: %v", err)) | ||
| return nil, false, append(errs, fmt.Errorf("failed to determine number of control plane nodes: %v", err)) | ||
| } | ||
| expectedDeployment.Spec.Replicas = masterNodeCount | ||
| if controlPlaneCount == nil { | ||
| return nil, false, append(errs, fmt.Errorf("found nil control plane nodes count")) | ||
| } | ||
|
|
||
| expectedDeployment.Spec.Replicas = controlPlaneCount | ||
| setRollingUpdateParameters(*controlPlaneCount, expectedDeployment) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Behavioral check: replicas tied to CP count — OK.
Replica assignment from countNodes and the nil-guard look correct. After fixing the helper, this should roll safely on N>=2 and degrade gracefully on N==1.
Run this to catch the undefined symbols before CI:
🏁 Script executed:
#!/bin/bash
# Verify no usage of intstr.FromInt32 and bare max(…) in Go sources
rg -nP --type=go '\bintstr\.FromInt32\s*\(' -C1
rg -nP --type=go '(?<![A-Za-z0-9_])max\s*\(' -C1Length of output: 569
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "---- intstr.FromInt32 usages ----"
rg -n --type=go 'intstr\.FromInt32' -C1 || true
echo "---- occurrences of max(...) in Go files ----"
rg -n --type=go '\bmax\s*\(' -C3 || true
echo "---- definitions of func max(...) ----"
rg -n --type=go 'func\s+max\s*\(' -C3 || true
echo "---- show setRollingUpdateParameters context (lines ~310-340) ----"
sed -n '310,340p' pkg/controllers/deployment/deployment_controller.go || true
echo "---- show file imports (top 1-140 lines) ----"
sed -n '1,140p' pkg/controllers/deployment/deployment_controller.go || trueLength of output: 11363
Fix undefined 'max' in setRollingUpdateParameters
setRollingUpdateParameters calls max(controlPlaneCount-1, 1) but no max function is defined; add an int32 max helper (e.g. func max(a, b int32) int32 { if a > b { return a } return b }) or replace with an inline conditional.
Location: pkg/controllers/deployment/deployment_controller.go:325-329
🤖 Prompt for AI Agents
In pkg/controllers/deployment/deployment_controller.go around lines 257 to 268,
setRollingUpdateParameters uses an undefined max(...) when computing
max(controlPlaneCount-1, 1); add a small int32 max helper (for example func
max(a, b int32) int32 { if a > b { return a } return b }) near the top of the
file or replace the call with an inline conditional (if controlPlaneCount-1 > 1
{ ... } else { ... }) so the expression compiles and correctly returns the
larger int32 value.
| // Given the control plane sizes, we adjust the max unavailable and max surge values to mimic "MinAvailable". | ||
| // We always ensure it is controlPlaneCount - 1, as this allows us to keep have at least a single replica running. | ||
| // We also set MaxSurge to always be exactly the control plane count, as this allows us to more quickly replace failing | ||
| // deployments with a new replica set. This does not clash with the pod anti affinity set above. | ||
| func setRollingUpdateParameters(controlPlaneCount int32, deployment *appsv1.Deployment) { | ||
| maxUnavailable := intstr.FromInt32(max(controlPlaneCount-1, 1)) | ||
| maxSurge := intstr.FromInt32(controlPlaneCount) | ||
| deployment.Spec.Strategy.RollingUpdate.MaxUnavailable = &maxUnavailable | ||
| deployment.Spec.Strategy.RollingUpdate.MaxSurge = &maxSurge | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix compile-time errors and add defensive init.
max is undefined in this file, and intstr.FromInt32 does not exist (use FromInt). Also guard against a nil RollingUpdate to avoid panics. Clamp the 0-node case to 0 unavailability.
Apply this diff:
-// Given the control plane sizes, we adjust the max unavailable and max surge values to mimic "MinAvailable".
-// We always ensure it is controlPlaneCount - 1, as this allows us to keep have at least a single replica running.
-// We also set MaxSurge to always be exactly the control plane count, as this allows us to more quickly replace failing
-// deployments with a new replica set. This does not clash with the pod anti affinity set above.
+// Given the control plane size, adjust MaxUnavailable and MaxSurge to approximate a "MinAvailable"-like policy.
+// When controlPlaneCount > 1, we keep at least one replica running (MaxUnavailable = controlPlaneCount - 1).
+// On single-node control planes, anti-affinity may cause a brief gap during updates.
+// MaxSurge is set to controlPlaneCount to speed up replacement; anti-affinity prevents co-location.
func setRollingUpdateParameters(controlPlaneCount int32, deployment *appsv1.Deployment) {
- maxUnavailable := intstr.FromInt32(max(controlPlaneCount-1, 1))
- maxSurge := intstr.FromInt32(controlPlaneCount)
- deployment.Spec.Strategy.RollingUpdate.MaxUnavailable = &maxUnavailable
- deployment.Spec.Strategy.RollingUpdate.MaxSurge = &maxSurge
+ if deployment.Spec.Strategy.RollingUpdate == nil {
+ deployment.Spec.Strategy.Type = appsv1.RollingUpdateDeploymentStrategyType
+ deployment.Spec.Strategy.RollingUpdate = &appsv1.RollingUpdateDeployment{}
+ }
+ var mu int32
+ switch {
+ case controlPlaneCount <= 0:
+ mu = 0
+ case controlPlaneCount == 1:
+ mu = 1
+ default:
+ mu = controlPlaneCount - 1
+ }
+ maxUnavailable := intstr.FromInt(int(mu))
+ maxSurge := intstr.FromInt(int(controlPlaneCount))
+ deployment.Spec.Strategy.RollingUpdate.MaxUnavailable = &maxUnavailable
+ deployment.Spec.Strategy.RollingUpdate.MaxSurge = &maxSurge
}🤖 Prompt for AI Agents
In pkg/controllers/deployment/deployment_controller.go around lines 321-330,
replace the use of the undefined max and non-existent intstr.FromInt32 and add a
defensive init: ensure deployment.Spec.Strategy.RollingUpdate is non-nil
(allocate a new RollingUpdate if nil), compute maxUnavailableInt :=
int(controlPlaneCount-1) and if maxUnavailableInt < 0 set it to 0 (clamping the
0-node case), set maxSurgeInt := int(controlPlaneCount), then use
intstr.FromInt(maxUnavailableInt) and intstr.FromInt(maxSurgeInt) to create
values and assign them to deployment.Spec.Strategy.RollingUpdate.MaxUnavailable
and .MaxSurge respectively.
|
@tjungblu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@xingxingxia are you able to verify this please? |
|
Sure. I'd do a post-merge verification due to currently other tasks on me make me not available for pre-merge testing several PRs. |
|
@xingxingxia: This PR has been marked to be verified later by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ed0d09e
into
openshift:master
|
@tjungblu: Jira Issue OCPBUGS-57444: All pull requests linked via external trackers have merged: This pull request has the In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tjungblu: new pull request created: #791 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@tjungblu: new pull request created: #792 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in accepted release 4.21.0-0.nightly-2025-09-18-193221 |
This PR configures the upgrade parameters according to the size of the control plane. This also ensures that during the event of an unhealthy node, the rolling upgrade can proceed.