OCPBUGS-62634, OCPBUGS-62624: Fix DeploymentController Progressing #2034

jsafrane · 2025-10-10T09:22:23Z

The Progression condition should be set only when the Deployment has a new content and is actively rolling out its update. After all replicas are updated, the Progressing condition should be false, even when some pods are missing. E.g. because a node is drained, something evicted them or so on.

Use Deployment condition Progressing with reason NewReplicaSetAvailable to detect that Deployment has been fully rolled out in the past.

The Progression condition should be set only when the Deployment has a new content and is actively rolling out its update. After all replicas are updated, the Progressing condition should be false, even when some pods are missing. E.g. because a node is drained, something evicted them or so on. Use Deployment condition Progressing with reason NewReplicaSetAvailable to detect that Deployment has been fully rolled out in the past.

openshift-ci-robot · 2025-10-10T09:22:31Z

@jsafrane: This pull request references Jira Issue OCPBUGS-62634, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-62624, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The Progression condition should be set only when the Deployment has a new content and is actively rolling out its update. After all replicas are updated, the Progressing condition should be false, even when some pods are missing. E.g. because a node is drained, something evicted them or so on.

Use Deployment condition Progressing with reason NewReplicaSetAvailable to detect that Deployment has been fully rolled out in the past.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jsafrane · 2025-10-10T09:33:00Z

/jira refresh

openshift-ci-robot · 2025-10-10T09:33:09Z

@jsafrane: This pull request references Jira Issue OCPBUGS-62634, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

This pull request references Jira Issue OCPBUGS-62624, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu

The change looks reasonable to me.
The only concern is that it may overkill Progressing for other reasons we want to keep.
If we can verify before merging that the pull solves the issue in the bugs, I think it it worth trying.
Also left a couple of NITs and questions in the pull.

pkg/operator/deploymentcontroller/deployment_controller.go

hongkailiu · 2025-10-10T14:49:23Z

pkg/operator/deploymentcontroller/deployment_controller.go

+	}
+	// Deployment that are fully deployed get Progressing condition with Reason NewReplicaSetAvailable condition.
+	// https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment
+	// Any subsequent missing replicas (e.g. caused by a node reboot) must not not change the Progressing condition.


From the go doc on "NewReplicaSetAvailable", i cannot see "a node reboot" as a cause.

https://github.com/kubernetes/kubernetes/blob/ee1ff4866e30ac3685da3e007979b0e9ab7651a6/pkg/controller/deployment/util/deployment_util.go#L76-L77

But above kube docs implies "NewReplicaSetAvailable" is an indicator of Deployment rolled out successfully even though new pods might be coming on the way. It sounds to me good enough to not report Progressing=True for a CO.

pkg/operator/deploymentcontroller/deployment_controller_test.go

pkg/operator/deploymentcontroller/deployment_controller.go

jsafrane · 2025-10-13T13:20:50Z

The only concern is that it may overkill Progressing for other reasons we want to keep.

My assumption is that any reconfiguration must lead to re-deployment, i.e. Depolyment spec.template changes and generation increases and so on.

If there is an OCP reconfiguration that should lead to Progressing=true, but the Deployment pods stay as they are, other controller must catch it, not DeploymentController. It's very common to run many controllers in a library-go style operator.

Co-authored-by: Hongkai Liu <[email protected]>

It does not need anything from DeploymentController

jsafrane · 2025-10-13T13:31:03Z

/label tide/merge-method-squash
does it work here?

openshift-ci · 2025-10-13T13:38:17Z

@jsafrane: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hongkailiu · 2025-10-14T23:03:33Z

The testing results look good.

And here also good.

But this one is bad.
The fix from this pull does not solve the issue on co/storage. Is it expected?

Oct 14 19:02:20.982 W clusteroperator/storage condition/Progressing reason/GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying status/True GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods (exception: https://issues.redhat.com/browse/OCPBUGS-62634)

jsafrane · 2025-10-15T09:56:18Z

The fix from this pull does not solve the issue on co/storage. Is it expected?

Yes. cluster-storage-operator runs Deployments with other CSI driver operators, which then actually install a CSI driver (= create Deployments + DaemonSets). We will need to bump the library-go changes to ~10 other repos :-.

I'll comment on openshift/cluster-storage-operator#634

hongkailiu · 2025-10-28T19:27:47Z

/lgtm

/hold

Free to cancel when ready.

openshift-ci · 2025-10-28T19:28:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/operator/OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 10, 2025

openshift-ci bot requested review from hexfusion and p0lyn0mial October 10, 2025 09:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 10, 2025

jsafrane mentioned this pull request Oct 10, 2025

WIP: OCPBUGS-62624: Bump library-go to fix Progressing condition openshift/cluster-csi-snapshot-controller-operator#247

Open

hongkailiu reviewed Oct 10, 2025

View reviewed changes

jsafrane and others added 2 commits October 13, 2025 15:23

Update pkg/operator/deploymentcontroller/deployment_controller.go

2b323fc

Co-authored-by: Hongkai Liu <[email protected]>

Make isProgressing a simple function

b1970ef

It does not need anything from DeploymentController

jsafrane force-pushed the progressing-deployment branch from 15d18d7 to b1970ef Compare October 13, 2025 13:26

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 13, 2025

This was referenced Oct 15, 2025

WIP: Bump library-go to fix progressing condition openshift/gcp-pd-csi-driver-operator#152

Open

WIP: Fix progressing openshift/cluster-storage-operator#639

Open

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2025

openshift-ci bot assigned hongkailiu Oct 28, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2025

OCPBUGS-62634, OCPBUGS-62624: Fix DeploymentController Progressing #2034

Are you sure you want to change the base?

OCPBUGS-62634, OCPBUGS-62624: Fix DeploymentController Progressing #2034

Uh oh!

Conversation

jsafrane commented Oct 10, 2025

Uh oh!

openshift-ci-robot commented Oct 10, 2025

Uh oh!

jsafrane commented Oct 10, 2025

Uh oh!

openshift-ci-robot commented Oct 10, 2025

Uh oh!

hongkailiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hongkailiu Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsafrane commented Oct 13, 2025

Uh oh!

jsafrane commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

hongkailiu commented Oct 14, 2025

Uh oh!

jsafrane commented Oct 15, 2025

Uh oh!

hongkailiu commented Oct 28, 2025

Uh oh!

openshift-ci bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants