-
Notifications
You must be signed in to change notification settings - Fork 36
NO-ISSUE:UPSTREAM: <carry>: Fix: Race condition in ClusterExtension cleanup #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…meout for singleownnamespace tests
|
@kuiwang02: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kuiwang02 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview 5 |
|
@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c3130730-ae56-11f0-8ca2-47048bb63af9-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview 5 |
|
@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d0093450-ae56-11f0-8264-d6e3474f0c3b-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-azure-ovn-runc-techpreview 5 |
|
@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e8acbae0-ae56-11f0-8ec9-6cccbfc04901-0 |
| err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{}) | ||
| return errors.IsNotFound(err) | ||
| }).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name) | ||
| }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tests report a signal for Sippy; to avoid a bad signal, we use bug timeouts.
| }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name) | |
| }).WithTimeout(5*time.Minute).WithPolling(3*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check the other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeouts should be generous — set them to 5 minutes with a pool size of 3.
Those return signals to Sippy and block other teams.
So, we cannot fail due to it. And yes, it was merged before and should be 5 minutes as the others
| err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns) | ||
| g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace) | ||
| g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace) | ||
| }).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to wait to delete the NS before running the other scenario?
Each scenario has its own unique bundles (CE, etc.), so they can run in parallel and are not marked as SERIAL. Therefore, they should not impact
Because of that, there’s no reason to block other teams or create concern if a namespace takes longer to delete — this can happen for known Kubernetes reasons. And we should not send bad signal for Sippy or block other teams due that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking on that
But to address the flake:
See; https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]ly-4.21-e2e-azure-ovn-runc-techpreview/1980106371709800448
fail [/build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185]: Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete
Expected
: false
to be true
We should:
-> Not wait for the deletion of the CE
-> We can warn but not fail
See that k8s, for many reasons, can take longer to uninstall resources — and that’s normal.
We no longer have a SERIAL test, so each scenario can run in parallel and is fully isolated.
That means if the ClusterExtension (CE) is not removed right away, it should not impact any other test.
Therefore, we should not risk sending a bad signal to Sippy or blocking other teams because of it.
|
/close |
|
@kuiwang02: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests
Why / Problem Statement
The test
[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modeswas failing intermittently with a 60-second timeout during ClusterExtension cleanup.This is a race condition issue, not a regression introduced by recent changes. The test has a pre-existing robustness problem where asynchronous Kubernetes deletion time (variable: 45-120s depending on cluster load, resources, and finalizers) races against a fixed timeout (constant: 60s). The test passes when deletion completes quickly (<60s) and fails when it takes longer (>60s).
Failure evidence:
Root causes:
Insufficient timeout for foreground deletion: ClusterExtension with
foregroundDeletionfinalizer must wait for complete deletion chain (Deployment → ReplicaSet → Pods with 30s graceful shutdown + CRD instances + ServiceAccount + RBAC). This can legitimately take 60-120 seconds, but timeout was hardcoded to 60s.Kubernetes Delete() is asynchronous:
client.Delete()returns immediately (~50ms) after API server accepts the request, but actual deletion happens in background (45-90s later). The test did not properly wait for actual deletion completion.No wait between scenario iterations: The test runs two scenarios sequentially (singleNamespace, then ownNamespace) but only called
Delete()without waiting forIsNotFound, causing the next scenario to potentially start before previous resources are fully cleaned up.This is NOT introduced by PR #524: Analysis of PR #524 shows it only changed which operator is tested (quay-operator → singleown-operator) and added in-cluster builds. The deletion logic and timeout remained unchanged. PR #524 simply exposed this pre-existing race condition by changing environmental factors that made deletion slightly slower.
What / Solution
This PR fixes the race condition by implementing two changes to make the test robust against timing variations:
Changes Made
1. Increase ClusterExtension cleanup timeout (Required Fix)
File:
pkg/helpers/cluster_extension.go:185Eventually(func() bool { err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{}) return errors.IsNotFound(err) - }).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), + }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)Rationale:
2. Wait for namespace deletion between scenarios (Defense in Depth)
File:
test/olmv1-singleownnamespace.goAdded import:
+ "k8s.io/apimachinery/pkg/api/errors"Added wait logic after namespace deletion (lines 476-492):
Rationale:
IsNotFound), not just thatDelete()call succeededKey Technical Decisions
3-minute timeout for ClusterExtension cleanup
Wait for IsNotFound instead of trusting Delete() success
Eventuallywait checkingerrors.IsNotFound()after namespace deletionDelete()is asynchronous - it returns when API server accepts the request, not when deletion completes. Must poll forIsNotFoundto confirm actual deletion.time.Sleep()was rejected as an anti-pattern (hardcoded timing assumptions)2-minute timeout for namespace deletion wait
Benefits of Combined Fix
Testing
Assisted-by: Claude Code