Skip to content

Conversation

@kuiwang02
Copy link
Contributor

Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests

Why / Problem Statement

The test [sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes was failing intermittently with a 60-second timeout during ClusterExtension cleanup.

This is a race condition issue, not a regression introduced by recent changes. The test has a pre-existing robustness problem where asynchronous Kubernetes deletion time (variable: 45-120s depending on cluster load, resources, and finalizers) races against a fixed timeout (constant: 60s). The test passes when deletion completes quickly (<60s) and fails when it takes longer (>60s).

Failure evidence:

[FAILED] - /build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185
Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete

Timeline:
- 05:33:22 - Delete ClusterExtension called
- 05:34:22 - Timeout (60 seconds later)
- ClusterExtension status: DeletionTimestamp set, but object still exists with foregroundDeletion finalizer

Root causes:

  1. Insufficient timeout for foreground deletion: ClusterExtension with foregroundDeletion finalizer must wait for complete deletion chain (Deployment → ReplicaSet → Pods with 30s graceful shutdown + CRD instances + ServiceAccount + RBAC). This can legitimately take 60-120 seconds, but timeout was hardcoded to 60s.

  2. Kubernetes Delete() is asynchronous: client.Delete() returns immediately (~50ms) after API server accepts the request, but actual deletion happens in background (45-90s later). The test did not properly wait for actual deletion completion.

  3. No wait between scenario iterations: The test runs two scenarios sequentially (singleNamespace, then ownNamespace) but only called Delete() without waiting for IsNotFound, causing the next scenario to potentially start before previous resources are fully cleaned up.

This is NOT introduced by PR #524: Analysis of PR #524 shows it only changed which operator is tested (quay-operator → singleown-operator) and added in-cluster builds. The deletion logic and timeout remained unchanged. PR #524 simply exposed this pre-existing race condition by changing environmental factors that made deletion slightly slower.

What / Solution

This PR fixes the race condition by implementing two changes to make the test robust against timing variations:

Changes Made

1. Increase ClusterExtension cleanup timeout (Required Fix)

File: pkg/helpers/cluster_extension.go:185

  Eventually(func() bool {
      err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{})
      return errors.IsNotFound(err)
- }).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
+ }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
      "Cleanup ClusterExtension %s failed to delete", ce.Name)

Rationale:

  • Foreground deletion legitimately takes 60-120 seconds in production clusters
  • 3 minutes provides sufficient buffer for pod graceful shutdown, finalizer processing, and CRD cleanup
  • Still fails fast enough (within 3 minutes) to detect real deletion issues
  • Addresses the core race condition: variable async deletion time vs fixed timeout

2. Wait for namespace deletion between scenarios (Defense in Depth)

File: test/olmv1-singleownnamespace.go

Added import:

+ "k8s.io/apimachinery/pkg/api/errors"

Added wait logic after namespace deletion (lines 476-492):

By(fmt.Sprintf("waiting for namespace %s to be fully deleted before next scenario", installNamespace))
Eventually(func(g Gomega) {
    ns := &corev1.Namespace{}
    err := k8sClient.Get(ctx, client.ObjectKey{Name: installNamespace}, ns)
    g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", installNamespace)
    g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", installNamespace)
}).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())

if watchNSObj != nil {
    By(fmt.Sprintf("waiting for watch namespace %s to be fully deleted before next scenario", watchNamespace))
    Eventually(func(g Gomega) {
        ns := &corev1.Namespace{}
        err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns)
        g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace)
        g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace)
    }).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())
}

Rationale:

  • Ensures namespace is actually deleted (IsNotFound), not just that Delete() call succeeded
  • Prevents resource conflicts between scenario iterations
  • Properly handles Kubernetes asynchronous deletion semantics
  • Provides complete isolation between test scenarios

Key Technical Decisions

  1. 3-minute timeout for ClusterExtension cleanup

    • Decision: Increase from 60s to 180s
    • Rationale: Based on analysis of foreground deletion chain timing (Deployment → ReplicaSet → Pods with 30s graceful shutdown + finalizers). 180s provides comfortable buffer while still detecting real issues.
    • Alternatives considered: 120s was considered but 180s chosen for extra margin in slow clusters
  2. Wait for IsNotFound instead of trusting Delete() success

    • Decision: Add explicit Eventually wait checking errors.IsNotFound() after namespace deletion
    • Rationale: In Kubernetes, Delete() is asynchronous - it returns when API server accepts the request, not when deletion completes. Must poll for IsNotFound to confirm actual deletion.
    • Alternatives considered: Using time.Sleep() was rejected as an anti-pattern (hardcoded timing assumptions)
  3. 2-minute timeout for namespace deletion wait

    • Decision: Use 120s timeout for namespace cleanup verification
    • Rationale: Namespace deletion typically faster than ClusterExtension (no complex finalizers), but needs buffer for various resources within namespace to clean up
    • Alternatives considered: 60s rejected as potentially too short in slow environments

Benefits of Combined Fix

Aspect Before After
ClusterExtension cleanup 60s (insufficient) 180s (sufficient)
Scenario isolation No wait (race condition) Wait for IsNotFound (guaranteed)
Async handling Assumes Delete() = deleted Properly waits for actual deletion
Robustness Timing-dependent (flaky) State-dependent (reliable)
Debugging Vague timeout errors Clear error messages with namespace names

Testing

INFO[0194] Found 0 must-gather tests                    
started: 0/1/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"

started: 0/2/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/3/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"

started: 0/4/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (40.7s) 2025-10-21T07:48:00 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"


passed: (46s) 2025-10-21T07:48:05 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"


passed: (51.2s) 2025-10-21T07:48:11 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (1m17s) 2025-10-21T07:48:36 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/5/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"


passed: (38.3s) 2025-10-21T07:49:21 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"

Shutting down the monitor
Collecting data.
INFO[0325] Starting CollectData for all monitor tests   
INFO[0325]   Starting CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325]   Finished CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325] Finished CollectData for all monitor tests   
Computing intervals.
Evaluating tests.
Cleaning up.
INFO[0325] beginning cleanup                             monitorTest=watch-namespaces
Serializing results.
Writing to storage.
  m.startTime = 2025-10-21 15:47:11.194084 +0800 CST m=+194.609326834
  m.stopTime  = 2025-10-21 15:49:21.634841 +0800 CST m=+325.051185959
Processing monitorTest: watch-namespaces
  finalIntervals size = 10
  first interval time: From = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834; To = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834
  last interval time: From = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168; To = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168
Writing junits.
Writing JUnit report to e2e-monitor-tests__20251021-074409.xml
5 pass, 0 flaky, 0 skip (5m12s)

Assisted-by: Claude Code

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 21, 2025
@openshift-ci-robot
Copy link

@kuiwang02: This pull request explicitly references no jira issue.

In response to this:

Fix: Race condition in ClusterExtension cleanup timeout for singleownnamespace tests

Why / Problem Statement

The test [sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes was failing intermittently with a 60-second timeout during ClusterExtension cleanup.

This is a race condition issue, not a regression introduced by recent changes. The test has a pre-existing robustness problem where asynchronous Kubernetes deletion time (variable: 45-120s depending on cluster load, resources, and finalizers) races against a fixed timeout (constant: 60s). The test passes when deletion completes quickly (<60s) and fails when it takes longer (>60s).

Failure evidence:

[FAILED] - /build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185
Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete

Timeline:
- 05:33:22 - Delete ClusterExtension called
- 05:34:22 - Timeout (60 seconds later)
- ClusterExtension status: DeletionTimestamp set, but object still exists with foregroundDeletion finalizer

Root causes:

  1. Insufficient timeout for foreground deletion: ClusterExtension with foregroundDeletion finalizer must wait for complete deletion chain (Deployment → ReplicaSet → Pods with 30s graceful shutdown + CRD instances + ServiceAccount + RBAC). This can legitimately take 60-120 seconds, but timeout was hardcoded to 60s.

  2. Kubernetes Delete() is asynchronous: client.Delete() returns immediately (~50ms) after API server accepts the request, but actual deletion happens in background (45-90s later). The test did not properly wait for actual deletion completion.

  3. No wait between scenario iterations: The test runs two scenarios sequentially (singleNamespace, then ownNamespace) but only called Delete() without waiting for IsNotFound, causing the next scenario to potentially start before previous resources are fully cleaned up.

This is NOT introduced by PR #524: Analysis of PR #524 shows it only changed which operator is tested (quay-operator → singleown-operator) and added in-cluster builds. The deletion logic and timeout remained unchanged. PR #524 simply exposed this pre-existing race condition by changing environmental factors that made deletion slightly slower.

What / Solution

This PR fixes the race condition by implementing two changes to make the test robust against timing variations:

Changes Made

1. Increase ClusterExtension cleanup timeout (Required Fix)

File: pkg/helpers/cluster_extension.go:185

 Eventually(func() bool {
     err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{})
     return errors.IsNotFound(err)
- }).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
+ }).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(),
     "Cleanup ClusterExtension %s failed to delete", ce.Name)

Rationale:

  • Foreground deletion legitimately takes 60-120 seconds in production clusters
  • 3 minutes provides sufficient buffer for pod graceful shutdown, finalizer processing, and CRD cleanup
  • Still fails fast enough (within 3 minutes) to detect real deletion issues
  • Addresses the core race condition: variable async deletion time vs fixed timeout

2. Wait for namespace deletion between scenarios (Defense in Depth)

File: test/olmv1-singleownnamespace.go

Added import:

+ "k8s.io/apimachinery/pkg/api/errors"

Added wait logic after namespace deletion (lines 476-492):

By(fmt.Sprintf("waiting for namespace %s to be fully deleted before next scenario", installNamespace))
Eventually(func(g Gomega) {
   ns := &corev1.Namespace{}
   err := k8sClient.Get(ctx, client.ObjectKey{Name: installNamespace}, ns)
   g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", installNamespace)
   g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", installNamespace)
}).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())

if watchNSObj != nil {
   By(fmt.Sprintf("waiting for watch namespace %s to be fully deleted before next scenario", watchNamespace))
   Eventually(func(g Gomega) {
       ns := &corev1.Namespace{}
       err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns)
       g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace)
       g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace)
   }).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())
}

Rationale:

  • Ensures namespace is actually deleted (IsNotFound), not just that Delete() call succeeded
  • Prevents resource conflicts between scenario iterations
  • Properly handles Kubernetes asynchronous deletion semantics
  • Provides complete isolation between test scenarios

Key Technical Decisions

  1. 3-minute timeout for ClusterExtension cleanup
  • Decision: Increase from 60s to 180s
  • Rationale: Based on analysis of foreground deletion chain timing (Deployment → ReplicaSet → Pods with 30s graceful shutdown + finalizers). 180s provides comfortable buffer while still detecting real issues.
  • Alternatives considered: 120s was considered but 180s chosen for extra margin in slow clusters
  1. Wait for IsNotFound instead of trusting Delete() success
  • Decision: Add explicit Eventually wait checking errors.IsNotFound() after namespace deletion
  • Rationale: In Kubernetes, Delete() is asynchronous - it returns when API server accepts the request, not when deletion completes. Must poll for IsNotFound to confirm actual deletion.
  • Alternatives considered: Using time.Sleep() was rejected as an anti-pattern (hardcoded timing assumptions)
  1. 2-minute timeout for namespace deletion wait
  • Decision: Use 120s timeout for namespace cleanup verification
  • Rationale: Namespace deletion typically faster than ClusterExtension (no complex finalizers), but needs buffer for various resources within namespace to clean up
  • Alternatives considered: 60s rejected as potentially too short in slow environments

Benefits of Combined Fix

Aspect Before After
ClusterExtension cleanup 60s (insufficient) 180s (sufficient)
Scenario isolation No wait (race condition) Wait for IsNotFound (guaranteed)
Async handling Assumes Delete() = deleted Properly waits for actual deletion
Robustness Timing-dependent (flaky) State-dependent (reliable)
Debugging Vague timeout errors Clear error messages with namespace names

Testing

INFO[0194] Found 0 must-gather tests                    
started: 0/1/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"

started: 0/2/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/3/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"

started: 0/4/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (40.7s) 2025-10-21T07:48:00 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace watch mode with operator should install a cluster extension successfully"


passed: (46s) 2025-10-21T07:48:05 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation should reject invalid watch namespace configuration and update the status conditions accordingly should fail to install the ClusterExtension when watch namespace is invalid"


passed: (51.2s) 2025-10-21T07:48:11 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for singleNamespace watch mode with operator should install a cluster extension successfully"


passed: (1m17s) 2025-10-21T07:48:36 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace] OLMv1 operator installation support for ownNamespace and single namespace watch mode with operator should install cluster extensions successfully in both watch modes"

started: 0/5/5 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"


passed: (38.3s) 2025-10-21T07:49:21 "[sig-olmv1][OCPFeatureGate:NewOLMOwnSingleNamespace][Serial] OLMv1 operator installation support for ownNamespace watch mode with an operator that does not support ownNamespace installation mode should fail to install a cluster extension successfully"

Shutting down the monitor
Collecting data.
INFO[0325] Starting CollectData for all monitor tests   
INFO[0325]   Starting CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325]   Finished CollectData for [Monitor:watch-namespaces][Jira:"Test Framework"] monitor test watch-namespaces collection 
INFO[0325] Finished CollectData for all monitor tests   
Computing intervals.
Evaluating tests.
Cleaning up.
INFO[0325] beginning cleanup                             monitorTest=watch-namespaces
Serializing results.
Writing to storage.
 m.startTime = 2025-10-21 15:47:11.194084 +0800 CST m=+194.609326834
 m.stopTime  = 2025-10-21 15:49:21.634841 +0800 CST m=+325.051185959
Processing monitorTest: watch-namespaces
 finalIntervals size = 10
 first interval time: From = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834; To = 2025-10-21 15:47:11.202394 +0800 CST m=+194.617636834
 last interval time: From = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168; To = 2025-10-21 15:49:21.632643 +0800 CST m=+325.048988168
Writing junits.
Writing JUnit report to e2e-monitor-tests__20251021-074409.xml
5 pass, 0 flaky, 0 skip (5m12s)

Assisted-by: Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kuiwang02
Once this PR has been reviewed and has the lgtm label, please assign perdasilva for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c3130730-ae56-11f0-8ca2-47048bb63af9-0

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d0093450-ae56-11f0-8264-d6e3474f0c3b-0

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-azure-ovn-runc-techpreview 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-azure-ovn-runc-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e8acbae0-ae56-11f0-8ec9-6cccbfc04901-0

err := k8sClient.Get(ctx, client.ObjectKey{Name: ce.Name}, &olmv1.ClusterExtension{})
return errors.IsNotFound(err)
}).WithTimeout(1*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)
}).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests report a signal for Sippy; to avoid a bad signal, we use bug timeouts.

Suggested change
}).WithTimeout(3*time.Minute).WithPolling(2*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)
}).WithTimeout(5*time.Minute).WithPolling(3*time.Second).Should(BeTrue(), "Cleanup ClusterExtension %s failed to delete", ce.Name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check the other cases

Copy link
Contributor

@camilamacedo86 camilamacedo86 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeouts should be generous — set them to 5 minutes with a pool size of 3.
Those return signals to Sippy and block other teams.
So, we cannot fail due to it. And yes, it was merged before and should be 5 minutes as the others

err := k8sClient.Get(ctx, client.ObjectKey{Name: watchNamespace}, ns)
g.Expect(err).To(HaveOccurred(), "expected namespace %s to be deleted", watchNamespace)
g.Expect(errors.IsNotFound(err)).To(BeTrue(), "expected NotFound error for namespace %s", watchNamespace)
}).WithTimeout(2 * time.Minute).WithPolling(2 * time.Second).Should(Succeed())
Copy link
Contributor

@camilamacedo86 camilamacedo86 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to wait to delete the NS before running the other scenario?

Each scenario has its own unique bundles (CE, etc.), so they can run in parallel and are not marked as SERIAL. Therefore, they should not impact
Because of that, there’s no reason to block other teams or create concern if a namespace takes longer to delete — this can happen for known Kubernetes reasons. And we should not send bad signal for Sippy or block other teams due that.

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking on that
But to address the flake:

See; https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-[…]ly-4.21-e2e-azure-ovn-runc-techpreview/1980106371709800448
fail [/build/openshift/tests-extension/pkg/helpers/cluster_extension.go:185]: Timed out after 60.039s.
Cleanup ClusterExtension install-webhook-bothns-ownns-ce-tz9c failed to delete
Expected
: false
to be true

We should:
-> Not wait for the deletion of the CE
-> We can warn but not fail

See that k8s, for many reasons, can take longer to uninstall resources — and that’s normal.
We no longer have a SERIAL test, so each scenario can run in parallel and is fully isolated.
That means if the ClusterExtension (CE) is not removed right away, it should not impact any other test.

Therefore, we should not risk sending a bad signal to Sippy or blocking other teams because of it.

@kuiwang02
Copy link
Contributor Author

/close

@openshift-ci openshift-ci bot closed this Oct 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

@kuiwang02: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants