Skip to content

Conversation

@kuiwang02
Copy link
Contributor

@kuiwang02 kuiwang02 commented Oct 14, 2025

Summary

Enable parallel execution for webhook operator tests to reduce overall test execution time. This PR splits the webhook tests into two groups: parallel tests (validating, mutating, and conversion webhooks) and serial tests (TLS secret deletion).

Changes

1. Split Tests into Two Describe Blocks

Parallel Tests (Lines 42-177)

  • Removed Serial marker from the Describe block
  • Removed [Serial] marker from individual test cases:
    • should have a working validating webhook
    • should have a working mutating webhook
    • should have a working conversion webhook
  • Removed cleanup call in BeforeEach to allow multiple ClusterExtensions to coexist

Serial Tests (Lines 180-306)

  • New Describe block with Serial marker
  • Contains the TLS secret deletion test
  • Executes cleanup before running to ensure a clean environment

Architecture Analysis

Resource Isolation Strategy

When parallel tests run in separate processes on the same cluster:

Per-Process Resources (Isolated)

  • Namespace: webhook-operator-<random> (unique per test)
  • ClusterExtension: named after namespace (unique)
  • ServiceAccount, ClusterRoleBinding: namespace-scoped
  • Deployment, Service, Secret: namespace-scoped

Shared Resources (Cluster-Scoped)

  • CRD: webhooktests.webhook.operators.coreos.io (shared)
  • MutatingWebhookConfiguration: multiple instances (uses generateName)
  • ValidatingWebhookConfiguration: multiple instances (uses generateName)

CRD Conflict Resolution

Question: When multiple ClusterExtensions try to install the same CRD simultaneously, what happens?

Answer: OLM handles this gracefully because:

  1. No OwnerReferences on CRDs: CRDs created by OLM do not have ownerReferences pointing to ClusterExtension
  2. Server-Side Apply (SSA): OLM uses SSA (see boxcutter.go:226), which allows multiple field managers to co-manage the same resource
  3. By Design: OLM is designed to support multiple operators sharing the same CRD

Evidence from Code:

// internal/operator-controller/applier/boxcutter.go:67
obj.SetOwnerReferences(nil) // reset OwnerReferences for migration

// internal/operator-controller/applier/boxcutter.go:226
return bc.Client.Patch(ctx, obj, client.Apply, client.FieldOwner(bc.FieldOwner), client.ForceOwnership)

Cleanup Behavior

When a ClusterExtension is deleted:

  • ✅ ClusterExtension itself is deleted
  • ✅ ClusterExtensionRevision is deleted (via ownerReference)
  • ✅ Namespace-scoped resources are deleted (via namespace deletion)
  • ✅ ClusterRoleBinding is deleted (via DeferCleanup)
  • ❌ CRD is NOT deleted (no ownerReference, managed via SSA)

This behavior is intentional and allows parallel tests to safely delete their ClusterExtensions without affecting other running tests.

Known Side Effects (Acceptable)

  1. Multiple Webhook Interceptors

Each WebhookTest resource creation will be validated/mutated by all active webhook configurations (~3x in parallel execution).

Impact:

  • Minor performance overhead (~3x network calls)
  • Potential timeout if one webhook is unreachable
  • Functional behavior should remain correct
  1. ConversionWebhook Namespace Mismatch

The CRD contains a hardcoded namespace reference (webhook-operator-system), but actual deployments use random namespaces (webhook-operator-).

Impact:

  • May affect conversion webhook routing
  • Depends on whether OLM dynamically updates the CRD namespace reference
  • Conversion webhook test might experience issues, but validating/mutating tests should pass
  1. Cleanup Timing

The serial test block uses helpers.EnsureCleanupClusterExtension() to clean up all ClusterExtensions and the shared CRD before running. This ensures the TLS deletion test runs in a clean environment.

Testing Strategy

Execution Flow

Phase 1: Parallel Tests (Processes 1-3)
T0: All processes start simultaneously
├─ Process 1: Creates CE "webhook-operator-abc12" → OLM creates CRD ✅
├─ Process 2: Creates CE "webhook-operator-def34" → OLM detects existing CRD, validates, continues ✅
└─ Process 3: Creates CE "webhook-operator-ghi56" → OLM detects existing CRD, validates, continues ✅

T1: All tests execute in parallel
└─ Each test uses its isolated namespace and resources

T2: Tests complete and clean up
├─ Process 1: Deletes CE (CRD remains) ✅
├─ Process 2: Deletes CE (CRD remains) ✅
└─ Process 3: Deletes CE (CRD remains) ✅

Phase 2: Serial Test (After all parallel tests complete)
T3: Serial test starts
├─ Cleanup all remaining ClusterExtensions
├─ Delete shared CRD
├─ Create fresh ClusterExtension
└─ Run TLS deletion test

Verification Plan (run them more times and all pass)

INFO[0194] Found 0 must-gather tests                    
started: 0/1/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working validating webhook"

started: 0/2/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working mutating webhook"

started: 0/3/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working conversion webhook"


passed: (1m14s) 2025-10-14T07:32:03 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working mutating webhook"


passed: (1m36s) 2025-10-14T07:32:25 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working conversion webhook"


passed: (2m7s) 2025-10-14T07:32:57 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working validating webhook"

started: 0/4/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - serial tests should be tolerant to tls secret deletion [Serial]"


passed: (1m4s) 2025-10-14T07:34:07 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - serial tests should be tolerant to tls secret deletion [Serial]"

Shutting down the monitor

Assisted-by: Claude Code

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 14, 2025
@openshift-ci-robot
Copy link

@kuiwang02: This pull request explicitly references no jira issue.

In response to this:

Summary

Enable parallel execution for webhook operator tests to reduce overall test execution time. This PR splits the webhook tests into two groups: parallel tests (validating, mutating, and conversion webhooks) and serial tests (TLS secret deletion).

Changes

1. Split Tests into Two Describe Blocks

Parallel Tests (Lines 42-177)

  • Removed Serial marker from the Describe block
  • Removed [Serial] marker from individual test cases:
    • should have a working validating webhook
    • should have a working mutating webhook
    • should have a working conversion webhook
  • Removed cleanup call in BeforeEach to allow multiple ClusterExtensions to coexist

Serial Tests (Lines 180-306)

  • New Describe block with Serial marker
  • Contains the TLS secret deletion test
  • Executes cleanup before running to ensure a clean environment

Architecture Analysis

Resource Isolation Strategy

When parallel tests run in separate processes on the same cluster:

Per-Process Resources (Isolated)

  • Namespace: webhook-operator-<random> (unique per test)
  • ClusterExtension: named after namespace (unique)
  • ServiceAccount, ClusterRoleBinding: namespace-scoped
  • Deployment, Service, Secret: namespace-scoped

Shared Resources (Cluster-Scoped)

  • CRD: webhooktests.webhook.operators.coreos.io (shared)
  • MutatingWebhookConfiguration: multiple instances (uses generateName)
  • ValidatingWebhookConfiguration: multiple instances (uses generateName)

CRD Conflict Resolution

Question: When multiple ClusterExtensions try to install the same CRD simultaneously, what happens?

Answer: OLM handles this gracefully because:

  1. No OwnerReferences on CRDs: CRDs created by OLM do not have ownerReferences pointing to ClusterExtension
  2. Server-Side Apply (SSA): OLM uses SSA (see boxcutter.go:226), which allows multiple field managers to co-manage the same resource
  3. By Design: OLM is designed to support multiple operators sharing the same CRD

Evidence from Code:

// internal/operator-controller/applier/boxcutter.go:67
obj.SetOwnerReferences(nil) // reset OwnerReferences for migration

// internal/operator-controller/applier/boxcutter.go:226
return bc.Client.Patch(ctx, obj, client.Apply, client.FieldOwner(bc.FieldOwner), client.ForceOwnership)

Cleanup Behavior

When a ClusterExtension is deleted:
-ClusterExtension itself is deleted
-ClusterExtensionRevision is deleted (via ownerReference)
-Namespace-scoped resources are deleted (via namespace deletion)
-ClusterRoleBinding is deleted (via DeferCleanup)
-CRD is NOT deleted (no ownerReference, managed via SSA)

This behavior is intentional and allows parallel tests to safely delete their ClusterExtensions without affecting other running tests.

Known Side Effects (Acceptable)

1. Multiple Webhook Interceptors

Each WebhookTest resource creation will be validated/mutated by all active webhook configurations (~3x in parallel execution).

Impact:
- Minor performance overhead (~3x network calls)
- Potential timeout if one webhook is unreachable
- Functional behavior should remain correct

2. ConversionWebhook Namespace Mismatch

The CRD contains a hardcoded namespace reference (webhook-operator-system), but actual deployments use random namespaces (webhook-operator-<random>).

Impact:
- May affect conversion webhook routing
- Depends on whether OLM dynamically updates the CRD namespace reference
- Conversion webhook test might experience issues, but validating/mutating tests should pass

3. Cleanup Timing

The serial test block uses helpers.EnsureCleanupClusterExtension() to clean up all ClusterExtensions and the shared CRD before running. This ensures the TLS deletion test runs in a clean environment.

Testing Strategy

Execution Flow

Phase 1: Parallel Tests (Processes 1-3)
T0: All processes start simultaneously
  ├─ Process 1: Creates CE "webhook-operator-abc12"OLM creates CRD ✅
  ├─ Process 2: Creates CE "webhook-operator-def34"OLM detects existing CRD, validates, continues ✅
  └─ Process 3: Creates CE "webhook-operator-ghi56"OLM detects existing CRD, validates, continuesT1: All tests execute in parallel
  └─ Each test uses its isolated namespace and resources

T2: Tests complete and clean up
  ├─ Process 1: Deletes CE (CRD remains) ✅
  ├─ Process 2: Deletes CE (CRD remains) ✅
  └─ Process 3: Deletes CE (CRD remains) ✅

Phase 2: Serial Test (After all parallel tests complete)
T3: Serial test starts
  ├─ Cleanup all remaining ClusterExtensions
  ├─ Delete shared CRD
  ├─ Create fresh ClusterExtension
  └─ Run TLS deletion test

Verification Plan (run them more times and all pass)

```console
INFO[0194] Found 0 must-gather tests                    
started: 0/1/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working validating webhook"

started: 0/2/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working mutating webhook"

started: 0/3/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working conversion webhook"


passed: (1m14s) 2025-10-14T07:32:03 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working mutating webhook"


passed: (1m36s) 2025-10-14T07:32:25 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working conversion webhook"


passed: (2m7s) 2025-10-14T07:32:57 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests should have a working validating webhook"

started: 0/4/4 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - serial tests should be tolerant to tls secret deletion [Serial]"


passed: (1m4s) 2025-10-14T07:34:07 "[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - serial tests should be tolerant to tls secret deletion [Serial]"

Shutting down the monitor

Assisted-by: Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dtfranz and pedjak October 14, 2025 08:03
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kuiwang02
Once this PR has been reviewed and has the lgtm label, please assign jianzhangbjz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview 5

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-serial-ovn-ipv6-techpreview-1of2 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/29ee9f50-a8d5-11f0-9faf-3ce017251e3c-0

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-serial-ovn-ipv6-techpreview-2of2 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-serial-ovn-ipv6-techpreview-1of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/30c73f30-a8d5-11f0-8208-a2ad207adfea-0

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-serial-ovn-ipv6-techpreview-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/38daa900-a8d5-11f0-90de-e05da449773b-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/40989fd0-a8d5-11f0-94e4-4e81c75cc544-0

@kuiwang02
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview-serial 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@kuiwang02: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/47f563d0-a8d5-11f0-8a0c-c0779eb1d1a3-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 14, 2025

@kuiwang02: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-techpreview-olmv1-ext d59e968 link true /test e2e-aws-techpreview-olmv1-ext
ci/prow/openshift-e2e-aws d59e968 link true /test openshift-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

var _ = Describe("[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks",
Ordered, Serial, func() {
// Parallel webhook tests - can run concurrently
var _ = Describe("[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks - parallel tests",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need parallel test in the name?

}
// Note: cleanup is now handled by DeferCleanup in BeforeEach, which ensures
// cleanup runs even if BeforeEach or the test fails
})
Copy link
Contributor

@camilamacedo86 camilamacedo86 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it passes in the payload aggregation, I am OK with it.
However, running as SERIAL we have the data that we need see:

  1. Go to : https://sippy.dptools.openshift.org/sippy-ng/tests/4.21
  2. Search for the suite ID like; NewOLMWebhook
  3. Then, you will see all tests. Adding image below
  4. By last if we click in ANY, we can see the execution with IPv6
Screenshot 2025-10-14 at 10 08 36 Screenshot 2025-10-14 at 10 08 55

Example

Screenshot 2025-10-14 at 10 09 57

So, I am not sure if we need to change anything since our goal is only make the tests in a way that pass in the validation of this PR: openshift/api#2491. See the result from yesterday exxecution: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/[…]ft-api-master-verify-feature-promotion/1977693404213022720

It is no longer saying 0, but it is saying it has X execution and needs 14.

@perdasilva
Copy link
Contributor

perdasilva commented Oct 14, 2025

@kuiwang02

By Design: OLM is designed to support multiple operators sharing the same CRD

Is not accurate. For OLMv1 we specifically want to stay away from multi-tenancy type features. We should not be able to install the same operator twice.

Upstream:

CRD Gets deleted with ClusterExtension

$ kubectl apply -f debug-samples/test-operator.yaml
namespace/test-operator created
serviceaccount/test-operator-installer created
clusterrolebinding.rbac.authorization.k8s.io/test-operator-installer-binding created
clusterextension.olm.operatorframework.io/test-operator created
$ kubectl get crd
NAME                                                 CREATED AT
...
olme2etests.olm.operatorframework.io                 2025-10-14T16:01:25Z
$ kubectl delete clusterextension test-operator
clusterextension.olm.operatorframework.io "test-operator" deleted
$ kubectl get crd
NAME                                                 CREATED AT
...

(olme2etests.olm.operatorframework.io is deleted)

No two operators with the same CRD

If we try to install two packages with the same CRD, I get an error like:

 message: |
      error for resolved bundle "test-operator-two.1.0.0" with version "1.0.0": revision object collisions in phase 1
      Object CustomResourceDefinition.apiextensions.k8s.io/v1 /olme2etests.olm.operatorframework.io
      Action: "Collision"
      Probes:
      - Progress: Succeeded
      Conflicts:
      - "kube-apiserver"
        .status.acceptedNames.kind
        .status.acceptedNames.plural
      Other:
      - "kube-apiserver"
        .status.acceptedNames.listKind
        .status.acceptedNames.singular
        .status.conditions[type="Established"]
        .status.conditions[type="NamesAccepted"]
        .status.conditions[type="Established"].lastTransitionTime
        .status.conditions[type="Established"].message
        .status.conditions[type="Established"].reason
        .status.conditions[type="Established"].status
        .status.conditions[type="Established"].type
        .status.conditions[type="NamesAccepted"].lastTransitionTime
        .status.conditions[type="NamesAccepted"].message
        .status.conditions[type="NamesAccepted"].reason
        .status.conditions[type="NamesAccepted"].status
        .status.conditions[type="NamesAccepted"].type
      Comparison:
      - Added:
        .metadata.ownerReferences[uid="93dc3e01-97f0-4924-a35b-f56692766a31"]
        .spec.conversion.strategy
      - Modified:
        .metadata.labels.olm.operatorframework.io/owner
        .metadata.labels.olm.operatorframework.io/owner-name
      - Removed:
        .metadata.ownerReferences[uid="4794e10c-a4c1-41c4-8f68-1a9cc5f1adee"]
      Conflicting Owner: &OwnerReference{Kind:ClusterExtensionRevision,Name:test-operator-1,UID:93dc3e01-97f0-4924-a35b-f56692766a31,APIVersion:olm.operatorframework.io/v1,Controller:*true,BlockOwnerDeletion:*true,}

If we try to install the same operator twice, we'll get a conflict on cluster resources like ClusterRole.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Oct 14, 2025

/hold

Our goal is not to remove SERIAL, just to have a test that runs in a disconnected env / ipv6 to allow us to promote the feature to GA More info see: https://redhat-internal.slack.com/archives/C09EPHWT0NM/p1760460889991629 and #522 (comment)

If we indeed change the name, it will delay us from achieving the goal. (We have a few X number of executions already )
The tests, as they are they run in the required profiles; we just need to collect more data.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2025
@kuiwang02
Copy link
Contributor Author

/close

@openshift-ci openshift-ci bot closed this Oct 15, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 15, 2025

@kuiwang02: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kuiwang02 kuiwang02 deleted the supportwebhookparallel branch November 10, 2025 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants