-
Notifications
You must be signed in to change notification settings - Fork 458
MCO-1940: Enhance MCS layered image serving safety during node scale-up by requiring node validation #5382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-1940: Enhance MCS layered image serving safety during node scale-up by requiring node validation #5382
Conversation
|
@dkhater-redhat: This pull request references MCO-1940 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f2f3c2f to
ad43c64
Compare
|
@dkhater-redhat: This pull request references MCO-1940 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ad43c64 to
722c732
Compare
|
@pablintino this is the follow up PR for the reboot work. please LMK what you think. |
5bcb821 to
ca8c4dd
Compare
0359b8c to
a3889ee
Compare
a3889ee to
f9536e2
Compare
isabella-janssen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
All previous review comments were addressed, so this should be good to go!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dkhater-redhat, isabella-janssen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Pre-merge verified Environment Setup Testing Scenarios Tested for both internal and external registry
Note: Internal registry always requires 2 reboots for new node joining Test Steps:
$ oc create secret docker-registry layering-push-secret \
--docker-server=quay.io \
--docker-username=<username> \
--docker-password=<password> \
--docker-email="" \
-n openshift-machine-config-operator
secret/layering-push-secret created
$ oc get pods -n openshift-machine-config-operator
layering-push-secret kubernetes.io/dockerconfigjson 1 12s
MOSC external templateoc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineOSConfig
metadata:
name: worker
spec:
machineConfigPool:
name: worker
imageBuilder:
imageBuilderType: Job
baseImagePullSecret:
name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
renderedImagePushSecret:
name: layering-push-secret
renderedImagePushSpec: "quay.io/mcoqe/layering:ocl"
$ oc get machineosbuilds NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED AGE worker-ab4af923d4c7fcbbc4ba9b96e9b99153 False False True False False 66m
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-276efa5b5cb35d6ff2e2a87d495bf2fc True False False 3 3 3 0 87m worker rendered-worker-6fd45a1e637e9f3af77af474bb8bb138 False True False 3 0 0 0 87m worker rendered-worker-6fd45a1e637e9f3af77af474bb8bb138 False True False 3 1 1 0 90m $ oc scale --replicas 2 machinesets.machine.openshift.io -n openshift-machine-api dalia-1411a-qxnvp-worker-us-east-2c machineset.machine.openshift.io/dalia-1411a-qxnvp-worker-us-east-2c scaled
$ oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
dalia-1411a-qxnvp-worker-us-east-2a 1 1 1 1 107m
dalia-1411a-qxnvp-worker-us-east-2b 1 1 1 1 107m
dalia-1411a-qxnvp-worker-us-east-2c 2 2 2 2 107m
oc debug node/ip-10-0-79-73.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-79-73us-east-2computeinternal-debug-s4dpd ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:da49d058c6b431d20fc7875e1248dea283e85fec725259ef8bd9920f304708ee
Digest: sha256:da49d058c6b431d20fc7875e1248dea283e85fec725259ef8bd9920f304708ee
Version: 9.6.20251105-0 (2025-11-14T05:57:48Z)
Removing debug pod ...
$ oc scale --replicas 2 machinesets.machine.openshift.io -n openshift-machine-api dalia-1411a-qxnvp-worker-us-east-2b
machineset.machine.openshift.io/dalia-1411a-qxnvp-worker-us-east-2b scaled
$ oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
dalia-1411a-qxnvp-worker-us-east-2a 1 1 1 1 122m
dalia-1411a-qxnvp-worker-us-east-2b 2 2 2 2 121m
dalia-1411a-qxnvp-worker-us-east-2c 2 2 2 2 121m
$ oc debug node/ip-10-0-38-95.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-38-95us-east-2computeinternal-debug-pxnkc ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:da49d058c6b431d20fc7875e1248dea283e85fec725259ef8bd9920f304708ee
Digest: sha256:da49d058c6b431d20fc7875e1248dea283e85fec725259ef8bd9920f304708ee
Version: 9.6.20251105-0 (2025-11-14T05:57:48Z)
Removing debug pod ...
$ oc get machinesets.machine.openshift.io -n openshift-machine-api dalia-1411a-qxnvp-worker-us-east-2c -o yaml > ms.yaml
# Edit the name in in ms.yaml
$ oc apply -f ms.yaml
machineset.machine.openshift.io/dalia-1411a-qxnvp-worker-us-east-2c-ocl created
$ oc get machinesets.machine.openshift.io -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
dalia-1411a-qxnvp-worker-us-east-2a 1 1 1 1 135m
dalia-1411a-qxnvp-worker-us-east-2b 2 2 2 2 135m
dalia-1411a-qxnvp-worker-us-east-2c 2 2 2 2 135m
dalia-1411a-qxnvp-worker-us-east-2c-ocl 2 2 2 2 4m20s
$ oc debug node/ip-10-0-66-77.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-66-77us-east-2computeinternal-debug-j94ds ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:da49d058c6b431d20fc7875e1248dea283e85fec725259ef8bd9920f304708ee
Digest: sha256:da49d058c6b431d20fc7875e1248dea283e85fec725259ef8bd9920f304708ee
Version: 9.6.20251105-0 (2025-11-14T05:57:48Z)
Removing debug pod ...
Similarly verified for internal registry. /label qe-approved |
|
@dkhater-redhat: This pull request references MCO-1940 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ptalgulk01: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-gcp-op-1of2 |
|
/test e2e-hypershift |
|
/test e2e-aws-ovn |
|
/test e2e-hypershift |
|
/test e2e-hypershift |
|
/override ci/prow/e2e-hypershift |
|
@dkhater-redhat: Overrode contexts on behalf of dkhater-redhat: ci/prow/e2e-hypershift In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest-required |
|
/override ci/prow/images |
|
@dkhater-redhat: Overrode contexts on behalf of dkhater-redhat: ci/prow/images, ci/prow/okd-scos-images In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@dkhater-redhat: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/override e2e-aws-ovn-upgrade |
|
@dkhater-redhat: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-aws-ovn |
|
@dkhater-redhat: Overrode contexts on behalf of dkhater-redhat: ci/prow/e2e-aws-ovn, ci/prow/e2e-aws-ovn-upgrade In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
b3e3c8a
into
openshift:main
- What I did
Added safety check to only serve layered images during node bootstrap when UpdatedMachineCount > 0, ensuring at least one node has validated the build before serving it to newly scaled nodes. Also prevents serving internal registry images during bootstrap (DNS unavailable).
- How to verify it
- Builds layered image
- Waits for first node to adopt image (UpdatedMachineCount > 0)
- Scales up MachineSet
- Verifies new node gets layered image during bootstrap (1-reboot path)
- Description for the changelog