Skip to content

Conversation

@cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented May 28, 2025

- What I did

If a MachineConfig is applied with empty systemd unit contents, the MCD will degrade because it skips writing the file in that particular situation. For parity with the CoreOS Ignition implementation, we should not attempt to enable or disable any systemd units where the unit file does not have any contents.

This PR also extends the TestIgn3Cfg e2e test to include systemd units as well as assertions for verifying that the on-disk state and systemd state is as expected.

- How to verify it

  1. Bring up a cluster.
  2. Apply a MachineConfig which adds an empty systemd unit.
  3. The node(s) should roll out the config as usual. However, the file will not be created on the node(s) nor will systemd be aware of the unit.

The e2e tests have been modified to perform this automatically.

- Description for the changelog
Fixes systemd unit creation for empty units

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 28, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-56648, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

If a MachineConfig is applied with empty systemd unit contents, the MCD
will degrade because it skips writing the file in that particular
situation. For parity with the CoreOS Ignition implementation, we should
not attempt to enable or disable any systemd units where the unit file
does not have any contents.

- How to verify it

  1. Bring up a cluster.
  2. Apply a MachineConfig which adds an empty systemd unit.
  3. The node(s) should roll out the config as usual. However, the file will not be created on the node(s) nor will systemd be aware of the unit.

- Description for the changelog
Fixes systemd unit creation for empty units

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from djoshy and yuqi-zhang May 28, 2025 21:46
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2025
@@ -745,7 +745,7 @@ func TestDontDeleteRPMFiles(t *testing.T) {
func TestIgn3Cfg(t *testing.T) {
cs := framework.NewClientSet("")

delete := helpers.CreateMCP(t, cs, "infra")
deleteFunc := helpers.CreateMCP(t, cs, "infra")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewer: I changed this because delete() is a built-in function in Golang. While it was not causing any problems, I opted to proactively fix it.

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I will leave final tagging to someone else on the team with more context.

Side note: Good catch with changing the delete var name to deleteFunc.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm, some thoughts inline on logging and erroring. Thanks for adding the comprehensive testing! Bonus points for adding it to the existing general ignition test so we don't have to add any extra time/reboots :)

disabledUnits = append(disabledUnits, u.Name)
// Only when a unit has contents should we attempt to enable or disable it.
// See: https://issues.redhat.com/browse/OCPBUGS-56648
if unitHasContent(u) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: maybe we should log it like we do for the preset below, just to have something to reference that we attempted to enable/disable but we could not find contents.

If we do, we should be careful about the message since some bug reporters have thought that the preset failure is fatal (presumably since it says Error msg), might be a good time to rephrase that as well.

// Only when a unit has contents should we attempt to enable or disable it.
// See: https://issues.redhat.com/browse/OCPBUGS-56648
if unitHasContent(u) {
if *u.Enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 on going with the match-ignition approach. One very minor concern is that the few times this bug has popped up, it was because the user was trying to use a layered image build with a service built in, but enabling the service via a MachineConfig.

Before, if they apply both changes in the same update, what happened was: the update fails since it can't find the service, which allows the user to at least try to figure out what happened

Now, if they apply both changes in the same update, the service enablement is skipped on this update, since the MCD can't find it, but the actual service is being staged as part of the OS update. Upon reboot, the MCD doesn't error, but the service that the user might expect to be enabled is not there. Upon a second reboot sometime in the future, it actually gets enabled properly since now the service exists, which might catch some users off guard.

So then the options I can see are:

  1. say that's fine and direct users to either apply the image first and then enable the service via a second update, or try to build that into the image directly
  2. have the post-reboot MCD run the list again and try to make sure everything is enabled/started
  3. leave it as is for now and eventually, once layering is the default mechanism, we should probably be building the service and enablement into the update image directly instead of the hybrid management we have now

I'm happy with 1/3 but just wanted to write that out in case someone feels differently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm in agreement. But something I want to point out is the OCL reboot work which will effectively stop baking the MachineConfigs into the OS image. The advice to give at that point is that they should not use a MachineConfig for enabling the service in that case.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-56648 branch from 41489d7 to 713bbd7 Compare June 12, 2025 19:12
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2025
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025
@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2025
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-56648 branch from 713bbd7 to 6a29a07 Compare October 15, 2025 19:01
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 15, 2025
@cheesesashimi
Copy link
Member Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-56648 branch from 6a29a07 to 40ad111 Compare October 20, 2025 16:01
@cheesesashimi
Copy link
Member Author

/retest-required

1 similar comment
@cheesesashimi
Copy link
Member Author

/retest-required

If a MachineConfig is applied with empty systemd unit contents, the MCD
will degrade because it skips writing the file in that particular
situation. For parity with the CoreOS Ignition implementation, we should
not attempt to enable or disable any systemd units where the unit file
does not have any contents.
@cheesesashimi cheesesashimi force-pushed the zzlotnik/OCPBUGS-56648 branch from 40ad111 to b14bd5c Compare October 22, 2025 14:49
@cheesesashimi
Copy link
Member Author

/test unit

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Re-applying the label that was lost due to a rebase. The tests are still passing post-rebase, so this should be good to go.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, isabella-janssen, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cheesesashimi,isabella-janssen,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@isabella-janssen
Copy link
Member

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 29, 2025
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-56648, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sergiordlr October 29, 2025 14:42
@sergiordlr
Copy link
Contributor

Verified using IPI on AWS

  1. Apply a MC with an enabled empty unit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: mc-unit-22
spec:
  config:
    ignition:
      version: 2.2.0
    systemd:
      units:
      - name: my22unit.service
        enable: true
  1. Wait for the configuration to be applied. No degradation happens
  2. Check the logs
I1113 08:29:34.794049    2761 file_writers.go:330] Unit "my22unit.service" has no content, skipping write
I1113 08:29:34.794056    2761 update.go:2140] Could not enable unit "my22unit.service", because it has no contents, skipping
  1. No unit is created
 oc debug node/ip-10-0-11-108.us-east-2.compute.internal -q -- chroot /host systemctl status my22unit.service
Unit my22unit.service could not be found.
error: non-zero exit code from debug container

The following tests were executed and passed:

passed: (25m30s) 2025-11-13T09:06:44 "[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47008-Config Drift. Dropin file. [Serial]"
passed: (14m18s) 2025-11-13T09:21:02 "[sig-mco] MCO NodeDisruptionPolicy Author:rioliu-NonPreRelease-Longduration-High-73411-NodeDisruptionPolicy units with multiple actions [Disruptive] [Serial]"
passed: (25m34s) 2025-11-13T09:46:36 "[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47009-Config Drift. New Service Unit. [Serial]"
passed: (19m32s) 2025-11-13T10:06:08 "[sig-mco] MCO Author:sregidor-NonPreRelease-Longduration-Medium-56614-[P2][OnCLayer] Create unit with content and mask=true[Disruptive] [Serial]"
passed: (3m32s) 2025-11-13T10:09:41 "[sig-mco] MCO NodeDisruptionPolicy Author:rioliu-NonPreRelease-High-73414-[P1] NodeDisruptionPolicy units with action None [Disruptive] [Serial]"
passed: (18m23s) 2025-11-13T10:28:04 "[sig-mco] MCO NodeDisruptionPolicy Author:rioliu-NonPreRelease-Longduration-High-73413-[P2] NodeDisruptionPolicy units with action Reboot [Disruptive] [Serial]"

/label qe-approved
/verified by @sergiordlr

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 13, 2025
@openshift-ci-robot
Copy link
Contributor

@sergiordlr: This PR has been marked as verified by @sergiordlr.

In response to this:

Verified using IPI on AWS

  1. Apply a MC with an enabled empty unit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
 labels:
   machineconfiguration.openshift.io/role: worker
 name: mc-unit-22
spec:
 config:
   ignition:
     version: 2.2.0
   systemd:
     units:
     - name: my22unit.service
       enable: true
  1. Wait for the configuration to be applied. No degradation happens
  2. Check the logs
I1113 08:29:34.794049    2761 file_writers.go:330] Unit "my22unit.service" has no content, skipping write
I1113 08:29:34.794056    2761 update.go:2140] Could not enable unit "my22unit.service", because it has no contents, skipping
  1. No unit is created
oc debug node/ip-10-0-11-108.us-east-2.compute.internal -q -- chroot /host systemctl status my22unit.service
Unit my22unit.service could not be found.
error: non-zero exit code from debug container

The following tests were executed and passed:

passed: (25m30s) 2025-11-13T09:06:44 "[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47008-Config Drift. Dropin file. [Serial]"
passed: (14m18s) 2025-11-13T09:21:02 "[sig-mco] MCO NodeDisruptionPolicy Author:rioliu-NonPreRelease-Longduration-High-73411-NodeDisruptionPolicy units with multiple actions [Disruptive] [Serial]"
passed: (25m34s) 2025-11-13T09:46:36 "[sig-mco] MCO Author:sregidor-Longduration-NonPreRelease-High-47009-Config Drift. New Service Unit. [Serial]"
passed: (19m32s) 2025-11-13T10:06:08 "[sig-mco] MCO Author:sregidor-NonPreRelease-Longduration-Medium-56614-[P2][OnCLayer] Create unit with content and mask=true[Disruptive] [Serial]"
passed: (3m32s) 2025-11-13T10:09:41 "[sig-mco] MCO NodeDisruptionPolicy Author:rioliu-NonPreRelease-High-73414-[P1] NodeDisruptionPolicy units with action None [Disruptive] [Serial]"
passed: (18m23s) 2025-11-13T10:28:04 "[sig-mco] MCO NodeDisruptionPolicy Author:rioliu-NonPreRelease-Longduration-High-73413-[P2] NodeDisruptionPolicy units with action Reboot [Disruptive] [Serial]"

/label qe-approved
/verified by @sergiordlr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 13, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-56648, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

- What I did

If a MachineConfig is applied with empty systemd unit contents, the MCD will degrade because it skips writing the file in that particular situation. For parity with the CoreOS Ignition implementation, we should not attempt to enable or disable any systemd units where the unit file does not have any contents.

This PR also extends the TestIgn3Cfg e2e test to include systemd units as well as assertions for verifying that the on-disk state and systemd state is as expected.

- How to verify it

  1. Bring up a cluster.
  2. Apply a MachineConfig which adds an empty systemd unit.
  3. The node(s) should roll out the config as usual. However, the file will not be created on the node(s) nor will systemd be aware of the unit.

The e2e tests have been modified to perform this automatically.

- Description for the changelog
Fixes systemd unit creation for empty units

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD eeabc73 and 2 for PR HEAD b14bd5c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ca0c19d and 1 for PR HEAD b14bd5c in total

@isabella-janssen
Copy link
Member

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 661e30e and 0 for PR HEAD b14bd5c in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2025

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift-techpreview 41489d7 link false /test e2e-hypershift-techpreview
ci/prow/e2e-azure-ovn-upgrade-out-of-change 713bbd7 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op 713bbd7 link true /test e2e-gcp-op
ci/prow/e2e-gcp-op-techpreview 713bbd7 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-gcp-op-ocl 713bbd7 link false /test e2e-gcp-op-ocl
ci/prow/e2e-agent-compact-ipv4 713bbd7 link false /test e2e-agent-compact-ipv4
ci/prow/bootstrap-unit b14bd5c link false /test bootstrap-unit
ci/prow/okd-scos-e2e-aws-ovn b14bd5c link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

/hold

Revision b14bd5c was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2025
@isabella-janssen
Copy link
Member

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b3e3c8a and 2 for PR HEAD b14bd5c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d151372 and 1 for PR HEAD b14bd5c in total

@openshift-merge-bot openshift-merge-bot bot merged commit b4ea81d into openshift:main Nov 19, 2025
13 of 15 checks passed
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: Jira Issue Verification Checks: Jira Issue OCPBUGS-56648
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-56648 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

- What I did

If a MachineConfig is applied with empty systemd unit contents, the MCD will degrade because it skips writing the file in that particular situation. For parity with the CoreOS Ignition implementation, we should not attempt to enable or disable any systemd units where the unit file does not have any contents.

This PR also extends the TestIgn3Cfg e2e test to include systemd units as well as assertions for verifying that the on-disk state and systemd state is as expected.

- How to verify it

  1. Bring up a cluster.
  2. Apply a MachineConfig which adds an empty systemd unit.
  3. The node(s) should roll out the config as usual. However, the file will not be created on the node(s) nor will systemd be aware of the unit.

The e2e tests have been modified to perform this automatically.

- Description for the changelog
Fixes systemd unit creation for empty units

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants