-
Notifications
You must be signed in to change notification settings - Fork 458
OCPBUGS-56648: fixes systemd unit creation for empty units #5086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-56648: fixes systemd unit creation for empty units #5086
Conversation
|
@cheesesashimi: This pull request references Jira Issue OCPBUGS-56648, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| @@ -745,7 +745,7 @@ func TestDontDeleteRPMFiles(t *testing.T) { | |||
| func TestIgn3Cfg(t *testing.T) { | |||
| cs := framework.NewClientSet("") | |||
|
|
|||
| delete := helpers.CreateMCP(t, cs, "infra") | |||
| deleteFunc := helpers.CreateMCP(t, cs, "infra") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: I changed this because delete() is a built-in function in Golang. While it was not causing any problems, I opted to proactively fix it.
isabella-janssen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I will leave final tagging to someone else on the team with more context.
Side note: Good catch with changing the delete var name to deleteFunc.
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm, some thoughts inline on logging and erroring. Thanks for adding the comprehensive testing! Bonus points for adding it to the existing general ignition test so we don't have to add any extra time/reboots :)
| disabledUnits = append(disabledUnits, u.Name) | ||
| // Only when a unit has contents should we attempt to enable or disable it. | ||
| // See: https://issues.redhat.com/browse/OCPBUGS-56648 | ||
| if unitHasContent(u) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: maybe we should log it like we do for the preset below, just to have something to reference that we attempted to enable/disable but we could not find contents.
If we do, we should be careful about the message since some bug reporters have thought that the preset failure is fatal (presumably since it says Error msg), might be a good time to rephrase that as well.
| // Only when a unit has contents should we attempt to enable or disable it. | ||
| // See: https://issues.redhat.com/browse/OCPBUGS-56648 | ||
| if unitHasContent(u) { | ||
| if *u.Enabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 on going with the match-ignition approach. One very minor concern is that the few times this bug has popped up, it was because the user was trying to use a layered image build with a service built in, but enabling the service via a MachineConfig.
Before, if they apply both changes in the same update, what happened was: the update fails since it can't find the service, which allows the user to at least try to figure out what happened
Now, if they apply both changes in the same update, the service enablement is skipped on this update, since the MCD can't find it, but the actual service is being staged as part of the OS update. Upon reboot, the MCD doesn't error, but the service that the user might expect to be enabled is not there. Upon a second reboot sometime in the future, it actually gets enabled properly since now the service exists, which might catch some users off guard.
So then the options I can see are:
- say that's fine and direct users to either apply the image first and then enable the service via a second update, or try to build that into the image directly
- have the post-reboot MCD run the list again and try to make sure everything is enabled/started
- leave it as is for now and eventually, once layering is the default mechanism, we should probably be building the service and enablement into the update image directly instead of the hybrid management we have now
I'm happy with 1/3 but just wanted to write that out in case someone feels differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm in agreement. But something I want to point out is the OCL reboot work which will effectively stop baking the MachineConfigs into the OS image. The advice to give at that point is that they should not use a MachineConfig for enabling the service in that case.
41489d7 to
713bbd7
Compare
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
713bbd7 to
6a29a07
Compare
|
/remove-lifecycle rotten |
6a29a07 to
40ad111
Compare
|
/retest-required |
1 similar comment
|
/retest-required |
If a MachineConfig is applied with empty systemd unit contents, the MCD will degrade because it skips writing the file in that particular situation. For parity with the CoreOS Ignition implementation, we should not attempt to enable or disable any systemd units where the unit file does not have any contents.
40ad111 to
b14bd5c
Compare
|
/test unit |
isabella-janssen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Re-applying the label that was lost due to a rebase. The tests are still passing post-rebase, so this should be good to go.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi, isabella-janssen, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@isabella-janssen: This pull request references Jira Issue OCPBUGS-56648, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Verified using IPI on AWS
The following tests were executed and passed: /label qe-approved |
|
@sergiordlr: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@cheesesashimi: This pull request references Jira Issue OCPBUGS-56648, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
@cheesesashimi: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold Revision b14bd5c was retested 3 times: holding |
|
/unhold |
b4ea81d
into
openshift:main
|
@cheesesashimi: Jira Issue Verification Checks: Jira Issue OCPBUGS-56648 Jira Issue OCPBUGS-56648 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
- What I did
If a MachineConfig is applied with empty systemd unit contents, the MCD will degrade because it skips writing the file in that particular situation. For parity with the CoreOS Ignition implementation, we should not attempt to enable or disable any systemd units where the unit file does not have any contents.
This PR also extends the TestIgn3Cfg e2e test to include systemd units as well as assertions for verifying that the on-disk state and systemd state is as expected.
- How to verify it
The e2e tests have been modified to perform this automatically.
- Description for the changelog
Fixes systemd unit creation for empty units