-
Notifications
You must be signed in to change notification settings - Fork 245
OCPBUGS-33013: certsyncpod+installerpod: Swap secret/cm directories atomically #2009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b3eca57 to
a389cbd
Compare
8dec327 to
60f05a8
Compare
|
@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tchap: This pull request references Jira Issue OCPBUGS-33013, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I actually have to make sure this can be merged as this is only supported on Linux 3.15 or later. /hold |
|
This patch should be OK for RHEL 8 or later based on https://access.redhat.com/articles/3078 The latest CI for OCP 4.21 actually uses RHEL 9.6. |
60f05a8 to
f6df27a
Compare
|
The PR using this change in cluster-kube-apiserver-operator seems to be passing on CI, I deem this ready. /unhold |
|
@tchap is there a must-gather from an incident i could take a look at ? |
|
@p0lyn0mial I think that I consulted this one: https://access.redhat.com/support/cases/#/case/03849958/discussion?attachmentId=a096R00003JpGgMQAV |
|
@vrutkovs do you have time to take a look at this issue ? I think that the issue might be real. I think the issue is when a two file cert is replaced. It can happen that the server picks up the update and notices the public/private key mismatch and crashes. Is there a way to repo this issue ? |
| } | ||
|
|
||
| func (c *CertSyncController) sync(ctx context.Context, syncCtx factory.SyncContext) error { | ||
| if err := dirutils.RemoveContent(getStagingDir(c.destinationDir)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhm, maybe this could be done in the Sync method, after creating the staging dir. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense, but here we just call it once for the whole staging area while Sync works per object/directory. So actually it makes sense to call it here to prune old staging directories, not just the directory for the object being staged.
| contentDir := getSecretDir(resourceDir, secretBaseName) | ||
| stagingDir := getSecretStagingDir(resourceDir, secretBaseName) | ||
|
|
||
| if err := atomicdir.Sync(contentDir, 0700, stagingDir, secret.Data, 0600); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sync method expects that the filenames don't hold a path, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's actually being checked.
a835118 to
6b52148
Compare
|
@p0lyn0mial I pushed a complete change now with some unit tests added. They are not complete, particularly FS operations failing are not tested, but also some combinations of sync/get errors are also not tested, because mocking them is annoying. Let me know whether you require more tests. There were none before, so... |
| files[k] = []byte(v) | ||
| } | ||
| c.eventRecorder.Eventf("CertificateUpdated", "Wrote updated configmap: %s/%s", configMap.Namespace, configMap.Name) | ||
| // XXX: Are these permissions correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this actually...
| }) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this into a separate package to be able to import it.
|
installerpod BWT does contain some tests regarding the files written, so I haven't touched or improved that. |
|
Updated openshift/cluster-kube-apiserver-operator#1917 with the current changes. |
| } | ||
|
|
||
| func (c *CertSyncController) sync(ctx context.Context, syncCtx factory.SyncContext) error { | ||
| if err := dirutils.RemoveContent(getStagingDir(c.destinationDir)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should clean per resource cm/secrets otherwise the old data might be left from the previous runs, right ?
I think that we should do that inside the Sync func, wdyt ?
we could add cleaning somewhere here and fail if the cleaning fn returns an err, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomicdir.Sync works with a particular object, right? So you would tell it to sync cm into configmaps/cm and use staging/cert-sync/configmaps/cm for staging. So what should it prune exactly? When there is a leftover in staging/cert-sync, Sync cannot really remove it as it work with a subdir of that path. So my idea was to prune staging/cert-sync at the beginning of sync so that we are clean and that's it.
Having said that, we can also extend Sync to remove everything from staging/cert-sync/configmaps/cm just to be sure there are no leftovers and it's encapsulated, but I wanted to ensure that on a higher level with a single rm call.
| filePerms := os.FileMode(0600) | ||
| if strings.HasSuffix(fullFilename, ".sh") { | ||
| filePerms = 0755 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, didn't notice this check for custom permission setting...
6b52148 to
36195b3
Compare
f390998 to
c11264b
Compare
| strings.HasSuffix(path, "/staging/cert-sync/secrets") || | ||
| strings.HasSuffix(path, "/staging/cert-sync/configmaps") || | ||
| path == filepath.Join(controller.destinationDir, "configmaps") || | ||
| path == filepath.Join(controller.destinationDir, "secrets") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not too pretty, but meh.
deedcd7 to
542e674
Compare
542e674 to
ba412ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a few more comments. overall lgtm.
please also test this pr with some operator e.g. kas-o
439e1f3 to
46d0eac
Compare
Use atomicdir.Sync to write target secret/configmap directories to be synchronized with the relevant objects. Added unit tests, but the coverage is not complete. Particularly filesystem operations failing are not being tested.
46d0eac to
f2df141
Compare
|
@tchap: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Currently it can happen that cert-syncer replaces some of the secret/configmap files successfully and then fails. This can lead to problems when these are e.g. TLS cert/key files and the directory gets inconsistent. This may seem transient, but when cert-syncer is terminated in the middle, it can later fail to start as the whole kube-apiserver gets into a crash loop.
This introduces a new
staticpod.SwapDirectoriesAtomic, which usesunix.Renameat2withRENAME_EXCHANGEflag set.