Skip to content

Conversation

munnerz
Copy link
Member

@munnerz munnerz commented Aug 31, 2023

replaces #48

Some issuers take a very long time to approve and issue a certificate request. Because the CSI drivers NodePublishVolume call has its own implicit timeout (1 minute) which cannot be changed easily, we would like to be able to 'resume' a request even if it has taken longer than 1 minute to complete.

As a concrete example, say an issuer always takes 90s to complete a request (for whatever reason). In these cases, the driver will wait 60s, the context will timeout, 30s later the request will be issued, but upon the NodePublishVolume call being retried we will continuously create a new request.

This is obviously not desirable, so persisting a reference to the crypto.PrivateKey in memory allows us to 'resume' the request if it is still usable.

This is different to #48 in that I've avoided modifying large parts of existing code-flows - instead, basically using a map as a cache to lookup an existing private key.

I've also added in an event handler that monitors 'delete' operations on CertificateRequest objects so we can handle the case where another entity deletes requests, so we don't keep persisting stale private keys in memory forever (aka a memory link)

This PR is still WIP, as I need to add a number of tests for it. To do that effectively, I need #46 to be merged so we can timeout the first issue call.

cc @7ing @irbekrm

@jetstack-bot jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 31, 2023
return nil, nil
}

// TODO: check if this request is still actually valid for the input metadata
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not done this just yet, as I don't think it's quite as important as we may think (as volumeAttributes on a pod are not mutable).

The only case where this could be problematic is if a drivers implementation of generateRequest is non-deterministic/can change between calls. To properly handle the wide-range of weird setups users may have, we may actually need to push this comparison function to the driver implementers interface...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider check privateKey against CSR's public key here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the internal cache is in memory, UIDs are guaranteed to be unique, and the CertificateRequest resource is immutable, I don't think it's actually essential for us to implement the privatekey<>public key check...

I think I am also going to leave this TODO for now, as it isn't really something we handle at all at the moment (and does raise questions around timing, e.g. what if the driver is random, when can we ever really stop?). I'd like us to expand on our expectations around how drivers implement generateRequest before over-complicating this code path :)

Comment on lines +199 to +206
// begin a background routine which periodically checks to ensure all members of the pending request map actually
// have corresponding CertificateRequest objects in the apiserver.
// This avoids leaking memory if we don't observe a request being deleted, or we observe it after the lister has purged
// the request data from its cache.
// this routine must be careful to not delete entries from this map that have JUST been added to the map, but haven't
// been observed by the lister yet (else it may purge data we want to keep, causing a whole new request cycle).
// for now, to avoid this case, we only run the routine every 5 minutes. It would be better if we recorded the time we
// added the entry to the map instead, and only purged items from the map that are older that N duration (TBD).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious when will the informer lost the delete event from api server ? I thought the informerFactory will guarantee to resync in a period of time to ensure it captures all the events for eventual consistency ?

As mentioned in your comment, not sure how to prevent the newly added entry not being deleted because of lister is not in sync yet. If the request is happened at the 5 mins edge, will be deleted immediately as the lister does not have it in cache yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep with a resync period, we will definitely see the fact that the CR has been deleted. However, it's not guaranteed that the lister will still have a copy of the object stored - without the object, we can't convert the namespace/name to a known UID to look up in the requestToPrivateKeyMap.

Hence, if we don't have access to the UID once we have observed the delete, we won't be able to de-register/remove it from our own internal map.

@@ -259,6 +321,10 @@ type Manager struct {
// lister is used as a read-only cache of CertificateRequest resources
lister cmlisters.CertificateRequestLister

// A map that associates a CertificateRequest's UID with its private key.
requestToPrivateKeyLock *sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using sync.RWMutex to improve the performance a little bit on read ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect to have many concurrent readers so it seemed like a very negligible performance gain (and keeps things a little simpler to use a regular mutex for future readers)

return nil, nil
}

// TODO: check if this request is still actually valid for the input metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider check privateKey against CSR's public key here ?

@munnerz munnerz force-pushed the resume-pending-requests branch 2 times, most recently from d1d751f to 96a41ac Compare September 1, 2023 12:16
@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Sep 1, 2023
@munnerz munnerz force-pushed the resume-pending-requests branch from 8a6d253 to 356358e Compare September 1, 2023 12:26
@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Sep 1, 2023
Comment on lines 549 to 569
case cmapi.CertificateRequestReasonFailed:
return false, fmt.Errorf("request %q has failed: %s", updatedReq.Name, readyCondition.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider call m.deletePendingRequestPrivateKey(req.UID) when the CertificateRequest is in Failed condition ? It likely to fail again with the same private key. Create a new CR might help resolving the problem if it is due to key reuse issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll automatically use a new private key the next time anyway - if a request is failed, it is terminal so will never be returned again by findPendingRequest (i.e. it won't be re-used). This will in turn trigger a new CR to be created and new private key generated (or at least, a new call to generatePrivateKey).

The item will be deleted from the map once the CR is deleted, although yep perhaps a future optimisation could be to delete terminal failed items from the map a bit early just to save on memory.. but it shouldn't have any functional difference :)

func (m *Manager) handleRequest(ctx context.Context, volumeID string, meta metadata.Metadata, key crypto.PrivateKey, req *cmapi.CertificateRequest) error {
log := m.log.WithValues("volume_id", volumeID)

// Poll every 200ms for the CertificateRequest to be ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to choose 200ms now? This looks like a typical round tripper time for remote data center query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had reduced it here as this whole block only ever reads from a local in-memory cache anyway, and it reduced test flakes (as there were a few awkward timing issues where we had timeouts of 2s, but 1s sleeps in between each 'loop' here)

Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@munnerz A few comments from me, but otherwise looks good to me 🙂

Comment on lines 220 to 221
requestToPrivateKeyLock.Lock()
defer requestToPrivateKeyLock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For keeping consistency of what the state of the world is between routines, we should lock at the beginning of this function (before listing).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't to be done ^

Comment on lines 208 to 213
go wait.Until(func() {
reqs, err := lister.List(labels.Everything())
if err != nil {
janitorLogger.Error(err, "failed listing existing requests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment is that we currently have no coordination between Stop and our go routines so are returning from Stop before all resources have been released. We should consider adding this in another PR.

// start at the end of the slice and work back to maxRequestsPerVolume
for i := len(reqs) - 1; i >= m.maxRequestsPerVolume-1; i-- {
for i := len(reqs) - 1; i > m.maxRequestsPerVolume-1; i-- {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been changed? It is because we can now recover the private key between syncs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly - this function was previously doing some ✨ weird ✨ counting logic, which is now fixed (and you can see the behaviour change by taking a look at how the unit tests have changed too)

@jetstack-bot jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Oct 13, 2023
@munnerz munnerz force-pushed the resume-pending-requests branch from 1ecbbb3 to 0f8a34e Compare October 13, 2023 08:59
@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Oct 13, 2023
@JoshVanL
Copy link
Contributor

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2023
@JoshVanL
Copy link
Contributor

/approve

@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 7ing, JoshVanL

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2023
@munnerz munnerz force-pushed the resume-pending-requests branch from 0f8a34e to 8293161 Compare November 30, 2023 14:23
@jetstack-bot jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed lgtm Indicates that a PR is ready to be merged. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Nov 30, 2023
@munnerz munnerz force-pushed the resume-pending-requests branch from 8293161 to 0ce8db0 Compare November 30, 2023 14:24
@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Nov 30, 2023
@@ -313,16 +312,16 @@ func TestManager_ManageVolume_exponentialBackOffRetryOnIssueErrors(t *testing.T)
Jitter: expBackOffJitter,
Steps: expBackOffSteps,
}
opts.ReadyToRequest = func(meta metadata.Metadata) (bool, string) {
// ReadyToRequest will be called by issue()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true, hence as part of this PR I've added a temporary function that will only be called during tests, which increments whenever issue() is called.

This is the lesser of the evils IMO, until we have actual metrics support throughout csi-lib, which will allow us to do stuff like counting issue() calls properly :)

@JoshVanL
Copy link
Contributor

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2023
@jetstack-bot jetstack-bot merged commit b58fb32 into cert-manager:main Nov 30, 2023
@munnerz munnerz deleted the resume-pending-requests branch November 30, 2023 14:29
@7ing 7ing mentioned this pull request Jan 16, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants