add pending request cache to allow for resuming in-flight requests that take longer than a single issuance cycle #51

munnerz · 2023-08-31T14:25:14Z

replaces #48

Some issuers take a very long time to approve and issue a certificate request. Because the CSI drivers NodePublishVolume call has its own implicit timeout (1 minute) which cannot be changed easily, we would like to be able to 'resume' a request even if it has taken longer than 1 minute to complete.

As a concrete example, say an issuer always takes 90s to complete a request (for whatever reason). In these cases, the driver will wait 60s, the context will timeout, 30s later the request will be issued, but upon the NodePublishVolume call being retried we will continuously create a new request.

This is obviously not desirable, so persisting a reference to the crypto.PrivateKey in memory allows us to 'resume' the request if it is still usable.

This is different to #48 in that I've avoided modifying large parts of existing code-flows - instead, basically using a map as a cache to lookup an existing private key.

I've also added in an event handler that monitors 'delete' operations on CertificateRequest objects so we can handle the case where another entity deletes requests, so we don't keep persisting stale private keys in memory forever (aka a memory link)

This PR is still WIP, as I need to add a number of tests for it. To do that effectively, I need #46 to be merged so we can timeout the first issue call.

cc @7ing @irbekrm

munnerz · 2023-08-31T14:28:12Z

manager/manager.go

+		return nil, nil
+	}
+
+	// TODO: check if this request is still actually valid for the input metadata


I've not done this just yet, as I don't think it's quite as important as we may think (as volumeAttributes on a pod are not mutable).

The only case where this could be problematic is if a drivers implementation of generateRequest is non-deterministic/can change between calls. To properly handle the wide-range of weird setups users may have, we may actually need to push this comparison function to the driver implementers interface...

Consider check privateKey against CSR's public key here ?

Given the internal cache is in memory, UIDs are guaranteed to be unique, and the CertificateRequest resource is immutable, I don't think it's actually essential for us to implement the privatekey<>public key check...

I think I am also going to leave this TODO for now, as it isn't really something we handle at all at the moment (and does raise questions around timing, e.g. what if the driver is random, when can we ever really stop?). I'd like us to expand on our expectations around how drivers implement generateRequest before over-complicating this code path :)

manager/manager.go

7ing · 2023-08-31T21:37:51Z

manager/manager.go

+	// begin a background routine which periodically checks to ensure all members of the pending request map actually
+	// have corresponding CertificateRequest objects in the apiserver.
+	// This avoids leaking memory if we don't observe a request being deleted, or we observe it after the lister has purged
+	// the request data from its cache.
+	// this routine must be careful to not delete entries from this map that have JUST been added to the map, but haven't
+	// been observed by the lister yet (else it may purge data we want to keep, causing a whole new request cycle).
+	// for now, to avoid this case, we only run the routine every 5 minutes. It would be better if we recorded the time we
+	// added the entry to the map instead, and only purged items from the map that are older that N duration (TBD).


Just curious when will the informer lost the delete event from api server ? I thought the informerFactory will guarantee to resync in a period of time to ensure it captures all the events for eventual consistency ?

As mentioned in your comment, not sure how to prevent the newly added entry not being deleted because of lister is not in sync yet. If the request is happened at the 5 mins edge, will be deleted immediately as the lister does not have it in cache yet.

Yep with a resync period, we will definitely see the fact that the CR has been deleted. However, it's not guaranteed that the lister will still have a copy of the object stored - without the object, we can't convert the namespace/name to a known UID to look up in the requestToPrivateKeyMap.

Hence, if we don't have access to the UID once we have observed the delete, we won't be able to de-register/remove it from our own internal map.

7ing · 2023-08-31T21:50:37Z

manager/manager.go

@@ -259,6 +321,10 @@ type Manager struct {
 	// lister is used as a read-only cache of CertificateRequest resources
 	lister cmlisters.CertificateRequestLister

+	// A map that associates a CertificateRequest's UID with its private key.
+	requestToPrivateKeyLock *sync.Mutex


Consider using sync.RWMutex to improve the performance a little bit on read ?

We don't expect to have many concurrent readers so it seemed like a very negligible performance gain (and keeps things a little simpler to use a regular mutex for future readers)

7ing · 2023-08-31T22:03:59Z

manager/manager.go

+		return nil, nil
+	}
+
+	// TODO: check if this request is still actually valid for the input metadata


Consider check privateKey against CSR's public key here ?

7ing · 2023-09-01T20:04:44Z

manager/manager.go

 		case cmapi.CertificateRequestReasonFailed:
 			return false, fmt.Errorf("request %q has failed: %s", updatedReq.Name, readyCondition.Message)


Do we consider call m.deletePendingRequestPrivateKey(req.UID) when the CertificateRequest is in Failed condition ? It likely to fail again with the same private key. Create a new CR might help resolving the problem if it is due to key reuse issue.

We'll automatically use a new private key the next time anyway - if a request is failed, it is terminal so will never be returned again by findPendingRequest (i.e. it won't be re-used). This will in turn trigger a new CR to be created and new private key generated (or at least, a new call to generatePrivateKey).

The item will be deleted from the map once the CR is deleted, although yep perhaps a future optimisation could be to delete terminal failed items from the map a bit early just to save on memory.. but it shouldn't have any functional difference :)

7ing · 2023-09-01T20:08:07Z

manager/manager.go

+func (m *Manager) handleRequest(ctx context.Context, volumeID string, meta metadata.Metadata, key crypto.PrivateKey, req *cmapi.CertificateRequest) error {
+	log := m.log.WithValues("volume_id", volumeID)
+
+	// Poll every 200ms for the CertificateRequest to be ready


Any reason to choose 200ms now? This looks like a typical round tripper time for remote data center query.

I had reduced it here as this whole block only ever reads from a local in-memory cache anyway, and it reduced test flakes (as there were a few awkward timing issues where we had timeouts of 2s, but 1s sleeps in between each 'loop' here)

JoshVanL

@munnerz A few comments from me, but otherwise looks good to me 🙂

JoshVanL · 2023-09-13T12:59:02Z

manager/manager.go

+		requestToPrivateKeyLock.Lock()
+		defer requestToPrivateKeyLock.Unlock()


For keeping consistency of what the state of the world is between routines, we should lock at the beginning of this function (before listing).

This doesn't to be done ^

manager/manager.go

JoshVanL · 2023-09-13T13:08:36Z

manager/manager.go

+	go wait.Until(func() {
+		reqs, err := lister.List(labels.Everything())
+		if err != nil {
+			janitorLogger.Error(err, "failed listing existing requests")


A general comment is that we currently have no coordination between Stop and our go routines so are returning from Stop before all resources have been released. We should consider adding this in another PR.

JoshVanL · 2023-09-13T13:11:53Z

manager/manager.go

 	// start at the end of the slice and work back to maxRequestsPerVolume
-	for i := len(reqs) - 1; i >= m.maxRequestsPerVolume-1; i-- {
+	for i := len(reqs) - 1; i > m.maxRequestsPerVolume-1; i-- {


Why has this been changed? It is because we can now recover the private key between syncs?

yeah exactly - this function was previously doing some ✨ weird ✨ counting logic, which is now fixed (and you can see the behaviour change by taking a look at how the unit tests have changed too)

test/integration/resume_request_test.go

JoshVanL · 2023-10-13T09:00:07Z

/lgtm

JoshVanL · 2023-10-13T09:05:57Z

/approve

jetstack-bot · 2023-10-13T09:06:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 7ing, JoshVanL

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoshVanL]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…at take longer than a single issuance cycle Signed-off-by: James Munnelly <[email protected]>

Signed-off-by: James Munnelly <[email protected]>

munnerz · 2023-11-30T14:27:07Z

manager/manager_test.go

@@ -313,16 +312,16 @@ func TestManager_ManageVolume_exponentialBackOffRetryOnIssueErrors(t *testing.T)
 		Jitter:   expBackOffJitter,
 		Steps:    expBackOffSteps,
 	}
-	opts.ReadyToRequest = func(meta metadata.Metadata) (bool, string) {
-		// ReadyToRequest will be called by issue()


This is no longer true, hence as part of this PR I've added a temporary function that will only be called during tests, which increments whenever issue() is called.

This is the lesser of the evils IMO, until we have actual metrics support throughout csi-lib, which will allow us to do stuff like counting issue() calls properly :)

JoshVanL · 2023-11-30T14:27:58Z

/lgtm

jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 31, 2023

munnerz commented Aug 31, 2023

View reviewed changes

manager/manager.go Show resolved Hide resolved

munnerz commented Aug 31, 2023

View reviewed changes

manager/manager.go Show resolved Hide resolved

7ing reviewed Aug 31, 2023

View reviewed changes

munnerz force-pushed the resume-pending-requests branch 2 times, most recently from d1d751f to 96a41ac Compare September 1, 2023 12:16

munnerz force-pushed the resume-pending-requests branch from 8a6d253 to 356358e Compare September 1, 2023 12:26

jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Sep 1, 2023

7ing reviewed Sep 1, 2023

View reviewed changes

7ing approved these changes Sep 4, 2023

View reviewed changes

JoshVanL requested changes Sep 13, 2023

View reviewed changes

munnerz mentioned this pull request Oct 13, 2023

Retry pending request when issue is called #48

Closed

jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Oct 13, 2023

munnerz force-pushed the resume-pending-requests branch from 1ecbbb3 to 0f8a34e Compare October 13, 2023 08:59

jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Oct 13, 2023

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2023

jetstack-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2023

munnerz force-pushed the resume-pending-requests branch from 0f8a34e to 8293161 Compare November 30, 2023 14:23

munnerz added 10 commits November 30, 2023 14:24

add pending request cache to allow for resuming in-flight requests th…

a58c86c

…at take longer than a single issuance cycle Signed-off-by: James Munnelly <[email protected]>

reduce issuance poll interval to 200ms

cf77048

Signed-off-by: James Munnelly <[email protected]>

add issuance resumption integration tests

2e36109

Signed-off-by: James Munnelly <[email protected]>

add additional debug logging

29ed31b

Signed-off-by: James Munnelly <[email protected]>

fixup cleanupStaleRequests

4c00ad3

Signed-off-by: James Munnelly <[email protected]>

testing: set logger verbosity to 999999

a71341e

Signed-off-by: James Munnelly <[email protected]>

return early if we fail to list requests from lister in janitor job

6b92c3b

Signed-off-by: James Munnelly <[email protected]>

address review feedback

1885342

Signed-off-by: James Munnelly <[email protected]>

acquire requestToPrivateKey lock at the start of the event handler

60a2d87

Signed-off-by: James Munnelly <[email protected]>

fix exponential backoff test handling

0ce8db0

Signed-off-by: James Munnelly <[email protected]>

munnerz force-pushed the resume-pending-requests branch from 8293161 to 0ce8db0 Compare November 30, 2023 14:24

jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Nov 30, 2023

munnerz commented Nov 30, 2023

View reviewed changes

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2023

jetstack-bot merged commit b58fb32 into cert-manager:main Nov 30, 2023

munnerz deleted the resume-pending-requests branch November 30, 2023 14:29

7ing mentioned this pull request Dec 5, 2023

Race condition: CertificateRequests may never be fulfilled if the issuer was overwhelmed #47

Closed

7ing mentioned this pull request Jan 16, 2024

Support prometheus metrics #60

Open

2 tasks

		case cmapi.CertificateRequestReasonFailed:
		return false, fmt.Errorf("request %q has failed: %s", updatedReq.Name, readyCondition.Message)

		requestToPrivateKeyLock.Lock()
		defer requestToPrivateKeyLock.Unlock()

add pending request cache to allow for resuming in-flight requests that take longer than a single issuance cycle #51

add pending request cache to allow for resuming in-flight requests that take longer than a single issuance cycle #51

Uh oh!

Conversation

munnerz commented Aug 31, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshVanL left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoshVanL commented Oct 13, 2023

Uh oh!

JoshVanL commented Oct 13, 2023

Uh oh!

jetstack-bot commented Oct 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshVanL commented Nov 30, 2023

Uh oh!

Uh oh!