-
Notifications
You must be signed in to change notification settings - Fork 521
CNTRLPLANE-1575: Add support for event-ttl in Kube API Server Operator #1857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@tjungblu: This pull request references CNTRLPLANE-1575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
4fb0d6d
to
6168a91
Compare
@tjungblu: This pull request references CNTRLPLANE-1575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
6168a91
to
bc89dd8
Compare
bc89dd8
to
f6bf1ff
Compare
@tjungblu: This pull request references CNTRLPLANE-1575 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
I'm also curious about the event-ttl setting in OCP. Please correct me if I'm missing something - why not set the event-ttl in apiserver(config.openshift.io/v1), just like:
Maybe we will set more parameters for all OpenShift-provided apiserver in the future, not only the event-ttl |
Thanks for the review @lance5890 - events are only created through the kube-apiserver, so adding it to the others makes not much sense. As for other configuration values (we have debated the GOAWAY chance parameter recently), this would be more suitable on the general apiserver config CRD. |
|
||
1. Allow customers to configure the event-ttl setting for kube-apiserver through the OpenShift API | ||
2. Provide a reasonable range of values (5 minutes to 3 hours) that covers most customer needs | ||
3. Maintain backward compatibility with the current default of 3 hours (180 minutes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just so that we don't have to configure this value in CI? Given kube's default is 1 hour and we state there is little reason for a custom to need a longer period than 3 hours, could we also argue customers would generally want the default to be 1 hour and it's only 3 because of our internal processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As usual, nothing was functionally specified on what the intended behavior is supposed to be.
I would also argue we could reduce the default to 1h safely and configure the CI for 3h. That would be a discussion for the apiserver folks to chime in, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have one customer running a batch job to export events every 2 hours, then automatically reducing the TTL to 1 hour breaks them.
Events are meant to be transient, and if you need to retain them, they can be replicated elsewhere. That sort of replication is not currently a part of the platform (maybe there are optional operators on top that can be configured to do this), and I'm afraid to know how many programs are reliant on events for doing "real work".
I would prefer to keep the scope as small as possible while satisfying the request to allow shorter event TTLs. Users with constrained deployments or high event creation rates will appreciate the escape valve. I'm not opposed to migrating the default eventually, but it carries more risk and is not necessary to satisfy the RFE. Besides, the steady-state number of events depends on both the arrival rate and TTL. I suspect APF would be more a effective way to mitigate the worst case number of events by limiting the arrival rate.
|
||
## Proposal | ||
|
||
We propose to add an `eventTTLMinutes` field to the `APIServer` resource in `config.openshift.io/v1` that allows customers to configure the event-ttl setting for kube-apiserver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API PR is currently operator, not config btw, which do you intend to proceed with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good one, no this is going to the KAS Operator CRD, it's an exclusive kube-apiserver setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't important for other API servers? Looking at the existing operator CR there isn't any config there right now, might be worth checking with the API server folks where they think this should be landing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only kube-apiserver handles the events to attach leases on them, so it would only make sense there.
might be worth checking with the API server folks where they think this should be landing
Yep, just pinged them on slack. Happy to move it wherever it makes most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that, in Hypershift, the APIServer cluster config is part of the HostedCluster but the KASO operator config is not and we do not allow configuration inside the hosted cluster to modify how components in the hosted control plane (HCP) are configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjungblu is right that events are a kube-apiserver resource, not a generic API server thing. The event-ttl option only exists for kube-apiserver. I think we want to avoid expanding the surface of the config/v1 APIServer type with things that are not common across most/all API servers. At least, that's the stated intent:
// APIServer holds configuration (like serving certificates, client CA and CORS domains)
// shared by all API servers in the system, among them especially kube-apiserver
// and openshift-apiserver. The canonical name of an instance is 'cluster'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeAPIServerConfig looks like it could work, seems abandoned besides last years addition with MinimumKubeletVersion.
@sjenning I think you're referring to this, right?
https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/hostedcluster_types.go#L1855-L1859
that would be:
https://github.com/openshift/api/blob/master/config/v1/types_apiserver.go#L37
if the others are OK with moving the setting there, we can also lift and shift this into Hypershift that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeAPIServerConfig in kubecontrolplane/v1 is the format of the config file, it's not a resource that we serve anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that in general config/v1 APIServer is meant to be used for settings shared among most API servers.
The operator/v1 KubeAPIServer resource seems like a good candidate for a setting that applies only to the kube-apiserver. This resource already contains spec.logLevel
field which is used to set the log level for the operand.
As a cluster administrator in a regulated environment, I want to configure a longer event retention period so that I can meet compliance requirements for audit trails and debugging. | ||
|
||
#### Story 2: Storage Optimization | ||
As a cluster administrator with limited etcd storage, I want to configure a shorter event retention period so that I can reduce etcd storage usage while maintaining sufficient event history for troubleshooting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any metrics in terms of bytes of data that events generate over time in a regular openshift cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check telemeter tomorrow, I think we may have the counts. Bytes are usually difficult to attain because etcd does not expose those per CRD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to understand the impact on the data, so if we can without too much effort have some estimates of the impact of this change, I suspect that will be useful for folks to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I got the query right:
topk(10, cluster:usage:resources:sum{resource="events"})
the biggest cluster has about 3-4 million events in it and then it levels off real quick, putting it into quantiles:

the very large majority of clusters probably won't notice when changing the TTL to 5m with this few events in storage.
The biggest cluster with 3-4 million also has not much else beyond events in it, so taking it as a sample against its five gigabytes of etcd used storage is 1.5kb per event, on average.
Looking into the arrival rate here, this is constant at about 0.2/s with a one to two hour window at 0.8/s. Not super larger either.
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
This enhancement does not apply to Hypershift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if HyperShift also uses 3h by default? Do they have similar requests to be able to change the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github is down more than it is up right now, but they also use 3h:
https://github.com/openshift/hypershift/blob/69c68a4003cbeb0048c9c31d8b68bed6fc287970/control-plane-operator/controllers/hostedcontrolplane/v2/kas/config.go#L214
As for the other question, I'll ping Seth tomorrow on reviewing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypershift follows KASO for default config generally and does use 3h
for event-ttl
same as KASO
While we don't have an explicit request for this in HCP, we should look to allow this for all topologies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note this down, we can do the necessary hypershift changes as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning I was just looking into how GOAWAY chance was implemented and it seems to be based on an annotation:
https://github.com/openshift/hypershift/blob/main/control-plane-operator/controllers/hostedcontrolplane/v2/kas/params.go#L121-L123
do you think this could be a viable implementation here, too? Given that, in the other thread, we wanted to keep it in the operator CRD for OCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we shouldn't be adding new annotations, is there any reason this couldn't be a proper structured API on whichever is the appropriate HyperShift object?
kind: APIServer | ||
apiVersion: config.openshift.io/v1 | ||
spec: | ||
eventTTLMinutes: 60 # Integer value in minutes, e.g., 60, 180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any configuration subsections that it might make sense to group this into? How does this look in the upstream API server configuration file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a plain cmdline argument with a duration in kube-apiserver
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
--event-ttl duration Default: 1h0m0s
Amount of time to retain events.
downstream, not sure, also something for the apiserver folks to chime in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no resource/file for configuring kas upstream, settings are passed as cli flags.
We could group this under ApiServerArguments
, but I’m not sure if that would provide a good UX for administrators to set/use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect then we just leave it where it is for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, for the downstream fork we have KubeAPIServerConfig
, which is filled by the operator and then consumed by the forked kas.
I think the only alternative would be to put this into config/v1 APIServer and mark that fields as kas specific.
(unless we also consider creating a CM that would be created by admins but that would provide terrible UX/validation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then let's leave it where it is. We need to find a way to add this into Hypershift then.
1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need to perform compaction more frequently to remove expired events. This increases the bandwidth usage for etcd compaction operations. | ||
|
||
2. **etcd CPU Usage**: More frequent compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events. | ||
|
||
3. **Event Availability**: Events will be deleted more quickly, potentially reducing the time window available for debugging and troubleshooting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These impacts might be worth touching on briefly in the godoc for the field, in terms of you are trading lower storage for increased CPU usage. At the moment the godoc only has a positive and doesn't explain the trade off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to quantify the order of magnitude here. I think this is too negligable to even notice, but I wasn't able to test this in-depth yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having said that, the compactions do not become more frequent, they just become more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a practical risk of decreasing event-ttl by 90%+ for a cluster with many events (i.e. how expensive can the first expiry cycle be)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good scenario, let me also measure this. The compaction is always O(n) where n is the number of revisions to be compacted, either 10k or whatever we write in the 5 minute interval.
When we reduce the TTL to 1h, we would delete the prior events for 3h and after 1h the additional incoming events since changing the TTL. For the remaining two hours, there would be an elevated rate of those deletes until the 3h backlog is fully expired. Then it should settle on some "steady-state" for the 1h.
What could be also worrisome is any watchers that run on the events resource. I'm not sure whether a lease-based delete would trigger a delete notification. This seems quite wasteful and I think the keys are just internally tombstoned on the index until compacted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filled etcd with about 4 gigs of events at 3h, then left it running for a while with just 5m TTL.
It's a sharp drop in storage usage and memory, which makes sense as the defrag needs to kick-in first to reclaim the space.

Outside of that, nothing worth noting. CPU seems a little increased, then again reduced (both etcd + apiserver stacked):

memory on apiserver stays about the same:

Here's the pattern based on the event count:

And as expected, the compaction duration on etcd is linear w.r.t. the amount of keys- also no surprise here:

f6bf1ff
to
3fb6e12
Compare
3fb6e12
to
3a99db9
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
2. **Cluster Administrator** edits the operator configuration resource | ||
3. **Cluster Administrator** sets the `eventTTLMinutes` field to the desired value in minutes (e.g., 60, 180) | ||
4. **kube-apiserver-operator** detects the configuration change | ||
5. **kube-apiserver-operator** updates the kube-apiserver deployment with the new configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think in practice we will create an observer that sets apiServerArguments/event-ttl
, which will eventually be propagated to the operator’s observedConfig, and later propagated as a config to kas
#### Impact of removing 3 gigabytes of events | ||
|
||
To represent the worst case of removing 3 gigabyte of events, we have filled a 4.21 nightly cluster with 3 million events and the default TTL. | ||
Then configured a 5 minute TTL and watch the resource usage over the coming three hours... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if you could share some data, especially regarding cpu/memory usage of etcd/kas.
in general kas shouldn’t be affected, since the watch cache should be/is (will try to verify this) disabled for events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put 1.89M events into a 4.21 nightly cluster so far, nothing noteworthy regarding the apiserver.

Increased CPU is expected because I'm writing quite a lot of big events in there that need to be serialized/deserialized.

memory is growing, but not insanely as I've seen in the past with writing CRDs.
The machine has 64 gigs, I assume it is mostly due to golang GC laziness to free memory:

also in relation to the etcd size, you can see that the apiserver scales a lot better here with memory than etcd:

85de283
to
a5685cd
Compare
|
||
1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need more bandwidth to remove expired events. | ||
|
||
2. **etcd CPU Usage**: More expensive compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't a lower TTL reduce CPU usage?
Whether we are using a TTL of 5m or 3h, the number of expired leases per minute will roughly be the same (after the initial TTL time). However, since the number of active leases will be lower with a lower TTL, it should be cheaper for etcd to go through the list of all the lease to check if they are expired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me adjust this, thanks for reading this thoroughly.
I've looked into the way it is implemented in etcd. Effectively a lessor is running in the background and it polls every 500ms for new leases that are expired.
It uses a queue by expiration time however, so it is not an O(N) iteration over all leases. The found leases are put into a channel, which is polled/processed in parallel by the leader only.
If there's an lease deletion, it will be published via raft.
the number of expired leases per minute will roughly be the same
That's right, it depends on the arrival rate of the events.
However, since the number of active leases will be lower with a lower TTL, it should be cheaper for etcd to go through the list of all the lease to check if they are expired.
Realistically, the only thing that is saved here is the memory and disk space, which is also what we see in the perf screenshots I've shared in the other threads. Raft is pretty expensive and the list "polling" is already optimized to not look through the entire list of leases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, the only thing that is saved here is the memory and disk space, which is also what we see in the perf screenshots I've shared in the other threads. Raft is pretty expensive and the list "polling" is already optimized to not look through the entire list of leases.
Yes, thank you for sharing all these performance screenshots, it helped me confirm that this feature shouldn't have any negative impact on performance which I was kind of worried about after reading Lukasz's comment in openshift/api#2520 (comment).
a5685cd
to
90790be
Compare
LGTM |
API PR in openshift/api#2520 Feature Gate PR in openshift/api#2525 Signed-off-by: Thomas Jungblut <[email protected]>
90790be
to
436248f
Compare
Thank you! Added you as a reviewer as well. |
@tjungblu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Now that David has left the org, we are all missing his insightful commentary on our code and design documents. This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go. Example on how to use it: ``` /deads openshift/enhancements#1857 ● Technical Review API Semantics & Configuration Placement: Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't? You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active. Validation & Edge Cases: 5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals? Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for other reasons. Instead use Degraded conditions. Performance Impact: The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver. You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything. Upgrade Impact & Rollout Safety: You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if: - User sets invalid eventTTLMinutes and the kube-apiserver crashes? - The operator rolls out this config but kube-apiserver version doesn't support the flag yet? - Downgrade scenario - does an older apiserver reject the unknown flag? The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver to crashloop? Cross-Platform Considerations: HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done? MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too? Security & Reserved Values: Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets any regulatory retention requirements customers might have? Metrics & Debugging: What metrics are we tracking for this? - Histogram of actual event TTL values in the cluster? - Count of events pruned due to TTL expiration? - Etcd storage savings vs baseline? How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration? Alternatives Not Considered: Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead? Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"? Specific Questions: 1. What's the watch cache impact when TTL drops from 180min to 5min? 2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity? 3. What happens when the apiserver restarts - do all event leases get recreated? 4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves? Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship. ``` Signed-off-by: Thomas Jungblut <[email protected]>
Now that David has left the org, we are all missing his insightful commentary on our code and design documents. This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go. Example on how to use it: ``` /deads openshift/enhancements#1857 ● Technical Review API Semantics & Configuration Placement: Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't? You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active. Validation & Edge Cases: 5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals? Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for other reasons. Instead use Degraded conditions. Performance Impact: The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver. You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything. Upgrade Impact & Rollout Safety: You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if: - User sets invalid eventTTLMinutes and the kube-apiserver crashes? - The operator rolls out this config but kube-apiserver version doesn't support the flag yet? - Downgrade scenario - does an older apiserver reject the unknown flag? The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver to crashloop? Cross-Platform Considerations: HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done? MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too? Security & Reserved Values: Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets any regulatory retention requirements customers might have? Metrics & Debugging: What metrics are we tracking for this? - Histogram of actual event TTL values in the cluster? - Count of events pruned due to TTL expiration? - Etcd storage savings vs baseline? How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration? Alternatives Not Considered: Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead? Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"? Specific Questions: 1. What's the watch cache impact when TTL drops from 180min to 5min? 2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity? 3. What happens when the apiserver restarts - do all event leases get recreated? 4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves? Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship. ``` Signed-off-by: Thomas Jungblut <[email protected]>
Now that David has left the org, we are all missing his insightful commentary on our code and design documents. This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go. Example on how to use it: ``` /deads openshift/enhancements#1857 ● Technical Review API Semantics & Configuration Placement: Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't? You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active. Validation & Edge Cases: 5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals? Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for other reasons. Instead use Degraded conditions. Performance Impact: The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver. You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything. Upgrade Impact & Rollout Safety: You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if: - User sets invalid eventTTLMinutes and the kube-apiserver crashes? - The operator rolls out this config but kube-apiserver version doesn't support the flag yet? - Downgrade scenario - does an older apiserver reject the unknown flag? The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver to crashloop? Cross-Platform Considerations: HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done? MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too? Security & Reserved Values: Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets any regulatory retention requirements customers might have? Metrics & Debugging: What metrics are we tracking for this? - Histogram of actual event TTL values in the cluster? - Count of events pruned due to TTL expiration? - Etcd storage savings vs baseline? How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration? Alternatives Not Considered: Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead? Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"? Specific Questions: 1. What's the watch cache impact when TTL drops from 180min to 5min? 2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity? 3. What happens when the apiserver restarts - do all event leases get recreated? 4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves? Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship. ``` Signed-off-by: Thomas Jungblut <[email protected]>
Now that David has left the org, we are all missing his insightful commentary on our code and design documents. This command has been synthesized by claude itself and countless comments and git commits on k/k o/k and their respective api and enhancement repos, including o/library-go. Example on how to use it: ``` /deads openshift/enhancements#1857 ● Technical Review API Semantics & Configuration Placement: Where's the observability? The enhancement adds eventTTLMinutes to the KubeAPIServerOperatorSpec, but I don't see status tracking. How do I know the value was actually applied? How do I debug when it wasn't? You need to reflect the configured value and actual applied value in status. If you don't observe the status, you don't even know if the configuration is active. Validation & Edge Cases: 5-180 minute range - show me the data on why 5 minutes is safe. What's the actual event churn rate in production clusters? Have you tested with clusters generating events at high velocity? What happens when event creation rate exceeds the TTL window - do we start thrashing etcd with constant lease renewals? Use whatever off-the-shelf validation you have, but the operator should go Degraded when the kube-apiserver rejects the value or fails to apply it. Can't force Progressing to false - it'll go Progressing for other reasons. Instead use Degraded conditions. Performance Impact: The proposal mentions "minimal performance impact" - that's meaningless without actual metrics. What's the watch cache behavior with shorter TTLs? Are we creating more etcd churn? What's the lease management overhead when TTL is set to 5 minutes vs 180 minutes? Show me the CPU and memory deltas in the apiserver. You tested with 1.89M events - what was the actual performance delta? Latency? Throughput? Memory footprint? "No significant negative impacts" doesn't tell me anything. Upgrade Impact & Rollout Safety: You should be sure. Commonly when a configuration fails to rollout all future configuration changes (including upgrades) fail. What happens if: - User sets invalid eventTTLMinutes and the kube-apiserver crashes? - The operator rolls out this config but kube-apiserver version doesn't support the flag yet? - Downgrade scenario - does an older apiserver reject the unknown flag? The enhancement says "existing clusters maintain default configuration" - but what's the actual rollout mechanism? Is this applied via a static pod update? What's the blast radius if this causes the apiserver to crashloop? Cross-Platform Considerations: HyperShift/HCP: How is this configured in hosted control planes? The proposal mentions "annotations" but HCP doesn't allow admission webhooks and cannot rely on kube-apiserver patches. How will this be done? MicroShift: "Not directly applicable" - why not? If this is a kube-apiserver flag, MicroShift runs kube-apiserver. Do we need to be sure this is disabled in MicroShift, or should it work there too? Security & Reserved Values: Can users set this to the minimum (5 minutes) in a way that causes problems for system components that rely on event audit trails? What about compliance requirements - have you validated that 5 minutes meets any regulatory retention requirements customers might have? Metrics & Debugging: What metrics are we tracking for this? - Histogram of actual event TTL values in the cluster? - Count of events pruned due to TTL expiration? - Etcd storage savings vs baseline? How do we debug when someone complains events are disappearing too quickly? Is there a metric showing "events_pruned_by_ttl" so we can correlate user complaints with the configuration? Alternatives Not Considered: Why is this in the operator API instead of a cluster-scoped config? This feels like cluster-wide configuration, not operator-specific. Have you considered putting this in the APIServer config CR instead? Also - why configure TTL at all instead of just letting etcd compaction handle it? What's the actual problem we're solving beyond "storage optimization"? Specific Questions: 1. What's the watch cache impact when TTL drops from 180min to 5min? 2. Have you benchmarked etcd lease churn with minimum TTL under high event velocity? 3. What happens when the apiserver restarts - do all event leases get recreated? 4. Is this really necessary or are we overengineering a storage problem that etcd compaction already solves? Show me the data. Performance benchmarks with actual numbers. Upgrade test results. Metrics definitions. Then we can talk about whether this should ship. ``` Signed-off-by: Thomas Jungblut <[email protected]>
|
||
### Downgrade Strategy | ||
|
||
- Configuration will be ignored by older versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will downgraded clusters with the older CRDs still be able to deserialize the stored-in-etcd resources? I'm not sure how that handles unrecognized properties in etcd. But even if this breaks, we don't support downgrading from 4.y to 4.(y-1), so it would only be a brief hiccup for rollbacks within 4.dev Engineering Candidates and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event-ttl is just signaling to etcd to attach a lease on the given event resource, so there are no downgrade concerns. The setting would also only apply to new incoming events, in a downgrade scenario the default of 3h would be set on every incoming event.
API PR in openshift/api#2520
Feature Gate PR in openshift/api#2525