Skip to content

Commit 6168a91

Browse files
committed
CNTRLPLANE-1575: Add support for event-ttl in Kube API Server Operator
Signed-off-by: Thomas Jungblut <[email protected]>
1 parent 7f59958 commit 6168a91

File tree

1 file changed

+251
-0
lines changed

1 file changed

+251
-0
lines changed
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
---
2+
title: event-ttl
3+
authors:
4+
- "@tjungblu"
5+
- "CursorAI"
6+
reviewers:
7+
- benluddy
8+
- p0lyn0mial
9+
approvers:
10+
- sjenning
11+
api-approvers:
12+
- JoelSpeed
13+
creation-date: 2025-10-08
14+
last-updated: 2025-10-08
15+
tracking-link:
16+
- https://issues.redhat.com/browse/OCPSTRAT-2095
17+
- https://issues.redhat.com/browse/CNTRLPLANE-1539
18+
- https://github.com/openshift/api/pull/2520
19+
status: proposed
20+
see-also:
21+
replaces:
22+
superseded-by:
23+
---
24+
25+
# Event TTL Configuration
26+
27+
## Summary
28+
29+
This enhancement describes a configuration option in the `config.openshift.io/v1` API group to configure the event-ttl setting for the kube-apiserver. The event-ttl setting controls how long events are retained in etcd before being automatically deleted.
30+
31+
Currently, OpenShift uses a default event-ttl of 3 hours, while upstream Kubernetes uses 1 hour. This enhancement allows customers to configure this value based on their specific requirements, with a suggested range of 10 minutes to 6 hours.
32+
33+
## Motivation
34+
35+
The event-ttl setting in kube-apiserver controls the retention period for events in etcd. Events are automatically deleted after this duration to prevent etcd from growing indefinitely. Different customers have different requirements for event retention:
36+
37+
- Some customers need longer retention for compliance or debugging purposes
38+
- Others may want shorter retention to reduce etcd storage usage
39+
- The current fixed value of 3 hours may not suit all use cases
40+
41+
### Goals
42+
43+
1. Allow customers to configure the event-ttl setting for kube-apiserver through the OpenShift API
44+
2. Provide a reasonable range of values (10 minutes to 6 hours) that covers most customer needs
45+
3. Maintain backward compatibility with the current default of 3 hours
46+
4. Ensure the configuration is properly validated and applied
47+
48+
### Non-Goals
49+
50+
- Changing the default event-ttl value (will remain 3 hours)
51+
- Supporting event-ttl values outside the recommended range
52+
- Modifying the underlying etcd compaction behavior beyond what the event-ttl setting provides
53+
54+
## Proposal
55+
56+
We propose to add an `eventTTL` field to the `APIServer` resource in `config.openshift.io/v1` that allows customers to configure the event-ttl setting for kube-apiserver.
57+
58+
### User Stories
59+
60+
#### Story 1: Compliance Requirements
61+
As a cluster administrator in a regulated environment, I want to configure a longer event retention period so that I can meet compliance requirements for audit trails and debugging.
62+
63+
#### Story 2: Storage Optimization
64+
As a cluster administrator with limited etcd storage, I want to configure a shorter event retention period so that I can reduce etcd storage usage while maintaining sufficient event history for troubleshooting.
65+
66+
#### Story 3: Default Behavior
67+
As a cluster administrator, I want the current default behavior to be preserved so that existing clusters continue to work without changes.
68+
69+
### API Extensions
70+
71+
This enhancement modifies the `APIServer` resource in `config.openshift.io/v1` by adding a new `eventTTL` field.
72+
73+
### Workflow Description
74+
75+
The workflow for configuring event-ttl is straightforward:
76+
77+
1. **Cluster Administrator** accesses the OpenShift cluster via CLI or web console
78+
2. **Cluster Administrator** edits the `APIServer` resource in the `config.openshift.io/v1` API group
79+
3. **Cluster Administrator** sets the `eventTTL` field to the desired duration (e.g., "1h", "30m", "6h")
80+
4. **kube-apiserver-operator** detects the configuration change
81+
5. **kube-apiserver-operator** validates the new event-ttl value
82+
6. **kube-apiserver-operator** updates the kube-apiserver deployment with the new configuration
83+
7. **kube-apiserver** restarts with the new event-ttl setting
84+
8. **etcd** begins using the new event retention policy for future events
85+
86+
The configuration change takes effect immediately for new events, while existing events continue to use their original TTL until they expire.
87+
88+
### Topology Considerations
89+
90+
#### Hypershift / Hosted Control Planes
91+
92+
This enhancement does not apply to Hypershift.
93+
94+
#### Standalone Clusters
95+
96+
This enhancement is fully applicable to standalone OpenShift clusters. The event-ttl configuration will be applied to the kube-apiserver running in the control plane, affecting event retention in the cluster's etcd.
97+
98+
#### Single-node Deployments or MicroShift
99+
100+
For single-node OpenShift (SNO) deployments, this enhancement will work as expected. The event-ttl configuration will be applied to the kube-apiserver running on the single node.
101+
102+
For MicroShift, this enhancement is not directly applicable as MicroShift uses a different architecture and may not have the same event-ttl configuration options. However, if MicroShift adopts similar event management, the same principles would apply.
103+
104+
### Implementation Details/Notes/Constraints
105+
106+
The proposed API looks like this:
107+
108+
```yaml
109+
kind: APIServer
110+
apiVersion: config.openshift.io/v1
111+
spec:
112+
eventTTL: "3h" # Duration string, e.g., "1h", "30m", "6h"
113+
```
114+
115+
The `eventTTL` field will be a duration string that follows the standard Kubernetes duration format (e.g., "1h", "30m", "6h"). The field will be validated to ensure it falls within the required range.
116+
117+
The API design is based on the changes in [openshift/api PR #2520](https://github.com/openshift/api/pull/2520), which includes:
118+
119+
```go
120+
type KubeAPIServerSpec struct {
121+
StaticPodOperatorSpec `json:",inline"`
122+
123+
// eventTTL specifies the amount of time that the events are stored before being deleted.
124+
//
125+
// The value must be parseable as a time duration value;
126+
// see <https://pkg.go.dev/time#ParseDuration>.
127+
//
128+
// If configured, it must be a value of 1m (one minute) or greater, we only allow setting
129+
// minute and hour durations (e.g. 5m or 5h).
130+
//
131+
// The default value is 3h.
132+
//
133+
// +kubebuilder:validation:Format=duration
134+
// +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(m|h))+)$
135+
// +kubebuilder:validation:Type:=string
136+
// +default="3h"
137+
// +optional
138+
EventTTL metav1.Duration `json:"eventTTL,omitempty"`
139+
}
140+
```
141+
142+
### Impact of Lower TTL Values
143+
144+
Setting the event-ttl to values lower than the upstream default of 1 hour will primarily impact:
145+
146+
1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need to perform compaction more frequently to remove expired events. This increases the bandwidth usage for etcd compaction operations.
147+
148+
2. **etcd CPU Usage**: More frequent compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events.
149+
150+
3. **Event Availability**: Events will be deleted more quickly, potentially reducing the time window available for debugging and troubleshooting.
151+
152+
The main reason for this impact is that with faster expiring events, the system needs to delete events much more frequently, increasing the overhead of the cleanup process.
153+
154+
### Risks and Mitigations
155+
156+
**Risk**: Customers might set extremely low values that could impact etcd performance.
157+
**Mitigation**: The API validation ensures values are within a reasonable range.
158+
159+
160+
### Drawbacks
161+
162+
- Adds complexity to the configuration API
163+
- Additional validation and error handling required
164+
165+
## Alternatives (Not Implemented)
166+
167+
1. **Hardcoded Values**: Keep the current fixed value of 3 hours
168+
- **Rejected**: Does not meet customer requirements for configurability
169+
170+
2. **Environment Variable**: Use environment variables instead of API configuration
171+
- **Rejected**: Less user-friendly and harder to manage
172+
173+
3. **Separate CRD**: Create a separate CRD for event configuration
174+
- **Rejected**: Overkill for a single setting, better to include in existing APIServer resource
175+
176+
## Test Plan
177+
178+
**Note:** *Section not required until targeted at a release.*
179+
180+
The test plan will include:
181+
182+
1. **Unit Tests**: Test the API validation and parsing logic
183+
2. **Integration Tests**: Test that the configuration is properly applied to kube-apiserver
184+
3. **E2E Tests**: Test that events are properly deleted after the configured TTL
185+
4. **Performance Tests**: Test the impact of different TTL values on etcd performance
186+
187+
## Graduation Criteria
188+
189+
### Dev Preview -> Tech Preview
190+
191+
- API is implemented and validated
192+
- Basic functionality works end-to-end
193+
- Documentation is available
194+
- Sufficient test coverage
195+
196+
### Tech Preview -> GA
197+
198+
- More comprehensive testing (upgrade, downgrade, scale)
199+
- Performance testing with various TTL values
200+
- User feedback incorporated
201+
- Documentation updated in openshift-docs
202+
203+
### Removing a deprecated feature
204+
205+
This enhancement does not remove any existing features. It only adds new configuration options while maintaining backward compatibility with the existing default behavior.
206+
207+
## Upgrade / Downgrade Strategy
208+
209+
### Upgrade Strategy
210+
211+
- Existing clusters will continue to use the default 3-hour TTL
212+
- No changes required for existing clusters
213+
- New configuration option is available immediately
214+
215+
### Downgrade Strategy
216+
217+
- Configuration will be ignored by older versions
218+
- No impact on cluster functionality
219+
- Events will continue to use the default TTL
220+
221+
## Version Skew Strategy
222+
223+
- The event-ttl setting is a kube-apiserver configuration
224+
- No coordination required with other components
225+
- Version skew is not a concern for this enhancement
226+
227+
## Operational Aspects of API Extensions
228+
229+
This enhancement modifies the `APIServer` resource but does not add new API extensions. The impact is limited to:
230+
231+
- Configuration validation in the kube-apiserver-operator
232+
- Application of the setting to kube-apiserver deployment
233+
- No impact on API availability or performance
234+
235+
## Support Procedures
236+
237+
### Detection
238+
239+
- Configuration can be verified by checking the `APIServer` resource
240+
- kube-apiserver logs will show the configured event-ttl value
241+
- etcd metrics can be monitored for compaction frequency
242+
243+
### Troubleshooting
244+
245+
- If events are not being deleted as expected, check the event-ttl configuration
246+
- Monitor etcd compaction metrics for unusual patterns
247+
248+
## Implementation History
249+
250+
- 2025-10-08: Initial enhancement proposal
251+

0 commit comments

Comments
 (0)