|
| 1 | +--- |
| 2 | +title: event-ttl |
| 3 | +authors: |
| 4 | + - "@tjungblu" |
| 5 | + - "CursorAI" |
| 6 | +reviewers: |
| 7 | + - benluddy |
| 8 | + - p0lyn0mial |
| 9 | +approvers: |
| 10 | + - sjenning |
| 11 | +api-approvers: |
| 12 | + - JoelSpeed |
| 13 | +creation-date: 2025-10-08 |
| 14 | +last-updated: 2025-10-08 |
| 15 | +tracking-link: |
| 16 | + - https://issues.redhat.com/browse/OCPSTRAT-2095 |
| 17 | + - https://issues.redhat.com/browse/CNTRLPLANE-1539 |
| 18 | + - https://github.com/openshift/api/pull/2520 |
| 19 | +status: proposed |
| 20 | +see-also: |
| 21 | +replaces: |
| 22 | +superseded-by: |
| 23 | +--- |
| 24 | + |
| 25 | +# Event TTL Configuration |
| 26 | + |
| 27 | +## Summary |
| 28 | + |
| 29 | +This enhancement describes a configuration option in the `config.openshift.io/v1` API group to configure the event-ttl setting for the kube-apiserver. The event-ttl setting controls how long events are retained in etcd before being automatically deleted. |
| 30 | + |
| 31 | +Currently, OpenShift uses a default event-ttl of 3 hours, while upstream Kubernetes uses 1 hour. This enhancement allows customers to configure this value based on their specific requirements, with a suggested range of 10 minutes to 6 hours. |
| 32 | + |
| 33 | +## Motivation |
| 34 | + |
| 35 | +The event-ttl setting in kube-apiserver controls the retention period for events in etcd. Events are automatically deleted after this duration to prevent etcd from growing indefinitely. Different customers have different requirements for event retention: |
| 36 | + |
| 37 | +- Some customers need longer retention for compliance or debugging purposes |
| 38 | +- Others may want shorter retention to reduce etcd storage usage |
| 39 | +- The current fixed value of 3 hours may not suit all use cases |
| 40 | + |
| 41 | +### Goals |
| 42 | + |
| 43 | +1. Allow customers to configure the event-ttl setting for kube-apiserver through the OpenShift API |
| 44 | +2. Provide a reasonable range of values (10 minutes to 6 hours) that covers most customer needs |
| 45 | +3. Maintain backward compatibility with the current default of 3 hours |
| 46 | +4. Ensure the configuration is properly validated and applied |
| 47 | + |
| 48 | +### Non-Goals |
| 49 | + |
| 50 | +- Changing the default event-ttl value (will remain 3 hours) |
| 51 | +- Supporting event-ttl values outside the recommended range |
| 52 | +- Modifying the underlying etcd compaction behavior beyond what the event-ttl setting provides |
| 53 | + |
| 54 | +## Proposal |
| 55 | + |
| 56 | +We propose to add an `eventTTL` field to the `APIServer` resource in `config.openshift.io/v1` that allows customers to configure the event-ttl setting for kube-apiserver. |
| 57 | + |
| 58 | +### User Stories |
| 59 | + |
| 60 | +#### Story 1: Compliance Requirements |
| 61 | +As a cluster administrator in a regulated environment, I want to configure a longer event retention period so that I can meet compliance requirements for audit trails and debugging. |
| 62 | + |
| 63 | +#### Story 2: Storage Optimization |
| 64 | +As a cluster administrator with limited etcd storage, I want to configure a shorter event retention period so that I can reduce etcd storage usage while maintaining sufficient event history for troubleshooting. |
| 65 | + |
| 66 | +#### Story 3: Default Behavior |
| 67 | +As a cluster administrator, I want the current default behavior to be preserved so that existing clusters continue to work without changes. |
| 68 | + |
| 69 | +### API Extensions |
| 70 | + |
| 71 | +This enhancement modifies the `APIServer` resource in `config.openshift.io/v1` by adding a new `eventTTL` field. |
| 72 | + |
| 73 | +### Workflow Description |
| 74 | + |
| 75 | +The workflow for configuring event-ttl is straightforward: |
| 76 | + |
| 77 | +1. **Cluster Administrator** accesses the OpenShift cluster via CLI or web console |
| 78 | +2. **Cluster Administrator** edits the `APIServer` resource in the `config.openshift.io/v1` API group |
| 79 | +3. **Cluster Administrator** sets the `eventTTL` field to the desired duration (e.g., "1h", "30m", "6h") |
| 80 | +4. **kube-apiserver-operator** detects the configuration change |
| 81 | +5. **kube-apiserver-operator** validates the new event-ttl value |
| 82 | +6. **kube-apiserver-operator** updates the kube-apiserver deployment with the new configuration |
| 83 | +7. **kube-apiserver** restarts with the new event-ttl setting |
| 84 | +8. **etcd** begins using the new event retention policy for future events |
| 85 | + |
| 86 | +The configuration change takes effect immediately for new events, while existing events continue to use their original TTL until they expire. |
| 87 | + |
| 88 | +### Topology Considerations |
| 89 | + |
| 90 | +#### Hypershift / Hosted Control Planes |
| 91 | + |
| 92 | +This enhancement does not apply to Hypershift. |
| 93 | + |
| 94 | +#### Standalone Clusters |
| 95 | + |
| 96 | +This enhancement is fully applicable to standalone OpenShift clusters. The event-ttl configuration will be applied to the kube-apiserver running in the control plane, affecting event retention in the cluster's etcd. |
| 97 | + |
| 98 | +#### Single-node Deployments or MicroShift |
| 99 | + |
| 100 | +For single-node OpenShift (SNO) deployments, this enhancement will work as expected. The event-ttl configuration will be applied to the kube-apiserver running on the single node. |
| 101 | + |
| 102 | +For MicroShift, this enhancement is not directly applicable as MicroShift uses a different architecture and may not have the same event-ttl configuration options. However, if MicroShift adopts similar event management, the same principles would apply. |
| 103 | + |
| 104 | +### Implementation Details/Notes/Constraints |
| 105 | + |
| 106 | +The proposed API looks like this: |
| 107 | + |
| 108 | +```yaml |
| 109 | +kind: APIServer |
| 110 | +apiVersion: config.openshift.io/v1 |
| 111 | +spec: |
| 112 | + eventTTL: "3h" # Duration string, e.g., "1h", "30m", "6h" |
| 113 | +``` |
| 114 | +
|
| 115 | +The `eventTTL` field will be a duration string that follows the standard Kubernetes duration format (e.g., "1h", "30m", "6h"). The field will be validated to ensure it falls within the required range. |
| 116 | + |
| 117 | +The API design is based on the changes in [openshift/api PR #2520](https://github.com/openshift/api/pull/2520), which includes: |
| 118 | + |
| 119 | +```go |
| 120 | +type KubeAPIServerSpec struct { |
| 121 | + StaticPodOperatorSpec `json:",inline"` |
| 122 | + |
| 123 | + // eventTTL specifies the amount of time that the events are stored before being deleted. |
| 124 | + // |
| 125 | + // The value must be parseable as a time duration value; |
| 126 | + // see <https://pkg.go.dev/time#ParseDuration>. |
| 127 | + // |
| 128 | + // If configured, it must be a value of 1m (one minute) or greater, we only allow setting |
| 129 | + // minute and hour durations (e.g. 5m or 5h). |
| 130 | + // |
| 131 | + // The default value is 3h. |
| 132 | + // |
| 133 | + // +kubebuilder:validation:Format=duration |
| 134 | + // +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(m|h))+)$ |
| 135 | + // +kubebuilder:validation:Type:=string |
| 136 | + // +default="3h" |
| 137 | + // +optional |
| 138 | + EventTTL metav1.Duration `json:"eventTTL,omitempty"` |
| 139 | +} |
| 140 | +``` |
| 141 | + |
| 142 | +### Impact of Lower TTL Values |
| 143 | + |
| 144 | +Setting the event-ttl to values lower than the upstream default of 1 hour will primarily impact: |
| 145 | + |
| 146 | +1. **etcd Compaction Bandwidth**: With faster expiring events, etcd will need to perform compaction more frequently to remove expired events. This increases the bandwidth usage for etcd compaction operations. |
| 147 | + |
| 148 | +2. **etcd CPU Usage**: More frequent compaction operations will increase CPU usage on etcd nodes, as the compaction process requires CPU cycles to identify and remove expired events. |
| 149 | + |
| 150 | +3. **Event Availability**: Events will be deleted more quickly, potentially reducing the time window available for debugging and troubleshooting. |
| 151 | + |
| 152 | +The main reason for this impact is that with faster expiring events, the system needs to delete events much more frequently, increasing the overhead of the cleanup process. |
| 153 | + |
| 154 | +### Risks and Mitigations |
| 155 | + |
| 156 | +**Risk**: Customers might set extremely low values that could impact etcd performance. |
| 157 | +**Mitigation**: The API validation ensures values are within a reasonable range. |
| 158 | + |
| 159 | + |
| 160 | +### Drawbacks |
| 161 | + |
| 162 | +- Adds complexity to the configuration API |
| 163 | +- Additional validation and error handling required |
| 164 | + |
| 165 | +## Alternatives (Not Implemented) |
| 166 | + |
| 167 | +1. **Hardcoded Values**: Keep the current fixed value of 3 hours |
| 168 | + - **Rejected**: Does not meet customer requirements for configurability |
| 169 | + |
| 170 | +2. **Environment Variable**: Use environment variables instead of API configuration |
| 171 | + - **Rejected**: Less user-friendly and harder to manage |
| 172 | + |
| 173 | +3. **Separate CRD**: Create a separate CRD for event configuration |
| 174 | + - **Rejected**: Overkill for a single setting, better to include in existing APIServer resource |
| 175 | + |
| 176 | +## Test Plan |
| 177 | + |
| 178 | +**Note:** *Section not required until targeted at a release.* |
| 179 | + |
| 180 | +The test plan will include: |
| 181 | + |
| 182 | +1. **Unit Tests**: Test the API validation and parsing logic |
| 183 | +2. **Integration Tests**: Test that the configuration is properly applied to kube-apiserver |
| 184 | +3. **E2E Tests**: Test that events are properly deleted after the configured TTL |
| 185 | +4. **Performance Tests**: Test the impact of different TTL values on etcd performance |
| 186 | + |
| 187 | +## Graduation Criteria |
| 188 | + |
| 189 | +### Dev Preview -> Tech Preview |
| 190 | + |
| 191 | +- API is implemented and validated |
| 192 | +- Basic functionality works end-to-end |
| 193 | +- Documentation is available |
| 194 | +- Sufficient test coverage |
| 195 | + |
| 196 | +### Tech Preview -> GA |
| 197 | + |
| 198 | +- More comprehensive testing (upgrade, downgrade, scale) |
| 199 | +- Performance testing with various TTL values |
| 200 | +- User feedback incorporated |
| 201 | +- Documentation updated in openshift-docs |
| 202 | + |
| 203 | +### Removing a deprecated feature |
| 204 | + |
| 205 | +This enhancement does not remove any existing features. It only adds new configuration options while maintaining backward compatibility with the existing default behavior. |
| 206 | + |
| 207 | +## Upgrade / Downgrade Strategy |
| 208 | + |
| 209 | +### Upgrade Strategy |
| 210 | + |
| 211 | +- Existing clusters will continue to use the default 3-hour TTL |
| 212 | +- No changes required for existing clusters |
| 213 | +- New configuration option is available immediately |
| 214 | + |
| 215 | +### Downgrade Strategy |
| 216 | + |
| 217 | +- Configuration will be ignored by older versions |
| 218 | +- No impact on cluster functionality |
| 219 | +- Events will continue to use the default TTL |
| 220 | + |
| 221 | +## Version Skew Strategy |
| 222 | + |
| 223 | +- The event-ttl setting is a kube-apiserver configuration |
| 224 | +- No coordination required with other components |
| 225 | +- Version skew is not a concern for this enhancement |
| 226 | + |
| 227 | +## Operational Aspects of API Extensions |
| 228 | + |
| 229 | +This enhancement modifies the `APIServer` resource but does not add new API extensions. The impact is limited to: |
| 230 | + |
| 231 | +- Configuration validation in the kube-apiserver-operator |
| 232 | +- Application of the setting to kube-apiserver deployment |
| 233 | +- No impact on API availability or performance |
| 234 | + |
| 235 | +## Support Procedures |
| 236 | + |
| 237 | +### Detection |
| 238 | + |
| 239 | +- Configuration can be verified by checking the `APIServer` resource |
| 240 | +- kube-apiserver logs will show the configured event-ttl value |
| 241 | +- etcd metrics can be monitored for compaction frequency |
| 242 | + |
| 243 | +### Troubleshooting |
| 244 | + |
| 245 | +- If events are not being deleted as expected, check the event-ttl configuration |
| 246 | +- Monitor etcd compaction metrics for unusual patterns |
| 247 | + |
| 248 | +## Implementation History |
| 249 | + |
| 250 | +- 2025-10-08: Initial enhancement proposal |
| 251 | + |
0 commit comments