-
Couldn't load subscription status.
- Fork 576
OCPBUGS-61858: Add HTTPKeepAliveTimeout to IngressController API #2547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OCPBUGS-61858: Add HTTPKeepAliveTimeout to IngressController API #2547
Conversation
|
Hello @alebedev87! Some important instructions when contributing to openshift/api: |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-61858, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-61858, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
operator/v1/types_ingress.go
Outdated
| // httpKeepAliveTimeout defines the maximum allowed time to wait for | ||
| // a new HTTP request to appear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to waiting for a new HTTP request to appear on an idle connection that is being considered for closure, right? If that is the purpose, or one of the purposes of this timeout, it should be mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closure of an idle connection is one of the purposes. There are quite some others mentioned in the haproxy docs. I didn't want to favorise one over another or list them all (as there are many). I think this message is consitent with other timeouts we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear where the wait is happening, so it would be better to be explicit here, or give examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased it to:
// httpKeepAliveTimeout defines the maximum allowed time to wait for
// a new HTTP request to appear on a connection from the client to the router.
I hope this makes it clearer.
| // fraction and a unit suffix, e.g. "300ms", "1.5h" or "2h45m". | ||
| // Valid time units are "ns", "us" (or "µs" U+00B5 or "μs" U+03BC), "ms", "s", "m", "h". | ||
| // | ||
| // When omitted, this means the user has no opinion and the platform is left |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a section starting with "// Setting this field is generally not recommended..." is always helpful. We have it on most of the other tuning options to help people understand the consequence of changing the default value, with explanation for what happens if you set it too high (idle connections remain open longer and use unnecessary resources?), and what happens if you set it too low (idle connection could be closed sooner than wanted and interrupt traffic?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"// Setting this field is generally not recommended..."
I cannot say that it's "not recommended". It's a prerogative of a customer, that's why we had an RFE fo it. I can elaborate on corner cases though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 2 "paragraphs" about the potential impact of setting low and high values.
| // to choose a reasonable default. This default is subject to change over time. | ||
| // The current default is 300s. | ||
| // | ||
| // +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(ns|us|µs|μs|ms|s|m|h))+)$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// +kubebuilder:validation:Format=duration
does this works instead? spotted it on other field above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the duration validation is not aligned in the tuning options. I used the explicit regex for a reason that some kubebuilder's duration != golang's duration. We have a bug which showcases this for the client timeout.
However this made me think of the fact that I forgot to add tests for the new field, will do them.
| // +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(ns|us|µs|μs|ms|s|m|h))+)$ | ||
| // +kubebuilder:validation:Type:=string | ||
| // +optional | ||
| HTTPKeepAliveTimeout *metav1.Duration `json:"httpKeepAliveTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note
We prefer not to use duration values anymore. Instead, we would create a int32 type, with units in the name. For example, this should be httpKeepAliveTimeoutSeconds.
Referring linter: kubernetes-sigs/kube-api-linter#24
We have a bunch of other *metav1.Duration types as part of this structure and I think we should keep them consistent with the new field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a bunch of other *metav1.Duration types as part of this structure and I think we should keep them consistent with the new field.
I think this makes sense for new APIs (or new fields of existing APIs). But here I'm thinking of the consistency in the scope of the same API field. All other timeouts we have in IngressController.Spec.TuningOptions are metav1.Duration. Using httpKeepAliveTimeoutSeconds will break the existing pattern and harm the user experience. I acknowledge the new rule but I would like to stay consistent with other timeouts. Unless it's a hard requirement without which we won't get an approval from the API team.
7a86cb6 to
61a5799
Compare
operator/v1/types_ingress.go
Outdated
| // Low values (tens of milliseconds or less) can cause clients to close and reopen connections | ||
| // for each request, leading to excessive TCP or SSL handshakes. | ||
| // For HTTP/2, special care should be taken with low values. | ||
| // A few seconds is a reasonable starting point to avoid holding idle connections open | ||
| // while still allowing subsequent requests to reuse the connection. | ||
| // | ||
| // High values (more than a minute) can cause idle connections to linger, | ||
| // increasing exposure to long-lived but inactive connection attacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our default is 300s, I don't think we can can bluntly say that values more than a minute increases exposure to attacks. If so, why don't we change the default now?
As a matter of fact, why not mention the use case from the bug? Also, unless we've tested every single value, we can't really be explicit about what is a high or low value.
| // Low values (tens of milliseconds or less) can cause clients to close and reopen connections | |
| // for each request, leading to excessive TCP or SSL handshakes. | |
| // For HTTP/2, special care should be taken with low values. | |
| // A few seconds is a reasonable starting point to avoid holding idle connections open | |
| // while still allowing subsequent requests to reuse the connection. | |
| // | |
| // High values (more than a minute) can cause idle connections to linger, | |
| // increasing exposure to long-lived but inactive connection attacks. | |
| // Setting this value requires a careful consideration of the impact. Choosing a value too | |
| // low can cause clients to close and reopen connections for every request, leading to reduced | |
| // connection sharing. Choosing a value too high can cause idle connections to linger, | |
| // increasing exposure to long-lived but inactive connection attacks. | |
| // | |
| // For HTTP/2, special care should be taken with low values. | |
| // A few seconds is a reasonable starting point to avoid holding idle connections open | |
| // while still allowing subsequent requests to reuse the connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our default is 300s, I don't think we can can bluntly say that values more than a minute increases exposure to attacks.
I'm going to rephrase it to:
// High values (minutes or more) favor connection reuse but may cause idle
// connections to linger longer.
to highlight more the "reuse" aspect which we chose initially with 300s value.
If so, why don't we change the default now?
I believe that there is no perfect default value here. We favor the connection reuse by default. Some customers want this timeout to be even longer (like in https://issues.redhat.com/browse/RFE-1284), some prefer it to be smaller (like in https://issues.redhat.com//browse/OCPBUGS-61858). Changing the default can result in the behavior change as we saw in https://issues.redhat.com//browse/OCPBUGS-61858 where the bug was fixed and timeout http-keep-alive started to be respected leading to more concurrent connections.
As a matter of fact, why not mention the use case from the bug?
What use case from the bug you think can be mentioned here? Just for the context: https://issues.redhat.com//browse/OCPBUGS-61858 discovered a fix for timeout http-keep-alive which was done in HAProxy 2.8. In HAProxy 2.6 the timeout http-keep-alive was not respected and was falling back to shorter client timeout.
Also, unless we've tested every single value, we can't really be explicit about what is a high or low value.
I feel like we have to define what "low" and "high" values at least in order of magnitude. Otherwise this will always raise a follow-up question about what is "low" and what is "high".
HAProxy documentation is explicit about the values, I think we can safely use them:
In general "timeout http-keep-alive" is best used to prevent clients from
holding open an otherwise idle connection too long on sites seeing large
amounts of short connections. This can be accomplished by setting the value
to a few tens to hundreds of milliseconds in HTTP/1.1.
Another use case is the exact opposite: some sites want to permit clients
to reuse idle connections for a long time (e.g. 30 seconds to one minute)
A suggested low starting value for HTTP/2 connections would be around
4 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregarding the word-smithing, if our default is still 5 minutes, then I think we're not addressing the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug will not be addressed by changing the default value because this will impact all the other customers who use HAProxy 2.8 right now. We will suggest the customer from https://issues.redhat.com//browse/OCPBUGS-61858 to set the timeout to the value they need.
This commit introduces `HTTPKeepAliveTimeout` tuning option to the IngressController API, allowing customers to configure `timeout http-keep-alive`. In OCP versions prior to 4.16, this timeout was not respected (see haproxy/haproxy#2334). This addition brings the ability to adjust the behavior to match pre-4.16 configurations.
61a5799 to
a47ad11
Compare
|
@alebedev87: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Tested it with 4.21.0-0-2025-10-29-070835-test-ci-ln-g86vs5k-latest |
|
/label qe-approved |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-61858, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from an API shadow reviewer perspective.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: saschagrunert The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR introduces
HTTPKeepAliveTimeouttuning option to the IngressController API, allowing customers to configuretimeout http-keep-alive.In OCP versions prior to 4.16, this timeout was not respected (see haproxy/haproxy#2334). This addition brings the ability to adjust the behavior to match pre-4.16 configurations.
Xref old RFE: https://issues.redhat.com/browse/RFE-1284.