-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Push back excessive requests for stats #83832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push back excessive requests for stats #83832
Conversation
|
Hi @gmarouli, I've created a changelog YAML for you. |
|
Hi @gmarouli, I've updated the changelog YAML for you. |
|
Hi @gmarouli, I've updated the changelog YAML for you. |
|
Pinging @elastic/es-data-management (Team:Data Management) |
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Mary, I left a few small comments.
Another question I have is, is this something that it makes sense to track as a stats metric? It seems like it might be useful to have a "number of times diagnostics were requested but rejected" similar to the way that we track threadpool rejections. Having something like that would be useful for an administrator/support to be able to tweak the setting to help when stats requests are overloading a cluster.
If that sounds useful maybe we can think about it as a follow-up to this?
...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java
Outdated
Show resolved
Hide resolved
This is a very good point. I think it should be part of this PR if it's not exploding the scope of it because it makes it a more rounded feature. If we want this to provide enough information to the users then I would say we need more than the rejected requests right? Shouldn't we monitor all three aspects:
The reason that I am suggesting this, is maybe the problem is apparent when the requests are rejected but there are usually earlier signs that could be useful and can help users find the beginning of the problem. What do you think?? |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undecided whether we should exclude requests that specify target indices. Monitoring clients just request everything, and these are the ones that cause trouble when they build up, but a client that just asks for stats about a single index will cause fewer problems; moreover such a client might want to make many more requests in parallel.
There's a few TransportNodesAction subclasses that we might also want to consider limiting, particularly cluster stats and node stats.
...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java
Outdated
Show resolved
Hide resolved
In threadpool stats we track counts of completed and rejected actions, cumulative over the life of each node, and also report the number of currently-active threads (and various other things that don't translate to this situation). I find the currently-active value to be particularly useful when visualised as a time series. So +1 to tracking these things here too. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the quick-fire reviews, this one is just alternative naming suggestions.
...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/action/support/broadcast/node/TransportBoundedDiagnosticAction.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java
Outdated
Show resolved
Hide resolved
Hm.... I am not sure if this is worth the complexity. I am concerned that it makes it more difficult to explain to the users and it is not as easy to know where to draw the line? What if the index expression has a wildcard that matches almost all indices? So, I would like to add this only if it really adds value, for example what's the worst thing that can happen if we do not distinguish between the two? What I can think about is that a user does a separate call for every single index they have and that exceeds the limit, but is that a common case?
I think it makes sense to add them too, if they could cause the same issue, now that hopefully the stats calls will not it makes sense to add them. That would change the structure of the code a bit but I think it's worth it. I will snoop around the code to try to come up with a list of relevant cases, but if you have things in mind already like you said cluster and node stats feel free to add them :). |
|
Suggested calls to be limited:
Internal actions are excluded and more specialized stats such as |
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good to me, I left a couple more comments, but I'd also like to wait for David's review also (and confirmation about the double-release comment) before merging
server/src/main/java/org/elasticsearch/action/support/StatsRequestLimiter.java
Outdated
Show resolved
Hide resolved
| executeAction.apply(task, request, ActionListener.runBefore(listener, release::run)); | ||
| success = true; | ||
| } finally { | ||
| if (success == false) { | ||
| release.run(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (but I'm not sure) this can double-invoke the release.run().
Since the request wraps the release::run before the original listener, the executeAction can run, the first invocation of release.run() can happen (the one from runBefore), then the original listener can throw an exception, which would cause the success boolean not to be updated, and then the release.run() in the finally block would run a second time.
I guess it depends on whether ActionListeners are expected to ever throw exceptions (@DaveCTurner can probably clarify on this point), but if they can, then maybe the runBefore call could do something like () -> { success.getAndSet(true); release.run(); } and then it wouldn't have the potential for invoking twice? (if they can't, then this is probably a moot point)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am processing the possibilities but I do not have an answer yet. Coming asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe it's possible that executeAction might throw an exception but still go on to complete its listener. I mean it probably shouldn't, but in general we're ok with completing an ActionListener multiple times so we don't protect against this.
However here the release runnable is a RunOnce so it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will start with the suggestion that the runBefore() does something like this () -> { success.getAndSet(true); release.run(); } because I feel more confident about my analysis :). I believe this will cause double invocation in almost all cases, because:
executeActionintroduces therunBeforeto run before the original action listener starts whatever async needs to be run and finishes- Chances are by this point, the async work hasn't finished yet, which means the
successis false and thefinallywill run therelease.run()effectively releasing the semaphore while the node is still coordinating - When the async work finishes, the original listener will be notified and the
release.run()will be called again.
Now let's go to the trickier part, it's not clear to me yet where could the original listener throw an exception. If I look at the code of the RunBeforeActionListener:
@Override
public void onResponse(T response) {
try {
runBefore.run();
} catch (Exception ex) {
super.onFailure(ex);
return;
}
delegate.onResponse(response);
}
Similar the failure is being handled, the release will be called even if the original listener has an error. I would expect that the only situation where this can go wrong, is if the listener is never completed and the executeAction finished. But this effectively would mean that the node is still coordinating so we have a different problem then right?
Did I address your concern properly?
PS: I do not know if my understanding is correct or complete, so I would also appreciate @DaveCTurner input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no need for any changes in this area - RunOnce already does what is needed.
It's not just about completing the listener once but also releasing things in the finally block, you also need to protect against completing the listener multiple times. But we already do the right thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, totally didn't notice that the release was a RunOnce, that's fine then and no worries about this.
server/src/main/java/org/elasticsearch/action/support/StatsRequestLimiter.java
Outdated
Show resolved
Hide resolved
...sticsearch/action/support/broadcast/node/AbstractTransportBroadcastByNodeActionTestCase.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Lee Hinman <[email protected]>
...n/java/org/elasticsearch/action/support/broadcast/node/TransportBoundedDiagnosticAction.java
Outdated
Show resolved
Hide resolved
| executeAction.apply(task, request, ActionListener.runBefore(listener, release::run)); | ||
| success = true; | ||
| } finally { | ||
| if (success == false) { | ||
| release.run(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe it's possible that executeAction might throw an exception but still go on to complete its listener. I mean it probably shouldn't, but in general we're ok with completing an ActionListener multiple times so we don't protect against this.
However here the release runnable is a RunOnce so it doesn't matter.
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating Mary!
|
Thanks for the review @dakrone & @DaveCTurner . With your input the MR is much better than the initial effort! |
This reverts commit ed0bb2a.
|
I reverted this from the 8.2 branch via #85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0. |
The issue we are addressing
We are trying to avoid overwhelming the coordinating node of a stats or a recovery request in the case where other nodes are having troubles. See #51992 :
Approach
We are introducing a limit on how many stats and recovery requests a node can coordinate concurrently. When we receive one too many of this requests the node rejects with a
409. We are not concerned with further handling of this error because we assume that these requests would have timed out anyway. The limit is configurable as a cluster setting.The choice to have one limit that would apply to both requests was made for the sake of simplicity. This is seen as a protective mechanism that is triggered only to defend the node if the cluster is having troubles and not to rate limit requests when things are going well. We could alternatively create separate limits for each requests we want to bound, if there are objections to this approach.
Implementation details
We introduce
StatsRequestLimiterwhich usesAdjustableSemaphoreto bound the requests. Via the methodtryToExecute(Task task, Request request, ActionListener<Response> listener, TriConsumer<Task, Request, ActionListener<Response>> execute), this class orchestrates all aspects of pushing back, it checks the semaphore, it tracks metrics and handles exceptions.Affected actions:
IndicesStatsActionsRecoveryActionIndicesSegmentsActionNodesStatsActionClusterStatsActionNodesInfoActionNodesUsageActionResolves #51992