Push back excessive requests for stats #83832

gmarouli · 2022-02-11T12:52:33Z

The issue we are addressing

We are trying to avoid overwhelming the coordinating node of a stats or a recovery request in the case where other nodes are having troubles. See #51992 :

if a single node becomes unresponsive for a few minutes then we could see a couple hundred requests build up and in a decent sized cluster each could consume many MBs of heap on the coordinating node.

Approach

We are introducing a limit on how many stats and recovery requests a node can coordinate concurrently. When we receive one too many of this requests the node rejects with a 409. We are not concerned with further handling of this error because we assume that these requests would have timed out anyway. The limit is configurable as a cluster setting.

The choice to have one limit that would apply to both requests was made for the sake of simplicity. This is seen as a protective mechanism that is triggered only to defend the node if the cluster is having troubles and not to rate limit requests when things are going well. We could alternatively create separate limits for each requests we want to bound, if there are objections to this approach.

Implementation details

We introduce StatsRequestLimiter which uses AdjustableSemaphore to bound the requests. Via the method tryToExecute(Task task, Request request, ActionListener<Response> listener, TriConsumer<Task, Request, ActionListener<Response>> execute), this class orchestrates all aspects of pushing back, it checks the semaphore, it tracks metrics and handles exceptions.

Affected actions:

IndicesStatsActions
RecoveryAction
IndicesSegmentsAction
NodesStatsAction
ClusterStatsAction
NodesInfoAction
NodesUsageAction

Resolves #51992

elasticsearchmachine · 2022-02-11T12:52:57Z

Hi @gmarouli, I've created a changelog YAML for you.

elasticsearchmachine · 2022-02-11T14:57:59Z

Hi @gmarouli, I've updated the changelog YAML for you.

elasticsearchmachine · 2022-02-11T15:26:40Z

Hi @gmarouli, I've updated the changelog YAML for you.

elasticmachine · 2022-02-11T17:57:47Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone

Thanks for working on this Mary, I left a few small comments.

Another question I have is, is this something that it makes sense to track as a stats metric? It seems like it might be useful to have a "number of times diagnostics were requested but rejected" similar to the way that we track threadpool rejections. Having something like that would be useful for an administrator/support to be able to tweak the setting to help when stats requests are overloading a cluster.

If that sounds useful maybe we can think about it as a follow-up to this?

docs/reference/modules/cluster/misc.asciidoc

...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java

gmarouli · 2022-02-14T09:14:59Z

Another question I have is, is this something that it makes sense to track as a stats metric? It seems like it might be useful to have a "number of times diagnostics were requested but rejected" similar to the way that we track threadpool rejections. Having something like that would be useful for an administrator/support to be able to tweak the setting to help when stats requests are overloading a cluster.

This is a very good point. I think it should be part of this PR if it's not exploding the scope of it because it makes it a more rounded feature. If we want this to provide enough information to the users then I would say we need more than the rejected requests right? Shouldn't we monitor all three aspects:

Acquiring
Releasing
Rejecting

The reason that I am suggesting this, is maybe the problem is apparent when the requests are rejected but there are usually earlier signs that could be useful and can help users find the beginning of the problem. What do you think??

DaveCTurner

I'm undecided whether we should exclude requests that specify target indices. Monitoring clients just request everything, and these are the ones that cause trouble when they build up, but a client that just asks for stats about a single index will cause fewer problems; moreover such a client might want to make many more requests in parallel.

There's a few TransportNodesAction subclasses that we might also want to consider limiting, particularly cluster stats and node stats.

...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java

DaveCTurner · 2022-02-14T09:40:50Z

Shouldn't we monitor all three aspects:

In threadpool stats we track counts of completed and rejected actions, cumulative over the life of each node, and also report the number of currently-active threads (and various other things that don't translate to this situation). I find the currently-active value to be particularly useful when visualised as a time series. So +1 to tracking these things here too.

DaveCTurner

Sorry for the quick-fire reviews, this one is just alternative naming suggestions.

...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java

...n/java/org/elasticsearch/action/support/broadcast/node/TransportBoundedDiagnosticAction.java

...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java

gmarouli · 2022-02-14T09:59:13Z

I'm undecided whether we should exclude requests that specify target indices. Monitoring clients just request everything, and these are the ones that cause trouble when they build up, but a client that just asks for stats about a single index will cause fewer problems; moreover such a client might want to make many more requests in parallel.

Hm.... I am not sure if this is worth the complexity. I am concerned that it makes it more difficult to explain to the users and it is not as easy to know where to draw the line? What if the index expression has a wildcard that matches almost all indices? So, I would like to add this only if it really adds value, for example what's the worst thing that can happen if we do not distinguish between the two? What I can think about is that a user does a separate call for every single index they have and that exceeds the limit, but is that a common case?

There's a few TransportNodesAction subclasses that we might also want to consider limiting, particularly cluster stats and node stats.

I think it makes sense to add them too, if they could cause the same issue, now that hopefully the stats calls will not it makes sense to add them. That would change the structure of the code a bit but I think it's worth it.

I will snoop around the code to try to come up with a list of relevant cases, but if you have things in mind already like you said cluster and node stats feel free to add them :).

gmarouli · 2022-02-14T10:40:22Z

Suggested calls to be limited:

TransportIndicesSegmentsAction extends TransportBroadcastByNodeAction
TransportNodesStatsAction extends TransportNodesAction
TransportClusterStatsAction extends TransportNodesAction
TransportNodesHotThreadsAction extends TransportNodesAction (I am not sure if this is too specialized so it should not be limited)
TransportNodesInfoAction extends TransportNodesAction
TransportNodesUsageAction extends TransportNodesAction

Internal actions are excluded and more specialized stats such as GeoIpDownloaderStatsTransportAction. What do you think about this list?

dakrone

This generally looks good to me, I left a couple more comments, but I'd also like to wait for David's review also (and confirmation about the double-release comment) before merging

server/src/main/java/org/elasticsearch/action/support/StatsRequestLimiter.java

dakrone · 2022-02-23T20:51:18Z

server/src/main/java/org/elasticsearch/action/support/StatsRequestLimiter.java

+                executeAction.apply(task, request, ActionListener.runBefore(listener, release::run));
+                success = true;
+            } finally {
+                if (success == false) {
+                    release.run();
+                }


I think (but I'm not sure) this can double-invoke the release.run().

Since the request wraps the release::run before the original listener, the executeAction can run, the first invocation of release.run() can happen (the one from runBefore), then the original listener can throw an exception, which would cause the success boolean not to be updated, and then the release.run() in the finally block would run a second time.

I guess it depends on whether ActionListeners are expected to ever throw exceptions (@DaveCTurner can probably clarify on this point), but if they can, then maybe the runBefore call could do something like () -> { success.getAndSet(true); release.run(); } and then it wouldn't have the potential for invoking twice? (if they can't, then this is probably a moot point)

I am processing the possibilities but I do not have an answer yet. Coming asap

Yes I believe it's possible that executeAction might throw an exception but still go on to complete its listener. I mean it probably shouldn't, but in general we're ok with completing an ActionListener multiple times so we don't protect against this.

However here the release runnable is a RunOnce so it doesn't matter.

I will start with the suggestion that the runBefore() does something like this () -> { success.getAndSet(true); release.run(); } because I feel more confident about my analysis :). I believe this will cause double invocation in almost all cases, because:

executeAction introduces the runBefore to run before the original action listener starts whatever async needs to be run and finishes

Chances are by this point, the async work hasn't finished yet, which means the success is false and the finally will run the release.run() effectively releasing the semaphore while the node is still coordinating

When the async work finishes, the original listener will be notified and the release.run() will be called again.

Now let's go to the trickier part, it's not clear to me yet where could the original listener throw an exception. If I look at the code of the RunBeforeActionListener:

@Override public void onResponse(T response) { try { runBefore.run(); } catch (Exception ex) { super.onFailure(ex); return; } delegate.onResponse(response); }

Similar the failure is being handled, the release will be called even if the original listener has an error. I would expect that the only situation where this can go wrong, is if the listener is never completed and the executeAction finished. But this effectively would mean that the node is still coordinating so we have a different problem then right?

Did I address your concern properly?

PS: I do not know if my understanding is correct or complete, so I would also appreciate @DaveCTurner input)

I think there's no need for any changes in this area - RunOnce already does what is needed.

It's not just about completing the listener once but also releasing things in the finally block, you also need to protect against completing the listener multiple times. But we already do the right thing here.

Yep, totally didn't notice that the release was a RunOnce, that's fine then and no worries about this.

server/src/main/java/org/elasticsearch/action/support/StatsRequestLimiter.java

...sticsearch/action/support/broadcast/node/AbstractTransportBroadcastByNodeActionTestCase.java

Co-authored-by: Lee Hinman <[email protected]>

...n/java/org/elasticsearch/action/support/broadcast/node/TransportBoundedDiagnosticAction.java

DaveCTurner · 2022-02-24T12:44:28Z

server/src/main/java/org/elasticsearch/action/support/StatsRequestLimiter.java

+                executeAction.apply(task, request, ActionListener.runBefore(listener, release::run));
+                success = true;
+            } finally {
+                if (success == false) {
+                    release.run();
+                }


Yes I believe it's possible that executeAction might throw an exception but still go on to complete its listener. I mean it probably shouldn't, but in general we're ok with completing an ActionListener multiple times so we don't protect against this.

However here the release runnable is a RunOnce so it doesn't matter.

dakrone

LGTM, thanks for iterating Mary!

gmarouli · 2022-02-28T07:44:24Z

Thanks for the review @dakrone & @DaveCTurner . With your input the MR is much better than the initial effort!

This reverts commit ed0bb2a.

…nch (#85504)

joegallo · 2022-03-31T17:46:01Z

I reverted this from the 8.2 branch via #85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0.

gmarouli added 2 commits February 11, 2022 13:50

Introduce permits service to track the bounded req

bf5a001

Bound recovery & stats requests

4661149

gmarouli added >enhancement :Data Management/Stats Statistics tracking and retrieval APIs Team:Data Management Meta label for data/management team v8.2.0 labels Feb 11, 2022

gmarouli self-assigned this Feb 11, 2022

gmarouli and others added 3 commits February 11, 2022 13:52

Update docs/changelog/83832.yaml

e389f7a

Extend stats limit to facilitate a test

4471b47

Add issue to the change log

40b56a4

Update docs/changelog/83832.yaml

e6b5100

gmarouli and others added 2 commits February 11, 2022 16:26

Update docs/changelog/83832.yaml

d38e819

Revert format changes to docs

b917c57

gmarouli marked this pull request as ready for review February 11, 2022 17:57

gmarouli requested review from DaveCTurner and dakrone February 11, 2022 17:58

dakrone requested changes Feb 11, 2022

View reviewed changes

First round of review

f21a517

DaveCTurner reviewed Feb 14, 2022

View reviewed changes

...in/java/org/elasticsearch/action/support/broadcast/node/BoundedDiagnosticRequestPermits.java Outdated Show resolved Hide resolved

DaveCTurner reviewed Feb 14, 2022

View reviewed changes

gmarouli added 2 commits February 14, 2022 11:58

Refactor to enable extending the limiter usage

35753a8

Adjust documentation

cb317fd

gmarouli added 5 commits February 21, 2022 10:17

Limit other stats calls

404d10d

Fix changelog

348e457

Fix error message in StatsRequestLimiter

655743c

Fix error message and javadoc(StatsRequestLimiter)

db06c8c

Polishing

61028ae

gmarouli requested review from DaveCTurner and dakrone February 21, 2022 12:15

dakrone reviewed Feb 23, 2022

View reviewed changes

gmarouli and others added 2 commits February 24, 2022 12:50

Use List.copyOf(..)

41c9a48

Co-authored-by: Lee Hinman <[email protected]>

Polishing

ed3f1ac

DaveCTurner reviewed Feb 24, 2022

View reviewed changes

Improve error message

5a8ae08

dakrone approved these changes Feb 24, 2022

View reviewed changes

gmarouli merged commit ed0bb2a into elastic:master Feb 28, 2022

gmarouli deleted the push-back-excessive-requests-for-stats branch February 28, 2022 07:46

original-brownbear mentioned this pull request Mar 24, 2022

Pushing back on index stats requests can cause ILM rollover-ready checks to pile up #85333

Open

joegallo added a commit to joegallo/elasticsearch that referenced this pull request Mar 30, 2022

Revert "Push back excessive requests for stats (elastic#83832)"

bfab292

This reverts commit ed0bb2a.

joegallo mentioned this pull request Mar 30, 2022

Revert "Push back excessive requests for stats" (#83832) from 8.2 #85504

Merged

joegallo added a commit that referenced this pull request Mar 31, 2022

Remove "Push back excessive requests for stats (#83832)" from 8.2 bra…

184ace7

…nch (#85504)

joegallo added v8.3.0 and removed v8.2.0 labels Mar 31, 2022

joegallo mentioned this pull request Mar 31, 2022

Fix BWC versions to 8.3.0 #85574

Merged

joegallo mentioned this pull request May 23, 2022

Remove "Push back excessive requests for stats (#83832)" #87054

Merged

joegallo added a commit that referenced this pull request May 23, 2022

Remove "Push back excessive requests for stats (#83832)" (#87054)

79990fa

joegallo removed the v8.3.0 label May 23, 2022

Push back excessive requests for stats #83832

Push back excessive requests for stats #83832

Uh oh!

Conversation

gmarouli commented Feb 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The issue we are addressing

Approach

Implementation details

Affected actions:

Uh oh!

elasticsearchmachine commented Feb 11, 2022

Uh oh!

elasticsearchmachine commented Feb 11, 2022

Uh oh!

elasticsearchmachine commented Feb 11, 2022

Uh oh!

elasticmachine commented Feb 11, 2022

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmarouli commented Feb 14, 2022

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner commented Feb 14, 2022

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmarouli commented Feb 14, 2022

Uh oh!

gmarouli commented Feb 14, 2022

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dakrone Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarouli Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

gmarouli Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

dakrone Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaveCTurner Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

gmarouli commented Feb 28, 2022

Uh oh!

joegallo commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gmarouli commented Feb 11, 2022 •

edited

Loading

dakrone Feb 23, 2022 •

edited

Loading

gmarouli Feb 24, 2022 •

edited

Loading

joegallo commented Mar 31, 2022 •

edited

Loading