Skip to content

Conversation

@gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Feb 11, 2022

The issue we are addressing

We are trying to avoid overwhelming the coordinating node of a stats or a recovery request in the case where other nodes are having troubles. See #51992 :

if a single node becomes unresponsive for a few minutes then we could see a couple hundred requests build up and in a decent sized cluster each could consume many MBs of heap on the coordinating node.

Approach

We are introducing a limit on how many stats and recovery requests a node can coordinate concurrently. When we receive one too many of this requests the node rejects with a 409. We are not concerned with further handling of this error because we assume that these requests would have timed out anyway. The limit is configurable as a cluster setting.

The choice to have one limit that would apply to both requests was made for the sake of simplicity. This is seen as a protective mechanism that is triggered only to defend the node if the cluster is having troubles and not to rate limit requests when things are going well. We could alternatively create separate limits for each requests we want to bound, if there are objections to this approach.

Implementation details

We introduce StatsRequestLimiter which uses AdjustableSemaphore to bound the requests. Via the method tryToExecute(Task task, Request request, ActionListener<Response> listener, TriConsumer<Task, Request, ActionListener<Response>> execute), this class orchestrates all aspects of pushing back, it checks the semaphore, it tracks metrics and handles exceptions.

Affected actions:

  • IndicesStatsActions
  • RecoveryAction
  • IndicesSegmentsAction
  • NodesStatsAction
  • ClusterStatsAction
  • NodesInfoAction
  • NodesUsageAction

Resolves #51992

@gmarouli gmarouli added >enhancement :Data Management/Stats Statistics tracking and retrieval APIs Team:Data Management Meta label for data/management team v8.2.0 labels Feb 11, 2022
@gmarouli gmarouli self-assigned this Feb 11, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @gmarouli, I've updated the changelog YAML for you.

@gmarouli gmarouli marked this pull request as ready for review February 11, 2022 17:57
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Mary, I left a few small comments.

Another question I have is, is this something that it makes sense to track as a stats metric? It seems like it might be useful to have a "number of times diagnostics were requested but rejected" similar to the way that we track threadpool rejections. Having something like that would be useful for an administrator/support to be able to tweak the setting to help when stats requests are overloading a cluster.

If that sounds useful maybe we can think about it as a follow-up to this?

@gmarouli
Copy link
Contributor Author

Another question I have is, is this something that it makes sense to track as a stats metric? It seems like it might be useful to have a "number of times diagnostics were requested but rejected" similar to the way that we track threadpool rejections. Having something like that would be useful for an administrator/support to be able to tweak the setting to help when stats requests are overloading a cluster.

This is a very good point. I think it should be part of this PR if it's not exploding the scope of it because it makes it a more rounded feature. If we want this to provide enough information to the users then I would say we need more than the rejected requests right? Shouldn't we monitor all three aspects:

  • Acquiring
  • Releasing
  • Rejecting

The reason that I am suggesting this, is maybe the problem is apparent when the requests are rejected but there are usually earlier signs that could be useful and can help users find the beginning of the problem. What do you think??

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided whether we should exclude requests that specify target indices. Monitoring clients just request everything, and these are the ones that cause trouble when they build up, but a client that just asks for stats about a single index will cause fewer problems; moreover such a client might want to make many more requests in parallel.

There's a few TransportNodesAction subclasses that we might also want to consider limiting, particularly cluster stats and node stats.

@DaveCTurner
Copy link
Contributor

Shouldn't we monitor all three aspects:

In threadpool stats we track counts of completed and rejected actions, cumulative over the life of each node, and also report the number of currently-active threads (and various other things that don't translate to this situation). I find the currently-active value to be particularly useful when visualised as a time series. So +1 to tracking these things here too.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the quick-fire reviews, this one is just alternative naming suggestions.

@gmarouli
Copy link
Contributor Author

I'm undecided whether we should exclude requests that specify target indices. Monitoring clients just request everything, and these are the ones that cause trouble when they build up, but a client that just asks for stats about a single index will cause fewer problems; moreover such a client might want to make many more requests in parallel.

Hm.... I am not sure if this is worth the complexity. I am concerned that it makes it more difficult to explain to the users and it is not as easy to know where to draw the line? What if the index expression has a wildcard that matches almost all indices? So, I would like to add this only if it really adds value, for example what's the worst thing that can happen if we do not distinguish between the two? What I can think about is that a user does a separate call for every single index they have and that exceeds the limit, but is that a common case?

There's a few TransportNodesAction subclasses that we might also want to consider limiting, particularly cluster stats and node stats.

I think it makes sense to add them too, if they could cause the same issue, now that hopefully the stats calls will not it makes sense to add them. That would change the structure of the code a bit but I think it's worth it.

I will snoop around the code to try to come up with a list of relevant cases, but if you have things in mind already like you said cluster and node stats feel free to add them :).

@gmarouli
Copy link
Contributor Author

Suggested calls to be limited:

  • TransportIndicesSegmentsAction extends TransportBroadcastByNodeAction
  • TransportNodesStatsAction extends TransportNodesAction
  • TransportClusterStatsAction extends TransportNodesAction
  • TransportNodesHotThreadsAction extends TransportNodesAction (I am not sure if this is too specialized so it should not be limited)
  • TransportNodesInfoAction extends TransportNodesAction
  • TransportNodesUsageAction extends TransportNodesAction

Internal actions are excluded and more specialized stats such as GeoIpDownloaderStatsTransportAction. What do you think about this list?

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me, I left a couple more comments, but I'd also like to wait for David's review also (and confirmation about the double-release comment) before merging

Comment on lines +76 to +81
executeAction.apply(task, request, ActionListener.runBefore(listener, release::run));
success = true;
} finally {
if (success == false) {
release.run();
}
Copy link
Member

@dakrone dakrone Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (but I'm not sure) this can double-invoke the release.run().

Since the request wraps the release::run before the original listener, the executeAction can run, the first invocation of release.run() can happen (the one from runBefore), then the original listener can throw an exception, which would cause the success boolean not to be updated, and then the release.run() in the finally block would run a second time.

I guess it depends on whether ActionListeners are expected to ever throw exceptions (@DaveCTurner can probably clarify on this point), but if they can, then maybe the runBefore call could do something like () -> { success.getAndSet(true); release.run(); } and then it wouldn't have the potential for invoking twice? (if they can't, then this is probably a moot point)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am processing the possibilities but I do not have an answer yet. Coming asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe it's possible that executeAction might throw an exception but still go on to complete its listener. I mean it probably shouldn't, but in general we're ok with completing an ActionListener multiple times so we don't protect against this.

However here the release runnable is a RunOnce so it doesn't matter.

Copy link
Contributor Author

@gmarouli gmarouli Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will start with the suggestion that the runBefore() does something like this () -> { success.getAndSet(true); release.run(); } because I feel more confident about my analysis :). I believe this will cause double invocation in almost all cases, because:

  1. executeAction introduces the runBefore to run before the original action listener starts whatever async needs to be run and finishes
  2. Chances are by this point, the async work hasn't finished yet, which means the success is false and the finally will run the release.run() effectively releasing the semaphore while the node is still coordinating
  3. When the async work finishes, the original listener will be notified and the release.run() will be called again.

Now let's go to the trickier part, it's not clear to me yet where could the original listener throw an exception. If I look at the code of the RunBeforeActionListener:

        @Override
        public void onResponse(T response) {
            try {
                runBefore.run();
            } catch (Exception ex) {
                super.onFailure(ex);
                return;
            }
            delegate.onResponse(response);
        }

Similar the failure is being handled, the release will be called even if the original listener has an error. I would expect that the only situation where this can go wrong, is if the listener is never completed and the executeAction finished. But this effectively would mean that the node is still coordinating so we have a different problem then right?

Did I address your concern properly?

PS: I do not know if my understanding is correct or complete, so I would also appreciate @DaveCTurner input)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need for any changes in this area - RunOnce already does what is needed.

It's not just about completing the listener once but also releasing things in the finally block, you also need to protect against completing the listener multiple times. But we already do the right thing here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally didn't notice that the release was a RunOnce, that's fine then and no worries about this.

gmarouli and others added 2 commits February 24, 2022 12:50
Comment on lines +76 to +81
executeAction.apply(task, request, ActionListener.runBefore(listener, release::run));
success = true;
} finally {
if (success == false) {
release.run();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe it's possible that executeAction might throw an exception but still go on to complete its listener. I mean it probably shouldn't, but in general we're ok with completing an ActionListener multiple times so we don't protect against this.

However here the release runnable is a RunOnce so it doesn't matter.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating Mary!

@gmarouli
Copy link
Contributor Author

Thanks for the review @dakrone & @DaveCTurner . With your input the MR is much better than the initial effort!

@gmarouli gmarouli merged commit ed0bb2a into elastic:master Feb 28, 2022
@gmarouli gmarouli deleted the push-back-excessive-requests-for-stats branch February 28, 2022 07:46
joegallo added a commit to joegallo/elasticsearch that referenced this pull request Mar 30, 2022
@joegallo joegallo added v8.3.0 and removed v8.2.0 labels Mar 31, 2022
@joegallo
Copy link
Contributor

joegallo commented Mar 31, 2022

I reverted this from the 8.2 branch via #85504, and I've updated the version tag here to v8.3.0 rather than v8.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Stats Statistics tracking and retrieval APIs >enhancement Team:Data Management Meta label for data/management team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Push back on excessive requests for stats

6 participants