Push Back on Excessive Snapshot Repository API Requests

Currently, requests for the status of snapshots (`TransportGetSnapshotsAction` as well as `TransportSnapshotsStatusAction`) can result in long running executions on the generic thread pool.
This is especially true for `TransportSnapshotsStatusAction` (which can easily take multiple minutes to run for Cloud backed repositories and large snapshots).
If a client sends a number of these requests at once this can cause a large number of generic pool threads to become busy as well as a lot of heap pressure from that.

One scenario where this could become troublesome would be a client that retries a slow snapshot status request because it times out on the slow API quicker than the requests can finish, adding ever more tasks to the `GENERIC` pool on the master node.
Another possible scenario observed was a user simply sending status requests for multiple snapshots in parallel causing a number of multi-second tasks to run on the master's generic pool at the same time, destabilizing the master node from heap pressure and potentially causing significant latency on the generic pool. 

Currently, there is no push-back against a flood of snapshot status requests from a client other than the (real-memory) circuit breaker. Given that it's fairly easy to DOS a master node via `TransportSnapshotsStatusAction` calls, should we add a mechanism to push back against these to limit how many of these requests we service concurrently?

Similar to https://github.com/elastic/elasticsearch/issues/51992 but affecting the generic pool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Push Back on Excessive Snapshot Repository API Requests #55153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Push Back on Excessive Snapshot Repository API Requests #55153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions