Skip to content

Push Back on Excessive Snapshot Repository API Requests #55153

@original-brownbear

Description

@original-brownbear

Currently, requests for the status of snapshots (TransportGetSnapshotsAction as well as TransportSnapshotsStatusAction) can result in long running executions on the generic thread pool.
This is especially true for TransportSnapshotsStatusAction (which can easily take multiple minutes to run for Cloud backed repositories and large snapshots).
If a client sends a number of these requests at once this can cause a large number of generic pool threads to become busy as well as a lot of heap pressure from that.

One scenario where this could become troublesome would be a client that retries a slow snapshot status request because it times out on the slow API quicker than the requests can finish, adding ever more tasks to the GENERIC pool on the master node.
Another possible scenario observed was a user simply sending status requests for multiple snapshots in parallel causing a number of multi-second tasks to run on the master's generic pool at the same time, destabilizing the master node from heap pressure and potentially causing significant latency on the generic pool.

Currently, there is no push-back against a flood of snapshot status requests from a client other than the (real-memory) circuit breaker. Given that it's fairly easy to DOS a master node via TransportSnapshotsStatusAction calls, should we add a mechanism to push back against these to limit how many of these requests we service concurrently?

Similar to #51992 but affecting the generic pool.

Metadata

Metadata

Labels

:Distributed Coordination/Snapshot/RestoreAnything directly related to the `_snapshot/*` APIsTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions