-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Currently, requests for the status of snapshots (TransportGetSnapshotsAction as well as TransportSnapshotsStatusAction) can result in long running executions on the generic thread pool.
This is especially true for TransportSnapshotsStatusAction (which can easily take multiple minutes to run for Cloud backed repositories and large snapshots).
If a client sends a number of these requests at once this can cause a large number of generic pool threads to become busy as well as a lot of heap pressure from that.
One scenario where this could become troublesome would be a client that retries a slow snapshot status request because it times out on the slow API quicker than the requests can finish, adding ever more tasks to the GENERIC pool on the master node.
Another possible scenario observed was a user simply sending status requests for multiple snapshots in parallel causing a number of multi-second tasks to run on the master's generic pool at the same time, destabilizing the master node from heap pressure and potentially causing significant latency on the generic pool.
Currently, there is no push-back against a flood of snapshot status requests from a client other than the (real-memory) circuit breaker. Given that it's fairly easy to DOS a master node via TransportSnapshotsStatusAction calls, should we add a mechanism to push back against these to limit how many of these requests we service concurrently?
Similar to #51992 but affecting the generic pool.