-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version: Any
Plugins installed: N/A
JVM version (java -version): Any
OS version (uname -a if on a Unix-like system): Any
Description of the problem including expected versus actual behavior:
The starting point for the retry timeout implemented in RestClient is when the request is submitted to the Apache HTTP client, instead of when the request is actually sent to the Elasticsearch cluster. But the client has an unbounded request queue, so the request may be sent a long time after it got submitted.
As a result, when the HTTP client is already busy handling a lot of requests, the RestClient will never retry failing requests, because failures will always occur more than 30s after the failing request has been submitted (or whatever your timeout is, 30s is just the default).
The problem is even worse when you try to use more reasonable timeouts, such as 1 or 2 seconds. As soon as the client gets clogged up with more than 1 or 2 seconds worth of requests, retry is basically disabled.
With synchronous requests, this is not really a problem, because the submitting thread will stop waiting after maxRequestTimeout anyway, so the failure will happen long after the submitting thread gave up on the request.
With asynchronous requests, though, there's a good chance you chose asynchonous execution because you knew the request could take a lot of time executing. In which case, the problem will happen every time a request fails.
One example of a use case where this behavior is undesirable is when indexing a huge amount of data, for example when initializing the indexes from an external data source. In this case, you want to send a lot of asynchronous requests to the client, so as to be sure you're using the cluster to the maximum of its capacity, and as a result the client's request queue will probably be very long. Yet you still want failing requests to be retried...
Steps to reproduce:
- Set the max retry timeout to 10s, and use a single Elasticsearch host
- Submit (asynchronously) 40 requests, each taking about 1s to execute, to the RestClient; this will keep the two transport threads busy for approximately 20s.
- Just after that, submit (still asynchronously) another request that will fail after, say, 1s.
- The request from 3 will fail, but will not be retried, despite the fact it only had 1 second to execute.
Solution:
One solution would be to set the starting point for the retry timeout when the Apache HTTP client actually starts processing the request.
This would not affect the timeout for synchronous requests, but would still provide a significant improvement for asynchronous requests.