-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Today the ?timeout= query parameter to the repository analysis API applies to the regular blob operations, but not to the linearizable register operations. The assumption here was that the register operations simply increment a counter once per node which should take almost no time at all, but in practice we've seen a couple of S3-like repositories with incomplete/incorrect support for the multipart APIs which underpin its linearizable register implementation, giving spurious responses that cause endless retries. Specifically, the S3 list multipart upload API returns "all in-progress uploads" but some repositories claiming to be S3-compatible incorrectly omit recently-started uploads from responses to this API.
We should apply the timeout to both kinds of operation so that these repository implementations can fail the analysis at the timeout instead of waiting forever.
Relates #101185 which adds verification for uncontended register operations, which need no retries and therefore will allow to distinguish this incorrect behaviour from other reasons for an analysis timeout.
Workaround
To work around this issue, implement a client-side timeout when requesting a repository analysis, using a timeout value a few seconds longer than the server-side timeout specified with the ?timeout= query parameter. Treat the expiry of the client-side timeout as indicative of a repository incompatibility which you should work with your storage supplier to address.
Test your repository's behaviour with linearizable registers first by setting the query parameters ?blob_count=1&max_blob_size=1b. If this analysis takes more than a few seconds to complete, it is likely that your repository behaves incorrectly in a manner that will cause Elasticsearch to retry endlessly.