Avoid timeout-and-retry in CoordinationDiagnosticsService and friends

Both `o.e.c.c.CoordinationDiagnosticsService#sendTransportRequest` and `o.e.c.c.MasterHistoryService#refreshRemoteMasterHistory` use transport-layer timeouts to fail a request if it takes too long. A timeout in this area makes sense, because we want to know if the remote node isn't responding promptly and we don't want to see arbitrarily-stale responses, but a _transport-layer_ timeout is problematic because it means we lose track of the in-flight request so we might just keep retrying and piling up work on the handling node, making things worse and worse.

I think we should avoid using transport-layer timeouts here and ensure that we only ever have one request in flight at once. We can still use something like `o.e.a.s.SubscribableListener#addTimeout` to fail as expected and ensure we don't see arbitrarily-stale responses, but we should still wait for the transport response before sending another request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid timeout-and-retry in CoordinationDiagnosticsService and friends #97514

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid timeout-and-retry in CoordinationDiagnosticsService and friends #97514

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions