Skip to content

Avoid timeout-and-retry in CoordinationDiagnosticsService and friends #97514

@DaveCTurner

Description

@DaveCTurner

Both o.e.c.c.CoordinationDiagnosticsService#sendTransportRequest and o.e.c.c.MasterHistoryService#refreshRemoteMasterHistory use transport-layer timeouts to fail a request if it takes too long. A timeout in this area makes sense, because we want to know if the remote node isn't responding promptly and we don't want to see arbitrarily-stale responses, but a transport-layer timeout is problematic because it means we lose track of the in-flight request so we might just keep retrying and piling up work on the handling node, making things worse and worse.

I think we should avoid using transport-layer timeouts here and ensure that we only ever have one request in flight at once. We can still use something like o.e.a.s.SubscribableListener#addTimeout to fail as expected and ensure we don't see arbitrarily-stale responses, but we should still wait for the transport response before sending another request.

Metadata

Metadata

Assignees

Labels

:Data Management/Health>bugSupportabilityImprove our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better.Team:Data ManagementMeta label for data/management team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions