-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Both o.e.c.c.CoordinationDiagnosticsService#sendTransportRequest and o.e.c.c.MasterHistoryService#refreshRemoteMasterHistory use transport-layer timeouts to fail a request if it takes too long. A timeout in this area makes sense, because we want to know if the remote node isn't responding promptly and we don't want to see arbitrarily-stale responses, but a transport-layer timeout is problematic because it means we lose track of the in-flight request so we might just keep retrying and piling up work on the handling node, making things worse and worse.
I think we should avoid using transport-layer timeouts here and ensure that we only ever have one request in flight at once. We can still use something like o.e.a.s.SubscribableListener#addTimeout to fail as expected and ensure we don't see arbitrarily-stale responses, but we should still wait for the transport response before sending another request.