Skip to content

Client-side stats collection timeouts can result in overloaded master #60188

@DaveCTurner

Description

@DaveCTurner

Today the monitoring subsystem collects stats from a cluster with a client-side timeout, e.g.:

() -> client.admin().cluster().prepareClusterStats().get(getCollectionTimeout());

This timeout is configurable for each collector and defaults to 10 seconds:

protected static Setting<TimeValue> collectionTimeoutSetting(final String settingName) {
String name = collectionSetting(settingName);
return timeSetting(name, TimeValue.timeValueSeconds(10), Property.Dynamic, Property.NodeScope);
}

Handlers of stats requests generally reach out to all the nodes in the cluster, collect their responses, and once all nodes have responded they send a summary of the results to the originating client. These responses can be rather large, perhaps 10s of MBs, and this data all lives on-heap on the coordinating node until every node has responded.

The problem with the client-side timeouts that monitoring uses is that they do not clean up the partial results held on the coordinating node. If one node stops responding to stats requests for a while then monitoring will retry adding new handlers with 10s of MBs more heap usage to the coordinating node every 10 seconds.

I think we should remove these client-side timeouts so that we avoid the accumulation of on-heap junk caused by these retries. If we feel that the timeout/retry behaviour is necessary then I think we should move it into the server so that it can clean up properly on a failure (relates #52616).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions