-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Today the monitoring subsystem collects stats from a cluster with a client-side timeout, e.g.:
Line 87 in 8b7c556
| () -> client.admin().cluster().prepareClusterStats().get(getCollectionTimeout()); |
Line 74 in 8b7c556
| .get(getCollectionTimeout()); |
Line 73 in 8b7c556
| .get(getCollectionTimeout()); |
Line 76 in 8b7c556
| .actionGet(getCollectionTimeout()); |
This timeout is configurable for each collector and defaults to 10 seconds:
Lines 174 to 177 in 8b7c556
| protected static Setting<TimeValue> collectionTimeoutSetting(final String settingName) { | |
| String name = collectionSetting(settingName); | |
| return timeSetting(name, TimeValue.timeValueSeconds(10), Property.Dynamic, Property.NodeScope); | |
| } |
Handlers of stats requests generally reach out to all the nodes in the cluster, collect their responses, and once all nodes have responded they send a summary of the results to the originating client. These responses can be rather large, perhaps 10s of MBs, and this data all lives on-heap on the coordinating node until every node has responded.
The problem with the client-side timeouts that monitoring uses is that they do not clean up the partial results held on the coordinating node. If one node stops responding to stats requests for a while then monitoring will retry adding new handlers with 10s of MBs more heap usage to the coordinating node every 10 seconds.
I think we should remove these client-side timeouts so that we avoid the accumulation of on-heap junk caused by these retries. If we feel that the timeout/retry behaviour is necessary then I think we should move it into the server so that it can clean up properly on a failure (relates #52616).