Client-side stats collection timeouts can result in overloaded master

Today the monitoring subsystem collects stats from a cluster with a _client-side_ timeout, e.g.:

https://github.com/elastic/elasticsearch/blob/8b7c55600fe3469966e321c59937b384e11564e9/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/cluster/ClusterStatsCollector.java#L87

https://github.com/elastic/elasticsearch/blob/8b7c55600fe3469966e321c59937b384e11564e9/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/indices/IndexStatsCollector.java#L74

https://github.com/elastic/elasticsearch/blob/8b7c55600fe3469966e321c59937b384e11564e9/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/indices/IndexRecoveryCollector.java#L73

https://github.com/elastic/elasticsearch/blob/8b7c55600fe3469966e321c59937b384e11564e9/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/ml/JobStatsCollector.java#L76

This timeout is configurable for each collector and defaults to 10 seconds:

https://github.com/elastic/elasticsearch/blob/8b7c55600fe3469966e321c59937b384e11564e9/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/Collector.java#L174-L177

Handlers of stats requests generally reach out to all the nodes in the cluster, collect their responses, and once all nodes have responded they send a summary of the results to the originating client. These responses can be rather large, perhaps 10s of MBs, and this data all lives on-heap on the coordinating node until every node has responded.

The problem with the client-side timeouts that monitoring uses is that they do not clean up the partial results held on the coordinating node. If one node stops responding to stats requests for a while then monitoring will retry adding new handlers with 10s of MBs more heap usage to the coordinating node every 10 seconds.

I think we should remove these client-side timeouts so that we avoid the accumulation of on-heap junk caused by these retries. If we feel that the timeout/retry behaviour is necessary then I think we should move it into the server so that it can clean up properly on a failure (relates https://github.com/elastic/elasticsearch/issues/52616).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Client-side stats collection timeouts can result in overloaded master #60188

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	protected static Setting<TimeValue> collectionTimeoutSetting(final String settingName) {
	String name = collectionSetting(settingName);
	return timeSetting(name, TimeValue.timeValueSeconds(10), Property.Dynamic, Property.NodeScope);
	}

Client-side stats collection timeouts can result in overloaded master #60188

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions