-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
There are a number of performance issues that have been found in production cluster for metric solutions that need to be addressed in order to have competitive query latency in the metric space. This is part of the tsdb effort as it aim is to make Elasticsearch better at storing and querying metric data. Tasks mentioned here are improvements that significantly reduce query time of many metric query workloads or specific ones.
Our current observations indicate that the poor performance is caused by the default refresh behaviour. Shards by default go search-idle after 30 seconds of search inactivity. When a shard is queries that is search idle then a refresh is performed as part of the search and then search execution continues. This adds a significant amount of latency to the query time. Especially because the refresh isn't triggered, but awaits until the scheduled refresh kicks in (which means often for 1 second nothing happens).
Additionally we observed that any search with a percentile aggregation is slow. Under the hood the percentile aggregation uses avl t-digest to compute the percentiles. This shows up as significant hotspot when profiling.
- Build new Rally track that measure performance when shards go search-idle. Add new track for tsdb based on k8s integration rally-tracks#373
- Trigger a refresh when a shard becomes search active instead of waiting for it. #95544
- Avoid refreshing search-idle shards that don't yield results after query rewrite #95541
- Improve the performance of
percentileaggregation by switching to the merging based t-digest implementation. The current avl based implementation performs slowly in production with metric data set of any reasonable size. This work consists out of forking the t-digest library (Fork tdigest library #95903)) and then change the implementation to merging t-digest (Feature/replace avl digest with merging digest #35182). - Improve
cardinalityaggregation performance on low cardinality fields (Add support for dynamic pruning to cardinality aggregations on low-cardinality keyword fields. #92060). - Better detect when execution hint
maporglobal_ordinalsshould be used. - [Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served kibana#157837