-
Notifications
You must be signed in to change notification settings - Fork 840
Closed
Description
The ruler service in our cluster is frequently (every day) running into issues that end up meaning no rules are processed. The main issue seen is upper-percentile (90th percentile and above) ruler query time durations increase to 10 - 20 seconds, which causes the ruler to run into the group timeout (left at the default 10s in our cluster). Since we evaluate ~100 rules per tenant, these high percentile latencies cause every evaluation to fail.

Queries for this graph look like:
histogram_quantile(0.99, sum(rate(cortex_distributor_query_duration_seconds_bucket{name="ruler"}[1m])) by (le))
Lots of log messages like:
ts=2018-02-13T09:38:55.273063356Z caller=log.go:108 level=error org_id=0 msg="error in mergeQuerier.selectSamples" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
ts=2018-02-13T09:38:55.274565552Z caller=log.go:108 level=warn msg="context error" error="context deadline exceeded"
Metadata
Metadata
Assignees
Labels
No labels