Skip to content

Ruler performance frequently degrades #702

@csmarchbanks

Description

@csmarchbanks

The ruler service in our cluster is frequently (every day) running into issues that end up meaning no rules are processed. The main issue seen is upper-percentile (90th percentile and above) ruler query time durations increase to 10 - 20 seconds, which causes the ruler to run into the group timeout (left at the default 10s in our cluster). Since we evaluate ~100 rules per tenant, these high percentile latencies cause every evaluation to fail.

screen shot 2018-02-13 at 1 05 51 pm
Queries for this graph look like:

histogram_quantile(0.99, sum(rate(cortex_distributor_query_duration_seconds_bucket{name="ruler"}[1m])) by (le))

Lots of log messages like:

ts=2018-02-13T09:38:55.273063356Z caller=log.go:108 level=error org_id=0 msg="error in mergeQuerier.selectSamples" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
ts=2018-02-13T09:38:55.274565552Z caller=log.go:108 level=warn msg="context error" error="context deadline exceeded"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions