Skip to content

ruler: config updates are nearly always overwritten by rescheduling #482

@bboreham

Description

@bboreham

Symptom: after updating alerting rules in the config server, ruler was still running the previous rules for several minutes.

There is a race between the ticker triggering a config update and the scheduler re-scheduling rules it has just evaluated. I enhanced the logging to make this clearer:

time="2017-07-05T17:29:42Z" level=debug msg="Scheduler: work item added: {orgID=1 scheduled=2017-07-05 17:29:42 ["ALERT AAA" "ALERT BBB"]}" source="scheduler.go:204" 
time="2017-07-05T17:29:42Z" level=debug msg="Evaluated rule "ALERT AAA"" source="manager.go:287" 
time="2017-07-05T17:29:42Z" level=debug msg="Scheduler: work item {orgID=1 scheduled=2017-07-05 17:29:42 ["ALERT AAA"]} rescheduled for 2017-07-05 17:29:57" source="scheduler.go:228" 
time="2017-07-05T17:29:42Z" level=debug msg="Scheduler: work item added: {orgID=1 scheduled=2017-07-05 17:29:57 ["ALERT AAA"]}" source="scheduler.go:204" 

The first line comes from a config update with two rules, but it is immediately overwritten by the re-scheduling of the previous rule set which had only one rule.

In practice this race happens very often because the config polling loop is using the same duration as the rule re-evaluation, so a trivial improvement would be to make those durations relatively prime.

Perhaps a better fix would be to look up the current rules when re-scheduling rather than using the set we have at hand.

Metadata

Metadata

Assignees

Labels

postmortemAn issue arising out of a serious production issuetype/bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions