-
Notifications
You must be signed in to change notification settings - Fork 839
Add support for Prometheus 2.0 rule format #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is based on the vendoring updates in #688. @jml This is basically ready and working, but marked [WIP] only because:
|
|
Great, thanks. I'm stretched a bit thin, so I'll wait until you've addressed the first two points and try to find someone else to look into conversion plan. If I can't find someone else, will do it myself. |
76fed09 to
4b19e36
Compare
|
@jml I rebased ontop of the latest master and added a couple more fixups and improvements. The Prometheus rules YAML parser can return multiple errors if it finds multiple problems, but I wasn't sure how to best present those to the user, so I simply opted to just always show the first one (so then the user can still iteratively fix their errors). In terms of metrics, we are still measuring the durations of each rule group (as before), except that previously all rules for a user were dumped into one large group, whereas now a user can specify groupings themselves in the new YAML rules config. So you will see more, but smaller groups, and accordingly faster per-group evaluation durations. The metric that tracks the overall latency of a scheduler work item completion is still the same, except for a rename / help text update to reflect that it's not about one rule group, but a whole set of them for a given config. I think this should be ready from a code perspective now, but I'm keeping the [WIP] so that nobody accidentally merges it before we have a transition plan. A transition plan should include converting all existing user configs to the new format, updating example/default configs, Cortex documentation around configs, and maybe notifying users of the change. Do we also want to rename the current "prometheus-1518408565633.rules"-style names to end with ".yml"? |
|
Thanks Julius!
No movement on a transition plan. I haven't found anyone w/ spare cycles to
think about it. Will keep pinging.
Renaming seems sensible.
…On Mon, 12 Feb 2018 at 05:40 Julius Volz ***@***.***> wrote:
@jml <https://github.com/jml> I rebased ontop of the latest master and
added a couple more fixups and improvements.
The Prometheus rules YAML parser can return multiple errors if it finds
multiple problems, but I wasn't sure how to best present those to the user,
so I simply opted to just always show the first one (so then the user can
still iteratively fix their errors).
In terms of metrics, we are still measuring the durations of each rule
group (as before), except that previously all rules for a user were dumped
into one large group, whereas now a user can specify groupings themselves
in the new YAML rules config. So you will see more, but smaller groups, and
accordingly faster per-group evaluation durations. The metric that tracks
the overall latency of a scheduler work item completion is still the same,
except for a rename / help text update to reflect that it's not about one
rule group, but a whole set of them for a given config.
I think this should be ready from a code perspective now, but I'm keeping
the [WIP] so that nobody accidentally merges it before we have a transition
plan. A transition plan should include converting all existing user configs
to the new format, updating example/default configs, Cortex documentation
around configs, and maybe notifying users of the change.
Do we also want to rename the current
"prometheus-1518408565633.rules"-style names to end with ".yml"?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#689 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHq6qZ-bI1xfTXHxGjBaN7GcT10cZ7Yks5tT87BgaJpZM4R6zUO>
.
|
0b077fb to
57d6477
Compare
57d6477 to
ab34987
Compare
|
@jml I completely re-did this PR ontop of #719, which touched all the same code places. I also added flag-based binary-wide support for setting the rule format, with it still defaulting to v1. Ideally this should be deployable without breaking anything unless you explicitly set flags to indicate a v2 rule format. |
bboreham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits really
cmd/lite/main.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here? Could we have something more descriptive than "VERSION"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this was just local dev debugging output. Removed.
pkg/ruler/api_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have the indentation the same between v1 and v2 here? For consistency, and I think this would also improve the diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I outdented this to what it used to be.
pkg/ruler/scheduler.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe factor this switch out to a function in config, so we don't have to edit this file when v3 is added?
(And it would help keep down the complexity of this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, I factored that into a Parse() method in config, which also got rid of those switches in other places.
pkg/ruler/scheduler.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panics on programming errors are pretty normal/ok, no? Best way to actually catch them quickly...
The rule format to use is now set binary-wide via the `-ruler.rule-format-version` and `configs.rule-format-version` flags, which still default to the Prometheus 1.x rule format. There's some trickiness here regarding what data type to return from parsing, regarding the ability to track alert states, and not being able to create final rule groups yet. That's laid out in the comment above RulesConfig.Parse(). Fixes #622
cb56d6d to
1f17463
Compare
|
@bboreham As discussed on Slack, I rebased this PR ontop of latest master, ran |
The rule format to use is now set binary-wide via the
-ruler.rule-format-versionandconfigs.rule-format-versionflags, which still default to the Prometheus 1.x rule format.
There's some trickiness here regarding what data type to return from parsing,
regarding the ability to track alert states, and not being able to create final
rule groups yet. That's laid out in the comment above RulesConfig.Parse().
Fixes #622