Skip to content

Conversation

@juliusv
Copy link
Contributor

@juliusv juliusv commented Feb 6, 2018

The rule format to use is now set binary-wide via the
-ruler.rule-format-version and configs.rule-format-version
flags, which still default to the Prometheus 1.x rule format.

There's some trickiness here regarding what data type to return from parsing,
regarding the ability to track alert states, and not being able to create final
rule groups yet. That's laid out in the comment above RulesConfig.Parse().

Fixes #622

@juliusv
Copy link
Contributor Author

juliusv commented Feb 6, 2018

This is based on the vendoring updates in #688.

@jml This is basically ready and working, but marked [WIP] only because:

  • I need to make slight readability improvements to error reporting to the user.
  • I need to check whether we want to change the evaluation metrics a bit, due to the fact that we're now executing multiple groups per user, with multiple rules each.
  • We can't just deploy this without having a conversion plan for existing rule files in place.

@jml
Copy link
Contributor

jml commented Feb 7, 2018

Great, thanks.

I'm stretched a bit thin, so I'll wait until you've addressed the first two points and try to find someone else to look into conversion plan. If I can't find someone else, will do it myself.

@juliusv
Copy link
Contributor Author

juliusv commented Feb 12, 2018

@jml I rebased ontop of the latest master and added a couple more fixups and improvements.

The Prometheus rules YAML parser can return multiple errors if it finds multiple problems, but I wasn't sure how to best present those to the user, so I simply opted to just always show the first one (so then the user can still iteratively fix their errors).

In terms of metrics, we are still measuring the durations of each rule group (as before), except that previously all rules for a user were dumped into one large group, whereas now a user can specify groupings themselves in the new YAML rules config. So you will see more, but smaller groups, and accordingly faster per-group evaluation durations. The metric that tracks the overall latency of a scheduler work item completion is still the same, except for a rename / help text update to reflect that it's not about one rule group, but a whole set of them for a given config.

I think this should be ready from a code perspective now, but I'm keeping the [WIP] so that nobody accidentally merges it before we have a transition plan. A transition plan should include converting all existing user configs to the new format, updating example/default configs, Cortex documentation around configs, and maybe notifying users of the change.

Do we also want to rename the current "prometheus-1518408565633.rules"-style names to end with ".yml"?

@jml
Copy link
Contributor

jml commented Feb 12, 2018 via email

@juliusv juliusv force-pushed the update-rule-format branch from 0b077fb to 57d6477 Compare March 25, 2018 20:27
@juliusv juliusv changed the title [WIP] Update to Prometheus 2.0 YAML-based rule format Add support for Prometheus 2.0 rule format Mar 25, 2018
@juliusv juliusv force-pushed the update-rule-format branch from 57d6477 to ab34987 Compare March 25, 2018 20:30
@juliusv
Copy link
Contributor Author

juliusv commented Mar 25, 2018

@jml I completely re-did this PR ontop of #719, which touched all the same code places.

I also added flag-based binary-wide support for setting the rule format, with it still defaulting to v1. Ideally this should be deployable without breaking anything unless you explicitly set flags to indicate a v2 rule format.

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits really

cmd/lite/main.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here? Could we have something more descriptive than "VERSION"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, this was just local dev debugging output. Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have the indentation the same between v1 and v2 here? For consistency, and I think this would also improve the diffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I outdented this to what it used to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe factor this switch out to a function in config, so we don't have to edit this file when v3 is added?
(And it would help keep down the complexity of this function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I factored that into a Parse() method in config, which also got rid of those switches in other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panics on programming errors are pretty normal/ok, no? Best way to actually catch them quickly...

@bboreham
Copy link
Contributor

bboreham commented Apr 3, 2018

@juliusv can you check your dep command please - this PR seems to bring in a bunch of *_test.go files which were removed in #705. Maybe you have an older version?

juliusv and others added 3 commits April 3, 2018 14:56
The rule format to use is now set binary-wide via the
`-ruler.rule-format-version` and `configs.rule-format-version`
flags, which still default to the Prometheus 1.x rule format.

There's some trickiness here regarding what data type to return from parsing,
regarding the ability to track alert states, and not being able to create final
rule groups yet. That's laid out in the comment above RulesConfig.Parse().

Fixes #622
@juliusv juliusv force-pushed the update-rule-format branch from cb56d6d to 1f17463 Compare April 3, 2018 12:58
@juliusv
Copy link
Contributor Author

juliusv commented Apr 3, 2018

@bboreham As discussed on Slack, I rebased this PR ontop of latest master, ran dep ensure (with newest dep version) again, and squashed the extra vendoring changes into the original vendor update commit.

@juliusv juliusv merged commit 2f36aa6 into master Apr 3, 2018
@tomwilkie tomwilkie deleted the update-rule-format branch August 29, 2018 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants