-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add prometheus registry to transaction pool, with couple of initial metrics #5657
Conversation
026c5c2 to
376c74d
Compare
376c74d to
3a6eae0
Compare
| /// Transaction pool prometheus metrics. | ||
| pub struct Metrics { | ||
| pub validation_pending: Gauge<U64>, | ||
| pub total_validated: Counter<U64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have some median over the time it takes to validate a transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, I will think of what I need most while resolving current issues or others to come
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bit of experience with Prometheus now. 🙃
In my opinion, validation_pending should be replaced with a validations_started of type Counter and that counts the number of validations that have started. Then you can do total_validated - validation_started to see the number of pending validations.
Gauges in general are not necessarily a good idea. Since metrics are collected only every couple of seconds, you don't know how many times inc() and dec() have been called in this interval. If instead you use two Counters and use Grafana to calculate a - b then you know.
Also, I don't know how difficult it is to implement that, but total_validated could instead be a Histogram, and when a validation is finished you would call observe() with the time it took to validate.
Histograms don't just report the time it took but also the number of times observe() has been called, so this can serve as a counter for the number of finished validations (and you can also get averages, quantiles, and so on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing a median I would suggest exposing a Prometheus Histogram. Medians might be misleading as they can hide long tail latencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. Thanks for heads up about Gauges, probably saved me some time.
I thought even metrics are scrapped every couple of seconds, they also scrap all changes that were made between scraps, not just immediate value, but according to docs this seems to not be the case.
I will change to counter.
As for time, I'm not interested in it currently, but will probably add later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they also scrap all changes that were made between scraps
No, this is a bit of a pitfall at the beginning. Prometheus is making a tradeoff between resolution and performance favoring performance in this place. E.g. recording every single change would require an allocated vector whereas only taking snapshots enables one to use cheap atomic variables.
| Ok(sc_client::LongestChain::new(backend.clone())) | ||
| })? | ||
| .with_transaction_pool(|config, client, _fetcher| { | ||
| .with_transaction_pool(|config, client, _fetcher, prometheus_registry| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is the most straight-forward solution for this PR, but ugh this API. It doesn't make sense at all.
We should really remove these closures and simply have prometheus_registry (and all the other components) be a local variable.
We can't just continue adding parameters to these closures (which is a breaking change every single time) whenever we need something that is in the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not requesting any change on this PR in particular, but I thought I'd comment on that in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen this PR?
#5557
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall. I got a couple of comments. Thanks for keeping the instrumentation style (new Metrics struct which has register function, ...) consistent with the remaining code base!
| /// Transaction pool prometheus metrics. | ||
| pub struct Metrics { | ||
| pub validation_pending: Gauge<U64>, | ||
| pub total_validated: Counter<U64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of exposing a median I would suggest exposing a Prometheus Histogram. Medians might be misleading as they can hide long tail latencies.
Co-Authored-By: Max Inden <[email protected]>
Co-Authored-By: Max Inden <[email protected]>
Co-Authored-By: Max Inden <[email protected]>
Co-Authored-By: Max Inden <[email protected]>
Co-Authored-By: Max Inden <[email protected]>
|
|
||
| /// Transaction pool prometheus metrics. | ||
| pub struct Metrics { | ||
| pub validation_pending: Gauge<U64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd traitify the accessors to the fields and implement them for Option<Metrics> so that you don't have to do the annoying if let Some all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you meant by traitifying accessors, but refactored to avoid if let Some everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He meant that you implement a trait for Option<Arc<Metrics>>, but your solution is achieving the same simplifications ;)
…e into nv-txpool-prometheus # Conflicts: # client/transaction-pool/src/lib.rs
# Conflicts: # client/transaction-pool/Cargo.toml
| use prometheus_endpoint::{register, Counter, PrometheusError, Registry, U64}; | ||
|
|
||
| #[derive(Clone, Default)] | ||
| pub struct MetricsLink(Arc<Option<Metrics>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add some basic docs at least, but since it's not public it's not really enforced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems pretty self-documenting to me 🤷♀️
polkadot: paritytech/polkadot#1003