Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Conversation

@NikVolf
Copy link
Contributor

@NikVolf NikVolf commented Apr 16, 2020

@NikVolf NikVolf added the A0-please_review Pull request needs code review. label Apr 16, 2020
@NikVolf NikVolf added this to the 2.0 milestone Apr 16, 2020
@NikVolf NikVolf force-pushed the nv-txpool-prometheus branch from 026c5c2 to 376c74d Compare April 16, 2020 09:32
@bkchr bkchr requested a review from mxinden April 16, 2020 09:46
@NikVolf NikVolf force-pushed the nv-txpool-prometheus branch from 376c74d to 3a6eae0 Compare April 16, 2020 09:49
@NikVolf NikVolf requested a review from tomusdrw as a code owner April 16, 2020 09:49
/// Transaction pool prometheus metrics.
pub struct Metrics {
pub validation_pending: Gauge<U64>,
pub total_validated: Counter<U64>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have some median over the time it takes to validate a transaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I will think of what I need most while resolving current issues or others to come

Copy link
Contributor

@tomaka tomaka Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bit of experience with Prometheus now. 🙃

In my opinion, validation_pending should be replaced with a validations_started of type Counter and that counts the number of validations that have started. Then you can do total_validated - validation_started to see the number of pending validations.

Gauges in general are not necessarily a good idea. Since metrics are collected only every couple of seconds, you don't know how many times inc() and dec() have been called in this interval. If instead you use two Counters and use Grafana to calculate a - b then you know.

Also, I don't know how difficult it is to implement that, but total_validated could instead be a Histogram, and when a validation is finished you would call observe() with the time it took to validate.
Histograms don't just report the time it took but also the number of times observe() has been called, so this can serve as a counter for the number of finished validations (and you can also get averages, quantiles, and so on).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing a median I would suggest exposing a Prometheus Histogram. Medians might be misleading as they can hide long tail latencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. Thanks for heads up about Gauges, probably saved me some time.
I thought even metrics are scrapped every couple of seconds, they also scrap all changes that were made between scraps, not just immediate value, but according to docs this seems to not be the case.

I will change to counter.

As for time, I'm not interested in it currently, but will probably add later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they also scrap all changes that were made between scraps

No, this is a bit of a pitfall at the beginning. Prometheus is making a tradeoff between resolution and performance favoring performance in this place. E.g. recording every single change would require an allocated vector whereas only taking snapshots enables one to use cheap atomic variables.

Ok(sc_client::LongestChain::new(backend.clone()))
})?
.with_transaction_pool(|config, client, _fetcher| {
.with_transaction_pool(|config, client, _fetcher, prometheus_registry| {
Copy link
Contributor

@tomaka tomaka Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the most straight-forward solution for this PR, but ugh this API. It doesn't make sense at all.
We should really remove these closures and simply have prometheus_registry (and all the other components) be a local variable.
We can't just continue adding parameters to these closures (which is a breaking change every single time) whenever we need something that is in the builder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not requesting any change on this PR in particular, but I thought I'd comment on that in general.

Copy link
Contributor Author

@NikVolf NikVolf Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen this PR?
#5557

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall. I got a couple of comments. Thanks for keeping the instrumentation style (new Metrics struct which has register function, ...) consistent with the remaining code base!

/// Transaction pool prometheus metrics.
pub struct Metrics {
pub validation_pending: Gauge<U64>,
pub total_validated: Counter<U64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing a median I would suggest exposing a Prometheus Histogram. Medians might be misleading as they can hide long tail latencies.

@NikVolf NikVolf added A8-mergeoncegreen and removed A0-please_review Pull request needs code review. labels Apr 16, 2020

/// Transaction pool prometheus metrics.
pub struct Metrics {
pub validation_pending: Gauge<U64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd traitify the accessors to the fields and implement them for Option<Metrics> so that you don't have to do the annoying if let Some all the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant by traitifying accessors, but refactored to avoid if let Some everywhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He meant that you implement a trait for Option<Arc<Metrics>>, but your solution is achieving the same simplifications ;)

NikVolf added 3 commits April 17, 2020 09:45
…e into nv-txpool-prometheus

# Conflicts:
#	client/transaction-pool/src/lib.rs
# Conflicts:
#	client/transaction-pool/Cargo.toml
use prometheus_endpoint::{register, Counter, PrometheusError, Registry, U64};

#[derive(Clone, Default)]
pub struct MetricsLink(Arc<Option<Metrics>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add some basic docs at least, but since it's not public it's not really enforced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems pretty self-documenting to me 🤷‍♀️

@bkchr bkchr merged commit b7b60fa into master Apr 17, 2020
@bkchr bkchr deleted the nv-txpool-prometheus branch April 17, 2020 09:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants