-
Notifications
You must be signed in to change notification settings - Fork 314
Influxdb3 monitor metrics #6422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 6403-influxdb3-perf-tuning
Are you sure you want to change the base?
Conversation
chore(qol): Instruction to use /version/ in shared links
… reference doc for /metrics output.- Add monitoring guide: - Core and general metrics - Enterprise cluster and node-specific metrics - Using metrics and relabeling using Prometheus or Telegraf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info. Lots of questions 😄
``` | ||
{{% /show-in %}} | ||
|
||
Replace {{% code-placeholder-key %}}`AUTH_TOKEN`{{% /code-placeholder-key %}} with your {{< product-name >}} {{% token-link %}} that has read access to the `/metrics` endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What tokens can read the /metrics
endpoint? I assume it's just admin tokens since it's in both Core and Enterprise. I think we should call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Enterprise you can create a non-admin fine grained token with system:metrics:read
permission and it will grant access to that endpoint.
{{% show-in "enterprise" %}} | ||
### Aggregate metrics across cluster | ||
|
||
```bash | ||
# Get metrics from all nodes in cluster | ||
for node in ingester-01 query-01 compactor-01; do | ||
echo "=== Node: $node ===" | ||
curl -s http://$node:8181/metrics | grep 'http_requests_total.*status="ok"' | ||
done | ||
``` | ||
{{% /show-in %}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these metrics are specific to each node. Does the prometheus schema include the node ID or is there additional processing a user would have to do to know the source node?
Different metrics are more relevant depending on node [mode configuration](/influxdb3/version/admin/clustering/#configure-node-modes): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do irrelevant metrics still get reported? Do all nodes report the same metric, no matter what mode they're running in?
```promql | ||
# 95th percentile query latency by query node | ||
histogram_quantile(0.95, | ||
sum(rate(influxdb_iox_query_log_execute_duration_seconds_bucket[5m])) by (instance, le) | ||
) | ||
|
||
# Average inter-node coordination time | ||
avg(rate(influxdb_iox_query_log_ingester_latency_to_full_data_seconds_sum[5m]) / | ||
rate(influxdb_iox_query_log_ingester_latency_to_full_data_seconds_count[5m])) by (instance) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, but why not suggest to users to use Telegraf to collect these metrics and store them in another InfluxDB instance rather than Prometheus? I think we can provide PromQL queries, but they should be secondary to InfluxDB queries.
Setting up a sidecar monitoring instance is basically standard practice with v1 and v2 production deployments. It think it should be with v3 as well.
The Telegraf config would look something like:
[[inputs.prometheus]]
urls = [
"http://ingester-1.com/metrics",
"http://querier-1.com/metrics",
"http://compactor-1.com/metrics"
]
metric_version = 2
http_headers = {"Authorization" = "Bearer ${READ_AUTH_TOKEN}"}
[[outputs.influxdb_v2]]
urls = ["http://influxdb3-monitor.com"]
token = "${WRITE_AUTH_TOKEN}"
organization = ""
bucket = "DATABASE_NAME"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually see that you cover this later under "Node Labeling", but I still think this should be the first suggestion.
Create role-specific dashboards with the following suggested metrics for each dashboard: | ||
|
||
#### Cluster Overview Dashboard | ||
- Node status and availability | ||
- Request rates across all nodes | ||
- Error rates by node and operation type | ||
- Resource utilization summary | ||
|
||
#### Ingest Performance Dashboard | ||
- Write throughput by ingest node | ||
- Snapshot creation rates | ||
- Memory usage and pressure | ||
- WAL-to-Parquet conversion metrics | ||
|
||
#### Query Performance Dashboard | ||
- Query latency percentiles by query node | ||
- Cache hit rates and efficiency | ||
- Inter-node coordination times | ||
- Memory usage during query execution | ||
|
||
#### Operations Dashboard | ||
- Compaction progress and performance | ||
- Object store operation success rates | ||
- Processing engine trigger rates | ||
- System health indicators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section doesn't seem all that helpful unless we're going to actually provide a dashboard for them, or, at a minimum, the queries for each. But I know that depends on where they're storing the metrics.
|
||
# Add node name from URL | ||
[inputs.prometheus.tags] | ||
node_name = "$1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we don't actually include the node-id as a label in the metrics.
[[processors.regex]] | ||
# Extract node role from node name | ||
[[processors.regex.tags]] | ||
key = "node_name" | ||
pattern = '^(ingester|query|compactor)-.*' | ||
replacement = "${1}" | ||
result_key = "node_role" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that users include the node role in the node URL. I feel like this should be part of the reported metrics.
Part of #6420