Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions postgres-mixin/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/alerts.yaml
/rules.yaml
dashboards_out
19 changes: 19 additions & 0 deletions postgres-mixin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Postgres Mixin

_This is a work in progress. We aim for it to become a good role model for alerts
and dashboards eventually, but it is not quite there yet._

The Postgres Mixin is a set of configurable, reusable, and extensible alerts and
dashboards based on the metrics exported by the Postgres Exporter. The mixin creates
recording and alerting rules for Prometheus and suitable dashboard descriptions
for Grafana.

To use them, you need to have `mixtool` and `jsonnetfmt` installed. If you
have a working Go development environment, it's easiest to run the following:
```bash
$ go get github.com/monitoring-mixins/mixtool/cmd/mixtool
$ go get github.com/google/go-jsonnet/cmd/jsonnetfmt
```

For more advanced uses of mixins, see
https://github.com/monitoring-mixins/docs.
57 changes: 57 additions & 0 deletions postgres-mixin/alerts/alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
groups:
- name: PostgreSQL
rules:
- alert: PostgreSQLMaxConnectionsReached
expr: sum(pg_stat_activity_count) by (instance) >= sum(pg_settings_max_connections) by (instance) - sum(pg_settings_superuser_reserved_connections) by (instance)
for: 1m
labels:
severity: email
annotations:
summary: "{{ $labels.instance }} has maxed out Postgres connections."
description: "{{ $labels.instance }} is exceeding the currently configured maximum Postgres connection limit (current value: {{ $value }}s). Services may be degraded - please take immediate action (you probably need to increase max_connections in the Docker image and re-deploy."

- alert: PostgreSQLHighConnections
expr: sum(pg_stat_activity_count) by (instance) > (sum(pg_settings_max_connections) by (instance) - sum(pg_settings_superuser_reserved_connections) by (instance)) * 0.8
for: 10m
labels:
severity: email
annotations:
summary: "{{ $labels.instance }} is over 80% of max Postgres connections."
description: "{{ $labels.instance }} is exceeding 80% of the currently configured maximum Postgres connection limit (current value: {{ $value }}s). Please check utilization graphs and confirm if this is normal service growth, abuse or an otherwise temporary condition or if new resources need to be provisioned (or the limits increased, which is mostly likely)."

- alert: PostgreSQLDown
expr: pg_up != 1
for: 1m
labels:
severity: email
annotations:
summary: "PostgreSQL is not processing queries: {{ $labels.instance }}"
description: "{{ $labels.instance }} is rejecting query requests from the exporter, and thus probably not allowing DNS requests to work either. User services should not be effected provided at least 1 node is still alive."

- alert: PostgreSQLSlowQueries
expr: avg(rate(pg_stat_activity_max_tx_duration{datname!~"template.*"}[2m])) by (datname) > 2 * 60
for: 2m
labels:
severity: email
annotations:
summary: "PostgreSQL high number of slow on {{ $labels.cluster }} for database {{ $labels.datname }} "
description: "PostgreSQL high number of slow queries {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }} "

- alert: PostgreSQLQPS
expr: avg(irate(pg_stat_database_xact_commit{datname!~"template.*"}[5m]) + irate(pg_stat_database_xact_rollback{datname!~"template.*"}[5m])) by (datname) > 10000
for: 5m
labels:
severity: email
annotations:
summary: "PostgreSQL high number of queries per second {{ $labels.cluster }} for database {{ $labels.datname }}"
description: "PostgreSQL high number of queries per second on {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }}"

- alert: PostgreSQLCacheHitRatio
expr: avg(rate(pg_stat_database_blks_hit{datname!~"template.*"}[5m]) / (rate(pg_stat_database_blks_hit{datname!~"template.*"}[5m]) + rate(pg_stat_database_blks_read{datname!~"template.*"}[5m]))) by (datname) < 0.98
for: 5m
labels:
severity: email
annotations:
summary: "PostgreSQL low cache hit rate on {{ $labels.cluster }} for database {{ $labels.datname }}"
description: "PostgreSQL low on cache hit rate on {{ $labels.cluster }} for database {{ $labels.datname }} with a value of {{ $value }}"
Loading