94ca073cd1
There is always some fallout, but over the past two weeks the ratio was never above 0.0001. It did go up to 0.0004 when there was an issue with email delivery, so 0.0002 seems to be a decent value to trigger an investigation. |
||
---|---|---|
.build.yml | ||
LICENSE | ||
README.md | ||
backup_rules.yml | ||
build_rules.yml | ||
chat_rules.yml | ||
meta_rules.yml | ||
node_rules.yml | ||
postgres_rules.yml | ||
process_rules.yml | ||
service_rules.yml | ||
ssl_rules.yml | ||
test_rules.yml |
README.md
metrics.sr.ht
This repository tracks our Prometheus alert rules. They are available as a package from mirror.sr.ht (for Alpine only) as metrics.sr.ht-rules.
Our Prometheus instance is public:
Usage instructions
- Install our package
- Add our
rules_files
entries to yourprometheus.yml
for each set of rules you wish to use - Configure alertmanager accordingly
Our alerts are categorized into three severity groups:
- interesting alerts are worth noting, as they may be useful in identifying trends over time, for forensic attention after an outage, or for addressing on a rainy day. Upstream, we send these to our IRC channel.
- important alerts are likely to be actionable, but do not require immediate attention.
- urgent alerts require immediate attention.