Commit Graph

71 Commits

Author SHA1 Message Date
Conrad Hoffmann 94ca073cd1 Alert on increase in unconfirmed registrations
There is always some fallout, but over the past two weeks the ratio was
never above 0.0001. It did go up to 0.0004 when there was an issue with
email delivery, so 0.0002 seems to be a decent value to trigger an
investigation.
2024-04-09 10:58:59 +02:00
Conrad Hoffmann 00b53bbe3d node_rules: take all CPU modes into account
Currently, the "high CPU usage" alert only looks at time spent in user
mode. This can hide issues where a signicant amount of time is spent in
kernel mode ("system"), iowait, or similar.

To take all activity into account, invert the query to assert that the
CPU always spends at least 20% of capacity idling. Using 20% instead of
25% here to try to make this stay somewhat equivalent. Previously, 75%
user plus as assumed 5% system overhead was fine, so it should be again.
2024-04-02 15:58:26 +02:00
Drew DeVault 8d45627721 Loosen up backup rules
The git.sr.ht backups tend to take a pretty long time these days and we
get some false positives on this.

Might tune this figure back down a bit if/when we switch to bupstash.
2024-01-08 14:59:22 +01:00
Simon Ser df23347a96 chat: add alarm for synIRC 2023-10-24 13:32:49 +02:00
Drew DeVault 775fe37356 build_rules.yml: correct name of builds submitted metric 2023-10-04 11:03:41 +02:00
Ignas Kiela 61dd449a4a Add alerts for high worker utilization
Additionally, update the metric used for high number of builds timing
out and double the limit of high number of build submission since the
high worker utilization alarm should most of the cases that submission
alarm was meant to handle.
2023-06-22 10:34:38 +02:00
Simon Ser 594b2448b0 chat: add rules for /media/soju-logs
Sigh, really thought we had this already, but apparently not…
2023-06-01 12:36:55 +02:00
Jackson Chen 3c54b74879 fix incorrect expression for "Instance rebooted"
node_boot_time_seconds is not in "seconds since boot", it is "unix time of
boot". Therefore, the unix current time minux the boot unix time is actually
seconds since boot.
2023-01-26 09:55:09 +01:00
Drew DeVault e8260f8add .build.yml: upgrade to 3.17
metrics was bumped
2023-01-19 11:49:38 +01:00
Drew DeVault 3fb9af0dec Add postgres_rules.yml 2023-01-19 11:49:12 +01:00
Drew DeVault 5b509b3ecf Update libera chat alarm 2023-01-05 18:20:34 +01:00
Simon Ser 676751cb3e chat: bump Rizon alert to 40
The hard limit is now 50. Set the alert to 40 so that we can contact
Rizon support in time whenm we're getting close.
2022-12-01 11:55:51 +01:00
Drew DeVault b87b19ebd4 backup_rules.yml: bump to 72 hours
borg is super slow and only getting slower as our dataset grows. The
long-term solution is to switch to bupstash, but for now this should
reduce the noise.
2022-07-04 14:36:33 +02:00
Simon Ser 4c6a07356d Add chat.sr.ht rules
Setup alerts monitoring the number of connections to some
well-known IRC networks.
2022-03-14 17:39:12 +01:00
Ignas Kiela 5ce3d52183 Fix build queue length alert
Accidentally left in an old in-development metric name I used.
2022-02-28 11:29:57 +01:00
Drew DeVault a709d7864f build.yml: upgrade to Alpine 3.15 2022-02-14 19:21:21 +01:00
Ignas Kiela 74b7d859d5 Fix High number of 500 errors alert to work instance-wide
This was originally intentioned to be look at the instance-wide stats,
but I have accidentally copied the wrong query from my experiments.
2022-02-14 16:50:14 +01:00
Ignas Kiela 927f06f0f3 Filter out low traffic routes from high number of errors alert
Set the cutoff to at least 1 request per minute over the past hour.
Currently around 40 routes reach this rate, which is about 10% of all
routes.
2022-02-09 08:05:14 +01:00
Ignas Kiela 4c8f6f8587 Remove builds short-circuit for patches
I can't get it right, and I'd rather have builds deploy than have
patches succeed builds
2022-02-03 13:34:11 +01:00
Ignas Kiela 367feee072 Bring back service alarms
This brings back an improved version of the high error count alarm that
was removed for being too noisy, which was mostly caused by the fact
that python services didn't report consistent metrics without prometheus
multiprocessing mode, which has now been implemented. An alert for
webhook queues is also added.
2022-02-03 11:17:14 +01:00
Ignas Kiela c4f0b537e7 Add a time component to queued up builds alert
There is a fair few cases where a number high enough to trigger this
alert would queue up, and clear up in under 5 minutes without needing
any operator intervention, requiring that the condition continues for
a few minutes will make such transients silent.
2022-02-03 11:16:36 +01:00
Ignas Kiela 7653974e37 Fix build deployment (again)
Mixed up the way shell conditionals work.
2022-02-03 11:16:14 +01:00
Ignas Kiela 713e596ad2 Fix build deployment
Commit 184e0fd broke the deployment of metrics package because the
metrics.sr.ht repo doesn't use tags, so `git describe` always fails, and
the builds script always gets ended with complete-build. Replacing that
with a check of $BUILD_REASON that is set by hub.sr.ht on patchset
submission still works for most cases the early completion is important.
2022-01-18 19:13:17 +01:00
Simon Ser 874390245a Add alert for process open FDs 2022-01-18 19:13:10 +01:00
Ignas Kiela 184e0fd51d Don't fail the build without secrets 2021-12-15 11:30:18 +01:00
Ignas Kiela cbdcce5662 meta_rules.yml: use increase instead of delta
delta is meant for gauges and does not handle resets
2021-12-15 11:30:16 +01:00
Ignas Kiela 3ea3fa2957 build_rules.yml: Alert on queued up builds 2021-12-15 11:30:14 +01:00
Ignas Kiela 3166280b41 build_rules.yml: track rate of job submission at the services 2021-12-15 11:30:13 +01:00
Simon Ser b6b61e6f7c Add low available memory alert 2021-11-16 07:39:58 +01:00
Drew DeVault bce1792825 Reschedule weekly test alarm to CEST window 2021-07-29 09:18:59 +02:00
Drew DeVault faa07f55d6 .build.yml: upgrade to Alpine 3.14 2021-07-26 09:42:37 +02:00
Drew DeVault 0d919bd352 Tweak node alarms 2021-07-18 09:03:12 +02:00
Drew DeVault 2d62c1aacf Reduce threshold for initial I/O alarm
proxy.golang.org generates large bursts of traffic every now and then
which can cause our nominal I/O usage to increase in short bursts. This
behavior is normal, so let's re-tune the alarm to avoid bothering us.
2021-05-17 08:27:56 -04:00
Drew DeVault 0012f555c8 Remove trigger happy alarm 2021-02-08 09:50:07 -05:00
Drew DeVault 482e0d6656 .build.yml: update to Alpine 3.13 2021-01-21 09:55:27 -05:00
Bor Grošelj Simić 82e4020e49 align annotations with actual thresholds 2021-01-14 08:03:24 -05:00
Drew DeVault 3b1aef8a3f Bump weekly test alarm to urgent 2020-12-08 22:21:11 -05:00
Drew DeVault 2563d63019 Fix typo in password reset alarm 2020-11-25 14:13:03 -05:00
Drew DeVault d0f1cadda9 I/O alarm: correct error for write metric 2020-10-08 20:10:37 -04:00
Drew DeVault 9e2a456510 Loosen alarm for login failures
This is a bit too noisy
2020-10-04 13:51:59 -04:00
Drew DeVault 1a5eaba152 Fix high CPU usage alert 2020-07-10 09:43:38 -04:00
Drew DeVault 9bb58d3cdb Add alarm for aging ZFS snapshots 2020-07-03 11:05:06 -04:00
Ignas Kiela 9c1389a8f8 Add an alert for high rate of server errors 2020-06-28 09:56:36 -04:00
Drew DeVault 3435b2ca01 .build.yml: switch to Alpine 3.12 2020-06-22 20:47:30 -04:00
Drew DeVault 19dfb3cac9 add urgent alarm for sustained login failure rate 2020-06-22 20:37:01 -04:00
Drew DeVault 8bdda41097 Improve CPU utilization rules 2020-04-29 12:38:09 -04:00
Drew DeVault bbb32ed90f Re-introduce test alarm 2020-04-22 11:52:49 -04:00
Drew DeVault b5b49bdfa8 Bump disk usage alarms to 90% 2020-03-26 09:06:37 -04:00
Philipp Riegger ee7e72ff8f Fix builds.sr.ht alerts
According to prometheus documentation, delta should only be used with
gauges and increase can be used with counters.

This changes the threshold for high rate of build submissions from 25
per second to 25 per 5 minutes. According to the metrics.sr.ht data, a
threshold of 20 build submissions per 5 minutes was never exceeded.
2020-03-02 11:27:49 -05:00
Philipp Riegger 7169b1b775 Fix read-only filesystem alert 2020-02-28 09:13:14 -05:00