metrics.sr.ht

Commit Graph

Author	SHA1	Message	Date
Conrad Hoffmann	94ca073cd1	Alert on increase in unconfirmed registrations There is always some fallout, but over the past two weeks the ratio was never above 0.0001. It did go up to 0.0004 when there was an issue with email delivery, so 0.0002 seems to be a decent value to trigger an investigation.	2024-04-09 10:58:59 +02:00
Conrad Hoffmann	00b53bbe3d	node_rules: take all CPU modes into account Currently, the "high CPU usage" alert only looks at time spent in user mode. This can hide issues where a signicant amount of time is spent in kernel mode ("system"), iowait, or similar. To take all activity into account, invert the query to assert that the CPU always spends at least 20% of capacity idling. Using 20% instead of 25% here to try to make this stay somewhat equivalent. Previously, 75% user plus as assumed 5% system overhead was fine, so it should be again.	2024-04-02 15:58:26 +02:00
Drew DeVault	8d45627721	Loosen up backup rules The git.sr.ht backups tend to take a pretty long time these days and we get some false positives on this. Might tune this figure back down a bit if/when we switch to bupstash.	2024-01-08 14:59:22 +01:00
Simon Ser	df23347a96	chat: add alarm for synIRC	2023-10-24 13:32:49 +02:00
Drew DeVault	775fe37356	build_rules.yml: correct name of builds submitted metric	2023-10-04 11:03:41 +02:00
Ignas Kiela	61dd449a4a	Add alerts for high worker utilization Additionally, update the metric used for high number of builds timing out and double the limit of high number of build submission since the high worker utilization alarm should most of the cases that submission alarm was meant to handle.	2023-06-22 10:34:38 +02:00
Simon Ser	594b2448b0	chat: add rules for /media/soju-logs Sigh, really thought we had this already, but apparently not…	2023-06-01 12:36:55 +02:00
Jackson Chen	3c54b74879	fix incorrect expression for "Instance rebooted" node_boot_time_seconds is not in "seconds since boot", it is "unix time of boot". Therefore, the unix current time minux the boot unix time is actually seconds since boot.	2023-01-26 09:55:09 +01:00
Drew DeVault	e8260f8add	.build.yml: upgrade to 3.17 metrics was bumped	2023-01-19 11:49:38 +01:00
Drew DeVault	3fb9af0dec	Add postgres_rules.yml	2023-01-19 11:49:12 +01:00
Drew DeVault	5b509b3ecf	Update libera chat alarm	2023-01-05 18:20:34 +01:00
Simon Ser	676751cb3e	chat: bump Rizon alert to 40 The hard limit is now 50. Set the alert to 40 so that we can contact Rizon support in time whenm we're getting close.	2022-12-01 11:55:51 +01:00
Drew DeVault	b87b19ebd4	backup_rules.yml: bump to 72 hours borg is super slow and only getting slower as our dataset grows. The long-term solution is to switch to bupstash, but for now this should reduce the noise.	2022-07-04 14:36:33 +02:00
Simon Ser	4c6a07356d	Add chat.sr.ht rules Setup alerts monitoring the number of connections to some well-known IRC networks.	2022-03-14 17:39:12 +01:00
Ignas Kiela	5ce3d52183	Fix build queue length alert Accidentally left in an old in-development metric name I used.	2022-02-28 11:29:57 +01:00
Drew DeVault	a709d7864f	build.yml: upgrade to Alpine 3.15	2022-02-14 19:21:21 +01:00
Ignas Kiela	74b7d859d5	Fix High number of 500 errors alert to work instance-wide This was originally intentioned to be look at the instance-wide stats, but I have accidentally copied the wrong query from my experiments.	2022-02-14 16:50:14 +01:00
Ignas Kiela	927f06f0f3	Filter out low traffic routes from high number of errors alert Set the cutoff to at least 1 request per minute over the past hour. Currently around 40 routes reach this rate, which is about 10% of all routes.	2022-02-09 08:05:14 +01:00
Ignas Kiela	4c8f6f8587	Remove builds short-circuit for patches I can't get it right, and I'd rather have builds deploy than have patches succeed builds	2022-02-03 13:34:11 +01:00
Ignas Kiela	367feee072	Bring back service alarms This brings back an improved version of the high error count alarm that was removed for being too noisy, which was mostly caused by the fact that python services didn't report consistent metrics without prometheus multiprocessing mode, which has now been implemented. An alert for webhook queues is also added.	2022-02-03 11:17:14 +01:00
Ignas Kiela	c4f0b537e7	Add a time component to queued up builds alert There is a fair few cases where a number high enough to trigger this alert would queue up, and clear up in under 5 minutes without needing any operator intervention, requiring that the condition continues for a few minutes will make such transients silent.	2022-02-03 11:16:36 +01:00
Ignas Kiela	7653974e37	Fix build deployment (again) Mixed up the way shell conditionals work.	2022-02-03 11:16:14 +01:00
Ignas Kiela	713e596ad2	Fix build deployment Commit `184e0fd` broke the deployment of metrics package because the metrics.sr.ht repo doesn't use tags, so `git describe` always fails, and the builds script always gets ended with complete-build. Replacing that with a check of $BUILD_REASON that is set by hub.sr.ht on patchset submission still works for most cases the early completion is important.	2022-01-18 19:13:17 +01:00
Simon Ser	874390245a	Add alert for process open FDs	2022-01-18 19:13:10 +01:00
Ignas Kiela	184e0fd51d	Don't fail the build without secrets	2021-12-15 11:30:18 +01:00
Ignas Kiela	cbdcce5662	meta_rules.yml: use increase instead of delta delta is meant for gauges and does not handle resets	2021-12-15 11:30:16 +01:00
Ignas Kiela	3ea3fa2957	build_rules.yml: Alert on queued up builds	2021-12-15 11:30:14 +01:00
Ignas Kiela	3166280b41	build_rules.yml: track rate of job submission at the services	2021-12-15 11:30:13 +01:00
Simon Ser	b6b61e6f7c	Add low available memory alert	2021-11-16 07:39:58 +01:00
Drew DeVault	bce1792825	Reschedule weekly test alarm to CEST window	2021-07-29 09:18:59 +02:00
Drew DeVault	faa07f55d6	.build.yml: upgrade to Alpine 3.14	2021-07-26 09:42:37 +02:00
Drew DeVault	0d919bd352	Tweak node alarms	2021-07-18 09:03:12 +02:00
Drew DeVault	2d62c1aacf	Reduce threshold for initial I/O alarm proxy.golang.org generates large bursts of traffic every now and then which can cause our nominal I/O usage to increase in short bursts. This behavior is normal, so let's re-tune the alarm to avoid bothering us.	2021-05-17 08:27:56 -04:00
Drew DeVault	0012f555c8	Remove trigger happy alarm	2021-02-08 09:50:07 -05:00
Drew DeVault	482e0d6656	.build.yml: update to Alpine 3.13	2021-01-21 09:55:27 -05:00
Bor Grošelj Simić	82e4020e49	align annotations with actual thresholds	2021-01-14 08:03:24 -05:00
Drew DeVault	3b1aef8a3f	Bump weekly test alarm to urgent	2020-12-08 22:21:11 -05:00
Drew DeVault	2563d63019	Fix typo in password reset alarm	2020-11-25 14:13:03 -05:00
Drew DeVault	d0f1cadda9	I/O alarm: correct error for write metric	2020-10-08 20:10:37 -04:00
Drew DeVault	9e2a456510	Loosen alarm for login failures This is a bit too noisy	2020-10-04 13:51:59 -04:00
Drew DeVault	1a5eaba152	Fix high CPU usage alert	2020-07-10 09:43:38 -04:00
Drew DeVault	9bb58d3cdb	Add alarm for aging ZFS snapshots	2020-07-03 11:05:06 -04:00
Ignas Kiela	9c1389a8f8	Add an alert for high rate of server errors	2020-06-28 09:56:36 -04:00
Drew DeVault	3435b2ca01	.build.yml: switch to Alpine 3.12	2020-06-22 20:47:30 -04:00
Drew DeVault	19dfb3cac9	add urgent alarm for sustained login failure rate	2020-06-22 20:37:01 -04:00
Drew DeVault	8bdda41097	Improve CPU utilization rules	2020-04-29 12:38:09 -04:00
Drew DeVault	bbb32ed90f	Re-introduce test alarm	2020-04-22 11:52:49 -04:00
Drew DeVault	b5b49bdfa8	Bump disk usage alarms to 90%	2020-03-26 09:06:37 -04:00
Philipp Riegger	ee7e72ff8f	Fix builds.sr.ht alerts According to prometheus documentation, delta should only be used with gauges and increase can be used with counters. This changes the threshold for high rate of build submissions from 25 per second to 25 per 5 minutes. According to the metrics.sr.ht data, a threshold of 20 build submissions per 5 minutes was never exceeded.	2020-03-02 11:27:49 -05:00
Philipp Riegger	7169b1b775	Fix read-only filesystem alert	2020-02-28 09:13:14 -05:00

1 2

71 Commits All Branches Search

71 Commits

All Branches