There is always some fallout, but over the past two weeks the ratio was
never above 0.0001. It did go up to 0.0004 when there was an issue with
email delivery, so 0.0002 seems to be a decent value to trigger an
investigation.
Currently, the "high CPU usage" alert only looks at time spent in user
mode. This can hide issues where a signicant amount of time is spent in
kernel mode ("system"), iowait, or similar.
To take all activity into account, invert the query to assert that the
CPU always spends at least 20% of capacity idling. Using 20% instead of
25% here to try to make this stay somewhat equivalent. Previously, 75%
user plus as assumed 5% system overhead was fine, so it should be again.
The git.sr.ht backups tend to take a pretty long time these days and we
get some false positives on this.
Might tune this figure back down a bit if/when we switch to bupstash.
Additionally, update the metric used for high number of builds timing
out and double the limit of high number of build submission since the
high worker utilization alarm should most of the cases that submission
alarm was meant to handle.
node_boot_time_seconds is not in "seconds since boot", it is "unix time of
boot". Therefore, the unix current time minux the boot unix time is actually
seconds since boot.
borg is super slow and only getting slower as our dataset grows. The
long-term solution is to switch to bupstash, but for now this should
reduce the noise.
This brings back an improved version of the high error count alarm that
was removed for being too noisy, which was mostly caused by the fact
that python services didn't report consistent metrics without prometheus
multiprocessing mode, which has now been implemented. An alert for
webhook queues is also added.
There is a fair few cases where a number high enough to trigger this
alert would queue up, and clear up in under 5 minutes without needing
any operator intervention, requiring that the condition continues for
a few minutes will make such transients silent.
Commit 184e0fd broke the deployment of metrics package because the
metrics.sr.ht repo doesn't use tags, so `git describe` always fails, and
the builds script always gets ended with complete-build. Replacing that
with a check of $BUILD_REASON that is set by hub.sr.ht on patchset
submission still works for most cases the early completion is important.
proxy.golang.org generates large bursts of traffic every now and then
which can cause our nominal I/O usage to increase in short bursts. This
behavior is normal, so let's re-tune the alarm to avoid bothering us.
According to prometheus documentation, delta should only be used with
gauges and increase can be used with counters.
This changes the threshold for high rate of build submissions from 25
per second to 25 per 5 minutes. According to the metrics.sr.ht data, a
threshold of 20 build submissions per 5 minutes was never exceeded.