2018-11-30 - Progress - Tony Finch
This is a postmortem of an incident that caused a large amount of cronspam, but not an outage. However, the incident exposed a lot of latent problems that need addressing.
Description of the incident
I arrived at work late on Tuesday morning to find that the DHCP
servers were sending cronspam every minute from monit
. monit
thought dhcpd
was not working, although it was.
A few minutes before I arrived, a colleague had run our Ansible playbook to update the DHCP server configuration. This was the trigger for the cronspam.
Cause of the cronspam
We are using monit
as a basic daemon supervisor for our critical
services. The monit
configuration doesn't have an "include" facility
(or at least it didn't when we originally set it up) so we are using
Ansible's "assemble" feature to concatenate configuration file
fragments into a complete monit
config.
The problem was that our Ansible setup didn't have any explicit
dependencies between installing monit
config fragments and
reassembling the complete config and restarting monit
.
Running the complete playbook caused the monit
config to be
reassembled, so an incorrect but previously inactive config fragment
was activated, causing the cronspam.
Origin of the problem
How was there an inactive monit
config fragment on the DHCP servers?
The DHCP servers had an OS upgrade and reinstall in February. This was
when the spammy broken monit
config fragment was written.
What were the mistakes at that time?
The config fragment was not properly tested. A good
monit
config is normally silent, but in this case we didn't check that it sent cronspam when things are broken, whoch would have revealed that the config fragment was not actually installed properly.The Ansible playbook was not verified to be properly idempotent. It should be possible to wipe a machine and reinstall it with one run of Ansible, and a second run should be all green. We didn't check the second run properly. Check mode isn't enough to verify idempotency of "assemble".
During routine config changes in the nine months since the servers were reinstalled, the usual practice was to run the DHCP-specific subset of the Ansible playbook (because that is much faster) so the bug was not revealed.
Deeper issues
There was a lot more anxiety than there should have been when debugging this problem, because at the time the Ansible playbooks were going through a lot of churn for upgrading and reinstalling other servers, and it wasn't clear whether or not this had caused some unexpected change.
This gets close to the heart of the matter:
- It should always be safe to check out and run the Ansible playbook against the production systems, and expect that nothing will change.
There are other issues related to being a (nearly) solo developer, which makes it easier to get into bad habits. The DHCP server config has the most contributions from colleagues at the moment, so it is not really surprising that this is where we find out the consequences of the bad habits of soloists.
Resolutions
It turns out that monit
and dhcpd do not really get along. The
monit
UDP health checker doesn't work with DHCP (which was the cause
of the cronspam) and monit
's process checker gets upset by dhcpd
being restarted when it needs to be reconfigured.
The monit
DHCP UDP checker has been disabled; the process checker
needs review to see if it can be useful without sending cronspam on
every reconfig.
There should be routine testing to ensure the Ansible playbooks committed to the git server run green, at least in check mode. Unfortunately it's risky to automate this because it requires root access to all the servers; at the moment root access is restricted to admins in person.
We should be in the habit of running the complete playbook on all the servers (e.g. before pushing to the git server), to detect any differences between check mode and normal (active) mode. This is necessary for Ansible tasks that are skipped in check mode.
Future work
This incident also highlights longstanding problems with our low bus protection factor and lack of automated testing. The resolutions listed above will make some small steps to improve these weaknesses.