2018-01-12 - Progress - Tony Finch
(This item isn't really related to progress towards a bright shiny future, but since I'm blogging here I might as well include other work-related articles.)
This week I have been helping
Mark Andrews and Evan Hunt
to track down a bug in BIND9. The problem manifested as named
occasionally failing to re-sign a DNSSEC zone; the underlying cause
was access to uninitialized memory.
It was difficult to pin down, partly because there is naturally a lot of nondeterminism in uninitialized memory bugs, but there is also a lot of nondeterminism in the DNSSEC signing process, and it is time-dependent so it is hard to re-run a failure case, and normally the DNSSEC signing process is very slow - three weeks to process a zone, by default.
Timeline
Oct 9 - latent bug exposed
Nov 12 - first signing failure
I rebuild and restart my test DNS server quite frequently, and the bug is quite rare, which explains why it took so long to appear.
Nov 18 - Dec 6 - Mark fixes several signing-related bugs
Dec 28 - another signing failure
Jan 2 - I try adding some debugging diagnostics, without success
Jan 9 - more signing failures
Jan 10 - I make the bug easier to reproduce
Mark and Evan identify a likely cause
Jan 11 - I confirm the cause and fix
The debugging process
The incremental re-signing code in named
is tied into BIND's core
rbtdb
data structure (the red-black tree database). This is tricky
code that I don't understand, so I mostly took a black-box approach to
try to reproduce it.
I started off by trying to exercise the signing code harder. I set up a test zone with the following options:
# signatures valid for 1 day (default is 30 days) # re-sign 23 hours before expiry # (whole zone is re-signed every hour) sig-validity-interval 1 23; # restrict the size of a batch of signing to examine # at most 10 names and generate at most 2 signatures sig-signing-nodes 10; sig-signing-signatures 2;
I also populated the zone with about 500 records (not counting DNSSEC records) so that several records would get re-signed each minute.
This helped a bit, but I often had to wait a long time before it went
wrong. I wrote a script to monitor the zone using rndc zonestatus
, so
I could see if the "next resign time" matches the zone's earliest
expiring signature.
There was quite a lot of flailing around trying to exercise the code harder, by making the zone bigger and changing the configuration options, but I was not successful at making the bug appear on demand.
To make it churn faster, I used dnssec-signzone
to construct a version
of the zone in which all the signatures expire in the next few minutes:
rndc freeze test.example dig axfr test.example | grep -v RRSIG | dnssec-signzone -e now+$((86400 - 3600 - 200)) \ -i 3600 -j 200 \ -f signed -o test.example /dev/stdin rm -f test.example test.example.jnl mv signed test.example # re-load the zone rndc thaw test.example # re-start signing rndc sign test.example
I also modified BIND's re-signing co-ordination code; normally each batch will re-sign any records that are due in the next 5 seconds; I reduced that to 1 second to keep batch sizes small, on the assumption that more churn would help - which it did, a little bit.
But the bug still took a random amount of time to appear, sometimes within a few minutes, sometimes it would take ages.
Finding the bug
Mark (who knows the code very well) took a bottom-up approach; he ran
named
under valgrind
which identified an access to uninitialized
memory. (I don't know what led Mark to try valgrind
- whether he does
it routinely or whether he tried it just for this bug.)
Evan had not been able to reproduce the bug, but once the cause was identified it became clear where it came from.
The commit on the 9th October that exposed the bug was a change to BIND's memory management code, to stop it from deliberately filling newly-allocated memory with garbage.
Before this commit, the missing initialization was hidden by the memory fill, and the byte used to fill new allocations (0xbe) happened to have the right value (zero in the bottom bit) so the signer worked correctly.
Evan builds BIND in developer mode, which enables memory filling, which stopped him from being able to reproduce it.
Verifying the fix
I changed BIND to fill memory with 0xff which (if we were right) should provoke signing failures much sooner. And it did!
Then applying the one-character fix to remove the access to uninitialized memory made the signer work properly again.
Lessons learned
BIND has a lot of infrastructure that tries to make C safer to use, for instance:
Run-time assertions to ensure that internal APIs are used correctly;
Canary elements at the start of most objects to detect memory overruns;
buffer
andregion
types to prevent memory overruns;A memory management system that keeps statistics on memory usage, and helps to debug memory leaks and other mistakes.
The bug was caused by failing to use buffer
s well, and hidden by the
memory management system.
The bug occurred when initializing an rdataslab
data structure, which
is an in-memory serialization of a set of DNS records. The records are
copied into the rdataslab
in traditional C style, without using a
buffer
. (This is most blatant when the code
manually serializes a 16 bit number
instead of using isc_buffer_putuint16
.) This code is particularly
ancient which might explain the poor style; I think it needs
refactoring for safety.
It's ironic that the bug was hidden by the memory management code - it's
supposed to help expose these kinds of bug, not hide them! Nowadays, the
right approach would be to link to jemalloc
or some other advanced
allocator, rather than writing a complicated wrapper around standard
malloc
. However that wasn't an option when BIND9 development started.
Conclusion
Memory bugs are painful.