2018-08-03 - News - Tony Finch
Earlier this year, we had an abortive
attempt to turn on BIND 9.12's new serve-stale
feature. This
helps to make the DNS more resilient when there are local network
problems or when DNS servers out on the Internet are temproarily
unreachable. After many trials and tribulations we have at last
successfully enabled serve-stale
.
Popular websites tend to have very short DNS TTLs, which means the DNS
stops working quite soon when there are network problems. As a result,
network problems look more like DNS problems, so they get reported to
the wrong people. We hope that serve-stale
will reduce this
kind of misattribution.
Pinning down CVE-2018-5737
The original attempt to roll out serve-stale
was rolled back
after one of the recursive DNS servers crashed. My normal upgrade
testing wasn't enough to trigger the crash, which happened after a few
hours of production load.
Since this was a crash that could be triggered by query traffic, it counted as a security bug. After I reported it to ISC.org, there followed a lengthy effort to reproduce it in a repeatable manner, so that it could be debugged and fixed.
I have a tool called
adns-masterfile
which I use for testing server upgrades and suchlike. I eventually
found that sending lots of reverse DNS queries was a good way to
provoke the crash; the reverse DNS has quite a large proportion of
broken DNS servers, which exercise the serve-stale
machinery.
The best I was able to do was get the server to crash after 1 hour; I
could sometimes get it to crash sooner, but not reliably. I used a
cache dump (from rndc dumpdb
) truncated after the .arpa
TLD so it
contained 58MB of reverse DNS, nearly 700,000 queries. I then set up
several concurrent copies of adns-masterfile
to run in loops. The
different copies tended to synchronize with each other, because when
one of them got blocked on a broken domain name the others would catch
up. So I added random delays between each run to encourage different
copies to make queries from different parts of the dump file.
It was difficult for our friends at ISC.org to provoke a crash. After
valgrind
failed to provide any clues, I tried using Mozilla's rr
debugger which supports record/replay
time-travel debugging with efficient reverse execution. It allowed me
to bundle up the binary, libraries, and execution trace and send them
to ISC.org so they could investigate what happened in detail.
Cosmetic issues
I waited for BIND 9.12.2 before deploying the fixed serve-stale
implementation because earlier versions had very verbose logging that
could not easily be turned off.
I submitted a patch that moved serve-stale logging to a separate category so that it can be turned on and off independently of other logging or moved to a separate file. This was merged for the 9.12.2 release, which made it usable in production.
Improved upgrade workflow
I also investigated options for better testing of new versions before
putting them into production. The disadvantage of adns-masterfile
is
that it makes a large number of unique queries, whereas the
CVE-2018-5737 crash required repetition.
I now have a little script which can extract queries from tcpdump
output and replay them against another server. I can use it to mirror
production traffic onto a staging server, and let it soak for several
hours before performing a live/staging switch-over.
Selfish motivations
Earlier this year we had a number of network outages during which we
lost connectivity to about half of the Internet. I was quite keen to
get serve-stale
deployed before these outages were fixed, so
I could observe it working; however I lost that race.
The outages were triggered by work on the network links between our CUDN border equipment and JANET's routers. In theory, this should not have affected our connectivity, because traffic should have seamlessly moved to use our other JANET uplink.
However, the routing changes propagated further than expected: they appeared to one of JANET's connectivity providers as route withdrawals and readvertisements. If more than a few ups and downs happened within a number of minutes, our routes were deemed to be flapping, triggering the flap-damping protection mechanism. Flap-damping meant our routes were ignored for 20 minutes by JANET's connectivity provider.
Addressing this required quite a lot of back-and-forth between network engineers in three organizations. It hasn't been completely eliminated, but the flap-damping has been made less sensitive, and we have amended our border router work processes to avoid multiple up/down events.