Network setup for Cambridge's new DNS servers

2015-01-07 - Progress - Tony Finch

The SCCS-to-git project that I wrote about previously was the prelude to setting up new DNS servers with an entirely overhauled infrastructure.

Old and new hardware

The current setup which I am replacing uses Solaris Zones (like FreeBSD Jails or Linux Containers) to host the various name server instances on three physical boxes. The new setup will use Ubuntu virtual machines on our shared VM service (should I call it a "private cloud"?) for the authoritative servers. I am making a couple of changes to the authoritative setup: changing to a hidden master, and eliminating differences in which zones are served by each server.

I have obtained dedicated hardware for the recursive servers. Our main concern is that they should be able to boot and work with no dependencies on other services beyond power and networking, because basically all the other services rely on the recursive DNS servers. The machines are Dell R320s, each with one Xeon E5-2420 (6 hyperthreaded cores, 2.2GHz), 32 GB RAM, and a Dell-branded Intel 160GB SSD.

Failover for recursive DNS servers

The most important change to the recursive DNS service will be automatic failover. Whenever I need to loosen my bowels I just contemplate dealing with a failure of one of the current elderly machines, which involves a lengthy and delicate manual playbook described on our wiki...

Often when I mention DNS and failover, the immediate response is "Anycast?". We will not be doing anycast on the new servers, though that may change in the future. My current plan is to do failover with VRRP using keepalived. (Several people have told me they are successfully using keepalived, though its documentation is shockingly bad. I would like to know of any better alternatives.) There are a number of reasons for using VRRP rather than anycast:

  • The recursive DNS server addresses are 131.111.8.42 (aka recdns0) and 131.111.12.20 (aka recdns1). (They have IPv6 addresses too.) They are on different subnets which are actually VLANs on the same physical network. It is not feasible to change these addresses.

  • The 8 and 12 subnets are our general server subnets, used for a large proportion of our services, most of which use the recdns servers. So anycasting recdns[01] requires punching holes in the server network routing.

  • The server network routers do not provide proxy ARP and my colleagues in network systems do not want to change this. But our Cisco routers can't punch a /32 anycast hole in the server subnets without proxy ARP. So if we did do anycast we would also have to do VRRP to support failover for recdns clients on the server subnets.

  • The server network spans four sites, connected via our own city-wide fibre network. The sites are linked at layer 2: the same Ethernet VLANs are present at all four sites. So VRRP failover gives us pretty good resilience in the face of server, rack, or site failures.

VRRP will be a massive improvement over our current setup, and it should provide us a lot of the robustness that other places would normally need anycast for, but with significantly less complexity. And less complexity means less time before I can take the old machines out of service.

After the new setup is in place, it might make sense for us to revisit anycast. For instance, we could put recursive servers at other points of presence where our server network does not reach (e.g. the Addenbrooke's medical research site). But in practice there are not many situations when our server network is unreachable but the rest of the University data network is functioning, so it might not be worth it.

Configuration management

The old machines are special snowflake servers. The new setup is being managed by Ansible.

I first used Ansible in 2013 to set up the DHCP servers that were a crucial part of the network renumbering we did when moving our main office from the city centre to the West Cambridge site. I liked how easy it was to get started with Ansible. The way its --check mode prints a diff of remote config file changes is a killer feature for me. And it uses ssh rather than rolling its own crypto and host authentication like some other config management software.

I spent a lot of December working through the configuration of the new servers, starting with the hidden master and an authoritative server (a staging server which is a clone of the future live servers). It felt like quite a lot of elapsed time without much visible progress, though I was steadily knocking items off the list of things to get working.

The best bit was the last day before the xmas break. The new recdns hardware arrived on Monday 22nd, so I spent Tuesday racking them up and getting them running.

My Ansible setup already included most of the special cases required for the recdns servers, so I just uncommented their hostnames in the inventory file and told Ansible to run the playbook. It pretty much Just Worked, which was extremely pleasing :-) All that steady work paid off big time.

Multi-VLAN network setup

The main part of the recdns config which did not work was the network interface configuration, which was OK because I didn't expect it to work without fiddling.

The recdns servers are plugged into switch ports which present subnet 8 untagged (mainly to support initial bootstrap without requiring special setup of the machine's BIOS), and subnet 12 with VLAN tags (VLAN number 812). Each server has its own IPv4 and IPv6 addresses on subnet 8 and subnet 12.

The service addresses recdns0 (subnet 8) and recdns1 (subnet 12) will be additional (virtual) addresses which can be brought up on any of the four servers. They will usually be configured something like:

  • recdns-wcdc: VRRP master for recdns0
  • recdns-rnb: VRRP backup for recdns0
  • recdns-sby: VRRP backup for recdns1
  • recdns-cnh: VRRP master for recdns1

And in case of multi-site failures, the recdns1 servers will act as additional backups for the recdns0 servers and vice versa.

There were two problems with my initial untested configuration.

The known problem was that I was likely to need policy routing, to ensure that packets with a subnet 12 source address were sent out with VLAN 812 tags. This turned out to be true for IPv4, whereas IPv6 does the Right Thing by default.

The unknown problem was that the VLAN 812 interface came up only half-configured: it was using SLAAC for IPv6 instead of the static address that I specified. This took a while to debug. The clue to the solution came from running ifup with the -v flag to get it to print out what it was doing:

    # ip link delete em1.812
    # ifup -v em1.812

This showed that interface configuration was failing when it tried to set up the default route on that interface. Because there can be only one default route, and there was already one on the main subnet 8 interface. D'oh!

Having got ifup to run to completion I was able to verify that the subnet 12 routing worked for IPv6 but not for IPv4, pretty much as expected. With advice from my colleagues David McBride and Anton Altaparmakov I added the necessary runes to the configuration.

My final /etc/network/interfaces files on the recdns servers are generated from a Jinja template you can see in the ipreg Ansible repository.

Edited to add:

The original minimal policy routing configuration turned out not to work sometimes, depending on which of the routers were active and how ECMP split traffic. Eventually, after a number of rounds of reachability bug fixes, I extended the configutation to repeat the policy routing setup mutatis mutandis for both IPv4 and IPv6 on both subnet 8 and subnet 12.