Resilience of the Domain Name System

During the last half year I was given the opportunity to write my Master's Thesis at SIDN Labs. In this article I would like to introduce you to the research I performed and give short conclusions regarding the main results. The topic of my thesis was the resilience of the Domain Name System, especially as a case study for the situation for the second-level domains within the .nl-zone.

Motivation

The Internet is a collection of separate networks, called autonomous systems (AS). and together, these autonomous systems form a global network. Several incidents of the past show, that failure of certain parts of this global network of autonomous systems sometimes happen with different levels of impact for the rest of the network. However, the impact these incidents have on the availability of DNS data was never analyzed before. This analysis should help to lower the impact of possible future incidents by estimating their impact and give insights in which countermeasures should be applied to lower the impact of a possible incident. Therefore it was desired to develop a method that could be used to estimate this impact, which was also the main purpose of my study.

Methodology

The method I propose consists of several subsequent steps. The most important ones are the following:

  1. Map the domain names to be analyzed onto the autonomous systems the belonging name servers are hosted in

  2. Obtain a graph representing the network topology at AS level

  3. Simulate malfunction of parts of the obtained topology

  4. Investigate shortest paths between ASs to analyze whether all ASs in which domain names are hosted are still reachable

The mapping of domain names onto autonomous systems was done by resolving the NS records of all domain names within .nl and afterwards resolving the A record of the obtained name servers. The IP addresses of name servers obtained this way were then mapped onto the autonomous systems they belong to by utilizing the GeoLite ASN dataset provided by MaxMind.As network topology, the AS relationship dataset provided by the Center for Applied Internet Data Analysis (CAIDA) was used. This dataset can directly be used to build up a graph representing the network topology with autonomous systems as vertices and connection between them as edges.Failures of certain parts of the infrastructure were simulated by eliminating those parts from the graph representing the network topology and the impact of these simulations were analyzed by investigating whether all autonomous systems are still reachable from the other ASs in the network.

Results

To show some results obtained by this method, 60 different failure scenarios of parts of the Internet’s infrastructure were investigated. These failure scenarios include failure of the ASs of large hosting providers and transit providers as well as failure of single connections and even failure of an Internet Exchange Point, the AMS-IX, was considered.Next to the availability of the autonomous systems and hosted domain names, the lengths of the shortest path to reach a domain name was also investigated, as a measure for the performance of the network and as a measure for the severity of failures that do not lead to unavailability of all domain names. Two examples of the results, showing the results for a simulation in which the autonomous systems maintained by LeaseWeb (left) and Cogent Communications (right) stops functioning normally are shown in the following figures:

Grafiek met resultaten van onbereikbare domeinen
Grafiek met resultaten van onbereikbare domeinen

   The y-axis of these plots shows the network location of investigated resolvers whereas the x-axis shows the amount of autonomous systems and domain names becoming unreachable in the simulations. The blue parts in these plots show those ASs and domain names which are always unreachable, even in the full topology as inferred by CAIDA data. The most probable explanation for this being the incompleteness of the topology dataset. The green part shows the AS removed in the simulation and those domain names which are hosted solely in that AS. Finally, the orange part shows ASs and domain names which additionally become unavailable in the simulation. A general observation is, that many more domain names become unavailable in case of failure of a hosting AS, however, as no additional ASs become unavailable other than the simulated failing one, this problem could very easily be circumvented by utilizing another redundant name server located in another AS. In the case of failure of an AS belonging to a transit provider, also some additional ASs become unavailable, however, these account for only a small number of domain names and therefore only a small amount of domain names becomes unavailable in these situations. Also in these cases mainly domain names hosted in a single AS are affected.

Conclusion

All investigated situations indicate that the availability of DNS records belonging to second-level domain names of .nl is either only marginally influenced by failure of single parts of the network or can be easily explained, e.g. because some ASs do only show a single connection to the rest of the network and failure of this connections leads to that AS being completely separated from the Internet.The main conclusion that can be drawn from the results is strengthening of the recommendation to use multiple redundant name servers for every domain name which should be located in different parts of the Internet's infrastructure. In general, this recommendation is followed by most of the maintainers of domain names within the .nl-zone, however, there are still some exceptions. By following this recommendation DNS data is still available even in the case of malfunction of some parts of the Internet, as no bottlenecks in the Internet’s infrastructure were identified which could be responsible for large amounts of domain names becoming unavailable.Please be aware this analysis is carried out on the basis of logical connections, rather than physical connections. Therefore the shown influences of failure scenarios might be different in a real world situation as this is also influenced by many other factors such as the available bandwidth on certain links. However, an estimation of the impact of malfunction of some parts of the Internet’s infrastructure can be made.

The full thesis is public and can be found here.