Monitoring highly distributed DNS deployments: challenges and recommendations

The root server system as a use case examined

Tuesday 25 March 2025
Article by: Moritz Müller, Marco Davids, Willem Toorop

DNS name servers are crucial for the reachability of domain names. For this reason, name server operators rely on multiple name servers and often replicate and distribute each server across different locations across the world. Operators monitor the name servers to verify that they meet the expected performance requirements. Monitoring can be done from within the system, e.g. with metrics like CPU utilisation, and from the outside, mimicking the experience of the clients. In this article, we focus on the latter. We take the root server system as a use case and highlight the challenges operators and researchers face when monitoring highly distributed DNS deployments from the outside. We also present recommendations on building a monitoring system that is more reliable and that captures only the relevant metrics.

Distributing DNS name servers

The DNS is a prime example of a decentralised and distributed system. For example, it is a decentralised system in that SIDN is responsible for .nl, Verisign is responsible for .com, and you, the reader, are responsible for your own domain names. It is also a distributed system because each of those operators uses a geographically dispersed set of authoritative name servers to provide information about their part of the name space.

Probably the most distributed and decentralised service in the DNS is the root zone. The root server system consists of 13 different name servers, operated by 12 different organisations, all reachable via IPv4 and IPv6. "Under the hood", the name servers consist of more than 1,900 "sites" (physical or virtual machines) located in networks all over the world. The root server operators accomplish that by using a technique called "anycast". The internet's routing protocol (BGP - Border Gateway Protocol) makes sure that queries to each root server are sent to the site that is topographically closest to the machine sending the request in terms of network distance.

The root server system provides information about all authoritative name servers for TLDs, such as .nl, and also acts as a trust anchor for the DNS security extensions (DNSSEC). And, even though a resolver will send relatively few queries to the root servers during the course of a day, the unlikely event of a large-scale outage of the root servers will ultimately have consequences for the reachability of all domain names globally.

The need for external DNS monitoring

Monitoring a group of distributed DNS servers from within (endogenous monitoring) is relatively straightforward and makes use of metrics of both individual server instances and the system as a whole. Basic examples are CPU and memory utilisation, the number of queries received per second, and the serial number of the zone currently being served. More advanced metrics include whether resolvers reach their nearest anycast sites, and the time it takes for requests to reach the name server.

However, by relying on internal monitoring only, operators might overlook how their clients experience the service. In the DNS, monitoring from the outside (exogenous monitoring), using vantage points that mimic real clients, provides valuable insights. For example, it can reveal problems on the network path towards a name server, help during operational changes, and reveal if third parties are trying to meddle with the information the name server is providing.

Finally, external monitoring enables others to monitor public DNS services independently. That is where the root servers come into the picture again. As described above, the root server system plays an important role in the DNS. It is therefore in the internet community's interest that the root servers are available and performing correctly at all times. Today, the only standard operational data for the root server system is the RSSAC002 data, which RSOs provide voluntarily. The ICANN Root Server System Advisory Committee (RSSAC) developed RSSAC047 for use by a future root server system governance body under which root server operators may need to meet contractual obligations for performance, availability, and quality of service.

External monitoring at .nl

.nl relies on 3 name servers forming the .nl name server system. In total, the name server system consists of more than 80 sites located in networks all over the world.

To monitor the .nl name servers externally, SIDN also relies on the measurements by RIPE Atlas. As with the root servers, all RIPE Atlas probes query our name servers on a regular basis. The measurements are processed and presented on Grafana dashboards internal to SIDN. Additionally, SIDN relies on anycast -based performance measurements obtained using Verfploeter, which is a hybrid external and internal monitoring tool.

We recommend reading our blog post on the DNS infrastructure for .nl if you would like to know more.

RSSAC047: metrics for the root

RSSAC047v2 proposes the requirements for the root server monitoring system, including the metrics that the system needs to monitor. The metrics apply both to individual root servers and to the root server system as a whole.

The metrics are:

Availability: the amount of time root servers and the root server system are unreachable
Response latency: the time it takes to respond to a query
Correctness: whether the servers respond with the expected information
Publication delay: the time it takes to serve the latest version of the root zone

RSSAC047 also proposes how measurements should be aggregated and the expected performance levels.

For example, to measure the availability of a root server, the measurement system should send an SOA query for the root zone every 5 minutes from each measurement vantage point. At the end of a month, the measurement system calculates the percentage of queries that did not successfully obtain a response (e.g. because they timed out). If more than 4% of the queries were unsuccessful, the measurement system flags a root server as not having met the defined availability threshold. Once a month, a report is generated summarising how the root servers and the root server system as a whole performed for each metric.

Case study: monitoring the root server system

ICANN has developed an initial implementation of the monitoring specifications and has been monitoring the metrics listed above for more than 2 years. The software is open source.

The initial implementation is currently still in the development phase, and the generated reports are solely for information purposes. Nevertheless, on multiple occasions, the generated reports did not match the experience and expectations of either operators or the community. First, the initial implementation reported on several occasions that the root server system did not meet the availability threshold. Second, in May 2024, one of the root letters was serving a stale root zone for several days but the initial implementation did not report anything unusual.

In response, the root server operators Verisign and ISC asked us and NLnet Labs to study the implementation and deployment of the measurement software, and the measurements obtained. The goal was to understand whether the root server system was actually performing inadequately. Additionally, during the course of the study, one of the root servers failed to publish root zones on time. We therefore also wanted to understand why the measurement software did not report on the high publication delay.

Monitoring challenges

During the course of the study, we identified several challenges that need to be taken into account when measuring the root server systems externally. We also formulated recommendations on how those challenges could be addressed. You can find more details of our study in our report.

We believe that the insights we obtained are valid in relation to monitoring systems for other distributed DNS services as well. The general challenges we identified, and our associated recommendations are considered in the remainder of this article. Our study of the root server system is used for illustration purposes.

Challenge #1: vantage point (VP) selection

The first major challenge when measuring highly distributed DNS systems is the selection of vantage points. Where RSSAC047 is concerned, the initial implementation relies on 20 custom vantage points, deployed on different cloud platforms and at data centres all over the world.

For RSSAC047, the goal was to strike a good balance between coverage and manageability. In general, however, monitoring system operators have the following 3 options:

Decide which sites are most crucial for monitoring, and try to find vantage points that allow them to be reached.
Select vantage points that reflect the most "important" clients.
Select vantage points that are evenly distributed (e.g. across a country, continent or the world).

Note that route changes can cause a vantage point to reach different sites at different times. The DNS Name Server Identifier (NSID) option helps the measurement platform to track which sites each VP reaches.

For the deployment of the initial implementation of the root monitoring software, we found that the low number of vantage points relative to the number of sites decreased the confidence in the measurement results. Because of the low coverage, usually only 1 vantage point measured the timeout of a given site at a given time. It was therefore often unclear whether the timeout was actually caused by the root server site, the network, or the vantage point itself.

In addition, we found strong signs that some vantage points were at locations with poor connectivity. For example, they often failed to reach all root servers at the same time, which is a strong indication that the vantage point or the network of the vantage point caused the timeouts and not the root servers.

Challenge #2: defining the metrics

The second major challenge involves the definition of the metrics.

This became apparent when we looked closer at the publication delay reported for one of the root servers. We found that, while the measurement system did pick up on the missing zone files, the reports did not mention the problem. The reason for this is the way the measurements are aggregated.

According to RSSAC047, the monthly publication delay of a root server is calculated by taking the median across all measured delays. Zone files that are never published are not taken into account. The metric is simple, but has the disadvantage that it only signals problems with zone publication when the delay was high for at least half of the month.

As in the example of RSSAC047, metrics often need to strike a balance between simple but vague and complex but detailed. The publication delay metric is calculated in a very simple manner which also results in the loss of (important) details.

Other challenges

Operators and researchers who would like to monitor highly distributed systems also face other challenges that we did not address in our study. Examples include the cost of the monitoring platform, the availability of vantage points at different locations, and the measurement frequency.

Our recommendations for measuring distributed DNS deployments

Based on our analysis of the RSSAC047 measurements, we make 3 recommendations for improving the monitoring systems of distributed DNS systems like the root.

Recommendation #1: introduce health checks

First, we recommend implementing health checks of the vantage points themselves.

In the case of RSSAC047, we developed and deployed 2 extensions to the current implementation in a 1-month trial.

First, we continuously monitored the availability of the vantage points from an additional monitoring node. While these measurements too can be ambiguous (e.g. because of network problems between the monitoring node and the vantage point) they add 1 additional signal that can help with interpreting the measurements by the vantage points.

Second, we implemented traceroute measurements directed to services not related to the root server system. The assumption was that if these measurements were to fail at the same time as the root server measurements, that would be a strong sign that the connectivity problems were not caused by the root servers.

Recommendation #2: increase confidence

Second, we argue that using more vantage points enables a highly distributed system to be monitored more accurately, even though it might increase the noise in the monitoring system.

In order to retrospectively test whether a timeout was caused by the root servers or not, we relied on the measurements by more than 10,000 RIPE Atlas probes that query the root servers every few minutes. If tens or hundreds of probes reported connectivity problems at the same time, then the root server very likely had problems. The figure below shows an example where the initial implementation signalled a time out (vertical blue line), and a large number of RIPE Atlas probes failed to reach the same server (orange line). This was for us a strong sign that the measurement from the VP could be trusted and that the root server or the path towards the root server was indeed unavailable.

Example of a timeout observed by the initial implementation (vertical line) automatically correlated with a drop in reachability observed by RIPE Atlas.

Figure 1: Example of a timeout observed by the initial implementation (vertical line) automatically correlated with a drop in reachability observed by RIPE Atlas.

By implementing recommendations #1 and #2, we were able to confirm that the availability of the root server system was very likely higher than reported by the initial implementation.

Recommendation #3: tune and test the metrics

Our third recommendation is to develop test cases that help operators to come up with metrics that actually enable reporting on expected outages and to define appropriate alarm thresholds. By developing test cases of outages that should and should not be picked up by the metric, operators can make sure that the metric only triggers reports on relevant events.

In the example of RSSAC047, test cases that simulate different scenarios of delayed zone publication could have helped RSSAC to ascertain whether the metric was actually capable of reflecting relevant outages.

Main takeaway: finding the right balance is key

How you want to monitor a distributed DNS service depends on many different things: the scale of the service, whether you're operating the service yourself or not, the reason why you want to monitor in the first place, and who the audience is for the reports created by the monitoring system. That applies not only to the RSSAC047 measurements, but also to other platforms that monitor distributed DNS systems.

Where RSSAC047 is concerned, it has already taken 2 iterations to improve the metrics and a third will likely follow. We hope that our study will help with improving the metrics further. You can find more details of our study on our website. The information on our site also looks more closely at the source code and describes in great detail how we came to the conclusion that the root server performance is very likely better than originally reported.

We would like to thank Verisign, ISC and ICANN for their support throughout this study. Also, we would like to thank the root server operators and the members of the RSSAC caucus for their feedback.

If you have any feedback about this blog post or our report, please contact us at moritz.muller@sidn.nl.

Article by: