Regularly rolling the DNSSEC keys of a top-level domain (TLD) is important for security. However, it’s also a risky task because a failed rollover can render the entire TLD unavailable for validating resolvers. In December 2017, we monitored the algorithm rollover of Sweden's .se TLD. In this blog post, we share our results and methodology.
Rolling DNSSEC keys is a critical operation
DNSSEC key rollovers are necessary when operating signed zones. There are various reasons why an operator may need to replace their keys. It may be required by the key-management policy, for example. Or there may have been a security breach, or the operator may need to update to a new algorithm. Whatever the reason, it is a critical operation, because a failed rollover results in the entire zone becoming unreachable for validating resolvers for minutes or even hours. That has particularly serious implications for larger zones, such as top-level domains (TLDs), which may contain millions of domain names.
Rolling the .se algorithm
IIS, the operator of .se, planned their algorithm rollover for December 2017. They were the first ccTLD that signed their zone with DNSSEC. Back in 2005, they chose the algorithm RSA/SHA-1 to create their first keys and signatures and they had used it ever since. However, today RSA/SHA-1 is a bit dated and so it was time to move to a different algorithm. IIS chose to create their new keys with RSA/SHA-256. They explain their decision-making and planning in more detail in their blog.
Measurements and methodology
IIS asked SIDN Labs to monitor their rollover. That meant we had the opportunity to thoroughly measure an algorithm rollover for the first time, and to develop a methodology that operators can use to monitor this critical operation. Our objective was to give operators more insight into their rollovers, so that they can make more confident decisions at each stage of the rollover and thus maintain zone availability.
Together with colleagues from Northeastern University, University of Twente, and IIS, we measured the whole rollover process using RIPE Atlas and of the HTTP proxy network Luminati. In total, we used more than 46,000 vantage points, located in over 12,000 autonomous systems. As a result, we had a realistic view of the .se rollover and also covered resolvers that might not behave as expected.
We are currently using the same platforms to monitor the KSK rollover in the root zone in the Root Canary project.
Timing issues with rollovers
A failed key rollover or algorithm rollover can have a particularly serious impact on zone availability. One of the biggest challenges for operators is deciding when to add the new keys and withdraw the old ones.
That is because of the caching behaviour of recursive resolvers. For example, a resolver may have cached a signature created with the old key, but not the key itself. If the resolver then tries to validate the signature, it has to fetch the key again. If the operator has withdrawn the old key too early, then the resolver will only receive the new key. As a consequence, it will fail to validate the signature in its cache.
That scenario is realistic: in the early days of DNSSEC at .nl, we made the mistake of withdrawing an old key too early, causing issues with validating resolvers.
DS replacement
To avoid such problems, a rollover is divided into several distinct stages, which RFC 6781 describes in detail. One of the most crucial stages in the .se algorithm rollover was the moment when the DS needed to be replaced in the root. A failure at that point in the rollover could render every .se domain effectively unavailable for validating resolvers.
The .se operators had to make sure that the new DS was distributed to every authoritative resolver correctly. Also, they had to give resolvers enough time to pick up the new DS before they withdrew the old keys.
In the .se rollover, the DS was replaced in the root zone on 14.12.2017 at around 17:30 UTC. From our RIPE Atlas vantage points, we observed that the publishing delay at the root was around 10 minutes. After that, every root site responded with the new DS (see Figure 1).
The TTL of the DS is 1 day and, as expected, after 24 hours 99 per cent of the measured resolvers had the new DS in their caches (see Figure 2). However, some resolvers kept the old DS in their caches for longer; the maximum propagation delay we observed was 48 hours (see Figure 3). Similar behaviour had previously been observed by my colleague Giovane.
Any failing resolvers?
The important question, however, is: did we observe any failing resolvers during this stage of the rollover?
To answer that question, we continually validated signed .se domains during the whole period of the rollover. Figure 4 shows the share of monitored resolvers that were:
resolving and validating .se domains successfully (secure)
resolving .se domains successfully but not validating them (insecure)
failing to resolve .se domains (bogus)
And the good news is: we didn’t see an increase in failing resolvers during the replacement of the DS in the root. In fact, we didn’t observe an increase in failing resolvers at ANY stage of the rollover.
We may therefore conclude that the .se algorithm rollover was a success. Each stage of the rollover was carried out correctly, with enough time for new records to publish and propagate. The zone stayed secure at all times and end users were not affected.
It is worth noting that the measurements illustrated in Figure 4 would also have picked up other issues, such as publication of the incorrect DS in the root or the use of incorrect signatures. We didn’t observe any such failures, however.
Making rollovers a bit less scary
On the basis of our observation of the .se rollover, we have developed a measurement methodology that other operators can follow to monitor their algorithm rollovers and other rollovers. Following the methodology will increase insight into the rollover and enable decisions to be made with greater confidence at each stage of the process, so that the zone remains constantly available.
We are also developing a small open-source tool that operators can use to easily set up the necessary measurements. It will rely on the RIPE Atlas probes and employ the RIPE Atlas API.
Detailed descriptions of the methodology and tool will be available on this blog in the next couple of weeks.