tsuNAME: DNS loops are a well-known problem, but aren't properly addressed by current RFCs

Recursive resolvers should cache DNS records that are misconfigured with loops

3 stacked wooden blocks with red letters 'DNS' (Domain name System) on them

Authors: Giovane Moura (1,2), Sebastian Castro (3), John Heidemann (4) and Wes Hardaker (4) (1) SIDN Labs, (2) TU Delft, (3) InternetNZ, (4) USC/ISI

Last May, we publicly disclosed tsuNAME, a DNS vulnerability that could be exploited to mount DDoS attacks, where resolvers, clients and/or forwarders send endless queries to authoritative DNS servers. Although earlier RFCs have documented the existence of DNS name loops, none of them have fully addressed the problem. To fix that, we have proposed a new IETF draft to the DNS Operations Working Group (DNSOP WG).

DNS loops

A loop is a well-known type of configuration error in a DNS zone, being documented even in the seminal RFC 1034 (‘Domain Names – Concepts and Facilities’, November 1987). For example, a CNAME loop can be created as follows:

.org zone file: example.org CNAME example.nl
.nl zone file: example.nl CNAME example.org

In the example above, a DNS resolver will not be able to resolve either of the domains, because each points to the other.

We found that domain names configured with looping NS records can cause resolvers to send endless queries to authoritative servers, ultimately flooding the servers – a phenomenon we refer to as a tsuNAME. A real-world tsuNAME hit New Zealand's .nz authoritative servers, when two misconfigured names which had little traffic, caused a 50% surge in total server traffic (see Figure 1).

Graph showing a 50% increase in DNS traffic on the .nz servers during the tsuNAME incident
https://images.ctfassets.net/yj8364fopk6s/7nc6GcbOolZBH5ht1rXYqq/ea9bdb8da81648d91b71fe69ceb5206e/Fig_1_Autoritatieve_.nz-servers_zien_verkeer_met_50-_toenemen_tijdens_tsuNAME-incident.png

Figure 1: .nz authoritative severs experience 50% traffic growth during tsuNAME event caused by two domain names.

That begs the question: if loops have been known about for so long, why are they still a problem?

Past solutions

The first solution was proposed in RFC 1034, which states that CNAME loops should be ‘signalled as an error’ (Section 3.6.2). To avoid resolvers starting to loop infinitely in the presence of configuration errors, RFC 1034 also recommends that a resolver limits the number of queries it sends out when resolving an individual domain name.

RFC 1035 (‘Domain Names – Implementation and Specification’) stipulates that resolvers should use counters to implement the limits. The later RFC 1536 (‘Common DNS Implementation Errors and Suggested Fixes’) states that ‘a set of servers might form a loop wherein A refers to B and B refers to A’. It does not, however, specify what types of record might create such a loop. Nor does it offer solutions beyond what RFC 1034 and RFC 1035 suggested.

In short, RFC 1034, RFC 1035 and RFC 1536 describe the problem and do provide guidance to resolver implementers to help avoid infinite loops caused by misconfigured zone files with NS or CNAME loops. However, we continue to see various forms of loop problem, so in this blog we seek to further clarify that guidance.

Root causes of traffic surge

We found in our research paper that traffic surges can occur if NS loops are present:

  • Looping recursive resolvers: these are resolvers that send endless queries to authoritative servers after receiving a single client query (Figure 1) targeting a domain with an NS loop. Such recursive resolvers do not conform to the guidance in RFC 1034 and RFC 1035, both of which set limits on the number queries a resolver should send when resolving a name.

  • Looping clients, stub-resolvers, and forwarders: problems can also arise if parts of the DNS infrastructure, behind a recursive resolver, send endless queries because of NS loops. The queries reach their upstream recursive resolvers, which then send queries to authoritative servers (which may further amplify the query stream).

Current RFCs provide solutions to prevent resolvers from looping (although we found two old versions of popular DNS software that do not implement the solutions). Whenever a client sends a query to its recursive, the recursive should return SERVFAIL to the client whenever there are looping NS records – the detection is done using counters, as specified in current RFCs.

However, that still does not prevent clients, stubs and DNS forwarders from repeatedly asking the same question, over and over. Suppose that a resolver is set with twenty queries as the upper limit for resolving a name before it returns SERVFAIL, as in the case of a loop. In that scenario, each new query from each client will trigger twenty new queries. That is exactly the problem we found in the Google Public DNS implementation: the source of loops was not the Google resolver software, but the clients, which were endlessly sending the same query to the resolver.

How to fix it

To fix this problem, we recommend that recursive resolvers MUST cache DNS records that are misconfigured with loops. Then, whenever a resolver returns a SERVFAIL to a client, all subsequent queries from clients can be answered directly from the resolver's cache. Hence, caching works as a barrier between the resolver and looping clients, ultimately preventing excessive traffic from reaching authoritative servers.

How long these looping records should be cached for is an implementation choice. However, a recursive resolver MUST answer from its cache for at least fifteen minutes, given that most looping NS/CNAME record situations will require human intervention. Google Public DNS implemented that solution and, once they started caching looping records, the issue was fixed (see Figure 2).

Graph showing a drop in query volumes on Google Public DNS after caching records that cause a loop.
https://images.ctfassets.net/yj8364fopk6s/5KUbIsmYm6n5LK5EVVC1SF/9296d7231aa4b91a268416614d15c4e6/Fig_2_Oplossing_voor_tsuNAME_op_Google_Public_DNS.png

Figure 2: Fixing tsuNAME on Google Public DNS: significant drop in query volumes after caching of looping records. See our research paper for details.

Next steps

We presented this proposal in the form of an IETF draft to the DNSOP WG last November. We have received feedback on it, and we will shortly be submitting an updated version.

For more details on the tsuNAME specifics, interested readers are referred to our research paper and video presentation that were presented at the ACM Internet Measurement Conference (IMC 2021).