DNS2Vec: applying representation learning to DNS data

Exploratory research confirms promising internet security applications

The abbreviation DNS in yellow letters on a digital surface.

Thursday 25 January 2024
Article by: Thymen Wabeke, Thijs van den Hout, Moritz Müller

The original blog is in Dutch, this is the English translation.

Representation learning is a technique for concise and computer-friendly data representation. It has been the basis of numerous machine learning successes over the last 10 years, but it's largely unknown in the DNS community. In 2023, therefore, we started the DNS2Vec project to assess the technique's potential added value for the DNS. This blog describes an initial exploratory study involving the application of representation learning to resolvers. All the findings show that the produced representations describe the resolvers' query behaviour accurately and are suitable for use in machine learning operations. We will therefore extend our research into representation learning in 2024.

Machine learning and the need for representations

SIDN Labs makes frequent use of machine learning methods to boost the security and resilience of .nl. A good example is RegCheck, for which we use machine learning to estimate the probability that a newly registered domain name will be used for a malicious purpose.

One of the main requirements to use machine learning is the availability of an informative, machine-readable definition of the concept to be considered. In the context of the DNS, that concept might, for example, be "suspect domain name registration" (for RegCheck) or "resolver that looks up .nl domain names" (e.g. for .nl name server optimisation). Concept descriptions used for machine learning are often referred to as representations.

2 types of representation

Representation methods can be divided into 2 broad groups. The first is knowledge-driven: a representation is generated from multiple features manually defined by an expert. Each of the features needs to be informative regarding the concept under consideration. For example, the inclusion in a domain name of a word often associated with abuse is an informative feature in relation to the concept "suspect domain name registration". The knowledge-driven approach is often used here at SIDN Labs, and is the most familiar approach within the DNS community.

The second approach, widely known as representation learning or feature learning, is data-driven. It involves using an algorithm to extract features from a body of data, rather than getting an expert to define them manually. In combination, the extracted features form an informative representation of a concept but are abstract and difficult for humans to interpret.

Over the last 10 years, representation learning has grown in popularity, because the extracted features often define concepts better than features defined using the knowledge-driven approach. They also tend to lend themselves to use in multiple tasks. Nevertheless, representation learning is relatively unfamiliar within the DNS community.

Aim: to apply representation learning to DNS data

We expect that representation learning can also increase the effectiveness of machine learning when applied to DNS data. In 2023, we therefore started a project called DNS2Vec. Its aim is to explore the possibility of applying representation learning to DNS query data to extract data-driven representations of DNS concepts. We are also investigating whether the representations are suitable for use in the context of machine learning tasks such as classification and regression.

In this blog, we focus on recursive resolver representations and on simple machine learning tasks, with a view to efficiently determining whether representation learning has added value for SIDN Labs and the wider DNS community. The possible DNS2Vec applications we envisage include the optimisation of our .nl name servers and the detection of abnormal resolvers and hacked domain names.

Word2vec: a proven representation learning technique

As its title reflects, the DNS2Vec project makes use of Word2vec, a proven technique published by Google researchers about 10 years ago. Word2vec is able to extract semantic word representations by analysing large corpuses of textual data. When it appeared, Word2vec was a major breakthrough, because it opened the way for more effective estimation of semantic similarities between words by means of machine learning.

Encouraged by the successful use of Word2vec with textual data, various researchers and companies have since started experimenting with using it in other contexts. For example, Spotify is using the technique to extract representations of music tracks. The representations are then used to make recommendations and automatically generate playlists.

To learn more about Word2vec, see the original publication or Jay Alammar's accessible explanation. In this blog, we'll confine ourselves to a brief description of Word2vec's input and output, and how we translate it to the DNS query data we store in ENTRADA. After that, we'll look at the extracted representations and how we can use them.

From textual data to DNS query data

As input, Word2vec uses a large corpus of sentences. The resulting output is a representation consisting of abstract features for each unique word that appears in the sentences. The representations are based on the assumption that words have a similar meaning if they are often used in the same context. So, for example, 'chair' and 'sofa' will have similar representations because both tend to be used in sentences such as "I'm sitting on a …"

We can also use Word2vec to compare representations of recursive resolvers, where, instead of consisting of words, the input 'sentences' are made up of IP addresses. For each domain name, we create a 'sentence' out of the IP addresses of the recursive resolvers that have looked up the domain name. Just as a conventional Word2vec application makes an assumption about the meanings of words, we assume that resolvers are similar if they frequently look up the same domain names.

DNS dataset used for the study

In order to generate recursive resolver representations, we applied Word2vec to DNS query data gathered in the period 10 to 12 September 2023. Both successful and unsuccessful DNS queries were included in our corpus. Domain names and resolvers with fewer than 15 queries a day were excluded, because we thought it unlikely that reliable representations could be generated for them. We also used downsampling to make the dataset more manageable, and to prevent popular domain names having an unduly large influence on the representations. We set the threshold for downsampling at 1,500 queries per domain name per day, resulting in 5.4% of the domain names being downsampled.

Finally, we programmed Word2vec to extract representations containing 30 features using the default settings. In total, Word2vec processed more than 10 million unique domain names and 999,384 unique resolvers.

Representations of recursive resolvers

In the remainder of this blog, we consider the resolver representations we extracted using Word2vec – in particular, whether the representations are suitable for use in the context of machine learning tasks such as classification and regression. We start with 2 visual analyses to establish whether the representations are informative. In other words, does Word2vec generate similar representations of resolvers whose query behaviour is similar, and distinct representations of resolvers with distinct behaviour?

Figure 1 visualises Word2vec's output for 4 recursive resolvers. Each resolver representation is made up of 30 numbers, which together describe the query behaviour of the resolver in question. Each of the numbers is the value of a feature, automatically extracted by Word2vec. As indicated earlier, the features are abstract and do not lend themselves to human interpretation, but should be informative.

Representaties van 4 recursieve DNS-resolvers

Figure 1: Representations of 4 recursive DNS resolvers. Each representation is made up of 30 abstract features, which together describe the query behaviour of the resolver in question.

The top 2 rows in figure 1 show 2 resolvers whose representations are almost identical, implying that Word2vec has extracted that the resolvers display very similar query behaviour. The extraction is correct, because the 2 resolvers belong to the same application, OpenINTEL, which scans the entire .nl zone on a daily basis for DNS monitoring purposes. That constitutes good anecdotal evidence that our representations are informative.

The lower 2 rows show a DMAP resolver (third row) and a Cloudflare resolver (fourth row). DMAP is our crawler, which visits all .nl websites once a month. The Cloudflare resolver is part of the 1.1.1.1 public open resolver and represents end users. These 2 representations are strikingly different from each other and from the OpenINTEL resolver representations. Given that we know the resolvers differ in their query behaviour, this is again good anecdotal evidence that our representations are informative.

Clustering Google resolvers

Our second visual analysis focuses on the 22,000 or so resolvers we observed in Google's autonomous system. For this analysis, we use t-SNE to reduce the 30 features to just 2 in order to facilitate plotting of the resolvers. The result of the analysis is presented in figure 2, where each dot is one Google resolver.

Figure 2: 22k resolvers in Google's AS, with the 30 features representing a resolver reduced to 2 using t-SNE.

Figure 2 shows the 2 categories of Google resolver that we use. The first category is resolvers constituting Google's 8.8.8.8 public resolver, which are shown in green. All other Google resolvers are shown in blue. It's immediately apparent that the 8.8.8.8 resolvers are concentrated in the upper right-hand region of the chart. Hence, on the basis of just 2 features, we can readily distinguish between the 2 categories of Google resolver. We therefore have further anecdotal evidence that our representations are informative.

It's also exciting to see that the blue and green dots form clear clusters, implying that Word2vec has extracted that (minor) differences in query behaviour can be observed amongst the Google resolvers. We have not yet fully analysed the clustering, but one hypothesis is that the clusters may correspond to groups of resolvers stationed in particular countries or data centres.

Classification of resolver locations

Figures 1 and 2 show that our resolver representations are informative. The next step is to consider whether they are suitable for machine learning tasks. To do that, we first classify the countries where the recursive resolvers are located, on the basis of the representations obtained from Word2vec.

Our analysis is confined to resolvers from the 15 countries that send the most queries to the .nl name servers. Of those resolvers, 70% are assigned to the training set, and 30% are used to generate test results. We then train a simple k-nearest neighbours (k-NN) classifier to predict a resolver's country on the basis of the 30 features in the resolver representations.

The k-NN predicts the country correctly for 77% of the resolvers in the test set, giving an F1 score of 60% (macro-average). That's a very good result, particularly considering that k-NN is a simple method, and that we didn't optimise the process.

Figure 3 shows the results for each country. The rows indicate the actual country, while the columns show the predicted country. Ideally, the prediction should be correct in all cases. The majority of errors involve resolvers mistakenly predicted to be in Germany (DE) or the US.

Figure 3: Confusion matrix for the k-NN classifier used to predict the countries where resolvers are located.

All things considered, we view the results as positive. Our provisional conclusion is that the resolver representations are suitable for use in machine learning tasks.

Predicting query volumes

We can assess the robustness of that provisional conclusion by testing the representations in the context of a second task, namely predicting the volume of queries a resolver will send to our .nl name servers. That information is needed for purposes such as scaling our DNS infrastructure.

The test involves training a k-NN-regressor. In contrast to the previous test, all the resolvers are used. The k-NN regressor achieves an R²score of 0.88. An R²score is a value between 0 and 1 (where a higher score is better), and reflects the fraction of variance in the query volumes that is explained by the trained model. A score of 0.88 is therefore a very good initial result.

Figure 4 shows the results in more detail. You can see that the blue dots follow the direction of the green line. This implies a positive correlation between the predicted values and the true values. However, it's also clear that many blue dots are a long way from the green line, which means the k-NN regressor's predictions aren't very accurate.

Figure 4: Spread of the k-NN regressor's resolver query volume predictions.

In the context of this exploratory study, we regard the correlation as more significant than the inaccuracy. The reason being that the correlation shows that resolver representations are suitable for regression tasks, while the inaccuracy can probably be addressed by procedural optimisation, or by adopting a more complex methodology.

Conclusion: representation learning for the DNS is promising

Our exploratory study involved applying a proven technique called Word2vec to DNS data in order to extract data-driven representations of recursive resolvers. The representations were then tested by performing 2 visual analyses and 2 machine learning tasks. The results show that the representations describe recursive resolvers' query behaviour well, and are suitable for use in machine learning tasks.

We therefore extend our research into representation learning in 2024. For example, we plan to use representation learning as the basis for an operational machine learning application, such as the detection of abnormal resolvers and hacked domain names. We also want to improve the representations by optimising Word2vec and experimenting with other methods. Another line we intend to follow is the application of representation learning to other DNS concepts, such as domain names. Finally, we'll be investigating the possibility of making representations available for other researchers to use in their work.

Got feedback on our work or ideas about how we might use representation learning within the DNS? Drop a line to thymen.wabeke@sidn.nl.

Article by: