Thesis: Detecting malicious .nl registrations using representation learning

A novel approach to detecting malicious registrations by embedding DNS query data

Concept of malicious domain name detection

Detecting malicious domain name registrations based on DNS query data is difficult, because data is sparse, and signals are weak. In our previous blog post, we showed that the use of word embeddings on DNS data is a promising approach. In this post, I summarise the findings of my internship at SIDN Labs, during which I examined the viability of using word embeddings to detect maliciously registered .nl domain names. The details are in my master’s thesis.

Use of embeddings for DNS appears promising

In our January 2024 blog, we reported on our first experiments with the use of so-called “word embeddings” to aid in the detection of potentially malicious domain names. Word embeddings are generated using representation learning methods, which in the case of words are designed to capture their meaning based on the words that often co-occur. Similarly, we can represent domain names as short numerical vectors based on the resolvers that query them.

In our first experiments, we found that the resulting embeddings can indeed be used to tackle various domain name-related questions. We expect that the query behaviour associated with a domain name can be indicative of its maliciousness. So it may be possible to detect maliciously registered domain names early based on this representation.

Detecting malicious .nl domain names based on DNS embeddings

My research project was therefore intended to determine the extent to which the use of domain embeddings created from DNS query data is a viable mechanism for detecting malicious .nl domain name registrations. More specifically, we hope that this method is able to detect malicious domains before the existing detection methods used by SIDN. Our goal is to deploy this method in production systems at SIDN.

We consider a domain registration to be malicious if the domain is flagged for malicious activities within 30 days of registration. Malicious activities come in many different types, such as hosting a phishing, scam or malware distribution website or controlling a command-and-control server.

My approach relies upon the assumption that the traffic patterns associated with malicious domain registrations are distinct from those associated with benign domains. If our approach is successful at detecting malicious domain registrations, we may conclude that that assumption is correct, since the classifier only uses DNS traffic to make its predictions.

Datasets used in my study

For this study, we considered all .nl domain name registrations between March and August of 2024.

Netcraft

We used the Netcraft dataset as ground truth for the maliciousness of each domain. Netcraft is a commercial dataset that contains abuse reports for .nl domain names, which SIDN’s support team as well as researchers at SIDN Labs use for abuse mitigation.

ENTRADA

ENTRADA is a dataset that stores DNS query data relating to .nl domain names. From this data source, we collected all the queries concerning every domain that was newly registered within our selected date range, for the period up to 10 days after registration.

The data used for the final classifier relates to all domains registered between 1 March 2024 and 1 July 2024. The training and testing data that we have is highly imbalanced, with only a fraction of all registrations being malicious. The training data contains 259,317 domain registrations, 509 of which are malicious (0.2%). The test data contains 49,358 domain registrations, with 66 (0.1%) being malicious. Detecting these malicious registrations is a needle-in-a-haystack problem, making it very hard for models to learn and perform.

Our detection method based on DNS embeddings

Our proposed method involves two main steps: embedding domains and classifying domains.

Step 1: embedding domains

We use Doc2Vec to create embeddings of domains based on the resolver IP addresses querying them. Doc2Vec is used to embed a document (based on its words), which in our case translates to embedding a domain name based on the resolvers that query it. More specifically, the domain name translates to the document ID, and the list of resolvers translates to the sentence used to train the embedder. This is shown in figure 1

Representation of a domain name based on the resolvers that query it.

Figure 1. Representation of a domain name based on the resolvers that query it.

In the example in Figure 1, we collect a list of resolvers which have queried domain X and feed this to the trained Doc2Vec model. The result of this is an embedding of domain X based on the resolvers that have queried it. The numbers in the embedding of domain X are not meaningful to a human observer, but they represent meaningful characteristics that distinguish one domain name from another in terms of the resolvers that query it.

Step 2: classifying domains

We feed the resulting embeddings into a classifier to determine the likelihood of a .nl domain being maliciously registered. We use the embedding of a domain name as input for the classifier. It predicts whether the newly registered domain is malicious or benign solely based on the domain’s embedding, as shown in the figure below.

Classifying a domain’s embedding.

Figure 2. Classifying a domain’s embedding.

Classifier results

The training of the embedder takes a few hours, while the training of the classifier only takes a few minutes. The prediction of a new domain registration happens in a fraction of a second once enough queries have been collected.

One fact that is important to keep in mind while analysing the results is that no dataset is perfect. It is possible that our ground truth (Netcraft) has inaccuracies in both the training and testing data, such as false positives and false negatives. One possible effect of this is that our classifier might correctly detect a malicious registration which is considered benign by Netcraft.

We use a precision-recall curve to assess the performance of our classifier, as it remains an accurate tool for measuring performance even when highly imbalanced datasets are used, which is not the case with other metrics such as ROC. The precision-recall graph records the precision and recall of the classifier as the threshold of the classifier changes.

Figure 3 shows the precision-recall curve of our classifier, as well as the same curve of an imaginary random classifier and a perfect one. The figure clearly shows that there is a significant difference between the curve of the random classifier and our classifier, which proves that the information contained in the domain name embeddings is informative in relation to the question of whether a domain name is malicious or not. Hence, the classifier is to a certain extent able to predict the maliciousness of a registration solely based on the resolver IP addresses that query it. This means that our assumption that malicious domains have distinct traffic patterns is indeed true, and that the embedder and classifier are able to pick up those patterns and use them for prediction.

Precision-recall curves of our classifier, compared with random and perfect classifiers.

Figure 3. Precision-recall curves of our classifier, compared with random and perfect classifiers.

Detection speed

Now that we have a list of correctly classified malicious registrations, we can analyse whether the classifier is able to detect the malicious domains before Netcraft does by displaying their detection timeline (see Figure 4).

Detection time of maliciously registered domain names (ours vs Netcraft).

Figure 4. Detection time of maliciously registered domain names (ours vs Netcraft).

Figure 4 shows how many hours it takes for our classifier (green) and Netcraft (red) to detect the malicious domains after registration. The classifier was able to detect malicious registrations on average 18 hours sooner than Netcraft, with the largest difference between the detection by the classifier and detection by Netcraft being 33 hours.

Use at SIDN and other registries

One way this method could be integrated within a DNS registry for use in malicious domain detection is by having the classifier simply return the registration’s score (probability of the registration being malicious). This probability could then be added to a table of the most recent registrations, along with data obtained by other means, such as RegCheck. That would allow the registry’s staff to make more informed decisions about which domain name registrations to investigate, based on the scores for an additional metric.

The resulting risk score, or even the domain name embeddings themselves, could also be included as a feature in another detection method, such as RegCheck, improving the overall classification performance of the system. This idea will be tested in the coming year.

Conclusions

Our research demonstrates that malicious .nl domain name registrations can be detected by using only the resolver IP addresses from their DNS traffic. Using domain embeddings to classify newly registered malicious domains is not only feasible but also could prove to be a valuable addition to existing detection methods such as RegCheck. The embedder is able to do this in a relatively short space of time, making it possible to use for deployment.

When applied to the test dataset of July 2024, there were a total of 49,358 registrations, of which 66 were malicious according to Netcraft. Our classifier predicted 33 domains as malicious, of which 12 were indeed malicious. Therefore, this method is not sufficiently accurate to deploy on its own. However, as stated earlier, it can add to the performance of existing detection methods or provide additional information for support employees looking at cases individually.

Future work

There are many ways that the methodology described in my work could be explored further. For example, instead of only doing one prediction after 2,000 queries, we could train multiple classifiers in order to detect malicious domains at different stages (such as after 50, 250, 1,000, and 2,000 queries).

Another way performance could be significantly improved is by incorporating domain name-specific features into the classifier. Examples include the registrant and their reputation, the RegCheck score, and the average number of queries and maximum number of DNS queries within a given time span. The classifier would not then have to rely solely on the domain’s traffic pattern, but could use it as additional information to make more informed decisions.

The detection of compromised domains might also be possible using these methods, since embeddings are able to capture the complex traffic patterns associated with domains. One possible way of doing that would be to analyse the evolution of each domain over time. If we have multiple embeddings of the same domain over a certain period, we might be able to detect anomalies in traffic patterns. The only drawback with that approach is that it would be significantly more complex, as it would require an embedding of all existing domains for each day. For each domain we would then have to compare its different embeddings.

Epilogue

This project has been a challenging and valuable experience. I have learned a lot throughout it, and I highly appreciate the opportunity I was given by SIDN, as well as the continued guidance by Thymen, Thijs and Giovane. I hope SIDN Labs will continue research on this topic as it is very interesting, and I believe that it shows a lot of promise.