Thesis: Detecting malicious .nl registrations using representation learning
A novel approach to detecting malicious registrations by embedding DNS query data
Chose your color
Frequently visited
Frequently asked questions
The Whois is an easy-to-use tool for checking the availability of a .nl domain name. If the domain name is already taken, you can see who has registered it.
On the page looking up a domain name you will find more information about what a domain name is, how the Whois works and how the privacy of personal data is protected. Alternatively, you can go straight to look for a domain name via the Whois.
To get your domain name transferred, you need the token (unique ID number) for your domain name. Your existing registrar has the token and is obliged to give it to you within five days, if you ask for it. The procedure for changing your registrar is described on the page transferring your domain name.
To update the contact details associated with your domain name, you need to contact your registrar. Read more about updating contact details.
When a domain name is cancelled, we aren't told the reason, so we can't tell you. You'll need to ask your registrar. The advantage of quarantine is that, if a name's cancelled by mistake, you can always get it back.
One common reason is that the contract between you and your registrar says you've got to renew the registration every year. If you haven't set up automatic renewal and you don't renew manually, the registration will expire.
Wanneer je een klacht hebt over of een geschil met je registrar dan zijn er verschillende mogelijkheden om tot een oplossing te komen. Hierover lees je meer op pagina klacht over registrar. SIDN heeft geen formele klachtenprocedure voor het behandelen van een klacht over jouw registrar.
Would you like to be able to register domain names for customers or for your own organisation by dealing directly with SIDN? If so, you can become a .nl registrar. Read more about the conditions and how to apply for registrar status on the page becoming a registrar.
A novel approach to detecting malicious registrations by embedding DNS query data
Detecting malicious domain name registrations based on DNS query data is difficult, because data is sparse, and signals are weak. In our previous blog post, we showed that the use of word embeddings on DNS data is a promising approach. In this post, I summarise the findings of my internship at SIDN Labs, during which I examined the viability of using word embeddings to detect maliciously registered .nl domain names. The details are in my master’s thesis.
In our January 2024 blog, we reported on our first experiments with the use of so-called “word embeddings” to aid in the detection of potentially malicious domain names. Word embeddings are generated using representation learning methods, which in the case of words are designed to capture their meaning based on the words that often co-occur. Similarly, we can represent domain names as short numerical vectors based on the resolvers that query them.
In our first experiments, we found that the resulting embeddings can indeed be used to tackle various domain name-related questions. We expect that the query behaviour associated with a domain name can be indicative of its maliciousness. So it may be possible to detect maliciously registered domain names early based on this representation.
My research project was therefore intended to determine the extent to which the use of domain embeddings created from DNS query data is a viable mechanism for detecting malicious .nl domain name registrations. More specifically, we hope that this method is able to detect malicious domains before the existing detection methods used by SIDN. Our goal is to deploy this method in production systems at SIDN.
We consider a domain registration to be malicious if the domain is flagged for malicious activities within 30 days of registration. Malicious activities come in many different types, such as hosting a phishing, scam or malware distribution website or controlling a command-and-control server.
My approach relies upon the assumption that the traffic patterns associated with malicious domain registrations are distinct from those associated with benign domains. If our approach is successful at detecting malicious domain registrations, we may conclude that that assumption is correct, since the classifier only uses DNS traffic to make its predictions.
For this study, we considered all .nl domain name registrations between March and August of 2024.
We used the Netcraft dataset as ground truth for the maliciousness of each domain. Netcraft is a commercial dataset that contains abuse reports for .nl domain names, which SIDN’s support team as well as researchers at SIDN Labs use for abuse mitigation.
ENTRADA is a dataset that stores DNS query data relating to .nl domain names. From this data source, we collected all the queries concerning every domain that was newly registered within our selected date range, for the period up to 10 days after registration.
The data used for the final classifier relates to all domains registered between 1 March 2024 and 1 July 2024. The training and testing data that we have is highly imbalanced, with only a fraction of all registrations being malicious. The training data contains 259,317 domain registrations, 509 of which are malicious (0.2%). The test data contains 49,358 domain registrations, with 66 (0.1%) being malicious. Detecting these malicious registrations is a needle-in-a-haystack problem, making it very hard for models to learn and perform.
Our proposed method involves two main steps: embedding domains and classifying domains.
We use Doc2Vec to create embeddings of domains based on the resolver IP addresses querying them. Doc2Vec is used to embed a document (based on its words), which in our case translates to embedding a domain name based on the resolvers that query it. More specifically, the domain name translates to the document ID, and the list of resolvers translates to the sentence used to train the embedder. This is shown in figure 1
Figure 1. Representation of a domain name based on the resolvers that query it.
In the example in Figure 1, we collect a list of resolvers which have queried domain X and feed this to the trained Doc2Vec model. The result of this is an embedding of domain X based on the resolvers that have queried it. The numbers in the embedding of domain X are not meaningful to a human observer, but they represent meaningful characteristics that distinguish one domain name from another in terms of the resolvers that query it.
We feed the resulting embeddings into a classifier to determine the likelihood of a .nl domain being maliciously registered. We use the embedding of a domain name as input for the classifier. It predicts whether the newly registered domain is malicious or benign solely based on the domain’s embedding, as shown in the figure below.
Figure 2. Classifying a domain’s embedding.
The training of the embedder takes a few hours, while the training of the classifier only takes a few minutes. The prediction of a new domain registration happens in a fraction of a second once enough queries have been collected.
One fact that is important to keep in mind while analysing the results is that no dataset is perfect. It is possible that our ground truth (Netcraft) has inaccuracies in both the training and testing data, such as false positives and false negatives. One possible effect of this is that our classifier might correctly detect a malicious registration which is considered benign by Netcraft.
We use a precision-recall curve to assess the performance of our classifier, as it remains an accurate tool for measuring performance even when highly imbalanced datasets are used, which is not the case with other metrics such as ROC. The precision-recall graph records the precision and recall of the classifier as the threshold of the classifier changes.
Figure 3 shows the precision-recall curve of our classifier, as well as the same curve of an imaginary random classifier and a perfect one. The figure clearly shows that there is a significant difference between the curve of the random classifier and our classifier, which proves that the information contained in the domain name embeddings is informative in relation to the question of whether a domain name is malicious or not. Hence, the classifier is to a certain extent able to predict the maliciousness of a registration solely based on the resolver IP addresses that query it. This means that our assumption that malicious domains have distinct traffic patterns is indeed true, and that the embedder and classifier are able to pick up those patterns and use them for prediction.
Figure 3. Precision-recall curves of our classifier, compared with random and perfect classifiers.
Now that we have a list of correctly classified malicious registrations, we can analyse whether the classifier is able to detect the malicious domains before Netcraft does by displaying their detection timeline (see Figure 4).
Figure 4. Detection time of maliciously registered domain names (ours vs Netcraft).
Figure 4 shows how many hours it takes for our classifier (green) and Netcraft (red) to detect the malicious domains after registration. The classifier was able to detect malicious registrations on average 18 hours sooner than Netcraft, with the largest difference between the detection by the classifier and detection by Netcraft being 33 hours.
One way this method could be integrated within a DNS registry for use in malicious domain detection is by having the classifier simply return the registration’s score (probability of the registration being malicious). This probability could then be added to a table of the most recent registrations, along with data obtained by other means, such as RegCheck. That would allow the registry’s staff to make more informed decisions about which domain name registrations to investigate, based on the scores for an additional metric.
The resulting risk score, or even the domain name embeddings themselves, could also be included as a feature in another detection method, such as RegCheck, improving the overall classification performance of the system. This idea will be tested in the coming year.
Our research demonstrates that malicious .nl domain name registrations can be detected by using only the resolver IP addresses from their DNS traffic. Using domain embeddings to classify newly registered malicious domains is not only feasible but also could prove to be a valuable addition to existing detection methods such as RegCheck. The embedder is able to do this in a relatively short space of time, making it possible to use for deployment.
When applied to the test dataset of July 2024, there were a total of 49,358 registrations, of which 66 were malicious according to Netcraft. Our classifier predicted 33 domains as malicious, of which 12 were indeed malicious. Therefore, this method is not sufficiently accurate to deploy on its own. However, as stated earlier, it can add to the performance of existing detection methods or provide additional information for support employees looking at cases individually.
There are many ways that the methodology described in my work could be explored further. For example, instead of only doing one prediction after 2,000 queries, we could train multiple classifiers in order to detect malicious domains at different stages (such as after 50, 250, 1,000, and 2,000 queries).
Another way performance could be significantly improved is by incorporating domain name-specific features into the classifier. Examples include the registrant and their reputation, the RegCheck score, and the average number of queries and maximum number of DNS queries within a given time span. The classifier would not then have to rely solely on the domain’s traffic pattern, but could use it as additional information to make more informed decisions.
The detection of compromised domains might also be possible using these methods, since embeddings are able to capture the complex traffic patterns associated with domains. One possible way of doing that would be to analyse the evolution of each domain over time. If we have multiple embeddings of the same domain over a certain period, we might be able to detect anomalies in traffic patterns. The only drawback with that approach is that it would be significantly more complex, as it would require an embedding of all existing domains for each day. For each domain we would then have to compare its different embeddings.
This project has been a challenging and valuable experience. I have learned a lot throughout it, and I highly appreciate the opportunity I was given by SIDN, as well as the continued guidance by Thymen, Thijs and Giovane. I hope SIDN Labs will continue research on this topic as it is very interesting, and I believe that it shows a lot of promise.
Article by:
Share this article