Feasibility study of automated detection of malicious .nl registrations

Three methods evaluated and their impact analysed

Hands typing on desktop computer keyboard in a dark room

Thursday 24 March 2022
Article by: Thymen Wabeke, Thijs van den Hout, Moritz Müller

The original blog is in Dutch. This is the English translation.

A large part of the .nl domain names listed in abuse report are registered with malicious intentions. In other words, the registrants always intended to use them for phishing, malware or the publication of other abusive content. We could make the .nl zone even more secure if we could proactively identify such malicious registrations and disable the domains in question, preferably before anyone gets scammed. In this blog, we present a scientific comparison of three candidate methods of identifying suspect registrations. Our provisional results indicate that both machine learning algorithms and static rules can be effective tools for flagging up suspect registrations. We also consider how adopting such methods might impact the activities of SIDN's anti-abuse team.

Registration of domain names for malicious purposes remains a problem

Broadly speaking, there are two ways that malicious content is spread on the internet. One way is to hack an existing, legitimate website, and the other is to register a domain name specifically for malicious purposes. The ratio between these methods is difficult to determine exactly, but malicious registrations are responsible for a significant part of the abuse reports. Our previous COMAR study showed that 62 per cent of detected abuses involved malicious registrations. If we use the age of a domain name, we arrive at a lower percentage. 22% of reported domain names are less than 30 days old. You could assume that these domain names are specifically registered for malicious purposes. For criminals, one of the advantages of registering their own domain names is that they can choose names similar to real domains whose users they want to trick. So, for example, they will often opt for domain names that contain keywords linked to widely used and trusted services, with the aim of catching people out. Another advantage for a criminal is that registering a scam domain name removes the need to hack a website before they can get on with phishing, distributing malware or whatever.

The problem: manual checks at the time of registration aren't feasible

Until now, our anti-abuse team has randomly selected suspect registrations for manual checking. We also make use of security reports issued by Netcraft, which collects phishing URLs from several blocklists and reports by volunteers. However, manually investigating suspect registrations and identifying those registered for malicious purposes is very time-consuming. Checking whether a postal address really exists, or whether a business is registered with the Chamber of Commerce, can take an anti-abuse worker fifteen minutes or more, for example. Given that an average of 2,580 new .nl domain names a day were registered in the period January to June 2021, and that only 0.11 per cent (531) of the registrations during that period were reported by Netcraft within thirty days, it's clear that an automated approach is required.

What do we need?

An algorithm capable of automatically identifying suspect registrations would be extremely useful. It would save our anti-abuse experts the onerous task of manually investigating all registrations and identifying those registered for malicious purposes. Time would no longer be wasted checking bona fide registrations, while prompt action could be taken in genuinely abusive cases, before the domain names are put to malicious use. In this article, we therefore consider whether such an algorithm could actually work in practice and, if so, what the best approach would be.

Detecting scam webshops before they open

Various people and organisations have suggested ways of identifying suspect registrations. In early 2021, one of our graduate students published a thesis on this topic. And some of our fellow registries already have systems for detecting suspect registrations. EURid (.eu) has a machine learning system, for example, and DNS Belgium (.be) uses a scoring system based on static rules to pick up potentially malicious registrations. In this article, we scientifically assess which of those approaches would be most appropriate for us, and whether other methods are available. In the period ahead, we will continue our investigations and report our findings in a later blog. It is not by chance that increased attention has recently been given to the topic of registration data verification. A second Network and Information Security Directive (NIS2) has been put forward, which would require the verification of all the information recorded about registrants. The industry is waiting to learn exactly what is proposed, and whether or when the requirements will take effect. It may be that the approach we currently use -- selectively checking suspect registration data -- proves to be a good transitional solution for registries seeking to counter malicious domain name registrations.

Comparison of candidate methods

Before we can get to work on the development of detection methods, we need to define the data available to us. Because we want to identify suspect registrations as early as possible, we have to rely on the data available at the time of registration, such as the domain name itself and the registrant's postal address. Information about the website or other applications associated with the domain name does not become available until later and is therefore outside the scope of the current analysis. In this blog, we consider three candidate algorithms.

Candidate 1: scoring system based on static rules

Our first candidate is a knowledge-driven scoring system based on static rules. A scoring system is straightforward to program and its findings are easy to interpret. However, you do need to keep the rules up to date, and a scoring system easily becomes complex if the number of rules is allowed to grow unchecked. Our candidate system uses eight rules composed on the basis of our knowledge and experience of malicious registrations. The more rules a registration conforms to, the higher the score given to it, and the more suspect it is considered to be. For example, we know that a striking number of malicious registrations are made outside Dutch daytime hours, so one of our rules relates to the time of registration. Another rule relates to the use of words that frequently appear in phishing domain names. So, for example, the score increases if a domain name includes the name of the official ID system used in the Netherlands.

Candidate 2: machine learning with weak supervision

The second candidate method makes use of a machine learning (ML) algorithm. An ML algorithm teaches itself to identify complex associations and is easy to update using new training data. However, ML requires significant development work and monitoring. What's more, the findings are not always easy to interpret. The ML algorithm we've chosen to train is a random forest classifier. A random forest classifier has the advantages of being efficient, capable of identifying non-linear associations and performing well with unseen data. We're using seventeen features as input. The selected features are based on common characteristics of malicious registrations, such as the domain name itself, the registrant's e-mail provider and inconsistencies in the registrant data. All the features are strictly data-driven. So, for example, the system doesn't depend on a list of suspect words, but deduces suspect and legitimate letter combinations by analysing historical registrations. This candidate requires training only once, by means of weak supervision. The underlying philosophy prioritises quantity over quality: you use labels that are easily obtained, even though you know they're not entirely reliable. We're training the algorithm with both bona fide and malicious registrations. In that context, we define as 'malicious' any registration made between January and June 2021 (inclusive), which was reported by Netcraft within thirty days. Our training set includes a total of 531 registrations defined as malicious on that basis. The bona fide registrations used for training consist of 531 random registrations made in the same period that were not reported by Netcraft within thirty days. We know from our COMAR project that our adopted definitions are not perfect, but we hope that they're sufficient for easily training a good-quality classifier.

Figure 1: Procedure for training the two machine learning candidates.

Candidate 3: machine learning with active learning

Our third candidate method also uses the random forest machine learning algorithm and the same seventeen features. The difference between candidates 2 and 3 is that candidate 3 uses an active learning strategy to train the algorithm (see the lower portion of Figure 1) and thus make iterative improvements to algorithm 2's weak supervision model. Active learning is based on the principle that analysts do not need to label all data points in order to improve a model; it is sufficient to label the most informative ones. The crux is how you decide whether a data point is informative. Generally speaking, the system works well if you label a diverse set of boundary cases. Another challenge is combining the weak labels used for the weak-supervision candidate with the analysts' labels. Should weak labels and analysts' labels carry equal weight, for example, or should extra weight be given to the analysts' labels? Before testing candidate 3, we ran the active learning loop three times. For each iteration, our anti-abuse experts labelled an average of 230 relevant registrations, which were then used as training data. The exact implementation of our active learning method is outside the scope of this blog. Anyone interested in that topic is referred to this article and anyone keen to explore the topic in depth to the book 'Human-in-the-Loop Machine Learning' by Robert Monarch.

How can we evaluate the three candidates?

Having introduced the three candidates, let us move on to the evaluation. Ideally, we want a model that flags up as many malicious registrations as possible, but as few legitimate registrations as possible. We need a model capable of doing that in order to efficiently suppress malicious registrations without requiring our anti-abuse experts to devote large amounts of time to investigating registrations erroneously identified as suspect ('false positives'). We measure the performance of a model using two indicators: sensitivity and specificity. Sensitivity is an expression of the probability that a candidate model will identify a malicious registration as suspect. Specificity is an expression of the probability that an algorithm will identify a bona fide registration as such. We've been evaluating the specificity and sensitivity of the candidates using two datasets (see Table 1). The first is a dataset detailing malicious registrations between 1 November 2021 and 15 January 2022, used for measuring sensitivity. The registrations in question were identified on the basis of Netcraft reports. Most involve domain names used for phishing, detected by security investigators and end users. The second dataset is derived from nearly a thousand randomly selected registrations, which have been manually evaluated by our anti-abuse experts. Of those, more than 920 were identified as bona fide and therefore suitable for measuring the specificity of the candidates. Of the randomly selected registrations, 48 were judged by our anti-abuse experts to warrant further investigation and therefore labelled 'suspect'. Details of the two datasets are given in table 1. The proportion of registrations identified by our anti-abuse experts as suspect is slightly higher than the proportion of registrations identified by Netcraft as malicious. That probably reflects the more critical approach taken by our anti-abuse experts, who consider a wider range of possible abuses than Netcraft.

Source	Purpose	Label	Registrations	Unique registrants
Netcraft	Sensitivity	Bona fide	0	0
		Malicious	150	118
Random registrations	Specificity	Bona fide	920	695
		Suspect	48	43

Table 1: Details of the two datasets used for evaluation of candidate methods.

How well do the candidate methods work?

Each candidate can be configured to detect as many suspect registrations as possible by defining a threshold value. Unfortunately, it's quickly apparent that there is no threshold value at which a high proportion of malicious registrations are detected (high sensitivity) and very few bona fide registrations are mistakenly classified as suspect (high specificity). That's easy to see from figure 2. The graph shows the level of sensitivity (y-axis) associated with various threshold values (x-axis) for each of the three candidates. From figure 2, we can conclude that all three candidates are capable of detecting a high proportion of the malicious registrations in the Netcraft dataset. The weak-supervision candidate (red line) performs best at most threshold values, but only slightly better than the scoring system (green line). In other words, the loosely defined labels used in the weak-supervision approach are adequate for training a model to perform well. Active learning (purple line) yields the poorest performance. The use of three feedback loops and manually defined labels does not seem to improve sensitivity. We think that's because our anti-abuse experts consider a wider range of possible abuses and therefore identify more domain names as suspect than Netcraft identifies as malicious.

Figure 2: Sensitivity (probability of malicious registration being detected) calculated on the basis of Netcraft reports.

The broken coloured lines in the graph represent the threshold values corresponding to a good balance between sensitivity and specificity. The calculation of balance-point threshold values facilitates comparison of the candidates and datasets. The sensitivity of each candidate at the balance-point threshold value is given in table 2.

Candidate	Sensitivity (Netcraft)	Specificity (Random registrations)
Scoring system	72.0%	93.3%
Weak supervision	73.3%	95.5%
Active learning	66.7%	98.3%

Table 2: Sensitivity and specificity at the balance-point threshold values shown in figure 2.

Figure 3: Sensitivity (probability of bona fide registrations being recognised) calculated on the basis of random registrations.

The specificity of the models is shown in figure 3. The x-axis again represents the threshold values, while this time the y-axis represents the specificity of each candidate. Note that the specificity scale starts at 75 per cent, because all the candidates correctly recognised most bona fide registrations. The active-learning candidate proved to be the most specific. At the balance-point threshold value, 98.3 per cent of bona fide registrations were identified as such. The scoring system and the weak-supervision candidate were 95.5 and 93.3 per cent specific, respectively. Although those percentages may sound fairly similar, the differences are quite significant in the context of all registrations. Our anti-abuse experts wishes to assess 5 per cent of the random registration dataset. It therefore follows that the active-learning candidate would refer 43 false positives a day to the anti-abuse experts for investigation, whereas the weak-supervision candidate would refer 109 registrations and the scoring system 165. In terms of the absolute number of cases unnecessarily referred for investigation, the performance differences are very significant, especially given that each case can take fifteen minutes or more to investigate.

Practical use: technology and policy

From the analyses presented above, it's clear that none of the candidates provide a perfect solution to our problem. We therefore need to work with our anti-abuse experts to define a number of requirements, regarding both technical and policy aspects. First, we need to decide whether the priority is a very sensitive system, or a very specific system. The former would maximise the probability of picking up all suspect registrations, but would increase the risk of bona fide registrations being flagged as suspect. Prioritising specificity would provide greater assurance that most flagged domain names really were problematic, but would be more likely to let genuinely malicious domain names slip through the net. We also have to choose between a knowledge-driven scoring system and a data-driven machine-learning system. A scoring system is easier to implement, but less flexible, because adaptations in line with new malicious registration trends have to be made manually. A weak-supervision system would offer greater flexibility, because it could easily be retrained using new weak labels. Finally, an active-learning system would offer greater adaptability than either of the other options, since it could be fine-tuned via the feedback loop. However, each feedback loop would require considerable time input by our anti-abuse team. Policy decisions are also required regarding the desirability of detecting suspect registrations of types that aren't yet included in abuse feeds, and the desirability of detecting malicious registrations before they appear in feeds. Netcraft currently reports 56 per cent of malicious domain names within twenty-four hours, and 81 per cent within three days. In practice, therefore, it is advantageous to have a system that detects suspect registrations faster than Netcraft only if our anti-abuse team is able to follow up most detections within hours. Providing that capability would have major implications for the anti-abuse team.

Conclusions and plans

Our central conclusion is that the identification of suspect domain names exclusively on the basis of registration data is indeed possible. However, we have yet to find a candidate detection system that's clearly preferable in all respects. That's due partly to the use of a relatively short evaluation period, and partly to preferability being dependent on policy priorities. In the period ahead, we therefore intend to work with our anti-abuse experts, not only with a view to making technical improvements, but also with a view to clarifying the policy priorities. Features such as the computation of registrant reputation may be added. We will also continue evaluating the performance of the candidates after upgrading, so that we can ultimately identify a preferred candidate and threshold value for integration into our registration and anti-abuse processes. We wish to thank the anti-abuse experts Nanneke Franken and Manouk van Schellen for their feedback and their help evaluating the candidates.

Article by: