Thesis on the proactive recognition of domain name abuse

Identifying malicious registrations by means of data validation

Close-up of someone working on a laptop in a dark room

Thursday 4 February 2021
Article by: Joost Prins

Phishing attacks and other forms of domain name-related abuse are a widely recognised problem on the internet. As part of my university studies, I recently did a project at SIDN Labs for my thesis. The project explored the scope for proactively detecting domain names registered for abusive purposes before they enter use, purely by analysing registration data. One aspect that I looked at was the validation of registration data.

SIDN Labs and various other teams [1, 2] do a lot of research into ways of tackling phishing websites and other forms of domain name-related abuse. Their research has explored options such as analysing the DNS traffic associated with domains to detect irregularities, and the use of machine learning to classify websites on the basis of their content. In many cases, the detection methods involved are reactive: they cannot detect a phishing site until it is up and running. Other methods seek to identify malicious websites from suspicious registration patterns, such as simultaneous bulk registrations. However, those methods are not good at detecting smaller-scale malicious activities. My study was therefore intended to investigate the scope for addressing such activities by identifying the associated domain names at the time of registration. The approach is explained in this blog post.

Domain registration

Detecting scam webshops before they open

The life cycle of a .nl domain name starts with the name's registration in the Domain Registration System. That involves three actors: the person or organisation that wants the domain name (the registrant), the company through which the registration is made (the registrar), and the company that administers the top-level domain (the registry, SIDN where .nl is concerned).

Every domain name's registration includes the following attributes:

The domain name
The date and time of registration
The registrar through which the domain name is registered
The name server
The reseller through which the domain name is registered (where relevant)
The registrant's name
The registrant's phone number
The registrant's address
The registrant's e-mail address

Validation of registrant data

Because SIDN administers the registration data for all .nl registrants, it is potentially possible to use that data to assess the likelihood of a registration being malicious. Analysis of the data could facilitate the identification of falsified information and thus the prevention of abuse.

My research was based on the assumption that registrants who intend to use their domain names for phishing or other malicious purposes will provide falsified data in order to minimise the likelihood of being held to account. We accordingly developed an authenticity index for each individual registration attribute (e.g. registrant's name and registrant's e-mail address). The process of determining authenticity is referred to as 'attribute validation'.

Because registration data is personal data, the registration attributes were validated locally, removing the need to share any data with outside parties.

The data validation process is explained below.

Attribute 1: Registrant's name

The registrant's name may be the name of a natural person or the name of a business or other organisation that wants the domain name in question. We validated registrants' names using natural language processing software, which can assess the probability that a name is a genuine personal name.

Testing on a sample set of names found that the software was good at analysing the names of natural persons, but not the names of organisations. Validation of this attribute was therefore restricted to domain names with private registrants. A name was regarded as valid if the software recognised it as a personal name.

Attribute 2: Registrant's address

Registrants' addresses were validated using a local copy of the Register of Buildings and Addresses ('BAG'). The BAG is a public database of all addresses in the Netherlands, maintained by the Dutch government.

We wrote a programme that broke down the registrant's address linked to each new registration into the street name, building number, number suffix (where relevant) and postcode. The programme then retrieved the building number and postcode from the BAG database and compared the retrieved data with the registration data. The comparison involved calculating the Levenshtein distance between the two.

Because the BAG contains only data about Dutch addresses, the validation method was applicable only to such addresses. However, roughly 95 per cent of .nl domain names have Dutch registrant addresses, so the validation covered the vast majority of registrations.

Attribute 3: Registrant's e-mail address

In order to validate the registrant's e-mail address, we first checked whether the corresponding domain had a valid mail server. If it did, the mail server was queried to establish whether the full address existed on the server. If existence of the address was confirmed, the e-mail address was regarded as valid. Validation was performed using a python library.

Attribute 4: Registrant's phone number

Registrants' phone numbers were validated using publicly available phone number validation software. The software performs syntactic checks on all types of phone number. So, for example, it checks whether the number of digits is valid for the relevant country.

Attribute 5: other attributes

As well as validating the data provided in the context of registration, we looked at other properties that could point to malicious intent. The properties in question were selected by reference to previous research in this field [1, 2], in conjunction with input from SIDN's Abuse Team regarding the pointers that they regarded as suspect. The selected properties included the time of registration and characteristics of the domain name itself, such as numeric content.

Recognition of domain name abuse

We used a random forest algorithm for the recognition of domain name abuse. We tested six algorithms and selected the one that performed best. The validation results for the registration data attributes and the extra properties were used in combination as input for the algorithm. We trained the algorithm using various test datasets, which included details of known malicious registrations (e.g. as available from Netcraft) and details of registrations linked to Trustmarked websites.

The test results showed that the system worked well on the test datasets, but not so well for detecting abusive registrations in real time. Of the malicious domains in the test dataset, the system detected 80 per cent of the malicious registrations. However, when used on real registration data received by SIDN over a period of a month, the system picked up only 15 per cent of those that subsequently proved to be malicious. One reason for the discrepancy is likely to be that the abusive registrations in the test dataset were on average a year old, and the characteristics of such domains are typically subject to rapid change.

Moreover, it seems that a fault probably occurred with the address data validation, which is liable to have affected performance. After completion of the study, SIDN Labs performed further evaluation of the developed address data verification method. It was found that roughly 28 per cent of malicious .nl registrations involved address data that failed the validation test, compared with just 4 per cent of legitimate registrations. That finding was out of step with the first evaluation, which indicated that 40 per cent of both malicious and legitimate registrations had registrant addresses that could not be validated.

Follow-up

SIDN Labs is continuing to assess the method developed for my project. Further investigation is felt to be desirable, despite the error made in the first evaluation. In that context, it will be important to train the random forest classifier using newer data and to see how the system then performs.

SIDN Labs is also looking into the possibility of combining the system with existing malicious registration detection methods, in order to improve legitimacy prediction capability.

Anyone interested in a detailed description of the system is welcome to read my thesis.

Article by:

Joost Prins

Graduation trainee

Joost graduated from SIDN Labs.

j.v.prins@alumnus.utwente.nl

Research into domain name-related abuse