Using new machine learning applications to boost internet security

Our vision and research agenda for the next two years

Monday 26 July 2021
Article by: Thymen Wabeke, Thijs van den Hout, Cristian Hesselman

The original blog is in Dutch. This is the English translation. SIDN Labs' research is aimed at improving the security, stability and resilience of the internet infrastructure. In that context, machine learning plays an increasingly important role. It has, for example, helped us to automatically detect thousands of fake webshops, establish partnerships for fighting domain name abuse, and help abuse analysts identify malicious .nl sites, e.g. on the basis of logo abuse. In this blogpost, we consider the future and discuss the research we're planning with a view to further increasing internet security with the aid of machine learning.

Machine learning is taking off

Using machine learning to make the internet more secure Thesis: Using machine learning to determine the economic activities associated with domain names Machine learning method identifies brand logos on fake webshops

Machine learning involves the use of computer algorithms that automatically extract rules and patterns from large volumes of data, and use them as a basis for making decisions or predictions. Numerous major advances in machine learning have been made in recent years, leading to increasing use of machine learning algorithms for high stake decisions. Examples include new image recognition algorithms based on deep learning, enabling cars to function on an increasingly autonomous basis and systems that support doctors in the analysis of medical scans. Machine learning has become so important that Apple has recently started making processors for its computers that are specially optimised for ML algorithms. The rate of advance is also reflected in the number of academic articles devoted to machine learning. In 2020, 13,788 articles on the subject were added to arXiv.org, an open library of academic papers. That is more than sixty times as many as in 2010 (Figure 1).

Fig1 Aantal artikelen in de categorie machine learning

Figure 1: Number of articles in the machine learning categrory (stats.ml) published in arXiv.org, an open library of academic papers.

Our niche: the use of machine learning to increase internet and DNS security

At SIDN Labs, we're involved in researching machine learning algorithms for a 'niche' application: to increase the security of the internet and the Domain Name System (DNS). Our aim is to add and evaluate machine learning algorithms, so that DNS actors – registries (including SIDN itself), registrars and DNS operators – can apply them to their datasets. Our machine learning work is relevant, because many large datasets relating to internet and DNS security topics are available, but their size limits both their usefulness for manual pattern recognition and the scope for keeping them updated. In relation to .nl, for example, we generate DMAP crawler data (6.2 million new measurements a month) that we use for fake webshop identification, and we have a historical DNS database called ENTRADA (more than 2 billion new data points a day) that can be used for botnet pattern detection. At SIDN Labs, we've chosen to place the emphasis on innovation with machine learning. In other words, we monitor the academic literature in this field and seek to make innovative use of promising methods that can support our aims. We generally leave the innovation of machine learning to large research teams at universities and corporations like Google and Microsoft.

Our approach: responsible machine learning

We give explicit consideration to the way we use machine learning, in recognition of the social and ethical impact of algorithms. We therefore follow the philosophy of responsible machine learning, which we interpret as follows:

We apply the 'human-in-the-loop' principle. In other words, our systems don't make any automated decisions about changes to domain name registrations (e.g. the removal of a name from the zone, or the delinking of name servers). We also use feedback from users to keep improving our models.
We believe that we should be able to understand and explain the outcomes associated with our systems. We therefore have a strong preference for algorithms that are intrinsically explainable or can be made explainable.
We seek to collaborate with other parties and we publish our findings. The rationale being that broad input enhances the quality of our work and reduces the risk of bias, while publication gives registrars, other registries and partners the ability to learn from our successes and mistakes. Operating in this way therefore helps others to contribute to internet security.
We continuously monitor the performance of our models by reference to multiple indicators. That enables us to detect various types of error and more accurately assess how well our systems are working.

Issues addressed by our research agenda

Our research agenda is structured around three research questions:

RQ1: How can we get even better at proactive abuse detection?
RQ2: How can we train shared abuse models without exchanging data?
RQ3: How can we maximise the effectiveness and speed of our anycast infrastructure monitoring and management?

We are focusing on those three research questions because the answers can contribute to internet security and because the questions lend themselves to investigation with machine learning. The reason for the latter being that numerous relevant data points are available, but rules cannot be derived manually.

RQ1: How can we get even better at proactive abuse detection?

We've been investigating the proactive detection of suspect websites using machine learning since 2018, because detection capability helps registries like SIDN to take down malicious content sooner, thus minimising the number of victims.

In relation to RQ1, our primary aim is to continue refining the systems we have already developed:

Fake webshop Detector (FaDe): FaDe has already enabled us to detect thousands of fake webshops. In the period ahead, we plan to use FaDe to monitor the strategies used by fraudsters. It's important to do that, because the detection of fake webshops is a cat-and-mouse game, in which the fraudsters are always changing tactics to escape detection.
LogoMotive: helps abuse analysts (e.g. those working for the government) identify phishing sites on the basis of logo abuse. We will be investigating how we can use LogoMotive output more widely, e.g. to identify suspect webshops. Our focus on visual website content is innovative, and we expect it to make a positive contribution to the detection of malicious sites.

Another way we intend to address RQ1 is by using machine learning in the detection of compromised domain names. In that context, we distinguish two types of attack:

Domain name hacks: these are commonplace and usually involve the exploitation of vulnerable web technologies, e.g. outdated WordPress plugins. We want to investigate the possibility of proactively detecting domain names of this type from changes in the DNS traffic. However, we recognise that it will be challenging, because such changes may have a variety of causes. Moreover, as a TLD operator, we are not well placed to follow up on any suspicions we may have, because we have little information about the websites in question. With those considerations in mind, we have started a pilot with the registrar Realtime Register to explore the possibility of jointly investigating suspect domain names.
DNS hijacks: these are attacks in which criminals attempt to seize control of a domain name. Such attacks rarely come to public attention, but when they do occur their impact can be serious. The reason being that a hacker who has administrator-access to a domain name's name servers can modify its individual DNS records. So internet users can be directed to a different server hosting, say, malicious content or a data harvesting operation. Unfortunately, DNSSEC cannot protect against attacks of this kind. We therefore plan to investigate the possibility of using machine learning algorithms in the detection of DNS hijacks. In the context of that investigation, the shortage of ground truth data represents a challenge. We may therefore start with a cluster analysis or an anomaly detection exercise. That would involve looking for and then investigating unusual data points, on the grounds that they could be associated with suspect modifications.

RQ2: How can we train shared abuse models without exchanging data?

Registries often have access to a lot of data (e.g. registration data and DNS data), but have a limited view of the internet environment. At SIDN, for example, we have no access to the information that registrars and hosters hold about .nl domain names. Collaboration with the registries for other top-level domains (TLDs) can also help us to improve our machine learning models and reduce the risk of bias in our work. We therefore want to investigate how we might use technologies such as federated learning to train models in collaboration with partners (e.g. other ccTLD registries or .nl registrars). We would like to do that because the exchange of privacy-sensitive data can be challenging, e.g. due to the need for harmonisation and legal alignment. In federated learning, each individual partner independently trains a temporary machine learning model. The temporary models are abstractions and contain no sensitive information, but are inaccurate. The various temporary models are then combined to create a sound model that all the partners can use. One possible route is to work with a group of ccTLDs (e.g. under the CENTR umbrella) to train a model for the detection of suspect domain registrations. After all, the problem of cybercrime is not specific to any one TLD or registrar. By 'looking over the fence', we may discern other patterns, e.g. the bulk registration of domain names under multiple TLDs with a view to using them for fake webshops.

RQ3: How can we maximise the effectiveness and speed of our anycast infrastructure monitoring and management?

SIDN has been using anycast for some years. Since last year, SIDN Labs has had its own BGP anycast testbed, which we use for our NTP service time.nl, for example. Anycast boosts the resilience and performance of our DNS infrastructure, but also increases the number of systems and configurations that need to be managed and monitored. We therefore want to explore the scope for using machine learning algorithms to help network operators monitor and manage their anycast infrastructures. For example, we are thinking of developing a system that generates warnings in the event of notable network traffic shifts, such as a sudden surge in the volume of traffic to a particular anycast node. Flagging up traffic shifts would hopefully facilitate rapid response by network operators. That might involve, say, deploying additional nodes or making certain routes more or less attractive and thus rebalancing the traffic distribution. Another question we want to look at is whether we could help operators interpret traffic shifts. Why is it, for instance, that a sudden surge occurs in the volume of traffic to a particular node? The approach we have in mind is to correlate data on network traffic shifts with data from other sources, e.g. RIPEstat, which records BGP routing changes.

Collaboration on machine learning applications

As you'll have gathered, there are many ways in which machine learning can increase the security of the internet and the Domain Name System (DNS). Over the next two years, we intend to explore some of those possibilities by addressing the research questions set out above. Got any suggestions or feedback? Have you spotted an opportunity for collaboration? If so, please drop a line to thymen.wabeke@sidn.nl.

Article by: