Using logo detection technology to identify malicious .nl websites

LogoMotive helps the fight against internet crime by flagging up unauthorised logo use

Close-up of a programmer's hands on a keyboard

The original blog is in Dutch. This is the English translation.

Logos give a website a familiar feel and promote trust. Scammers take advantage of that by using well-known organisations' logos on malicious websites. Unsuspecting internet users see the logos and think that they're looking at a legitimate webshop or government website, when it's actually a phishing site, a fake webshop or a site set up to spread misinformation. So, here at SIDN Labs, we've developed LogoMotive: a prototype tool designed to help analysts identify abusive domain names more quickly by flagging up .nl sites that seem to be making unauthorised use of logos. In this blog we explain why we developed LogoMotive, the research on which it's based and how it works.

Malicious websites often use the logos of authoritative organisations

At SIDN Labs, we're constantly looking for new ways of tackling domain name abuse, in order to protect .nl domain name users against internet crime.

One thing we've observed is that phishing sites, fake webshops and other malicious websites often use the logos of authoritative organisations. Examples include a phishing site with a government logo on its forged DigiD login page (Figure 1a) and a fake webshop with the Trustpilot logo and the logo of a standardisation body (Figure 1b). By putting such logos on their sites, scammers lull visitors into a false sense of security. It's then easier to trick them into parting with money or data, or accepting false information.

A phishing website on which the DigiD login page was created.

Figure 1a: A phishing website with a government logo on its forged DigiD login page.

A screenshot of a fake webshop where a Trustpilot and ISO logo are placed in the bottom left corner, while this webshop is not affiliated or certified for.

Figure 1b: A fake webshop that has Trustpilot and ISO logos (lower left), although it isn't affiliated or certified.

Identifying suspect websites on the basis of logo use

Against that background, we set ourselves the goal of helping abuse analysts to identify suspect websites in the .nl zone by checking for unauthorised logo use. We've now developed a prototype tool called LogoMotive, by building on the findings of the pilot project we did with Currence last year.

SIDN and Currence team up to fight fake webshops

Algorithm

For a tool like LogoMotive, we need two key components. First, an algorithm capable of automatically detecting logos on .nl websites. The .nl zone has more than 6.2 million domain names, which are changing all the time. So the algorithm needs to be very efficient, otherwise regular scanning of the entire zone would be too time-consuming.

The algorithm also has to be capable of recognising a variety of logos, which we'd ultimately like to be able to add to over time, on a semi-automated basis. Flexibility is important, because scammers don't always use the same logos. Recently, for example, we've seen an upturn in the use of postal logos on phishing sites, probably in response to the growth of online shopping.

Dashboard

LogoMotive's second essential component is a dashboard on which abuse analysts can easily check out the websites flagged up by the algorithm. For example, it needs to be possible for the analyst to record whether a flagged domain name is legitimate or malicious, and what needs to be done next, such as disabling a malicious domain name or associating a legitimate domain name with an organisation.

Applying ethical constraints

For ethical reasons, we want to deliberately constrain LogoMotive in certain ways. For example, we want the algorithm to look exclusively at logos, not other visual objects, such as faces. To make sure it does that, we're training the algorithm ourselves, using only logos. LogoMotive will also be limited to analysing screenshots of public web pages.

Another requirement is that LogoMotive must operate on the basis of the 'human in the loop' principle. In other words, as with our other anti-abuse tools, there won't be any autonomous robot decision-making about things such as the deactivation of domain names. LogoMotive's output will be presented on a dashboard to help human anti-abuse analysts to assess suspect domain names thoroughly and efficiently.

Under LogoMotive's hood

Let's begin by outlining how LogoMotive works. We'll then move on to taking a closer look at its two main components: the logo-detection algorithm and the dashboard.

LogoMotive starts with a list of domain names; in this case, a list of all the domain names in the .nl zone. Working from that list, it compiles a bank of screenshots. That involves a robot visiting the websites linked to the domain names on the list and screenshotting all the relevant pages. Screenshotting is a time-consuming process, because the content of each page – including images, backgrounds and other formatting – has to be loaded before a screenshot can be taken. We therefore try to be smart about deciding which websites the robot should visit. For example, we skip any site whose content hasn't changed since the previous week, and any page that's identical to one previously found on another site.

Next, the screenshots are analysed by our logo-detection algorithm (see below). At present, the algorithm supports sixteen organisations' logos that are often used for phishing, including iDEAL, the Dutch national government, SIDN, various banks, accreditation schemes and postal/courier firms.

The analysis yields a list of logos found on website screenshots, details of which are stored in a database. Finally, an abuse analyst using a straightforward dashboard can work through the 'hits' to assess whether anything is amiss (see below).

Logo-detection using YOLO

At the heart of LogoMotive is the logo-detection algorithm. We opted to use the existing YOLO (You Only Look Once) algorithm. YOLO is a neural network that's specially designed for object detection and able to detect objects more quickly than most other algorithms of its kind. Another factor behind our choice was that we had been impressed by YOLO when we used it in an earlier pilot carried out in partnership with Currence.

Machine learning method identifies brand logos on fake webshops

During the development process, we also considered an alternative approach based on SIFT. The SIFT algorithm identifies the characteristic elements of images and calculates how many elements two images have in common. In the context of our work, a screenshot could be considered to contain a logo if the screenshot and the logo had sufficient shared elements. The advantage of using SIFT would have been that it requires less computational power than a neural network, and doesn't need training. However, we soon discovered that SIFT was relatively slow and didn't perform as well as YOLO.

Automatic generation of training data

Before YOLO can recognise objects in images, it needs to be trained using a large volume of data. Our training data consisted of screenshots known to contain one or more logos, plus associated labels for describing where the logos appear in the images. One option for compiling the training data was to manually annotate relevant screenshots. However, to arrive at a reasonable dataset that way, we would have had to invest scores of hours per logo. As well as implying considerable time input during development, that would have complicated the task of adding support for new logos in the future. We therefore devised a smart way of automatically generating training data.

Our method requires two inputs: an organisation's logo and a set of screenshots from a few thousand randomly selected websites. Both are easy to obtain: the logo is available from the organisation's website and the screenshot set can be compiled by automatically crawling a random subset of the .nl zone.

We then proceeded to paste the logos onto the screenshots, randomly varying the position, size, sharpness, colour and visibility percentage of the logos. The purpose of the variations was to ensure that the neural network could handle the variety it would encounter on real web pages. The method enabled us to quickly generate large, good-quality datasets, without deploying a disproportionate amount of human capacity. Figure 2 illustrates how we generated a data point.

Screenshot of automated training data generation

Figure 2: We generated training data automatically by combining logos with random screenshots to form training data points.

Analysts' LogoMotive dashboard

Figure 3 shows the LogoMotive dashboard, with examples of websites that feature the SIDN logo (i.e. www.sidn.nl and www.sidnlabs.nl). The dashboard uses the output of the YOLO algorithm after training as described above.

Screenshot of the Logomotive web app

Figure 3: The LogoMotive dashboard shows websites on which a particular logo has been found.

Figure 4 shows the screen that an anti-abuse analyst sees if they click on one of the result lines shown in Figure 3 – in this case sidn.nl. The dashboard shows them the screenshot featuring the detected logo, together with the degree of confidence (in this case 0.98, or 98 per cent). The dashboard also displays information to help the analyst to assess the flagged websites thoroughly and efficiently. The analyst can then label the result at the bottom of the screen and, in appropriate cases, select a follow-up action. The labels can easily be modified for each individual logo, in line with the analyst's preferences.

Screenshot of the LogoMotive web app showing the found logo, with space to enter an annotation with comments.

Figure 4: Information about each website can be viewed. The user sees the screenshot(s) of the sites where the logo has been found, and can add notes and comments.

How good is LogoMotive?

A key question is, of course, how good is LogoMotive at detecting logos? Answering that question isn't easy, however. A balance must always be struck between false negatives (instances of actual logo use that are overlooked) and false positives (reported finds where the relevant logo isn't really used). That implies deciding how certain you want the algorithm to be before flagging up a website (the 'confidence threshold'). What's more, the number of false negatives is always unknown, because you can't say how many sites have been overlooked without knowing the total number of sites with relevant logos in the .nl zone.

We have therefore configured the system to yield an expected false positive rate of 10 per cent. That is the percentage that we use for other detection systems, and is acceptable to the anti-abuse analysts who process the output. We could configure LogoMotive to deliver fewer false positives, but the price would be more false negatives.

We set the desired confidence threshold for each individual logo by manually examining a small sample of websites. With the government logo, for example, we found that a confidence threshold of 90 per cent yielded good results, while for the Thuiswinkel Waarborg logo 85 per cent was best. We don't know exactly why a higher threshold is needed for the government logo, but we suspect that it may be because the logo features a dark blue square, which is not very distinctive. The system can detect logos even if they are only half-visible, featured in a different colour, or reproduced as a very small background image. We therefore think that the number of logos overlooked is trivial. Nevertheless, we intend to investigate that more closely in due course.

Joint evaluation and plans for the future

Over the last few months, we have improved our prototype logo-detection system considerably. LogoMotive can now serve as a solid basis for various follow-up studies.

In the period ahead, we intend to evaluate how the system works in practice. That will mean focusing primarily on our own Anti-Abuse Desk, but also reaching out to large organisations with authoritative logos that are often abused. Recently, for example, we started a pilot with the Dutch national government's Publicity and Communication Service (DPC). We're expecting the DPC to use LogoMotive to detect not only malicious websites making unauthorised use of the government logo, such as the phishing site in Figure 1a, but also legitimate government websites that have slipped under the radar. We'll be publishing a blog about the pilot's results before too long.

Another thing we want to do is take a closer look at the machine learning issues. What is the maximum number of logos that the neural network can look for at the same time, for example? We'd also like to investigate the scope for automatically classifying the motive for a site's logo use by drawing on other data sources. We know from our previous research into fake webshops that some data, such as the time of domain registration, has predictive value in the context of abuse detection. It may well be possible, therefore, to discern similar patterns in the field of malicious logo use. We'll explore those ideas more fully in future blogs. Finally, we intend to integrate LogoMotive output with DEX, the tool we developed a while ago for studying the ecosystem around a suspect domain name.