Applying the COMAR classifier to 35k unique phishing URLs

Study highlights characteristics of compromised and maliciously registered domain names

Credit card lifted from a keyboard using a fishhook

Wednesday 25 May 2022
Article by: Maciej Korczyński, Thymen Wabeke, Marc van der Wal, Benoît Ampeau, Cristian Hesselman

The research was carried out by the University of Grenoble Alpes in collaboration with AFNIC and SIDN Labs.

In our previous blogs [1, 2],we discussed the COMAR classifier, which automatically groups blacklisted URLs into either compromised or maliciously registered domain names. In this blog, we focus on the application of the COMAR classifier to blacklisted URLs serving phishing pages. We study four selected characteristics of the domain names of malicious URLs and analyse their distribution across different types of top-level domain (TLD). Based on the analysed datasets, for example, we find that 84% of maliciously registered domain names are less than one year old, and approximately 57% of compromised domains, exploited often at the website level, were registered at least six years before the corresponding URLs were blacklisted. Since COMAR is a fully automated system that performs classification based on multiple characteristics, it is resistant to manipulation (e.g. domain 'aging').

Executive summary

Our main findings are as follows:

Approximately one quarter of domain names abused to launch phishing campaigns are compromised and generally cannot be blocked at the DNS level.
While for legacy gTLDs and ccTLDs between 26% and 32% of domain names are benign but exploited possibly at the website level, the vast majority of new gTLD domain names are maliciously registered.
The most frequently used keywords in domain names registered by malicious actors to lure victims into providing their credentials are ‘online’, ‘bank’, ‘service’, ‘info’, ‘support’, ‘secure’ and ‘paypal’.
For 84% of maliciously registered domain names, the difference between the domain registration date and the blacklisting date is less than a year, and for 13% of them, the domains are blacklisted on the same day the domain is registered.
As many as 71.8% of the maliciously registered domain names have no specific technology on their homepage. In comparison, 67.7% of compromised domains use more than six different frameworks and plugins to build the website, making them susceptible to web application attacks.

Overview of the COMAR system

COMAR (classification of COmpromised versus MAliciously Registered domains) [3] is a machine-learning system capable of distinguishing between domain names registered by cybercriminals solely for fraudulent purposes, and benign but hacked domain names exploited mainly at the hosting level, often by taking advantage of vulnerabilities in web applications. In both cases, cybercriminals abuse such domain names to distribute malicious content, such as malware or phishing websites. The COMAR classifier is a more precise method [3] for making such a distinction than a set of heuristics, such as domain name age, often used by practitioners. Note that a website may be hacked soon after a domain name is registered, or cybercriminals may register a domain name and use it in a phishing campaign several months after registration. These could lead to incorrect assessments concerning the maliciousness of the domain name. COMAR does not suffer from such limitations because it does not rely heavily on individual characteristics such as registration date (it is only one of 38 proposed characteristics) [3].

The goal of the COMAR classifier is to help actors in the domain name registration and hosting industries to improve their anti-abuse processes. Specifically, if COMAR classifies a domain name as maliciously registered, then registries and registrars can block the domain name* and the hosting provider can remove the malicious content from the hosting server. If COMAR classifies a domain name as benign but compromised at the hosting level, then registries and registrars should not block it at the DNS level to avoid collateral damage to its legitimate users (i.e. domain name registrant and website visitors). Instead, depending on whether the hosting is unmanaged or managed, the webmaster or hosting provider should remove the malicious content and patch the vulnerable application.

By having domain names flagged as malicious or benign but compromised, we can also derive more actionable insights into the attackers' behaviour. For example, COMAR can help uncover a list of popular terms (e.g. support, online, bank) in domain names used in phishing attacks and classified as maliciously registered. Such a list can form the basis for building a proactive domain monitoring system that tracks newly registered domains containing such keywords to identify possible new phishing activities.

Classification results

For this blog, we analysed phishing URLs that we collected through the first six months of 2021. We automatically evaluated 35,519 unique phishing URLs (with unique underlying domain names across different TLDs) collected from APWG and PhishTank.

Figure 1 shows the overall classification results: 76% of domain names were registered for malicious purposes only, and 24% were classified as registered by benign users but compromised. If those domain names were compromised at the hosting rather than at the DNS level, they should not be blocked by TLD registries or registrars.

Pie chart showing the percentage of overall ranking results for the phishing URLs.

Figure 1: Overall classification results for the phishing URLs.

Figure 2 shows the classification results for phishing sites in different types of TLDs:

legacy gTLDs (e.g. .com, .net, or .org)
new gTLDs (e.g. .top, .report, or .xyz)
country-code TLDs (e.g. .nl, .fr, or .br).

Chart showing the percentage of ranking results for the phishing URLs by TLD type.

Figure 2: Classification results for the phishing URLs: breakdown by TLD type.

Figure 2 shows that almost 96% of domain names of blacklisted phishing URLs in new gTLDs are likely to be maliciously registered, 69% for legacy gTLDs and about 74% for ccTLDs. The question arises: why is the fraction of domain names registered for malicious purposes in new gTLDs compared to compromised ones much higher than in ccTLDs and legacy gTLDs? Previous studies [4, 5] have shown that, in general, for new gTLDs, a relatively large proportion of domain names are either parked or contain no content (DNS or HTTP errors), compared to legacy gTLDs. Intuitively, only domain names containing content are likely to be vulnerable to certain types of exploits and thus can be exploited at the website level. This might be a plausible explanation for why only a tiny fraction of domain names of new gTLDs are likely to be compromised. However, this hypothesis requires systematic future research because no recent studies have conducted such a comparative analysis.

The presented results should be seen merely as trend indicators, and may be influenced by blacklist bias as well as short-term trends in the choices made by attackers. For example, some blacklists may be more effective in detecting maliciously registered domain names (e.g. based on suspicious keywords), while others may be more effective in detecting compromised sites. Some registrars, accredited by a TLD registry, may offer low registration prices for a short period to attract new customers. Malicious actors may take advantage of such special offers and register domain names on a large scale. This may affect the observed percentages of compromised and maliciously registered domains.

Analysis of selected features of COMAR's classification decisions

As discussed in our previous blog [2], COMAR's classification decisions are based on 38 so-called features, which capture the characteristics of a blacklisted URL and the registered domain name. In this blog, we explain how the compromised and maliciously registered domain names that COMAR distinguishes differ in terms of four selected features: popular terms in domain names, the number of web technologies used, domain name age and usage of HTTPS certificates.

Features indicating that a cybercriminal (rather than a benign user) has registered a domain name include special keywords in the domain name, such as 'verification', 'payment', or 'support' or brand names (e.g. paypal-online-support.com). Figure 3 presents a word frequency analysis of the phishing dataset for both domain names automatically classified as maliciously registered (red) and those classified as compromised (blue).

Bar chart showing popular keywords used in phishing domain names.

Figure 3: Popular keywords used in phishing domain names.

Indeed, we find that cybercriminals tend to incorporate such words into domain names to lure victims into entering their credentials. The keywords most frequently used by malicious actors are 'online', 'bank', 'service', 'info, 'support', 'secure', and 'paypal'. On the other hand, the domain name of compromised sites rarely contain such special keywords. Therefore, COMAR leverages lexical features such as ‘special word in domain name’ or ‘name of a well-known brand in domain name’ in the classification.

One of the features of COMAR is ‘number of web technologies’: a count of the JavaScript, Cascading Style Sheets (CSS), or Content Management System (CMS) frameworks and plugins used to build the homepages of maliciously registered and compromised domain names. Developers of professionally designed and high-profile websites usually avoid using too many libraries and frameworks. However, this is not the case for less complex websites. The number of technologies used for developing a website could reflect the amount of effort and time its designer spent to create a fully functional website. Figure 4 shows the results for compromised and maliciously registered domain names.

Bar chart showing the number of techniques for maliciously registered and compromised phishing domains.

Figure 4: Number of technologies for maliciously registered and compromised phishing domains.

As many as 67.7% of compromised domains use more than six different technologies, frameworks and plugins to build the website. In comparison, 71.8% of the maliciously registered domain names have no specific technology on their homepage. We have noticed that many maliciously registered domains either have no homepage (showing the default directory index served by the web server), redirect to another domain (e.g. the landing page of a phishing attack), or display a custom error message (e.g. forbidden page). Instead, they frequently serve the phishing page either on a URL path or a subdomain level.

Bar chart showing the age of compromised and rogue registered domain names.

Figure 5: Age of compromised and maliciously registered domain names.

The age of a domain name, defined as the time between the registration of the domain name and its appearance on the blacklist, is one of the important features of the COMAR classifier. Intuitively, the older the domain name, the more likely it is to have been registered by a benign user but subsequently compromised. On the other hand, cybercriminals tend to abuse a domain name soon after registration. Nevertheless, malicious actors can also compromise domains shortly after they are registered [3]. Also, some criminals may age registered domains, waiting weeks and sometimes months before abusing them. However, as COMAR is a fully automated system performing classification based on multiple features (domain name age is just one of them), it is resistant to manipulation (e.g. domain aging).

Figure 5 shows the age of domain names for all TLDs that provide a registration date as part of their WHOIS data. In the figure, ‘0’ means that registration and blacklisting occurred on the same day. ‘1’ means that the difference between the registration date and the blacklisting date is at most one year, and ‘>6’ means that the difference between the domain registration date and the blacklisting date is at least six years. For 84% of maliciously registered domain names, the difference between the domain registration date and the blacklisting date is less than a year, and for 13% of them, the domains were blacklisted on the same day the domain was registered. With compromised domain names, about 57% of them were registered at least six years before being blacklisted. A possible explanation for this phenomenon is that websites hosted on older domain names are more likely to use outdated technologies or content management systems (e.g. vulnerable versions of CMSs such as WordPress), making them easier to compromise.

Chart showing the percentage of TLS certificates issued for maliciously registered and compromised phishing domains.

Figure 6: Issued TLS certificates for maliciously registered and compromised phishing domains.

Another interesting but, according to our analysis [3], less important feature of the COMAR classifier is the use of the Transport Layer Security protocol. According to a PhishLabs report [9], three quarters of all phishing sites use HTTPS (HTTP over TLS) in 2020 ‘to add a layer oflegitimacy, better mimic the target site in question, and reduce being flagged or blocked from some browsers.’ However, the report conflates compromised with maliciously registered domain names. Therefore, to establish whether cybercriminals are increasingly using TLS certificates, we need to distinguish between compromised and maliciously registered domain names and analyse TLS usage only in the latter group. Otherwise, it is unclear whether the TLS certificate was issued at the request of a criminal for a maliciously registered domain to enhance the website's credibility or at the request of a legitimate domain owner for a benign domain name that was later compromised and abused by a criminal.

Figure 6 shows the percentage of TLS certificates issued for malicious and benign (and later compromised) domain names involved in phishing attacks. The use of TLS certificates is less widespread among phishers than benign (but compromised) domain names. 75% of phishing attacks using compromised domains take advantage of TLS certificates issued at the request of benign domain owners (e.g. by displaying the green lock in the address bar of the browsers), while 64% of maliciously registered domains use TLS certificates deliberately deployed by malicious actors to lure their victims.

Conclusions

In this blog, we have presented the results of applying COMAR to phishing websites and classifying registered domains as malicious or compromised. We applied COMAR to malicious URLs blacklisted by reputable providers, i.e. APWG and PhishTank, from January to June 2021. We demonstrated that 76.2% of domain names were maliciously registered, and 23.8% were compromised. We also revealed that the occurrence of certain keywords in domain names, the domain name age, and the number of technologies used are significant discriminators between compromised and maliciously registered domains. We also found that malicious actors deploy TLS certificates less frequently than owners of legitimate (and compromised) domain names. COMAR is a fully automated system that performs classification based on multiple features, is resistant to manipulation (e.g. domain aging), practical and much more accurate than rule-based heuristic methods. Thus, it can help streamline the process of mitigating DNS abuse by the various entities involved in domain name registration and hosting.

COMAR project wrap-up

This blog wraps up the COMAR project, which started in late 2018 and was funded by AFNIC and SIDN. COMAR led to a successful Ph.D. thesis at Grenoble Alpes University and a total of four scientific papers: one published in IEEE European Symposium on Security and Privacy 2020 (top-tier venue) [3], one in the ACM Internet Measurement Conference 2020 [6] (another top-tier venue), one at Traffic Measurement and Analysis 2020 [7] (best paper award) and its extended version at IEEE Transactions on Network and Service Management 2021 [8]. AFNIC and SIDN are currently incorporating the prototype of the COMAR classifier that Grenoble Alpes University developed into their production systems to facilitate the process of remediating maliciously registered or compromised domain names that serve malicious content.

We look forward to continued collaboration between Grenoble Alps University, AFNIC and SIDN!

References

‘Franco-Dutch research project on automatic classification of domain name abuse‘ Cristian Hesselman, Benoît Ampeau and Maciej Korczyński, October 2018.
‘Distinguishing exploited from malicious domain names using COMAR‘ Sourena Maroofi, Maciej Korczyński, Benoît Ampeau, Thymen Wabeke, Cristian Hesselman, Andrzej Duda, April 2021
‘COMAR: Classification of Compromised versus Maliciously Registered Domains‘, Sourena Maroofi, Maciej Korczyński, Cristian Hesselman, Benoit Ampeau and Andrzej Duda, IEEE European Symposium on Security and Privacy (IEEE EuroS&P 2020), Virtual Conference, September 2020
‘Cybercrime After the Sunrise: A Statistical Analysis of DNS Abuse in New gTLDs‘, Maciej Korczyński, Maarten Wullink, Samaneh Tajalizadehkhoob, Giovane C.M. Moura, Arman Noroozian, Drew Bagley, Cristian Hesselman, ACM Asia Conference on Computer and Communications Security (ACM AsiaCCS 2018), South Korea, June 2018
From .Academy to .Zone: An Analysis of the New TLD Land Rush T. Halvorson, M. F. Der, I. Foster, S. Savage, L. K. Saul, and G. M. Voelker, ACM Internet Measurement Conference, October 2015
Are You Human?: Resilience of Phishing Detection to Evasion Techniques Based on Human Verification Sourena Maroofi and Maciej Korczyński and Andrzej Duda, ACM Internet Measurement Conference, October 2020
From Defensive Registration to Subdomain Protection: Evaluation of Email Anti-Spoofing Schemes for High-Profile Domains, Sourena Maroofi, Maciej Korczyński and Andrzej Duda, Network Traffic Measurement and Analysis Conference (TMA 2020), June 2020 (Best Paper Award)
Adoption of Email Anti-Spoofing Schemes: A Large Scale Analysis, Sourena Maroofi, Maciej Korczyński, Arnold Holzel, and Andrzej Duda, IEEE Transactions on Network and Service Management, 2021
Abuse of HTTPS on Nearly Three-Fourths of all Phishing Sites (2020), PhishLabs.