Detection of phishing domains with DNS traffic analysis

In the last six months, I have been working on the project “SIDekICk” (SuspIcious DomaIn Classification), which is the final project for my M.Sc. degree in Computer Science at the University of Twente. In this blog post, I summarize the major findings and provide recommendations for future work. The full thesis is available for download.

SIDekICk

SIDekICk was part of SIDN Labs’ continuing effort to find novel ways to further reduce the number of misused domains in the .nl zone. The goal of SIDekICk was to develop and evaluate algorithms (so called “classifiers”) to automatically detect malicious domains in the .nl zone, in particular domain names that were being used for phishing.The SIDekICk classifiers run on top of SIDN Lab’s ENTRADA platform (ENhanced Top-level domain Resilience through Advanced Data Analysis), which collects and stores the traffic of one of .nl’s authoritative names servers. The ENTRADA platform is an experimental system that is based on Hadoop.

Phishing in .nl

There are relatively few phishing attacks that take place through a .nl domain name. The Global Phishing Survey of the Anti-Phishing Working Group reported 493 phishing attacks that used .nl domain names in the second half of 2014, involving 432 unique .nl domain names [1]. By comparison, [1] reported a total of 123,000 observed phishing campaigns during that same period, which is the highest number in five years. Even SIDN itself was recently targeted by a phishing campaign, although it did not involve any .nl domain names. In my work, I differentiated between domain names that have been registered exclusively for a phishing campaign and domain names that have been compromised by the phisher. Between December 2014 and June 2015, only 6% of the .nl domains reported by the anti-phishing service NetCraft were younger than one week when they got misused for phishing. The majority (82%) were older than one year. We therefore work under the assumption that most of the “old” domains have been compromised and most of the “young” domains were registered by the phisher.

Geo-distribution

For this post, I want to describe the geographic location of querying resolvers and DNS query patterns for phishing domains in a bit more detail. I’ve used them in my thesis to build a classifier in order to detect malicious domains automatically.In the latest edition of “The .nlyst” [2], we saw that the majority of .nl websites are in Dutch, and therefore we would expect that the majority of DNS queries come from resolvers from the Netherlands as well. Figure 1 shows the location of resolvers that queried the Top 1000 .nl Alexa Domains, compared with the location for reported phishing domains on the day that they got reported by NetCraft. Most of these domains get queried from the US, which is likely caused by open resolvers of Google and OpenDNS. We can see that, although many queries for phishing domains come from the same countries as for the popular benign domains, there are still some obvious discrepancies. For example, the phishing domains in my dataset receive more queries from Ireland and Italy than the benign domains in the same dataset. Further, the specimen phishing domain in Figure 1 illustrates that geographic characteristics may vary even more strongly for individual phishing domains. For example, the specimen phishing domain gets significantly more queries from Brazil. This analysis suggests that the geo-distribution of resolvers querying for a particular domain name might be an indicator that the domain is being used for phishing.

Figure 1: Resolver locations for benign and phishing domains

Number of queries

Besides geographic characteristics we have examined the number of queries for phishing domains as well. Figure 2 shows the queries we received on the ENTRADA platform for newly registered phishing domains, which usually get queried significantly more often than benign domains on the days right after registration (day 0 in Figure 2).

Figure 2: Average queries for newly registered domains before and after their registration day 0. Phishing domains are reported by NetCraft.

Similarly, we also observed an increase in DNS queries for compromised phishing domains. On the day these domains got reported by NetCraft, we saw that 63% of the domains were queried at least two times more often than on the day before (see Figure 3). Furthermore, in the week of the report, compromised phishing domains were queried on average almost 10 times more often than three weeks earlier (median 2).

Figure 3: Compromised domains that received two times more queries on the day of the report than on the day before.

Figures 2 and 3 suggest that a sudden spike in the number of DNS queries we receive for a (newly created) domain name might be an indicator that the domain is being used for phishing.

SIDekICk classifiers

Based on the geographic location of the resolvers, rapid increases in queries, and observed query growth over three weeks, I built two classifiers. Both are using a decision tree for classification.The first classifier focuses on finding compromised domain names. It automatically carries out a daily analysis of all 5.5 million domains that have been queried more than 50 times that day. Domains with fewer queries were less likely to be part of phishing campaigns. In seven days, the classifier scored over 1.7 million domains and reported over 14,000 of them as malicious. This number is very probably too high to be true and a quick manual check revealed many false positives among the reported domains. For example, we noticed that some misclassified domains experienced a rapid query growth as a result of promotions for online shops or as a result of popular content shared in social networks that day. This illustrates that query patterns and geographic origin can be an indicator for compromised domain names, but that it is not enough to build a sufficiently precise classifier for their detection.By comparison, the second classifier performs significantly better. It automatically analyses newly registered domains on the day of their registration, as well as one and two days after. During an evaluation of 31 days, this classifier scored over 61,000 domains and detected 33 malicious domains. This was 23 more than NetCraft reported in that same period for .nl. Among those 23 domains were 10 bogus web shops and other online scams. The false positive rate was below 0.3%.

Conclusion

Malicious domains used for botnets and exploit kits share similar characteristics, such as rapid query growth and geographic diversity [3, 4]. My work shows that it is fairly easy to detect newly registered phishing domains as soon as they become active and my work suggests that this can be done in a reasonably accurate way. Detecting compromised domains is a greater challenge, because many benign domains also share the described characteristics.

Recommendations

The next goal is to detect newly registered phishing domains not merely on a daily basis, but on an hourly basis. Also, we are constantly looking into new ways to detect compromised domains with higher precision using DNS data and other data sources. For example, we could consider whether a website is run with a vulnerable version of a content management system, which could indicate that a domain is compromised. If we are confident that a domain name is malicious, then I recommend SIDN submitting the domain to public blacklists, such as the Google Safe Browsing Service. Afterwards, the registrant, the registrar or the web hosting company could be informed. In situations where we have only a suspicion about malicious activity, we are interested in sharing this information with other researchers to turn this suspicion into an educated and confident classification on which we can base further actions. At SIDN Labs, we are always interested in cooperating with other researchers to share insights and to combine efforts to fight harmful activities on the Internet and we are very happy about any feedback and new ideas.

References

  1. Aaron, G. and Rasmussen, R. (2015). Global phishing survey: Trends and domain name use in 2h2014. APWG Industry Advisory, 2H2014. http://internetidentity.com/wp-content/uploads/2015/05/APWG_Global_Phishing_Report_2H_2014.pdf

  2. The .nlyst No 18 Dutch: https://view.publitas.com/ara/sidn-nlyst-xviii-nl/page/8-9 English: https://view.publitas.com/ara/sidn-nlyst-xviii-en/page/8-9

  3. Villamar ´in-Salom ´on, R. and Brustoloni, J. C. (2008). Identifying botnets using anomaly  detection techniques applied to dns traffic. In Consumer Communications and Networking Conference, 2008. CCNC 2008. 5th IEEE, pages 476–481. IEEE.

  4. Perdisci, R., Corona, I., Dagon, D., and Lee, W. (2009). Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces. In Annual Computer Security  Applications Conference, pages 311–320.

Further Reading

  • Hao, S., Feamster, N., and Pandrangi, R. (2010). An internet-wide view into DNS lookup patterns. School of Computer Science, Georgia Tech, Tech. Rep.

  • Hao, S., Feamster, N., and Pandrangi, R. (2011). Monitoring the initial DNS behavior of malicious domains. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference, pages 269 – 278. ACM.

  • Antonakakis, M., Perdisci, R., Lee, W., Vasiloglou II, N., and Dagon, D. (2011). Detecting malware domains at the upper dns hierarchy. In USENIX Security Symposium, pages 16–32.