Thesis: Using machine learning to determine the economic activities associated with domain names

Hierarchical text classification in practice

Data storage concept with 3d folder on circuit board

If you visit a website expecting it to be a car dealer's, you can often see immediately whether it actually is a car dealer's site. However, assessing the sites linked to all six million .nl domain names that way would be extraordinarily tedious, time-consuming and impractical. We have therefore developed a system that automatically determines what economic activity is associated with a domain name by analysing the text on the website. That enables us to build up a picture of the make-up of the .nl zone. Which in turn makes it easier to do things such as monitor the adoption of internet standards in various sectors of the economy. This blog explains how the system works and how you can try it out yourself.

Definition of economic activity

The aim of our study was to investigate the suitability of hierarchical classification as a means of resolving the challenge of classifying economic activities. That was thought to be worth investigating, because hierarchical classification dovetails neatly with conventional methods of defining economic activity. For example, the Dutch government uses the Standard Business Classification System (SBI), a hierarchical taxonomy based on the European NACE. The system involves a number of general activity types (e.g. 'G: Wholesale and retail trade'), which are subdivided into more specific activities (e.g. 'G.45: Wholesale and retail trade and repair of motor vehicles and motorcycles'). The general activities are referred to as 'sections' and the more specific activities as 'divisions'. The SBI has no fewer than twenty-one superordinate sections and eighty-six subordinate divisions.

What is machine learning?

Machine learning is the data-driven learning of a function on the basis of predefined input and output. The technology can be used to assign appropriate economic divisions to business domain names. Thus, the use of machine learning could enable automation of the economic activity classification process. In that scenario, the text found on a web page would serve as the input, while the output would be an economic activity definition. More information about machine learning is available in Wikipedia and elsewhere.

Development of a hierarchical classification system

Classification (division assignment) can be approached in two ways: you can disregard the sections or use section information to refine the division classification process. In the case of a car dealer, the second approach would imply first determining that the business belonged to the 'Wholesale and retail trade' section, and then determining which of the subordinate divisions applied, without considering any other divisions. This technique is known as hierarchical classification. In academic research, hierarchical classification is often applied to 'benchmark datasets'. These datasets are widely used to compare different methodologies. In theory, therefore, hierarchical classification is known to work. But how useful is it for the resolution of a real-world problem? Our study addressed that question by investigating the suitability of hierarchical classification as a means of resolving the challenge of classifying economic activities. For a detailed explanation of hierarchical classification and an overview of existing applications, refer to this research report.

Can we improve webpage classification (division assignment) by considering the hierarchical context? That was our primary research question. In the example of the car dealer given above, the hierarchical context is the section, G, and the division, 45. We sought to answer the research question by comparing the performance of a non-hierarchical division classification system with that of a hierarchical division classification system.

Figure 1: Illustration of different division classification approaches. The non-hierarchical approach is visualised on the left, the hierarchical approach on the right. The top circle represents the SBI. The middle row of circles represents SBI sections, and the bottom row SBI divisions.

Use of the hierarchical approach involves training multiple systems to determine first the section that an activity belongs to, and then the division within that section. In figure 1, the approach is illustrated in simplified form. In reality there are twenty-one sections (not two) and eighty-six divisions (not three).

In the diagrams, the SBI is represented by the green circle at the top. Beneath are the blue sections, with the orange divisions at the bottom. The sections or divisions that a system must choose between are indicated by a black rectangle. In the left-hand diagram, a single system has to choose from the entire list of divisions in order to classify an activity. In the right-hand diagram, multiple systems are active; the system represented by the dashed rectangle decides which other system should be used to make the ultimate division classification. The grey lines indicate the relationships between the levels: all sections come under the SBI, while each division comes under no more than one section.

The rationale underpinning the approach is that the relatively large and complex problem of division classification is broken down into multiple smaller problems, for each of which a specialist system is used. While that has advantages, it also has the disadvantage that multiple systems need to be trained.

Economic activities in the .nl zone

In our research, the hierarchical variant assigned the correct divisions in 64 per cent of the cases, while the non-hierarchical variant did so in 63 per cent of the cases. On their own, those performance figures don't tell us much. However, it should be borne in mind that the probability of a random division classification being correct -- the likelihood of getting the classification right by, say, picking titles out of a hat, blindfold -- is just 2 per cent. Although the performance difference between the two systems is small, significant scope for improvement remains. Various techniques for correcting section classification errors were not used in our research.

We used our experimental hierarchical classification system to classify all business and e-commerce-related domains in the .nl zone. We then compared the hierarchical system's output with data published by Statistics Netherlands to see how well the various economic activities are represented in the .nl zone. The section-level and division-level findings are presented in figures 2 and 3. In the figures, a section or division's percentage share of the .nl zone (number of domains) is shown in blue, and the corresponding share of the Dutch economy (number of businesses) is in orange. The green bars represent the differences between the two shares (share of .nl zone minus share of the Dutch economy).

Figure 2: Comparison of sections' percentage shares of the .nl zone (number of domains; blue) and shares of the Dutch economy (number of businesses; orange). The green bars represent the differences between the two shares.

The figures highlight various discrepancies. For example, section M (consultancy) has a much bigger share of the economy than of the .nl zone. Such information could, for instance, be combined with website security data to identify sectors of the economy where domain security is being neglected. That would provide a starting point for organising awareness campaigns or approaching proprietors personally with information and advice.

Figure 3: Comparison of divisions' percentage shares of the .nl zone (number of domains; blue) and shares of the Dutch economy (number of businesses; orange). The green bars represent the differences between the two shares.

Hierarchical classification works!

The project described in this blog was intended to establish whether hierarchical classification could be used in practice for economic activity classification.

We found that a hierarchical classification system performed better than a non-hierarchical system, albeit only slightly better: just 1 per cent. Nevertheless, the results show that hierarchical classification is useful outside the context of benchmark datasets and does work for economic activity classification. It is at least as good as a conventional non-hierarchical classification method, and we intend to explore the scope for improving its performance further.

We have already added the new classification system to our DMAP crawler, which is used to classify the economic activities associated with all business domain names in the .nl zone. Our findings will shortly be made available on stats.sidnlabs.nl, so that the statistics can be broken down across economic sectors.

We have additionally made the classification system available online via a web application. The more often it's used, the more insight we will gain into the system's performance. That will tell us whether the system is really as good as we believe it is.

Try our system yourself and help us evaluate it

We have made our new classification system available on line via a web application. The more often it's used, the more insight we will gain into the system's performance. That will tell us whether the system is really as good as we believe it is. Interested in giving it a try? Go to webcola.sidnlabs.nl and enter a domain name you want to classify. When you get a response, you can tell us whether the classifier has correctly identified the economic activity associated with the domain. Your feedback will help us evaluate how the system performs in practice. We'll also be able to collect annotations that can be used to refine the system going forward.

What to know more?

If you'd like to know more about the research described here, you're welcome to read my Master's thesis. Just drop a line to robin.deheer@gmail.com.