HMP: a new open-source tool for Hadoop managers

SIDN Labs' latest contribution to the internet research community

Tuesday 31 May 2022
Article by: Maarten Wullink

The original blog is in Dutch. This is the English translation.

Like many other researchers, here at SIDN Labs we often use Apache Hadoop and related software such as Apache Spark and Apache Impala. For some years, we used Hadoop in combination with Cloudera Express (CDH), a tool for managing clusters of Hadoop servers. However, Cloudera changed its licence model in 2021, meaning that after the next software upgrade we would have had to pay a substantial annual fee. Our preference is, of course, to invest in research aimed at improving the security of the internet infrastructure. So, over the last six months, we have developed the Hadoop Provisioning Manager (HPM). After extensive testing, we are now using the HPM for the production Hadoop cluster on which we run ENTRADA and various other systems. We are also making the HPM source code open, so that other internet researchers can use it on the basis of an MIT licence.

Background

Since 2014, we have been using Hadoop software at SIDN Labs, together with various related software components, such as Apache Spark. Hadoop is open-source software for bulk data storage and efficient data analysis. We use Hadoop for systems such as ENTRADA, our big DNS big data storage and analysis platform. The current ENTRADA database has more than 2.3 trillion rows and requires roughly 320TB of storage capacity. By Hadoop standards, our Hadoop environment is very small, consisting of fourteen servers with a 10Gbit network. The Hadoop cluster as a whole has 600TB of storage, 624 CPU cores and 1600GB RAM available for applications.

The installation, configuration and management of a Hadoop cluster's software components are complex and time-consuming, and require considerable specialist knowledge. For some years, therefore, we have used Cloudera Express (CDH) to simplify those tasks and save ourselves time. Previously available as a free Hadoop distribution, CDH features a web application-based management interface that makes the installation and management of a Hadoop cluster considerably easier. Although a Hadoop cluster can be operated without a tool such as CDH, that implies manually downloading and installing Hadoop components, which is time-consuming and error-prone.

Since the start of 2021, however, CDH has been available only on the basis of a paid licence. The annual licence fee depends on the number of servers in the managed cluster, the servers' hardware specifications and the chosen licence type. Even for our small Hadoop cluster, the fee would have been in the region of 150k a year.

Because our preference is to invest in research aimed at improving the security of the internet infrastructure, we decided to investigate the alternatives. After all, we were using only a small portion of CDH's functionality, and we knew that other researchers, including the University of Twente's network security research team, were facing the same problem.

Requirements

Before looking at possible alternatives to CDH, we drew up a list of requirements, as follows.

High availability (production status) Our CDH alternative must be capable of setting up a high-availability Hadoop cluster. High availability implies that, if some of the software or hardware components go down, the Hadoop cluster as a whole will continue to function without developing significant faults. That's important because we use Hadoop both internally and externally, as with ENTRADA. Our DNS operators use it to optimise the .nl name servers, for example. So there are various applications that depend on our Hadoop environment working properly. Also, SIDN Labs is a research team, and we therefore want to focus on our data analyses, without having to constantly fix problems.
Authentication and authorisation A CDH alternative must be able to configure authentication and authorisation components on our cluster, e.g. Kerberos and Apache Ranger. Authentication involves a server process or an end user being required to prove its/their identity, to which authorisation rules are then linked. The rules ensure that authenticated users aren't all able to do anything they like.
Data integrity and confidentiality The (web) interfaces of components installed by a CDH alternative (e.g. the Hadoop HDFS name node) must send their data over the network in encrypted form. Encryption is required to ensure data integrity and confidentiality, even if the Hadoop servers are in different networks.
Multiple operating systems A CHD alternative must be able to cope with multiple operating systems simultaneously, since we use a mix of Red Hat Enterprise Linux 9 (RHEL) and Ubuntu for our Hadoop cluster. We use RHEL on the servers running the Hadoop components and Ubuntu on the 'gateway servers' to give users access to the cluster, or to start applications that run on the cluster.
Scalability A CDH alternative must support the set-up of Hadoop clusters ranging from very small clusters that have only a handful of servers, to larger Hadoop clusters that have tens of servers, as used by major research organisations, such as universities.
Open source As well as meeting the technical requirements set out above, any CDH alternative we adopt must have open source code. That enables us to make modifications and share our experiences with the internet research community, so that together we can contribute to the further enhancement of internet infrastructure security.

Alternatives

We examined the following alternatives to CDH to see whether they met our requirements.

Cloud solutions The big cloud service providers, such as Microsoft, Amazon and Oracle, offer CDH-like solutions. We decided that they were unsuitable, however, because of the high cost associated with the large volumes of data that we process. A cloud solution would also imply less predictable costs than a Hadoop cluster running on our own hardware.
Apache Ambari Ambari is an open-source management tool, fairly similar to CDH. It was originally developed by Hortonworks, which was later acquired by Cloudera. Unfortunately, Ambari is no longer actively maintained, making it unsuitable for us.
Apache Bigtop Bigtop is an open-source project that offers Hadoop components for various operating systems as part of the existing operating system installation mechanism. However, Bigtop doesn't support all components. It doesn't work with Apache Impala or Apache Ranger, for example. Also, Bigtop has only component installation functionality; the user still has to manage the various component configurations, which can be a lot of work. Another drawback is that, with Bigtop (or, indeed, CDH), we would be dependent on the particular versions that Bigtop publishes.

SIDN Labs Hadoop Provisioning Manager (HPM)

Because none of the alternatives to CDH met our requirements, we developed our own tool: the Hadoop Provisioning Manager (HPM). The HPM uses Ansible, an existing scripting language for configuring a server (also known as 'provisioning').

The HPM (see figure 1) enables Hadoop managers to download and instal existing open-source Hadoop components on cluster servers. Another important function of the HPM is the generation, management and rollout of Hadoop component configurations. For example, the user can specify the network ports that certain services should monitor, or where data should be stored. In total, hundreds of different configurations can be made. The HPM also prevents individual Hadoop servers each downloading the same data from the internet. Direct downloads by all the cluster servers are undesirable for security reasons and also not very efficient, they slow down the deployment process.

Figure 1: Overview of the HPM.

The HPM is a relatively simple tool that saves Hadoop managers the time-consuming and error-prone tasks of manually downloading and installing components for their clusters. Unlike CDH, the HPM does not feature a web application for quick and easy cluster rollout or configuration. Instead, it relies on command-line-activated scripts. After six months of development and testing, we are now successfully using the HPM for our production Hadoop cluster.

The inspiration for the HPM came partly from similar work by CZ.NIC, the registry for Czechia's .cz domain. The CZ.NIC playbooks serve as a good proof of concept, demonstrating that Ansible can be used to set up a Hadoop cluster. However, unlike the HPM, the CZ.NIC tool isn't yet suitable for use in a production environment.

Web interface

Because a Hadoop cluster often includes numerous different components, most of which have one or more web interfaces, it is difficult to maintain an overview of, for example, where each component is installed and what other components make use of it.

To address that problem, we have developed a simple management console with a web application (see figure 2). The console enables Hadoop managers to navigate around the various component web interfaces. However, the console does not (yet) support cluster configuration or management.

Screenshot of the Hadoop Provisioning Manager web interface.

Figure 2: Screenshot of the HPM web interface.

Support

The HPM supports twenty Hadoop components, the main ones being:

Apache Hadoop
Apache Spark
Apache Impala
Apache Zookeeper
Apache Hive
Apache Ranger
Hue SQL Assistant

The HPM can install the components on either Red Hat Enterprise Linux or Ubuntu. One challenge that we encountered during development is that not all Hadoop component versions are mutually compatible, and it's difficult to establish which will work with which. Mutual dependencies mean that you cannot install any version you like of a given component.

Challenges

Development of the HPM involved overcoming numerous major and minor challenges, some of which related to the Hadoop cluster requirements we had defined. For example, our requirement that the cluster must be secure implied that we had to configure authentication (Kerberos) for every single component and ensure that authorisation was enforced (Apache Ranger). In many instances, bugs and unclear or missing documentation meant that we had to refer to the components' source code (fortunately open) to find out why things were not working as we expected.

Another challenge was that our existing cluster uses a slightly older version of RHEL. Furthermore, installation-ready, compiled versions of some open-source components (e.g. Apache Impala) were not available. The compilation of a component for a recent operating system isn't usually a big problem, but with Apache Impala and an older version of RHEL, it was quite an undertaking. The reason being that Apache Impala itself uses certain versions of compilers and code libraries that weren't available for the RHEL version we use. We therefore had to produce compiler versions and libraries for our RHEL version before building Apache Impala. The same had to be done for the latest version of Python, for example.

Plans for further work

In the period ahead, we intend to carry forward the development of functionality for monitoring for and detecting issues with the various installed components and cluster hardware. In readiness for those functionalities, the HPM currently already adds Prometheus and Grafana when configuring a Hadoop cluster. The HPM configures Prometheus for the collection of metrics on the various installed components. Grafana is configured to use Prometheus as its backend datasource, ready to start visualising the metrics in a dashboard. We also plan to work on support for other operating system versions, such as Red Hat Enterprise Linux 9.

Open source code

The HPM's source code is open and available for use under an MIT licence. The intention is that other internet researchers and ENTRADA users should be able to use the HPM and contribute to its further development. The source code is available from GitHub

Questions or feedback?

Got a question or feedback? Let me know by mailing maarten.wullink@sidn.nl.

Article by:

Maarten Wullink

Research engineer