Deploying a Hadoop Cluster on Linux VMs in Azure from an ARM Template

This article covers how to deploy a Hadoop Cluster using Apache Ambari running on Linux Virtual Machines in Azure from an ARM Template.

Overview

There are existing options available in the Azure Marketplace for deploying Hadoop in Azure. The first option for Production purposes is HDInsight. By running the a deployment from the Azure Marketplace you can have a cluster setup and ready in less than 30 minutes. Additionally, there is another option avilable from Hortonworks for learning purposes in the Azure Marketplace, Hortonworks Sandbox with HDP 2.4.This will install almost all of the currently available Hadoop services onto a single stand-alone VM.

In my case, I wanted the ability to deploy Hadoop in Azure on Linux VMs of any size of my choosing and to be able to control the entire deployment of Hadoop and Hadoop related services using Apache Ambari. The reason I wanted to go to this level of effort was so that I could learn Hadoop from the standpoint of both an Administrator and a Developer while being able to manage a Hadoop Cluster as a single or multi-node deployment.

Prerequisites

Before deploying the ARM Template below, make sure you have enough VM cores available in your Azure Subscription.

Deploy the new Hadoop Infrastructure to Azure using an ARM Template

Note: This ARM Template should be used for learning and testing purposes, ONLY!

Clicking on the Deploy to Azure button below will deploy the following

Single Ambari Server VM
Multiple Hadoop Server VMs

Once the Infrastructure is deployed, the following actions will take place on the Ambari Server VM

The Ambari Server repo is downloaded and installed on the Ambari VM.
All deployed Servers will have their /etc/hosts file modified to contain the IP Address, FQDN, and Hostname of all deployed Servers.
The FQDN for each Server will be based upon the location where the ARM Template is deployed to, i.e. - West Europe = westeurope.cloudapp.azure.com.
iptables and Transparent Huge Pages is disabled on all Servers.

Some of the options available to you if you decide to test out Hadoop using this ARM Template include:

Deploying Hadoop to VMs smaller than A3 or DS3v2.
Deploy Hadoop into a multi-node environment.
Deploying minimal Hadoop features.
Deploying only the Hadoop features you want.

While the ability to deploy a Hadoop Cluster to a set of A1 or DS1v2 VMs isn’t recommended, it is still possible. Having the opportunity to figure out several different ways to deploy Hadoop by accidentally breaking it was one of my primary reasons for writing this ARM Template. From my personal experience, breaking a product is a great way to learn it (as long as it isn’t in Production).

Accessing the deployed VMs

All of the deployed VMs are externally accesssible via SSH on Port 22 from their respective Public IP Addresses.

List of all Open Ports for all Hadoop Services can be found in the Network Security Group deployed in the Resource Group.

Retrieve the SSH Private Key and Hadoop FQDNs

In order to deploy a Hadoop Cluster from Ambari without having to manually install agents on all the Servers you want in the Hadoop Cluster, an SSH Private Key must be generated and added to all the Servers. This was previously done as part of the Custom Script that was deployed on the Ambari Server at the very end of the ARM Template Deployment. Now all that is required is to retreive the SSH Private Key.

For this section, the Name of the Ambari Server will be rei-ambarisrv-iy.westeurope.cloudapp.azure.com and the Linux User will be linuxadmin.

<AMBARI_SERVER_NAME>.<LOCATION>.cloudapp.azure.com

Once you are logged in, change over to root. Type the password of the linuxadmin user when prompted.

sudo su

Run the following command to retrieve the FQDNs of all of the Hadoop Servers. This is optional As this information can also be found on the Public IP Address resources associated with the Hadoop Servers.

cat /etc/hosts

Sample Output:

0.1.4 rei-ambarisrv-iy.westeurope.cloudapp.azure.com rei-ambarisrv-iy
0.1.5 rei-hadoopsrv-iy0.westeurope.cloudapp.azure.com rei-hadoopsrv-iy0
0.1.6 rei-hadoopsrv-iy1.westeurope.cloudapp.azure.com rei-hadoopsrv-iy1
0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

Next, run the following command to retrieve the SSH Private Key generated during the installation of Ambari.

cat /root/.ssh/id_rsa

Sample Output:

Make note of the SSH Private Key for the next section.

Deploying the Hadoop Cluster using Ambari

Apache Ambari enables system administrators to provision, manage and monitor a Hadoop cluster as well as to integrate with an existing enterprise infrastructure. By using Ambari, you can control the deployment, management, and removal of the following Hadoop related services in a Hadoop Cluster.

HDFS
YARN + MapReduce2
Tez
Hive
HBase
Pig
Sqoop
Oozie
ZooKeeper
Falcon
Storm
Flume
Accumulo
Ambari Infra
Ambari Metrics
Atlas
Kafka
Knox
Log Search
SmartSense
Spark
Spark2
Zeppelin Notebook
Mahout
Slider

For this section, the Ambari Web UI will be http://rei-ambarisrv-iy.westeurope.cloudapp.azure.com:8080.

Once the ARM Template Deployment is complete, use the following syntax to access the Ambari Web UI.

http://<AMBARI_SERVER_NAME>.<LOCATION>.cloudapp.azure.com:8080

deploying-hadoop-in-azure-using-ambari-000

Next, login to the Ambari Server using the following credentials:

username: admin
password: admin

Next, click on the Launch Install Wizard button under the Create a Cluster section.

deploying-hadoop-in-azure-using-ambari-001

Next, type in a name for the Hadoop Cluster and click Next.

deploying-hadoop-in-azure-using-ambari-002

Next, make sure HDP-2.5.3.0 is already selected, scroll down to the bottom of the page and click Next.

deploying-hadoop-in-azure-using-ambari-003

Next, copy in the FQDN values of the Hadoop Servers that you deployed and the SSH Private Key that you retrieved earlier from the Ambari Server. Afterwards, click on the Register and Confirm button.

deploying-hadoop-in-azure-using-ambari-004

The Hadoop Hosts will be registered and checked for any potential issues, the entire process should only take a couple minutes.

deploying-hadoop-in-azure-using-ambari-005

After the registration is completed, ignore the warnings and click on Next.

deploying-hadoop-in-azure-using-ambari-006

In the Choose Services section, you have the option to add or remove any of the services you want to install on the Cluster; better still, if you choose a combination that is missing a dependency, you will be prompted what you are missing and to add it. For the purpose of this walkthrough, scroll down to the bottom of the page and click Next.

deploying-hadoop-in-azure-using-ambari-007

In the Assign Masters section, you have the option to assign the master components to whichever server you want them to reside on. Leave the configuration as is by default, scroll down to the bottom of the page and click Next.

deploying-hadoop-in-azure-using-ambari-008

In the Assign Slaves and Clients section, leave the default values as is and click Next.

deploying-hadoop-in-azure-using-ambari-009

In the Customize Services section, any of the Services that have a red number beside them require your attention. In all of the cases of this walkthrough, each of the matching services requires that you type in a password. Do this for each service as required, scroll down to the bottom of the page and click Next.

Note: After clicking on Next, if you recieve any configuration warnings, choose to proceed anyway.

deploying-hadoop-in-azure-using-ambari-010