How to Deploy a Cluster

 

In this blog post I will talk about how to deploy a cluster, the methods I tried and my solution to resolving the prerequisites problem.

I’m fairly new to the big data field. Learning about Hadoop, I kept hearing the term “clusters”, deploying a cluster, and installing some services on namenode, some on datanode and so on. I also heard about Cloudera manager which helps me to deploy services on my cluster, so I set up a VM and followed several tutorials including the Cloudera documentation to install cloudera manager. However, every time I reached the “cluster installation” step my installation failed. I later found out that there are several prerequisites for a Cloudera Manager Installation, which was the reason for the failure to install.

 

Deploy a Cluster

Though I discuss 3 other methods in detail, ultimately I recommend method 4, provided below.

Method 1:

I ran my very first MapReduce job on “labianchin/hadoop” Docker image. This was easy to use as it comes with Hadoop pre-installed. However it did not contain any other services. I manually installed Hive and Oozie, but soon ran into configuration issues.

Method 2:

A lot of people recommend using Cloudera’s quickstart-VM. Sadly my system did not meet the memory requirements, which led to me to look for other methods.

Method 3:  

Cloudera Manager installation on Google cloud. I will be discussing this method in detail below.

Method 4:

Cloudera Docker image. Released on December 1st, 2015. This is by far the quickest and easiest method to use Cloudera services. All services are pre-installed and pre-configured. I recommend the following by installing a single Google compute engine VM instance, as described below then install Docker, pull image and run the container.

 

Step-by-Step Guide to Cloudera Manager Installation on Google Cloud (Method 3)

I will be creating a cluster consisting of 4 nodes. 1 node will run Cloudera Manager and remaining 3 nodes will run services.

Create a Google compute engine VM instance:

  1. From the Developers console, under Resources, select ‘Compute Engine’.
  2. Click on new instance on top of the page.
  3. Enter an instance name.
  4. Select a time zone.
  5. Select machine type as ‘2vCPUs 13GB RAM n1-highmem-2’ with 80 GB disk and CentOS 6.7 image (Change disk space according to requirements).
  6. Create.
  7. When the instance is ready, select it and open in Edit Mode.
  8. Scroll down to ssh keys and enter your public key here. If you do not have a public key run the following commands or generate one using PuTTY:
    • ssh keygen -t rsa
    • cat ~/.ssh/id_rsa.pub
  9. Save.
  10. Create 3 clones of this instance.
  11. Start all 4 instances.
  12. Follow steps from “Repartition a root persistent disk” to expand your disk to allotted size [Repeat on all instances].

Prerequisites:

  • To allow nodes to SSH to each other, edit /etc/hosts file to include hosts from each node. Below is an example [Repeat on all Instances]:

View the code on Gist.

  • Change swappiness to minimum value without disabling it [Repeat on all instances]:

View the code on Gist.

  • Disable iptables:

View the code on Gist.

  • Disable redhat_transparent_hugepage:

View the code on Gist.

  • Install MySQL:

View the code on Gist.

  • Install Java:

View the code on Gist.

  • Download mysql-connector [Repeat on all instances]:

View the code on Gist.

  • Install Cloudera manger:

View the code on Gist.

  • Create databases and users:

View the code on Gist.

  • Update database name, user and password in ‘db.properties’:

View the code on Gist.

  • Start cloudera-server and observe logs until “Jetty Server started” appears. This may take a while:

View the code on Gist.

  • Access cloudera manager from the browser to complete installation:
    • Install PuTTY.
    • Open your private key file in PuTTY pageant. By default this should be located in C:/users/username/.ssh/filename.ppk . PageAnt icon will appear in the system tray.
    • Fill in external IP of VM instance (of node where cloudera server is running) as hostname in PuTTY.
    • From the column on right, go to SSH > tunnels.
    • Enter the internal IP of VM instance in the destination textbox with port 7180. E.g. 10.240.0.2:7180.
    • For ease of remembering ports, set the source port as 7180 (Same as destination port). You can choose to redirect to another port if 7180 is not available. 7180 is the default port for Cloudera manager.
    • Apply and Open the connection.
    • Open the browser and go to “localhost:7180”.
    • Proceed with cluster installation using Cloudera manager.

Cluster Installation:

  • Login as “admin” , “admin”.
  • Accept the Terms and Conditions. Then continue.
  • Select ‘Cloudera Express” or “Cloudera Enterprise Data Hub Edition Trial”. Then continue.
  • Search for your machine’s hostnames. e.g.
  • On the “Cluster Installation” page continue with all default settings.
  • On the “JDK Installation Options” page select “Install Oracle Java SE Development Kit (JDK)” and “Install Java Unlimited Strength Encryption Policy Files”. Continue.
  • Do not select “Single User Mode”. Continue
  • On “Provide SSH Login credentials” page, select Login to all hosts as ‘Another User’ with Authentication method ‘All hosts accept same private key’. Enter the username from SSH key that was added to GCE instance (This is the same user that logged in to PuTTY session). Select the private key file stored on your local machine. Continue without passphrase.
  • On the next page, the cluster installation may take a while. (NOTE: If you install Cloudera manager without following the prerequisites, installation will fail at this step).

 

  • Once the installation is complete, continue to install parcels and inspect hosts. Finally, continue to the Cluster Setup.
  • Select ‘Core Hadoop’ when asked to select CDH5 services.
  • When Assigning roles, I like to assign all ‘Cloudera Management Service’ roles to one node (in my case ‘cloudera-cm’) and distribute all other roles evenly on the remaining 3 nodes. Here is one possible assignment of roles:
  • On the Database Setup page, set all ‘Database Host Name’ fields to the node running the Cloudera-server. Enter Database name, Username and Passwords that were created in MySQL earlier.
  • Review the changes. Now the cluster will be setup and services deployed. This is the final step.
  • You are now ready to use services directly from console or access Hue on port 8888. Good Luck :)

 

P.S. I would like to thank Manoj Kukreja for showing me the right way of deploying clusters.

 

Discover more about our expertise in Big Data.