Configuration of Hadoop through Ansible.

7 min readApr 4, 2021

This article will help you to configure NameNode and DataNode through Ansible.

Task Description:

🔰 We will configure Hadoop and start cluster services using Ansible Playbook.

What is Big Data?

Big data is large amount of data. Big Data in normal layman’s term can be described as a huge volume of unstructured data. It is a term used to describe data that is huge in amount and which keeps growing with time. Big Data consists of structured, unstructured and semi-structured data. This data can be used to track and mine information for analysis or research purpose.

Big data in simple terms is a large amount of structured, unstructured, semi-structured data that can be used to for analysis purpose.

Volume: The name Big Data itself suggest it contains large amount of data. The size of the data is very important in determining whether the data is “Big data” or not. Hence, “Volume” is an important characteristic when dealing with Big data.
Velocity: Velocity is the speed at which data is generated. In Big Data the velocity is a measure of determining the efficiency of the data. The more quickly the data is generated and processed will determine the data’s real potential. The flow of data is huge and Velocity is one of the characteristics of Big Data.
Variety: Data comes in various forms, structured, unstructured, numeric, etc. Earlier spreadsheets and database were considered as data. But now pdf’s, emails, audio, etc are considered for analysis.

What Is Hadoop?

Hadoop is an open source, Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes.

What is Datanode?

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

Functions of DataNode:

These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s clients.
They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

What Is Namenode?

NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my next blog. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only.

Functions of NameNode:

It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata:
FsImage: It contains the complete state of the file system namespace since the start of the NameNode.
EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage.
It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
The NameNode is also responsible to take care of the replication factor of all the blocks which we will discuss in detail later in this HDFS tutorial blog.
In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.

Step-1)

First we have to configure our inventory file. In inventory file we will be defining IP address of our namenode and datanode.

Now, we will test that If our hosts are connected or not. We can check this by using PING command.

Here all the hosts are pingable.

Step-2)

Now, we will configure NAMENODE first.

Before Installing Hadoop on the NameNode we have to install JDK on the NameNode.

- hosts: namenode  
  tasks:   
       - name: Copy JDK file         
         copy:  
                   src: /root/Desktop/jdk-8u171-linux-x64.rpm                    
                   dest: /root   
      - name: Copy HADOOP file   
        copy:        
            src: /root/Desktop/hadoop-1.2.1-1.x86_64.rpm
            dest: /root

- name: Installation of JDK      
   shell: rpm -i /root/jdk-8u171-linux-x64.rpm           - name: Installation of Hadoop            shell: rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force          - file:                    path: /nn                    state: directory

In this block of code we will be installing Hadoop and JDK and also we will create an empty directory.

After this now we will configure our namenode. To do this we have to make some changes in the hdfs-site.xml and core-site.xml.

- lineinfile:            
        path: /etc/hadoop/hdfs-site.xml
        insertafter: "<configuration>"   
        line:     <property>                   
                  <name>dfs.name.dir</name>    
                  <value>/nn</value>        
                  </property>  
- lineinfile:  
                  path: /etc/hadoop/core-site.xml                   
                  insertafter:  "<configuration>"   
                  line:  <property> 
                         <name>fs.default.name</name>   
                         <value>hdfs://0.0.0.0:9001</value>      
                         </property>

In these files, we will add some lines through which we tell our node that it should work as NameNode and control other nodes.

For adding lines we will be using lineinfile module.

After the addition of lines in the hdfs-site.xml and core-site.xml now we will format our directory which we have created above and after formatting, we will finally start our namenode.

- shell: echo Y | hadoop namenode -format      
- shell: hadoop-daemon.sh start namenode

Full code for NameNode:

Now just run the Playbook and our NameNode will be configured.

To run the playbook:

ansible-playbook <File_Name.yml>

Step-3)

DataNode Configuration

As we have configured NameNode. Similarly, we will configure DataNode.

To connect DataNode to NameNode we need NameNode Ip.

- hosts: datanode  
  vars_prompt:      
     - name: namenode_IP  
       prompt: "Enter NameNode IP Address:"     
       private: no

Here we will ask the IP of NameNode from the user and store it in a Variable.

tasks:         
     - name: Copy JDK file     
       copy:              
             src: /root/Desktop/jdk-8u171-linux-x64.rpm   
             dest: /root     
     - name: Copy HADOOP file    
       copy:     
               src: /root/Desktop/hadoop-1.2.1-1.x86_64.rpm  
               dest: /root        
     - name: Installation of JDK      
       shell: rpm -i /root/jdk-8u171-linux-x64.rpm   
     - name: Installation of Hadoop          
       shell: rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force     
     - file:                
             path: /dn                 
             state: directory

This Block of code will Copy Jdk and Hadoop file for further installation.

In this block of code, we will be installing Hadoop and JDK and also we will create an empty directory.

After this now we will configure our namenode. To do this we have to make some changes to the hdfs-site.xml and core-site.xml.

In these files, we will add some lines through which we tell our node that it should work as DataNode and who is NameNode.

For adding lines we will be using lineinfile module.

- lineinfile:               
         path: /etc/hadoop/hdfs-site.xml     
         insertafter: "<configuration>"              
         line:    <property>          
                  <name>dfs.data.dir</name>  
                  <value>/dn</value>             
                 </property>      
- lineinfile:               
           path: /etc/hadoop/core-site.xml         
           insertafter:  "<configuration>"               
           line:   <property>      
                   <name>fs.default.name</name>
                   <value>hdfs://{{ namenode_IP }}:9001</value>    
                   </property>

After the addition of lines in the hdfs-site.xml and core-site.xml, we will finally start our datanode.

After this we only have to start our DataNode.

- shell: hadoop-daemon.sh start datanode

This block of code will start our DataNode.

Full code for DataNode Playbook:

After, this we only have to run the Playbook.

To run the playbook:

ansible-playbook <filename>

Now, let’s check that our datanode is connected to namenode or not.

Github Link:

Abhinav1808/Hadoop_Configuration

Contribute to Abhinav1808/Hadoop_Configuration development by creating an account on GitHub.

github.com