Hadoop is a software framework that is used to process huge amounts of data. When we say huge, we are talking in magnitudes of terabytes to petabytes of raw data. The data is processed and analyzed using Map Reduce algorithm which are basically a set of Java programs running master nodes and a set of worker nodes which processes a subset of data to give meaningful results. We are not going to go into the details of MapReduce, but will set up Hadoop in our local machine to get a hang of how it works.
There are several customized distributions of Hadoop some of which are mentioned below:
- Apache Hadoop
- Cloudera’s Distribution including Apache Hadoop (that’s the official name)
- IBM Distribution of Apache Hadoop
- DataStax Brisk
- Amazon Elastic MapReduce
In this post we will be covering the installation of Apache Hadoop which is the by far the most popular Hadoop distribution available currently. Hadoop primarily runs on 3 modes:
- Stand-alone-mode : In this mode, all the daemons (processes) for map-reduce run on a single jvm. (not for production use)
- Pseudo-distributed: In this mode, daemons run on different jvms on one local machine , much like a “pseudo” cluster and hence has this name. (not for production use)
- Fully Distributed: In this mode, several daemons run on a cluster of machines. (for production use)
Here, we would just cover the pseudo-distributed mode.
Hadoop has a Hadoop Distributed File System(HDFS) which maintains all the data. It consists of :
Namenode - the master node which contains all the meta data, i.e the location of nodes where the files are stored.
Data nodes - it contain the actual files split as blocks of 64MB by default.
Job tracker - it runs in the master node, that takes care of distributing, managing Map and Reduce programs using task trackers.
Task tracker - it runs the map reduce tasks assigned by Job tracker. Its runs in Data nodes.
Finally there is a Secondary Namenode which has a back up of the original Namenode. It cannot replace the Namenode but can help in building a new one in case of a failure.
With three easy steps, get yourself to hadoop-pseudo:
- Download Hadoop
- Setup the per-requisites
- Run Hadoop
1. Download Hadoop
Download the latest release of Hadoop from the following website :
Now, the downloaded tar file can be extracted to any file location – here, I have used /opt as the file location.
We use the tar command to untar the file:
admin@server:/opt$ tar -xzf hadoop-0.20.203.0rc1.tar.gz
2. Set the pre-requisites
a) Java 1.6 or above
Hadoop is built and tested on Sun JDK and other jdks may not have the same performance.OpenJDK cannot be used to compile hadoop mapreduce code in branch-0.23 and beyond.
In order to install sun Java on Linux:
First add the following repository using the following command:
admin@server:/opt$ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
Then update the local package by typing the following command :
admin@server:/opt$ sudo apt-get update
Now, to install the sun-jdk:
admin@server:/opt$ sudo apt-get install sun-java6-jdk sun-java6-plugin
The system has to start using the newly installed jdk :
admin@server:/opt$ sudo update-java-alternatives -s java-6-sun
In order to check if the jdk was installed, type the following command:
admin@server:/opt$ java -version
You will get the following output:
Then we need to update the ~/.bashrc file to tell the machine where we have installed hadoop and to also set the JAVA_HOME variable.
So, the following lines are added to the ~/.basrc file:
export HADOOP_INSTALL=/opt/hadoop-0.20.203 export JAVA_HOME=/usr/lib/jvm/java-6-sun export PATH=$PATH:$HADOOP_INSTALL/bin
While running hadoop in a pseudo distributed mode, we require ssh for the jvms to communicate between each other. So, we use the following command.
admin@server:/opt$ sudo apt-get install ssh
try pinging the local host by using the following command :
admin@server:/opt$ ssh localhost
The first time you will be prompted for a password.
We will also create a dedicated user called “hadoop” using the following command :
admin@server:/opt$ sudo adduser --system --shell /bin/bash --gecos 'hadoop' --group –disabled password --home /home/hadoop hadoop
The rest of the commands are executed by logging in as hadoop user. To login using “hadoop” user, use the following command.
admin@server:/opt$ sudo su - hadoop
So, in order to use ssh without passphrase, the following two commands need to be executed.
First, to generate ssh keypair, use the below ssh-keygen command :
hadoop@server:~$ ssh-keygen -t rsa -P ""
Then copy the public key – id_rsa.pub to the file .ssh/authorized_keys using the command :
hadoop@server:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You will see the following output:
Now, ssh is also configured.
The pre-requisites are done and we have completed the first two stages already!
3. Run Hadoop
Initially to begin with, we need to set the JAVA_HOME variable in the hadoop-env.sh file present in the hadoop-0.20.*/conf directory.
We need to uncomment the JAVA_HOME variable and set it to
Next, we need to make changes to three files present in the same directory :
We will edit the files , core-site.xml, mapred-site.xml and hdfs-site.xml starting with core-site.xml:
Then we edit the mapred-site.xml:
Finally in the hdfs-site.xml:
That’s it, now we can start running hadoop.
But, before we go ahead and start hadoop, we will create a symlink to hadoop in the /opt folder using the following command : (replace * with the version you are using)
admin@server:/opt$ ln -s hadoop-0.20.* hadoop
We will start by formatting the HDFS.
The formatting process creates an empty filesystem by creating the storage directories and the initial versions of the namenode’s persistent data structures. Datanodes are not involved in the initial formatting process, since the namenode manages all of the filesystem’s metadata, and datanodes can join or leave the cluster dynamically.
This is done by using the following command :
hadoop@server:/opt/hadoop$ sudo su - hadoop hadoop@server:/opt/hadoop$ cd /opt/hadoop hadoop@server:/opt/hadoop$ bin/hadoop namenode -format
You will get the following output :
Thus we have successfully formatted the HDFS. We can start hadoop by calling the start-all.sh. We need to navigate to the hadoop folder (installation folder) and then use:
Else we can use :
hadoop@server:/opt/hadoop$ bin/start-dfs.sh hadoop@server:/opt/hadoop$ bin/start-mapred.sh
Now, hadoop is running in pseudo distributed mode and in order to check if all the nodes are running, we use the jps command :
To stop the process we can call :
hadoop@server:/opt/hadoop$ bin/stop-dfs.sh hadoop@server:/opt/hadoop$ bin/stop-mapred.sh
Cheers!! With those quick and easy steps we have installed and run a pseudo distribution of Hadoop.