Assignment 1 Write-up
Assignment 1 Write-up
Assignment No.1
Title: To perform Single node/Multiple node Hadoop Installation.
Objective: To study,
1. Configure Hadoop on open source software
Theory:
Hadoop
Hadoop is an open sourcesoftware framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines or racks of machines) are common and thus should be automatically handled in
software by the framework.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data will be
stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software can
be written to interact with the database, process the required data and present it to the users for
analysis purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by standard database
servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with
huge amounts of data, it is really a tedious task to process such data through a traditional database server.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small
parts and assigns those parts to many computers connected over the network, and collectsthe results to form
the final result dataset.
Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and
started an Open Source Project called HADOOP in 2005 and Doug named it after
his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes. In short, Hadoop framework is
capable enough to develop applications capable of running on clusters of computers
and they could perform complete statistical analysis for huge amounts of data.
2. Installing SSH
ssh has two main components:
1. ssh : The command we use to connect to remote machines - the client.
2. sshd : The daemon that is running on the server and allows clients to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use
this command to do that :
rashmi@laptop:~$ sudo apt-get install ssh
This will install ssh on our machine. If we get something similar to the following, we can think it is
setup properly:
rashmi@laptop:~$ which ssh
/usr/bin/ssh
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For
our single-node setup of Hadoop, we therefore need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH public key
authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password.
However, this requirement can be eliminated by creating and setting up SSH certificates using the
following commands. If asked for a filename just leave it blank and press the enter key to continue.
rashmi@laptop:~$ ssh-keygen -t rsa -P ""
The second command adds the newly created key to the list of authorized keys so that Hadoop can
use ssh without prompting for a password.
We can check if ssh works:
rashmi@laptop:~$ ssh localhost
4. Install Hadoop
We want to move the Hadoop installation to the /usr/local/hadoop directory using the following
command:
rashmi@laptop:~$ sudo mv hadoop/ /usr/local/
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.
rashmi@laptop:~$ gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME
variable will be available to Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop
uses when starting up. This file can be used to override the default settings that Hadoop starts with.
rashmi@laptop:~$ sudo mkdir -p /app/hadoop/tmp
rashmi@laptop:~$ sudo chown rashmi:rashmi /app/hadoop/tmp
Open the file and enter the following in between the <configuration> </configuration> tag:
rashmi@laptop:~$ gedit /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration> </configuration>
tag:
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the
cluster that is being used. It is used to specify the directories which will be used as the namenode
and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the
datanode for this Hadoop installation. This can be done using the following commands:
rashmi@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
rashmi@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
rashmi@laptop:~$ sudo chown -R rashmi:rashmi /usr/local/hadoop_store
Open the file and enter the following content in between the <configuration> </configuration> tag:
rashmi@laptop:~$ gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
7. Starting Hadoop
Now it's time to start the newly installed single node cluster. We can use start-all.sh or (start-dfs.sh
and start-yarn.sh)
rashmi@laptop:~$ start-all.sh
The output means that we now have a functional instance of Hadoop running on our VPS (Virtual
private server).
Let's start the Hadoop again and see its Web UI:
http://localhost:50070/
http://localhost:8088/
Conclusion: In this way, the Hadoop was installed & configured on Ubuntu.