0% found this document useful (0 votes)

4 views

Assignment 1 Write-up

The document outlines the installation and configuration of Hadoop on Ubuntu, detailing the prerequisites such as Java and SSH, as well as the steps to set up Hadoop's environment and configuration files. It explains the architecture of Hadoop, including its components like HDFS and MapReduce, and provides commands for installation and starting the Hadoop cluster. The conclusion confirms the successful installation and configuration of Hadoop on the system.

Uploaded by

sahildav24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Assignment 1 Write-up

Uploaded by

sahildav24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Part-A

Assignment No.1
Title: To perform Single node/Multiple node Hadoop Installation.

Objective: To study,
1. Configure Hadoop on open source software
Theory:
Hadoop
Hadoop is an open sourcesoftware framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines or racks of machines) are common and thus should be automatically handled in
software by the framework.
Traditional Approach

In this approach, an enterprise will have a computer to store and process big data. Here data will be
stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated software can
be written to interact with the database, process the required data and present it to the users for
analysis purpose.

Limitation

This approach works well where we have less volume of data that can be accommodated by standard database
servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with
huge amounts of data, it is really a tedious task to process such data through a traditional database server.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small
parts and assigns those parts to many computers connected over the network, and collectsthe results to form
the final result dataset.

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 1

Above diagram shows various commodity hardware‟s which could be single CPU
machines or servers with higher capacity.

Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and
started an Open Source Project called HADOOP in 2005 and Doug named it after
his son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes. In short, Hadoop framework is
capable enough to develop applications capable of running on clusters of computers
and they could perform complete statistical analysis for huge amounts of data.

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 2

Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules. These libraries provide file system and OS level abstractions and contains
the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS™): A distributed file system that

provides high- throughput access to application data.
HadoopMapReduce: This is YARN-based system for parallel processing of large data
sets.
MapReduce
HadoopMapReduce is a software framework for easily writing applications which
process big amounts of data in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
 The Map Task: This is the first task, which takes input data and converts it into a
set of data, where individual elements are broken down into tuples (key/value pairs).

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 3

 The Reduce Task: This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is always
performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master Job Tracker and one slave Task
Tracker per cluster-node. The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks. The slaves Task Tracker
execute the tasks as directed by the master and provide task-status information to the
master periodically.
The Job Tracker is a single point of failure for the Hadoop MapReduce service which
means if Job Tracker goes down, all running jobs are halted.

Hadoop Distributed File System (HDFS):-

Hadoop can work directly with any mountable distributed file system such as Local FS,
HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the
Hadoop Distributed FileSystem (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on large clusters (thousands
of computers) of small computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single Name Node
that manages the file system metadata and one or more slave Data Nodes that store
the actual data.

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 4

Installation of Hadoop 2.9.0 on Ubuntu
1. Installing Java
Hadoop framework is written in Java!!
# Update the source list rashmi@laptop:~$sudo apt-get update

rashmi@laptop:~$ sudo apt-get install openjdk-7-jdk

rashmi@laptop:~$ java -version

2. Installing SSH
ssh has two main components:
1. ssh : The command we use to connect to remote machines - the client.
2. sshd : The daemon that is running on the server and allows clients to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh first. Use
this command to do that :
rashmi@laptop:~$ sudo apt-get install ssh

This will install ssh on our machine. If we get something similar to the following, we can think it is
setup properly:
rashmi@laptop:~$ which ssh
/usr/bin/ssh

rashmi@laptop:~$ which sshd

/usr/sbin/sshd

3. Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For
our single-node setup of Hadoop, we therefore need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH public key
authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password.
However, this requirement can be eliminated by creating and setting up SSH certificates using the
following commands. If asked for a filename just leave it blank and press the enter key to continue.
rashmi@laptop:~$ ssh-keygen -t rsa -P ""

rashmi@laptop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The second command adds the newly created key to the list of authorized keys so that Hadoop can
use ssh without prompting for a password.
We can check if ssh works:
rashmi@laptop:~$ ssh localhost
4. Install Hadoop

rashmi@laptop:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop- 2.9.0/hadoop-2.9.0.tar.gz

rashmi@laptop:~$ tar xvzf hadoop-2.9.0.tar.gz

We want to move the Hadoop installation to the /usr/local/hadoop directory using the following
command:
rashmi@laptop:~$ sudo mv hadoop/ /usr/local/

rashmi@laptop:~$ sudo chown -R rashmi:rashmi /usr/local/hadoop

5. Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:
1. ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has been
installed to set the JAVA_HOME environment variable using the following command:
rashmi@laptop:~$ update-alternatives --config java

Now we can append the following to the end of ~/.bashrc:

rashmi@laptop:~$ gedit .bashrc

rashmi@laptop:~$ source .bashrc

This command applies the changes made in the .bashrc file.

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.
rashmi@laptop:~$ gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME
variable will be available to Hadoop whenever it is started up.
3. /usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop
uses when starting up. This file can be used to override the default settings that Hadoop starts with.
rashmi@laptop:~$ sudo mkdir -p /app/hadoop/tmp
rashmi@laptop:~$ sudo chown rashmi:rashmi /app/hadoop/tmp

Open the file and enter the following in between the <configuration> </configuration> tag:
rashmi@laptop:~$ gedit /usr/local/hadoop/etc/hadoop/core-site.xml

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration> </configuration>
tag:

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the
cluster that is being used. It is used to specify the directories which will be used as the namenode
and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and the
datanode for this Hadoop installation. This can be done using the following commands:
rashmi@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
rashmi@laptop:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
rashmi@laptop:~$ sudo chown -R rashmi:rashmi /usr/local/hadoop_store

Open the file and enter the following content in between the <configuration> </configuration> tag:
rashmi@laptop:~$ gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

6. Format the New Hadoop Filesystem

Now, the Hadoop file system needs to be formatted so that we can start to use it. The format
command should be issued with write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:
rashmi@laptop:~$ hadoop namenode -format
Note that hadoop namenode -format command should be executed once before we start using
Hadoop. If this command is executed again after Hadoop has been used, it'll destroy all the data on
the Hadoop file system.

7. Starting Hadoop
Now it's time to start the newly installed single node cluster. We can use start-all.sh or (start-dfs.sh
and start-yarn.sh)
rashmi@laptop:~$ start-all.sh

We can check if it's really up and running:

rashmi@laptop:~$ jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode

The output means that we now have a functional instance of Hadoop running on our VPS (Virtual
private server).

8. Hadoop Web Interfaces

Let's start the Hadoop again and see its Web UI:

Accessing HADOOP through browser

http://localhost:50070/

Verify all applications for cluster

http://localhost:8088/

Conclusion: In this way, the Hadoop was installed & configured on Ubuntu.

HND SWD Curiculum
100% (1)
HND SWD Curiculum
269 pages
Mobile Developer Specialization Sample Exam - EN
No ratings yet
Mobile Developer Specialization Sample Exam - EN
9 pages
Big Data Engineering - PySpark
100% (1)
Big Data Engineering - PySpark
120 pages
Unit III
No ratings yet
Unit III
32 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Experiment 1
No ratings yet
Experiment 1
17 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
1.Mrplab Intro
No ratings yet
1.Mrplab Intro
18 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop
No ratings yet
Hadoop
7 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Bda A2
No ratings yet
Bda A2
17 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
Bda Practical
No ratings yet
Bda Practical
62 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Hadoop
No ratings yet
Hadoop
11 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
unit 2
No ratings yet
unit 2
9 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Apache Hadoop
No ratings yet
Apache Hadoop
11 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
shawn
No ratings yet
shawn
4 pages
CLOUD_COMPUTING
No ratings yet
CLOUD_COMPUTING
21 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Unit 2
No ratings yet
Unit 2
73 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
ADM Hadoop
No ratings yet
ADM Hadoop
25 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Experiment 1 Hadoop Installation
No ratings yet
Experiment 1 Hadoop Installation
6 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Hadoop
No ratings yet
Hadoop
27 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
Unit 2
No ratings yet
Unit 2
21 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Bda Aat
No ratings yet
Bda Aat
18 pages
Session3_4-Bigdata Tools and Movie use case
No ratings yet
Session3_4-Bigdata Tools and Movie use case
79 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
unit 2
No ratings yet
unit 2
28 pages
Unit - II
No ratings yet
Unit - II
64 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Restaurant Management System Third Year
No ratings yet
Restaurant Management System Third Year
84 pages
18CS45 - Ooc - Module 4
No ratings yet
18CS45 - Ooc - Module 4
110 pages
Geographical Information System: Course Description
No ratings yet
Geographical Information System: Course Description
4 pages
Efektivitas Pelaksanaan Program Bantuan Sosial Pada Masyarakat Di Kota Palu (Studi Tentang Kelompok Usaha Bersama)
No ratings yet
Efektivitas Pelaksanaan Program Bantuan Sosial Pada Masyarakat Di Kota Palu (Studi Tentang Kelompok Usaha Bersama)
8 pages
Building A Successful Cloud Infrastructure Security & Compliance Practice
No ratings yet
Building A Successful Cloud Infrastructure Security & Compliance Practice
17 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
Chapter 5 - AISe - Student
No ratings yet
Chapter 5 - AISe - Student
78 pages
Agile Transformation Training (Agile + Scrum)
100% (1)
Agile Transformation Training (Agile + Scrum)
49 pages
Mohammed Yahya Sarayi Summary
No ratings yet
Mohammed Yahya Sarayi Summary
5 pages
Viviyan Resume
No ratings yet
Viviyan Resume
6 pages
Seminar PPT
No ratings yet
Seminar PPT
15 pages
The 2021 Analyst Playbook: How To Future-Proof Your Career
No ratings yet
The 2021 Analyst Playbook: How To Future-Proof Your Career
18 pages
Practical Training Seminar: Online Complaint Management System
No ratings yet
Practical Training Seminar: Online Complaint Management System
43 pages
Review Paper On Big Data Management and Cloud Computing.
No ratings yet
Review Paper On Big Data Management and Cloud Computing.
10 pages
Multi-Model Implementation-Challenges and Best Practices
No ratings yet
Multi-Model Implementation-Challenges and Best Practices
4 pages
Unit III System Design
No ratings yet
Unit III System Design
23 pages
What Is Application Programming Interfaces
No ratings yet
What Is Application Programming Interfaces
8 pages
DDB Unit3
No ratings yet
DDB Unit3
11 pages
IBM Data Science Certification
No ratings yet
IBM Data Science Certification
1 page
Orange en 1
No ratings yet
Orange en 1
100 pages
Packet And: Sniffing Spoofing
No ratings yet
Packet And: Sniffing Spoofing
48 pages
PDF (Ebook) Leveraging Applications of Formal Methods, Verification, and Validation : 6th International Symposium, ISoLA 2014, Corfu, Greece, October 8-11, 2014, and 5th International Symposium, ISoLA 2012, Heraklion, Crete, Greece, October 15-18, 2012, Revised Selec by Anna-Lena Lamprecht (eds.) ISBN 9783319516400, 9783319516417, 331951640X, 3319516418 download
100% (10)
PDF (Ebook) Leveraging Applications of Formal Methods, Verification, and Validation : 6th International Symposium, ISoLA 2014, Corfu, Greece, October 8-11, 2014, and 5th International Symposium, ISoLA 2012, Heraklion, Crete, Greece, October 15-18, 2012, Revised Selec by Anna-Lena Lamprecht (eds.) ISBN 9783319516400, 9783319516417, 331951640X, 3319516418 download
65 pages
Threat_Intelligence_News_2025-03-17
No ratings yet
Threat_Intelligence_News_2025-03-17
2 pages
Estadistica VAC
No ratings yet
Estadistica VAC
28 pages
Backend Engineering Using NodeJS
No ratings yet
Backend Engineering Using NodeJS
15 pages
How To Replicate Data From SAP Source To HANA Using SLT
No ratings yet
How To Replicate Data From SAP Source To HANA Using SLT
8 pages
AddOns Released MSSecurity
No ratings yet
AddOns Released MSSecurity
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Assignment 1 Write-up

Uploaded by

Assignment 1 Write-up

Uploaded by

Part-A

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 1

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 2

Hadoop Distributed File System (HDFS™): A distributed file system that

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 3

Hadoop Distributed File System (HDFS):-

Department of Information Technology, JSCOE,Hadapsar, Pune-028 Page 4

rashmi@laptop:~$ sudo apt-get install openjdk-7-jdk

rashmi@laptop:~$ java -version

rashmi@laptop:~$ which sshd

3. Create and Setup SSH Certificates

rashmi@laptop:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

rashmi@laptop:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop- 2.9.0/hadoop-2.9.0.tar.gz

rashmi@laptop:~$ tar xvzf hadoop-2.9.0.tar.gz

rashmi@laptop:~$ sudo chown -R rashmi:rashmi /usr/local/hadoop

5. Setup Configuration Files

Now we can append the following to the end of ~/.bashrc:

rashmi@laptop:~$ source .bashrc

This command applies the changes made in the .bashrc file.

6. Format the New Hadoop Filesystem

We can check if it's really up and running:

8. Hadoop Web Interfaces

Accessing HADOOP through browser

Verify all applications for cluster

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.