0% found this document useful (0 votes)

2 views136 pages

Unit II Hadoop and Map Reduce Overview

Hadoop is an open-source framework developed by Mike Cafarella and Doug Cutting for processing and analyzing large volumes of data, inspired by Google's GFS and MapReduce. It addresses challenges in big data storage and processing by using HDFS for distributed storage and enabling parallel processing through its architecture. Key components include ResourceManager and NodeManager in YARN, which manage resources and application scheduling across the cluster.

Uploaded by

anildudla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views136 pages

Unit II Hadoop and Map Reduce Overview

Uploaded by

anildudla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 136

Hadoop

• It all started with two people, Mike Cafarella and Doug Cutting.
• who were in the process of building a search engine system that can
index 1 billion pages.
• They estimated that such a system will cost around half a million
dollars in hardware, with a monthly running cost of $30,000, which is
quite expensive.
• They came across a paper, published in 2003, that described the
architecture of Google’s distributed file system, called GFS.
• Later in 2004, Google published one more paper that introduced
MapReduce to the world.
• Finally, these two papers led to the foundation of the framework
called “Hadoop“.
What is Hadoop ?
• Hadoop is an open source framework from Apache and is used to store
process and analyze data which are very huge in volume.
• Hadoop is written in Java and is not OLAP (online analytical processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many
more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Now, you must have got an idea why Big Data is
a problem statement and how Hadoop solves it.

• The first problem is storing the colossal amount

of data: Storing huge data in a traditional system
is not possible.
• The reason is obvious, the storage will be limited
to one system and the data is increasing at a
tremendous rate.
• The second problem is storing heterogeneous data : we know that storing is
a problem. The data is not only huge, but it is also present in various formats
i.e. unstructured, semi-structured and structured. So, you need to make sure
that you have a system to store different types of data that is generated from
various sources.
• The third problem The Processing Speed :

• The time taken to process this huge

amount of data is quite high as the data to
be processed is too large.
The first problem is storing the colossal amount of data:

• HDFS provides a distributed way to store Big Data.

• Your data is stored in blocks in DataNodes and you
specify the size of each block.
512 MB

128 MB 128 MB 128 MB 128 MB

Second problem was storing a
variety of data.
• In HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured.

• In HDFS, there is no pre-dumping schema validation. It also follows write once and read many models.
Due to this, you can just write any kind of data once and you can read it multiple times for finding
insights.
The third challenge was about processing the
data faster
• we move the processing unit to data instead of moving data to the processing unit.

• It means that instead of moving data from different nodes to a single master node for processing, the processing
logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel.

• Finally, all of the intermediary output produced by each node is merged together and the final response is sent back
to the client.
Hadoop Architecture
• NameNode meta data associated with two files.

• FsImage : Contains the complete state of the file system

namespace since the start of the NameNode.

• EditLog : Contains all the recent modifications made to the file

system with respect to the most recent FsImage.
• If a file is deleted in HDFS, the NameNode will immediately record
this in the EditLog.
The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
• Very large files
• Streaming data access
• Commodity hardware

HDFS is not a good fit today:

• Low-latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
• ls: This command is used to list all the files. It will print all the directories
present in HDFS
$hadoop fs –ls /dir

• mkdir: To create a directory. In Hadoop dfs there is no home directory by

default. So let’s first create it.
$hadoop fs –mkdir /directory_name

• touchz: It creates an empty file.

$hadoop fs –touchz /filename

• copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
$hadoop fs –put filename(which you want to put) /path
$hadoop fs –copyFromLocal filename(which you want to put) /path
• copyToLocal (or) get: To copy files/folders from hdfs store to
local file system.
$hadoop fs –get /file(path)
$hadoop fs –copyToLocal /file(path)

• cat: To print file contents

$hadoop fs –cat /file(path)

• moveFromLocal: This command will move file from local to

hdfs.
$hadoop fs –moveFromLocal file_name(which you want to move) /path

• cp: This command is used to copy files within hdfs.

$hadoop fs –cp /path1/file /path2/file

• mv: This command is used to move files within hdfs. It cut-

paste a file.
• rmr: This command deletes a file from HDFS recursively. It is
very useful command when you want to delete a non-empty
directory.
$hadoop fs –rmr /file(path)

• du: It will give the size of each file in directory.

$hadoop fs –du /file(path)

• dus:: This command will give the total size of directory/file.

$hadoop fs –dus /file(path)

• stat: It will give the last modified time of directory or path. In

short it will give stats of the directory or file.
$hadoop fs –stat /dir(file)
YARN
YARN comprises of two major components:
ResourceManager and NodeManager.
ResourceManager

• It is a cluster-level (one for each cluster) component and runs on the

master machine
• It manages resources and schedules applications running on top of
YARN
• It has two components: Scheduler & Application Manager
• The Scheduler is responsible for allocating resources to the various
running applications
• The Application Manager is responsible for accepting job submissions
and negotiating the first container for executing the application
• It keeps a track of the heartbeats from the Node Manager
NodeManager

• It is a node-level component (one on each node) and runs on each

slave machine
• It is responsible for managing containers and monitoring resource
utilization in each container
• It also keeps track of node health and log management
• It continuously communicates with ResourceManager to remain up-
to-date

Coa Notes 4th Semm Computer Organization and Architecture
No ratings yet
Coa Notes 4th Semm Computer Organization and Architecture
96 pages
A To Z Computer Related Full Form
90% (10)
A To Z Computer Related Full Form
15 pages
Big data aktu unit 2
No ratings yet
Big data aktu unit 2
127 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Hadoop
No ratings yet
Hadoop
7 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit II Hadoop
No ratings yet
Unit II Hadoop
23 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BIG DATA UNIT -2
No ratings yet
BIG DATA UNIT -2
18 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Big data unit 2
No ratings yet
Big data unit 2
25 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
BDA_exp_1
No ratings yet
BDA_exp_1
7 pages
HDFS Commands Updated
No ratings yet
HDFS Commands Updated
87 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
UNIT 2 full
No ratings yet
UNIT 2 full
121 pages
HADOOP
No ratings yet
HADOOP
19 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Lecture 4 Introduction to Hadoop
No ratings yet
Lecture 4 Introduction to Hadoop
25 pages
10th August Morning and Afternoon session Hadoop (1)
No ratings yet
10th August Morning and Afternoon session Hadoop (1)
18 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
132 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Chap4_BigDataStorageAndManagement
No ratings yet
Chap4_BigDataStorageAndManagement
46 pages
HDFS
No ratings yet
HDFS
18 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop
No ratings yet
Hadoop
71 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Introduc) On To Bigdata
No ratings yet
Introduc) On To Bigdata
103 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
UNIT-3 (1)
No ratings yet
UNIT-3 (1)
27 pages
Map Reduce Types and Formats
No ratings yet
Map Reduce Types and Formats
32 pages
Unit-1-BDA
No ratings yet
Unit-1-BDA
95 pages
Unit II Hadoop IO
No ratings yet
Unit II Hadoop IO
27 pages
Unit-1 CC
No ratings yet
Unit-1 CC
86 pages
Threads Synchronization
No ratings yet
Threads Synchronization
29 pages
Computer Design Basic
No ratings yet
Computer Design Basic
1 page
Java Notes
No ratings yet
Java Notes
14 pages
Raspberry Pi - Wikipedia
No ratings yet
Raspberry Pi - Wikipedia
16 pages
Multiple Choice Questions For Mid - 1
No ratings yet
Multiple Choice Questions For Mid - 1
26 pages
BCS - System
100% (1)
BCS - System
35 pages
Voice Assistant With Home Automation
No ratings yet
Voice Assistant With Home Automation
8 pages
Ecil Amss Ops-14-1
No ratings yet
Ecil Amss Ops-14-1
8 pages
ST-SERV3 V58 en
100% (3)
ST-SERV3 V58 en
262 pages
Computer Notes Form 1 4 Booklet
No ratings yet
Computer Notes Form 1 4 Booklet
20 pages
04 Delays, Counters, and Timers
No ratings yet
04 Delays, Counters, and Timers
47 pages
Chapter 1
No ratings yet
Chapter 1
2 pages
DSE8003 MKII Operator Manual
No ratings yet
DSE8003 MKII Operator Manual
74 pages
DSM Developer Guide 7 Enu
No ratings yet
DSM Developer Guide 7 Enu
167 pages
CPX-terminal Electrical Interface CPX-CP-4-FB
No ratings yet
CPX-terminal Electrical Interface CPX-CP-4-FB
136 pages
Iseries Work Management
No ratings yet
Iseries Work Management
80 pages
Key Point For eNSP Test A
No ratings yet
Key Point For eNSP Test A
2 pages
CIPer Model 30 - Installation and Operation Guide - 31-00206EFS-01
No ratings yet
CIPer Model 30 - Installation and Operation Guide - 31-00206EFS-01
36 pages
Lecture 04 - Linked Lists - Arrays, ArrayList, LinkedList
No ratings yet
Lecture 04 - Linked Lists - Arrays, ArrayList, LinkedList
18 pages
Technical Manual: Pavonesystems
No ratings yet
Technical Manual: Pavonesystems
70 pages
HammerspaceWhite Paper - Simplifying Data Automation
No ratings yet
HammerspaceWhite Paper - Simplifying Data Automation
11 pages
Fundamental of Computer and Programming
No ratings yet
Fundamental of Computer and Programming
40 pages
Dbase Notes
No ratings yet
Dbase Notes
89 pages
Module 4 OOP - First Touch
No ratings yet
Module 4 OOP - First Touch
24 pages
WSPro UpgradeGuide SDPII USBCable
No ratings yet
WSPro UpgradeGuide SDPII USBCable
5 pages
Various Exce LINX
No ratings yet
Various Exce LINX
2 pages
Ds 42 Upgrade en
No ratings yet
Ds 42 Upgrade en
118 pages
A Primer On Database Clustering Architectures: by Mike Hogan, Ceo Scaledb Inc
No ratings yet
A Primer On Database Clustering Architectures: by Mike Hogan, Ceo Scaledb Inc
4 pages
iHAS120 Datasheet
No ratings yet
iHAS120 Datasheet
2 pages
HB mbNET V3 3 5 DR05 en
No ratings yet
HB mbNET V3 3 5 DR05 en
226 pages
VMM Golden Reference Guide Jan 2010
No ratings yet
VMM Golden Reference Guide Jan 2010
364 pages
Zimbra To Office 365 Migration Guide
No ratings yet
Zimbra To Office 365 Migration Guide
6 pages
Patterns in C 4, Observer
No ratings yet
Patterns in C 4, Observer
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit II Hadoop and Map Reduce Overview

Uploaded by

Unit II Hadoop and Map Reduce Overview

Uploaded by

Hadoop

• The first problem is storing the colossal amount

• The time taken to process this huge

• HDFS provides a distributed way to store Big Data.

128 MB 128 MB 128 MB 128 MB

• FsImage : Contains the complete state of the file system

• EditLog : Contains all the recent modifications made to the file

HDFS is not a good fit today:

• mkdir: To create a directory. In Hadoop dfs there is no home directory by

• touchz: It creates an empty file.

• cat: To print file contents

• moveFromLocal: This command will move file from local to

• cp: This command is used to copy files within hdfs.

• mv: This command is used to move files within hdfs. It cut-

• du: It will give the size of each file in directory.

• dus:: This command will give the total size of directory/file.

• stat: It will give the last modified time of directory or path. In

• It is a cluster-level (one for each cluster) component and runs on the

• It is a node-level component (one on each node) and runs on each

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.