0% found this document useful (0 votes)
2 views136 pages

Unit II Hadoop and Map Reduce Overview

Hadoop is an open-source framework developed by Mike Cafarella and Doug Cutting for processing and analyzing large volumes of data, inspired by Google's GFS and MapReduce. It addresses challenges in big data storage and processing by using HDFS for distributed storage and enabling parallel processing through its architecture. Key components include ResourceManager and NodeManager in YARN, which manage resources and application scheduling across the cluster.

Uploaded by

anildudla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views136 pages

Unit II Hadoop and Map Reduce Overview

Hadoop is an open-source framework developed by Mike Cafarella and Doug Cutting for processing and analyzing large volumes of data, inspired by Google's GFS and MapReduce. It addresses challenges in big data storage and processing by using HDFS for distributed storage and enabling parallel processing through its architecture. Key components include ResourceManager and NodeManager in YARN, which manage resources and application scheduling across the cluster.

Uploaded by

anildudla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 136

Hadoop

• It all started with two people, Mike Cafarella and Doug Cutting.
• who were in the process of building a search engine system that can
index 1 billion pages.
• They estimated that such a system will cost around half a million
dollars in hardware, with a monthly running cost of $30,000, which is
quite expensive.
• They came across a paper, published in 2003, that described the
architecture of Google’s distributed file system, called GFS.
• Later in 2004, Google published one more paper that introduced
MapReduce to the world.
• Finally, these two papers led to the foundation of the framework
called “Hadoop“.
What is Hadoop ?
• Hadoop is an open source framework from Apache and is used to store
process and analyze data which are very huge in volume.
• Hadoop is written in Java and is not OLAP (online analytical processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many
more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Now, you must have got an idea why Big Data is
a problem statement and how Hadoop solves it.

• The first problem is storing the colossal amount


of data: Storing huge data in a traditional system
is not possible.
• The reason is obvious, the storage will be limited
to one system and the data is increasing at a
tremendous rate.
• The second problem is storing heterogeneous data : we know that storing is
a problem. The data is not only huge, but it is also present in various formats
i.e. unstructured, semi-structured and structured. So, you need to make sure
that you have a system to store different types of data that is generated from
various sources.
• The third problem The Processing Speed :

• The time taken to process this huge


amount of data is quite high as the data to
be processed is too large.
The first problem is storing the colossal amount of data:

• HDFS provides a distributed way to store Big Data.


• Your data is stored in blocks in DataNodes and you
specify the size of each block.
512 MB

128 MB 128 MB 128 MB 128 MB


Second problem was storing a
variety of data.
• In HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured.

• In HDFS, there is no pre-dumping schema validation. It also follows write once and read many models.
Due to this, you can just write any kind of data once and you can read it multiple times for finding
insights.
The third challenge was about processing the
data faster
• we move the processing unit to data instead of moving data to the processing unit.

• It means that instead of moving data from different nodes to a single master node for processing, the processing
logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel.

• Finally, all of the intermediary output produced by each node is merged together and the final response is sent back
to the client.
Hadoop Architecture
• NameNode meta data associated with two files.

• FsImage : Contains the complete state of the file system


namespace since the start of the NameNode.

• EditLog : Contains all the recent modifications made to the file


system with respect to the most recent FsImage.
• If a file is deleted in HDFS, the NameNode will immediately record
this in the EditLog.
The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
• Very large files
• Streaming data access
• Commodity hardware

HDFS is not a good fit today:


• Low-latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
• ls: This command is used to list all the files. It will print all the directories
present in HDFS
$hadoop fs –ls /dir

• mkdir: To create a directory. In Hadoop dfs there is no home directory by


default. So let’s first create it.
$hadoop fs –mkdir /directory_name

• touchz: It creates an empty file.


$hadoop fs –touchz /filename

• copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
$hadoop fs –put filename(which you want to put) /path
$hadoop fs –copyFromLocal filename(which you want to put) /path
• copyToLocal (or) get: To copy files/folders from hdfs store to
local file system.
$hadoop fs –get /file(path)
$hadoop fs –copyToLocal /file(path)

• cat: To print file contents


$hadoop fs –cat /file(path)

• moveFromLocal: This command will move file from local to


hdfs.
$hadoop fs –moveFromLocal file_name(which you want to move) /path

• cp: This command is used to copy files within hdfs.


$hadoop fs –cp /path1/file /path2/file

• mv: This command is used to move files within hdfs. It cut-


paste a file.
• rmr: This command deletes a file from HDFS recursively. It is
very useful command when you want to delete a non-empty
directory.
$hadoop fs –rmr /file(path)

• du: It will give the size of each file in directory.


$hadoop fs –du /file(path)

• dus:: This command will give the total size of directory/file.


$hadoop fs –dus /file(path)

• stat: It will give the last modified time of directory or path. In


short it will give stats of the directory or file.
$hadoop fs –stat /dir(file)
YARN
YARN comprises of two major components:
ResourceManager and NodeManager.
ResourceManager

• It is a cluster-level (one for each cluster) component and runs on the


master machine
• It manages resources and schedules applications running on top of
YARN
• It has two components: Scheduler & Application Manager
• The Scheduler is responsible for allocating resources to the various
running applications
• The Application Manager is responsible for accepting job submissions
and negotiating the first container for executing the application
• It keeps a track of the heartbeats from the Node Manager
NodeManager

• It is a node-level component (one on each node) and runs on each


slave machine
• It is responsible for managing containers and monitoring resource
utilization in each container
• It also keeps track of node health and log management
• It continuously communicates with ResourceManager to remain up-
to-date

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy