0% found this document useful (0 votes)
12 views25 pages

BDA UNIT -3 Updated (1).docx

The document discusses the Hadoop Distributed File System (HDFS), highlighting its fault tolerance, distributed storage capabilities, and key components such as Namenode and Datanode. It also covers the MapReduce framework for processing large datasets across multiple machines, detailing the roles of Mapper and Reducer in data transformation and aggregation. Additionally, it addresses file compression techniques in Hadoop and the importance of serialization for interprocess communication.

Uploaded by

aburoobhastudy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

BDA UNIT -3 Updated (1).docx

The document discusses the Hadoop Distributed File System (HDFS), highlighting its fault tolerance, distributed storage capabilities, and key components such as Namenode and Datanode. It also covers the MapReduce framework for processing large datasets across multiple machines, detailing the roles of Mapper and Reducer in data transformation and aggregation. Additionally, it addresses file compression techniques in Hadoop and the importance of serialization for interprocess communication.

Uploaded by

aburoobhastudy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT -3 BIG DATA STORAGE AND ANALYSIS

HDFS

Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using
low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue the
system from possible data losses in case of failure. HDFS also makes applications available to
parallel processing.

Features of HDFS
● It is suitable for the distributed storage and processing.
● Hadoop provides a command interface to interact with HDFS.
● The built-in servers of namenode and datanode help users to easily check the status of
cluster.
● Streaming access to file system data.
● HDFS provides file permissions and authentication.

DESIGN OF HDFS

HDFS follows the master-slave architecture and it has the following elements.

Namenode

HDFS works in master-worker pattern where the name node acts as master.Name Node is
controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS;
the metadata information being file permission, names and location of each block.The metadata
are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the
HDFS cluster is accessed by multiple clients concurrently,so all this information is handled by a
single machine.The system having the namenode acts as the master server and it does the
following tasks −

● Manages the file system namespace.


● Regulates client’s access to files.
● It also executes file system operations such as renaming, closing, and opening files and
directories.

Datanode

They store and retrieve blocks when they are told to; by client or name node. They report back to
name node periodically, with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation, deletion and replication as stated by
the name node.

● Datanodes perform read-write operations on the file systems, as per client request. ● They
also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block.

HDFS blocks are 128 MB by default and this is configurable.Files n HDFS are broken into
block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to
minimize the cost of seek.
Hadoop FS Command Line
The Hadoop FS command line is a simple way to access and interface with HDFS. Below are
some basic HDFS commands in Linux, including operations like creating directories, moving
files, deleting files, reading files, and listing directories.

To use HDFS commands, start the Hadoop services using the following command:

sbin/start-all.sh

To check if Hadoop is up and running:

jps
Below cover several basic HDFS commands, along with a list of more File system commands
given command -help.

1. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when
we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands.

2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s
first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user

hdfs/bin -mkdir /user/username -> write the username of your computer

Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path

bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be created relative to the home
directory.
3. touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt

4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This
is the most important command. Local filesystem means the files present on the OS. Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks
present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

5. cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>
Example:

bin/hdfs dfs -cat /geeks/AI.txt ->

6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

7. moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

8.cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /geeks /geeks_copied

9.mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from
geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /geeks/myfile.txt /geeks_copied
10. rmr: This command deletes a file from HDFS recursively. It is very useful command when
you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the
directory itself.

11. du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks

12. dus: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks

13.stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks
14. setrep:This command is used to change the replication factor of a file/directory in HDFS.
By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml). Example 1:
To change the replication factor to 6 for geeks.txt stored in HDFS. bin/hdfs dfs -setrep -R -w 6
geeks.txt
Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Hadoop MapReduce – Data Flow
Map-Reduce is a processing framework used to process data over a large number of machines.
Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not
similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these
previous frameworks are designed to use with a traditional system where the data is stored at a
single location like Network File System, Oracle database, etc. But when we are processing big
data the data is located on multiple commodity machines with the help of HDFS.

So when the data is stored on multiple nodes we need a processing framework where it can copy
the program to the location where the data is present, Means it copies the program to all the
machines where the data is present. Here the Map-Reduce came into the picture for processing
the data on Hadoop over a distributed system. Hadoop has a major drawback of cross-switch
network traffic which is due to the massive volume of data. Map-Reduce comes with a feature
called Data-Locality. Data Locality is the potential to move the computations closer to the actual
data location on the machines.

Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as it is widely


acceptable which provides an easy way to process data over multiple nodes. Map-Reduce is not
the only framework for parallel processing.

Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used
for Transformation while the Reducer is used for aggregation kind of operation. The terminology
for Map and Reduce is derived from some functional programming languages like Lisp, Scala,
etc. The Map-Reduce processing framework program comes with 3 main components i.e. our
Driver code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of
data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it
for that we have a Map-Reduce framework. So to process this data with Map-Reduce we have a
Driver code which is called Job. If we are using Java programming language for processing the
data on HDFS then we need to initiate this Driver class with the Job object. Suppose you have a
car which is your framework than the start button used to start the car is similar to this Driver
code in the Map-Reduce framework. We need to initiate the Driver code to utilize the advantages
of this Map-Reduce Framework.

There are also Mapper and Reducer classes provided by this framework which are predefined and
modified by the developers as per the organizations requirement.

Working of Mapper

Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have
100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper
program or process that runs in parallel on machines(nodes) and produce there own output
known as intermediate output which is then stored on Local Disk, not on HDFS. The output of
the mapper act as input for Reducer which performs some sorting and aggregation operation on
data and produces the final output.

Working Of Reducer

Reducer is the second part of the Map-Reduce programming model. The Mapper produces the
output in the form of key-value pairs which works as input for the Reducer. But before sending
this intermediate key-value pairs directly to the Reducer some process will be done which shuffle
and sort the key-value pairs according to its key values. The output generated by the Reducer will
be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer
mainly performs some computation operation like addition, filtration, and aggregation.
Steps of Data-Flow:

● At a time single input split is processed. Mapper is overridden by the developer


according to the business logic and this Mapper run in a parallel manner in all the
machines in our cluster.
● The intermediate output generated by Mapper is stored on the local disk and shuffled to
the reducer to reduce the task.
● Once Mapper finishes their task the output is then sorted and merged and provided to
the Reducer.
● Reducer performs some reducing tasks like aggregation and other compositional
operation and the final output is then stored on HDFS in part-r-00000(created by
default) file.
Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it
speeds up data transfer across the network or to or from disk. When dealing with large volumes
of data, both of these savings can be significant, so it pays to carefully consider how to use
compression in Hadoop.

1. What to compress?
1) Compressing input files
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less
time to read data. This time conservation is beneficial to the performance of job execution.

If the input files are compressed, they will be decompressed automatically as they are read by
MapReduce, using the filename extension to determine which codec to use. For example, a file
ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.

2) Compressing output files


Often we need to store the output as history files. If the amount of output per day is extensive,
and we often need to store history results for future use, then these accumulated results will take
extensive amount of HDFS space. However, these history files may not be used very frequently,
resulting in a waste of HDFS space. Therefore, it is necessary to compress the output before
storing on HDFS.

3) Compressing map output


Even if your MapReduce application reads and writes uncompressed data, it may benefit from
compressing the intermediate output of the map phase. Since the map output is written to disk
and transferred across the network to the reducer nodes, by using a fast compressor such as LZO
or Snappy, you can get performance gains simply because the volume of data to transfer is
reduced.
2. Common input format
Compression format Tool Algorithm File extention Splittable

gzip gzip DEFLATE .gz No

bzip2 bizp2 bzip2 .bz2 Yes

LZO lzop LZO .lzo Yes if indexed

Snappy N/A Snappy .snappy No

gzip:
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a
combination of LZ77 and Huffman Coding.
bzip2:
bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically
compresses files to within 10% to 15% of the best available techniques (the PPM family of
statistical compressors), whilst being around twice as fast at compression and six times faster at
decompression.

LZO:
The LZO compression format is composed of many smaller (~256K) blocks of compressed data,
allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind:
it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive
read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of
50% larger than their gzipped version. But that is still 20-50% of the size of the files without any
compression at all, which means that IO-bound jobs complete the map phase about four times
faster.

Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or
compatibility with any other compression library; instead, it aims for very high speeds and
reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of
magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to
100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about
250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used
inside Google, in everything from BigTable and MapReduce to our internal RPC systems.

What is Serialization?
Serialization is the process of translating data structures or objects state into binary or textual
form to transport the data over network or to store on some persisten storage. Once the data is
transported over network or retrieved from the persistent storage, it needs to be deserialized
again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Serialization in Hadoop
Generally in distributed systems like Hadoop, the concept of serialization is used for
Interprocess Communication and Persistent Storage.

Interprocess Communication
To establish the interprocess communication between the nodes connected in a
network, RPC technique was used.
RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows −
■ Compact − To make the best use of network bandwidth,
which is the most scarce resource in a data center.
■ Fast − Since the communication between the nodes is
crucial in distributed systems, the serialization and deserialization
process should be quick, producing less overhead.
■ Extensible − Protocols change over time to meet new
requirements, so it should be straightforward to evolve the
protocol in a controlled manner for clients and servers.
■ Interoperable − The message format should support the
nodes that are written in different languages.

Persistent Storage

Persistent Storage is a digital storage facility that does not lose its data with the loss of power
supply. Files, folders, databases are the examples of persistent storage.

Writable Interface

This is the interface in Hadoop which provides methods for serialization and deserialization. The
following table describes the methods −

S.No. Methods and Description

1 void readFields(DataInput in)


This method is used to deserialize the fields of the given object.

2 void write(DataOutput out)


This method is used to serialize the fields of the given object.

Serializing the Data in Hadoop


The procedure to serialize the integer type of data is discussed below.

● Instantiate IntWritable class by wrapping an integer value in it.


● Instantiate ByteArrayOutputStream class.
● Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream
class to it.
● Serialize the integer value in IntWritable object using write() method. This method needs
an object of DataOutputStream class.
● The serialized data will be stored in the byte array object which is passed as parameter to
the DataOutputStream class at the time of instantiation. Convert the data in the object to
byte array.
Deserializing the Data in Hadoop
The procedure to deserialize the integer type of data is discussed below

● Instantiate IntWritable class by wrapping an integer value in it.


● Instantiate ByteArrayOutputStream class.
● Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream
class to it.
● Deserialize the data in the object of DataInputStream using readFields() method of
IntWritable class.
● The deserialized data will be stored in the object of IntWritable class. You can retrieve this
data using get() method of this class.

AVRO FILE BASED DATA STRUCTURES

To transfer data over a network or for its persistent storage, you need to serialize the data. Prior
to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a
schema-based serialization technique.

What is Avro?
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting,
the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop.

Avro has a schema-based system. A language-independent schema is associated with its read and
write operations. Avro serializes the data which has a built-in schema. Avro serializes the data
into a compact binary format, which can be deserialized by any application.

Avro Schemas
Avro depends heavily on its schema. It allows every data to be written with no prior knowledge
of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored
along with the Avro data in a file for any further processing.
In RPC, the client and the server exchange schemas during the connection. This exchange helps
in the communication between same named fields, missing fields, extra fields, etc.
Avro schemas are defined with JSON that simplifies its implementation in languages with JSON
libraries.

Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol
Buffers, and Thrift.

Features of Avro
Listed below are some of the prominent features of Avro −

● Avro is a language-neutral data serialization system.


● It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby). ●
Avro creates binary structured format that is both compressible and splittable. Hence it can
be efficiently used as the input to Hadoop MapReduce jobs.
● Avro provides rich data structures. For example, you can create a record that contains an
array, an enumerated type, and a sub record. These datatypes can be created in any
language, can be processed in Hadoop, and the results can be fed to a third language.
● Avro schemas defined in JSON, facilitate implementation in the languages that already
have JSON libraries.
● Avro creates a self-describing file named Avro Data File, in which it stores data along with
its schema in the metadata section.
● Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server
exchange schemas in the connection handshake.

General Working of Avro


To use Avro, you need to follow the given workflow −

Step 1 − Create schemas. Here you need to design Avro schema


according to your data. Step 2 − Read the schemas into your
program. It is done in two ways −

By Generating a Class Corresponding to Schema − Compile the schema using


Avro. This generates a class file corresponding to the schema
By Using Parsers Library − You can directly read the schema
using parsers library.

Step 3 − Serialize the data using the serialization API provided


for Avro, which is found in the package org.apache.avro.specific.
Step 4 − Deserialize the data using deserialization API provided
for Avro, which is found in the package org.apache.avro.specific.

Various File systems in Hadoop

Hadoop is utilized for batch/offline processing over the network of so many machines forming a
physical cluster. The framework works in such a manner that it is capable enough to provide
distributed storage and processing over the same cluster. It is designed to work on cheaper
systems commonly known as commodity hardware where each system offers its local storage and
computation power.
Hadoop is capable of running various file systems and HDFS is just one single implementation
that out of all those file systems. The Hadoop has a variety of file systems that can be
implemented concretely. The Java abstract class org.apache.hadoop.fs.FileSystem represents a
file system in Hadoop.
Filesystem URI Java implementation Description
scheme (all under
org.apache.hadoop)

The Hadoop Local filesystem is


Local file fs.LocalFileSystem client-side checksumming. The local
used for a locally connected disk with filesystem uses

RawLocalFileSystem with
no checksums.

HDFS hdfs hdfs.DistributedFileSystem HDFS stands for Hadoop


Distributed File System and it
is drafted for working with
MapReduce efficiently.
HFTP hftp hdfs.HftpFileSystem The HFTP filesystem provides
read-only access to HDFS
over HTTP. There is no
connection of HFTP with FTP.

This filesystem is
commonly used with distcp
to share data between
HDFS clusters
possessing different versions.

HSFTP hsftp hdfs.HsftpFileSystem The HSFTP filesystem


provides read-only access to
HDFS over HTTPS. This file
system also does not have any
connection with FTP.

HAR har fs.HarFileSystem The HAR file system is


mainly used to reduce the
memory
usage of NameNode by
registering files in Hadoop
HDFS. This file system is
layered on some other file
system for archiving purposes.
KFS kfs fs.kfs.KosmosFileSyste cloud store or
(Cloud-Sto KFS(KosmosFileSystem) is a
file system that is written in
c++. It is very much similar to
a
distributed file system like
HDFS and GFS(Google
File System).

ftp m
re) FTP

The FTP filesystem is


supported by the FTP server.

fs.ftp.FTPFileSystem

S3 (native) s3n fs.s3native.NativeS3FileS This file system is backed


yst em by AmazonS3.

S3 s3 fs.s3.S3FileSystem S3 (block-based) file system


(block-based) which is supported by
Amazon s3 stores files in
blocks(similar

to HDFS) just to overcome


S3’s file system 5 GB file
size limit.
What is MapReduce in Hadoop?

MapReduce is a software framework and programming model used for processing huge amounts
of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with
splitting and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable
of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. The
programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster. The input to each
phase is key-value pairs. In addition, every programmer needs to specify two functions: map
function and reduce function.

MapReduce is a programming framework that allows us to perform distributed and parallel


processing on large data sets in a distributed environment.
● MapReduce consists of two distinct tasks — Map and Reduce.
● As the name MapReduce suggests, reducer phase takes place after the mapper phase has
been completed.
● So, the first is the map job, where a block of data is read and processed to produce
key-value pairs as intermediate outputs.
● The output of a Mapper or map job (key-value pairs) is input to the Reducer.
● The reducer receives the key-value pair from multiple map jobs.
● Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.
MapReduce Architecture in Big Data explained with Example

The whole process goes through four phases of execution namely, splitting, mapping, shuffling,
and reducing.

Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will
be finding unique words and the number of occurrences of those unique words.
The data goes through the following phases of MapReduce in Big Data

Input Splits:

An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits
Input split is a chunk of the input that is consumed by a single map.
● First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.

Mapping

This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of mapping
phase is to count a number of occurrences of each word from input splits (more details about
input-split is given below) and prepare a list in the form of <word, frequency>
● Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each
of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
● Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
— Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their
respective frequency.
● After the mapper phase, a partition process takes place where sorting and shuffling happen
so that all the tuples with the same key are sent to the corresponding reducer. ● So, after the
sorting and shuffling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
● Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as — Bear, 2.
● Finally, all the output key/value pairs are then collected and written in the output file.

Apache Hadoop YARN

YARN, known as Yet Another Resource Negotiator, was introduced in Hadoop version 2.0 by
Yahoo and Hortonworks in 2012. The basic idea of YARN in Hadoop is to divide the functions of
resource management and task scheduling/monitoring into separate daemon processes.

YARN in Hadoop allows for the execution of various data processing engines such as batch
processing, graph processing, stream processing, and interactive processing, as well as the
processing of data stored in HDFS.
Why is YARN Hadoop Used?

Before Hadoop 2.0, Hadoop 1.x had two main components: the Hadoop Distributed File System
(HDFS) and MapReduce. The MapReduce batch processing framework was tightly coupled to
HDFS.

By relying solely on MapReduce, Hadoop ran into many challenges.

● MapReduce had to handle both resource management and processing. ● Job Tracker was
overloaded with many features that it had to manage, including planning, task control,
processing, resource allocation, etc
● One Job Tracker was the bottleneck in terms of scalability.
● Overall, the system was computationally inefficient in terms of resource usage.

Why is YARN Used?

● YARN in Hadoop efficiently and dynamically allocates all cluster resources, resulting in
higher Hadoop utilization compared to previous versions which help in better cluster
utilization.
● Clusters in YARN in Hadoop can now run streaming data processing and interactive
queries in parallel with MapReduce batch jobs.
● All thanks to YARN in Hadoop, it can now handle several processing methods and can
support a wider range of applications.
YARN Architecture in Hadoop

The architecture of YARN is shown in the figure below. The architecture consists of several
components such as Resource Manager, Node Manager and Application Master.

The cluster's Resource Manager and Node Manager are two components in charge of managing
and scheduling Hadoop jobs. The execution of tasks in parallel is the responsibility of the
Application Master. Its daemon is in charge of carrying out the compute jobs, checking them for
errors, and finishing them.

Main Components of YARN Architecture in Hadoop


Resource Manager
The Resource Manager is the central decision maker for allocating resources among all system
applications. When it receives processing requests, it forwards portions of them to the appropriate
node managers, where the actual processing takes place. It acts as the cluster's resource arbitrator,
allocating available resources to competing applications.

The Resource Manager consists of the following:


1.Scheduler:

● It is known as a pure scheduler as it performs no monitoring or application state tracking. ●


If there is a sudden failure in hardware or application failure, it does not guarantee a restart of
the failed tasks.
● The scheduler performs its functions based on the resource requirements of the application.
It does this by using the abstraction of resource containers, including memory, CPU, disk,
network, and more.

2.Application Manager:

● The Application Manager is responsible for collecting job submissions, selecting the first
container to run the application-specific ApplicationMaster, and providing services to
restart the ApplicationMaster container in case of failure.
● Each application's ApplicationMaster is responsible for negotiating the appropriate
resource containers from the scheduler, maintaining their state and tracking progress.

3.Node Manager and Container

● Monitors resource usage (storage, CPU, etc) per container and handles log management. ●
It registers with the Resource Manager and sends out heartbeats containing the health status
of the node.
● Resource Manager assigns the Node Manager to manage all the application containers

4.Application Master

● A resource request for a container to perform an application task is sent from the
application host to the Resource Manager.
● Upon receiving the request from the Application Master, the Resource Manager evaluates
resource requirements, checks resource availability, and authorizes the container to fulfill
the resource request.
● Once the container is configured, the application host will instruct the Node Manager to
use resources and start application-specific activities and also sends health reports to the
Resource Manager from time to time.
Application Workflow in Hadoop YARN

● A client applies.
● To launch the Application Master, the Resource Manager allows a container.
● The Resource Manager accepts the Application Master registration.
● Containers are negotiated by the Application Master with the Resource Manager. ● The
Node Manager receives a request to launch containers from the Application Master. ● The
container is used to run application code.
● To check on the status of an application, the client contacts the Resource Manager or
Application Master.
● The Application Master deregisters with the Resource Manager once the above processing
is finished.

Features

● Multitenancy
YARN provides access to multiple data processing engines, such as Batch Processing
engines, Stream Processing Engines, Interactive Processing Engines, Graph Processing
Engines, etc. This brings the advantage of multi-tenancy to the business.
● Cluster Utilization
YARN optimizes a cluster by dynamically using and allocating its resources. YARN is a
parallel processing framework for implementing distributed computing clusters that
process large amounts of data across multiple computing nodes. Hadoop YARN allows
dividing a computing task into hundreds or thousands of tasks.
● Compatibility
YARN in Hadoop is also compatible with the first version of Hadoop because it uses
existing MapReduce applications. So, YARN can also be used with earlier versions of
Hadoop.
● Scalability
The YARN scheduler in Hadoop Resource Manager allows thousands of clusters and
nodes to be managed and scaled by Hadoop.

Hadoop Scheduler

Hadoop MapReduce is a software framework for writing applications that process huge amounts
of data (terabytes to petabytes) in-parallel on the large Hadoop cluster. This framework is
responsible for scheduling tasks, monitoring them, and re-executes the failed task.

In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic idea
behind the YARN introduction is to split the functionalities of resource management and job
scheduling or monitoring into separate daemons that are ResorceManager, ApplicationMaster,
and NodeManager.

ResorceManager is the master daemon that arbitrates resources among all the applications in the
system. NodeManager is the slave daemon responsible for containers, monitoring their resource
usage, and reporting the same to ResourceManager or Schedulers. ApplicationMaster negotiates
resources from the ResourceManager and works with NodeManager in order to execute and
monitor the task.

The ResourceManager has two main components that are Schedulers and ApplicationsManager.

Schedulers in YARN ResourceManager is a pure scheduler which is responsible for allocating


resources to the various running applications.

It is not responsible for monitoring or tracking the status of an application. Also, the scheduler
does not guarantee about restarting the tasks that are failed either due to hardware failure or
application failure.

The scheduler performs scheduling based on the resource requirements of the applications.

It has some pluggable policies that are responsible for partitioning the cluster resources among
the various queues, applications, etc.
The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable policies that are
responsible for allocating resources to the applications.

Let us now study each of these Schedulers in detail.


TYPES OF HADOOP SCHEDULER

1. FIFO Scheduler

First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives more
preferences to the application coming first than those coming later. It places the applications in a
queue and executes them in the order of their submission (first in, first out).

Here, irrespective of the size and priority, the request for the first application in the queue are
allocated first. Once the first application request is satisfied, then only the next application in the
queue is served.

Advantage:

● It is simple to understand and doesn’t need any configuration.


● Jobs are executed in the order of their submission.
Disadvantage:

● It is not suitable for shared clusters. If the large application comes before the shorter
one, then the large application will use all the resources in the cluster, and the shorter
application has to wait for its turn. This leads to starvation.
● It does not take into account the balance of resource allocation between the long
applications and short applications.

2. Capacity Scheduler

The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It is


designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing the
throughput and the utilization of the cluster.

It supports hierarchical queues to reflect the structure of organizations or groups that utilizes the
cluster resources. A queue hierarchy contains three types of queues that are root, parent, and leaf.
The root queue represents the cluster itself, parent queue represents organization/group or
sub-organization/sub-group, and the leaf accepts application submission.

The Capacity Scheduler allows the sharing of the large cluster while giving capacity guarantees
to each organization by allocating a fraction of cluster resources to each queue.

Also, when there is a demand for the free resources that are available on the queue who has
completed its task, by the queues running below capacity, then these resources will be assigned to
the applications on queues running below capacity. This provides elasticity for the organization in
a cost-effective manner.

Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the cluster.

To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.

Advantages:

● It maximizes the utilization of resources and throughput in the Hadoop cluster. ●


Provides elasticity for groups or organizations in a cost-effective manner. ● It also
gives capacity guarantees and safeguards to the organization utilizing cluster.
Disadvantage:

● It is complex amongst the other scheduler.

3. Fair Scheduler

FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters. With
FairScheduler, there is no need for reserving a set amount of capacity because it will dynamically
balance resources between all running applications.

It assigns resources to applications in such a way that all applications get, on average, an equal
amount of resources over time.

The FairScheduler, by default, takes scheduling fairness decisions only on the basis of memory.
We can configure it to schedule with both memory and CPU.

When the single application is running, then that app uses the entire cluster resources. When
other applications are submitted, the free up resources are assigned to the new apps so that every
app eventually gets roughly the same amount of resources. FairScheduler enables short apps to
finish in a reasonable time without starving the long-lived apps.

Similar to CapacityScheduler, the FairScheduler supports hierarchical queue to reflect the


structure of the long shared cluster.

Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues for
ensuring that certain users, production, or group applications always get sufficient resources.
When an app is present in the queue, then the app gets its minimum share, but when the queue
doesn’t need its full guaranteed share, then the excess share is split between other running
applications.

Advantages:

● It provides a reasonable way to share the Hadoop Cluster between the number of
users.
● Also, the FairScheduler can work with app priorities where the priorities are used as
weights in determining the fraction of the total resources that each application
should get.

Disadvantage:

● It requires configuration.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy