BDA UNIT -3 Updated (1).docx
BDA UNIT -3 Updated (1).docx
HDFS
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using
low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue the
system from possible data losses in case of failure. HDFS also makes applications available to
parallel processing.
Features of HDFS
● It is suitable for the distributed storage and processing.
● Hadoop provides a command interface to interact with HDFS.
● The built-in servers of namenode and datanode help users to easily check the status of
cluster.
● Streaming access to file system data.
● HDFS provides file permissions and authentication.
DESIGN OF HDFS
HDFS follows the master-slave architecture and it has the following elements.
Namenode
HDFS works in master-worker pattern where the name node acts as master.Name Node is
controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS;
the metadata information being file permission, names and location of each block.The metadata
are small, so it is stored in the memory of name node,allowing faster access to data. Moreover the
HDFS cluster is accessed by multiple clients concurrently,so all this information is handled by a
single machine.The system having the namenode acts as the master server and it does the
following tasks −
Datanode
They store and retrieve blocks when they are told to; by client or name node. They report back to
name node periodically, with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation, deletion and replication as stated by
the name node.
● Datanodes perform read-write operations on the file systems, as per client request. ● They
also perform operations such as block creation, deletion, and replication according to the
instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block.
HDFS blocks are 128 MB by default and this is configurable.Files n HDFS are broken into
block-sized chunks,which are stored as independent units.Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to
minimize the cost of seek.
Hadoop FS Command Line
The Hadoop FS command line is a simple way to access and interface with HDFS. Below are
some basic HDFS commands in Linux, including operations like creating directories, moving
files, deleting files, reading files, and listing directories.
To use HDFS commands, start the Hadoop services using the following command:
sbin/start-all.sh
jps
Below cover several basic HDFS commands, along with a list of more File system commands
given command -help.
1. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when
we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs
means we want the executables of hdfs particularly dfs(Distributed File System) commands.
2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s
first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be created relative to the home
directory.
3. touchz: It creates an empty file.
Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This
is the most important command. Local filesystem means the files present on the OS. Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks
present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
8.cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs dfs -cp /geeks /geeks_copied
9.mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from
geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs dfs -mv /geeks/myfile.txt /geeks_copied
10. rmr: This command deletes a file from HDFS recursively. It is very useful command when
you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the
directory itself.
12. dus: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
13.stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks
14. setrep:This command is used to change the replication factor of a file/directory in HDFS.
By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml). Example 1:
To change the replication factor to 6 for geeks.txt stored in HDFS. bin/hdfs dfs -setrep -R -w 6
geeks.txt
Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Hadoop MapReduce – Data Flow
Map-Reduce is a processing framework used to process data over a large number of machines.
Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not
similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these
previous frameworks are designed to use with a traditional system where the data is stored at a
single location like Network File System, Oracle database, etc. But when we are processing big
data the data is located on multiple commodity machines with the help of HDFS.
So when the data is stored on multiple nodes we need a processing framework where it can copy
the program to the location where the data is present, Means it copies the program to all the
machines where the data is present. Here the Map-Reduce came into the picture for processing
the data on Hadoop over a distributed system. Hadoop has a major drawback of cross-switch
network traffic which is due to the massive volume of data. Map-Reduce comes with a feature
called Data-Locality. Data Locality is the potential to move the computations closer to the actual
data location on the machines.
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used
for Transformation while the Reducer is used for aggregation kind of operation. The terminology
for Map and Reduce is derived from some functional programming languages like Lisp, Scala,
etc. The Map-Reduce processing framework program comes with 3 main components i.e. our
Driver code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of
data is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it
for that we have a Map-Reduce framework. So to process this data with Map-Reduce we have a
Driver code which is called Job. If we are using Java programming language for processing the
data on HDFS then we need to initiate this Driver class with the Job object. Suppose you have a
car which is your framework than the start button used to start the car is similar to this Driver
code in the Map-Reduce framework. We need to initiate the Driver code to utilize the advantages
of this Map-Reduce Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined and
modified by the developers as per the organizations requirement.
Working of Mapper
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have
100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper
program or process that runs in parallel on machines(nodes) and produce there own output
known as intermediate output which is then stored on Local Disk, not on HDFS. The output of
the mapper act as input for Reducer which performs some sorting and aggregation operation on
data and produces the final output.
Working Of Reducer
Reducer is the second part of the Map-Reduce programming model. The Mapper produces the
output in the form of key-value pairs which works as input for the Reducer. But before sending
this intermediate key-value pairs directly to the Reducer some process will be done which shuffle
and sort the key-value pairs according to its key values. The output generated by the Reducer will
be the final output which is then stored on HDFS(Hadoop Distributed File System). Reducer
mainly performs some computation operation like addition, filtration, and aggregation.
Steps of Data-Flow:
1. What to compress?
1) Compressing input files
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less
time to read data. This time conservation is beneficial to the performance of job execution.
If the input files are compressed, they will be decompressed automatically as they are read by
MapReduce, using the filename extension to determine which codec to use. For example, a file
ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.
gzip:
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a
combination of LZ77 and Huffman Coding.
bzip2:
bzip2 is a freely available, patent free (see below), high-quality data compressor. It typically
compresses files to within 10% to 15% of the best available techniques (the PPM family of
statistical compressors), whilst being around twice as fast at compression and six times faster at
decompression.
LZO:
The LZO compression format is composed of many smaller (~256K) blocks of compressed data,
allowing jobs to be split along block boundaries. Moreover, it was designed with speed in mind:
it decompresses about twice as fast as gzip, meaning it’s fast enough to keep up with hard drive
read speeds. It doesn’t compress quite as well as gzip — expect files that are on the order of
50% larger than their gzipped version. But that is still 20-50% of the size of the files without any
compression at all, which means that IO-bound jobs complete the map phase about four times
faster.
Snappy:
Snappy is a compression/decompression library. It does not aim for maximum compression, or
compatibility with any other compression library; instead, it aims for very high speeds and
reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of
magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to
100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about
250 MB/sec or more and decompresses at about 500 MB/sec or more. Snappy is widely used
inside Google, in everything from BigTable and MapReduce to our internal RPC systems.
What is Serialization?
Serialization is the process of translating data structures or objects state into binary or textual
form to transport the data over network or to store on some persisten storage. Once the data is
transported over network or retrieved from the persistent storage, it needs to be deserialized
again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.
Serialization in Hadoop
Generally in distributed systems like Hadoop, the concept of serialization is used for
Interprocess Communication and Persistent Storage.
Interprocess Communication
To establish the interprocess communication between the nodes connected in a
network, RPC technique was used.
RPC used internal serialization to convert the message into binary format before
sending it to the remote node via network. At the other end the remote system
deserializes the binary stream into the original message.
The RPC serialization format is required to be as follows −
■ Compact − To make the best use of network bandwidth,
which is the most scarce resource in a data center.
■ Fast − Since the communication between the nodes is
crucial in distributed systems, the serialization and deserialization
process should be quick, producing less overhead.
■ Extensible − Protocols change over time to meet new
requirements, so it should be straightforward to evolve the
protocol in a controlled manner for clients and servers.
■ Interoperable − The message format should support the
nodes that are written in different languages.
Persistent Storage
Persistent Storage is a digital storage facility that does not lose its data with the loss of power
supply. Files, folders, databases are the examples of persistent storage.
Writable Interface
This is the interface in Hadoop which provides methods for serialization and deserialization. The
following table describes the methods −
To transfer data over a network or for its persistent storage, you need to serialize the data. Prior
to the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a
schema-based serialization technique.
What is Avro?
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting,
the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop.
Avro has a schema-based system. A language-independent schema is associated with its read and
write operations. Avro serializes the data which has a built-in schema. Avro serializes the data
into a compact binary format, which can be deserialized by any application.
Avro Schemas
Avro depends heavily on its schema. It allows every data to be written with no prior knowledge
of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored
along with the Avro data in a file for any further processing.
In RPC, the client and the server exchange schemas during the connection. This exchange helps
in the communication between same named fields, missing fields, extra fields, etc.
Avro schemas are defined with JSON that simplifies its implementation in languages with JSON
libraries.
Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol
Buffers, and Thrift.
Features of Avro
Listed below are some of the prominent features of Avro −
Hadoop is utilized for batch/offline processing over the network of so many machines forming a
physical cluster. The framework works in such a manner that it is capable enough to provide
distributed storage and processing over the same cluster. It is designed to work on cheaper
systems commonly known as commodity hardware where each system offers its local storage and
computation power.
Hadoop is capable of running various file systems and HDFS is just one single implementation
that out of all those file systems. The Hadoop has a variety of file systems that can be
implemented concretely. The Java abstract class org.apache.hadoop.fs.FileSystem represents a
file system in Hadoop.
Filesystem URI Java implementation Description
scheme (all under
org.apache.hadoop)
RawLocalFileSystem with
no checksums.
This filesystem is
commonly used with distcp
to share data between
HDFS clusters
possessing different versions.
ftp m
re) FTP
fs.ftp.FTPFileSystem
MapReduce is a software framework and programming model used for processing huge amounts
of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with
splitting and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable
of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. The
programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster. The input to each
phase is key-value pairs. In addition, every programmer needs to specify two functions: map
function and reduce function.
The whole process goes through four phases of execution namely, splitting, mapping, shuffling,
and reducing.
Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will
be finding unique words and the number of occurrences of those unique words.
The data goes through the following phases of MapReduce in Big Data
Input Splits:
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits
Input split is a chunk of the input that is consumed by a single map.
● First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of mapping
phase is to count a number of occurrences of each word from input splits (more details about
input-split is given below) and prepare a list in the form of <word, frequency>
● Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each
of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
● Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
— Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their
respective frequency.
● After the mapper phase, a partition process takes place where sorting and shuffling happen
so that all the tuples with the same key are sent to the corresponding reducer. ● So, after the
sorting and shuffling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
● Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as — Bear, 2.
● Finally, all the output key/value pairs are then collected and written in the output file.
YARN, known as Yet Another Resource Negotiator, was introduced in Hadoop version 2.0 by
Yahoo and Hortonworks in 2012. The basic idea of YARN in Hadoop is to divide the functions of
resource management and task scheduling/monitoring into separate daemon processes.
YARN in Hadoop allows for the execution of various data processing engines such as batch
processing, graph processing, stream processing, and interactive processing, as well as the
processing of data stored in HDFS.
Why is YARN Hadoop Used?
Before Hadoop 2.0, Hadoop 1.x had two main components: the Hadoop Distributed File System
(HDFS) and MapReduce. The MapReduce batch processing framework was tightly coupled to
HDFS.
● MapReduce had to handle both resource management and processing. ● Job Tracker was
overloaded with many features that it had to manage, including planning, task control,
processing, resource allocation, etc
● One Job Tracker was the bottleneck in terms of scalability.
● Overall, the system was computationally inefficient in terms of resource usage.
● YARN in Hadoop efficiently and dynamically allocates all cluster resources, resulting in
higher Hadoop utilization compared to previous versions which help in better cluster
utilization.
● Clusters in YARN in Hadoop can now run streaming data processing and interactive
queries in parallel with MapReduce batch jobs.
● All thanks to YARN in Hadoop, it can now handle several processing methods and can
support a wider range of applications.
YARN Architecture in Hadoop
The architecture of YARN is shown in the figure below. The architecture consists of several
components such as Resource Manager, Node Manager and Application Master.
The cluster's Resource Manager and Node Manager are two components in charge of managing
and scheduling Hadoop jobs. The execution of tasks in parallel is the responsibility of the
Application Master. Its daemon is in charge of carrying out the compute jobs, checking them for
errors, and finishing them.
2.Application Manager:
● The Application Manager is responsible for collecting job submissions, selecting the first
container to run the application-specific ApplicationMaster, and providing services to
restart the ApplicationMaster container in case of failure.
● Each application's ApplicationMaster is responsible for negotiating the appropriate
resource containers from the scheduler, maintaining their state and tracking progress.
● Monitors resource usage (storage, CPU, etc) per container and handles log management. ●
It registers with the Resource Manager and sends out heartbeats containing the health status
of the node.
● Resource Manager assigns the Node Manager to manage all the application containers
4.Application Master
● A resource request for a container to perform an application task is sent from the
application host to the Resource Manager.
● Upon receiving the request from the Application Master, the Resource Manager evaluates
resource requirements, checks resource availability, and authorizes the container to fulfill
the resource request.
● Once the container is configured, the application host will instruct the Node Manager to
use resources and start application-specific activities and also sends health reports to the
Resource Manager from time to time.
Application Workflow in Hadoop YARN
● A client applies.
● To launch the Application Master, the Resource Manager allows a container.
● The Resource Manager accepts the Application Master registration.
● Containers are negotiated by the Application Master with the Resource Manager. ● The
Node Manager receives a request to launch containers from the Application Master. ● The
container is used to run application code.
● To check on the status of an application, the client contacts the Resource Manager or
Application Master.
● The Application Master deregisters with the Resource Manager once the above processing
is finished.
Features
● Multitenancy
YARN provides access to multiple data processing engines, such as Batch Processing
engines, Stream Processing Engines, Interactive Processing Engines, Graph Processing
Engines, etc. This brings the advantage of multi-tenancy to the business.
● Cluster Utilization
YARN optimizes a cluster by dynamically using and allocating its resources. YARN is a
parallel processing framework for implementing distributed computing clusters that
process large amounts of data across multiple computing nodes. Hadoop YARN allows
dividing a computing task into hundreds or thousands of tasks.
● Compatibility
YARN in Hadoop is also compatible with the first version of Hadoop because it uses
existing MapReduce applications. So, YARN can also be used with earlier versions of
Hadoop.
● Scalability
The YARN scheduler in Hadoop Resource Manager allows thousands of clusters and
nodes to be managed and scaled by Hadoop.
Hadoop Scheduler
Hadoop MapReduce is a software framework for writing applications that process huge amounts
of data (terabytes to petabytes) in-parallel on the large Hadoop cluster. This framework is
responsible for scheduling tasks, monitoring them, and re-executes the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic idea
behind the YARN introduction is to split the functionalities of resource management and job
scheduling or monitoring into separate daemons that are ResorceManager, ApplicationMaster,
and NodeManager.
ResorceManager is the master daemon that arbitrates resources among all the applications in the
system. NodeManager is the slave daemon responsible for containers, monitoring their resource
usage, and reporting the same to ResourceManager or Schedulers. ApplicationMaster negotiates
resources from the ResourceManager and works with NodeManager in order to execute and
monitor the task.
The ResourceManager has two main components that are Schedulers and ApplicationsManager.
It is not responsible for monitoring or tracking the status of an application. Also, the scheduler
does not guarantee about restarting the tasks that are failed either due to hardware failure or
application failure.
The scheduler performs scheduling based on the resource requirements of the applications.
It has some pluggable policies that are responsible for partitioning the cluster resources among
the various queues, applications, etc.
The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable policies that are
responsible for allocating resources to the applications.
1. FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives more
preferences to the application coming first than those coming later. It places the applications in a
queue and executes them in the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are
allocated first. Once the first application request is satisfied, then only the next application in the
queue is served.
Advantage:
● It is not suitable for shared clusters. If the large application comes before the shorter
one, then the large application will use all the resources in the cluster, and the shorter
application has to wait for its turn. This leads to starvation.
● It does not take into account the balance of resource allocation between the long
applications and short applications.
2. Capacity Scheduler
It supports hierarchical queues to reflect the structure of organizations or groups that utilizes the
cluster resources. A queue hierarchy contains three types of queues that are root, parent, and leaf.
The root queue represents the cluster itself, parent queue represents organization/group or
sub-organization/sub-group, and the leaf accepts application submission.
The Capacity Scheduler allows the sharing of the large cluster while giving capacity guarantees
to each organization by allocating a fraction of cluster resources to each queue.
Also, when there is a demand for the free resources that are available on the queue who has
completed its task, by the queues running below capacity, then these resources will be assigned to
the applications on queues running below capacity. This provides elasticity for the organization in
a cost-effective manner.
Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a
single application/user/queue cannot use a disproportionate amount of resources in the cluster.
To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.
Advantages:
3. Fair Scheduler
FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters. With
FairScheduler, there is no need for reserving a set amount of capacity because it will dynamically
balance resources between all running applications.
It assigns resources to applications in such a way that all applications get, on average, an equal
amount of resources over time.
The FairScheduler, by default, takes scheduling fairness decisions only on the basis of memory.
We can configure it to schedule with both memory and CPU.
When the single application is running, then that app uses the entire cluster resources. When
other applications are submitted, the free up resources are assigned to the new apps so that every
app eventually gets roughly the same amount of resources. FairScheduler enables short apps to
finish in a reasonable time without starving the long-lived apps.
Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues for
ensuring that certain users, production, or group applications always get sufficient resources.
When an app is present in the queue, then the app gets its minimum share, but when the queue
doesn’t need its full guaranteed share, then the excess share is split between other running
applications.
Advantages:
● It provides a reasonable way to share the Hadoop Cluster between the number of
users.
● Also, the FairScheduler can work with app priorities where the priorities are used as
weights in determining the fraction of the total resources that each application
should get.
Disadvantage:
● It requires configuration.