Unit 3 Introduction To Hadoop Syllabus
Unit 3 Introduction To Hadoop Syllabus
Introduction to Hadoop
Syllabus:
Introduction to Hadoop : Introducing Hadoop , RDBMS versus Hadoop, Distributed
Computing Challenges , History of Hadoop, Hadoop Overview, Use Case of Hadoop, Hadoop
Distributors ,HDFS (Hadoop Distributed File System),Processing Data with Hadoop, Managing
Resources and Applications with Hadoop YARN (Yet another Resource Negotiator),Interacting
with Hadoop Ecosystem
Introducing Hadoop
Today, Big Data seems to be the buzz word! Enterprises, the world over, are beginning to
realize that there is a huge volume of untapped information before them in the form of
structured, semi-structured, and unstructured data. This varied variety of data is spread across
the networks. Let us look at few statistics to get an idea of the amount of data which gets
generated every day, every minute, and every second.
1. Every day:
(a) NYSE (New York Stock Exchange) generates 1.5 billion shares and trade data.
(b) Facebook stores 2.7 billion comments and Likes.
(c) Google processes about 24 petabytes of data.
2. Every minute:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps.
(f) Email users send over 200 million messages.
(g) Amazon generates over $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. Every second:
(a) Banking applications process more than 10,000 credit card transactions.
1. Low cost: Hadoop is an open-source framework and uses commodity hardware (commodity
hardware is relatively inexpensive and easy to obtain hardware) to store enormous quantities
of data.
2. Computing power: Hadoop is based on distributed computing model which processes very
large volumes of data fairly quickly. The more the number of computing nodes, the more the
processing power at hand.
3. Scalability: This boils down to simply adding nodes as the system grows and requires much
less administration.
4. Storage flexibility:Unlikethetraditionalrelational databases, in Hadoop data need not be pre-
processed before storing it. Hadoop provides the convenience of storing as much data as one
needs and also the added flexibility of deciding later as to how to use the stored data. In
Hadoop, one can store unstructured data like images, videos, and free-form text.
5. Inherent data protection: Hadoop protects data and executing applications against hardware
failure. If a node fails, it automatically redirects the jobs that had been assigned to this node to
the other functional and available nodes and ensures that distributed computing does not fail. It
goes a step further to store multiple copies (replicas) of the data on various nodes across the
cluster. Hadoop makes use of commodity hardware, distributed file system, and distributed
computing as shown in Figure 5.3.
In this new design, groups of machine are gathered together; it is known as a Cluster. Figure
5.3 Hadoop framework (distributed file system, commodity hardware).
With this new paradigm, the data can be managed with Hadoop as follows:
1. Distributes the data and duplicates chunks of each data file across several nodes, for
example, 25–30 is one chunk of data as shown in Figure 5.3.
2. Locally available compute resource is used to process each chunk of data in parallel. 3.
Hadoop Framework handles failover smartly and automatically.
The Name “Hadoop” The name Hadoop is not an acronym; it’s a made-up name. The
project creator, Doug Cutting, explains how the name came about: “The name my kid gave
a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and
not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol
is a kid’s term”.
Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to
their function, often with an elephant or other animal theme (“Pig”, for example).
HADOOP OVERVIEW
Open-source software framework to store and process massive amounts of data in a
distributed fashion on large clusters of commodity hardware. Basically, Hadoop
accomplishes two tasks:
1. Massive data storage.
2. Faster data processing.
Key Aspects of Hadoop Figure 5.7 describes the key aspects of Hadoop.
Hadoop Components
Figure 5.8 depicts the Hadoop components.
Hadoop Core Components
1. HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem: Hadoop Ecosystem are support projects to enhance the functionality of
Hadoop Core Components.
The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
HDFS Daemons
NameNode HDFS breaks a large file into smaller pieces called blocks.
NameNode uses a rack ID to identify DataNodes in the rack.
A rack is a collection of DataNodes within the cluster.
NameNode keeps tracks of blocks of a file as it is placed on various DataNodes.
NameNode manages file-related operations such as read, write, create, and delete.
Its main job is managing the File System Namespace.
A file system namespace is collection of files in the cluster. NameNode stores HDFS
namespace.
File system namespace includes mapping of blocks to file, file properties and is stored in a
file called FsImage. NameNode uses an EditLog (transaction log) to record every
transaction that happens to the filesystem metadata.
Refer Figure 5.16.
When NameNodestarts up, it reads FsImage and EditLog from disk and applies all
transactions from the EditLog to in-memory representation of the FsImage. Then it flushes
out new version of FsImage on disk and truncates the old EditLog because the changes are
updated in the FsImage. There is a single NameNode per cluster.
DataNode
There are multiple DataNodes per cluster. During Pipeline read and write DataNodes
communicate with each other. A DataNode also continuously sends “heartbeat” message to
NameNode to ensure the connectivity between the NameNode and DataNode. In case there
is no heartbeat from a DataNode, the NameNode replicates that DataNode within the cluster
and keeps on running as if nothing had happened. Let us explain the concept behind sending
the heartbeat report by the DataNodes to the NameNode.
Secondary NameNode
The Secondary NameNode takes a snapshot of HDFS metadata at intervals specified in the
Hadoop configuration. Since the memory requirements of Secondary NameNode are the
same as NameNode, it is better to run NameNode and Secondary NameNode on different
machines. In case of failure of the NameNode, the Secondary NameNode can be configured
manually to bring up the cluster. However, the Secondary NameNode does not record any
real-time changes that happen to the HDFS metadata.
The MapReduce functions and input/output locations are implemented via the MapReduce
applications. These applications use suitable interfaces to construct the job. The application
and the job parameters together are known as job configuration. Hadoop job client submits
job (jar/executable, etc.) to the JobTracker. Then it is the responsibility of JobTracker to
schedule tasks to the slaves. In addition to scheduling, it also monitors the task and provides
status information to the Job client.
MapReduce Daemons
1. JobTracker: It provides connectivity between Hadoop and your application. When you
submit code to cluster, JobTracker creates the execution plan by deciding which task to
assign to which node. It also monitors all the running tasks. When a task fails, it
automatically re-schedules the task to a different node after a predefined number of retries.
JobTracker is a master daemon responsible for executing overall MapReduce job. There is a
single JobTracker per Hadoop cluster.
2. TaskTracker: This daemon is responsible for executing individual tasks that is assigned by
the JobTracker. There is a single TaskTracker per slave and spawns multiple Java Virtual
Machines (JVMs) to handle multiple map or reduce tasks in parallel. TaskTracker
continuously sends heartbeat message to JobTracker. When the JobTracker fails to receive a
heartbeat from a TaskTracker, the JobTracker assumes that the TaskTracker has failed and
resubmits the task to another available node in the cluster. Once the client submits a job to
the JobTracker, it partitions and assigns diverse MapReduce tasks for each TaskTracker in
the cluster. Figure 5.22 depicts JobTracker and TaskTracker interaction.
.
How Does MapReduce Work?
MapReduce divides a data analysis task into two parts − map and reduce. Figure 5.23 depicts
how the MapReduce Programming works. In this example, there are two mappers and one
reducer.
Each mapper works on the partial dataset that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
Figure 5.24 describes the working model of MapReduce Programming.
The following steps describe how MapReduce performs its task.
1. First, the input dataset is split into multiple pieces of data (several small subsets).
2. Next, the framework creates a master and several workers processes and executes the
worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that were assigned to each
map task. The map worker uses the map function to extract only those data that are present
on their server and generates key/value pair for the extracted data.
4. Map worker uses partitioner function to dividethe data into regions. Partitioner decides
which reducer should get the output of the specified mapper.
5. When the map workers complete their work, the master instructs the reduce workers to
begin their work.The reduce workers in turn contact the map workers to get the key/value
data for their partition. The data thus received is shuffled and sorted as per keys.
6. Then it calls reduce function for every unique key. This function writes the output to the
file.
7. When all the reduce workerscompletetheir work, the master transfers thecontrol to the
user program.
MapReduce Example
The famous example for MapReduce Programming is Word Count. For example, consider
you need to count the occurrences of similar words across 50 files. You can achieve this
using MapReduce Programming.
Refer Figure 5.25. Word Count MapReduce Programming using Java
The MapReduce Programming requires three things.
1. Driver Class: This class specifies Job Configuration details.
2. Mapper Class: This class overrides the Map Function based on the problem statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem statement.
YARN Architecture:
Figure 5.29 depicts YARN architecture.
The steps involved in YARN architecture are as follows:
1. A client program submits the application which includes the necessary specifications to
launch the application-specific ApplicationMaster itself.
2. The ResourceManager launches the ApplicationMaster by assigning some container.
3. The ApplicationMaster, on boot-up, registers with the ResourceManager. This helps the
client program to query the ResourceManager directly for the details.
4. During the normal course, ApplicationMaster negotiates appropriate resource containers
via the resource-request protocol.
5. On successful container allocations, the ApplicationMaster launches the container by
providing the container launch specification to the NodeManager.
6. The NodeManager executes the application code and provides necessary information such
as progress, status, etc. to it’s ApplicationMaster via an application-specific protocol.
7. During the application execution, the client that submitted the job directly communicates
with the ApplicationMaster to get status, progress updates, etc. via an application-specific
protocol.
8. Once the application has been processed completely, ApplicationMaster deregisters with
the ResourceManager and shuts down, allowing its own container to be repurposed.
Sqoop
Sqoop is a tool which helps to transfer data between Hadoop and Relational Databases. With
the help of Sqoop, you can import data from RDBMS to HDFS and vice-versa. Figure 5.32
depicts the Sqoop in Hadoop ecosystem.
HBase
HBase is a NoSQL database for Hadoop. HBase is column-oriented NoSQL database. HBase
is used to store billions of rows and millions of columns. HBase provides random read/write
operation. It also supports record level updates which is not possible using HDFS. HBase
sits on top of HDFS. Figure 5.33 depicts the HBase in Hadoop