0% found this document useful (0 votes)
4 views

Hadoop Notes 2

Hadoop is an open-source framework from Apache designed for storing, processing, and analyzing large volumes of data, primarily using batch processing. It consists of several modules, including HDFS for distributed file storage, YARN for resource management, and MapReduce for parallel data computation. Hadoop's architecture features a master/slave setup with a single master node and multiple slave nodes, and it is known for its scalability, cost-effectiveness, and resilience to failures.

Uploaded by

sreekantha.aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Hadoop Notes 2

Hadoop is an open-source framework from Apache designed for storing, processing, and analyzing large volumes of data, primarily using batch processing. It consists of several modules, including HDFS for distributed file storage, YARN for resource management, and MapReduce for parallel data computation. Hadoop's architecture features a master/slave setup with a single master node and multiple slave nodes, and it is known for its scalability, cost-effectiveness, and resilience to failures.

Uploaded by

sreekantha.aiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired
result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The
Java language is used to develop HDFS. So any machine that supports Java language can
easily run the NameNode and DataNode software.

NameNode

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode

o The HDFS cluster contains multiple DataNodes.


o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.


o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce
job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job
is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and Peta
bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so
it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that
data they have to spend a lot of costs which becomes the consequence of that
project. This problem becomes one of the important reason for the emergence of
Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It
is a proprietary distributed file system developed to provide efficient access to
data.
o In 2004, Google released a white paper on Map Reduce. This technique
simplifies the data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map
reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known
as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released
in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy