0% found this document useful (0 votes)

4 views

Hadoop Notes 2

Hadoop is an open-source framework from Apache designed for storing, processing, and analyzing large volumes of data, primarily using batch processing. It consists of several modules, including HDFS for distributed file storage, YARN for resource management, and MapReduce for parallel data computation. Hadoop's architecture features a master/slave setup with a single master node and multiple slave nodes, and it is known for its scalability, cost-effectiveness, and resilience to failures.

Uploaded by

sreekantha.aiml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Hadoop Notes 2

Uploaded by

sreekantha.aiml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP
(online analytical processing). It is used for batch/offline processing.It is being used by
Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.

Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and
converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired
result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.

Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The
Java language is used to develop HDFS. So any machine that supports Java language can
easily run the NameNode and DataNode software.

NameNode

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode

o The HDFS cluster contains multiple DataNodes.

o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce
job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job
is rescheduled.

Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and Peta
bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so
it really cost effective as compared to traditional relational database management
system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that
data they have to spend a lot of costs which becomes the consequence of that
project. This problem becomes one of the important reason for the emergence of
Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It
is a proprietary distributed file system developed to provide efficient access to
data.
o In 2004, Google released a white paper on Map Reduce. This technique
simplifies the data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map
reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known
as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released
in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Multi Threading PDF
67% (3)
Multi Threading PDF
48 pages
CC 2
No ratings yet
CC 2
25 pages
Unit 2
No ratings yet
Unit 2
21 pages
Unit III
No ratings yet
Unit III
32 pages
Unit 2
No ratings yet
Unit 2
30 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
unit 2
No ratings yet
unit 2
28 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
Hadoop
No ratings yet
Hadoop
14 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 2
No ratings yet
Unit 2
10 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Module III Note
No ratings yet
Module III Note
36 pages
CC unit5
No ratings yet
CC unit5
27 pages
Big data unit 2
No ratings yet
Big data unit 2
25 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
2
No ratings yet
2
40 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
bd sec b
No ratings yet
bd sec b
19 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Bda Aiml Note Unit 2
No ratings yet
Bda Aiml Note Unit 2
13 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Big data aktu unit 2
No ratings yet
Big data aktu unit 2
127 pages
Module II
No ratings yet
Module II
46 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
unit 2
No ratings yet
unit 2
9 pages
Hadoop
No ratings yet
Hadoop
7 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop Intro
No ratings yet
Hadoop Intro
25 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
27 pages
Unit 6: Programmability Issues
No ratings yet
Unit 6: Programmability Issues
18 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Priority Inheritance and Priority Ceiling Protocols
No ratings yet
Priority Inheritance and Priority Ceiling Protocols
25 pages
MadhurBakshi_OS_Practical_File
No ratings yet
MadhurBakshi_OS_Practical_File
51 pages
VLIW Architecture
No ratings yet
VLIW Architecture
53 pages
Multithreading and Thread Synchronization Lecture Note
No ratings yet
Multithreading and Thread Synchronization Lecture Note
46 pages
Asynchronous Communication: SE3020 - Distributed Systems - Async. Communication - Dharshana Kasthurirathna
No ratings yet
Asynchronous Communication: SE3020 - Distributed Systems - Async. Communication - Dharshana Kasthurirathna
52 pages
05 - synchCritSec Args HW
No ratings yet
05 - synchCritSec Args HW
47 pages
Scheduling Algorithm
No ratings yet
Scheduling Algorithm
18 pages
Dis S23
No ratings yet
Dis S23
1 page
Module 05
No ratings yet
Module 05
51 pages
Exploring the Gpu Architecture
No ratings yet
Exploring the Gpu Architecture
9 pages
Threads & Concurrency: Bibliographical Notes
No ratings yet
Threads & Concurrency: Bibliographical Notes
4 pages
Cse 4001-Parallel and Distributed Computing Lab Digital Assessment-1 Name: Avulapati Anusha REG - NO: 17BCE0435
No ratings yet
Cse 4001-Parallel and Distributed Computing Lab Digital Assessment-1 Name: Avulapati Anusha REG - NO: 17BCE0435
5 pages
Thread Array Example: CS108, Stanford Handout #23 Fall, 2008-09 Osvaldo Jiménez
No ratings yet
Thread Array Example: CS108, Stanford Handout #23 Fall, 2008-09 Osvaldo Jiménez
2 pages
Round Robin and Priority Schedule
No ratings yet
Round Robin and Priority Schedule
9 pages
Introduction To DSM: Unit - III Essay Questions
No ratings yet
Introduction To DSM: Unit - III Essay Questions
21 pages
Operating - System - Concepts - Essentials8thed - CH 6 Solution
No ratings yet
Operating - System - Concepts - Essentials8thed - CH 6 Solution
2 pages
5.9.84 (1471) Crash 2021 12 17 11 21 17 1639711277390
No ratings yet
5.9.84 (1471) Crash 2021 12 17 11 21 17 1639711277390
2 pages
Chapter 3-Processes
No ratings yet
Chapter 3-Processes
40 pages
Lab 12
No ratings yet
Lab 12
17 pages
Chapter 11
No ratings yet
Chapter 11
39 pages
Cloud computing Previous question papers
No ratings yet
Cloud computing Previous question papers
2 pages
RTOS - Module 3
No ratings yet
RTOS - Module 3
116 pages
Scheduling Algorithms
No ratings yet
Scheduling Algorithms
9 pages
Operating System Exercises - Chapter 5-Exr
No ratings yet
Operating System Exercises - Chapter 5-Exr
2 pages
OS Lesson 3 - Concurrency
No ratings yet
OS Lesson 3 - Concurrency
15 pages
Scheduling Questions
No ratings yet
Scheduling Questions
19 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hadoop Notes 2

Uploaded by

Hadoop Notes 2

Uploaded by

What is Hadoop

o It is a single master server exist in the HDFS cluster.

o The HDFS cluster contains multiple DataNodes.

o It works as a slave node for Job Tracker.

Let's focus on the history of Hadoop in the following steps: -

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.