0% found this document useful (0 votes)

2 views

10th August Morning and Afternoon session Hadoop (1)

Hadoop is a distributed processing framework for large data sets, utilizing HDFS for storage and MapReduce for computation. It has evolved since its inception in 2002, becoming a leading platform for big data analytics, with significant milestones including sorting 1 terabyte of data faster than supercomputers. Key components of Hadoop include the NameNode and DataNodes, which manage data storage and processing across clusters, ensuring fault tolerance and data integrity through replication and checksums.

Uploaded by

fallenalways89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

10th August Morning and Afternoon session Hadoop (1)

Uploaded by

fallenalways89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

• “Hadoop is a framework that allows for the distributed

processing of large data sets across clusters of computers

using simple programming models”

• Hadoop à ideal solutions to analyze & gain insights from big-data.

Ø De facto big-data processing platforms

Ø Storage: Hadoop Distributed File System (HDFS)

Ø Computation: MapReduce (MR)

• HDFS, MR distribute data among nodes - process them in parallel.

Realizing the benefit
Reading 1 TB of data

10 machine
4 I/O Channel
Each channel – 100 MB/s

1 machine
4 I/O Channel
Each channel – 100 MB/s
In 2002, Doug Cutting and Mike Cafarella - Apache
Nutch Project - aim at building a web search
engine - crawl & index websites.

In 2003, Google released a paper on Google

distributed File System (GFS) – Architecture for
storing large datasets in a distributed environment

In 2004, Nutch’s developers developed an open-source

implementation, the Nutch Distributed File System (NDFS).

In 2004, Google introduced MapReduce to process large

datasets parallelly.
In 2006, Nutch formed an independent subproject
called “Hadoop”

In 2006, Doug Cutting joined Yahoo to scale

the Hadoop project to thousands of nodes cluster.

In 2007, Yahoo started using Hadoop on 1000

nodes cluster

In 2008, Hadoop confirmed its success by becoming

the top-level project at Apache.
In 2008, Hadoop defeated supercomputers and became
the fastest system on the planet by sorting an entire
terabyte of data.

In November 2008, Google reported that its Mapreduce

implementation sorted 1 terabyte in 68 seconds.

In April 2009, a team at Yahoo used Hadoop to sort 1

terabyte in 62 seconds, beaten Google MapReduce
implementation.
In December 2011, Apache released Hadoop version 1.0

In May 2012, the Hadoop 2.0.0-alpha version was released.

In December 2017, release 3.0.0 was available – 3.3 x

(3.3.4) - Aug 2022
Hadoop Characteristics
Realizing the benefit
Reading 1 TB of data

10 machine
4 I/O Channel
Each channel – 100 MB/s

1 machine
4 I/O Channel
Each channel – 100 MB/s
WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
HDFS is a block-structured file system
where each file is divided into blocks of a
pre-determined size and stored across a Name Node:
cluster of one or several machines.
v Master daemon - maintains and manages
v Moving Computation is Cheaper than the Data Nodes
Moving Data
v Records the metadata of all the files stored
in the cluster, e.g. Location of data, Size of
files, permissions etc

v Regularly receives a Heartbeat and block

report from Data Nodes-live.

v Responsible for replication factor

WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
Data Node:
v Slave Daemon .
v Actual data is stored on Data Nodes.
v Commodity hardware, non-expensive
v Data Nodes perform read and write
requests from the clients.
v Send heartbeats to Name Node
periodically to report the overall health,
frequency is 3 secs.
SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
Blocks & Replicas
• Blocks are the smallest continuous location • HDFS provides a reliable way to store huge
on your hard drive where data is stored. - data in a distributed environment
HDFS file à blocks
• Blocks are replicated to provide fault tolerance
• Default size of each block is 128 MB in
• Default replication factor is 3
Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x) – Configure • NN collects Block report – over/under
Example.txt – 514 MB replicated
Block Placement

• One replica on local node, another replica on a remote rack, Third replica on different
node on the same rack, Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
HDFS Read Architecture:

Name node
HDFS Read Architecture:
v Client will reach out NameNode asking for block metadata
v NameNode will return the list of DataNodes where each block (Block A &
B) are stored
v After that client, will connect to the DN where blocks are stored
v Client starts reading data parallel from the DataNodes (Block A from
DataNode 1 and Block B from DataNode 3)
v Once the client gets all the required file blocks, it will combine these blocks to
form a file
v While serving read request of client, HDFS selects the replica which is closest
to the client - reduces the read latency and the bandwidth consumption
MUTATION ORDER AND LEASES

• A mutation is an operation that changes the

contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases are used to maintain consistency
• Master grants chunk lease to one replica (primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)

Dr Vengadeswaran 16
DATA CORRECTNESS

• Use Checksums to validate data

– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas

Dr Vengadeswaran 17
• Guarantees
• Checkpoints for incremental writes
• Checksums for records/chunks
• Unique ID for records
• Stale replicas by version number.

Dr Vengadeswaran 18

Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Lecture 4 Introduction to Hadoop
No ratings yet
Lecture 4 Introduction to Hadoop
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Module II
No ratings yet
Module II
46 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Unit 3
No ratings yet
Unit 3
5 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
Big Data
No ratings yet
Big Data
67 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
4
No ratings yet
4
53 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit 2
No ratings yet
Unit 2
21 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Hadoop
No ratings yet
Hadoop
25 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
BDA-3
No ratings yet
BDA-3
70 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
Luminous Design
No ratings yet
Luminous Design
12 pages
Unit-2.1 Control Statements
No ratings yet
Unit-2.1 Control Statements
58 pages
Red Hat JBoss Enterprise Application Platform-7.1-Configuration Guide-en-US
No ratings yet
Red Hat JBoss Enterprise Application Platform-7.1-Configuration Guide-en-US
487 pages
Amazon: Exam Questions AWS-Certified-Cloud-Practitioner
No ratings yet
Amazon: Exam Questions AWS-Certified-Cloud-Practitioner
15 pages
EEE_15EE63C_U2_S7_Sy
No ratings yet
EEE_15EE63C_U2_S7_Sy
5 pages
Messaging and Events Level 2 Quiz - Attempt Review
No ratings yet
Messaging and Events Level 2 Quiz - Attempt Review
14 pages
Computer Science - XII
No ratings yet
Computer Science - XII
33 pages
Natus Nicolet Edx Software Load Instructions
No ratings yet
Natus Nicolet Edx Software Load Instructions
28 pages
MB91590_FUJITSU
No ratings yet
MB91590_FUJITSU
152 pages
Stack Using Pointer C++
No ratings yet
Stack Using Pointer C++
6 pages
Oracle Fusion HCM
No ratings yet
Oracle Fusion HCM
5 pages
Red Hat Enterprise Linux Network Performance Tuning Guide - Red Hat Customer Portal
No ratings yet
Red Hat Enterprise Linux Network Performance Tuning Guide - Red Hat Customer Portal
39 pages
Distributed File Systems
No ratings yet
Distributed File Systems
23 pages
Best Digital Art Ipad Pro HD Wallpapers - Ilikewallpaper: Visit
No ratings yet
Best Digital Art Ipad Pro HD Wallpapers - Ilikewallpaper: Visit
1 page
GSM Security System
No ratings yet
GSM Security System
5 pages
Welcome To MEGA PDF
No ratings yet
Welcome To MEGA PDF
9 pages
Ansible Interview Questions
No ratings yet
Ansible Interview Questions
17 pages
Zetron 5020
No ratings yet
Zetron 5020
4 pages
Only Cash Back Offers April 2022
No ratings yet
Only Cash Back Offers April 2022
22 pages
Cache Mapping 1
No ratings yet
Cache Mapping 1
29 pages
prisma-sd-wan-ion-cli-reference
No ratings yet
prisma-sd-wan-ion-cli-reference
426 pages
My Courses: Home UGRD-IT6300-2113T Week 14: Final Examination Final Exam
No ratings yet
My Courses: Home UGRD-IT6300-2113T Week 14: Final Examination Final Exam
26 pages
Lab Report of Python
0% (1)
Lab Report of Python
23 pages
Lecture11 Thirdmicroprocessorc
100% (1)
Lecture11 Thirdmicroprocessorc
23 pages
Bga S3-Via PN133T Twister
No ratings yet
Bga S3-Via PN133T Twister
12 pages
ACL9 Getting Started Guide PDF
No ratings yet
ACL9 Getting Started Guide PDF
84 pages
CSE - IOT - 5th - Sem Syllabus AFTER BoS MEETING
No ratings yet
CSE - IOT - 5th - Sem Syllabus AFTER BoS MEETING
12 pages
Topic 3 Computer Software
No ratings yet
Topic 3 Computer Software
7 pages
NIELIT Scientist B' Recruitment 2016 - Computer Science - GeeksforGeeks
No ratings yet
NIELIT Scientist B' Recruitment 2016 - Computer Science - GeeksforGeeks
15 pages
Project Document
No ratings yet
Project Document
78 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10th August Morning and Afternoon session Hadoop (1)

Uploaded by

10th August Morning and Afternoon session Hadoop (1)

Uploaded by

• “Hadoop is a framework that allows for the distributed

processing of large data sets across clusters of computers

• Hadoop à ideal solutions to analyze & gain insights from big-data.

Ø De facto big-data processing platforms

Ø Storage: Hadoop Distributed File System (HDFS)

Ø Computation: MapReduce (MR)

• HDFS, MR distribute data among nodes - process them in parallel.

In 2003, Google released a paper on Google

In 2004, Nutch’s developers developed an open-source

In 2004, Google introduced MapReduce to process large

In 2006, Doug Cutting joined Yahoo to scale

In 2007, Yahoo started using Hadoop on 1000

In 2008, Hadoop confirmed its success by becoming

In November 2008, Google reported that its Mapreduce

In April 2009, a team at Yahoo used Hadoop to sort 1

In May 2012, the Hadoop 2.0.0-alpha version was released.

In December 2017, release 3.0.0 was available – 3.3 x

v Regularly receives a Heartbeat and block

v Responsible for replication factor

• A mutation is an operation that changes the

• Use Checksums to validate data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.