0% found this document useful (0 votes)
2 views

10th August Morning and Afternoon session Hadoop (1)

Hadoop is a distributed processing framework for large data sets, utilizing HDFS for storage and MapReduce for computation. It has evolved since its inception in 2002, becoming a leading platform for big data analytics, with significant milestones including sorting 1 terabyte of data faster than supercomputers. Key components of Hadoop include the NameNode and DataNodes, which manage data storage and processing across clusters, ensuring fault tolerance and data integrity through replication and checksums.

Uploaded by

fallenalways89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

10th August Morning and Afternoon session Hadoop (1)

Hadoop is a distributed processing framework for large data sets, utilizing HDFS for storage and MapReduce for computation. It has evolved since its inception in 2002, becoming a leading platform for big data analytics, with significant milestones including sorting 1 terabyte of data faster than supercomputers. Key components of Hadoop include the NameNode and DataNodes, which manage data storage and processing across clusters, ensuring fault tolerance and data integrity through replication and checksums.

Uploaded by

fallenalways89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

• “Hadoop is a framework that allows for the distributed

processing of large data sets across clusters of computers


using simple programming models”

• Hadoop à ideal solutions to analyze & gain insights from big-data.

Ø De facto big-data processing platforms

Ø Storage: Hadoop Distributed File System (HDFS)

Ø Computation: MapReduce (MR)

• HDFS, MR distribute data among nodes - process them in parallel.


Realizing the benefit
Reading 1 TB of data

10 machine
4 I/O Channel
Each channel – 100 MB/s

1 machine
4 I/O Channel
Each channel – 100 MB/s
In 2002, Doug Cutting and Mike Cafarella - Apache
Nutch Project - aim at building a web search
engine - crawl & index websites.

In 2003, Google released a paper on Google


distributed File System (GFS) – Architecture for
storing large datasets in a distributed environment

In 2004, Nutch’s developers developed an open-source


implementation, the Nutch Distributed File System (NDFS).

In 2004, Google introduced MapReduce to process large


datasets parallelly.
In 2006, Nutch formed an independent subproject
called “Hadoop”

In 2006, Doug Cutting joined Yahoo to scale


the Hadoop project to thousands of nodes cluster.

In 2007, Yahoo started using Hadoop on 1000


nodes cluster

In 2008, Hadoop confirmed its success by becoming


the top-level project at Apache.
In 2008, Hadoop defeated supercomputers and became
the fastest system on the planet by sorting an entire
terabyte of data.

In November 2008, Google reported that its Mapreduce


implementation sorted 1 terabyte in 68 seconds.

In April 2009, a team at Yahoo used Hadoop to sort 1


terabyte in 62 seconds, beaten Google MapReduce
implementation.
In December 2011, Apache released Hadoop version 1.0

In May 2012, the Hadoop 2.0.0-alpha version was released.

In December 2017, release 3.0.0 was available – 3.3 x


(3.3.4) - Aug 2022
Hadoop Characteristics
Realizing the benefit
Reading 1 TB of data

10 machine
4 I/O Channel
Each channel – 100 MB/s

1 machine
4 I/O Channel
Each channel – 100 MB/s
WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
HDFS is a block-structured file system
where each file is divided into blocks of a
pre-determined size and stored across a Name Node:
cluster of one or several machines.
v Master daemon - maintains and manages
v Moving Computation is Cheaper than the Data Nodes
Moving Data
v Records the metadata of all the files stored
in the cluster, e.g. Location of data, Size of
files, permissions etc

v Regularly receives a Heartbeat and block


report from Data Nodes-live.

v Responsible for replication factor


WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
Data Node:
v Slave Daemon .
v Actual data is stored on Data Nodes.
v Commodity hardware, non-expensive
v Data Nodes perform read and write
requests from the clients.
v Send heartbeats to Name Node
periodically to report the overall health,
frequency is 3 secs.
SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
Blocks & Replicas
• Blocks are the smallest continuous location • HDFS provides a reliable way to store huge
on your hard drive where data is stored. - data in a distributed environment
HDFS file à blocks
• Blocks are replicated to provide fault tolerance
• Default size of each block is 128 MB in
• Default replication factor is 3
Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x) – Configure • NN collects Block report – over/under
Example.txt – 514 MB replicated
Block Placement

• One replica on local node, another replica on a remote rack, Third replica on different
node on the same rack, Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
HDFS Read Architecture:

Name node
HDFS Read Architecture:
v Client will reach out NameNode asking for block metadata
v NameNode will return the list of DataNodes where each block (Block A &
B) are stored
v After that client, will connect to the DN where blocks are stored
v Client starts reading data parallel from the DataNodes (Block A from
DataNode 1 and Block B from DataNode 3)
v Once the client gets all the required file blocks, it will combine these blocks to
form a file
v While serving read request of client, HDFS selects the replica which is closest
to the client - reduces the read latency and the bandwidth consumption
MUTATION ORDER AND LEASES

• A mutation is an operation that changes the


contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases are used to maintain consistency
• Master grants chunk lease to one replica (primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)

Dr Vengadeswaran 16
DATA CORRECTNESS

• Use Checksums to validate data


– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas

Dr Vengadeswaran 17
• Guarantees
• Checkpoints for incremental writes
• Checksums for records/chunks
• Unique ID for records
• Stale replicas by version number.

Dr Vengadeswaran 18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy