0% found this document useful (0 votes)

48 views37 pages

Introduction To Hadoop

MapReduce is a programming model for processing large datasets in a distributed manner. It involves two functions - map and reduce. The map function processes each input and generates intermediate key-value pairs. The reduce function combines these intermediate values based on keys to produce the final output. MapReduce allows distributed processing across large clusters with automatic parallelization, fault tolerance, and load balancing. Implementations like Hadoop provide a runtime system that handles scheduling, data distribution, synchronization, and fault recovery.

Uploaded by

whosuresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views37 pages

Introduction To Hadoop

Uploaded by

whosuresh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

You are on page 1/ 37

Introduction to MapReduce

Todays Topics

Functional programming

MapReduce

Distributed fle system

Functional Programming

MapReduce = functional programming meets

distributed processing on steroids
Not a new idea dates bac to t!e "#s $or e%en
&#s'

(!at is functional programming)

*omputation as application of functions
T!eoretical foundation pro%ided by lambda
calculus

+ow is it di,erent)
Traditional notions of -data. and -instructions. are
not applicable

Data /ows are implicit in program

Di,erent orders of e0ecution are possible

10emplifed by 234P and M2

5%er%iew of 2isp

2isp 6 2ost 3n 4illy Parent!eses

(ell focus on particular a dialect7

-4c!eme.

2ists are primiti%e data types

Functions written in pref0 notation

(+ 1 2) 3
(* 3 4) 12
(sqrt (+ (* 3 3) (* 4 4))) 5
(define x 3) x
(* x 5) 15
'(1 2 3 4 5)
'((a 1) (b 2) (c 3))
Functions

Functions = lambda e0pressions bound

to %ariables

4yntactic sugar for defning functions

8bo%e e0pressions is e9ui%alent to7

5nce defned: function can be applied7

(define (foo x y)
(sqrt (+ (* x x) (* y y))))
(define foo
(lambda (x y)
(sqrt (+ (* x x) (* y y)))))
(foo 3 4) 5
5t!er Features

3n 4c!eme: e%eryt!ing is an s;
e0pression

No distinction between -data. and

-code.

1asy to write self;modifying code

+ig!er;order functions

Functions t!at tae ot!er functions as

arguments
(define (bar f x) (f (f x)))
(define (baz x) (* x x))
(bar baz 2) 16
Doesnt matter what f is, just apply it twice.
Recursion is your friend

4imple factorial e0ample

1%en iteration is written wit!

recursi%e calls<

(define (factorial n)
(if (= n 1)
1
(* n (factorial ( n 1)))))
(factorial 6) !2"
(define (factorialiter n)
(define (a#x n to$ $rod#ct)
(if (= n to$)
(* n $rod#ct)
(a#x (+ n 1) to$ (* n
$rod#ct))))
(a#x 1 n 1))
(factorialiter 6) !2"
2isp MapReduce)

(!at does t!is !a%e to do wit!

MapReduce)

8fter all: 2isp is about processing lists

Two important concepts in functional

programming

Map7 do somet!ing to e%eryt!ing in a

list

Fold7 combine results of a list in some

way
Map

Map is a !ig!er;order function

+ow map wors7

Function is applied to e%ery element

in a list

Result is a new list

f f f f f
Fold

Fold is also a !ig!er;order function

+ow fold wors7

8ccumulator set to initial %alue

Function applied to list element and t!e

accumulator

Result stored in t!e accumulator

Repeated for e%ery item in t!e list

Result is t!e fnal %alue in t!e accumulator

f f f f f final value
Initial value
Map=Fold in 8ction

4imple map e0ample7

Fold e0amples7

4um of s9uares7
(ma$ (lambda (x) (* x x))
'(1 2 3 4 5))
'(1 4 % 16 25)
(fold + " '(1 2 3 4 5)) 15
(fold * 1 '(1 2 3 4 5)) 12"
(define (s#mofsq#ares &)
(fold + " (ma$ (lambda (x) (* x x)) &)))
(s#mofsq#ares '(1 2 3 4 5)) 55
2isp MapReduce

2ets assume a long list of records7

imagine if>>>

(e can paralleli?e map operations

(e !a%e a mec!anism for bringing map

results bac toget!er in t!e fold operation

T!ats MapReduce< $and +adoop'

5bser%ations7

No limit to map paralleli?ation since maps

are indepedent

(e can reorder folding if t!e fold function is

commutati%e and associati%e
Typical Problem

3terate o%er a large number of

records

10tract somet!ing of interest from

eac!

4!u@e and sort intermediate results

8ggregate intermediate results

Aenerate fnal output

M
a
p
R
e
d
u
c
e
MapReduce

Programmers specify two functions7

map $: %' B C: %DE
reduce $: %' B C: %DE

8ll % wit! t!e same are reduced toget!er

Fsually: programmers also specify7

partition $: number of partitions ' B partition
for

5ften a simple !as! of t!e ey: e>g> !as!$'

mod n

8llows reduce operations for di,erent eys in

parallel

3mplementations7

Aoogle !as a proprietary implementation in *GG

+adoop is an open source implementation in

Ha%a $lead by Ia!oo'
3ts Just di%ide and con9uer<
Data Store
Initial kv pairs
map map
Initial kv pairs
map
Initial kv pairs
map
Initial kv pairs
k
1
, values
k
2
, values
k
3
, values
k
1
, values
k
2
, values
k
3
, values
k
1
, values
k
2
, values
k
3
, values
k
1
, values
k
2
, values
k
3
, values
Barrier: aggregate values by keys
reduce
k
1
, values
final k
1
values
reduce
k
2
, values
final k
2
values
reduce
k
3
, values
final k
3
values
Recall t!ese problems)

+ow do we assign wor units to worers)

(!at if we !a%e more wor units t!an

worers)

(!at if worers need to s!are partial

results)

+ow do we aggregate partial results)

+ow do we now all t!e worers !a%e

fnis!ed)

(!at if worers die)

MapReduce Runtime

+andles sc!eduling

8ssigns worers to map and reduce tass

+andles -data distribution.

Mo%es t!e process to t!e data

+andles sync!roni?ation

Aat!ers: sorts: and s!u@es intermediate

data

+andles faults

Detects worer failures and restarts

1%eryt!ing !appens on top of a

distributed F4 $later'
-+ello (orld.7 (ord *ount
Map!tring input"key, !tring input"value#:
// input_key: document name
// input_value: document contents
f$r eac% &$rd & in input"values:
'mitIntermediate&, (1(#)
*educe!tring key, Iterat$r intermediate"values#:
// key: a word, same for input and output
// intermediate_values: a list of counts
int result + ,)
f$r eac% v in intermediate"values:
result -+ .arseIntv#)
'mit/s!tringresult##)
!$urce: 0ean and 1%ema&at 2!0I 2,,3#
Kandwidt! 5ptimi?ation

3ssue7 large number of ey;%alue

pairs

4olution7 use -*ombiner. functions

10ecuted on same mac!ine as

mapper

Results in a -mini;reduce. rig!t after

t!e map p!ase

Reduces ey;%alue pairs to sa%e

bandwidt!

4ew Problem

3ssue7 reduce is only as fast as t!e

slowest map

4olution7 redundantly e0ecute map

operations: use results of frst to
fnis!

8ddresses !ardware problems>>>

Kut not issues related to in!erent

distribution of data
+ow do we get data to t!e
worers)
Compute Nodes
NAS
SAN
Whats the prolem here!
Distributed File 4ystem

Dont mo%e data to worers Mo%e worers to

t!e data<

4tore data on t!e local diss for nodes in t!e cluster

4tart up t!e worers on t!e node t!at !as t!e data

local

(!y)
Not enoug! R8M to !old all t!e data in memory
Dis access is slow: dis t!roug!put is good

8 distributed fle system is t!e answer

AF4 $Aoogle File 4ystem'

+DF4 for +adoop

AF47 8ssumptions

*ommodity !ardware o%er -e0otic. !ardware

+ig! component failure rates

3ne0pensi%e commodity components fail all t!e

time

-Modest. number of +FA1 fles

Files are write;once: mostly appended to

Per!aps concurrently

2arge streaming reads o%er random access

+ig! sustained t!roug!put o%er low latency

14! slides adapted fr$m material by 0ean et al5
AF47 Design Decisions

Files stored as c!uns

Fi0ed si?e $LMMK'

Reliability t!roug! replication

1ac! c!un replicated across &G c!unser%ers

4ingle master to coordinate access: eep metadata

4imple centrali?ed management

No data cac!ing
2ittle beneft due to large data sets: streaming reads

4implify t!e 8P3

Pus! some of t!e issues onto t!e client
!$urce: 1%ema&at et al5 !2!. 2,,3#
4ingle Master

(e now t!is is a7

4ingle point of failure

4calability bottlenec

AF4 solutions7
4!adow masters
Minimi?e master in%ol%ement
Ne%er mo%e data t!roug! it: use only for metadata
$and cac!e metadata at clients'
2arge c!un si?e
Master delegates aut!ority to primary replicas in data
mutations $c!un leases'

4imple: and good enoug!<

Masters Responsibilities
$N=O'

Metadata storage

Namespace management=locing

Periodic communication wit! c!unser%ers

Ai%e instructions: collect state: trac cluster !ealt!

*!un creation: re;replication: rebalancing

Kalance space utili?ation and access speed
4pread replicas across racs to reduce correlated
failures

Re;replicate data if redundancy falls below t!res!old

Rebalance data to smoot! out storage and re9uest

load
Masters Responsibilities
$O=O'

Aarbage *ollection

4impler: more reliable t!an traditional

fle delete

Master logs t!e deletion: renames t!e

fle to a !idden name

2a?ily garbage collects !idden fles

4tale replica deletion

Detect -stale. replicas using c!un

%ersion numbers
Metadata

Alobal metadata is stored on t!e master

File and c!un namespaces
Mapping from fles to c!uns
2ocations of eac! c!uns replicas

8ll in memory $LM bytes = c!un'

Fast
1asily accessible

Master !as an operation log for persistent logging of

critical metadata updates
Persistent on local dis
Replicated
*!ecpoints for faster reco%ery

Mutations

Mutation = write or append

Must be done for all replicas

Aoal7 minimi?e master in%ol%ement

2ease mec!anism7

Master pics one replica as primaryP gi%es

it a -lease. for mutations

Primary defnes a serial order of mutations

8ll replicas follow t!is order

Data /ow decoupled from control /ow

Paralleli?ation Problems

+ow do we assign wor units to worers)

(!at if we !a%e more wor units t!an

worers)

(!at if worers need to s!are partial

results)

+ow do we aggregate partial results)

+ow do we now all t!e worers !a%e

fnis!ed)

(!at if worers die)

"ow is MapReduce different!
From T!eory to Practice
"adoop Cluster
#ou
$. Scp data to cluster
%. Mo&e data into "D'S
(. De&elop code locally
). Sumit MapReduce jo
)a. *o ac+ to Step (
,. Mo&e data out of "D'S
-. Scp data from cluster
5n 8ma?on7 (it! 1*O
#ou
$. Scp data to cluster
%. Mo&e data into "D'S
(. De&elop code locally
). Sumit MapReduce jo
)a. *o ac+ to Step (
,. Mo&e data out of "D'S
-. Scp data from cluster
.. Allocate "adoop cluster
/C%
#our "adoop Cluster
0. Clean up1
2h oh. Where did the data 3o!
5n 8ma?on7 1*O and 4&
#our "adoop Cluster
S(
45ersistent Store6
/C%
47he Cloud6
Copy from S( to "D'S
Copy from "'DS to S(
8uestions!

Mathematics - Form 2 - Paper 1
67% (6)
Mathematics - Form 2 - Paper 1
4 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Ditp - ch2 1
No ratings yet
Ditp - ch2 1
2 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Distributed Computing Seminar: Mapreduce Theory and Implementation
No ratings yet
Distributed Computing Seminar: Mapreduce Theory and Implementation
30 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Bda 2
No ratings yet
Bda 2
35 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Hadoop Mapreduce
No ratings yet
Hadoop Mapreduce
131 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
192 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Untitled
No ratings yet
Untitled
16 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Data Structures and Algorithms with Python
From Everand
Data Structures and Algorithms with Python
Aadinath Pothuvaal
No ratings yet
CC-211L OOP Lab-04
No ratings yet
CC-211L OOP Lab-04
16 pages
Forces and Elasticity
No ratings yet
Forces and Elasticity
37 pages
String Formatting Exercise
No ratings yet
String Formatting Exercise
4 pages
Basic Electrical Engineering Topic: Mesh Current Analysis Using Inspection Method (Recap)
No ratings yet
Basic Electrical Engineering Topic: Mesh Current Analysis Using Inspection Method (Recap)
5 pages
Scientific Method & Variables
100% (1)
Scientific Method & Variables
28 pages
Java Phase 1 Object Oriented Programming Through Java
No ratings yet
Java Phase 1 Object Oriented Programming Through Java
65 pages
GRUPOS Visual - Group - Theory - (Class - Notes) PDF
No ratings yet
GRUPOS Visual - Group - Theory - (Class - Notes) PDF
214 pages
Connecting Mathematics and Science Through Literature and Storytelling Resources
No ratings yet
Connecting Mathematics and Science Through Literature and Storytelling Resources
6 pages
PHY F213 Fermat's Principle BITS Pilani
No ratings yet
PHY F213 Fermat's Principle BITS Pilani
18 pages
EASE Module 1 Polynomial Functions
No ratings yet
EASE Module 1 Polynomial Functions
29 pages
Protection of DC Microgrids Based On Complex Power During Faults in On Off-Grid Scenarios
No ratings yet
Protection of DC Microgrids Based On Complex Power During Faults in On Off-Grid Scenarios
11 pages
Removing Brackets Linear Equations Simultaneous Equations Factors Change The Subject of The Formula Solving Quadratic Equations Indices Surds
No ratings yet
Removing Brackets Linear Equations Simultaneous Equations Factors Change The Subject of The Formula Solving Quadratic Equations Indices Surds
22 pages
cs115 Lab03 v1
No ratings yet
cs115 Lab03 v1
3 pages
Transportation Engineering Questions
No ratings yet
Transportation Engineering Questions
3 pages
2009 Review Packet Algebra 2
No ratings yet
2009 Review Packet Algebra 2
9 pages
Circle Test 1
No ratings yet
Circle Test 1
3 pages
Session 14 - Joint Probability Distributions (GbA) PDF
No ratings yet
Session 14 - Joint Probability Distributions (GbA) PDF
69 pages
SAP Security Tables
No ratings yet
SAP Security Tables
3 pages
Reasoning With Equations Inequalities Unit Map
No ratings yet
Reasoning With Equations Inequalities Unit Map
2 pages
LBM GPU Hsu2018
No ratings yet
LBM GPU Hsu2018
14 pages
C Programs
No ratings yet
C Programs
41 pages
MDB Lesson 8 Helical Springs
No ratings yet
MDB Lesson 8 Helical Springs
10 pages
Introduction To Crystallography Introduc
No ratings yet
Introduction To Crystallography Introduc
5 pages
Paper Title : 1 Given Name Surname 2 Given Name Surname 3 Given Name Surname
No ratings yet
Paper Title : 1 Given Name Surname 2 Given Name Surname 3 Given Name Surname
3 pages
Precision Machine Design
No ratings yet
Precision Machine Design
56 pages
18it301 LM PDF016
No ratings yet
18it301 LM PDF016
15 pages
Ib HL Counting Binomial Questions
100% (1)
Ib HL Counting Binomial Questions
2 pages
Hypergeometric Function
No ratings yet
Hypergeometric Function
17 pages
Analysis and Design of Asynchronous Sequential Circuits
No ratings yet
Analysis and Design of Asynchronous Sequential Circuits
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.