0% found this document useful (0 votes)
37 views

Introduction To Hadoop

MapReduce is a programming model for processing large datasets in a distributed manner. It involves two functions - map and reduce. The map function processes each input and generates intermediate key-value pairs. The reduce function combines these intermediate values based on keys to produce the final output. MapReduce allows distributed processing across large clusters with automatic parallelization, fault tolerance, and load balancing. Implementations like Hadoop provide a runtime system that handles scheduling, data distribution, synchronization, and fault recovery.

Uploaded by

whosuresh
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Introduction To Hadoop

MapReduce is a programming model for processing large datasets in a distributed manner. It involves two functions - map and reduce. The map function processes each input and generates intermediate key-value pairs. The reduce function combines these intermediate values based on keys to produce the final output. MapReduce allows distributed processing across large clusters with automatic parallelization, fault tolerance, and load balancing. Implementations like Hadoop provide a runtime system that handles scheduling, data distribution, synchronization, and fault recovery.

Uploaded by

whosuresh
Copyright
© © All Rights Reserved
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to MapReduce

Todays Topics

Functional programming

MapReduce

Distributed fle system


Functional Programming

MapReduce = functional programming meets


distributed processing on steroids
Not a new idea dates bac to t!e "#s $or e%en
&#s'

(!at is functional programming)


*omputation as application of functions
T!eoretical foundation pro%ided by lambda
calculus

+ow is it di,erent)
Traditional notions of -data. and -instructions. are
not applicable

Data /ows are implicit in program


Di,erent orders of e0ecution are possible

10emplifed by 234P and M2


5%er%iew of 2isp

2isp 6 2ost 3n 4illy Parent!eses

(ell focus on particular a dialect7


-4c!eme.

2ists are primiti%e data types

Functions written in pref0 notation

(+ 1 2) 3
(* 3 4) 12
(sqrt (+ (* 3 3) (* 4 4))) 5
(define x 3) x
(* x 5) 15
'(1 2 3 4 5)
'((a 1) (b 2) (c 3))
Functions

Functions = lambda e0pressions bound


to %ariables

4yntactic sugar for defning functions

8bo%e e0pressions is e9ui%alent to7

5nce defned: function can be applied7

(define (foo x y)
(sqrt (+ (* x x) (* y y))))
(define foo
(lambda (x y)
(sqrt (+ (* x x) (* y y)))))
(foo 3 4) 5
5t!er Features

3n 4c!eme: e%eryt!ing is an s;
e0pression

No distinction between -data. and


-code.

1asy to write self;modifying code

+ig!er;order functions

Functions t!at tae ot!er functions as


arguments
(define (bar f x) (f (f x)))
(define (baz x) (* x x))
(bar baz 2) 16
Doesnt matter what f is, just apply it twice.
Recursion is your friend

4imple factorial e0ample

1%en iteration is written wit!


recursi%e calls<

(define (factorial n)
(if (= n 1)
1
(* n (factorial ( n 1)))))
(factorial 6) !2"
(define (factorialiter n)
(define (a#x n to$ $rod#ct)
(if (= n to$)
(* n $rod#ct)
(a#x (+ n 1) to$ (* n
$rod#ct))))
(a#x 1 n 1))
(factorialiter 6) !2"
2isp MapReduce)

(!at does t!is !a%e to do wit!


MapReduce)

8fter all: 2isp is about processing lists

Two important concepts in functional


programming

Map7 do somet!ing to e%eryt!ing in a


list

Fold7 combine results of a list in some


way
Map

Map is a !ig!er;order function

+ow map wors7

Function is applied to e%ery element


in a list

Result is a new list


f f f f f
Fold

Fold is also a !ig!er;order function

+ow fold wors7

8ccumulator set to initial %alue

Function applied to list element and t!e


accumulator

Result stored in t!e accumulator

Repeated for e%ery item in t!e list

Result is t!e fnal %alue in t!e accumulator


f f f f f final value
Initial value
Map=Fold in 8ction

4imple map e0ample7

Fold e0amples7

4um of s9uares7
(ma$ (lambda (x) (* x x))
'(1 2 3 4 5))
'(1 4 % 16 25)
(fold + " '(1 2 3 4 5)) 15
(fold * 1 '(1 2 3 4 5)) 12"
(define (s#mofsq#ares &)
(fold + " (ma$ (lambda (x) (* x x)) &)))
(s#mofsq#ares '(1 2 3 4 5)) 55
2isp MapReduce

2ets assume a long list of records7


imagine if>>>

(e can paralleli?e map operations

(e !a%e a mec!anism for bringing map


results bac toget!er in t!e fold operation

T!ats MapReduce< $and +adoop'

5bser%ations7

No limit to map paralleli?ation since maps


are indepedent

(e can reorder folding if t!e fold function is


commutati%e and associati%e
Typical Problem

3terate o%er a large number of


records

10tract somet!ing of interest from


eac!

4!u@e and sort intermediate results

8ggregate intermediate results

Aenerate fnal output


M
a
p
R
e
d
u
c
e
MapReduce

Programmers specify two functions7


map $: %' B C: %DE
reduce $: %' B C: %DE

8ll % wit! t!e same are reduced toget!er

Fsually: programmers also specify7


partition $: number of partitions ' B partition
for

5ften a simple !as! of t!e ey: e>g> !as!$'


mod n

8llows reduce operations for di,erent eys in


parallel

3mplementations7

Aoogle !as a proprietary implementation in *GG

+adoop is an open source implementation in


Ha%a $lead by Ia!oo'
3ts Just di%ide and con9uer<
Data Store
Initial kv pairs
map map
Initial kv pairs
map
Initial kv pairs
map
Initial kv pairs
k
1
, values
k
2
, values
k
3
, values
k
1
, values
k
2
, values
k
3
, values
k
1
, values
k
2
, values
k
3
, values
k
1
, values
k
2
, values
k
3
, values
Barrier: aggregate values by keys
reduce
k
1
, values
final k
1
values
reduce
k
2
, values
final k
2
values
reduce
k
3
, values
final k
3
values
Recall t!ese problems)

+ow do we assign wor units to worers)

(!at if we !a%e more wor units t!an


worers)

(!at if worers need to s!are partial


results)

+ow do we aggregate partial results)

+ow do we now all t!e worers !a%e


fnis!ed)

(!at if worers die)


MapReduce Runtime

+andles sc!eduling

8ssigns worers to map and reduce tass

+andles -data distribution.

Mo%es t!e process to t!e data

+andles sync!roni?ation

Aat!ers: sorts: and s!u@es intermediate


data

+andles faults

Detects worer failures and restarts

1%eryt!ing !appens on top of a


distributed F4 $later'
-+ello (orld.7 (ord *ount
Map!tring input"key, !tring input"value#:
// input_key: document name
// input_value: document contents
f$r eac% &$rd & in input"values:
'mitIntermediate&, (1(#)
*educe!tring key, Iterat$r intermediate"values#:
// key: a word, same for input and output
// intermediate_values: a list of counts
int result + ,)
f$r eac% v in intermediate"values:
result -+ .arseIntv#)
'mit/s!tringresult##)
!$urce: 0ean and 1%ema&at 2!0I 2,,3#
Kandwidt! 5ptimi?ation

3ssue7 large number of ey;%alue


pairs

4olution7 use -*ombiner. functions

10ecuted on same mac!ine as


mapper

Results in a -mini;reduce. rig!t after


t!e map p!ase

Reduces ey;%alue pairs to sa%e


bandwidt!

4ew Problem

3ssue7 reduce is only as fast as t!e


slowest map

4olution7 redundantly e0ecute map


operations: use results of frst to
fnis!

8ddresses !ardware problems>>>

Kut not issues related to in!erent


distribution of data
+ow do we get data to t!e
worers)
Compute Nodes
NAS
SAN
Whats the prolem here!
Distributed File 4ystem

Dont mo%e data to worers Mo%e worers to


t!e data<

4tore data on t!e local diss for nodes in t!e cluster

4tart up t!e worers on t!e node t!at !as t!e data


local

(!y)
Not enoug! R8M to !old all t!e data in memory
Dis access is slow: dis t!roug!put is good

8 distributed fle system is t!e answer

AF4 $Aoogle File 4ystem'

+DF4 for +adoop


AF47 8ssumptions

*ommodity !ardware o%er -e0otic. !ardware

+ig! component failure rates

3ne0pensi%e commodity components fail all t!e


time

-Modest. number of +FA1 fles

Files are write;once: mostly appended to


Per!aps concurrently

2arge streaming reads o%er random access

+ig! sustained t!roug!put o%er low latency


14! slides adapted fr$m material by 0ean et al5
AF47 Design Decisions

Files stored as c!uns


Fi0ed si?e $LMMK'

Reliability t!roug! replication


1ac! c!un replicated across &G c!unser%ers

4ingle master to coordinate access: eep metadata


4imple centrali?ed management

No data cac!ing
2ittle beneft due to large data sets: streaming reads

4implify t!e 8P3


Pus! some of t!e issues onto t!e client
!$urce: 1%ema&at et al5 !2!. 2,,3#
4ingle Master

(e now t!is is a7

4ingle point of failure

4calability bottlenec

AF4 solutions7
4!adow masters
Minimi?e master in%ol%ement
Ne%er mo%e data t!roug! it: use only for metadata
$and cac!e metadata at clients'
2arge c!un si?e
Master delegates aut!ority to primary replicas in data
mutations $c!un leases'

4imple: and good enoug!<


Masters Responsibilities
$N=O'

Metadata storage

Namespace management=locing

Periodic communication wit! c!unser%ers


Ai%e instructions: collect state: trac cluster !ealt!

*!un creation: re;replication: rebalancing


Kalance space utili?ation and access speed
4pread replicas across racs to reduce correlated
failures

Re;replicate data if redundancy falls below t!res!old

Rebalance data to smoot! out storage and re9uest


load
Masters Responsibilities
$O=O'

Aarbage *ollection

4impler: more reliable t!an traditional


fle delete

Master logs t!e deletion: renames t!e


fle to a !idden name

2a?ily garbage collects !idden fles

4tale replica deletion

Detect -stale. replicas using c!un


%ersion numbers
Metadata

Alobal metadata is stored on t!e master


File and c!un namespaces
Mapping from fles to c!uns
2ocations of eac! c!uns replicas

8ll in memory $LM bytes = c!un'


Fast
1asily accessible

Master !as an operation log for persistent logging of


critical metadata updates
Persistent on local dis
Replicated
*!ecpoints for faster reco%ery

Mutations

Mutation = write or append

Must be done for all replicas

Aoal7 minimi?e master in%ol%ement

2ease mec!anism7

Master pics one replica as primaryP gi%es


it a -lease. for mutations

Primary defnes a serial order of mutations

8ll replicas follow t!is order

Data /ow decoupled from control /ow

Paralleli?ation Problems

+ow do we assign wor units to worers)

(!at if we !a%e more wor units t!an


worers)

(!at if worers need to s!are partial


results)

+ow do we aggregate partial results)

+ow do we now all t!e worers !a%e


fnis!ed)

(!at if worers die)


"ow is MapReduce different!
From T!eory to Practice
"adoop Cluster
#ou
$. Scp data to cluster
%. Mo&e data into "D'S
(. De&elop code locally
). Sumit MapReduce jo
)a. *o ac+ to Step (
,. Mo&e data out of "D'S
-. Scp data from cluster
5n 8ma?on7 (it! 1*O
#ou
$. Scp data to cluster
%. Mo&e data into "D'S
(. De&elop code locally
). Sumit MapReduce jo
)a. *o ac+ to Step (
,. Mo&e data out of "D'S
-. Scp data from cluster
.. Allocate "adoop cluster
/C%
#our "adoop Cluster
0. Clean up1
2h oh. Where did the data 3o!
5n 8ma?on7 1*O and 4&
#our "adoop Cluster
S(
45ersistent Store6
/C%
47he Cloud6
Copy from S( to "D'S
Copy from "'DS to S(
8uestions!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy