0% found this document useful (0 votes)

38 views

Map Reduce Intro CS4961-L22

MapReduce is a programming model and framework for processing large datasets in parallel. It allows users to write map and reduce functions to split work across clusters. The framework handles parallelization, fault tolerance, data distribution, and load balancing. Users specify input/output files and the number of tasks, then write map and reduce functions without worrying about parallel implementation details. The framework schedules tasks across nodes, handles failures, and returns final outputs.

Uploaded by

yahia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Map Reduce Intro CS4961-L22

Uploaded by

yahia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 20

L22: SC Report,

Map Reduce
November 23, 2010
Map Reduce
• What is MapReduce?
• Example computing environment
• How it works
• Fault Tolerance
• Debugging
• Performance
• Google version = Map Reduce; Hadoop = Open source

11/23/10
What is MapReduce?

• Parallel programming model meant for large clusters

- User implements Map() and Reduce()

• Parallel computing framework

- Libraries take care of EVERYTHING else
- Parallelization
- Fault Tolerance
- Data Distribution
- Load Balancing

• Useful model for many practical tasks (large data)

Functional Abstractions Hide Parallelism
• Map and Reduce
• Functions borrowed from functional programming
languages (eg. Lisp)
• Map()
- Process a key/value pair to generate intermediate key/value
pairs•
• Reduce()
- Merge all intermediate values associated with the same key

11/23/10
Example: Counting Words

• Map()
- Input <filename, file text>
- Parses file and emits <word, count> pairs
- eg. <”hello”, 1>

• Reduce()
- Sums values for the same key and emits <word, TotalCount>
- eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
Example Use of MapReduce
• Counting words in a large set of documents

map(string key, string value)

//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);

reduce(string key, iterator values)

//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
How MapReduce Works

• User to do list:
- indicate:
- Input/output files
- M: number of map tasks
- R: number of reduce tasks
- W: number of machines
- Write map and reduce functions
- Submit the job

• This requires no knowledge of parallel/distributed

systems!!!
• What about everything else?
Data Distribution

• Input files are split into M pieces on distributed file

system
- Typically ~ 64 MB blocks

• Intermediate files created from map tasks are

written to local disk
• Output files are written to distributed file system
Assigning Tasks

• Many copies of user program are started

• Tries to utilize data localization by running map tasks
on machines with data
• One instance becomes
the Master
• Master finds idle machines and assigns them tasks
Execution (map)

• Map workers read in contents of corresponding input

partition
• Perform user-defined map computation to create
intermediate <key,value> pairs
• Periodically buffered output pairs written to local
disk
- Partitioned into R regions by a partitioning function
Partition Function

• Example partition function: hash(key) mod R

• Why do we need this?
• Example Scenario:
- Want to do word counting on 10 documents
- 5 map tasks, 2 reduce tasks
Execution (reduce)

• Reduce workers iterate over ordered intermediate

data
- Each unique key encountered – values are passed to user's
reduce function
- eg. <key, [value1, value2,..., valueN]>

• Output of user's reduce function is written to

output file on global file system
• When all tasks have completed, master wakes up
user program
Observations

• No reduce can begin until map is complete

• Tasks scheduled based on location of data
• If map worker fails any time before reduce finishes,
task must be completely rerun
• Master must communicate locations of intermediate
files
• MapReduce library does most of the hard work for
us!
Input key*value Input key*value
pairs pairs

...

map map
Data store 1 Data store n

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

values...) values...) values...) values...) values...) values...)

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

intermediate intermediate intermediate
values values values

reduce reduce reduce

final key 1 final key 2 final key 3

values values values
Fault Tolerance

• Workers are periodically pinged by master

- No response = failed worker

• Master writes periodic checkpoints

• On errors, workers send “last gasp” UDP packet to
master
- Detect records that cause deterministic crashes and skips
them
Fault Tolerance

• Input file blocks stored on multiple machines

• When computation almost done, reschedule in-
progress tasks
- Avoids “stragglers”
Debugging

• Offers human readable status info on http server

- Users can see jobs completed, in-progress, processing rates,
etc.
• Sequential implementation
- Executed sequentially on a single machine
- Allows use of gdb and other debugging tools
MapReduce Conclusions

• Simplifies large-scale computations that fit this model

• Allows user to focus on the problem without worrying
about details
• Computer architecture not very important
- Portable model
References

• Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified

Data Processing on Large Clusters
• Josh Carter, http://multipart-
mixed.com/software/mapreduce_presentation.pdf
• Ralf Lammel, Google's MapReduce Programming Model –
Revisited
• http://code.google.com/edu/parallel/mapreduce-tutorial.html

• RELATED
- Sawzall
- Pig
- Hadoop

Caribbean Studies Past Paper Questions
84% (19)
Caribbean Studies Past Paper Questions
3 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map reduce
No ratings yet
Map reduce
35 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Map Red
No ratings yet
Map Red
6 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
The Mapreduce Paradigm: Michael Kleber
No ratings yet
The Mapreduce Paradigm: Michael Kleber
13 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
No ratings yet
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
7 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Map Reduce PDF
No ratings yet
Map Reduce PDF
29 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Distributed Computing Seminar: Mapreduce Theory and Implementation
No ratings yet
Distributed Computing Seminar: Mapreduce Theory and Implementation
30 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
BDP 2024 08
No ratings yet
BDP 2024 08
14 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Lec 6
No ratings yet
Lec 6
14 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Social Style Theory
No ratings yet
Social Style Theory
5 pages
Lec 1 Image Processing PDF
No ratings yet
Lec 1 Image Processing PDF
48 pages
Lec 1 Image Processing PDF
No ratings yet
Lec 1 Image Processing PDF
48 pages
Image Processing Image Processing: CSC CSC 447 447 Instructor: Dr. Shereen Aly Instructor: Dr. Shereen Aly
No ratings yet
Image Processing Image Processing: CSC CSC 447 447 Instructor: Dr. Shereen Aly Instructor: Dr. Shereen Aly
32 pages
Lec 3 Image Processing (Modefied) PDF
No ratings yet
Lec 3 Image Processing (Modefied) PDF
41 pages
HDD
No ratings yet
HDD
11 pages
Poster
No ratings yet
Poster
1 page
1 - ADITYA Project Report
No ratings yet
1 - ADITYA Project Report
42 pages
Rs Agrwal Qunti Apti Book
No ratings yet
Rs Agrwal Qunti Apti Book
142 pages
Super Grain Corp. Advertising-Mix Problem
No ratings yet
Super Grain Corp. Advertising-Mix Problem
2 pages
RPMS Internal Guidelines
No ratings yet
RPMS Internal Guidelines
9 pages
Brubaker Cooper - Beyond Identity
No ratings yet
Brubaker Cooper - Beyond Identity
21 pages
How To Change The Network Connection Priority in Windows 7
No ratings yet
How To Change The Network Connection Priority in Windows 7
9 pages
Career Anchors and Job/role Planning: Tools For Career and Talent Management
No ratings yet
Career Anchors and Job/role Planning: Tools For Career and Talent Management
1 page
Tugas One Sample - Anggun
No ratings yet
Tugas One Sample - Anggun
4 pages
God Particle 2012
No ratings yet
God Particle 2012
3 pages
How Silos Fail
100% (2)
How Silos Fail
14 pages
Phychem 2 - Lab Report 2
100% (1)
Phychem 2 - Lab Report 2
9 pages
Richard G. Griskey PH.D., P.E. (Auth.) Polymer Process Engineering 1995
100% (5)
Richard G. Griskey PH.D., P.E. (Auth.) Polymer Process Engineering 1995
487 pages
Environmental Sociology Edited
100% (2)
Environmental Sociology Edited
125 pages
YC_DHWJ勉強会-デザイン教育-slides
No ratings yet
YC_DHWJ勉強会-デザイン教育-slides
41 pages
Goetz Creativity Inc.
No ratings yet
Goetz Creativity Inc.
3 pages
RPT 2020 DLP Mathematics Year 3 KSSR Semakan
No ratings yet
RPT 2020 DLP Mathematics Year 3 KSSR Semakan
15 pages
DMX6301_ Final Paper _2021_Moderated Jayamaha
No ratings yet
DMX6301_ Final Paper _2021_Moderated Jayamaha
7 pages
Sequel Examples Pe
No ratings yet
Sequel Examples Pe
128 pages
Time Table Final
No ratings yet
Time Table Final
3 pages
At Easy 8 Test Executive
100% (1)
At Easy 8 Test Executive
177 pages
Unit Preparations
No ratings yet
Unit Preparations
15 pages
SQC QP Paper
No ratings yet
SQC QP Paper
3 pages
IDP Integrated Development Planning Guide Pack. Overview
No ratings yet
IDP Integrated Development Planning Guide Pack. Overview
27 pages
EE-402-E Wireless Communication: Text Books
No ratings yet
EE-402-E Wireless Communication: Text Books
10 pages
Quarry management plan
No ratings yet
Quarry management plan
40 pages
Web Usability Standards
No ratings yet
Web Usability Standards
10 pages
TH I Gian Làm Bài: 60 Phút
100% (1)
TH I Gian Làm Bài: 60 Phút
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Map Reduce Intro CS4961-L22

Uploaded by

Map Reduce Intro CS4961-L22

Uploaded by

L22: SC Report,

• Parallel programming model meant for large clusters

• Parallel computing framework

• Useful model for many practical tasks (large data)

map(string key, string value)

reduce(string key, iterator values)

• This requires no knowledge of parallel/distributed

• Input files are split into M pieces on distributed file

• Intermediate files created from map tasks are

• Many copies of user program are started

• Map workers read in contents of corresponding input

• Example partition function: hash(key) mod R

• Reduce workers iterate over ordered intermediate

• Output of user's reduce function is written to

• No reduce can begin until map is complete

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

reduce reduce reduce

final key 1 final key 2 final key 3

• Workers are periodically pinged by master

• Master writes periodic checkpoints

• Input file blocks stored on multiple machines

• Offers human readable status info on http server

• Simplifies large-scale computations that fit this model

• Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.