0% found this document useful (0 votes)

25 views41 pages

Unit-2 MapReduce2024

Heyy

Uploaded by

patelgaming0901

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views41 pages

Unit-2 MapReduce2024

Heyy

Uploaded by

patelgaming0901

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

2CEIT702: Big Data Analytics

MapReduce
(Hadoop Processing Component)
Hadoop

Challenge: Data is too big store in one computer

Hadoop Solution: Data is stored in multiple computer.
Challenge: Very high end machines are expensive
Hadoop solution: Run on commodity hardware
Challenge: commodity hardware will fail.
Hadoop Solution: Software is intelligent enough to deal with hardware failure.
Challenge: hardware failure may lead to data loss
Hadoop Solution: Replicate data
Challenge: how will the distributed nodes co-ordinate among themselves
Hadoop solution: Master node that co-ordinates all the worker nodes
Distributed processing (Traditional way)
• When the MapReduce framework was not there, how parallel and distributed processing
used to happen in a traditional way.
• Example-1: Weather dataset (Daily average temperature of the years from 2000 to
2018). Query: want to calculate the day having the highest temperature in each year.
• In traditional way:
Example-2: Counting election votes
Example
Challenges-Distributed processing (Traditional way)

• If, any of the machines delay the job, the whole work gets delayed (Critical path
problem).
• What if, any of the machines which are working with a part of data fails? The
management of this failover becomes a challenge (Reliability problem).
• how to equally divide the data such that no individual machine is overloaded or
underutilized (Equal split issue).
• There should be a mechanism to aggregate the result generated by each of the
machines to produce the final output (Aggregation of the result).

Above are the issues which we will have to take care individually while performing
parallel processing of huge data sets when using traditional approaches.
To overcome these issues, we have the MapReduce framework which allows us to
perform such parallel computations without bothering about the issues like reliability,
fault tolerance etc.
Main Hadoop Components
MapReduce
MapReduce: What It Is and Why It Is Important
• MapReduce is a Distributed Data Processing Algorithm, introduced by Google in it’s
MapReduce Tech Paper (2004).

• MapReduce is a programming framework that allows us to perform distributed and

parallel processing on large data sets in a distributed environment.

• MapReduce is a programming model that allows us to perform parallel and distributed

processing on huge data sets.
Example: Counting the votes
How does MapReduce work?
• Three main steps of MapReduce: Map Function, Shuffle Function, Reduce Function
• Map Function
• Map Function is the first step in MapReduce Algorithm.

• Map Function performs the following two sub-steps:

• Splitting (takes input DataSet from Source and divide into smaller Sub-DataSets)
• Mapping (On Sub-DataSets, perform required computation on each Sub-DataSet.)

• The output of this Map Function is a set of key and value pairs as <Key, Value>.
How does MapReduce work?

MapReduce First Step Output:

How does MapReduce work?

• Shuffle Function
• It is the second step in MapReduce Algorithm. Shuffle Function is also know as
“Combine Function”.
• It takes a list of outputs coming from “Map Function” and perform following two sub-
steps on each and every key-value pair.

• Merging: this step combines all key-value pairs which have same keys (that is
grouping key-value pairs by comparing “Key”). This step returns <Key,
List<Value>>.

• Sorting: this step takes input from Merging step and sort all key-value pairs by
using Keys. This step also returns <Key, List<Value>> output but with sorted key-
value pairs.

• Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
How does MapReduce work?

MapReduce Second Step Output:

How does MapReduce work?
• Reduce Function
• It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
• It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform
reduce operation as shown below.

MapReduce Final Step Output:

• Final step output looks like first step output.

• However final step <Key, Value> pairs are different
than first step <Key, Value> pairs.
• Final step <Key, Value> pairs are computed and
sorted pairs.
MapReduce Example – Word Count
• Problem Statement:
• Count the number of occurrences of each word available in a DataSet.
• Input DataSet:

• Client Required Final Result:

• MapReduce – Map Function (Split Step)
• MapReduce – Map Function (Mapping Step)
• MapReduce – Shuffle Function (Merge Step)
• MapReduce – Shuffle Function (Merge Step)
• MapReduce – Shuffle Function (Sorting Step)
• MapReduce – Reduce Function (Reduce Step)
Example-2
Example-3
Example-4
How does MapReduce Work ?
From start to finish, there are four fundamental transformations.
Data is :
• 1. Transformed from the input files and fed into the mappers.
• 2. Transformed by the mappers.
• 3. Sorted, merged, and presented to the reducer.
• 4. Transform by reducers and written to output files.
• Most of the computing takes place on nodes with data on local disks that reduces
the network traffic. After completion of the given tasks, the cluster collects and
reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Map and Reduce Phases:
• A MapReduce program in Hadoop is called a Hadoop job.

• Map task is broken into Phases like:

• RecordReader,
• Mapper,
• Combiner,
• Partitioner

• Reducer task is broken into phases like:

• Shuffle,
• Sort,
• Reducer,
• O/P format
Technical details of Map and Reduce Phases
• Input Files: The data for a Map Reduce task is stored in input files and these input files
are generally stored in HDFS.
• InputFormat:
• Take the file from HDFS and divides the data into the number of splits (typically
64/128mb).
• InputFormat defines how the input files are split and read.
• InputFormat create the InputSplit.
• InputSplit:
• InputSplit is the logical representation of data.
• It represents the data which is processed by an individual Mapper.
• One map task is created for each Input Split. (Note: Number of InputSplits = Number
of map tasks).
• The split is divided into records and each record will be processed by the mapper.
Technical details of Map and Reduce Phases
• RecordReader:
• It communicates with InputSplit in and converts the data into key-value pairs
suitable for reading by the mapper.
• By default, it uses TextInputFormat for converting data into a key-value pair.
• It assigns byte offset (unique number) to each line present in the file. Then, these
key-value pairs are sent to the mapper for further processing.
Map and Reduce Phases:
• Mapper
• Map function works on the key-value pair produced by RecordReader and generate
zero or more intermediate key-value pairs.
• Mappers output is passed to the combiner for further process.

• Combiner
• It is optional function but provide high performance in terms of n/w bandwidth and
disk space.
• Like map output in some stage is <1,10>, <1,15>, <1,20>, <2,5>, <2,60> and the
purpose of map-reduce job is to find the maximum value corresponding to each
key.
• In combiner you can reduce this data to <1,20> , <2,60> as 20 and 60 are maximum
value for key 1 and key 2 respectively.
• It is an optimization technique for MapReduce Job.
• Output generated by combiner is intermediate data and it is passed to the reducer.
Map and Reduce Phases:
• Partitioner
• It happens after map phase and before reduce phase.
• Map task returns output in <key,value> form.
• Partitioner is club the data which should go to the same reducer based on keys.

• Example:
• If map output is <1,10> , <1,15> , <1,20> , <2, 13> , <2,6> , <4,8> , <4,20> etc.
• We can see that there are 3 different keys which are 1, 2 and 4.
• In map-reduce number of reduce tasks are fixed and each reduce task
should handle all the data related to one key.
• That means map output like <1,10>, <1,15>, <1,20> should be handled by same
reduce tasks.
• It is not possible that <1,10> is handled by one reduce task and <1,15> is handled
by another reduce task as key which is 1 is same.
Entire workflow of a job in Hadoop.
How MapReduce Organizes Work?
 Hadoop divides the job into two tasks. 1) Map-tasks(Splits & Mapping) 2)
Reduce-tasks(Shuffling, Reducing). This two tasks are controlled by two types of
entities:
 JobTracker: Act as single-master (responsible
for complete execution of submitted job). Each
job, there is one Jobtracker that resides on
Namenode.

 TaskTrackers: Act as one slaves/Cluster

Node, each of them performing the job as
directed by the master. Each job, there are
multiple Tasktrackers which reside on
Datanode.

Job submission process is as follows:

 Client (i.e., driver program) creates a job, configures it, and submits it to the
JobTracker
 JobClient computes input splits (on client end)
 Job data (jar, configuration XML) are sent to JobTracker
 JobTracker puts job data in a shared location, enqueues tasks
 TaskTrackers poll for tasks
Hadoop Map Reduce Architecture
How MapReduce Organizes Work?
• A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.

• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.

• Execution of individual task is then to look after by task tracker, which resides on every data
node executing part of the job.

• Task tracker's responsibility is to send the progress report to the job tracker.

• In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify
him of the current state of the system.

• Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the
job tracker can reschedule it on a different task tracker.
MapReduce Architecture explained in detail
• One map task is created for each split which then executes map function for each record
in the split.

• Execution of map tasks results into writing output to a local disk on the respective node
and not to HDFS.

• Map output is intermediate output which is processed by reduce tasks to produce the
final output.

• Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
MapReduce Architecture explained in detail
• In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.

• An output of every map task is fed to the reduce task. Map output is transferred to the
machine where reduce task is running.

• On this machine, the output is merged and then passed to the user-defined reduce
function.

• Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
Terminology used in Map Reduce
Applications implement the Map and the Reduce functions, and form the core of
Pay Load
the job.
Mapper Mapper maps the input key/value pairs to a set of intermediate key/value pair.

Named Node Node that manages the Hadoop Distributed File System (HDFS).

DataNode Node where data is presented in advance before any processing takes place.

Master Node Node where JobTracker runs and which accepts job requests from clients.

Slave Node Node where Map and Reduce program runs.

Job Tracker Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker Tracks the task and reports status to JobTracker.

Job A program is an execution of a Mapper and Reducer across a dataset.

Task An execution of a Mapper or a Reducer on a slice of data.

Task Attempt A particular instance of an attempt to execute a task on a SlaveNode.

MapReduce key benefits

Benefit Description

Simplicity Developers have freedom to write applications in their language of choice, such as Java, C++
or Python, and MapReduce jobs are easy to run.
Scalability MapReduce can process huge amount of data (in terms of petabytes and more) , stored in
HDFS on one cluster
Speed It uses parallel processing it means that MapReduce can take problems that used to take
days to solve which can be solved in hours or minutes

MapReduce also takes care of failures. If a machine with one copy of the data is unavailable
Recovery than another machine has a copy of the same key/value pair, which can be used to solve the
same sub-task. The JobTracker keeps track of these all processes.

MapReduce moves are able to compute processes to the data on HDFS and not the other way
Minimal around. Processing tasks can occur on the physical node where the data resides. This
data motion significantly reduces the network I/O patterns and contributes to Hadoop’s processing
speed.
• Business and other organizations run calculations to:

• Determine the price for their products that yields the highest profits.

• Know precisely how effective their advertising is and where they should spend their
ad dollars.

• Make long-range weather predictions.

• Web clicks, sales records purchased from retailers, and Twitter trending topics to
determine what new products the company should produce in the upcoming season.
MapReduce Use Case: Global Warming
Query: we want to know how much global warming has raised the ocean’s temperature.
• Input temperature readings from thousands of OceanSignals all over the globe.
(OceanSignals, DateTime, longitude, latitude, lowTemp, highTemp)

• Run a map over every OceanSignals -dateTime reading and add the average temperature as a
field:
(OceanSignals, DateTime, longitude, latitude, lowTemp, highTemp, Average)

• Drop DateTime column and produce one average temperature for each OceanSignals
(OceanSignal N, Average) //like, (OceanSignal 1, Average) , (OceanSignal 2, Average)

• Then the reduce operation runs.

Ocean Average Temperature = Average (OceanSignal n) + Average (OceanSignal n-1) + … +
Average (OceanSignal 2) + Average (OceanSignal 1) / number of OceanSignals

Mastercam 2017 Instructor Guides
No ratings yet
Mastercam 2017 Instructor Guides
530 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Big Data
No ratings yet
Big Data
120 pages
Ied Full Book-Ncert Based-Updated
No ratings yet
Ied Full Book-Ncert Based-Updated
53 pages
Unit 3
No ratings yet
Unit 3
27 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Bda U2
No ratings yet
Bda U2
79 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Eurocontrol-Spec-0139 MTCD Ed 2.0
No ratings yet
Eurocontrol-Spec-0139 MTCD Ed 2.0
36 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Bda Unit 3
No ratings yet
Bda Unit 3
20 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Map Reduce Workflow Colloquim
No ratings yet
Map Reduce Workflow Colloquim
30 pages
Katalog Komputer Haidakom
No ratings yet
Katalog Komputer Haidakom
8 pages
Types of Composite Manufacturing
100% (2)
Types of Composite Manufacturing
42 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Slides For Midterm2 - R&S
No ratings yet
Slides For Midterm2 - R&S
82 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
BLH Series: Operating Manual
No ratings yet
BLH Series: Operating Manual
52 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
A Developer Guide To Websockets Development With Jakarta EE
No ratings yet
A Developer Guide To Websockets Development With Jakarta EE
15 pages
UNIT 3bda
No ratings yet
UNIT 3bda
16 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Unintentional Injury
No ratings yet
Unintentional Injury
48 pages
Electric Field and Potential: Coulomb Newton M
No ratings yet
Electric Field and Potential: Coulomb Newton M
15 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
2 Bda Chapter2 Answer
No ratings yet
2 Bda Chapter2 Answer
9 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
SoD (Strength of Defenses Matrix)
No ratings yet
SoD (Strength of Defenses Matrix)
4 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Caller and Agent Data
No ratings yet
Caller and Agent Data
16 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Unit 3
No ratings yet
Unit 3
22 pages
Microtunneling en
No ratings yet
Microtunneling en
7 pages
21 Pouces 2008 Alfa Romeo 159 3.2 JTS V6 20V Q4 Ti Tire Sizes (Since Mid-Year 2008 For Europe Australia)
No ratings yet
21 Pouces 2008 Alfa Romeo 159 3.2 JTS V6 20V Q4 Ti Tire Sizes (Since Mid-Year 2008 For Europe Australia)
4 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Natural Vegetation-Woksheet - 1
No ratings yet
Natural Vegetation-Woksheet - 1
4 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Unit - III
No ratings yet
Unit - III
37 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Terms of Reference For Consultants I. Background: Astana Integrated Water Master Plan (KSTA KAZ 51353)
No ratings yet
Terms of Reference For Consultants I. Background: Astana Integrated Water Master Plan (KSTA KAZ 51353)
13 pages
LKS Cloud Computing Jawa Barat 2024 - Modul 1
No ratings yet
LKS Cloud Computing Jawa Barat 2024 - Modul 1
7 pages
API INTEGRATION AGREEMENT - Part 3
No ratings yet
API INTEGRATION AGREEMENT - Part 3
3 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Posisi Stek Tangga Dan Revisi Denah Sloof GA - 11-12-2024
No ratings yet
Posisi Stek Tangga Dan Revisi Denah Sloof GA - 11-12-2024
2 pages
Bda 03
No ratings yet
Bda 03
10 pages
Compose Compiler Gradle Plugin - Jetpack Compose - Android Developers
No ratings yet
Compose Compiler Gradle Plugin - Jetpack Compose - Android Developers
3 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Mil Long Quiz
No ratings yet
Mil Long Quiz
3 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
Seasonal Variations in Physicochemical Properties of Water, Sediment and Fish of Tiga Dam, Kano-Nigeria
No ratings yet
Seasonal Variations in Physicochemical Properties of Water, Sediment and Fish of Tiga Dam, Kano-Nigeria
7 pages
LESSON 1 - Chords and Arcs
No ratings yet
LESSON 1 - Chords and Arcs
4 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Unit 2
No ratings yet
Unit 2
12 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Group 15 To 18
No ratings yet
Group 15 To 18
6 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Contractura 11 PDF
No ratings yet
Contractura 11 PDF
2 pages
Easy Way - Get France Schengen Visa in 7 Days
No ratings yet
Easy Way - Get France Schengen Visa in 7 Days
7 pages
Volume 1
No ratings yet
Volume 1
63 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
11 CogControlConclusion B55 02
No ratings yet
11 CogControlConclusion B55 02
11 pages
Data Science
No ratings yet
Data Science
7 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Lohit C++ Internship Presentation
No ratings yet
Lohit C++ Internship Presentation
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-2 MapReduce2024

Uploaded by

Unit-2 MapReduce2024

Uploaded by

2CEIT702: Big Data Analytics

Challenge: Data is too big store in one computer

• MapReduce is a programming framework that allows us to perform distributed and

• MapReduce is a programming model that allows us to perform parallel and distributed

• Map Function performs the following two sub-steps:

MapReduce First Step Output:

MapReduce Second Step Output:

MapReduce Final Step Output:

• Final step output looks like first step output.

• Client Required Final Result:

• Map task is broken into Phases like:

• Reducer task is broken into phases like:

 TaskTrackers: Act as one slaves/Cluster

Job submission process is as follows:

Slave Node Node where Map and Reduce program runs.

Task Tracker Tracks the task and reports status to JobTracker.

Job A program is an execution of a Mapper and Reducer across a dataset.

Task An execution of a Mapper or a Reducer on a slice of data.

Task Attempt A particular instance of an attempt to execute a task on a SlaveNode.

• Make long-range weather predictions.

• Then the reduce operation runs.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.