Unit-2 MapReduce2024
Unit-2 MapReduce2024
MapReduce
(Hadoop Processing Component)
Hadoop
• If, any of the machines delay the job, the whole work gets delayed (Critical path
problem).
• What if, any of the machines which are working with a part of data fails? The
management of this failover becomes a challenge (Reliability problem).
• how to equally divide the data such that no individual machine is overloaded or
underutilized (Equal split issue).
• There should be a mechanism to aggregate the result generated by each of the
machines to produce the final output (Aggregation of the result).
Above are the issues which we will have to take care individually while performing
parallel processing of huge data sets when using traditional approaches.
To overcome these issues, we have the MapReduce framework which allows us to
perform such parallel computations without bothering about the issues like reliability,
fault tolerance etc.
Main Hadoop Components
MapReduce
MapReduce: What It Is and Why It Is Important
• MapReduce is a Distributed Data Processing Algorithm, introduced by Google in it’s
MapReduce Tech Paper (2004).
• Splitting (takes input DataSet from Source and divide into smaller Sub-DataSets)
• Mapping (On Sub-DataSets, perform required computation on each Sub-DataSet.)
• The output of this Map Function is a set of key and value pairs as <Key, Value>.
How does MapReduce work?
• Shuffle Function
• It is the second step in MapReduce Algorithm. Shuffle Function is also know as
“Combine Function”.
• It takes a list of outputs coming from “Map Function” and perform following two sub-
steps on each and every key-value pair.
• Merging: this step combines all key-value pairs which have same keys (that is
grouping key-value pairs by comparing “Key”). This step returns <Key,
List<Value>>.
• Sorting: this step takes input from Merging step and sort all key-value pairs by
using Keys. This step also returns <Key, List<Value>> output but with sorted key-
value pairs.
• Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
How does MapReduce work?
• Combiner
• It is optional function but provide high performance in terms of n/w bandwidth and
disk space.
• Like map output in some stage is <1,10>, <1,15>, <1,20>, <2,5>, <2,60> and the
purpose of map-reduce job is to find the maximum value corresponding to each
key.
• In combiner you can reduce this data to <1,20> , <2,60> as 20 and 60 are maximum
value for key 1 and key 2 respectively.
• It is an optimization technique for MapReduce Job.
• Output generated by combiner is intermediate data and it is passed to the reducer.
Map and Reduce Phases:
• Partitioner
• It happens after map phase and before reduce phase.
• Map task returns output in <key,value> form.
• Partitioner is club the data which should go to the same reducer based on keys.
• Example:
• If map output is <1,10> , <1,15> , <1,20> , <2, 13> , <2,6> , <4,8> , <4,20> etc.
• We can see that there are 3 different keys which are 1, 2 and 4.
• In map-reduce number of reduce tasks are fixed and each reduce task
should handle all the data related to one key.
• That means map output like <1,10>, <1,15>, <1,20> should be handled by same
reduce tasks.
• It is not possible that <1,10> is handled by one reduce task and <1,15> is handled
by another reduce task as key which is 1 is same.
Entire workflow of a job in Hadoop.
How MapReduce Organizes Work?
Hadoop divides the job into two tasks. 1) Map-tasks(Splits & Mapping) 2)
Reduce-tasks(Shuffling, Reducing). This two tasks are controlled by two types of
entities:
JobTracker: Act as single-master (responsible
for complete execution of submitted job). Each
job, there is one Jobtracker that resides on
Namenode.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.
• Execution of individual task is then to look after by task tracker, which resides on every data
node executing part of the job.
• Task tracker's responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify
him of the current state of the system.
• Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the
job tracker can reschedule it on a different task tracker.
MapReduce Architecture explained in detail
• One map task is created for each split which then executes map function for each record
in the split.
• Execution of map tasks results into writing output to a local disk on the respective node
and not to HDFS.
• Map output is intermediate output which is processed by reduce tasks to produce the
final output.
• Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
MapReduce Architecture explained in detail
• In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.
• An output of every map task is fed to the reduce task. Map output is transferred to the
machine where reduce task is running.
• On this machine, the output is merged and then passed to the user-defined reduce
function.
• Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
Terminology used in Map Reduce
Applications implement the Map and the Reduce functions, and form the core of
Pay Load
the job.
Mapper Mapper maps the input key/value pairs to a set of intermediate key/value pair.
Named Node Node that manages the Hadoop Distributed File System (HDFS).
DataNode Node where data is presented in advance before any processing takes place.
Master Node Node where JobTracker runs and which accepts job requests from clients.
Job Tracker Schedules jobs and tracks the assign jobs to Task tracker.
Benefit Description
Simplicity Developers have freedom to write applications in their language of choice, such as Java, C++
or Python, and MapReduce jobs are easy to run.
Scalability MapReduce can process huge amount of data (in terms of petabytes and more) , stored in
HDFS on one cluster
Speed It uses parallel processing it means that MapReduce can take problems that used to take
days to solve which can be solved in hours or minutes
MapReduce also takes care of failures. If a machine with one copy of the data is unavailable
Recovery than another machine has a copy of the same key/value pair, which can be used to solve the
same sub-task. The JobTracker keeps track of these all processes.
MapReduce moves are able to compute processes to the data on HDFS and not the other way
Minimal around. Processing tasks can occur on the physical node where the data resides. This
data motion significantly reduces the network I/O patterns and contributes to Hadoop’s processing
speed.
• Business and other organizations run calculations to:
• Determine the price for their products that yields the highest profits.
• Know precisely how effective their advertising is and where they should spend their
ad dollars.
• Web clicks, sales records purchased from retailers, and Twitter trending topics to
determine what new products the company should produce in the upcoming season.
MapReduce Use Case: Global Warming
Query: we want to know how much global warming has raised the ocean’s temperature.
• Input temperature readings from thousands of OceanSignals all over the globe.
(OceanSignals, DateTime, longitude, latitude, lowTemp, highTemp)
• Run a map over every OceanSignals -dateTime reading and add the average temperature as a
field:
(OceanSignals, DateTime, longitude, latitude, lowTemp, highTemp, Average)
• Drop DateTime column and produce one average temperature for each OceanSignals
(OceanSignal N, Average) //like, (OceanSignal 1, Average) , (OceanSignal 2, Average)