0% found this document useful (0 votes)

223 views

Map Reduce

MapReduce is a software framework introduced by Google for processing large datasets in a distributed computing environment. It involves splitting large tasks into smaller subtasks, running them in parallel on different systems, and then combining the results. The MapReduce model consists of two main phases - the Map phase and the Reduce phase. In the Map phase, data is converted into key-value pairs, and in the Reduce phase, the outputs from all the maps are combined and aggregated to form the final results. MapReduce provides scalability, simplicity, speed, recovery and easy solutions for data processing on large datasets across clusters of computers.

Uploaded by

Soni Nsit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

223 views

Map Reduce

Uploaded by

Soni Nsit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on

clusters of computers.

MapReduce is a functional programming model. It runs in the Hadoop background to provide scalability, simplicity, speed,
recovery and easy solutions for data processing.

Why MapReduce?

Normally traditional enterprise system used centralized processing server to store and process data. This system is not
suitable to process large volumes of data. If we are trying to process multiple files concurrently, then the centralized system
creates too much of a bottleneck. Google gave the solution to this bottleneck issue by using an algorithm known
as MapReduce.

In the MapReduce process the large tasks are split into smaller tasks, and then they are assigned to many systems.

Ways to MapReduce

Libraries : Hbase, Hive, Pig, Sqoop, Oozie, Mahout and others

Languages : Java *, HiveQL (HQL) , Pig Latin, Python, C#, Javascript, R and more.

In the MapReduce bulk tasks are divided into smaller tasks, they are then alloted to many systems. The two important tasks
in MapReduce algorithm

->Map

->Reduce

Map task is always performed first which is then followed by Reduce job. One data set converts into another data set in
map, and individual element is broken into tuples.

Reduce task combines the tuples of data into smaller tuples set and it uses map output as an input.
Input Phase: It is a record reader that sends data in the form of key-value pairs and transforms every input file record to
the mapper.

Map Phase: It is a user defined function. It generates zero or more key-value pairs with the help of a sequence of key-value
pairs and processes each of them.

Intermediate Keys: Mapper generated key-value pairs are called as intermediate keys.

Combiner: Combiner takes mapper internal keys as input and applies a user-defined code to combine the values in a small
scope of one mapper.

Shuffle and Sort: Shuffle and Sort is the first step of reducer task. When reducer is running, it downloads all the key-value
pairs onto the local machine. Each key -value pairs are stored by key into a larger data list. This data list groups the
corresponding keys together so that their values can be iterated easily in the reducer task.

Reducer phase: This phase gives zero or more key-value pairs after the following process The data can be combined,
filtered and aggregated in a number of ways and it requires a large range of processing.

Output phase: It has an output formatter, from the reducer function and writes them onto a file using a record writer that
translates the last key-value pairs.

On a daily basis the micro-blogging site Twitter receives nearly 500 million tweets, i.e., 3000 tweets per second. We can
see the illustration on Twitter with the help of MapReduce.

In the above example Twitter data is an input, and MapReduce Trainingperforms the actions like Tokenize, filter, count and
aggregate counters.
Tokenize: Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

Filter: It filters the unwanted words from maps of tokens.

Count: Generates a token counter per word.

Aggregate counters: Prepares a combination of similar counter values into small manageable units.
Hadoop MapReduce Tutorial
Hadoop MapReduce is a programming paradigm at the heart of Apache Hadoop for providing
massive scalability across hundreds or thousands of Hadoop clusters on commodity hardware.
The MapReduce model processes large unstructured data sets with a distributed algorithm on
a Hadoop cluster.

The term MapReduce represents two separate and distinct tasks Hadoop programs perform-
Map Job and Reduce Job. Map job scales takes data sets as input and processes them to
produce key value pairs. Reduce job takes the output of the Map job i.e. the key value pairs
and aggregates them to produce desired results. The input and output of the map and reduce
jobs are stored in HDF

What is MapReduce?
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing
of large data sets on computing clusters. It is a sub-project of the Apache Hadoop project.
Apache Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models.
MapReduce is the core component for data processing in Hadoop framework. In layman’s term
Mapreduce helps to split the input data set into a number of parts and run a program on all
data parts parallel at once. The term MapReduce refers to two separate and distinct tasks. The
first is the map operation, takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). The reduce operation
combines those data tuples based on the key and accordingly modifies the value of the key.
Bear, Deer, River and Car Example
The following word count example explains MapReduce method. For simplicity, let's consider
a few words of a text document. We want to find the number of occurrence of each word. First
the input is split to distribute the work among all the map nodes as shown in the figure. Then
each word is identified and mapped to the number one. Thus the pairs also called as tuples
are created. In the first mapper node three words Deer, Bear and River are passed. Thus the
output of the node will be three key, value pairs with three distinct keys and value set to one.
The mapping process remains the same in all the nodes. These tuples are then passed to the
reduce nodes. A partitioner comes into action which carries out shuffling so that all the tuples
with same key are sent to same node.

The Reducer node processes all the tuples such that all the pairs with same key are counted
and the count is updated as the value of that specific key. In the example there are two pairs
with the key ‘Bear’ which are then reduced to single tuple with the value equal to the count.
All the output tuples are then collected and written in the output file.

How is MapReduce used?

Various platforms are designed over Hadoop for easier querying, summarization. For instance,
Apache Mahout provides machine learning algorithms that are implemented over Hadoop.
Apache Hive provides data summarization, query, and analysis over the data stored in HDFS.

MapReduce is primarily written in Java, therefore more often than not, it is advisable to
learn Java for Hadoop MapReduce.MapReduce libraries have been written in many
programming languages. Though it is mainly implemented in Java, there are non-Java
interfaces such as Streaming(Scripting Languages), Pipes(C++), Pig, Hive, Cascading. In case of
Streaming API, the corresponding jar is included and the mapper and reducer are written in
Python/Scripting language. Hadoop which in turn uses MapReduce technique has a lot of use
cases. On a general note it is used in scenario of needle in a haystack or for continuous
monitoring of a huge system statistics. One such example is monitoring the traffic in a country
road network and handling the traffic flow to prevent a jam. One common example is analyzing
and storage of twitter data. It is also used in Log analysis which consists of various summations.

MapReduce Architecture
The figure shown below illustrates the various parameters and modules that can be configured
during a MapReduce operation:
JobConf is the framework used to provide various parameters of a MapReduce job to the
Hadoop for execution. The Hadoop platforms executes the programs based on configuration
set using JobConf. The parameters being Map Function, Reduce Function, combiner ,
Partitioning function, Input and Output format. Partitioner controls the shuffling of the
tuples when being sent from Mapper node to Reducer nodes. The total number of partitions
done in the tuples is equal to the number of reduce nodes. In simple terms based on the
function output the tuples are transmitted through different reduce nodes.

Input Format describes the format of the input data for a MapReduce job. Input location
specifies the location of the datafile. Map Function/ Mapper convert the data into key value
pair. For example let’s consider daily temperature data of 100 cities for the past 10 years. In
this the map function is written such a way that every temperature being mapped to the
corresponding city. Reduce Function reduces the set of tuples which share a key to a single
tuple with a change in the value. In this example if we have to find the highest temperature
recorded in the city the reducer function is written in such a way that it return the tuple with
highest value i.e: highest temperature in the city in that sample data.

The number of Map and Reduce nodes can also be defined. You can set Partitioner function
which partitions and transfer the tuples which by default is based on a hash function. In other
words we can set the options such that a specific set of key value pairs are transferred to a
specific reduce task. For example if your key value consists of the year it was recorded, then
we can set the parameters such that all the keys of specific year are transferred to a same
reduce task. The Hadoop framework consists of a single master and many slaves. Each master
has JobTracker and each slave has TaskTracker. Master distributes the program and data to the
slaves. Task tracker, as the name suggests keep track of the task directed to it and relays the
information to the JobTracker. The JobTracker monitors all the status reports and re-initiates
the failed tasks if any.

Combiner class are run on map task nodes. It takes the tuples emitted by Map nodes as input.
It basically does reduce operation on the tuples emitted by the map node. It is like a pre-
reduce task to save a lot of bandwidth. We can also pass global constants to all the nodes
using ‘Counters’. They can be used to keep track of all events in map and reduce tasks. For
example we can pass a counter to calculate the statistics of an event beyond a certain
threshold.
MapReduce Tutorial: Traditional Way
parallel processing of huge data sets using traditional approaches.

Let us understand, when the MapReduce framework was not there, how parallel and distributed processing
used to happen in a traditional way. So, let us take an example where I have a weather log containing
the daily average temperature of the years from 2000 to 2015. Here, I want to calculate the day having
the highest temperature in each year.

So, just like in the traditional way, I will split the data into smaller parts or blocks and store them in
different machines. Then, I will find the highest temperature in each part stored in the corresponding
machine. At last, I will combine the results received from each of the machines to have the final output.
Let us look at the challenges associated with this traditional approach:

1. Critical path problem: It is the amount of time taken to finish the job without delaying the next
milestone or actual completion date. So, if, any of the machines delays the job, the whole work
gets delayed.
2. Reliability problem: What if, any of the machines which is working with a part of data fails? The
management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that each machine gets even
part of data to work with. In other words, how to equally divide the data such that no individual
machine is overloaded or under utilized.
4. Single split may fail: If any of the machine fails to provide the output, I will not be able to
calculate the result. So, there should be a mechanism to ensure this fault tolerance capability of
the system.
5. Aggregation of result: There should be a mechanism to aggregate the result generated by each
of the machines to produce the final output.

These are the issues which I will have to take care individually while performing parallel processing of
huge data sets when using traditional approaches.

To overcome these issues, we have the MapReduce framework which allows us to perform such parallel
computations without bothering about the issues like reliability, fault tolerance etc. Therefore, MapReduce
gives you the flexibility to write code logic without caring about the design issues of the system.
MapReduce Tutorial: What is MapReduce?

MapReduce is a programming framework that allows us to perform distributed and parallel processing on
large data sets in a distributed environment.

 MapReduce consists of two distinct tasks – Map and Reduce.

 As the name MapReduce suggests, reducer phase takes place after mapper phase has been
completed.
 So, the first is the map job, where a block of data is read and processed to produce key-value pairs
as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs which is the final output.

MapReduce Tutorial: A Word Count Example of MapReduce

Let us understand, how a MapReduce works by taking an example where I have a text file called
example.txt whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will be
finding the unique words and the number of occurrences of those unique words.
 First, we divide the input in three splits as shown in the figure. This will distribute the work among
all the map nodes.
 Then, we tokenize the words in each of the mapper and give a hardcoded value (1) to each of the
tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in
itself, will occur once.
 Now, a list of key-value pair will be created where the key is nothing but the individual words and
value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.
 After mapper phase, a partition process takes place where sorting and shuffling happens so that
all the tuples with the same key are sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of
ones in the very list and gives the final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.

MapReduce Tutorial: Advantages of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part of the job
simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to process the
data using different machines. As the data is processed by multiple machine instead of a single machine
in parallel, the time taken to process the data gets reduced by a tremendous amount as shown in the
figure below (2).

Fig.: Traditional Way Vs. MapReduce Way – MapReduce Tutorial

2. Data Locality:
Instead of moving data to the processing unit, we are moving processing unit to the data in the MapReduce
Framework. In the traditional system, we used to bring data to the processing unit and process it. But,
as the data grew and became very huge, bringing this huge amount of data to the processing unit posed
following issues:

 Moving huge data to processing is costly and deteriorates the network performance.
 Processing takes time as the data is processed by a single unit which becomes the bottleneck.
 Master node can get over-burdened and may fail.

Now, MapReduce allows us to overcome above issues by bringing the processing unit to the data. So, as
you can see in the above image that the data is distributed among multiple nodes where each node
processes the part of the data residing on it. This allows us to have the following advantages:

 It is very cost effective to move processing unit to the data.

 The processing time is reduced as all the nodes are working with their part of the data in parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node getting
overburdened.

MapReduce Tutorial: MapReduce Example Program

Before jumping into the details, let us have a glance at a MapReduce example program to have a basic
idea about how things work in a MapReduce environment practically. I have taken the same word count
example where I have to find out the number of occurrences of each word. And Don’t worry guys, if you
don’t understand the code when you look at it for the first time, just bear with me while I walk you through
each part of the MapReduce code.

MapReduce Algorithm for Matrix Multiplication

 Matrix Multiplication
o From high school calculus:

A × B = C

cij = ∑k=1,2,...,n aik × ckj

o Example:
 The reduce( ) step in the MapReduce Algorithm for matrix multiplication
o Facts:

1. The final step in the MapReduce algorithm is to produce the matrix A × B

2. The unit of computation of of matrix A × B is one element in the matrix:

conclusion:

 The input information of the reduce( ) step (function) of the MapReduce algorithm are:

1. One row vector from matrix A

2. One column vector from matrix B

 The reduce( ) function will compute:

 The inner product of the

 One row vector from matrix A

 One column vector from matrix B

 Preprocessing for the map( ) function

o Fact:

 The map( ) function (really) only has one input stream:

 of the format ( keyi , valuei )

o the inputs of the matrix multiplication are:

 Tow (2) input matrices:

Therefore:

 We must insert a pre-processing step to:

 Convert the input matrices to the form:

( key1 , value1 )
( key2 , value2 )
( key3 , value3 )
...

Graphically:

o Pre-processing used for matrix multiplication:

 Overview of the MapReduce Algorithm for Matrix Multiplication

o So far, we have discovered:

 The input to the Map( ) is as follows:

( (A, 1, 1) , a11 )
( (A, 1, 2) , a12 )
( (A, 1, 3) , a13 )
...
( (B, 1, 1) , b11 )
( (B, 1, 2) , b12 )
( (B, 1, 3) , b13 )
...

 The input to one reduce( ) function is as follows:

 A row vector from matrix A

 A column vector from matrix B

o Graphical summary:

the MapReduce Algorithm for Matrix Multiplication

o The map( ) function:

 The map( ) will duplicate N times as follows:

(A, i, j), x) ---> ( (i,1), x ) ( (i,2), x ) .... ( (i,N), x )

(B, i, j), x) ---> ( (1,j), x ) ( (2,j), x ) .... ( (N,j), x )

wher N = # rows in matrix A (= # columnss in matrix B)

o Example:

o The shuffle mechanism of MapReduce will re-organize (group) the map(

) output as follows:
o The reduce( ) function will compute the inner product of the input vectors

o Postscript:
 We need to tag the map( ) function output with the position so the reduce( )
function can identify the components in the different vectors

Example:

(This detail was omitted for brevity --- figure is kinda full)

 The reduce( ) function is as follows:

sum = 0;

for ( pos = 1 to N ) do
{
x = first value at position pos
y = second value at position pos

sum = sum + x*y;

}

CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
Simply Rethink DB
No ratings yet
Simply Rethink DB
193 pages
2.1.1.8 Lab - Creating A Process Flowchart PDF
0% (2)
2.1.1.8 Lab - Creating A Process Flowchart PDF
6 pages
CCNA Security Packet Tracer Practice SBA
80% (5)
CCNA Security Packet Tracer Practice SBA
11 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
192 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
CC Question Bank All Units
No ratings yet
CC Question Bank All Units
28 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
Hadoop Training #5: MapReduce Algorithm
100% (2)
Hadoop Training #5: MapReduce Algorithm
31 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
Cloudera Impala
No ratings yet
Cloudera Impala
478 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
Apache Spark Graph Processing - Sample Chapter
No ratings yet
Apache Spark Graph Processing - Sample Chapter
22 pages
Hadoop HBase Notes-Abhijit-Nagargoje
No ratings yet
Hadoop HBase Notes-Abhijit-Nagargoje
24 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
Big Data & Hadoop
100% (3)
Big Data & Hadoop
189 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Big Data Management Syllabus
100% (1)
Big Data Management Syllabus
5 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
Hadoop and Mapreduce Cheat Sheet
No ratings yet
Hadoop and Mapreduce Cheat Sheet
1 page
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
In - Memory Data Fabric in Action: Apache Ignite
No ratings yet
In - Memory Data Fabric in Action: Apache Ignite
16 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Ignite Sample
0% (1)
Ignite Sample
88 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
Complexity of An Algorithm Lect-2.pps
No ratings yet
Complexity of An Algorithm Lect-2.pps
21 pages
2 IntroductionToRDBMS
No ratings yet
2 IntroductionToRDBMS
192 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
Mongo DB Using Python
No ratings yet
Mongo DB Using Python
7 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Apache Storm Tutorial
100% (1)
Apache Storm Tutorial
64 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Lekcija09 - 04 NoSQL Redis
No ratings yet
Lekcija09 - 04 NoSQL Redis
40 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map reduce
No ratings yet
Map reduce
35 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Smu C++ SLM Unit 1
No ratings yet
Smu C++ SLM Unit 1
15 pages
Printing Invoices-Smart Forms
100% (11)
Printing Invoices-Smart Forms
25 pages
Hidden Network: Detecting Hidden Networks Created With USB Devices
No ratings yet
Hidden Network: Detecting Hidden Networks Created With USB Devices
17 pages
Datasheet XC888
No ratings yet
Datasheet XC888
144 pages
STT Tutorial
No ratings yet
STT Tutorial
23 pages
C Multiple Choice Questions and Answers PDF
100% (4)
C Multiple Choice Questions and Answers PDF
22 pages
High-Volume Web Sites
100% (1)
High-Volume Web Sites
152 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Me2402 Computer Integrated Manufcaturing Unit I Computer Aided Design
No ratings yet
Me2402 Computer Integrated Manufcaturing Unit I Computer Aided Design
8 pages
Design of Shopping Mall Management System
100% (1)
Design of Shopping Mall Management System
16 pages
2015 015 B.tech (It) Declare Result Dec2018
No ratings yet
2015 015 B.tech (It) Declare Result Dec2018
53 pages
BCT (V3.00) Operation Guide
100% (1)
BCT (V3.00) Operation Guide
13 pages
Creating New Web Services With B1if
100% (1)
Creating New Web Services With B1if
50 pages
Cloud Computing Syllabus
No ratings yet
Cloud Computing Syllabus
1 page
Case Study Authentication
No ratings yet
Case Study Authentication
3 pages
Readme PDF
No ratings yet
Readme PDF
2 pages
Preparation Guide CISSP
100% (1)
Preparation Guide CISSP
104 pages
Exchange Labs FAQ End Users - KF
No ratings yet
Exchange Labs FAQ End Users - KF
7 pages
Cyber Security Awareness
No ratings yet
Cyber Security Awareness
78 pages
Java File Dialog Box
No ratings yet
Java File Dialog Box
4 pages
PSS SINCAL - Multi - User
No ratings yet
PSS SINCAL - Multi - User
20 pages
FM25V05 PDF
No ratings yet
FM25V05 PDF
16 pages
How To Run or Install MASM Software On Windows (32 or 64-Bit) Using DOSBox (With Video Tutorial) - TechzClub
No ratings yet
How To Run or Install MASM Software On Windows (32 or 64-Bit) Using DOSBox (With Video Tutorial) - TechzClub
19 pages
LOAD - DB2 Utility
No ratings yet
LOAD - DB2 Utility
20 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
Unified Modeling Language
No ratings yet
Unified Modeling Language
35 pages
Module 1.1
No ratings yet
Module 1.1
28 pages
RedHat - Train4sure - Ex200.v2019!02!10.by - Travis.53q 1
No ratings yet
RedHat - Train4sure - Ex200.v2019!02!10.by - Travis.53q 1
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Map Reduce

Uploaded by

Map Reduce

Uploaded by

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on

Libraries : Hbase, Hive, Pig, Sqoop, Oozie, Mahout and others

Filter: It filters the unwanted words from maps of tokens.

Count: Generates a token counter per word.

How is MapReduce used?

 MapReduce consists of two distinct tasks – Map and Reduce.

MapReduce Tutorial: A Word Count Example of MapReduce

MapReduce Tutorial: Advantages of MapReduce

The two biggest advantages of MapReduce are:

Fig.: Traditional Way Vs. MapReduce Way – MapReduce Tutorial

 It is very cost effective to move processing unit to the data.

MapReduce Tutorial: MapReduce Example Program

MapReduce Algorithm for Matrix Multiplication

cij = ∑k=1,2,...,n aik × ckj

1. The final step in the MapReduce algorithm is to produce the matrix A × B

2. The unit of computation of of matrix A × B is one element in the matrix:

1. One row vector from matrix A

 The reduce( ) function will compute:

 The inner product of the

 One row vector from matrix A

 Preprocessing for the map( ) function

 The map( ) function (really) only has one input stream:

 of the format ( keyi , valuei )

o the inputs of the matrix multiplication are:

 Tow (2) input matrices:

 We must insert a pre-processing step to:

o Pre-processing used for matrix multiplication:

 Overview of the MapReduce Algorithm for Matrix Multiplication

 The input to the Map( ) is as follows:

 The input to one reduce( ) function is as follows:

 A row vector from matrix A

the MapReduce Algorithm for Matrix Multiplication

o The map( ) function:

 The map( ) will duplicate N times as follows:

(B, i, j), x) ---> ( (1,j), x ) ( (2,j), x ) .... ( (N,j), x )

wher N = # rows in matrix A (= # columnss in matrix B)

o The shuffle mechanism of MapReduce will re-organize (group) the map(

 The reduce( ) function is as follows:

sum = sum + x*y;

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.