0% found this document useful (0 votes)
8 views

Ditp - ch2 1

MapReduce has its roots in functional programming concepts like map and fold. It provides a model for processing large datasets in parallel across clusters by dividing the work into two phases - the map phase where a function is applied to each input record to generate intermediate outputs, and the reduce phase where the outputs are aggregated. The MapReduce framework coordinates executing the map and reduce phases in parallel on a distributed file system and hardware cluster. It provides a generic programming model for processing large datasets that has been widely adopted with different implementations.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Ditp - ch2 1

MapReduce has its roots in functional programming concepts like map and fold. It provides a model for processing large datasets in parallel across clusters by dividing the work into two phases - the map phase where a function is applied to each input record to generate intermediate outputs, and the reduce phase where the outputs are aggregated. The MapReduce framework coordinates executing the map and reduce phases in parallel on a distributed file system and hardware cluster. It provides a generic programming model for processing large datasets that has been widely adopted with different implementations.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

tails by introducing partitioners and combiners, which provide greater control

over data flow. MapReduce would not be practical without a tightly-integrated


distributed file system that manages the data being processed; Section 2.5 cov-
ers this in detail. Tying everything together, a complete cluster architecture is
described in Section 2.6 before the chapter ends with a summary.

2.1 Functional Programming Roots


MapReduce has its roots in functional programming, which is exemplified in
languages such as Lisp and ML.4 A key feature of functional languages is the
concept of higher-order functions, or functions that can accept other functions
as arguments. Two common built-in higher order functions are map and fold,
illustrated in Figure 2.1. Given a list, map takes as an argument a function f
(that takes a single argument) and applies it to all elements in a list (the top
part of the diagram). Given a list, fold takes as arguments a function g (that
takes two arguments) and an initial value: g is first applied to the initial value
and the first item in the list, the result of which is stored in an intermediate
variable. This intermediate variable and the next item in the list serve as
the arguments to a second application of g, the results of which are stored in
the intermediate variable. This process repeats until all items in the list have
been consumed; fold then returns the final value of the intermediate variable.
Typically, map and fold are used in combination. For example, to compute
the sum of squares of a list of integers, one could map a function that squares
its argument (i.e., λx.x2 ) over the input list, and then fold the resulting list
with the addition function (more precisely, λxλy.x + y) using an initial value
of zero.
We can view map as a concise way to represent the transformation of a
dataset (as defined by the function f ). In the same vein, we can view fold as an
aggregation operation, as defined by the function g. One immediate observation
is that the application of f to each item in a list (or more generally, to elements
in a large dataset) can be parallelized in a straightforward manner, since each
functional application happens in isolation. In a cluster, these operations can
be distributed across many different machines. The fold operation, on the
other hand, has more restrictions on data locality—elements in the list must
be “brought together” before the function g can be applied. However, many
real-world applications do not require g to be applied to all elements of the
list. To the extent that elements in the list can be divided into groups, the fold
aggregations can also proceed in parallel. Furthermore, for operations that are
commutative and associative, significant efficiencies can be gained in the fold
operation through local aggregation and appropriate reordering.
In a nutshell, we have described MapReduce. The map phase in MapReduce
roughly corresponds to the map operation in functional programming, whereas
the reduce phase in MapReduce roughly corresponds to the fold operation in
4 However, there are important characteristics of MapReduce that make it non-functional

in nature—this will become apparent later.


f f f f f

g g g g g

Figure 2.1: Illustration of map and fold, two higher-order functions commonly
used together in functional programming: map takes a function f and applies it
to every element in a list, while fold iteratively applies a function g to aggregate
results.
Figure  2.1:  IllustraPon  of  map  and  fold,  two  higher-­‐order  funcPons  commonly  used  together  
in  funcPonal  programming:  map  takes  a  funcPon  f  and  applies  it  to  every  element  in  a  list,  
while  ffunctional
old  iteraPvely   applies  a  funcPon  
programming. As we willg  to  discuss
aggregate   results.  
in detail shortly, the MapReduce
execution framework coordinates the map and reduce phases of processing over
large amounts of data on large clusters of commodity machines.
Viewed from a slightly different angle, MapReduce codifies a generic “recipe”
for processing large datasets that consists of two stages. In the first stage, a
user-specified computation is applied over all input records in a dataset. These
operations occur in parallel and yield intermediate output that is then aggre-
gated by another user-specified computation. The programmer defines these
two types of computations, and the execution framework coordinates the ac-
tual processing (very loosely, MapReduce provides a functional abstraction).
Although such a two-stage processing structure may appear to be very restric-
tive, many interesting algorithms can be expressed quite concisely—especially
if one decomposes complex algorithms into a sequence of MapReduce jobs.
Subsequent chapters in this book focus on how a number of algorithms can be
implemented in MapReduce.
To be precise, MapReduce can refer to three distinct but related concepts.
First, MapReduce is a programming model, which is the sense discussed above.
Second, MapReduce can refer to the execution framework (i.e., the “runtime”)
that coordinates the execution of programs written in this particular style. Fi-
nally, MapReduce can refer to the software implementation of the programming
model and the execution framework: for example, Google’s proprietary imple-
mentation vs. the open-source Hadoop implementation in Java. And in fact,
there are many implementations of MapReduce, e.g., targeted specifically for
multi-core processors [127], for GPGPUs [71], for the CELL architecture [126],
etc. There are some differences between the MapReduce programming model

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy