0% found this document useful (0 votes)
342 views

Shuffle and Sort

The document summarizes how the shuffle process in MapReduce works. It describes that map outputs are buffered in memory and spilled to disk if the buffer fills. The outputs are partitioned and sorted by key. During the reduce phase, map outputs are copied in parallel from map tasks and merged on disk. The reduce task then merges the outputs and directly feeds the sorted data to the reduce function without a final disk write.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
342 views

Shuffle and Sort

The document summarizes how the shuffle process in MapReduce works. It describes that map outputs are buffered in memory and spilled to disk if the buffer fills. The outputs are partitioned and sorted by key. During the reduce phase, map outputs are copied in parallel from map tasks and merged on disk. The reduce task then merges the outputs and directly feeds the sorted data to the reduce function without a final disk write.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Shuffle and Sort

MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by
which the system performs the sort—and transfers the map outputs to the reducers as inputs—is
known as the shuffle.6 In this section, we look at how the shuffle works, as a basic understanding
would be helpful, should you need to optimize a Map-Reduce program. The shuffle is an area of
the codebase where refinements and improvements are continually being made, so the following
description necessarily conceals many details (and may change over time, this is for version 0.20).
In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.

The Map Side

When the map function starts producing output, it is not simply written to disk. The process is
more involved, and takes advantage of buffering writes in memory and doing some presorting for
efficiency reasons. Figure 6-6 shows what happens.

Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by
default, a size which can be tuned by changing the io.sort.mb property. When the contents of the
buffer reaches a certain threshold size (io.sort.spill.per cent, default 0.80, or 80%), a background
thread will start to spill the contents to disk.

Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer
fills up during this time, the map will block until the spill is complete.
Spills are written in round-robin fashion to the directories specified by the mapred.local.dir
property, in a job-specific subdirectory.

Figure 6-6. Shuffle and sort in MapReduce

Before it writes to disk, the thread first divides the data into partitions corresponding to the
reducers that they will ultimately be sent to. Within each partition, the background thread performs
an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.
Running the combiner function makes for a morecompact map output, so there is less data to write
to local disk and to transfer to the reducer.

Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the
map task has written its last output record there could be several spill files. Before the task is
finished, the spill files are merged into a single partitioned and sorted output file. The
configuration property io.sort.factor controls the maximum number of streams to merge at once;
the default is 10.

If there are at least three spill files (set by the min.num.spills.for.combine property) then the
combiner is run again before the output file is written. Recall that combiners may be run repeatedly
over the input without affecting the final result. If there are only one or two spills, then the potential
reduction in map output size is not worth the overhead in invoking the combiner, so it is not run
again for this map output.
It is often a good idea to compress the map output as it is written to disk, since doing so makes
it faster to write to disk, saves disk space, and reduces the amount of data to transfer to the reducer.
By default, the output is not compressed, but it is easy to enable by setting
mapred.compress.map.output to true. The compression library to use is specified by
mapred.map.output.compression.codec;

The output file’s partitions are made available to the reducers over HTTP. The maximum number
of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads
property—this setting is per tasktracker, not per map task slot. The default of 40 may need
increasing for large clusters running large jobs. In MapReduce 2, this property is not applicable
since the maximum number of threads used is set automatically based on the number of processors
on the machine. (Map- Reduce 2 uses Netty, which by default allows up to twice as many threads
as there are processors.)

The Reduce Side

Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of
the machine that ran the map task (note that although map outputs always get written to local disk,
reduce outputs may not be), but now it is needed by the machine that is about to run the reduce
task for the partition. Furthermore, the reduce task needs the map output for its particular
partition from several map tasks across the cluster. The map tasks may finish at different
times, so the reduce task starts copying their outputs as soon as each completes. This is known
as the copy phase of the reduce task. The reduce task has a small number of copier threads so
that it can fetch map outputs in parallel. The default is five threads, but this number can be
changed by setting the mapred.reduce.parallel.copies property. The map outputs are copied to
the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by
mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this
purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold
size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map
outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is
specified it will be run during the merge to reduce the amount of data written to disk.
As the copies accumulate on disk, a background thread merges them into larger, sorted files. This
saves some time merging later on. Note that any map outputs that were compressed (by the
map task) have to be decompressed in memory in order to perform a merge on them.

When all the map outputs have been copied, the reduce task moves into the sort phase (which
should properly be called the merge phase, as the sorting was carried out on the map side), which
merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if
there were 50 map outputs, and the merge factor was 10 (the default, controlled by the io.sort.factor
property, just like in the map’s merge), then there would be 5 rounds. Each round would merge 10
files into one, so at the end there would be five intermediate files.

Rather than have a final round that merges these five files into a single sorted file, the merge
saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce
phase. This final merge can come from a mixture of in-memory and on-disk segments.

During the reduce phase, the reduce function is invoked for each key in the sorted output. The
output of this phase is written directly to the output filesystem, typically HDFS. In the case of
HDFS, since the tasktracker node (or node manager) is also running a datanode, the first block
replica will be written to the local disk.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy