0% found this document useful (0 votes)
19 views61 pages

Map Reduce and Format Features

Uploaded by

Being Gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views61 pages

Map Reduce and Format Features

Uploaded by

Being Gamer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Big Data Analytics:

MapReduce (Contd.)
Dr. Shivangi Shukla
Assistant Professor
Computer Science and Engineering
IIIT Pune
Contents
• Failures
• Job Scheduling
• MapReduce Types
• Input Formats
• Output Formats
• MapReduce Features
• Counters
• Sorting
• Joins
• Side data distribution

Big Data Analytics: MapReduce (Contd.) 2


Failures
Failures in MapReduce is categorized in three
categories:
i. Task Failure
ii. TaskTracker Failure
iii. JobTracker Failure

Big Data Analytics: MapReduce (Contd.) 3


Task Failure
• When the JobTracker is notified of a task attempt that has
failed (by the TaskTracker’s heartbeat call)
• JobTracker reschedules execution of the task.
• In this rescheduling of task, JobTracker avoids the
TaskTracker where the task has previously failed
• In case, a task has failed more than four times, it will not be
retried further.
• This value is configurable: the maximum number of
attempts to run a task is controlled by the
mapred.map.max.attempts property for map tasks,
and mapred.reduce.max.attempts for reduce tasks.
• By default, if any task fails more than four times (or
maximum number of attempts is configured to), the whole
job fails. 4
Big Data Analytics: MapReduce (Contd.)
TaskTracker Failure
• If a TaskTracker fails by crashing, or runs very slowly, it
will stop sending heartbeats to the JobTracker (or send
them very infrequently).
• JobTracker notices that the TaskTracker has stopped
sending heartbeats,
• JobTracker removes TaskTracker from its pool of
TaskTrackers to schedule tasks on.
• JobTracker arranges for map tasks that were run and
completed successfully on that TaskTracker to be
rerun if they belong to incomplete jobs,
• because the intermediate output of map tasks
resides on the failed TaskTracker’s local filesystem
which is not accessible to reduce task. Any tasks in
progress are also rescheduled. 5
Big Data Analytics: MapReduce (Contd.)
TaskTracker Failure..
• A TaskTracker can also be blacklisted by the
JobTracker, even if the TaskTracker has not failed.
• A TaskTracker is blacklisted if the number of
tasks that have failed on it is significantly higher
than the average task failure rate on the cluster.
• Blacklisted TaskTrackers can be restarted to
remove them from the JobTracker’s blacklist.

Big Data Analytics: MapReduce (Contd.) 6


JobTracker Failure
• Failure of the JobTracker is the most serious failure
mode.
• Currently, Hadoop has no mechanism for dealing with
failure of the JobTracker. JobTracker is a single point of
failure, so in this case the job fails.
• However, this failure of JobTracker has a low chance of
occurring since the chance of a particular machine
failing is low.
• It is possible that a future release of Hadoop will
remove this limitation by running multiple
JobTrackers, only one of which is the primary
JobTracker at any time.
Big Data Analytics: MapReduce (Contd.) 7
Job Scheduling
• Early versions of Hadoop had simple approach to
scheduling users’ jobs i.e. running in order of
submission using a FIFO scheduler.
• Typically each job would use the whole cluster thus,
jobs had to wait their turn.
• Production jobs need to complete in a timely
manner,
• while allowing users who are making smaller ad hoc
queries to get results back in a reasonable time.
• Shared cluster offers large resources to many users in
the cluster
• however, the problem of sharing resources fairly between
users requires a better scheduler.
Big Data Analytics: MapReduce (Contd.) 8
Job Scheduling..
• Later, the ability to set a job’s priority was added via
mapred.job.priority property or setJobPriority()
method on JobClient (which take one of the values
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW).
• When job scheduler is choosing the next job to run, it
selects one with the highest priority.
• However, FIFO scheduler priorities do not support
preemption, so a high-priority job can still be
blocked by a long-running low priority job that
started before the high-priority job was scheduled.

Big Data Analytics: MapReduce (Contd.) 9


Job Scheduling..
• MapReduce in Hadoop now comes with a choice of
schedulers.
• The default is the original FIFO queue-based scheduler,
• There also a multi-user scheduler called the Fair
Scheduler.

Big Data Analytics: MapReduce (Contd.) 10


Job Scheduling- Fair Scheduler
• Fair Scheduler aims to give every user a fair share of
the cluster capacity over time.
• If a single job is running, it gets all of the cluster.
• As more jobs are submitted, free task slots are
given to the jobs in such a way that each user gets a
fair share of the cluster.
• This ensures that a short job that belongs to one
user will complete in reasonable time whereas a
long job belonging to another user continues to run
and makes progress.

Big Data Analytics: MapReduce (Contd.) 11


Job Scheduling- Fair Scheduler..
• Jobs are placed in pools,
• and by default, each user gets their own pool.
• If a user submits more jobs than another user
• Fair Scheduler ensures that both users receive
roughly the same amount of cluster resources on
average over time, regardless of how many jobs
each one submits.
• It is also possible to define custom pools
• with guaranteed minimum capacities
• defined in terms of number of map and reduce
slots, and to set weightings for each pool.
Big Data Analytics: MapReduce (Contd.) 12
Job Scheduling- Fair Scheduler..
• The Fair Scheduler supports preemption,
• In case, if a pool has not received its fair share of
resources for a certain period of time,
• then the scheduler will kill tasks in pools running
over capacity in order to give the slots to the
pool running under capacity.

Big Data Analytics: MapReduce (Contd.) 13


MapReduce Types
• The map and reduce functions in Hadoop MapReduce
have the following general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
• The map input key and value types (K1 and V1) are
different from the map output types (K2 and V2).
• The reduce input must have the same types as the map
output, although the reduce output types may be
different again (K3 and V3).

Big Data Analytics: MapReduce (Contd.) 14


Types of Input and Output Formats

Input Formats Output Formats

• Input Splits and Records • Text Output


• Text Input • Binary Output
• TextInputFormat • SequenceFileOutputFormat
• NLineInputFormat • SequenceFileAsBinaryOutputFormat
• Binary Input • MapFileOutputFormat
• SequenceFileInputFormat • Multiple Outputs
• SequenceFileAsTextInputFormat • MultipleOutputFormat
• SequenceFileAsBinaryInputFormat • MultipleOutputs
• Multiple Inputs • Lazy Output
• Database Inputs • Database Output

Big Data Analytics: MapReduce (Contd.) 15


Input Formats
• Input split is a chunk of the input that is processed by a
single map. InputFormat is responsible for creating input
splits, and dividing them into records
• Each map processes a single split. Each split is divided into
records, and the map processes each record and returns a
key-value pair.
• InputSplit has a length in bytes, and a set of storage
locations (in form of hostname strings).
• The InputSplit doesn’t contain the input data; it is just a
reference to the data.
• thus, storage locations used by the MapReduce system
places map tasks as close to the InputSplit’s data as
possible,
• and the size is used to order the splits so that the largest
get processed first, in an attempt to minimize the job 16
runtime Big Data Analytics: MapReduce (Contd.)
Text Input
• Hadoop excels at processing unstructured text.
• Following are the types of Text Input are:
• TextInputFormat
• NLineInputFormat

Big Data Analytics: MapReduce (Contd.) 17


TextInputFormat
• TextInputFormat is the default InputFormat. Each
record is a line of input.
• The key is the byte offset within the file of the
beginning of the line.
• The offset is usually sufficient for applications that
need a unique identifier for each line.
• Combined with the file’s name, it is unique within
the filesystem.
• The value is the contents of the line, excluding any
line terminators (newline, carriage return), and is
packaged as a Text object.
Big Data Analytics: MapReduce (Contd.) 18
TextInputFormat
• For example, a file containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
• is divided into one split of four records. The records
are interpreted as the following key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Big Data Analytics: MapReduce (Contd.) 19
NLineInputFormat
• With TextInputFormat, each mapper receives a variable
number of lines of input.
• The number depends on the size of the split and the length of the
lines.
• If the application demands mappers to receive a fixed number
of lines of input,
• then NLineInputFormat is used instead of InputFormat.
• Similar to TextInputFormat the keys are the byte offsets
within the file and the values are the lines themselves.
• N refers to the number of lines of input that each mapper
receives.
• If N set to one (default), each mapper receives exactly one line of
input.
• If N is two, then each split contains two lines and each mapper
receives two key-value pairs. Big Data Analytics: MapReduce (Contd.) 20
NLineInputFormat
• For example, consider these four lines:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
• If N is two, then each split contains two lines. One mapper will
receive the first two key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
• And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
• The keys and values are the same as TextInputFormat
produces. The difference lies is the way the splits are
constructed in NLineInputFormat 21
Big Data Analytics: MapReduce (Contd.)
Binary Input
• Hadoop MapReduce support for binary formats in
addition to textual data.
• Sequence files are a special type of binary file format
designed for efficient storage and processing of large
datasets.
• These sequence files store data in a binary format,
meaning that the data is written in a compact,
serialized form.
• Following are input formats related to sequence files are:
• SequenceFileInputFormat
• SequenceFileAsTextInputFormat
• SequenceFileAsBinaryInputFormat 22
Big Data Analytics: MapReduce (Contd.)
SequenceFileInputFormat
• Hadoop MapReduce sequence file format stores
sequences of binary key-value pairs.
• These sequence file format have sync points that
enables readers to synchronize with record boundaries
from an arbitrary point in the file.
• The sequence file format also support compression as a
part of the format.

Big Data Analytics: MapReduce (Contd.) 23


SequenceFileAsTextInputFor
mat
• SequenceFileAsTextInputFormat is a variant of
SequenceFileInputFormat that converts the
sequence file’s keys and values to text objects.
• The conversion is performed by calling toString() on
the keys and values.
• This format makes sequence files suitable input for
Streaming.

Big Data Analytics: MapReduce (Contd.) 24


SequenceFileAsBinaryInputForm
at
• SequenceFileAsBinaryInputFormat is a variant of
SequenceFileInputFormat
• that retrieves the sequence file’s keys and values as opaque
binary objects (raw binary data).
• They are encapsulated as BytesWritable objects, and
the application is free to interpret the underlying byte
array as it pleases.
• Combined with SequenceFile.Reader’s appendRaw()
method,
• this provides a way to use any binary data types with
MapReduce (packaged as a sequence file).

Big Data Analytics: MapReduce (Contd.) 25


Multiple Inputs
• The input to a MapReduce job may consist of multiple input
files, all of the input is interpreted by a single InputFormat
and a single Mapper.
• However, the data format evolves with time,
• for instance, different data sources that provide the
same type of data but in different formats
• or they are in the same format, they may have
different representations, and therefore need to be
parsed differently.
• These cases are handled by using the MultipleInputs class,
• which allows to specify the InputFormat and Mapper to
be used on a per-path basis.

Big Data Analytics: MapReduce (Contd.) 26


Database Input
• DBInputFormat is an input format for reading
data from a relational database, using JDBC.
• It is considered as best practice to use database
input to load relatively small datasets,
• and join with larger datasets from HDFS, using
MultipleInputs.

Big Data Analytics: MapReduce (Contd.) 27


Output Formats – Text Output
• Text Output
• TextOutputFormat is considered as default
output format.
• It writes records as lines of text.
• Its keys and values can be of any type as
TextOutputFormat converts key and values to
strings by calling toString() on them
• Usually, each key-value pair is separated by a tab
character, although it is configurable.

Big Data Analytics: MapReduce (Contd.) 28


Binary Output
• SequenceFileOutputFormat
• SequenceFileOutputFormat writes sequence
files for its output.
• This is considered as good choice of output
• if it forms the input to a further MapReduce job,
since it is compact, and is readily compressed.
• Compression is controlled via the static methods on
SequenceFileOutputFormat
• SequenceFileAsBinaryOutputFormat
• SequenceFileAsBinaryOutputFormat is the
counterpart to
SequenceFileAsBinaryInputFormat.
• It writes keys and values in raw binary format into a
SequenceFile container 29
Big Data Analytics: MapReduce (Contd.)
Binary Output..
• MapFileOutputFormat
• MapFileOutputFormat writes MapFiles as
output.
• The keys in a MapFile must be added in order, so
it is needed to ensure that the reducers emit
keys in sorted order.

Big Data Analytics: MapReduce (Contd.) 30


Multiple Outputs
• FileOutputFormat and its subclasses generate a set
of files in the output directory.
• There is one file per reducer
• and files are named by the partition number: part-
00000, part-00001, etc.
• There is sometimes a need to have more control over
the naming of the files, or to produce multiple files per
reducer.
• MapReduce comes with two libraries for this
purpose:
• MultipleOutputFormat and MultipleOutputs.

Big Data Analytics: MapReduce (Contd.) 31


MultipleOutputFormat
• MultipleOutputFormat allows to write data to
multiple files
• whose names are derived from the output keys and values.
• MultipleOutputFormat is an abstract class with two
concrete subclasses,
• MultipleTextOutputFormat and
• MultipleSequenceFileOutputFormat,
• which are the multiple file equivalents of
TextOutputFormat and SequenceFileOutputFormat.
• MultipleOutputFormat provides a few protected
methods
• that subclasses can override to control the output filename.

Big Data Analytics: MapReduce (Contd.) 32


MultipleOutputs
• The MultipleOutputs class is used to generate
additional outputs to the usual output.
• Outputs are given names, and may be written to a
single file called single named output,
• or to multiple files (called multi named output).
• In this case of multiple files demand, one for each
output,
• use a multi named output, which is initialized in the
driver by calling the addMultiNamedOutput()
method to specify
• the name of the output, the output format, and the
output types.

Big Data Analytics: MapReduce (Contd.) 33


Difference between MultipleOutputFormat
and MultipleOutputs
Feature MultipleOutputFormat MultipleOutputs
Complete control over names
 ×
of files and directories
Different key and value types
× 
for different outputs
Use from map and reduce in
× 
the same job
Multiple outputs per record × 
Use with any OutputFormat × 

• It should be noted that MultipleOutputs is more fully featured but


MultipleOutputFormat has more control over the output
directory structure and file naming.
Big Data Analytics: MapReduce (Contd.) 34
Lazy Output
• FileOutputFormat subclasses creates output files,
• even if they are empty.
• Some applications prefer that empty files not be
created,
• which is where LazyOutput Format helps.
• It is a wrapper output format
• that ensures that the output file is created only when the
first record is emitted for a given partition.

Big Data Analytics: MapReduce (Contd.) 35


Database Output
• The output formats of Database is written to relational
databases and HBase.
• The output format is DBOutputFormat,
• which is useful for dumping job outputs (of modest
size) into a database.
• TableOutputFormat is for writing MapReduce
outputs into an HBase table.

Big Data Analytics: MapReduce (Contd.) 36


MapReduce Features
• MapReduce features are categorized as:
• Counters
• Sorting
• Joins
• Side data distribution

Big Data Analytics: MapReduce (Contd.) 37


Counters
• Counters are a useful channel for gathering statistics
about the job: for quality control, or for application
level-statistics.
• Counters are classified in two categories:
• Build-in Counters
• User defined Counters

Big Data Analytics: MapReduce (Contd.) 38


Build-in Counters
• Hadoop maintains some built-in counters for every job,
which report various metrics for respective job.
• For instance, there are counters for the number of
bytes and records processed,
• which allows to confirm that the expected amount
of input was consumed and the expected amount of
output was produced.

Big Data Analytics: MapReduce (Contd.) 39


Build-in Counters
Group Counter Description
Map-Reduce Map input records The number of input records consumed by
Framework all the maps in the job.
It is incremented every time a record is
read from a RecordReader and passed to
the map method by the framework.
Map output records The number of map output records
produced by all the maps in the job.
Map skipped records The number of input records skipped by all
the maps in the job.
Map input bytes The number of bytes of uncompressed
input consumed by all the maps in the job.
Map output bytes The number of bytes of uncompressed
output produced by all the maps in the job.
Big Data Analytics: MapReduce (Contd.) 40
Build-in Counters

Group Counter Description


File Systems Filesystem bytes read The number of bytes read by each
filesystem by map and reduce tasks.
There is a counter for each
filesystem: Filesystem may be Local,
HDFS, S3, KFS, etc.
Filesystem bytes The number of bytes written by each
written filesystem by map and reduce tasks.

Big Data Analytics: MapReduce (Contd.) 41


Build-in Counters
Group Counter Description
Job Launched map tasks The number of map tasks that were
Counters launched.
Launched reduce tasks The number of reduce tasks that were
launched.
Failed map tasks The number of map tasks that failed.
Failed reduce tasks The number of reduce tasks that failed.
Data-local map tasks The number of map tasks that ran on the
same node as their input data.
Rack-local map tasks The number of map tasks that ran on a
node in the same rack as their input data.

Big Data Analytics: MapReduce (Contd.) 42


Build-in Counters
• Counters are maintained by the task with which they
are associated, and periodically sent to the TaskTracker
and then to the JobTracker, so they can be globally
aggregated.

Big Data Analytics: MapReduce (Contd.) 43


User-Defined Counters
• MapReduce allows user code to define a set of
counters, which are then incremented as desired in the
mapper or reducer.
• Counters are defined by a Java enum, which serves to
group related counters.
• In Java, an enum (short for "enumeration") is a
special data type that enables a variable to be a set
of predefined constants.
• It is used when you have a fixed set of values that a
variable can take, like days of the week, directions,
or states in a process.
Big Data Analytics: MapReduce (Contd.) 44
User-Defined Counters
• A job may define an arbitrary number of enums, each
with an arbitrary number of fields.
• The name of the enum is the group name,
• and the enum’s fields are the counter names.
• Counters are global: the MapReduce framework
aggregates them across all maps and reduces to
produce a grand total at the end of the job
• User-defined counters can be Dynamic Counters that
have Readable Counter Names and Retrieving Counter
mechanism can be used to retrieve them.

Big Data Analytics: MapReduce (Contd.) 45


Dynamic Counters
• Dynamic counters are user-defined counters
• that can be created dynamically during the execution
of a MapReduce job.
• A user can define a new counter at any point in the
program,
• allowing the counting of various custom events that
happen during the job execution.
• For example, if a mapper processes certain records of
interest,
• a counter can be dynamically created to track how
many such records are encountered.

Big Data Analytics: MapReduce (Contd.) 46


Dynamic Counters..
• The two ways of creating and accessing counters-
• using enums
• using Strings
• Both are actually equivalent since Hadoop turns enums
into Strings to send counters over RPC.
• However, enums are slightly easier to work with,
provide type safety, and are suitable for most jobs.

Big Data Analytics: MapReduce (Contd.) 47


Readable Counters Names
• Readable counter names are names given to counters
that make it easier to understand what the counter is
measuring.
• The names should be meaningful, so anyone reviewing
the job results can quickly understand what is being
counted.
• When defining user-defined counters, users can assign
intuitive names to make them more readable.

Big Data Analytics: MapReduce (Contd.) 48


Retrieving Counters
• Once a MapReduce job is complete,
• counters can be retrieved to analyze the execution
of the job and extract useful information.
• Counters can be retrieved
• from the job’s output programmatically or from the
Hadoop web UI.
• In a Java program,
• after the job completes, users can get access to
counters and retrieve specific counter values.

Big Data Analytics: MapReduce (Contd.) 49


Sorting
• The ability to sort data is at the heart of MapReduce.
• MapReduce uses the sorting stage
• to organize its data.
• Sorting in MapReduce is categorized as:
• Partial Sort
• Total Sort
• Secondary Sort

Big Data Analytics: MapReduce (Contd.) 50


Partial Sorting
• By default, MapReduce sorts input records by their
keys.
• Each individual reducer sorts the data it receives.
• It should be noted that
• sorting is local to each reducer, meaning that the
output of the MapReduce job will have each
reducer's data sorted independently,
• but there is no global order across all reducers.

Big Data Analytics: MapReduce (Contd.) 51


Partial Sorting..
• Working mechanism
• After the Map phase, the shuffle and sort phase
ensures that all key-value pairs with the same key
go to the same reducer.
• Within each reducer, the key-value pairs are sorted
by the key, but there is no coordination between
reducers to ensure global order.
• Partial sort is useful when the data needs to be
grouped or sorted locally per key but doesn't need to
be globally ordered across the entire dataset.

Big Data Analytics: MapReduce (Contd.) 52


Total Sort
• In Total Sort, the output of the MapReduce job is
globally sorted across all partitions or reducers. This
ensures that the entire output dataset is sorted, not
just within each individual reducer.
• Working Mechanism
• Use a partitioner that respects the total order of the
output.
• A custom partitioner is used to distribute the key-
value pairs in such a way that each reducer receives
keys that are part of a globally sorted range.
• Total sort is useful when the entire dataset needs to be
fully sorted,
• such as when preparing sorted data for output or for
53
downstream processing that requires sorted input.
Secondary Sort
• The MapReduce framework sorts the records by key
before they reach the reducers.
• For any particular key, however, the values are not
sorted.
• Secondary sort ensures that the values associated
with a key are also sorted.
• Usually, most MapReduce programs are written so as
not to depend on the order that the values appear to
the reduce function.
• However, it is possible to impose an order on the
values by sorting and grouping the keys in a
particular way.
Big Data Analytics: MapReduce (Contd.) 54
Joins
• Implementation of join depends on how large the
datasets are and how they are partitioned.
• MapReduce can perform joins between large datasets,
but writing the code to do joins from scratch is fairly
involved.
• However, it is considered to use higher-level
framework such as Pig, Hive, or Cascading,
• in which join operations are present as a core
part of the implementation.
• MapReduce provides Map-Side Joins and Reduce-Side
Joins.
Big Data Analytics: MapReduce (Contd.) 55
Map-Side Joins
• A map-side join works by performing the join before
the data reaches the map function.
• For this to work, the inputs to each map must be
partitioned and sorted in a particular way.
• Each input dataset must be divided into the
same number of partitions, and it must be
sorted by the same key (the join key) in each
source.
• All the records for a particular key must reside in
the same partition.
• Map-Side join has this strict requirement, but it actually
fits the description of the output of a MapReduce job.
Big Data Analytics: MapReduce (Contd.) 56
Map-Side Joins..
• A map-side join can be used to join the outputs of
several jobs that had the
• same number of reducers,
• the same keys, and
• output files that are not splitable.
• Advantage of Map-Side joins is that it is faster and
suitable cases where at least one dataset is small
enough to fit in the memory.
• Disadvantage of Map-Side Joins is less flexible as input
datasets have to be pre-sorted and partitioned and
incompatible for large dataset.

Big Data Analytics: MapReduce (Contd.) 57


Reduce-Side Joins
• Reduce-side join is the most common and flexible type
of join in MapReduce,
• where the joining of data happens during the reduce
phase
• Working mechanism
• Both input datasets are first distributed to the mappers.
• Mapper processes both datasets, emitting key-value
pairs where the key is the join key, and the value is the
associated data.
• The framework then shuffles and sorts the data, sending
all records with the same key to the same reducer.
• Reducer receives all records for given key, combines the
records from two datasets and performs join operation.
58
Reduce-Side Joins..
• Advantage of Reduce-Side joins is it works well for large
datasets that cannot fit into memory.
• Limitations are
• slow in speed as all the data must go through both
the map and reduce phases.
• heavy shuffling is involved the map and reduce
phases, that increases network and input-output
overhead.

Big Data Analytics: MapReduce (Contd.) 59


Side Data Distribution
• Side data can be defined as extra read-only data
needed by a job to process the main dataset.
• The challenges associated with side data distribution
are:
• Ensure side data available to all the map or reduce
tasks which are spread across the cluster, in a
convenient and efficient fashion.
• Sometime the side data might be too large to pass
as arguments to the tasks but too small to warrant
the overhead of distributing it via HDFS.

Big Data Analytics: MapReduce (Contd.) 60


Side Data Distribution..
• There are two mechanisms for side data distribution:
1) Distributed Cache
2) Passing files through Job Configuration
• Distributed cache copies the necessary side data to the
local disk of each node before the job execution.
• It is commonly used for small or medium-sized side
data that needs to be read by every task.
• Passing file through job configuration is useful for very
small pieces of size data so that size data can be
directly embedded into the job configuration.

Big Data Analytics: MapReduce (Contd.) 61

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy