Map Reduce Types and Formats
Map Reduce Types and Formats
29 July 2010
Taikyoung Kim
Outline
MapReduce Types
Input Formats
Output Formats
2 / 32
MapReduce Types(1/7)
General form ≠
Java interface
3 / 32
MapReduce Types(2/7)
Combine function
4 / 32
MapReduce Types(3/7)
Input types are set by the input format
Ex) setInputFormat(TextInputFormat.class)
Generate Key type: LongWritable, Value type: Text
Intermediate types are also set as the final output types by de-
fault
Just need to call setOutputKeyClass() if K2 and K3 are the same
5 / 32
MapReduce Types(4/7)
Why can’t the types be determined from a combina-
tion of the mapper and the reducer?
– Type Erasure in JAVA Compiler deletes generic types at com-
pile time
The configuration isn’t checked at compile time
Possible to configure a MapReduce job with incompatible types
Type conflicts are detected at runtime
– Give types to Hadoop explicitly
6 / 32
MapReduce Types(5/7)
The Default MapReduce Job
Default Input Format
– TextInputFormat
LongWritable (Key)
Text (Value)
setNumMapTasks(1)
– Does not set the
number of map tasks
to one
1 is a hint
7 / 32
MapReduce Types(6/7)
Choosing the Number of Reducers
The optimal number of reducers is related to the total
number of available reducer slots
– Total number of available reducers =
Total nodes * mapred.tasktracker.reduce.-
tasks.maximum
8 / 32
MapReduce Types(7/7)
Keys and values in Streaming
10 /
32
Input Formats
Input Splits and Records
11 /
32
Input Formats
Input Splits and Records - InputFormat
Create the input splits, and dividing them into records
numSplits argument of getSplits() method is a hint
– InputFormat is free to return a different number of splits
The client sends the calculated splits to the jobtracker
– Schedule map tasks to process on the tasktrackers
RecordReader
– Iterate over records
– Used by the map task to generate record key-value pairs
12 /
32
Input Formats
Input Splits and Records - FileInputFormat
The base class for all InputFormat that use files as their data
source
13 /
32
Input Formats
Input Splits and Records - FileInputFormat
15 /
32
Input Formats
Input Splits and Records – FileInputFormat input splits
FileInputFormat splits only large files that larger
than an HDFS block
– Normally the split size is the size of an HDFS block
16 /
32
Input Formats
Input Splits and Records – FileInputFormat input splits
The split size calculation (computeSplitSize() method)
By default
17 /
32
Input Formats
Input Splits and Records – Small files and CombineFileInputFormat
18 /
32
Input Formats
Input Splits and Records – Preventing splitting
19 /
32
Input Formats
Input Splits and Records – Processing a whole file as a record
WholeFileRecordReader
– Take a FileSplit and convert it into a single record
20 /
32
Input Formats
Text Input - TextInputFormat
TextInputFormat is the default InputFormat
– Key: The byte offset of the beginning of the line (LongWritable) ; Not line
number
– Value: The contents of the line excluding any line terminators (Text)
21 /
32
Input Formats
Text Input - NLineInputFormat
Each mapper receives a variable number of lines of
input using:
– TextInputFormat, KeyValueTextInputFormat
22 /
32
Input Formats
Text Input - XML
Use StreamXmlRecordReader class for XML
– Org.apache.hadoop.streaming package
– Set stream.recordreader.class to org.apache.hadoop.streamin.StreamXml-
RecordReader
23 /
32
Input Formats
Binary Input
SequenceFileInputFormat
– Hadoop’s sequence file format stores sequences of binary key-value pairs
Data is splittable (Data has sync points)
Use SequenceFileInputFormat
SequenceFileAsTextInputFormat
– Convert the sequence file’s keys and values to Text objects
Use toString() method
SequenceFileAsBinaryInputFormat
– Retrieve the sequence file’s keys and values as opaque binary objects
24 /
32
Input Formats
Multiple Inputs
Use MultipleInput when
– Have data sources that provide the same type of data but in differ-
ent formats
Need to be parsed differently
Ex) One might be tab-separated plain text, the other a binary sequence
file
25 /
32
Outline
MapReduce Types
Input Formats
Output Formats
– Text Output
– Binary Output
– Multiple Outputs
– Lazy Output
– Database Output
26 /
32
Output Formats
The OutputFormat class hierarchy
27 /
32
Output Formats
Text Output
TextOutputFormat (default)
– Write records as lines of text
– Keys and Values may be of any type
It calls toString() method
– Each key-value pair is separated by a tab character
Set the separator in mapred.textoutputformat.separator
property
28 /
32
Output Formats
Binary Output
SequenceFileOutputFormat
– Write sequence files for its output
– Compact, readily compressed (Useful for a further MapReduce job)
SequenceFileAsBinaryOutputFormat
– Write keys and values in raw binary format into a SequenceFile con-
tainer
MapFileOutputFormat
– Write MapFiles as output
– The keys in MapFile must be added in order
Ensure that the reducers emit keys in sorted order (only for this format)
29 /
32
Output Formats
Multiple Output
MultipleOutputFormat and MultipleOutputs
– Help to produce multiple files per reducer
MultipleOutputFormat
– The names of multiple files are derived from the output keys and
values
– Is an abstract class with
MultipleTextOutputFormat
MultipleSequenceFileOutputFormat
MultipleOutputs
– Can emit different types for each output (Differ from MultipleOutputFormat)
– Less control over the naming of outputs
30 /
32
Output Formats
Multiple Output
Difference between MultipleOutputFormat and MultiplOutputs
31 /
32
Output Formats
Lazy Output
LazyOutput helps some applications that doesn’t want to
create empty files
– Since FileOutputFormat subclasses create output files even if they
are empty
To use it
– Call its setOutputFormatClass() method with the JobConf option
32 /
32