0% found this document useful (0 votes)

2 views32 pages

Map Reduce Types and Formats

The document provides an overview of MapReduce types and formats, detailing the general form of MapReduce functions, input and output formats, and their configurations. It explains the significance of input splits, the default input format, and how to handle different data types, including text and binary inputs. Additionally, it covers output formats, including text, binary, and multiple outputs, emphasizing the importance of selecting appropriate formats for efficient data processing.

Uploaded by

anildudla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views32 pages

Map Reduce Types and Formats

Uploaded by

anildudla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

O’Reilly – Hadoop: The Definitive Guide

Ch.7 MapReduce Types and For-

mats

29 July 2010
Taikyoung Kim
Outline
 MapReduce Types
 Input Formats
 Output Formats

2 / 32
MapReduce Types(1/7)
 General form ≠

map: (K1, V1) → list(K2, V2)

=
reduce: (K2, list(V2)) → list(K3, V3)

 Java interface

– OutputCollector emitts key-value pairs

– Reporter updates counters and status

3 / 32
MapReduce Types(2/7)
 Combine function

– The same form as the reduce function, except its output

types
– Output type is the same as Map
– Often the combine and reduce functions are the same
 Partition function

– Input intermediate key and value types

– Returns the partition index

4 / 32
MapReduce Types(3/7)
 Input types are set by the input format

Ex) setInputFormat(TextInputFormat.class)
 Generate Key type: LongWritable, Value type: Text

 Other types are set explicitly by calling the methods on the

JobConf
 Ex) JobConf conf; conf.setMapOutputKeyClass(Text.class)

 Intermediate types are also set as the final output types by de-
fault
 Just need to call setOutputKeyClass() if K2 and K3 are the same

5 / 32
MapReduce Types(4/7)
 Why can’t the types be determined from a combina-
tion of the mapper and the reducer?
– Type Erasure in JAVA Compiler deletes generic types at com-
pile time
 The configuration isn’t checked at compile time
 Possible to configure a MapReduce job with incompatible types
 Type conflicts are detected at runtime
– Give types to Hadoop explicitly

6 / 32
MapReduce Types(5/7)
The Default MapReduce Job
 Default Input Format
– TextInputFormat
 LongWritable (Key)
 Text (Value)

 setNumMapTasks(1)
– Does not set the
number of map tasks
to one
 1 is a hint

7 / 32
MapReduce Types(6/7)
Choosing the Number of Reducers
 The optimal number of reducers is related to the total
number of available reducer slots
– Total number of available reducers =
Total nodes * mapred.tasktracker.reduce.-
tasks.maximum

 To have slightly fewer reducers than total slots

– Tolerate a few failures without extending job execution time

8 / 32
MapReduce Types(7/7)
Keys and values in Streaming

 A streaming application can control the separator

– Default separator: tab character
– Separators may be configured independently for maps and
reduces
– The number of fields separated by itself to treat as the map
output key
 Set the first n fields in stream.num.map.output.key.fields
 Ex) Output was a,b,c (and separator is a comma), n=2
– Key: a,b Value:c
9 / 32
Outline
 MapReduce Types
 Input Formats
– Input Splits and Records
– Text Input
– Binary Input
– Multiple Inputs
– Database Input(and Output)
 Output Formats

10 /
32
Input Formats
Input Splits and Records

 InputSplit (org.apache.hadoop.mapred package)

– A chunk of the input processed by a single map
– Each split is divided into records
– Split is just a reference to the data (Doesn’t contain the input
data)

 Ordering the splits

– To process the largest split first (minimize the job runtime)

11 /
32
Input Formats
Input Splits and Records - InputFormat
 Create the input splits, and dividing them into records
 numSplits argument of getSplits() method is a hint
– InputFormat is free to return a different number of splits
 The client sends the calculated splits to the jobtracker
– Schedule map tasks to process on the tasktrackers
 RecordReader
– Iterate over records
– Used by the map task to generate record key-value pairs

12 /
32
Input Formats
Input Splits and Records - FileInputFormat
 The base class for all InputFormat that use files as their data
source

13 /
32
Input Formats
Input Splits and Records - FileInputFormat

 FileInputFormat offers 4 static methods for setting a

JobConf’s input paths
– addInputPath() and addInputPaths()
 Add a path or paths to the list of inputs
 Can call these methods repeatedly
– setInputPaths()
 Set entire list of paths in one time (Replacing any paths that were
set in previous calls)
– A path may represent
 A file
 A directory (consider all files in the directory as input)
– Error when subdirectory exists (solved by glob or filter)
 A collection of files and directories
14 / by using a glob
32
Input Formats
Input Splits and Records - FileInputFormat
 Filters
– Use FileInputFormat as a default filter
 Exclude hidden files
– Use setInputPathFilter() method
 Act in addition to the default filter
 Refer page 61

15 /
32
Input Formats
Input Splits and Records – FileInputFormat input splits
 FileInputFormat splits only large files that larger
than an HDFS block
– Normally the split size is the size of an HDFS block

 Possible to control the split size

– Effect maximum split size: The maximum size is less than
block size

16 /
32
Input Formats
Input Splits and Records – FileInputFormat input splits
 The split size calculation (computeSplitSize() method)

 By default

– Split size is blockSize

 Control the split size

17 /
32
Input Formats
Input Splits and Records – Small files and CombineFileInputFormat

 Hadoop works better with a small number of large files than a

large number of small files
– FileInputFormat generates splits that each split is all or part of a
single file
– Bookkeeping overhead with a lot of small input data

 Use CombineFileInputFormat to pack many files into splits

– Designed to work well with small files
– Take node and rack locality when packing blocks into split
– Worth when already have a large number of small files in HDFS

 Avoiding the many small files is a good idea

– Reduce the number of seeks
– Merge small files into larger files by using a SequenceFile

18 /
32
Input Formats
Input Splits and Records – Preventing splitting

 Some application don’t want files to be split

– Want to process entire data by a single mapper

 Two ways of ensuring an existing file is not split

– Set the minimum split size larger than the largest file size
– Override the isSplitable() method to return false

19 /
32
Input Formats
Input Splits and Records – Processing a whole file as a record

 WholeFileRecordReader
– Take a FileSplit and convert it into a single record

20 /
32
Input Formats
Text Input - TextInputFormat
 TextInputFormat is the default InputFormat
– Key: The byte offset of the beginning of the line (LongWritable) ; Not line
number
– Value: The contents of the line excluding any line terminators (Text)

 Each split knows the size of the preceding splits

– A global file offset = The offsets within the split + The size of preceding splits

 The logical records do not usually fit into HDFS

21 /
32
Input Formats
Text Input - NLineInputFormat
 Each mapper receives a variable number of lines of
input using:
– TextInputFormat, KeyValueTextInputFormat

 To receive a fixed number of lines of input, use

– NLineInputFormat as InputFormat
 N: The number of lines of input
 Control N in Mapred.line.input.format.linespermap property
– Inefficient if a map task takes a small number of lines of input
 Due to task setup overhead

22 /
32
Input Formats
Text Input - XML
 Use StreamXmlRecordReader class for XML
– Org.apache.hadoop.streaming package
– Set stream.recordreader.class to org.apache.hadoop.streamin.StreamXml-
RecordReader

23 /
32
Input Formats
Binary Input
 SequenceFileInputFormat
– Hadoop’s sequence file format stores sequences of binary key-value pairs
 Data is splittable (Data has sync points)
 Use SequenceFileInputFormat

 SequenceFileAsTextInputFormat
– Convert the sequence file’s keys and values to Text objects
 Use toString() method

 SequenceFileAsBinaryInputFormat
– Retrieve the sequence file’s keys and values as opaque binary objects

24 /
32
Input Formats
Multiple Inputs
 Use MultipleInput when
– Have data sources that provide the same type of data but in differ-
ent formats
 Need to be parsed differently
 Ex) One might be tab-separated plain text, the other a binary sequence
file

– Use different mappers

– The map outputs have the same types
 Reducers are not aware of the different mappers

25 /
32
Outline
 MapReduce Types
 Input Formats
 Output Formats
– Text Output
– Binary Output
– Multiple Outputs
– Lazy Output
– Database Output

26 /
32
Output Formats
 The OutputFormat class hierarchy

27 /
32
Output Formats
Text Output
 TextOutputFormat (default)
– Write records as lines of text
– Keys and Values may be of any type
 It calls toString() method
– Each key-value pair is separated by a tab character
 Set the separator in mapred.textoutputformat.separator
property

28 /
32
Output Formats
Binary Output
 SequenceFileOutputFormat
– Write sequence files for its output
– Compact, readily compressed (Useful for a further MapReduce job)

 SequenceFileAsBinaryOutputFormat
– Write keys and values in raw binary format into a SequenceFile con-
tainer

 MapFileOutputFormat
– Write MapFiles as output
– The keys in MapFile must be added in order
 Ensure that the reducers emit keys in sorted order (only for this format)

29 /
32
Output Formats
Multiple Output
 MultipleOutputFormat and MultipleOutputs
– Help to produce multiple files per reducer

 MultipleOutputFormat
– The names of multiple files are derived from the output keys and
values
– Is an abstract class with
 MultipleTextOutputFormat
 MultipleSequenceFileOutputFormat

 MultipleOutputs
– Can emit different types for each output (Differ from MultipleOutputFormat)
– Less control over the naming of outputs

30 /
32
Output Formats
Multiple Output
 Difference between MultipleOutputFormat and MultiplOutputs

– MultipleOutputs is more fully featured

– MultipleOutputFormat has more control over the output directory
structure and file naming

31 /
32
Output Formats
Lazy Output
 LazyOutput helps some applications that doesn’t want to
create empty files
– Since FileOutputFormat subclasses create output files even if they
are empty

 To use it
– Call its setOutputFormatClass() method with the JobConf option

32 /
32

Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
Computer Design Basic
No ratings yet
Computer Design Basic
1 page
BDA
No ratings yet
BDA
20 pages
Unit-1-BDA
No ratings yet
Unit-1-BDA
95 pages
2.1-MapReduce
No ratings yet
2.1-MapReduce
16 pages
Lecture-4
No ratings yet
Lecture-4
28 pages
Unit 4 Session 4
No ratings yet
Unit 4 Session 4
43 pages
Map Reduce and Format Features
No ratings yet
Map Reduce and Format Features
61 pages
UNIT-3 (1)
No ratings yet
UNIT-3 (1)
27 pages
Unit II Hadoop IO
No ratings yet
Unit II Hadoop IO
27 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
64 pages
BDA.Unit-4
No ratings yet
BDA.Unit-4
32 pages
Final Report Cloud Classroom With E Learning System
No ratings yet
Final Report Cloud Classroom With E Learning System
22 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
6. Map Reduce Programming
No ratings yet
6. Map Reduce Programming
67 pages
Hadoop Training #4: Programming With Hadoop
100% (2)
Hadoop Training #4: Programming With Hadoop
46 pages
BDA U2 - copy
No ratings yet
BDA U2 - copy
79 pages
Hadoop Week 3
No ratings yet
Hadoop Week 3
60 pages
Hacking WordPress
No ratings yet
Hacking WordPress
11 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Unit-1 CC
No ratings yet
Unit-1 CC
86 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Unit 3
No ratings yet
Unit 3
14 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Bda Winter 2021 Solution
No ratings yet
Bda Winter 2021 Solution
27 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Unit-4
No ratings yet
Unit-4
19 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
MAP Reduce - 1 (1).Pptx (1)
No ratings yet
MAP Reduce - 1 (1).Pptx (1)
34 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
onBigdataNha
No ratings yet
onBigdataNha
41 pages
Hadoop Week 4
No ratings yet
Hadoop Week 4
13 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Unit 4
No ratings yet
Unit 4
11 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Questionsand Answers
No ratings yet
Questionsand Answers
23 pages
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
44 pages
Hadoop
No ratings yet
Hadoop
38 pages
BDA IV B.Tech I Sem MR18-Mid-2 Objective Questions
No ratings yet
BDA IV B.Tech I Sem MR18-Mid-2 Objective Questions
11 pages
SVVT Paper Unit 1
No ratings yet
SVVT Paper Unit 1
9 pages
HTML
No ratings yet
HTML
7 pages
SSI Exam Details
100% (20)
SSI Exam Details
26 pages
ECEN 438 Fa2024 Lab2
No ratings yet
ECEN 438 Fa2024 Lab2
23 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
A Comprehensive Overview of Knowledge Graph Completion
No ratings yet
A Comprehensive Overview of Knowledge Graph Completion
65 pages
Filipino Online Scam Data Classification Decision Tree Algorithms
No ratings yet
Filipino Online Scam Data Classification Decision Tree Algorithms
6 pages
Hadoop Training in Hyderabad
No ratings yet
Hadoop Training in Hyderabad
49 pages
Manual Equus 810 070
100% (1)
Manual Equus 810 070
10 pages
Hadoop Mapred
100% (1)
Hadoop Mapred
11 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Bda U3, U4 and U5 Two Marks Qs
No ratings yet
Bda U3, U4 and U5 Two Marks Qs
19 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Quick HadOop Ref Card Always
No ratings yet
Quick HadOop Ref Card Always
2 pages
Dynamitez - Procedure Text - Occupation
No ratings yet
Dynamitez - Procedure Text - Occupation
12 pages
Cloudera Certification Dump - 410-Anil
100% (3)
Cloudera Certification Dump - 410-Anil
49 pages
A19 - Properties of Log, Equations VJ
No ratings yet
A19 - Properties of Log, Equations VJ
192 pages
FSP Form Filling Process
No ratings yet
FSP Form Filling Process
14 pages
Remote Alarm Notification: Moeller Intelligent Relays
No ratings yet
Remote Alarm Notification: Moeller Intelligent Relays
12 pages
472-Article Text-744-1-10-20220819
No ratings yet
472-Article Text-744-1-10-20220819
12 pages
The Google Project Management Certificate
No ratings yet
The Google Project Management Certificate
19 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Hadoopsdsdgs
No ratings yet
Hadoopsdsdgs
29 pages
Java Notes
No ratings yet
Java Notes
14 pages
Industrial Attachement Report
No ratings yet
Industrial Attachement Report
20 pages
Panduit - FRSPJ2X2LYL
No ratings yet
Panduit - FRSPJ2X2LYL
2 pages
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Solaxcloud Web User Guide V3.2 Enduser en 2022.05.27
No ratings yet
Solaxcloud Web User Guide V3.2 Enduser en 2022.05.27
17 pages
Itcertmaster: Safe, Simple and Fast. 100% Pass Guarantee!
No ratings yet
Itcertmaster: Safe, Simple and Fast. 100% Pass Guarantee!
6 pages
158A
No ratings yet
158A
113 pages
Part 5
No ratings yet
Part 5
3 pages
Computer Lesson Plan..ILO and Tlas 2
No ratings yet
Computer Lesson Plan..ILO and Tlas 2
9 pages
IQRA University Karachi, EDC Campus: Prakash Naik 34170
No ratings yet
IQRA University Karachi, EDC Campus: Prakash Naik 34170
2 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
Threads Synchronization
No ratings yet
Threads Synchronization
29 pages
Hoja de Especificaciones Series ePMP 3000
No ratings yet
Hoja de Especificaciones Series ePMP 3000
5 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Windows Command Prompt A-N
From Everand
Windows Command Prompt A-N
Prometheus MMS
5/5 (2)
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
1 M e Cadcam
No ratings yet
1 M e Cadcam
21 pages
Multiple Choice Questions For Mid - 1
No ratings yet
Multiple Choice Questions For Mid - 1
26 pages
Ejemplo Dns11
No ratings yet
Ejemplo Dns11
4 pages
Access Management Policy and Procedure
100% (4)
Access Management Policy and Procedure
44 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
Linux Commands By Example
From Everand
Linux Commands By Example
Khaled Jamal
4.5/5 (3)
Raspberry Pi - Wikipedia
No ratings yet
Raspberry Pi - Wikipedia
16 pages
PRO Series 100 - 400V2 Pre-Service and Calibration Manual: (Models 110 - 410)
No ratings yet
PRO Series 100 - 400V2 Pre-Service and Calibration Manual: (Models 110 - 410)
42 pages
Windows DNA: Windows Distributed Internet Architecture
No ratings yet
Windows DNA: Windows Distributed Internet Architecture
21 pages
Cyber Bullying
No ratings yet
Cyber Bullying
5 pages
R12 Oracle Applications System Administrator Fundamentals
No ratings yet
R12 Oracle Applications System Administrator Fundamentals
4 pages
Let's Explore The Most Effective Ways To Create Whatsapp Chatbot
No ratings yet
Let's Explore The Most Effective Ways To Create Whatsapp Chatbot
4 pages
MVS JCL Utilities Quick Reference, Third Edition
From Everand
MVS JCL Utilities Quick Reference, Third Edition
Robert Wingate
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Map Reduce Types and Formats

Uploaded by

Map Reduce Types and Formats

Uploaded by

O’Reilly – Hadoop: The Definitive Guide

Ch.7 MapReduce Types and For-

map: (K1, V1) → list(K2, V2)

– OutputCollector emitts key-value pairs

– The same form as the reduce function, except its output

– Input intermediate key and value types

 Other types are set explicitly by calling the methods on the

 To have slightly fewer reducers than total slots

 A streaming application can control the separator

 InputSplit (org.apache.hadoop.mapred package)

 Ordering the splits

 FileInputFormat offers 4 static methods for setting a

 Possible to control the split size

– Split size is blockSize

 Hadoop works better with a small number of large files than a

 Use CombineFileInputFormat to pack many files into splits

 Avoiding the many small files is a good idea

 Some application don’t want files to be split

 Two ways of ensuring an existing file is not split

 Each split knows the size of the preceding splits

 The logical records do not usually fit into HDFS

 To receive a fixed number of lines of input, use

– Use different mappers

– MultipleOutputs is more fully featured

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.