0% found this document useful (0 votes)

2 views

Unit 2

The document provides a comprehensive overview of Hadoop and MapReduce, detailing its history, core components, and architecture, including HDFS and YARN. It explains the data processing workflow, the MapReduce framework, and various tools in the Hadoop ecosystem, as well as job scheduling and failure handling. Additionally, it highlights real-world use cases of MapReduce in big data processing.

Uploaded by

69420narendramodi69420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit 2

Uploaded by

69420narendramodi69420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit: Hadoop and MapReduce – Detailed Notes

1. History of Hadoop

Hadoop’s journey began in the early 2000s when Doug Cutting and Mike Cafarella were
inspired by Google’s innovations in distributed computing. Google had published papers on
two core technologies: GFS (Google File System) and MapReduce, which described a
method to process vast amounts of data efficiently across multiple machines. Doug Cutting,
while working at Yahoo, wanted to build an open-source framework that could handle
petabytes of data with fault tolerance, and thus, Hadoop was born in 2005.

The name “Hadoop” was derived from the toy elephant of Doug Cutting’s son, symbolizing
its strength and reliability. Hadoop soon became an Apache project and gained immense
popularity due to its ability to handle large-scale data processing.

• Hadoop 1.x: Initial versions relied on a single master node called NameNode to
manage metadata, which made it prone to bottlenecks and failures.

• Hadoop 2.x: Introduced YARN (Yet Another Resource Negotiator), decoupling

resource management from job scheduling, leading to better performance and
scalability.

Today, Hadoop has evolved into an ecosystem of tools that support advanced big data
analytics.

2. Apache Hadoop and HDFS (Hadoop Distributed File System)

Apache Hadoop is an open-source framework that allows distributed processing of large

datasets across clusters of computers using simple programming models. It is designed to
scale from a single server to thousands of machines, with each offering local computation
and storage.

Hadoop Core Components:

1. HDFS (Hadoop Distributed File System): A distributed file system that stores data in
a fault-tolerant manner.

2. MapReduce: A parallel processing framework that processes large datasets.

3. YARN: Manages and schedules computing resources in the cluster.

HDFS Architecture:

HDFS follows a master-slave architecture where:

• NameNode: Acts as the master that maintains the directory tree and metadata of all
files and directories in the file system.

• DataNode: Slave nodes that store actual data blocks and report periodically to the
NameNode.

• Secondary NameNode: Takes periodic snapshots of the NameNode’s metadata for

backup purposes.

Key Characteristics of HDFS:

• Block Size: Large block size (default 128 MB) to minimize metadata overhead.

• Replication: Ensures data redundancy by maintaining multiple copies (default

replication factor is 3).

• Fault Tolerance: If a DataNode goes down, data is retrieved from other replicas.

3. Data Format and Analyzing Data with Hadoop

Hadoop supports various data formats that help optimize storage and processing:

• Text Files: Plain text data, commonly in CSV, TSV, or JSON formats.

• Sequence Files: Binary files containing key-value pairs for efficient processing.

• Avro Files: Highly compact and efficient format for data serialization.

• Parquet/ORC: Columnar storage formats that significantly improve query

performance.

Analyzing Data with Hadoop:

Data analysis using Hadoop involves the following steps:

1. Data Ingestion: Loading raw data into HDFS using tools like Apache Sqoop or Flume.

2. Processing with MapReduce: Applying transformation and aggregation using

MapReduce jobs.

3. Storing Results: Processed data is stored back in HDFS for further analysis.

4. Querying with Hive or Pig: Hive provides SQL-like querying on large datasets, while
Pig enables scripting for complex transformations.

4. Scaling Out in Hadoop

Hadoop excels in horizontal scaling, which involves adding more machines to the cluster to
handle increasing data loads. Unlike traditional databases that rely on vertical scaling
(adding more CPU, RAM, or disk), Hadoop distributes data across multiple nodes, allowing it
to handle petabytes of information.

• Horizontal Scaling: New nodes can be added dynamically, and Hadoop automatically
rebalances the cluster to include the new nodes.

• Fault Tolerance: Data is replicated across multiple nodes, ensuring that failure of one
node does not result in data loss.

• Load Balancing: YARN dynamically allocates tasks to nodes based on available

resources.

5. Hadoop Streaming and Pipes

Hadoop provides two interfaces to allow non-Java developers to write MapReduce

programs:

• Hadoop Streaming: Allows users to create and run MapReduce jobs with any
executable or script as the mapper and reducer. It supports languages like Python,
Perl, and Ruby.

o Usage: Input is read from stdin and output is written to stdout.

o Example: cat data.txt | python mapper.py | sort | python reducer.py

• Hadoop Pipes: A C++ API for writing MapReduce programs.

o Usage: Allows C++ programs to communicate with Hadoop by implementing

the Mapper and Reducer interfaces.

o Advantages: Provides lower latency and higher efficiency compared to

Hadoop Streaming.

6. Hadoop Ecosystem (Echo System)

The Hadoop ecosystem consists of several tools that extend Hadoop’s functionality. Here are
the key components:

• Hive: A data warehouse infrastructure that provides SQL-like querying capabilities

over HDFS.

• Pig: A high-level platform for processing large datasets using a scripting language
called Pig Latin.
• HBase: A NoSQL database that provides real-time read/write access to large
datasets.

• Sqoop: Facilitates importing and exporting data between HDFS and relational
databases.

• Flume: A data ingestion service that collects, aggregates, and moves large amounts
of streaming data into HDFS.

• Oozie: A workflow scheduler that helps manage and coordinate Hadoop jobs.

• ZooKeeper: A distributed coordination service that manages distributed applications.

7. MapReduce Framework and Basics

MapReduce is a programming model and an associated implementation used to process and

generate large datasets. It works by breaking down the processing into two main phases:

1. Map Phase: Processes the input data and converts it into key-value pairs.

2. Reduce Phase: Aggregates and processes these intermediate key-value pairs to

generate the final output.

8. How MapReduce Works

The entire MapReduce process can be broken down into the following steps:

1. Input Split: Divides the input data into smaller chunks called splits.

2. Mapper Phase: Each split is processed by a separate mapper, which generates

intermediate key-value pairs.

3. Partitioning: Intermediate data is partitioned based on keys.

4. Shuffle and Sort: Data is shuffled to ensure that all values for the same key are
grouped together and sorted.

5. Reducer Phase: Reducers aggregate the sorted data and produce the final output.

6. Output: The results are stored in HDFS.

9. Developing a MapReduce Application

To develop a MapReduce application:

1. Define the Mapper Class: Extend the Mapper class and implement the map()
method.

2. Define the Reducer Class: Extend the Reducer class and implement the reduce()
method.

3. Configure Job: Use the Job class to configure input and output paths, mapper,
reducer, and partitioner.

4. Submit Job: Submit the job to YARN and monitor its execution.

10. Unit Tests with MR Unit

MRUnit is a framework that allows developers to test individual MapReduce components in

isolation.

• Testing Mapper: Verifies that input splits are correctly processed.

• Testing Reducer: Ensures that intermediate key-value pairs are aggregated properly.

• Local Tests: Allows running small-scale tests on local data before deploying on a
cluster.

11. Anatomy of a MapReduce Job Run

A typical MapReduce job runs through the following phases:

1. Job Submission: The client submits the job to YARN.

2. Resource Allocation: YARN allocates resources and launches containers.

3. Map Task Execution: Mappers process input splits and produce intermediate key-
value pairs.

4. Shuffle and Sort: Intermediate data is shuffled, sorted, and partitioned.

5. Reduce Task Execution: Reducers aggregate sorted data and produce final results.

6. Job Completion: Results are written to HDFS.

12. Handling Failures in MapReduce

• Task Failure: Failed tasks are retried on other nodes.

• Node Failure: YARN reallocates resources and reruns failed tasks.

• Job Failure: Error logs provide insights for troubleshooting.

13. Job Scheduling in Hadoop

Hadoop offers multiple job scheduling options:

• FIFO Scheduler: Processes jobs in the order they are submitted.

• Fair Scheduler: Distributes resources fairly among jobs.

• Capacity Scheduler: Divides resources into queues with different capacities.

14. Shuffle and Sort in MapReduce

The Shuffle and Sort phase ensures that intermediate key-value pairs are correctly sorted
and grouped before being passed to the reducer.

• Shuffle: Moves map output to the reducers.

• Sort: Sorts intermediate data to ensure that all values associated with a key are
processed together.

15. Task Execution in MapReduce

• Map Task: Processes input splits and generates intermediate key-value pairs.

• Reduce Task: Aggregates and processes sorted data.

• Speculative Execution: Detects slow-running tasks and launches backup tasks to

improve performance.

16. MapReduce Types, Input, and Output Formats

• MapReduce Types: Various formats based on the nature of input and output data.

• Input Formats:

o TextInputFormat: Processes line-based text data.

o KeyValueTextInputFormat: Reads key-value pairs from input.

o SequenceFileInputFormat: Processes binary sequence files.

• Output Formats:

o TextOutputFormat: Writes plain text output.

o SequenceFileOutputFormat: Stores binary key-value pairs.

17. Real-world MapReduce Use Cases

MapReduce has revolutionized big data processing and is used in various real-world
scenarios:

• Log Analysis: Analyzing web server logs for trends and anomalies.

• ETL (Extract, Transform, Load): Cleaning and transforming raw data for data
warehousing.

• Recommendation Systems: Building recommendation engines based on user

preferences.

• Data Aggregation: Summarizing large-scale data for business insights.

HADOOP
No ratings yet
HADOOP
4 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Cloud Notes - Unit - 5
No ratings yet
Cloud Notes - Unit - 5
31 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
BDM 2
No ratings yet
BDM 2
5 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
CC unit5
No ratings yet
CC unit5
27 pages
Hadoop_and_MapReduce_Notes
No ratings yet
Hadoop_and_MapReduce_Notes
4 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
BDA ESE
No ratings yet
BDA ESE
21 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Unit 5
No ratings yet
Unit 5
7 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
18 module 2
No ratings yet
18 module 2
9 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
BDA 3rd Unit QB
No ratings yet
BDA 3rd Unit QB
4 pages
Big Data Analytics unit wise short note
No ratings yet
Big Data Analytics unit wise short note
6 pages
Unit 3
No ratings yet
Unit 3
12 pages
Big data
No ratings yet
Big data
8 pages
Unit III
No ratings yet
Unit III
15 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
04
No ratings yet
04
23 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big data unit 3 own
No ratings yet
Big data unit 3 own
20 pages
bigdata short
No ratings yet
bigdata short
8 pages
BDA UNIT 2 (1)
No ratings yet
BDA UNIT 2 (1)
16 pages
Redesigned Hadoop Document
No ratings yet
Redesigned Hadoop Document
2 pages
DSCC UNIT 5 PDF
No ratings yet
DSCC UNIT 5 PDF
8 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
3. Introduction-to-Hadoop-Ecosystem
No ratings yet
3. Introduction-to-Hadoop-Ecosystem
26 pages
HADOOP
No ratings yet
HADOOP
19 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
HADOOP
No ratings yet
HADOOP
10 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Bda Aat
No ratings yet
Bda Aat
18 pages
Unit IV Basics of Hadoop 463b93e20ce0da4b6c33b8a0656c10bb
No ratings yet
Unit IV Basics of Hadoop 463b93e20ce0da4b6c33b8a0656c10bb
21 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
(Ebook) Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist by W.H. Inmon; Daniel Linstedt; Mary Levins ISBN 9780128169179, 0128169176 all chapter instant download
100% (5)
(Ebook) Data Architecture: A Primer for the Data Scientist: A Primer for the Data Scientist by W.H. Inmon; Daniel Linstedt; Mary Levins ISBN 9780128169179, 0128169176 all chapter instant download
66 pages
Bhanuprakash Avadutha_Datacrew
No ratings yet
Bhanuprakash Avadutha_Datacrew
1 page
CreateNorthwindDatabase SQL
No ratings yet
CreateNorthwindDatabase SQL
286 pages
Installing MySQL On Unix Linux Using Generic Binaries
No ratings yet
Installing MySQL On Unix Linux Using Generic Binaries
6 pages
Ip Practical File Class 12
No ratings yet
Ip Practical File Class 12
57 pages
Informatica Session Properties
No ratings yet
Informatica Session Properties
7 pages
Recorded Future - Solutions For Defense & Intelligence
No ratings yet
Recorded Future - Solutions For Defense & Intelligence
2 pages
Tutorial A (DB and SQL) Solutions
No ratings yet
Tutorial A (DB and SQL) Solutions
12 pages
Ado Net Notes
No ratings yet
Ado Net Notes
1 page
Rendy Khonelius Studi Kasus MYSQL
No ratings yet
Rendy Khonelius Studi Kasus MYSQL
19 pages
Data Warehousing - CS614 Fall 2007 Assignment 01 Solution
No ratings yet
Data Warehousing - CS614 Fall 2007 Assignment 01 Solution
3 pages
DBMS Most IMP Q by Campusify
No ratings yet
DBMS Most IMP Q by Campusify
3 pages
Wait Types
No ratings yet
Wait Types
2 pages
DBMS Insem Question Answers
No ratings yet
DBMS Insem Question Answers
15 pages
Plproxy, Pgbouncer, Pgbalancer: Asko Oja
No ratings yet
Plproxy, Pgbouncer, Pgbalancer: Asko Oja
31 pages
Data Virtualization: How To Get Your Business Intelligence Answers Today
No ratings yet
Data Virtualization: How To Get Your Business Intelligence Answers Today
8 pages
Data Mining Standards
No ratings yet
Data Mining Standards
12 pages
Untitled
0% (1)
Untitled
36 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
42 pages
Data Warehouse Material Concepts
100% (2)
Data Warehouse Material Concepts
28 pages
Chapter 8 - Crossword Puzzle
No ratings yet
Chapter 8 - Crossword Puzzle
4 pages
Oracle R12 EBTax SQL Queries For Functional Implementers For Troubleshooting
No ratings yet
Oracle R12 EBTax SQL Queries For Functional Implementers For Troubleshooting
8 pages
Case Sensitive Vlookup in Excel Finding The 1st, 2nd, NTH or Last Occurrence of The Lookup Value
No ratings yet
Case Sensitive Vlookup in Excel Finding The 1st, 2nd, NTH or Last Occurrence of The Lookup Value
8 pages
Sap HR Abap
No ratings yet
Sap HR Abap
140 pages
Data Manipulation
No ratings yet
Data Manipulation
7 pages
hb402 5 ch2
No ratings yet
hb402 5 ch2
12 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
BDH UNITs
No ratings yet
BDH UNITs
2 pages
The Data Model
No ratings yet
The Data Model
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 2

Uploaded by

Unit 2

Uploaded by

Unit: Hadoop and MapReduce – Detailed Notes

• Hadoop 2.x: Introduced YARN (Yet Another Resource Negotiator), decoupling

2. Apache Hadoop and HDFS (Hadoop Distributed File System)

Apache Hadoop is an open-source framework that allows distributed processing of large

Hadoop Core Components:

2. MapReduce: A parallel processing framework that processes large datasets.

3. YARN: Manages and schedules computing resources in the cluster.

HDFS follows a master-slave architecture where:

• Secondary NameNode: Takes periodic snapshots of the NameNode’s metadata for

Key Characteristics of HDFS:

• Replication: Ensures data redundancy by maintaining multiple copies (default

3. Data Format and Analyzing Data with Hadoop

• Parquet/ORC: Columnar storage formats that significantly improve query

Analyzing Data with Hadoop:

Data analysis using Hadoop involves the following steps:

2. Processing with MapReduce: Applying transformation and aggregation using

4. Scaling Out in Hadoop

• Load Balancing: YARN dynamically allocates tasks to nodes based on available

5. Hadoop Streaming and Pipes

Hadoop provides two interfaces to allow non-Java developers to write MapReduce

o Usage: Input is read from stdin and output is written to stdout.

o Example: cat data.txt | python mapper.py | sort | python reducer.py

• Hadoop Pipes: A C++ API for writing MapReduce programs.

o Usage: Allows C++ programs to communicate with Hadoop by implementing

o Advantages: Provides lower latency and higher efficiency compared to

6. Hadoop Ecosystem (Echo System)

• Hive: A data warehouse infrastructure that provides SQL-like querying capabilities

• ZooKeeper: A distributed coordination service that manages distributed applications.

7. MapReduce Framework and Basics

MapReduce is a programming model and an associated implementation used to process and

2. Reduce Phase: Aggregates and processes these intermediate key-value pairs to

8. How MapReduce Works

2. Mapper Phase: Each split is processed by a separate mapper, which generates

3. Partitioning: Intermediate data is partitioned based on keys.

6. Output: The results are stored in HDFS.

9. Developing a MapReduce Application

To develop a MapReduce application:

10. Unit Tests with MR Unit

MRUnit is a framework that allows developers to test individual MapReduce components in

• Testing Mapper: Verifies that input splits are correctly processed.

11. Anatomy of a MapReduce Job Run

A typical MapReduce job runs through the following phases:

1. Job Submission: The client submits the job to YARN.

2. Resource Allocation: YARN allocates resources and launches containers.

4. Shuffle and Sort: Intermediate data is shuffled, sorted, and partitioned.

6. Job Completion: Results are written to HDFS.

12. Handling Failures in MapReduce

• Task Failure: Failed tasks are retried on other nodes.

• Node Failure: YARN reallocates resources and reruns failed tasks.

• Job Failure: Error logs provide insights for troubleshooting.

Hadoop offers multiple job scheduling options:

• FIFO Scheduler: Processes jobs in the order they are submitted.

• Fair Scheduler: Distributes resources fairly among jobs.

• Capacity Scheduler: Divides resources into queues with different capacities.

14. Shuffle and Sort in MapReduce

• Shuffle: Moves map output to the reducers.

15. Task Execution in MapReduce

• Reduce Task: Aggregates and processes sorted data.

• Speculative Execution: Detects slow-running tasks and launches backup tasks to

16. MapReduce Types, Input, and Output Formats

o TextInputFormat: Processes line-based text data.

o KeyValueTextInputFormat: Reads key-value pairs from input.

o SequenceFileInputFormat: Processes binary sequence files.

o TextOutputFormat: Writes plain text output.

o SequenceFileOutputFormat: Stores binary key-value pairs.

• Recommendation Systems: Building recommendation engines based on user

• Data Aggregation: Summarizing large-scale data for business insights.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.