0% found this document useful (0 votes)
2 views

Unit 2

The document provides a comprehensive overview of Hadoop and MapReduce, detailing its history, core components, and architecture, including HDFS and YARN. It explains the data processing workflow, the MapReduce framework, and various tools in the Hadoop ecosystem, as well as job scheduling and failure handling. Additionally, it highlights real-world use cases of MapReduce in big data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit 2

The document provides a comprehensive overview of Hadoop and MapReduce, detailing its history, core components, and architecture, including HDFS and YARN. It explains the data processing workflow, the MapReduce framework, and various tools in the Hadoop ecosystem, as well as job scheduling and failure handling. Additionally, it highlights real-world use cases of MapReduce in big data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit: Hadoop and MapReduce – Detailed Notes

1. History of Hadoop

Hadoop’s journey began in the early 2000s when Doug Cutting and Mike Cafarella were
inspired by Google’s innovations in distributed computing. Google had published papers on
two core technologies: GFS (Google File System) and MapReduce, which described a
method to process vast amounts of data efficiently across multiple machines. Doug Cutting,
while working at Yahoo, wanted to build an open-source framework that could handle
petabytes of data with fault tolerance, and thus, Hadoop was born in 2005.

The name “Hadoop” was derived from the toy elephant of Doug Cutting’s son, symbolizing
its strength and reliability. Hadoop soon became an Apache project and gained immense
popularity due to its ability to handle large-scale data processing.

• Hadoop 1.x: Initial versions relied on a single master node called NameNode to
manage metadata, which made it prone to bottlenecks and failures.

• Hadoop 2.x: Introduced YARN (Yet Another Resource Negotiator), decoupling


resource management from job scheduling, leading to better performance and
scalability.

Today, Hadoop has evolved into an ecosystem of tools that support advanced big data
analytics.

2. Apache Hadoop and HDFS (Hadoop Distributed File System)

Apache Hadoop is an open-source framework that allows distributed processing of large


datasets across clusters of computers using simple programming models. It is designed to
scale from a single server to thousands of machines, with each offering local computation
and storage.

Hadoop Core Components:

1. HDFS (Hadoop Distributed File System): A distributed file system that stores data in
a fault-tolerant manner.

2. MapReduce: A parallel processing framework that processes large datasets.

3. YARN: Manages and schedules computing resources in the cluster.

HDFS Architecture:

HDFS follows a master-slave architecture where:


• NameNode: Acts as the master that maintains the directory tree and metadata of all
files and directories in the file system.

• DataNode: Slave nodes that store actual data blocks and report periodically to the
NameNode.

• Secondary NameNode: Takes periodic snapshots of the NameNode’s metadata for


backup purposes.

Key Characteristics of HDFS:

• Block Size: Large block size (default 128 MB) to minimize metadata overhead.

• Replication: Ensures data redundancy by maintaining multiple copies (default


replication factor is 3).

• Fault Tolerance: If a DataNode goes down, data is retrieved from other replicas.

3. Data Format and Analyzing Data with Hadoop

Hadoop supports various data formats that help optimize storage and processing:

• Text Files: Plain text data, commonly in CSV, TSV, or JSON formats.

• Sequence Files: Binary files containing key-value pairs for efficient processing.

• Avro Files: Highly compact and efficient format for data serialization.

• Parquet/ORC: Columnar storage formats that significantly improve query


performance.

Analyzing Data with Hadoop:

Data analysis using Hadoop involves the following steps:

1. Data Ingestion: Loading raw data into HDFS using tools like Apache Sqoop or Flume.

2. Processing with MapReduce: Applying transformation and aggregation using


MapReduce jobs.

3. Storing Results: Processed data is stored back in HDFS for further analysis.

4. Querying with Hive or Pig: Hive provides SQL-like querying on large datasets, while
Pig enables scripting for complex transformations.

4. Scaling Out in Hadoop


Hadoop excels in horizontal scaling, which involves adding more machines to the cluster to
handle increasing data loads. Unlike traditional databases that rely on vertical scaling
(adding more CPU, RAM, or disk), Hadoop distributes data across multiple nodes, allowing it
to handle petabytes of information.

• Horizontal Scaling: New nodes can be added dynamically, and Hadoop automatically
rebalances the cluster to include the new nodes.

• Fault Tolerance: Data is replicated across multiple nodes, ensuring that failure of one
node does not result in data loss.

• Load Balancing: YARN dynamically allocates tasks to nodes based on available


resources.

5. Hadoop Streaming and Pipes

Hadoop provides two interfaces to allow non-Java developers to write MapReduce


programs:

• Hadoop Streaming: Allows users to create and run MapReduce jobs with any
executable or script as the mapper and reducer. It supports languages like Python,
Perl, and Ruby.

o Usage: Input is read from stdin and output is written to stdout.

o Example: cat data.txt | python mapper.py | sort | python reducer.py

• Hadoop Pipes: A C++ API for writing MapReduce programs.

o Usage: Allows C++ programs to communicate with Hadoop by implementing


the Mapper and Reducer interfaces.

o Advantages: Provides lower latency and higher efficiency compared to


Hadoop Streaming.

6. Hadoop Ecosystem (Echo System)

The Hadoop ecosystem consists of several tools that extend Hadoop’s functionality. Here are
the key components:

• Hive: A data warehouse infrastructure that provides SQL-like querying capabilities


over HDFS.

• Pig: A high-level platform for processing large datasets using a scripting language
called Pig Latin.
• HBase: A NoSQL database that provides real-time read/write access to large
datasets.

• Sqoop: Facilitates importing and exporting data between HDFS and relational
databases.

• Flume: A data ingestion service that collects, aggregates, and moves large amounts
of streaming data into HDFS.

• Oozie: A workflow scheduler that helps manage and coordinate Hadoop jobs.

• ZooKeeper: A distributed coordination service that manages distributed applications.

7. MapReduce Framework and Basics

MapReduce is a programming model and an associated implementation used to process and


generate large datasets. It works by breaking down the processing into two main phases:

1. Map Phase: Processes the input data and converts it into key-value pairs.

2. Reduce Phase: Aggregates and processes these intermediate key-value pairs to


generate the final output.

8. How MapReduce Works

The entire MapReduce process can be broken down into the following steps:

1. Input Split: Divides the input data into smaller chunks called splits.

2. Mapper Phase: Each split is processed by a separate mapper, which generates


intermediate key-value pairs.

3. Partitioning: Intermediate data is partitioned based on keys.

4. Shuffle and Sort: Data is shuffled to ensure that all values for the same key are
grouped together and sorted.

5. Reducer Phase: Reducers aggregate the sorted data and produce the final output.

6. Output: The results are stored in HDFS.

9. Developing a MapReduce Application

To develop a MapReduce application:


1. Define the Mapper Class: Extend the Mapper class and implement the map()
method.

2. Define the Reducer Class: Extend the Reducer class and implement the reduce()
method.

3. Configure Job: Use the Job class to configure input and output paths, mapper,
reducer, and partitioner.

4. Submit Job: Submit the job to YARN and monitor its execution.

10. Unit Tests with MR Unit

MRUnit is a framework that allows developers to test individual MapReduce components in


isolation.

• Testing Mapper: Verifies that input splits are correctly processed.

• Testing Reducer: Ensures that intermediate key-value pairs are aggregated properly.

• Local Tests: Allows running small-scale tests on local data before deploying on a
cluster.

11. Anatomy of a MapReduce Job Run

A typical MapReduce job runs through the following phases:

1. Job Submission: The client submits the job to YARN.

2. Resource Allocation: YARN allocates resources and launches containers.

3. Map Task Execution: Mappers process input splits and produce intermediate key-
value pairs.

4. Shuffle and Sort: Intermediate data is shuffled, sorted, and partitioned.

5. Reduce Task Execution: Reducers aggregate sorted data and produce final results.

6. Job Completion: Results are written to HDFS.

12. Handling Failures in MapReduce

• Task Failure: Failed tasks are retried on other nodes.

• Node Failure: YARN reallocates resources and reruns failed tasks.

• Job Failure: Error logs provide insights for troubleshooting.


13. Job Scheduling in Hadoop

Hadoop offers multiple job scheduling options:

• FIFO Scheduler: Processes jobs in the order they are submitted.

• Fair Scheduler: Distributes resources fairly among jobs.

• Capacity Scheduler: Divides resources into queues with different capacities.

14. Shuffle and Sort in MapReduce

The Shuffle and Sort phase ensures that intermediate key-value pairs are correctly sorted
and grouped before being passed to the reducer.

• Shuffle: Moves map output to the reducers.

• Sort: Sorts intermediate data to ensure that all values associated with a key are
processed together.

15. Task Execution in MapReduce

• Map Task: Processes input splits and generates intermediate key-value pairs.

• Reduce Task: Aggregates and processes sorted data.

• Speculative Execution: Detects slow-running tasks and launches backup tasks to


improve performance.

16. MapReduce Types, Input, and Output Formats

• MapReduce Types: Various formats based on the nature of input and output data.

• Input Formats:

o TextInputFormat: Processes line-based text data.

o KeyValueTextInputFormat: Reads key-value pairs from input.

o SequenceFileInputFormat: Processes binary sequence files.

• Output Formats:

o TextOutputFormat: Writes plain text output.

o SequenceFileOutputFormat: Stores binary key-value pairs.


17. Real-world MapReduce Use Cases

MapReduce has revolutionized big data processing and is used in various real-world
scenarios:

• Log Analysis: Analyzing web server logs for trends and anomalies.

• ETL (Extract, Transform, Load): Cleaning and transforming raw data for data
warehousing.

• Recommendation Systems: Building recommendation engines based on user


preferences.

• Data Aggregation: Summarizing large-scale data for business insights.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy