Unit 2
Unit 2
1. History of Hadoop
Hadoop’s journey began in the early 2000s when Doug Cutting and Mike Cafarella were
inspired by Google’s innovations in distributed computing. Google had published papers on
two core technologies: GFS (Google File System) and MapReduce, which described a
method to process vast amounts of data efficiently across multiple machines. Doug Cutting,
while working at Yahoo, wanted to build an open-source framework that could handle
petabytes of data with fault tolerance, and thus, Hadoop was born in 2005.
The name “Hadoop” was derived from the toy elephant of Doug Cutting’s son, symbolizing
its strength and reliability. Hadoop soon became an Apache project and gained immense
popularity due to its ability to handle large-scale data processing.
• Hadoop 1.x: Initial versions relied on a single master node called NameNode to
manage metadata, which made it prone to bottlenecks and failures.
Today, Hadoop has evolved into an ecosystem of tools that support advanced big data
analytics.
1. HDFS (Hadoop Distributed File System): A distributed file system that stores data in
a fault-tolerant manner.
HDFS Architecture:
• DataNode: Slave nodes that store actual data blocks and report periodically to the
NameNode.
• Block Size: Large block size (default 128 MB) to minimize metadata overhead.
• Fault Tolerance: If a DataNode goes down, data is retrieved from other replicas.
Hadoop supports various data formats that help optimize storage and processing:
• Text Files: Plain text data, commonly in CSV, TSV, or JSON formats.
• Sequence Files: Binary files containing key-value pairs for efficient processing.
• Avro Files: Highly compact and efficient format for data serialization.
1. Data Ingestion: Loading raw data into HDFS using tools like Apache Sqoop or Flume.
3. Storing Results: Processed data is stored back in HDFS for further analysis.
4. Querying with Hive or Pig: Hive provides SQL-like querying on large datasets, while
Pig enables scripting for complex transformations.
• Horizontal Scaling: New nodes can be added dynamically, and Hadoop automatically
rebalances the cluster to include the new nodes.
• Fault Tolerance: Data is replicated across multiple nodes, ensuring that failure of one
node does not result in data loss.
• Hadoop Streaming: Allows users to create and run MapReduce jobs with any
executable or script as the mapper and reducer. It supports languages like Python,
Perl, and Ruby.
The Hadoop ecosystem consists of several tools that extend Hadoop’s functionality. Here are
the key components:
• Pig: A high-level platform for processing large datasets using a scripting language
called Pig Latin.
• HBase: A NoSQL database that provides real-time read/write access to large
datasets.
• Sqoop: Facilitates importing and exporting data between HDFS and relational
databases.
• Flume: A data ingestion service that collects, aggregates, and moves large amounts
of streaming data into HDFS.
• Oozie: A workflow scheduler that helps manage and coordinate Hadoop jobs.
1. Map Phase: Processes the input data and converts it into key-value pairs.
The entire MapReduce process can be broken down into the following steps:
1. Input Split: Divides the input data into smaller chunks called splits.
4. Shuffle and Sort: Data is shuffled to ensure that all values for the same key are
grouped together and sorted.
5. Reducer Phase: Reducers aggregate the sorted data and produce the final output.
2. Define the Reducer Class: Extend the Reducer class and implement the reduce()
method.
3. Configure Job: Use the Job class to configure input and output paths, mapper,
reducer, and partitioner.
4. Submit Job: Submit the job to YARN and monitor its execution.
• Testing Reducer: Ensures that intermediate key-value pairs are aggregated properly.
• Local Tests: Allows running small-scale tests on local data before deploying on a
cluster.
3. Map Task Execution: Mappers process input splits and produce intermediate key-
value pairs.
5. Reduce Task Execution: Reducers aggregate sorted data and produce final results.
The Shuffle and Sort phase ensures that intermediate key-value pairs are correctly sorted
and grouped before being passed to the reducer.
• Sort: Sorts intermediate data to ensure that all values associated with a key are
processed together.
• Map Task: Processes input splits and generates intermediate key-value pairs.
• MapReduce Types: Various formats based on the nature of input and output data.
• Input Formats:
• Output Formats:
MapReduce has revolutionized big data processing and is used in various real-world
scenarios:
• Log Analysis: Analyzing web server logs for trends and anomalies.
• ETL (Extract, Transform, Load): Cleaning and transforming raw data for data
warehousing.