04
04
- Hadoop Concepts,
- Design of Hadoop distributed
file system (HDFS) ,
- Analyzing data with Hadoop,
- Hadoop streaming,
- Data Formats.
Introduction - Hadoop
• Apache Hadoop is an open-source framework that allows for the distributed
processing of large datasets across clusters of computers using simple
programming models. It is designed to scale up from a single server to thousands
of machines, with a high degree of fault tolerance.
There are four core components of Hadoop.
1. Data Ingestion:
Data ingestion is the process of transferring data from multiple sources such as
relational databases, log files, social media, IoT devices, and more into Hadoop's
HDFS (Hadoop Distributed File System) or other Hadoop-compatible storage
systems. This is the first and foundational step for any big data analysis pipeline, as
it allows Hadoop to process large volumes of structured, semi-structured, or
unstructured data coming from disparate sources. The data ingestion can happen in
batch mode i.e moving large amounts of data at once or in real-time/streaming
mode i.e continuous flow of data as it's generated. Both modes are supported by
different tools and frameworks in the Hadoop ecosystem. Various data sources are
✔ Relational databases (e.g., MySQL, Oracle, ✔ APIs or social media feeds
SQL Server) ✔ IoT devices generating sensor
✔ NoSQL databases (e.g., MongoDB, data
Cassandra) ✔ Data streams (e.g., clickstreams
✔ Flat files (e.g., CSV, JSON, XML) from websites)
✔ Log files from web servers, applications, or
devices
Analyzing data with Hadoop (contd)
Key Ingestion Tools:
1. Apache Sqoop: A tool to import data from traditional databases (RDBMS) like
MySQL, PostgreSQL, and Oracle into Hadoop’s HDFS, or vice versa. Sqoop
allows you to bring large datasets stored in relational databases into the Hadoop
ecosystem.
2. Apache Flume: A service for collecting, aggregating, and moving large
amounts of log and event data from different sources to HDFS. It’s ideal for
ingesting real-time streaming data from sources such as social media, web logs,
or IoT devices.
2. Data Storage in HDFS :
Hadoop Distributed File System (HDFS) is the primary storage layer for Hadoop,
designed to store large datasets across a distributed cluster. It allows for scalable and
fault-tolerant storage of data by splitting large files into smaller blocks typically 128 MB
or 256 MB in size and replicating those blocks across multiple nodes. Fault tolerance
is achieved as HDFS replicates each data block across multiple nodes (by default,
three replicas per block). This replication ensures that data remains available even if
a node fails, thus maintaining data reliability. The storage and retrieval of data in HDFS
form the backbone of big data processing, as it enables high-throughput access and
efficient data analysis through parallel processing. HDFS is optimized for write-once,
read-many access patterns, which means once data is written, it is rarely updated.
This design improves performance for reading and analyzing large-scale datasets.
Analyzing data with Hadoop (contd)
If a file is larger than the block size (128 MB), it will be split into several blocks. For
example, a 512 MB file will be split into 4 blocks of 128 MB each and saved as follows.
5. Workflow Scheduling :
Once we have imported the data from various data sources , saved it on HDFS and
ready to perform MapReduce operation using tools the next step is to automate and
schedule these jobs using tools like Oozie. We can use this tool to automates and
manages the execution of multiple Hadoop jobs. It allows users to define
dependencies between jobs and schedule them to run in a particular sequence.
6. Result Storage :
The Result Storage step is the final stage in a Hadoop workflow, particularly after a
MapReduce or other distributed processing job has been completed. This step
Analyzing data with Hadoop (contd)
involves saving the output produced by the reducer tasks to HDFS (Hadoop Distributed
File System) or another storage system such as HBase, S3, etc. The output files from
different reducers may need to be concatenated or combined to create a single file,
depending on the use case. To store appropriate storage format like csv , tab , Avro etc
may be used.
Schedule
Design of Hadoop distributed file system (HDFS)
• HDFS (Hadoop Distributed File System) is a core component of the Hadoop
ecosystem, designed to store massive amounts of data reliably and efficiently
across a distributed cluster. It follows a set of principles and design ideas tailored to
handle big data workloads, ensuring scalability, fault tolerance, and high throughput.
• The architecture of HDFS is optimized for write-once, read-many patterns, making it
suitable for storing large files and supporting distributed data processing frameworks
like MapReduce, Hive, and Spark.
• Following are the Key Design Principles that are followed in HADOOP.
Master-Slave Architecture
HDFS follows a master-slave architecture, where the master node manages the
metadata and overall operation of the system, while the slave nodes store the
actual data. The master node does not store actual data but rather tracks where the
blocks of each file are stored across the DataNodes. It only stores data such as file
locations, permissions, block mapping etc .
The data nodes are the worker nodes that store the actual data blocks. Each
DataNode is responsible for managing the storage attached to it and periodically
reporting the status of its blocks to the Master node. They perform read and write
requests from clients and replicate data blocks as instructed by the Master node.
This architecture allows Master node to focus on meta data while data nodes on
storing and scaling up for data storage and performance.
Design of Hadoop distributed file system (contd)
2. Rack Aware
HDFS is rack-aware, meaning that it is designed to consider the physical topology
of the network when replicating data. It ensures that replicas of data blocks are
stored on different machines to improve fault tolerance. In a large data center,
machines are often grouped into racks, with high-bandwidth connections within a
rack but slower cross-rack communication. By storing replicas across different
racks, HDFS ensures that if an entire rack fails (e.g., due to a network switch
failure), the data is still available from nodes in other racks.
4. Portability
HDFS is designed to be hardware-agnostic and portable across various hardware
configurations. It can run on commodity hardware, allowing organizations to build
large-scale storage systems without expensive infrastructure. This reduces the cost
of setting up a Hadoop cluster.
Design of Hadoop distributed file system (contd)
5. Data Integrity and Consistency
HDFS ensures that data stored in the system is consistent and remains intact even in
the case of failures. It uses checksums to validate the integrity of the data blocks stored
on DataNodes. Each block is checked for corruption, and if a corrupt block is found, it is
replaced with a replica from another node. HDFS supports atomic file writes. Once a file
is written to HDFS, it cannot be modified. This write-once design ensures data
consistency and simplifies file management in a distributed system.
• In the above block diagram we have an Input Reader which is responsible for
reading the input data and produces the list of key-value pairs.
• We can read data in .csv format, in delimiter format, from a database table, image
data(.jpg, .png), audio data etc.
• The input reader contains the complete logic about the data it is reading. Suppose
we want to read an image then we have to specify the logic in the input reader so
that it can read that image data and finally it will generate key-value pairs for that
image data.
Hadoop Streaming (contd)
• The list of key-value pairs is fed to the Map phase and Mapper will work on each of
these key-value pair of each pixel and generate some intermediate key-value pairs
which are then fed to the Reducer after doing shuffling and sorting then the final
output produced by the reducer will be written to the HDFS.
• We can create our external mapper and run it as an external separate process.
• As the key-value pairs are passed to the internal mapper the internal mapper
process send these key-value pairs in turn to the external mapper where we have
written our code in some other language like with python with help of STDIN.
• The external map processes are not part of the basic MapReduce flow. This
external mapper will take input from STDIN and produce output to STDOUT.
• The external mappers process these key-value pairs and generate intermediate
key-value pairs with help of STDOUT and send it to the internal mappers.
• Similarly, Reducer does the same thing. Once the intermediate key-value pairs are
processed through the shuffle and sorting process they are fed to the internal
reducer which will send these pairs to external reducer process that are working
separately through the help of STDIN and gathers the output generated by external
reducers with help of STDOUT and finally the output is stored to our HDFS.
• This is how Hadoop Streaming works on Hadoop which is by default available in
Hadoop. We are just utilizing this feature by making our external mapper and
reducers.
Data Storage Formats
There are several commonly used data storage formats in HDFS (Hadoop Distributed
File System). These formats cater to different data requirements, such as binary
storage, schema support, compression, and ease of use
1. CSV (Comma-Separated Values) and TSV (Tab-Separated Values)
2. JSON
3. Sequence Files
4. Avro
5. Parquet
6. Optimised Row Columnar (ORC)
7. Protobuf
3. Sequence Files
A Sequence File is a binary format that stores data as key-value pairs. It is an
older Hadoop-native format primarily used for intermediate data in MapReduce
jobs and can store any arbitrary Java objects. Sequence files store data as key-
value pairs, where the keys and values are serialized in binary format. This makes
them ideal for scenarios where data needs to be processed as key-value pairs, such
as in MapReduce jobs. They can be compressed using block-level or record-level
compression and also are splittable, meaning that large files can be split into
multiple smaller pieces, enabling parallel processing across multiple nodes.
Data Storage Formats (contd)
4. Avro
Apache Avro is a row-based binary data serialization format that is optimized for
performance and supports schema evolution. It's widely used in big data ecosystems
because of its compactness, versatility, and ability to handle both structured and
semi-structured data. Avro stores data row by row, making it well-suited for write-
heavy workloads. It can carry their schema along with the data, which allows for easy
serialization and deserialization across systems. Since it uses compact binary
encoding, it reduces storage overhead. Also , Avro files are splittable, which
makes them suitable for distributed processing frameworks like Hadoop MapReduce or
Apache Spark. Avro is often used to store logs or sensor data that might evolve
over time, as it efficiently handles schema changes.
[easy writes hard reads]
5. Parquet
Apache Parquet is a columnar storage format, optimized for efficient reading and
writing of large datasets. It was developed to provide high performance for both
read-heavy and write-heavy workloads in data analytics systems. Data in Parquet is
stored in a column-oriented format, meaning that the values from the same column
are stored together. This enables efficient querying and retrieval of specific columns
without scanning entire rows. Parquet uses efficient compression algorithms, such as
Snappy, GZIP, or LZO, which reduce storage space while maintaining high
performance. Parquet files are splittable, meaning large files can be divided into
smaller pieces for parallel processing across different nodes in a Hadoop cluster. [easy
Data Storage Formats (contd)
Parquet enforces a schema for the data, ensuring that the data structure is well-
defined and validated. It supports rich data types like nested structures (arrays,
maps), making it a great fit for complex data models.