BDA CW Chapter 2
BDA CW Chapter 2
1. Explain the Hadoop Ecosystem with core components. Describe the physical architecture of Hadoop
and state its limitations. [IA1, PYQ]
1. HDFS
o Purpose: HDFS is designed to store large datasets reliably and to stream those
datasets at high bandwidth to user applications.
o Structure: It consists of two main components:
▪ NameNode: Manages the metadata (data about data) and keeps track of which
blocks are stored on which DataNodes.
▪ DataNode: Stores the actual data. Data is split into blocks and distributed
across multiple DataNodes.
o Fault Tolerance: Data is replicated across multiple DataNodes to ensure fault
tolerance and high availability.
2. YARN
o Purpose: YARN is the resource management layer of Hadoop, responsible for
managing and scheduling resources across the cluster.
o Components:
▪ Resource Manager: Allocates resources to various applications running
in the cluster.
▪ Node Manager: Manages resources on a single node and reports to the
Resource Manager.
▪ Application Manager: Acts as an interface between the Resource
Manager and Node Manager, negotiating resources for applications.
o Functionality: YARN allows multiple data processing engines to run and share
resources, improving the utilization and efficiency of the cluster.
3. MapReduce
o Purpose: MapReduce is a programming model used for processing large datasets in a
distributed and parallel manner.
o Process:
• Map Function: Takes input data and converts it into a set of key-value pairs. It
performs sorting and filtering of data.
• Reduce Function: Takes the output from the Map function and aggregates the data,
producing the final result.
o Execution: The MapReduce framework handles the distribution of tasks, manages data
transfer between nodes, and ensures fault tolerance.
Limitations of Hadoop
1. Complexity: Setting up, managing, and optimizing Hadoop requires specialized
knowledge, making it challenging for non-experts.
2. Real-Time Processing: Hadoop is designed for batch processing and struggles with real
time data processing tasks.
3. Small File Handling: Hadoop is inefficient at managing a large number of small files,
leading to performance issues and increased overhead.
4. High Latency: Due to its batch processing nature, Hadoop often exhibits higher latency,
which can be problematic for time-sensitive applications
2. Why is HDFS more suited for applications having large datasets and not when there are small files?
Elaborate. [IA1]
1. Large Block Size: HDFS uses large block sizes (128 MB or 256 MB), reducing the overhead of
managing metadata.
2. High Throughput: Optimized for high-throughput access, making it ideal for reading and
writing large files sequentially.
3. Fault Tolerance: Data blocks are replicated across multiple nodes, ensuring data
availabiliteven if some nodes fail.
4. Scalability: Easily scales by adding more nodes to the cluster, distributing large datasets
efficiently.
1. Metadata Overhead: Each small file requires an inode in the NameNode’s memory, leading to
excessive memory usage.
2. Inefficient Storage: Small files do not fully utilize the large block size, resulting in wasted
storage space.
3. High Latency: Accessing many small files incurs high latency due to the overhead of opening
and closing files.
4. Resource Management: Managing numerous small files increases the load on the NameNode,
affecting overall cluster performance.
5. Not Optimized for Random Access: HDFS is designed for sequential access, making it
inefficient for random access patterns typical of small files.
6. Complexity in Handling Small Files: The overhead of handling many small files can degrade
the performance and efficiency of the HDFS cluster.
3. Explain the distributed storage system of Hadoop with the help of a neat diagram.
4. Structure of HDFS with a neat, labeled diagram.
5. Explain HDFS architecture with read/write operations performed.
6. Explain how Hadoop goals are covered in the Hadoop Distributed File System. [PYQ]
The Hadoop Distributed File System (HDFS) effectively achieves Hadoop's key objectives:
scalability, fault tolerance, high throughput, and reliability.
1. Scalability
• Distributed Architecture: HDFS divides large data into blocks and distributes them across
multiple nodes, enabling horizontal scaling by adding more nodes to the cluster.
• Block-Based Storage: Fixed-size blocks (default: 128 MB) allow parallel processing and
efficient handling of large files.
• Decoupled Design: Storage and computation grow independently, offering flexibility in scaling.
2. Fault Tolerance
• Replication: Data blocks are replicated across multiple nodes (default: 3), ensuring data
availability even during node failures.
• Heartbeat and Block Reports: DataNodes send regular updates to the NameNode, which
monitors health and triggers re-replication if failures occur.
• Automatic Recovery: Lost blocks are recreated from healthy replicas to maintain consistency.
3. High Throughput
• Data Locality: By moving computation closer to where data resides, HDFS minimizes network
traffic and enhances performance.
• Batch Processing: HDFS is optimized for sequential reads/writes and large-scale processing,
rather than random access.
• Large Block Size: Reduces management overhead and improves processing efficiency for
massive datasets.
4. Reliability
• Metadata Management: The NameNode handles metadata (e.g., block locations), while
DataNodes manage actual data storage, ensuring efficient operations.
• Data Integrity: Checksums validate data during storage and retrieval, detecting corruption.
Corrupted blocks are automatically replaced from replicas.
• Self-Healing: Failed nodes rejoin after recovery, and HDFS seamlessly restores missing data
from replicas.