0% found this document useful (0 votes)
5 views

BDA CW Chapter 2

Uploaded by

tyagrajssecs121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

BDA CW Chapter 2

Uploaded by

tyagrajssecs121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

BDA CW Chapter 2: 20M

1. Explain the Hadoop Ecosystem with core components. Describe the physical architecture of Hadoop
and state its limitations. [IA1, PYQ]

Core Components of Hadoop Ecosystem:

1. HDFS
o Purpose: HDFS is designed to store large datasets reliably and to stream those
datasets at high bandwidth to user applications.
o Structure: It consists of two main components:
▪ NameNode: Manages the metadata (data about data) and keeps track of which
blocks are stored on which DataNodes.
▪ DataNode: Stores the actual data. Data is split into blocks and distributed
across multiple DataNodes.
o Fault Tolerance: Data is replicated across multiple DataNodes to ensure fault
tolerance and high availability.

2. YARN
o Purpose: YARN is the resource management layer of Hadoop, responsible for
managing and scheduling resources across the cluster.
o Components:
▪ Resource Manager: Allocates resources to various applications running
in the cluster.
▪ Node Manager: Manages resources on a single node and reports to the
Resource Manager.
▪ Application Manager: Acts as an interface between the Resource
Manager and Node Manager, negotiating resources for applications.
o Functionality: YARN allows multiple data processing engines to run and share
resources, improving the utilization and efficiency of the cluster.
3. MapReduce
o Purpose: MapReduce is a programming model used for processing large datasets in a
distributed and parallel manner.
o Process:
• Map Function: Takes input data and converts it into a set of key-value pairs. It
performs sorting and filtering of data.
• Reduce Function: Takes the output from the Map function and aggregates the data,
producing the final result.
o Execution: The MapReduce framework handles the distribution of tasks, manages data
transfer between nodes, and ensures fault tolerance.

Physical Architecture of Hadoop

Hadoop operates on a master-slave architecture and comprises the following components:


1. Master Node Components
1. NameNode:
o Manages the file system namespace (opening, renaming, and closing files).
o Stores metadata and oversees DataNodes.
o Single point of failure (critical to HDFS operation).
2. Job Tracker:
o Accepts MapReduce jobs from the client.
o Coordinates tasks between Task Trackers.
o Interacts with NameNode for metadata.
2. Slave Node Components
1. DataNode:
o Stores actual data in blocks.
o Executes read/write requests and performs block creation, deletion, and replication.
2. Task Tracker:
o Receives and executes tasks from the Job Tracker (e.g., Mapper or Reducer tasks).
o Sends progress reports to the Job Tracker.
Advantages of Hadoop
1. Scalability: Hadoop can easily scale horizontally by adding more nodes to the cluster, allowing
it to handle vast amounts of data.
2. Cost-Effective: It uses commodity hardware, making it a cost-effective solution for storing and
processing large datasets.
3. Fault Tolerance: Hadoop automatically replicates data across multiple nodes, ensuring data
availability even if some nodes fail.
4. Flexibility: Hadoop can process a wide variety of data types, including structured, semi-
structured, and unstructured data, from multiple sources.

Limitations of Hadoop
1. Complexity: Setting up, managing, and optimizing Hadoop requires specialized
knowledge, making it challenging for non-experts.
2. Real-Time Processing: Hadoop is designed for batch processing and struggles with real
time data processing tasks.
3. Small File Handling: Hadoop is inefficient at managing a large number of small files,
leading to performance issues and increased overhead.
4. High Latency: Due to its batch processing nature, Hadoop often exhibits higher latency,
which can be problematic for time-sensitive applications

2. Why is HDFS more suited for applications having large datasets and not when there are small files?
Elaborate. [IA1]

Reasons HDFS is Suited for Large Datasets

1. Large Block Size: HDFS uses large block sizes (128 MB or 256 MB), reducing the overhead of
managing metadata.
2. High Throughput: Optimized for high-throughput access, making it ideal for reading and
writing large files sequentially.
3. Fault Tolerance: Data blocks are replicated across multiple nodes, ensuring data
availabiliteven if some nodes fail.
4. Scalability: Easily scales by adding more nodes to the cluster, distributing large datasets
efficiently.

Challenges with Small Files

1. Metadata Overhead: Each small file requires an inode in the NameNode’s memory, leading to
excessive memory usage.
2. Inefficient Storage: Small files do not fully utilize the large block size, resulting in wasted
storage space.
3. High Latency: Accessing many small files incurs high latency due to the overhead of opening
and closing files.
4. Resource Management: Managing numerous small files increases the load on the NameNode,
affecting overall cluster performance.
5. Not Optimized for Random Access: HDFS is designed for sequential access, making it
inefficient for random access patterns typical of small files.
6. Complexity in Handling Small Files: The overhead of handling many small files can degrade
the performance and efficiency of the HDFS cluster.

3. Explain the distributed storage system of Hadoop with the help of a neat diagram.
4. Structure of HDFS with a neat, labeled diagram.
5. Explain HDFS architecture with read/write operations performed.
6. Explain how Hadoop goals are covered in the Hadoop Distributed File System. [PYQ]
The Hadoop Distributed File System (HDFS) effectively achieves Hadoop's key objectives:
scalability, fault tolerance, high throughput, and reliability.
1. Scalability
• Distributed Architecture: HDFS divides large data into blocks and distributes them across
multiple nodes, enabling horizontal scaling by adding more nodes to the cluster.
• Block-Based Storage: Fixed-size blocks (default: 128 MB) allow parallel processing and
efficient handling of large files.
• Decoupled Design: Storage and computation grow independently, offering flexibility in scaling.
2. Fault Tolerance
• Replication: Data blocks are replicated across multiple nodes (default: 3), ensuring data
availability even during node failures.
• Heartbeat and Block Reports: DataNodes send regular updates to the NameNode, which
monitors health and triggers re-replication if failures occur.
• Automatic Recovery: Lost blocks are recreated from healthy replicas to maintain consistency.
3. High Throughput
• Data Locality: By moving computation closer to where data resides, HDFS minimizes network
traffic and enhances performance.
• Batch Processing: HDFS is optimized for sequential reads/writes and large-scale processing,
rather than random access.
• Large Block Size: Reduces management overhead and improves processing efficiency for
massive datasets.
4. Reliability
• Metadata Management: The NameNode handles metadata (e.g., block locations), while
DataNodes manage actual data storage, ensuring efficient operations.
• Data Integrity: Checksums validate data during storage and retrieval, detecting corruption.
Corrupted blocks are automatically replaced from replicas.
• Self-Healing: Failed nodes rejoin after recovery, and HDFS seamlessly restores missing data
from replicas.

7. Explain the characteristics of Pig and Mahout

Characteristics of Apache Pig


1. High-Level Abstraction: Provides a high-level scripting language (Pig Latin) for data analysis,
abstracting the complexity of MapReduce.
2. Ease of Use: Easy to learn, read, and write, especially for SQL programmers, reducing the
development effort.
3. Extensibility: Allows users to create their own processes and user-defined functions (UDFs) in
languages like Python and Java.
4. Rich Set of Operators: Offers built-in operators for filtering, joining, sorting, and aggregation,
simplifying data operations.
5. Nested Data Types: Supports complex data types such as tuples, bags, and maps, enabling
more sophisticated data handling.
6. Efficient Code: Reduces the length of code significantly compared to writing in Java for
MapReduce.
7. Prototyping and Ad-Hoc Queries: Useful for exploring large datasets, prototyping data
processing algorithms, and running ad-hoc queries.

Characteristics of Apache Mahout


1. Scalability: Designed to handle large-scale data processing by leveraging Hadoop and Spark,
making it suitable for big data machine learning projects.
2. Versatility: Offers a wide range of machine learning algorithms, including classification,
clustering, recommendation, and pattern mining.
3. Integration: Seamlessly integrates with other Hadoop ecosystem components like HDFS and
HBase, simplifying data storage and retrieval.
4. Distributed Processing: Utilizes Hadoop’s MapReduce and Spark for distributed data
processing, ensuring efficient handling of large datasets.
5. Extensibility: Easily extensible, allowing users to add custom algorithms and processing
steps to meet specific requirements.

8. What is Hadoop? How are Big Data and Hadoop linked?


o Hadoop is an open-source framework designed to store and process large datasets efficiently.
It consists of several components: HDFS (Hadoop Distributed File System) for storing data
across multiple machines, MapReduce for processing data in parallel across clusters, YARN
(Yet Another Resource Negotiator) for managing resources and scheduling, and Hadoop
Common, which includes common utilities and libraries. Hadoop is primarily written in Java.
o Big Data and Hadoop are closely linked because Hadoop is specifically designed to handle Big
Data. Hadoop’s HDFS component stores large datasets efficiently, while MapReduce processes
these datasets in parallel, making it possible to manage and analyze Big Data effectively.
Hadoop is also highly scalable, allowing for the addition of more nodes to the cluster to handle
increasing amounts of data. Common use cases for Hadoop include data warehousing, business
intelligence, machine learning, and data mining.

9. Compare Namenode and Datanode in HDFS [PYQ]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy