Assignment 4 (Big Data)
Assignment 4 (Big Data)
Ans. JobTracker:
Role: The JobTracker is the master node responsible for managing and coordinating MapReduce jobs
submitted to the Hadoop cluster. It is typically run on the master node of the cluster.
Functionality:
Job Scheduling: It schedules MapReduce tasks, allocates resources, and monitors the progress of each
job.
Task Assignment: It assigns tasks to available TaskTracker nodes based on data locality and resource
availability.
TaskTracker:
Role: The TaskTracker is a slave node component responsible for executing tasks assigned by the
JobTracker. Each worker node in the Hadoop cluster runs a TaskTracker daemon.
Functionality:
Task Execution: It executes Map and Reduce tasks assigned by the JobTracker, processing data stored
locally on the node.
Heartbeat: It sends periodic heartbeat signals to the JobTracker to indicate its availability and report task
status updates.
Q2. List and explain limitations and solutions of Hadoop for Big Data Analytics.
Ans. limitations of Hadoop for Big Data Analytics along with potential solutions:
Limitations:
1. High Latency for Interactive Queries: Hadoop's batch processing model can result in high latency for
interactive queries and real-time analytics.
Solutions:
1. In-Memory Processing with Apache Spark: Apache Spark offers in-memory processing capabilities,
reducing latency for interactive queries and real-time analytics compared to Hadoop's disk-based processing
model.
2. Higher-Level Abstractions with Apache Hive and Pig: Tools like Apache Hive and Pig provide higher-level
abstractions and SQL-like languages, enabling easier development of analytics workflows without extensive
programming knowledge.
Q3. Compare Hadoop 1.0 and Hadoop 2.0 with the help of its architecture and features.
Ans.
Hadoop 1.0:
1. Architecture:
Single Resource Manager: Hadoop 1.0 architecture consists of a single JobTracker, which acts as the
central resource manager and scheduler for all MapReduce jobs.
TaskTrackers: Multiple TaskTracker nodes are responsible for executing Map and Reduce tasks on
individual nodes in the cluster.
2. Features:
MapReduce Framework: Provides the MapReduce processing framework for distributed computation of
large datasets, enabling parallel processing of tasks across the cluster
Hadoop 2.0:
1. Architecture:
YARN (Yet Another Resource Negotiator): Hadoop 2.0 introduces YARN, a new resource management
framework that decouples resource management and job scheduling from MapReduce, allowing for more
diverse workloads and improved scalability.
2. Features:
YARN: YARN provides a more flexible and scalable resource management framework, supporting multiple
processing paradigms beyond MapReduce, such as Apache Spark, Apache Tez, and others.
Enhanced HDFS: Hadoop 2.0 includes enhancements to HDFS, such as support for high availability (HA)
Namenode and HDFS federation, improving reliability and scalability.
Overall, Hadoop 2.0 represents a significant evolution of the Hadoop ecosystem, addressing limitations of
Hadoop 1.0 and introducing new features and capabilities to meet the growing demands of Big Data processing.
Architecture:
1. ResourceManager (RM):
The ResourceManager is the master daemon responsible for managing cluster resources
2. NodeManager (NM):
The NodeManager is a per-node daemon responsible for managing resources on individual cluster nodes.
3. ApplicationMaster (AM):
The ApplicationMaster is a framework-specific master daemon responsible for managing the execution of
a single application.
How it Works:
1. Job Submission:
A client submits a job to the ResourceManager by providing details such as the type of application,
resource requirements, and input data location.
2. Resource Allocation:
Resources are allocated in the form of containers, which represent a fixed amount of CPU, memory, and
other resources on a cluster node.
3. Task Execution:
The ApplicationMaster, once launched, is responsible for coordinating the execution of tasks for the
application
1. Key-Value Stores:
2. Document Stores:
Data Model: Stores semi-structured data as documents, typically in JSON or BSON format.
Data Model: Stores data in columns rather than rows, organized into column families
4. Graph Databases:
Data Model: Stores data as nodes, edges, and properties, representing relationships between entities.
5. Time-Series Databases:
Data Model: Stores data points indexed by time, typically used for tracking and analyzing time-stamped
data.