0% found this document useful (0 votes)
7 views

Assignment 4 (Big Data)

The document discusses Hadoop architecture including JobTracker, TaskTracker, and YARN. It also covers limitations of Hadoop 1.0 and improvements in Hadoop 2.0 such as YARN framework. Types of NoSQL databases including key-value stores, document stores, column-family stores, graph databases and time-series databases are also explained.

Uploaded by

Vishal Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Assignment 4 (Big Data)

The document discusses Hadoop architecture including JobTracker, TaskTracker, and YARN. It also covers limitations of Hadoop 1.0 and improvements in Hadoop 2.0 such as YARN framework. Types of NoSQL databases including key-value stores, document stores, column-family stores, graph databases and time-series databases are also explained.

Uploaded by

Vishal Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment - 4 (Big Data)

Q1. Explain Job Tracker and Task Tracker in Hadoop.

Ans. JobTracker:

Role: The JobTracker is the master node responsible for managing and coordinating MapReduce jobs
submitted to the Hadoop cluster. It is typically run on the master node of the cluster.

Functionality:

Job Scheduling: It schedules MapReduce tasks, allocates resources, and monitors the progress of each
job.

Task Assignment: It assigns tasks to available TaskTracker nodes based on data locality and resource
availability.

TaskTracker:

Role: The TaskTracker is a slave node component responsible for executing tasks assigned by the
JobTracker. Each worker node in the Hadoop cluster runs a TaskTracker daemon.

Functionality:

Task Execution: It executes Map and Reduce tasks assigned by the JobTracker, processing data stored
locally on the node.

Heartbeat: It sends periodic heartbeat signals to the JobTracker to indicate its availability and report task
status updates.

Q2. List and explain limitations and solutions of Hadoop for Big Data Analytics.

Ans. limitations of Hadoop for Big Data Analytics along with potential solutions:

Limitations:

1. High Latency for Interactive Queries: Hadoop's batch processing model can result in high latency for
interactive queries and real-time analytics.

2. Complexity in Programming: Developing MapReduce programs requires expertise in Java or other


programming languages, making it challenging for non-programmers.

Solutions:

1. In-Memory Processing with Apache Spark: Apache Spark offers in-memory processing capabilities,
reducing latency for interactive queries and real-time analytics compared to Hadoop's disk-based processing
model.

2. Higher-Level Abstractions with Apache Hive and Pig: Tools like Apache Hive and Pig provide higher-level
abstractions and SQL-like languages, enabling easier development of analytics workflows without extensive
programming knowledge.

Q3. Compare Hadoop 1.0 and Hadoop 2.0 with the help of its architecture and features.
Ans.
Hadoop 1.0:

1. Architecture:

Single Resource Manager: Hadoop 1.0 architecture consists of a single JobTracker, which acts as the
central resource manager and scheduler for all MapReduce jobs.

TaskTrackers: Multiple TaskTracker nodes are responsible for executing Map and Reduce tasks on
individual nodes in the cluster.

2. Features:

Assignment - 4 (Big Data) 1


Basic HDFS: Hadoop 1.0 includes the Hadoop Distributed File System (HDFS) for distributed storage,
providing fault tolerance and scalability for storing large datasets.

MapReduce Framework: Provides the MapReduce processing framework for distributed computation of
large datasets, enabling parallel processing of tasks across the cluster

Hadoop 2.0:

1. Architecture:

YARN (Yet Another Resource Negotiator): Hadoop 2.0 introduces YARN, a new resource management
framework that decouples resource management and job scheduling from MapReduce, allowing for more
diverse workloads and improved scalability.

ResourceManager and NodeManager: YARN architecture includes ResourceManager, which manages


cluster resources, and multiple NodeManagers.

2. Features:

YARN: YARN provides a more flexible and scalable resource management framework, supporting multiple
processing paradigms beyond MapReduce, such as Apache Spark, Apache Tez, and others.

Enhanced HDFS: Hadoop 2.0 includes enhancements to HDFS, such as support for high availability (HA)
Namenode and HDFS federation, improving reliability and scalability.

Overall, Hadoop 2.0 represents a significant evolution of the Hadoop ecosystem, addressing limitations of
Hadoop 1.0 and introducing new features and capabilities to meet the growing demands of Big Data processing.

Q4. Explain Hadoop YARN architecture. How does it works?


Ans.Hadoop YARN (Yet Another Resource Negotiator) is a resource management and job scheduling framework
introduced in Hadoop 2.0. It separates the resource management and job scheduling functions from the
MapReduce framework, allowing for more flexible and scalable data processing in Hadoop clusters.

Architecture:

1. ResourceManager (RM):

The ResourceManager is the master daemon responsible for managing cluster resources

2. NodeManager (NM):

The NodeManager is a per-node daemon responsible for managing resources on individual cluster nodes.

3. ApplicationMaster (AM):

The ApplicationMaster is a framework-specific master daemon responsible for managing the execution of
a single application.

How it Works:

1. Job Submission:

A client submits a job to the ResourceManager by providing details such as the type of application,
resource requirements, and input data location.

2. Resource Allocation:

Resources are allocated in the form of containers, which represent a fixed amount of CPU, memory, and
other resources on a cluster node.

3. Task Execution:

The ApplicationMaster, once launched, is responsible for coordinating the execution of tasks for the
application

Q5. Explain the types of NoSql Databases.

Assignment - 4 (Big Data) 2


Ans. NoSQL databases, also known as "Not Only SQL" databases, are a diverse set of database management
systems that differ from traditional relational databases in their data model, scalability, and flexibility.

1. Key-Value Stores:

Data Model: Stores data as a collection of key-value pairs

2. Document Stores:

Data Model: Stores semi-structured data as documents, typically in JSON or BSON format.

3. Column-Family Stores (Wide Column Stores):

Data Model: Stores data in columns rather than rows, organized into column families

4. Graph Databases:

Data Model: Stores data as nodes, edges, and properties, representing relationships between entities.

5. Time-Series Databases:

Data Model: Stores data points indexed by time, typically used for tracking and analyzing time-stamped
data.

Assignment - 4 (Big Data) 3

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy