BDA Model QP Soln
BDA Model QP Soln
MODULE -I
1a) What is Big Data? Explain evolution of big data & characteristics.
1. Definition:
o Big Data refers to large, complex data sets that traditional data processing software
cannot handle.
o It includes structured, semi-structured, and unstructured data, often characterized
by high volume, velocity, variety, and veracity.
2. Purpose:
o The goal of Big Data is to analyze vast amounts of data to gain insights, improve
decision-making, and optimize processes.
1. Volume:
o Refers to the massive amount of data generated, typically measured in petabytes or
exabytes.
o Data comes from various sources like social media, sensors, and online
transactions.
2. Velocity:
o Describes the speed at which data is generated and needs to be processed.
o Real-time data generation and processing are common (e.g., sensor data, social
media posts).
3. Variety:
o Represents the different types of data: structured (tables), semi-structured (XML,
JSON), and unstructured (text, videos).
o Big Data includes various formats that need to be processed and integrated.
4. Veracity:
o Refers to the quality and accuracy of data.
o Big Data often contains noisy, incomplete, or inconsistent data that needs to be
cleaned and validated for analysis.
Additional Vs (Optional)
5. Value:
oFocuses on the usefulness of the data and the insights that can be extracted.
oThe value lies in turning raw data into actionable intelligence.
6. Variability:
o Describes the changing nature of data, especially from sources like social media or
IoT devices.
o Data patterns can fluctuate over time, requiring dynamic analysis techniques.
Classification of Data
Data can be classified into four main categories based on its structure:
1. Structured Data:
o This data is highly organized and adheres to specific schemas, such as rows and
columns in relational databases. Examples include data stored in traditional
databases (e.g., RDBMS).
21CS71
Big Data can come from various sources, each generating different types of data:
• Social Networks and Web Data: Data generated by users on platforms like Facebook,
Twitter, emails, blogs, and YouTube.
• Transactional and Business Process Data: Includes data from credit card transactions,
flight bookings, medical records, and insurance claims.
• Machine-Generated Data: Includes data from Internet of Things (IoT) devices, sensors,
and machine-to-machine communications.
• Human-Generated Data: Includes biometric data, human-machine interaction data,
emails, and personal documents.
Parallel Processing:
o Grid computing helps combine the power of multiple machines to perform complex
tasks that a single machine would struggle with.
o It allows for distributed and parallel computation of tasks across the grid, improving
efficiency and scalability.
• Relation to Cloud Computing:
o Grid computing is similar to cloud computing in that both allow for the sharing and
pooling of resources. However, while cloud computing is typically provided as a
service, grid computing is more focused on direct coordination of computing
resources across locations.
Cluster Computing:
Cloud Computing:
• Definition:
Cloud computing is a model of Internet-based computing that provides shared computing
resources, data, and applications to devices such as computers and smartphones on demand.
It enables users to access computing services over the internet without needing to own or
maintain the physical infrastructure.
• Key Features of Cloud Computing:
1. On-Demand Service: Users can access and use resources (like storage, computing
power, applications) as needed, without requiring long-term commitments.
2. Resource Pooling: Cloud providers pool resources (computing, storage, etc.) to
serve multiple customers, often using multi-tenant models where resources are
dynamically allocated and reassigned according to demand.
3. Scalability: The ability to scale resources up or down as per user demand. This
allows users to adjust their resource usage based on workload changes.
4. Accountability: Cloud services offer performance tracking, security measures, and
usage audits to ensure transparency and reliability.
5. Broad Network Access: Cloud services are accessible from anywhere and on any
device with an internet connection.
Cloud Services:
Big Data is used across various industries and domains to extract valuable insights, enhance
decision-making, and improve operations. Below are two examples of Big Data applications:
21CS71
o Big Data helps advertisers optimize campaigns, ensuring the right ads reach the
right audiences, avoiding overuse and ensuring relevance.
2c) How does Berkeley data analytics stack helps in analytics take?
The Berkeley Data Analytics Stack (BDAS) is a comprehensive framework designed to handle
Big Data by integrating various components for data processing, management, and resource
management. BDAS aims to improve performance and scalability by leveraging different
computation models and providing in-memory processing. Below are the key components and
architecture layers:
1. Applications:
o AMP-Genomics and Carat are examples of applications running on BDAS.
o AMP (Algorithms, Machines, and People Laboratory) focuses on optimizing
data processing and analytics through innovative algorithms and machine learning
models.
2. Data Processing:
o BDAS supports in-memory processing, which allows data to be processed
efficiently across different frameworks.
o It integrates batch, streaming, and interactive computations, enabling diverse
analytics capabilities.
▪ Batch processing: Handles large volumes of data in bulk.
▪ Streaming: Processes data in real time as it arrives.
▪ Interactive computations: Provides immediate feedback and results for
quick decision-making.
3. Resource Management:
o BDAS incorporates resource management software that ensures efficient
sharing of infrastructure across multiple frameworks, promoting resource
optimization and cost-efficiency.
o The system manages the allocation of resources to ensure the execution of tasks
across various components, such as Hadoop, Spark, and other frameworks.
1. Hadoop:
o A widely used framework for distributed storage and processing of large datasets,
Hadoop provides the foundation for Big Data frameworks.
2. MapReduce:
o The programming model that allows large-scale data processing by distributing
tasks across multiple nodes in a cluster.
3. Spark Core:
21CS71
MODULE-II
03 a) What is Hadoop? Explain Hadoop eco-system with neat diagram
Overview of Hadoop:
Hadoop is a powerful, open-source platform for processing and managing Big Data. It is designed
to handle large volumes of data by distributing the tasks across multiple machines in a cluster. It
uses a MapReduce programming model to break down tasks into smaller chunks and process them
in parallel. Hadoop provides a scalable, fault-tolerant, and self-healing environment that can
process petabytes of data quickly and cost-effectively. The core components of Hadoop are
designed to work together to store, process, and manage data.
21CS71
1. Hadoop Common:
o Contains libraries and utilities required by other Hadoop modules. These include
components for distributed file systems, general I/O operations, and interfaces like
Java RPC (Remote Procedure Call).
2. Hadoop Distributed File System (HDFS):
o A Java-based distributed file system that stores large volumes of data across
multiple machines in the cluster. It ensures high availability through data
replication (default of 3 copies).
3. MapReduce:
o A programming model for processing large data sets in parallel. The Map function
processes input data into key-value pairs, and the Reduce function aggregates the
data from the Map function.
4. YARN (Yet Another Resource Negotiator):
o Manages and schedules resources across the Hadoop cluster. It allocates resources
for MapReduce jobs and manages the distributed environment effectively.
5. MapReduce v2:
o The upgraded version of MapReduce (MapReduce 2.0) that works with YARN for
enhanced resource management and parallel processing of large data sets.
2. Features of Hadoop:
1. Scalability:
o Hadoop is scalable, meaning it can easily scale up or down by adding or removing
nodes in the cluster. This flexibility allows Hadoop to process growing amounts of
data efficiently.
2. Fault Tolerance:
o HDFS ensures fault tolerance by replicating data blocks across different nodes
(default replication factor is 3). If one node fails, the data can still be accessed from
other nodes with replicated data.
3. Robust Design:
o Hadoop has a robust design with built-in data recovery mechanisms. Even if a
node or server fails, Hadoop can continue processing tasks due to replication and
failover mechanisms.
4. Data Locality:
o Data locality means that tasks are executed on the nodes where the data is stored,
reducing data transfer time and improving processing speed.
5. Open-Source and Cost-Effective:
o Hadoop is open-source and works on commodity hardware, making it a cost-
effective solution for managing and processing large datasets.
6. Hardware Fault Tolerance:
o If a hardware failure occurs, Hadoop handles it automatically by replicating data
and reassigning tasks to other nodes, ensuring that the system continues to run
smoothly.
7. Java and Linux-Based:
21CS71
o Hadoop primarily uses Java for its interfaces and is built to run on Linux
environments. Hadoop’s tools and shell commands are tailored to support Linux
systems.
The Hadoop ecosystem consists of various tools and frameworks designed to handle different
aspects of Big Data processing:
1. Avro:
o A data serialization system used for storing data in a compact format. It enables
efficient communication between layers in the Hadoop ecosystem.
2. ZooKeeper:
o A coordination service that helps synchronize tasks across distributed systems. It is
used for maintaining configuration information, providing distributed
synchronization, and managing naming and configuration.
3. Hive:
o A data warehousing tool that allows users to perform SQL-like queries on data
stored in HDFS. It abstracts the complexity of MapReduce with a simpler querying
interface.
4. Pig:
o A platform for analyzing large datasets with a scripting language. It simplifies
complex data processing tasks by using a high-level language called Pig Latin.
5. Mahout:
o A machine learning library for scalable data analysis. It provides algorithms for
clustering, classification, and collaborative filtering.
21CS71
HDFS (Hadoop Distributed File System) is the storage layer of the Hadoop ecosystem. It is
designed to store large files across multiple machines in a distributed environment. The key idea
behind HDFS is to distribute large data sets across multiple machines in a cluster and to provide
high availability and fault tolerance.
1. Client:
o Clients are the users or applications that interact with the Hadoop cluster. The client
submits requests to the NameNode to access data stored in HDFS. These clients
can run on machines outside the Hadoop cluster, such as the applications running
Hive, Pig, or Mahout.
2. NameNode (Master Node):
o The NameNode is the central component of HDFS. It stores metadata about the
files in the distributed file system. It manages the file system namespace, keeps
track of the file locations across the cluster, and performs functions like:
▪ Storing the metadata information of the files (e.g., file names, block
locations).
▪ Keeping a record of the file blocks that are stored across various
DataNodes.
▪ Handling client requests for reading and writing files, and directing clients
to the appropriate DataNodes.
3. Secondary NameNode:
o The Secondary NameNode is often misunderstood as a backup for the NameNode,
but it is not. It periodically merges the edits log with the fsimage to prevent the
NameNode from becoming too large. This ensures the NameNode's metadata
remains manageable and can be recovered in the event of failure.
o It does not serve as a failover for the NameNode but helps in preventing NameNode
failure by maintaining an up-to-date checkpoint.
4. DataNode (Slave Node):
o DataNodes are the worker nodes in HDFS. They are responsible for storing the
actual data blocks and serving the client requests for reading and writing data.
o Each file in HDFS is divided into blocks (typically 128MB or 256MB in size), and
these blocks are distributed across multiple DataNodes.
o DataNodes are responsible for reporting to the NameNode about the status of the
blocks (e.g., whether they are healthy, if there are any issues with the blocks).
5. JobTracker (Optional - Not a direct HDFS component but relevant for HDFS
interaction):
o JobTracker manages and schedules MapReduce jobs. It is responsible for
coordinating the execution of tasks (Map and Reduce) and ensuring the efficient
parallel processing of data across DataNodes. The JobTracker interacts with the
NameNode to locate data stored in the HDFS for MapReduce jobs.
21CS71
1. Features:
o ETL Tools: Enables easy Data Extraction, Transformation, and Loading
(ETL).
o Data Structuring: Supports imposing structure on various data formats.
o Integration: Accesses files in HDFS and other systems like HBase.
o Query Execution: Uses MapReduce or Tez for query execution.
2. Basic Commands:
o Create Table: Define schema for data storage.
Example:
sql
Copy code
SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%'
GROUP BY t4;
3. Advanced Usage:
o Transformations: Allows integrating Python scripts for advanced data processing.
o Example: Converting UNIX timestamps to weekdays using a Python script
(weekday_mapper.py).
4. Output Example:
Using Hive queries, data can be grouped and summarized efficiently, such as counting log
message types or analyzing user data.
21CS71
1 13278
2 14816
3 15426
4 13774
5 17964
6 12318
7 12424
Apache Hive simplifies Big Data processing with its SQL-like approach and integration with
Hadoop's ecosystem.
Figure 7.2 Two-step Sqoop data export method (Adapted from Apache Sqoop Documentation)
Apache Sqoop is a tool designed to efficiently transfer data between Hadoop and relational
databases. It simplifies importing and exporting data using a map-only Hadoop job. Here's a
breakdown of its Import and Export methods:
Import Method
1. Metadata Gathering
o Sqoop examines the source database to collect necessary metadata about the data
to be imported.
o Metadata includes information like table schema, data types, and data size.
2. Data Transfer
o A map-only Hadoop job is submitted by Sqoop.
o The job splits the data into chunks (parallel tasks) for distributed processing.
o Each task imports data into an HDFS directory.
o Default Format: Comma-delimited fields with newline-separated records (can be
customized).
Output:
• The imported data is saved in HDFS and can be processed using Hadoop, Hive, or other
tools.
Export Method
1. Metadata Examination
o Sqoop analyzes the database schema to identify how to map the HDFS data to the
target database.
2. Data Transfer
o A map-only Hadoop job is used to export data to the database.
o The data set is divided into splits, and each map task pushes one split to the
database.
o Database access is required for all nodes performing the export.
Output:
• Data from HDFS is written into the target database, enabling further operations in relational
database systems.
Apache Oozie is a workflow scheduling and coordination system specifically designed for
managing Hadoop jobs. It ensures that various interdependent jobs can execute in a defined
sequence or parallel, based on the data processing requirements.
1. Workflow Definition:
Workflows are defined in hPDL (Hadoop Process Definition Language), an XML-based
language.
2. Workflow Execution:
o DAG structure ensures orderly execution.
o Parallel or sequential dependencies are adhered to.
3. Job Monitoring:
o CLI and web UI provide real-time job tracking.
Advantages of Oozie
Limitations of Oozie
1. ComplexConfiguration:
Requires a good understanding of XML-based definitions.
2. LimitedUIFeatures:
Dependency on CLI for advanced tasks
3. LacksNon-HadoopSupport:
Primarily tied to Hadoop ecosystem jobs.
• Start Node → MapReduce Job → [Success: End Node | Failure: Fail Node]
• Start Node → Fork Node → (Parallel Jobs: MapReduce, Hive Query) → Join Node →
Decision Node → [Conditional Jobs] → End Node
YARN (Yet Another Resource Negotiator) is a core component of Hadoop, acting as a resource
management platform. It handles the allocation and scheduling of computational resources
across applications in the Hadoop ecosystem.
1. Resource Management:
o YARN manages computational resources like memory, CPU, and storage for
various tasks.
2. Task Scheduling:
o Schedules subtasks and ensures resources are allocated during specified time
intervals.
3. Decoupling:
o Separates resource management and processing components for scalability and
efficiency.
1. Client:
o Submits job requests to the Resource Manager (RM).
2. Resource Manager (RM):
o Acts as the master node in the cluster.
o Tracks available resources across all Node Managers (NMs).
o Contains two key services:
▪ Job History Server: Maintains records of completed jobs.
▪ Resource Scheduler: Allocates resources based on availability and
requirements.
3. Node Manager (NM):
o A slave node in the infrastructure.
o Manages local resources on individual nodes.
o Hosts Containers, which execute the subtasks of an application.
4. Application Master (AM):
o Coordinates the execution of a single application.
o Communicates resource requirements to the RM.
5. Containers:
o Units of resource allocation where application tasks run.
Advantages of YARN
1. Scalability:
Supports clusters with thousands of nodes.
2. Efficiency:
Dynamically allocates resources based on demand.
3. Flexibility:
Enables applications to run independently of the MapReduce framework.
4. Fault Tolerance:
Ensures the availability of tasks even during node failures.
MODULE-3
5a) What is NOSQL? Explain CAP Theorem.
NoSQL Overview
NoSQL stands for "Not Only SQL", representing a class of non-relational database systems.
These databases are designed for scalability, flexibility, and handling large volumes of data,
particularly in distributed systems.
1. Schema Flexibility:
No predefined schemas; data models can evolve dynamically.
2. Horizontal Scalability:
Supports the addition of new nodes to handle increasing loads.
21CS71
CAP Theorem
The CAP theorem, proposed by Eric Brewer, states that a distributed system can guarantee at
most two out of three properties simultaneously:
1. Consistency (C):
o Ensures all nodes in the system see the same data at the same time.
o Example: In a distributed database, if a sale is recorded in one node, it is reflected
across all nodes immediately.
2. Availability (A):
o Guarantees that every request receives a response, even if some nodes are down.
o Example: Even during network failures, a replicated node can respond to a query.
3. Partition Tolerance (P):
o The system continues to function despite network partitions or communication
breakdowns between nodes.
o Example: Even if one region in a distributed database loses connectivity, the other
regions continue to operate.
A distributed system must choose between Consistency (C) and Availability (A) while always
maintaining Partition Tolerance (P).
Scenarios:
Description: A schema-less data store using key-value pairs, similar to a hash table.
Features:
Examples:
• Amazon DynamoDB: Scalable and managed database for applications requiring low
latency.
• Redis: In-memory data store for caching, message brokering, and real-time analytics.
• Riak: Distributed database with fault-tolerance for high-availability applications.
• Couchbase: Combines key-value storage with document-based functionality.
Advantages:
Limitations:
21CS71
Use Cases:
Document Store
Description: Stores data as documents (e.g., JSON, XML) with a hierarchical structure.
Features:
Examples:
• MongoDB: A NoSQL database with a flexible schema, ideal for real-time analytics.
• CouchDB: Designed for web applications with an emphasis on availability and partition
tolerance.
Use Cases:
Description: Organizes data into rows and columns, grouping columns into families.
Features:
Examples:
21CS71
Advantages:
Use Cases:
Graph Databases
Description: Focuses on representing and querying data as nodes (entities), edges (relationships),
and properties.
Features:
Examples:
Use Cases:
Object-Oriented Mapping
Examples:
Use Cases:
The Shared-Nothing (SN) architecture is a cluster-based architecture where each node operates
independently without sharing memory or disk with any other node. It is widely used in big data
tasks because of its scalability, fault tolerance, and efficiency. Below is a detailed explanation of
the SN architecture:
Definition:
• A Shared-Nothing (SN) architecture divides the workload across a cluster of nodes where
each node functions independently.
• Nodes do not share memory, disk, or data, ensuring isolation.
• Data is partitioned across nodes, and processing tasks are distributed among them.
1. Independence:
o Each node operates autonomously without memory or data sharing.
o Nodes possess computational self-sufficiency, reducing contention.
2. Self-Healing:
o The system can recover from link failures by creating alternate links or
redistributing tasks.
3. Sharding of Data:
o Each node stores a shard (a portion of the database).
o Shards improve performance by distributing workloads across nodes.
4. No Network Contention:
o Since nodes work independently, there is minimal network congestion during
processing.
5. Scalability:
o Horizontal scaling is achieved by adding more nodes to the cluster.
o Allows handling of large datasets efficiently.
6. Fault Tolerance:
o A node failure does not disrupt the entire system; tasks are redistributed to other
nodes.
Advantages:
1. Horizontal Scalability:
o Easy to scale by adding more nodes to the cluster.
2. High Performance:
o No resource contention ensures faster processing.
3. Fault Tolerance:
o System remains operational even if some nodes fail.
4. Cost-Effective:
o Commodity hardware can be used for setting up the architecture.
1. Hadoop:
o Distributes data and processing tasks across multiple nodes using HDFS and
MapReduce.
2. Apache Spark:
o Processes data in-memory across independent nodes.
3. Apache Flink:
o Executes stream processing in a distributed, shared-nothing manner.
Conclusion:
The Shared-Nothing architecture is highly effective for big data tasks due to its scalability, fault
tolerance, and independence. It is a preferred choice for distributed computing systems like
Hadoop, Spark, and Flink, enabling efficient processing of massive datasets.
21CS71
MongoDB is an open-source, NoSQL database management system (DBMS) that stores data in a
document-oriented format. It is widely used for handling large-scale data due to its flexibility and
scalability.
1. Non-relational and NoSQL: MongoDB does not rely on a fixed schema and supports
unstructured data.
2. Document-based: Data is stored in BSON (Binary JSON) format, making it easy to store
hierarchical relationships.
21CS71
3. Dynamic Schema: Unlike RDBMS, MongoDB collections can store documents with
varying fields.
4. Cross-platform: Compatible with multiple operating systems like Windows, Linux, and
macOS.
5. Scalability: Supports horizontal scaling through sharding for better data distribution.
6. High Performance: Supports fast data retrieval and updates due to its efficient indexing
and querying mechanism.
7. Fault Tolerance: Ensures data availability using replication through replica sets.
Features of MongoDB
Advantages of MongoDB
Limitations of MongoDB
Command Function
use <db> Creates or switches to a database.
db.<collection>.insert() Inserts a document into a collection.
db.<collection>.find() Retrieves all documents from a collection.
db.<collection>.update() Updates a document in a collection.
db.<collection>.remove() Deletes a document from a collection.
db.stats() Displays statistics about the MongoDB server.
1. Replication:
o Achieved using a replica set (one primary and multiple secondary nodes).
o Provides high availability and automatic failover during server failures.
2. Sharding:
o Data distribution method to scale horizontally.
o Automatically balances data across servers and improves write throughput.
21CS71
Conclusion
MongoDB is a robust, NoSQL database solution tailored for modern application requirements. Its
flexibility, scalability, and ease of use make it an excellent choice for developers dealing with
dynamic and large-scale datasets.
MongoDB and RDBMS (Relational Database Management Systems) differ fundamentally in their
structure, features, and use cases. Below is a comparative analysis:
Replication in MongoDB
Replication ensures data availability and reliability. MongoDB implements replication using
replica sets.
• Replica Set: A group of MongoDB servers that maintain copies of the same dataset.
21CS71
• Components:
o Primary Node: Receives all write operations.
o Secondary Nodes: Sync data from the primary node and serve as backups.
o Automatic Failover: If the primary node fails, a secondary node is promoted to
primary.
o Reintegration: Failed nodes can rejoin as secondary nodes after recovery.
Command Description
rs.initiate() Initiates a new replica set.
rs.conf() Displays the replica set configuration.
rs.status() Checks the current status of the replica set.
rs.add() Adds members to the replica set.
Auto-Sharding in MongoDB
Sharding distributes data across multiple servers to handle large-scale data efficiently.
• Purpose: Supports horizontal scaling to manage increased data sizes and demands.
• Mechanism: Automatically balances data and load among servers.
• Benefits:
o Improves write throughput by distributing operations across multiple mongod
instances.
o Cost-effective alternative to vertical scaling.
MongoDB supports a wide range of BSON data types to accommodate diverse data requirements:
1. Query Language: Provides a robust query mechanism similar to SQL for document
retrieval.
2. Secondary Indexes: Support text search and geospatial queries.
3. Aggregation Framework: Powerful for real-time data analysis.
4. Flexibility: Dynamic schemas enable handling unstructured and semi-structured data
efficiently.
1. Schema Flexibility: Documents in the same collection can have different structures.
2. High Scalability: Sharding supports horizontal scaling for large datasets.
3. Performance: Optimized for high-speed read and write operations.
4. Embedded Relationships: Avoid complex joins by storing related data together.
Conclusion
MongoDB, with its document-oriented model and features like replication, sharding, and rich data
types, is a powerful NoSQL database. It is particularly suited for Big Data applications, real-time
analytics, and scenarios requiring scalability and flexibility. The comparison with RDBMS
highlights MongoDB’s advantages in modern application development.
MODULE-IV
MapReduce is a programming model used for processing large datasets in a distributed manner. It
is commonly used in Hadoop for parallel processing and fault tolerance. Below is a detailed
explanation of the MapReduce execution steps and how the system handles node failures:
1. Job Submission
The process starts when the user submits a MapReduce job to the Hadoop JobTracker. The job
consists of two main parts: the Map function and the Reduce function.
2. Job Initialization
3. Data Splitting
• The input data is split into smaller chunks (typically called splits) by the JobTracker.
• Each split is processed by a Map task.
21CS71
• These splits are distributed across the nodes to enable parallel processing.
4. Map Phase
• Map Function: Each Map task takes an input split and processes it. The Map function
reads the input, processes it, and emits key-value pairs as output.
• Data Flow: The output from the Map function is sent to the Map output collector, which
writes the intermediate data to local disk (in sorted order) on each node.
• Partitioning: The system partitions the output based on a partitioning function, which
ensures that related data is sent to the same reducer.
• After the Map phase, the Shuffle and Sort phase begins. In this phase, the data emitted by
the mappers is sorted by key and grouped by the same key.
• The Shuffle phase involves moving data from the Map nodes to the appropriate Reduce
nodes. This is done by the Shuffle function, which groups the data by key.
6. Reduce Phase
• Reduce Function: After the data is shuffled and sorted, each Reduce task receives a set
of key-value pairs grouped by key. The Reduce function processes these pairs to produce
the final output.
• The output of the reduce tasks is written back to the HDFS (Hadoop Distributed File
System).
7. Job Completion
• Once the Reduce tasks complete, the results are stored in HDFS, and the JobTracker
notifies the client that the job is complete.
Hadoop ensures fault tolerance by recovering from node failures during the execution of
MapReduce jobs. The JobTracker and TaskTracker components handle this fault tolerance.
3. JobTracker Failure:
o If the JobTracker fails (which could happen if only one JobTracker is running),
the entire MapReduce job aborts.
o The client is notified about the failure, and the job must restart if there's no backup
JobTracker.
• Each node (TaskTracker) communicates periodically with the JobTracker to signal its
health.
• If a node doesn't communicate with the JobTracker for a specified duration (default 10
minutes), the node is considered failed.
• Re-execution of tasks can happen on another TaskTracker, ensuring that the MapReduce
job continues without disruption.
• Hive is a data warehousing and SQL-like query language system built on top of Hadoop.
• It simplifies the process of querying and managing large datasets in Hadoop using HiveQL,
which is similar to SQL.
• Hive allows users to run queries on data stored in HDFS (Hadoop Distributed File
System).
• It is used for data analysis, summarization, and aggregation in big data environments.
Hive Architecture
1. Hive Server (Thrift)
o Exposes a client API for executing HiveQL queries.
o Supports various programming languages (e.g., Java, Python, C++).
o Allows remote clients to submit queries to Hive and retrieve results.
2. Hive CLI (Command Line Interface)
o A popular interface to interact with Hive.
o Runs Hive in local mode (using local storage) or distributed mode (using HDFS).
o Allows execution of HiveQL queries directly from the command line.
3. Web Interface (HWI)
21CS71
• Grunt Shell:
21CS71
The execution of a Pig script goes through several stages, each responsible for transforming and
processing the data.
1. Parser:
o The Parser handles the Pig script after it's passed through the Grunt Shell or Pig
Server.
o Function: It checks the script for syntax errors and performs type checking.
o The output of the parsing step is a Directed Acyclic Graph (DAG).
▪ DAG (Directed Acyclic Graph):
▪ Represents the sequence of Pig Latin statements.
▪ Nodes in the DAG represent logical operators.
▪ Edges between the nodes represent the data flows.
▪ Acyclic means that only one set of inputs are processed at a time,
and after processing, one output is generated.
2. Optimizer:
o The Optimizer optimizes the DAG before passing it for compilation.
o Optimization Features:
▪ PushUpFilter: Splits and pushes up filter conditions to reduce the data
early.
▪ PushDownForEachFlatten: Postpones the flattening operation to
minimize record expansion.
▪ ColumnPruner: Eliminates unused or unnecessary columns to reduce
record size.
▪ MapKeyPruner: Removes unused map keys, optimizing data storage.
▪ Limit Optimizer: If a limit operation is used immediately after loading or
sorting data, it applies optimizations to reduce unnecessary processing by
limiting the dataset size earlier in the process.
3. Compiler:
o After optimization, the Compiler compiles the optimized DAG into a series of
MapReduce jobs.
o These jobs represent the logical steps required to process the Pig script.
21CS71
4. Execution Engine:
o The Execution Engine is responsible for running the MapReduce jobs.
o The jobs are executed on the Hadoop cluster, and the final results are produced
after processing the data.
• Main Purpose: The Grunt Shell is used to write and execute Pig Latin scripts
interactively.
• Command Syntax:
o sh command: Invokes shell commands from within the Grunt shell.
▪ Example: grunt> sh ls
o ls command: Lists files in the Grunt shell environment.
▪ Example: grunt> sh ls
1. Execution Process:
o Pig scripts are executed in one of the three ways: Grunt Shell, Script File, or
Embedded Script.
o The parser processes the Pig script and produces a DAG, which is optimized by the
optimizer.
o The optimizer reduces unnecessary data processing, making the execution more
efficient.
o The optimized DAG is compiled into MapReduce jobs, which are executed by the
execution engine.
2. Pig Latin Data Model:
o Pig supports both primitive and complex data types, enabling flexible data
handling during processing.
3. Grunt Shell provides an interactive environment to write and test Pig scripts, making it
easier for users to experiment with Pig Latin queries and functions.
This architecture enables efficient big data processing, making it a powerful tool in the Hadoop
ecosystem for ETL processes, data transformation, and analysis.
21CS71
• MapReduce uses key-value pairs at different stages to process and manipulate data. The
data must be converted into key-value pairs before being passed to the Mapper, as it only
understands and processes key-value pairs.
• InputSplit:
o Defines a logical representation of the data. It splits the data into smaller chunks
for processing by the map() function.
• RecordReader:
o Communicates with the InputSplit and converts the data into key-value pairs
suitable for processing by the Mapper.
o By default, TextInputFormat is used to convert text data into key-value pairs.
o RecordReader continues processing until the entire file is read.
4. Grouping by Key
• After the map() task completes, the Shuffle process groups all the Mapper outputs by the
key.
o All key-value pairs with the same key are grouped together.
o A "Group By" operation is performed on the intermediate keys, resulting in a list
of values (v2) associated with each key (k2).
o The output of the Shuffle and Sorting phase will be a list of <k2, List(v2)>.
6. Partitioning
• The Partitioner is responsible for distributing the output of the map() tasks into different
partitions.
o It is an optional class and can be specified by the MapReduce driver.
o Partitions help divide the key-value pairs across different Reducer tasks, ensuring
efficient data processing.
• The Partitioner executes locally on each machine that performs a map task.
7. Combiners
8. Reduce Tasks
MODULE-V
Q. 09 a What is Machine Learning? Explain different types of Regression Analysis
Machine Learning:
1. Initialization:
o Randomly initialize kkk cluster centroids (C1,C2,…,CkC_1, C_2, \dots, C_kC1,C2
,…,Ck).
o These centroids act as initial cluster centers.
2. Assignment:
o Calculate the distance between each data point and all centroids.
o Assign each data point to the cluster with the nearest centroid.
o Common distance metrics:
▪ Euclidean Distance: d(x,y)=∑(xi−yi)2d(x, y) = \sqrt{\sum (x_i -
y_i)^2}d(x,y)=∑(xi−yi)2
▪ Manhattan Distance: d(x,y)=∑∣xi−yi∣d(x, y) = \sum |x_i - y_i|d(x,y)=∑∣xi
−yi∣
▪ Cosine Distance: d(x,y)=1−x⃗⋅y⃗∥x⃗∥∥y⃗∥d(x, y) = 1 - \frac{\vec{x} \cdot
\vec{y}}{\|\vec{x}\| \|\vec{y}\|}d(x,y)=1−∥x∥∥y∥x⋅y
3. Update Centroids:
o Compute the new centroid of each cluster by calculating the mean position of all
points in the cluster: Ck=1Nk∑i=1NkXiC_k = \frac{1}{N_k} \sum_{i=1}^{N_k}
X_iCk=Nk1i=1∑NkXi where NkN_kNk is the number of points in cluster kkk.
4. Iterative Refinement:
o Repeat the Assignment and Update steps until:
▪ Centroids no longer change, or
▪ A predefined stopping criterion is met (e.g., maximum iterations).
5. Output:
o A set of kkk clusters with minimal intra-cluster distance and maximal inter-cluster
distance.
Algorithm Steps:
1. Input:
o NNN: Number of data points (objects).
o kkk: Number of clusters.
2. Output:
o kkk clusters with minimized distance between points and their centroids.
3. Steps:
o Step 1: Randomly initialize kkk centroids.
o Step 2: Assign each point to the nearest centroid.
o Step 3: Update the centroid to the mean of the points in the cluster.
o Step 4: Repeat until centroids stabilize (no change in cluster membership).
Diagram Representation:
1. Definition:
o A supervised machine learning technique based on probability theory.
o Computes the probability of an instance belonging to each class using prior
probabilities and likelihoods.
2. Formula:
21CS71
21CS71
Advantages:
Disadvantages:
Text mining is a systematic process to extract meaningful information from textual data. The five
phases of text mining are:
This phase focuses on preparing raw textual data for further analysis by performing the following
steps:
1. Text Cleanup:
o Removes unnecessary or unwanted information.
o Corrects typos (e.g., "teh" becomes "the").
o Resolves inconsistencies, removes outliers, and fills missing values.
o Example: Removing comments or replacing "%20" in URLs.
2. Tokenization:
o Splits text into tokens (words) using white spaces and punctuation as delimiters.
3. POS Tagging:
o Labels each word with its part of speech (noun, verb, etc.).
o Helps recognize entities like names or places.
4. Word Sense Disambiguation:
o Identifies the correct meaning of ambiguous words based on context.
o Example: "bank" could mean a financial institution or a riverbank.
5. Parsing:
o Creates a grammatical structure (parse-tree) for sentences.
o Determines relationships between words.
1. Dimensionality Reduction:
o Removes redundant and irrelevant features.
o Methods include PCA and LDA.
2. N-gram Evaluation:
o Identifies sequences of words (e.g., "tasty food" for 2-gram).
3. Noise Detection:
o Identifies and removes unusual or suspicious data points.
1. Unsupervised Learning:
o Clustering groups similar data without predefined labels.
o Example: Grouping blog posts by topic.
2. Supervised Learning:
o Classification assigns labels based on training data.
o Example: Email spam filtering.
3. Evolutionary Pattern Identification:
o Summarizes changes over time, like trends in news articles.
1. Result Evaluation:
o Determines if results meet expectations.
2. Interpretation:
o Discards or refines processes based on results.
3. Visualization:
o Prepares visual representations for better understanding.
4. Utilization:
o Applies insights to improve industry or enterprise activities.
These phases and challenges illustrate the systematic approach and complexity of text mining.
Web usage mining refers to the process of extracting useful information and patterns from the data
generated through webpage visits and transactions. This involves analyzing the activity data
captured at various levels during a user's interaction with websites and applications.
Sources of Data
Additionally, metadata such as page attributes, content attributes, and usage data is collected to
provide a comprehensive dataset.
Analysis Levels
1. Server-Side Analysis:
o Focuses on the relative popularity of web pages accessed.
o Identifies hubs (central resources) and authorities (high-value content sources).
2. Client-Side Analysis:
o Focuses on user activity and content consumed.
o Comprises two main types of analysis:
1. Usage Pattern Analysis:
▪ Uses clickstream analysis to track the sequence of clicks, locations,
and durations of visits.
▪ Applications: Web activity analysis, market research, software
testing, employee productivity analysis.
2. Content Analysis:
▪ Textual information accessed by users is structured using techniques
like the Bag-of-Words model.
▪ The text is analyzed for:
▪ Cluster Analysis: Grouping similar topics.
▪ Association Rules: Finding patterns like user segmentation
or sentiment trends.
Key Techniques
Web usage mining provides insights that are crucial for personalized user experiences, business
growth, and effective marketing strategies.