0% found this document useful (0 votes)
19 views55 pages

BDA Model QP Soln

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views55 pages

BDA Model QP Soln

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

21CS71

MODULE -I
1a) What is Big Data? Explain evolution of big data & characteristics.

Figure 1.1 Evolution of Big Data and their characteristics

1. Definition:
o Big Data refers to large, complex data sets that traditional data processing software
cannot handle.
o It includes structured, semi-structured, and unstructured data, often characterized
by high volume, velocity, variety, and veracity.
2. Purpose:
o The goal of Big Data is to analyze vast amounts of data to gain insights, improve
decision-making, and optimize processes.

Evolution of Big Data

1. Early Data (Pre-2000s):


o Data was mostly structured and stored in relational databases (RDBMS).
o Tools like RDBMS were sufficient to handle smaller amounts of structured data (in
megabytes and gigabytes).
2. Big Data Revolution (2000s-Present):
o Rapid increase in data volume, variety, and velocity with the growth of social
media, IoT, and sensors.
o Traditional systems could not scale, leading to the development of technologies like
Hadoop, NoSQL databases (e.g., MongoDB), and cloud storage solutions.
o The rise of Big Data analytics allowed businesses to process complex data for real-
time decision-making.
21CS71

3. Modern Era (Present and Future):


o Integration of Big Data with AI and machine learning for predictive analytics and
decision-making.
o Edge computing and IoT play a role in processing data in real-time at the source.
o Data democratization allows non-experts to access and use Big Data insights for
business applications.

Characteristics of Big Data (The 4 Vs)

1. Volume:
o Refers to the massive amount of data generated, typically measured in petabytes or
exabytes.
o Data comes from various sources like social media, sensors, and online
transactions.
2. Velocity:
o Describes the speed at which data is generated and needs to be processed.
o Real-time data generation and processing are common (e.g., sensor data, social
media posts).
3. Variety:
o Represents the different types of data: structured (tables), semi-structured (XML,
JSON), and unstructured (text, videos).
o Big Data includes various formats that need to be processed and integrated.
4. Veracity:
o Refers to the quality and accuracy of data.
o Big Data often contains noisy, incomplete, or inconsistent data that needs to be
cleaned and validated for analysis.

Additional Vs (Optional)

5. Value:
oFocuses on the usefulness of the data and the insights that can be extracted.
oThe value lies in turning raw data into actionable intelligence.
6. Variability:
o Describes the changing nature of data, especially from sources like social media or
IoT devices.
o Data patterns can fluctuate over time, requiring dynamic analysis techniques.

Classification of Data

Data can be classified into four main categories based on its structure:

1. Structured Data:
o This data is highly organized and adheres to specific schemas, such as rows and
columns in relational databases. Examples include data stored in traditional
databases (e.g., RDBMS).
21CS71

o Key operations on structured data include insertion, deletion, updating, and


indexing for fast retrieval.
2. Semi-Structured Data:
o This type of data does not conform to a strict data model but contains markers that
separate elements within the data. Examples include XML and JSON documents,
which can be parsed and processed in a flexible manner.
3. Multi-Structured Data:
o This data combines elements of structured, semi-structured, and unstructured data,
often found in non-transactional systems like data warehouses or customer
interaction logs.
4. Unstructured Data:
o Data that lacks a predefined structure or schema. Examples include text files,
images, videos, social media content, emails, and sensor data. This data often
requires advanced processing techniques to extract meaningful insights.

Big Data can come from various sources, each generating different types of data:

• Social Networks and Web Data: Data generated by users on platforms like Facebook,
Twitter, emails, blogs, and YouTube.
• Transactional and Business Process Data: Includes data from credit card transactions,
flight bookings, medical records, and insurance claims.
• Machine-Generated Data: Includes data from Internet of Things (IoT) devices, sensors,
and machine-to-machine communications.
• Human-Generated Data: Includes biometric data, human-machine interaction data,
emails, and personal documents.

Examples of Big Data Use Cases

• Chocolate Marketing Company: A company with a large network of Automatic


Chocolate Vending Machines (ACVMs) might collect vast amounts of data related to
customer interactions and vending machine usage.
• Predictive Automotive Maintenance: Data from connected vehicles can be used to
predict maintenance needs, improving customer service and reducing downtime.
• Weather Data and Prediction: Large volumes of data from weather stations, satellites,
and sensors can be processed to monitor and predict weather patterns

2b)Explain the following terms.


i. Scalability & Parallel Processing ii. Grid & Cluster Computing.

i. Scalability & Parallel Processing


Scalability:

• Definition: Scalability refers to the capability of a system to handle a growing amount of


work or its potential to accommodate growth. It is the ability to expand or shrink resources
as per demand.
• Types of Scalability:
21CS71

o Vertical Scalability (Scaling Up):


▪ Involves adding more resources (like CPUs, RAM, storage) to a single
machine.
▪ Increases the power of a single system to handle more workload.
▪ Example: Adding more memory or processing power to a server to handle
larger data.
o Horizontal Scalability (Scaling Out):
▪ Involves adding more machines to a network to handle larger workloads.
▪ Distributes the load across multiple systems, improving performance by
parallelizing tasks.
▪ Example: Adding more nodes to a cloud infrastructure to distribute data
processing tasks.
• Importance in Big Data:
o Big Data applications require the ability to handle massive data volumes, and
scalability ensures that systems can adjust to increasing data size, complexity, and
processing requirements.

Parallel Processing:

• Definition: Parallel processing refers to the simultaneous execution of multiple tasks or


operations in parallel to speed up computing.
• Types of Parallel Processing:
o Task Parallelism: Distributing different tasks to different processors.
o Data Parallelism: Distributing chunks of data to different processors to perform
the same task.
• Massive Parallel Processing (MPP):
o Involves distributing data and tasks across multiple nodes or processors to process
large datasets concurrently.
o It can be implemented at various levels:
▪ Within a single computer (multiple CPUs or cores),
▪ Across multiple computers (in a cluster or grid setup).

ii. Grid & Cluster Computing


Grid Computing:

• Definition: Grid computing is a distributed computing model where a network of


computers, often geographically dispersed, works together to solve a common task.
• Key Characteristics:
o Resource Sharing: Resources like processing power, storage, and networks are
shared across multiple machines.
o Geographically Distributed: The computers involved in grid computing may be
located in different places.
o Heterogeneous: The participating machines can vary in architecture, operating
systems, and configurations.
• Functionality:
21CS71

o Grid computing helps combine the power of multiple machines to perform complex
tasks that a single machine would struggle with.
o It allows for distributed and parallel computation of tasks across the grid, improving
efficiency and scalability.
• Relation to Cloud Computing:
o Grid computing is similar to cloud computing in that both allow for the sharing and
pooling of resources. However, while cloud computing is typically provided as a
service, grid computing is more focused on direct coordination of computing
resources across locations.

Cluster Computing:

• Definition: Cluster computing involves a group of interconnected computers (or nodes)


working together as a single system to perform tasks.
• Key Characteristics:
o Nodes: A cluster consists of multiple nodes, each of which is a computer that is
connected to other computers through a network.
o Load Balancing: Clusters work by distributing tasks across nodes to maintain an
even load, ensuring that no single node is overwhelmed.
o Homogeneous: Typically, the machines in a cluster have similar configurations
and are located close to each other in terms of physical proximity (often in the same
data center).
• Functionality:
o Clusters are primarily used for improving performance by parallelizing workloads,
providing high availability, and ensuring fault tolerance.
o It helps in maintaining performance even if one node fails, as the load is shifted to
the remaining nodes.
• Use Cases:
o High-Performance Computing (HPC): Used for scientific simulations, financial
modeling, etc.
o Big Data Processing: Large-scale data analytics applications often use clusters for
processing large volumes of data quickly.
21CS71

2a) What is Cloud Computing? Explain different services of Cloud.

Cloud Computing:

• Definition:
Cloud computing is a model of Internet-based computing that provides shared computing
resources, data, and applications to devices such as computers and smartphones on demand.
It enables users to access computing services over the internet without needing to own or
maintain the physical infrastructure.
• Key Features of Cloud Computing:
1. On-Demand Service: Users can access and use resources (like storage, computing
power, applications) as needed, without requiring long-term commitments.
2. Resource Pooling: Cloud providers pool resources (computing, storage, etc.) to
serve multiple customers, often using multi-tenant models where resources are
dynamically allocated and reassigned according to demand.
3. Scalability: The ability to scale resources up or down as per user demand. This
allows users to adjust their resource usage based on workload changes.
4. Accountability: Cloud services offer performance tracking, security measures, and
usage audits to ensure transparency and reliability.
5. Broad Network Access: Cloud services are accessible from anywhere and on any
device with an internet connection.

Cloud Services:

There are three primary types of cloud services:

1. Infrastructure as a Service (IaaS):


o Definition: IaaS provides virtualized computing resources over the internet. It
allows users to rent computing infrastructure (such as servers, storage, and
networking) on-demand.
o Key Features:
▪ Provides fundamental resources such as processing power, storage, and
networking.
▪ Users manage operating systems and applications.
▪ Scalable and flexible infrastructure.
o Examples:
▪ Amazon Web Services (AWS) Elastic Compute Cloud (EC2)
▪ Microsoft Azure
▪ Apache CloudStack
o Use Case: IaaS is commonly used by businesses that need to run applications and
manage workloads without investing in physical hardware.
2. Platform as a Service (PaaS):
o Definition: PaaS provides a platform and environment to allow developers to build,
deploy, and manage applications without worrying about the underlying
infrastructure.
o Key Features:
21CS71

▪ Provides tools for application development, testing, and deployment.


▪ Users focus on developing applications while the provider manages the
infrastructure, operating systems, and middleware.
▪ Includes services like databases, development frameworks, and deployment
tools.
o Examples:
▪ Hadoop Cloud Services (IBM BigInsight, Microsoft Azure HD Insights)
▪ Google App Engine
▪ Oracle Cloud Platform
o Use Case: PaaS is ideal for developers who need a platform to develop and run
applications without managing underlying hardware or software layers.
3. Software as a Service (SaaS):
o Definition: SaaS provides software applications over the internet, where users
access software hosted on the provider's servers, eliminating the need for
installation and maintenance on local machines.
o Key Features:
▪ Software applications are accessible via a web browser.
▪ The provider manages all aspects of the software, including updates,
security, and maintenance.
▪ Subscription-based model with pay-as-you-go pricing.
o Examples:
▪ Google SQL
▪ Microsoft Office 365
▪ Oracle Big Data SQL
o Use Case: SaaS is used by individuals or organizations that need access to software
tools without maintaining infrastructure or performing updates.

Summary of Cloud Services:


Service Description Examples Use Cases
Provides virtualized resources Businesses requiring scalable
AWS EC2,
IaaS over the internet (e.g., servers, infrastructure for apps and
Microsoft Azure
storage). workloads.
Platform for building, Developers needing tools to build
IBM BigInsight,
PaaS deploying, and managing apps without managing
Google App Engine
applications. infrastructure.
Software applications delivered Google SQL, Users accessing ready-to-use
SaaS
over the internet. Office 365 software applications online.

2b) Explain any two Big Data different Applications.

Big Data is used across various industries and domains to extract valuable insights, enhance
decision-making, and improve operations. Below are two examples of Big Data applications:
21CS71

Big Data Applications

1. Big Data in Marketing and Sales:


o Big data analytics helps businesses understand Customer Value (CV), which is
based on quality, service, and price.
o Customer Value Analytics (CVA) identifies what customers really need, helping
companies like Amazon deliver personalized experiences.
o Data-driven insights allow marketers to predict customer behavior and target the
right products, optimizing sales strategies.
2. Big Data in Fraud Detection:
o Big data enables fraud detection by fusing data from multiple sources such as
social media, websites, and emails.
o It helps identify patterns and anomalies that indicate potential fraud and enhances
data visualization and reporting for better decision-making.
3. Big Data Risks:
o While Big Data offers valuable insights, it also presents risks:
▪ Data Security: Protecting sensitive information is a concern.
▪ Data Privacy Breaches: Personal data may be compromised.
▪ Costs: The expense of processing large volumes of data could reduce
profits.
▪ Bad Analytics and Bad Data: Incorrect data can lead to misleading
insights and poor decisions.
4. Big Data in Credit Card Risk Management:
o Financial institutions use Big Data to assess credit risks, including:
▪ Identifying high-risk individuals or groups.
▪ Predicting loan defaults and issues with timely repayments.
▪ Evaluating risks associated with specific sectors and employees.
5. Big Data in Healthcare:
o Big data analytics in healthcare uses various data sources like clinical records and
medical logs to:
▪ Provide customer-centric healthcare.
▪ Reduce fraud, waste, and abuse in the system.
▪ Monitor patients in real time and improve outcomes.
o It also utilizes the Internet of Things (IoT) for better healthcare services.
6. Big Data in Medicine:
o In medical research, Big Data helps build health profiles and predictive models to
diagnose diseases better.
o Aggregating data from DNA, proteins, cells, and other biological sources helps
enhance disease understanding.
o Data from wearable devices provides real-time insights on patients' health and
helps in risk profiling for diseases.
7. Big Data in Advertising:
o The digital advertising industry uses Big Data to create targeted advertisements
across channels like SMS, email, and social media platforms.
o Real-time analytics uncover trends and insights, enabling hyper-localized
advertising (personalized, targeted ads).
21CS71

o Big Data helps advertisers optimize campaigns, ensuring the right ads reach the
right audiences, avoiding overuse and ensuring relevance.

2c) How does Berkeley data analytics stack helps in analytics take?

Berkeley Data Analytics Stack (BDAS)

The Berkeley Data Analytics Stack (BDAS) is a comprehensive framework designed to handle
Big Data by integrating various components for data processing, management, and resource
management. BDAS aims to improve performance and scalability by leveraging different
computation models and providing in-memory processing. Below are the key components and
architecture layers:

Key Components of BDAS:

1. Applications:
o AMP-Genomics and Carat are examples of applications running on BDAS.
o AMP (Algorithms, Machines, and People Laboratory) focuses on optimizing
data processing and analytics through innovative algorithms and machine learning
models.
2. Data Processing:
o BDAS supports in-memory processing, which allows data to be processed
efficiently across different frameworks.
o It integrates batch, streaming, and interactive computations, enabling diverse
analytics capabilities.
▪ Batch processing: Handles large volumes of data in bulk.
▪ Streaming: Processes data in real time as it arrives.
▪ Interactive computations: Provides immediate feedback and results for
quick decision-making.
3. Resource Management:
o BDAS incorporates resource management software that ensures efficient
sharing of infrastructure across multiple frameworks, promoting resource
optimization and cost-efficiency.
o The system manages the allocation of resources to ensure the execution of tasks
across various components, such as Hadoop, Spark, and other frameworks.

Four-Layer Architecture for Big Data Stack (BDAS):

1. Hadoop:
o A widely used framework for distributed storage and processing of large datasets,
Hadoop provides the foundation for Big Data frameworks.
2. MapReduce:
o The programming model that allows large-scale data processing by distributing
tasks across multiple nodes in a cluster.
3. Spark Core:
21CS71

A powerful in-memory data processing engine that supports batch processing,


o
streaming, and iterative processing, making it faster than traditional MapReduce.
o It integrates other components like SparkSQL, GraphX, MLib, and R for
enhanced data analytics.
4. Advanced Components:
o SparkSQL: A module for querying structured data with SQL.
o Streaming: A framework for processing real-time data streams.
o R: Supports advanced statistical computing and data analysis.
o GraphX: A component for graph processing and analytics.
o MLib: A machine learning library for scalable machine learning algorithms.
o Mahout: A library for scalable machine learning algorithms, often used for
recommendation systems.
o Kafka: A distributed messaging system that is highly scalable, used for handling
real-time data streams.
o Arrow: A cross-language development platform for in-memory analytics.

MODULE-II
03 a) What is Hadoop? Explain Hadoop eco-system with neat diagram
Overview of Hadoop:

Hadoop is a powerful, open-source platform for processing and managing Big Data. It is designed
to handle large volumes of data by distributing the tasks across multiple machines in a cluster. It
uses a MapReduce programming model to break down tasks into smaller chunks and process them
in parallel. Hadoop provides a scalable, fault-tolerant, and self-healing environment that can
process petabytes of data quickly and cost-effectively. The core components of Hadoop are
designed to work together to store, process, and manage data.
21CS71

1. Hadoop Core Components:

1. Hadoop Common:
o Contains libraries and utilities required by other Hadoop modules. These include
components for distributed file systems, general I/O operations, and interfaces like
Java RPC (Remote Procedure Call).
2. Hadoop Distributed File System (HDFS):
o A Java-based distributed file system that stores large volumes of data across
multiple machines in the cluster. It ensures high availability through data
replication (default of 3 copies).
3. MapReduce:
o A programming model for processing large data sets in parallel. The Map function
processes input data into key-value pairs, and the Reduce function aggregates the
data from the Map function.
4. YARN (Yet Another Resource Negotiator):
o Manages and schedules resources across the Hadoop cluster. It allocates resources
for MapReduce jobs and manages the distributed environment effectively.
5. MapReduce v2:
o The upgraded version of MapReduce (MapReduce 2.0) that works with YARN for
enhanced resource management and parallel processing of large data sets.

2. Features of Hadoop:

1. Scalability:
o Hadoop is scalable, meaning it can easily scale up or down by adding or removing
nodes in the cluster. This flexibility allows Hadoop to process growing amounts of
data efficiently.
2. Fault Tolerance:
o HDFS ensures fault tolerance by replicating data blocks across different nodes
(default replication factor is 3). If one node fails, the data can still be accessed from
other nodes with replicated data.
3. Robust Design:
o Hadoop has a robust design with built-in data recovery mechanisms. Even if a
node or server fails, Hadoop can continue processing tasks due to replication and
failover mechanisms.
4. Data Locality:
o Data locality means that tasks are executed on the nodes where the data is stored,
reducing data transfer time and improving processing speed.
5. Open-Source and Cost-Effective:
o Hadoop is open-source and works on commodity hardware, making it a cost-
effective solution for managing and processing large datasets.
6. Hardware Fault Tolerance:
o If a hardware failure occurs, Hadoop handles it automatically by replicating data
and reassigning tasks to other nodes, ensuring that the system continues to run
smoothly.
7. Java and Linux-Based:
21CS71

o Hadoop primarily uses Java for its interfaces and is built to run on Linux
environments. Hadoop’s tools and shell commands are tailored to support Linux
systems.

3. Hadoop Ecosystem Components:

The Hadoop ecosystem consists of various tools and frameworks designed to handle different
aspects of Big Data processing:

1. Distributed Storage Layer:


o This layer handles storage of large datasets across clusters using HDFS.
2. Resource Manager Layer:
o This layer is responsible for job scheduling and resource allocation across the
cluster. It uses YARN to manage the resources.
3. Processing Framework Layer:
o The MapReduce framework is used for distributed processing of data. Tasks are
divided into Map and Reduce phases for parallel execution.
4. Application Support Layer (APIs):
o Hive, Pig, Mahout, and other tools are part of the application support layer,
providing high-level querying (Hive), scripting (Pig), and machine learning
capabilities (Mahout) that work on top of the Hadoop framework.

4. Key Ecosystem Tools:

1. Avro:
o A data serialization system used for storing data in a compact format. It enables
efficient communication between layers in the Hadoop ecosystem.
2. ZooKeeper:
o A coordination service that helps synchronize tasks across distributed systems. It is
used for maintaining configuration information, providing distributed
synchronization, and managing naming and configuration.
3. Hive:
o A data warehousing tool that allows users to perform SQL-like queries on data
stored in HDFS. It abstracts the complexity of MapReduce with a simpler querying
interface.
4. Pig:
o A platform for analyzing large datasets with a scripting language. It simplifies
complex data processing tasks by using a high-level language called Pig Latin.
5. Mahout:
o A machine learning library for scalable data analysis. It provides algorithms for
clustering, classification, and collaborative filtering.
21CS71

3b Explain with neat diagram HDFS Components.


21CS71
21CS71

Hadoop Distributed File System (HDFS) Components

HDFS (Hadoop Distributed File System) is the storage layer of the Hadoop ecosystem. It is
designed to store large files across multiple machines in a distributed environment. The key idea
behind HDFS is to distribute large data sets across multiple machines in a cluster and to provide
high availability and fault tolerance.

Here’s an explanation of the key HDFS components:

Key HDFS Components:

1. Client:
o Clients are the users or applications that interact with the Hadoop cluster. The client
submits requests to the NameNode to access data stored in HDFS. These clients
can run on machines outside the Hadoop cluster, such as the applications running
Hive, Pig, or Mahout.
2. NameNode (Master Node):
o The NameNode is the central component of HDFS. It stores metadata about the
files in the distributed file system. It manages the file system namespace, keeps
track of the file locations across the cluster, and performs functions like:
▪ Storing the metadata information of the files (e.g., file names, block
locations).
▪ Keeping a record of the file blocks that are stored across various
DataNodes.
▪ Handling client requests for reading and writing files, and directing clients
to the appropriate DataNodes.
3. Secondary NameNode:
o The Secondary NameNode is often misunderstood as a backup for the NameNode,
but it is not. It periodically merges the edits log with the fsimage to prevent the
NameNode from becoming too large. This ensures the NameNode's metadata
remains manageable and can be recovered in the event of failure.
o It does not serve as a failover for the NameNode but helps in preventing NameNode
failure by maintaining an up-to-date checkpoint.
4. DataNode (Slave Node):
o DataNodes are the worker nodes in HDFS. They are responsible for storing the
actual data blocks and serving the client requests for reading and writing data.
o Each file in HDFS is divided into blocks (typically 128MB or 256MB in size), and
these blocks are distributed across multiple DataNodes.
o DataNodes are responsible for reporting to the NameNode about the status of the
blocks (e.g., whether they are healthy, if there are any issues with the blocks).
5. JobTracker (Optional - Not a direct HDFS component but relevant for HDFS
interaction):
o JobTracker manages and schedules MapReduce jobs. It is responsible for
coordinating the execution of tasks (Map and Reduce) and ensuring the efficient
parallel processing of data across DataNodes. The JobTracker interacts with the
NameNode to locate data stored in the HDFS for MapReduce jobs.
21CS71

6. Zookeeper (for HBase):


o Zookeeper is used for coordination and metadata management when using
HBase (a NoSQL database) in the Hadoop ecosystem. It helps to store the
configuration and metadata information in a distributed manner.

3c Write short note on Apache hive


Short Note on Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for managing and
analyzing large data sets. It provides a SQL-like language called HiveQL for querying and
processing structured data. Below are key points:

1. Features:
o ETL Tools: Enables easy Data Extraction, Transformation, and Loading
(ETL).
o Data Structuring: Supports imposing structure on various data formats.
o Integration: Accesses files in HDFS and other systems like HBase.
o Query Execution: Uses MapReduce or Tez for query execution.
2. Basic Commands:
o Create Table: Define schema for data storage.
Example:

CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string,


t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

o Load Data: Import data into the table.


Example:

LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs;

o Query Data: Retrieve and analyze data.


Example:

sql
Copy code
SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%'
GROUP BY t4;

3. Advanced Usage:
o Transformations: Allows integrating Python scripts for advanced data processing.
o Example: Converting UNIX timestamps to weekdays using a Python script
(weekday_mapper.py).
4. Output Example:
Using Hive queries, data can be grouped and summarized efficiently, such as counting log
message types or analyzing user data.
21CS71

Hive Example Output:


For a query like SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;,
output could show:

1 13278
2 14816
3 15426
4 13774
5 17964
6 12318
7 12424

Apache Hive simplifies Big Data processing with its SQL-like approach and integration with
Hadoop's ecosystem.

4a)Explain Apache Sqoop Import and Export methods.


21CS71

Figure 7.2 Two-step Sqoop data export method (Adapted from Apache Sqoop Documentation)

pache Sqoop Import and Export Methods

Apache Sqoop is a tool designed to efficiently transfer data between Hadoop and relational
databases. It simplifies importing and exporting data using a map-only Hadoop job. Here's a
breakdown of its Import and Export methods:

Import Method

The process involves two main steps:

1. Metadata Gathering
o Sqoop examines the source database to collect necessary metadata about the data
to be imported.
o Metadata includes information like table schema, data types, and data size.
2. Data Transfer
o A map-only Hadoop job is submitted by Sqoop.
o The job splits the data into chunks (parallel tasks) for distributed processing.
o Each task imports data into an HDFS directory.
o Default Format: Comma-delimited fields with newline-separated records (can be
customized).

Output:

• The imported data is saved in HDFS and can be processed using Hadoop, Hive, or other
tools.

Export Method

The process is reverse of import and also involves two steps:


21CS71

1. Metadata Examination
o Sqoop analyzes the database schema to identify how to map the HDFS data to the
target database.
2. Data Transfer
o A map-only Hadoop job is used to export data to the database.
o The data set is divided into splits, and each map task pushes one split to the
database.
o Database access is required for all nodes performing the export.

Output:

• Data from HDFS is written into the target database, enabling further operations in relational
database systems.

Key Features of Sqoop Import and Export

• Parallel processing ensures faster data transfer.


• Customizable data formats (field separators, delimiters).
• Handles large-scale data efficiently between Hadoop and relational databases.

4b )Explain Apache Oozie with neat diagram


Manage Hadoop Workflows with Apache Oozie

Apache Oozie is a workflow scheduling and coordination system specifically designed for
managing Hadoop jobs. It ensures that various interdependent jobs can execute in a defined
sequence or parallel, based on the data processing requirements.

Key Features of Oozie

1. Integration with Hadoop Stack:


Oozie seamlessly integrates with Hadoop tools such as MapReduce, Hive, Pig, and
Sqoop.
2. Workflow Representation:
Jobs are represented as Directed Acyclic Graphs (DAGs), ensuring no cyclic
dependencies.
3. Job Types in Oozie:
o Workflow Jobs: Sequential or dependent Hadoop jobs.
o Coordinator Jobs: Time- or data-availability-based scheduled workflows.
o Bundle Jobs: Batch of coordinator jobs combined into one.
4. Error Handling:
Supports nodes for workflow control such as fail and retry.

Nodes in Oozie Workflow

1. Control Flow Nodes:


o Start Node: Initiates the workflow.
21CS71

o End Node: Marks successful completion.


o Fail Node: Handles failure scenarios.
2. Action Nodes:
o Executes tasks like MapReduce, Hive, or custom shell scripts.
o Notifies Oozie on completion for subsequent action execution.
3. Fork and Join Nodes:
o Fork Node: Enables parallel task execution.
o Join Node: Synchronizes the flow after parallel execution.
4. Decision Nodes:
o Allow conditional execution based on prior action results (e.g., file existence).
o Uses JSP Expression Language for evaluations.

How Oozie Works

1. Workflow Definition:
Workflows are defined in hPDL (Hadoop Process Definition Language), an XML-based
language.
2. Workflow Execution:
o DAG structure ensures orderly execution.
o Parallel or sequential dependencies are adhered to.
3. Job Monitoring:
o CLI and web UI provide real-time job tracking.

Advantages of Oozie

1. Efficient Workflow Management:


Automates and manages complex, multi-step Hadoop data processing tasks.
2. Flexibility:
Handles various job types (Hive, Pig, Sqoop, etc.) and integrates custom scripts.
3. Scalability:
Suitable for small and large-scale workflows.
4. ErrorHandling:
In-built mechanisms for retries and alternate paths.

Limitations of Oozie

1. ComplexConfiguration:
Requires a good understanding of XML-based definitions.
2. LimitedUIFeatures:
Dependency on CLI for advanced tasks
3. LacksNon-HadoopSupport:
Primarily tied to Hadoop ecosystem jobs.

Simple Oozie DAG Workflow

Below is a simplified depiction of a DAG workflow:


21CS71

• Start Node → MapReduce Job → [Success: End Node | Failure: Fail Node]

This showcases Oozie's ability to define actions based on job results.

Complex Oozie DAG Workflow

• Start Node → Fork Node → (Parallel Jobs: MapReduce, Hive Query) → Join Node →
Decision Node → [Conditional Jobs] → End Node

This allows parallel processing, decision-making, and synchronized completion.


21CS71

4c Explain YARN application framework.


21CS71

Hadoop YARN: Overview

YARN (Yet Another Resource Negotiator) is a core component of Hadoop, acting as a resource
management platform. It handles the allocation and scheduling of computational resources
across applications in the Hadoop ecosystem.

Key Features of YARN

1. Resource Management:
o YARN manages computational resources like memory, CPU, and storage for
various tasks.
2. Task Scheduling:
o Schedules subtasks and ensures resources are allocated during specified time
intervals.
3. Decoupling:
o Separates resource management and processing components for scalability and
efficiency.

YARN Architecture Components

1. Client:
o Submits job requests to the Resource Manager (RM).
2. Resource Manager (RM):
o Acts as the master node in the cluster.
o Tracks available resources across all Node Managers (NMs).
o Contains two key services:
▪ Job History Server: Maintains records of completed jobs.
▪ Resource Scheduler: Allocates resources based on availability and
requirements.
3. Node Manager (NM):
o A slave node in the infrastructure.
o Manages local resources on individual nodes.
o Hosts Containers, which execute the subtasks of an application.
4. Application Master (AM):
o Coordinates the execution of a single application.
o Communicates resource requirements to the RM.
5. Containers:
o Units of resource allocation where application tasks run.

Hadoop 2 Execution Model with YARN


Workflow:

1. Client Submits Request:


o Application submission occurs via the client node to the RM.
2. Resource Manager Actions:
o Tracks resources like location, available memory, and CPU (Rack Awareness).
21CS71

oAssigns resources using the Resource Scheduler.


3. Node Manager Initialization:
o NMs signal their availability to RM.
o Each NM launches an Application Master Instance (AMI) to initialize tasks.
4. Application Master Coordination:
o AM calculates the resources needed for subtasks.
o Sends requests to RM for required resources.
5. Container Execution:
o NMs allocate containers for subtasks.
o Subtasks execute in containers.

Advantages of YARN

1. Scalability:
Supports clusters with thousands of nodes.
2. Efficiency:
Dynamically allocates resources based on demand.
3. Flexibility:
Enables applications to run independently of the MapReduce framework.
4. Fault Tolerance:
Ensures the availability of tasks even during node failures.

YARN Application Framework


Key Actions:

1. The Client submits the job request to the RM.


2. The RM assigns resources and coordinates task execution.
3. The AM handles application lifecycle and requests additional resources.
4. Tasks are executed within Containers on the NMs.

MODULE-3
5a) What is NOSQL? Explain CAP Theorem.
NoSQL Overview

NoSQL stands for "Not Only SQL", representing a class of non-relational database systems.
These databases are designed for scalability, flexibility, and handling large volumes of data,
particularly in distributed systems.

Key Features of NoSQL Databases

1. Schema Flexibility:
No predefined schemas; data models can evolve dynamically.
2. Horizontal Scalability:
Supports the addition of new nodes to handle increasing loads.
21CS71

3. Replication and Auto-Sharding:


Automatically divides and replicates data across nodes for fault tolerance and high
availability.
4. Semi-Structured Data Support:
Supports varied formats such as JSON, XML, and other semi-structured data types.
5. BASE Properties:
o Basically Available: The system is mostly available, even under failure scenarios.
o Soft State: The state of the database can change over time, even without new input.
o Eventual Consistency: Data will become consistent eventually, not instantly.

CAP Theorem

The CAP theorem, proposed by Eric Brewer, states that a distributed system can guarantee at
most two out of three properties simultaneously:

1. Consistency (C):
o Ensures all nodes in the system see the same data at the same time.
o Example: In a distributed database, if a sale is recorded in one node, it is reflected
across all nodes immediately.
2. Availability (A):
o Guarantees that every request receives a response, even if some nodes are down.
o Example: Even during network failures, a replicated node can respond to a query.
3. Partition Tolerance (P):
o The system continues to function despite network partitions or communication
breakdowns between nodes.
o Example: Even if one region in a distributed database loses connectivity, the other
regions continue to operate.

Trade-offs in CAP Theorem

A distributed system must choose between Consistency (C) and Availability (A) while always
maintaining Partition Tolerance (P).

Scenarios:

1. CP (Consistency and Partition Tolerance):


o Prioritizes consistency but sacrifices availability during a partition.
o Example: Banking systems, where consistency is critical.
2. AP (Availability and Partition Tolerance):
o Prioritizes availability but allows for eventual consistency.
o Example: Web applications and e-commerce sites.
3. CA (Consistency and Availability):
o Suitable for systems without partition tolerance, which is impractical in distributed
systems.
21CS71

5b) Explain NOSQL Data Architecture Patterns.


NoSQL Data Architecture Patterns
Key-Value Store

Description: A schema-less data store using key-value pairs, similar to a hash table.
Features:

• High performance, scalability, and flexibility.


• Stores values as BLOBs (e.g., text, images, video).
• Values are retrieved by a unique key.
• Can store any data type and is eventually consistent.

Examples:

• Amazon DynamoDB: Scalable and managed database for applications requiring low
latency.
• Redis: In-memory data store for caching, message brokering, and real-time analytics.
• Riak: Distributed database with fault-tolerance for high-availability applications.
• Couchbase: Combines key-value storage with document-based functionality.

Advantages:

• Supports diverse data types.


• Simple query model with fast retrieval.
• Scalability and low operational costs.
• Flexible key formats (e.g., hashes, logical paths, REST calls).

Limitations:
21CS71

• No indexes on values; subset search not possible.


• Lacks traditional database capabilities like atomicity.
• Challenges with maintaining unique keys in large datasets.
• No query filters like SQL's WHERE clause.

Use Cases:

• Image storage, document/file storage, lookup tables, query caching.

Document Store

Description: Stores data as documents (e.g., JSON, XML) with a hierarchical structure.
Features:

• Handles unstructured data.


• Easy querying using document trees.
• Exhibits ACID properties.

Examples:

• MongoDB: A NoSQL database with a flexible schema, ideal for real-time analytics.
• CouchDB: Designed for web applications with an emphasis on availability and partition
tolerance.

Use Cases:

• Office documents, forms, inventory data, document exchange/search.

CSV and JSON File Formats

• CSV: Stores flat, record-based data (not hierarchical).


• JSON: Represents semi-structured data, including object-oriented and hierarchical
records.

Widely Used For: Representing and querying structured information.

3.3.4 Column-Family Data Store

Description: Organizes data into rows and columns, grouping columns into families.
Features:

• Uses row IDs and column names for data retrieval.


• Supports hierarchical structures with column families and super columns.

Examples:
21CS71

• Apache Cassandra: Highly available and scalable for time-series data.


• HBase: Distributed database built on Hadoop for big data analytics.

Advantages:

• Scalability with distributed query processing.


• Partitionability for efficient memory usage.
• High availability through replication across nodes.
• Easy to add new data by extending columns or rows.

Use Cases:

• Large-scale data analytics, distributed systems requiring high performance.

Graph Databases

Description: Focuses on representing and querying data as nodes (entities), edges (relationships),
and properties.
Features:

• Highly suitable for connected data.


• Provides traversal algorithms to query relationships efficiently.
• Schema-less with flexible representation.

Examples:

• Neo4j: Industry-standard graph database for querying connected data.


• ArangoDB: Multi-model database supporting graphs, documents, and key-value storage.

Use Cases:

• Social networks, recommendation systems, and networked systems.

Object-Oriented Mapping

Description: Represents data as objects, enabling close alignment with object-oriented


programming concepts.
Features:

• Eliminates the need for ORM (Object-Relational Mapping) tools.


• Facilitates direct storage and retrieval of objects.
• Aligns well with programming languages like Java and C++.

Examples:

• ObjectDB: Java-based object-oriented database for persistence.


21CS71

• db4o: Embedded object database for Java and .NET.

Use Cases:

• Applications heavily relying on object-oriented programming paradigms.

6a) Explain Shared Nothing Architecture for Big Data tasks.


Shared Nothing Architecture for Big Data Tasks

The Shared-Nothing (SN) architecture is a cluster-based architecture where each node operates
independently without sharing memory or disk with any other node. It is widely used in big data
tasks because of its scalability, fault tolerance, and efficiency. Below is a detailed explanation of
the SN architecture:

Definition:

• A Shared-Nothing (SN) architecture divides the workload across a cluster of nodes where
each node functions independently.
• Nodes do not share memory, disk, or data, ensuring isolation.
• Data is partitioned across nodes, and processing tasks are distributed among them.

Features of Shared-Nothing Architecture:

1. Independence:
o Each node operates autonomously without memory or data sharing.
o Nodes possess computational self-sufficiency, reducing contention.
2. Self-Healing:
o The system can recover from link failures by creating alternate links or
redistributing tasks.
3. Sharding of Data:
o Each node stores a shard (a portion of the database).
o Shards improve performance by distributing workloads across nodes.
4. No Network Contention:
o Since nodes work independently, there is minimal network congestion during
processing.
5. Scalability:
o Horizontal scaling is achieved by adding more nodes to the cluster.
o Allows handling of large datasets efficiently.
6. Fault Tolerance:
o A node failure does not disrupt the entire system; tasks are redistributed to other
nodes.

Working of SN Architecture in Big Data Tasks:

• Data is partitioned across multiple nodes using partitioning strategies.


• Each node processes its assigned data independently.
21CS71

• A coordination protocol may be used to synchronize results if needed.

Advantages:

1. Horizontal Scalability:
o Easy to scale by adding more nodes to the cluster.
2. High Performance:
o No resource contention ensures faster processing.
3. Fault Tolerance:
o System remains operational even if some nodes fail.
4. Cost-Effective:
o Commodity hardware can be used for setting up the architecture.

Examples of SN Architecture in Big Data Frameworks:

1. Hadoop:
o Distributes data and processing tasks across multiple nodes using HDFS and
MapReduce.
2. Apache Spark:
o Processes data in-memory across independent nodes.
3. Apache Flink:
o Executes stream processing in a distributed, shared-nothing manner.

Comparison with Other Models:

1. Single Server Model:


o Processes all data on a single server.
o Suitable for small-scale tasks but lacks scalability for large data.
2. Master-Slave Model:
o Master directs and coordinates tasks among slave nodes.
o High resilience but involves replication overhead.
3. Shared-Nothing Model:
o Completely decentralized and ideal for large-scale distributed processing.
o Superior fault tolerance and scalability compared to centralized models

Conclusion:

The Shared-Nothing architecture is highly effective for big data tasks due to its scalability, fault
tolerance, and independence. It is a preferred choice for distributed computing systems like
Hadoop, Spark, and Flink, enabling efficient processing of massive datasets.
21CS71

6b)Explain MONGO DATABASE.


MongoDB - An Overview

MongoDB is an open-source, NoSQL database management system (DBMS) that stores data in a
document-oriented format. It is widely used for handling large-scale data due to its flexibility and
scalability.

Key Characteristics of MongoDB

1. Non-relational and NoSQL: MongoDB does not rely on a fixed schema and supports
unstructured data.
2. Document-based: Data is stored in BSON (Binary JSON) format, making it easy to store
hierarchical relationships.
21CS71

3. Dynamic Schema: Unlike RDBMS, MongoDB collections can store documents with
varying fields.
4. Cross-platform: Compatible with multiple operating systems like Windows, Linux, and
macOS.
5. Scalability: Supports horizontal scaling through sharding for better data distribution.
6. High Performance: Supports fast data retrieval and updates due to its efficient indexing
and querying mechanism.
7. Fault Tolerance: Ensures data availability using replication through replica sets.

Comparison with RDBMS

Feature RDBMS MongoDB


Database Model Relational Document-based
Schema Fixed Dynamic
Data Storage Format Rows/Columns BSON (Binary JSON)
Joins Complex Joins Embedded Documents
Transactions ACID Transactions Atomic Operations on Documents
Horizontal Scaling Limited Supported (Sharding)

Core Components of MongoDB

1. Database: A physical container for collections, similar to databases in RDBMS.


2. Collection: Analogous to tables in RDBMS, collections can hold documents with varying
fields.
3. Document: The basic unit of data storage in JSON-like format containing key-value pairs.

Features of MongoDB

1. Schema-less Data Storage:


o No need to predefine fields or their data types.
o Supports flexible data models.
2. Rich Query Language:
o Enables deep queries using a document-based query language similar to SQL.
3. Indexing:
o MongoDB supports indexing on any field, improving query performance.
o By default, every collection includes an _id field as the primary key.
4. Replication:
o Ensures high availability by maintaining multiple copies of data through replica
sets.
5. Sharding:
o Distributes data across multiple servers to handle large-scale applications.
6. High Performance:
o Fast in-place updates without allocating new memory.
7. Aggregation Framework:
21CS71

Performs real-time data analysis using aggregation pipelines.


o
8. Built-in Support for Geospatial Data:
o Ideal for location-based applications.
9. Horizontal Scalability:
o Easily add new servers to the cluster for expanding storage.

Advantages of MongoDB

• Flexible and Scalable: Suitable for rapidly changing application requirements.


• Fault Tolerance: Redundancy ensures no single point of failure.
• Real-time Analytics: Supports efficient querying and aggregation for data analysis.

Limitations of MongoDB

• Lacks full ACID compliance for multi-document transactions.


• No support for complex joins as seen in RDBMS.
• Consumes more memory due to its BSON format.

Typical Applications of MongoDB

1. Content Management Systems


2. Mobile and Web Applications
3. E-commerce Platforms
4. Real-time Analytics
5. Gaming and User Data Management

Common MongoDB Commands

Command Function
use <db> Creates or switches to a database.
db.<collection>.insert() Inserts a document into a collection.
db.<collection>.find() Retrieves all documents from a collection.
db.<collection>.update() Updates a document in a collection.
db.<collection>.remove() Deletes a document from a collection.
db.stats() Displays statistics about the MongoDB server.

Replication and Sharding in MongoDB

1. Replication:
o Achieved using a replica set (one primary and multiple secondary nodes).
o Provides high availability and automatic failover during server failures.
2. Sharding:
o Data distribution method to scale horizontally.
o Automatically balances data across servers and improves write throughput.
21CS71

Conclusion

MongoDB is a robust, NoSQL database solution tailored for modern application requirements. Its
flexibility, scalability, and ease of use make it an excellent choice for developers dealing with
dynamic and large-scale datasets.

MongoDB vs. RDBMS

MongoDB and RDBMS (Relational Database Management Systems) differ fundamentally in their
structure, features, and use cases. Below is a comparative analysis:

Feature RDBMS MongoDB


Database Type Relational Document-oriented (NoSQL)
Storage Unit Table Collection
Row Record/Tuple Document/Object
Column Field Key
Primary Key User-defined primary key Default _id field as the primary key
Joins Complex table joins Embedded documents simplify relationships
Indexing Supports indexing Supports indexing for faster querying

Replication in MongoDB

Replication ensures data availability and reliability. MongoDB implements replication using
replica sets.

• Replica Set: A group of MongoDB servers that maintain copies of the same dataset.
21CS71

• Components:
o Primary Node: Receives all write operations.
o Secondary Nodes: Sync data from the primary node and serve as backups.
o Automatic Failover: If the primary node fails, a secondary node is promoted to
primary.
o Reintegration: Failed nodes can rejoin as secondary nodes after recovery.

Commands for Replica Set Management

Command Description
rs.initiate() Initiates a new replica set.
rs.conf() Displays the replica set configuration.
rs.status() Checks the current status of the replica set.
rs.add() Adds members to the replica set.

Auto-Sharding in MongoDB

Sharding distributes data across multiple servers to handle large-scale data efficiently.

• Purpose: Supports horizontal scaling to manage increased data sizes and demands.
• Mechanism: Automatically balances data and load among servers.
• Benefits:
o Improves write throughput by distributing operations across multiple mongod
instances.
o Cost-effective alternative to vertical scaling.

Data Types in MongoDB

MongoDB supports a wide range of BSON data types to accommodate diverse data requirements:

Data Type Description


Double Stores floating-point numbers.
String UTF-8 encoded strings.
Object Represents embedded documents.
Array Stores lists or sets of values.
Binary Data Arbitrary byte strings, suitable for images or binaries.
ObjectId 12-byte unique identifier for MongoDB documents.
Boolean Represents true or false values.
Date Stores dates as a 64-bit integer (milliseconds since Unix epoch).
Null Represents missing or unknown values.
Regular Expression Maps directly to JavaScript regular expressions.
21CS71

Data Type Description


32-bit Integer Stores numbers without decimal points as 32-bit integers.
64-bit Integer Stores numbers without decimal points as 64-bit integers.
Timestamp Special internal type for tracking operations within a second.
Represent the smallest and largest BSON values, respectively, for internal
MinKey/MaxKey
use.

Rich Queries and Functionalities

MongoDB offers rich querying capabilities comparable to RDBMS:

1. Query Language: Provides a robust query mechanism similar to SQL for document
retrieval.
2. Secondary Indexes: Support text search and geospatial queries.
3. Aggregation Framework: Powerful for real-time data analysis.
4. Flexibility: Dynamic schemas enable handling unstructured and semi-structured data
efficiently.

Advantages of MongoDB over RDBMS

1. Schema Flexibility: Documents in the same collection can have different structures.
2. High Scalability: Sharding supports horizontal scaling for large datasets.
3. Performance: Optimized for high-speed read and write operations.
4. Embedded Relationships: Avoid complex joins by storing related data together.

Conclusion

MongoDB, with its document-oriented model and features like replication, sharding, and rich data
types, is a powerful NoSQL database. It is particularly suited for Big Data applications, real-time
analytics, and scenarios requiring scalability and flexibility. The comparison with RDBMS
highlights MongoDB’s advantages in modern application development.

MODULE-IV

Q. 07 a Explain Map Reduce Execution steps with neat diagram.


21CS71

MapReduce Execution Steps

MapReduce is a programming model used for processing large datasets in a distributed manner. It
is commonly used in Hadoop for parallel processing and fault tolerance. Below is a detailed
explanation of the MapReduce execution steps and how the system handles node failures:

1. Job Submission

The process starts when the user submits a MapReduce job to the Hadoop JobTracker. The job
consists of two main parts: the Map function and the Reduce function.

2. Job Initialization

• JobTracker: The JobTracker is responsible for coordinating the MapReduce job


execution. It splits the job into Map tasks and Reduce tasks and assigns them to different
TaskTrackers running on different nodes in the Hadoop cluster.
• TaskTrackers: These are the worker nodes that execute the tasks assigned by the
JobTracker.

3. Data Splitting

• The input data is split into smaller chunks (typically called splits) by the JobTracker.
• Each split is processed by a Map task.
21CS71

• These splits are distributed across the nodes to enable parallel processing.

4. Map Phase

• Map Function: Each Map task takes an input split and processes it. The Map function
reads the input, processes it, and emits key-value pairs as output.
• Data Flow: The output from the Map function is sent to the Map output collector, which
writes the intermediate data to local disk (in sorted order) on each node.
• Partitioning: The system partitions the output based on a partitioning function, which
ensures that related data is sent to the same reducer.

5. Shuffle and Sort

• After the Map phase, the Shuffle and Sort phase begins. In this phase, the data emitted by
the mappers is sorted by key and grouped by the same key.
• The Shuffle phase involves moving data from the Map nodes to the appropriate Reduce
nodes. This is done by the Shuffle function, which groups the data by key.

6. Reduce Phase

• Reduce Function: After the data is shuffled and sorted, each Reduce task receives a set
of key-value pairs grouped by key. The Reduce function processes these pairs to produce
the final output.
• The output of the reduce tasks is written back to the HDFS (Hadoop Distributed File
System).

7. Job Completion

• Once the Reduce tasks complete, the results are stored in HDFS, and the JobTracker
notifies the client that the job is complete.

Fault Tolerance in MapReduce

Hadoop ensures fault tolerance by recovering from node failures during the execution of
MapReduce jobs. The JobTracker and TaskTracker components handle this fault tolerance.

Handling TaskTracker Failures

1. Map TaskTracker Failure:


o If a Map TaskTracker fails, the JobTracker identifies the failed task.
o The failed task is re-scheduled to another available TaskTracker.
o The tasks that were in-progress or completed on the failed TaskTracker are reset to
idle, and the job continues from the point of failure.
2. Reduce TaskTracker Failure:
o If a Reduce TaskTracker fails, only the in-progress reduce tasks are reset to idle.
o The failed reduce tasks are re-scheduled to another TaskTracker.
21CS71

3. JobTracker Failure:
o If the JobTracker fails (which could happen if only one JobTracker is running),
the entire MapReduce job aborts.
o The client is notified about the failure, and the job must restart if there's no backup
JobTracker.

Handling Node Failures with Multiple TaskTrackers

• Each node (TaskTracker) communicates periodically with the JobTracker to signal its
health.
• If a node doesn't communicate with the JobTracker for a specified duration (default 10
minutes), the node is considered failed.
• Re-execution of tasks can happen on another TaskTracker, ensuring that the MapReduce
job continues without disruption.

6b )What is HIVE? Explain HIVE Architecture.

• Hive is a data warehousing and SQL-like query language system built on top of Hadoop.
• It simplifies the process of querying and managing large datasets in Hadoop using HiveQL,
which is similar to SQL.
• Hive allows users to run queries on data stored in HDFS (Hadoop Distributed File
System).
• It is used for data analysis, summarization, and aggregation in big data environments.

Hive Architecture
1. Hive Server (Thrift)
o Exposes a client API for executing HiveQL queries.
o Supports various programming languages (e.g., Java, Python, C++).
o Allows remote clients to submit queries to Hive and retrieve results.
2. Hive CLI (Command Line Interface)
o A popular interface to interact with Hive.
o Runs Hive in local mode (using local storage) or distributed mode (using HDFS).
o Allows execution of HiveQL queries directly from the command line.
3. Web Interface (HWI)
21CS71

o Provides access to Hive through a web browser.


o Requires the HWI Server to run on a designated machine.
o Can be accessed using the URL: http://hadoop:<port_number>/hwi.
4. Metastore
o Stores metadata for tables, columns, data types, and HDFS mapping.
o Acts as the system catalog in Hive.
o Interacts with all other Hive components for schema and data management.
5. Hive Driver
o Manages the lifecycle of a HiveQL statement.
o Handles compilation, optimization, and execution of queries.
o Works with the Query Compiler and Execution Engine to execute queries.
6. Query Compiler
o Compiles HiveQL queries into an executable DAG (Directed Acyclic Graph).
o Converts high-level HiveQL queries into MapReduce, Tez, or Spark jobs for
execution on the Hadoop cluster.
7. Execution Engine
o Executes the compiled query plan (MapReduce/Tez/Spark jobs).
o Runs the MapReduce tasks or other supported tasks in the Hadoop cluster.
o Processes data and provides results.
8. HDFS (Hadoop Distributed File System)
o Serves as the primary storage for data in Hive.
o All data processed by Hive is stored in HDFS.
o Hive interacts with HDFS to read, write, and manage data.
9. HBase Integration (Optional)
o Hive can be integrated with HBase for real-time data processing.
o Allows accessing real-time data stored in HBase tables.
o Suitable for low-latency data access scenarios

Q. 08 a Explain Pig architecture for scripts dataflow and processing

Pig Architecture: Scripts, Dataflow, and Processing


1. Ways to Execute Pig Scripts

• Grunt Shell:
21CS71

o An interactive shell for executing Pig scripts.


o Allows direct execution of commands in the shell.
• Script File:
o Pig commands are written in a script file.
o The script is executed at the Pig Server (remote execution environment).
• Embedded Script:
o Custom User Defined Functions (UDFs) are created for functions not available in
Pig's built-in operators.
o UDFs can be written in other programming languages and embedded within the Pig
Latin script file.

2. Pig Dataflow and Processing

The execution of a Pig script goes through several stages, each responsible for transforming and
processing the data.

3. Components in Pig Architecture

1. Parser:
o The Parser handles the Pig script after it's passed through the Grunt Shell or Pig
Server.
o Function: It checks the script for syntax errors and performs type checking.
o The output of the parsing step is a Directed Acyclic Graph (DAG).
▪ DAG (Directed Acyclic Graph):
▪ Represents the sequence of Pig Latin statements.
▪ Nodes in the DAG represent logical operators.
▪ Edges between the nodes represent the data flows.
▪ Acyclic means that only one set of inputs are processed at a time,
and after processing, one output is generated.
2. Optimizer:
o The Optimizer optimizes the DAG before passing it for compilation.
o Optimization Features:
▪ PushUpFilter: Splits and pushes up filter conditions to reduce the data
early.
▪ PushDownForEachFlatten: Postpones the flattening operation to
minimize record expansion.
▪ ColumnPruner: Eliminates unused or unnecessary columns to reduce
record size.
▪ MapKeyPruner: Removes unused map keys, optimizing data storage.
▪ Limit Optimizer: If a limit operation is used immediately after loading or
sorting data, it applies optimizations to reduce unnecessary processing by
limiting the dataset size earlier in the process.
3. Compiler:
o After optimization, the Compiler compiles the optimized DAG into a series of
MapReduce jobs.
o These jobs represent the logical steps required to process the Pig script.
21CS71

4. Execution Engine:
o The Execution Engine is responsible for running the MapReduce jobs.
o The jobs are executed on the Hadoop cluster, and the final results are produced
after processing the data.

4. Pig Latin Data Model

• Primitive (Atomic) Data Types:


o int, float, long, double — standard numeric types.
o char[] — an array of characters (string).
o byte[] — an array of bytes.
• Complex Data Types:
o Tuple: An ordered collection of fields (like a record).
o Bag: A collection of tuples (similar to a table).
o Map: A collection of key-value pairs.

5. Pig Grunt Shell Usage

• Main Purpose: The Grunt Shell is used to write and execute Pig Latin scripts
interactively.
• Command Syntax:
o sh command: Invokes shell commands from within the Grunt shell.
▪ Example: grunt> sh ls
o ls command: Lists files in the Grunt shell environment.
▪ Example: grunt> sh ls

Summary of Pig Architecture and Dataflow

1. Execution Process:
o Pig scripts are executed in one of the three ways: Grunt Shell, Script File, or
Embedded Script.
o The parser processes the Pig script and produces a DAG, which is optimized by the
optimizer.
o The optimizer reduces unnecessary data processing, making the execution more
efficient.
o The optimized DAG is compiled into MapReduce jobs, which are executed by the
execution engine.
2. Pig Latin Data Model:
o Pig supports both primitive and complex data types, enabling flexible data
handling during processing.
3. Grunt Shell provides an interactive environment to write and test Pig scripts, making it
easier for users to experiment with Pig Latin queries and functions.

This architecture enables efficient big data processing, making it a powerful tool in the Hadoop
ecosystem for ETL processes, data transformation, and analysis.
21CS71

8b Explain Key Value pairing in Map Reduce.


Key-Value Pairing in MapReduce:
1. Introduction to Key-Value Pairing

• MapReduce uses key-value pairs at different stages to process and manipulate data. The
data must be converted into key-value pairs before being passed to the Mapper, as it only
understands and processes key-value pairs.

2. Key-Value Pair Generation

• InputSplit:
o Defines a logical representation of the data. It splits the data into smaller chunks
for processing by the map() function.
• RecordReader:
o Communicates with the InputSplit and converts the data into key-value pairs
suitable for processing by the Mapper.
o By default, TextInputFormat is used to convert text data into key-value pairs.
o RecordReader continues processing until the entire file is read.

3. Functions Using Key-Value Pairs

Key-value pairs are used at four primary points:

• map() input: Data received by the Mapper.


• map() output: Data produced by the Mapper, consisting of key-value pairs.
• reduce() input: Data received by the Reducer, which includes the grouped key-value pairs
from the Mapper output.
• reduce() output: Data produced by the Reducer, which results in the final key-value pairs.

4. Grouping by Key

• After the map() task completes, the Shuffle process groups all the Mapper outputs by the
key.
o All key-value pairs with the same key are grouped together.
o A "Group By" operation is performed on the intermediate keys, resulting in a list
of values (v2) associated with each key (k2).
o The output of the Shuffle and Sorting phase will be a list of <k2, List(v2)>.

5. Shuffle and Sorting Phase

• In the Shuffle phase:


o All pairs with the same group key (k2) are grouped together.
o These key-value groups are assigned to different reduce nodes for further
processing.
• HDFS automatically sorts the partitions on a single node before passing them as inputs to
the Reducer.
21CS71

6. Partitioning

• The Partitioner is responsible for distributing the output of the map() tasks into different
partitions.
o It is an optional class and can be specified by the MapReduce driver.
o Partitions help divide the key-value pairs across different Reducer tasks, ensuring
efficient data processing.
• The Partitioner executes locally on each machine that performs a map task.

7. Combiners

• Combiners act as semi-reducers in MapReduce. They are an optional optimization class.


o The Combiner works to aggregate data locally at the Mapper level before the
Shuffle and Sort phase.
• Function of a Combiner:
o Consolidates map output records with the same key into fewer or smaller records.
o This reduces the volume of data transferred across the network between the map
and reduce tasks.
• Combiners and Reducers:
o Both the combiner() and reducer() functions can implement the same logic, but
the combiner operates on the map output, while the reducer works on the shuffled
and sorted output.
• Usage:
o Combiner helps reduce the cost of data transfer, particularly when dealing with
large datasets.

8. Reduce Tasks

• The Reducer class in Java provides the abstract reduce() function.


o A custom Reducer class needs to override the reduce() function.
• The reduce() function:
o Takes the Mapper output (after being shuffled and sorted) as input.
o Processes the list of values (v2) associated with each key (k2).
o Performs tasks such as aggregation or statistical computation.
o Outputs key-value pairs (k3, v3) that represent the final result.
21CS71

MODULE-V
Q. 09 a What is Machine Learning? Explain different types of Regression Analysis
Machine Learning:

• Machine learning is a subset of AI that enables computers to learn from data.


• It allows systems to make decisions or predictions without being explicitly programmed.
• The models improve as they are exposed to more data, allowing for tasks like classification,
regression, and clustering.

Types of Regression Analysis:

1. Simple Linear Regression:


o Models the relationship between one independent variable and one dependent
variable.
o The equation is:
y=b0+b1xy = b_0 + b_1 xy=b0+b1x
where yyy is the dependent variable, xxx is the independent variable, and b0,b1b_0,
b_1b0,b1 are coefficients.
o The aim is to find a straight line that fits the data with minimal error.
2. Linear Square Estimation (Least Squares Criterion):
o In simple linear regression, the best line is found by minimizing the sum of squared
errors between actual and predicted values.
o The goal is to minimize the deviation (error) from the line.
3. Multiple Regression:
o Used when there are multiple independent variables influencing the dependent
variable.
o The equation is:
y=b0+b1x1+b2x2+⋯+bnxny = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_n x_ny=b0
+b1x1+b2x2+⋯+bnxn
where yyy is the dependent variable, x1,x2,…,xnx_1, x_2, \dots, x_nx1,x2,…,xn
are independent variables, and b0,b1,…,bnb_0, b_1, \dots, b_nb0,b1,…,bn are
coefficients.
21CS71

o It allows for predictions using more than one predictor variable.


4. Non-linear Regression:
o Used when the relationship between variables is not linear.
o The equation involves non-linear functions of the predictors, like
y=b0+b1x12+b2x23y = b_0 + b_1 x_1^2 + b_2 x_2^3y=b0+b1x12+b2x23.
5. Modeling Possibilities Using Regression:
o Forecasting: Predicting future outcomes (e.g., sales).
o Optimization: Analyzing data for maximum returns (e.g., marketing efforts).
o Risk Minimization: Identifying key factors influencing risk (e.g., customer
defaults).
o Predictive Modeling: Estimating future values (e.g., house prices).
o Understanding Relationships: Identifying how different variables are related
(e.g., marketing campaigns' effects).
21CS71

9b) Explain with neat diagram K-means clustering.


21CS71

Overview of the K-Means Method


Key Steps in K-Means Clustering:

1. Initialization:
o Randomly initialize kkk cluster centroids (C1,C2,…,CkC_1, C_2, \dots, C_kC1,C2
,…,Ck).
o These centroids act as initial cluster centers.
2. Assignment:
o Calculate the distance between each data point and all centroids.
o Assign each data point to the cluster with the nearest centroid.
o Common distance metrics:
▪ Euclidean Distance: d(x,y)=∑(xi−yi)2d(x, y) = \sqrt{\sum (x_i -
y_i)^2}d(x,y)=∑(xi−yi)2
▪ Manhattan Distance: d(x,y)=∑∣xi−yi∣d(x, y) = \sum |x_i - y_i|d(x,y)=∑∣xi
−yi∣
▪ Cosine Distance: d(x,y)=1−x⃗⋅y⃗∥x⃗∥∥y⃗∥d(x, y) = 1 - \frac{\vec{x} \cdot
\vec{y}}{\|\vec{x}\| \|\vec{y}\|}d(x,y)=1−∥x∥∥y∥x⋅y
3. Update Centroids:
o Compute the new centroid of each cluster by calculating the mean position of all
points in the cluster: Ck=1Nk∑i=1NkXiC_k = \frac{1}{N_k} \sum_{i=1}^{N_k}
X_iCk=Nk1i=1∑NkXi where NkN_kNk is the number of points in cluster kkk.
4. Iterative Refinement:
o Repeat the Assignment and Update steps until:
▪ Centroids no longer change, or
▪ A predefined stopping criterion is met (e.g., maximum iterations).
5. Output:
o A set of kkk clusters with minimal intra-cluster distance and maximal inter-cluster
distance.

Algorithm Steps:

1. Input:
o NNN: Number of data points (objects).
o kkk: Number of clusters.
2. Output:
o kkk clusters with minimized distance between points and their centroids.
3. Steps:
o Step 1: Randomly initialize kkk centroids.
o Step 2: Assign each point to the nearest centroid.
o Step 3: Update the centroid to the mean of the points in the cluster.
o Step 4: Repeat until centroids stabilize (no change in cluster membership).

Diagram Representation:

1. Step 1: Randomly initialized centroids.


2. Step 2: Points assigned to nearest centroids.
21CS71

3. Step 3: Centroids updated to the mean of their clusters.


4. Step 4: Repeat until convergence.

9c )Explain Naïve Bayes Theorem with example.

Naïve Bayes Technique


Overview:

1. Definition:
o A supervised machine learning technique based on probability theory.
o Computes the probability of an instance belonging to each class using prior
probabilities and likelihoods.
2. Formula:
21CS71
21CS71

Advantages:

1. Performs well with small training datasets.


2. Computationally efficient and easy to implement.
3. Performs better when predictors are independent.

Disadvantages:

1. Assumes independence between predictors, which is unrealistic in many real-world


scenarios.
2. Zero-frequency issue: Assigns zero probability to unseen categories in the test data,
requiring smoothing techniques like Laplace estimation to address this.

Q. 10a) Explain five phases in a process pipeline text mining


21CS71

Five Phases in Text Mining Process Pipeline

Text mining is a systematic process to extract meaningful information from textual data. The five
phases of text mining are:

Phase 1: Text Pre-Processing

This phase focuses on preparing raw textual data for further analysis by performing the following
steps:

1. Text Cleanup:
o Removes unnecessary or unwanted information.
o Corrects typos (e.g., "teh" becomes "the").
o Resolves inconsistencies, removes outliers, and fills missing values.
o Example: Removing comments or replacing "%20" in URLs.
2. Tokenization:
o Splits text into tokens (words) using white spaces and punctuation as delimiters.
3. POS Tagging:
o Labels each word with its part of speech (noun, verb, etc.).
o Helps recognize entities like names or places.
4. Word Sense Disambiguation:
o Identifies the correct meaning of ambiguous words based on context.
o Example: "bank" could mean a financial institution or a riverbank.
5. Parsing:
o Creates a grammatical structure (parse-tree) for sentences.
o Determines relationships between words.

Phase 2: Feature Generation

This phase transforms text into features for analysis:

1. Bag of Words (BoW):


o Represents text as word occurrences without considering order.
o Useful for document classification.
2. Stemming:
o Reduces words to their root forms (e.g., "speaking" → "speak").
o Normalizes plurals, verb tenses, and affixes.
3. Stop Words Removal:
o Eliminates common words like "a," "the," or "in" that don't contribute to analysis.
4. Vector Space Model (VSM):
o Represents text as numeric vectors based on word frequencies using TF-IDF.

Phase 3: Feature Selection

This phase filters and reduces features to relevant subsets:


21CS71

1. Dimensionality Reduction:
o Removes redundant and irrelevant features.
o Methods include PCA and LDA.
2. N-gram Evaluation:
o Identifies sequences of words (e.g., "tasty food" for 2-gram).
3. Noise Detection:
o Identifies and removes unusual or suspicious data points.

Phase 4: Data Mining Techniques

This phase applies algorithms to structured data:

1. Unsupervised Learning:
o Clustering groups similar data without predefined labels.
o Example: Grouping blog posts by topic.
2. Supervised Learning:
o Classification assigns labels based on training data.
o Example: Email spam filtering.
3. Evolutionary Pattern Identification:
o Summarizes changes over time, like trends in news articles.

Phase 5: Analyzing Results

This final phase evaluates and interprets outcomes:

1. Result Evaluation:
o Determines if results meet expectations.
2. Interpretation:
o Discards or refines processes based on results.
3. Visualization:
o Prepares visual representations for better understanding.
4. Utilization:
o Applies insights to improve industry or enterprise activities.

Text Mining Challenges

• NLP Issues: Ambiguity, tokenization, parsing, and stemming.


• Data Variety: Handling unstructured, multi-language data.
• Mining Techniques: Selecting algorithms and working with large datasets.
• Scalability and Real-time Processing: Efficiently handling massive text streams.

These phases and challenges illustrate the systematic approach and complexity of text mining.

10b) Explain Web Usage Mining.


21CS71

Web Usage Mining

Web usage mining refers to the process of extracting useful information and patterns from the data
generated through webpage visits and transactions. This involves analyzing the activity data
captured at various levels during a user's interaction with websites and applications.

Sources of Data

• Server Access Logs: Record of the pages served to users.


• Referrer Logs: Information about the source of a user’s visit.
• Agent Logs: Details about the browser or software used by the user.
• Client-side Cookies: Data stored on the user’s device that tracks browsing behavior.

Additionally, metadata such as page attributes, content attributes, and usage data is collected to
provide a comprehensive dataset.

Analysis Levels

Web content and usage can be analyzed at multiple levels:

1. Server-Side Analysis:
o Focuses on the relative popularity of web pages accessed.
o Identifies hubs (central resources) and authorities (high-value content sources).
2. Client-Side Analysis:
o Focuses on user activity and content consumed.
o Comprises two main types of analysis:
1. Usage Pattern Analysis:
▪ Uses clickstream analysis to track the sequence of clicks, locations,
and durations of visits.
▪ Applications: Web activity analysis, market research, software
testing, employee productivity analysis.
2. Content Analysis:
▪ Textual information accessed by users is structured using techniques
like the Bag-of-Words model.
▪ The text is analyzed for:
▪ Cluster Analysis: Grouping similar topics.
▪ Association Rules: Finding patterns like user segmentation
or sentiment trends.

Business Applications of Web Usage Mining

1. User Behavior Prediction:


o Uses previously learned rules and user profiles to predict future actions.
2. Client Value Determination:
o Assesses the lifetime value of customers based on their interactions.
3. Cross-Marketing Strategies:
21CS71

oIdentifies association rules among webpage visits to suggest complementary


products or services.
4. Campaign Evaluation:
o Measures the effectiveness of promotional campaigns by analyzing user
engagement with relevant pages.
5. Dynamic Information Presentation:
o Provides targeted ads, coupons, or content based on user interests and access
patterns.

Key Techniques

• Clickstream Analysis: Analyzing the sequence of user clicks to uncover navigation


patterns.
• Text Mining: Analyzing page content for user interests and sentiment.
• Cluster Analysis: Grouping similar data points or users.
• Association Rule Mining: Identifying relationships between different data elements.

Web usage mining provides insights that are crucial for personalized user experiences, business
growth, and effective marketing strategies.

Figure 9.7 shows three phases for web usage mining.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy