BDA Answerbank
BDA Answerbank
BDA Answerbank
Sr. CO
No. Question Text Marks Number
What is big data analytics? Explain four ‘V’s of Big data. Briefly
discuss applications of big data.
1. Healthcare:
2. Transportation:
3. Medicine:
Unstructured Data:
2. Velocity:
3. Variety:
5. Analysis Techniques:
6. Use Cases:
7. Value:
2. MapReduce:
1. NameNode:
2. DataNode:
Draw HDFS Architecture. Explain any two commands of HDFS from the
following commands with syntax at least one example of each.
CopyFromLocal,setrep,checksum
1. copyFromLocal:
2. setrep:
3. checksum:
12 7 CO2
The Hadoop ecosystem is a collection of open-source software
components and tools that complement the Hadoop core
components, extending its capabilities for distributed storage and
P a g e | 25
1. Map Phase:
13 7 CO2
• Map Phase is the first step in the MapReduce process. It
involves breaking down the input data into key-value pairs
and performing some initial processing or transformation on
each data point. The Map phase runs independently on each
node in the cluster.
• Map Function: The core component of the Map phase is the
Map function. The Map function takes an input record and
emits one or more key-value pairs as intermediate output.
Example: Let's say you have a large log file containing web server
access logs, and you want to count the occurrences of each unique
URL.
P a g e | 28
• After the Map phase, all the key-value pairs emitted by the
Map functions are collected and grouped by their keys. This
process is known as the Shuffle and Sort phase.
• The Shuffle and Sort phase ensures that all values associated
with a specific key end up on the same reducer node for
processing in the Reduce phase.
1. Map Phase:
• Input Data: The Map Phase begins with a large dataset that
needs to be processed. This dataset is divided into smaller
splits or chunks, with each chunk assigned to a worker node
in the cluster for processing.
• Mapper Function: A user-defined Mapper function is applied
to each chunk of data independently. The Mapper function
takes the input data, processes it, and generates a set of key-
value pairs. The Mapper function can perform various
operations on the data, such as filtering, parsing, and
transformation.
• Intermediate Key-Value Pairs: The output of the Mapper
function consists of intermediate key-value pairs, where the
key represents a specific category or group, and the value is
associated data or a count. These intermediate key-value
pairs are typically stored in memory and not written to the
Hadoop Distributed File System (HDFS).
• Shuffling and Sorting: Once the Mapper tasks complete, the
MapReduce framework performs a shuffle and sort operation.
It groups together intermediate key-value pairs with the same
key, ensuring that all values associated with a particular key
are grouped together. This is a critical step in preparing data
for the Reduce Phase.
Important Points:
1. Data Model:
• Relational Database: Relational databases use a
structured, tabular data model with predefined
schemas. Data is organized into tables with rows and
columns.
• NoSQL Database: NoSQL databases use various data
models, including document-oriented, key-value,
column-family, and graph-based models. They offer
schema flexibility and can handle semi-structured or
unstructured data.
2. Scalability:
• Relational Database: Traditional relational databases
are typically scaled vertically by adding more resources
(CPU, RAM) to a single server. This approach has
limitations in terms of scalability.
• NoSQL Database: NoSQL databases are designed for
horizontal scalability, allowing data to be distributed
across multiple nodes or servers. They can easily
handle large and growing datasets.
3. Schema Flexibility:
• Relational Database: Relational databases require a
predefined schema, where the structure of the data
(tables, columns, data types) must be determined
before data insertion.
• NoSQL Database: NoSQL databases offer dynamic
schema flexibility, allowing data to be inserted without
a predefined schema. This flexibility is particularly
useful in agile development and handling evolving
data.
4. Query Language:
• Relational Database: Relational databases use SQL
(Structured Query Language) for querying and
manipulation. SQL is a powerful query language with
standardized syntax.
• NoSQL Database: NoSQL databases use various query
languages, some of which are specific to the database
type. These query languages are often optimized for
specific data models.
5. ACID vs. BASE:
P a g e | 34
i) Document-Oriented Database:
1. Data Model:
• SQL Database:
• SQL databases use a structured, tabular data model
with predefined schemas.
• Data is organized into tables with rows and columns.
• SQL databases enforce strong schema constraints.
• NoSQL Database:
• NoSQL databases use various data models, including
document-oriented, key-value, column-family, and
graph-based models.
19 7 CO3
• They offer schema flexibility and can handle semi-
structured or unstructured data.
• Some NoSQL databases allow dynamic schema
changes.
2. Query Language:
• SQL Database:
• SQL databases use SQL (Structured Query Language)
for querying and manipulation.
• SQL is a powerful query language with standardized
syntax for defining and retrieving data.
• NoSQL Database:
• NoSQL databases use various query languages, some
of which are specific to the database type.
• Query languages may be optimized for specific data
models and use cases.
P a g e | 40
3. Scaling:
• SQL Database:
• Traditional relational databases are scaled vertically by
adding more resources (CPU, RAM) to a single server.
• Vertical scaling has limitations in terms of scalability.
• NoSQL Database:
• NoSQL databases are designed for horizontal
scalability, allowing data to be distributed across
multiple nodes or servers.
• They can easily handle large and growing datasets.
4. Schema Flexibility:
• SQL Database:
• Relational databases require a predefined schema
where the structure of the data (tables, columns, data
types) must be determined before data insertion.
• Schema changes can be complex and time-consuming.
• NoSQL Database:
• NoSQL databases offer dynamic schema flexibility,
allowing data to be inserted without a predefined
schema.
• This flexibility is particularly useful in agile
development and handling evolving data.
• SQL Database:
• SQL databases adhere to the ACID (Atomicity,
Consistency, Isolation, Durability) properties, ensuring
strong data consistency and integrity.
• ACID transactions provide strict data guarantees.
• NoSQL Database:
• NoSQL databases follow the BASE (Basically Available,
Soft state, Eventually consistent) model.
P a g e | 41
6. Use Cases:
• SQL Database:
• SQL databases are well-suited for applications with
complex relationships, structured data, and
transactions.
• They are commonly used in industries like finance,
healthcare, and traditional enterprise applications.
• NoSQL Database:
• NoSQL databases are suitable for applications with
large volumes of unstructured or semi-structured data,
real-time data processing, and high scalability
requirements.
• They are used in industries like social media, e-
commerce, IoT, and big data analytics.
7. Examples:
a fixed end; they can continue indefinitely. These data streams can
originate from various sources, such as sensors, social media feeds,
financial markets, IoT devices, and more. Analyzing and processing
data streams are critical in modern applications to make real-time
decisions, gain insights, and detect patterns or anomalies.
DIAGRAM:
1. Data Sources:
• Data streams originate from a variety of sources,
including sensors, IoT devices, social media, web
applications, and more.
• These sources continuously produce data points, such
as events, measurements, logs, or messages, with
associated timestamps.
2. Data Ingestion:
• Data ingestion is the process of collecting and
importing data streams into the stream processing
system.
• Ingestion components handle data sources and adapt
data into a format suitable for processing.
3. Stream Processing Engine:
• The stream processing engine is the core component
responsible for processing data streams in real-time.
• It includes libraries, APIs, and tools for defining and
executing data processing operations, such as filtering,
mapping, aggregation, and windowing.
4. Complex Event Processing (CEP): (Optional)
• CEP is a specialized component that detects complex
patterns or conditions within data streams.
• It enables the identification of specific events or
sequences of events, triggering actions or alerts.
5. Storage (Optional):
• Some stream processing architectures include storage
for historical data, auditing, or offline analysis.
• Storage systems may store processed data, aggregated
results, or raw data for later retrieval.
6. Output and Actions:
• Processed data streams can trigger various real-time
actions, decisions, or outputs.
• Outputs may include alerts, notifications, updates to
databases, visualization on dashboards, or external
system integrations.
7. Scalability:
P a g e | 48
DIAGRAM:
3. Filter Definition:
5. Inclusion or Exclusion:
• If the data point satisfies the filtering criteria (i.e., it meets the
conditions), it is included in the filtered stream. Otherwise, it is
excluded.
• Included data points are typically passed to downstream
processing stages, while excluded data points are discarded
or archived, depending on the use case.
6. Types of Filters:
7. Continuous Processing:
8. Output Stream:
1. Explicit Filtering:
• Explicit filtering refers to the process of applying filters
directly by specifying the criteria or conditions
explicitly. Users or developers define the filtering rules,
and the filtering operation is carried out accordingly.
P a g e | 52
There are several types of data sampling techniques, each with its
own purpose and method:
1. Random Sampling:
• Random sampling involves selecting data points from
a dataset purely by chance, with each data point
having an equal probability of being included in the
sample.
• Random sampling is unbiased and is often used when
researchers want to avoid introducing any systematic
bias into the sample.
• Methods like simple random sampling and stratified
random sampling fall under this category.
2. Stratified Sampling:
• In stratified sampling, the dataset is divided into
subgroups or strata based on specific characteristics or
attributes.
• A random sample is then taken from each stratum in
proportion to its representation in the overall
population.
• Stratified sampling ensures that important subgroups
are adequately represented in the sample, making it
useful when certain subgroups are of particular
interest.
3. Systematic Sampling:
• Systematic sampling involves selecting data points at
regular intervals from a sorted or ordered dataset.
• For example, if you have a list of customer names in
alphabetical order, you might select every 10th
customer for your sample.
• Systematic sampling is efficient and can be less time-
consuming than pure random sampling.
4. Cluster Sampling:
• In cluster sampling, the dataset is divided into clusters
or groups, and a random sample of clusters is selected.
P a g e | 54
1. Data Ingestion:
2. Hive Metastore:
3. HiveQL:
5. Query Optimization:
P a g e | 60
7. Query Execution:
11. Query Result Retrieval: - Users can retrieve the query results
using the Hive command-line interface, a graphical user interface
(like Hue), or by integrating Hive with external tools like BI tools or
applications.
P a g e | 61
1. User Interface:
• The top layer of Hive's architecture includes various
user interfaces through which users interact with the
system.
30 • The command-line interface (CLI), Hive shell, and web- 7 CO5
based interfaces like Hue provide ways for users to
submit HiveQL queries and manage Hive operations.
2. Driver:
• The Driver is responsible for parsing, compiling,
optimizing, and executing HiveQL queries.
• It coordinates the flow of query processing and
communicates with other components of the Hive
architecture.
3. Compiler and Query Optimizer:
• The Compiler takes the HiveQL queries and generates
an execution plan in the form of a directed acyclic
graph (DAG).
P a g e | 62
1. Embedded Metastore:
• The Embedded Metastore, also known as the built-in
Metastore, is the default Metastore provided with Hive.
• It uses an embedded database, such as Apache Derby
31 3 CO5
or Apache HSQLDB, to store metadata.
• The Embedded Metastore is suitable for small to
moderate-sized Hive installations, where the metadata
storage requirements are not extensive.
• It is easy to set up and does not require external
database installation or configuration.
2. Local Metastore:
• The Local Metastore is similar to the Embedded
Metastore in that it stores metadata in an embedded
database.
• However, it differs in its use case. The Local Metastore
is typically used when you want to run Hive in local
P a g e | 64
1. Abstraction Layer:
• Pig provides a high-level abstraction layer over
Hadoop MapReduce. Users can express data
transformations using a simple scripting language
called Pig Latin, without needing to write complex
MapReduce code.
2. Ease of Use:
• Pig Latin is similar to SQL, making it relatively easy for
those familiar with SQL to learn and use Pig.
• Pig abstracts away many of the low-level details of
MapReduce, reducing the learning curve for Hadoop
newcomers.
3. Extensibility:
• Pig allows users to write User-Defined Functions
(UDFs) in Java, Python, or other languages, which can
be integrated into Pig scripts. This extensibility enables
custom data processing.
4. Data Flow Language:
• Pig Latin scripts define data flows, where data is
loaded, transformed, and stored in a series of steps.
This makes it easy to understand and visualize the data
processing logic.
5. Optimization:
• Pig includes an optimization phase that can
automatically optimize the execution plan of a Pig
Latin script for better performance. It can reorder
operations and reduce data movement, improving
query efficiency.
6. Schema Flexibility:
• Pig is schema-agnostic, which means it can handle
structured, semi-structured, and unstructured data.
This flexibility is useful for processing diverse data
sources.
7. Support for Complex Data Types:
P a g e | 66
33 4 CO5
35
Differentiate between Apache pig vs NoSQL.
3 CO5
Aspect Apache Pig NoSQL Databases
P a g e | 69
Apache Pig is a platform for processing NoSQL databases are designed for efficient
and analyzing large datasets. It is used storage and retrieval of unstructured, semi-
for data transformation, ETL (Extract, structured, or structured data. They are used
Transform, Load), and data analysis for data persistence, often in real-time or
Use Case tasks. near-real-time applications.
Pig is primarily used for data processing NoSQL databases are storage systems
and analysis, providing a scripting optimized for data retrieval and management.
Data language (Pig Latin) for expressing data They don't offer data processing capabilities
Processing transformations. like Pig.
Pig doesn't provide inherent horizontal NoSQL databases are designed for horizontal
scalability. Performance is based on the scalability, making them suitable for handling
Scalability underlying Hadoop cluster's scalability. large and growing datasets.
P a g e | 70
Pig focuses on data processing and NoSQL databases are designed for efficient
doesn't provide real-time data retrieval data retrieval, supporting real-time or near-
Data Retrieval capabilities. real-time access to stored data.
Pig has a lower learning curve for users NoSQL databases may have varying learning
Learning with SQL experience due to its SQL-like curves depending on the database type and
Curve syntax. query language.
Pig can be used for data analysis, but it's Some NoSQL databases support analytics and
Data Analytics not optimized for real-time analytics or reporting features, but they are primarily
and Reporting reporting. focused on data storage and retrieval.