Week 2
Week 2
1. Which statement best describes the data storage model used by HBase?
a. Key-value pairs
b. Document-oriented
c. Encryption
d. Relational tables
Ans-
Option a: Key-value pairs - Correct. HBase is a NoSQL database that stores data
in a key-value format. Each row is identified by a unique key, and columns within
a row are organized into column families.
Option b: Document-oriented - Incorrect. Document-oriented databases like
MongoDB store data as self-contained documents, while HBase uses a
key-value model.
Option c: Encryption- Incorrect.
Option d: Relational tables - Incorrect. Relational databases use structured tables
with rows and columns, while HBase is a NoSQL database with a flexible
schema.
2. What is Apache Avro primarily used for in the context of Big Data?
a. Real-time data streaming
b. Data serialization
c. Machine learning
d. Database management
Ans-
Option a: Real-time data streaming - Incorrect. While Avro can be used in
streaming applications, its primary focus is data serialization.
Option b: Data serialization - Correct. Avro is a data serialization format that
efficiently encodes data structures for storage and transmission.
Option c: Machine learning - Incorrect. Avro can be used to store data for
machine learning models, but its core functionality is data serialization.
Option d: Database management - Incorrect. Avro is not a database
management system, but a format for storing data.
Explanation-
Apache Avro is a framework for data serialization. It provides a compact, fast,
and efficient way to serialize and deserialize data, making it suitable for
communication between different systems or for persisting data in a binary
format. Avro is commonly used in Big Data applications for serialization of data in
a way that supports schema evolution and provides interoperability across
various programming languages.
3.Which component in HDFS is responsible for storing actual data blocks on the
DataNodes?
a. NameNode
b. DataNode
c. Secondary NameNode
d. ResourceManager
In the Hadoop Distributed File System (HDFS), DataNodes are responsible for
storing the actual data blocks. Each DataNode manages the storage of data
blocks and periodically sends heartbeat signals and block reports to the
NameNode to confirm its status and the health of the data blocks it stores.
4.Which feature of HDFS ensures fault tolerance by replicating data blocks
across multiple DataNodes?
a. Partitioning
b. Compression
c. Replication
d. Encryption
Explanation-
HDFS achieves fault tolerance through the replication of data blocks. Each data
block is replicated across multiple DataNodes, which helps in ensuring data
availability and reliability even in the event of hardware failures. By default, each
block is replicated three times across different nodes to safeguard against data
loss.
Ans- option a: Mapper - Incorrect. The Mapper generates key-value pairs, but
doesn't perform sorting or grouping.
option b: Reducer - Incorrect. The Reducer processes the grouped key-value
pairs, but doesn't perform the initial sorting and grouping.
option c: Partitioner - Correct. The Partitioner determines which Reducer will
process a specific key-value pair.
option d: Combiner - Incorrect. The Combiner is an optional optimization that can
reduce data volume before sending it to the Reducer, but it doesn't perform
sorting and grouping.
Explanation-
Explanation-
The default replication factor in HDFS is 3. This means that each data block is
replicated three times across different DataNodes. This default replication factor
strikes a balance between data redundancy and storage overhead, providing
fault tolerance and high availability for the data.
Ans- option a: sorting input data - Incorrect. The Mapper and Partitioner handle
data sorting and distribution.
option b: transforming intermediate data - Correct. The Reducer can transform
intermediate data based on the key-value pairs it receives.
option c: aggregating results - Correct. The Reducer is often used to aggregate
values based on the key.
option d: splitting input data - Incorrect. The input data is split into blocks by the
InputFormat.
Explanation-
Explanation-
The Word Count application is a classic example of a MapReduce job. It involves
counting the frequency of each word in a large corpus of text. The Mapper
extracts words from the text and emits them as key-value pairs with a count of 1.
The Reducer then sums up these counts for each unique word to produce the
final word count results.
Explanation-
Reversing a web link graph involves inverting the direction of the edges between
nodes (web pages). In a web link graph, each directed edge represents a
hyperlink from one page to another. Reversing the graph means changing the
direction of these links, so a link from Page A to Page B becomes a link from
Page B to Page A. This is useful for various analyses, such as computing
PageRank in a different context or understanding link relationships from a
different perspective.