Bda Viva Questions
Bda Viva Questions
Bda Viva Questions
Characteristics : 4V’s
1. Volume : Larget Amount of data.
2. Velocity : Rate of generation of data in real time.
3. Variety : contains structured, unstructured, semi structured data.
4. Veracity : Quality of data like accuracy and trustworthiness of data.
Definition of Hadoop :
● Hadoop is an apache open source framework written in java which allows distributed
processing of larger dataset using simple programing models.
Pros :
1. Open source
2. Scalable
3. High fault tolerance
Cons :
1. Security
2. Issues with small files
3. Vulnerable :- weak configuration
4. No real time processing
HDFS :
● Works on master Slave architecture
● Master : NameNode, Secondary Node, Job tracker
● Slaves : DataNode , Task Tracker
NameNode :
1. Maintains & manages data nodes
2. It stores metadata related to datanodes
● FsImage
● EditLogs
3. Single Point failure
DataNode :
● Responsible for storing the actual data in HDFS
● Serving read and write requests
● Block creation, deletion, and replication
● NameNode and DataNode are in constant communication
Secondary NameNode :
● Not a backup Node
● Helper Node: The Secondary NameNode assists the NameNode by managing and
merging the file system metadata.
● Metadata Merging: It periodically merges the edit logs and the fsimage (file system
image) to create an updated fsimage.
● Reduces Edit Logs: By performing this merge, it prevents the edit logs from
becoming too large, improving system efficiency.
Under Replication
● Occurs when the number of replicas for a block falls below the desired replication
factor (e.g., due to DataNode failure).
● HDFS automatically creates more replicas on other nodes to fix this.
Over Replication
● Happens when there are more replicas than needed.
● HDFS removes the extra replicas to save space and maintain the correct replication
factor.
Fault Tolerance
● Replication provides fault tolerance, ensuring that even if some DataNodes fail, the
data is still accessible from other replicas.
Data Localization
● HDFS tries to store replicas on DataNodes close to where the data is being
processed, improving performance by reducing network transfer time.
Supports only one programing model : Map Supports multiple programing model with
Reduce YARN
MR does do both map reduce & cluster Separate model YARN will do cluster
resource management resource management.
The scalability of Nodes is limited i.e 4000 It has a higher scalability limit i.e up to
nodes per cluster. 10000 nodes per cluster.
It has only one NameNode managing the It has a standby NameNode to overcome
metadata of the cluster the SPOF
Hadoop1 does not support Microsoft In Hadoop2 added support for Microsoft
windows. windows
What is the default block size in Hadoop 1.x and hadoop 2.x
● 64 MB (Hadoop 1. x)
● 128 MB (Hadoop 2. x)
Example :
Explain all NoSQL Data Architecture Patterns in detail. Give an example of each one.
1. Key-Value Databases
Example: Redis
2. Graph-Based Databases
Example: Neo4j
3. Column-Based Databases
Example: Cassandra
4. Document Store Databases
Example: MongoDB