Bda Viva Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

BDA VIVA QUESTIONS

Differentiate between Traditional and Big Data Approach.

Big Data Definition :


● Big Data is a collection of large dataset that contains both structured and
unstructured data which is difficult to manage with traditional approach

Characteristics : 4V’s
1. Volume : Larget Amount of data.
2. Velocity : Rate of generation of data in real time.
3. Variety : contains structured, unstructured, semi structured data.
4. Veracity : Quality of data like accuracy and trustworthiness of data.

Definition of Hadoop :
● Hadoop is an apache open source framework written in java which allows distributed
processing of larger dataset using simple programing models.
Pros :
1. Open source
2. Scalable
3. High fault tolerance
Cons :
1. Security
2. Issues with small files
3. Vulnerable :- weak configuration
4. No real time processing

What are the types of big data ?


1. Structured data
2. Unstructured data
3. Semi structured data
What are Core Components of hadoop :
1. HDFS
2. Map Reduce
3. YARN
4. Common utility packages.

HDFS :
● Works on master Slave architecture
● Master : NameNode, Secondary Node, Job tracker
● Slaves : DataNode , Task Tracker

NameNode :
1. Maintains & manages data nodes
2. It stores metadata related to datanodes
● FsImage
● EditLogs
3. Single Point failure

DataNode :
● Responsible for storing the actual data in HDFS
● Serving read and write requests
● Block creation, deletion, and replication
● NameNode and DataNode are in constant communication

Map Reduce : Job Tracker(JT) & Task Tracker(TT)


● JT : Master and TT : Slave
● JT : coordinating parallel processing of data using mapreduce
● JT : Assign task to task tracker
● TT : runs task using MR

Secondary NameNode :
● Not a backup Node
● Helper Node: The Secondary NameNode assists the NameNode by managing and
merging the file system metadata.
● Metadata Merging: It periodically merges the edit logs and the fsimage (file system
image) to create an updated fsimage.
● Reduces Edit Logs: By performing this merge, it prevents the edit logs from
becoming too large, improving system efficiency.

Apache Hadoop Ecosystem :


1. HDFS
2. YARN : scheduling & resource allocation
3. Map Reduce
4. Sqoop : Transfer data between Hadoop and relational database
5. Flume : Collects, aggregates and moves a large amount of data into hadoop
6. PIG :
● Pre-processing framework and scripting tool
● Convert scripts to Map and Reduce code
7. HIVE : Hive Query Language similar to SQL
8. MAHOT
9. HBASE : NoSQL database, support all types of data
10. Zookeeper : coordinating everything on cluster, track on all nodes
11. Oozie : scheduling job on cluster
12. Rconnector
13. Ambari

Replication factor- what is the default replication factor ?


The replication factor in Hadoop's HDFS refers to the number of copies (replicas) of each
data block that are stored across different DataNodes in the cluster.

Can we change the replication factor ? How ?


● Yes, the replication factor can be changed.
● You can set it at the file level using the command
hdfs dfs -setrep <new_replication_factor> <file_path>

Under Replication
● Occurs when the number of replicas for a block falls below the desired replication
factor (e.g., due to DataNode failure).
● HDFS automatically creates more replicas on other nodes to fix this.

Over Replication
● Happens when there are more replicas than needed.
● HDFS removes the extra replicas to save space and maintain the correct replication
factor.

Fault Tolerance
● Replication provides fault tolerance, ensuring that even if some DataNodes fail, the
data is still accessible from other replicas.

Data Localization
● HDFS tries to store replicas on DataNodes close to where the data is being
processed, improving performance by reducing network transfer time.

Explain HDFS Block Placement Policy.


1. Place the first replica somewhere – either a random node (if the HDFS client is
outside the Hadoop/DataNode cluster) OR - Place the first replica on the local node
(if the HDFS client is running on a node inside the cluster).
2. Place the second replica in a different rack.
3. Place the third replica in the same rack as the second replica
4. If there are more replicas – spread them across the rest of the racks.
Differentiate Between Hadoop 1.x vs Hadoop 2.x

Hadoop 1.x Hadoop 2.x

Supports only one programing model : Map Supports multiple programing model with
Reduce YARN

MR does do both map reduce & cluster Separate model YARN will do cluster
resource management resource management.

The scalability of Nodes is limited i.e 4000 It has a higher scalability limit i.e up to
nodes per cluster. 10000 nodes per cluster.

It has only one NameNode managing the It has a standby NameNode to overcome
metadata of the cluster the SPOF

Hadoop1 does not support Microsoft In Hadoop2 added support for Microsoft
windows. windows

What is the default block size in Hadoop 1.x and hadoop 2.x
● 64 MB (Hadoop 1. x)
● 128 MB (Hadoop 2. x)

Explain the WordCount example using the MapReduce algorithm.

Write a MapReduce pseudo code for vector-matrix multiplication.


Map reduce function for selection :

Example :

Map reduce for projection :


What is NoSQL?
● It is a database management system that provides a mechanism for storage and
retrieval of massive amount of unstructured data in a distributed environment with the
focus to provide high scalability, performance, availability and agility.
What do you mean by CAP theorem?
● The three letters in CAP refer to three desirable properties of distributed systems with
replicated data: Consistency (among replicated copies), Availability (of the system for
read and write operations), and Partition Tolerance (in the face of the nodes in the
system being partitioned by a network fault). ]
● The CAP Theorem states that it is not possible to guarantee all three of these
desirable properties—Consistency, Availability, and Partition Tolerance—at the same
time in a distributed system with data replication.

Explain all NoSQL Data Architecture Patterns in detail. Give an example of each one.
1. Key-Value Databases
Example: Redis
2. Graph-Based Databases
Example: Neo4j
3. Column-Based Databases
Example: Cassandra
4. Document Store Databases
Example: MongoDB

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy