0% found this document useful (0 votes)
6 views1 page

HDFS

Uploaded by

realmex7max5g
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views1 page

HDFS

Uploaded by

realmex7max5g
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

The Hadoop Distributed File System (HDFS) is a core component of the Apache Hadoop ecosystem, designed

for reliable, scalable, and distributed storage of massive data across clusters of commodity hardware. Its
architecture and design principles are optimized for big data workloads, focusing on high throughput rather
than low latency. Here’s a breakdown of its design principles, components, and functionalities:
1.Design Principles of HDFS
Scalability • HDFS can scale horizontally to accommodate petabytes of data across thousands of machines
(nodes).•The system is built to expand seamlessly by adding more commodity hardware.
Fault Tolerance •Data is replicated across multiple nodes to ensure fault tolerance. If one node fails, copies of
the data are available on other nodes.•Automatic recovery of data is handled by re-replication mechanisms.
High Throughput •Optimized for batch processing of large datasets rather than low-latency access.•Large block
sizes (e.g., 128MB, 256MB) and sequential reads improve efficiency.
Cost Efficiency •Built to run on inexpensive, commodity hardware instead of high-end, specialized servers.
Write-Once, Read-Many •HDFS follows a write-once-read-many model, meaning data can be written once but
read multiple times. •This simplifies the system’s design and is ideal for data analytics workloads.
2.HDFS Architecture HDFS follows a master-slave architecture with three main components:
NameNode (Master Node) •\Role: Manages the file system metadata (file names, directories, blocks, and their
locations). •It does not store the actual data but keeps a metadata map of where each block of data resides
across the cluster./Responsibilities: •Maintains the namespace hierarchy (file-to-block mapping). •Tracks
replication levels of data blocks. •Coordinates file operations like opening, closing, and renaming. •Detects
node failures and triggers data re-replication. •High Availability: A standby NameNode can be set up to avoid a
single point of failure.
DataNodes (Slave Nodes) /Role: •Store actual data blocks.DataNodes communicate with the NameNode to
send block reports and health status./Responsibilities: •Store, retrieve, and replicate data blocks upon the
NameNode’s request. •Periodically send “heartbeats” to the NameNode to report their status. •If a DataNode
fails, the NameNode initiates replication to other nodes.
3.HDFS Blocks •HDFS divides files into fixed-size blocks (default 128MB or 256MB) for storage. •Blocks are
stored across multiple DataNodes for redundancy./Advantages of Blocks:•Simplifies storage management.
•Optimized for sequential reads of large datasets.
4.Replication Mechanism •HDFS uses data replication to ensure fault tolerance. •By default, each block is
replicated three times across different DataNodes. Replication Policy: •One replica is stored on the same rack
as the client. •Another replica is stored on a different rack to avoid single-rack failures. •The replication factor
can be configured based on requirements.
5.HDFS High Availability (HA) •To address the single point of failure of the NameNode, HDFS supports High
Availability:•Active NameNode: Handles client requests.•Standby NameNode: Mirrors the metadata and takes
over in case of failure.
6.File Access Process in HDFS
Here’s how a file is read and written in HDFS:
File Write Workflow•Client Interaction: The client contacts the NameNode to create a file.•Block Allocation:
The NameNode assigns blocks and selects DataNodes for block storage.•Data Streaming: The client writes data
directly to the DataNodes in a pipeline (not to the NameNode).
•Replication: DataNodes replicate the blocks to other nodes based on the replication factor.•Acknowledgment:
Once blocks are written and replicated, acknowledgments are sent back to the client.
7.File Read Workflow
•Client Query: The client contacts the NameNode to fetch the file metadata (block locations).•Data Retrieval:
The client reads data directly from the nearest DataNodes for better efficiency. •Sequential Reads: HDFS
ensures optimized reads for large datasets.•Rack Awareness HDFS is rack-aware to improve fault tolerance and
data locality. •The NameNode uses a rack topology to place data replicas strategically:•Ensures at least one
copy resides on a different rack.•Reduces inter-rack traffic for read/write operations.
8.Key Strengths and Limitations of HDFS
•Handles large files efficiently. •Built-in fault tolerance and reliability. •Linear scalability across thousands of
nodes. •Cost-effective, using commodity hardware.
Limitations:•Not suitable for small files: HDFS is optimized for large files; storing many small files can overload
the NameNode metadata. •High latency: Not ideal for real-time applications.
•Write-once limitation: Files cannot be updated once written.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy