0% found this document useful (0 votes)
6 views8 pages

HDFS

Uploaded by

Deergha Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

HDFS

Uploaded by

Deergha Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System) is a key component in


big data analytics, playing a crucial role in the storage and
management of vast amounts of data. Here’s how it fits into the
larger big data ecosystem:
1. Distributed Storage
HDFS is designed to store large datasets across multiple
machines, ensuring data is distributed and stored in blocks
across various nodes. This allows for the storage of petabytes of
data, which is common in big data analytics.
2. Fault Tolerance
Data stored in HDFS is automatically replicated across different
nodes. If one node fails, the system can recover the data from
another replica, making it highly fault-tolerant.
3. Scalability
HDFS is designed to scale easily by adding more nodes to the
cluster. This is crucial for big data analytics, where the data size
can grow exponentially.
4. High Throughput Access
HDFS is optimized for high throughput rather than low latency.
It’s particularly useful for batch processing of large datasets, a
common requirement in analytics workloads.
5. Integration with Big Data Tools
HDFS is tightly integrated with the Hadoop ecosystem, but it
also supports other big data tools and frameworks such as:
 MapReduce: The data stored in HDFS can be processed
using the MapReduce programming model.

 Apache Spark: Spark, a widely used big data processing


framework, can directly access data from HDFS.

 Hive and Pig: These tools can query and analyze data
stored in HDFS.
6. Handling Diverse Data Types
HDFS can handle structured, semi-structured, and unstructured
data, making it ideal for big data analytics where data comes in
many forms.
7. Cost-Effective
HDFS runs on commodity hardware, making it cost-effective for
organizations that need to store and process large amounts of
data without investing in expensive infrastructure.
Key Use Cases in Big Data Analytics
 Log Analysis: HDFS can store massive log files from
various sources, which can then be analyzed for insights
into system behavior, security, and user activity.
 Data Lake: HDFS is often used to build data lakes, storing
raw data that can later be analyzed or refined for specific
business use cases.
 Machine Learning: Large datasets for training machine
learning models can be stored in HDFS and processed
using distributed computing frameworks.
HDFS(Hadoop Distributed File System) is utilized for storage
permission. It is mainly designed for working on commodity
Hardware devices(inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way
that it believes more in storing the data in a large chunk of
blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High
availability to the storage layer and the other devices present in
that Hadoop cluster. Data storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster


that guides the Datanode(Slaves). Namenode is mainly used for
storing the Metadata i.e. the data about the data. Meta Data can
be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.
Meta Data can also be the name of the file, size, and the
information about the location(Block number, Block ids) of
Datanode that Namenode stores to find the closest DataNode
for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly
utilized for storing the data in a Hadoop cluster, the number of
DataNodes can be from 1 to 500 or even more than that. The
more number of DataNode, the Hadoop cluster will be able to
store more data. So it is advised that the DataNode should have
High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms


of blocks. So the single block of data is divided into multiple
blocks of size 128MB which is default and you can also change
it manually.
Let’s understand this concept of breaking down of file in
blocks with an example. Suppose you have uploaded a file of
400MB to your HDFS then what happens is this file got
divided into blocks of 128MB+128MB+128MB+16MB =
400MB size. Means 4 blocks are created each of 128MB
except the last one. Hadoop doesn’t know or it doesn’t care
about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea
regarding it. In the Linux file system, the size of a file block is
about 4KB which is very much less than the default size of file
blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from
other file systems as it can be scaled, nowadays file blocks of
128MB to 256MB are considered in Hadoop.
Replication In HDFS
Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you
make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the
HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file
blocks.
By default, the Replication Factor for Hadoop is set to 3 which
can be configured means you can change it manually as per
your requirement like in above example we have made 4 file
blocks which means that 3 Replica or copy of each file block is
made means total of 4×3 = 12 blocks are made for the backup
purpose.
This is because for running Hadoop we are using commodity
hardware (inexpensive system hardware) which can be crashed
at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS
which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
Now one thing we also need to notice that after making so
many replica’s of our file blocks we are wasting so much of
our storage but for the big brand organization the data is very
much important than the storage so nobody cares for this extra
storage. You can configure the Replication factor in your hdfs-
site.xml file.
Rack Awareness The rack is nothing but just the physical
collection of nodes in our Hadoop cluster (maybe 30 to 40). A
large Hadoop cluster is consists of so many Racks . with the
help of this Racks information Namenode chooses the closest
Datanode to achieve the maximum performance while
performing the read/write information which reduces the
Network Traffic.
HDFS Architecture

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy