0% found this document useful (0 votes)

6 views8 pages

HDFS

Uploaded by

Deergha Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views8 pages

HDFS

Uploaded by

Deergha Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System) is a key component in

big data analytics, playing a crucial role in the storage and
management of vast amounts of data. Here’s how it fits into the
larger big data ecosystem:
1. Distributed Storage
HDFS is designed to store large datasets across multiple
machines, ensuring data is distributed and stored in blocks
across various nodes. This allows for the storage of petabytes of
data, which is common in big data analytics.
2. Fault Tolerance
Data stored in HDFS is automatically replicated across different
nodes. If one node fails, the system can recover the data from
another replica, making it highly fault-tolerant.
3. Scalability
HDFS is designed to scale easily by adding more nodes to the
cluster. This is crucial for big data analytics, where the data size
can grow exponentially.
4. High Throughput Access
HDFS is optimized for high throughput rather than low latency.
It’s particularly useful for batch processing of large datasets, a
common requirement in analytics workloads.
5. Integration with Big Data Tools
HDFS is tightly integrated with the Hadoop ecosystem, but it
also supports other big data tools and frameworks such as:
 MapReduce: The data stored in HDFS can be processed
using the MapReduce programming model.

 Apache Spark: Spark, a widely used big data processing

framework, can directly access data from HDFS.

 Hive and Pig: These tools can query and analyze data
stored in HDFS.
6. Handling Diverse Data Types
HDFS can handle structured, semi-structured, and unstructured
data, making it ideal for big data analytics where data comes in
many forms.
7. Cost-Effective
HDFS runs on commodity hardware, making it cost-effective for
organizations that need to store and process large amounts of
data without investing in expensive infrastructure.
Key Use Cases in Big Data Analytics
 Log Analysis: HDFS can store massive log files from
various sources, which can then be analyzed for insights
into system behavior, security, and user activity.
 Data Lake: HDFS is often used to build data lakes, storing
raw data that can later be analyzed or refined for specific
business use cases.
 Machine Learning: Large datasets for training machine
learning models can be stored in HDFS and processed
using distributed computing frameworks.
HDFS(Hadoop Distributed File System) is utilized for storage
permission. It is mainly designed for working on commodity
Hardware devices(inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way
that it believes more in storing the data in a large chunk of
blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High
availability to the storage layer and the other devices present in
that Hadoop cluster. Data storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster

that guides the Datanode(Slaves). Namenode is mainly used for
storing the Metadata i.e. the data about the data. Meta Data can
be the transaction logs that keep track of the user’s activity in a
Hadoop cluster.
Meta Data can also be the name of the file, size, and the
information about the location(Block number, Block ids) of
Datanode that Namenode stores to find the closest DataNode
for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly
utilized for storing the data in a Hadoop cluster, the number of
DataNodes can be from 1 to 500 or even more than that. The
more number of DataNode, the Hadoop cluster will be able to
store more data. So it is advised that the DataNode should have
High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms

of blocks. So the single block of data is divided into multiple
blocks of size 128MB which is default and you can also change
it manually.
Let’s understand this concept of breaking down of file in
blocks with an example. Suppose you have uploaded a file of
400MB to your HDFS then what happens is this file got
divided into blocks of 128MB+128MB+128MB+16MB =
400MB size. Means 4 blocks are created each of 128MB
except the last one. Hadoop doesn’t know or it doesn’t care
about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea
regarding it. In the Linux file system, the size of a file block is
about 4KB which is very much less than the default size of file
blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in
petabyte, this is what makes Hadoop file system different from
other file systems as it can be scaled, nowadays file blocks of
128MB to 256MB are considered in Hadoop.
Replication In HDFS
Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you
make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the
HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file
blocks.
By default, the Replication Factor for Hadoop is set to 3 which
can be configured means you can change it manually as per
your requirement like in above example we have made 4 file
blocks which means that 3 Replica or copy of each file block is
made means total of 4×3 = 12 blocks are made for the backup
purpose.
This is because for running Hadoop we are using commodity
hardware (inexpensive system hardware) which can be crashed
at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS
which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
Now one thing we also need to notice that after making so
many replica’s of our file blocks we are wasting so much of
our storage but for the big brand organization the data is very
much important than the storage so nobody cares for this extra
storage. You can configure the Replication factor in your hdfs-
site.xml file.
Rack Awareness The rack is nothing but just the physical
collection of nodes in our Hadoop cluster (maybe 30 to 40). A
large Hadoop cluster is consists of so many Racks . with the
help of this Racks information Namenode chooses the closest
Datanode to achieve the maximum performance while
performing the read/write information which reduces the
Network Traffic.
HDFS Architecture

COBIT Foundation V1.1 (Master Slides)
100% (8)
COBIT Foundation V1.1 (Master Slides)
273 pages
Hypertrophy Execution Mastery - Module 2 Workouts - Biceps & Triceps PDF
100% (1)
Hypertrophy Execution Mastery - Module 2 Workouts - Biceps & Triceps PDF
24 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
258 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
Unit 2
No ratings yet
Unit 2
56 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Notes - 3 Unit neha
No ratings yet
Notes - 3 Unit neha
25 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
HDFS
No ratings yet
HDFS
11 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
BDA-3
No ratings yet
BDA-3
70 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
HDFS
No ratings yet
HDFS
15 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Unit 2
No ratings yet
Unit 2
22 pages
Big Data Unit-3 PPT
No ratings yet
Big Data Unit-3 PPT
46 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Unit- 3 (HDFS)-1
No ratings yet
Unit- 3 (HDFS)-1
24 pages
Unit- 3 (HDFS)
No ratings yet
Unit- 3 (HDFS)
23 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
5_bdp-2024-06
No ratings yet
5_bdp-2024-06
14 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
BCS061_Notes_Unit3
No ratings yet
BCS061_Notes_Unit3
23 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
HDFS(27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS(27 Jan 2025 Hadoop Distributed File System)
73 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
HDFS
No ratings yet
HDFS
16 pages
Notes
88% (8)
Notes
18 pages
BDP 2024 06
No ratings yet
BDP 2024 06
14 pages
Big Data Assighmwnt 2
No ratings yet
Big Data Assighmwnt 2
60 pages
unit IV
No ratings yet
unit IV
248 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
What Is Hadoop HDFS
No ratings yet
What Is Hadoop HDFS
20 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
HADOOP FRAME WORK
No ratings yet
HADOOP FRAME WORK
38 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
Distributed File Systems Leading To Hadoop File System: UNIT-2
No ratings yet
Distributed File Systems Leading To Hadoop File System: UNIT-2
12 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
HDFS
No ratings yet
HDFS
22 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Introduction To AI and Its Application Using Python
No ratings yet
Introduction To AI and Its Application Using Python
16 pages
1203010504
No ratings yet
1203010504
120 pages
Ed537581 PDF
No ratings yet
Ed537581 PDF
36 pages
ET200SPHA PCS7V100 DOC V1 4 en
No ratings yet
ET200SPHA PCS7V100 DOC V1 4 en
37 pages
Pesantren Dalam Kebijakan Pendidikan Indonesia
No ratings yet
Pesantren Dalam Kebijakan Pendidikan Indonesia
40 pages
0522 s14 1 0 QP PDF
No ratings yet
0522 s14 1 0 QP PDF
12 pages
A Monroe Case
0% (1)
A Monroe Case
14 pages
J.S. Mill: DR JTM Miller 31 OCTOBER 2017
No ratings yet
J.S. Mill: DR JTM Miller 31 OCTOBER 2017
42 pages
G Max150 Partslist
No ratings yet
G Max150 Partslist
92 pages
Homework 5 5
No ratings yet
Homework 5 5
2 pages
Vectors - The Dot Product and Cross Product of Two Vectors
No ratings yet
Vectors - The Dot Product and Cross Product of Two Vectors
32 pages
Lateral Soil Resistance of Rigid Pile in Cohesionless Soil On Slope
No ratings yet
Lateral Soil Resistance of Rigid Pile in Cohesionless Soil On Slope
8 pages
HUAWEI JAT-LX3 Telcel-Mx Software Upgrade Guideline - R1
No ratings yet
HUAWEI JAT-LX3 Telcel-Mx Software Upgrade Guideline - R1
8 pages
F06BOQ3721 2645BOQ Khel Maidan Nirman
No ratings yet
F06BOQ3721 2645BOQ Khel Maidan Nirman
3 pages
Endoxan: List of Excipients
No ratings yet
Endoxan: List of Excipients
4 pages
Intellectual Knowledge, Active Intellect and Intellectual Memory in Avicenna's Kitab Al-Nafs and Its Aristotelian Background
No ratings yet
Intellectual Knowledge, Active Intellect and Intellectual Memory in Avicenna's Kitab Al-Nafs and Its Aristotelian Background
54 pages
F.L.Laptop For Sale, Test Builder
No ratings yet
F.L.Laptop For Sale, Test Builder
27 pages
Professional Development Table
No ratings yet
Professional Development Table
3 pages
Statics: Solution
No ratings yet
Statics: Solution
32 pages
Theory of Social Entrepreneurship
No ratings yet
Theory of Social Entrepreneurship
17 pages
Important Notice For Successful Candidates of 2021 GKS-G 2nd Round Selection (University Track)
No ratings yet
Important Notice For Successful Candidates of 2021 GKS-G 2nd Round Selection (University Track)
2 pages
4.1 Setting Specimens in Place: Chapter 4 Operation Preparation
No ratings yet
4.1 Setting Specimens in Place: Chapter 4 Operation Preparation
10 pages
SP333-1_CCSU_Activity_Flowchart_and_Hot_Spots
No ratings yet
SP333-1_CCSU_Activity_Flowchart_and_Hot_Spots
70 pages
A Framework For Examining Leadership in Extreme Contexts
No ratings yet
A Framework For Examining Leadership in Extreme Contexts
23 pages
Physics Ss3 Mock
No ratings yet
Physics Ss3 Mock
2 pages
Northern Ireland Geography: Cultura Inglesa M. Denys Nº3200379/ Daniel Santos n3210479
No ratings yet
Northern Ireland Geography: Cultura Inglesa M. Denys Nº3200379/ Daniel Santos n3210479
17 pages
Old Bones
100% (5)
Old Bones
15 pages
End-Of-Course Test B
No ratings yet
End-Of-Course Test B
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

HDFS

Uploaded by

HDFS

Uploaded by

HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System) is a key component in

 Apache Spark: Spark, a widely used big data processing

NameNode:NameNode works as a Master in a Hadoop cluster

File Block In HDFS: Data in HDFS is always stored in terms

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.