0% found this document useful (0 votes)

95 views

BDA Module-1 Notes

The document discusses the key design features and components of the Hadoop Distributed File System (HDFS). HDFS is designed for large file streaming and uses a master/slave architecture with one NameNode that manages metadata and multiple DataNodes that store data blocks. The NameNode monitors DataNodes and replicates data blocks across nodes for reliability in the event of failures. The SecondaryNameNode performs periodic checkpoints of the NameNode metadata to improve restart times.

Uploaded by

Kavita Horadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views

BDA Module-1 Notes

Uploaded by

Kavita Horadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

MODULE-1:

3. Hadoop Distributed File System Basics

In This Chapter:

The design and operation of the Hadoop Distributed File System (HDFS) are presented.

Important HDFS topics such as block replication, Safe Mode, rack awareness, High
Availability, Federation, backup, snapshots, NFS mounting, and the HDFS web GUI are
discussed.

Examples of basic HDFS user commands are provided.

HDFS programming examples using Java and C are provided.

The Hadoop Distributed File System is the backbone of Hadoop MapReduce processing.
New users and administrators often find HDFS different than most other UNIX/Linux file
systems. This chapter highlights the design goals and capabilities of HDFS that make it
useful for Big Data processing.

HADOOP DISTRIBUTED FILE SYSTEM DESIGN FEATURES

The Hadoop Distributed File System (HDFS) was designed for Big Data processing.
Although capable of supporting many users simultaneously, HDFS is not designed as a true
parallel file system. Rather, the design assumes a large file write-once/read-many model that
enables other optimizations and relaxes many of the concurrency and coherence overhead
requirements of a true parallel file system. For instance, HDFS rigorously restricts data
writing to one user at a time. All additional writes are “append-only,” and there is no random
writing to HDFS files. Bytes are always appended to the end of a stream, and byte streams
are guaranteed to be stored in the order written.

The design of HDFS is based on the design of the Google File System (GFS). A paper
published by Google provides further background on GFS
(http://research.google.com/archive/gfs.html).

HDFS is designed for data streaming where large amounts of data are read from disk in bulk.
The HDFS block size is typically 64MB or 128MB. Thus, this approach is entirely unsuitable
for standard POSIX file system use. In addition, due to the sequential nature of the data, there
is no local caching mechanism. The large block and file sizes make it more efficient to reread
data from HDFS than to try to cache the data.

Perhaps the most interesting aspect of HDFS—and the one that separates it from other file
systems—is its data locality. A principal design aspect of Hadoop MapReduce is the
emphasis on moving the computation to the data rather than moving the data to the
computation. This distinction is reflected in how Hadoop clusters are implemented. In other
high-performance systems, a parallel file system will exist on hardware separate from the
compute hardware. Data is then moved to and from the computer components via high-speed
interfaces to the parallel file system array. HDFS, in contrast, is designed to work on the
same hardware as the compute portion of the cluster. That is, a single server node in the
cluster is often both a computation engine and a storage engine for the application.

Finally, Hadoop clusters assume node (and even rack) failure will occur at some point. To
deal with this situation, HDFS has a redundant design that can tolerate system failure and still
provide the data needed by the compute part of the program.

The following points summarize the important aspects of HDFS:

The write-once/read-many design is intended to facilitate streaming reads.

Files may be appended, but random seeks are not permitted. There is no caching of data.

Converged data storage and processing happen on the same server nodes.

“Moving computation is cheaper than moving data.”

A reliable file system maintains multiple copies of data across the cluster. Consequently,
failure of a single node (or even a rack in a large cluster) will not bring down the file system.

A specialized file system is used, which is not designed for general use.

HDFS COMPONENTS

The design of HDFS is based on two types of nodes: a NameNode and multiple DataNodes.
In a basic design, a single NameNode manages all the metadata needed to store and retrieve
the actual data from the DataNodes. No data is actually stored on the NameNode, however.
For a minimal Hadoop installation, there needs to be a single NameNode daemon and a single
DataNode daemon running on at least one machine (see the section “Installing Hadoop from
Apache Sources” in Chapter 2, “Installation Recipes”).

The design is a master/slave architecture in which the master (NameNode) manages the file
system namespace and regulates access to files by clients. File system namespace operations
such as opening, closing, and renaming files and directories are all managed by the
NameNode. The NameNode also determines the mapping of blocks to DataNodes and
handles DataNode failures.

The slaves (DataNodes) are responsible for serving read and write requests from the file
system to the clients. The NameNode manages block creation, deletion, and replication.

An example of the client/NameNode/DataNode interaction is provided in Figure 3.1. When a

client writes data, it first communicates with the NameNode and requests to create a file. The
NameNode determines how many blocks are needed and provides the client with the
DataNodes that will store the data. As part of the storage process, the data blocks are
replicated after they are written to the assigned node. Depending on how many nodes are in
the cluster, the NameNode will attempt to write replicas of the data blocks on nodes that are
in other separate racks (if possible). If there is only one rack, then the replicated blocks are
written to other servers in the same rack. After the DataNode acknowledges that the file block
replication is complete, the client closes the file and informs the NameNode that the
operation is complete. Note that the NameNode does not write any data directly to the
DataNodes. It does, however, give the client a limited amount of time to complete the
operation. If it does not complete in the time period, the operation is canceled.

FIG 3.1

Reading data happens in a similar fashion. The client requests a file from the NameNode,
which returns the best DataNodes from which to read the data. The client then accesses the
data directly from the DataNodes.

Thus, once the metadata has been delivered to the client, the NameNode steps back and lets
the conversation between the client and the DataNodes proceed. While data transfer is
progressing, the NameNode also monitors the DataNodes by listening for heartbeats sent
from DataNodes. The lack of a heartbeat signal indicates a potential node failure. In such a
case, the NameNode will route around the failed DataNode and begin re-replicating the now-
missing blocks. Because the file system is redundant, DataNodes can be taken offline
(decommissioned) for maintenance by informing the NameNode of the DataNodes to exclude
from the HDFS pool.
The mappings between data blocks and the physical DataNodes are not kept in persistent
storage on the NameNode. For performance reasons, the NameNode stores all metadata in
memory. Upon startup, each DataNode provides a block report (which it keeps in persistent
storage) to the NameNode. The block reports are sent every 10 heartbeats. (The interval
between reports is a configurable property.) The reports enable the NameNode to keep an up-
to-date account of all data blocks in the cluster.

In almost all Hadoop deployments, there is a SecondaryNameNode. While not explicitly

required by a NameNode, it is highly recommended. The term “SecondaryNameNode” (now
called CheckPointNode) is somewhat misleading. It is not an active failover node and cannot
replace the primary NameNode in case of its failure. (See the section “NameNode High
Availability” later in this chapter for more explanation.)

The purpose of the SecondaryNameNode is to perform periodic checkpoints that evaluate the
status of the NameNode. Recall that the NameNode keeps all system metadata memory for
fast access. It also has two disk files that track changes to the metadata:

An image of the file system state when the NameNode was started. This file begins with
fsimage_* and is used only at startup by the NameNode.
A series of modifications done to the file system after starting the NameNode. These files
begin with edit_* and reflect the changes made after the fsimage_* file was read.
The location of these files is set by the dfs.namenode.name.dir property in the hdfs-site.xml
file.
The SecondaryNameNode periodically downloads fsimage and edits files, joins them into a
new fsimage, and uploads the new fsimage file to the NameNode. Thus, when the NameNode
restarts, the fsimage file is reasonably up-to-date and requires only the edit logs to be applied
since the last checkpoint. If the SecondaryNameNode were not running, a restart of the
NameNode could take a prohibitively long time due to the number of changes to the file
system.

Thus, the various roles in HDFS can be summarized as follows:

HDFS uses a master/slave model designed for large file reading/streaming.

The NameNode is a metadata server or “data traffic cop.”

HDFS provides a single namespace that is managed by the NameNode.

Data is redundantly stored on DataNodes; there is no data on the NameNode.

The SecondaryNameNode performs checkpoints of NameNode file system’s state but is not
a failover node.

HDFS Block Replication

As mentioned, when HDFS writes a file, it is replicated across the cluster. The amount of
replication is based on the value of dfs.replication in the hdfs-site.xml file. This default value
can be overruled with the hdfs dfs-setrep command. For Hadoop clusters containing more
than eight DataNodes, the replication value is usually set to 3. In a Hadoop cluster of eight or
fewer DataNodes but more than one DataNode, a replication factor of 2 is adequate. For a
single machine, like the pseudo-distributed install in Chapter 2, the replication factor is set to
1.

If several machines must be involved in the serving of a file, then a file could be rendered
unavailable by the loss of any one of those machines. HDFS combats this problem by
replicating each block across a number of machines (three is the default).

In addition, the HDFS default block size is often 64MB. In a typical operating system, the
block size is 4KB or 8KB. The HDFS default block size is not the minimum block size,
however. If a 20KB file is written to HDFS, it will create a block that is approximately 20KB
in size. (The underlying file system may have a minimal block size that increases the actual
file size.) If a file of size 80MB is written to HDFS, a 64MB block and a 16MB block will be
created.

As mentioned in Chapter 1, “Background and Concepts,” HDFS blocks are not exactly the
same as the data splits used by the MapReduce process. The HDFS blocks are based on size,
while the splits are based on a logical partitioning of the data. For instance, if a file contains
discrete records, the logical split ensures that a record is not split physically across two
separate servers during processing. Each HDFS block may consist of one or more splits.

Figure 3.2 provides an example of how a file is broken into blocks and replicated across the
cluster. In this case, a replication factor of 3 ensures that any one DataNode can fail and the
replicated blocks will be available on other nodes—and then subsequently re-replicated on
other DataNodes.
HDFS Safe Mode

When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted. Safe Mode enables the NameNode to perform two important processes:

1. The previous file system state is reconstructed by loading the fsimage file into memory and
replaying the edit log.

2. The mapping between blocks and data nodes is created by waiting for enough of the
DataNodes to register so that at least one copy of the data is available. Not all DataNodes are
required to register before HDFS exits from Safe Mode. The registration process may
continue for some time.

HDFS may also enter Safe Mode for maintenance using the hdfs dfsadmin-safemode
command or when there is a file system issue that must be addressed by the administrator.

Rack Awareness

Rack awareness deals with data locality. Recall that one of the main design goals of Hadoop
MapReduce is to move the computation to the data. Assuming that most data center networks
do not offer full bisection bandwidth, a typical Hadoop cluster will exhibit three levels of data
locality:

1. Data resides on the local machine (best).

2. Data resides in the same rack (better).

3. Data resides in a different rack (good).

When the YARN scheduler is assigning MapReduce containers to work as mappers, it will
try to place the container first on the local machine, then on the same rack, and finally on
another rack.

In addition, the NameNode tries to place replicated data blocks on multiple racks for
improved fault tolerance. In such a case, an entire rack failure will not cause data loss or stop
HDFS from working. Performance may be degraded, however.

HDFS can be made rack-aware by using a user-derived script that enables the master node to
map the network topology of the cluster. A default Hadoop installation assumes all the nodes
belong to the same (large) rack. In that case, there is no option 3.

NameNode High Availability

With early Hadoop installations, the NameNode was a single point of failure that could bring
down the entire Hadoop cluster. NameNode hardware often employed redundant power
supplies and storage to guard against such problems, but it was still susceptible to other
failures. The solution was to implement NameNode High Availability (HA) as a means to
provide true failover service.

As shown in Figure 3.3, an HA Hadoop cluster has two (or more) separate NameNode
machines. Each machine is configured with exactly the same software. One of the NameNode
machines is in the Active state, and the other is in the Standby state. Like a single NameNode
cluster, the Active NameNode is responsible for all client HDFS operations in the cluster.
The Standby NameNode maintains enough state to provide a fast failover (if required).
To guarantee the file system state is preserved, both the Active and Standby NameNodes
receive block reports from the DataNodes. The Active node also sends all file system edits to
a quorum of Journal nodes. At least three physically separate JournalNode daemons are
required, because edit log modifications must be written to a majority of the JournalNodes.
This design will enable the system to tolerate the failure of a single JournalNode machine.
The Standby node continuously reads the edits from the JournalNodes to ensure its
namespace is synchronized with that of the Active node. In the event of an Active NameNode
failure, the Standby node reads all remaining edits from the JournalNodes before promoting
itself to the Active state.

To prevent confusion between NameNodes, the JournalNodes allow only one NameNode to
be a writer at a time. During failover, the NameNode that is chosen to become active takes
over the role of writing to the JournalNodes. A SecondaryNameNode is not required in the
HA configuration because the Standby node also performs the tasks of the Secondary
NameNode.

Apache ZooKeeper is used to monitor the NameNode health. Zookeeper is a highly available
service for maintaining small amounts of coordination data, notifying clients of changes in
that data, and monitoring clients for failures. HDFS failover relies on ZooKeeper for failure
detection and for Standby to Active NameNode election. The Zookeeper components are not
depicted in Figure 3.3.

HDFS NameNode Federation

Another important feature of HDFS is NameNode Federation. Older versions of HDFS

provided a single namespace for the entire cluster managed by a single NameNode. Thus, the
resources of a single NameNode determined the size of the namespace. Federation addresses
this limitation by adding support for multiple NameNodes/namespaces to the HDFS file
system. The key benefits are as follows:

Namespace scalability. HDFS cluster storage scales horizontally without placing a burden
on the NameNode.

Better performance. Adding more NameNodes to the cluster scales the file system
read/write operations throughput by separating the total namespace.

System isolation. Multiple NameNodes enable different categories of applications to be

distinguished, and users can be isolated to different namespaces.

Figure 3.4 illustrates how HDFS NameNode Federation is accomplished. NameNode1

manages the /research and /marketing namespaces, and NameNode2 manages the /data and
/project namespaces. The NameNodes do not communicate with each other and the
DataNodes “just store data block” as directed by either NameNode.

block” as directed by either NameNode.

Figure 3.4 HDFS NameNode Federation example

HDFS Checkpoints and Backups

As mentioned earlier, the NameNode stores the metadata of the HDFS file system in a file
called fsimage. File systems modifications are written to an edits log file, and at startup the
NameNode merges the edits into a new fsimage. The SecondaryNameNode or
CheckpointNode periodically fetches edits from the NameNode, merges them, and returns an
updated fsimage to the NameNode.

An HDFS BackupNode is similar, but also maintains an up-to-date copy of the file system
namespace both in memory and on disk. Unlike a CheckpointNode, the BackupNode does
not need to download the fsimage and edits files from the active NameNode because it
already has an up-to-date namespace state in memory. A NameNode supports one
BackupNode at a time. No CheckpointNodes may be registered if a Backup node is in use.

HDFS Snapshots
HDFS snapshots are similar to backups, but are created by administrators using the hdfs dfs
snapshot command. HDFS snapshots are read-only point-in-time copies of the file system.
They offer the following features:

Snapshots can be taken of a sub-tree of the file system or the entire file system.

Snapshots can be used for data backup, protection against user errors, and disaster recovery.

Snapshot creation is instantaneous.

Blocks on the DataNodes are not copied, because the snapshot files record the block list and
the file size. There is no data copying, although it appears to the user that there are duplicate
files.

Snapshots do not adversely affect regular HDFS operations.

See Chapter 10, “Basic Hadoop Administration Procedures,” for information on creating
HDFS snapshots.

HDFS NFS Gateway

The HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as part of the
client’s local file system. Users can browse the HDFS file system through their local file
systems that provide an NFSv3 client compatible operating system. This feature offers users
the following capabilities:

Users can easily download/upload files from/to the HDFS file system to/from their local file
system.

Users can stream data directly to HDFS through the mount point. Appending to a file is
supported, but random write capability is not supported.

Mounting a HDFS over NFS is explained in Chapter 10, “Basic Hadoop Administration
Procedures.”

HDFS USER COMMANDS

The following is a brief command reference that will facilitate navigation within HDFS. Be
aware that there are alternative options for each command and that the examples given here
are simple use-cases. What follows is by no means a full description of HDFS functionality.
For more information, see the section “Summary and Additional Resources” at the end of the
chapter.

Brief HDFS Command Reference

The preferred way to interact with HDFS in Hadoop version 2 is through the hdfs command.
Previously, in version 1 and subsequently in many Hadoop examples, the hadoop dfs
command was used to manage files in HDFS. The hadoop dfs command will still work in
version 2, but its use will cause a message to be displayed indicating that the use of hadoop
dfs is deprecated.
The following listing presents the full range of options that are available for the hdfs
command. In the next section, only portions of the dfs and hdfsadmin options are explored.
Chapter 10, “Basic Hadoop Administration Procedures,” provides examples of administration
with the hdfs command.
Click here to view code image

List of HDD Master Password
40% (5)
List of HDD Master Password
1 page
Symmetrix VMAX3 Internals Essentials PDF
No ratings yet
Symmetrix VMAX3 Internals Essentials PDF
92 pages
Module 1
No ratings yet
Module 1
66 pages
File System Basics: Hadoop Distributed
No ratings yet
File System Basics: Hadoop Distributed
22 pages
HDFS
No ratings yet
HDFS
14 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
No ratings yet
Experiment No. 2 Training Session On Hadoop: Hadoop Distributed File System
9 pages
HDFS Intro
No ratings yet
HDFS Intro
9 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Apache Hadoop 3.4.1 – HDFS Architecture
No ratings yet
Apache Hadoop 3.4.1 – HDFS Architecture
7 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
BDA-Unit-I
No ratings yet
BDA-Unit-I
18 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Unit2 HDFS
No ratings yet
Unit2 HDFS
17 pages
Unit II-bid Data Programming
No ratings yet
Unit II-bid Data Programming
23 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Unit 3
No ratings yet
Unit 3
61 pages
Unit - 3 HDFS MAPREDUCE HBASE
No ratings yet
Unit - 3 HDFS MAPREDUCE HBASE
34 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Notes
88% (8)
Notes
18 pages
Unit-2
No ratings yet
Unit-2
14 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Document 4 HDFS
No ratings yet
Document 4 HDFS
8 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Unit Ii
No ratings yet
Unit Ii
39 pages
Module III Note
No ratings yet
Module III Note
36 pages
Quick Look: HDFS: Assumptions and Goals
No ratings yet
Quick Look: HDFS: Assumptions and Goals
5 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit 3.4 Gfs and Hdfs
No ratings yet
Unit 3.4 Gfs and Hdfs
4 pages
Namenode and Datanodes
No ratings yet
Namenode and Datanodes
3 pages
1564-Article Text-2810-1-10-20171231 PDF
No ratings yet
1564-Article Text-2810-1-10-20171231 PDF
5 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
HDFS
No ratings yet
HDFS
16 pages
Hadoop Architecture Overview-converted
No ratings yet
Hadoop Architecture Overview-converted
10 pages
Unit 3
No ratings yet
Unit 3
44 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
The Architecture of Open Source Applications - The Hadoop Distributed File System
No ratings yet
The Architecture of Open Source Applications - The Hadoop Distributed File System
6 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
Computer Science Apprenticeship Bigdata Assignement3
No ratings yet
Computer Science Apprenticeship Bigdata Assignement3
3 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Bigdata Unit IV
No ratings yet
Bigdata Unit IV
29 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Apache Hadoop: Google File System Hadoop Distributed File System
No ratings yet
Apache Hadoop: Google File System Hadoop Distributed File System
2 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Celerra Nas VNX Health Check With CLI Commands
No ratings yet
Celerra Nas VNX Health Check With CLI Commands
4 pages
Virtual Memory Term Paper
100% (1)
Virtual Memory Term Paper
7 pages
Computarized Accounting
No ratings yet
Computarized Accounting
18 pages
Computer Architecture and Organization Mcqs
No ratings yet
Computer Architecture and Organization Mcqs
9 pages
How to add disk in Azure SUSE Linux VM
No ratings yet
How to add disk in Azure SUSE Linux VM
10 pages
Preface: About The Subject
No ratings yet
Preface: About The Subject
78 pages
Complete Download A Beginner's Guide to SSD Firmware: Designing, Optimizing, and Maintaining SSD Firmware Gopi Kuppan Thirumalai PDF All Chapters
100% (2)
Complete Download A Beginner's Guide to SSD Firmware: Designing, Optimizing, and Maintaining SSD Firmware Gopi Kuppan Thirumalai PDF All Chapters
51 pages
Glossary: Programmable Logic Controllers: Industrial Control Glossary 1
No ratings yet
Glossary: Programmable Logic Controllers: Industrial Control Glossary 1
14 pages
1.2_Unit1_Functional Units of Computer, Bus, Bus Types, Introduction to Bus Arbitration
No ratings yet
1.2_Unit1_Functional Units of Computer, Bus, Bus Types, Introduction to Bus Arbitration
36 pages
Block Diagram of Computer System
No ratings yet
Block Diagram of Computer System
7 pages
File 1. File Concept
No ratings yet
File 1. File Concept
6 pages
Logix5000 ™ & Rslogix™ 5000 Overview
100% (1)
Logix5000 ™ & Rslogix™ 5000 Overview
75 pages
How To Use The KV-FL Setup Tool
No ratings yet
How To Use The KV-FL Setup Tool
24 pages
Architecture
No ratings yet
Architecture
10 pages
HND I Com321
No ratings yet
HND I Com321
5 pages
JCL
No ratings yet
JCL
142 pages
INT2EXT Updated
No ratings yet
INT2EXT Updated
5 pages
Basic Concepts of Information Technology (IT)
No ratings yet
Basic Concepts of Information Technology (IT)
101 pages
Azure Blob Storage, Azure Disk Storage, Azure Files, and Azure Data Lake using Bicep templates, with both Azure CLI and PowerShell,
No ratings yet
Azure Blob Storage, Azure Disk Storage, Azure Files, and Azure Data Lake using Bicep templates, with both Azure CLI and PowerShell,
3 pages
Computer Architecture Hnd1 Notes
No ratings yet
Computer Architecture Hnd1 Notes
27 pages
Veeam Agent Oracle Solaris 4 0 User Guide
No ratings yet
Veeam Agent Oracle Solaris 4 0 User Guide
224 pages
KFOD and KFED Utilities For ASM Disks
No ratings yet
KFOD and KFED Utilities For ASM Disks
6 pages
An Overview of Physical Storage Media
No ratings yet
An Overview of Physical Storage Media
17 pages
PB E Series E2700 Platform FEB15 Rev2
No ratings yet
PB E Series E2700 Platform FEB15 Rev2
9 pages
6t Sram Cell
No ratings yet
6t Sram Cell
4 pages
Von Neumann Architecture PDF
100% (2)
Von Neumann Architecture PDF
4 pages
Dell R860
No ratings yet
Dell R860
2 pages
SSD_User_Manual
No ratings yet
SSD_User_Manual
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDA Module-1 Notes

Uploaded by

BDA Module-1 Notes

Uploaded by

MODULE-1:

3. Hadoop Distributed File System Basics

Examples of basic HDFS user commands are provided.

HDFS programming examples using Java and C are provided.

HADOOP DISTRIBUTED FILE SYSTEM DESIGN FEATURES

The following points summarize the important aspects of HDFS:

The write-once/read-many design is intended to facilitate streaming reads.

“Moving computation is cheaper than moving data.”

An example of the client/NameNode/DataNode interaction is provided in Figure 3.1. When a

In almost all Hadoop deployments, there is a SecondaryNameNode. While not explicitly

Thus, the various roles in HDFS can be summarized as follows:

HDFS uses a master/slave model designed for large file reading/streaming.

The NameNode is a metadata server or “data traffic cop.”

HDFS provides a single namespace that is managed by the NameNode.

Data is redundantly stored on DataNodes; there is no data on the NameNode.

HDFS Block Replication

1. Data resides on the local machine (best).

3. Data resides in a different rack (good).

NameNode High Availability

HDFS NameNode Federation

Another important feature of HDFS is NameNode Federation. Older versions of HDFS

System isolation. Multiple NameNodes enable different categories of applications to be

Figure 3.4 illustrates how HDFS NameNode Federation is accomplished. NameNode1

block” as directed by either NameNode.

HDFS Checkpoints and Backups

Snapshot creation is instantaneous.

Snapshots do not adversely affect regular HDFS operations.

HDFS NFS Gateway

HDFS USER COMMANDS

Brief HDFS Command Reference

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.