BDA Module-1 Notes
BDA Module-1 Notes
In This Chapter:
The design and operation of the Hadoop Distributed File System (HDFS) are presented.
Important HDFS topics such as block replication, Safe Mode, rack awareness, High
Availability, Federation, backup, snapshots, NFS mounting, and the HDFS web GUI are
discussed.
The Hadoop Distributed File System is the backbone of Hadoop MapReduce processing.
New users and administrators often find HDFS different than most other UNIX/Linux file
systems. This chapter highlights the design goals and capabilities of HDFS that make it
useful for Big Data processing.
The Hadoop Distributed File System (HDFS) was designed for Big Data processing.
Although capable of supporting many users simultaneously, HDFS is not designed as a true
parallel file system. Rather, the design assumes a large file write-once/read-many model that
enables other optimizations and relaxes many of the concurrency and coherence overhead
requirements of a true parallel file system. For instance, HDFS rigorously restricts data
writing to one user at a time. All additional writes are “append-only,” and there is no random
writing to HDFS files. Bytes are always appended to the end of a stream, and byte streams
are guaranteed to be stored in the order written.
The design of HDFS is based on the design of the Google File System (GFS). A paper
published by Google provides further background on GFS
(http://research.google.com/archive/gfs.html).
HDFS is designed for data streaming where large amounts of data are read from disk in bulk.
The HDFS block size is typically 64MB or 128MB. Thus, this approach is entirely unsuitable
for standard POSIX file system use. In addition, due to the sequential nature of the data, there
is no local caching mechanism. The large block and file sizes make it more efficient to reread
data from HDFS than to try to cache the data.
Perhaps the most interesting aspect of HDFS—and the one that separates it from other file
systems—is its data locality. A principal design aspect of Hadoop MapReduce is the
emphasis on moving the computation to the data rather than moving the data to the
computation. This distinction is reflected in how Hadoop clusters are implemented. In other
high-performance systems, a parallel file system will exist on hardware separate from the
compute hardware. Data is then moved to and from the computer components via high-speed
interfaces to the parallel file system array. HDFS, in contrast, is designed to work on the
same hardware as the compute portion of the cluster. That is, a single server node in the
cluster is often both a computation engine and a storage engine for the application.
Finally, Hadoop clusters assume node (and even rack) failure will occur at some point. To
deal with this situation, HDFS has a redundant design that can tolerate system failure and still
provide the data needed by the compute part of the program.
Files may be appended, but random seeks are not permitted. There is no caching of data.
Converged data storage and processing happen on the same server nodes.
A reliable file system maintains multiple copies of data across the cluster. Consequently,
failure of a single node (or even a rack in a large cluster) will not bring down the file system.
A specialized file system is used, which is not designed for general use.
HDFS COMPONENTS
The design of HDFS is based on two types of nodes: a NameNode and multiple DataNodes.
In a basic design, a single NameNode manages all the metadata needed to store and retrieve
the actual data from the DataNodes. No data is actually stored on the NameNode, however.
For a minimal Hadoop installation, there needs to be a single NameNode daemon and a single
DataNode daemon running on at least one machine (see the section “Installing Hadoop from
Apache Sources” in Chapter 2, “Installation Recipes”).
The design is a master/slave architecture in which the master (NameNode) manages the file
system namespace and regulates access to files by clients. File system namespace operations
such as opening, closing, and renaming files and directories are all managed by the
NameNode. The NameNode also determines the mapping of blocks to DataNodes and
handles DataNode failures.
The slaves (DataNodes) are responsible for serving read and write requests from the file
system to the clients. The NameNode manages block creation, deletion, and replication.
FIG 3.1
Reading data happens in a similar fashion. The client requests a file from the NameNode,
which returns the best DataNodes from which to read the data. The client then accesses the
data directly from the DataNodes.
Thus, once the metadata has been delivered to the client, the NameNode steps back and lets
the conversation between the client and the DataNodes proceed. While data transfer is
progressing, the NameNode also monitors the DataNodes by listening for heartbeats sent
from DataNodes. The lack of a heartbeat signal indicates a potential node failure. In such a
case, the NameNode will route around the failed DataNode and begin re-replicating the now-
missing blocks. Because the file system is redundant, DataNodes can be taken offline
(decommissioned) for maintenance by informing the NameNode of the DataNodes to exclude
from the HDFS pool.
The mappings between data blocks and the physical DataNodes are not kept in persistent
storage on the NameNode. For performance reasons, the NameNode stores all metadata in
memory. Upon startup, each DataNode provides a block report (which it keeps in persistent
storage) to the NameNode. The block reports are sent every 10 heartbeats. (The interval
between reports is a configurable property.) The reports enable the NameNode to keep an up-
to-date account of all data blocks in the cluster.
The purpose of the SecondaryNameNode is to perform periodic checkpoints that evaluate the
status of the NameNode. Recall that the NameNode keeps all system metadata memory for
fast access. It also has two disk files that track changes to the metadata:
An image of the file system state when the NameNode was started. This file begins with
fsimage_* and is used only at startup by the NameNode.
A series of modifications done to the file system after starting the NameNode. These files
begin with edit_* and reflect the changes made after the fsimage_* file was read.
The location of these files is set by the dfs.namenode.name.dir property in the hdfs-site.xml
file.
The SecondaryNameNode periodically downloads fsimage and edits files, joins them into a
new fsimage, and uploads the new fsimage file to the NameNode. Thus, when the NameNode
restarts, the fsimage file is reasonably up-to-date and requires only the edit logs to be applied
since the last checkpoint. If the SecondaryNameNode were not running, a restart of the
NameNode could take a prohibitively long time due to the number of changes to the file
system.
The SecondaryNameNode performs checkpoints of NameNode file system’s state but is not
a failover node.
If several machines must be involved in the serving of a file, then a file could be rendered
unavailable by the loss of any one of those machines. HDFS combats this problem by
replicating each block across a number of machines (three is the default).
In addition, the HDFS default block size is often 64MB. In a typical operating system, the
block size is 4KB or 8KB. The HDFS default block size is not the minimum block size,
however. If a 20KB file is written to HDFS, it will create a block that is approximately 20KB
in size. (The underlying file system may have a minimal block size that increases the actual
file size.) If a file of size 80MB is written to HDFS, a 64MB block and a 16MB block will be
created.
As mentioned in Chapter 1, “Background and Concepts,” HDFS blocks are not exactly the
same as the data splits used by the MapReduce process. The HDFS blocks are based on size,
while the splits are based on a logical partitioning of the data. For instance, if a file contains
discrete records, the logical split ensures that a record is not split physically across two
separate servers during processing. Each HDFS block may consist of one or more splits.
Figure 3.2 provides an example of how a file is broken into blocks and replicated across the
cluster. In this case, a replication factor of 3 ensures that any one DataNode can fail and the
replicated blocks will be available on other nodes—and then subsequently re-replicated on
other DataNodes.
HDFS Safe Mode
When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted. Safe Mode enables the NameNode to perform two important processes:
1. The previous file system state is reconstructed by loading the fsimage file into memory and
replaying the edit log.
2. The mapping between blocks and data nodes is created by waiting for enough of the
DataNodes to register so that at least one copy of the data is available. Not all DataNodes are
required to register before HDFS exits from Safe Mode. The registration process may
continue for some time.
HDFS may also enter Safe Mode for maintenance using the hdfs dfsadmin-safemode
command or when there is a file system issue that must be addressed by the administrator.
Rack Awareness
Rack awareness deals with data locality. Recall that one of the main design goals of Hadoop
MapReduce is to move the computation to the data. Assuming that most data center networks
do not offer full bisection bandwidth, a typical Hadoop cluster will exhibit three levels of data
locality:
When the YARN scheduler is assigning MapReduce containers to work as mappers, it will
try to place the container first on the local machine, then on the same rack, and finally on
another rack.
In addition, the NameNode tries to place replicated data blocks on multiple racks for
improved fault tolerance. In such a case, an entire rack failure will not cause data loss or stop
HDFS from working. Performance may be degraded, however.
HDFS can be made rack-aware by using a user-derived script that enables the master node to
map the network topology of the cluster. A default Hadoop installation assumes all the nodes
belong to the same (large) rack. In that case, there is no option 3.
With early Hadoop installations, the NameNode was a single point of failure that could bring
down the entire Hadoop cluster. NameNode hardware often employed redundant power
supplies and storage to guard against such problems, but it was still susceptible to other
failures. The solution was to implement NameNode High Availability (HA) as a means to
provide true failover service.
As shown in Figure 3.3, an HA Hadoop cluster has two (or more) separate NameNode
machines. Each machine is configured with exactly the same software. One of the NameNode
machines is in the Active state, and the other is in the Standby state. Like a single NameNode
cluster, the Active NameNode is responsible for all client HDFS operations in the cluster.
The Standby NameNode maintains enough state to provide a fast failover (if required).
To guarantee the file system state is preserved, both the Active and Standby NameNodes
receive block reports from the DataNodes. The Active node also sends all file system edits to
a quorum of Journal nodes. At least three physically separate JournalNode daemons are
required, because edit log modifications must be written to a majority of the JournalNodes.
This design will enable the system to tolerate the failure of a single JournalNode machine.
The Standby node continuously reads the edits from the JournalNodes to ensure its
namespace is synchronized with that of the Active node. In the event of an Active NameNode
failure, the Standby node reads all remaining edits from the JournalNodes before promoting
itself to the Active state.
To prevent confusion between NameNodes, the JournalNodes allow only one NameNode to
be a writer at a time. During failover, the NameNode that is chosen to become active takes
over the role of writing to the JournalNodes. A SecondaryNameNode is not required in the
HA configuration because the Standby node also performs the tasks of the Secondary
NameNode.
Apache ZooKeeper is used to monitor the NameNode health. Zookeeper is a highly available
service for maintaining small amounts of coordination data, notifying clients of changes in
that data, and monitoring clients for failures. HDFS failover relies on ZooKeeper for failure
detection and for Standby to Active NameNode election. The Zookeeper components are not
depicted in Figure 3.3.
Namespace scalability. HDFS cluster storage scales horizontally without placing a burden
on the NameNode.
Better performance. Adding more NameNodes to the cluster scales the file system
read/write operations throughput by separating the total namespace.
An HDFS BackupNode is similar, but also maintains an up-to-date copy of the file system
namespace both in memory and on disk. Unlike a CheckpointNode, the BackupNode does
not need to download the fsimage and edits files from the active NameNode because it
already has an up-to-date namespace state in memory. A NameNode supports one
BackupNode at a time. No CheckpointNodes may be registered if a Backup node is in use.
HDFS Snapshots
HDFS snapshots are similar to backups, but are created by administrators using the hdfs dfs
snapshot command. HDFS snapshots are read-only point-in-time copies of the file system.
They offer the following features:
Snapshots can be taken of a sub-tree of the file system or the entire file system.
Snapshots can be used for data backup, protection against user errors, and disaster recovery.
See Chapter 10, “Basic Hadoop Administration Procedures,” for information on creating
HDFS snapshots.
The HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as part of the
client’s local file system. Users can browse the HDFS file system through their local file
systems that provide an NFSv3 client compatible operating system. This feature offers users
the following capabilities:
Users can easily download/upload files from/to the HDFS file system to/from their local file
system.
Users can stream data directly to HDFS through the mount point. Appending to a file is
supported, but random write capability is not supported.
Mounting a HDFS over NFS is explained in Chapter 10, “Basic Hadoop Administration
Procedures.”
The following is a brief command reference that will facilitate navigation within HDFS. Be
aware that there are alternative options for each command and that the examples given here
are simple use-cases. What follows is by no means a full description of HDFS functionality.
For more information, see the section “Summary and Additional Resources” at the end of the
chapter.