Distributed File System
Distributed File System
Distributed File System
By Bandana Mahapatra
Why files are used
1. Permanent storage of information on a secondary
storage media.
2. Sharing of information between applications.
What is a file system
• A file system is a subsystem of the operating system that performs file
management activities such as organization, storing, retrieval, naming,
sharing, and protection of files.
• A file system frees the programmer from concerns about the details of
space allocation and layout of the secondary storage device.
1. Remote Information sharing: Thus any node, irrespective of the physical location of the file can access the file.
2. User mobility : Users should be permitted to work on different nodes
3. Availability: For better fault-tolerance, files should be available for use even in the event of temporary failure of
one or more nodes of the system. Thus the system should maintain multiple copies of the files, the existence of which
should be transparent to the user.
4. Diskless workstations
A distributed file system, with its transparent remote-file accessing capability, allows the use of diskless workstations
in a system
A distributed file system provides the
following types of services:
Storage service
- Structure transparency
Clients should not know the number or locations of file servers and the storage devices.
- Access transparency
Both local and remote files should be accessible in the same way. The file system should automatically locate an
accessed file and transport it to the clients site.
- Naming transparency
The name of the file should give no hint as to the location of the file.
The name of the file must not be changed when moving from one node to another.
- Replication transparency
If a file is replicated on multiple nodes, both the existence of multiple copies and their locations should be
hidden from the clients.
HDFS
• The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications. It employs a NameNode
and DataNode architecture to implement a distributed file system that
provides high-performance access to data across highly scalable
Hadoop clusters.
• HDFS is a key part of the many Hadoop ecosystem technologies, as it
provides a reliable means for managing pools of big data and
supporting related big data analytics applications.
Case study: Andrew File System
• Andrew is a distributed computing environment being developed in a joint project by Carnegie
Mellon University and IBM. One of the major components of Andrew is a distributed file system.
• The goal of the Andrew File System is to support growth up to at least 7000 workstations (one for
each student, faculty member, and staff at Carnegie Mellon) while providing users, application
programs, and system administrators with the amenities of a shared file system.
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big
Data.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and
analyzing data.
• (iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
• (iv) Variability – This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and manage the
data effectively.
Benefits of Big Data Processing