Distributed File System

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Distributed File System

By Bandana Mahapatra
Why files are used
1. Permanent storage of information on a secondary
storage media.
2. Sharing of information between applications.
What is a file system
• A file system is a subsystem of the operating system that performs file
management activities such as organization, storing, retrieval, naming,
sharing, and protection of files.

• A file system frees the programmer from concerns about the details of
space allocation and layout of the secondary storage device.

• The design and implementation of a distributed file system is more


complex than a conventional file system due to the fact that the users
and storage devices are physically dispersed.
Distributed File System
DFS features
In addition to the functions of the file system of a single-processor system, the distributed file system supports the
following:

1. Remote Information sharing: Thus any node, irrespective of the physical location of the file can access the file.
2. User mobility : Users should be permitted to work on different nodes
3. Availability: For better fault-tolerance, files should be available for use even in the event of temporary failure of
one or more nodes of the system. Thus the system should maintain multiple copies of the files, the existence of which
should be transparent to the user.

4. Diskless workstations
A distributed file system, with its transparent remote-file accessing capability, allows the use of diskless workstations
in a system
A distributed file system provides the
following types of services:

Storage service
  

Allocation and management of space on a secondary storage device thus


providing a Logical view of the storage system.
 
True file service
Includes file-sharing semantics, file-caching mechanism, file replication
mechanism, concurrency control, multiple copy update protocol etc.
Continue…
• Name/Directory service
Responsible for directory related activities such as creation and deletion
of directories, adding a new file to a directory, deleting a file from a
directory, changing the name of a file, moving a file from one directory
to another etc.
Desirable features of a distributed file system:
  Transparency

-        Structure transparency
Clients should not know the number or locations of file servers and the storage devices.
-         Access transparency
Both local and remote files should be accessible in the same way. The file system should automatically locate an
accessed file and transport it to the clients site.

- Naming transparency

The name of the file should give no hint as to the location of the file.
The name of the file must not be changed when moving from one node to another.

-         Replication transparency
If a file is replicated on multiple nodes, both the existence of multiple copies and their locations should be
hidden from the clients.
HDFS
• The Hadoop Distributed File System (HDFS) is the primary data
storage system used by Hadoop applications. It employs a NameNode
and DataNode architecture to implement a distributed file system that
provides high-performance access to data across highly scalable
Hadoop clusters.
• HDFS is a key part of the many Hadoop ecosystem technologies, as it
provides a reliable means for managing pools of big data and
supporting related big data analytics applications.
Case study: Andrew File System
• Andrew is a distributed computing environment being developed in a joint project by Carnegie
Mellon University and IBM. One of the major components of Andrew is a distributed file system.

• The goal of the Andrew File System is to support growth up to at least 7000 workstations (one for
each student, faculty member, and staff at Carnegie Mellon) while providing users, application
programs, and system administrators with the amenities of a shared file system.

• The general goal of widespread accessibility of computational and informational facilities,


coupled with the choice of UNIX, led to the decision to provide an integrated, campus-wide file
system with functional characteristics as close to that of UNIX as possible. The first design choice
was to make the file system compatible with UNIX at the system call level.
• The general goal of widespread accessibility of computational and informational
facilities, coupled with the choice of UNIX, led to the decision to provide an
integrated, campus-wide file system with functional characteristics as close to that
of UNIX as possible. The first design choice was to make the file system
compatible with UNIX at the system call level.
• The second design decision was to use whole files as the basic unit of data
movement and storage, rather than some smaller unit such as physical or logical
records. This is undoubtedly the most controversial and interesting aspect of the
Andrew File System.
• It means that before a workstation can use a file, it must copy the
entire file to its local disk, and it must write modified files back to the
file system in their entirety.
This in turn requires using a local disk to hold recently-used files.
The issues with these file systems are:
• file size
• Record Updating
File Size
• only files small enough to fit in the local disks can be handled. Where
this matters in the environment, the large files had to be broken into
smaller parts which fit
File Updation
• Modified files are returned to the central system only when they are
closed, thus rendering record-level updates impossible.
This was considered as a non serious issue.
The main application for record-level updates is databases.
Serious multi-user databases have many other requirements.
1. record- or field-granularity authorization,
2. physical disk write ordering controls, and
3. update serialization
which are not satisfied by UNIX file system semantics, even in a non-
distributed environment
• The third and last key design decision in the Andrew File System was
to implement it with many relatively small servers rather than a single
large machine. This decision was based on the desire to support
growth gracefully, to enhance availability (since if any single server
fails, the others should continue), and to simplify the development
process by using the same hardware and operating system as the
workstations.
Current status of AFS
At the present time, an Andrew. file server consists of
• a workstation with three to six 400-megabyte disks attached.
• A price/performance goal of supporting at least 50 active workstations
per file server is achieved, so that the centralized costs of the file
system would be reasonable.
• In a large configuration like the one at Carnegie Mellon, a separate
"system control machine" to broadcast global information (such as
where specific users' files are to be found) to the file servers is used. In
a small configuration the system control machine is combined with a
(the) server machine.
Question(Skills)
• Comparative analysis between.
Network File system, Google file System, File system in cloud
Big Data
• What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or mechanical recording media.

• What is Big Data?


Big Data is also data but with a huge size. Big Data is a term used to describe
a collection of data that is huge in volume and yet growing exponentially with
time. In short such data is so large and complex that none of the traditional
data management tools are able to store it or process it efficiently.
Types Of Big Data

BigData could be found in three forms:


• Structured: tabular form of data representation and storage. E.g.
Databases, Excel format of representing Data
• Unstructured: Any data with unknown form or the structure is
classified as unstructured data. E.g. output of google.
• Semi-structured: Semi-structured data can contain both the forms of
data. We can see semi-structured data as a structured in form but it is
actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
e.g. Personal data stored in an XML file.
Characteristics Of Big Data

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data
plays a very crucial role in determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big
Data.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of data
considered by most of the applications. Nowadays, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and
analyzing data.
• (iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
• (iv) Variability – This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and manage the
data effectively.
Benefits of Big Data Processing

1. Ability to process Big Data brings in multiple benefits, such as-


• Businesses can utilize outside intelligence while taking decisions
2. Access to social data from search engines and sites like facebook, twitter are enabling
organizations to fine tune their business strategies.
• Improved customer service
3. Traditional customer feedback systems are getting replaced by new systems designed with Big
Data technologies. In these new systems, Big Data and natural language processing technologies are
being used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
Drivers of Big Data
Data driven initiatives:
They are primarily categorized into 3 broad areas:
a. Data Driven Innovation:
b. Data Driven Decision Making:.
c. Data Driven Discovery:
• Data Science as a competitive advantage:
• Sustained processes:
• Cost advantages of commodity hardware & open source software
• Quick turnaround and less bench times:
• Automation to backfill redundant/mundane tasks:
Technical
Data continues to grow exponentially
Data is everywhere and in many formats:
Alternate, Multiple Synchronous & Asynchronous data streams:
Low barrier to entry:
Traditional solutions failing to catch up with new market conditions:
Applications of Big Data
• Tracking Customer Spending Habit, Shopping Behaviour:
• Recommendation:
• Smart Traffic System:
• Auto Driving Car:
• Virtual Personal Assistant Tool:
• IoT:
• Energy Sector:
• Media and Entertainment Sector:
Algorithms for Big data Architecture
• Linear Regression
• Logistics Regression
• Classification and Regression Trees
• K-nearest Neighbours

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy