0% found this document useful (0 votes)
8 views27 pages

PPT -CC-UNIT-5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

PPT -CC-UNIT-5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT- V SYLLABUS:

Storage Systems: Evolution of storage technology, storage models, file systems and
database, distributed file systems, general parallel file systems. Google file system.

10/16/2023 Cloud Computing/ Unit-5 2


Data storage on a cloud
◼ Storage and processing on the cloud are intimately tied to one another.
 Most cloud applications process very large amounts of data. Effective data
replication and storage management strategies are critical to the computations
performed on the cloud.
 Strategies to reduce the access time and to support real-time multimedia access
are necessary to satisfy the requirements of content delivery.
◼ Sensors feed a continuous stream of data to cloud applications.
◼ An ever-increasing number of cloud-based services collect detailed data about their
services and information about the users of these services. The service providers use
the clouds to analyze the data.
◼ Humongous amounts of data - in 2013
 The Internet video will generate over 18 EB/month.
 Global mobile data traffic will reach 2 EB/month.
• (1 EB = 1018 bytes, 1 PB = 1015 bytes, 1 TB = 1012 bytes, 1 GB = 1012 bytes)

10/16/2023 Cloud Computing/ Unit-5 3


Big data
◼ New concept ➔ reflects the fact that many applications use data sets that cannot be
stored and processed using local resources.
◼ Applications in genomics, structural biology, high energy physics, astronomy,
meteorology, and the study of the environment carry out complex analysis of data
sets often of the order of TBs (terabytes). Examples:
 In 2010, the four main detectors at the Large Hadron Collider (LHC) produced
13 PB of data.
 The Sloan Digital Sky Survey (SDSS) collects about 200 GB of data per night.
◼ Three-dimensional phenomena.
 Increased volume of data.
 Requires increased processing speed to process more data and produce more
results.
 Involves a diversity of data sources and data types.

10/16/2023 Cloud Computing/ Unit-5 4


Evolution of storage technology
◼ The capacity to store information in units of 730-MB (1 CD-ROM)
 1986 - 2.6 EB ➔ <1, CD-ROM /person.
 1993 - 15.8EB ➔4 CD-ROM/person.
 2000 - 54.5 EB ➔ 12 CD-ROM/person.
 2007 -295.0 EB ➔ 61 CD-ROM/person.
◼ Hard disk drives (HDD) - during the 1980-2003 period:
 Storage density of has increased by four orders of magnitude from about 0.01
Gb/in2 to about 100 Gb/in2 , Square Inch (in2)
 Prices have fallen by five orders of magnitude to about 1 cent/MB.
 HDD densities are projected to climb to 1,800 Gb/in2 by 2016, up from 744 Gb/in2 in
2011.
◼ Dynamic Random Access Memory (DRAM) - during the period 1990-2003:
 The density increased from about 1 Gb/in2 in 1990 to 100 Gb/in2 .
 The cost has tumbled from about $80/MB to less than $1/MB.
10/16/2023 Cloud Computing/ Unit-5 5
Storage and data models

◼ A storage model ➔ describes the layout of a data structure in a


physical storage - a local disk, a removable media, or storage
accessible via the network.

◼ A data model ➔ captures the most important logical aspects of a


data structure in a database.

10/16/2023 Cloud Computing/ Unit-5 6


Storage and data models
◼ Two abstract models of storage are used.
 Cell storage ➔ assumes that the storage consists of cells of the same size and
that each object fits exactly in one cell. This model reflects the physical
organization of several storage media;
 The primary memory of a computer is organized as an array of memory cells and
a secondary storage device, e.g., a disk, is organized in sectors or blocks read and
written as a unit.
 Journal storage ➔ system that keeps track of the changes that will be made in a
journal (usually a circular log in a dedicated area of the file system) before
committing them to the main file system. In the event of a system crash or power
failure, such file systems are quicker to bring back online and less likely to
become corrupted.

10/16/2023 Cloud Computing/ Unit-5 7


A A

Write item A to Read item A from


memory cell M memory cell M Previous Current Next
Read/Write Read/Write Read/Write

M M
A A time

time
Before-or-after atomicity: the result of every
Read/Wr Read or Write is the same as if that Read or
ite Write occurred either completely before or
coheren completely after any other Read or Write.
ce: the
result of
a Read
of
Read/write coherence and before-or-after atomicity are two highly desirable
memory
cell M
properties of any storage modelshould
and in particular of cell storage
be the
10/16/2023 same asUnit-5
Cloud Computing/ 8
the
most
Data Base Management System (DBMS)
◼ Database ➔ a collection of logically-related records.
◼ Data Base Management System (DBMS) ➔ the software that controls the access
to the database.
◼ Query language ➔ a dedicated programming language used to develop database
applications.
◼ Most cloud application do not interact directly with the file systems, but through a
DBMS.
◼ Database models ➔ reflect the limitations of the hardware available at the time
and the requirements of the most popular applications of each period.
 navigational model of the 1960s.
 relational model of the 1970s. MySQL, Oracle, and Microsoft SQL Server
 object-oriented model of the 1980s. MongoDB and Cassandra
 NoSQL model of the first decade of the 2000s.
10/16/2023 Cloud Computing/ Unit-5 9
Requirements of cloud applications
◼ Most cloud applications are data-intensive and test the limitations of the existing infrastructure.
Requirements:
 Rapid application development and short-time to the market.
 Low latency.
 Scalability.
 High availability.
 Consistent view of the data.
◼ These requirements cannot be satisfied simultaneously by existing database models; e.g.,
relational databases are easy to use for application development but do not scale well.
◼ The NoSQL model is useful when thestructure of the data does not require a relational model
and the amount of data is very large.
 Does not support SQL as a query language.
 May not guarantee the ACID (Atomicity, Consistency, Isolation, Durability) properties of
traditional databases; it usually guarantees the eventual consistency for transactions limited
to a single data item.
10/16/2023 Cloud Computing/ Unit-5 10
Logical and physical organization of a file
◼ File ➔ a linear array of cells stored on a persistent storage device. Viewed
by an application as a collection of logical records; the file is stored on a
physical device as a set of physical records, or blocks, of size dictated by
the physical media.
◼ File pointer➔ identifies a cell used as a starting point for a read or
write operation.
◼ The logical organization of a file ➔ reflects the data model, the view of the
data from the perspective of the application.
◼ The physical organization of a file ➔ reflects the storage model and
describes the manner the file is stored on a given storage media.

10/16/2023 Cloud Computing/ Unit-5 11


File systems
◼ File system ➔ collection of directories; each directory provides information about a set of
files.
 Traditional – Unix File System.
 Distributed file systems.
◼ Network File Systems (NFS) - very popular, have been used for some time, but do
not scale well and have reliability problems; an NFS server could be a single point of
failure.
 Storage Area Networks (SAN) - allow cloud servers to deal with non-disruptive changes
in the storage configuration. The storage in a SAN can be pooled and then allocated
based on the needs of the servers. A SAN-based implementation of a file system can
be expensive, as each node must have a Fibre Channel adapter to connect to the
network.
 Parallel File Systems (PFS) - scalable, capable of distributing files across a large
number of nodes, with a global naming space. Several I/O nodes serve data to all
computational nodes; it includes also a metadata server which contains information
about the data stored in the I/O nodes. The interconnection network of a PFS could be
a SAN .
10/16/2023 Cloud Computing/ Unit-5 12
Unix File System (UFS)
◼ The layered design provides flexibility.
 The layered design allows UFS to separate the concerns for the physical file
structure from the logical one.
 The vnode layer allowed UFS to treat uniformly local and remote file access.
◼ The hierarchical design supports scalability reflected by the file naming
convention. It allows grouping of files directories, supports multiple levels of
directories, and collections of directories and files, the so-called file systems.
◼ The metadata supports a systematic design philosophy of the file system and
device-independence.
 Metadata includes file owner, access rights, creation time, time of the last
modification, file size, the structure of the file and the persistent storage device
cells where data is stored.
 The inodes contain information about individual files and directories. The
inodes are kept on persistent media together with the data.

10/16/2023 Cloud Computing/ Unit-5 13


UFS layering

10/16/2023 Cloud Computing/ Unit-5 14


Network File System (NFS)
◼ Design objectives:
 Provide the same semantics as a local Unix File System (UFS) to ensure
compatibility with existing applications.
 Facilitate easy integration into existing UFS.
 Ensure that the system will be widely used; thus, support clients running on different
operating systems.
 Accept a modest performance degradation due to remote access over a network with
a bandwidth of several Mbps.
◼ NFS is based on the client-server paradigm. The client runs on the local host while the
server is at the site of the remote file system; they interact by means of Remote
Procedure Calls (RPC).
◼ A remote file is uniquely identified by a file handle (fh) rather than a file descriptor.
The file handle is a 32-byte internal name - a combination of the file system
identification, an inode number, and a generation number.

10/16/2023
Cloud Computing/ 15
Unit-5
The NFS client-server interaction. The vnode layer implements file operation in a
uniform manner, regardless of whether the file is local or remote.
An operation targeting a local file is directed to the local file system, while one for a
remote file involves NFS; an NSF client packages the relevant information about the
target and the NFS server passes it to the vnode layer on the remote host which, in turn,
directs it to the remote file system.
10/16/2023 Cloud Computing/ Unit-5 16
Comparison of distributed file systems

10/16/2023 Cloud Computing/ Unit-5 17


◼ The API of the UNIX file system and the corresponding RPC issued
by an NFS client to the NFS server.
 fd ➔ file descriptor.
 fh ➔ for file handle.
 fname ➔ file name,
 dname ➔ directory name.
 dfh ➔the directory were the file handle can be found.
 count ➔ the number of bytes to be transferred.
 buf ➔the buffer to transfer the data to/from.
 device ➔ the device where the file system is located.

10/16/2023 Cloud Computing/ Unit-5 18


• The API of the Unix File System and the
corresponding RPC issued by an NFS client
to the NFS server.
• The actions of the server in response to an
RPC issued by the NFS client are too
complex to be fully described.
• fd stands for file descriptor, fh for file
handle, fname for filename, dname for
directory name,
• dfh for the directory where the file handle
can be found, count for the number of
bytes to be transferred,
• buf for the buffer to transfer the data
to/from, and device for the device on which
the file system is located fsname (stands for
files system name).

10/16/2023 Cloud Computing/ Unit-5 19


General Parallel File System (GPFS)
◼ Parallel I/O implies concurrent execution of multiple input/output operations. Support for
parallel I/O is essential for the performance of many applications.
◼ Concurrency control is a critical issue for parallel file systems. Several semantics for
handling the shared access are possible. For example, when the clients share the file
pointer successive reads issued by multiple clients advance the file pointer; another
semantics is to allow each client to have its own file pointer.
◼ GPFS.
 Developed at IBM in the early 2000s as a successor of the TigerShark multimedia file
system.
 Designed for optimal performance of large clusters; it can support a file system of up to 4
PB consisting of up to 4,096 disks of 1 TB each.
 Maximum file size is (263 -1) bytes.
 A file consists of blocks of equal size, ranging from 16 KB to 1 MB, stripped across
several disks.

10/16/2023 Cloud Computing/ Unit-5 20


I/O servers

• A GPFS
configuration. LAN1
The disks are
interconnected
LAN2
by a SAN and
compute
servers are
distributed in disk
SAN
four disk

• LANs, LAN1– LAN4


disk
disk
LAN4. The I/O
disk LAN3
nodes/servers disk

are connected
to LAN1
10/16/2023 Cloud Computing/ Unit-5 21
GPFS reliability
◼ To recover from system failures, GPFS records all metadata updates in
a write-ahead log file.
◼ Write-ahead ➔ updates are written to persistent storage only after the
log records have been written.
◼ The log files are maintained by each I/O node for each file system it
mounts; any I/O node can initiate recovery on behalf of a failed node.
◼ Data striping allows concurrent access and improves performance but
can have unpleasant side-effects. When a single disk fails, a large
number of files are affected.
◼ The system uses RAID devices with the stripes equal to the block size
and dual-attached RAID controllers.
◼ To further improve the fault tolerance of the system, GPFS data files as
well as metadata are replicated on two different physical disks.
10/16/2023 Cloud Computing/ Unit-5 22
GPFS distributed locking
◼ In GPFS, consistency and synchronization are ensured by a distributed locking
mechanism. A central lock manager grants lock tokens to local lock managers running in
each I/O node. Lock tokens are also used by the cache management system.
◼ Lock granularity has important implications on the performance.
GPFS uses a variety of techniques for different types of data.
 Byte-range tokens ➔ used for read and write operations to data files as follows: the
first node attempting to write to a file acquires a token covering the entire file; this
node is allowed to carry out all reads and writes to the file without any need for
permission until a second node attempts to write to the same file; then, the range of
the token given to the first node is restricted.
 Data-shipping ➔an alternative to byte-range locking, allows fine-grain data sharing. In
this mode the file blocks are controlled by the I/O nodes in a round-robin manner. A
node forwards a read or write operation to the node controlling the target block, the
only one allowed to access the file.

10/16/2023 Cloud Computing/ Unit-5 23


Google File System (GFS)
◼ GFS ➔ developed in the late 1990s; uses thousands of storage systems built from
inexpensive commodity components to provide petabytes of storage to a large user
community with diverse needs.
◼ Design considerations.
 Scalability and reliability are critical features of the system; they must be
considered from the beginning, rather than at some stage of the design.
 The vast majority of files range in size from a few GB to hundreds of TB.
 The most common operation is to append to an existing file; random write operations to
a file are extremely infrequent.
 Sequential read operations are the norm.
 The users process the data in bulk and are less concerned with the response time.
 The consistency model should be relaxed to simplify the system implementation but
without placing an additional burden on the application developers.

10/16/2023 Cloud Computing/ Unit-5 24


GFS – design decisions
◼ Segment a file in large chunks.
◼ Implement an atomic file append operation allowing multiple applications operating
concurrently to append to the same file.
◼ Build the cluster around a high-bandwidth rather than low-latency interconnection
network. Separate the flow of control from the data flow. Pipeline data transfer over TCP
connections. Exploit network topology by sending data to the closest node in the
network.
◼ Eliminate caching at the client site. Caching increases the overhead for maintaining
consistency among cashed copies.
◼ Ensure consistency by channeling critical file operations through a master, a component
of the cluster which controls the entire system.
◼ Minimize the involvement of the master in file access operations to avoid hot-spot
contention and to ensure scalability.
◼ Support efficient checkpointing and fast recovery mechanisms.
◼ Support an efficient garbage collection mechanism.
10/16/2023 Cloud Computing/ Unit-5 25
GFS chunks
◼ GFS files are collections of fixed-size segments called chunks.
◼ The chunk size is 64 MB; this choice is motivated by the desire to optimize
the performance for large files and to reduce the amount of metadata
maintained by the system.
◼ A large chunk size increases the likelihood that multiple operations will be
directed to the same chunk thus, it reduces the number of requests to locate
the chunk, and, at the same time, it allows the application to maintain a
persistent network connection with the server where the chunk is located.
◼ A chunk consists of 64 KB blocks and each block has a 32-bit checksum.
◼ Chunks are stored on Linux files systems and are replicated on multiple
sites; a user may change the number of the replicas, from the standard value
of three, to any desired value.
◼ At the time of file creation each chunk is assigned a unique chunk handle.
10/16/2023 Cloud Computing/ Unit-5 26
File name & chunk index M aster
Application

Chunk handle & chunk location Meta-information

Chunk data

State
information
Instructions

Communication network
Chunk handle &
data count

Chunk server Chunk server Chunk server

Linux file system Linux file system Linux file system

◼ The architecture of a GFS cluster; the master maintains state information


about all system components; it controls a number of chunk servers. A
chunk server runs under Linux; it uses metadata provided by the master to
communicate directly with the application. The data and the control paths
are shown separately, data paths with thick lines and the control paths with
thin lines. Arrows show the flow of control between the application, the
master and the chunk servers.
10/16/2023 Cloud Computing/ Unit-5 27
Thank you

10/16/2023 Cloud Computing/ Unit-5 28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy