CC Unit 4
CC Unit 4
CC Unit 4
Storage Systems
• Now there is an opportunity to store all our data in the internet. Those off-
site storages are provided and maintained by the third parties through the
Internet Cloud storage offers a large pool of storage was available for use with
immediate availability of very large quantities of storage.
• The evolution of Cloud Storage based on traditional network storage and
hosted storage. Benefit of cloud storage is the access of your data from
anywhere.
• Cloud storage providers provide storage varying from small amount of data
to even the entire warehouse of an organization.
• Subscriber can pay to the cloud storage provider for what they are using and
how much they are transferring to the cloud storage.
• Basically the cloud storage subscriber copies the data into any one of the
data server of the cloud storage provider. That copy of data will be made
available to all the other data servers of the cloud storage provider.
I. Cell storage
II. Journal storage.
Cell storage
• Cell storage assumes that the storage consists of cells of the same size and that
each object fits exactly in one cell.
• This model reflects the physical organization of several storage media; The
primary memory of a computer is organized as an array of memory cells.
Journal storage.
• Journal storage is an elaborate organization for storing composite objects
such as records consisting of multiple fields.
• Journal storage consists of a manager and cell storage, where the entire
history of a variable is maintained, rather than just the current value.
• The user does not have direct access to the cell storage; instead the user can
request the journal manager to
(i) start a new action;
(ii) read the value of a cell;
(iii) write the value of a cell;
(iv) commit an action;
(v) abort an action.
The journal manager translates user requests to commands sent to the cell storage:
• Many cloud applications must support online transaction processing and have
to guarantee the correctness of the transactions.
• Transactions consist of multiple actions; for example, the transfer of funds
from one account to another requires withdrawing funds from one account
and crediting it to another. The system may fail during or after each one of the
actions, and steps to ensure correctness must be taken. Correctness of a
transaction means that the result should be guaranteed to be the same as
though the actions were applied one after another, regardless of the order.
More stringent conditions must be taken.
A file system consists of a collection of directories. Each directory provides
information about a set of files.
• Today high-performance systems can choose among three classes of file
system: network file systems (NFSs), storage area networks (SANs),
and parallel file systems (PFSs).
Network file systems (NFSs)
• The NFS is very popular and has been used for some time, but it does not
scale well and has reliability problems; an NFS server could be a single
point of failure.
Storage area networks (SANs):
• Advances in networking technology allow the separation of storage
systems from computational servers; The two can be connected by a SAN.
• SANs offer additional flexibility and allow cloud servers to deal with non
disruptive changes in the storage configuration. Moreover, the storage in a
SAN can be pooled and then allocated based on the needs of the servers;
pooling requires additional software and hardware support and represents
another advantage of a centralized storage system.
• A SAN-based implementation of a file system can be expensive, since each
node must have a Fibre Channel adapter to connect to the network.
Pparallel file systems (PFSs)
• Parallel file systems are scalable and are capable of distributing files across
a large number of nodes, and provide a global naming space.
• In a parallel data system, several I/O nodes serve data to all computational
nodes and include a metadata server that contains information about the
data stored in the I/O nodes. The interconnection network of a parallel file
system could be a SAN.
Data Bases:
• A database is a collection of logically related records. The software that
controls the access to the database is called a database management system
(DBMS).
• The main functions of a DBMS are to enforce data integrity, manage data
access and concurrency control, and support recovery after a failure.
• A DBMS supports a query language, a dedicated programming
language used to develop database applications. Several database models,
including the navigational model of the 1960s, the relational model of
the 1970s, the object-oriented model of the 1980s, and
the NoSQL model of the first decade of the 2000s, reflect the limitations
of the hardware available at the time and the requirements of the most
popular applications of each period.
• Most cloud applications are data intensive and test the limitations of the
existing infrastructure.. At the same time, cloud applications require low
latency, scalability, and high availability and demand a consistent view of
the data.
• These requirements cannot be satisfied simultaneously by existing
database models; for example, relational databases are easy to use for
application development but do not scale well.
• As its name implies, the NoSQL model does not support SQL as a query
language and may not guarantee the atomicity, consistency, isolation,
durability (ACID) properties of traditional databases.
• NoSQL usually guarantees the eventual consistency for transactions
limited to a single data item. The NoSQL model is useful when the
structure of the data does not require a relational model and the amount of
data is very large.
• Several types of NoSQL database have emerged in the last few years.
Based on the way the NoSQL databases store data, we recognize several
types, such as key-value stores, BigTable implementations, document
store databases, and graph databases.
• Replication, used to ensure fault tolerance of large-scale systems built with
commodity components, requires mechanisms to guarantee that all
replicas are consistent with one another. This is another example of
increased complexity of modern computing and communication systems
due to physical characteristics of components.
Distributed file systems:
There are various advantages of the distributed file system. Some of the advantages
are as follows:
• It uses the Transmission Control Protocol (TCP) and User Datagram Protocol
(UDP) for accessing and delivering the data and files.
Andrew File System (AFS)
• Andrew File System (AFS) is a global file system that allows access to files
from Mac, Windows or Linux computers.
• It is similar to cloud-based storage, however, data is stored locally at Carnegie
Mellon University. AFS also allows file sharing with other members of the
Carnegie Mellon University community. Network Drive storage on Mac lab
computers uses AFS space.
• An AFS presents a homogeneous, location-independent file namespace to all
client workstations via a group of trustworthy servers.
• After login onto workstations the users exchange data and programs (DCI).
The goal is to facilitate large-scale information exchange by reducing client-
server communication.
• This is accomplished by moving whole files between server and client
computers and caching them until the servers get a more recent version.
• An AFS uses a local cache to improve speed and minimize effort in dispersed
networks. A server, for example, replies to a workstation request by storing
data in the workstation’s local cache.
Andrew File System Architecture:
Vice: The Andrew File System provides a homogeneous, location-transparent
file namespace to all client workstations by utilizing a group of trustworthy
servers known as Vice.
Venus: The mechanism, known as Venus, caches files from Vice and returns
updated versions of those files to the servers from which they originated. Only
when a file is opened or closed does Venus communicate with Vice.
Venus performs as much work as possible rather than Vice. Vice only keeps the
functionalities that are necessary for the file system’s integrity, availability, and
security. The servers are set up as a loose confederacy with little connectivity
between them.
The following are the server and client components used in AFS networks:
• Any computer that creates requests for AFS server files hosted on a network
qualifies as a client.
• The file is saved in the client machine’s local cache and shown to the user once
a server responds and transmits a requested file.
• When a user visits the AFS, the client sends all modifications to the server via
a callback mechanism. The client machine’s local cache stores frequently used
files for rapid access.
Advantages:
1. Shared files that aren’t updated very often and local user files that aren’t
updated too often will last a long time.
2. It sets up a lot of storage space for caching.
3. It offers a big enough working set for all of a user’s files.
Chunk Server
Chunks:
1. GFS files are collections of fixed-size segments called chunks; at the time of
file creation each chunk is assigned a unique chunk handle. A chunk consists
of 64 KB blocks and each block has a 32 bit checksum.
2. Chunks are stored on Linux files systems and are replicated on multiple sites.
The chunk size is 64 MB.
3. Some of the metadata is stored in persistent storage.
4. The locations of the chunks are stored only in the control structure of the
master's memory and are updated at the system start up, or when a new chunk
server joins the cluster. This strategy allows the master to have up-to-date
information about the location of the chunks.
This GFS consistency model is very effective and scalable.
• Operations, such as file creation, are atomic and are handled by the master.
• To ensure scalability, the master has a minimal involvement in file mutations,
operations such as write or appends which occur frequently.
• In such cases the master grants a lease for a particular chunk to one of the
chunk servers called the primary; then, the primary creates a serial order for
the updates of that chunk.
Flow of Data and Read and Write operations is as follows
1. The client contacts the master which assigns a lease to one of the chunk
servers for the particular chunk, if no lease for that chunk exists; then, the
master replies with the Ids of the primary and the secondary chunk servers
holding replicas of the chunk. The client caches this information.
2. The client sends the data to all chunk servers holding replicas of the chunk;
each one of the chunk servers stores the data in an internal LRU buffer and
then sends an acknowledgment to the client.
3. The client sends the write request to the primary chunk server once it has
received the acknowledgments from all chunk servers holding replicas of the
chunk.
4. The primary chunk server identifies mutations by consecutive sequence
numbers.
5. The primary chunk server sends the write requests to all secondaries.
6. Each secondary chunk server applies the mutations in the order of the
sequence number and then sends an acknowledgment to the primary chunk
server.
7. Finally, after receiving the acknowledgments from all secondaries, the
primary informs the client.
• The system supports an efficient check pointing procedure based on copy-on-
write to construct system snapshots. A lazy garbage collection strategy is used to
reclaim the space after a file deletion. The master periodically scans the
namespace, removes the metadata for the files with a hidden name older than a
few days. This mechanism gives a window of opportunity to a user who deleted
files by mistake to recover the files with little effort.
• Periodically, chunk servers exchange with the master the list of chunks stored on
each one of them; the master supplies them with the identity of orphaned chunks,
whose metadata has been deleted and such chunks are then deleted. Even when
control messages are lost, a chunk server will carry out the house cleaning at the
next heartbeat exchange with the master. Each chunk server maintains in core the
checksums for the locally stored chunks to guarantee data integrity.
The system was designed after a careful analysis of the file characteristics and
the access models. This analysis reflected in the GFS design are
• Scalability and reliability are critical features of the system;
• The majority of files range in size from a few GB to hundreds of TB.
• The most common operation is to append to an existing file; random write
operations to a file are extremely infrequent.
• Sequential read operations are the norm.
• Users process the data in bulk and are less concerned with the response time.
• To simplify the system implementation the consistency model should be
relaxed without placing an additional burden on the application developers.
As a result of this analysis several design decisions were made:
1. Segment a file in large chunks.
2. Implement an atomic file append operation
3. Allowing multiple applications operating concurrently to append to the same
file.
4. Build the cluster around a high-bandwidth rather than low-latency
interconnection network.
5. Separate the flow of control from the data flow;
6. Schedule the high-bandwidth data flow by pipelining the data transfer over
TCP connections to reduce the response time.
7. Exploit network topology by sending data to the closest node in the network.
8. Eliminate caching at the client site;
9. Ensure consistency by channeling critical file operations through a master
controlling the entire system.
10. Minimize master's involvement in file access operations to avoid hot-spot on
tension and to ensure scalability.
11. Support efficient check pointing and fast recovery mechanisms.
12. Support efficient garbage collection mechanisms.
Apache Hadoop
• Apache Hadoop is an open source framework that is used to efficiently store
and process large datasets ranging in size from gigabytes to petabytes of data.
• Instead of using one large computer to store and process the data, Hadoop
allows clustering multiple computers to analyze massive datasets in parallel
more quickly.
• It provides a software framework for distributed storage and processing of big
data using the MapReduce programming model.
• Hadoop was originally designed for computer clusters built from commodity
hardware, which is still the common use.
• All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common occurrences and should be automatically
handled by the framework.
• The core of Apache Hadoop consists of a storage part, known as Hadoop
Distributed File System (HDFS), and a processing part which is a MapReduce
programming model.
• Hadoop splits files into large blocks and distributes them across nodes in a
cluster. It then transfers packaged code into nodes to process the data in
parallel. This approach takes advantage of data locality, where nodes
manipulate the data they have access to.
• This allows the dataset to be processed faster and more efficiently than it
would be in a more conventional supercomputer architecture that relies on a
parallel file system where computation and data are distributed via high-speed
networking.
1. Name Node
2. Secondary Name Node
3. Job tracker
4. Data Node
5. Task Tracker
• Bigtable is a sparsely populated table that can scale to billions of rows and thousands
of columns, enabling us to store terabytes or even petabytes of data.
• Bigtable development began in 2004. It is now used by a number of Google
applications, such as Google Analytics, web indexing, MapReduce, which is often
used for generating and modifying data stored in Bigtable, Google Maps, Google
Books search, "My Search History", Google Earth, Blogger.com, Google Code
hosting, YouTube,and Gmail.
• Google's reasons for developing its own database include scalability and better
control of performance characteristics
• All client requests go through a frontend server before they are sent to a
Bigtable node.
• The nodes are organized into a Bigtable cluster, which belongs to a Bigtable
instance, a container for the cluster.
• Each node in the cluster handles a subset of the requests to the cluster.
• By adding nodes to a cluster, you can increase the number of simultaneous
requests that the cluster can handle. Adding nodes also increases the
maximum throughput for the cluster.
• A Bigtable table is sharded into blocks of contiguous rows, called tablets, to
help balance the workload of queries.
• Tablets are stored on Colossus, Google's file system, in SSTable format.
• An SSTable provides a persistent, ordered immutable map from keys to
values, where both keys and values are arbitrary byte strings.
• Each tablet is associated with a specific Bigtable node. In addition to the
SSTable files, all writes are stored in Colossus's shared log as soon as they are
acknowledged by Bigtable, providing increased durability.
Importantly, data is never stored in Bigtable nodes themselves; each node has
pointers to a set of tablets that are stored on Colossus.
As a result:
• Rebalancing tablets from one node to another happens quickly, because the
actual data is not copied. Bigtable simply updates the pointers for each node.
• Recovery from the failure of a Bigtable node is fast, because only metadata
must be migrated to the replacement node.
• When a Bigtable node fails, no data is lost.
Mega Store
• Megastore is a storage system developed to meet the requirements of today's
interactive online services.
• Megastore is Providing scalable, highly available storage for interactive
services
• Megastore blends the scalability of a NoSQL datastore with the convenience
of a traditional RDBMS in a novel way, and provides both strong consistency
guarantees and high availability.
• Google's Megastore is the structured data store supporting the Google
Application Engine.
• Megastore handles more than 3 billion write and 20 billion read transactions
daily and stores a petabyte of primary data across many global datacenters.
• The basic design philosophy of the system is to partition the data into entity
groups and replicate each partition independently in data centers located in
different geographic areas.
• Megastore tries to provide the convenience of using traditional RDBMS with
the scalability of NOSQL: It is a scalable transactional indexed record
manager (built on top of BigTable), providing full ACID semantics within
partitions but lower consistency guarantees across partitions . To achieve
these strict consistency requirements, Megastore employs a Paxos-based
algorithm for synchronous replication across geographically distributed
datacenters.
• Another distinctive feature of the system is the use of the Paxos consensus
algorithm, to replicate primary user data, metadata, and system configuration
information across data centers and for locking.
• The version of the Paxos algorithm used by Megastore does not require a
master. Instead, any node can initiate read and write operations to a write-
ahead log replicated to a group of symmetric peers.
• The data model is declared in a schema consisting of a set of tables
composed of entries, each entry being a collection of named and typed
properties. The unique primary key of an entity in a table is created as a
composition of entry properties.
• A Megastore table can be a root or a child table. Each child entity must
reference a special entity, called a root entity in its root table. An entity group
consists of the primary entity and all entities that reference it.
• The system makes extensive use of BigTable. Entities from different
Megastore tables can be mapped like as BigTable row without collisions.
This is possible because the BigTable column name is a concatenation of the
Megastore table name and the name of a property.
• A BigTable row for the root entity stores the transaction and all metadata for
the entity group. multiple versions of the data with different time stamps can
be stored in a cell. Megastore takes advantage of this feature to implement
multi-version concurrency control (MVCC).
• A write transaction involves the following steps: (1) Get the timestamp and
the log position of the last committed transaction. (2) Gather the write
operations in a log entry. (3) Use the consensus algorithm to append the log
entry and then commit. (4) Update the BigTable entries. (5) Clean up.
Cloud Security
• Cloud security, also known as cloud computing security, is a collection of security
measures designed to protect cloud-based infrastructure, applications, and data.
• These measures ensure user and device authentication, data and resource access
control, and data privacy protection.
• They also support regulatory data compliance. Cloud security is employed in
cloud environments to protect a company's data from distributed denial of service
(DDoS) attacks, malware, hackers, and unauthorized user access or use.
• Cloud computing is continually transforming the way companies store, use,
and share data, workloads, and software.
• The volume of cloud utilization around the globe is increasing, leading to a
greater mass of sensitive material that is potentially at risk.
The market for worldwide cloud computing is projected to grow to $191
billion in two years.
• There are many pros of cloud computing, which are driving more firms and
individuals to the cloud. The benefits include low costs, improved employee
productivity, and faster to market, among many more.
• Regardless of the great advantages, saving a firm’s workloads to a cloud
service that is publicly hosted exposes the organization to new data security
risks which cause unease for some firms’ IT departments and clients.
• With more and more data and software moving to the cloud, unique info-
security challenges crop up. Here are the top cloud computing security risks
that every firm faces.
Cloud security risks
Following are the important Cloud security risks