Mct702 All Units
Mct702 All Units
Mct702 All Units
Computing
Course Objectives
The differences among: concurrent, networked, distributed, and
mobile.
Resource allocation and deadlock detection and avoidance
techniques.
Remote procedure calls.
IPC mechanisms in distributed systems.
Course Outcomes
Develop, test and debug RPC based client-server programs in
Unix.
Design and build application programs on distributed systems.
Improve the performance and reliability of distributed programs.
Design and build newer distributed file systems for any OS.
1
Syllabus
UNIT I
Introduction-Examples of Distributed System - Resource, Sharing and the WebChallenges, case study on World wide web-System Models-IntroductionArchitectural Models-Fundamental Models.
Distributed Objects and Components: Introduction, Distributed Objects, from
objects to components, Case study: enterprise java beans and fractals.
Remote Invocation- Remote Procedure Call-Events and Notifications.
UNIT II
Distributed
Operating
Systems-Introduction-Issues-Communication
Primitives-Inherent Limitation-Lamports Logical Clock; Vector Clock; Causal
Ordering; Global State; Cuts; Termination Detection. Distributed Mutual
Exclusion-Non-Token based Algorithms-Lamports Algorithm-Token based
Algorithms-Suzuki-Kasamis Broadcast Algorithm- consensus and related
problems. Distributed Deadlock Detection-Issues-Centralized DeadlockDetection Algorithms-Distributed Deadlock-Detection Algorithms.
UNIT III
Distributed Resource Management-Distributed File Systems-ArchitectureMechanisms-Design Issues-Case Study: Sun Network File System-Distributed
Shared Memory-Architecture-Algorithm-Protocols-Design Issues. Distributed
Scheduling-Issues-Components-Algorithms- Load Distributing Algorithms, Load
Sharing Algorithms.
Unit IV
Transaction
and
Concurrency:Introduction,
Transactions,
Nested
2
Transactions, Locks, Optimistics concurrency control ,Time Stamp Ordering,
Syllabus
UNIT V
Resource Security and Protection : Access and Flow control-IntroductionThe Access Matrix Model-Implementation of Access Matrix Model-Safety in the
Access Matrix Model-Advanced Models of Protection-Data SecurityIntroduction-Modern Cryptography:-Private Key Cryptography, Public key
cryptography.
UNIT VI
Distributed Multimedia Systems:-Introduction- Characteristics Quality of
Service Management- Resource Management-Stream Adaptation -Case Study.
Designing Distributed System:-Google Case Study-Introducing the Case
Study: Google- Overall architecture and Design Paradigm-Communication
Paradigm- Data Storage and Coordination Services-Distributed Computation
Services
Text Book:
Distributed Systems Concepts and Design, George Coulouris, Jean Dellimore
and Tim KIndberg,Pearson Education,5th Edition.
Advanced Concepts in Operating Systems, Mukesh Singhal and
N.G.Shivaratri, McGraw-Hill.
Distributed Operating Systems, Pradeep K. Sinha, PHI,2005
References:
Distributed
Computing-Principles,Algorithms
and
Systems,
Ajay
3
D.Kshemkalyani and Mukesh Singhal Cambridge University Press.
Unit No.I
Chapter 1:Introduction to
DS
A distributed system is a collection of
independent entities that cooperate
to solve a problem that cannot be
individually solved.
Definition :
A distributed system is one in which hardware
or software components located at networked
computers, communicate and coordinate their
actions only by passing messages.
Note : Computers that are connected by a
network may be spatially separated by any
distance.
They may be on separate continents, in the same
building or in the same room.
Passing the message is the key feature of
distributed system
Major Consequences
1. Concurrency
concurrent program execution
sharing resources
increasing the capacity of the system
2. No global clock
No shared memory
provide the abstraction of a common address space
shares idea of the time at which the programs actions occur.
3. Independent Failures
Faults in the network result in the isolation of the computers that
are connected to it
The failure of a computer, or the unexpected termination of a
program somewhere in the system (a crash)
It is the responsibility of system designers to plan for the
consequences of possible failures
7
Examples of Distributed
System
1. Application Domains
. Web search
. Massively multiplayer online games
. Financial trading
2. Recent Trends
. Pervasive networking and the modern
Internet
. Mobile and ubiquitous computing
. Distributed multimedia systems
. Distributed computing as a utility
8
1. Application Domain
1. Web search
Sr. no.
Application Domain
Application Network
01.
02.
Information Societies
WWW
Search engines : Google & Yahoo
user-generated content :YouTube,
Wikipedia and Flickr
Social networking : Facebook and
MySpace.
03.
Creative industries
and
entertainment
04.
Healthcare
Conti
1. Web search
Sr. no.
Application Domain
Application Network
05.
Education
E-learning
virtual learning environments
Distance learning
community-based learning.
06.
07.
Science
08.
Environmental
management
3. Financial Trading:
The industry employs automated
monitoring and trading applications
12
2. Recent Trends
1. Pervasive networking and the modern
Internet
intranet
ISP
backbone
satellite link
desktop computer:
server:
network link:
Large Distributed.
sys
Services such as
www
It is an open 13
Host intranet
WAP
gatew ay
Wireless LAN
Home intranet
Mobile
phone
Printer
Camera
Laptop
.spontaneous interoperation
. service discovery
Host site
16
Mobile computing
Mobile computing is the performance of computing tasks while the
user is on the move, or visiting places other than their usual
environment .
by providing with access to resources via the devices they carry
with them.
They can continue to access the Internet;
they can continue to access resources in their home intranet;
and there is increasing provision for users to utilize resources such
as printers or even sales points that are conveniently nearby as
they move around. The latter is also known as location-aware or
context-aware computing.
Mobility introduces a number of challenges for distributed
systems, including the need to deal with variable connectivity and
indeed disconnection, and the need to maintain operation in the
face of device mobility
17
Ubiquitous computing
This includes many small, cheap computational
devices that are present in users physical
environments, including the home, office and
even natural settings.
it may be convenient for users to control their
washing machine or their entertainment
system from their phone or a universal remote
control device in the home.
Equally, the washing machine could notify the
user via a smart badge or phone when the
washing is done.
18
3. Distributed multimedia
systems
The main objectives are the storage, transmission
and presentation of
discrete media types, such as pictures or text messages.
continuous media types such as audio and video
Cloud computing
Clouds are generally implemented on cluster computers which
include set of interconnected computers that cooperate closely
to provide a single, integrated high performance computing
capability.
Blade servers are minimal computational elements containing
for example processing and (main memory) storage capabilities.
Grid computing has support for scientific applications.
20
Resources sharing
We routinely share :
Resources
Hardware resources
Data resources
Example
Printers, disks,
Files, databases,
webpages
Search engines,
Functionally specific
Services
:
that
manages
a
collection
resources
of related resources and presents
their functionality to users and
applications. e.g. Services
Purpose
Access files
Document send to
printers
Buying of goods
File services
Printing services
Electronic payment
21
Client
Servers
operation
invokes an operation
remote invocation.
nature
active
passive
time
Work continuously
objects
Encapsulates the
Contains the
22
Challenges
Sr.
no.
Challenges
Remarks
01.
Heterogeneity
02.
Openness
03.
Security
04.
Scalability
05.
Failure handling
Computer or network
06.
Concurrency
07.
Transparency
08.
Quality of services
23
1. Heterogeneity
Resources
Examples
Network
Computer
hardware
Implementations
1. Middleware
Middleware
:provides
a programming abstraction as well as
by different
2. Mobile codes
virtual
machines
masking
the heterogeneity
. eg:and
The
Common
Object Request
developers.
Broker (CORBA), Java Remote Method Invocation (RMI)
Mobile Codes : refers to transfer of program code . Eg, Java
applets
Virtual Machines : provides a way of making code executable
on a variety of host computers: Java compiler
24
2. Openness
Extension and reimplementation of new resourcesharing services, which can be made available for use
by a variety of client programs.
Open systems are characterized by the fact that their
key interfaces are published. Eg. Requests For
Comments in IP
Open distributed systems are based on the provision
of a uniform communication mechanism and published
interfaces for access to shared resources. Eg. World
Wide Web Consortium (W3C) provides standards of
working on Web.
Open distributed systems can be constructed from
heterogeneous hardware and software, possibly from
different vendors. But the conformance of each
component to the published standard must be
carefully tested and verified if the system is to work
25
correctly.
3. Security
Security for information resources has
three components:
confidentiality
integrity
availability
Security of mobile
code
4. Scalability
5. Failure handling
28
6.Concurrency
It is a problem when two or more users access to the
same resource at the same time
29
7. Transparency
Concealment of the separation of components
from users:
Access transparency: local and remote resources can
be accessed using identical operations (ftp services)
Location transparency: resources can be accessed
without knowing their where abouts (URL)
Concurrency transparency: processes can operate
concurrently
using
shared
resources
without
interferences( threads and semaphores)
Failure transparency: faults can be concealed from
users/applications( retransmits of mail)
Mobility transparency: resources/users can move
within a system without affecting their operations
(airtel & idea comu)
30
Conti
Replication transparency : enables
multiple instances of resources to be
used to increase reliability and
performance without knowledge of the
replicas by users or application
programmers.
Performance transparency: system can
be reconfigured to improve performance
Scaling transparency: system can be
expanded in scale without change to the
applications
31
Transparency Examples
Distributed File System allows access transparency
and location transparency
URLs are location transparent, but are not mobility
transparent
Message retransmission governed by TCP is a
mechanism for providing failure transparency
Mobile phone is an example of mobility transparency
Note
:
Access
Transparency
and
transparency together is called as
transparency.
Location
Network
32
8. Quality of services
Responsiveness and computational throughput.
Ability
to
meet
timeliness
guarantees
depending on computing and communication
resources.
QoS applies to operating systems as well as
networks.
There must be resource managers that provide
guarantees.
Provision for Reservation requests that cannot
be met should rejected.
33
Major Components
HTML
URL
HTTP
Related Terms
34
Short History
The Web began life at the European
centre for nuclear research (CERN),
Switzerland, in 1989 as a vehicle for
exchanging documents between a
community of physicists connected
by the Internet.
35
Concept
Web provides
hypertext structure among the documents that it
stores
hyperlinks i.e. references to other documents
and resources that are also stored in the Web
Open system : it can be extended and
implemented in new ways without disturbing its
existing functionality.
Operation is based on communication standards
and document or content standards that are
freely published and widely implemented.
36
Concept
There are many types of browser, which are implemented
on several platforms and also there are many
implementations of web servers.
37
http://www.google.com/search?q=kindberg
www.google.com
Browsers
Web servers
Internet
www.cdk3.net
http://www.cdk3.net/
www.w3c.org
File system of
www.w3c.org
Protocols
http://www.w3c.org/Protocols/Activity.html
Activity.html
Majors Components
The Web is based on three main standard technological
components:
the HyperText Markup Language (HTML), a
language for specifying the contents and layout of
pages as they are displayed by web browsers;
Uniform Resource Locators (URLs), also known
as Uniform Resource Identifiers (URIs), which
identify documents and other resources stored as part
of the Web;
a client-server system architecture, with
standard rules for interaction (the HyperText
Transfer Protocol HTTP) by which browsers and other
clients fetch documents and other resources from web
servers.
39
1.HTML
The HyperText Markup Language is used to specify
the text and images that make up the contents of a
web page, and to specify how they are laid out and
formatted for presentation to the user.
A web page contains such structured items as
headings, paragraphs, tables and images.
HTML is also used to specify links and which
resources are associated with them.
Users may produce HTML by hand, using a standard
text editor, but they more commonly use an HTMLaware wysiwyg editor that generates HTML from a
layout that they create graphically.
40
Example
Consider a piece of code stored in a html file say
earth.html
<IMG SRC = http://www.cdk5.net/WebExample/Images/earth.jpg>
<P>
2
Welcome to Earth! Visitors may also be interested in taking a look at the
<A HREF = http://www.cdk5.net/WebExample/moon.html>Moon</A>.
</P>
5
3
4
Output :
41
2. URL
The purpose of a Uniform Resource
Locator is to identify a resource.
Uses File Transfer Protocol (FTP) or
HyperText Transfer Protocol (HTTP)
ftp://ftp.downloadIt.com/software/aProg.ex
e
intro
3. HTTP
The HyperText Transfer Protocol defines the ways in
which browsers and other types of client interact with
web servers.
The main features are :
Request-reply interactions:
Operation : GET, to retrieve data from the resource, and POST, to provide data to
the resource
Content types:
text/html then a browser will interpret the text as HTML and display
it;
image/GIF then the browser will render it as an image in GIF format
application/zip then it is data compressed in zip format,
44
Related Terms
Dynamic Pages
Download codes
Common Gateway Interfaces on server
Javascript, Asynchronous Javascript And XML(AJAX)
Applets
Web services
programmatic access to web resources
Web resources provide service-specific operations.
GET, POST, PUT,DELETE
Web Discussion
Web faces problems of scale.
Use of proxy servers
clusters of computers.
45
Fundamental Models
Interaction models
Failure models
Security models
46
Introduction
Difficulties and threats for distributed systems
Widely varying modes of use: The component parts of systems
are subject to wide variations in workload for example, some
web pages are accessed several million times a day. Some parts
of a system may be disconnected, or poorly connected some of
the time for example, when mobile computers are included in a
system. Some applications have special requirements for high
communication bandwidth and low latency for example,
multimedia applications.
Wide range of system environments: A distributed system must
accommodate heterogeneous hardware, operating systems and
networks. The networks may differ widely in performance
wireless networks operate at a fraction of the speed of local
networks. Systems of widely differing scales, ranging from tens of
computers to millions of computers, must be supported.
Internal problems: Non-synchronized clocks, conflicting data
updates and many modes of hardware and software failure
involving the individual system components.
47
External threats: Attacks on data integrity and secrecy, denial of
Introduction
The properties and design issues of distributed
systems can be captured and discussed
through the use of descriptive models.
Each type of model is intended to provide an
abstract, simplified but consistent description
of a relevant aspect of distributed system
design.
The basic models under consideration are
Architecture model
Fundamental model
And there is one more model : The Physical model
48
I. Architecture elements
Key questions ?
What are the entities that are communicating in
the distributed system?
How do they communicate, or, more specifically,
what communication paradigm is used?
What (potentially changing) roles and
responsibilities do they have in the overall
architecture?
How are they mapped on to the physical
distributed infrastructure (what is their
placements)
50
1. Communication entities
System-oriented entities
Nodes
Processes
( distributed environment, based on threads)
Components
accessed through interfaces
making all dependencies explicit
third-party development removing hidden dependencies.
Web services
defined by the web-based technologies
a software application identified by a URI
message exchanges via Internet-based protocols.
51
2. Communication paradigms
Inter-process communication
low-level support for communication between processes in DS
Message passing : c-s arch.
Socket programming : Use of IP (TCP/UDP)
Muti-cast communication : one msg to many
Remote invocation
2-way exchange bet. entities in terms of remote oper n, pro. or mtds
Request-reply protocols: c-s communication, encoded as an array of bytes
Remote procedure calls: procedures in processes on remote computers
can be called
Remote method invocation : calling object can invoke a method in a
remote object
Indirect Communication
52
A. Client-server model:
Most important and most widely distributed
system architecture.
Client and server roles are assigned and
changeable.
Servers may in turn be clients of other
servers.
Services may be implemented as several
interacting processes in different host computers
to provide a service to client processes:
Servers partition the set of objects on which
the service is based and distribute them
among themselves
(e.g. Web data
and web servers)
54
Client
invocation
result
Client
invocation
Server
Server
result
Key:
Process:
Computer:
Peer 1
Application
Application
Peer 3
Sharable
objects
application
database,
the storage,
processing
communication
loads
for access to
objects
Application
Peer 4
Application
Peers 5 .... N
57
4. Placements
Variation in the models:
i. Mapping of services to multiple
servers
ii. Caching using web proxy servers
iii. Web applets in form of mobile
code
iv. Mobile agents
v. Thin client
58
i. Multiple servers
Service
Server
Client
Server
Client
Server
59
Client
Proxy
server
Client
Web
server
Cache:
A store of recently used data objects that is closer to the client
process than those remote objects.
When an object is needed by a client process the caching service
checks the cache and supplies the object from there in case of an upto-date copy is available.
Proxy server:
Client
Applet code
Web
server
Client
A pplet
Web
server
Advantage:
Disadvantage:
61
62
v. Thin Client
Thin client refers to a software layer
that supports a window-based user
interface that is local to the user
while executing application programs
on a remote computer.
63
Conti..
Same as the network computer scheme but
instead of downloading the applications code
into the users computer, it runs them on a
server machine, compute server.
Compute server is a powerful computer that has
the capacity to run large numbers of
applications simultaneously.
Disadvantage: Increasing of the delays in highly
interactive graphical applications .
Recently this concept has led to the emergence
of virtual network computing (VNC), which has
out dated the network computers.
Since all the application data and code is stored
by a file server, the users may migrate from one
network computer to another.
64
65
1. Layering
In the layered view of a system each layer offers its services to the level
above and builds its own service on the services of the layer below.
Software architecture is the structuring of software in terms of layers
(modules) or services that can be requested locally or remotely.
Applications, services
Middlew are
Operating system
Platform
Computer and netw ork hardw are
66
Platform:
Lowest-level layers that provide services to other higher layers.
bring a systems programming interface for communication
and coordination between processes .
Examples:
Pentium processor / Windows NT
SPARC processor / Solaris
Middleware:
Layer of software to mask heterogeneity and provide a unified
distributed programming interface to application programmers.
Provide services, infrastructure services, for use by application
programs.
Examples:
Object Management Groups Common Object Request Broker
Architecture (CORBA).
Java Remote Object Invocation (RMI).
Microsofts Distributed Common Object Model (DCOM).
Limitation: require application level involvement in some tasks.
67
2. Tier Architecture
68
69
Architectures Design
Requirements
1. Performance Issues:
Considered under the following factors:
Responsiveness:
Fast and consistent response time is important for the
users of interactive applications.
Response speed is determined by the load and
performance of the server and the network and the
delay in all the involved software components.
System must be composed of relatively few software
layers and small quantities of transferred data to
achieve good response times.
Throughput:
The rate at which work is done for all users in a
distributed system.
Load balancing:
Enable applications and service processes to proceed
concurrently without competing for the same resources.
Exploit available processing resources.
70
Architectures Design
Requirements
2. Quality of Service:
Main system properties that affect the service
quality are:
Reliability: related to failure fundamental model
(discussed later).
Performance: ability to meet timeliness guarantees.
Security: related to security fundamental model
(discussed later).
Adaptability: ability to meet changing resource
availability and system configurations.
3. Dependability issues:
A requirement in most application domains.
Achieved by:
Fault tolerance: continuing to function in the presence of
failures.
Security: locate sensitive data only in secure computers.
Correctness of distributed concurrent programs:
research topic.
71
2. Fundamental Models
Models of systems share some fundamental
properties which are more specific about their
characteristics , the failures and security risks
they might exhibit.
The interaction model is concerned with the
performance of processes and communication
channels and the absence of a global clock.
The failure model classifies the failures of
processes and basic communication channels
in a distributed system.
The security model identifies the possible
threats to processes and communication
72
channels in an open distributed system.
I. Interaction Model
Distributed systems consists of multiple interacting
processes with private set of data that can access.
Distributed processes behavior is described by
distributed algorithms.
Distributed algorithms define the steps to be taken
by each process in the system including the
transmission of messages between them.
Transmitted
messages
transfer
information
between these processes and coordinate their
ordering and synchronization activities.
73
77
Interaction Model
Event ordering: when its need to know if an
event at one process (sending or receiving a
message) occurred before, after, or concurrently
with another event at another process.
It is impossible for any process in a distributed
system to have a view on the current global state
of the system.
The execution of a system can be described in
terms of events and their ordering despite the
lack of accurate clocks.
Logical clocks define some event order based on
causality.
Logical time can be used to provide ordering
among events in different computers in a
78
distributed system (since real clocks cannot be
receive
m1
2
receive
receive
4
send
3
m2
receive
Physical
time
send
receive
receive
m3
A
t1
t2
m1
m2
Arbitrary failures
Any type of error can occur in processes or channels (worst).
Timing failures
Applicable only to synchronous distributed systems where time
limits may not be met.
80
Process p
1. Omission
Failures
sendm
Process q
receive
Communication channel
Outgoing message buffer
2. Arbitrary Failures
The term arbitrary or Byzantine failure is used to describe
the worst possible failure semantics, in which any type of
error may occur. For example, a process may set wrong
values in its data items, or it may return a wrong value in
response to an invocation.
Arbitrary failures in processes cannot be detected by
seeing whether the process responds to invocations,
because it might arbitrarily omit to reply.
Communication channels can suffer from arbitrary failures;
eg, message contents may be corrupted, nonexistent
messages may be delivered or real messages may be
delivered more than once.
Arbitrary failures of communication channels are rare
because the communication software is able to
recognize them and reject the faulty messages. Eg ,
checksums are used to detect corrupted messages, and
message sequence numbers can be used to detect
82
nonexistent and duplicated messages.
Failure Model
Omission and arbitrary failures
83
3. Timing Failures
Timing
failures
are
applicable
in
synchronous distributed systems where
time limits are set on process execution
time, message delivery time and clock drift
rate.
Real-time operating systems (like UNIX) are
designed with a view to providing timing
guarantees.
ClassofFailure
Affects
Description
Clock The typical
Process timing
Processslocalclockexceedstheboundsonits
failures are :
Performance
Process
Performance
Channel
rateofdriftfromrealtime.
Processexceedstheboundsontheinterval
betweentwosteps.
Amessagestransmissiontakeslongerthanthe
84
statedbound.
of
Eg. users
private
data
(mailbox),
Server shared
data (web
pages)
Client
result
Principal (user)
Network
Object
Principal (server)
Authority ( principal)
m
Process p
Processq
Communication channel
The enemy
Threats to processes:
server side or client side
Threats to communication channels:
the privacy and integrity of information as it
87
Principal A
Processp
Secure channel
Process q
Secure channels
By Cryptography and shared secrets
88
Mobile code:
Requires executability privileges on target
machine
Code may be malicious (e.g., mail worms)
89
Object
to
Fractals
90
Overlook of Architecture
model
3 Objectives : Arch. Elements, Arch. Patterns,
Middleware Platform Available
4 Arch. Element : Entities, Comm. Paradigms,
Roles & Responsibilities and Placements
2 Entities : System & Problem Oriented entities
3 Problem O.E. : Objects, Components & Web
services
3 Comm. Paradigms: Inter- process comm.
Remote invocation (RRP,RMI,RPC), indirect
Comm. (eg event based)
91
92
Introduction
This chapter discuss about complete middleware
solutions,
presenting
distributed
objects
and
components as two of the most important styles of
middleware in use today.
Software that allows a level of programming beyond
processes and message passing is called middleware.
Middleware layers are based on protocols and
application programming interfaces.
Applications
RMI, RPC and events
Request reply protocol
External data representation
Middleware
layers
Operating System
93
Programming Models
Remote Procedure Calls Client programs
call procedures in server programs
Remote Method Invocation Objects
invoke methods of remote objects on
distributed hosts
Event-based Programming Model Objects
receive notice of events in other objects in
which they have interest
94
Interface
Current programming languages allow programs to be
developed as a set of modules that communicate with
each other.
Permitted interactions between modules are defined by
interfaces.
A specified interface can be implemented by different
modules without the need to modify other modules
using the interface.
In Distributed system , a Remote Interface defines the
remote objects on a server, and each of the objects
methods input and output arguments that are
available to clients.
Remote objects can return objects as arguments
back to the client
Remote objects can return references to remote
objects to the client
95
Interfaces do not have constructors.
Benefits of Middleware
Location Transparency:
Remote Objects seem as if they are on the
same machine as the client
Communication Protocols:
Client/Server does not need to know if the
underlying protocol used by the middleware is
UDP or TCP
Computer Hardware/ Operating System:
Hides differences in data representation
caused by different computer hardware or
operating system
Programming Languages:
Allows the client and server programs to be
96
written in different languages
Limitations
Implicit dependencies:
Programming complexity:
Lack of separation
concerns:
of
distribution
100
2. Component-based middleware
Component-based middleware builds on the
limitations of distributed object middleware ,
but also adds significant support for distributed
systems development and deployment.
Software components are like distributed
objects in that they are encapsulated units of
composition.
Distributed Objects
Middleware
based
on
distributed objects is designed
to provide a programming
model based on object-oriented
principles and benefits the
approach
to
distributed
programming.
Continue..
In a system for distributed objects, the
unit of distribution is the object.
Objects that can receive remote requests
for services are called remote objects.
Remote objects must have a way to be
accessed through a remote reference.
To invoke a method, its signature and
parameters must be defined in a remote
interface.
Together, these technologies are called
remote method invocation (action).
103
remote
interface
(IDL)
Data
m1
m2
m3
remote
invocation
A
remoteobject
implementation
of methods
m4
m5
m6
local
C
E
invocation local
invocation
B
local
invocation
D
remote
invocation
F
104
1. Inter-object communication:
2. Lifecycle management:
Lifecycle management is concerned with the creation,
migration and deletion of objects, with each step having to
deal with the distributed nature of the underlying
environment.
Continue
4. Persistence:
Objects typically have state, and it is important to
maintain this state across possible cycles of activation
and deactivation and indeed system failures.
Distributed object middleware must therefore offer
persistency management for stateful objects.
5. Additional services:
A comprehensive distributed object middleware
framework must also provide support for the range of
distributed system services viz naming, security and
transaction services.
107
108
Components
A component can be thought of as collection of
objects that provide a set of services to other
systems.
The set of services includes code providing
graphing facilities, network communication
services, browsing services related to database
tables etc.
The object linking embedded (OLE) architecture
is one of the first component based framework
on which Microsoft Excel spreadsheets are
designed.
113
Programming by assembly
(manufacturing) rather than
development (engineering)
Reduced skills requirement
114
Essence of component
A component is specified in terms of a contract, which
includes
A set of provided interfaces that is, interfaces that
the component offers as services to other components
A set of required interfaces that is, the dependencies
that this component has in terms of other components
that must be present and connected to this
components for it to function correctly.
Note: Interfaces
includes
in
component
based
middleware
116
Component-based development
Programming in component-based systems is
concerned with the development of components
and their composition.
Goal
Support a style of software development that
parallels hardware development in using off-theshelf components and composing them together to
develop more sophisticated services.
It supports third-party development of software
components and also make it easier to adapt
system configurations at runtime, by replacing one
component with another.
Note : Components are encapsulated in Containers. 117
Containers
Containers
support
a
common
pattern
often
encountered in distributed systems development
It consists of:
A front-end (web-based) client
A container holding one or more components that
implement the application or business logic
System services that manage the associated data
in persistence storage.
Tasks of a container
Provides a managed server-side hosting
environment for components
Provides the necessary separation of concerns
Continued ..
The container implements middleware's services like:
To authenticate user
To make an application remotely accessible
To provide transaction handling
Other services
Activation and passivation, persistence
Life cycle management
Container metadata (introspection)
Packaging and deployment
Structure of Container
This shows a number of
components
encapsulated within a
container.
The container does not
provide direct access to
the components but
rather
intercepts
incoming
invocations
and
then
takes
appropriate actions to
ensure
the
desired
properties of the
120
distributed application
121
Application Server
Middleware that supports the container
pattern and the separation of concerns implied
by this pattern is known as an application
server.
A wide range of application servers are now
available:
122
Note : Enterprise JavaBeans specification is an example of an application
Component-based deployment
Component-based middleware provides support for
the deployment of component configurations.
Deployment descriptors
fully describe how the
configurations should be deployed in a distributed
environment.
Deployment descriptors are typically interpreters
written in XML and include sufficient information to
ensure that:
components are correctly connected using appropriate
protocols and associated middleware support;
the underlying middleware and platform are configured to
provide the right level of support to the component
configuration
the associated distributed system services are set up to
provide the right level of security, transaction support and
123
so on.
What is EJB?
125
Enterprise beans
The Enterprise JavaBeans architecture is a component
architecture for the development and deployment of
component-based distributed business applications.
Example: In an inventory control application, the
enterprise beans might implement the business logic
in
methods
called
checkInventoryLevel
and
orderProduct.
Benefits of Enterprise Beans
EJB container provides system-level services to
enterprise beans, the bean developer can concentrate
on solving business problems.
Client developer can focus on the presentation of the
client.
Application assembler can build new applications from
126
existing beans.
127
Programming in EJB
The task of programming in EJB has been simplified
significantly through the use of Enterprise JavaBeanPOJOs
(plain old Java objects) together with Java Enterprise
JavaBean annotations.
A bean is a POJO supplemented by annotations.
Annotations were introduced in Java 1.5 as a mechanism for
associating metadata with packages, classes, methods,
parameters and variables.
The following are examples of annotated bean definitions
@Stateful public class eShop implements Orders {...}
@Stateless public class CalculatorBean implements Calculator {...}
@MessageDriven public class SharePrice implements MessageListener
{...}
128
Types of EJBs
1. Session Bean: EJB used for implementing highlevel business logic and processes :
Session beans handle complex tasks that require
interaction with other components (entits,
web services, messaging, etc.)
Session bean is used to represent state of single
interactive communication session between the
a client and the business tier of the server.
Session beans are transient :
when a session is completed , then the
associate bean is discarded.
in case of any failures , session bean are lost as
they are not stored in stable storages.
129
EJB containers
EJB container
Runtime environment that provides services, such as
transaction management, concurrency control, pooling,
and security authorization.
Historically, application servers have added other
features such as clustering, load balancing, and failover.
Some JEE Application Servers
GlassFish (Sun/Oracle, open source edition)
WebSphere (IBM)
WebLogic (Oracle)
JBoss (Apache)
WebObjects (Apple)
131
Essence of Fractal
Fractal is a lightweight component model that
can be used with various programming
language to design , implement, deploy and
reconfigure various system and applications
from OS to middleware and to GUI.
I. Various programming
language
Programming platforms
Julia and AOKell (Java-based)
Cecilia and Think (C-based)
FracNet (.NET-based)
FracTalk (Smalltalk-based)
Julio (Python-based).
Julia and Cecilia are treated as the reference implementations of
Fractal.
Middleware platforms
Think (a configurable operating system kernel),
DREAM (supporting various forms of indirect communication),
GOTM (offering flexible transaction management)
Proactive ( Grid computing).
Jasmine (monitoring and management of SOA platforms)
135
3. Inversion of control
client interfaces
which support outgoing invocations
equivalent to required interfaces
Note :
Communication between the Fractal components is
only possible if their interface are bound.
137
This leads to the composition of Fractal components.
Binding in Fractals
To enable composition, Fractal supports
bindings between interfaces.
Two styles of binding are :
Primitive bindings:
Composite bindings:
138
Primitive bindings
Direct mapping between one client
interface and one server interface
within the same address space.
Operation invocation emitted by the
client interface should be accepted
by the specified server interface.
It can readily implements by using
pointers or by direct language
reference ( java using object
reference).
139
Composite bindings
Build out of set of primitive bindings and binding
components like stub, skeleton, adaptor etc.
Implemented in terms of communication path
between a number of component interfaces
potentially on different machines.
Composite bindings are themselves components
in Fractal
Interconnnection (remote invocation or indirect,
point-to-point or multiparty)
reconfigured at runtime (security and scalability)
140
Structure of Fractal
components
A Fractal component is
runtime entity that is encapsulated,
has a distinct identity and
supports one or more interfaces
Architecture based on
Membrane & Controllers (Non functional concern)
External
interface
interceptors
Internal
interface
1. Activity
controllers
2. Threads
controllers
3. Scheduling
controllers
Purpose of Controllers
1. Implementation of lifecycle management:
Activation or deactivation of a process
Allows even replacement of server by other
enhanced server
Purpose of Membrane
To provide different level of controls
simple encapsulation of components
support for non functional issues like
transactions and security as in
application servers.
144
Level of controls
Low level controls (Run time entities,
base components, eg objects in java)
Middle level controls(introspection level,
provides components interfaces, eg
external interfaces in client-sever, com
services)
High level controls (configuration level,
exploits internal elements)
Additional level of controls(
a
framework for the instantiation of
components )
145
Content
The content of a component is composed of (a finite number of)
other components, called sub components, which are under the
control of the controller of the enclosing component.
The Fractal model is thus recursive and allows components to be
nested (i.e. to appear in the content of enclosing components) at an
arbitrary level.
Content
Sub-components :
Hierarchy of components
Components as run-time entities
(computational units)
Caller and callee interface in client-server
Sharing of components :
Software architectures with resources
Menu and toolbar components
Share of Undo button
147
Standard factories:
creates only one kind of components
Use templates and sub-templates
148
149
Chapter 4: Remote
Invocation
1. Introduction
2. Remote Procedure Call
3. Events & Notification
150
1. Introduction
Middleware layers are based on protocols and
application programming interfaces.
151
doOperation
Server
Request
message
(wait)
(continuation)
Reply
message
getRequest
select object
execute
method
sendReply
152
Operations of the
request-reply protocol
153
messageType
requestId
objectReference
int(0=Request,1=Reply)
int
RemoteObjectRef
methodId
intorMethod
arguments
arrayofbytes
2. Duplicate Filtering
Retransmission are used
3. Retransmission of result
Whether to keep the history of result
Avoid the re-execution of server operation
Note : Combination of all these leads to variety of invocation
scematics
155
Invocation Scematics
Maybe invocation scematics
At-least-once invocation scematics
At-most-once invocation scematics
1.Maybe invocation
Remote method
may execute or not at all, invoker cannot tell
useful only if occasional failures
Invocation message lost...
method not executed
Result not received...
was method executed or not?
Server crash...
before or after method executed?
if timeout, result could be received after timeout...
157
2. At-least-once invocation
Remote method
invoker receives result (executed exactly) or exception
(no result, executed once or not at all)
retransmission of request messages
Invocation message retransmitted...
method may be executed more than once
arbitrary failure (wrong result possible)
method must be idempotent (repeated execution has the
same effect as a single execution)
Server crash...
dealt with by timeouts, exceptions
158
3. At-most-once invocation
Remote method
invoker receives result (method executed once)
or exception (no result was received)
retransmission of reply & request messages
duplicate filtering
Best fault-tolerance...
arbitrary failures prevented if method called at
most once
Used by CORBA and Java RMI
159
160
L2 : Programming Models
Remote procedure call (RPC)
call procedure in separate process
client programs call procedures in server programs
Event-based model
Register interested events of other objects
Receive notification of the events at other objects
161
Introduction
Design issues
Implementation
Case study :Sun RPC
162
Introduction
Remote Procedure Call (RPC) is a high-level model
for client-sever communication.
It provides the programmers with a familiar
mechanism for building distributed systems.
Examples: File service, Authentication service.
163
Introduction
Introduction
How to operate RPC?
When a process on machine A calls a procedure on
machine B, the calling process on A is suspended, and the
execution of the called procedure takes place on B.
Information can be transported from the caller to the callee
in the parameters and can come back in the procedure
result.
No message passing or I/O at all is visible to the
programmer.
165
Introduction
The RPC model
server
client
Call procedure and
wait for reply
request
Receive request and start
process execution
reply
Resume
execution
Blocking state
Executing state
166
Characteristics
The called procedure is in another process which may
reside in another machine.
The processes do not share address space.
Passing of parameters by reference and passing
pointer values are not allowed.
Parameters are passed by values.
The called remote procedure executes within the
environment of the server process.
The called procedure does not have access to the
calling procedure's environment.
167
Features
Simple call syntax
Familiar semantics
Well defined interface
Ease of use
Efficient
Can communicate between processes on the same
machine or different machines
168
Limitations
Parameters passed by values only and pointer values are not
allowed.
Speed: remote procedure calling (and return) time (i.e.,
overheads) can be significantly (1 - 3 orders of magnitude)
slower than that for local procedure.
This may affect real-time design and the programmer should be aware of
its impact.
Design Issues
Exception handling
Necessary because of possibility of network and nodes
failures;
RPC uses return value to indicate errors;
Transparency
Syntactic achievable, exactly the same syntax as a local
procedure call;
Semantic impossible because of RPC limitation: failure
(similar but not exactly the same);
170
Design Issues
Delivery guarantees
Retry request message: whether to retransmit the request
message until either a reply or the server is assumed to
have failed;
Duplicate filtering : when retransmission are used, whether
to filter out duplicates at the server;
Retransmission of replies: whether to keep a history of
reply messages to enable lost replies to be retransmitted
without re-executing the server operations.
171
Call Semantics
Maybe call semantics
After a RPC time-out (or a client crashed and restarted), the
client is not sure if the RP may or may not have been
called.
This is the case when no fault tolerance is built into RPC
mechanism.
Clearly, maybe semantics is not desirable.
172
Call Semantics
At-least-once call semantics
With this call semantics, the client can assume that the RP
is executed at least once (on return from the RP).
Can be implemented by retransmission of the (call) request
message on time-out.
Acceptable only if the servers operations are idempotent.
That is f(x) = f(f(x)).
173
Call Semantics
At-most-once call semantics
When a RPC returns, it can assumed that the remote
procedure (RP) has been called exactly once or not at all.
Implemented by the server's filtering of duplicate requests
(which are caused by retransmissions due to IPC failure,
slow or crashed server) and caching of replies (in reply
history, refer to RRA protocol).
174
Call Semantics
This ensure the RP is called exactly once if the server does
not crash during execution of the RP.
When the server crashes during the RP's execution, the
partial execution may lead to erroneous results.
In this case, we want the effect that the RP has not been
executed at all.
175
RPC Mechanism
client process
server process
Request
client
program
client stub
procedure
Communication
module
Reply
server stub
procedure
Communication
dispatcher
module
service
procedure
176
RPC Mechanism:
Client computer
Local
return
Local
call
Marshal
Unmarshal
arguments
results
Receive
reply
Send
request
service
procedure
client
server
stub
client
proc.
stub
proc.
Communication
module
Server computer
Execute procedure
Unmarshal
arguments
Marshal
results
Select procedure
Receive
request
Send
reply
177
RPC Mechanism:
1. The client provides the arguments and calls the client stub in
the normal way.
2. The client stub builds (marshals) a message (call request) and
traps to OS & network kernel.
3. The kernel sends the message to the remote kernel.
4. The remote kernel receives the message and gives it to the
server dispatcher.
5. The dispatcher selects the appropriate server stub.
6. The server stub unpacks (unmarshals) the parameters and call
the corresponding server procedure.
178
RPC Mechanism
7. The server procedure does the work and returns the result to
the server stub.
8. The server stub packs (marshals) it in a message (call return)
and traps it to OS & network kernel.
9. The remote (receiver) kernel sends the message to the client
kernel.
10. The client kernel gives the message to the client stub.
11. The client stub unpacks (unmarshals) the result and returns to
client.
179
A pair of Stubs
Client-side stub
Looks like local server
function
Same interface as local
function
Bundles arguments into
message, sends to serverside stub
Waits for reply, unbundles results
returns
Server-side stub
Looks like local client
function to server
Listens on a socket for
message from client stub
Un-bundles arguments to
local variables
Makes a local function
call to server
Bundles result into reply
message to client stub
180
RPC Implementation
Three main tasks:
Interface processing: integrate the RPC mechanism with
client and server programs in conventional programming
languages.
Communication handling: transmitting and receiving
request and reply messages.
Binding: locating an appropriate server for a particular
service.
181
183
RPC IDL
program numbers instead of interface names (unique)
procedure numbers instead of procedure names ( version changes)
single input procedure parameter (structs)
Procedure definition
(e.g. WRITE file procedure)
version 1
Procedure definition
(e.g. READ file procedure)
version 2
version number
program number
184
struct readargs {
FileIdentifier f;
FilePointer position;
Length length;
};
program FILEREADWRITE {
version VERSION {
void WRITE(writeargs)=1;
Data READ(readargs)=2;
}=2;
} = 9999;
185
Complier : rpcgen
rpcgen name.x
produces:
name.h header
name_svc.c
server stub
name_clnt.c
client stub
[ name_xdr.c ] XDR conversion routines
186
Authentication
SUN RPC request and reply message have
an additional fields for authentication
information to be passed between client
and server
SUN
RPC
supports
the
following
authentication protocols :
UNIX style using uid and gid of user
Shared key is eshtablished for signing the RPC
message
Well known Kerberos style of authentication
189
Advantages
Dont worry about getting a unique transport address (port)
But with SUN RPC you need a unique program number
per server
Greater portability
Transport independent
Protocol can be selected at run-time
Application does not have to deal with maintaining message
boundaries, fragmentation, reassembly
Applications need to know only one transport address
Port mapper
Function call model can be used instead of send/receive
190
Event-Notification model
Idea
One object react to a change occurring in another object
Event causes changes in the object that maintain the state of
application
Objects that represent events are called notifications
Event examples
modification of a document
Entering text in a text box using keyboard
Clicking a button using mouse
Publish/subscribe paradigm
event generator publish the type of events
event receiver subscribe to the types of events that are interest to
them
When event occur, notify the receiver
191
192
System components
Information provider process
receive new trading information
publish stocks prices event
stock price update notification
Dealer process
subscribe stocks prices event
193
Dealers computer
Dealer
Notification
Dealers computer
Notification
Information
provider
Notification
Notification
Dealer
Notification
Notification
Notification
Dealers computer
Dealers computer
Notification
Information
provider
Notification
Notification
Dealer
Dealer
External
source
194
object of interest
1.
notification
object of interest
2.
object of interest
3.
notification
observer
subscriber
notification
observer
subscriber
notification
196
Three cases
Inside object without an observer: send
notifications directly to the subscribers
Inside object with an observer: send
notification via the observer to the subscribers
Outside object (with an observer)
1. an observer queries the object of interest in
order to discover when events occur
2. the observer sends notifications to the
subscribers
197
Notification Delivery
Delivery semantics
Filtering of notifications
Patterns of events
Notification mailboxes
EventGenerator interface
to
the
RemoteEventListener interface
RemoteEvent
Third-party agents
MCT702 :UNIT II
Distributed Operating Systems
Chapter 1 : Architectures of Distributed System
Introduction
Issues in distributed operating system(9)
Communication Primitives(2)
Chapter 2 : Theoretical Foundation
Inherent Limitations of a Distributed System
Lamports Logical Clock and its limitations
Vector Clock
Causal Ordering of messages
Global State and Chandy-Lamports Recording algorithm;
Cuts of Distributed Computation
Termination Detection
201
Continue
Chapter 3:Distributed Mutual Exclusion
Non-Token based Algorithms
Lamports Algorithm
202
Architecture of Distributed OS
204
206
Definition:
Distributed operating system:
Integration of system services presenting
a transparent view of a multiple computer
system with distributed resources and
control.
Consisting
of
concurrent
processes
accessing distributed, shared or replicated
resources through message passing in a
network environment.
207
Issues in Designing
Distributed Operating Sys.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Global Knowledge
Naming
Scalability
Compatibility
Process Synchronization
Resource Management
Security
Structuring
Client Server Computing Models
209
1. Global Knowledge
Complete and accurate knowledge of all processes
and resources is not easily available
Difficulties arise due to
absence of global shared memory
absence of global clock
unpredictable message delays
Challenges
decentralized system wide control
total temporal ordering of system events
process synchronization (deadlocks, starvation)
210
2. Naming
Names are used to refer to objects which includes Computers,
printers, services, files and users.
Objects are encapsulated in servers and only visible entities in the
system are servers. To contact a server, server must be
identifiable.
Three identification methods:
1.Identification by name ( name server)
2.Identification by physical or logical address (network server)
3.Identification by service that servers provide ( components)
211
3. Scalability
Systems generally grow with time.
Design should be such that system should not result
in system unavailability or degraded performance
when growth occurs
E.g. broadcast based protocols work well for small
systems but not for large systems
Distributed File System.( on a larger scale increase
in broadcast queries for file location affects the
performance of every computer)
212
4. Compatibility
Refers to the interoperability among the resources in
a system.
There are three levels of compatibility in DS
Binary Level: All processes execute the same instruction
set even though the processors may differ in performance
and in input-output
213
Compatibility
Execution level: The same source code can be
compiled and executed properly on any computer in
the system
E.g. Andrew and Athena systems support execution level
compatibility
214
5. Process Synchronization
Process synchronization is difficult because of
unavailability of shared memory.
DOS has to synchronize process running at different
computers when they try to concurrently access
shared resources.
Mutual exclusion problem.
Request must be serialized to secure the integrity of
the shared resources.
In DS, process can request resources (local or remote)
and release resources in any order .
If the sequence of the resource allocation is not
controlled, deadlock may occur which can lead to
decrease in system performance.
215
6. Resource Management
Concerned with making both local and
remote resources available to users in an
effective manner.
Users should be able to access remote
resources as easily as they can access
local resources.
Specific location of resources should be
hidden from users in the following ways:
Data Migration
Computation Migration and
Distributed scheduling
216
6.1:Data Migration
Data can either be file or contents of physical
memory.
In process of data migration, data is brought to the
location of the computation that needs access to it by
the DOS.
If computation updates a set of data, original location
may have to be updated.
In case of file DFS is involved as a component of DOS
that implements a common file system available to
the autonomous computers in the system.
Primary goal is to provide same functional capability
to access files regardless of their location.
If the data accessed is in the physical memory of
another system then a computations data request is
handled by distributed shared memory.
It provides a virtual address space that is shared
among all the computers in a DS, main issues are
217
consistency and delays.
6.2:Computation migration
In computation migration, computation migrates to
another location.
It may be efficient when information is needed
concerning a remote file directory.
It is more efficient to send the message and receive
the information back, instead of transferring the
whole directory.
Remote procedural call has been commonly used for
computation migration.
Only a part of computation of a process is normally
carried out on a different machine.
218
7:Security
OS is responsible for the security of the
computer system
Two issues must be considered:
Authentication: process of guaranteeing that an
entity is what it claims to be
Authorization: process of deciding what privileges
an entity has and making only these privileges
available
220
8: Structuring
1. Monolithic Kernel:
. The kernel contains all the services
provided by operating system.
. A copy of huge kernel is running on all the
machines of the system.
. The limitation of this approach is that most
of the machines will not require most of the
services but the kernel is still providing it.
Note: one size fits all (diskless workstations,
multiprocessors, and file servers)
221
Communication Primitives
It is a mode to send raw bit streams of data in
distributed environment.
There are two models that are widely accepted
to develop distributed operating system(DOS).
1.Message Passing
2.Remote Procedure Call(RPC)
Note : In DS , recall communication paradigms in
architecture model:
Inter-process (Message passing, socket prog. Multicast)
Remote invocation (RRP,RPC,RMI,)
Indirect communication (group comm., event-based
etc)
224
225
receive
user a send m
buffer b
buffer a
Communication channel
kernel buffer of sending computer
Non-blocking Primitives:
With non- blocking primitive, the SEND primitive
return the control to user process.
While the RECEIVE primitive respond by signaling
and provide a buffer to copy the message.
Primary advantages is the programs have
maximum flexibility to perform computation and
communication in any order.
A significant disadvantages of non-blocking is that
programming becomes difficult.
A natural use of nonblocking communication
occurs in producer(SEND)-consumer(RECEIVE)
relationship.
227
Blocking Primitives :
The SEND primitive does not
the user program
Synchronous Vs Asynchronous
Primitives
It is based on the concept of using buffer or not.
Both of these can be extended in terms of
blocking or non blocking primitives.
Synchronous primitive:
A SEND primitive is block until a corresponding
RECEIVE primitive is executed at the receiving
computer.
This strategy is referred as blocking synchronous
primitive or rendezvous.
In unblocking synchronous primitive , the
message is copied to a buffer at the sending
side, and then allowing the process to perform
other computation activity except another SEND229
Asynchronous primitive:
The messages are buffered
A SEND primitive is not block even if
there no corresponding execution of
a RECEIVE primitive.
The RECEIVE primitive can either be
a blocking or a nonblocking primitive.
The main disadvantage in using
buffers increases the complexity in
terms of creating , managing and
destroying the buffers.
230
234
Structure
When a program (client) makes a remote procedure
call, say p(x,y), it actually makes a local call on a
dummy procedure or a client-side stub procedure p.
The client-side stub procedure construct a message
containing the identity of the remote procedure and
parameters and then send to remote machine.
A stub procedure at server side stub receives the
message and makes a local call to the procedure
specified in the message.
After execution control returns to the server stub
procedure which return the control to client side
stub.
The stub procedures can be generated at compile
time or can be linked at run time.
235
Binding
Binding is process that determines the remote
procedure, and the machine on which it will be
executed.
It may also check the compatibility of
parameters passed and procedure type called.
Binding server essentially store the server
machine along with the services they provide.
Another approach used for binding is where the
client specifies the machine and the service
required and the binding server returns the port
number for communication.
236
237
Success
Failure
Partial
At-least once
>= 1
0 or more
possible
Exactly once
0 or 1
possible
At most once
0 or 1
none
Blocks
call
process
241
242
Continued..
if the clocks of different systems are tried to
synchronize:
These clocks can drift from the physical time and the
drift rate may vary from clock to clock due to
technological limitations.
This may also end up with the same result.
We cannot have a system of perfectly synchronized
clocks
243
2. Absence of shared
memory
Due to the lack of shared memory, an up-to-date state
of the entire system is not available to any individual
process
It is necessary for reasoning about the systems
behavior,
debugging and
Recovery
245
Coherent means:
all processes make their observations at the same time.
Note : incoherent sounds that all processes donot make their
observations at the same time.
Local state of B
Communication
$200
Channel
S1: A
S2: B
Note :
(a)
$500
S1: A
(b)
(c)
Local state of B
Communication
$200
Channel
S2: B
$450
$200
S1: A
S2: B
$500
S1: A
$250
S2: B
248
248
249
Happened before
relationship
e12
e13
e14
Space
P1
e21
e22
e23
e24
P2
Global time
255
257
a b,
Space
P1
Clock
values
e11
e12
e13
(1)
(2)
(3)
(1)
e21
P2
(2)
e22
e14
(4)
Max(4+1,
2+1)
e15
e16
(5)
(6)
(3)
e23
Max(2+1,
2+1)
Global time
(4)
e24
Max(6+1,4+1
)e
17
(7)
(7)
e25
Max(4+1,
6+1)
259
Ci(a) < Cj(b) or Ci(a) = Cj(b) and Pi < Pj where < is any
arbitrary relation that totally orders the processes to break
ties
Partial Ordering
2
1
262
P2
P3
e12
(1)
(2)
e21
e22
(1)
(3)
e31
e32
e33
(1)
(2)
(3)
Global time
Note : if a and b are events in different processes and C(a) < C(b),
then a b is not necessarily true; events a and b may be causally
265
related or may not be causally related.
Vector Clocks
Generalize logical clocks to provide
non-causality information as well
as causality information.
Implement with values drawn from
a partially ordered set instead of a
totally ordered set.
Assign a value V(e) to each
computation event e in an
execution such that a b if and
only if V(a) < V(b).
266
Vector Timestamps
Algorithm
Manipulating Vector
Timestamps
Let V and W be two n-vectors of integers.
Equality: V = W iff V[i] = W[i] for all i.
Example: (3,2,4) = (3,2,4)
Less than or equal: V W iff V[i] W[i]
for all i.
Example: (2,2,3) (3,2,4) and (3,2,4)
(3,2,4)
Less than: V < W iff V W but V W.
Example: (2,2,3) < (3,2,4)
Incomparable: V || W iff !(V W) and !(W
V).
268
Example: (3,2,4) || (4,1,4)
P1
(2,0,0)
Space
e11
e12
(0,1,0)
P2
e21
P3
(3,4,1)
e13
(2,2,0)
(2,3,1)
e22
e23
(0,0,1)
e31
(2,4,1)
e24
(0,0,2)
e32
Global time
Causal Ordering of
Messages
If M1 is sent before M2, then every recepient of
both messages must get M1 before M2.
This is not guaranteed by the communication
network since M1 may be from P1 to P2 and M2
may be from P3 to P4.
Consider a replicated database system.
Updates to the entries should be received in
order!
Basic idea for message ordering :
Deliver a message only if the preceding one has
already been delivered.
Otherwise, buffer it up. ie buffer a later message
270
Violation of
Causal Ordering of Messages
Send(M1)
Space
P1
Send(M2)
P2
P3
Time
271
P1
(0,0,0)
(0,0,1)
P2
M2
(0,1,1)
M1
P3
(0,0,1)
(0,1,1)
PROTOCOLS
1. Birman-Schiper-Stephenson Protocol
2. Schiper-Eggli-Sandaz Protocol
273
1.Birman-Schiper-Stephenson
Protocol
274
Birman-Schiper-Stephenson
Protocol
Pi stamps sending messages m with a
vector time.
Pj, upon receiving message m from Pi
,VTm buffers it till
VTpj[i] = VTm[i] 1
forall k, k != i, VTpj[k] >= VTm[k]
275
BSS Algorithm
e.g. 1
(1,0,1)
P1
(2,2,1)
M2
P2
(0,1,1) (0,2,1)
P3
M1
(0,0,1)
(0,2,2)
277
e.g. 2
2,2,0
P1
1,1,0
1,2,0
P2
P3
?
0,0,0
? message service
=
1,0,1 1,2,2
2. SES Protocol
SES: Schiper-Eggli-Sandoz Algorithm.
No need for broadcast messages.
Each process maintains a vector V_P of size N - 1, N
the number of processes in the system.
V_P is a vector of tuple (P,t): P the destination
process id and t, a vector timestamp.
Eg V_P : (P2,<1,1,0>)
279
SES Algorithm
Sending a Message:
Send message M, time stamped tm, along with V_P1
to P2.
Insert (P2, tm) into V_P1. Overwrite the previous
value of (P2,t), if any.
(P2,tm) is not sent. Any future message carrying
(P2,tm) in V_P1 cannot be delivered to P2 until tm <
tP2.
Delivering a message
If V_M (vector with the message) does not contain
any pair (P2, t), it can be delivered.
/* (P2, t) exists */ If t > Tp2, buffer the message.
(Dont deliver).
else (t < Tp2) deliver it
280
P1
(2,2,2)
Tp2:
P2 (0,1,0)
(0,2,0)
M1
M2
V_P2 V_P2:
empty (P1, <0,1,0>)
M3
P3
Tp3: (0,2,1)
(0,2,2)V_P3:
(P1,<0,1,0>)
281
282
Global state
Global state of a distributed system
Local state of each process
Messages sent but not received (state of the
queues)
Many applications need to know the state of the
system
Failure recovery, distributed deadlock detection
Problem: how to figure out the state of a
distributed system?
Each process is independent
No global clock or synchronization
284
Global State
Due to absence of global clock, states are
recorded at different times.
For
global
consistency,
state
of
the
communication channel should be the sequence
of messages sent before the senders state was
recorded excluding the messages received
before the receivers state was recorded.
Local states are defined in context of an
application
a send is a part of the local state if it happened
before the state was recorded.
285
286
Global State:
1. GS ={LS1,LS2,LS3}
2. {LS11,LS22,LS32 } is inconsistent GS as received
recorded but not send recorded for M2
3. {LS12,LS23,LS33 } is consistent GS as avoided
inconsistency as well ( no. msg send) (no. of msg
received.)
4. .{LS11,LS21,LS31 } strong consistent GS.
LS11
S1
M1
S2
LS21 M2
LS12
LS22
LS23
M3
S3
LS31
LS32
LS33
287
290
291
Technical requirement:
initiator
c1
c2
c3
c4
r
p
x
q
x
x
r
marker
checkpoint
292
Step 1:
Process p sends out $10 to process q and then decides
to initiate the global state recording algorithm:
p records its current state ($490) and send out a marker
along channel c1.
Sender process p
Meanwhile, process q has sent $20 to p along channel c2
Also q has sent $10 to r along channel c3.
293
Step 2:
Process q receives the $10 transfer (increasing its
value to $480) and then receives the marker on
channel c1.
Receiver process
q point 1
Because it received a marker, process q records
its state as $480 and then sends markers along
each of its outgoing channels c2 and c3.
Meanwhile, process r has sent $25 along c4.
Note that it does not matter if r sent the message
before or after q received the marker.
294
Step 3:
process r receives the $10 transfer and the marker
from channel c3.
295
Step 4 :
Process q receives the $20 transfer on channel c1 and
updates its value to $500.
No marker
Notice that process q does not change is recorded state
found
value.
Also process p has receives the $20 transfer on channel c2.
Process p records the $20 transfer as Receiver
part of its process
recorded state
point 2
Old recorded
state
296
Step 5 :
Process p receives the $25 on channel c4.
p adds this to its recorded state and also
changes its current value from $490 to $515.
When process p receives the marker on channel c4,
the state recording is complete because all
processes have received markers on all of their
input channels.
The final recorded state is shown in the table
below.
Old recorded state
Previous recorded
state
490+25=515
297
P1
P2
P3
Snapshot
Example
e
e
e
e
1
1,2
e20
M
M
e21,2,3
e30
e13
e23 e24
Consistent Cut
M M
e32,3,4
e31
1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3;
turns on recording for channels C21 and C31
2- P2 receives Marker over C12, records its state (S2), sets state(C12) = {}
sends Marker to P1 & P3; turns on recording for channel C32
3- P1 receives Marker over C21, sets state(C21) = {a}
4- P3 receives Marker over C13, records its state (S3), sets state(C13) = {}
sends Marker to P1 & P2; turns on recording for channel C23
5- P2 receives Marker over C32, sets state(C32) = {b}
6- P3 receives Marker over C23, sets state(C23) = {}
7- P1 receives Marker over C31, sets state(C31) = {}
Stability Detection
The reachability property of the snapshot
algorithm is useful for detecting stable properties.
If a stable predicate is true in the state Ssnap then
we may conclude that the predicate is true in the
state Sfinal
Similarly if the predicate evaluates to False for
Ssnap, then it must also be False for Sinit.
300
Cut:
Cuts: graphical representation of a global state.
Cut C = {c1, c2, .., cn} where ci: cut event at Si.
Th : A cut C is consistent iff no two cut event are
causally related.
c1
S1
M1
S2
M2
c2
M3
S3
LS31
c3
301
Time of a Cut
Let C = {c1, c2, .., cn} is a cut where ci is the
cut event at site Si with vector time stamp
VTci.
Vector time of the cut, VTc = sup(VTc1, VTc2, ..,
VTcn).
sup is a component-wise maximum,
i.e., VTci = max(VTc1[i], VTc2[i], .., VTcn[i]).
For consistency : No message is sent after the
cut event which was received before the same
cut event.
Th : A cut is consistent iff VTc = (VTc1[1],
VTc2[2], .., VTcn[n])
302
p0
1
1
p1
p2
303
Cut:
p0
1
1
p1
p2
4
1
2
3
304
Consistent cut:
p0
1
1
p1
p2
4
1
2
3
305
Consistent cut:
p0
1
1
p1
p2
7
2
3
5
306
messages cross
from right to left of the cut
nconsistent cut:
p0
1
1
p1
p2
5
1
2
4
307
messages cross
from right to left of the cut
nconsistent cut:
p0
1
1
p1
p2
4
4
6
4
3
5
Termination Detection
Termination: completion of the sequence of
algorithm.
(e.g.,)
leader
election,
deadlock
detection, deadlock resolution.
System Model
Termination Detection
Huangs Algorithm
Huangs Algorithm
Suppose :P1(1)->P2 and then P3 also P3->P4
and then P5
1/4 P1
1/2 P1
P2
1/2
P3
P2
P3
1/16
1
P1
P2
P3 0
1/2
P4
P5
P4
1/8
P5
P4
P5
1/16
312
Unit II : Chapter 3
Distributed Mutual Exclusion
Introduction : Mutual Exclusion
Non-Token based Algorithms Lamports Algorithm
313
DME Algorithms:
Classification
Mutual exclusion algorithms can be grouped into 2 classes.
1.The algorithms in the first class are nontoken-based
2. The algorithms in the second class are token-based
The execution of DME Algorithms are mainly focused on
the existence of critical section (CS).
A critical section is the code segment in a process in which
a shared resource is accessed.
316
System Model
DME: 5 Requirements
1.Maintain mutual exclusion: To
guarantee that only one request
accesses the CS at a time.
2.Freedom from Deadlocks. Two or
more sites should not endlessly wait for
messages that will never arrive.
3.Freedom from starvation. A site
should not be forced to wait indefinitely
to execute CS while other sites are
320
repeatedly executing CS. That is, every
Requirements conti.
4.Fairness. Fairness dictates that requests must
be executed in the order they are made (or the
order in which they arrive in the system). Since a
physical global clock does not exist, time is
determined by logical clocks. Note that fairness
implies freedom from starvation, but not viceversa.
5.Fault Tolerance. A mutual exclusion algorithm
is fault-tolerant if in the wake of a failure, it can
reorganize itself so that it continues to function
without any (prolonged) disruptions.
321
DME: 4 Performance
The
performance
of
mutual
exclusion
algorithms is generally measured by the
following four metrics:
1. The number of messages necessary
per CS invocation
2. The synchronization delay, which is the
time required after a site leaves the CS and
Last site exist CS
before
the next site enters the CS
Next Site Enter CS
Synchronizati
on
delay
time
322
message sent
the CS
the CS
CS execution
time
Response
Time
Lamports algorithm
Ricart-Agrawala algorithm
Maekawas algorithm
326
Lamports Algorithm
Lamport proposed DME algorithm
which was based on his clock
synchronization scheme
In Lamports algorithm
Every
site
Si
keeps
queue,
request_queue,
which
contains
mutual
exclusion
requests
ordered
by
their
timestamps.
Algorithm requires messages to be delivered in
the FIFO order between every pair of sites
327
S1
s2
(1,2)
S3
Sites S1 and S2 are making the REQUEST for the
CS
328
2.
S1
s2
S3
(1,2)
(2,1),(1,2)
(1,2)
(2,1),(1,2)
329
S1
s2
S3
(1,2)
(2,1),(1,2)
(1,2)
(2,1),(1,2)
S2 enters the CS
330
S1
s2
S3
(2,1)
S1 enters the CS
(1,2)
(2,1)
(2,1),(1,2)
(1,2)
(2,1),(1,2)
(2,1)
S2 exits the CS
331
Another Example
, , so (1,B) is
332
Performance
Requires 3(N-1) messages per CS invocation:
(N-1) REQUEST, (N-1) REPLY, and (N-1) RELEASE
messages, Synchronization delay is T
Optimization
Can be optimized to require between 3(N-1) and 2(N-1)
messages per CS execution by suppressing REPLY
messages in certain cases
E.g. suppose site Sj receives a REQUEST message from site
Si after it has sent its own REQUEST messages with
timestamp higher than the timestamp of site Sis request
In this case, site Sj need not send a REPLY message to site
Si.
333
335
Request to enter CS
Request to enter CS
SK Algorithms Requirements :
Process j broadcasts REQUEST (j, num),
where num is the sequence number of the req
request.
req
Q
queue
last
req
req
337
Algorithm :
If a node wants TOKEN, it broadcasts a REQUEST message to
all other nodes :
Requesting the CS :
At node:
REQUEST(j, n)
1. node j requesting n-th CS invocation n = 1, 2, 3, ... , seq #
(number)
2. node i receives REQUEST from j
update RNi[j]=max(RNi[j], n )
where RNi[j] = largest seq # received so far from node j
Activated channel
338
Executing the CS
3. The node i executes the CS when it has received the TOKEN.
where TOKEN(Q, LN ) ( suppose at node i )
Q -- queue of requesting nodes
LN -- array of size N such that
Example
There are three processes, p1, p2, and p3.
p1 and p3 seek mutually exclusive access
to a shared resource.
Initially: the token is at p2 and the token's
state is LN = [0, 0, 0] and Q empty;
p1's state is: n1 ( seq # ) = 0, RN1 = [0, 0,
0];
p2's state is: n2 = 0, RN2 = [0, 0, 0];
p3's state is: n3 = 0, RN3 = [0, 0, 0];
340
Correctness:
Mutual exclusion is guaranteed because there is only
one token in the system and a site holds the token
during the CS execution.
Theorem: A requesting site enters the CS in finite time.
Proof:
Token request messages of a site Si reach other sites
in finite time.
Since one of these sites will have token in finite time,
site Si s request will be placed in the token queue in
finite time.
Since there can be at most N 1 requests in front of
this request in the token queue, site Si will get the
token and execute the CS in finite time.
343
Performance:
No message is needed and the synchronization
delay is zero if a site holds the idle token at the
time of its request.
It requires at most N message exchange per CS
execution ( (N-1) REQUEST messages + TOKEN
message )
Synchronization delay in this algorithm is 0 or T
Deadlock free ( because of TOKEN requirement )
No starvation ( i.e. a requesting site enters CS in finite
time )
344
Lamports Algorithm
Algorithm
Deadlocks An
Introduction
347
Illustrating A Deadlock
Wait-For-Graph (WFG)
Process 1
Waits For
Process 2
Resource 2
Held By
There are
models
AND model
two
basic
deadlock
349
AND Model
Presence of a cycle.
P1
P2
P4
P3
P5
350
OR Models
OR Model
Presence of a knot.
Knot: Subset of a graph such that starting from
any node in the subset, it is impossible to leave
the knot by following the edges of the graph.
P1
P2
P5
P4
P3
P6
351
Cycle vs Knot
P3
P1
P2
Cycle but no
Knot
P5
Deadlock in both AND & OR
Model
P1
P3
P4
P2
P5
352
Distributed Deadlock
Detection
Assumptions:
1.
2.
Resource vs Communication
Deadlocks
Resource deadlocks:
Set of deadlocked processes,
where each process waits for a
resource held by another process
(e.g., data object in a database, I/O
resource on a server)
Communication deadlocks:
Set of deadlocked processes, where
each process waits to receive
messages (communication) from
other processes in the set.
354
Basic Issues
Deadlock detection and
addressing two basic issues:
resolution
entails
Basic Issue
A correct deadlock detection algorithm
must satisfy the following two conditions:
1. Progress : (no undetected deadlock)
Algorithm should be able to detect all
existing deadlocks in finite time.
Continuously able to detect the deadlock.
Avoid to support any new formation of
deadlock.
356
Control Organisation
Centralized Control:
A control site constructs wait-for graphs (WFGs) and
checks for directed cycles.
WFG can be maintained continuously (or) built ondemand by requesting WFGs from individual sites.
Distributed Control:
WFG is spread over different sites. Any site can
initiate the deadlock detection process.
Hierarchical Control:
Sites are arranged in a hierarchy.
A site checks for cycles only in descendents.
358
Centralized DeadlockDetection
The Completely Centralized
Algorithm
The Ho-Ramamoorthy Algorithms
359
It
imposes
larger
delays,
large
communication
overhead,
and
the
congestion of communication links.
Moreover, the reliability is poor due to single 360
point of failure.
361
Shortcoming
s Occurance of Phantom
Deadlocks.
High Storage & Communication
Costs.
Example
of Phantom
Deadlocks
P0
P2
System A
System B
S
P1
Distributed Deadlock
Detection Algorithms
A Path-Pushing Algorithm
An Edge-Chasing Algorithm
A Diffusion Computation Based
Algorithm
Global State Detection Based
Algorithm
365
An Overview
All sites collectively cooperate to detect a
cycle in the state graph that is likely to be
distributed over several sites of the system.
The algorithm can be initiated whenever a
process is forced to wait.
The algorithm can be initiated either by the
local site of the process or by the site where
the process waits.
366
Obermarcks Path-Pushing
Algorithm
Individual Sites maintain local WFG
A virtual node x exists at each site.
Node
x
represents
external
processes.
Detection Process
Case 1: If Site Sn finds a cycle not
involving x -> Deadlock exists.
Case 2: If Site Sn finds a cycle involving
x -> Deadlock possible.
Contd
369
If Case 2 ->
Site Sn sends a message containing its detected cycles to
other sites. All sites receive the message, update their
WFG and re-evaluate the graph.
Consider Site Sj receives the message:
Site Sj checks for local cycles. If cycle found not involving x (of
Sj) -> Deadlock exists.
If site Sj finds cycle involving x it forwards the updated
message to other sites.
If a process sees its own label come back then it is part of a cycle
, deadlock is finally detected.
Note :Algorithm detects false deadlocks, due to
asynchronous snapshots at different sites
370
371
S1
S4
S2
S3
372
Iteration 2
373
Iteration 3
Iteration 4
Note : If a process
sees its own label
come back then it is
part of a cycle,
deadlock is finally
detected.
374
375
ALGORITHM:
Let Pi be the initiator
ifPiis locally dependent on
itself
then
declare a deadlock
else
send probe (i,j,k) to home
site ofPkfor eachj,ksuch
that all of the following
holds:
Piis locally dependent
onPj
Pjis waiting onPk
PjandPkare on different
sites
376
377
deadlocked.
Otherway :Deadlock is
Analysis :
A book Example
P2
P1
Probe (1,9,1)
P3
Site S1
Probe (1,6,8)
P9
Probe (1,3,4)
P4
P6
P8
P5
P10
Site S3
P7
Probe (1,7,10)
Site S2
379
reply =reply(i,k,j)
Numi(K) = number of message query sent by i to k
380
ALGORITHM
Initiate the process Pi by sending
query (i,i,j) to all Pj on which Pi
depends.
receipt ofquery(i,j,k) byk (some
blocked process)
if not blocked then
if blocked
if
this
is
anengagingquery
propagate query(i,k,m) to
dependent set ofm
else
if not continously blocked
since engagement
discard the query
else
sendreply(i,k,j) toj
381
On
receipt
of
reply(i,j,k) byk
if this is not the last
reply
then just decrement the
awaited reply count
numk(i)= numk(i)-1
else
send reply(i,k,m) to
the
engaging processm
At this point, a knot has been
382
A.
383
384
385
386
387
388
High Availability
Architecture
Files can be stored at any machine and
computation can be performed at any machine.
A machine can access a file stored on a remote
machine where the file access operations (like
read operation) are performed and the data is
returned.
Alternatively, File Servers are provided as
dedicated to storing files and performing
storage and retrieval operations.
Two most important services in a DFS are
Architecture of DFS
Cache manager can
be present at both
client and file server.
Cache manager at
client subsequently
reduces the access
delay due to network
latency.
Cache manager at
server cache file s in
the main memory to
reduce the delay due
to disk latency.
391
392
1.Mounting
2. Caching
3. Hints
5. Encryption
393
1.Mounting
Allows the binding together of different
filename spaces to form a single
hierarchically structured name space.
A name space (or the collection of files) can
be bounded to or mounted at a internal
node or a leaf node of the name space tree.
A node onto which a namespace is
mounted is know as a mount point.
Kernel maintains a structure called the
mount table which maps mount points to
appropriate storage. devices.
394
2. Caching:
To reduce delays in the accessing of data
by exploiting the temporal locality of
reference exhibited by program.
Data can be either cached in the main
memory or on the local disk of the client.
Also data is cached in the main memory (
server cache) at the server to reduce the
disk access latency.
Caching
increase
the
system
performance , reduces the frequency of
access
to
the
file
server
and
communication network.
Improves the scalability of the file
system.
396
3. Hints:
An alternative to cached data as to
overcome inconsistency problem when
multiple clients access shared data.
Hints helps in recoveries when invalid
cache data are discovered.
For example , after the name of file or
directory is mapped to the physical data ,
the address of the object can be stored
as the hint in the cache. If the local
address fails to map to the object the
cached address can be used from cache
397
memory.
5.Encryption:
To enforce security in distributed systems with a
scenario that two entities wishing to communicate
establish a key for conversation.
It is important to note that the conversation key is
determined by the authentication server , which never
398
sent plain text to either of the entities.
Design Issues
1.
2.
3.
4.
5.
6.
7.
Name
object
(e.g. a file or a directory)
Name resolution refers to the process of mapping a
name
to an object,
in case of replication, to multiple objects.
Name space is a collection of names which may or
may not share an identical resolution mechanism
Three approaches to name the files in Distributed
Environment:
Concatenation of host name to the names(unique)
of the file stored on that server
Mounting of remote directories onto local
directories (Sun NFS)
Maintaining
of a single global directory
structure(Sprite and Apollo)
400
401
Name Server:
Resolves the names in distributed systems.
A name server is the process that maps name
specified by the client to store the objects such
as file or directories.
The client can send their query to the single
name server which map the name to the
object.
Drawbacks involved such as single point of
failure, performance bottleneck.
Alternate is to have several name servers, e.g.
Domain Name Servers , where replication of
tables can achieve fault tolarance and high
performance.
402
403
3. Writing Policy
Writing policy decides when the modified
cache block at a client should be transferred
to the server
Write-through policy
All writes requested by the applications at clients
are also carried out at the server immediately.
4. Cache Consistency
Two approaches to guarantee that the data
returned to the client is valid.
Server-initiated approach
Server inform cache managers whenever
the data in the client caches become stale.
Cache managers at clients can then
retrieve the new data or invalidate the
blocks containing the old data.
Client-initiated approach
The responsibility of the cache managers
at the clients to validate data with the
server before returning it to the client.
Both are expensive since communication cost is
high.
405
Alternative approach:
Concurrent-write sharing approach
A file is open at multiple clients and at least
one has it open for writing.
When this occurs for a file, the file server
informs all the clients to flushed their cached
data items belonging to that file.
Major issue:
Sequential-write sharing issues causes cache
inconsistency when
Client opens a file, it may have outdated
blocks in its cache
Client opens a file, the current data block
may still be in another clients cache waiting
to be flushed. (e.g. happens in Delayed
writing policy)
406
5. Availability
Immunity to the failure of server or the
communication network
Issue: what is the level of availability of files
in a distributed file system?
Resolution: use replication to increase
availability, i.e. many copies (replicas) of files
are maintained at different sites/servers.
It is expensive because
Extra storage space required
The overhead incurred in maintaining all the
replicas up to date
Replication Issues involve
How to keep replicas consistent?
How to detect inconsistency among replicas?
407
Causes of Inconsistency :
Unit of Replication:
File
Group of files
a) Volume: group of all files of a user or group or all
files in a server
Advantage: ease of implementation
Disadvantage: wasteful, user may need only
a subset replicated
b) Primary pack vs. pack
Primary pack: all files of a user
Pack: subset of primary pack. Can receive a
different degree of replication for each pack
408
Replica Management:
Deals with the maintenance of replicas and in
making use of them to provide increased
availability
Concerns with the consistency among replicas
A weighted voting scheme (e.g. Roe File
System)
Latest updates of read/write
based
timestamp are maintain.
Designated agents scheme (e.g. Locus)
Designate one or more process/ site
( also called as current synchronization
site ) as agent for controlling the access
to the replicas of files.
Backups servers scheme (e.g. Harp File
System)
409
Designated site -> primary & Other
6. Scalability
The suitability of the design of a system to provide to
the demands of a growing system.
As the system grow larger, both the size of the server
state and the load due to invalidations increase.
The structure of the server process also plays a major
role in deciding how many clients a server can support.
If the server is designed with a single process, then
many clients have to wait for a long time whenever
a disk I/O is initiated.
These waits can be avoided if a separate process is
assigned to each client.
An alternate is to use Lightweight processes
(threads).
410
7. Semantics
The
semantics
of
a
file
system
characterizes the effects of accesses on
files.
Expected semantics: A read will return
data stored by the latest write.
To guarantee the above semantics
possible options are
All the reads and writes from various clients
will have to go through the server.
Disadvantage: communication overhead
411
Case Study:
Basic Design
Three important parts
The protocol
The client side
The server side
413
1. Protocol
Uses the Sun RPC mechanism and
Sun eXternal Data Representation
(XDR) standard
Defined as a set of remote
procedures.
Protocol is stateless
Each procedure call contains all the
information necessary to complete
the call
Server maintains no between call
414
2. Client side
Provides transparent interface to NFS
Mapping between remote file names and remote file
addresses is done through remote mount
Extension of UNIX mounts
Specified in a mount table
3. Server side
Server implements a write-through
policy
Required by statelessness
Any blocks modified by a write request
must be written back to disk before the
call completes.
416
NFS Architecture
1. System call interface layer
a) Presents sanitized validated
requests in a uniform way to
the VFS.
2.
NFS (Cont.)
NFS (Cont.)
Caching:
Cached on demand with time stamp of the file (when last modified on the
server)
Entire file cached, if under certain size, with timestamp when last modified
After certain age, blocks have to be validated with server
Delayed writing policy: Modified blocks flushed to the server after certain delay
Updated when new attributes received from the server, discarded after certain
time
Stateless Server :
File access requests from clients contain all needed information (pointer position,
etc)
Servers have no record of past requests
421
D S M Architecture
Communication Network
Node 1
Node 2
Node n
Memory
Memory
Memory
Shared Memory
(virtual address space)
Mapping Manager
422
Architecture of DSM
- Programs access data in a shared address space
just they access data as if in traditional virtual
memory.
- Data moves between main memory and secondary
memory (within a node) and between main
memories of different nodes
- Each data object is owned by a node
- Initial owner is the node that created the object
- Ownership can change as object moves from
node to node
- When a process accesses data in the shared
address space, the mapping manager maps shared
memory address to physical memory (local or
remote).
- Mapping manager: a layer of software, perhaps
bundled with the OS or as a runtime library routine.
423
7. Programs
written
for
shared
memory
multiprocessors can be run on DSM systems with
minimum changes.
425
Issues
- How to keep track of the location of remote data
- How to minimize communication overhead when
accessing remote data
- How to access concurrently remote data at several
nodes
Types of algorithms:
1. Central-server
2. Data migration
3. Read-replication
4.Full-replication
426
- Possible solutions
Partition shared data between several
servers
Use a mapping function to
distribute/locate data
427
3. Read-replication Algorithm:
Extend migration algorithm:
Replicate data at multiple nodes for read access.
Write operation:
One node write access (multiple readers-one writer protocol)
After a write, invalidate all copies of shared data at various
nodes (or) update with modified value
Data Access Request
Write Operation
in Read-replication
Algorithm :
Node i
Node j
Data Replication
Invalidate
DSM must keep track of the location of all the copies of shared data.
Read cost low, write cost higher.
429
Extension
of
read-replication
algorithm:
multiple nodes can read and multiple nodes
can write (multiple-readers, multiple-writers
protocol)
Memory Coherence
The memory is said to be coherent when
value returned by read operation is the
expected value by the programmer (e.g.,
value of most recent write)
In DSM memory coherence is maintained
when the coherence protocol is chosen in
accordance with a consistency model.
Mechanism
that
control/synchronizes
accesses is needed to maintain memory
coherence which is based on following
models:
1. Strict Consistency: Requires total ordering
431
of requests where a read returns the most
2.
Sequential
consistency:
A
system
is
sequentially consistent if the result of any
execution is the same as if the operations of all
processors were executed in some sequential
order, and the operations of each individual
processor appear in this sequence in the order
specified by its program.
3. General consistency : All copies of a memory
location (replicas) eventually contain same data
when all writes issued by every processor have
been completed.
4. Processor consistency: Operations issued by a
processor are performed in the same order they
are issued.
5. Weak consistency : Synchronization operations
are guaranteed to be sequentially consistent.
6. Release consistency: Provides acquire and
432
Coherence Protocols
Issues
- How do we ensure that all replicas have the same
information.
- How do we ensure that nodes do not access
stale(old) data.
1. Write-invalidate protocol
- Invalidate(nullify) all copies except the one being
modified before the write can proceed.
- Once invalidated, data copies cannot be used.
- Advantage: good performance for
Many updates between reads
Per node locality of reference
- Disadvantage
Invalidations sent to all nodes that have
copies.
Inefficient if many nodes access same object.433
Coherence Protocols
2. Write-update protocol
- Causes all copies of shared data to be updated.
- More difficult to implement,
- Guaranteeing consistency may be more difficult as
reads may happen in between write-updates.
Examples of Implementation of memory coherence
1. Cache coherence in PLUS system
2. Type specific memory coherence in the Munin
system
Based on Process synchronization
3. Unifying Synchronization and data transfer in
Clouds
434
PLUS:
RW
Operations
Read operation:
PLUS Write-update
Protocol Distributed copy list
X
Master Next-copy
=1
on 2
X
2
3
4
2
5
Node 1
Master Next-copy
=1
on 3
Master Next-copy
=1
on Nil
6
Node 3
Node 2
page table
X Node 2 Page p
1
7. Update X
8. MCM sends ack:
Update complete.
Node 4
437
PLUS: Protocol
Node issuing write is not blocked on write
operation.
However, a read on that location (being written
into) gets blocked till the whole update is
completed. (i.e., remember pending writes).
Strong ordering within a single processor
independent of replication (in the absence of
concurrent writes by other processors), but not
with respect to another processor.
write-fence operation: strong ordering with
synchronization among processors. MCM waits
for previous writes to complete.
438
Design Issues
1. Granularity: size of the shared memory unit.
2. Page Replacement :
Needed as physical/main memory is limited.
Data may be used in many modes: shared,
private, read-only, writable etc
Least Recently Used (LRU) replacement policy
cannot be directly used in DSMs supporting data
movement. Modified policies more effective:
Private pages may be removed ahead of shared ones
as shared pages have to be moved across the network
Read-only pages can be deleted as owners will have a
copy
442
443
Introduction
Good resource allocation schemes are
needed to fully utilize the computing
capacity of the DS.
Distributed scheduler is a resource
management component of a DOS.
It focuses on judiciously and transparently
redistributing the load of the system
among the computers.
Target is to maximize the overall
performance of the system.
More suitable for DS based on LANs.
444
2. Classification of LDA
Basic function is to transfer load from heavily
loaded systems to idle or lightly loaded
systems
1. Transfer policy
Determines whether a processor is a sender or a
receiver
Sender overloaded processor
Receiver underloaded processor
Threshold-based transfer
Establish a threshold, expressed in units of load
When a new task originates on a processor, if
the load on that processor exceeds the
threshold, the transfer policy decides that that
processor is a sender
When the load at a processor falls below the
threshold, the transfer policy decides that the
processor can be a receiver
451
2. Selection Policy
Selects which task to transfer
Newly originated simple (task just started)
Long (response time improvement compensates
transfer overhead)
small size
with minimum location-dependent system calls
(residual bandwidth minimized)
lowest priority
Priority assignment policy
Selfish local processes given priority
Altruistic remote processes given priority
Intermediate give priority on the ratio of
local/remote processes in the system
452
3. Location Policy
Once the transfer policy designates a processor as a
sender, finds a receiver
Or, once the transfer policy designates a
processor as a receiver, finds a sender
Polling one processor polls another processor to
find out if it is a suitable processor for load
distribution, selecting the processor to poll either:
Randomly
Based on information collected in previous polls
On a nearest-neighbor basis
Can poll processors either serially or in parallel
(e.g., multicast)
Usually some limit on number of polls, and if that
number is exceeded, the load distribution is not
done
453
4. Information Policy
Decides:
When information about the state of other
processors should be collected
Where it should be collected from
What information should be collected
Demand-driven
A processor collect the state of the other
processors only when it becomes either a sender
or a receiver (based on transfer and selection
policies)
Dynamic driven by system state
Sender-initiated senders look for receivers
to transfer load onto
Receiver-initiated receivers solicit load from
senders
454
Symmetrically-initiated combination where
Periodic
- Processors exchange load information at periodic
intervals.
- Based on information collected, transfer policy on a
processor may decide to transfer tasks.
- Does not adapt to system state collects same
information (overhead) at high system load as at low
system load.
State-change-driven
Processors propagates state information whenever
their state changes by a certain degree.
Differs from demand-driven in that a processor
propagates information about its own state, rather
than collecting information about the state of other
processors.
May send to central collection point or may send to
peers.
455
Stability
The two views of stability are,
The Queuing-Theoretic Perspective
A system is termed as unstable if the CPU
queues grow without bound when the long
term arrival rate of work to a system is
greater than the rate at which the system
can perform work.
Sender-Initiated Algorithms
Receiver-Initiated Algorithms
Symmetrically Initiated Algorithms
Adaptive Algorithms
457
1. Sender-Initiated
Algorithms
Activity is initiated
by an overloaded node
(sender)
A task is sent to an underloaded node
(receiver)
CPU queue threshold T is decided for all
nodes
Transfer Policy
A node is identified as a sender if a
new task originating at the node
makes the queue length exceed a
threshold T.
Selection Policy
Only new arrived tasks are considered
for transfer
458
Location Policy
Random: dynamic location policy (select any node
to transfer the task at random).
The selected node X may be overloaded.
If transferred task is treaded as new arrival,
then X may transfer the task again.
No prior information exchange.
Effective under light-load conditions.
Threshold: Poll nodes until a receiver is found.
Up to PollLimit nodes are polled.
If none is a receiver, then the sender commits
to the task.
Shortest: Among the polled nodes that where found
to be receivers, select the one with the shortest
queue.
Information Policy
A demand-driven type
Stability
Location policies adopted cause system instability
at high loads
459
Yes
Select Node i
randomly
i is Poll-set
No
Poll Node i
Poll-set=Poll-set U i
Poll-set = Nil
Yes
Transfer task
to i
YesQueueLength at i
<T
Task
Arrives QueueLength+1
No
>T
Yes
No
No. of polls
<
PollLimit
No
Queue the
task locally
460
2. Receiver-Initiated
Algorithms
Initiated from an underloaded node (receiver)
obtain a task from an overloaded node (sender)
to
Transfer Policy
Selection Policy
Location Policy
Information Policy
A demand-driven type
Stability
Yes
Select Node i
randomly
i is Poll-set
No
Poll Node i
Poll-set=Poll-set U i
Poll-set = Nil
Transfer task
from i to j
Yes
YesQueueLength at I
>T
No
QueueLength
<T
No
Yes
Wait for a
perdetermined period
Task Departure at j
No. of polls
<
PollLimit
No
462
3. Symmetrically Initiated
Algorithms
463
Location policy
465
4. Adaptive Algorithms
1. A Stable Symmetrically Initiated Algorithm
Utilizes the information gathered during polling to classify
the nodes in the system as either Sender, Receiver or OK.
The knowledge concerning the state of nodes is maintained
by a data structure at each node, comprised of a senders list,
a receivers list, and an OK list.
Initially, each node assumes that every other node is a
receiver.
Transfer Policy
Location Policy
467
468
471
Question bank
1. What are the central issues in load distributing?
2. What are the components of load distributing
algorithm?
3. Differentiate between load balancing & load
sharing.
4. Discuss the Above-average load sharing
algorithm.
5. How will you select a suitable load sharing
algorithm
6. Write short note on (expected any one)
Sender-Initiated Algorithms
Receiver-Initiated Algorithms
Symmetrically Initiated Algorithms
Adaptive Algorithms
472
Unit IV
Chapter 1: Transaction and Concurrency:
Introduction
Transactions
Nested Transactions
Methods of Concurrency Control:
Locks,
Optimistic concurrency control ,
Time Stamp Ordering,
Transaction concept
Transaction: Specified by a client as a set of
operations on objects to be performed as an
indivisible unit where the servers manage those
objects.
Goal of transaction: Ensure all the objects managed
by a server remain in a consistent state when
accessed by multiple transactions (client side) and
in the presence of server crashes.
Objects that can be recovered after the server crashes
are called as recoverable objects.
Objects on server are stored on volatile memory (RAM)
or on persistent memory (disk)
Enhance reliability
Recovery from failures
Record in permanent storage
476
single
server
Nested transaction
Methods of concurrency control
Note : Each
account is
represented by
a remote
object whose
interface
Account
provides the
operations
Note : Each
branch of bank
is represented
by a remote
object whose
interface
Branch
provides the
operations
The client side will work on behalf of users which will lookUp and
478
then can perform account interface.
Transactions
Transaction concept is
originally from
database management systems.
Clients require a sequence of separate
requests to a server to be atomic in the
sense that:
They are free from interference by operations
being performed on behalf of other concurrent
clients; and
Either all of the operations must be completed
successfully or they must have no effect at all in
the presence of server crashes.
484
Transaction T:
a.withdraw(1
00);
b.deposit(100
);
This is called as an atomic transaction.
c.withdraw(2
485
Consistency
A transaction takes the system from one
consistent state to another consistent state
The state during a transaction is invisible to
another
Isolation
Serially equivalent or serializable.
Durability
Sucessful transaction are saved and are
recoverable.
487
Use a transaction
Transaction coordinator
Each transaction is created and
managed by a coordinator
Result of a transaction
Success
Aborted
Initiated by client
Initiated by server
488
Aborted by client
Aborted by server
openTransaction
openTransaction
openTransaction
operation
operation
operation
operation
operation
operation
server aborts
transaction
operation
operation
operation ERROR
reported to client
closeTransaction
abortTransaction
491
1. Concurrency control
Problems of concurrent transaction
The lost update problem
Inconsistent retrievals
Conflict in operations
492
balance = b.getBalance();
b.setBalance(balance*1.1);
a.withdraw(balance/10)
Transaction
U:
balance = b.getBalance();
b.setBalance(balance*1.1);
c.withdraw(balance/10)
balance = b.getBalance();
$200
balance = b.getBalance();
$200
b.setBalance(balance*1.1);
$220
b.setBalance(balance*1.1);
$220
a.withdraw(balance/10) $80
c.withdraw(balance/10) $280
493
Transaction
W:
Transaction
V:
a.withdraw(100)
b.deposit(100)
a.withdraw(100);
aBranch.branchTotal()
$100
total = a.getBalance()
$100
total = total+b.getBalance()$300
total = total+c.getBalance()
b.deposit(100)
$300
494
Serial equivalence
What is serial equivalence?
An interleaving of the operations of
transactions in which the combined
effect is the same as if the
transactions had been performed one
at a time in some order.
Significance
The criterion for correct concurrent
execution
Avoid lost update and inconsistent
retrieval
496
Transaction
U:
balance = b.getBalance()
b.setBalance(balance*1.1)
c.withdraw(balance/10)
balance = b.getBalance()
$200
b.setBalance(balance*1.1)
$220
balance = b.getBalance()
$220
b.setBalance(balance*1.1)
$242
a.withdraw(balance/10) $80
c.withdraw(balance/10) $278
497
Transaction
W:
a.withdraw(100);
b.deposit(100)
aBranch.branchTotal()
a.withdraw(100);
$100
b.deposit(100)
$300
total = a.getBalance()
$100
$400
total = total+b.getBalance()
total = total+c.getBalance()
...
498
Conflicting operations
When we say a pair of operations
conflicts we mean that their combined
effect depends on the order in which
they are executed. E.g. read and write
read
No
Reason
write
Yes
write
write
Yes
TransactionT:
x = read(i)
write(i, 10)
write(j, 20)
TransactionU:
y = read(j)
write(j, 30)
z = read (i)
501
502
Dirty Reads
The isolation property of transaction
requires that the transaction do not see
the uncommitted state of the other
transaction.
The dirty read problem is caused by the
interaction between the read operation in
one transaction and an earlier write
operation in another transaction
503
TRANSACTION U
a.getBalance( );
a.setBalance(balance+10);
a.getBalance( );
a.setBalance(balance+20);
balance = a.getBalance( );
a.setBalance(balance+10);
$100
$110
balance = a.getBalance( );
a.setBalance(balance+20);
commit transaction;
$110
$130
abort transaction;
Premature writes
This one is related to the interaction
between the write operations on the same
object belonging to different transactions.
It uses the concept of before image on write
operation.
505
TRANSACTION U
a.setBalance(105);
a.setBalance(110);
a.setBalance(105);
$100
$105
a.setBalance(110);
$110
Tentative
versions:
update
operations
performed during a transaction are done in
tentative versions of objects in volatile memory.
506
Nested Transactions
Several transactions may be started from within
a
transaction, allowing transactions to be regarded as
modules that can be selfT possessed.
: top-level transaction
T1 = openSubTransaction
T1 :
T2 = openSubTransaction
commit
T2 :
openSubTransaction openSubTransaction
T11 :
T12 :
prov. commit
prov. commit
prov. commit
openSubTransaction
T21 :
abort
openSubTransaction
T211 :
prov. commit
prov.commit
508
Child
completes:
provisionally or abort
commit
all
of
its
509
commit, all
committed
511
1.Locks
A simple example of a serializing
mechanism is the use of exclusive locks.
Server can lock any object that is about to
be used by a client.
If another client wants to access the same
object, it has to wait until the object is
unlocked in the end.
512
Transaction : T
Transaction : U
al = b.getBalance()
bal = b.getBalance()
b.setBalance(bal*1.1)
.setBalance(bal*1.1)
.withdraw(bal/10)
c.withdraw(bal/10)
Operations
Lock
Operations
Locks
s
openTransaction
al = b.getBalance()
Lock B
openTransaction
.setBalance(bal*1.1)
waits for
.withdraw(bal/10)lock Abal = b.getBalance()
Ts on B
lock
closeTransaction
unlock A,B
lock B
b.setBalance(bal*1.1)
c.withdraw(bal/10) lock
513
unlock
B,C
C,
closeTransaction
Two-Phasing Locking
Basic 2PL
lock point
obtain lock
number
of locks
release lock
Phase 1
BEGIN
Phase 2
END
515
number
of locks
BEGIN
Transaction
duration
516
Lock Rules
Lock granularity
as
small
concurrency
as
possible:
enhance
Lock compatibility
If a transaction T has already performed a
read operation on an object, then a
concurrent transaction U must not write
that object until T commits or aborts
If a transaction T has already performed a
write operation on an object, then a
517
concurrent transaction U must not read or
518
Definition:
Deadlocks
Prevention:
Detection:
Timeouts:
520
Hierarchic locks:
mix-granularity
locks are used.
Branch
Account
521
none
read
write
commit
Read
write
commit
OK
OK
OK
Wait
OK
OK
wait
wait
OK
wait
-------
522
none
read
write
I-read
I-write
Read
Lock to be set
write
I-read
OK
OK
wait
OK
Wait
OK
wait
wait
wait
wait
OK
OK
wait
OK
OK
I-write
OK
wait
wait
OK
OK
523
Drawbacks of locking:
Lock maintenance represents an overhead that
is not present in systems that do not support
concurrent access to shared data.
Deadlock. Deadlock prevention reduces
concurrency. Deadlock detection or timeout not
wholly satisfactory for use in interactive
programs.
To avoid cascading abort, locks can not be
release until the end of transaction reduce
potential concurrency.
524
Basic Idea
Since the validation and update phase are short, so there is only one
transaction at a time.
Conflict rules
Validation uses the read-write conflict rules to ensure that the
scheduling of a particular transaction is serially equivalent
with respect to all other overlapping transactions.
E.g. :Tv is serializable with respect to an overlapping
transaction Ti , their operations must conform to the following
rules.
527
Ti
Rule
write
read
1.
must
Ti
not read objects written Tby
v
read
write
2.
must
Tv
not read objects writtenTby
i
write
write
3.
must
Ti
not write objects writtenTby
and
v
Tv must
not write objects written by
Ti
Forms of Validation
1. Backward Validation: Checks the transaction
undergoing
validation
with
other
preceding
overlapping transactions- those that entered the
validation phase before it.
2.Forward Validation: Checks the transaction undergoing
validation with other later transactions, which are still
active ( lagging behind in respective validation
phases).
529
Validation Update
T1
Earlier committed
transactions
T2
1. Backward
form
T3
Transaction
being validated
2. Forward form
Later active
transactions
Tv
active 1
active
530
the
startTn
The biggest transaction number assigned to some other committed
transaction at the time when transaction Tv started its working
phase.
finishTn
The biggest transaction number assigned to some other committed
transaction at the time when Tv entered the validation phase
T2
T3
Rule 2:
Automatically fulfilled because the active transaction
do not write untill after Tv has completed.
533
Forward validation
Overhead of time
To validate a transaction must wait until
all active transactions finished.
535
3:Timestamp Ordering
Basic Idea:
Each
transaction
has
a
Timestamp(TS)
associated with it.
TS is not necessarily real time, can be a logical
counter.
TS is unique for a transaction.
New transaction has larger TS than older
transaction.
Larger TS transactions wait for smaller TS
transactions and smaller TS transactions die
and restart when confronting larger TS
transactions.
No deadlock.
536
Rul Tc
e
Ti Condition
1.write read Tc must not write an object that has
been read by any Ti where Ti > Tc,
this requires that Tc the maximum
read timestamp of the object.
2.write write Tc must not write an object that has
been written by any Ti where Ti>Tc,
this requires that Tc > the write
timestamp of the committed object.
3.read write Tc must not read an object that has
been written by any Ti where Ti>Tc,
this requires that Tc
> write
538
else
539
(a) T3 write
Before
T2
After
T2
Before T1
T3
After
T1
Key:
T2
T2
Committed
T3
Time
Time
(c)
T3 write
(d)T3 write
Before
T1
T4
After
T1
T3
T4
Time
Before
T4
After
T4
Ti
Transaction
aborts
Ti
Tentative
object produced
by transaction Ti
(with write timestamp T
T1<T2<T3<T4
Time
540
else
wait until the transaction that made version D selected
reapply the read rule
}
else
abort transaction Tc
541
(a) T3 read
Key:
read
proceeds
T2
Selected
T2
Time
T2
Selected
Ti
Time
Committed
Ti
(d) T3 read
(c) T3 read
T1
read
proceeds
T4
Tentative
read waits
Selected
T4
Time
Transaction
aborts
object produced
by transaction Ti
(with write timestamp
T1 < T2 < T3 < T4
Time
542
543
Pessimistic methods
Less concurrency but simple in relative to
optimistic methods
548
Question Bank 4
Explain the concepts of Concurrency control and
Recoverability from abort used in the transactions.
Discuss the locking mechanism for the concurrency
control.
Write short note on :
(i) Nested transaction
(ii) Timestamp
ordering
What is the purpose of Validation of Transactions?
Explain the various forms of transaction with suitable
examples.
Illustrate a comparative study of the three methods of
concurrency control with suitable example.
549
551
Atomicity of transaction
All or nothing for all involved servers
Two phase commit
Concurrency control
Serialize locally + serialize globally
3 concurrency methods wrt to Dist.
Transaction
552
Distributed transactions
553
Distributed transactions
2) Nested transaction
554
a.withdraw(10)
b.withdraw(20)
c.deposit(10)
d.deposit(20)
T
Y
T = openTransaction
openSubTransaction
a.withdraw(10);
openSubTransaction
b.withdraw(20);
openSubTransaction
c.deposit(10);
openSubTransaction
d.deposit(20);
closeTransaction
Z
T
T
3
4
Coordination in Distributed
Transactions
Each server has a special participant process. Coordinator process
(leader) resides in one of the servers, talks to trans. &
participants.
Coordinato
r
join
Participant
X
join
Participant
join
Y
Participant
C
Coordinator &
Participants
Open
Transacto
n TID
Coordinato
r
Close
Transactio
n
Abort
Transactio
n
3
1
a.method (TID
Join (TID, ref)
)
Participant
2
The Coordination
Process
557
openTransaction
closeTransaction
.
participant
A
a.withdraw(4);
join
BranchX
T
Client
participant
b.withdraw(T, 3);
T = openTransaction
a.withdraw(4);
c.deposit(4);
b.withdraw(3);
d.deposit(3);
closeTransaction
B
join
b.withdraw(3);
BranchY
participant
C
c.deposit(4);
d.deposit(3);
BranchZ
Working of Coordinator
Servers for a distributed transaction need to coordinate
their actions.
A client starts a transaction by sending an openTransaction
request to a coordinator. The coordinator returns the TID
to the client. The TID must be unique (serverIP and number
unique to that server)
Coordinator is responsible for committing or aborting it.
Each other server in a transaction is a participant.
Participants are responsible for cooperating with the
coordinator in carrying out the commit protocol, and keep
track of all recoverable objects managed by it.
Each coordinator has a set of references to the participants.
Each participant records a reference to the coordinator.
559
/*additional
method*/
abortTransaction(trans);
560
561
The problem
some servers commit, some servers abort
How to deal with the situation that some
servers decide to abort?
The challenge
work correctly when error happens
Failure model
563
Second phase
The coordinator tell all participants to commit
564
( or abort)
Coordinator
Participant
step status
step status
1
3
prepared to commit
(waiting for votes)
committed
canCommit?
Yes
prepared to commit
(uncertain)
committed
doCommit
haveCommitted
done
Failure of Coordinator
When a participant has voted Yes and is waiting for
the coordinator to report on the outcome of the vote,
such participant is in uncertain stage. If the
coordinator has failed, the participant will not be able
to get the decision until the coordinator is replaced,
which can result in extensive delays for participants in
the uncertain state.
One alternative strategy is allow the participants to
obtain a decision from other participants instead of
contacting coordinator. However, if all participants are
in the uncertain state, they will not get a decision.
572
Parent transaction
openSubTransaction(trans)subTrans
Open a subtransaction whose parents is trans and
returns
a
unique
subtransaction
identifier.
(extension of its parent TID)
getStatus(trans)commited, aborted, provisional
Asks the coordinator to report on the status of the
transactions trans. Return values representing one
of the following: committed, aborted, provisional
574
T
1
Abort
A
Client
T1
2
T
T
2
Provision
al
T
B
N
T2
1
K
D
T2
2
Nested Distributed
Transaction
Yes
T1
1
Provision
al
T1
N
o
Yes
T
2
Abor
t
T2
1
N
o
Provision
al
Yes
Yes
T2
2
Provision
al
Yes
Bottom up decision in
2PC
576
Server status
T
abort (at M)
11
T1
12
aborted (at Y)
T
22
Client status
Transaction T decides whether to commit
577
Child
subtrans
T1, T2
T11, T12
T21, T22
Participant
Provisional
commitlist
T1, T12
T1, T12
yes
yes
no(aborted)
no(aborted)
T12butnot T21 T21, T12
no(parentaborted) T22
Abortlist
T11, T2
T11
T2
T11
Participant
step status
step status
prepared to commit
1
(waiting for votes)
3
committed
canCommit?
Yes
doCommit
haveCommitted
done
2prepared to commit
(uncertain)
4
committed
Note :
Phase I:
Step 1
Step 2
Phase II:
Flat manner
Callacoordinatortoaskcoordinatorofchildsubtransactionwhetheritcancommita
subtransactionsubTrans.Thefirstargumenttransisthetransactionidentifieroftop
leveltransaction.ParticipantreplieswithitsvoteYes/No.
Fourth place:
provisionally committed subtransactions of aborted
subtransactions e.g. T22 whose parent T2 has aborted
use getStatus on parent, whose coordinator should
remain active for a while
If parent does not reply, then abort
582
584
1. Locking
Each participant locks on objects locally
strict two phase locking scheme
Lock manager at each server decide whether to
grant a lock or make the requesting transaction wait.
Locking
T U
Write(A) at X locks A
Write(B) at Y locks B
Read(B) at Y
waits for U
Read(A) at X waits for T
***************************************************
T before U in one server X and U before T in
server Y. These different ordering can lead to
cyclic dependencies between transactions and a
distributed deadlock situation arises.
586
transaction
Resolution of a conflict
587
U
Read (B) At Y
Write (A)
Read(B) At Y
Write (B)
Write (B)
Read(A) At X
Write (A)
588
validation(Kung
&
Question Bank 5.
Describe the Flat and Nested distributed
transaction. How these are utilized in a
distributed banking transaction?
How a transaction can be completed in
atomic manner? Explain in details the
working of two-phase commit protocol .
Discuss in details the concept of two-phase
commit protocol for nested transactions.
How can you achieve the concurrency
control in the distributed transactions.
590
Introduction
The Access Matrix Mode
Implementation of Access Matrix Model(3)
Safety in the Access Matrix Model
Advanced Models of Protection(3)
Deals
with
the
control
of
unauthorized use of software and
hardware.
Business
applications
such
as
banking requires high security and
protection during any transaction.
Security techniques should not only
prevent the misuse of secret
information but also its destruction.
592
Basics
Potential Security Violations [By AnderSon]:
1. Unauthorized information release : unauthorized
person is able to read information, unauthorized use
of computer program.
2. Unauthorized
information
modification:
unauthorized person is able to modify information
e.g changing grade of a university student,
changing account balances in bank databases
3. Unauthorized denial of service : Unauthorized
person should not succeed in preventing an
authorized person from accessing the information593
2. Internal Security :
Mechanism
1.
2.
3.
4.
5.
Protection Domain of a
Process
Specifies Resources that a process can access and type
of operation that a process can perform on the
resources.
Required for enforcing security
Allow the process to use only those resources that it
requires.
Every process executes in its protection domain and
protection domain is switched appropriately whenever
control jumps from process to process.
Advantage :
Eliminates the possibility of a process violating security
maliciously or unintentionally and increases accountability 596
Design Principles
cont
(S,O,P)
Set of current
subjects
Set of current
objects
Access
Matrix
Note :
Access Matrix has a row for every current subject and a
column for every current object.
601
s
Subject
s
P[s,o]
O2
O3
(S1)
O4
(S2)
O5
(S3)
S1
read,
write
own,
delete
own
sendmail
recmail
S2
execute
copy
recmail
own
block,
wakeup
S3
own
read,
write
sendmail
block,
wakeup
own
603
Reference
Monitor
Grant/ Deny
604
Implementation of Access
Matrix Model
Three Implementations of Access matrix model
1. Capabilities Based
2. Access Control List
3. Lock-key Method
606
Capabilities
Access Rights
read , write, execute etc.
607
s1
s2
s3
O1
r1
Capability
Lists
O O
2
r3
r2
r4
r5
grouped by subject
s1
(r1, O1)
(r2, O3)
s2
(r3, O2)
(r4, O3)
s3
(r5, O1)
Capability Lists
608
Capabilities cont..
Possession of a capability treated as a evidence that
user has authority to access the object in the ways
specified in the capability.
At any point of time, a subject is autorized to access
only those objects for which it has capabilities.
609
610
An address(of request) in a
program
Capability id
Offset
What object to be accessed in main
memory |
Relative location of word within an
object
length
base
Object Table
Entry for the object
offse
t
Access
Rights
lengt
h
Object
Descript
or
611
1. Tagged approach
2. Partitioned Approach:
Advantages Drawbacks of
Capabilities
Advantages
1. Efficient : validity can be easily tested
2. Simple : due to natural correspondence between structural
properties of capabilities and semantic properties of
addressing variables.
3. Flexible : user can decide which of his address contain
capabilities
616
Disadvantages:
1. Control of propagation :
Copy of capability is passed from one subject to
other subject without knowledge of 1st subject.
2. Review:
Determination of all subject accessing one object is
difficult.
3. Revocation of access rights
Destroy of object , which prevent all the undesired
subjects from accessing it.
4. Garbage Collection
When capabilities of an object disappear from the
system , the object is left inaccessible to user and
becomes garbage.
617
Access
Control
Lists
O O O
1
s1
s2
s3
r1
r3
r2
r4
r5
Grouped by object
O1
O2
(s1, r1)
(s2, r3)
O3
(s1, r2)
(s2, r4)
(s3, r5)
Schematic of an access
control list
Subject
s
Smith
read,write,execute
Jones
read
Lee
Grant
Access Rights
write
execute
2.
.
1.
2.
(O, k)
Lock
(l,y) (k, {r 1 , r 2 ,...})
objects are associated
with a set of locks
622
Comparison of methods
Capability list
propagation
Good
Bad
review
Bad
revocation
Good
reclamation
Bad
Good
Bad
Good
Good
Good
Good
Command execution
All checks in the condition part are evaluated. The
<conditions> part has checks in the form r in
P[s,o].
If all checks pass, primitive operations in <list of
primitive operations> are executed.
626
Example: command to create a file and assign own and read rights
to it
command create-read (process, file)
create object file
enter own into P [process, file]
enter read into P [process, file]
end.
627
628
Mono-Operational
Commands
Single primitive operation in a
command
Example: Make process p the owner
of file g
command makeowner(p, g)
enter own into A[p, g];
end
Advanced model of
Protection
1. Take Grant model
2. Bell Lapadula model
3. Lattice model
631
1.Take-Grant Model
Principles:
632
1.Take-Grant Model
Model:
Graph nodes: subjects and objects
An edge from node x to node y indicates that
subject x has an access right to the object y: the
edge is tagged with the corresponding access rights
Access rights
Read (r), write (w), execute (e)
Special access rights for propagating access
rights to other nodes
Take: If node x has access right take to node
y, then subject x can take any access right
that it has on y to another node
Grant: If node x has access right grant to node
y, then the entity represented by node y can
be granted any of the access rights that node
633
x has
take
y r, w
634
r, w
z
grand
r, w
z
635
636
2. Bell-LaPadula Model
Used to control information flow
Model components
Subjects, objects, and access matrix
Several ordered security levels
Each subject has a (maximum) clearance and
a current clearance level
Each object has a classification (I.e., belongs
to a security level)
637
Read-only
Append: subject can only write object (no read permitted)
Execute: no read or write
Read-write: both read and write are permitted
Tranquility principle
no operation may change the classification of an
active object
638
can write
.
.
.
Level i+1
Level i
Level i-1
.
.
.
Level 1
can read
640
level n
w
i
level i
r,w
objects
subject
level 1
*-property
641
642
643
sink
{x,y,
z}
{x,z
}
{y,z
{x,y}
{z}
{x}
{}
source
644
{y}
646
648
(1,
{p,s})
(2,
{})
(2,
{p})
(1,
{p})
(1,
{})
650
Question Bank 6
Explain various implementation of Access
matrix with suitable example .
Explain the Take-grant model of information
flow with suitable example
How the Bell-LaPadula model deals with the
control of information flow.
Explain the Lattice model of information flow
with suitable example.
Write short note on
(i) Protection State
(ii)Safety in the access matrix model
651
Model of Cryptography
Terminology:
Plaintext [cleartext or original message]
Ciphertext [message in encrypted form]
Encryption [ Process of converting Plaintext to ciphered
text]
Decryption [Process of converting ciphered to Plaintext text]
Cryptosystem [System for encryption and decryption of
information]
Symmetric Cryptography : If the key is same for both
encryption and decryption
Asymmetric Cryptography : If the key is not same for both
encryption and decryption
653
CA
C = Eke(M)
Ke
Encryption key
Kd
Decryption key
CA
Potential threats:
1. Ciphertext only attack
2. Known-plaintext attack
3. Chosen-plaintext attack
654
Design Principles
Shannons principle :
(Supports the conventional cryptography)
1.Principle of Diffusion : Spreading the correlation and
dependencies among key- string variables over substrings
as much as possible so as to maximize the length of the
plaintext needed to break the system
2.Principle of Confusion : Change the piece of information
so that output has no obvious relation with the input.
Exhaustive search principle:
(Supports the modern cryptography)
3.Determination of key needed to break the system
4.Requires exhaustive search of a space.
655
Classification of Cryptographic
Systems
Cryptographic Systems
Conventional
Modern
Systems
Systems
Open design
Private key
Systems
Public key
Systems
656
Conventional Cryptography
Based on substitution cipher
1. Caesar Cipher ( no. of keys <=25)
2.
3.
Modern Cryptography
1. Private key Cryptography
Data Encryption
Standard
[DES]
Three steps:
1.Plain text undergoes initial permutation(IP) in which 64 bits of the
block is permuted
2.Permuted block goes a complex transformation using a key and
involves 16 iterations and then 32 bit switch (SW)
3.The output of step(2) goes a final permutation which is the inverse of
step(1)
<< The output of step(3) is ciphered text>>
64-bit plaintext (X)
Initial Permutation (IP)
Key i
Round
(i)
Key Generation
(KeyGen)
Key I of 48 bits
Iterative Transformation
Li-1
Ri-1
Ki
Key
Ri=Lif(Ri-1,Ki)
Li
Li
Ri
661
f Steps:
1. 32-bit Ri-1 is expanded to 48-bit E(Ri-1), depending on permutation
and duplication .
2. Ex-OR operation is performed between 48-bit key Ki and E(Ri-1).
48 bit output is partitioned into 8 partitions S1,S2,.S8 of 6bit each
3. Each Si, i<=i<=8 is fed into a separate 6-to-4 substitution box.
4. 32-bit output of 8 substitution boxes is fed to a permutation box
whose 32 bit output is f.
S-box
[
1
662
Decryption
The same algorithm as encryption.
Reversed the order of key (Key16, Key15,
Key1) based on.
Ri-1 =Ri
Li-1=Ri f(Li,Ki)
For example:
IP undoes IP-1 step of encryption.
The 3rd decryption step undoes the
permutation IP performed in the 1st
encryption step , yielding the
original[
663
1
plain text block
Rivest-Shamir-Adleman
Method
Popularly known as RSA method.
Binary Plaintext is divided into blocks. Each block is
represented by an integer between 0 and n-1.
Encryption key is a pair (e , n) where e is positive
integer
Message M is encrypted by raising it to eth power
moduloe n.
C = M modulo n
Cipher text C is an integer between 0 and n-1.
Encryption does not increase the length of plaintext
Decryption key (d, n) is a pair where d is a positive
665
integer.
Rivest-Shamir-Adleman
cont..
Cipher text block C is decrypted by raising it to dth
power modulo n.
M =C dmodulo n.
User possesses an encryption key(eX, nX) and a
decryption key(dX, nX) where as encryption key is
available in public domain but decrpytion key is
known to user only
666
Rivest-Shamir-Adleman
cont..
M
e
M mod n
e
C = M mod
n
d
C mod n
(e ,
(d ,
n)
n)
<< Encryption Key for user>> << Decryption Key for user>>
667
Determination of Keys
668
Example of RSA
Let
p=5
and
q=11
such
that
n=pxq=>n=55
Therefore (p-1)x(q-1) =40
Let d=23
as 23 and 40 are relatively prime i.e.
gcd(23,40)=1.
Choose e such that dxe(modulo 40)=1.
Note e=7
M
M
C=M mod C
M=C mod
55
55 0 to 55 to
Consider any
integer between
8
209715 2
8388608
8
execute
encryption and decryption by
2
7
478296
23
70368744177664
23
669
Question Bank 7
What do you mean by data security?
Explain in detail the model of Cryptography.
Explain the concept of Public Key
Cryptography with suitable example.
Explain the concept of Private Key
Cryptography with suitable examples.
Write a note on Data encryption standards.
Discuss the Rivest Shamir Adleman method
with suitable example.
670
Introduction
Resource Management
Stream Adaptation
Case Study
Introducing
Communication Paradigm
Introduction:
Modern computers can handle streams of
continuous, time-based data such as digital
audio and video applications.
This capability has led to the development of
distributed multimedia applications.
The requirements of multimedia applications
significantly differ from real-time applications:
Multimedia applications are highly distributed and
therefore
compute
with
other
distributed
applications for network bandwidth and computing
resources.
The
resource
requirements
of
multimedia
applications are dynamic.
672
Local network
Local network
Video
server
Digital
TV/radio
server
of
multimedia
674
Video-on-demand services
Supply video information in digital form, from large online
storage systems to the user display
Require sufficient dedicated network bandwidth
Assumes that the video server and the receiving stations
are dedicated
675
QoS management
Traditional real-time system
E.g. avionics, air traffic control, telephone
switching
Small quantities of data, strict time requirement
QoS management: fixed schedule that ensures
worst-case requirements are always met
676
Requirements
Low-latency communication
round trip delays of 100-300 ms => interaction between user
to be synchronous
Media synchronization
All participants in a music performance should hear the
performance at approximately the same time
677
high-quality
audio
insufficient
resources
scarce
resources
abundant
resources
network
file access
remote
login
1980
1990
2000
Characteristics of
Multimedia data
Multimedia data (video and audio) is continuous and
time-based.
Continuous data is represented as sequence of
discrete values that replace each other over time.
Refer to the users view of the data
Video: a image array is replaced 25 times per second
Audio: the amplitude value is replaced 8000 times per
second
Data compression
Reduce bandwidth requirements by factors
between 10 and 100.
Available in various formats like GIF, TIFF,
JPEG, MPEG-1, MPEG-2, MPEG-4.
It imposes substantial additional loads on
processing resources at the source and
destination.
E.g. the video and audio coders/ decoders found on
video cards.
QoS Management
When multimedia run in networks of PCs, they compete
for resources at workstations running the applications and
in the network.
In multi-tasking operating system, the central processor is
allocated to individual tasks in a Round-Robin or other
scheduling scheme.
The key feature of these schemes is that they handle
increases in demand by spreading the available resources
more thinly between the competing tasks.
The timely processing and transmission of multimedia
streams in crucial. In order to achieve timely delivery,
applications need guarantees that the necessary
resources will be allocated and scheduled at the required
times.
The management and allocation of resources to provide
such guarantee is referred to as Quality of Service
Management (QoS Management)
684
Connections
Network connection
In-memory transfer
Target
Each process must be allocated adequate CPU time, memory
capacity and network bandwidth
Resource requirement
Provides QoS specifications for components of
multimedia applications
QoS Manager
685
PC/workstation
Window system
Camera
Microphones
Screen
K
Codec
Mixer
G
Codec
L
Network
connections
C
D
Video
store
Window system
: multimediastream
White boxes represent media processing components,
many of which are implemented in software, including:
codec: coding/decodingfilter
mixer: sound-mixingcomponent
686
Bandwidth
Camera
Out:
Codec
Mixer
Window
system
Network
connection
Network
connection
In:
Out:
In:
Out:
In:
Out:
In/Out:
In/Out:
10frames/sec,rawvideo
640x480x16bits
10frames/sec,rawvideo
MPEG1stream
244kbpsaudio
144kbpsaudio
various
50frame/secframebuffer
MPEG1stream,approx.
1.5Mbps
Audio44kbps
Latency
Lossrate
Resourcesrequired
Zero
Interactive Low
Interactive
Verylow
Interactive Low
Interactive Low
Interactive
Verylow
10msCPUeach100ms;
10MbytesRAM
1msCPUeach100ms;
1MbytesRAM
5msCPUeach100ms;
5MbytesRAM
1.5Mbps,lowloss
streamprotocol
44kbps,verylowloss
streamprotocol
Admission control
Applications run under a resource contract
Recycle the released resource
689
QoS negotiation
Application components specif y their QoS
requirements to QoS manager
Flow spec.
QoSmanagerevaluatesnewrequirements
agains t the
available
res ources .
S ufficient?
Yes
Res ervethereques ted
res ources
Resource contract
Allowapplication
to proceed
No
Negotiatereducedres ourceprovis ion
withapplication.
Agreement?
Yes
No
Do notallowapplication
to proceed
Application
notifiesQoS managerof
increas ed
res ourcerequirements
690
QoS Negotiation
The application indicates its resource
requirements to the QoS manager.
To Negotiate QoS between an application and
its underlying system an application must
specify its QoS requirements to the QoS
manager.
This is done by transmitting a set of
parameters.
691
692
693
Bandwidth
Specified as minimum-maximum
value or average value
Required bandwidth varies according to the
compression rate of the video. E.g., 1:50 1:100 of MPEG video
Specify burstiness
Different traffic patterns of streams with the
same average bandwidth
LBAP model: Rt + B, where R is the rate, B is
the maximum size of burst
694
No jitter
Jitter: the variation in the period between the delivery
of two adjacent frames
Loss rate
Typically be expressed as a probability
Be calculated based on worst-cast assumptions or on
695
standard distributions
Traffic Shaping
Traffic shaping is the term used to describe
the use of output buffering for the smooth
the flow of data elements.
The bandwidth parameter of a multimedia
stream
provides
an
idealistic
approximation of the actual traffic pattern.
The closer the actual pattern matches the
description, the better the system will
handle the traffic.
696
699
Delay:
Loss:
700
Flow Specifications
A collection of QoS parameters is
typically known as a flow
specification, or flow spec for short.
Several examples for flow spec
exists. In Internet RFC 1363 , a flow
spec is defined as a 16-bit numeric
values, which reflect the QoS
parameters.
701
Bandwidth reservation:
Statistical multiplexing:
Reserve minimum or average bandwidth.
Handle burst that cause some service drop
level occasionally.
Hypothesis
a large number of streams the aggregate
bandwidth required remains nearly constant
regardless of the bandwidth of individual streams.703
Resource Management
To provide a certain QoS level to an
application, a system needs to have sufficient
resources, it also needs to make the resources
available to an application when they are
needed (scheduling).
Resource Scheduling: A process needs to have
resources assigned to them according to their
priority. Following 2 methods are used:
Fair Scheduling
Round-robin
Packet-by-packet
Bit-by-bit
Real-time scheduling
Earliest-deadline-first (EDF)
704
(i)Fair Scheduling
If several streams compete for the same
resource, it becomes necessary to
consider fairness and to prevent illbehaved streams taking too much
bandwidth.
A straight forward approach is to apply
round-robin scheduling to all streams in
the same class, to ensure fairness.
In Nagle, a method was introduced on a
packet-by-packet basis that provides
more fairness w.r.t varying packet sizes
and arrival times. This is called Fair
Queuing.
705
(ii)Real-time scheduling
Several algorithms were developed to
meet
CPU
scheduling
needs
of
applications.
Traditional
real-time
scheduling
methods suit the model of regular
continuous multimedia streams very
well.
Earliest-Deadline-First
(EDF)
scheduler uses a deadline i.e.
associated with each of its work items
to determine the next item: The item
with earliest deadline goes in first.
706
Stream Adaptation
The simplest form of adjustment
when QoS cannot be guaranteed is
adjusting
its
performance
by
dropping pieces of information.
Two methodologies are used:
Scaling
Filtering
707
Scaling
Best applied when live streams are sampled.
Scaling algorithms are media-dependent,
although overall scaling approach is the
same: to subsample a given signal.
A system to perform scaling consists of a
monitor process at the target and a scalar
process at the source.
Monitor keeps track of the arrival times of
messages in a stream. Delayed messages
are an indication of bottle neck in the
system.
Monitor sends a scale-down message to the
source that scales up again .
708
Filtering
Design goals
Video-on-demand for a large number of
users
A large stored digital movie library
Delay of receiving the first frame is within a few seconds
Users can perform pause, rewind, fast-forward
Quality of service
Constant rate
a maximum jitter and low loss rate
Low-cost hardware
Constructed by commodity PC
Fault tolerant
Tolerant to the failure of any single server or disk
711
System architecture
One controller
Connect with each server by low-bandwidth network
n+1
Cub 0
n+2
Cub 1
n+3
Cub 2
n+4
Cub 3
2n+1
Cub n
high-bandwidth
ATM switching network
Start/Stop
requests from clients
712
Storage organization
Stripping
A movie is divided into blocks
The blocks of a movie are stored on disks
attached to different cubs in a sequence of the
disk number
Deliver a movie: deliver the blocks of the movie
from different disks in the sequence number
Load-balance when delivering hotspot movies
Mirroring
Each block is divided into several portions
(secondaries)
The secondaries are stored in the successors
If a block is on a disk i, then the secondaries are stored
on disks i+1 to i+d
713
Distributed Schedule
Slot
Deliver a stream
Deliver the blocks of the stream disk by disk
Can be viewed as a slot moving along disks step by step
Viewer state
2
slot 0
viewer 4
state
slot 1
free
slot 3
viewer 0
state
block service
time t
1
slot 4
viewer 3
state
slot 5
viewer 2
state
slot 6
free
slot 7
viewer 1
state
714
716
Question bank
Explain the quality of service management and
resource management in multimedia applications.
Discuss the importance of Quality of service
negotiation and Admission control in the multimedia
applications.
What are the characteristics of multimedia streams?
Explain the impacts of Scaling and Filtering on
Stream adaptation
What is the purpose of Traffic shaping? What are
various approaches to avoid bursting of stream ?
Discuss the impact distributed multimedia in the
Tiger Video file server.
717
721
Google as a cloud
provider
Google is now a major player in cloud computing which is
defined as a set of Internet-based application, storage and
computing services sufficient to support most user's needs, thus
enabling them to largely or totally dispense with local data
storage and application software.
Software as a service: offering application-level software over
the Internet as web application. A prime example is a set of
web-based applications including Gmail, Google Docs, Google
Talk and Google Calendar. Aims to replace traditional office
suites. ( more examples in the following table)
Platform as a service: concerned with offering distributed
system APIs and services across the Internet, with these APIs
used to support the development and hosting of web
applications. With the launch of Google App Engine, Google
went beyond software as a service and now offers it distributed
system infrastructure as a cloud service. Other organizations to
run their own web applications on the Google platform.
722
723
Google Physical
The keymodel
philosophy of Google in terms of physical infrastructure is
to use very large numbers of commodity PCs to produce a costeffective environment for distributed storage and computation.
Purchasing decision are based on obtained the best performance
per dollar rather than absolute performance. When Brin and Page
built the first Google search engine from spare hardware
scavenged from around the lab at Standford university.
724
Physical model
Organization of the Google physical infrastructure
(To avoid clutter the Ethernet connections are shown from only one of the clusters to
the external links)
725
Key Requirements
Scalability: i). Deal with more data ii) deal with more
queries and iii) seeking better results
Reliability: There is a need to provide 24/7 availability.
Google offers 99.9% service level agreement to paying
customers of Google Apps covering Gmail, Google
Calendar, Google Docs, Google sites and Google Talk.
The well-reported outage of Gmail on Sept. 1 st 2009
(100 minutes due to cascading problem of overloading
servers) acts as reminder of challenges.
Performance: Low latency of user interaction. Achieving
the throughput to respond to all incoming requests
while dealing with very large datasets over network.
Openness: Core services and applications should be
open to allow innovation and new applications.
726
727
Google infrastructure
728
Google Infrastructure
730
731
732
64Mega Each
chunk
Chubby API
Four distinct capabilities:
1.Distribute locks to synchronize
distributed activities in a largescale asynchronous
environment.
2.File system offering reliable
storage of small files
complementing the service
offered by GFS.
3.Support the election of a
primary in a set of replicas.
4.Used as a name service within
Google.
It might appear to contradict the
over design principle of
simplicity doing one thing and
doing it well. However, we will
see that its heart is one core
service that is offering a solution
to distributed consensus and
734
other facets emerge from this
735
736
737
738
GFS offers storing and accessing large flat file which is accessed relative to byte
offsets within a file. It is efficient to store large quantities of data and perform
sequential read and write (append) operations. However, there is a strong need for a
distributed storage system that provide access to data that is indexed in more
sophisticated ways related to its content and structure.
Instead of using an existing relational database with a full set of relational operators
(union, selection, projection, intersection and join). However, the performance and
scalability is a problem. So Google uses BigTable in 2008 which retains the table
model but with a much simpler interface.
Given table is a three-dimensional structure containing cells indexed by a row key, a
739
column key and a timestamp to save multiple versions.
741
743
Distributed Computation
Services
It is important to support high performance distributed computation
over the large datasets stored in GFS and Bigtable. The Google
infrastructure supports distributed computation through MapReduce
service and also the higher level Sawzall language.
Carry out distributed computation by breaking up the data into
smaller fragments and carrying out analyses (sorting, searching and
constructing inverted indexes) of such fragments in parallel, making
use of the physical architecture.
MapReduce {Dean and Ghemawat 2008} is a simple programming
model to support the development of such application, hiding
underlying detail from the programmer including details related to the
parallelization of the computation, monitoring and recovery from
failure, data management and load balancing onto the underlying
physical infrastructure.
Key principle behind MapReduce is that many parallel
computations share the same overall pattern that is:
Break the input data into a number of chunks
Carry out initial processing on these chunks of data to produce
744
intermediary results ( map function)
746
The first stage is to split the input file into M pieces, with each piece being
typically 16-64 megabytes in size (no bigger than a single chunk in GFS). The
intermediary results is also partitioned into R pieces. So M map and R reduce.
The library then starts a set of worker machines from the pool available in the
cluster with one being designed as the master and other being used for executing
map or reduce steps.
A worker that has been assigned a map task will first read the contents of the
input file allocated to that map task, extract the key-value pairs and supply them
as input to the map function. The output of the map function is a processed set of
key/value pairs that are held in an intermediary buffer.
The intermediary buffers are periodically written to a file local to the map
computation. At this stage, the data are partitioned resulting in R regions.
Unusually apply hash function to key then modulo R to the hashed value to
produce R partitions.
When a worker is assigned to carry out a reduce function, it reads
747 its
corresponding partition from the local disk of the map workers using RPC. The
748
749
Question Bank
Discuss the overall Google architecture for
distributed computing.
Discuss in details the data storage and coordination
services provided in the Google infrastructure.
What is the purpose of distributed computing
services?
Explain how the Google infrastructure supports
distributed computation?
Write short note on
(i) Chubby
(ii) Communication paradigm
750