Unit 3 Coordinaton and Agreement Algorithm
Unit 3 Coordinaton and Agreement Algorithm
Agreement
Allan Clark
School of Informatics
University of Edinburgh
http://www.inf.ed.ac.uk/teaching/courses/ds
Autumn Term 2012
Coordination and Agreement
Overview
I In this part of the course we will examine how distributed
processes can agree on particular values
I It is generally important that the processes within a
distributed system have some sort of agreement
I Agreement may be as simple as the goal of the distributed
system
I Has the general task been aborted?
I Should the main aim be changed?
I This is made more complicated than it sounds, since all the
processes must, not only agree, but be confident that their
peers agree.
I We will look at:
I mutual exclusion to coordinate access to shared resources
I The conditions necessary in general to guarantee that a global
consensus is reached
I Perhaps more importantly the conditions which prevent this
Coordination and Agreement
No Fixed Master
I We will also look at dynamic agreement of a master or leader
process i.e. an election. Generally after the current master has
failed.
I We saw in the Time and Global State section that some
algorithms required a global master/nominee, but there was
no requirement for that master/nominee process to be fixed
I With a fixed master process agreement is made much simpler
I However it then introduces a single point of failure
I So here we are generally assuming no fixed master process
Coordination and Agreement
Synchronous vs Asynchronous
I Again with the synchronous and asynchronous
I It is an important distinction here, synchronous systems allow
us to determine important bounds on message transmission
delays
I This allows us to use timeouts to detect message failure in a
way that cannot be done for asynchronous systems.
Coping with Failures
I In this part we will consider the presence of failures, recall
from our Fundamentals part three decreasingly benign failure
models:
1. Assume no failures occur
2. Assume omission failures may occur; both process and
message delivery omission failures.
3. Assume that arbitrary failures may occur both at a process or
through message corruption whilst in transit.
A Brief Aside
Failure Detectors
I Here I am talking about the detection of a crashed process
I Not one that has started responding erroneously
I Detecting such failures is a major obstracle in designing
algorithms which can cope with them
I A failure detector is a process which responds to requests
querying whether a particular process has failed or not
I The key point is that a failure detector is not necessarily
accurate.
I One can implement a “reliable failure detector”
I One which responds with: “Unsuspected” or “Failed”
Failure Detectors
Mutual Exclusion
I Ensuring mutual exclusion to shared resources is a common
task
I For example, processes A and B both wish to add a value to a
shared variable ‘a’.
I To do so they must store the temporary result of the current
value for the shared variable ‘a’ and the value to be added.
Time Process A Process B
1 t = a + 10 A stores temporary
I 2 t’ = a + 20 B stores temporary
3 a = t’ (a now equals 25)
4 a=t (a now equal 15)
I The intended increment for a is 30 but B’s increment is
nullified
Coordination and Agreement
Mutual Exclusion
new-next = i.next
new-next = (i+1).next
i.next = new-next
(i-1).next = new-next
Shamelessly stolen from Wikipedia
Ring-based Algorithm
I A simple way to arrange for mutual exclusion without the
need for a master process, is to arrange the processes in a
logical ring.
I The ring may of course bear little resemblance to the physical
network or even the direct links between processes.
1 2 3 4 1 2 3 4
8 7 6 5 8 7 6 5
Distributed Mutual Exclusion Algorithms
Ring-based Algorithm
I The token passes around the ring continuously.
I When a process receives the token from its neighbour:
I If it does not require access to the critical section it
immediately forwards on the token to the next neighbour in
the ring
I If it requires access to the critical section, the process:
1. retains the token
2. performs the critical section and then:
3. to relinquish access to the critical section
4. forwards the token on to the next neighbour in the ring
Distributed Mutual Exclusion Algorithms
Ring-based Algorithm
I Once again it is straight forward to determine that this
algorithm satisfies the Safety and Liveness properties.
I However once again we fail to satisfy the Fairness property
Ring-based Algorithm
token
P1 2 3 4
8 7 6 P2
Requesting Entry
I Each process retains a variable indicating its state, it can be:
1. “Released” — Not in or requiring entry to the critical section
2. “Wanted” — Requiring entry to the critical section
3. “Held” — Acquired entry to the critical section and has not
yet relinquished that access.
I When a process requires entry to the critical section it
updates its state to “Wanted” and multicasts a request to
enter the critical section to all other processes. It stores the
request message {Ti , pi }
I Only once it has received a “permission granted” message
from all other processes does it change its state to “Held” and
use the critical section
Multicast and Logical Clocks
Responding to requests
I Upon receiving such a request a process:
I Currently in the “Released” state can immediately respond
with a permission granted message
I A process currently in the “Held” state:
1. Queues the request and continues to use the critical section
2. Once finished using the critical section responds to all such
queued requests with a permission granted message
3. changes its state back to “Released”
I A process currently in the “Wanted” state:
1. Compares the incoming request message {Tj , pj } with its own
stored request message {Ti , pi } which it broadcasted
2. If {Ti , pi } < {Tj , pj } then the incoming request is queued as if
the current process was already in the “Held” state
3. If {Ti , pi } > {Tj , pj } then the incoming request is responded
to with a permission granted message as if the current process
was in the “Released” state
Multicast and Logical Clocks
Requesting Permission
I To request permission to access the critical section a process
pi :
1. Updates its state variable to “Wanted”
2. Multicasts a request to all processes in the associated voting
set Vi
3. When the process has received a “permission granted”
response from all processes in the voting set Vi : update state
to “Held” and use the critical section
4. Once the process is finished using the critical section, it
updates its state again to “Released” and multicasts a
“release” message to all members of its voting set Vi
Maekawas voting algorithm
Granting Permission/Voting
I When a process pj receives a request message from a process
pi :
I If its state variable is “Held” or its voted variable is True:
1. Queue the request from pi without replying
I otherwise:
1. send a “permission granted” message to pi
2. set the voted variable to True
I When a process pj receives a “release” message:
I If there are no queued requests:
1. set the voted variable to False
I otherwise:
1. Remove the head of the queue, pq :
2. send a “permission granted” message to pq
3. The voted variable remains as True
Maekawas voting algorithm
Deadlock
I The algorithm as described does not respect the Liveness
property
I Consider three processes p1 , p2 and p3
I Their voting sets: V1 = {p1 , p2 }, V2 = {p2 , p3 } and
V3 = {p3 , p1 }
I Suppose that all three processes concurrently request
permission to access the critical section
I All three processes immediately respond to their own requests
I All three processes have their “voted” variables set to True
I Hence, p1 queues the subsequently received request from p3
I Likewise, p2 queues the subsequently received request from p1
I Finally, p3 queues the subsequently received request from p2
I _
¨
Maekawas voting algorithm
Performance Evaluation
I We have four algorithms: central server, ring based, Ricart
and Agrawala’s and Maekawa’s voting algorithm
I We have three logical properties with which to compare them,
we can also compare them with respect to performance:
I For performance we are interested in:
1. The number of messages sent in order to enter and exit the
critical section
2. The client delay incurred at each entry and exit operation
3. The synchronisation delay, this is delay between one process
exiting the critical section and a waiting process entering
I Note: which of these is (more) important depends upon the
application domain, and in particular how often critical section
access is required
Mutual Exclusion Performance Evaluation
Ring-based Algorithm
I Entering the critical section:
I Requires between 0 and N messages
I Delay, these messages are serialised so the delay is between 0
and N
I Exiting the critical section:
I Simply requires that the holding process sends the token
forward through the ring
I The synchronisation-delay is between 1 and N-1 messages
Mutual Exclusion Performance Evaluation
Further Considerations
I The ring-based algorithm continuously consumes bandwidth
as the token is passed around the ring even when no process
requires entry
I Ricart and Agrawala — the process that last used the critical
section can simply re-use it if no other requests have been
received in the meantime
Mutual Exclusion Algorithms
Fault Tolerance
I None of the algorithms described above tolerate loss of
messages
Mutual Exclusion Algorithms
Fault Tolerance
I None of the algorithms described above tolerate loss of
messages
I The token based algorithms lose the token if such a message
is lost meaning no further accesses will be possible
Mutual Exclusion Algorithms
Fault Tolerance
I None of the algorithms described above tolerate loss of
messages
I The token based algorithms lose the token if such a message
is lost meaning no further accesses will be possible
I Ricart and Agrawala’s method will mean that the requesting
process will indefinitely wait for (N - 1) “permission granted”
messages that will never come because one or more of them
have been lost
Mutual Exclusion Algorithms
Fault Tolerance
I None of the algorithms described above tolerate loss of
messages
I The token based algorithms lose the token if such a message
is lost meaning no further accesses will be possible
I Ricart and Agrawala’s method will mean that the requesting
process will indefinitely wait for (N - 1) “permission granted”
messages that will never come because one or more of them
have been lost
I Maekawa’s algorithm cannot tolerate message loss without it
affecting the system, but parts of the system may be able to
proceed unhindered
Fault Tolerance
Process Crashes
I What happens when a process crashes?
1. Central server, provided the process which crashes is not the
central server, does not hold the token and has not requested
the token, everything else may proceed unhindered
Fault Tolerance
Process Crashes
I What happens when a process crashes?
1. Central server, provided the process which crashes is not the
central server, does not hold the token and has not requested
the token, everything else may proceed unhindered
2. Ring-based algorithm — complete meltdown, but we may get
through up to N-1 critical section accesses in the meantime
Fault Tolerance
Process Crashes
I What happens when a process crashes?
1. Central server, provided the process which crashes is not the
central server, does not hold the token and has not requested
the token, everything else may proceed unhindered
2. Ring-based algorithm — complete meltdown, but we may get
through up to N-1 critical section accesses in the meantime
3. Ricart and Agrawala — complete meltdown, we might get
through additional critical section accesses if the failed process
has already responded to them. But no subsequent requests
will be granted
Fault Tolerance
Process Crashes
I What happens when a process crashes?
1. Central server, provided the process which crashes is not the
central server, does not hold the token and has not requested
the token, everything else may proceed unhindered
2. Ring-based algorithm — complete meltdown, but we may get
through up to N-1 critical section accesses in the meantime
3. Ricart and Agrawala — complete meltdown, we might get
through additional critical section accesses if the failed process
has already responded to them. But no subsequent requests
will be granted
4. Maekawa’s voting algorithm — This can tolerate some process
crashes, provided the crashed process is not within the voting
set of a process requesting critical section access
Mutual Exclusion Algorithms
Fault Tolerance
I All of these algorithms may be adapted to recover from
process failures
I Given a failure detector(s)
I Note, however, that this problem is non-trivial
I In particular because for all of these algorithms a failed
process looks much like one which is currently using the
critical section
I The key point is that the failure may occur at any point
I A synchronous system may be sure that a process has failed
and take appropriate action
I An asynchronous system cannot be sure and hence may steal
the token from a process currently using the critical section
I Thus violating the Safety property
Mutual Exclusion Fault Tolerance
Considerations
I Central server
I care must be taken to decide whether the server or the failed
process held the token at the time of the failure
I If the server itself fails a new one must be elected, and any
queued requests must be re-made.
I Ring-based algorithm
I The ring can generally be easily fixed to circumvent the failed
process
I The failed process may have held or blocked the progress of
the token
I Ricart and Agrawala
I Each requesting process should record which processes have
granted permission rather than simply how many
I The failed process can simply be removed from the list of
those required
I Maekawa’s voting algorithm
I Trickier, the failed process may have been in the intersection
between two voting sets
Coordination and Agreement
Elections
I Several algorithms which we have visited until now required a
master or nominee process, including:
1. Berkley algorithm for clock synchronisation
2. Distributed Debugging
3. The central server algorithm for mutual exclusion
I Even other algorithms may need a nominee to actually report
the results of the algorithm
I For example Chandy and Lamport’s snap shot algorithm
described how to record the local state at each process in
such a way that a consistent global state could be assembled
from the local states recorded at different times
I To actually be useful these local states must be gathered
together, a simple way to do this is for each local process to
send their locally recorded state to a nominee process
Elections
No Fixed Master/Nominee
I A simple way to provide a master process, is to simply name
one
I However if the named process fails there should be a recovery
plan
I A recovery plan requires that we dynamically decide who
should become the new master/nominee
I Even with a fixed order this is non-trivial, in particular as all
participants must agree that the current master as failed
I A more dynamic election process can allow for greater
flexibility of a running system
Elections
Requirements
I We require that the result of the election should be unique
I (no hung-parliaments or coalitions)
I Even if multiple processes call for an election concurrently
I We will say that the elected process should be the best choice:
Requirements
I Safety A participant process pi has electedi = ⊥ or
electedi = P, where P is chosen as the non-crashed process at
the end of the run with the largest identifier
I Liveness All processes participate and eventually either crash
or have electedi 6= ⊥
I Note that there may be some process pj which is not yet a
participant which has electedj = Q for some process which is
not the eventual winner of the election
I An additional property then could be specified as, no two
processes concurrently have electedi set to two different
processes
I Either one may be set to a process and the other to ⊥
I But if they are both set to a process it should be the same one
I We’ll call this property Total Safety
Elections
Election/Nominee Algorithms
I We will look at two distributed election algorithms
1. A ring-based election algorithm similar to the ring-based
mutual-exclusion algorithm
2. The bully election algorithm
I We will evaluate these algorithms with respect to their
performance characteristics, in particular:
I The total number of messages sent during an election — this
is a measure of the bandwidth used
I The turn-around time, measured by the number of serialised
messages sent:
I Recall Ricart and Agrawala’s algorithm for mutual exclusion
that required 2(N − 1) messsages to enter the critical section,
but that that time only amounted to a turn-around time, since
the only serialisation was that each response message followed
a request message.
Elections
Initiating an election
I Initially all processes are marked as “non-participant”
I Any process may begin an election at any time
I To do so, a process pi :
1. marks itself as a “participant”
2. sets the electedi variable to ⊥
3. Creates an election message and places its own identifier
within the election message
4. Sends the election message to its nearest clockwise neighbour
in the ring
Ring-based Election Algorithm
Required Properties
I Safety:
I A process must receive its own identifier back before sending
an elected message
I Therefore the election message containing that identifier must
have travelled the entire ring
I And must therefore have been compared with all process
identifiers
I Since no process updates its electedi variable until it wins the
election or receives an elected message no participating process
will have its electedi variable set to anything other than ⊥
I Liveness:
I Since there are no failures the liveness property follows from
the guaranteed traversals of the ring.
Ring-based Election Algorithm
Performance
I If only a single process starts the election
I Once the process with the highest identifier sends its election
message (either initiating or because it received one), then the
election will consume two full traversals of the ring.
I In the best case, the process with the highest identifier
initiated the election, it will take 2 × N messages
I The worst case is when the process with the highest identifier
is the nearest anti-clockwise peer from the initiating process
In which case it is (N − 1) + 2 × N messages
I Or 3N − 1 messages
I The turn-around time is also 3N − 1 since all the messages are
serialised
Elections
Failure Detector
I We are assuming a synchronous system here and so we can
build a reliable failure detector
I We assume that message delivery times are bound by Ttrans
I Further that message processing time is bound by Tprocess
I Hence a failure detector can send a process psuspect a message
and expect a response within time T = 2 × Ttrans + Tprocess
I If a response does not occur within that time, the local failure
detector can report that the process psuspect has failed
The Bully Election Algorithm
A simple election
I If the process with the highest identifier is still available
I It knows that it is the process with the highest identifier
I It can therefore elect itself by simply sending a coordinator
message
The Bully Election Algorithm
A simple election
I If the process with the highest identifier is still available
I It knows that it is the process with the highest identifier
I It can therefore elect itself by simply sending a coordinator
message
I You may wonder why it would ever need to do this
I Imagine a process which can be initiated by any process, but
requires some coordinator
I For example global garbage collection
I For which we run a global snapshot algorithm
I And then require a coordinator to:
1. collect the global state
2. figure out which objects may be deleted
3. alert the processes which own those objects to delete them
I The initiator process cannot be sure that the previous
coordinator has not failed since the previous run.
I Hence an election is run each time
The Bully Election Algorithm
An actual election
I A process which does not have the highest identifier:
Receiving Messages
I coordinator If a process receives a coordinator message it sets
the electedi variable to the named winner
I election If a process receives an election message it sends back
an answer message and begins another election (unless it has
already begun one).
The Bully Election Algorithm
Starting a process
I When a process fails a new process may be started to replace
it
I When a new process is started it calls for a new election
I If it is the process with the highest identifier this will be a
simple election in which it simply sends a coordinator message
to elect itself
I This is the origin of the name: Bully
The Bully Election Algorithm
Properties
I The Liveness property is satisfied.
I Some processes may only participate in the sense that they
receive a coordinator message
I But all non-crashed processes will have set electedi to
something other than ⊥.
I The Safety property is also satisfied if we assume that any
process which has crashed, either before or during the
election, is not replaced with another process with the same
identifier during the election.
I Total Safety is not satisfied
The Bully Election Algorithm
Properties
I Unfortunately the Safety property is not met if processes may
be replaced during a run of the election
I One process, say p1 , with the highest identifier may be started
just as another process p2 has determined that it is currently
the process with the highest identifier
I In this case both these processes p1 and p2 will concurrently
send coordinator messages announcing themselves as the new
coordinator
I Since there is no guarantee as to the delivery order of messages
two other processes may receive these in a different order
I such that say: p3 believes the coordinator is p2 whilst p4
believes the coordinator is p1 .
I Of course things can also go wrong if the assumption of a
synchronous system is incorrect
The Bully Election Algorithm
Performance Evaluation
I In the best case the process with the current highest identifier
calls the election
I It requires (N - 1) coordinator messages
I These are concurrent though so the turnaround time is 1
message
I In the worst case though we require O(N 2 ) messages
I This is the case if the process with the lowest identifier calls
for the election
I In this case N − 1 processes all begin elections with processes
with higher identifiers
I The turn around time is best if the process with the highest
identifier is still alive. In which case it is comparable to a
round-trip time.
I Otherwise the turn around time depends on the time bounds
for message delivery and processing
Election Algorithms Comparision
Ring-based vs Bully
Ring Based Bully
Asynchronous Yes No
Allows processes to crash No Yes
Satisfies Safety Yes Yes/No
Dynamic process identifiers Yes No
Dynamic configuration of processes Maybe Maybe
Best case performance 2×N N −1
Worst case performance 3×N −1 O(N 2 )
Global Agreement
MultiCast
I Previously we encountered group multicast
I IP multicast and Xcast both delivered “Maybe” semantics
I That is, perhaps some of the recipients of a multicast message
receive it and perhaps not
I Here we look at ways in which we can ensure that all
members of a group have received a message
I And also that multiples of such messages are received in the
correct order
I This is a form of global consensus
Global Agreement
Reliable Multicast
I Reliable multicast, with respect to a multicast operation
multicast(g , m), has three properties:
1. Integrity — A correct process p ∈ g delivers a message m at
most once and m was multicast by some correct process
2. Validity — If a correct process multicasts message m then
some correct process in g will eventually deliver m
3. Agreement — If a correct process delivers m then all other
correct processes in group g will deliver m
I Validity and Agreement together give the property that if a
correct process which multicasts a message it will eventually
be delivered at all correct processes
Global Agreement
Basic Multicast
I Suppose we have a reliable one-to-one send(p, m) operation
I We can implement a Basic Multicast: Bmulticast(g , m) with
a corresponding Bdeliver operation as:
1. Bmulticast(g , m) = for each process p in g :
I send(p, m)
2. On receive(m) : Bdeliver (m)
I This works because we can be sure that all messages will
eventually receive the multicast message since send(p, m) is
reliable
I It does however depend upon the multicasting process not
crashing
I Therefore Bmulticast does not have the Agreement property
Global Agreement
Reliable Multicast
I We will now implement reliable multicast on top of basic
multicast
I This is a good example of protocol layering
I We will implement the operations:
I Rmulticast(g , m) and Rdeliver (m)
I which are analogous to their Bmulticast(g , m) and
Bdeliver (m) counterparts but have additionally the Agreement
property
Global Agreement
Reliable Multicast
I Note that we insist that the sending process is in the receiving
group, hence:
I Validity — is satisfied since the sending process p will deliver
to itself
I Integrity — is guaranteed because of the integrity of the
underlying Bmulticast operation in addition to the rule that m
is only added to Received at most once
I Agreement — follows from the fact that every correct process
that Bdelivers(m) then performs a Bmulticast(g , m) before it
Rdelivers(m).
I However it is somewhat inefficient since each message is sent
to each process | g | times.
Global Agreement
Properties
I The hold-back queue is not strictly necessary but it simplifies
things since then a simple number can represent all messages
that have been delivered
I We assume that IP-multicast can detect message corruption
(for which it uses checksums)
I Integrity is therefore satisfied since we can detect duplicates
and delete them without delivery
I Validity property holds again because the sending process is in
the group and so at least that will deliver the message
I Agreement only holds if messages amongst the group are sent
indefinitely and if sent messages are retained (for re-sending)
until all groups have acknowledged receipt of it
I Therefore as it stands Agreement does not formally hold,
though in practice the simple protocol can be modified to give
acceptable guarantees of Agreement
Global Agreement
Uniform Agreement
I Our Agreement property specifies that if any correct process
delivers a message m then all correct processes deliver the
message m
I It says nothing about what happens to a failed process
I We can strengthen the condition to Uniform Agreement
I Uniform Agreement states that if a process, whether it then
fails or not, delivers a message m, then all correct processes
also deliver m.
I A moment’s reflection shows how useful this is, if a process
could take some action that put it in an inconsistent state and
then fail, recovery would be difficult
I For example applying an update that not all other processes
receive
Global Agreement
Ordering
I There are several different ordering schemes for multicast
I The three main distinctions are:
1. FIFO — If a correct process performs mulitcast(g , m) and
then multicast(g , m0 ) then every correct process which delivers
m0 will deliver m before m0
2. Causal — If mulitcast(g , m) → multicast(g , m0 ) then every
process which delivers m0 delivers m before m0
3. Total — If a correct process delivers m before it delivers m0
then every correct process which delivers m0 delivers m before
m0
Global Agreement
Ordering
I There are several different ordering schemes for multicast
I The three main distinctions are:
1. FIFO — If a correct process performs mulitcast(g , m) and
then multicast(g , m0 ) then every correct process which delivers
m0 will deliver m before m0
2. Causal — If mulitcast(g , m) → multicast(g , m0 ) then every
process which delivers m0 delivers m before m0
3. Total — If a correct process delivers m before it delivers m0
then every correct process which delivers m0 delivers m before
m0
I Note that Causal ordering implies FIFO ordering
I None of these require or imply reliable multicast
Global Agreement
Total Ordering
I As we saw Causal ordering implies FIFO ordering
I But Total ordering is an orthogonal requirement
I Total ordering only requires an ordering on the delivery order,
but that ordering says nothing of the order in which messages
were sent
I Hence Total ordering can be combined with FIFO and Causal
ordering
I FIFO-Total ordering or Causal-Total ordering
Multicast Ordering
Using a sequencer
I Using a sequencer process is straight forward
I To total-ordered multicast a message m a process p first sends
the message to the sequencer
I The sequencer can determine message sequence numbers
based purely on the order in which they arrive at the
sequencer
I Though it could also use process sequence numbers or
Lamport timestamps should we wish to, for example, provide
FIFO-Total or Causal-Total ordering
I Once determined, the sequencer can either bmulticast the
message itself
I Or, to reduce the load on the sequencer, it may just respond
to process p with the sequence number which then itself
performs the bmulticast
Implementing Total Ordering
Overlapping Groups
I So far we have been happy to assume that each receiving
process belongs to exactly one multicast group
I Or that for overlapping groups the order is unimportant
I For some applications this is insufficient and our orderings can
be updated to account for overlapping groups
Ordered Multicast
Overlapping Groups
I Global FIFO Ordering If a correct process issues
multicast(g , m) and then multicast(g 0 , m0 ) then every correct
process in g ∩ g 0 that delivers m0 delivers m before m0
I Global Causal Ordering If multicast(g , m) → multicast(g 0 , m0 )
then every correct process in g ∩ g 0 that delivers m0 delivers m
before m0
I Pairwise Total Ordering If a correct process delivers message
m sent to g before it delivers m0 sent to g 0 then every correct
process in g ∩ g 0 which delivers m0 delivers m before m0
Ordered Multicast
Overlapping Groups
I Global FIFO Ordering If a correct process issues
multicast(g , m) and then multicast(g 0 , m0 ) then every correct
process in g ∩ g 0 that delivers m0 delivers m before m0
I Global Causal Ordering If multicast(g , m) → multicast(g 0 , m0 )
then every correct process in g ∩ g 0 that delivers m0 delivers m
before m0
I Pairwise Total Ordering If a correct process delivers message
m sent to g before it delivers m0 sent to g 0 then every correct
process in g ∩ g 0 which delivers m0 delivers m before m0
I A simple, but inefficient way, to do this is force all multicasts
to be to the group g ∪ g 0 , receiving processes then simply
ignore the multicast messages not intended for them.
I e.g. process p ∈ g − g 0 ignore multicast messages sent to g 0
Summary
Further Thoughts
I These algorithms to perform mutual exclusion, nominee
election and agreed multicast suffer many drawbacks
I Many are subject to some assumptions which may be
unreasonable
I Particularly when the network used is not a Local Area
Network
I These problems can be, and are, overcome
I But for each individual application the designer should
consider whether the assumptions are a problem
I It may be that coming up with a solution which is less optimal
but does not rely on, say, a reliable communication network,
may be the best approach
I For example, Routing Information Protocol
Consensus
Three Kinds
I The problems of mutual exclusion, electing a nominee and
multicast are all instances of the more general problem of
consensus.
I Consensus problems more generally then are described as one
of three kinds:
1. Consensus
2. Byzantine Generals
3. Interactive Consensus
Global Agreement
Consensus
I A set of processes {p1 , p2 , . . . pn } each begins in the
undecided state
I Each proposes a single value vi
I The processes then communicate, exchanging values
I To conclude, each process must set their decision variable di
to one value and thus enter the decided state
I Three desired properties:
I Termination: each process sets its decisioni variable
I Agreement: If pi and pj are correct processes and have both
entered the decided state, then di = dj
I Integrity: If the correct processes all proposed the same value
v , then any correct process pi in the decided state has di = v
Global Agreement
Byzantine Generals
I Imagine three or more generals are to decide whether or not
to attack
I We assume that there is a commander who issues the order
I The others must decide whether or not to attack
I Either the lieutenants or the commander can be faulty and
thus send incorrect values
I Three desired properties:
I Termination: each process sets its decisioni variable
I Agreement: If pi and pj are correct processes and have both
entered the decided state, then di = dj
I Integrity: If the commander is correct then all correct
processes decide on the value proposed by the commander
I When the commander is correct, Integrity implies Agreement,
but the commander may not be correct
Global Agreement
Interactive Consensus
I Each process proposes its own value and the goal is for each
process to agree on a vector of values
I Similar to consensus other than that each process contributes
only a part of the final answer which we call the decision
vector
I Three desired properties:
I Termination: each process sets its decisioni variable
I Agreement: The final decision vector of all processes is the
same
I Integrity: If pi is correct and proposes vi then all correct
processes decide on vi as the ith component of the decision
vector
Global Agreement
p1
commander
1 says v 1 says v
2 claims 1 says v
p2 p3
3 claims 1 says x
Global Agreement
p1
commander
1 says v 1 says x
2 claims 1 says v
p2 p3
3 claims 1 says x
Global Agreement
Impossible
I Recall:
I Agreement: If pi and pj are correct processes and have both
entered the decided state, then di = dj
I Integrity: If the commander is correct then all correct
processes decide on the value proposed by the commander
I In both scenarios, process p2 receives different values from the
commander p1 and the other process p3
I It can therefore know that one process is faulty but cannot
know which one
I By the Integrity property then it is bound to choose the value
given by the commander
I By symmetry the process p3 is in the same situation when the
commander is faulty.
I Hence when the commander is faulty there is no way to
satisfy the Agreement property, so no solution exists for three
processes
Global Agreement
N ≤3×f
I In the above case we had three processes and at most one
incorrect process, hence N = 3 and f = 1
I It has been shown, by Pease et al that more generally no
solution can exist whenever N ≤ 3 × f
I However there can exist a solution whenever N > 3 × f
I Such algorithms consist of rounds of messages
I It is known that such algorithms require at least f + 1
message rounds
I The complexity and cost of such algorithms suggest that they
are only applicable where the threat is great
I That means either the threat of an incorrect or malicious
process is great
I and/or the cost of failing due to inability to reach consensus is
large
Global Agreement
So what to do?
I The important word in the previous impossibility result is:
guarantee
I There is no algorithm which is guaranteed to reach consensus
I Consensus has been reached in asynchronous systems for years
I Some techniques for getting around the impossibility result:
I Masking process failures, for example using persistant storage
such that a crashed process can be replaced by one in
effectively the same state
I Thus meaning some operations appear to take a long time,
but all operations do eventually complete
I Employ failure detectors:
I Although in an asynchronous system we cannot achieve a
reliable failure detector
I We can use one which is “perfect by design”
I Once a process is deemed to have failed, any subsequent
messages that it does send (showing that it had not failed) are
ignored
I To do this the other processes must agree that a given process
has failed
Consensus in an Asynchronous System
A A A A Attack!A A A A
right
general
Summary
I We looked at the problem of Mutual Exclusion in a distributed
system
I Giving four algorithms:
1. Central server algorithm
2. Ring-based algorithm
3. Ricart and Agrawala’s algorithm
4. Maekawa’s voting algorithm
I Each had different characteristics for:
1. Performance, in terms of bandwidth and time
2. Guarantees, largely the difficulty of providing the Fairness
property
3. Tolerance to process crashes
I We then looked at two algorithms for electing a master or
nominee process
I Then we looked at providing multicast with a variety of
guarantees in terms of delivery and delivery order
Coordination and Agreement
Summary
I We then noted that these were all specialised versions of the
more general case of obtaining consensus
I We defined three general cases for consensus which could be
used for the above three problems
I We noted that a synchronous system can make some
guarantee about reaching consensus in the existance of a
limited number of process failures
I But that even a single process failure limits our ability to
guarantee reaching consensus in an asynchronous system
I In reality we live with this impossibility and try to figure out
ways to minimise the damage
Any Questions
Any Questions?