Distributed Shared Memory
Distributed Shared Memory
Advantages/Disadvantages of DSM
Advantages:
Shields programmer from Send/Receive primitives
Single address space; simplifies passing-by-reference and passing complex data
structures
Exploit locality-of-reference when a block is moved
DSM uses simpler software interfaces, and cheaper off-the-shelf hardware. Hence
cheaper than dedicated multiprocessor systems
No memory access bottleneck, as no single bus
Large virtual memory space
DSM programs portable as they use common DSM programming interface
Disadvantages:
Programmers need to understand consistency models, to write correct programs
DSM implementations use async message-passing, and hence cannot be more
efficient than msg-passing implementations
By yielding control to DSM manager software, programmers cannot use their own
msg-passing solutions.
Distributed Shared Memory 3 / 48
Distributed Computing: Principles, Algorithms, and Systems
Memory Coherence
si memory operations by Pi
(s1 + s2 + . . . sn )!/(s1 !s2 ! . . . sn !) possible interleavings
Memory coherence model defines which interleavings are permitted
Traditionally, Read returns the value written by the most recent Write
”Most recent” Write is ambiguous with replicas and concurrent accesses
DSM consistency model is a contract between DSM system and application
programmer
op1 op2 op3 opk
process
invocation invocation invocation invocation
response response response response
local
memory manager
Strict consistency
1 A Read should return the most recent value written, per a global time axis.
For operations that overlap per the global time axis, the following must hold.
2 All operations appear to be atomic and sequentially executed.
3 All processors see the same order of events, equivalent to the global time
ordering of non-overlapping events.
op1 op2 op3 opk
process
invocation invocation invocation invocation
response response response response
local
memory manager
Sequential invocations and responses to each Read or Write operation.
Write(x,4) Read(y,2)
P1
Write(y,2) Read(x,4)
P2
(b) Sequentially consistent and linearizable
Write(x,4) Read(y,0)
P
1
Write(y,2) Read(x,0)
P2
(c) Not sequentially consistent (and hence not linearizable)
Initial values are zero. (a),(c) not linearizable. (b) is linearizable
Linearlzability: Implementation
(shared var)
int: x;
(1) When the Memory Manager receives a Read or Write from application:
(1a) total order broadcast the Read or Write request to all processors;
(1b) await own request that was broadcast;
(1c) perform pending response to the application as follows
(1d) case Read: return value from local replica;
(1e) case Write: write to local replica and return ack to application.
(2) When the Memory Manager receives a total order broadcast(Write, x, val) from network:
(2a) write val to local replica of x.
(3) When the Memory Manager receives a total order broadcast(Read, x) from network:
(3a) no operation.
Write(x,4)
P_i
total order
broadcast
P_j
Read(x,0)
P_k
Read(x,4)
Write(x,4)
P_i
total order
broadcast
P_j
Read(x,0)
P_k
Read(x,4)
Sequential Consistency
Sequential Consistency.
The result of any execution is the same as if all operations of the processors were
executed in some sequential order.
The operations of each individual processor appear in this sequence in the local
program order.
Any interleaving of the operations from the different processors is possible. But all
processors must see the same interleaving. Even if two operations from different
processors (on the same or different variables) do not overlap in a global time scale, they
may appear in reverse order in the common sequential order seen by all. See examples
used for linearizability.
Sequential Consistency
(shared var)
int: x;
(1) When the Memory Manager at Pi receives a Read or Write from application:
(1a) case Read: return value from local replica;
(1b) case Write(x,val): total order broadcasti (Write(x,val)) to all processors including itself.
(2) When the Memory Manager at Pi receives a total order broadcastj (Write, x, val) from network
(2a) write val to local replica of x;
(2b) if i = j then return ack to application.
(3) When the Memory Manager at Pi receives a total order broadcastj (Write, x, val) from network
(3a) write val to local replica of x.
(3b) if i = j then
(3c) counter ←− counter − 1;
(3d) if (counter = 0 and any Reads are pending) then
(3e) perform pending responses for the Reads to the application.
Locally issued Writes get acked immediately. Local Reads are delayed until the locally preceding
Writes have been acked. All locally issued Writes are pipelined.
Causal Consistency
In SC, all Write ops should be seen in
common order.
For causal consistency, only causally W(x,2) W(x,4)
P1
related Writes should be seen in common R(x,4) W(x,7)
order. P2
R(x,2) R(x,7)
P3
Causal relation for shared memory R(x,4) R(x,7)
P
systems 4
(a)Sequentially consistent and causally consistent
PRAM memory
Only Write ops issued by the same processor are seen by others in the order they
were issued, but Writes from different processors may be seen by other processors
in different orders.
PRAM can be implemented by FIFO broadcast? PRAM memory can exhibit
counter-intuitive behavior, see below.
(shared variables)
int: x, y ;
Process 1 Process 2
... ...
(1a) x ←− 4; (2a) y ←− 6;
(1b) if y = 0 then kill(P2 ). (2b) if x = 0 then kill(P1 ).
Slow Memory
Slow Memory
Only Write operations issued by the same processor and to the same memory
location must be seen by others in that order.
W(x,2) W(y,4) W(x,7)
P1
R(y,4) R(x,0) R(x,0) R(x,7)
P2
(a) Slow memory but not PRAM consistent
no consistency model
pipelined RAM (PRAM)
Sequential consistency
Linearizability/
Atomic consistency/
Strict consistency
Causal consistency
Slow memory
Weak consistency:
All Writes are propagated to other processes, and all Writes done elsewhere are brought
locally, at a sync instruction.
Drawback: cannot tell whether beginning access to shared variables (enter CS), or
finished access to shared variables (exit CS). g) Distributed Shared Memory
19 / 48
Distributed Computing: Principles, Algorithms, and Systems
Release Consistency
Acquire indicates CS is to be entered. Hence all Writes from other processors should be
locally reflected at this instruction
Release indicates access to CS is being completed. Hence, all Updates made locally should
be propagated to the replicas at other processors.
Acquire and Release can be defined on a subset of the variables.
If no CS semantics are used, then Acquire and Release act as barrier synchronization
variables.
Lazy release consistency: propagate updates on-demand, not the PRAM way.
Entry Consistency
Each ordinary shared variable is associated with a synchronization variable (e.g., lock,
barrier)
For Acquire /Release on a synchronization variable, access to only those ordinary variables
guarded by the synchronization variables is performed.
(shared vars)
array of boolean: choosing [1 . . . n];
array of integer: timestamp[1 . . . n];
repeat
(1) Pi executes the following for the entry section:
(1a) choosing [i] ←− 1;
(1b) timestamp[i] ←− maxk∈[1...n] (timestamp[k]) + 1;
(1c) choosing [i] ←− 0;
(1d) for count = 1 to n do
(1e) while choosing [count] do no-op;
(1f) while timestamp[count] 6= 0 and (timestamp[count], count) < (timestamp[i], i) do
(1g) no-op.
(2) Pi executes the critical section (CS) after the entry section
(3) Pi executes the following exit section after the CS:
(3a) timestamp[i] ←− 0.
(4) Pi executes the remainder section after the exit section
until false;
Mutual exclusion
I Role of line (1e)? Wait for others’ timestamp choice to stabilize ...
I Role of line (1f)? Wait for higher priority (lex. lower timestamp) process to
enter CS
Bounded waiting: Pi can be overtaken by other processes at most once (each)
Progress: lexicographic order is a total order; process with lowest timestamp
in lines (1d)-(1g) enters CS
Space complexity: lower bound of n registers
Time complexity: (n) time for Bakery algorithm
Lamport’s fast mutex algorithm takes O(1) time in the absence of contention.
However it compromises on bounded waiting. Uses W (x) − R(y ) − W (y ) − R(x)
sequence necessary and sufficient to check for contention, and safely enter CS
repeat
(1) Pi (1 ≤ i ≤ n) executes entry section:
(1a) b[i] ←− true;
(1b) x ←− i;
(1c) if y 6= 0 then
(1d) b[i] ←− false;
(1e) await y = 0;
(1f) goto (1a);
(1g) y ←− i;
(1h) if x 6= i then
(1i) b[i] ←− false;
(1j) for j = 1 to N do
(1k) await ¬b[j];
(1l) if y 6= i then
(1m) await y = 0;
(1n) goto (1a);
(2) Pi (1 ≤ i ≤ n) executes critical section:
(3) Pi (1 ≤ i ≤ n) executes exit section:
(3a) y ←− 0;
(3b) b[i] ←− false;
forever.
Need for a boolean vector of size n: For Pi , there needs to be a trace of its identity
and that it had written to the mutex variables. Other processes need to know who (and
when) leaves the CS. Hence need for a boolean array b[1..n].
(shared variables)
register: Reg ←− false; // shared register initialized
(local variables)
integer: blocked ←− 0; // variable to be checked before entering CS
repeat
(1) Pi executes the following for the entry section:
(1a) blocked ←− true;
(1b) repeat
(1c) Swap(Reg , blocked);
(1d) until blocked = false;
(2) Pi executes the critical section (CS) after the entry section
(3) Pi executes the following exit section after the CS:
(3a) Reg ←− false;
(4) Pi executes the remainder section after the exit section
until false;
repeat
(1) Pi executes the following for the entry section:
(1a) waiting [i] ←− true;
(1b) blocked ←− true;
(1c) while waiting [i] and blocked do
(1d) blocked ←− Test&Set(Reg );
(1e) waiting [i] ←− false;
(2) Pi executes the critical section (CS) after the entry section
(3) Pi executes the following exit section after the CS:
(3a) next ←− (i + 1)mod n;
(3b) while next 6= i and waiting [next] = false do
(3c) next ←− (next + 1)mod n;
(3d) if next = i then
(3e) Reg ←− false;
(3f) else waiting [next] ←− false;
(4) Pi executes the remainder section after the exit section
until false;
Wait-freedom
Safe register
A Read that does not overlap with a Write returns the most recent value written
to that register. A Read that overlaps with a Write returns any one of the possible
values that the register could ever contain.
Write11 (x,4) Write21 (x,6)
P1
Read12 (x,?) Read22 (x,?) Read32 (x,?)
P2
Write13 (x,−6)
P3
Regular register
Safe register + if a Read overlaps with a Write, value returned is the value before
the Write operation, or the value written by the Write.
Atomic register
Regular register + linearizable to a sequential register
Write11 (x,4) Write21 (x,6)
P1
Read12 (x,?) Read22 (x,?) Read32 (x,?)
P2
Write13 (x,−6)
P3
Read from R
(local variable)
array of boolean: Val[1 . . . log (m)];
(local variables)
boolean local to writer P0 : previous ←− 0;
Construction 5: Algorithm
(shared variables)
boolean MRSW regular registers R1 . . . Rm−1 ←− 0; Rm ←− 1;
// Ri readable by all, writable by P0 .
(local variables)
integer: count;
Write val to R
R
Write 1
Zero out entries
R1 R2 R3 R Rm
val
Scan for "1"; return index. (bool MRSW reg to int MRSW reg)
Scan for first "1"; then scan backwards
and update pointer to lowest−ranked
register containing a "1"
(bool MRSW atomic to int MRSW atomic)
Read( R )
Read(R 1,0) Read(R 2,0) Read(R 3,1) Read(R 1,0) Read(R 2,1)
Pb
Read1 b(R,?) returns 3 Read b(R,?) returns 2
) Distributed Shared Memory 38 / 48
Distributed Computing: Principles, Algorithms, and Systems
Construction 6: Algorithm
(shared variables)
boolean MRSW regular registers R1 . . . Rm−1 ←− 0; Rm ←− 1.
// Ri readable by all; writable by P0 .
(local variables)
integer: count, temp;
(local variables)
array of MRSW atomic registers of type hdata, tag i, where tag = hseq no, pidi: Reg Array [1 . . . n];
integer: seq no, j, k;
mailboxes Last_Read_Values[1..n,1..n]
(SRSW atomic registers)
Construction 8: Algorithm
(shared variables)
SRSW atomic register of type hdata, seq noi, where data, seq no are integers: R1 . . . Rn ←− h0, 0i;
SRSW atomic register array of type hdata, seq noi, where data, seq no are integers:
Last Read Values[1 . . . n, 1 . . . n] ←− h0, 0i;
(local variables)
array of hdata, seq noi: Last Read[0 . . . n];
integer: seq, count;
(local variables)
array of int: changed[1 . . . n];
array of type hdata, seq no, old snapshoti: v 1[1 . . . n], v 2[1 . . . n], v [1 . . . n];
(2) Scani
(2a) for count = 1 to n do
(2b) changed[count] ←− 0;
(2c) while true do
(2d) v 1[1 . . . n] ←− collect();
(2e) v 2[1 . . . n] ←− collect();
(2f) if (∀k, 1 ≤ k ≤ n)(v 1[k].seq no = v 2[k].seq no) then
(2g) return(v 2[1].data, . . . , v 2[n].data);
(2h) else
(2i) for k = 1 to n do
(2j) if v 1[k].seq no 6= v 2[k].seq no then
(2k) changed[k] ←− changed[k] + 1;
(2l) if changed[k] = 2 then
(2m) return(v 2[k].old snapshot).
Double collect
Pi Collect Collect
j j j j
P
i
changed[j]=1 changed[j]=2
Pj writes in Pj writes in
this period this period
Pj writes Pj writes
Pj