LogDevice Consensus Deepdive
LogDevice Consensus Deepdive
trim
writer append
LSN 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Proposer
Acceptors
Goal: Agree on value "v" for a slot
pick
proposal
number n
Proposer
Acceptors
Phase 1(a)
Prepare
Goal: Agree on value "v" for a slot
pick
proposal
number n
Proposer
PREPARE(n)
Acceptors
Phase 1(a)
Prepare
Goal: Agree on value "v" for a slot
pick
proposal
number n wait for majority
Proposer
PREPARE(n)
PROMISE(n’,v’)
Acceptors
PROMISE(n’,v’)
Acceptors
pick
proposal select
number n wait for majority value v
Proposer
PREPARE(n) PROPOSE(n,v)
PROMISE(n’,v’)
Acceptors
pick
proposal select value v
number n wait for majority value v wait for majority chosen
Proposer
PREPARE(n) PROPOSE(n,v)
PROMISE(n’,v’) ACCEPT(n)
Intersection!
Acceptors
pick
proposal select value v
number n wait for majority value v wait for majority chosen
Proposer
PREPARE(n) PROPOSE(n,v) COMMIT(n)
PROMISE(n’,v’) ACCEPT(n)
Intersection!
Acceptors
• Flexible Paxos [2]: not all quorums need to intersect. Only need that
any Phase 1 quorum and any Phase 2 quorum must intersect.
pick proposal wait for select wait for value v
number n Phase I quorum value v Phase II quorum chosen
Proposer
PREPARE(n) PROPOSE(n,v) COMMIT(n)
8 out of 10 3 out of 10
PROMISE(n’,v’) ACCEPT(n)
Acceptors
Phase 3
Phase 1(a) Phase 1(b) Phase 2(a) Phase 2(b)
Commit
Prepare Promise Propose Accepted
(optional)
pick proposal wait for select wait for value v
number n Phase I quorum value v Phase II quorum chosen
Proposer
PREPARE(n) PROPOSE(n,v) COMMIT(n)
8 out of 10 3 out of 10
PROMISE(n’,v’) ACCEPT(n)
Intersection!
Acceptors
Phase 3
Phase 1(a) Phase 1(b) Phase 2(a) Phase 2(b)
Commit
Prepare Promise Propose Accepted
(optional)
From Single-Decree Paxos to Multi-Paxos
What is Multi-Paxos
• Scaling Paxos from single value to a growing chain of single-value
consensus slots
• Practically, we need consensus on multiple values in distributed systems
• High throughput/Low latency – one phase 1 (leader election) + multiple phase
2 (replication)
• directly maps to log abstraction: append-only, immutable after consensus
Log is THE abstraction for multi-Paxos
LSN 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Acceptors
Steady state
az2
az3
az4
az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone
az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone
az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone
az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone
• Advantages:
• Only f+1 record copies are needed for tolerating f acceptor failures
• Log throughput and capacity not bounded by a single storage acceptor
2 R2 R2 R2
3 R3 R3 R3 committed
(released)
4 R4 R4 R4 records
release 5 R5 R5 R5
pointer
6
…
Storage
N1 N2 N3 N4 N5 N6 N7 N8 N9 Acceptors
Slots (LSN)
1 R1 R1 R1
2 R2 R2 R2
3 R3 R3 R3 committed
(released)
4 R4 R4 R4 records
release 5 R5 R5 R5
pointer
6
… record
R5
(Gray Rn denotes record
R4 R5 R4 R4 R5 streams
filtered out by SCD
R1 R1 R1 R2 R2 R2 R3 R3 R3 (acceptor to optimization)
reader)
Stream Merge
R1 R2 R3 R4 R5 … reader
1 2 3 4 5
Log Segments and configuration management
• A Log in LogDevice -> A sequence of log segments indexed by monotonically
increasing epoch
• each segment has its fixed configuration: idea inspired by Stoppable Paxos [5]
…… current
log tail
epoch
7
(latest)
e7n1 e7n2 ……
configuration of a log epoch segment
epoch: 1 | SEQ: N0 | [(region, 2), (node, 3)] | { N1, N2, N3, N4, N5 }
sequencer
3. The new sequencer got its epoch and configuration
current
from the epoch store released
dirty slots current
log tail
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12
epoch
store
sequencer
epoch 2
4. The new sequencer perform Paxos Phase I to SEAL
current
the Phase 1 (leader election) quorum of storage node released
set for epoch 1, preventing it from completing new epoch 1
dirty slots tail
appends (SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12
epoch
store
sequencer
epoch 2
Seal Zoom in:
Sequencer in epoch 2 Seals phase 1 quorum of the configuration
of the previous epoch segment (epoch). 2 (epoch) is used as the
proposal/ballot number.
epoch: 1 | SEQ: N0 | [(region, 2), (node, 3)] | { N1, N2, N3, N4, N5, N6, N7, N8, N9, N10 }
N1 N2 N3 N4 N5 N6 N7 N8 N9 N10
sequencer
epoch 1
epoch
store
sequencer
epoch 2
epoch: 2 | SEQ: N20 | [(region, 2), (node, 3)] | { N4, N5, N6, N7, N8, N9, N10, N11 }
4. The sequencer in epoch 2 can start taking new
current
appends, but won’t release these records despite released
fully replicated. epoch 1
dirty slots tail
(SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12
epoch current
store log tail
epoch
sequencer
epoch 2 2
e2n1 e2n2
5. At the same time, sequencer in epoch 2, with potential
other successors, keep running FPaxos (Phase I and II) to
reach consensus on each slot of epoch 1 in the dirty range, current
and finally placing a bridge record to mark the end of epoch 1 released
epoch 1
also using FPaxos. slots reached consensus tail
(SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12 e1n13
hole hole bridge
plug* plug record
epoch current
store log tail
epoch
sequencer
epoch 2 2
e2n1 e2n2
* hole plug is inserted for the LSN slots that are were
NOT ACKed originally, indicating a benign (non-dataloss)
gap in the LSN sequence.
6. The sequencer in epoch 2 can finally release all
records up to the fully replicated prefix of epoch 2.
epoch 1
tail
(SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12 e1n13
bridge
epoch current record
store log tail
epoch
sequencer
epoch 2 2
e2n1 e2n2
current
released
Zero-move, out-of-band reconfiguration
• No data movement with reconfiguration
• start a new log segment only requires a transaction in metadata store
epoch +
N1
activation replication
configuration
N2
writer application reader application
APPEND STORE
sequencer RECORD stream
logdevice (proposer) N3 logdevice
APPENDED STORED
client lib client lib
N4
… … …
N5