0% found this document useful (0 votes)
18 views

LogDevice Consensus Deepdive

Uploaded by

huangdongfa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

LogDevice Consensus Deepdive

Uploaded by

huangdongfa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

LogDevice: the Consensus Story

Xi Xiong, LogDevice SWE


LogDevice
• Log data model built on a strongly consistent Paxos consensus engine
• Carefully chosen variants of Paxos to achieve:
• fault tolerance with fewer copies
• flexible quorums for highly available, high throughput and low latency steady
state replication
• zero-copy quorum reconfiguration with high availability
Log abstraction
Log data model

trim

writer append

LSN 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

reader stream reader stream


Log is the abstraction for reliable
communication
• RPC: thrift, etc…
• require strongest inter-service dependencies (availability, rpc format, etc)
• Log as communication primitive
• supports fan-out and streaming subscription
• messages durably replicated and persisted as ordered log records
• messages can be independently replayed again and again by consumers
• minimal inter-service dependencies
• consumers can be down for hours or days, can still catch up once up via backfills
• load isolation: consumer won’t overwhelm producer service
• easier to handle data format changes
Log is the abstraction for distributed state
replication and distribution
Let's talk Paxos
Concepts & Roles
• Proposers: propose value to be chosen
• value proposed usually on behalf of clients
• Acceptors: agrees and persists decided values
• Learner: a process wish to learn the chosen value
received
client
Goal: Agree on value "v" for a slot
request

Proposer

Acceptors
Goal: Agree on value "v" for a slot

pick
proposal
number n
Proposer

Acceptors

Phase 1(a)
Prepare
Goal: Agree on value "v" for a slot

pick
proposal
number n
Proposer
PREPARE(n)

Acceptors

Phase 1(a)
Prepare
Goal: Agree on value "v" for a slot

pick
proposal
number n wait for majority
Proposer
PREPARE(n)

PROMISE(n’,v’)

Acceptors

Phase 1(a) Phase 1(b)


Prepare Promise
Goal: Agree on value "v" for a slot

v: v’ with largest n’ in PROMISEs


pick received or (in case no v’ received)
proposal select client picked value
number n wait for majority value v
Proposer
PREPARE(n)

PROMISE(n’,v’)

Acceptors

Phase 1(a) Phase 1(b)


Prepare Promise
Goal: Agree on value "v" for a slot

pick
proposal select
number n wait for majority value v
Proposer
PREPARE(n) PROPOSE(n,v)

PROMISE(n’,v’)

Acceptors

Phase 1(a) Phase 1(b) Phase 2(a)


Prepare Promise Propose
Goal: Agree on value "v" for a slot

pick
proposal select value v
number n wait for majority value v wait for majority chosen
Proposer
PREPARE(n) PROPOSE(n,v)

PROMISE(n’,v’) ACCEPT(n)

Intersection!
Acceptors

Phase 1(a) Phase 1(b) Phase 2(a) Phase 2(b)


Prepare Promise Propose Accepted
Goal: Agree on value "v" for a slot

pick
proposal select value v
number n wait for majority value v wait for majority chosen
Proposer
PREPARE(n) PROPOSE(n,v) COMMIT(n)

PROMISE(n’,v’) ACCEPT(n)

Intersection!
Acceptors

Phase 1(a) Phase 1(b) Phase 2(a) Phase 2(b) Phase 3


Prepare Promise Propose Accepted Commit
(optional)
Flexible Paxos
• Single decree Paxos [1] restriction: Phase 1 and 2 must use a majority
quorum of servers and that any two quorums must intersect

• Flexible Paxos [2]: not all quorums need to intersect. Only need that
any Phase 1 quorum and any Phase 2 quorum must intersect.
pick proposal wait for select wait for value v
number n Phase I quorum value v Phase II quorum chosen
Proposer
PREPARE(n) PROPOSE(n,v) COMMIT(n)
8 out of 10 3 out of 10

PROMISE(n’,v’) ACCEPT(n)

Acceptors

Phase 3
Phase 1(a) Phase 1(b) Phase 2(a) Phase 2(b)
Commit
Prepare Promise Propose Accepted
(optional)
pick proposal wait for select wait for value v
number n Phase I quorum value v Phase II quorum chosen
Proposer
PREPARE(n) PROPOSE(n,v) COMMIT(n)
8 out of 10 3 out of 10

PROMISE(n’,v’) ACCEPT(n)

Intersection!
Acceptors

Phase 3
Phase 1(a) Phase 1(b) Phase 2(a) Phase 2(b)
Commit
Prepare Promise Propose Accepted
(optional)
From Single-Decree Paxos to Multi-Paxos
What is Multi-Paxos
• Scaling Paxos from single value to a growing chain of single-value
consensus slots
• Practically, we need consensus on multiple values in distributed systems
• High throughput/Low latency – one phase 1 (leader election) + multiple phase
2 (replication)
• directly maps to log abstraction: append-only, immutable after consensus
Log is THE abstraction for multi-Paxos

Reaching consensus on a newly


trim
allocated slot
writer append

LSN 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

reader stream reader stream


Learn the established consensus
of existing slots
Multi-Paxos + Flexible Quorums is a game
changer
• Highly performant steady state via larger acceptor membership and
smaller replication quorums
• Higher Throughput: pipelined Phase 2 replication with small quorums (e.g., 3
out of 20)
• Lower Latency: leader picks best 3 out of 20
• Higher write availability: with larger acceptor membership, leader can keep
writing as long as any 3/20 acceptors are up

• Leader election – less common


• Phase 1 – Leader election with larger quorum (e.g., 18 out of 20) only during
leader failover
Proposer A Proposer A slot 1 slot 2
phase 1 elected (phase 2) (phase 2)
Proposer B Proposer B
Proposer A
phase 1 elected
Proposer B

Acceptors

Steady state

Phase 1 Phase 2 Phase 1


Leader election Replication Leader election
Question: with a larger Phase 1 quorum (e.g.,
18 out of 20) , is it more difficult (i.e., less
available) to elect a leader?
Failure domain aware Placement
• Goal: Improving availability and fault tolerance for Phase 1 (leader
election)

• Solution: Failure domain aware placement


• Reducing size requirement of Phase 1 quorum by enforcing topology
constrains on Phase 2 quorums during replication
• result: Phase 1 quorum require much smaller number of acceptors during
correlated failures
Flexible Quorums example: Grid Quorums
az1

az2

az3

az4

(a) Basic Paxos: (b) Flexible Paxos:


Phase 1 and 2 quorum: simple Phase 1 quorum: one full Availability Zone (AZ)
majority Phase 2 quorum: a node in each AZ

Neither Phase 1 nor Phase 2 quorum need simple majority!


à Significantly reduced minimal number of acceptors required: floor(M*N/2)+1 -> M+N-1
à Higher availability and better latency.
à Better data availability and durability in correlated failures
nodeset size: 18
replication property: (region,2)(az,3)(node,4)
root
replication quorum (copyset):
leader election quorum
(f-majority):

Oregon N. Carolina Texas region

az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone

N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18


node
nodeset size: 18
replication property: (region,2)(az,3)(node,4) failure scenario 1: loss of one entire region
root
replication quorum (copyset):
leader election quorum
(f-majority):

Oregon N. Carolina Texas region

az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone

N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18


node
nodeset size: 18
replication property: (region,2)(az,3)(node,4) failure scenario 2: loss of 2 entire AZs
root
replication quorum (copyset):
leader election quorum
(f-majority):

Oregon N. Carolina Texas region

az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone

N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18


node
nodeset size: 18
replication property: (region,2)(az,3)(node,4) failure scenario 3: loss of any 3 nodes
root
replication quorum (copyset):
leader election quorum
(f-majority):

Oregon N. Carolina Texas region

az1 az2 az3 az4 az5 az6 az7 az8 az9 availability
zone

N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18


node
Storing and learning consensus results
• Consensus log records are stored among acceptors in a data striping fashion
• storage acceptors do not store the full copy of the log
• flexible Paxos enables disjoint small replication quorums over large acceptor membership -
perfect for striping

• Advantages:
• Only f+1 record copies are needed for tolerating f acceptor failures
• Log throughput and capacity not bounded by a single storage acceptor

• Learning the result of consensus: reading the log via streaming


• Acceptors stream their local copy of committed records
• Client reader merges all acceptor record streams using slot (LSN) order
• Single Copy Delivery (SCD) optimization achieves 1X read amplification
Storage
N1 N2 N3 N4 N5 N6 N7 N8 N9 Acceptors
Slots (LSN)
1 R1 R1 R1

2 R2 R2 R2

3 R3 R3 R3 committed
(released)
4 R4 R4 R4 records

release 5 R5 R5 R5
pointer
6


Storage
N1 N2 N3 N4 N5 N6 N7 N8 N9 Acceptors
Slots (LSN)
1 R1 R1 R1

2 R2 R2 R2

3 R3 R3 R3 committed
(released)
4 R4 R4 R4 records

release 5 R5 R5 R5
pointer
6

… record
R5
(Gray Rn denotes record
R4 R5 R4 R4 R5 streams
filtered out by SCD
R1 R1 R1 R2 R2 R2 R3 R3 R3 (acceptor to optimization)
reader)

Stream Merge

R1 R2 R3 R4 R5 … reader

1 2 3 4 5
Log Segments and configuration management
• A Log in LogDevice -> A sequence of log segments indexed by monotonically
increasing epoch
• each segment has its fixed configuration: idea inspired by Stoppable Paxos [5]

• Reconfiguration can happen out-of-band of replication via an auxiliary metadata


store
• epoch store: stores log segment configuration. back by Zeus (Zookeeper).
• auxiliary metadata store inspired by Vertical Paxos [3]

• Starting a new log segment when:


• Leader (sequencer) fail-over
• reconfiguring replication property and storage acceptor membership

• Similar design also adopted by Delos [4]


Log Segments
bridge
epoch record
1
e1n1 e1n2 …… e1n7 e1n8
bridge
epoch record
2
e2n1 e2n2 …… e2n18 e2n19
bridge
epoch record
3
e3n1

…… current
log tail
epoch
7
(latest)
e7n1 e7n2 ……
configuration of a log epoch segment

epoch: 1 | SEQ: N0 | [(region, 2), (node, 3)] | { N1, N2, N3, N4, N5 }

epoch sequencer replication property storage node set


(leader) (acceptors)
epoch transition: Sealing and Bridge record
• Starting a new log segment requires first “Sealing” the previous log
segment.
• A procedure similar to executing Phase 1 Paxos on a leader election quorum of the
previous segment
• Once sealing is done, no append request can be successfully ACKed to the sealed log
epoch segment

• After Sealing, an epoch recovery procedure is performed to:


• learn the last appended record slot in the sealed epoch segment
• place a bridge record immediately after the last record, marking the end of the log
segment
• once bridge record is place the log epoch segment becomes immutable (until
trimmed)
Animation: Leader (sequencer) failure
scenario
0. Sequencer in epoch 1 in steady state replication
current
released
current
log tail
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12
1. Sequencer in epoch 1 failed / partitioned
current
released
dirty slots current
log tail
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12
2. A new sequencer got elected by failure detector
current
released
dirty slots current
log tail
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12

sequencer
3. The new sequencer got its epoch and configuration
current
from the epoch store released
dirty slots current
log tail
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12

epoch
store

sequencer
epoch 2
4. The new sequencer perform Paxos Phase I to SEAL
current
the Phase 1 (leader election) quorum of storage node released
set for epoch 1, preventing it from completing new epoch 1
dirty slots tail
appends (SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12

epoch
store

sequencer
epoch 2
Seal Zoom in:
Sequencer in epoch 2 Seals phase 1 quorum of the configuration
of the previous epoch segment (epoch). 2 (epoch) is used as the
proposal/ballot number.
epoch: 1 | SEQ: N0 | [(region, 2), (node, 3)] | { N1, N2, N3, N4, N5, N6, N7, N8, N9, N10 }

N1 N2 N3 N4 N5 N6 N7 N8 N9 N10
sequencer
epoch 1

epoch
store

sequencer
epoch 2

epoch: 2 | SEQ: N20 | [(region, 2), (node, 3)] | { N4, N5, N6, N7, N8, N9, N10, N11 }
4. The sequencer in epoch 2 can start taking new
current
appends, but won’t release these records despite released
fully replicated. epoch 1
dirty slots tail
(SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12

epoch current
store log tail

epoch
sequencer
epoch 2 2
e2n1 e2n2
5. At the same time, sequencer in epoch 2, with potential
other successors, keep running FPaxos (Phase I and II) to
reach consensus on each slot of epoch 1 in the dirty range, current
and finally placing a bridge record to mark the end of epoch 1 released
epoch 1
also using FPaxos. slots reached consensus tail
(SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12 e1n13
hole hole bridge
plug* plug record
epoch current
store log tail

epoch
sequencer
epoch 2 2
e2n1 e2n2

* hole plug is inserted for the LSN slots that are were
NOT ACKed originally, indicating a benign (non-dataloss)
gap in the LSN sequence.
6. The sequencer in epoch 2 can finally release all
records up to the fully replicated prefix of epoch 2.
epoch 1
tail
(SEALED)
sequencer epoch
epoch 1 1
e1n1 e1n2 …… e1n7 …… e1n12 e1n13
bridge
epoch current record
store log tail

epoch
sequencer
epoch 2 2
e2n1 e2n2

current
released
Zero-move, out-of-band reconfiguration
• No data movement with reconfiguration
• start a new log segment only requires a transaction in metadata store

• Out-of-band reconfiguration benefits:


• Allowing different requirements and design choices in data plane vs. metadata plane
• trade-off on durability, availability, throughput, …
• Higher availability in reconfiguration
• Scenario: steady state log replication is stuck (e.g., quorum loss)
• In-band: cannot reconfigure, require manual intervention!
• Out-of-band: reconfiguration by starting a new log segment with a new health acceptor membership

• Low reconfiguration latency


• Reconfiguration latency: TX in metadata store + Sealing the previous segment.
• No joint consensus. No intermediary transition. Not blocked by data replication.
Highlights
• Superior steady state replication performance
• Smart placement for IaaS compliant failure modelling
• Only requires f+1 copies for tolerating f failures
• 40% less space compared with raft when f = 2
• Low latency, Zero-move reconfiguration
• Log capacity and throughput not bounded by a single node
• High write availability from out-of-band reconfiguration
Takeaways
• Log is THE abstraction for modeling multi-paxos
• LogDevice is a managed service with strong consistency of multi-
paxos and highly performant and efficient with flexible quorums
• Designed to be a reliable, scalable and flexible service
LogDevice: Paxos at Facebook Scale
• LogDevice powering Scribe use case:
https://engineering.fb.com/data-infrastructure/scribe/
• “the total size of these logs is several petabytes every hour.”
• 2.5 TB/s writes; 7 TB/s reads globally
References
• [1] Paxos Made Simple. Leslie Lamport. 2001
• [2] Flexible Paxos: Quorum intersection revisited. Heidi Howard, Dahlia
Malkhi, Alexander Spiegelman. 2016
• [3] Vertical Paxos and Primary-Backup Replication. Leslie Lamport, Dahlia
Malkhi, and Lidong Zhou. 2009
• [4] Delos: Simple, flexible control plane storage. Mahesh Balakrishnan and
Jason Flinn. 2019
• [5] Stoppable Paxos. Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. 2008
• [6] Gossip-Style Failure Detection and Distributed Consensus for Scalable
Heterogeneous Clusters. Sridharan Ranganathan, et al. 2001
Appendix
Leader election and APPEND routing
• Leader election is a plug-in in Multi-Paxos context
• Gossip-based failure detector [6]
• cluster nodes exchange gossips periodically (e.g., every 100ms)
• nodes maintain local cluster state; clients poll cluster state from server;
• placement and routing: weighted consistent hashing
• input: logid, sequencer configuration (map of node -> weights), cluster state
• output: sequencer node id for the log
• “Soft consensus”
• best effort for local views to converge quickly (i.e., 1-3 seconds)
• failing to achieve that won’t affect correctness, but may affect liveness (i.e.,
availability / latency)
• sequencer ping-pong issue
storage nodes
epoch store
(ballot, config) (acceptors)

epoch +
N1
activation replication
configuration

N2
writer application reader application

APPEND STORE
sequencer RECORD stream
logdevice (proposer) N3 logdevice
APPENDED STORED
client lib client lib

N4

… … …
N5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy