DK Stalk
DK Stalk
DK Stalk
Ali Ghodsi
http://www.sics.se/~ali/thesis/
Presentation Overview
Gentle introduction to DHTs Contributions The future
So what?
Characteristic properties
Scalability
Number of nodes can be huge Number of items can be huge
So what?
Characteristic properties
Time to find data is logarithmic Scalability Size of routing tables is Number of nodes can be huge logarithmic
Number of items can be huge Example:
So what?
Characteristic properties Store number of items
Scalability proportional to number Number of nodes can be huge of nodes
Number of items can be hugeTypically:
With D items and n nodes Self-manage in presence joins/leaves/failures Store D/n items per node Routing information
Data items
So what?
Characteristic properties Self-management routing info:
Scalability Ensure routing information Number of nodes can be huge is up-to-date
Number of items can be huge
Self-management of items: Ensure that data is always Self-manage in presence joins/leaves/failures replicated and available Routing information
Data items
Presentation Overview
Whats been the general motivation for DHTs?
Gnutella
Completely decentralized Ask everyone you know to find data Very inefficient
decentralized index
10
Our philosophy
DHT is a useful data structure Assumptions might not be true
Moderate amount of dynamism Leave not same thing as failure
Dedicated servers
Nodes can be trusted Less heterogeneity
Presentation Overview
How to construct a DHT?
13
15 14
1 2 3 4 5
Definition of Successor
The successor of an identifier is the
first node met going in clockwise direction starting at the identifier
15
Example
succ(12)=14 succ(15)=2 succ(6)=6
13 12 11
1 2 3 4 5
14
10 9 8
6 7
15
1 2 3 4 5 6
Example
H(Marina)=12 H(Peter)=2 H(Seif)=9 H(Stefan)=14
7
16
With D items and n nodes Node n is responsible for item k Store D/n items per node Move D/n items when Example nodes join/leave/fail H(Marina)=12
EFFICIENT! H(Peter)=2
13 12 11
1 2 3 4 5
H(Seif)=9 H(Stefan)=14
10 9 8
6 7
17
Example
0s successor is succ(1)=2 2s successor is succ(3)=5 5s successor is succ(6)=6 6s successor is succ(7)=11 11s successor is succ(12)=0
13 12 11
15 14
1 2 3 4 5
10 9 8
6 7
18
DHT Lookup
To lookup a key k
Calculate H(k) Follow succ pointers until item k is found
Key
Value
Alexander
Berlin
Marina
Gothenburg
Peter
Louvain la neuve
Seif
Stockholm
Stefan
Stockholm
Example
Lookup Seif at node 2 H(Seif)=9 Traverse nodes:
2, 5, 6, 11 (BINGO)
13 12 11 10 14
15
1 2 3 4 5 6
7
19
Speeding up lookups
If only pointer to succ(n+1) is used
Worst case lookup time is N, for N nodes
1 2 3 4
Point to succ(n+2M)
11
Speeding up lookups
If only pointer to succ(n+1) is used
Worst case lookup time is N, for N nodes
Time to find data is Improving lookup time logarithmic Point to Size of routing tables succ(n+1) is Point to succ(n+2) logarithmic
Example: Point to succ(n+4) Point to succ(n+8)
13 12
15 14
1 2 3 4 5
log2(1000000)20
11 10 9 8 6 7
21
If successor fails
If predecessor fails
Set pred to nil
22
Handling Dynamism
Periodic stabilization used to make pointers eventually correct
Try pointing succ to closest alive successor Try pointing pred to closest alive predecessor
23
Handling joins
When n joins
Find ns successor with lookup(n) Set succ to ns successor Stabilization fixes the rest
13 15
11 Periodically at n: 1. 2. 3. 4. set v:=succ.pred if vnil and v is in (n,succ] set succ:=v send a notify(n) to succ When receiving notify(p) at n: 1. 2. if pred=nil or p is in (pred,n] set pred:=p
24
Handling leaves
When n leaves
Just dissappear (like failure)
15
Periodically at n: 1. 2. 3. 4. set v:=succ.pred if vnil and v is in (n,succ] set succ:=v send a notify(n) to succ
25
Presentation Overview
Gentle introduction to DHTs Contributions The future
26
Outline
Lookup consistency
27
10
12
14
15
28
10
13
16
29
10
11
12
14
15
30
Outline
Atomic Ring Maintenance
31
32
Lookup consistency
33
Nave Approach
Each node i hosts a lock called Li
For p to join or leave:
First acquire Lp.pred Second acquire Lp Third acquire Lp.succ Thereafter update relevant pointers
34
35
Safety
Non-interference theorem:
When node p acquires both locks:
Node ps successor cannot leave Node ps predecessor cannot leave Other joins cannot affect relevant pointers
36
Dining Philosophers
Problem similar to the Dining philosophers problem Five philosophers around a table
One fork between each philosopher (5) Philosophers eat and think To eat:
grab left fork then grab right fork
37
Deadlocks
Can result in a deadlock
If all nodes acquire their first lock Every node waiting indefinitely for second lock
38
Pitfalls
Join adds node/philosopher
Solution: some requests in the lock queue forwarded to new node
12
14, 12 14 12
10
12
14
15
39
Pitfalls
Leave removes a node/philosopher
Problem: if leaving node gives lock queue to its successor, nodes can get worse position in queue: starvation
40
Correctness
Liveness Theorem:
Algorithm is starvation free
Also free from deadlocks and livelocks
41
Performance drawbacks
If many neighboring nodes leaving
All grab local lock Sequential progress
12 14
10
15
Solution
Randomized locking Release locks and retry Liveness with high probability
42
43
Lookup consistency
Goal is to make joins and leaves appear as if they happened instantaneously Every leave has a leave point
A point in global time, where the whole system behaves as if the node instantaneously left
44
Leave Algorithm
Node p Node q (leaving)
leave point LeaveForward=true
<Le a v e Poin t, pr e d= p>
Node r
pred:=p
c= , su c uc c d a t eS < Up r>
succ:=r
<St o p
F orwa
rd i ng
>
LeaveForward=false
45
46
Join Algorithm
Node p Node q (joining)
<Upda tePre d,
Node r
pred= q >
succ:=q
<Stop Forwa rding >
sh> <Fini
JoinForwarding=false
47
Outline
What about failures?
48
50
Contributions
Lookup consistency in presence of joins/leaves
System not affected by joins/leaves Inserts do not disappear
Related Work
Li, Misra, Plaxton (04, 06) have a similar solution Advantages
Assertional reasoning Almost machine verifiable proofs
Disadvantages
Starvation possible Not used for lookup consistency Failure-free environment assumed
52
Related Work
Lynch, Malkhi, Ratajczak (02), position paper with pseudo code in appendix Advantages
First to propose atomic lookup consistency
Disadvantages
No proofs Message might be sent to a node that left Does not work for both joins and leaves together Failures not dealt with
53
Outline
Additional Pointers on the Ring
54
Routing
Generalization of Chord to provide arbitrary arity Provide logk(n) hops per lookup k being a configurable parameter n being the number of nodes Instead of only log2(n)
55
Interval 3
Interval 0
Node 0
12
I0
I1
I2
I3
Interval 2
Interval 1
32
56
Interval 0 Interval 1
Node 0
12
I0
I1
I2
I3
Interval 2
48
Interval 3
Level 2 03
47
811 1215
32
57
Node 0
12
I0
I1
I2
I3
Level 2 03 Level 3 0
47 1
811 1215 2 3
32
58
Arity important
Maximum number of hops can be configured
1 r
k =N
N (N ) = 2
59
Placing pointers
Each node has (k-1)logk(N) pointers
Node ps pointers point at
f (i ) = p (1 + ((i 1) mod (k 1)))k
i 1 k 1
Node 0s pointers f(1)=1 f(2)=2 f(3)=3 f(4)=4 f(5)=8 f(6)=12 f(7)=16 f(8)=32 f(9)=48
4 8 12
48
16
32
60
Greedy Routing
lookup(i) algorithm
Use pointer closest to i, without overshooting i If no such pointer exists, succ is responsible for i
i
61
62
Fault-free Algorithm
No routing failures
General Routing
Three lookup styles
Recursive Iterative Transitive
64
Reliable Routing
Reliable lookup for each style
If initiator doesnt crash, responsible node reached No redundant delivery of messages
General strategy
Repeat operation until success Filter duplicates using unique identifiers
Iterative lookup
Reliability easy to achieve
Recursive lookup
Several algorithms possible
Transitive lookup
Efficient reliability hard to achieve
65
Outline
One-to-many Communication
66
69
Broadcast Algorithms
Correctness conditions:
Termination
Algorithm should eventually terminate
Coverage
All nodes should receive the broadcast message
Non-redundancy
Each node receives the message at most once
Nave Broadcast
Naive Broadcast Algorithm
send message to succ until: initiator reached or overshooted
initiator
15 14 13 12 11 10 9 8 6 7
71
1 2 3 4 5
Nave Broadcast
Naive Broadcast Algorithm
send message to succ until: initiator reached or overshooted
Improvement
Initiator delegates half the space to neighbor
14 13 12
initiator
15 0 1 2 3 4 5 10 9 8 6 7
72
73
Advanced Broadcast
Old algorithm on k-ary trees
74
Getting responses
Getting a reply
Nodes send directly back to initiator Not scalable
Outline
Advanced One-to-many Communication
76
Expensive
One node making 1000 lookups Marshaling/unmarshaling 1000 requests
77
Bulk Operation
Define a bulk set: I
A set of identifiers
bulk_operation(m, I)
Send message m to every node i I
bulk_own(m, I)
Send m to every node responsible for an identifier i I
Example
Bulk set I={4} Node 4 might not exist Some node is responsible for identifier 4
79
bulk_feed(m, I)
Send message m to every node i I Accumulate responses back to initiator
bulk_own_feed(m, I)
Send message m to every node responsible for i I Accumulate responses back to initiator
80
81
Case 2
Bulk set is a singleton with one identifier Identical to ordinary lookup Message complexity is log(n) Time complexity is in log(n)
82
Filter redundant messages using unique identifiers Eventually perfect failure detector for termination
Inaccuracy results in redundant messages
83
Bulk owner
Multiple inserts into a DHT
84
Outline
Replication
85
Successor-list replication
Successor-list replication
Replicate a nodes item on its f successors DKS, Chord, Pastry, Koorde etcetera.
86
Motivation: successor-lists
If a node joins or leaves
f replicas need to be updated
Color represents data item
87
Motivation: successor-lists
If a node joins or leaves
f replicas need to be updated
Color represents data item
88
Multiple hashing
Rehashing
Node 9 crashes
Node 12 should get item from replica Need hash inverse H-1(7)=Seif (impossible) Items dispersed all over nodes (inefficient)
9 7 5
Seif, Stockholm
90
12
Symmetric Replication
Basic Idea
Replicate identifiers, not nodes
N r (k ) = i + k , for 0 k < f f
Identifier space partitioned into m equivalence classes
Cardinality of each class is
f, m=N/f
Each node replicates the equivalence class of all identifiers it is responsible for
91
Symmetric replication
Replication degree f=4, Space={0,,15} Congruence classes modulo 4:
{0, {1, {2, {3, 4, 5, 6, 7, 8, 12} 9, 13} 10, 14} 11, 15}
Data: 14, 13, 12, 11
Data: 15, 0
15 14
1 2 3 4 5
Data: 4, 5 Data: 1, 2, 3
13 12 11 10
Data: 6, 7, 8, 9, 10
6 9 8 7
92
Ordinary Chord
Replication degree f=4, Space={0,,15} Congruence classes modulo 4
{0, {1, {2, {3, 4, 5, 6, 7, 8, 12} 9, 13} 10, 14} 11, 15}
Data: 2, 1, 0, 15 Data: 6, 5, 4, 3 Data: 10, 9, 8, 7 Data: 14, 13, 12, 11 Data: 3, 4 Data: 7, 8 Data: 11, 12 Data: 15, 0 Data: 5, 6, 7 Data: 9, 10, 11
15 14
1 2 3 4 5
13 12
Data: 10, 11, 12, 13, 14 Data: 14, 15, 0, 1, 2 Data: 2, 3, 4, 5, 6 Data: 6, 7, 8, 9, 10
11 10 9 8 7 6
Data: 4, 5
93
Cheap join/leave
Replication degree f=4, Space={0,,15} Congruence classes modulo 4
{0, {1, {2, {3, 4, 5, 6, 7, 8, 12} 9, 13} 10, 14} 11, 15}
Data: 2, 1, 0, 15 Data: 6, 5, 4, 3 Data: 10, 9, 8, 7 Data: 14, 13, 12, 11 Data: 0, 15 Data: 3, 4 Data: 7, 8 Data: 11, 12 Data: 10, 11, 12, 13, 14 Data: 14, 15, 0, 1, 2 Data: 2, 3, 4, 5, 6 Data: 6, 7, 8, 9, 10 Data: 3, 4 Data: 7, 8 Data: 11, 12 Data: 15, 0 Data: 5, 6, 7 Data: 9, 10, 11
15 14
1 2 3 4 5
12 11 10 9 8 7 6
Data: 4, 5
94
Contributions
Message complexity for join/leave O(1)
Bit complexity remains unchanged
Presentation Overview
Summary
96
Summary (1/3)
Atomic ring maintenance
Lookup consistency for j/l No routing failures as nodes j/l No bound on number of leaves Eventual consistency with failures
Summary (2/3)
Efficient Broadcast
log(n) time and n message complexity Used in overlay multicast
Bulk operations
Efficient parallel lookups Efficient range queries
98
Summary (3/3)
Symmetric Replication
Simple, O(1) message complexity for j/l
O(log f) for failures
99
Presentation Overview
Gentle introduction to DHTs Contributions The future
100
101
102
Acknowledgments
Seif Haridi Luc Onana Alima Cosmin Arad Per Brand Sameh El-Ansary Roland Yap
104
THANK YOU
105