AAscript
AAscript
AAscript
Mohsen Ghaffari
This is a draft version; please check again for updates. Feedback and
comments would be greatly appreciated and should be emailed to
ghaffari@inf.ethz.ch. Last update: Feb 5, 2019.
Lecture Notes by
Davin Choo
I Approximation algorithms 1
1 Greedy algorithms 3
1.1 Minimum set cover . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Approximation schemes 11
2.1 Knapsack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Bin packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Minimum makespan scheduling . . . . . . . . . . . . . . . . . 19
4 Rounding ILPs 33
4.1 Minimum set cover . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Minimizing congestion in multi-commodity routing . . . . . . 39
8 Graph sketching 83
8.1 Warm up: Finding the single cut . . . . . . . . . . . . . . . . 84
8.2 Warm up 2: Finding one out of k > 1 cut edges . . . . . . . . 86
8.3 Maximal forest with O(n log4 n) memory . . . . . . . . . . . . 87
13 Paging 121
13.1 Types of adversaries . . . . . . . . . . . . . . . . . . . . . . . 122
13.2 Random Marking Algorithm (RMA) . . . . . . . . . . . . . . 123
Week 13 (10 Dec - 14 Dec 2018) Section 13.2 till end of Chapter 15
1
Pr[X] ≥ 1 −
poly(n)
1
say, Pr[X] ≥ 1 − nc
for some constant c ≥ 2.
Useful inequalities
• ( nk )k ≤ nk ≤ ( en
k
)k
• nk ≤ nk
• limn→∞ (1 − n1 )n = e−1
P∞ 1 π
• i=1 i2 = 6
Approximation algorithms
1
Chapter 1
Greedy algorithms
3
4 CHAPTER 1. GREEDY ALGORITHMS
Example
S1 e1
S2 e2
S3 e3
S4 e4
e5
Algorithm 1 GreedySetCover(U, S, c)
T ←∅ . Selected subset of S
C←∅ . Covered vertices
while C 6= U do
Si ← arg minSi ∈S\T |Sc(S i)
i \C|
. Pick set with lowest price-per-item
T ← T ∪ {Si } . Add Si to selection
C ← C ∪ Si . Update covered vertices
end while
return T
e1
Not in .. ..
OP T . .
ek−1
ek
.. OP Tk ek+1
.
OP T ..
.
en
n − k + 1 = |U \ Ck−1 |
≤ |O1 ∩ (U \ Ck−1 )| + · · · + |Op ∩ (U \ Ck−1 )|
p
X
= |Oj ∩ (U \ Ck−1 )|
j=1
c(Oj )
3. By definition, for each j ∈ {1, . . . , p}, ppi(Oj ) = |Oj ∩(U \Ck−1 )|
.
Since the greedy algorithm will pick a set in S \ T with the lowest price-per-
item, price(ek ) ≤ ppi(Oj ) for all j ∈ {1, . . . , p}. Hence,
The second equality is because the cost of sets is partitioned across the price
of all n vertices.
2 4 8 = 2 · 22 2 · 2k−1
elts elts elements elements
... ... Sk+1
S1 S2 S3 Sk
c(Oi )
ei,k is covered, price(ei,k ) ≤ d−k+1 (It is an equality if the greedy algorithm
also chose Oi to first cover ei,k , . . . , ei,d ). Hence, the greedy cost of covering
elements in Oi (i.e. ei,1 , . . . , ei,d ) is at most
d d
X c(Oi ) X 1
= c(Oi ) · = c(Oi ) · Hd ≤ c(Oi ) · H∆
k=1
d−k+1 k=1
k
Summing over all p sets to cover all n elements, we have c(T ) ≤ H∆ ·c(OP T ).
Remark We apply the same greedy algorithm for small ∆ but analyzed in
a more localized manner. Crucially, in this analysis, we always work with
the exact degree d and only use the fact d ≤ ∆ after summation. Observe
that ∆ ≤ n and the approximation factor equals that of Theorem 1.3 when
∆ = n.
Definition 1.6 (Minimum vertex cover problem). Given a graph G = (V, E),
find a subset S ⊆ V such that:
a b c d e f
Algorithm 2 GreedyMaximalMatching(V, E)
M ←∅ . Selected edges
C←∅ . Set of incident vertices
while E 6= ∅ do
ei = {u, v} ← Pick any edge from E
M ← M ∪ {ei } . Add ei to the matching
C ← C ∪ {u, v} . Add endpoints to incident vertices
Remove all edges in E that are incident to u or v
end while
return M
...
Vertex cover C,
Maximal matching M
... where |C| = 2 · |M |
Sketch of Proof Let C be the set of all vertices involved in the greedily
selected hyperedges. In a similar manner as the proof in Theorem 1.8, C can
be showed to be an f -approximation.
Chapter 2
Approximation schemes
11
12 CHAPTER 2. APPROXIMATION SCHEMES
2.1 Knapsack
Definition 2.3 (Knapsack problem). Consider a set S with n items. Each
item i has size(i) ∈ Z+ and profit(i) ∈ Z+ . Given a budget B, find a
subset S ∗ ⊆ S such that:
Since each cell can be computed in O(1) using DP via the above recurrence,
matrix M can be filled in O(n2 pmax ) and S ∗ may be extracted by back-tracing
from M [n, npmax ].
2.1. KNAPSACK 13
Algorithm 3 FPTAS-Knapsack(S, B, )
k ← max{1, b pmax n
c} . Choice of k to be justified later
for i ∈ {1, . . . , n} do
profit0 (i) = b profit(i)
k
c . Round and scale the profits
end for
Run DP in Section 2.1.1 with B, size(i), and re-scaled profit0 (i).
return Items selected by DP
n
X
loss(i) ≤ nk loss(i) ≤ k for any item i
i=1
pmax
< · pmax Since k = b c
n
≤ · profit(OP T (I)) Since pmax ≤ profit(OP T (I))
Example Consider S = {0.5, 0.1, 0.1, 0.1, 0.5, 0.4, 0.5, 0.4, 0.4}, where |S| =
n = 9. Since ni=1 size(i) = 3, at least 3 bins are needed. One can verify
P
that 3 bins suffice: b1 = b2 = b3 = {0.5, 0.4, 0.1}. Hence, |OP T (S)| = 3.
b1 b2 b3
Algorithm 4 FirstFit(S)
B→∅ . Collection of bins
for i ∈ {1, . . . , n} do
if size(i) ≤ free(b) for some bin b ∈ B. Pick the first one. then
free(b) ← free(b) − size(i) . Put item i to existing bin b
else
B ← B ∪ {b0 } . Put item i into a fresh bin b0
free(b0 ) = 1 − size(i)
end if
end for
return B
Lemma 2.6. Using FirstFit, at most one bin is less than half-full. That
is, |{b ∈ B : size(b) ≤ 21 }| ≤ 1, where B is the output of FirstFit.
Proof. Suppose, for a contradiction, that there are two bins bi and bj such
that i < j, size(bi ) ≤ 21 and size(bj ) ≤ 21 . Then, FirstFit could have put
all items in bj into bi , and would not have created bj . This is a contradiction.
b1 b2 b3 b4
16 CHAPTER 2. APPROXIMATION SCHEMES
Remark If we first sort the item weights in non-increasing order, then one
can show that running FirstFit on the sorted item weights will yield a
3
2
-approximation algorithm for bin packing. See footnote for details1 .
It is natural to wonder whether we can do better than a 32 -approximation.
Unfortunately, unless P = N P, we cannot do so efficiently. To prove this, we
show that if we can efficiently derive a ( 23 − )-approximation for bin packing,
then the partition problem (which is N P-hard) can be solved efficiently.
In the following sections, we work towards a PTAS for bin packing whose
runtime will be exponential in 1 . To do this, we first consider two simplifying
assumptions and design algorithms for them. Then, we adapt the algorithm
to a PTAS by removing the assumptions one at a time.
Assumption (1) All items have at least size , for some > 0.
Assumption (1) All items have at least size , for some > 0.
1
Curious readers can read the following lecture notes for proof on First-Fit-Decreasing:
http://ac.informatik.uni-freiburg.de/lak_teaching/ws11_12/combopt/notes/
bin_packing.pdf
https://dcg.epfl.ch/files/content/sites/dcg/files/courses/2012%20-%
20Combinatorial%20Optimization/12-BinPacking.pdf
2.2. BIN PACKING 17
Algorithm 5 PTAS-BinPacking(I = S, )
k ← d 12 e
Q ← bn2 c
Partition n items into k non-overlapping groups, each with ≤ Q items
for i ∈ {1, . . . , k} do
imax ← maxitem j in group i size(j)
for item j in group i do
size(j) ← imax
end for
end for
Denote the modified instance as J
return A (J)
0
≤ Q items ≤ Q items J rounds down ≤ Q items
Figure 2.1: Partition items into k groups, each with ≤ Q items; Label
groups in ascending sizes; J rounds up item sizes, J 0 rounds down item sizes.
Proof. By Lemma 2.10 and the fact that |OP T (J 0 )| ≤ |OP T (I)|.
Proof. By Assumption (1), all item sizes are at least , so |OP T (I)| ≥ n.
Then, Q = bn2 c ≤ · |OP T (I)|. Apply Lemma 2.10.
Proof. If FirstFit does not open a new bin, the theorem trivially holds.
Suppose FirstFit opens a new bin (using m bins in total), then we know
that at least (m − 1) bins are strictly more than (1 − 0 )-full.
2.3. MINIMUM MAKESPAN SCHEDULING 19
Algorithm 6 Full-PTAS-BinPacking(I = S, )
0 ← min{ 12 , 2 } . See analysis why we chose such an 0
X ← Items with size < 0 . Ignore small items
0
P ← PTAS-BinPacking(S \ X, ) . By Theorem 2.12,
. |P | = (1 + 0 ) · |OP T (S \ X)|
P 0 ← Using FirstFit, add items in X to P . Handle small items
return Resultant packing P 0
n
X
|OP T (I)| ≥ size(i) Lower bound on |OP T (I)|
i=1
> (m − 1)(1 − 0 ) From above observation
Hence,
|OP T (I)|
m< +1 Rearranging
1 − 0
1 1
< |OP T (I)| · (1 + 20 ) + 1 Since 0
≤ 1 + 20 , for 0 ≤
1− 2
1
≤ (1 + ) · |OP T (I)| + 1 By choice of 0 = min{ , }
2 2
M3 p6 p7
M2 p3 p4
M1 p1 p2 p5
0 Time
3 5 7 Makespan = 11
n
1 X
|OP T (I)| ≤ pi + pmax ≤ 2 · L(I)
m i=1
M3 p3 p6
M2 p2 p5 p7
M1 p1 p4
0 Time
3 4 5 8 9 10 Makespan = 14
1
Lemma 2.17. Let plast be the last job that finishes running. If plast > 3
· |OP T (I)|,
then |ModifiedGraham(I)| = |OP T (I)|.
Let us denote the jobs that are alone in C as heavy jobs, and the machines
they are on as heavy machines.
Suppose there are k heavy jobs occupying a machine each in OP T (I). Then,
there are 2(m − k) + 1 jobs (two non-heavy jobs per machine in C, and pn ) to
be distributed across m − k machines. By the pigeonhole principle, at least
one machine M ∗ will get ≥ 3 jobs in OP T (I). However, since the smallest
job pn takes > 31 · |OP T (I)| time, M ∗ will spend > |OP T (I)| time. This is
a contradiction.
M3 p3 p6
M2 p7 p5
M1 p4 p2 p1
0 Time
5 6 10 Makespan = 13
original job sizes, PTAS-Makespan follows P ’s bin packing but uses bins
of size t(1 + ) to account for the rounded down job sizes. Suppose jobs 1 and
2 with sizes p1 and p2 were rounded down to p01 and p02 , and P assigns them
to a same bin (i.e., p01 + p02 ≤ t). Then, due to the rounding process, their
original sizes should also fit into a size t(1 + ) bin (i.e., p1 + p2 ≤ t(1 + )).
Finally, small jobs are handled using FirstFit. Let α(I, t, ) be the final bin
configuration produced by PTAS-Makespan on parameter t and |α(I, t, )|
be the number of bins used. Since |OP T (I)| ∈ [L, 2L], there will be a
t ∈ {L, L + L, L + 2L, . . . , 2L} such that |α(I, t, )| ≤ Bin(I, t) ≤ m bins
(see Lemma 2.19 for the first inequality). Note that running binary search
on t also works, but we only care about poly-time.
Proof. If FirstFit does not open a new bin, then |α(I, t, )| ≤ Bin(I, t)
since α(I, t, ) uses additional (1 + ) buffer. If FirstFit opens a new bin
(say, totalling b bins), then there are at least (b − 1) produced bins from
A (exact solving on rounded down non-small items) that are more than
(t(1 + ) − t) = t-full. Hence, any bin packing algorithm must use strictly
more than (b−1)t
t
= b − 1 bins. In particular, Bin(I, t) ≥ b = |α(I, t, )|.
Randomized approximation
schemes
• A runs in poly(|I|, 1 )
• F = C1 ∨ · · · ∨ Cm is a disjunction of clauses
25
26 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES
Any clause with both xi and ¬xi is trivially false. As they can be removed
in a single scan of F , assume that F does not contain such trivial clauses.
However, there are exponentially many terms and there exist instances where
truncating the sum yields arbitrarily bad approximation.
Pm |M | denote
Let total number of 1’s in M . Since |Si | = 2n−|Ci | , |M | =
Pmthe n−|C i|
i=1 |Si | = i=1 2 . As every column represents a satisfying assign-
ment, there are exactly f (F ) “topmost” 1’s.
α1 α2 ... αf (F )
C1 0 1 ... 0
C2 1 1 ... 1
C3 0 0 ... 0
.. .. .. ..
... . . . .
Cm 0 1 ... 1
Table 3.1: Red 1’s indicate the (“topmost”) smallest index clause Ci satis-
fied for each assignment αj
Algorithm 10 DNF-Count(F, )
X←0 . Empirical number of “topmost” 1’s sampled
9m
for k = 2 times do
n−|C |
Ci ← Sample one of m clauses, where Pr[Ci chosen] = 2 |M | i
αj ← Sample one of 2n−|Ci | satisfying assignments of Ci
IsTopmost ← True
for l ∈ {1, . . . , i − 1} do . Check if αj is “topmost”
if Cl [α] = 1 then . Checkable in O(n) time
IsTopmost ← False
end if
end for
if IsTopmost then
X ←X +1
end if
end for
return |Mk|·X
Proof. Let Xi be the indicator variable whether the i-th sampled assignment
is “topmost”, where p = Pr[Xi = 1]. By Lemma 3.3, p = Pr[Xi = 1] = f|M (F )
|
.
Pk
Let X = i=1 Xi be the empirical number of “topmost” 1’s. Then, E(X) =
kp by linearity of expectation. By picking k = 9m
2
,
3.2. COUNTING GRAPH COLORINGS 29
|M | · X
Pr[| − f (F )| ≥ · f (F )]
k
k · f (F ) · k · f (F ) k
= Pr[|X − |≥ ] Multiply throughout by
|M | |M | |M |
f (F )
= Pr[|X − kp| ≥ kp] Since p =
|M |
2 kp
≤ 2 exp(− ) By Chernoff bound
3
3m · f (F ) 9m f (F )
= 2 exp(− ) Since k = and p =
|M | 2 |M |
≤ 2 exp(−3) Since |M | ≤ m · f (F )
1
≤
4
Negating, we get:
|M | · X 1 3
Pr[| − f (F )| ≤ · f (F )] ≥ 1 − =
k 4 4
One can see that Ωi ⊆ Ωi−1 as removal of ei in Gi−1 can only increase the
number of valid colorings. Furthermore, suppose ei = {u, v}, then Ωi−1 \Ωi =
{c : c(u) = c(v)}. Fix the coloring of, say the lower-indexed vertex, u. Then,
there are ≥ q − ∆ ≥ 2∆ + 1 − ∆ = ∆ + 1 possible recolorings of v in Gi .
Hence,
|Ωi | ≥ (∆ + 1)|Ωi−1 \ Ωi | ≥ (∆ + 1)(|Ωi−1 | − |Ωi |)
|Ωi | ∆+1 3
This implies that ri = |Ωi−1 |
≥∆+2
≥ 4
since ∆ ≥ 2.
|Ω1 | |Ωm |
Since f (G) = |Ωm | = |Ω0 | · |Ω
... 0|
= |Ω0 | · Πm
|Ωm−1 |
n m
i=1 ri = q · Πi=1 ri , if
we can find a good estimate of ri for each ri with high probability, then we
have a FPRAS for counting the number of valid graph colorings for G.
Algorithm 12 Color-Count(G, )
m ← 0
rb1 , . . . , rc . Estimates for ri
for i = 1, . . . , m do
3
for k = 128m 2
times do
c ← Sample coloring of Gi−1 . Using SampleColor
if c is a valid coloring for Gi then
rbi ← rbi + k1 . Update empirical count of ri = |Ω|Ωi−1
i|
|
end if
end for
end for
return q n Πm i=1 r
bi
3
Lemma 3.9. For all i ∈ {1, . . . , m}, Pr[|b
ri − ri | ≤ 2m
· ri ] ≥ 4m
.
Proof. Let Xj be the indicator variable whether the j-th sampled coloring
for Ωi−1 is a valid coloring for Ωi , where p = Pr[Xj = 1]. From above, we
know that p = Pr[Xj = 1] = |Ω|Ωi−1
i| 3
Pk
|
≥ 4
. Let X = j=1 Xj be the empirical
number of colorings that is valid for both Ωi−1 and Ωi , captured by k · rbi .
3
Then, E(X) = kp by linearity of expectation. Picking k = 128m2
,
( )2 kp
Pr[|X − kp| ≥ kp] ≤ 2 exp(− 2m ) By Chernoff bound
2m 3
32mp 128m3
= 2 exp(− ) Since k =
3 2
3
≤ 2 exp(−8m) Since p ≥
4
1 1
≤ Since exp(−x) ≤ for x > 0
4m x
32 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES
Remark Recall from Claim 3.8 that SampleColor actually gives an ap-
proximate uniform coloring. A more careful analysis can absorb the approx-
imation of SampleColor under Color-Count’s factor.
1
See https://www.wolframalpha.com/input/?i=e%5Ex+%3C%3D+1%2B2x
Chapter 4
Rounding ILPs
Linear programming (LP) and integer linear programming (ILP) are versa-
tile models but with different solving complexities — LPs are solvable in
polynomial time while ILPs are N P-hard.
Definition 4.1 (Linear program (LP)). The canonical form of an LP is
minimize cT x
subject to Ax ≥ b
x≥0
where x is the vector of n variables (to be determined), b and c are vectors
of (known) coefficients, and A is a (known) matrix of coefficients. cT x and
obj(x) are the objective function and objective value of the LP respectively.
For an optimal variable assignment x∗ , obj(x∗ ) is the optimal value.
ILPs are defined similarly with the additional constraint that variables
take on integer values. As we will be relaxing ILPs into LPs, to avoid confu-
sion, we use y for ILP variables to contrast against the x variables in LPs.
Definition 4.2 (Integer linear program (ILP)). The canonical form of an
ILP is
minimize cT y
subject to Ay ≥ b
y≥0
y ∈ Zn
where y is the vector of n variables (to be determined), b and c are vectors
of (known) coefficients, and A is a (known) matrix of coefficients. cT y and
obj(y) are the objective function and objective value of the LP respectively.
For an optimal variable assignment y ∗ , obj(y ∗ ) is the optimal value.
33
34 CHAPTER 4. ROUNDING ILPS
Remark We can define LPs and ILPs for maximization problems similarly.
One can also solve maximization problems with a minimization LPs using
the same constraints but negated objective function. The optimal value from
the solved LP will then be the negation of the maximized optimal value.
In this chapter, we illustrate how one can model set cover and multi-
commodity routing as ILPs, and how to perform rounding to yield approx-
imations for these problems. As before, Chernoff bounds will be a useful
inequality in our analysis toolbox.
Example
S1 e1
S2 e2
S3 e3
S4 e4
e5
ILPSet cover
m
X
minimize yi · c(Si ) / Cost of chosen set cover
i=1
X
subject to yi ≥ 1 ∀j ∈ {1, . . . , n} / Every item ej is covered
i:ej ∈Si
Upon solving ILPSet cover , the set {Si : i ∈ {1, . . . , n} ∧ yi∗ = 1} is the
optimal solution for a given set cover instance. However, as solving ILPs is
N P-hard, we consider relaxing the integral constraint by replacing binary
yi variables by real-valued/fractional xi ∈ [0, 1]. Such a relaxation will yield
the corresponding LP:
LPSet cover
m
X
minimize xi · c(Si ) / Cost of chosen fractional set cover
i=1
X
subject to xi ≥ 1 ∀j ∈ {1, . . . , n} / Every item ej is fractionally covered
i:ej ∈Si
Since LPs can be solved in polynomial time, we can find the optimal
fractional solution to LPSet cover in polynomial time.
Example The corresponding ILP for the example set cover instance is:
After relaxing:
Proof. Since x∗ is a feasible (not to mention, optimal) solution for LPSet cover ,
in each constraint, there is at least one x∗i that is greater or equal to f1 . Hence,
every element is covered by some set yi in the rounding.
1
Using Microsoft Excel. See tutorial: http://faculty.sfasu.edu/fisherwarre/lp_
solver.html
Or, use an online LP solver such as: http://online-optimizer.appspot.com/?model=
builtin:default.mod
4.1. MINIMUM SET COVER 37
Since e−1 ≈ 0.37, we would expect the rounded y not to cover several
items. However, one can amplify the success probability by considering in-
dependent roundings and taking the union (See ApxSetCoverILP).
Algorithm 13 ApxSetCoverILP(U, S, c)
ILPSet cover ← Construct ILP of problem instance
LPSet cover ← Relax integral constraints on indicator variables y to x
x∗ ← Solve LPSet cover
T ←∅ . Selected subset of S
for k · ln(n) times (for any constant k > 1) do
for i ∈ {1, . . . , m} do
yi ← Set to 1 with probability x∗i
if yi = 1 then
T ← T ∪ {Si } . Add to selected sets T
end if
end for
end for
return T
(iv) (Single path): All demand for commodity i passes through a single path
pi (no repeated vertices).
(v) (Congestion factor): ∀e ∈ E, ki=1 di 1e∈pi ≤ λ · c(e), where indicator
P
1e∈pi = 1 ⇐⇒ e ∈ pi .
(vi) (Minimum congestion): λ is minimized.
40 CHAPTER 4. ROUNDING ILPS
s1 13 17
a t1
7 8
19 5
s2 20
b 11 t2
8 5 7
s3 6
c t3
s1 10 10
a t1
s2 b t2
s3 c t3
s1 5
a t1
5 5
5 5
5
s2 b t2
s3 c t3
4.2. MINIMIZING CONGESTION IN MULTI-COMMODITY ROUTING41
s1 a t1
s2 5 5
b t2
5 5 5
s3 5
c t3
s1 10 10
a t1
s2 b t2
s3 c t3
s1 a t1
10
10
10
s2 b t2
s3 c t3
42 CHAPTER 4. ROUNDING ILPS
s1 a t1
10
s2 10
b 10 t2
10
s3 c t3
ILPMCR-Given-Paths
minimize λ / (1)
k
X X
subject to di · yi,p ≤ λ · c(e) ∀e ∈ E / (2)
i=1 p∈Pi ,e∈p
X
yi,p = 1 ∀i ∈ [k] / (3)
p∈Pi
Relax the integral constraint on yi,p to xi,p ∈ [0, 1] and solve the correspond-
ing LP. Define λ∗ = obj(LPMCR-Given-Paths ) and denote x∗ as a fractional path
4.2. MINIMIZING CONGESTION IN MULTI-COMMODITY ROUTING43
selection that achieves λ∗ . To obtain a valid path selection, for each com-
x∗
modity i ∈ [k], pick path p ∈ Pi with weighted probability P i,p x∗ = x∗i,p .
p∈Pi i,p
Note that by constraint (3), p∈Pi x∗i,p = 1.
P
Remark 1 For a fixed i, a path is selected exclusively (only one!) (cf. set
cover’s roundings where we may pick multiple sets for an item).
2c log m
Theorem 4.9. Pr[obj(y) ≥ log log m
max{1, λ∗ }] ≤ 1
mc−1
k
X
E(Ye ) = E( di · Ye,i )
i=1
k
X
= di · E(Ye,i ) By linearity of expectation
i=1
Xk X X
= di xi,p Since Pr[Ye,i = 1] = xi,p
i=1 p∈Pi ,e∈p p∈Pi ,e∈p
For every edge e ∈ E, applying2 the tight form of Chernoff bounds with
2 log n Ye
(1 + ) = log log n
on variable c(e) gives
Ye 2c log m 1
Pr[ ≥ max{1, λ∗ }] ≤ c
c(e) log log m m
ILPMCR-Given-Network
minimize λ / (1)
X X
subject to f (e, i) − f (e, i) = 1 ∀i ∈ [k] / (2)
e∈out(si ) e∈in(si )
X X
f (e, i) − f (e, i) = 1 ∀i ∈ [k] / (3)
e∈in(ti ) e∈out(ti )
X X
f (e, 1) − f (e, 1) = 0 ∀e ∈ E, / (4)
e∈out(v) e∈in(v)
∀v ∈ V \ {s1 ∪ t1 }
.. ..
. .
X X
f (e, k) − f (e, k) = 0 ∀e ∈ E, / (4)
e∈out(v) e∈in(v)
∀v ∈ V \ {sk ∪ tk }
k
X X
di · yi,p ≤ λ · c(e) ∀e ∈ E As before
i=1 p∈Pi ,e∈p
X
yi,p = 1 ∀i ∈ [k] As before
p∈Pi
mine∈pi f (e, i) on the path as the selection probability (as per xe,i in the pre-
vious section). By selecting the path pi with probability mine∈pi f (e, i), one
can show by similar arguments as before that E(obj(y)) ≤ obj(x∗ ) ≤ obj(y ∗ ).
46 CHAPTER 4. ROUNDING ILPS
Chapter 5
Trees are a special kind of graphs without cycles and some N P-hard problems
are known to admit exact polynomial time solutions on trees. Motivated by
existence of efficient algorithms on trees, one hopes to design the following
framework for a general graph G = (V, E) with distance metric dG (u, v)
between vertices u, v ∈ V :
1. Construct a tree T
2. Solve the problem on T efficiently
3. Map the solution back to G
4. Argue that the transformed solution from T is a good approximation
for the exact solution on G.
Ideally, we want to build a tree T such that dG (u, v) ≤ dT (u, v) and
dT (u, v) ≤ c · dG (u, v), where c is the stretch of the tree embedding. Unfortu-
nately, such a construction is hopeless1 . Instead, we consider a probabilistic
tree embedding of G into a collection of trees T such that
• (Over-estimates cost): ∀u, v ∈ V , ∀T ∈ T , dG (u, v) ≤ dT (u, v)
• (Over-estimate by not too much): ∀u, v ∈ V , ET ∈T [dT (u, v)] ≤ c · dG (u, v)
P
• (T is a probability space): T ∈T Pr[T ] = 1
Bartal [Bar96] gave a construction2 for probabilistic tree embedding with
poly-logarithmic stretch factor c, and proved3 that a stretch factor c ∈
1
For a cycle G with n vertices, the excluded edge in a constructed tree will cause the
stretch factor c ≥ n − 1.
2
Theorem 8 in [Bar96]
3
Theorem 9 in [Bar96]
47
48 CHAPTER 5. PROBABILISTIC TREE EMBEDDING
dG (u,v)
(B) ∀u, v ∈ V , Pr[u and v not in same partition] ≤ α · D
, for some α
Using ball carving, ConstructT recursively partitions the vertices of a
given graph until there is only one vertex remaining. At each step, the upper
bound D indicates the maximum distance between the vertices of C. The
first call of ConstructT starts with C = V and D = diam(V ). Figure 5.1
illustrates the process of building a tree T from a given graph G.
Lemma 5.2. For any two vertices u, v ∈ V and i ∈ N, if T separates u and
v at level i, then 2D
2i
≤ dT (u, v) ≤ 4D
2i
, where D = diam(V ).
Proof. If T splits u and v at level i, then the path from u to v in T has to
include two edges of length 2Di , hence dT (u, v) ≥ 2D
2i
. To be precise,
2D D D 4D
i
≤ dT (u, v) = 2 · ( i + i+1 + · · · ) ≤ i
2 2 2 2
See picture — r is the auxiliary node at level i which splits nodes u and v.
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION49
r r
D D D D
2i 2i 2i 2i
u ∈ Vu ... v ∈ Vv
u ∈ Vu ... v ∈ Vv
D D
2i+1 2i+1
.. ..
. .
u v
r0 r0 r0
D D D D D D
V1,1,...,1
D
2i
..
.
Figure 5.1: Recursive ball carving with dlog2 (D)e levels. Red vertices are
auxiliary nodes that are not in the original graph G. Denoting the root as
the 0th level, edges from level i to level i + 1 have weight 2Di .
log(D)−1
X
E[dT (u, v)] = Pr[Ei ] · [dT (u, v), given Ei ] Definition of expectation
i=0
log(D)−1
X 4D
≤ Pr[Ei ] · By Lemma 5.2
i=0
2i
log(D)−1
X dG (u, v) 4D
≤ (α · )· i Property (B) of ball carving
i=0
D/2i 2
= 4α log(D) · dG (u, v) Simplifying
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION51
• If B(u, r) ⊆ B(vj , r), then vertices in B(u, r) would have been removed
before vi is considered.
In any case, if there is a 1 ≤ j < i such that π(vj ) < π(vi ), then Vi does not
cut B(u, r). So,
Pr[B(u, r) is cut]
[n
= Pr[ Event that Vi first cuts B(u, r)]
i=1
n
X
≤ Pr[Vi first cuts B(u, r)] Union bound
i=1
Xn
= Pr[π(vi ) = min π(vj )] Pr[Vi cuts B(u, r)] Require vi to appear first
j∈[i]
i=1
n
X 1
= · Pr[Vi cuts B(u, r)] By random permutation π
i=1
i
n
X 1 2r D D
≤ · diam(B(u, r)) ≤ 2r, θ ∈ [ , ]
i=1
i D/8 8 4
n
r X 1
= 16 Hn Hn =
D i=1
i
r
∈ O(log(n)) ·
D
In the last inequality: For Vi to cut B(u, r), we need θ ∈ (dG (u, vi ) − r, dG (u, vi ) + r),
hence the numerator of ≤ 2r; The denominator D8 is because the range of
values that θ is sampled from is D4 − D8 = D8 .
dG (u, v)
∀u, v ∈ V, Pr[u and v not in same partition] ≤ α ·
D
Proof. Let r = dG (u, v), then v is on the boundary of B(u, r).
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION53
To remove the log(D) factor, so that stretch factor c = O(log n), a tighter
analysis is needed by only considering vertices that may cut B(u, dG (u, v))
instead of all n vertices. For details, see Theorem 5.16 in Section 5.3.
5.1.3 Contraction of T
Notice in Figure 5.1 that we introduce auxiliary vertices in our tree con-
struction and wonder if we can build a T without additional vertices (i.e.
V (T ) = V (G). In this section, we look at Contract which performs tree
contractions to remove the auxiliary vertices. It remains to show that the
produced tree that still preserves desirable properties of a tree embedding.
Algorithm 16 Contract(T )
while T has an edge (u, w) such that u ∈ V and w is an auxiliary node
do
Contract edge (u, w) by merging subtree rooted at u into w
Identify the new node as u
end while
Multiply weight of every edge by 4
return Modified T 0
Since we do not contract actual vertices, at least one of the (u, w) or (v, w)
edges of weight 2Di will remain. Multiplying the weights of all remaining edges
by 4, we get dT (u, v) ≤ 4 · 2Di = dT 0 (u, v).
Suppose we only multiply the weights of dT (u, v) by 4, then dT 0 (u, v) = 4 · dT (u, v).
Since we contract edges, d0T (u, v) can only decrease, so dT 0 (u, v) ≤ 4 · dT (u, v).
Remark Claim 5.9 tells us that one can construct a tree T 0 without aux-
iliary variables by incurring an additional constant factor overhead.
Claim 5.11. |AT (I)| using edges in G ≤ |AT (I)| using edges in T .
5.3. EXTRA: BALL CARVING WITH O(LOG N ) STRETCH FACTOR55
D 16r
Lemma 5.14. For i ∈ N, if r > 16
, then Pr[B(u, r) is cut] ≤ D
D 16r
Proof. If r > 16 , then D
> 1. As Pr[B(u, r) is cut at level i] is a probability
≤ 1, the claim holds.
Remark Although lemma 5.14 is not a very useful inequality per se (since
any probability ≤ 1), we use it to partition the value range of r so that we
can say something stronger in the next lemma.
D
Lemma 5.15. For i ∈ N, if r ≤ 16
, then
r |B(u, D/2)|
Pr[B(u, r) is cut] ≤ O(log( ))
D |B(u, D/16)|
D D
Proof. Since B(vi , θ) cuts B(u, r) only if 8
− r ≤ dG (u, vi ) ≤ 4
+ r, we have
D 5D D D
dG (u, vi ) ∈ [ 16 , 16 ] ⊆ [ 16 , 2 ].
D
D 2
2 D
16
u u
D v1 vj vj+1 . . . vk Dist from u
16
We see that only vertices vj , vj+1 , . . . , vk have distances from u in the range
D D
[ 16 , 2 ]. Pictorially, only vertices in the shaded region could possibly cut
B(u, r). As before, let π(v) be the ordering in which vertex v appears in
random permutation π. Then,
5.3. EXTRA: BALL CARVING WITH O(LOG N ) STRETCH FACTOR57
Pr[B(u, r) is cut]
k
[
= Pr[ Event that B(vi , θ) cuts B(u, r)] Only vj , vj+1 , . . . , vk can cut
i=j
k
X
≤ Pr[π(vi ) < minz<[i−1] {π(vz }] · Pr[vi cuts B(u, r)] Union bound
i=j
k
X 1
= · Pr[B(vi , θ) cuts B(u, r)] By random permutation π
i=j
i
k
X 1 2r D D
≤ · diam(B(u, r)) ≤ 2r, θ ∈ [ , ]
i=j
i D/8 8 4
k
r X 1
= (Hk − Hj ) where Hk =
D i=1
i
r |B(u, D/2)|
∈ O(log( )) since Hk ∈ Θ(log(k))
D |B(u, D/16)|
Proof. As before, let Ei be the event that “vertices u and v get separated at
the ith level”. For Ei to happen, the ball B(u, r) = B(u, dG (u, v)) must be
cut at level i, so Pr[Ei ] ≤ Pr[B(u, r) is cut at level i].
58 CHAPTER 5. PROBABILISTIC TREE EMBEDDING
59
Chapter 6
Warm up
Thus far, we have been ensuring that our algorithms run fast. What if our
system does not have sufficient memory to store all data to post-process it?
For example, a router has relatively small amount of memory while tremen-
dous amount of routing data flows through it. In a memory constrained set-
ting, can one compute something meaningful, possible approximately, with
limited amount of memory?
More formally, we now look at a slightly different class of algorithms
where data elements from [n] = {1, . . . , n} arrive in one at a time, in a stream
S = a1 , . . . , am , where ai ∈ [n] arrives in the ith time step. At each step, our
algorithm performs some computation1 and discards the item ai . At the end
of the stream2 , the algorithm should give us a value that approximates some
value of interest.
61
62 CHAPTER 6. WARM UP
probability p > 0.5, the probability that more than half of them fails
(and hence the median fails) drops exponential with respect to k.
Let > 0 and δ > 0 denote the precision factor and failure probability
respectively. Robust combines the above-mentioned two tricks to yield a
(1 ± )-approximation to X that succeeds with probability > 1 − δ.
Algorithm 18 Robust(A, I, , δ)
C←∅ . Initialize candidate outputs
for k = O(log 1δ ) times do
sum ← 0
for j = O( 12 ) times do
sum ← sum + A(I)
end for
Add sum
j
to candidates C . Include new sample of mean
end for
return Median of C . Return median
Stream elements 1 3 3 7 5 3 2 3
Guess 1 3 3 3 5 3 2 3
Count 1 1 2 1 1 1 1 1
65
66 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
m
X
E[2Xm+1 ] = E[2Xm+1 |Xm = j] Pr[Xm = j] Condition on Xm
j=1
Xm
= (2j+1 · 2−j + 2j · (1 − 2−j )) · Pr[Xm = j] Increment x w.p. 2−j
j=1
Xm
= (2j + 1) · Pr[Xm = j] Simplifying
j=1
Xm m
X
= 2j · Pr[Xm = j] + Pr[Xm = j] Splitting the sum
j=1 j=1
m
X
= E[2Xm ] + Pr[Xm = j] Definition of E[2Xm ]
j=1
m
X
= E[2Xm ] + 1 Pr[Xm = j] = 1
i=1
= (m + 1) + 1 Induction hypothesis
=m+2
Proof. Exercise.
m2
Claim 7.3. E[(2Xm − 1 − m)2 ] ≤ 2
Proof.
Pr[|(2Xm − 1) − m| > m] = Pr[((2Xm − 1) − m)2 > (m)2 ] Square both sides
E[((2Xm − 1) − m)2 ]
≤ Markov’s inequality
(m)2
2
m /2
≤ 2 2 By Claim 7.3
m
1
= 2
2
Remark Using the discussion in Section 6.1, we can run Morris multiple
times to obtain a (1 ± )-approximation of the first moment of a stream that
succeeds with probability > 1 − δ. For instance, repeating Morris 10 2
times
1
and reporting the mean m, b − m| > m] ≤ 20 .
b Pr[|m
Since we are randomly hashing elements into the range [0, 1], we expect
1 1
the minimum hash output to be D+1 , so E[ z1 − 1] = D. Unfortunately,
storing a uniformly random hash function that maps to the interval [0, 1] is
infeasible. As storing real numbers is memory intensive, one possible fix is to
discretize the interval [0, 1], using O(log n) bits per hash output. However,
storing this hash function would still require O(n log n) space.
1
See https://en.wikipedia.org/wiki/Order_statistic
68 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
Proof.
n
X
Var(X) = Var(Xi ) The Xi ’s are pairwise independent
i=1
n
X
= (E[Xi2 ] − (E[Xi ])2 ) Definition of variance
i=1
n
X
≤ E[Xi2 ] Ignore negative part
i=1
Xn
= E[Xi ] Xi2 = Xi since Xi ’s are indicator random variables
i=1
n
X
= E[ Xi ] Linearity of expectation
i=1
= E[X] Definition of expectation
and Xr = m
P
i=1 Xi,r = |{ai ∈ S : zeros(h(ai )) ≥ r}|. Notice that Xn ≤ Xn−1 ≤ · · · ≤ X1
since zeros(h(ai )) ≥ r + 1 ⇒ zeros(h(ai )) ≥ r. Now,
70 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
m
X m
X
E[Xr ] = E[ Xi,r ] Since Xr = Xi,r
i=1 i=1
m
X
= E[Xi,r ] By linearity of expectation
i=1
m
X
= Pr[Xi,r = 1] Since Xi,r are indicator variables
i=1
m
X 1
= h is a uniform hash
i=1
2r
D
= r Since h hashes same elements to the same value
2
τ1
√
Denote τ1 as the smallest integer
√ such that 2 · 2 > 3D, and τ2 as the
largest integer such that 2τ2 · 2 < D3 . We see that if τ1 < Z < τ2 , then
√
2Z · 2 is a 3-approximation of D.
r τ2 τ1 0
τ2 + 1 log2 ( √D2 )
√ √
• If Z ≥ τ1 , then 2Z · 2 ≥ 2τ1 · 2 > 3D
√ √ D
• If Z ≤ τ2 , then 2Z · 2 ≤ 2τ2 · 2< 3
7.2. ESTIMATING THE ZEROTH MOMENT OF A STREAM 71
D √ √
Pr[( > 2Z · 2) or (2Z · 2 > 3D)]
3
D √ √
≤ Pr[ ≥ 2Z · 2] + Pr[2Z · 2 ≥ 3D] By union bound
√3
2 2
≤ From above
3 √
2 2
=1−C For C = 1 − >0
3
probability C > 0.5, one can then call the routine k times independently and
return the median (Recall Trick 2).
While Tricks 1 and 2 allows us to strength the success probability C, more
work needs to be done to improve the approximation factor from 3 to (1 + ).
To do this, we look at a slight modification of FM, due to [BYJK+ 02].
If tN
Z
> (1 + )D, then (1+)DtN
> Z = tth smallest hash value, implying
tN
that there are ≥ t hashes smaller than (1+)D . Since the hash uniformly
distributes [n] over [N ], for each element ai ,
tN
tN (1+)D t
Pr[h(ai ) ≤ ]= =
(1 + )D N (1 + )D
tN
Pr[ > (1 + )D] ≤ Pr[X ≥ t] Since the former implies the latter
Z
= Pr[X − E[X] ≥ t − E[X]] Subtracting E[X] from both sides
t
≤ Pr[X − E[X] ≥ t] Since E[X] = ≤ (1 − )t
2 (1 + ) 2
≤ Pr[|X − E[X]| ≥ t] Adding absolute sign
2
Var(X)
≤ By Chebyshev’s inequality
(t/2)2
E[X]
≤ Since Var(X) ≤ E[X]
(t/2)2
4(1 − /2)t t
≤ Since E[X] = ≤ (1 − )t
2 t2 (1 + ) 2
4 c
≤ Simplifying with t = 2 and (1 − ) < 1
c 2
Similarly, if tN
Z
tN
< (1 − )D, then (1−)D < Z = tth smallest hash value,
tN
implying that there are < t hashes smaller than (1−)D . Since the hash
uniformly distributes [n] over [N ], for each element ai ,
tN
tN (1−)D t
Pr[h(ai ) ≤ ]= =
(1 − )D N (1 − )D
and Y = D tN
P
i=1 Yi is the number of hashes that are smaller than (1−)D . From
t t
above, Pr[Yi = 1] = (1−)D . By linearity of expectation, E[Y ] = (1−) . Then,
by Lemma 7.7, Var(Y ) ≤ E[Y ]. Now,
74 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
tN
Pr[ < (1 − )D]
Z
≤ Pr[Y ≤ t] Since the former implies the latter
= Pr[Y − E[Y ] ≤ t − E[Y ]] Subtracting E[Y ] from both sides
t
≤ Pr[Y − E[Y ] ≤ −t] Since E[Y ] = ≥ (1 + )t
(1 − )
≤ Pr[−(Y − E[Y ]) ≥ t] Swap sides
≤ Pr[|Y − E[Y ]| ≥ t] Adding absolute sign
Var(Y )
≤ By Chebyshev’s inequality
(t)2
E[Y ]
≤ Since Var(Y ) ≤ E[Y ]
(t)2
(1 + 2)t t
≤ Since E[Y ] = ≤ (1 + 2)t
2 t2 (1 − )
3 c
≤ Simplifying with t = 2 and (1 + 2) < 3
c
Putting together,
tN tN tN
Pr[| − D| > D]] ≤ Pr[ > (1 + )D]] + Pr[ < (1 − )D]] By union bound
Z Z Z
≤ 4/c + 3/c From above
≤ 7/c Simplifying
≤ 1/4 For c ≥ 28
7.3.1 k=2
For each element i ∈ [n], we associate a random variable ri ∈u.a.r. {−1, +1}.
Lemma 7.10. In AMS-2,
Pn if random variables {ri }i∈[n] are pairwise indepen-
2 2
dent, then E[Z ] = i=1 fi = F2 . That is, AMS-2 is an unbiased estimator
for the 2nd moment.
7.3. ESTIMATING THE K T H MOMENT OF A STREAM 75
Proof.
Xn n
X
E[Z 2 ] = E[( ri fi )2 ] Since Z = ri fi at the end
i=1 i=1
Xn X Xn
= E[ ri2 fi2 + 2 ri rj fi fj ] Expanding ( ri f i ) 2
i=1 1≤i<j≤n i=1
n
X X
= E[ri2 fi2 ] + 2 E[ri rj fi fj ] Linearity of expectation
i=1 1≤i<j≤n
Xn X
= E[ri2 ]fi2 + 2 E[ri rj ]fi fj fi ’s are (unknown) constants
i=1 1≤i<j≤n
Xn X
= fi2 + 2 E[ri rj ]fi fj Since (ri )2 = 1, ∀i ∈ [n]
i=1 1≤i<j≤n
n
X X
= fi2 + 2 E[ri ]E[rj ]fi fj Since {ri }i∈[n] are pairwise independent
i=1 1≤i<j≤n
n
X X
= fi2 + 2 0 · fi fj Since E[ri ] = 0, ∀i ∈ [n]
i=1 1≤i<j≤n
n
X
= fi2 Simplifying
i=1
n
X
= F2 Since F2 = (fi )2
i=1
Lemma 7.11. In AMS-2, if random variables {ri }i∈[n] are 4-wise indepen-
dent, then Var[Z 2 ] ≤ 2(E[Z 2 ])2 .
76 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
Proof. As before, E[ri ] = 0 and E[ri2 ] = 1 for all i ∈ [n]. By 4-wise in-
dependence, the expectation of any product of ≤ 4 different ri ’s is the
product of their expectation, which is zero. For instance, E[ri rj rk rl ] =
E[ri ]E[rj ]E[rk ]E[rl ] = 0. Note that 4-wise independence implies pairwise
independence, ri2 = ri4 = 1 and ri = ri3 .
Xn n
X
4
E[Z ] = E[( ri fi )4 ] Since Z = ri fi at the end
i=1 i=1
n
X X
= E[ri4 ]fi4 + 6 E[ri2 rj2 ]fi2 fj2 L.o.E. and 4-wise independence
i=1 1≤i<j≤n
Xn X
= fi4 + 6 fi2 fj2 Since E[ri4 ] = E[ri2 ] = 1, ∀i ∈ [n]
i=1 1≤i<j≤n
The coefficient of 1≤i<j≤n E[ri2 rj2 ]fi2 fj2 is 42 22 = 6. All other terms
P
Pn 4 4
P 2 2 2 2
besides i=1 E[ri ]f i and 6 1≤i<j≤n E[ri rj ]fi fj evaluate to 0 because of
4-wise independence.
Proof.
Pr[|Z 2 − F2 | > F2 ] = Pr[|Z 2 − E[Z 2 ]| > E[Z 2 ]] By Lemma 7.10
Var(Z 2 )
≤ By Chebyshev’s inequality
(E[Z 2 ])2
2(E[Z 2 ])2
≤ By Lemma 7.11
(E[Z 2 ])2
2
=
2
Claim 7.13. O(k log n) bits of randomness suffices to obtain a set of k-wise
independent random variables.
Proof. Recall the definition of hash family Hn,m . In a similar fashion2 , we
consider hashes from the family (for prime p):
k−1
X
{hak−1 ,ak−2 ,...,a1 ,a0 : h(x) = ai x i mod p
i=1
= ak−1 xk−1 + ak−2 xk−2 + · · · + a1 x + a0 mod p,
∀ak−1 , ak−2 , . . . , a1 , a0 ∈ Zp }
This requires k random coefficients, which can be stored with O(k log n)
bits.
Observe that the above analysis only require {ri }i∈[n] to be 4-wise in-
dependent. Claim 7.13 implies that AMS-2 only needs O(4 log n) bits to
represent {ri }i∈[n] .
Although the failure probability 22 is large for small , one can repeat t
times and output the mean (Recall Trick 1). With t ∈ O( 12 ) samples, the
failure probability drops to t22 ∈ O(1). When the failure probability is < 0.5,
one can then call the routine k times independently, and return the median
(Recall Trick 2). On the whole, for any given > 0 and δ > 0, O( log(n) log(1/δ)
2 )
space suffices to yield a (1 ± )-approximation algorithm that succeeds with
probability > 1 − δ.
7.3.2 General k
Remark At the end of AMS-k, r = |{i ∈ [m] : i ≥ J and ai = aJ }| will
be the number of occurrences of aJ in suffix of the stream.
2
See https://en.wikipedia.org/wiki/K-independent_hashing
78 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
E[Z | aJ = i]
1 1 1
= [m(fik − (fi − 1)k )] + [m((fi − 1)k − (fi − 2)k )] + · · · + [m(1k − 0k )]
fi fi fi
m k
= [(fi − (fi − 1)k ) + ((fi − 1)k − (fi − 2)k ) + · · · + (1k − 0k )]
fi
m k
= fi
fi
3
See https://en.wikipedia.org/wiki/Reservoir_sampling
7.3. ESTIMATING THE K T H MOMENT OF A STREAM 79
n
X
E[Z] = E[Z | aJ = i] · Pr[aJ = i] Condition on the choice of J
i=1
n
X fi
= E[Z | aJ = i] · Since choice of J is uniform at random
i=1
m
n
X m fi
= fik · From above
i=1
fi m
Xn
= fik Simplifying
i=1
n
X
= Fk Since Fk = fik
i=1
Xn Xn k
X
2k−1 1−1/k
( fi )( fi )≤n ( fik )2
i=1 i=1 i=1
Pn
Proof. Let M = maxi∈[n] fi , then fi ≤ M for any i ∈ [n] and M k ≤ i=1 fik .
Hence,
Xn Xn Xn n
X
2k−1 k−1
( fi )( fi )≤( fi )(M fik ) Pulling out a M k−1 factor
i=1 i=1 i=1 i=1
Xn n
X n
X n
X
≤( fi )( fik )(k−1)/k ( fik ) k
Since M ≤ fik
i=1 i=1 i=1 i=1
Xn Xn
=( fi )( fik )(2k−1)/k Merging the last two terms
i=1 i=1
Xn n
X n
X n
X
1−1/k k 1/k k (2k−1)/k
≤n ( fi ) ( fi ) Fact: ( fi )/n ≤ ( fik /n)1/k
i=1 i=1 i=1 i=1
Xn
= n1−1/k ( fi )2 Merging the last two terms
i=1
80 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
E[Z 2 ]
m
= [ (1k − 0k )2 + (2k − 1k )2 + · · · + (f1k − (f1 − 1)k )2 (1)
m
+ (1k − 0k )2 + (2k − 1k )2 + · · · + (f2k − (f2 − 1)k )2
+ ...
+ (1k − 0k )2 + (2k − 1k )2 + · · · + (fnk − (fn − 1)k )2 ]
≤ m[ k · 1k−1 (1k − 0k ) + k · 2k−1 · (2k − 1k ) + · · · + k · f1k−1 · (f1k − (f1 − 1)k ) (2)
+ k · 1k−1 (1k − 0k ) + k · 2k−1 · (2k − 1k ) + · · · + k · f2k−1 · (f2k − (f2 − 1)k )
+ ...
+ k · 1k−1 (1k − 0k ) + k · 2k−1 · (2k − 1k ) + · · · + k · fnk−1 · (fnk − (fn − 1)k )]
≤ m[k · f12k−1 + k · f22k−1 + · · · + k · fn2k−1 ] (3)
= k · m · F2k−1 (4)
= k · F1 · F2k−1 (5)
(1) By definition of E[Z 2 ] (condition on J and expand in the same style as
the proof of Theorem 7.14).
(2) For all 0 < b < a and a = b + 1,
ak − bk = (a − b)(ak−1 + ak−2 b + · · · + abk−2 + bk−1 ) ≤ (a − b)kak−1
(5) F1 = ni=1 fi = m
P
Then,
Remark Proofs for Lemma 7.15 and Theorem 7.16 were omitted in class.
The above proofs are presented in a style consistent with the rest of the scribe
notes. Interested readers can refer to [AMS96] for details.
Remark One can apply an analysis similar to the case when k = 2, then
use Tricks 1 and 2.
e 1− k2 ) is known.
Claim 7.17. For k > 2, a lower bound of Θ(n
Proof. Theorem 3.1 in [BYJKS04] gives the lower bound. See [IW05] for
algorithm that achieves it.
82 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
Chapter 8
Graph sketching
Let m be the total number of distinct edges in the stream. There are two
ways to represent connected components on a graph:
1. Every vertex stores a label where vertices in the same connected com-
ponent has the same label
83
84 CHAPTER 8. GRAPH SKETCHING
• Vertices send the coordinator their sum and the coordinator computes
XORA ⊕ {XORu : u ∈ A}
S = {h{v1 , v2 }, +i, h{v2 , v3 }, +i, h{v1 , v3 }, +i, h{v4 , v5 }, +i, h{v2 , v5 }, +i, h{v1 , v2 }, −i}
and we query for the cut edge {v2 , v5 } with A = {v1 , v2 , v3 } at t = |S|. The
figure below shows the graph G6 when t = 6:
3
In reality, the algorithm simulates all the vertices’ actions so it is not a real multi-party
computation setup.
8.1. WARM UP: FINDING THE SINGLE CUT 85
v1 v4
v2
v3 v5
Vertex v1 sees {h{v1 , v2 }, +i, h{v1 , v3 }, +i, and h{v1 , v2 }, −i}. So,
Remark Bit tricks are often used in the random linear network coding
literature (e.g. [HMK+ 06]).
Pr[|E 0 | = 1]
= k · Pr[Cut edge {u, v} is marked; others are not]
= k · (1/b k))k−1
k)(1 − (1/b Edges marked ind. w.p. 1/b k
k
b
≥ (bk/2)(1/b k))k
k)(1 − (1/b Since ≤ k ≤ b k
b
2
1
≥ · 4−1 Since 1 − x ≥ 4−x for x ≤ 1/2
2
1
≥
10
Remark The above analysis assumes that vertices can locally mark the
edges in a consistent manner (i.e. both endpoints of any edge make the
same decision whether to mark the edge or not). This can be done with a
sufficiently large string of shared randomness. We discuss this in Section 8.3.
From above, we know that Pr[|E 0 | = 1] ≥ 1/10. If |E 0 | = 1, we can
re-use the idea from Section 8.1. However, if |E 0 | = 6 1, then XORA may
correspond erroneously to another edge in the graph. In the above example,
id({v1 , v2 }) ⊕ id({v2 , v4 }) = 000001 ⊕ 001011 = 001010 = id({v2 , v3 }).
To fix this, we use random bits as edge IDs instead of simply concatenat-
ing vertex IDs: Randomly assign (in a consistent manner) each edge with a
random ID of k = 20 log n bits. Since the XOR of random bits is random,
for any edge e, Pr[XORA = id(e) | |E 0 | = 6 1] = ( 21 )k = ( 12 )20 log n . Hence,
8.3. MAXIMAL FOREST WITH O(N LOG4 N ) MEMORY 87
• If b
k k, the marking probability will select nothing (in expectation).
• If b
k k, more than one edge will get marked, which we will then
detect (and ignore) since XORA will likely not be a valid edge ID.
Remark One can drop the memory constraint per vertex from O(log4 n)
to O(log3 n) by using a constant t instead of t ∈ O(log n) such that the
success probability is a constant larger than 1/2. Then, simulate Borůvka
for d2 log ne steps. See [AGM12] (Note that they use a slightly different
sketch).
8.3. MAXIMAL FOREST WITH O(N LOG4 N ) MEMORY 89
5
See https://en.wikipedia.org/wiki/Small-bias_sample_space
Part III
Graph sparsification
91
Chapter 9
Preserving distances
Remark The first inequality is because G0 has less edges than G. The
second inequality upper bounds how much the distances “blow up” in the
sparser graph G0 .
93
94 CHAPTER 9. PRESERVING DISTANCES
Remark One way to prove the existence of an (α, β)-spanner is to use the
probabilistic method : Instead of giving an explicit construction, one designs
a random process and argues that the probability that the spanner existing
is strictly larger than 0. However, this may be somewhat unsatisfying as such
proofs do not usually yield a usable construction. On the other hand, the
randomized constructions shown later are explicit and will yield a spanner
with high probability 1 .
v k
• g(G00 ) ≥ g(G0 ) ≥ 2k + 1, since girth does not decrease with fewer edges.
n ≥ |V 00 | By construction
k
X
≥ |{v}| + |{u ∈ V 00 : dG00 (u, v) = i}| Look only at k-hop neighbourhood from v
i=1
k
X
≥1+ (n1/k + 1)(n1/k )i−1 Vertices distinct and have deg ≥ n1/k + 1
i=1
(n1/k )k − 1
= 1 + (n1/k + 1) Sum of geometric series
n1/k − 1
> 1 + (n − 1) Since (n1/k + 1) > (n1/k − 1)
=n
Let us consider the family of graphs G on n vertices with girth > 2k. It can
be shown by contradiction that a graph G with n vertices with girth > 2k can-
not have a proper (2k − 1)-spanner2 : Assume G0 is a proper (2k − 1)-spanner
with edge {u, v} removed. Since G0 is a (2k − 1)-spanner, dG0 (u, v) ≤ 2k − 1.
Adding {u, v} to G0 will form a cycle of length at most 2k, contradicting the
assumption that G has girth > 2k.
Let g(n, k) be the maximum possible number of edges in a graph from G.
By the above argument, a graph on n vertices with g(n, k) edges cannot have
a proper (2k − 1)-spanner. Note that the greedy construction of Theorem ??
will always produce a (2k − 1)-spanner with ≤ g(n, k) edges. The size of the
spanner is asymptotically tight if Conjecture ?? holds.
1. E[|S|] = np
2. For any vertex v with degree d(v) and neighbourhood N (v) = {u ∈ V : (u, v) ∈ E},
1.
X
E[|S|] = E[ Xv ] By construction of S
v∈V
X
= E[Xv ] Linearity of expectation
v∈V
X
= p Since E[Xv ] = Pr[Xv = 1] = p
v∈V
= np Since |V | = n
2.
X
E[|N (v) ∩ S|] = E[ Xv ] By definition of N (v) ∩ S
v∈N (v)
X
= E[Xv ] Linearity of expectation
v∈N (v)
X
= p Since E[Xv ] = Pr[Xv = 1] = p
v∈N (v)
d(v)·p
= e− 2
Proof.
Construction Partition vertex set V into light vertices L and heavy vertices
H, where
2. Initialize E20 = ∅.
E[|E 0 |] = E[|E10 ∪E20 |] ≤ E[|E10 |+|E20 |] = E[|E10 |]+E[|E20 |] ≤ n3/2 +n·10n1/2 log n ∈ O(n
e 3/2 )
Stretch factor Consider two arbitrary vertices u and v with the shortest
path Pu,v in G. Let h be the number of heavy vertices in Pu,v . We split the
analysis into two cases: (i) h ≤ 1; (ii) h ≥ 2. Recall that a heavy vertex has
degree at least n1/2 .
Case (i) All edges in Pu,v are adjacent to a light vertex and are thus in E10 .
Hence, dG0 (u, v) = dG (u, v), with additive stretch 0.
Case (ii)
Claim 9.6. Suppose there exists a vertex w ∈ Pu,v such that (w, s) ∈ E
for some s ∈ S, then dG0 (u, v) ≤ dG (u, v) + 2.
s∈S
... ...
... ...
u w v
3
Though we may have repeated edges
9.2. β-ADDITIVE SPANNERS 99
Proof.
dG0 (u, v) ≤ dG0 (u, s) + dG0 (s, v) (1)
= dG (u, s) + dG (s, v) (2)
≤ dG (u, w) + dG (w, s) + dG (s, w) + dG (w, v) (3)
≤ dG (u, w) + 1 + 1 + dG (w, v) (4)
≤ dG (u, v) + 2 (5)
Let w be a heavy vertex in Pu,v with degree d(w) > n1/2 . By Claim 9.4
10 log n
with p = 10n−1/2 log n, Pr[|N (w) ∩ S| = 0] ≤ e− 2 = n−5 . Taking
union bound over all possible pairs of vertices u and v,
n −5
Pr[∃u, v ∈ V, Pu,v has no neighbour in S] ≤ n ≤ n−3
2
Then, Claim 9.6 tells us that the additive stretch factor is at most 2
with probability ≥ 1 − n13 .
1
Therefore, with high probability (≥ 1 − n3
), the construction yields a 2-
additive spanner.
Remark A way to remove log factors from Theorem 9.5 is to sample only
n1/2 nodes into S, and then add all edges incident to nodes that don’t have an
adjacent node in S. The same argument then shows that this costs O(n3/2 )
edges in expectation.
Theorem 9.7. [Che13] For a fixed k ≥ 1, every graph G on n vertices has
e 7/5 ) edges.
a 4-additive spanner with O(n
Proof.
Construction Partition vertex set V into light vertices L and heavy vertices
H, where
L = {v ∈ V : deg(v) ≤ n2/5 } and H = {v ∈ V : deg(v) > n2/5 }
100 CHAPTER 9. PRESERVING DISTANCES
2. Initialize E20 = ∅.
3. Initialize E30 = ∅.
Number of edges
• Since there are ≤ n heavy vertices, ≤ n edges of the form (v, s0 ) for
v ∈ H, s0 ∈ S 0 will be added to E30 . Then, for shortest s − s0 paths
with ≤ n1/5 heavy internal vertices, only edges adjacent to the heavy
vertices need to be counted because those adjacent to light vertices
are already accounted for in E10 . By Claim 9.4 with p = 10n−2/5 log n,
E[|S0 |] = n · 10n−2/5 log n = 10n3/5 log n. So, E30 contributes ≤ n +
|S 0 |
2
· n1/5 ≤ n + (10n3/5 log n)2 · n1/5 ∈ O(n
e 7/5 ) edges to the count of
0
|E |.
4
Though we may have repeated edges
9.2. β-ADDITIVE SPANNERS 101
Stretch factor Consider two arbitrary vertices u and v with the shortest
path Pu,v in G. Let h be the number of heavy vertices in Pu,v . We split the
analysis into three cases: (i) h ≤ 1; (ii) 2 ≤ h ≤ n1/5 ; (iii) h > n1/5 . Recall
that a heavy vertex has degree at least n2/5 .
Case (i) All edges in Pu,v are adjacent to a light vertex and are thus in E10 .
Hence, dG0 (u, v) = dG (u, v), with additive stretch 0.
Case (ii) Denote the first and last heavy vertices in Pu,v as w and w0 re-
spectively. Recall that in Case (ii), including w and w0 , there are
at most n1/5 heavy vertices between w and w0 . By Claim 9.4, with
p = 10n−2/5 log n,
n2/5 ·10n−2/5 log n
Pr[|N (w) ∩ S 0 | = 0] = Pr[|N (w0 ) ∩ S 0 | = 0] ≤ e− 2 = n−5
∗ ∗
• By definition of Ps,s 0, l ≤ dG (s, w) + dG (w, w0 ) + dG (w0 , s0 ) =
0
dG (w, w ) + 2.
• Since there are no internal heavy vertices between u − w and
w0 − v, Case (i) tells us that dG0 (u, w) = dG (u, w) and dG0 (w0 , v) =
dG (w0 , v).
Thus,
dG0 (u, v)
= dG0 (u, w) + dG0 (w, w0 ) + dG0 (w0 , v) (1)
≤ dG0 (u, w) + dG0 (w, s) + dG0 (s, s0 ) + dG0 (s0 , w0 ) + dG0 (w0 , v) (2)
= dG0 (u, w) + dG0 (w, s) + l∗ + dG0 (s0 , w0 ) + dG0 (w0 , v) (3)
≤ dG0 (u, w) + dG0 (w, s) + dG (w, w0 ) + 2 + dG0 (s0 , w0 ) + dG0 (w0 , v) (4)
= dG0 (u, w) + 1 + dG (w, w0 ) + 2 + 1 + dG0 (w0 , v) (5)
= dG (u, w) + 1 + dG (w, w0 ) + 2 + 1 + dG (w0 , v) (6)
≤ dG (u, v) + 4 (7)
102 CHAPTER 9. PRESERVING DISTANCES
s ∈ S0 ∗
Ps,s 0 of length l
∗ s0 ∈ S 0
...
Case (iii)
|N (w)| · 31 . Let
S P
Claim 9.8 tells us that | w∈Heavy N (w)| ≥ w∈Heavy
Applying Claim 9.4 with p = 30 · n−3/5 · log n and Claim 9.8, we get
1
E[|Nu,v ∩ S|] ≥ n1/5 · n2/5 · · 30 · n−3/5 · log n = 10 log n
3
and 10 log n
Pr[|N (v) ∩ S| = 0] ≤ e− 2 = n−5
Taking union bound over all possible pairs of vertices u and v,
n −5
Pr[∃u, v ∈ V, Pu,v has no neighbour in S] ≤ n ≤ n−3
2
Then, Claim 9.6 tells us that the additive stretch factor is at most 4
with probability ≥ 1 − n13 .
1
Therefore, with high probability (≥ 1 − n3
), the construction yields a 4-
additive spanner.
Concluding remarks
the parts back into one. The distance error must be even over a bipartite
graph, and so the additive (2k + 1)-spanner construction must actually give
an additive 2k-spanner by showing that the error bound is preserved over
the “collapse”.
Chapter 10
Preserving cuts
105
106 CHAPTER 10. PRESERVING CUTS
k k k
Pr[Success] ≥ (1 − ) · (1 − ) . . . (1 − )
nk/2 (n − 1)k/2 3k/2
2 2 2
= (1 − ) · (1 − ) . . . (1 − )
n n−1 3
n−2 n−3 1
=( )·( )...( )
n n−1 3
2
=
n(n − 1)
n
= 1/
2
n
Corollary 10.3. There are ≤ 2
minimum cuts in a graph.
Proof. Since RandomContraction successfully produces any given
mini-
mum cut with probability at least 1/ n2 , there can be at most ≤ n2 many
minimum cuts.
In general, we can bound the number of cuts that are of size at most
α · µ(G) for α ≥ 1.
Proof. See Lemma 2.2 and Appendix A (in particular, Corollary A.7) of a
version2 of [Kar99].
1. Let p = Ω( logn n )
3. Define w(e) = 1
p
for each edge e ∈ E 0
3. Define w(e) = 1
p
for each edge e ∈ E 0
k0
S V \S
Using Theorem 10.4 and union bound over all possible cuts in G,
Theorem 10.6. [Kar94] For a graph G, consider sampling every edge inde-
pendently with probability pe into E 0 , and assign weights 1/pe to each edge
e ∈ E 0 . Let H = (V, E 0 ) be the sampled graph and suppose µ(H) ≥ c log
2
n
, for
some constant c. Then, with high probability, every weighted cut size in H is
(well-estimated) within (1 ± ) of the original cut size in G.
Theorem 10.6 can be proved by using a variant of the earlier proof. In-
terested readers can see Theorem 2.1 of [Kar94].
10.3. NON-UNIFORM EDGE SAMPLING 109
Kn Kn
Running the uniform edge sampling will not sparsify the above dumbbell
graph as µ(G) = 1 leads to large sampling probability p.
Proof.
110 CHAPTER 10. PRESERVING CUTS
k1 -strong components
k2 -strong components
1. By definition of maximum
4. Consider a minimum cut CG (S, V \ S). Since ke ≥ µ(G) for all edges
1
e ∈ CG (S, V \ S), these edges contribute ≤ µ(G) · k1e ≤ µ(G) · µ(G) =1
to the summation. Remove these edges from G and repeat the argu-
ment on any remaining connected components. Since each cut removal
contributes at most 1 Pto the summation and the process stops when we
reach n components, e∈E k1e ≤ n − 1.
For a graph G with minimum cut size µ(G) = k, consider the following
procedure to construct H:
c log n
1. Set q = 2
for some constant c
q
2. Independently put each edge e ∈ E into E 0 with probability pe = ke
3. Define w(e) = 1
pe
= ke
q
for each edge e ∈ E 0
X
E[|E 0 |] = E[ Xe ] By definition
e∈E
X
= E[Xe ] Linearity of expectation
e∈E
X
= pe Since E[Xe ] = Pr[Xe = 1] = pe
e∈E
X q q
= Since pe =
e∈E
ke ke
X 1
= q(n − 1) Since ≤n−1
e∈E
k e
n log n c log n
∈ O( ) Since q = for some constant c
2 2
Remark One can apply Chernoff bounds to argue that |E 0 | is highly con-
centrated around its expectation.
Proof. Let k1 < k2 < · · · < ks be all possible strength values in the graph.
Consider G as a weighted graph with edge weights kqe for each edge e ∈ E,
and a family of unweighted graphs F1 , . . . , Fs where Fi = (V, Ei ) where Ei =
{e ∈ E : ke ≥ ki }. Observe that:
• F1 = G
113
Chapter 11
We now study the class of online problems where one has to commit to
provably good decisions as data arrive in an online fashion. To measure the
effectiveness of online algorithms, we compare the quality of the produced
solution against the solution from an optimal offline algorithm that knows
the whole sequence of information a priori. The tool we will use for doing
such a comparison is competitive analysis.
115
116 CHAPTER 11. WARM UP: SKI RENTAL
Linear search
Free swap Move the queried paper from position i to the top of the stack
for 0 cost.
Paid swap For any consecutive pair of items (a, b) before i, swap their rel-
ative order to (b, a) for 1 cost.
What is the best online strategy for manipulating the stack to minimize total
cost on a sequence of queries?
Remark One can reason that the free swap costs 0 because we already
incurred a cost of i to reach it.
117
118 CHAPTER 12. LINEAR SEARCH
double or halve the hash table, incurring a runtime of O(m) time for doubling
or halving a hash table of size m.
Worst case analysis tells us that dynamic resizing will incur O(m) run
time per operation. However, resizing only occurs after O(m) insertion/dele-
tion operations, each costing O(1). Amortized analysis allows us to conclude
that this dynamic resizing runs in amortized O(1) time. There are two equiv-
alent ways to see it:
• Split the O(m) resizing overhead and “charge” O(1) to each of the
earlier O(m) operations.
• The total run time for every sequential chunk of m operations is O(m).
Hence, each step takes O(m)/m = O(1) amortized run time.
12.2 Move-to-Front
Move-to-Front (MTF) [ST85] is an online algorithm for the linear search
problem where we move the queried item to the top of the stack (and do no
other swaps). We will show that MTF is a 2-competitive algorithm for linear
search. Before we analyze MTF, let us first define a potential function Φ and
look at examples to gain some intuition.
Let Φt be the number of pairs of papers (i, j) that are ordered differently
in MTF’s stack and OPT’s stack at time step t. By definition, Φt ≥ 0 for
any t. We also know that Φ0 = 0 since MTF and OPT operate on the same
initial stack sequence.
1 2 3 4 5 6
MTF’s stack a b c d e f
OPT’s stack a e b d c f
12.2. MOVE-TO-FRONT 119
Now, we have the inversions (b, e), (c, d), (c, e) and (d, e), so Φ = 4.
Scenario 2 We swap (e, d) in OPT’s stack — The inversion (d, e) was de-
stroyed due to the swap.
1 2 3 4 5 6
MTF’s stack a b c d e f
OPT’s stack a b d e c f
In either case, we see that any paid swap results in ±1 inversions, which
changes Φ by ±1.
Proof. We will consider the potential function Φ as before and perform amor-
tized analysis on any given input sequence σ. Let at = cM T F (t) + (Φt − Φt−1 )
be the amortized cost of MTF at time step t, where cM T F (t) is the cost by
MTF at time t. Suppose the queried item at time step t is at position k in
MTF. Denote:
F ∪B
k ≥ |F | = f
x
x
≥ |B| = b
MTF OPT
Since x is the k-th item, MTF will incur cM T F (t) = k to reach item x,
then move it to the top. On the other hand, OPT needs to spend at least
f + 1 to reach x. Suppose OPT does p paid swaps, then cOP T (t) ≥ f + 1 + p.
120 CHAPTER 12. LINEAR SEARCH
(4) Telescoping
(5) Since Φt ≥ 0 = Φ0
P|σ|
(6) Since cM T F (σ) = t=1 cM T F (t)
Chapter 13
Paging
121
122 CHAPTER 13. PAGING
On the other hand, since OP T can see the entire sequence σ, OP T can
choose to evict the page that is requested furthest in the future. Then, in
every k steps, OP T has ≤ 1 cache miss. Thus, cOP T ≤ |σ|
k
.
Hence, k · cOP T ≤ cA (σ).
Claim 13.4. Any conservative online algorithm A is k-competitive.
Proof. For any given input sequence σ, partition σ into maximal phases —
P1 , P2 , . . . — where each phase has k distinct pages, and a new phase is
created only if the next element is different from the ones in the current
phase. Let xi be the first item that does not belong in Phase i.
σ= k distinct pages x1 k distinct pages x2 ...
Phase 1 Phase 2
– If p is not in cache,
∗ If all pages in cache are marked, unmark all
∗ Evict a random unmarked page
– Mark page p
Proof. Let Pi be the set of pages at the start of phase i. Since requesting a
marked page does not incur any cost, it suffices to analyze the first time any
request occurs within the phase.
Denote N as the set of new requests (pages that are not in Pi ) and O
as the set of old requests (pages that are in Pi ). By definition, |O| ≤ k and
|N | + |O| = k. Order the old requests in O in the order which they appear
in the phase and let xj be the j th old request, for j ∈ {1, . . . , |O|}. Define
mi = |N |, and lj as the number of distinct new requests before xj .
Phase i
σ=( ... new new new old new old ... ... )
x1 x2
l1 = 3 l2 = 4
For j ∈ {1, . . . , |O|}, consider the first time the j th old request xj occurs.
Since the adversary is oblivious, xj is equally likely to be in any position in
the cache at the start of the phase. After seeing (j − 1) old requests and
marking their cache positions, there are k − (j − 1) initial positions in the
cache that xj could be in. Since we have only seen lj new requests and (j −1)
old requests, there are at least1 k − lj − (j − 1) old pages remaining in the
cache. So, the probability that xj is in the cache when requested is at least
k−lj −(j−1)
k−(j−1)
. Then,
1
We get an equality if all these requests kicked out an old page.
13.2. RANDOM MARKING ALGORITHM (RMA) 125
|O|
X
Cost due to O = Pr[xj is not in cache when requested] Sum over O
j=1
|O|
X lj
≤ From above
j=1
k − (j − 1)
|O|
X mi
≤ Since lj ≤ mi = |N |
j=1
k − (j − 1)
k
X 1
≤ mi · Since |O| ≤ k
j=1
k − (j − 1)
k
X 1
= mi · Rewriting
j=1
j
n
X 1
= mi · Hk Since = Hn
i=1
i
Since every new request incurs a unit cost, the cost due to N is mi .
Hence, cRM A (Phase i) = (Cost due to N ) + (Cost due to O) ≤ mi + mi · Hk .
We now analyze OPT’s performance. By definition of phases, among all
requests between two consecutive phases (say, i − 1 and i), a total of k + mi
distinct pages are requested. So, OPT has to incur at least ≥ mi to bring in
these new pages. To avoid double
P counting, we lower bound
P cOP T (σ) for both
odd and even i: cOP T (σ) ≥ odd i mi and cOP T (σ) ≥ even i mi . Together,
X X X
2 · cOP T (σ) ≥ mi + mi ≥ mi
odd i even i i
Therefore, we have:
X X
cRM A (σ) ≤ (mi + mi · Hk ) = O(log k) mi ≤ O(log k) · cOP T (σ)
i i
Proof.
X
C= qx · C Sum over all possible inputs x
x
X
≥ qx Ep [c(A, x)] Since C = max Ep [c(A, x)]
x∈X
x
X X
= qx pa c(a, x) Definition of Ep [c(A, x)]
x p
X X
= pa qx c(a, x) Swap summations
a q
X
= pa Eq [c(a, X)] Definition of Eq [c(a, X)]
a
X
≥ pa · D Since D = min Eq [c(a, X)]
a∈A
a
=D Sum over all possible algorithms a
127
128 CHAPTER 14. YAO’S MINIMAX PRINCIPLE
Remark We do not fix the starting positions of the k servers, but we com-
pare the performance of OPT on σ with same initial starting positions.
The paging problem is a special case of the k-server problem where the
points are all possible pages, the distance metric is unit cost between any
two different points, and the servers represent the pages in cache of size k.
129
130 CHAPTER 15. THE K-SERVER PROBLEM
..
.
s∗
0 1+2+
Without loss of generality, suppose all servers currently lie on the left of “0”.
For > 0, consider the sequence σ = (1 + , 2 + , 1 + , 2 + , . . . ). The first
request will move a single server s∗ to “1 + ”. By the greedy algorithm,
subsequent requests then repeatedly use s∗ to satisfy requests from both
“1 + ” and “2 + ” since s∗ is the closest server. This incurs a total cost of
≥ |σ| while OPT could station 2 servers on “1 + ” and “2 + ” and incur a
constant total cost on input sequence σ.
• If request r is on one side of all servers, move the closest server to cover
it
Before r Before r
After r After r
15.1. SPECIAL CASE: POINTS ON A LINE 131
Proof. Suppose, for a contradiction, that both xi and xi+1 are moving
away from their partners. That means yi ≤ xi < r < xi+1 ≤ yi+1 at
the end of OPT’s action (before DC moved xi and xi+1 ). This is a
contradiction since OPT must have a server at r but there is no server
between yi and yi+1 by definition.
Since at least one of xi or xi+1 is moving closer to its partner, ∆DC (Φt,1 ) ≤
z − z = 0.
Meanwhile, since xi and xi+1 are moved a distance of z towards each
other, (xi+1 − xi ) = −2z while the total change against other pairwise
distances cancel out, so ∆DC (Φt,2 ) = −2z.
Hence,
cDC (t)+∆(Φt ) = cDC (t)+∆DC (Φt )+∆OP T (Φt ) ≤ 2z−2z+kx = kx ≤ k·cOP T (t)
In all cases, we see that cDC (t) + ∆(Φt ) ≤ k · cOP T (t). Hence,
|σ| |σ|
X X
cDC (t) + ∆(Φt ) ≤ k · cOP T (t) Summing over σ
t=1 t=1
|σ|
X
⇒ cDC (t) + (Φ|σ| − Φ0 ) ≤ k · cOP T (σ) Telescoping
t=1
|σ|
X
⇒ cDC (t) − Φ0 ≤ k · cOP T (σ) Since Φt ≥ 0
t=1
|σ|
X
⇒ cDC (σ) ≤ k · cOP T (σ) + Φ0 Since cDC (σ) = cDC (t)
t=1
Definition 16.1 (The learning from experts problem). Every day, we are
to make a binary decision. At the end of the day, a binary output is revealed
and we incur a mistake if our decision did not match the output. Suppose we
have access to n experts e1 , . . . , en , each of which makes a recommendation
for the binary decision to take per day. How does one make use of the experts
to minimize the total number of mistakes on an online binary sequence?
Toy setting Consider a stock market with only a single stock. Every day,
we decide whether to buy the stock or not. At the end of the day, the stock
value will be revealed and we incur a mistake/loss of 1 if we did not buy
when the stock value rose, or bought when the stock value fell.
Days 1 1 0 0 1
e1 1 1 0 0 1
e2 1 0 0 0 1
e3 1 1 1 1 0
133
134 CHAPTER 16. MULTIPLICATIVE WEIGHTS UPDATE (MWU)
Proof. Observe that when DMWU makes a mistake, the weighted majority
Pn wrong and their weight decreases by a factor of (1 − ). Suppose that
was
i=1 wi = x at the start of the day. If we make a mistake, x drops to
≤ x2 (1 − ) + x2 = x(1 − 2 ). That is, the overall weight reduces by at least a
factor of (1 − 2 ). Since the best expert e∗ makes m∗ mistakes, his/her weight
∗
at the end is (1 − )m . By the above observation, the total weight of all
experts would be ≤ n(1 − 2 )m at the end of the sequence. Then,
∗
(1 − )m ≤ n(1 − )m Expert e∗ ’s weight is part of the overall weight
2
∗
⇒ m ln(1 − ) ≤ ln n + m ln(1 − ) Taking ln on both sides
2
1
⇒ m∗ (− − 2 ) ≤ ln n + m(− ) Since −x − x2 ≤ ln(1 − x) ≤ −x for x ∈ (0, )
2 2
2 ln n
⇒ m ≤ 2(1 + )m∗ + Rearranging
Proof. Consider only two experts e0 and e1 where e0 always outputs 0 and
e1 always outputs 1. Any binary sequence σ must contain at least |σ|
2
zeroes
|σ| ∗ |σ|
or 2 ones. Thus, m ≤ 2 . On the other hand, the adversary looks at A
and produces a sequence σ which forces A to incur a loss every day. Thus,
m = |σ| ≥ 2m∗ .
• On each day:
– Pick a random expert with probability
P proportional to their weight.
(i.e. Pick ei with probability wi / ni=1 wi )
– Follow that expert’s recommendation.
– For each wrong expert, set wi to (1 − ) · wi , for some constant
∈ (0, 12 ).
Another way to think about the probabilities is to split all experts into
two groups A = {Experts that output 0} and B = {Experts that output
wA wB
1}. Then, decide
P ‘0’ with probability
P wA +wB and ‘1’ with probability wA +wB ,
where wA = ei ∈A wi and wB = ei ∈B wi are the sum of weights in each
set.
Theorem 16.5. Suppose the best expert makes m∗ mistakes and RMWU
makes m mistakes. Then,
ln n
E[m] ≤ (1 + )m∗ +
Proof. Fix an arbitrary day j ∈ {1, . . . , |σ|}. Denote A = {Experts
P that output 0 on day j}
and BP= {Experts that output 1 on day j}, where wA = ei ∈A wi and
wB = ei ∈B wi are the sum of weights in each set. Let Fj be the weighted
fraction of wrong experts on day j. If σj = 0, then Fj = wAw+w B
B
. If σj = 1,
wA
then Fj = wA +wB . By definition of Fj , RMWU makes a mistake on day j
with probability Fj . By linearity of expectation, E[m] = |σ|
P
j=1 Fj .
∗ ∗
Since the best expert e makes m mistakes, his/her weight at the end is
∗
(1 − )m . On each day, RMWU reduces the overall weight by a factor of
(1 − · Fj ) by penalizing wrong experts. Hence, the total weight of all experts
|σ|
would be n · Πj=1 (1 − · Fj ) at the end of the sequence. Then,
∗ |σ|
(1 − )m ≤ n · Πj=1 (1 − · Fj ) Expert e∗ ’s weight is part of the overall weight
∗
P|σ|
⇒ (1 − )m ≤ n · e j=1 (−·Fj ) Since (1 − x) ≤ e−x
|σ|
m∗
X
−·E[m]
⇒ (1 − ) ≤n·e Since E[m] = Fj
j=1
16.4 Generalization
Denote the loss of expert i on day t as lit ∈ [−ρ, ρ], for some constant ρ When
lt
we incur a loss, update the weights of affected experts from wi to (1 − ρi )wi .
lit
Note that ρ
is essentially the normalized loss ∈ [−1, 1].
Claim 16.6 (Without proof). With RMWU, we have
X X ρ ln n
E[m] ≤ min( lit + |lit | + )
i
t t
Remark If each expert has a different ρi , one can modify the update rule
and claim to use ρi instead of a uniform ρ accordingly.
v2 11 8 v5
138 CHAPTER 16. MULTIPLICATIVE WEIGHTS UPDATE (MWU)
v1 v4 v1 v4 v1 v4
5/13 5/10 5/13 5/10 5/13 8/10
v3 0/21 v3 0/21 v3 13/21
• le∗ (j) as the optimal offline algorithm’s relative load of edge e after
request j.
In other words, the objective is to minimize maxe∈E le (|σ|) for a given se-
quence σ. Denoting Λ as the (unknown) optimal congestion factor, we nor-
∗
malize pee (i) = peΛ(i) , e le∗ (j) = leΛ(j) . Let a be a constant to be
le (j) = leΛ(j) , and e
determined. Consider algorithm A which does the following on request i + 1:
1
Lemma 16.9. For a = 1 + 2γ
, Φ(j + 1) − Φ(j) ≤ 0.
∗
Proof. Let Pj+1 be the path that A found and Pj+1 be the path that the
th
optimal offline algorithm assigned to the (j + 1) request hs(j + 1), t(j +
1), d(j + 1)i. For any edge e, observe the following:
∗
• If e 6∈ Pj+1 , the load on e due to the optimal offline algorithm remains
unchanged. That is, e ∗
le∗ (j). On the other hand, if e ∈ Pj+1
le∗ (j + 1) = e ,
∗ ∗
then le (j + 1) = le (j) + pee (j + 1).
e e
∗
• If e is neither in Pj+1 nor in Pj+1 le∗ (j +1)) = ale (j) (γ −
, then ale (j+1) (γ − e
e e
le∗ (j)).
e
∗
That is, only edges used by Pj+1 or Pj+1 affect Φ(j + 1) − Φ(j).
Using the observations above together with Lemma 16.8 and the fact that A
computes a shortest path, one can show that Φ(j + 1) − Φ(j) ≤ 0. In detail,
140 CHAPTER 16. MULTIPLICATIVE WEIGHTS UPDATE (MWU)
Φ(j + 1) − Φ(j)
X e
= le∗ (j + 1)) − ale (j) (γ − e
ale (j+1) (γ − e le∗ (j))
e
e∈E
X
= le∗ (j))
(ale (j+1) − ale (j) )(γ − e (1)
e e
∗
e∈Pj+1 \Pj+1
X
+ le∗ (j) − pee (j + 1)) − ale (j) (γ − e
ale (j+1) (γ − e le∗ (j))
e e
∗
e∈Pj+1
X X
= (ale (j+1) − ale (j) )(γ − lee∗ (j)) − ale (j+1) pee (j + 1)
e e e
e∈Pj+1 ∗
e∈Pj+1
X X
≤ (ale (j+1) − ale (j) )γ − ale (j+1) pee (j + 1) (2)
e e e
e∈Pj+1 ∗
e∈Pj+1
X X
≤ (ale (j+1) − ale (j) )γ − ale (j) pee (j + 1) (3)
e e e
e∈Pj+1 ∗
e∈Pj+1
X X
= (ale (j)+epe (j+1) − ale (j) )γ − ale (j) pee (j + 1) (4)
e e e
e∈Pj+1 ∗
e∈Pj+1
X
≤ (ale (j)+epe (j+1) − ale (j) )γ − ale (j) pee (j + 1) (5)
e e e
∗
e∈Pj+1
X
= ale (j) (apee (j+1) − 1)γ − pee (j + 1)
e
∗
e∈Pj+1
X 1
= ale (j) ((1 + )pee (j+1) − 1)γ − pee (j + 1) (6)
e
∗
e∈Pj+1
2γ
≤ 0 (7)
le∗ (j) ≥ 0
(2) e
le (j + 1) ≥ e
(3) e le (j)
Proof. Since Φ(0) = mγ and Φ(j + 1) − Φ(j) ≤ 0, we see that Φ(j) ≤ mγ, for
all j ∈ {1, . . . , |σ|}. Consider the edge e with the highest congestion. Since
γ −e le∗ (j) ≥ 1, we see that
1 L
(1 + le∗ (j)) ≤ Φ(j) ≤ mγ ≤ n2 γ
) ≤ aL · (γ − e
2γ
Taking log on both sides and rearranging, we get:
1
L ≤ (2 log(n) + log(γ)) · 1 ∈ O(log n)
log(1 + 2γ
)
3
Existing paths are preserved, just that we ignore them in the subsequent computations
of ce .
Bibliography
[AB17] Amir Abboud and Greg Bodwin. The 43 additive spanner expo-
nent is tight. Journal of the ACM (JACM), 64(4):28, 2017.
[ADD+ 93] Ingo Althöfer, Gautam Das, David Dobkin, Deborah Joseph,
and José Soares. On sparse spanners of weighted graphs. Discrete
& Computational Geometry, 9(1):81–100, 1993.
[AGM12] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Ana-
lyzing graph structure via linear measurements. In Proceedings
of the twenty-third annual ACM-SIAM symposium on Discrete
Algorithms, pages 459–467. SIAM, 2012.
[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplica-
tive weights update method: a meta-algorithm and applications.
Theory of Computing, 8(1):121–164, 2012.
[AMS96] Noga Alon, Yossi Matias, and Mario Szegedy. The space com-
plexity of approximating the frequency moments. In Proceedings
of the twenty-eighth annual ACM symposium on Theory of com-
puting, pages 20–29. ACM, 1996.
i
ii Advanced Algorithms
[Lee18] James R Lee. Fusible hsts and the randomized k-server conjec-
ture. In 2018 IEEE 59th Annual Symposium on Foundations of
Computer Science (FOCS), pages 438–449. IEEE, 2018.
[Mos15] Dana Moshkovitz. The projection games conjecture and the np-
hardness of ln n-approximating set-cover. Theory of Computing,
11(1):221–235, 2015.
[NY18] Jelani Nelson and Huacheng Yu. Optimal lower bounds for
distributed and streaming spanning forest computation. arXiv
preprint arXiv:1807.05135, 2018.