AAscript

Advanced Algorithms
Computer Science, ETH Zürich
Mohsen Ghaffari
This is a draft version; please check again for updates. Feedback and
comments would be greatly appreciated and should be emailed to
ghaffari@inf.ethz.ch. Last update: Feb 5, 2019.
Lecture Notes by
Davin Choo
Version: 5 Feb 2019

Contents
Notation and useful inequalities
I Approximation algorithms 1
1 Greedy algorithms 3
1.1 Minimum set cover . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Approximation schemes 11
2.1 Knapsack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Bin packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Minimum makespan scheduling . . . . . . . . . . . . . . . . . 19
3 Randomized approximation schemes 25

3.1 DNF counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Counting graph colorings . . . . . . . . . . . . . . . . . . . . . 29
4 Rounding ILPs 33
4.1 Minimum set cover . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Minimizing congestion in multi-commodity routing . . . . . . 39
5 Probabilistic tree embedding 47

5.1 A tight probabilistic tree embedding construction . . . . . . . 48
5.2 Application: Buy-at-bulk network design . . . . . . . . . . . . 54
5.3 Extra: Ball carving with O(log n) stretch factor . . . . . . . . 55
II Streaming and sketching algorithms 59

6 Warm up 61
6.1 Typical tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Majority element . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Estimating the moments of a stream 65

7.1 Estimating the first moment of a stream . . . . . . . . . . . . 65
7.2 Estimating the zeroth moment of a stream . . . . . . . . . . . 67
7.3 Estimating the k th moment of a stream . . . . . . . . . . . . . 74
8 Graph sketching 83
8.1 Warm up: Finding the single cut . . . . . . . . . . . . . . . . 84
8.2 Warm up 2: Finding one out of k > 1 cut edges . . . . . . . . 86
8.3 Maximal forest with O(n log4 n) memory . . . . . . . . . . . . 87
III Graph sparsification 91

9 Preserving distances 93
9.1 α-multiplicative spanners . . . . . . . . . . . . . . . . . . . . . 94
9.2 β-additive spanners . . . . . . . . . . . . . . . . . . . . . . . . 96
10 Preserving cuts 105

10.1 Warm up: G = Kn . . . . . . . . . . . . . . . . . . . . . . . . 107
10.2 Uniform edge sampling . . . . . . . . . . . . . . . . . . . . . . 107
10.3 Non-uniform edge sampling . . . . . . . . . . . . . . . . . . . 109
IV Online algorithms and competitive analysis 113

11 Warm up: Ski rental 115
12 Linear search 117

12.1 Amortized analysis . . . . . . . . . . . . . . . . . . . . . . . . 117
12.2 Move-to-Front . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
13 Paging 121
13.1 Types of adversaries . . . . . . . . . . . . . . . . . . . . . . . 122
13.2 Random Marking Algorithm (RMA) . . . . . . . . . . . . . . 123
14 Yao’s Minimax Principle 127

14.1 Application to the paging problem . . . . . . . . . . . . . . . 128
15 The k-server problem 129

15.1 Special case: Points on a line . . . . . . . . . . . . . . . . . . 130
16 Multiplicative Weights Update (MWU) 133
16.1 Warm up: Perfect expert exists . . . . . . . . . . . . . . . . . 134
16.2 A deterministic MWU algorithm . . . . . . . . . . . . . . . . 134
16.3 A randomized MWU algorithm . . . . . . . . . . . . . . . . . 135
16.4 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
16.5 Application: Online routing of virtual circuits . . . . . . . . . 137
Acknowledgements
Sebastian Brandt, Greg Bodwin
Pace for 2018 iteration

Week 1 (17 Sep - 21 Sep 2018) Chapter 1
Week 2 (24 Sep - 28 Sep 2018) Chapter 2 till Section 2.2.3
Week 3 (1 Oct - 5 Oct 2018) Section 2.2.3 until end of Chapter 2
Week 4 (8 Oct - 12 Oct 2018) Chapter 3
Week 7 (29 Oct - 2 Nov 2018) Chapter 6 till Section 7.2.2
Week 8 (5 Nov - 9 Nov 2018) Section 7.2.2 till end of Chapter 7
Week 9 (12 Nov - 16 Nov 2018) Chapter 8
Week 12 (3 Dec - 7 Dec 2018) Chapter 11 till end of Section 13.1
Week 13 (10 Dec - 14 Dec 2018) Section 13.2 till end of Chapter 15
Week 14 (17 Dec - 21 Dec 2018) Chapter 16

Notation and useful inequalities
Commonly used notation

• P: class of decision problems that can be solved on a deterministic
sequential machine in polynomial time with respect to input size
• N P: class of decision problems that can be solved on a deterministic

sequential machine in polynomial time with respect to input size
• A: usually denotes the algorithm we are discussing about
• I: usually denotes a problem instance
• ind.: independent / independently
• w.p.: with probability
• w.h.p: with high probability

We say event X holds with high probability (w.h.p.) if
1
Pr[X] ≥ 1 −
poly(n)
1
say, Pr[X] ≥ 1 − nc
for some constant c ≥ 2.
• L.o.E.: linearity of expectation
• u.a.r.: uniformly at random
• Integer range [n] = {1, . . . , n}
• e ≈ 2.718281828459: the base of the natural logarithm

Useful distributions
Bernoulli Coin flip w.p. p. Useful for indicators
Pr[X = 1] = p
E[X] = p
Var(X) = p(1 − p)
Binomial Number of successes out of n trials, each succeeding w.p. p;

Sample with replacement out of n items, p of which are successes

n k
Pr[X = k] = p (1 − p)n−k
k
E[X] = np
Var(X) = np(1 − p) ≤ np
Geometric Number of Bernoulli trials until one success

Pr[X = k] = (1 − p)n−1 p
1
E[X] =
p
1−p
Var(X) =
p2
Hypergeometric r successes in n draws without replacement when there

are K successful items in total
K n−k

r r−k
Pr[X = k] = n

r
k
E[X] = r ·
n
k n−k n−r
Var(X) = r · · ·
n n n−1
Exponential Parameter: λ; Written as X ∼ Exp(λ)

(
λe−λx if x ≥ 0
Pr[X = x] =
0 if x < 0
1
E[X] =
λ
1
Var(X) = 2
λ
Remark If x1 ∼ Exp(λ1 ), . . . , xn ∼ Exp(λn ), then
• min{x1 , . . . , xn } ∼ Exp(λ1 + · · · + λn )
λk
• Pr[k | xk = min{x1 , . . . , xn }] = λ1 +···+λn
Useful inequalities
• ( nk )k ≤ nk ≤ ( en

k
)k
• nk ≤ nk

• limn→∞ (1 − n1 )n = e−1
P∞ 1 π
• i=1 i2 = 6
• (1 − x) ≤ e−x , for any x

2
• (1 − x) ≥ e−x−x , for x ∈ (0, 12 )
1 1
• 1−x
≤ 1 + 2x for x ≤ 2
Theorem (AM-GM inequality). Given n numbers x1 , . . . , xn ,

x1 + · · · + xn
≥ (x1 ∗ · · · ∗ xn )1/n
n
The equality holds if and only if x1 = · · · = xn .
Theorem (Markov’s inequality). If X is a nonnegative random variable and
a > 0, then
E(X)
Pr[X ≥ a] ≤
a
Theorem (Bernoulli’s inequality). For every integer r ≥ 0 and every real
number x ≥ −1,
(1 + x)r ≥ 1 + rx
TheoremP(Chernoff bound). For independent Bernoulli variables X1 , . . . , Xn ,
let X = ni=1 Xi . Then,
−2 E(X)
Pr[X ≥ (1 + ) · E(X)] ≤ exp( ) for 0 <
3
−2 E(X)
Pr[X ≤ (1 − ) · E(X)] ≤ exp( ) for 0 < < 1
2
By union bound, for 0 < < 1, we have

−2 E(X)
Pr[|X − E(X)| ≥ · E(X)] ≤ 2 exp( )
3
Remark 1 There is actually a tighter form of Chernoff bounds:
e
∀ > 0, Pr[X ≥ (1 + )E(X)] ≤ ( )E(X)
(1 + )1+
Remark 2 We usually apply Chernoff bound to show that the probability

2
of bad approximation is low by picking parameters such that 2 exp( − E(X)
3
)≤
δ, then negate to get Pr[|X − E(X)| ≤ · E(X)] ≥ 1 − δ.
Part I
Approximation algorithms
1
Chapter 1
Greedy algorithms
Unless P = N P, we do not expect efficient algorithms for N P-hard prob-

lems. However, we are often able to design efficient algorithms that give
solutions that are provably close/approximate to the optimum.
Definition 1.1 (α-approximation). An algorithm A is an α-approximation
algorithm for a minimization problem with respect to cost metric c if for any
problem instance I and for some optimum algorithm OP T ,
c(A(I)) ≤ α · c(OP T (I))
Maximization problems are defined similarly with c(OP T (I)) ≤ α·c(A(I)).
1.1 Minimum set cover

Consider a universe U = {e1 , . . . , en } of n elements,Sa collection of subsets
S = {S1 , . . . , Sm } of m subsets of U such that U = m i=1 Si , and a positive
1
cost function c : S → R+ . If Si = {e1 , e2 , e5 }, then we say Si covers elements

e1 , e2 , and e5 . For any subset T ⊆ S, define the cost of T as the cost of all
subsets in T . That is, X
c(T ) = c(Si )
Si ∈T
Definition 1.2 (Minimum set cover problem). Given a universe of elements

U, a collection of subsets S, and a non-negative cost function c : S → R+ ,
find a subset S ∗ ⊆ S such that:
(i) S ∗ is a set cover: Si ∈S ∗ Si = U
S
(ii) c(S ∗ ), the cost of S ∗ , is minimized

1
If a set costs 0, then we can just remove all the elements covered by it for free.
3
4 CHAPTER 1. GREEDY ALGORITHMS
Example
S1 e1
S2 e2
S3 e3
S4 e4
e5
Suppose there are 5 elements e1 , e2 , e3 , e4 , e5 , 4 subsets S1 , S2 , S3 , S4 ,

and the cost function is defined as c(Si ) = i2 . Even though S3 ∪ S4 covers all
vertices, this costs c({S3 , S4 }) = c(S3 ) + c(S4 ) = 9 + 16 = 25. One can verify
that the minimum set cover is S ∗ = {S1 , S2 , S3 } with a cost of c(S ∗ ) = 14.
Notice that we want a minimum cover with respect to c and not the number
of subsets chosen from S (unless c is uniform cost).
1.1.1 A greedy minimum set cover algorithm

Since finding the minimum set cover is N P-complete, we are interested in al-
gorithms that give a good approximation for the optimum. [Joh74] describes
a greedy algorithm GreedySetCover and proved that it gives an Hn -
approximation2 . The intuition is as follows: Spread the cost c(Si ) amongst
the vertices that are newly covered by Si . Denoting the price-per-item by
ppi(Si ), we greedily select the set that has the lowest ppi at each step.
Algorithm 1 GreedySetCover(U, S, c)
T ←∅ . Selected subset of S
C←∅ . Covered vertices
while C 6= U do
Si ← arg minSi ∈S\T |Sc(S i)
i \C|
. Pick set with lowest price-per-item
T ← T ∪ {Si } . Add Si to selection
C ← C ∪ Si . Update covered vertices
end while
return T
Consider a run of GreedySetCover on the earlier example. In the first

iteration, ppi(S1 ) = 1/3, ppi(S2 ) = 4, ppi(S3 ) = 9/2, ppi(S4 ) = 16/3. So,
Pn
Hn = i=1 n1 = ln(n) + γ ≤ ln(n) + 0.6 ∈ O(log(n)), where γ is the Euler-Mascheroni
2
constant. See https://en.wikipedia.org/wiki/Euler-Mascheroni_constant.

1.1. MINIMUM SET COVER 5
S1 is chosen. In the second iteration, ppi(S2 ) = 4, ppi(S3 ) = 9, ppi(S4 ) = 16.

So, S2 was chosen. In the third iteration, ppi(S3 ) = 9, ppi(S4 ) = ∞. So,
S3 was chosen. Since all vertices are now covered, the algorithm terminates
(coincidentally to the minimum set cover). Notice that ppi for the unchosen
sets change according to which vertices remain uncovered. Furthermore, one
can simply ignore S4 when it no longer covers any uncovered vertices.
Theorem 1.3. GreedySetCover is an Hn -approximation algorithm.
Proof. By construction, GreedySetCover terminates with a valid set cover

T . It remains to show that c(T ) ≤ Hn · c(OP T ) for any minimum set
cover OP T . Let e1 , . . . , en be the elements in the order they are covered
by GreedySetCover. Define price(ei ) as the price-per-item of the set
that covered ei during the run of the algorithm. Consider the moment in
the algorithm where elements Ck−1 = {e1 , . . . , ek−1 } are already covered by
some sets Tk ⊂ T . Since there is a cover3 of cost at most c(OP T ) for the
remaining n − k + 1 elements, there must be an element from {ek , . . . , en }
whose price is at most c(OP
n−k+1
T)
.
e1
Not in .. ..
OP T . .
ek−1
ek
.. OP Tk ek+1
.
OP T ..
.
en
We formalize this intuition with the argument below. Since OP T is a

set cover, there exists a subset OP Tk ⊆ OP T that covers ek . . . en . Suppose
OP Tk = {O1 , . . . , Op }. We make the following observations:
1. Since no element in {ek , . . . , en } is covered by Tk , O1 , . . . , Op ∈ S \ Tk .

3
OP T is a valid cover (though probably not minimum) for the remaining elements.
2. Because some elements may be covered more than once,
n − k + 1 = |U \ Ck−1 |
≤ |O1 ∩ (U \ Ck−1 )| + · · · + |Op ∩ (U \ Ck−1 )|
p
X
= |Oj ∩ (U \ Ck−1 )|
j=1
c(Oj )
3. By definition, for each j ∈ {1, . . . , p}, ppi(Oj ) = |Oj ∩(U \Ck−1 )|
.
Since the greedy algorithm will pick a set in S \ T with the lowest price-per-
item, price(ek ) ≤ ppi(Oj ) for all j ∈ {1, . . . , p}. Hence,
c(Oj ) ≥ price(ek ) · |Oj ∩ (U \ Ck−1 )|, ∀j ∈ {1, . . . , p} (1.1)
Summing over all p sets, we have
c(OP T ) ≥ c(OP Tk ) Since OP Tk ⊆ OP T

p
X
= c(Oj ) Definition of c(OP Tk )
j=1
p
X
≥ price(ek ) · |Oj ∩ (U \ Ck−1 )| By Equation (1.1)
j=1
≥ price(ek ) · |U \ Ck−1 | By observation 2

= price(ek ) · (n − k + 1)
c(OP T )
Rearranging, price(ek ) ≤ n−k+1
. Summing over all elements, we have:
n n n
X X X c(OP T ) X 1
c(T ) = c(S) = price(ek ) ≤ = c(OP T )· = Hn ·c(OP T )
S∈T k=1 k=1
n − k + 1 k=1
k
The second equality is because the cost of sets is partitioned across the price
of all n vertices.
Remark By construction, price(e1 ) ≤ · · · ≤ price(en ).
Tight bound example for GreedySetCover Consider n = 2 · (2k − 1)

elements, for some k ∈ N \ {0}. Partition the elements into groups of size
2 · 20 , 2 · 21 , 2 · 22 , . . . , 2 · 2k−1 . Let S = {S1 , . . . , Sk , Sk+1 , Sk+2 }. For 1 ≤ i ≤ k,
let Si cover the group of size 2 · 2i−1 = 2i . Let Sk+1 and Sk+2 cover half of
each group (i.e. 2k − 1 elements each).
2 4 8 = 2 · 22 2 · 2k−1
elts elts elements elements
... ... Sk+1
... ... Sk+2
S1 S2 S3 Sk
Suppose c(Si ) = 1, ∀i ∈ {1, . . . , k + 2}. The greedy algorithm will pick

Sk , then Sk−1 , . . . , and finally S1 . This is because 2 · 2k−1 > n/2 and
k−1
2 · 2i > (n − j=i+1 2 · 2j )/2, for 1 ≤ i ≤ k − 1. This greedy set cover costs
P
k = O(log(n)). Meanwhile, the minimum set cover is S ∗ = {Sk+1 , Sk+2 } with
a cost of 2.
A series of works by Lund and Yannakakis [LY94], Feige [Fei98], and
Moshkovitz [Mos15] showed that it is N P-hard to always approximate set
cover to within (1 − ) · ln |U|, for any constant > 0.
Theorem 1.4 ([Mos15]). It is N P-hard to always approximate set cover to

within (1 − ) · ln |U|, for any constant > 0.
Proof. See [Mos15]
1.1.2 Special cases

In this section, we show that one may improve the approximation factor
from Hn if we have further assumptions on the set cover instance. View-
ing a set cover instance as a bipartite graph between sets and elements, let
∆ = maxi∈{1,...,m} degree(Si ) and f = maxi∈{1,...,m} degree(ei ) represent the
maximum degree of the sets and elements respectively. Consider the follow-
ing two special cases of set cover instances:
1. All sets are small. That is, ∆ is small.
2. Every element is covered by few sets. That is, f is small.
Special case: Small ∆

Theorem 1.5. GreedySetCover is a H∆ -approximation algorithm.
Proof. Suppose OP T = {O1 , . . . , Op }. Consider a set Oi = {ei,1 , . . . , ei,d }

with degree(Oi ) = d ≤ ∆. Without loss of generality, suppose that the
greedy algorithm covers ei,1 , then ei,2 , and so on. For 1 ≤ k ≤ d, when
c(Oi )
ei,k is covered, price(ei,k ) ≤ d−k+1 (It is an equality if the greedy algorithm
also chose Oi to first cover ei,k , . . . , ei,d ). Hence, the greedy cost of covering
elements in Oi (i.e. ei,1 , . . . , ei,d ) is at most
d d
X c(Oi ) X 1
= c(Oi ) · = c(Oi ) · Hd ≤ c(Oi ) · H∆
k=1
d−k+1 k=1
k
Summing over all p sets to cover all n elements, we have c(T ) ≤ H∆ ·c(OP T ).
Remark We apply the same greedy algorithm for small ∆ but analyzed in
a more localized manner. Crucially, in this analysis, we always work with
the exact degree d and only use the fact d ≤ ∆ after summation. Observe
that ∆ ≤ n and the approximation factor equals that of Theorem 1.3 when
∆ = n.
Special case: Small f

We first look at the case when f = 2, show that it is related to another graph
problem, then generalize the approach for general f .
Vertex cover as a special case of set cover
Definition 1.6 (Minimum vertex cover problem). Given a graph G = (V, E),
find a subset S ⊆ V such that:
(i) S is a vertex cover: ∀e = {u, v} ∈ E, u ∈ S or v ∈ S
(ii) |S|, the size of S, is minimized
When f = 2 and c(Si ) = 1, ∀Si ∈ S, the minimum set cover problem is

essentially the minimum vertex cover problem — Each element is an edge
with endpoints being the two sets that cover it. One way to obtain a 2-
approximation to minimum vertex cover (and hence 2-approximation for this
special case of set cover) is to use a maximal matching.
Definition 1.7 (Maximal matching problem). Given a graph G = (V, E),

find a subset M ⊆ E such that:
(i) M is a matching: Distinct edges ei , ej ∈ M do not share an endpoint
(ii) M is maximal: ∀ek 6∈ M , M ∪ {ek } is not a matching

a b c d e f
A related concept to maximal matching is maximum matching, where one

tries to maximize the set of M . By definition, any maximum matching is also
a maximal matching, but the converse is not necessarily true. Consider a path
of 6 vertices and 5 edges. Both the set of blue edges {{a, b}, {c, d}, {e, f }}
and the set of red edges {{b, c}, {d, e}} are valid maximal matchings, where
the maximum matching is the former.
Algorithm 2 GreedyMaximalMatching(V, E)
M ←∅ . Selected edges
C←∅ . Set of incident vertices
while E 6= ∅ do
ei = {u, v} ← Pick any edge from E
M ← M ∪ {ei } . Add ei to the matching
C ← C ∪ {u, v} . Add endpoints to incident vertices
Remove all edges in E that are incident to u or v
end while
return M
GreedyMaximalMatching is a greedy maximal matching algorithm.

The algorithm greedily adds any available edge ei that is not yet incident to
M , then excludes all edges that are adjacent to ei .
...
Vertex cover C,
Maximal matching M
... where |C| = 2 · |M |
Theorem 1.8. The set of incident vertices C at the end of GreedyMax-

imalMatching is a 2-approximation for minimum vertex cover.
Proof. Suppose, for a contradiction, that GreedyMaximalMatching ter-
minated with a set C that is not a vertex cover. Then, there exists an edge
e = {u, v} such that u 6∈ C and v 6∈ C. If such an edge exists, e = {u, v} ∈ E
and GreedyMaximalMatching would not have terminated. This is a
contradiction, hence C is a vertex cover.
Consider the matching M . Any vertex cover has to include at least one
endpoint for each edge in M , hence the minimum vertex cover OP T has at
least |M | vertices. By picking C as our vertex cover, |C| = 2·|M | ≤ 2·|OP T |.
Therefore, C is a 2-approximation.
We now generalize beyond f = 2 by considering hypergraphs. Hyper-

graphs are a generalization of graphs in which an edge can join any num-
ber of vertices. Formally, a hypergraph H = (X, E) consists of a set X
of vertices/elements and a set E of hyperedges where each hyperedge is a
non-empty subset of P(X), the powerset of X. The minimum vertex cover
problem and maximal matching problem are defined similarly on a hyper-
graph.
Remark A hypergraph H = (X, E) can be viewed as a bipartite graph with

partitions X and E, with an edge between element x ∈ X and hyperedge
e ∈ E if x ∈ e.
Example Suppose H = (X, E) where X = {a, b, c, d, e} and E = {{a, b, c},

{b, c}, {a, d, e}}. A minimum vertex cover of size 2 would be {a, c} (there
are multiple vertex covers of size 2). Maximal matchings would be {{a, b, c}}
and {{b, c}, {a, d, e}}, where the latter is the maximum matching.
Claim 1.9. Generalizing GreedyMaximalMatching to compute a max-

imal matching in the hypergraph by greedily picking hyperedges yields an f -
approximation algorithm for minimum vertex cover.
Sketch of Proof Let C be the set of all vertices involved in the greedily
selected hyperedges. In a similar manner as the proof in Theorem 1.8, C can
be showed to be an f -approximation.
Chapter 2
Approximation schemes
In the last chapter, we described simple greedy algorithms that approximate

the optimum for minimum set cover, maximal matching and minimum ver-
tex cover. We now formalize the notion of efficient (1 + )-approximation
algorithms for minimization problems, a la [Vaz13].
Let I be an instance from the problem of interest (e.g. minimum set
cover). Denote |I| as the size of the problem instance in bits, and |Iu | as
the size of the problem instance in unary. For example, if the input is a
number x of at most n bits, then |I| = log2 (x) = O(n) while |Iu | = O(2n ).
This distinction of “size of input” will be important when we discuss the
knapsack problem later.
Definition 2.1 (Polynomial time approximation algorithm (PTAS)). For
a given cost metric c and an optimal algorithm OP T , an algorithm A is a
PTAS for a minimization problem if
• For any > 0, c(A(I)) ≤ (1 + ) · c(OP T (I))
• A runs in poly(|I|) time
The runtime for PTAS may depend arbitrarily on . A stricter definition
is that of fully polynomial time approximation schemes (FPTAS). Assuming
P= 6 N P, FPTAS is the best one can hope for on N P-hard problems.
Definition 2.2 (Fully polynomial time approximation scheme (FPTAS)).
For a given cost metric c and an optimal algorithm OP T , an algorithm A is
an FPTAS for a minimization problem if
• For any > 0, c(A(I)) ≤ (1 + ) · c(OP T (I))
• A runs in poly(|I|, 1 ) time
As before, one can define (1 − )-approximations, PTAS, and FPTAS for
maximization problems similarly.
11
12 CHAPTER 2. APPROXIMATION SCHEMES
2.1 Knapsack
Definition 2.3 (Knapsack problem). Consider a set S with n items. Each
item i has size(i) ∈ Z+ and profit(i) ∈ Z+ . Given a budget B, find a
subset S ∗ ⊆ S such that:
(i) Selection S ∗ fits budget:

P
i∈S ∗ size(i) ≤ B
(ii) Selection S ∗ has maximum value:

P
i∈S ∗ profit(i) is maximized
Let us denote pmax = maxi∈{1,...,n} profit(i). Further assume, without

loss of generality, that size(i) ≤ B, ∀i ∈ {1, . . . , n}. As these items cannot
be chosen in S ∗ , we can remove them, and relabel if necessary, in O(n) time.
Thus, observe that pmax ≤ profit(OP T (I)) because we can always pick at
least one item, namely the highest valued one.
Example Denote the i-th item by i : hsize(i), profit(i)i. Consider an

instance with S = {1 : h10, 130i, 2 : h7, 103i, 3 : h6, 91i, 4 : h4, 40i, 5 : h3, 38i}
and budget B = 10. Then, the best subset S ∗ = {2 : h7, 103i, 5 : h3, 38i} ⊆ S
yields a total profit of 103 + 38 = 141.
2.1.1 An exact algorithm via dynamic programming

The maximum achievable profit is at most npmax , where S ∗ = S. Define the
size of a subset as the sum of the size of the sets involved. Using dynamic
programming (DP), we can form an n-by-(npmax ) matrix M where M [i, p]
is the smallest size of a subset chosen from {1, . . . , i} such that the total
profit equals p. Trivially, set M [1, profit(1)] = size(1) and M [1, p] = ∞
for p 6= profit(1). To handle boundaries, define M [i, j] = ∞ for j ≤ 0.
Then, for M [i + 1, p],
• If profit(i + 1) > p, then we cannot pick item i.

So, M [i + 1, p] = M [i, p].
• If profit(i + 1) ≤ p, then we pick item i.

So, M [i + 1, p] = min{M [i, p], size(i + 1) + M [i, p − profit(i + 1)]}.
Since each cell can be computed in O(1) using DP via the above recurrence,
matrix M can be filled in O(n2 pmax ) and S ∗ may be extracted by back-tracing
from M [n, npmax ].
2.1. KNAPSACK 13
Remark This dynamic programming algorithm is not a PTAS because

O(n2 pmax ) can be exponential in input problem size |I|. Recall that the value
pmax is just a single number, hence representing it only requires log2 (pmax )
bits. As such, we call this DP algorithm a pseudo-polynomial time algorithm.
2.1.2 FPTAS via profit rounding
Algorithm 3 FPTAS-Knapsack(S, B, )
k ← max{1, b pmax n
c} . Choice of k to be justified later
for i ∈ {1, . . . , n} do
profit0 (i) = b profit(i)
k
c . Round and scale the profits
end for
Run DP in Section 2.1.1 with B, size(i), and re-scaled profit0 (i).
return Items selected by DP
FPTAS-Knapsack pre-processes the problem input and calls the DP

algorithm described in Section 2.1.1. Since we scaled down the profits, the
2
new maximum profit is pmax k
, hence the DP now runs in O( n pkmax ). To
obtain a FPTAS, we pick k = max{1, b pmax
n
c} so that FPTAS-Knapsack
is a (1 − )-approximation algorithm and runs in poly(n, 1 ).
Theorem 2.4. FPTAS-Knapsack is a FPTAS for the knapsack problem.
Proof. Suppose we are given a knapsack instance I = (S, B). Let loss(i)
denote the decrease in value by using rounded profit0 (i) for item i. By the
profit rounding definition, for each item i,
profit(i)
loss(i) = profit(i) − k · b c≤k
k
Then, over all n items,
n
X
loss(i) ≤ nk loss(i) ≤ k for any item i
i=1
pmax
< · pmax Since k = b c
n
≤ · profit(OP T (I)) Since pmax ≤ profit(OP T (I))
Thus, profit(FPTAS-Knapsack(I)) ≥ (1 − ) · profit(OP T (I)).

2 3
Furthermore, FPTAS-Knapsack runs in O( n pkmax ) = O( n ) ∈ poly(n, 1 ).
Remark k = 1 when pmax ≤ n . In that case, no rounding occurs and the

3
DP finds the exact solution in O(n2 pmax ) ∈ O( n ) ∈ poly(n, 1 ) time.
Example Recall the earlier example where budget B = 10 and S = {1 :

h10, 130i, 2 : h7, 103i, 3 : h6, 91i, 4 : h4, 40i, 5 : h3, 38i}. For = 21 , one
would set k = max{1, b pmax n
c} = max{1, b 130/25
c} = 13. After round-
ing, we have S 0 = {1 : h10, 10i, 2 : h7, 7i, 3 : h6, 7i, 4 : h4, 3i, 5 : h3, 2i}.
The optimum subset from S 0 is {3 : h6, 7i, 4 : h4, 3i} which translates to
a total profit of 91 + 40 = 131 in the original problem. As expected,
131 = profit(FPTAS-Knapsack(I)) ≥ (1 − 21 ) · profit(OP T (I)) = 70.5.
2.2 Bin packing

Definition 2.5 (Bin packing problem). Given a set S with n items where
each item i has size(i) ∈ (0, 1], find the minimum number of unit-sized bins
that can hold all n items.
For any problem instance I, let OP T (I) be a optimum bin assignment
and |OP T (I)|Pbe the corresponding minimum number of bins required. One
can see that ni=1 size(i) ≤ |OP T (I)|.
Example Consider S = {0.5, 0.1, 0.1, 0.1, 0.5, 0.4, 0.5, 0.4, 0.4}, where |S| =
n = 9. Since ni=1 size(i) = 3, at least 3 bins are needed. One can verify
P
that 3 bins suffice: b1 = b2 = b3 = {0.5, 0.4, 0.1}. Hence, |OP T (S)| = 3.
0.1 0.1 0.1
0.4 0.4 0.4
0.5 0.5 0.5
b1 b2 b3
2.2.1 First-fit: A 2-approximation algorithm

FirstFit processes items one-by-one, creating new bins if an item cannot fit
into existing bins. For a unit-sized bins b, we use size(b) to denote the sum
of the size of items that are put into b, and define free(b) = 1 − size(b).
2.2. BIN PACKING 15
Algorithm 4 FirstFit(S)
B→∅ . Collection of bins
for i ∈ {1, . . . , n} do
if size(i) ≤ free(b) for some bin b ∈ B. Pick the first one. then
free(b) ← free(b) − size(i) . Put item i to existing bin b
else
B ← B ∪ {b0 } . Put item i into a fresh bin b0
free(b0 ) = 1 − size(i)
end if
end for
return B
Lemma 2.6. Using FirstFit, at most one bin is less than half-full. That
is, |{b ∈ B : size(b) ≤ 21 }| ≤ 1, where B is the output of FirstFit.
Proof. Suppose, for a contradiction, that there are two bins bi and bj such
that i < j, size(bi ) ≤ 21 and size(bj ) ≤ 21 . Then, FirstFit could have put
all items in bj into bi , and would not have created bj . This is a contradiction.
Theorem 2.7. FirstFit is a 2-approximation algorithm for bin packing.

Proof.
Pn Suppose FirstFit terminates
Pn with |B| = m bins. By Lemma 2.6,
m−1
i=1 size(i) > 2
. Since i=1 size(i) ≤ |OP T (I)|, we have
n
X
m−1<2· size(i) ≤ 2 · |OP T (I)|
i=1
That is, m ≤ 2 · |OP T (I)|.

Recall example with S = {0.5, 0.1, 0.1, 0.1, 0.5, 0.4, 0.5, 0.4, 0.4}. First-
Fit will use 4 bins: b1 = {0.5, 0.1, 0.1, 0.1}, b2 = b3 = {0.5, 0.4}, b4 = {0.4}.
As expected, 4 = |FirstFit(S)| ≤ 2 · |OP T (S)| = 6.
0.1 0.4 0.4

0.1
0.1
0.5 0.5 0.5 0.4
b1 b2 b3 b4
Remark If we first sort the item weights in non-increasing order, then one
can show that running FirstFit on the sorted item weights will yield a
3
2
-approximation algorithm for bin packing. See footnote for details1 .
It is natural to wonder whether we can do better than a 32 -approximation.
Unfortunately, unless P = N P, we cannot do so efficiently. To prove this, we
show that if we can efficiently derive a ( 23 − )-approximation for bin packing,
then the partition problem (which is N P-hard) can be solved efficiently.
Definition 2.8 (Partition problem). Given a multiset S of (possibly re-

peated) positive integers
P . . . , xn , is there a way to partition S into S1
x1 , P
and S2 such that x∈S1 x = x∈S2 x?
Theorem 2.9. It is N P-hard to solve bin packing with an approximation

factor better than 32 .
Proof. Suppose algorithm A solves bin packing with ( 23 − )-approximation

for > 0. P Given an instance of the partition problem with S = {x1 , . . . , xn },
n 0 2x1 2xn
let X =
P i=1 xi . Define bin packing instance S = { X , . . . , X }. Since
x∈S 0 x = 2, at least two bins are required. By construction, one can bi-
partition S if and only if only two bins are required to pack S 0 . Since A
gives a ( 23 − )-approximation, if OP T on S 0 returns 2 bins, then A on S 0 will
return b( 32 − )(2)c = 2 bins. As A can solve the partition problem, solving
bin packing with ( 32 − )-approximation is N P-hard.
In the following sections, we work towards a PTAS for bin packing whose
runtime will be exponential in 1 . To do this, we first consider two simplifying
assumptions and design algorithms for them. Then, we adapt the algorithm
to a PTAS by removing the assumptions one at a time.
Assumption (1) All items have at least size , for some > 0.
Assumption (2) There are only k different possible sizes (k is a constant).
2.2.2 Special case 1: Exact solving with A

In this section, we make two assumptions:
1
Curious readers can read the following lecture notes for proof on First-Fit-Decreasing:
http://ac.informatik.uni-freiburg.de/lak_teaching/ws11_12/combopt/notes/
bin_packing.pdf
https://dcg.epfl.ch/files/content/sites/dcg/files/courses/2012%20-%
20Combinatorial%20Optimization/12-BinPacking.pdf
2.2. BIN PACKING 17
Assumption (2) There are only k different possible sizes (k is a constant).

Let M = d 1 e and xi be the number of items of the ith possible size. Let
R be the number of weight configurations, or possible item configurations
(multiset of item weights) in a bin. By assumption 1, each bin can only
M +k

contain ≤ M items. By assumption 2, R ≤ M . Then, the total number of
bin configurations is at most n+R

R
. Since k is a constant, one can enumerate
over all possible bin configurations (denote this algorithm as A ) to exactly
solve bin packing in this special case in O(nR ) ∈ poly(n) time since R is a
constant (with respect to constants and k).
Remark 1 Number of configurations are computed by solving combina-

torics problems of the following form: How many non-negative integer solu-
tions are there to x1 + · · · + xn ≤ k?2
Remark 2 The number of bin configurations is computed out of n bins

(i.e., 1 bin for each item). One may use less than n bins, but this upper
bound suffices for our purposes.
2.2.3 Special case 2: PTAS

In this section, we make one assumption:
Now, we only assume that all item sizes are at least , for some > 0.
PTAS-BinPacking pre-processes the sizes of a given input instance, then
calls the exact algorithm A to solve the modified instance. Since J only
rounds up sizes, A (J) will yield a satisfying bin assignment for instance
I, with possibly “spare slack”. For analysis, let us define another modified
instance J 0 as rounding down item sizes. Since we rounded down item sizes
in J 0 , |OP T (J 0 )| ≤ |OP T (I)|.
Lemma 2.10. |OP T (J)| ≤ |OP T (J 0 )| + Q
Proof. Label the k groups in J by J1 , . . . , Jk where the items in Ji have
smaller sizes than the items in Ji+1 . Label the k groups in J 0 similarly. See
0
Figure 2.1. For i = {1, . . . , k − 1}, since the smallest item in Ji+1 has size at
0
least as large as the largest item in Ji , any valid packing for Ji serves as a
valid packing for the Ji−1 . For Jk (the largest group of Q items), let us use
separate bins (hence the additive Q term).
2
See slides 22 and 23 of http://www.cs.ucr.edu/~neal/2006/cs260/piyush.pdf for
illustration of MM+k and n+R

R .
Algorithm 5 PTAS-BinPacking(I = S, )
k ← d 12 e
Q ← bn2 c
Partition n items into k non-overlapping groups, each with ≤ Q items
for i ∈ {1, . . . , k} do
imax ← maxitem j in group i size(j)
for item j in group i do
size(j) ← imax
end for
end for
Denote the modified instance as J
return A (J)
J1 , J10 J2 , J20 J rounds up Jk , Jk0

.. .. ..
0 Item sizes
0
≤ Q items ≤ Q items J rounds down ≤ Q items
Figure 2.1: Partition items into k groups, each with ≤ Q items; Label
groups in ascending sizes; J rounds up item sizes, J 0 rounds down item sizes.
Lemma 2.11. |OP T (J)| ≤ |OP T (I)| + Q
Proof. By Lemma 2.10 and the fact that |OP T (J 0 )| ≤ |OP T (I)|.
Theorem 2.12. PTAS-BinPacking is an (1 + )-approximation algorithm

for bin packing with assumption (1).
Proof. By Assumption (1), all item sizes are at least , so |OP T (I)| ≥ n.
Then, Q = bn2 c ≤ · |OP T (I)|. Apply Lemma 2.10.
2.2.4 General case: PTAS

Theorem 2.13. Full-PTAS-BinPacking uses ≤ (1+)|OP T (I)|+1 bins
Proof. If FirstFit does not open a new bin, the theorem trivially holds.
Suppose FirstFit opens a new bin (using m bins in total), then we know
that at least (m − 1) bins are strictly more than (1 − 0 )-full.
2.3. MINIMUM MAKESPAN SCHEDULING 19
Algorithm 6 Full-PTAS-BinPacking(I = S, )
0 ← min{ 12 , 2 } . See analysis why we chose such an 0
X ← Items with size < 0 . Ignore small items
0
P ← PTAS-BinPacking(S \ X, ) . By Theorem 2.12,
. |P | = (1 + 0 ) · |OP T (S \ X)|
P 0 ← Using FirstFit, add items in X to P . Handle small items
return Resultant packing P 0
n
X
|OP T (I)| ≥ size(i) Lower bound on |OP T (I)|
i=1
> (m − 1)(1 − 0 ) From above observation
Hence,
|OP T (I)|
m< +1 Rearranging
1 − 0
1 1
< |OP T (I)| · (1 + 20 ) + 1 Since 0
≤ 1 + 20 , for 0 ≤
1− 2
1
≤ (1 + ) · |OP T (I)| + 1 By choice of 0 = min{ , }
2 2
2.3 Minimum makespan scheduling

Definition 2.14 (Minimum makespan scheduling problem). Given n jobs
I = {p1 , . . . , pn }, where job i takes pi units of time to complete, find an
assignment of jobs to m identical machines such that the completion time
(i.e. makespan) is minimized.
For any problem instance I, let OP T (I) be an optimum job assignment
and |OP T (I)| be the corresponding makespan. One can see that:
• pmax = maxi∈{1,...,n} pi ≤ |OP T (I)|
• m1 ni=1 pi ≤ |OP T (I)|

P
Denote L(I) = max{pmax , m1 ni=1 pi } as the larger lower bound. Then,

P
L(I) ≤ |OP T (I)|.
Remark To prove approximation factors, it is often useful to relate to lower

bounds of |OP T (I)|.
Example Suppose we have 7 jobs I = {p1 = 3, p2 = 4, p3 = 5, p4 = 6,

p5 = 4, p6 = 5, p7 = 6} and m = 3 machines. Then, the lower bound on
the makespan is L(I) = max{6, 11} = 11. This is achieveable by allocating
M1 = {p1 , p2 , p5 }, M2 = {p3 , p4 }, M3 = {p6 , p7 }.
M3 p6 p7
M2 p3 p4
M1 p1 p2 p5
0 Time
3 5 7 Makespan = 11
Graham [Gra66] is a 2-approximation greedy algorithm for the minimum

makespan scheduling problem. With slight modifications, we improve it to
ModifiedGraham, a 43 -approximation algorithm. Finally, we end off the
section with a PTAS for minimum makespan scheduling.
2.3.1 Greedy approximation algorithms
Algorithm 7 Graham(I = {p1 , . . . , pn }, m)

M1 , . . . , Mm ← ∅ . All machines are initially free
for i ∈ {1, . . . , n} do P
j ← argminj∈{1,...,m} p∈Mj p . Pick the least loaded machine
Mj ← Mj ∪ {pi } . Add job i to this machine
end for
return M1 , . . . , Mm
Theorem 2.15. Graham is a 2-approximation algorithm.

Proof. Suppose the last job that finishesP(which takes plast time) running was
assigned to machine Mj . Define t = ( p∈Mj p) − plast as the makespan of
machine Mj before the last job was assigned to it. That is,
|Graham(I)| = t + plast
As Graham assigns greedilyPto the least loaded machine, all machines take
at least t time, so t · m ≤ ni=1 pi ≤ m · |OP T (I)|. Since plast ≤ pmax ≤
|OP T (I)|, we have |Graham(I)| = t + plast ≤ 2 · |OP T (I)|.
Pn
Corollary 2.16. |OP T (I)| ≤ 2 · L(I), where L(I) = max{pmax , m1 i=1 pi }.
From proof of Theorem 2.15, we have |Graham(I)| = t + plast and

Proof. P
t ≤ m ni=1 pi . Since |OP T (I)| ≤ |Graham(I)| and plast ≤ pmax , we have
1
n
1 X
|OP T (I)| ≤ pi + pmax ≤ 2 · L(I)
m i=1
Recall the example with I = {p1 = 3, p2 = 4, p3 = 5, p4 = 6, p5 =

4, p6 = 5, p7 = 6} and m = 3. Graham will schedule M1 = {p1 , p4 },
M2 = {p2 , p5 , p7 }, M3 = {p3 , p6 }, yielding a makespan of 14. As expected,
14 = |Graham(I)| ≤ 2 · |OP T (I)| = 22.
M3 p3 p6
M2 p2 p5 p7
M1 p1 p4
0 Time
3 4 5 8 9 10 Makespan = 14
Remark The approximation for Graham is loose because we have no

guarantees on plast beyond plast ≤ pmax . This motivates us to order the job
timings in descending order (see ModifiedGraham).
Algorithm 8 ModifiedGraham(I = {p1 , . . . , pn }, m)

I 0 ← I in descending order
return Graham(I 0 , m)
1
Lemma 2.17. Let plast be the last job that finishes running. If plast > 3
· |OP T (I)|,
then |ModifiedGraham(I)| = |OP T (I)|.
Proof. For m ≥ n, |ModifiedGraham(I)| = |OP T (I)| by trivially putting

one job on each machine. For m < n, without loss of generality3 , we can
assume that every machine has a job.
3
Suppose there is a machine Mi without a job, then there must be another machine
Mj with more than 1 job (by pigeonhole principle). Shifting one of the jobs from Mj to
Mi will not increase the makespan.
Suppose, for a contradiction, that |ModifiedGraham(I)| > |OP T (I)|.

Then, there exists4 a sequence of jobs with descending sizes I = {p1 , . . . , pn }
such that the last smallest job pn causes ModifiedGraham(I) to have a
makespan larger than OP T (I). That is, |ModifiedGraham(I \ {pn })| ≤
|OP T (I)| and plast = pn . Let C be the configuration of machines after
ModifiedGraham assigned {p1 , . . . , pn−1 }.
Observation 1 In C, each machine has either 1 or 2 jobs.

If there exists machine Mi with ≥ 3 jobs, Mi will take > |OP T (I)|
time because all jobs take > 31 · |OP T (I)| time. This contradicts the
assumption |ModifiedGraham(I \ {pn })| ≤ |OP T (I)|.
Let us denote the jobs that are alone in C as heavy jobs, and the machines
they are on as heavy machines.
Observation 2 In OP T (I), all heavy jobs are alone.

Assigning pn to any machine (in particular, the heavy machines) in C
causes the makespan to exceed |OP T (I)|. Since pn is the smallest job,
no other job can be assigned to the heavy machines otherwise |OP T (I)|
cannot attained by OP T (I).
Suppose there are k heavy jobs occupying a machine each in OP T (I). Then,
there are 2(m − k) + 1 jobs (two non-heavy jobs per machine in C, and pn ) to
be distributed across m − k machines. By the pigeonhole principle, at least
one machine M ∗ will get ≥ 3 jobs in OP T (I). However, since the smallest
job pn takes > 31 · |OP T (I)| time, M ∗ will spend > |OP T (I)| time. This is
a contradiction.
Theorem 2.18. ModifiedGraham is a 34 -approximation algorithm.
Proof. By similar arguments as per Theorem 2.15, |ModifiedGraham(I)| =

t + plast ≤ 43 · |OP T (I)| when plast ≤ 31 · |OP T (I)|. Meanwhile, when plast >
1
3
· |OP T (I)|, |ModifiedGraham(I)| = |OP T (I)| by Lemma 2.17.
Recall the example with I = {p1 = 3, p2 = 4, p3 = 5, p4 = 6, p5 = 4, p6 =

5, p7 = 6} and m = 3. Putting I in decreasing sizes, I 0 = hp4 = 6, p7 = 6,
p3 = 5, p6 = 5, p2 = 4, p5 = 4, p1 = 3i and ModifiedGraham will schedule
M1 = {p4 , p2 , p1 }, M2 = {p7 , p5 }, M3 = {p3 , p6 }, yielding a makespan of 13.
As expected, 13 = |ModifiedGraham(I)| ≤ 34 · |OP T (I)| = 14.666 . . .
4
If adding pj for some j < n already causes |ModifiedGraham({p1 , . . . , pj })| >
|OP T (I)|, we can truncate I to {p1 , . . . , pj } so that plast = pj . Since pj ≥ pn >
1
3 · |OP T (I)|, the antecedent still holds.
M3 p3 p6
M2 p7 p5
M1 p4 p2 p1
0 Time
5 6 10 Makespan = 13
2.3.2 PTAS for minimum makespan scheduling

Recall that any makespan scheduling instance (I, m) has a lower bound
L(I) = max{pmax , m1 ni=1 pi }. We know that |OP T (I)| ∈ [L(I), 2L(I)].
P
Let Bin(I, t) be the minimum number of bins of size t that can hold all jobs.
By associating job times with sizes, and scaling bin sizes up by a factor of t,
we can relate Bin(I, t) to the bin packing problem. One can see that Bin(I, t)
is monotonically decreasing in t and |OP T (I)| is the minimum t such that
Bin(I, t) = m. To get a (1 + )-approximate schedule, it suffices to find a
t ≤ (1 + ) · |OP T (I)| such that Bin(I, t) ≤ m.
Algorithm 9 PTAS-Makespan(I = {p1 , . . . , pn }, m)

L = max{pmax , m1 ni=1 pi }
P
for t ∈ {L, L + L, L + 2L, L + 3L, . . . , 2L} do
I 0 ← I \ {Jobs with sizes ≤ t} . Ignore small jobs
1
h ← dlog1+ ( )e . To partition (t, t] into powers of (1 + )
0
for pi ∈ I do
k ← Largest j ∈ {0, . . . , h} such that pi ≥ t(1 + )j
pi ← t(1 + )k . Round down job size
end for
P ← A (I 0 ) . Use A from Section 2.2.2 with size t bins
α(I, t, ) ← Use bins of size t(1 + ) to emulate P on original sizes
α(I, t, ) ← Using FirstFit, add items in X to α(I, t, )
if α(I, t, ) uses ≤ m bins then
return Assign jobs to machines according to α(I, t, )
end if
end for
Given t, PTAS-Makespan transforms a makespan scheduling instance

into a bin packing instance, then solves for an approximate bin packing to
yield an approximate scheduling. By ignoring small jobs and rounding job
sizes down to the closest power of (1 + ) : t · (1, (1 + ), . . . , (1 + )h = −1 ],
exact bin packing A with size t bins is used. To get a bin packing for the
original job sizes, PTAS-Makespan follows P ’s bin packing but uses bins
of size t(1 + ) to account for the rounded down job sizes. Suppose jobs 1 and
2 with sizes p1 and p2 were rounded down to p01 and p02 , and P assigns them
to a same bin (i.e., p01 + p02 ≤ t). Then, due to the rounding process, their
original sizes should also fit into a size t(1 + ) bin (i.e., p1 + p2 ≤ t(1 + )).
Finally, small jobs are handled using FirstFit. Let α(I, t, ) be the final bin
configuration produced by PTAS-Makespan on parameter t and |α(I, t, )|
be the number of bins used. Since |OP T (I)| ∈ [L, 2L], there will be a
t ∈ {L, L + L, L + 2L, . . . , 2L} such that |α(I, t, )| ≤ Bin(I, t) ≤ m bins
(see Lemma 2.19 for the first inequality). Note that running binary search
on t also works, but we only care about poly-time.
Lemma 2.19. For any t > 0, |α(I, t, )| ≤ Bin(I, t).
Proof. If FirstFit does not open a new bin, then |α(I, t, )| ≤ Bin(I, t)
since α(I, t, ) uses additional (1 + ) buffer. If FirstFit opens a new bin
(say, totalling b bins), then there are at least (b − 1) produced bins from
A (exact solving on rounded down non-small items) that are more than
(t(1 + ) − t) = t-full. Hence, any bin packing algorithm must use strictly
more than (b−1)t
t
= b − 1 bins. In particular, Bin(I, t) ≥ b = |α(I, t, )|.
Theorem 2.20. PTAS-Makespan is a (1 + )-approximation for the min-

imum makespan scheduling problem.
Proof. Let t∗ and tα be the minimum t ∈ {L, L + L, L + 2L, . . . , 2L}

such that Bin(I, t) ≤ m and |α(I, t, )| ≤ m respectively. By Lemma 2.19,
tα ≤ t∗ ≤ |OP T (I)|. But since PTAS-Makespan checks for values of t in
gaps of L, it may terminate with tα + L instead. Since L ≤ |OP T (I)|,
|PTAS-Makespan(I)| ≤ t∗ + L ≤ (1 + ) · |OP T (I)|.
Theorem 2.21. PTAS-Makespan runs in poly(|I|, m) time.

Pn
Proof. There are at most 1 = max{ pmax
1
, m i=1 pi } values of t to try. Fil-
tering small jobs and rounding remaining jobs take O(n). From previous
h+1
lecture, A runs in O( 1 · n ) and FirstFit runs in O(nm).
Chapter 3
Randomized approximation
schemes
In this chapter, we study the class of algorithms which extends FPTAS by

allowing randomization.
Definition 3.1 (Fully polynomial randomized approximation scheme (FPRAS)).

For cost metric c, an algorithm A is a FPRAS if
3
• For any > 0, Pr[|c(A(I)) − c(OP T (I))| ≤ · c(OP T (I))] ≥ 4
• A runs in poly(|I|, 1 )
Remark The fraction 34 in the definition of FPRAS is arbitrary. In fact,

any fraction 12 + α for α > 0 suffices. For any δ > 0, one can invoke O( 1δ )
independent copies of A(I) then return the median. Then, the median is
a correct estimation with probability greater than ≥ 1 − δ. This is also
sometimes known as probability amplification.
3.1 DNF counting

Definition 3.2 (Disjunctive Normal Form (DNF)). A formula F on n Boolean
variables x1 , . . . , xn is said to be in DNF if
• F = C1 ∨ · · · ∨ Cm is a disjunction of clauses
• ∀i ∈ [m], a clause Ci = li,1 ∧ · · · ∧ li,|Ci | is a conjunction of literals
• ∀i ∈ [n], a literal li ∈ {xi , ¬xi } is either the variable xi or its negation.
25
26 CHAPTER 3. RANDOMIZED APPROXIMATION SCHEMES
Let α : [n] → {0, 1} be a truth assignment on the n variables. Formula F

is said to be satisfiable if there exists a satisfying assignment α such that F
evaluates to true under α (i.e. F [α] = 1).
Any clause with both xi and ¬xi is trivially false. As they can be removed
in a single scan of F , assume that F does not contain such trivial clauses.
Example Let F = (x1 ∧ ¬x2 ∧ ¬x4 ) ∨ (x2 ∧ x3 ) ∨ (¬x3 ∧ ¬x4 ) be a

Boolean formula on 4 variables, where C1 = x1 ∧ ¬x2 ∧ ¬x4 , C2 = x2 ∧ x3
and C3 = ¬x3 ∧ ¬x4 . Drawing the truth table, one sees that there are 9 sat-
isfying assignments to F , one of which is α(1) = 1, α(2) = α(3) = α(4) = 0.
Remark Another common normal form for representing Boolean formulas

is the Conjunctive Normal Form (CNF). Formulas in CNF are conjunctions
of disjunctions (as compared to disjunctions of conjunctions in DNF). In
particular, one can determine in polynomial time whether a DNF formula is
satisfiable but it is N P-complete to determine if a CNF formula is satisfiable.
Suppose F is a Boolean formula in DNF. Let f (F ) = |{α : F [α] = 1}| be

the number of satisfying assignments to F . If we let Si = {α : Ci [α] = 1}
be
S the set of satisfying assignments to clause Ci , then we see that f (F ) =
| mi=1 Si |. In the above example, |S1 | = 2, |S2 | = 4, |S3 | = 4, and f (F ) = 9.
In the following, we present two failed attempts to compute f (F ) and then
present DNF-Count, a FPRAS for DNF counting via sampling.
3.1.1 Failed attempt 1: Principle of Inclusion-Exclusion

By definition of f (F ) = | m
S
i=1 Si |, one may be tempted to apply the Principle
of Inclusion-Exclusion to expand:
m
[ m
X X
| Si | = |Si | − |Si ∩ Sj | + . . .
i=1 i=1 i<j
However, there are exponentially many terms and there exist instances where
truncating the sum yields arbitrarily bad approximation.
3.1.2 Failed attempt 2: Sampling (wrongly)

Suppose we pick k assignments uniformly at random (u.a.r.). Let Xi be
the
Pk indicator variable whether the i-th assignment satisfies F , and X =
i=1 Xi be the total number of satisfying assignments out of the k sampled
3.1. DNF COUNTING 27
assignments. A u.a.r. assignment is satisfying with probability f2(Fn ) . By

linearity of expectation, E(X) = k · f2(Fn ) . Unfortunately, since we only sample
k ∈ poly(n, 1 ) assignments, 2kn can be exponentially small. This means that
we need exponentially many samples for E(X) to be a good estimate of f (F ).
Thus, this approach will not yield a FPRAS for DNF counting.
3.1.3 An FPRAS for DNF counting via sampling

Consider an m-by-f (F ) Boolean matrix M where
(
1 if assignment αj satisfies clause Ci
M [i, j] =
0 otherwise
Pm |M | denote
Let total number of 1’s in M . Since |Si | = 2n−|Ci | , |M | =
Pmthe n−|C i|
i=1 |Si | = i=1 2 . As every column represents a satisfying assign-
ment, there are exactly f (F ) “topmost” 1’s.
α1 α2 ... αf (F )
C1 0 1 ... 0
C2 1 1 ... 1
C3 0 0 ... 0
.. .. .. ..
... . . . .
Cm 0 1 ... 1
Table 3.1: Red 1’s indicate the (“topmost”) smallest index clause Ci satis-
fied for each assignment αj
DNF-Count samples 1’s in M by picking clauses according to their

length (shorter clauses are more likely), then uniformly selecting a satisfying
assignment for a picked clause by flipping coins for variables not in it. DNF-
Count then returns the fraction of “topmost” 1’s as an estimate of f (F ).
Lemma 3.3. DNF-Count samples a ‘1’ in the matrix M uniformly at

random at each step.
Pm
Proof.
Pm n−|C Recall that the total number of 1’s in M is |M | = i=1 |Si | =
i|
i=1 2 .
Algorithm 10 DNF-Count(F, )
X←0 . Empirical number of “topmost” 1’s sampled
9m
for k = 2 times do
n−|C |
Ci ← Sample one of m clauses, where Pr[Ci chosen] = 2 |M | i
αj ← Sample one of 2n−|Ci | satisfying assignments of Ci
IsTopmost ← True
for l ∈ {1, . . . , i − 1} do . Check if αj is “topmost”
if Cl [α] = 1 then . Checkable in O(n) time
IsTopmost ← False
end if
end for
if IsTopmost then
X ←X +1
end if
end for
return |Mk|·X
Pr[Ci and αj are chosen] = Pr[Ci is chosen] · Pr[αj is chosen|Ci is chosen]

2n−|Ci | 1
= Pm n−|Ci | · n−|Ci |
i=1 2 2
1
= Pm n−|Ci |
i=1 2
1
=
|M |
Lemma 3.4. In DNF-Count, Pr[| |Mk|·X − f (F )| ≤ · f (F )] ≥ 34 .
Proof. Let Xi be the indicator variable whether the i-th sampled assignment
is “topmost”, where p = Pr[Xi = 1]. By Lemma 3.3, p = Pr[Xi = 1] = f|M (F )
|
.
Pk
Let X = i=1 Xi be the empirical number of “topmost” 1’s. Then, E(X) =
kp by linearity of expectation. By picking k = 9m
2
,
3.2. COUNTING GRAPH COLORINGS 29
|M | · X
Pr[| − f (F )| ≥ · f (F )]
k
k · f (F ) · k · f (F ) k
= Pr[|X − |≥ ] Multiply throughout by
|M | |M | |M |
f (F )
= Pr[|X − kp| ≥ kp] Since p =
|M |
2 kp
≤ 2 exp(− ) By Chernoff bound
3
3m · f (F ) 9m f (F )
= 2 exp(− ) Since k = and p =
|M | 2 |M |
≤ 2 exp(−3) Since |M | ≤ m · f (F )
1
≤
4
Negating, we get:
|M | · X 1 3
Pr[| − f (F )| ≤ · f (F )] ≥ 1 − =
k 4 4
Lemma 3.5. DNF-Count runs in poly(F, 1 ) = poly(n, m, 1 ) time.
Proof. There are k ∈ O( m2 ) iterations. In each iteration, we spend O(m + n)

sampling Ci and αj , and O(nm) for checking if a sampled αj is “topmost”.
2
In total, DNF-Count runs in O( m n(m+n) 2
) time.
Theorem 3.6. DNF-Count is a FPRAS for DNF counting.
Proof. By Lemmas 3.4 and 3.5.
3.2 Counting graph colorings

Definition 3.7 (Graph coloring). Let G = (V, E) be a graph on |V | = n
vertices and |E| = m edges. Denote the maximum degree as ∆. A valid q-
coloring of G is an assignment c : V → {1, . . . , q} such that adjacent vertices
have different colors. i.e., If u and v are adjacent in G, then c(u) 6= c(v).
Example (3-coloring of the Petersen graph)
For q ≥ ∆ + 1, one can obtain a valid q-coloring by sequentially coloring

a vertex with available colors greedily. In this section, we show a FPRAS for
counting f (G), the number of graph coloring of a given graph G, under the
assumption that we have q ≥ 2∆ + 1 colors.
3.2.1 Sampling a coloring uniformly

When q ≥ 2∆ + 1, the Markov chain approach in SampleColor allows us
to sample a random color in O(n log n ) steps.
Algorithm 11 SampleColor(G = (V, E), )

Greedily color the graph
for k = O(n log n ) times do
Pick a random vertex v uniformly at random from V
Pick u.a.r. an available color . Different from the colours in N (v)
Color v with new color . May end up with same color
end for
return Coloring
Claim 3.8. For q ≥ 2∆ + 1, the distribution of colorings returned by Sam-

pleColor is -close to a uniform distribution on all valid colorings.
Proof. Beyond the scope of this course.
3.2.2 FPRAS for q ≥ 2∆ + 1 and ∆ ≥ 2

Fix an arbitrary ordering of edges in E. For i = {1, . . . , m}, let Gi = (V, Ei )
be a graph such that Ei = {e1 , . . . , ei } is the set of the first i edges. Define
Ωi = {c : c is a valid coloring for Gi } as the set of all valid colorings of Gi ,
and denote ri = |Ω|Ωi−1
i|
|
.
3.2. COUNTING GRAPH COLORINGS 31
One can see that Ωi ⊆ Ωi−1 as removal of ei in Gi−1 can only increase the
number of valid colorings. Furthermore, suppose ei = {u, v}, then Ωi−1 \Ωi =
{c : c(u) = c(v)}. Fix the coloring of, say the lower-indexed vertex, u. Then,
there are ≥ q − ∆ ≥ 2∆ + 1 − ∆ = ∆ + 1 possible recolorings of v in Gi .
Hence,
|Ωi | ≥ (∆ + 1)|Ωi−1 \ Ωi | ≥ (∆ + 1)(|Ωi−1 | − |Ωi |)
|Ωi | ∆+1 3
This implies that ri = |Ωi−1 |
≥∆+2
≥ 4
since ∆ ≥ 2.
|Ω1 | |Ωm |
Since f (G) = |Ωm | = |Ω0 | · |Ω
... 0|
= |Ω0 | · Πm
|Ωm−1 |
n m
i=1 ri = q · Πi=1 ri , if
we can find a good estimate of ri for each ri with high probability, then we
have a FPRAS for counting the number of valid graph colorings for G.
Algorithm 12 Color-Count(G, )
m ← 0
rb1 , . . . , rc . Estimates for ri
for i = 1, . . . , m do
3
for k = 128m 2
times do
c ← Sample coloring of Gi−1 . Using SampleColor
if c is a valid coloring for Gi then
rbi ← rbi + k1 . Update empirical count of ri = |Ω|Ωi−1
i|
|
end if
end for
end for
return q n Πm i=1 r
bi
3
Lemma 3.9. For all i ∈ {1, . . . , m}, Pr[|b
ri − ri | ≤ 2m
· ri ] ≥ 4m
.
Proof. Let Xj be the indicator variable whether the j-th sampled coloring
for Ωi−1 is a valid coloring for Ωi , where p = Pr[Xj = 1]. From above, we
know that p = Pr[Xj = 1] = |Ω|Ωi−1
i| 3
Pk
|
≥ 4
. Let X = j=1 Xj be the empirical
number of colorings that is valid for both Ωi−1 and Ωi , captured by k · rbi .
3
Then, E(X) = kp by linearity of expectation. Picking k = 128m2
,
( )2 kp
Pr[|X − kp| ≥ kp] ≤ 2 exp(− 2m ) By Chernoff bound
2m 3
32mp 128m3
= 2 exp(− ) Since k =
3 2
3
≤ 2 exp(−8m) Since p ≥
4
1 1
≤ Since exp(−x) ≤ for x > 0
4m x
Dividing by k and negating, we have:

1 3
ri − ri | ≤
Pr[|b · ri ] = 1 − Pr[|X − kp| ≥ kp] ≥ 1 − =
2m 2m 4m 4m
Lemma 3.10. Color-Count runs in poly(F, 1 ) = poly(n, m, 1 ) time.

3
Proof. There are m ri ’s to estimate. Each estimation has k ∈ O( m2 ) iter-
ations. In each iteration, we spend O(n log n ) time to sample a coloring c
of Gi−1 and O(m) time to check if c is a valid coloring for Gi . In total,
Color-Count runs in O(mk(n log n + m)) = poly(n, m, 1 ) time.
Theorem 3.11. Color-Count is a FPRAS for counting the number of

valid graph colorings when q ≥ 2∆ + 1 and ∆ ≥ 2.
Proof. By Lemma 3.10, Color-Count runs in poly(n, m, 1 ) time. Since

m
1 + x ≤ ex for all real x, we have (1 + 2m ) ≤ e 2 ≤ 1 + . The last
inequality1 is because ex ≤ 1 + 2x for 0 ≤ x ≤ 1.25643. On the other hand,
m
Bernoulli’s inequality tells us that (1 − 2m ) ≥ 1 − 2 ≥ 1 − . We know from
1
the proof of Lemma 3.9, Pr[|b ri − ri | ≤ 2m · ri ] ≥ 1 − 4m for any estimate ri .
Therefore,
Pr[|q n Πm bi − f (G)| ≤ f (G)] = Pr[|q n Πm

i=1 r bi − f (G)| ≤ f (G)]
i=1 r
1 m
≥ (1 − )
4m
3
≥
4
Remark Recall from Claim 3.8 that SampleColor actually gives an ap-
proximate uniform coloring. A more careful analysis can absorb the approx-
imation of SampleColor under Color-Count’s factor.
1
See https://www.wolframalpha.com/input/?i=e%5Ex+%3C%3D+1%2B2x
Chapter 4
Rounding ILPs
Linear programming (LP) and integer linear programming (ILP) are versa-
tile models but with different solving complexities — LPs are solvable in
polynomial time while ILPs are N P-hard.
Definition 4.1 (Linear program (LP)). The canonical form of an LP is
minimize cT x
subject to Ax ≥ b
x≥0
where x is the vector of n variables (to be determined), b and c are vectors
of (known) coefficients, and A is a (known) matrix of coefficients. cT x and
obj(x) are the objective function and objective value of the LP respectively.
For an optimal variable assignment x∗ , obj(x∗ ) is the optimal value.
ILPs are defined similarly with the additional constraint that variables
take on integer values. As we will be relaxing ILPs into LPs, to avoid confu-
sion, we use y for ILP variables to contrast against the x variables in LPs.
Definition 4.2 (Integer linear program (ILP)). The canonical form of an
ILP is
minimize cT y
subject to Ay ≥ b
y≥0
y ∈ Zn
where y is the vector of n variables (to be determined), b and c are vectors
of (known) coefficients, and A is a (known) matrix of coefficients. cT y and
obj(y) are the objective function and objective value of the LP respectively.
For an optimal variable assignment y ∗ , obj(y ∗ ) is the optimal value.
33
34 CHAPTER 4. ROUNDING ILPS
Remark We can define LPs and ILPs for maximization problems similarly.
One can also solve maximization problems with a minimization LPs using
the same constraints but negated objective function. The optimal value from
the solved LP will then be the negation of the maximized optimal value.
In this chapter, we illustrate how one can model set cover and multi-
commodity routing as ILPs, and how to perform rounding to yield approx-
imations for these problems. As before, Chernoff bounds will be a useful
inequality in our analysis toolbox.
4.1 Minimum set cover

Recall the minimum set cover problem and the example from Section 1.1.
Example
S1 e1
S2 e2
S3 e3
S4 e4
e5
Suppose there are n = 5 vertices and m = 4 subsets S = {S1 , S2 , S3 , S4 },

where the cost function is defined as c(Si ) = i2 . Then, the minimum set
cover is S ∗ = {S1 , S2 , S3 } with a cost of c(S ∗ ) = 14.
In Section 1.1, we saw that a greedy selection of sets that minimizes
the price-per-item of remaining sets gave an Hn -approximation for set cover.
Furthermore, in the special cases where ∆ = maxi∈{1,...,m} degree(Si ) and
f = maxi∈{1,...,n} degree(xi ) are small, one can obtain H∆ -approximation and
f -approximation respectively.
We now show how to formulate set cover as an ILP, reduce it into a LP,
and how to round the solutions to yield an approximation to the original set
cover instance. Consider the following ILP:
ILPSet cover
m
X
minimize yi · c(Si ) / Cost of chosen set cover
i=1
X
subject to yi ≥ 1 ∀j ∈ {1, . . . , n} / Every item ej is covered
i:ej ∈Si
yi ∈ {0, 1} ∀i ∈ {1, . . . , m} / Indicator whether set Si is chosen
Upon solving ILPSet cover , the set {Si : i ∈ {1, . . . , n} ∧ yi∗ = 1} is the
optimal solution for a given set cover instance. However, as solving ILPs is
N P-hard, we consider relaxing the integral constraint by replacing binary
yi variables by real-valued/fractional xi ∈ [0, 1]. Such a relaxation will yield
the corresponding LP:
LPSet cover
m
X
minimize xi · c(Si ) / Cost of chosen fractional set cover
i=1
X
subject to xi ≥ 1 ∀j ∈ {1, . . . , n} / Every item ej is fractionally covered
i:ej ∈Si
0 ≤ xi ≤ 1 ∀i ∈ {1, . . . , m} / Relaxed indicator variables
Since LPs can be solved in polynomial time, we can find the optimal
fractional solution to LPSet cover in polynomial time.
Observation As the set of solutions of ILPSet cover is a subset of LPSet cover ,

obj(x∗ ) ≤ obj(y ∗ ).
Example The corresponding ILP for the example set cover instance is:
minimize y1 + 4y2 + 9y3 + 16y4

subject to y1 + y4 ≥ 1 / Sets covering e1
y1 + y3 ≥ 1 / Sets covering e2
y3 ≥ 1 / Sets covering e3
∀i ∈ {1, . . . , 4}, yi ∈ {0, 1}
After relaxing:
minimize x1 + 4x2 + 9x3 + 16x4

subject to x1 + x 4 ≥ 1
x1 + x3 ≥ 1
x3 ≥ 1
x2 + x4 ≥ 1
x1 + x4 ≥ 1
∀i ∈ {1, . . . , 4}, 0 ≤ xi ≤ 1 / Relaxed indicator variables
Solving it using a LP solver1 yields: x1 = 1, x2 = 1, x3 = 1, x4 = 0. Since

the solved x∗ are integral, x∗ is also the optimal solution for the original
ILP. In general, the solved x∗ may be fractional, which does not immediately
yield a set selection.
We now describe two ways to round the fractional assignments x∗ into
binary variables y so that we can interpret them as proper set selections.
4.1.1 (Deterministic) Rounding for small f

We round x∗ as follows:
(
1 if x∗i ≥ 1
f
∀i ∈ {1, . . . , m}, set yi =
0 else
Theorem 4.3. The rounded y is a feasible solution to ILPSet cover .
Proof. Since x∗ is a feasible (not to mention, optimal) solution for LPSet cover ,
in each constraint, there is at least one x∗i that is greater or equal to f1 . Hence,
every element is covered by some set yi in the rounding.
Theorem 4.4. The rounded y is a f -approximation to ILPSet cover .
Proof. By the rounding, yi ≤ f · x∗i , ∀i ∈ {1, . . . , m}. Therefore,
obj(y) ≤ f · obj(x∗ ) ≤ f · obj(y ∗ )
1
Using Microsoft Excel. See tutorial: http://faculty.sfasu.edu/fisherwarre/lp_
solver.html
Or, use an online LP solver such as: http://online-optimizer.appspot.com/?model=
builtin:default.mod
4.1.2 (Randomized) Rounding for general f

If f is large, having a f -approximation algorithm from the previous sub-
section may be unsatisfactory. By introducing randomness in the rounding
process, we show that one can obtain a ln(n)-approximation (in expectation)
with arbitrarily high probability through probability amplification.
Consider the following rounding procedure:
1. Interpret each x∗i as probability for picking Si . That is, Pr[yi = 1] = x∗i .
2. For each i, independently set yi to 1 with probability x∗i .
Theorem 4.5. E(obj(y)) = obj(x∗ )
Proof.
Xm
E(obj(y)) = E( yi · c(Si ))
i=1
m
X
= E(yi ) · c(Si ) By linearity of expectation
i=1
m
X
= Pr(yi = 1) · c(Si ) Since each yi is an indicator variable
i=1
Xm
= x∗i · c(Si ) Since Pr(yi = 1) = x∗i
i=1
= obj(x∗ )
Although the rounded selection to yield an objective cost that is close

to the optimum (in expectation) of the LP, we need to consider whether all
constraints are satisfied.
Theorem 4.6. For any j ∈ {1, . . . , n}, item ej is not covered w.p. ≤ e−1 .
Proof. For any j ∈ {1, . . . , n},
X
Pr[Item ej not covered] = Pr[ yi = 0]
i:ej ∈Si
= Πi:ej ∈Si (1 − x∗i )

∗
≤ Πi:ej ∈Si e−xi Since (1 − x) ≤ e−x
x∗i
P
− i:ej ∈Si
=e
≤ e−1
because the optimal solution x∗ satisfies the j th

The last inequality holds P
constraint in the LP that i:ej ∈Si x∗i ≥ 1.
Since e−1 ≈ 0.37, we would expect the rounded y not to cover several
items. However, one can amplify the success probability by considering in-
dependent roundings and taking the union (See ApxSetCoverILP).
Algorithm 13 ApxSetCoverILP(U, S, c)
ILPSet cover ← Construct ILP of problem instance
LPSet cover ← Relax integral constraints on indicator variables y to x
x∗ ← Solve LPSet cover
T ←∅ . Selected subset of S
for k · ln(n) times (for any constant k > 1) do
for i ∈ {1, . . . , m} do
yi ← Set to 1 with probability x∗i
if yi = 1 then
T ← T ∪ {Si } . Add to selected sets T
end if
end for
end for
return T
Similar to Theorem 4.4, we can see that E(obj(T )) ≤ (k · ln(n)) · obj(y ∗ ).

Furthermore, Markov’s inequality tells us that the probability of obj(T ) being
z times larger than its expectation is at most z1 .
Theorem 4.7. ApxSetCoverILP gives a valid set cover w.p. ≥ 1 − n1−k .
Proof. For all j ∈ {1, . . . , n},
Pr[Item ej not covered by T ] = Pr[ej not covered by all k ln(n) roundings]

≤ (e−1 )k ln(n)
= n−k
Taking union bound over all n items,

n
X
Pr[T is not a valid set cover] ≤ n−k = n1−k
i=1
So, T covers all n items with probability ≥ 1 − n1−k .

4.2. MINIMIZING CONGESTION IN MULTI-COMMODITY ROUTING39
Note that the success probability of 1 − n1−k can be further amplified by

taking several independent samples of ApxSetCoverILP, then returning
the lowest cost valid set cover sampled. With z samples, the probability that
all repetitions fail is less than nz(1−k) , so we succeed w.p. ≥ 1 − nz(1−k) .
4.2 Minimizing congestion in multi-commodity

routing
A multi-commodity routing (MCR) problem involves routing multiple (si , ti )
flows across a network with the goal of minimizing congestion, where con-
gestion is defined as the largest ratio of flow over capacity of any edge in
the network. In this section, we discuss two variants of the multi-commodity
routing problem. In the first variant (special case), we are given the set of
possible paths Pi for each (si , ti ) source-target pairs. In the second variant
(general case), we are given only the network. In both cases, [RT87] showed
that one can obtain an approximation of O( loglog(m)
log(m)
) with high probability.
Definition 4.8 (Multi-commodity routing problem). Consider a directed

graph G = (V, E) where |E| = m and each edge e = (u, v) ∈ E has a capacity
c(u, v). The in-set/out-set of a vertex v is denoted as in(v) = {(u, v) ∈ E :
u ∈ V } and out(v) = {(v, u) ∈ E : u ∈ V } respectively. Given k triplets
(si , ti , di ), where si ∈ V is the source, ti ∈ V is the target, and di ≥ 0 is
the demand for the ith commodity respectively, denote f (e, i) ∈ [0, 1] as the
fraction of di that is flowing through edge e. The task is to minimize the
congestion parameter λ by finding paths pi for each i ∈ [k], such that:
P P
(i) (Valid sources): e∈out(si ) f (e, i) − e∈in(si ) f (e, i) = 1, ∀i ∈ [k]
P P
(ii) (Valid sinks): e∈in(ti ) f (e, i) − e∈out(ti ) f (e, i) = 1, ∀i ∈ [k]
(iii) (Flow conservation): For each commodity i ∈ [k],

X X
f (e, i) − f (e, i) = 0, ∀e ∈ E, ∀v ∈ V \ {si ∪ ti }
e∈out(v) e∈in(v)
(iv) (Single path): All demand for commodity i passes through a single path
pi (no repeated vertices).
(v) (Congestion factor): ∀e ∈ E, ki=1 di 1e∈pi ≤ λ · c(e), where indicator
P
1e∈pi = 1 ⇐⇒ e ∈ pi .
(vi) (Minimum congestion): λ is minimized.
Example Consider the following flow network with k = 3 commodities

with edge capacities as labelled:
s1 13 17
a t1
7 8
19 5
s2 20
b 11 t2
8 5 7
s3 6
c t3
For demands d1 = d2 = d3 = 10, there exists a flow assignment such that

the total demands flowing on each edge is below its capacity:
s1 10 10
a t1
s2 b t2
s3 c t3
s1 5
a t1
5 5
5 5
5
s2 b t2
s3 c t3
s1 a t1
s2 5 5
b t2
5 5 5
s3 5
c t3
Although the assignment attains congestion λ = 1 (due to edge (s3 , a)),

the path assignments for commodities 2 and 3 violate the property of “single
path”. Forcing all demand of each commodity to flow through a single path,
we have a minimum congestion of λ = 1.25 (due to edges (s3 , s2 ) and (a, t2 )):
s1 10 10
a t1
s2 b t2
s3 c t3
s1 a t1
10
10
10
s2 b t2
s3 c t3
s1 a t1
10
s2 10
b 10 t2
10
s3 c t3
4.2.1 Special case: Given sets of si − ti paths Pi

For each commodity i ∈ [k], we are to select a path pi from a given set
of valid paths Pi , where each edge in all paths in Pi has capacities ≥ di .
Because we intend to pick a single path for each commodity to send all
demands through, constraints (i)-(iii) of MCR are fulfilled trivially. Using
yi,p as indicator variables whether path p ∈ Pi is chosen, we can model the
following ILP:
ILPMCR-Given-Paths
minimize λ / (1)
k
X X
subject to di · yi,p ≤ λ · c(e) ∀e ∈ E / (2)
i=1 p∈Pi ,e∈p
X
yi,p = 1 ∀i ∈ [k] / (3)
p∈Pi
yi,p ∈ {0, 1} ∀i ∈ [k], p ∈ Pi / (4)
/ (1) Congestion parameter λ
/ (2) Congestion factor relative to selected paths
/ (3) Exactly one path chosen from each Pi
/ (4) Indicator variable for path p ∈ Pi
Relax the integral constraint on yi,p to xi,p ∈ [0, 1] and solve the correspond-
ing LP. Define λ∗ = obj(LPMCR-Given-Paths ) and denote x∗ as a fractional path
selection that achieves λ∗ . To obtain a valid path selection, for each com-
x∗
modity i ∈ [k], pick path p ∈ Pi with weighted probability P i,p x∗ = x∗i,p .
p∈Pi i,p
Note that by constraint (3), p∈Pi x∗i,p = 1.
P
Remark 1 For a fixed i, a path is selected exclusively (only one!) (cf. set
cover’s roundings where we may pick multiple sets for an item).
Remark 2 The weighted sampling is independent across different com-

modities. That is, the choice of path amongst Pi does not influence the
choice of path amongst Pj for i 6= j.
2c log m
Theorem 4.9. Pr[obj(y) ≥ log log m
max{1, λ∗ }] ≤ 1
mc−1
Proof. Fix an arbitrary edge e ∈ E. For each commodity i, define an indi-

cator variable Ye,i whether edge e is part of the chosen path for commod-
∗
P
ity i. By randomized rounding, Pr[Ye,i = 1] = p∈Pi ,e∈p xi,p . Denoting
Pk
Ye = i=1 di · Ye,i as the total demand on edge e in all k chosen paths,
k
X
E(Ye ) = E( di · Ye,i )
i=1
k
X
= di · E(Ye,i ) By linearity of expectation
i=1
Xk X X
= di xi,p Since Pr[Ye,i = 1] = xi,p
i=1 p∈Pi ,e∈p p∈Pi ,e∈p
≤ λ∗ · c(e) By MCR constraint and optimality of the solved LP
For every edge e ∈ E, applying2 the tight form of Chernoff bounds with
2 log n Ye
(1 + ) = log log n
on variable c(e) gives
Ye 2c log m 1
Pr[ ≥ max{1, λ∗ }] ≤ c
c(e) log log m m
Finally, take union bound over all m edges.

2
See Corollary 2 of https://courses.engr.illinois.edu/cs598csc/sp2011/
Lectures/lecture_9.pdf for details.
4.2.2 General: Given only a network

In the general case, we may not be given path sets Pi and there may be
exponentially many si − ti paths in the network. However, we show that one
can still formulate an ILP and round it (slightly differently) to yield the same
approximation factor. Consider the following:
ILPMCR-Given-Network
minimize λ / (1)
X X
subject to f (e, i) − f (e, i) = 1 ∀i ∈ [k] / (2)
e∈out(si ) e∈in(si )
X X
f (e, i) − f (e, i) = 1 ∀i ∈ [k] / (3)
e∈in(ti ) e∈out(ti )
X X
f (e, 1) − f (e, 1) = 0 ∀e ∈ E, / (4)
∀v ∈ V \ {s1 ∪ t1 }
.. ..
. .
X X
f (e, k) − f (e, k) = 0 ∀e ∈ E, / (4)
∀v ∈ V \ {sk ∪ tk }
k
X X
di · yi,p ≤ λ · c(e) ∀e ∈ E As before
i=1 p∈Pi ,e∈p
X
yi,p = 1 ∀i ∈ [k] As before
p∈Pi
yi,p ∈ {0, 1} ∀i ∈ [k], p ∈ Pi As before

/ (1) Congestion parameter λ
/ (2) Valid sources
/ (3) Valid sinks
/ (4) Flow conservation
Relax the integral constraint on yi,p to xi,p ∈ [0, 1] and solve the corresponding
LP. To extract the path candidates Pi for each commodity, perform flow de-
composition3 . For each extracted path pi for commodity i, treat the minimum
3
See https://www.youtube.com/watch?v=zgutyzA9JM4&t=1020s (17:00 to 29:50) for
a recap on flow decomposition.
mine∈pi f (e, i) on the path as the selection probability (as per xe,i in the pre-
vious section). By selecting the path pi with probability mine∈pi f (e, i), one
can show by similar arguments as before that E(obj(y)) ≤ obj(x∗ ) ≤ obj(y ∗ ).
Chapter 5
Probabilistic tree embedding
Trees are a special kind of graphs without cycles and some N P-hard problems
are known to admit exact polynomial time solutions on trees. Motivated by
existence of efficient algorithms on trees, one hopes to design the following
framework for a general graph G = (V, E) with distance metric dG (u, v)
between vertices u, v ∈ V :
1. Construct a tree T
2. Solve the problem on T efficiently
3. Map the solution back to G
4. Argue that the transformed solution from T is a good approximation
for the exact solution on G.
Ideally, we want to build a tree T such that dG (u, v) ≤ dT (u, v) and
dT (u, v) ≤ c · dG (u, v), where c is the stretch of the tree embedding. Unfortu-
nately, such a construction is hopeless1 . Instead, we consider a probabilistic
tree embedding of G into a collection of trees T such that
• (Over-estimates cost): ∀u, v ∈ V , ∀T ∈ T , dG (u, v) ≤ dT (u, v)
• (Over-estimate by not too much): ∀u, v ∈ V , ET ∈T [dT (u, v)] ≤ c · dG (u, v)
P
• (T is a probability space): T ∈T Pr[T ] = 1
Bartal [Bar96] gave a construction2 for probabilistic tree embedding with
poly-logarithmic stretch factor c, and proved3 that a stretch factor c ∈
1
For a cycle G with n vertices, the excluded edge in a constructed tree will cause the
stretch factor c ≥ n − 1.
2
Theorem 8 in [Bar96]
3
Theorem 9 in [Bar96]
47
48 CHAPTER 5. PROBABILISTIC TREE EMBEDDING
Ω(log n) is required for general graphs. A construction that yields c ∈

O(log n), in expectation, was subsequently found by [FRT03].
5.1 A tight probabilistic tree embedding con-

struction
In this section, we describe a probabilistic tree embedding construction due
to [FRT03] with a stretch factor c = O(log n). For a graph G = (V, E), let
the distance metric dG (u, v) be the distance between two vertices u, v ∈ V
and denote diam(C) = maxu,v∈C dG (u, v) as the maximum distance between
any two vertices u, v ∈ C for any subset of vertices C ⊆ V . In particular,
diam(V ) refers to the diameter of the whole graph. Denote B(v, r) := {u ∈
V : dG (u, v) ≤ r} as the ball of distance r around vertex v, including v.
5.1.1 Idea: Ball carving

Before we present the actual construction, we argue that the following ball
carving approach will yield a probabilistic tree embedding.
Definition 5.1 (Ball carving). Given a graph G = (V, E), a subset C ⊆ V
of vertices and upper bound D, where diam(C) = maxu,v∈C dG (u, v) ≤ D,
partition C into C1 , . . . , Cl such that
D
(A) ∀i ∈ {1, . . . , l}, maxu,v∈Ci dG (u, v) ≤ 2
dG (u,v)
(B) ∀u, v ∈ V , Pr[u and v not in same partition] ≤ α · D
, for some α
Using ball carving, ConstructT recursively partitions the vertices of a
given graph until there is only one vertex remaining. At each step, the upper
bound D indicates the maximum distance between the vertices of C. The
first call of ConstructT starts with C = V and D = diam(V ). Figure 5.1
illustrates the process of building a tree T from a given graph G.
Lemma 5.2. For any two vertices u, v ∈ V and i ∈ N, if T separates u and
v at level i, then 2D
2i
≤ dT (u, v) ≤ 4D
2i
, where D = diam(V ).
Proof. If T splits u and v at level i, then the path from u to v in T has to
include two edges of length 2Di , hence dT (u, v) ≥ 2D
2i
. To be precise,
2D D D 4D
i
≤ dT (u, v) = 2 · ( i + i+1 + · · · ) ≤ i
2 2 2 2
See picture — r is the auxiliary node at level i which splits nodes u and v.
5.1. A TIGHT PROBABILISTIC TREE EMBEDDING CONSTRUCTION49
Algorithm 14 ConstructT(G = (V, E), C ⊆ V, D)

if |C| = 1 then
return The only vertex in C . Return an actual vertex from V (G)
else
V1 , . . . , Vl ← BallCarving(G, C, D) . maxu,v,∈Vi dG (u, v) ≤ D2
Create auxiliary vertex r . r is root of current subtree
for i ∈ {1, . . . , l} do
ri ← ConstructT(G, Vi , D2 )
Add edge {r, ri } with weight D
end for
return Root of subtree r . Return an auxiliary vertex r
end if
r r
D D D D
2i 2i 2i 2i
u ∈ Vu ... v ∈ Vv
u ∈ Vu ... v ∈ Vv
D D
2i+1 2i+1
.. ..
. .
u v
Remark If u, v ∈ V separate before level i, then dT (u, v) must still include

the two edges of length 2Di , hence dT (u, v) ≥ 2D
2i
.
Claim 5.3. ConstructT(G, C = V, D = diam(V )) returns a tree T such

that
dG (u, v) ≤ dT (u, v)
Proof. Consider u, v ∈ V . Say 2Di ≤ dG (u, v) ≤ 2i−1

D
for some i ∈ N. By
property (A) of ball carving, T will separate them at, or before, level i. By
Lemma 5.2, dT (u, v) ≥ 2D
2i
D
= 2i−1 ≥ dG (u, v).
Claim 5.4. ConstructT(G, C = V, D = diam(V )) returns a tree T such

that
E[dT (u, v)] ≤ 4α log(D) · dG (u, v)
r0 r0 r0
D D D D D D
V1 ... Vl0 r1 ... rl0 r1 ... rl0

D
D D .. 2
2 2 .
.
..
D
V1,1 . . . V1,l1 2i−1
V1,1,...,1
D
2i
..
.
Figure 5.1: Recursive ball carving with dlog2 (D)e levels. Red vertices are
auxiliary nodes that are not in the original graph G. Denoting the root as
the 0th level, edges from level i to level i + 1 have weight 2Di .
Proof. Consider u, v ∈ V . Define Ei as the event that “vertices u and v get

separated at the ith level”, for i ∈ N. By recursive nature of ConstructT,
the subset at the ith level has distance at most 2Di . So, property (B) of ball
carving tells us that Pr[Ei ] ≤ α · dGD/2
(u,v)
i . Then,
log(D)−1
X
E[dT (u, v)] = Pr[Ei ] · [dT (u, v), given Ei ] Definition of expectation
i=0
log(D)−1
X 4D
≤ Pr[Ei ] · By Lemma 5.2
i=0
2i
log(D)−1
X dG (u, v) 4D
≤ (α · )· i Property (B) of ball carving
i=0
D/2i 2
= 4α log(D) · dG (u, v) Simplifying
5.1.2 Ball carving construction

We now give a concrete construction of ball carving that satisfies properties
(A) and (B) as defined.
Algorithm 15 BallCarving(G = (V, E), C ⊆ V, D)

if |C| = 1 then
return The only vertex in C
else . Say there are n vertices, where n > 1
θ ← Uniform random value from the range [ D8 , D4 ]
Pick a random permutation π on C . Denote πi as the ith vertex in π
for i ∈ {1, . . . , n} do
Vi ← B(πi , θ) \ i−1
S
j=1 B(πj , θ) . V1 , . . . , Vn is a partition of C
end for
return Non-empty sets V1 , . . . , Vl . Vi can be empty
end if . i.e. Vi = ∅ ⇐⇒ ∀v ∈ B(πi , θ), [∃j < i, v ∈ B(πj , θ)]
Notation Let π : C → N be an ordering of the vertices C. For vertex

v ∈ C, denote π(v) as v’s position in π and πi as the ith vertex. That is,
v = ππ(v) .
Claim 5.5. BallCarving(G, C, D) returns partition V1 , . . . , Vl such that
D
diam(Vi ) = max dG (u, v) ≤
u,v∈Vi 2
for all i ∈ {1, . . . , l}.
Proof. Since θ ∈ [ D8 , D4 ], all constructed balls have diameters ≤ D
2
.
Definition 5.6 (Ball cut). A ball B(u, r) is cut if BallCarving puts the
vertices in B(u, r) in different partitions of V1 , . . . , Vl . We say Vi cuts B(u, r)
if there exists w, y ∈ B(u, r) such that w ∈ Vi and y 6∈ Vi .
Lemma 5.7. For any vertex u ∈ C and radius r ∈ R+ ,
r
Pr[B(u, r) is cut in BallCarving(G,C,D)] ≤ O(log n) ·
D
Proof. Let θ be the randomly chosen ball radius in BallCarving. Consider
an ordering of vertices in increasing distance from u: v1 , v2 , . . . , vn , such that
dG (u, v1 ) ≤ dG (u, v2 ) ≤ · · · ≤ dG (u, vn ).
Observe that for Vi to be the first partition that cuts B(u, r), a necessary
condition is for vi to appear before any vj (i.e. π(vi ) < π(vj ), ∀1 ≤ j < i).
Consider the largest 1 ≤ j < i such that π(vj ) < π(vi ):
• If B(u, r) ∩ B(vj , r) = ∅, then B(u, r) ∩ B(vi , r) = ∅ since dG (u, vj ) ≤

dG (u, vi ).
• If B(u, r) ⊆ B(vj , r), then vertices in B(u, r) would have been removed
before vi is considered.
• If both B(u, r) ∩ B(vj , r) = ∅, B(u, r) ∩ B(vi , r) = ∅ and B(u, r) 6⊆

B(vj , r), then Vi is not the first partition that cuts B(u, r) since Vj (or
possibly an earlier partition) has already cut B(u, r).
In any case, if there is a 1 ≤ j < i such that π(vj ) < π(vi ), then Vi does not
cut B(u, r). So,
Pr[B(u, r) is cut]
[n
= Pr[ Event that Vi first cuts B(u, r)]
i=1
n
X
≤ Pr[Vi first cuts B(u, r)] Union bound
i=1
Xn
= Pr[π(vi ) = min π(vj )] Pr[Vi cuts B(u, r)] Require vi to appear first
j∈[i]
i=1
n
X 1
= · Pr[Vi cuts B(u, r)] By random permutation π
i=1
i
n
X 1 2r D D
≤ · diam(B(u, r)) ≤ 2r, θ ∈ [ , ]
i=1
i D/8 8 4
n
r X 1
= 16 Hn Hn =
D i=1
i
r
∈ O(log(n)) ·
D
In the last inequality: For Vi to cut B(u, r), we need θ ∈ (dG (u, vi ) − r, dG (u, vi ) + r),
hence the numerator of ≤ 2r; The denominator D8 is because the range of
values that θ is sampled from is D4 − D8 = D8 .
Claim 5.8. BallCarving(G) returns partition V1 , . . . , Vl such that
dG (u, v)
∀u, v ∈ V, Pr[u and v not in same partition] ≤ α ·
D
Proof. Let r = dG (u, v), then v is on the boundary of B(u, r).
Pr[u and v not in same partition]

≤ Pr[B(u, r) is cut in BallCarving]
r
≤ O(log n) · By Lemma 5.7
D
dG (u, v)
= O(log n) · Since r = dG (u, v)
D
Note: α = O(log n).
If we apply Claim 5.8 with Claim 5.4, we get
E[dT (u, v)] ≤ O(log(n) log(D)) · dG (u, v)
To remove the log(D) factor, so that stretch factor c = O(log n), a tighter
analysis is needed by only considering vertices that may cut B(u, dG (u, v))
instead of all n vertices. For details, see Theorem 5.16 in Section 5.3.
5.1.3 Contraction of T
Notice in Figure 5.1 that we introduce auxiliary vertices in our tree con-
struction and wonder if we can build a T without additional vertices (i.e.
V (T ) = V (G). In this section, we look at Contract which performs tree
contractions to remove the auxiliary vertices. It remains to show that the
produced tree that still preserves desirable properties of a tree embedding.
Algorithm 16 Contract(T )
while T has an edge (u, w) such that u ∈ V and w is an auxiliary node
do
Contract edge (u, w) by merging subtree rooted at u into w
Identify the new node as u
end while
Multiply weight of every edge by 4
return Modified T 0
Claim 5.9. Contract returns a tree T such that
dT (u, v) ≤ dT 0 (u, v) ≤ 4 · dT (u, v)

Proof. Suppose auxiliary node w, at level i, is the closest common ancestor

for two arbitrary vertices u, v ∈ V in the original tree T . Then,
log D
XD D
dT (u, v) = dT (u, w) + dT (w, v) = 2 · ( j
)≤4· i
j=i
2 2
Since we do not contract actual vertices, at least one of the (u, w) or (v, w)
edges of weight 2Di will remain. Multiplying the weights of all remaining edges
by 4, we get dT (u, v) ≤ 4 · 2Di = dT 0 (u, v).
Suppose we only multiply the weights of dT (u, v) by 4, then dT 0 (u, v) = 4 · dT (u, v).
Since we contract edges, d0T (u, v) can only decrease, so dT 0 (u, v) ≤ 4 · dT (u, v).
Remark Claim 5.9 tells us that one can construct a tree T 0 without aux-
iliary variables by incurring an additional constant factor overhead.
5.2 Application: Buy-at-bulk network design

Definition 5.10 (Buy-at-bulk network design problem). Consider a graph
G = (V, E) with edge lengths le for e ∈ E. Let f : R+ → R+ be a sub-additive
cost function. That is, f (x + y) ≤ f (x) + f (y). Given k commodity triplets
(si , ti , di ), where si ∈ V is the source, ti ∈ V is the target, and di ≥ 0 is the
demand for the ith commodity, find a capacity assignment on edges ce (for
all edges) such that
P
• e∈E f (ce ) · le is minimized
• ∀e ∈ E, ce ≥ Total flow passing through it
• Flow conservation is satisfied and every commodity’s demand is met
Remark If f is linear (e.g. f (x + y) = f (x) + f (y)), one can obtain an

optimum solution by finding the shortest path si → ti for each commodity i,
then summing up the required capacities for each edge.
Let us denote I = (G, f, {si , ti , di }ki=1 ) as the given instance. Let OP TG (I)
be the optimal solution on G and AT (I) be the solution produced by NetworkDesign.
Denote the costs as |OP TG (I)| and |AT (I)| respectively. We now compare
the solutions OP TG (I) and AT (I) by comparing edge costs (u, v) ∈ E in G
and tree embedding T .
Claim 5.11. |AT (I)| using edges in G ≤ |AT (I)| using edges in T .
5.3. EXTRA: BALL CARVING WITH O(LOG N ) STRETCH FACTOR55
Algorithm 17 NetworkDesign(G = (V, E))

ce = 0, ∀e ∈ E . Initialize capacities
T ← ConstructT(G) . Build probabilistic tree embedding T of G
T ← Contract(T) . V (T ) = V (G) after contraction
for i ∈ {1, . . . , k} do . Solve problem on T
PsTi ,ti ← Find shortest si − ti path in T . It is unique in a tree
for Edge {u, v} of PsTi ,ti in T do
G
Pu,v ← Find shortest u − v path in G
G
ce ← ce + di , for each edge in e ∈ Pu,v
end for
end for
return {e ∈ E : ce }
Proof. (Sketch) For any pair of vertices u, v ∈ V , dG (u, v) ≤ dT (u, v).

Claim 5.12. |AT (I)| using edges in T ≤ |OP TG (I)| using edges in T .
Proof. (Sketch) Since shortest path in a tree is unique, AT (I) is optimum for
T . So, any other flow assignment has to incur higher edge capacities.
Claim 5.13. E[|OP TG (I)| using edges in T ] ≤ O(log n) · |OP TG (I)|
Proof. (Sketch) T stretches edges by at most a factor of O(log n).
By the three claims above, NetworkDesign gives a O(log n)-approximation
to the buy-at-bulk network design problem, in expectation. For details, refer
to Section 8.6 in [WS11].
5.3 Extra: Ball carving with O(log n) stretch

factor
If we apply Claim 5.8 with Claim 5.4, we get E[dT (u, v)] ≤ O(log(n) log(D))·
dG (u, v). To remove the log(D) factor, so that stretch factor c = O(log n), a
tighter analysis is needed by only considering vertices that may cut B(u, dG (u, v))
instead of all n vertices.
5.3.1 Tighter analysis of ball carving

Fix arbitrary vertices u and v. Let r = dG (u, v). Recall that θ is chosen
uniformly at random from the range [ D8 , D4 ]. A ball B(vi , θ) can cut B(u, r)
only when dG (u, vi ) − r ≤ θ ≤ dG (u, vi ) + r. In other words, one only needs
to consider vertices vi such that D8 − r ≤ θ − r ≤ dG (u, vi ) ≤ θ + r ≤ D4 + r.
D 16r
Lemma 5.14. For i ∈ N, if r > 16
, then Pr[B(u, r) is cut] ≤ D
D 16r
Proof. If r > 16 , then D
> 1. As Pr[B(u, r) is cut at level i] is a probability
≤ 1, the claim holds.
Remark Although lemma 5.14 is not a very useful inequality per se (since
any probability ≤ 1), we use it to partition the value range of r so that we
can say something stronger in the next lemma.
D
Lemma 5.15. For i ∈ N, if r ≤ 16
, then
r |B(u, D/2)|
Pr[B(u, r) is cut] ≤ O(log( ))
D |B(u, D/16)|
D D
Proof. Since B(vi , θ) cuts B(u, r) only if 8
− r ≤ dG (u, vi ) ≤ 4
+ r, we have
D 5D D D
dG (u, vi ) ∈ [ 16 , 16 ] ⊆ [ 16 , 2 ].
D
D 2
2 D
16
u u
D v1 vj vj+1 . . . vk Dist from u
16
Suppose we arrange the vertices in ascending order of distance from u:

u = v1 , v2 , . . . , vn . Denote:
D D
• j − 1 = |B(u, 16 )| as the number of nodes that have distance ≤ 16
from
u
• k = |B(u, D2 )| as the number of nodes that have distance ≤ D

2
from u
We see that only vertices vj , vj+1 , . . . , vk have distances from u in the range
D D
[ 16 , 2 ]. Pictorially, only vertices in the shaded region could possibly cut
B(u, r). As before, let π(v) be the ordering in which vertex v appears in
random permutation π. Then,
5.3. EXTRA: BALL CARVING WITH O(LOG N ) STRETCH FACTOR57
Pr[B(u, r) is cut]
k
[
= Pr[ Event that B(vi , θ) cuts B(u, r)] Only vj , vj+1 , . . . , vk can cut
i=j
k
X
≤ Pr[π(vi ) < minz<[i−1] {π(vz }] · Pr[vi cuts B(u, r)] Union bound
i=j
k
X 1
= · Pr[B(vi , θ) cuts B(u, r)] By random permutation π
i=j
i
k
X 1 2r D D
≤ · diam(B(u, r)) ≤ 2r, θ ∈ [ , ]
i=j
i D/8 8 4
k
r X 1
= (Hk − Hj ) where Hk =
D i=1
i
r |B(u, D/2)|
∈ O(log( )) since Hk ∈ Θ(log(k))
D |B(u, D/16)|
5.3.2 Plugging into ConstructT
Recall that ConstructT is a recursive algorithm which handles graphs of

diameter ≤ 2Di at each level. For a given pair of vertices u and v, there exists
i∗ ∈ N such that 2Di∗ ≤ r = dG (u, v) ≤ 2i∗D−1 . In other words, 2i∗D−4 16 1
≤
D 1 ∗
r ≤ 2i∗ −5 16 . So, lemma 5.15 applies for levels i ∈ [0, i − 5] and lemma 5.14
applies for levels i ∈ [i∗ − 4, log(D) − 1].
Theorem 5.16. E[dT (u, v)] ∈ O(log n) · dG (u, v)
Proof. As before, let Ei be the event that “vertices u and v get separated at
the ith level”. For Ei to happen, the ball B(u, r) = B(u, dG (u, v)) must be
cut at level i, so Pr[Ei ] ≤ Pr[B(u, r) is cut at level i].
E[dT (u, v)]

log(D)−1
X
= Pr[Ei ] · [dT (u, v), given Ei ] (1)
i=0
log(D)−1
X 4D
≤ Pr[Ei ] · (2)
i=0
2i
∗ −5
iX log(D)−1
4D X 4D
= Pr[Ei ] · i + Pr[Ei ] · i (3)
i=0
2 i=i∗ −4
2
∗ −5
iX log(D)−1
r |B(u, D/2i+1 )| 4D X 4D
≤ i
O(log( i+4
)) · i
+ Pr[Ei ] · i (4)
i=0
D/2 |B(u, D/2 )| 2 i=i∗ −4
2
∗ −5
iX log(D)−1
r |B(u, D/2i+1 )| 4D X 16r 4D
≤ O(log( )) · + ∗ · (5)
i=0
D/2i |B(u, D/2i+4 )| 2i i=i∗ −4
D/2i −4 2i
∗ −5
iX log(D)−1
|B(u, D/2i+1 )| X ∗
= 4r · O(log( i+4
)) + 4 · 2i −i · r (6)
i=0
|B(u, D/2 )| i=i∗ −4
∗ −5
iX
|B(u, D/2i+1 )|
≤ 4r · O(log( i+4
)) + 27 r (7)
i=0
|B(u, D/2 )|
= 4r · O(log(n)) + 27 r (8)
∈ O(log n) · r
(1) Definition of expectation
(2) By Lemma 5.2
D 1 D 1
(3) Split into cases: 2i∗ −4 16
≤r≤ 2i∗ −5 16
(4) By Lemma 5.15

∗ −4
(5) By Lemma 5.14 with respect to D/2i
(6) Simplifying
i∗ −i
(7) Since log(D)−1 ≤ 25
P
i=i∗ −4 2
(8) log( xy ) = log(x) − log(y) and |B(u, ∞)| ≤ n

Part II
Streaming and sketching

algorithms
59
Chapter 6
Warm up
Thus far, we have been ensuring that our algorithms run fast. What if our
system does not have sufficient memory to store all data to post-process it?
For example, a router has relatively small amount of memory while tremen-
dous amount of routing data flows through it. In a memory constrained set-
ting, can one compute something meaningful, possible approximately, with
limited amount of memory?
More formally, we now look at a slightly different class of algorithms
where data elements from [n] = {1, . . . , n} arrive in one at a time, in a stream
S = a1 , . . . , am , where ai ∈ [n] arrives in the ith time step. At each step, our
algorithm performs some computation1 and discards the item ai . At the end
of the stream2 , the algorithm should give us a value that approximates some
value of interest.
6.1 Typical tricks

Before we begin, let us first describe two typical tricks used to amplify suc-
cess probabilities of randomized algorithms. Suppose we have a randomized
algorithm A that returns an unbiased estimate of a quantity of interest X
on a problem instance I, with success probability p > 0.5.
Trick 1: Reduce Pvariance Run j independent copies ofPA on I, and return

the mean j i=1 A(I). The expected outcome E( 1j ji=1 A(I)) will still
1 j
be X while the variance drops by a factor of j.

Trick 2: Improve success Run k independent copies of A on I, and re-
turn the median. As each copy of A succeeds (independently) with
1
Usually this is constant time so we ignore the runtime.
2
In general, the length of the stream, m, may not be known.
61
62 CHAPTER 6. WARM UP
probability p > 0.5, the probability that more than half of them fails
(and hence the median fails) drops exponential with respect to k.
Let > 0 and δ > 0 denote the precision factor and failure probability
respectively. Robust combines the above-mentioned two tricks to yield a
(1 ± )-approximation to X that succeeds with probability > 1 − δ.
Algorithm 18 Robust(A, I, , δ)
C←∅ . Initialize candidate outputs
for k = O(log 1δ ) times do
sum ← 0
for j = O( 12 ) times do
sum ← sum + A(I)
end for
Add sum
j
to candidates C . Include new sample of mean
end for
return Median of C . Return median
6.2 Majority element

Definition 6.1 (“Majority in a stream” problem). Given a stream S =
{a1 , . . . , am } of items from [n] = {1, . . . , n}, with an element j ∈ [n] that
appears strictly more than m2 times in S, find j.
Algorithm 19 MajorityStream(S = {a1 , . . . , am })

guess ← 0
count ← 0
for ai ∈ S do . Items arrive in streaming fashion
if ai = guess then
count ← count + 1
else if count > 1 then
count ← count − 1
else
guess ← ai
end if
end for
return guess
6.2. MAJORITY ELEMENT 63
Example Consider a stream S = {1, 3, 3, 7, 5, 3, 2, 3}. The table below

shows how guess and count are updated as each element arrives.
Stream elements 1 3 3 7 5 3 2 3
Guess 1 3 3 3 5 3 2 3
Count 1 1 2 1 1 1 1 1
One can verify that MajorityStream uses O(log n + log m) bits to

store guess and counter.
Claim 6.2. MajorityStream correctly finds element j ∈ [n] which appears

> m2 times in S = {a1 , . . . , am }.
Proof. (Sketch) Match each other element in S with a distinct instance of j.

Since j appears > m2 times, at least one j is unmatched. As each matching
cancels out count, only j could be the final guess.
Remark If no element appears > m2 times, then MajorityStream is

not guaranteed to return the most frequent element. For example, for S =
{1, 3, 4, 3, 2}, MajorityStream(S) returns 2 instead of 3.
64 CHAPTER 6. WARM UP
Chapter 7
Estimating the moments of a

stream
One class of interesting problems is computing moments of a given stream S.

For items j ∈ [n], define fj as the number of times P
j appears in a stream S.
th n k
Then, the k momentPn of a stream S is defined as j=1 (fj ) . When k = 1,
the first moment j=1 fj = m is simply the number of elements in the stream
S. When k = 0, by associating 00 = 0, the zeroth moment nj=1 (fj )0 is the
P
number of distinct elements in the stream S.
7.1 Estimating the first moment of a stream

A trivial exact solution would be to use O(log m) bits to maintain a counter,
incrementing for each element observed. For some upper bound M , consider
the sequence (1 + ), (1 + )2 , . . . , (1 + )log1+ M . For any stream length m,
there exists i ∈ N such that (1 + )i ≤ m ≤ (1 + )i+1 . Thus, to obtain
a (1 + )-approximation, it suffices to track the exponent i to estimate the
length of m. For ∈ Θ(1), this can be done in O(log log m) bits.
Algorithm 20 Morris(S = {a1 , . . . , am })

x←0
r ← Random probability from [0, 1]
if r ≤ 2−x then . If not, x is unchanged.
x←x+1
end if
end for
return 2x − 1 . Estimate m by 2x − 1
65
66 CHAPTER 7. ESTIMATING THE MOMENTS OF A STREAM
The intuition behind Morris [Mor78] is to increase the counter (and

hence double the estimate) when we expect to observe 2x new items. For
analysis, denote Xm as the value of counter x after exactly m items arrive.
Theorem 7.1. E[2Xm − 1] = m. That is, Morris is an unbiased estimator

for the length of the stream.
Proof. Equivalently, let us prove E[2Xm ] = m + 1, by induction on m ∈ N+ .

On the first element (m = 1), x increments with probability 1, so E[2X1 ] =
21 = m + 1. Suppose it holds for some m ∈ N, then
m
X
E[2Xm+1 ] = E[2Xm+1 |Xm = j] Pr[Xm = j] Condition on Xm
j=1
Xm
= (2j+1 · 2−j + 2j · (1 − 2−j )) · Pr[Xm = j] Increment x w.p. 2−j
j=1
Xm
= (2j + 1) · Pr[Xm = j] Simplifying
j=1
Xm m
X
= 2j · Pr[Xm = j] + Pr[Xm = j] Splitting the sum
j=1 j=1
m
X
= E[2Xm ] + Pr[Xm = j] Definition of E[2Xm ]
j=1
m
X
= E[2Xm ] + 1 Pr[Xm = j] = 1
i=1
= (m + 1) + 1 Induction hypothesis
=m+2
Note that we sum up to m because x ∈ [1, m] after m items.
Claim 7.2. E[22Xm ] = 32 m2 + 32 m + 1
Proof. Exercise.
m2
Claim 7.3. E[(2Xm − 1 − m)2 ] ≤ 2
Proof. Exercise. Use the Claim 7.2.

1
Theorem 7.4. For > 0, Pr[|(2Xm − 1) − m| > m] ≤ 22
7.2. ESTIMATING THE ZEROTH MOMENT OF A STREAM 67
Proof.
Pr[|(2Xm − 1) − m| > m] = Pr[((2Xm − 1) − m)2 > (m)2 ] Square both sides
E[((2Xm − 1) − m)2 ]
≤ Markov’s inequality
(m)2
2
m /2
≤ 2 2 By Claim 7.3
m
1
= 2
2
Remark Using the discussion in Section 6.1, we can run Morris multiple
times to obtain a (1 ± )-approximation of the first moment of a stream that
succeeds with probability > 1 − δ. For instance, repeating Morris 10 2
times
1
and reporting the mean m, b − m| > m] ≤ 20 .
b Pr[|m
7.2 Estimating the zeroth moment of a stream

Trivial exact solutions could either use O(n) bits to track if element exists,
or use O(m log n) bits to remember the whole stream. Suppose there are D
distinct items in the whole stream. In this section, we show that one can in
fact make do with only O(log n) bits to obtain an approximation of D.
7.2.1 An idealized algorithm

Consider the following algorithm sketch:
1. Take a uniformly random hash function h : {1, . . . , m} → [0, 1]
2. As items ai ∈ S arrive, track z = min{h(ai )}

1
3. In the end, output z
−1
Since we are randomly hashing elements into the range [0, 1], we expect
1 1
the minimum hash output to be D+1 , so E[ z1 − 1] = D. Unfortunately,
storing a uniformly random hash function that maps to the interval [0, 1] is
infeasible. As storing real numbers is memory intensive, one possible fix is to
discretize the interval [0, 1], using O(log n) bits per hash output. However,
storing this hash function would still require O(n log n) space.
1
See https://en.wikipedia.org/wiki/Order_statistic
7.2.2 An actual algorithm

Instead of a uniformly random hash function, we select a random hash from
a family of pairwise independent hash functions.
Definition 7.5 (Family of pairwise independent hash functions). Hn,m is a
family of pairwise independent hash functions if
• (Hash definition): ∀h ∈ Hn,m , h : {1, . . . , n} → {1, . . . , m}
1
• (Uniform hashing): ∀x ∈ {1, . . . , n}, Prh∈Hn,m [h(x) = i] = m
• (Pairwise independent) ∀x, y ∈ {1, . . . , n}, x 6= y, Prh∈Hn,m [h(x) =

i ∧ h(y) = j] = m12
Remark For now, we care only about m = n, and write Hn,n as Hn .

Claim 7.6. Let n be a prime number. Then,
Hn = {ha,b : h(x) = ax + b mod n, ∀a, b ∈ Zn }
is a family of pairwise independent hash functions.

Proof. (Sketch) For any given a, b,
• There is a unique value of h(x) mod n, out of n possibilities.
• The system {ax + b = i mod n, ay + b = j mod n} has a unique

solution for (x, y), out of n2 possibilities.
Remark If n is not a prime, we know there exists a prime p such that

n ≤ p ≤ 2n, so we round n up to p. Storing a random hash from Hn is then
storing the numbers a and b in O(log n) bits.
We now present an algorithm [FM85] which estimates the zeroth moment
of a stream and defer the analysis to the next lecture. In FM, zeros refer
to the number of trailing zeroes in the binary representation of h(ai ). For
example, if h(ai ) = 20 = (...10100)2 , then zeros(h(ai )) = 2.P
Recall that the k th moment of a stream S is defined as nj=1 (fj )k . Since
the hash h is deterministic after picking a random hash from Hn,n , h(ai ) =
h(aj ), ∀ai = aj ∈ [n]. We first prove a useful lemma.
Lemma 7.7. IfPX1 , . . . , Xn are pairwise independent indicator random vari-
ables and X = ni=1 Xi , then Var(X) ≤ E[X].
Algorithm 21 FM(S = {a1 , . . . , am })

h ← Random hash from Hn,n
Z←0
Z = max{Z, zeros(h(ai ))}
(zeros(h(ai )) = # trailing zeroes in binary representation of h(ai ))
end for √
return 2Z · 2 . Estimate of D
Proof.
n
X
Var(X) = Var(Xi ) The Xi ’s are pairwise independent
i=1
n
X
= (E[Xi2 ] − (E[Xi ])2 ) Definition of variance
i=1
n
X
≤ E[Xi2 ] Ignore negative part
i=1
Xn
= E[Xi ] Xi2 = Xi since Xi ’s are indicator random variables
i=1
n
X
= E[ Xi ] Linearity of expectation
i=1
= E[X] Definition of expectation
Theorem 7.8. There exists a constant C > 0 such that

D √
Pr[ ≤ 2Z · 2 ≤ 3D] > C
3
√ √
Proof. We will prove Pr[( D3 > 2Z · 2) or (2Z · 2 > 3D)] ≤ 1 − C by
√ √
separately analyzing Pr[ D3 ≥ 2Z · 2] and Pr[2Z · 2 ≥ 3D], then applying
union bound. Define indicator variables
(
1 if zeros(h(ai )) ≥ r
Xi,r =
0 otherwise
and Xr = m
P
i=1 Xi,r = |{ai ∈ S : zeros(h(ai )) ≥ r}|. Notice that Xn ≤ Xn−1 ≤ · · · ≤ X1
since zeros(h(ai )) ≥ r + 1 ⇒ zeros(h(ai )) ≥ r. Now,
m
X m
X
E[Xr ] = E[ Xi,r ] Since Xr = Xi,r
i=1 i=1
m
X
= E[Xi,r ] By linearity of expectation
i=1
m
X
= Pr[Xi,r = 1] Since Xi,r are indicator variables
i=1
m
X 1
= h is a uniform hash
i=1
2r
D
= r Since h hashes same elements to the same value
2
τ1
√
Denote τ1 as the smallest integer
√ such that 2 · 2 > 3D, and τ2 as the
largest integer such that 2τ2 · 2 < D3 . We see that if τ1 < Z < τ2 , then
√
2Z · 2 is a 3-approximation of D.
r τ2 τ1 0
τ2 + 1 log2 ( √D2 )
√ √
• If Z ≥ τ1 , then 2Z · 2 ≥ 2τ1 · 2 > 3D
√ √ D
• If Z ≤ τ2 , then 2Z · 2 ≤ 2τ2 · 2< 3
Pr[Z ≥ τ1 ] ≤ Pr[Xτ1 ≥ 1] Since Z ≥ τ1 ⇒ Xτ1 ≥ 1

E[Xτ1 ]
≤ By Markov’s inequality
1
D D
= τ1 Since E[Xr ] =
2√ 2r
2 √
≤ Since 2τ1 · 2 > 3D
3
Pr[Z ≤ τ2 ] ≤ Pr[Xτ2 +1 = 0] Since Z ≤ τ2 ⇒ Xτ2 +1 = 0

≤ Pr[E[Xτ2 +1 ] − Xτ2 +1 ≥ E[Xτ2 +1 ]] Implied
≤ Pr[|Xτ2 +1 − E[Xτ2 +1 ]| ≥ E[Xτ2 +1 ]] Adding absolute sign
Var[Xτ2 +1 ]
≤ By Chebyshev’s inequality
(E[Xτ2 +1 ])2
E[Xτ2 +1 ]
≤ By Lemma 7.7
(E[Xτ2 +1 ])2
2τ2 +1 D
≤ Since E[Xr ] =
√D 2r
2 √ D
≤ Since 2τ2 · 2<
3 3
Putting together,
D √ √
Pr[( > 2Z · 2) or (2Z · 2 > 3D)]
3
D √ √
≤ Pr[ ≥ 2Z · 2] + Pr[2Z · 2 ≥ 3D] By union bound
√3
2 2
≤ From above
3 √
2 2
=1−C For C = 1 − >0
3
Although√ the analysis tells us that there is a small success probability

2 2
(C = 1 − 3 ≈ 0.0572), one can use t independent hashes and output the
√
mean k1 ki=1 (2Zi · 2) (Recall Trick 1). With t hashes, the variance drops
P
by a factor of 1t , improving the analysis for Pr[Z ≤ τ2 ]. When the success
probability C > 0.5, one can then call the routine k times independently and
return the median (Recall Trick 2).
While Tricks 1 and 2 allows us to strength the success probability C, more
work needs to be done to improve the approximation factor from 3 to (1 + ).
To do this, we look at a slight modification of FM, due to [BYJK+ 02].
Algorithm 22 FM+(S = {a1 , . . . , am }, )

N ← n3
t ← c2 ∈ O( 12 ) . For some constant c ≥ 28
h ← Random hash from Hn,N . Hash to a larger space
T ←∅ . Maintain t smallest h(ai )’s
T ← t smallest values from T ∪ {h(ai )}
(If |T ∪ {h(ai )}| ≤ t, then T = T ∪ {h(ai )})
end for
Z = maxt∈T T
return tNZ
. Estimate of D
Remark For a cleaner analysis, we treat the integer interval [N ] as a con-

tinuous interval in Theorem 7.9. Note that there may be a rounding error
of N1 but this is relatively small and a suitable c can be chosen to make the
analysis still work.
Theorem 7.9. In FM+, for any given 0 < < 21 , Pr[| tN

Z
− D| ≤ D] > 34 .
Proof. We first analyze Pr[ tN

Z
> (1 + )D] and Pr[ tN
Z
< (1 − )D] separately.
Then, taking union bounds and negating yields the theorem’s statement.
If tN
Z
> (1 + )D, then (1+)DtN
> Z = tth smallest hash value, implying
tN
that there are ≥ t hashes smaller than (1+)D . Since the hash uniformly
distributes [n] over [N ], for each element ai ,
tN
tN (1+)D t
Pr[h(ai ) ≤ ]= =
(1 + )D N (1 + )D
Let d1 , . . . , dD be the D distinct elements in the stream. Define indicator

variables (
tN
1 if h(di ) ≤ (1+)D
Xi =
0 otherwise
PD tN
and X = i=1 Xi is the number of hashes that are smaller than (1+)D .
t t
From above, Pr[Xi = 1] = (1+)D . By linearity of expectation, E[X] = (1+) .
Then, by Lemma 7.7, Var(X) ≤ E[X]. Now,
tN
Pr[ > (1 + )D] ≤ Pr[X ≥ t] Since the former implies the latter
Z
= Pr[X − E[X] ≥ t − E[X]] Subtracting E[X] from both sides
t
≤ Pr[X − E[X] ≥ t] Since E[X] = ≤ (1 − )t
2 (1 + ) 2

≤ Pr[|X − E[X]| ≥ t] Adding absolute sign
2
Var(X)
(t/2)2
E[X]
≤ Since Var(X) ≤ E[X]
(t/2)2
4(1 − /2)t t
≤ Since E[X] = ≤ (1 − )t
2 t2 (1 + ) 2
4 c
≤ Simplifying with t = 2 and (1 − ) < 1
c 2
Similarly, if tN
Z
tN
< (1 − )D, then (1−)D < Z = tth smallest hash value,
tN
implying that there are < t hashes smaller than (1−)D . Since the hash
uniformly distributes [n] over [N ], for each element ai ,
tN
tN (1−)D t
Pr[h(ai ) ≤ ]= =
(1 − )D N (1 − )D
Let d1 , . . . , dD be the D distinct elements in the stream. Define indicator

variables
(
tN
1 if h(di ) ≤ (1−)D
Yi =
0 otherwise
and Y = D tN
P
i=1 Yi is the number of hashes that are smaller than (1−)D . From
t t
above, Pr[Yi = 1] = (1−)D . By linearity of expectation, E[Y ] = (1−) . Then,
by Lemma 7.7, Var(Y ) ≤ E[Y ]. Now,
tN
Pr[ < (1 − )D]
Z
≤ Pr[Y ≤ t] Since the former implies the latter
= Pr[Y − E[Y ] ≤ t − E[Y ]] Subtracting E[Y ] from both sides
t
≤ Pr[Y − E[Y ] ≤ −t] Since E[Y ] = ≥ (1 + )t
(1 − )
≤ Pr[−(Y − E[Y ]) ≥ t] Swap sides
≤ Pr[|Y − E[Y ]| ≥ t] Adding absolute sign
Var(Y )
(t)2
E[Y ]
≤ Since Var(Y ) ≤ E[Y ]
(t)2
(1 + 2)t t
≤ Since E[Y ] = ≤ (1 + 2)t
2 t2 (1 − )
3 c
≤ Simplifying with t = 2 and (1 + 2) < 3
c
Putting together,
tN tN tN
Pr[| − D| > D]] ≤ Pr[ > (1 + )D]] + Pr[ < (1 − )D]] By union bound
Z Z Z
≤ 4/c + 3/c From above
≤ 7/c Simplifying
≤ 1/4 For c ≥ 28
7.3 Estimating the k th moment of a stream

In this section, we describe algorithms from [AMS96] that estimates the k th
moment of a stream, first for k = 2, thenP for general k. Recall that the k th
moment of a stream S is defined as Fk = nj=1 (fj )k .
7.3.1 k=2
For each element i ∈ [n], we associate a random variable ri ∈u.a.r. {−1, +1}.
Lemma 7.10. In AMS-2,
Pn if random variables {ri }i∈[n] are pairwise indepen-
2 2
dent, then E[Z ] = i=1 fi = F2 . That is, AMS-2 is an unbiased estimator
for the 2nd moment.
7.3. ESTIMATING THE K T H MOMENT OF A STREAM 75
Algorithm 23 AMS-2(S = {a1 , . . . , am })

For each i ∈ [n], assign ri ∈u.a.r. {−1, +1} . For now, this takes O(n)
space
Z←0
for ai ∈ S do . Items arrive in streaming
Pfashion
Z ← Z + ri . At the end, Z = ni=1 ri fi
end for
return Z 2 . Estimate of F2 = ni=1 (fi )2
P
Proof.
Xn n
X
E[Z 2 ] = E[( ri fi )2 ] Since Z = ri fi at the end
i=1 i=1
Xn X Xn
= E[ ri2 fi2 + 2 ri rj fi fj ] Expanding ( ri f i ) 2
i=1 1≤i<j≤n i=1
n
X X
= E[ri2 fi2 ] + 2 E[ri rj fi fj ] Linearity of expectation
i=1 1≤i<j≤n
Xn X
= E[ri2 ]fi2 + 2 E[ri rj ]fi fj fi ’s are (unknown) constants
i=1 1≤i<j≤n
Xn X
= fi2 + 2 E[ri rj ]fi fj Since (ri )2 = 1, ∀i ∈ [n]
i=1 1≤i<j≤n
n
X X
= fi2 + 2 E[ri ]E[rj ]fi fj Since {ri }i∈[n] are pairwise independent
i=1 1≤i<j≤n
n
X X
= fi2 + 2 0 · fi fj Since E[ri ] = 0, ∀i ∈ [n]
i=1 1≤i<j≤n
n
X
= fi2 Simplifying
i=1
n
X
= F2 Since F2 = (fi )2
i=1
Lemma 7.11. In AMS-2, if random variables {ri }i∈[n] are 4-wise indepen-
dent, then Var[Z 2 ] ≤ 2(E[Z 2 ])2 .
Proof. As before, E[ri ] = 0 and E[ri2 ] = 1 for all i ∈ [n]. By 4-wise in-
dependence, the expectation of any product of ≤ 4 different ri ’s is the
product of their expectation, which is zero. For instance, E[ri rj rk rl ] =
E[ri ]E[rj ]E[rk ]E[rl ] = 0. Note that 4-wise independence implies pairwise
independence, ri2 = ri4 = 1 and ri = ri3 .
Xn n
X
4
E[Z ] = E[( ri fi )4 ] Since Z = ri fi at the end
i=1 i=1
n
X X
= E[ri4 ]fi4 + 6 E[ri2 rj2 ]fi2 fj2 L.o.E. and 4-wise independence
i=1 1≤i<j≤n
Xn X
= fi4 + 6 fi2 fj2 Since E[ri4 ] = E[ri2 ] = 1, ∀i ∈ [n]
i=1 1≤i<j≤n
The coefficient of 1≤i<j≤n E[ri2 rj2 ]fi2 fj2 is 42 22 = 6. All other terms
P
Pn 4 4
P 2 2 2 2
besides i=1 E[ri ]f i and 6 1≤i<j≤n E[ri rj ]fi fj evaluate to 0 because of
4-wise independence.
Var[Z 2 ] = E[(Z 2 )2 ] − (E[Z 2 ])2 Definition of variance

Xn X
= fi4 + 6 fi2 fj2 − (E[Z 2 ])2 From above
i=1 1≤i<j≤n
Xn X n
X
= fi4 +6 fi2 fj2 −( fi2 )2 By Lemma 7.10
i=1 1≤i<j≤n i=1
X
=4 fi2 fj2 Expand and simplify
1≤i<j≤n
X n
≤ 2( fi2 )2 Upper bound
i=1
2 2
= 2(E[Z ]) By Lemma 7.10
Theorem 7.12. In AMS-2, if {ri }i∈[n] are 4-wise independent, Pr[|Z 2 −

F2 | > F2 ] ≤ 22 for any > 0.
Proof.
Pr[|Z 2 − F2 | > F2 ] = Pr[|Z 2 − E[Z 2 ]| > E[Z 2 ]] By Lemma 7.10
Var(Z 2 )
(E[Z 2 ])2
2(E[Z 2 ])2
≤ By Lemma 7.11
(E[Z 2 ])2
2
=
2
Claim 7.13. O(k log n) bits of randomness suffices to obtain a set of k-wise
independent random variables.
Proof. Recall the definition of hash family Hn,m . In a similar fashion2 , we
consider hashes from the family (for prime p):
k−1
X
{hak−1 ,ak−2 ,...,a1 ,a0 : h(x) = ai x i mod p
i=1
= ak−1 xk−1 + ak−2 xk−2 + · · · + a1 x + a0 mod p,
∀ak−1 , ak−2 , . . . , a1 , a0 ∈ Zp }
This requires k random coefficients, which can be stored with O(k log n)
bits.
Observe that the above analysis only require {ri }i∈[n] to be 4-wise in-
dependent. Claim 7.13 implies that AMS-2 only needs O(4 log n) bits to
represent {ri }i∈[n] .
Although the failure probability 22 is large for small , one can repeat t
times and output the mean (Recall Trick 1). With t ∈ O( 12 ) samples, the
failure probability drops to t22 ∈ O(1). When the failure probability is < 0.5,
one can then call the routine k times independently, and return the median
(Recall Trick 2). On the whole, for any given > 0 and δ > 0, O( log(n) log(1/δ)
2 )
space suffices to yield a (1 ± )-approximation algorithm that succeeds with
probability > 1 − δ.
7.3.2 General k
Remark At the end of AMS-k, r = |{i ∈ [m] : i ≥ J and ai = aJ }| will
be the number of occurrences of aJ in suffix of the stream.
2
See https://en.wikipedia.org/wiki/K-independent_hashing
Algorithm 24 AMS-k(S = {a1 , . . . , am })

m ← |S| . For now, assume we know m = |S|
J ∈u.a.r. [m] . Pick a random index
r←0
if i ≥ J and ai = aJ then
r ←r+1
end if
end for
Z ← m(rk − (r − 1)k )
. Estimate of Fk = ni=1 (fi )k
P
return Z
The assumption of known m in AMS-k can be removed via reservoir

sampling3 . The idea is as follows: Initially, initialize stream length and J as
both 0. When ai arrives, choose to replace J with i with probability 1i . If J
is replaced, reset r to 0 and start counting from this stream suffix onwards.
It can be shown that the choice of J is uniform over current stream length.
Lemma 7.14. In AMS-k, E[Z] = ni=1 fik = Fk . That is, AMS-k is an

P
unbiased estimator for the k th moment.
Proof. When aJ = i, there are fi choices for J. By telescoping sums, we

have:
E[Z | aJ = i]
1 1 1
= [m(fik − (fi − 1)k )] + [m((fi − 1)k − (fi − 2)k )] + · · · + [m(1k − 0k )]
fi fi fi
m k
= [(fi − (fi − 1)k ) + ((fi − 1)k − (fi − 2)k ) + · · · + (1k − 0k )]
fi
m k
= fi
fi
3
See https://en.wikipedia.org/wiki/Reservoir_sampling
n
X
E[Z] = E[Z | aJ = i] · Pr[aJ = i] Condition on the choice of J
i=1
n
X fi
= E[Z | aJ = i] · Since choice of J is uniform at random
i=1
m
n
X m fi
= fik · From above
i=1
fi m
Xn
= fik Simplifying
i=1
n
X
= Fk Since Fk = fik
i=1
Lemma 7.15. For every n positive reals f1 , f2 , . . . , fn ,
Xn Xn k
X
2k−1 1−1/k
( fi )( fi )≤n ( fik )2
i=1 i=1 i=1
Pn
Proof. Let M = maxi∈[n] fi , then fi ≤ M for any i ∈ [n] and M k ≤ i=1 fik .
Hence,
Xn Xn Xn n
X
2k−1 k−1
( fi )( fi )≤( fi )(M fik ) Pulling out a M k−1 factor
i=1 i=1 i=1 i=1
Xn n
X n
X n
X
≤( fi )( fik )(k−1)/k ( fik ) k
Since M ≤ fik
i=1 i=1 i=1 i=1
Xn Xn
=( fi )( fik )(2k−1)/k Merging the last two terms
i=1 i=1
Xn n
X n
X n
X
1−1/k k 1/k k (2k−1)/k
≤n ( fi ) ( fi ) Fact: ( fi )/n ≤ ( fik /n)1/k
i=1 i=1 i=1 i=1
Xn
= n1−1/k ( fi )2 Merging the last two terms
i=1
Remark f1 = n1/k , f2 = · · · = fn = 1 is a tight example for Lemma 7.15,

up to a constant factor.
1
Theorem 7.16. In AMS-k, Var(Z) ≤ kn1− k (E[Z])2
Proof. Let us first analyze E[Z 2 ].
E[Z 2 ]
m
= [ (1k − 0k )2 + (2k − 1k )2 + · · · + (f1k − (f1 − 1)k )2 (1)
m
+ (1k − 0k )2 + (2k − 1k )2 + · · · + (f2k − (f2 − 1)k )2
+ ...
+ (1k − 0k )2 + (2k − 1k )2 + · · · + (fnk − (fn − 1)k )2 ]
≤ m[ k · 1k−1 (1k − 0k ) + k · 2k−1 · (2k − 1k ) + · · · + k · f1k−1 · (f1k − (f1 − 1)k ) (2)
+ k · 1k−1 (1k − 0k ) + k · 2k−1 · (2k − 1k ) + · · · + k · f2k−1 · (f2k − (f2 − 1)k )
+ ...
+ k · 1k−1 (1k − 0k ) + k · 2k−1 · (2k − 1k ) + · · · + k · fnk−1 · (fnk − (fn − 1)k )]
≤ m[k · f12k−1 + k · f22k−1 + · · · + k · fn2k−1 ] (3)
= k · m · F2k−1 (4)
= k · F1 · F2k−1 (5)
(1) By definition of E[Z 2 ] (condition on J and expand in the same style as
the proof of Theorem 7.14).
(2) For all 0 < b < a and a = b + 1,
ak − bk = (a − b)(ak−1 + ak−2 b + · · · + abk−2 + bk−1 ) ≤ (a − b)kak−1
(3) Telescope each row, then ignore remaining negative terms

(4) F2k−1 = ni=1 fi2k−1
P
(5) F1 = ni=1 fi = m
P
Then,
Var(Z) = E[Z 2 ] − (E[Z])2 Definition of variance

≤ E[Z 2 ] Ignore negative part
≤ k · F1 · F2k−1 From above
≤ kn1−1/k Fk2 By Lemma 7.15
1−1/k 2
= kn (E[Z]) By Theorem 7.14
Remark Proofs for Lemma 7.15 and Theorem 7.16 were omitted in class.
The above proofs are presented in a style consistent with the rest of the scribe
notes. Interested readers can refer to [AMS96] for details.
Remark One can apply an analysis similar to the case when k = 2, then
use Tricks 1 and 2.
e 1− k2 ) is known.
Claim 7.17. For k > 2, a lower bound of Θ(n
Proof. Theorem 3.1 in [BYJKS04] gives the lower bound. See [IW05] for
algorithm that achieves it.
Chapter 8
Graph sketching
Definition 8.1 (Streaming connected components problem). Consider a

graph of n vertices and a stream S of edge updates {het , ±i}t∈N+ , where edge
et is either added (+) or removed (-). Assume that S is “well-behaved” where
existing edges are not added and edge deletions can only occur after additions.
At time t, the edge set Et of the graph Gt = (V, Et ) is the set of edges
present after accounting for all stream updates up to time t. How much
memory do we need if we want to be able to query the connected components
for Gt for any t ∈ N+ ?
Let m be the total number of distinct edges in the stream. There are two
ways to represent connected components on a graph:
1. Every vertex stores a label where vertices in the same connected com-
ponent has the same label
2. Explicitly build a tree for each connected component — This yields a

maximal forest
For now, we are interested in building a maximal forest for Gt . This

can be done with memory size of O(m) words1 , or — in the special case
of only edge additions — O(n) words2 . However, these are unsatisfactory
as m ∈ O(n2 ) on a complete graph, and we may have edge deletions. We
show how one can maintain a data structure with O(n log4 n) memory, with
a randomized algorithm that succeeds in building the maximal forest with
success probability ≥ 1 − n110 .
1
Toggle edge additions/deletion per update. Compute connected components on de-
mand.
2
Use the Union-Find data structure. See https://en.wikipedia.org/wiki/
Disjoint-set_data_structure
83
84 CHAPTER 8. GRAPH SKETCHING
Coordinator model For a change in perspective3 , consider the following

computation model where each vertex acts independently from each other.
Then, upon request of connected components, each vertex sends some infor-
mation to a centralized coordinator to perform computation and outputs the
maximal forest.
The coordinator model will be helpful in our analysis of the algorithm
later as each vertex will send O(log4 n) amount of data (a local sketch of the
graph) to the coordinator, totalling O(n log4 n) memory as required.
8.1 Warm up: Finding the single cut

Definition 8.2 (The single cut problem). Fix an arbitrary subset A ⊆ V .
Suppose there is exactly 1 cut edge {u, v} between A and V \ A. How do we
output the cut edge {u, v} using O(log n) bits of memory?
Without loss of generality, assume u ∈ A and v ∈ V \ A. Note that this is

not a trivial problem on first glance since it already takes O(n) bits for any
vertex to enumerate all adjacent edges. To solve the problem, we use a bit
trick which exploits the fact that any edge {a, b} ∈ A will be considered twice
by vertices in A. Since one can uniquely identify each vertex with O(log n)
bits, consider the following:
• Identify an edge by the concatenation the identifiers of its endpoints

(say, u ◦ v if id(u) < id(v))
• Locally, every vertex u maintains XORu = ⊕{id(ei ) : ei ∈ S ∧ ei has a endpoint u}
• Vertices send the coordinator their sum and the coordinator computes
XORA ⊕ {XORu : u ∈ A}
Example Suppose V = {v1 , v2 , v3 , v4 , v5 } where id(v1 ) = 000, id(v2 ) = 001,

id(v3 ) = 010, id(v4 ) = 011, and id(v5 ) = 100. Then, id({v1 , v3 }) = id(v1 ) ◦
id(v3 ) = 000010, and so on. Suppose
S = {h{v1 , v2 }, +i, h{v2 , v3 }, +i, h{v1 , v3 }, +i, h{v4 , v5 }, +i, h{v2 , v5 }, +i, h{v1 , v2 }, −i}
and we query for the cut edge {v2 , v5 } with A = {v1 , v2 , v3 } at t = |S|. The
figure below shows the graph G6 when t = 6:
3
In reality, the algorithm simulates all the vertices’ actions so it is not a real multi-party
computation setup.
8.1. WARM UP: FINDING THE SINGLE CUT 85
v1 v4
v2
v3 v5
Vertex v1 sees {h{v1 , v2 }, +i, h{v1 , v3 }, +i, and h{v1 , v2 }, −i}. So,
XOR1 ⇒ 000000 Initialize

⇒ 000000 ⊕ id((v1 , v2 )) = 000000 ⊕ 000001 = 000001 Due to h{v1 , v2 }, +i
⇒ 000001 ⊕ id((v1 , v3 )) = 000001 ⊕ 000010 = 000011 Due to h{v1 , v3 }, +i
⇒ 000011 ⊕ id((v1 , v2 )) = 000011 ⊕ 000001 = 000010 Due to h{v1 , v2 }, −i
Repeating the simulation for all vertices,
XOR1 = 000010 = id({v1 , v2 }) ⊕ id({v1 , v3 }) ⊕ id({v1 , v2 })

= 000001 ⊕ 000010 ⊕ 000001
XOR2 = 000110 = id({v1 , v2 }) ⊕ id({v2 , v3 }) ⊕ id({v2 , v5 }) ⊕ id({v1 , v2 })
= 000001 ⊕ 001010 ⊕ 001100 ⊕ 000001
XOR3 = 001000 = id({v2 , v3 }) ⊕ id({v1 , v3 })
= 001010 ⊕ 000010
XOR4 = 011100 = id({v4 , v5 })
= 011100
XOR5 = 010000 = id({v4 , v5 }) ⊕ id({v2 , v5 })
= 011100 ⊕ 001100
Thus, XORA = XOR1 ⊕ XOR2 ⊕ XOR3 = 000010 ⊕ 000110 ⊕ 001000 =
001100 = id({v2 , v5 }) as expected. Notice that adding and deleting edges
both add the edge ID to each vertex’s XOR sum, and every edge in A con-
tributes an even number of times to the coordinator’s XOR sum.
Claim 8.3. XORA = ⊕{XORu : u ∈ A} is the identifier of the cut edge.
Proof. For any edge (a, b) such that a, b ∈ A, id((a, b)) is in both XORa and
XORb . So, XORa ⊕ XORb will cancel out the contribution of id((a, b)).
Hence, the only remaining value in XORA = ⊕{XORu : u ∈ A} will be the
cut edge since only one endpoint lies in A.
Remark Bit tricks are often used in the random linear network coding
literature (e.g. [HMK+ 06]).
8.2 Warm up 2: Finding one out of k > 1 cut

edges
Definition 8.4 (The k cut problem). Fix an arbitrary subset A ⊆ V . Sup-
pose there is exactly k cut edge (u, v) between A and V \ A, and we are given
an estimate bk such that k2 ≤ k ≤ b
k. How do we output a cut edge (u, v) using
b
O(log n) bits of memory, with high probability?
A straight-forward idea is to independently mark each edge, each with

probability 1/b
k. In expectation, we expect one edge to be marked. Denote
the set of marked cut edges by E 0 .
Pr[|E 0 | = 1]
= k · Pr[Cut edge {u, v} is marked; others are not]
= k · (1/b k))k−1
k)(1 − (1/b Edges marked ind. w.p. 1/b k
k
b
≥ (bk/2)(1/b k))k
k)(1 − (1/b Since ≤ k ≤ b k
b
2
1
≥ · 4−1 Since 1 − x ≥ 4−x for x ≤ 1/2
2
1
≥
10
Remark The above analysis assumes that vertices can locally mark the
edges in a consistent manner (i.e. both endpoints of any edge make the
same decision whether to mark the edge or not). This can be done with a
sufficiently large string of shared randomness. We discuss this in Section 8.3.
From above, we know that Pr[|E 0 | = 1] ≥ 1/10. If |E 0 | = 1, we can
re-use the idea from Section 8.1. However, if |E 0 | = 6 1, then XORA may
correspond erroneously to another edge in the graph. In the above example,
id({v1 , v2 }) ⊕ id({v2 , v4 }) = 000001 ⊕ 001011 = 001010 = id({v2 , v3 }).
To fix this, we use random bits as edge IDs instead of simply concatenat-
ing vertex IDs: Randomly assign (in a consistent manner) each edge with a
random ID of k = 20 log n bits. Since the XOR of random bits is random,
for any edge e, Pr[XORA = id(e) | |E 0 | = 6 1] = ( 21 )k = ( 12 )20 log n . Hence,
8.3. MAXIMAL FOREST WITH O(N LOG4 N ) MEMORY 87
Pr[XORA = id(e) for some edge e | |E 0 | =

6 1]
X
≤ Pr[XORA = id(e) | |E 0 | =
6 1] Union bound over all possible edges
V
e∈( 2 )

n 1 20 log n n
= ( ) There are possible edges
2 2 2

−18 log n n
=2 Since ≤ n2 = 22 log n
2
1
= 18 Rewriting
n
Now, we can correctly distinguish |E 0 | = 1 from |E 0 | =
6 1 and Pr[|E 0 | =
1
1] ≥ 10 . For any given > 0, there exists a constant C() such that if we
repeat t = C() log n times, the probability that all t tries fail to extract a
1 t 1
single cut is (1 − 10 ) ≤ n1+ .
8.3 Maximal forest with O(n log4 n) memory

Recall that Borůvka’s algorithm4 builds a minimum spanning tree by itera-
tively finding the cheapest edge leaving connected components and adding
them into the MST. The number of connected components decreases by at
least half per iteration, so it converges in O(log n) iterations.
For any arbitrary cut, the number of edge cuts is k ∈ [0, n]. Guessing
through b k = 20 , 21 , . . . , 2dlog ne , one can use Section 8.2 to find a cut edge:
• If b
k k, the marking probability will select nothing (in expectation).
• If b
k k, more than one edge will get marked, which we will then
detect (and ignore) since XORA will likely not be a valid edge ID.
Using a source of randomness R, every vertex in ComputeSketches

maintains O(log3 n) copies of edge XORs using random (but consistent) edge
IDs and marking probabilities:
• dlog ne times for Borůvka simulation later
• dlog ne times for guesses of cut size k
• C() · log n times to amplify success probability of Section 8.2

4
See https://en.wikipedia.org/wiki/Bor%C5%AFvka%27s_algorithm
Algorithm 25 ComputeSketches(S = {he, ±i, . . . }, , R)

for i = 1, . . . , n do
3
XORi ← 0(20 log n)∗log n . Initialize log3 n copies
end for
for Edge update {he = (u, v), ±i} ∈ S do . Streaming edge updates
for b = log n times do . Simulate Borůvka
for i ∈ {1, 2, . . . , log n} do . log n guesses of bk
for t = C() log n times do . Amplify success probability
Rb,i,t ← Randomness for this specific instance based on R
k = 2i , according to Rb,i,t then
if Edge e is marked w.p. b
Compute id(e) using R
XORu [b, i, t] ← XORu [b, i, t] ⊕ id(e)
XORv [b, i, t] ← XORv [b, i, t] ⊕ id(e)
end if
end for
end for
end for
end for
return XOR1 , . . . , XORn
Then, StreamingMaximalForest simulates Borůvka using the output of

ComputeSketches:
• Find an out-going edge from each connected component via Section 8.2
• Join connected components by adding edges to graph
Since each edge ID uses O(log n) memory and O(log3 n) copies were main-
tained per vertex, a total of O(n log4 n) memory suffices. At each step, we
fail to find one cut edge leaving a connected component with probability
1 t
≤ (1 − 10 ) , which can be be made to be in O( n110 ). Applying union bound
over all O(log3 n) computations of XORA , we see that
log3 n 1
Pr[Any XORA corresponds wrongly some edge ID] ≤ O( ) ⊆ O( )
n18 n10
So, StreamingMaximalForest succeeds with high probability.
Remark One can drop the memory constraint per vertex from O(log4 n)
to O(log3 n) by using a constant t instead of t ∈ O(log n) such that the
success probability is a constant larger than 1/2. Then, simulate Borůvka
for d2 log ne steps. See [AGM12] (Note that they use a slightly different
sketch).
8.3. MAXIMAL FOREST WITH O(N LOG4 N ) MEMORY 89
Algorithm 26 StreamingMaximalForest(S = {he, ±i, . . . }, )

R ← Generate O(log2 n) bits of shared randomness
XOR1 , . . . , XORn ← ComputeSketches(S, , R)
F ← (VF = V, EF = ∅) . Initialize empty forest
for b = log n times do . Simulate Borůvka
C←∅ . Initialize candidate edges
for Every connected component A in F do
for i ∈ {1, 2, . . . , dlog ne} do . Guess A has [2i−1 , 2i ] cut edges
for t = C() log n times do . Amplify success probability
Rb,i,t ← Randomness for this specific instance
XORA ← ⊕{XORu [b, i, t] : u ∈ A}
if XORA = id(e) for some edge e = (u, v) then
C ← C ∪ {(u, v)} . Add cut edge (u, v) to candidates
Go to next connected component in F
end if
end for
end for
end for
EF ← EF ∪ C, removing cycles in O(1) if necessary . Add candidates
end for
return F
Theorem 8.5. Any randomized distributed sketching protocol for computing

spanning forest with success probability > 0 must have expected average
sketch size Ω(log 3 n), for any constant > 0.
Proof. See [NY18].
Claim 8.6. Polynomial number of bits provide sufficient independence for

the procedure described above.
Remark One can generate polynomial number of bits of randomness with

O(log2 n) bits. Interested readers can check out small-bias sample spaces5 .
The construction is out of the scope of the course, but this implies that the
shared randomness R can be obtained within our memory constraints.
5
See https://en.wikipedia.org/wiki/Small-bias_sample_space
Part III
Graph sparsification
91
Chapter 9
Preserving distances
Given a simple, unweighted, undirected graph G with n vertices and m edges,

can we sparsify G by ignoring some edges such that certain desirable prop-
erties still hold? We will consider simple, unweighted and undirected graphs
G. For any pair of vertices u, v ∈ G, denote the shortest path between them
by Pu,v . Then, the distance between u and v in graph G, denoted by dG (u, v),
is simply the length of shortest path Pu,v between them.
Definition 9.1 ((α, β)-spanners). Consider a graph G = (V, E) with |V | = n
vertices and |E| = m edges. For given α ≥ 1 and β ≥ 0, an (α, β)-spanner
is a subgraph G0 = (V, E 0 ) of G, where E 0 ⊆ E, such that
dG (u, v) ≤ dG0 (u, v) ≤ α · dG (u, v) + β
Remark The first inequality is because G0 has less edges than G. The
second inequality upper bounds how much the distances “blow up” in the
sparser graph G0 .
For an (α, β)-spanner, α is called the multiplicative stretch of the spanner

and β is called the additive stretch of the spanner. One would then like to
construct spanners with small |E 0 | and stretch factors. An (α, 0)-spanner is
called a α-multiplicative spanner, and a (1, β)-spanner is called a β-additive
spanner. We shall first look at α-multiplicative spanners, then β-additive
spanners in a systematic fashion:
1. State the result (the number of edges and the stretch factor)
2. Give the construction
3. Bound the total number of edges |E 0 |
4. Prove that the stretch factor holds
93
94 CHAPTER 9. PRESERVING DISTANCES
Remark One way to prove the existence of an (α, β)-spanner is to use the
probabilistic method : Instead of giving an explicit construction, one designs
a random process and argues that the probability that the spanner existing
is strictly larger than 0. However, this may be somewhat unsatisfying as such
proofs do not usually yield a usable construction. On the other hand, the
randomized constructions shown later are explicit and will yield a spanner
with high probability 1 .
9.1 α-multiplicative spanners

Let us first state a fact regarding the girth of a graph G. The girth of a
graph G, denoted g(G), is defined as the length of the shortest cycle in G.
Suppose g(G) > 2k, then for any vertex v, the subgraph formed by the k-hop
neighbourhood of v is a tree with distinct vertices. This is because the k-hop
neighbourhood of v cannot have a cycle since g(G) > 2k.
v k
Theorem 9.2. ?? [ADD+ 93] For a fixed k ≥ 1, every graph G on n vertices

has a (2k − 1)-multiplicative spanner with O(n1+1/k ) edges.
Proof.
Construction
1. Initialize E 0 = ∅
2. For e = {u, v} ∈ E (in arbitrary order):
If dG0 (u, v) ≥ 2k currently, add {u, v} into E 0 .
Otherwise, ignore it.
1
This is shown by invoking concentration bounds such as Chernoff.
9.1. α-MULTIPLICATIVE SPANNERS 95
Number of edges We claim that |E 0 | ∈ O(n1+1/k ). Suppose, for a con-

tradiction, that |E 0 | > 2n1+1/k . Let G00 = (V 00 , E 00 ) be a graph obtained by
iteratively removing vertices with degree ≤ n1/k from G0 . By construction,
|E 00 | > n1+1/k since at most n·n1/k edges are removed. Observe the following:
• g(G00 ) ≥ g(G0 ) ≥ 2k + 1, since girth does not decrease with fewer edges.
• Every vertex in G00 has degree ≥ n1/k + 1, by construction.
• Pick an arbitrary vertex v ∈ V 00 and look at its k-hop neighbourhood.
n ≥ |V 00 | By construction
k
X
≥ |{v}| + |{u ∈ V 00 : dG00 (u, v) = i}| Look only at k-hop neighbourhood from v
i=1
k
X
≥1+ (n1/k + 1)(n1/k )i−1 Vertices distinct and have deg ≥ n1/k + 1
i=1
(n1/k )k − 1
= 1 + (n1/k + 1) Sum of geometric series
n1/k − 1
> 1 + (n − 1) Since (n1/k + 1) > (n1/k − 1)
=n
This is a contradiction since we showed n > n. Hence, |E 0 | ≤ 2n1+1/k ∈ O(n1+1/k ).
Stretch factor For e = {u, v} ∈ E, dG0 (u, v) ≤ (2k − 1) · dG (u, v) since we

only leave e out of E 0 if the distance is at most the stretch factor at the point
of considering e. For any u, v ∈ V , let Pu,v be the shortest path between u
and v in G. Say, Pu,v = (u, w1 , . . . , wk , v). Then,
dG0 (u, v) ≤ dG0 (u, w1 ) + · · · + dG0 (wk , v) Simulating Pu,v in G0

≤ (2k − 1) · dG (u, w1 ) + · · · + (2k − 1) · dG (wk , v) Apply edge stretch to each edge
= (2k − 1) · (dG (u, w1 ) + · · · + dG (wk , v)) Rearrange
= (2k − 1) · dG (u, v) Definition of Pu,v
Let us consider the family of graphs G on n vertices with girth > 2k. It can
be shown by contradiction that a graph G with n vertices with girth > 2k can-
not have a proper (2k − 1)-spanner2 : Assume G0 is a proper (2k − 1)-spanner
with edge {u, v} removed. Since G0 is a (2k − 1)-spanner, dG0 (u, v) ≤ 2k − 1.
Adding {u, v} to G0 will form a cycle of length at most 2k, contradicting the
assumption that G has girth > 2k.
Let g(n, k) be the maximum possible number of edges in a graph from G.
By the above argument, a graph on n vertices with g(n, k) edges cannot have
a proper (2k − 1)-spanner. Note that the greedy construction of Theorem ??
will always produce a (2k − 1)-spanner with ≤ g(n, k) edges. The size of the
spanner is asymptotically tight if Conjecture ?? holds.
Conjecture 9.3. [Erd64] For a fixed k ≥ 1, there exists a family of graphs

on n vertices with girth at least 2k + 1 and Ω(n1+1/k ) edges.
Remark 1 By considering edges in increasing weight order, the greedy

construction is also optimal for weighted graphs [FS16].
Remark 2 The girth conjecture is confirmed for k ∈ {1, 2, 3, 5} [Wen91,

Woo06].
9.2 β-additive spanners

In this section, we will use a random process to select a subset of vertices by
independently selecting vertices to join the subset. The following claim will
be useful for analysis:
Claim 9.4. If one picks vertices independently with probability p to be in

S ⊆ V , where |V | = n, then
1. E[|S|] = np
2. For any vertex v with degree d(v) and neighbourhood N (v) = {u ∈ V : (u, v) ∈ E},
• E[|N (v) ∩ S|] = d(v) · p

d(v)·p
• Pr[|N (v) ∩ S| = 0] ≤ e− 2
Proof. ∀v ∈ V , let Xv be the indicator whether v ∈ S. By construction,

E[Xv ] = Pr[Xv = 1] = p.
2
A proper subgraph in this case refers to removing at least one edge.
9.2. β-ADDITIVE SPANNERS 97
1.
X
E[|S|] = E[ Xv ] By construction of S
v∈V
X
= E[Xv ] Linearity of expectation
v∈V
X
= p Since E[Xv ] = Pr[Xv = 1] = p
v∈V
= np Since |V | = n
2.
X
E[|N (v) ∩ S|] = E[ Xv ] By definition of N (v) ∩ S
v∈N (v)
X
= E[Xv ] Linearity of expectation
v∈N (v)
X
= p Since E[Xv ] = Pr[Xv = 1] = p
v∈N (v)
= d(v) · p Since |N (v)| = d(v)
By one-sided Chernoff bound,
Pr[|N (v) ∩ S| = 0] = Pr[|N (v) ∩ S| ≤ (1 − 1) · E[|N (v) ∩ S|]]

E[|N (v)∩S|]
≤ e− 2
d(v)·p
= e− 2
e hides logarithmic factors. For example, O(n log1000 n) ⊆ O(n).

Remark O e
Theorem 9.5. [ACIM99] For a fixed k ≥ 1, every graph G on n vertices has

e 3/2 ) edges.
a 2-additive spanner with O(n
Proof.
Construction Partition vertex set V into light vertices L and heavy vertices
H, where
L = {v ∈ V : deg(v) ≤ n1/2 } and H = {v ∈ V : deg(v) > n1/2 }

1. Let E10 be the set of all edges incident to some vertex in L.
2. Initialize E20 = ∅.
• Choose S ⊆ V by independently putting each vertex into S with

probability 10n−1/2 log n.
• For each s ∈ S, add a Breadth-First-Search (BFS) tree rooted at
s to E20
Select edges in spanner to be E 0 = E10 ∪ E20 .

Number of edges
1. Since there are at most n light vertices, |E10 | ≤ n · n1/2 = n3/2 .
2. By Claim 9.4 with p = 10n−1/2 log n, E[|S|] = n · 10n−1/2 log n =

10n1/2 log n. Then, since every BFS tree has n − 1 edges3 , |E20 | ≤ n · |S|,
thus
E[|E 0 |] = E[|E10 ∪E20 |] ≤ E[|E10 |+|E20 |] = E[|E10 |]+E[|E20 |] ≤ n3/2 +n·10n1/2 log n ∈ O(n
e 3/2 )
Stretch factor Consider two arbitrary vertices u and v with the shortest
path Pu,v in G. Let h be the number of heavy vertices in Pu,v . We split the
analysis into two cases: (i) h ≤ 1; (ii) h ≥ 2. Recall that a heavy vertex has
degree at least n1/2 .
Case (i) All edges in Pu,v are adjacent to a light vertex and are thus in E10 .
Hence, dG0 (u, v) = dG (u, v), with additive stretch 0.
Case (ii)
Claim 9.6. Suppose there exists a vertex w ∈ Pu,v such that (w, s) ∈ E
for some s ∈ S, then dG0 (u, v) ≤ dG (u, v) + 2.
s∈S
... ...
... ...
u w v
3
Though we may have repeated edges
Proof.
dG0 (u, v) ≤ dG0 (u, s) + dG0 (s, v) (1)
= dG (u, s) + dG (s, v) (2)
≤ dG (u, w) + dG (w, s) + dG (s, w) + dG (w, v) (3)
≤ dG (u, w) + 1 + 1 + dG (w, v) (4)
≤ dG (u, v) + 2 (5)
(1) By triangle inequality

(2) Since we add the BFS tree rooted at s
(3) By triangle inequality
(4) Since {s, w} ∈ E, dG (w, s) = dG (s, w) = 1
(5) Since u, w, v lie on Pu,v
Let w be a heavy vertex in Pu,v with degree d(w) > n1/2 . By Claim 9.4
10 log n
with p = 10n−1/2 log n, Pr[|N (w) ∩ S| = 0] ≤ e− 2 = n−5 . Taking
union bound over all possible pairs of vertices u and v,

n −5
Pr[∃u, v ∈ V, Pu,v has no neighbour in S] ≤ n ≤ n−3
2
Then, Claim 9.6 tells us that the additive stretch factor is at most 2
with probability ≥ 1 − n13 .
1
Therefore, with high probability (≥ 1 − n3
), the construction yields a 2-
additive spanner.
Remark A way to remove log factors from Theorem 9.5 is to sample only
n1/2 nodes into S, and then add all edges incident to nodes that don’t have an
adjacent node in S. The same argument then shows that this costs O(n3/2 )
edges in expectation.
Theorem 9.7. [Che13] For a fixed k ≥ 1, every graph G on n vertices has
e 7/5 ) edges.
a 4-additive spanner with O(n
Proof.
Construction Partition vertex set V into light vertices L and heavy vertices
H, where
L = {v ∈ V : deg(v) ≤ n2/5 } and H = {v ∈ V : deg(v) > n2/5 }
1. Let E10 be the set of all edges incident to some vertex in L.
• Choose S ⊆ V by independently putting each vertex into S with

• For each s ∈ S, add a Breadth-First-Search (BFS) tree rooted at
s to E20
• Choose S 0 ⊆ V by independently putting each vertex into S 0 with

• For each heavy vertex w ∈ H, if there exists edge (w, s0 ) for some
s0 ∈ S 0 , add (w, s0 ) to E30 .
• ∀s, s0 ∈ S 0 , add the shortest path between s and s0 with ≤ n1/5
internal heavy vertices to E30 .
Note: If all paths between s and s0 contain > n1/5 heavy vertices,
do not add any edge to E30 .
Select edges in spanner to be E 0 = E10 ∪ E20 ∪ E30 .
Number of edges
• Since there are at most n light vertices, |E10 | ≤ n · n2/5 = n7/5 .
• By Claim 9.4 with p = 30n−3/5 log n, E[|S|] = n · 30n−3/5 log n =

30n2/5 log n. Then, since every BFS tree has n − 1 edges4 , |E20 | ≤
n · |S| = 30n7/5 log n ∈ O(n
e 7/5 ).
• Since there are ≤ n heavy vertices, ≤ n edges of the form (v, s0 ) for
v ∈ H, s0 ∈ S 0 will be added to E30 . Then, for shortest s − s0 paths
with ≤ n1/5 heavy internal vertices, only edges adjacent to the heavy
vertices need to be counted because those adjacent to light vertices
are already accounted for in E10 . By Claim 9.4 with p = 10n−2/5 log n,
E[|S0 |] = n · 10n−2/5 log n = 10n3/5 log n. So, E30 contributes ≤ n +
|S 0 |
2
· n1/5 ≤ n + (10n3/5 log n)2 · n1/5 ∈ O(n
e 7/5 ) edges to the count of
0
|E |.
4
Though we may have repeated edges
Stretch factor Consider two arbitrary vertices u and v with the shortest
path Pu,v in G. Let h be the number of heavy vertices in Pu,v . We split the
analysis into three cases: (i) h ≤ 1; (ii) 2 ≤ h ≤ n1/5 ; (iii) h > n1/5 . Recall
that a heavy vertex has degree at least n2/5 .
Case (i) All edges in Pu,v are adjacent to a light vertex and are thus in E10 .
Hence, dG0 (u, v) = dG (u, v), with additive stretch 0.
Case (ii) Denote the first and last heavy vertices in Pu,v as w and w0 re-
spectively. Recall that in Case (ii), including w and w0 , there are
at most n1/5 heavy vertices between w and w0 . By Claim 9.4, with
p = 10n−2/5 log n,
n2/5 ·10n−2/5 log n
Pr[|N (w) ∩ S 0 | = 0] = Pr[|N (w0 ) ∩ S 0 | = 0] ≤ e− 2 = n−5
Let s, s0 ∈ S 0 be adjacent vertices to w and w0 respectively. Observe

that s−w −w0 −s0 is a path between s and s0 with at most n1/5 internal
∗ ∗
heavy vertices. Let Ps,s 0 be the shortest path of length l from s to s0
with at most n1/5 internal heavy vertices. By construction, we have
∗ 0
added Ps,s 0 to E3 . Observe:
∗ ∗
• By definition of Ps,s 0, l ≤ dG (s, w) + dG (w, w0 ) + dG (w0 , s0 ) =
0
dG (w, w ) + 2.
• Since there are no internal heavy vertices between u − w and
w0 − v, Case (i) tells us that dG0 (u, w) = dG (u, w) and dG0 (w0 , v) =
dG (w0 , v).
Thus,
dG0 (u, v)
= dG0 (u, w) + dG0 (w, w0 ) + dG0 (w0 , v) (1)
≤ dG0 (u, w) + dG0 (w, s) + dG0 (s, s0 ) + dG0 (s0 , w0 ) + dG0 (w0 , v) (2)
= dG0 (u, w) + dG0 (w, s) + l∗ + dG0 (s0 , w0 ) + dG0 (w0 , v) (3)
≤ dG0 (u, w) + dG0 (w, s) + dG (w, w0 ) + 2 + dG0 (s0 , w0 ) + dG0 (w0 , v) (4)
= dG0 (u, w) + 1 + dG (w, w0 ) + 2 + 1 + dG0 (w0 , v) (5)
= dG (u, w) + 1 + dG (w, w0 ) + 2 + 1 + dG (w0 , v) (6)
≤ dG (u, v) + 4 (7)
(1) Decomposing Pu,v in G0

(2) Triangle inequality
∗ 0
(3) Ps,s 0 is added to E3
(4) Since l∗ ≤ dG (w, w0 ) + 2

(5) Since (w, s) ∈ E 0 and (s0 , w0 ) ∈ E 0 and dG0 (w, s) = dG0 (s0 , w0 ) = 1
(6) Since dG0 (u, w) = dG (u, w) and dG0 (w0 , v) = dG (w0 , v)
(7) By definition of Pu,v
s ∈ S0 ∗
Ps,s 0 of length l
∗ s0 ∈ S 0
...
... ... ...

u w 0 v
w
First heavy vertex Last heavy vertex
Case (iii)
Claim 9.8. There cannot be a vertex y that is a common neighbour to

more than 3 heavy vertices in Pu,v .
Proof. Suppose, for a contradiction, that y is adjacent to w1 , w2 , w3 , w4 ∈

Pu,v as shown in the picture. Then u − w1 − y − w4 − v is a shorter
u − v path than Pu,v , contradicting the fact that Pu,v is the shortest
u − v path.
... ... ... ... ...

u w1 w2 w3 w4 v
Note that if y is on Pu,v , it immediately contradicts that Pu,v was the

shortest path involving all of {y, w1 , w2 , w3 , w4 }.
|N (w)| · 31 . Let
S P
Claim 9.8 tells us that | w∈Heavy N (w)| ≥ w∈Heavy
Nu,v = {x ∈ V : (x, w) ∈ Pu,v for some w ∈ Pu,v }

Applying Claim 9.4 with p = 30 · n−3/5 · log n and Claim 9.8, we get
1
E[|Nu,v ∩ S|] ≥ n1/5 · n2/5 · · 30 · n−3/5 · log n = 10 log n
3
and 10 log n
Pr[|N (v) ∩ S| = 0] ≤ e− 2 = n−5
Taking union bound over all possible pairs of vertices u and v,

n −5
Pr[∃u, v ∈ V, Pu,v has no neighbour in S] ≤ n ≤ n−3
2
Then, Claim 9.6 tells us that the additive stretch factor is at most 4
with probability ≥ 1 − n13 .
1
Therefore, with high probability (≥ 1 − n3
), the construction yields a 4-
additive spanner.
Remark Suppose the shortest u − v path Pu,v contains a vertex from S,

say s. Then, Pu,v is contained in E 0 since we include the BFS tree rooted at
s because it is the shortest u − s path and shortest s − v path by definition.
In other words, the triangle inequality between u, s, v becomes tight.
Concluding remarks
Additive β Number of edges Remarks

[ACIM99] 2 e 3/2 )
O(n Almost5 tight [Woo06]
[Che13] 4 e 7/5 )
O(n e 4/3 ) possible?
Open: Is O(n
[BKMP05] ≥6 e 4/3 )
O(n Tight [AB17]
Remark 1 A k-additive spanner is also a (k + 1)-additive spanner.
Remark 2 The additive stretch factors appear in even numbers because

current constructions “leave” the shortest path, then “re-enter” it later, in-
troducing an even number of extra edges. Regardless, it is a folklore theorem
that it suffices to only consider additive spanners with even error. Specif-
ically, any construction of an additive (2k + 1)-spanner on ≤ E(n) edges
implies a construction of an additive 2k-spanner on O(E(n)) edges. Proof
sketch: Copy the input graph G and put edges between the two copies to
yield a bipartite graph H; Run the spanner construction on H; “Collapse”
the parts back into one. The distance error must be even over a bipartite
graph, and so the additive (2k + 1)-spanner construction must actually give
an additive 2k-spanner by showing that the error bound is preserved over
the “collapse”.
Chapter 10
Preserving cuts
In the previous chapter, we looked at preserving distances via spanners. In

this chapter, we look at preserving cut sizes.
Definition 10.1 (Cut and minimum cut). Consider a graph G = (V, E).
• For S ⊆ V, S 6= ∅, S 6= V , CG (S, V \ S) = {(u, v) : u ∈ S, v ∈ V \ S} is
a non-trivial cut in G
P
• Define cut size EG (S, V \ S) = e∈CG (S,V \S) w(e)
For unweighted G, w(e) = 1 for all e ∈ E, so EG (S, V \ S) = |CG (S, V \ S)|
• Minimum cut size of the graph G is denoted by µ(G) = minS⊆V,S6=∅,S6=V EG (S, V \ S)
• A cut CG (S, V \ S) is said to be minimum if EG (S, V \ S) = µ(G)

Given an undirected graph G = (V, E), our goal in this lecture is to
construct a weighted graph H = (V, E 0 ) with E 0 ⊆ E and weight function
w : E 0 → R+ such that
(1 − ) · EG (S, V \ S) ≤ EH (S, V \ S) ≤ (1 + ) · EG (S, V \ S)
for every S ⊆ V, S 6= 0, S 6= V . Recall Karger’s random contraction algo-

rithm [Kar93]1 :
Theorem 10.2. For a fixed minimum cut S∗ in the graph, RandomCon-
traction returns it with probability ≥ 1/ n2 .
Proof. Fix a minimum cut S ∗ in the graph. Suppose |S ∗ | = k. To successfully
return S ∗ , none of the edges in S ∗ must be selected in the whole contraction
process.
1
Also, see https://en.wikipedia.org/wiki/Karger%27s_algorithm
105
106 CHAPTER 10. PRESERVING CUTS
Algorithm 27 RandomContraction(G = (V, E))

while |V | > 2 do
e ← Pick an edge uniformly at random from E
G ← G/e . Contract edge e
end while
return The remaining cut . This may be a multi-graph
By construction, there will be n−i vertices in the graph at step i of Ran-

domContraction. Since µ(G) = k, each vertex has degree ≥ k (otherwise
that vertex itself gives a cut smaller than k), so there are ≥ (n − i)k/2 edges
in the graph. Thus,
k k k
Pr[Success] ≥ (1 − ) · (1 − ) . . . (1 − )
nk/2 (n − 1)k/2 3k/2
2 2 2
= (1 − ) · (1 − ) . . . (1 − )
n n−1 3
n−2 n−3 1
=( )·( )...( )
n n−1 3
2
=
n(n − 1)

n
= 1/
2
n

Corollary 10.3. There are ≤ 2
minimum cuts in a graph.
Proof. Since RandomContraction successfully produces any given
mini-
mum cut with probability at least 1/ n2 , there can be at most ≤ n2 many

minimum cuts.
Remark There exists (multi-)graphs with n2 minimum cuts: Consider a

cycle where there are µ(G)

2
edges between every pair of adjacent vertices.
...
µ(G)
10.1. WARM UP: G = KN 107
In general, we can bound the number of cuts that are of size at most
α · µ(G) for α ≥ 1.
Theorem 10.4. In an undirected graph, the number of α-minimum cuts is

less than n2α .
Proof. See Lemma 2.2 and Appendix A (in particular, Corollary A.7) of a
version2 of [Kar99].
10.1 Warm up: G = Kn

Consider the following procedure to construct H:
1. Let p = Ω( logn n )
2. Independently put each edge e ∈ E into E 0 with probability p
3. Define w(e) = 1
p
for each edge e ∈ E 0
One can check3 that this suffices for G = Kn .
10.2 Uniform edge sampling

For a graph G with minimum cut size µ(G) = k, consider the following
procedure to construct H:
c log n
1. Set p = 2 k
for some constant c
2. Independently put each edge e ∈ E into E 0 with probability p
3. Define w(e) = 1
p
Theorem 10.5. With high probability, for every S ⊆ V, S 6= ∅, S 6= V ,
Proof. Fix an arbitrary cut (S, V \ S). Suppose EG (S, V \ S) = k 0 = α · k

for some α ≥ 1.
2
Version available at: http://people.csail.mit.edu/karger/Papers/
skeleton-journal.ps
3
Fix a cut, analyze, then take union bound.
k0
S V \S
Let Xe be the indicator for the edge e ∈ CG (S, V \ S) being selected

into E 0 . By construction, E[Xi ] =
PPr[Xi = 1] = p. Then, by linearity of
0
expectation, E[|CH (S, V \ S)|] = e∈CG (S,V \S) E[Xi ] = k p. As we put 1/p
weight on each edge in E 0 , E[EH (S, V \ S)] = k 0 . Using Chernoff bound, for
sufficiently large c, we get:
Pr[Cut (S, V \ S) is badly estimated in H]

= Pr[|EH (S, V \ S) − E[EH (S, V \ S)]| > · k 0 ] What it means to be badly estimated
2 k0 p
≤ 2e− 3 Chernoff bound
2
− αkp
= 2e 3 Since k 0 = αk
≤ n−10α For sufficiently large c
Using Theorem 10.4 and union bound over all possible cuts in G,
Pr[Any cut is badly estimated in H]

Z ∞
1
≤ n2α · −10α dα From Theorem 10.4 and above
1 n
−5
≤n Loose upper bound
Theorem 10.6. [Kar94] For a graph G, consider sampling every edge inde-
pendently with probability pe into E 0 , and assign weights 1/pe to each edge
e ∈ E 0 . Let H = (V, E 0 ) be the sampled graph and suppose µ(H) ≥ c log
2
n
, for
some constant c. Then, with high probability, every weighted cut size in H is
(well-estimated) within (1 ± ) of the original cut size in G.
Theorem 10.6 can be proved by using a variant of the earlier proof. In-
terested readers can see Theorem 2.1 of [Kar94].
10.3. NON-UNIFORM EDGE SAMPLING 109
10.3 Non-uniform edge sampling

Unfortunately, uniform sampling does not work well on graphs with small
minimum cut.
Kn Kn
Running the uniform edge sampling will not sparsify the above dumbbell
graph as µ(G) = 1 leads to large sampling probability p.
Before we describe a non-uniform edge sampling process [BK96], we first

define k-strong components.
Definition 10.7 (k-connected). A graph is k-connected if the value of each

cut of G is at least k.
Definition 10.8 (k-strong component). A k-strong component is a maximal

k-connected vertex-induced subgraph. For an edge e, define its strong connec-
tivity / strength ke as the maximum k such that e is in a k-strong component.
We say an edge is k-strong if ke ≥ k.
Remark The (standard) connectivity of an edge e = (u, v) is the minimum

cut size that separates its endpoints u and v. In particular, an edge’s strong
connectivity is no more than the edge’s (standard) connectivity since a cut
size of k between u and v implies there is no (k + 1)-connected component
containing both u and v.
Lemma 10.9. The following holds for k-strong components:
1. ke is uniquely defined for every edge e
2. For any k, the k-strong components are disjoint.
3. For any 2 values k1 , k2 (k1 < k2 ), k2 -strong components are a refine-

ment of k1 -strong components
1
P
4. e∈E ke ≤ n − 1
Intuition: If there are a lot of edges, then many of them have high
strength.
Proof.
k1 -strong components
k2 -strong components
1. By definition of maximum
2. Suppose, for a contradiction, there are two different intersecting k-

strong components. Since their union is also k-strong, this contradicts
the fact that they were maximal.
3. For k1 < k2 , a k2 -strong component is also k1 -strong, so it is a subset

of some k1 -strong component.
4. Consider a minimum cut CG (S, V \ S). Since ke ≥ µ(G) for all edges
1
e ∈ CG (S, V \ S), these edges contribute ≤ µ(G) · k1e ≤ µ(G) · µ(G) =1
to the summation. Remove these edges from G and repeat the argu-
ment on any remaining connected components. Since each cut removal
contributes at most 1 Pto the summation and the process stops when we
reach n components, e∈E k1e ≤ n − 1.
For a graph G with minimum cut size µ(G) = k, consider the following
procedure to construct H:
c log n
1. Set q = 2
for some constant c
q
2. Independently put each edge e ∈ E into E 0 with probability pe = ke
3. Define w(e) = 1
pe
= ke
q
Lemma 10.10. E[|E 0 |] ≤ O( n log

2
n
)
10.3. NON-UNIFORM EDGE SAMPLING 111
Proof. Let Xe be the indicator whether edge e was selected into E 0 . By

construction, E[Xe ] = Pr[Xe = 1] = pe . Then,
X
E[|E 0 |] = E[ Xe ] By definition
e∈E
X
= E[Xe ] Linearity of expectation
e∈E
X
= pe Since E[Xe ] = Pr[Xe = 1] = pe
e∈E
X q q
= Since pe =
e∈E
ke ke
X 1
= q(n − 1) Since ≤n−1
e∈E
k e
n log n c log n
∈ O( ) Since q = for some constant c
2 2
Remark One can apply Chernoff bounds to argue that |E 0 | is highly con-
centrated around its expectation.
Theorem 10.11. With high probability, for every S ⊆ V, S 6= ∅, S 6= V ,
Proof. Let k1 < k2 < · · · < ks be all possible strength values in the graph.
Consider G as a weighted graph with edge weights kqe for each edge e ∈ E,
and a family of unweighted graphs F1 , . . . , Fs where Fi = (V, Ei ) where Ei =
{e ∈ E : ke ≥ ki }. Observe that:
• s ≤ |E| since each edge has only 1 strength value
• By construction of Fi ’s, if an edge e has strength i in Fi , ke = i in G
• F1 = G
• For each i, Fi+1 is a subgraph of Fi
• By defining k0 = 0, one can write G = si=1 ki −kq i−1 Fi . This is because

P
an edge with strength ki will appear in Fi , Fi−1 , . . . , F1 and the terms
will telescope to yield a weight of kqi .
The sampling process in G directly translates to a sampling process in

each graph in {Fi }i∈[s] — when we add an edge e into E 0 , we also add it
to the edge sets of Fke , . . . , Fs . For each i ∈ [s], Theorem 10.6 tells us
that every cut in Fi is well-estimated with high probability. Then, a union
bound over {Fi }i∈[s] will tell us that any cut in G is well-estimated with high
probability.
Part IV
Online algorithms and

competitive analysis
113
Chapter 11
Warm up: Ski rental
We now study the class of online problems where one has to commit to
provably good decisions as data arrive in an online fashion. To measure the
effectiveness of online algorithms, we compare the quality of the produced
solution against the solution from an optimal offline algorithm that knows
the whole sequence of information a priori. The tool we will use for doing
such a comparison is competitive analysis.
Remark We do not assume that the optimal offline algorithm has to be

computationally efficient. Under the competitive analysis framework, only
the quality of the best possible solution matters.
Definition 11.1 (α-competitive online algorithm). Let σ be an input se-
quence, c be a cost function, A be the online algorithm and OP T be the
optimal offline algorithm. Then, denote cA (σ) as the cost incurred by A
on σ and cOP T (σ) as the cost incurred by OPT on the same sequence. We
say that an online algorithm is α-competitive if for any input sequence σ,
cA (σ) ≤ α · cOP T (σ).
Definition 11.2 (Ski rental problem). Suppose we wish to ski every day but
we do not have any skiing equipment initially. On each day, we could:
• Rent the equipment for a day at CHF 1
• Buy the equipment (once and for all) at CHF B
In the toy setting where we may break our leg on each day (and cannot ski
thereafter), let d be the (unknown) total number of days we ski. What is the
best online strategy for renting/buying?
Claim 11.3. A = “Rent for B days, then buy on day B+1” is a 2-competitive
algorithm.
115
116 CHAPTER 11. WARM UP: SKI RENTAL
Proof. If d ≤ B, the optimal offline strategy is to rent everyday, incurring d.

A will rent for d days and also incur d. If d > B, the optimal offline strategy
is to buy the equipment immediately, incurring B. A will rent for B days
then buy, incurring 2B. Thus, for any d, cA (d) ≤ 2 · cOP T (d).
Chapter 12
Linear search
Definition 12.1 (Linear search problem). Suppose we have a stack of n

papers on the desk. Given a query, we do a linear search from the top of the
stack. Suppose the query is the i-th paper in the stack and it costs i units of
work to reach it. There are two types of swaps we can make to the stack:
Free swap Move the queried paper from position i to the top of the stack
for 0 cost.
Paid swap For any consecutive pair of items (a, b) before i, swap their rel-
ative order to (b, a) for 1 cost.
What is the best online strategy for manipulating the stack to minimize total
cost on a sequence of queries?
Remark One can reason that the free swap costs 0 because we already
incurred a cost of i to reach it.
12.1 Amortized analysis

Amortized analysis1 is a way to analyze the complexity of an algorithm on a
sequence of operations. Instead of looking the worst case performance on a
single operation, it measures the total cost for a batch of operations.
The dynamic resizing process of hash tables is a classical example of
amortized analysis. An insertion or deletion operation will typically cost
O(1) unless the hash table is almost full or almost empty, in which case we
1
See https://en.wikipedia.org/wiki/Amortized_analysis
117
118 CHAPTER 12. LINEAR SEARCH
double or halve the hash table, incurring a runtime of O(m) time for doubling
or halving a hash table of size m.
Worst case analysis tells us that dynamic resizing will incur O(m) run
time per operation. However, resizing only occurs after O(m) insertion/dele-
tion operations, each costing O(1). Amortized analysis allows us to conclude
that this dynamic resizing runs in amortized O(1) time. There are two equiv-
alent ways to see it:
• Split the O(m) resizing overhead and “charge” O(1) to each of the
earlier O(m) operations.
• The total run time for every sequential chunk of m operations is O(m).
Hence, each step takes O(m)/m = O(1) amortized run time.
12.2 Move-to-Front
Move-to-Front (MTF) [ST85] is an online algorithm for the linear search
problem where we move the queried item to the top of the stack (and do no
other swaps). We will show that MTF is a 2-competitive algorithm for linear
search. Before we analyze MTF, let us first define a potential function Φ and
look at examples to gain some intuition.
Let Φt be the number of pairs of papers (i, j) that are ordered differently
in MTF’s stack and OPT’s stack at time step t. By definition, Φt ≥ 0 for
any t. We also know that Φ0 = 0 since MTF and OPT operate on the same
initial stack sequence.
Example One way to interpret Φ is to count the number of inversions

between MTF’s stack and OPT’s stack. Suppose we have the following stacks
(visualized horizontally) with n = 6:
1 2 3 4 5 6
MTF’s stack a b c d e f
OPT’s stack a b e d c f
We have the inversions (c, d), (c, e) and (d, e), so Φ = 3.
Scenario 1 We swap (b, e) in OPT’s stack — A new inversion (b, e) was
created due to the swap.
1 2 3 4 5 6
OPT’s stack a e b d c f
12.2. MOVE-TO-FRONT 119
Now, we have the inversions (b, e), (c, d), (c, e) and (d, e), so Φ = 4.
Scenario 2 We swap (e, d) in OPT’s stack — The inversion (d, e) was de-
stroyed due to the swap.
1 2 3 4 5 6
OPT’s stack a b d e c f
Now, we have the inversions (c, d) and (c, e), so Φ = 2.
In either case, we see that any paid swap results in ±1 inversions, which
changes Φ by ±1.
Claim 12.2. MTF is 2-competitive.
Proof. We will consider the potential function Φ as before and perform amor-
tized analysis on any given input sequence σ. Let at = cM T F (t) + (Φt − Φt−1 )
be the amortized cost of MTF at time step t, where cM T F (t) is the cost by
MTF at time t. Suppose the queried item at time step t is at position k in
MTF. Denote:
F = {Items in front of x in MTF’s stack and in front of x in OPT’s stack}

B = {Items in front of x in MTF’s stack and behind x in OPT’s stack}
Let |F | = f and |B| = b. There are k −1 items in front x, so f +b = k −1.
F ∪B
k ≥ |F | = f
x
x
≥ |B| = b
MTF OPT
Since x is the k-th item, MTF will incur cM T F (t) = k to reach item x,
then move it to the top. On the other hand, OPT needs to spend at least
f + 1 to reach x. Suppose OPT does p paid swaps, then cOP T (t) ≥ f + 1 + p.
120 CHAPTER 12. LINEAR SEARCH
To measure the change in potential, we first look at the swaps done by

MTF and how OPT’s swaps can affect them. In MTF, moving x to the top
destroys b inversions and creates f inversions, so the change in Φ due to MTF
is f − b. If OPT chooses to do a free swap, Φ decreases as both stacks now
have x before any element in F . For every paid swap that OPT performs,
Φ changes by one since inversions only locally affect the swapped pair. That
is, the change in Φ due to MTF is ≤ p.
Thus, the effect on Φ from both processes is: ∆(Φt ) ≤ (f −b)+p. Putting
together, we have cOP T (t) ≥ f + 1 + p and at = cM T F (t) + (Φt − Φt−1 ) =
k + ∆(Φt ) ≤ 2f + 1 + p. Hence,
at ≤ 2 · cOP T (t) (1)

|σ| |σ|
X X
⇒ at ≤ 2 · cOP T (t) (2)
t=1 t=1
|σ|
X
⇒ cM T F (t) + (Φt − Φt−1 ) ≤ 2 · cOP T (σ) (3)
t=1
|σ|
X
⇒ cM T F (t) + (Φ|σ| − Φ0 ) ≤ 2 · cOP T (σ) (4)
t=1
|σ|
X
⇒ cM T F (t) ≤ 2 · cOP T (σ) (5)
t=1
⇒ cM T F (σ) ≤ 2 · cOP T (σ) (6)
(1) Since at = 2f + 1 + p and cOP T (t) ≥ f + 1 + p
(2) Summing over all inputs in the sequence σ
(3) Since at = cM T F (t) + (Φt − Φt−1 )
(4) Telescoping
(5) Since Φt ≥ 0 = Φ0
P|σ|
(6) Since cM T F (σ) = t=1 cM T F (t)
Chapter 13
Paging
Definition 13.1 (Paging problem [ST85]). Suppose we have a fast memory

(cache) that can fit k pages and an unbounded sized slow memory. Accessing
items in the cache costs 0 units of time while accessing items in the slow
memory costs 1 unit of time. After accessing an item in the slow memory,
we can bring it into the cache by evicting an incumbent item if the cache was
full. What is the best online strategy for maintaining items in the cache to
minimize the total access cost on a sequence of queries?
Denote cache miss as accessing an item that is not in the cache. Any
sensible strategy should aim to reduce the number of cache misses. For
example, if k = 3 and σ = {1, 2, 3, 4, . . . , 2, 3, 4}, keeping item 1 in the cache
will incur several cache misses. Instead, the strategy should aim to keep items
{2, 3, 4} in the cache. We formalize this notion in the following definition of
conservative strategy.
Definition 13.2 (Conservative strategy). A strategy is conservative if on
any consecutive subsequence that has only k distinct pages, there are at most
k cache misses.
Remark Some natural paging strategies such as “Least Recently Used

(LRU)” and “First In First Out (FIFO)” are conservative.
Claim 13.3. If A is a deterministic online algorithm that is α-competitive,
then α ≥ k.
Proof. Consider the following input sequence σ on k + 1 pages: Since the
cache has size k, at least one item is not in the cache at any point in time.
Iteratively pick σ(t + 1) as the item not in the cache after time step t.
Since A is deterministic, the adversary can simulate A for |σ| steps and
build σ accordingly. By construction, cA (σ) = |σ|.
121
122 CHAPTER 13. PAGING
On the other hand, since OP T can see the entire sequence σ, OP T can
choose to evict the page that is requested furthest in the future. Then, in
every k steps, OP T has ≤ 1 cache miss. Thus, cOP T ≤ |σ|
k
.
Hence, k · cOP T ≤ cA (σ).
Claim 13.4. Any conservative online algorithm A is k-competitive.
Proof. For any given input sequence σ, partition σ into maximal phases —
P1 , P2 , . . . — where each phase has k distinct pages, and a new phase is
created only if the next element is different from the ones in the current
phase. Let xi be the first item that does not belong in Phase i.
σ= k distinct pages x1 k distinct pages x2 ...
Phase 1 Phase 2
By construction, OP T has to pay ≥ 1 to handle the elements in Pi ∪{xi },

for any i. On the other hand, since A is conservative, A has ≤ k cache misses
per phase.
Hence, cA (σ) ≤ k · Number of phases ≤ k · cOP T (σ).
Remark A randomized algorithm can achieve O(log k)-competitiveness.

This will be covered in the next lecture.
13.1 Types of adversaries

Since online algorithms are analyzed on all possible input sequences, it helps
to consider adversarial inputs that may induce the worst case performance
for a given online algorithm A. To this end, one may wish to classify the
classes of adversaries designing the input sequences (in increasing power):
Oblivious The adversary designs the input sequence σ at the beginning. It
does not know any randomness used by algorithm A.
Adaptive At each time step t, the adversary knows all randomness used
by algorithm A thus far. In particular, it knows the exact state of the
algorithm. With these in mind, it then picks the (t + 1)-th element in
the input sequence.
Fully adaptive The adversary knows all possible randomness that will be
used by the algorithm A when running on the full input sequence σ. For
instance, assume the adversary has access to the same pseudorandom
number generator used by A and can invoke it arbitrarily many times
while designing the adversarial input sequence σ.
13.2. RANDOM MARKING ALGORITHM (RMA) 123
Remark If A is deterministic, then all three classes of adversaries have the

same power.
13.2 Random Marking Algorithm (RMA)

Consider the Random Marking Algorithm (RMA), a O(log k)-competitive
algorithm for paging against oblivious adversaries:
• Initialize all pages as marked
• Upon request of a page p
– If p is not in cache,
∗ If all pages in cache are marked, unmark all
∗ Evict a random unmarked page
– Mark page p
Example Suppose k = 3, σ = (2, 5, 2, 1, 3).
Suppose the cache is initially: Cache 1 3 4

Marked? 3 3 3
When σ(1) = 2 arrives, all pages Cache 1 2 4

were unmarked. Suppose the random
Marked? 7 3 7
eviction chose page ‘3’. The newly
added page ‘2’ is then marked.
When σ(2) = 5 arrives, suppose Cache 1 2 5

random eviction chose page ‘4’ (be-
Marked? 7 3 3
tween pages ‘1’ and ‘4’). The newly
When σ(3) = 2 arrives, page ‘2’ Cache 1 2 5

in the cache is marked (no change).
Marked? 7 3 3
When σ(4) = 1 arrives, page ‘1’ Cache 1 2 5

in the cache is marked. At this point,
Marked? 3 3 3
any page request that is not from
{1, 2, 5} will cause a full unmarking
of all pages in the cache.
When σ(5) = 3 arrives, all pages Cache 1 2 3

were unmarked. Suppose the random
Marked? 7 7 3
eviction chose page ‘5’. The newly
We denote a phase as the time period between 2 consecutive full unmark-

ing steps. That is, each phase is a maximal run where we access k distinct
pages. In the above example, {2, 5, 2, 1} is such a phase for k = 3.
Observation As pages are only unmarked at the beginning of a new phase,

the number of unmarked pages is monotonically decreasing within a phase.
Theorem 13.5. RMA is O(log k)-competitive against any oblivious adver-

sary.
Proof. Let Pi be the set of pages at the start of phase i. Since requesting a
marked page does not incur any cost, it suffices to analyze the first time any
request occurs within the phase.
Denote N as the set of new requests (pages that are not in Pi ) and O
as the set of old requests (pages that are in Pi ). By definition, |O| ≤ k and
|N | + |O| = k. Order the old requests in O in the order which they appear
in the phase and let xj be the j th old request, for j ∈ {1, . . . , |O|}. Define
mi = |N |, and lj as the number of distinct new requests before xj .
Phase i
σ=( ... new new new old new old ... ... )
x1 x2
l1 = 3 l2 = 4
For j ∈ {1, . . . , |O|}, consider the first time the j th old request xj occurs.
Since the adversary is oblivious, xj is equally likely to be in any position in
the cache at the start of the phase. After seeing (j − 1) old requests and
marking their cache positions, there are k − (j − 1) initial positions in the
cache that xj could be in. Since we have only seen lj new requests and (j −1)
old requests, there are at least1 k − lj − (j − 1) old pages remaining in the
cache. So, the probability that xj is in the cache when requested is at least
k−lj −(j−1)
k−(j−1)
. Then,
1
We get an equality if all these requests kicked out an old page.
13.2. RANDOM MARKING ALGORITHM (RMA) 125
|O|
X
Cost due to O = Pr[xj is not in cache when requested] Sum over O
j=1
|O|
X lj
≤ From above
j=1
k − (j − 1)
|O|
X mi
≤ Since lj ≤ mi = |N |
j=1
k − (j − 1)
k
X 1
≤ mi · Since |O| ≤ k
j=1
k − (j − 1)
k
X 1
= mi · Rewriting
j=1
j
n
X 1
= mi · Hk Since = Hn
i=1
i
Since every new request incurs a unit cost, the cost due to N is mi .
Hence, cRM A (Phase i) = (Cost due to N ) + (Cost due to O) ≤ mi + mi · Hk .
We now analyze OPT’s performance. By definition of phases, among all
requests between two consecutive phases (say, i − 1 and i), a total of k + mi
distinct pages are requested. So, OPT has to incur at least ≥ mi to bring in
these new pages. To avoid double
P counting, we lower bound
P cOP T (σ) for both
odd and even i: cOP T (σ) ≥ odd i mi and cOP T (σ) ≥ even i mi . Together,
X X X
2 · cOP T (σ) ≥ mi + mi ≥ mi
odd i even i i
Therefore, we have:
X X
cRM A (σ) ≤ (mi + mi · Hk ) = O(log k) mi ≤ O(log k) · cOP T (σ)
i i
Remark In the above example, k = 3, phase 1 = (2, 5, 2, 1), P1 = {1, 3, 4},

N = {5}, O = {2, 1}. Although ‘2’ appeared twice, we only care about
analyzing the first time it appeared.
Chapter 14
Yao’s Minimax Principle
Given the sequence of random bits used, a randomized algorithm behaves

deterministically. Hence, one may view a randomized algorithm as a random
choice from a distribution of deterministic algorithms.
Let X be the space of problem inputs and A be the space of all possible
deterministic algorithms. Denote probability distributions over A and X by
pa = Pr[A = a] and qx = Pr[X = x], where X and A are random variables
for input and deterministic algorithm, respectively. Define c(a, x) as the cost
of algorithm a ∈ A on input x ∈ X.
Theorem 14.1 ([Yao77]).
C = max Ep [c(a, x)] ≥ min Eq [c(a, x)] = D
x∈X a∈A
Proof.
X
C= qx · C Sum over all possible inputs x
x
X
≥ qx Ep [c(A, x)] Since C = max Ep [c(A, x)]
x∈X
x
X X
= qx pa c(a, x) Definition of Ep [c(A, x)]
x p
X X
= pa qx c(a, x) Swap summations
a q
X
= pa Eq [c(a, X)] Definition of Eq [c(a, X)]
a
X
≥ pa · D Since D = min Eq [c(a, X)]
a∈A
a
=D Sum over all possible algorithms a
127
128 CHAPTER 14. YAO’S MINIMAX PRINCIPLE
Implication If one can argue that no deterministic algorithm can do well

on a given distribution of random inputs, then no randomized algorithm can
do well on all inputs.
14.1 Application to the paging problem

Theorem 14.2. Any (randomized) algorithm has competitive ratio Ω(log k)
against an oblivious adversary.
Proof. Fix an arbitrary deterministic algorithm A. Let n = k + 1 and |σ| =

m. Consider the following random input sequence σ where the i-th page is
drawn from {1, . . . , k + 1} uniformly at random.
1
By construction of σ, the probability of having a cache miss is k+1 for A,
m
regardless of what A does. Hence, E[cA (σ)] = k+1 .
On the other hand, an optimal offline algorithm may choose to evict the
page that is requested furthest in the future. As before, we denote a phase
as a maximal run where there are k distinct page requests. This means that
m
E[cOP T (σ)] = Expected number of phases = Expected phase length
.
To analyze the expected length of a phase, suppose there are i distinct
pages so far, for 0 ≤ i ≤ k. The probability of the next request being new
is k+1−i
k+1
k+1
, and one expects to get k+1−i requests before having i + 1 distinct
pages. Thus, the expected length of a phase is ki=0 k+1−ik+1
P
= (k + 1) · Hk+1 .
m
Therefore, E[cOP T (σ)] = (k+1)·Hk+1 .
E[cA (σ)]
Putting together, we have E[cOP T (σ)]
= Hk+1 = Θ(log k).
Remark The length of a phase is essentially the coupon collector problem

with n = k + 1 coupons.
Chapter 15
The k-server problem
Definition 15.1 (k-server problem [MMS90]). Consider a metric space (V, d)

where V is a set of n points and d : V × V → R is a distance metric between
any two points. Suppose there are k servers placed on V and we are given
an input sequence σ = (v1 , v2 , . . . ). Upon request of vi ∈ V , we have to move
one server to point vi to satisfy that request. What is the best online strategy
to minimize the total distance travelled by servers to satisfy the sequence of
requests?
Remark We do not fix the starting positions of the k servers, but we com-
pare the performance of OPT on σ with same initial starting positions.
The paging problem is a special case of the k-server problem where the
points are all possible pages, the distance metric is unit cost between any
two different points, and the servers represent the pages in cache of size k.
Progress It is conjectured that a deterministic k-competitive algorithm ex-

ists and a randomized (log k)-competitive algorithm exists. The table below
shows the current progress on this problem.
Competitive ratio Type

[MMS90] k-competitive, for k = {2, n − 1} Deterministic
[FRR90] 2O(k log k) -competitive Deterministic
[Gro91] 2O(k) -competitive Deterministic
[KP95] (2k − 1)-competitive Deterministic
[BBMN11] poly(log n, log k)-competitive Randomized
[Lee18] O(log6 k)-competitive Randomized
129
130 CHAPTER 15. THE K-SERVER PROBLEM
Remark [BBMN11] uses a probabilistic tree embedding, a concept we have

seen in earlier lectures.
15.1 Special case: Points on a line

Consider the metric space where V are points on a line and d(u, v) is the
distance between points u, v ∈ V . One can think of all points lying on the
1-dimensional number line R.
15.1.1 Greedy is a bad idea

A natural greedy idea would be to pick the closest server to serve any given
request. However, this can be arbitrarily bad. Consider the following:
..
.
s∗
0 1+2+
Without loss of generality, suppose all servers currently lie on the left of “0”.
For > 0, consider the sequence σ = (1 + , 2 + , 1 + , 2 + , . . . ). The first
request will move a single server s∗ to “1 + ”. By the greedy algorithm,
subsequent requests then repeatedly use s∗ to satisfy requests from both
“1 + ” and “2 + ” since s∗ is the closest server. This incurs a total cost of
≥ |σ| while OPT could station 2 servers on “1 + ” and “2 + ” and incur a
constant total cost on input sequence σ.
15.1.2 Double coverage

The double coverage algorithm does the following:
• If request r is on one side of all servers, move the closest server to cover
it
• If request r lies between two servers, move both towards it at constant

speed until r is covered
Before r Before r
After r After r
15.1. SPECIAL CASE: POINTS ON A LINE 131
Theorem 15.2. Double coverage (DC) is k-competitive on a line.

Proof. Without loss of generality,
• Suppose location of DC’s servers on the line are: x1 ≤ x2 ≤ · · · ≤ xk
• Suppose location of OPT’s servers on the line are: y1 ≤ y2 ≤ · · · ≤ yk
Define potential function Φ = Φ1 + Φ2 = k · ki=1 |xi − yi | + i<j (xj − xi ),
P P
where Φ1 is k times the “paired distances” between xi and yi and Φ2 is the
pairwise distance between any two servers in DC.
We denote the potential function at time step t by Φt = Φt,1 + Φt,2 . For
a given request r at time step t, we will first analyze OPT’s action then
DC’s action. We analyze the change in potential ∆(Φ) by looking at ∆(Φ1 )
and ∆(Φ2 ) separately, and further distinguish the effects of DC and OPT on
∆(Φ) via ∆DC (Φ) and ∆OP T (Φ) respectively.
Suppose OPT moves server s∗ by a distance of x = d(s∗ , r) to reach the
point r. Then, cOP T (t) ≥ x. Since s∗ moved by x, ∆(Φt,1 ) ≤ kx. Since OPT
does not move DC’s servers, ∆(Φt,2 ) = 0. Hence, ∆OP T (Φt ) ≤ kx.
There are three cases for DC, depending on where r appears.
1. r appears exactly on a current server position
DC does nothing. So, cDC (t) = 0 and ∆DC (Φt ) = 0. Hence,
cDC (t) + ∆(Φt ) = cDC (t) + ∆DC (Φt ) + ∆OP T (Φt )

≤ 0 + kx + 0 = kx
≤ k · cOP T (t)
2. r appears on one side of all servers x1 , . . . , xk (say r > xk without loss

of generality)
DC will move server xk by a distance y = d(xk , r) to reach point r. That
is, cDC (t) = y. Since OPT has a server at r, yk ≥ r. So, ∆DC (Φt,1 ) =
−ky. Since only xk moved, ∆DC (Φt,2 ) = (k − 1)y. Hence,
cDC (t) + ∆(Φt ) = cDC (t) + ∆DC (Φt ) + ∆OP T (Φt )
≤ y − ky + (k − 1)y + kx
= kx
≤ k · cOP T (t)
3. r appears between two servers xi < r < xi+1

Without loss of generality, say r is closer to xi and denote z = d(xi , r).
DC will move server xi by a distance of z to reach point r, and server
xi+1 by a distance of z to reach xi+1 − z. That is, cDC (t) = 2z.
132 CHAPTER 15. THE K-SERVER PROBLEM
Claim 15.3. At least one of xi or xi+1 is moving closer to its partner

(yi or yi+1 respectively).
Proof. Suppose, for a contradiction, that both xi and xi+1 are moving
away from their partners. That means yi ≤ xi < r < xi+1 ≤ yi+1 at
the end of OPT’s action (before DC moved xi and xi+1 ). This is a
contradiction since OPT must have a server at r but there is no server
between yi and yi+1 by definition.
Since at least one of xi or xi+1 is moving closer to its partner, ∆DC (Φt,1 ) ≤
z − z = 0.
Meanwhile, since xi and xi+1 are moved a distance of z towards each
other, (xi+1 − xi ) = −2z while the total change against other pairwise
distances cancel out, so ∆DC (Φt,2 ) = −2z.
Hence,
cDC (t)+∆(Φt ) = cDC (t)+∆DC (Φt )+∆OP T (Φt ) ≤ 2z−2z+kx = kx ≤ k·cOP T (t)
In all cases, we see that cDC (t) + ∆(Φt ) ≤ k · cOP T (t). Hence,
|σ| |σ|
X X
cDC (t) + ∆(Φt ) ≤ k · cOP T (t) Summing over σ
t=1 t=1
|σ|
X
⇒ cDC (t) + (Φ|σ| − Φ0 ) ≤ k · cOP T (σ) Telescoping
t=1
|σ|
X
⇒ cDC (t) − Φ0 ≤ k · cOP T (σ) Since Φt ≥ 0
t=1
|σ|
X
⇒ cDC (σ) ≤ k · cOP T (σ) + Φ0 Since cDC (σ) = cDC (t)
t=1
Since Φ0 is a constant that captures the initial state, DC is k-competitive.
Remark One can generalize the approach of double coverage to points on

a tree. The idea is as follows: For a given request point r, consider the set
of servers S such that for s ∈ S, there is no other server s0 between s and
r. Move all servers in S towards r “at the same speed” until one of them
reaches r.
Chapter 16
Multiplicative Weights Update

(MWU)
In this final lecture, we discuss the Multiplicative Weight Updates (MWU)

method. A comprehensive survey on MWU and its applications can be found
in [AHK12].
Definition 16.1 (The learning from experts problem). Every day, we are
to make a binary decision. At the end of the day, a binary output is revealed
and we incur a mistake if our decision did not match the output. Suppose we
have access to n experts e1 , . . . , en , each of which makes a recommendation
for the binary decision to take per day. How does one make use of the experts
to minimize the total number of mistakes on an online binary sequence?
Toy setting Consider a stock market with only a single stock. Every day,
we decide whether to buy the stock or not. At the end of the day, the stock
value will be revealed and we incur a mistake/loss of 1 if we did not buy
when the stock value rose, or bought when the stock value fell.
Example — Why it is non-trivial Suppose n = 3 and σ = (1, 1, 0, 0, 1).

In hindsight, we have:
Days 1 1 0 0 1
e1 1 1 0 0 1
e2 1 0 0 0 1
e3 1 1 1 1 0
In hindsight, e1 is always correct so we would have incurred 0 mistakes if

we always followed e1 ’s recommendation. However, we do not know which
133
134 CHAPTER 16. MULTIPLICATIVE WEIGHTS UPDATE (MWU)
is expert e1 (assuming a perfect expert even exists). Furthermore, it is not

necessarily true that the best expert always incurs the least number of mis-
takes on any prefix of the sequence σ. Ignoring e1 , one can check that e2
outperforms e3 on the example sequence. However, at the end of day 2, e3
incurred 0 mistakes while e2 incurred 1 mistake.
The goal is as follows: If a perfect expert exists, we hope to eventually
converge to always following him/her. If not, we hope to not do much worse
than the best expert on the entire sequence.
16.1 Warm up: Perfect expert exists

Suppose there exists a perfect expert. Do the following on each day:
• Make a decision by taking the majority vote of the remaining experts.
• If we incur a loss, remove the experts that were wrong.
Theorem 16.2. We incur at most log2 n mistakes on any given sequence.
Proof. Whenever we incur a mistake, at least half the experts were wrong
and were removed. Hence, the total number of experts is at least halved
whenever a mistake occurred. After at most log2 n removals, the only expert
left will be the perfect expert and we will be always correct thereafter.
16.2 A deterministic MWU algorithm

Suppose that there may not be a perfect expert. The idea is similar, but we
update our trust for each expert instead of completely removing an expert
when he/she makes a mistake. Consider the following deterministic algorithm
(DMWU):
• Initialize weights wi = 1 for expert ei , for i ∈ {1, . . . , n}.
• On each day:
– Make a decision by the weighted majority.
– If we incur a loss, set wi to (1 − ) · wi for each wrong expert, for
some constant ∈ (0, 21 ).
Theorem 16.3. Suppose the best expert makes m∗ mistakes and DMWU
makes m mistakes. Then,
2 ln n
m ≤ 2(1 + )m∗ +

16.3. A RANDOMIZED MWU ALGORITHM 135
Proof. Observe that when DMWU makes a mistake, the weighted majority
Pn wrong and their weight decreases by a factor of (1 − ). Suppose that
was
i=1 wi = x at the start of the day. If we make a mistake, x drops to
≤ x2 (1 − ) + x2 = x(1 − 2 ). That is, the overall weight reduces by at least a
factor of (1 − 2 ). Since the best expert e∗ makes m∗ mistakes, his/her weight
∗
at the end is (1 − )m . By the above observation, the total weight of all
experts would be ≤ n(1 − 2 )m at the end of the sequence. Then,
∗
(1 − )m ≤ n(1 − )m Expert e∗ ’s weight is part of the overall weight
2
∗
⇒ m ln(1 − ) ≤ ln n + m ln(1 − ) Taking ln on both sides
2
1
⇒ m∗ (− − 2 ) ≤ ln n + m(− ) Since −x − x2 ≤ ln(1 − x) ≤ −x for x ∈ (0, )
2 2
2 ln n
⇒ m ≤ 2(1 + )m∗ + Rearranging

Remark 1 In the warm up, m∗ = 0.
Remark 2 For x ∈ (0, 21 ), the inequality −x − x2 ≤ ln(1 − x) ≤ −x is due

to the Taylor expansion1 of ln. A more familiar equivalent form would be:
2
e−x−x ≤ (1 − x) ≤ e−x .
Theorem 16.4. No deterministic algorithm A can do better than 2-competitive.
Proof. Consider only two experts e0 and e1 where e0 always outputs 0 and
e1 always outputs 1. Any binary sequence σ must contain at least |σ|
2
zeroes
|σ| ∗ |σ|
or 2 ones. Thus, m ≤ 2 . On the other hand, the adversary looks at A
and produces a sequence σ which forces A to incur a loss every day. Thus,
m = |σ| ≥ 2m∗ .
16.3 A randomized MWU algorithm

The 2-factor in DMWU is due to the fact that DMWU deterministically takes
the (weighted) majority at each step. Let us instead interpret the weights as
probabilities. Consider the following randomized algorithm (RMWU):
• Initialize weights wi = 1 for expert ei , for i ∈ {1, . . . , n}.

1
See https://en.wikipedia.org/wiki/Taylor_series#Natural_logarithm
• On each day:
– Pick a random expert with probability
P proportional to their weight.
(i.e. Pick ei with probability wi / ni=1 wi )
– Follow that expert’s recommendation.
– For each wrong expert, set wi to (1 − ) · wi , for some constant
∈ (0, 12 ).
Another way to think about the probabilities is to split all experts into
two groups A = {Experts that output 0} and B = {Experts that output
wA wB
1}. Then, decide
P ‘0’ with probability
P wA +wB and ‘1’ with probability wA +wB ,
where wA = ei ∈A wi and wB = ei ∈B wi are the sum of weights in each
set.
Theorem 16.5. Suppose the best expert makes m∗ mistakes and RMWU
makes m mistakes. Then,
ln n
E[m] ≤ (1 + )m∗ +

Proof. Fix an arbitrary day j ∈ {1, . . . , |σ|}. Denote A = {Experts
P that output 0 on day j}
and BP= {Experts that output 1 on day j}, where wA = ei ∈A wi and
wB = ei ∈B wi are the sum of weights in each set. Let Fj be the weighted
fraction of wrong experts on day j. If σj = 0, then Fj = wAw+w B
B
. If σj = 1,
wA
then Fj = wA +wB . By definition of Fj , RMWU makes a mistake on day j
with probability Fj . By linearity of expectation, E[m] = |σ|
P
j=1 Fj .
∗ ∗
Since the best expert e makes m mistakes, his/her weight at the end is
∗
(1 − )m . On each day, RMWU reduces the overall weight by a factor of
(1 − · Fj ) by penalizing wrong experts. Hence, the total weight of all experts
|σ|
would be n · Πj=1 (1 − · Fj ) at the end of the sequence. Then,
∗ |σ|
(1 − )m ≤ n · Πj=1 (1 − · Fj ) Expert e∗ ’s weight is part of the overall weight
∗
P|σ|
⇒ (1 − )m ≤ n · e j=1 (−·Fj ) Since (1 − x) ≤ e−x
|σ|
m∗
X
−·E[m]
⇒ (1 − ) ≤n·e Since E[m] = Fj
j=1
⇒ m∗ ln(1 − ) ≤ ln n − · E[m] Taking ln on both sides

ln(1 − ) ∗ ln n
⇒ E[m] ≤ − m + Rearranging

ln n
⇒ E[m] ≤ (1 + )m∗ + Since − ln(1 − x) ≤ −(−x − x2 ) = x + x2

16.4. GENERALIZATION 137
16.4 Generalization
Denote the loss of expert i on day t as lit ∈ [−ρ, ρ], for some constant ρ When
lt
we incur a loss, update the weights of affected experts from wi to (1 − ρi )wi .
lit
Note that ρ
is essentially the normalized loss ∈ [−1, 1].
Claim 16.6 (Without proof). With RMWU, we have
X X ρ ln n
E[m] ≤ min( lit + |lit | + )
i
t t

Remark If each expert has a different ρi , one can modify the update rule
and claim to use ρi instead of a uniform ρ accordingly.
16.5 Application: Online routing of virtual

circuits
Definition 16.7 (The online routing of virtual circuits problem). Consider
a graph G = (V, E) where each edge e ∈ E has a capacity ue . A request is
denoted by a triple hs(i), t(i), d(i)i, where s(i) ∈ V is the source, t(i) ∈ V is
the target, and d(i) > 0 is the demand for the ith request respectively. Given
the ith request, we are to build a connection (single path Pi ) from s(i) to t(i)
with flow d(i). The objective is to minimize the maximum congestion on all
edges as we handle requests in an online manner. To be precise, we wish to
P|σ| P
d(i)
minimize maxe∈E i=1 uPe i 3e on the input sequence σ where Pi 3 e is the
set of paths that include edge e.
Remark This is similar to the multi-commodity routing problem in lec-

ture 5. However, in this problem, each commodity flow cannot be split into
multiple paths, and the commodities appear in an online fashion.
Example Consider the following graph G = (V, E) with 5 vertices and 5

edges with the edge capacities ue annotated for each edge e ∈ E. Suppose
there are 2 requests: σ = (hv1 , v4 , 5i, hv5 , v2 , 8i).
v1 v4
13 10
v3 20
v2 11 8 v5
Upon seeing σ(1) = hv1 , v4 , 5i, we (red edges) commit to P1 = v1 – v3 –

v4 as it minimizes the congestion to 5/10. When σ(2) = hv5 , v2 , 8i appears,
P2 = v5 – v3 – v2 minimizes the congestion given that we committed to P1 .
This causes the congestion to be 8/8 = 1. On the other hand, the optimal
offline algorithm (blue edges) can attain a congestion of 8/10 via P1 = v1 –
v3 – v5 – v4 and P2 = v5 – v4 – v3 – v2 .
v1 v4 v1 v4 v1 v4
5/13 5/10 5/13 5/10 5/13 8/10
v3 0/21 v3 0/21 v3 13/21
v2 0/11 0/8 v5 v2 8/11 8/8 v5 v2 8/11 5/8 v5
To facilitate further discussion, we define the following notations:

d(i)
• pe (i) = ue
is the relative demand i with respect to the capacity of edge
e.
P
• le (j) = Pi 3e,i≤j pe (i) as the relative load of edge e after request j
• le∗ (j) as the optimal offline algorithm’s relative load of edge e after
request j.
In other words, the objective is to minimize maxe∈E le (|σ|) for a given se-
quence σ. Denoting Λ as the (unknown) optimal congestion factor, we nor-
∗
malize pee (i) = peΛ(i) , e le∗ (j) = leΛ(j) . Let a be a constant to be
le (j) = leΛ(j) , and e
determined. Consider algorithm A which does the following on request i + 1:
• Denote the cost of edge e by ce = ale (i)+epe (i+1) − ale (i)

e e
• Return the shortest (smallest total ce cost) s(i) − t(i) path Pi on G

with edge weights ce
Finding the shortest path via the cost function ce tries to minimize the load
impact of the new (i + 1)th request. To analyze A, we consider the following
potential function: Φ(j) = e∈E ale (j) (γ − e le∗ (j)), for some constant γ ≥ 2.
P e
BecauseP le∗ (j) ≤ 1, so γ − e

of normalization, e le∗ (j) ≥ 1. Initially, when j = 0,
Φ(0) = e∈E γ = mγ.
1 x
Lemma 16.8. For γ ≥ 1 and 0 ≤ x ≤ 1, (1 + 2γ
) < 1 + γx .
1 x x x
Proof. By Taylor series2 , (1 + 2γ
) =1+ 2γ
+ O( 2γ ) < 1 + γx .
2
See https://en.wikipedia.org/wiki/Taylor_series#Binomial_series
16.5. APPLICATION: ONLINE ROUTING OF VIRTUAL CIRCUITS139
1
Lemma 16.9. For a = 1 + 2γ
, Φ(j + 1) − Φ(j) ≤ 0.
∗
Proof. Let Pj+1 be the path that A found and Pj+1 be the path that the
th
optimal offline algorithm assigned to the (j + 1) request hs(j + 1), t(j +
1), d(j + 1)i. For any edge e, observe the following:
∗
• If e 6∈ Pj+1 , the load on e due to the optimal offline algorithm remains
unchanged. That is, e ∗
le∗ (j). On the other hand, if e ∈ Pj+1
le∗ (j + 1) = e ,
∗ ∗
then le (j + 1) = le (j) + pee (j + 1).
e e
• Similarly, (i) If e 6∈ Pj+1 , then e le (j); (ii) If e ∈ Pj+1 , then

le (j + 1) = e
le (j + 1) = le (j) + pee (j + 1).
e e
∗
• If e is neither in Pj+1 nor in Pj+1 le∗ (j +1)) = ale (j) (γ −
, then ale (j+1) (γ − e
e e
le∗ (j)).
e
∗
That is, only edges used by Pj+1 or Pj+1 affect Φ(j + 1) − Φ(j).
Using the observations above together with Lemma 16.8 and the fact that A
computes a shortest path, one can show that Φ(j + 1) − Φ(j) ≤ 0. In detail,
Φ(j + 1) − Φ(j)
X e
= le∗ (j + 1)) − ale (j) (γ − e
ale (j+1) (γ − e le∗ (j))
e
e∈E
X
= le∗ (j))
(ale (j+1) − ale (j) )(γ − e (1)
e e
∗
e∈Pj+1 \Pj+1
X
+ le∗ (j) − pee (j + 1)) − ale (j) (γ − e
ale (j+1) (γ − e le∗ (j))
e e
∗
e∈Pj+1
X X
= (ale (j+1) − ale (j) )(γ − lee∗ (j)) − ale (j+1) pee (j + 1)
e e e
e∈Pj+1 ∗
e∈Pj+1
X X
≤ (ale (j+1) − ale (j) )γ − ale (j+1) pee (j + 1) (2)
e e e
e∈Pj+1 ∗
e∈Pj+1
X X
≤ (ale (j+1) − ale (j) )γ − ale (j) pee (j + 1) (3)
e e e
e∈Pj+1 ∗
e∈Pj+1
X X
= (ale (j)+epe (j+1) − ale (j) )γ − ale (j) pee (j + 1) (4)
e e e
e∈Pj+1 ∗
e∈Pj+1
X
≤ (ale (j)+epe (j+1) − ale (j) )γ − ale (j) pee (j + 1) (5)
e e e
∗
e∈Pj+1
X
= ale (j) (apee (j+1) − 1)γ − pee (j + 1)
e
∗
e∈Pj+1
X 1
= ale (j) ((1 + )pee (j+1) − 1)γ − pee (j + 1) (6)
e
∗
e∈Pj+1
2γ
≤ 0 (7)
(1) From observations above
le∗ (j) ≥ 0
(2) e
le (j + 1) ≥ e
(3) e le (j)
(4) For e ∈ Pj+1 , lee (j + 1) = lee (j) + pee (j + 1)

(5) Since Pj+1 is the shortest path
1
(6) Since a = 1 + 2γ
(7) Lemma 16.8 with 0 ≤ pee (j + 1) ≤ 1

16.5. APPLICATION: ONLINE ROUTING OF VIRTUAL CIRCUITS141
Theorem 16.10. Let L = maxe∈E e le (|σ|) be the maximum normalized load

1
at the end of the input sequence σ. For a = 1 + 2γ and γ ≥ 2, L ∈ O(log n).
That is, A is O(log n)-competitive.
Proof. Since Φ(0) = mγ and Φ(j + 1) − Φ(j) ≤ 0, we see that Φ(j) ≤ mγ, for
all j ∈ {1, . . . , |σ|}. Consider the edge e with the highest congestion. Since
γ −e le∗ (j) ≥ 1, we see that
1 L
(1 + le∗ (j)) ≤ Φ(j) ≤ mγ ≤ n2 γ
) ≤ aL · (γ − e
2γ
Taking log on both sides and rearranging, we get:
1
L ≤ (2 log(n) + log(γ)) · 1 ∈ O(log n)
log(1 + 2γ
)
Handling unknown Λ Since Λ is unknown but is needed for the run of

A (to compute ce when a request arrives), we use a dynamically estimated
e Let β be a constant such that A is β-competitive according to Theorem
Λ.
16.10. The following modification to A is a 4β-competitive: On the first
request, we can explicitly compute Λ
e = Λ. Whenever the actual congestion
3
exceeds Λβ, we reset the edge loads to 0, update our estimate to 2Λ,
e e and
start a new phase.
• By the updating procedure, Λ

e ≤ 2βΛ in all phases.
• Let T be the total number of phases. In any phase i ≤ T , the congestion

at the end of phase i is at most 22βΛT −i . Across all phases, we have
PT 2βΛ
i=1 2T −i ≤ 4βΛ.
3
Existing paths are preserved, just that we ignore them in the subsequent computations
of ce .
Bibliography
[AB17] Amir Abboud and Greg Bodwin. The 43 additive spanner expo-
nent is tight. Journal of the ACM (JACM), 64(4):28, 2017.
[ACIM99] Donald Aingworth, Chandra Chekuri, Piotr Indyk, and Ra-

jeev Motwani. Fast estimation of diameter and shortest paths
(without matrix multiplication). SIAM Journal on Computing,
28(4):1167–1181, 1999.
[ADD+ 93] Ingo Althöfer, Gautam Das, David Dobkin, Deborah Joseph,
and José Soares. On sparse spanners of weighted graphs. Discrete
& Computational Geometry, 9(1):81–100, 1993.
[AGM12] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Ana-
lyzing graph structure via linear measurements. In Proceedings
of the twenty-third annual ACM-SIAM symposium on Discrete
Algorithms, pages 459–467. SIAM, 2012.
[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplica-
tive weights update method: a meta-algorithm and applications.
Theory of Computing, 8(1):121–164, 2012.
[AMS96] Noga Alon, Yossi Matias, and Mario Szegedy. The space com-
plexity of approximating the frequency moments. In Proceedings
of the twenty-eighth annual ACM symposium on Theory of com-
puting, pages 20–29. ACM, 1996.
[Bar96] Yair Bartal. Probabilistic approximation of metric spaces and its

algorithmic applications. In Foundations of Computer Science,
1996. Proceedings., 37th Annual Symposium on, pages 184–193.
IEEE, 1996.
[BBMN11] Nikhil Bansal, Niv Buchbinder, Aleksander Madry, and Joseph

Naor. A polylogarithmic-competitive algorithm for the k-server
i
ii Advanced Algorithms
problem. In Foundations of Computer Science (FOCS), 2011

IEEE 52nd Annual Symposium on, pages 267–276. IEEE, 2011.
[BK96] András A Benczúr and David R Karger. Approximating st min-

e 2 ) time. In Proceedings of the twenty-eighth
imum cuts in O(n
annual ACM symposium on Theory of computing, pages 47–55.
ACM, 1996.
[BKMP05] Surender Baswana, Telikepalli Kavitha, Kurt Mehlhorn, and

Seth Pettie. New constructions of (α, β)-spanners and purely
additive spanners. In Proceedings of the sixteenth annual ACM-
SIAM symposium on Discrete algorithms, pages 672–681. Soci-
ety for Industrial and Applied Mathematics, 2005.
[BYJK+ 02] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and

Luca Trevisan. Counting distinct elements in a data stream. In
International Workshop on Randomization and Approximation
Techniques in Computer Science, pages 1–10. Springer, 2002.
[BYJKS04] Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and

D Sivakumar. An information statistics approach to data stream
and communication complexity. Journal of Computer and Sys-
tem Sciences, 68(4):702–732, 2004.
[Che13] Shiri Chechik. New additive spanners. In Proceedings of the

twenty-fourth annual ACM-SIAM symposium on Discrete algo-
rithms, pages 498–512. Society for Industrial and Applied Math-
ematics, 2013.
[Erd64] P. Erdös. Extremal problems in graph theory. In “Theory of

graphs and its applications,” Proc. Symposium Smolenice, pages
29–36, 1964.
[Fei98] Uriel Feige. A threshold of ln n for approximating set cover.

Journal of the ACM (JACM), 45(4):634–652, 1998.
[FM85] Philippe Flajolet and G Nigel Martin. Probabilistic counting

algorithms for data base applications. Journal of computer and
system sciences, 31(2):182–209, 1985.
[FRR90] Amos Fiat, Yuval Rabani, and Yiftach Ravid. Competitive k-

server algorithms. In Foundations of Computer Science, 1990.
Proceedings., 31st Annual Symposium on, pages 454–463. IEEE,
1990.
BIBLIOGRAPHY iii
[FRT03] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight

bound on approximating arbitrary metrics by tree metrics. In
Proceedings of the thirty-fifth annual ACM symposium on Theory
of computing, pages 448–455. ACM, 2003.
[FS16] Arnold Filtser and Shay Solomon. The greedy spanner is exis-
tentially optimal. In Proceedings of the 2016 ACM Symposium
on Principles of Distributed Computing, pages 9–17. ACM, 2016.
[Gra66] Ronald L Graham. Bounds for certain multiprocessing anoma-
lies. Bell System Technical Journal, 45(9):1563–1581, 1966.
[Gro91] Edward F Grove. The harmonic online k-server algorithm is
competitive. In Proceedings of the twenty-third annual ACM
symposium on Theory of computing, pages 260–266. ACM, 1991.
[HMK+ 06] Tracey Ho, Muriel Médard, Ralf Koetter, David R Karger,
Michelle Effros, Jun Shi, and Ben Leong. A random linear net-
work coding approach to multicast. IEEE Transactions on In-
formation Theory, 52(10):4413–4430, 2006.
[IW05] Piotr Indyk and David Woodruff. Optimal approximations of
the frequency moments of data streams. In Proceedings of the
thirty-seventh annual ACM symposium on Theory of computing,
pages 202–208. ACM, 2005.
[Joh74] David S Johnson. Approximation algorithms for combinatorial
problems. Journal of computer and system sciences, 9(3):256–
278, 1974.
[Kar93] David R Karger. Global min-cuts in rnc, and other ramifications
of a simple min-cut algorithm. In SODA, volume 93, pages 21–
30, 1993.
[Kar94] David R Karger. Random sampling in cut, flow, and network
design problems. In Proc. of the Symp. on Theory of Comp.
(STOC), pages 648–657, 1994.
[Kar99] David R Karger. Random sampling in cut, flow, and network de-
sign problems. Mathematics of Operations Research, 24(2):383–
413, 1999.
[KP95] Elias Koutsoupias and Christos H Papadimitriou. On the k-
server conjecture. Journal of the ACM (JACM), 42(5):971–983,
1995.
iv Advanced Algorithms
[Lee18] James R Lee. Fusible hsts and the randomized k-server conjec-
ture. In 2018 IEEE 59th Annual Symposium on Foundations of
Computer Science (FOCS), pages 438–449. IEEE, 2018.
[LY94] Carsten Lund and Mihalis Yannakakis. On the hardness of

approximating minimization problems. Journal of the ACM
(JACM), 41(5):960–981, 1994.
[MMS90] Mark S Manasse, Lyle A McGeoch, and Daniel D Sleator. Com-

petitive algorithms for server problems. Journal of Algorithms,
11(2):208–230, 1990.
[Mor78] Robert Morris. Counting large numbers of events in small reg-

isters. Communications of the ACM, 21(10):840–842, 1978.
[Mos15] Dana Moshkovitz. The projection games conjecture and the np-
hardness of ln n-approximating set-cover. Theory of Computing,
11(1):221–235, 2015.
[NY18] Jelani Nelson and Huacheng Yu. Optimal lower bounds for
distributed and streaming spanning forest computation. arXiv
preprint arXiv:1807.05135, 2018.
[RT87] Prabhakar Raghavan and Clark D Tompson. Randomized round-

ing: a technique for provably good algorithms and algorithmic
proofs. Combinatorica, 7(4):365–374, 1987.
[ST85] Daniel D Sleator and Robert E Tarjan. Amortized efficiency

of list update and paging rules. Communications of the ACM,
28(2):202–208, 1985.
[Vaz13] Vijay V Vazirani. Approximation algorithms. Springer Science

& Business Media, 2013.
[Wen91] Rephael Wenger. Extremal graphs with no c4’s, c6’s, or c10’s.

Journal of Combinatorial Theory, Series B, 52(1):113–116, 1991.
[Woo06] David P Woodruff. Lower bounds for additive spanners, emu-

lators, and more. In Foundations of Computer Science, 2006.
FOCS’06. 47th Annual IEEE Symposium on, pages 389–398.
IEEE, 2006.
[WS11] David P Williamson and David B Shmoys. The design of ap-

proximation algorithms. Cambridge university press, 2011.
BIBLIOGRAPHY v
[Yao77] Andrew Chi-Chin Yao. Probabilistic computations: Toward a

unified measure of complexity. In Foundations of Computer Sci-
ence, 1977., 18th Annual Symposium on, pages 222–227. IEEE,
1977.

AAscript

Uploaded by

Copyright:

Available Formats

AAscript

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AAscript

Uploaded by

Copyright:

Available Formats

Advanced Algorithms

Computer Science, ETH Zürich

Version: 5 Feb 2019

Notation and useful inequalities

3 Randomized approximation schemes 25

5 Probabilistic tree embedding 47

II Streaming and sketching algorithms 59

7 Estimating the moments of a stream 65

III Graph sparsification 91

10 Preserving cuts 105

IV Online algorithms and competitive analysis 113

12 Linear search 117

14 Yao’s Minimax Principle 127

15 The k-server problem 129

Pace for 2018 iteration

Week 2 (24 Sep - 28 Sep 2018) Chapter 2 till Section 2.2.3

Week 3 (1 Oct - 5 Oct 2018) Section 2.2.3 until end of Chapter 2

Week 4 (8 Oct - 12 Oct 2018) Chapter 3

Week 5 (15 Oct - 19 Oct 2018) Chapter 4

Week 6 (22 Oct - 26 Oct 2018) Chapter 5

Week 7 (29 Oct - 2 Nov 2018) Chapter 6 till Section 7.2.2

Week 8 (5 Nov - 9 Nov 2018) Section 7.2.2 till end of Chapter 7

Week 9 (12 Nov - 16 Nov 2018) Chapter 8

Week 10 (19 Nov - 23 Nov 2018) Chapter 9

Week 11 (26 Nov - 30 Nov 2018) Chapter 10

Week 12 (3 Dec - 7 Dec 2018) Chapter 11 till end of Section 13.1

Week 14 (17 Dec - 21 Dec 2018) Chapter 16

Commonly used notation

• N P: class of decision problems that can be solved on a deterministic

• A: usually denotes the algorithm we are discussing about

• I: usually denotes a problem instance

• ind.: independent / independently

• w.p.: with probability

• w.h.p: with high probability

• L.o.E.: linearity of expectation

• u.a.r.: uniformly at random

• Integer range [n] = {1, . . . , n}

• e ≈ 2.718281828459: the base of the natural logarithm

Binomial Number of successes out of n trials, each succeeding w.p. p;

Geometric Number of Bernoulli trials until one success

Hypergeometric r successes in n draws without replacement when there

Exponential Parameter: λ; Written as X ∼ Exp(λ)

• (1 − x) ≤ e−x , for any x

Theorem (AM-GM inequality). Given n numbers x1 , . . . , xn ,

By union bound, for 0 <  < 1, we have

Remark 2 We usually apply Chernoff bound to show that the probability

Unless P = N P, we do not expect efficient algorithms for N P-hard prob-

1.1 Minimum set cover

cost function c : S → R+ . If Si = {e1 , e2 , e5 }, then we say Si covers elements

Definition 1.2 (Minimum set cover problem). Given a universe of elements

(ii) c(S ∗ ), the cost of S ∗ , is minimized

Suppose there are 5 elements e1 , e2 , e3 , e4 , e5 , 4 subsets S1 , S2 , S3 , S4 ,

1.1.1 A greedy minimum set cover algorithm

Consider a run of GreedySetCover on the earlier example. In the first

constant. See https://en.wikipedia.org/wiki/Euler-Mascheroni_constant.

S1 is chosen. In the second iteration, ppi(S2 ) = 4, ppi(S3 ) = 9, ppi(S4 ) = 16.

Theorem 1.3. GreedySetCover is an Hn -approximation algorithm.

Proof. By construction, GreedySetCover terminates with a valid set cover

We formalize this intuition with the argument below. Since OP T is a

By union bound, for 0 < < 1, we have

Thus, profit(FPTAS-Knapsack(I)) ≥ (1 − ) · profit(OP T (I)).

Remark k = 1 when pmax ≤ n . In that case, no rounding occurs and the

Proof. Suppose algorithm A solves bin packing with ( 23 − )-approximation

2.2.2 Special case 1: Exact solving with A

Theorem 2.12. PTAS-BinPacking is an (1 + )-approximation algorithm