161 Main
161 Main
161 Main
MOOR XU
NOTES FROM A COURSE BY TIM ROUGHGARDEN
Abstract. These notes were taken during CS161 (Design and Analysis of Algorithms)
taught by Tim Roughgarden in Winter 2011 at Stanford University. They were live-TEX-
ed during lectures in vim and compiled using latexmk. Each lecture gets its own section.
The notes are not edited afterward, so there may be typos; please email corrections to
moorxu@stanford.edu.
1. 1/4: Introduction
We’re going to ask and answer lots of questions. Why are you here? For most people,
because it is required. How come? It is fundamental, it is useful, and it is fun.
1.1. Two example problems.
Example 1.1. Internet routing.
Remark. We can think of the internet as a graph, with vertices and with edges. The vertices
are end hosts and routers, while the directed edges are physical or wireless connections.
There are other internet related graphs as well, such as social networks, Web, etc.
Which Stanford to Cornell path to use? We prefer to use the shortest path. Let’s assume
that the fewest number of hops would satisfy this shortest condition. Then evidently we
need an algorithm to find the shortest path.
The first such algorithm is Dijkstra, but running this requires knowing what all the nodes
are, which requires knowing the whole graph; this is infeasible for the internet. We need an
alternative shortest path algorithm that needs only local computation. It’s not obvious that
such an algorithm should exist, but it does exist. It is the Bellman-Ford algorithm, and it
is an application of dynamical programming.
Example 1.2. Sequence alignment. This is a fundamental problem in computational ge-
nomics. The input is two strings over the alphabet {A, C, G, T}. For example, the input to
the algorithm might be AGGGCT and AGGCA, and these would be portions of genomes.
Problem 1.3. Figure out how “similar” two strings are.
Think about how many steps it takes to go from one string to the other. This is one way to
defines similarity. Here’s another: Define similarity as the quality of the “best” alignment.
Consider some intuition: AGGGCT and AGGCA can be “nicely” aligned; write them as
AGGGCT and AGG CA. We had to insert a gap, and there was one mismatch.
Assume that we have experimentally determined penalties for gaps and mismatches. So
one gap would produce some penalty, while mismatching A and T would produce some other
penalty. Penalties are additive. In our example, the total penalty would be for one gap and
one A/T mismatch.
1
The best match is therefore defined as the alignment with smallest total penalty. This is a
famous concept in computational genomics, called the Needleman-Wunsch score. They were
biologists in the early 70’s. A small penalty score means that the sequences are similar.
Why would people care about doing this?
• Extrapolation of genome substrings
• Similarity of genomes can reflect proximity in evolutionary tree.
Note. The definition of NW score is inherently algorithmic. We need a practical algorithm
to find the best alignment.
That algorithm is called brute-force search: try all possible alignments, pick the best
one. But there are a massive number of alignments. The number of alignments exceeds the
number of atoms in the known universe even for string lengths in the hundreds. This is bad.
We instead need a faster nontrivial algorithm. Use dynamical programming.
1.2. Simple problem. Consider the multiplication of two n-digit numbers x and y. For
example, x = 1234 and y = 5678. There exists the grade school algorithm for doing this.
This is O(n2 ). Can we do better?
Let’s think about recursive algorithms for doing this.
Algorithm 1.4. Write x = 10n/2 a + b and y = 10n/2 c + d. The point is that each of a, b, c, d
are n2 -digit numbers. We are being asked to multiply
(1) xy = 10n ac + 10n/2 (ad + bc) + bd.
Now, all of the multiplications that remain involve numbers with fewer digits. This is a
recursive problem. We now recursively compute ac, ad, bc, bd, and then finish evaluating
the obvious way. Clearly, this algorithm is correct and terminates.
Algorithm 1.5 (Gauss). Recursively compute ac, bd, and (a + b)(c + d) = ac + bd + ad + bc.
The point is that by subtraction we can compute ad + bc, and this only required three
multiplications. This should be better than the previously algorithm, but it’s unclear how
to compare it to the grade school algorithm.
2. 1/6: Divide and Conquer
2.1. Merge sort. Today we begin with divide and conquer algorithms. The canonical
example is mergesort. It remains the pedagogically cleanest example.
Consider given n numbers that are unsorted, we want to get an array of n numbers sorted
in increasing order.
Algorithm 2.1 (Mergesort).
(1) Recursively sort 1st half
(2) Recursively sort 2nd half
(3) Merge two subsets into one.
Pseudocode for merge step:
c = output array, length > n
A = first sorted array, length > n/2
B = second sorted array, length > n/2
i = 1; j = 1;
2
for k from i to n
if A(i) < B(j)
c(k) = A(i); i++
else c(k) = B(j); j++
We are now interested in the running time of merge sort. This means the number of
operations executed.
The initialization step is 2 operations. Inside the for loop, we need three operations, and
we need to increment k, so each loop iteration requires 4 operations. The loop runs n times.
So the running time of merge on an array of m numbers is ≤ 4m + 2 ≤ 6m.
Claim. Merge sort requires ≤ 6n log2 n + 6n operations to sort n ≥ 1 numbers.
Proof. Assume that n is a power of 2 for simplicity. We will use a “recursion tree.” The
point is to write down all of the recursive calls into a tree structure. There are log2 n levels
to the tree. The key idea is to anaylze the work done by merge sort, level by level.
Consider a level j = 0, 1, . . . , log2 n. There are 2j subproblems, each of size n/2j . So the
total number of operations that we need at a level j is
n
≤ 2j · 6 j = 6n.
2
We add this up over all of the levels to get that the total is ≤ 6n(log2 n + 1).
Remark. We need to make some assumptions.
(1) We used “worst case analysis”: our bound applies to every input of length n. It is
difficult to define “practical inputs”. Also, this is easier to analyze.
(2) We will ignore constants and lower order terms. This makes life much easier and does
not lose predictive power. Constants depend on domain-dependent factors anyway.
(3) We will bound running time for large inputs of size n. For example, we will claim
that 6n log n is “better than” 21 n2 . By Moore’s Law, the only interesting problems
are big problems.
This course adopts these as guiding principles. Fast algorithms are those where the worst
case running time grows slowly with input size.
Usually, we want as close to linear O(n) as possible.
2.2. Lightning fast review of asymptotic notation. The most common is O(·), Ω(·),
Θ(n). Let T (n) be a function defined on n = 1, 2, 3, . . . . This is usually the worst case
running time of an algorithm.
Definition 2.2. T (n) = O(f (n)) if there exist constants c, n0 > 0 such that
T (n) ≤ c · f (n)
for every n ≥ n0 .
Example 2.3. If T (n) = ak nk + · · · + a1 n + a0 , then T (n) = O(nk ).
Proof. Choose n0 = 1. Choose c = |ak | + |ak−1 | + · · · + |a1 | + |a0 |. We need to show that for
all n ≥ 1, T (n) ≤ c · nk . We have, for every n ≥ 1,
T (n) ≤ |ak |nk + · · · + |a1 |n + |a0 | ≤ |ak |nk + · · · + |a1 |nk + |a0 |nk = c · nk .
Example 2.4. For every k ≥ 1, nk is not O(nk−1 ).
3
Proof. We prove this by contradiction. Suppose for contradiction that nk = O(nk−1 ). There
exist constants c, n0 > 0 such that nk ≤ c · nk−1 for all n ≥ n0 . But then just cancel nk−1 , so
that n ≤ c for all n ≥ n0 , which is clearly false.
Definition 2.5. T (n) = Ω(f (n)) if and only if there exist constants c, n0 such that T (n) ≥
c · f (n) for all n ≥ n0 .
Definition 2.6. T (n) = Θ(f (n)) if and only if T (n) = O(f (n)) and T (n) = Ω(f (n)).
Example 2.7. 21 n2 + n is O(n3 ), Ω(n), and Θ(n2 ).
n n n
b b b
. . . . . . . . .
How many subproblems do we have at level j? There are aj problems, each with input
size m = n/bj , and there are logb n levels. The total work performed at level j (ignoring
work in recursive calls) is
n d a j
j d
≤ a c j = cn
b bd
The total work is therefore
logb n
d
X a j
(2) cn .
j=0
bd
Remark. Here’s how to intuitively interpret this formula. a is the rate of subproblem
proliferation, and bd is the rate of work shrinkage per subproblem.
First, suppose that a = bd ; the rate of proliferation equals the rate of work shrinkage. This
is like merge sort; same amount of work per level, and now it’s obvious how much work is
done at each level.
If the rate of work shrinkage is larger, the root should dominate.
If subproblem proliferation is larger, then the work per level increases, so the bottom most
level dominates.
We now evaluate (2). In the case a = bd , this becomes
logb n logb n
d
X a j d
X
cn = cn 1 = cnd logb n = O(nd log n).
j=0
bd j=0
5
Note that (
rk+1 − 1 O(rk ) r > 1
1 + r + r2 + · · · + rk = =
r−1 O(1) r < 1.
a
So taking r = bd
, if a < bd , then
logb n
d
X a j
cn d
= cnd O(1) = O(nd ).
j=0
b
Algorithm 4.3.
• break the array A into group of 5 elements
• sort each group of 5 elements using merge sort
• get n/5 medians (middle element of each group)
• recursively find the medians of these, return as our pivot
Here’s how to actually implement ChoosePivot:
1A. break A into groups of 5 and sort each group
1B. Let C be the middle group elements
1C. return Select(C, n/5, n/10)
Note that we are calling merge sort in a supposedly linear time algorithm. Isn’t that bad?
No, because merge sort is being applied to input sizes that are fixed, and it takes a finite
and bounded number of operations to sort a list of 5 elements. By lecture 2, this takes
≤ 6(5 log2 5 + 1) ≤ 120 per group, so it takes ≤ 120(n/5) = 24n = O(n) to sort the groups.
We consider the running time. Let’s build a rough recurrence and fix it later. There
exists a constant c > 0 such that T (1) = 1 and T (n) ≤ cn + T (n/5) + T (?). The cn is the
nonrecursive piece, T (n/5) comes from finding the median of medians. If we actually got
the median, T (?) would be T (n/2), but we might not be so lucky. We need the following
lemma:
7
Lemma 4.4. The second recursive call is guaranteed to be on an array of size ≤ 10
n.
Proof. Let k = n/5 be the number of groups. Let xl be the l-th smallest of the k middle
elements. In this notation, the pivot that we choose is xk/2 .
Envision the unsorted array A as laid out in a grid. Each column will be a group of 5
elements, with the columns sorted by their middle element. The winning pivot is the one
at the center. Now, there is a geographic region of the grid that we can confidently say
is smaller than the pivot. Using transitivity of <, we see that the bottom left quadrant is
certainly smaller than the center pivot. This means that xk/2 is bigger than 3 out of 5 (60%)
of ≈ 50% of the groups. This means that the pivot is in fact bigger than at least 30% of A.
Similarly, it is guaranteed to be smaller than ≥ 30% of A.
@ABC
GFED
and V W = 2.
:B V A ~ AA
1 ~
~~~~~~ AA6
~
~~
~ AA
GFED
@ABC
~
@ABC
GFED
AA
~~~~~
S@ 2
:B T
@@ }}}
@@ 4 3 }}}}}
@@ }}
@@ }}}}}}
ONML
HIJK
}}
W
Start at S. First, we include V , labeling it with 1. There are now three edges in the
frontier. We advance using edge V W , and we label W with 3. Finally, we want to get to T ,
and we choose the W T edge, labeling it with 6, which is shortest path S → V → W → T .
We should prove that this is correct.
Proof. This is by induction on the number of iterations. The base case should be clear.
Suppose that A[v] and B[v] have been computed correctly for all v ∈ X. In the inductive
step, we pick (v ∗ , w∗ ), and add w∗ to X. The length of the s − w∗ path B[w∗ ] is the length
of B[v ∗ ] ∪ (v ∗ , w∗ ), which by the inductive hypothesis is A[v ∗ ] + cv∗ w∗ .
We want to say that this is the shortest path. Note that any such path must cross the
frontier from X to V \ X, so in particular, there is some initial edge y − z that it uses to
cross the frontier. Therefore, we can think of it as having three pieces: s − y − z − w∗ . Let’s
lower bound each piece separately.
First, the y − z path is only one edge, so it has length cyz . We know the shortest path s − y
by the inductive hypothesis; this is at least A[y]. For the final path z − w∗ , we don’t know
anything except that the length is ≥ 0. Therefore, every path s − w∗ has length ≥ A[y] + cyz
for some y, z edge across the frontier. Now, we use Dijkstra’s greedy criterion to get that
the length is ≥ A[v ∗ ] + cv∗ w∗ . Therefore, our path is indeed the shortest one.
Definition 5.6. Let G = (V, E) be undirected. The connected components of G are the
“pieces of G”. Formally, these are the equivalence classes of the equivalence relation u ∼ v
if G has a uv path.
10
Proposition 5.7. We claim that BFS started at s explores exactly the nodes in connected
components of S. The run time is proportional to the size of the component.
Proof. For all levels i, the level i nodes of BFS are precisely the nodes that are i hops away
from s. (Check this; prove it by induction.) The point is that we can’t miss anything.
Proposition 5.8. We can compute all connected components in an undirected graph in
linear (O(n + m)) time.
Proof.
For i = 1 to n
if i not yet explored
BFS from i
One BFS is called for each component. There is overhead O(n) forP each check in the
loop, and we do one BFS per component, which have total runtime O( component size) =
O(n + m). So overall, this is O(m + n).
11
DFS(graph G, node i)
- mark i as explored
- set leader(i) := node s
- for each arc (i,j) in G
- if j not yet explored
- DFS(G, j)
- t++; set f(i) := t
At the end, the SCCs of G are exactly the nodes with the same leader.
Let’s do an example to show how this works.
Example 6.3. We first run DFS on Grev .
GFED
@ABC GFED
@ABC G/ FED
@ABC G@ABC
/ FED
2 o GFED
@ABC
@ 3 1
G W..
4 5
G
...
..
@ABC
GFED ..
9 >^ ..
>> ..
>> ..
>> ..
>>
GFED
@ABC @ABC
GFED @ABC
GFED
6 7 8
After the first call to DFS, we have f (7) = 1, f (5) = 2, f (8) = 3, f (2) = 4, f (4) = 5,
f (1) = 6. Finally, f (9) = 7, f (6) = 8, and f (3) = 9.
At the end, the SCCs of G are the nodes with the same leader in the second call to DFS.
In the second pass, we relabel the nodes with our f (v) values.
GFED
@ABC
9 GFED
@ABC
6 o @ABC
GFED 5 o GFED
@ABC
4 G@ABC
/ FED
2
O ... O O
..
..
@ABC
GFED
..
7> ..
>> ..
>> ..
>> ..
@ABC
GFED @ABC
GFED @ABC
GFED
>>
8 1 3
Nodes 7, 8, 9 have leader = 9. Nodes 1, 5, 6 have leader = 6. Nodes 2, 3, 4 have leader = 4.
It seems that this works, but it isn’t clear why it works.
Consider the running time of this algorithm. This is two depth first searches plus some
overhead, which is certainly linear O(m + n) time. What about correctness?
Note. The SCCs of a graph form an acyclic metagraph, where the meta nodes are the SCCs
C1 , . . . , Ck . There is an arc C → Ĉ if and only if there exists an arc (i, j) with i ∈ C and
j ∈ Ĉ. In the example above, this is simply
◦ → ◦ → ◦.
This is acyclic because if there were a cycle of SCCs then they would collapse into a single
SCC. This is a cool way of looking at directed graphs.
Here’s a key lemma:
12
Lemma 6.4. Consider two “adjacent” SCCs C1 and C2 in a graph G. This means that
there are i ∈ C1 and j ∈ C2 such that there exists an arc i → j. Let f (v) be the finishing
times of general-DFS in Grev . Then,
max f (v) < max f (v).
v∈C1 v∈C2
Say Ci has maximal f value fi . Then f1 < f2 and f3 < f4 . Here, C4 is a sink SCC.
Correctness Intuition of Algorithm 6.2. The second pass of General-DFS begins somewhere
in a sink SCC C ∗ of G. Since we cannot escape a sink SCC, we will find everything in C ∗ but
not be able to discover anything elsewhere. The first call to DFS discovers C ∗ and nothing
else.
Now, consider the next call of DFS. We effectively recurse on the graph after peeling off
the already explored C ∗ . This starts in a sink SCC of the residual graph G \ C ∗ .
Successive calls of DFS peels off the SCCs one by one.
That’s why the algorithm works. Fundamentally, the first DFS call is just a preprocessing
step. The point is that we will be able to start at sink SCCs, providing an order to find
SCCs. This guarantees that the second DFS call works as intended.
Proof of key lemma 6.4. Consider Grev . Suppose we have j ∈ C2 and i ∈ C1 , and there is
an arc j → i in this reversed graph. Note that SCCs are preserved under graph reversal.
Let v be the first node in C1 ∪ C2 reached by the DFS call on Grev . There are two cases.
Suppose that v ∈ C1 . DFS is going to find everything in C1 but nothing in C2 because the
metagraph is acyclic and hence contains no arc C1 → C2 . Therefore, all of C1 is explored
before any of C2 . Hence, f (x) < f (y) for every x ∈ C1 and y ∈ C2 .
In the second case, if v ∈ C2 , DFS will find everything in C2 and it will find everything
in C1 as well because of the arc C2 → C1 . DFS from v will not finish until all of C1 ∪ C2
is completely explored, in particular, f (v) > f (w) for all w ∈ C1 . This proves the key
lemma.
This is a great algorithm if everything fits in main memory. It is very fast. What about
enormous graphs that don’t fit in memory?
13
6.1. Application: structure of the web. Consider an application to the structure of the
web. Broder et al published a paper in 2000 considering the SCCs of the web graph. This
was a nontrivial task. Search engines now have systems to do this, but 2000 was a long time
ago. This paper proposed something called the bowtie structure.
They found a giant strongly connected component in the web graph, but there are other
SCCs as well. There are SCCs from which you can get to the core but not conversely (e.g.
new pages), and there are SCCs from which you get to from the core but not conversely (e.g.
corporate websites). There were also tubes and tendrils of other SCCs.
The main findings were:
(1) All four parts have roughly the same size.
(2) Within the core, the graph is very well connected. This is the small world property.
(3) Outside the core, the graph is surprisingly poorly connected.
7. 1/25: Greedy algorithms
The exasperating thing about algorithm design is that there is no silver bullet. We will
discuss the major design paradigms to form a toolbox.
7.1. Overview of greedy algorithms. What is a greedy algorithm? You know it when
you see it. The point is to iteratively make “myopic” decisions and hope that it works out
in the end.
Example 7.1. Dijkstra is a greedy algorithm. This was a one-pass algorithm, and we made
a choice at each node.
This has a different flavor than divide and conquer. In divide and conquer, it might take
a bit of thought to see the correct algorithm to recursively solve the problem. For greedy
algorithms, on the other hand, they are really easy to propose. In addition, the running time
is very easy: sort things and do one pass. For divide and conquer, correctness was usually
mechanical, but for greedy algorithms, proving correctness can be tricky. There are often
greedy algorithms that seem correct but do not work. It take practice to prove correctness:
There isn’t a simple template.
Here are some recurrent principles of correctness proofs. There are two methods. The
first method is the “greedy stays ahead” proof. This uses proof by induction, and Dijkstra
is the canonical example of this. The second method is the “exchange argument”. This is
what we will mainly discuss today. Ultimately, just do whatever works.
7.2. Applications. There are some applications of greedy algorithms.
Example 7.2. Optimal caching. We have different types of memory, in a memory hierarchy.
When programs request memory, we want it to be in the small fast memory. To get something
not in fast memory, we have to evict something from fast memory to make space. This is
an algorithmic problem.
Theorem 7.3. We kick out whatever we will need furthest in the future. This minimizes
the number of cache misses.
Not all greedy algorithms are correct, but this is correct. However, it requires full knowl-
edge of the future, which in practice doesn’t happen. It does provide a good benchmark for
comparing actual performance to what is optimal.
14
7.3. Scheduling. Today, we will discuss the application of scheduling.
Suppose we have a shared resource, and a lot of people or processes want to use this
resource. We want to sequence the jobs on the processor. Of course, this means that we
need an objective function. There are numerous ways to define an objective function; we’ll
consider one example.
The input is n jobs. Each has a weight wj (importance) and length lj (needed time). Our
goal is to sequence these jobs.
Definition 7.4. The completion time cj of a job j is the sum of the job lengths up to and
including j.
Example 7.5. Suppose we have three jobs, l1 = 1, l2 = 2, and l3 = 3, and they all have
equal weight w1 = w2 = w3 = 1. If we schedule in order 1, 2, 3, our completion time would
be c1 = 1, c2 = 3, and c3 = 6.
Problem 7.6. Our goal is to minimize the weighted sum of the completion times cj :
n
X
min w j cj .
j=1
C 5
D
The output should be the min-cost tree T ⊂ E that spans all vertices.
16
We will assume that G is connected and all of the edges have distinct costs. It doesn’t
matter if there are ties; we just won’t consider that case here.
This is like Dijkstra, in that we have a mold growing with a frontier. The intuition and
the picture are more or less the same.
Algorithm 8.1 (Prim’s MST Algorithm).
• Initialize X = {s}. This is chosen arbitrarily.
• T = 0. We have the invariant that X will be the vertices spanned so far by T .
• while X 6= V
– Let e = (u, v) be the cheapest edge with u ∈ X, v ∈
/ X.
– Add e to T
– Add v to X
Example 8.2. In the graph above, starting at node A, Prim would select edges AB, BD,
and AC, in that order.
8.1. Correctness. We claim that Prim always computes a minimum spanning tree.
Definition 8.3. A cut of a graph is a nonempty partition of the vertices of a graph into two
pieces. An edge connecting vertices of the two partitions is called a crossing edge.
Lemma 8.4 (Empty cut lemma). A graph is not connected if and only if there exists a cut
(A, B) with no crossing edges.
Proof. This is easy. Check this at home and just talk through it.
Lemma 8.5 (Double crossing lemma). Suppose we have a cycle C ⊂ E of a graph, and this
cycle has an edge that crosses a cut (A, B). Then some other edge of C also crosses (A, B).
In general, a cycle crosses a cut an even number of times.
Corollary 8.6 (Lonely cut corollary). If e is the only edge crossing some cut (A, B), it is
not in any cycle.
Before we prove that Prim is a minimum spanning tree, we should first make sure that it
is actually a spanning tree.
Lemma 8.7. Prim outputs a spanning tree.
Proof. By induction, this maintains the invariant that T spans X. Do this in the privacy of
your own home.
Why can’t we get stuck with X 6= V ? This would mean that we failed to find an edge
crossing the frontier, which by the Empty Cut Lemma 8.4 would imply that G is not con-
nected.
Why are there no cycles? When e is added to T , it is the first edge of T to cross this
cut (X, V \ X). This means that its addition cannot create a cycle in T by the Lonely Cut
Corollary 8.6.
There are a lot of spanning trees, but Prim cleverly finds the best one.
Even though Prim looks really like Dijkstra, the proofs of correctness are very different.
For Dijkstra, we used a greedy stays ahead proof; for Prim, we use an exchange argument.
This is used to proved the following Cut Property.
17
Proposition 8.8 (Cut Property). Suppose that e is the cheapest edge of the graph G crossing
some cut (A, B). Then e belongs to every minimal spanning tree.
A greedy algorithm is something that makes myopic decisions. The cut property is pre-
cisely the type of property that we want to use to justify greedy algorithms.
Proposition 8.9. The cut property implies the correctness of Prim.
Proof. By our lemma 8.7, Prim outputs a spanning tree T ∗ .
Prim considers a sequence of n − 1 cuts, and it by definition the cheapest edge crossing
that cut. Each edge of T ∗ is therefore justified by the Cut Property 8.8. Therefore, T ∗ is
the (unique) minimal spanning tree.
Proof of the Cut Property 8.8. We will prove this by contradiction, using an exchange argu-
ment.
Suppose that e is the cheapest edge crossing some cut (A, B) but not is an minimal
spanning tree T . T is connected, so the Empty Cut Lemma 8.4 implies that T contains an
edge f 6= e that crosses (A, B).
We know that ce < cf . Then T ∪ {e} \ {f } has lower cost than the original T . However,
this might no longer be a spanning tree. We just need to be a little bit more careful.
e0 /
◦ ◦
e
◦ ◦
f
◦ / ◦
Is there some other swap that would work? Let C be the cycle created by adding e to T .
Now we apply the Double Crossing Lemma 8.5 to see that there is some other edge e0 of C
(also in T ) that crosses (A, B).
Check that T 0 = T ∪ {e} \ {e0 } is a spanning tree. Since ce < ce0 , this is cheaper than T ,
which is a contradiction.
That was the mathematical part of the lecture. Let’s talk about implementation and
running time.
8.2. Running time. The naive running time is to have O(n) iterations and O(m) time per
iteration, so O(mn) overall.
Data structure play two roles. We will see data structures so that we know what they do.
They will also be used to speed up algorithms like Prim’s algorithm.
8.2.1. Speed up via heaps. Recall that heaps support the operations of Insert, Extract-Min,
and Delete. All of these operations take time linear in the height of the tree (to restore the
heap property after a change), so these are all O(log n).
There will be two invariants. The elements in the heap will be V \ X. For v ∈ V \ X, the
key value of v is the cheapest edge (u, v) with u ∈ X, or +∞ if no such edges exist.
18
Example 8.10. For example,
◦ MMM
MMM
MM2M
MMM
/ GFED
@ABC
MM&
4
◦ q 8 V
qq
5 qqq
qqq
qq
qqq
◦
would have that the key of v is 2.
Note. Extract-Min yields the next vertex v and edge (u, v) to add to X and T .
The point is that this is like a two-round knockout tournament. First, we obtain minimum
edges locally at each vertex, and then we choose the minimum of the minimums, which is a
global minimum. This isn’t how it is implemented, but it’s a nice way to think about things.
The above note implies that we only have to do an Extract-Min at each iteration. We
should check that our invariants are preserved. The first invariant causes no problems. The
second invariant, though, might get broken. After one iteration, v is no longer in the heap.
The frontier advances, which means that the edges crossing the frontier changes; edges that
used to not cross the frontier will now cross the frontier, and key values can drop. In each
iteration, we also need to update our key values to preserve the invariants.
When v is added to X, only edges incident to v need to be updated.
• for each edge (v, w) ∈ E
– if w ∈ V \ X
∗ delete w from heap
∗ recompute key[w] = min(key[w], cvw )
∗ reinsert w into the heap
8.3. Running time with heaps. In this implementation, all we do is use the heap, so the
running time is dominated by heap operations. We do one Extract-Min for each iteration,
for (n − 1) Extract-Mins. We will do the key restoration once for each edge: each time
when the first endpoint of the edge becomes incorporated into X. Each edge triggers one
Delete and Insert. So we have in total O(m) heap operations. (Recall that m ≥ n − 1 for
connected graphs, so O(m + n) = O(m).)
Each heap operation takes time log n, so this takes time O(m log n).
For Dijkstra, the same argument holds. The only difference is that we use the Dijkstra
greedy rule instead of the greedy edge crossing, so we let the key of v be the minimum of
A[u] + cuv over all edges (u, v) crossing the frontier.
9.1. Correctness.
Theorem 9.2. Kruskal is correct.
Proof. Let T ∗ be the output of Kruskal. There are clearly no cycles by the definition of the
algorithm.
We claim that T ∗ is connected. It is sufficient to show that T ∗ crosses every cut (by the
Empty Cut Lemma 8.4). So fix a cut (A, B).
The Lonely Cut Corollary 8.6, Kruskal will include the first (i.e. cheapest) edge it sees
crossing (A, B) because this edge cannot yet participate in a cycle, and the only reason
Kruskal might skip the edge is because of cycles. This shows that Kruskal outputs a spanning
tree.
We now claim that that every edge of T ∗ is justified by the Cut Property 8.8. As we
discussed last time, this would imply that T ∗ is the unique minimum spanning tree.
To see this final claim, suppose that (u, v) is an edge that Kruskal added to T . At this
time, T had no (u, v) path since otherwise there would be a cycle. Therefore, the edges T
that we have chosen so far has u and v in different connected components. We can therefore
cut the graph with some cut (A, B) such that u ∈ A and v ∈ B and such that there are no
edges of T crossing it. Now, the greedy property of Kruskal will take the cheapest edge that
creates no cycles. Since there are no edges crossing (A, B), the first edge crossing (A, B)
creates no cycles, and therefore the edge (u, v) is added by Kruskal and justified by the Cut
Property 8.8.
In Prim’s Algorithm, it is obvious where the cuts are, while in Kruskal it is not obvious,
so this proof was a bit more subtle. Fundamentally, the Cut Property was what we needed.
9.2. Running time. First, we consider the naive running time of Kruskal’s algorithm. The
time to sort the m edges is O(m log m). We do O(m) loop iterations, and we need O(n) time
to check for cycles, using DFS or BFS in the graph (V, T ), yielding O(mn) overall. We’ve
been getting spoiled by near linear time algorithms, and we’d like this to be like an algorithm
as well. This takes a bit of work. For Prim and Dijkstra, this was done using heaps. Here,
we need a new data structure.
?
Example 11.1.
??
??
??
?
has min cut with the bottom left vertex separated from everything else.
Algorithm 11.2 (Contraction Algorithm (Karger)).
• while > 2 nodes remain
– pick a remaining edge (u, v) uniformly at random
– merge u and v into a single node
– remove any self-loops
• return cut represented by final 2 nodes.
Example 11.3. In the graph in the example above, suppose we first picked the downward
edge on the left. Then we get the graph
Picking the bottom most edge now yields the graph
corresponding to the cut separating the top right node from the rest of the graph.
25
Note that since this is a randomized algorithm, the answer doesn’t have to come out to
the same thing. If in the second step we had chosen to use the downward edge, we would
have split the graph down the middle, which is incorrect.
Question. We want to know the probability of success of this algorithm.
11.2. Review of probability.
Definition 11.4. The sample space Ω is the set of all possible algorithm. In Karger, Ω
contains all possible random choices for edge contractions.
Definition 11.5. Also, each outcome i ∈ Ω has probability mass p(i) ≥ 0. We have the
constraint that X
p(i) = 1.
i∈Ω
Example 11.7. Suppose that G = (V, E) has n vertices, m edges, and min cut (A, B) has
k crossing edges. We know that if the min cut algorithm picks one of these crossing edges,
then we would be guaranteed not to get the min cut (A, B). Conversely, if we never contract
any of these edges, we do get precisely the desired cut (A, B).
Let Si be the event that one of these k crossing edges gets contracted in the ith iteration.
Note that Pr[S1 ] = k/m.
We claim that in G, every node has degree deg(v) ≥ k; otherwise, we could simply cut
that node out of the graph. Therefore, the number of edges is at least m ≥ kn/2 because
each one of n nodes has at least k edges and each edge can be seen from exactly two nodes.
Therefore, Pr[S1 ] ≤ n2 .
Definition 11.8. Let S and T be two events. Then the conditional probability is
Pr[S ∩ T ]
Pr[S | T ] = .
Pr[T ]
Example 11.9. We now have at most n − 1 edges, but possibly less (due to discarding
self-loops). Then
k
Pr[S2 | not S1 ] = .
number of remaining edges
Note that all nodes, including “supernodes” from contracted edges, induce cuts in G.
So, in every iteration of the algorithm, we maintain the invariant that every node of the
contracted graph has degree deg(v) ≥ k. Therefore, in the second iteration, the number of
remaining edges is ≥ k(n − 1)/2. Hence
2
Pr[S2 | ¬S1 ] ≤ .
n−1
Therefore,
2 2
Pr[¬S1 ∩ ¬S2 ] = Pr[¬S1 ] Pr[¬S2 | ¬S1 ] ≥ 1 − 1− .
n n−1
26
We can of course iterate this process to get the probability of success, or the probability
of always missing the k crossing edges. This is
2 2 2 2 2
Pr[¬S1 ∩ ¬S1 ∩ · · · ∩ ¬Sn−2 ] ≥ 1 − 1− 1− ··· 1 − 1−
n n−1 n−2 4 3
n−2 n−3 n−4 2 1 2 1
= · · ··· · = ≥ 2.
n n−1 n−2 4 3 n(n − 1) n
We have just lower bounded the success probability of this algorithm. In this case, we end
up with the min cut (A, B)! So we’re happy, right? There’s a minor problem: n12 is rather
small. Happily, this isn’t too hard to fix.
11.3. Improving Karger’s algorithm.
Definition 11.10. Events S, T ⊆ Ω are independent if and only if
Pr[S ∩ T ] = Pr[S] Pr[T ].
Equivalently, this means that Pr[S | T ] = Pr[S] or Pr[T | S] = Pr[T ].
Be very careful about assuming that things are independent. Here, we do something that
is by definition independent. We run Karger’s algorithm repeatedly on the same graph.
Suppose we run Karger’s algorithm N different times. Let Ti be the event that (A, B) is
found on the ith try. By definition, the different Ti ’s are independent. So the probability
that we are hoping won’t happen is
N N
Y 1
Pr[all N trials fail] = (1 − Pr[Ti ]) ≤ 1 − 2 .
i=1
n
We want to say something about the rate at which this goes to zero.
Proposition 11.11. For all real numbers x, we have 1 + x ≤ ex .
Proof. To be rigorous, use the Taylor series expansion of ex . Otherwise, just draw a picture.
Alternatively, note that 1 + x is a tangent line to ex at x = 0 and ex is a convex function.
So, if we take N = n2 , then the probability
2
2 n 1
Pr[all N trials fail] ≤ e−1/n = .
e
2
If we take N = 2n ln n, then
2 ln n
1 1
Pr[all N trials fail] ≤ = 2,
e n
which is really tiny. By running many trials, we have improved a small success probability
to a small failure probability.
Now, let’s consider the running time of this algorithm. This is polynomial in n and m.
With a lot of clever tricks, this can be implemented to run in near-linear time.
We do not give a proof of correctness because such a proof does not exist. We can only
estimate the probability of correctness, as we did above. Even if we missed the minimum,
we’re still going to get the next-best cut. How to tune the failure probability is a domain-
dependent question. There exists an always-correct algorithm that runs in polynomial time.
27
12. 2/10: QuickSort
12.1. The QuickSort algorithm. Recall that we had a linear time selection algorithm 4.2.
Therefore, we used a ChoosePivot function to partition the array and recurse on each half.
QuickSort builds on these idea of selecting pivots and recursing:
Algorithm 12.1.
• if n > 1, return
• p = ChoosePivot(A, n)
• Partition around p.
• Recurse on the first part.
• Recurse on the second part.
Note that assuming that all of the subroutines run in place, the algorithm does too.
Proposition 12.2. The QuickSort algorithm 12.1 runs correctly.
Proof. Induction on n.
The running time of this algorithm depends on the quality of the pivot. In the worst case,
the algorithm would have time Ω(n2 ) because it chose bad pivots. With ideal pivots, such
as finding the running time by computing the linear-time median with Algorithm 4.2, the
running time would satisfy the recurrence T (n) ≤ 2T (n/2) + O(n), which yields O(n log n).
This is pretty good, but not what’s actually done in practice due to large constant factors.
We can do better. The hope is that a random pivot is “pretty good” “often enough.” It’s
really not obvious that this should work. The benefit is that choosing a pivot is constant
time, but our pivots clearly won’t be optimal.
Theorem 12.3. For every input array (of size n), the average running time of QuickSort
is O(n log n).
Notice that the theorem holds for arbitrary input, so we are still doing a worst-case
analysis. We average over the internal random choices of the algorithm and not over the
possible inputs. We are not doing “average-case analysis,” in which we would assume that
the data itself is random.
12.2. Probability review.
Definition 12.4. A random variable X is a real-valued function X : Ω → R.
Example 12.5. Consider an input array A. Let zi be the ith smallest element of A. Let Ω be
the sample space of all possible random choices QuickSort could make (i.e., pivot sequences).
Let Xij (σ) be the number of times zi , zj get compared in QuickSort with the pivot sequence
σ ∈ Ω.
Note that for all σ, Xij (σ) is either 0 or 1. This is because zi and zj are compared only
when one of them is the pivot, which is then excluded from all recursive calls. This type of
random variable is called an indicator random variable.
Definition 12.6. Let X : Ω → R be a random variable. Then the expectation E[X] of X is
the average value of X, i.e. X
E[X] = p(σ)X(σ).
σ∈Ω
28
Example 12.7. For Xij above, we have
E[Xij ] = 0 · Pr[Xij = 0] + 1 · Pr[Xij = 1] = Pr[xi and xj get compared].
Proposition 12.8 (Linearly of Expectation). Let X1 , X2 , . . . , Xn be random variables on
Ω, not necessarily independent. Then
" n # n
X X
E Xi = E[Xi ].
i=1 i=1
By linearity of expectation,
Xn X n n X
X n
(4) E[C(σ)] = E[Xij (σ)] = Pr[zi and zj get compared].
i=1 j=i+1 i=1 j=i+1
Lemma 12.9. The running time of QuickSort is dominated by the number of comparisons,
i.e. by C.
Proof. Just think about it. For full rigor, see the handout.
Proof of Theorem 12.3. To prove the running time of the QuickSort algorithm, we need to
show that (4) is O(n log n). Consider zi and zj for some fixed i < j. Take the set of elements
zi , zi+1 , . . . , zj−1 , zj , and consider the first of them to be chosen as a pivot.
If zi or zj are the first to be chosen as a pivot, then zi and zj get compared. If one of
zi+1 , . . . , zj−1 are first chosen as a pivot, then zi and zj are split into different recursive calls
and are never compared. Since pivots are chosen uniformly at random, this means that
2
Pr[zi and zj are compared] = .
j−i+1
There’s still a calculation to be done. Plugging into equation (4), we have
n X n
X 1
E[C] = 2 .
i=1 j=i+1
j−i+1
1
For each fixed i, the inner sum is ≤ 2
+ 31 + 14 + · · · , then
n
X 1
E[C] ≤ 2 · n · .
k=2
k
We need a calculus fact:
Lemma 12.10.
n
X 1
≤ ln n.
k=2
k
29
1
Proof of Lemma. Consider a lower Riemann sum of the function f (x) = x
, and draw a
picture! This gives
n Z n
X 1 dx
≤ = ln x|n1 = ln n.
k=2
k 1 x
Therefore,
E[C] ≤ 2 · n · ln n,
completing the proof of the running time of QuickSort.
12.3. Sorting lower bounds. Why can’t we have a linear O(n) time algorithm for sorting?
Theorem 12.11. Every “comparison-based” sorting algorithm has worst case running time
Ω(n log n).
Example 12.12. Examples of comparison-based sorting algorithms are MergeSort, Heap-
Sort, and QuickSort.
Example 12.13. Non-comparison based sorting algorithms include RadixSort, BucketSort,
and CountingSort. These are very useful, but you have to make assumptions about the input
data. When these assumptions are made, it is possible to bypass the Ω(n log n) lower bound.
Proof of Theorem 12.11. Suppose that our algorithm always uses ≤ k comparisons to sort
arrays of length n. The only thing that the algorithm knows about relative order of input
elements is from comparison outcomes. Using ≤ k comparisons implies that there are ≤ 2k
possible distinct results. This means that it can only differentiate between ≤ 2k distinct
relative orderings. Since we need to distinguish n! possible relative orderings, this means
that we need to have 2k ≥ n! ≥ ( n2 )n/2 . Therefore,
n n
k ≥ log2 = Ω(n log n).
2 2
13. 2/15: Hashing
Today we will talk about hash tables, which are an important use of randomization in
data structures.
Definition 13.1. A hash table supports fast (in expectation) operations of
• Insert
• Delete
• Lookup.
They are sometimes called a “dictionary.”
A classic application of hash tables is the symbol table of a compiler.
For a hash table, we have a universe U of everything that might possibly be represented.
(For example, U might be all 232 IP addresses.) We want to maintain an evolving set S ⊆ U .
(For example, S might contain IP addresses of around 200 clients.)
13.0.1. Naive solutions. Here are some naive solutions to this problem. We might have bit
vectors with A[x] = 1 if and only if x ∈ S. The problem is that these arrays are massive
with large universes. We need O(1) operations but O(|U |) space.
Another method is to use a linked list. Then this requires O(|S|) space but O(|S|) lookup.
30
13.1. Hash tables. Our hash tables will require O(|S|) space and O(1) time for all opera-
tions (in expectation). Here’s the basic idea.
Pick n buckets, where n ≈ |S|. This might change over time, so in practice, we’d keep
track of this and rehash if necessary.
Definition 13.2. A hash function is a function h : U → bucket, {0, 1, . . . , n − 1}. For each
element of U , this function assigns it to a bucket.
We use an array A of length n and store x ∈ S in A[h(x)]. Therefore, the function h maps
the large U onto a much smaller array A. This is too good to be true, and indeed it is: We
need to worry about collisions, where two elements of S are sent to the same bucket. This
means that h(x) = h(y) for x 6= y.
To resolve collisions we can use the method called chaining, where we keep a linked list in
each bucket. Given x, we simply Insert, Delete, or Lookup for x in the list A[h(x)]; if there
are multiple elements in each bucket, we simply resort to the naive solution. There are other
ways of solving this problem, such as the method known as “open addressing.”
13.2. Running time. Since all operations are operations on one of the linked lists, the
running time is determined by the length of the linked list in the relevant bucket. Therefore,
our running times is O(list length) for all operations. How well our algorithm does is therefore
dependent on how well h spreads things out between the different buckets. If we had a bad
hash function that sent everything to the first bucket, our hash table would be really slow.
Remark. It is not trivial to come up with good hash functions. Naive hash functions work
poorly in both theory and practice.
Example 13.3. In the IP address problem, a bad idea would to take the most significant
or least significant 8 bits. We would have a lot of clumps, and we would end up with very
unbalanced buckets.
Example 13.4. In 2003, there was a paper by Crosby and Wallach. They used an “algo-
rithmic complexity” attack on a naive hash function in a network intrusion detection system.
What would be a good hash function? We would love to use some clever hash function
that would be guaranteed to spread every data set out evenly. Unfortunately, this is not
possible. For any hash function, there is a pathological data set for it, in which everything
devolves to the single linked list solution. Why is this true? Fix (arbitrarily clever) any
hash function h : U → {0, 1, . . . , n − 1}, assuming |U | n. Imagine if we hashed every
object of the universe and stuck them into the buckets. By the pigeonhole principle, there
is some bucket i such that at least |U |/n elements of U hash to i. Our pathological data set
is therefore drawn from these |U |/n elements given by h−1 (i).
The solution is to design a family H of hash functions such that, for all data sets S,
“almost all” h ∈ H spreads S out evenly.
Definition 13.5. Let H be a set of hash functions from U to {0, 1, 2, . . . , n − 1}. H is
universal if for all distinct x, y ∈ U , if h ∈ H is chosen uniformly at random, the probability
of a collision is Prh∈H [h(x) = h(y)] ≤ n1 .
Theorem 13.6. If H is universal, then all operations will run in O(1) time (in expectation).
Here, we assume that |S| = O(n).
31
The proof is another example of the type of argument we saw for QuickSort, using the
decomposition principle and linearity of expectation.
Proof. Let S be the data set. We want to determine if x ∈ S; this means that we want to
look up x. The running time is O(list length in A[h(x)]). Let L = list length in A[h(x)].
This is the crucial random variable for the running time; we need to show that the expected
length of this list is constant.
For y ∈ S, y 6= x, define the indicator random variables
(
1 h(x) = h(y)
Zy =
0 otherwise.
Observe that X
l ≤1+ Zy .
y∈S
y6=x
So
X X X1 |S|
E[l] ≤ 1 + E[Zy ] = 1 + Pr[h(x) = h(y)] ≤ 1 + ≤1+ = O(1)
y∈S y∈S y∈S
n n
y6=x y6=x y6=x
Example 14.4. 5
7
3
6 8
Proposition 14.5. A red-black tree with n nodes has height ≤ 2 log2 (n + 1).
Proof. If every root-null path has ≥ k nodes. Then the tree includes a full binary search tree
of depth k. Therefore, n ≥ 2k − 1. So the minimum number of nodes on a root-null path
is ≤ log2 (n + 1). Every, every root-null path has ≤ log2 (n + 1) black nodes. Hence, every
root-null path has ≤ 2 log2 (n + 1) total nodes.
Theorem 14.6. We can implement Insert/Delete such that the invariants are maintained,
and they take O(log n) time each.
Proof. The proof is complicated and we won’t cover this in lecture. See CLRS chapter 13.
The key primitive is the rotation. This is how we always keep search trees balanced. This
preserves the search tree property, which takes O(1) work. The left rotation is the reverse
of the right rotation.
x y
y x
A C
B C A B
The idea for the Insert/Delete operation is to recolor and rotate until invariants are re-
stored. For example, here’s what we do in the case of Insert.
• insert x as usual; this makes a leaf.
• try coloring it red.
• if parent y is black, we are done
• otherwise, y is red, so the grandparent w is black.
w
z y
x
We have some cases. In Case 1, the grandparent’s other child z is absorbed. So we recolor
y and z to be black, and w to be red. This either fixes double red or propagates it upward.
This can only be done O(log n) times. If we reach the root, recolor it black.
In case 2, z is black or null. We can fix this in O(1) via 2 or 3 rotations and suitable
recoloring. See references for details.
14.3. Randomized alternative: Skip lists. Suppose we are storing S. Then level 0 is a
sorted linked list of all of S. In level 1, we have a sorted list of random 50% of level 0. Level
2 is a random 50% of level 1, and so on.
34
−∞ / 2
−∞ / 2 / 5
−∞ / 2 / 3 / 5
−∞ /1 /2 /3 /4 /5 /6
The running time of this iterative version of the algorithm is obvious O(n), and the
correctness proof is verbatim the same as the recursive version.
We now want to present an algorithm to reconstruct the vertices of the maximum weight
independent set, given a complete array A. We claim that vertex vi belongs to a maximum
36
weight independent set of Gi if and only if
wi +maximum weight independent set of Gi−2 ≥ maximum weight independent set of Gi−1 .
This is similar to our previous case analysis – check it.
Our reconstruction algorithm is then as follows:
Algorithm 15.4.
• set S = ∅ and i = n
• while i ≥ 1
– if wi + A[i − 2] ≥ A[i − 1], add vi to S and decrease i by 2.
– otherwise, decrease i by 1.
What we just did is the quintessential example of dynamic programming. Here are the
key ingredients:
• identify a small number of subproblems, e.g. find the maximum independent set in
Gi for all i.
• after solving all subproblems, we can quickly compute the final solution to the main
problem
• can quickly and correctly solve “larger” subproblems given solutions to “smaller sub-
problems”. This is usually via a recursive formula.
15.2. Knapsack problem. Imagine a burglar who breaks into a museum and wants to
escape with as much value as possible.
We have n items, with values v1 , . . . , vn ≥ 0 and integral sizes w1 , . . . , wn ≥ 0. Our
knapsack has integral capacity W ≥ 0. P
The goalPis to have a subset S ⊆ {1, 2, . . . , n} that maximizes i∈S vi subject to the
constraint i∈S wi ≤ W .
Step 1. We formulate a recurrence based on the structure of the optimal solution. Let S be
the optimal solution. Note that either n ∈ S or n ∈ / S. We have two cases:
Case 1: If n ∈/ S, then S is optimal for the first n − 1 items with the same capacity W .
Case 2: Now suppose that n ∈ S. Then S − {n} is optimal for the first n − 1 items, but
we also need to work with a reduced capacity W − wn . To see why this is true, note that if
S ∗ is a better solution, then S ∗ ∪ {n} is better than S in the original problem, which is a
contradiction.
Step 3. We relate the subproblems via a recursive formula. In this case, let Vi,x be the value
of the best solution using only a subset of the first i items and total weight ≤ xi .
We claim that for all i ∈ {1, 2, . . . , n} and for all x ∈ {0, 1, . . . , W }. Then
(
Vi−1,x
Vi,x = max
Vi−1,x−wi + vi .
This is just mathematical notation for what we proved in Step 1.
37
Step 4. Use the formula to systematically solve all subproblems.
Algorithm 15.5. Let A be a two-dimensional array.
For i = 1, 2, . . . , n:
For y = 0, 1, 2, . . . , w:
(
A[i − 1, x]
A[i, x] = max
vi + A[i − 1, x − wi ].
15.2.1. Run time. The running time of this algorithm is O(nW ), or O(1) per array entry.
This has a dependence on W , which is bad if W is in the billions. There are heuristic
approaches to make this more efficient if W is too big.
This linear dependence on W is needed (unless P = NP). This is “pseudopolynomial”. In
fact, the Knapsack problem is NP-hard.
16.1.1. Optimal substructure. Consider the optimal alignment for X and Y and the final
positions of each string. We have three cases:
(1) xm and yn are matched
(2) xm is matched with a gap
(3) yn is matched with a gap.
If someone told us which case we were in, we would be done by recursion.
Proposition 16.2. Let X 0 = X − xm and Y 0 = Y − yn . If case 1 holds, then the induced
alignment of X 0 and Y 0 is optimal. If case 2 holds, then the induced alignment of X 0 and Y
is optimal. If case 3 holds, then the induced alignment of X and Y 0 is optimal.
Proof. We will prove case 1 here and leave the rest as exercises to the reader.
Suppose that the induced alignment of X 0 and Y 0 has penalty p, yet some other alignment
has penalty p∗ < p. Appending xm and yn to the purportedly better alignment gives an
alignment of X, Y with penalty p∗ + αxm yn < p + αxm yn . This is a contradiction.
38
The goal here is to think about all subproblems that we would ever need to solve. This is
very similar to the line graph problem, except we have two strings. The relevant subproblems
have the form (Xi , Yj ), where Xi are the first i letters of X and Yj are the first j letters of
Y.
Let Pij be the penalty of the best alignment of Xi and Yj . Then for all i, j ≥ 1,
Pi−1,j−1 + αxi yj
Pij = min Pi−1,j + αgap
P
i,j−1 + αgap .
The correctness of this recurrence is immediate from the optimal substructure. There are
only three candidates, and we brute-force search over them. The algorithm will simply again
be populating a suitable array with this recurrence.
Algorithm 16.3. We have a two-dimensional array A.
In the base case, we have A[i, 0] = A[0, i] = i · αgap .
For i = 1, . . . , m
For j = 1, . . . , n
A[i − 1, j − 1] + αxi yj
A[i, j] = min A[i − 1, j] + αgap
A[i, j − 1] + α .
gap
The correctness of this algorithm is true by induction and the correctness of our recurrence.
The running time is O(mn).
To reconstruct the solution:
• Trace back through A, starting at A[m, n].
– If A[i, j] was produced from case 1, match xi and yj , and go to A[i − 1, j − 1].
– If A[i, j] was produced from case 2, match xi with a gap, and go to A[i − 1, j].
– If A[i, j] was produced from case 3, match xi with a gap, and go to A[i, j − 1].
Remark. Why do we use the name dynamic programming? Bellman first created this
concept. He worked at RAND, and his superiors hated the word “research”. Therefore, he
hid the fact that he was doing research. At the time, “programming” was thought of as
tabulation and not coding, and also, “dynamic” cannot be used in a pejorative sense. That’s
why he named it “dynamic programming”.
16.2. Shortest paths. Recall the single source shortest path problem: Given a directed
graph G = (V, E) with edge lengths, compute the lengths of the shortest s − v path for all
v ∈ V . Recall that we already solved this problem using Dijkstra’s algorithm 5.4. However,
Dijkstra works only if all edge costs are nonnegative. Here are some drawbacks of Dijkstra:
(1) Some applications have negative edge lengths. For example, edges might correspond
to financial transactions.
(2) Dijkstra is also not very distributed. It needs the entire graph at each iteration,
which is not feasible for large graphs in applications such as internet routing.
The solution to these two problems is the Bellman-Ford algorithm. This is in fact how
internet routing is done, though with lots of tweaks.
Question. How should we define shortest paths when G has a negative cycle?
39
Example 16.4. The problem is that we can have a negative cycle:
/.-,
()*+
s / ◦
4 / ◦
−6 −5
◦ / ◦.
3
If we just want the shortest path s − v path, with cycles allowed, we will fail because we
would keep traversing the negative cycle forever.
Instead, we could look for the shortest s − v path with no cycles allowed. Unfortunately,
this problem is NP-hard, which means that there are no polynomial time algorithms unless
P=NP.
Here, we will assume that no negative cycles exists. This means that there will be no
negative cycles in shortest paths. In fact, it is possible to modify the Bellman-Ford algorithm
(without changing the running time) to detect if a negative cycle exists.
16.2.1. Optimal substructure.
Lemma 16.5. Let G = (V, E) be a directed graph, with a source s and edges ce . We will
assume that there are no negative cycles. For some v ∈ V , let i ∈ {1, 2, . . . , n − 1}. Let P
be the shortest s − v path with at most i edges (and no cycles). Then we have two cases.
Case 1: If P has ≤ i − 1 edges, it is a shortest s − v path with ≤ i − 1 edges and no cycles.
Case 2: If P has i edges, with the last hop (w, v). Let P 0 = P − (w, v) be the path P with
the final edge (w, v) removed. Then P 0 is a shortest s − w path with ≤ i − 1 edges (and no
cycles).
Proof. We have to do something nontrivial in the proof because we need to use the assump-
tion that there are no negative cycles.
Case 1 follows by trivial contradiction, analogously to the problem of independent set.
Now, consider case 2. Assume for a contradiction that Q is shorter than P 0 , where Q is a
s − w path, has ≤ (i − 1) edges, and has no cycles. Then Q + (w, v) is an s − v path, has ≤ i
edges, and is shorter than P . We would be done, but what if Q passed through v already?
Then Q + (w, v) could have a cycle.
However, we can still get a contradiction. If there is a negative cycle, remove it from
Q + (w, v) to get an even shorter s − v path with ≤ i edges. So there are no negative
cycles!
Now, we should consider the running time of this algorithm. We still have constant
lookup time, but it does not take constant time to compute an element of the array because
it requires searching over all edges incoming to a vertex v. It therefore takes O(in-degree(v))
time to compute each A[i, v]. The total running time is therefore
!
X
O n· in-degree(v) = O(nm).
v∈V
17.2. Optimizations.
17.2.1. Stopping early. Suppose that for i < n − 1, we have A[i, v] = A[i − 1, v] for all v. In
this case, the A[i, v] never change again, so we can safely halt.
Proof. Define d(v) = A[i, v] = A[i − 1, v]. Then our recurrence (5) becomes
(
d(v)
d(v) = min
min(w,v)∈E d(w) + cwv .
Thus, for (w, v) ∈ E, we have that d(v) ≤ d(w) + cwv . For a cycle c, for each edge e = (w, v),
we would therefore have d(v) − d(w) ≤ cwv .
ONML
HIJK
W / GFED
@ABC
V
O
@ABC
GFED GFED
@ABC
X o Y
41
Then we have the inequalities
d(v) − d(w) ≤ cwv
d(y) − d(v) ≤ cvy
d(x) − d(y) ≤ cyx
d(w) − d(x) ≤ cxw .
Summing the inequalities then yields that
X
0≤ ce ,
e∈C
89:;
?>=< 89:;
?>=<
• 6
•
has a minimum cost tour of 13.
Conjecture 18.6 (Edmonds, 1965). There is no polynomial-time algorithm for the traveling
salesman problem.
This conjecture predates and is actually equivalent to P6=NP.
A good idea is to amass evidence of intractability via relative difficulty.
Definition 18.7. Problem Π1 reduces to Π2 if given a polynomial-time subroutine for Π2 ,
we can solve Π1 in polynomial time.
Example 18.8. We saw in lecture 4 that the median algorithm reduces to sorting.
Here, we will use the contrapositive as reduction. If Π1 is not in P , then neither is Π2 .
This says that Π2 is as hard as Π1 .
Definition 18.9. Let C be a set of problems. If Π ∈ C and everything in C reduces to Π1 ,
then Π is C-complete. Think of this as the “hardest problem in C”.
We therefore want to show that the traveling salesman problem is C-complete for some
really big set C.
What if we were really ambitious, and asked for C being the set of all problems? This
doesn’t make any sense. The halting problem is not solvable, while the traveling salesman
problem is solvable by exponential-time brute force search. There are problems strictly
harder than the traveling salesman problem.
The refined idea is that the traveling salesman problem is as hard as all brute-force-solvable
problems.
Definition 18.10. A problem is in NP if
(1) correct solutions has polynomial length
(2) purported solutions can be verified in polynomial time.
Example 18.11. Is there a tour with length ≤ 1000?
Remark. Every NP problem is solvable by brute-force search.
This definition is so abstract and general that almost any problem you would ever see is
in NP. The vast majority of natural computational problems are in NP. All that we require
is that we can recognize a solution if it is given to us.
By definition, a polynomial-time algorithm for a single NP-complete problem actually
solves every NP problem. This would imply that P=NP. Thus, NP-completeness is strong
evidence of intractability.
Are there NP-complete problems at all?
Fact (Cook, Karp, Levin, early 1970s). NP-complete problems exist. In fact, there are
thousands of them, including the traveling salesman problem.
44
There is an easy recipe to prove that Π is NP-complete.
(1) Find a known NP-complete problem Π0 .
(2) Reduce Π0 to Π.
Question. Does P6=NP?
This is one of the Clay Institute’s Millennium Problems.
18.2. Coping with NP-completeness. There are still productive things that can be done
with NP-complete problems. Here are some possible approaches to NP-complete problems.
(1) Focus on a tractable special case (see Section 10.2 of the text, or Homework 7). See
also the maximum weight independent set on line graphs 15.1.
(2) Use heuristics to find fast algorithms that are not always correct.
(3) Find an exponential time algorithm that is better than brute-force search. For ex-
ample, for the Knapsack Problem 15.2, brute force search gives Ω(2n ) while dynamic
programming had running time O(nW ).
18.2.1. Minimum vertex cover. We are given an undirected graph G = (V, E). The goal is
to produce a minimum size vertex cover S ⊆ V with at least one endpoint of each edge.
Example 18.12.
◦@ ◦ ◦
@@ ~
@@ ~~~
@@ ~
~~
◦ • @@ ◦
~ @@
~~~ @@
~ @
~~
◦ ◦ ◦
has a minimum vertex cover size of 1. For a complete graph on n vertices, the minimum
vertex cover size is n − 1.
Fact. The minimum vertex cover problem is NP-complete.
Suppose that we instead want to determine if there exists a vertex cover with size ≤ k,
with k small. We could try all possibilities, and this would take running time Ω(nk ). We
would like something better.
Here’s a smarter method. Note that for any edge e = (u, v), a vertex cover contains either
u or v (or both). Define G0 to be what remains if we delete vertex u and all of its incident
edges. Analogously, define G00 to be what remains if we delete vertex v and all of its incident
edges.
Lemma 18.13. G has a vertex cover of size k if and only if G0 or G00 has a vertex cover of
size k − 1.
Proof. Exercise.
Algorithm 18.14.
(1) Pick any edge e = (u, v).
(2) Recursively search for a vertex cover S of size k − 1 in G0 . If found, return S ∪ {u}.
(3) Recursively search for a vertex cover S of size k − 1 in G00 . If found, return S ∪ {v}.
(4) If we haven’t found a vertex cover, such a vertex cover does not exist.
45
Consider the running time of this algorithm. The branching factor is two, and the recursive
depth in k, so there are 2k function calls. There is also some overhead, which we will sloppily
call O(m). Then the running time is O(2k m).
19. 3/8: Approximation algorithms
Last time, we talked about what to do about NP-complete problems. Using the example
of vertex cover, we looked at fully correct faster exponential algorithms. We also considered
solvable special cases. Today, we will discuss using efficient heuristics. We will relax the
requirement of perfect correctness. We will need to have some measure of how correct
our algorithms are. Ideally, our approximation algorithms should provide a performance
guarantee.
Knapsack problem. Recall the Knapsack Problem 15.2. We have n items, with weights
items have integralPvalues v1 , . . . , vn 1. Our goal
w1 , . . . , wn ≤ W and total capacity W . The P
is to compute S ⊆ {1, 2, . . . , n} maximizing i∈S vi such that i∈S wi ≤ W .
19.1. Greedy algorithm. The motivation is that ideal items have big value and small
weight.
Algorithm 19.1.
Step 1: Sort and reindex the items by the ratio of value to weight, so that
v1 v2 vn
≥ ≥ ··· ≥ .
w1 w2 wn
Step 2: Then pack items in this order until one doesn’t fit, and then halt.
Example 19.2. Suppose W = 5 and
v1 = 5 w1 = 1
v2 = 4 w2 = 2
v3 = 3 w3 = 3
The greedy algorithm picks {1, 2}, which happily is optimal.
Example 19.3. Now, consider the example of W = 1000, with values and weights
v1 = 2 w1 = 1
v2 = 1000 w2 = 1000.
The algorithm picks the first item, with value 2. The optimal solution is to choose the second
item, with value 1000. So this algorithm might be arbitrarily bad.
There is a simple fix though.
Step 3: In the algorithm above, return either the previous solution or the maximum value
item, whichever is better. This guarantees that we achieve at least 50% of the optimal
solution.
Theorem 19.4. The value of the greedy solution with the preceding three steps will always
be ≥ 50% of the value of the optimal solution. This runs in O(n log n) time, and we call this
a 21 -approximation algorithm.
1This is slightly different than our previous description of this problem
46
Proof. Suppose that in step 2, the greedy algorithm picks the first k problems. What if we
were allowed to fully fill the knapsack using a suitable “fraction” of item k + 1?
Exercise 19.5. This would be as good as every knapsack solution. This is a typical greedy
proof, as in 7.8.
Our greedy solution has
value ≥ first k items
value ≥ item k + 1,
so therefore
2 × value ≥ greedy solution with taking fractional items ≥ best knapsack solution.
19.2. Dynamic programming. The goal is that given any user-specified ε > 0 (e.g. ε =
0.01), we want to get a ≥ (1 − ε)-approximation algorithm. The running time will increase
as ε → 0.
The idea is to exactly solve a slightly incorrect version of the problem.
Lemma 19.6. We presented a dynamic programming algorithm for Knapsack (with integral
weights) in 15.5 earlier. Here, we have another Knapsack dynamic programing algorithm
(with integral values) with running time O(n2 vmax ), where vmax = maxi vi .
Proof. Let A[i, x] be the minimum weight needed to achieve value ≥ x using the first i items.
This is now a standard dynamic programming solution.
We now present a rounding algorithm to get integer values from a general Knapsack
problem.
Algorithm 19.7.
(1) Round each vi down to the nearest multiple of m. Divide by m to get v̂i . Here, we
are defining jv k
i
v̂i = .
m
These v̂i are integers.
(2) Compute the optimal solution for the v̂i ’s using the dynamic programming algorithm
from Lemma 19.6.
Note that after rounding,
mv̂i ≤ vi ≤ m(v̂i + 1).
We now do accuracy analysis on this algorithm.
19.2.1. Accuracy analysis. Let S ∗ be the optimal solution to the original problem. Let S be
our solution. Then X X
v̂i ≥ v̂i .
i∈S i∈S ∗
Therefore,
X X X X X
vi ≤ m (v̂i + 1) = m|S ∗ | + m v̂i ≤ m|S ∗ | + m v̂i ≤ m|S ∗ | + vi .
i∈S ∗ i∈S ∗ i∈S ∗ i∈S i∈S
47
Therefore,
X X
vi ≥ vi − m|S ∗ |.
i∈S i∈S ∗
P
Our goal is to have the above quantity be ≥ (1 − ε) i∈S ∗ vi . So choose m to guarantee that
X
m|S ∗ | ≤ ε vi .
i∈S ∗
20.1. Bipartite matching. Suppose that we have a graph with sets of nodes U and V .
Assume that U and V have the same number of nodes2. Each edge has one vertex in U and
one vertex in V .
Our goal is to find a matching M ⊆ E such that each node is paired with at most one
other.
A matching is perfect is every edge is paired, i.e. |M | = n.
Example 20.1. This is a perfect matching:
◦@ ◦
@@ ~~
@~
~~@@
~~ @
◦ ◦
2This is almost never important; just add fictitious nodes to the smaller set.
48
20.1.1. Stable matching. Each node has a ranked of nodes on the other side. For example,
one side might represent students and the other side might represent schools, where the
students are applying to schools.
The goal is to find a perfect matching such that if u and v are not matched, then either
u likes v 0 more than v, or v likes u0 more than u.
Example 20.2. Suppose that each student has school preferences C, D, and each school has
school preferences A, B. This is the unique stable matching.
@ABC
GFED
A @ABC
GFED
C
@ABC
GFED
B @ABC
GFED
D
This is an unstable matching because both A and C would be happier if they were matched:
GFED
@ABC
A@ @ABC
GFED
C
@@ ~~
@@~~
~~@@@
@ABC
GFED @ABC
GFED
~~ @
~
B D
It is not clear even if a stable matching should exist. There is an algorithm that does this.
Algorithm 20.3 (Gale-Shapley Proposal Algorithm).
• while there exists an unpaired man u
– u proposes to top woman v who hasn’t rejected him yet
– each woman entertains only the best proposal received so far
Example 20.4. In the example 20.2. B proposes to C, and there is a tentative acceptance.
A proposes to C, and C accepts A and rejects B. Then B is now open, so B proposes to D,
who accepts. The algorithm then terminates.
Theorem 20.5. The algorithm always terminates with O(n2 ) proposals. It always terminates
with a stable matching.
Idea of proof. There are a total of n2 possible proposals, which shows that the first part of
the statement.
We always maintain the invariant that each node has only one incident edge, so we always
terminate with a matching. Suppose that the matching is not perfect. Then some man
was rejected by every woman, which means that all n woman are engaged at the end of the
algorithm. Since there are as many men as women, all men are also paired up, formed a
contradiction.
We should also prove that this gives a stable matching. This follows from the same type
of argument, and it is left as an exercise.
20.1.2. Maximum matching. We want to maximize then number of edges in a matching. We
want to reduce this to a different problem for which we have good algorithms. This problem
reduces to a problem called maximum flow 3.
3See CS 261.
49
20.2. Max flow. We have a directed graph G = (V, E) with a source s and a sink t. There
are edge capacities ue ≥ 0. The goal is to set “flow” that sends the maximum possible
amount.
Example 20.6.
_1 E
y
2
/
?◦>>> ,
1 >>1 *
/.-,
()*+
>>
/.-,
()*+
>
s ?? 1 @ Jt
* ??
??
, 1 ?? 1
/
2 ◦
E
_ y
1
The maximum flow would be to send one unit in the top path and the bottom path.
We now present an algorithm to do this. Pretend that we can go backward across an
edge. In the above example, take two zigzag paths that overlap in the middle path. Then
the paths cancel over the center edge, which yields the above max flow solution4.
20.2.1. Selfish flow. We have a flow network, with a fixed and known number of commuters
traveling between two cities. For each edge, there will be a cost function. We want to send
one-unit of flow from s and t. 5
Example 20.7. Here, the cost x is the fraction of drivers on that route.
? ◦>
>>
x >>1
/.-,
()*+
>>
/.-,
()*+
>
s ?? @t
??
??
1 ?? x
◦
Over enough time, we will stable with half of the drivers taking the top route and half of
the drivers taking the bottom route. The stable solution is that everyone takes 1.5 hours,
with 0.5 hours on the x piece and one hour on the fixed cost piece.
Suppose that a Stanford student builds a teleportation device of cost 0, to get this graph:
? ◦>
>>
x >>1
/.-,
()*+
>>
/.-,
()*+
>
s ?? 0 @t
??
??
1 ?? x
◦
4See CS 369 (advanced graph algorithms)
5See CS 364A (computational game theory).
50
Now, everyone wants to use the teleportation device, so everyone takes the top x path, the
bottom x path, and the teleportation device, for a total time of two hours. The commute
time increased with the addition of teleportation! This is Braess’s Paradox.
Remark. There is a physical interpretation of this paradox. Given some complicated physi-
cal device with strings holding up a heavy weight, cutting a string might cause the weight to
move upward instead of downward. This is because the same equations govern the physical
weight and the commuters.
20.3. Linear programming. We want to optimize a linear function over the intersection of
half spaces. In two dimensions, each half-space is a half-plane. The set of feasible solutions
will form a polygon. The general idea is that we want to go as far as possible in a particular
direction. This generalizes max flow and tons of other stuff, and is one of the most commonly
used algorithms. 6
Algorithms, especially the simplex method, were invented by Dantzig. Linear program-
ming can be solved efficiently.
20.4. Computational Geometry. This deals with low-dimensional geometry, such finding
convex hulls. In high dimensional geometry, we are interested in finding the nearest neighbor.
There are cool data structure ideas with solutions to these problems. 7
20.5. Approximation and randomized algorithms. A selection of applications include:8
• Heuristics for the Traveling Salesman Problem 18.4.
• Semidefinite programming
• Chernoff bounds
• Markov chains
9
20.6. Complexity. An important question is: What can’t algorithms do?
• Limits on algorithms
• Completeness for other complexity classes
• Inapproximable
• Unique games conjecture.
20.7. Conclusion. It’s a joy to teach this class. We see many beautiful ideas. The computer
scientists who predated us were really smart, and we had a lot of learn. We should feel
smarter after this class.
E-mail address: moorxu@stanford.edu