Lect11 DP Lcs
Lect11 DP Lcs
Lect11 DP Lcs
Reading: This algorithm is not covered in KT or DPV. It is closely related to the Sequence
Alignment problem of Section 6.6 of KT and the Edit Distance problem in Section 6.3 of DPV.
Strings: One important area of algorithm design is the study of algorithms for character strings.
Finding patterns or similarities within strings is fundamental to various applications, ranging
from document analysis to computational biology. One common measure of similarity between
two strings is the lengths of their longest common subsequence. Today, we will consider an
efficient solution to this problem based on dynamic programming.
X= A B R A C A D A B R A
LCS = A B A D A B A
Y = Y A B B A D A B B A D O O
The Longest Common Subsequence Problem (LCS) is the following. Given two sequences X =
hx1 , . . . , xm i and Y = hy1 , . . . , yn i determine the length of their longest common subsequence,
and more generally the sequence itself. Note that the subsequence is not necessarily unique.
For example the LCS of hABCi and hBACi is either hACi or hBCi.
DP Formulation for LCS: The simple brute-force solution to the problem would be to try all
possible subsequences from one string, and search for matches in the other string, but this is
hopelessly inefficient, since there are an exponential number of possible subsequences.
Instead, we will derive a dynamic programming solution. In typical DP fashion, we need to
break the problem into smaller pieces. There are many ways to do this for strings, but it
turns out for this problem that considering all pairs of prefixes will suffice for us. A prefix of
a sequence is just an initial string of values, Xi = hx1 , . . . , xi i. X0 is the empty sequence.
The idea will be to compute the longest common subsequence for every possible pair of
prefixes. Let lcs(i, j) denote the length of the longest common subsequence of Xi and Yj . For
example, in the above case we have X5 = hABRACi and Y6 = hYABBADi. Their longest
common subsequence is hABAi. Thus, lcs(5, 6) = 3.
Let us start by deriving a recursive formulation for computing lcs(i, j). As we have seen
with other DP problems, a naive implementation of this recursive rule will lead to a very
inefficient algorithm. Rather than implementing it directly, we will use one of the other
techniques (memoization or bottom-up computation) to produce a more efficient algorithm.
Basis: If either sequence is empty, then the longest common subsequence is empty. Therefore,
lcs(i, 0) = lcs(j, 0) = 0.
Last characters match: Suppose xi = yj . For example: Let Xi = hABCAi and let Yj =
hDACAi. Since both end in ‘A’, it is easy to see that the LCS must also end in ‘A’.1
Also, there is no harm in assuming that the last two characters of both strings will be
matched to each other, since matching the last ‘A’ of one string to an earlier instance of
‘A’ of the other can only limit our future options.
Since the ‘A’ is the last character of the LCS, we may find the overall LCS by (1) removing
‘A’ from both sequences, (2) taking the LCS of Xi−1 = hABCi and Yj−1 = hDACi which
is hACi, and (3) adding ‘A’ to the end. This yields hACAi as the LCS. Therefore, the
length of the final LCS is the length of lcs(Xi−1 , Yj−1 ) + 1 (see Fig. 2), which provides
us with the following rule:
if (xi = yj ) then lcs(i, j) = lcs(i − 1, j − 1) + 1
xi
Xi A Xi−1 A
yj lcs(i − 1, j − 1) +1
Yj A Yj−1 A
Last characters do not match: Suppose that xi 6= yj . In this case xi and yj cannot both
be in the LCS (since they would have to be the last character of the LCS). Thus either
xi is not part of the LCS, or yj is not part of the LCS (and possibly both are not part
of the LCS).
At this point it may be tempting to try to make a “smart” choice. By analyzing the
last few characters of Xi and Yj , perhaps we can figure out which character is best to
discard. However, this approach is doomed to failure (and you are strongly encouraged
to think about this, since it is a common point of confusion). Remember the DP selection
principle: When given a set of feasible options to choose from, try them all and take the
best. Let’s consider both options, and see which one provides the better result.
Option 1: (xi is not in the LCS) Since we know that xi is out, we can infer that the
LCS of Xi and Yj is the LCS of Xi−1 and Yj , which is given by lcs(i − 1, j).
1
We will leave the formal proof as an exercise, but intuitively this is proved by contradiction. If the LCS did not
end in ‘A’, then we could make it longer by adding ‘A’ to its end.
Option 2: (yj is not in the LCS) Since yj is out, we can infer that the LCS of Xi and
Yj is the LCS of Xi and Yj−1 , which is given by lcs(i, j − 1).
Xi−1 A skip xi
xi lcs(i − 1, j)
Xi A Yj B
max
yj
Yj B Xi A
lcs(i, j − 1)
Yj−1 B skip yj
We compute both options and take the one that gives us the longer LCS (see Fig. 3).
As mentioned earlier, a direct recursive implementation of this rule will be very inefficient.
Let’s consider two alternative approaches to computing it.
Bottom-up implementation: The alternative to memoization is to just create the lcs table in a
bottom-up manner, working from smaller entries to larger entries. By the recursive rules, in
order to compute lcs[i, j], we need to have already computed lcs[i − 1, j − 1], lcs[i − 1, j], and
lcs[i, j − 1]. Thus, we can compute the entries row-by-row or column-by-column in increasing
order. See the code block below and Fig. 4(a). The running time and space used by the
algorithm are both clearly O(mn).
Bottom-up Longest Common Subsequence
bottom-up-lcs() {
lcs = new array [0..m, 0..n]
for (i = 0 to m) lcs[i,0] = 0 // basis cases
for (j = 0 to n) lcs[0,j] = 0
for (i = 1 to m) { // fill rest of table
for (j = 1 to n) {
if (x[i] == y[j]) // take x[i] (= y[j]) for LCS
lcs[i,j] = lcs[i-1, j-1] + 1
else
lcs[i,j] = max(lcs[i-1, j], lcs[i, j-1])
}
}
return lcs[m, n] // final lcs length
}
Extracting the LCS: The algorithms given so far compute only the length of the LCS, not the
actual sequence. The remedy is common to many other DP algorithms. Whenever we make
a decision, we save some information to help us recover the decisions that were made. We
then work backwards, unraveling these decisions to determine all the decisions that led to the
optimal solution. In particular, the algorithm performs three possible actions:
addXY : Add xi (= yj ) to the LCS (‘-’ in Fig. 4(b)) and continue with lcs[i − 1, j − 1]
skipX : Do not include xi to the LCS (‘↑’ in Fig. 4(b)) and continue with lcs[i − 1, j]
skipY : Do not include yj to the LCS (‘←’ in Fig. 4(b)) and continue with lcs[i, j − 1]
An updated version of the bottom-up computation with these added hints is shown in the
code block below and Fig. 4(b).
How do we use the hints to reconstruct the answer? We start at the the last entry of the
table, which corresponds to lcs(m, n). In general, suppose that we are visiting the entry
(a) (b)
Fig. 4: Contents of the lcs array for the input sequences X = hBACDBi and Y = hBCDBi. The
numeric table entries are the values of lcs[i, j] and the arrow entries are used in the extraction of
the sequence.
corresponding to lcs(m, n). If h[i, j] = addXY , we know that xi (= yj ) is appended to the LCS
sequence, and we continue with entry [i − 1, j − 1]. If h[i, j] = skipX we know that xi is not
in the LCS sequence, and we continue with entry [i − 1, j]. If h[i, j] = skipY we know that yj
is not in the LCS sequence, and we continue with entry [i, j − 1]. Because the characters of
the LCS are generated in reverse order, we prepend each one to a sequence, so that when we
are done, the sequence is in proper order.
Extracting the LCS using the Hints
get-lcs-sequence() {
LCS = new empty character sequence
i = m; j = n // start at lower right
while(i != 0 or j != 0) // loop until upper left
switch h[i,j]
case addXY: // add x[i] (= y[j])
prepend x[i] (or equivalently y[j]) to front of LCS
i--; j--; break
case skipX: i--; break // skip x[i]
case skipY: j--; break // skip y[j]
return LCS
}