Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 73 cost alignment between sequences A and B. In honor of its inventor, this score is formally known as the generalized Levenshtein measure or distance between sequences A and B. Indeed the measure, D, between sequences forms a metric space over sequences if the underlying scoring function δ forms a metric space over the underlying alphabet. Thus calling this measure a distance is formally correct for a wide class of scoring schemes δ. Immediately note that the distance and similarity perspectives are complementary. To solve a difference problem, we need only revise our previous discussions by replacing maximum with minimum in every sentence and formula. Also, one could simply take a δ for a difference problem and multiply every score by â1. Applying the similarity algorithm with the modified scores would produce optimal alignments for the original difference problem, and multiplying the resultant similarity score by â1 would give the distance between the two sequences. Aligning More Than Two Sequences at a Time Molecular biologists are frequently interested in comparing more than two sequences simultaneously. For instance, given a number of sequences of the same functionality, it is much more likely that the similarity that gives this common function will be more evident among the group than among two sequences from the group. A closely related problem is to discover the evolutionary relationships between a set of sequences by constructing an evolutionary tree, orphylogeny , that minimizes the evolutionary changes that must have taken place along each branch of the tree. A third application for aligning a collection of sequences is to correct errors in the "raw" experimental data obtained in DNA sequencing experiments. Typically, 1 to 10 percent of the symbols in a sequenced fragment are incorrect, missing, or spurious. These errors are detected and corrected by sequencing a given stretch several times and then forming a consensus by aligning the sequences. Figure 3.8 illustrates a multi- alignment of such sequence data.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 74 Figure 3.8 A multi-alignment of five DNA sequences. Suppose we wish to align K sequences A1, A2,. . ., AK, where is of length Ni. As for the basic problem, we wish to arrange the sequences into a tableau using dashes to force the alignment of certain characters in given columns. For example, in Figure 3.8 the dashes are placed so as to arrange columns consisting of primarily one symbol. For each column, the consensus of the column is the symbol that occurs the greatest number of times in that column. Concatenating these consensus characters together, ignoring dashes, gives the consensus sequence for the five experimental trials. As for pairwise alignments, each column of K symbols of the multi-alignment is scored according to a user-supplied function δ. For example, if δ is the number of symbols in the column not equal to the majority symbol of the column (which can be a dash), then the multi-alignment of Figure 3.8 has score 13, and this is the minimum possible score over all possible multi-alignments of the five sequences. The problem of finding a maximum (minimum)-scoring alignment among K sequences can be solved by extending the dynamic programming recurrence for the basic problem from a recurrence over a two-dimensional matrix to a recurrence over a K-dimensional matrix.Let i = (i1, i2,. . .,iK ) be a vector in K-dimensional Cartesian space. Now we compute a K-dimensional array S, where S(i) is the score of the best alignment among the prefix sequences . The central recurrence now becomes
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 75 In terms of an edit graph model, imagine a grid of vertices in K-dimensional space where each vertex i has 2K â 1 edges directed into it, each corresponding to a column that when appended to the alignment for the edge's tail gives rise to the alignment for the prefix sequences represented by i. Computing the S values in some topological ordering requires a total of O(NK) time, where N = maxj Nl. While multiple sequence comparison algorithms of this genre are conceptually straightforward, they take an exponential amount of time in K and are thus generally impractical for K > 3. Multiple sequence comparison has been shown to be NP-complete (Garey and Johnson, 1979), which means that it is almost surely the case that any algorithm for this problem must exhibit time behavior that is exponential in K. Thus many authors have sought heuristic approximations, the most popular of which is to take O(K N2) time to compute all pairwise optimal alignments between the K sequences, and then produce a multiple sequence alignment by merging these pairwise alignments. Note that any multiple sequence alignment induces an alignment between a given pair of sequences (take the two rows of the tableau and remove any columns consisting of just dashes). However, given all of the possible K(K â 1)/2 pairwise alignments between K sequences, it is almost always impossible to arrange a multi-alignment consistent with them all. Try, for example, merging the best pairwise alignments among ACG, CGA, and GAC. But, given any K â 1 alignments relating all the sequences (that is, a spanning tree of the complete graph of sequence pairs), it is always possible to do so. Feng and Doolittle (1987) compare a number of methods based on this approach. The most recent algorithms utilize the natural choice of the K â 1 alignments whose scores sum to the minimal possible amount (that is, a minimum spanning tree of the complete graph of sequence pairs). However, such merges do not always lead to optimal alignments, as is illustrated by the following example: