Sequence Alignment
Sequence Alignment
Sequence Alignment
Based on Based on
completeness Numbers
Pairwise
Global Local Multiple
Local Vs. Global
Methods and Algorithms
Progressive
Dynamic
Dot Matrix Methods:
Programming
Clustal, Tcoffee
Iterative FASTA
methods BLAST
Dot Plot Matrix
• A dot matrix analysis is a method for comparing two sequences to look for possible alignment
(Gibbs and McIntyre 1970)
• One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left
side
• Starting from the first character in B, one moves across the page keeping in the first row and
placing a dot in many column where the character in A is the same
• The process is continued until all possible comparisons between A and B are made
• For each position in the grid, compare the sequence elements at the
top (column) and to the left (row). If and only if they are the same,
place a dot at that position.
Dot Plot
• Dot plot are two dimensional graphs, showing a comparison of two
sequences.
• The principle used to generate the dot plot is: The top X and the left y axes
of a rectangular array are used to represent the two sequences to be
compared.
• Calculation:
• Matrix
• Columns = residues of sequence 1
• Rows = residues of sequence 2
• A dot is plotted at every co-ordinate where there is similarity between the
bases.
Identical Sequences
Seq1: MALWGRL
Seq2: MALWGRL
M A L W G R L
M *
A *
L *
W *
G *
R *
L *
Dotplots
Multiple diagonal indicate repetition
Analysis of Dot Plot Matrix
• Region of similarity appears as diagonal run of dots.
• Principal diagonal shows identical sequence.
• Global and local alignment are shown.
• Multiple diagonal indicate repetition
• Reverse diagonal (perpendicular to diagonal) indicate INVERSION.
• Reverse diagonal crossing diagonal (X) indicate PALINDROMES.
• Formation of box indicate the low complexity region
Repeats
Reverse diagonal crossing diagonal (X) indicate PALINDROMES.
Formation of box indicate the low complexity region
Dot Plot Software
• Needleman-Wunsch
• Pairwise global alignment only. Difference :
• Different Scoring matrices
• Gap penalty functions
• Smith-Waterman • Sequence Coverage
31
Dynamic Programming
• Break problem into overlapping subproblems
• use memoization: remember solutions to
subproblems that we have already seen
3 5 7
1 8
2 4 6
32
Fibonacci example
• 1,1,2,3,5,8,13,21,...
• fib(n) = fib(n - 2) + fib(n - 1)
• Could implement as a simple recursive function
• However, complexity of simple recursive function is
exponential in n
33
Fibonacci dynamic programming
• Two approaches
1.Memoization: Store results from previous calls of function in a table (top
down approach)
2.Solve subproblems from smallest to largest, storing results in table (bottom
up approach)
• Both require evaluating all (n-1) subproblems only once: O(n)
34
Dynamic Programming Graphs
• Dynamic programming algorithms can be
represented by a directed acyclic graph
• Each subproblem is a vertex
• Direct dependencies between subproblems are edges
1 2 3 4 5 6
• In a top-down recursive approach we can use memoization to create a potentially large dictionary indexed by
each of the subproblems that we are solving (aligned sequences).
• This needs O(n 2m2 ) space if we index each subproblem by the starting and end points of the subsequences
for which an optimal alignment needs to be computed.
• The advantage is that we solve each subproblem at most once: if it is not in the dictionary, the problem gets
computed and then inserted into dictionary for further reference.
Dynamic Programming
In a bottom-up iterative approach we can use dynamic programming. We define the order of computing sub-
problems in such a way that a solution to a problem is computed once the relevant sub-problems have been
solved.
In particular, simpler sub-problems will come before more complex ones. This removes the need for keeping
track of which sub-problems have been solved (the dictionary in memoization turns into a matrix) and ensures
that there is no duplicated work (each sub-alignment is computed only once).
Pairwise Alignment Via
Dynamic Programming
37
Global Alignment
Needleman-Wunsch Algorithm
• The Needleman–Wunsch algorithm is an algorithm used in bioinformatics
to align protein or nucleotide sequences.
AAA C AAAC -
AG C AG C
AGC -
alignment of + aligning
these prefixes this pair
39
DP Algorithm for Global Alignment
with Linear Gap Penalty
• Subproblem: F(i,j) = score of best alignment of the length i
prefix of x and the length j prefix of y.
40
Dynamic Programming
Implementation
• given an n-character sequence x, and an m-character
sequence y
• construct an (n+1) ´ (m+1) matrix F
• F ( i, j ) = score of the best alignment of
x[1…i ] with y[1…j ]
A G C
A
A
score of best alignment of
A AAA to AG
C
41
Initializing Matrix: Global Alignment with
Linear Gap Penalty
A G C
0 s 2s 3s
A s
A 2s
A 3s
C 4s
42
DP Algorithm Sketch:
Global Alignment
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left to right
• for each F ( i, j ), save pointer(s) to cell(s) that
resulted in best score
• F (m, n) holds the optimal alignment score; trace
pointers back from F (m, n) to F (0, 0) to recover
alignment
43
Global Alignment Example
• suppose we choose the following scoring scheme:
S(x i , y i ) =
+1 when xi = yi
-1 when x ≠ y
i i
44
Global Alignment Example
A G C
0 -2 -4 -6
one optimal alignment
A -2 1 -1 -3
x: A A A C
y: A G - C
A -4 -1 0 -2
but there are three
A -6 -3 -2 -1 optimal alignments
here (can you find
C -8 -5 -4 -1 them?)
45
Equally Optimal Alignments
• many optimal alignments may exist for a given pair of
sequences
• can use preference ordering over paths when doing
traceback
highroad 1 lowroad 3
2 2
3 1
• High road and low road alignments show the two most
different optimal alignments
46
High road & Low road Alignments
A G C
High road alignment
0 -2 -4 -6
x: A A A C
y: A G - C
A -2 1 -1 -3
C -8 -5 -4 -1
47
Semi-Global Alignment
• Global alignment seeks the best, full length alignment; that is, the
best way to match up two sequences along their entire length.
•.
Filling the Matrix
•.
Semi-Global Alignment
Local Alignment
Smith waterman Algorithm
• Initialize rows and columns with zero that will enable to move both
sequences without any penalty.
2
O(n )
80
Scoring indels: naive approach
• A fixed penalty σ is given to every indel:
• -σ for 1 indel,
• -2σ for 2 consecutive indels
• -3σ for 3 consecutive indels, etc.
• Can be too severe penalty for a series of 100 consecutive indels
Gap Penalties
• Minimizing gaps in an alignment is important to create a useful
alignment.
PRT - - -EINS
PRTWPSEIN-
Total Gap penalty =-24
A model for sequence evolution