BLAST N FASTA
BLAST N FASTA
BLAST N FASTA
1
Pairwise Alignment
Global Local
• Best score from among • Best score from among
alignments of full-length alignments of partial
sequences sequences
• Needelman-Wunch • Smith-Waterman
algorithm algorithm
8
Local vs. Global Alignment
• Global Alignment
--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C
• Local Alignment—better
alignment to find conserved
segment tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
9
Why do we need local alignments?
10
11
Heuristic Methods: FASTA and BLAST
FASTA
• First fast sequence searching algorithm for
comparing a query sequence against a database.
BLAST
• Basic Local Alignment Search Technique
improvement of FASTA: Search speed, ease of
use, statistical rigor.
12
FASTA and BLAST
• Basic idea: a good alignment contains
subsequences of absolute identity (short lengths
of exact matches):
13
FASTA
Derived from logic of the dot plot
– compute best diagonals from all frames of
alignment
The method looks for exact matches between
words in query and test sequence
– DNA words are usually 6 nucleotides long
– protein words are 2 amino acids long
14
FASTA Algorithm
17
Makes Longest Diagonal
After all diagonals are found, tries to join
diagonals by adding gaps
18
FASTA Alignments
19
FASTA Results - Histogram
!!SEQUENCE_LIST 1.0
(Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02
TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols:
913,285 Word Size: 6
Searching with both strands of the query.
Scoring matrix: GenRunData:fastadna.cmp
Constant pamfactor used
Gap creation penalty: 16 Gap extension penalty: 4
Histogram Key:
Each histogram symbol represents 4 search set sequences
Each inset symbol represents 1 search set sequences
z-scores computed from opt scores
z-score obs exp
(=) (*)
< 20 0 0:
22 0 0:
24 3 0:=
26 2 0:=
28 5 0:==
30 11 3:*==
32 19 11:==*==
34 38 30:=======*==
36 58 61:===============*
38 79 100:==================== *
40 134 140:==================================*
42 167 171:==========================================*
44 205 189:===============================================*==== 20
46 209 192:===============================================*=====
48 177 184:=============================================*
FASTA Results - List
The best scores are: init1 initn opt z-sc E(1018780)..
21
FASTA Results - Alignment
SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58
>>GB_IN3:DMU09374 (2038 nt)
initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58
66.2% identity in 875 nt overlap
(83-957:151-1022)
60 70 80 90 100 110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
|| ||| | ||||| | ||| |||||
DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130 140 150 160 170 180
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA
ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT
GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC
CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG
TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA
GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT
CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC
CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC 23
CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT
BLAST
• Basic Local Alignment Search Tool
– Altschul et al. 1990,1994,1997
• Heuristic method for local alignment
• Designed specifically for database searches
• Based on the same assumption as FASTA
that good alignments contain short lengths
of exact matches
24
BLAST
• Both BLAST and FASTA search for local
sequence similarity - indeed they have exactly
the same goals, though they use somewhat
different algorithms and statistical approaches.
• BLAST benefits
– Speed
– User friendly
– Statistical rigor
– More sensitive
25
BLAST Searches GenBank
[BLAST= Basic Local Alignment Search Tool]
The NCBI BLAST web server lets you compare your
query sequence to various sections of GenBank:
– nr = non-redundant (main sections)
– month = new sequences from the past few weeks
– refseq_rna
– RNA entries from NCBI's Reference Sequence project
– refseq_genomic
– Genomic entries from NCBI's Reference Sequence project
– ESTs
– Taxon = e.g., human, Drososphila, yeast, E. coli
– proteins (by automatic translation)
– pdb = Sequences derived from the 3-dimensional structure
from Brookhaven Protein Data Bank
26
BLAST
• Uses word matching like FASTA
• Similarity matching of words (3 amino acids, 11
bases)
– does not require identical words.
• If no words are similar, then no alignment
– Will not find matches for very short sequences
28
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA Break query
AAV
AVK into words:
VKE
KEE
EEI
EIS
ISV Break database
...
sequences
into words:
29
Find locations of matching words
in database sequences
ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT
MEA
EAA TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
AAV IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
AVK
KLV
KEE
EEI
EIS
ISV
30
Extend hits one base at a time
31
BLAST variants
32
33
34
35
Understanding BLAST output
36
37
38
39
40
41
44
45