0% found this document useful (0 votes)
16 views10 pages

An Introduction To Hidden Markov Models

Uma introdução aos modelos Hidden Markov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

An Introduction To Hidden Markov Models

Uma introdução aos modelos Hidden Markov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/5423654

An Introduction to Hidden Markov Models

Article in Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] · July 2007
DOI: 10.1002/0471250953.bia03as18 · Source: PubMed

CITATIONS READS

11 154

2 authors:

Benjamin Schuster-Böckler Alex Bateman


University of Oxford EMBL-EBI
20 PUBLICATIONS 2,374 CITATIONS 226 PUBLICATIONS 57,839 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Alex Bateman on 06 June 2014.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
An Introduction to Hidden Markov APPENDIX 3A
Models
Markov and hidden Markov models have COMMON APPLICATIONS
many applications in Bioinformatics. A quick Finding signals in noisy data is the problem
search for “hidden Markov model” in Pubmed that Bioinformatics is concerned with most of
yields around 500 results from various fields the time. The signal can be a gene in a genomic
such as gene prediction, sequence compari- sequence, a transcription factor binding site, a
son, structure prediction, and more specialized member of a protein family from a set of un-
tasks such as detection of genomic recom- known proteins and so forth. For all of these
bination, determining expression from ge- problems, programs were written that used ex-
nomic tiling microarrays, and many others. isting knowledge to filter the noise. For exam-
One might in fact rather ask “What are HMMs ple, when it was known that there are certain
not good for?” (this will be discussed later transcription initiation motifs and stop codons,
in the unit). First, though, the unit will very one could search for open reading frames in
briefly introduce the history of Markov and genomes. Most of the early programs were
hidden Markov models and outline some very simple in their design, they just looked
common fields of their application. for exact occurrences of small strings or used
ad-hoc scoring schemes. They lacked a con-
HISTORY ceptional probabilistic framework to bind to-
The name Markov model refers to the Rus- gether different pieces of evidence and give
sian mathematician Andrei Markov (1856- meaning to the returned scores. It was often
1922) who studied sequences of mutually de- difficult to extend these programs or to calcu-
pendent random variables. For many years, his late some kind of confidence value to test the
findings could not be used practically due to quality of the output.
the sheer complexity of the calculations. With Markov models and HMMs offered a clean
the advent of modern computers, however, and well-established theory that applied to a
this limitation diminished, and hidden Markov wide range of existing problems. Many pro-
models (HMMs) quickly became popular as a grams actually already implicitly used HMM-
tool for supervised machine learning. One of like structures, but employing arbitrary scores
the first applications of HMMs was in the field rather than probabilities. HMM theory sheds
of speech recognition. According to the clas- a new light on those programs, showing that
sic tutorial on HMMs by Rabiner (1989), the seemingly different approaches were strongly
basic theory of HMMs was outlined by Baum related to each other from the underlying prin-
and colleagues (Baum and Petrie, 1966) in the ciples. HMMs also made it possible to interpret
late 1960s and further developed by different scores as probabilities. This greatly simplified
groups during the 1970s. In one of his early the creation, extension, and evaluation of pro-
reviews on the subject, Eddy (1996) describes grams. As will be seen later, by just changing
how HMMs were adopted in biology. Accord- the model layout, a wide range of questions
ing to him, Haussler, Krogh, and colleagues can be addressed, while the algorithms for pa-
first realized that a range of profile methods al- rameter estimation and decoding remain the
ready in use to describe protein families could same.
be expressed in terms of HMMs. Baldi et al. Another great advantage of HMMs is that
also came up with a very similar concept for machine learning techniques can be used to
profile HMMs at the same time (Baldi et al., characterize the statistical properties of a sig-
1994). Soon after, several groups began to ex- nal. Using a data-driven learning approach, a
plore the potential of HMMs for sequence program can be allowed to induce a set of rules
analysis, and later two of the most popular from a collection of manually classified cases.
packages, SAM and HMMER, were released Those rules are then converted into an HMM
by groups in Santa Cruz and Cambridge, and applied by another program to classify
respectively. unknown cases.

Fundamentals of
Bioinformatics
Contributed by Benjamin Schuster-Böckler and Alex Bateman A.3A.1
Current Protocols in Bioinformatics (2007) A.3A.1-A.3A.9
Copyright 
C 2007 by John Wiley & Sons, Inc.
Supplement 18
Sequence Comparison It should be mentioned here that HMMs
Detecting homologous sequences was one have not replaced other methods. Many pro-
of the first applications in Bioinformatics. grams still exist and new ones are being de-
Soon after the first protein sequences were veloped which use other ways of representing
collected, algorithms were developed to com- similarity data. HMMs are not even necessar-
pare them and evaluate their similarity. Tools ily the most sensitive method. What makes
like NEEDLE or BLAST compare pairs of se- HMMs so popular is that they offer a very gen-
quences and predict whether the sequences are eral, well-established framework that is main-
homologous. tainable and extendable.
In a pairwise alignment, all residue posi-
tions are treated the same and only judged in Gene Finding
terms of the overall amino acid substitution Several gene finding programs use HMMs
rate. What a pairwise sequence alignment can to distinguish intergenic sequence, exons, in-
not provide is a characterization of conserved trons, and other elements on raw genomic se-
residues within a family of related sequences. quence (Kulp et al., 1996; Burge and Karlin,
For example, an important active site residue 1997; Lukashin and Borodovsky, 1998). Un-
is scored just the same as a rapidly mutat- like in homology detection, these HMMs do
ing surface residue. Multiple sequence align- not necessarily have to be trained on exist-
ments (MSA) offer a handy visualization of ing data but can be set up with general values,
a set of related sequences. From a multiple e.g., about the nucleotide frequencies in differ-
alignment, it can be easily seen which posi- ent elements. However, training the models on
tions are more conserved than others. To test each species gives large gains in accuracy. A
whether a new sequence belongs to a family, simple example of this will be seen below. One
and how good the fit is, one needed to recalcu- program, GeneMark.hmm, is further described
late the whole alignment, which is relatively in UNITS 4.5 & 4.6 (Prokaryotic/Eukaryotic Gene
slow. Profile methods for homology detection Prediction Using GeneMark.hmm).
were invented to allow searching for family Birney et al. (2004) describe an algorithm
members in sets of sequences. The position- that combines gene prediction and homol-
specific scores allowed important residues to ogy searching using a strictly probabilistic
be given higher weight. But due to the ad-hoc approach. Their program, called GeneWise,
nature of the scoring schemes, it needed an is able to identify regions in raw DNA se-
expert to assess which matches were signifi- quence which, once translated and spliced,
cant. Furthermore, only experts with sufficient would be similar to a known protein family.
knowledge of the particular program and the This approach basically combines a simpli-
biology of the sequences of interest were able fied, GenScan-like gene-finder with HMMER-
to use the advantage of these programs. style profile HMMs. The strictly HMM based
Profile HMMs provided a probabilistic approach greatly facilitated this combination.
framework to describe a set of sequences by Some more information is provided in UNIT 2.5
their observed properties. Many existing al- (Identifying Protein Domains with the Pfam
gorithms could be reinterpreted with respect Database).
to HMM theory. Within this framework it
was now possible to compare and evaluate Fold Assignment and Structure
the scores of different programs. HMM pack- Prediction
ages like HMMER or SAM now allow non- HMMs are routinely used in homology
specialists to use existing multiple sequence modeling to identify known folds in a target
alignments to derive a profile HMM, with- sequence. HMMs are popular in this area for
out having to tweak any part of the process their high sensitivity. After successfully iden-
manually. tifying template structures, it is often neces-
Once an HMM representing a set of ho- sary to improve the accuracy of the align-
mologous sequences has been created, it can ment. It can be advantageous to compare a
be used to search for other potential mem- profile of the target family to the profile of
bers of this family in a fast and reliable way. the fold, rather than just aligning the individ-
The Pfam database provides a hand curated ual target sequence to the fold profile. Tools
collection of sequence families represented as for profile-profile comparison and visualiza-
HMMs. More on Pfam can be found in UNIT tion have been developed and shown to be
An Introduction 2.5 (Identifying Protein Domains with the Pfam useful (Madera, 2005; Schuster-Böckler and
to Hidden Markov Database). Bateman, 2005; Söding, 2005).
Models

A.3A.2
Supplement 18 Current Protocols in Bioinformatics
Phylogeny and Function Prediction theoretically possible to have different tran-
An efficient way to infer the properties of an sition probabilities for a T→C and a C→T
unknown protein is usually to search for evo- transition, substitution matrices like PAM or
lutionarily related sequences which have some BLOSUM are always symmetrical.
annotations. It has already been mentioned that In this example, the Markov property de-
HMMs are widely used for homology detec- mands that what a residue will mutate to in the
tion. UNIT 6.9 describes the FlowerPower al- future does not depend on what residues were
gorithm which uses an iterative clustering al- observed in the past. This means that if it is
gorithm involving repeated HMM searches to known that another homologous sequence ex-
cluster related sequences into subfamilies with ists with a C residue observed at the specified
distinct features. These subfamilies can then position, that does not change the probability
be used for functional inference. of a T→C transition. It should be mentioned
here that relationships between neighboring
MARKOV MODELS residues are also not accounted for in any way.
One type of Markov chain that almost ev- Certainly, these are major simplifications, but
ery biologist has come across at some point, the benefits usually outweigh the drawbacks.
probably without realizing it, are amino acid or Markov chains can usually be expressed as
nucleotide substitution matrices like PAM and a matrix, where each row and column corre-
BLOSUM. Quite generally, a Markov chain sponds to a state and the value in the cell is the
is a type of stochastic (random) process. It transition probability between the states (see
consists of several states (x) from a system Fig. A.3A.1C). Representing a Markov chain
(X), which represent observations at a spe- as a Matrix allows a number of useful opera-
cific point in time (t), and a set of transition tions. In the case of substitution matrices, for
probabilities (Pr). What distinguishes Markov example, one can use matrix multiplication to
chains from other stochastic processes is the extrapolate substitution matrices for any arbi-
Markov property, which states that Pr depends trary time-distance from the one matrix that
only on the current state of the system. In math- was measured. Also, because the matrix con-
ematical terms, this is usually written as shown sists purely of probabilities, one can calculate
in equation below. confidence measures analytically.

Pr(Xt+1 =xJ |X1 =x1 ,X2 =x2 ,...,Xt =xi )


HIDDEN MARKOV MODELS
=Pr(Xt+1 =xj |Xt =xi )
So, what is hidden in a hidden Markov
for all t≥1 and x1 ,...,xn element of S, where S Model? In our earlier example, a state rep-
is the set of all possible states in the system. resents exactly one residue, and this residue
“|” in this context means “depending on”. The will be observed every time the state is passed.
equation denotes that the transition probability But there are cases where the actual state the
of going from xi to xj at timestep t+1 does not process is in at a given time is not known. In
depend on the states visited in the past. the introduction, it was mentioned that HMMs
Markov models are often represented by can be used for gene prediction. Following the
a graph in which states are shown as nodes example from Eddy (2004), imagine one is in-
and the transitions are edges. A graph show- terested in identifying splice sites in a genomic
ing the transition probabilities as defined by sequence. Simply put, it is stated that every
a nucleotide substitution matrix is shown in residue in a genome is either in an exon (E),
Figure A.3A.1B. Each nucleotide is a possible intron/intergenic (I), or a 5′ splice site (5). A
observation, and there are probabilities of sub- Markov chain describing this model is shown
stituting one for another. Imagine looking at in Figure A.3A.2A. The difference to our ear-
one residue in a sequence, and checking it ev- lier example is that the sequence of visited
ery 1 million years. Within each time-step, the states can not be observed directly. In fact, the
residue can mutate to a different residue or re- most likely sequence of hidden states is very
main the same. The model in Figure A.3A.1B often what one is interested in.
shows exactly that. At every time step, one can A hidden Markov model consists of a set
transit from one state to another (including, in of states just like a simple Markov model. The
this case, going to the same state again), with difference is that every hidden state emits a
a probability pij depending on the state one is nucleotide when it is visited. Therefore, ev-
currently in. The residue change from T to C in ery state “contains” a probability distribution
the alignment shown in Figure A.3A.1A then which defines the emission probability. Just
Fundamentals of
corresponds to a transition from state T to state like the transition probabilities, the emission Bioinformatics
C, which has a probability of pTC . While it is probabilities are again independent of the past.
A.3A.3
Current Protocols in Bioinformatics Supplement 18
Figure A.3A.1 A simple Markov model for the transition probabilities as defined by a nucleotide substitution
matrix. (A) Example of a short DNA sequence alignment with the transition depicted in green.(B) A simple
Markov model for DNA residue substitutions. Every circle represents a state, and arrows denote probabilities
to make a transition into the state at the end of the arrow in the next time step. A stochastic process like this
is called a Markov chain if the transition probabilities p do not change depending on the states visited in the
past. The transition observed in Panel (A) is highlighted in green. (C) A transition matrix for the model shown
in (B). For color version of this figure see http://www.currentprotocols.com.

A model like the one in Figure A.3A.2A distribution defined for the state is emitted.
can be thought of as a generator or simu- This continues until the end state “N” is
lator of whatever it represents, in this case reached. An example of a state path and
DNA containing splice sites. It can be used the emitted sequence is shown in Figure
to create “random” strings of nucleotide let- A.3A.2B. The recorded sequence of emitted
ters. Our example starts at the starting state residues from the path through the model
“S”, then passes from one state to the other, should have the characteristics of a splice
An Introduction choosing the next state according to the transi- site as was defined, because all paths from
to Hidden Markov
Models tion probabilities on the arrows. When a state start to end must pass through the “5”
is reached, a random letter according to the state.
A.3A.4
Supplement 18 Current Protocols in Bioinformatics
Figure A.3A.2 A simple hidden Markov model for splice site recognition (Eddy, 2004). (A) The
circles again denote states, in this case exon (E), intron (I) or 5′ splice site (5). The edges represent
transition probabilities to move from one state to another in each time step. “Time” in this context
refers to the residue position we are in, not actual time. The sequence of visited states cannot be
observed, rather a nucleotide distribution that is typical for one of the 3 states is seen, here shown
as a bar shaded in 4 colors. The proportion of each color reflects the probability of observing this
nucleotide when the state is visited. (B) An example of a sequence (bottom) and its hidden state
path (top, bold font). Passing through the model and emitting a residue according to the given
emission probabilities creates a sequence like the one shown. Vice-versa, one can calculate the
probability that an observed sequence was created by the given state path. For color version of
this figure see http://www.currentprotocols.com.

Usually one would not want to create a ran- create this sequence. Efficient algorithms for
dom sequence, unless it is necessary to know these and other problems are implemented in
whether some sequence is a splice site. In other all major HMM packages. More on the details
words, the question is: Given an observation, of these algorithms can be found in Durbin
what is the most likely sequence of hidden et al. (1998).
states that would have created this observa-
tion? The so-called Viterbi algorithm finds the
state-path with the highest probability given an PROFILE METHODS FOR
observed sequence. For this purpose, it is use- SEQUENCE ANALYSIS
ful to think of an HMM as a generating model: Profile hidden Markov models are a way to
The Viterbi algorithm compares the probabil- describe families of related sequences. It was
ity of all paths through the model that could previously shown how a hidden Markov chain
generate the observed sequence and chooses is made up of a series of states which emit
the most likely one. a signal according to a state-dependent distri-
Another question that is particularly impor- bution. In our earlier example, there were 3
tant for profile HMMs representing a sequence states (E, I, and 5). A schematic outline of a
family is: How likely is it that a given model profile HMM (as implemented in the HMMER
generated an observed sequence? This kind package) is shown in Figure A.3A.3. In profile
of analysis is often called posterior decoding. HMMs, the number of states varies depending
One can find the probability of generating a on the length of the sequences in the family.
given sequence from an HMM by summing The layout is strictly repetitive: The three types Fundamentals of
Bioinformatics
over the probabilities of all paths that could of states in a profile HMM are Match, Insert,
A.3A.5
Current Protocols in Bioinformatics Supplement 18
Figure A.3A.3 Schematic representation of a profile hidden Markov model of length 4. The
states are begin (B), end (E), match (M), insert (I), and delete (D). Rectangular states (M and
I) emit amino acids or nucleotides, round states are silent. A sequence family is represented by
a characteristic distribution of amino-acid emission probabilities at every match or insert state.
Delete states are equivalent to gaps.

and Delete. Match states represent informative resulting HMM could be biased towards a par-
positions in a family, while insert states al- ticular subset of the family. The HMM might
low insertions of small stretches of nonspecific then not be able to find more distant family
sequence. Delete states correspond to gaps, members with a distinctly different residue
where a match position is skipped in a mem- composition. While software packages like
ber of the family. With every step, the process HMMER or SAM try to correct for bias by
moves from left to right, to the next distinct too closely related sequences in the alignment,
residue position, until the end state has been great care should still be taken when compiling
reached. a training set for a protein family.
The parameters of a profile HMM, its emis-
sion and transition probabilities, have to be
“learned” before it can be applied. This is PAIR HMMS
done by taking a multiple sequence alignment A pair HMM is a process that maps one
of a core set of family members and deriv- alphabet to another. The beauty of pair HMMs
ing the conservation of positions and residues is that the concept can be applied to all sorts
from it, see Figure A.3A.4. The match states of alignment or mapping problems. Compar-
in the profile represent conserved positions in ison of two profile HMMs can be expressed
the multiple alignments. Positions in the mul- as a pair HMM (Madera, 2005). Birney et al.
tiple alignments which contain gaps for most use pair HMMs in their GeneWise algorithm to
member sequences transform to a high prob- link the output of a basic gene-prediction algo-
ability of passing through the corresponding rithm with a profile HMM search (Birney et al.,
delete state. Similarly, regions which are in- 2004). Pair HMMs have been used success-
distinguishable to the background amino acid fully for gene prediction when two related se-
distribution are modeled as insert states. quences are known (Meyer and Durbin, 2004).
Deriving the parameters of the HMM from As an example, pairwise sequence align-
a sequence alignment necessarily means that a ment can be expressed in terms of pair HMMs,
profile HMM can only be as good as this under- too. Figure A.3A.5 shows the model for an
lying alignment. If, for example, an alignment HMM that corresponds to a pairwise align-
contains numerous orthologous sequences, the ment. The core of the model consists of a

Figure A.3A.4 (at right) The Leucine Rich Repeat family as a multiple sequence alignment and as an HMM.
The columns of the HMM and the positions in the profile HMM that correspond to them are shown by black
bars. The bottom part shows an HMM logo. HMM logos resemble sequence logos (Schuster-Böckler et al.,
2004). Each position in the HMM corresponds to a column in the logo. Match states are white, insert states
are red. The width of the column reflects how likely it is to skip it by going through the delete state. The height
of the letters shows their frequency relative to the overall information content of the state. It can be seen how
the information in the HMM mirrors the composition of the sequence alignment. For color version of this figure
see http://www.currentprotocols.com.

A.3A.6
Supplement 18 Current Protocols in Bioinformatics
Figure A.3A.4 Legend at left.

Fundamentals of
Bioinformatics

A.3A.7
Current Protocols in Bioinformatics Supplement 18
Figure A.3A.5 A pair HMM for global pairwise alignment with affine gap penalties as described
by Durbin et al. (1998).

match state M that emits a pair of symbols. The CONCLUSIONS


X and Y states are gaps in either sequence. The The number of applications of HMMs in the
start S and end E states complete the model. field of Bioinformatics is large and still grow-
The model generates sequences of pairs of ing. Sophisticated algorithms, complex mod-
symbols, or gaps in one of the sequences. The els, and a specialized terminology sometimes
probability to emit a certain pair of symbols gives people the impression that HMMs are
pxi y j is identical to a substitution score. g in an overly complicated piece of machinery. It
the model is the gap opening probability, e is is the hope of the authors to have opened the
the gap extension probability. “black box” of HMMs a little, shedding some
light on what they do and how they achieve
DRAWBACKS it. After all, HMMs can make life easier for
What are hidden Markov models not good bioinformaticians and the users of the pro-
for, then? As mentioned before, care has to grams they develop.
be applied when using machine learning to
estimate parameters from existing data. A pre- LITERATURE CITED
Baldi, P., Chauvin, Y., Hunkapiller, T., and
diction can only be as good as the data used McClure, M.A. 1994. Hidden Markov mod-
for training. When there are very few training els of biological primary sequence information.
data available, other methods can sometimes PNAS 91:1059-1063.
perform better, especially if they make use of Baum, L.E. and Petrie, T. 1966. Statistical inference
expert knowledge that was built into the algo- for probabilistic functions of finite state Markov
rithm directly. chains. Ann. Math. Stat. 37:1554-1563.
The most important limitation of HMMs re- Birney, E., Clamp, M., and Durbin, R. 2004.
sults from the Markov property itself. Because GeneWise and Genomewise. Genome Res.
HMMs are memoryless, it is not possible to 14:988-995.
model dependencies between distant events. Burge, C. and Karlin, S. 1997. Prediction of com-
In practice this means that one cannot, for ex- plete gene structures in human genomic DNA.
J. Mol. Biol. 268:78-94.
ample, model the dependency between two nu-
cleotides in an RNA sequence that form a base Durbin, R., Eddy, S.R., Krogh, A., and Mitchison,
G. 1998. Biological Sequence Analysis: Proba-
pair in the secondary structure, but are sep- bilistic Models of Proteins and Nucleic Acids.
arated by an unknown number of residues in Cambridge University Press, Cambridge, U.K.
the primary structure. There are other methods Eddy, S.R. 1996. Hidden Markov models. Curr.
that can deal with such situations, see UNITS Opin. Struct. Biol. 6:361-365.
An Introduction 12.1 & 12.5, but they increase the computational
Eddy, S.R. 2004. What is a hidden Markov model?
to Hidden Markov complexity significantly. Nat. Biotechnol. 22:1315-1316.
Models

A.3A.8
Supplement 18 Current Protocols in Bioinformatics
Kulp, D., Haussler, D., Reese, M.G., and Eeckman, Schuster-Böckler, B. and Bateman, A. 2005. Vi-
F.H. 1996. A generalized hidden Markov model sualizing profile-profile alignment: Pairwise
for the recognition of human genes in DNA. HMM logos. Bioinformatics 21:2912-2913.
Proc. Int. Conf. Intell. Syst. Mol. Biol. 4:134- Schuster-Böckler, B., Schultz, J., and Rahmann, S.
142. 2004. HMM Logos for visualization of protein
Lukashin, A.V. and Borodovsky, M. 1998. Gen- families. BMC Bioinformatics 5:7.
eMark.hmm: New solutions for gene finding. Söding, J. 2005. Protein homology detection
Nucl. Acids Res. 26:1107-1115. by HMM–HMM comparison. Bioinformatics
Madera, M. 2005. Hidden Markov models for detec- 21:951-960.
tion of remote homology. PhD thesis, University
of Cambridge, MRC Laboratory of Molecular
Biology, May 2005. Contributed by Benjamin Schuster-Böckler
Meyer, I.M. and Durbin, R. 2004. Gene structure and Alex Bateman
conservation aids similarity based gene predic- Wellcome Trust Sanger Institute
tion. Nucl. Acids Res. 32:776-783. Hinxton, Cambridge, United Kingdom
Rabiner, L.R. 1989. A tutorial on hidden Markov
models and selected applications in speech
recognition. Proceedings of the IEEE 77:257-
286.

Fundamentals of
Bioinformatics

A.3A.9
Current Protocols in Bioinformatics Supplement 18

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy