Computational Genomics Hidden Markov Models (HMMS)
Computational Genomics Hidden Markov Models (HMMS)
Computational Genomics Hidden Markov Models (HMMS)
Lecture 10
© Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU)
. Modified by Benny Chor (TAU)
Outline
Finite, or Discrete, Markov Models
Hidden Markov Models
Three major questions:
Q1.: Computing the probability of a given
observation.
A1.: Forward – Backward (Baum Welch) dynamic
programming algorithm.
Q2.: Computing the most probable sequence, given
an observation.
A2.: Viterbi’s dynamic programming Algorithm
Q3.: Learn best model, given an observation,.
A3.: Expectation Maximization (EM): A Heuristic.
2
Markov Models
A discrete (finite) system:
N distinct states.
4
Discrete Markov Model: Example
Discrete Markov Model
with 5 states.
Each aij represents the
probability of moving
from state i to state j
The aij are given in a
matrix A = {aij}
The probability to start
in a given state i is i ,
The vector repre-
sents these
startprobabilities.
5
Markov Property
• Markov Property: The state of the system at time t+1
depends only on the state of the system at time t
P[X t 1 x t 1 | X t x t , X t -1 x t -1 , . . . , X1 x1 , X 0 x 0 ]
P[X t 1 x t 1 | X t x t ]
6
Markov Chains
Stationarity Assumption
Probabilities independent of t when process is “stationary”
So, for all t, P[X t 1 x j |X t x i ] pij
7
Simple Minded Weather Example
8
Simple Minded Weather Example
Transition matrix for our example
0.4 0.6
P
0.2 0.8
• Note that rows sum to 1
• Such a matrix is called a Stochastic Matrix
• If the rows of a matrix and the columns of a matrix all
sum to 1, we have a Doubly Stochastic Matrix
9
Coke vs. Pepsi (a cental cultural dilemma)
Given that a person’s last cola purchase was Coke ™,
there is a 90% chance that her next cola purchase will
also be Coke ™.
If that person’s last cola purchase was Pepsi™, there
is an 80% chance that her next cola purchase will also
be Pepsi™.
0.9 0.1
0.8
coke pepsi
0.2
1
Coke vs. Pepsi
Given that a person is currently a Pepsi purchaser,
what is the probability that she will purchase Coke
two purchases from now?
The transition matrices
are: (corresponding to
0.9 0.1
P one purchase ahead)
0.2 0.8
1
Coke vs. Pepsi
Assume each person makes one cola purchase per week.
Suppose 60% of all people now drink Coke, and 40% drink
Pepsi.
What fraction of people will be drinking Coke three weeks
from now?
P00
Let (Q0,Q1)=(0.6,0.4) be the initial probabilities.
We will regard Coke as 0 and Pepsi as 1 0.9 0.1
P
We want to find P(X3=0) 0 .2 0.8
1
P( X 3 0) Qi pi(03) Q0 p00
( 3)
Q1 p10(3) 0.6 0.781 0.4 0.438 0.6438
i 0
1
Equilibrium (Stationary) Distribution
Suppose 60% of all people now drink Coke, and 40%
drink Pepsi. What fraction will be drinking Coke
10,100,1000,10000 … weeks from now?
For each week, probability is well defined. But does it
converge to some equilibrium distribution [p0,p1]?
If it does, then eqs. : .9p0+.2p1 =p0, .8p1+.1p0 =p1
must hold, yielding p0= 2/3, p1=1/3 .
0.9 0.1
0.8
coke pepsi
0.2
1
Equilibrium (Stationary) Distribution
Whether or not there is a stationary distribution, and
whether or not it is unique if it does exist, are determined
by certain properties of the process. Irreducible means that
every state is accessible from every other state. Aperiodic
means that there exists at least one state for which the
transition from that state to itself is possible. Positive
recurrent means that the expected return time is finite for
every state. 0.9 0.1
0.8
coke pepsi
0.2
http://en.wikipedia.org/wiki/Markov_chain 1
Equilibrium (Stationary) Distribution
If the Markov chain is positive recurrent, there
exists a stationary distribution. If it is positive
recurrent and irreducible, there exists a unique
stationary distribution, and furthermore the process
constructed by taking the stationary distribution as
the initial distribution is ergodic. Then the average
of a function f over samples of the Markov chain is
equal to the average with respect to the stationary
distribution,
http://en.wikipedia.org/wiki/Markov_chain
1
Equilibrium (Stationary) Distribution
Writing P for the transition matrix, a stationary
distribution is a vector π which satisfies the equation
Pπ = π .
http://en.wikipedia.org/wiki/Markov_chain
1
Discrete Markov Model - Example
Matrix A –
Problem – given that the weather on day 1 (t=1) is sunny(3), what is the
probability for the observation O:
1
Discrete Markov Model – Example (cont.)
The answer is -
1
Types of Models
Ergodic model
Strongly connected - directed
path w/ positive probabilities
from each state i to state j
(but not necessarily complete
directed graph)
2
Third Example: A Friendly Gambler
Game starts with 10$ in gambler’s pocket
– At each round we have the following:
• Gambler wins 1$ with probability p
or
• Gambler loses 1$ with probability 1-p
– Game ends when gambler goes broke (no sister in bank),
or accumulates a capital of 100$ (including initial capital)
– Both 0$ and 100$ are absorbing states
p p p p
0 1 2 N-1 N
0 1 2 N-1 N
Start
tail 1/2 1/2
1/2 tail
0.1 1/4
Fair loaded
0.9 0.1
1/2 3/4 0.9
head head
2
Hidden Markov Models
(probabilistic finite state automata)
Often we face scenarios where states cannot be
directly observed.
We need an extension: Hidden Markov Models
a11 a22 a33 a44
Hidden variables
H1 H2 Hi HL-1 HL
X1 X2 Xi XL-1 XL
Observed data
2
Example: Dishonest Casino
2
Coin-Tossing Example
Start
1/2 1/2
tail
1/2 tail
1/4
0.1
Fair loaded
0.9 0.1
3/4 0.9
1/2
head head
Fair/Loade
L tosses
d
H1 H2 Hi HL-1 HL
X1 X2 Xi XL-1 XL
Head/Tail
2
Loaded Coin Example
Start
tail 1/2 1/2
1/2 tail
0.1 1/4
Fair loaded
0.9 0.1
1/2 3/4 0.9
head head
Fair/Loade
L tosses d
H1 H2 Hi HL-1 HL
X1 X2 Xi XL-1 XL
Head/Tail
2
C-G Islands Example
C-G islands: DNA parts which are very rich in C and G
q/4
q/4 P q
A G
P q
Regular change
P q
P q
DNA
q/4
T C
q/4 p/6 p/3
(1-P)/4 A G
(1-q)/6
(1-q)/3 p/3
P/6
C-G island T C
3
Example: CpG islands
In human genome, CG dinucleotides are relatively
rare
CG pairs undergo a process called methylation
chance) to a T
Promotor regions are CG rich
These regions are not methylated, and thus
3
CpG Islands
We construct Markov chain
for CpG rich and poor
regions
Using maximum likelihood
estimates from 60K
nucleotide, we get two
models
3
Ratio Test for CpC islands
Given a sequence X1,…,Xn we compute the
likelihood ratio
P ( X1 , , Xn | )
S ( X1 , , Xn ) log
P ( X 1 , , X n | )
A Xi Xi 1
log
i A Xi Xi 1
Xi Xi 1
i
3
Empirical Evalation
3
Finding CpG islands
Simple Minded approach:
Pick a window of size N
Problems:
How do we select N?
3
A Different C-G Islands Model
A G
change
T C
A G
T C
C-G
island?
H1 H2 Hi HL-1 HL
X1 X2 Xi XL-1 XL
A/C/G/T
3
HMM Recognition (question I)
For a given model M = { A, B, p} and a given state
sequence Q1 Q2 Q3 … QL ,, the probability of an
observation sequence O1 O2 O3 … OL is
P(O|Q,M) = bQ1O1 bQ2O2 bQ3O3 … bQTOT
For a given hidden Markov model M = { A, B, p}
the probability of the state sequence Q1 Q2 Q3 … QL
is (the initial probability of Q1 is taken to be pQ1)
P(Q|M) = pQ1 aQ1Q2 aQ2Q3 aQ3Q4 … aQL-1QL
So, for a given HMM, M
the probability of an observation sequence O1O2O3 … OT
is obtained by summing over all possible state sequences
3
HMM – Recognition (cont.)
P(O| M) = P(O|Q) P(Q|M)
= Q Q bQ O aQ Q bQ O aQ Q bQ O
1 1 1 1 2 2 2 2 3 2 2
…
4
HMM – Recognition (cont.)
Why isn’t it efficient? – O(2LQL)
For a given state sequence of length L we have
about 2L calculations
P(Q|M) = Q aQ Q aQ Q aQ Q … aQ
1 1 2 2 3 3 4 T-1 QT
P(O|Q) = bQ O bQ O bQ O … bQ O
1 1 2 2 3 3 T T
4
The Forward Backward Algorithm
A white board presentation.
4
The F-B Algorithm (cont.)
Option 1) The likelihood is measured using any
sequence of states of length T
This is known as the “Any Path” Method
Method
4
HMM – Question II (Harder)
Given an observation sequence, O = (O1 O2 … OT),
and a model, M = {A, B, p }, how do we efficiently
compute the most probable sequence(s) of states,
Q?
Namely the sequence of states Q = (Q1 Q2 … QT) ,
which maximizes P(O|Q,M), the probability that the
given model M produces the given observation O
when it goes through the specific sequence of states
Q.
Recall that given a model M, a sequence of
observations O, and a sequence of states Q, we
can efficiently compute P(O|Q,M) (should watch out
for numeric underflows) 4
Most Probable States Sequence (Q. II)
Idea:
If we know the identity of Q , then the most
i
probable sequence on i+1,…,n does not depend on
observations before time i
4
Dishonest Casino (again)
Computing posterior probabilities for “fair” at each
point in a long sequence:
4
HMM – Question III (Hardest)
Given an observation sequence O = (O1 O2 … OL), and
a class of models, each of the form M = {A,B,p}, which
specific model “best” explains the observations?
A solution to question I enables the efficient
computation of P(O|M) (the probability that a specific
model M produces the observation O).
Question III can be viewed as a learning problem: We
want to use the sequence of observations in order to
“train” an HMM and learn the optimal underlying model
parameters (transition and output probabilities).
4
Learning
Given a sequence x1,…,xn, h1,…,hn
How do we learn Akl and Bka ?
Problem:
Counts are inaccessible since we do not observe hi
4
If we have Akl and Bka we can compute
P (Hi k , Hi 1 l | x1 , , xn )
P (Hi k , Hi 1 l , x1 , , xn )
P (x 1 , , x n )
P (Hi k , x1 , , xi )P (xi 1 | Hi 1 l )P (xi 2 , , xn | Hi 1 l )
P (x 1 , , x n )
fk (i )Blxi 1 bl (i 1 )
P (x 1 , , x n )
5
Expected Counts
We can compute expected number of times
hi=k & hi+1=l
E [Nkl ] P (Hi k , Hi 1 l | x1 , , xn )
i
Similarly
E [Nka ] P (Hi
i x a
,
k | x1 , , xn )
i
5
Expectation Maximization (EM)
Choose Akl and Bka
E-step:
Compute expected counts E[Nkl], E[Nka]
M-Step:
E [Nkl ]
Restimate: A'kl
E [Nkl ']
l'
E [Nka ]
B'ka
E [Nka ']
a'
Reiterate
5
EM - basic properties
P(x1,…,xn: Akl, Bka) P(x1,…,xn: A’kl, B’ka)
Likelihood grows in each iteration
5
Complexity of E-step
Compute forward and backward messages
Time & Space complexity: O(nL)
5
EM - problems
Local Maxima:
Learning can get stuck in local maxima
Sensitive to initialization
Choosing L
We often do not know how many hidden values we
5
Communication Example