Introduction To Hidden Markov Models
Introduction To Hidden Markov Models
Introduction To Hidden Markov Models
Alperen Degirmenci
1
c 2014 Alperen Degirmenci
μΦZ →Zt μZt→ΦZ ,Z μΦZ ,Z →Zt+1
all i, j; otherwise A will have some zero-valued elements. t-1 ,Zt
ΦZt-1,Zt ΦZt,Zt+1
t t+1 t t+1
Fig. 2 shows two state transition diagrams for a 2-state and Zt Zt+1
μZt→ΦZ t-1 ,Zt
μΦZ ,Z →Zt μZt+1→ΦZ t-1 ,Zt
μZt→ΦX ,Z t t
μΦX ,Z →Zt
t t
0 A32 A33
Xt
The conditional probability can be written as
K Y
K Fig. 3. Factor graph for a slice of the HMM at time t.
Z Zt,j
Y
P (Zt |Zt−1 ) = Aijt−1,i . (5)
i=1 j=1
• Problem 2: Given observations X1 , . . . , XN and the
Taking the logarithm, we can write this as model λ, how do we find the “correct” hidden state
K X
X K sequence Z1 , . . . , ZN that best “explains” the observa-
logP (Zt |Zt−1 ) = Zt−1,i Zt,j log Aij (6) tions? This corresponds to finding the most probable
i=1 j=1 sequence of hidden states from the lecture notes, and
= Zt> log (A)Zt−1 . (7) can be solved using the Viterbi algorithm. A related
problem is calculating the probability of being in state
4) Observation model, B: Also called the emission prob- Ztk given the observations, P (Zt = k|X1:N ), which
abilities, B is a Ω × K matrix whose elements Bkj describe can be calculated using the forward-backward algo-
the probability of making observation Xt,k given state Zt,j . rithm.
This can be written as, • Problem 3: How do we adjust the model parameters λ =
Bkj = P (Xt = k|Zt = j). (8) (A, B, π) to maximize P (X1:N |λ)? This corresponds
to the learning problem presented in the lecture notes,
The conditional probability can be written as and can be solved using the Expectation-Maximization
Ω
K Y (EM) algorithm (in the case of HMMs, this is called the
Z Xt,k
Y
P (Xt |Zt ) = Bkjt,j . (9) Baum-Welch algorithm).
j=1 k=1
C. The Forward-Backward Algorithm
Taking the logarithm, we can write this as
K X
Ω
The forward-backward algorithm is a dynamic program-
ming algorithm that makes use of message passing (be-
X
logP (Xt |Zt ) = Zt,j Xt,k log Bkj (10)
j=1 k=1
lief propagation). It allows us to compute the filtered and
smoothed marginals, which can be then used to perform
= Xt> log (B)Zt . (11)
inference, MAP estimation, sequence classification, anomaly
5) Initial state distribution, π: This is a K × 1 vector detection, and model-based clustering. We will follow the
of probabilities πi = P (Z1i=1 ). The conditional probability derivation presented in Murphy [3].
can be written as, 1) The Forward Algorithm: In this part, we compute
K the filtered marginals, P (Zt |X1:T ) using the predict-update
cycle. The prediction step calculates the one-step-ahead
Y
P (Z1 |π) = πiZ1i . (12)
i=1
predictive density,
Given these five parameters presented above, an HMM P (Zt =j|X1:t−1 ) = · · ·
can be completely specified. In literature, this often gets K
X
abbreviated as = P (Zt = j|Zt−1 = i)P (Zt−1 = i|X1:t−1 )
λ = (A, B, π). (13) i=1
(14)
B. Three Problems of Interest
which acts as the new prior for time t. In the update state,
In [2] Rabiner states that for the HMM to be useful in
the observed data from time t is absorbed using Bayes rule:
real-world applications, the following three problems must
be solved: αt (j) , P (Zt = j|X1:t )
• Problem 1: Given observations X1 , . . . , XN and a = P (Zt = j|Xt , X1:t−1 )
model λ = (A, B, π), how do we efficiently compute
P (Xt |Zt = j,
X1:t−1
)P (Zt = j|X1:t−1 )
P (X1:N |λ), the probability of the observations given =P
j P (Xt |Zt = j,
X1:t−1
)P (Zt = j|X1:t−1 )
the model? This is a part of the exact inference problem
presented in the lecture notes, and can be solved using 1
= P (Xt |Zt = j)P (Zt = j|X1:t−1 ) (15)
forward filtering. Ct
2
c 2014 Alperen Degirmenci
Algorithm 1 Forward algorithm Algorithm 2 Backward algorithm
1: Input: A, ψ1:N , π 1: Input: A, ψ1:N , α
2: [α1 , C1 ] = normalize(ψ1 π) ; 2: βN = 1;
3: for t = 2 : N do 3: for t = N − 1 : 1 do
4: [αt , Ct ] = normalize(ψt (A> αt−1 )) ; 4: βt = normalize(A(ψt+1 βt+1 ) ;
P
5: Return α1:N and log P (X1:N ) = t log Ct 5: γ = normalize(α β, 1)
P 6: Return γ 1:N
6: Sub: [α, C] = normalize(u): C = j uj ; αj = uj /C;
3
c 2014 Alperen Degirmenci
Algorithm 3 Viterbi algorithm Algorithm 4 Baum-Welch algorithm
1: Input: X1:N , K, A, B, π 1: Input: X1:N , A, B, α, β
2: Initialize: δ1 = π BX1 , a1 = 0; 2: for t = 1 : N do
3: for t = 2 : N do 3: γ(:, t) = (α(:, t) β(:, t))./sum(α(:, t) β(:, t));
4: for j = 1 : K do 4: ξ(:, :, t) = ((α(:, t) A(t + 1)) ∗ (β(:, t + 1)
5: [at (j), δt (j)] = maxi (log δt−1 (:) + log Aij + B(Xt+1 ))T )./sum(α(:, t) β(:, t));
log BXt (j)); 5: π̂ = γ(:, 1)./sum(γ(:, 1));
∗
6: ZN = arg max(δN ); 6: for j = 1 : K do
7: for t = N − 1 : 1 do 7: Â(j, :) = sum(ξ(2 : N, j, :), 1)./sum(sum(ξ(2 :
8: Zt∗ = at+1 Zt+1∗
; N, j, :), 1), 2);
∗ 8: ˆ :) = X(:, j)T γ ./sum(γ, 1);
B(j,
9: Return Z1:N
9: Return π̂, Â, B̂
The pseudo-code in Algorithm 3 outlines the steps of the The pseudo-code in Algorithm 4 outlines the steps of the
computation. computation.
4
c 2014 Alperen Degirmenci
[2] L. Rabiner, ”A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition,” Proceedings of the IEEE, vol.
77, no. 2, pp. 257–286, 1989.
[3] K.P. Murphy, Machine Learning: A Probabilistic Perspective, Cam-
bridge, MA: MIT Press, 2012.
5
c 2014 Alperen Degirmenci