0% found this document useful (0 votes)
103 views

Cis262 HMM

Uploaded by

sofia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views

Cis262 HMM

Uploaded by

sofia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter 4

Hidden Markov Models (HMMs)

4.1 Definition of a Hidden Markov Model (HMM)

There is a variant of the notion of DFA with output, for


example a transducer such as a gsm (generalized sequen-
tial machine), which is widely used in machine learning.

This machine model is known as hidden Markov model ,


for short HMM .

There are three new twists compared to traditional gsm


models:
(1) There is a finite set of states Q with n elements, a
bijection σ : Q → {1, . . . , n}, and the transitions
between states are labeled with probabilities rather
that symbols from an alphabet. For any two states
p and q in Q, the edge from p to q is labeled with a
probability A(i, j), with i = σ(p) and j = σ(q).
111
112 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The probabilities A(i, j) form an n × n matrix A =


(A(i, j)).
(2) There is a finite set O of size m (called the observa-
tion space) of possible outputs that can be emitted,
a bijection ω : O → {1, . . . , m}, and for every state
q ∈ Q, there is a probability B(i, j) that output
O ∈ O is emitted (produced), with i = σ(q) and
j = ω(O).
The probabilities B(i, j) form an n × m matrix B =
(B(i, j)).
(3) Sequences of outputs O = (O1, . . . , OT ) (with Ot ∈
O for t = 1, . . . , T ) emitted by the model are di-
rectly observable, but the sequences of states S =
(q1, . . . , qT ) (with qt ∈ Q for t = 1, . . . , T ) that
caused some sequence of output to be emitted are
not observable.

In this sense the states are hidden, and this is the


reason for calling this model a hidden Markov model.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 113

Example 4.1. Say we consider the following behavior


of some professor at some university.

On a hot day (denoted by Hot), the professor comes to


class with a drink (denoted D) with probability 0.7, and
with no drink (denoted N) with probability 0.3.

On the other hand, on a cold day (denoted Cold), the


professor comes to class with a drink with probability
0.2, and with no drink with probability 0.8.

Suppose a student intrigued by this behavior recorded


a sequence showing whether the professor came to class
with a drink or not, say NNND.

Several months later, the student would like to know


whether the weather was hot or cold the days he recorded
the drinking behavior of the professor.
114 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Now the student heard about machine learning, so he


constructs a probabilistic (hidden Markov) model of the
weather.

Based on some experiments, he determines the probabil-


ity of a going from a hot day to another hot day to be
0.75, the probability of a going from a hot day to a cold
day to be 0.25, the probability of going from a cold day to
another cold day to be 0.7, and the probability of going
from a cold day to a hot day to be 0.3.

He also knows that when he started his observations, it


was a cold day with probability 0.45, and a hot day with
probability 0.55.

The above data determine an HMM depicted in Figure


4.1.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 115

start

0.45 0.55
0.7 0.75
0.3
Cold Hot
0.25
0.2 0.3
0.8 0.7

N D

Figure 4.1: Example of an HMM modeling the “drinking behavior” of a professor at the
University of Pennsylvania.

In this example, the set of states is Q = {Cold, Hot}, and


the set of outputs is O = {N, D}.

We have the bijection σ : {Cold, Hot} → {1, 2} given


by σ(Cold) = 1 and σ(Hot) = 2, and the bijection
ω : {N, D} → {1, 2} given by ω(N) = 1 and ω(D) = 2.
116 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The portion of the state diagram involving the states


Cold, Hot, is analogous to an NFA in which the tran-
sition labels are probabilities; it is the underlying Markov
model of the HMM.

For any given state, the probabilities of the outgoing edges


sum to 1.

The start state is a convenient way to express the proba-


bilities of starting either in state Cold or in state Hot.

Also, from each of the states Cold and Hot, we have emis-
sion probabilities of producing the ouput N or D, and
these probabilities also sum to 1.

We can also express these data using matrices.


4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 117

The matrix ⎛ ⎞
0.7 0.3
A=⎝ ⎠
0.25 0.75

describes the transitions of the Markov model,

the vector
⎛ ⎞
0.45
π=⎝ ⎠
0.55

describes the probabilities of starting either in state Cold


or in state Hot,

and the matrix


⎛ ⎞
0.8 0.2
B= ⎝ ⎠
0.3 0.7

describes the emission probabilities.


118 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The student would like to solve what is known as the


decoding problem.

Namely, given the output sequence NNND, find the


most likely state sequence of the Markov model that
produces the output sequence NNND.

Is it (Cold, Cold, Cold, Cold), or (Hot, Hot, Hot, Hot), or


(Hot, Cold, Cold, Hot), or (Cold, Cold, Cold, Hot)?

Given the probabilities of the HMM, it seems unlikely


that it is (Hot, Hot, Hot, Hot), but how can we find the
most likely one?
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 119

Before going any further, we wish to address a notational


issue.

The issue is how to denote the states, the ouputs, as


well as (ordered) sequences of states and sequences of
output.

In most problems, states and outputs have “meaningful”


names.

For example, if we wish to describe the evolution of the


temperature from day to day, it makes sense to use two
states “Cold” and “Hot,” and to describe whether a given
individual has a drink by “D,” and no drink by “N.”

Thus our set of states is Q = {Cold, Hot}, and our set of


outputs is O = {N, D}.
120 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

However, when computing probabilities, we need to use


matrices whose rows and columns are indexed by positive
integers, so we need a mechanism to associate a numer-
ical index to every state and to every output, and this
is the purpose of the bijections σ : Q → {1, . . . , n} and
ω : O → {1, . . . , m}.

In our example, we define σ by σ(Cold) = 1 and σ(Hot) =


2, and ω by ω(N) = 1 and ω(D) = 2.

Some author circumvent (or do they?) this notational


issue by assuming that the set of outputs is O = {1, 2, . . .,
m}, and that the set of states is Q = {1, 2, . . . , n}.

The disadvantage of doing this is that in “real” situations,


it is often more convenient to name the outputs and the
states with more meaningful names than 1, 2, 3 etc.

Warning: The task of naming the elements of the out-


put alphabet can be challenging, for example in speech
recognition.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 121

Let us now turn to sequences.

For example, consider the sequence of six states (from the


set Q = {Cold, Hot}),

S = (Cold, Cold, Hot, Cold, Hot, Hot).

Using the bijection σ : {Cold, Hot} → {1, 2} defined above,


the sequence S is completely determined by the sequence
of indices

σ(S) = (σ(Cold), σ(Cold), σ(Hot), σ(Cold),


σ(Hot), σ(Hot)) = (1, 1, 2, 1, 2, 2).

More generally, we will denote a sequence of length T of


states from a set Q of size n by
S = (q1, q2, . . . , qT ),
with qt ∈ Q for t = 1, . . . , T .
122 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Using the bijection σ : Q → {1, . . . , n}, the sequence S


is completely determined by the sequence of indices

σ(S) = (σ(q1), σ(q2), . . . , σ(qT )),

where σ(qt) is some index from the set {1, . . . , n}, for
t = 1, . . . , T .

The problem now is, what is a better notation for the


index denoted by σ(qt)?

Of course, we could use σ(qt), but this is a heavy notation,


so we adopt the notational convention to denote the
index σ(qt) by it.

Remark: We contemplated using the notation σt for


σ(qt) instead of it. However, we feel that this would de-
viate too much from the common practice found in the
literature, which uses the notation it.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 123

Going back to our example

S = (q1, q2, q3, q4, q4, q6) =(Cold, Cold, Hot, Cold,
Hot, Hot),

we have

σ(S) = (σ(q1), σ(q2), σ(q3), σ(q4),


σ(q5), σ(q6)) = (1, 1, 2, 1, 2, 2),

so the sequence of indices


(i1, i2, i3, i4, i5, i6) = (σ(q1), σ(q2), σ(q3), σ(q4), σ(q5),
σ(q6)) is given by

σ(S) = (i1, i2, i3, i4, i5, i6) = (1, 1, 2, 1, 2, 2).

So, the fourth index i4 is has the value 1.


124 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

We apply a similar convention to sequences of outputs.

For example, consider the sequence of six outputs (from


the set O = {N, D}),

O = (N, D, N, N, N, D).

Using the bijection ω : {N, D} → {1, 2} defined above,


the sequence O is completely determined by the sequence
of indices

ω(O) = (ω(N), ω(D), ω(N), ω(N), ω(N), ω(D))


= (1, 2, 1, 1, 1, 2).
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 125

More generally, we will denote a sequence of length T of


outputs from a set O of size m by

O = (O1, O2, . . . , OT ),

with Ot ∈ O for t = 1, . . . , T .

Using the bijection ω : O → {1, . . . , m}, the sequence O


is completely determined by the sequence of indices

ω(O) = (ω(O1), ω(O2), . . . , ω(OT )),

where ω(Ot) is some index from the set {1, . . . , m}, for
t = 1, . . . , T .

This time, we adopt the notational convention to de-


note the index ω(Ot) by ωt.
126 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Going back to our example

O = (O1, O2, O3, O4, O5, O6) = (N, D, N, N, N, D),

we have

ω(O) = (ω(O1), ω(O2), ω(O3), ω(O4), ω(O5), ω(O6))


= (1, 2, 1, 1, 1, 2),

so the sequence of indices


(ω1, ω2, ω3, ω4, ω5, ω6) = (ω(O1), ω(O2), ω(O3), ω(O4),
ω(O5), ω(O6)) is given by

ω(O) = (ω1, ω2, ω3, ω4, ω5, ω6) = (1, 2, 1, 1, 1, 2).

HMM’s are among the most effective tools to solve the


following types of problems:
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 127

(1) DNA and protein sequence alignment in the


face of mutations and other kinds of evolutionary change.
(2) Speech understanding systems, also called
Automatic speech recognition. When we talk,
our mouths produce sequences of sounds from the sen-
tences that we want to say. This process is complex.

Multiple words may map to the same sound, words


are pronounced differently as a function of the word
before and after them, we all form sounds slightly
differently, and so on.

All a listener can hear (perhaps a computer system)


is the sequence of sounds, and the listener would like
to reconstruct the mapping (backward) in order to
determine what words we were attempting to say.

For example, when you “talk to your TV” to pick a


program, say game of thrones, you don’t want to get
Jessica Jones.
128 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

(3) Optical character recognition (OCR). When


we write, our hands map from an idealized symbol to
some set of marks on a page (or screen).

The marks are observable, but the process that gen-


erates them isn’t.

A system performing OCR, such as a system used by


the post office to read addresses, must discover which
word is most likely to correspond to the mark it reads.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 129

The reader should review Example 4.1 illustrating the


notion of HMM.

Let us consider another example taken from Stamp [?].


Example 4.2. Suppose we want to determine the av-
erage annual temperature at a particular location over a
series of years in a distant past where thermometers did
not exist.

Since we can’t go back in time, we look for indirect evi-


dence of the temperature, say in terms of the size of tree
growth rings.

For simplicity, assume that we consider the two tempera-


tures Cold and Hot, and three different sizes of tree rings:
small, medium and large, which we denote by S, M, L.
130 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

In this example, the set of states is Q = {Cold, Hot}, and


the set of outputs is O = {S, M, L}.

We have the bijection σ : {Cold, Hot} → {1, 2} given


by σ(Cold) = 1 and σ(Hot) = 2, and the bijection
ω : {S, M, L} → {1, 2, 3} given by ω(S) = 1, ω(M) = 2,
and ω(L) = 3.

The HMM shown in Figure 4.2 is a model of the situation.

start

0.4 0.6
0.6 0.7
0.4
Cold Hot
0.3
0.1 0.1
0.2 0.4
0.7 0.5

S M L

Figure 4.2: Example of an HMM modeling the temperature in terms of tree growth rings.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 131

Suppose we observe the sequence of tree growth rings


(S, M, S, L).

What is the most likely sequence of temperatures over a


four-year period which yields the observations
(S, M, S, L)?
132 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Going back to Example 4.1, which corresponds to the


HMM graph shown in Figure 4.3, we need to figure out
the probability that a sequence of states S = (q1, q2, . . . , qT )
produces the output sequence O = (O1, O2, . . . , OT ).

start

0.45 0.55
0.7 0.75
0.3
Cold Hot
0.25
0.2 0.3
0.8 0.7

N D

Figure 4.3: Example of an HMM modeling the “drinking behavior” of a professor at the
University of Pennsylvania.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 133

Then the probability that we want is just the product


of the probability that we begin with state q1, times the
product of the probabilities of each of the transitions,
times the product of the emission probabilities.

With our notational conventions, σ(qt) = it and ω(Ot) =


ωt, so we have

T
%
Pr(S, O) = π(i1)B(i1, ω1) A(it−1, it)B(it, ωt).
t=2

In our example, ω(O) = (ω1, ω2, ω3, ω4) = (1, 1, 1, 2),


which corresponds to NNND.

The brute-force method is to compute these probabilities


for all 24 = 16 sequences of states of length 4 (in general,
there are nT sequences of length T ).
134 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

For example, for the sequence S = (Cold, Cold, Cold, Hot),


associated with the sequence of indices
σ(S) = (i1, i2, i3, i4) = (1, 1, 1, 2), we find that

Pr(S, NNND) = π(1)B(1, 1)A(1, 1)B(1, 1)A(1, 1)B(1, 1)


A(1, 2)B(2, 2)
= 0.45 × 0.8 × 0.7 × 0.8 × 0.7 × 0.8
× 0.3 × 0.7 = 0.0237.

A much more efficient way to proceed is to use a method


based on dynamic programming.

Recall the bijection σ : {Cold, Hot} → {1, 2}, so that we


will refer to the state Cold as 1, and to the state Hot as
2.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 135

For t = 1, 2, 3, 4, for every state i = 1, 2, we compute


score(i, t) to be the highest probability that a sequence
of length t ending in state i produces the output se-
quence (O1, . . . , Ot), and for t ≥ 2, we let pred(i, t) be
the state that precedes i in a best sequence of length t
ending in i.

Initially, we set
score(j, 1) = π(j)B(j, ω1), j = 1, 2,
and since ω1 = 1 we get score(1, 1) = 0.45 × 0.8 = 0.36
and score(2, 1) = 0.55 × 0.3 = 0.165.

Next we compute score(1, 2) and score(2, 2) as follows.

For j = 1, 2, for i = 1, 2, compute temporary scores


tscore(i, j) = score(i, 1)A(i, j)B(j, ω2);
then pick the best of the temporary scores,
score(j, 2) = max tscore(i, j).
i
136 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Since ω2 = 1, we get tscore(1, 1) = 0.36 × 0.7 × 0.8 =


0.2016, tscore(2, 1) = 0.165 × 0.25 × 0.8 = 0.0330, and
tscore(1, 2) = 0.36 × 0.3 × 0.3 = 0.0324, tscore(2, 2) =
0.165 × 0.75 × 0.3 = 0.0371.

Then
score(1, 2) = max{tscore(1, 1), tscore(2, 1)}
= max{0.2016, 0.0330} = 0.2016,

and

score(2, 2) = max{tscore(1, 2), tscore(2, 2)}


= max{0.0324, 0.0371} = 0.0371.

Since the state that leads to the optimal score score(1, 2)


is 1, we let pred(1, 2) = 1, and since the state that leads
to the optimal score score(2, 2) is 2, we let pred(2, 2) = 2.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 137

We compute score(1, 3) and score(2, 3) in a similar way.

For j = 1, 2, for i = 1, 2, compute


tscore(i, j) = score(i, 2)A(i, j)B(j, ω3);
then pick the best of the temporary scores,
score(j, 3) = max tscore(i, j).
i

Since ω3 = 1, we get

score(1, 3) = max{0.1129, 0.0074} = 0.1129,

and

score(2, 3) = max{0.0181, 0.0083} = 0.0181.

We also get pred(1, 3) = 1 and pred(2, 3) = 1.

Finally, we compute score(1, 4) and score(2, 4) in a sim-


ilar way.
138 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

For j = 1, 2, for i = 1, 2, compute


tscore(i, j) = score(i, 3)A(i, j)B(j, ω4);
then pick the best of the temporary scores,
score(j, 4) = max tscore(i, j).
i

Since ω4 = 2, we get
score(1, 4) = max{0.0158, 0.0009} = 0.0158,
and
score(2, 4) = max{0.0237, 0.0095} = 0.0237,
and pred(1, 4) = 1 and pred(2, 4) = 1.

Since max{score(1, 4), score(2, 4)} = 0.0237, the state


with the maximum score is Hot, and by following the
predecessor list (also called backpointer list), we find the
most likely state sequence to produce the sequence NNND
to be (Cold, Cold, Cold, Hot).
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 139

The stages of the computations of score(j, t) for i =


1, 2 and t = 1, 2, 3, 4 can be recorded in the following
diagram called a lattice or a trellis (which means lattice
in French!):

0.2016 ! 0.1129 ! 0.0158


Cold 0.36❈❈❈ 0.2016
④④
% ❈❈❈❈❈ 0.1129
④④
% ❊❊❊❊❊
!
0.0158
③'
❈ ❈❈❈❈❈ ❊ ❊❊❊❊❊ ③③
❈❈0.0324 ④④④ ❈❈0.0181 ❊❊0.0237
❈❈ ④ ④ ③
❈❈❈ ④ ④④ ❊❊❊ ③③
❈❈
❈❈ ④④④
④ ❈❈❈❈❈ ④④④ ❊❊❊❊❊ ③③③③
❈④❈④ ❈❈❈④❈④④ ❊❊❊❊③③
④④ ❈❈ ④④❈❈❈❈❈ ③③③❊❊❊❊❊❊
0.033④④④④④ ❈❈❈❈❈ 0.0074④④④④④ ❈❈❈❈❈❈❈❈ 0.0009③③③③ ❊❊❊❊❊❊❊❊❊
④④ ❈❈ ④④ ❈❈❈❈❈ ③③
③ ❊❊❊❊❊
④④ ❈" ④④ ❈# ③③ ❊$

Hot 0.1650 0.0371 0.0181 0.0095 0.0237


! & &
0.0371 0.0083

Double arrows represent the predecessor edges.

For example, the predecessor pred(2, 3) of the third node


on the bottom row labeled with the score 0.0181 (which
corresponds to Hot), is the second node on the first row la-
beled with the score 0.2016 (which corresponds to Cold).

The two incoming arrows to the third node on the bottom


row are labeled with the temporary scores 0.0181 and
0.0083.
140 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The node with the highest score at time t = 4 is Hot,


with score 0.0237 (showed in bold), and by following the
double arrows backward from this node, we obtain the
most likely state sequence (Cold, Cold, Cold, Hot).

The method we just described is known as the Viterbi


algorithm.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 141

Definition 4.1. A hidden Markov model , for short


HMM , is a quintuple M = (Q, O, π, A, B) where
• Q is a finite set of states with n elements, and there
is a bijection σ : Q → {1, . . . , n}.
• O is a finite output alphabet (also called set of pos-
sible observations) with m observations, and there is
a bijection ω : O → {1, . . . , m}.
• A = (A(i, j)) is an n × n matrix called the state
transition probability matrix , with
n
&
A(i, j) ≥ 0, 1 ≤ i, j ≤ n, and A(i, j) = 1,
j=1

i = 1, . . . , n.
• B = (B(i, j)) is an n × m matrix called the state ob-
servation probability matrix (also called confusion
matrix ), with
m
&
B(i, j) ≥ 0, 1 ≤ i, j ≤ n, and B(i, j) = 1,
j=1

i = 1, . . . , n.
142 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

A matrix satisfying the above conditions is said to be


row stochastic. Both A and B are row-stochastic.

We also need to state the conditions that make M a


Markov model. To do this rigorously requires the notion
of random variable and is a bit tricky (see the remark in
the notes), so we will cheat as follows:
(a) Given any sequence of states (q1, . . . , qt−2, p, q), the
conditional probability that q is the tth state given
that the previous states were q1, . . . , qt−2, p is equal
to the conditional probability that q is the tth state
given that the previous state at time t − 1 is p:
Pr(q | q1, . . . , qt−2, p) = Pr(q | p).
This is the Markov property.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 143

(b) Given any sequence of states (q1, . . . , qi, . . . , qt), and


given any sequence of outputs (O1, . . . , Oi, . . . , Ot),
the conditional probability that the output Oi is emit-
ted depends only on the state qi, and not any other
states or any other observations:
Pr(Oi | q1, . . . , qi, . . . , qt, O1, . . . , Oi, . . . , Ot)
= Pr(Oi | qi).
This is the output independence condition.

Examples of HMMs are shown in Figure 4.1, Figure 4.2,


and Figure 4.4 shown below.
144 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Note that an ouput is emitted when visiting a state, not


when making a transition, as in the case of a gsm.

So the analogy with the gsm model is only partial; it is


meant as a motivation for HMMs.

If we ignore the output components O and B, then we


have what is called a Markov chain.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 145

There are three types of problems that can be solved using


HMMs:
(1) The decoding problem: Given an HMM M =
(Q, O, π, A, B), for any observed output sequence O =
(O1, O2, . . . , OT ) of length T , find a most likely se-
quence of states S = (q1, q2, . . . , qT ) that produces
the output sequence O.

More precisely, with our notational convention that


σ(qt) = it and ω(Ot) = ωt, this means finding a se-
quence S such that the probability

T
%
Pr(S, O) = π(i1)B(i1, ω1) A(it−1, it)B(it, ωt)
t=2

is maximal.

This problem is solved effectively by the Viterbi al-


gorithm.
146 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

(2) The evaluation problem, also called


the likelyhood problem:
Given a finite collection {M1, . . . , ML} of HMM’s
with the same output alphabet O, for any output se-
quence O = (O1, O2, . . . , OT ) of length T , find which
model Mℓ is most likely to have generated O.

More precisely, given any model Mk , we compute the


probability tprobk that Mk could have produced O
along any path.

Then we pick an HMM Mℓ for which tprobℓ is max-


imal. We will return to this point after having de-
scribed the Viterbi algoritm.

A variation of the Viterbi algorithm called the for-


ward algorithm effectively solves the evaluation prob-
lem.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 147

(3) The training problem, also called the learning


problem: Given a set {O1, . . . , Or } of output se-
quences on the same output alpabet O, usually called
a set of training data, given Q, find the “best” π, A,
and B for an HMM M that produces all the se-
quences in the training set, in the sense that the
HMM M = (Q, O, π, A, B) is the most likely to have
produced the sequences in the training set.

The technique used here is called expectation maxi-


mization, or EM . It is an iterative method that starts
with an initial triple π, A, B, and tries to impove it.

There is such an algorithm known as the Baum-Welch


or forward-backward algorithm, but it is beyond the
scope of this introduction.

Let us now describe the Viterbi algorithm in more details.


148 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

4.2 The Viterbi Algorithm and the Forward Algorithm

Given an HMM M = (Q, O, π, A, B), for any observed


output sequence O = (O1, O2, . . ., OT ) of length T , we
want to find a most likely sequence of states S =
(q1, q2, . . . , qT ) that produces the output sequence O.

Using the bijections σ : Q → {1, . . . , n} and ω : O →


{1, . . . , m}, we can work with sequences of indices, and
recall that we denote the index σ(qt) associated with the
tth state qt in the sequence S by it, and the index ω(Ot)
associated with the tth output Ot in the sequence O by
ωt.

Then we need to find a sequence S such that the proba-


bility

T
%
Pr(S, O) = π(i1)B(i1, ω1) A(it−1, it)B(it, ωt)
t=2

is maximal.
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 149

In general, there are nT sequences of length T .

This problem can be solved efficiently by a method based


on dynamic programming.

For any t, 1 ≤ t ≤ T , for any state q ∈ Q, if σ(q) =


j, then we compute score(j, t), which is the largest
probability that a sequence (q1, . . . , qt−1, q) of length t
ending with q has produced the output sequence
(O1, . . . , Ot−1, Ot).

The point is that if we know score(k, t − 1) for k =


1, . . . , n (with t ≥ 2), then we can find score(j, t) for j =
1, . . . , n, because if we write k = σ(qt−1) and j = σ(q)
(recall that ωt = ω(Ot)), then the probability associated
with the path (q1, . . . , qt−1, q) is

tscore(k, j) = score(k, t − 1)A(k, j)B(j, ωt).


See the illustration below:
150 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

state indices i1( ... k( j(


σ σ σ
score(k,t−1) A(k,j)
states q1 ... & qt−1 & q
B(j,ωt )
) ) )
outputs O1 ... Ot−1 Ot
ω ω ω
* * *
output indices ω1 ... ωt−1 ωt

So to maximize this probability, we just have to find the


maximum of the probabilities tscore(k, j) over all k,
that is, we must have
score(j, t) = max tscore(k, j).
k

See the illustration below:

σ −1(1) ❙❙❙
❙tscore(1,j)
❙❙❙
❙❙❙
❙❙❙
❙❙❙
❙❙❙
❙❙+
tscore(k,j)
σ −1(k) &
q❦, = σ −1(j)
❦❦❦
❦❦❦❦❦

❦❦❦
❦❦❦
❦ tscore(n,j)
❦❦
❦❦❦
σ −1(n)
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 151

To get started, we set score(j, 1) = π(j)B(j, ω1) for j =


1, . . . , n.

The algorithm goes through a forward phase for t =


1, . . . , T , during which it computes the probabilities
score(j, t) for j = 1, . . . , n.

When t = T , we pick an index j such that score(j, T ) is


maximal.

The machine learning community is fond of the notation


j = arg max score(k, T )
k

to express the above fact. Typically, the smallest index


j corresponding to the largest value of score(k, T ) is re-
turned.

This gives us the last state qT = σ −1(j) in an optimal


sequence that yields the output sequence O.
152 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The algorithm then goes through a path retrieval phase.

To to this, when we compute

score(j, t) = max tscore(k, j),


k

we also record the index k = σ(qt−1) of the state qt−1 in


the best sequence (q1, . . . , qt−1, qt) for which tscore(k, j)
is maximal (with j = σ(qt)), as pred(j, t) = k.

The index k is often called the backpointer of j at time


t.

This state may not be unique, we just pick one of them.


Typically, the smallest index k corresponding to the largest
value of tscore(k, j) is returned.
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 153

Again, this can be expressed by

pred(j, t) = arg max tscore(k, j).


k

The predecessors pred(j, t) are only defined for t = 2, . . .,


T , but we can let pred(j, 1) = 0.

Observe that the path retrieval phase of the Viterbi algo-


rithm is very similar to the phase of Dijkstra’s algorithm
for finding a shortest path that follows the prev array.

The forward phase of the Viterbi algorithm is quite dif-


ferent from the Dijkstra’s algorithm, and the Viterbi al-
gorithm is actually simpler.
154 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The Viterbi algorithm, invented by Andrew Viterbi in


1967, is shown below.

The input to the algorithm is M = (Q, O, π, A, B) and


the sequence of indices ω(O) = (ω1, . . . , ωT ) associated
with the observed sequence O = (O1, O2, . . . , OT ) of
length T , with ωt = ω(Ot) for t = 1, . . . , T .

The output is a sequence of states (q1, . . . , qT ). This se-


quence is determined by the sequence of indices (I1, . . . , IT );
namely, qt = σ −1(It).
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 155

The Viterbi Algorithm


begin
for j = 1 to n do
score(j, 1) = π(j)B(j, ω1)
endfor;
(∗ forward phase to find the best (highest) scores ∗)
for t = 2 to T do
for j = 1 to n do
for k = 1 to n do
tscore(k) = score(k, t − 1)A(k, j)B(j, ωt)
endfor;
score(j, t) = maxk tscore(k);
pred(j, t) = arg maxk tscore(k)
endfor
endfor;
(∗ second phase to retrieve the optimal path ∗)
IT = arg maxj score(j, T );
qT = σ −1(IT );
for t = T to 2 by −1 do
It−1 = pred(It, t);
qt−1 = σ −1(It−1)
endfor
end
156 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

If we run the Viterbi algorithm on the output sequence


(S, M, S, L) of Example 4.2, we find that the sequence
(Cold, Cold, Cold, Hot) has the highest probability, 0.00282,
among all sequences of length four.

One may have noticed that the numbers involved, being


products of probabilities, become quite small.

Indeed, underflow may arise in dynamic programming.


Fortunately, there is a simple way to avoid underflow by
taking logarithms.

It immediately verified that the time complexity of the


Viterbi algorithm is O(n2T ).

Let us now to turn to the second problem, the evaluation


problem (or likelyhood problem).
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 157

This time, given a finite collection {M1, . . . , ML} of HMM’s


with the same output alphabet O, for any observed out-
put sequence O = (O1, O2, . . . , OT ) of length T , find
which model Mℓ is most likely to have generated O.

More precisely, given any model Mk , we compute the


probability tprobk that Mk could have produced O along
any path.

Then we pick an HMM Mℓ for which tprobℓ is maximal.

It is easy to adapt the Viterbi algorithm to compute


tprobk . This algorithm is called the forward algorithm.

Since we are not looking for an explicity path, there is


no need for the second phase, and during the forward
phase, going from t − 1 to t, rather than finding the
maximum of the scores tscore(k) for k = 1, . . . , n, we
just set score(j, t) to the sum over k of the temporary
scores tscore(k).

At the end, tprobk is the sum over j of the probabilities


score(j, T ).
158 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The input to the algorithm is M = (Q, O, π, A, B) and


the sequence of indices ω(O) = (ω1, . . . , ωT ) associated
with the observed sequence O = (O1, O2, . . . , OT ) of
length T , with ωt = ω(Ot) for t = 1, . . . , T .

The output is the probability tprob.

The Foward Algorithm

begin
for j = 1 to n do
score(j, 1) = π(j)B(j, ω1)
endfor;
for t = 2 to T do
for j = 1 to n do
for k = 1 to n do
tscore(k) = score(k, t − 1)A(k, j)B(j, ωt)
endfor;
'
score(j, t) = k tscore(k)
endfor
endfor;
'
tprob = j score(j, T )
end
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 159

We can now run the above algorithm on M1, . . . , ML to


compute tprob1, . . . , tprobL, and we pick the model Mℓ
for which tprobℓ is maximum.

As for the Viterbi algorithm, the time complexity of the


forward algorithm is O(n2T ).

Underflow is also a problem with the forward algorithm.

At first glance it looks like taking logarithms does not help


because there is no simple expression for log(x1 +· · ·+xn)
in terms of the log xi.

Fortunately, we can use the log-sum exp trick ; see the


notes.
Example 4.3. To illustrate the forward algorithm, as-
sume that our observant student also recorded the drink-
ing behavior of a professor at Harvard, and that he came
up with the HHM shown in Figure 4.4.
160 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

start

0.13 0.87
0.33 0.9
0.67
Cold Hot
0.1
0.05 0.8
0.95 0.2

N D

Figure 4.4: Example of an HMM modeling the “drinking behavior” of a professor at Harvard.

However, the student can’t remember whether he ob-


served the sequence NNND at Penn or at Harvard.

So he runs the forward algorithm on both HMM’s to find


the most likely model. Do it!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy