0% found this document useful (0 votes)

103 views

Cis262 HMM

Uploaded by

sofia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views

Cis262 HMM

Uploaded by

sofia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Chapter 4

Hidden Markov Models (HMMs)

4.1 Definition of a Hidden Markov Model (HMM)

There is a variant of the notion of DFA with output, for

example a transducer such as a gsm (generalized sequen-
tial machine), which is widely used in machine learning.

This machine model is known as hidden Markov model ,

for short HMM .

There are three new twists compared to traditional gsm

models:
(1) There is a finite set of states Q with n elements, a
bijection σ : Q → {1, . . . , n}, and the transitions
between states are labeled with probabilities rather
that symbols from an alphabet. For any two states
p and q in Q, the edge from p to q is labeled with a
probability A(i, j), with i = σ(p) and j = σ(q).
111
112 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The probabilities A(i, j) form an n × n matrix A =

(A(i, j)).
(2) There is a finite set O of size m (called the observa-
tion space) of possible outputs that can be emitted,
a bijection ω : O → {1, . . . , m}, and for every state
q ∈ Q, there is a probability B(i, j) that output
O ∈ O is emitted (produced), with i = σ(q) and
j = ω(O).
The probabilities B(i, j) form an n × m matrix B =
(B(i, j)).
(3) Sequences of outputs O = (O1, . . . , OT ) (with Ot ∈
O for t = 1, . . . , T ) emitted by the model are di-
rectly observable, but the sequences of states S =
(q1, . . . , qT ) (with qt ∈ Q for t = 1, . . . , T ) that
caused some sequence of output to be emitted are
not observable.

In this sense the states are hidden, and this is the

reason for calling this model a hidden Markov model.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 113

Example 4.1. Say we consider the following behavior

of some professor at some university.

On a hot day (denoted by Hot), the professor comes to

class with a drink (denoted D) with probability 0.7, and
with no drink (denoted N) with probability 0.3.

On the other hand, on a cold day (denoted Cold), the

professor comes to class with a drink with probability
0.2, and with no drink with probability 0.8.

Suppose a student intrigued by this behavior recorded

a sequence showing whether the professor came to class
with a drink or not, say NNND.

Several months later, the student would like to know

whether the weather was hot or cold the days he recorded
the drinking behavior of the professor.
114 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Now the student heard about machine learning, so he

constructs a probabilistic (hidden Markov) model of the
weather.

Based on some experiments, he determines the probabil-

ity of a going from a hot day to another hot day to be
0.75, the probability of a going from a hot day to a cold
day to be 0.25, the probability of going from a cold day to
another cold day to be 0.7, and the probability of going
from a cold day to a hot day to be 0.3.

He also knows that when he started his observations, it

was a cold day with probability 0.45, and a hot day with
probability 0.55.

The above data determine an HMM depicted in Figure

4.1.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 115

start

0.45 0.55
0.7 0.75
0.3
Cold Hot
0.25
0.2 0.3
0.8 0.7

N D

Figure 4.1: Example of an HMM modeling the “drinking behavior” of a professor at the
University of Pennsylvania.

In this example, the set of states is Q = {Cold, Hot}, and

the set of outputs is O = {N, D}.

We have the bijection σ : {Cold, Hot} → {1, 2} given

by σ(Cold) = 1 and σ(Hot) = 2, and the bijection
ω : {N, D} → {1, 2} given by ω(N) = 1 and ω(D) = 2.
116 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The portion of the state diagram involving the states

Cold, Hot, is analogous to an NFA in which the tran-
sition labels are probabilities; it is the underlying Markov
model of the HMM.

For any given state, the probabilities of the outgoing edges

sum to 1.

The start state is a convenient way to express the proba-

bilities of starting either in state Cold or in state Hot.

Also, from each of the states Cold and Hot, we have emis-
sion probabilities of producing the ouput N or D, and
these probabilities also sum to 1.

We can also express these data using matrices.

4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 117

The matrix ⎛ ⎞
0.7 0.3
A=⎝ ⎠
0.25 0.75

describes the transitions of the Markov model,

the vector
⎛ ⎞
0.45
π=⎝ ⎠
0.55

describes the probabilities of starting either in state Cold

or in state Hot,

and the matrix

⎛ ⎞
0.8 0.2
B= ⎝ ⎠
0.3 0.7

describes the emission probabilities.

118 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The student would like to solve what is known as the

decoding problem.

Namely, given the output sequence NNND, find the

most likely state sequence of the Markov model that
produces the output sequence NNND.

Is it (Cold, Cold, Cold, Cold), or (Hot, Hot, Hot, Hot), or

(Hot, Cold, Cold, Hot), or (Cold, Cold, Cold, Hot)?

Given the probabilities of the HMM, it seems unlikely

that it is (Hot, Hot, Hot, Hot), but how can we find the
most likely one?
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 119

Before going any further, we wish to address a notational

issue.

The issue is how to denote the states, the ouputs, as

well as (ordered) sequences of states and sequences of
output.

In most problems, states and outputs have “meaningful”

names.

For example, if we wish to describe the evolution of the

temperature from day to day, it makes sense to use two
states “Cold” and “Hot,” and to describe whether a given
individual has a drink by “D,” and no drink by “N.”

Thus our set of states is Q = {Cold, Hot}, and our set of

outputs is O = {N, D}.
120 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

However, when computing probabilities, we need to use

matrices whose rows and columns are indexed by positive
integers, so we need a mechanism to associate a numer-
ical index to every state and to every output, and this
is the purpose of the bijections σ : Q → {1, . . . , n} and
ω : O → {1, . . . , m}.

In our example, we define σ by σ(Cold) = 1 and σ(Hot) =

2, and ω by ω(N) = 1 and ω(D) = 2.

Some author circumvent (or do they?) this notational

issue by assuming that the set of outputs is O = {1, 2, . . .,
m}, and that the set of states is Q = {1, 2, . . . , n}.

The disadvantage of doing this is that in “real” situations,

it is often more convenient to name the outputs and the
states with more meaningful names than 1, 2, 3 etc.

Warning: The task of naming the elements of the out-

put alphabet can be challenging, for example in speech
recognition.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 121

Let us now turn to sequences.

For example, consider the sequence of six states (from the

set Q = {Cold, Hot}),

S = (Cold, Cold, Hot, Cold, Hot, Hot).

Using the bijection σ : {Cold, Hot} → {1, 2} defined above,

the sequence S is completely determined by the sequence
of indices

σ(S) = (σ(Cold), σ(Cold), σ(Hot), σ(Cold),

σ(Hot), σ(Hot)) = (1, 1, 2, 1, 2, 2).

More generally, we will denote a sequence of length T of

states from a set Q of size n by
S = (q1, q2, . . . , qT ),
with qt ∈ Q for t = 1, . . . , T .
122 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Using the bijection σ : Q → {1, . . . , n}, the sequence S

is completely determined by the sequence of indices

σ(S) = (σ(q1), σ(q2), . . . , σ(qT )),

where σ(qt) is some index from the set {1, . . . , n}, for
t = 1, . . . , T .

The problem now is, what is a better notation for the

index denoted by σ(qt)?

Of course, we could use σ(qt), but this is a heavy notation,

so we adopt the notational convention to denote the
index σ(qt) by it.

Remark: We contemplated using the notation σt for

σ(qt) instead of it. However, we feel that this would de-
viate too much from the common practice found in the
literature, which uses the notation it.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 123

Going back to our example

S = (q1, q2, q3, q4, q4, q6) =(Cold, Cold, Hot, Cold,
Hot, Hot),

we have

σ(S) = (σ(q1), σ(q2), σ(q3), σ(q4),

σ(q5), σ(q6)) = (1, 1, 2, 1, 2, 2),

so the sequence of indices

(i1, i2, i3, i4, i5, i6) = (σ(q1), σ(q2), σ(q3), σ(q4), σ(q5),
σ(q6)) is given by

σ(S) = (i1, i2, i3, i4, i5, i6) = (1, 1, 2, 1, 2, 2).

So, the fourth index i4 is has the value 1.

124 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

We apply a similar convention to sequences of outputs.

For example, consider the sequence of six outputs (from

the set O = {N, D}),

O = (N, D, N, N, N, D).

Using the bijection ω : {N, D} → {1, 2} defined above,

the sequence O is completely determined by the sequence
of indices

ω(O) = (ω(N), ω(D), ω(N), ω(N), ω(N), ω(D))

= (1, 2, 1, 1, 1, 2).
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 125

More generally, we will denote a sequence of length T of

outputs from a set O of size m by

O = (O1, O2, . . . , OT ),

with Ot ∈ O for t = 1, . . . , T .

Using the bijection ω : O → {1, . . . , m}, the sequence O

is completely determined by the sequence of indices

ω(O) = (ω(O1), ω(O2), . . . , ω(OT )),

where ω(Ot) is some index from the set {1, . . . , m}, for
t = 1, . . . , T .

This time, we adopt the notational convention to de-

note the index ω(Ot) by ωt.
126 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Going back to our example

O = (O1, O2, O3, O4, O5, O6) = (N, D, N, N, N, D),

we have

ω(O) = (ω(O1), ω(O2), ω(O3), ω(O4), ω(O5), ω(O6))

= (1, 2, 1, 1, 1, 2),

so the sequence of indices

(ω1, ω2, ω3, ω4, ω5, ω6) = (ω(O1), ω(O2), ω(O3), ω(O4),
ω(O5), ω(O6)) is given by

ω(O) = (ω1, ω2, ω3, ω4, ω5, ω6) = (1, 2, 1, 1, 1, 2).

HMM’s are among the most eﬀective tools to solve the

following types of problems:
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 127

(1) DNA and protein sequence alignment in the

face of mutations and other kinds of evolutionary change.
(2) Speech understanding systems, also called
Automatic speech recognition. When we talk,
our mouths produce sequences of sounds from the sen-
tences that we want to say. This process is complex.

Multiple words may map to the same sound, words

are pronounced diﬀerently as a function of the word
before and after them, we all form sounds slightly
diﬀerently, and so on.

All a listener can hear (perhaps a computer system)

is the sequence of sounds, and the listener would like
to reconstruct the mapping (backward) in order to
determine what words we were attempting to say.

For example, when you “talk to your TV” to pick a

program, say game of thrones, you don’t want to get
Jessica Jones.
128 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

(3) Optical character recognition (OCR). When

we write, our hands map from an idealized symbol to
some set of marks on a page (or screen).

The marks are observable, but the process that gen-

erates them isn’t.

A system performing OCR, such as a system used by

the post oﬃce to read addresses, must discover which
word is most likely to correspond to the mark it reads.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 129

The reader should review Example 4.1 illustrating the

notion of HMM.

Let us consider another example taken from Stamp [?].

Example 4.2. Suppose we want to determine the av-
erage annual temperature at a particular location over a
series of years in a distant past where thermometers did
not exist.

Since we can’t go back in time, we look for indirect evi-

dence of the temperature, say in terms of the size of tree
growth rings.

For simplicity, assume that we consider the two tempera-

tures Cold and Hot, and three diﬀerent sizes of tree rings:
small, medium and large, which we denote by S, M, L.
130 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

In this example, the set of states is Q = {Cold, Hot}, and

the set of outputs is O = {S, M, L}.

We have the bijection σ : {Cold, Hot} → {1, 2} given

by σ(Cold) = 1 and σ(Hot) = 2, and the bijection
ω : {S, M, L} → {1, 2, 3} given by ω(S) = 1, ω(M) = 2,
and ω(L) = 3.

The HMM shown in Figure 4.2 is a model of the situation.

start

0.4 0.6
0.6 0.7
0.4
Cold Hot
0.3
0.1 0.1
0.2 0.4
0.7 0.5

S M L

Figure 4.2: Example of an HMM modeling the temperature in terms of tree growth rings.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 131

Suppose we observe the sequence of tree growth rings

(S, M, S, L).

What is the most likely sequence of temperatures over a

four-year period which yields the observations
(S, M, S, L)?
132 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Going back to Example 4.1, which corresponds to the

HMM graph shown in Figure 4.3, we need to figure out
the probability that a sequence of states S = (q1, q2, . . . , qT )
produces the output sequence O = (O1, O2, . . . , OT ).

start

0.45 0.55
0.7 0.75
0.3
Cold Hot
0.25
0.2 0.3
0.8 0.7

N D

Figure 4.3: Example of an HMM modeling the “drinking behavior” of a professor at the
University of Pennsylvania.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 133

Then the probability that we want is just the product

of the probability that we begin with state q1, times the
product of the probabilities of each of the transitions,
times the product of the emission probabilities.

With our notational conventions, σ(qt) = it and ω(Ot) =

ωt, so we have

T
%
Pr(S, O) = π(i1)B(i1, ω1) A(it−1, it)B(it, ωt).
t=2

In our example, ω(O) = (ω1, ω2, ω3, ω4) = (1, 1, 1, 2),

which corresponds to NNND.

The brute-force method is to compute these probabilities

for all 24 = 16 sequences of states of length 4 (in general,
there are nT sequences of length T ).
134 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

For example, for the sequence S = (Cold, Cold, Cold, Hot),

associated with the sequence of indices
σ(S) = (i1, i2, i3, i4) = (1, 1, 1, 2), we find that

Pr(S, NNND) = π(1)B(1, 1)A(1, 1)B(1, 1)A(1, 1)B(1, 1)

A(1, 2)B(2, 2)
= 0.45 × 0.8 × 0.7 × 0.8 × 0.7 × 0.8
× 0.3 × 0.7 = 0.0237.

A much more eﬃcient way to proceed is to use a method

based on dynamic programming.

Recall the bijection σ : {Cold, Hot} → {1, 2}, so that we

will refer to the state Cold as 1, and to the state Hot as
2.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 135

For t = 1, 2, 3, 4, for every state i = 1, 2, we compute

score(i, t) to be the highest probability that a sequence
of length t ending in state i produces the output se-
quence (O1, . . . , Ot), and for t ≥ 2, we let pred(i, t) be
the state that precedes i in a best sequence of length t
ending in i.

Initially, we set
score(j, 1) = π(j)B(j, ω1), j = 1, 2,
and since ω1 = 1 we get score(1, 1) = 0.45 × 0.8 = 0.36
and score(2, 1) = 0.55 × 0.3 = 0.165.

Next we compute score(1, 2) and score(2, 2) as follows.

For j = 1, 2, for i = 1, 2, compute temporary scores

tscore(i, j) = score(i, 1)A(i, j)B(j, ω2);
then pick the best of the temporary scores,
score(j, 2) = max tscore(i, j).
i
136 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Since ω2 = 1, we get tscore(1, 1) = 0.36 × 0.7 × 0.8 =

0.2016, tscore(2, 1) = 0.165 × 0.25 × 0.8 = 0.0330, and
tscore(1, 2) = 0.36 × 0.3 × 0.3 = 0.0324, tscore(2, 2) =
0.165 × 0.75 × 0.3 = 0.0371.

Then
score(1, 2) = max{tscore(1, 1), tscore(2, 1)}
= max{0.2016, 0.0330} = 0.2016,

and

score(2, 2) = max{tscore(1, 2), tscore(2, 2)}

= max{0.0324, 0.0371} = 0.0371.

Since the state that leads to the optimal score score(1, 2)

is 1, we let pred(1, 2) = 1, and since the state that leads
to the optimal score score(2, 2) is 2, we let pred(2, 2) = 2.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 137

We compute score(1, 3) and score(2, 3) in a similar way.

For j = 1, 2, for i = 1, 2, compute

tscore(i, j) = score(i, 2)A(i, j)B(j, ω3);
then pick the best of the temporary scores,
score(j, 3) = max tscore(i, j).
i

Since ω3 = 1, we get

score(1, 3) = max{0.1129, 0.0074} = 0.1129,

and

score(2, 3) = max{0.0181, 0.0083} = 0.0181.

We also get pred(1, 3) = 1 and pred(2, 3) = 1.

Finally, we compute score(1, 4) and score(2, 4) in a sim-

ilar way.
138 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

For j = 1, 2, for i = 1, 2, compute

tscore(i, j) = score(i, 3)A(i, j)B(j, ω4);
then pick the best of the temporary scores,
score(j, 4) = max tscore(i, j).
i

Since ω4 = 2, we get
score(1, 4) = max{0.0158, 0.0009} = 0.0158,
and
score(2, 4) = max{0.0237, 0.0095} = 0.0237,
and pred(1, 4) = 1 and pred(2, 4) = 1.

Since max{score(1, 4), score(2, 4)} = 0.0237, the state

with the maximum score is Hot, and by following the
predecessor list (also called backpointer list), we find the
most likely state sequence to produce the sequence NNND
to be (Cold, Cold, Cold, Hot).
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 139

The stages of the computations of score(j, t) for i =

1, 2 and t = 1, 2, 3, 4 can be recorded in the following
diagram called a lattice or a trellis (which means lattice
in French!):

0.2016 ! 0.1129 ! 0.0158

Cold 0.36❈❈❈ 0.2016
④④
% ❈❈❈❈❈ 0.1129
④④
% ❊❊❊❊❊
!
0.0158
③'
❈ ❈❈❈❈❈ ❊ ❊❊❊❊❊ ③③
❈❈0.0324 ④④④ ❈❈0.0181 ❊❊0.0237
❈❈ ④ ④ ③
❈❈❈ ④ ④④ ❊❊❊ ③③
❈❈
❈❈ ④④④
④ ❈❈❈❈❈ ④④④ ❊❊❊❊❊ ③③③③
❈④❈④ ❈❈❈④❈④④ ❊❊❊❊③③
④④ ❈❈ ④④❈❈❈❈❈ ③③③❊❊❊❊❊❊
0.033④④④④④ ❈❈❈❈❈ 0.0074④④④④④ ❈❈❈❈❈❈❈❈ 0.0009③③③③ ❊❊❊❊❊❊❊❊❊
④④ ❈❈ ④④ ❈❈❈❈❈ ③③
③ ❊❊❊❊❊
④④ ❈" ④④ ❈# ③③ ❊$

Hot 0.1650 0.0371 0.0181 0.0095 0.0237

! & &
0.0371 0.0083

Double arrows represent the predecessor edges.

For example, the predecessor pred(2, 3) of the third node

on the bottom row labeled with the score 0.0181 (which
corresponds to Hot), is the second node on the first row la-
beled with the score 0.2016 (which corresponds to Cold).

The two incoming arrows to the third node on the bottom

row are labeled with the temporary scores 0.0181 and
0.0083.
140 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The node with the highest score at time t = 4 is Hot,

with score 0.0237 (showed in bold), and by following the
double arrows backward from this node, we obtain the
most likely state sequence (Cold, Cold, Cold, Hot).

The method we just described is known as the Viterbi

algorithm.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 141

Definition 4.1. A hidden Markov model , for short

HMM , is a quintuple M = (Q, O, π, A, B) where
• Q is a finite set of states with n elements, and there
is a bijection σ : Q → {1, . . . , n}.
• O is a finite output alphabet (also called set of pos-
sible observations) with m observations, and there is
a bijection ω : O → {1, . . . , m}.
• A = (A(i, j)) is an n × n matrix called the state
transition probability matrix , with
n
&
A(i, j) ≥ 0, 1 ≤ i, j ≤ n, and A(i, j) = 1,
j=1

i = 1, . . . , n.
• B = (B(i, j)) is an n × m matrix called the state ob-
servation probability matrix (also called confusion
matrix ), with
m
&
B(i, j) ≥ 0, 1 ≤ i, j ≤ n, and B(i, j) = 1,
j=1

i = 1, . . . , n.
142 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

A matrix satisfying the above conditions is said to be

row stochastic. Both A and B are row-stochastic.

We also need to state the conditions that make M a

Markov model. To do this rigorously requires the notion
of random variable and is a bit tricky (see the remark in
the notes), so we will cheat as follows:
(a) Given any sequence of states (q1, . . . , qt−2, p, q), the
conditional probability that q is the tth state given
that the previous states were q1, . . . , qt−2, p is equal
to the conditional probability that q is the tth state
given that the previous state at time t − 1 is p:
Pr(q | q1, . . . , qt−2, p) = Pr(q | p).
This is the Markov property.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 143

(b) Given any sequence of states (q1, . . . , qi, . . . , qt), and

given any sequence of outputs (O1, . . . , Oi, . . . , Ot),
the conditional probability that the output Oi is emit-
ted depends only on the state qi, and not any other
states or any other observations:
Pr(Oi | q1, . . . , qi, . . . , qt, O1, . . . , Oi, . . . , Ot)
= Pr(Oi | qi).
This is the output independence condition.

Examples of HMMs are shown in Figure 4.1, Figure 4.2,

and Figure 4.4 shown below.
144 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

Note that an ouput is emitted when visiting a state, not

when making a transition, as in the case of a gsm.

So the analogy with the gsm model is only partial; it is

meant as a motivation for HMMs.

If we ignore the output components O and B, then we

have what is called a Markov chain.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 145

There are three types of problems that can be solved using

HMMs:
(1) The decoding problem: Given an HMM M =
(Q, O, π, A, B), for any observed output sequence O =
(O1, O2, . . . , OT ) of length T , find a most likely se-
quence of states S = (q1, q2, . . . , qT ) that produces
the output sequence O.

More precisely, with our notational convention that

σ(qt) = it and ω(Ot) = ωt, this means finding a se-
quence S such that the probability

T
%
Pr(S, O) = π(i1)B(i1, ω1) A(it−1, it)B(it, ωt)
t=2

is maximal.

This problem is solved eﬀectively by the Viterbi al-

gorithm.
146 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

(2) The evaluation problem, also called

the likelyhood problem:
Given a finite collection {M1, . . . , ML} of HMM’s
with the same output alphabet O, for any output se-
quence O = (O1, O2, . . . , OT ) of length T , find which
model Mℓ is most likely to have generated O.

More precisely, given any model Mk , we compute the

probability tprobk that Mk could have produced O
along any path.

Then we pick an HMM Mℓ for which tprobℓ is max-

imal. We will return to this point after having de-
scribed the Viterbi algoritm.

A variation of the Viterbi algorithm called the for-

ward algorithm eﬀectively solves the evaluation prob-
lem.
4.1. DEFINITION OF A HIDDEN MARKOV MODEL (HMM) 147

(3) The training problem, also called the learning

problem: Given a set {O1, . . . , Or } of output se-
quences on the same output alpabet O, usually called
a set of training data, given Q, find the “best” π, A,
and B for an HMM M that produces all the se-
quences in the training set, in the sense that the
HMM M = (Q, O, π, A, B) is the most likely to have
produced the sequences in the training set.

The technique used here is called expectation maxi-

mization, or EM . It is an iterative method that starts
with an initial triple π, A, B, and tries to impove it.

There is such an algorithm known as the Baum-Welch

or forward-backward algorithm, but it is beyond the
scope of this introduction.

Let us now describe the Viterbi algorithm in more details.

148 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

4.2 The Viterbi Algorithm and the Forward Algorithm

Given an HMM M = (Q, O, π, A, B), for any observed

output sequence O = (O1, O2, . . ., OT ) of length T , we
want to find a most likely sequence of states S =
(q1, q2, . . . , qT ) that produces the output sequence O.

Using the bijections σ : Q → {1, . . . , n} and ω : O →

{1, . . . , m}, we can work with sequences of indices, and
recall that we denote the index σ(qt) associated with the
tth state qt in the sequence S by it, and the index ω(Ot)
associated with the tth output Ot in the sequence O by
ωt.

Then we need to find a sequence S such that the proba-

bility

T
%
Pr(S, O) = π(i1)B(i1, ω1) A(it−1, it)B(it, ωt)
t=2

is maximal.
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 149

In general, there are nT sequences of length T .

This problem can be solved eﬃciently by a method based

on dynamic programming.

For any t, 1 ≤ t ≤ T , for any state q ∈ Q, if σ(q) =

j, then we compute score(j, t), which is the largest
probability that a sequence (q1, . . . , qt−1, q) of length t
ending with q has produced the output sequence
(O1, . . . , Ot−1, Ot).

The point is that if we know score(k, t − 1) for k =

1, . . . , n (with t ≥ 2), then we can find score(j, t) for j =
1, . . . , n, because if we write k = σ(qt−1) and j = σ(q)
(recall that ωt = ω(Ot)), then the probability associated
with the path (q1, . . . , qt−1, q) is

tscore(k, j) = score(k, t − 1)A(k, j)B(j, ωt).

See the illustration below:
150 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

state indices i1( ... k( j(

σ σ σ
score(k,t−1) A(k,j)
states q1 ... & qt−1 & q
B(j,ωt )
) ) )
outputs O1 ... Ot−1 Ot
ω ω ω
* * *
output indices ω1 ... ωt−1 ωt

So to maximize this probability, we just have to find the

maximum of the probabilities tscore(k, j) over all k,
that is, we must have
score(j, t) = max tscore(k, j).
k

See the illustration below:

σ −1(1) ❙❙❙
❙tscore(1,j)
❙❙❙
❙❙❙
❙❙❙
❙❙❙
❙❙❙
❙❙+
tscore(k,j)
σ −1(k) &
q❦, = σ −1(j)
❦❦❦
❦❦❦❦❦
❦
❦❦❦
❦❦❦
❦ tscore(n,j)
❦❦
❦❦❦
σ −1(n)
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 151

To get started, we set score(j, 1) = π(j)B(j, ω1) for j =

1, . . . , n.

The algorithm goes through a forward phase for t =

1, . . . , T , during which it computes the probabilities
score(j, t) for j = 1, . . . , n.

When t = T , we pick an index j such that score(j, T ) is

maximal.

The machine learning community is fond of the notation

j = arg max score(k, T )
k

to express the above fact. Typically, the smallest index

j corresponding to the largest value of score(k, T ) is re-
turned.

This gives us the last state qT = σ −1(j) in an optimal

sequence that yields the output sequence O.
152 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The algorithm then goes through a path retrieval phase.

To to this, when we compute

score(j, t) = max tscore(k, j),

we also record the index k = σ(qt−1) of the state qt−1 in

the best sequence (q1, . . . , qt−1, qt) for which tscore(k, j)
is maximal (with j = σ(qt)), as pred(j, t) = k.

The index k is often called the backpointer of j at time

This state may not be unique, we just pick one of them.

Typically, the smallest index k corresponding to the largest
value of tscore(k, j) is returned.
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 153

Again, this can be expressed by

pred(j, t) = arg max tscore(k, j).

The predecessors pred(j, t) are only defined for t = 2, . . .,

T , but we can let pred(j, 1) = 0.

Observe that the path retrieval phase of the Viterbi algo-

rithm is very similar to the phase of Dijkstra’s algorithm
for finding a shortest path that follows the prev array.

The forward phase of the Viterbi algorithm is quite dif-

ferent from the Dijkstra’s algorithm, and the Viterbi al-
gorithm is actually simpler.
154 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The Viterbi algorithm, invented by Andrew Viterbi in

1967, is shown below.

The input to the algorithm is M = (Q, O, π, A, B) and

the sequence of indices ω(O) = (ω1, . . . , ωT ) associated
with the observed sequence O = (O1, O2, . . . , OT ) of
length T , with ωt = ω(Ot) for t = 1, . . . , T .

The output is a sequence of states (q1, . . . , qT ). This se-

quence is determined by the sequence of indices (I1, . . . , IT );
namely, qt = σ −1(It).
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 155

The Viterbi Algorithm

begin
for j = 1 to n do
score(j, 1) = π(j)B(j, ω1)
endfor;
(∗ forward phase to find the best (highest) scores ∗)
for t = 2 to T do
for j = 1 to n do
for k = 1 to n do
tscore(k) = score(k, t − 1)A(k, j)B(j, ωt)
endfor;
score(j, t) = maxk tscore(k);
pred(j, t) = arg maxk tscore(k)
endfor
endfor;
(∗ second phase to retrieve the optimal path ∗)
IT = arg maxj score(j, T );
qT = σ −1(IT );
for t = T to 2 by −1 do
It−1 = pred(It, t);
qt−1 = σ −1(It−1)
endfor
end
156 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

If we run the Viterbi algorithm on the output sequence

(S, M, S, L) of Example 4.2, we find that the sequence
(Cold, Cold, Cold, Hot) has the highest probability, 0.00282,
among all sequences of length four.

One may have noticed that the numbers involved, being

products of probabilities, become quite small.

Indeed, underflow may arise in dynamic programming.

Fortunately, there is a simple way to avoid underflow by
taking logarithms.

It immediately verified that the time complexity of the

Viterbi algorithm is O(n2T ).

Let us now to turn to the second problem, the evaluation

problem (or likelyhood problem).
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 157

This time, given a finite collection {M1, . . . , ML} of HMM’s

with the same output alphabet O, for any observed out-
put sequence O = (O1, O2, . . . , OT ) of length T , find
which model Mℓ is most likely to have generated O.

More precisely, given any model Mk , we compute the

probability tprobk that Mk could have produced O along
any path.

Then we pick an HMM Mℓ for which tprobℓ is maximal.

It is easy to adapt the Viterbi algorithm to compute

tprobk . This algorithm is called the forward algorithm.

Since we are not looking for an explicity path, there is

no need for the second phase, and during the forward
phase, going from t − 1 to t, rather than finding the
maximum of the scores tscore(k) for k = 1, . . . , n, we
just set score(j, t) to the sum over k of the temporary
scores tscore(k).

At the end, tprobk is the sum over j of the probabilities

score(j, T ).
158 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

The input to the algorithm is M = (Q, O, π, A, B) and

the sequence of indices ω(O) = (ω1, . . . , ωT ) associated
with the observed sequence O = (O1, O2, . . . , OT ) of
length T , with ωt = ω(Ot) for t = 1, . . . , T .

The output is the probability tprob.

The Foward Algorithm

begin
for j = 1 to n do
score(j, 1) = π(j)B(j, ω1)
endfor;
for t = 2 to T do
for j = 1 to n do
for k = 1 to n do
tscore(k) = score(k, t − 1)A(k, j)B(j, ωt)
endfor;
'
score(j, t) = k tscore(k)
endfor
endfor;
'
tprob = j score(j, T )
end
4.2. THE VITERBI ALGORITHM AND THE FORWARD ALGORITHM 159

We can now run the above algorithm on M1, . . . , ML to

compute tprob1, . . . , tprobL, and we pick the model Mℓ
for which tprobℓ is maximum.

As for the Viterbi algorithm, the time complexity of the

forward algorithm is O(n2T ).

Underflow is also a problem with the forward algorithm.

At first glance it looks like taking logarithms does not help

because there is no simple expression for log(x1 +· · ·+xn)
in terms of the log xi.

Fortunately, we can use the log-sum exp trick ; see the

notes.
Example 4.3. To illustrate the forward algorithm, as-
sume that our observant student also recorded the drink-
ing behavior of a professor at Harvard, and that he came
up with the HHM shown in Figure 4.4.
160 CHAPTER 4. HIDDEN MARKOV MODELS (HMMS)

start

0.13 0.87
0.33 0.9
0.67
Cold Hot
0.1
0.05 0.8
0.95 0.2

N D

Figure 4.4: Example of an HMM modeling the “drinking behavior” of a professor at Harvard.

However, the student can’t remember whether he ob-

served the sequence NNND at Penn or at Harvard.

So he runs the forward algorithm on both HMM’s to find

the most likely model. Do it!

Detecting of Fake News With Python and ML
57% (7)
Detecting of Fake News With Python and ML
17 pages
CIT-651 Introduction To Machine Learning and Statistical Analysis (42 Hours, 14 Lectures)
No ratings yet
CIT-651 Introduction To Machine Learning and Statistical Analysis (42 Hours, 14 Lectures)
4 pages
Cis262 HMM
No ratings yet
Cis262 HMM
34 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
56 pages
HMM - Extra
No ratings yet
HMM - Extra
17 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
17 pages
Hidden Markov Models (HMMS) : Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Hidden Markov Models (HMMS) : Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Markov Models
No ratings yet
Markov Models
54 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
26 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
17 pages
Module 6.2
No ratings yet
Module 6.2
25 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
5 pages
8.1 HMM
No ratings yet
8.1 HMM
50 pages
CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin
No ratings yet
CS 4705 Hidden Markov Models: Slides Adapted From Dan Jurafsky, and James Martin
35 pages
NLP Lecture 01-10-Hmm
No ratings yet
NLP Lecture 01-10-Hmm
9 pages
A Revealing Introduction To Hidden Markov Models
No ratings yet
A Revealing Introduction To Hidden Markov Models
20 pages
Знімок екрана 2022-10-31 о 18.56.30
No ratings yet
Знімок екрана 2022-10-31 о 18.56.30
96 pages
Applications of Hidden Markov Model Stat-1
No ratings yet
Applications of Hidden Markov Model Stat-1
8 pages
Lecture Week11
No ratings yet
Lecture Week11
24 pages
Winter Semester 2022-23 CSE3008 ETH AP2022236000448 Reference Material I 26-Apr-2023 HMM Class-1 PDF
No ratings yet
Winter Semester 2022-23 CSE3008 ETH AP2022236000448 Reference Material I 26-Apr-2023 HMM Class-1 PDF
56 pages
Hidden Markov Model Methods: Submitted by
No ratings yet
Hidden Markov Model Methods: Submitted by
11 pages
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
No ratings yet
Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
55 pages
Machine Learning For Natural Language Processing: Hidden Markov Models
No ratings yet
Machine Learning For Natural Language Processing: Hidden Markov Models
33 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
9 pages
T6-Hang Li - Machine Learning Methods-Springer (2023) - 230-252
No ratings yet
T6-Hang Li - Machine Learning Methods-Springer (2023) - 230-252
23 pages
Hidden Markov Model HMM
No ratings yet
Hidden Markov Model HMM
11 pages
BT302_L9_HMM
No ratings yet
BT302_L9_HMM
29 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
4 pages
Esma El 2012
No ratings yet
Esma El 2012
6 pages
HiddenMarkovModel_5b5e0d23cdc2e2dc8a5739683be9baa6
No ratings yet
HiddenMarkovModel_5b5e0d23cdc2e2dc8a5739683be9baa6
6 pages
Hidden Markov Model in Machine Learning
No ratings yet
Hidden Markov Model in Machine Learning
2 pages
ML 5
No ratings yet
ML 5
28 pages
Chapter Six Hidden Markov Models: Event Data (Philip A. Schrodt and Deborah J. Gerner) - Please Do Not Quote Without
No ratings yet
Chapter Six Hidden Markov Models: Event Data (Philip A. Schrodt and Deborah J. Gerner) - Please Do Not Quote Without
64 pages
HMM
No ratings yet
HMM
41 pages
HIDDEN MARKOV MODEL_20241217_231745_0000
No ratings yet
HIDDEN MARKOV MODEL_20241217_231745_0000
12 pages
Hidden Markov Models and Sequential Data
No ratings yet
Hidden Markov Models and Sequential Data
45 pages
Sequence Model:: Hidden Markov Models
No ratings yet
Sequence Model:: Hidden Markov Models
60 pages
Unit - 4 Hidden Markov Models
No ratings yet
Unit - 4 Hidden Markov Models
39 pages
Recitation4 Notes
No ratings yet
Recitation4 Notes
6 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
36 pages
Hidden Markov Models: A Simple Markov Chain
No ratings yet
Hidden Markov Models: A Simple Markov Chain
46 pages
Hidden Markov Model (HMM) Architecture
No ratings yet
Hidden Markov Model (HMM) Architecture
15 pages
Hidden Markov Models: Background
No ratings yet
Hidden Markov Models: Background
13 pages
Parametric Models Hidden Markov Models
No ratings yet
Parametric Models Hidden Markov Models
30 pages
Slides
No ratings yet
Slides
69 pages
Algorithms - Hidden Markov Models
No ratings yet
Algorithms - Hidden Markov Models
7 pages
HMM
No ratings yet
HMM
5 pages
2024-Fall-CSE366-12-HMM
No ratings yet
2024-Fall-CSE366-12-HMM
46 pages
Hidden Markov Models Applied To Information Extraction: Part I: Concept
No ratings yet
Hidden Markov Models Applied To Information Extraction: Part I: Concept
34 pages
Factorial Hidden Markov Models
No ratings yet
Factorial Hidden Markov Models
29 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
35 pages
Unit 4 Full PPT (ML)
No ratings yet
Unit 4 Full PPT (ML)
31 pages
Presentation_20241212_094152_0000
No ratings yet
Presentation_20241212_094152_0000
8 pages
HMM
No ratings yet
HMM
24 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
4 pages
MLRD 8
No ratings yet
MLRD 8
39 pages
Mahee 718&saba 710
No ratings yet
Mahee 718&saba 710
4 pages
Computational Genomics Hidden Markov Models (HMMS)
No ratings yet
Computational Genomics Hidden Markov Models (HMMS)
55 pages
Probability Theory: A Concise Course
From Everand
Probability Theory: A Concise Course
Y. A. Rozanov
4/5 (2)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
The Logical Solution Syracuse Conjecture
From Everand
The Logical Solution Syracuse Conjecture
Rolando Zucchini
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
AI Associate Trailhead Badges Notes
No ratings yet
AI Associate Trailhead Badges Notes
94 pages
8a - Branchwise Subjects For Even Semester
No ratings yet
8a - Branchwise Subjects For Even Semester
4 pages
Similar Data Points Identification With LLM: A Human-In-The-Loop Strategy Using Summarization and Hidden State Insights
No ratings yet
Similar Data Points Identification With LLM: A Human-In-The-Loop Strategy Using Summarization and Hidden State Insights
14 pages
Literature review on forecasting green hydrogen production using machine learning and deep learning
No ratings yet
Literature review on forecasting green hydrogen production using machine learning and deep learning
10 pages
2017 Phrase Mining From Massive Text and Its Applications
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
89 pages
Cloud Computing and Artificial Intellegence: Syed Furqan Haider Shah BSE-8 (A) FA14-176
No ratings yet
Cloud Computing and Artificial Intellegence: Syed Furqan Haider Shah BSE-8 (A) FA14-176
12 pages
AIML 1 Mark
No ratings yet
AIML 1 Mark
12 pages
Jshara Icsns Viii Cryptoml
No ratings yet
Jshara Icsns Viii Cryptoml
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Diploma in AI and ML Brochure
No ratings yet
Diploma in AI and ML Brochure
14 pages
An Ensemble Method For Phishing Websites Detection Based On XGBoost
No ratings yet
An Ensemble Method For Phishing Websites Detection Based On XGBoost
6 pages
聊天機器人之檢索效能之研究以BERT為方法
No ratings yet
聊天機器人之檢索效能之研究以BERT為方法
79 pages
Improvement in Power Transformer Intelligent Dissolved Gas Analysis Method
No ratings yet
Improvement in Power Transformer Intelligent Dissolved Gas Analysis Method
4 pages
AIA 6600 Module 5
No ratings yet
AIA 6600 Module 5
14 pages
Lecture 09 Softmax Classifier
No ratings yet
Lecture 09 Softmax Classifier
46 pages
Automatic Detection of Narcotics - Smiths Detection
No ratings yet
Automatic Detection of Narcotics - Smiths Detection
7 pages
Age and Gender Detection
No ratings yet
Age and Gender Detection
25 pages
Pre-Test 8 - Attempt Review
No ratings yet
Pre-Test 8 - Attempt Review
2 pages
Class 1 C
No ratings yet
Class 1 C
14 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
Domaining by Clustering Multivariate Geostatistical Data
No ratings yet
Domaining by Clustering Multivariate Geostatistical Data
12 pages
Homework 2 Solution PDF
No ratings yet
Homework 2 Solution PDF
5 pages
Multi-Modal Hate Speech Detection Using Machine Learning
No ratings yet
Multi-Modal Hate Speech Detection Using Machine Learning
4 pages
sem 6 ques data science
No ratings yet
sem 6 ques data science
23 pages
Final Report On Big Data and Advanced Analytics PDF
No ratings yet
Final Report On Big Data and Advanced Analytics PDF
60 pages
Unit III Deep Learning Chapter Notes
No ratings yet
Unit III Deep Learning Chapter Notes
23 pages
Blackberry IVY in Depth
No ratings yet
Blackberry IVY in Depth
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.