0% found this document useful (0 votes)
36 views

Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội

Hidden Markov models are statistical models used to model systems where an underlying process with unknown (hidden) states can be observed through another set of outputs. The document discusses three key problems with hidden Markov models: 1) evaluation, which calculates the probability of an observed sequence given a model; 2) decoding, which finds the most likely sequence of hidden states that produced an observation sequence; and 3) learning, which adjusts the model parameters to better describe a training dataset of observation sequences.

Uploaded by

Thanh Xuân Chu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Hidden Markov Models: Ts. Nguyễn Văn Vinh Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội

Hidden Markov models are statistical models used to model systems where an underlying process with unknown (hidden) states can be observed through another set of outputs. The document discusses three key problems with hidden Markov models: 1) evaluation, which calculates the probability of an observed sequence given a model; 2) decoding, which finds the most likely sequence of hidden states that produced an observation sequence; and 3) learning, which adjusts the model parameters to better describe a training dataset of observation sequences.

Uploaded by

Thanh Xuân Chu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Hidden Markov Models

Ts. Nguyễn Văn Vinh


Bộ môn KHMT, Trường ĐHCN, ĐH QG Hà nội
Introduction
 Modeling dependencies in input
 Sequences:
 Temporal: In speech; phonemes in a word
(dictionary), words in a sentence (syntax, semantics
of the language).
In handwriting, pen movements
 Spatial: In a DNA sequence; base pairs
Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, Russia


Died: 20 July 1922 in Petrograd (now
St Petersburg), Russia
Markov is particularly remembered
for his study of Markov chains,
sequences of random variables in
which the future variable is
determined by the present variable
but is independent of the way in which
the present state arose from its
predecessors. This work launched the
theory of stochastic processes.
Discrete Markov Process
 N states: S1, S2, ..., SN State at “time” t, qt = Si
 First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)

 Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1

 Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1
Markov random processes
 A random sequence has the Markov property if its distribution is
determined solely by its current state. Any random process having
this property is called a Markov random process.
 For observable state sequences (state is known from data), this
leads to a Markov chain model.
 For non-observable states, this leads to a Hidden Markov Model
(HMM).
Chain Rule & Markov Property
Bayes rule

P(qt , qt 1 ,...q1 )  P(qt | qt 1 ,...q1 ) P(qt 1 ,...q1 )


P(qt , qt 1 ,...q1 )  P(qt | qt 1 ,...q1 ) P(qt 1 | qt  2 ,...q1 ) P(qt  2 ,...q1 )
t
P (qt , qt 1 ,...q1 )  P ( q1 ) P (qi | qi 1 ,...q1 )
i 2

Markov property
P( qi | qi 1 ,...q1 )  P ( qi | qi 1 ) for i  1
t
P(qt , qt 1 ,...q1 )  P(q1 ) P (qi | qi 1 )  P(q1 ) P(q2 | q1 )...P(qt | qt 1 )
i 2
Stochastic Automaton

T
P O  Q | A ,    P  q 1   P  q t | q t 1   q 1 a q 1q 2 a q T 1q T
t 2
Example: Balls and Urns
 Three urns each full of balls of one color
S1: red, S2: blue, S3: green

0 .4 0 .3 0 .3 
   0 .5 ,0 .2 ,0 .3  A  0 .2 0 .6 0 .2 
T

0 .1 0 .1 0 .8 
O  S 1 , S 1 , S 3 , S 3 
P O | A ,    P  S 1   P  S 1 | S 1   P  S 3 | S 1   P  S 3 | S 3 
 1  a 11  a 13  a 33
 0 .5  0 .4  0 .3  0 .8  0 .048
Balls and Urns: Learning
 Given K example sequences of length T

# sequences starting with S i  1


k 1  S i
q k

ˆi  
# sequences  K
# transition s from S i to S j 
â ij 
#  transition s from S i 


k t 1 t i
T- 1
1 q k
 S and q t 1  S j
k

k t 1 t  S i
1
T- 1
q k
 
Hidden Markov Models
 States are not observable
 Discrete observations {v1,v2,...,vM} are recorded; a
probabilistic function of the state
 Emission probabilities
bj(m) ≡ P(Ot=vm | qt=Sj)
 Example: In each urn, there are balls of different
colors, but with different probabilities.
 For each observation sequence, there are multiple
state sequences
From Markov To Hidden Markov
 The previous model assumes that each state can be uniquely
associated with an observable event
 Once an observation is made, the state of the system is then trivially retrieved
 This model, however, is too restrictive to be of practical use for most realistic
problems
 To make the model more flexible, we will assume that the outcomes or
observations of the model are a probabilistic function of each state
 Each state can produce a number of outputs according to a unique probability
distribution, and each distinct output can potentially be generated at any state
 These are known a Hidden Markov Models (HMM), because the state sequence
is not directly observable, it can only be approximated from the sequence of
observations produced by the system
The coin-toss problem

 To illustrate the concept of an HMM consider the following scenario


 Assume that you are placed in a room with a curtain
 Behind the curtain there is a person performing a coin-toss experiment
 This person selects one of several coins, and tosses it: heads (H) or tails (T)
 The person tells you the outcome (H,T), but not which coin was used each time
 Your goal is to build a probabilistic model that best explains a
sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}
 The coins represent the states; these are hidden because you do not know
which coin was tossed each time
 The outcome of each toss represents an observation
 A “likely” sequence of coins may be inferred from the observations, but this
state sequence will not be unique
Speech Recognition
 We record the sound signals associated
with words.
 We’d like to identify the ‘speech
recognition features associated with
pronouncing these words.
 The features are the states and the sound
signals are the observations.
The Coin Toss Example – 2 coins
Hidden Markov Model (HMM)
 HMMs allow you to estimate probabilities
of unobserved events
 Given plain text, which underlying
parameters generated the surface
 E.g., in speech recognition, the observed
data is the acoustic signal and the words
are the hidden parameters
HMMs and their Usage
 HMMs are very common in Computational
Linguistics:
 Speech recognition (observed: acoustic signal,
hidden: words)
 Handwriting recognition (observed: image, hidden:
words)
 Part-of-speech tagging (observed: words, hidden:
part-of-speech tags)
 Machine translation (observed: foreign words,
hidden: words in target language)
Noisy Channel Model
 In speech recognition you observe an
acoustic signal (A=a1,…,an) and you want
to determine the most likely sequence of
words (W=w1,…,wn): P(W | A)
Noisy Channel Model
 Assume that the acoustic signal (A) is already
segmented wrt word boundaries
 P(W | A) could be computed as

P(W | A)   max P(w i | ai )


ai wi
 Problem: Finding the most likely word
corresponding to a acoustic representation
depends on the context
 E.g., /'pre-z&ns / could mean “presents” or

“presence” depending on the context
Noisy Channel Model
 Given a candidate sequence W we need
to compute P(W) and combine it with P(W
| A)
 Applying Bayes’ rule:
P(A |W )P(W )
argmax P(W | A)  arg max
W W P(A)
 The denominator P(A) can be dropped,
because it is constant for all W

Noisy Channel in a Picture

20
Decoding
The decoder combines evidence from
 The likelihood: P(A | W)
This can be approximated as:
P(A |W )   P(ai | w i )
n

i1

 The prior: P(W)


This can be approximated as:

P(W )  P(w1 )
n
P(w i | w i1 )
i 2
Search Space
 Given a word-segmented acoustic sequence list all
candidates

'bot ik-'spen-siv 'pre-z&ns


boat P('bot | bald) excessive presidents
bald
P(inactive | bald)
expensive presence

bold expressive presents
 bought inactive press
 Compute the most likely path
Markov Assumption
 The Markov assumption states that
probability of the occurrence of word wi at
time t depends only on occurrence of
word wi-1 at time t-1
 Chain rule: n
P(w1,...,w n )   P(w i | w1,...,w i1 )
i 2

 Markov assumption: n
P(w1,...,w n )   P(w i | w i1 )
 i 2
The Trellis
Parameters of an HMM
 States: A set of states S=s1,…,sn
 Transition probabilities: A= a1,1,a1,2,…,an,n Each
ai,j represents the probability of transitioning
from state si to sj.
 Emission probabilities: A set B of functions of
the form bi(ot) which is the probability of
observation ot being emitted by si


Initial state distribution: i is the probability that
si is a start state


The Three Basic HMM Problems
 Problem 1 (Evaluation): Given the observation
sequence O=o1,…,oT and an HMM model
  (A,B,  ) , how do we compute the
probability of O given the model?
 Problem 2 (Decoding): Given the observation
sequence O=o1,…,oT and an HMM model
   (A,B,  ), how do we find the state
sequence that best explains the observations?


The Three Basic HMM Problems
 Problem 3 (Learning): How do we adjust
the model parameters   (A,B,  ), to
maximize P(O | ) ?



Problem 1: Probability of an Observation
Sequence
 What is P(O | ?)
 The probability of a observation sequence is the sum
of the probabilities of all possible state sequences in
the HMM.
 
Naïve computation is very expensive. Given T
observations and N states, there are NT possible state
sequences.
 Even small HMMs, e.g. T=10 and N=10, contain 10
billion different paths
 Solution to this and problem 2 is to use dynamic
programming
Examples
Example (cont.)
P (O)   P(O, Q)   P (O | Q) P (Q)
Q Q

P(3 1 3) = P (3 1 3, cold cold cold)


+ P(313, cold cold hot) + P(313, hot
hot cold) + … = ?
The observation likelihood for the ice-
cream events 3 1 3 given the hidden state
sequence hot hot cold
n n
P (O, Q)  P (O | Q)  P(Q)   P(oi | qi )   P(qi | qi 1 )
i 1 i 1

P(3 1 3, hot hot cold) = ?


Forward Probabilities
 What is the probability that, given an
HMM  , at time t the state is i and the
partial observation o1 … ot has been
generated?
  t (i)  P(o1 ... ot , qt  si |  )
Forward Probabilities
 t (i)  P(o1 ...ot , qt  si | )



N 
 t ( j)   t1 (i) aij b j (ot )
i1 
Forward Algorithm
 Initialization: 1(i)   ibi (o1) 1  i  N

 Induction:
 N 
 t ( j)   t1 (i) aij b j (ot ) 2  t  T,1  j  N
i1 

 Termination: P(O |  )    T (i )


i 1
Example
Forward Algorithm Complexity
 In the naïve approach to solving problem
1 it takes on the order of 2T*NT
computations
 The forward algorithm takes on the order
of N2T computations
Backward Probabilities
 Analogous to the forward probability, just
in the other direction
 What is the probability that given an HMM
and given the state at time t is i, the
partial observation ot+1 … oT is generated?

  t (i)  P(ot 1 ...oT | qt  si , )


Backward Probabilities
 t (i)  P(ot 1 ...oT | qt  si , )



N 
 t (i)   aij b j (ot 1 ) t 1 ( j) 

j1 

Backward Algorithm
 Initialization: T (i)  1, 1  i  N

 Induction:
N 
 t (i)   aij b j (ot 1 ) t 1 ( j)  t  T 1...1,1  i  N
 j1 


 Termination: N
 P(O | )    i 1 (i)
i1
Problem 2: Decoding
 The solution to Problem 1 (Evaluation) gives us
the sum of all paths through an HMM efficiently.
 For Problem 2, we wan to find the path with the
highest probability.
 We want to find the state sequence Q=q1…qT,
such that
Q  argmax P(Q'| O, )
Q'
Viterbi Algorithm
 Similar to computing the forward
probabilities, but instead of summing
over transitions from incoming states,
compute the maximum
 Forward: N 
 ( j)   (i) a b (o )
t t1 ij j t
i1 
 Viterbi Recursion:
 
t ( j)  max t1 (i) aij b j (ot )
1iN

Viterbi Algorithm
 Initialization: 1 (i)   ib j (o1) 1  i  N
 Induction:

 
t ( j)  max t1 (i) aij b j (ot )
1iN

 
 t ( j)  argmaxt1 (i) aij  2  t  T,1  j  N
 1iN 
  Termination:
p  max T (i)
*
q  argmax T (i)
*
T
1i N 1iN
 Read out path:
 q   t 1 (q ) t  T 1,...,1
*
t
*
t 1
Problem 3: Learning
 Up to now we’ve assumed that we know the
underlying model   (A,B,  )
 Often these parameters are estimated on
annotated training data, which has two
drawbacks:
Annotation is difficult and/or expensive


 Training data is different from the current data
 We want to maximize the parameters with
respect to the current data, i.e., we’re looking
for a model ' , such that ' argmax P(O |  )

Problem 3: Learning
 Unfortunately, there is no known way to
analytically find a global maximum, i.e., a model
' , such that ' argmax
P(O |  )
 But it is possible to find a local maximum
 Given an initial model  , we can always find a
model ', such that P(O | ')  P(O |  )
 


 
Parameter Re-estimation
 Use the forward-backward (or Baum-
Welch) algorithm, which is a hill-climbing
algorithm
 Using an initial parameter instantiation,
the forward-backward algorithm iteratively
re-estimates the parameters and
improves the probability that given
observation are generated by the new
parameters
Parameter Re-estimation
 Three parameters need to be re-
estimated:
 Initial state distribution:  i
 Transition probabilities: ai,j

 Emission probabilities: bi(ot)


Re-estimating Transition Probabilities
 What’s the probability of being in state si
at time t and going to state sj, given the
current model and parameters?
 t (i, j)  P(qt  si , qt 1  s j | O,  )


Re-estimating Transition Probabilities
 t (i, j)  P(qt  si , qt 1  s j | O,  )



 t (i) ai, j b j (ot 1 )  t 1 ( j)


 t (i, j)  N N

  (i) a t i, j b j (ot 1 )  t 1 ( j)
i1 j1
Re-estimating Transition Probabilities
 The intuition behind the re-estimation
equation for transition probabilities is
expected number of transitions from state si to state sj
aˆ i, j 
expected number of transitions from state si

 Formally:
T 1

  (i, j)t

aˆ i, j  t1
T 1 N

  (i, j') t
t1 j'1
Re-estimating Transition Probabilities
N
 Defining  t (i)   t (i, j)
j1

As the probability of being in state si,


given
 the complete observation O
T 1

 (i, j)
t

 We can say: aˆ i, j  t1


T 1

  (i) t
t1
Review of Probabilities
 Forward probability:  (i)
t
The probability of being in state si, given the partial observation
o1,…,ot
 Backward probability:
 t (i)
The probability of being in state si, given the partial observation
ot+1,…,oT


Transition probability:
The probability of going from tstate
(i, j)si, to state sj, given the
complete observation o1,…,oT


State probability:

observation o1,…,oT
 (i)
The probability of being in state si, given the complete
t

Re-estimating Initial State Probabilities
 Initial state distribution:  i is the
probability that si is a start state
 Re-estimation is easy:

ˆ i  expected number
 of times in state si at time 1
 Formally: 
ˆ i  1(i)



Re-estimation of Emission Probabilities
 Emission probabilities are re-estimated as
expected number of times in state si and observe symbol vk
bˆi (k) 
expected number of times in state si
 Formally: T

(o ,v )  (i)
t k t

 bˆi (k)  t1


T

  (i) t
t1
Where
 (o ,v )  1, if o  v , and 0 otherwise
Note that heret is kthe Kronecker
t k
delta function and is not related
 discussion of the Viterbi algorithm!!
to the in the
 

The Updated Model
 Coming from   (A,B,  ) we get to
' ( Aˆ , Bˆ , 
ˆ)
by the following update rules:

T 1

 (i, j)
T

 t (o ,v )  (i)
t k t

aˆ i, j  t1
T 1
bˆi (k)  t1
T

ˆ i  1(i)
  (i) t
  (i) t
t1
t1



Expectation Maximization
 The forward-backward algorithm is an
instance of the more general EM
algorithm
 The E Step: Compute the forward and
backward probabilities for a give model
 The M Step: Re-estimate the model
parameters
Exercise
 Programming with Viterbi Algorithm
 Apply HMM for Part-of-Speech Tagging

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy