0% found this document useful (0 votes)
7 views

AML_mod2

Uploaded by

Sanjeev Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

AML_mod2

Uploaded by

Sanjeev Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Hidden Markov

Models
hidden Markov model
• doubly (or bivariate) stochastic process in which an
underlying stochastic process that is not observable can
only be observed through another stochastic process that
produces a sequence of observations
- state process
- observation process
a little story about HMM

s1 .2
.7 .6
.8
H .3 C
P(1|H) = .2 (s2) (s3) P(1|C) = .5
P(2|H) = .4 B1 .4 B2 P(2|C) = .4
P(3|H) = .4 P(3|C) = .1
3
HMM: definition
λ = (S, V, A, B, Π)
S = {s ,. .. , s }
set of states 1 N
set of observation symbols
V = {v1,. .. , vM }
A = [a ]
state transition probabilities ij
where aij P (qt+1 = sj |qt = si), ∀i, j = 1,. .. ,N
qt,t = 1,. .. ,T : state at time t
B = [b (m)]
observation probabilities i
where bj (m) P (ot = vm|qt = sj ), ∀i = 1,. .. ,N
ot,t = 1,. .. ,T : observation at time t
stochastic constraints on HMM
N
aij = 1, ∀i
j=1

N
πi = 1,
j=1 ∀i
assumptions

Markov assumption:
P (qt|q1 .. . qt−1) = P (qt|qt−1)

output independence:
P (ot|q1 .. . qt .. . qT , o1 .. . ot .. . oT ) = P
(ot|qt)
M
multinomial observations: m) r mt
P (ot|qt = s ,j λ)j = b (
m=1

1, if ot = vm
where tm =
r 0, o.w.
3 fundamental problems

likelihood problem:
λ = (A, B, Π)
Given an HMM and an observation
sequence O, determine the likelihood P (O|λ)

decoding problem:
λ = (A, B, Π)
Given an HMM and an observation
sequence O, discover the best hidden state sequence Q

learning problem:
Given an an observation sequence O, the set of states S,
and the set of symbols V in the HMM, learn the HMM
parameters A and B
likelihood: probability evaluation
O o1o2 .. . oT Q
q1q2 .. . qT
T
P (O|Q, λ) = P (o |q , λ)
t
t=1 t

P (O, Q|λ) = P (O|Q, λ)P (Q|λ)


T T
= P (o |q t, λ) P ( qt+1|qt, λ)
t=1 t t=1
P (O|λ) = P (O|Q, λ)P (Q|λ)
Q

= πq1 bq1 (o1)aq1q2 bq2 (o2) .. . aqT −1qT bqT (oT )


q1q2...qT
likelihood computation
I
.2
(s1)
.7 .6
.8
H .3 C
P(1|H) = .2 (s2) (s3) P(1|C) = .5
P(2|H) = .4 B1 .4 B2 P(2|C) = .4
P(3|H) = .4 P(3|C) = .1

P (313|λ) = P (313, CCC|λ)+ P (313, CCH|λ)+ P (313, HHC|λ)+ .. .

P (313, HHC) = P (H|q0) × P (H|H) × P (C|H) × P (3|H) × P (1|H) × P (3|C)

9
likelihood: forward variable

αt(j) P (o1o2 .. . ot, qt = j|λ)


N
αt(j) = αt−1(i)aij bj (ot)
i=1

s1

a
si ij sj

ot
sN
t-1 t
likelihood: forward algorithm

1. Initialization
α1(i) = πibi(o1), 1 ≤ i ≤ N

2. Induction
N
1 t T 1
αt+1(j) = αt(i)aij bj (ot+1)
i=1 1≤j≤N

3.
Termination N
P (O|λ) = α T (i)
i=1
backward variable
βt(i) P (ot+1ot+2 .. . oT |qt = i, λ)
N
βt(i) = a ij bj (o t+1)βt+1(j)
j=1

s1

a
si ij sj

o
sN t+1

t t+1
backward procedure

1. initialization

βT (i) = 1, 1 ≤ i ≤ N
2. induction
N
βt(i) = a ij bj (o t+1)βt+1(j)
j=1
t = T − 1,T − 2,. .. , 1, 1 ≤ i ≤ N
decoding: Viterbi algorithm
P (O, qt = i|λ)
γt(i) P (q t = i|O, λ) =
P (O|λ)
P (O, qt = i|λ) αt(i)βt(i)
= N
= N
i=1 P (O, qt = i|λ) i=1
αt(i)βt(i)

s1 s1

si si si

ot
sN sN
t-1 t t+1
most likely state at time t
qt∗ = arg min [γt(i)], 1 ≤ t ≤ T
1≤i≤N
Viterbi algorithm
δt(i) max P (q1q2 .. . qt−1, qt = i, o1o2 .. .
ot|λ)
q1q2...qt−1
δt+1(j) =max δt(i)aij · bj (ot+1)
i

ψ (j) δ (j)
t
the state that maximizes t−1
at time t - 1
Viterbi algorithm
1. initialization
δ1(i) = πibi(o1), ψ1(i) = 0, 1 ≤ i ≤ N
2. recursion
δt(j) = max [δt−1(i)aij ] bj (ot)
1≤i≤N
ψt(j) = arg max [δt−1(i)aij ] , 2 ≤ t ≤ T, 1≤j ≤N
1≤i≤N
3. termination
P ∗ = max [δT (i)
1≤i≤N
qT∗ = arg max [δT (i)]
1≤i≤N

4. path backtracking
qt∗ = ψt+1(qt∗+1), t = T − 1,T − 2,. .. ,
learning problem
P (O|λ)
choose λ such that its likelihood is maximized

q0 ?
? ?

? ?
Hot Cold
B1 ? B2

? ?

Q = {Hot, Cold}
O = {1, 3, 2, 2, 1, …}
learning problem
ξt(i, j) P (qt = i, qt+1 = j|O, λ)
P (qt = i, qt+1 = j, O|λ)
=
P (O|λ)
αt(i)aijbj (o )β (j)
P (O|λ)
= t+1 t+1

αt(i)aijbj (ot+1)βt+1(j)
= N N
i=1 j=1 αt(i)aijbj (ot+1)βt+1(j)

a
si ij sj

o
t+1
t t+1
learning problem
N
γt (i)t = ξ ( i,
j=1 j)
T −1
γt (i) = expected # of transitions from state i in O
t=1

T −1
expected # of transitions from state i
ξt (i, j) =
t=1 to state j in O
(re)estimation of an HMM parameters
πˆ = expected frequency in state i at t = 1 =
i
γ (i) expected # of transitions from state i to state j

1
ij expected # of transitions from state i
=
T −1
ξt (i, j)
t=1
= T −1
γ (i)
t=1 t

expected # of times in state j and observing


ˆbj (m) = expected # of times from state j
vm
{t=1,...,T |o =v } γt(j)
= T
t m

t=1
γt(j)
forward-backward algorithm (Baum-Welch)
initialize A, B, and
Π
iterate until convergence

E-step M-step
αt(i)βt(i) πˆi =
γt(i) = N
,
i=1 α t(i)βt (i) ∀t, γ1(i) T −1
i t=1 ξt (i, j)
ξt(i, j) = aˆ
ij T −1
= t=1 γt (i)
αt(i)aijbj (ot+1)βt+1(j) P T
N N
i=1 j=1 αt(i)aijbj (ot+1)βt+1(j)
t=1 γt (j)
ˆ
bj (m) = s.t.ot=v m
T
∀t, t=1 γt (j)
i, j
N
αt(j) P (o1o2 .. . ot, qt = j|λ) = αt−1(i)aij bj (ot)
i=1
N
βt(i) P (ot+1ot+2 .. . oT |qt = i, λ) = a b (o )β (j)
ij j t+1 t+1
j=1
multiple training sequences
k K
X {O } k=1
K
P (X |λ) P (O k |λ)
. k=1
=
K
1 γk(i)
k=1
πˆi = K
K −1k
T k (i,
t
ξ
k=1 t=1
aˆij = K T k −1j) k
k=1 t=1 γt i)
P T k(
K
k (j)
t=1
k=1 t
ˆj s.t.ok =v
t
m
γ
b (m) K k k
γ t (j)
T
= k=1 t=1
model selection
• tuning the topology of an HMM
- zeroing out some impossible (or unnecessary) transitions:
aij = 0
- moving forward only: aij = 0, for j < i
- no big jumps: aij = 0, for j > i + τ
• # of states
- determined using prior information
- can be fine-tuned by cross validation by checking the likelihood
of validation sequences
finite mixture
N
P (ot = vm) = P (o =t v m |qt = s j)P (q=
t s )j
j=1

aij = wj, j = 1,. .. , N, ∀i


=⇒ P (qt = sj ) = wj
N
=⇒ P (ot = mv j) = b ( m) · j
j=1 w
continuous observation densities
2
P (ot | q t =j s ) ∼ N (µ ,j σ j
)
M-step equations:
t γt (j)ot
µˆj =
t γt(j)

2
2 t γt (j)( ot − µˆj
σˆ j )
t γt (j)
=
earthquake example
(source: D. W. Chambers, et al., Hidden Markov model forecasting of earthquakes)

• T = 1227 earthquakes in southern California in 1932 -


2004
• (s1, s2, s3) = (short, moderate, long) time between
earthquakes
• o1, o2, …, oT are the actual inter-event times
- ot: # days between earthquakes t-1 and t
• given qt = si, ot follows an exponential distribution with
mean θi s.t. θ1 < θ2 < θ3
earthquake example
? ? ?

s1 ? s3

? ?
? ?
s2
p(ot|qt = s1) p(ot|qt = s3)

p(ot|qt = s2)
earthquake: forecasting problem
Given an HMM with parameters (A, θ1, θ2, θ3, Π), and the
observations up to the present, find the forecast density:
p(ot+1
3
= x|o1o2 .. . ot, λ)
= p(o = t+1
x, qt+1= si|o1o2 .. . o t , λ)
i=1
3
= p(o = t+1
x|q = s t+1
,o i .. . o t , λ) p(q=
t+1 si|o1 .. . o t , λ)
i=1 1 ·
3
1 − x
= e θi· p(q t+1 = s i|o .. .ot ,
θi
i=1 λ)
3 1 3
3
— x p(qt+1 = si,o1 .. .ot|λ) 1 x j=1 α t (j) ji·
= e θi · = e − θi· 3 a
1i p(o1 .. .ot|λ) θi
i=1 θ j=1 α t(j)
i=1
earthquake: forecasting problem
P (ot+1 ≤ x|o1o2 .. . ot, λ)
3 3
− x j=1 αt(j) · aji
= (1 − e) · θ
i 3
i=1 j=1 α t(j)
earthquake: the results
• θ1 = 1.3, θ2 = 17.42, θ3 = 27,92
• 1226 forecasts were made to find the probability of
another earthquake within 7 days
• proportion of times an earthquake did actually occur
within 7 days:
[.28, .32) [.32, .36) [.36, .50) [.50, 1]
proportion .292 .300 .421 .625
classification
• set of HMMs, each one modeling the sequences
belonging to one class
P (O|λi)P (λi)
arg max P (λi|O) =
i j P (O|λ )j P (λ j)
numerical issues
• scaling α and β
- for large t, αt(i) computation will exceed the precision range
(even in double precision)
• smoothing
- symbol vm that does not appear in the training sequence will
make bj(m) = 0
• imbalance between emission and transition probabilities
- bj(m) << aij
- P(O|λ) becomes mostly influenced by bj(m)
HMM for Hot Topic Detection

dfi

state
34
user state mining for games

35
automatic composition of Mozart style music
readings
• K. Seymore, A. McCallum, and R. Rosenfeld, “Learning Hidden Markov
Model Structure for Information Extraction,” Proc. AAAI Workshop
on Machine Learning for Information Extraction, 1999.
• A. Srivastava, et al., “Credit Card Fraud Detection Using Hidden Markov
Model,” IEEE Trans. Dependable and Secure Computing, vol. 5, no. 1,
pp. 37-48, 2008.
• S. M. Siddiqi and A. W. Moore, “Fast Inference and Learning in Large-
State-Space HMMs,” Proc. 22nd Int’l Conf. Machine Learning, 2005.
• O. Netzer, J. M. Lattin, and V. Srinivasan, “A Hidden Markov Model of
Customer Relationship Dynamics,” Marketing Science, vol. 27, no. 2, pp.
185-204, 2008.
• M. J. Zaki, C. D. Carothers, and B. K. Szymanski, “VOGUE: A Variable
Order Hidden Markov Model with Duration Based on Frequent Sequence
Mining,” ACM Trans. Knowledge Discovery from Data, vol. 4, no. 1,
2010.
• J. Kleinberg, “Bursty and Hierarchical Structure in Streams,” Proc. 8th
ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2002.
references
• E. Alpaydin, Introduction to Machine Learning, The MIT
Press, 2004.
• O. C. Ibe, Markov Processes for Stochastic Modeling,
Academic Press, 2009.
• F. Jelinek, Statistical Methods for Speech Recognition, The
MIT Press, 1997.
• D. Jurafsky and J. H. Martin, Speech and Language
Processing, 2nd Ed., Prentice Hall, 2009
• T. Koski, Hidden Markov Models for Bioinformatics, Kluwer
Academic Publishers, 2001.
• L. Rabiner and B-H. Juang, Fundamentals of Speech
Recognition, Prentice Hall, 1993.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy