0% found this document useful (0 votes)
29 views

Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models

1. State-space models are probabilistic graphical models used to model sequential data using hidden states. 2. Hidden Markov models (HMMs) are state-space models where the hidden states form a Markov chain and observations depend only on the current state. 3. The forward-backward and Viterbi algorithms can be used for inference in HMMs to compute state probabilities and find the most likely state sequence. The Baum-Welch algorithm is used for learning model parameters.

Uploaded by

Rohit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models

1. State-space models are probabilistic graphical models used to model sequential data using hidden states. 2. Hidden Markov models (HMMs) are state-space models where the hidden states form a Markov chain and observations depend only on the current state. 3. The forward-backward and Viterbi algorithms can be used for inference in HMMs to compute state probabilities and find the most likely state sequence. The Baum-Welch algorithm is used for learning model parameters.

Uploaded by

Rohit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Probabilis*c

Graphical Models

Lecture 8: State-Space Models

Based on slides by Richard Zemel




Sequential data
Turn a@en*on to sequen*al data
–  Time-series: stock market, speech, video analysis
–  Ordered: text, gene

Simple example: Dealer A is fair; Dealer B is not C=h


C=t C=t
Process (let Z be dealer A or B): A B
Loop until tired:
1.  Flip coin C, use it to decide whether to switch dealer C=h
2.  Chosen dealer rolls die, record result

Fully observable formulation: data is sequence of dealer selections


AAAABBBBAABBBBBBBAAAAABBBBB
Simple example: Markov model

•  If underlying process unknown, can construct model to predict


next le@er in sequence
•  In general, product rule expresses joint distribu*on for sequence

•  First-order Markov chain: each observa*on independent of all


previous observa*ons except most recent

•  ML parameter es*mates are easy



•  Each pair of outputs is a training case; in this example:
P(Xt =B| Xt-1=A) = #[t s.t. Xt = B, Xt-1 = A] / #[t s.t. Xt-1 = A]
Higher-order Markov models
•  Consider example of text
•  Can capture some regularities with bigrams (e.g., q nearly
always followed by u, very rarely by j) – probability of a
le?er given just its preceding le?er
•  But probability of a le?er depends on more than just
previous le?er
•  Can formulate as second-order Markov model (trigram
model)


•  Need to take care: many counts may be zero in training
dataset

Character
recognition:
Transition
probabilities
Hidden Markov model (HMM)
•  Return to casino example -- now imagine that do not observe
ABBAA, but instead just sequence of die rolls (1-6)

•  Genera*ve process:
Loop un*l *red:
1. Flip coin C (Z = A or B)
2. Chosen dealer rolls die, record result X

Z is now hidden state variable – 1st order Markov chain generates
state sequence (path), governed by transi2on matrix A


Observa*ons governed by emission probabili2es, convert state path
into sequence of observable symbols or vectors:
Relationship to other models
•  Can think of HMM as:
–  Markov chain with stochastic measurements

–  Mixture model with states coupled across time

•  Hidden state is 1st-order Markov, but output not Markov of any


order
•  Future is independent of past give present, but conditioning on
observations couples hidden states
Character Recognition Example

Which le?ers are these?



HMM: Character Recognition Example

Context ma?ers: recognition easier based on sequence of


characters
How to apply HMM to this character string?
Main elements: states? emission, transition probabilities?

HMM: Semantics

z1 = {a,..., z} z2 = {a,..., z} z3 = {a,..., z} z4 = {a,..., z} z5 = {a,..., z}

x1 = x2 = x3 = x4 = x5 =

Need 3 distributions:
1.  Initial state: P(Z1)
2.  Transition model: P(Zt|Zt-1)
3.  Observation model (emission probabilities): P(Xt|Zt)

HMM: Main tasks

•  Joint probabilities of hidden states and outputs:

T
P(x, z) = P( z1 ) P( x1 | z1 )∏t =2 P( zt | zt −1 ) P( xt | zt )
•  Three problems
1.  Computing probability of observed sequence: forward-
backward algorithm [good for recognition]
2.  Infer most likely hidden state sequence: Viterbi algorithm
[useful for interpretation]
3.  Learning parameters: Baum-Welch algorithm (version of
EM)
Fully observed HMM
Learning fully observed HMM (observe both X and Z) is easy:
1.  Initial state: P(Z1) – proportion of words start with each
le?er
2.  Transition model: P(Zt|Zt-1) – proportion of times a given
le?er follows another (bigram statistics)
3.  Observation model (emission probabilities): P(Xt|Zt) – how
often particular image represents specific character, relative
to all images

But still have to do inference at test time: work out states given
observations

HMMs often used where hidden states are identified: words in
speech recognition; activity recognition; spatial position of
rat; genes; POS tagging
HMM: Inference tasks
Important to infer distributions over hidden states:
§  If states are interpretable, infer interpretations
§  Also essential for learning

Can break down hidden state inference tasks to solve (each


based on all observations up to current time, X0:t)
1.  Filtering: compute posterior over current hidden state:
P(Zt| X0:t)
2.  Prediction: compute posterior over future hidden state:
P(Zt+k| X0:t)
3.  Smoothing: compute posterior over past hidden state:
P(Zk| X0:t), 0<k<t
4.  Fixed-lag smoothing: P(Zt-a| X0:t): compute posterior
over hidden state a few steps back
Filtering, Smoothing & Prediction
P( Z t | X 1:t ) = P( Z t | X t , X 1:t −1 )
∝ P( X t | Z t , X 1:t −1 ) P( Z t | X 1:t −1 )
= P( X t | Z t ) P( Z t | X 1:t −1 )
= P( X t | Z t )∑ z P( Z t | zt −1 , X 1:t −1 ) P( zt −1 | X 1:t −1 )
t −1

= P( X t | Z t )∑ z P( Z t | zt −1 ) P( zt −1 | X 1:t −1 )
t −1

Filtering: for online estimation of state


Pr(state) =observation probability * transition-model

Smoothing: post hoc estimation of state (similar computation)
Prediction is filtering, but with no new evidence:
P( Z t + k | X 1:t ) = ∑ P( Z t +k | zt + k −1 ) P( zt + k −1 | X 1:t )
zt +k −1
HMM: Maximum likelihood
Having observed some dataset, use ML to learn the parameters
of the HMM

Need to marginalize over the latent variables:
X
p(X|✓) = p(X, Z|✓)
Z
Difficult:
–  does not factorize over time steps
–  involves generalization of a mixture model

Approach: utilize EM for learning



Focus first on how to do inference efficiently
Forward recursion (α)

Clever recursion can compute huge sum efficiently


Backward recursion (β)

α(zt,j): total inflow of prob. to node (t,j)


β(zt,j): total outflow of prob. from node (t,j)
Forward-Backward algorithm
Estimate hidden state given observations

One forward pass to compute all α(zt,i), one backward


pass to compute all β(zt,i): total cost O(K2T)
Can compute likelihood at any time t based on α (zt,j)
and β(zt,j)
Baum-Welch training algorithm: Summary

Can estimate HMM parameters using maximum


likelihood
If state path known, then parameter estimation easy
Instead must estimate states, update parameters, re-
estimate states, etc. -- Baum-Welch (form of EM)
State estimation via forward-backward, also need
transition statistics (see next slide)
Update parameters (transition matrix A, emission
parameters) to maximize likelihood
Transition statistics
Need statistics for adjacent time-steps:

Expected number of transitions from state i to state j that


begin at time t-1, given the observations
Can be computed with the same α(zt,j) and β(zt,j)
recursions
Parameter updates
Initial state distribution: expected counts in state k at time 1

Estimate transition probabilities:

Emission probabilities are expected number of times observe


symbol in particular state:
Using HMMs for recognition
Can train an HMM to classify a sequence:
1. train a separate HMM per class
2. evaluate prob. of unlabelled sequence under each
HMM
3. classify: HMM with highest likelihood

Assumes can solve two problems:
1. estimate model parameters given some training
sequences (we can find local maximum of
parameter space near initial position)
2. given model, can evaluate prob. of a sequence
Probability of observed sequence
Want to determine if given observation sequence is likely
under the model (for learning, or recognition)

Compute marginals to evaluate prob. of observed seq.: sum
across all paths of joint prob. of observed outputs and state

Take advantage of factorization to avoid exp. cost (#paths = KT)


Variants on basic HMM
•  Input-output HMM
–  Have additional observed variables U

•  Semi-Markov HMM
–  Improve model of state duration

•  Autoregressive HMM
–  Allow observations to depend on some previous
observations directly

•  Factorial HMM
–  Expand dim. of latent state
State Space Models
Instead of discrete latent state of the HMM, model Z as a
continuous latent variable
Standard formulation: linear-Gaussian (LDS), with (hidden
state Z, observation Y, other variables U)
–  Transition model is linear
zt = A t zt 1 + Bt u t + ✏ t
–  with Gaussian noise
✏t = N (0, Qt )
–  Observation model is linear
y t = C t zt + D t u t + t
–  with Gaussian noise
t = N (0, Rt )

Model parameters typically independent of time: stationary
Kalman Filter
Algorithm for filtering in linear-Gaussian state space model
Everything is Gaussian, so can compute updates exactly

Dynamics update: predict next belief state
Z
p(zt |y1:t 1 , u1:t ) = N (zt |At zt 1 + Bt ut , Qt )N (zt 1 |µt 1 , ⌃t 1 )dzt 1

= N (zt |µt|t 1 , ⌃t|t 1 )

µt|t 1 = At µt 1 + Bt u t
T
⌃t|t 1 = A t ⌃t 1 At + Qt
Kalman Filter: Measurement Update
Key step: update hidden state given new measurement:
p(zt |y1:t , u1:t ) / p(yt |zt , ut )p(zt |y1:t 1 , u1:t )

First term a bit complicated, but can apply various identities
(such as the matrix inversion lemma, Bayes rule), obtain:
p(z |y , u ) = N (z |µ , ⌃ )
t 1:t 1:t t t t

The mean update depends on Kalman gain matrix K, and the
residual or innovation r = y – E[y]
µt = µt|t 1 + Kt rt
Kt = ⌃t|t T
1 Ct S t
1

ŷ = E[yt |y1:t 1 , ut ] = Ct µt|t + Dt ut


1
T
St = cov[rt |y1:t 1 , u1:t ] = Ct ⌃t|t 1 t + Rt
C
Kalman Filter: Extensions
Learning similar to HMM
–  Need to solve inference problem – local posterior marginals
for latent variables
–  Use Kalman smoothing instead of forward-backward in E
step, re-derive updates in M step

Many extensions and elaborations


–  Non-linear models: extended KF, unscented KF
–  Non-Gaussian noise
–  More general posteriors (multi-modal, discrete, etc.)
–  Large systems with sparse structure (sparse information
filter)
Viterbi decoding
How to choose single best path through state space?
Choose state with largest probability at each time t: maximize
expected number of correct states
But this may not be the best path, with highest likelihood of
generating the data

To find best path – Viterbi decoding, form of dynamic
programming (forward-backward algorithm)
Same recursions, but replace ∑ with max (“brace” example)
Forward: retain best path into each node at time t
Backward: retrace path back from state where most
probable path ends

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy