Estimation Theory
Estimation Theory
Estimation Theory
Roberto Togneri
1 Applications
Modern estimation theory can be found at the heart of many electronic signal
processing systems designed to extract information. These systems include:
Radar where the delay of the received pulse echo has to be estimated in the
presence of noise
Sonar where the delay of the received signal from each sensor has to estimated
in the presence of noise
Speech where the parameters of the speech model have to be estimated in the
presence of speech/speaker variability and environmental noise
Image where the position and orientation of an object from a camera image
has to be estimated in the presence of lighting and background noise
Biomedicine where the heart rate of a fetus has to be estimated in the presence
of sensor and environmental noise
1
2 Introduction
Dene:
• p(x; θ) ≡ mathemtical model (i.e. PDF) of the N -point data set parametrized
by θ .
The problem is to nd a function of the N -point data set which provides an
estimate of θ, that is:
2
2.2 EXAMPLE
Consider a xed signal, A, embedded in a WGN (White Gaussian Noise) signal,
w[n] :
x[n] = A + w[n] n = 0, 1, . . . , N − 1
where θ = A is the parameter to be estimated from the observed data, x[n].
Consider the sample-mean estimator function:
N −1
1 X
θ̂ = x[n]
N n=0
Is the sample-mean an MVU estimator for A?
Unbiased?
E(θ̂) = E( N1 1 1 1
P P P
x[n]) = N E(x[n]) = N A= N NA =A
Minimum Variance?
σ2
var(θ̂) = var( N1 1 1 Nσ
P P P
x[n]) = N2 var(x[n]) = N2 σ= N = N
The variance of any unbiased estimater θ̂ must be lower bounded by the CLRB,
with the variance of the MVU estimator attaining the CLRB. That is:
1
var(θ̂) ≥ h i
∂2 ln p(x;θ)
−E ∂θ 2
and
1
var(θ̂M V U ) = h i
∂ 2 ln p(x;θ)
−E ∂θ 2
3
i.e. Cθ̂ − I−1 (θ) is postitive semidenite whereCθ̂ = E[(θ̂ − E(θ̂)T (θ̂ − E(θ̂))
is the covariance matrix. The Fisher matrix, I(θ), is given as:
2
∂ ln p(x; θ)
[I(θ)]ij = −E
∂θi ∂θj
Furthermore if, for some p-dimensional function g and p×p matrix I:
∂ ln p(x; θ)
= I(θ)(g(x) − θ)
∂θ
then we can nd the MVU estimator as: θ M V U = g(x) and the minimum
covariance is I−1 (θ).
3.1 EXAMPLE
Consider the case of a signal embedded in noise:
x[n] = A + w[n] n = 0, 1, . . . , N − 1
where w[n] is a WGN with variance σ2 , and thus:
N −1
Y 1 1 2
p(x; θ) = √ exp − 2 (x[n] − θ)
n=0 2πσ 2 2σ
N −1
" #
1 1 X 2
= N exp − (x[n] − θ)
(2πσ 2 ) 2 2σ 2 n=0
∂ ln p(x; θ) N 1 X N
= ( x[n] − θ) = 2 (θ̂ − θ)
∂θ σ2 N σ
∂ 2 ln p(x; θ) N
= − 2
∂θ2 σ
For a MVU estimator the lower bound has to apply, that is:
1 σ2
var(θ̂M V U ) = h i=
−E ∂2 ln p(x;θ) N
∂θ 2
σ2
but we know from the previous example that var(θ̂) =
N and thus the sample-
mean is a MVU estimator. Alternatively we can show this by considering the
rst derivative:
∂ ln p(x; θ) N 1 X
= 2( x[n] − θ) = I(θ)(g(x) − θ)
∂θ σ N
I(θ) = σN2 and g(x) = N1
P
where x[n]. Thus the MVU estimator is indeed
2
1 1
= σN .
P
θ̂M V U =N x[n] with minimum variance I(θ)
4
4 Linear Models
x = Hθ + w
where
x = N ×1 observation vector
H = N ×p observation matrix
θ = p×1 vector of parameters to be estimated
w = N ×1 noise vector with PDF N(0, σ 2 I)
then using the CRLB theorem θ = g(x) will be an MVU estimator if:
∂ ln p(x; θ)
= I(θ)(g(x) − θ)
∂θ
with Cθ̂ = I−1 (θ). So we need to factor:
∂ ln p(x; θ) ∂ N 1
= − ln(2πσ 2 ) 2 − 2 (x − Hθ)T (x − Hθ)
∂θ ∂θ 2σ
into the form I(θ)(g(x) − θ). When we do this the MVU estimator for θ is:
θ̂ = (HT H)−1 HT x
4.1 EXAMPLES
4.1.1 Curve Fitting
Consider tting the data, x(t), by a pth order polynomial function of t:
x(t) = θ0 + θ1 t + θ2 t2 + · · · + θp tp + w(t)
T
x = [x(t0 ), x(t1 ), x(t2 ), . . . , x(tN −1 )]
T
w = [w(t0 ), w(t1 ), w(t2 ), . . . , w(tN −1 )]
T
θ = [θ0 , θ1 , θ2 , . . . . θp ]
5
so x = Hθ + w, where H is the N ×p matrix:
t20 tp0
1 t0 ···
1 t1 t21 ··· tp1
H= .
. . ..
.. . .
. . . ···
1 tN −1 t2N −1 ··· tpN −1
Hence the MVU estimate of the polynomial coecients based on the N samples
of data is:
θ̂ = (HT H)−1 HT x
M M
X 2πkn X 2πkn
x[n] = ak cos + bk sin + w[n]
N N
k=1 k=1
so x = Hθ + w, where:
T
x = [x[0], x[1], x[2], . . . , x[N − 1]]
T
w = [w[0], w[1], w[2], . . . , w[N − 1]]
T
θ = [a1 , a2 , . . . , aM , b1 , b2 , . . . , bM ]
where:
1 0
cos 2πk sin 2πk
N N
a
cos 2πk2
N
b
sin 2πk2
N
hk = .
, hk = .
. .
. .
2πk(N −1) 2πk(N −1)
cos N sin N
Hence the MVU estimate of the Fourier co-ecients based on the N samples of
data is:
θ̂ = (HT H)−1 HT x
6
After carrying out the simplication the solution can be shown to be:
2 a T
N (h1 ) x
.
.
.
2
a T
N (h M) x
θ̂ = 2 b T
N (h 1) x
.
.
.
2 b T
N (h M) x
which is none other than the standard solution found in signal processing text-
books, usually expressed directly as:
N −1
2 X 2πkn
âk = x[n] cos
N n=0 N
N −1
2 X 2πkn
b̂k = x[n] sin
N n=0 N
2σ 2
and Cθ̂ = N I.
We assume N input samples are used to yield N output samples and our iden-
tication problem is the same as estimation of the linear model parameters for
x = Hθ + w, where:
T
x = [x[0], x[1], x[2], . . . , x[N − 1]]
T
w = [w[0], w[1], w[2], . . . , w[N − 1]]
T
θ = [h[0], h[1], h[2], . . . , h[p − 1]]
u[0] 0 ··· 0
u[1] u[0] ··· 0
H=
. . .. .
. . .
. . . .
u[N − 1] u[N − 2] · · · u[N − p]
7
The MVU estimate of the system model co-ecients is given by:
θ̂ = (HT H)−1 HT x
where Cθ̂ = σ 2 (HT H)−1 . Since H is a function of u[n] we would like to choose
u[n] to achieve minimum variance. It can be shown that the signal we need is a
pseudorandom noise (PRN) sequence which has the property that the autocor-
relation function is zero for k 6= 0, that is:
−1−k
NX
1
ruu [k] = u[n]u[n + k] = 0 k 6= 0
N n=0
−1−k
NX
1
rux [k] = u[n]x[n + k]
N n=0
rux [i]
ĥ[i] =
ruu [0]
1. The noise vector, w, is no longer white and has PDF N(0, C) (i.e. general
Gaussian noise)
2. The observed data vector, x, also includes the contribution of known signal
components, s.
Thus the general linear model for the observed data is expressed as:
x = Hθ + s + w
where:
Our solution for the simple linear model where the noise is assumed white can
be used after applying a suitable whitening transformation. If we factor the
noise covariance matrix as:
C−1 = DT D
8
then the matrix D is the required transformation since:
−1
E[wwT ] = C ⇒ E[(Dw)(Dw)T ] = DCDT = (DD−1 )(DT DT ) = I
that is w0 = Dw has PDF N(0, I). Thus by transforming the general linear
model:
x = Hθ + s + w
to:
x0 = Dx = DHθ + Ds + Dw
x0 = H0 θ + s0 + w0
or:
x00 = x0 − s0 = H0 θ + w0
we can then write the MVU estimator of θ given the observed data x00 as:
T T
θ̂ = (H0 H0 )−1 H0 x00
= (HT DT DH)−1 HT DT D(x − s)
That is:
θ̂ = (HT C−1 H)−1 HT C−1 (x − s)
and the covariance matrix is:
where g(T (x), θ) is a function of T (x) and θ only and h(x) is a function of x
only, thenT (x) is a sucient statistic for θ. Conceptually one expects that the
PDF after the sucient statistic has been observed, p(x|T (x) = T0 ; θ), should
not depend on θ since T (x) is sucient for the estimation of θ and no more
knowledge can be gained about θ once we know T (x).
9
5.1.1 EXAMPLE
Consider a signal embedded in a WGN signal:
x[n] = A + w[n]
Then:
N −1
" #
1 1 X
p(x; θ) = N exp − 2 (x[n] − θ)2
(2πσ 2 ) 2 2σ n=0
where θ = A is the unknown parameter we want to estimate. We factor as
follows:
−1 N −1
" N
# " #
1 1 2
X 1 X 2
p(x; θ) = N exp − 2 (N θ − 2θ x[n] exp − 2 x [n]
(2πσ 2 ) 2 2σ n=0
2σ n=0
= g(T (x), θ) · h(x)
PN −1
where we dene T (x) = n=0 x[n] which is a sucient statistic for θ.
θ̆ be any
R
1. Let unbiased estimator of θ. Then θ̂ = E(θ̆|T (x)) = θ̆p(θ̆|T (x))dθ̆
is the MVU estimator.
The sucient statistic, T (x), is complete if there is only one function g(T (x))
that is unbiased. That is, if h(T (x)) is another unbiased estimator (i.e. E[h(T (x))] =
θ) then we must have that g = h if T (x) is complete.
10
5.2.1 EXAMPLE
Consider the previous example of a signal embedded in a WGN signal:
x[n] = A + w[n]
PN −1
where we derived the sucient statistic, T (x) = n=0 x[n]. Using method 2
we need to nd a function g such that E[g(T (x))] = θ = A. Now:
"N −1 # N −1
X X
E[T (x)] = E x[n] = E[x[n]] = N θ
n=0 n=0
It is obvious that:
N −1
" #
1 X
E x[n] = θ
N n=0
sample mean
1
PN −1
and thus θ̂ = g(T (x)) = n=0 x[n], which is the we have
N
already seen before, is the MVU estimator for θ .
It may occur that the MVU estimator or a sucient statistic cannot be found or,
indeed, the PDF of the data is itself unknown (only the second-order statistics
are known). In such cases one solution is to assume a functional model of the
estimator, as being linear in the data, and nd the linear estimator which is
both unbiased and has minimum variance, i.e. the BLUE .
For the general vector case we want our estimator to be a linear function of the
data, that is:
θ̂ = Ax
Our rst requirement is that the estimator be unbiased, that is:
E(θ̂) = AE(x) = θ
E(x) = Hθ
where Cθ̂ = (HT C−1 H)−1 . The form of the BLUE is identical to the MVU
estimator for the general linear model. The crucial dierence is that the BLUE
11
does not make any assumptions on the PDF of the data (or noise) whereas the
MVU estimator was derived assuming Gaussian noise. Of course, if the data is
truly Gaussian then the BLUE is also the MVU estimator. The BLUE for the
general linear model can be stated as follows:
x = Hθ + w
6.1 EXAMPLE
Consider a signal embedded in noise:
x[n] = A + w[n]
Where w[n] is of unspecied PDF with var(w[n]) = σn2 and the unknown pa-
rameter θ = A is to be estimated. We assume a BLUE estimate and we derive
H by noting:
E[x] = 1θ
T T
where x = [x[0], x[1], x[2], . . . , x[N − 1]] , 1 = [1, 1, 1, . . . , 1] and we have
H ≡ 1. Also:
1
···
σ02
0 ··· 0
σ02
0 0
1
0 σ12 ··· 0 0
σ12
··· 0
C−1 =
C= . ⇒
. .. . .. . .. .
.. . . . .
. . . .
. . .
2 1
0 0 ··· σN −1 0 0 ··· 2
σN −1
PN −1 x[n]
T −1 −1 T −1 1T C−1 x n=0 σ 2
θ̂ = (H C H) H C x = T −1 = PN −1 1n
1 C 1 n=0 σ 2 n
1 1
Cθ̂ = var(θ̂) = = PN −1
1T C−1 1 n=0
1
σn2
12
and we note that in the case of white noise where σn2 = σ 2 then we get the
sample mean:
N −1
1 X
θ̂ = x[n]
N n=0
σ2
and miminum variance var(θ̂) = N .
where x is the vector of observed data (of N samples). It can be shown that θ̂
is asymptotically unbiased:
lim E(θ̂) = θ
N →∞
and asymptotically ecient:
7.1.1 EXAMPLE
Consider the signal embedded in noise problem:
x[n] = A + w[n]
where w[n] is WGN with zero mean but unknown variance which is also A, that
is the unknown parameter, θ = A, manifests itself both as the unknown signal
and the variance of the noise. Although a highly unlikely scenario, this simple
example demonstrates the power of the MLE approach since nding the MVU
estimator by the previous procedures is not easy. Consider the PDF:
N −1
" #
1 1 X
p(x; θ) = N exp − (x[n] − θ)2
(2πθ) 2 2θ n=0
13
We consider p(x; θ) as a function of θ, thus it is a likelihood function and we
need to maximise it wrt to θ. For Gaussian PDFs it easier to nd the maximum
of the log-likelihood function:
N −1
!
1 1 X
ln p(x; θ) = ln N − (x[n] − θ)2
(2πθ) 2 2θ n=0
Dierentiating we have:
N −1 N −1
∂ ln p(x; θ) N 1 X 1 X
=− + (x[n] − θ) + 2 (x[n] − θ)2
∂θ 2θ θ n=0 2θ n=0
and setting the derivative to zero and solving for θ, produces the MLE estimate:
v
u N −1
1 u 1 X 2 1
θ̂ = − + t x [n] +
2 N n=0 4
lim E(θ̂) = θ
N →∞
and:
θ2
lim var(θ̂) = CRLB =
N →∞ N (θ + 12 )
7.1.2 EXAMPLE
Consider a signal embedded in noise:
x[n] = A + w[n]
where w[n] is WGN with zero mean and known variance σ 2 . We know the MVU
estimator for θ is the sample mean. To see that this is also the MLE, we consider
the PDF:
N −1
" #
1 1 X 2
p(x; θ) = N exp − (x[n] − θ)
(2πσ 2 ) 2 2σ 2 n=0
and maximise the log-likelihood function be setting it to zero:
N −1 N −1
∂ ln p(x; θ) 1 X X
= 2 (x[n] − θ) = 0 ⇒ x[n] − N θ = 0
∂θ σ n=0 n=0
1
PN −1
thus θ̂ = N n=0 x[n] which is the sample-mean.
14
7.2 MLE for Transformed Parameters
The MLE of the transformed parameter, α = g(θ), is given by:
α̂ = g(θ̂)
x = Hθ + w
HT C−1 (x − Hθ) = 0
7.4 EM Algorithm
We wish to use the MLE procedure to nd an estimate for the unknown param-
eter θ which requires maximisation of the log-likelihood function, ln px (x; θ).
However we may nd that this is either too dicult or there arediculties in
nding an expression for the PDF itself. In such circumstances where direct
expression and maxmisation of the PDF in terms of the observed data x is
15
dicult or intractable, an iterative solution is possible if another data set, y,
can be found such that the PDF in terms of y set is much easier to express in
closed form and maximise. We term the data set y the complete data and the
original data x the incomplete data. In general we can nd a mapping from the
complete to the incomplete data:
x = g(y)
7.4.1 EXAMPLE
Consider spectral analysis where a known signal, x[n], is composed of an un-
known summation of harmonic components embedded in noise:
p
X
x[n] = cos 2πfi n + w[n] n = 0, 1, . . . , N − 1
i=1
where w[n] is WGN with known variance σ 2 and the unknown parameter vector
to be estimated is the group of frequencies: θ = f = [f1 f2 . . . fp ]T . The
standard MLE would require maximisation of the log-likelihood of a multi-
variate Gaussian distribution which is equivalent to minimising the argument
of the exponential:
N −1 p
!2
X X
J(f ) = x[n] − cos 2πfi n
n=0 i=1
i = 1, 2, . . . , p
yi [n] = cos 2πfi n + wi [n]
n = 0, 1, . . . , N − 1
where wi [n] is WGN with known variance σi2 then the MLE procedure would
result in minimisation of:
N
X −1
J(fi ) = (yi [n] − cos 2πfi n)2 i = 1, 2, . . . , p
n=0
16
which are p independent one-dimensional minimisation problems (easy!). Thus
y is the complete data set that we are looking for to facilitate the MLE proce-
dure. However we do not have access to y. The relationship to the known data
x is:
p
X
x = g(y) ⇒ x[n] = yi [n] n = 0, 1, . . . , N − 1
i=1
p
X
w[n] = wi [n]
i=1
Xp
σ2 = σi2
i=1
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦
Once we have found the complete data set, y, even though an expression for
ln py (y; θ) can now easily be derived we can't directly maxmise ln py (y; θ) wrt
θ since y is unavailable. However we know x and if we further assume that we
have a good guess estimate for θ then we can consider the expected value of
ln py (y; θ) conditioned on what we know:
Z
E[ln py (y; θ)|x; θ] = ln py (y; θ)p(y|x; θ)dy
and we attempt to maximise this expectation to yield, not the MLE, but next
best-guess estimate for θ.
What we do is iterate through both an E-step (nd the expression for the
Expectation) and M-step (Maxmisation of the expectation), hence the name
EM algorithm. Specically:
17
Convergence of the EM algorithm is guaranteed (under mild conditions) in the
sense that the average log-likelihood of the complete data does not decrease at
each iteration, that is:
Q(θ, θ k+1 ) ≥ Q(θ, θ k )
with equality when θk is the MLE. The three main attributes of the EM algo-
rithm are:
1. An initial value for the unknown parameter is needed and as with most
iterative procedures a good initial estimate is required for good conver-
gence
7.4.2 EXAMPLE
Applying the EM algorithm to the previous example requires nding a closed
expression for the average log-likelihood.
p N −1
X 1 X
ln py (y; θ) ≈ h(y) + yi [n] cos 2πfi n
σ2
i=1 i n=0
p
X 1 T
≈ h(y) + c yi
σ2 i
i=1 i
where the terms in h(y) do not depend on θ , ci = [1, cos 2πfi , cos 2πfi (2), . . . , cos 2πfi (N −
1)]T and yi = [yi [0], yi [1], yi [2], . . . , yi [N − 1]]T . We write the conditional expec-
tation as:
We note that E(yi |x; θ k ) can be thought as as an estimate of the yi [n] data set
given the observed data set x[n] and current estimate θk . Since y is Gaussian
18
then x is a sum of Gaussians and thus x and y are jointly Gaussian and one of
the standard results is:
p
σi2 X
ŷi = E(yi |x; θ k ) = ci + 2 (x − ci )
σ i=1
and
p
!
σ2 X
ŷi [n] = cos 2πfik n + i2 x[n] − cos 2πfik n
σ i=1
Thus:
p
X
Q0 (θ, θ k ) = cTi ŷi
i=1
p p
X X σ2 i
σ2 = σi2 ⇒ =1
i=1 i=1
σ2
19
where s(n; θ) is a function of n and parameterised by θ. Due to noise and model
inaccuracies, w[n], the signal s[n] can only be observed as:
N
X −1
J(θ) = (x[n] − s[n])2
n=0
is minimised over the N observation samples of interest and we call this the LSE
of θ. More precisely we have:
Jmin = J(θ̂)
A problem that arises from assuming a signal model function s(n; θ) rather than
knowledge of p(x; θ) is the need to choose an appropriate signal model. Then
again in order to obtain a closed form or parameteric expression for p(x; θ)
one usually needs to know what the underlying model and noise characteristics
are anyway.
8.1.1 EXAMPLE
Consider observations, x[n], arising from a DC-level signal model, s[n] = s(n; θ) =
θ:
x[n] = θ + w[n]
where θ is the unknown parameter to be estimated. Then we have:
N
X −1 N
X −1
J(θ) = (x[n] − s[n])2 = (x[n] − θ)2
n=0 n=0
20
The dierentiating wrt θ and setting to zero:
N −1 N −1
∂J(θ) X X
=0 ⇒ −2 (x[n] − θ̂) = 0 ⇒ x[n] − N θ̂ = 0
∂θ θ=θ̂ n=0 n=0
sample-mean.
1
PN −1
and hence θ̂ = N n=0 x[n] which is the We also have that:
N −1 N −1
!2
X 1 X
Jmin = J(θ̂) = x[n] − x[n]
n=0
N n=0
s = Hθ
N
X −1
J(θ) = (x[n] − s[n])2 = (x − Hθ)T (x − Hθ)
n=0
21
8.3 Order-Recursive Least Squares
In many cases the signal model is unknown and must be assumed. Obviously
we would like to choose the model, s(θ), that minimises Jmin , that is:
We can do this arbitrarily by simply choosing models, obtaining the LSE θ̂, and
then selecting the model which provides the smallest Jmin . However, models
are not arbitrary and some models are more complex (or more precisely have
a larger number of parameters or degrees of freedom) than others. The more
complex a model the lower the Jmin one can expect but also the more likely
the model is to overt the data or be overtrained (i.e. t the noise and not
generalise to other data sets).
8.3.1 EXAMPLE
Consider the case of line tting where we have observations x(t) plotted against
the sample time index t and we would like to t the best line to the data. But
what line do we t: s(t; θ) = θ1 , constant? s(t; θ) = θ1 + θ2 t, a straight line?
s(t; θ) = θ1 +θ2 t+θ3 t2 , a quadratic? etc. Each case represents an increase in the
order of the model (i.e. order of the polynomial t), or the number of parameters
to be estimated and consequent increase in the modelling power. A polynomial
t represents a linear model, s = Hθ , where s = [s(0), s(1), . . . , s(N − 1)]T and:
1
1
Constant θ = [θ1 ]T and H = .. is a N × 1 matrix
.
1
1 0
1 1
Linear θ = [θ1 , θ2 ] and H =
T 1 2 is a N × 2 matrix
. .
.. .
.
1 N −1
1 0 0
1 1 1
Quadratic θ = [θ1 , θ2 , θ3 ] and H =
T 1 2 4 is a N ×3
. . .
.. . .
. .
1 N − 1 (N − 1)2
matrix
22
and so on. If the underlying model is indeed a straight line then we would expect
not only that the minimum Jmin result with a straight line model but also that
higher order polynomial models (e.g. quadratic, cubic, etc.) will yield the same
Jmin (indeed higher-order models would degenerate to a straight line model,
except in cases of overtting). Thus the straight line model is the best model
to use.
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦
θ̂ k+1 = θ̂ k + UPDATEk
However the success of this approach depends on proper formulation of the linear
models in order to facilitate the derivation of the recursive update. For example,
if H has othonormal column vectors then the LSE is equivalent to projecting
the observation x onto the space spanned by the orthonormal column vectors of
H. Since increasing the order implies increasing the dimensionality of the space
by just adding another column to H this allows a recursive update relationship
to be derived.
where s[n|n − 1] ≡ s(n; θ[n − 1]). The K[n] is the correction gain and (x[n] −
s[n|n − 1]) is the prediction error. The magnitude of the correction gain K[n] is
23
usually directly related to the value of the estimator error variance, var(θ̂[n−1]),
with a larger variance yielding a larger correction gain. This behaviour is rea-
sonable since a larger variance implies a poorly estimated parameter (which
should have minumum variance) and a larger correction gain is expected. Thus
one expects the variance to decrease with more samples and the estimated pa-
rameter to converge to the true value (or the LSE with an innite number of
samples).
8.4.1 EXAMPLE
We consider the specic case of linear model with a vector parameter:
s[n] = H[n]θ[n]
where hT [n] is the nth row vector of the n × p matrix H[n]. It should be obvious
that:
s[n − 1] = H[n − 1]θ[n − 1]
We also have that s[n|n − 1] = hT [n]θ[n − 1]. So the estimator update is:
θ̂[n] = θ̂[n − 1] + K[n](x[n] − hT [n]θ[n − 1])
Let Σ[n] = Cθ̂ [n] be the covariance matrix of θ̂ based on n samples of data and
the it can be shown that:
Σ[n − 1]h[n]
K[n] =
σn2 + hT [n]Σ[n − 1]h[n]
24
8.5 Constrained Least Squares
Assume that in the vector parameter LSE problem we are aware of constraints
on the individual parameters, that is p -dimensional θ is subject to r<p inde-
pendent linear constraints. The constraints can be summarised by the condition
that θ satisfy the following system of linear constraint equations:
Aθ = b
Then using the technique of Lagrangian multipliers our LSE problem is that of
miminising the following Lagrangian error criterion:
Let θ̂ be the unconstrained LSE of a linear model, then the expression for θ̂ c
the constrained estimate is:
J = (x − s(θ))T (x − s(θ))
becomes much more dicult. Dierentiating wrt θ and setting to zero yields:
∂J ∂s(θ)T
=0 ⇒ (x − s(θ)) = 0
∂θ ∂θ
which requires solution of N nonlinear simultaneous equations. Approximate
solutions based on linearization of the problem exist which require iteration until
convergence.
25
initialised value θk and directly solves for the zero of the function to produce
the next estimate:
−1
∂g(θ)
θ k+1 = θ k − g(θ)
∂θ
θ=θ k
Of course if the function was linear the next estimate would be the correct value,
but since the function is nonlinear this will not be the case and the procedure
is iterated until the estimates converge.
Based on the standard solution to the LSE of the linear model, we have that:
9 Method of Moments
Although we may not have an expression for the PDF we assume that we can
use the natrual estimator for the k th moment, µk = E(xk [n]), that is:
N −1
1 X k
µ̂k = x [n]
N n=0
N −1
!
−1 −1 1 X k
θ̂ = h (µ̂k ) = h x [n]
N n=0
26
If θ is a p -dimensional vector then we require p equations to solve for the p
unknowns. That is we need some set of p moment equations. Using the lowest
order p moments what we would like is:
θ̂ = h−1 (µ̂)
where: PN −1
1
N P n=0 x[n]
1 N −1 2
n=0 x [n]
N
µ̂ =
.
.
.
1
PN −1 p
N n=0 x [n]
9.1 EXAMPLE
Consider a 2-mixture Gaussian PDF:
where g1 = N(µ1 , σ12 ) and g2 = N(µ2 , σ22 ) are two dierent Gaussian PDFs and
θ is the unknown parameter that has to be estimated. We can write the second
moment as a function of θ as follows:
Z
µ2 = E(x2 [n]) = x2 p(x; θ)dx = (1 − θ)σ12 + θσ22 = h(θ)
and hence: PN −1
1
µ̂2 − σ12 N n=0 x2 [n] − σ12
θ̂ = 2 =
σ2 − σ12 σ22 − σ12
10 Bayesian Philosophy
27
In the classic approach we derived the MVU estimator by rst considering mimi-
sation of the mean square eror, i.e. θ̂ = arg minθ̂ mse(θ̂) where:
Z
mse(θ̂) = E[(θ̂ − θ)2 ] = (θ̂ − θ)p(x; θ)dx
is the Bayesian mse and p(x, θ) is the joint pdf of x and θ (since θ is now a
random variable). It should be noted that the Bayesian squared error (θ − θ̂)2
and classic squared error (θ̂−θ)2 are the same. The minimum Bmse(θ̂) estimator
or MMSE is derived by dierentiating the expression for Bmse(θ̂) with respect
to θ̂ and setting this to zero to yield:
Z
θ̂ = E(θ|x) = θp(θ|x)dθ
10.1.1 EXAMPLE
Consider signal embedded in noise:
x[n] = A + w[n]
where as before w[n] = N(0, σ 2 ) is a WGN process and the unknown parameter
θ =A is to be estimated. However in the Bayesian approach we also assume
the parameter A is a random variable with a prior pdf which in this case is the
2
Gaussian pdf p(A) = N(µA , σA ) . We also have that p(x|A) = N(A, σ 2 ) and we
can assume that x and A are jointly Gaussian. Thus the posterior pdf:
p(x|A)p(A) 2
p(A|x) = R = N(µA|x , σA|x )
p(x|A)p(A)dA
28
is also a Gaussian pdf and after the required simplication we have that:
2 1 N µA 2
σA|x = N 1
and µA|x = x̄ + 2 σA|x
σ2 + 2
σA
σ2 σA
= αx̄ + (1 − α)µA
2
σA
where α = 2 +σ 2 . Upon closer examination of the MMSE we observe the
σA N
2
following (assume σA σ 2 ):
2
1. With few data (N is small) then σA σ 2 /N and  → µA , that is the
MMSE tends towards the mean of the prior pdf (and eectively ignores
2
the contribution of the data). Also p(A|x) ≈ N(µA , σA ).
2
2. With large amounts of data (N is large)σA σ 2 /N and  → x̄, that is
the MMSE tends towards the sample mean x̄ (and eectively ignores the
2
contribution of the prior information). Also p(A|x) ≈ N(x̄, σ /N ).
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦
x = Hθ + w
E(x) = Hµθ
E(y) = µθ
29
and we can show that:
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦
10.3.1 EXAMPLE
Consider the signal embedded in noise problem:
x[n] = A + w[n]
where we have shown that the MMSE is:
 = αx̄ + (1 − α)µA
2
σA 2
where α= 2 . If the prior pdf is noninformative then σA =∞ and α=1
σA + σN
2
p(x|θ)p(θ)
p(θ|x) = R
p(x|θ)p(θ)dθ
Now p(x|θ) is, in reality, p(x|θ, α), but we can obtain the true p(x|θ) by:
Z
p(x|θ) = p(x|θ, α)p(α|θ)dα
Z
p(x|θ) = p(x|θ, α)p(α)dα
30
11 General Bayesian Estimators
Z Z
Bmse(θ̂) = E[(θ − θ̂)2 ] = (θ − θ̂)2 p(x, θ)dxdθ
is one specic case for a general estimator that attempts to minimise the average
of the cost function, C(), that is the Bayes risk R = E[C()] where = (θ − θ̂).
There are three dierent cost functions of interest:
Z θ̂ Z ∞
1
p(θ|x)dθ = p(θ|x)dθ or Pr{θ ≤ θ̂|x} =
−∞ θ̂ 2
which is the mode of the posterior pdf (the value that maxmises the pdf ).
For the Gaussian posterior pdf it should be noted that the mean mediam and ,
mode are identical. Of most interest are the quadratic and hit-or-miss cost
functions which, together with a special case of the latter, yield the following
three important classes of estimator:
31
2. MAP (Maximum A Posteriori ) estimator which is the mode or maximum
of the posterior pdf :
θ̂ = arg max p(θ|x)
θ
= arg max p(x|θ)p(θ)
θ
32
pdf p(x, θ). We consider the class of all ane estimators of the form:
N
X −1
θ̂ = an x[n] + aN = aT x + aN
n=0
The resultant estimator is termed the linear minimum mean square error (LMMSE)
estimator. The LMMSE will be sub-optimal unless the MMSE is also linear.
Such would be the case if the Bayesian linear model applied:
x = Hθ + w
∂Bmse(θ̂)
The weight co-ecients are obtained from
∂ai = 0 for i = 1, 2, . . . , N
which yields:
N
X −1
aN = E(θ) − an E(x[n]) = E(θ) − aT E(x) and a = C−1
xx Cxθ
n=0
T
where Cxx = E (x − E(x))(x − E(x)) is the N × N covariance matrix and
Cxθ = E [(x − E(x))(θ − E(θ))] is the N × 1 cross-covariance vector. Thus the
LMMSE estimator is:
x = Hθ + w
33
where x is the N × 1 data vector, H is a known N × p observation matrix, θ
p × 1 random vector of parameters with mean E(θ) and covariance matrix
is a
Cθθ and w is an N × 1 random vector with zero mean and covariance matrix
Cw which is uncorrelated with θ (the joint pdf p(w, θ) and hence also p(x, θ)
are otherwise arbitrary), then noting that:
E(x) = HE(θ)
Cxx = HCθθ HT + Cw
Cθx = Cθθ HT
and the covariance of the error which is the Bayesian MSE matrix is:
Mθ̂ = (C−1 T −1
θθ + H Cw H)
−1
where rxx [k] = E(x[n]x[n − k]) is the autocorrelation function (ACF) of the
x[n] process and Rxx denotes the autocorrelation matrix. Note that since x[n]
is WSS the expectation E(x[n]x[n − k]) is independent of the absolute time
index n. In signal processing the estimated ACF is used:
1
PN −1−|k|
n=0 x[n]x[n + |k|] |k| ≤ N − 1
r̂xx [k] = N
0 |k| ≥ N
Both the data x and the parameter to be estimated θ̂ are assumed zero mean.
Thus the LMMSE estimator is:
θ̂ = Cθx C−1
xx x
34
12.2.1 Smoothing
The problem is to estimate the signal θ = s = [s[0] s[1] . . . s[N − 1]]T based on
the noisy data x = [x[0] x[1] . . . x[N − 1]]T where:
x=s+w
and w = [w[0] w[1] . . . w[N − 1]]T is the noise process. An important dierence
between smoothing and ltering is that the signal estimate s[n] can use the
entire data set: the past values (x[0], x[1], . . . x[n − 1]), the present x[n] and
future values (x[n + 1], x[n + 2], . . . , x[N − 1]). This means that the solution
cannot be cast as ltering problem since we cannot apply a causal lter to the
data.
We assume that the signal and noise processes are uncorrelated. Hence:
and thus:
Cxx = Rxx = Rss + Rww
also:
Cθx = E(sxT ) = E(s(s + w)T ) = Rss
Hence the LMMSE estimator (also called the Wiener estimator) is:
12.2.2 Filtering
The problem is to estimate the signal θ = s[n] based only on the present and
past noisy data x = [x[0] x[1] . . . x[n]]T . As n increases this allows us to view
the estimation process as an application of a causal lter to the data and we
need to cast the LMMSE estimator expression in the form of a lter.
35
is a 1 × (n + 1) row vector. The LMMSE estimator is:
Then:
n
X n
X n
X
(n)
ŝ[n] = ak x[k] = h [n − k]x[k] = h(n) [k]x[n − k]
k=0 k=0 k=0
(n) T
We dene the vector h = h [0] h(n) [1] . . . h(n) [n] . Then we have that h = ǎ,
that is, h is a time-reversed version of a. To explicitly nd the impulse response
h we note that since:
(Rss + Rww )a = řTss
then it also true that:
(Rss + Rww )h = rss
12.2.3 Prediction
The problem is to estimate θ = x[N − 1 + l] based on the current and past x =
T
[x[0] x[1] . . . x[N − 1]] at sample l ≥ 1 in the future. The resulting estimator is
termed the l-step linear predictor.
36
As before we have Cxx = Rxx where Rxx is the N ×N autocorrelation matrix,
and:
Cθx = E(xxT )
= E(x[N − 1 + l] [x[0] x[1] . . . x[N − 1]])
= [rxx [N − 1 + l] rxx [N − 2 + l] . . . rxx [l]] = řTxx
T
where a = [a0 a1 . . . aN −1 ] = R−1
xx řxx . We can interpret the process of forming
the estimator as a ltering operation where h(N ) [k] = h[k] = an−k ⇒ h[N −
k] = ak and then:
N
X −1 N
X
x̂[N − 1 + l] = h[N − k]x[k] = h[k]x[N − k]
k=0 k=1
T T
Dening h = [h[1] h[2] . . . h[N ]] = [aN −1 aN −2 . . . a0 ] = ǎ as before we can
nd an explicit expression for h by noting that:
• the values of −h[n] are termed the linear prediction coecients which are
used extensively in speech coding, and
37
• θ̂[n − 1], the estimate based on n−1 data samples,
Using a vector space analogy where we dene the following inner product:
(x, y) = E(xy)
the procedure is as follows:
1. From the previous iteration we have the estimate, θ̂[n − 1], or we have
an initial estimate θ̂[0]. We can consider θ̂[n − 1] as the true value of θ
projected on the subspace spanned by {x[0], , x[1], . . . , x[n − 1]}.
2. We nd the LMMSE estimator of x[n] based on the previous n−1 samples,
that is the one-step linear predictor, x̂[n|n − 1], or we have an initial
estimate, x̂[0| − 1]. We can consider x̂[n|n − 1] as the true value x[n]
projected on the subspace spanned by {x[0], , x[1], . . . , x[n − 1]}.
x[n] = hT θ + w[n]
38
13 Kalman Filters
where s[n] is the p×1 state vector with unknown initialisation s[−1] = N(µs , Cs )
, u[n] = N(0, Q) is the WGN r × 1 driving noise vector and A, B are known
p × p and p × r matrices. The observed data, x[n], is then assumed a linear
function of the state vector with the following observation equation:
E[(s[n] − ŝ[n|n])2 ]
where E(s[n]|X[n − 1]) is the state prediction and E(s[n]|x̃[n]) is the inno-
vation correction. By denition ŝ[n|n − 1] = E(s[n]|X[n − 1]) and from the
process equation we can show that:
and:
39
where:
M[n|n − 1] = E (s[n] − ŝ[n|n − 1])(s[n] − ŝ[n|n − 1])T
is the state error covariance predictor at time n based on the observation se-
quence X[n − 1]. From the process equation we can show that:
and by appropriate substitution we can derive the following expression for the
state error covariance estimate at time n:
M[n|n] = (I − K[n]H[n])M[n|n − 1]
Gain
K[n] = M[n|n − 1]HT [n](H[n]M[n|n − 1]HT [n] + R)−1
Correction
ŝ[n|n] = ŝ[n|n − 1] + K[n](x[n] − x̂[n|n − 1])
M[n|n] = (I − K[n]H[n])M[n|n − 1]
14 MAIN REFERENCE
40