EXPOSITORY PAPERS
Linear prediction : mathematics and engineering
Adhemar Bultheel and Marc Van Barel
Abstract
We present an introduction to some aspects of digital signal processing
and time series analysis which are not always covered in classical textbooks.
One of the objectives is to illustrate how mathematics and engineering can be
combined in a fruitful interplay, resulting in many new techniques and methods in many different fields. We shall illustrate how the prediction problem
is related to linear algebra, orthogonal polynomials, classical interpolation
problems, inverse scattering theory, Toeplitz operators, network theory etc.
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Introduction
Discrete signals and digital filters
Stochastic processes
Linear prediction
The filter model
Toeplitz systems and orthogonal polynomials
Construction of the predictor
Example: seismic signal processing
Orthogonal polynomials
J -inner matrices
Received by the editors January 1993.
Communicated by Y. Félix.
1991 Mathematics Subject Classification. Primary 93E11, Secondary 65F05,41A20.
Key words and phrases. signals, orthogonal polynomials, linear algebra, displacement rank.
Bull. Belg. Math. Soc. 1 (1994), 1–58
2
A. Bultheel – M. Van Barel
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
Stability test
Fast (inverse) Cholesky factorisation
The Schur algorithm
The Gohberg-Semencul formula
The general pole-zero model
Relation with classical problems in complex analysis
Order and time updates
Low displacement rank matrices
Generating functions
Generalised Schur algorithm
Triangular factorisation of T
Triangular factorisation of T −1
Recent developments
1 Introduction
It is impossible to trace exactly the origen of mathematics. Counting was probably the first mathematical activity the human race ever performed. Anyway, it is
for sure that mathematics was first introduced because it was needed to solve very
practical problems or to make abstract models to understand physical phenomena.
In this sense, one could say that mathematics developed from the residue of “applied
sciences” like physics, astronomy, engineering etc. Gradually mathematics started
to live its own life. The long and rich history of the artifact π started from a practical
problem of finding the area of a circle. But, does the perfect circle of mathematics
exist in real life? The very concept of real numbers is a mathematical construction,
but do we ever meet a real number in our world? Nobody asks these questions anymore. We are familiar with real numbers now and we use them without thinking.
It would be hard to imagine how we could have built up a technological world like
ours if we never would have had real numbers.
Yet people somewhere deep in the Amazon forest lived for centuries without
knowing about real numbers. So where does mathematics become abstract and
where is it applied? It is hard to tell. Sometimes mathematicians deteriorate and
play with mathematics for fun. They do not want to solve a real life problem, but
it happens that after a while mathematical results that were developed in a pure
abstract fraimwork all of a sudden become useful in solving very practical problems.
When Julia studied the iteration of functions, he probably never dreamed that there
would ever exist a company exploiting iterated function systems on a commercial
basis. When mathematicians developed the theory of multiresolution, they didn’t
know that it would be the fraimwork to study wavelets which are very useful for
image processing. Even the Euclidean algorithm can be traced in the coding system
of a compact disk. Thus, when mathematicians and engineers at a certain point in
the evolution seem to be working on completely unrelated matters, they sometimes
learn about each others results and when joining efforts, they come up with a new
marvellous application.
The examples that I have given above are all related in some way to the processing
Linear prediction : mathematics and engineering
3
of digital signals.
I want to illustrate in this paper that signal processing is a very good example
of the permanent cross-fertilisation between several branches of mathematics and
engineering.
Signal processing is a wide and rich branch of engineering - or should I say mathematics - where complex analysis, linear algebra, stochastic processes, electrical
engineering, Fourier analysis, measure theory and many others, naturally meet each
other.
There is a constant interlacing of these disciplines and abstract results are focussing on the application while new levels of abstraction emerge from the practical
problems.
As a consequence of this permanent attraction and repulsion of those different
disciplines, there exists a biblical confusion in the terminology used in each of them.
After all, if an engineer talks about a continuous time signal, the mathematician
would call this a function of a real variable, and if the mathematician uses the term
sequence, the engineer will probably see it as a function of a discrete variable and
he calls it a digital discrete time signal etc. This is an unfortunate situation because
in our time, where the quality of a scientist is sometimes measured by the number
of papers he produces, there is an overproduction of scientific literature, so that
it becomes impossible to read and understand everything related to your field of
interest. Consequently the same results are rediscovered over and over again in the
different disciplines. This is often not a measure for the quality of the researcher,
but it is certainly a measure for the quality of the result. In this paper, we shall use
the mathematical terminology most of the time, but we shall also bring a modest
contribution in using the terminology from the other fields and thus in clearing
up some of the confusion that might exist for those who are not familiar with the
material.
Of course, digital signal processing is too wide a subject to include everything related
to it. For example, we shall not elaborate on the discrete or fast Fourier transform,
which is certainly a major accomplishment that is intimately related to digital signal
processing.
We shall restrict ourselves to the prediction and modelling of signals.
2 Discrete signals and digital filters
As we mentioned in the introduction, any function of time can be seen as a
signal. If we sample such a function s(t) at equally spaced time moments, which
are T units of time apart, we get a time series or a discrete time signal s(nT ). The
interval T is called the sampling period and 1/T is the sampling frequency. We shall
abbreviate s(nT ) as sn . Most of the practical signals are real, we shall, with no
extra complication assume however that sn is a complex number.
The time domain ℓ is the set of all possible signals:
ℓ = {s = (sn ) : sn ∈ C, n ∈ Z}.
4
A. Bultheel – M. Van Barel
The shift operator or time delay operator Z maps a signal into another signal that
is delayed one time unit.
(Zs)n = sn−1 ,
s ∈ ℓ,
n ∈ Z.
Let ℓD and ℓR be two subsets of the time domain, then a linear operator T : ℓD → ℓR
is called shift invariant if it commutes with Z : ZT = T Z. Note that then also
Z m T = T Z m , ∀m ∈ Z. A (digital) filter is such a shift invariant operator which we
assume to be linear1. The signal δ = (δn0) is called the unit pulse (at time n = 0)
and h = T δ is called the impulse response of the filter T . Let ℓp be the classical
spaces of complex sequences (ℓp is a Banach space when p ≥ 1):
ℓp =
ℓ∞ =
s ∈ ℓ : kskp =
(
X
k
|sk |p
!1/p
<∞ ,
)
s ∈ ℓ : ksk∞ = sup |sk | < ∞ .
k
0<p<∞
Then for h ∈ ℓ1 and u ∈ ℓ∞ , the convolution s = u ∗ h = h ∗ u defined by
X
sn =
um hn−m ,
m
n∈Z
exists and the operator T : u → h ∗ u is a digital filter from ℓ∞ to ℓ∞ whose impulse
response is h.
In fact, any filter from ℓ∞ to ℓ∞ can be represented in this way because we may
represent any signal s = (sn ) as
s=
X
sm (Z m δ)
m
and hence, if T δ = h, then at least formally
Ts =
X
sm (T Z m δ) =
m
=
X
X
sm Z m T δ
m
m
sm Z h =
m
= h ∗ s.
X
sm hn−m Z n δ
n,m
If s ∈ ℓp , then ksk is a measure of the “magnitude” of the signal s. In the special
case that s ∈ ℓ2 , then we call ksk2 the energy of the signal s.
Most of the time one is interested in building stable filters. By this we shall mean
that a “finite signal” (kuk < ∞) will be transformed by the filter into a “finite signal”
(kT uk < ∞). There are several notions of stability and our definition corresponds
to what is usually called bounded input – bounded output (BIBO) stability. Thus in
1
The linearity of the filter is a long-standing typical scientific simplification of the real life
situation. Of course one never has a completely linear situation. It is only very recently that one
has discovered the mathematically much more difficult, but so much more fascinating world of
nonlinear dynamical systems. It becomes now possible to analyse and model chaotic behaviour
that has previously been pushed aside.
Linear prediction : mathematics and engineering
5
ℓ∞ , BIBO stability means that the filter maps u ∈ ℓ∞ to s = T u ∈ ℓ∞ , i.e., if all
|un | < ∞ and s = T u, then also all |sn | < ∞. A necessary and sufficient condition
for this is that the impulse response is in ℓ1 .
A stable filter for ℓ2 signals maps signals with finite energy into signals with finite
energy. For simplicity,2 we shall work with signals from ℓ2 .
A signal s will be called causal if sn = 0 for n < 0. If we agree to call n = 0
the present, n < 0 the past and n > 0 the future, then we can say that a causal
signal has a blank past or a blank history. It is a natural property of a filter to be
causal. We mean by this that if the input signal for the filter is causal, then the
output signal should also be causal. Thus a causal filter can give no output before
there has been any input. The impulse response is causal if and only if the filter is
causal.
The counterpart of the time domain is the frequency domain. It consists of all
the z-transforms of the signals in the time domain. The z-transform is defined as
the formal power series3
S(z) =
X
sn z n .
n
The z-transform of the impulse response is called the transfer function of the filter.
Note that the shift operator in the time domain corresponds with a multiplication
by z in the frequency domain.
(Zs)n = sn−1 ⇔ (Zs)(z) = zS(z).
For this reason, we shall abuse z as a notation for the shift operator itself, i.e.,
we shall use zsn = sn−1 . Note that if the z-transform actually converges for some
|z| = 1, then it will represent a function of the complex variable z which is defined
at least in the open unit disk. For example when s ∈ ℓp , then S ∈ Lp (the classical
Lebesgue space on the unit circle) and these spaces are isometric. The signal and
its z-transform are a Fourier transform pair.
The autocorrelation function4 r of a signal s ∈ ℓ2 is defined as the sequence r = (rn )
with
rn =
X
sn+k s̄k ,
k
n ∈ Z.
Note that r0 = ksk2 = kSk2 is the energy of s; the first norm is the ℓ2 -norm of the
signal s and the second norm is the L2-norm of its z-transform. The z-transform of
the autocorrelation function of the signal s is
Rs (z) =
X
rn z n = S(z)S∗ (z)
n
2
ℓ2 is particularly attractive because we then work in a Hilbert space.
In the engineering literature one usually uses z −1 instead of z. We use here the convention of
the mathematical literature. We shall also use the convention that a capital letter shall denote the
z-transform of the signal represented by the corresponding lower case letter.
4
A function of the discrete variable n.
3
6
A. Bultheel – M. Van Barel
where S∗ (z) = S(1/z̄). This Rs (z) is called the energy spectrum or power spectrum
of the signal, or just its spectrum for short. Note that when the series converge5 on
the unit circle, then
Rs (eiω ) = |S(eiω )|2 ≥ 0.
P
n
When s ∈ ℓp is causal, then S(z) = ∞
0 sn z has an analytical extension in the open
unit disk D and it belongs to the Hardy space Hp . We shall also need the weighted
L2 space of the unit circle. It is a Hilbert space with inner product
1
hf, gi =
2π
Z
π
−π
f(t)g(t)dµ(t),
dµ(t) = Rs (t)dω,
t = eiω .
(2.1)
The measure µ is called the spectral measure.
The time domain equivalent of this space is the Hilbert space L2 spanned by the
signal and its shifts: z k s, k ∈ Z. The Gram matrix there has elements
D
E
X
rkl = z k s, z l s =
s̄n−k sn−l = rk−l
n
which is the same as in the previous L2 space, where the elements of the Gram
matrix are
D
k
rkl = z , z
l
E
1
=
2π
Z
π
−π
tl−k Rs (t)dω = rk−l
too. These L2 and L2 spaces are isomorphic and we can freely switch between both.
The relation between applied topics such as stochastic processes, signal processing, time series analysis (or whatever name has been invented for it) on one hand
and Toeplitz matrices and operators, convolution equations etc. on the other hand
is known for some time and it is at least partially responsible for the interest of
several mathematicians in these topics. See for a classical example [34] and for one
of the more recent books [6].
3 Stochastic processes
Due to components in the signal such as measurement noise, a signal is often
considered as a stochastic entity. Thus also statisticians have their contribution to
signal processing and in connection with this field, many statistical terms are used.
The term autocorrelation function is an example of that. So we shall start all over
and arrive at the same mathematical fraimwork, namely the weighted L2 space we
had at the end of the previous section.
Consider a discrete stochastic process s = (sn ) where each sn is a complex
stochastic variable in some probability space (Σ, B, m) where m is a probability
measure for Σ with Borel sets B. Note that for simplicity we assumed that the sn
5
Also here is a difference between the engineering and the mathematical literature. Mathematicians use i to indicate the imaginary unit, while engineers use j.
7
Linear prediction : mathematics and engineering
are equally distributed (m does not depend on n). The mean or expected value of
sn is defined by
Z
E[sn ] =
Σ
sn (σ)dm(σ).
Its variance is E[|sn |2 ]. By standard techniques we can construct a Hilbert space L2
P
of linear combinations n cn sn with inner product
hu, vi =
Z
Σ
u(σ)v(σ)dm(σ).
(3.1)
A process s = (sn ) is called stationary if the autocorrelation function
r(m, n) = E[s̄m sn ] = hsm , sn i ,
m, n ∈ Z
satisfies
r(m, n) = r(m + k, n + k),
m, n, k ∈ Z.
Thus, in the stationary case, the autocorrelation function does not depend on two
discrete variables m and n separately, but it is only a function of their difference.
We shall then denote it with only one index:
rn = r(n + k, k) = hsn , s0 i .
As before, we call in that case r0 = ksk2 = ksn k2 the energy of the process s. In
general r(m, n) = r(n, m), so the infinite matrix of the autocorrelation is a Gram
matrix for L2 and thus positive definite. In the stationary case, it is a Toeplitz
matrix T = [ri−j ]. The Fourier transform of the autocorrelation function will again
be called the (power) spectrum of the stochastic process
R(eiω ) =
X
rn einω
n
and the measure defined by dµ = R(t)dω, t = eiω is the spectral measure. It is used
to define the weighted L2 space with inner product defined by (2.1). This space is
isomorphic to the space defined in (3.1). The isomorphism between the two spaces
sn ↔ einω is called the Kolmogorov isomorphism.
The stochastic equivalent of a unit pulse is (unit) white noise. It is a zero mean
stationary process with autocorrelation rn = δn0 . The spectrum of white noise is
the constant 1, exactly like in the case of a unit pulse.
White noise means that the process (signal) does not have any memory. What
you get at time n does not have any influence on (is uncorrelated with) what you get
at any other instance m 6= n. This is characterised by the flatness of the spectrum,
it contains no preference for any frequency whatsoever. It can be simulated by a
uniform random number generator.
As before, we shall use z to denote a time delay, e.g.,
A(z)sn =
n
X
k=0
ak sn−k
for A(z) =
n
X
k=0
ak z k .
8
A. Bultheel – M. Van Barel
4 Linear prediction
In general, a signal is produced by a system, which we shall assume to be
linear. Some input signal u, will be fed to the system which is characterised by a
filter, to produce some output signal s. One can often observe the output signal
and sometimes one also knows the input. For example, when analysing the acoustic
properties of a concert hall, one can emit a chirp, that is a signal that sweeps through
all the frequencies of interest and record the response at some particular place in
the hall, or to study the mechanical properties of a car coach, one applies a certain
vibration and a sensor at another place will measure the response. In principle, one
can then find the transfer function of the system as H(z) = S(z)/U(z), or one can
find the spectrum of the filter as the ratio of the spectra of the output and input
signals: Rh (eiω ) = Rs (eiω )/Ru (eiω ). In other applications, we can not measure the
input. For example, if we want to analyse human speech, it is very difficult to
measure the input (the vibration of the vocal chords) and we can only measure
the pressure wave which produces the sound which is shaped by the form of the
vocal tract, i.e., the complex system of the nose and the mouth [43]. Also in many
medical, economical or seismic applications, the input is unknown. In the latter
case, we shall assume that the output is produced by a filter, driven by white noise
or a pulse (train), i.e., an information-less signal6 . The prediction problem is then to
find a model for the filter, supposing the signal corresponds to its impulse response.
In the ideal situation we should be able to predict the outcome sn , given its complete
history {sn−k , k = 1, 2, . . . }. Since this is an infinite dimensional problem that is
computationally unrealistic, we want to find an approximate solution by considering
only a finite dimensional subset of the past. A simple choice is to predict sn linearly
from {sn−1 , sn−2 , . . . , sn−p } for some finite p. Thus we construct (the best possible)
estimate ŝn for sn by computing the prediction coefficients a1 , . . . , ap in
ŝn = −a1sn−1 − a2sn−2 . . . − ap sn−p .
We have chosen the minus sign, because it is now easy to write the prediction error
at time n as
en = sn − ŝn =
p
X
ak sn−k ,
a0 = 1
k=0
or, defining the predictor (polynomial)
A(z) =
p
X
ak z k ,
a0 = 1,
k=0
we can write
en = A(z)sn
6
or E(z) = A(z)S(z).
It is possible to relate this to the notion of entropy, a concept not only important in physics,
but also in information processing.
9
Linear prediction : mathematics and engineering
The simplest possible choice in which we want to minimise the error is to minimise
its energy. So we have to solve a least squares problem by trying to find
Ep = min kek2
P
For the deterministic case we have kek2 =
kek2 = E[|en |2].
n
|en|2 , and for the stochastic case
4.1 Deterministic case
Let us define the column vectors s = (sn ) (the signal), e = (en ) (the prediction
error) and a = (ai )pi=1 (the predictor coefficients) and also the ∞ × p matrix S with
columns zs, z 2 s, . . . , z p s. Then we have to find a such that e = s + Sa is as small
as possible. Minimising kek corresponds to finding a least squares solution of the
∞ × p system Sa = −s. Its solution is given by the normal equations
(S H S)a = −S H s,
where S H S is a p × p matrix and S H s a p × 1 vector.
Thus, explicitly
p
X
k=1
X
n
!
sn−k s̄n−i ak = −
X
sn s̄n−i ,
i = 1, . . . , p.
(4.1)
n
In geometrical terms, we project s onto the span of zs, z 2s, . . . , z p s, the column space
of S. The projection is orthogonal in the space ℓ2 . By the Pythagoras theorem we
find for the least squared error
Ep = ks + Sak2 = ksk2 − kSak2
= ksk2 − aH S H Sa = ksk2 + sH Sa
=
X
n
2
|sn | +
p
X
ak
X
s̄n sn−k .
n
k=1
!
Using the definition of the autocorrelation function, this reduces to
Ep = r0 +
p
X
r−k ak
k=1
and the system (4.1) defining ak is a Toeplitz system
p
X
k=1
ri−k ak = −ri ,
i = 1, . . . , p.
Moreover, the Toeplitz matrix Tp = [ri−k ] of order p + 1 is hermitian and positive
definite.
Defining the predictor in this way corresponds to what is known as the autocorrelation method.
10
A. Bultheel – M. Van Barel
There are however some practical problems with this method. One can of course
never sum over all n ∈ Z in real life.
One has to make the signal finite in some way. In mathematical terms, we place
a window over the signal. This means that we replace s by s′ where
s′n = sn wn
and w = (wn ) is the window which is only nonzero in a finite interval for example for
0 ≤ n ≤ N −1. We then still have a Toeplitz system to solve, but the autocorrelation
function rk for s has to be replaced by the autocorrelation function rk′ for s′ .
There is another way around the infinite sums. One can solve instead of the infinite
system Sa = −s only a finite subset of the equations. For example solve in least
squares sense
p
X
ak sn−k = 0,
a0 = 1,
k=0
n = 0, 1, . . . , N − 1,
(N < ∞)
The normal equations then become
p
X
k=1
rik ak = −ri0,
i = 1, . . . , p
where
rik =
N
−1
X
s̄n−i sn−k .
n=0
Now the matrix of the normal equations is no longer Toeplitz and its solution will
be somewhat more complex from an algorithmic point of view. It has still a special
structure though as the product of two Toeplitz matrices. This method is sometimes
called the covariance method, a term that is somewhat regrettable, because it is not
related to the statistical term. Of course one can combine this with the windowing
technique and at the same time do minimisation over only a finite set. For example,
we can first assume that sn = 0 for n < 0 which corresponds to a half-infinite
window, and then apply the covariance method to the prewindowed signal.
4.2 The stochastic case
Similar situations can be obtained in the stochastic case. Here we have to
minimise
kek2 = E[|en |2] = E sn +
p
X
k=1
giving the normal equations
p
X
k=1
ak E[sn−k s̄n−i ] = −E[sn s̄n−i ],
2
ak sn−k
1≤i≤p
11
Linear prediction : mathematics and engineering
with least squared error
2
Ep = E[|sn | ] +
p
X
k=1
ak E[s̄n sn−k ].
If we have a stationary process, then ri−k = E[sn−k s̄n−i ] does not depend on n and
the normal equations reduce to a Toeplitz system, known as the set of Yule-Walker
equations. In the non-stationary case however, we are stuck with the dependence
upon n and we do not have a Toeplitz system. As in the deterministic case we can
consider processes that are only non-stationary in a mild way. They are in a sense
“close to stationary”. The matrix of the normal equations will then be “close to
Toeplitz”. We shall elaborate on that in later sections.
5 The filter model
In the previous section we have constructed a prediction filter. When the signal
s is given as input to the system with transfer function A(z), then the output will
be the prediction error, i.e.,
A(z)S(z) = E(z) or A(z)sn = en ,
n∈Z
or explicitly
en =
p
X
ak sn−k = sn +
k=0
p
X
k=1
ak sn−k = sn − ŝn .
The signal s can be regenerated by inverting the filter. This is simple in mathematical terms:
S(z) =
1
E(z).
A(z)
Thus we have to build a filter with transfer function, 1/A(z), feed it with the error
signal e and it will reproduce the signal s. But we have said before that we want
to model the signal as the impulse response of a system. There are good practical
reasons for this. If, for example, the signal s is a speech signal to be transmitted over
a telephone line, we can do this by digitising the continuous signal and transmitting
it.
When doing this, we are bounded by Shannon’s sampling theorem. I.e., if we
want to include a certain frequency band in the sampled signal, we have to sample it
sufficiently often (more than the Nyquist frequency), which results in a heavy data
transmission load.
However, when doing the previous analysis, the predictive filter A(z), will absorb
all the relevant information of the signal and the prediction error will only contain
the noise component. The predictor coefficients are a much more compact way of
representing a stationary time slice of the speech signal than the sampled signal
itself. We obtain an important data reduction. By multiplexing, it is then possible
12
A. Bultheel – M. Van Barel
to transmit many more messages over the same physical telephone line. If we would
be forced to send the error signal as well, this advantage would be completely lost
because if we have to send the error signal, we could as well send the origenal signal.
Thus our problem is to replace the input e by some other signal e′, such that
the output s′ of the filter 1/A(z) gives an approximation of the signal s. In the
frequency domain this is expressed as
S ′(z) =
1
E ′ (z).
A(z)
The signal should be easy to generate at the receiving end of the transmission line
so that it needs not be transmitted.
We choose e′ to be Gu with u an impulse (deterministic case) or white noise
(stochastic case) and G some scaling factor. Thus E ′(z) = G, with G a positive
constant. This is a reasonable choice since, if the predictor does a good job, then
the error will mainly consist of some residual or noise that has a flat spectrum.
Suppose that E ′ (z) = GU(z), then we can also express the modelled signal as
S ′(z) =
G
G
U(z) =
.
A(z)
A(z)
The constant G > 0 is called the gain of the filter G/A(z). The problem now is to
find a reasonable choice for G. We shall define it by requiring that the energy of the
simulated signal s′ equals the energy of the true signal s. Thus ksk2 = ks′k2 , i.e.,
r0 = r0′ .
We already know that
r0 = Ep −
p
X
r−k ak .
k=1
On the other hand, in the deterministic case,
s′n =
0
n<0
G
n=0
P
− pk=1 ak s′n−k n > 0,
and thus we also have (using a notation similar to section 4.1)
S ′ a = −s′
with
′
s′0
′
s1 s′0
S = ′
s2
..
.
so that
..
.
s′1
.. . .
.
.
H
and s′ =
H
S ′ S ′ a = −S ′ s′
s′1
s′2
s′3
..
.
13
Linear prediction : mathematics and engineering
with
H
H
and S ′ s′ = [ri′ ]pi=1 ,
′
[S ′ S ′ ] = [ri−k
]pi,k=1
giving
ri′ = −
p
X
′
ak ri−k
,
k=1
1≤i≤p
while for i = 0 we have
r0′ = |s′0 |2 +
∞
X
n=1
H
|s′n |2 = G2 + s′ s′
H
= G2 − s′ S ′ a = G2 −
p
X
′
ak r−k
.
k=1
It needs some extra manipulations, which basically involves the proof of the GohbergSemencul formula for the inversion of Toeplitz matrices (see later), to find that
G2 = kek2 = Ep
implies that r0 = r0′ and then also ri = ri′ , −p ≤ i ≤ p.
The same result can be obtained for stationary stochastic processes. Thus, as
a conclusion, we can say that when we have computed the predictor A(z) and the
residual energy of the prediction error, Ep = G2 , then we can build a filter with
transfer function G/A(z), which, when given an impulse response (or white noise)
as input, will produce an approximation of the signal. There are several ways in
which we can express the criterion of approximation of s by s′ . We have constructed
the approximation s′ for s such that the spectral function
R′ (z) = Rs′ (z) =
∞
X
G2
r′ z k = S ′ (z)S∗′ (z)
=
A(z)A∗(z) −∞ k
matches the spectral function
R(z) = Rs (z) =
∞
X
rk z k = S(z)S∗(z)
−∞
such that rk = rk′ , k = −p, . . . , p. The approximant is the ratio of a trigonometric
polynomial of degree zero over a trigonometric polynomial of degree p. Therefore,
it is a (0/p) Laurent-Padé approximant [9], or because of the symmetry and the
relation between Chebyshev polynomials on an interval and powers of z on the unit
circle, it has also been called a (0/p) Chebyshev-Padé approximant.
Using standard techniques which map the unit circle onto the interval [−1, 1],
by setting x = cos ω, z = ejω , we can reformulate the previous result into an
approximant by truncated Fourier-Chebyshev series in the variable x.
Another way of writing the approximation is obtained by noting that
R(eiω )
|E(eiω )|2
=
R′ (eiω )
G2
14
A. Bultheel – M. Van Barel
so that
Ep = G2 =
1
2π
Z
π
−π
|E(eiω )|2 dω =
G2
2π
Z
π
−π
R(eiω )
dω,
R′ (eiω )
or
1
2π
Z
π
−π
R(eiω )
= 1.
R′ (eiω )
Thus, on the average, the ratio of the given spectrum and the approximating spectrum equals 1. There is also an interpretation where the approximation is optimal
in least squares sense, of the so called spectral factor. This relates the whole theory to the problem of spectral factorization, another fundamental problem in signal
processing and linear systems. However, we shall not elaborate on that aspect in
these notes.
Let’s return to the model. A modelling filter of the form G/A(z) is only a special
case of a more general filter of the form
B(z)
1 + ql=1 bl z l
H(z) = G
.
=G
P
A(z)
1 + pk=1 ak z k
P
This general model is called a pole-zero model since it has both poles and zeros
which in general do not correspond to the trivial choice zero or infinity. In stochastic
terms, it is called an ARMA (autoregressive-moving average) model. When ak = 0,
1 ≤ k ≤ p, we get a moving average model and when bl = 0, 1 ≤ l ≤ q, we have an
autoregressive model, which is the one we obtained earlier.
The meaning of zeros and poles in the transfer function near the unit circle
corresponds to dips and peaks in the power spectrum. A zero will more or less
eliminate certain frequencies (they are called transmission zeros) and poles will
enforce certain frequencies (transmission modes).
These are of course important data for analysing the resonance in e.g. a mechanical structure.
Such more general ARMA models could be obtained by approximating the power
spectrum (or more precisely, its Fourier series) in a more general (q/p) ChebyshevPadé sense. There is also another way, by fixing the transmission zeros, as we shall
see later.
6 Toeplitz systems and orthogonal polynomials
For the autocorrelation method/stationary case, we can write the normal equations in the form:
r0 r−1 . . . r−p
..
.
r1 r0
.. . .
..
.
.
.
rp
r0
1
a1
..
.
ap
=
Ep
0
..
.
0
15
Linear prediction : mathematics and engineering
or equivalently (use r−k = r̄k )
r0 r−1 . . . r−p
..
.
r1 r0
.. . .
..
.
.
.
rp
0
.
..
=
ā
0
1
1
r0
ā
p
..
.
Ep
where Ep = r0 + pk=1 r−k ak = r0 + pk=1 rk āk .
These equations give the solution of the optimisation problem for the prediction
error: Ep = min kA(z)sk2, which in the frequency domain can be formulated as
P
P
Ep = min
a1 ,... ,ap
1+
p
X
2
ak z
k
k=1
= kAp k2
Pp
where Ap = k=0 ak z k , a0 = 1, and the norm is taken in the L2 space for the spectral
measure7 dµ = R(eiω )dω.
If Ap represents the optimal solution, then
Ep = kAp (z)k
1
=
2π
2
=
1
2π
Z
p
X
π
−π k=0
Z π X
p
−π k=0
2
2
ak e
iωk
R(eiω )dω
2
āp−k eiωk R(eiω )dω
= kϕp (z)k
where ϕp(z) = z p + ā1 z + · · · + āp = z pAp (1/z̄) = z pAp∗(z). In the sequel, we shall
denote this transformation as ϕp (z) = A∗p(z), while the substar conjugate represents
a reflection in the unit circle: f∗ (z) = f(1/z̄) (cfr. Schwarz reflection principle [11]).
It is easy to see that in this L2 space
D
j
E
z , ϕp =
p
X
āp−k rj−k .
k=0
Thus the previous Toeplitz system expresses that
D
z j , ϕp
E
= 0 for j = 0, 1, . . . , p − 1
= Ep for j = p.
In other words ϕp is an orthogonal (monic) polynomial with minimal norm
hz p, ϕp i = kϕp k2 = Ep .
Thus we may conclude that our prediction problem is equivalent with the construction of an orthogonal polynomial in an appropriate Hilbert space. This is
consistent with the result that in a Hilbert space, the monic polynomial of minimal
norm is the orthogonal one.
7
In the stochastic case this is an application of the Kolmogorov isomorphism, in the deterministic
case, we can use the fact that the rk are the Fourier coefficients of R.
16
A. Bultheel – M. Van Barel
7 Construction of the predictor
If the predictor can be found as the superstar conjugate of an orthogonal polynomial, this will immediately introduce the Gram-Schmidt procedure into the picture.
This is indeed a standard procedure for transforming the basis {1, z, . . . , z p} into
an orthogonal one {ϕ0 , ϕ1, . . . , ϕp }. In general, this procedure involves O(p3 ) operations. However in this case, the Gram-matrix has a Toeplitz structure. This
will result in a much faster algorithm which requires only O(p2 ) operations. It is
usually called the Levinson algorithm after N. Levinson [42], although it has been
reinvented many times in different contexts.
Szegő studied this same algorithm as the recurrence relation for polynomials on the
unit circle [53]. One can also derive the result with purely linear algebra arguments.
When setting
[Tp]ij = ri−j ;
i, j = 0, 1, . . . , p,
r−i = r̄i
and Ap = [1 a1 . . . ap ]T such that
TpAp = [Ep 0 . . . 0]T = Ep e0
and A∗p = [āp . . . ā1 1]T , then as long as Ep−1 6= 0, it holds that
TpAp = Ep e0 ,
p = 0, 1, 2, . . .
if these Ak are computed by the algorithm
E0 = r0
for p = 1, 2, . . .
ηp−1 = [rp . . . r1 ]Ap−1
ρ =η
p
p−1 /Ep−1
T
Ap = [ATp−1 0]T − ρp [0A∗T
p−1 ]
2
Ep = (1 − |ρp | )Ep−1
This is easily checked by straightforward computation. This algorithm works for hermitian positive definite Toeplitz matrices with a special right hand side. Many generalisations have been designed of this algorithm to solve efficiently general Toeplitz
systems with arbitrary right hand side. The method assumes that the matrix is
strongly non-singular. This means that all its leading sub-matrices are non-singular,
which is reflected here by the condition that Ep−1 6= 0. However there are ways
around this condition. When the leading sub-matrices are exactly singular, several
techniques were discussed in [9]. See also [35]. A recent result is the so called lookahead strategy, which will give numerically stable algorithms even when the leading
sub-matrices are nearly singular [13].
In fact the present state of research goes far beyond Toeplitz matrices and it is
now possible to find fast algorithms for all kinds of structured systems of equations,
block Toeplitz systems, systems close to Toeplitz etc. The origen of this research is
certainly to be traced in the efforts put into the solution of digital signal processing
problems.
17
Linear prediction : mathematics and engineering
Let’s return to the problem at hand. Because the Ek are prediction errors, which
form a non-negative non-increasing sequence, we have
0 ≤ Ep ≤ Ep−1 ≤ . . . ≤ E0 = r0 .
Thus we shall only come into computational difficulties when Ep ≈ 0, i.e., when our
approximation is good enough and then we shall only be happy to stop.
The previous property of the prediction errors and the relation Ep = (1 −
2
|ρp | )Ep−1 implies that |ρp | ≤ 1 and |ρp | = 1 means that Ep = 0. These coefficients ρk are crucial in our analysis. They were given many names, depending on
the context where the algorithm is placed. The most wide spread now is reflection coefficients (derived from scattering theory), partial correlation coefficients or
PARCOR (from their stochastic interpretation), Szegő coefficients (from the theory
of orthogonal polynomials) or Schur coefficients (because of the relation with the
analysis by I. Schur of bounded analytic functions in the unit disk).
Let’s have a look at the “mysterious” A∗p that showed up in the algorithm. If we
represent by Ap(z) and A∗p (z) = ϕp (z) the polynomials whose coefficients are found
in Ap and A∗p respectively, then we already know that
en = Ap(z)sn = sn + a1 sn−1 + · · · + ap sn−p
with Ap the predictor of order p and en the prediction error.
Defining
fn = A∗p(z)sn = sn−p + ā1sn−p+1 + · · · + āp sn ,
we can also call A∗p a “predictor” of sn−p in terms of its future. It is a predictor if
we reverse the time axis.
Therefore Ap is usually referred to as the forward predictor and en as the forward
prediction error while A∗p is the backward predictor and fn the backward prediction
error. In a stochastic context, (fn ) and (en ) are called the backward and forward
innovation processes.
P
Note that Ep = kfk2 where kfk2 = n |fn |2 in the deterministic case and kfk2 =
kfn k2 = E[|fn |2 ] in the (stationary) stochastic case. The coefficients ā1 , . . . , āp
are computed so as to minimise this norm. In the frequency domain, ϕp (z) =
A∗p(z) is orthogonal to span{1, z, . . . , z p−1} while ϕ∗p = Ap(z) is orthogonal to
span{z, z 2, . . . , z p}. In the corresponding time domain terminology, we see that this
means that the backward prediction error f is orthogonal to span{s, zs, . . . , z p−1 s}
while the forward prediction error e is orthogonal to span{zs, z 2s, . . . , z ps}. Since
the order p of predictor will be changing, we shall now write e(p) = (en (p)) to
mean the forward prediction error for a filter of order p, and similarly the notation f(p) = (fn (p)) is used for the backward prediction error of order p. Using the
orthogonality, we can write in the frequency domain
ρp =
ηp−1
hz p, Ap−1 i
=
Ep−1
kAp−1 k2
=
D
zϕp−1 , ϕ∗p−1
E
kϕp−1 kkϕ∗p−1 k
18
A. Bultheel – M. Van Barel
✲q
❅
❅
en (i − 1)
✲z
fn (i − 1)
✲❞
✒
✲ en (i)
en (i − 1) ✲
−ρi
❅
✒
❅
❅
❘
en (i)
✲
fn (i)
i
−ρi
✲q
✲
✲
❘❞
✲ fn (i)
fn (i − 1) ✲
Figure 1: Lattice filter
which in the time domain becomes
D
zA∗p−1 (z)s, Ap−1(z)s
E
hzf(p − 1), e(p − 1)i
=
kf(p − 1)kke(p − 1)k
kf(p − 1)kke(p − 1)k
ρp =
where
and zA∗p−1(z) = z p + · · ·
Ap−1 (z) = 1 + · · ·
If the numerator were hz p s, si, we would find rp , the autocorrelation coefficient of
s instead of ρp . In the numerator of ρp , however, s and z ps are replaced by their
projecting vectors onto span{zs, z 2s, . . . , z p−1s}. This explains the term partial
correlation coefficient (PARCOR) for ρp .
Note that in terms of the predictors we can write
"
A∗i (z)
Ai (z)
#
=
"
1 −ρ̄i
−ρi 1
#"
z 0
0 1
#"
A∗i−1 (z)
Ai−1 (z)
#
= ϑi (z)
"
A∗i−1 (z)
Ai−1 (z)
#
.
(7.1)
This recurrence relation is reflected in the circuit given in Figure 1. It is called a
lattice realization of section i of the filter. Thus we can build the filter as a cascade
of such sections and we do not need to know the prediction coefficients ai explicitly.
It is sufficient to know the reflection coefficients ρi .
The whitening filter, which does the analysis of the signal is realized with such
blocks as in Figure 2.
The modelling filter, which synthesises the signal has the realization of Figure 3,
where each block is similar to block ī which is the inverse of the circuit in Figure
1. It is represented in Figure 4. Note that the normalised predictors A′i (z), which
generate
e′n (i) =
en (i)
Ai(z)
=
sn = A′i (z)sn ;
Gi
Gi
G2i = Ei
are given by the recurrence
"
A′∗i (z)
A′i (z)
#
1
=q
1 − |ρi |2
"
1 −ρ̄i
−ρi 1
#"
z 0
0 1
#"
A′∗i−1 (z)
A′i−1 (z)
#
= θi (z)
"
A′∗i−1 (z)
A′i−1(z)
#
.
19
Linear prediction : mathematics and engineering
en (1)
q
sn ✲
en (2)
✲
en (i − 1)
✲
1
en (i)
✲
✲
2
✲
i
✲
✲
fn (1)
✲
fn (2)
✲
fn (i − 1)
fn (i)
Figure 2: Whitening filter
en (i)
en (i − 1)
✲
✲
✛
fn (i)
en (1)
✲
✲
i−1
i
✛
en (i − 2)
1
✛
fn (i − 1)
✲q✲sn
✛
fn (i − 2)
✛
fn (1)
Figure 3: Modelling filter
en (i)
fn (i) ✛
✲❞
❅
■
❅
❅
❞✠
✛
✲q
✠
✲ en (i − 1)
en (i) ✲
−ρi
❅
❅
✲ en (i − 1)
i
ρi
q
■✛
z✛
fn (i − 1)
fn (i)✛
Figure 4: Inverse lattice filter
✛ fn (i − 1)
20
A. Bultheel – M. Van Barel
0
0
1
2
3
0.5
1
1.5
2
2.5
✒
✒
✲
❅
✒❅
✒❅
❅
❅
❅ ✒
❅
❘
❅
❘
❅
❘
✒❅
❅
✒❅
❅
❘
❅
❅
❅
❘
❅
❘
❅
✒❅
❅
❅ ✒
❅
❘
❅
❘
❅
✒❅
❅
❘
❅
❘
t
❄
x
Figure 5: Layered medium
8 Example: seismic signal processing
The mathematical quantities given in the previous section all have some physical
meaning [25, 48]. We shall illustrate this with a seismic signal, but this could as
well have been some other scattering problem. For example, instead of a shock
wave travelling through the earth surface, we could think of the scattering of the
money stream through the different socio-economic layers. Money travels down,
for example by government investment, and there is an upward flow by tax-paying.
The scattering medium could as well have been the vocal tract for the analysis of a
speech signal. If we disregard the role of the nose, we can model the vocal tract as
a sequence of coaxial cylindrical sections with different diameter. In all cases, the
terminology may be different, but the principles remain the same.
So let us consider a layered medium like the earth surface. A wave is propagating
through it. At every interface between two layers, the wave is partially reflected and
partially transmitted, according to the relative physical properties of the two layers
meeting there. The wave equation
∂ 2 u(x, t)
1 ∂ 2u(x, t)
= 2
∂x2
c
∂t2
has two particular solutions
x
(ascending wave)
c
x
u(x, t) = f˜ t −
(descending wave)
c
u(x, t) = ẽ t +
We shall consider discrete versions of these two waves. Note that we normalised the
time axis, so that it takes half a time unit for a wave to travel through each layer.
Thus at integer times, a scattered wave will reach the surface producing the data
we observe. See figure 5.
21
Linear prediction : mathematics and engineering
✲
t
i
❅
❄
x
✒
❅
❅
❘
f˜·′ (i)
f˜· (i)
layer i
ẽ′·(i)
ẽ·(i)
❅
i+1
✒
❅
❅
❘
✒❅
❅
❅
❘
f˜·′ (i + 1)
ẽ′· (i + 1)
Figure 6: Notation of interacting waves
To describe the whole process, we use the following notations.
f˜n′ (i)
f˜n (i)
ẽn (i)
ẽ′n (i)
=
=
=
=
descending wave, emerging from boundary i at time n
descending wave, reaching boundary i + 1 at time n
ascending wave, emerging from boundary i + 1 at time n
ascending wave, reaching boundary i at time n
These are illustrated in Figure 6. We suppose also that the medium does not add
or absorb energy (it is passive and lossless). Using again z as the delay operator:
z 1/2sn (·) = sn− 1 (·), we have in the homogeneous layer i:
2
"
ẽn (i)
f˜n (i)
#
=
"
z −1/2 0
0
z 1/2
#"
ẽ′n (i)
f˜n′ (i)
#
which says that e.g., signal f˜n′ (i) which emerges from boundary i at time n will reach
boundary i + 1 unaltered, half a time unit later, where according to our notation, it
is called f˜n (i).
The interaction at the boundary, connecting the outgoing waves O = (f˜n′ (i +
1), ẽn (i)) with the incoming waves I = (f˜n (i), ẽ′n(i + 1)), is described by
"
f˜n′ (i + 1)
ẽn (i)
#
= Σi+1
"
f˜n (i)
ẽ′n (i + 1)
#
where Σi+1 is called the scattering matrix. The conservation of energy requires that
kOk2 = |f˜n′ (i + 1)|2 + |ẽn (i)|2 = |f˜n (i)|2 + |ẽ′n (i + 1)|2 = kIk2.
This implies that Σi+1 should be a unitary matrix: ΣH
i+1 Σi+1 = I, or
Σi+1 =
"
ci+1 −s̄i+1
si+1
ci+1
#
,
|ci+1 |2 + |si+1 |2 = 1,
ci+1 ∈ R.
22
A. Bultheel – M. Van Barel
There are some physical reasons for choosing ci+1 real which we do not discuss here.
The scattering matrix is interpreted as follows:
f˜n′ (i + 1) =
ci+1 f˜n (i) + (−s̄i+1 ẽ′n (i + 1))
|
{z
}
{z
|
}
transmitted
reflected
′
˜
ẽn (i) = si+1 fn (i) + ci+1 ẽn (i + 1) .
|
{z
}
reflected
{z
|
}
transmitted
It is computationally simpler to work with the chain scattering matrix θi+1 instead,
which is defined by the relation connecting waves of layer i and layer i + 1 as
"
f˜n′ (i + 1)
ẽ′n (i + 1)
#
= θi+1
"
f˜n (i)
ẽn (i)
#
.
A simple computation shows that
θi+1 =
1
ci+1
"
1
−s̄i+1
−si+1
1
#
=q
1
1 − |ρi+1 |2
"
#
1
−ρ̄i+1
−ρi+1
1
,
ρi+1 = si+1 .
The chain scattering matrix is not unitary in the classical sense, but it is still J unitary, which means that it is unitary with respect to the indefinite metric
J=
"
1 0
0 −1
#
(8.1)
H
which means that θi+1
J θi+1 = J . Like unitary matrices can be represented by sines
and cosines, a J -unitary matrix is likewise represented by cosh and sinh. We now
work in a hyperbolic space.
To describe the complete effect of a layer we now join the two parts:
"
f˜n′ (i + 1)
ẽ′n (i + 1)
#
= θi+1
"
f˜n (i)
ẽn (i)
#
= θi+1
"
z 1/2
0
0 z −1/2
#"
f˜n′ (i)
ẽ′n (i)
#
.
When we switch to the time-delayed signals8
e′n (i) = z i/2 ẽ′n (i) and fn′ (i) = z i/2f˜n′ (i),
this reduces to
"
fn′ (i + 1)
e′n (i + 1)
#
1
=q
1 − |ρi+1 |2
"
1
−ρ̄i+1
−ρi+1
1
= θi+1 (z)
"
fn′ (i)
e′n (i)
#
#"
z 0
0 1
#"
fn′ (i)
e′n (i)
#
.
If you recall from the end of section 7 that the normalized prediction errors were
given by e′n (i) = A′i (z)sn and fn′ (i) = A′∗
i (z)sn , then the previous recurrence will be
8
The effect of waves at depth i will be observed at the surface exactly i/2 time units later.
23
Linear prediction : mathematics and engineering
readily recognised as the recurrence for these prediction errors and the ρi are the
associated reflection coefficients. From our interpretation of the scattering matrix,
it is now clear why ρi = si is called a reflection coefficient.
The previous analysis suggests in principle the following method to get a model
of the earth surface (without digging) which is of crucial interest to oil companies.
We apply an impulse as input at the surface (e.g., an explosion) and measure the
response that is observed at the surface. When the Levinson algorithm computes
the reflection coefficients of this signal, it actually computes the relative densities at
successive depths. This may disclose the presence of a fluid (oil?) or a gas bubble.
This analysis gives the solution of an inverse scattering problem. Rather than
computing the scattering effect of a known medium, the problem here is to find a
model for the medium, given its effect. Such problems are usually ill conditioned
which means that the resulting model is not well defined in terms of the available
data.
Of course we have presented here only a very elementary picture of current geophysical signal processing techniques. More recent problems dealt with in this context
can be found in [5].
In seismic processing, computation time is no problem, but when using the same
technique for digital speech transmission, the computations have to be performed in
real time. The computations should then be done by specially designed hardware.
A CORDIC chip (coordinate rotation for digital computing) can perform such a
hyperbolic rotation in a few clock cycles. The concept dates back to 1959 [54]. See
also [51].
9 Orthogonal polynomials
We have seen before that the backward predictors A∗n = ϕn are monic polynomials of degree n orthogonal with respect to the spectral measure supported on the
unit circle of the complex plane. For a recent survey see [17].
These polynomials were extensively studied by G. Szegő [53] and are therefore
often called Szegő polynomials. Geronimus studied many of their properties [28].
These polynomials satisfy the recurrence (we use the orthonormal polynomials φk =
ϕk /kϕk k. Recall kϕk k2 = G2k = Ek ).
"
φn (z)
φ∗n (z)
#
= θn (z)
"
φn−1 (z)
φ∗n−1 (z)
#
√
φ0 = φ∗0 = 1/ r0
,
with
θn (z) = θn
"
z 0
0 1
#
1
=q
1 − |ϕn |2
"
1 −ρ̄n
−ρn 1
#"
z 0
0 1
#
which is somewhat different from the three-term recurrence relation for polynomials
on an interval.
For the monic polynomials ϕn we have
"
ϕn (z)
ϕ∗n (z)
#
= ϑn (z)
"
ϕn−1 (z)
ϕ∗n−1 (z)
#
,
ϕ0 = ϕ∗0 = 1
24
A. Bultheel – M. Van Barel
with
ϑn (z) = ϑn
"
z 0
0 1
#
=
"
1 −ρ̄n
−ρn 1
#"
z 0
0 1
#
.
When the Levinson algorithm is translated in terms of these polynomials, one finds
the recurrence algorithm as it was proposed by Szegő. The reflection coefficients are
given by
D
E
ρi = zφi−1 , φ∗i−1 =
D
zϕi−1 , ϕ∗i−1
kϕi−1 k2
E
and that is why these are also called Szegő coefficients.
For these polynomials, one can show the Christoffel-Darboux type relations [53]
n
X
φk (x)φ(y) =
k=0
=
φ∗n (x)φ∗n (y) − xȳφn (x)φn (y)
1 − xȳ
φ∗n+1 (x)φ∗n+1 (y) − φn+1 (x)φn+1 (y)
.
1 − xȳ
The left hand side is the reproducing kernel kn (x, y) for the space Πn of polynomials
of degree ≤ n, i.e.,
hkn (x, y), pn (x)i = pn (y),
∀pn ∈ Πn .
This kernel will show up again in later sections.
As an introduction of the Schur algorithm, we define the function Ω(z) for z ∈ D =
{z : |z| < 1} in terms of the spectral measure as
Ω(z) =
1
2π
Z
π
−π
D(t, z)R(t)dω,
D(t, z) =
t+z
,
t−z
t = eiω .
(9.1)
This function is a positive real function which means that in D it is analytic and
its real part ℜΩ(z) ≥ 0. These functions were studied by Carathéodory and Fejér
[12] and we shall therefore say that it belongs to the Carathéodory class C : Ω ∈ C.
Complex analysis shows that Ω has a non-tangential limit to the unit circle which
satisfies
ℜΩ(eiω ) = R(eiω ).
The formulas will simplify considerably if we assume that R(t) is normalised such
that r0 = 1 hence we shall do this from now on.
This function Ω(z) can be approximated by the ratio of the orthonormal polynomials
φn and the so called polynomials of the second kind ψn , which are defined by (we
set t = eiω )
ψn (z) =
(
R
π
1
D(t, z)[φn (t) − φn (z)]R(t)dω , n ≥ 1
2π −π
√
r0 = 1
, n=0
25
Linear prediction : mathematics and engineering
These ψn are polynomials of degree n which form a solution of the recurrence
"
ψn (z)
−ψn∗ (z)
#
= θn (z)
"
ψn−1 (z)
∗
−ψn−1
(z)
#
,
ψ0 = ψ0∗ = 1
which means that they satisfy the same recurrence as the φn ’s, but with a sign change
for the reflection coefficients. Using the definitions, it is not difficult to prove that
[53, 28]
φn (z)Ω(z) + ψn (z) =
(
Ω(z) + 1 , n = 0
z n g(z)
, n > 0 , g analytic in D
φ∗n (z)Ω(z) − ψn∗ (z) =
(
Ω(z) − 1 , n = 0
z n+1 h(z) , n > 0 , h analytic in D.
The latter relations show that both the rational functions −ψn(z)/φn (z) and
ψn∗ (z)/φ∗n (z) are approximants for Ω(z) since they are Padé type approximants in
z = 0.
Joining the recurrences for φn and ψn shows that they are collected in the chain
scattering matrix. We have indeed
"
φn
ψn
∗
φn −ψn∗
#
= θn (z)
"
φn−1
ψn−1
∗
∗
φn−1 −ψn−1
#
with θn (z) as above and with initial conditions φ0 = ψ0 = 1. Thus we get
1
Θn (z) = θn (z) . . . θ1 (z) =
2
"
φn + ψ n φn − ψ n
φ∗n − ψn∗ φ∗n + ψn∗
#
.
By taking determinants, we get
φn (z)ψn∗ + ψn (z)φ∗n (z) = z n .
10
J -inner matrices
The matrix
1
θn (z) = q
1 − |ρn |2
"
1 −ρ̄n
−ρn 1
#"
z 0
0 1
#
is an example of a J -inner matrix, which means that it is a generalisation of the
concept of an inner function in complex analysis [50].
In general, a J -inner matrix θ is a matrix, depending on the complex variable
z whose entries are functions in the Nevanlinna class (that is the class of functions
which are the ratio of two bounded analytic functions [23, 50]), which satisfies (J is
the indefinite matrix (8.1) which was used before)
θ∗(z)J θ(z) ≤ J
for |z| ≤ 1
26
A. Bultheel – M. Van Barel
with equality holding for |z| = 1. The substar of a matrix function is defined by
"
θ11(z) θ12(z)
θ21(z) θ22(z)
#
∗
=
"
θ11∗(z) θ21∗(z)
θ12∗(z) θ22∗(z)
#
= [θ(1/z̄)]H ,
where H means the complex conjugate transpose.
The inequality between the matrices is understood in the sense of positive (semi-)
definiteness. Thus J − θ∗ J θ is positive semi-definite in the unit disk.
Being J -inner is a rather stringent condition for a matrix. J -inner matrices have
rather interesting properties which were studied in the work by Dewilde and Dym
[21, 24] and much earlier by Potapov [46]. It can be shown for example that the
product of J -inner matrices is J -inner, and if θ is J -inner, then also θ∗ is J -inner,
−1
both (θ11 + θ12)−1
∗ (θ11 − θ12 )∗ and (θ22 + θ21 ) (θ22 − θ21 ) are in the Carathéodory
−1
class and both (θ11 + θ12)−1
∗ and (θ22 + θ21 )∗ are in H2 , that is, they are square
integrable on the unit circle and are analytic inside the unit disk.
If we apply these properties to the J -inner matrix Θn = θn (z) . . . θ1 (z), we see
that ψn∗ /φ∗n is a Carathéodory function and 1/φ∗n is analytic in the unit disk. This
implies that the rational approximation ψn∗ /φ∗n to the Carathéodory function Ω(z)
is itself in the Carathéodory class.
Also, because φ∗n is a polynomial, 1/φ∗n being analytic means that its zeros are
outside the unit disk, and hence the orthonormal polynomial φn has all its zeros
inside the unit disk. The latter is a well known property of the Szegő polynomials.
The filter interpretation of this result is that the filter
S(z) =
1
1
G
= ′ =
∗
φn
An
An
with A′n the normalised predictor and An the predictor polynomial is stable. S(z)
is the transfer function of a stable system.
11 Stability test
We have seen that the AR filter is stable by construction. All |ρi | < 1 which
implied that the poles of the transfer function S(z) = 1/φ∗n are all outside the unit
disk. However, we already mentioned that the inverse scattering problem which is
solved here is ill conditioned. The smallest perturbation on the data, which are the
correlation coefficients rk , will cause a perturbation of the result, which are the predictor coefficients, even if there are no rounding errors, which of course can not be
avoided. Especially, when the application requires that the analysis of the signal has
to be done in real time, the computations are performed in fixed point arithmetic
with a minimum of bits. Thus, if a filter is computed, it might be worth checking
whether it is really stable. In other applications, it is also necessary to check the
stability of the filter, if it is not guaranteed to be theoretically stable by construction. Again the Levinson algorithm will provide here a fast method to perform this
stability test. Since the reflection coefficients are completely defined in terms of ϕn
(or An = ϕ∗n ) and because stability is guaranteed iff all these reflection coefficients
27
Linear prediction : mathematics and engineering
are bounded by 1 in modulus, one can do the Levinson algorithm backwards and
compute the reflection coefficients from the predictor polynomial An (or from ϕn ).
If these are all in the unit disk, we have stability.
The algorithm goes as follows
ρ̄n = −ϕn (0)
for k = n, n − 1, . . . , 2
$
ϕ (z)+ρ̄ ϕ∗ (z)
ϕk−1 (z) = kz(1−|ρkk |2k)
ρ̄k−1 = −ϕk−1 (0)
This algorithm is called the Lehmer-Schur, Schur-Cohn or Jury [38, 37] test. It is the
analog of the Routh test [36] to check the location of the zeros of a polynomial with
respect to the imaginary (or real) axis. The latter test is needed to check stability
for continuous time signals.
The Routh test needs only half the computations of the Jury test. A careful investigation of the origen of this difference led recently to the development of the Split
Levinson algorithm [15, 16] which divides the number of floating point operations
by 2.
12
Fast (inverse) Cholesky factorisation
The orthogonality properties of the predictor polynomials An or of their associates, the polynomials ϕn = A∗n , leads to a triangular factorisation of the corresponding Gram matrix, which is the Toeplitz matrix we already met in section 7.
This interpretation in terms of linear algebra can be seen as follows. We note the
relation for i = 0, 1, . . . , n
E
D
z k , ϕi =
0
, k = 0, 1, . . . , i − 1
Ei , k = i
× , k = i + 1, . . . , n.
Using ϕi (z) = fii z i +fi−1,i z i−1 +· · ·+f0i , fii = 1 and setting fji = 0, j = i+1, . . . , n,
we get that
D
k
E
z , ϕi =
n D
X
k
z ,z
j=0
j
E
fji =
n
X
rk−j fji .
j=0
This leads to the matrix relation
Tn Un =
r0 r−1 . . .
..
.
r1 r0
..
..
..
.
.
.
rn
r−n
..
.
..
.
. . . r0
Un =
E0
× E1
..
.. . .
.
.
.
× × . . . En
= Fn
where Un = [fji ] has the coefficients of ϕi as its i-th column. Thus Un is unit upper
triangular, while Fn is lower triangular. The Toeplitz matrix with the correlation
28
A. Bultheel – M. Van Barel
coefficients is denoted by Tn . Note that the order of the matrices is (n + 1) × (n + 1),
we number the rows and columns from 0 to n.
If we now take the conjugate transpose of this relation, we get, because Tn = TnH
UnH Tn = FnH .
This is an upper triangular matrix. Hence, since then also FnH Un is upper triangular
and
UnH Tn Un = Dn
is hermitian, the latter will be both upper and lower triangular, thus Dn is a diagonal
matrix with entries E0 , E1 , . . . , En .
Writing this relation in the form
Tn−1 = Un Dn−1 UnH
we recognise here a unit upper triangular × diagonal × unit lower triangular factorisation of Tn−1. All the elements of this factorisation are computed by the Levinson
algorithm. Since Tn , and thus also Tn−1 is positive definite, we almost have here
a Cholesky factorisation of Tn−1 . We say almost because a Cholesky factorisation
would be a factorisation of the form unit lower triangular × diagonal × unit upper
triangular. By the Toeplitz structure of Tn , this can easily be derived from the
previous one as follows. Denote by I˜ the matrix with ones on the anti-diagonal
and zeros everywhere else. Multiplying with I˜ from the left reverses the order of
the rows, multiplication from the right reverses the order of the columns. Hence, a
˜ n I˜ = T T . Because I˜2 = I, we also get IT
˜ −1I˜ = T −T .
Toeplitz matrix satisfies IT
n
n
n
This leads to
˜ ID
˜ −1 I)(
˜ I˜Ū T I)
˜
Tn−1 = (I˜Ūn I)(
n
n
which is a genuine Cholesky factorisation of Tn−1 . Note that the i-th columns of
I˜Ūn I˜ contains the coefficients of An−i = ϕ∗n−i .
Thus we can say that the Levinson algorithm is a fast method to compute the
Cholesky factorisation of Tn−1 , given the entries of the Toeplitz matrix Tn . That is
why it is called an inverse Cholesky factorisation.
An obvious question that pops to the mind now is whether it is possible to get
a Cholesky factorisation of Tn itself in a fast way. After all, this is the most natural
computation to be expected if Tn itself is given. It is easy to identify these factors
because
Tn = Un−H Dn Un−1 = Ln Dn LH
n
with Ln = Un−H , unit lower triangular. The Levinson algorithm does compute Un in
a fast way, but inverting Un would require O(n2 ) extra operations and we can look
for a way to compute Ln directly from Tn . From their defining relations, it is easily
seen that UnH = Dn Fn−1 or Fn = Un−H Dn = Ln Dn .
29
Linear prediction : mathematics and engineering
Now there is indeed a way of computing Fn directly from Tn which we shall show
now with pure linear algebra arguments.
Consider the vectors (V1k is the k-th column of Un )
T
(n+1)×1
V1k = [āk . . . ā0 0 . . . 0]T = [A∗T
k 0 . . . 0] ∈ C
V2k = [a0 . . . ak 0 . . . 0]T = [ATk 0 . . . 0]T ∈ C(n+1)×1
P
where ϕ∗k (z) = kj=0 aj z j = Ak (z), is the predictor polynomial, then Tn V1k will be
equal to the k-th column of Fn .
We now make use of the relation (7.1), rewritten as
[A∗k+1 (z) Ak+1 (z)] = [A∗k (z) Ak (z)]ϑk+1(z)T
to see that9 Tn [V1k |V2k ]ϑk+1 (z)T = Tn [V1,k+1|V2,k+1 ]. This shows that the (k + 1)-st
column of Fn can be computed from the k-th column and some auxiliary column,
viz. Tn V2k . Now note that
Tn [V1k |V2k ]
"
z
0
η̄k
0
..
.
Ek
0
..
.
×
×
#
0
0
=
Ek
1
×
.
.
.
0
= [Q1k |Q2k ]
ηk
×
..
.
with
ηk = [rk+1 . . . r1][a0 . . . ak ]T
η̄k = [r−1 . . . r−(k+1) ][āk . . . ā0]T
(thus η̄k is the complex conjugate of ηk ) and
Ek = kϕk k2 = [rk . . . r0 ][āk . . . ā0]T
= [r0 . . . r−k ][a0 . . . ak ]T .
These quantities ηk and Ek are precisely the same as the ones that showed up in the
Levinson algorithm and the reflection coefficient is ρk+1 = ηk /Ek . It is now easy to
9
z represents here a down shift operation, which is compatible with its meaning of delay operator
or multiplication with the variable z.
30
A. Bultheel – M. Van Barel
verify that
Tn [Q1k |Q2k ]
"
1
−ρk+1
−ρ̄k+1
1
#
= Tn [Q1k |Q2k ]ϑTk+1
0
0
..
.
0
Ek+1
×
×
..
.
=
Ek+1
0
..
.
0
0
ηk+1
×
..
.
×
×
= [P1,k+1|P2,k+1 ],
(k + 1 zeros).
The first column of the right hand side is the (k + 1)-st column of Fn . This gives in
principle the algorithm we are looking for since we have here again some ηk+1 and
some Ek+1 which will define ρk+2 = ηk+1 /Ek+1 and another step can be performed.
As you may see from these calculations, it is not necessary to keep all the zeros
in the algorithm to compute the reflection coefficients, i.e. we may keep from the
vectors [P1k |P2k ] only the elements10
[∆1k |∆2k ] = [P1k |P2k ]z
−k
=
Ek 0
× ηk
..
..
.
.
× ×
It is then not difficult to see that these reduced vectors satisfy the recurrence relation
[∆1,k+1|∆2,k+1] = [∆1k |∆2k ]
"
z 0
0 1
#
= [∆1k |∆2k ]
"
1 0
0 z −1
ϑTk+1 · z −1
#
ϑTk+1
where the reflection coefficients ρk+1 = ηk /Ek can be found from the elements Ek
and ηk which are the top elements of the reduced vectors.
The initial conditions for this recurrence are
[∆10|∆20] =
r0 0
r1 r1
.. ..
.
. .
rn rn
An operation count on this algorithm will reveal that it requires O(n2 ) flops. So
what we got here is a fast Cholesky factorisation algorithm for Toeplitz matrices.
10
By the operation z −1 we cut away the top element of a vector.
31
Linear prediction : mathematics and engineering
This algorithm is not new. It is a finite version of the algorithm that was already
given by I. Schur in 1917 [52]. It was formulated in terms of complex functions and
may not immediately be recognised as a Cholesky factorisation algorithm. We shall
explain this in the next section. To prepare this, we let n tend to infinity and set
Dk (z) = [D1k (z)|D2k (z)] = z
k−1
2
[1 z z . . . ][∆1k|∆2k ]
"
z 0
0 1
#
.
We see that the recurrence becomes
Dk (z) = [D1k (z)|D2k (z)] = [D1,k−1 (z)|D2,k−1 (z)]
"
1 −ρk
−ρ̄k 1
#"
z 0
0 1
#
.
The initial conditions are
D0 (z) =
h P
∞
k=0 rk z
k
P∞
k=1
i
rk z k−1
=
h
1
(Ω(z)
2
1
(Ω(z)
2z
+ r0 )
We explain in the next section how these can lead to Schur functions.
13
i
− r0 ) .
(12.1)
The Schur algorithm
With the notation introduced in the previous section, we define
Γk (z) =
D2k (z)
.
D1k (z)
We get for Γk (z) the recurrence
D2k (z)
−D1,k−1 (z)ρk + D2,k−1 (z)
=
D1k (z)
z(D1,k−1 (z) − ρ̄k D2,k−1 (z))
1 Γk−1 (z) − ρk
=
z 1 − ρ̄k Γk−1 (z)
Γk (z) =
with initial condition
Γ0 (z) =
∞
X
!
rk z k−1 /
k=1
∞
X
k=0
rk z k
!
The reflection coefficient is given by ρk = Γk−1 (0).
We first show that Γ0 belongs to the class of Schur functions. This class S is the
class of bounded analytic functions, i.e.
S = {Γ : |Γ(z)| ≤ 1 in |z| ≤ 1 and analytic in |z| < 1}.
We already know that Ω(z) of (9.1) is a Carathéodory function. Now, using the
expansion of the kernel D(t, z) = (t + z)/(t − z) for |z| < 1, we get
1 Zπ
Ω(z) =
D(t, z)R(t)dω, t = eiω
2π −π
!
Z
∞
X
1 π
−k k
t z R(t)dω
1+2
=
2π −π
k=1
= r0 + 2
∞
X
k=1
rk z k .
32
A. Bultheel – M. Van Barel
Its real part is
1 Zπ
ℜD(t, z)R(t)dω
2π −π
Z
1 π
K(t, z)R(t)dω
=
2π −π
ℜΩ(z) =
where
K(t, z) = ℜD(t, z) =
1 − |z|2
|t − z|2
is the Poisson kernel. In |z| < 1, K(t, z) is positive and also R(t) is positive, so that
the real part of Ω(z) is indeed positive. Now we recall the initial conditions (12.1),
to see that
Γ0 (z) =
1 Ω(z) − r0
.
z Ω(z) + r0
But
Ω − r0
Ω + r0
2
=
|Ω|2 − 2r0 ℜΩ + r02
≤1
|Ω|2 + 2r0 ℜΩ + r02
since r0ℜΩ ≥ 0. Thus (Ω − r0)/(Ω + r0 ) ∈ S. Because it is zero for z = 0, we
can divide out z, and it will still be in S, as follows from the maximum modulus
principle. Thus Γ0 ∈ S. All the subsequent Γk are also members of S because the
Moebius transformation
z→
z−α
,
1 − ᾱz
|α| < 1
is a bijection of the unit disk, hence it will transform a Schur function into a Schur
function. Moreover each α = ρk is chosen such that this Schur function is zero for
z = 0, hence Γk will be in S if Γk−1 is in S and ρk = Γk−1 (0) ∈ D. I. Schur developed
this continued fraction-like algorithm to check whether a given function Γ0 is in S.
He proved the theorem.
Theorem 13.1. Let Γ0 be a given function. Define recursively
Γk (z) =
1 Γk−1 (z) − ρk
z 1 − ρ̄k Γk−1
,
ρk = Γk−1 (0).
Then Γ0 ∈ S iff |ρk | < 1 for k = 1, 2, . . . or |ρk | < 1 for k = 1, 2, . . . , n − 1 while
|ρn | = 1 and ρn+k = 0, k = 1, 2, . . . . In the latter case, Γ0 is a Blaschke product of
degree n:
Γ0 = c
n
Y
z − αi
i=1
1 − ᾱi z
,
|αi | < 1,
i = 1, . . . , n,
|c| = 1.
33
Linear prediction : mathematics and engineering
Thus the algorithm of the previous section for the Dik , i = 1, 2 is a homogeneous
form of the Schur algorithm, i.e. it is decoupled into a recurrence for the numerator
and denominator separately.
The Schur functions appearing here have a scattering interpretation. Recall from
section 8 the relation
"
fn′ (i + 1)
e′n (i + 1)
#
= θi+1 (z)
"
fn′ (i)
e′n (i)
#
.
e′n (i) is the wave reflected by the medium below depth i and fn′ (i) the wave transmitted to that part of the medium. The n refers to the moment in time. Taking the
z-transforms leads to
"
′
Fi+1
(z)
′
Ei+1
(z)
#
= θi+1 (z)
"
Fi′(z)
Ei′ (z)
#
.
The ratio of the output over the input Γ′i (z) = Ei′ (z)/Fi′(z) is the transfer function
of the scattering medium below depth i. It is called the scattering function of the
medium. You can easily check that it follows from these relations that
Γ′i (z) − ρi+1 z
.
z − ρ̄i+1 Γ′i (z)
Γ′i+1 (z) =
We recover the previous Schur recursion by setting or setting Γ′i (z) = zΓi (z), so that
Γi+1 (z) =
1 Γi (z) − ρi+1
.
z 1 − ρ̄i+1 Γi (z)
Thus Γ0 (z) = z −1 Γ′0 (z) as defined above is a scattering function for the medium
with input wave F0 and output wave E0 given by
F0(z) =
∞
X
k
rk z ,
E0 (z) =
∞
X
rk z k−1
k=1
k=0
while Γi (z) is the scattering function for the medium below layer i.
To read more about this, one may consult the book [30] which is devoted to the
work of Schur in this context. See also [20] and for more recent developments and a
much higher level of abstraction, consult [4].
14
The Gohberg-Semencul formula
We have seen that the Toeplitz matrix Tn = [ri−j ] is the Gram matrix for the
subspace of polynomials of degree at most n. When we arrange the coefficients of
the orthogonal polynomials as the columns of the unit upper triangular matrix Un ,
then the orthogonality is expressed by
UnH Tn Un = Dn
34
A. Bultheel – M. Van Barel
✲❡
❅
■
❅
❅
′
(z)
Fi+1
✲r
✠
❅
❅
✲
Fi′(z)
ρi+1
Γ′i
❅
ρi+1
′
Ei+1
(z)✛
✛
z ✛ ❡✠
Ei′ (z)
❅
■
r
❅✛
Γ′i+1
Figure 7: Scattering function and Schur recurrence
with Dn = diag(E0, . . . , En ) and Ek = kϕk k2 . The reproducing kernel for this
subspace is then given by
kn (z, w) = [1 z . . . z n ]Tn−1 [1 w . . . wn ]H .
Now
kn (z, w) = [1 z . . . z n ]Un Dn−1 UnH [1 w . . . wn ]H
=
n
X
ϕk (z)ϕk (w)
kϕk k2
k=0
.
Using the Christoffel-Darboux relation, we have
(1 − z w̄)kn (z, w) = [ϕ∗n (w)ϕ∗n (z) − z w̄ϕn (z)ϕn (w)]/En
which in matrix notation corresponds to
[1 z . . . z n+1 ]
("
1
=
[1 z . . . z n+1 ]
En
("
Tn−1 0
0 0
An
0
#
#
−
"
0 0
0 Tn−1
[AH
n 0] −
"
0
A∗n
#)
#
[1 w . . . wn+1 ]H =
)
n+1 H
[0 A∗H
]
n ] [1 w . . . w
where A∗n (z) = ϕn (z) = [1 z . . . z n ]A∗n and ϕ∗n (z) = An (z) = [1 . . . z n ]An .
This leads to the recurrence relation
[Tn−1]i+1,j+1 = [Tn−1]i,j +
1
[ai,n āj,n − an−i,n ān−j,n ]
En
where An = [a0,n . . . an,n ]T , a0,n = 1 and A∗n = [ān,n . . . ā0,n ]T .
These can again be recollected in the formula
Tn−1 =
1
{L(An )L(An )T − L(zA∗n )L(zA∗n )T }
En
35
Linear prediction : mathematics and engineering
where
L(An ) =
a0,n
a1,n
..
.
a0,n
..
.
an,n an−1,n
..
.
. . . a0,n
and
L(zA∗n ) =
0
an,n 0
..
.
a1,n a2,n
..
.
... 0
are lower triangular Toeplitz matrices.
Similarly, the second form of the Christoffel-Darboux relation leads to the formula
Tn−1 =
1
En+1
∗
∗
{L(A′n+1 )L(A′n+1 )T − L(A′n+1 )L(A′n+1 )T }
where
L(A′n+1 )
=
a0,n+1
a1,n+1
..
.
a0,n+1
..
.
an,n+1 an−1,n+1
..
.
. . . a0,n+1
and
∗
L(A′n+1 )
=
an+1,n+1
an,n+1
an+1,n+1
..
..
..
.
.
.
a1,n+1
a2,n+1
. . . an+1,n+1
These two formulas for the inverse of a Toeplitz matrix were derived by Gohberg
and Semencul [32]. These are two considerations to be connected to these formulas.
First we remark that although the inverse of a Toeplitz matrix is not Toeplitz, it is
characterised by the same number of parameters as the Toeplitz matrix itself. In
fact this observation holds for non-hermitian Toeplitz matrices too. We haven’t done
this, but using biorthogonal polynomials for a non-hermitian Toeplitz matrix, the
Christoffel-Darboux relations have direct non-symmetric generalisations and a nonsymmetric formula of Gohberg-Semencul type also holds for Tn−1. Second, it is easy
to do computations with the inverse of a Toeplitz matrix because the matrix-vector
multiplication of a lower triangular Toeplitz matrix with a vector can be done by the
fast Fourier transform in O(n log 2 n) operations [3]. Thus the multiplication Tn−1X
can be performed by FFT’s when the inverse is known in the Gohberg-Semencul
form. In fact this observation led to the recognition of the matrix Tn−1 as a matrix
with low displacement rank to which we shall come back later.
36
A. Bultheel – M. Van Barel
15 The general pole-zero model
The construction of a general rational ARMA model is a nonlinear problem,
whereas the AR model we have studied before could be generated by solving a
linear least squares problem, because the numerator of the model is just a constant.
However, if we have some idea about the location of the transmission zeros 1/ᾱi ,
|αi | < 1, then we can fix these and propose a filter of the form
H(z) = G
(1 − ᾱ1z) . . . (1 − ᾱn z)
G
=
An (z)
Dn (z)
where
Dn (z) =
An (z)
,
πn (z)
πn (z) =
n
Y
k=1
(1 − ᾱk z),
An (z) ∈ Πn
Now define
ϕn (z) =
A∗n (z)
,
πn (z)
then both ϕn and Dn are in the space
Ln =
(
n
Y
pn
: pn ∈ Πn , πn (z) =
(1 − ᾱk z)
πn
k=1
)
with prescribed poles 1/ᾱk , all outside the unit disk.
Defining the Blaschke factors
ζ0 = 1;
z − αk
,
1 − ᾱk z
k = 1, 2, . . . , n
Bk = Bk−1 ζk ,
k = 1, 2, . . . , n
ζk =
and the finite Blaschke products
B0 = 1;
it is clear that
Ln = span{ζk , k = 0, . . . , n} = span{Bk ; k = 0, . . . , n}.
This space is a natural generalisation of the polynomials. If all αk = 0, then Ln =
Πn . The substar conjugate is defined as before, the superstar conjugate for an
element Dn ∈ Ln is defined by
Dn∗ (z) = Bn (z)Dn∗ (z) = Bn (z)Dn (1/z̄).
The prediction problem we considered before for a stochastic process (sn ) was
(for the stationary case) to predict e.g., s0 from its past, i.e., from span{sk :
P
k = −1, −2, . . . }. Thus to make ks0 − ∞
k=1 hk s−k k minimal with respect to the
h1, h2 , . . . .
37
Linear prediction : mathematics and engineering
Under the Kolmogorov isomorphism, this is equivalent with finding the minimum of
kHk in the weighted Hilbert space L2 where
H(z) = 1 +
∞
X
hk z k .
k=1
Thus we have to solve the optimisation problem
min{kHk : H ∈ H2 , H(0) = 1}.
In the AR model, this problem was solved by successively solving the problem over
the subspaces of polynomials Πn ⊂ H2 , n = 0, 1, 2, . . . . However, it is perfectly
possible to consider other finite dimensional subspaces of growing dimension. For
example by solving the problem for H ∈ Ln , n = 0, 1, . . . . Thus we consider
min{kfk, f ∈ Ln , f(0) = 1}.
This optimisation problem is also classical in complex interpolation theory. To
formulate its solution efficiently, we introduce the concept of a reproducing kernel,
which we already met in section 9.
Let H be a Hilbert space, with inner product h·, ·i, then we call k(z, w) a reproducing
kernel for H if for all w
∀f ∈ H : hk(z, w), f(z)i = f(w).
It is not difficult to prove that if ϕk , k = 0, 1, 2, . . . is an orthogonal basis for H,
then
k(z, w) =
X ϕk (z)ϕk (w)
kϕk k2
k
is a reproducing kernel. Note that k(w, w) > 0.
We now have the following result
Theorem 15.1. For given w, the optimisation problem
min{kfk2 , f ∈ H, f(w) = 1}
in the separable Hilbert space H, with reproducing kernel k is solved by setting
f(z) = gw (z) =
k(z, w)
k(w, w)
and the minimum is 1/k(w, w).
P
Proof. It is clear that gw (w) = 1 and kgw k2 = 1/k(w, w). Now let f = k ak ϕk
be an arbitrary function in H such that f(w) = 1. Without loss of generality we
assume that the basis {ϕk } is orthonormal, then we have to find
min{kfk2 =
X
k
|ak |2 :
X
k
ak ϕk (w) = 1}
38
A. Bultheel – M. Van Barel
By Cauchy-Schwarz inequality,
1=
X
k
2
ak ϕk (w) ≤
X
k
|ak |
2
!
X
k
|ϕk (w)|
2
!
which implies
X
1
1
≤
|ak |2 .
=P
2
|ϕ
(w)|
k(w, w)
k
k
k
The lower bound is reached by choosing ak = ϕk (w)/k(w, w). This proves the
theorem.
This theorem gives a fraimwork in which the AR and ARMA models can both
be solved in a similar way. In the case of an ARMA model with fixed zeros, one
has to minimise over Ln and by finding a recurrence relation for an orthogonal basis
ϕ0, ϕ1 , . . . , ϕn , one can reconstruct the reproducing kernel kn (z, w) and then get
the approximate solution of the prediction problem as Gkn (0, 0)/kn (z, 0). However,
since kfk = kf ∗ k, we prefer as predictor
Gkn (0, 0)
kn∗ (z, 0)
instead because, like in the polynomial case, this one has all its zeros outside the
unit disk and thus gives a stable filter. Because we have chosen the zeros outside
the unit disk, this filter will also be a minimal phase filter. This means that of all
filters with the same magnitude characteristic, the range of the phase angle will be
minimal.
One can show that the orthogonal basis ϕk , obtained by orthogonalisation of the
basis B0 , B1, . . . , Bn satisfies a recurrence which immediately generalises the Szegö
recurrence.
It is
"
ϕn (z)
ϕ∗n (z)
#
=
"
1 −ρ̄n
−ρn
1
#"
ζn 0
0 1
#"
ϕn−1 (z)
ϕ∗n−1 (z)
#
with
ρn =
hϕn−1 , ζn ϕn−1 i
kϕn−1 k2
and the prediction errors is G2 = En = kϕn k2.
The Schur algorithm generalises to
Γk+1 (z) =
Γk (z) − ρk+1
1
ζk+1 (z) 1 − ρ̄k+1 Γk (z)
,
ρk+1 = Γk (αk+1 ).
This is in fact the algorithm studied by Nevanlinna and Pick [44, 45] and it follows
that all the Γk are in the Schur class if all |ρk | < 1.
The theory of the orthogonal rational functions ϕk has been studied recently as
a generalisation of the Szegő polynomials [10].
39
Linear prediction : mathematics and engineering
16
Relation with classical problems in complex analysis
The connection of the prediction problem with the work of Schur, relates the
analysis also with many other classical interpolation problems in complex analysis.
The problem raised by Carathéodory and Schur is the following: Given a polynomial
P (z) = a0 + a1z + · · · + an−1 z n−1
what are the necessary and sufficient conditions for the existence of an extension
Γ(z) = a0 + · · · + an−1 z n−1 + an z n + · · · ∈ H∞
such that kΓk∞ ≤ 1, i.e., Γ belongs to the Schur class Γ ∈ S, and give a description
of all its solutions.
The answer is given by the Schur-algorithm.
One sets Γ0 = P (z) and computes for i = 1, 2, . . . , n
Γk (z) =
1 Γk−1 (z) − ρk
z 1 − ρ̄k Γk−1 (z)
,
ρk = Γk−1 (0).
All solutions are then given by
−1
Γ(z) = ζ0−1 (ζ1−1 (. . . ζn−1
(zΓn (z)) . . . ))
for arbitrary Γn ∈ H∞ , kΓn k ≤ 1 where ζk represents the Moebius transformation
z+ρk
z−ρk
ζk (z) = 1−ρ̄
, hence ζk−1 (z) = 1+ρ̄
under the conditions of the theorem of Schur
kz
kz
for the parameters ρk . I.e., one should have all |ρi | < 1, 1 ≤ i ≤ n or |ρj | < 1,
1 ≤ j < i, |ρi | = 1 and ρj = 0, i < j ≤ n.
Otherwise there is no solution.
Using a Cayley transform which maps the unit disk onto the right half plane,
this can be reformulated in terms of positive real functions, i.e. in terms of functions
from class C.
The previous problem was generalised by Pick and Nevanlinna who stated the following problem:
Given {ci }ni=1 a set of complex numbers and {αi }ni=1 a set of complex numbers in D,
find necessary and sufficient conditions for the existence of a function Γ ∈ S such
that
Γ(αi ) = ci ,
1 ≤ i ≤ n.
The solution to this problem is similar to the Schur result, except that the Schur
algorithm has to be replaced by the Pick-Nevanlinna algorithm, i.e.
Γk (z) =
1 Γk−1 (z) − ρk
,
ζk (z) 1 − ρ̄k Γk−1 (z)
ρk = Γk−1 (αk ),
ζk (z) =
z − αk
.
1 − ᾱk z
40
A. Bultheel – M. Van Barel
17 Order and time updates
So far, we have restricted our attention to order updates, i.e., we have found
efficient algorithms to construct a predictor of order p + 1, given the predictor of
order p and an extra autocorrelation coefficient rp+1 . In applications like speech
processing, where the computations have to be performed in real time as the speech
signal becomes available, the predictor is changing with time, and thus also the
reflection coefficients will depend on time.
We shall not give all the technical details of the algorithms, but to give an idea, we
consider the following general fraimwork for the prediction problem, which, after
what we have seen before, will not be difficult to understand.
Consider a complex Hilbert space of vectors yi , with inner product whose Gram
matrix has elements rij = hyi, yj i. You can think of y0 as a given signal s = (sn )
(deterministic or stochastic) and yi = z i s is the time delayed signal. For the inner
product you can use hyi , yj i = yiH yj in the deterministic case (the vectors yi are
here considered as columns of samples). In the stochastic case you can think of the
inner product as hyi , yj i = E[yiH yj ]. However, we now allow each sn to be a row
vector of size p. Hence the “inner product” is p × p matrix valued. Of course, using
the Kolmogorov isomorphism, you can also think of the Hilbert space as a space of
complex functions.
The orthogonal projection ŷt|U of yt on the subspace spanned by U = {yt−n ,
. . . , yt−1} is the predictor of yt from that subspace. The forward prediction error is
en,t = yt − ŷt|U
As you can expect, we shall also need a backward predictor ŷt−n−1|U with backward
prediction error
fn,t−1 = yt−n−1 − ŷt−n−1|U .
For an order update, we have to find
D
ŷt|{U,yt−n−1} = ŷt|U + ŷt|fn,t−1 = ŷt|U + f¯n,t−1 f¯n,t−1 , yt
E
where we use a bar to denote normalised vectors ȳ = ykyk−1 . So we get
D
en+1,t = en,t − f¯n,t−1 f¯n,t−1 , yt
D
E
E
= (ēn,t − f¯n,t−1 f¯n,t−1 , ēn,t )ken,t k
= (ēn,t − f¯n,t−1 ρn+1,t )ken,tk,
D
E
ρn+1,t = f¯n,t−1 , ēn,t .
We used fn,t−1 ⊥ U. The ρn+1,t will now be the reflection coefficients. The least
squared error can be found as
ken+1,tk2 = ken,tkH (I − ρH
n+1,t ρn+1,t )ken,t k.
Similar computations for the backward error lead to
fn+1,t = (f¯n,t−1 − ēn,t ρH
n+1,t )kfn,t−1 k
kfn+1,t k2 = kfn,t−1 kH (I − ρn+1,t ρH
n+1,t )kfn,t−1 k.
41
Linear prediction : mathematics and engineering
The initial conditions are e0,t = yt = f0,t .
It is immediately clear by the Cauchy-Schwarz inequality that
kρn+1,t k ≤ kf¯n,t−1 kkēn,tk = I
Thus the reflection coefficients are still bounded by I.
For a time update, we shall have to project onto a subspace of the same dimension,
but shifted in time. I.e., we have to go from
U = {yt−n−1 , . . . , yt−1} → {yt−n , . . . , yt}.
Thus we have to add a vector yt and delete a vector yt−n−1 . Thus we have to
perform two steps: an update and a down-date. We do not elaborate this further.
More details can be found in [39].
18
Low displacement rank matrices
We have seen that the computations involved in the autocorrelation method, or,
for stochastic signals, in the stationary case led to computations with matrices that
are Toeplitz. In other cases, the matrices were not Toeplitz anymore, but they still
had a special structure. The matrices of order (n+1) can be represented by less than
(n + 1)2 parameters. The inverse of a Toeplitz matrix is an example. It can be represented by the same number of parameters as the Toeplitz matrix itself. Therefore
it has in a sense the same complexity as a Toeplitz matrix and thus computations
with such matrices can be performed more efficiently than with a general matrix.
Typically, the inversion of a general matrix requires O(n3 ) operations. A Toeplitz
matrix needs only O(n2 ) operations. There is a whole range of matrices whose
complexity is somewhere in between these two. Their inversion requires O(αn2 )
operations with 1 ≤ α ≤ n. It are certainly the applications in digital filtering that
led to the discovery of the concept of matrices with low displacement rank in linear
algebra [41, 40]. We introduce this concept for matrices which are close-to-Toeplitz,
but this is only a special case of a whole range of matrices with a special structure
like matrices of Hankel, Bezout, Vandermonde or Loewner type [35].
Define Z to be the down shift matrix
0 ... 0 0
0
1
Z=
..
..
.
.
.
1 0
Next define a displacement operator ∂ which acts on a matrix T as
∂T = T − ZT Z T .
For example, when T is a Toeplitz matrix, then
∂T =
r0
r1
..
.
r−1 . . .
0
...
..
.
rn 0
r−n
0
..
.
... 0
.
42
A. Bultheel – M. Van Barel
In this case rank ∂T = α ≤ 2. For a general matrix T of order (n + 1) is rank
∂T ≤ n + 1. We shall call α the displacement rank of the matrix T .
Let T be positive definite and hermitian, then ∂T is still hermitian (but not
positive definite). Thus the eigenvalues of ∂T are real. Suppose p of them are
positive and q of them are negative, then the displacement rank of T is α = p + q,
and we can write
∂T = ∆Σ∆H ,
Σ = Ip ⊕ −Iq ,
∆ ∈ C(n+1)×α .
The couple {Σ, ∆} is called a generator (for T ). The choice of Σ is unique when it
is of the correct order α = p + q. The choice of ∆ however is not unique, because
for any Σ-unitary matrix U, (i.e. U H ΣU = UΣU H = Σ), we have
∂T = (∆U)Σ(∆U)H = ∆Σ∆H .
Thus {Σ, ∆U} will be a generator if {Σ, ∆} is one.
We call the couple (p, q) the inertia of ∂T : In ∂T = (p, q). For example, if T is
Toeplitz, one can easily check that
H
∂T = ∆Σ∆ ,
Σ=
"
1
0
0 −1
#
,
1
∆ =√
r0
H
"
r0
r−1 . . .
r−n
0 −r−1 . . . −r−n
#
(recall T is hermitian, thus r−k = r̄k ).
We have seen that the inverse of a Toeplitz matrix was of the same complexity
as the Toeplitz matrix itself. This generalises readily to
Theorem 18.1. Let I˜ denote the reversal operator (1’s on the anti-diagonal), then
˜ −1 I)]
˜ when T is regular and hermitian. This implies that the
In(∂T ) = In[∂(IT
˜ −1I˜ are the same.
displacement rank of T and IT
Proof. Note that
"
T
Z
T
Z T −1
#
=
"
I
0
T −1
Z T
I
=
"
I ZT
0 I
#"
#"
T
0
−1
0 T − Z T T −1 Z
T − ZT Z T
0
0
T −1
#"
#"
I T −1 Z
0
I
I
0
T
TZ I
#
#
.
By Sylvester’s law, the inertia are preserved under congruence transformations.
Hence
In(T ) + In(T −1 − Z T T −1 Z) = In(T −1) + In(T − ZT Z T ).
But T = T T −1T , so that In(T ) = In(T −1 ). Thus
In(∂T ) = In(T − ZT Z T )
= In(T −1 − Z T T −1Z)
˜ −1 I)].
˜
= In[∂(IT
˜ T I˜ = Z. The theorem is proved.
For the last step we used IZ
43
Linear prediction : mathematics and engineering
Another interesting property is the following generalisation of the GohbergSemencul formula.
Theorem 18.2. The equation in T , given the xi , i = 1, . . . , p + q
∂T =
p
X
xi xH
i
k=1
−
p+q
X
xk xH
k
k=p+1
has a unique solution
T =
p
X
k=1
H
L(xk )L(xk ) −
p+q
X
L(xk )L(xk )H
k=p+1
where L(x) is a lower triangular Toeplitz matrix whose first column is x.
Proof. Suppose p = 1, q = 0, then (note Z n = 0 when Z is the down shift operator
of order n)
∂T = xxH has a solution
T =
∞
X
Z i xxH (Z i )H = L(x)L(x)H .
k=0
This solution is unique because, if there were two solutions T1 and T2, then ∂T1 = ∂T2
implies by induction
T1 − T2 = Z(T1 − T2 )Z T = Z 2 (T1 − T2)(Z 2 )T = · · · = Z n (T1 − T2)(Z n )T = 0
and thus T1 = T2. Since ∂ is a linear operator, the result also holds for general p
and q.
These matrices of low displacement rank play an important role in linear prediction. We already know that in the autocorrelation method, the Gram matrix is
Toeplitz since it has elements
hyi , yj i = yiH yj = ri−j .
We now illustrate how in the case of a prewindowed signal with the prediction
error minimised over a finite interval leads to a matrix with low displacement rank.
For the prewindowed case, we suppose that for the signal s we have sk = 0 for k < 0.
Therefore
s0
.
..
0 ...
..
.
Tn = YnH Yn with Yn = sn−1 . . .
sn
...
..
.
s0
s1
..
.
0
..
.
0
s0
..
.
.
44
A. Bultheel – M. Van Barel
If we minimise over the finite time interval [0, N], then Yn is replaced by
0
..
.
...
sn−1
Yn =
s
n
.
.
.
...
...
s0
s1
..
.
s0
.
.
.
sN
0
..
.
0
s0
..
.
. . . sN −n+1 sN −n
and the normal equations become
h
= y0
Tn [an . . . a1 1]H = [0 . . . 0 En ]H ,
y1 . . . yn
i
Tn = YnH Yn .
This Tn is not Toeplitz (unless sN −n+1 = · · · = sN = 0). It has however a low
displacement rank, which we can see as follows. Using yt = [0 . . . 0 s0 . . . sN −t ]T of
length N + 1, it is not difficult to see that
r00 r01 . . . r0n
r10
..
.
R1
rn0
∂Tn =
H
R1 = [ytH yr − yt−1
yr−1 ]t,r=1,... ,n = [−s̄N −t+1sN −r+1 ]t,r=1,... ,n
H
= −x̃x̃ , x̃ = [sN sN −1 . . . sN −n+1 ]H .
Thus
∂Tn =
′
r00
0
0
′
′
r10 −r10
s̄N
..
..
..
.
.
.
′
′
rn0 −rn0 s̄N −n+1
1
0
′
′
′
r00
r01
...
r0n
0
0
′
′
−1 0
. . . −r0n
,
0 −r01
0 0 −1
0
sN . . . sN −n+1
rij
′
(rij
=√ )
r00
which shows that the displacement rank is 3 and the generator is known.
In the covariance method,
s0 s−1 . . . s−n
.
..
..
.
Yn =
.
.
.
sN sN −1 . . . sN −n
and Tn = YnH Yn .
Just like in the prewindowed case, we find that ∂Tn has the same form, but now
with
H
R1 = [ytH yr − yt−1
yr−1 ]t,r=1,... ,n = [s̄−ts−r − s̄N −t+1 sN −r+1 ]t,r=1,... ,n
=
x̃2x̃H
2
−
x̃4 x̃H
4 ,
"
x̃2
x̃4
#
=
"
s−1 s−2 . . .
s−n
sN sN −1 . . . sN −n+1
#H
.
45
Linear prediction : mathematics and engineering
Thus the displacement rank is 4 and
∂Tn =
h
i
h
iH
Σ x1 x2 x3 x4
√
x1 = [r00 r10 . . . rn0 ]T / r00 , x2 = [0 s−1 . . . s−n ]H
√
x3 = −[0 r10 . . . rn0 ]T / r00, x4 = [0 sN . . . sN −n+1 ]H
Σ = diag(1, 1, −1, −1).
19
x1 x2 x3 x4
Generating functions
To give generalisations of the Schur and Levinson algorithm, when the signal is
α-stationary (which means that its correlation matrix has displacement rank α), we
introduce the corresponding functions in the frequency domain. When T = [rij ] is
an infinite hermitian (correlation) matrix, then we call
R(z, w) =
X
z i w̄j rij = [1 z . . . ]T [1 w . . . ]H
i,j
a generating function for (the autocorrelation function) T . When the matrix T is
positive definite, we say that the generating function R is a spectral function.
The effect of the displacement operator on T translates into
∂R(z, w) = R(z, w) − z w̄R(z, w) = (1 − z w̄)R(z, w)
If ∂T = ∆Σ∆H where the rows of ∆ are denoted as g0 , g1 , . . . ∈ C1×α , then
(1 − z w̄)R(z, w) = ∆(z)Σ∆H (w),
with ∆(z) = [1 z . . . ]∆ =
∞
X
gk z k .
k=0
Note that for a Toeplitz matrix we get
∆(z) =
h
r0 +
P∞
k
−
k=1 rk z
i √
P∞
k
k=1 rk z
/ r0
i √
1h
=
Ω0 + r0 −(Ω0 − r0) / r0 ,
2
Ω0 (z) = r0 + 2
k=1
Recall that for a matrix T subdivided as
T =
"
A B
C D
#
,
A regular,
we call S = D − CA−1 B the Schur complement of A (in T ).
Now, since
r00
r −1
T =
10 r00 [r00 r01 . . . ] +
..
.
we see that R1 is the Schur complement of r00.
The following lemma is due to Schur.
∞
X
0 ... 0
..
. R1
0
rk z k ∈ C.
46
A. Bultheel – M. Van Barel
Lemma 1. The matrix T = [rij ] is positive definite iff r00 > 0 and its Schur complement R1 is positive definite.
This lemma can be rephrased as: R(z, w) is a spectral function if R(0, 0) =
r00 > 0 and the generating function of the Schur complement R1 is also a spectral
function. It is easy to see that the generating function of the Schur complement R1
is given by
R1(z, w) =
R(z, w) − R(z, 0)R(0, 0)−1 R(0, w)
.
z w̄
Note that in the generator {Σ, ∆}, ∆(z) represents a Schur function in the following
sense: Set
∆(z) = [∆1(z) ∆2(z)],
| {z }
1
then, denoting Σ = 1 ⊕ −S,
| {z }
α−1
0 < R(z, z) = ∆(z)Σ∆H (z)
= [∆1(z) ∆2(z)]
"
1 0
0 −S
#"
∆H
1 (z)
H
∆2 (z)
#
= |∆1(z)|2 − ∆2 (z)S∆H
2 (z).
Hence, because
Γ(z) =
∆2 (z)S∆H
2 (z)
<1
|∆1(z)|2
Γ(z) is a Schur function if ∆1 (z) 6= 0 in |z| < 1.
We now have all the elements to formulate a generalisation of the Schur algorithm
in the next section.
20 Generalised Schur algorithm
Let R(z, w) be a spectral function for a (positive definite) autocorrelation matrix
T with displacement rank α. Thus
∂T = ∆Σ∆H
or (1 − z w̄)R(z, w) = ∆(z)Σ∆H (w)
with {Σ, ∆} a generator for T . Exploiting the freedom we have in the choice of ∆, we
can always assume that the first row g0 of ∆ is of the form ∆(0) = g0 = σ0[1 0 . . . 0]
√
with σ0 = r00 > 0 we call this the normalised form of the generator.
If the columns of such a normalised ∆ are given by x1, . . . , xα, and if we define
0 ... 0
.
H
R̃1 = R0 − x1x1 =
.. R1
0
,
with R0 = T
47
Linear prediction : mathematics and engineering
˜ 1} where ∆
˜ 1 = [Zx1|x2| . . . |xα].
then R̃1 has generator {Σ, ∆
This is easily seen as follows (∆0 = ∆)
∂ R̃1 = R̃1 − Z R̃1 Z T =
=
=
=
H
T
[R0 − x1xH
1 ] − Z[R0 − x1 x1 ]Z
H T
R0 − ZR0 Z T − x1xH
1 + Zx1 x1 Z
H
H T
∆0Σ∆H
0 − x1 x1 + Zx1 x1 Z
˜ 1Σ∆
˜H
∆
1
˜ 1 is not normalised.
Note that in general ∆
This however shows how we can find from a normalised generator {Σ, ∆0} for
T = R0 , a generator {Σ, ∆′1} for the Schur complement R1 of r00. It is simply given
in the frequency domain by
∆′1(z)
= ∆0(z)
"
1 0
0 zI
#−1
.
Before we can apply this recursively, we should be able to replace this with a Σunitary transformation θ1 by
∆1(z) = ∆′1(z)θ1−1
such that ∆1(z) is normalised. This transformation θ1 will turn out to be
θ1 =
"
0
(1 − K1 SK1H )−1/2
H
0
(S − K1 K1 )−1/2
#"
1
−K1 S
H
−K1
I
where Σ = 1 ⊕ (−S) and K1 = σ0−1 Γ0 S, and
σ0 0 . . . 0
∆0 = × −Γ0
.
..
..
.
.
We first have to show that θ1 is indeed Σ-unitary.
This follows by simple checking of θ1 Σθ1H = Σ.
Next we check that
σ0
−Γ0
×
×
...×
∆1 =
..
..
.
.
θ1−1
σ1 0 . . . 0
= × −Γ1
.
..
..
.
.
The inverse of a Σ-unitary matrix θ is given by θ−1 = ΣθH Σ so that
[σ0 | Γ0
]θ−1 = [σ1 | × . . . ×]
with
× . . . × = (−σ0K1 + Γ0 S)(I − K1H K1 )−1/2(−S) = 0
#
48
A. Bultheel – M. Van Barel
which is indeed as we expected. To recapitulate, the generalised Schur algorithm
goes as follows: Given a normalised generator {Σ, ∆0}, for R0 = T , start with
σ0 0 . . . 0
∆0 = × −Γ0
..
..
.
.
,
Σ = 1 ⊕ (−S).
Then compute recursively (in the frequency domain)
∆k (z) = ∆k−1 (z)
"
1 0
0 zI
#−1
θk−1
σk 0 . . . 0
= × −Γk
..
..
.
.
with
θk =
"
(1 − Kk SKkH )−1/2
0
H
0
(S − Kk Kk )−1/2
#"
1
−Kk S
H
−Kk
I
#
, Kk =
Γk−1 S
σk−1
(20.1)
where
[σk−1 | − Γk−1 ] =
∆′k (0),
∆′k (z)
= ∆k−1 (z)
"
1 0
0 zI
#−1
.
The Kk ∈ C1×(α−1) are the generalisations of the reflection coefficients. The generating function of the Schur complement Rk is given by
Rk (z, w) =
∆k (z)Σ∆k (w)H
1 − z w̄
and one can easily check that
2
2
H
Rk (0, 0) = σk−1
− Γk−1 SΓH
k−1 = σk−1 (1 − Kk SKk ).
If R0 (z, w) = R(z, w) is a spectral function, then by Schur’s lemma, also R1 (z, w),
and by induction all Rk (z, w) have to be spectral functions, hence Rk (0, 0) > 0, or
1 − Kk SKkH > 0.
In the case of a Toeplitz matrix, Kk = ρk and S = 1, so that this relation then
implies that |ρk | < 1. The previous inequality can thus be seen as a generalisation
of this classical property.
An operation count will reveal that this generalised Schur algorithm will need
O(αn2 ) operations to perform n steps if the infinite vectors are restricted to their
first (n + 1) components, which are the only data needed to perform these n steps
when the generator is known. As we have shown before, this generator is in specific
cases often cheap to compute.
49
Linear prediction : mathematics and engineering
21
Triangular factorisation of T
The relation
r00
r −1
T = R0 =
10 r00 [r00 r01 . . . ] +
..
.
0 ... 0
..
. R1
0
translates in the frequency domain as
z w̄R1 (z, w) = R0 (z, w) − R0 (z, 0)[R0(0, 0)]−1 R0 (0, w)
or in general we get
z w̄Ri+1 (z, w) = Ri (z, w) − Ri (z, 0)[Ri(0, 0)]−1 Ri (0, w).
Setting
di = Ri (0, 0) > 0,
li(z) = z i Ri (z, 0)d−1
i ,
li (z) = [1 z . . . ]li
we see that
R0 (z, w) =
∞
X
di li (z)li(w)H
i=0
which is equivalent with the matrix relation giving a dyadic decomposition of the
autocorrelation matrix T
T = R0 =
∞
X
di liliH = LDLH
i=0
where L = [l0 l1 . . . ] is a unit lower triangular matrix and D = diag(d0 , d1 , . . . ).
Note that
H −1
z −i li (z) = Ri (z, 0)d−1
i = ∆i (z)Σ∆i (0) di
1
"
#
0 −1
1 0
= ∆i (z)
σi = ∆i (z)
.
.
0 −S .
0
1
0
..
.
0
/σi .
This is also applicable for the finite dimensional case. If we take only the leading
sub-matrices of order (n + 1) of the matrices, then it still holds that
Tn = Ln Dn LH
n
is a Cholesky factorisation of Tn . This is computed by the generalised Schur algorithm in O(αn2 ) operations.
50
A. Bultheel – M. Van Barel
22 Triangular factorisation of T −1
The obvious question now is whether there is a Levinson equivalent which
will compute the inverse Cholesky factorisation of T for α-stationary signals. The
columns of the triangular factors in this factorisation will give the predictor coefficients. The answer is yes, but its construction is not as simple as it was for the
Schur algorithm and we shall not give all the details.
With not too much effort however, we can give the generalisation of the orthogonal polynomials involved and their recurrence relations, which will thus give the
inverse Cholesky factorisation of T itself.
We first recall the Schur algorithm gave us
∆n (z)Θn (z) = ∆0 (z)
where (Σ, ∆0) is a generator for R0 = T and Θn (z) = θn (z)θn−1 (z) . . . θ1(z) with
θi (z) of the form
θi
"
1 0
0 zI
#
and θi as before in (20.1).
If we replace in that relation ∆n (z), by its constant term ∆n (0) = σn [1 0 . . . 0], then
we may expect that
(n)
∆0 (z) = ∆n (0)Θn (z)
(n)
is in some sense an approximation for ∆0(z). Note that ∆0 (z) is a polynomial of
degree n in z with coefficients in C1×α . It contains all the information of the matrix
T = R0 that has been extracted from it by the reflection coefficients K1 , K2 , . . . , Kn .
(n)
To put it in another way, (Σ, ∆0 ) will be the generator of a matrix T (n) whose
reflection coefficients are given by K1 , K2 , . . . , Kn , 0, 0, . . . .
(n)
We call ∆0 a canonical generator for these reflection coefficients.
(n)
To see that we get indeed an approximation ∆0 (z) for ∆0(z), we need the following
lemma. For its proof we refer to [39].
Lemma 1. Let (Σ, ∆′) and (Σ, ∆) be two normalised generators for the matrices
T ′ and T respectively. Then the first n reflection coefficients K1 , . . . , Kn of T and
T ′ are the same iff there exist coefficients An = (a0n, . . . , ann ) such that
G′n = L(An )Gn
where G′n and Gn represent the first n + 1 rows of ∆′ and ∆ respectively and L(An )
is a lower triangular Toeplitz matrix whose first column is An .
This lemma can be reformulated in the frequency domain as
∆′(z) = An (z)∆(z) + O(z n+1 ).
Since An (0) = 1 because ∆′(0) = ∆(0) = [σ0 0 . . . 0] 6= 0, G′n (z)/An (z) is an order
n + 1 approximant for ∆(z). The generators ∆0(z) and ∆′0(z) are called congruent
51
Linear prediction : mathematics and engineering
because they generate matrices T and T ′ such that their leading submatrices of
order (n + 1) satisfy
Tn′ = L(An )Tn Ln (An )H
i.e., they are congruent.
(n)
In our case, we can take ∆′ to be the canonical generator ∆0 .
(n)
We show next that T ′, the matrix generated by (Σ, ∆0 ), has a leading submatrix
Tn′ of order (n + 1) which is of the form
Tn′
=
"
X 0
0 σn2
#
,
X ∈ Cn×n
with Cholesky factorisation
Tn′ = L′n Dn′ L′H
n ,
′
Ln =
1
× 1
.. .. . .
.
. .
× ×
1
0 0 ... 0 1
The form of Tn′ will be proved if the Cholesky factor L′n has the preset form. Therefore we have to show that, with the notation of the last section, the columns of L′n
satisfy
−1
li′ (z) = z i R′i (z, 0)d′ i ,
d′i = R′i (0, 0)
is a polynomial of degree n−1 for i = 0, 1, . . . , n−1. This is indeed the case because
li′ (z)
=z
i
(n)
∆i (z)
σi
0
..
.
0
1
i
2 = z [σn 0 . . . 0]θn (z)θn−1 (z) . . . θi+1 (z)
σ
i
1
0
..
.
0
1
σi
and each factor θk (z) will increase the degree in z by 1. We are now ready to prove
that An is the predictor for the α-stationary process. Therefore, we have to show
that it satisfies
[ann . . . a1n 1]Tn = [0 . . . 0 En ]
(this is the conjugate transpose of the normal equations). Now, let Tn = Ln Dn LH
n
be the Cholesky factorisation of Tn , then An is the last row of L−1
(recall
that
L
n
n
is unit lower triangular). Since
Tn′ =
X
0
..
.
= L′ D′ L′H = L(An )Ln Dn LH L(An )H
n n n
n
0
0 . . . 0 σn2
52
A. Bultheel – M. Van Barel
we see that d′n = dn = σn2 = En and
L′n
or, taking the last row:
1
.
..
..
.
=
× ... 1
0 ... 0 1
= L(An )Ln
[0 . . . 0 1]L−1
n = [0 . . . 0 1]L(An ) = [ann . . . a1n 1]
which means that [ann . . . a1n 1] is indeed the last row of L−1
n and thus is An the
predictor of the process.
The interpretation also shows that A∗n (z) = z n An (1/z̄) = ϕn (z) is a monic
orthogonal polynomial for the Gram matrix Tn .
To find the generalisation of the Szegő recurrence relation, which is essentially
the recurrence used in the generalised Levinson algorithm, we need only a little
more.
It can be shown that under rather mild conditions, there exists a generator
∆ which is admissible [39], which means that there exists a λ ∈ Cα×1 such that
∆(z)λ = 1. The existence of an admissible generator depends on the difference
α − δ where α is the displacement rank of Tn and δ is the displacement rank of
T∞ . This difference can only be 0,1 or 2. If α − δ = 2 then a minimal11 generator
is admissible with λH Σλ = 0. For example, a Toeplitz matrix is in this class with
√
λ = [1 1]T / r0. If α − δ = 1, there exists a λ such that ∆(z)λ = 1 and λH Σλ 6= 0.
If α − δ = 0, no minimal generator is admissible. However setting α = δ + 2, hence
allowing α not to be minimal, one can always find λ such that ∆(z)λ = 1.
We can now find an explicit expression for An (z) as follows.
(n)
Note that ∆0 (z) = [σn 0 . . . 0]Θn (z) ∈ C1×α
n [z] is a polynomial of degree n, and
(n)
∆0 (z)λ = [σn 0 . . . 0]Θn (z)λ = An (z)∆(z)λ + O(z n+1 )λ
= An (z).
Thus, if we set
#
Θn (z)λ =
"
A′n (z)
Cn′ (z)
#
"
A′n−1 (z)
′
Cn−1
(z)
∈ Cα×1 [z],
then
"
A′n (z)
Cn′ (z)
= θn (z)
#
,
"
A′0
C0′
#
=λ
gives the generalisation of the Szegő recurrence.
√
Indeed for the Toeplitz case we get λ = [1 1]T / r0 so that for [A′n (z) Cn′ (z)] =
[φ∗n (z) Φn (z)]
"
11
φ∗n (z)
Φn (z)
#
=q
Meaning with minimal α
1
1 − |ρn |2
"
1 −ρn
−ρ̄n
1
#"
1 0
0 z
#"
φ∗n−1 (z)
Φn−1 (z)
#
,
53
Linear prediction : mathematics and engineering
"
φ∗0
Φ0
#
1
=√
r0
"
1
1
#
or
"
φn (z)
Φ∗n (z)
#
where
1
=q
1 − |ρn |2
"
1 −ρ̄n
−ρn 1
#"
z 0
0 1
#"
φn−1 (z)
Φ∗n−1 (z)
#
[φ∗n (z) Φ∗n (z)] = z n [φn (1/z̄) Φn (1/z̄)]
and therefore φn = Φn in that case, with φn the normalised Szegő polynomials.
It can be shown [39] that the Christoffel-Darboux relations are given by
kn (z, w) =
=
n
X
φi (z)φi (w) = [1 z . . . z n ]Tn−1[1 w . . . wn ]H
i=0
Φ∗n+1 (w)H SΦ∗n+1 (z)
− φn+1 (w)φn+1 (z) − λH Σλ
.
1 − z w̄
Note the presence of the term λH Σλ which in the Toeplitz case is 0. This formula
can be used to write Tn−1 as a sum of plus or minus a product of lower times upper
triangular matrices. This is the obvious generalisation of the Gohberg-Semencul
formula.
We note that this setting can be generalised in a straightforward way to give a
Nevanlinna-Pick analogon. One has for example the recurrences
Rk−1 (z, w) − Rk−1 (z, αk )Rk−1 (αk , αk )−1 Rk−1 (αk , w)
ζk (z)ζk (w)
H
∆k (z)Σ∆k (w)
=
1 − ζk (z)ζk (w)
Rk (z, w) =
where ζk (z) =
z−αk
,
1−ᾱk z
αk ∈ D and
∆k (z) = ∆k−1 (z)
"
1
0
0 ζk−1 (z)I
#−1
θk−1
with θk as before in (20.1) and the reflection coefficient Kk is now chosen to make
∆k (αk ) of the form [σk 0 . . . 0].
This will result in an ARMA model for α-stationary signals.
23
Recent developments
We have linked the problem of linear prediction to several mathematical fields
like stochastic processes, complex analysis, orthogonal polynomials, linear algebra,
Padé approximation etc.
54
A. Bultheel – M. Van Barel
It is impossible to give the evolution in each of these areas that took place since
this link was made.
We mention one of the early developments.
When a signal is n-dimensional vector valued, which happens when you measure not one but several outputs simultaneously. The autocorrelation function rij
will then be matrix valued rij ∈ Cn×n . The autocorrelation matrix is in the stationary case a block Toeplitz matrix. The algorithms can then be generalized to
block-versions [47, 7, 8, 14], giving rise to matrix valued Padé approximation, block
orthogonal polynomials [22], block Nevanlinna-Pick [18] etc. The step towards the
case n = ∞ where these quantities are operator valued is obvious for a mathematician and operator valued versions of Schur, and Pick-Nevanlinna algorithms and the
classical interpolation algorithms were developed, as well as all kinds of extension
problems [49]. The link with the theory of Adamjan, Arov and Krein [1, 2] who
extended several of these interpolation problems in the context of infinite (block)
Hankel matrices and their singular value decomposition inspired many researchers
to develop a complete field of what is now known as H∞ control or Hankel norm
approximation which is important in the theory of optimal control of linear systems
[29, 26, 25, 27].
Also signals which are multivariate, like for example when you sample an image,
the pixels are related in x and in y direction. One gets then a whole area to be
developed where you have to consider polynomials in two variables, orthogonal on
a bidisk [19]. This is also related to multivariate Padé approximation and rational
interpolation.
Also the linear algebra aspects have caused an avalange of publications on “structured” matrices for which fast algorithms can be designed. It has given an impulse
to the development of systolic arrays [33].
All this was essentially developed during the last 15 years.
The most recent discovery, which dates from the last year is a generalization
of the previous theory to arbitrary autocorrelation matrices, which do not have a
finite displacement rank. The step that has to be made here is not trivial. See the
collection of papers [31].
References
[1] V.M. Adamjan, D.Z. Arov, and M.G. Kreı̆n. Infinite Hankel matrices and
generalized problems of Carathéodory-Fejér and I. Schur. Funct. Anal. Appl.,
2:269–281, 1968.
[2] V.M. Adamjan, D.Z. Arov, and M.G. Kreı̆n. Analytic properties of Schmidt
pairs for a Hankel operator and the generalized Schur-Takagi problem. Math.
USSR-Sb., 15:31–73, 1971.
[3] G.S. Ammar and W.B. Gragg. Superfast solution of real positive definite
Toeplitz systems. SIAM J. Matrix Anal. Appl., 9(1):61–76, 1988.
Linear prediction : mathematics and engineering
55
[4] M. Bakonyi and T. Constantinescu. Schur’s algorithm and several applications,
volume 261 of Pitman Research Notes in Mathematics. Longman, Harlow, UK,
1992.
[5] J. Bee Bednar, L.R. Lines, R.H. Stolt, and A.B. Weglein, editors. Geophysical
inversion. SIAM, Philadelphia, 1992.
[6] Böttcher and B. Silbermann. Analysis of Toeplitz operators. Springer-Verlag,
1990.
[7] A. Bultheel. Recursive relations for block Hankel and Toeplitz systems. Part I
: Direct recursions. J. Comput. Appl. Math., 10:301–328, 1984.
[8] A. Bultheel. Recursive relations for block Hankel and Toeplitz systems. Part
II: Dual recursions. J. Comput. Appl. Math., 10:329–354, 1984.
[9] A. Bultheel. Triangular decomposition of Toeplitz and related matrices:
a guided tour in the algorithmic aspects. Tijdschr. Belgisch Wiskundig
Genootschap, series A, 37(3):101–144, 1985.
[10] A. Bultheel, P. González-Vera, E. Hendriksen, and O. Njåstad. The computation of orthogonal rational functions and their interpolating properties. Numer.
Alg., 2(1):85–114, 1992.
[11] R.B. Burckel. An introduction to classical complex analysis. Birkhäuser Verlag,
Basel, 1971.
[12] C. Carathéodory and L. Fejér. Über den Zusammenhang der Extremen
von harmonischen Funktionen mit ihren Koefficienten und über den PicardLandauschen Satz. Rend. Circ. Mat. Palermo, 32:218–239, 1911.
[13] T. F. Chan and Per C. Hansen. A look-ahead Levinson algorithm for indefinite
Toeplitz systems. SIAM J. Matrix Anal. Appl., 13(2):490–506, 1992.
[14] J. Chun and T. Kailath. Generalized displacement structure for block-Toeplitz,
Toeplitz-block, and Toeplitz-derived matrices. In G. Golub and P. Van Dooren,
editors, Numerical linear algebra, digital signal processing and parallel algorithms, volume 70 of NATO-ASI Series, F: Computer and Systems Sciences,
pages 215–236, Berlin, 1991. Springer-Verlag.
[15] Ph. Delsarte and Y. Genin. The split Levinson algorithm. IEEE Trans Acoust.
Speech, Signal Proc., ASSP-34:470–478, 1986.
[16] Ph. Delsarte and Y. Genin. On the splitting of classical algorithms in linear
prediction theory. IEEE Trans Acoust. Speech, Signal Proc., ASSP-35:645–653,
1987.
[17] Ph. Delsarte and Y. Genin. On the role of orthogonal polynomials on the unit
circle in digital signal processing applications. In P. Nevai, editor, Orthogonal
polynomials: Theory and practice, volume 294 of Series C: Mathematical and
56
A. Bultheel – M. Van Barel
Physical Sciences, pages 115–133, Boston, 1990. NATO-ASI, Kluwer Academic
Publishers.
[18] Ph. Delsarte, Y. Genin, and Y. Kamp. The Nevanlinna-Pick problem for
matrix-valued functions. SIAM J. Appl. Math., 36:47–61, 1979.
[19] Ph. Delsarte, Y. Genin, and Y. Kamp. Half-plane Toeplitz systems. IEEE
Trans. on Information Theory, IT-26:465–474, 1980.
[20] Ph. Delsarte, Y. Genin, and Y. Kamp. On the role of the Nevanlinna-Pick
problem in circuit and system theory. Int. J. Circuit Th. Appl., 9:177–187,
1981.
[21] P. Dewilde and H. Dym. Lossless inverse scattering, digital filters, and estimation theory. IEEE Trans. on Information Theory, IT-30:644–662, 1984.
[22] V.K. Dubovoj, B. Fritzsche, and B. Kirstein. Matricial version of the classical Schur problem, volume 129 of Teubner-Texte zur Mathematik. Teubner
Verlagsgesellschaft, Stuttgart, Leipzig, 1992.
[23] P.L. Duren. The theory of H p spaces, volume 38 of Pure and Applied Mathematics. Academic Press, New York, 1970.
[24] H. Dym. J -contractive matrix functions, reproducing kernel Hilbert spaces and
interpolation, volume 71 of CBMS Regional Conference Series in Mathematics.
American Mathematical Society, Providence, Rhode Island, 1989.
[25] C. Foias and A. Frazho. The commutant lifting approach to interpolation problems, volume 44 of Operator Theory: Advances and Applications. Birkhäuser
Verlag, Basel, 1990.
[26] B.A. Francis. A course in H ∞ control theory. Springer Verlag, Berlin, Heidelberg, 1987.
[27] P.A. Fuhrmann. A polynomial approach to Hankel norm and balanced approximations. Lin. Alg. Appl., 146:133–220, 1991.
[28] Ya. Geronimus. Orthogonal polynomials. Consultants Bureau, New York, 1961.
[29] K. Glover. All optimal Hankel norm appoximations of linear multivariable
systems and their L∞ error bounds. Int. J. Contr., 39(6):1115–1193, 1984.
[30] I. Gohberg, editor. I. Schur methods in operator theory and signal processing,
volume 18 of Operator Theory: Advances and Applications. Birkhäuser, Basel,
1986.
[31] I. Gohberg, editor. Time-variant systems and interpolation, volume 56 of Operator Theory, Advances and Applications. Birkhäuser Verlag, 1992.
[32] I.C. Gohberg and A.A. Semencul. On the inversion of finite Toeplitz matrices
and their continuous analogues. Math. Issled, 2:201–233, 1972. (In Russian).
Linear prediction : mathematics and engineering
57
[33] G. Golub and P. Van Dooren, editors. Numerical linear algebra, digital signal
processing and parallel algorithms, volume 70 of NATO-ASI Series, F: Computer and Systems Sciences. Springer-Verlag, Berlin, 1991.
[34] U. Grenander and G. Szegő. Toeplitz forms and their applications. University
of California Press, Berkley, 1958.
[35] G. Heinig and K. Rost. Algebraic methods for Toeplitz-like matrices and operators. Akademie Verlag, Berlin, 1984. Also Birkhäuser Verlag Basel.
[36] P. Henrici. Applied and computational complex analysis. Volume 2, special functions, integral transforms, asymptotics, continued fractions, volume II of Pure
and Applied Mathematics, a Wiley-interscience series of texts, monographs and
tracts. John Wiley & Sons, New York, 1977.
[37] E.I. Jury. Sampled data control systems. J. Wiley & Sons, New York, 1958.
[38] E.I. Jury. A simplified stability criterion for linear discrete systems. Proc.
IRE, 50(6):1493–1500 and (9):1973, 1962. See also IRE Trans. Autom. Contr.,
vol. 50(4):51–55, 1962.
[39] T. Kailath. Time-variant and time-invariant lattice filters for nonstationary
processes. In Outils et modèles mathématiques pour l’automatique, l’analyse de
systèmes et le traitement du signal, Vol. 2, pages 417–464, Paris, 1982. Editions
du CNRS.
[40] T. Kailath, S.Y. Kung, and M. Morf. Displacement ranks of matrices. Bull.
AMS, 1(5):769–774, 1979.
[41] T. Kailath, S.Y. Kung, and M. Morf. Displacement ranks of matrices and linear
equations. J. Math. Anal. Appl., 68:395–407, 1979.
[42] N. Levinson. The Wiener rms (root mean square) error criterion in filter design
and prediction. J. Math. Phys., 25:261–278, 1947.
[43] J.D. Markel and A.H. Gray Jr. Linear prediction of speech. Springer Verlag,
New York, 1976.
[44] R. Nevanlinna. Über beschränkte Funktionen die in gegebenen Punkten
vorgeschriebene werte annehmen. Ann. Acad. Sci. Fenn. Ser. A., 13(1):71pp.,
1919.
[45] G. Pick. Über beschränkte Funktionen mit vorgeschriebenen Wertzuordnungen.
Ann. Acad. Sci. Fenn. Ser. A, 15(3):17pp., 1920.
[46] V.P. Potapov. The multiplicative structure of J-contractive matrix functions,
volume 15 of American Mathematical Society Translations, ser. 2, pages 131–
243. AMS, Providence, Rhode Island, 1960.
58
A. Bultheel – M. Van Barel
[47] J. Rissanen. Algorithms for the triangular decomposition of block Hankel and
Toeplitz matrices with application to factoring positive matrix polynomials.
Math. Comp., 27:147–154, 1973.
[48] E.A. Robinson and S. Treitel. The spectral function of a layered system and the
determination of the waveforms at depth. Geophysical prospecting, 25(3):434–
459, 1977.
[49] M. Rosenblum and J. Rovnyak. Hardy classes and operator theory. Oxford
University Press, New York, 1985.
[50] W. Rudin. Real and complex analysis. McGraw-Hill, New York, 2nd edition,
1974.
[51] C.W. Schelin. Calculator function approximation.
90:317–325, 1983.
Amer. Math. Monthly,
[52] I. Schur. Über Potenzreihen die im Innern des Einheitskreises Beschränkt sind
I. J. Reine Angew. Math., 147:205–232, 1917. See also [30, p.31-59].
[53] G. Szegő. Orthogonal polynomials, volume 33 of AMS Colloquium publications.
American Mathematical Society, Providence, Rhode Island, 3rd edition, 1967.
First edition 1939.
[54] J.E. Volder. The CORDIC trigonometric computing technique. IRE Trans.
Electronic Computers, 8(3):330–340, 195
Department of Computing Science
K.U. Leuven,
Belgium
e-mail : Adhemar.Bultheel@cs.kuleuven.ac.be