Module-2
Module-2
Module-2
Johan Larsson
Department of Mechanical Engineering
University of Maryland
Second Edition
December, 2024
1 Introduction 2
1.1 Numerical methods vs numerical analysis . . . . . . . . . . . . . . . . . 3
1.2 Different types of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Required background for this book . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Taylor expansions . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Complex variables . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Root-finding 10
2.1 Fixed-point iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Newton-Raphson method . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Interpolation 18
3.1 Piecewise linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Global polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Approximate agreement with the data . . . . . . . . . . . . . . . 21
3.2.2 Least-squares method . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Derivation of cubic spline equations . . . . . . . . . . . . . . . . 24
3.4 Multi-dimensional interpolation . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Cartesian grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Structured but non-Cartesian grids . . . . . . . . . . . . . . . . . 28
ii
5 Integration 37
5.1 Trapezoidal and midpoint rules . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Adaptive quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Differentiation 44
6.1 Introduction to finite difference schemes . . . . . . . . . . . . . . . . . . 44
6.2 General finite difference schemes on uniform grids . . . . . . . . . . . . . 46
6.3 Finite differences on non-uniform grids . . . . . . . . . . . . . . . . . . . 50
6.4 Implementation as a matrix-vector product . . . . . . . . . . . . . . . . 52
6.5 Finite differences in higher dimensions . . . . . . . . . . . . . . . . . . . 54
This book grew out of the course on fundamental numerical methods that I have been
teaching at the University of Maryland for several years. While some of the students in
this class will do their PhD in some area of computational science, the majority are either
undergraduates or graduate students who focus their research on other topics and take
the course to learn the fundamentals of numerical methods. Given this, my experience
is that the material needs to be presented in a way that is intuitive to engineers, with
less initial emphasis on the mathematics. I naturally use the course as a way to improve
the students’ abilities and sense of comfort with math and programming, but I try to
lead with intuition and end with math rather than the other way around. Over the
years, I have ended up developing my own course notes reflecting this style. After
having been asked by many students for access to scans of my hand-written and only
occasionally organized notes, I eventually took the time to write them down in LATEX.
The first edition reflected what I managed to type during the Fall 2022 semester. This
second edition reflects what I added during the Fall 2024 semester; I hope to add some
additional material and more examples in the coming years.
The book is aimed at beginners of the subject. More comprehensive treatment at a
similar introductory level can be found in, for example, the book by Chapra & Canale
(2015). A more rigorous and mathematical treatment for advanced students can be
found in, for example, Heath (2018). While some linear algebra is used in this book, I
have completely avoided any concept of numerical linear algebra; the book by Trefethen
& Bau III (1997) is recommended. Finally, the books by Bendat & Piersol (2010) and
Sivia (2006) provide comprehensive coverage of the aspects of statistics that we only
barely touch upon in this book; the book by Sivia (2006) is a particularly joyful read.
Johan Larsson
Rockville, MD
December, 2024
1
Chapter 1
Introduction
Most readers of this book will have spent at least 15 years in school, with mathematics
having been a recurring subject for most of that time. In math classes, students will have
been taught to compute quantities, solve algebraic and differential equations, perform
integration and differentiation, and many other things. So if readers have learnt so much
math, why should they learn about numerical methods? Because the vast majority of
real-world problems can’t be solved with type of math that you have spent 15 or more
years learning! Think about it: at some point you learnt how to solve equations of form
a + bx = 0, and then later you learned to solve the more complicated a + bx + cx2 = 0.
But what about solving a + bx + cx2 + . . . + ex5 = 0, or a + bx2 + sin(x) + ln(x) = 0 –
surely such equations might occur in real-world problems?
The purpose of numerical methods is to provide ways for us to solve whatever
equations we want, to perform integrals of whatever functions we want, and so on.
This is done by creating algorithms that are executed on computers. An algorithm
is effectively a recipe, a step-by-step codified process which, if followed, leads to the
final answer. Many algorithms require a lot of repeated steps – hence the need for the
computer.
The price we pay for the ability to solve whatever equations we want comes in two
parts: (i) we need to use computer resources, which costs both money and energy; and
(ii) we must be satisfied with approximate answers. A user of numerical methods must
have a basic level of understanding of both of these drawbacks in order to choose the
right method for each problem context.
2
CHAPTER 1. INTRODUCTION 3
x0 = 50, say. In that case, the Babylonian algorithm produces: x1 = 26, x2 = 14.9 . . .,
x3 = 10.8 . . ., x4 = 10.03 . . ., x5 = 10.00005 . . ., x6 = 10.0000000001 . . ., and so on.
The algorithm clearly converges to the right answer, and thus provides us with a way
to make this computation even if we don’t know how to compute a square-root. After
6 iterations the error is about 10−10 , and the error is going down quickly with each
additional iteration. However, it will never reach zero, and thus we must stop this
process at some point and accept that the answer is not exact.
When first studying this topic, the distinction between numerical “methods” (im-
plementing a method) and “analysis” (using math to understand how a method will
behave) will probably seem fuzzy. For that reason, this book tries to make a distinction
between these perspectives when covering different types of methods.
many kinds of errors, although not every numerical method will necessarily be affected
by every single kind.
The Babylonian method for computing square-roots (see the example above) is af-
fected by two different kinds of errors. The most obvious one is that the algorithm must
be terminated at some point, meaning that we will in reality compute a finite number of
iterations. Thus an important part of this method (or “recipe”) is the clear specification
of how one should decide when to terminate. Understanding when to terminate requires
an understanding of how the error in xi is decreased at each iteration, something which
can be estimated through numerical analysis.
The second type of error affects all numerical methods, in fact it affects every algo-
rithm that is implemented on a computer. It’s called “round-off” error, and is related
to the fact that a computer stores real numbers with finite precision. In “single pre-
cision”, a computer uses 4 bytes (=32 bits, or 32 zeros or ones) to store real numbers
as ± significand · baseexponent , with 1 bit for the sign, 8 bits for the exponent, and 23
bits for the significand (the base is 2). In “double precision”, the exponent gets 11 bits
and the significand gets 52 bits. This works out to precisions of about 10−7 and 10−16 ,
respectively, when measured in a relative sense.
The round-off error is a fundamental limit of using a computer. In the best case
scenario our final answer will thus have a relative error of 10−16 (assuming double preci-
sion), but since the round-off error affects pretty much every single step in a calculation,
there is a risk that all of the round-off errors build up to a larger relative magnitude
in the final answer. Understanding how robust an algorithm is to round-off error is
therefore quite important.
which should be interpreted as a term of unknown exact value but which varies as
(x − x0 )n+1 if (x − x0 ) is varied. This notation is used heavily in this book.
Taylor expansions are very commonly used in numerical analysis. In an extremely
hand-wavy way, the Taylor expansion (or Taylor series) in Eqn. (1.1) says that we can
either describe f (x) by knowing the function values at all different locations x (the left-
hand-side of the equation) or we can equivalently describe f (x) by knowing all different
derivative values at a single location x0 (the right-hand-side). An example is shown in
Fig. 1.1.
Figure 1.1: Illustration of the Taylor expansions, approximating a function f (x) around
the base point x0 using two, three and four terms (including the f (x0 ) term).
which we could also write more compactly as x = (x1 , x2 , . . . , xn )T where the superscript
T denotes the transpose.
An m × n matrix is defined as
a11 a12 . . . a1n
a21 a22 . . . a2n
A= . .. ,
. .. . .
. . . .
am1 am2 . . . amn
with the components aij defined with the first index being the row and the second being
the column.
The matrix-vector and matrix-matrix products are defined only when the two factors
are compatible in size, specifically the number of columns in the first must equal the
number of rows in the second. The product
A = BC
CHAPTER 1. INTRODUCTION 7
is most easily defined using index notation of the components; specifically, the product
is
Xn
aij = bik ckj .
k=1
The same expression holds if C is a vector, in which case j = 1 is the only column of C
and thus A. Similarly, if B is a row vector, then i = 1 is the only row of B and thus A.
A linear system of equations is written as
Ax = q .
If A is a square non-singular matrix, the solution to this system is x = A−1 q where A−1
is the inverse of A. If A has more rows than columns, the system is over-determined
since there are more equations (=rows in A) than unknowns (=columns in A). If A has
fewer rows than columns, the system is under-determined.
The matrix A is called “sparse” if it has mostly zeros. Having a sparse matrix
is very advantageous in numerical linear algebra since it may drastically reduce the
computational effort required to find the inverse. The basic way to find the inverse
(or solve a linear system of equations) is to use Gaussian elimination, which has a
computational complexity (=cost) that is proportional to n3 for an n × n matrix. In
contrast, a tri- or penta-diagonal matrix (i.e., a matrix with non-zero elements only on
the main diagonal and on one or two diagonals on either side) can be solved using the
Thomas algorithm at a computational complexity that is proportional to only n!
A square n × n matrix A has n eigenvectors xi and n eigenvalues λi defined by
Axi = λi xi , i = 1, 2, . . . , n .
AX = XΛ ,
1.3.4 Statistics
Consider a random variable X with expected value E(X) = µX and variance V (X) =
2 . If we have a set of random samples x , i = 1, . . . , n, we can compute the sample
σX i
mean
1X
x= xi ,
n
i=1
CHAPTER 1. INTRODUCTION 8
Since the set of xi are random samples, by implication the sample mean x and sample
variance s2x are random too; in other words, if we repeat the experiment, we get a new
set of values xi and thus new computed sample mean and sample variance. In contrast,
the expected value µX and variance σX 2 are not random, but instead deterministic
µX = x ± zα/2 σx , (1.2)
which is an abbreviated way of saying that the expected value of X (the “true mean”)
differs from the computed sample mean x by less than the amount zα/2 σx with 1 − α
confidence or probability. The value zα/2 provides the link between the confidence level
and the size of the interval: larger confidence requires a larger interval. The quantity σx
is the so-called standard error, which quantifies the variability in the computed sample
√
mean. For uncorrelated (or independent) samples, the standard error is σX / n, which
produces the formula for a confidence interval (for the expected value) that is covered
in most undergraduate statistics courses.
Part I
9
Chapter 2
Root-finding
Consider the function f (x) and the equation f (x) = 0. The values x∗ for which f (x∗ ) =
0 are called the “roots” of the equation. In general there can be many roots of an
equation, perhaps even infinitely many (e.g., the equation sin(x) = 0). “Root-finding”
methods attempt to find a single root from either an initial guess x0 or a specified
interval [x0 , x1 ]. In order to find multiple roots, a user would then have to try different
initial guesses or different specified intervals.
There are many root-finding methods, some of which will be presented here. Some
methods work well on some problems, others work better on other problems – one must
therefore know multiple root-finding methods (and their strengths and weaknesses) in
order to be able to handle a wide range of problems.
10
CHAPTER 2. ROOT-FINDING 11
Convergence implies that the error becomes smaller in magnitude during each iteration,
which means that convergence requires |g ′ (x∗ )| < 1. This requirement is then useful
when deciding how to define g(x), as the different possibilities mentioned above would
have different slopes g ′ (x∗ ) at the root. There is no single way to choose g(x) that leads
to convergence for all functions f (x), and thus this fixed-point iteration approach is ill-
suited for general purposes; having said that, it can be powerful for certain equations.
Figure 2.1: Sketch of the bisection root-finding method, showing how the interval (hor-
izontal red bars) shrinks with each iteration.
discussed below, one could make other choices, but then it’s not strictly a “bisection”
method.
The algorithm is then
Algorithm 2.1: the bisection method
guess an initial interval [xl , xu ]
while not converged,
compute trial point xt = (xl + xu )/2
if f (xt )f (xl ) > 0, (i.e., if the function has the same sign at xl as at xt )
xt ← xl (replace xl by xt )
else,
xt ← xr (replace xu by xt )
end if
end while
take the final estimated root as (xl + xu )/2
equation mismatch); thus the most natural stopping criterion is framed in terms of
xu − xl .
Having described the method and the algorithm, we next turn to some analysis.
The error at iteration i is
xl,i + xu,i
ei = xt,i − x∗ = − x∗ .
2
We don’t know the true root x∗ , but we know that the error is bounded in magnitude
as
xu,i − xl,i
|ei | ≤ ,
2
i.e., the error can not be larger in magnitude than half the length of the interval. Since
the interval is cut in half in each iteration, this implies that
i
xu,0 − xl,0 1
|ei | ≤ , i = 0, 1, . . . ,
2 2
where subscript 0 means the initial guess. This is a quite powerful statement: it’s a
hard error bound that depends only on the choice of initial interval and the number
of iterations. This property of the bisection method thus implies that convergence to
whatever tolerance one desires is guaranteed, provided one chooses an initial interval
that contains the root (and that f (x) is continuous).
i.e., as the first two terms of a Taylor expansion around xi . This guarantees that the
linear approximation flinear (x) and f (x) have exactly the same function values and
derivatives at xi . Finding the root (= zero-crossing) of this linear approximation is
done by setting flinear (x) = 0; labeling the root xi+1 then yields
f (xi )
xi+1 = xi − , (2.2)
f ′ (xi )
which is the Newton-Raphson method. A sketch of the method is shown in Fig. 2.1.
To implement this method, one needs an initial guess x0 and the ability to compute
both f (x) and f ′ (x) for any given x. The need for f ′ (x) limits the applicability of this
method. In some scenarios it is easy to find f ′ (x) analytically (it is often easier to take
CHAPTER 2. ROOT-FINDING 14
a derivative than to find a root), but in others this may not be easy or possible at all,
in which case the Newton-Raphson method can not be used.
Having described the numerical method, we next turn to some numerical analysis.
As before, this is centered on the behavior of the error ei = xi − x∗ , and as before we use
Taylor expansions to estimate how the error behaves when xi is close to x∗ . Specifically,
we have
f (xi ) f (xi )
ei+1 = xi+1 − x∗ = xi − ′ − x∗ = ei − ′ , (2.3)
f (xi ) f (xi )
which relates the errors at two successive iterations. Clearly there is no bound on this
error due to the explicit dependence on f (xi ) and f ′ (xi ). As a result, we can not prove
convergence for this method: unlike the bisection method, it is entirely possible that the
Newton-Raphson method diverges (i.e., fails to find the root) for some functions f (x).
Even if we can’t say anything about whether the method will converge, can we say
anything useful at all? Actually, yes we can. Assuming that xi is close to x∗ , let us
Taylor-expand both f (xi ) and f ′ (xi ) in the error equation (2.3) around x∗ ; this yields
(recall that ei = xi − x∗ , and that f ′ (x) can be Taylor-expanded just like any other
function)
f (x∗ ) + f ′ (x∗ )ei + f ′′ (x∗ )e2i /2 + O e3i
ei+1 ≈ ei − ′ . (2.4)
f (x∗ ) + f ′′ (x∗ )ei + f ′′′ (x∗ )e2i /2 + O e3i
The goal of this analysis is to see how the error changes during one iteration: in order
to see that, we next need to Taylor-expand the denominator. To do this, we can define
1
g(ei ) = 2
f ′ (x ∗) + f ′′ (x ′′′
∗ )ei + f (x∗ )ei /2
CHAPTER 2. ROOT-FINDING 15
1 f ′′ (x∗ )
ei + O e2i
= ′
− 2
′
f (x∗ ) [f (x∗ )]
f ′′ (x∗ )
1 2
= ′ 1− ′ ei + O ei .
f (x∗ ) f (x∗ )
Inserting this into the error equation (2.4) then yields, also using the fact that by
definition f (x∗ ) = 0,
f ′′ (x∗ ) 2 f ′′ (x∗ )
′ 3
1 2
ei+1 ≈ ei − f (x∗ )ei + ei + O ei 1− ′ ei + O ei
2 f ′ (x∗ ) f (x∗ )
f ′′ (x∗ ) 2 f ′′ (x∗ )
3 2
≈ ei − ei + ′ e + O ei 1− ′ ei + O ei
2f (x∗ ) i f (x∗ )
f ′′ (x∗ ) 2
ei + O e3i .
≈ ′
2f (x∗ )
So, when the error is small (i.e., when we are close to the root), the magnitude of the
error in Newton-Raphson behaves as
f ′′ (x∗ ) 2
|ei+1 | ≈ e .
2f ′ (x∗ ) i
In other words, the error is reduced quadratically – if the root is known to 4 significant
digits after iteration i, it is known to 8 significant digits after iteration i + 1. That is
quite remarkable.
To summarize, our analysis of the Newton-Raphson method has concluded that it
may not converge in all cases, but when it does, it converges extremely fast.
′ f (xi ) − f (xi−1 )
fapprox (xi , xi−1 ) = .
xi − xi−1
The secant method is then derived from the Newton-Raphson method (2.2) as
f (xi ) xi − xi−1
xi+1 = xi − ′
= xi − f (xi ) .
fapprox (xi , xi−1 ) f (xi ) − f (xi−1 )
f1 (x) = 2 sin(x) − 1 = 0 ,
f2 (x) = (x − 1)3 = 0 .
The exact roots are x∗,1 = π/6 and x∗,2 = 1, but let us instead compute them using
the methods described here. We take the initial guess x0 = 0 for the Newton-Raphson
method. We take the additional initial guess x1 = 0.2 for the secant method, and the
interval of [0, 1.8] for the bisection method (we want to avoid finding the exact root by
luck). The functions and the resulting convergence behavior (the absolute value of the
error in the estimated root at each iteration) are shown in Fig. 2.3.
Figure 2.3: Root-finding algorithms applied to two different problems f1 (x) = 0 and
f2 (x) = 0) (left) and the resulting convergence behavior (right).
CHAPTER 2. ROOT-FINDING 17
The bisection method produces the same convergence behavior for both problems:
this method is “slow” but always converges (provided the function is continuous and
the initial interval is valid). The figure also shows the error bound, and it is clear that
the actual error is always below this bound.
The Newton-Raphson method produces incredibly fast convergence for f1 (x) = 0
but very slow convergence for f2 (x) = 0. In fact, the convergence is linear for the second
problem: this is caused by this problem having a multiple (or repeated) root.
The secant method behaves qualitatively like the Newton-Raphson method, but not
quite as efficiently; this slight lack of convergence speed is generally a small price to pay
for avoiding the need to evaluate the derivative.
Chapter 3
Interpolation
The general interpolation problem in one dimension can be stated as: assuming that
you have a set of N data points (xi , yi ) (numbered i = 1, . . . , N ), what is the estimated
value of y for some different value of x?
Within this general problem statement there are several different flavors of interpo-
lation methods. One key distinction is whether one wants the interpolation function
to pass through the data points or not; this is almost equivalent to asking whether
the data are affected by random noise or not. This is illustrated in Fig. 3.1. If the 4
data points are viewed as being exact, then they suggest that the true function varies
non-monotonically, and the red interpolating curve might be a reasonable choice. If, on
the other hand, one believes that the true function should be monotonic in this interval
and therefore that the variation in the data points must come at least partly from noise,
then the green interpolating function might be a more reasonable choice. This chapter
will cover both types of interpolation strategies.
Interpolation methods also differ in whether a single interpolating function is used
everywhere or not. Two of the examples in Fig. 3.1 use a single (“global”) interpolating
function, whereas the piecewise linear interpolation method actually uses N −1 different
interpolating functions depending on where you want to perform the interpolation; in
a sense, this method is “local”.
Finally, interpolation methods also differ in what types of interpolating functions
are used. All interpolating functions in Fig. 3.1 are polynomials, but there are other
choices. The most common different choice is to use trigonometric functions (sines and
cosines) as interpolating functions, which might make sense if one knows that the true
function is periodic (or nearly so). If one allows for the interpolating function to be a
sum of sines and cosines, then this becomes a Fourier series.
18
CHAPTER 3. INTERPOLATION 19
Figure 3.1: General interpolation problem, with given data points (xi , yi ) and different
possible interpolating functions.
between the N data points. The algorithm is necessarily composed of two parts: (1) we
must first identify which interval our point x is in (i.e., which interpolating function to
use); and then (2) we can perform a linear interpolation in that interval.
To find the interval, it is necessary that the data points are ordered, meaning that
x1 < x2 < x3 and so on. We can then define interval i as being the interval between
xi and xi+1 . If our data points are numbered i = 1, 2, . . . , N , then our intervals are
numbered i = 1, 2, . . . , N − 1.
Assuming that our point x for which we want to know the function value y(x) is
located in interval i, i.e., that x ∈ [xi , xi+1 ]. The interpolating function is then
yi+1 − yi
fi (x) = yi + (x − xi ) ,
xi+1 − xi
where we have used a subscript i for the interpolating function fi (x) to show that it
applies only to the ith interval. It is sometimes convenient to use different (but entirely
equivalent) forms of the interpolating function; it is quite straightforward to see that
we can also write
x − xi x − xi
fi (x) = 1 − yi + yi+1
xi+1 − xi xi+1 − xi
or
xi+1 − x x − xi
fi (x) = yi + yi+1 . (3.1)
xi+1 − xi xi+1 − xi
Both of these alternate forms highlight the fact that linear interpolation is effectively
a weighted average between the two data points yi and yi+1 , with the weight given by
the scaled distance from the other point (i.e., the weight for yi is the distance from x
CHAPTER 3. INTERPOLATION 20
to xi+1 , and the weight for yi+1 is the distance from x to xi , scaled by the distance
between the two points).
The main advantages of piecewise linear interpolation are the simplicity and the
robustness. The fully “local” nature of the method implies that the interpolation is
completely unaffeced by data points other than the two defining the interval, which
makes the interpolation method very robust. The main disadvantage is the fact that
the interpolating function is only C 0 : it is continuous but not continuously differentiable.
or Aa = y.
It is useful to consider under what conditions this system of equations can be solved.
We first note that the matrix A is a so-called Vandermonde matrix, with rows that are
linearly independent if and only if all the data locations xi are unique (i.e., if there are
two data points with the same x value, then A has two rows that are the same). We
then have the following possible scenarios:
• If N = n + 1 and if all the data locations xi are unique, then A is invertible and we
can find a unique set of polynomial coefficients a. For example, if we have N = 3
data points we can exactly fit a unique quadratic (n = 2) polynomial.
CHAPTER 3. INTERPOLATION 21
So, given N data points, we have two choices for polynomial interpolation: we can
either (1) find the polynomial of order n = N − 1 that exactly goes through the data (by
solving a = A−1 y); or (2) find a polynomial of any order n < N − 1 that approximately
goes through the data. In the latter case, we must choose which polynomial order n we
want.
which is the mathematical statement saying that the best a (i.e., polynomial) is the one
that produces the smallest size of the misfits as measured by the norm ∥e∥.
CHAPTER 3. INTERPOLATION 22
Figure 3.2: Interpolation problem with misfits ei defined as the difference between the
interpolating function and the data at each point.
i.e., that the quantity we are trying to minimize (∥e∥) must have zero derivative in the
direction of every variable that we are trying to find the optimal value of (the set of aj
coefficients). We can write this in vector notation as
∂/∂a0 0
∂/∂a1 0
∇a ∥e∥ = . ∥e∥ = . , (3.4)
. . ..
∂/∂an 0
The by-far most common choice is the second one, for which the optimal a can be found
analytically. This approach is called the “least-squares” approach.
3.3 Splines
A spline is a collection of polynomials where each polynomial is defined in a single
interval and where the polynomials are forced to match each other as well as possible
at the data points. The most commonly used spline is the cubic one, in which the
interpolating function in the ith interval x ∈ [xi , xi+1 ] could be defined as
For cubic splines, both the first and second derivatives are matched at the data points.
An example is shown in Fig. 3.3, which clearly shows the smoothness at the data points.
With N − 1 polynomials and 4 coefficients in each, we have 4N − 4 unknown coeffi-
cients; we then need 4N − 4 equations in order to find those coefficients. Forcing every
polynomial to match the data at both ends of each interval yields the conditions
gi (xi ) = yi
, i = 1, 2, . . . , N − 1 . (3.5)
gi (xi+1 ) = yi+1
We need additional conditions, and impose that the first and second derivatives are
continuous at the interior data points. To figure out what this means in terms of
equations, the easiest way is to consider data point i, which has polynomial gi−1 (x) to
its left and polynomial gi (x) to its right. In other words,
′ (x ) = g ′ (x )
gi−1 i i i
′′ (x ) = g ′′ (x ) , i = 2, 2, . . . , N − 1 .
gi−1
(3.6)
i i i
CHAPTER 3. INTERPOLATION 24
• the “clamped” spline, in which the first derivative at the end g1′ (x1 ) is forced
to a user-provided value; this approach is natural in situations where the first
derivative at the end is known from some application-specific insight, for example
at adiabatic walls (the temperature gradient is zero) or in clamped beams (the
slope of the beam is fixed);
• the “naturel” spline, in which the second derivative at the end g1′′ (x1 ) = 0; this
mimics the way a plastic ruler is forced to bend around nails on a board (the nails
being the data points), with the ruler having no bending moment and thus no
curvature at the ends;
• the “not-a-knot” condition, in which g1′′′ = g2′′′ (note: for cubic polynomials, the
third derivative is constant, hence no need to specify which xi point to use for
each polynomial).
Splines are often a happy medium between the piecewise linear interpolation and
the global polynomial. They are twice continuously differentiable, but retain a mostly
local nature. Furthermore, they can be implemented in a way that is computationally
very efficient, on par with the methods described above. For these reasons, they are
frequently used in practice.
where Gi and Gi+1 are the unknown parameters (or coefficients) of this line (as it
happens, these parameters are actually equal to the second derivative of the polynomials
at the corresponding points). For convenience we define hi = xi+1 − xi as the length of
each interval, and then integrate twice to get
(xi+1 − x)3 (x − xi )3
gi (x) = Gi + Gi+1 + ai,1 x + ai,0 .
6hi 6hi
This is a cubic polynomial with built-in continuity of the second derivative at the interior
data points. At this point we have 2(N − 1) unknown coefficients ai,0 and ai,1 , and N
unknown coefficients Gi ; a total of 3N − 2 unknowns. This makes sense, as we have
implicitly used N − 2 conditions to match the second derivative at the interior data
points.
The 2N − 2 conditions from Eqn. (3.5) to match the data can be used to solve for
the ai,0 and ai,1 coefficients, which yields
(xi+1 − x)2 hi (x − xi )2 hi
′ yi+1 − yi
gi (x) = − + Gi + − Gi+1 + .
2hi 6 2hi 6 hi
Matching the first derivative according to Eqn. (3.6) at the interior points then yields
This is the final equation for cubic splines. Note the tri-diagonal structure, i.e., that the
equation for point i involves Gi−1 , Gi , and Gi+1 . Solving a tri-diagonal matrix can be
done very quickly and with minimal coding effort, which is the main reason for defining
the cubic spline from the starting point of the linearly varying second derivative.
CHAPTER 3. INTERPOLATION 26
Figure 3.4: Example of different interpolation methods applied to the same data.
Equation (3.7) needs to be supplemented by two end conditions, which are encoded
on the first and last row of the matrix-vector equation. The first three rows of the final
matrix-vector equation are, for the special case of a “natural” end condition,
1 0 ... G1 0
h1 h1 +h2 h2
0 ... G2 y3 −y2 − y2 −y1
6 3 6 y h−y 2 h1
0 h2 h2 +h3 h3
0 . . . G3 = 4 3 − y3 −y2 .
6 3 6 h3 h2
.. .. ..
. . .
In two dimensions we have the known data f (xl , yl ) at a sequence of grid points
(xl , yl ) for l = 1, 2, . . . , N , and we seek to find the value f (x∗ , y∗ ) at some point (x∗ , y∗ ).
In the most general case we have no idea how the data is organized; it could be a random
list of points in no particular order. More commonly our data would be organized in
some way. The following subsections cover a few common cases.
function. Our task now is to solve for ξ and η for a given ⃗x∗ . We have two equations (the
mapping for both x and y) and hence this is doable. The equations can be combined into
a quadratic equation for either ξ or η, which can be solved analytically; the equation
gets rather long, though, and is therefore not shown here. Or we could use a root-finding
method to find ξ and η numerically. Either way, once this is done, we apply the bilinear
interpolation formula (3.8) in computational space to find the interpolated value.
We actually skipped the issue of how to identify which quadrilateral element the
interpolation target point is within. This is not trivial for a non-Cartesian grid, and
raises the question of what it means to be inside an element. One answer is the following:
imagine that we traverse around the element in a counter-clockwise direction; if the point
is always on our left, it is inside the element. Conversely, if it is to our right for even
a single element edge, it is not inside the element. An similar answer is to consider the
vectors from the target point to each of the grid points (or corners of the quadrilateral):
if a counter-clockwise rotation takes the vector to the (i, j) node to the vector to the
(i + 1, j) node (and similarily for the other corners), the point is inside the element.
This could be formalized as: if all of the conditions
[(⃗xi,j − ⃗x∗ ) × (⃗xi+1,j − ⃗x∗ )] · ⃗ez > 0,
[(⃗xi+1,j − ⃗x∗ ) × (⃗xi+1,j+1 − ⃗x∗ )] · ⃗ez > 0,
[(⃗xi+1,j+1 − ⃗x∗ ) × (⃗xi,j+1 − ⃗x∗ )] · ⃗ez > 0,
[(⃗xi,j+1 − ⃗x∗ ) × (⃗xi,j − ⃗x∗ )] · ⃗ez > 0
where ei could be (for example) unit vectors in the x-, y-, and z-directions. Having
expressed the vector f in terms of a set of basis vectors, we can more easily gain an
appreciation for what “makes up” the vector: e.g., we could see that a vector may lie
mostly in the x − y plane if the a3 component is small.
For continuous functions we use basis functions rather than vectors, but apart from
that there is no real difference between these cases. More to the point (in the context
of this book), once we approximate continuous functions by a set of discrete values, the
function becomes a finite-dimensional vector anyway.
There are many possible choices for basis functions, each of which could be used in
a modal expansion. Sets of basis functions that are orthogonal make for particularly
efficient and useful expansions. The most common of these is to use sines and cosines as
basis functions, which is then called the Fourier transform; this is covered first in this
chapter. We will then cover the concept of filtering, which is the process of changing
the frequency content of a function.
30
CHAPTER 4. MODAL EXPANSIONS AND FILTERING 31
where fˆl is the amplitude of each mode (the “modal amplitude” or “modal coefficient”
or “Fourier coefficient”), kl = 2πl/L is the wavenumber or angular √ frequency of the
mode number l, eıkl xj is the mode shape or basis function, and ı = −1. Note that the
wavelength of mode l is λl = L/l (i.e., there are l periods within the distance L) and
that the wavenumber kl = 2π/λl . The Nyquist wavelength, i.e., the smallest wavelength
that can be resolved on the grid, is λNyquist = 2h, for which kNyquist = π/h. The Nyquist
criterion can thus be written as |kh| ≤ π.
This idea of writing the function as a linear combination of modes is general; for
the specific case where the basis functions are sines and cosines, we call it a Fourier
series. It is important to note that the modal amplitudes fˆl have the same units (or
dimension) as fj itself, and that the wavenumber kl has units of inverse length. If fj
is a function of time rather than spatial position, one would equivalently use time t in
place of location x and thus the angular frequency ω instead of the wavenumber k.
Since xj = hj = (L/N )j, we see that kl xj = 2πlj/N . Thus the discrete Fourier
transform becomes
N/2−1
X
fj = fˆl eı2πlj/N , (4.1)
l=−N/2
in which there is no information required about the grid-spacing h. Eqn. (4.1) is gener-
ally referred to as the “backward” transform.
To derive the “forward” transform, we multiply Eqn. (4.1) by e−ı2πmj/N and sum
over all points j; this yields (after swapping the order of the summations)
N N N/2−1 N/2−1 N
X X X X X
−ı2πmj/N −ı2πmj/N
e fj = e fˆl eı2πlj/N = fˆl eı2π(l−m)j/N .
j=1 j=1 l=−N/2 l=−N/2 j=1
| {z }
glm
The term labelled glm is particularly important. First note that it depends on l and m
(the sum over j removes dependence on j). If l = m, the term inside the sum is 1, and
CHAPTER 4. MODAL EXPANSIONS AND FILTERING 32
thus the whole sum equals N . If l ̸= m, the sum is over an integer number of periods
of a trigonometric function, and therefore the sum is exactly zero. In other words, we
have
N , l = m,
glm =
0 , l ̸= m .
This is an expression of the concept of orthogonality: just like two vectors in real space
can be orthogonal to each other, in this case two basis functions can also be orthogonal
to each other. The mathematical definition of orthogonality is that the inner product
between them is zero: the summation over all grid points j is the inner product in this
case.
Returning to our derivation, we then get that
N
X
e−ı2πmj/N fj = N fˆm .
j=1
The combination of Eqns. (4.1) and (4.2) form a discrete Fourier transform pair; they
allow for a function to be expressed either as fj in real space or as fˆl in transformed
(or wavenumber, or frequency) space. It is important to realize that there is no approx-
imation involved, both expressions are exactly valid and a Fourier transform does not
involve any loss of information.
i.e., we define the energy as the mean-square of the signal. Note that energy is meant
in a mathematical sense here. Inserting the backward Fourier transform (4.1) for both
fj and fj∗ yields
∗
N N/2−1 N/2−1
1 X X X
energy = fˆl eı2πlj/N fˆm eı2πmj/N
N
j=1 l=−N/2 m=−N/2
N N/2−1 N/2−1
1 X X ˆ ı2πlj/N X
∗ −ı2πmj/N
= fl e fˆm e .
N
j=1 l=−N/2 m=−N/2
CHAPTER 4. MODAL EXPANSIONS AND FILTERING 33
Note that we used m rather than l as the summation index for the fj∗ part, to avoid
confusion
P between the two sums. Our goal now is to reach a final expression where
E = l . . ., i.e., where the energy in the signal is expressed as a sum over all modes –
that way, the contribution from every mode adds up to the total. To achieve this, we
change the order of the sums and get
N/2−1 N/2−1 N
∗ 1
X X X
energy = fˆl fˆm eı2π(l−m)j/N .
N
l=−N/2 m=−N/2 j=1
| {z }
glm
The same glm factor appears again, and just like before it is non-zero only for l = m.
Therefore
N/2−1 N/2−1
X X 2
ˆ ˆ∗
energy = fl fl = fˆl ,
l=−N/2 l=−N/2
and we see that the energy of the signal is the sum of the squared magnitude of all
modal coefficients. This is called Parseval’s theorem, and allows us to ask questions like
“how much does each frequency (or wavenumber) contribute to the total energy?”. The
collection of all squared modal amplitudes is called the energy spectrum.
Since this is true for all l = −N/2, . . . , N/2 − 1, we specifically have that fˆ0 = fˆ0∗ ; in
other words, the modal amplitude or Fourier coefficient for mode 0 must be real-valued.
The same is true for mode N/2, since
N N N
1 X 1 X 1 X
fˆN/2 = fj e−ı2π(N/2)j/N = fj e−ıπj = ∗
fj eıπj = fˆ−N/2 .
N N N
j=1 j=1 j=1
Note that the transformed function now has N −1 independent complex-valued numbers
and 2 real-valued ones, meaning a total of N real-valued “pieces of data”. This is the
same as required to express a real-valued fj in real space, of course.
CHAPTER 4. MODAL EXPANSIONS AND FILTERING 34
4.2 Filtering
The concept of filtering refers to the idea of changing the frequency content of a signal.
For example, think of the way you can change the amount of bass (low frequencies) or
treble (high frequencies) on a music player. The prototype problem is therefore: given
a discrete signal fj described on a uniform grid xj = hj, j = 1, 2, . . . , N , we want to
find a “filtered” signal f j in which we have modified the frequency content.
The easiest and most straightforward scenario is for periodic signals using the Fourier
transform. After applying the forward transform (4.2) which provides the modal coef-
ficients fˆm , we can multiply the modal amplitudes with a function Gm and then apply
the backward transform (4.1) to find the filtered signal f j . Modes for which Gm < 1 are
attenuated while modes for which Gm > 1 are amplified. The most common situation
in practice is the low-pass filter in which we want to keep the lowest frequencies (or
wavenumbers) untouched while attenuating the highest frequencies: this would call for
a Gm function that is 1 for the lowest modes and < 1 (or even zero) for the highest
modes.
If the signal fj is not periodic, or if we want to avoid using the Fourier transform for
some other reason, we have to apply the filter operation in real space. A rather general
form for doing that is
Xb
fj = cl fj+l , (4.3)
l=a
where the set of {cl } are coefficients that define the filter.
To analyze this filter, we make the idealized assumption that a Fourier transform
applies. Note that this assumption is only made when analyzing the method, not when
implementing it. Assuming that the signal fj can be written as a backward Fourier
transform (4.1) then yields, after inserting this into the filter formula (4.3),
b N/2−1 N/2−1 b
X X X X
fj = cl ˆ
fm e ı2πm(j+l)/N = fˆm cl eı2πm(j+l)/N
l=a m=−N/2 m=−N/2 l=a
N/2−1 b
!
X X
= fˆm cl eı2πml/N eı2πmj/N ,
m=−N/2 l=a
| {z }
fˆ m
where we can identify the quantity marked with the underbrace as the Fourier coefficient
of the filtered signal f j . We can then define the “transfer function” as the ratio of each
Fourier coefficient after and before the application of the filter, i.e.,
fˆ m X
b
Gm = = cl eı2πml/N .
fˆm l=a
Since the wavenumber of mode m is km = 2πm/L = 2πm/(N h), the exponent can
also be written as ı km h l. If we suppress the mode index m, we then get the transfer
CHAPTER 4. MODAL EXPANSIONS AND FILTERING 35
function as
b
X
G(kh) = cl eıkh l ,
l=a
which shows how it is a function of the product kh, the wavenumber scaled by the grid-
spacing h. The maximum possible value of this scaled wavenumber is π, which is called
the Nyquist criterion. We are therefore interested in how the transfer function G(kh)
behaves for kh ∈ [0, π], or from infinitely long waves to waves with just two points per
wavelength.
To see how to interpret the transfer function G(kh), we can consider the simple
example when fj is a single cosine mode, i.e.,
n o
fj = cos(kxj ) = Re eıkxj .
By definition, the filtered version of this signal is then
n o n o
f j = Re G(kh) eıkxj = Re |G| eıθ eıkxj = |G| cos(kxj + θ) .
We then see that the magnitude of the transfer function |G| controls how the amplitude
of the modal coefficient is changed, and that the phase angle θ = tan−1 (Im{G}/Re{G})
controls the phase shift of the filtering process. In other words, a real-valued transfer
function G(kh) introduces no phase shift.
This shows that, if cl = c−l , then the transfer function is purely real. In other
words, a filter operation only introduces a phase shift (non-real G) if the filter is
not symmetric. The most common situation where that occurs is for one-sided
filters, which can occur either near boundaries or if one is applying the filter in
real time with only data from the past.
Solving for the coefficients yields (c0 , c1 ) = (1/2, 1/4). This is therefore the “best”
low-pass filter possible using only those 3 stencil points, if by “best” we mean “adheres
to the requirements we stated”. Perhaps surprisingly, the solution was not c0 = c1 = 1/3,
which one might have expected on grounds of this providing “more” averaging; however,
the transfer function of that filter reaches negative values for large kh, as seen in Fig. 4.1.
If one allows for a larger stencil (i.e., including more neighboring points), then one
can impose additional requirements. The transfer functions of two filters using 5-point
stencils are also shown in Fig. 4.1. These were designed to produce d2 G/d(kh)2 = 0 at
either kh = 0 (the red dashed line) or kh = π (the magenta dashed line).
The filter corresponding to the green dash-dotted line in Fig. 4.1 is a high-pass filter,
leaving the highest wavenumbers rather untouched while attenuating the lowest ones.
Finally, note that the low-pass filters in Fig. 4.1 have different wavenumber charac-
teristics: e.g., the filter corresponding to the red dashed line leaves more modes with
relatively large amplitude, while the filter corresponding to the magenta dashed line
attenuates more modes. We would then say that the former has a larger “cut-off fre-
quency” or “cut-off wavenumber”.
Chapter 5
Integration
where Z xj
Ij = f (x)dx .
xj−1
There are two different scenarios in which we may want to compute an integral
numerically in practice:
1. We may know the function f (x) only through its values at a discrete set of data
points, in which case we could not possibly find the analytical integral. In this
case, the choice of how to decompose the interval [a, b] into segments (how many,
where to place the dividing points) is most likely dictated by where we have data.
2. We may know the function f (x) analytically, but not be able to find the analytical
integral if f (x) is sufficiently complex. In this case, we have full freedom in how
to decompose the interval [a, b].
Some quadrature algorithms can handle both types of problems, but some algorithms
are suitable only to the second class of problems if they are based on the ability to insert
or choose data points wherever one wants.
37
CHAPTER 5. INTEGRATION 38
f (xj−1 ) + f (xj )
Ij,trap = hj .
2
This amounts to approximating the function by a straight line in each segment (i.e.,
piecewise linear interpolation between the data points at the segment boundaries), as
sketched in Fig. 5.2.
Another intuitive alternative is the midpoint rule, in which the integral over a single
segment is approximated as
xj−1 + xj
Ij,midpoint = hj f .
2
To simplify notation, we often write xj−1/2 = (xj−1 + xj )/2. The integral over each
segment is thus approximated by a rectangle with height f (xj−1/2 ), i.e., the function
value at the midpoint of the segment – hence the name of the method, of course. This
is also sketched in Fig. 5.2. Note that one can also think of the midpoint rule as
approximating the function by a straight line that goes through the midpoint f (xj−1/2 )
(as illustrated by the dash-dotted line in the figure): the area of the resulting trapezoid
is identical to the area of the rectangle, regardless of the slope of this straight line.
Having described these two methods (the numerical methods part of this book),
let us next turn to numerical analysis to see what accuracy we should expect. The
midpoint rule is the easiest to analyze. The true function f (x) can be described in the
CHAPTER 5. INTEGRATION 39
Figure 5.2: Approximate integral of a single segment using the midpoint and trapezoidal
rules, with the integral approximated as the area under the black lines.
neighborhood of xj−1/2 as
′ (x − xj−1/2 )2
′′
f (x) ≈ f (xj−1/2 ) + f (xj−1/2 )(x − xj−1/2 ) + f (xj−1/2 ) + ... .
2
With this, the true segment integral is (integrate term by term)
Z xj " #
′ ′′ (x − xj−1/2 )2
Ij ≈ f (xj−1/2 ) + f (xj−1/2 )(x − xj−1/2 ) + f (xj−1/2 ) + . . . dx
xj−1 2
" # xj
(x − x )2 (x − x ) 3
j−1/2 j−1/2
= f (xj−1/2 ) x + f ′ (xj−1/2 ) + f ′′ (xj−1/2 ) + ...
2 6
xj−1
h3j
= f (xj−1/2 ) hj + f ′′ (xj−1/2 ) + O h5j .
24
Note that the terms with odd order derivatives disappear. This can be visually seen in
Fig. 5.2, where the area between the dash-dotted and solid lines must be zero since the
two triangles are identical (but with different sign of the integral); the same would be
true for higher-order odd derivatives.
We note that the first term on the right-hand-side is exactly the numerical formula
for the midpoint rule. We therefore have
h3j
Ij,midpoint ≈ Ij − f ′′ (xj−1/2 ) + O h5j .
24
Summing over all segments yields
N N
" #
h 3
j
X X
Ij − f ′′ (xj−1/2 ) + O h5j .
Imidpoint = Ij,midpoint ≈
24
j=1 j=1
CHAPTER 5. INTEGRATION 40
The sum over the first term is the exact integral. For convenience when handling the
other terms,
let’s define the average segment size h = (b − a)/N . The remainder term
4
is then O h since the sum over all segments implies a factor of N . We then find the
error in the full integral to be
N
X h3j
4
εmidpoint = |Imidpoint − I| ≈ f ′′ (xj−1/2 ) +O h .
24
j=1
So the error is controlled by the second derivatives at the segment midpoints and the
size of each segment. The second derivative may change sign over the interval, so it is
possible that the first error term is very small (even zero). This will only happen for
very special functions and choices of the segment spacing, so it’s more interesting and
useful to find an approximate bound on the error. This is done by taking the absolute
value operation inside the sum using the triangle inequality (recall: |x + y| ≤ |x| + |y|),
after which we get
N
1 X ′′
4
εmidpoint ≲ f (xj−1/2 ) h3j + O h .
24
j=1
This is not a true bound on the error since we still have all the higher-order terms, but
it becomes an approximate bound in the limit of N → ∞. We still would like to simplify
2
it further, specifically by removing the summation. We can multiply and divide by h
to get
N 2
1 2 X ′′ hj
4
εmidpoint ≲ h f (xj−1/2 ) hj + O h
24 h
j=1
( 2 )
1 2 ′′ hj
4
≤ h (b − a) f +O h .
24 h
max
This is our final result for the error of the midpoint rule formula. The error is propor-
tional to the full integration interval b − a, which makes sense. It is proportional to
the square of the average segment size h, implying that, if you double the number of
segments, the error will decrease by a factor of 4: therefore, the midpoint rule is “second
order accurate”. Finally, the error depends on the curvature (=second derivative) of
the function: for more highly curved functions, the error bound is larger.
The presence of the “relative” segment size hj /h shows that one can decrease the
error by using smaller segments in regions where the function has large curvature: we
will use this below in section 5.3 to derive an adaptive quadrature method.
The error analysis for the trapezoidal rule is a bit more involved. The process for
the analysis of the midpoint rule was that we expressed the function f (x) in terms of
a Taylor expansion around xj−1/2 , which then meant that the midpoint rule formula
appeared directly as the first term of the Taylor expansion. This was crucial in the
analysis, as we needed to find a relationship between the exact segment integral Ij and
the formula of the method. To achieve the same for the trapezoidal method, we will
need to have f (xj−1 ) and f (xj ) in the expansion, which seems impossible. We can do
CHAPTER 5. INTEGRATION 41
this by also using Taylor expansions of f (xj−1 ) and f (xj ). This is left as an exercise,
and we simply give the final result here which is that
( 2 )
1 2 ′′ hj
4
εtrap = |Itrap − I| ≲ h (b − a) f +O h .
12 h
max
This error bound is exactly twice larger than for the midpoint rule – this might at
first seem surprising (surely approximating a function by trapezoids is better than by
rectangles?), but can make graphical sense by looking at Fig. 5.2. Recall that the
midpoint rule can be viewed as computing the integral of any trapezoid for which the
top boundary goes through the midpoint, at any angle. The error for the midpoint
rule is then the area between the dash-dotted curve and the function f (x). Similarly,
the error for the trapezoidal rule is the area between the solid curve and the function.
The curvature of the function means that the latter area is larger, in fact exactly twice
larger.
At the end of the day, the choice between the midpoint and trapezoidal rules is often
made based on where one has discreetly known data. For example, if one has sampled
data and wants to compute the integral between a set of data points, the trapezoidal
rule is the most natural.
It is 4th order accurate, meaning: if you double the number of segments, then the error
will decrease by a factor of 24 .
Note that we have defined the segment size hj here as xj − xj−1 . It is quite common
in other parts of the literature to instead view Simpson’s rule as covering two segments
(that must be of equal size): in this case, the formulae above need to be adjusted
accordingly.
CHAPTER 5. INTEGRATION 42
using the trapezoidal method. If we treat the whole interval using a single segment, the
error is ∼ (xr − xl )3 |f ′′ |. We could then try to estimate the second derivative in order
to estimate the magnitude of the error. An alternative and arguably simpler approach
is to instead re-compute the integral using a single segment of Simpson’s rule, for which
we know that the error is ∼ (xr − xl )5 |f ′′′′ |. Importantly, the difference between the
two approximations has the error of the trapezoidal method as its leading error term,
and we can therefore use this difference as an approximate estimate of the error in the
trapezoidal method. Equivalently, we can say that the error in the trapezoidal method
is
|Iseg,trap − Iseg,exact | ≈ |Iseg,trap − Iseg,simp | ,
i.e., we can view Simpson’s rule as an approximation of the exact answer in order to
estimate the error. If the difference between the low- and high-order methods is larger
than the tolerance, the segment is split into two and the process is repeated for each
of those two segments. Clearly the same idea could be used with any combination of
methods, provided that one of them is of higher-order accuracy than the other.
Algorithm 5.1 gives the essential structure of the implementation. Note that this
uses so-called “recursive” programming, in which the function calls itself – this is key
to a compact implementation. Also note that we assign a lower tolerance to the compu-
tation of each sub-segment: if the errors of the two segments add up (i.e., if they don’t
fortuitously cancel), this means that the overall tolerance for the segment is still within
the tolerance.
A natural question is whether it would be better to use the Simpson’s rule result
rather than the trapezoidal one when assigning the value of the segment integral –
CHAPTER 5. INTEGRATION 43
surely it would be better to use the more accurate approximation? While there is some
point to this, the real beauty of the adaptive quadrature is that it only stops when the
difference between the high- and low-order methods is sufficiently small, and thus it
doesn’t really matter which one we take.
Chapter 6
Differentiation
Imagine that we have known data at a discrete set of points and want to compute
the derivative at a specific point. In other words, we know fj = f (xj ) at a set of
xj , j = 1, . . . , N , organized in increasing order, and we want to compute an approximate
value of fj′ = df /dx|xj . We have two broad ways to approach this problem:
1. We can fit an interpolating function to the data, and then take the analytical
derivative of this interpolating function. We could use any of the methods dis-
cussed in chapter 3. If the data is periodic, we could also use the Fourier transform
discussed in chapter 4 as an interpolating function.
2. We can compute an approximate derivative without ever explicitly fitting an in-
terpolating function based on the definition of a derivative as the limit of (f (x +
h) − f (x))/h or (f (x + h) − f (x − h))/(2h) or similar as h → 0; we would then
use formulae like this but with a finitely large step size h. This type of approach
is called ”finite differencing”.
′ fj − fj−1
fj,left = ,
xj − xj−1
′ fj+1 − fj
fj,right = , (6.1)
xj+1 − xj
′ fj+1 − fj−1
fj,cent = .
xj+1 − xj−1
The first method is biased to the left, the second is its mirror-image that is biased to
the right. The last method is symmetric around xj , using data from both sides. An
example illustrating these different approximations of the derivative is shown in Fig. 6.1.
44
CHAPTER 6. DIFFERENTIATION 45
Figure 6.1: Example comparing the derivatives computed using the three different for-
mulae in Eqn. (6.1).
Now, all of these three schemes will compute valid approximations of the true first
derivative f ′ (xj ), so which should we use in practice? It depends on the context and
what accuracy we desire. For example, the left-biased scheme could be useful in a
situation where we are sampling a signal in time and want to compute the derivative
on-the-fly: in that case, fj+1 is unknown, and hence only the left-biased scheme would
be useful. Similarly, both the left- and right-biased schemes could be useful when
computing derivatives near the edges of where we have data.
Having said all this, assuming there are no restrictions on which scheme we can
use, which of these three schemes is the most accurate? This is where numerical anal-
ysis comes in, to say something about the expected behavior of errors for these three
methods.
Let us analyze what the errors are for the three examples given in Eqn. (6.1). To
simplify the analysis, we assume that the data points are uniformly spaced, i.e., that
xj+1 − xj = xj − xj−1 = h. Taylor expansions of both fj−1 and fj+1 around xj can be
written as
h2 h3 h4
fj±1 = fj ± fj′ h + fj′′ ± fj′′′ + fj′′′′ + . . . .
2 6 24
CHAPTER 6. DIFFERENTIATION 46
′
Inserting these into the formula for fj,cent then yields
′ fj+1 − fj−1
fj,cent =
2h 2 3 4
2 3 4
fj + fj′ h + fj′′ h2 + fj′′′ h6 + fj′′′′ h24 − fj − fj′ h + fj′′ h2 − fj′′′ h6 + fj′′′′ h24
≈
2h
3
2fj′ h 2fj′′′ h6 h5
+ +O
=
2h
h 2
= fj′ + fj′′′ + O h4 .
6
This shows that the numerical scheme (or “formula” or “recipe”) fj,cent ′ for estimating
′ ′
the derivative is equal to the exact derivative fj = f (xj ) plus an infinite sequence of
error terms composed of higher and higher derivatives and powers of h. As the grid-
spacing h → 0, the values of all the higher derivatives remain the same; therefore,
for h smaller than some “critical” value (which is probably very small), only the first
error term (h2 /6)fj′′′ remains and all other terms are vanishingly small. When that
happens, we are in the so-called “asymptotic range of convergence”: by definition, the
error is described by only a single term. Having only a single error term is important
since it prevents cancellation effects between different terms. In that asympotic range,
we can therefore say that, if the grid-spacing h is divided by 2, then the error in the
approximation of the first derivative will be reduced by a factor of 4 due to the h2 factor.
′
We therefore say that the fj,cent scheme is “second-order accurate”.
If we repeat the same exercise for the left- and right-biased schemes in Eqn. (6.1),
we find that
′ h
≈ fj′ − fj′′ + O h2 ,
fj,left
2
′ h
≈ fj′ + fj′′ + O h2 .
fj,right
2
So these methods also estimate the exact derivative fj′ in the limit of h → 0, but for
finite values of h they have different errors. Specifically, they are “first-order accurate”:
when reducing h by half, the error is reduced by half as well.
Having performed this analysis, we return to the original question of which of these
three schemes is the most accurate? We actually can’t say, in general. What we can
say is that the error for the central scheme is reduced faster as h → 0. So there exists
some critical grid-spacing where, for all h smaller than that critical grid-spacing, the
central scheme will be more accurate than the left- or right-biased schemes.
just the first derivative. To keep things simple, we will assume a uniform grid-spacing
h = xj − xj−1 for all j throughout this section.
Imagine that we want to compute an approximation of dn f /dxn at xj , i.e., the nth
derivative of f , using a formula of type
b
dn f 1 X
= cl fj+l . (6.2)
dxn xj ,num hn
l=a
Before continuing, let us think about this formula. The set of {cl } are the coefficients
of the scheme, at this point unknown. In order to be of general use, these coefficients
must not have any dimension; therefore, there must be a factor of h−n in the formula,
since h is the only thing with the same dimension as x here. The numbers a and b define
which data points are included in the scheme. For example, a = −1 and b = 2 would
imply that four data points are included, with one to the left and two to the right. The
collection of data points included in the scheme is often referred to as the “stencil”.
We now have two scenarios: either (i) we have been given a set of coefficients {cl }
and want to analyze what accuracy they imply; or (ii) we have to find values for the
coefficients in order to derive a formula for a particular derivative with a particular
accuracy. We actually proceed in the same way for both scenarios, since the strategy
for finding coefficient values is to perform the accuracy analysis and then at the end ask
what coefficient values would be required.
The analysis process starts by expressing all function values fj+l in terms of Taylor
expansions around xj as
l2 h2 ′′ l3 h3 ′′′ l4 h4 ′′′′
fj±l = fj ± lhfj′ + f ± f + f + ... .
2 j 6 j 24 j
We then insert this into the general formula (6.2) and collect terms to get
b
dn f l2 h2 ′′ l3 h3 ′′′ l4 h4 ′′′′
1 X ′
= n cl fj ± lhfj + f ± f + f + ...
dxn xj ,num h 2 j 6 j 24 j
l=a
P P P 2 P 3 P 4
l cl l lcl ′ l l cl ′′ l l cl ′′′ l cl
= n
fj + n−1 fj + n−2
fj + n−3
fj + l n−4 fj′′′′ + . . . (6.3)
h h 2h 6h 24h
∞ b
!
X 1 m−n X m
d f
= h lm cl .
m! dxm xj
m=0 l=a
We now have a relation between the formula (left hand side) and the function and all its
derivatives at xj . If we know the values of the coefficients cl , we simply compute each
sum and then have an expression to analyze. If we instead want to find the coefficient
values, we need to think about what we want the right-hand-side to be.
We first note the following: the formula will only compute an approximation to the
nth derivative as h → 0 under the following conditions:
• All terms with m < n (i.e., terms with lower derivatives of f than the desired nth
derivative) must be exactly zero; if not, they will go to infinity as h → 0 due to
CHAPTER 6. DIFFERENTIATION 48
• The term with the nth derivative must be exactly dn f /dxn |j , meaning that we
need
b
1 X n
l cl = 1 .
n!
l=a
Note that the h factor disappears for this term, by construction: we argued this
on dimensional grounds above, but we can now see that it is also necessary on
grounds that the coefficients cl must not depend on h.
The conditions above provide n + 1 equations, and they must be enforced exactly. This
implies that we need a minimum of n + 1 non-zero coefficients cl in order to have
a formula for the nth derivative; equivalently, we need a stencil that includes at least
n+1 data points. For example, it would be impossible to approximate the 4th derivative
from only 3 data points.
In addition to the necessary requirements above, we can impose additional desirable
attributes of our approximate formula. The obvious desirable attribute is that the error
should be as small as possible. The simplest way to enforce this is to make some of the
remaining error terms exactly zero, which will then enforce a certain order of accuracy.
For example, if the term with the (n + 1)th derivative is enforced to be zero, then the
leading error term becomes proportional to h2 : a second-order accurate method. Or we
could also enforce that the term with the (n + 2)th derivative is zero, which leads to a
third-order accurate method. In both cases, we add an additional equation which then
means that we will need additional non-zero coefficients cl : the more requirements, the
larger the stencil we need.
Once we have specified the n + 1 required and k desired conditions, we have n + 1 + k
equations. We must then choose a stencil of exactly n + 1 + k non-zero coefficients, i.e.,
we have to choose the value of a or b (and then find the other such that exactly n + 1 + k
data points are involved). This then allows us to write a linear system of equations for
the unknown coefficient values. An example of this process is shown below.
As this only provides 3 equations for the 5 unknowns, we are free to enforce 2 additional
constraints. To maximize the order of accuracy, these should be
X X
l3 cl = 0 , l4 cl = 0 .
l l
The linear system of equations is then (note that the columns correspond to l =
−2, −1, 0, 1, 2)
1 1 1 1 1 c−2 0
−2 −1 0 1 2 c−1 0
4 1 0 1 4 c0 = 2 ,
−8 −1 0 1 8 c1 0
16 1 0 1 16 c2 0
| {z } | {z }
A c
Example 6.2: how to solve for finite difference coefficients with a biased stencil
Imagine that we want to find the coefficients cl for a finite difference scheme that
approximates the first derivative at xj to the highest possible order of accuracy using a
five-point stencil that is fully biased towards the right (i.e., using points j, j+1, j+2, j+
3, j +4). We then have the hard (non-negotiable) requirements
X X
cl = 0 , lcl = 0,
l l
Figure 6.2: Sketch of the grid points in physical (x) and computational (η) coordinates.
which has the solution c = (−25/12, 4, −3, 4/3, −1/4)T , where we need to remember
that the elements of c are c0 , c1 , . . . , c4 .
Since F ′ (η) is defined in computational space where the grid-spacing is uniform, we can
use any existing finite difference method to approximate it, for example,
df F ′ (η)
= ′ .
dx G (η)
In other words, to compute the first derivative of a function in physical space, we use
a finite difference scheme to compute the first derivative of the function in computa-
tional space, use a finite difference scheme again to compute the first derivative of the
coordinate mapping, and finally divide the two results with each other.
For example, the second-order accurate central scheme is
Fj+1 −Fj−1
df 2∆η fj+1 − fj−1
= Gj+1 −Gj−1
= .
dx j,2nd−cent xj+1 − xj−1
2∆η
This is very intuitive, arguably exactly what one would have guessed. However, since
all the formulae here apply to any finite difference schemes, we can instead do it for the
fourth-order accurate central scheme to get
1
df − 12 fj+2 + 23 fj+1 − 23 fj−1 + 12
1
fj−2
= 1 2 2 1 .
dx j,4th−cent − 12 xj+2 + 3 xj+1 − 3 xj−1 + 12 xj−2
CHAPTER 6. DIFFERENTIATION 52
This is not something that one would come up with without the formality of the coor-
dinate mapping.
As a side-point, these formulae show that the concept of the grid-spacing is actually
non-unique for non-uniform grids. In the coordinate mapping approach, the “grid-
spacing” is really the derivative of the coordinate x with respect to the computational
space η; while the derivative of the imagined continuous function is of course unique,
the fact that we can approximate it using any finite difference scheme means that the
discrete value is non-unique.
The equivalent formula for the second derivative is a bit more involved. We have
d2 f d F ′ (η) dη d F ′ (η)
d df
= = =
dx2 dx dx dx G′ (η) dx dη G′ (η)
′′
F (η) F ′ (η)G′′ (η)
1
= ′ −
G (η) G′ (η) (G′ (η))2
F ′′ (η) F ′ (η)G′′ (η)
= − .
(G′ (η))2 (G′ (η))3
We then need four different applications of any finite difference scheme to compute
approximations to F ′ and F ′′ (the first and second derivatives of the function) and
G′ and G′′ (the first and second derivatives of the coordinate mapping). For example,
using the second-order accurate central schemes for both the first and second derivatives
yields
If the grid is uniform, the coordinate mapping is linear and has zero second derivative;
in that case, the second term is zero. We can therefore think of the formula as being
composed of a main term (the first one) and a correction for the effect of the non-uniform
grid (the second term).
dn f 1
= Dn f ,
dxn num hn
CHAPTER 6. DIFFERENTIATION 53
where
x′ = D 1 x .
This shows how the finite difference matrix D1 is re-used for the computation of both
derivatives. This is even more true when computing the second derivative. In practice,
one can get the same benefit by implementing the differencing operation in a function.
To complete the definition, we need to choose schemes for the first two grid points.
The simplest choice is to use a second-order accurate central scheme in the second grid
point, and a right-biased first-order accurate scheme in the first. This would yield
−1 1 0 0 0 0 0
−1/2 0 1/2 0 0 0 0
D1 = 1/12 −2/3 0 2/3 −1/12 0 0 .
0
1/12 −2/3 0 2/3 −1/12 0
.. .. .. .. ..
0 0 . . . . .
This would work, but would reduce the order of accuracy near the boundary. If one
wants to avoid that, one can use higher-order methods for those first two grid points.
For example,
−25/12 4 −3 4/3 −1/4 0 0
−1/4 −5/6 3/2 −1/2 1/12 0 0
D1 = 1/12
−2/3 0 2/3 −1/12 0 0 ,
0
1/12 −2/3 0 2/3 −1/12 0
.. .. .. .. ..
0 0 . . . . .
CHAPTER 6. DIFFERENTIATION 54
Figure 6.3: A sample two-dimensional grid with the physical and computational coor-
dinate systems, and the indices of three grid points.
coordinates correspond to the same point after the unknown coordinate transformation.
The chain-rule of differentiation yields
∂f ∂F ∂ξ ∂F ∂η ∂ξ ∂η ∂F
∂x ∂ξ ∂x +
= ∂η ∂x
= ∂x ∂x ∂ξ .
∂f ∂F ∂ξ (6.4)
∂F ∂η ∂ξ ∂η ∂F
+
∂y ∂ξ ∂y ∂η ∂y ∂y ∂y ∂η
| {z }
A
The final vector is the gradient of F in computational space, which we can compute
using finite differences along the ξ and η directions since the grid is perfectly Cartesian
and uniform in computational space (compare this with section 6.3: there we used
only the uniformity of the grid in computational space, now we also use the Cartesian
nature). The matrix A contains the coordinate transformation, which can not be easily
computed. However, let us write the inverse operation
∂F ∂x ∂y ∂f
∂ξ ∂ξ ∂ξ ∂x
∂F = ∂x ∂y ∂f
∂η ∂η ∂η ∂y
| {z }
B
which shows that A = B −1 . The key point here is that we can compute B using finite
differences in computational space: just like in section 6.3, we can apply our finite
difference method to the coordinates x and y just as well as to the function. Taking the
inverse of B then yields
∂ξ ∂η ∂y ∂y
∂x ∂x 1 ∂η − ∂ξ
A= ∂ξ ∂η det(B) ∂x ∂x .
= (6.5)
−
∂y ∂y ∂η ∂ξ
We are now ready to compute the gradient. We apply our chosen finite difference
scheme along both of the directions of the grid as
∂g 1 X
= cl gi+l,j ,
∂ξ i,j ∆ξ
l
∂g 1 X
= cl gi,j+l ,
∂η i,j ∆η
l
CHAPTER 6. DIFFERENTIATION 56
Data from any real experiment will be contaminated by random noise, making the
measured data random in nature. Engineering decisions or scientific analysis are almost
based on the underlying deterministic (or systematic) behavior, and one must therefore
use averaging in order to reduce the random fluctuations in the data. In many cases it
is then also important to estimate how large the random effects are.
and measures the degree of correlation between the two random variables. If the value
of X is completely independent of the value of Y , then the covariance will be zero. If
high values of X are more likely to occur when Y takes on high values, the covariance
will be positive.
Imagine a time-dependent signal x(t) or its discreet equivalent xi . We can then ask
how correlated the signal is with itself at a later time, i.e.,
This “auto-correlation” (the correlation of the signal with itself) Rk depends only on the
separation between the data points, or on the time delay τ in the continuous example.
If the random process that produced the data “loses memory” of its prior states after
some time delay τ , we should expect the auto-correlation R to become zero for times
larger than τ . In that way, the auto-correlation says something about the correlation
length of a data set.
57
CHAPTER 7. ANALYZING RANDOM DATA 58
Figure 7.1: Three random signals (left) with different levels of correlation as evidenced
by their auto-correlation coefficients (right).
successive data points are not independent of each other. The subject of this section is
to show how one can compute the estimated standard error for correlated data.
Assume we have sampled data xi , i = 1, 2, . . . , n, which is correlated. The standard
error is defined as the standard deviation of the sample mean, i.e., σx . The squared
standard error is h i
σx2 = E (x − µx )2 .
The expected value of the sample mean is the same as the expected value itself (i.e.,
the sample mean is unbiased), and thus µx = µX . We can then insert the sample mean
formula and perform manipulations as
!2 !2
n n
1 X 1 X
σx2 = E xi − µ x = 2 E (xi − µx )
n n
i=1 i=1
! n
n
1 X X
= 2E (xi − µx ) (xj − µx )
n
i=1 j=1
n
1 X 2 X
E (xi − µx )2 + 2
= E [(xi − µx )(xj − µx )] .
n2 n
i=1 i̸=j
We can the identify the first sum as the variance and the second sum as the covariance
between different data points, which leads us to the final formula for the sample mean
standard error
σ2 2 X
σx2 = x + 2 Cov(xi , xj ) .
n n
i̸=j
The second term becomes zero for uncorrelated data since each data point is then
independent of all other data points. For correlated data, the second term is most
likely positive since nearby data points tend to have positive correlation (or covariance)
while data points far apart tend to have zero correlation (or covariance). This then
implies that the standard error for correlated data is higher than for uncorrelated data
(assuming the same underlying variance σx2 and number of samples n). In other words,
if we use the formula for independent data, we will underpredict the true standard error,
and thus our confidence interval will be smaller than it should be.
1)m + 1 → bm. Step 1 is then to compute the sample mean for each batch, i.e.,
bm
1 X
xb = xi , b = 1, 2, . . . , B .
m
i=(b−1)m+1
Now, if the batches are sufficiently large (i.e., if m is sufficiently large), then the different
batch means should be independent of each other. In that case we can approximate the
variance of the sample mean of the complete data as
σx2
σx2 ≈ ,
B
which implies that our formula for the estimated standard error is
v
u B
u 1 X 2
sx = t xb − x .
B(B − 1)
b=1
The final question is how one should choose m, the size of each batch? The key
consideration is that we need the batch means to be independent, which implies that
each batch must be large enough such that most data points in a batch b are uncorrelated
with the data points in the nearby batches b − 1 and b + 1. In other words, the batches
need to be sufficiently larger than the correlation length of the signal.
using the notation from section 4.1. The deterministic energy spectrum is the collection
of expected squared modal amplitudes. In practice, we should therefore do the following
to compute the energy spectrum: run multiple experiments, collect the signal fj for each,
apply the Fourier transform and compute the squared modal amplitudes, and finally
average these squared modal amplitudes over multiple experiments.
A common situation in practice is that we have only a single instance of the signal
fj (i.e., only a single “experiment”) but that this single instance is very long. We can
then divide the signal into several shorter segments or batches, exactly as was done in
section 7.2.1 when estimating the standard error of the sample mean. The idea is then
to treat each segment as an independent experiment, and thus to Fourier transform
and compute the squared modal amplitudes for each segment, and finally to average
the squared modal amplitudes over all segments of the signal. The lowest frequency
captured in a Fourier transform is the inverse of the length of the signal, so the process
of dividing the signal into several shorter segments therefore amounts to trading some
low-frequency resolution for a smoother energy spectrum. As a result, the optimal
number of segments, and thus the length of each segment, is therefore problem-specific.
The Fourier transform is defined for periodic data. In situations where the data is
not periodic (e.g., if obtained by measuring a random process in time), this introduces
so-called aliasing errors in the Fourier transform, where energy associated with the lack
of periodicity is attributed to different Fourier modes. One way to reduce these aliasing
errors is to multiply the signal by a window function that tapers smoothly to zero at
the ends of the signal before applying the Fourier transform. If the signal is divided
into several segments, this “windowing” is applied to each segment prior to taking the
Fourier transform.
The complete process for computing the energy spectrum of a long non-periodic
signal is therefore as listed in Algorithm 7.1.
Figure 7.2: Two sample energy spectral densities F (k) plotted in two different ways.
Note that both spectra have the same total energy (variance).
wavenumber k, then Z ∞
energy = F (k)dk
0
(note that F (k) is the squared modal amplitude at k divided by the wavenumber “step
size” ∆k). If the energy spectral density F (k) (or “energy spectrum”) is plotted in a
linear-linear plot, the area under the graph is the energy of the signal, which simplifies
interpretation. In most cases, however, it makes sense to plot F (k) with a logarithmic
scale for the frequency or wavenumber k, in which case the area under the curve is no
longer related to the energy of the signal. One can then instead plot the “pre-multiplied
spectrum” k F (k) since dk = kd(log k), and thus
Z ∞ Z ∞
energy = F (k)dk = kF (k)d(log k) .
0 −∞
Therefore, if kF (k) is plotted with a linear axis vs k plotted with a logarithmic axis, the
area under the graph is again the energy of the signal. This is illustrated in Fig. 7.2,
which shows two different energy spectra plotted in two different ways. The “pre-
multiplied” plot shows clearly how the blue signal has two spectral peaks while the red
signal has a flat region; especially the former feature is hard to make out in the other
plot.
Part II
63
Chapter 8
Differential equations occur in almost every topic area within science and engineering.
Some examples of ordinary differential equations (ODEs) are
du
= f (u, s)
ds
and
d2 u
= f (u, du/ds, s) ,
ds2
where we seek the solution u(s) for some interval s ∈ [s1 , s2 ]. In general there will
be infinitely many solutions to ODEs, and we need additional conditions to define
a unique solution to the problem. In most cases these will take the form of boundary
conditions, i.e., conditions on the solution u at either s1 or s2 . For initial value problems
(IVPs), the flow of information is from s1 to s2 , and thus all boundary conditions are
provided at s1 ; these boundary conditions are then referred to as “initial conditions”.
For boundary value problems (BVPs), the flow of information goes in both directions,
and thus boundary conditions are needed at both s1 and s2 . To help build an intuitive
feel for these problems, we will replace s by t (time) for IVPs and by x (space) for
BVPs.
The solution u can be either scalar- or vector-valued, meaning that u(s) can be
either a number or a vector. The more common scenario in practice is for vector-valued
solutions, and we will therefore use the notation u to remind ourselves that the solution
is most likely a vector.
64
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 65
by finite time steps of size h until the end time t = T . We thus define the solution at
step n as un = u(tn ), with the time step counter n = 0, 1, . . .. All ODE-IVP solvers
are then defined in the sense of going from step n to step n + 1, i.e., assuming that one
knows the solution un , the algorithm should find the solution un+1 at the next time
level.
We note that higher-order ODEs can always be written in the general form of
Eqn. (8.1) by introducing auxiliary variables that describe the lower-order derivatives.
For example, the second-order ODE
d2 u
= f (u, u′ , t) ,
dt2
where u′ = du/dt, can be written as
dv f (v2 , v1 , t)
=
dt v1
dubase dup ∂f
+ = f (ubase + up , t) ≈ f (ubase , t) + up (t) + . . . ,
dt dt ∂u ubase ,t
where we used a Taylor expansion to approximate the right-hand-side for small pertur-
bations up (t). By assuming that the base solution ubase (t) solves the ODE, i.e., that
dubase /dt = f (ubase , t), we then get
dup ∂f
≈ up (t) . (8.5)
dt ∂u ubase ,t
| {z }
A
the tools of linear algebra and linear ODEs to analyze the stability of a nonlinear ODE-
IVP. Finally, we note that the Jacobian matrix A is something we could, at least in
principle, derive and compute from the underlying ODE problem and the assumed base
solution. So, while implementation of an ODE-IVP solver in a code does not require the
computation of A (there are exceptions to this, not covered in this book), the stability
analysis of the ODE-IVP solver requires us to at least imagine that we know A.
Assuming that A is known, it is not trivial to look at Eqn. (8.5) and say whether
the solution up (t) will grow in time or not. The key analytical step then is to write the
solution up (t) in terms of the eigenvectors of A.
Assume that A has an eigendecomposition
AX = XΛ ,
The second step shows that this amounts to writing the solution up (t) as a linear
combination of all the eigenvectors xi of A, with the coefficients being ci (t). So we have
not modified the solution in any way, we are just choosing to express it in a different
coordinate system – specifically, we are describing up (t) in the coordinate system defined
by the eigenvectors rather than the base coordinate system we used originally.
If we insert this into the stability problem (8.5) we get
dXc
= A Xc .
dt
Multiplying by X −1 and recognizing that the eigenvectors X are not dependent on time
then yields
dc
= Λc , (8.6)
dt
after making use of the fact that X −1 AX = Λ. This is the same ODE as we started
with (the stability problem) but now expressed in a different coordinate system – and
in this new coordinate system the different components are decoupled from each other!
To see this, we can write the equation for the jth component as
dcj
= λj cj .
dt
Now that we have used the linear algebra tool of an eigendecomposition, let us use our
analytical understanding of ODE-IVPs to say that this equation has the solution
cj (t) = cj (0)eλj t .
We can now say something about the stability of this problem: if even a single eigenvalue
λj has a positive real part, then the solution will blow up. Put in a different way:
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 68
the stability problem as defined in our alternative coordinate system (i.e., Eqn. (8.6))
becomes unstable if any eigenvalue of A has a positive real part, and since the two
stability problems are really one and the same this must be true for the problem in the
form of Eqn. (8.5) too. Or put in a last way: the two different definitions of the solution
are linked by up (t) = Xc(t), and thus any statement about stability for one implies the
same for the other.
The upshot of all of this linear algebra is the following: to study stability, we should
study the model problem
dc
= λc , (8.7)
dt
where λ is a scalar and c(t) is scalar-valued. Also, since we know that this really
represents each eigenmode in the full system of equations, we should let λ be a complex
number since we know that the eigenvalues of real-valued matrices can still be (and
often are) complex-valued.
The analysis then proceeds by applying a specific ODE-IVP solver to the model
problem (8.7) and asking whether the solution grows in time or not. To do this, we
define the amplification factor
cn+1
G= .
cn
This amplification factor G is complex-valued. If |G| > 1, the solution will grow expo-
nentially in magnitude and the method is unstable. If |G| = 1, the solution will maintain
the same magnitude for ever; it might oscillate, but it will neither grow nor decay. In
that case the method is marginally stable. Finally, if |G| < 1, then the solution will
decay in time; it might oscillate, but it will eventually reach zero. In that case the
method is stable.
Figure 8.1: Stability plots for explicit Euler (top left), implicit Euler (top right) and
Crank-Nicolson (bottom left), all showing the magnitude of the amplification factor |G|
in colors and the stability boundary as a solid line. Also showing the stability boundary
for Runge-Kutta methods of increasing accuracy (bottom right), from RK1 (also explicit
Euler) to RK4 in darker shades with larger stability regions.
which implies
cn+1 1 + λh/2
GCrank−Nic. = = .
cn 1 − λh/2
The magnitude of the different amplification factors is visualized in Fig. 8.1. Stability
requires |G| ≤ 1, which is true in different regions of the complex plane for these different
methods (blue colors in the figure). The implicit Euler and Crank-Nicolson methods are
both stable for all λh with real parts ≤ 0. The implicit Euler method is actually stable
for some λh with positive real parts too, but this is arguably more of academic interest
since Re(λ) > 0 implies that the actual problem should be unstable too; arguably the
notion of numerical stability only makes sense if the problem is analytically stable.
The explicit Euler method is stable only for λh values in a small circle. The term
for this is that it is “conditionally stable”, meaning that it is stable for some λh but not
all. In contrast, the implicit Euler and Crank-Nicolson methods are “unconditionally
stable”, meaning that they are stable for all λh in the left half-plane (i.e., with real part
≤ 0).
We also note that the explicit Euler method is unstable for all λh that lie on the
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 70
imaginary axis, i.e., with exactly zero real part. This is powerful information: it tells
us that, for problems that create Jacobian matrices with purely imaginary eigenvalues,
the explicit Euler method is useless!
Finally, we observe that the Jacobian eigenvalue λ and the time step h always appear
as a product, meaning that their individual values are unimportant. Let’s think about
what this means. The stability model problem (8.7) shows that the units of λ must
be the inverse of the units of h – since h is a time step (measured in seconds, say),
that means that λ must be a rate (measured in 1/seconds). The product λh then tells
us something about how large the time step is compared to the rate (or speed) of the
process modeled by the ODE, which makes sense.
du d2 u h2 d3 u h3
un+1 = un + h+ 2 + 3 + ... .
dt dt 2 dt 6
Replacing du/dt by f (according to Eqn. (8.1)) then yields
df h2 d2 f h3
un+1 = un + f h + + 2 + ... .
dt 2 dt 6
The function f (u, t) is a function of two variables, which then implies
df ∂f ∂f ∂u ∂f ∂f du ∂f ∂f
= + = + = + f,
dt ∂t ∂u ∂t ∂t ∂u dt ∂t ∂u
where we used the fact that u is only a function of t (thus ∂u/∂t = du/dt) and again
the ODE itself in the last step. Repeating this process for the next higher derivative
yields
2
d2 f ∂2f ∂2f ∂2f 2
d ∂f ∂f ∂f ∂f ∂f
2
= + f = 2 +2 f+ + 2
f + f.
dt dt ∂t ∂u ∂t ∂u∂t ∂u ∂t ∂u ∂u
We then get that the exact un+1 should be an infinite Taylor series that starts as
h2 ∂f
∂f
un+1 = un + hf + + f
2 ∂t ∂u
2 !
h3 ∂ 2 f ∂2f ∂f ∂f ∂2f 2 ∂f (8.8)
+ 2
+2 f+ + 2
f + f
6 ∂t ∂u∂t ∂u ∂t ∂u ∂u
+ ... ,
given by Eqn. (8.8). For example, the explicit Euler method (8.2) implies very simply
that
un+1 = un + hf ,
which matches only the first two terms in the exact Eqn. (8.8). Therefore, the error
2
in the explicit Euler method is O h over the course of a single time step. To solve
the ODE over a specific interval T requires T /h time steps, which then implies that the
error of explicit Euler over a fixed time T is O(h).
The implicit Euler method (8.3) involves f (un+1 , tn+1 ) which then means that we
must Taylor-expand this too; we then get
d2 f h2
df
un+1 = un + hf (un+1 , tn+1 ) ≈ un + h f + h + 2 + ...
dt dt 2
2 ∂f ∂f
= un + hf + h + f + ... .
∂t ∂u
This also matches only the first two terms in the exact Eqn. (8.8), and thus implicit
Euler also has an error that scales as O(h) over a fixed time interval.
Repeating this process for the Crank-Nicolson method shows that its Taylor expan-
sions matches the
first three terms of the exact one, and thus the error of Crank-Nicolson
2
is of size O h over a fixed time interval.
The first line shows how the full time step is completed using a weighted average of s
different intermediate slopes fint,i , with s generally referred to as the number of “stages”.
The second line defines how each intermediate slope fint,i is computed, specifically how
it is defined using a weighted average of the prior i − 1 slopes and the current ith slope
as well. If the current ith slope is included (i.e., if aii ̸= 0), the method is implicit and
requires the solution of a nonlinear system of equations; if the currunt ith slope is not
included (i.e., if aii = 0), the method is explicit.
Given this general definition, a specific Runge-Kutta method is defined by the values
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 72
c1 a11
c2 a21 a22
.. .. .. ..
. . . .
cs as1 as2 . . . ass
b1 b2 . . . bs
The most common Runge-Kutta method, often referred to as the “classic RK”
method or “RK4”, is the 4-stage method defined by
0 0
1/2 1/2 0
1/2 0 1/2 0
1 0 0 1 0
1/6 1/3 1/3 1/6
d2 u
= q(u, x) , x ∈ [0, 1] , (8.10)
dx2
which could for example model a one-dimensional heat conduction process in which
u is the temperature, x is the coordinate, and q is a heat sink/source. The most
common boundary conditions are either to specify the value of the unknown function
or its derivative; these are called “Dirichlet” and “Neumann” boundary conditions,
respectively. For the heat conduction example, a Dirichlet condition (e.g., u(0) = uL
for a given value uL ) implies that the temperature is known at the boundary while a
Neumann condition (e.g., du/dx(1) = u′R for a given value u′R ) implies that the heat
flux is known instead.
An ODE boundary value problem (ODE-BVP) must necessarily have at least a
second derivative in it in order to require at least one boundary condition at either side
of the domain; however, it could have higher derivatives as well, and if so would need
more boundary conditions at at least one of the two ends of the domain. For example,
the beam bending equation in solid mechanics is of fourth order and thus requires a
total of four boundary conditions, generally taken as two at either end of the domain.
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 73
Two different approaches to solving ODE-BVPs will be discussed in this book. The
most general, robust and accurate method of these is the finite difference method, in
which one uses the finite difference schemes developed in chapter 6 and then solves the
resulting system of equations. This approach will be discussed in section 8.2.1. We will
also cover an additional approach, the so-called “shooting method”, that is less robust
but can be highly useful for quick tests due to its simplicity; this will be covered in
section 8.2.3.
We would then find the solution u = A−1 q by using any linear algebra package. Note
how the first and last rows of A are encoding the boundary conditions rather than the
ODE, and how therefore the first and last entries in the right-hand-side vector q are
related to the boundary conditions rather than the source term q(x).
If we desired greater accuracy, we could use more accurate finite difference schemes
for either the interior points and/or for the Neumann boundary condition. For example,
we could use a 4th-order accurate scheme for the 2nd derivative to replace Eqn. (8.11)
by
−uj−2 + 16uj−1 − 30uj + 16uj+1 − uj+2
= qj , j = 2, 3, . . . , N − 2 .
12h2
This could of course not be used at grid points j = 1 and j = N − 1, for which we would
need to either keep the 2nd-order accurate scheme or derive a partially right-biased
scheme that would utilize one neighbor on one side and maybe three on the other side.
Let us for simplicity’s sake choose the former option here.
Similarly, we could use a higher-order finite difference scheme to approximate the
Neumann boundary condition, say
du uN −2 − 4uN −1 + 3uN
≈ .
dx N 2h
With all this, the new version of the system of equations would have the modified matrix
1 0 ...
1/h2 −2/h2 1/h2 0 ...
−1/(12h2 ) 4/(3h 2 ) −5/(2h 2 ) 4/(3h 2 ) −1/(12h 2 ) . . .
A= .
.. .. .. .. .
. . .
... 2 2
−1/(12h ) 4/(3h ) −5/(2h ) 2 4/(3h )2 2
−1/(12h )
... 0 1/h2 −2/h2 1/h2
... 0 1/(2h) −2/h 3/(2h)
linear, and we will need to find ways to deal with that. The basic problem can be con-
densed into the fact that all of our linear algebra tools for solving systems of equations
apply to linear systems. Therefore, the general strategy for solving nonlinear systems
of equations is to create an iterative method in which we make an initial guess for the
solution u (let’s call it u0 ), and where we then seek to iteratively improve this guess by
solving a sequence of linear problems. The general strategy is identical to the one in the
Newton-Raphson method for root-finding: at each iteration we create a linear problem
that approximates the true problem, and we then solve that linear problem to find the
“new” guess for the solution.
Nonlinearities can enter the problem in two ways in Eqn. (8.10), either through
the source term q(u, x) or through the introduction of a solution-dependent coefficient
multiplying the differential operator. Let us deal with each case one at a time.
If we view our ODE-BVP (8.10) as a model for a heat conduction problem, then we
can easily imagine a source term of form
where a(x) could describe a fixed heat source (e.g., an electric heater with a given
current through it), b(x)u could describe convective cooling (e.g., the flow of cold air
at given velocity, with the heat loss proportional to the temperature difference), and
c(x)u4 could describe radiative cooling.
In our quest for a linearized equation, we can approximate the source term as
where the term in square brackets is simply the Taylor expansion of u4 around some
known solution ubase . In the context of an iterative method where the solution at
iteration counter n is known, we would then say that ubase is the known solution at
iteration n and that u is the unknown solution at iteration n + 1.
The full linearized ODE is then (at interior points, assuming the 2nd-order accurate
finite difference scheme for simplicity)
uj−1,n+1 − 2uj,n+1 + uj+1,n+1
= aj + bj uj,n+1 + cj u4j,n + 4cj u3j,n (uj,n+1 − uj,n ) ,
h2
where uj,n is the solution at grid point j and iteration n. We would then move all terms
including uj,n+1 to the left-hand-side to get
uj−1,n+1 − 2uj,n+1 + uj+1,n+1
− bj uj,n+1 − 4cj u3j,n uj,n+1 = aj + cj u4j,n − 4cj u3j,n uj,n .
h2
The final step is then to encode this in a matrix-vector form so we can use linear
algebra tools to solve the linear system of equations at each iteration. Close inspection
of the left-hand-side shows that the new matrix will be similar to the old matrix (which
encoded the differential operator) but modified along the diagonal with all the linear
terms from the linearized source term.
We can also do the manipulations caused by the linearization of the source term in
the matrix-vector form of the equation directly. After approximating the differential
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 76
where diag(v) is a diagonal matrix with the elements of the vector v on the diagonal.
The use of all these diagonal matrices may seem strange, but is necessitated by the fact
that the “point-wise” product bj uj is not a vector-vector product – instead, it is
b0 0 . . . u0 b0 u0
0 b1 0 . . . u1 b1 u1
diag(b) u = .. = .. .
. .
. . . 0 . 0 . .
. . . 0 bN uN bN uN
This is, of course, the same as we would have gotten by performing the linearization
first and then writing it in matrix-vector form.
Nonlinearities can also enter the problem through a solution-dependent coefficient
multiplying the differential operator. If we view our ODE-BVP (8.10) as a model for
a heat conduction problem, imagine having a heat conductivity d on the left-hand-side
that depends on the solution (temperature in our view of the problem) u. Our ODE in
matrix-vector form is then
diag(d(u))Au = q .
In this case it is the left-hand-side that is nonlinear and requires attention. A true
linearization of the left-hand-side is
∂d
diag(d(u))Au ≈ diag(d(ubase ))Au + (u − ubase ) Aubase .
∂u base
If the sensitivity in the coefficient (or “conductivity”) ∂d/∂u is easy to find, one could
implement this true linearization; however, in practice it is quite common to neglect the
second term and instead solve
diag(d(un )Aun+1 = q
at each iteration.
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 77
The linearization technique described above does not guarantee that the iterative
solution process will converge. A different technique for stabilizing the convergence
process is the idea of “under-relaxation”, in essence the idea of slowing down the con-
vergence process to avoid over-shoots. Imagine that our linearized system is
An un+1 = qn ,
where A and qn now contain all the linearized terms. The iterative process would
then produce a sequence of solutions u1 , u2 , . . ., each of which the result of solving the
linearized problem. If this sequence diverges (“blows up”), one can instead try to update
u as
un+1 = αv + (1 − α)un ,
An v = q n ,
where α ∈ (0, 1] is the under-relaxation parameter. In other words, one would solve the
linear problem to find v, but instead of taking this as the new estimate of the solution
un+1 , one takes only a small portion of v when updating the solution. This will slow
down the convergence, which is generally a bad thing – but it may be needed in cases
where the solution blows up.
Under-relaxation is arguably more art than science, in the sense that one must use
trial-and-error to choose the appropriate value for α. In practice, one should attempt to
device a solution algorithm that does not require it, but if all else fails, under-relaxation
can be useful.
Algorithm 8.1: the shooting method (for the ODE-BVP (8.10) with boundary
conditions u(0) = uL and u(1) = uR .
guess an initial value for u′L
while not converged,
solve the “initial value problem” with u(0) = uL and u′ (0) = u′L for u(x)
if u(1) ̸= uR , adjust u′L and try again
end while
The iterative process for finding the correct value of u′L is actually a root-finding process:
we are seeking the value of u′L that makes the misfit in the right boundary condition
CHAPTER 8. ORDINARY DIFFERENTIAL EQUATIONS 78
f (u′L ) = u(1) − uR equal to zero. It is not trivial to find the derivative of this misfit,
and hence the Newton-Raphson root-finding method is generally not useful here, but we
could use either the bisection and secant methods with no problems. The only difference
compared to our prior uses of the root-finding methods is that each function evaluation
(in the context of the root-finding method) now involves solving an ODE initial value
problem.
Chapter 9
Partial differential equations (PDEs) have derivatives with respect to more than one
variable. In this book we will cover only two-dimensional cases, like
∂u ∂2u
− α 2 = 0, (9.1a)
∂t ∂x
∂u ∂u
+c = 0, (9.1b)
∂t ∂x
∂2u ∂2u
+ 2 = q. (9.1c)
∂x2 ∂y
These three examples were chosen on purpose: they represent the simplest examples of
the three different categories of PDEs. As we will see in this chapter, each category of
PDE requires slightly different numerical methods.
Equation (9.1a) is a “parabolic” equation. From a physics point-of-view, it could
represent an unsteady diffusion process – a hopefully familiar example would be that
of unsteady heat conduction, with u representing temperature and α representing heat
conductivity (actually, heat diffusivity). This equation needs an initial condition, mean-
ing the specification of the solution at time t = 0 (i.e., u(x, 0) = uI (x)). It also needs
boundary conditions at both ends, say at x = 0 and x = 1. These boundary condi-
tions could be either of Dirichlet type with a specific value for the solution at each time
(e.g., u(0, t) = uL (t)) or of Neumann type with a specific value for the derivative of the
solution at each time (e.g., ∂u/∂x|x=1 = u′R (t)). If we interpret this equation as un-
steady heat conduction, then these boundary conditions would imply fixed temperature
or fixed heat flux.
Equation (9.1b) is a “hyperbolic” equation, generally called a first-order wave equa-
tion. It describes wave propagation in a single direction, with wave speed c. It requires
an initial condition and a boundary condition at only one of the two ends: if c > 0, the
waves propagate to the right, and a boundary condition is needed on the left side of the
domain but not on the right side.
Equation (9.1c) is an “elliptic” equation. Physically it could represent steady heat
conduction in two dimensions, with u being the temperature. It requires boundary con-
ditions at all sides of the domain, but no initial condition as there is no time dimension.
79
CHAPTER 9. PARTIAL DIFFERENTIAL EQUATIONS 80
but if we look carefully we realize that this will not be trivial for the boundary points.
Specifically, how can we encode the boundary condition u0 = uL (t) in the context of
an ODE-IVP with a time derivative? We could at best encode du0 /dt = duL /dt, but
this would not guarantee the right value at the boundary (e.g., if u0 was too large,
there would be no mechanism for bringing it back down to the right value). Actually,
when thinking about this subtle issue of how to encode the boundary condition, we
at some point realize that the easiest solution is to simply not solve the boundary
points as ODE-IVPs. In other words, we should define our unknown solution vector as
u = (u1 , u2 , . . . , uN −1 )T , i.e., without the first and last points. If we do that, we can
write our spatially discretized problem as
u1 −2 1 u1 uL (t)
u2 1 −2 1 u2 0
.. . . . . .
. .
.
. . . . .
d α α
uj
= 2
1 −2 1 u j
+
∆x2 0 . (9.3)
dt
.. ∆x
.. ..
. .
.. ..
.
. .
uN −2 1 −2 1 uN −2 0
uN −1 1 −1 uN −1 ′
∆x uR (t)
| {z } | {z } | {z }
u A u
Note that we used the Neumann BC on the right side to solve for uN , which when
explicitly entered into the equation for grid point j = N − 1 yields
the solution as a backward Fourier transform (Eqn. (4.1) in section 4.1) and get from
Eqn. (9.2)
X dûl α X h ı2πl(j−1)/N i
eı2πlj/N = ûl e − 2e ı2πlj/N
+ e ı2πl(j+1)/N
dt ∆x2
l l
α X ı2πlj/N h −ı2πl/N ı2πl/N
i
= ûl e e − 2 + e
∆x2
l
α X ı2πlj/N
= ûl e [2 cos(2πl/N ) − 2] .
∆x2
l
We now use the fact that 2πl/N = kl ∆x in the argument of the cos function since
this will aid our intuitive understanding later. Orthogonality of the Fourier modes then
implies that (see section 4.1)
dûl α
=− [2 − 2 cos(kl ∆x)] ûl .
dt ∆x2
The factor multiplying ûl on the right-hand-side is actually the eigenvalues (for every
l) of the matrix A (and the eigenvectors are the Fourier modes exp(ı2πjl/N )). We thus
see that the eigenvalues of A for this problem are negative real values, and that they
are ≤ 4α/∆x2 in magnitude. Going back to the ODE-IVP solvers, we then realize that
explicit Euler would work well for this problem, and that stability requires 4α∆t/∆x2 ≤
2. Actually, since our analysis ignored the boundaries and assumed periodicity, we
should view this stability bound as approximate.
The maximum (in magnitude) eigenvalue of a matrix is called the spectral radius,
equal to 4α/∆x2 for this problem. The spectral radius must necessarily have units of
inverse time, and the α/∆x2 part is entirely due to the nature of the PDE. The factor of
4, on the other hand, comes from our choice of the second-order accurate finite difference
scheme; had we chosen a more accurate scheme, this factor would have been larger.
In practice we would probably use a non-uniform grid, and the diffusivity α might
not be constant. In that case we could still apply the approximate analysis using Fourier
modes to find the spectral radius, and we would need to take the largest value of α/∆x2
to estimate the spectral radius; it would, of course, be even more approximate than
before.
We again follow the same “method of lines” process of first discretizing in space
and then using an ODE-IVP solver to integrate the system in time. Let us again use
a uniform spatial grid xj = ∆x j, j = 0, 1, . . . , N , and let us use the second-order
accurate central difference scheme for the first derivative. We need to solve the PDE at
all points except j = 0 where we instead use the boundary condition: in contrast to the
parabolic case, we must now solve the PDE at the last grid point j = N since we have
no boundary condition there. At this last point we can’t apply the central difference
scheme, and therefore must use a left-biased scheme instead. We then get the equation
duj uj+1 − uj−1
= −c , j = 1, 2, . . . , N − 1 ,
dt 2∆x (9.4)
duj uj − uj−1
= −c , j=N.
dt ∆x
Following the discussion of the parabolic problem, it is easiest to define our vector of
unknowns as u = (u1 , u2 , . . . , uN )T , i.e., to contain the solution at all grid points except
for where we have a boundary condition (so this vector has one additional element
compared to the parabolic case). We then get the system of ODE-IVPs
u1 0 0.5 u1 0.5uL (t)
u2 −0.5 0 0.5 u2 0
.. .. .. .. ..
. . . . .
d
c c
uj =− −0.5 0 0.5 uj + 0 .
dt .
∆x
. .
.
∆x
.
. . . . . .
. .
.
.
uN −1 −0.5 0 0.5 uN −1 0
uN −1 1 uN 0
| {z } | {z } | {z }
u A u
(9.5)
Just like for the parabolic problem, the boundary condition has become (after dis-
cretization in space) a source term that forces the solution near the boundary, which
the interior dynamics (described by the PDE) then propagates into the domain.
We again face the same problem of having to estimate the eigenvalues of the A matrix
in order to first choose a suitable ODE-IVP solver and then decide on the largest stable
time step ∆t. Exactly as before, this is best accomplished in an approximate manner,
by assuming (just for the sake of this estimation) that we have a periodic domain for
which we can apply the discrete Fourier transform. Orthogonality of the Fourier modes
then leads to, for the central difference scheme,
− e−ıkl ∆x
ıkl ∆x
dûl c e c
=− ûl = −ı sin(kl ∆x) ûl .
dt ∆x 2 ∆x
Note that the factor multiplying ûl (i.e., the eigenvalue for the periodic domain case)
is now purely imaginary, with a maximum magnitude of c/∆x. This is very impor-
tant information, because it tells us that explicit Euler will blow up for this problem!
Actually, the same is true for two-stage Runge-Kutta, as both of those schemes are un-
stable for purely imaginary eigenvalues. In contrast, three- and four-stage Runge-Kutta
CHAPTER 9. PARTIAL DIFFERENTIAL EQUATIONS 84
methods do contain parts of the imaginary axis in their stability regions, and they are
therefore very suitable to this hyperbolic problem. If we use RK4, say, we would need
c∆t/∆x ≲ 2.8 (where the stability region intersects the imaginary axis), which we can
use to choose a stable time step.
An alternative way to achieve a stable solution of this hyperbolic problem is to use
a left-biased finite difference scheme. Imagine that we used
duj uj − uj−1
= −c
dt ∆x
throughout the domain. The stability analysis would then end up with
dûl c h i c
=− ûl 1 − e−ıkl ∆x = − [1 − cos(kl ∆x) + ı sin(kl ∆x)] ûl .
dt ∆x ∆x
The eigenvalues (the factor multiplying ûl on the right-hand-side) now have a negative
real part! If one plots them for the range of allowable k∆x ∈ [−π, π] one finds that the
eigenvalues lie on a circle in the negative half-plane. Provided that c∆t/∆x ≤ 1, all
eigenvalues actually fall inside the stability region of the explicit Euler scheme, which
implies that we could now use this method.
This example illustrates something very important: the characteristics of a spatially
discretized hyperbolic PDE depend very strongly on the specific finite difference scheme
chosen to approximate the spatial derivative. While the specific choice of finite difference
scheme changes the multiplicative factor in the eigenvalues for parabolic problems, it
has a much more dramatic qualitative effect of changing the nature of the eigenvalues
for hyperbolic problems.
The idea of using a left-biased scheme for this problem is more generally called
“upwinding”. If we solve the same problem but with c < 0 instead, our left-biased
scheme would produce eigenvalues with positive real part, implying divergence; we would
then need to use a right-biased scheme instead to have a stable method. In other words,
we need to bias our finite difference scheme towards the “upstream” direction, or the
direction from which the waves are traveling. We actually don’t need a fully upwind-
biased scheme to achieve this effect, we just need to have a bit of bias towards the
upstream. For example, consider the finite difference scheme
duj 0.4uj+1 + 0.2uj − 0.6uj−1
= −c ,
dt ∆x
which places a bit more weight on the upstream point versus the downstream point (0.6
versus 0.4). For this scheme, the stability analysis ends up with
dûl c h i
=− ûl 0.4eıkl ∆x + 0.2 − 0.6e−ıkl ∆x
dt ∆x
c
=− [0.4 cos(kl ∆x) + 0.4ı sin(kl ∆x) + 0.2 − 0.6 cos(kl ∆x) + 0.6ı sin(kl ∆x)] ûl
∆x
c
=− [0.2 − 0.2 cos(kl ∆x) + ı sin(kl ∆x)] ûl .
∆x
This also adds some negative real part to the eigenvalues, but with five times lower
magnitude. Explicit Euler would still be stable, but a more detailed analysis reveals
that it would require a five times smaller time step ∆t.
CHAPTER 9. PARTIAL DIFFERENTIAL EQUATIONS 85
Russo, S. & Luchini, P. 2017 A fast algorithm for the estimation of statistical error
in DNS (or experimental) time averages. J. Comput. Phys. 347, 328–340.
86