0% found this document useful (0 votes)

247 views

Applied Numerical Methods

This document provides an overview of the contents of a book on applied numerical methods. The book is structured to gradually build understanding of numerical methods and programming ability, starting with solving systems of equations, then ordinary differential equations, partial differential equations, and parallel computing techniques. It will use both MATLAB and C++ to implement numerical methods, balancing understandability and efficiency. The goal is to develop parallel programs that can solve real-world problems in various fields of science and engineering.

Uploaded by

tobby20621

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

247 views

Applied Numerical Methods

Uploaded by

tobby20621

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 464

Applied Numerical Methods

Steve Moore

October 6, 2016
ii
Contents

Preface v

I Systems of Equations 1
1 Introduction 3

2 Direct Methods 31

3 Iterative Methods 53

4 Nonlinear and Coupled Systems 97

II Ordinary Differential Equations 111

5 Introduction 113

6 Euler Methods 117

7 Crank-Nicolson Methods 143

8 Leapfrog Methods 151

9 Adams-Bashforth Methods 163

10 Runge-Kutta methods 177

11 Shooting Methods 201

III Partial Differential Equations 205

12 Introduction 207

13 Finite Difference Methods 231

iii
iv CONTENTS

14 Finite Volume Methods 271

15 Finite Element Methods 285

16 Spectral Methods 339

17 Spectral Element Methods 355

18 Meshfree Methods 357

IV Parallel Computing 359

19 Introduction 361

20 OpenMP 373

21 MPI 387

22 OpenCL 429

V Applications 449
Bibliography 451

Nomenclature 455

Glossary 455
Preface

Differential equations are applied everywhere in science and engineering. They can
be used to describe many physical phenomena ranging from the smallest scales,
where quantum mechanics is required, through the everyday scales, where Newto-
nian and continuum mechanics applies, up to the largest scales where relativistic
mechanics is required. For the most part, we can only obtain analytical solutions
to the simplest of these differential equations and have to turn to numerical meth-
ods implemented in computer programs to obtain numerical solutions to the rest.
With the ever increasing power of computers and modeling software, this area is
continually growing and so an understanding of numerical methods and how they
are applied is of great importance.
The aim of this book is hence to provide a basic introduction to a range of nu-
merical methods commonly used in practice and to show how they are applied in
solving ‘real world’ problems. The end goal is that one should have sufficient under-
standing to develop a parallel program, designed to run on a supercomputer, that
can solve a problem in electromagnetics, fluid, or solid mechanics, thermodynamics,
or molecular dynamics.
In order to reach this end goal the structure of this book has been designed
such that we gradually ‘build up’ our understanding of numerical methods and
programming ability. In Part I we will spend some time studying methods for solving
linear systems of equations, with a subsequent extension to nonlinear systems. The
reason for this is twofold; firstly it will give us a chance to create some simple
programs and ‘get a feel’ for the basic structure of scientific software. Secondly, the
solution of differential equations frequently reduces to finding the solution of a linear
system of equations, so we will need to know these methods in order to proceed. In
Part II we will spend some time studying methods for solving ordinary differential
equations. Now while these methods can be applied to solving many real world
problems directly, finding the solution of a partial differential equation frequently
reduces to finding the solution of a system of ordinary differential equations, so it is
important that we have this understanding in place before we move on. In Part III
we will spend some time studying methods for solving partial differential equations.
Again, these methods can be applied to solving many real world problems, but what
we want is to be able to do is develop parallel programs and before we can do this, we
have to have a solid understanding of these methods. In Part IV we will spend some
time studying some methods by which we can turn the serial programs developed

v
vi PREFACE

up until that point, into parallel programs that can be executed on a supercomputer
Finally, once we have all of these tools in place, Part V will work through a number
of applications, outlining the relevant physics and deriving the governing differential
equations, then developing a parallel program to perform some sort of simulation.
The various parts of this book will cover quite wide range of topics in the math-
ematical, physical, and computer sciences. It needs to be emphasized however that
this book is not designed to provide an in depth coverage of any of these topics,
or particular numerical methods. To elaborate on this point, this is not a book on
linear algebra, nor is it a book on partial differential equations, nor is it a book
on parallel computing or a book on continuum mechanics. In fact for most of the
chapters present in this book, one could find entire books (or in some cases a series
of books) dedicated to the subject. This book is not aiming to compete with other
texts; and in fact references will be given to some of the more detailed texts when-
ever appropriate. Each chapter will only provide the ‘bare minimum’ necessary to
get a computer program ‘up and running’. What this book will do however, is show
you how all of these areas are applied together and implemented in the form of a
working computer program. So perhaps the best way to think off this book is a
supplementary text that ties all of these other topics together, a bit like a ‘sampler
pack’ with an assortment of methods.
Since we will be developing computer programs we will have to choose a pro-
gramming language with which to do so. While the disadvantage is that we will be
restricting ourselves somewhat, the advantage is that we will get the most detailed
‘nuts and bolts’ understanding of how the methods really work. The question is
then which language do we choose? As it happens we will use two different lan-
guages, namely Matlab and C++. Now although it sounds like we’ll be doubling the
amount of effort required, there is a good reason for using both of these program-
ming languages. The reasons for using Matlab are that firstly, the programming
environment is relatively easy to use; and we can create a prototype program imple-
menting a numerical method relatively quickly. The most important reason however,
is that the nature of the language means that we can often write code that closely
resembles the mathematics, which will therefore aid in our understanding of how it
is implemented. Having pointed out these desirable features, one might then ask
the question, why should we use C++? Again, we have some good reasons for using
this programming language too. For starters we can generally create C++ programs
that execute much faster and perform larger calculations than a Matlab program
and furthermore, when we come to developing parallel programs, the methods that
we will be using won’t work with Matlab, but can be integrated with C++ code.
Another good reason for choosing C++ is that it is a ‘lower level’ language that ex-
poses us to some issues (memory allocation for example) that we generally don’t
have to worry about with Matlab. Now although this might seem like we’re just
making life more difficult by getting ‘closer to the computer’, it is important that
we have a reasonable understanding of how computers work. So that being said, for
some numerical methods we encounter we will create Matlab programs, others we
vii

will create C++ programs, and some we will create both, so that we can compare the
features of each language. No matter how we do it, there’s no escaping that fact that
we will have to spend some time looking through code and while you may find this
a painful and daunting prospect, this is really the best way to get an understand-
ing of how numerical methods work. As it happens however, these two languages
are syntactically quite similar in a number of ways, so translating between the two
should not be too difficult. An important point to always remember throughout
this book that there are many ways in which a computer program can be written
to implement a given numerical method. Some ways are better than others, some
make no difference at all. We will be taking the approach of picking one way so that
we get the job done. On this point, we will also have to strike a bit of a balance
between writing code that is efficient and writing code that is easy to understand.
Often these two goals are mutually exclusive; and if one wants highly optimized
code, it may end up looking nothing like the numerical method that it’s supposed
to be implementing; even though it is. In our case, with the Matlab examples we
will be mostly concerned with ease of understanding and with the C++ examples we
will be more concerned with efficiency.
Given the outline of this book, the assumed prerequisite background knowledge
will include some basic linear algebra and calculus, plus some familiarity with Matlab
and C++. All of the programs that we will create will be designed such that they can
be executed with little to no input required from the user and as we implement the
numerical methods we will walk through the process explaining the various steps
and the reasons for doing so. For anyone already familiar with C++ you will notice
that our use of the language will include minimal use of the object oriented concepts
(which are in fact a key feature of the language). The reason for this is to not
complicate the implementation of a numerical method by ‘wrapping it up’ in object
oriented code; but we will instead crete a small number of classes when it simplifes
a particular program. As a final point, a desired feature of this book is that one
can get the programs up and running without having to buy anything. For users of
Mac, Windows, or Linux, there are free C++ compilers out there which can turn the
C++ source codes provided with this book into executable programs. Matlab on the
other hand is a commercial product that one must pay for and so to this end all of
the Matlab source codes are designed such that they can be executed with Octave,
which is free.
The suggested approach for reading this book is of course, first and foremost
in the order it was written in. If however you already have some familiarity with
any one of the topics you can skip ahead and backtrack as necessary. At each step
throughout the book, references will be made to the previous ‘building blocks’ in
order to make this as easy as possible. Having now outlined the basic structure of
the book and the reason behind this choice of structure we are now in a position to
begin.
viii PREFACE

(a) (b)

(e) (f)

Figure 1: Some example applications in science and engineering which involve the
solution of differential equations.
Part I

Systems of Equations

1
Chapter 1

Introduction

In this part of the book we are going to investigate a number of different methods
for solving systems of algebraic equations. We will first look at a number of direct
methods, where the number of computational operations required to solve a system is
fixed and predetermined by the nature of the system. We will then look at a number
of iterative methods, which involve ‘guessing’ the solution and iteratively refining
that guess, in which case the number of operations is not fixed. Then, finally we will
look at how we extend these techniques to solving systems of nonlinear algebraic
equations. As we do this we will create some simple Matlab and C++ programs so
that we ‘get a feel’ for the common structure of scientific software. Before we begin
investigating these methods however we need to outline some terminology and make
some definitions that will be used throughout the book.
The most general form for a linear system of equations can be expressed as:

Aφ = b
or, written another way:
N
X
am,n φn = bm
n=1

where we have:

     
a1,1 a1,2 ··· a1,N 
 φ 1 
 
 b1 

a2,1 a2,2 ··· a2,N  φ2  b2
    
 

A= φ= b=
 
.. .. .. ..  .. ..
 . . . .  
 . 
 
 . 


 φ   
aM,1 aM,2 · · · aM,N N
  bN 

Here, A is an M × N matrix containing the coefficients that describe our system,

φ is an N × 1 column vector containing the unknown variables that we wish to
solve for, and b is an N × 1 column vector containing known values. In order to
aid in understanding the mathematics we will try to be as consistent with notation

3
4 CHAPTER 1. INTRODUCTION

as possible throughout the book. With that in mind, remember that we will use
the subscripts m and n to denote indices into matrices and vectors, that will have
dimensions M and N . Later on, as our programs become more complex and we
reserve M and N for the definition different quantities, we will instead use Nrow and
Ncol to mean the same thing. Most of the time, the number of equations M will be
equal to the number of unknowns N so that we have a square matrix meaning that
we can find a unique solution for φ. In principle, we can solve our linear system as:

φ = A−1 b
but as we shall see however we pretty much never attempt to directly compute
the inverse of a matrix in practice, since it generally requires a prohibitive number
of computational operations to do so; instead we look for other ‘smarter’ ways
to solve the system. There will also be a few times throughout this book where
M > N , meaning that we have an over-determined system with more equations
than unknowns. We will cross this bridge when we come to it, but for now, let’s
just say that what we do in this case is to obtain a solution which is the ‘best fit’
for all of the unknowns. If M < N , meaning that we have an under-determined
system with more unknowns that equations then there will be no unique solution,
but fortunately the differential equations that we will encounter in this book will
always result in a square matrix with a unique solution.
Continuing with the definitions, a matrix is termed symmetric if AT = A (i.e.
am,n = an,m ) and skew symmetric if AT = −A (i.e. am,n = −an,m ). A matrix is
termed orthogonal if AT = A−1 and is termed Hermitian if A = AH . Here, the
superscript H denotes the conjugate transpose, which is equivalent to transposing
the matrix, then taking the complex conjugate of each entry (i.e. replacing i with
−i in any of the matrix entries which happen to be complex numbers). Finally, a
matrix is termed unitary if AAH = AH A = I, where I denotes the identity matrix.
Another very important classification is that we call a square matrix positive definite
if:

vT Av > 0 ∀v 6= 0
So for any N × 1 vector v not composed entirely of zeros, the above product will
yield a real number greater than zero. A more useful definition of a positive definite
matrix is one that has all eigenvalues greater than zero. The concept of positive
definiteness plays an important role in certain methods that we will cover later. We
call a matrix an M-matrix if:

am,n ≤ 0 ∀m 6= n
−1
(A )m,n ≥ 0 ∀m, n

and it is nonsingular . Here, all of the off-diagonal entries that are not zero are
negative. Another important property of a matrix that will be important later is its
5

diagonal dominance, where we say the matrix is diagonally dominant if:

X
am,m > |am,n |
n6=m

so, the entry on the main diagonal is greater than the sum of the off diagonal entries
in that row. The excess amount:
X
min(am,m − |an,m |)
m6=n

is called the diagonal dominance of the matrix. This property will have important
implications when we come to investigating iterative methods for solving linear
systems.
With vectors we can define a number of quantities called norms. These norms
are a single number and in a sense describe a vector. More formally, they are
functions that assign a strictly positive ‘length’ or ‘size’ to a given vector. Perhaps
the simplest of these is the one norm, defined as:
N
X
kvk1 = |vn |
n=1

which is the sum of the absolute values of all of the entries in v. Another norm is
the p norm, defined as:

N
! p1
X
kvkp = |vn |p
n=1

For the case where p = 2, we would call this the two norm which it can be observed
is analogous to the Pythagoras’ theorem in N dimensions. The final vector norm
that we will consider is the infinity norm, defined as:

kvk∞ = max |vn |

which is the largest entry in v. Now while the definition of these norms may seem
rather arbitrary, we will see later on that they are useful measures for getting in-
formation about a solution of a system of equations. We can also extend this idea
and define a number of norms for a matrix. In fact there are a couple of different
categories of matrix norms, one of which is called an induced norm and, similar to
vectors we have a p norm, defined as:

kAvkp
kAkp = max
v6=0 kvkp
If we let p = 1 then we get the one norm:
6 CHAPTER 1. INTRODUCTION

M
X
kAk1 = max |am,n |
1≤n≤N
m=1

which is the maximum absolute value of the sum of each column in the matrix.
If we let p → ∞ then we get the infinity norm:
N
X
kAk∞ = max |am,n |
1≤m≤M
n=1

which is the maximum absolute value of the sum of each row in the matrix. Another
category of matrix norms is known as an entrywise norm and we can again define a
p norm as:

M X
N
! p1
X
kAkp = |am,n |p
m=1 n=1

If we let p = 2 then we get the two norm (or Frobenius norm):

v
uM N
uX X
kAk2 = t |am,n |2
m=1 n=1

Similar to vectors these definitions can be quite useful. In particular, we can define
the condition number of a matrix as:

κ(A) = kAk · kA−1 k

which gives a measure of how ‘well conditioned’ a matrix is. This sort of translates
to how accurate our solution is likely to be and how easy it will be to find one.
Finally, we will mention that two vectors u and v are orthogonal if their inner
product (or dot product) is zero:
N
X
T
u·v =u v = un vn = 0
n=1

and are A-orthogonal if uT Av = 0.

Having now made a number of definitions, let’s take a moment to look at some
different matrix patterns that can occur in practice. These patterns are defined by
which entries in the matrix are zero and which are nonzero and Figures 1.1(a) - 1.1(i)
illustrate these patterns using dots to indicate the nonzero entries and whitespace
to indicate zero entries. Perhaps the simplest of these patterns is a full matrix,
where all of the entries are nonzero (Figure 1.1(a)). A triangular matrix occurs
when all of the entries are zero either below the main diagonal (1.1(b)), in which
case we have an upper triangular matrix, or above the main diagonal (Figure 1.1(c)),
7

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 1.1: Some patterns found in matrices that are either defining the system
being solved, or used in the solution of the system. The dots indicate where an
entry in the matrix is nonzero and blank space illustrates where the entries are zero.
(a) illustrates a full matrix (b) illustrates an upper triangular matrix (c) illustrates a
lower triangular matrix (d) illustrates a diagonal matrix (e) illustrates a tridiagonal
matrix (f) illustrates a pentadiagonal matrix (g) illustrates a sparse matrix which
could be either symmetric or skew symmetric (h) illustrates a non-symmetric sparse
matrix and (i) illustrates a matrix with no definable structure.
8 CHAPTER 1. INTRODUCTION

in which case we have a lower triangular matrix. If we have an upper triangular

matrix where the entries on the main diagonal are zero then we have a strictly
upper triangular matrix. Similarly, a lower triangular matrix with zeros on the
main diagonal would be a strictly lower triangular matrix. If we have an upper
triangular matrix with entries immediately below the main diagonal then we have
upper Hessenberg matrix (Figure 1.1(d)), and with entries above the main diagonal
of a lower triangular matrix we have a lower Hessenberg matrix. If we only have
nonzero entries on the main diagonal and one entry either side then we have a
tridiagonal matrix (Figure 1.1(e)). Analogously, if we have two nonzero entries either
side of the main diagonal then we have a pentadiagonal matrix (Figure 1.1(f)), or
more generally, a matrix having nonzero entries in distinct diagonal bands is termed
a banded matrix. One of the most important definitions that we can make is that of
a sparse matrix which occurs when ‘most’ of the entries in the matrix are nonzero.
These are the most common types of matrices that we will deal with when solving
partial differential equations. It can be observed that a tridiagonal matrix and in
fact most banded matrices would be classed as sparse matrices. We will often have
the scenario however where the matrix is sparse and symmetric (Figure 1.1(g)),
or sparse and asymmetric (Figure 1.1(h)), neither of which however are banded.
Finally, Figure 1.1(i) illustrates a matrix with no definable structure.
At this point it is worth focusing on how we actually go about storing and working
with vectors and matrices in our Matlab and C++ programs, so we need to take a look
at some code ‘snippets’ illustrating this. Before we proceed however, we need to be
clear on our terminology. We are going to look at how we allocate computer memory
in our programs to store matrices and vectors, but the terms ‘matrix’ and ‘column
vector’ or ‘row vector’ are really mathematical concepts. Throughout this book,
when we talk about the programming concepts we are going to use the terminology
Array when we speak of a contiguous block of computer memory. The important
difference is that while an array might represent an N × 1 column vector or an
M × N matrix, it also might not. An array is more generally just a contiguous block
of memory and it is up to us and our program as to whether or not we interpret
the array as storing a matrix, or just some other data. As we will see later on, we
will use arrays to store all sorts of things that wouldn’t be interpreted as a vector
or a matrix in the linear algebra sense; furthermore, we will use multiple 1D arrays
to store a sparse matrix more efficiently. So, for us, a matrix is always 2D where
the rows represent equations and the columns represent unknowns. As will become
clear throughout the book a matrix is often something that is purely conceptual and
not necessarily something that we explicitly define in our program anyway.
With this terminology now defined, let’s take a look at how we actually go about
defining some arrays. We can allocate memory in Matlab for a 1D array as:
A = zeros(M,1);

and access entries with parentheses as:

A_m = A(m);
9

and for a 2D array as:

A = zeros(M,N);

and access entries with parentheses as:

a_mn = A(m,n);

and for a 3D array as:

A = zeros(M,N,O);

and access entries with parentheses as:

a_mno = A(m,n,o);

Finally, for completeness it is worth mentioning that this concept extends to multi-
dimensional arrays where we could in Matlab have something like:
A = zeros(M,N,O,P,Q,R);

and access entries with parentheses as:

a_mnopqr = A(m,n,o,p,q,r);

An important point to note is that in Matlab we can often ‘get away’ with not
explicitly allocating memory to store our array before we start adding entries into
it, but this comes at a price; namely that Matlab will constantly have to resize the
array, which will greatly reduce the execution speed of the program. In C++ we don’t
have the luxury of arrays being automatically resized for us and must always ensure
that we have allocated memory for our array. Memory allocation is a slightly more
complex issue in C++ than it is in Matlab, with many different ways of allocating
memory for a given array. The main considerations are the method of memory
allocation, whether or not the array is contiguous in memory, and how we want to
be able to access entries in the array. There are in fact two ways in which we can
allocate memory; the first of which is known as static allocation and is achieved for
a 1D array as:
float A[M];

and access entries with square brackets as:

a_m = A[m];

and for a 2D array as:

float A[M][N];

and access entries with square brackets as:

a_mn = A[m][n];

and for a 3D array as:

float A[M][N][O];

and access entries with square brackets as:

a_mno = A[m][n][o];
10 CHAPTER 1. INTRODUCTION

Finally, for completeness it is worth mentioning that this concept extends to multi-
dimensional arrays where we could in C++ have something like:
float A[M][N][O][P][Q][R];

and access entries with square brackets as:

a_mnopqr = A[m][n][o][p][q][r];

float A[M];

float A[M][N];

float A[M][N][O];

Stack

Figure 1.2: A schematic illustrating the creation of three arrays of increasing di-
mensionality using static memory allocation. The blue boxes indicate the bytes
required to store a single floating point number (which will most likely be 4) which
are contiguous in memory in an area known as the stack. Note that there is no
dimensionality associated with the memory itself, just blocks with increasing ad-
dresses, so the 3 × 9 layout for the 3D array has no meaning other than to illustrate
a contiguous block that will fit on the page.

Figure 1.2 illustrates the creation of these three different arrays and where the bytes
(that store the floating point numbers) are are actually located in memory. The
important points to note are that for static memory allocation we are guaranteed
that the array entries will be contiguous in computer memory in an area known
as the stack . It is worth mentioning at this stage that these techniques are not
dependent on the data type at all and would be the same for integers (int , long
, unsigned int , etc) or double precision floating point numbers.
The second way is known as dynamic allocation which is achieved for a 1D array
as:
float* A = new float[M];

where we can access entries with square brackets as just as we could for a statically
allocated 1D array as:
11

a_m = A[m];

and for a 2D array as:

float** A = new float*[M];
for(int m=0; m<M; m++)
{
A[m] = new float[N];
}

where we can access entries with square brackets as just as we could for a statically
allocated 2D array as:
a_m = A[m][n];

and for a 3D array as:

float*** A = new float**[M];
for(int m=0; m<M; m++)
{
A[m] = new float*[N];
for(int n=0; n<N; n++)
{
A[m][n] = new float[O];
}
}

where we can access entries with square brackets as just as we could for a statically
allocated 2D array as:
a_m = A[m][n][o];

The dynamic memory allocation snippets look more complex than their static coun-
terparts, so it’s worth explaining these, with the aid of Figure 1.3. Each time we use
the new operator in C++ we are ‘asking’ for a single contiguous block of memory to
store a particular type of data; and assuming that the operating system is able to
fullfill this request we are returned a pointer to where this block of memory begins
in an area of memory known as the heap. Now a pointer variable is much like any
other variable except that what it stores is a memory address of something, and in
the context of dynamic memory allocation, the pointer stores the base address where
the block of memory begins. As an analogy think of the computers’ memory as a
long street full of houses. Each house has its own letterbox, with a unique number
analogous to a memory address. The contents of the house is analogous to the data
that is stored at that memory location (it could be an integer or a floating point
number for instance).
Considering first the dynamic allocation of a 1D array, we create a variable of
type float * which is stored on the stack and is a ‘pointer to a float’ (i.e. storing
the address of a floating point number). For a 2D array it can be observed that the
variable is instead of type float ** which is a ‘pointer to a pointer to a float’. So
to store our 2D array we first allocate one 1D array on the heap (of type ‘pointer to
float’) and then in a for loop allocate multiple 1D arrays to store the actual floating
12 CHAPTER 1. INTRODUCTION

point numbers. When we use the notation A[m][n] the first square brackets ‘get’ the
address m entries along from the location that A points to, then the second square
brackets ‘get’ the value n entries along from that location. Finally, for a 3D array
this concept is extended even further such that the variable is of type float ***,
which is a ‘pointer to a pointer to a pointer to a float’ and we go through a sequence
of nested for loops to ultimately store the floating point numbers. As a final point
to note, the number of bytes required to store any sort of pointer is dependent on
the system itself (i.e. either 4 or 8 bytes depending on whether its 32 or 64 bit) and
is the same for a float *, float **, float ***, int *, int **, double *, etc. If
we were storing double precision floating point numbers instead of single precision
floating point numbers we would require 8 bytes per number rather than 4, but the
pointers would be the same either way.
Given the extra complication involved in allocating memory dynamically in C++
you might well be wondering why bother when we can allocate arrays statically
and avoid the for loops and also having to deallocate them. Well, two reasons;
number one, with static memory allocation we need to know how big the array will
be at ‘compile time’, i.e. when we turn our source code into an executable program.
Sometimes this is not a problem; other times we have no way of knowing how big
our array needs to be until the program is running. So in this case dynamic memory
allocation is our only option. The second reason is that the stack is quite a small area
of memory (generally somewhere around 8M b) and we can generally allocate much
bigger blocks of memory dynamically. So as we solve bigger and bigger systems of
equations ‘say’ dynamic memory allocation becomes a more attractive option for
this reason too.
Sometimes it can be advantageous to have our arrays allocated contiguously in
memory. While not such an issue initially this will become more important as we
consider parallel programming. For a 1D array this will be the case, but for the 2D
and 3D arrays we can see that the actual bytes of memory storing the floating point
numbers are scattered throughout the memory space. In addition to just having
memory allocated contiguously there is also an overhead in allocating each separate
block of memory, so it will be more efficient to allocate one big block of memory as
opposed to many smaller blocks. One way of allocating a 2D array contiguously in
memory would be:
float* A = new float[M*N];

but now we can no longer access entries in the same way. In fact given the two
indices m and n we must ‘map’ these to a single index as:
a_mn = A[m*N + n];

or:
a_mn = *(A+m*N + n);

where the * is the dereferencing operator and the meaning of this statement is; go
to the memory location pointed to by A, count along m ∗ N + n entries and return
13

float* A;

float** A;

float*** A;

float
float*
float**
float*** Stack Heap

Figure 1.3: A schematic illustrating the creation of three arrays of increasing di-
mensionality using dynamic memory allocation. The colored boxes indicate the
data type. Note that although each call to the new operator allocates a contiguous
block of memory, the overall array itself is not completely contiguous in memory. is
no dimensionality associated with the memory itself, just blocks with increasing ad-
dresses, so the 3 × 9 layout for the 3D array has no meaning other than to illustrate
a contiguous block that will fit on the page.

the value stored at that location. For a 3D array the equivalent way of allocating
the memory would be:
float* A = new float[M*N*O];
where we can access entries as:
a_mno = *(A+m*N*O+n*O+o);
So although we’ve now allocated our memory contiguously, we’ve lost the ‘nice’
way of indexing into arrays that mimics the math. Is there a way in which we can
allocate our memory contiguously with a minimal number of calls to new and retain
the square brackets for our access method? As it turns out there is and the code to
do so for a 2D array looks like:
float** A = new float*[M];
A[0] = new float[M*N];
for(int m=1, mm=N; m<M; m++, mm+=N)
{
A[m] = &A[0][mm];
14 CHAPTER 1. INTRODUCTION

}
where we can access entries with square brackets as just as we could for a statically
allocated 2D array as:
a_mn = A[m][n];
So in this case we have two allocations, we have contiguous memory allocated for
our float ’s and we can use the square brackets to index entries. The important
point is that the large block of size M × N is allocated and the pointer assigned in
the first entry of A. After that, rather than allocating new blocks for the different
entries in A we instead just assign the addresses from the one large block that we’ve
allocated (Figure 1.4). We can extend this concept to a 3D as:
float*** A = new float**[M];
A[0] = new float*[M*N];
A[0][0] = new float [M*N*O];
for(int m=0, mm=0; m<M; m++, mm+=N)
{
A[m] = &A[0][mm];
for(int n=0, nn=(m*N*O+n*O); n<N; n++, nn+=O)
{
A[m][n] = &A[0][0][nn];
}
}
where we can access entries with square brackets as just as we could for a statically
allocated 3D array as:
a_mno = A[m][n][o];

float** A;

float
float*
float** Stack Heap

Figure 1.4: A schematic illustrating two different ways that a 2D array can be
allocated dynamically.

An important point that one must remember when using dynamic memory al-
location is that it is up to the programmer to deallocate (or free) up the memory
15

that is allocated with the new operator. Now, if we have a simple program that
allocates some memory, performs some computations, then ends, all of the memory
is reclaimed by the operating system when the program terminates. So in some
cases it’s not a ‘big deal’ if we don’t free up the memory. It is however probably bad
practice to get into this habit, and in fact it can cause a problem known as memory
leaks. To illustrate this problem, imagine that we have a program that is constantly
allocating blocks of memory. If we don’t free up the memory then eventually there
will come a point where, as far as the operating system is concerned there won’t be
any more memory that it can allocate and so most likely our program will crash.
The important step to remember (which we will always do throughout this book) is
whenever we’re finished with an array, we deallocate the memory with the delete
operator, which is achieved for a 1D array as:
delete [] A;

and for a 2D array allocated using multiple allocations as:

for(int m=0; m<M; m++)
{
delete [] A[m];
}
delete [] A;

and for a 3D array allocated using multiple allocations as:

for(int m=0; m<M; m++)
{
for(int n=0; n<N; n++)
{
delete [] A[m][n];
}
delete [] A[m];
}
delete [] A;

and for a 2D array allocated using two allocations as:

delete [] A[0];
delete [] A;

It is important to remember here that we delete the memory in the reverse order
that we allocate it. Otherwise, if we deleted the first block of float * then we
would no longer have access to the addresses of the individual blocks of float in
order to delete them. As it happens, we can also free up memory in Matlab too and
we can achieve this for a vector or a matrix (or in fact any data object) as:
clear A;

However, we will only really be using this function in our Matlab programs to
remove all of the variables in the ‘workspace’ with the clear all; command, rather
than applying it to specific data objects. One issue that current Matlab and C++
programmers will be aware of, but others may not, is the fact that in addition to
16 CHAPTER 1. INTRODUCTION

having different syntax and ways to allocate and free memory, Matlab uses one
based indexing, whereas C++ uses zero based indexing. All this means is that in a
C++ array, the first entry is given the index 0, rather than 1. This is usually not
a big issue, we just have to remember when translating a program from Matlab to
C++ to ‘knock’ 1 off each array index that we make.
Another issue that is very important when developing programs implementing
numerical methods is round-off error , namely the limited precision to which comput-
ers can represent real numbers. Generally a single precision floating point number
(a float type in C++) uses four bytes to store a number, a double precision floating
point number (a double type in C++) uses eight bytes, and if we’re really concerned
we can use a quadruple precision number (a long double type in C++) which uses
16 bytes. Now we would normally want our programs to produce as accurate a
result as possible and so the use of the most precise numbers possible may seem
appropriate. The trade off however is that this could require up to four times the
amount of memory. For small simple programs this will not really be a big issue, but
for large scale parallel programs memory can become important. Throughout this
book we will strike a balance between precision and memory and use the double
type for storing real numbers. One other common technique that could strike a rea-
sonable balance between accuracy and memory use is to use long double for ‘say’
variables that involve repeated operations such as summation, subtraction, multi-
plication, division, etc, but truncate and store the resulting array entries in a lower
precision.
In scientific software, the computers memory can often be a precious resource
and so storing a matrix in a 2D array as previously illustrated can be a huge waste of
memory if the matrix is in fact sparse. As an example, consider a system of equations
with 106 unknowns (which is fairly conservative by todays standards) and imagine
that it is represented by a tridiagonal matrix. In this case the matrix would be of size
1012 (i.e. storing one trillion entries), but would only contain approximately 3 × 106
nonzero entries (i.e. about three million entries). If we were storing this matrix
as a full matrix of single precision floating point numbers (i.e. 4 bytes of memory
per entry) then we would need nearly four Terabytes of memory compared to only
eleven Megabytes of memory if we were only storing the relevant data. Clearly this
is a huge potential savings. So, how do we go about storing only the nonzero entries
in a matrix? Well, in Matlab we can do this very easily by simply allocating our
matrix to be a sparse matrix as:
A = sparse(M,N);

Note that here we are using the term ‘matrix’ as opposed to a 2D array, because of
the context. Having declared our matrix in this manner we can then work with it
in the usual way, but there will be a substantial savings in the amount of memory
required, so ‘life is good’. Now if we want to do something similar in C++ we need
to ‘dig a little deeper’ into how a sparse matrix is actually stored. As it happens
there are a number of ways of doing this, but one common technique for doing so
17

is known as Compressed Row Storage (CRS). With this format no assumptions are
made about the sparsity structure of the matrix, and no unnecessary entries are
stored. The way that the matrix is stored is that we create three 1D arrays, one
containing floating point numbers (which stores the non-zero entries in the matrix),
ordered column by column, row by row; and two containing integers (which are used
to identify row and column indices). The best way to illustrate the concept is via
example, so consider the matrix:
 
9 0 0 0 2 0
 3 9 0 0 0 3 
 
 0 7 8 7 0 0 
A=  
 3 0 8 7 5 0 

 0 8 0 9 9 6 
0 4 0 0 2 1
where we will denote the number of non-zero entries in the matrix by Nnz , which is
equal to 19 in this case. If we were to store this matrix using the CRS format then
we would have the arrays:

val ={9, 2, 3, 9, 3, 7, 8, 7, 3, 8, 7, 5, 8, 9, 9, 6, 4, 2, 1}
col ={1, 5, 1, 2, 6, 2, 3, 4, 1, 3, 4, 5, 2, 4, 5, 6, 2, 5, 6}
row={1, 3, 6, 9, 13, 17, 20}

It should be fairly obvious that val is storing all of the entries in order as we move
along the columns and then along the rows of A. The entries in col are storing
the column indices of the entries in A (assuming one based indexing). Finally the
entries in row are storing the index in col and val where the data for that par-
ticular row begins. So the data for row 1 begins in row(1) = 1, the data for row
2 begins in row(2) = 3, the data for row 3 begins in row(3) = 6, etc. Further-
more, to access all of the column data within a given row m, we work through val
from row(m), up to but not including row(m + 1), which is a nice feature when it
comes to performing matrix vector multiplication. By convention the format defines
row(M + 1) = Nnz + 1. If A happened to be symmetric then we could further
reduce the storage, but would have the trade-off of a more complicated algorithm
with a different pattern of data access. Furthermore, if A happened to have some
rows containing all zeros, then we would simply find that for any row m with no
nonzero entries, row(m) = row(m + 1), so when accessing entries, or ‘say’ perform-
ing a matrix vector multiplication, we can easily handle this scenario. So instead of
allocating memory to store M 2 = 36 entries, we are allocating 2Nnz + M = 45. So
in this particular case we’ve illustrated an important point; namely that one only
obtains a savings in memory using this format if the Nnz < M (M − 1)/2) (i.e. less
then half of the matrix entries are nonzero). For this example, this is not the case,
but going back to the 106 × 106 tridiagonal matrix mentioned previously, in this case
18 CHAPTER 1. INTRODUCTION

we could store the matrix in around twenty seven megabytes. So it turns out that we
can’t store just the nonzero entries in the matrix, because without the structure of
the arrays themselves we wouldn’t know where the nonzero entries belong, but even
so, we’ve made a pretty big savings in terms of computer memory. This method
does have some disadvantages however, namely that we can’t access entries directly,
we have to go through two arrays to get to the data.
Another common technique for storing a sparse matrix is known as Compressed
Column Storage (CCS). This approach is similar to compressed row storage except
that we essentially ‘swap’ the indexing of rows and columns. If we were to store A
now using the CCS format then we would have the arrays:

val ={9, 3, 3, 9, 7, 8, 4, 8, 8, 7, 7, 9, 2, 5, 9, 2, 3, 6, 1}
row={1, 2, 4, 2, 3, 5, 6, 3, 4, 3, 4, 5, 1, 4, 5, 6, 2, 5, 6}
col ={1, 4, 8, 10, 13, 17, 20}

It should be fairly obvious that val is storing all of the entries in order as we move
along the rows and then along the columns of A. The entries in row are storing
the row indices of the entries in A (assuming one based indexing). Finally the
entries in col are storing the index in row and val where the data for that particular
column begins. So the data for column 1 begins in col(1) = 1, the data for column
2 begins in col(2) = 4, the data for column 3 begins in col(3) = 8, etc. Some
other formats include Block Compressed Row Storage, Compressed Diagonal Storage,
Jagged Diagonal Storage, and Skyline Storage [63].

Example 1.1:

In this example we are going to develop a C++ class for storing a sparse matrix
using the compressed row storage format. For those readers new to C++, this would
be a good time to ‘brush up’ on some of the object oriented features of the language.
As it happens we will be using this class later on in the book when we come to
implementing programs for solving partial differential equations. Given that we have
just been discussing sparse matrices and dynamic memory allocation this seems like
the most appropriate point to develop this class, but this example could be skipped
and revisited when necessary.
The key feature that we want out of our class at this point is the ability to
store a sparse matrix and to insert and access entries. In later parts of the book
we will extend the class; for example, incorporating an optimized matrix-vector
multiplication routine. So let’s begin by creating the ‘skeleton’ for our class:
19
20 CHAPTER 1. INTRODUCTION

class SparseMatrix
{
public:
SparseMatrix();
~SparseMatrix();
void initialize(int nrow, int nnzperrow);
void finalize();
inline
double& operator()(int m, int n);
protected:
void reallocate();
private:
double* val_;
int* col_;
int* row_;
int* nnzs_;
int N_row_;
int N_nz_rowmax_;
int N_nz_;
int N_allocated_;
};

Here we have named our class SparseMatrix and we have declared a number of
private member variables and a few public and protected member functions, which
we will gradually explain. It can be observed the first three member variables are
the pointers for the three arrays that will need to be dynamically allocated to store
the sparse matrix data itself (and it can be observed that we will be using double
precision floating point numbers to store the entries in the matrix), the fifth member
variable is the number of rows, the seventh the number of nonzero entries and the
eighth is a variable which stores the number of entries currently allocated for the val
and col arrays. The remaining member variables will be explained in due course.
As a quick note, it is quite common in the design of C++ classes to identify member
variables of a class in some way. In this case we are using a trailing underscore, but
another commonly used convention is a preceding m_ to indicate that a variable is a
‘member’ of a class (i.e. m_val, m_col, etc). At the end of the day it doesn’t really
matter either way, the most important thing is to be consistent, and so throughout
this book we will use the trailing underscores to indicate member variables.
We want to be able to have a sparse matrix that can handle the case where we
don’t know exactly how many nonzero entries will be stored in it. While this gives
us a lot more flexibility and will ultimately make our programs simpler, it means we
may have to resize the arrays if they fill up. So the approach we will take is to ‘guess’
an initial size to allocate the three arrays and as we run of space we will increase the
value of this variable and reallocate more memory. Another design consideration is
that we can’t assume that we will add the entries, column by column, row by row,
as the CRS format requires. As such we will need to provide a mechanism for ‘filling
up’ (or ‘assembling’) the matrix that will allow us to insert entries in arbitrary order.
The one thing that we will however assume that we know is the number of rows in
21

the matrix.
Before we go about implementing the member functions of this class it is worth
illustrating how we will use it in a program:
int main(int argc, char** argv)
{
const int N_row = 6;
const int N_col = 6;
SparseMatrix A;
double B[N_row][N_col] = { { 9, 0, 0, 0, 2, 0},
{ 3, 9, 0, 0, 0, 3},
{ 0, 7, 8, 7, 0, 0},
{ 3, 0, 8, 7, 5, 0},
{ 0, 8, 0, 9, 9, 6},
{ 0, 4, 0, 0, 2, 1} };

A.initialize(N_row, 3);
for(int n=N_col-1; n>=0; n--)
{
for(int m=N_Row-1; m>=0; m--)
{
if(B[m][n]!=0)
{
A(m,n) = B[m][n];
}
}
}
A.finalize();
return 0;
}

It can be observed in the above program that we create two matrices A and B, for
former using our SparseMatrix class and the latter as a statically allocated 2D
array. In the program we first ‘initialize’ our sparse matrix, then work backwards
row by row, column by column through B inserting nonzero entries into A, then
‘finalize’ our matrix, then we’re done. The reason for going through B backwards
is to make sure that we’re inserting entries in an order different from the way that
the CRS schemes stores them, to make sure that the class is functioning correctly.
When it comes to assigning the values, we are using the square brackets to index
into B, but an overloaded C++ operator (namely the parentheses operator) to put
the value into A.
Moving along, the first thing our class needs is a constructor. This is the function
that is called when we create an instance (or object) of our class, with a statement
like:
SparseMatrix A;

and could be implemented as:

SparseMatrix()
{
22 CHAPTER 1. INTRODUCTION

N_row_ = 0;
N_nz_ = 0;
N_nz_rowmax_ = 0;
N_allocated_ = 0;
val_ = NULL;
col_ = NULL;
row_ = NULL;
nnzs_ = NULL;
}

where we are setting all variables to zero or to be null pointers. This happens
because we have constructed our matrix with no useful input information. What
we are going to do is instead use a member function called initialize taking as
arguments the number of rows and an estimate for the number of nonzero entries
per row same input arguments. The code for this function is implemented as:
void initialize(int nrow, int nnzperrow)
{
N_row_ = nrow;
N_nz_ = 0;
N_nz_rowmax_ = nnzperrow;
N_allocated_ = N_row_*N_nz_rowmax_;
val_ = new double [N_allocated_];
col_ = new int [N_allocated_];
row_ = new int [N_row_+1];
nnzs_ = new int [N_row_];

memset(val_, 0, N_allocated_ *sizeof(double));

memset(col_, 0, N_allocated_ *sizeof(int));
memset(row_, 0, (N_row_+1) *sizeof(int));
memset(nnzs_, 0, (N_row_+1) *sizeof(int));

for(int k=0, kk=0; k<N_row_; k++, kk+=N_nz_rowmax_)

{
row_[k] = kk;
}
return;
}

Here we assign the number of rows and the the maximum number of nonzero entries
per row N nz rowmax equal to our initial estimate. Note that the number of nonzero
entries (N nz ) is set equal to zero because this variable contains a count of the
current number of nonzero entries in the matrix and at the point of initialization
of the SparseMatrix object there are none. We then set our allocation size based
on an estimate of the number of rows multiplied by the number of nonzero entries
per row and allocate the arrays storing the sparse matrix. It is important to note
that because we know the number of rows in the matrix, we won’t have to reallocate
row as we assemble the matrix, only modify its contents. Following the memory
allocation we initialize all of the arrays with zero via the memset function and then
set the initial row positions. This last statement requires some elaboration because
23

it relates to how we choose to fill up the sparse matrix. Remember that with the
CRS format row(m) is the index in the col and val arrays where the data for row m
starts. Because we don’t know this information yet, one approach is to just assume
that each row will have exactly N nz rowmax entries per row and then in that case
we can then start to fill the matrix up. If some rows have more nonzero entries than
this we will have to reallocate more memory; if they have less then we will have
to ‘compress’ the matrix (which as you might’ve guessed will happen through the
finalize member function). In order to keep track of the actual number of nonzero
entries that any given row has, we will make use of the ‘temporary’ array nnzs that
will maintain a count for each row. After the matrix has been completely assembled
and compressed (so that it’s using the CRS format exactly) this array is redundant
and will be deleted. So after our initialize function has been called, given that
we estimated 3 nonzero entries per row, the arrays of the sparse matrix will look
like:

val ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
col ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 0, 0, 0, 0, 0}

At this point we can start to look at how we actually insert entries into the sparse
matrix. We can in fact come up with a pretty nice way of doing this by making use
of a feature of the C++ language known as operator overloading; and we will overload
the parentheses operator as:
inline double& operator()(int m, int n);

So here we have declared another member function, but the way we perform the
function call is via the parentheses, meaning that if we create an object of our class:
SparseMatrix A;

then we could access entries using the same syntax that we would with Matlab, i.e:
double a_mn = A(m,n);

So now we just need to go about implementing this function. As mentioned pre-

viously, with the CRS format, we know that given the row index m, then the
entry will found somewhere in the val array between row(m) and row(m + 1),
but because our matrix isn’t assembled yet we have to check from row(m) and
row(m) + nnzs(m). So given m and n we need to check in the col array between
row(m) and row(m) + nnzs(m) to see if there is an entry for n. If there is then we
will simply return a reference to that entry in the val array (which can be added to
or have its value reassigned, etc), but if not then we will need to insert it. The code
for checking whether or not entry m, n already exists in the sparse matrix could take
the form:
24 CHAPTER 1. INTRODUCTION

double& operator()(int m, int n)

{
int k = row_[m];
bool foundEntry = false;
while(k<(row_[m]+nnzs_[m]) && !foundEntry)
{
if(col_[k]==n)
{
foundEntry=true;
}
k++;
}
if(foundEntry)
{
return val_[k-1];
}
else
{
N_nz_++;
nnzs_[m]++;
col_[k] = n;
return val_[k];
}
}

So when this function is called, we create a temporary index k that is initialized to

the index stored in row [m]. Then in a while loop we work through the values in
col_[k] and check if any of them are equal to n. If so then we have found the index
k where the entry m, n belongs in the val array. Now if foundEntry is turns out to
have the value false at this point then the entry m, n doesn’t exist in our matrix
and so we must insert it. Now, because we’ve already decided that we will have
a member function called finalize that compresses the arrays in order to use the
CRS format exactly, we could in fact use that function to sort the column indices
into increasing order such that we don’t worry about that while we’re inserting
entries. If we do so, then inserting a value can be done quite easily using the code
in the else statement above. Here, all we have to do is increment the total number
of nonzero entries, increment the number of nonzero entries for row m, add in the
column index at the location that k and then return the reference to that position
in val. To elaborate by way of example, let’s work our way through the nested for
loops from the program code above. Working back through B the first value that
we’ll insert is B(6, 6) = 1 and of course since this is the first value to be inserted it
won’t be present in the matrix and so foundEntry will return false and we’ll have
to insert it. Following the insertion the contents of the arrays will look like:
25

val ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0}
col ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 0, 0, 0, 0, 1}

Then, after inserting B(5, 6) = 6 we would have:

val ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 1, 0, 0}
col ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 0, 0, 0, 1, 1}

Then, after inserting B(2, 6) = 3 we would have:

val ={0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 1, 0, 0}
col ={0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 1, 0, 0, 1, 1}

Skipping ahead a bit, after inserting B(6, 2) = 4 we would have:

val ={2, 0, 0, 3, 0, 0, 7, 8, 0, 5, 7, 8, 6, 9, 9, 1, 2, 4}
col ={5, 0, 0, 6, 0, 0, 4, 3, 0, 5, 4, 3, 6, 5, 4, 6, 5, 2}
row ={1, 4, 7, 10, 13, 16}
nnzs={1, 1, 2, 3, 3, 3}

At this point the next entry to be inserted would be B(5, 2) = 8, but the problem
is that we don’t have any more free space between the data for rows 5 and 6. So
at this point we’ll need to reallocate our arrays. The approach that we’ll take is
not necessarily the smartest or most efficient, but makes the code relatively simple.
Essentially what we’ll do is just double our current guess for the maximum number
of nonzero entries per row and reallocate the arrays appropriately. In order to check
whether or not we have enough room to insert an entry, we’ll add an extra bit of
code to our implementation of the parenthesis operator:
double& operator()(int m, int n)
{
if(nnzs_[m]>=N_nz_rowmax_)
{
26 CHAPTER 1. INTRODUCTION

this->reallocate();
}
int k = row_[m];
bool foundEntry = false;
while(k<(row_[m]+nnzs_[m]) && !foundEntry)
{
...
}

where the three dots ... are indicate the remainder of the function, which is
unchanged. The newly inserted if statement checks if our current count for the
number of nonzero entries for row m is equal to or greater than what we’ve set aside.
For the case of inserting B(5, 2) = 8 we have already inserted 3 value and so this
will be true. In this case we call the protected member function reallocate before
proceeding to insert the value. Because this is a protected member function, our
program won’t be able to access this function, only the sparse matrix object itself.
This is in-keeping with the encapsulation idea of object oriented programming, i.e.
our sparse matrix will be able to ‘take care of itself’ in a sense and reallocate memory
when it needs to. After implementing this function we won’t need to worry about
it at any other point in the programs that use it. As it happens this is exactly what
Matlab would be doing when we use the sparse function. The code implementing
this function will look like:
void reallocate()
{
N_nz_rowmax_ *= 2;
N_allocated_ = N_nz_rowmax_*N_row_;

double* tempVal = new double [N_allocated_];

int* tempCol = new int [N_allocated_];

memset(tempVal, 0, N_allocated_ *sizeof(double));

memset(tempCol, 0, N_allocated_ *sizeof(int));

for(int m=0, mm=0; m<N_row_; m++, mm+=N_nz_rowmax_)

{
memcpy(&tempVal[mm], &val_[row_[m]], nnzs_[m]*sizeof(double));
memcpy(&tempCol[mm], &col_[row_[m]], nnzs_[m]*sizeof(int));
row_[m] = mm;
}

delete [] val_;
delete [] col_;

val_ = tempVal;
col_ = tempCol;

return;
}
27

Here, after doubling N nz rowmax and increasing N allocated accordingly, we

create temporary arrays of this new size to store the val and col data and copy the
current data in val and col into these arrays. We then delete the current blocks
of memory that val and col point to and reassign these pointers to the new larger
arrays. At this point we have completed the reallocation and the contents of the
arrays will look like:

val ={2, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 5, 7, 8, 0, 0, 0, 6, 9, 9, 0, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 5, 4, 3, 0, 0, 0, 6, 5, 4, 0, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={1, 1, 2, 3, 3, 3}

and so after inserting B(5, 2) = 8 we would have:

val ={2, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 5, 7, 8, 0, 0, 0, 6, 9, 9, 8, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 5, 4, 3, 0, 0, 0, 6, 5, 4, 2, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={1, 1, 2, 3, 4, 3}

Skipping ahead a bit, after inserting B(1, 1) = 9 we would have:

val ={2, 9, 0, 0, 0, 0, 3, 9, 3, 0, 0, 0, 7, 8, 7, 0, 0, 0, 5, 7, 8, 3, 0, 0, 6, 9, 9, 8, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 1, 0, 0, 0, 0, 6, 2, 1, 0, 0, 0, 4, 3, 2, 0, 0, 0, 5, 4, 3, 1, 0, 0, 6, 5, 4, 2, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={2, 3, 3, 4, 4, 3}

At this point we have inserted all of the entries that we want to, so the final step
is to ‘tidy up’ the data so that it is in CRS format. The code implementing the
finalize member function will be somewhat similar to the reallocate function
in that it will create new temporary copies of the arrays to work with, but its main
job will be to sort the column indices for each row into ascending order before
compressing them down into the col array. This function will be the most complex
one present in the class, but let’s start by presenting the whole thing and then
explaining the various components:
void finalize()
{
int minCol = 0;
int insertPos = 0;
int index = 0;
double* tempVal = new double [N_nz_];
28 CHAPTER 1. INTRODUCTION

int* tempCol = new int [N_nz_];

int* tempRow = new int [N_row_+1];
bool* isSorted = new bool [N_allocated_];

memset(tempVal, 0, N_nz_ *sizeof(double));

memset(tempCol, 0, N_nz_ *sizeof(int));
memset(tempRow, 0, (N_row_+1) *sizeof(int));
memset(isSorted, 0, N_allocated_ *sizeof(bool));

for(int m=0; m<N_row_; m++)

{
for(int k=row_[m]; k<(row_[m]+nnzs_[m]); k++)
{
minCol = N_row_+1;
for(int kk=row_[m]; kk<(row_[m]+nnzs_[m]); kk++)
{
if(!isSorted[kk] && col_[kk]<minCol)
{
index = kk;
minCol = col_[index];
}
}
tempVal[insertPos] = val_[index];
tempCol[insertPos] = col_[index];
isSorted[index] = true;
insertPos++;
}
tempRow[m+1] = tempRow[m]+nnzs_[m];
}

delete [] val_;
delete [] col_;
delete [] row_;
delete [] nnzs_;
delete [] isSorted;

val_ = tempVal;
col_ = tempCol;
row_ = tempRow;
nnzs_ = NULL;
N_allocated_ = N_nz_;

return;
}

Similar to the reallocate function, we can see that the first steps consist of allo-
cating new arrays for val and col, but having assembled the matrix, we now know
the number of nonzero entries in the matrix, so we can in fact allocate the correct
size for these arrays. We are also going to allocate a temporary array isSorted
which stores a boolean flag defining whether or not a given entry has been sorted
and placed into the new arrays. The function of the nested for loops is to loop
29

over each row in the matrix and for each row, work through the entries that have
been inserted, find the one with the lowest column index, and then put it into the
new arrays. So, given the current state of the arrays:

As we search through the column data for row(1) between col entries 1 and 3 we
find that the lowest column index is 1 and it has not been previously sorted, so its
isSorted flag will be false and we can insert the value and column index into the
next available spot in the new val and col arrays, then set its flag to true . On
the second pass through the innermost for loop, we are scanning the same column
indices, but because the column index 1 has been sorted already, the lowest unsorted
index is 9, so we add the value and column index into the next available spot in
the new val and col arrays, then set its flag to true . After sorting any given row,
we can then set the values in row appropriately. After repeating this process for all
rows the arrays will look like:

val ={9, 2, 3, 9, 3, 7, 8, 7, 3, 8, 7, 5, 8, 9, 9, 6, 4, 2, 1}
col ={1, 5, 1, 2, 6, 2, 3, 4, 1, 3, 4, 5, 2, 4, 5, 6, 2, 5, 6}
row={1, 3, 6, 9, 13, 17}

Finally, the last thing we need to implement a is a destructor. This is the function
that is called when we destroy an instance (or object) of our class. This can take
the form:
~SparseMatrix()
{
if(val_) delete [] val_;
if(col_) delete [] col_;
if(row_) delete [] row_;
if(nnzs_) delete [] nnzs_;
}

So the important point to note is that because we allocated our three arrays dynam-
ically we have to delete them ourselves and this should be done in the destructor.
Note that if our matrix has been finalized, nnzs will have already been deleted and
hence will be a NULL pointer and at this point. As such it wouldn’t be deleted here
in the destructor.
The complete program is given in Example1_1.cpp.
30 CHAPTER 1. INTRODUCTION

Having now made all of the necessary definitions regarding vectors and matrices,
looking at how we actually go about storing them in both Matlab and C++, we are
now ready to start investigating some different methods for solving a linear system.
Those readers already familiar with Matlab will know that a linear system can be
solved quite simply by using the backslash operator as:
phi = A\b;

which instructs Matlab to find the solution of the linear system and it will in fact do
so by examining the structure of the A and choosing the most appropriate method,
so ‘life is good’. Throughout this book, sometimes we will make use of the back-
slash operator and other times we won’t. If the focus of a particular program is
to demonstrate ‘say’ a numerical method for solving a partial differential equation
then we may develop the code for doing that and just use the backslash operator
so solve the resulting linear system of equations. There will be a few times however
when we will want to explicitly implement some of the numerical methods that we
are about to investigate. When it comes to implementing C++ programs we don’t
have the backslash operator, so we will need to know how to implement a method
to solve a linear system ourselves.
Throughout the next two chapters we will for the most part be applying our
numerical methods to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

As can be observed we have a full symmetric matrix here, which is small enough that
we can work through the computations ‘by hand’ to see how the different methods
work. This is also a pattern that we will try and stick to throughout the book;
namely that we will try and test our numerical methods on the same problems so
that we can highlight the differences between the different methods and also know
what solution to expect.
Chapter 2

Direct Methods

As was mentioned in the introduction we can classify methods for linear systems
of equations as either direct methods or iterative methods. Direct methods solve a
system of equations with a fixed number of operations, which will depend upon M ,
but will be known at the start of the method. In the absence of round-off error due
to the finite precision to which a computer can represent numbers, direct methods
would deliver an exact solution.

2.1 Gaussian Elimination

One of the simplest methods to solve a linear system of equations is Gaussian elim-
ination. This is commonly the method that one is first taught in an introduction
to linear algebra. It is an efficient way to solve a systems when the matrix is asym-
metric and has a relatively small number of zero elements (i.e. is nearly full). The
basic idea is that one reduces the system to row echelon form [47] via a series of el-
ementary row operations, where the elementary row operations consist of swapping
rows, multiplying a row by a constant, and adding multiples of one row to another.
Essentially the procedure is to form the augmented matrix for the system and then
reduce the coefficient matrix part to an upper triangular matrix.

Example 2.1:

In this example we will develop a Matlab program to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform the Gaus-
sian elimination by hand. We first form the augmented matrix:

31
32 CHAPTER 2. DIRECT METHODS

 
2 1 −1 3
Ab =  1 4 2 −5 
−1 2 6 7

Then we proceed by removing the entries in column 1, below the entry on the main
diagonal (i.e. a1,1 ). We achieve this by subtracting a2,1 /a1,1 × row 1 from row 2 and
a3,1 /a1,1 × row 1 from row 3, i.e:

1
Ab2 ← Ab2 − Ab1
2
−1
Ab3 ← Ab3 − Ab1
2
so that we get:
 
2 1 −1 3
Ab =  0 3.5 2.5 −6.5 
0 2.5 5.5 8.5
We then remove the entries in column 2, below the main diagonal (i.e. a2,2 ). We do
this by subtracting a3,2 /a2,2 × row 2 from row 3, i.e:
2.5
Ab3 ← Ab3 − Ab2
3.5
such that we get:
 
2 1 −1 8
Ab =  0 0.5 0.5 1 
0 0 3.7143 13.1429
At this point we can perform the back substitution as:

13.1429
φ3 = = 3.5385
3.7143
−6.5 − 2.5 × 3.5385
φ2 = =−4.3846
3.5
3 − (−1) × 3.5385 − 1 × (−4.3846)
φ1 = = 5.4615
2.0
In order to create a Matlab program to perform the Gaussian elimination we can
first create an augmented matrix with the cat function (which concatenates arrays)
as:
Ab = cat(2, A, b);
Having done this we can then loop through the columns of Ab, and for each column,
loop through the rows of Ab
2.1. GAUSSIAN ELIMINATION 33

for n=1:N_col
for m=n+1:N_row
Ab(m,:) = Ab(m,:) - Ab(m,n)/Ab(n,n)*Ab(n,:);
end
end

It should be noted here the colon operator implies ‘all’ of the elements in row m if
we write Ab(m,:), so we are performing a ‘vector’ operation, rather than operating
on individual entries in the array. At this point we can then perform the back
substitution as:
phi(N_row) = Ab(N_row,N_row+1)/Ab(N_row,N_row);
for m=N_row-1:-1:1
phi(m) = (Ab(m,N_row+1) - Ab(m, m+1:N_row)*phi(m+1:N_row))/Ab(m,m);
end

and our program is complete. If any of the elements on the main diagonal of A were
zero then we would run into the problem of division by zero when we add a multiple
of one row to another. If we allow for partial pivoting then we swap rows such that
the row with a zero on the main diagonal is swapped with the row below which has
the largest absolute value of am,n in column n. We could achieve this in Matlab by
checking for zero entries on the main diagonal and swapping any rows where this
occurs by modifying our initial code snippet as:
for n=1:N_col
if Ab(n,n)==0
[A_col_max k] = max(abs(Ab(n+1:N_row, n)));
temp = Ab(n,:);
Ab(n,:) = Ab(k,:);
Ab(k,:) = temp;
end
for m=n+1:N_row
Ab(m,:) = Ab(m,:) - Ab(m,n)/Ab(n,n)*Ab(n,:);
end
end

The complete program is given in Example2_1.m.

To solve a system of M equations for M unknowns requires an arithmetic com-

plexity of O(M 3 ), thus the method becomes prohibitively expensive for large sys-
tems. Furthermore, if we allow for partial pivoting, then we may need to devise
some method for keeping track of the original location of each equation such that
our solution vector is arranged correctly. Gaussian elimination is numerically stable
for diagonally dominant or positive-definite matrices. For general matrices, Gaus-
sian elimination is usually considered to be stable in practice if one uses partial
pivoting, even though there are examples for which it is unstable. Another issue to
consider is that if the system were large, then storing the augmented matrix might
34 CHAPTER 2. DIRECT METHODS

be a problem too. The other option might be to simply overwrite the data in A and
b rather than store the augmented matrix and perform the operations there, but
this too may be undesirable.

2.2 LU Decomposition
LU decomposition (or LU factorization) is a similar process to Gaussian elimination
and is equivalent in terms of the elementary row operations. The basic idea behind
the method is that the matrix A can be decomposed (or factored) so that:

A = LU
where L is a lower triangular matrix with ones on the main diagonal and U is an
upper triangular matrix:

   
1 0 ··· 0 u1,1 u1,2 · · · u1,N
 l2,1 1 ··· 0   0 u2,2 · · · u2,N 
L =  .. U =  ..
   
.. . . ..  .. .. .. 
 . . . .   . . . . 
lM,1 lM,2 ··· 1 0 0 · · · uM,N

In this case Aφ = b becomes:

LU φ = b
So we let ψ = U φ and first solve:

Lψ = b
Because L is a lower triangular matrix this equation can be solved efficiently by
forward substitution. To find φ we then we solve:

Uφ = ψ
Because U is an upper triangular matrix this equation can be solved efficiently by
back substitution (as we did with Gaussian elimination).

Example 2.2:

In this example we will develop a Matlab program to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   
2.2. LU DECOMPOSITION 35

Before we start writing any code however, let’s work through and perform the LU
decomposition by hand. Unlike Gaussian elimination we don’t have to form an
augmented matrix, however we proceed in exactly the same manner as Gaussian
elimination except that we keep a record of the elementary row operations performed
at the nth stage in a matrix Ln and place the results of these row operations in U .
Let’s first assign U = A and L = I.The the algorithm proceeds by removing the
entries in column 1, below the entry on the main diagonal (i.e. u1,1 ). We achieve
this by subtracting u2,1 /u1,1 × row 1 from row 2 and u3,1 /u1,1 × row 1 to row 3, i.e:

1
U2 ← U2 − U1
2
−1
U3 ← U3 − U1
2
such that we get:
   
2 1 −1 1 0 0
U =  0 3.5 2.5  L1 =  0.5 1 0 
0 2.5 5.5 −0.5 0 1
We then remove the entries in column 2, below the main diagonal (i.e. a2,2 ). We do
this by subtracting u3,2 /u2,2 × row 2 from row 3, i.e:

2.5
U3 ← U3 − U2
3.5
such that we get:
   
2 1 −1 1 0 0
U =  0 3.5 2.5  L2 =  0 1 0 
0 0 3.7143 0 0.7143 1
We then have:

A = L1 L2 U
so that:

L = L1 L2
which in this particular example is:
    
1 0 0 1 0 0 1 0 0
L =  0.5 1 0   0 1 0  =  0.5 1 0 
−0.5 0 1 0 0.7143 1 −0.5 0.7143 1
At this point we can perform the forward substitution on Lψ = b as:
36 CHAPTER 2. DIRECT METHODS

3
ψ1 = = 3
1
−5 − 0.5 × 3
ψ2 = =−6.5
1
7 − (−0.5) × 3 − 0.7143 × (−6.5)
ψ3 = =1 3.1429
1
We can then perform the back substitution on U φ = ψ as:

13.1429
φ3 = = 3.5385
3.7143
−6.5 − 2.5 × 3.5385
φ2 = =−4.3846
3.5
3 − (−1) × 3.5385 − 1 × (−4.3846)
φ1 = = 5.4615
2.0
In order to create a Matlab program to perform the LU decomposition we will
have nested for loops similar to Example 2.1:
for n=1:N_col
for m=n+1:N_row
L(m,n) = U(m,n)/U(n,n);
U(m,:) = U(m,:) - U(m,n)/U(n,n)*U(n,:);
end
end

We can then perform the forward substitution as:

psi(1) = b(1)/L(1,1);
for m=2:N_row
psi(m) = b(m) - L(m, 1:m-1) * psi(1:m-1);
end

and then the back substitution as:

phi(N_row) = psi(N_row) / U(N_row, N_row);
for m=N_row-1:-1:1
phi(m) = (psi(m) - U(m, m+1:N_row) * phi(m+1:N_row)) / U(m,m);
end

If any of the elements on the main diagonal of A were zero then we would run into
the problem of division by zero when we add a multiple of one row to another. If we
allow for partial pivoting then we swap rows such that the row with a zero on the
main diagonal is swapped with the row below which has the largest absolute value
of am,n in column n. We could achieve this in our Matlab program in exactly the
same manner that we did with Example 2.1.
The complete program is given in Example2_2.m.
2.3. CHOLESKY DECOMPOSITION 37

Note that in both cases we have triangular matrices (lower and upper) which can
be solved directly using forward and backward substitution without using the Gaus-
sian elimination process (however we need this process or equivalent to compute the
LU decomposition itself). Thus the LU decomposition is computationally efficient
only when we have to solve a matrix equation multiple times for different b; it is
faster in this case to do an LU decomposition of the matrix A once and then solve
the triangular matrices for the different b, than to use Gaussian elimination each
time. As a final note before continuing on, we can perform an LU decomposition
using the Matlab function lu.

2.3 Cholesky Decomposition

Cholesky Decomposition (or Cholesky Factorization) is a form of triangular decom-
position that can only be applied to symmetric positive definite or Hermitian positive
definite matrices. In such a case we can then write:

A = LLT
where L is a lower triangular matrix.
  
l1,1 0 0 l1,1 l1,2 l1,3
A =  l2,1 l2,2 0   0 l2,2 l2,3 
l1,3 l2,3 l3,3 0 0 l3,3
In this case Aφ = b becomes:

LLT φ = b
So we let ψ = Lφ and first solve:

Lψ = b
Because L is a lower triangular matrix this equation can be solved efficiently by
forward substitution. To find φ we then we solve:

LT φ = ψ
Because U is an upper triangular matrix this equation can be solved efficiently by
back substitution (as we did with Gaussian elimination and LU decomposition). If
we write out the matrix product in full we get:
2
   
a1,1 a1,2 a1,3 l1,1 l1,1 l2,1 l1,1 l3,1
 a2,1 a2,2 a2,3  =  l2,1 l1,1 2 2
l2,1 + l2,2 l2,1 l3,1 + l2,2 l3,2 
2 2 2
a3,1 a3,2 a3,3 l3,1 l1,1 l3,1 l2,1 + l3,2 l2,2 l3,1 + l3,2 + l3,3
Then equating coefficients for each entry, we get the formulae:
38 CHAPTER 2. DIRECT METHODS

v
u
u m−1
X
lm,m = am,m −
t 2
lm,o
o=1
n−1
!
1 X
lm,n = am,n − lm,o ln,o
lm,m o=1

So we can compute the entry lm,n if we know the entry to the left and above. A
property of positive definite matrices is that the terms inside the square root above
is always positive and will result in a real number. We then proceed through the
matrix column by column and calculate the entries.

Example 2.3:

In this example we will develop a Matlab program to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform the
Cholesky decomposition by hand. To begin we work through the first column of
A to compute the corresponding entries for L as:

√ √
l1,1 = a1,1 = 2 = 1.4142
1 1
l2,1 = (a2,1 − 0)= × (1 − 0) = 0.7071
l1,1 1.4142
1 1
l3,1 = (a3,1 − 0)= × (−1 − 0)=−0.7071
l1,1 1.4142

So that we have:
 
1.4142 0 0
L =  0.7071 0 0 
−0.7071 0 0
We can then work through the second column of A to compute the corresponding
entries for L as:
2.3. CHOLESKY DECOMPOSITION 39

q √
2
l2,2 = a2,2 − l2,1 = 4 − 0.70712 =1.8708
1 1
l3,2 = (a3,2 − l3,1 l2,1 )= × (2 − (−0.7071) × 0.7071)=1.3363
l2,2 1.8708

So that we have:
 
1.4142 0 0
L =  0.7071 1.8708 0 
−0.7071 1.3363 0
Then finally we compute the last entry on the main diagonal as:

q p
2 2
l3,3 = a3,3 − l3,1 − l3,2 = 6 − (−0.7071)2 − 1.33632 =1.9272

So that we have:
 
1.4142 0 0
L =  0.7071 1.8708 0 
−0.7071 1.3363 1.9272
At this point we can perform the forward substitution on Lψ = b as:

3
ψ1 = = 2.1213
1.4142
−5 − 0.7071 × 2.1213
ψ2 = =−3.4744
1.8708
7 − (−0.7072) × 2.1213 − 1.3363 × (−3.4744)
ψ3 = = 6.8195
1.9272
We can then perform the back substitution on U φ = ψ as:

6.8195
φ3 = = 3.5385
1.9272
−3.4744 − 1.3363 × 3.5385
φ2 = =−4.3846
1.8708
2.1213 − (−0.7071) × 3.5385 − 0.7071 × (−4.3846)
φ1 = = 5.4615
2.0
In order to create a Matlab program to perform the Cholesky decomposition
we will have nested for loops similar to both Example 2.1 and 2.2 to compute the
entries in L:
40 CHAPTER 2. DIRECT METHODS

for n=1:N_col
Sum = 0;
for o=1:n-1
Sum = Sum + L(n,o)^2;
end
L(n,n) = sqrt(A(n,n) - Sum);
for m=n+1:N_row
Sum = 0;
for o=1:n-1
Sum = Sum + L(m,o)*L(n,o);
end
L(m,n) = 1/L(n,n)*(A(m,n)-Sum);
end
end

We can then perform the forward substitution as:

psi(1) = b(1)/L(1,1);
for m=2:N_row
psi(m) = (b(m) - L(m, 1:m-1) * psi(1:m-1)) / L(m,m);
end

and then perform the back substitution as:

phi(N_row) = psi(N_row) / L(N_row, N_row);
for m=N_row-1:-1:1
phi(m) = (psi(m) - L(m+1:N_row,m)‘ * phi(m+1:N_row)) / L(m,m);
end

It can be observed that the Cholesky decomposition is very similar to the LU de-
composition, but an important difference is that partial pivoting will not be required
because the dominant coefficients will be on the main diagonal.
The complete program is given in Example2_3.m.

As a final note before continuing on, we can perform a Cholesky decomposition

using the Matlab function chol.

2.4 QR Decomposition
QR Decomposition (or QR factorization) is a technique similar to both LU and
Cholesky decompositions, that can be applied to any real, square matrix A. Here,
we write:

A = QR
where Q is an orthogonal matrix and R is an upper triangular matrix. In this case
Aφ = b becomes:
2.4. QR DECOMPOSITION 41

QRφ = b

So we let ψ = Rφ and first solve:

Qψ = b

Because Q is orthogonal we have Q−1 = QT , so we can solve this equation by

transposing Q and performing the matrix-vector multiplication. To find φ we then
we solve:

Rφ = ψ

Because R is an upper triangular matrix this equation can be solved efficiently by

back substitution. Now, there are a number of ways in which we can perform the
decomposition, but we will use the Gram-Schmidt process. As it happens we will
use this technique a number of times throughout this part of the book, so it’s worth
taking a moment to explain exactly what it does. The basic idea behind the method
is that it takes a set of vectors and from them generates another set of orthonormal
vectors (i.e. that are all orthogonal and have a magnitude of one) that span the M
dimensional space. As an analogy, think of 3D Euclidean space. Given any set of
vectors within that space (which will of course each have three components), the the
Gram-Schmidt process will compute a set of orthonormal vectors (e.g. something
like î = {1, 0, 0}, ĵ = {0, 1, 0}, k̂ = {0, 0, 1}) for that space. In this particular
application we consider the columns of A as vectors:

A= a1 a2 · · · aN

What we are aiming to do with the Gram-Schmidt process is to find a new set of
vectors q1 , q1 , · · · qN , such that:

 
a1 · q1 a2 · q1 · · · aN · q1
 0 a2 · q2 · · · aN · q2 
A= a1 a2 · · · aN = q1 q 2 · · · qN
 
 .. .. ... .. 
 . . . 
0 0 · · · aN · qN

Furthermore, the vectors will be normalized such that ||qn ||2 = 1, so that they form
an orthonormal set (i.e. they are unit vectors). This will give us the orthogonal
and upper triangular matrices that the QR decomposition requires. One important
point to bear in mind here is that the number of components in each of the vectors
will be equal to the number of rows in the matrix A. The process works by initially
defining an intermediate vector un and then successively computing qn as:
42 CHAPTER 2. DIRECT METHODS

u1
u1 = a1 q1 =
||u1 ||2
aT2 q1 u2
u2 = a2 − q1 q2 =
qT1 q1 ||u2 ||2
aT3 q1 aT3 q2 u3
u3 = a3 − T q1 − T q2 q3 =
q1 q1 q2 q 2 ||u3 ||2
.. ..
.=.
n−1 T
X an qm un
un = an − qm qn = (2.1)
qT q
m=1 m m
||un ||2

Now, we can define the projection operator as:

aT q
proj(a, q) = q
qT q
which means projecting a onto q and so a more intuitive way to think of the algo-
rithm is constructing a new orthogonal vector qn by projecting an onto the previous
q1 , · · · , qn−1 vectors (Figure 2.1) and then normalizing it. So while the vectors an
will not necessarily be orthogonal, the vectors qn will be.

a2
u2

q2
a1, u1

q1
proj(a2,q1)

Figure 2.1: The Gram-Schmidt process illustrating the computation of the second
orthonormal vector q2 . Having set u1 equal to a1 and normalizing it to obtain q1 , we
compute q2 by projecting a2 onto q1 and then subtracting this vector (i.e. adding
the negative of it) to a2 , giving us u2 . Finally we normalize u2 to get q2 . While
this example is given in 2D for simplicity, the method extends to any number of
dimensions.
2.4. QR DECOMPOSITION 43

Example 2.4:

In this example we will develop a Matlab program to solve the example system:

    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform the QR
decomposition by hand. The the algorithm begins by setting:

 
 2 
u1 = a1 = 1
−1
 
 
T 0.8165 
u1 {2, 1, −1} 
q1 = = = 0.4082
||u1 ||2 (22 + 12 + (−1)2 ) 21
−0.4082
 

and computing:

 
 0.8165 
R1,1 = a1 · q1 = {2, 1, −1} 0.4082 = 2.4495
−0.4082
 

So then we have:

   
0.8165 0 0 2.4495 0 0
Q =  0.4082 0 0  R =  0 0 0 
−0.4082 0 0 0 0 0

We then proceed to compute the second orthonormal vector as:

44 CHAPTER 2. DIRECT METHODS

aT2 q1
u2 =a2 − q1
qT1 q1
 
 0.8165 
  {1, 4, 2} 0.4082  
 1  
−0.4082
  0.8165 
= 4 −   0.4082
 
2  0.8165   −0.4082 
{0.8165, 0.4082, −0.4082} 0.4082
−0.4082
 
 
 −0.3333 
= 3.3333
2.6667
 
 
−0.0778 
u1 
q2 = = 0.7785
||u1 ||2 
0.6228


and computing:

 
 0.8165 
R2,1 =a2 · q1 = {1, 4, 2} 0.4082 =1.6330
−0.4082
 
 
 −0.0778 
R2,2 =a2 · q2 ={2, 1, −1} 0.7785 =4.2817
0.6228
 

So then we have:
   
0.8165 −0.0778 0 2.4495 1.6330 0
Q =  0.4082 0.7785 0  R =  0 4.2817 0 
−0.4082 0.6228 0 0 0 0

Furthermore, we can note that:

 
 −0.0778 
q1 · q2 = 0.8165 0.4082 −0.4082 0.7785 = 0.0
0.6228
 

verifying that these two vectors are orthogonal. We then proceed to compute the
third orthonormal vector as:
2.4. QR DECOMPOSITION 45

aT3 q1 aT3 q2
u3 =a2 − q 1 − q2
qT1 q1 qT2 q1
 
 0.8165 
  {−1, 2, 6} 0.4082  
 −1  
−0.4082
  0.8165 
= 2 −   0.4082

6
  0.8165   −0.4082 
{0.8165, 0.4082, −0.4082} 0.4082
−0.4082
 
 
 −0.0778 
{−1, 2, 6} 0.7785  

0.6228
  −0.0778 
−   0.7785
 −0.0778   0.6228 
{−0.0778, 0.7785, 0.6228} 0.7785
0.6228
 
 
 1.4182 
= −1.1818
1.6545
 
 
0.5721 
u1 
q3 = = −0.4767
||u1 ||2 
0.6674


and computing:

 
 0.8165 
R3,1 =a3 · q1 ={−1, 2, 6} 0.4082 =−2.4495
−0.4082
 
 
 −0.0778 
R3,2 =a3 · q2 ={−1, 2, 6} 0.7785 = 5.3716
0.6228
 
 
 0.5721 
R3,3 =a3 · q3 ={−1, 2, 6} −0.4767 = 2.4790
0.6674
 

So then we have:

   
0.8165 −0.0778 0.5721 2.4495 1.6330 −2.4495
Q =  0.4082 0.7785 −0.4767  R =  0 4.2817 5.3716 
−0.4082 0.6228 0.6674 0 0 2.4790
Again, we can note that:
46 CHAPTER 2. DIRECT METHODS

 
0.5721
 
q1 · q3 = −0.0778 0.7785 0.6228 −0.4767 = 0.0
0.6674
 
 
 0.5721 
q2 · q3 = 0.8165 0.4082 −0.4082 −0.4767 = 0.0
0.6674
 

verifying that all three vectors are orthogonal. At this point we can perform the
matrix-vector multiplication as:

    
0.8165 0.4082 −0.4082  3   −2.4495 
ψ = QT b =  −0.0778 0.7785 0.6228  −5 = 0.2335
0.5721 −0.4767 0.6674 7 8.7719
   

We can then perform the back substitution on U φ = ψ as:

8.7719
φ3 = = 3.5385
2.4790
0.2335 − 5.3716 × 3.5385
φ2 = =−4.3846
4.2817
−2.4495 − (−2.4495) × 3.5385 − 1.6330 × (−4.3846)
φ1 = = 5.4615
2.4495
In order to create a Matlab program to perform the QR decomposition we can
first note that since the qn vectors have a magnitude of one, we can in fact remove
the qT q computation from the denominator of the projection operator in order to
simplify the computations (because we know that this will be equal to one). With
that in mind we will then will have nested for loops similar to Example 2.1, 2.2,
and 2.3 to compute the entries in Q and R:
Q = A;
for n=1:N_col
for m=1:n-1
R(m,n) = Q(:,m)‘ * Q(:,n);
Q(:,n) = Q(:,n) - R(m,n)*Q(:,m);
end
R(n,n) = norm(Q(:,n));
Q(:,n) = Q(:,n)/R(n,n);
end

where it can be observed that we don’t actually need to create variables to store
un , but rather just store these computations in the Q array. Furthermore, it can be
observed that by initializing Q = A we don’t use A in the factorization steps. At
this point we can trivially compute ψ as:
2.5. TRIDIAGONAL MATRIX ALGORITHM 47

psi = Q‘*b;

and then perform the back substitution to compute φ as:

phi(N_row) = psi(N_row) / R(N_row, N_row);
for m=N_row-1:-1:1
phi(m) = (psi(m) - R(m,m+1:N_row) * phi(m+1:N_row)) / R(m,m);
end

The complete program is given in Example2_4.m.

As a final note before continuing on, we can perform a QR decomposition using

the Matlab function qr.

2.5 Tridiagonal Matrix Algorithm

The last direct method that we will investigate in this book is known as the Tridi-
agonal Matrix Algorithm (TDMA). As its name suggests it is applicable only to
tridiagonal matrices of the form:
    
a1 d 1 0  φ
 1   
 b 1 

 c 2 a1 d 2 
 φ 2





 b 1



...
    

 c 3 a3

 φ 3 = b 3 (2.2)
. . ..  .. 
. . . . dM −1  
  
 

 . 
   . 
 

 φ     b  
0 cM aM N N

which arise when we have an similar equation for each unknown of the form:

cn φn−1 + an φn + dn φn+1 = bn
and as we shall see in Part III, these types of systems can often arise when solving
partial differential equations. Now, although we could use any of the methods that
we’ve studied thus far to solve the system in Equation 2.2 we can exploit the nature
of the matrix in order to solve the system in substantially fewer computations. The
basic idea behind the method is that we perform Gaussian Elimination in the usual
way. If we were to form the augmented matrix:
 
a1 d 1 0 b1
 c2 a1 d2 b2 
.
 
Ab =  c 3 a3 . . b3 
 
 .. .. .. 
 . . dM −1 . 
0 cM aM b N
Then the algorithm proceeds by modifying the second row as:
48 CHAPTER 2. DIRECT METHODS

c2
Ab2 ← Ab2 − Ab1
a1

such that we get:

 
a1 d1 0 b1
 0 a2 a1 − d 1 c 2 d 2 a1 b 2 a1 − b 1 c 2 
..
 
Ab = 

c3 a3 . b3


 . .
. . . . dM −1 .. 
 . 
0 cM aM bN
We can then repeat the process for the third row, up until the last row and then
follow the back substitution process to obtain the solution for φ. Since the matrix
would get rather complicated if we were to write it out explicitly, a more elegant
way is to write out the modified coefficients recursively as:

ĉn = 0
â1 = a1
ân = an ân−1 − dˆn−1 cn
dˆ1 = d1
dˆn = dn b̂n−1
b̂1 = b1
b̂n = bn an−1 − b̂n−1 cn

where the terms with the ˆ indicate the modified matrix coefficients as each row is
processed. As long as we know that all of the entries on the main diagonal will be
nonzero then we can divide out the ân terms to get:

c0n = 0
a0n = 1
d1
d01 =
a1
dn
d0n =
an − d0n−1 cn
b1
b01 =
a1
bn − b0n−1 cn
b0n =
an − d0n cn
2.5. TRIDIAGONAL MATRIX ALGORITHM 49

So we can loop through the rows and compute modified coefficients as:
(
dn
0 an
n=1
dn = dn
an −d0 cn
n = 2, 3...N − 1
n−1

and:
bn
(
an
n=1
b0n = bn −b0n−1 cn
an −d0n−1 cn
n = 2, 3...N − 1
which is the forward sweep. The solution is then obtained by back substitution:

φN = b0N
φn = b0n − d0n φn+1 n = N − 1, N − 2, · · · , 1
For such systems, the solution can be obtained in O(M ) operations instead of
the O(M 3 ) that we observed was required by Gaussian elimination. To illustrate
this savings in computational cost by way of an example, a system of equations
involving say 106 unknowns (which is a realistic sized system for pracital problems),
is going to require O(106 ) operations via the TDMA method, whereas full Gaussian
Elimination (not taking advantage of the tridiagonal nature) would require O(1018 )
operations!

Example 2.5:
In this example we will develop a Matlab program to solve the example system:
    
−2 1 0 0 0   φ 1 
 
 0 

 1 −2 1 0 0 
 φ 2





 0



 
 0
 1 −2 1 0  φ3
 = 0
 0 0 1 −2 1  φ  0
   
3 
   

  
 
0 0 0 1 −2 φ5 1
  

As can be observed, we can’t use the example system that we’d been using in other
examples up until this point, since it wasn’t a tridiagonal matrix. Before we start
writing any code however, let’s work through and perform the TDMA by hand. The
the algorithm begins by computing the modified coefficients for row 1 as:

d1 1
d01 = = = −0.5
a1 −2
b1 0
b01 = = = 0.0
a1 −2
50 CHAPTER 2. DIRECT METHODS

We then work through the rows and compute the modified coefficients for row 2 as:

d2 1
d02 = 0
= = −0.6667
a2 − d 1 c 2 −2 − (−0.5)(1)
b2 − b01 c2 0 − 0(1)
b02 = 0
= = 0.0
a2 − d1 c1 (−2) − (−0.5)(1)

for row 3 as:

d3 1
d03 = 0
= = −0.7500
a3 − d2 c3 −2 − (−0.6667)(1)
b3 − b02 c3 0 − 0(1)
b03 = 0
= = 0.0
a3 − d2 c2 (−2) − (−0.6667)(1)

for row 4 as:

d4 1
d04 = 0
= = −0.8000
a4 − d3 c4 −2 − (−0.7500)(1)
b4 − b03 c4 0 − 0(1)
b04 = 0
= = 0.0
a4 − d3 c3 (−2) − (−0.7500)(1)

and finally for row 5 as:

b5 − b04 c5 1 − 0(1)
b05 = = = −0.8333
a5 − d04 c4 (−2) − (−0.8000)(1)

We can then perform the back substitution by computing:

φ5 = b05 = −0.8333

and then working back through the rows as:

φ4 = b04 − d04 φ5 = 0 − (−0.8000)(−0.8333) = −0.6667

φ3 = b03 − d03 φ4 = 0 − (−0.7500)(−0.6667) = −0.5000
φ2 = b02 − d02 φ3 = 0 − (−0.6667)(−0.5000) = −0.3333
φ1 = b01 − d01 φ2 = 0 − (−0.5000)(−0.3333) = −0.1667
2.5. TRIDIAGONAL MATRIX ALGORITHM 51

which is the correct solution.

In order to create a Matlab program to perform the tridiagonal matrix algorithm
we will do something a bit different to previous examples and store the matrix in
three different vectors a, b, and c. Now although for such a small system, we’re
not really worried about trying to save on computer memory, we’re going to do
this to illustrate that point that once can solve a system of equations, without
explicitly storing the matrix. As long as we perform the computations that the
method dictates, then we’ll get the same result either way. We begin by computing
the modified coefficients for the first row as:
dprime(1) = d(1)/a(1);
bprime(1) = b(1)/a(1);

and then working through the rows computing their modified coefficients as:
for n=2:N_col-1
dprime(n) = d(n) /(a(n) - c(n-1) * dprime(n-1));
bprime(n) = (b(n) - bprime(n-1)*c(n-1))/(a(n) - c(n-1) * dprime(n-1));
end
bprime(N_col) = (b(N_col) - bprime(N_col-1)*c(N_col-1))/(a(N_col) - c(N_col-1) * dprime(N_col-1));

We then solve for φ by performing the back substitution as:

phi(N_col) = bprime(N_col);
for n=N_col-1:-1:1
phi(n) = bprime(n) - dprime(n) * phi(n + 1);
end

The complete program is given in Example2_5.m.

52 CHAPTER 2. DIRECT METHODS
Chapter 3

Iterative Methods

Having now investigated some direct methods for solving a linear system of equa-
tions, it is time to investigate some iterative methods. It can be observed that for
all of the direct methods encountered in Chapter 2, we had a fixed number of oper-
ations, which was dependent on the size of the matrix, but after which we had the
exact solution (ignoring round off errors of course). Despite this advantageous prop-
erty of direct methods, the large number of operations required means that they are
not generally used for solving large systems, especially when the matrices are sparse
(as is generally the case when solving a partial differential equation). Furthermore,
an observation that we can make regarding ‘most’ of the direct methods is that they
require factorizing (or decomposing) A into a product of two matrices, both of which
we have to store. For the small example system that we were applying our methods
to, this feature doesn’t really matter; but for larger systems it could be a bit of a
pain.
Iterative methods work by specifying an initial ‘guess’ for the solution and suc-
cessively refining that guess until it is within some user specified tolerance to the
exact solution. We cannot generally know in advance how many iterations it will
take to converge on the correct solution. We can further classify iterative methods
as stationary or nonstationary and we will investigate them in this order. Station-
ary methods are older, simpler to understand and implement, but usually not as
effective. Non-stationary methods are a more recent development; their analysis is
usually harder to understand, but they can be highly effective. As we shall see later
on in the book, iterative methods work very nicely with sparse matrices, meaning
that we can get fast efficient solvers with minimal storage required. It is for this
reason that when we come to implementing programs to solve partial differential
equations, we will always be using iterative methods. Before we begin this investi-
gation, we need to make some definitions and introduce some concepts relevant to
all iterative methods. Throughout this book we will use the superscript k to denote
the current iteration of our numerical method. In order to avoid confusion with an
exponent, we will always state when a quantity is to be understood as being raised
to a power, if the mathematics is ambiguous. At any stage within our iterative

53
54 CHAPTER 3. ITERATIVE METHODS

method, we will define the error as:

ek = φk − φ (3.1)
So we can see that the error is a column vector that indicates how far our current
approximation of the solution φk is to the exact solution φ. Now in practice, this
measure is not of great use because in order to calculate it we need to know the
exact solution, which is what we are trying to find in the first place! A more useful
definition is that of the residual :

rk = b − Aφk (3.2)
If we rearrange Equation 3.1 in terms of φk and substitute into Equation 3.2 we get:

rk = −Aek
So while the error indicates how far we are from the correct solution, the residual
indicates how far we are from the correct value of b. Moreover we can think of the
residual as being the error transformed by A into the same ‘space’ as b. We could
always calculate the error from the residual, but this would involve computing the
inverse of A and if we were going to do that, then we wouldn’t need an iterative
method in the first place. Now we can calculate the residual at each iteration and
use this as our criterion for judging when our method has converged on a solution.
Since the residual itself is a column vector, we generally use one of the vector norms
defined in Chapter 1 such that we have a single scalar value to compare to some
user specified tolerance. We could then either continue iterating until the residual
norm has reduced to below this tolerance (e.g. krk k∞ < 0.001) or we could continue
iterating until the residual norm has reduced to below some proportion of its initial
value when we began the iterations (e.g. krk k∞ < 0.001kr0 k∞ ). Now while the
choice of tolerance is arbitrary, a good rule of thumb is to choose half the machine
precision. Thus if the precision of calculations is about 16 digits one may choose
the tolerance to be 10−8 .

3.1 Jacobi Method

Perhaps the simplest iterative method is the Jacobi method. The basic idea is that
we let A = D + R where D is a diagonal matrix and R is everything else:

   
a1,1 0 · · · 0 0 a1,2 · · · a1,N
 0 a2,2 · · · 0   a2,1 0 · · · a2,N 
D=  R=
   
.. .. .. .. .. .. .. .. 
 . . . .   . . . . 
0 0 · · · aM,N aM,1 aM,2 ··· 0

In this case, the system Aφ = b becomes:

3.1. JACOBI METHOD 55

(D + R) φ = b
and therefore:

Dφ = b − Rφ
φ = D−1 (b − Rφ)

Now the inverse of a diagonal matrix is trivially just another diagonal matrix where
each entry on the main diagonal is the reciprocal. With the Jacobi method we use
this as the basis for an iteration:

φk+1 = D−1 b − Rφk

That is, we can update our solution based on the current ‘guess’. In terms of the
entries in the matrices, we get:
!
1 X
φk+1
m = bm − am,n φkn
am,m m6=n

This method is guaranteed when A is diagonally dominant and will of course ‘blow
up’ if any am,m = 0. In other situations, it may or may not converge. We need some
initial guess for φ0 , which ideally should be as close as possible to the solution.

Example 3.1:

In this example we will develop a Matlab program to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform a few
iterations of the Jacobi method by hand. To begin the algorithm, let’s provide the
initial guess of φ0 = {1, 1, 1}T . We will also assume that we are using the infinity
norm as our measure of convergence, and therefore, initially we have:
      
 3  2 1 −1  1   1 
r0 = −5 − 1 4 2  1 = −12
7 −1 2 6 1 0
     

with kr0 k∞ = 12. We then work through the rows of A to update φ1 as:
56 CHAPTER 3. ITERATIVE METHODS

1 1
φ11 = b1 − a1,2 φ02 − a1,3 φ03 = (3 − 1 × 1 − (−1) × 1) = 1.5
a1,1 2
1 1
φ12 = b2 − a2,1 φ01 − a2,3 φ03 = (−5 − 1 × 1 − 2 × 1) = −2.5
a2,2 4
1 1
φ13 = b3 − a3,1 φ01 − a3,2 φ02 = (7 − (−1) × 1 − 2 × 1) = 1.0

a3,3 6

and then update the residual as:

      
 3  2 1 −1  1.5   3.5 
r1 = −5 − 1 4 2  −2.5 = 1.5
7 −1 2 6 1.0 6.5
     

with kr1 k∞ = 6.5. We then repeat the procedure and update φ2 as:

1 1
φ21 = b1 − a1,2 φ12 − a1,3 φ13 = (3 − 1 × (−2.5) − (−1) × 1.0)= 3.0000
a1,1 2
1 1
φ22 = b2 − a2,1 φ11 − a2,3 φ13 = (−5 − 1 × 1.5 − 2 × 1.0)

=−2.1250
a2,2 4
1 1
φ23 = b3 − a3,1 φ11 − a3,2 φ12 = (7 − (1) × 1.5 − 2 × (−2.5)) = 2.0833
a3,3 6

and then update the residual as:

      
 3  2 1 −1  3.0000   1.2083 
r2 = −5 − 1 4 2  −2.1250 = −3.6666
7 −1 2 6 2.0833 1.7502
     

with kr2 k∞ = 3.6667. As can be observed, each successive iteration brings the
solution closer to the exact solution and with the residual norm decreasing in turn.
With a tolerance of 10−8 the algorithm converges after 58 iterations. Now, in order
to create an algorithm, it’s worth noting that we only need the previous iteration to
compute next one. So there’s little point in storing every value of φk , instead what
we can do is have two vectors φold and φ. We initialize φold as φ0 and update to
find φ. We then assign φold = φ and repeat.
In order to create a Matlab program to perform the Jacobi method we will write
the method in a slightly different manner to the way we implemented the direct
methods in Chapter 2. Since we don’t know how many iterations our method will
take, we will use a while loop to continue iterating until we have converged on a
solution. As it happens we will stick to this form throughout the book and use
while loops to indicate iteration where the number of steps is not fixed. It is worth
pointing out however that once can achieve an iterative loop with the for loop
3.2. GAUSS-SEIDEL METHOD 57

construct; furthermore, we could also have achieved all of the looping over rows
and columns in the examples of Chapter 2 with while loops. So this convention is
mainly just to aid in understanding. With this in mind, our program will look like:

while r_norm>tolerance && k<N_k

for m=1:N_row
SumA_mnphi_n = 0;
for n=1:N_col
if(n~=m)
SumA_mnphi_n = SumA_mnphi_n + A(m,n)*phi_old(n);
end
end
phi(m) = (b(m) - SumA_mnphi_n)/A(m,m);
end
phi_old = phi;
r = b - A*phi;
r_norm = max(abs(r));
k = k+1;
end

So within our iterative while loop we loop over all of the rows in A, then for
each row, loop over all of the columns in A and assemble the sum A_mphi which is
essentially the mth row in A multiplied by the column vector phi_old. An important
point to note is that we could also achieve this with the code A(m,:)*phi_old, but
this would include the A(m, m), which we do not want included in our summation.
Once the summation is complete, we can update φm . The last four lines within
the while loop consist of copying the data stored in phi into phi_old, computing
the updated residual and the infinity norm, then finally incrementing the iteration
counter. Another important point worth noting is that programmatically it makes
sense to have some user specified upper limit to the number of iterations just to
make sure that we don’t ever get stuck in an infinite loop if the method fails to
converge. For this reason the iterative while loop will continue until either the
solution converges, or until k reaches the maximum number of iterations.
The complete program is given in Example3_1.m.

3.2 Gauss-Seidel Method

The Gauss-Seidel method is quite similar to the Jacobi method, but we let A = L+U
where L is a lower triangular matrix containing everything on and below the main
diagonal of A and U is a strictly upper triangular matrix containing everything
strictly above the main diagonal of A:
58 CHAPTER 3. ITERATIVE METHODS

   
a1,1 0 ··· 0 0 a1,2 · · · a1,N
 a2,1 a2,2 ··· 0   0 0 · · · a2,N 
L= U =
   
.. .. ... ..   .. .. .. .. 
 . . .   . . . . 
aM,1 aM,2 · · · aM,N 0 0 ··· 0

In this case Aφ = b becomes:

(L + U ) φ = b
and therefore:

Lφ = b − U φ
φ = L−1 (b − U φ)

So the basic idea behind the Gauss-Seidel method is that we use this as the basis
for an iteration:

φk+1 = L−1 b − U φk

However, by taking advantage of the triangular form of L, the elements of φk+1 can
be computed sequentially using forward substitution. In terms of the entries in the
matrices, this is:
!
1 X X
φk+1
m = bm − am,n φkn − am,n φk+1
n
am,m n>m n<m

Note, this method is actually the same as the Jacobi method, except that we take
our φm values to be from the current iteration k + 1, whenever these are available.
This means that we are always using the most up to date information available to
us, which hopefully leads to faster convergence. Similar to the Jacobi method, we
need diagonal dominance and no zero entries on the main diagonal in order for this
method to be able to converge.

Example 3.2:

In this example we will develop both a Matlab and a C++ program to solve the
example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   
3.2. GAUSS-SEIDEL METHOD 59

Before we start writing any code however, let’s work through and perform a few
iterations of the Gauss-Seidel method by hand. To begin the algorithm, let’s provide
the initial guess of φ0 = {1, 1, 1}T . We will also assume that we are using the infinity
norm as our measure of convergence, and therefore, initially we have:
      
 3  2 1 −1  1   1 
r0 = −5 − 1 4 2  1 = −12
7 −1 2 6 1 0
     

with kr0 k∞ = 12. We then work through the rows of A to update φ1 as:

1 1
φ11 = b1 − a1,2 φ02 − a1,3 φ03 = (3 − 1 × 1 − (−1) × 1) = 1.5000
a1,1 2
1 1
φ12 = b2 − a2,1 φ11 − a2,3 φ03 = (−5 − 1 × 1.5 − 2 × 1) =−2.1250
a2,2 4
1 1
φ13 = b3 − a3,1 φ11 − a3,2 φ12 = (7 − (1) × 1.5 − 2 × (−2.1250))= 2.1250
a3,3 6
and then update the residual as:
      
 3  2 1 −1  1.5000   4.2500 
r1 = −5 − 1 4 2  −2.1250 = −2.2500
7 −1 2 6 2.1250 0.0000
     

with kr1 k∞ = 4.2500. So already after one iteration we have a different solution
compared to the Jacobi method. We then repeat the procedure and update φ2 as:

1 1
φ21 = b1 − a1,2 φ12 − a1,3 φ13 = (3 − 1 × (−2.1250) − (−1) × 2.1250)= 3.6250
a1,1 2
1 1
φ22 = b2 − a2,1 φ21 − a2,3 φ13 = (−5 − 1 × 3.6250 − 2 × 2.1250) =−3.2188
a2,2 4
1 1
φ23 = b3 − a3,1 φ21 − a3,2 φ22 = (7 − (1) × 3.6250 − 2 × (−3.2188)) = 2.8438
a3,3 6
and then update the residual as:
      
 3  2 1 −1  3.6250   1.8126 
r2 = −5 − 1 4 2  −3.2188 = −1.4374
7 −1 2 6 2.8438 −0.0002
     

with kr2 k∞ = 1.8126. So after 2 iterations we can see that the residual norm is
about half of what it was in the case of the Jacobi method and with a tolerance of
10−8 the algorithm converges in 30 iterations.
In order to create a Matlab program to perform the Gauss-Seidel method we
can note that the immediate use of updated φk+1 means that instead of needing
60 CHAPTER 3. ITERATIVE METHODS

the ‘old’ and ‘new’ arrays, we can just have the one array and update in place. We
can do this because we know that when we come to update ‘say’ φk+1 3 we will have
k+1 k+1
already update φ1 and φ2 , but if there were entries φ4 · · · φM , they would still
contain data from iteration k. So we just use the values as if we didn’t care which
iteration they were from and ‘all will be OK’. So we could create an algorithm for
the Gauss-Seidel method in Matlab as:
while r_norm>tolerance && k<N_k
for m=1:N_row
SumA_mnphi_n = 0;
for n=1:N_col
if(n ~= m)
SumA_mnphi_n = SumA_mnphi_n + A(m,n)*phi(n);
end
end
phi(m) = (b(m) - SumA_mnphi_n)/A(m,m);
end
r = b - A*phi;
r_norm = max(abs(r));
k = k+1;
end
It can be observed that this algorithm is almost exactly the same as the implemen-
tation of the Jacobi method, except for the use of the phi_old array. In order to
create a C++ program we will use essentially the same programming constructs and
so the ‘bulk’ of the algorithm will look like:
while(r_norm>tolerance && k<N_k)
{
for(m=0; m<N_row; m++)
{
A_mphi = 0.0;
for(n=0; n<N_col; n++)
{
if(n!=m)
{
A_mphi += A[m][n]*phi[n];
}
}
phi[m] = (b[m] - A_mphi)/A[m][m];
}
...
k++;
}
where the major differences are simply the different syntax of the for and while
loops. The only part of the algorithm that is significantly different is the computa-
tion of the residual and the infinity norm. With C++, we can’t perform matrix-vector
multiplications or additions and subtractions in the same way that we do in Matlab.
So in order to achieve the equivalent computations we will have:
r_norm = 0.0;
3.3. SUCCESSIVE OVER-RELAXATION METHOD 61

for(m=0; m<N_row; m++)

{
A_mphi = 0.0;
for(int n=0; n<N_col; n++)
{
A_mphi += A[m][n]*phi[n];
}
r[m] = b[m] - A_mphi;
r_norm = fmax(r_norm, fabs(r[m]));
}

Here we have two nested for loops, working through the rows and then columns
of A, multiplying the mth row of A by the current solution. As each entry in the
residual vector is computed, it is compared with the current value for the infinity
norm and if greater, the infinity norm is reassigned. So in this way, the infinity
norm is computed ‘on the fly’ as we compute the residual vector itself, rather than
afterwards.
The complete programs are given in Example3_2.m and Example3_2.cpp.

3.3 Successive Over-relaxation Method

The successive Over-relaxation method (SOR) is again similar to both the Jacobi
and Gauss-Seidel methods, but we let A = D + L + U , where D is a diagonal matrix,
L is a strictly lower triangular matrix, and U is a strictly upper triangular matrix:

     
a1,1 0 · · · 0 0 0 ··· 0 0 a1,2 · · · a1,N
 0 a2,2 · · · 0   a2,1 0 ··· 0 
 0 0 · · · a2,N
 
D=  L= U =  ..
   
.. .. ... .. .. .. . . ..  .. .. .. 
 . . .   . . . .   . . . . 
0 0 · · · aM,N aM,1 aM,2 ··· 0 0 0 ··· 0

In this case Aφ = b becomes:

(D + L + U ) φ = b
We then multiply both sides by a constant ω > 1, which we will call an over-
relaxation factor as:

ω (D + L + U ) φ = ωb
which we can rearrange as:
62 CHAPTER 3. ITERATIVE METHODS

(D + ωL) φ = ωb − (ωU + (ω − 1) D) φ
φ = (D + ωL)−1 (ωb − (ωU + (ω − 1) D) φ)

The reason for incorporating ω is so that we can make lager changes to our solution
at each iteration so that we ‘hopefully’ converge on a solution faster (i.e. in fewer
iterations), because we are able to make a larger update. So the basic idea behind
the successive over-relaxation method is that we use this as the basis for an iteration:

φk+1 = (D + ωL)−1 ωb − (ωU + (ω − 1) D) φk

However, by taking advantage of the triangular form of D + ωL, the elements of

φk+1 can be computed sequentially using forward substitution:
!
ω X X
φk+1 k
m = (1 − ω) φm + bm − am,n φkn − am,n φk+1
n
am,m n>m n<m

The choice of relaxation factor is not necessarily easy, and depends upon the prop-
erties of the A. For symmetric, positive definite matrices it can be proven that
0 < ω < 2 will lead to convergence, but we are generally interested in faster conver-
gence rather than just convergence.

Example 3.3:

In this example we will develop a Matlab program to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform a few
iterations of the successive over-relaxation method by hand. To begin the algorithm,
let’s provide the initial guess of φ0 = {1, 1, 1}T and we’ll choose ω = 1.2. We will
also assume that we are using the infinity norm as our measure of convergence, and
therefore, initially we have:
      
 3  2 1 −1  1   1 
r0 = −5 − 1 4 2  1 = −12
7 −1 2 6 1 0
     

with kr0 k∞ = 12. We then work through the rows of A to update φ1 as:
3.3. SUCCESSIVE OVER-RELAXATION METHOD 63

ω
φ11 = (1 − ω)φ01 + b1 − a1,2 φ02 − a1,3 φ03

a1,1
1.2
= (−0.2)(1) + (3 − 1 × 1 − (−1) × 1)
2
= 1.6000
ω
φ12 = (1 − ω)φ02 + b2 − a2,1 φ11 − a2,3 φ03

a2,2
1.2
= (−0.2)(1) + (−5 − 1 × 1.6 − 2 × 1)
4
= −2.7800
ω
φ13 = (1 − ω)φ03 + b3 − a3,1 φ11 − a3,2 φ12

a3,3
1.2
= (−0.2)(1) + (7 − (1) × 1.6 − 2 × (−2.7800))
6
= 2.6320

and then update the residual as:

      
 3  2 1 −1  1.6000   5.2120 
r1 = −5 − 1 4 2  −2.7800 = −0.7440
7 −1 2 6 2.6320 −1.6320
     

with kr1 k∞ = 5.2120. We then repeat the procedure and update φ2 as:

ω
φ21 = (1 − ω)φ11 + b1 − a1,2 φ12 − a1,3 φ13

a1,1
1.2
= (−0.2)(1.6000) + (3 − 1 × (−2.7800) − (−1) × 1(2.6320))
2
= 4.7272
ω
φ22 = (1 − ω)φ12 + b2 − a2,1 φ21 − a2,3 φ13

a2,2
1.2
= (−0.2)(−2.7800) + (−5 − 1 × 4.7272 − 2 × 2.6320)
4
= −3.9414
ω
φ23 = (1 − ω)φ13 + b3 − a3,1 φ21 − a3,2 φ22

a3,3
1.2
= (−0.2)(2.6320) + (7 − (1) × 4.7272 − 2 × (−3.9414))
6
= 3.3956

and then update the residual as:

64 CHAPTER 3. ITERATIVE METHODS

      
 3  2 1 −1  4.7272   0.8825 
r2 = −5 − 1 4 2  −3.9414 = −0.7529
7 −1 2 6 3.3956 −0.7636
     

with kr2 k∞ = 0.8825. So after 2 iterations we can see that the residual norm is less
than half of what it was in the case of the Gauss-Seidel method and with a tolerance
of 10−8 the algorithm converges in 17 iterations.
In order to create a Matlab program to perform the successive over-relaxation
method we can essentially use the same algorithm that we developed to implement
the Gauss-Seidel method, simply changing the line of code where the new values for
φk+1 are updated:
while r_norm>tolerance && k<N_k
for m=1:N_row
SumA_mnphi_n = 0;
for n=1:N_col
if(n ~= m)
SumA_mnphi_n = SumA_mnphi_n + A(m,n)*phi(n);
end
end
phi(m) = (1-omega)*phi(m) + omega*(b(m) - SumA_mnphi_n)/A(m,m);
end
r = b - A*phi;
r_norm = max(abs(r));
k = k+1;
end
The complete program is given in Example3_3.m.

Having now investigated some stationary iterative methods, we will turn our
attention to non stationary methods. With the Jacobi, Gauss-Seidel, and successive
over-relaxation methods, the observation can be made that in performing an itera-
tion, the computations involve the current or old estimates of φ and the entries in
A and b, which are constant at each iteration (and in fact do not change through-
out the entire computation). With nonstationary methods on the other hand, the
computations involve information that does change at each iteration.

3.4 Steepest Descent Method

The method of steepest descent will involve a departure from the three iterative
methods we have studied up until now. To begin our derivation of the method we
need to introduce the quadratic form for a linear system, which can be written as:
1
f (φ) = φT Aφ − bT φ + c (3.3)
2
3.4. STEEPEST DESCENT METHOD 65

800

600

400
f

200

0
10
5 10
0 5
0
−5 −5
φ2 −10 −10
φ1

Figure 3.1: Quadratic form for an example system involving two equations.
66 CHAPTER 3. ITERATIVE METHODS

As can be observed the quadratic form f is a scalar, quadratic function of the vector
φ and c is a scalar. If φ happened to be a 2 × 1 column vector, then we could plot
the quadratic form as shown in Figure 3.1. It can be observed that f takes the
shape of a paraboloid and you may well wonder why this is the case, or if this is
always the case? The answer is that it will always be a paraboloid if A is positive
definite. If A is not positive definite then we end up with other surfaces such as
inverted paraboloids or hyperboloids and the method will not work. So bearing in
mind that our method is restricted to positive definite systems we can proceed to
computing the gradient of f :

∂f ∂f ∂f ∂f
= , ,··· ,
∂φ ∂φ1 ∂φ1 ∂φM
So while the quadratic form is a scalar valued function, the gradient of the quadratic
form is a vector-valued function and can be written as:

∂f 1 1
= AT φ + Aφ − b
∂φ 2 2
and if A is symmetric, this reduces to:

∂f
= Aφ − b (3.4)
∂φ
Remembering that we can find the minimum of a function by setting its derivative to
zero; the minimum of the quadratic form occurs when Aφ = b. The important point
here is that the solution to our linear system of equations occurs at the minimum
of the quadratic form. With this in mind (and noting that we are now restricting
ourselves to symmetric positive definite systems) the basic idea behind the method
of steepest descent can be thought of as specifying an initial guess for φ (which will
define a point on the surface) and iteratively stepping our way down the surface
until we reach the bottom. At this point we will have found the solution. Now
earlier we made the definition:

rk = b − Aφk
where rk is a vector that indicates how far away we are from the correct value of
b. More importantly however, by examination of Equation 3.4 we can infer that
the residual is actually the negative of the gradient of f , so we can think of the
residual as the direction of steepest descent. Given an initial point on the surface,
the algorithm works by taking a step in the direction of steepest descent:

φk+1 = φk + αrk (3.5)

The question is then, how big a step to take? As an analogy, imagine standing
on the hillside of a large valley. Solving the system of equations is then equivalent
to getting to the to lowest point at the bottom of the valley. Wherever we happen
3.4. STEEPEST DESCENT METHOD 67

to be standing there will be a direction of steepest descent and we will work our way
down by walking in that direction for a while, then reexamining the new direction
of steepest descent and walking in that direction for a while, and so on. Now, if
we could ‘see’ the bottom then of course we would head in that direction (and our
problem would of course already be solved), but to continue with the analogy, let’s
further assume that it’s a foggy day, so we can’t see the bottom of the valley, or
anything much around us. All we have is the direction of steepest descent based
on where we are currently standing. When we take a step we are committed to
walking a distance α in that direction and what we would like is to choose α such
that we make our way as far down the valley as possible, but of course not walk so
far that we begin making our way up the other side of the valley. Put another way
we want to choose α to minimize f along our direction of steepest descent. From
d
basic calculus, α minimizes f when the directional derivative dα f (φk+1 ) is equal to
zero. By the chain rule we can write:

df (φk+1 ) df (φk+1 ) dφk+1

=
dα dφk+1 dα
=0
and note that by differentiating Equation 3.5 with respect to α we get:

dφk+1
= rk
dα
and that:

∂f (φk+1 )
k+1
= −rk+1
∂φ
So, by setting the directional derivative to zero, we find that α should minimize f
when rk and rk+1 are orthogonal:

(rk+1 )T rk = 0
T
b − Aφk+1 rk = 0
T
b − A(φk + αrk ) rk = 0
T
b − Aφk )T rk − α(Ark rk = 0
T
b − Aφk rk = α(Ark )T rk
(rk )T rk = α(rk )T (Ark )
(rk )T rk
α =
(rk )T Ark
At this point our method of steepest descent is complete.
68 CHAPTER 3. ITERATIVE METHODS

Example 3.4:

In this example we will develop a Matlab program to solve the example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform a few
iterations of the steepest descent method by hand. To begin the algorithm, let’s
provide the initial guess of φ0 = {1, 1, 1}T . This time however we’ll use the two
norm as our measure of convergence, and therefore, initially we have:
      
3 
 2 1 −1  1   1 
r0 = −5 − 1 4 2  1 = −12
7 −1 2 6 1 0
     

1
with kr0 k2 = (12 + (−12)2 + 02 ) 2 = 12.0416. We then compute α as:
 
 1 
1 −12 0 −12
0
 
α=    = 0.2617
2 1 −1  1 
1 −12 0  1 4 2  −12
−1 2 6 0
 

update φ1 as:

φ11 =φ01 + 0.2617r10 =1 + 0.2617× 1 = 1.2617

1 0 0
φ2 =φ2 + 0.2617r2 =1 + 0.2617 × −12 =−2.1408
φ13 =φ03 + 0.2617r30 =1 + 0.2617× 0 = 1.0000

and then update the residual as:

      
 3  2 1 −1  1.2617   3.6174 
r1 = −5 − 1 4 2  −2.1408 = 0.3015
7 −1 2 6 1.0000 6.5433
     

1
with kr1 k2 = (3.61742 +0.30152 +6.54332 ) 2 = 7.4827. We then repeat the procedure
and compute α as:
3.4. STEEPEST DESCENT METHOD 69

 
 3.6173 
3.6173 0.3014 6.5433 0.3014
6.5433
 
α=    = 0.2275
2 1 −1  3.6173 
3.6173 0.3014 6.5433  1 4 2  0.3014
−1 2 6 6.5433
 

update φ2 as:

φ21 =φ21 + 0.2617r11 = 1.2617 + 0.2275 × 3.6173 = 2.0845

φ22 =φ22 + 0.2617r21 =−2.1408 + 0.2275 × 0.3014 =−2.0722
φ23 =φ23 + 0.2617r31 = 1.0000 + 0.2275 × 6.5433 = 2.4884

and then update the residual as:

      
 3  2 1 −1  2.0845   3.3916 
r2 = −5 − 1 4 2  −2.0722 = −3.7725
7 −1 2 6 2.4884 −1.7015
     
1
with kr2 k2 = (3.39162 + (−3.7725)2 + (−1.7015)2 ) 2 = 5.3504. So we can observe
that every iteration we are getting closer to the exact solution and the residual
norm is reducing in turn and with a tolerance of 10−8 the algorithm converges in 78
iterations.
In order to create a Matlab program to perform the steepest descent algorithm we
will have a main iterative while loop, similar to the stationary methods introduced
previously in this chapter. Since the algorithm involves matrix-vector and vector-
vector operations such as addition, subtraction, and multiplication, we can create
quite a concise piece of code:
while r_norm>tolerance && k<N_k
alpha = (r’*r)./(r’*A*r);
phi = phi + alpha*r;
r = b - A*phi;
r_norm = sqrt(sum(r.^2));
k = k+1;
end

So it can be observed that each iteration involves computing a new value for α
based on the old residuals, updating φ, and computing the new residual. Here we
are using the Matlab function norm to compute the two norm. This function can
in fact compute other norms such as the inifinity norm (so we could have used it in
previous examples if we’d wished) by passing in a second argument to the function,
but defaults to computing the two norm.
The complete program is given in Example3_4.m.
70 CHAPTER 3. ITERATIVE METHODS

3.5 Conjugate Gradient Method

2 2

1 1

0 0

e0
−1
φ2

φ2
−1 d0

−2 −2
e1

−3 −3

−4 −4
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
φ1 φ1

(a) (b)

Figure 3.2: A comparison of the iterations of the steepest descent and conjugate
gradient methods for the example system involving two equations, the quadratic
form of which is shown in Figure 3.1

The conjugate gradient method uses a similar idea to the method of steepest de-
scent to solve a linear system, but removes one of its major disadvantages, namely
that steepest descent often finds itself taking steps in the same direction as previous
steps (Figure 3.2(a)). A much better approach would be to select a set of search
directions {d0 , d1 , d2 , · · · dM } and take exactly one step in each direction. Further-
more each step we take will be of the right length to obtain the correct value of one
part of the solution φm , then after M iterations we will have found the minimum of
f (i.e. solved Aφ = b). Using the previous analogy of trying to walk down to the
bottom of the valley, the method of steepest descent is analogous to walking down
the hill in a ‘zig zag’ pattern that ‘say’ a skier might take. The idea behind the
conjugate gradient method on the other hand is analogous to ‘heading south, by the
correct amount’, then ‘heading east by the correct amount’. Now obviously from
Figures 3.2(a) and 3.2(b) we our basing our analogy on a system with 2 unknowns,
but the concept extends to higher dimensions (it just becomes harder to visualize).
So, our update for φk+1 is given by:

φk+1 = φk + αdk
3.5. CONJUGATE GRADIENT METHOD 71

which is similar to the steepest descent method, but here we are stepping in along
our search direction, rather than along the direction of the residual. We will also
choose α to minimize f along our search direction as:

df (φk+1 ) df (φk+1 ) dφk+1

=
dα dφk+1 dα
T
df (φk+1 ) k
= d
dφk+1
= − (rk+1 )T dk

and setting this expression to zero we have:

(rk+1 )T dk = 0
T
b − Aφk+1 dk = 0
T
b − A(φk + αdk ) dk = 0
T
b − Aφk )T dk − α(Adk dk = 0
T
b − Aφk dk = α(Adk )T dk
(rk )T dk = α(dk )T Adk
(dk )T rk
α = (3.6)
(dk )T Adk
So the big question now is how we go about constructing our search directions.
Referring back to Figure 3.2(b), it can be observed that with each step we are
updating the error vector as:

ek+1 = ek + αdk
and using our definition of the residual vector rk = −Aek , this implies that at each
step the new residual is being updated as:

rk+1 = −Aek+1
= −A(ek + αdk )
= rk − αAdk (3.7)

More importantly, the new error vector is orthogonal to the previous search direction:

(dk )T ek+1 = 0
Of course this fact is of no practical use to us in terms of constructing the search
directions, because if we ever happened to know the error vector at any iteration, the
72 CHAPTER 3. ITERATIVE METHODS

problem would immediately be solved. What we can do however is choose that the
search directions be A-orthogonal to one another, rather than orthogonal. This is a
key feature of the conjugate gradient method and implies that rather than choosing:

(dk )T dk+1 = 0
we have:

(dk )T Adk+1 = 0
This also has the effect of making the error vector A-orthogonal to the previous
search directions, rather than orthogonal:

(dk )T Aek+1 = 0
Using our definition of the residual vector, we see that this choice means that the
new residual will be orthogonal to the previous search direction:

−(dk )T rk+1 = 0
So once we take a step in a search direction, we need never step in that direction
again; the error vector is evermore A-orthogonal to all of the old search directions
and hence the residual is evermore orthogonal to all of the old search directions.
At this point, all that is missing is a way to construct a set of A-orthogonal
search directions. As it happens, we’ve already come across just such a technique in
Chapter 2; namely the Gram-Schmidt process, which we used in performing the QR
decomposition. Here however, instead of using the columns of A to construct the set
of orthonormal vectors (which formed the columns of Q), we will use the of residual
vectors rk to construct a set of A-orthogonal vectors. As such, the Gram-Schmidt
process to compute a new search direction is:
k
X (rk+1 )T Adn
dk+1 = rk+1 − dn (3.8)
n=0
(dn )T Adn
where it can be observed that the projection operator is different compared to Equa-
tion 2.1 since we are constructing A-orthogonal search vectors rather than orthogonal
vectors. Also, because we are not interested in an orthonormal set of vectors, we
don’t need to worry about the normalization step as we did with QR decomposition.
Now, because the search vectors are built from the residuals, and because each
residual is orthogonal to the previous search direction, each residual is also orthog-
onal to the previous residuals, hence a result that we will find useful shortly is
that:

(rk )T (rn ) = 0 k 6= n
Furthermore, if we pre-multiply Equation 3.8 by (rk+1 )T we get another useful result
that:
3.5. CONJUGATE GRADIENT METHOD 73

(rk+1 )T dk+1 = (rk+1 )T rk+1 (3.9)

Continuing now with how we construct the search directions, we can rewrite
Equation 3.8 as:
k
X
k+1 k+1
d =r − βk,n dn
n=0

where:

(rk+1 )T Adn
βk,n = − (3.10)
(dn )T Adn
One feature of the Gram-Schmidt process is that we need to store each search
direction in order to evaluate each βk,n term and hence compute a new search di-
rection. This is an undesirable feature of the method because we don’t want to be
storing all of these vectors in memory, especially for a large system of equations.
Even if A happens to be a sparse matrix where we may be able to achieve substan-
tial savings by using a sparse matrix storage format, this wouldn’t help with the
fact that we would still need to store all of the old search directions, and this would
make the Conjugate Gradient method less attractive as an iterative method. As it
happens however, we can actually eliminate the need to store all of the old search
directions and to show how this is possible, we begin by taking Equation 3.7 and
pre-multiply by (rk+1 )T to get:

(rk+1 )T rn+1 = (rk+1 )T rn − αn (rk+1 )T Adn

and rearranging:

1
(rk+1 )T Adn = k+1 T n k+1 T n+1

(r ) r − (r ) r (3.11)
αn
Substituting into Equation 3.10 we get:
1 k+1 T n k+1 T n+1

n (r ) r − (r ) r
βk,n = α (3.12)
(dn )T Adn
But because all of our residuals are orthogonal, the only instance where Equation
3.12 is nonzero is when n = k, (i.e. for the latest search direction) where βk,n is
given by:
1 k+1 T k+1

k (r ) r
βk,n = α (3.13)
(dk )T Adk
All other values of n < k will results in inner products of residuals from previous
iterations which we know are orthogonal and hence the numerator will be zero. Now,
74 CHAPTER 3. ITERATIVE METHODS

finally we can substitute Equation 3.9 into Equation 3.6 so that we evaluate α at
any iteration as:

(rk )T rk
α= (3.14)
(dk )T Adk
and finally we can substitute Equation 3.14 into Equation 3.13 so that we can
evaluate β at any iteration as:

(rk+1 )T rk+1
β=
(rk )T rk
So at this point our conjugate gradient method is complete and will be similar to
that of steepest descent apart from the fact that after updating the residual, we will
need to update β and then compute a new search direction, before beginning the
next iteration.

Example 3.5:

In this example we will develop both a Matlab and a C++ program to solve the
example system:
    
2 1 −1  φ1   3 
 1 4 2  φ2 = −5
−1 2 6 φ3 7
   

Before we start writing any code however, let’s work through and perform a few
iterations of the conjugate gradient method by hand. To begin the algorithm, let’s
provide the initial guess of φ0 = {1, 1, 1}T . As with the steepest descent method
we will use the two norm as our measure of convergence, and therefore, initially we
have:
      
 3  2 1 −1  1   1 
r0 = −5 − 1 4 2  1 = −12
7 −1 2 6 1 0
     
1
with kr0 k2 = (12 +(−12)2 +02 ) 2 = 12.0416. Furthermore, our initial search direction
will be equal to the initial residual d0 = {1, −12, 0}T . We then compute α as:
 
 1 
1 −12 0 −12
0
 
α=    = 0.2617
2 1 −1  1 
1 −12 0  1 4 2  −12
−1 2 6 0
 
3.5. CONJUGATE GRADIENT METHOD 75

and update φ1 as:

φ11 =φ01 + αd01 =1 + 0.2617× 1 = 1.2617

1 0 0
φ2 =φ2 + αd2 =1 + 0.2617 × −12 =−2.1408
φ13 =φ03 + αd03 =1 + 0.2617× 0 = 1.0000

At this point we now update the residual:

r1 = r0 − αAd
      
 1  2 1 −1  1   3.6173 
= −12 − 0.2617  1 4 2  −12 = 0.3014
0 −1 2 6 0 6.5433
     
1
with kr0 k2 = (3.61732 + 0.30142 + 6.54332 ) 2 = 7.4827, compute the projection onto
the previous residual:

(r1 )T r1
β=
(r0 )T r0
 
 3.6173 
3.6173 0.3014 6.5433 0.3014
6.5433
 
=  
 1 
1 −12 0 −12
0
 

= 0.3861
and then compute the new search direction:

d1 = r1 + βd0
     
 3.6173   1   4.0035 
= 0.3014 + 0.3861 −12 = −4.3323
6.5433 0 6.5433
     

We then repeat the procedure and compute α as:

 
 3.6173 
3.6173 0.3014 6.5433 0.3014
6.5433
 
α=     = 0.3423
2 1 −1  4.0035 
4.0035 −4.3323 6.5433  1 4 2  −4.3323
−1 2 6 6.5433
 
76 CHAPTER 3. ITERATIVE METHODS

and update φ2 as:

φ21 =φ11 + αd11 =1 + 0.3423 × 4.0035 = 2.6323

φ22 =φ12 + αd12 =1 − 0.3423 × 4.3323 =−3.6239
φ23 =φ13 + αd13 =1 + 0.3423 × 6.5433 = 3.2401

We now update the residual:

r2 = r1 − αAd1
      
 3.6173  2 1 −1  4.0035   4.5994 
= 0.3014 − 0.3423  1 4 2  −4.3323 = 0.3833
6.5433 −1 2 6 6.5433 −2.5603
     

1
with kr2 k2 = (4.59942 + 0.38332 + (−2.5603)2 ) 2 = 5.2780, compute the projection
onto the previous residual:

(r2 )T r2
β=
(r1 )T r1
 
 4.5994 
4.5994 0.3833 −2.5603 0.3833
−2.5603
 
=  
 3.6173 
3.6173 0.3014 6.5433 0.3014
6.5433
 

= 0.4975

and then compute the new search direction:

d2 = r2 + βd1
   
 4.5994  4.0035 

= 0.3833 + 0.4975 −4.3323
−2.5603 6.5433
   
 
 6.5912 
= −1.7720
0.6951
 

We then repeat the procedure for the third and final iteration and compute α as:
3.5. CONJUGATE GRADIENT METHOD 77

 
 4.5994 
4.5994 0.3833 −2.5603 0.3833
−2.5603
 
α=    = 0.4292
2 1 −1  6.5912 
6.5912 −1.7720 0.6951  1 4 2  −1.7720
−1 2 6 0.6951
 

and update φ3 as:

φ31 =φ21 + αd21 = 2.6323 + 0.4292× 6.5912= 5.4615

φ32 =φ22 + αd22 =−3.6239 + 0.4292×−1.7720=−4.3846
φ33 =φ23 + αd23 = 3.2401 + 0.4292× 0.6951= 3.5385

Which is the exact solution computed with all of the other methods studied. In
contrast to steepest descent which took 78 iterations to get to the same solution,
the conjugate gradient method only took 3; an obvious advantage! In order to
create a Matlab program to perform the conjugate gradient algorithm we will have
something very similar to the steepest descent code, the major difference being that
we will have the extra step of computing β at each iteration, plus we will have to
store both an ‘old’ and a ‘new’ residual vector. That being said, we can still create
quite a concise algorithm:
while r_norm>tolerance && k<N_k
alpha = (r_old’*r_old)./(d’*A*d);
phi = phi + alpha*d;
r = r_old - alpha*A*d;
beta = (r’*r)/(r_old’*r_old);
d = r + beta*d;
r_old = r;
r_norm = norm(r);
k = k+1;
end

With C++, because we can’t perform matrix-vector multiplications or additions

and subtractions in the same way that we do in Matlab. our code will have to be a
little ‘bulkier’ to achieve the same computations:
while(r_norm>tolerance && k<N_k)
{
dTAd = 0.0;
for(m=0; m<N_row; m++)
{
Ad[m] = 0.0;
for(n=0; n<N_col; n++)
{
Ad[m] += A[m][n]*d[n];
78 CHAPTER 3. ITERATIVE METHODS

}
dTAd += d[m]*Ad[m];
}
alpha = r_oldTr_old/dTAd;
for(m=0; m<N_row; m++)
{
phi[m] += alpha*d[m];
}
for(m=0; m<N_row; m++)
{
r[m] = r_old[m] - alpha*Ad[m];
}
rTr = 0.0;
for(m=0; m<N_row; m++)
{
rTr += r[m]*r[m];
}
beta = rTr/r_oldTr_old;
for(m=0; m<N_row; m++)
{
d[m] = r[m] + beta*d[m];
}
for(m=0; m<N_row; m++)
{
r_old[m] = r[m];
}
r_oldTr_old = rTr;
r_norm = sqrt(rTr);
k++;
}

Here it be observed that the major difference compared to the equivalent Matlab
code is that we have to perform the matrix-vector multiplications to compute dT Ad
via nested for loops and the vector addition and subtraction within single for
loops. Another interesting point to note is that when we take the inner product
of the new residual vector rT r (stored in the variable rTr) we have in fact almost
computed the two norm, since it is defined as the square root of the sum of the
squares of each term in r. We can therefore compute the two norm by simply taking
the square root of rTr, which saves on the amount of computation compared to
evaluating the residual from r = b − Aφ.
The complete programs are given in Example3_5.m and Example3_5.cpp.

A few issues worth mentioning are that firstly; although in the example presented
here, a system with 3 unknowns required 3 iterations, so you might think that this
is more similar to a direct method, where we know how many computations we’ll
need to perform in order to obtain a solution. For larger systems however, we will
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 79

typically require fewer iterations than the number of unknowns for the residual norm
to drop to an acceptable level. For example, for a system with 106 unknowns we
might be able to get a converged solution in only a few hundred iterations. We
are however guaranteed to have an exact solution after 106 iterations, but we’d
never want to perform that many in practice. The second issue worth pointing
out that there are a number of variations of this basic method, allowing for its
application to asymmetric systems and adding in preconditioning (i.e. multiplying
A by another matrix in order to produce a modified system with a better condition
number). For the interested reader an excellent reference for more detailed aspects of
conjugate gradient methods can be found in the book by Shewchuk [74]. In Matlab
the functions bicg, pcg (plus others) provide implementations of these variations.

3.6 Generalized Minimal Residual Method

The final iterative method that we are going to investigate is known as the Gen-
eralized Minimal Residual method, usually abbreviated to GMRES. The basic idea
behind the method is that similar to the steepest descent and conjugate gradient
methods, the solution to our linear system will be found by computing a number of
vectors and approximating the solution by ‘taking a step’ in the direction of that
vector. In this case however, we solve a minimization problem at each iteration to
give us the step that minimizes the residual. To be a bit more ‘correct’, the GMRES
method approximates the solution by a vector in the Krylov Subspace, given by:

Kk = span{r, Ar, A2 r, . . . , Ak−1 r}

This is going to require a bit of attention, so let’s start by taking a step back and
exploring the concepts of a linear combination and a span. Imagine that we have
a collection of vectors v1 , v2 , · · · , vN , in RD , meaning that our collection of vectors
are composed of real numbers in a D dimensional space. A linear combination of
these vectors is any expression of the form:

a1 v 1 + a2 v 2 + · · · + aN v N
where the coefficients an are scalar quantities and we want to be able to express
other vectors in terms of linear combinations of this collection. We then define the
span to be the set of all linear combinations of the collection of vectors, which is
always a subspace of RD . Take for example the three dimensional Euclidean space
R3 , which we commonly deal with. If we had the collection of vectors v1 = {2, 5, 3},
v2 = {1, 1, 1}, then the span of this collection would be the subspace of R3 consisting
of all linear combinations of these vectors which in fact defines a plane in R3 with a
normal vector n = v1 × v2 . So with these two vectors we can only describe another
vector within this plane. If we now consider the collection of vectors v1 = {1, 0, 0},
v2 = {0, 1, 0}, v3 = {0, 0, 1} then the span of this collection would be all of R3
because every vector can be written as a linear combination of these vectors. For
80 CHAPTER 3. ITERATIVE METHODS

this reason we can say that this collection of vectors also form a basis for this space.
So in a sense the span indicates how much of the space is available to us.
Returning now to the Krylov subspace, we can see that the collection of vectors
are formed by repeatedly multiplying the residual vector r by the matrix A. Fur-
thermore, it can be observed that at each iteration another vector is added to the
collection. If we ‘stack’ all of these column vectors together then we can form the
Krylov matrix as:

Kk = [r, Ar, A2 r, . . . , Ak−1 r]

then we can define the iterative update for our solution as:

φk = φ0 + Kk α (3.15)
Using the column vectors of the Krylov matrix can lead to some problems how-
ever; and this is related to the fact that there’s no guarantee that these vectors will
be orthogonal. So what we then need to do is compute a set of vectors which form
a basis for this space. So instead we will write our iterative update as:

φk = φ0 + Qα (3.16)
where Q will be the matrix of orthonormal vectors which form a basis for the Krylov
subspace. Another important and related concept in the GMRES method is the
Hessenberg factorization of a matrix:

A = QHQH
where we have:

 
h1,1 h1,2 h1,3 ··· h1,N

 h2,1 h2,2 h2,3 ··· h2,N 

0 h3,2 h3,3 ··· h2,N

Q= q1 q 2 · · · qN H=
 

 .. .. .. ... .. 
 . . . . 
0 0 0 hM,N −1 hM,N

Here, both Q and H are M × N square matrices and we are writing the unitary
matrix as being made up of the orthogonal column vectors qn . So any matrix A can
be written as a product of a unitary matrix Q and a Hessenberg matrix H and we
can see that in order to compute the update to our solution we will need to perform
this factorization in order to compute Q. If we multiply both sides by Q we get:

AQ = QHQH Q = QH
Since for a unitary matrix QH Q = I. Now, let’s say we only consider a part of this
system:
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 81

Qn = q1 q 2 · · · qn

Qn+1 = q1 q 2 · · · qn qn+1
 
h1,1 h1,2 h1,3 ··· h1,n

 h2,1 h2,2 h2,3 ··· h2,n 

 0 h3,2 h3,3 ··· h2,n 
Hn = 
 
.. .. .. .. .. 
 . . . . . 
 
 0 0 0 hm,n−1 hm,n 
0 0 0 0 hm+1,n

meaning that we only considering the first n, or n + 1 columns of Qn and Qn+1

respectively, and we are only considering a ‘block’ of size m + 1, n from H. We can
then write:

AQn = Qn+1 Hn (3.17)

Now, because Qn+1 is an M × n + 1 matrix and Hn is an n + 1 × n the result is an

M × n matrix, but both sides of the equation balance, so everything is OK. If we
compare the nth columns on both sides we have:

Aqn = h1,n q1 + h2,n q2 + . . . + hn,n qn + hn+1,n qn+1

where we can solve for the column vector qn+1 and obtain a recursive equation as:

n
P
Aqn − hm,n qn
m=1
qn+1 = (3.18)
hn+1,n

It can be observed that the method in Equation 3.18 is quite similar to the Gram-
Schmidt process that we encountered in the QR decomposition and conjugate gra-
dient methods. As it happens we are in fact going to use a modified form of the
Gram-Schmidt process known as Arnoldi iteration in order to perform the Hessen-
berg factorization of Equation 3.17. In comparison then to the process outlined in
Equation 2.1 we do something very similar here:
82 CHAPTER 3. ITERATIVE METHODS

u1
u1 = r0 q1 =
||u1 ||2
qT1 Aq1 u2
u2 = Aq1 − q1 q2 =
qT1 q1 ||u2 ||2
qT Aq2 qT Aq2 u3
u3 = Aq2 − 1T q1 − 2 T q2 q3 =
q1 q1 q2 q2 ||u3 ||2
.. ..
.=.
n−1 T
X qm Aqn un
un = Aqn−1 − Tq
qm qn = (3.19)
m=1
q m m ||un ||2

where it can be observed that our first orthonormal vector is based on the initial
residual and furthermore, rather than constructing the sequence of orthonormal
vectors based on the columns of A as we did with the QR decomposition, here we
are multiplying the previous vector by A, which is how we obtain the collection of
vectors in the Krylov matrix. Finally we can note that again, since the qn vectors
are orthonormal, the denominator of the summation term in Equation 3.19 will
always be equal to 1 and can be ignored. So now by comparing terms in Equations
3.18 and 3.19 we can infer that the terms on and above the main diagonal in H are
given by:

hm,n = qTm Aqn

and furthermore, the terms below the main diagonal are given by:

hn+1,n = ||un+1 ||2

Once we have performed the Hessenberg factorization, the next step of the GM-
RES method is the solution phase, where we look to update φk and to do this we
will need to compute the column vector α. Now, we want to choose α in order to
minimize the residual and to see how we do this, note that:

||rk ||2 = ||b − Aφk ||2

Then, substituting in Equation 3.16, we get:

||b − Aφk ||2 = ||b − A(φ0 + Qn α)||2

= ||b − Aφ0 − AQn α||2
= ||r0 − AQn α||2

Then, making use of the partial Hessenberg factorization in Equation 3.17, we get:
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 83

||r0 − AQn αk ||2 = ||r0 − Qn+1 Hn α||2 (3.20)

At this point we will use the property of a unitary matrix QH
n+1 Qn+1 = I and
H
multiply Equation 3.20 by Qn+1 to get:

||r0 − AQn α||2 = ||QH 0 H

n+1 r − Qn+1 Qn+1 Hn α||2
= ||QH 0
n+1 r − Hn α||2
= ||d − Hn α||2 (3.21)

where it is important to note that we are allowed to multiply Equation 3.20 by QH

n+1
because the two norm of a vector is invariant under orthogonal transformations; a
bit like saying the length of a vector doesn’t change if we rotate it. An important
point to note is that the column vector d is found my multiplying the unitary matrix
Qn+1 by the initial residual r0 . We could write this out in full as:

d = QHn+1 r
0

H
r0

= q1 q2 ··· qn qn+1
 

 qH 1 r
0


 qH r0
 

2
= ..

 . 

 qT r 0
 

n+1

Now since the first column vector q1 is based on the initial residual; and since all of
the subsequent qn vectors are orthogonal to q1 , they will also be orthogonal to r0
and hence d is in fact:
 

 ||r0 ||2 

 0
 

d= ..


 . 


 0 

Now from inspection of Equation 3.21 we can see that the residual will be a
minimum when Hn α = d. But the important thing to note here is that since Hn
is an n + 1 × n matrix and so we have more equations than unknowns; i.e. an
over-determined system of equations. Now, if we could somehow remove the entries
below the main diagonal of Hn we would be left with an upper triangular matrix
(with the bottom row of Hn being all zeros) and we could then efficiently solve the
system by back substitution. As it happens there is a method to achieve this, which
is known as a Givens rotation. The idea with the technique is that we can create a
Givens matrix of the form:
84 CHAPTER 3. ITERATIVE METHODS

 
1 ... 0 ... 0 ... 0
 .. . . .. .. .. 
 . . . . . 
 0 ... c ... s 0 
 
 . .. . . .. .. 
 ..
G(m, n, θ) =  . . . . 
 0 . . . −s ... c ... 0 
 
 . .. .. . . .. 
 .. . . . . 
0 ... 0 ... 0 ... 1
which we can see is basically the identity matrix with the entries c = cos(θ) and
s = sin(θ) added in at a particular location (although we don’t really bother to
calculate the rotation angle θ). When we multiply our matrix by the Givens matrix,
only rows m and n will be affected and we will ‘zero’ the entry m, n in our matrix.
So in our particular case, assuming that we want to zero the entry Hn+1,n , we can
most simply compute the c and s terms by:

q
a= 2 + H2
Hn,n n+1,n

Hn,n
c=
a
Hn+1,n
s=
a
So we will need to create a sequence of Givens rotation matrices to ‘zero’ each entry
below the main diagonal (i.e. Gn , ...G2 G1 Hn ) before we can solve for α and of course
we must apply this same sequence of matrices to the vector d. At this point we can
then compute the entries in α using back substitution and then finally update φk
by evaluating Equation 3.21.

Example 3.6:

Before we start writing any code however, let’s work through and perform a few
iterations of the GMRES method by hand. To begin the algorithm, let’s provide
the initial guess of φ1 = {1, 1, 1}T . As with the steepest descent and conjugate
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 85

gradient methods we will use the two norm as our measure of convergence, and
therefore, initially we have:
      
 3  2 1 −1  1   1 
r0 = −5 − 1 4 2  1 = −12
7 −1 2 6 1 0
     
1
with kr0 k2 = (12 + (−12)2 + 02 ) 2 = 12.0416 and we can set d1 = 12.0416. We can
then set:

 
 1 
u1 = r0 = −12
0
 
 
0.0830
u1 {1, −12, −1}T  
q1 = = = −0.9965
||u1 ||2 (12 + (−12)2 + 02 ) 21
0.0000
 

and we can then compute:

  
2 1 −1  0.0830 
= qT1 Aq1 = 0.0830 −0.9965 0.0000  1 4

H1,1 2  −0.9965 = 3.8207
−1 2 6 0.0000
 

So then we have:
 
  3.8207 0 0
0.0830 0 0 0  0 0 0 
Q =  −0.9965 0 0 0  H =  
 0 0 0 
0.0000 0 0 0
0 0 0
We then proceed to compute the second orthonormal vector as:

u2 =Aq1 − H1,1 q1
    
2 1 −1  0.0830   0.0830 
= 1 4 2  −0.9965 − 2.4096 −0.9965
−1 2 6 0.0000 0.0000
   
 
 −1.1477 
= −0.0956
−2.0761
 
 
−0.4834 
u2 
q2 = = −0.0403
||u2 ||2 
−0.8745

86 CHAPTER 3. ITERATIVE METHODS

and we can then compute:

p
H2,1 = ||u2 ||2 = (−1.1477)2 + (−0.0956)2 + (−2.0761)2 = 2.3742
  
2 1 −1  −0.4834 
qT1 Aq2 =

H1,2 = 0.0830 −0.9965 0.0000  1 4 2  −0.0403 = 2.3742
−1 2 6 −0.8745
 
  
2 1 −1  −0.4834 
qT2 Aq2 =

H2,2 = −0.4834 −0.0403 −0.8745  1 4 2  −0.0403 = 4.3963
−1 2 6 −0.8745
 

So then we have:
 
  3.8207 2.3742 0
0.0830 −0.4834 0 0  2.3742 4.3963 0 
Q =  −0.9965 −0.0403 0 0  H =  
 0 0 0 
0.0000 −0.8745 0 0
0 0 0
We then proceed to compute the third orthonormal vector as:

u3 =Aq2 − H1,2 q1 − H2,2 q2

      
2 1 −1  −0.4834   0.0830   −0.4834 
= 1 4 2  −0.0403 − 2.3742 −0.9965 − 4.3963 −0.0403
−1 2 6 −0.8745 0.0000 −0.8745
     
 
 1.7955 
= 0.1496
−0.9995
 
 
0.8714 
u3 
q3 = = 0.0726
||u3 ||2 
−0.4851


and we can then compute:

p
H3,2 = ||u3 ||3 = 1.79552 + 0.14962 + (−0.9995)2 = 2.0603
  
2 1 −1  0.8714 
qT1 Aq3 =

H1,3 = 0.0830 −0.9965 0.0000  1 4 2  0.0726 = 0.0000
−1 2 6 −0.4851
 
  
2 1 −1  0.8714 
= qT2 Aq3 = −0.4834 −0.0403 −0.8745 

H2,3 1 4 2  0.0726 = 2.0603
−1 2 6 −0.4851
 
  
2 1 −1  0.8714 
H3,3 = qT3 Aq3 =

0.8714 0.0726 −0.4851  1 4 2  0.0726 = 3.7830
−1 2 6 −0.4851
 
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 87

So then we have:

 
 3.8207 2.3742 0.0000

0.0830 −0.4834 0.8714 0  2.3742 4.3963 2.0603 
Q = −0.9965 −0.0403
 0.0726 0  H =  
 0 2.0603 3.7830 
0.0000 −0.8745 −0.4851 0
0 0 0

Finally we can proceed to compute the fourth orthonormal vector as:

u4 =Aq3 − H1,3 q1 − H2,3 q2 − H3,3 q3

      
2 1 −1  0.8714   0.0830   −0.4834 
= 1 4 2  0.0726 − 0.0000 −0.9965 − 2.0603 −0.0403
−1 2 6 −0.4851 0.0000 −0.8745
     
 
 0.8714 
− 3.7830 0.0726
−0.4851
 
 
 −0.0888 
= 0.1554 × 10−14
0.0000
 
 
−0.4961 
u4 
q4 = = 0.8682
||u4 ||2 
0.0000


and we can then compute:

p
H4,3 = ||u4 ||2 = (−0.0888)2 + 0.15542 + 0.00002 × 10−14 = 0.0000

So then we have:

 
 3.8207 2.3742 0.0000

0.0830 −0.4834 0.8714 −0.4961  2.3742 4.3963 2.0603 
Q = −0.9965 −0.0403
 0.0726 0.8682  H =  
 0 2.0603 3.7830 
0.0000 −0.8745 −0.4851 0.0000
0 0 0.0000

At this point we have our upper Hessenberg matrix and in order to perform the
least squares minimization we need to ‘zero’ the entries below the main diagonal.
Starting with entry H2,1 , we compute:
88 CHAPTER 3. ITERATIVE METHODS

q √
a1 = (H1,1 )2 + (H2,1 )2 = 3.82072 + 2.37422 = 4.4983
H1,1 3.8207
c1 = = = 0.8494
a1 4.4983
H2,1 2.3742
s1 = 1 = = 0.5278
a 4.4983
And then multiply H and d by the first Givens rotation matrix to get:

H 1 = G1 H
  
0.8494 0.5278 0 0 3.8207 2.3742 0.0000
 −0.5278 0.8494 0 0   2.3742 4.3963 2.0603 
=  
 0 0 1 0  0 2.0603 3.7830 
0 0 0 1 0 0 0.0000
 
4.4983 4.3369 1.0874
 0 2.4810 1.7500 
= 
 0 2.0603 3.7830 
0 0 0.0000
and:

d1 = G1 d
  
0.8494 0.5278 0 0   12.0416 

 −0.5278 0.8494 0 0   0

= 
0 0 1 0  0 

0 0 0 1 0
 
 

 10.2277 

−6.3556
 
=

 0 

0
 

We then compute:

q √
a2 = 1 2
(H2,2 1 2
) + (H3,2 ) = 2.48102 + 2.06032 = 3.2249
1
H2,2 2.4810
c2 = = = 0.7693
a2 2.9050
1
2
H3,2 2.0603
s = 2 = = 0.6389
a 2.9050
And then multiply H 1 and d1 by the second Givens rotation matrix to get:
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 89

H 2 = G2 H
  
1 0 0 0 4.4983 4.3369 1.0874
 0 0.7693 0.6389 0  
  0 2.4810 1.7500 
= 0 −0.6389 0.7693 0  

0 2.0603 3.7830 
0 0 0 1 0 0 0.0000
 
4.4983 4.3369 1.0874
 0 3.2249 3.7631 
= 
 0 0 1.7923 
0 0 0

and:

d2 = G2 d1
  
1 0 0 0 
 10.2277 

 0 0.7693 0.6389 0   −6.3556 
= 0 −0.6389 0.7693 0


 0 

0 0 0 1 0
 
 

 10.2277 
−4.8894
 
=

 4.0604 
0
 

Now for this particular example, H4,3 is already zero and so we don’t need to perform
another givens rotation. At this point we can then perform the back substitution
on H 2 α = d as:

4.0604
α3 = = 2.2655
1.7923
−4.8894 − 3.7631 × 2.2655
α2 = =−4.1597
3.2250
10.2277 − 1.0875 × 2.2655 − 4.3370 × (−4.1597)
α1 = = 5.7365
4.4983

and update φk+1 = φk + Qα as:

90 CHAPTER 3. ITERATIVE METHODS

 
 2.2655 
φ11 =1.0000 + 0.0830 −0.4834 0.8714

−4.1597 = 5.4615
5.7365
 
 
 2.2655 
φ12 =1.0000 + −0.9965 −0.0403 0.0726

−4.1597 =−4.3846
5.7365
 
 
 2.2655 
φ13 =1.0000 + 0.0000 −0.8745 −0.4851

−4.1597 = 3.5385
5.7365
 

Which is the exact solution computed with all of the other methods studied. As with
the conjugate gradient method, for a 3 × 3 system, the solution will be computed
in 3 iterations. For larger systems however we will find that we can usually achieve
convergence in a number of iterations that is much smaller than the size of the
system. Another important point is that similar to the direct methods that we
studied in Chapter 2, the GMRES requires that we store Qn+1 and Hn . Now while we
could potentially store an upper Hessenberg matrix in a way such that we don’t need
to store the zeros in the lower triangular part of it (note that using the compressed
row storage format wouldn’t help because the matrix is more than half full, so the
addition of storing the row and column data would in fact require more storage)
Qn+1 will in fact be full and so there’s no opportunity for saving on storage there.
For very large systems then, the GMRES method can require a prohibitive amount
of storage. What is usually done to circumvent this problem is to have ‘restarts’;
that is we will pick a size n < M and apply the method as usual, then start the whole
thing again from scratch, the only difference being that we will have a better initial
guess for φ0 . In order to create a Matlab program to perform the GMRES algorithm
we will therefore include an iterative while loop which will apply the restarts. Since
we are always dealing with square matrices, we need to know M , but the variable N
is redundant, so let’s therefore change our interpretation of this variable to mean the
size of Qn+1 and Hn that we are prepared to store. Then, initially within our while
loop we will begin by assigning q1 and d1 and then perform the Arnoldi iteration
as:

while r_norm>tolerance && k<N_k

Q(:,1) = r/r_norm;
d(:) = 0.0;
d(1) = r_norm;
for n=1:N_col
u = A*Q(:,n);
for m = 1:n
H(m,n) = u‘*Q(:,m);
u = u - H(m,n)*Q(:,m);
end
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 91

u_norm = norm(u);
H(n+1,n) = u_norm;
Q(:,n+1) = u/u_norm;
end
...
end
At this point, the next step is to apply the Givens rotations to Hn and d in order to
bring the upper Hessenberg matrix into an upper triangular form, which is achieved
as:

while r_norm>tolerance && k<N_k

...
for n=1:N_col
a = sqrt(H(n,n)^2 + H(n+1,n)^2)
c = H(n, n)/a;
s = H(n+1,n)/a;
G = eye(N_col+1,N_col+1);
G(n:n+1,n:n+1)= [c, s; -s, c];
H = G*H;
d = G*d;
end
...
end
where it can be observed that we are creating the nth Givens rotation matrix by
creating initially assigning it to be the (N + 1) × (N + 1) identity matrix and then
adding in the c and s terms. It should also be noted that we are also overwriting
the data in H rather than storing these matrices at each stage. We can perform the
back substitution, update φ and decide whether or not we need to restart:

while r_norm>tolerance && k<N_k

...
alpha(N_col) = d(N_col)/H(N_col,N_col);
for n = N_col-1:-1:1
alpha(n) = (d(n) - H(n,n+1:N_col)*alpha(n+1:N_col))/H(n,n);
end
phi = phi + Q(:,1:N_col)*alpha(1:N_col);
r = b - A*phi;
r_norm = norm(r);
k = k + N_col;
end
The complete program is given in Example3_6.m. In order to create our C++
program we’re going to do things a little bit differently. Since we will be using
the C++ GMRES code later in the book for solving some much larger systems it
is worth designing the code in a slightly more efficient manner than the Matlab
implementation. First of all, we do not want to be storing Hn as a full matrix, since
almost half of the matrix is zeros. Now while using a sparse matrix won’t help us
save on storage, we can exploit the upper Hessenberg nature of the matrix to save on
92 CHAPTER 3. ITERATIVE METHODS

storage that way. In fact what we are going to do is store Hn as two 1D arrays; one
storing the upper triangular part and one storing the part below the main diagonal.
For the upper triangular part, imagine taking all of the nonzero entries of each row,
and ‘butting’ them up so that they form a single 1D array. In this case the array
will be of size N (N + 1)/2. The catch with this idea however is that if we’re storing
this part of Hn in a 1D array we won’t be able to access entries with the [m][n]
indexing anymore. In fact what we will do is ‘map’ from this indexing to a single
linear index, which will be achieved by:

m(2 ∗ N − m − 1)
index(m, n) = +n
2
remembering that this is for ‘zero-based’ indexing. For the part below the main
diagonal, our 1D array will simply be of size N and will require no special mapping
to access elements. This approach is a good example illustrating the separation of
the mathematical notion of a matrix, from the way it is stored in computer memory
(i.e. here we’re storing a matrix, just not with a 2D array). Our C++ program will
follow the same procedure as the Matlab implementation:
while(r_norm>tolerance && k<N_k)
{
for(int m=0; m<N_Row; m++)
{
Q[m][0] = r[m]/r_norm;
}
for(n=0; n<N_col+1; n++)
{
d[n] = 0.0;
}
d[0] = r_norm;
for(n=0; n<N_col; n++)
{
for(int m=0; m<N_row; m++)
{
AQn = 0.0;
for(int o=0; o<N_row; o++)
{
AQn += A[m][o]*Q[o][n];
}
u[m] = AQn;
}
for(int m=0; m<=n; m++)
{
uTQm = 0.0;
for(int o=0; o<M; o++)
{
uTQm += u[o]*Q[o][m];
}
index1 = (2*N_col-m-1)*(m)/2+n;
H_u[index1] = uTQm;
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 93

for(int o=0; o<N_row; o++)

{
u[o] -= H_u[index1]*Q[o][m];
}
}
u_norm = 0.0;
for(int m=0; m<N_row; m++)
{
u_norm += u[m]*u[m];
}
u_norm = sqrt(u_norm);
H_l[n] = u_norm;
for(int m=0; m<N_row; m++)
{
Q[m][n+1] = u[m]/u_norm;
}
}
...
}

Where the major difference is the additional for loops required for performing the
matrix-vector multiplication and vector inner products. As can be observed in this
code ‘snippet’ we are storing the upper triangular part of Hn in a 1D array H_u
and the part below the main diagonal in the 1D array H_l. At this point we have
the completely assembled Qn+1 and Hn and of course the next step is to apply the
sequence of Givens rotation matrices to Hn and d in order to bring the system into
upper triangular form.
The more observant readers may have noticed that when a matrix is multiplied
by a Givens rotation matrix, only two rows are modified by the matrix matrix
multiplication. This gives us another opportunity to save on storage and to reduce
the number of computations required to perform the matrix-matrix multiplication.
For each multiplication by a Givens rotation matrix, only rows n and n + 1 are
affected and we can compute the entries in the oth columns of these two rows as:

1
Hn,o = cHn,o + sHn+1,o
1
Hn+1,o =−sHn,o + cHn+1,o

So rather than performing the entire matrix-matrix multiplication as we did in the

Matlab implementation we will instead only perform the above computations to the
entries in rows n and n + 1. So in contrast to the Matlab implementation we will
need two nested for loops. Another important point is that since we are overwriting
the updated matrices Hn1 , Hn2 , etc, into the same memory locations, we will need
some temporary variables to store parts of the computation so that we don’t lose
any data before we’re done with it. The complete implementation of the application
of the Givens rotation matrices hence:
while(r_norm>tolerance && k<N_k)
94 CHAPTER 3. ITERATIVE METHODS

{
...
for(n=0; n<N_col; n++)
{
index1 = n*(2*N_col-n-1)/2+n;
a = sqrt(H_u[index1]*H_u[index1] + H_l[n]*H_l[n]);
c = H_u[index1]/a;
s = H_l[n] /a;
H_u[index1] = c*H_u[index1] + s*H_l[n];
H_l[n] = 0.0;
for(int o=n+1; o<N_col; o++)
{
index1 = n *(2*N_col-n-1)/2+o;
index2 = (n+1)*(2*N_col-n-2)/2+o;
tempH_u = c*H_u[index1] + s*H_u[index2];
H_u[index2] = -s*H_u[index1] + c*H_u[index2];
H_u[index1] = tempH_u;

}
tempd = c*d[n] + s*d[n+1];
d[n+1] = -s*d[n] + c*d[n+1];
d[n] = tempd;
}
...
}

At this point all that remains is to perform the back substitution for α and then
to update φk . This will be done in pretty much the same was as in the Matlab
implementation, the only significant difference being the indexing into the upper
triangular part of Hn and the extra work involved in performing the matrix-vector
multiplications:
while(r_norm>tolerance && k<N_k)
{
...
index1 = (N_col-1)*(2*N_col-N_col)/2+(N_col-1);
alpha[N_col-1] = d[N_col-1]/H_u[index1];
for(int m=N_col-2; m>=0; m--)
{
Halpha = 0.0;
for(int o=m+1; o<n; o++)
{
index1 = m*(2*N_col-m-1))/2+o;
Halpha += H_u[index1]*alpha[o];
}
index1 = m*(2*N_col-m-1)/2+m;
alpha[m] = (d[m] - Halpha) / H_u[index1];
}

for(int m=0; m<N_row; m++)

{
Qalpha = 0.0;
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 95

for(int o=0; o<n; o++)

{
Qalpha += Q[m][o]*alpha[o];
}
phi[m] += Qalpha;
}

r_norm = 0.0;
for(int m=0; m<N_row; m++)
{
Aphim = 0.0;
for(int o=0; o<N_row; o++)
{
Aphim += A[m][o]*phi[o];
}
r[m] = b[m] - Aphim;

r_norm += r[m]*r[m];
}
r_norm = sqrt(r_norm);
k += N_col;
}

The complete program is given in Example3_6.cpp.

One of the important advantages of GMRES is that it can be used to solve any
real square matrix and this is of practical importance in a number of applications
where the numerical methods result in asymmetric matrices that couldn’t be solved
by ‘say’ the conjugate gradient method (although they could be solved by one of
its variants). Another desirable feature of GMRES is that the residuals always
decrease at every iteration, as opposed to ‘say’ the steepest descent or conjugate
gradient methods where the residual norm may increase every now and then. As a
final note before continuing on, we can compute the solution to a linear system with
the GMRES method using the Matlab function gmres.
96 CHAPTER 3. ITERATIVE METHODS
Chapter 4

Nonlinear and Coupled Systems

Now that we’ve introduced a bunch of terminology relating to linear systems of

equations, examined a number of direct methods for solving them, then examined
a number of iterative methods (which are more often the methods that are used in
practice) we are going to end this part of the book by investigating how we extend
these ideas to solving nonlinear systems of equations as well as coupled systems.
As it turns out these are found all the time in practical ‘real world’ problems. As
always, before we get into studying some specific methods, we’ll take a moment to
introduce some relevant terminology. In Chapter 1 we presented the general form
of a linear system:

Aφ = b

If we to extend to the more general nonlinear case then we will write our system as:

f (φ) = 0 (4.1)

where f denotes a vector of functions of the dependent variables φ and 0 denotes

a zero vector (i.e. an N × 1 column vector containing all zeros). So obviously we
can see that a linear system f (φ) = Aφ − b = 0 is just a special case of this more
general form. Generally, the way that we go about solving a nonlinear system is to
linearize it and then iteratively solve a linear system. We will ‘get a feel’ for what
this means when we look at some specific examples shortly, but an important point
to bear in mind is that when we talk about iteratively solving a linear system, this
is not the same thing as an iterative method for solving a linear system. If we have a
nonlinear system then we may in fact have two nested iterative loops; the first where
we iteratively solve the linearized system, which we call the inner iteration and the
second loop where we repeat this process iteratively until the nonlinear system has
been solved, which we call the outer iteration.

97
98 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS

4.1 Newton’s Method

One of the ‘classic’ and perhaps the simplest methods we can use to solve a nonlinear
system of equations is Newton’s Method. The starting point for the derivation of
the method is to consider the Taylor series expansion for a function f

∆φ2 d2 f

∆φ df
f (φ + ∆φ) = f (φ) + + + O(∆φ3 )
1! dφ φ 2! dφ2 φ
If we extend this to a vector of functions:

∆φ ∆φ2
f (φ + ∆φ) = f (φ) + J+ H + O(∆φ3 )
1! 2!
where:
∂f1 ∂f1 ∂f1
···
 
∂φ1 ∂φ2 ∂φM
∂f2 ∂f2 ∂f2

∂φ1 ∂φ2
··· ∂φM

J =
 
.. .. .. .. 
 . . . . 
∂fM ∂fM ∂fM
∂φ1 ∂φ2
··· ∂φM

is known as the Jacobian matrix and:

 ∂ 2 f1 ∂ 2 f1 ∂ 2 f1 
∂φ21 ∂φ22
··· ∂φ2M
∂ 2 f2 ∂ 2 f2 ∂ 2 f2
···
 
∂φ21 ∂φ22 ∂φ2M
 
H= .. .. .. ..

. . . .
 
 
∂ 2 fM ∂ 2 fM ∂ 2 fM
∂φ21 ∂φ22
··· ∂φ2M

is known as the Hessian matrix. If we ‘ignore’ the second order terms and higher
we get the approximate relation:

f (φ + ∆φ) ≈ f (φ) + J∆φ

We then use this as the basis for an iteration:

f (φk + ∆φ) = f (φk ) + J∆φ

So ∆φ is a column vector which we can think of as the increment that we add to our
current guess. Now, given the form given in Equation 4.1 we set f (φk + ∆φ) = 0.
In this case we can solve for ∆φ as:

∆φ = −J −1 f
So to update our solution, we are effectively solving the linear system:

J∆φ = −f
4.1. NEWTON’S METHOD 99

and then we compute the update:

φk+1 = φk + ∆φ
When we were dealing with a linear system, we defined the residual as rk =
b − Aφk . When we’re dealing with a nonlinear system then the residual vector will
in fact be the vector of functions evaluated at the current guess for φ:

rk = f (φk )
And since we need to compute this in order to perform the outer iterations, this
makes computing the residual even easier than in the case of a linear system.

Example 4.1:

In this example we will develop a Matlab program to solve the example nonlinear
system:

φ1 + φ2 − φ1 φ2 + 2 0
f (φ) = =
φ1 e−φ2 − 1 0

Clearly this system is nonlinear due to the φ1 φ2 and the φ1 e−φ2 terms and so we
can’t write it in the form Aφ = b. Before we start writing any code however, let’s
work through and perform a few iterations of Newton’s method by hand. To begin
the algorithm, let’s provide the initial guess of of φ0 = {0, 0}T . We will also assume
that we are using the infinity norm as our measure of convergence, and therefore,
initially we have:

0 0 0+0−0×0+2 2
r = f (φ ) = =
0 × e−0 − 1 −1
with kr0 k∞ = 2. We will also define the Jacobian as:
" #
∂f1 ∂f1
∂φ1 ∂φ2 1 − φ2 1 − φ1
J = ∂f2 ∂f2 =
∂φ1 ∂φ2
e−φ2 −φ1 e−φ2

and therefore, initially we have:

1−0 1−0 1 1
J= =
e−0 −0 × e−0 1 0
We can then solve for ∆φ as:
100 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS

∆φ = −J −1 f
−1
1−0 1−0 0+0−0×0+2
=−
e0 −0e0 0 × e0 − 1
−1
1 1 2
=−
1 0 −1

1
=
−3

and update φ1 as:

φ11 =φ01 + ∆φ1 =0+1 = 1

1 0
φ2 =φ2 + ∆φ2 =0+(−3)=−3

We can now reevaluate the f (and the residual) as:

1 1 1 + (−3) − 1 × (−3) + 2 3.0000
r = f (φ ) = =
1 × e−(−3) − 1 19.0855
with kr1 k∞ = 19.0855. We then reevaluate the Jacobian and solve for ∆φ as:

∆φ = −J −1 f
−1
1 − (−3) 1−1 3.0000
=−
e−(−3) −1 × e−(−3) 19.0855
−1
4.0000 0 3.0000
=−
20.0855 −20.0855 19.0855

−0.7500
=
0.2002

and update φ2 as:

φ21 =φ11 + ∆φ1 = 1+(−0.7500)= 0.2500

φ22 =φ12 + ∆φ2 =−3+0.2002 =−2.7998

We again reevaluate the f (and the residual) as:

4.1. NEWTON’S METHOD 101

2 2 0.2500 + (−2.7998) − 0.2500 × (−2.7998) + 2 0.1502
r = f (φ ) = =
0.2500 × e−(−2.7998) − 1 3.1103

with kr2 k∞ = 3.1103. We again reevaluate the Jacobian and solve for ∆φ as:

∆φ = −J −1 f
−1
1 − (−2.7998) 1 − 0.2500 0.1502
=−
e−(−2.7998) −1 × e−(−2.7998) 3.1103
−1
3.7998 0.7500 0.1502
=−
16.4411 −4.1103 3.1103

−0.1055
=
0.3345

and update φ3 as:

φ31 =φ21 + ∆φ1 = 0.2500+(−0.1055)= 0.1445

φ32 =φ22 + ∆φ2 =−2.7998+0.3345 =−2.4653

We can now reevaluate the f (and the residual) as:

4 4 0.1445 + (−2.4653) − 0.1445 × (−2.4653) + 2 0.0353
r = f (φ ) = =
0.1445 × e−(−2.4653) − 1 0.6997

with kr4 k∞ = 0.6997. Repeating this procedure for 7 more iterations we finally
converge on the solution of φ = {0.0978, −2.3251}T . In order to create a Matlab
program to perform Newton’s method we will first note that since the purpose of
the example is to see how we solve a nonlinear system, we won’t pay any attention
to the solution of the linearized system and will therefore simply use the backslash
operator. Of course in principle almost any of the methods presented in this part of
the book could be used in its place. As such, the entire program will take the form:
phi = zeros(2,1);
f = [phi(1)+phi(2)-phi(1)*phi(2)+2; phi(1)*exp(-phi(2))-1];
k = 0;
r = f;
r_norm = max(abs(r));
tolerance = 1e-8;
N_k = 1e3;

while r_norm>tolerance && k<N_k

102 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS

J = [1-phi(2), 1-phi(1); exp(-phi(2)), -phi(1)*exp(-phi(2))];

Delta_phi = -J\f;
phi = phi + Delta_phi;
f = [phi(1) + phi(2) - phi(1)*phi(2)+2; phi(1)*exp(-phi(2))-1];
r = f;
r_norm = max(abs(r));
k = k+1;
end

An observation that can be made is that the above program is a little ‘verbose’
in that we have the column vectors f and r, both of which store the same thing.
Furthermore, we could just update as phi=phi-J\f. Since the point of this example
however was to demonstrate the method, and also, memory isn’t a concern for such
a small system, we don’t really care. The important point however is that you can
‘spot’ some of these inefficiencies so that when writing your own programs you can
make them more concise and efficient.
The complete program is given in Example4_1.m.

As a final note before continuing on, we can compute the solution to a nonlinear
system using the Matlab function fsolve. In such a case we present the specific
system to it as a function handle, which is how we avoid having the nonlinear system
‘hard coded’ into the program as was done in Example 4.1.

4.2 Broyden’s Method

One of the apparent features of Newton’s method was that we had to compute the
Jacobian at every iteration. Sometimes, this can be a bit of a pain. As we saw with
the Newton’s method in Example 4.1 we had to obtain the form of the Jacobian
by hand and while this might not be a problem for small, simple systems, it could
be quite difficult for larger ones. Furthermore, for large systems, evaluating the
Jacobian directly could be quite expensive. Broyden’s method falls under a family
of methods classed as Quasi-Newton Methods that are characterized by storing an
approximation of the Jacobian matrix.
If we were to think about Newton’s method applied to a single equation with
one unknown, then our update formula would be:

f (φk )
φk+1 = φk −
f 0 (φk )
where f 0 (φk ) is equivalent to J and hence f 0 (φk )−1 is equivalent to J −1 in this case.
The basic idea behind Broyden’s method is that we provide an initial estimate of the
Jacobian at the start of the algorithm (along with our initial guess for φ) and then
4.2. BROYDEN’S METHOD 103

perform a ‘rank-one’ update at the other iterations. In comparison to the Newton

update above, we could provide an update using the Secant method as:

φk − φk−1
φk+1 = φk − f (φk )
f (φk ) − f (φk−1 )

where here it can be observed that we’ve essentially replaced the approximation for
the Jacobian by a ‘difference expression’:

f (φk ) − f (φk−1 )
f 0 (φk ) =
φk − φk−1
Broyden’s method is essentially a generalization of the Secant method in multiple
dimensions, where we replace the first derivative by a ‘difference expression’, which
can be written as:

J k (φk − φk−1 ) ≈ (f (φk ) − f (φk−1 ))

However this system is under-determined in more than one dimension. Broyden’s
method suggests using the current estimate of the Jacobian and improving upon it
by taking the solution to the secant equation that is a ‘minimal modification’ to J k
(minimal in the sense of minimizing the Frobenius norm):

∆f k − J k−1 ∆φ k T
J k = J k−1 + (φ )
k∆φk k2
and then proceeds similar to Newton’s Method:

∆φk = (J k )−1 f (φk )

and then:

φk+1 = φk − ∆φk
Now, another thing we could do with the method is rather than store an approx-
imation to the Jacobian, store an approximation to the inverse of the Jacobian and
perform our updates on that instead. This would mean that we wouldn’t have to
solve a linear system at each outer iteration. In order to achieve this we can make
use of the Sherman-Morrison formula [49]:

A−1 uvT A−1

(A + uvT )−1 = A−1 − (4.2)
1 + vT A−1 u
So here A is some generic invertible matrix and u, and v are generic column vectors.
The application of this formula lies in the fact that when updating the approximation
to the Jacobian with Broyden’s method, our update involves adding a matrix which
results from a vector-vector multiplication, following which we have to invert this
104 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS

new matrix. Now if we store an approximation for the inverse of the Jacobian, then
we can update using the Sherman-Morrison formula as:

∆φk − (J k−1 )−1 ∆f (φk )

(J k )−1 = (J k−1 )−1 + ((∆φk )T (J k−1 )−1 ) (4.3)
(∆φk )T (J k−1 )−1 ∆f (φk )

Example 4.2:
In this example we will develop a Matlab program to solve the example nonlinear
system:

φ1 + φ2 − φ1 φ2 + 2 0
f (φ) = =
φ1 e−φ2 − 1 0
Before we start writing any code however, let’s work through and perform a few
iterations of Broydens’s method by hand. In this example we are going to store
an approximation to the inverse of the Jacobian. To begin the algorithm we will
provide the initial guess of of φ0 = {1, −1}T . Now, normally our initial guesses
have been quite simple, whereas here we are making use of the knowledge of the
exact solution from Example 4.1 to provide an initial guess slightly closer to the
exact solution. The reason for this is that since we will be ‘guessing’ the Jacobian,
there’s a good chance that our method won’t converge if both the initial guess for
the solution and the initial guess for the inverse of the Jacobian are both ‘way off’.
So this approach will allow us to use quite a simple approximation for (J 0 )−1 (which
is the one of the intended learning outcomes for this example), while still having an
algorithm which converges. We will also assume that we are using the infinity norm
as our measure of convergence, and therefore, initially we have:

0 0 1 + (−1) − 1 × (−1) + 2 3.0000
r = f (φ ) = =
1 × e−(−1) − 1 1.7183
with kr0 k∞ = 1.7183. We will also define our approximation for the inverse of the
Jacobian as:

−1 1 0
J =
0 1
We can then compute ∆φ as:

∆φ = −J −1 f

1 0 3.0000
=−
0 1 1.7183

−3.0000
=
−1.7183
4.2. BROYDEN’S METHOD 105
p
with k∆φk2 = (−3.000)2 + (−1.7183)2 = 3.4572 and update φ1 as:

φ11 =φ01 + ∆φ1 = 1+(−3.0000)=−2.0000

φ12 =φ02 + ∆φ2 =−1+(−1.7183)=−2.7183

We can now reevaluate the f (and the residual) as:

1 1 (−2.0000) + (−2.1783) − (−2.0000) × (−2.7183) + 2 −8.1548
r = f (φ ) = =
(−2.0000) × e−(−2.7381) − 1 −31.3085

with kr1 k∞ = 31.3085. We then compute:

1 0 −8.1548 3.0000 −11.1548
∆f = f − f = − =
−31.3085 1.7183 −33.0268
and update our approximation for the Jacobian as:

1 −1 0 −1 ∆φ − (J 0 )−1 ∆f T 0 −1

(J ) = (J ) + ∆φ (J )
∆φT (J 0 )−1 ∆f

1 0
=
0 1

−3.0000 1 0 −11.1548
−
−1.7183 −33.0268

0 1 1 0
+ −3.0000 −1.7183
1 0 −11.1548 0 1
−3.0000 −1.7183
0 1 −33.0268

0.7288 −0.1553
=
−1.0411 0.4037

and then compute ∆φ as:

∆φ = −J −1 f

0.7288 −0.1553 −8.1548
=−
−1.0411 0.4037 −31.3085

1.0804
=
4.1481
p
with k∆φk2 = (1.0804)2 + (4.1481)2 = 4.2865 and update φ2 as:
106 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS

φ21 =φ11 + ∆φ1 =−2.0000+1.0804=−0.9196

φ22 =φ12 + ∆φ2 =−2.7183+4.1481= 1.4298

We can now reevaluate the f (and the residual) as:

2 2 (−0.9196) + 1.4298 − (−0.9196) × −1.4298 + 2 3.8250
r = f (φ ) = =
(−0.9196) × e−1.4298 − 1 −1.2201

with kr2 k∞ = 3.8250. We then compute:

2 1 3.8250 −8.1548 11.9799
∆f = f − f = − =
−1.2201 −31.3085 30.0884
and update our approximation for the Jacobian as:

∆φ − (J 1 )−1 ∆f
(J 2 )−1 = (J 1 )−1 + ∆φT (J 1 )−1

T 1 −1
∆φ (J ) ∆f

0.7288 −0.1553
=
−1.0411 0.4037

1.0804 0.7288 −0.1553 11.9799
−
4.1481 −1.0411 0.4037 30.0884
+
0.7288 −0.1553 11.9799
1.0804 4.1481
−1.0411 0.4037 30.0884

0.7288 −0.1553
× 1.0804 4.1481
−1.0411 0.4037

4.2006 −1.6366
=
−6.2593 2.6301

and then compute ∆φ as:

∆φ = −J −1 f

4.2006 −1.6366 3.8250
=−
−6.2593 2.6301 −1.2201

−18.0642
=
27.1511
p
with k∆φk2 = (1.0804)2 + (4.1481)2 = 32.6613 and update φ2 as:
4.2. BROYDEN’S METHOD 107

φ21 =φ11 + ∆φ1 =−0.9196+(−18.0642)=−18.9837

φ22 =φ12 + ∆φ2 = 1.4298+(27.1511) = 28.5809

Repeating this procedure for a total of 23 iterations we finally converge on the

solution of φ = {0.0978, −2.3251}T . In order to create a Matlab program to perform
Broydens’s method the can use the same basic structure as the Newton’s method
example and as such, the entire program will take the form:
while r_norm>tolerance && k<N_k
Delta_phi = -Jinv*f;
Delta_phi_norm = sqrt(sum(Delta_phi.^2));
phi = phi + Delta_phi;
f_old = f;
f = [phi(1) + phi(2) - phi(1)*phi(2)+2; phi(1)*exp(-phi(2))-1];
Delta_f = f - f_old;
Jinv = Jinv+(Delta_phi-Jinv*Delta_f)/(Delta_phi’*Jinv*Delta_f)*(Delta_phi’*Jinv);
r = f;
r_norm = max(abs(r));
k = k+1;
end

An observation that an be made with this method is that the residuals tend to
‘jump around’ initially before finally converging rapidly on the solution. Of course
the better our initial guesses for the solution and the inverse of the Jacobian, the
fewer iterations are likely required to converge on a solution. Furthermore, if these
guesses are too poor we may not converge at all.
The complete program is given in Example4_2.m.

Having now looked at a couple of methods for solving nonlinear systems we will
finish up this part of the book by introducing a number of concepts that will become
important later on.
Often, we will be able to write our resulting nonlinear system of equations in the
form:

A(φ)φ = b
where we are emphasizing that the matrix A contains entries that involve the de-
pendent variable φ in some way. So obviously this is not a linear system; but one
approach to solving a system of this form is known as Fixed point iteration. The
idea here is that we evaluate the terms in A using the previous estimate of φ, so
that we have:
108 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS

A(φk )φk+1 = b (4.4)

Of course, b may also contain terms which involve φ and so we do exactly the same
thing and use previous estimate to evaluate it. A good practical example of this is
having the term ‘say’ α(φ1 )2 occurring in one of the equations. Here with the Picard
iteration approach the coefficient of φk+1
1 in A would be αφk1 . One point worthy of
mention here is that there would usually be no point in solving the linear system
given in Equation 4.4 exactly. To elaborate on this point, with both Newton’s
method and Broyden’s method (assuming that we are storing an approximation
to J and not J −1 ) then at each outer iteration we were solving the linear system
J∆φ = f exactly and at each outer iteration we reevaluate J and f . If we were
using an iterative method to solve the linear system given in Equation 4.4 however,
we may wish to reevaluate A at each inner iteration, not at each outer iteration.
As an example, we could imagine using the Gauss-Seidel method and performing
one sweep to update φk+1 , but then rather than immediately performing the next
sweep, we first reevaluate A based on the newest information. We will see later on
that this approach simple and involves minimal storage, because we can overwrite
the entries in A at each iteration.
Another important concept that we will encounter in Part V is the solution of a
coupled systems of equations. In this case we often have two (or more) systems of
the form:

A1 (φ, ψ)φ = b1
A2 (φ, ψ)ψ = b2

where φ and ψ are both column vectors containing the solution to their respective
system of equations, but the important point is that the coefficient matrices may
depend upon both vectors. In fact the systems could perhaps be more generally
written in the form f (φ, ψ) = 0, but if instead the systems happened to be linear,
then we could trivially combine them into one large system:

φ
A =b
ψ
which we would called a coupled (or simultaneous) solution. Another approach we
might take however, is to hold one of the vectors as constant and solve for the other,
then switch. This would be known as a segregated solution. This also ties in quite
nicely with the Picard iteration approach and so our segregated solution in that we
can ‘assemble’ the matrices using data from a previous iteration:

A1 (φk , ψ k )φk+1 = b1
A2 (φk+1 ,ψ k )ψ k+1 = b2
4.2. BROYDEN’S METHOD 109

where is can be observed that we solve the first system using previous estimates
φk and ψ k , but as soon as we have solved for φk+1 we use that information to
assemble A2 before solving for ψ k+1 . One important difference between the coupled
and segregated solvers is that if we have one ‘giant’ matrix A, then this may require
significantly more memory to store it, whereas with a segregated solver, we may be
able to memory used to store A1 to also store A2 since we only need one at a time
and we will be reassembling them anyway. Furthermore, sometimes A1 = A2 and
so using a segregated solver definitely saves on memory. We would generally expect
faster convergence with a coupled solution, but also more computational expense.
Sometimes if the equations are particularly ‘nasty’ and nonlinear and tightly coupled,
then a segregated solution may be preferable as well.
These concepts will become clearer as we present the different governing differ-
ential equations in Part V and explain how they are converted into multiple systems
of equations. The important point to bear in mind at this stage is simply to note
that often we will be solving coupled systems of equations and occasionally they
will be nonlinear. It would then make sense to revisit this Chapter to revise these
concepts and ‘crystalize’ them in your mind.
To briefly summarize and recap on this part of the book, we have introduced some
relevant terminology and made some definitions relating to systems of equations that
we will encounter later on. We then investigated a number of direct methods, where
the number of operations required to solve a system is fixed, depending only upon the
size of the system, but we generally have to store multiple matrices to do so and as
a result we tend not to use these methods so much in practice. We then investigated
a number of iterative methods, where the number of operations required to solve a
system is unknown, but we generally don’t have to store as much information and
furthermore, the methods can be integrated with the sparse matrices that arise when
solving differential equations and as a result these methods are more commonly used
in practice. Finally we investigated some methods for extending these techniques
for solving nonlinear equations and how we deal with multiple systems together. In
addition to this we have gained some experience with both Matlab and C++ and seen
the basic structure of both types of program, which will be continued throughout
this book. We have seen that the programs involve lot’s of for and while loops
and generally we can make our Matlab programs a bit more concise because of the
matrix and vector operations that are available to us. Furthermore, a number of
the methods are directly available within Matlab. With C++ on the other hand
linear solvers aren’t a part of the language itself, but there are a number of libraries
available which can be integrated into one’s program such as Blas [3] and Lapack
[26] that can perform this functionality.
As a final point, it is worth mentioning that we have only briefly touched on the
details of all of these methods and so for the interested reader an excellent reference
for more detailed aspects of conjugate gradient methods can be found in the books
by Lindfield [65], Press [73], and Barret [63].
110 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS
Part II

Ordinary Differential Equations

111
Chapter 5

Introduction

φmax

φmin φl φl+1

Δt
}

t
tmin tl tl+1 tmax

Figure 5.1: A schematic illustrating the idea behind a numerical solution to an ordi-
nary differential equation (ODE). The blue line illustrates the continuous analytical
solution to the ODE in the domain of interest t ∈ [tmin , tmax ] and the blue dots
illustrate the numerical solution at two time steps (denoted by the superscript l)
within this domain.

In this part of the book we are going to investigate a number of different methods
for solving solving Ordinary Differential Equations (ODEs). Before we begin to
study these numerical methods however we need to outline some terminology and
make some definitions. The most general form of an ODE can be given as:

113
114 CHAPTER 5. INTRODUCTION

dN φ d3 φ d2 φ dφ

f , · · · , 3 , 2 , , φ, t = 0
dtN dt dt dt
Here we are using the variable φ to denote our dependent variable, t to denote our
independent variable, and f represents some arbitrary function of the dependent
variable and its derivatives, the independent variable, or both. We will generally
think of t as representing the ordinate of time, but it should be noted that for
this general problem t could represent any independent variable that we can take
derivatives with respect to. For example, ODEs involving derivatives with respect
to an independent variable representing a spatial ordinate x are also commonplace
in many engineering problems. In terms of notation, it is also common to express
derivatives with respect to time as φ̇, φ̈, etc, and with respect to a spatial ordinate
as φ0 , φ00 , etc. We will also make use of this notation from time to time throughout
the book (mainly just to fit a derivative expression in the body of a paragraph,
which would stretch the line spacing if using the fractional notation). We will
always assume the existence and uniqueness of the solution and also that f (φ, t)
has continuous partial derivatives with respect to φ and t of as high an order as
necessary. Now, the idea behind solving an ODE is to find φ(t), but because we are
restricting ourselves to numerical solutions we will only ever obtain a solution that
is a collection of φ at discrete points in time, denoted φ(tl ) = φl (Figure 5.1). To
obtain the numerical solution to our ODE we will always be obtaining it within a
domain which we can denote as tmin ≤ t ≤ tmax , or as t ∈ [tmin , tmax ]. So this is just
the range of the independent variable which we are interested in the solution of our
ODE. Often tmin will be zero, but it doesn’t have to be.
One of the important aspects we must consider is the order of the ODE, which
we can deduce from the highest derivative present in the equation. The order will
have important implications in terms of how much information we have to specify
to obtain a unique solution. A generic first order ODE takes the form:
dφ
= f (φ, t) (5.1)
dt
Another important aspect we must consider is whether or not we are solving a
single ODE or a system of ODEs. Throughout our study we will actually most often
be interested in studying systems of ODEs, in which case φ will be thought of as
representing a column vector of dependent variables φ = {φ1 , φ2 , ...φN }T and f will
hence represent a column vector of functions. If f happens to not explicitly involve
the independent variable t and is only a function of φ then this is known as an
autonomous system. As we will see, the numerical methods that we will investigate
are only designed to solve first order ODEs. The solution of a higher order ODE
however can be achieved if we ‘break it up’ into a system of first order ODEs. For
instance, to solve the third order ODE:

d3 φ
= f (φ, t)
dt3
115

the approach that one takes is to create a system of three first order ODEs by letting
φ1 = φ, φ2 = φ̇, and φ3 = φ̈ such that:

dφ1
= φ2
dt
dφ2
= φ3
dt
dφ3
= f (φ1 , t)
dt
which of course extends to higher order ODEs. So a single equation of order N is
equivalent to a system of N first order equations as long as the highest derivative
can be isolated. Another important aspect that we must consider is whether or not
we are solving a linear or nonlinear ODE (or system of ODEs). If our problem is
linear, then it can be represented in the form:

dN φ d2 φ dφ
a(t) + . . . + b(t) + c(t) + d(t)φ + e(t) = 0
dtN dt2 dt
and hence:

φ̇ = Kφ + s (5.2)
where K is an M × N matrix and s an N × 1 column vector, the entries of which
may be functions of t, but not of φ. This form leads to another important aspect
that we must consider which is whether or not the ODE is homogeneous or inho-
mogeneous. If s is zero then our ODE is homogeneous and if it is nonzero then it
is inhomogeneous. It should be noted that while we introduced homogeneity via a
linear ODE, a nonlinear ODE can be either homogeneous or inhomogeneous as well.
Finally, and perhaps most importantly, one of the aspects of our problem we
must consider is whether we are solving an initial value problem or a boundary value
problem. The amount of information that we will need to specify to obtain a solution
will depend upon the order of the ODE. For a first order ODE we must specify the
solution at a moment in time, most commonly at tmin as φ(tmin ) = φmin , which is
our initial condition. If we have say a second or third order ODE then we would
need so specify initial conditions for φ̇, and φ̈ respectively. Turning to the solution of
a system of ODEs we find that we need to specify an initial condition for each entry
in the vector φ, such as φmin = {φ1,min , φ2,min , ...φN,min }T . If all of these pieces of
information are specified at tmin then we are solving an initial value problem (IVP).
It is also possible that we may not always have the relevant information at tmin but
may have some of the information at tmax . In this case we are solving a boundary
value problem (BVP) and we would call the value φ(tmax ) = φmax our boundary
condition. We will see how we solve these problems in a later chapter, but for now
the point to take away from this discussion is that we need to specify a condition
116 CHAPTER 5. INTRODUCTION

(somewhere in the domain) for each derivative in our ODE, or each element in the
column vector of ODEs, however you find it easiest to visualize.
Having now introduced a number of concepts relating the the classification of
ODEs, let’s take a moment to look at some common example ODEs and classify
them. Beginning with the Bernoulli differential equation:
dφ
+ bφ = cφn
dt
We can see that this is a first order, nonlinear, homogeneous ODE. Another well
known ODE is the Euler differential equation:

d2 φ dφ
t2 2
+ at + bφ = c
dt dt
which we can see is a second order, linear, inhomogeneous ODE. Another well known
ODE is the van der Pol equation:

d2 φ 2 dφ

− a 1 − φ +φ=0
dt2 dt
which is a second order, nonlinear, homogeneous, ODE.
Throughout the next two chapters we will for the most part be applying our
numerical methods to solve two example systems, the first of which is:

d2 φ
= −4φ
dt2
which we can see is a second order, linear, homogeneous ODE. The second example
system is:

dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2
dt
which we can see is a system of 3 first order, nonlinear, homogeneous ODEs. By
testing our numerical methods on the same problem we should hopefully highlight
the differences between the different numerical methods and also know what solution
to expect.
We are now almost in a position to begin the studying the families of numerical
methods. Bearing in mind that our numerical solutions will only ever be approxima-
tions to the true solution however, we need to first define some numerical concepts.
We will simply state them here and then explain them at a later stage. For each
numerical method we will investigate the accuracy of the method and the stability
of the method.
Chapter 6

Euler Methods

Perhaps the simplest methods that we can use to solve an initial value problem are
Euler methods, so they present an excellent starting point. The basic idea behind
the method is that we first consider the Taylor Series Expansion about φ(tl ):

∆t2 d2 φ ∆t3 d3 φ

l+1 l ∆t dφ
φ(t ) = φ(t ) + + + + ... (6.1)
1! dt tl 2! dt2 tl 3! dt3 tl
Which we could also write in a more compact form as:
∞
(tl+1 − tl )n dn φ

X
l+1
φ(t )=
n=0
n! dtn lt

Letting tl+1 − tl = ∆t, which we will call the time step size, and substituting our
equation for a generic first order ODE given in Equation 5.1 into Equation 6.1 we
get:

∆t2 d2 φ ∆t3 d3 φ

l+1 l
φ(t ) = φ(t ) + ∆tf (φ, t) + + + ... (6.2)
2! dt2 tl 3! dt3 tl
With Euler methods we neglect the second order terms and higher such that we
have the approximate expression:

φ(tl+1 ) ≈ φ(tl ) + ∆tf (φ, t)

6.1 Explicit Euler Method

If we choose to evaluate f (φ, t) at the time step l then we get:

φl+1 = φl + ∆tf (φl , tl ) (6.3)

where φl is the numerical approximation of the exact solution φ(tl ). Equation 6.3
is called the explicit Euler method (or the forward Euler method) because we are

117
118 CHAPTER 6. EULER METHODS

computing the values of φ at the new time steps based solely on the values from the
previous time steps (Figure 6.1). Since we will always know the values of φ from
the previous time steps we have an explicit expression which we can evaluate to
compute the new values.

φl+1

}
φ

el+1local

φ(tl+1)

φl

t
tl tl+1
}
Δt

Figure 6.1: A schematic illustrating one step in the explicit Euler method. The green
arrow illustrates the gradient f (φl , tl ) which is used to step the solution forward. The
pink line illustrates computed step to φl+1 . The blue line illustrates the analytical
solution. Also illustrated is the local truncation error, which is the difference between
the analytical and numerical solutions at time step l + 1.

In order to consider the accuracy of the method, need to consider two types of
error; namely truncation error and round-off error. As mentioned in Part I, round-
off error arises due to the fact that computers can only represent numbers to a finite
precision. Truncation error on the other hand arises due to the terms in Equation
6.2 that were ignored. Assuming that we knew the exact solution of our ODE at
a given time step tl+1 (and ignoring round-off error for the moment) we define the
local truncation error as:

l+1

elocal = φ(tl+1 ) − φl+1

= φ(tl+1 ) − φl − ∆tf (φl , tl )

which is defining the incremental error introduced into the solution as we take a
step from φl to φl+1 . Going back to the Taylor series expansion in Equation 6.1, we
can see that the the error introduced into the solution is of the order O(∆t2 ), thus
6.1. EXPLICIT EULER METHOD 119

the explicit Euler method has a local truncation error of O(∆t2 ). The idea here is
that if we reduce ∆t by a factor of two, then the error should decrease by a factor of
four. The global truncation error is the accumulation of the local truncation error
after a number of time steps. Assuming perfect knowledge of the exact solution at
the initial time step we can define the global truncation error after Nt time steps as:

eN
N
t
= φ(t t ) − φNt
global
= φ(tNt ) − φ0 + ∆tf (φ0 , t0 ) + ∆tf (φ1 , t1 ) + . . . + ∆tf (φNt −1 , tNt −1 )

Now, if each step incurs an error of O(∆t2 ) and the errors are simply cumulative
(a fairly conservative assumption) and we have to take O(∆t−1 ) steps to cover the
domain, then the net truncation error is O(∆t). In other words, the error associated
with integrating an ODE using Euler methods is directly proportional to the time
step size. So Euler methods are termed a first-order accurate method because the
global truncation error associated with integrating over a finite domain scales like
O(∆t). More generally, a numerical method is conventionally called an N th order
method if its local truncation error per step is O(∆tN +1 ).
Note that truncation error would be incurred even if computers performed floating-
point arithmetic operations to infinite accuracy. Unfortunately, computers do not
perform such operations to infinite accuracy. In fact, a computer is only capable of
storing a floating-point number to a fixed number of decimal places. At large time
step sizes the error is dominated by truncation error, whereas round-off error dom-
inates at small time step sizes. So we in fact reach a point where further reducing
the time step size will actually increase the error in the solution.

Example 6.1:

In this example we will develop both a Matlab and a C++ program to solve the
example system:
dφ
=1−φ
dt
in the domain t ∈ [0, 10], with initial condition φ(0) = 0. We will experiment with
the time step sizes ∆t = 0.1, ∆t = 0.5, ∆t = 1.0, and ∆t = 2.0 and compare
the numerical solution with the exact (or analytical) solution φ(t) = 1 − e−t . The
intended learning outcomes for this example will be to observe the application of
the explicit Euler method to solve an ODE and how we write data to an output file
in C++.

In order to begin, we apply the explicit Euler method to the ODE by substituting
in for f (φl , tl ) into Equation 6.3 as:
120 CHAPTER 6. EULER METHODS

φl+1 = φl + ∆t 1 − φl

Then in order to actually compute the solution at each time step we place this equa-
tion inside a time marching loop where we will compute the solution at successive
time steps as:
for l=1:N_t-1
phi(l+1) = phi(l) + Delta_t *(1 - phi(l));
end

in Matlab, and:
for(int l=0; l<N_t-1; l++)
{
phi[l+1] = phi[l] + Delta_t *(1.0 - phi[l]);
}

in C++. These two code snippets are essentially all that is required to implement an
explicit Euler method for solving this simple ODE. In order to output the solution
data to a file we will assume that we have completed the simulation and all of the
data has been computed. In this case we will first declare an instance of an fstream
class:
fstream file;

Then we will open the file, loop through the array and write out the data, then close
the file:
file.open("Example6_1.data", ios::out);
for(int l=0; l<N_t; l++)
{
file << phi[l] << "\t" << 1.0-exp(-t[l]) << endl;
}
file.close();

where the first argument in the open member function is our desired name for the
file and the second is a flag indicating in this case that we wish to write out the data.
For this particular example our output file will be a readable text file containing
two columns; the first containing the numerical solution and the second column
(separated by a tab "\t") containing the exact solution. So each pass through the
for loop writes out one row of data.
The complete programs are given in Example6_1.m and Example6_1.cpp. The
solution is shown in Figures 6.2(a) - 6.2(d). Note that the numerical solution gets
closer to the analytical solution for smaller values of ∆t. This is what we would
expect since we know that the error in the solution is proportional to the time step
size.
6.1. EXPLICIT EULER METHOD 121

2.0 2.0

1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2
φ

φ
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10
t t

(a) (b)

2.0 2.0

1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2
φ

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10
t t

Figure 6.2: Solution to the ODE in Example 6.1 with (a) ∆t = 0.1 (b) ∆t = 0.5 (c)
∆t = 1.0 (d) ∆t = 2.0. The green lines show the analytical solution and the blue
dotted lines show the numerical solution.
122 CHAPTER 6. EULER METHODS

Having now devoted some attention to the accuracy of a numerical method

for solving an ODE, it is now time to devote some attention to the stability of
a numerical method for solving an ODE. As we saw in Example 6.1 the smaller
we made the time step size, the closer our numerical solution is to the analytical
solution. This feature is known as consistency; namely that the truncation error
goes to zero as the time step size goes to zero. However, what we really want to
know is that our numerical method will approach the exact solution as the time
step size goes to zero. This is known as convergence and while you might think that
consistency and convergence are the same thing, as we’ll see they’re not. In order
for a numerical method to be convergent, it must be both consistent and stable.
So let’s now look at the stability of the explicit Euler method. To do so we will
perform a stability analysis on the method, and to do so one generally considers the
model initial value problem:

dφ
= λφ (6.4)
dt

λIm

θ
Re
λRe

Figure 6.3: Some qualitative solutions to the model problem. When λ is real the
solution will show exponential growth or decay depending on whether it is positive
or negative respectively. When it is imaginary the solution will be oscillatory like
a sinusoidal function, and if λ has both real and imaginary components then the
solution will exhibit sinusoidal growth or decay.
6.1. EXPLICIT EULER METHOD 123

where λ is a constant which can be a complex number. Figure 6.3 illustrates some
solutions to the model problem depending upon the value of λ. In most engineering
problems, the real part of λ is negative. This means that the solution to the ODE will
typically decay with time. You may well ask why we use this model problem for our
analysis and why we let λ be complex (especially since most of the problems that we
will be solving involve real numbers as opposed to complex numbers). The answer
to the first question is that many problems fall into this category, or as we shall see,
can be made to fall into this category. Although the model problem is homogeneous,
we find that the inhomogeneous parts of the ODE do not really affect the stability
analysis, so it is therefore ‘no big deal’ to ignore them. Furthermore, for complex
nonlinear problems we will most likely not be able to perform a stability analysis, so
we learn what we can from studying the model problem. We can however sometimes
‘linearize’ a nonlinear problem such that our stability analysis performed on the
model problem still applies. In regards to the second question, in many practical
problems, the ODEs produce oscillatory type solutions, which won’t appear unless
we allow λ to be a complex number.
Applying the explicit Euler method to Equation 6.4 gives:

φl+1 = φl + λ∆tφl (6.5)

where φl is the numerical solution φ(t) at time tl = l∆t. In order to perform the
stability analysis (i.e. determining the region of stability where we can get a useful
solution via our numerical method of choice), we begin by manipulating Equation
6.5 to give:

φl+1 = (1 + λ∆t) φl
Thus the solution at any time step l can be written as:

φl = φ0 (1 + λ∆t)l
= φ0 (1 + (λRe + iλIm )∆t)l
= φ0 σ l

where λRe and λIm are the real and imaginary parts of λ and σ is known as the
amplification factor . An important point on notation is that here φ0 implies the
solution at time step 0, but σ l implies that the amplification factor is raised to the
power of l. This will be the case for all similar expressions throughout this part of
the book. If |σ| ≤ 1 then the solution will decay with time. The opposite is true if
|σ| > 1. Hence, in order to ensure the stability of the numerical method we want
|σ| ≤ 1, and therefore:

|σ| = (1 + λRe ∆t)2 + (λIm ∆t)2 ≤ 1

124 CHAPTER 6. EULER METHODS

This is the equation of a circle of radius 1, centered at (-1,0), and the inequality
implies that σ must lie inside the circle in order for the method to be stable. This
plot is called the stability diagram and is shown in Figure 6.4. Within this region
we say that the method is absolutely stable and outside of this region we say that it
is unstable.
λImΔt

Unstable

Stable
λReΔt
-2 -1

Figure 6.4: The stability diagram for the explicit Euler method.

If we consider first the case where λ is real and negative (i.e. λ = −λRe ), the
model problem becomes:
dφ
= −λRe φ
dt
For illustrative purposes, let’s use φ(t = 0) = φ0 . The exact solution to this problem
is:

φ(t) = φ0 e−λRe t
then in order for the numerical method to be stable we have:

|1 − λRe ∆t| ≤ 1
where using the rule of inequalities (|a| ≤ b implying −b ≤ a ≤ b) means that:

−1 ≤ 1 − λRe ∆t ≤ 1
−2 ≤ −λRe ∆t ≤ 0
6.1. EXPLICIT EULER METHOD 125

and dividing through by −λRe and noting another rule of inequalities (that diving
by a negative number involves reversing the inequality) we get:

2
0 ≤ ∆t ≤ (6.6)
λRe
So we have found the maximum time step size that we can use and still get a stable
solution. It is worth mentioning that if λRe is positive, then the solution will exhibit
an exponential increase with time. This does not mean that the numerical method
however is not working, just that the stability analysis doesn’t really apply. We are
more concerned with cases where we know that the solution should decay with time
(or at least not grow exponentially), finding the regions in the stability diagram
where this will not happen, and staying out of them!
Consider now the √ case where λ is purely imaginary (i.e. λ = iλIm ), where i is
the imaginary unit −1 . The model problem becomes:

dφ
= iλIm φ
dt
For illustrative purposes, let’s use φ(t = 0) = φ0 . The exact solution to this problem
is:

φ(t) = φ0 eiλIm t
By considering Euler’s formula eiθ = cos θ + i sin θ (not to be confused with Euler
methods that are the focus of this chapter) we can see that the exact solution will be
oscillatory in nature. Because the stability region is only tangent to the imaginary
axis, the explicit Euler method is always unstable (irrespective of the time step size)
for purely imaginary λ. So we know that the amplitude will grow with time as the
value of λ is not within the stability region of Figure 6.4 (i.e. it lies somewhere on
the vertical axis).
If we are dealing with a system of ODEs, then the concepts of stability analysis
still apply. To see how let’s consider a linear system:

φ̇ = Kφ (6.7)
where K is a constant M x N matrix. We will assume that K is diagonalizable
[10], meaning that is has a complete set of N linearly independent eigenvectors ξn
satisfying Kξn = λn ξn for n = 1, 2, ...N (i.e. a standard eigenvalue problem). We
can represent the matrix of Eigenvectors as:

Ξ = [ξ1 , ξ2 , . . . ξN ]
where essentially we’re taking each ξn (which is a column vector) and ‘stacking’
them together column by column to form a matrix. We can then represent the
matrix of eigenvalues as:
126 CHAPTER 6. EULER METHODS

 
λ1 0 . . . 0
 0 λ2 . . . 0 
Λ=  = diag(λ1 , λ2 , ...λN )
 
.. .. . . ..
 . . . . 
0 0 . . . λn
and so our matrix K can be written as:

K = ΞΛΞ−1
or rearranging in terms of Λ as:

Λ = Ξ−1 KΞ
Now if we let ψ(t) = Ξ−1 φ(t), we can multiply Equation 6.7 by Ξ−1 and realizing
that ΞΞ−1 = Ξ−1 Ξ = I gives the equivalent equations:

Ξ−1 φ̇ = Ξ−1 (KIφ)

= Ξ−1 KΞΞ−1 φ
ψ̇ = Λψ

This is a diagonal system of equations that decouples into N independent scalar

equations, one for each component of ψ. The nth equation is:

dψn
= λn ψn
dt
which is the model problem outlined in Equation 6.4 and we can translate the
stability criteria for a system of ODEs from that for a single ODE. So, applying the
explicit Euler method to the model problem gives:

φl+1 = (I + ∆tK) φl
which by the previous transformations can be written as:

ψ l+1 = (I + ∆tΛ) ψ l
This decouples into N independent equations, one for each component of ψ. These
take the form:

ψnl+1 = (1 + ∆tλn ) ψnl

So, for the overall method to be stable, each of the scalar problems must be stable
and this clearly requires that every λn ∆t be in the stability region of the explicit Eu-
ler method for all n. If, for example we had the scenario where all of the eigenvalues
were real, then we would have the restriction on our time step size:
6.1. EXPLICIT EULER METHOD 127

2
∆t ≤
max(λn )
If the range of magnitudes of the eigenvalues is large:

max(λn )
1
min(λn )
and the solution is desired over a large span of the independent variable t, then the
system is known as a stiff system. Stiff systems arise in physical situations with
widely varying rates of responses and can result in extremely small time step sizes
in order to satisfy the stability criteria. By ‘varying rates of response’ we mean that
maybe one ψn is changing rapidly while another one may change relatively slowly.
This same idea will apply more generally to other methods such that a method
is stable if and only if each λn ∆t is within the stability region of the numerical
method for every eigenvalue of the matrix K. An important point to realize when
considering stability analysis is that we derive all our stability regions by considering
a model problem, but of course most of the systems of ODEs that we will want to
solve in practice will not be so simple. If we have a linear system as was just outlined
then the idea is that we try and make our problem ‘look’ like the model problem
so that we can apply our stability analyses to it. If we are dealing with non-linear
systems then of course things are a lot more difficult.

Example 6.2:

In this example we will develop a Matlab program to solve the example system:

d2 φ dφ
+ + 4φ = 0 (6.8)
dt2 dt
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0. We will
plot the stability region for the explicit Euler method and experiment with the time
step sizes ∆t = 0.1 and ∆t = 0.5. The intended learning outcome for this example
will be to see how we solve a second order ODE, to observe how the time step size
affects where the solution will sit in the stability region of the explicit Euler method
and what happens to the solution when we are outside of that region.

In order to begin, we will need to break this second order ODE into a system of
first order ODEs. We do so by making the definition:

φ1 = φ
dφ1
φ2 =
dt
128 CHAPTER 6. EULER METHODS

Which applied to Equation 6.8 gives the 2 × 2 system:

dφ1
= φ2
dt
dφ2
= −4φ1 − φ2
dt

and can be represented in the form:

φ̇ = Kφ
where φ = {φ1 , φ2 }T and K is the matrix:

0 1
K=
−4 −1
We will actually store the solution as a 2D array in Matlab as:
phi = zeros(N_e, N_t)

where we are using Ne to denote the number of equations (2 in this example). So

each column in the array phi is a column vector representing the solution at that
time step. Now applying the explicit Euler method by substituting for f (φ, t) into
Equation 6.3, we get:

φl+1 = φl + ∆tKφl
= (I + ∆tK)φl

For this system we can calculate the diagonal matrix of eigenvalues and eigenvectors
in Matlab with the eig function as:
[Xi Lambda] = eig(K);

which will give us two matrices, one containing the eigenvalues on the main diag-
onal and the other containing the eigenvectors. For this particular K we get the
eigenvalues:

−0.5000 + 1.9365i 0
Λ=
0 −0.5000 − 1.9365i
Since these two eigenvalues have negative real parts as well as imaginary parts, we
should be able to fit inside the stability region of the explicit Euler method for some
choice of ∆t. Recall the amplification factor for the explicit Euler method:

σ = (1 + (λRe + iλIm )∆t)

One of the simplest ways to plot the region of stability is to evaluate the expression
for σ at a number of x, y points within the complex plane, compute the magnitude
6.1. EXPLICIT EULER METHOD 129

of the complex number at each x, y point, and then extract a contour of |σ| = 1.
This can be easily accomplished in Matlab using the meshgrid, abs and contourf
functions respectively as:
[X, Y] = meshgrid(-2:0.1:2, -2:0.1:2);
Z = X + i*Y;
sigma = abs((1 + real(Z)).^2 + imag(Z).^2);
contourf(X, Y, sigma, [1 1]);
plot(real(diag(Lambda)*Delta_t), imag(diag(Lambda))*Delta_t);

While we can derive an expression for the minimum time step size allowed in order
for all of the eigenvalues to be inside the stability region of the explicit Euler method,
a more interesting way for this example will be to experiment with some different
choices of ∆t and see how it affects the solution.
The complete program is given in Example6_2.m. As such, two different solutions
are shown in Figures 6.5(a) - 6.5(d) for ∆t = 0.1 and ∆t = 0.5, showing where the
λm ∆t terms lie on the complex plane in each case. As can be observed, when the
λm ∆t values are inside the stability region the solution decays with time (as it
should), whereas when the λm ∆t values are outside the stability region the solution
‘blows up’.

Often, just knowing the order of accuracy of a numerical method by itself is

not very informative. In particular, in problems with oscillatory solutions, one is
interested in the phase and amplitude errors separately. To investigate this type of
error, we consider the model problem with purely imaginary λ, where we know that
the exact solution will an oscillatory function involving sin or cos functions. So let’s
now analyze the error introduced by the numerical method a little bit further. The
amplification factor, σ for this problem can be written in polar form (Figure 6.6) as:

σ = 1 + iλIm ∆t
= Zeiθ

where:
p
Z= 1 + (λIm ∆t)2 (6.9)
represents the magnitude and:

−1 λIm ∆t
θ = tan
1
represents the angle. These definitions are useful in order to perform an error
analysis. Let’s now compare the exact solution at time t = l∆t:
130 CHAPTER 6. EULER METHODS

2.0 3

1.5
2
1.0
1
0.5
λReΔ t

0.0

φ
0

−0.5
−1
−1.0
−2
−1.5

−2.0 −3
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
λReΔ t
0 2 4 6 8 10
t

(a) (b)

2.0 3

1.5
2
1.0
1
0.5
λReΔ t

0.0
φ

−0.5
−1
−1.0
−2
−1.5

−2.0 −3
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
λReΔ t
0 2 4 6 8 10
t

Figure 6.5: Solution to the ODE system in Example 6.2 illustrating (a) the λ∆t
terms in relation to the stability region of the explicit Euler method and (b) the
solution with ∆t = 0.1 (c) the λ∆t terms in relation to the stability region of the
explicit Euler method and (d) the solution with ∆t = 0.5.
6.1. EXPLICIT EULER METHOD 131

σ
σIm
Ζ

θ
Re
σRe

Figure 6.6: Polar form of the amplification factor.

132 CHAPTER 6. EULER METHODS

φ(t = l∆t) = φ0 eiλIm l∆t

with the approximated solution given by the explicit Euler method:

φl = φ0 σ l = φ0 Z l eilθ
Dividing the two equations and rearranging in terms of the numerical solution we
get:

φl = φ(l∆t)Z l eil(θ−λIm ∆t)

So, it can be observed that the numerical solution will be the exact solution amplified
by Z l and its phase will be shifted by l(θ−λIm ∆t). We therefore call Z the Amplitude
error and θ − λIm ∆t the Phase error . Now, ideally we’d like for our numerical
method to have an amplitude error of 1, and a phase error of 0. Looking at Equation
6.9 however, we can see that no matter what our choice of time step size (and for
any λIm ), Z is going to be greater than 1 and so our amplification factor will lead to
the solution ‘blowing up’. At any time step, l, the phase error is θ − λIm ∆t, which
is given by:

θ − λIm ∆t = tan−1 (λIm ∆t) − λIm ∆t

Using the power series expansion[54] for the tan−1 function:

α3 α5 α7
tan−1 (α) = α − + − + ...
3 5 7
we can then rewrite the phase error as:

θ − λIm ∆t = tan−1 (λIm ∆t) − λIm ∆t

(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

= λIm ∆t − + − + . . . − λIm ∆t
3 5 7
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
= − + − + ...
3 5 7
So the phase error has a leading order term of (λIm ∆t)3 which is reasonably small.
It should be noted that we’ve already stated that the explicit Euler method is
unstable for a purely oscillatory solution, so you may ask why we bother analyzing
the error in terms of amplitude and phase when we know that it’s going to increase
as the method ‘blows up’ ? The reason is that other methods will be stable for
oscillatory solutions, so this is an opportune moment to introduce the concept, as
well as illustrate the instability of the explicit Euler method for oscillatory problems
by another means.
To recap briefly on what we have covered so far, we have introduced the explicit
Euler method, observed its order of accuracy, analyzed this error in terms of phase
6.2. IMPLICIT EULER METHOD 133

and amplitude, and investigated the stability region of the method. We will be doing
this for all of the numerical methods throughout the rest of this part of the book,
so that we can learn more about the advantages, disadvantages of each method.

6.2 Implicit Euler Method

Having seen that the explicit Euler method is not of much use for oscillatory prob-
lems the reasonable question is how do we create a method that will be stable? Well,
a more stable method is the implicit Euler method (or the backward Euler method).
In this case we choose to evaluate f (φ, t) at the time step l + 1, hence we get:

φl+1 = φl + ∆tf (φl+1 , tl+1 ) (6.10)

This method is an implicit method because it has φl+1 on both sides of the equation,
so we are using information at time step l + 1 that we don’t yet know in order to
take our next step (Figure 6.7). If we are only solving a single ODE then, depending
upon the form of f , we may still be able to rearrange Equation 6.10 to obtain an
explicit expression. If we are solving a system of ODEs then we will need to solve
a system of equations (either linear or nonlinear depending upon the form of f ) at
each time step.

Example 6.3:

In this example we will develop a Matlab program to solve the example system:

d2 φ
= −4φ (6.11)
dt2
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0 and
compare the numerical solution with the exact solution φ(t) = cos(2t) and hence
φ̇(t) = −2 sin(2t). The intended learning outcomes for this example will be to see
how the use of an implicit method means solving a system of equations at every
time step.

In order to begin, we will need to break this second order ODE into a system of
first order ODEs. We do so by making the definition:

φ1 = φ
dφ1
φ2 =
dt
134 CHAPTER 6. EULER METHODS

}
φ(tl+1)

φl el+1local

φl+1
t
tl tl+1
}

Δt

Figure 6.7: A schematic illustrating one step in the implicit Euler method. The
green arrow illustrates the gradient f (φl+1 , tl+1 ) which is used to step the solution
forward. The pink line illustrates computed step to φl+1 . The blue line illustrates
the analytical solution. Also illustrated is the local truncation error, which is the
difference between the analytical and numerical solutions at time step l + 1.
6.2. IMPLICIT EULER METHOD 135

Which applied to Equation 6.11 gives the 2 × 2 system:

dφ1
= φ2
dt
dφ2
= −4φ1
dt

and can be represented in the form:

φ̇ = Kφ
where φ = {φ1 , φ2 }T and K is the matrix:

0 1
K=
−4 0
We will actually store the solution as a 2D array in Matlab as:
phi = zeros(N_e, N_t)

where we are using Ne to denote the number of equations (2 in this example). So

each column in the array phi is a column vector representing the solution at that
time step. Now applying the implicit Euler method by substituting for f (φ, t) into
Equation 6.10, we get:

φl+1 = φl + ∆tKφl+1
(I − ∆tK)φl+1 = φl (6.12)

where I is the identity matrix and which is effectively solving:

Aφl+1 = b
where A = (I − ∆tK) and b = φl . So the important point is that by using an
implicit method to solve our system of ODEs we have to solve a system of equations
at every time step in order to compute the solution φl+1 . Now, because we had a
linear ODE, we get a linear system of equations to solve. As we will see shortly, when
we have a nonlinear system of ODEs and we use an implicit method, we have to solve
a nonlinear system of equations. Another point worthy of mention is that we could
in principle use any of the methods from Part I to solve the linear system at each
time step. Furthermore, because the system is such a small one, a direct method
such as Gaussian Elimination or LU Decomposition might be a good choice. Here
however, because the focus of this example is on how we apply the implicit Euler
method to solving a system of ODEs, we will simply use the backslash operator,
such that the code solving our system will be given as:
136 CHAPTER 6. EULER METHODS

for l=1:N_t-1
phi(:, l+1) = (I - Delta_t*K) \ phi(:, l);
end

The complete program is given in Example6_3.m. The solution is shown in Figures

6.8(a) - 6.8(b). Note how the numerical solution deviates more and more from the
exact solution throughout the course of the simulation.

3 3

2 2

1 1
φ

φ
0 0

−1 −1

−2 −2

−3 −3
0 2 4 6 8 10 0 2 4 6 8 10
t t

(a) (b)

Figure 6.8: Solution to the ODE system in Example 6.3 with ∆t = 0.05 showing (a)
φ1 = φ(t) and (b) φ2 = φ̇(t). The green lines show the exact solution and the blue
dotted lines show the numerical solution.

Example 6.4:

In this example we will develop a Matlab program to solve the example system:

dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2 (6.13)
dt
6.2. IMPLICIT EULER METHOD 137

in the domain t ∈ [0, 10], with initial conditions φ1 (0) = 0, φ2 (0) = 1, φ3 (0) = 1.
The intended learning outcomes for this example will be to see how the application
of an implicit method to a nonlinear system of ODEs requires iteratively solving a
linear system of equations.

In order to begin, we apply the implicit Euler method by substituting for f (φ, t)
into Equation 6.10 to get:

φ1l+1 = φl1 + ∆tφl+1

2 φ3
l+1

φ2l+1 = φl2 − ∆tφl+1

1 φ3
l+1

φ3l+1 = φl3 − 0.5∆tφl+1

1 φ2
l+1
(6.14)

It can be observed that unlike Example 6.3 this system is nonlinear and consequently
we won’t be able to put the system in Equation 6.14 into the form Aφl+1 = b. In
this case we will have to use iterative method to solve the nonlinear system and as
such we will use the Newton’s method, which we studied in Chapter 4. To do so we
put the system into the form f (φl+1 ) = 0 as:

 l+1
∆tφl+1 l+1

 φ1 − φl1 − 2 φ3 
f= φ2l+1 − φl2 + ∆tφl+1
1 φl+1
3 =0
 l+1 l l+1 l+1 
φ3 − φ3 +0.5 ∆tφ1 φ2

and then we compute the Jacobian matrix by working out all of the derivative
expressions. For this example we get:

−∆tφl+1 −∆tφl+1
 
1 3 2
∂fm
J= =  ∆tφl+13 1 ∆tφl+1
1

∂φnl+1 l+1 l+1
0.5∆tφ2 0.5∆tφ1 1
We can then iteratively solve the linear system of equations at each time step:

J∆φ = f
and improve the values for φl+1 as:

φl+1,k+1 = φl+1,k ∆φ
where k denotes the iteration, until we converge on a solution for the time step l + 1.
As with all iterative techniques we need to ‘guess’ the solution for φl+1 before we
begin the iterative process. One approach would be to simply take φl+1,k=0 = 0,
which is a perfectly reasonable guess. A far more efficient approach however would
be to use the converged solution to the previous time step (i.e. φl+1,k=0 = φl ). In
terms of implementing the program in Matlab we the time marching loop with an
inner iterative loop as:
138 CHAPTER 6. EULER METHODS

for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
phi(:,l+1) = phi(:,l);
f = [ phi(1,l+1) - phi(1,l) - Delta_t*phi(2,l+1)*phi(3,l+1);
phi(2,l+1) - phi(2,l) + Delta_t*phi(1,l+1)*phi(3,l+1);
phi(3,l+1) - phi(3,l) + 0.5*Delta_t*phi(1,l+1)*phi(2,l+1)];
r_norm = max(abs(f));
while r_norm>tolerance && k<maxIterations
f = [ phi(1,l+1) - phi(1,l) - Delta_t*phi(2,l+1)*phi(3,l+1);
phi(2,l+1) - phi(2,l) + Delta_t*phi(1,l+1)*phi(3,l+1);
phi(3,l+1) - phi(3,l) + 0.5*Delta_t*phi(1,l+1)*phi(2,l+1)];
J = [ 1 -Delta_t*phi(3,l+1) -Delta_t*phi(2,l+1);
Delta_t*phi(3,l+1) 1 Delta_t*phi(1,l+1);
0.5*Delta_t*phi(2,l+1) 0.5*Delta_t*phi(1,l+1) 1 ];
Delta_phi = -J\f;
phi(:,l+1) = phi(:,l+1) + Delta_phi;
r_norm = max(abs(f));
end
end
Where a point to note is that we are using the infinity norm as the measure of
converge.
The complete program is given in Example6_4.m. Figure 6.9 shows the numerical
solution to the system in Equation 6.13. As can be observed the amplitude of the
oscillations is decreasing in time, which is incorrect as since the true solution should
exhibit no change in the amplitude of the oscillations.

1.5

1.0

0.5

0.0
φ

−0.5
φ1
−1.0 φ2
φ3
−1.5
0 2 4 6 8 10
t

Figure 6.9: Solution to the ODE system in Example 6.4 with ∆t = 0.1.

In order to perform a stability analysis on the implicit Euler method, we apply

Equation 6.10 to the model problem to get:
6.2. IMPLICIT EULER METHOD 139

φl+1 = φl + λ∆tφl+1
which after rearranging gives:

φl
φl+1 =
1 − λ∆t
Thus the solution at any time step l can be written as:

φl = φ0 σ l
where our amplification factor this time is:
1
σ=
1 − λ∆t
As before, in order to ensure the stability of the numerical method, |σ| ≤ 1, therefore:
1
|σ| = ≤1
(1 − λRe ∆t)2 + (λIm ∆t)2
or, rearranging:

(1 − λRe ∆t)2 + (λIm ∆t)2 > 1

As with the explicit Euler method this is just a circle of radius 1, but this time
centered at (1,0) and the inequality now implies that σ must lie outside the circle
in order for the method to be stable (Figure 6.10). Thus the implicit Euler method
is always stable (as long as λRe < 0), no matter what the value of ∆t we choose.
This means that the implicit Euler method is unconditionally stable, which is a
characteristic of most implicit numerical methods. The price that we must pay
however, is higher computational cost per time step because we have to solve a
system of equations.
In order to perform an error analysis we again restrict ourselves to the case where
λ is imaginary. We want to get σ into polar form and the best way to do this is to
write:

1 σ1 Z1 eiθ1 Z1 i(θ1 −θ2 )

σ= = = iθ
= e = Zeiθ
1 − iλIm ∆t σ2 Z2 e 2 Z2
In this case:

Z1 = |σ1 | = 1
q
Z2 = |σ2 | = 1 + (λIm ∆t)2

and:
140 CHAPTER 6. EULER METHODS

λImΔt

Stable

Unstable
λReΔt
1 2

Figure 6.10: The stability diagram for the implicit Euler method.

0
−1
θ1 = tan
1

−1 λIm ∆t
θ2 = tan −
1

So noting that tan−1 (−α) = − tan−1 (α) when evaluating θ1 − θ2 we get:

1
Z=p (6.15)
(1 + (λIm ∆t)2
and:

−1 λIm ∆t
θ = tan
1
Comparing with the exact solution at time t = l∆t:

φ(t = l∆t) = φ0 eiλIm l∆t

with the approximated solution given by the implicit Euler method:

φl = φ0 σ l = φ0 Z l eilθ
and dividing the two equations we get:
6.2. IMPLICIT EULER METHOD 141

φl = φ(l∆t)Z l eil(θ−λIm ∆t)

So the approximate solution that we will get will be amplified by Z l and its phase
will be shifted by l(θ − λIm ∆t). So this is exactly the same as it was for the explicit
Euler method. The things that has change between numerical methods are the
definitions of Z and θ (and in fact in this case θ is the same). We can observe
that for any choice of time step size, Z is always going to be less than 1. So for
any oscillatory type solution we should expect that the amplitude of the oscillations
would ‘die away’ with time; and this is in fact was we observed in the solutions in
both Examples 6.3 and 6.4. We can again use the power series expansion for the
tan−1 function to rewrite the phase error as:

θ − λIm ∆t = tan−1 (λIm ∆t) − λIm ∆t

(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

= λIm ∆t − + − + . . . − λIm ∆t
3 5 7
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
= − + − + ...
3 5 7
which has a leading order term proportional to (λIm ∆t)3 and because this term is
negative, we would expect a phase lead in the solution (meaning that the oscillations
in the numerical solution will be ahead of the exact solution).
So in comparing the expressions for the amplitude error in Equations 6.9 and
6.15 we can observe that for any choice of time step size, oscillations will tend to
grow with time for the explicit Euler method and die away for the implicit Euler
method. To highlight this result, Figures 6.11(a) and 6.11(b) illustrate the solution
to the model problem after a number of time steps for both methods. As can be
observed, the explicit Euler solution (Figure 6.11(a)) has completely blown up, while
the implicit Euler solution (Figure 6.11(b)) has died away, almost to zero.
A conclusion one can draw thus far is that it is requires a more computational
effort to use an implicit Euler method over an explicit Euler method; and in fact
this is statement could be applied to implicit versus explicit methods in general, not
just the Euler methods. We have seen however that the justification for this added
effort is the additional stability applied to our solution.
It is worth emphasizing again that numerical stability does not imply accuracy.
A numerical method can be stable but inaccurate. Conversely a numerical method
can be accurate but unstable. From the stability point of view, our objective is
to use the maximum step size ∆t to reach the final destination at time t = tmax
while remaining within the region of stability. We have seen that we could create
an unconditionally stable Euler method, but even so, this method is still only first
order accurate and so throughout the remaining chapters we will looking to derive
more accurate methods.
142 CHAPTER 6. EULER METHODS

1.5 1.5

1.0 1.0

0.5 0.5
φ

0.0 0.0

−0.5 −0.5

−1.0 −1.0

−1.5 −1.5
20 22 24 26 28 30 20 22 24 26 28 30
t t

(a) (b)

Figure 6.11: The solution to the model problem after a number of cycles, when λ = i
for (a) the Explicit and (b) the Implicit Euler Methods. The green curves illustrates
the analytical solution and the blue curves illustrates the numerical solution, in
particular highlighting the phase and amplitude error.
Chapter 7

Crank-Nicolson Methods

φ(tl+1)
} l+1
e local
φl+1
φl

t
tl tl+1
}

Δt

Figure 7.1: A schematic illustrating one step in the Crank-Nicolson method. The
green arrows illustrate the gradients f (φl , tl ) and f (φl+1 , tl+1 ) which are both used
to step the solution forward. The pink line illustrates computed step to φl+1 . The
blue line illustrates the analytical solution. Also illustrated is the local truncation
error, which is the difference between the analytical and numerical solutions at time
step l + 1.

In Chapter 6 we studied the Euler methods, which are pretty much the simplest
way in which we can solve an ODE numerically. These methods were only first order
accurate however and so an appropriate question would be, how can we improve on
that accuracy? In fact all of the numerical methods for solving ODEs that we will
study from here on in will be an improvement on the Euler methods and the one

143
144 CHAPTER 7. CRANK-NICOLSON METHODS

such method that will be the focus of this chapter is the Crank-Nicolson method.
The basic idea behind the method is to compute φ(tl+1 ) by integration:
Z tl+1
l+1 l
φ(t ) = φ(t ) + f (φ, t)dt (7.1)
tl

7.1 Implicit Crank-Nicolson Method

One way to approximate the integral in Equation 7.1 is by using the trapezoidal rule
[57]:
Z tl+1
∆t
f (tl , φl ) + f (tφl+1 , tl+1 ) + O(∆t3 )

f (φ, t) =
tl 2
where ∆t = tl+1 − tl . Ignoring the higher order terms, we obtain the following
formula for the implicit (or trapezoidal) Crank-Nicolson method:
∆t
φl+1 = φl + f (φl , tl ) + f (φl+1 , tl+1 )

(7.2)
2
So we are using information at both at time steps l and l + 1 in order to compute the
next step (Figure 7.1). The local error associated with the Crank-Nicolson method
is O(∆t3 ), however, similar to the Euler methods, the error accumulates over time.
Thus, the global error of the Crank-Nicolson method is O(∆t2 ).

Example 7.1:

In this example we will develop a Matlab program to solve the example system:

d2 φ
= −4φ (7.3)
dt2
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0 and compare
the numerical solution with the exact solution φ(t) = cos(2t) and hence φ̇(t) =
−2 sin(2t). The intended learning outcomes for this example will be to simply
observe the application of the Crank-Nicolson method to solve a system of ODEs.
In order to begin, as we did in Example 6.3, we will need to break this second
order ODE into a system of first order ODEs. We do so by making the definition:

φ1 = φ
dφ1
φ2 =
dt
7.1. IMPLICIT CRANK-NICOLSON METHOD 145

Which applied to Equation 7.3 gives the 2 × 2 system:

dφ1
= φ2
dt
dφ2
= −4φ1
dt

and can be represented in the form:

φ̇ = Kφ
where φ = {φ1 , φ2 }T and K is the matrix:

0 1
K=
−4 0
We will actually store the solution as a 2D array in Matlab as:
phi = zeros(N_e, N_t)

where we are using Ne to denote the number of equations (2 in this example). So

each column in the array phi is a column vector representing the solution at that
time step. Now applying the implicit Crank-Nicolson method by substituting for
f (φ, t) into Equation 7.2, we get:

∆t
φl+1 = φl + Kφl + Kφl+1

2
∆t l+1 ∆t
I− K φ = I+ K φl
2 2

where I is the identity matrix and which is effectively solving:

Aφl+1 = b
where A = (I − ∆t/2K) and b = (I + ∆t/2K)φl . So as with the implicit Euler
method we have to solve a system of equations at every time step in order to compute
the solution φl+1 . Now, because we had a linear ODE, we get a linear system of
equations to solve. Since we are more interested in the application of the Crank-
Nicolson method, than how we solve this system, we will again simply use the
backslash operator to solve the resulting linear system and hence the code for solving
this system of ODEs is:
for l = 1:N_t-1
phi(:,l+1) = ((I - Delta_t/2*K)\(I + Delta_t/2*K))*phi(:,l);
end
146 CHAPTER 7. CRANK-NICOLSON METHODS

The complete program is given in Example7_1.m. Figures 7.2(a) - 7.2(b) illus-

trate the solution and it can be observed that for the same time step size ∆t = 0.05
the Crank-Nicolson solution is much more accurate than the solution obtained us-
ing the implicit Euler method in Example 6.3 (Figures 6.8(a) - 6.8(b)). This is as
expected since the Crank-Nicolson method is second order accurate while the Euler
methods are first order accurate.

3 3

2 2

1 1
φ

φ
0 0

−1 −1

−2 −2

−3 −3
0 2 4 6 8 10 0 2 4 6 8 10
t t

(a) (b)

Figure 7.2: Solution to the ODE system in Example 7.1 with ∆t = 0.05 showing (a)
φ1 = φ(t) and (b) φ2 = φ̇(t). The green lines show the exact solution and the blue
dotted lines show the numerical solution.

In order to perform a stability analysis for the implicit Crank-Nicolson method

we apply Equation 7.2 to the model problem of Equation 6.4 giving:

∆t
φl+1 = φl + λφl + λφl+1

2 !
1 + ∆tλ
= 2
φl
1 − ∆tλ
2

Thus the solution at any time step l can be written as:

7.1. IMPLICIT CRANK-NICOLSON METHOD 147

!l
∆tλ
1+
φl = φ0 2
∆tλ
1− 2
!l
∆t
1+ λ
2 Re
+ i ∆t λ
2 Im
= φ0 ∆t ∆t
1− λ
2 Re
− i 2 λIm
= φ0 σ l
where our amplification factor this time is:
∆tλ
1+ 2
σ= ∆tλ
1− 2
Then in order for the numerical method to be stable we have |σ| ≤ 1, therefore:
q 2 2
1 + ∆t λ
2 Re
+ ∆t λ
2 Im
|σ| = q 2 2 ≤ 1
∆t ∆t
1 − 2 λRe + 2 λIm
or, rearranging and simplifying:
s 2 2 s 2 2
∆t ∆t ∆t ∆t
1+ λRe + λIm ≤ 1− λRe + λIm
2 2 2 2
∆t2 2 ∆t2 2 ∆t2 2 ∆t2 2
1 + ∆tλRe + λRe + λIm ≤ 1 − ∆tλRe + λRe + λ
4 4 4 4 Im
2λRe ∆t ≤ 0
λRe ∆t ≤ 0
Therefore the stability region of the Crank-Nicolson method is the entire left hand
plane on the stability plot (Figure 7.3), and the method is stable for any choice of
∆t (i.e. unconditionally stable), provided λRe is negative.
In order to perform the error analysis we are going to consider the case where λ
is imaginary. We want to get σ into polar form and the best way to do this is:
i∆tλIm
1+ σ1 Z1 eiθ1 Z1 i(θ1 −θ2 )
σ= 2
i∆tλIm
= = iθ
= e = Zeiθ
1− 2
σ2 Z2 e 2 Z2
In this case:
s 2
λIm ∆t
Z1 = |σ1 | = 1+
2
s 2
λIm ∆t
Z2 = |σ2 | = 1+ (7.4)
2
148 CHAPTER 7. CRANK-NICOLSON METHODS

λImΔt

Stable Unstable

λReΔt

Figure 7.3: The stability diagram for the implicit Crank-Nicolson method.

and:

λIm ∆t
−1
θ1 = tan
2

−1 λIm ∆t
θ2 = tan − (7.5)
2

From an examination of Equation 7.4 it can be seen that Z = 1 and from an

examination of Equation 7.5 that θ = 2 tan−1 (∆tλIm /2). Comparing now with the
exact solution at time t = l∆t:

φ(t = l∆t) = φ0 eiλIm l∆t

and dividing the two equations as we did with the Euler methods:

φl = φ(l∆t)Z l eil(θ−λIm ∆t)

we can see that there is no amplitude error associated with the Crank-Nicolson
method and that the phase error is:
7.1. IMPLICIT CRANK-NICOLSON METHOD 149

−1 λIm ∆t
θ − λIm ∆t = 2 tan − λIm ∆t
2
λIm ∆t 3 λIm ∆t 5 λIm ∆t 7
!
λIm ∆t 2 2 2
= 2 − + − + ... − λIm ∆t
2 3 5 7
(λIm ∆t)3 (λIm ∆t)5
= − + + ...
12 80

which is of order O(λIm ∆t)3 . So compared to the Euler methods the phase error here
is of the same order and because this term is negative, we would expect a phase lead
in the solution, but the leading term has 12 in the denominator instead of 3 and will
be smaller. To highlight this result, Figure 7.4 illustrates the solution to the model
problem after a number of time steps for the implicit Crank-Nicolson method. As
can be observed, the numerical solution is ahead of the exact solution, but has not
developed any error in the amplitude.

1.5

1.0

0.5
φ

0.0

−0.5

−1.0

−1.5
20 22 24 26 28 30
t

Figure 7.4: The solution to the model problem with the Crank-Nicolson Method
after a number of cycles, when λ = i. The green curve illustrates the analytical so-
lution and the blue curve illustrates the numerical solution, in particular highlighting
the phase error and that there is no amplitude error.
150 CHAPTER 7. CRANK-NICOLSON METHODS

7.2 Explicit Crank-Nicolson Method

Sometimes it is possible to linearize the implicit method in order to obtain an explicit
equation and thereby remove the need to solve a system of equations at each time
step. Considering the implicit Crank-Nicolson method:
∆t
φl+1 = φl + f (φl , tl ) + f (φl+1 , tl+1 ) + · · ·

(7.6)
2
The difficulty comes from the term f (φl+1 , tl+1 ). Let’s consider a Taylor series
expansion about of f (φl+1 , tl+1 ) about φl :

∂f l+1 l (φl+1 − φl )2 ∂ 2 f l+1 l

f (φl+1 , tl+1 ) = f (φl , tl+1 ) + (φl+1 − φl ) (t , φ ) + (t , φ ) + · · ·
∂φ 2! ∂φ2
(7.7)
We know that:

φl+1 − φl = O(∆t)
So substituting into Equation 7.7 gives:
∂f l+1 l
f (φl+1 , tl+1 ) = f (φl , tl+1 ) + (φl+1 − φl ) (t , φ ) + O(∆t2 ) (7.8)
∂φ
Substituting Equation 7.8 into Equation 7.6 gives:

l+1 l ∆t l l+1 l+1 l ∂f l+1 l l l
φ =φ + f (φ , t ) + (φ − φ ) (t , φ ) + f (φ , t )
2 ∂φ
So, it is now possible to obtain an explicit expression for φl+1 to be:
∆t l l+1 l l
!
f (φ , t ) + f (φ , t )
φl+1 = φl + 2 ∂f
(7.9)
1 − ∆t2 ∂φ
(φl , tl+1 )
This method has good stability characteristics but suffers from the problem that you
have to find the derivative of f with respect to φ. This may not always be possible,
or may just be a pain to do. Now the implicit Crank-Nicolson method is the more
common of these two, so from now on we will refer to the implicit Crank-Nicolson
method simply as the Crank-Nicolson method, realizing that it is implicit in nature.
Chapter 8

Leapfrog Methods

φl+1
}el+1
local

φ(tl+1)

φl

t
tl-1 tl tl+1
}
}

Δt Δt

Figure 8.1: A schematic illustrating one step in the Leapfrog Method. The green
arrow illustrates the gradient f (φl , tl ) which is used to step the solution forward. The
pink line illustrates computed step to φl+1 . The blue line illustrates the analytical
solution. Also illustrated is the local truncation error, which is the difference between
the analytical and numerical solutions at time step l + 1.

In Chapter 7 we studied the Crank-Nicolson method, which improved upon the

accuracy of the Euler methods by using the right hand side of Equation 5.1 at
two time steps. Another way that we can improve upon the accuracy of the Euler
methods is to use data from previous time steps to provide additional information.
Such methods are called Multistep Methods and the particular multistep method
that will be the focus of this chapters is the Leapfrog method. Recall that the

151
152 CHAPTER 8. LEAPFROG METHODS

explicit Euler method was given by:

φl+1 = φl + ∆tf (φl , tl )

The Leapfrog Method uses data from two time steps ago in order to compute the
next (Figure 8.1), as:

φl+1 = φl−1 + 2∆tf (φl , tl ) (8.1)

Now, when we come to studying the Finite Difference method in Part III we will
see that the explicit Euler method is equivalent to a first order accurate forward
finite difference, while the Leapfrog method is equivalent to a second order accurate
central finite difference. We don’t want to get ahead of ourselves though, so for now
just accept this choice, noting that all will become clear in a few chapters.
To prove that this method is of second order accuracy, we will consider the Taylor
Series Expansion about both φl+1 and φl−1 :

∆t2 d2 φ ∆t3 d3 φ

l+1 ∆t dφ
l
φ(t ) = φ(t ) + + + + ...
1! dt tl 2! dt2 tl 3! dt3 tl
∆t2 d2 φ ∆t3 d3 φ

l−1 l ∆t dφ
φ(t ) = φ(t ) − + − + ...
1! dt tl 2! dt2 tl 3! dt3 tl

Substituting into Equation 8.1, along with the relation that f (φl , tl ) is the first
derivative we get:

∆t2 d2 φ ∆t3 d3 φ

l ∆t dφ
φ(t ) + + + + ...
1! dt tl 2! dt2 tl 3! dt3 tl
∆t2 d2 φ ∆t3 d3 φ

l ∆t dφ ∆t dφ
=φ(t ) − + − + ... + 2
1! dt l t 2! dt2 tl 3! dt3 tl 1! dt tl

So it can be observed that we can cancel everything up to the terms involving the
third order derivatives and so we can infer from this that the Leapfrog method will
have a local truncation error of order ∆t3 and hence a global truncation error of
order ∆t2 .
Now, in practice, the way in which we would generally apply a Leapfrog method
is to a system of two ODEs. Then, as we integrate forward in time, we would
stagger the computations, meaning that for one of the dependent variables we would
compute values at time steps l, l +1, l +2, . . . as usual, but for the other, we compute
values at time steps l + 12 , l + 32 , l + 25 , . . .. It is not very often in this book that we will
use specific variables in the development of the numerical method, but in this case,
this approach is warranted, since it will help clarify the way the algorithm works.
The ‘classic’ use of a Leapfrog method, is integrating equations of motion where we
have:
153

dv
=a
dt
dr
=v
dt
where the variables r, v , a denote position, velocity, and acceleration of a particle
respectively. The way in which we would apply the Leapfrog method to solve this
system would be to write:

1 1
vl+ 2 = vl− 2 +∆tal (8.2)
l+1 l l+ 12
r =r +∆tv (8.3)

So we can see that the computations of r and v are staggered, or ‘leaping over’ one
another. Now, since we will be starting our time marching from an integer time
step, we will need to compute the velocity half a time step ahead, and the simplest
way in which we could do this would be as:
∆t l
1
vl+ 2 = vl +
a (8.4)
2
which is just an explicit Euler method stepping forward half a time step. We can
also use this result to write the Leapfrog method in a form that will only involve
computations at integer time steps. If we substitute Equation 8.4 into Equation 8.3
we get:

l+1 l ∆t l l
r = r + ∆t v + a
2
∆t2 l
= rl + ∆tvl + a
2
We can also then shift Equation 8.2 forward half a time step (i.e. adding 21 to each
1
time step index) and then approximate the resulting al+ 2 as the mean of al and al+1
(which is actually what we did with the Crank-Nicolson method in Chapter 7) so
that we get the discrete system:

∆t2 l
rl+1 = rl + ∆tvl + a
2
∆t l
vl+1 = vl + a + al+1

2
Although these equations look quite different to Equations 8.3 and 8.2 it is still
the same numerical method and while the positions and velocities are now defined
154 CHAPTER 8. LEAPFROG METHODS

at integer times, their increments are governed by quantities approximately defined

at half integer time steps. A nice property of the Leapfrog method is that it is fully
explicit, meaning that unlike the implicit Euler and Crank-Nicolson methods, we
don’t have to solve a system of equations at each time step. In fact we’ve been able
to improve upon the accuracy of the explicit Euler Method, essentially without any
extra effort.

Example 8.1:

In this example we will develop both a Matlab and a C++ program to solve the
example system:

d2 φ
= −4φ
dt2
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0 and
compare the numerical solution with the exact solution φ(t) = cos(2t) and hence
φ̇(t) = −2 sin(2t). The intended learning outcomes for this example will be to sim-
ply observe the application of the Leapfrog method to solve a system of ODEs.

As we did in Examples 6.3 and 7.1, we will begin by breaking this second order
ODE into a system of first order ODEs by making the definition:

φ1 = φ
dφ1
φ2 =
dt
Which applied to Equation 6.11 gives the 2 × 2 system:

dφ1
= φ2
dt
dφ2
= −4φ1
dt
We can then apply the Leapfrog method at integer time steps, to get:

∆t2
φl+1 = φl1 + ∆tφl2 + −4φl1

1
2
∆t
φl+1 = φl2 + −4φl1 + −4φl+1

2 1
2
which is an explicit expression which we can accomplish in our code as:
155
156 CHAPTER 8. LEAPFROG METHODS

for l=1:N_t-1
t(l+1) = t(l) + Delta_t;
phi(1,l+1) = phi(1,l) + Delta_t*phi(2,l) - 2*Delta_t^2*phi(1,l);
phi(2,l+1) = phi(2,l) - 2*Delta_t*(phi(1,l) + phi(1,l+1));
end

in Matlab, and:
for(int l=0; l<N_t-1; l++)
{
t[l+1] = t[l] + Delta_t;
phi[0][l+1] = phi[0][l] + Delta_t*phi[1][l] - 2*Delta_t*Delta_t*phi[0][l];
phi[1][l+1] = phi[1][l] - 2*Delta_t*(phi[0][l] + phi[0][l+1]);
}

in C++. These two code snippets are essentially all that is required to implement
an Leapfrog method for solving this simple ODE. Note how major differences are
the different indexing between Matlab and C++, computing ∆t2 slightly differently
in C++ and changing the limits of our for loop.
The complete program is given in Example8_1.m. Figures 8.2(a) - 8.2(b) illustrate
the solution and it can be observed that for the time step size ∆t = 0.05 the solution
appears to be of a similar accuracy to the Crank-Nicolson solution in Example 7.1
(Figures 7.2(a) - 7.2(b)) and more accurate than the solution using the implicit
Euler method in Example 6.3 (Figures 6.8(a) - 6.8(b)).

3 3

2 2

1 1
φ

0 0

−1 −1

−2 −2

−3 −3
0 2 4 6 8 10 0 2 4 6 8 10
t t

(a) (b)

Figure 8.2: Solution to the ODE system in Example 8.1 with ∆t = 0.05 showing (a)
φ1 = φ(t) and (b) φ2 = φ̇(t). The green lines show the exact solution and the blue
dotted lines show the numerical solution.
157

In order to perform a stability analysis for the Leapfrog method we begin by

applying the Equation 8.1 to the model problem of Equation 6.4 giving:

φl+1 = φl−1 + 2∆t λφl

This equation is different to those for the Euler and Crank-Nicolson methods that
we have seen thus far, since is contains multiple time steps. In order to compute
the amplification factor we assume φl = φ0 σ l as we had with the previous methods.
Substituting in we get:

φ0 σ l+1 = φ0 σ l−1 + 2∆tλφ0 σ l

and canceling φ0 and dividing through by σ l−1 we get:

σ 2 − 2λ∆tσ − 1 = 0
so we have a polynomial in terms of σ to solve. It is in fact a key characteristic of
multistep methods that we have multiple roots for the amplification factor and in
this case we have two. Using the Quadratic Formula, the solution can be shown to
be:
√
σ1,2 = λ∆t ± λ2 ∆t2 + 1
Then, using a Power Series Expansion[54] for the term in the square root:
√ 1 1 1 5 4
1 + α = 1 + α − α2 + α3 − α + ...
2 8 16 128
we can approximate the roots in terms of powers of λ∆t to get:

√ 1 1 4 4
σ1 = λ∆t + λ2 ∆t2 + 1=λ∆t + 1 + λ2 ∆t2 − λ ∆t + . . .
2 8
√ 1 1 4 4
σ2 = λ∆t − λ2 ∆t2 + 1=λ∆t − 1 − λ2 ∆t2 + λ ∆t + . . .
2 8
If we consider the limit as ∆t → 0 and forget about the terms containing ∆t2 and
higher we obtain the asymptotic forms:

σ1 ≈ 1 + λ∆t
σ2 ≈ −1 + λ∆t

If we first consider σ1 , then we find that:

t
lim σ1l = lim (1 + λ∆t)l = lim (1 + λ∆t) ∆t = eλt
∆t→0 ∆t→0 ∆t→0
158 CHAPTER 8. LEAPFROG METHODS

and so using our relation φl = φ0 σ l we see that φl = φ0 eλt which is the analytical
solution of the model problem. If we now consider σ2 , then applying the same logic,
this root does not correspond to the solution of the model problem and for this
reason it is known as the spurious root. So in order to determine the stability of the
Leapfrog method we need to analyze each root separately for different values of λ.
If λ > 0 then it is possible that |σ2 | < 1 and so the spurious solution decays,
whereas it must be the case that |σ1 | > 1 and so the true solution grows, hence the
method will be unstable in this case. If on the other hand λ < 0 then it is possible
that |σ1 | < 0, but it must be the case that |σ2 | > 1 and so the spurious solution will
grow and make the method unstable. So we have seen that either way, one of the
roots is going to make the method unstable. In fact the only way in which we can get
a stable solution is if λRe = 0 and then in order for |σ| ≤ 1 we must have −i ≤ σ ≤ i,
meaning that the region of stability of the Leapfrog Method is confined to a line
along the imaginary axis (Figure 8.3). This might seem a bit disappointing given
the comparison to the stability regions of the implicit Euler and Crank-Nicolson
methods that we came across in Chapters 6 and 7, but the Leapfrog method is still
useful if we are solving problems that have purely oscillatory solutions. As we shall
see in Part V this does occur in practice.

λImΔt

Unstable

λReΔt

-1

Figure 8.3: The stability diagram for the Leapfrog Method.

In order to perform an error analysis we again consider the case where λ is

imaginary (and given that we’ve just shown that this is the only way in which the
method will be stable, this seems like even less of a restriction). We want to get σ
into polar form and the best way to do this is:
159

q
σ1 = iλIm ∆t + (iλIm ∆t)2 + 1
= Zeiθ

where:

r 2
+ (λIm ∆t)2
p
Z= 1 − (λIm ∆t)2

q
= 1 − (∆tλIm )2 + (∆tλIm )2
=1

So we can see that as with the Crank-Nicolson method there is no amplitude error
associated with the Leapfrog method. In order to compute the phase error we will
approximate the first amplification factor with the power series expansion up to
order O(λ∆t2 ). In this case we can write:
and:

!
λIm ∆t
θ = tan−1 (λIm ∆t)2
1− 2

Here we will use the power series expansion for the tan−1 function as we did previ-
ously, but we will also use the power series expansion:

1
= 1 + α + α2 + α3 + α4 + . . .
1−α

such that the phase error can be given as:

160 CHAPTER 8. LEAPFROG METHODS

! !!
λIm ∆t 1
tan−1 (λIm ∆t)2
= tan−1 λIm ∆t (λIm ∆t)2
1− 2
1− 2
2 3 !!
(λIm ∆t)2 (λIm ∆t)2
(λIm ∆t)2

= tan−1 λIm ∆t 1 + + + + ...
2 2 2
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

−1
= tan λIm ∆t + + + + ...
2 4 8
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

= λIm ∆t + + + + ...
2 8 8
3
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

1
− λIm ∆t + + + + ...
3 2 4 8
5
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

1
+ λIm ∆t + + + + ... + ...
5 2 4 8
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
≈ λIm ∆t + − − + ...
6 20 56
Now, using the same approach that we did for the Euler and Crank-Nicolson meth-
ods, we can show that the phase error is given by:

!
−1 λIm ∆t
θ − λIm ∆t = tan 2 − λIm ∆t
1 − (λIm2∆t)
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7

= λIm ∆t + − − + . . . − λIm ∆t
6 20 56
(λIm ∆t)3 λIm ∆t)5 (λIm ∆t)7
= − − + ...
6 20 56
So it can be observed that the phase error associated with the Leapfrog method is
of the same order as for the Euler and Crank-Nicolson methods and having 6 in
the denominator means that the phase error should be less than that introduced by
the Euler methods, but more than that introduced by the Crank-Nicolson method.
This time however, the leading order term is positive we would expect a phase
lag in the solution (meaning that the oscillations in the numerical solution will be
behind the exact solution). Figure 8.4 illustrates the solution to the model problem
after a number of time steps for the Leapfrog method. As can be observed, the
numerical solution is behind the exact solution, but has not developed any error in
the amplitude.
It is worth finishing this study of the Leapfrog method by comparing it to the
Crank-Nicolson method. Both have second order accuracy, no amplitude error, and
similar phase error magnitudes. The stability region of the Leapfrog Method is
161

1.5

1.0

φ 0.5

0.0

−0.5

−1.0

−1.5
20 22 24 26 28 30
t

Figure 8.4: The solution to the model problem with the Leapfrog Method after a
number of cycles, when λ = i. The green curve illustrates the analytical solution
and the blue curve illustrates the numerical solution, in particular highlighting the
phase error and that there is no amplitude error.

restricted to the imaginary axis, meaning that it is only stable for purely oscillatory
problems, so you might wonder why we wouldn’t just always choose the Crank-
Nicolson method, since the stability region was the entire left hand plane. One very
important practical reason is that the Leapfrog Method is fully explicit, unlike the
Crank-Nicolson method where we must solve a system of equations at each time step
and so for problems where solving a system would be too computationally intensive
the Leapfrog method could be a good choice.
162 CHAPTER 8. LEAPFROG METHODS
Chapter 9

Adams-Bashforth Methods

φl+1
}el+1
local

φ(tl+1)

φl

t
tl-1 tl tl+1
}
}

Δt Δt

Figure 9.1: A schematic illustrating one step in the second order Adams-Bashforth
method. The green arrows illustrate the gradients f (φl−1 , tl−1 ) and f (φl , tl ) which
are both used to step the solution forward. The pink line illustrates computed step
to φl+1 . The blue line illustrates the analytical solution.

In Chapter 8 we studied the Leapfrog method, which used the solution from
two previous time steps in order to improve the accuracy. Another possibility is to
use the gradient from previous time steps, which is in fact what we will do with
the Adams-Bashforth methods, which are the focus of this chapter (Figure 9.1).
This family of methods also falls into the category of multistep methods and having
now been exposed to one implementation of a multistep method with the Leapfrog
method, it is worth generalizing our notion of multistep methods, where the most

163
164 CHAPTER 9. ADAMS-BASHFORTH METHODS

general form of an N -step method can be written as:

N
X N
X
l+1−n
an φ = ∆t bn f (φl+1−n , tl+1−n )
n=0 n=0

Here, on the left hand side we have the solution itself at time steps φl+1−n ranging
from the new time step φl+1 back n time steps and on the right hand side we have the
gradient evaluated at time steps f (φl+1−n , tl+1−n ) ranging from the new time step
f (φl+1 , tl+1 ) back n time steps. These terms are then weighted by coefficients an
and bn and in developing our particular numerical method we will have to determine
these values. For the development of our Adams-Bashforth methods we will choose
a0 = 1 and b0 = 0. Because of this second choice we will have an explicit method,
because we won’t have f (φl+1 , tl+1 ) on the right hand side (as we did say with the
Crank-Nicolson method). So we will derive two explicit variations of the method
with different orders of accuracy, but it should be noted that other variations of the
method exist, including implicit methods.
A useful starting point for deriving the Adams-Bashforth methods is to return
to the idea that we compute φl+1 by integration (as we did with the Crank-Nicolson
method):
Z tl+1
l+1 l
φ =φ + f (φ, t)dt
tl

This time however, instead of using the trapezoidal rule to approximate the integra-
tion, we will instead replace f (φ, t) with an interpolation polynomial [34] p(t) which
we will be able to integrate. This gives approximations φl+1 of φ(tl+1 ) and φl of
φ(tl ):
Z tl+1
l+1 l
φ =φ + p(t)dt (9.1)
tl

Different choices for p(t) will produce the specific variations of the method. Before
we define what this interpolation polynomial will look like we will define the notation:

f 0 ≡f (φ0 ,t0 )
f 1 ≡f (φ1 ,t1 )
.. ..
. ≡.
f l ≡f (φl , tl )

to make the derivation a little less verbose. That being said, imagine now that we
have the set of data points:

(t0 , f 0 ), (t1 , f 1 ), (t2 , f 2 ), . . . , (tN , f N )

165

(i.e. at each time step we have evaluated f ). Two important points to note at this
stage are that we are starting from t0 and our set of data points is moving forward
in time. We shall see later on that we can generalize both of these assumptions so
that the starting point is arbitrary and we can count backwards in time. With this
in mind we can then say that the Newton form of the interpolation polynomial takes
the form:

pN −1 (t) = a0 + a1 (t − t0 ) + a2 (t − t0 )(t − t1 ) + . . . + aN −1 (t − t0 )(t − t1 ) . . . (t − tN −2 )

N
X −1
= an ηn (t)
n=0

where:
n−1
Y
ηn (t) = t − tm = (t − t0 ) × (t − t1 ) × . . . × (t − tn−1 )
m=0

Now, because by definition our interpolation polynomial passes through all of the
data points, we have pN −1 (t0 ) = f 0 and using this result we can observe that all but
the first terms will be zero (because they will include the term (t0 − t0 )) and hence:

a0 = f 0
Following this line of reasoning we also have pN −1 (t1 ) = f 1 and can observe that in
this case all but the first two terms will be zero and hence:

f 1 = a0 + a1 (t1 − t0 )
= f 0 + a1 (t1 − t0 )
f1 − f0
a1 = 1
t − t0
Continuing this approach one more time (and after a bit of algebra) we get:

f 2 = a0 + a1 (t2 − t0 ) + a2 (t2 − t0 )(t2 − t1 )

f1 − f0 2
= f0 + 1 (t − t0 ) + a2 (t2 − t0 )(t2 − t1 )
t − t0
f 2 −f 1 1 −f 0

t2 −t1
− ft1 −t0
a2 = 2 0
t −t
Now, another way to express these coefficients of our interpolation polynomial is:

an = [f 0 , f 1 , . . . , f N −1 ]
166 CHAPTER 9. ADAMS-BASHFORTH METHODS

Here, the notation [, ] defines what are known as the Divided Differences[14] where:

[f 0 ] = f 0
0 1 f1 − f0
[f , f ] = 1
t − t0
2 1 1
f −f 0

0 1 2 [f 1 , f 2 ] − [f 0 , f 1 ] t2 −t1
− ft1 −f
−t0 f2 − f1 f1 − f0
[f , f , f ] = = = −
t2 − t0 t2 − t0 (t2 − t1 )(t2 − t0 ) (t1 − t0 )(t2 − t0 )
[f 1 , f 2 , f 3 ] − [f 0 , f 1 , f 2 ]
[f 0 , f 1 , f 2 , f 3 ] =
t3 − t0
are known as Forward Divided Differences and:

[f 0 ] = f 0
f1 − f0
[f 1 , f 0 ] = 1
t − t0
2 1 1
f −f 0

2 1 0 [f 2 , f 1 ] − [f 1 , f 0 ] t2 −t1
− ft1 −f
−t0 f2 − f1 f1 − f0
[f , f , f ] = = = −
t2 − t0 t2 − t0 (t2 − t1 )(t2 − t0 ) (t1 − t0 )(t2 − t0 )
[f 3 , f 2 , f 1 ] − [f 2 , f 1 , f 0 ]
[f 3 , f 2 , f 1 , f 0 ] =
t3 − t0
are known as Backward Divided Differences. Now, since with our Adams-Bashforth
method we have t1 − t0 = t2 − t1 = . . . = ∆t (i.e. a constant time step size), we can
simplify these Backward Divided Differences as:

[f 0 ] = f0
f1 − f0
[f 1 , f 0 ] =
∆t
[f , f ] − [f 1 , f 0 ]
2 1
f 2 − 2f 1 + f 0
[f 2 , f 1 , f 0 ] = =
2∆t 2∆t2
3 2 1 2 1 0
[f , f , f ] − [f , f , f ] f − 3f + 3f 1 − f 0
3 2
[f 3 , f 2 , f 1 , f 0 ] = =
3∆t 6∆t3
Furthermore, we can say that t = t0 + l∆t and simplify the form of the interpolation
polynomial to be:

p(t) = [f 0 ] + [f 0 , f 1 ]l∆t + [f 0 , f 1 , f 2 ]l(l − 1)∆t2 + . . . +[f 0 , . . . , f N −1 ]l(l − 1)(l − N + 2)∆tN −1

N −1
X l
= n!∆tn [f 0 , f 1 , . . . , f n ]
n
n=0
9.1. SECOND ORDER ADAMS-BASHFORTH METHOD 167

which is known as the Newton Forward Divided Difference Formula. Alternatively,

if we choose to reverse the order of the data points the form of the polynomial would
be:

p(t) = [f N −1 ] + [f N −1 , f N −2 ]l∆t + [f N −1 , f N −2 , f N −3 ]l(l + 1)∆t2 + . . .

+ [f N −1 , . . . , f 0 ]l(l + 1)(l + N − 2)∆tN −1
N −1
X −l
= (−1) n
n!∆tn [f N −1 , f N −2 , . . . , f N −1−n ] (9.2)
n
n=0

which is known as the Newton Backward Divided Difference Formula and is the form
we will use to represent our interpolation polynomial. It is at this point that we
choose the order of the polynomial and hence how many terms in Equation 9.2 we
wish to keep. Now, you may well wonder how this polynomial is still a function of
time, since we don’t see t appearing anywhere in Equation 9.2. The answer is that
because we have the relation t = t0 + l∆t and our polynomial is a function of l then
in that sense it is a function of t.

9.1 Second Order Adams-Bashforth Method

If we want to derive a second order accurate method, then we will set N = 2, which
means that we will need two data points and in fact the polynomial that we can fit
through two data points is a straight line. So in fact our interpolation polynomial in
this case will be a linear polynomial which we will denote p1 (t). Following on from
this choice our polynomial can be written as:

p1 (t) = [f 1 ] + [f 1 , f 0 ]l∆t
We must now integrate this polynomial between tl to tl+1 , but to simplify the in-
tegration we will use the limits t0 and t1 and then note that because the definition
of these data points is arbitrary the results applies for tl to tl+1 . Now, because our
interpolation polynomial is a function of l we will use the relation t = t0 + l∆t and
perform a change of variables. In this case we can write:

dt = ∆tdl
and so
Z t1 Z 1
p1 (t)dt = p1 (l)∆tdl
t0 0

So performing the integration we get:

168 CHAPTER 9. ADAMS-BASHFORTH METHODS

Z 1 Z 1
[f 1 ] + [f 1 , f 0 ]l dl

p1 ∆tdl = ∆t
0 0
2
1
1 1 0 l
= ∆t [f ]l + [f , f ] ∆t
2
0
∆t
= ∆t [f 1 ] + [f 1 , f 0 ] (9.3)
2

If we now substitute for the simplified backward divided difference expressions into
Equation 9.3 we get:

Z 1
1 1 0 ∆t
p1 ∆tdl = ∆t [f ] + [f , f ]
0 2
1 0

1 f −f ∆t
= ∆t f +
∆t 2

3 1 1 0
= ∆t f − f
2 2

But since this integral expression is valid between any two consecutive time steps,
we can replace the 0 and 1 superscripts with l − 1 and l so that substituting into
Equation 9.1 we obtain the second order Adams-Bashforth method:

l+1 l 3 l 1 l−1
φ = φ + ∆t f − f (9.4)
2 2

9.2 Fourth Order Adams-Bashforth Method

If we want to derive a fourth order accurate method, then we will set N = 4, which
means that we will need four data points and in fact the polynomial that we can fit
through four data points is a cubic polynomial which we will denote p3 (t) and can
be written as:

p3 (t) = f 3 + [f 3 , f 2 ]l∆t + [f 3 , f 2 , f 1 ]l(l + 1)∆t2 + [f 3 , f 2 , f 1 , f 0 ]l(l + 1)(l + 2)∆t3

Here we follow exactly the same approach as we did for the second order method
and as such can just straight in to the integration:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 169

Z 1 Z 1
p3 ∆tdl = ∆t f 3 + [f 3 , f 2 ]l∆t + [f 3 , f 2 , f 1 ]l(l + 1)∆t2 + [f 3 , f 2 , f 1 , f 0 ]l(l + 1)(l + 2)∆t3 dl
0
"0
l2
3
l2

3 3 2 3 2 1 l
= ∆t [f ]l + [f , f ] ∆t + [f , f , f ] + ∆t2
2 3 2
4 #1
3l3 2l2

3 2 1 0 l 3
+ [f , f , f , f ] + + ∆t
4 3 2
0
2 3

3 3 2 ∆t 3 2 1 5∆t 3 2 1 0 9∆t
= ∆t [f ] + [f , f ] + [f , f , f ] + [f , f , f , f ]
2 6 4
(9.5)

If we now again substitute for the simplified backward divided difference expressions
into Equation 9.5 we get:

Z 1 2 3

3 3 2 ∆t 3 2 1 5∆t 3 2 1 0 9∆t
p3 ∆tdl = ∆t [f ] + [f , f ] + [f , f , f ] + [f , f , f , f ]
0 2 6 4
3
f − f 2 ∆t f 3 − 2f 2 + f 1 5∆t2

= ∆t f 3 + +
∆t 2 2∆t2 6
3 !
f − 3f 2 + 3f 1 − f 0 9∆t3

+
6∆t3 4

55 3 59 2 37 1 9 0
= ∆t f − f + f − f
24 24 24 24

But since this integral expression is valid between any four consecutive time steps,
we can replace the 0 to 3 superscripts with l − 3 and l so that substituting into
Equation 9.1 we obtain the fourth order Adams-Bashforth method:

∆t
φl+1 = φl + 55f l − 59f l−1 + 37f l−2 − 9f l−3

(9.6)
24

Example 9.1:

In this example we will develop both a Matlab and a C++ program to solve the
example system:
170 CHAPTER 9. ADAMS-BASHFORTH METHODS

dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2 (9.7)
dt
in the domain t ∈ [0, 10], with initial conditions φ1 (0) = 0, φ2 (0) = 1, φ3 (0) = 1. The
intended learning outcome for this example will be to simply observe the application
of the fourth order Adams-Bashforth method to solve a system of ODEs.
One of the important features of the Adams-Bashforth methods is that we need
to keep track of the right hand side of our system of ODEs at multiple time steps.
There are two ways in which we could do this; either we store the gradient at multiple
time steps, or we reevaluate it at each time step. We will use the former approach
and as such will create an array f which will store the right hand side throughout
the simulation. So our algorithm might look something like:
for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
f(:,l) = [ phi(2,l)*phi(3,l); - phi(1,l)*phi(3,l); -0.5*phi(1,l)*phi(2,l) ];
phi(:,l+1) = phi(:,l) + Delta_t/24*(55*f(:,l) - 59*f(:,l-1) + 37*f(:,l-2) - 9*f(:,l-3));
end
in Matlab and:
for(l=0; l<N_t; l++)
{
t[l+1] = t[l] + Delta_t;
f[l][0] = phi[l][1]*phi[l][2];
f[l][1] = - phi[l][0]*phi[l][2];
f[l][2] = -0.5*phi[l][0]*phi[l][1];
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] ...
+ Delta_t/24*(55*f[l][e] - 59*f[l-1][e] + 37*f[l-2][e] - 9*f[l-3][e]);
}
}
in C++. One very important point to consider with multistep methods is what to
do on the first time step, and for a method of order N , what to do on the first N
time steps. The reason being that with a fourth order Adams-Bashforth method,
at time step l = 1, the algorithm requires values of f going back l − 3 time steps,
which is obviously a negative value. A common way to deal with this problem is
to use a different method until there are enough previous data points to be able to
use the method. For example we could use an explicit Euler method for the first 4
time steps, or perhaps just the first time step, then a second order Adams-Bashforth
method for the next two time steps. This is in fact what we will do and so our overall
algorithm will look like:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 171

for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
f(:,l) = [ phi(2,l)*phi(3,l); - phi(1,l)*phi(3,l); -0.5*phi(1,l)*phi(2,l) ];
if l>3
phi(:,l+1) = phi(:,l) + Delta_t/24*(55*f(:,l) - 59*f(:,l-1) + 37*f(:,l-2) - 9*f(:,l-3));
elseif l>1
phi(:,l+1) = phi(:,l) + Delta_t/2 *(3 *f(:,l) - f(:,l-1));
else
phi(:,l+1) = phi(:,l) + Delta_t * f(:,l);
end
end

in Matlab and:
for(l=0; l<N_t-1; l++)
{
t[l+1] = t[l] + Delta_t;
f[l][0] = phi[l][1]*phi[l][2];
f[l][1] = - phi[l][0]*phi[l][2];
f[l][2] = -0.5*phi[l][0]*phi[l][1];
if (l>3)
{
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] ...
+ Delta_t/24*(55*f[l][e] - 59*f[l-1][e] + 37*f[l-2][e] - 9*f[l-3][e]);
}
}
else if (l>1)
{
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] ...
+ Delta_t/2 *(3 *f[l][e] - f[l-1][e]);
}
}
else
{
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] + Delta_t * f[l][e];
}
}
}

in C++.
The complete programs are given in Example9_1.m and Example9_1.cpp. Figure
9.1 shows the numerical solution to Equation 9.7 obtained using the fourth order
Adams-Bashforth method. It can be observed that in comparison to the solution
obtained using the implicit Euler method in Example 6.4 (Figure 6.9) the Adams-
Bashforth solution is more accurate, since the amplitude of the oscillations remains
constant throughout the simulation.
172 CHAPTER 9. ADAMS-BASHFORTH METHODS

1.5

1.0

0.5

0.0

φ
−0.5
φ1
−1.0 φ2
φ3
−1.5
0 2 4 6 8 10
t

Figure 9.2: Solution to the ODE system in Example 9.1 with ∆t = 0.1.

In order to perform a stability analysis for the second order Adams-Bashforth

method we begin by applying Equation 9.4 to the model equation of Equation 6.4
giving:

l+1 l 3 l 1 l−1
φ = φ + ∆t f − f
2 2

l 3 l 1 l−1
= φ + ∆t λφ − λφ
2 2

3 1
= 1 + λ∆t φl − λ∆tφl−1
2 2

3 1
φ − 1 + λ∆t φl + λ∆tφl−1 = 0
l+1
2 2
assuming φl = φ0 σ l , so substituting in:

3 1
φ σ − 1 + λ∆t φ0 σ l + λ∆tφ0 σ l−1 = 0
0 l+1
2 2
and canceling φ0 and dividing through by σ l−1 we get:

2 3 1
σ − 1 + λ∆t σ + λ∆t = 0
2 2
so we have a polynomial in terms of σ to solve. Again, this is a characteristic of
multistep methods that one gets multiple equations for σ. In this particular case we
can use the quadratic formula to get:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 173

q
1 + 32 λ∆t ± 1 + λ∆t + 94 (λ∆t)2
σ=
2

Using a power series expansion [54] to approximate the term in the square root:

r 2
9 1 9 2 1 9 2
1 + λ∆t + (λ∆t)2 = 1+ 1 + λ∆t + (λ∆t) − 1 + λ∆t + (λ∆t) +. . .
4 2 4 8 4

and hence we have approximate expressions for the two roots:

1
σ1 = 1+λ∆t + (λ∆t)2 + . . .
2
1 1
σ2 = λ∆t − (λ∆t)2 + . . .
2 2

where σ2 is known as the spurious root. It can be observed that as ∆t → 0, then

σ2 → 0, but σ2 → 1, hence we would be less concerned with the spurious root. For
stability we need to make sure that |σ| ≤ 1 for both of the equations and in terms of
creating a plot we could sample σ in the complex plane and creating a contour plot
of |σ| as we did with other methods, by just simply need to taking | max(σ1 , σ2 )|. A
better and more general method however is to make use of the formula:

eN iθ − e(N −1)iθ
σ= N
P
bn ei(N −n)θ
n=0

which for the second order Adams-Bashforth method is:

e2iθ − eiθ

σ= 3 2iθ (9.8)
2
e − 12 eiθ
we can then evaluate Equation 9.8 in the range 0 ≤ θ ≤ 2π and plot σ as a function
of θ. Figure 9.3 illustrates the stability region obtained using this approach for
the first four Adams-Bashforth methods and it should be noted that the region of
stability lies inside of the boundary. It can be observed that the stability regions
get smaller with increasing order accuracy and cross the real axis increasingly closer
to the origin. The second order Adams-Bashforth method is also only tangent to
the imaginary axis and thus, strictly it is unstable for pure imaginary λ, but it turns
out that the instability is very mild.
In order to perform the error analysis we as always consider the case where λ is
purely imaginary and get the amplification factor into polar form as:
174 CHAPTER 9. ADAMS-BASHFORTH METHODS

λImΔt

Unstable

Stable

λReΔt
-2 -1

Figure 9.3: The stability diagrams for the Adams-Bashforth methods of various
orders, including the first order (black line), second order (red line), third order
(green line), and fourth order (blue line).

(iλIm ∆t)2
σ = 1 + iλIm ∆t +
2
= Zeiθ

where:

s 2
(λIm ∆t)2
Z = 1− + (∆tλIm )2
2
r
(λIm ∆t)4
= 1+
4
and:
!
λIm ∆t
θ = tan−1 (λIm ∆t)2
1− 2

Using the same approach as before, we can show that the phase error is given by:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 175

!
λIm ∆t
θ − λIm ∆t = tan−1 2 − λIm ∆t
1 − (λIm2∆t)
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7

= λIm ∆t + − − + . . . − λIm ∆t
6 20 56
(λIm ∆t)3 λIm ∆t)5 (λIm ∆t)7
= − − + ...
6 20 56
So it can be observed that the phase error associated with the second order Adams-
Bashforth method is the same as for the Leapfrog method (at least approximately)
but it also has an amplitude error associated with it. Figures 9.4(a) and 9.4(b)
illustrate the solution to the model problem after a number of time steps for the
second and fourth order Adams-Bashforth methods respectively. As can be observed,
for the second order method the numerical solution is behind the exact solution and
has developed an observable amplitude error, whereas the fourth order method has
not this error is not as pronounced.

1.5 1.5

1.0 1.0

0.5 0.5
φ

0.0 0.0

−0.5 −0.5

−1.0 −1.0

−1.5 −1.5
20 22 24 26 28 30 20 22 24 26 28 30
t t

(a) (b)

Figure 9.4: The solution to the model problem with the Adams-Bashforth Method
after a number of cycles, when λ = i for (a) the second order and (b) the fourth order
Adams-Bashforth Methods. The green curves illustrates the analytical solution and
the blue curves illustrates the numerical solution, in particular highlighting the phase
and amplitude error.
176 CHAPTER 9. ADAMS-BASHFORTH METHODS
Chapter 10

Runge-Kutta methods

k2
φ
φl+1

}el+1
local

k1 φ(tl+1)

φl

t
tl tl+1
}

Δt

Figure 10.1: A schematic illustrating one step in the second order Runge-Kutta
method. The green arrows illustrate the gradients f (φl , tl ) and f (φl + k1 ∆t, tl + ∆t)
which are both used to step the solution forward. The pink line illustrates computed
step to φl+1 . The blue line illustrates the analytical solution.

In contrast to the multistep methods studied in the last two chapters, Runge-
Kutta methods fall under the category of multistage methods and are probably the
most popular methods in solving initial value problems. However, many variations of
the Runge-Kutta methods exist of varying orders of accuracy. The basic idea behind
the method is that the order of accuracy can be increased if one supplies additional
information about the function f . Runge-Kutta methods introduce multiple stages
between tl and tl+1 , and evaluate f at these intermediate stages (Figure 10.1). The

177
178 CHAPTER 10. RUNGE-KUTTA METHODS

additional function evaluations, of course, result in higher cost per time step, but
the accuracy is increased, and as it turns out, better stability properties are also
obtained. While there are implicit Runge-Kutta methods, the variations that we will
be dealing with (which are the more common variation) are fully explicit methods.
In general, Runge-Kutta methods approximate the solution to Equation 5.1 can
be written as:

φl+1 = φl + ∆tg(φl , tl , ∆t) (10.1)

where g is known as the incremental function and can be interpreted as the slope
which is used to predict the new value of φ. In general, g can be written as:

g = a1 k1 + a2 k2 + a3 k3 + a4 k4 + · · · aN kN (10.2)
where:

k1 = f (φl , tl )
k2 = f (φl + q11 ∆tk1 , tl + p1 ∆t)
k3 = f (φl + q21 ∆tk1 + q22 ∆tk2 , tl + p2 ∆t)
k4 = f (φl + q31 ∆tk1 + q32 ∆tk2 + q33 ∆tk3 , tl + p3 ∆t)
..
.
kN = f (φl + qN −1,1 ∆tk1 + qN −1,2 ∆tk2 + · · · + qN −1,N −1 ∆tkN −1 , tl + pN −1 ∆t)

These k values are actually estimates of the gradient at the different intermediate
points. For N = 1, we get the first order Runge-Kutta method, which is in fact
equivalent to the explicit Euler method presented in Chapter 6.
We are going to derive two methods of different orders. In order to do this, we
are going to need to make use of a couple of formulae. The first is the Taylor series
expansion:

∆t2 d2 φ ∆t3 d3 φ ∆t4 d4 φ

l+1 l ∆t dφ
φ(t ) = φ(t ) + + + + ...
1! dt tl 2! dt2 tl 3! dt3 tl 4! dt4 tl
but we are going to substitute in for f and its derivatives with respect to t in place of
derivatives of φ. One of the important things to consider here is that f is a function
of both φ and t, i.e. it is a multivariate function. In this case we need to consider
the total derivative with respect to time DfDt
, defined as:

Df ∂f dt ∂f dφ
= +
Dt ∂t dt ∂φ dt
∂f ∂f
= +f
∂t ∂φ
10.1. SECOND ORDER RUNGE-KUTTA METHOD 179

which takes into account that a change in t can cause a change in φ. Substituting
this into the Taylor series expansion we get:

∆t2 Df ∆t3 D2 f ∆t4 D3 f

l+1 l ∆t l l
φ(t ) = φ(t ) + f (φ , t ) + + + ... (10.3)
1! 2! Dt tl 3! Dt2 tl 4! Dt3 tl
The other tool that we are going to need is the two-dimensional Taylor series ex-
pansion for a function, which is defined as:

f (x + ∆x, y + ∆y) = f (x, y)

1 ∂ ∂
+ ∆x + ∆y f (x, y)
1! ∂x ∂y
2
1 ∂ ∂
+ ∆x + ∆y f (x, y)
2! ∂x ∂y
3
1 ∂ ∂
+ ∆x + ∆y f (x, y) + . . .
3! ∂x ∂y
which we could alternatively write by expanding out the derivatives as:

f (x + ∆x, y + ∆y) = f (x, y)

1 ∂f ∂f
+ ∆x + ∆y
1! ∂x ∂y
2
∂ 2f 2

1 2∂ f 2∂ f
+ ∆x + 2∆x∆y + ∆y
2! ∂x2 ∂x∂y ∂y 2
3
∂ 3f 3 3

1 3∂ f 2 2 ∂ f 3∂ f
+ ∆x + 3∆x ∆y 2 + 3∆x∆y + ∆y
3! ∂x3 ∂x ∂y ∂x∂y 2 ∂y 3
+ ...

Note that the use of variables x and y in place of φ and t was deliberate and done so
in order to make the derivations more clear when the step sizes are different for the
various stages of a Runge-Kutta method. Ultimately our derivations will come down
to developing two equations using these tools and equating coefficients to produce
a specific Runge-Kutta method.

10.1 Second Order Runge-Kutta Method

In order to derive the Second order Runge-Kutta method (RK2), we put N = 2 into
Equation 10.2, we get:

g = a1 k1 + a2 k2
180 CHAPTER 10. RUNGE-KUTTA METHODS

and furthermore can choose q11 = 1 and p1 = 1, so that we get:

k1 = f (φl , tl )
k2 = f (φl + ∆tk1 , tl + ∆t)
At this point we then apply our two-dimensional Tayolor series expansion for k2 ,
noting that x ≡ φ, ∆x ≡ ∆tk1 = ∆tf (φl , tl ), y ≡ t, and ∆y ≡ ∆t:

l l ∂f ∂f
k2 = f (φ , t ) + ∆tk1 + ∆t + O(∆t2 )
∂φ ∂t tl

l l ∂f ∂f
= f (φ , t ) + ∆tf + ∆t + O(∆t2 )
∂φ ∂t tl

l l Df
= f (φ , t ) + ∆t + O(∆t2 )
Dt tl

Then substituting into Equation 10.1 gives:

φl+1 = φl + ∆t (a1 k1 + a2 k2 )

l l l l l Df
= φ + ∆t a1 f (φ , t ) + a2 f (φ , t ) + ∆t
Dt tl

l l l Df
= φ + ∆t (a1 + a2 ) f (φ , t ) + a2 ∆t (10.4)
Dt tl
If we compare Equation 10.4 to the Taylor series expansion back in Equation 10.3
up to second order:

∆t2 Df

l+1 l ∆t l l
φ(t ) = φ(t ) + f (φ , t ) + + O(∆t3 )
1! 2! Dt tl
and equate coefficients we get the system of equations

1
a2 =
2
a1 +a2 =1
So one possible (but generally referred to as the) second order Runge-Kutta method
is:

l+1 l 1 1
φ = φ + ∆t k1 + k2 (10.5)
2 2
where:

k1 = f (φl , tl )
k2 = f (φl + ∆tk1 , tl + ∆t)
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 181

10.2 Fourth Order Runge-Kutta Method

By far the most popular numerical method for solving ODEs is the fourth order
Runge-Kutta method (RK4), where we put N = 4 into Equation 10.2 and get:

g = a1 k1 + a2 k2 + a3 k3 + a4 k4
and furthermore can choose q11 = q22 = 21 , q33 = 1, q21 = q31 = q32 = 0, p1 = p2 = 12 ,
and p3 = 1 so that we get:

k1 = f φl , tl

l ∆t l ∆t
k2 = f φ + k1 , t +
2 2

l ∆t l ∆t
k3 = f φ + k2 , t +
2 2
l l

k4 = f φ + ∆tk3 , t + ∆t

At this point we then apply our two-dimensional Taylor series expansion for k2 ,
noting that x ≡ φ, ∆x ≡ ∆t
2 1
k = ∆t
2
f (φl , tl ), y ≡ t, and ∆y ≡ ∆t
2
:

l ∆t l ∆t
k2 = f φ + k1 , t +
2 2

l l ∆t ∂f ∆t ∂f
= f (φ , t ) + k1 + + O(∆t2 )
2 ∂φ 2 ∂t tl

∆t ∂f ∆t ∂f
= f (φl , tl ) + f + + O(∆t2 )
2 ∂φ 2 ∂t tl

l l ∆t Df
= f (φ , t ) + + O(∆t2 )
2 Dt tl

Applying our two-dimensional Taylor series expansion for k3 , noting that x ≡ φ,

∆x ≡ ∆t k , y ≡ t, and ∆y ≡ ∆t
2 2 2
:

l ∆t l ∆t
k3 = f φ + k2 , t +
2 2

l l ∆t D l l ∆t Df
= f (φ , t ) + f (φ , t ) + + O(∆t2 )
2 Dt 2 Dt tl

Applying our two-dimensional Taylor series expansion for k4 , noting that x ≡ φ,

∆x ≡ ∆tk3 , y ≡ t, and ∆y ≡ ∆t:
182 CHAPTER 10. RUNGE-KUTTA METHODS

k4 = f φl + ∆tk3 , tl + ∆t

l l D l l ∆t D l l ∆t Df
+ O(∆t2 )
= f (φ , t ) + ∆t f (φ , t ) + f (φ , t ) +
Dt 2 Dt 2 Dt l
t

Then substituting into Equation 10.1 gives:

φl+1 = φl + ∆t (a1 k1 + a2 k2 + a3 k3 + a4 k4 )

l
= φ + ∆t

a1 f (φl , tl )

l l ∆t Df
+ a2 f (φ , t ) +
2 Dt tl

l l ∆t D l l ∆t Df
+ a3 f (φ , t ) + f (φ , t ) +
2 Dt 2 Dt tl

l l D l l ∆t D l l ∆t Df
+ a4 f (φ , t ) + ∆t f (φ , t ) + f (φ , t ) +
Dt 2 Dt 2 Dt l
t
1 1 Df
= φl + (a1 + a2 + a3 + a4 ) ∆tf (φl , tl ) + a2 + a3 + a4 ∆t2
2 2 Dt tl
2 3

1 1 3D f 1 4D f
+ a3 + a4 ∆t + a4 ∆t + O(∆t5 ) (10.6)
4 2 Dt2 l t 4 Dt3 l t

If we compare Equation to the Taylor series expansion back in Equation 10.3 up to

fifth order:

∆t2 Df ∆t3 D2 f ∆t4 D3 f

l+1 l ∆t l l
φ(t ) = φ(t ) + f (φ , t ) + + 2
+ 3
+ O(∆t5 )
1! 2! Dt tl 3! Dt tl 4! Dt tl

and equate coefficients we get the system of equations

a1 + a2 + a3 + a4 = 1
1 1 1
a2 + a3 + a4 =
2 2 2
1 1 1
a3 + a4 =
4 2 6
1 1
a4 =
4 24
So one possible (but generally referred to as the) fourth order Runge-Kutta method
is:
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 183

l+1 l 1 1 1 1
φ = φ + ∆t k1 + k2 + k3 + k4 (10.7)
6 3 3 6
where:

k1 = f φl , tl

l ∆t l ∆t
k2 = f φ + k1 , t +
2 2

l ∆t l ∆t
k3 = f φ + k2 , t +
2 2
l l

k4 = f φ + ∆tk3 , t + ∆t

If we are interested in using the Runge-Kutta methods to solve a system of ODEs

then the only thing we must note is that the k values will be vectors rather than a
single number.

Example 10.1:

In this example we will develop both a Matlab and a C++ program to solve the
example system:

dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2 (10.8)
dt
in the domain t ∈ [0, 10], with initial conditions φ1 (0) = 0, φ2 (0) = 1, φ3 (0) = 1.
The intended learning outcomes for this example will be to observe the application
of the fourth order Runge-Kutta method to solve a system of ODEs and to see how
to implement a function call for evaluating the right hand side of the system in both
Matlab and C++.
In order to begin we are first going to develop a method by which to evaluate the
k1 , k2 , k3 , k4 values (which will be 3 × 1 column vectors since we are dealing with
a system) at each time step and we will do so by creating a function f which we
can call repeatedly throughout the simulation, changing only the input arguments.
Doing so will keep the code shorter, more elegant, and easier to understand; always
a good thing. The function will encapsulate the right hand side of Equation 10.8.
We can do this in Matlab as:
184 CHAPTER 10. RUNGE-KUTTA METHODS
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 185

function k = f(phi, t)
k = zeros(size(phi));
k(1) = phi(2)*phi(3);
k(2) = -phi(1)*phi(3);
k(3) = -0.5*phi(1)*phi(2);
end

which we could call for example as:

k2 = f(phi(:,l)+Delta_t/2.*k1, t(l)+Delta_t/2);

to evaluate k2 . To do the equivalent in C++ we could define our function as:

void f(double* k, double* phi)
{
k[0] = phi[1]*phi[2];
k[1] = -phi[0]*phi[2];
k[2] = -0.5*phi[0]*phi[1];
return;
}

but to call the function within our time marching loop we would need to do some-
thing like:
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t/2*k1[e];
}
f(k2, tempPhi);

The reason being that we can’t simply add arrays together ‘on the fly’ as we can in
Matlab. We can then explicitly update the system at each time step by:

l+1 l 1 1 1 1
φ = φ + ∆t k1 + k2 + k3 + k4
6 3 3 6
which would look something like:
for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
k1 = f(phi(:,l));
k2 = f(phi(:,l) + Delta_t/2*k1);
k3 = f(phi(:,l) + Delta_t/2*k2);
k4 = f(phi(:,l) + Delta_t *k3);
phi(:,l+1) = phi(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end

in Matlab and:
for(l=0; l<N_t; l++)
{
t[l+1] = t[l] + Delta_t;
f(k1, phi[l]);
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t/2*k1[e];
186 CHAPTER 10. RUNGE-KUTTA METHODS

}
f(k2, tempPhi);
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t/2*k2[e];
}
f(k3, tempPhi);
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t *k3[e];
}
f(k4, tempPhi);
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] + Delta_t*(k1[e]/6 + k2[e]/3 + k3[e]/3 + k4[e]/6);
}
}

in C++. The complete programs are given in Example10_1.m and Example10_1.cpp.

Figure 10.1 shows the numerical solution to Equation 10.8 obtained using the fourth
order Runge-Kutta method. It can be observed that in comparison to the solution
obtained using the implicit Euler method in Example 6.4 (Figure 6.9) the Runge-
Kutta solution is more accurate, since the amplitude of the oscillations remains
constant throughout the simulation.

1.5

1.0

0.5

0.0
φ

−0.5
φ1
−1.0 φ2
φ3
−1.5
0 2 4 6 8 10
t

Figure 10.2: Solution to the ODE system in Example 10.1 with ∆t = 0.1.

In order to perform a stability analysis for the second order Runge-Kutta method
we begin by applying Equation 10.5 to the model equation of Equation 6.4 giving:
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 187

∆t
φl+1 = φl + (k1 + k2 )
2
∆t
φl + f (tl , φl ) + f (tl + ∆t, φl + k1 ∆t)

=
2
∆t
φl + λφl + λ(φl + (λφl )∆t)

=
2
∆t
φl + λφl + (λ + ∆tλ2 )φl

=
2
(λ∆t)2

l
= φ 1 + λ∆t +
2

Thus the solution at any time step l can be written as:

l
(λ∆t)2

l 0
φ = φ 1 + λ∆t +
2
= φ0 σ l (10.9)

While we know that in order to have stability |σ| ≤ 1, obtaining an equation in

terms of the the real and imaginary components of λ is not as simple compared to
the Euler or Crank-Nicolson methods. One way of getting around this problem is
to follow the approach used in Example 6.2, namely to evaluate the expression for
σ at a number of points within the complex plane, compute the magnitude of σ at
each point, then extract a contour of |σ| = 1.
Figure 10.3 illustrates the stability region obtained using this approach for the
first four Runge-Kutta methods and it should be noted that the region of stability
lies inside of the boundary. It can be observed that the stability region gets bigger
as we increase the order of the method, where the stability region crosses the real
axis at −2.79 compared to −2 for the explicit Euler method. Furthermore we can
see that the higher order methods encompass more of the imaginary axis, making
them advantageous for oscillatory problems.
In order to perform the error analysis we again consider the case where λ is
purely imaginary and get the amplification factor into polar form as:

(iλIm ∆t)2
σ = 1 + iλIm ∆t +
2
= Zeiθ

where:
188 CHAPTER 10. RUNGE-KUTTA METHODS

λImΔt

Unstable

Stable
λReΔt
-2 -1

Figure 10.3: The stability diagrams for the Runge-Kutta methods of various orders,
including the first order (black line), second order (red line), third order (green line),
and fourth order (blue line).

s 2
(λIm ∆t)2
Z = 1− + (∆tλIm )2
2
r
(λIm ∆t)4
= 1+
4

and:
!
−1 λIm ∆t
θ = tan (λIm ∆t)2
1− 2

Here we will use the power series expansion for the tan−1 function as we did previ-
ously, but we will also use the power series expansion:

1
= 1 + α + α2 + α3 + α4 + . . .
1−α
such that the phase error can be given as:
10.3. RUNGE-KUTTA-FEHLBERG METHOD (RKF-45) 189

! !!
λIm ∆t 1
tan−1 (λIm ∆t)2
= tan−1 λIm ∆t (λIm ∆t)2
1− 2
1− 2
2 3 !!
(λIm ∆t)2 (λIm ∆t)2
(λIm ∆t)2

= tan−1 λIm ∆t 1 + + + + ...
2 2 2
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

= tan−1 λIm ∆t + + + + ...
2 4 8
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

= λIm ∆t + + + + ...
2 4 8
3
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

1
− λIm ∆t + + + + ...
3 2 4 8
5
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7

1
+ λIm ∆t + + + + ... + ...
5 2 4 8
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
≈ λIm ∆t + − − + ...
6 20 56
Using the same approach that we did for the Euler and Crank-Nicolson methods,
we can show that the phase error is given by:

!
λIm ∆t
θ − λIm ∆t = tan−1 2 − λIm ∆t
1 − (λIm2∆t)
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7

= λIm ∆t + − − + . . . − λIm ∆t
6 20 56
(λIm ∆t)3 λIm ∆t)5 (λIm ∆t)7
= − − + ...
6 20 56
So it can be observed that the phase error associated with the second order Runge-
Kutta method is of the same as for the Leapfrog method, where we have a leading
order term proportional to λIm ∆t3 which will result in a phase lag. Figures 10.4(a)
and 10.4(b) illustrate the solution to the model problem after a number of time
steps for the second and fourth order Runge-Kutta methods respectively. As can be
observed, for the second order method the numerical solution is behind the exact
solution and has developed an observable amplitude error, whereas the fourth order
method has not.

10.3 Runge-Kutta-Fehlberg Method (RKF-45)

The Runge-Kutta-Fehlberg method differs to those presented thus far in that the
time step size, ∆t is adjusted every time step in order to keep the error of the
190 CHAPTER 10. RUNGE-KUTTA METHODS

1.5 1.5

1.0 1.0

0.5 0.5
φ

φ
0.0 0.0

−0.5 −0.5

−1.0 −1.0

−1.5 −1.5
20 22 24 26 28 30 20 22 24 26 28 30
t t

(a) (b)

Figure 10.4: The solution to the model problem with the Runge-Kutta Method after
a number of cycles, when λ = i for (a) the second order and (b) the fourth order
Runge-Kutta Methods. The green curves illustrates the analytical solution and the
blue curves illustrates the numerical solution, in particular highlighting the phase
and amplitude error.

numerical method under control. In order to see how this can be done, recall that
the N th order Runge-Kutta scheme can be written as:

φl+1 = φl + ∆tg(φl , tl , ∆t) (10.10)

The Taylor series can be written as:

φ(tl+1 ) = φ(tl ) + ∆tψ(φ(tl ), tl ) + O(∆tN +1 ) (10.11)

where g is the incremental function consisting of your k values and ψ is a series
consisting of derivatives of φ(t). Recall that in Equation 10.10, the expression for g
is derived by requiring that they are identical to ψ in Equation 10.11. Thus, locally,
at time level tl , the numerical value of g and ψ must be the same. A higher order
Runge-Kutta scheme can be written as:

φ̂l+1 = φ̂l + ∆tĝ(φl , tl , ∆t) (10.12)

and it is obtained from a higher order Taylor series:

φ(tl+1 ) = φ(tl ) + ∆tψ̂(φ(tl ), tl ) + O(∆tN +2 ) (10.13)

Let’s assume that at time level, tl , there is hardly any error in the numerical solu-
tions, i.e. φl ≈ φ̂l ≈ φ(tl ). Rearranging Equation 10.11, one obtains an expression
for the local truncation error el+1
local (∆t) as:
10.3. RUNGE-KUTTA-FEHLBERG METHOD (RKF-45) 191

φ(tl+1 ) − φ(tl )
el+1
local (∆t) = − ψ(φ(tl ), tl )
∆t
φ(tl+1 ) − φl
= − ψ(φ(tl ), tl )
∆t
φ(tl+1 ) − (φl + ∆tψ(φ(tl ), tl ))
=
∆t
l+1
Note that elocal (∆t) is O(∆tN ). Since the Runge-Kutta method requires that the
numerical value of ψ = g, we can continue the above as:

l+1 φ(tl+1 ) − (φl + ∆tg(φl , tl ))

elocal (∆t) =
∆t
1
φ(t ) − φl+1
l+1

= (10.14)
∆t
Similarly, starting from Equation 10.13, one can obtain an expression for the local
l+1
truncation error êlocal (∆t):
1 l+1
êl+1
local (∆t) =φ(t ) − φ̂l+1 (10.15)
∆t
which is O(∆tN +1 ). Going back to Equation 10.14 and using Equation 10.15:

1
el+1
local (∆t) = (φ(tl+1 ) − φl+1 )
∆t
1 l+1
= φ(t ) − φ̂l+1 + φ̂l+1 − φl+1
∆t
1 l+1
= êl+1
local (∆t) + φ̂ − φl+1
∆t
Recall that el+1 N l+1
local (∆t) is O(∆t ) and êlocal (∆t) is O(∆t
N +1
). Thus, if ∆t is small,
l+1
elocal (∆t) can be simply approximated as:
1 l+1
el+1
local (∆t) ≈ φ̂ − φl+1 (10.16)
∆t
Let’s now see how this information can be used to control the local truncation error.
Since el+1 N
local (∆t) is O(∆t ), we can write:

l+1
elocal (∆t) = α∆tN
If we increase or decrease the time step ∆t by a factor of β, then the local truncation
βN
error would be el+1 N N N N l+1
local (β∆t) = α(β∆t) = β α∆t = β elocal (∆t) ≈ ∆t (φ̂
l+1
−φl+1 ),
using Equation 10.15. Thus if we want to bound the local truncation error to a small
value , then:
192 CHAPTER 10. RUNGE-KUTTA METHODS

1/N
∆t
β≤
φ̂l+1 − φl+1
In practice, one usually sets:
!1/N
∆t
2
β=
φ̂l+1 − φl+1
One popular method to implement the above algorithm is called the Runge-
Kutta-Fehlberg method. In this method φl+1 and φ̂l+1 are approximated as:

l+1 l 25 1408 2197 1
φ = φ + ∆t k1 + k3 + k4 − k5 (10.17)
216 2565 4104 5

16 6656 28561 9 2
φ̂l+1 l
= φ̂ + ∆t k1 + k3 + k4 − k5 + k6 (10.18)
135 12825 56430 50 55

where:

k1 = f φl , t l

∆t ∆t
k2 = f φl + k1 , t +l
4 4

3∆t 9∆t 3∆t
k3 = f φl + k1 + k2 , l
t +
32 32 8

1932∆t 7200∆t 7296∆t 12∆t
k4 = f φl + k1 − k2 + k3 , l
t +
2197 2197 2197 13

439∆t 3680∆t 845∆t
k5 = f φl + k1 − 8∆t k2 + k3 − k4 , l
t + ∆t
216 513 4104

8∆t 3544∆t 1859∆t 11∆t ∆t
k6 = f φl − k1 + 2∆t k2 − k3 + k4 − l
k5 ,t +
27 2565 4104 40 2

It can be shown that the global error associated with φl+1 is O(∆t4 ) and the global
error associated with φ̂l+1 is O(∆t5 ). So N = 4 and β is calculated as:
1/4
∆t
β = 0.84 (10.19)
φ̂l+1 − φl+1
Recall that the error at time level l +1 is approximated as |φ̂l+1 −φl+1 | and assuming
that there is no error at time level l, i.e. φ(tl ) ≈ φl ≈ φ̂l . So using Equations 10.17
and 10.18, the error at time level l + 1 is approximated as:
10.3. RUNGE-KUTTA-FEHLBERG METHOD (RKF-45) 193

e = |φ̂l+1 − φl+1 |

1 128 2197 1 2
= k1 − k3 − k4 + k5 + k6
360 4275 75240 50 55

It should be noted that as with the other Runge-Kutta methods, the Runge-
Kutta-Fehlberg method can be extended to a system of equations by treating the k
terms as vectors. In this case however, our error will also be a vector and we must
use the maximum error in order to determine how to adjust the time step size.

Example 10.2:

In this example we will develop a Matlab program to solve the example system:

dφ −4π 4π
= cos (10.20)
dt (t + 1)2 t+1
in the domain t ∈ [0, 10], with initial conditions φ(0) = 0and compare the numer-
4π
ical solution with the analytical solution φ(t) = sin t+1 . The intended learning
outcome for this example will be to simply observe the application of the Runge-
Kutta-Fehlberg method to solve an ODE.
Similar to the fourth order algorithm developed in Example 10.1 we will develop
a function f which we will call repeatedly as the algorithm progresses. This will
take the form:
function val = f(phi, t)
val = -4*pi/(t+1)^2*cos(4*pi/(t+1));
end

A major difference compared to other numerical methods studied thus far is that if
we have a variable time step size ∆t then we don’t know exactly how many time steps
we’ll need to march through the temporal domain. As such we’ll use a while loop
instead of the usual for loop for marching through time, which will look something
like:
while ~finished
k1 = f(phi(l), t(l));
k2 = f(phi(l) + 1/4*k1*Delta_t, t(l) + 1/4*Delta_t);
k3 = f(phi(l) + 3/32 *k1*Delta_t + 9/32*k2*Delta_t, t(l) + 3/8*Delta_t);
...
e = abs(1/360*k1 - 128/4275*k3 - 2197/75240*k4 + 1/50*k5 + 2/55*k6);
if error < tolerance
t(l+1) = t(l) + Delta_t;
phi(l+1) = phi(l) + Delta_t*(25/216*k1 + 1408/2565*k3 + 2197/4104*k4 - 1/5*k5);
l = l+1;
194 CHAPTER 10. RUNGE-KUTTA METHODS
10.4. NEW DERIVATION 195

end
beta = 0.84*(tolerance/error)^(1/4);
...
end

where some of the k values have been omitted so that the code snippet can fit onto
the page. So at each iteration through the while loop we are computing the φl+1 ,
but if the error in the computed value is too large then we will adjust the time step
size and try again. Only when the error is less than the tolerance that we specify,
do we update φl+1 and move on. In order to adjust ∆t, we could so something like:
while ~finished
...
beta = 0.84*(tolerance/error)^(1/4);
if beta < 0.1
Delta_t = 0.1*Delta_t
elseif beta > 4.0
Delta_t = 4.0*Delta_t;
else
Delta_t = beta*Delta_t;
end
if Delta_t > Delta_t_max
Delta_t = Delta_t_max;
end
if t(l) >= t_max
finished = 1;
elseif t(l)+Delta_t>t_max
Delta_t = t_max-t(l);
elseif Delta_t < Delta_t_min
finished = 1;
end
end

The complete program is given in Example10_2.m. Figures 10.5(a) - 10.5(b) show

the numerical solution to Equation 10.20 obtained using the Runge-Kutta-Fehlberg
method. It can be observed that near the beginning of the simulation φ is changing
far more abruptly than near the end of the simulation. It can also be observed that
the algorithm reduces ∆t by more than an order of magnitude in order to keep the
error in the solution below the tolerance that we specified.

10.4 New Derivation

∆t2 d2 φ ∆t3 d3 φ ∆t4 d4 φ

l+1 l ∆t dφ
φ(t ) = φ(t ) + + + + ...
1! dt tl 2! dt2 tl 3! dt3 tl 4! dt4 tl
196 CHAPTER 10. RUNGE-KUTTA METHODS

1.0 2.5

2.0
0.5

1.5
∆t
φ

0.0

1.0

−0.5
0.5

−1.0 0.0
0 2 4 6 8 10 5 10 15 20 25 30
t l

(a) (b)

Figure 10.5: Solution to the ODE system in Example 10.2 showing (a) the numer-
ical and analytical solutions in blue and green respectively (b) the variation in ∆t
throughout the course of the simulation.
10.4. NEW DERIVATION 197

In the mathematical field of differential calculus, a total derivative or full deriva-

tive of a function f of several variables, e.g., t, x, y, etc., with respect to an exogenous
argument, e.g., t, is the limiting ratio of the change in the function’s value to the
change in the exogenous argument’s value (for arbitrarily small changes), taking
into account the exogenous argument’s direct effect as well as its indirect effects via
the other arguments of the function.
f (φ, t)

df ∂f dt ∂f dφ
= +
dt ∂t dt ∂φ dt
∂f ∂f
= +f
∂t ∂φ

φ(tl+1 ) = φ(tl )
+ ∆tf (φl , tl )
∆t2 ∂f

∂f
+ +f
2 ∂t ∂φ tl
∆t 3
∂ ∂f ∂f

∂

∂f ∂f

+ +f +f +f
6 ∂t ∂t ∂φ ∂φ ∂t ∂φ l
t
∆t 4
∂ ∂ ∂f ∂f

∂

∂f ∂f

∂f

∂ ∂f ∂f

∂

∂f ∂f

+ +f +f +f +f +f +f +f ...
24 ∂t ∂t ∂t ∂φ ∂φ ∂t ∂φ ∂φ ∂t ∂t ∂φ ∂φ ∂t ∂φ l
t

10.4.1 Second Order Runge-Kutta Method

Assume an expression of the form:

φl+1 = φl + ∆t (a1 k1 + a2 k2 )

where:

k1 = f (φl , tl )
k2 = f (φl + ∆tk1 , tl + ∆t)

Since the two dimensional Taylor series expansion of f (φ+∆φ, t+∆t) can be written
as:
198 CHAPTER 10. RUNGE-KUTTA METHODS

f (φ + ∆φ, t + ∆t) = f (φ, t)

1 ∂ ∂
+ ∆φ + ∆t f (φ, t)
1! ∂φ ∂t
2
1 ∂ ∂
+ ∆φ + ∆t f (φ, t)
2! ∂φ ∂t
+ ...
n
1 ∂ ∂
+ ∆φ + ∆t f (φ, t)
n! ∂φ ∂t
Substitute this into k2 and note that ∆φ = ∆tk1 = ∆tf (φl , tl )

l l ∂ ∂
k2 = f (φ , t ) + ∆φ + ∆t f (φ, t)
∂φ ∂t tl

l l ∂f ∂f l l ∂f ∂f
= f (φ , t ) + ∆tk1 + ∆t = f (φ , t ) + ∆tf + ∆t
∂φ ∂t tl ∂φ ∂t tl
Substitute back into original expression and rearrange:

φl+1 = φl + ∆t (a1 k1 + a2 k2 )

l l l l l ∂f ∂f
= φ + ∆t a1 f (φ , t ) + a2 f (φ , t ) + ∆tf + ∆t
∂φ ∂t tl

l l l 2 ∂f ∂f
= φ + ∆t (a1 + a2 ) f (φ , t ) + ∆t a2 f +
∂φ ∂t tl
1
Compare with coefficients from first Taylor series expansion and we get a2 = 2
,
a1 = 12 .

10.4.2 Fourth Order Runge-Kutta Method

Assume an expression of the form:

φl+1 = φl + ∆t (a1 k1 + a2 k2 + a3 k3 + a4 k4 )
where:

k1 = f (φl , tl )
∆t ∆t
k2 = f (φl + k1 , tl + )
2 2
∆t ∆t
k3 = f (φl + k2 , tl + )
2 2
k4 = f (φl + ∆tk3 , tl + ∆t)
10.4. NEW DERIVATION 199

Same approach as before. We first substitute the two dimensional Taylor series
expansion for k2 :

k2 = f (φl , tl )

1 ∆t ∂f ∂f
+ k1 +
1! 2 ∂φ ∂t tl
2 2
∂ 2f ∂ 2 f

1 ∆t 2∂ f
+ k1 2 + 2k1 + 2
2! 2 ∂φ ∂φ∂t ∂t tl
3 3 3 3
∂ 3 f

1 ∆t 3∂ f 2 ∂ f ∂ f
+ k1 3 + 3k1 2 + 3k1 + 3
3! 2 ∂φ ∂φ ∂t ∂φ∂t2 ∂t tl
4 4 4 4 4 4

1 ∆t ∂ f ∂ f ∂ f ∂ f ∂ f
+ k14 4 + 4k13 3 + 6k12 2 2 + 4k1 +
4! 2 ∂φ ∂φ ∂t ∂φ ∂t ∂φ∂t3 ∂t4 tl
+ O(∆t)5

If we do the same for k3 we get:

k3 = f (φl , tl )

1 ∆t ∂f ∂f
+ k2 +
1! 2 ∂φ ∂t tl
2 2
∂ 2f ∂ 2 f

1 ∆t 2∂ f
+ k2 2 + 2k2 + 2
2! 2 ∂φ ∂φ∂t ∂t tl
3 3 3
∂ 3f ∂ 3 f

1 ∆t 3∂ f 2 ∂ f
+ k2 3 + 3k2 2 + 3k2 + 3
3! 2 ∂φ ∂φ ∂t ∂φ∂t2 ∂t tl
4 4 4 4 4
∂ 4 f

1 ∆t 4∂ f 3 ∂ f 2 ∂ f ∂ f
+ k2 4 + 4k2 3 + 6k2 2 2 + 4k2 + 4
4! 2 ∂φ ∂φ ∂t ∂φ ∂t ∂φ∂t3 ∂t tl
5
+ O(∆t)

If we do the same for k4 we get:

200 CHAPTER 10. RUNGE-KUTTA METHODS

k4 = f (φl , tl )

1 ∂f ∂f
+ (∆t) k3 +
1! ∂φ ∂t tl
2
∂ 2f ∂ 2 f

1 2 2∂ f
+ (∆t) k3 2 + 2k3 + 2
2! ∂φ ∂φ∂t ∂t tl
3 3 3
∂ 3 f

1 3 3∂ f 2 ∂ f ∂ f
+ (∆t) k3 3 + 3k3 2 + 3k3 + 3
3! ∂φ ∂φ ∂t ∂φ∂t2 ∂t tl
4 4 4 4
∂ 4 f

1 4 4∂ f 3 ∂ f 2 ∂ f ∂ f
+ (∆t) k3 4 + 4k3 3 + 6k3 2 2 + 4k3 + 4
4! ∂φ ∂φ ∂t ∂φ ∂t ∂φ∂t3 ∂t tl
5
+ O(∆t)

Substitute in to assumed form and gather coefficients

Chapter 11

Shooting Methods

The methods introduced thus far can only be used to solve initial value problems,
meaning that all the information you are given is at time t = tmin , and you are asked
to predict the solution up to a later point in time, say t = tmax . What if you are
given some information at t = tmin and some of the information at t = tmax ? These
kinds of problems are called Boundary Value Problems. There are two techniques for
solving boundary value problems, Shooting Methods, which use standard methods for
initial value problems such as Euler methods, Runge-Kutta methods, etc, and Direct
Methods, which are based on straight forward discretisation of the derivatives in the
differential equation and solving the resulting system of algebraic equations. As it
happens we will cover direct methods in the context of solving partial differential
equations later in the course, so for now will will only focus on how we implement
a shooting method.
The basic idea behind a shooting method is that they guess the information at
t = tmin in order to give the required conditions at t = tmax . To illustrate by way of
example consider a system of 3 ODEs (where the generalisation to a system of M
ODE’s is straightforward):

dφ1
= f1 (φ1 , φ2 , φ3 )
dt
dφ2
= f2 (φ1 , φ2 , φ3 )
dt
dφ3
= f3 (φ1 , φ2 , φ3 ) (11.1)
dt
given the following conditions φ1 (tmin ) = φ1,min , φ2 (tmin ) = φ2,min , and φ3 (tmax ) =
φ3,max . You are asked to find φ1 (t), φ2 (t), and φ3 (t). It is important to note that
φ3 (tmin ) has not been defined, so for the purpose of the following discussion, let
φ3 (tmin ) = α. The idea behind the shooting methods is that because φ3 (tmin ) is not
defined, we are free to choose any value for α that will give us φ3 (tmax ) = φ3,max .
But we do not know beforehand what value of α will achieve this. So we need iterate
through different values of α until the computed value of φ3 (tmax ) = φ3,max .

201
202 CHAPTER 11. SHOOTING METHODS

Using any numerical method for solving ODEs, Equation 11.1 can be solved with
the following initial conditions φ1 (tmin ) = φ1,min , φ2 (tmin ) = φ2,min , and φ3 (tmin ) =
α. Let’s say we take Nt steps of size ∆t to approach t = tmax and get the approximate
value of φN Nt
3,k=1 , where k denotes an iteration. The value of φ3,k=1 is dependent on
t

the value of α. Since we are only guessing the value of α, it is very likely that
φN3,k=1 − φ3,max 6= 0. For the following analysis, it will be convenient to define the
t

function:

g(α) = φN
3,k − φ3,max
t

In order for the numerical solution to satisfy the original boundary conditions, we
must ensure that:

g(α) = 0
Thus this becomes a root finding problem, i.e. we have to iteratively find the value
of α such that g(α) = 0. Each value α will give you a numerical solution. Only
the numerical solution computed with the value of α that ensures that g(α) = 0 is
the correct solution to the original problem. Since the problem has been recast as a
root finding problem, the Secant formula is usually used to provide a better guess
value of α:

(αk − αk−1 )g(αk )

αk+1 = αk − (11.2)
g(αk ) − g(αk−1 )
that will make g(αk ) = 0.
It is worth emphasizing the point that to have a boundary value problem we
must have at least a second order ODE (or a system of at least two first order
ODEs). The reason for this should be fairly obvious, but a single first order ODE
can only have one condition specified in order to provide a unique solution, which
will always be an initial value problem.

Example 11.1:

Using a shooting method with an explicit Euler method for the time integration
and the secant formula, write a program in Matlabto solve the second order ODE:

d2 φ
= −2 (11.3)
dt2
in the domain t ∈ [0, 1], with boundary conditions φ(0) = 0 and φ(1) = 0.The in-
tended learning outcome for this example will be to simply observe the application
of a shooting method to solve an ODE with boundary values specified.
203

In order to begin, we will need to break this second order ODE into a system of
first order ODEs:

dφ1
= φ2
dt
dφ2
= −2 (11.4)
dt

We are given that φ1 (0) = φ(0) = 1. However we do not know the value of φ2 (0) =
φ̇(0), but rather the condition i.e. φ1 (1) = 1. So we are free to choose φ2 (0) = α.
The numerical solution computed using different values of α will give you different
values of φ2 (tmax ). Also note that while we are using the explicit Euler method to
compute the numerical solution, but any method can be chosen). We would like to
pick only the numerical solution that will give us φN1 = 1. Let’s define a function:
t

g(α) = φN
1 (α) − φ1 (1)
t

So our task now is to find the value of α such that g(α) = 0. Let’s just guess α0 = 0
and α1 = 2. For these values of α, numerical solution for Equation 11.4 could be
computed and g(α0 ) = −0.9500 and g(α1 ) = 1.0500. One can then use Equation
11.2 to compute α2 . This process is then repeated until g(αk ) ≈ 0. The Matlab
algorithm will look something like:
while abs(g(k)) > tolerance
alpha(k+1) = alpha(k) - (alpha(k) - alpha(k-1))*g(k)/(g(k) - g(k-1));
g(k+1) = solve(alpha(k+1));
k = k + 1;
end

where we have defined a function solve to perform the time marching at each
iteration, given the initial condition for φ3 .
The complete program is given in Example11_1.m. Figures 11.1(a) and 11.1(b)
show the numerical solution to Equation 11.3 obtained using the shooting method.
It can be observed that the αk are chosen according to Equation 11.2 and it forces
the numerical solution of φ at t = 1 to approach 5. Note, the numerical solution
computed using α5 and α6 are not distinguishable on the scale of the diagram.
204 CHAPTER 11. SHOOTING METHODS

2.5 1.5

2.0
1.0

1.5
0.5
φ

1.0

−0.0
0.5

−0.5
−0.0 k=1 φ1
k=2 φ2
k=3
−0.5 −1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t

(a) (b)

Figure 11.1: Solution to the ODE system in Example 11.1 with ∆t = 0.01 showing
(a) the solution for φ1 (t) at each iteration computed using the explicit Euler method
(b) the converged solution for the system.
Part III

Partial Differential Equations

205
Chapter 12

Introduction

In this part of the book we are going to going to investigate a number of different
families of numerical methods for solving Partial Differential Equations (PDEs).
These contrast to ordinary differential equations in that the dependent variable
is some function of multiple independent variables (i.e. a multivariate function),
rather than just one. To elaborate on this point, the ODEs that we studied in
Part II involved derivatives of one or more dependent variables with respect to time.
The partial differential equations encountered in many common applications may
however involve derivatives with respect to time, and/or derivatives with respect to
one or more spatial dimensions. The most general form of a PDE can be given as:

∂ 2 φ ∂ 2 φ ∂ 2 φ ∂φ ∂φ

f , , , , , φ, x, t = 0
∂x2 ∂x∂t ∂t2 ∂x ∂t
where for brevity, we have only shown two independent variables x and t, and
derivatives up to second order, but of course both can be extended to any number.
As with ODEs we will again use φ to denote our dependent variable and t to denote
the ordinate of time; but now, in addition we are using x to denote a spatial ordinate.
It will often be helpful if we try and think of φ as representing a field of some
sort. Although the focus in this part of the book is more on the numerics than
the physics of what the PDE is describing, it will become apparent in Part V that
when applied to a particular problem, φ will represent some physical quantity such
as a velocity field, a temperature field, a stress or strain field, perhaps even an
electromagnetic field. The point is that it’s representing some quantity that varies
continuously over a region of space and time. The ‘continuously’ bit of the last
statement is quite important because its the continuity of these fields which allows
us to apply calculus in the first place. An important point to note in the examples
just mentioned are that we listed an examples of a scalar field, a vector field, and a
tensor field. To elaborate on this idea, a quantity like temperature can be described
by a single number; we’re all used to this from years of checking the weather. Of
course we can’t really describe the temperature variation throughout the atmosphere
by a single number because the temperature varies over a city, or a country, over the

207
208 CHAPTER 12. INTRODUCTION

entire planet in fact. The important point is that it’s a continuous function of space
and time and so there are infinitely many of these points that we could arbitrarily
choose; but having chosen a point in space and a moment in time, we only need one
number to define a temperature. In contrast, a quantity like velocity needs a few
numbers to describe it since velocity is a vector valued quantity. Continuing with the
weather analogy, we could imagine picking an arbitrary point in space and measuring
the wind velocity at that point. In our 3D world we would therefore assign three
numbers at this point describing the velocity vector components in each direction
at any moment in time. Tensor fields are then an extension of this idea, where we
require more numbers to describe a physical quantity at a location in space. We
will see some examples of tensor fields in Part V, where for a stress or a strain field
however we need nine numbers to define the state of each at a point in space. As it
turns out, we can define a tensor by its rank and the number of components that it
requires is Drank where D defines the number of spatial dimensions. Furthermore,
we can say that scalars and vectors are in fact subsets of the more general class of
tensors, i.e. a scalar is a rank 0 tensor requiring 30 = 1 number to describe it, a
vector is a rank 1 tensor requiring 31 = 3 numbers to describe it, and a rank 2 tensor
requires 32 = 9 numbers to describe it.

dΩ Ω

x dΓ

Figure 12.1: A schematic illustrating an arbitrarily shaped domain Ω bounded by

Γ. Also illustrated is an differential element of the boundary dΓ which is normal to
the boundary.

In terms of notation, it is also common to express derivatives as ∂x φ, ∂t φ, ∂xx φ,

etc, or φx , φxx , φt , etc. We will also make use of this notation from time to time
throughout the course (mainly just to fit a derivative expression in the body of a
paragraph, which would stretch the line spacing if using the fractional notation).
Now, the idea behind solving an PDE is to find φ(x, y, z, t), but because we are
restricting ourselves to numerical solutions we will only ever obtain a solution that
is a collection of φ at discrete points in space, and discrete moments in time, which
209

we might denote as φ(xi , yj , zk , tl ) = φli,j,k . As with ODEs, our PDEs will be solved
within a domain, and as with ODEs the temporal domain will be specified as t ∈
[tmin , tmax ]. In terms of the spatial domain however, things are a little more complex.
With some of the numerical methods we will study, we can specify the domain in a
similar way, namely x ∈ [xmin , xmax ], but in other numerical methods this will not
be possible, since the shape of a spatial domain can be arbitrarily complex. For the
latter cases we will use Ω to denote our spatial domain and Γ its boundary (the term
∂Ω is also sometimes used to denote a boundary). We will define our domain in
Euclidean space RD (which is a rather complicated way of saying that our domain is
D dimensional and is defined in terms of real numbers) and so we can say Ω ⊂ RD
(i.e. our domain is a subset of the D dimensional space of real numbers). The
advantage of using this notation is that no matter how many spatial dimensions
our problem is defined in, the notation doesn’t change. In R3 (i.e. 3D space) the
domain has a volume with boundary surfaces, in R2 (i.e. 2D space) the domain has
an area with boundary edges, and in R1 the domain has an length with boundary
points. At this stage we don’t need to elaborate much further; the meaning will
become apparent when we study the particular numerical methods. For the most
part we will only be solving problems in up to two spatial dimensions, plus time.
The reason for this restriction is that solving PDEs in R2 will reveal the relevant
complexity necessary for understanding how ‘real world’ problems are solved, but
also mean that we don’t get ‘bogged down’ in too much math and computation. All
of the techniques that we will cover extend readily to higher spatial dimensions.
As we did with our study of ODEs, before we get into studying the numerical
methods, we need to outline some basic concepts and definitions. One of the impor-
tant aspects we must consider is the order of a PDE, which as with ODEs is simply
the order of the highest derivative present in the equation. As with ODEs the order
of a PDE will have important implications in terms of how much information we
have to specify to obtain a solution. In this book we will only be considering PDEs
up to second order. Another important aspect we must consider is whether or not
we are solving a single PDE or a system of PDEs. Throughout this part of the
book our study we will only investigate the numerical solution of a single PDE, but
in Part V we will then extend these ideas to coupled systems of PDEs. Another
important aspect that we must consider is whether or not we are solving a linear or
nonlinear PDE. If a second order PDE is linear, then it can be represented in the
form:

∂ 2φ ∂ 2φ ∂ 2φ ∂φ ∂φ
a(x, t) 2
+ b(x, t) + c(x, t) 2
+ d(x, t) + e(x, t) + f (x, t)φ = g(x, t)
∂x ∂x∂t ∂t ∂x ∂t
(12.1)
If any of the coefficients a, b, c, d, e, f, g happen to also be functions of φ then the
PDE is described as quasi-linear . Now with PDEs it is the highest order terms that
determine the properties of the solutions. The collection of the highest order terms
is called the principal part and for the PDE presented in Equation 12.1 is defined
210 CHAPTER 12. INTRODUCTION

as:

∂ 2φ ∂ 2φ ∂ 2φ
a(x, t) + b(x, t) + c(x, t)
∂x2 ∂x∂t ∂t2
To generalize on this idea to three spatial dimensions, plus time, let x1 = x, x2 = y,
x3 = z, x4 = t. Then we can write out our second order linear PDE as:
4 X
4 4
X ∂ 2φ X ∂φ
Am,n + bm + f (φ, x1 , x2 , · · · x4 ) = 0
m=1 n=1
∂xm ∂xn m=1 ∂xm
Here A is a matrix containing the coefficients of the second order derivatives defin-
ing the principal part, b is a vector containing the coefficients of the first order
derivatives, and f is some function defining the remainder of the PDE. We can then
classify the PDE in this form by examining the eigenvalues of the matrix A, where:

• if λ1 , · · · , λD are nonzero and have the same sign (i.e. the matrix A is positive
definite) then the equation is termed Elliptic.

• if λ1 , · · · , λD are nonzero and all except one have the same sign then the
equation is termed Hyperbolic..

• if λ1 , · · · , λD are nonzero and at least two of them are positive and two negative
then the equation is termed Ultrahyperbolic..

• if any one of λ1 , · · · , λD are zero then the equation is termed Parabolic..

The meaning and implications of this classification will become apparent throughout
the course since we will actually solve PDEs of each type, but for now we just need
to bear in mind that the classification informs us as to what the solution might look
like, and what information we need to specify to get a solution. As it happens elliptic
equations tend to describe steady state equilibrium problems, parabolic equations
tend to describe transient diffusion type problems, and hyperbolic equations tend
to describe transient problems exhibiting a wave type motion. An important point
to note is that because the terms in Am,n could be functions of xm it is possible that
the eigenvalues will in fact be functions of xm (not just single numbers) and that
the classification may change throughout the domain. A ‘classic’ example of this is
say, compressible fluid flow which changes from subsonic to supersonic as it flows
over the wing of an aircraft. A final point on the matter is that all first order PDEs
are classified as hyperbolic.
As with ODEs, another important aspect to consider is whether or not the PDE
is homogeneous or inhomogeneous. If g(xm ) is zero then it is the former, and if
non-zero then it is the latter.
Finally and perhaps most importantly, one of the important aspects of our prob-
lem we must consider is what types and how many boundary conditions we need
to specify for our problem to be well posed. When it comes to PDEs there are
211

three main types of boundary conditions that we can specify on the boundary of our
domain Γ. The first is known as a Dirichlet boundary condition and is defined as:

φ(x) = f (x) ∀(x) ∈ ΓDirichlet

which is a rather complicated way of stating that the solution φ is equal to some
specified function for all points x on the boundary where we are choosing to specify
the Dirichlet boundary condition. This is essentially the same thing that we covered
when solving systems of ODEs describing a boundary value problem in Chapter 11.
The second type of boundary condition is known as a Neumann boundary condition
and is defined as:

∂φ
= f (x) ∀(x) ∈ ΓN eumann
∂n
where n is a unit normal vector pointing away from the boundary. The idea here
is that the gradient of φ, normal to the boundary (i.e. pointing directly away from
the boundary), is equal to some specified function for all points x on the boundary
where we are choosing to specify the Neumann boundary condition. We could also
denote this by ∂n φ = f (x). The third type of boundary condition is known as a
Robin boundary condition and is defined as:

∂φ
aφ(x) + b = f (x) ∀(x) ∈ ΓRobin
∂n
where we can see that it is essentially a combination of a Dirichlet and a Neumann
boundary condition. We will refer to Mixed boundary conditions when we have
a problem where we specify one type of boundary condition on some part of the
domain, and a different type on another part. For instance we will frequently be
specifying both Dirichlet and Neumann conditions over different parts of a boundary
when we solve certain PDEs. A final type of boundary condition is known as a
Periodic boundary condition. In this case we effectively let the domain ‘loop’ back
around on itself so really, there is in fact no boundary anymore. It is then the
case that the boundary values at either end of the domain are equal. So we’ve
covered different types of boundary conditions, but also need to think about initial
conditions. In the solution of PDEs we often have what is known as a Cauchy
problem, which is essentially an extension of the idea of an initial value problem.
When we have only a first order derivative with respect to time, then we will need
to provide the initial condition φ(x, tmin ) = f (x), but if we have a second order
derivative with respect to time then we must also provide ∂t φ(x, tmin ) = g(x). The
Cauchy problem implies specifying the PDE with the appropriate number of initial
conditions in order to obtain a unique solution.
It is now time to introduce some common operators in vector calculus and we
will introduce these using two types of notation, namely vector notation and tensor
notation. Both of these forms describe the same thing, but often, working with one
212 CHAPTER 12. INTRODUCTION

type of notation will make life easier compared to the other. The first operator that
we will introduce is known as grad (or del or nabla) and is defined as:

∇ ≡∂xi i = 1, 2, . . . , D (12.2)

∂
≡ R1
∂x

∂ ∂
≡ , R2
∂x ∂y

∂ ∂ ∂
≡ , , R3
∂x ∂y ∂z

where the use of the ∇ symbol is vector notation, while the ∂xi notation is tensor
notation. This operator describes the gradient of a field. Now, if this operator was
applied to a scalar field, the result would be a vector field. For example, if we had
φ(x, y, z, t), then the result of this operation would be:

∂φ ∂φ ∂φ
∇φ = ∂xi φ = , ,
∂x ∂y ∂z
If however, this operator was applied to a vector field the result would be a dyadic.
For example, if we had v(x, y, z, t), with components {vx , vy , vz } then the result of
this operation would be:

∂vx ∂vx ∂vx

 
 ∂x ∂y ∂z 
 ∂vy ∂vy ∂vy 
 
∇v = ∂xj vi =   ∂x ∂y ∂z 

 ∂vz ∂vz ∂vz 
 
∂x ∂y ∂z
The use of both vector and tensor notation is quite elegant in that we can change
our problem from 1D to 2D to 3D without having to change the notation.
Another common operator is known as the divergence and is defined as:

∂
∇· = R1
∂x
∂ ∂
= + R2
∂x ∂y
∂ ∂ ∂
= + + R3
∂x ∂y ∂z

which is also written as div(). This operator describes the ‘source’ of a field at a
given point. Now, if this operator was applied to a vector field, the result would be
213

a scalar field. For example, if we had v(x, y, z, t), with components {vx , vy , vz } then
the result of this operation would be:

∂vx ∂vy ∂vz

∇ · v = ∂xi vi = + +
∂x ∂y ∂z
If however, this operator was applied to a tensor field, the result would be a vector
field. For example, if we had σ(x, y, z, t), with components {σxx , σxy , σxz , σyx , σyy , σyz ,
σzx , σzy , σzz } then the result of this operation would be:

∂σxx ∂σxy ∂σxz ∂σyx ∂σyy ∂σyz ∂σzx ∂σzy ∂σzz
∇·σ = ∂xj σij = + + , + + , + +
∂x ∂y ∂z ∂x ∂y ∂z ∂x ∂y ∂z

An important point to note here is that in terms of the tensor notation we are using
Einstein summation notation[13], where any repeated indices (j in this case) imply
that we substitute in all possible values for the index and add them all together.
Any index appearing only once in an expression is hence called a free index.
Another common operator is known as the curl and is defined as:

∂vz ∂vy ∂vx ∂vz ∂vy ∂vx
∇×v = − , − , − R3 (12.3)
∂y ∂z ∂z ∂x ∂x ∂y

which operates on a vector field and produces another vector field. It is also written
as curl(). This operator is only defined in 3D and describes a vector field’s rotation
at a given point.
Finally we will define the Laplacian as:

∇2 ≡∂xi xi (12.4)
=∇ · ∇
∂2
≡ 2 R1
∂x
∂2 ∂2
≡ 2+ R2
∂x ∂y 2
∂2 ∂2 ∂2
≡ 2+ + R3
∂x ∂y 2 ∂z 2

which is also written as ∆. Now, if this operator was applied to a scalar field, the
result would be another scalar field. For example, if we had φ(x, y, z, t), then the
result of this operation would be:

∂ 2φ ∂ 2φ ∂2
∇ 2 φ = ∂ xi xi φ = + +
∂x2 ∂y 2 ∂z 2
214 CHAPTER 12. INTRODUCTION

If however, this operator was applied to a vector field, the result would be another
vector field. For example, if we had v(x, y, z, t), with components {vx , vy .vz } then
the result of this operation would be:

∂ 2 vx ∂ 2 vx ∂ 2 vx ∂ 2 vy ∂ 2 vy ∂ 2 vy ∂ 2 vz ∂ 2 vz ∂ 2 vz

2
∇ v = ∂ xj xj v i = + + , + + , + +
∂x2 ∂y 2 ∂z 2 ∂x2 ∂y 2 ∂z 2 ∂x2 ∂y 2 ∂z 2
The reason that we’ve made these definitions is that they are commonly used
in the description of a number of important PDEs. So, having now introduced a
number of concepts and definitions relating to the classification of PDEs, let’s take
a moment to look at some common example PDEs and classify them. Beginning
with the Poisson equation:

∂ 2φ ∂ 2φ ∂ 2φ
+ + 2 =ψ or ∇2 φ = ψ or ∂xi xi φ = ψ (12.5)
∂x2 ∂y 2 ∂z

We can see that this is a second order, linear, inhomogeneous PDE in RD . Because
the eigenvalues of the principal part are λ1 = 1, · · · , λD = 1 this equation is elliptic.
It describes a number of equilibrium problems (as observed by the fact that there’s
no time derivatives present). If ψ = 0 then the equation is known as Laplace’s
equation. Another ‘classic’ PDE is the Heat equation:

∂φ ∂ 2φ ∂ 2φ ∂ 2φ
= + + 2 or φ̇ = ∇2 φ or ∂ t φ = ∂ xi xi φ (12.6)
∂t ∂x2 ∂y 2 ∂z

which we can see is a second order, linear, homogeneous PDE in RD . Because the
eigenvalues of the principal part are λ1 = 1, · · · , λD = 1, λt = 0 this equation is
parabolic. It describes transient diffusion processes. Another ‘classic’ PDE is the
Wave equation, defined as:

∂φ ∂φ ∂φ ∂φ
= + + or φ̇ = ∇φ or ∂t φ = ∂xi φ (12.7)
∂t ∂x ∂y ∂z

which is a first order, linear, homogeneous PDE in RD . This version is also known
as the first order wave equation, or the one-way wave equation and being first order
it is hyperbolic. The wave equation is perhaps more commonly defined as:

∂ 2φ ∂ 2φ ∂ 2φ ∂ 2φ
= + + 2 or φ̈ = ∇2 φ or ∂tt φ = ∂xi xi φ (12.8)
∂t ∂x2 ∂y 2 ∂z

which is a second order, linear, homogeneous PDE in RD . Because the eigenvalues

of the principal part are λ1 = 1, · · · , λD = 1, λt = −1 this equation is hyperbolic.
215

It describes things such as transport phenomena, wave motion, vibration problems

etc. These first three PDEs are quite ‘generic’, in that they can be applied to many
different physical phenomena. Let’s also look at some specific PDEs. Beginning with
the Elastodynamic equations in solid mechanics, (which comes from the principal of
conservation of momentum):

ρü = (λ + µ)∇(∇ · u) + µ∇2 u + ρg or ρ∂tt ui = (λ + µ)∂xi xj uj + µ∂xj xj ui + ρgi

(12.9)
or in 3D:

∂ 2 ux ∂ 2u
x ∂ 2 uy ∂ 2 uz ∂ 2 ux ∂ 2 ux ∂ 2 ux
ρ =(λ + µ) + + +µ + + +ρgx
∂t2 ∂x2 ∂x∂y ∂x∂z ∂x2 ∂y 2 ∂z 2
∂ 2 uy ∂ 2u
x ∂ 2 uy ∂ 2 uz ∂ 2 uy ∂ 2 uy ∂ 2 uy
ρ 2 =(λ + µ) + + +µ + + +ρgy
∂t ∂y∂x ∂y 2 ∂y∂z ∂x2 ∂y 2 ∂z 2
∂ 2 uz ∂ 2u
x ∂ 2 uy ∂ 2 uz ∂ 2 uz ∂ 2 uz ∂ 2 uz
ρ 2 =(λ + µ) + + +µ + + +ρgz
∂t ∂z∂x ∂z∂y ∂z 2 ∂x2 ∂y 2 ∂z 2

which we can see is a second order, linear, inhomogeneous PDE in RD . Because the
eigenvalues of the principal part of the second equation are λ1 = λ+2µ, · · · , λD = λ+
2µ, λt = −ρ this equation is hyperbolic. The dependent variable is the displacement
field u, which is a vector field, and so this equation can be thought of as a system
of D equations, one for each displacement component. In this case ρ is the mass
density and µ and λ are the Lamé Parameters which define the elastic properties
of a solid. Another common set of PDEs are the Navier-Stokes equations in fluid
mechanics (which come from the principals of conservation of mass and momentum):

ρ̇ +∇ · ρv =0 or ∂t ρ +∂xi ρvi =0
ρv̇+v · ∇ρv=µ∇2 v − ∇p + ρg ∂t ρvi +vj ∂xi ρvi =µ∂xj xj vi − ∂xi p + ρgi

or in 3D:

∂ρ ∂ρvx ∂ρvy ∂ρvz

+ + + = 0
∂t ∂x ∂y ∂z
∂ρvx ∂ρvx ∂ρvx ∂ρvx ∂ 2 vx ∂ 2 vx ∂ 2 vx ∂p
+ vx + vy + vz =µ + µ + µ − + ρgx
∂t ∂x ∂y ∂z ∂ 2x ∂ 2y ∂ 2z ∂x
∂ρvy ∂ρvy ∂ρvy ∂ρvy ∂ 2 vy ∂ 2 vy ∂ 2 vy ∂p
+ vx + vy + vz =µ 2 + µ 2 + µ 2 − + ρgy
∂t ∂x ∂y ∂z ∂ x ∂ y ∂ z ∂y
∂ρvx ∂ρvz ∂ρvz ∂ρvz ∂ 2 vz ∂ 2 vz ∂ 2 vz ∂p
+ vx + vy + vz =µ 2 +µ 2 +µ 2 − + ρgz
∂t ∂x ∂y ∂z ∂ x ∂ y ∂ z ∂z
216 CHAPTER 12. INTRODUCTION

which we can see is a system of two PDEs, the first of which is a first order, linear,
homogeneous PDE in RD , the second of which is a second order, nonlinear, inho-
mogeneous PDE in RD . Because the eigenvalues of the principal part of the second
equation are λ1 = µ, · · · , λD = µ, λt = 0 this equation is parabolic. The dependent
variable in the first equation is the fluid mass density ρ, which is a scalar field. The
dependent variable in the second equation is the fluid velocity field v, which is a
vector field, and in fact it can be thought of as a system of D equations, one for
each velocity component. In this case µ describes the fluid viscosity and p is the
pressure field.
Another common set of PDE is the Energy equation in thermodynamics (which
comes from the principal of conservation of energy):

∂T
ρC = ∇k · ∇T + Q
∂t
which is a second order, linear PDE in RD . Because the eigenvalues of the
principal part are λ1 = k, · · · , λD = k, λt = 0 this equation is parabolic. The
dependent variable is the temperature field T , which is a scalar field. In this case ρ,
C, k, and Q are the mass density, specific heat capacity, thermal conductivity, and
heat generation respectively. Another common set of PDEs are Maxwell’s equations:

∂E
= ∇ × H− J
∂t
∂H
µ =− ∇ × E−M
∂t
∇ · E = ρf
µ∇ · H= 0

which is a system of four, first order, linear PDEs, one of which are homogeneous.
Here, E and H are the electric and magnetic field vectors respectively, ρf , J, and
M are the charge, current density, and the magnetization fields respectively, and
and µ are the electric and magnetic permeabilities of a material respectively. These
equations are hyperbolic and describe how electric charges and electric currents act
as sources for the electric and magnetic fields. Further, they describes how a time
varying electric field generates a time varying magnetic field and vice versa.
Finally we shall end with the Schrödinger equation:

∂Ψ ~2 2
i~ + ∇ Ψ = V (x)Ψ
∂t 2m
which is a second order, linear, homogeneous PDE in RD . Here Ψ is the wavefunction
of the system, V is the potential, m is the mass, and ~ is the reduced Planck constant.
It is used in quantum mechanics and describes how the quantum state of a physical
system changes in time.
217

Compared to the ODEs that we studied and solved in Part II we can now see
that PDEs are quite a bit more complicated. As a result, unlike ODEs, where we
can essentially develop ‘canned algorithms’ that we can subsequently apply to any
system of ODEs, PDEs are much more complex and it is much harder to have a
‘one method fits all equations’ type approach. Similar to Parts I and II we will are
going to define an example system which we can apply our numerical methods to
so, for the remainder of this part of the book we will be solving the generic scalar
transport equation:

φ̇+∇ · (vφ) =∇ · (µ∇φ)+ψ or ∂t φ+∂xi (vi φ)=∂xi (µ∂xi φ)+ψ (12.10)

in it various forms, meaning that sometimes we will set certain terms to zero, so
that the problem will be either hyperbolic, parabolic, or elliptic. As it stands at the
moment, this equation is second order, linear, inhomogeneous, and the eigenvalues of
the principal part are λ1 = µ, · · · , λD = µ, λt = 0, so this equation is parabolic. As
its name implies this equation describes the transport of a scalar quantity and can
be applied to solving numerous problems by simply replacing φ with some specific
variable such as density, velocity, temperature, etc. It is hence a good candidate for
the development of our numerical methods.
To elaborate briefly on the various terms in the equation, the first term on the
left hand side of Equation 12.11 is the derivative of φ with respect to time (and
has the same meaning as in Part II except that we use ∂ instead of d to denote
the differential, since φ is now a multivariate function), and as such we call this the
unsteady term as it allows for the variation of φ with time.
The second term involves the vector field v and is commonly called the convective
term. As its name implies the convective term describes how the variable φ is moved
or ‘convected’ through a spatial domain by the presence of the velocity field. As an
analogy think of a drop of dye being injected into a flowing stream of water and
imagine that we are using the scalar transport equation to compute the concentration
field of dye as a function of space and time. The concentration field will change as
dye is carried along with the flow, and this is exactly what the convective term
represents mathematically.
The first term on the right hand side of Equation 12.11 contains the variable
µ and the Laplacian of φ and is commonly called the diffusive term. As its name
implies the diffusive term describes the ‘diffusion’ of φ throughout the computational
domain. Returning to the drop of dye in water analogy; if the water was instead
still, then over time the drop would spread out and the water would change color.
The diffusion of the die is what this term represents mathematically.
The last term on the right hand side of Equation 12.11, ψ is known as the source
term and is a generic way of including any addition or problems specific terms into
this generic PDE. It should also be apparent that this term will define whether the
PDE is homogeneous or inhomogeneous. The source term may be some function of
φ or a function of x, a constant, or it may just be zero. The point is that we just
218 CHAPTER 12. INTRODUCTION

don’t put too much effort into specifying the form of the function at this stage. As
an example of where this term can be used, in fluid or solid mechanics problems
where the effects of gravity are included, the gravitational force would be added
into the source term as constant. As a second example, consider a thermodynamics
problem describing heat transfer in a solid which is producing its own heat (either
via a chemical reaction or electrical current). In either case φ would be the materials
temperature T and the function describing the source of heat would be placed into
ψ.
If the coefficients in the convective and diffusive terms (i.e. v and µ) are constant,
when we can use the vector identity

∇ · (ab) = a · ∇b + b(∇ · a)
and note that the second term on the right hand side will be zero. As such, we
arrive at a simplified form of the the generic scalar transport equation

φ̇+v · ∇φ =µ∇2 φ+ψ or ∂t φ+vi ∂xi φ=µ∂xi xi φ+ψ (12.11)

Expanding out the operators in full for all of the terms, for a 3D problem we could
write the scalar transport equation as:

∂φ ∂φ ∂φ ∂φ ∂ 2φ ∂ 2φ ∂ 2φ
+ vx + vy + vz =µ 2 +µ 2 +µ 2 +ψ
∂t ∂x ∂y ∂z ∂x ∂y ∂z
Now, as we will soon see, the solution of a PDE generally involves applying some
numerical method to the terms involving the spatial derivatives and reducing the
PDE to a system of ODEs. Since we will be focusing on a linear PDE, we will be
able to express this system as:

M φ̇ = Kφ + s (12.12)
which we can see is essentially the same as Equation 5.2. Often, M is termed the
mass matrix , K the stiffness matrix , and s the load vector (or source vector). These
names derive from the application in solid mechanics where ‘say’ the K matrix was
representing the stiffness of a material, but we will continue their use throughout
this book. Turning a PDE into a system of ODEs is known as a semi-discretization
or the Method of Lines[28] and we can generally then use any of the numerical
methods from Part II to perform the time integration. It should be noted however
that it is possible to discretize PDEs in both space and time at the same time, which
would be a full discretization. If we happen to use an explicit method for the time
integration then we will have something like:

φl+1 − φl
M = Kφl + s
∆t
219

If on the other hand, we happen to use an implicit method for the time integration
then we will have something like:

φl+1 − φl
M = Kφl+1 + s
∆t
meaning that we will have to solve a system of equations at every time step like:

Aφl+1 = b
where:

A = M − ∆tK
b = M φl + ∆ts

So, it can be observed that often, solving a PDE reduces to solving a system of
ODEs, which in turn reduces to solving a system of algebraic equations. If we
don’t have a temporal derivative term in our PDE however, then applying a spatial
discretization would lead directly to a system of algebraic equations. It should
be noted that compared to some of the example problems from Parts I and II,
the important difference is that the size of the system of equations is defined by the
spatial discretization and will typically be much larger (e.g. we were solving systems
of size 3 × 3, but systems of say 106 × 106 are not uncommon).
As it happens, most of the concepts that were introduced in Part II relating to
accuracy, stability, consistency, convergence, etc, still apply when studying PDEs.
As we cover the different numerical methods for performing the spatial discretization,
we will define whether or not they use local or global approximations. Essentially
this defines whether or not the solution at any given point within the computational
domain is related to just a few of its immediate neighbors, or to all other points
within the domain. When we studied ODEs we could generally state the order
of accuracy of a method, or for some particular families (such as Runge-Kutta or
Adams-Bashforth methods) there were varying orders of accuracy available to us.
With the methods for solving PDEs that we will study, we have a similar scenario,
namely that we can obtain varying orders of accuracy for the spatial discretization.

12.0.1 Spatial Discretization

In order to understand how numerical methods for solving PDEs work, we must also
understand how we define the spatial discretization, since the two are intimately
linked, and moreover are often designed together. We will briefly investigate two
classes of computational grids, namely structured and unstructured grids (Figures
12.2(a) - 12.2(b)) and see what this means in terms of the type of geometry we
can model, and the corresponding computational storage requirements. We will not
concern ourselves with the details of ‘how’ a particular grid is created (since this
220 CHAPTER 12. INTRODUCTION

(a) (b)

Figure 12.2: Schematics of (a) a regular structured grid and (b) an unstructured
grid composed of triangles.

topic could be an entire book by itself), but rather with the nature of the resulting
grids. As a quick aside before beginning our study, it is worth mentioning that
in practice the term mesh is commonly used synonymously with grid, so one will
commonly hear references to structured and unstructured meshes.
Structured grids are perhaps the simplest way in which we can break up a region
of space into discrete pieces. As can be seen in Figure 12.2(a) a square region of
space (our computational domain) has been broken up into a number of smaller
regularly spaced pieces. While shown in 2D for simplicity, the extension to 3D is
obvious where we would have a cube which is broken up into a regular array and
with spacings ∆x, ∆y, and ∆z in the x, y, z directions respectively. Now depending
on the numerical method that we will be applying to the structured grid, we can
either think of this as a collection discrete points spaced ∆x, ∆y, and ∆z apart, or
as a collection of cells (or elements) with volume ∆x × ∆y × ∆z.
In order to specify a particular value of φ(x, y, z, t) all that is needed is to assign
indices i, j, k for x, y, z respectively and then a particular points can be located by
φ(i∆x, j∆y, k∆z, t). If we denote the number of grid points in each dimension by
Nx , Ny , and Nz , then in order to store the field φ at any given moment in time (i.e.
for one time step) then we could store the array:
phi = zeros(N_x, N_y, N_z);

in Matlab, and:
double phi[N_x][N_y][N_z];

in C++ (or the dynamic memory allocation equivalent). So in order to ‘navigate’

through our field data array, all we need to store in addition are the three spacings
221

∆x, ∆y, and ∆z. An important point to note is that we don’t have to store our
field data in an array like this, but if the nature of our grid is analogous to a 3D
array we may as well take advantage of that. The only thing that really matters is
that for each scalar value of φ that we are storing we know where to locate it in 3D
space.

Figure 12.3: Some typical unstructured grid primitive cell types in 3D.

The obvious limitation to structured grids is that the vast majority of problems
of interest can’t be represented by a regular ‘block’ like this. Unstructured grids on
the other hand break up a region of space into a smaller number of primitives which
we will call either cells or elements depending upon the numerical method (Figure
12.2(b)). While triangles were used in this case there are many more possibilities
and Figure 12.3 illustrates some primitive types in 3D. Now while Figure 12.2(b)
shows an unstructured triangular grid equivalent of a square domain, the real power
in utilizing unstructured grids lies in the fact that we can solve PDEs in much
more complex domains. Figure 12.4(a) is one such example of a complex geometry
and illustrates a portion of an unstructured grid around an aircraft. In addition to
observing that the triangles are able to tessellate up the space around the fuselage, we
can also observe how the triangles can vary greatly in size from place to place within
the grid, thereby allowing greater resolution and accuracy in the solution wherever it
is needed. Figure 12.4(b) presents another example of the use of unstructured grids
applied to a spring. It can be observed that the same domain has been ‘meshed’ using
both two different types of primitives, namely tetrahedral and hexahedral cells. This
flexibility is a common feature of unstructured grids depending upon the numerical
method employed to solve the PDE within this grid, one may even ‘mix and match’
any number of different primitives within the one grid. The important thing is that
the entire spatial domain be tessellated into contiguous, non overlapping cells or
elements, expressed mathematically as:

Nc
[
Ωc = Ω
c=1

That is, the union of all the Nc cells Ωe tessellating the domain, is the domain Ω
itself.
222 CHAPTER 12. INTRODUCTION

(a) (b)

Figure 12.4: Two examples of unstructured grids of complex geometries (a) illus-
trates a tetrahedral grid of the region around the fuselage of an aircraft (b) illustrates
both a hexahedral and tetrahedral grid of a spring type structure.

In terms of specifying φ(x, y, z, t) at some point within the computational domain

life becomes a more difficult compared to structured grids. It should be obvious that
grid spacings ∆x and ∆y are of little use in trying to define the grid in Figure 12.4(a).
The result is that as we will see, we need to store a lot more data in order to be
able to solve a PDE on an unstructured grid. There are a number of ways in which
an unstructured grid can be stored and these data structures can be ‘tailored’ to
the details of the numerical method. The following discussion should therefore be
taken as presenting some possibilities, not an exhaustive description (i.e. ‘a’ way to
do things, not ‘the’ way). One component of the data structure for an unstructured
grid that is in general common across all structures is a 2D array of coordinate
points:

 
x1 , y1 , z1

 x2 , y2 , z2 

 x3 , y3 , z3 
P =
 
 x4 , y4 , z4 

 .. 
 . 
xNp , yNp , zNp

which is of size Np × ND where Np is the number of points defining the grid and ND
is the dimensionality of the domain. These points represent the vertices of the cells
or elements. Another data structure which can be defined in conjunction with the
points is a 2D array of edges:
223

 
P1 , P2

 P2 , P3 

 P3 , P1 
E=
 
 P1 , PNp 

 .. 
 . 
P4 , P19
which is of size Ne × 2 where Ne is the number of edges in the grid. Here each
edge is a row in the edge array and is defined by two indices which each identify a
row in the points array (e.g. edge 1 is defined by point 1 and point 2, which have
the coordinates x1 , y1 , z1 and x2 , y2 , z2 ). One could also define an edge by explicitly
storing the coordinates of the end points of each edge, but if there are many edges
which share a given point (e.g. there are around five or six edges using each point in
Figure 12.2(b)) then this data structure becomes somewhat inefficient because we
will be storing the same coordinate point many times.
Building on this data structure we may then store a 2D array of faces:
   
E1 , E2 , E3 P1 , P2 , P3

 E3 , E1 , E4  
  P3 , P1 , P4 

 E1 , E5 , E6 , ENe   P1 , P5 , P6 , PNp 
F =  or 
   
 E8 , E9 , E7 , E5 , E5   P8 , P9 , P7 , P5 , P6 

 ..   .. 
 .   . 
E9 , E4 , E6 P9 , P4 , P6
which will have Nf rows and the number of columns will depend upon the nature
of the face (e.g. triangular, quadrilateral, polyhedral). Now, the faces could defined
by indices identifying a row in the edge array or by indices identifying a row in
the points array. So we are saying that we could either define a face by its edges
or by its vertices. Generally there will be some ordering in the sequence of the
edges or points defining a face. The more common approach is to store the edges
in an anticlockwise order around the face. The varying number of edges or points
within each row are present to emphasize the point that we could have triangular,
quadrilateral, pentagonal, etc, faces in our unstructured grid.
Building up further we define a list of cells:
     
F1 , F2 , F3 , F5 E1 , E2 , E3 , E5 P1 , P2 , P3 , P5

 F9 , F5 , F6 , F7  
  E9 , E5 , E6 , E7  
  P9 , P5 , P6 , P7 

 F11 , F12 , F13 , FNf ,   E11 , E12 , E13 , ENe ,   P11 , P12 , P13 , PNp , 
C=  or   or 
     
 F31 , F49 , F52 , F7 , F22   E31 , E49 , E52 , E7 , E22   P31 , P49 , P52 , P7 , P22 

 ..   ..   .. 
 .   .   . 
F9 , E35 , F44 , F14 E9 , E35 , E44 , E14 P9 , P35 , P44 , P14

which will have Nc rows and the number of columns will depend upon the nature of
the cell (e.g. tetrahedral, hexahedral, polyhedral). That is we could define a cell by
either its faces, edges, or vertices, and in either case the indices in each row of the cell
array identify where a geometrical entity in one of the other arrays. Similar to the
224 CHAPTER 12. INTRODUCTION

faces, cells can be of essentially arbitrary shape (Figure 12.3) and hence defined by
a variable number of faces (e.g. 4 faces for a tetrahedral cell, 5 faces for a pyramid,
6 faces for a hexahedral cell). Obviously the definition of cells presented here only
applies to the discretization of a 3D domain. If we are looking at at 2D domain
then the faces play the same role as the cells in 3D. Depending on the terminology
however, you may find the faces referred to as cells in a 2D discretization and this
is the approach we will use in this book. So in 3D our cells will have a volume
associated with them, in 2D they will have an area associated with them, and in
1D they will have a length associated with them.
So far we have described how each individual cell is defined. In order to com-
pletely define an unstructured grid however we may need to go even further and also
store some connectivity information, describing how particular geometrical entities
are connected to one another. This is analogous to a given grid point in a structured
grid φi,j,k being able to locate its neighbors φi±1,j±1,k±1 , but in this case we need to
explicitly store the connectivity information. Some possibilities here are that for a
given point we might store the indices of all other points to which it is connected to
via an edge (Figure 12.5(a)):
 
P3 , P4 , P5 , P5 , P6

 P19 , P23 , P74 , P6 , PNp 


 P4 , P5 , P8 , P31 , 


 P9 , P5 , P7 , P31 , P42 

 .. 
 . 
P7 , P74 , P6 , P2 , P2
meaning that point 2 is connected to points 19, 23, 74, 6, and 2 etc. We may for a
given edge define the connectivity by storing the indices of the two faces which use
the edge (Figure 12.5(b)):
 
F1 , F2

 F5 , F6 


 F11 , FNf 


 F31 , F49 

 .. 
 . 
F9 , F12
We can extend this concept further to faces and define the connectivity for a given
face by storing the indices of the two cells which share the face:
 
C1 , C2

 C5 , C6 


 C11 , CNf 


 C31 , C49 

 .. 
 . 
C9 , C35
where we are saying here that face 2 is shared by cells 4 and 5 etc. It should be
noted that here a given face can be shared by only two cells, which is true for the
225

most common grid structures. On this point this data structure is sometimes termed
a ‘neighbor-owner’ array. The reason for this is that although both cells share the
face, one of these will be designated as being the owner cell for the face (say the first
cell index in the list) while the second will be designated as the neighbor cell for the
face. We will elaborate on this point at a later stage but to hint at the reason why
this is important, the discretization is going to involve the face normal vectors of
a cell which are defined as pointing outward, away from the cell. Obviously if two
cells are sharing the face then the normal vector will be pointing out of one face, but
in to the other. If we defined the normal as pointing out of the owner and into the
neighbor cell then this will help us keep our discretization consistent. Finally, we
could define the connectivity for a given cell by storing the indices of its neighboring
cells (Figure 12.5(c)):
 
C2 , C3 , C4 , C5
 C5 , C5 , C7 , C9 , C11 , C23 
 
 C11 , C12 , C4 , C5 , 
 
 C31 , C49 , C56 , C44 , C2 
 
 .. 
 . 
C9 , C35 , C19 , C11
where we are saying here that cell 1 neighbors cells 2, 3, 4, and 5.

(a) (b) (c)

Figure 12.5: Schematics of different types of grid connectivity (a) illustrates a given
node storing indices of its neighbouring nodes (b) illustrates a given face storing the
indices of the two neighbor and owner cells which share it (c) illustrates a given cell
storing the indices of the cells which neighbor it.

In terms of how we actually store this data computationally we would define a

number of arrays. If for example we were storing an unstructured grid of tetrahedral
cells in 3D then we could have something of the form:
Points = zeros(N_p, 3);
226 CHAPTER 12. INTRODUCTION

Faces = zeros(N_f, 3);

Cells = zeros(N_c, 4);
phi = zeros(N_p, 1);

in Matlab and:
double Points[N_p][3];
double Faces [N_f][3];
double Cells [N_c][4];
double phi [N_p];

in C++. Here a tetrahedral cell is defined by 4 triangular faces, which are in turn
defined by 3 points which have 3 x, y, z components. Furthermore it is assumed
that we are defining the discrete values of φ at the vertices of the cells, hence why
the length of the φ array matches the length of the points array. As we shall see,
for the Finite Element method in Chapter 15, this will be the case, but for the
Finite Volume method in Chapter 14 we instead define the discrete values of φ at
the centroids of the cells, hence the length of the φ would match the length of the
cells array.
So, to re-emphasize the point, with a structured grid, in order to locate any
field variable in 3D space we only needed to store three numbers, ∆x, ∆y, and
∆z, then based on the index i, j, k index of the variable within it’s array we can
assign its coordinates. For an unstructured grid on the other hand we need to store
between three to four arrays in order to do the same thing; and these arrays may
have thousands or millions of rows in them. This is of course in addition to the array
storing the actual field data. But remember the extra computer memory required
to store an unstructured grid is the ‘price we pay’ in order to be able to deal with
complex spatial domains and hence solve ‘real world’ problems.
Another important issue concerning unstructured grids lies in how we define its
boundaries. With a structured grid, these will trivially be the elements in the edges
of the field array. For example, accessing all of the boundary values on the lower
x, y plane of a 3D domain could be achieved via:
phi_b = phi(:,:,1);

in Matlab, and:
for(i=0; i<N_x; i++)
{
for(j=0; j<N_y; j++)
{
phi_b = phi[i][j][0];
}
}

in C++. With an unstructured grid however, we typically need some additional

information to define which faces (or points depending on our numerical method of
choice) are part of a given boundary. The way in which we can do this depends
upon what we can assume about the organization of our Points and Faces arrays.
Consider the unstructured grids depicted in Figures 12.6(a) and 12.6(b) illustrating
227

(a) (b)

Figure 12.6: Schematics illustrating different ways of specifying boundary conditions

on a simple unstructured grid (a) illustrates a scenario where the discrete field values
are stored at the cell vertices and so the boundary conditions are defined on boundary
points. As such a convenient way to define these boundaries is to assume that the
Points array is organized such that the interior points come first, then the points
for each boundary contiguously. This scenario is applicable to the Finite Element
method. (b) illustrates a scenario where the discrete field values are stored at the
cell centroids and so the boundary conditions are defined on boundary faces. As
such a convenient way to define these boundaries is to assume that the Faces array
is organized such that the interior faces come first, then the faces for each boundary
contiguously. This scenario is applicable to the Finite Volume method.

two different ways of defining boundary conditions. In Figure 12.6(a) the numerical
method will dictate that the field variables are defined at the cell vertices and so
the boundary conditions are applied at the vertices on the boundary of the grid. In
Figure 12.6(b) however, the numerical method will dictate that the field variables
are defined at the cell centroids and so the boundary conditions are applied on
the faces on the boundary of the grid. So with the former, we need to specify
which points are on a particular boundary and with the latter we need to specify
which faces are on a particular boundary. A useful way to minimize the amount of
information required to locate all of these boundary points/faces is to assume that
in their respective arrays, all of the interior points/faces come first, followed by the
boundary points/faces. Furthermore, we assume that the boundary points/faces are
all grouped together. In this case an ‘elegant’ way to define the boundaries is with
a structure or a class like:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’start’, {}, ’value’, {});

in Matlab, or:
228 CHAPTER 12. INTRODUCTION

class Boundary
{
public:
Boundary(){ }
string name_;
string type_;
int N_;
int start_;
double value_;
};
Boundary Boundaries[N_b];

in C++. In either case we are defining a particular boundary by a starting index

start (in either the Points or the Faces array) and the number N of each entity
in the boundary. Furthermore, here we are simply using the type string to store
the type of boundary condition (Dirichlet or Neumann for our purposes) and the
floating point number corresponding to the actual value to impose in value. We
can then simply loop over all of the boundaries in our struct or class as:
for b=1:N_b
if Boundaries(b).type == ’Dirichlet’
for n=Boundaries(b).start:Boundaries(b).start+Boundaries(b).N
phi(n) = Boundaries(b).value;
end
end
end

in Matlab and something similar in C++. Now, if we couldn’t assume that the
boundary points/faces were organized this way then our boundary struct or class
might not be so compact. If for instance the boundary points/faces were randomly
scattered throughout their arrays then we would need to explicitly store an array
for each boundary defining the indices of which points/faces are a part of it:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’indices’, {}, ’value’, {});

in Matlab, or:
class Boundary
{
public:
Boundary(){ }
string name_;
string type_;
int N_;
int* indices_;
double value_;
};
Boundary Boundaries[N_b];

where indices is a 1D array storing the indices of either the points or faces in the
grid. We can then simply loop over all of the boundaries in our struct or class as:
for b=1:N_b
229

if Boundaries(b).type ==’Dirichlet’
for n=1:Boundaries(b).N
phi(Boundaries(b).indices(n)) = Boundaries(b).value;
end
end
end

This introduction to structured and unstructured grids has only touched very
briefly on what is quite a large field. For the interested reader, some additional
information can be found in [45], [59], [71], and [70].
We are now in a position to begin the studying the various families of numerical
methods. Part of the reason for introducing all of these concepts initially is so
that when we come to applying different numerical methods to the solution various
forms of the generic scalar transport equation, we will have an understanding of
what the PDE is describing, what the solution might look like, what boundary and
initial conditions are required, so that we have a well posed problem, and how we go
about defining the required data structures computationally. As a final point, it is
worth mentioning that most of the concepts introduced here will have more meaning,
once we have actually studied a number of numerical methods and applied them to
specific PDEs. As such, it is recommended that you re-read this introduction at the
end of this part of the book.
230 CHAPTER 12. INTRODUCTION
Chapter 13

Finite Difference Methods

Just as Euler methods were the simplest methods that we can use to solve ODEs,
Finite Difference methods are perhaps the simplest methods that we can use to
solve PDEs, so they present an excellent starting point. The basic idea behind
the method is that we replace the derivative terms in a PDE with approximately
equivalent difference quotients, often called stencils. The difference quotients are
linear combinations of the field values at neighboring grid points. Finite Difference
methods are generally applied to structured grids where we have a regular and
equally spaced array of grid points covering our spatial domain and are a local
method in that the solution at any given point only involves the solution at a few
neighboring points.
In order to derive the finite difference quotients for a derivative term of order n,
we apply the general finite difference formula:

Nm Nm
!
dn φ

1 X X
= a−Nm φi−m + a0 φi + a+Nm φi+m (13.1)
dxn xi ∆xn m=1 m=1

where the values φi±m denote the the field values at neighboring points m and a±m
are weighting coefficients, whose value which will depend on how accurate we want
the difference quotient to be.

13.1 First Derivative

In order to derive some different finite difference quotients for the first derivative we
set n = 1, in which case we get:
Nm Nm
!
dφ 1 X X
= a−Nm φi−m + a0 φi + a+Nm φi+m (13.2)
dx xi
∆x m=1 m=1

Now, consider the Taylor series expansion:

231
232 CHAPTER 13. FINITE DIFFERENCE METHODS

(Nm ∆x)2 d2 φ (Nm ∆x)3 d3 φ

dφ
φ(xi ±Nm ∆x) = φ(xi )±(Nm ∆x) + 2
± 3
+O(Nm ∆x4 )
dx xi 2! dx xi 3! dx xi

Evaluating for m = 1 for example, we get:

(∆x)2 d2 φ (∆x)3 d3 φ

dφ
φi+1 = φi + (∆x) + + + O(∆x4 )
dx xi 2! dx2 xi 3! dx3 xi
(∆x)2 d2 φ (∆x)3 d3 φ

dφ
φi−1 = φi − (∆x) + − − O(∆x4 )
dx xi 2! dx2 xi 3! dx3 xi

Now repeating this procedure for all 1 ≤ m ≤ Nm and substituting the resulting
expressions for φi±m into the right hand side of Equation 13.1 we get:

!
(Nm ∆x)2 d2 φ (Nm ∆x)3 d3 φ

dφ 1 dφ
= a−Nm φi−Nm − (Nm ∆x) + − + ...
dx xi
∆x dx xi 2! dx2 xi 3! dx3 xi
+ ...
+ a0 φi
+ ...
!!
(Nm ∆x)2 d2 φ (Nm ∆x)3 d3 φ

dφ
+ a+Nm φi+Nm + (Nm ∆x) + + + ...
dx xi 2! dx2 xi 3! dx3 xi

Then, collecting the coefficients of each derivative of φ we get:

dφ 1
= a−Nm + . . . +a−1 +a0 +a+1 + . . .+a+Nm φ(xi )
dx xi ∆x

dφ
+ a−Nm (−Nm ) + . . . +a−1 (−1) +a+1 (1) + . . .+a+Nm (M ) ∆x
dx xi
(−Nm )2 (−1)2 (1)2 (Nm )2 2

2 d φ
+ a−Nm + . . . +a−1 +a+1 + . . .+a+Nm ∆x
2! 2! 2! 2! dx2 xi
(−Nm )3 (−1)3 (1)3 (Nm )3 3

3 d φ
+ a−Nm + . . . +a−1 +a+1 + . . .+a+Nm ∆x
3! 3! 3! 3! dx3 xi
!
+ ...

So, in order to approximate first derivative of φ up to a certain order of accuracy,

we would need to satisfy the following equations:
13.1. FIRST DERIVATIVE 233

a−Nm + . . . +a−1 +a0 +a+1 + . . . +a+Nm =0

−a−Nm Nm + . . . −a−1 +a+1 + . . . +a+Nm Nm = 1
2
N 1 1 N2
a−Nm m + . . . +a−1 +a+1 + . . . +a+Nm m = 0
2 2 2 2
3
N 1 1 N3
−a−Nm m + . . . −a−1 +a+1 + . . . +a+Nm m = 0
6 6 6 6
.. .
. = ..

So we have the system:

 
a−Nm  
 0 


1 ... 1 1 1 ... 1

 .. 
 
.
   
1

 
 
 

 −Nm . . . −1 0 1 . . . Nm    
a−1
   
 N2 2  
  0 
 
 m
2
... 1
2
0 12 . . . N2m 
a0 =

 Nm3 1 1 Nm3
 0 
 − 6 ... 0 . . . a+1
  
   . 
6 6 6  .. 
   
.. .. .. .. .. .. .. ..

 
  
   
. . . . . . . 

 . 



 0 
 
a+Nm
The more equations we choose to satisfy, the higher the accuracy of the approxi-
mation. For example, if we only satisfy the first two equations, then the accuracy
of the approximation is O(∆x). If we satisfy the first three equations, then the
approximation is O(∆x2 ).
Let’s consider the case where Nm = 1. The general finite difference formula can
then be written as:

dφ 1
= (a−1 φi−1 + a0 φi + a+1 φi+1 )
dx xi
∆x
The system of equations that we need to satisfy in order to approximate the deriva-
tive can be written as:

a−1 +a0 + a+1 = 0

−a−1 + a+1 = 1
1 1
a−1 + a+1 = 0
2 2
1 1
− a−1 + a+1 = 0
6 6
.. .
. = .. (13.3)

So we have the uniquely determined system:

234 CHAPTER 13. FINITE DIFFERENCE METHODS

    
1 1 1  a−1   0 
 −1 0 1  a0 = 1
1 1
0 2 a+1 0
   
2

We can in fact solve this system and find that the coefficients are:

1
a−1 =−
2
a0 = 0
1
a+1 =
2
and hence the corresponding formula for the first derivative of φ is:

dφ 1
= (φi+1 − φi−1 ) (13.4)
dx xi
2∆x
This is called the second order central difference for the first derivative. So we can
see that to approximate the derivative at point xi we will need to know the field
values at one point either side xi±1 . Sometimes it’s more convenient to have a finite
difference quotient which only requires knowing field values on one side or the other
of point xi . In this case what we can do is ‘choose’ for the coefficients on one side to
be zero. For example, if we only want to involve points xi+1 then we could choose
a−1 = 0, so that we would have the reduced system:

1 1 a0 0
=
0 1 a+1 1
where the solution is trivially:

a0 =−1
a+1 = 1

and hence the corresponding formula for the first derivative of φ is:

dφ 1
= (φi+1 − φi ) (13.5)
dx xi
∆x
This is called the first order forward difference for the first derivative. Alternatively,
is we only want to involve points xi−1 then we could choose a+1 = 0, so that we
would have the reduced system:

1 1 a−1 0
=
−1 0 a0 1
13.1. FIRST DERIVATIVE 235

where the solution is trivially:

a0 = 1
a−1 =−1

and hence the corresponding formula for the first derivative of φ is:

dφ 1
= (φi − φi−1 ) (13.6)
dx xi
∆x

This is called the first order backward difference for the first derivative.
Let’s now consider the case where Nm = 2. The general finite difference formula
can then be written as:

dφ 1
= (a−2 φi−2 + a−1 φi−1 + a0 φi + a+1 φi+1 + a+2 φi+2 )
dx xi
∆x

and the system of equations that we need to satisfy in order to approximate the
derivative can be written as:

a−2 + a−1 +a0 + a+1 + a+2 = 0

−2a−2 − a−1 + a+1 + 2a+2 = 1
1 1
2a−2 +a−1 + a+1 + 2a+2 = 0
2 2
4 1 1 4
− a−2 − a−1 + a+1 + a+2 = 0
3 6 6 3
2 1 1 2
a−2 + a−1 + a+1 + a+2 = 0
3 24 24 3
.. .
. = ..

So we have the uniquely determined system:

 
1 1 1 1 1    
 a−2   0 
 −2 −1 0 1 2 
    
a−1  1 
 
 
 
 1 1
  
 2 0 2  a0 = 0
 2 2 
a+1 0 
 4   
 − 3 − 16 0 1 4  
 
 
6 3  
  

a+2 0
 
2 1 1 2
3 24
0 24 3

We can in fact solve this system and find that the coefficients are:
236 CHAPTER 13. FINITE DIFFERENCE METHODS

1
a−2 =
12
2
a−1 =−
3
a0 = 0
2
a+1 =
3
1
a+2 =−
12
and hence the corresponding formula for the first derivative of φ is:

dφ 1
= (φi−2 − 8φi−1 + 8φi+1 − φi+2 ) (13.7)
dx xi
12∆x
This is called the fourth order central difference for the first derivative. Again, if
we only wanted to derive forward or backward differences we could explicitly choose
a−2 = 0, a−1 = 0 or a+2 = 0, a+1 = 0 and solve the reduced systems. The important
point to note is that we can essentially create finite difference quotients of essentially
any order we choose, either forward, backward, or central.

13.2 Second Derivative

In order to derive some finite difference quotients for the second derivative we set
n = 1, in which case we get:
Nm Nm
!
d2 φ

1 X X
= a−Nm φi−m + a0 φi + a+Nm φi+m (13.8)
dx2 xi ∆x2 m=1 m=1

As with the first derivative formula from before, expanding the right hand side in
terms of Taylor series and collecting coefficients of the derivatives gives:

a−Nm + . . . +a−1 +a0 +a+1 + . . . +a+Nm =0

−a−Nm Nm + . . . −a−1 +a+1 + . . . +a+Nm Nm = 0
N2 1 1 N2
a−Nm m + . . . +a−1 +a+1 + . . . +a+Nm m = 1
2 2 2 2
3 3
Nm 1 1 Nm
−a−Nm + . . . −a−1 +a+1 + . . . +a+Nm =0
6 6 6 6
.. .
. = ..

So we have the system:

13.2. SECOND DERIVATIVE 237

   
a−Nm 0
  ..

 
 ..


1 ... 1 1 1 ... 1 


 .







 .




 −Nm . . . −1 0 1 . . . Nm 


a−1





 0 


 Nm2 2    

 2
... 1
2
0 12 . . . N2m 
 a0 = 1
3 3
− N6m . . . 1
0 16 . . . N6m a+1 0 
    
 6  
 
 
.. .. .. .. .. .. .. .. .. 

 
 
 
   
. . . . . . . 

 . 




 . 


   
a+Nm 0
Where the only difference compared to the system of equations for the first derivative
it is the third element in the vector of known values which is equal to 1, instead of
the second. As with the case as the first derivative, the more equations you satisfy,
the more accurate your finite difference approximation will be. As we did for the
first derivative, let’s consider the case where Nm = 1. The general finite difference
formula can then be written as:

d2 φ

1
2
= (a−1 φi−1 + a0 φi + a+1 φi+1 )
dx xi ∆x2
The system of equations that we need to satisfy in order to approximate the deriva-
tive can be written as:

a−1 +a0 + a+1 = 0

−a−1 + a+1 = 0
1 1
a−1 + a+1 = 1
2 2
1 1
− a−1 + a+1 = 0
6 6
.. .
. = ..

So we have the uniquely determined system:

    
1 1 1  a−1   0 
 −1 0 1  a0 = 0
1 1
0 2 a+1 1
   
2
We can in fact solve this system and find that the coefficients are:

a−1 = 1
a0 =−2
a+1 = 1

and hence the corresponding formula for the second derivative of φ is:
238 CHAPTER 13. FINITE DIFFERENCE METHODS

d2 φ

1
2
= (φi−1 − 2φi + φi+1 ) (13.9)
dx xi ∆x2
This is called the second order central difference for the second derivative. As with
the first derivative, if we only wanted to derive forward or backward differences we
could explicitly choose a−2 = 0, a−1 = 0 or a+2 = 0, a+1 = 0 and solve the reduced
systems to obtain forward or backward difference.
Although our derivations thus far have all been in terms of x, it should be obvious
that in multiple dimensions all of the analyses apply and we simply replace the
independent variable x, y, z and the index i, j, k appropriately. Putting everything
together, if we were to use second order central differences ‘say’ to approximate the
derivative terms in our generic scalar transport equation, in 3D we would get:

dφi,j,k vx
+ (φi+1,j,k − φi−1,j,k )
dt 2∆x
vy
+ (φi,j+1,k − φi,j−1,k )
2∆y
vz
+ (φi,j,k+1 − φi,j,k−1 )
2∆z
µ
= (φi−1,j,k − 2φi,j,k + φi+1,j,k )
∆x2
µ
+ (φi,j−1,k − 2φi,j,k + φi,j+1,k )
∆y 2
µ
+ (φi,j,k−1 − 2φi,j,k + φi,j,k+1 )
∆z 2
+ ψi,j,k (13.10)

So that we end up with a coupled system of ordinary differential equations, which

can be written in the form:

M φ̇ = Kφ + s (13.11)
where φ is a column vector is defining values each i, j, k grid point and M is the
identity matrix in this case. So it can be observed that by applying the Finite
difference method, we have applied a spatial discretization and we have reduced our
PDE to a system of ODEs. Now we have φi,j,k (t) and at this point we can apply
one of the numerical methods from Part II to perform the time integration.
Before we move on to applying this method to solving some specific forms of
the generic scalar transport equation, it is worth devoting a little time to the issue
of how we actually impose boundary conditions. Now, when we have a Dirichlet
boundary condition to impose, we are saying that we know the value of φ on the
boundary and so really what this means is that those values should not be a part of
the column vector φ in Equation 13.11. Instead, these values are incorporated into
the column vector s. With a Neumann boundary condition on the other hand, we
13.2. SECOND DERIVATIVE 239

are saying that we know the gradient of φ on the boundary, but the actual value of
φ on the boundary is still unknown. As such we can say that those values should be
a part of the column vector φ, in Equation 13.11, but the discrete equation for that
grid point will be modified by the Neumann boundary condition. To illustrate by
way of example, consider the discretized generic scalar transport equation in 1D:

dφi vx µ
+ (φi+1 − φi−1 ) = (φi−1 − 2φi + φi+1 ) + ψi (13.12)
dt 2∆x ∆x2
and suppose that at grid point xi we are also applying the Neumann boundary
condition:

∂φi
=f
∂x
What we would generally do here is to approximate the derivative boundary condi-
tion by a finite difference quotient, ‘say’ a second order central difference. In that
case we would have:

φi+1 − φi−1
=f (13.13)
2∆x
and so rearranging we get:

φi+1 = φi−1 + 2∆xf (13.14)

Then, substituting Equation 13.14 into 13.12 we get:

dφi vx µ
+ (φi−1 + 2∆xf − φi−1 ) = (φi−1 − 2φi + φi−1 + 2∆xf ) + ψi
dt 2∆x ∆x2
dφi 2µ
+ vx f = (φi−1 − φi + ∆xf ) + ψi
dt ∆x2

So it can be observed that the incorporation of the Neumann boundary condition

has given us a modified ODE at the boundary grid point where we are applying this
condition. Furthermore we can say that the convective term doesn’t depend upon
φ at all and so should be put into s, while the diffusive term will modify the row
in the stiffness matrix K corresponding to this grid point. Now although we have
presented an example for the derivative in the x direction, the extension of this idea
to other dimensions is trivial.

Example 13.1:
240 CHAPTER 13. FINITE DIFFERENCE METHODS

In this example we will develop both a Matlab and a C++ program to solve the
1D first order wave equation:

∂φ ∂φ
+v =0 (13.15)
∂t ∂x

in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, v = 1.0, and compare the numerical solution with
the exact solution

2
φ(x, t) = e−5(x−vt−3) + 1

where we will define the error function

e = φli − φ(xi , tl )

and use the infinity norm

||e||∞ = max(|φli − φ(xi , tl )|)

as our measure of convergence. For the spatial discretization we will use the Finite
Difference method with second order central differences for the first derivative and
for the temporal discretization we will use the fourth order Runge-Kutta method.
The intended learning outcomes for this example will be to ‘get a feel’ for applying
the Finite Difference method and observing the solution of a hyperbolic PDE. Also,
because we are using an explicit method to perform the time integration, we will
have stability constraints and we will investigate this by solving the PDE with some
different spatial step sizes ∆x and temporal step sizes ∆t. Finally, we will look
at how we can replace the Dirichlet boundary condition with a periodic boundary
condition.
So, there are a lot of things to learn in this example. To begin, let’s first confirm
in our minds that we have a well posed problem. Our PDE has two derivative terms
in it and so this translates into requiring two pieces of information in order to obtain
a unique solution, one boundary condition for the spatial derivative and one initial
condition for the temporal derivative. Since we were given both of these, then we
can say that our problem will be well posed. The reason for laboring this point is
that if the solution of a PDE is attempted with too many or too few boundary or
initial conditions it will be ‘doomed’ from the start. So it is always important to
consider these issues before writing any code.
Assuming now that our spatial domain has been broken up into Nx grid points
with spatial step size ∆x, then we can replace the spatial derivative with a second
order central difference and define an ODE at each interior grid point as:
13.2. SECOND DERIVATIVE 241

dφ2 v
=− (φ3 − φ1 )
dt 2∆x
dφ3 v
=− (φ4 − φ2 )
dt 2∆x
dφ4 v
=− (φ5 − φ3 )
dt 2∆x
.. ..
. = .
dφNx −1 v
=− (φNx − φNx −2 )
dt 2∆x
For the grid point Nx we can’t use a second order central difference however because
grid point xNx +1 is outside of the spatial domain. What we can do to remedy this
problem however is to use a first order backward difference for the spatial derivative
at this point:

dφNx v
=− (φNx − φNx −1 )
dt ∆x
You might wonder, does it matter if we use a second order accurate spatial discretiza-
tion for most of the points, but then a first order accurate spatial discretization at
this one point? The answer is technically yes, but in terms of both the accuracy and
stability of the solution, it doesn’t make too much difference. We could always use a
second order backward difference if we were concerned about this however. Moving
along, these ODEs can be written in the form:

M φ̇ = Kφ + s
where we have:

v vφ1
        
1 0 0 ··· 0  φ2
  0 − 2∆x 0 ··· 0  φ2
   2∆x 
v v
   
0 1 0 ··· 0    φ3 0 − 2∆x ··· 0  φ3 0
 
   
 
 

 d   2∆x    
v
       
 0 0 1 ··· 0  φ4 =
 0 2∆x 0 ··· 0  φ4 + 0
 dt 
  
 .. .. .. ..  ..   .. .. .. .. v
 ..   .. 
 . . . . 0   
 .




 . . . . − 2∆x 

 .







 .




v v
0 0 0 0 1 φNx 0 0 0 − ∆x φNx 0
     
∆x

It can be observed that both M and K are sparse matrices and because M = I, we
could in fact rewrite our system more simply as:

φ̇ = Kφ + s = f (φ) (13.16)
Now, using approach we took in implementing the fourth order Runge-Kutta method
in Example 10.1, we will define a function f to evaluate the right hand side of
Equation 13.16 at the various stages of the method. In our Matlab code, this will
take the form:
242 CHAPTER 13. FINITE DIFFERENCE METHODS

function k = f(phi)
k = zeros(N_x,1);
for i=2:N_x-1;
k(i) = -v/(2*Delta_x)*(phi(i+1) - phi(i-1));
end
k(N_x) = -v/( Delta_x)*(phi(N_x) - phi(N_x-1));
end

At this point it is worth addressing some practical issues regarding the storage of
phi and the imposition of the Dirichlet boundary condition. Although our system
of ODEs doesn’t include the Dirichlet point φ1 as an unknown, from a programming
point of view it makes the most sense to allocate one array to store all of the discrete
solution, including this point. Because we are stepping forward in time by ‘looping’
over all of the grid points we can simply set the value of the solution at this point
and not include it in the update. The remainder of the algorithm is just the basic
fourth order Runge-Kutta code from Example 10.1:
for l=1:N_t-1
k1 = f(phi(:,l));
k2 = f(phi(:,l) + Delta_t/2*k1);
k3 = f(phi(:,l) + Delta_t/2*k2);
k4 = f(phi(:,l) + Delta_t *k3);
phi(:,l+1) = phi(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end

In our C++ program we have the limitation that we can’t add arrays together
‘on the fly’ like we can in Matlab with statements like phi(:,l)+Delta t*k3, so
instead we are going to have to introduce a ‘temporary’ array to store this data. In
fact we will dynamically allocate a total of six arrays to store our field data and the
various stages of the Runge-Kutta method:
double* tempPhi = new double [N_x];
double* k1 = new double [N_x];
double* k2 = new double [N_x];
double* k3 = new double [N_x];
double* k4 = new double [N_x];
double* phi = new double [N_x];
...

In contrast to our Matlab program where our array phi stores the solution for each
grid point at every time step, our C++ program is only going to store the solution
for one time step. This is a more common approach to take in ‘real world’ programs
since it requires far less memory, but means that if we want to keep this solution
for post processing, we must write it to an output file. We’ll get to that shortly, but
for now, moving along, our function f, evaluating the right hand side, will take the
form:
void f(double* k, double* phi)
{
for(int i=1; i<N_x-1; i++)
{
13.2. SECOND DERIVATIVE 243

k[i] = -v/(2Delta_x)(phi[i+1] -phi[i-1]);

}
k[N_x-1] = -v/( Delta_x)*(phi[N_x-1]-phi[N_x-2]);
}
and then the remainder of the program is just the basic fourth order Runge-Kutta
code:
for(l=0; l<N_t-1; l++)
{
t += Delta_t;
f(k1, phi);
for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t/2*k1[i];
}
f(k2, tempPhi);
for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t/2*k2[i];
}
f(k3, tempPhi);
for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t *k3[i];
}
f(k4, tempPhi);
for(i=0; i<N_x; i++)
{
phi[i] = phi[i] + Delta_t*(k1[i]/6 + k2[i]/3 + k3[i]/3 + k4[i]/6);
}
write(file, phi, N_x);
}
where it can be observed that we are computing the phi+Delta t*k3, etc, statements
in for loops and passing that data into our function f as opposed to our Matlab
code. Finally, as we update the data in phi for the current time step we write this
data to an output file with the function write, which will take the form:
void write(fstream& file, double* phi, int N_x)
{
for(int i=0; i<N_x; i++)
{
file << phi[i] << "\t";
}
file << "\n";
return;
}
An important observation here is that although we wrote out our system of
ODEs in matrix vector form, we didn’t actually store any matrices in our programs;
again emphasizing the point that often the use of matrix vector notation is more
of a conceptual thing, rather than something that one would implement in their
244 CHAPTER 13. FINITE DIFFERENCE METHODS

code. An important point to note is that in our Matlab time marching loop we
are updating phi(:,l+1), where the colon operator implies all of the grid points
at time step l + 1 (including the boundary point φ1 ). The reason that this is not a
problem is because in our function f we didn’t evaluate any of the k values at that
grid point (i.e. we looped through grid points 2 to Nx ). So as long as k(1) is zero
for each k1 , k2 , k3 , k4 then when we update phi(:,l+1) its value won’t change.
So, assuming that phi(1,1) was initialized to 1.0, we will have the correct result.
If we were worried about it, we could always modify our update statement to be
phi(2:N_x,l+1), or add in the line phi(1,l+1)=1.0 inside the time marching loop
to ensure that this is always the case. So it can be observed that there are a few
different ways we can make sure that we impose the boundary condition correctly.
So, although at this point we have a working algorithm, we have to consider the
fact that the fourth order Runge-Kutta method is an explicit method and hence
we need to make sure that while performing the time integration we are inside the
stability region. As we will see, this places restrictions on both ∆t and ∆x. For
this reason, we will in fact create the matrix K in our program, but it should be
understood that this is only to determine its eigenvalues, not because it is part of
our numerical method. Due to the diagonal nature of K we can create it Matlab
quite easily by using the diag function, which can create a matrix with terms, on,
above, or below the main diagonal as:
K = -v/(2*Delta_x) .* (-1.*diag(ones(N_x-1, 1), -1) + diag(ones(N_x-1, 1), 1));

Then, we can compute the eigenvalues as we did in Example 6.2 and plot them
relative to the stability region of the fourth order Runge-Kutta method:
[Xi Lambda] = eig(K);
[X, Y] = meshgrid(-4:0.1:4, -4:0.1:4);
Z = X + i*Y;
sigma = abs(1 + Z + (Z.^2)/2 + (Z.^3)/6 + (Z.^4)/24);
contourf(X, Y, sigma, [1 1]);
plot(real(diag(Lambda)*Delta_t), imag(diag(Lambda))*Delta_t);

Now, we will have as many eigenvalues as we do grid points. If we were to factor

out v and ∆x from K then we could find that we have:

v∆t
λm ∆t ∝
∆x
so we can see that the eigenvalues of K depend upon the velocity and the spatial
and temporal step sizes. We make the important definition:

v∆t
CF L =
∆x
where CF L is known as the Courant-Friedrichs-Lewy number. This parameter
is useful in determining the stability of explicit methods, but it should be noted
that its definition changes depending on the dimensionality of the problem and the
derivatives present in the PDE. The more important point however is that we can
13.2. SECOND DERIVATIVE 245

observe that if we decrease ∆x (i.e. add in more grid points) to try and get a more
accurate solution (noting that the error associated with the second order central
difference Finite Difference quotient is proportional to O(∆x2 )) then we find that
the CF L will increase, reducing the stability of the solution. We therefore also need
to reduce ∆t to maintain stability. So we can’t just decrease ∆x to try and get a
more accurate solution, without reducing ∆t at the same time.
The complete programs are given in Example13_1.m and Example13_1.cpp. Fig-
ures 13.1(a) - 13.1(b) illustrate the location of the λm ∆t terms for two different com-
binations of ∆x and ∆t, the first with ∆x = 0.05 and ∆t = 0.02, and the second
with ∆x = 0.02 and ∆t = 0.10. In the first combination, all the terms are located
within the stability region, and in the second they are not. The corresponding effect
on the solution is shown in Figures 13.2(a) - 13.2(b). It is easily observed that for
the second combination, the simulation ‘blows up’ after just a couple of time steps,
whereas for a stable solution we see the ‘bell shaped’ initial condition is simply
shifted (or convected) along through the computational domain. This is in essence
what the convective term describes in a PDE. Another observation that can be made
is that all of the eigenvalues of K are purely imaginary, which is a characteristic of
the discretization of the convective term. When we include ‘say’ the diffusive term
in the generic scalar transport equation we find that the discretization will result
in the eigenvalues having real components too. The important point to take away
from this example is that if we are using an explicit method for the time integra-
tion, we need to be careful when choosing our spatial and temporal step sizes. To
illustrate the convergence of the solution, Table 13.1 presents the inifinity norm for
a range of spatial and temporal step sizes (maintaining stability of course). As can
be observed, the finer the grid and the smaller the time step size, the lower the error
in the solution (which is of course what we could expect).

Table 13.1: The convergence of the solution, illustrating the infinity norm for a
range of spatial and temporal step sizes.

∆x ∆t ||e||∞
1.000 1.000 0.760540
0.500 0.500 0.706663
0.100 0.100 0.387008
0.050 0.050 0.138828
0.010 0.010 0.005468
0.005 0.005 0.001361
0.001 0.001 0.000054

The final thing we are going to do in this example is experiment with how we
impose a periodic boundary condition instead of a Dirichlet boundary condition.
First of all however, an important question we should ask is, does it matter which
246 CHAPTER 13. FINITE DIFFERENCE METHODS

4 4

3 3

2 2

1 1
Δt

Δt
0 0
Re

Re
λ

λ
−1 −1

−2 −2

−3 −3

−4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
λ Δt λ Δt
Re Re

(a) (b)

Figure 13.1: Location of the λm ∆t terms within the stability region of the fourth
order Runge-Kutta method for the PDE in Example 13.1 for (a) ∆x = 0.05 and
∆t = 0.02 (b) ∆x = 0.02 and ∆t = 0.10. It should be noted that each λm ∆t is
marked as a cross in the complex plane, but the large number of these terms makes
them appear as a solid strip. It can be observed that all of the λm ∆t terms are
purely imaginary.

end of the domain we apply this condition at? The answer to this is, yes it does.
Because v was positive the wave will move in the direction of increasing x. If we
were to try and impose the Dirichlet boundary condition φ(10, t) = 0 then we would
run in to problems as the wave approaches that boundary. The key result is that we
impose the Dirichlet boundary condition on the boundary that the wave is moving
away from, so then in order to impose it at φ(x = 10, t) we would need to make v
negative.

Having said that we can now look at instead imposing a periodic boundary
condition, which remember, is essentially closing the domain back up on itself so
essentially there is no boundary. Thinking of our 1D spatial domain in this example
as a bead necklace pulled taught (where each bead represents a grid point), then
applying periodic boundary conditions is analogous to doing up the necklace. The
way in which this modified our system is only at grid points x1 and xNx . This time,
we can now use second order central differences to get:
13.2. SECOND DERIVATIVE 247

2.5 2.5

2.0 2.0
φ

φ
1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

2.5 2.5

2.0 2.0
φ

1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Figure 13.2: The solutions to the PDE in Example 13.1 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02 (b) the solution
at l = 0 and l = 13 for the combination ∆x = 0.02 and ∆t = 0.10.

dφ1 v
=−
(φ2 − φNx )
dt 2∆x
dφ2 v
=− (φ3 − φ1 )
dt 2∆x
dφ4 v
=− (φ5 − φ3 )
dt 2∆x
.. ..
. = .
dφNx −1 v
=− (φNx − φNx −2 )
dt 2∆x
dφNx v
=− (φ1 − φNx −1 )
dt 2∆x
248 CHAPTER 13. FINITE DIFFERENCE METHODS

So at the field value to the left of φ1 now becomes φNx and the field value to the
right of φNx becomes φ1 . We can in fact implement this in our program very easily
by trivially modifying the function f as:
function k = f(phi)
k = zeros(N_x,1);
k(1) = -v/(2*Delta_x)*(phi(2) - phi(N_x));
for i=2:N_x-1;
k(i) = -v/(2*Delta_x)*(phi(i+1) - phi(i-1));
end
k(N_x) = -v/(2*Delta_x)*(phi(1) - phi(N_x-1));
end

Figures 13.3(a) - 13.3(b) illustrate the solution at a number of time steps for the
the case where ∆x = 0.05 and ∆t = 0.02. It can be observed that the bell shaped
curve now exits the domain through the right hand side and immediately re-enters
through the left hand side. An interesting observation that can be made is that the
longer the simulation is run for, the more the bell shaped curve is distorted. This is
the effect of numerical error introduced by the spatial and temporal discretizations
becoming apparent.

Example 13.2:

In this example we will develop both a Matlab and a C++ program to solve the
2D Poisson equation:

∇2 φ + ψ = 0 (13.17)
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10, and compare the numerical solution with the
exact solution
∞
∞ X
X 40(1 − (−1)p )(1 − (−1)q )
φ(x, y) = sin(pπx) sin(qπy) + 1
p=1 q=1
(p2 + q 2 )pqπ 4
where we will define the error function

e = φi,j − φ(xi , yj )
and use the infinity norm

||e||∞ = max(|φi,j − φ(xi , yj )|)

13.2. SECOND DERIVATIVE 249

2.5 2.5

2.0 2.0
φ

φ
1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

2.5 2.5

2.0 2.0
φ

1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Figure 13.3: The solutions to the PDE in Example 13.1 for the combination ∆x =
0.05 and ∆t = 0.02 illustrating the solution at (a) l = 1, (b) 1 = 350, (c) l = 500
and (d) l = 700.
250 CHAPTER 13. FINITE DIFFERENCE METHODS

as our measure of convergence. For the spatial discretization we will use the Finite
Difference method with second order central differences for the second derivatives
and to solve the resulting system of algebraic equations we will use the Gauss-Seidel
method, with the two-norm as our measure of convergence. The intended learn-
ing outcomes for this example will be ‘get a feel’ for applying the Finite Difference
method and observing the solution of an elliptic PDE and to investigate the appli-
cation of iterative methods to solve the resulting system of equations.
To begin, let’s first confirm in our minds that we have a well posed problem. Our
PDE has two second order derivative terms in it and so this translates into requiring
four pieces of information in order to obtain a unique solution, two boundary con-
ditions for each spatial derivative. Put another way, we need a boundary condition
over every part of the boundary. Since we were given all of these, then we can say
that our problem will be well posed.
Assuming now that our spatial domain has been broken up into Nx data points
in x and Ny data points in y, then we can replace the spatial derivatives with a
second order central difference and define and algebraic equation at each interior
grid point as:
φi−1,j − 2φi,j + φi+1,j φi,j−1 − 2φi,j + φi,j+1
+ + ψi,j = 0
∆x2 ∆y 2
For the purposes of the example let’s assume that ∆x = ∆y = ∆xy. We can then
rearrange the discrete equation to get:

4φi,j − φi−1,j − φi+1,j − φi,j−1 − φi,j+1 = ∆xy 2 ψi,j (13.18)

which can be put written in the form:

Kφ = s
For the unrealistic case of a 5 × 5 grid ‘say’, K would take the form:
   
∆xy 2 ψ2,2 + φ2,1 + φ1,2 

4 −1 0 −1 0 0 0 0 0 
 φ2,2 
 
 
−1 4 −1 0 −1 0 0 0 0  φ3,2 ∆xy 2 ψ3,2 + φ3,1
  
 
 

  
 
 2


0 −1 4 0 0 −1 0 0 0  φ4,2 ∆xy ψ4,2 + φ4,1 + φ5,2 
  
 
 
  
 
 

−1 0 0 4 −1 0 −1 0 0  φ2,3 ∆xy 2 ψ2,3 + φ1,3
  
 
 

   
2


 0 −1 0 −1 4 −1 0 −1 0 
 φ3,3 = ∆xy ψ3,3
0 0 −1 0 −1 4 0 0 −1 φ4,3 ∆xy 2 ψ4,3 + φ4,2
  
 
 

  
 
 2


0 0 0 −1 0 0 4 −1 0  φ2,4 ∆xy ψ2,4 + φ1,4 + φ2,5 
  
 
 
  
 
 

0 0 0 0 −1 0 −1 4 −1  φ3,4 ∆xy 2 ψ3,4 + φ3,5
  
 
 


 
 
 

2
0 0 0 0 0 −1 0 −1 4 φ4,4 ∆xy ψ4,4 + φ5,4 + φ4,5
   
(13.19)
Some important observations that we can make here are that similar to Example
13.1 we are dealing with a sparse matrix, but in this case the structure of the matrix
is not as regular. This is most obvious when we look at the terms either side of the
main diagonal, where there is a pattern of alternating 0 and −1, and this occurs
when two sequential finite difference equations have different j indices in the grid.
13.2. SECOND DERIVATIVE 251

While we could type out the matrix manually, this is a fairly naive approach to take,
since it will become very time consuming and more error prone as we increase the
resolution of the grid, and it will need to be retyped for every change in Nx and
Ny . What we would like is an algorithm that will allow us to simply input Nx and
Ny and take care of the corresponding size of the arrays or us. We could use the
Matlab function gallery(‘poisson’,N) to construct the matrix, but we are going
to take another approach.
An important point to note is that the vector of unknowns φ is of length (Nx −
2)(Ny − 2) and K is of size (Nx − 2)(Ny − 2) × (Nx − 2)(Ny − 2). If we increase
the number of grid points in each dimension from 5 to a more reasonable resolution
of say Nx = 100 and Ny = 100 then φ would be of the order of 10, 000, but more
importantly K would be of the order of 10, 000×10, 000 i.e. it will be storing around
100,000,000 numbers! Since most of these numbers are zero, explicitly storing the
matrix would be very inefficient. Furthermore, the matrix is actually only storing
two different entries, −1 on the off diagonal entries and 4 on the diagonal entries.
Finding the inverse of a matrix suah a size using one of the direct methods from
Chapter 2 would in general be too computationally intensive. In practice, a much
more common approach is to use an iterative method and for this example we will
be using the Gauss-Seidel method to solve this linear system.
As we will see shortly, we can implement the Gauss-Seidel method in a way
that means that we can actually completely remove the need to explicitly store the
stiffness matrix. To see how this is possible, recall the iterative formula for the
Gauss-Seidel method:
!
1 X X
φk+1
m = sm − Km,n φkn − Km,n φk+1
m (13.20)
Km,m n>m n<m

where From Example 3.2 we can remember that the algorithm will involve an inner
for loop over each φm to update its value, and an outer while loop to iteratively
repeat this procedure until convergence has been reached. It should also be noted
that specifying the row index m will involve specifying some combination of i and
j indices in our 2D computational grid and in fact we can explicitly evaluate m for
each grid point as:

m = (j − 2)(Nx − 2) + (i − 1)

For example, if i = 2 and j = 2 we get m = 1 (i.e. the first element in φ). From
examination of Equation 13.19 we can see that every Km,m = 4, Km,n = −1, and
every ∆xy 2 ψi,j will also be the same since ψ is a constant throughout the domain
in this example. So although the structure of the matrix isn’t ‘that’ regular, the
elements in the matrix do follow a particular pattern. You might think that it would
be a bit inefficient then to explicitly store an entire 2D array when essentially we
only need to store the two numbers (4 and −1); and in fact you’d be right! Because
252 CHAPTER 13. FINITE DIFFERENCE METHODS

of this feature of the stiffness matrix, we can write out the Gauss-Seidel iteration
for each grid point (i.e. each φm ) as:

1
φk+1 ∆xy 2 ψi,j + φk+1 k k+1 k

i,j = i−1,j + φi+1,j + φi,j−1 + φi,j+1
4
Notice how this is nothing more than a rearrangement of Equation 13.18, but is
also implementing the Gauss-Seidel iteration in Equation 13.20. Furthermore the
stiffnex matrix K has gone completely (i.e. we’re not explicitly storing it). We can
now solve the system by iterating over the i, j indices with two nested for loops
(rather than the m indices with one for loop), to update our solution as:
while r_norm>tolerance && k<N_k
for i=2:N_x-1
for j=2:N_y-1
phi(i,j) = (Delta_xy^2*psi + phi(i-1,j) + phi(i+1,j) ...
+ phi(i,j-1) + phi(i,j+1))/4;
end
end
for i=2:N_x-1
for j=2:N_y-1
r(i,j) = Delta_xy^2*psi + phi(i-1,j) + phi(i+1,j) ...
+ phi(i,j-1) + phi(i,j+1) - 4*phi(i,j);
end
end
r_norm = sqrt(sum(sum(r.^2)));
k = k + 1;
end

in Matlab. Some important points to note here are that firstly our for loops are
indexing from 2:N_x-1 and 2:N_y-1 because we are not defining a Finite Difference
equation at the Dirichlet boundary points. Secondly, it can be observed that we
are in fact storing our residual as a 2D array as well. This again emphasizes the
difference between conceptually thinking of a residual vector as a column vector, (the
same length as the number of unknown variables in our system) but computationally
storing it in a different manner. Because we have a structured grid, it makes sense
to store φ as a 2D array, and it then makes sense to store r in the same way. In the
above code snippet, when we compute the residual we essentially loop over every
grid point and evaluate:

k+1
ri,j = ∆xy 2 ψi,j + φk+1 k+1 k+1 k+1 k+1
i−1,j + φi+1,j + φi,j−1 + φi,j+1 − 4φi,j

which is achieving the result:

rk+1 = s − Kφk+1
By then squaring each ri,j term, adding them all up and taking the square root,
we then have the two norm. In our C++ we will dynamically allocate the field and
residual data with as 2D arrays as:
13.2. SECOND DERIVATIVE 253

double** phi = new double* [N_x];

double** r = new double* [N_x];
phi[0] = new double [N_x*N_y];
r[0] = new double [N_x*N_y];
for(int i=1, ii=N_y; i<N_x; i++, ii+=N_y)
{
phi[i] = &phi[0][ii];
r[i] = &r[0][ii];
}

so that we are not limited by the stack size, can use [i][j] indexing, have our arrays
contiguous in memory, and minimize the number of separate memory allocations.
Our algorithm implement the Gauss-Seidel method will then look like:
while(r_norm>tolerance && k<N_k)
{
for(i=1; i<N_x-1; i++)
{
for(j=1; j<N_y-1; j++)
{
phi[i][j] = (Delta_x*Delta_y*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
r_norm = 0.0;
for(i=1; i<N_x-1; i++)
{
for(j=1; j<N_y-1; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
r_norm += r[i][j]*r[i][j];
}
}
r_norm = sqrt(r_norm);
k++;
}

The complete programs are given in Example13_2.m and Example13_2.cpp. Fig-

ures 13.4(a) - 13.4(d) illustrate the solution at a number of different iterations. It
can be observed that over a number of iterations, the solution of the linear system
gradually converges to a solution. To illustrate the convergence of the solution, Ta-
ble 13.2 presents the inifinity norm for a range of spatial and temporal step sizes
(maintaining stability of course). As can be observed, the finer the grid and the
smaller the time step size, the lower the error in the solution (which is of course
what we could expect).
254 CHAPTER 13. FINITE DIFFERENCE METHODS

Table 13.2: The convergence of the solution, illustrating the infinity norm for a
range of spatial step sizes.

∆x ∆y ||e||∞
0.50 0.50 0.111714
0.10 0.10 0.005730
0.05 0.05 0.001448
0.01 0.01 0.000184

2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1.0 1.0
φ

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

(a) (b)

2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1.0 1.0
φ

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

Figure 13.4: The solutions to the PDE in Example 13.2 illustrating the solution at
iterations (a) k = 1, (b) k = 100, (c) k = 1, 000, and (d) k = 5, 000 with Nx = 65
and Ny = 65.
13.2. SECOND DERIVATIVE 255

Example 13.3:

In this example we will develop a Matlab program to solve the 2D generic scalar
transport equation:

φ̇ + v · ∇φ = µ∇2 φ + ψ (13.21)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will
use the Finite Difference method with second order central differences for the first
and second derivatives and for the temporal discretization we will use the implicit
Euler method. The intended learning outcomes for this example will be to observe
the solution of an parabolic PDE and to investigate the concept of ‘assembling’ the
matrix defining our system of equations. As we will see later in the book, this idea
will be used with many other numerical methods.
To begin, let’s first confirm in our minds that we have a well posed problem.
Our PDE has two second order spatial derivative terms and one first order temporal
derivative in it and so this translates into requiring five pieces of information in or-
der to obtain a unique solution, two boundary conditions for each spatial derivative
plus one initial condition. Remember that because it is the higher order spatial
derivatives that determine the nature of the PDE, the first derivatives in the con-
vective term aren’t really important here. Since we were given all of these bits of
information, then we can say that our problem will be well posed.
Assuming now that our spatial domain has been broken up into Nx data points
in x and Ny data points in y, then we can apply our spatial discretization; that is
the Finite Difference method, meaning that we replace the spatial derivatives with
a second order central difference and define an ODE at each interior grid point as:

dφi,j vx vy
+ (φi+1,j − φi−1,j ) + (φi,j+1 − φi,j−1 )
dt 2∆x 2∆y
µ µ
= 2
(φi−1,j − 2φi,j + φi+1,j ) + (φi,j−1 − 2φi,j + φi,j+1 ) + ψ
∆x ∆y 2
Collecting coefficients of each grid point we get:

dφi,j µ vy µ vx
= + φi,j−1 + + φi−1,j
dt ∆y 2 2∆y ∆x2 2∆x

2µ 2µ
− + φi,j
∆x2 ∆y 2

µ vy µ vx
+ − φ i,j+1 + − φi+1,j + ψ
∆y 2 2∆y ∆x2 2∆x
256 CHAPTER 13. FINITE DIFFERENCE METHODS

which can be written in the semi-discrete form:

M φ̇ = Kφ + s
where again M is the identity matrix. Then, applying the implicit Euler method we
get:

φl+1 − φl
M = Kφl+1 + s
∆t
which can be rearranged to:

φl+1 = (M − ∆tK)−1 M φl + ∆ts

meaning that we will have to solve a system of equations at every time step like:

Aφl+1 = b
where:

A = M − ∆tK
b = M φl + ∆ts

So the only real ‘trick’ in this example is how we define the matrices. In contrast to
the previous two examples; this time we are actually going explicitly store the mass
and stiffness matrices, M and K. Because these matrices will be quite sparse, we
are going to use sparse function in Matlab to allocate memory for these as:
N_p = N_x * N_y;
K = sparse(N_p, N_p);
M = sparse(eye(N_p))
where Np is the number of grid points. For the stiffness matrix we are simply allo-
cating memory in this code snippet, whereas for the mass matrix we are creating an
Np ×Np identity matrix and then converting it from a full to a sparse representation.
To continue now with how we actually add all of the required information into these
matrices, we will in fact loop over every grid point and add the coefficients of the
algebraic equation defined at that point to the overall system of equations. Let’s use
the indices m and n to denote a row and column within K. As we did in Example
13.1, for any given grid point index i, j we can ‘map’ to the corresponding row index
as:

m = (j − 1)Nx + i
Now for each i, j grid point there is a connection with the neighboring i ± 1, j ± 1
grid points and we define these corresponding column indices in K as:

n = (j ± 1 − 1)Nx + i ± 1
13.2. SECOND DERIVATIVE 257

This means that in our loop over every grid point we will be adding in to the system:

2µ 2µ
Km,m = 2
+ i, j
∆x ∆x2
µ v
Km,n = + i, j − 1
∆y 2 2∆y
µ u
2
+ i + 1, j
∆x 2∆x
µ v
2
− i, j + 1
∆y 2∆y
µ u
2
− i + 1, j
∆x 2∆x
sm = ψ i, j

We will in fact create a function called assemble which we will call at the beginning
of our simulation, before entering the time marching loop. Similar functions will
be duplicated when we come to solving this same problem using other numerical
methods such as the Finite Volume or Finite Element methods, in order to highlight
the similarities and differences between the numerical methods. In this particular
case, our function could be implemented as:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
...
for j=2:N_y-1
for i=2:N_x-1
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) + v(2)/(2*Delta_y);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + mu/(Delta_x^2) + v(1)/(2*Delta_x);
n = (j-1 )*N_x + i+1;
K(m, n) = K(m, n) + mu/(Delta_x^2) - v(1)/(2*Delta_x);
n = (j+1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) - v(2)/(2*Delta_y);
s(m) = psi(i,j);
end
end
...
end

The only issue that we have not considered yet is what to do for the boundary grid
points, because as it stands this code snippet is only looping over the interior grid
points and adding their contributions to M and K. To match up with the dimension
of our matrices, we are going to store our field in a 2D array as:
phi = zeros(N_p, N_t);
258 CHAPTER 13. FINITE DIFFERENCE METHODS

so our vector of unknowns will actually include the Dirichlet boundary points, even
though, technically they are not unknown variables. Let’s first look at how we
handle the Neumann boundary points. Returning to the idea mentioned earlier,
the Neumann boundary condition will give us an extra equation that we can use to
modify the discrete equation at that boundary point. If we think about the x = 1
boundary we have:

∂φNx ,j
= ∇φb = 0
∂x
If we again approximate this derivative with a second order central difference we
get:

φNx+1 ,j − φNx −1,j

= ∇φb
2∆x
or:

φNx +1,j = φNx −1,j + 2∆x∇φb (13.22)

So, what this means is that we define an ODE at a Neumann boundary point, but
we will make the substitution φNx +1 = φNx −1 + 2∆x∇φb and collect the coefficients
such that we’ll have a slightly different equation:

dφNx ,j µ vy 2µ
= + φNx ,j−1 + φNx −1,j
dt ∆y 2 2∆y ∆x2

2µ 2µ
− + φNx ,j
∆x2 ∆y 2

µ vy µ vx
+ − φNx ,j+1 + − 2∆x∇φb + ψ
∆y 2 2∆y ∆x2 2∆x

Where the coefficients of φNx +1 have been added to the coefficients of φNx −1 and the
term involving ∇φb will be incorporated into the load vector along with ψ. We can
analogously do the same thing at the y = 1 boundary and will add the coefficients
of φi,Ny +1 to φi,Ny −1 :

dφi,Ny 2µ µ vx
= φ i,Ny −1 + + φi−1,Ny
dt ∆y 2 ∆x2 2∆x

2µ 2µ
− + φi,Ny
∆x2 ∆y 2
µ
vx µ vy
+ − φi+1,Ny + − 2∆y∇φb + ψ
∆x2 2∆x ∆y 2 2∆y

Finally, at the grid point φNx ,Ny we make both substitutions:

13.2. SECOND DERIVATIVE 259

dφNx ,Ny 2µ 2µ
= φNx ,Ny −1 + φNx −1,Ny
dt ∆y 2 ∆x2

2µ 2µ
− + φNx ,Ny
∆x2 ∆y 2
µ
vx µ vy
+ − 2∆x∇φb + − 2∆y∇φb + ψ
∆x2 2∆x ∆y 2 2∆y
So, to add the contributions of the Neumann boundary points to our system of
equations, we can have append some code to our assemble function to loop through
all of the Neumann boundary points and add their contributions to K and s as:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
...
gradphi_b = 0;
...
for i=2:N_x-1
j = N_y;
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + 2*mu/(Delta_y^2);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + mu/(Delta_x^2) + v(1)/(2*Delta_x);
n = (j-1 )*N_x + i+1;
K(m, n) = K(m, n) + mu/(Delta_x^2) - v(1)/(2*Delta_x);
s(m) = s(m) + (mu/(Delta_y^2) - v(2)/(2*Delta_y))*(2*Delta_y*gradphi_b) + psi;
end
for j=2:N_y-1
i = N_x;
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) + v(2)/(2*Delta_y);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + 2*mu/(Delta_x^2);
n = (j+1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) - v(2)/(2*Delta_y);
s(m) = s(m) + (mu/(Delta_x^2) - v(1)/(2*Delta_x))*(2*Delta_x*gradphi_b) + psi;
end
i = N_x;
j = N_y;
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + 2*mu/(Delta_y^2);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + 2*mu/(Delta_x^2);
s(m) = s(m) + (mu/(Delta_x^2) - v(1)/(2*Delta_x))*(2*Delta_x*gradphi_b)
260 CHAPTER 13. FINITE DIFFERENCE METHODS

+ (mu/(Delta_y^2) - v(2)/(2Delta_y))(2Delta_ygradphi_b) + psi;

...
end

Finally, for the Dirichlet boundary points, we will not add any contributions to K,
or s, but will instead set the values for those points in the array phi. We can do
this for every time step quite elegantly in Matlab and will do so by assigning the
following two lines of code in our assemble function:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
...
phi_b = 0;
...
phi(1:N_y:N_p,:) = phi_b;
phi(1:N_y, :) = phi_b;
end

At this point the system is completely assembled and Figure 13.5 illustrates a portion
of the stiffness matrix.

Figure 13.5: A portion of the assembled stiffness matrix K.

At the moment, the rows in K corresponding to Dirichlet boundary points will be

composed entirely of zeros. Obviously this is not a good thing from the point of view
of trying to solve the system, because this would make the matrix singular. So what
do we do about this? One option is to solve a reduced system corresponding only
to the interior and the Neumann boundary points, so that the rows corresponding
13.2. SECOND DERIVATIVE 261

to the Dirichlet boundary points won’t be included. Conceptually we can think of

our system Aφl+1 = b as:
l+1
AF ree,F ree AF ree,F ixed φF ree bF ree
=
AF ixed,F ree AF ixed,F ixed φl+1
F ixed bF ixed
Where the subscript F ree is indicating all of the interior and Neumann boundary
points where we don’t actually know the values in φl+1 and the subscript F ixed is
indicating all of the Dirichlet boundary points where we do. Conceptually, we could
then in fact just solve:

AF ree,F ree φFl+1

ree = bF ree − AF ree,F ixed φF ixed

and of course if the Dirichlet boundary conditions are imposing a value of 0 then
the second term on the right hand side will drop out as well. This idea is known as
partitioning a matrix and is an idea that we will use again in Chapter 15 when we
study the Finite Element method. While this idea is nice conceptually, the Dirichlet
boundary points will be ‘scattered’ throughout phi, not located contiguously at the
end. What we can do however is create an explicit list of all of the indices of all of
the Dirichlet boundary points:
Fixed = [1:N_y, 1:N_y:N_p];

and then compute the F ree points as the difference between all of the indices in the
range 1:N_p and the Dirichlet indices with the Matlab function setdiff as:
Free = setdiff(1:N_p, Fixed);

We can then solve a reduced system of equations with the code:

phi(Free,l+1) = A(Free,Free)\(b(Free) - A(Free,Fixed)*phi(Fixed,l+1));

which will only compute the solution for the interior and Neumann boundary points
and so as long as the Dirichlet boundary points are initialized to 1 in phi, then we
will have imposed the boundary conditions correctly. It is worth mentioning for the
sake of completeness that since we ultimately end up solving Aφl+1 = b at every
time step, and since the rows of A corresponding to the Dirichlet boundary points
will be all zeros while the entries in b will in fact contain the Dirichlet boundary
values, then another way to incorporate the effect of the Dirichlet boundary points
is to put a 1 on the main diagonal for any row corresponding to a Dirichlet boundary
point before performing the linear solve. Although this will enforce the boundary
conditions, a disadvantage is that it can result in A becoming badly scaled if the
other entries are much greater or less than 1. Now that we have shown how we
assemble our system of equations and solve for them at each time step, we are now
in a position to write out the core of the program as:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
A = M - Delta_t*K;
for l=1:N_t-1
262 CHAPTER 13. FINITE DIFFERENCE METHODS

b = M*phi(:,l) + Delta_t*s;
phi(Free,l+1) = A(Free,Free)\(b(Free) - A(Free,Fixed)*phi(Fixed,l+1));
end

The complete program is given in Example13_3.m. Figures 13.6(a) - 13.6(d)

illustrate the solution at a number of different time steps. It can be observed that as
time progresses the bell shaped surface (which is the initial condition) moves through
the domain (due to the convective term) and spreads out (due to the diffusive term),
and the domain as a whole rises (due to the source term).

2.0 2.0

1.8 1.8

1.6 1.6
φ

φ
1.4 1.4

1.2 1.2

1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

(a) (b)

2.0 2.0

1.8 1.8

1.6 1.6
φ

1.4 1.4

1.2 1.2

1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

Figure 13.6: The solutions to the PDE in Example 13.3 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
13.3. VON NEUMANN STABILITY ANALYSIS 263

Having now seen a number of examples using the finite difference method to solve
a PDE it is time to devote some attention to analyzing the stability of a simulation
a little more closely. So far we have seen that our PDE can be broken down into a
system of ODEs of the form:

M φ̇ = Kφ + s
where M = I and depending upon the nature of the problem s may be zero, so it is
not too restrictive to consider problems of the form:

φ̇ = Kφ
Now, in Example 13.1 we saw that we could determine whether the numerical
method was going to be stable by finding the eigenvalues of K and observing where
each λm ∆t lay in the complex plane, relative to the stability region of the time
marching method that we were using for the simulation. This technique is known as
Matrix Stability Analysis and strictly speaking, that is all we need to do in order to
determine whether or not the simulation will be stable. However, for most practical
problems, the size of the system of equations that we are solving means that it’s not
feasible to compute the eigenvalues of K. So we need a better approach. What we
would like to be able to do is to know if our simulation will be stable or not, or put
another way, given a grid spacing, what time step size do we need in order to ensure
that the simulation will be stable when using a conditionally stable time marching
method. We will look at two such techniques for doing just this.

13.3 Von Neumann Stability Analysis

Von Neumann stability analysis is a widely used ‘back of the envelope’ analytical
procedure for determining the stability properties of a numerical method applied
to a PDE, that does not account for the effects of boundary conditions. In fact, it
is assumed that the boundary conditions are periodic; that is, the solution and its
derivatives are the same at the two ends of the domain. The technique works for
linear, constant coefficient differential equations that are discretized on uniformly
spaced structured grids.
To introduce the procedure by way of example, consider the 1D heat equation:

∂φ ∂ 2φ
=µ 2
∂t ∂x
If we were to use a second order central difference to discretize the diffusive term,
we would get the semi-discretization:

dφi µ
= (φi+1 − 2φi + φi−1 )
dt ∆x2
264 CHAPTER 13. FINITE DIFFERENCE METHODS

Furthermore, if we were to then use the explicit Euler method for the time marching,
we would get the full discretization:

µ∆t l
φl+1 = φli + l l

i φi+1 − 2φ i + φ i−1 (13.23)
∆x2
The basic idea behind the von Neumann stability analysis is that we assume an
analytical solution of the form:

φli = σ l eikxi (13.24)

which is reasonable, given the assumptions that we’ve already made. If we substitute
Equation 13.24 into Equation 13.23 we get:

µ∆t l ikxi+1
σ l+1 eikxi = σ l eikxi + l ikxi l ikxi−1

σ e − 2σ e + σ e
∆x2
Noting that xi+1 = xi + ∆x and xi−1 = xi − ∆x we can then get:

µ∆t l ikxi ik∆x

σ l+1 eikxi = σ l eikxi + − 2σ l eikxi + σ l eikx e−ik∆x

2
σe e
∆x
and then dividing both sides by σ l eikxi we get:

µ∆t ik∆x −ik∆x

σ =1+ e − 2 + e
∆x2
Then, making use of Euler’s formula eiθ = cos θ + i sin θ we get:

µ∆t
σ =1+ (cos(ik∆x) + i sin(ik∆x) − 2 + cos(ik∆x) − i sin(ik∆x))
∆x2
2µ∆t
=1+ (cos(ik∆x) − 1))
∆x2
Now in Part II we learned that for stability, we must have |σ| ≤ 1, otherwise σ l will
grow unbounded, and so:

1 + 2µ∆t (cos(k∆x) − 1) ≤ 1

∆x 2
In other words, we must have:

2µ∆t
−1 ≤ 1 + (cos(k∆x) − 1) ≤ 1
∆x2
The right hand inequality is always satisfied since (cos(k∆x) − 1) is always less than
or equal to zero. The left hand inequality can then be recast as:

2µ∆t
(cos(k∆x) − 1) > −2
∆x2
13.4. MODIFIED WAVENUMBER ANALYSIS 265

or:

∆x2
∆t ≤
µ(1 − cos(k∆x))
The worst (or most restrictive) case occurs when cos(k∆x) = −1. Thus the maxi-
mum time step size that we can choose in order for the simulation to remain stable
is:

∆x2
∆t ≤
2µ
The von Neumann stability analysis works whenever the space dependent terms
are eliminated after substituting the periodic form of the solution given in Equation
13.24. For example, if µ was not constant but some function of x say, then the
von Neumann analysis would not in general work. In this case σ would have to
be a function of x and the simple solution we obtained would no longer be valid.
The same problem would arise if a non-uniformly spaced spatial grid were used. Of
course in these cases the matrix stability analysis would still work.

13.4 Modified Wavenumber Analysis

Modified wavenumber analysis is similar to the von Neumann stability analysis and
in many ways it is more straightforward. It is intended to readily expand the range of
applicability of what we have learned about the stability properties of time marching
methods for ODEs to the application of the same time marching method to PDEs.
To again introduce the procedure by way of example, consider again the 1D heat
equation:

∂φ ∂ 2φ
=µ 2
∂t ∂x
The basic idea behind the procedure is that we assume a separable solution of the
form:

φ(x, t) = ψ(t)eikx (13.25)

Substituting Equation 13.25 into the heat equation via performing the differentiation
leads to:

dψ
= −µk 2 ψ (13.26)
dt
which we could in fact solve analytically. In the assumed form of the solution, k is the
known as the wavenumber. In practice, instead of using the analytical differentiation
that lead to Equation 13.26, we use a finite difference method to approximate the
spatial derivative. For example, using a second order central difference, we have:
266 CHAPTER 13. FINITE DIFFERENCE METHODS

dφi µ
= (φi+1 − 2φi + φi−1 )
dt ∆x2
If we now substitute in our assumed solution φi = ψ(t)eikxi then we get:
dψ µ
eikxi ikxi+1 ikxi ikxi−1

= ψe − 2ψe + ψe
dt ∆x2
As with the von Neumann stability analysis we can note that xi+1 = xi + ∆x and
xi−1 = xi − ∆x and get:
dψ µ ikx −ik∆x
eikxi ikxi ik∆x ikxi

= ψe e − 2ψe + ψe e
dt ∆x2
and then dividing both sides by eikxi we get:
dψ µ
ψeik∆x − 2ψ + ψe−ik∆x

= 2
dt ∆x
Then, making use of Euler’s formula we get:

dψ µ
= (cos(k∆x) + i sin(k∆x) − 2 + cos(k∆x) − i sin(k∆x)) ψ
dt ∆x2
2µ∆t
= (cos(k∆x) − 1)) ψ
∆x2
2µ
= − 2 (1 − cos(k∆x)) ψ
∆x
or put another way:
dψ
= −µk ∗2 ψ (13.27)
dt
where:
r
∗ 2
k = (1 − cos(k∆x))
∆x2
By analogy to Equation 13.26, k ∗ is called the modified wavenumber. The important
point here is that in comparing Equations 13.26 and 13.27 we can see that the use
of the central difference means that we now solving a different ODE. Therefore our
simulation will most definitely give us results that differ from the analytical solution.
Another key observation is that Equation 13.27 fits the form of the model ordinary
differential equation in Equation 6.4 with λ = −µk ∗2 . In Part II we investigated
the stability properties of various numerical methods for ODEs with respect to the
model initial value problem in Equation 6.4. Now, using the modified wavenumber
analysis, we can readily obtain the stability properties of any of those time marching
methods when applied to a PDE. All we have to do is replace λ with −µk ∗2 in our
stability analysis. The application of any other finite difference quotient (instead
13.4. MODIFIED WAVENUMBER ANALYSIS 267

of the second order central difference used here) will also lead to the same form
as Equation 13.27, but with a different modified wavenumber. In fact each finite
difference quotient has a distinct modified wavenumber associated with it.
If we were to use the explicit Euler method to solve Equation 13.27 then because
the modified wavenumbers are all real we can use the result in Equation 6.6:
2
∆t ≤
|λRe |
to get:
2
∆t ≤ 2µ
∆x2
(1 − cos(k∆x))
The ‘worst case scenario’ (i.e. the maximum limitation on the time step size) occurs
when cos(k∆x) = −1, leading to:

∆x2
∆t ≤
2µ
which is exactly the same as that obtained with the von Neumann analysis. The
advantage of the modified wavenumber analysis however, is that the stability lim-
its for different time marching methods applied to the same equation are readily
obtained. For example, if instead of the explicit Euler we had used a fourth order
Runge-Kutta method, the stability limit would have been:

2.79 2.79 2.79∆x2

∆t ≤ = 2µ =
|λRe | ∆x2
(1 − cos(k∆x)) 4µ
More generally, we can derive the modified wavenumber for a finite difference
quotient by beginning with the general finite difference formula (Equation 13.1):
Nm Nm
!
∂ n φ(x, t)

1 X X
= a−Nm φi−m + a0 φi + a+Nm φi+m
∂xn xi ∆xn m=1 m=1

We can then substitute in our assumed form of the solution (Equation 13.25) to get:
M M
!
∂ n φi 1 X X
= a−Nm ψeikxi−m + a0 ψeikxi + a+Nm ψeikxi+m
∂xn ∆xn m=1 m=1

We can then make use of the fact that xi+m = xi + Nm ∆x and xi−m = xi − Nm ∆x:

M M
!
n
∂ φi 1 X X
n
= a−Nm ψeikxi e−ikNm ∆x + a0 ψeikxi + a+Nm ψeikxi eikNm ∆x
∂x ∆xn m=1 m=1

and note that when this partial derivative expression is substituted into a PDE we
will be able to divide both sides by eikxi to get:
268 CHAPTER 13. FINITE DIFFERENCE METHODS

M M
!
∂ n φi 1 X X
n
= a−Nm ψe−ikNm ∆x + a0 ψ + a+Nm ψeikNm ∆x
∂x ∆xn m=1 m=1
M M
!
1 X X
= a−Nm e−ikNm ∆x + a0 + a+Nm eikNm ∆x ψ
∆xn m=1 m=1
∗n
=k ψ

where the modified wavenumber is defined as:

v !
u
u 1 M
X XM
k∗ = t
n
a−Nm e−ikNm ∆x + a0 + a+Nm eikNm ∆x (13.28)
∆xn m=1 m=1

Let’s now consider some specific finite difference quotients for the first derivative
of a function (i.e. n = 1) and calculate the modified wavenumbers. If we consider
first the forward difference we know that M = 1, a−1 = 0, a0 = −1, and a+1 = 1
and so substituting these coefficients into Equations 13.28 we get:

1
k∗ = 0 × e−ik∆x − 1 + 1 × eik∆x

∆x
1
= (cos(k∆x) − 1 + i sin(k∆x))
∆x
So we can see that in this case, the modified wavenumber is complex. If we consider
the backward difference we know that M = 1, a−1 = −1, a0 = 1, and a+1 = 0.
Substituting these coefficients into Equations 13.28 we get:

1
k∗ = −1 × e−ik∆x + 1 + 0 × eik∆x

∆x
1
= (1 − cos(k∆x) + i sin(k∆x))
∆x
If we consider the second order central difference for the first derivative we know
that M = 1, a−1 = −1/2, a0 = 0 and a+1 = 1/2. Substituting these coefficients into
Equations 13.28 we get:

∗ 1 1 −ik∆x 1 ik∆x
k = − e +0+ e
∆x 2 2

1 1 1 1 1
= − cos(k∆x) + i sin(k∆x) + cos(k∆x) + i sin(k∆x)
∆x 2 2 2 2
1
= (i sin(k∆x))
∆x
13.4. MODIFIED WAVENUMBER ANALYSIS 269

3.0 exact
2nd order central difference
4th order central difference
6th order central difference
2.5

2.0
(k∗Δx)n

1.5

1.0

0.5

0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
(kΔx)n

Figure 13.7: A plot of the modified wavenumbers versus wavenumbers for some cen-
tral differencing methods. Note that the first order forward and backward methods
have the same wavenumber as the second order central difference method.

Figure 13.7 illustrates plots of (k ∗ ∆x)n versus (k∆x)n for three difference central
difference defined for a first order derivative. The exact curve illustrates the ana-
lytical solution where there is no error introduced by the use of a finite difference
method. For the other methods, we can observe that at lower values of (k∆x)n
all methods are reasonably close to the exact curve, but as (k∆x)n increases they
deviate away from the exact. Furthermore we can observe that the higher order
methods remain better approximations at larger values of (k∆x)n . The important
point here is that when applying a finite difference method to solve a PDE, ∆x
must be chosen (in conjunction with the particular finite difference method) such
that (k∆x)n is small enough, such that a reasonable approximation of the spatial
derivatives are achieved.
To elaborate on this point, let’s examine the effects of different finite difference
methods by considering the 1D first order wave equation of Example 13.1:

∂φ ∂φ
+v =0
∂t ∂x

If we choose use a second order central difference to approximate the derivative in

the convective term we will end up with:
270 CHAPTER 13. FINITE DIFFERENCE METHODS

dψ v
=− (i sin(k∆x)) ψ
dt ∆x
If on the other hand we were seeking an analytical solution, we would have:
dψ
= −uikψ
dt
Comparing these two solutions we have vk ∗ l instead of vk we could then define the
effective propagation speed as:

v sin(k∆x)
u∗ =
k∆x
For low frequency waves k /k ≈ 1 and so v ∗ ≈ v. For higher frequency waves
∗

k ∗ /k 1 and so v ∗ v creating dispersion error. Now if vk ∗ > 0 then waves will

tend to dissipate as time progresses. This is called numerical diffusion and reduces
the solution accuracy since the first order wave equation itself is not diffusive. If
vk ∗ < 0 then waves will tend to grow as time progresses. This is even worse, since
it is likely to cause instability.
Modified wavenumber analysis can hence be used simply to determine the maxi-
mum allowable time step size in order for a simulation to be stable, given a particular
PDE discretized with some finite difference quotient and a time marching method.
Furthermore, we can also use the wavenumbers to gain some understanding as to
how the finite difference approximations will affect the accuracy of the numerical
solution.
Chapter 14

Finite Volume Methods

We saw in Chapter 13 that the Finite Difference method required a regular struc-
tured grid (Figure 12.2(a)) in which to discretize the PDE over and so the continu-
ous scalar field variable φ is then approximated at a collection of grid points. Finite
Volume methods on the other hand break up a domain into a collection of cells
(or volumes) (Figure 12.2(b)) and then a spatial discretization of the original PDE
is performed over each cell. The method is still a local method in that the spatial
discretization at a point will only involve a few immediate neighbors, but Finite Vol-
ume methods offer perhaps the greatest flexibility in terms of the types and mixture
of grids which may be used to represent a computational domain. Furthermore, an
important point to note is that Finite Volume methods tend to be either cell based
or node based, depending upon whether the field variable is stored at the centroid
or the vertices of the cells defining the grid respectively. In this book we will only
concern ourselves with cell based Finite Volume methods (because this is the more
common variation) and we will assume that our spatial domain has been broken up
into a finite sized tessellation of non-overlapping triangles, which completely cover
the domain (i.e. we have a suitable unstructured grid). To begin our derivation, it
will help to consider the arbitrarily shaped spatial domain depicted in Figure 12.1.
Within this domain we will be solving the generic scalar transport equation:

φ̇ + v · ∇φ = µ∇2 φ + ψ
The basic idea behind the Finite Volume method is that we integrate this PDE over
the spatial domain as:
Z Z
µ∇2 φ + ψ dΩ

φ̇ + v · ∇φ dΩ = (14.1)
Ω Ω

Now while you may be worried at this point as to how one actually evaluates such
a nasty looking integral, the reality is that we don’t actually try to perform the
integration directly. Instead at this point we make use of a very important theorem
known as Gauss’s Divergence Theorem [12]:

271
272 CHAPTER 14. FINITE VOLUME METHODS

Z Z
∇ · f dΩ = f · dΓ
Ω Γ

where as was outlined in Chapter 12, the quantity ∇ · f is the divergence of the
vector field f . With reference to the domain depicted in Figure 12.1, the way to
think about what this theorem is saying is to imagine the arbitrarily shaped region
of RD space broken up into little pieces of size dΩ. We are then saying that the
sum of the scalar quantity that is the divergence of f in each of these pieces is equal
to the net flux of f out of the domain. This flux can be envisioned by considering
the piece on the boundary given by dΓ (noting that this is a vector quantity and is
normal to the boundary) and summing up the dot products between f and dΓ over
the boundary.
Following this theorem, some corollaries are:

Z Z
(f · ∇g + g∇ · f ) dΩ = f g · dΓ (14.2)
Ω
Z ZΓ
∇2 f dΩ = ∇f · dΓ (14.3)
Ω
Z ZΓ
∇f dΩ = f dΓ (14.4)
Ω Γ

where f and g denote scalar fields and g denotes another vector field. The reason
for introducing this theorem is because we are going to replace some of the domain
integrals with boundary integrals in Equation 14.1. Now although we haven’t stated
it explicitly, the vector field v that we have been using in the convective term of
our generic scalar transport equation is what is known as a solenoidal vector field
[51], meaning that it is divergence free (i.e. ∇ · v = 0). This assumption was made
somewhere along the line during the derivation of the scalar transport equation (for
instance in fluid mechanics when assuming incompressible flow), and since we are
more interested in solving it, rather than deriving its form for different physical
scalar quantities, we will just ‘go with it’ and accept this assumption. As it happens
however, a more general expression for the convective term in the generic scalar
transport equation is ∇ · (vφ) (compared to v · ∇φ). Given the solenoidal nature of
v we can use Equation 14.2 and cancel out the second term on the left hand side to
get:
Z Z
v · ∇φdΩ = vφ · dΓ
Ω Γ

Alternatively we could have simply brought v inside the derivative in the convective
term (i.e. ∇ · (vφ)) and used the original form of the divergence theorem, but this
273

still requires the assumption of v being solenoidal. If we now apply Equation 14.3
to the diffusive term we get:
Z Z
2
∇ φdΩ = ∇φ · dΓ
Ω Γ

and hence we can rewrite Equation 14.1 as:

Z Z Z Z
φ̇dΩ + vφ · dΓ = µ∇φ · dΓ + ψdΩ
Ω Γ Γ Ω

The next key step in the derivation of the Finite Volume method is that we think
of our domain as being an arbitrary cell within the mesh and therefore the ‘domain’
means either the length, area, or volume of the cell (depending on whether or not
we are thinking in 1D, 2D, or 3D respectively) and the boundary will imply the
points, edges, or faces that define the cell (again, depending on whether or not we
are thinking in 1D, 2D, or 3D respectively). We then assume that φ and ψ are
constant throughout the cell and can be pulled out of the domain integrals. We
also replace the boundary integrals with summations over the faces comprising the
boundary, hence we end up with the discrete form:
Nf Nf
X X
φ̇Ω + vf φf · Γf = µf ∇φf · Γf + ψΩ
f f

where the subscript f denotes a face on the boundary of the domain of which Nf
is the total number, and the values φf are defined at the face centroid. It can be
observed that the domain integrals have been replaced by a multiplication by the
domain. This means a multiplication by ‘say’ the area of the domain in 2D or
the volume of the domain in 3D. This discrete integral form of the generic scalar
transport equation will subsequently be applied to every cell in the grid (much like
a finite difference equation is applied to every point in the grid) and can be written
as:
Nf Nf
X X
Ωc φ̇c + vf φf · Γf = µf ∇φf · Γf + Ωc ψc
f f

so that we can end up with a coupled system of equations which can be written in
the form:

M φ̇ = Kφ + s
where φ is now a vector is defining values at the cell centroid of each cell in the grid.
Now, the final piece remaining before we can assemble the matrices and solve the
system is how we evaluate φ and ∇φ at the face centroids. The way we do this is to
relate (or interpolate) the face centroid values from the cell centroid values of the two
274 CHAPTER 14. FINITE VOLUME METHODS

cells which share a given face. This means that ultimately our system of equations
will only involve cell centroid values of φ, which is our solution to the PDE. Just as
we could essentially employ any order of finite difference expression to approximate
the derivatives in the generic scalar transport equation, we can similarly with the
Finite Volume method employ a variety of approximations for φf and ∇φf . In
order to proceed with the derivation we shall use the simplest approximations for
now. Any given face in the interior of the grid (i.e. forgetting about the faces that
comprise the boundary of the grid) will be shared by two adjacent cells. We can
arbitrarily assign one of the cells to be the owner of the face and one to be the
neighbor of the face and while this assignment is arbitrary, the important point is
that the face area vector Γf is defined positive pointing out of the owner cell and
in to the neighbor cell (Figure 14.1).

cneighbour

Δζ
Γf
cowner

Figure 14.1: A schematic illustrating two cells in an unstructured grid sharing a face.
The vector Γf is defined as normal to its face, directed from the owner cell to the
neighbor cell. The length vector ∆ζ is defined between the centroids of the owner
and neighbor cells. It can be observed that generally these two vectors will not be
parallel, except for the case where the grid is composed of equilateral triangles.

If we first consider the convective term, one of the simplest ways in which we
can evaluate φf is to use a central difference approximation:
φn + φo
φf =
2
which is simply the mean of φc for the neighbor and owner cells. As a quick note,
the use of upwinding methods, analogous to the forward difference in Equation 13.5
are also commonly used. If we now consider the diffusive term, one of the simplest
ways in which we can evaluate the gradient at the face is via:
Γf
∇φf · Γf = (φn − φo )
∆ζ
275

where ∆ζ is the distance between the cell centroids of the two cells sharing the face
f (Figure 14.1) and Γf is the magnitude of the vector Γf . This approximation is
essentially treating the gradient in terms of ‘rise over run’. An important assumption
that we have made is that the line ∆ζ and the vector Γf are parallel. This is a pretty
bad assumption that we will address and improve later. We can now write out the
matrices for our semi-discretization as:

Nc
X
M= Ωc (14.5)
c=1
Nf
X µf Γf vf · Γ f
K= ± ± (14.6)
f =1
∆ζ 2
Nc
X
s= Ωc ψc (14.7)
c=1

where Nc is the number of cells in the grid. The ± signs in K come about because
these terms will contribute differently to the owner and neighbor cells of the face.
The last issue before we can apply the Finite Volume method in an example
is how we impose boundary conditions. Since the boundary of our computational
domain is comprised of faces, it is hence on these faces where we are prescribing
our boundary conditions. An interesting comparison can be made with the Finite
Difference method, where we stored our solution at grid points, and the boundary
conditions were also defined at grid points. Here however, our solution is defined at
cell centroids, but our boundaries are defined at face centroids, so one implication
of this is that we won’t have to worry about matrix partitioning. For a Dirichlet
boundary condition we hence have:

φf = φb
This boundary condition will contribute to both K and s, and to see how, consider
the way the boundary face contributes to the convective and diffusive terms:
µb Γb
(φb − φo ) − vb · Γb φb
∆ζ
where there the coefficient of φo will be added into K and the remainder (which is
all known) will be added into s. It is important to note that ∆ζ in this case would
be defined as the distance between the cell centroid and the face centroid of the
boundary face. For a Neumann boundary condition we have:

∇φf = ∇φb
This boundary condition will also contribute to both K and s, and to see how, again
consider the way the boundary face contributes to the convective and diffusive terms:
276 CHAPTER 14. FINITE VOLUME METHODS

vb · Γ b
µb Γb ∇φb − (φo + ∆ζ∇φb )
2
Again the coefficient of φo will be added into K and the remainder (which is all
known) will be added into s.

Example 14.1:

In this example we will develop a Matlab program to solve the 2D generic scalar
transport equation:

φ̇ + v · ∇φ = µ∇2 φ + ψ (14.8)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we
will use the Finite Volume method with triangular elements and for the temporal
discretization we will use the implicit Euler method. The intended learning outcomes
for this example will be ‘get a feel’ for applying the Finite Volume method to solve a
PDE and to investigate the concept of ‘assembling’ the matrix defining our system
of equations by looping over the faces in our grid.
To begin we are going to need to make some definitions regarding the data
structures that we are going to use to solve the problem. Life was comparatively
easy when we were using Finite Difference methods to solve PDEs because we could
define our computational domain as something like:
phi = zeros(N_x, N_y, N_t);

and navigate through the grid with i, j, l counters, and that was all we needed
to define the geometry and the connectivity. The ‘price that we pay’ for being
able to deal with complex spatial domains is that we must explicitly store all of
the geometry and connectivity information defining the unstructured grid. As such
we will be storing four arrays for this problem; an array called Points which is an
Np × 2 array storing the x, y coordinates of the points defining the grid, an array
called Faces which is an Nf × 2 array storing the two indices of the points defining
a face in the grid, an array called Cells which is an Nc × 3 array storing the three
indices of the points defining a cell in the grid, and an array called NeighborOwner
which is an Nf × 2 array defining the indices of the owner and the neighbor cells of
each face in the grid. We are going to assume that the Faces array is ordered such
that all of the internal faces (of which there are Nif ) are contiguous and come first
in the array, followed by the boundary faces, where the faces of each boundary are
contiguous in the array (i.e. all the faces corresponding to the lower x boundary
come first, then all the edges corresponding to the upper y boundary second, the
277

upper x boundary third, and the lower y boundary last). In order to prescribe our
boundary conditions we are going to make use of a structure to store all of the
information that we need:
Boundaries = struct
(
’start’, {2873, 2901, 2929, 2957},...
’N’, {28, 28, 28, 28},...
’type’, {’neumann’, ’neumann’, ’dirichlet’, ’dirichlet’},...
’value’, {0, 0, 1, 1}
);

where the field start will store the index of the first of the contiguous faces compris-
ing a given boundary, N is the total number of faces defining the boundary, type is
a string variable telling us what type of boundary condition to apply, and value is
the value of either the Dirichlet or Neumann condition. Since we are assuming that
all of this data is ‘at the ready’ it will all be defined in a function called readGrid,
which we will call at the start of our program.
At this point we can apply the spatial discretization that is the Finite Volume
method to our PDE and we know that we will have a semi-discrete system of the
form:
dφ
M = Kφ + s
dt
where:

Nc
X
M= Ωc
c=1
Nf
X µf Γf vf · Γ f
K= ± ±
f =1
∆ζ 2
Nc
X
s= Ωc ψc
c=1

Now, applying the implicit Euler method we get:

φl+1 − φl
M = Kφl+1 + s
∆t
which can be rearranged to:

φl+1 = (M − ∆tK)−1 M φl + ∆ts

So effectively we have reduced our problem to the solution of:

Aφl+1 = b
278 CHAPTER 14. FINITE VOLUME METHODS

where:

A = M − ∆tK
b = M φl + ∆ts
As with the Finite Difference method applied in Example 13.3 the core part of
the method is the ‘assembling’ of the matrices defining the system of equations. To
assemble these matrices we are going to need to evaluate Ωc and Γf for the cells and
faces respectively. Now in our 2D example Ωc is the area of each triangle, which we
can evaluate as:

(x1 y2 − x2 y1 ) + (x3 y1 − x1 y3 ) + (x2 y3 − x3 y2 )

Ωc = (14.9)
2
where xp , yp are the x and y coordinates of the three points defining the triangle.
We can then compute the areas for all of the cells in our Matlab program via the
code:
for c=1:N_c
x = Points(Cells(c, :), 1);
y = Points(Cells(c, :), 2);
Omega(c) = ((x(1)*y(2)-x(2)*y(1))+(x(3)*y(1))-(x(1)*y(3))+(x(2)*y(3)-x(3)*y(2)))/2;
end
In 2D Γf is an edge vector with a magnitude equal to the length of the edge,
and a direction defined perpendicular to the edge (Figure 14.1). We can compute
the x and y components of this vector via:

Γf = −{y1 − y2 , x2 − x1 } (14.10)
and we can implement this in our Matlab program via the code:
for f=1:N_f
x = Points(Faces(f, :), 1);
y = Points(Faces(f, :), 2);
Gamma(f,:) = -[y(1)-y(2), x(2)-x(1)];
end
where again xp , yp are the x and y coordinates of the two points defining the edge.
We will also need the coordinates of both the cell centroids and the face centroids
in order to assemble K. The cell centroids can be calculated via:

x1 + x2 + x3 y 1 + y 2 + y 3
Centroidc = ,
3 3
and implemented via the code:
for c=1:N_c
x = Points(Cells(c, :), 1);
y = Points(Cells(c, :), 2);
cellCentroids(c,:) = [sum(x), sum(y)]/3;
end
279

c777 c778

c812
f1219

c779

c810 f1221 f2901

f1266
c811
φb=0

f2956 φb=1

Figure 14.2: A schematic illustrating a portion of the computational grid, displaying

the face and cell numbers of the given cells. The blue vectors denote vf , defined at
the centroid of each face and the pink vectors denote Γf directed from neighbor to
owner.

and the face centroids can be calculated via:

x1 + x2 y 1 + y 2
Centroidf = ,
2 2
and implemented via the code:
for f=1:N_f
x = Points(Faces(f, :), 1);
y = Points(Faces(f, :), 2);
faceCentroids(f,:) = [sum(x), sum(y)]/2;
end
With all of this information now ‘at the ready’ (and you can see how much extra
information we’ve had to store and calculate in order to get to this point compared
to the Finite Difference method) we can now assemble the matrices defining our
system. Now creating M is fairly trivial as it is only has elements on the main
diagonal, where each element is just the volume of a cell. To assemble the stiffness
matrix, let’s examine a portion of the mesh, near the x = 1, y = 0 boundary, depicted
in Figure 14.2. It should be noted that for this problem µ is a constant throughout
the domain and hence µf = µ. Furthermore, the velocity field is constant throughout
the domain and hence vf = v.
In order to ‘get a feel’ for how looping the faces can help us assemble our system
of equations, let’s look at how face 1221 will affect it’s neighbor cell 811 and its
owner cell 779. We could write the ‘partially complete’ ODEs for these two cells as:
280 CHAPTER 14. FINITE VOLUME METHODS

dφ811 µΓ1221 v · Γ1221

Ω811 =− (φ811 − φ779 ) + (φ811 + φ779 ) + . . .
dt ∆ζ1221 2
dφ779 µΓ1221 v · Γ1221
Ω779 =+ (φ811 − φ779 ) − (φ811 + φ779 ) + . . . (14.11)
dt ∆ζ1221 2
where it can be observed that the diffusive term is positive for the owner and negative
for the neighbor, while the convective term is positive for the neighbor and negative
for the owner. If we define:
µΓf v · Γf
Df = Cf = (14.12)
∆ξ 2
then we could make these substitutions into Equation 14.11 and collect coefficients
to get:

dφ811
Ω811 = (+Cf1221 − Df1221 ) φ811 + (+Cf1221 + Df1221 ) φ779 + . . .
dt
dφ779
Ω779 = (−Cf1221 − Df1221 ) φ811 + (−Cf1221 + Df1221 ) φ779 + . . .
dt
More generally, we could write this for any interior face in the grid as:

dφn
Ωn = (+Cf − Df ) φn + (+Cf + Df ) φo + . . .
dt
dφo
Ωo = (−Cf + Df ) φn + (−Cf − Df ) φo + . . .
dt
Now in order to actually implement this in our Matlab code our assemble func-
tion is going to involve ‘looping’ over all of the interior faces of the mesh and adding
contributions to K and the algorithm will look something like:
function [M, K, s, phi] = assemble(M, K, s, phi, Points, Faces, Cells, ...
NeighborOwner, Boundaries, N_p, N_if, ...
N_bf, N_f, N_c, N_b);
...
M = sparse(diag(Omega));
s = psi.*Omega;
for f=1:N_if
o = NeighborOwner(f, 1);
n = NeighborOwner(f, 2);
Delta_zi = norm(cellCentroids(n,:)-cellCentroids(o,:));
C_f = dot(v, Gamma(f, :))/2;
D_f = mu*norm(Gamma(f, :))/Delta_zi;
K(o, o) = K(o, o) - C_f - D_f;
K(o, n) = K(o, n) - C_f + D_f;
K(n, n) = K(n, n) + C_f - D_f;
281

K(n, o) = K(n, o) + C_f + D_f;

end
...
end

Now, in order to apply the boundary conditions we can loop over every boundary in
our structure, then loop over every face within the boundary, and add the contribu-
tions of the boundary conditions to K and s. Let’s look at how the boundary faces
2901 and 2956 affect the ODEs for the cells 779 and 811. We could again write the
‘partially complete’ ODEs as:

dφ811 µΓ2956
Ω811 = (φb − φ811 )−v · Γ2956 φb +...
dt ∆ζ2956
dφ779
Ω779 = µΓ2901 ∇φb −v · Γ2901 (φ799 + ∆ζ2901 ∇φb )+ . . . (14.13)
dt
If we again substitute the definitions in Equation 14.12 into 14.13 and collect coef-
ficients we get:

dφ811
Ω811 = −Df2956 φ811 + (Df2956 − Cf2956 ) φb +...
dt
dφ779
Ω779 = −Cf2901 φ811 + (Df2901 − Cf2901 ) ∆ξ∇φb + . . .
dt
More generally, we could write this for any boundary face in the grid as:

dφo
Ωo = −Df φo + (Df − Cf ) φb + . . .
dt
dφo
Ωo = −Cf φo + (Df − Cf ) ∆ξ∇φb + . . . (14.14)
dt
Remembering that boundary faces can only have an owner cell. An important point
to note is that the first terms on the right hand side of Equation 14.14 will be
added to K, while the second terms on the right hand side will be added to s. In
order to implement this in our Matlab code we will add in some more code to our
assemble function, ‘looping’ over all of the boundary faces of the mesh and adding
contributions to K and s and the algorithm will look something like:
function [M, K, s, phi] = assemble(M, K, s, phi, Points, Faces, Cells, ...
NeighborOwner, Boundaries, N_p, N_if, ...
N_bf, N_f, N_c, N_b);
...
for b=1:N_b
for f=Boundaries(b).start:Boundaries(b).start+Boundaries(b).N-1
o = NeighborOwner(f,1);
C_f = dot(v, Gamma(f,:));
282 CHAPTER 14. FINITE VOLUME METHODS

Delta_zi = norm(faceCentroids(f,:)-cellCentroids(o,:));
D_f = mu*norm(Gamma(f, :), 2)/Delta_zi;
if strcmp(Boundaries(b).type,’dirichlet’)
K(o, o) = K(o, o) - D_f;
s(o) = s(o) + (D_f - C_f) * Boundaries(b).value;
elseif strcmp(Boundaries(b).type,’neumann’)
K(o, o) = K(o, o) - C_f;
s(o) = s(o) + (D_f - C_f)*Delta_zi* Boundaries(b).value;
end
end
end
end

Figure 14.3: The pattern of the assembled stiffness matrix K using the Finite Volume
method.

At this point, the system is completely assembled (Figure 14.3). It can be observed
that in comparison to the stiffness matrix for the system in Example 13.3, this
matrix is much less ordered, although it is symmetric.
In order to implement this in our Matlab program the core of the algorithm will
look like:
assemble();
A = M - Delta_t*K;
for l=1:N_t-1
b = M*phi(:,l) + Delta_t*s;
phi(:,l+1) = A\b;
end
The complete program is given in Example14_1.m. Figures 14.4(a) - 14.4(d)
illustrate the solution at a number of different time steps. It can be observed that
as with Example 13.3, as time progresses the bell shaped surface (which is the initial
condition) moves through the domain (due to the convective term) and spreads out
(due to the diffusive term), and the domain as a whole rises (due to the source term).
283

2.0 2.0

1.8 1.8

1.6 1.6
φ

φ
1.4 1.4

1.2 1.2

1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

(a) (b)

2.0 2.0

1.8 1.8

1.6 1.6
φ

1.4 1.4

1.2 1.2

1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

Figure 14.4: The solutions to the PDE in Example 14.1 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.

As was mentioned earlier, the approximation that we used for the ∇φf is not a
particularly good one, the reason being that in general ∆ζ and Γf are not parallel.
What this implies is that we don’t accurately capture the diffusion across the face
of each cell, reducing the accuracy of the method. What is commonly done is to
calculate the gradient as:

Γf · Γf
∇φf · Γf = (φn − φo ) + S
Γf · eζ
284 CHAPTER 14. FINITE VOLUME METHODS

where eζ is the unit vector in the between the two cell centroids sharing the face f .
The first term is known as the primary diffusion while the term S is known as the
secondary diffusion. The primary diffusion accounts for part of the diffusion aligned
with the unit vector eζ , while the secondary diffusion accounts for the diffusion along
the length of the face. Now the secondary diffusion depends upon φo and φn but it
is usually treated explicitly, meaning that it’s assumed known, and its contributes
to s, not K.
In practice an iterative method is usually used to solve the system of equations at
each time step, partly due to the fact that the generic scalar transport equation can
often be nonlinear (take for example the solution of the Navier-Stokes equations),
but also because the resulting systems can be too large for direct methods. The
point being that in practice, incorporating the secondary diffusion explicitly is less
of a problem, but of course it does require explicitly calculating the face gradients
at each iteration. This was the reason that we ignored the secondary diffusion in
Example 14.1.
A final point worth mentioning is that there are a variety of higher order methods
available for the discretization of the convective and diffusive terms. While we used
the simplest (and lowest order) methods to simplify the introduction of the Finite
Volume method, in practice higher order methods would generally be used. For the
interested reader, two excellent references for more detailed aspects of the Finite
Volume method can be found in the books by Versteeg [75] and Murthy [70].
Chapter 15

Finite Element Methods

Similar to Finite Volume methods, Finite Element methods present a way by which
we can solve a PDE in a spatial domain with a much more complex geometrical
structure compared to Finite Difference methods. The domain is broken up into a
collection of cells and and then a spatial discretization is performed over each cell.
A key difference in terminology however, is that the cells are usually called elements
and the vertices of the elements are usually called nodes. Furthermore, the field
variable φ is defined at the nodes of an element, in contrast to a cell based Finite
Volume method where it was defined at the cell centroid. The elements used in 1D
are simply lines, in 2D they are commonly triangles or quadrilaterals, and in 3D,
tetrahedra or hexahedra (Figure 15.1). Just as there are higher order interpolation
methods available to the Finite Difference and Finite Volume methods, with the
Finite Element method we can use higher order elements to obtain a more accurate
solution. Furthermore, the Finite Element method is also a local method in that
the discretization within an element will only involve its own nodes.

Figure 15.1: Some common elements. In 1D, lines, in 2D, triangles and quadrilat-
erals, and in 3D, tetrahedra and hexahedra. Each row depicts the linear, quadratic,
and cubic versions of the element.

The theory of the Finite Element method is found in variational calculus and
there are generally two procedures one would normally use to solve a PDE using

285
286 CHAPTER 15. FINITE ELEMENT METHODS

the Finite Element method, known as the Rayleigh-Ritz and Galerkin methods
(both of which are subsets of the method of weighted residuals [62]). Similar to
the derivation of our Finite Volume method, we must choose a specific technique
and as such will only concern ourselves with the Galerkin method. It should be
understood that the method of weighed residuals is a mathematical technique in
and of its own right and is merely employed as a part of the Finite Element method,
analogous to being one of many ‘ingredients’. As with the Finite Volume method
we will assume that our spatial domain has been broken up into a tesselation of a
finite number of non-overlapping elements, which completely cover the domain (i.e.
we have a suitable unstructured grid). Before proceeding to discretize the generic
scalar transport equation however, it is worth taking a brief detour to investigate
the method of weighted residuals. To illustrate by way of example, let’s consider
the ODE from Example 6.1

dφ
=1−φ
dx
in the domain x ∈ [0, 10], with boundary condition φ(0) = 0. In this case we know
that the solution will be φ(x) = 1 − e−x , but we’ll assume a trial solution of the
form:

φ(x) = a0 + a1 x + a2 x2 + . . . + aN xN (15.1)
or more generally, we could write our trial solution as:

N
X
φ(x) = an pn (x)
n=0

where the an terms are coefficients that need to be determined and the pn terms
are the powers of x, known as a basis function or a trial function. Applying the
boundary condition we find that a0 = 0 and hence:

φ = a1 x + a2 x 2 + . . . + aN x N
If we only consider terms up to second order in our trial solution, then we can
substitute our trial solution into the original ODE and perform the differentiation
to get:

dφ
+ φ − 1 ≈ (a1 + 2a2 x) + (a1 x + a2 x2 ) − 1
dx
= a1 (1 + x) + a2 (2x + x2 ) − 1

Now let’s assume that our trial solution for φ doesn’t satisfy our original ODE
exactly. In this case we may define a residual as:
287

dφ
r := + φ − 1 = a1 (1 + x) + a2 (2x + x2 ) − 1 (15.2)
dx
which will be exactly zero if the trial solution satisfies the ODE, and non-zero oth-
erwise. It is important to note that the meaning of the word ‘residual’ is different
compared to the context of an iterative method for solving a linear system of equa-
tions. In that sense we were talking about column vector where each entry is a
measure of the error for each variable, whereas here we can think of the residual as
a continuous function of space. We cannot force the residual to vanish, no matter
how many terms we include in out trial solution. The idea of the method of weighted
residuals however is that we can multiply the residual by a weighting function and
force the integral of the weighted expression over the domain to vanish, i.e:
Z
W (x)r(x)dΩ = 0 (15.3)
Ω

where W (x) is the weighting function. To be more accurate W is a set of weighting

functions and since we have N unknown values of an we will have N weighting
functions Wn , resulting in a system of linear algebraic equations. The solution to
the system will be the unknown parameters an that will determine an approximation
of φ of the form of the trial solution given in Equation 15.1. This will satisfy the
differential equation in an ‘average’ or ‘integral’ sense. It is the choice of weighting
function that defines the particular weighted residual method. With our method of
choice, namely the Galerkin method, the weights are set equal to the trial functions,
i.e:

Wn (x) = pn (x) (15.4)

Our weighted residual expression is hence:

Z10
pn rdx = 0
0

Substituting in for our the residual and the weighting functions and performing the
integration we get:

Z10 2 10
x3
3
x4 x2

2
x 2x
x a1 (1 + x) + a2 (2x + x ) − 1 dx = a1 + + a2 + −
2 3 3 4 2 0
0
= 383a1 + 3167a2 − 50

and:
288 CHAPTER 15. FINITE ELEMENT METHODS

Z10 3 10
x4
4
x5 x3

2 2
x 2x
x a1 (1 + x) + a2 (2x + x ) − 1 dx = a1 + + a2 + −
3 4 4 5 3 0
0
= 2833a1 + 25000a2 − 333

which gives us the system:

383 3167 a1 50
=
2833 25000 a2 −333
which we can then solve to find that a1 = 0.3182 and a2 = −0.0277. Therefore our
weighted residual solution is:

φ = 0.3182x − 0.0277x2

2.0
1 term
1.8 2 terms
3 terms
1.6 4 terms
exact
1.4

1.2
φ

1.0

0.8

0.6

0.4

0.2

0.0
0 2 4 6 8 10
x

Figure 15.2: Galerkin Method of Weighted residual solutions to the ODE in Example
6.1.

Figure 15.2 illustrates the solution to this ODE comparing the exact solution to
weighted residual solutions employing 1 to 4 terms in the trial solution. It can
be observed that the bigger our set of weighting functions, the more accurate our
solution, but the bigger the system of equations that we will have to solve.
Having now seen a simple example of the Galerkin method of weighted residuals
the next important step in the derivation of our Finite Element method is how we
289

express our field variable φ and its derivatives. Thinking of φ most generally as
being a continuous function of space and time, we will write the assumed solution
in the form:
Nn
X
φ(x, t) = ηn (x)φn (t) (15.5)
n=1

where the ηn terms are known as shape functions, which are functions of spatial
location only, while the nodal values of φ are functions of time only. We will look at
how shape functions are derived for some of the element types in Figure 15.1 shortly,
but for now, let’s just say that they come about from defining a trial solution similar
to that in Equation 15.1 and manipulating it such that the power series solution is
expressed in terms of the nodal values. We will use the notation Nn to denote
the number of nodes in an element (e.g. 2 for a linear line element, 3 for a linear
triangular element, 4 for a linear tetrahedral element), and Np as normal to mean
the total number of nodes (or points) in the grid, which should obviously be much
larger.
Having made this assumption for our assumed solution we can express the tem-
poral derivative of our field as:

Nn
∂φ X ∂
= (ηn φn )
∂t n=1
∂t
Nn
X ∂ηn ∂φn
= φn + ηn
n=1
∂t ∂t
Nn
X ∂φn
= ηn
n=1
∂t

and the first spatial derivative as:

Nn
∂φ X ∂
= (ηn φn )
∂x n=1
∂x
Nn
X ∂ηn ∂φn
= φn + ηn
n=1
∂x ∂x
Nn
X ∂ηn
= φn
n=1
∂x

with similar expressions in y and z. In higher spatial dimensions may be written

more compactly as:
290 CHAPTER 15. FINITE ELEMENT METHODS

Nn
X
∂xi φ = ∂xi ηn φn
n=1

We can express and the second spatial derivative as:

Nn
∂ 2φ X ∂2
= (ηn φn )
∂x2 n=1
∂x 2

Nn
X ∂ ∂ηn
= φn
n=1
∂x ∂x
Nn
X ∂ 2 ηn ∂ηn ∂φn
= φn +
n=1
∂x2 ∂x ∂x
Nn
X ∂ 2 ηn
= φn
n=1
∂x2

with similar expressions in y and z. In higher spatial dimensions may be written

more compactly as:

Nn
X
∂ xi xj φ = ∂xi xj ηn φn
n=1

although, as we will see shortly, despite the elegance of this last result we won’t be
evaluating the second derivatives this way in practice.
If we now apply the Galerkin method of weighted residuals to the scalar transport
equation then we have:
Z
W φ̇ + v · ∇φ − µ∇2 φ − ψ dΩ = 0 (15.6)
Ω

At this point It can be observed that the integration procedure is quite similar to
that performed with the Finite Volume method and in fact would be equivalent if the
weighting function W were equal to 1. In our case however we will be substituting
in the shape functions and also our assumed form of the solution from Equation
15.5, and we will consider the domain of integration to be the domain of an element
itself such that the weighted residual expression can be rewritten as:
Z
ηp ηq φ̇q + ηp v · ∇ηq φq + µηp ∇2 ηq φq − ηp ψ dΩ = 0 (15.7)
Ωe
291

where an important point to note is that we are using Einstein summation notation,
implying that:

ηq φq = η1 φ1 + η2 φ2 + . . . + ηNn φNn
and:
  
η1 η1 η1 η2 ... η1 ηNn φ
 1

 

η2 η1 η2 η2 ... η2 ηNn φ2
  

ηp ηq φq = 
 
.. .. .. ..  ..
 . . . .  . 

 
ηNn η1 ηNn η2 . . . ηNn ηNn  φ
Nn


The important point here is that applying the Galerkin method of weighted residuals
this way means that for every element we get a ‘sub’ system of equations local to each
element and in terms of the local nodal values 1, 2, . . . , Nn , that must be assembled
into a global system of equations with global indices 1, 2, . . . , Np The overall global
system of equations to solve our generic scalar transport equation is then given by:
Ne Z
X
ηp ηq φ̇q + ηp v · ∇ηq φq + µ∇ηp ∇2 ηq φq − ηp ψ dΩ = 0
e=1 Ω
e

where Ne is the number of elements in the grid. We can then rewrite the system of
equations in the form:

M φ̇ = Kφ + s (15.8)
where φ is now a column vector is defining values at the nodes, and:

Ne Z
X
M= ηp ηq dΩ (15.9)
e=1 Ω
e
Ne Z
X Z
2
K=− µηp ∇ ηq dΩ + ηp v · ∇ηq dΩ (15.10)
e=1 Ω Ωe
e
Ne Z
X
s= ηp ψp dΩ (15.11)
e=1 Ω
e

We are now in a position to look into the details of how we derive shape functions
for a particular element. We will start with the simplest element possible, namely
the linear line element, depicted in the upper left corner of Figure 15.1, which is
perhaps the simplest element that we can use in 1D. When we say a ‘linear’ element
we are making the approximation that the solution will vary as a linear function
and as such the trial function may be written as:
292 CHAPTER 15. FINITE ELEMENT METHODS

φ(x) = a0 + a1 x
which can be observed is exactly the same as the trial solution defined in Equation
15.1 but considering only the first two terms. Now, an important point is that
because we are defining the solution of our PDE at the nodes of the elements in our
mesh, if we evaluate this trial solution at say node 1 then we have:

φ(x1 ) = φ1 = a0 + a1 x1
which we could write as:

φ1 = p1 a
where:

a = {a0 , a1 }T
and:

p1 = {1, x1 }
If we then evaluate the trial solution at the other point we get:

φ(x2 ) = φ2 = a0 + a1 x2
which we can write in matrix form as:

φ1 1 x1 a0 p1
= = a = Ca
φ2 1 x2 a1 p2
where C will be a square 2 × 2 matrix and we can solve for the unknown parameters
by computing a = C −1 φ. Doing so, we find that:

1
a0 = x2 φ1 − x1 φ2
Le
1
a1 = − φ1 + φ2
Le
where Le is the length of the element and is defined in terms of its nodal coordinates
as:

Le = x2 − x1
If we now substitute these coefficients back into our trial solution we get:
1 1
φ(x) = x2 φ1 − x1 φ2 + − φ1 + φ2 x
Le Le
293

where we can factor out the nodal values and rewrite the solution in the form:
Nn
X
φ(x, t) = ηn (x)φn (t) = η1 φ1 + η2 φ2 (15.12)
n=1

where the ηn terms are the shape functions and for a linear 1D element are given
by:

1
η1 (x) = x2 − x
Le
1
η2 (x) = x − x1
Le
We can then trivially express the derivatives of the shape functions as:

∂η1 (x) 1
=−
∂x Le
∂η2 (x) 1
= (15.13)
∂x Le
So what we’ve done here is define a trial solution within our element as some contin-
uous function of space, then we’ve used the fact that our trial solution must assume
the values φn at the nodes in order to obtain an expression for the field variable that
varies continuously within the element, but is defined in terms of the nodal values,
not the ai coefficients. As it happens these shape functions are known as piecewise
continuous functions in that they are only defined within their ‘own’ element and
are all zero in any other element in the grid. If we know the values of φn at the
nodes of the element then we can use the shape functions to compute the value of φ
at any point within the element. Now obviously we don’t know the values of φn at
the nodes a priori; the whole point of solving a PDE is to find them. But as with
the Finite Difference and Finite Volume methods studied thus far, this expression
will allow us to assemble a system that we can use to solve for these unknown nodal
values.
The final step in our Finite Element method is to perform the integration of the
shape functions (or their derivatives) over the domain of each element, in order to
evalutate the terms in Equations 15.9-15.11. Although the integration is not too
difficult in this case we will make use of the integration formulae defined for a linear
line element:
Z
a!b!Ωe
ηpa ηqb dΩ = (15.14)
(a + b + 1)!
Ωe

where Ωe ≡ Le in 1D. Let’s first consider the integration of the shape functions as
required in the mass matrix:
294 CHAPTER 15. FINITE ELEMENT METHODS

Z
e
Mp,q = ηp ηq dΩ
Ωe

Remembering that p and q are in the range of 1 to 2 for the linear line element,
what we end up with is a local or ‘sub’ matrix M e , which for our linear triangular
element will be a 2 × 2 matrix. In order to evaluate each term in the matrix we
simply input the values of p and q to the integration formula in Equation 15.14. For
the case where p and q are equal (i.e. for elements on the main diagonal) we get:

Z
e
Mp,p = ηp ηp dΩ
Ωe
Z
= ηp2 ηq0
Ωe
2!0!Ωe
=
(2 + 0 + 1)!
2Ωe
=
3
For the case where p and q are not equal (i.e. for elements off the main diagonal)
we get:

Z
e
Mp,q = ηp ηq dΩ
Ωe
Z
= ηp1 ηq1
Ωe
1!1!Ωe
=
(1 + 1 + 1)!
1Ωe
=
6
So we can write a single expression for the any element in our local mass matrix as:

e (1 + δpq )Ωe
Mp,q =
6
which is simple enough that we could write out the whole thing as:

eΩe 2 1
M = (15.15)
6 1 2
295

Considering the contribution of the convective term to the stiffness matrix we have:

Z
e
Kp,q = − ηp v · ∇ηq dΩ
Ωe

Noting that in 1D v = v and ∇ηq as defined by Equation 15.13 are constants, we

can pull them out such that we have:

Z
e
Kp,q = −v∇ηq ηp dΩ
Ωe
Z
= −v∇ηq ηp1 ηq0 dΩ
Ωe
Z
1!0!Ωe
= −v∇ηq dΩ
(1 + 0 + 1)!
Ωe
v∇ηq Ωe
= −
2
which is simple enough that we could write out the whole thing as:

v
− v2

e 2
K = v (15.16)
2
− v2

Considering now the contribution of the source term to the load vector we have:

Z
sep = ηp ψp dΩ
Ωe
Z
= ψp ηp dΩ
Ωe
1!0!Ωe
= ψp
(1 + 0 + 1)!
ψp Ωe
=
2
and if we can assume that ψ is the same at each node we can write our contribution
to the local load vector as:

ψΩe 1
se = (15.17)
2 1
296 CHAPTER 15. FINITE ELEMENT METHODS

Now, you may have noticed that we have not looked at the contribution of the
diffusive term. There is a good reason for this, namely that it involves the second
derivatives of the shape functions, which for a linear element will be zero. So what
we would find is that the contribution to the stiffness matrix would be zero, which
is obviously not correct. One option would be to use a higher order element, but as
we shall soon see, there is a second option.

Example 15.1:

In this example we will develop a Matlab program to solve the 1D first order
wave equation:

∂φ ∂φ
+v =0 (15.18)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, v = 1.0, and compare the numerical solution with
the exact solution
2
φ(x, t) = e−5(x−vt−3) + 1
where we will define the error function

e = φli − φ(xi , tl )
and use the infinity norm

||e||∞ = max(|φli − φ(xi , tl )|)

as our measure of convergence. For the spatial discretization we will use the Finite
Element method with linear line elements for the first derivative and for the tempo-
ral discretization we will use the fourth order Runge-Kutta method. The intended
learning outcomes for this example will be to ‘get a feel’ for applying the Finite
Element method to solve a PDE. Also, because we are using an explicit method to
perform the time integration, we will have stability constraints and we will inves-
tigate this by solving the PDE with some different element sizes ∆x and temporal
step sizes ∆t.
To begin, let’s first confirm in our minds that we have a well posed problem. Our
PDE has two derivative terms in it and so this translates into requiring two pieces
of information in order to obtain a unique solution, one boundary condition for the
spatial derivative and one initial condition for the temporal derivative. Since we
were given both of these, then we can say that our problem will be well posed. The
reason for laboring this point is that if the solution of a PDE is attempted with too
297

many or too few boundary or initial conditions it will be ‘doomed’ from the start.
So it is always important to consider these issues before writing any code.
Similar to the equivalent Finite Difference case in Example 13.1 we will assume
that our spatial domain has been broken up into Ne elements and so the number of
grid points will be Nx = Ne + 1; furthermore with spatial step size Ωe ≡ Le ≡ ∆x.
Applying our method for each element we can then write:

L L v
− v2

3 6
d φ1 2
φ1
= v
L
6
L
3 dt φ2 2
− v2 φ2
L L v
− v2

3 6
d φ2 2
φ2
= v
L
6
L
3 dt φ3 2
− v2 φ3
.. ..
. = .
L L v
− v2

3 6
d φNx−1 2
φNx−1
= v
L
6
L
3 dt φNx 2
− v2 φNx

and combining these systems for all of the elements we will get the global system
of equations:

M φ̇ = Kφ + s
where we have:

L L v
− v2
      
3 6 0 ··· 0  φ1
  2 0 ··· 0  φ1 
L 2L L v
− v2
  
··· 0    φ2 0 ··· 0   φ2
 
   

6 3 6  d   2  
v
 L 2L
   
 0 6 3 ··· 0  φ3 =
 0 2 0 ··· 0  φ3
 dt 
  
 .. .. .. .. ..   .. .. .. ..  .. 
. . . . L
. . . . . − v2 .
    
   

6 
 
 
 

v
0 0 0 L L φNx 0 0 0 − v2 φNx
   
6 3 2

In comparison to Example 13.1 it can be observed that we are including the Dirichlet
boundary node φ1 in the column vector of unknowns and also the mass matrix is
not equal to the identity matrix with the Finite Element method. It can also be
observed that the 2 × 2 ’elemental’ mass and stiffness matrices are ‘stamped’ in
place in the global mass and stiffness matrices and overlapping entries are added
up, which is a feature of the Galerkin method of weighted residuals. We can rewrite
our system more simply as:

dφ
= M −1 Kφ = f (φ) (15.19)
dt
So we will compute and store the matrix M −1 K and use this in our function f to
evaluate the right hand side. The Matlab code to achieve this will look like:
298 CHAPTER 15. FINITE ELEMENT METHODS

M = sparse(N_x, N_x);
K = sparse(N_x, N_x);
for p=1:N_x-1
M(p:p+1, p:p+1) = M(p:p+1, p:p+1) + Delta_x/6 *[2, 1; 1, 2];
K(p:p+1, p:p+1) = K(p:p+1, p:p+1) + v/2 *[1,-1; 1, -1];
end
MinvK = full(M\K);

So an interesting observation that can be made is that even though we are using
an explicit time marching scheme, we still have to solve a linear system M −1 K
(albeit only once). This is one feature of the Finite Element method that is quite
different to the Finite Difference and Finite Volume methods, where we only had
entries on the main diagonal of the mass matrix. Now, using approach we took in
implementing the fourth order Runge-Kutta method in Example 10.1, we will define
a function f to evaluate the right hand side of Equation 13.16 at the various stages
of the method. In our Matlab code, this will take the form:
function k = f(phi)
k = MinvK*phi;
k(1) = 0;
end

At this point the remainder of the algorithm is just the basic fourth order Runge-
Kutta code from Example 10.1:
for l=1:N_t-1
k1 = f(phi(:,l));
k2 = f(phi(:,l) + Delta_t/2*k1);
k3 = f(phi(:,l) + Delta_t/2*k2);
k4 = f(phi(:,l) + Delta_t *k3);
phi(:,l+1) = phi(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end

As we did in Example 13.1 we will make sure that each k1 (1), k2 (1), k3 (1), k4 (1) are
always zero such that as long as our Dirichlet boundary condition φ(0, t) = 1 has
been set, then out boundary condition will have been imposed correctly.
In order to check the whether or not our solution will be stable for a particular
number of elements and time step size we can compute the eigenvalues as we did
in Example 13.1 and plot them relative to the stability region of the fourth order
Runge-Kutta method:
[Xi Lambda] = eig(MinvK);
[X, Y] = meshgrid(-4:0.1:4, -4:0.1:4);
Z = X + i*Y;
sigma = abs(1 + Z + (Z.^2)/2 + (Z.^3)/6 + (Z.^4)/24);
contourf(X, Y, sigma, [1 1]);
plot(real(diag(Lambda)*Delta_t), imag(diag(Lambda))*Delta_t);

The complete program is given in Example15_1.m. Figures 15.3(a) - 15.3(b)

illustrate the location of the λm ∆t terms for two different combinations of ∆x and
∆t, the first with ∆x = 0.05 and ∆t = 0.02, and the second with ∆x = 0.02 and
299

∆t = 0.10. In the first combination, all the terms are located within the stability
region, and in the second they are not. The corresponding effect on the solution
is shown in Figures 15.4(a) - 15.4(b). It is easily observed that for the second
combination, the simulation ‘blows up’ after just a couple of time steps, whereas
for a stable solution we see the ‘bell shaped’ initial condition is simply shifted (or
convected) along through the computational domain. To illustrate the convergence
of the solution, Table 15.1 presents the inifinity norm for a range of spatial and
temporal step sizes (maintaining stability of course). As can be observed, the finer
the grid and the smaller the time step size, the lower the error in the solution (which
is of course what we could expect).

Table 15.1: The convergence of the solution, illustrating the infinity norm for a
range of spatial and temporal step sizes.

∆x ∆t ||e||∞
1.000000 1.000000 0.672490
0.500000 0.500000 0.444394
0.100000 0.100000 0.022728
0.050000 0.050000 0.003280
0.010000 0.010000 0.000148
0.005000 0.005000 0.000037

Having now seen a fairly simple example of the Finite Element method in ac-
tion, we are going to tackle the issue of dealing with second order terms in a PDE,
approximated with linear shape functions. A result of assuming a linear variation of
φ within an element is that while the nodal values are equal at element boundaries
(known as C 0 continuity), unfortunately the first derivatives are not equal (C 1 conti-
nuity) and hence the second derivatives do not exist at all (Figure 15.5). You might
think that we should just abandon the use of such simple elements and use higher
order elements (Figure 15.1), but to require the second order spatial derivatives to
exist everywhere is too restrictive. Fortunately there is a solution, and this involves
removing the second derivative from the weighted residual expression for the scalar
transport equation. To see how this is done, let’s define the residual function for
the generic scalar transport equation and apply the Galerkin method of weighted
residuals as we did previously:
Z
2
W φ̇ + v · ∇φ − µ∇ φ − ψ dΩ = 0
Ω
300 CHAPTER 15. FINITE ELEMENT METHODS

4 4

3 3

2 2

1 1
Δt

Δt
0 0
Re

Re
λ

−1 −1

−2 −2

−3 −3

−4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
λReΔ t λReΔ t

(a) (b)

Figure 15.3: Location of the λm ∆t terms within the stability region of the fourth
order Runge-Kutta method for the PDE in Example 15.1 for (a) ∆x = 0.05 and
∆t = 0.02 (b) ∆x = 0.02 and ∆t = 0.10. It should be noted that each λm ∆t is
marked as a cross in the complex plane, but the large number of these terms makes
them appear as a solid strip. It can be observed that all of the λm ∆t terms are
purely imaginary.
301

2.5 2.5

2.0 2.0
φ

φ
1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

2.5 2.5

2.0 2.0
φ

1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Figure 15.4: The solutions to the PDE in Example 15.1 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02 (b) the solution
at l = 0 and l = 8 for the combination ∆x = 0.02 and ∆t = 0.10.
302 CHAPTER 15. FINITE ELEMENT METHODS

x
∂ xφ

x
∂xxφ

Figure 15.5: A schematic depicting 3 elements in 1D with the corresponding vari-

ations in φ, ∂x φ, and ∂xx φ within the element. The green curves illustrate a linear
variation within the element and the blue curves illustrate some higher order ap-
proximation. For a linear variation within the element it can be observed that the
first derivatives are constant and that the second derivatives do not exist.
303

Expanding out we get:

Z Z Z Z
2
W φ̇dΩ + W v · ∇φdΩ − W µ∇ φdΩ − W ψdΩ = 0
Ω Ω Ω Ω

In order to proceed we note the generic product rule formula for differentiation:
∂ ∂f ∂g
(f g) = g+ f
∂x ∂x ∂x
which in higher spatial dimensions can be applied to two scalar fields as:

∇ (f g) = ∇f g + ∇gf
or, using Einstein summation notation as:
∂ ∂f ∂g
(f g) = g+ f
∂xi ∂xi ∂xi
If we are dealing with a scalar field and a vector field, then the equivalent generic
formula is:

∇ · (f g) = ∇f · g + ∇ · gf
or, again using Einstein summation notation as:
∂ ∂f ∂gi
(f gi ) = gi + f
∂xi ∂xi ∂xi
Substituting into this formula for f = W for our scalar field and g = ∇φ for our
vector field field we get:

∇ · (W ∇φ) = ∇W · ∇φ + W ∇ · ∇φ
= ∇W · ∇φ + W ∇2 φ

multiplying by µ, rearranging, and integrating over the domain we get:

Z Z Z
2
W µ∇ φdΩ = µ∇ · (W ∇φ) dΩ − µ∇W · ∇φdΩ (15.20)
Ω Ω Ω

Now, as we did in the Finite Volume method, we can again make use of the Diver-
gence Theorem (Equation 14.3) and apply it to the first term on the right hand side
of Equation 15.20 to get:
Z Z Z
2
W µ∇ φdΩ = µW ∇φ · dΓ − µ∇W · ∇φdΩ
Ω Γ Ω

which substituted back in to our original weighted residual expression we get:

304 CHAPTER 15. FINITE ELEMENT METHODS

Z Z Z Z Z
W φ̇dΩ + W v · ∇φdΩ + µ∇W · ∇φdΩ − W ψdΩ = µW ∇φ · dΓ
Ω Ω Ω Ω Γ

and finally after some rearrangement:

Z Z
W φ̇ + W v · ∇φ + µ∇W · ∇φ − W ψ dΩ = µW ∇φ · dΓ (15.21)
Ω Γ

This form of the problem is known as the weak form since it contains only the first
derivative of the solution φ(x, t), whereas the original form of the problem (Equation
15.6) is known as the strong form and contains the second derivative. The differenti-
ation requirement on the function φ(x, t) has been weakened, hence the name ‘weak
form’. It should be noted that no approximations have been made yet, i.e. nothing
has been lost in the formulation. However, piecewise linear approximations are now
possible because we don’t need to worry about the second derivatives of the shape
functions. Another point to note is that in terms of the application of boundary
conditions, the term on the right hand side of Equation 15.21 is the integral of the
derivative of φ over the boundary, which can be thought of as integrating a Neu-
mann boundary condition over the boundary of the domain. So, in contrast to the
Finite Difference and Finite Volume methods, the Neumann boundary conditions
are automatically incorporated into the integral form of the PDE. For this reason,
Neumann boundary conditions are often called natural boundary conditions, since
they ‘naturally pop up’ in the weighted residual expression. Following from this,
Dirichlet boundary conditions are then often called essential boundary conditions.
If we now substitute in for the weighting function, substitute in our assumed
form of the solution from Equation 15.5, and consider the domain of integration to
be the domain of an element itself, the weighted residual expression can be rewritten
as:

Z Z
ηp ηq φ̇q + ηp v · ∇ηq φq + µ∇ηp · ∇ηq φq − ηp ψ dΩ = µηp ∇φ · dΓ (15.22)
Ωe Γe

So while p and q were in the range 1 to 2 for the linear line element, they are are in
the range 1 to 3 for our linear triangular element. The overall system of equations
to solve our generic scalar transport equation is then given by:

Ne Z
X Ne Z
X
ηp ηq φ̇q + ηp v · ∇ηq φq + µ∇ηp · ∇ηq φq − ηp ψ dΩ = µηp ∇φ · dΓ
e=1 Ω e=1 Γ
e e
305

(x3, y3)

y
(x2, y2)

(x1, y1)

Figure 15.6: A generic triangular element within the grid.

where Ne is the number of elements in the grid. We can then rewrite the system of
equations in the form:

M φ̇ = Kφ + s (15.23)
where φ is now a column vector is defining values at the nodes, and:

Ne Z
X
M= ηp ηq dΩ (15.24)
e=1 Ω
e
Ne Z
X Z
K=− µ∇ηp · ∇ηq dΩ + ηp v · ∇ηq dΩ (15.25)
e=1 Ω Ωe
e
Ne Z
X Z
s= ηp ψp dΩ + µηp ∇φ · dΓ (15.26)
e=1 Ω Γe
e

We will now consider one more spatial dimension and derive the shape functions
for the linear triangular element, perhaps the simplest element that we can use in
2D. Considering the linear triangular element depicted in Figure 15.6 with the x
and y coordinates of its 3 nodes labeled, we begin our derivation by again assuming
306 CHAPTER 15. FINITE ELEMENT METHODS

a trial solution for the field variable. Just as we did for the linear line element we
will assume a linear trial function, which takes the form:

φ(x, y) = a0 + a1 x + a2 y
where a0 , a1 , and a2 are scalar coefficients. This is essentially the 2D equivalent of
the trial solution to our linear line element. If we apply this trial solution at say
node 1 then we have:

φ(x1 , y1 ) = φ1 = a0 + a1 x1 + a2 y1
which we could write as:

φ1 = p1 a
where:

a = {a0 , a1 , a2 }T
and:

p1 = {1, x1 , y1 }
If the then apply the trial solution at the other two points we get:

φ(x2 , y2 ) = φ2 = a0 + a1 x2 + a2 y2
φ(x3 , y3 ) = φ3 = a0 + a1 x3 + a2 y3

which we can write in matrix form as:

      
 φ1  1 x1 y 1  a0  p1
φ2 = 1 x2
 y 2  a1 =  p2  a = Ca
φ3 1 x3 y3 a2 p3
   

where C will be a square 3 × 3 matrix and we can solve for the unknown parameters
by computing a = C −1 φ. Doing so, we find that:

1
a0 = (x2 y3 − x3 y2 ) φ1 + (x3 y1 − x1 y3 ) φ2 + (x1 y2 − x2 y1 ) φ3
2Ae
1
a1 = (y2 − y3 ) φ1 + (y3 − y1 ) φ2 + (y1 − y2 ) φ3
2Ae
1
a2 = (x3 − y2 ) φ1 + (x1 − x3 ) φ2 + (x2 − x1 ) φ3
2Ae
where Ae is the area of the element and is defined in terms of its nodal coordinates
as:
307

2Ae = (x1 y2 − x2 y1 ) + (x3 y1 − x1 y3 ) + (x2 y3 − x3 y2 )

1 x1 y 1

= 1 x2 y2
1 x3 y 3

If we now substitute these coefficients back into our trial solution we can factor out
the nodal values and rewrite the solution in the form:
Nn
X
φ(x, y) = ηn (x, y)φn = η1 φ1 + η2 φ2 + η3 φ3 (15.27)
n=1

where the ηn terms are the shape functions and for the linear triangular element are
given by:

1
η1 (x, y) = (x2 y3 − x3 y2 ) + (y2 − y3 )x + (x3 − x2 )y
2Ae
1
η2 (x, y) = (x3 y1 − x1 y3 ) + (y3 − y1 )x + (x1 − x3 )y
2Ae
1
η3 (x, y) = (x1 y2 − x2 y1 ) + (y1 − y2 )x + (x2 − x1 )y (15.28)
2Ae
which may be more compactly written as:

1 x y
1
ηn (x, y) = 1 xn+1 yn+1 (15.29)
2Ae
1 xn+2 yn+2

where the index n is understood to be modulo 3 (e.g. if n = 2 then n + 2 = 4 but

the meaning here is that we loop back to 1 such that n + 2 = 1, etc). We can then
express the derivatives of the shape functions as:

∂η1 (x, y) 1 ∂η1 (x, y) 1

= (y2 − y3 ) = (x3 − x2 )
∂x 2Ae ∂y 2Ae
∂η2 (x, y) 1 ∂η2 (x, y) 1
= (y3 − y1 ) = (x1 − x3 )
∂x 2Ae ∂y 2Ae
∂η3 (x, y) 1 ∂η3 (x, y) 1
= (y1 − y2 ) = (x2 − x1 ) (15.30)
∂x 2Ae ∂y 2Ae

which may be more compactly written as:

1
∇ηn (x, y) = {yn+1 − yn+2 , xn+2 − xn+1 } (15.31)
2Ae
308 CHAPTER 15. FINITE ELEMENT METHODS

So while ηn is a linear scalar function of x and y, ∂x ηn and ∂y ηn are vectors for the
linear triangular element, which store constant values for any given element.
The final step in our Finite Element method is to perform the integration of the
shape functions (or their derivatives) over the domain of each element. While this
may seem like a daunting task we will make use of two integration formulae defined
for the linear triangular element:
Z
a!b!c!2Ωe
ηpa ηqb ηrc dΩ = (15.32)
(a + b + c + 2)!
Ωe

and:
Z
a!b!Γe
ηpa ηqb dΓ = (15.33)
(a + b + 1)!
Γe

Where for the 2D case Ωe and Γe represent the area and an edge length of the
triangle. It is worth mentioning at this point that the use of these formulae greatly
simplifies matters. When other element types are used however, then no such for-
mulae exist and the integration of the shape functions over an element may utilize
a numerical method such as quadrature [21] and isoparametric elements. Let’s first
consider the integration of the shape functions as required in the mass matrix:

Z
e
Mp,q = ηp ηq dΩ
Ωe

Remembering that p and q are in the range of 1 to 3 for the linear triangular element,
what we end up with is a local or ‘sub’ matrix M e , which for our linear triangular
element will be a 3 × 3 matrix. In order to evaluate each term in the matrix we
simply input the values of p and q to the integration formula in Equation 15.32. For
the case where p and q are equal (i.e. for elements on the main diagonal) we get:

Z
e
Mp,p = ηp ηp dΩ
Ωe
Z
= ηp2 ηq0 ηr0 dΩ
Ωe
2!0!0!2Ωe
=
(2 + 0 + 0 + 2)!
2Ωe
=
12
For the case where p and q are not equal (i.e. for elements off the main diagonal)
we get:
309

Z
e
Mp,q = ηp ηq dΩ
Ωe
Z
= ηp1 ηq1 ηr0 dΩ
Ωe
1!1!0!2Ωe
=
(1 + 1 + 0 + 2)!
1Ωe
=
12
So we can write a single expression for the any element in our local mass matrix as:

e (1 + δpq )Ωe
Mp,q =
12
which is simple enough that we could write out the whole thing as:

 
2 1 1
Ωe 
Me = 1 2 1  (15.34)
12
1 1 2

Considering the contribution of the convective term to the stiffness matrix we have:

Z Z
1 yq+1 − yq+2
ηp v · ∇ηq dΩ = v · ηp dΩ
2Ωe xq+2 − xq+1
Ωe Ωe

1 yq+1 − yq+2 0!1!0!2Ωe
= v·
2Ωe xq+2 − xq+1 (0 + 1 + 0 + 2)!

1 yq+1 − yq+2 Ωe
= v·
2Ωe xq+2 − xq+1 3

where we are writing shape function derivative terms in the form of a column vector
for compactness; the important thing is that the dot product between the velocity
vector and the vector defining the derivative of the shape functions produce a scalar
value. Considering now the contribution of the diffusive term to the stiffness matrix
we have:

Z Z
1 yp+1 − yp+2 1 yq+1 − yq+2
µ∇ηp · ∇ηq dΩ = µ · dΩ
2Ωe xp+2 − xp+1 2Ωe xq+2 − xq+1
Ωe Ωe
310 CHAPTER 15. FINITE ELEMENT METHODS

It can be observed that in this case every term inside the integral is a constant, so
the integration is in fact trivial:

Z
µ yp+1 − yp+2 yq+1 − yq+2
µ∇ηp · ∇ηq dΩ = · Ωe
4Ω2e xp+2 − xp+1 xq+2 − xq+1
Ωe

Again, the important point to note is that the dot product between these two vectors
produce a scalar value. So we can write our stiffness matrix more compactly as:

e 1 yq+1 − yq+2 µ yp+1 − yp+2 yq+1 − yq+2
Kp,q = −v · − · (15.35)
6 xq+2 − xq+1 4Ωe xp+2 − xp+1 xq+2 − xq+1

which we could write out in the form of a local stiffness matrix K e , but the terms
are so long that it would hardly fit on the page. Considering now the contribution
of the source term to the load vector we have:

Z
sep = ηp ψp dΩ
Ωe
Z
= ψp ηp dΩ
Ωe
1!0!0!2Ωe
= ψp
(1 + 0 + 0 + 2)!
ψp Ωe
=
3
and if we can assume that ψ is the same at each node we can write our contribution
to the local load vector as:

 
1
ψΩe  
se = 1 (15.36)
3
1

Finally, considering the contribution of the Neumann boundary condition term to

the load vector, we have to perform the integral:
Z
f
sp = µηp ∇φ · dΓ
Γe

An important point to note is that we will only perform this integral over the
boundary of the element (which in our case translates into an edge of a triangle) if
311

that element happens to be on a Neumann boundary of our computational domain.

In this case p will be in the range 1 to 2. Because our Neumann boundary conditions
are applied normal to the boundary, the gradient can be specified as a scalar (the
implied direction being normal to the boundary edge) and so the dot product will
drop out of the integral. We will furthermore assume that ∇φ does not vary over the
boundary of the element. Making use of the second integration formula (Equation
15.33) we then have:

Z
sfp = µηp ∇φ · dΓ
Γe
Z
= µ∇φ ηp dΓ
Γe
1!0!Γe
= µ∇φ
(1 + 0 + 1)!
µ∇φΓe
= (15.37)
2
So we can write our load vector as:

f µ∇φΓe 1
s =
2 1

At this point we now have all of the machinery in place to assemble a system of
equations, but let’s just take a moment to recap on what we’ve done. Depending on
our choice of element we will end up with 3 × 3, local, mass and stiffness matrices
and an 3 × 1 load vector. To evaluate the terms in a local matrix we ‘loop’ over
e e
the p, q indices and compute the Mp,q , Kp,q terms and for the load vector we ‘loop’
over the p indices and compute the sp terms. So each of these terms is just a single
number that we ‘place’ in the local matrix. Once we have complete local mass and
stiffness matrices and a complete local load vector we have to add these terms into
the global mass and stiffness matrices, so the p, q indices of a particular node in
an element have to be ‘mapped’ to the global indices of that node within the grid
(Figure 15.7).

Example 15.2:

In this example we will develop a Matlab program to solve the 2D Poisson

equation:
312 CHAPTER 15. FINITE ELEMENT METHODS

Local Indexing

p218
e3 Global Indexing

p1028 e5
e6
e9
p220

Figure 15.7: A schematic illustrating a portion of a computational grid, displaying

the node and element numbers. It can be observed that the 3 nodes defining an
element create a local 3 × 3 matrix, which is added into a global Np × Np matrix.
313

∇2 φ + ψ = 0 (15.38)
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10. To apply our spatial discretization we will use
the Finite Element method with linear triangular elements. The intended learning
outcomes for this example will be ‘get a feel’ for applying the Finite Element method
in 2D and observing how we can loop over the elements in our grid in order to
‘assemble’ the matrix defining our system of equations.
To begin, let’s first confirm in our minds that we have a well posed problem. Our
PDE has two second order derivative terms in it and so this translates into requir-
ing four pieces of information in order to obtain a unique solution, two boundary
conditions for each spatial derivative. Since we were given all of these, then we can
say that our problem will be well posed.
As we did with the Finite Volume method in Example 14.1 we will assume that
the unstructured grid defining the domain is already defined and will be returned
through a function called readGrid. As such we will be storing three arrays for
this problem; an array called Points which is an Np × 2 array storing the x, y
coordinates of the points defining the grid, an array called Faces which is an Nf × 2
array storing the two indices of the points defining a face in the grid, and an array
called Elements which is an Ne × 3 array storing the three indices of the points
defining an element in the grid. In contrast to the Finite Volume method example
we are not going to make any assumptions about the ordering of the points or the
faces in their respective arrays. In order to prescribe our boundary conditions we
are again going to make use of a structure to store all of the information that we
need:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’indices’, {}, ’value’, {});

So, for the φ(x, 0) = 1 boundary we would have the entry:

Boundaries(1).name = ’bottom’;
Boundaries(1).type = ’dirichlet’;
Boundaries(1).N = 27;
Boundaries(1).indices = [5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... 31];
Boundaries(1).value = 1.00000;

So because we haven’t made any assumptions about the ordering of the points, we
cannot simply use a ‘start’ index and the number of indices to define where these
might be located in the Points array. Instead were are explicitly storing the indices
of each point on a given boundary.
At this point we can apply the spatial discretization that is the Finite Element
method to our PDE and we know that we will have a discrete system of the form:

Kφ = s
where:
314 CHAPTER 15. FINITE ELEMENT METHODS

XNe Z
K= µ∇ηp · ∇ηq dΩ
e=1 Ω
e
Ne Z
X Z
s= ηp ψp dΩ + µηp ∇φ · dΓ
e=1 Ω Γe
e

but because there are no Neumann boundaries in our problem the second term in
the load vector involving the integral of ∇φ over a boundary will be zero for every
element.
As with the Finite Difference method applied in Example 13.3 and the Finite
Volume method applied in Example 14.1 the core part of the method is the ‘assem-
bling’ of the matrices defining the system of equations. To assemble these matrices
we are going to need to evaluate Ωe and Γf for the elements and faces respectively.
Now in our 2D example Ωe is the area of each triangle, which we can evaluate as:

(x1 y2 − x2 y1 ) + (x3 y1 − x1 y3 ) + (x2 y3 − x3 y2 )

Ωe = (15.39)
2
where xp , yp are the x and y coordinates of the three points defining the triangle.
We can then compute the areas for all of the elements in our Matlab program via
the code:
for e=1:N_e
x = Points(Elements(e, :), 1);
y = Points(Elements(e, :), 2);
Omega(e) = ((x(1)*y(2)-x(2)*y(1))+(x(3)*y(1))-(x(1)*y(3))+(x(2)*y(3)-x(3)*y(2)))/2;
end
In 2D Γf is an edge length and we can compute this quite simply via:
p
Γf = (y1 − y2 )2 + (x2 − x1 )2 (15.40)
and we can implement this in our Matlab program via the code:
for f=1:N_f
x = Points(Faces(f, :), 1);
y = Points(Faces(f, :), 2);
Gamma(f) = sqrt((x(1)-x(1))^2 + (y(2)-y(1))^2);
end
In order to ‘get a feel’ for how looping the elements can help us assemble our
system of equations, let’s look at how an element will contribute to the nodal so-
lutions of the three points that define it. What we will in fact do is loop over the
nodes of this element and evaluate the terms in the local stiffness matrix and the
load vector. To help with this we will define the array:

1 y2 − y3 y3 − y1 y1 − y2
∇η = (15.41)
2Ωe x3 − x2 x1 − x3 x2 − x1
315

which is simply storing the spatial derivatives of the shape functions of the element.
If we were to evaluate the first term in the local stiffness matrix, then by evaluating
Equation 15.35 we would have:

e µ y2 − y3 y2 − y3
K1,1 = · (15.42)
4Ωe x3 − x1 x3 − x 2
and from examination of the layout of the array in Equation 15.41 we can see that
this is equivalent to:

e ∇η1,1 ∇η1,1
K1,1 = µ · Ωe (15.43)
∇η2,1 ∇η2,1
In fact we can apply this expression to any p, q entry in the local stiffness matrix
as:

e ∇η1,p ∇η1,q
Kp,q = µ · Ωe (15.44)
∇η2,p ∇η2,q
So what we need to do is loop over the p, q indices of each element with nested
for loops and evaluate the terms in K e . Now in order to actually implement this
in our Matlab code our assemble function is going to involve ‘looping’ over all of
the elements in the grid, then looping over all of the nodes of each element. The
algorithm will look something like:
function [K, s, phi, Free, Fixed] = assemble(K, s, phi, Points, Faces, ...
Elements, Boundaries, N_p, N_f, N_e, N_b);
...
s_e = [1; 1; 1];
for e=1:N_e
Nodes = Elements(e,:);
x = Points(Nodes,1);
y = Points(Nodes,2);
gradEta = [y(2)-y(3), y(3)-y(1), y(1)-y(2);
x(3)-x(2), x(1)-x(3), x(2)-x(1)];
for p=1:3
m = Nodes(p);
gradEta_p = [gradEta(1,p), gradEta(2,p)];
for q=1:3
n = Nodes(q);
gradEta_q = [gradEta(1,q), gradEta(2,q)];
K(m,n) = K(m,n) + mu*dot(gradEta_p,gradEta_q)*Omega(e);
end
s(m) = s(m) + s_e(p)*psi*Omega(e)/3;
end
end
...
end

Where an important point to note is that the array Nodes is a 1 × 3 array defining
the global indices of the 3 nodes defining any given element e. So with reference to
316 CHAPTER 15. FINITE ELEMENT METHODS

Figure 15.7, when e = 5, N odes = {218, 220, 1028}. So when we come to adding
in the contribution of the local stiffness matricix and the local load vector to the
global K and s, we can access and assign values to these positions quite easily in
Matlab with the notation K(m,n), etc.
Now, in order to apply the boundary conditions we can loop over every boundary
in our structure, and loop over every point in that boundary and assign the value
into the array φ for every time step. We can do this by adding in another for loop
over the boundaries within our assemble function as:
function [K, s, phi, Free, Fixed] = assemble(K, s, phi, Points, Faces, ...
Elements, Boundaries, N_p, N_f, N_e, N_b);
...
Fixed = [];
for b=1:N_b
for p=1:Boundaries(b).N
m = Boundaries(b).indices(p);
phi(m) = Boundaries(b).value;
end
Fixed = [Fixed; Boundaries(b).indices‘];
end
Free = setdiff(1:N_p, Fixed);
end

where it can be observed that as we loop over the Dirichlet boundaries we are adding
their indices to the array Fixed. Furthermore, we are then creating the array Free
using the setdiff function that will give us a list of the interior indices. At this
point, the system is completely assembled (Figure 15.8). It can be observed that
similar to the stiffness matrix for the system in Example 14.1, this matrix is sparse
and symmetric.

Figure 15.8: The pattern of the assembled stiffness matrix K using the Finite Ele-
ment method.
317

In terms of the Dirichlet boundary points, what we are going to do here is follow
the same idea that was used in Example 13.3 where we ‘partitioned’ the final system
of equations to solve a reduced system corresponding only to the interior points, so
that the rows corresponding to the Dirichlet boundary points won’t be included.
Conceptually we can think of our system Kφ = s as:

KF ree,F ree KF ree,F ixed φF ree sF ree
=
KF ixed,F ree KF ixed,F ixed φF ixed sF ixed
Where the subscript F ree is indicating all of the interior points where we don’t ac-
tually know the values in φ and the subscript F ixed is indicating all of the boundary
points where we do. Conceptually, we could then in fact just solve:

KF ree,F ree φF ree = sF ree − KF ree,F ixed φF ixed

and of course if the Dirichlet boundary conditions were imposing a value of 0 then
the second term on the right hand side will drop out as well. Since we created these
two arrays while applying our boundary condiitons we can use these directly solve
a reduced system of equations with the code:
phi(Free) = K(Free,Free)\(s(Free) - K(Free,Fixed)*phi(Fixed));

which will only compute the solution for the interior points and so as long as the
Dirichlet boundary points are initialized correctly in phi, then we will have imposed
the boundary conditions correctly. Now that we have shown how we assemble our
system of equations, we are now in a position to write out the core of the program
as:
[K, s, phi, Free, Fixed] = assemble(K, s, phi, Points, Faces, ...
Elements, Boundaries, N_p, N_f, N_e, N_b);
phi(Free) = K(Free,Free)\(s(Free) - K(Free,Fixed)*phi(Fixed));

The complete program is given in Example15_2.m. Figures 15.9(a) - 15.9(b)

illustrate the solution immediately after the assemble function has completed, then
after the global system of equations has been solved

Example 15.3:

In this example we will develop both a Matlab and a C++ program to solve the
2D generic scalar transport equation:

φ̇ + v · ∇φ = µ∇2 φ + ψ (15.45)
318 CHAPTER 15. FINITE ELEMENT METHODS

2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1.0 1.0
φ

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

(a) (b)

Figure 15.9: The solutions to the PDE in Example 15.2 illustrating the solution at
iterations (a) immediately after assembly of the global system of equations, and (b)
after the system has been solved.
319

in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will
use the Finite Element method with linear triangular elements, for the temporal
discretization we will use the implicit Euler method, and in our C++ program we will
solve the resulting linear system with the Conjugate Gradient method. The intended
learning outcome for this example will be to simply observe the application of the
Finite Element method to solve a time dependent PDE and to see how we can use
the SparseMatrix class in a C++ program.
To begin, as we did with in Example 15.2 we will assume that the unstructured
grid defining the domain is already defined and will be returned through a function
called readGrid in our Matlab program and read in our C++ program. As such we
will be storing three arrays for this problem; an array called Points which is an
Np × 2 array storing the x, y coordinates of the points defining the grid, an array
called Faces which is an Nf × 2 array storing the two indices of the points defining
a face in the grid, and an array called Elements which is an Ne × 3 array storing
the three indices of the points defining an element in the grid. As we did previously,
we are not going to make any assumptions about the ordering of the points or the
faces in their respective arrays. In order to prescribe our boundary conditions we
are again going to make use of a structure in our Matlab program to store all of the
information that we need:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’indices’, {}, ’value’, {});
So, for the φ(x, 0) = 1 boundary, for example, we would have the entry:
Boundaries(1).name = ’bottom’;
Boundaries(1).type = ’dirichlet’;
Boundaries(1).N = 28;
Boundaries(1).indices = [5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... 31];
Boundaries(1).value = 1.00000;
and for the ∂x φ(1, y) = 0 boundary, for example, we would have the entry:
Boundaries(2).name = ’right’;
Boundaries(2).type = ’neumann’;
Boundaries(2).N = 28;
Boundaries(2).indices = [29 30 31 32 33 34 35 36 37 38 39 40 41 42 ... 56];
Boundaries(2).value = 0.00000;
In contrast to Example 15.2 we now have Neumann boundaries present in our prob-
lem, which illustrates an important point. If we are dealing with a Dirichlet bound-
ary then indices indicates to which points in the Points array the given boundary
condition needs to be applied. If however, we are dealing with a Neumann bound-
ary, then indices indicates to which faces in the Faces array the integration of the
Neumann boundary condition needs to be performed. So depending on the type we
will interpret the indices differently and do different things with them when we
come to assembling our global system of equations. Using a struct in a Matlab
program is a good way to group and store the different ‘bits’ of data that define a
320 CHAPTER 15. FINITE ELEMENT METHODS

given boundary. In C++, this is the perfect job for a class and so we will define one
called Boundary which will take the form:
class Boundary
{
public:
Boundary()
{ }
string name_;
string type_;
int N_;
int* indices_;
double value_;
};

As can be observed, this is a fairly simple class, containing the same fields as the
Matlab struct and when we create our array Boundaries, we will be allocating
memory to store N_b of these boundary objects. So, let’s begin by implementing
our function to read in the unstructured grid in our C++ program. The contents of
the text file is going to be pretty much exactly the same as the data that was in the
readGrid function in the Matlab code and will look something like:
N_p 1093
N_f 112
N_e 2072
N_b 4
Points
0.00000 0.00000
1.00000 0.00000
...
0.33743 0.05103
Faces
0 4
4 5
...
111 0
Elements
494 778 113
495 779 114
...
383 1057 1029
Boundaries
bottom
dirichlet
28
1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...
1.00000
right
neumann
28
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ...
0.00000
321

top
neumann
28
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 ...
0.00000
left
dirichlet
29
0 3 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 ...
1.00000
This is a fairly common approach to structuring a file defining a grid, where, at the
top, the first things we define are the numbers of points, faces, elements, boundaries,
etc, so that our program can dynamically allocate the required memory and will also
‘know’ how many of each entity to read in. Now, if we were developing a very general
purpose program, we might add in more information regarding the dimensionality
of the grid, the number of points defining a face, the element types and hence the
number of points defining an element. We’re going to keep things fairly simple with
our program however and make the assumptions that we are dealing with a 2D grid,
with linear triangular elements. We will also assume that the name of the file to
read in will be passed to the function read, and so with that in mind the function
will begin with:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& N_p, int& N_f, int& N_e, int& N_b)
{
fstream file;
string temp;

file.open(filename);

file >> temp >> N_p;

file >> temp >> N_f;
file >> temp >> N_e;
file >> temp >> N_b;

Points = new double* [N_p];

Faces = new int* [N_f];
Elements = new int* [N_e];
Boundaries = new Boundary [N_b];
Points[0] = new double [N_p*2];
Faces[0] = new int [N_f*2];
Elements[0] = new int [N_e*3];
for(int p=1, pp=2; p<N_p; p++, pp+=2)
{
Points[p] = &Points[0][pp];
}
for(int f=1, ff=2; f<N_f; f++, ff+=2)
{
Faces[f] = &Faces[0][ff];
}
322 CHAPTER 15. FINITE ELEMENT METHODS

for(int e=1, ee=3; e<N_e; e++, ee+=3)

{
Elements[e] = &Elements[0][ee];
}
...
}

where we are opening up the file, reading in the data and assigning it to the integer
variables N_p. . .N_b. Knowing the amount of data contained in the file we can then
allocate the arrays to store this, and it can be observed that we are allocating the 2D
arrays using two calls to the new operator, such that our data will be contiguous in
memory. An important point to note is that in the text file, we are not interested in
the text N_p. . .N_b, but rather the numbers alongside them. As such we create the
string called temp where we will store this text, which allows us to work through
the file, overwriting it each time. Now, we could just store the numbers N_p. . .N_b
in the file, but this little bit of text defining the meaning of each number makes the
file a bit more ‘human readable’ (i.e. we have a bit more of an idea as to what the
numbers mean). If our input file was written in binary as opposed to ascii (meaning
it’s not supposed to be human readable), then perhaps we wouldn’t bother with
this.
The next part of the function involves looping over the number of points, faces,
elements, boundaries, and reading in the data:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& N_p, int& N_f, int& N_e, int& N_b)
{
...
file >> temp;
for(int p=0; p<N_p; p++)
{
file >> Points[p][0] >> Points[p][1];
}
file >> temp;
for(int f=0; f<N_f; f++)
{
file >> Faces[f][0] >> Faces[f][1];
}
file >> temp;
for(int e=0; e<N_e; e++)
{
file >> Elements[e][0] >> Elements[e][1] >> Elements[e][2];
}
file >> temp;
for(int b=0; b<N_b; b++)
{
file >> Boundaries[b].name_ >> Boundaries[b].type_ >> Boundaries[b].N_;
Boundaries[b].indices_ = new int [Boundaries[b].N_];
for(int n=0; n<Boundaries[b].N_; n++)
{
file >> Boundaries[b].indices_[n];
323

}
file >> Boundaries[b].value_;
}
file.close();
return;
}

At this point we can apply the spatial discretization that is the Finite Element
method to our PDE and we know that we will have a semi-discrete system of the
form:

M φ̇ = Kφ + s
where:

Ne Z
X
M= ηp ηq dΩ
e=1
Ωe
Ne Z
X Z
K=− µ∇ηp · ∇ηq dΩ + ηp v · ∇ηq dΩ
e=1 Ω Ωe
e
Ne Z
X Z
s= ηp ψp dΩ + µηp ∇φ · dΓ
e=1 Ω Γe
e

Then, applying the implicit Euler method we get:

φl+1 − φl
M = Kφl+1 + s
∆t
which can be rearranged to:

φl+1 = (M − ∆tK)−1 M φl + ∆ts

So again, as with the Finite Difference and Finite Volume methods, we have reduced
our problem to the solution of:

Aφl+1 = b
where:

A = M − ∆tK
b = M φl + ∆ts

As with the Finite Difference method applied in Example 13.3, the Finite Volume
method applied in Example 14.1, and the Finite Element method applied in Example
324 CHAPTER 15. FINITE ELEMENT METHODS

15.2, the core part of the method is the ‘assembling’ of the matrices defining the
system of equations. To assemble these matrices we are going to need to evaluate
Ωe and Γf for the elements and faces respectively. Now, just as in Example 15.2, Ωe
is the area of each triangle, which we can evaluate as:

(x1 y2 − x2 y1 ) + (x3 y1 − x1 y3 ) + (x2 y3 − x3 y2 )

Ωe =
2
where xp , yp are the x and y coordinates of the three points defining the triangle,
and Γf is an edge length and we can compute this quite simply via:
p
Γf = (y1 − y2 )2 + (x2 − x1 )2
In our Matlab program we compute this data via the code:
for e=1:N_e
x = Points(Elements(e, :), 1);
y = Points(Elements(e, :), 2);
Omega(e) = ((x(1)*y(2)-x(2)*y(1))+(x(3)*y(1))-(x(1)*y(3))+(x(2)*y(3)-x(3)*y(2)))/2;
end
and:
for f=1:N_f
x = Points(Faces(f, :), 1);
y = Points(Faces(f, :), 2);
Gamma(f) = sqrt((x(1)-x(1))^2 + (y(2)-y(1))^2);
end
respectively, whereas in our C++ program this will look like:
for(int e=0; e<N_e; e++)
{
for(int p=0; p<3; p++)
{
x[p] = Points[Elements[e][p]][0];
y[p] = Points[Elements[e][p]][1];
}
Omega[e] = ((x[0]*y[1]-x[1]*y[0])+(x[2]*y[0])-(x[0]*y[2])+(x[1]*y[2]-x[2]*y[1]))/2;
}
and:
for(int f=0; f<N_f; f++)
{
for(int p=0; p<3; p++)
{
x[p] = Points[Faces[f][p]][0];
y[p] = Points[Faces[f][p]][1];
}
Gamma[f] = sqrt((x[1]-x[0])*(x[1]-x[0]) + (y[1]-y[0])*(y[1]-y[0]));
}
In order to ‘get a feel’ for how looping over the elements can help us assemble
our system of equations, let’s look at how an element will contribute to the nodal
325

solutions of the three points that define it. What we will in fact do is loop over the
nodes of this element and evaluate the terms in the local mass and stiffness matrices
and the load vector. Now we have seen from Equation 15.34 that the local mass
matrix is fairly simple, containing 2 on the main diagonal and 1 everywhere else.
The only thing that changes between elements is the area of the element Ωe to which
the local mass matrix is multiplied by. The same is true for the contribution of the
source term to the local load vector and we can evaluate Equation 15.36 for each
element fairly simply. The local stiffness matrix K e is a bit more complicated, so
let’s look at how we create that. To help with this we will define the array:

1 y2 − y3 y3 − y1 y1 − y2
∇η = (15.46)
2Ωe x3 − x2 x1 − x3 x2 − x1
which is simply storing the spatial derivatives of the shape functions of the element.
If we were to evaluate the first term in the local stiffness matrix, then by evaluating
Equation 15.35 we would have:

e 1 y2 − y3 µ y2 − y3 y2 − y3
K1,1 = −v · − · (15.47)
6 x3 − x1 4Ωe x3 − x1 x3 − x2

and from examination of the layout of the array in Equation 15.46 we can see that
this is equivalent to:

e 1 ∇η1,1 ∇η1,1 ∇η1,1
K1,1 = −v · Ωe − µ · Ωe (15.48)
3 ∇η2,1 ∇η2,1 ∇η2,1
In fact the layout is such that we can apply this expression to any p, q entry in the
local stiffness matrix as:

e 1 ∇η1,q ∇η1,p ∇η1,q
Kp,q = −v · Ωe − µ · Ωe (15.49)
3 ∇η2,q ∇η2,p ∇η2,q
So what we need to do is loop over the p, q indices of each element with nested
for loops and evaluate the terms in K e . Now in order to actually implement this
in our Matlab code our assemble function is going to involve ‘looping’ over all of
the elements in the grid, then looping over all of the nodes of each element. The
algorithm will look something like:
[M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, Points, Faces, Elements, ...
Boundaries, N_p, N_f, N_e, N_b)
...
M_e = [2, 1, 1;
1, 2, 1;
1, 1, 2];
s_e = [1; 1; 1];
for e=1:N_e
Nodes = Elements(e,:);
x = Points(Nodes,1);
326 CHAPTER 15. FINITE ELEMENT METHODS

y = Points(Nodes,2);
gradEta = [y(2)-y(3), y(3)-y(1), y(1)-y(2);
x(3)-x(2), x(1)-x(3), x(2)-x(1)];
for p=1:3
m = Nodes(p);
gradEta_p = [gradEta(1,p), gradEta(2,p)];
for q=1:3
n = Nodes(q);
gradEta_q = [gradEta(1,q), gradEta(2,q)];
M(m,n) = M(m,n) + M_e(p,q) *Omega(e)/12;
K(m,n) = K(m,n) - dot(v,Gq)/3*Omega(e)
- mu*dot(gradEta_p,gradEta_q)*Omega(e);
end
s(m) = s(m) + s_e(p)*psi*Omega(e)/ 3;
end
end
...
end

Where an important point to note is that the array Nodes is a 1 × 3 array defining
the global indices of the 3 nodes defining any given element e. So with reference to
Figure 15.7, when e = 5, N odes = {218, 220, 1028}. So when we come to adding in
the contribution of the local mass and stiffness matrices and the local load vector
to the global M, K, and s, we can access and assign values to these positions quite
easily in Matlab with the notation K(m,n), etc.
Now, in order to apply the boundary conditions we can loop over every boundary
in our structure, and if the boundary is a Neumann boundary, we can loop over
every face in that boundary, evaluate the integral term in Equation 15.37 and add
the contribution to entries in the load vector corresponding to the two nodes that
define that face. If the boundary is a Dirichlet boundary we can loop over every
point in that boundary and assign the value into the array φ for every time step. We
can do this by adding in another for loop over the boundaries within our assemble
function as:
[M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, Points, Faces, Elements, ...
Boundaries, N_p, N_f, N_e, N_b)
...
Fixed = [];
for b=1:N_b
if strcmp(Boundaries(b).type, ’neumann’)
for f=1:Boundaries(b).N;
Nodes = Faces(Boundaries(b).indices(f),:);
for p=1:2
m = Nodes(p);
s(m) = s(m) + mu*Boundaries(b).value*Gamma(Boundaries(b).indices(f))/2;
end
end
elseif strcmp(Boundaries(b).type, ’dirichlet’)
for p=1:Boundaries(b).N
m = Boundaries(b).indices(p);
327

phi(m,:) = Boundaries(b).value;
end
Fixed = [Fixed; Boundaries(b).indices‘];
end
end
Free = setdiff(1:N_p, Fixed);
end

where it can be observed that as we loop over the Dirichlet boundaries were are
adding their indices to the array Fixed. Furthermore, we are then creating the array
Free using the setdiff function that will give us a list of the interior and Neumann
boundary indices. At this point, the system is completely assembled (Figure 15.10).
It can be observed that similar to the stiffness matrix for the system in Example 14.1,
this matrix is sparse and symmetric. As it happens, the mass matrix will have the
same pattern as the stiffness matrix depicted in Figure 15.10, illustrating a difference
between the Finite Difference, Finite Volume, and Finite Element methods applied
to solving the same problem. For the Finite Difference method, M was the identity
matrix, for the Finite Volume method, M was a diagonal matrix with the area of
each cell on the main diagonal, and for the Finite Element method, M involves the
area of each element, but in a more complex manner.

Figure 15.10: The pattern of the assembled stiffness matrix K using the Finite
Element method.

In our C++ program, we will be using our SparseMatrix class from Example
1.1 to store the mass and stiffness matrices, and A. Remembering that we need to
initialize these objects, assemble the coefficients with the overloaded () operator,
then finalize them, the algorithm will look something like:
void assemble(SparseMatrix& M, SparseMatrix& K, double* s, double* phi, ...
int* Free, int* Fixed, double** Points, int** Faces, int** Elements, ...
Boundary* Boundaries, int& N_p, int& N_f, int& N_e, int& N_b)
328 CHAPTER 15. FINITE ELEMENT METHODS

{
...
double M_e[3][3] = {{2.0, 1.0, 1.0}, {1.0, 2.0, 1.0}, {1.0, 1.0, 2.0}};
double s_e[3] = {1.0, 1.0, 1.0};
...
M.initialize(N_p, 10);
K.initialize(N_p, 10);
for(int e=0; e<N_e; e++)
{
for(int p=0; p<3; p++)
{
Nodes[p]= Elements[e][p];
x[p] = Points[Nodes[p]][0];
y[p] = Points[Nodes[p]][1];
}
gradEta[0][0] = (y[1]-y[2])/(2*Omega[e]);
gradEta[0][1] = (y[2]-y[0])/(2*Omega[e]);
gradEta[0][2] = (y[0]-y[1])/(2*Omega[e]);
gradEta[1][0] = (x[2]-x[1])/(2*Omega[e]);
gradEta[1][1] = (x[0]-x[2])/(2*Omega[e]);
gradEta[1][2] = (x[1]-x[0])/(2*Omega[e]);
for(int p=0; p<3; p++)
{
m = Nodes[p];
gradEta_p[0] = gradEta[0][p];
gradEta_p[1] = gradEta[1][p];
for(int q=0; q<3; q++)
{
n = Nodes[q];
gradEta_q[0]= gradEta[0][q];
gradEta_q[1]= gradEta[1][q];
M(m,n) += M_e[p][q]*Omega[e]/12;
K(m,n) -= ((v[0]*gradEta_q[0]+v[1]*gradEta_q[1])/6 ...
+ mu*(gradEta_p[0]*gradEta_q[0]+gradEta_p[1]*gradEta_q[1])*Omega[e]);
}
s[m] += s_e[p]*psi*Omega[e]/3;
}
}
for(int b=0; b<N_b; b++)
{
if (Boundaries[b].type_=="neumann")
{
for(int f=0; f<Boundaries[b].N_; f++)
{
for(int p=0; p<2; p++)
{
Nodes[p] = Faces[Boundaries[b].indices_[f]][p];
m = Nodes[p];
s[m] += mu*Boundaries[b].value_*Gamma[f]/2;
}
}
}
329

else if (Boundaries[b].type_=="dirichlet")
{
for(int p=0; p<Boundaries[b].N_; p++)
{
m = Boundaries[b].indices_[p];
phi[m] = Boundaries[b].value_;
Free[m] = false;
Fixed[m]= true;
}
}
}
K.finalize();
M.finalize();
...
}

So, it can be observed that in comparison to the equivalent Matlab implementation

of the assemble function, we are performing exactly the same computations; the
only difference being the slightly more ‘verbose’ code required to assign data into
arrays and matrices. Note that in contrast to the Matlab code, our arrays Free and
Fixed will both be of length Np and store either false or a true (i.e. a 0 or a 1) to
indicate whether a point is free or not; so for a Drichlet boundary point the value in
the Fixed array will be 1 and in the Free array will be 0 and vice versa for interior
and Neumann boundary points.
Moving along, the final step is the solution of our system, and we will be parti-
tioning the matrix such that we have:

AF ree,F ree φFl+1 l+1

ree = bF ree − AF ree,F ixed φF ixed

and so we solve a reduced system of equations with the code:

phi(Free,l+1) = A(Free,Free)\(b(Free) - A(Free,Fixed)*phi(Fixed,l+1));

which will only compute the solution for the interior points and so as long as the
Dirichlet boundary points are initialized correctly in phi, then we will have imposed
the boundary conditions correctly. Now that we have shown how we assemble our
system of equations, we are now in a position to write out the core of the program
as:
[M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, Points, Faces, Elements, ...
Boundaries, N_p, N_f, N_e, N_b);
A = M - Delta_t*K;
for l=1:N_t-1
b = M*phi(:,l) + Delta_t*s;
phi(Free,l+1) = A(Free,Free)\(b(Free) - A(Free,Fixed)*phi(Fixed,l+1));
end

The complete program is given in Example15 3.m.

In our C++ program, the core of the program will be a little bulkier given that
we don’t have a backslash operator to solve a linear system and can’t partition the
330 CHAPTER 15. FINITE ELEMENT METHODS

system of equations as easily. So to solve the resulting linear system of equations at

each time step in the simulation we will create a function called solve which will
in fact essentially just be ‘plugging in’ the code from Example 3.5, but with a little
modification to account for the fact that we have to partition the system, using the
Free and Fixed arrays. Furthermore, since the conjugate gradient method involves
some matrix vector multiplications, we will modify our SparseMatrix class a little
by adding in a member function to perform the multiplication by a column vector.
As it happens we can implement this multiplication in quite an efficient manner by
only performing the multiplication for nonzero elements in the matrix (that wouldn’t
contribute to the overall matrix vector prouduct anyway). So hopefully you will get
a feel for how the various topics and examples that we covered in the previous parts
of the book can all be ‘stitched together’ to develop a useful piece of code. Actually,
we will add three member functions to the SparseMatrix class and so the best
place to start might be to look at the core of the solver and then work through the
various parts:
assemble(M, K, s, phi, Free, Fixed, Points, Faces, Elements, Boundaries, ...
N_p, N_f, N_e, N_b);
A = M;
A.subtract(Delta_t, K);
A.multiply(AphiFixed, phi, Free, Fixed);
for(int l=0; l<N_t; l++)
{
M.multiply(b, phi);
for(int m=0; m<N_p; m++)
{
b[m] += Delta_t*s[m] - AphiFixed[m];
}
solve(A, phi, b, Free, Fixed);
write(file, phi, N_p);
}

Now, in our Matlab code we could simply write A=M-Delta_t*K in order to define
the matrix A, but obviously we can’t ‘get away’ with such a concise piece of code
in C++. In order to make the necessary computation as simple as possible the first
member function that we are going to add will overload the = operator so that we
can have the line of code A=M; in our program and the resulting function call will
copy all of the data stored in the val_, col_, row_ arrays, etc. As such the member
function will look like:
void SparseMatrix::operator= (const SparseMatrix& A)
{
if(val_) delete [] val_;
if(col_) delete [] col_;
if(row_) delete [] row_;
if(nnzs_) delete [] nnzs_;
N_row_ = A.N_row_;
N_nz_ = A.N_nz_;
N_nz_rowmax_ = A.N_nz_rowmax_;
331

N_allocated_ = A.N_allocated_;
val_ = new double [N_allocated_];
col_ = new int [N_allocated_];
row_ = new int [N_row_+1];
memcpy(val_, A.val_, N_nz_ *sizeof(double));
memcpy(col_, A.col_, N_nz_ *sizeof(int));
memcpy(row_, A.row_, (N_row_+1) *sizeof(int));
}

It can be observed here that our SparseMatrix object will receive a constant refer-
ence to another SparseMatrix object as its input argument, delete its own arrays if
they have been allocated, allocate memory to store all of the data, then finally copy
it. An important assumption that we have made here is that the matrix that we
are copying has been finalized, such that we don’t have to worry about copying the
nnzs array that was used temporarily in the assembly process. If we were develop-
ing a more general purpose class , then we would have to develop more complex
functions to deal with these situations. The second member function that we will
add will then subtract one matrix from another, but will allow us to multiply this
matrix by a constant before subtracting the elements:
void SparseMatrix::subtract(double u, SparseMatrix& A)
{
for(int k=0; k<N_nz_; k++)
{
val_[k] -= (u*A.val_[k]);
}
return;
}

As can be observed this is actually a fairly simple function because all we have to
do is loop over every element in the val arrays and subtract one from the other.
There are however, a couple of points worthy of mention. The first is that we
are making the assumption in this function that the two matrices have the same
nonzero entries. For our particular problem this will be the case because the mass
and stiffness matrices do have the same pattern. If we were attempting to develop
a class that was more general purpose and could work with matrices that have
different patterns, then this function would obviously have to be a bit more complex
and check the row and column indices etc and insert new entries if they weren’t
already there. A second point is that this function is quite specific to our program
in that we are subtracting a constant multiplied by another matrix. Again, if we
were developing a more general purpose class , then we might create many more
of these member functions, for example:
void SparseMatrix::subtract(SparseMatrix& A);
void SparseMatrix::add(double u, SparseMatrix& A);
void SparseMatrix::add(SparseMatrix& A);

plus any other operations that we think could be useful. Following the use of the =
and the subtract functions we will have the correctly assembled the matrix A. The
332 CHAPTER 15. FINITE ELEMENT METHODS

final member function that we will add will perform a matrix vector multiplication,
taking as an input a 1D array defining the vector that the matrix should be mul-
tiplied by, and a vector that is the output of the matrix vector multiplication. We
can achieve quite an efficient implementation of this algorithm because we only have
to loop over the nonzero entries in the val array and as such the member function
will look like::
void SparseMatrix::multiply(double* u, double* v)
{
for(int m=0; m<N_row_; m++)
{
u[m] = 0.0;
for(int k=row_[m]; k<row_[m+1]; k++)
{
u[m] += val_[k]*v[col_[k]];
}
}
return;
}

where it can be observed that we loop over each row in the matrix, then loop over
all of the non zero columns and add the corresponding term into the output vector
u. An important assumption that we are making here is that both u and v are of
the correct size. Another issue to consider is the case where we only want to use
the free indices in the matrix vector multiplication. As such we will create another
version of this function which will take two additional arguments defining the rows
and columns of the matrix that we want to use in any matrix vector product. It is
in this way that we can use the Free and Fixed indices to partition the matrix in
the implementation of our Conjugate Gradient method. The code for this member
function will look like:
void SparseMatrix::multiply(double* u, double* v, int* includerows, int* includecols)
{
for(int m=0; m<N_row_; m++)
{
u[m] = 0.0;
if(includerows[m])
{
for(int k=row_[m]; k<row_[m+1]; k++)
{
if(includecols[col_[k]])
{
u[m] += val_[k]*v[col_[k]];
}
}
}
}
return;
}
333

With these member functions now defined we can look at the time marching loop
in our program. After the first multiply statement, the 1D array b will contain the
matrix vector product M φl , then after the for loop M φl + ∆ts. With this array
evaluated for any given time step, we call the solve function, which will look like:
void solve(SparseMatrix& A, double* phi, double* b, int* Free, int* Fixed)
{
...
A.multiply(Aphi, phi, Free, Free);
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = b[m] - Aphi[m];
d[m] = r_old[m];
r_oldTr_old+= r_old[m]*r_old[m];
}
}
r_norm = sqrt(r_oldTr_old);
while(r_norm>tolerance && k<maxIterations)
{
dTAd = 0.0;
A.multiply(Ad, d, Free, Free);
for(m=0; m<N_row; m++)
{
if(Free[m])
{
dTAd += d[m]*Ad[m];
}
}
alpha = r_oldTr_old/dTAd;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
phi[m] += alpha*d[m];
}
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r[m] = r_old[m] - alpha*Ad[m];
}
}
rTr = 0.0;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
rTr += r[m]*r[m];
}
334 CHAPTER 15. FINITE ELEMENT METHODS

}
beta = rTr/r_oldTr_old;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
d[m] = r[m] + beta*d[m];
}
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = r[m];
}
}
r_oldTr_old = rTr;
r_norm = sqrt(rTr);
k++;
}
return;
}

Where the only differences compared to Example 3.5 are that firstly, we use the
multiply member function of the SparseMatrix class to compute any matrix
vector products and that secondly we only use the free indices in computing these.
The complete program is given in Example15_3.cpp. Figures 15.11(a) - 15.11(d)
illustrate the solution at a number of different time steps. It can be observed that as
time progresses the bell shaped surface (which is the initial condition) moves through
the domain (due to the convective term) and spreads out (due to the diffusive term),
and the domain as a whole rises (due to the source term).
This example has been the most complex one that we’ve encountered thus far,
combining numerical methods for solving a PDE, system of ODEs, and the linear
equations at each time step that result from the full discretization. It can be ob-
served that we ‘tweak’ the basic methods (or at least the codes implementing them)
in various ways such that these components ‘stitch’ together nicely. It should be
remembered however that we are striking a balance in terms of efficiency and ease
of understanding.

With other types of elements, the same ideas that we’ve just used still apply,
it’s just that the form of the shape functions may look different. For example, if we
were to derive shape functions in 3D for a linear tetrahedron say, we would start
with the trial solution:

φ(x, y, z) = a0 + a1 x + a2 y + a3 z
335

2.0 2.0

1.8 1.8

1.6 1.6
φ

φ
1.4 1.4

1.2 1.2

1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

(a) (b)

2.0 2.0

1.8 1.8

1.6 1.6
φ

1.4 1.4

1.2 1.2

1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

Figure 15.11: The solutions to the PDE in Example 15.3 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
336 CHAPTER 15. FINITE ELEMENT METHODS

and using the same procedure as we did for the linear triangle we would end up with
the shape functions:

1
η1 (x, y, z) = x2 (y3 z4 − y4 z3 ) + x3 (y4 z2 − y2 z4 ) + x4 (y2 z3 − y3 z2 )
6Ve
+ ((y4 − y2 )(z3 − z2 ) − (y3 − y2 )(z4 − z2 )) x
+ ((x3 − x2 )(z4 − z2 ) − (x4 − x2 )(z3 − z2 )) y

+ ((x4 − x2 )(y3 − y2 ) − (x3 − x2 )(y4 − y2 )) z
1
η2 (x, y, z) = x1 (y4 z3 − y3 z4 ) + x3 (y1 z4 − y4 z1 ) + x4 (y3 z1 − y1 z3 )
6Ve
+ ((y3 − y1 )(z4 − z3 ) − (y3 − y4 )(z1 − z3 )) x
+ ((x4 − x3 )(z3 − z1 ) − (x1 − x3 )(z3 − z4 )) y

+ ((x3 − x1 )(y4 − y3 ) − (x3 − x4 )(y1 − y3 )) z
1
η3 (x, y, z) = x1 (y2 z4 − y4 z2 ) + x2 (y4 z1 − y1 z4 ) + x4 (y1 z2 − y2 z1 )
6Ve
+ ((y2 − y4 )(z1 − z4 ) − (y1 − y4 )(z2 − z4 )) x
+ ((x1 − x4 )(z2 − z4 ) − (x2 − x4 )(z1 − z4 )) y

+ ((x2 − x4 )(y1 − y4 ) − (x1 − x4 )(y2 − y4 )) z
1
η4 (x, y, z) = x1 (y3 z2 − y2 z3 ) + x2 (y1 z3 − y3 z1 ) + x3 (y2 z1 − y1 z2 )
6Ve
+ ((y1 − y3 )(z2 − z1 ) − (y1 − y2 )(z3 − z1 )) x
+ ((x2 − x1 )(z1 − z3 ) − (x3 − x1 )(z1 − z2 )) y

+ ((x1 − x3 )(y2 − y1 ) − (x1 − x2 )(y3 − y1 )) z (15.50)

where Ve is the volume of the element and is defined in terms of its nodal coordinates
as:

1 x1 y1 z1

1 x2 y2 z2
6Ve =
1 x3 y3 z3

1 x4 y4 z4

So we can see here that the shape functions for the linear tetrahedron are similar
in form to those for the linear triangle, but have a few more terms in them and
unsurprisingly involve the volume of the element rather than the area of the element.
Analogously, for the linear tetrahedron we could express the derivatives of the shape
functions as:
337

∂η1 (x, y, z) 1
= ((y4 − y2 )(z3 − z2 ) − (y3 − y2 )(z4 − z2 ))
∂x 6Ve
∂η1 (x, y, z) 1
= ((x3 − x2 )(z4 − z2 ) − (x4 − x2 )(z3 − z2 ))
∂y 6Ve
∂η1 (x, y, z) 1
= ((x4 − x2 )(y3 − y2 ) − (x3 − x2 )(y4 − y2 ))
∂z 6Ve
∂η2 (x, y, z) 1
= ((y3 − y1 )(z4 − z3 ) − (y3 − y4 )(z1 − z3 ))
∂x 6Ve
∂η2 (x, y, z) 1
= ((x4 − x3 )(z3 − z1 ) − (x1 − x3 )(z3 − z4 ))
∂y 6Ve
∂η2 (x, y, z) 1
= ((x3 − x1 )(y4 − y3 ) − (x3 − x4 )(y1 − y3 ))
∂z 6Ve
∂η3 (x, y, z) 1
= ((y2 − y4 )(z1 − z4 ) − (y1 − y4 )(z2 − z4 ))
∂x 6Ve
∂η3 (x, y, z) 1
= ((x1 − x4 )(z2 − z4 ) − (x2 − x4 )(z1 − z4 ))
∂y 6Ve
∂η3 (x, y, z) 1
= ((x2 − x4 )(y1 − y4 ) − (x1 − x4 )(y2 − y4 ))
∂z 6Ve
∂η4 (x, y, z) 1
= ((y1 − y3 )(z2 − z1 ) − (y1 − y2 )(z3 − z1 ))
∂x 6Ve
∂η4 (x, y, z) 1
= ((x2 − x1 )(z1 − z3 ) − (x3 − x1 )(z1 − z2 ))
∂y 6Ve
∂η4 (x, y, z) 1
= ((x1 − x3 )(y2 − y1 ) − (x1 − x2 )(y3 − y1 )) (15.51)
∂z 6Ve
Furthermore, we have the integration formulae defined for the linear tetrahedron:
Z
a!b!c!d!6Ωe
ηpa ηqb ηrc ηsd dΩ = (15.52)
(a + b + c + d + 3)!
Ωe

and:
Z
a!b!c!2Γe
ηpa ηqb ηrc dΓ = (15.53)
(a + b + c + 2)!
Γe

Where for the 3D case Ωe and Γe represent the volume and face area of a tetrahedron
respectively.
As was mentioned earlier we have restricted ourselves to only investigating a very
small portion of the ‘wider world’ of the Finite Element method. The main reason
for this restriction is that it is easier to investigate one way in which we can develop
an algorithm, so that there is a basic understanding in place, and then progress to
338 CHAPTER 15. FINITE ELEMENT METHODS

studying other aspects at a later stage. For the interested reader some excellent
references for more detailed aspects of Finite Element methods can be found in the
books by [78], [76], [77].
Chapter 16

Spectral Methods

Having now progressed from the simple Finite Difference methods of a PDE on a
regular structured grid, to the more complex Finite Volume and Finite Element
methods on an unstructured grid, we are now going to return to using a regular
structured grid for our spatial discretization. One feature of these methods that
sets them apart from others that we have studied thus far is that Spectral methods
can be categorized as global methods, whereas the first three were local methods.
To elaborate on this idea, with the Finite Difference, Finite Volume, and Finite
Element methods, the solution at a particular location in the grid only depended
upon the solution at a few of its immediate neighbors. With the Finite Difference
method for example, the solution at a grid point i, j depends upon the solution at
i, j ± m grid points, and this gave us the coupled system of ODEs. Similarly with
the Finite Volume method, the solution in a cell depends only upon the solution
in neighboring cells. Spectral methods by contrast are formulated in such a way
that the solution at grid point i, j depends upon the solution throughout the entire
computational domain. Another feature of Spectral methods which sets them apart
is that our discretization procedure involves changing the basis of the representation
of the data. What we mean here is that for the Finite Difference, Finite Volume,
and Finite Element methods, our system of ODEs was still in terms of φ, which is
a function of space and time. With Spectral methods however, as we shall see, our
system of ODEs is going to involve functions of complex frequency space and time.
Similar to Finite Element methods, a key step in the formulation of the method is
via the method of weighted residuals, and it is possible to utilize either the Galerkin,
Collocation, or Tau, methods. Furthermore, just as one has the option to use many
different shape functions in a Finite Element method, Spectral methods can be
formulated using either a Fourier Series [20], Chebyshev polynomials [8], or Legendre
polynomials [27]. The use of the former usually implies that the PDE has periodic
boundary conditions over the entire boundary of the domain, while the latter two
can be applied to more general Dirichlet and Neumann boundary conditions. The
point being made here is that as with the Finite Element formulation developed in
Chapter 15, we are presenting one way in which a Spectral method can be developed,

339
340 CHAPTER 16. SPECTRAL METHODS

and as such will restrict ourselves to periodic problems, using a Fourier series and
the Galerkin method of weighted residuals.
Perhaps the simplest place to begin the derivation of the method is to recall the
Fourier series for an arbitrary function, defined in the domain x ∈ [0, Lx ]:
∞
a0 X 2πpx 2πpx
φ(x) = + ap cos + bp sin
2 p=1
Lx Lx

where we have the coefficients:

ZLx
1
a0 = φ(x) dx (16.1)
Lx
0
ZLx
1 2πpx
ap = φ(x) cos dx (16.2)
Lx Lx
0
ZLx
1 2πpx
bp = φ(x) sin dx (16.3)
Lx Lx
0

So we are saying here that the function φ(x) can be represented by an infinite number
of sin and cos functions of different frequencies and amplitudes. For our purposes
it is going to be easier to work with the complex form of the Fourier series, and do
get to this form we make use of Euler’s formula:

eiθ + e−iθ
cos(θ) =
2
eiθ − e−iθ
sin(θ) =
2i
so that we get:
∞ ∞
a0 1 X 2πipx 1X −2πipx
φ(x) = + (ap − ibp )e Lx + (ap + ibp )e Lx
2 2 p=1 2 p=1

In order to only have one exponential term in the expansion we change the dummy
index in the first summation by setting p = −p, thus:
−∞ ∞
a0 1 X −2πipx 1 X −2πipx
φ(x) = + (a−p − ib−p )e Lx + (ap + ibp )e Lx
2 2 p=−1 2 p=1

From the definition of ap and bp in Equations 16.2 and 16.3 respectively we have the
property that a−p = ap and b−p = −bp . So if we define:
341

a0
Φ0 =
2
ap + bp
Φp =
2
we get the complex form of the Fourier series:
+∞
X −2πipx
φ(x) = Φp e Lx

p=−∞

where:

ZLx
1 2πipx 2πipx
Φp = φ(x) cos + i sin
Lx Lx Lx
0
or, again making use of Euler’s formula:

ZL
1 2πipx
Φp = φ(x)e Lx
Lx
0
So the Φp terms are complex numbers that represent the amplitude and phase
of the different sinusoidal components of the input φ.

1.5 10

8
1.0
6

4
0.5
2
ΦIm

0
φ

0.0

−2
−0.5
−4

−6
−1.0
−8

−1.5 −10
0 1 2 3 4 5 6 7 8 9 10 −10 −8 −6 −4 −2 0 2 4 6 8 10
x ΦRe

(a) (b)

Figure 16.1: (a) An arbitrary periodic function φ defined a 1D spatial domain (b)
the complex coefficients of the discrete Fourier transform of φ.

Now because we are interested in numerical methods in this course, our spatial
domain is going to be comprised of a finite number of discrete points, and in fact our
342 CHAPTER 16. SPECTRAL METHODS

domain will be a regularly spaced structured grid as we had for the Finite Difference
methods. If we assume that our grid in 1D is composed of Nx points, then we change
our approximation to be:
Nx
2
1 X 2πip
x
φ(x, t) ≈ Φp (t)e Lx (16.4)
Nx
p=− N2x +1

which is in fact the inverse discrete Fourier transform of φ, and:

Nx
2
X 2πipn
Φp (t) = φn (t)e− Nx

n=− N2x +1

which is the forward discrete Fourier transform of φ. So the basic idea behind our
Spectral method is that the field variable φ(x, t) can be represented by a discrete
Fourier series (Figures 16.1(a) - 16.1(b)). So the important approximation that
we have made is that instead of using an infinite number of sin and cos terms to
reconstruct φ we are using Nx . An important issue that we need to raise at this
point is that of the Nyquist criterion [36], which states that we can only resolve
frequencies up to half the number of sample points used. So in actual fact, although
our discrete Fourier transform gives us Nx terms, we are only really getting Nx /2
useful frequencies from our discrete Fourier transform. To understand this, Figure
16.2(a) illustrates the real part of Φ for the function depicted in Figure 16.1(a).
We can see here that the Φp coefficients are mirrored about the y axis, so while
our discrete Fourier transform gave us Nx coefficients, half of them are redundant.
So in actual fact we could quite correctly reconstruct our waveform using just the
coefficients ranging from 0 < p ≤ Nx /2, if we divide by Nx /2 instead of Nx in
Equation 16.4. The only little trick is that we don’t do this for the case where
p = 0. It is worth pointing out that this normalization by 1/Nx is merely convention
and could be placed in either the forward or√ the inverse discrete Fourier Transform,
or both could have a normalization of 1/ Nx . The only requirement is that the
product of the two be 1/Nx .
A final point concerning the discrete Fourier transform worth mentioning at this
time is that it is also quite common for the discrete Fourier transform to be denoted:

N x −1
X 2πipn
Φp (t) = φn (t)e− Nx

n=0

Where the limits on the summation term are different, but the same number of
points are involved. As can be observed in Figure 16.2(b) however, we are still
getting the same Fourier coefficients, just in a different order.
Having made this substitution for our field variable φ we can now compute the
derivatives in the same way that we did with the Finite Element method:
343

8 8

6 6

4 4

2 2
ΦRe

ΦRe
0 0

−2 −2

−4 −4

−6 −6

−8 −8
−50 −40 −30 −20 −10 0 10 20 30 40 50 1 10 20 30 40 50 60 70 80 90 100
p p

(a) (b)

Figure 16.2: The real part of the discrete Fourier transform when the limits are
taken from (a) [−Nx /2 + 1, Nx /2] and (b) [0, Nx − 1].

Nx
2
∂φ 1 X ∂Φp 2πip
= e Lx x
∂t Nx ∂t
p=− N2x +1
Nx
2
∂φ 1 X 2πip 2πip
x
= Φp e Lx
∂x Nx Lx
p=− N2x +1
Nx
2 2
∂ 2φ

1 X 2πip 2πip
x
2
= Φp e Lx
∂x Nx Lx
p=− N2x +1

with similar expressions in y and z. For the source term we assume that like φ, it
can be represented as a Fourier series:

Nx
2
1 X 2πip
x
ψ= Ψp (t)e Lx
Nx
p=− N2x +1

where as with Φp , Ψp represent the complex coefficients. Let’s now consider the
generic scalar transport equation defined in only one spatial dimension and substi-
tute these expressions for the derivatives. Doing so we get:
344 CHAPTER 16. SPECTRAL METHODS

Nx Nx
2 2
1 X dΦp (t) 2πip
x 1 X 2πip 2πip
x
e Lx + vΦp (t)e Lx
Nx dt Nx Lx
p=− N2x +1 p=− N2x +1
Nx
2 2
1 X 2πip 2πip
x
= µΦp (t)e Lx
Nx Lx
p=− N2x +1
Nx
2
1 X 2πip
x
+ Ψp (t)e Lx
Nx
p=− N2x +1

The next step in the formulation of our Spectral method is to apply the Galerkin
method of weighted residuals, but before we do so it will be worth taking a brief
detour to extend our notion of basis functions. With the Finite Element method we
defined our solution as:
Np
X
φ(x, t) = ηp (x)φp (t)
p=1

where the ηp terms were the basis functions and the φp terms were the coefficients
(which happened to be the values of φ at the nodal points in the grid). It can be
2πip
observed that this is quite similar to our solution in Equation 16.4 where the e Lx x
terms are the basis functions and the Φp (t) are the coefficients. Now if we think
about what a basis actually means, it’s a set of independent or orthogonal vectors
with which we can describe something. The most intuitive example is a Cartesian
basis where we have the unit vectors î, ĵ, k̂, and any vector quantity can be described
in terms of coefficients in that basis, (e.g. v = vx î + vy ĵ + vz k̂). If we switched to
a different basis such as spherical coordinates for example, u would not change,
only the coefficients in the expansion with the basis. One of the properties of an
orthonormal basis is that the inner product between two different components is zero
and between the same component is one. Returning once again to the Cartesian
basis we have î · ĵ=0, î · k̂ = 0, but î · î = 1. If we let e1 = {1, 0, 0}, e2 = {0, 1, 0} ,
e3 = {0, 0, 1} then we could write this inner product relationship more formally as:
3
X
< ep , eq >= ep,n eq,n = δpq
n=1

where the notation <, > is denoting an inner product, the indices p and q denote
different components of the basis, and δp,q is known as the Kronecker delta function,
which is defined as 1 if p = q and 0 otherwise. This is known as the orthogonality
condition. Now we can fairly easily extend this concept to basis vectors which have
more than three components. In fact this relation just defined works for vectors
345

with any number of components. In the limit where we have an infinite dimensional
vector, then we actually have a function. To elaborate on this point, think about
some function which has been evaluated at N points. This is analogous to an N
dimensional vector. So in the limit a function is an infinite dimensional vector. In
this case we write our inner product relationship as:
Z
< ηp , ηq >= ηp ηq dΩ = Ωδp,q
Ω

where the summation of the components of the two basis vectors is replaced by an
integral of the basis functions. If we allow for the scenario where we have complex
functions (i.e. the function gives us complex numbers) then we write our inner
product relationship as:
Z
< ηp , ηq >= ηp ηq∗ dΩ = Ωδp,q
Ω
∗
where the denotes the complex conjugate. Now although it is a bit more difficult
2πip
to visualize, we can think of and treat each of the e Lx x basis functions as we do
2πip
the Cartesian basis functions. In this case the complex conjugate is written e− Lx x
(since the complex conjugate is found by simply replacing i with −i). We can
now return to applying our method of weighted residuals, noting that we will be
using the complex conjugate of the basis functions as our weighing functions. Hence
substituting for W and our residual r and integrating over the domain we get:

Nx !
Z 2 2
− 2πiq x 1 X dΦp 2πip 2πip 2πip
x
e Lx + vΦp − µΦp − Ψp e Lx dΩ = 0
Nx dt Lx Lx
Ω p=− N2x +1

Noting that the residual is a constant within the domain of integration we get:

Nx !Z
2 2
1 X dΦp 2πip 2πip 2πiq 2πip
+ vΦp − µΦp − Ψp e− Lx x e Lx
x
dΩ = 0
Nx dt Lx Lx
p=− N2x +1 Ω

and using our orthogonality condition we get:

Nx !
2 2
1 X dΦp 2πip 2πip
+ vΦp − µΦp − Ψp Ωδp,q = 0
Nx dt Lx Lx
p=− N2x +1

Now ultimately what this means is that each p component is independent of the
others, so we can drop the summation terms and factor out the N1x and Ω terms to
arrive at:
346 CHAPTER 16. SPECTRAL METHODS

2
dΦ 2πip 2πip
= µΦ − vΦ + Ψ
dt Lx Lx
which can be put in the form:

M Φ̇ = KΦ + s
where Φ is a vector of the coefficients in the discrete Fourier transform of φ and
as with the Finite Difference method, M is the identity matrix. One important
point to note here however is that because our system of equations is uncoupled, K
will only have entries on the main diagonal. It is an interesting fact of this global
method is that we don’t end up with a coupled system of ODEs as we have done with
the previous methods. The coupling however comes about from the computation
of the discrete Fourier transform where each coefficient in the transform involves a
summation over all of the components in the grid.
If we want to extend our Spectral method to higher spatial dimensions the only
difference is that we have to use either the 2D inverse Fourier transform:
Nx Ny
2 2
1 X X 2πip
x+ 2πiq y
φ(x, y, t) = Φ(p, q, t)e Lx Ly
Nx Ny
p=− N2x +1 q=− Ny +1
2

or the 3D inverse Fourier transform:

Nx Ny Nz
2 2 2
1 X X X 2πip
x+ 2πiq y+ 2πir z
φ(x, y, z, t) = Φ(p, q, r, t)e Lx Ly Lz
Nx Ny Nz
p=− N2x +1 q=− Ny +1 r=− N2z +1
2

We can then proceed to solve our system of ODEs using any of the methods
covered in Part II, the only difference is that we must compute the appropriate
discrete Fourier transform of the initial condition defined in the spatial domain in
order to begin the time marching. Finally in terms of the imposition of boundary
conditions we find that the necessary assumption of periodic boundaries means that
we do not have to do anything to impose the boundary conditions on our system as
they are automatically incorporated.

Example 16.1:

In this example we will develop both a Matlab and a C++ program to solve the
1D first order wave equation:
347

∂φ ∂φ
+v =0 (16.5)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 0, initial
2
condition φ(x, 0) = e−5(x−3) , v = 1.0, and compare the numerical solution with the
2
exact solution φ(x, t) = e−5(x−vt−3) . For the spatial discretization we will use the
Spectra method and for the temporal discretization we will use the fourth order
Runge-Kutta method. The intended learning outcomes for this example will be to
‘get a feel’ for applying a Spectral method and to see how we compute the discrete
Fourier transform.

Assuming now that our spatial domain has been broken up into Nx grid points
with spatial step size ∆x, then we can immediately define an ODE at each grid
point as:

dΦ 2πip
=− vΦ
dt Lx
which fits the form:

Φ̇ = KΦ
and is similar to Example 13.1. Here however, K will only have terms on the main
diagonal, so there’s no real point in storing it explicitly. This again emphasizes the
point that a matrix is often something that is conceptual, not something that is
necessarily defined and stored in the program. The most efficient way to store a
diagonal matrix would be as a 1D array, but in fact we don’t even need to do that
in this example. The next step is the application of the Runge-Kutta method. As
we did with Examples 10.1 and 13.1 we will define a function f which we can call
repeatedly as we make our k vectors at each time step and this will take the form:
function k = f(PHI)
p = (floor(-N_x/2+1):floor( N_x/2))‘;
k = (-v*(2*pi*i.*p./L_x)).*PHI;
end

where an important point to note is that in this code snippet, i is the imaginary
unit, not an index as it was in the Finite Difference method examples. At this point
the remainder of the algorithm is just the basic fourth order Runge-Kutta code from
Example 10.1:
for l=1:N_t-1
k1 = f(PHI(:,l));
k2 = f(PHI(:,l) + Delta_t/2*k1);
k3 = f(PHI(:,l) + Delta_t/2*k2);
k4 = f(PHI(:,l) + Delta_t *k3);
PHI(:,l+1) = PHI(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end
348 CHAPTER 16. SPECTRAL METHODS

The only remaining piece of the program is the calculation of the discrete Fourier
transform of the initial condition to get the time marching loop started. In this
example we will write our own function to perform both the forward and inverse
transforms. These will take the form:
function PHI = computeDFT(phi)
PHI = zeros(N_x,1);
p = transpose(floor(-N_x/2+1):floor( N_x/2));
n = transpose(floor(-N_x/2+1):floor( N_x/2));
for m=1:N_x
PHI(m) = sum(phi .* exp(-2*pi*i.*n*p(m)/N_x));
end
end

for the forward Fourier transform and:

function phi = computeInverseDFT(PHI)
phi = zeros(N_x,1);
p = transpose(floor(-N_x/2+1):floor( N_x/2));
n = transpose(floor(-N_x/2+1):floor( N_x/2));
for m=1:N_x
phi(m) = 1/N_x*sum(PHI .* exp(+2*pi*i.*n*p(m)/N_x));
end
end

for the inverse Fourier transform. So we begin the simulation by computing the
discrete Fourier transform of the initial condition, then perform our time integration
on the Fourier coefficients PHI. When we are interested in the real solution, we
compute the inverse Fourier transform so that we have φ at any point in time. In
our Matlab program this will look something like:
PHI(:.1) = computeDFT(phi(:,1))
for l=1:N_t-1
k1 = f(PHI(:,l));
k2 = f(PHI(:,l) + Delta_t/2*k1);
k3 = f(PHI(:,l) + Delta_t/2*k2);
k4 = f(PHI(:,l) + Delta_t *k3);
PHI(:,l+1) = PHI(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
phi(:,l+1) = computeInverseDFT(PHI(:,l+1));
end

One important point to note is that the solution we get from the inverse Fourier
transform will still involve complex numbers, however, all of the imaginary compo-
nents are zero, so we can simply take the real part of the inverse Fourier transform
without losing any information. It is easily observed that the ‘bell shaped’ initial
condition is simply shifted along through the computational domain, which is what
we could expect from this PDE.
The complete program is given in Example2_6.m. Figures 16.3(a) - 16.3(d) illus-
trate the solution at a number of time steps for the the case where ∆x = 0.05 and
∆t = 0.02. A final point to note is that the use of Matlab for this problem makes
life very easy when dealing with complex numbers since we don’t have to modify our
349

code at all. This not the case if we were to use C++ however, which by default does
not have any support for dealing with complex arithmetic. This is not to say that it
can’t be done however, but one would need to develop (or find existing) functionality
for handling this, and perhaps defining a class to define a complex number, storing
its real and imaginary components and overloading some of the complex arithmetic
operations.

1.5 1.5

1.0 1.0

0.5 0.5
φ

φ
0.0 0.0

−0.5 −0.5

−1.0 −1.0

−1.5 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

1.5 1.5

1.0 1.0

0.5 0.5
φ

0.0 0.0

−0.5 −0.5

−1.0 −1.0

−1.5 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Figure 16.3: The solutions to the PDE in Example 16.1 for the combination ∆x =
0.05 and ∆t = 0.02 illustrating the solution at (a) l = 1, (b) l = 200, (c) l = 340
and (d) l = 500.
350 CHAPTER 16. SPECTRAL METHODS

Example 16.2:

In this example we will develop a Matlab program to solve the 2D generic scalar
transport equation:

φ̇ + v · ∇φ = µ∇2 φ + ψ (16.6)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 0,
2
φ(x, 0) = 0, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) ,
and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will use
Spectral method and for the temporal discretization we will use the implicit Euler
method. The intended learning outcome for this example will be to observe the
application of a Spectral method to solve a multidimensional PDE and to see how
we can use the Matlab routines to compute the forward and inverse discrete Fourier
transforms.
Assuming now that our spatial domain has been broken up into Nx data points
in x and Ny data points in y, then we can immediately define an ODE at each
interior grid point as:

2 2
dΦ 2πip 2πiq 2πip 2πiq
= µΦ + µΦ − uΦ − vΦ + Ψ
dt Lx Ly Lx Ly

which fits the form:

M Φ̇ = KΦ + s
Similar to our Example 16.1. Here however, K will only have terms on the main
diagonal, so there’s definitely no point in storing the whole matrix. Furthermore,
because our discrete scalar field will be laid out in a 2D array as:
phi = zeros(N_x,N_y,N_t);
PHI = zeros(N_x,N_y,N_t);

we could in fact store the main diagonal elements in this way. So we will be storing
K as a 2D array in this example, but the 2D part has to do with the 2D nature of
the grid, not implying equations and unknowns. Each element in the 2D array K
represents an element on the main diagonal of the matrix K. That being said, we
will define our assemble function as:
function assemble()
[p q] = meshgrid(floor(-N_x/2+1):floor( N_x/2), floor(-N_y/2+1):floor( N_y/2));
K = (-v(1)*(2*pi*i.*p/L_x) - v(2)*(2*pi*i.*q/L_y)
+ mu*(2*pi*i.*p/L_x).^2 + mu*(2*pi*i.*q/L_y).^2);
s = PSI;
end
351

The interesting point here is that we’ve created a 2D array of p and q indices with
the meshgrid function, and then we are doing ‘element by element’ operations (as
can be observed by the .* and .^ operators). So what we end up with is an array
of K elements for each point in the grid. It is important to note that in this case
we shouldn’t really think of K as being a stiffness matrix. If it were then it would
be of size (Nx × Ny ) × (Nx × Ny ) with entries only on the main diagonal. Here we
have an array of size Nx × Ny with entries in every location in the array. This is
obviously more efficient in terms of storage. While we could use a sparse matrix in
Matlab to help alleviate this storage, there’s really no point since the equations are
completely decoupled.
Now applying the implicit Euler method to our system we get:

Φl+1 − Φl
M = KΦl+1 + s
∆t
which can be rearranged to:

Φl+1 = (M − ∆tK)−1 Φl + ∆ts

which we can then implement this in our program as:

PHI(:,:,l+1) = (PHI(:,:,l) + Delta_t*s)./(1-Delta_t*K);

where it can be observed that here we are not solving a system of equations (as
we were in other examples solving the generic scalar transport equation), rather we
are updating Φ point by point. It is just a nice feature of the Matlab language
that allows us to write the operation in one line of code (as opposed to placing it
inside two nested for loops as we would have to in a C++ implementation). The
only remaining piece of the program is the calculation of the 2D discrete Fourier
transform of the initial condition to get the time marching loop started. In this
example we will use the Matlab function fft2 to do this (we use the function fft in
1D). It should be noted that the main reason for using the Matlab function rather
than writing our own is that being a built-in Matlab function it can do it much
faster than any function we could write. An important point however is that the
Matlab function returns indices in the range [0, N − 1] rather than [−N/2 + 1, N/2],
so in order for the location of the Fourier coefficients to be consistent with what we
need for our method to work, we use the fftshift function to reorder the output of
fft. So the computation of the forward discrete Fourier transform for our problem
is:
PHI(:,:,1) = fftshift(fft2(phi(:,:,1)));

and for the inverse discrete Fourier transform:

phi(:,:,l+1) = ifft2(ifftshift(PHI(:,:,l+1)));

So the core of our algorithm will look like:

PHI(:,:,1) = fftshift(fft2(phi(:,:,1)));
PSI = fftshift(fft2(psi));
352 CHAPTER 16. SPECTRAL METHODS

assemble();
for l=1:N_t-1
PHI(:,:,l+1) = (PHI(:,:,l) + Delta_t*s)./(1-Delta_t*K);
phi(:,:,l+1) = ifft2(ifftshift(PHI(:,:,l+1)));
end

The complete program is given in Example16_2.m. Figures 16.4(a) - 16.4(d)

illustrate the solution at a number of different time steps. It can be observed that
as with Example 13.3, as time progresses the bell shaped surface (which is the initial
condition) moves through the domain (due to the convective term) and spreads out
(due to the diffusive term). Some important differences however are that given the
implicit assumption of periodic boundary conditions, the bell shaped peak reappears
on the opposite sides of the domain after leaving (which is what we would expect
of course). Another interesting observation is that given that there are no Dirichlet
boundary conditions to ‘pin’ the solution to fixed values at any points, the source
term causes the whole surface to rise continuously throughout the simulation.

Having now seen two examples using a Spectral method to solve a PDE it is
worth ending our investigation with some remarks on the accuracy and applicability
of the method. A primary benefit of Spectral methods over alternative approaches,
such Finite Difference, Finite Volume, and Finite Element methods. Spectral dis-
cretization of PDEs based on Fourier series (as well as Chebyshev polynomials etc),
provide very low error approximations and in many cases, these approaches can
be exponentially convergent, meaning that for a length N expansion, the difference
between the analytical and the numerical solution can be O(1/N N ). Second, since
the numerical accuracy of Spectral methods is so high the number of grid points re-
quired to achieve the desired precision can be very low, thus a Spectral method may
require less computer memory than say a Finite Difference method. One important
point to note however is that the PDE must exhibit smooth variation throughout
the computational domain, otherwise this convergence is lost. Spectral methods are
hence not applicable to ‘say’ transonic fluid flow developing shock waves, or any
other PDE that would have discontinuities. For the interested reader some excellent
references for more detailed aspects of Spectral methods can be found in the books
by Hesthaven [68], Peyret [72], and Kopriva [69].
353

1.0 1.0

0.8 0.8

0.6 0.6
φ

φ
0.4 0.4

0.2 0.2

0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

(a) (b)

1.0 1.0

0.8 0.8

0.6 0.6
φ

0.4 0.4

0.2 0.2

0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x

Figure 16.4: The solutions to the PDE in Example 16.2 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
354 CHAPTER 16. SPECTRAL METHODS
Chapter 17

Spectral Element Methods

355
356 CHAPTER 17. SPECTRAL ELEMENT METHODS
Chapter 18

Meshfree Methods

357
358 CHAPTER 18. MESHFREE METHODS
Part IV

Parallel Computing

359
Chapter 19

Introduction

Having now covered a number of different numerical methods for solving systems
of algebraic, ordinary, and partial differential equations, and creating programs to
implement these methods, we are in a position to look at how we can develop parallel
versions of these programs to solve much bigger computational problems and solve
them faster. As such, we are going to study some different application programming
interfaces (APIs) for designing parallel programs. Up until this point, all of the
programs that we have developed have been written for serial computation. We can
take this to mean a series of instructions executed one after another on some form of
processor [43]. By contrast, parallel computation is the simultaneous use of multiple
compute resources to solve a problem. Now, the term ‘processor’ is a bit vague, but
we use this term because there are different designs and categorizations that could
be applied to the term ‘parallel computation’.
The techniques that we will study, namely OpenMP and MPI are designed such
that they can be used for designing programs to run on computing platforms ranging
from your laptop or workstation, up to the largest supercomputing facilities in the
world. Our interest will lie in the application of these techniques to solving PDEs,
but it should be remembered that they are far more general than this one application
and can be applied to parallelizing many different types of computational problems.
Furthermore, while we will be incorporating these techniques into programs writ-
ten in the C++ programming language, it is worth pointing out that both techniques
are commonly incorporated into programs written in other programming languages,
such as C and Fortran, for example.
Before we proceed to cover the details of OpenMP and MPI however, it will be
worth reviewing some basic concepts to provide a context for how they fit into the
‘wider world’ of High Performance Computing (HPC) [23]. As with the development
of our numerical methods, this will only be a brief introduction to some of the most
relevant concepts. Some excellent references can be found in [67] and [24] however.
One ‘classic’ method that can be used for the classification of computer architec-
tures is known as Flynn’s taxonomy [19]. Here, a computer architecture is classified
along the two independent dimensions of instruction and data and each dimension

361
362 CHAPTER 19. INTRODUCTION

SISD SIMD
Single Instruction, Single Instruction,
Single Data Multiple Data

MISD MIMD

Multiple Instruction, Multiple Instruction,

Single Data Multiple Data

Figure 19.1: Flynn’s Taxonomy. Computing architectures are classified on by the

dimensions of instruction and data, each of which can take the states single or
multiple.

can have one of two possible states: single or multiple (Figure 19.1). A single in-
struction, single data architecture is one where only on instruction stream is being
acted on by the processing unit at any one clock cycle and only one data stream
is being used as input. Many early computers such as old generation mainframes
followed this architecture, but even today, modern laptops and workstations can fall
into this category. A single instruction, multiple data architecture is one where a
number of processing units execute the same instruction at any one clock cycle, but
each processing unit can operate on a different data element. Early ‘vector’ processor
computers used this approach, but more recently graphics processing units found in
most modern laptops or workstations can fall into this category. A multiple instruc-
tion, single data architecture is one where a single data stream is fed into multiple
processing units, each of which can execute independent instruction streams. There
are very few actual examples of this style, but fault-tolerant computers executing
the same instructions redundantly in order to detect and mask errors would be an
example. Finally, a multiple instruction, multiple data architecture is one where
a number of processing units can be executing different instruction streams, which
may be operating on different data streams. This is currently the most common type
of computer and includes modern laptops and workstations, ‘clusters’ of networked
computers, and most current supercomputers.
If we now start to consider the specific types of hardware to which the term
‘processor’ can apply, we find that there are many different designs out there and
it makes sense to think of a ‘spectrum’ defined by how specialised the circuitry in
363

a processor is for a particular task, which will have implications in terms of its
performance, price, and ‘ease’ of programming.
At the least specialised end of this spectrum we have the central processing unit
(CPU)[7], which are found in modern laptops, workstations, and supercomputers.
A CPU is the hardware that caries out the instructions of a computer program
and performs the basic arithmetical, logical, and input/output operations of the
system. Because the average computer these days has to be able to perform all
sorts of tasks, CPUs must be quite general purpose and hence a tradeoff is that
performance is reduced in favour of flexibility. These days, most processors of this
type contain multiple cores[33] which can be thought of as multiple CPUs within
the one ‘component’ or ‘chip’. These cores can operate independently, executing
different instructions on different data streams.
Moving up the spectrum, an example of a slighly more specialised design are
graphics processing units(GPUs) [22], which are also found in modern laptops, work-
stations, and supercomputers. A GPU is the hardware that acts as a ‘co-processor’
performing the very intensive calculations required by modern day computer graph-
ics. GPUs are hence more specialised compared to CPUs in that more of the transis-
tors on the die comprising the chip, are dedicated to ‘number crunching’, compared
to a CPU which needs to perform more ‘administrative’ tasks. A fairly recent trend
has been the advent of APIs such as CUDA [9] and OpenCL [38], which allow users
to perform more general purpose calculations and hence the term general purpuse
computing on graphics processing units (GPGPU) has emerged. As with CPUs,
GPUs are in abundance and will be used together in most platforms.
Moving up the spectrum again, an example of a more specialised design are digital
signal processors (DSPs)[11]. The architecture of a DSP is optimized specifically for
digital signal processing, such as signals from audio or video sensors. Most also
support some of the features as an applications processor or microcontroller, since
signal processing is rarely the only task of a system.
Moving up the spectrum a little further, an example of an even more specialised
design is the field-programmable gate array (FPGA) [18]. Similar to GPUs, an
FPGA can act as a co-processor, but the architecture is such that they circuitry can
be rewritten for a particular task. This is in contrast to CPUs and GPUs where
the circutry is fixed and it is only the instructions comprising the program that
can change. In this case, the program that one would want to run on an FPGA
is actually defined in the hardware, but is reconfigurable at run time. Similar to
GPUs there are APIs that are available for programming them such as VHDL [60] or
Mitrion-C [30] for example.
At the most specialised end of the spectrum we have application specific inte-
grated circuits (ASICs) [2]. Because these processors are by definition designed for a
specific computational problem, they will generally give much greater performance
than ‘say’ a CPU would for the same task. Because of the extreme costs associated
with the masks required for the X-ray lithography to create an ASIC, they are not
feasible for most parallel computing applications. One good example of an exception
364 CHAPTER 19. INTRODUCTION

to this however is the RIKEN MDGRAPE-3 supercomputer built for the purpose of
performing molecular dynamics simulations, which uses specialised MDGRAPE-3
chips [46].
A Von Neumann architecture [61] is a design of computer in terms of a processing
unit (consisting of an arithmetic logic unit and processor registers, a control unit
containing an instruction register and program counter), memory to store both a
program and its instructions, plus input and output mechanisms including external
mass storage, and networking (Figure 19.2). The meaning of the term has evolved to
mean a stored-program computer in which an instruction fetch and a data operation
cannot occur at the same time because they share a common bus. The vast majority
of modern computers are based on this basic design and so we will now explore some
variations of this basic design and issues to consider when describing an HPC system.

Memory Subsystem

Program Data

Central Processing Unit

Input/Output Subsystem

Storage Network

Figure 19.2: Classic Von Neumann Computer.

An important issue that we must consider when we talk about parallel computing
architectures is where the computers’ memory is physically located in relation to its
CPUs and there are two broad categories, namely shared memory and distributed
memory architectures. Shared memory architectures (Figures 19.3(a)) in general
provide the ability for all of the CPUs to access all of the memory in a global ‘ad-
dress space’. This means that multiple CPUs can operate independently but share
the same memory resources. From a programming point of view this architecture
means that code for ‘say’ accessing and manipulating entries in an array will not
require modification from a serial program because all of the allocated memory is
accessable. A further classification can be made with shared memory architectures,
namely uniform memory access (UMA) and non-uniform memory access (NUMA)
architectures [35], which relate to the memory access times. UMA architectures
are generally identical processors with equal memory access times, and this is a
commonly found in symmetric multiprocessor (SMP) machines [53] (Figure 19.4).
Examples range from modern dual core laptops up to larger systems containing on
365

CPU Core Memory Address Space

CPU Core Memory Address Space Network

(a) (b)

Figure 19.3: Schematics illustrating the concepts of (a) a shared memory parallel
programming model and (b) a distributed memory parallel programming model.

the order of tens of CPUs. NUMA architectures are often based on multiple SMP
machines connected together via a bus interconnect [4]. While each CPU can access
all of the memory, the access times of its local memory will be faster than those
across the interconnect. As a final classification cache-coherent non-uniform mem-
ory access (ccNUMA) architectures keep one consistent image of the high speed
memory known as cache [6] (Figure 19.5). The term ‘cache coherent’ refers to the
fact that for all CPUs any variable that is to be used must have a consistent value.
Therefore, it must be assured that the caches that provide these variables are also
consistent. Since the appearance of shared memory systems with multiple proces-
sors the cache coherency phenomenon also manifests itself within processors with
multiple cores; first and second level cache belong to a particular core and therefore
when another core needs data that not resides in its own cache it has to retrieve it
via the complete memory hierarchy of the processor chip. This is typically orders of
magnitude slower than when it can be fetched from its local cache. A disadvantage
with shared memory architectures is that it becomes increasingly more difficult and
hence expensive to ‘scale up’ the design to include more CPUs with more memory,
so these architectures seem to have limited scalability. An advantage with these ar-
chitectures is that the global address space can simplify the parallel program design.
Distributed memory architectures (Figure - 19.3(b)) in general require a com-
munication network to connect individual nodes. Each CPU has its own address
space (Figure 19.6) and so when non-local data is required by a given CPU it is the
responsibility of the programmer to explicitly define how the data is to be communi-
366 CHAPTER 19. INTRODUCTION

Program Memory Subsystem Data

CPU
Core
CPU
Core
... CPU
Core
CPU
Core

Input/Output Subsystem

Storage Network

Figure 19.4: Schematic illustrating an idealised shared memory multi-processor.

Memory Subsystem Memory Subsystem

CPU ... CPU CPU ... CPU

Core Core Core Core

I/O Subsystem I/O Subsystem

Storage Network

Figure 19.5: Schematic illustrating a contemporary shared memory ccNUMA multi-

processor.

cated. From a programming point of view this architecture means that code for say
accessing and manipulating entries in an array will require modification from a se-
rial program because not all of the allocated memory is accessable. Examples range
from clusters of workstations connected via ethernet [16] up to massively parallel
processor (MPP) systems containing hundreds of thousands of CPUs and special-
ized network interconnects and topologies. A disadvantage of distributed memory
architectures is that the explicit communication of data can complicate the parallel
program design. An advantage with these architectures is that it is easier to scale
up the design, adding more CPUs and memory to the system.
Modern HPC systems can make use of various combinations of shared and dis-
tributed memory architectures (known as hybrid distributed-shared memory archi-
tectures), and utilizing different types of processors. One example design is a dis-
tributed memory multi-processor, where each compute node is a processor containing
multiple cores and shared memory, but which connect to one another over a network
367

Node Node Node Node Node

Memory Memory Memory Memory Memory

Core Core Core Core Core

I/O I/O I/O I/O I/O

Interconnection Network

Figure 19.6: Schematic illustrating an idealized distributed memory mutlti-processor

with single core nodes.

(Figure 19.7). Another example design is a distributed memory multi-processor that

utilizes one or more co-processors such as a GPU or FPGA (Figure 19.8).

Node Node Node

Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory

… … …
Core Core Core Core Core Core Core Core Core Core Core Core

I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem

Interconnection Network

Node Node Node

Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory

… … …
Core Core Core Core Core Core Core Core Core Core Core Core

I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem

Figure 19.7: Schematic illustrating a contemporary distributed memory multi-

processor with ccNUMA nodes.

Digging a little deeper into the communication networks used by distributed

memory architectures, we find that there are many different ways in which compute
nodes can be physically connected to one another. One example is a star network
[52](Figure 19.9(a)), where all compute nodes are connected to a central node (called
a ‘hub’ or ‘switch’) and so all communication with other nodes requires transmission
and rebroadcasting from the central node. Another example is a tree network [58]
(Figure 19.9(b)), which consists of a hierarchy of nodes. The highest level of any tree
network consists of a single, ‘root’ node, which is generally connected to multiple
nodes in the level below. Communication between nodes therefore requires traversal
up and down various levels in the tree. Another example is a torus network [56]
368 CHAPTER 19. INTRODUCTION

Node

Auxiliary Auxiliary
Shared Memory
Memory Memory

Auxiliary CPU CPU CPU CPU Auxiliary

Processors Core Core Core Core Processors

I/O Subsystem

Interconnection Network To Other

Nodes

Figure 19.8: Schematic illustrating a hybrid distributed memory multi-processor

where the auxiliary processors could be GPUs or FPGAs.

(Figure 19.9(c)), where compute nodes are connected in a lattice structure. The
term ‘torus’ comes from the fact that compute nodes at the sides of the lattice are
connected to the compute nodes at the opposite side, so the whole structure can be
thought to ‘wrap around on itself’. Communication between nearest neighbor nodes
will only require one ‘hop’ along the torus, but communication between nodes that
are farther away will require multiple hops. While a 3D torus is a common design,
this concept can be extended to higher dimensions such as a 5D torus. Essentially
what this means is that every ‘say’ N th node along the lattice will have additional
direct connections to other nodes further away on the lattice.

(a) (b) (c)

Figure 19.9: Schematics illustrating the concepts of (a) a star network (b) a tree
network (c) a 3D torous network.

Moving now from hardware to software, another issue worth considering is the
differences between processes and threads, since two of the techniques for paral-
lelization that we will investigate are based around their creation. A process is an
369

instance of a computer program that is being executed [42]. It contains the pro-
gram code and its current activity. Processes consist of (or ‘own’) a portion of the
computers memory, which will contain such things as the programs executable code,
process-specific data, a call stack , and a heap to hold intermediate computation data
generated during run time. A thread is the smallest unit of processing that can be
scheduled by an operating system [55]. It generally results from a computer pro-
gram ‘breaking’ into two or more concurrently running tasks. The implementation
of threads and processes differs from one operating system to another, but in most
cases, a thread is contained inside a process. Multiple threads can exist within the
same process and share resources such as memory, while different processes do not
share these resources. In particular, the threads of a process share its executable
code and its context.
Having now introduced some concepts relating to parallel computing it is worth
taking some time to think about how we now ‘scale up’ our simulations. quite
simply, we could define the speedup of a program as the ratio of the amount of time
a serial simulation takes to the amount of time the corresponding parallel simulation
takes. For example a simulation which takes 100s when run in serial, but only 20s
when run on some given number of CPUs, would mean that we have a speedup
of 5. Ideally, the more CPUs we use, smaller the amount of time the calculation
should take and hence the greater the speedup. One very well known law in parallel
computing is Amdahl’s law and can be used to predict the maximum theoretical
speedup attainable by a parallel program [1]. We need to know the fraction of the
code which is actually able to be run in parallel (which as we will see in the examples
to come is often less than 1) and will define it as P . Obviously if P = 0 then we will
get no speedup and if P = 1 then we can theoretically get an infinite speedup. When
the computation is divided among N processors, then we can define the speedup as:

1
speedup = P
(1 − P ) + N

The important result is that if we can only parallelize say 95% of the program,
then we will never be able to get more than a 20 times speedup of the program, no
matter how many processors we use (i.e. how big N is). To emphasize this point,
there are certain parts of computer programs (for example file I/O) that are often
quite difficult to parallelize. More fundamentally however, there are often algorithms
which by there very nature cannot be parallelized (computing a Fibonacci series for
example). Fortunately for us, with the solution of ODEs and PDEs there are at
least ‘some’ parts of the computation that can be parallelized.
Taking a simulation of a fixed size and using successively more CPUs to perform
the computation is what is known as strong scaling. Essentially we are taking a
problem of a fixed size and investigating how the solution time varies with the
number of CPUs. The greater the number of CPUs, the smaller the amount of
computation each one has to perform and hence the faster it will be able to perform
its computation (hopefully). Another metric however which may be quite applicable
370 CHAPTER 19. INTRODUCTION

however is weak scaling. In this case we keep a fixed problem size per processor and
look at how the solution time varies when we use successively more CPUs to perform
the computation. Now, the choice of which metric depends upon the problem. One
of the reasons for using supercomputers is that we want to be able to perform
computations faster and faster and so sometimes we want to scale out our problem
to get it to run faster. One of the other reasons for using supercomputers however
is that we want to be able to run bigger simulations that we weren’t able to do on
the limited memory spaces of smaller machines and so we make our problem bigger
and bigger as we use more CPUs. There is no ‘correct’ metric and the point of
this discussion is just to very briefly introduce some of the important concepts used
when designing and running large scale programs. Usually with a parallel code we
will want both bigger and faster at the same time!

SSH Head Node

Internet Login Node Job
Front-End Node Queue

User

Resource
Workstation Shared Manager
e.g. PC/Mac Storage e.g. Torque/PBS
SLURM, SGE

Job A Job B
User
Job
D

Workstation
Job C Job E

High Performance Computing System

Figure 19.10: A schematic illustrating the model for connection and job submission
on a high performcance computing facility.

As was mentioned previously, the APIs that we are going to study can be run
either on your laptop or workstation, or on a supercomputer. While the program
code itself does generally not need to change too much, the method by which one
executes a program generally will. In fact, the way in which one ‘interacts’ and
uses an HPC system is often quite different to the way in which one would use their
own personal computer. Most often a user will connect their workstation to a login
node over the internet via a procol such as SSH [48] (Figure 19.10). From this login
node a user can interact with the shared file system, creating, editing, deleting files,
etc, and compiling their programs. Rather than executing their programs directly
however, an HPC facility will make use of some form of job scheduling software.
While there are many different packages available, a common feature is that rather
than trying to execute your program in parallel directly, one generally submits a
371

job to the scheduling software. Furthermore, while it is usually possible to define

the job (in terms of the executable to run, the number of CPUs to use, etc) on the
command line, it is often more useful to define this information in a text file known
as a job script. One such job scheduler is known as the Portable Batch System
(PBS) [40] for which an example job script (which we will call myJobScript.pbs)
could be written:
#!/bin/bash
#PBS -A VR0084
#PBS -l procs=16
#PBS -l walltime=01:30:00
#PBS -N myJob
#PBS -o myJob.out
#PBS -e myJob.err
#PBS -M JohnSmith@gmail.com
mpirun -n 16 myProgram

Here, the lines beginning with #PBS are defining information for the job scheduling
software. Working through the various lines we are specifying a ‘project account’
(-A), the number of processors to use (-l procs=), the wall time (which is the
amount of time the simulation is expected to take -l walltime=01:30:00). Fur-
thermore, we are giving the job a name (-N), specifying the name of the output file
to put any program output that would go to stdout (-o), specifying the name of
the output file to put any program output that would go to stderr (i.e. when some-
thing goes wrong with the program -e), and specifying an email address for the job
scheduling software to notify us when the job starts, finishes, or crashes! (-M). Finally
the last line tells the job scheduler to run the executable myProgram. The user can
then submit the job to the system with a command such as qsub myJobScript.pbs
(of course the software must be able to find the job script, the executable, and any
other relevant files). The job scheduler will then place the job in a queue (which can
be viewed with the command such as showq or qstat) and execute the commands
in the file when the required system resources become available. Finally a job can
be removed by the command qdel followed by the job ID that would be displayed
in the queue.
Another job scheduler is known as the SLURM [50] for which an example job
script (which we will call myJobScript.sbatch) could be written:
#!/bin/sh
#SBATCH --account=VR0084
#SBATCH --nodes=1
#SBATCH --time=01:30:00
#SBATCH --job-name=myJob
#SBATCH --output=myJob.out
#SBATCH --error=myJob.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=JohnSmith@gmail.com
srun -n 16 --nodes 1 --ntasks-per-node 16 myProgram
372 CHAPTER 19. INTRODUCTION

Here, the lines beginning with the #SBATCH are defining information for the job
scheduling software and the meaning of these statements is exactly the those ex-
plained for the previous job script. The user can then submit the job to the system
with a command such as sbatch myJobScript.sbatch and job scheduler will then
place the job in a queue (which can be viewed with the command such as squeue)
and execute the commands in the file when the required system resources become
available. Finally a job can be removed by the command scancel followed by the
job ID that would be displayed in the queue.
As a final point, it is worth mentioning that most of the concepts introduced here
will have more meaning, once we have actually studied the APIs and applied them
to specific PDEs. As such, it is recommended that you re-read this introduction
at the end of this part of the book. It is important to realize that there are many
different ways in which we can classify computing architectures and programming
models, and there are always exceptions to these categorizations. Furthermore,
while some categories encompass architectures and programming models which are
still in use today, others are obsolete (although they may find their way back into
future designs). The key thing is that you realize this discussion has just presenting
a rough guide, not strict rules.
Chapter 20

OpenMP

20.1 Concepts

MyProgram
Thread 0

Serial Region

Thread 1 Thread 2 Thread 3

Parallel Region

Synchronization

Serial Region
Thread 1 Thread 2 Thread 3

Parallel Region

Synchronization

Serial Region

Time

Figure 20.1: OpenMP execution model.

The first API that we will investigate is the Open Multi-Processing API, which
is an implementation of multithreading; a shared memory method of parallelization
whereby a master thread forks a specified number of slave threads and a task is

373
374 CHAPTER 20. OPENMP

divided among them. The threads then run concurrently, with the operating sys-
tem allocating threads to different cores. We create a parallel program with OpenMP
by finding regions of our code that can be carried out in parallel (for example for
loops) and add preprocessor directives into the code [5]. Then, when the compiler
turns our source code into machine executable code, the preprocessor will first use
these directives to modify the code such that the regions of the code will be executed
with multiple threads. Each thread has a unique integer ID associated with it and
the master thread has an ID of 0. After the execution of the parallel section of the
code, the threads join back into the master thread, which continues onward to the
end of the program (Figure 20.1). The preprocessor directives begin with the state-
ment #pragma omp and then include combinations of directives and clauses. Before
we examine some of the more relevant directives and clauses it will be useful to
demonstrate by way of an example. Consider the very simple computation carried
out in the for loop:

#pragma omp parallel for shared(a, b, c) private(i)

for(i=0; i<N; i++)
{
a[i] = b[i] + c[i];
}

With the parallel directive we are instructing the compiler to create a team of
threads. Then with the for directive we are instructing the compiler that the
computations performed within the loop are to be shared among the team of threads.
The private clause tells the compiler that each thread should have its own unique
instance of the counter i (which makes sense if the threads are to compute different
parts of the loop), but the shared clause tells the compiler that the threads are
using the one instance of each of the arrays. An important point to bear in mind
with these types of constructs is that each thread must be able to compute its part
of the for loop without changing the results. This is fine for the above code snippet
because b[i] and c[i] are independent, but if we were computing something along
the lines of a[i]=a[i-1]+a[i-2] for example, then it’s possible that a thread might
need entries in a that haven’t been computed at the moment that it needs them, in
which case it will use whatever happens to be occupying that location in memory.
An OpenMP directive applies to the succeeding structured block of code. In this
case it is the for loop that defines the structured block and after completing the
loop the threads will join back into the master thread, but if we wanted to do
more things within our parallel region, we could, we just have to enclose all of the
statements within a pair of curly braces { }.
Now the codes that we develop for solving ODEs and PDEs tend to be full of for
and while loops and so there is great scope for taking our algorithm and adding in
some preprocessor directives to speed up the computation of the solution. Although
it may seem like the development of a parallel OpenMP program might be trivial,
there are some issues that we will need to consider and many subtle errors that can
20.1. CONCEPTS 375

become apparent at run time. One issue that we need to be aware of is the situation
of a race condition [44], which can occur when separate threads access and modify
the same memory location at the same time and lead to unpredictable values for
this variable. From a performance point of view, one issue that we need to be aware
of is how often the team of threads is created and destroyed. Since there is overhead
in creating a team of threads, we want to minimize the number of times this is done,
or put another way, maximize the amount of work that we can get done in parallel
when we create a team of threads. Consider for example the nested for loop:
for(i=0; i<10000; i++)
{
for(j=0; j<4; j++)
{
a[i][j] = b[i][j] + c[i][j];
}
}
If we were to parallelize the code by placing the preprocessor directive around the
inner loop:
for(i=0; i<10000; i++)
{
#pragma omp parallel for default(shared) private(j)
for(j=0; j<4; j++)
{
a[i][j] = b[i][j] + c[i][j];
}
}
then we will be creating and destroying our team of threads 10, 000 times. If we
were to say split the computation among four threads, then each thread will only
be responsible for the computation of 1 element in a given row of the array a. So we
would expect in this case that the benefit of parallelizing the code in this manner
would be minimal and in fact it is just as likely that the computation could take
longer compared to the serial case because of the overhead associated with the cre-
ation and destruction of the threads outweighing the gain. A better approach would
be to place the parallel directive around the outer loop:

#pragma omp parallel for default(shared) private(i,j)

for(i=0; i<10000; i++)
{
for(j=0; j<4; j++)
{
a[i][j] = b[i][j] + c[i][j];
}
}
in which case the team of threads is created only once and if the computation is again
split among four threads, then each thread will be responsible for the computation
376 CHAPTER 20. OPENMP

of 2, 500 rows in the array, which is much more likely to give us an increase in
performance of the parallel code compared to the serial code. A final issue to
consider is that we want to try and minimize the amount of time that threads spend
waiting for one another to catch up. At the end of parallel for loops there is an
implied barrier where all of the threads have to reach the end of their share of the
loop before joining back to the master thread. In cases where we have multiple loops
or other computations within a parallel region we may encounter regions where only
one thread is doing some work (for example in writing data to an output file), while
the other threads are waiting. If we have explicitly placed synchronization points
into the code with the barrier directive then we may also cause the threads waste
time waiting for one another to catch up. So while the incorporation of barriers
can be a useful tool for debugging your code and making sure that it is working
properly, their presence should be minimized in final version of the code.

20.2 Directives
The OpenMP directives tend to instruct the compiler as to ‘what’ to do in a parallel
region. Some of the important directives are:
• #pragma omp parallel
Forms a team of threads and starts parallel execution.
• #pragma omp for
Specifies that the iterations of the for loop will be distributed among and
executed by the encountering team of threads.
• #pragma omp sections
Assigns consecutive but independent code blocks to different threads. This
could be useful in cases where we want different threads to undertake com-
pletely different tasks.
• #pragma omp single
Specifies that the associated structured block is executed by only one of the
threads in the team (not necessarily the master thread).
• #pragma omp master
Similar to a single directive, but the code block will be executed by the
master thread only with no barrier implied in the end.
• #pragma omp critical
Restricts execution of the associated structured block to a single thread at a
time.
• #pragma omp barrier
Specifies an explicit barrier at the point at which the directive appears.
20.3. CLAUSES 377

20.3 Clauses
The OpenMP clauses are placed together with the directives in the code, in order to
specify ‘how’ a piece of code is to be parallelized. Not all of the clauses are valid on
all of the directives however. Some of the important OpenMP clauses are:

• default(shared|none)
Controls the default data-sharing attributes of variables that are referenced in
a parallel construct.

• shared(variable-list)
Declares one or more list items to be shared by threads generated by a parallel
or task construct.

• private(variable-list)
Declares one or more list items to be private to a thread.

• firstprivate(variable-list)
Declares one or more list items to be private to a thread, and initialises each
of them with the value that the corresponding original item has when the
construct is encountered.

• lastprivate(variable-list)
Declares one or more list items to be private to an implicit thread, and causes
the corresponding original item to be updated after the end of the region.

• reduction(operator:list)
Declares accumulation into the list items using the indicated associative oper-
ator. Accumulation occurs into a private copy for each list item which is then
combined with the original item.

• num_threads(integer-expression)
Declares the number of threads to be used when encountering a parallel region.

20.4 Runtime Library Routines

Finally, in addition to the directives and their clauses, OpenMP provides a number
of library functions which may be incorporated into a program being parallelized.
Some of the important OpenMP routines are:

• void omp_set_num_threads(int num_threads);

Sets the number of threads used for subsequent parallel regions that do not
specify a num_threads clause.
378 CHAPTER 20. OPENMP

• int omp_get_num_threads(void);
Returns the number of threads in the current team.

• int omp_get_max_threads(void);
Returns maximum number of threads that ‘could’ be used to form a new team
using a parallel construct without a num_threads clause.

• int omp_get_thread_num(void);
Returns the ID of the encountering thread where ID ranges from zero to the
size of the team minus 1.

• int omp_get_num_procs(void);
Returns the number of processors available to the program.

• int omp_in_parallel(void);
Returns true if the call to the routine is enclosed by an active parallel region;
otherwise, it returns false.

• void omp_set_dynamic(int dynamic_threads);

Enables or disables dynamic adjustment of the number of threads available.

• double omp_get_wtime(void);
Returns elapsed wall clock time in seconds.

In order to actually create a program using OpenMP, the compiler must have
OpenMP support built in to it. So although the program may need to link to the
OpenMP runtime library in order use some of the routines it provides, this is not
enough to build the program by itself. Most modern compilers have OpenMP sup-
port, for example the g++ compiler has support since version 4.2 and Microsoft’s
Visual C++ compiler has support since the 2005 edition. In order to use the OpenMP
functionality, a C++ program must include the header file omp.h and will most likely
need to link to the runtime library. Once the program has been compiled it can
be executed in the same manner as any other program. The number of threads
that will be used in the parallel regions can generally be set in one of three man-
ners, either explicitly by using the omp_set_num_threads routine, by using the
num threads clause with a parallel directive, or by setting the environment vari-
able [15] OMP NUM THREADS. It is important to realize that we will only be able to
parallelize portions of the code which are inherently parallelizable. One ‘classic’ ex-
ample of this is a time marching for loop. While it would be trivial to try and place
a #pragma omp parallel for statement around a time marching loop, this will of
course not work because stepping forward in time requires the solution from the
previous step. If a time marching loop was split among ‘say’ two threads, then one
would start from the initial condition, but the other would begin half way through
the simulation and both would march forward together. The second thread would
require the solution from the previous time step, but this won’t have been computed
20.4. RUNTIME LIBRARY ROUTINES 379

yet by the first thread. The lesson to take away from this is when parallelizing with
OpenMP think about what data a thread will need and will it exist when it needs it.
One final point worth mentioning is that although one has the freedom to specify
any number of threads to be used for the parallel regions there will generally not be
any advantage to specifying more than the number of cores present on the system
that the code is executing on and in fact it is more likely that the code would
actually slow down as a result of doing so. An exception to this rule however is the
scenario where the cores support multiple hardware threads, so in this case we could
(ideally) expect to see performance by choosing the number of threads for parallel
execution to be up to the number of cores multiplied by the number of hardware
threads supported per core.

Example 20.1:

In this example we will develop a ‘Hello World’ example program in C++, par-
allelized with OpenMP to illustrate the creation of a parallel region and print to the
screen the thread IDs. The intended learning outcomes for this example will be to
‘get a feel’ for the structure of a program that includes OpenMP code.
In order to begin, let’s just dive right in and look at the complete program:
int main(int argc, char** argv)
{
int myID = omp_get_thread_num();
int N_Threads = omp_get_num_threads();

cout << ‘‘Hello world from Thread ’’ << myID << ‘‘ of ’’ << N_Threads << endl;

#pragma omp parallel shared(N_Threads) private(myID)

{
myID = omp_get_thread_num();

#pragma omp master

{
N_Threads = omp_get_num_threads();
}

#pragma omp critical

{
cout << ‘‘Hello world from Thread ’’ << myID << ‘‘ of ’’ << N_Threads << endl;
}
}
return 0;
}
380 CHAPTER 20. OPENMP

As the program begins we will use two OpenMP library routines to get the thread ID
and the total number of threads and print to the terminal a message saying “Hello
world” with the threads ID. Initially when the program begins there will only be 1
thread, so we would expect that this message would read:
Hello world from Thread 0 of 1

Following the first cout statement however we will create a parallel region, letting
each thread share the N_Threads variable, but having its own private copy of myID.
Inside the parallel region each thread will call the omp_get_thread_num routine
to get its ID, but only the master thread will get the total number of threads.
Then each thread will print the same message as before, but this time it might look
something like:
Hello world from Thread 0 of 4
Hello world from Thread 1 of 4
Hello world from Thread 2 of 4
Hello world from Thread 3 of 4

An important point to note is the use of the critical directive which will let each
thread take its turn printing out its message. The complete program is given in
Example20_1.cpp.

Example 20.2:

In this example we will develop a C++ program to solve the 1D first order wave
equation:
∂φ ∂φ
+v =0 (20.1)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, v = 1.0. For the spatial discretization we will
use the Finite Difference method with second order central differences for the first
derivative and for the temporal discretization we will use the fourth order Runge-
Kutta method. Furthermore, our program will be parallelized with OpenMP. The
intended learning outcomes for this example will be to simply observe how we add
in OpenMP code to an existing C++ program for parallel execution.
In order to begin we will take the C++ program that was developed in Example
13.1 and add in the OpenMP code. The issues to consider here are where to create
the parallel regions, which loops can we parallelize, and which parts should be exe-
cuted by only one thread. Our current algorithm is essentially comprised of a time
20.4. RUNTIME LIBRARY ROUTINES 381

marching loop and then inside that, multiple for loops to compute the k values at
the various stages of the Runge-Kutta method. We will create the parallel region
outside the time marching loop as:

#pragma omp parallel default(shared) private(i, l, x)

for(l=0; l<N_t-1; l++)
{
...
}
So it can be observed that each thread will have its own time step counter l and
will be marching through time in parallel. It is important to note however that
we are not using the for directive to split the time marching loop among multiple
threads. At each stage within the fourth order Runge-Kutta method, we will share
the computations of the for loops among the team of threads where possible. For
example, to compute k2 , this will look something like:

#pragma omp for schedule(static)

for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t/2*k1[i];
}
f(k2, tempPhi);
Here we can see that the evalutation of the data in tempPhi will be performed in
parallel by each thread, then the function f will be called independently by each
thread. The contents of this function will look like:
void f(double* k, double* phi)
{
#pragma omp for
for(int i=1; i<N_x-1; i++)
{
k[i] = -v/(2*Delta_x)*(phi[i+1] -phi[i-1]);
}
#pragma omp single
k[N_x-1] = -v/( Delta_x)*(phi[N_x-1]-phi[N_x-2]);
return;
}
There will be a similar pattern for computing the other k values, and the updating
of φl+1 will also be placed in a #pragma omp for loop. It can be observed that we
use the #pragma omp single directive to let just one thread in the team update
the update a k value at the grid boundaries. This is essentially all we need to do
in order to produce a multithreaded program and as such the core of our algorithm
will look like:
#pragma omp parallel default(shared) private(i, l, x)
382 CHAPTER 20. OPENMP

{
for(l=0; l<N_t-1; l++)
{
f(k1, phi);

#pragma omp for schedule(static)

for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t/2*k1[i];
}
f(k2, tempPhi);

#pragma omp for schedule(static)

for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t/2*k2[i];
}
f(k3, tempPhi);

#pragma omp for schedule(static)

for(i=0; i<N_x; i++)
{
tempPhi[i] = phi[i] + Delta_t *k3[i];
}
f(k4, tempPhi);

#pragma omp for schedule(static)

for(i=0; i<N_x; i++)
{
phi[i] = phi[i] + Delta_t*(k1[i]/6 + k2[i]/3 + k3[i]/3 + k4[i]/6);
}

#pragma omp single

{
write(file, phi, N_x);
...
}
}
}

It can be observed that the write function is to be called by only one thread. This
is an important point that requires some elaboration. Often, file I/O is the slowest
part of a program and it would be fantastic if we could speed it up by using multiple
threads. Unfortunately, file I/O is something that we can’t really parallelize with
OpenMP. So we know from Amdahl’s law that this will limit the maximum speedup
that we could expect, but there’s not really too much we can do about it.
The complete program is given in Example20_2.cpp with a Matlab script for
viewing the output of the program given in Example20_2Postprocessing.m. Fig-
ures 20.2(a) and 20.2(a) illustrate the solution at two different moments in time for
20.4. RUNTIME LIBRARY ROUTINES 383

the case where ∆x=0.05 and ∆t=0.02. Some example strong scaling runs for a more
‘substantial’ case where ∆x=0.001 and ∆t=0.001 are:
Number of Threads Execution Time (s)
1 12.680660
2 8.103922
4 5.425282
8 4.675502

2.5 2.5

2.0 2.0
φ

φ
1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

Figure 20.2: The solutions to the PDE in Example 20.2 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02.

Example 20.3:

In this example we will develop a C++ program to solve the 2D Poisson equation:

∇2 φ + ψ = 0 (20.2)
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10. For the spatial discretization we will use
the Finite Difference method with second order central differences for the second
derivatives and to solve the resulting system of algebraic equations we will use the
384 CHAPTER 20. OPENMP

Gauss-Seidel method, with the two-norm as our measure of convergence. Further-

more, our program will be parallelized with OpenMP. The intended learning outcomes
for this example will be to observe how we add in OpenMP code to an existing C++
program for parallel execution and to investigate the use of a reduction clause.
In order to begin we will take the C++ program that was developed in Exam-
ple 13.2 and add in the OpenMP code. Similar to Example 20.1 we need to think
about where to create the parallel region, which loops can performed in paral-
lel, and which parts should be executed by only one thread. Our current algo-
rithm is essentially comprised of an iterative while loop and inside that nested for
loops over the 2D grid points to update φi,j at each iteration. We will place our
#pragma omp parallel directive outside the while loop such that it will be a par-
allel region. Each thread will share r_norm but will have its own i, j counters. So,
the basic core of our algorithm will look something like:

#pragma omp parallel default(none) shared(Delta_xy, N_x, N_y, phi, psi,

r_norm, tolerance, N_k, k, cout) private(i, j)
{
while(r_norm>tolerance && k<N_k)
{

#pragma omp for schedule(static)

...
for(i=1; i<N_x-1; i++)
{
for(j=1; j<N_y-1; j++)
{
phi[i][j] = (Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
...

#pragma omp single

{
k++;
}
}
}

So at this point we have parallelized the Gauss-Seidel iteration. One very subtle
issue is that stricly speaking, this is not the Gauss-Seidel method as presented in
Chapter 3 because with that technique we ‘swept’ through the column vector of
unknowns updating at each iteration. Each time we did this, the row entries below
the current row would have already been updated, but entries above would not have.
When we have multiple threads updating rows then it will not be true that all rows
below will have been updated. This result doesn’t affect the ability of the solver to
converge, but it’s worth mentioning that it is now a subtly different method.
20.4. RUNTIME LIBRARY ROUTINES 385

The next thing to do is parallelize the computation of the residual. We will

take basically the same approach that we took with the parallelization of the Gauss-
Seidel iterative update, the only important issue being that each thread will have
2
its own private value of r_norm which will be adding up all of the rij terms. What
we want is for all of these individual values to be added together at the end of
the loop, and to do so we make use of a reduction clause passing the arguments
reduction (+ : r_norm) which means to add all of the instances of r_norm to-
gether at the end of the for loop. In our program this will take the form:

#pragma omp parallel default(none) shared(Delta_xy, N_x, N_y, phi, psi,

r_norm, tolerance, N_k, k, cout) private(i, j)
{
while(r_norm>tolerance && k<N_k)
{
...

#pragma omp single

{
r_norm = 0.0;
}

#pragma omp for schedule(static) reduction(+ : r_norm)

for(i=1; i<N_x-1; i++)
{
for(j=1; j<N_y-1; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
r_norm += r[i][j]*r[i][j];
}
}

#pragma omp single

{
r_norm = sqrt(r_norm);
k++;
}
}
}

So at the end of the computation of the residual, we instruct one of the threads
to compute the square root with a single directive such that we have the correct
value for the two norm. It can be observed that we have parallelized the outermost
for loop in the Gauss-Seidel iteration and that only one thread in the team will
increment the iteration counter k.
The complete program is given in Example20_3.cpp with a Matlab script for
viewing the output of the program given in Example20_3Postprocessing.m. Figure
20.3 presents the converged solution for the case where Nx = 65 and Ny = 65. Some
386 CHAPTER 20. OPENMP

2.0
1.8
1.6
1.4
1.2
1.0
φ

0.8
0.6
0.4
0.2
0.0
0.0 1.0
0.2 0.8
0.4 0.6
0.6 0.4
0.8 0.2
1.0 0.0
y
x

Figure 20.3: The converged solution to the PDE in Example 20.3 illustrating the
solution for Nx = 60 and Ny = 60.

example strong scaling runs for a more ‘substantial’ case where Nx = 1001 and
Ny = 1001 are:

Number of Threads Time per iteration (s)

1 0.05663840
2 0.02890800
3 0.01501010
8 0.00805944
16 0.00461553

As a final concluding remark, it is worth emphasizing the point that we have

only touched very briefly on the use of OpenMP and its application to solving PDEs.
There is much more functionality and many more performance issues that have not
been covered here, but can be found in either the full API specification [39] or [64].
Chapter 21

MPI

21.1 Concepts

MyProgram MyProgram MyProgram MyProgram

Process 0 Process1 Process 2 Process 3

Send Receive Receive Send
Receive Send Send Receive

Synchronization

Broadcast

Synchronization

Time

Figure 21.1: MPI execution model.

The second API that we will investigate is the Message Passing Interface, which
actually defines a language standard rather than an implementation. As the name
suggests the interface is based on the distributed memory parallel programming

387
388 CHAPTER 21. MPI

model of ‘message passing’ whereby a programmer is responsible for communicating

data between processes. As a result, creating a program with MPI can generally be
a bit more involved compared to OpenMP, but it is probably a simplification of the
issue to suggest that one is ‘easier’ than the other.
We create a parallel program with MPI by first conceptually breaking up the
problem into pieces that can be solved concurrently. Given our application of solv-
ing PDEs, perhaps the most obvious way in which we do this is to break up our
computational grid into pieces and make each process responsible for computing the
solution of the PDE on its portion of the grid. We will see how this is actually done
in the examples shortly. The program that we create will then run as N unique
instances (or N processes) and each process will be assigned a unique integer ID
known as its rank, which we will generally store as an integer variable in our program
code. We can then use the rank to define what each process should do, generally by
testing the value of a rank with an if statement.
As the computation progresses, data can be communicated amongst the pro-
cesses based on knowing this rank (Figure 21.1). An important concept in MPI is
that of a communicator , which defines a collection of processes and the default is
MPI_COMM_WORLD, which essentially means all of the processes, but we can create
other communicators if it helps with the design of the program. The rank assigned
to a process hence depends upon which communicator we are talking about. As
an analogy, imagine a room full of people (N of them) and furthermore imagine
that each person is given a number from 0 to N − 1. The entire group of people is
analogous to MPI_COMM_WORLD. Now, imagine that the room is split into two groups
that go to opposite corners of the room and then give themselves new numbers in
the range of 0 to N/2 − 1. This is analogous to creating a new communicator with
MPI. So we can see in this situation that each group will have a person with rank 0,
another with rank 1, but in the context of the entire collection of people, there can
only one person with rank 0 and one person with rank 1 etc.
We can classify the communication between processes into point-to-point or col-
lective communication. With point-to-point communication we explicitly pass data
between processes using functions such as MPI_Send and MPI_recv where we will
include the rank of the processes involved so that the processes will know ‘who’
they are sending data to and ‘who’ they are receiving data from. As a very simple
example consider the code snippet:
if(myRank==0)
{
MPI_Send(a, 10000, MPI_FLOAT, 1, tag, MPI_COMM_WORLD);
}
else if(myRank==1)
{
MPI_Recv(a, 10000, MPI_FLOAT, 0, tag, MPI_COMM_WORLD);
}
The way to interpret this code is to assume that the program will have defined the
array a which will contain 10, 000 floating point numbers and will define the integer
21.1. CONCEPTS 389

variable myRank, which will be assigned uniquely to each process when they begin
execution. Let’s assume that we have chosen to execute this code with four processes.
All of the processes will enter the if statement, but only one of the processes will
have a rank of 0 and hence enter the following structured block of code. As it does
it will send 10, 000 floating point numbers from its copy of the array a to process 1.
Meanwhile, process 1 will have skipped the first if statement and will have entered
the second structured block of code and as such will be waiting to receive 10, 000
floating point numbers into its copy of the array a from process 0. The remaining
processes with ranks 2 and 3 will not enter either structured block and will continue
onto the next portion of the program. The last two arguments of the function calls
above define the communicator and an optional argument tag which can be used
in some cases if we need to associate more information with a message. Returning
to the analogy of a room full of people, point-to-point communication is analogous
to one person talking to another, i.e. one person speaks, the other listens; and both
have to be performing their respective role in order for information to be transferred.
In contrast to point-to-point communication, collective communication is used
when we want to exchange information between multiple processes in the one func-
tion call. Some examples of this could be a broadcast, where we transfer data from
one process to all of the others (Figure 21.2(a)), a scatter, where we break up an
array from one process and distribute parts of it to other processes (Figure 21.2(b)),
a gather, where do the exact opposite of a scatter and assemble data from mutliple
processes into an array on one process (Figure 21.2(c)), and an all to all, where we
take arrays from all processes and send parts of them around to all other processes
(Figure 21.2(d)). Another common use is to perform a reduction. Consider for
example the code:
MPI_Reduce(a, b, 10000, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);

The meaning here is that we are ‘summing’ the 10, 000 elements in a across all
of the processors and assigning the summations to the array b on process 0. To
elaborate on this point, if we had four processes, each having a copy of the array
a= {1, 2, 3, 4}, then the following the reduction process 0 would have the array
b= {4, 8, 12, 16}. Returning to the analogy of a room full of people, collective
communication is analogous to ‘say’ one person talking to everyone else, or even
everybody talking to everybody else at the same time!
As with OpenMP there are a number of issues we need to consider when devel-
oping MPI code, and one of the most important is the choice of using blocking or
non-blocking communication. When we use blocking communication the send and
receive routines will generally ‘hang’ until the message has been received. As we will
see in the examples this can lead to situations where processes either end up per-
forming their computations one after the other (which defeats the point of designing
a parallel program), or even worse to lock up completely. When we use non-blocking
communication the send and receive routines ‘post’ their data to be sent or received
and the program flow continues. This can allow for more efficient parallel execution
390 CHAPTER 21. MPI

since we can overlap computation and communication at the same time (meaning
that a process can ‘post’ a send or receive and do some computation in the mean-
time when this data is in transit), but of course we have to then explicitly check
at some point in the program that all of the required data has been sent to all of
the necessary processes by the point where they need it for computation, implying
the use of a function such as MPI_Waitall to check for this. As a final point, one
needs to be aware that there is an overhead and latency associated with creating
and sending a message and so messages passing should be performed ‘wisely’. In
particular sending many small messages repeatedly will generally be far less efficient
then sending fewer, larger messages. Furthermore, global collective communication
tends to create more ‘contention’ in the network compared to point-to-point com-
munication because every process may be trying to communicate with every other
process at the same time. The types of communication utilized in a distributed
memory parallel program really depend upon the nature of the algorithm however,
and as we will see, both forms are typically used in a program.

21.2 Runtime Library Routines

There are a large number of library functions provided by the MPI specification.
Fortunately there is a small collection of them which are the most useful for simple
problems. Let’s take a moment just to ‘skim’ over some of the more useful functions.
While it’s not important to memorize all of them at this stage, taking a quick look
should hopefully provide a bit of an understanding as to what the MPI function calls
look like and what their arguments are.

• int MPI_Init(int* argc, char** argv[]);

Initialises the MPI environment. Every MPI program must include this func-
tion call somewhere near the start of the program as it makes sure that any
input information passed into the program is passed to all processes, but more
importantly, this function assigns each process with a rank.

• int MPI_Finalize(void);
Terminates the MPI execution environment. Every MPI program should include
this function call somewhere near the end of the program and there should be
no MPI function calls after this one.

• int MPI_Comm_rank(MPI_Comm comm, int* rank);

Returns the rank of a process in the given communicator. We will pretty much
always need to include this function call somewhere near the beginning of our
programs so that we can store each process’s rank in the integer variable rank.
This will help us decide what each instance of our program should do.

• int MPI_Comm_size(MPI_Comm comm, int* size)

Returns the size of group associated with communicator. We will also pretty
21.2. RUNTIME LIBRARY ROUTINES 391

Process 0 Process 0 Process 0 Process 0

Process 1 Process 1 Process 1 Process 1

Process 2 Process 2 Process 2 Process 2

Process 3 Process 3 Process 3 Process 3

Before After Before After

(a) (b)

Process 0 Process 0 Process 0 Process 0

Process 1 Process 1 Process 1 Process 1

Process 2 Process 2 Process 2 Process 2

Process 3 Process 3 Process 3 Process 3

Before After Before After

Figure 21.2: Some example MPI collective communications illustrating (a) a broad-
cast, (b) a scatter, (c) a gather, (d) an all to all.

much always need to include this function call somewhere near the beginning
of our programs so that we can store the total number of processes in the
integer variable size. This will help us decide how much ‘work’ each process
should be doing (e.g. how may grid points, or cells, or elements a process
should be responsible for).
• double MPI_Wtime(void);
Returns elapsed time from some arbitrary point in the past, in seconds. This is
an optional function, but is often useful for determining how long a simulation
takes.
• int MPI_Send(void* buf, int count, MPI_Datatype datatype,
392 CHAPTER 21. MPI

int dest, int tag, MPI_Comm comm);

Performs a blocking send. This function will block until the message is sent
to the destination.

• int MPI_Recv(void* buf, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm, MPI_Status* status);
Performs a blocking receive. This function returns only after the receive buffer
contains the newly received message. A receive can complete before the match-
ing send has completed, but only after the matching send has started.

• int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm, MPI_Request* request);
Performs a nonblocking send. Nonblocking calls allocate a communication re-
quest object and associate it with the request handle. The request can be
used later to query the status of the communication or wait for its completion.
A nonblocking send indicates that the system may start copying data out of
the send buffer. The sender should not access any part of the send buffer after
a nonblocking send operation is called, until the send completes.

• int MPI_Irecv(void* buf, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm, MPI_Request* request);
Performs a nonblocking receive. This function indicates that the system may
start writing data into the receive buffer. The receiver should not access any
part of the receive buffer after a nonblocking receive operation is called, until
the receive completes.

• int MPI_Bsend(void* buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm);
Performs a send with user-provided buffering. This send is provided as a con-
venience function allowing the user to send messages without worring about
where they are buffered, because the user must have provided buffer space
with the MPI Buffer attach function.

• int MPI_Sendrecv(void* sendbuf, int sendcount, MPI_Datatype sendtype,

int destination, int sendtag, void* recvbuf, int recvcount,
MPI_Datatype recvtype, int source, int recvtag,
MPI_Comm comm, MPI_Status* status);
Performs a blocking send and receive operation at the same time.

• int MPI_Wait(MPI_Request* request, MPI_Status* status);

Waits for an MPI send or receive to complete and then returns.

• int MPI_Waitall(void );
Wait for all processes to complete their sends and receives.
21.2. RUNTIME LIBRARY ROUTINES 393

• int MPI_Barrier(MPI_Comm comm)

Blocks process until all processes have called it.

• int MPI_Bcast(void* buf, int count, MPI_Datatype datatype, int root,

MPI_Comm comm);
Broadcasts message from root process to all processes in the given communi-
cator and itself.

• int MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype,

void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm);
Sends a distinct message from each process to every proces.

• int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype,

void* recvbuf, int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm);
Gathers values for group of processes.

• int MPI_Allather(void* sendbuf, int sendcount, MPI_Datatype sendtype,

void* recvbuf, int recvcount, MPI_Datatype recvtype, int MPI_Comm comm);
Gathers data from all processes and distributes it to all.

• int MPI_Scatter(void* sendbuf, int sendcount, MPI_Datatype sendtype,

void* recvbuf, int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm)
Scatters a buffer from root in parts to group of processes.

• int MPI_Reduce(void* sendbuf, void* recvbuf, int count,

MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);
Applies a reduction operation to the vector send buffer over the set of tasks
specified by comm and places the result in receive buffer on root.

• int MPI_Cart_create(MPI_Comm comm_old, int ndims, int* dims,

int* isperiodic, int reorder, MPI_Comm* new_comm)
Creates a communicator containing topology information

• int MPI_Cart_shift(MPI_Comm comm, int direction, int displacement,

int* source, int* destination);
Returns shifted source and destination ranks for a process.

• int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int* coords);

Translates process rank in a communicator into Cartesian process coordinates.

• int MPI_Type_vector(int count, int blocklength, int stride,

MPI_Datatype oldtype, MPI_Datatype* newtype);
Returns a new data type that represents equally spaced blocks. The spacing
between the start of each block is given in units of the old type.
394 CHAPTER 21. MPI

• int MPI_Type_commit(MPI_Datatype* datatype);

Makes a data type ready for use in communication

• int MPI_Type_free(MPI_Datatype* datatype);

Marks a data type for deallocation.

• int MPI_Buffer_attach(void* buffer, int size);

Attaches a user-provided buffer for sending.

• int MPI_Buffer_detach(void* buffer, int* size);

Removes and existing buffer.

• int MPI_Exscan(void* sendbuf, void* recvbuf, int count,

MPI_Datatype datatype, MPI_Op, op, MPI_Comm comm);
Computes the exclusive scan (partial reductions) of data on a collection of
processes.

One observation that can be made is that essentially all MPI functions involved
in transferring data expect pointers to contiguous blocks of memory, which ties in
nicely with the data structures that we have been using in our serial programs thus
far.
In contrast to OpenMP the compiler does not need to have any support for MPI,
rather it is just a process of linking the program to the appropriate libraries. As with
OpenMP there is a header file which much be included which is mpi.h . Commonly
however, the installation of the MPI libraries will include the program mpicxx which
can be used instead of the normal compiler to create the executable. In fact mpicxx
is merely a ‘wrapper’ and generally calls the normal compiler, but telling it where
to look for the header file and the libraries, as well as what libraries to link to.
Another difference compared to OpenMP is that programs are not executed in the
same manner by which a serial program is. A program built to run in parallel
with MPI is executed using another program called mpirun (or mpiexec), which is
responsible for creating all of the processes on the different computers, or processors
(depending on the system). Most simply, a program can be run via the command:
mpirun -n 4 myProgam

where we are specifying four processes to run the program in parallel. The exact
syntax differs between systems, but commonly one might need to specify the ‘full’
path of the executable so that mpirun can find it in the computers file system, and
there may be different ways in which input arguments to myProgram are specified.
Now as mentioned at the beginning of this section MPI defines a standard rather
than an implementation. As such there are a few common implementations, namely
OpenMPI [37], MPICH [32], and LAM [25]. As with running OpenMP programs one has
the freedom to specify any number of processes to be used for the execution of the
program, but there will generally not be any advantage to specifying more processes
than CPUs available in the system. It is also quite common that depending on the
21.2. RUNTIME LIBRARY ROUTINES 395

decomposition of the problem, not just any number of processes can be used. Often
it may need to be a power of two or an even number, etc. Generally for our parallel
MPI programs to run efficiently we will want good load balancing, meaning that each
process has a similar (or ideally the same) amount of work to do, so that they can
complete their computations in the same amount of time. In the context of solving
PDEs this translates to breaking up a computational grid so that processes have
a similar number of grid points or elements. The worst scenario would be if one
process had significantly more work to do than the others as it would slow down the
entire calculation.

Example 21.1:

In this example we will develop a ‘Hello World’ example program in C++, paral-
lelized with MPI to illustrate the creation of multiple processes and send some data
between them using point-to-point communication. The intended learning outcomes
for this example will be to ‘get a feel’ for the structure of a program that includes
MPI code.
In order to begin, as with the ‘Hello World’ program in Example 20.1 let’s just
dive right in and look at the complete program:
int main(int argc, char** argv)
{
int myID;
int N_Procs;
int dummy;
MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myID);
MPI_Comm_size(MPI_COMM_WORLD, &N_Procs);

if(myID==0)
{
dummy = 10;
cout << ‘‘Hello world from Process ’’ << myID << ‘‘ of ’’ << N_Procs << endl;
for(int n=1; n<N_Procs; n++)
{
MPI_Send(&dummy, 1, MPI_INT, n, TAG, MPI_COMM_WORLD);
}
}
else
{
MPI_Recv(&dummy, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD, &status);
cout << ‘‘Hello world from Process ’’ << myID << ‘‘ of ’’ << N_Procs << endl;
}
396 CHAPTER 21. MPI
21.2. RUNTIME LIBRARY ROUTINES 397

MPI_Finalize();
return 0;
}

As the program begins we call the MPI_Init function to initialize the MPI envi-
ronment, then we call MPI_Comm_rank and MPI_Comm_size to get the rank of each
process and the total number of processes respectively. We then enter an if - else
construct where the ‘root’ process (i.e. having a rank of 0) will enter the first struc-
tured block of code, print a message to the terminal saying ”Hello World” with its
rank, and then perform a blocking send to each of the other processes in turn, send-
ing them a single integer number. In contrast the remaining processes will enter the
second structured block of code and perform a blocking receive, waiting to receive
a single integer number from the root process. Once they have received this value
they will print to the terminal a message saying ”Hello World” with their rank, so
if we execute the program with the command:
mpirun -n 4 Example21_1

we would expect the output of the program to look something like:

Hello world from Process 0 of 4
Hello world from Process 1 of 4
Hello world from Process 2 of 4
Hello world from Process 3 of 4

The complete program is given in Example21_1.cpp.

Example 21.2:

In this example we will develop a C++ program to solve the 1D first order wave
equation:

∂φ ∂φ
+v =0 (21.1)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, and v = 1.0. For the spatial discretization we
will use the Finite Difference method with second order central differences for the
first derivative and for the temporal discretization we will use the fourth order
Runge-Kutta method. Furthermore, our program will be parallelized with MPI. The
intended learning outcomes for this example will be to observe how we add in MPI
code to an existing C++ program for parallel execution, to investigate the use of both
398 CHAPTER 21. MPI

Process 3

Process 2 φ0 φ1 φmyNx φmyNx+1

Process 1 φ0 φ1 φmyNx φmyNx+1

Process 0 φ0 φ1 φmyNx φmyNx+1

φ0 φ1 φmyNx φmyNx+1

Figure 21.3: A schematic of the partitioned 1D problem domain for the first order
wave equation. The light grey boxes show the interior finite difference grid points,
the dark grey the ghost points and the pink box the Dirichlet boundary point.
The arrows show the flow of information between the ghost points during the time
marching.
21.2. RUNTIME LIBRARY ROUTINES 399

blocking and non-blocking communication, and to investigate the use of parallel file
IO to write out the simulation data.
In order to begin we will take the C++ program that was developed in Example
13.1 and add in the MPI code. The first issue to consider here is how we go about
breaking up the problem for a parallel computation. One of the common techniques
used is to assign each process a portion of the spatial domain (in our case a 1D
structured grid) and the processes will perform the time marching in parallel. In
this case we will retain the meaning of N_x to be the total number of grid points
in the domain, but now, rather than having N_x grid points, each process will have
‘say’ myN_x grid points, where we could compute the number of grid points for each
process via something like:
if(N_x%N_Procs)
{
myN_x = N_x/N_Procs + 1;
if(myID==N_Procs-1)
{
myN_x = N_x - myN_x*(N_Procs-1);
}
}
else
{
myN_x = N_x/N_Procs;
}

Remember that we want each process to have the same amount of work to do,
otherwise the execution speed of our parallel simulation will be impeded by the fact
that some processes will end up ‘waiting’ around for the others to catch up. The
way to interpret this code snippet is that if number of grid points doesn’t divide
evenly among the processes, then we add 1 to this division and ‘most’ processes will
be assigned this number. The process with the highest rank will however take the
remainder from the total number, minus what all of the other processes are assigned.
For example ‘say’ a grid of 201 points spread over 4 processes. The above ‘domain
decomposition’ would assign grid portions of {51, 51, 51, 48} to processes {0, 1, 2, 4}
respectively. Of course if we were really worried about load balancing we should
probably make sure that the size of our grid and the number of processors used
for the simulation are ‘tuned’ or ‘matched’ in some way. The above code snippet
however illustrates a way in which we can maintain a bit of flexibility in the program.
When it comes to setting the initial condition we have to know the ‘global’
x coordinates of each processes portion of the grid so that we can evaluate the
2
φ(x, 0) = e−5(x−3) + 1 term correctly. The way in which we will handle this is to
define a variable called prevN x which for a given process, stores the total number of
grid points posessed by processes with a lower rank than it. To evaluate this integer
we will make use of the MPI Exscan function as:
MPI_Exscan(&myN_x, &prevN_x, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
400 CHAPTER 21. MPI

where we are saying sum up the integers in myN x on every process below ‘me’ and
place this number in prevN x. So every process will have a different value for this
variable based on its rank. At this point we will then know where the grid points
for any given process are located in relation to the ‘global’ grid and can assign the
initial condition. In our program the code to implement this will look like:
for(i=1; i<myN_x+1; i++)
{
x = x_min + Delta_x*(prevN_x + i - 1);
phi[0][i] = exp(-5*pow(x-3, 2));
}
So the important observation to make here is that when developing parallel pro-
grams with a distributed memory model, we often have these ‘complications’ that
require more code in order to get everything set up correctly. Of course if we’d just
assumed that the number of grid points would be perfectly divisible by the number
of processes then we wouldn’t have to worry about this, but it’s important to be
exposed to some of these issues.
Now, we know that because we are using a second order central finite difference
stencil, the solution at a particular grid point will depend upon the solution at its
two neighboring grid points and so the approach of distributing the grid poses a
problem when we consider the fact that at the boundaries of each process’s portion
of the grid, it will be need data from grid points that exists in another process’s
memory. What we can do here is to allocate a slightly bigger array of size myN_x+2
to store what are known as ghost points (Figure 21.3). Then at every time step each
process p will send its value of phi[1] to location phi[myN_x+1] on process p − 1
and its value of phi[N_x] to location phi[0] on processor p + 1. Furthermore, each
process p will receive process p − 1’s value for phi[myN_x], which it will store in
phi[0], and process p + 1’s value of phi[1], which it will store in phi[myN_x+1].
The only exception to this are processes 0 and Np − 1, which contain the actual
boundaries of the grid. Having done this however, each process will be able to
evaluate the finite difference stencil and hence perform the time marching correctly.
As we march through time we will then need to exchange data between processes,
and in fact because we are using the fourth order Runge-Kutta method, we will need
to exchange ghost point data four times as we evaluate the k values on each process.
So in this case it makes sense to define a function that we can call repeatedly and
as such we will create the function exchange, which will take the form:
void exchange(double* phi, int myN_x, int myID, int N_Procs)
{
MPI_Status Status;
if(myID>0)
{
MPI_Send(&phi[1], 1, MPI_DOUBLE, myID-1, 1, MPI_COMM_WORLD);
}
if(myID<N_Procs-1)
{
MPI_Recv(&phi[myN_x+1], 1, MPI_DOUBLE, myID+1, 1, MPI_COMM_WORLD, &Status);
21.2. RUNTIME LIBRARY ROUTINES 401

MPI_Send(&phi[myN_x], 1, MPI_DOUBLE, myID+1, 2, MPI_COMM_WORLD);

}
if(myID>0)
{
MPI_Recv(&phi[0], 1, MPI_DOUBLE, myID-1, 2, MPI_COMM_WORLD, &Status);
}
return;
}

The first input to this function will either be phi[l] for evaluating k1 and then
subsequently tempPhi for evaluating the remaining k values. As this function pro-
ceeds all process except for process 0 will enter the first if statement and try to
send a ghost value to their lower neighbor process (e.g. process 3 will try to send
to process 2). Only process 0 will be able to enter the second if statement and
wait to receive its ghost point value to process 1. As soon as this has happened
process 1 is then ‘freed’ up to receive its ghost point value from process 2, and
so on, such that there is in fact a ‘wave’ of propagation of data being exchanged.
Once the processes have received their first ghost point value they will try and send
their other ghost point value to their higher neighbor process (e.g. process 3 will
try to send to process 4). Since the last process (with rank N_Procs-1) will not
enter the second if statement, it will be waiting to receive from its ghost point
value from process N_Procs-2. As soon as this has happened process N_Procs-2 is
then ‘freed’ up to receive its ghost value from process N_Procs-3, and so on, such
that there is a second ‘wave’ of propagation of data being exchanged. This wave
of information transfer in effect creates a serial computation since all the processes
are sending in turn and this is not ideal. To implement the equivalent data transfer
using non-blocking communication we could instead define our exchange function
as:
void exchange(double* phi, int myN_x, int myID, int N_Procs)
{
MPI_Status statuses[4];
MPI_Request requests[4];
int N_r = 0;
if(myID>0)
{
MPI_Isend(&phi[1], 1, MPI_DOUBLE, myID-1, 1, MPI_COMM_WORLD, &requests[N_r++]);
MPI_Irecv(&phi[0], 1, MPI_DOUBLE, myID-1, 2, MPI_COMM_WORLD, &requests[N_r++]);
}
if(myID<N_Procs-1)
{
MPI_Irecv(&phi[myN_x+1], 1, MPI_DOUBLE, myID+1, 1, MPI_COMM_WORLD, &requests[N_r++]);
MPI_Isend(&phi[myN_x], 1, MPI_DOUBLE, myID+1, 2, MPI_COMM_WORLD, &requests[N_r++]);
}
MPI_Waitall(N_c, requests, statuses);
return;
}

It can be observed here are that the structure is similar, but we are using the non-
402 CHAPTER 21. MPI

blocking MPI_Isend and MPI_Irecv functions. Furthermore, we’ve created arrays

of statuses and requests. With each send and receive call on a given processor
there will be a unique request data type created, and we will store this variable in
the requests array. At the end of the exchange of data we use the MPI_Waitall
function and pass it this array of requests so that it is able to determine when the
information has finished being exchanged.
In terms of incorporating the exchange function into the time marching loop we
would incorporate it at the various stages within the Runge-Kutta method as:
for(l=0; l<N_t-1; l++)
{
exchange(phi[l]);
f(k1, phi[l]);
for(i=0; i<myN_x+2; i++)
{
tempPhi[i] = phi[l][i] + Delta_t/2*k1[i];
}
exchange(tempPhi);
f(k2, tempPhi);
for(i=0; i<myN_x+2; i++)
{
tempPhi[i] = phi[l][i] + Delta_t/2*k2[i];
}
exchange(tempPhi);
f(k3, tempPhi);
for(i=0; i<myN_x+2; i++)
{
tempPhi[i] = phi[l][i] + Delta_t *k3[i];
}
exchange(tempPhi);
f(k4, tempPhi);
for(i=0; i<myN_x+1; i++)
{
phi[l+1][i] = phi[l][i] + Delta_t*(k1[i]/6 + k2[i]/3 + k3[i]/3 + k4[i]/6);
}
}
Finally, in terms of imposing the boundary condition and using a first order
forward difference at the end of the spatial domain, we can modify our function f
as:
void f(double* k, double* phi, int myN_x, int myID, int N_Procs)
{
if (myID==0)
{
k[1] = 0;
for(int i=2; i<myN_x+1; i++)
{
k[i] = -v/(2*Delta_x)*(phi[i+1] -phi[i-1]);
}
}
else if (myID==N_Procs-1)
21.2. RUNTIME LIBRARY ROUTINES 403

{
for(int i=1; i<myN_x; i++)
{
k[i] = -v/(2*Delta_x)*(phi[i+1] -phi[i-1]);
}
k[myN_x] = -v/( Delta_x)*(phi[myN_x]-phi[myN_x-1]);
}
else
{
for(int i=1; i<myN_x+1; i++)
{
k[i] = -v/(2*Delta_x)*(phi[i+1] -phi[i-1]);
}
}
return;
}
Here, we can see that the k array will be evaluated in a slightly different manner
depending upon the processes rank. Beginning with process 0, because it has the
Dirichlet boundary point in its portion of the grid, it will start updating from the
third element in the phi array (with index 2). For process NP rocesses − 1 however,
it contains the grid point at the other end of the domain where we must use a first
order backward difference in order to evaluate the spatial derivative at that point.
For all other processes, they simply loop over all of their grid points (indices 1 to
myN_x) and update them. In every case, it is assumed that the ghost point values
will have been exchanged and will be ‘waiting’ for use.

Output File
Nx Time Step

0
1
2
3
4

Process 0 Process 1 Process 2 Process 3

Figure 21.4: A schematic illustrating the single ascii output file and how each
process writes its portion of the grid to the file. Each square illustrates the number
of bytes required to store a single field value φi as a number of characters. The
format of the text file is such that each row corresponds to the solution over the
entire grid at a particular time step.

The one thing that we have not yet mentioned is how we will output the data
from our program. There are a few issues to consider here. When our grid has been
404 CHAPTER 21. MPI

distributed into multiple memory spaces, how do we recombine all of that data; do
we combine it all to create one output file, or do we have each process output its
portion of the grid and then recombine the data with some other program? As it
happens, both of these approaches are used in practice and we will see both in action
shortly. Often, it is ‘easier’ if we only have one output file to post process and so
then the question is, what’s the best way to get all of the data together to write
out into one file? One approach is to have each process send its portion of the data
to one process (usually the root process with rank 0 ’say’). Process 0 could then
be responsible for opening an output file and writing all of the data to it. While
this idea sounds quite simple, the disadvantages with this approach are that firstly
it can mean a large amount of message passing, all to the one process (analogous to
having a large number of people all talking to the one person) and secondly, it might
not be possible for the memory space of process 0 to hold all of the field data for all
of the other processes. Indeed, part of the reason for running parallel computations
in the first place is so that we can run much larger calculations that could not fit
in the memory space of one individual processor. Another option is to use the MPI
file I/O functionality. The basic idea here is that each process can write to the one
file, but they will write to different parts of the file, so that once we are done, we
will have one complete file. We begin by declaring a variable of type MPI_File and
each process will open up this file:
MPI_File file;
MPI_File_open(MPI_COMM_WORLD, "Example21_2.data", MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &file);

where the important input arguments to the function call to open the file are the
file name, the communicator, and some additional flags defining how the file is to
be opened (in our case we are creating the file and writing to it only). Now we are
going to be writing out the data at each time step (as we did in Example 13.1) and
so we will define a function write that we will call at each time step to add the field
data to the output file. This is going to take a bit of work, so let’s in fact start by
looking at the complete function:
void write(MPI_File& file, double* phi, int l, int myN_x, int prevN_x, ...
int N_x, int myID, int N_Procs)
{
int N_CharPer_x = 7;
int N_BytesPer_x = N_CharPer_x * sizeof(char);
int N_BytesPer_l = N_x * N_BytesPer_x;
int Offset = prevN_x * N_BytesPer_x;
char buffer[myN_x*N_BytesPer_x];
for(int i=1; i<myN_x+1; i++)
{
if(i==myN_x && myID==N_Procs-1)
{
sprintf(buffer+(i-1)*N_BytesPer_x, "%+.3f\n", phi[i]);
}
else
21.2. RUNTIME LIBRARY ROUTINES 405

{
sprintf(buffer+(i-1)*N_BytesPer_x, "%+.3f\t", phi[i]);
}
}
MPI_File_seek(*file, l*N_BytesPer_l + Offset, MPI_SEEK_SET);
MPI_File_write(*file, buffer, myN_x * N_CharPer_x, MPI_CHAR, MPI_STATUS_IGNORE);
return;
}

Generally, with large scale computations we would want to write our simulation
data in binary format (as opposed to ascii format) because it is faster to write and
requires less disk space. The disadvantage with binary data however, is that we
can’t open up the file in ‘say’ a text editor to look at it. Most of the time we would
have some post processing program available to read out binary output, but for our
purposes we are going to use Matlab to perform the post processing and this will
be easier to do if we have an ascii text file. That being said, the thing that our
write function has to do is to decide where in the file to start writing the field
data for each process. We can in fact think of the file in this sense like a single 1D
array and we have to compute which entry in the array corresponds to the start of
our grid. So in fact what we have to do is compute how many bytes into the file
each process should ‘skip over’ before it starts writing its data. Returning now to
the code snippet, showing the write function, the first thing we do is declare some
integer variables, the first defining the number of characters that we will use to store
a floating point number. In this example we are using 7 characters to represent a
number and so any given number in our output file will look like +0.027. Here
we can see that the + and the . count as characters, plus the tab character which
will separate out every string (but we will just see it as white space). So using 7
characters we will only actually be storing our field data to 3 decimal places.
Following this definition we can compute the number of bytes to store a field
value φi as the number of characters multiplied by the number of bytes per character
(which is 1 for ascii). Then we can compute the number of bytes to store the entire
grid at any one time step as the number of bytes per field value multiplied by the
overall number of grid points. Then in order to compute where in the file any given
process should ‘skip’ to, we will make use of the prevN_x variable that was defined
when setting the initial condition. As such we can compute an ‘offset’ for each
process that is the number of grid points owned by ranks below it multiplied by
the number of bytes per grid point (field value). With this information computed
we enter a for loop where we loop over every grid point in our portion of the
grid and convert the floating point numbers to a string with the sprintf function,
which we put into an array called buffer. Once this is done, all that we need
to do is skip to the right portion of the file and write the data. The skipping is
performed with the MPI_File_seek function and the number of bytes to skip is
l*N_BytesPer_l+Offset, i.e. we have to skip by the offset for each process, plus
the number of time steps already written. Then, finally, we can write out the data
with the MPI_File_write function. The important argument to the function here
406 CHAPTER 21. MPI

is MPI_CHAR which indicates that we are writing ascii text. Often, this would be
MPI_FLOAT or MPI_DOUBLE when writing binary data.
The complete program is given in Example21_2.cpp with a Matlab script for
viewing the output of the program given in Example21_2Postprocessing.m. Fig-
ures 21.5(a) and 21.5(b) illustrate the solution at two different moments in time for
the case where ∆x=0.05 and ∆t=0.02. Some example strong scaling runs for a more
‘substantial’ case where ∆x=0.001 and ∆t=0.001 are:
Number of Processes Execution Time (s)
1 2.650334
2 1.355908
4 0.661549
8 0.371658

2.5 2.5

2.0 2.0
φ

1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

Figure 21.5: The solutions to the PDE in Example 21.2 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02.

Example 21.3:

In this example we will develop a C++ program to solve the 2D Poisson equation:

∇2 φ + ψ = 0 (21.2)
21.2. RUNTIME LIBRARY ROUTINES 407

in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10. For the spatial discretization we will use
the Finite Difference method with second order central differences for the second
derivatives and to solve the resulting system of algebraic equations we will use the
Gauss-Seidel method, with the two-norm as our measure of convergence. Further-
more, our program will be parallelized with MPI. The intended learning outcomes for
this example will be to observe how we add in MPI code to an existing C++ program
for parallel execution and to investigate creating new communicators and data types.

Process 2 (0, 2) Process 5 (1, 2) Process 8 (2, 2)

Process 1 (0, 1) Process 7 (2, 1)

Process 0 (0, 0) Process 3 (1, 0) Process 6 (2, 0)

(a) (b)

Figure 21.6: A schematic of the partitioned 2D problem domain for the Poisson
equation. The pink circles show the interior finite difference grid points, while the
blue show the boundary grid points. The grey boxes show the portions of the grid
that are mapped to each process (Note that we actually use a 4 × 4 grid in the
Example) and the yellow patches show where the ghost cells overlap (i.e. each
process is actually only storing a 2 × 2 portion of the grid, the remaining grid points
are ghost cells). Finally the green arrows illustrate the flow of information during
each of the four send and receive operations that are required to communicate the
ghost cell values throughout the Gauss-Seidel iterations.

In order to begin we will take the C++ program that was developed in Example
13.2 and add in the MPI code. The first issue to consider here is how we go about
breaking up the problem for a parallel computation. Compared to the 1D spatial
domain from Example 21.2 we have a couple of options. We could either break up
the square domain into ‘strips’ and give each process one of these strips, or we could
408 CHAPTER 21. MPI

break it up into a ‘checkerboard’ type pattern and give each process one. We will
use the latter approach and break the domain up into smaller square pieces of equal
size. This means that we will need to use square numbers of processes (i.e. 4, 9,
16, 25, etc) and we can define an array dimensions which will store the numbers of
processes in x and y.
Similar to the 1D domain we will allocate the array slightly bigger to store
the ghost points and will exchange information between processes throughout the
simulation. Figures 21.6(a) and 21.6(b), illustrate an example grid, with its domain
decomposition and the ghost point data being exchanged between the process with
the middle of the grid and its neighbors. Because the layer of ghost points forms a
ring around the pieces of grid where we are defining the solution, it is often termed a
halo. For this example we will keep the number of grid points per process (which we
will denote myN_x and myN_y) constant, such that as we add more processes the grid
resolution will increase (i.e. ∆x and ∆y will decrease). As such we will calculate
the grid spacings as:
int myN_x = 20;
int N_x = myN_x*dimensions[X];
float Delta_x = (x_max-x_min)/(N_x-1);

with an equivalent calculations in y. Having now decomposed our domain, we need

to think about how we will exchange data between the processes. In 1D this was
fairly straight forward, where a given process would send data to its rank±1. In 2D
this becomes a bit more of a hassle, where instead we would need to store for each
process, the IDs of its neighbor processes. A second and far better option is to make
use of the MPI cartesian communicator functionality we which we can implement in
the code:
MPI_Cart_create(MPI_COMM_WORLD, N_D, dimensions, isPeriodic, reorder, &Comm2D);
MPI_Comm_rank(Comm2D, &myID);
MPI_Cart_coords(Comm2D, myID, N_D, myCoords);
MPI_Cart_shift(Comm2D, X, 1, &leftNeighbor, &rightNeighbor);
MPI_Cart_shift(Comm2D, Y, 1, &bottomNeighbor, &topNeighbor);

Here what we are doing is creating a new communicator called Comm2D with the
function MPI_Cart_create that we will use instead of MPI_COMM_WORLD. The ar-
guments to this function include the current communicator, the dimensions of the
grid (of processes) or which there are N_D, whether or not this grid will be periodic
(which could be handy if our PDE had periodic boundary conditions ‘say’), and
whether or not the processes can be reordered (meaning whether or not the ranks
can be modified). Once we have created this new communicator each process can get
its new rank within the context of the 2D communicator and furthermore, we can
get the coordinates of each process (Figure 21.6(b)). One of the most useful features
however is via the use of the MPI_Cart_shift function, which will give use the
ranks of the processes either side of a given process. We will store these ranks in the
variables leftNeighbor, rightNeighbor, bottomNeighbor, topNeighbor, where it
should be noted that the actual values will of course be different on each process. An
21.2. RUNTIME LIBRARY ROUTINES 409

important point worth mentioning is that processes on the ‘boundary’ of the grid will
not have all four neighbors (e.g. the lower left process in Figure 13.2 will only have
a rightNeighbor and a topNeighbor). In these cases the MPI_Cart_shift func-
tion will return MPI_PROC_NULL, and if we try and send to or receive from this rank,
the send and receive functions will simply ignore it.
As we exchange ghost point information throughout the iterations of our Gauss-
Seidel method, we will need to send to and receive from each of these neighbors
and while we could also accomplish this using either the standard blocking or non-
blocking sends and receives, a more elegant way is to use the MPI_Sendrecv function
that will do two things at the same time. For example. to exchange all of the ghost
point data between the left and right neighbors of a given process we could do:
MPI_Sendrecv( &(phi[1][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
&(phi[myN_x+1][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[myN_x][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
&(phi[0][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
Comm2D, &status);

φ1, 1 φ1, myNy φmyNx, 1 φmyNx, myNy

leftNeighbor

rightNeighbor

bottomNeighbor

topNeighbor

Figure 21.7: A schematic illustrating the ghost point data to be sent to neighboring
processes and its layout in memory.

In the first function call we are sending myN_y entries to the process leftNeighbor
and at the same time receiving myN_y entries to the process rightNeighbor. With
reference to the middle process in Figure 21.6(b) the address of the array that we
will send from is phi[1][1], and the address of the array that we will receive
into is phi[myN_x+1][1]. So upon completion of these two function calls we have
exchanged all of the data between our left and right neighbors. Now, because of the
way we have allocated memory for phi, elements in the array along a column y will
be contiguous in memory, but along a row x will not be. This is a problem in terms of
410 CHAPTER 21. MPI

sending and receiving data between out topNeighbor and bottomNeighbor because
the send and receive functions expect a pointer to a contiguous block of memory.
One option would be to create a new array (that will be a single contiguous block of
memory), loop over all of the x values that we need, copy them into the new array,
and send that array instead. A better way however is to again make use of some
MPI functionality, in particular to create a new data type, which we can implement
with the code:
MPI_Datatype strideType;
MPI_Type_vector(myN_x, 1, myN_y+2, MPI_DOUBLE, &strideType);
MPI_Type_commit(&strideType);

Here, strideType is our new data type, and it will be a vector that will hold
myN_x double precision floating point numbers. The third argument to the function
indicates the stride (in this case myN_y+2), meaning that when we try to send a
strideType (and give the function the pointer indicating where the block of memory
to be sent begins), every myN_y+2th value in the array will become part of the vector.
The MPI_Type_commit function makes the new data type available for use. Having
done this, we can then send to our topNeighbor and bottomNeighbor as:
MPI_Sendrecv( &(phi[1][1]), 1, strideType, bottomNeighbor, 0,
&(phi[1][myN_y+1]), 1, strideType, topNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[1][myN_y]), 1, strideType, topNeighbor, 0,
&(phi[1][0]), 1, strideType, bottomNeighbor, 0,
Comm2D, &status);

which is quite similar to data exchange between the left and right neighbors, ex-
cept that instead of sending myN_y double precision floating point numbers, we are
sending 1 strideType.
Now that we have our data exchange sorted out, the next little complication is
that some of the processes will contain the Dirichlet boundary grid points and so
when we perform our Gauss-Seidel iteration, we don’t want to include these points
in our update. With reference to Figure 21.6(a) we can see that process 0 (with
coordinates (0, 0)) should begin its update from the array entry phi[2][2] and end
at phi[myN_x][myN_y], whereas process 6 (with coordinates (2, 0)) should begin its
update from the array entry phi[2][1] and end at phi[myN_x-1][myN_y]. The
way in which we can integrate these different starting and ending indices for the
different processes is to define the variables:
int myiStart = myCoords[X]==0 ? 2 : 1;
int myiEnd = myCoords[X]==N-1 ? myN_x : myN_x+1;
int myjStart = myCoords[Y]==0 ? 2 : 1;
int myjEnd = myCoords[Y]==N-1 ? myN_y : myN_y+1;

Here we are saying that if a given processes x coordinate is 0 (meaning that it will
therefore include a portion of the left Dirichlet boundary), then the starting index
will be 2, otherwise it will be 1. Similarly, if the given processes x coordinate is N −1
(meaning that it will therefore include a portion of the right Dirichlet boundary),
21.2. RUNTIME LIBRARY ROUTINES 411

then the ending index will be myN_x, otherwise it will be myN_x+1. We can then
perform our Gauss-Seidel iteration correctly for each process via the code:
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
phi[i][j] = (Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
One important observation is that this will affect the load balancing slightly some
processes will have slightly more work to do than others performing the Gauss-Seidel
iteration. This could be improved by decomposing the data differently such that
each process have a similar number of interior points, but the tradeoff would be a
more complex code. As it happens, we will also need to do the same thing when it
comes to computing the residual, but we have an extra complication here. Because
each process is only operating on a portion of the grid, it will only be computing
a portion of the residual. Our iterative while loop will however need the ‘overall’
or ‘global’ residual norm. Because we are using the two norm as our measure of
convergence, what we will do is to have each process loop over its portion of the
2
grid, and add up the ri,j terms into a variable called my_rnorm as:
myr_norm = 0.0;
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
myr_norm += r[i][j]*r[i][j];
}
}
MPI_Allreduce(&myr_norm, &r_norm, 1, MPI_DOUBLE, MPI_SUM, Comm2D);
r_norm = sqrt(r_norm);
Following this, we will perform a reduction operation with the MPI_Allreduce func-
tion to sum all of the individual myr_norm’s on each process and place the result
in the variable r_norm, which each process has a copy of. Then each process will
compute the square root of this value, which will give the correct global measure of
the two norm. This way the processes will always stay synchronized.
We can now write out our entire iterative loop as:
while(r_norm>tolerance && k<N_k)
{
MPI_Sendrecv( &(phi[1][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
&(phi[myN_x+1][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[myN_x][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
&(phi[0][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
412 CHAPTER 21. MPI

Comm2D, &status);
MPI_Sendrecv( &(phi[1][1]), 1, strideType, bottomNeighbor, 0,
&(phi[1][myN_x+1]), 1, strideType, topNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[1][myN_x]), 1, strideType, topNeighbor, 0,
&(phi[1][0]), 1, strideType, bottomNeighbor, 0,
Comm2D, &status);
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
phi[i][j] = (Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
myr_norm = 0.0;
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
myr_norm += r[i][j]*r[i][j];
}
}
MPI_Allreduce(&myr_norm, &r_norm, 1, MPI_DOUBLE, MPI_SUM, Comm2D);
r_norm = sqrt(r_norm);
k++;
}

where we can see that the basic steps within the while loop consist of firstly ex-
changing the ghost point data with neighboring processes, performing a Gauss-Seidel
update, then computing the local and then global residual norm.
The final issue we will consider here is writing out the data. In Example 21.2
we used the MPI file I/O functionality to create a single output file. Now, there’s no
reason why we couldn’t do the same thing here. The only difficulty is that it would
involve a bit more code to compute the correct locations in the file to write data
in order to have an output file that had the same structure as the grid. What we
will do in this case is use the second option that was mentioned in that example,
namely to have each process write out a file containing the data for its portion of
the grid. We will then leave it up to our Matlab post processing script to take care
of ‘reassembling’ the data. As such the code for writing out the data is fairly simple
and will take the form:
fstream file;
char myfileName[64];
...
sprintf(myFileName, "Example21_3_Process_%d_%d.data", myCoords[X], myCoords[Y]);
file.open(myFileName, ios::out);
for(i=1; i<myN_x+1; i++)
21.2. RUNTIME LIBRARY ROUTINES 413

{
for(j=1; j<myN_y+1; j++)
{
file << phi[i][j] << "\t";
}
file << endl;
}
file.close();
...

where it can be observed that each process is simply writing out its portion of
the grid in a nested for loop. The only issue worth mentioning here is that each
process’s file will have a unique filename that will include the coordinates of the
process in the name. This will make it easier to ‘reassemble’ the grid in our post
processing script.

2.0

1.8

1.6
φ

1.4

1.2

1.0
−0.2 1.2
0.0 1.0
0.2 0.8
0.4 0.6
0.6 0.4
0.8 0.2
1.0 0.0
1.2 −0.2
y
x

Figure 21.8: The converged solution to the PDE in Example 21.3 illustrating the
solution for Nx = 60 and Ny = 60.

The complete program is given in Example21_3.cpp with a Matlab script for

viewing the output of the program given in Example21_3Postprocessing.m. Figure
21.8 presents the converged solution for the case where Nx = 60 and Ny = 60 with
the portions of grid owned by each process shifted to highlight the distributed grid.
Some example weak scaling runs for a more ‘substantial’ case where myNx = 257
and myNy = 257 are:
414 CHAPTER 21. MPI

Number of Processes Execution Time (s)

1 6.135634
4 5.953961
16 5.718665
64 6.659456

Example 21.4:

In this example we will develop a C++ program to solve the 2D generic scalar
transport equation:

φ̇ + v · ∇φ = µ∇2 φ + ψ (21.3)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will
use the Finite Element method with linear triangular elements, for the temporal
discretization we will use the implicit Euler method, and we will solve the resulting
linear system with the Conjugate Gradient method. Furthermore, our program will
be parallelized with MPI. The intended learning outcomes for this example will be to
observe how we add in MPI code to an existing C++ program for parallel execution
and to investigate the decomposion of an unstructured grid.
With the parallel program developed in Example 21.3, the use of a structured
grid meant that it was relatively easy to break up a spatial domain and use the
concept of ghost points in order to exchange the discrete field data throughout the
solution process. With an unstructured grid, we could use this approach, but we
would need to store a list of which points in any given processes Points array, are
its ghost points; or, we may in fact use ghost elements instead. As it happens
we are not going to use ghost ‘anything’ in our parallel computation and will do
something entirely different. Our method is going to be very much tied to the use of
the conjugate gradient method for solving the resulting linear system of equations
at each time step.
The way in which we will break up our unstructured grid is to assign the elements
to each process (Figure 21.9(b)). As can be observed, we are going to choose nine
processes for our parallel computation, but of course in principle the code that
we develop could be applied to any number of processes. Now, the manner by
which we break up the computational domain is quite a complicated topic and is
beyond the scope of this example. For the interested reader however, the domain
21.2. RUNTIME LIBRARY ROUTINES 415

Process 2 Process 5 Process 8

Process 1 Process 4 Process 7

Process 0 Process 3 Process 6

(a) (b)

Figure 21.9: A schematic of the partitioned 2D problem domain for the generic
scalar transport equation. The grey boxes show the portions of the grid that are
mapped to each process.
416 CHAPTER 21. MPI

was decomposed with the Metis [29] library. So, although the elements are uniquely
assigned to a process, certain points on the boundaries between processes will be
duplicated and we will have to handle this appropriately in our code. It can be
observed that this is a little bit different compared to the concept of ghost points,
because in that case, each process is responsible for a unique set of points where
the ghost points allowed for the finite difference quotients to be evaluated at the
boundaries of the processes portion of the grid. In this case, the solution at a point
on the boundary between two grids will in fact be computed by each process sharing
that point.
We will see how we handle this shortly, but for now, let’s first look at how
we will get this data into our parallel program. We are going to assume that the
unstructured grid has already been decomposed into a number of separate ascii
text file and that include the intended process in its filename, such as:
Example21_4.grid0
Example21_4.grid1
...
Example21_4.grid8
As a quick aside, it is important to bear in mind that quite commonly one would
deal with binary files for large scale computer simulations because they are typically
smaller in size and are quicker to read and write. We are using ascii text files
here because the human readability makes aids in the understanding of how the
numerical methods work. Each file will follow the same form as the ascii text
file defined in Example 15.3, the major difference now, being the number of points,
faces, and elements in each text file. As before we will not make any assumptions
about the ordering of the points and faces in the file. The big difference here is that
for the points that are shared between processes we are going to define a new type
of boundary condition which we will call an interprocess boundary condition. We
shouldn’t really think of this as being a fundamental type of boundary condition
like a Dirichlet or a Neumann boundary condition, it’s really just a way in which we
keep track of which points are duplicated on other processes it does give us a nice
elegant way in which to incorporate them into our existing file structure however.
The contents of the file containing the portion of the unstructured grid for ‘say’
process 7 (Example21 4.grid7) will look something like:
N_p 133
N_f 20
N_e 223
N_b 4
Points
1.00000 0.00000
0.64286 0.00000
...
0.94855 0.33766
Faces
1 2
2 3
21.2. RUNTIME LIBRARY ROUTINES 417

...
19 20
Elements
21 62 98
82 45 106
...
49 126 113
Boundaries
bottom
dirichlet
11
0 1 2 3 4 5 6 7 8 9 10
1.00000
right
neumann
10
10 11 12 13 14 15 16 17 18 19
0.00000
process7to6
interprocess
9
54 55 62 72 78 90 96 97 98
6
process7to8
interprocess
13
20 23 32 47 60 62 63 67 68 112 123 125 132
8

As can be observed in the Boundaries section of the file, there is a Dirichlet and a
Neumann boundary condition, defined in the same manner as in Example 15.3, plus
two interprocess boundary conditions. With Dirichlet and Neumann boundaries,
the indices define the points and the faces respectively, to where the boundary
condition is applied. For an interprocess boundary the indices define the points that
are shared between one process and another and the value that is given defines the
process with which to share the data with. The contents of the file containing the
portion of the unstructured grid for process 6 will look something like:
N_p 139
N_f 10
N_e 238
N_b 5
Points
0.28571 0.00000
0.32143 0.00000
...
0.33743 0.05103
Faces
0 1
1 2
...
418 CHAPTER 21. MPI

9 10
Elements
47 21 48
13 44 57
...
71 109 138
Boundaries
bottom
dirichlet
11
0 1 2 3 4 5 6 7 8 9 10
1.00000
process6to4
interprocess
12
12 44 58 64 69 80 89 95 100 123 127 128
4
process6to5
interprocess
1
12
5
process6to7
interprocess
9
51 54 60 73 79 91 97 98 99
7
process6to8
interprocess
8
12 15 20 47 48 60 113 134
8

As can be observed, process 6 has an interprocess boundary, sharing nine points with
process 7, with indices 51 54 60 ... 99. These correspond with the interprocess
boundary for process 7, sharing nine points with process 6, with indices 54 55 62
... 98. An important point to note here is that the indices into each processes
Points array will be different because each process will have its own collection of
points, faces, and elements, with different ordering and numbering. As long as these
indices correspond to the same physical coordinates in each processes Points array
however, then this approach will work. So to emphasize this point, each process will
have a local numbering of its Points, Faces, and Elements arrays, but there will
also be an implicit global numbering. This is ‘almost’ all that is necessary in order
to handle the decomposed grid. Let’s just jump right in and look at the modified
read function and look at then we’ll discuss the one extra little complication that
we have to deal with. This function will look something like:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& myN_p, int& myN_f, int& myN_e, int& myN_b, ...
bool*& yourPoints, int myID)
21.2. RUNTIME LIBRARY ROUTINES 419

{
fstream file;
string temp;
char myFileName[64];
int myMaxN_sp = 0;
int myMaxN_sb = 0;
int maxN_sp = 0;
int yourID = 0;
sprintf(myFileName, "%s%d", filename, myID);
file.open(myFileName);
file >> temp >> myN_p;
file >> temp >> myN_f;
file >> temp >> myN_e;
file >> temp >> myN_b;
Points = new double* [myN_p];
Faces = new int* [myN_f];
Elements = new int* [myN_e];
Boundaries = new Boundary [myN_b];
Points[0] = new double [myN_p*2];
Faces[0] = new int [myN_f*2];
Elements[0] = new int [myN_e*3];
myPoints = new bool [myN_p];
...
}
A couple of points to note here are that firstly, each process will append its rank to
the string defining the file name, before it attempts to open that file. That way we
can run our program by passing in one argument (say Example21 4.grid) and the
code will append the rank so that the four processes read in Example21 4.grid0,
Example21 4.grid1, . . ., Example21 4.grid8. Another thing to note is that we
are defining another 1D array called myPoints storing boolean values which will
be true if a given process ‘owns’ those points, or false if another process ‘owns’
them. This concept is analogous to the neighbour-owner concept applied in the
Finite Volume method in Chapter 14, where, although a face is shared between two
cells, we pick one of those cells to be the ‘owner’ of the face and one to be the
‘neighbour’ of the face. Here will be assigning one of the two (or however many)
processes sharing the interprocess points to be the owner of them. We will see why
this is important shortly, but the approach that we will use to assign ownership
will be to give them to the process with the highest rank; analogous to the way
the older siblings always have more privileges than the younger sibling; the highest
rank gets to own the shared points. The final point to note is that we have declared
a variables myMaxN sp, which is an integer variable storing the maximum number
of shared points on any given interprocess boundary, and myMaxN sb, which is an
integer storing the number of interprocess boundaries that a given process has. This
will be used in order to allocate a buffer array to use in the MPI send and receive
routines. The rest of our read function will then look something like:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& myN_p, int& myN_f, int& myN_e, int& myN_b, ...
420 CHAPTER 21. MPI

bool*& yourPoints, int myID)

{
...
for(int b=0; b<myN_b; b++)
{
file >> Boundaries[b].name_ >> Boundaries[b].type_ >> Boundaries[b].N_;
Boundaries[b].indices_ = new int [Boundaries[b].N_];
for(int n=0; n<Boundaries[b].N_; n++)
{
file >> Boundaries[b].indices_[n];
}
file >> Boundaries[b].value_;
if(Boundaries[b].type_=="interprocess")
{
myMaxN_sp = max(myMaxN_sp, Boundaries[b].N_);
yourID = static_cast<int> (Boundaries[b].value_);
if(myID>yourID)
{
for(int p=0; p<Boundaries[b].N_; p++)
{
myPoints[Boundaries[b].indices_[p]] = true;
}
}
}
}
MPI_Allreduce(&myMaxN_sp, &maxN_sp, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
buffer = new double [maxN_sp];
...
return;
}
Here the code for actually reading in the points, faces, and elements is exactly the
same as for the serial case (hence why it is not shown), and for reading in the
boundaries, it is ‘almost’ the same. The extra piece of code however, checks if a
given boundary is an interprocess boundary and if so first increments the myMaxN sb
variable to count the number of interprocess boundaries, then sets the variable
myMaxN sp to be the greater of the number of shared points on that boundary, or its
current value. That way, the array that we allocate using this variable will always
be big enough to hold the data that needs to be exchanged. Finally, the code checks
the rank of the process that these points are shared with, which can be found by
examining the first entry in the value member variable array of the boundary and
if that value is less than the rank of the process reading in the file, it will own them
and therefore set the corresponding entries in the myPoints array equal to true .
Finally, after having read in the file, we will make sure that every process has the
same global value for maxN sp by using the MPI Allreduce function, then we will
allocate the buffer.
At this point, the next important step in the program is the assembling of the
mass and stiffness matrices, and the load vector. As it happens we don’t need to
modify our code for this at all and so the assemble function will be exactly the
21.2. RUNTIME LIBRARY ROUTINES 421

same as for the serial case in Example 15.3. So that was easy; but although the
serial code doesn’t need to be modified, we do need to realize that the systems of
equations on each process will in fact not be quite correct at this stage. To elaborate
on this point, consider the interprocess boundary between processes 2 and 5 depicted
in Figure 21.10. In particular we are going to focus on one shared point, which is
point 84 in process 2’s local numbering scheme and point 88 in process 5’s local
numbering scheme. When we think about how we assemble M , K, and s with the
Finite Element method, we loop over all of the elements and for each element loop
over all of the nodes that a given element uses, adding in terms to M , K, and s in the
rows and columns corresponding to those nodes. If we consider the two rows in K
and s for point 84 on process 2, then we can see that they will be missing the entries
from the three elements that are in process 5’s portion of the grid. Conversely, the
two rows in K and s for point 88 on process 5, then we can see that they will be
missing the entries from the two elements that are in process 2’s portion of the grid.
As a quick point to note, because M and K have the same pattern, we only need to
consider one to convey this issue.

p88
p84

Process 2 Process 5

Figure 21.10: A schematic illustrating the rows in the stiffness matrices and load
vectors corresponding the a point shared between two processes. The colored blocks
illustrate the contribution that each node makes in the sytem of equations for that
point and the grey blocks indicate a summation of terms from multiple elements.

So, when each process comes to solving its system of equations at each time
step, the equations for each shared point will be missing some terms and if we want
422 CHAPTER 21. MPI

our algorithm to work, we need to share this data in order to make sure that the
shared point equations are the same on each process. So the question to ask is, do
we exchange the entries in these matrices? That does seem like a good way to do
things, but in fact we can get away with something even simpler. Again, this is very
much tied with our use of the conjugate gradient method. In order to build up to
how we make this work, let’s imagine that we have just completed the execution of
the assemble function and we have the incomplete M , K, and s. Similar to our MPI
programs in Examples 21.1 and 21.3 we will define a function called exchange that
we will call repeatedly throughout the program and we will pass this a 1D array,
representing a column vector of size Np × 1, which will look like:
void exchange(double* v, Boundary* Boundaries, int myN_b)
{
int yourID = 0;
int tag = 0;
MPI_Status status;
for(int b=0; b<myN_b; b++)
{
if(Boundaries[b].type_=="interprocess")
{
for(int p=0; p<Boundaries[b].N_; p++)
{
buffer[p] = v[Boundaries[b].indices_[p]];
}
yourID = static_cast<int> (Boundaries[b].value_);
MPI_Bsend(buffer, Boundaries[b].N_, MPI_DOUBLE, yourID, tag, MPI_COMM_WORLD);
}
}
for(int b=0; b<myN_b; b++)
{
if(Boundaries[b].type_=="interprocess")
{
yourID = static_cast<int> (Boundaries[b].value_);
MPI_Recv(buffer, Boundaries[b].N_, MPI_DOUBLE, yourID, tag, MPI_COMM_WORLD, &status);
for(int p=0; p<Boundaries[b].N_; p++)
{
v[Boundaries[b].indices_[p]] += buffer[p];
}
}
}
return;
}
In this function, the first thing we do is loop over all of the boundaries, and if the
boundary is an interprocess boundary we will loop over all of its points and copy
the data from the two corresponding entries in the input vector v into the buffer
array that was defined in the read function. We will then send off this buffer to
the process with whom these points are shared. We will then again loop over all
of the boundaries and for the interprocess boundaries we will receive a buffer from
the processes with whom we are sharing the points with and add the data into the
21.2. RUNTIME LIBRARY ROUTINES 423

corresponding rows in the input vector v. It is worth mentioning at this point that
when a process sends its data to the corresponding neighbor process, we are doing
so with a buffered send. The reason for this choice comes from that fact that if we
were to use a standard blocking send, then we could enter a ‘deadlock’ situation
where two processes are trying to send to each other at the same time and our
program will ‘hang’. An alternative would of course be to use non-blocking sends,
but in that case we could not use the one buffer array because we would not be
guaranteed that a process would have received the data stored in the buffer, before
we overwrite it with new data to send to another process. So an alternative here
is to use a buffered send where we can reuse the one buffer, but we must allocate
some additional memory for MPI to store this data until it can be exchanged. As
such, we will take a quick step back to our read function and add in the following
lines to the end:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& myN_p, int& myN_f, int& myN_e, int& myN_b, ...
bool*& yourPoints, int myID)
{
...
buffer = new double [maxN_sp];
bufferSize = (maxN_sp*sizeof(double)+MPI_BSEND_OVERHEAD)*myMaxN_sb;
MPI_Buffer_attach(new char[bufferSize] , bufferSize);
return;
}

Here we are allocating the buffer as before, but are now allocating a second array
to be used as the storage space. From examination of the bufferSize variable,
the term in the brackets defines enough bytes of memory to hold myMaxN sp dou-
ble precision floating point numbers, plus the amount of memory required by the
buffered send routine (defined by MPI BSEND OVERHEAD). This number of bytes is
then multiplied by the number of interprocess boundaries, in order to make sure
that we have enough space to potentially store all of the interprocess data until it
has been exchanged.
Returning now to how we make sure that each process is solving the correct
system of equations, if in our code we have the two lines:
assemble(M, K, s, phi, Free, Fixed, Points, Faces, Elements, Boundaries, ...
myN_p, myN_f, myN_e, myN_b, myID);
exchange(s, Boundaries, N_b);

then upon completion of the exchange function all processes will have the correct
entries in their load vectors corresponding to their shared points, because the func-
tion adds contributions to its own array from other processes. Now, because M
and K are also missing entries, we should do something similar, but in fact we will
leave the exchanging of this data for our solve function. As such, the core of our
algorithm will look like:
assemble(M, K, s, phi, Free, Fixed, Points, Faces, Elements, Boundaries, ...
myN_p, myN_f, myN_e, myN_b, myID);
424 CHAPTER 21. MPI

A = M;
A.subtract(Delta_t, K);
A.multiply(AphiFixed, phi, Free, Fixed);
exchange(AphiFixed, Boundaries, myN_b);
exchange(s, Boundaries, myN_b);
...
for(int l=0; l<N_t; l++)
{
M.multiply(b, phi);
exchange(b, Boundaries, myN_b);
for(int m=0; m<myN_p; m++)
{
b[m] += Delta_t*s[m] - AphiFixed[m];
}
solve(A, phi, b, Free, Fixed, Boundaries, yourPoints, myN_b, myID);
write(file, phi, myN_p);
}

So, it can be observed that at this stage most of the code is the same as the serial
version of the algorithm with the exception of the exchanging of the vectors s and
b. We can now look at how we solve the global linear system of equations at each
time step with the conjugate gradient method. Let’s just dive right in and look at
the code:
void solve(SparseMatrix& A, double* phi, double* b, bool* Free, bool* Fixed, ..
Boundary* Boundaries, bool* yourPoints, int myN_b, int myID)
{
...
A.multiply(Aphi, phi, Free, Free);
exchange(Aphi, Boundaries, myN_b);
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = b[m] - Aphi[m];
d[m] = r_old[m];
}
}
r_oldTr_old = innerProduct(r_old, r_old, Free, myPoints, N_row);
r_norm = sqrt(r_oldTr_old);
while(r_norm>tolerance && k<N_k)
{
A.multiply(Ad, d, Free, Free);
exchange(Ad, Boundaries, myN_b);
dTAd = innerProduct(d, Ad, Free, yourPoints, N_row);
alpha = r_oldTr_old/dTAd;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
phi[m] += alpha*d[m];
}
21.2. RUNTIME LIBRARY ROUTINES 425

}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r[m] = r_old[m] - alpha*Ad[m];
}
}
rTr = innerProduct(r, r, Free, yourPoints, N_row);
beta = rTr/r_oldTr_old;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
d[m] = r[m] + beta*d[m];
}
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = r[m];
}
}
r_oldTr_old = rTr;
r_norm = sqrt(rTr);
k++;
}
return;
}
As can be observed we are using the exchange function a few times throughout the
duration of the algorithm. This in fact ties in quite nicely because the conjugate
gradient method involves the computation of the column vectors Aφ and Ad and we
know at this stage that A will not be quite correct because we haven’t exchanged the
data between processes. Rather than trying to exchange the entries in the matrix
itself; what we can do is exchange the vectors Aφ and Ad, which means less data to
exchange, and much simpler code. So this is in fact what we do. The only missing
piece of the puzzle now is the computation of the residuals and the search directions.
For this, we are going to define one more function called innerProduct, which will
be responsible for computing the correct global inner products that are used to
define the residual norm, α, and β. Before looking at the code for this, let’s just
take a moment to get a clear picture in our mind as to what’s going on. Because
each process has a portion of the grid (forgetting about the shared interprocess
points for a moment), then it will only have a portion of the global residual vector
(just as with the solution of the Poisson equation in Example 21.3). But we need
each process to have the correct global residual (or more importantly it’s two-norm)
in order to use compute the same α and β. If we didn’t then we would find that
each process would use it’s own residual to compute the norm and then compute
426 CHAPTER 21. MPI

different search directions and so the algorithm would be a complete mess! With
this in mind, the code for our innerProduct function take as an argument two input
arrays (reprsenting column vectors) and will look like:
double innerProduct(double* v1, double* v2, bool* Free, bool* myPoints, int N_row)
{
double myv1Tv2 = 0.0;
double v1Tv2 = 0.0;

for(int m=0; m<N_row; m++)

{
if(Free[m] && myPoints[m])
{
myv1Tv2 += v1[m]*v2[m];
}
}

MPI_Allreduce(&myv1Tv2, &v1Tv2, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

return v1Tv2;
}

So it can be observed that the for loop incrementing the inner product is fairly
straightforward, but we are using the myPoints array in order to decide whether or
not to include the values in the increment. If we didn’t do this, then the shared
points would be counted more than once in the computation of the residual vector
(and all of the other vectors). So this is why it was important to determine the
ownership of a shared point right back when reading in the grid on each process.
Finally, after each process computes its portion of the inner product we use an
MPI Allreduce so that each process will have the correct global residual; and will
then compute the same update for φ, will then compute the correct new search
directions, etc.
The complete program is given in Example21 4.cpp and Figures 21.11(a) -
21.11(d) illustrate the solution at a number of different time steps with the por-
tions of grid owned by each process shifted to highlight the distributed grid.

Now just as one now expects to have libraries to perform certain calculations such
as the cos or exp functions in the standard math library, there are libraries available
which can be used to solve large scale linear systems in parallel and thereby remove
a large portion of this ‘hassle’. One such example is the Portable, Extensible Toolkit
for Scientific Computation (PETSc) [41] which is designed for the parallel solution
of PDEs and abstracts (or hides) much of the details associated with distributing a
matrix and communicating data throughout a simulation. With a spectral method
on the other hand, the global nature means that to compute the discrete Fourier
transform at a point we needed information throughout the entire domain, meaning
21.2. RUNTIME LIBRARY ROUTINES 427

2.0 2.0

1.8 1.8

1.6 1.6
φ

φ
1.4 1.4

1.2 1.2

1.0 1.0
−0.2 1.2 −0.2 1.2
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
1.2 −0.2 1.2 −0.2
y y
x x

(a) (b)

2.0 2.0

1.8 1.8

1.6 1.6
φ

1.4 1.4

1.2 1.2

1.0 1.0
−0.2 1.2 −0.2 1.2
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
1.2 −0.2 1.2 −0.2
y y
x x

Figure 21.11: The solutions to the PDE in Example 21.4 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
428 CHAPTER 21. MPI

that the we no longer have nearest neighbor communication and the concept of
ghost points no longer applies. It is therefore a bit more of a challenge to efficiently
compute a discrete Fourier transform in parallel when so much information needs to
be exchanged between processes. One example of a library which does this however
is the FFTW library [17].
As a final concluding remark it is worth emphasizing that as with OpenMP we
have only touched very briefly on the use of MPI and its application to solving PDEs.
There is much more functionality and many more performance issues that we have
not considered here, but can be found in either the full API specification [31] or [66].
Chapter 22

OpenCL

Compute Device
Compute Unit

Processing
Element

Host

Figure 22.1: OpenCL platform model.

22.1 Concepts
The third API that we will investigate is the Open Computing Language API, which
is a framework for writing programs that execute across heterogeneous platforms
consisting of central processing units (CPUs), graphics processing units (GPUs),
digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other
processors or hardware accelerators. The goal is that programmers can develop
efficient, yet portable code that can use the OpenCL framework to detect devices and
compile portions of the code at run-time to execute on them. In order to provide

429
430 CHAPTER 22. OPENCL

this infrastructure the setup of an OpenCL program can be more involved compared
to OpenMP and MPI, and will require an understanding of few different conceptual
models that we will looks at.

22.2 Platform Model

The Platform model for OpenCL is defined in Figure 22.1. The model consists of a
host connected to one or more OpenCL devices. An OpenCL device is divided into one
or more compute units (CUs) which are further divided into one or more processing
elements (PEs). Computations on a device occur within the processing elements.
An OpenCL application runs on a host according to the models native to the host
platform. The OpenCL application submits commands from the host to execute
computations on the processing elements within a device. The processing elements
within a compute unit execute a single stream of instructions as SIMD units (i.e.
execute in lockstep with a single stream of instructions) or as MIMD units (i.e. each
processing element maintains its own program counter).
OpenCL is designed to support devices with different capabilities under a sin-
gle platform. This includes devices which conform to different versions of the
OpenCL specification. There are three important version identifiers to consider for
an OpenCL system:

• The platform version

• The version of a device
• The version(s) of the OpenCL C language supported on a device

The platform version indicates the version of the OpenCL runtime supported.
This includes all of the APIs that the host can use to interact with the OpenCL
runtime, such as contexts, memory objects, devices, and command queues.
The device version is an indication of the devices capabilities, separate from the
runtime and compiler, as represented by the device info returned by clGetDeviceInfo.
Examples of attributes associated with the device version are resource limits and
extended functionality. The version returned corresponds to the highest version of
the OpenCL specification for which the device is conformant, but is not higher than
the platform version.
The language version for a device represents the OpenCL programming language
features a developer can assume are supported on a given device. The version
reported is the highest version of the language supported. OpenCL C is designed to
be backwards compatible, so a device is not required to support more than a single
language version to be considered conformant. If multiple language versions are
supported, the compiler defaults to using the highest language version supported
for the device. The language version is not higher than the platform version, but
may exceed the device version.
22.3. EXECUTION MODEL 431

22.3 Execution Model

Execution of an OpenCL program occurs in two parts: kernels that execute on one
or more OpenCL devices and a host program that executes on the host. The host
program defines the context for the kernels and manages their execution.
The core of the OpenCL execution model is defined by how the kernels execute.
When a kernel is submitted for execution by the host, an ‘index space’ is defined.
An instance of the kernel executes for each point in this index space. This kernel
instance is called a work-item and is identified by its point in the index space, which
provides a global ID for the work-item. Each work-item executes the same code
but the specific execution pathway through the code and the data operated upon
can vary per work-item.
Work-items are organized into work-groups. The work-groups provide a more
coarse-grained decomposition of the index space. Work-groups are assigned a unique
work-group ID with the same dimensionality as the index space used for the work-
items. Work-items are assigned a unique local ID within a work-group so that a
single work-item can be uniquely identified by its global ID or by a combination
of its local ID and work-group ID. The work-items in a given work-group execute
concurrently on the processing elements of a single compute unit.
The index space supported in OpenCL is called an NDRange. An NDRange is an
N dimensional index space, where N is 1, 2, or 3. An NDRange is defined by an
integer array of length N specifying the extent of the index space in each dimension
starting at an offset index O (0 by default). Each work-items global ID and local
ID are N dimensional tuples. The global ID components are values in the range
from O, to O plus the number of elements in that dimension minus one. Work-
groups are assigned IDs using a similar approach to that used for work-item global
IDs. An array of length N defines the number of work-groups in each dimension.
Work-items are assigned to a work-group and given a local ID with components in
the range from 0 to the size of the work-group in that dimension minus one. Hence,
the combination of a work-group ID and the local ID within a work-group uniquely
defines a work-item.
Each work-item is identifiable in two ways; either in terms of a global index, or
in terms of a work-group index plus a local index within a work group. Consider
for example the 2D index space in Figure 22.2. We input the index space for the
work-items [Gx , Gy ], the size of each work-group [Wx , Wy ] and the global ID offset
[Ox , Oy ]. The global indices define a Gx × Gy index space where the total number of
work-items is the product of Gx and Gy . The local indices define a Wx × Wy index
space where the number of work-items in a single work-group is the product of Wx
and Wy . Given the size of each work-group and the total number of work-items
we can compute the number of work-groups. A 2D index space is used to uniquely
identify a work-group. Each work-item is identified by its global ID [gx , gy ] or by
the combination of the work-group ID [wx , wy ], the size of each work-group [Wx , Wy ]
and the local ID [lx , ly ] inside the work-group such that:
432 CHAPTER 22. OPENCL

Work-group
w= [1,1]

Work Item Work Item Work Item Work Item

l= [0,3] l= [1,3] l= [2,3] l= [3,3]
g= [4,7] g= [5,7] g= [6,7] g= [7,7]

Work Item Work Item Work Item Work Item

l= [0,2] l= [1,2] l= [2,2] l= [3,2]
g= [4,6] g= [5,6] g= [6,6] g= [7,6]

Work Item Work Item Work Item Work Item

l= [0,1] l= [1,1] l= [2,1] l= [3,1]
g= [4,5] g= [5,5] g= [6,5] g= [7,5]

Work Item Work Item Work Item Work Item

l= [0,0] l= [1,0] l= [2,0] l= [3,0]
g= [4,4] g= [5,4] g= [6,4] g= [7,4]

Work-group Size
W= [4,4]
NDRange Size
G= [12,12]

Figure 22.2: OpenCL execution model.

gx wx × Wx + lx + Ox
= (22.1)
gy wy × Wy + ly + Oy
where it should be fairly obvious that the extension to 3D or the reduction to 1D
is trivial. The number of work-groups [Mx , My ] can be computed as:
" #
Gx
Mx Wx
= Gy (22.2)
My Wy

and given a global ID and the work-group size, the work-group ID for a work-item
is computed as:
" #
gx −lx −Ox
wx Wx
= gy −ly −Oy (22.3)
wy Wy

A wide range of programming models can be mapped onto this execution model.
For example, one could imagine how algorithms using Finite Difference Methods in
1 − 3D could be executed on a device, where a work-item would be responsible for
some computation at a single grid point and the [i, j, k] grid index is mapped to a
[gx , gy , gz ] global index.
From a practical point of view, every OpenCL program will need a number of
data structures (or ‘objects’ that are instantiations of these types):
22.4. MEMORY MODEL 433

• Devices: The collection of OpenCL devices to be used by the host, where each
device is represented by a cl device id object.
• Kernels: The OpenCL functions that run on OpenCL devices, where a kernel is
represented by a cl kernel object.
• Program Objects: The program source and executable that implement the
kernels, where a program is represented by a cl program object.
• Memory Objects: A set of memory objects visible to the host and the OpenCL
devices. Memory objects contain values that can be operated on by instances
of a kernel and are represented by cl mem.
• Command Queues: The host creates this data structure to coordinate ex-
ecution of the kernels on the devices. The host places commands into the
command-queue which are then scheduled onto the devices within the con-
text. These objects are represented by a cl command queue.
• Contexts: Which are defined for a device and used in order to create programs,
command queues, and memory objects.

Depending on the particular application, there may be many of each of these objects,
but at a bare minimum, to have any OpenCL program that executes a kernel on a
device and returns the result of the kernel to host memory, there must be one of
each of these objects.

22.4 Memory Model

In contrast to the other APIs that we have studied, where memory was not much of
a consideration, OpenCL defines four distinct memory regions (Figure ) that work-
items executing a kernel have access to:

• Global Memory. This memory region permits read/write access to all work-
items in all work-groups. Work-items can read from or write to any element
of a memory object. Reads and writes to global memory may be cached
depending on the capabilities of the device.
• Constant Memory: A region of global memory that remains constant during
the execution of a kernel. The host allocates and initializes memory objects
placed into constant memory.
• Local Memory: A memory region local to a work-group. This memory region
can be used to allocate variables that are shared by all work-items in that
work-group. It may be implemented as dedicated regions of memory on the
OpenCL device. Alternatively, the local memory region may be mapped onto
sections of the global memory.
434 CHAPTER 22. OPENCL

Work-group Work-group
Private Private Private Private
Memory Memory Memory Memory

Work Work Work Work

Item Item Item Item

Local Memory Local Memory

Compute Unit Compute Unit

Global/Constant Memory

Compute Device

Host Memory

Host

Figure 22.3: OpenCL memory model.

• Private Memory: A region of memory private to a work-item. Variables de-

fined in one work-items private memory are not visible to another work-item.

Deciding how to use these memory regions is one of the most important considera-
tions for designing efficient code, since these regions have different sizes and different
access speeds.

22.5 Runtime Library Routines

Similar to both OpenMP and MPI, OpenCL provides a number of library routines
which will be incorporated into a program being parallelized. Some of the important
OpenCL functions are:

• cl int clGetPlatformIDs(cl uint num entries, cl platform id* platforms,

cl uint* num platforms);
Obtain the list of platforms available.

• cl int clGetPlatformInfo(cl platform id platform, cl platform info

param name, size t param value size, void* param value, size t*
22.5. RUNTIME LIBRARY ROUTINES 435

param value size ret);

Get specific information about the OpenCL platform.

• cl int clGetDeviceIDs(cl platform id platform, cl device type device type,

cl uint num entries, cl device id* devices, cl uint* num devices);
Obtain the list of devices available on a platform.

• cl int clGetDeviceInfo(cl device id device, cl device info param name,

size t param value size, void* param value, size t* param value size ret);
Get specific information about an OpenCL device.

• cl context clCreateContext(cl context properties* properties, cl uint

num devices, const cl device id* devices, void* pfn notify ( const
char* errinfo, const void* private info, size t cb, void* user data
), void* user data, cl int* errcode ret);
Creates an OpenCL context.

• cl command queue clCreateCommandQueue(cl context context, cl device id

device, cl command queue properties properties, cl int* errcode ret);
Create a command-queue on a specific device.

• cl program clCreateProgramWithSource(cl context context, cl uint count,

const char** strings, const size t* lengths, cl int* errcode ret);
Creates a program object for a context, and loads the source code specified by
the text strings in the strings array into the program object.

• cl int clBuildProgram(cl program program, cl uint num devices, const

cl device id* device list, const char* options, void
(*pfn notify)(cl program, void* user data), void *user data);
Builds (compiles and links) a program executable from the program source or
binary.

• cl kernel clCreateKernel(cl program program, const char* kernel name,

cl int* errcode ret);
Creates a kernel object.

• cl mem clCreateBuffer(cl context context, cl mem flags flags, size t

size, void* host ptr, cl int* errcode ret);
Creates a buffer object.

• cl int clSetKernelArg(cl kernel kernel, cl uint arg index, size t arg size,
const void* arg value);
Used to set the argument value for a specific argument of a kernel.

• cl int clEnqueueNDRangeKernel(cl command queue command queue, cl kernel

kernel, cl uint work dim, const size t* global work offset, const size t*
436 CHAPTER 22. OPENCL

global work size, const size t* local work size, cl uint

num events in wait list, const cl event* event wait list, cl event*
event);
Enqueues a command to execute a kernel on a device.

• cl int clEnqueueReadBuffer(cl command queue command queue, cl mem buffer,

cl bool blocking read, size t offset, size t cb, void* ptr, cl uint
num events in wait list, const cl event* event wait list, cl event*
event);
Enqueue commands to read from a buffer object to host memory.

• cl int clEnqueueWriteBuffer(cl command queue command queue, cl mem

buffer, cl bool blocking write, size t offset, size t cb, const void*
ptr, cl uint num events in wait list, const cl event* event wait list,
cl event* event);
Enqueue commands to write to a buffer object from host memory.

• size t get global id(uint dimindx);

Returns the unique global work-item ID value for dimension identified by
dimindx.

• size t get local id(uint dimindx);

Returns the unique local work-item ID value for dimension identified by di-
mindx.

In contrast to OpenMP and similar to MPI, the compiler does not need to have
any support for OpenCL, rather it is just a process of linking the program to the
appropriate library. It must be the case however that the vendor of a particular
device (e.g. a GPU or a Xeon Phi) provides an OpenCL implementation. In order to
use the OpenCL functionality, a C++ program must include the header file cl.h (or
opencl.h on Apple systems) and will most likely need to link to the runtime library
with a compiler flag such as -lOpenCL (or -framework OpenCL on Apply systems).
Once the program has been compiled it can be executed in the same manner as
any other program. Generally, the size of the NDRange index space will be defined
by the problem (e.g. Finite Difference grid size, or size of a Matrix, etc), and the
maximum work-group size is a device specific parameter that may be queried using
the clGetDeviceInfo function. As such, the work-group size and the number of
work-groups is something that will most often be computed within the program,
which is conceptually a bit different to OpenMP and MPI.

Example 22.1:
22.5. RUNTIME LIBRARY ROUTINES 437

In this example we will develop a ‘Hello World’ example program with OpenCL to
illustrate the identification of platform and device IDs and use a some of the runtime
library routines to obtain some useful information about them. The intended learn-
ing outcomes for this example will be to ‘get a feel’ for the structure of a program
that includes OpenCL code, so that in later examples where we are focussing on the
development of kernels, we will not have to elaborate on how platform and device
IDs are defined.
In order to begin, let’s just dive right in and look at the complete program:
int main(int argc, char** argv)
{
cl_uint N_Platforms;
cl_uint N_Devices;
cl_char dummy1[10240];
cl_uint dummy2;
size_t dummy3;
cl_ulong dummy4;
cl_device_fp_config dummy5;

clGetPlatformIDs(0, NULL, &N_Platforms);

cl_platform_id platforms[N_Platforms];
clGetPlatformIDs(N_Platforms, platforms, NULL);
for(int p=0; p<N_Platforms; p++)
{
cout << ‘‘Hello world from Platform ’’ << p << ‘‘ of ’’ << N_Platforms << endl;

clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, sizeof(dummy1), &dummy1, NULL);

cout << ‘‘Name: ’’ << dummy1 << endl;

clGetPlatformInfo(platforms[p], CL_PLATFORM_VENDOR, sizeof(dummy1), &dummy1, NULL);

cout << ‘‘Vendor: ’’ << dummy1 << endl;

clGetPlatformInfo(platforms[p], CL_PLATFORM_VERSION, sizeof(dummy1), &dummy1, NULL);

cout << ‘‘Version: ’’ << dummy1 << endl;

clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, 0, NULL, &N_Devices);

cl_device_id devices[N_Devices];
clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, N_Devices, devices, NULL);
for(int d=0; d<N_Devices; d++)
{
cout << ‘‘\tHello world from Device ’’ << d << ‘‘ of ’’ << N_Devices << endl;

clGetDeviceInfo(devices[d], CL_DEVICE_NAME,
sizeof(dummy1), &dummy1, NULL);
cout << ‘‘\tName: ’’ << dummy1 << endl;

clGetDeviceInfo(devices[d], CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint), &dummy2, NULL);
cout << ‘‘\tMax compute units: ’’ << dummy2 << endl;

clGetDeviceInfo(devices[d], CL_DEVICE_MAX_WORK_GROUP_SIZE,
438 CHAPTER 22. OPENCL

sizeof(size_t), &dummy3, NULL);

cout << ‘‘\tMax work group size: ’’ << dummy3 << endl;

clGetDeviceInfo(devices[d], CL_DEVICE_MAX_MEM_ALLOC_SIZE,
sizeof(cl_ulong), &dummy4, NULL);
cout << ‘‘\tMax mem alloc size: ’’ << dummy4 << endl;

clGetDeviceInfo(devices[d], CL_DEVICE_DOUBLE_FP_CONFIG,
sizeof(cl_device_fp_config), &dummy5, NULL);
cout << ‘‘\tDouble precision capability: ’’ << dummy5 << endl;
}
}

return 0;
}

As the program begins we will use the clGetPlatformIDs function in order to query
the system for available platforms. In the first call, we pass in a NULL pointer for
the second argument and the address of the unsigned integer variable N Platforms
as the third argument, which will be assigned the number of platforms found. We
can then create a static array of cl platform ids and call the function a second
time, but this time passing in N Platforms and the array as arguments, causing the
function to populate the cl platform id objects. We then loop over the platforms
found and for each one will print a message to the terminal saying “Hello world”
with the platform ID and call the function clGetPlatformInfo a few times, in order
to obtain some useful information that we will print to the screen.
When we call clGetPlatformInfo, the first argument is the platform ID, the
second is an enumeration constant that defines what bit of platform information we
are after. The fourth argument is the address of the variable that we want to be
assigned with this information, and the third argument is the number of bytes that
this variable is represented by. For the platform information, everything that we
are interested in printing out is a string, so we can create a static char array called
dummy1 and pass this in. We can then do this three times to print out the platform
name, vendor, and OpenCL version supported.
Using a similar approach as we did to obtain the platform IDs we will use the
getDeviceIDs function in order to query the system for available devices. We call
the function once to get the number of devices, create a static array of cl device ids
and then call the function again to populate the cl device id objects. We then
loop over the devices found and for each one will print a message to the terminal
saying “Hello world” with the device ID and call the function clGetDeviceInfo
a few times, in order to obtain some useful information that we will print to the
screen.
When we call clGetDeviceInfo the format is similar to when we were calling
clGetPlatformInfo, except that this time not all of the information is a string. For
this reason, we have a few dummy variables of different types (e.g. cl uint, size t)
and it is important that we use the right variable for the piece of information that
22.5. RUNTIME LIBRARY ROUTINES 439

we are after (e.g. the maximum memory allocation size is represented by a cl ulong
- an unsigned long int ).
In contrast to other examples, the output of this program will depend on the
system on which it is run, but an example output would be something like:
Hello world from Platform 0 of 1
Name: Intel(R) OpenCL
Vendor: Intel(R) Corporation
Version: OpenCL 1.2 LINUX
Hello world from Device 0 of 2
Name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Max compute units: 1
Max work group size: 8192
Max mem alloc size: 67684369408
Double precision floating point configuration: 63
Hello world from Device 1 of 2
Name: Intel(R) Many Integrated Core Acceleration Card
Max compute units: 236
Max work group size: 8192
Max mem alloc size: 2017882112
Double precision floating point configuration: 63

It is important to note that these bits of device information will be used in order
to guide programs depending on the where they are run. For example, we may
use the CL DEVICE MAX WORK GROUP SIZE to define what our work-group size will be
when we execute a kernel, or if our code us using double s, have the program abort
if CL DEVICE DOUBLE FP CONFIG is zero (implying that the device can not support
double precision operations).

Example 22.2:

In this example we will develop a C++ program to solve the 1D first order wave
equation:

∂φ ∂φ
+v =0 (22.4)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, and v = 1.0. For the spatial discretization we will
use the Finite Difference method with second order central differences for the first
derivative and for the temporal discretization we will use the fourth order Runge-
Kutta method. Furthermore, our program will be parallelized with OpenCL. The
440 CHAPTER 22. OPENCL

intended learning outcomes for this example will be to observe how we create an
OpenCL context, command queue, write and build a kernel, and allocate and copy
memory between the host and device.
In order to begin we will take the C++ program that was developed in Example
13.1 and add in the OpenCL code. The issues to consider here are that firstly, in order
to develop an OpenCL program, we will need to obtain a platform and device ID,
create a context from the device ID, create command queue from the context, build
a program from the context and our kernel source code, then create the kernels.
So, all of this is a fair amount of code, just to get started. The second issue is
to determine what parts of the overall code are parallelizable and can be put into
kernels. Then finally (and quite importantly), how do we avoid (or at least minimize)
the bottlenecks of moving data back and forth between host and device memory.
In order to avoid making our top level code too cluttered, we will create a function
called initialize() that will be responsible for identifying the platform and device,
creating the context, command queue, building the program and creating the kernel.
As you could imagine, although every OpenCL program needs this things in order
to function, there many ways in which we could design the code to do this. So just
bear in mind that as a program becomes more complex, or reequires more flexibility,
etc, one single function for initialization might not always be the optimal solution.
Nevertheless, in this case our function will look something like:
void initialize(cl_platform_id& platformID, cl_device_id& deviceID,
cl_context& context, cl_command_queue& commandQueue, cl_program& program,
cl_kernel& computeRHS, cl_kernel& computeTempPhi, cl_kernel& updatePhi, int D)
{
cl_uint N_Platforms;
cl_uint N_Devices;
const cl_uint N_Kernels = 3;
cl_int errorID;
const char* kernelSource[N_Kernels];

clGetPlatformIDs(0, NULL, &N_Platforms);

cl_platform_id platforms[N_Platforms];
clGetPlatformIDs(N_Platforms, platforms, NULL);

platformID = platforms[0];

clGetDeviceIDs(platformID, CL_DEVICE_TYPE_ALL, 0, NULL, &N_Devices);

cl_device_id devices[N_Devices];
clGetDeviceIDs(platformID, CL_DEVICE_TYPE_ALL, N_Devices, devices, NULL);

deviceID = D<N_Devices ? devices[D] : devices[N_Devices-1];

context = clCreateContext(0, 1, &deviceID, NULL, NULL, &errorID);

commands = clCreateCommandQueue(context, deviceID, 0, &errorID);

kernelSource[0] = read("computeRHS.cl");
kernelSource[1] = read("computeTempPhi.cl");
22.5. RUNTIME LIBRARY ROUTINES 441

kernelSource[2] = read("updatePhi.cl");

program = clCreateProgramWithSource(context, N_Kernels, kernelSource, NULL, &errorID);

clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

computeRHS = clCreateKernel(program, "computeRHS", &errorID);

computeTempPhi = clCreateKernel(program, "computeTempPhi", &errorID);
updatePhi = clCreateKernel(program, "updatePhi", &errorID);

return;
}

As can be observed, the code for obtaining the platform and device IDs is similar
to Example 22.1, except that we are choosing the first platform (assuming that there
is one) and choosing the device based on an integer argument D to the function which
will select either devices[D] or if that number is invalid, the last device. Follwing
this we create a context by calling the clCreateContext function, passing in our
chosen device ID, then we create a command queue with the clCreateCommandQueue
function, passing in the context and device ID. The next step involves creating a
cl program object based on the source code for our kernels. Now, although we
haven’t actually discussed what kernels we are going to create, having seen Example
20.2 where we used multiple threads to evaluate the right hand side, compute the
variables in tempPhi and also to update phi at each time step, we will try the same
thing here, but create kernels to do so. More on this later however. For now we are
looking at how we create a program and to use the clCreateProgramWithSource
function, we need to pass in the number of kernels (three in our case) and a const
char** array defining the kernel code. So given that this function wants the code
input as an array of strings (similar to the way input arguments to a program are
defined in argv), there are actually a few options for doing so. One would be to
actually type the kernel code into an array:
const char* kernelSource[3] = {‘‘type kernel 1 code’’, ‘‘kernel 2 here’’, ‘‘etc’’};

but this gets a bit messy. A much better way is to code up a kernel in a separate
file (perhaps with a .cl extension) and create a function to read in the contents of
that ASCII text file and put it into a char array, which is exactly what we will do:
const char* read(const char* name)
{
ifstream file(name);
string sourceCode((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
size_t length = sourceCode.length();
char* buffer = new char [length+1];
sourceCode.copy(buffer, length);
buffer[length] = ’\0’;

return buffer;
}
442 CHAPTER 22. OPENCL

Although we won’t dwell on this function too much (but rather just use it in our
future programs), all it is doing is opening the kernel source file, reading the contents
into a string, creating a new char array, copying the contents and appending the
null character to the end, then returning this array.
Once we have put this code into our kernelSource array we can then create
cl program object, passing in the context object, the kernelSource array and the
number of kernels. Then we can call the clBuildProgram function in order to
compile and link the kernel source code, then finally can create our three cl kernel
objects by calling the clCreateKernel function, passing in the program object and
the kernel name (which will need to match the name in the .cl file).
At this point we can now start to look at what our kernels might look like.
As was mentioned previously, in other examples where we have solve the 1D wave
equation, we have used for loops to visit every entry in our arrays and perform
some sort of operation (e.g. evaluating the right hand side or computing an entry
in a k array). To parallelize these sorts of algorithms we either divided up the for
loops across multiple threads with OpenMP, or spread the grid (which is essentially
the same thing as dividing up the for loops) across multiple processes with MPI.
One common approach for parallelization with OpenCL (but also with other similar
APIs such as NVIDIA CUDA) is to think of each processing element (work-item)
being responsible for a grid point. In this case we ‘do away’ with the for loops
and instead launch the kernel multiple times (one for each grid point) and use the
processing element’s ID to define which grid point to process. As such, our kernel
for evaluating the right hand side will look like:
__kernel void computeRHS(__global double* k, __global double* phi,
const double Delta_x, const double v, const int N_x)
{
const int i = get_global_id(0);

if(i>0 && i<N_x-1)

{
k[i] = -v/(2*Delta_x)*(phi[i+1] - phi[i-1]);
}
if(i==N_x-1)
{
k[i] = -v/(Delta_x)*(phi[i] - phi[i-1]);
}
return;
}
Similarly, our kernel for computing tempPhi for input to the second to fourth stage
of the RK4 method will look like:
__kernel void computeTempPhi(__global double* tempPhi, __global const double* phi,
__global const double* k, const double coeff, const int N_x)
{
int i = get_global_id(0);

if(i<N_x)
22.5. RUNTIME LIBRARY ROUTINES 443

{
tempPhi[i] = phi[i] + coeff*k[i];
}
return;
}
and our kernel for updating phi at each time step will look like:
__kernel void updatePhi(__global const double* k1, __global const double* k2,
__global const double* k3, __global const double* k4,
__global double* phi, const double Delta_t, const int N_x)
{
int i = get_global_id(0);

if(i<N_x)
{
phi[i] += Delta_t*(k1[i]/6 + k2[i]/3 + k3[i]/3 + k4[i]/6);
}
return;
}
The next thing is allocating memory on the device to store these arrays, then
copying the data across and setting kernel arguments. After that we will be pretty
much ‘good to go’ with our program. On the host, we have seen how to allocate
memory for our arrays using an approach like:
double* phi = new double [N_x];
To allocate a corresponding array on the device, we use the clCreateBuffer func-
tion as:
cl_mem phi_d = clCreateBuffer(context, CL_MEM_READ_WRITE, N_x*sizeof(double), NULL, &errorID);
Here the important things are that the array is associated with the context and in
this case the array is defined with CL MEM READ WRITE meaning that its contents can
be read and written to (which is fairly obvious). Other options however, include
CL MEM READ ONLY, CL MEM WRITE ONLY, and CL MEM COPY HOST PTR if we want to
initialize the contents of a device array with a host array, in which case we pass in
a pointer to the host array as the fourth argument (instead of NULL as was done
above). Just as a quick note on terminology; sometimes for arrays representing
equivalent data on the host and device, people append h and d (or something along
those lines) to the arrays to make it obvious where the array is allocated. It is im-
portant to note that although it is in our host code that we use the clCreateBuffer
function, the actual memory is not accessible with a statement like phi d[i] like it
would be for the array phi.
If we don’t initialize from a host array however we can use the clEnqueueWriteBuffer
function to copy the data across as:
clEnqueueWriteBuffer(commandQueue, phi_d, CL_TRUE, 0, N_x*sizeof(double), phi, 0, NULL, NULL);
which will copy the contents of phi to phi d and if we want to do the reverse:
clEnqueueReadBuffer(commandQueue, phi_d, CL_TRUE, 0, N_x*sizeof(double), phi, 0, NULL, NULL);
444 CHAPTER 22. OPENCL

to copy the contents of phi d to phi. The only thing left is the setting of kernel
arguments and then the calling of kernels. To address the former, let’s look at
how we call the function to evaluate the right hand side. As can be seen above, it
takes five arguments, global double* k, global double* phi, const double
Delta x, const double v, and const int N x. To set these arguments we will use
the clSetKernelArg function as:
clSetKernelArg(computeRHS, 0, sizeof(cl_mem), &k1_d);
clSetKernelArg(computeRHS, 1, sizeof(cl_mem), &phi_d);
clSetKernelArg(computeRHS, 2, sizeof(double), &Delta_x);
clSetKernelArg(computeRHS, 3, sizeof(double), &v);
clSetKernelArg(computeRHS, 4, sizeof(int), &N_x);

where it can be observed that the arguments to this function are the cl kernel
object to which the arguments apply (computeRHS in this case), the position in the
argument list, the size of the argument and the argument itself. For the arrays it
case be observed that we are passing in the device arrays, but for the grid spacing,
velocity, etc, we are passing in the adress of variables residing in host memory. A
copule of points to note here are that the RK4 method requires that we compute the
right hand side four times per timestep and each time, passing in a different k array
to store the evaluation in. As can be seen above, we set kernel argments of k1 d
and phi d (which is for the first stage), but for later stages we would need to pass
in ‘say’ k2 d and tempPhi d for the second stage. The point is that we will need to
reset the kernel arguments before actually calling the kernel. For arguments which
don’t change (such as N x, v, etc) we don’t need to reset those specific arguments.
Finally, in order to call a kernel, we can use the clEnqueueNDRangeKernel func-
tion as:
clEnqueueNDRangeKernel(commandQueue, computeRHS, 1, NULL, &G, NULL, 0, NULL, NULL);

Here, the important arguments are the command queue, the kernel, the dimension-
ality (i.e. 1 in this case) and the global work size G. For this simple example, if we
leave the sixth argument (which specifies the work-group size) as NULL , then the
OpenCL implementation will determine how to be break the global work-items into
appropriate work-group instances.
Noting that the same approach is used for setting arguments, and executing
kernels we can omit the entire time marching loop (as it is now fairly lengthy) and
perhaps just look at the computation of the second stage of the RK4 method and
the update of φ. As such, the time marching loop would look like:
for(l=0; l<N_t-1; l++)
{
...

// k2
errorID = clSetKernelArg(computeTempPhi, 2, sizeof(cl_mem), &k1_d);
errorID |= clSetKernelArg(computeTempPhi, 3, sizeof(double), &Delta_tOn2);
errorID |= clEnqueueNDRangeKernel(commandQueue, computeTempPhi, 1, NULL, &G,
NULL, 0, NULL, NULL);
22.5. RUNTIME LIBRARY ROUTINES 445

errorID |= clSetKernelArg(computeRHS, 0, sizeof(cl_mem), &k2_d);

errorID |= clSetKernelArg(computeRHS, 1, sizeof(cl_mem), &tempPhi_d);
errorID |= clEnqueueNDRangeKernel(commandQueue, computeRHS, 1, NULL, &G,
NULL, 0, NULL, NULL);
if(errorID!=CL_SUCCESS)
{
cerr << ‘‘Error at stage 2’’ << endl;
exit(1);
}

...

// Update phi
errorID = clEnqueueNDRangeKernel(commandQueue, updatePhi, 1, NULL, &G,
NULL, 0, NULL, NULL);
if(errorID!=CL_SUCCESS)
{
cerr << ‘‘Error updating phi’’ << endl;
exit(1);
}

errorID = clFinish(commandQueue);

Here we can see the execution of all three kernels, while only setting the arguments
that would change between executions (e.g. the k array before calling computeTempPhi
or computeRHS); the others can be set once outside fo the time marching loop. Since
the arguments to updatePhi don’t change at all (only the contents of the arrays
from timestep to timestep), so we can just call that kernel directly.
One last point worth mentioning is that in the code snippet above, the OpenCL
functions are returning a cl int named errorID. This variable can be used to test
for the successful execution of a function call by checking whether or not the value of
the variable is equal to CL SUCCESS, which is an enumerated integer equal to zero. All
of the other values that could be returned are negative integers that correspond to
specific problems (e.g. CL DEVICE NOT FOUND=-1, CL BUILD PROGRAM FAILURE=-11,
CL INVALID ARG VALUE=-50 to name just a few of the 63 error codes). One ‘good’
option would be to check every single function call and write out a specific error
message so that if things go wrong in the code, we can pinpoint the problem quickly
and easily. In this example however, for brevity we have applied a bitwise inclusive
or so that any non-zero error codes returned from a block of function calls will result
in errorID being a non-zero value.
The complete program is given in Example22 2.cpp with a Matlab script for
viewing the output of the program given in Example22 2Postprocessing.m. Fig-
ures 22.4(a) and 22.4(b) illustrate the solution at two different moments in time for
the case where ∆x=0.05 and ∆t=0.02.
446 CHAPTER 22. OPENCL

2.5 2.5

2.0 2.0
φ

1.5 1.5

1.0 1.0

0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

(a) (b)

Figure 22.4: The solutions to the PDE in Example 22.2 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02.
22.5. RUNTIME LIBRARY ROUTINES 447

Example 22.3:
448 CHAPTER 22. OPENCL
Part V

Applications

449
Bibliography

[1] Amdahl’s law. http://en.wikipedia.org/wiki/Amdahl’s_law.

[2] Application-specific integrated circuit. https://en.wikipedia.org/wiki/Application-specific_i

[3] Blas (basic linear algebra subprograms). http://www.netlib.org/blas.

[4] Bus. http://en.wikipedia.org/wiki/Bus_(computing).

[5] C preprocessor. http://en.wikipedia.org/wiki/C_preprocessor.

[6] Cache. http://en.wikipedia.org/wiki/CPU_cache.

[7] Central processing unit. http://en.wikipedia.org/wiki/Central_processing_unit.

[8] Chebyshev polynomials. http://en.wikipedia.org/wiki/Chebyshev_polynomials.

[9] Cuda. http://en.wikipedia.org/wiki/CUDA.

[10] Diagonalizable matrix. http://en.wikipedia.org/wiki/Diagonalizable_matrix.

[11] Digital signal processor. http://en.wikipedia.org/wiki/Digital_signal_processor.

[12] Divergence theorem. http://en.wikipedia.org/wiki/Divergence_theorem.

[13] Divided differences. http://en.wikipedia.org/wiki/Einstein_notation.

[14] Einstein summation notation. https://en.wikipedia.org/wiki/Divided_differences.

[15] Environment variable. http://en.wikipedia.org/wiki/Environment_variable.

[16] Ethernet. http://en.wikipedia.org/wiki/Ethernet.

[17] Fftw. http://www.fftw.org.

[18] Field-programmable gate array. https://en.wikipedia.org/wiki/Field-programmable_gate_arr

[19] Flynn’s taxonomy. http://en.wikipedia.org/wiki/Flynn’s_taxonomy.

[20] Fourier series. http://en.wikipedia.org/wiki/Fourier_series.

451
452 BIBLIOGRAPHY

[21] Gaussian quadrature. http://en.wikipedia.org/wiki/Gaussian_quadrature.

[22] Graphics processing unit. https://en.wikipedia.org/wiki/Graphics_processing_unit.

[23] High performance computing. http://en.wikipedia.org/wiki/Supercomputer.

[24] Introduction to parallel computing. https://computing.llnl.gov/tutorials/parallel_comp.

[25] Lam mpi. http://www.lam-mpi.org.

[26] Lapack (linear algebra package). http://www.netlib.org/lapack.

[27] Legendre polynomials. http://en.wikipedia.org/wiki/Legendre_polynomials.

[28] Method of lines. http://en.wikipedia.org/wiki/Method_of_lines.

[29] Metis. http://glaros.dtc.umn.edu/gkhome/views/metis.

[30] Mitrion-c. http://en.wikipedia.org/wiki/Mitrionics.

[31] Mpi-2 extensions to the message passing interface.

http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.

[32] Mpich2. http://www.mcs.anl.gov/research/projects/mpich2.

[33] Multi-core processor. https://en.wikipedia.org/wiki/Multi-core_processor.

[34] Newton backward difference formula. http://en.wikipedia.org/wiki/Newton_polynomial.

[35] Non uniform memory access. http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access.

[36] Nyquist sampling theorem.

[37] Open mpi. http://www.open-mpi.org.

[38] Opencl. http://en.wikipedia.org/wiki/OpenCL.

[39] Openmp application program interface. http://www.openmp.org/mp-documents/spec30.pdf.

[40] Pbs. http://en.wikipedia.org/wiki/Portable_Batch_System.

[41] Petsc. http://www.mcs.anl.gov/petsc/petsc-as.

[42] Process. http://en.wikipedia.org/wiki/Process_(computing).

[43] Processor. http://en.wikipedia.org/wiki/Processor_(computing)#Computing.

[44] Race condition. http://en.wikipedia.org/wiki/Race_condition.

[45] Regular grid. http://en.wikipedia.org/wiki/Regular_grid.

BIBLIOGRAPHY 453

[46] Riken mdgrape-3. http://en.wikipedia.org/wiki/RIKEN_MDGRAPE-3.

[47] Row echelon form. http://en.wikipedia.org/wiki/Reduced_row_echelon_form.

[48] Secure shell. http://en.wikipedia.org/wiki/Secure_Shell.

[49] Sherman-morrison formula. https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formu

[50] Slurm. http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management.

[51] Solenoidal vector field. http://en.wikipedia.org/wiki/Solenoidal_vector_field.

[52] Star network. http://en.wikipedia.org/wiki/Star_network.

[53] Symmetric multiprocessing. http://en.wikipedia.org/wiki/Symmetric_multiprocessing.

[54] Taylor series. http://en.wikipedia.org/wiki/Taylor_series.

[55] Thread. http://en.wikipedia.org/wiki/Thread_(computer_science).

[56] Torus network. http://en.wikipedia.org/wiki/Torus_interconnect.

[57] Trapezoidal rule. http://en.wikipedia.org/wiki/Trapezoidal_rule.

[58] Tree network. http://en.wikipedia.org/wiki/Fat_tree.

[59] Unstructured grid. http://en.wikipedia.org/wiki/Unstructured_grid.

[60] Vhdl. http://en.wikipedia.org/wiki/VHDL.

[61] Von neumann architecture. http://en.wikipedia.org/wiki/Von_Neumann_architecture.

[62] Weighted residuals. http://en.wikiversity.org/wiki/Nonlinear_finite_elements/Weighted_r

[63] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Ei-

jkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution
of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM,
Philadelphia, PA, 1994.

[64] B. Chapman, G. Jost, and R. V. D. Pas. Using OpenMP - Portable Shared

Memory Parallel Programming. The MIT Press, Cambridge, Massachusetts,
2008.

[65] J. P. George Lindfield, J E T Penny. Numerical Methods Using MATLAB.

Prentice Hall, 2nd edition, 1999.

[66] W. Gropp, E. Lusk, and A. Skjellum. Using MPI - Portable Parallel Pro-
gramming with the Message-Passing Interface. The MIT Press, Cambridge,
Massachusetts, 1999.
454 BIBLIOGRAPHY

[67] G. Hager and G. Wellein. Introduction to high performance computing for

scientists and engineers. CRC Press, Boca Raton, FL, 2011.

[68] J. Hesthaven, S. Gottlieb, and D. Gottlieb. Spectral methods for time-dependent

problems. Cambridge University Press, Cambridge, 2007.

[69] D. A. Kopriva. Implementing spectral methods for partial differential equations

: algorithms for scientists and engineers. Springer, [Dordrecht], 2009.

[70] J. Y. Murthy. Numerical Methods in Heat, Mass, and Momentum Trans-

fer. https://engineering.purdue.edu/ME608/webpage/main.pdf., School of Me-
chanical Engineering Purdue University, 2002.

[71] S. J. Owen. A survey of unstructured mesh generation technology. In INTER-

NATIONAL MESHING ROUNDTABLE, pages 239–267, 1998.

[72] R. Peyret. Spectral methods for incompressible viscous flow. Springer, New
York, 2002.

[73] W. H. Press. Numerical recipes : the art of scientific computing. Cambridge

University Press, Cambridge, UK ; New York, 3rd edition, 2007.

[74] J. R. Shewchuk. An introduction to the conju-

gate gradient method without the agonizing pain.
http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.

[75] H. K. Versteeg and W. Malalasekera. An introduction to computational fluid

dynamics : the finite volume method. Pearson Education Ltd., Harlow, England
; New York, 2nd edition, 2007.

[76] O. C. Zienkiewicz and R. L. Taylor. The finite element method for solid and
structural mechanics. Elsevier Butterworth-Heinemann, Oxford ; Burlington,
MA, 6th edition, 2005.

[77] O. C. Zienkiewicz, R. L. Taylor, and P. Nithiarasu. The finite element method

for fluid dynamics. Elsevier Butterworth-Heinemann, Oxford ; Boston, 6th
edition, 2005.

[78] O. C. Zienkiewicz, R. L. Taylor, and J. Z. Zhu. The finite element method :

its basis and fundamentals. Butterworth-Heinemann, Oxford Burlington, MA
;, 6th edition, 2005.
BIBLIOGRAPHY 455
456 BIBLIOGRAPHY

Ansys Fluent Tutorial Guide 2023 R1
50% (4)
Ansys Fluent Tutorial Guide 2023 R1
1,722 pages
ANSYS Explicit Dynamics Analysis Guide
83% (6)
ANSYS Explicit Dynamics Analysis Guide
256 pages
Thermodynamics For Engineers
100% (14)
Thermodynamics For Engineers
620 pages
M. Thirumaleshwar - Fundamentals of Heat & Mass Transfer Includes Mathcad-Based Solutions To Problems-Pearson Education (2014)
100% (14)
M. Thirumaleshwar - Fundamentals of Heat & Mass Transfer Includes Mathcad-Based Solutions To Problems-Pearson Education (2014)
800 pages
Applied Computational Fluid Dynamics I To
100% (6)
Applied Computational Fluid Dynamics I To
353 pages
CFD
100% (4)
CFD
411 pages
Barbu V., Precupanu T. Convexity and Optimization in Banach Spaces (4ed., Springer, 2012) (ISBN 9789400722460) (O) (381s) - MOc - PDF
0% (1)
Barbu V., Precupanu T. Convexity and Optimization in Banach Spaces (4ed., Springer, 2012) (ISBN 9789400722460) (O) (381s) - MOc - PDF
381 pages
(Graduate Studies in Mathematics, V. 50) Y. A. Abramovich, Charalambos D. Aliprantis-An Invitation To Operator Theory-Amer Mathematical Society (2002)
100% (1)
(Graduate Studies in Mathematics, V. 50) Y. A. Abramovich, Charalambos D. Aliprantis-An Invitation To Operator Theory-Amer Mathematical Society (2002)
546 pages
Computational Fluid Dynamics
No ratings yet
Computational Fluid Dynamics
4 pages
Final MA 240 Lab Manual
No ratings yet
Final MA 240 Lab Manual
70 pages
CHE-0935301-Numerical Methods in Chemical Engineering-Sep-2021-Fall
No ratings yet
CHE-0935301-Numerical Methods in Chemical Engineering-Sep-2021-Fall
6 pages
Numerical & Mathematical Ability
No ratings yet
Numerical & Mathematical Ability
13 pages
Matlab - Equations of State
No ratings yet
Matlab - Equations of State
23 pages
Fibonacci Search10222014
100% (1)
Fibonacci Search10222014
7 pages
5 Interpolation
No ratings yet
5 Interpolation
71 pages
Chapter 3 Interpolation
100% (1)
Chapter 3 Interpolation
16 pages
Introduction To Type Curves: Explanation and Calculation Procedure
No ratings yet
Introduction To Type Curves: Explanation and Calculation Procedure
19 pages
Derive Lagrangian Method of Interpolation, 2. Solve Problems Using Lagrangian Method of Interpolation, and 3. Use Lagrangian Interpolants To Find Derivatives and Integrals of Discrete
No ratings yet
Derive Lagrangian Method of Interpolation, 2. Solve Problems Using Lagrangian Method of Interpolation, and 3. Use Lagrangian Interpolants To Find Derivatives and Integrals of Discrete
8 pages
General Material Balance For Gas Condensate Reservoir and Its Giipestimations 2157 7463 1000270
No ratings yet
General Material Balance For Gas Condensate Reservoir and Its Giipestimations 2157 7463 1000270
5 pages
Chapter 8: Production Decline Analysis: BQ DT DQ Q
No ratings yet
Chapter 8: Production Decline Analysis: BQ DT DQ Q
24 pages
5.02 Direct Method of Interpolation
100% (1)
5.02 Direct Method of Interpolation
11 pages
Advanced Production Decline Analysis and Application by Hedong Sun 0128024119
No ratings yet
Advanced Production Decline Analysis and Application by Hedong Sun 0128024119
5 pages
Optimization of Wells Using Perform Software - Final
No ratings yet
Optimization of Wells Using Perform Software - Final
80 pages
A New Method To Predict The Performance of Gas Condensate Reservoir
No ratings yet
A New Method To Predict The Performance of Gas Condensate Reservoir
13 pages
Near Well-Bore Condition (Skin) : Chapter 5 (Petroleum Production Systems by Economides)
100% (1)
Near Well-Bore Condition (Skin) : Chapter 5 (Petroleum Production Systems by Economides)
109 pages
Deliverability Testing of Gas Wells - PetroWiki
No ratings yet
Deliverability Testing of Gas Wells - PetroWiki
9 pages
A New Decline Curve Analysis Method For Layered Reservoirs
No ratings yet
A New Decline Curve Analysis Method For Layered Reservoirs
16 pages
Part 3 Global Reservoir Flow Regimes Azeb
No ratings yet
Part 3 Global Reservoir Flow Regimes Azeb
38 pages
Full Download (eBook PDF) Optimization in Operations Research 2nd Edition PDF DOCX
100% (7)
Full Download (eBook PDF) Optimization in Operations Research 2nd Edition PDF DOCX
56 pages
2022 - Chua Shi, Xiao Wang, Philip S. Yu - Heterogeneous Graph Representation Learning and Applications-Springer
No ratings yet
2022 - Chua Shi, Xiao Wang, Philip S. Yu - Heterogeneous Graph Representation Learning and Applications-Springer
329 pages
Golden Section Search - Wikipedia
No ratings yet
Golden Section Search - Wikipedia
4 pages
Experiment No#1 Constant Composition Expansion Test
100% (1)
Experiment No#1 Constant Composition Expansion Test
3 pages
Course Handouts - Numerical Methods in Chemical Engineering - CHT211 - 2019-20
No ratings yet
Course Handouts - Numerical Methods in Chemical Engineering - CHT211 - 2019-20
5 pages
Spe52170-Dynamic Nodal Analysis
No ratings yet
Spe52170-Dynamic Nodal Analysis
9 pages
Advanced Gas Material Paper - 226890110913049832 PDF
No ratings yet
Advanced Gas Material Paper - 226890110913049832 PDF
9 pages
Chapter 3 GW Movt
100% (1)
Chapter 3 GW Movt
76 pages
Simulation Study of Technical and Feasible Gas Lift Performance
No ratings yet
Simulation Study of Technical and Feasible Gas Lift Performance
24 pages
Pressure Gradients and Fluid Contacts
No ratings yet
Pressure Gradients and Fluid Contacts
7 pages
Reservoir Engineering
No ratings yet
Reservoir Engineering
18 pages
Chapter 7 PDF
No ratings yet
Chapter 7 PDF
76 pages
Reservoir Simulation
No ratings yet
Reservoir Simulation
4 pages
Part - 5 - Semilog Analysis For Oil Wells
No ratings yet
Part - 5 - Semilog Analysis For Oil Wells
32 pages
Equation of State and Reaction Rate For Condensed-Phase Explosives
No ratings yet
Equation of State and Reaction Rate For Condensed-Phase Explosives
11 pages
FUTO Postgraduate School Handbook V2
No ratings yet
FUTO Postgraduate School Handbook V2
70 pages
Kel-1 GAs Lift1
No ratings yet
Kel-1 GAs Lift1
12 pages
Rta Poster
No ratings yet
Rta Poster
1 page
Course Description PDF
No ratings yet
Course Description PDF
57 pages
Deconvolution of Wellbore Pressure and Flow Rate: Fikrl J. Kuchuk, Richard G. Carter, LWS Ayestaran
No ratings yet
Deconvolution of Wellbore Pressure and Flow Rate: Fikrl J. Kuchuk, Richard G. Carter, LWS Ayestaran
7 pages
Reservoir Management - 03 (W Solutions)
No ratings yet
Reservoir Management - 03 (W Solutions)
39 pages
Forecasting Oil and Gas Using Decline Curves
No ratings yet
Forecasting Oil and Gas Using Decline Curves
29 pages
William D. McCain - The Properties of Petroleum Fluids
100% (1)
William D. McCain - The Properties of Petroleum Fluids
16 pages
PETSOC 09-07-18.PDF Gas Condensate Reservoir Performance
0% (1)
PETSOC 09-07-18.PDF Gas Condensate Reservoir Performance
7 pages
Reservoir Management
No ratings yet
Reservoir Management
67 pages
Introduction To CMGs TUTORIAL - 2013 - 20 PDF
No ratings yet
Introduction To CMGs TUTORIAL - 2013 - 20 PDF
78 pages
SPE 015028 (Blasingame) Var Rate Res Limits Testing
No ratings yet
SPE 015028 (Blasingame) Var Rate Res Limits Testing
13 pages
Development of Ideas and Refining Research Topics
100% (9)
Development of Ideas and Refining Research Topics
25 pages
Part 7a - Manual Log-Log Analysis WBS IARF Model
No ratings yet
Part 7a - Manual Log-Log Analysis WBS IARF Model
14 pages
Pe04024 Qa
No ratings yet
Pe04024 Qa
22 pages
Improving Gas Well Deliverability A Case Study: Results???
No ratings yet
Improving Gas Well Deliverability A Case Study: Results???
3 pages
Reservoir Engineering: SPE-194746-MS
No ratings yet
Reservoir Engineering: SPE-194746-MS
37 pages
Ten Years of Reservoir Monitoring with Chemical Inflow Tracers
No ratings yet
Ten Years of Reservoir Monitoring with Chemical Inflow Tracers
22 pages
Spline Interpolation Method
No ratings yet
Spline Interpolation Method
30 pages
Decline Curve Exercise
No ratings yet
Decline Curve Exercise
1 page
SPE 04629 Fetkovich Decline TC
No ratings yet
SPE 04629 Fetkovich Decline TC
28 pages
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
Simulation modeling Third Edition
From Everand
Simulation modeling Third Edition
Gerardus Blokdyk
No ratings yet
A Complete Course in Physics ( Graphs ) - First Edition
From Everand
A Complete Course in Physics ( Graphs ) - First Edition
Rajat Kalia
No ratings yet
sample_7407
No ratings yet
sample_7407
11 pages
Module 08-01 Basic Aerodynamics: Physics of The Atmosphere
No ratings yet
Module 08-01 Basic Aerodynamics: Physics of The Atmosphere
14 pages
Balancing Robot Seminar
No ratings yet
Balancing Robot Seminar
14 pages
GTT Module 5
No ratings yet
GTT Module 5
156 pages
FDE 201 - Lecture Notes
0% (1)
FDE 201 - Lecture Notes
190 pages
Chemkin 2015 New
No ratings yet
Chemkin 2015 New
58 pages
Psychrometry Solved Problem
No ratings yet
Psychrometry Solved Problem
8 pages
PRELIMINARY DESIGN OF A COMBUSTION CHAMBER FOR
No ratings yet
PRELIMINARY DESIGN OF A COMBUSTION CHAMBER FOR
12 pages
Visconti G., Ruggieri P. Fluid Dynamics. Fundamentals and Applications 2020 PDF
100% (7)
Visconti G., Ruggieri P. Fluid Dynamics. Fundamentals and Applications 2020 PDF
337 pages
KOED - Cryogenic Engineering - WEEK 2-Module 3
No ratings yet
KOED - Cryogenic Engineering - WEEK 2-Module 3
45 pages
M o D U L e 0 1 - 0 3 C Mathematics: Geometry Trigonometry
No ratings yet
M o D U L e 0 1 - 0 3 C Mathematics: Geometry Trigonometry
36 pages
03 Udf Temp PDF
No ratings yet
03 Udf Temp PDF
9 pages
16-Hydraulic Turbines (Compatibility Mode)
No ratings yet
16-Hydraulic Turbines (Compatibility Mode)
70 pages
AIAA 99-2218 A Practical Mechanism For Computing Combustion in Gas Turbine Engines
No ratings yet
AIAA 99-2218 A Practical Mechanism For Computing Combustion in Gas Turbine Engines
16 pages
Fundamentals of Engineering Thermodynamics 2nbsped 812032790x 9788120327900 Compress
100% (5)
Fundamentals of Engineering Thermodynamics 2nbsped 812032790x 9788120327900 Compress
720 pages
Module 02-04 Physics: Optics
No ratings yet
Module 02-04 Physics: Optics
90 pages
Choice of Blade Profile, Pitch and Chord
60% (5)
Choice of Blade Profile, Pitch and Chord
6 pages
05-Theoretical Principles of Combustion
No ratings yet
05-Theoretical Principles of Combustion
24 pages
Design Concept
No ratings yet
Design Concept
16 pages
Module 15-21 Gas Turbine Engine: Engine Monitoring and Ground Operation
No ratings yet
Module 15-21 Gas Turbine Engine: Engine Monitoring and Ground Operation
86 pages
Momentum Source Term - CFD Online Discussion Forums
No ratings yet
Momentum Source Term - CFD Online Discussion Forums
8 pages
Introduction To Refrigeration and Air Conditioning Systems Theory and Applications (Allan T. Kirkpatrick) (Z-Library)
No ratings yet
Introduction To Refrigeration and Air Conditioning Systems Theory and Applications (Allan T. Kirkpatrick) (Z-Library)
165 pages
Airbreathing Propulsion
100% (5)
Airbreathing Propulsion
315 pages
Matlab Programs For Ignition Delay
No ratings yet
Matlab Programs For Ignition Delay
6 pages
Intro To Vectors
No ratings yet
Intro To Vectors
4 pages
Mathematical Methods in Quantum Mechanics - G. Teschl
100% (1)
Mathematical Methods in Quantum Mechanics - G. Teschl
255 pages
Numerical Linear Algebra
No ratings yet
Numerical Linear Algebra
45 pages
KIL1005: Numerical Methods For Engineering: Semester 2, Session 2019/2020 7 May 2020
No ratings yet
KIL1005: Numerical Methods For Engineering: Semester 2, Session 2019/2020 7 May 2020
16 pages
Solutions of Hammerstein Equations in The Space BV (I)
No ratings yet
Solutions of Hammerstein Equations in The Space BV (I)
13 pages
Precalculus Functions and Graphs 4th Edition Dugopolski Test Bank instant download
100% (2)
Precalculus Functions and Graphs 4th Edition Dugopolski Test Bank instant download
64 pages
A Genuinely High Order Total Variation Diminishing Scheme For One-Dimensional Scalar Conservation Laws
No ratings yet
A Genuinely High Order Total Variation Diminishing Scheme For One-Dimensional Scalar Conservation Laws
24 pages
Second Course in Linear Algebra
100% (1)
Second Course in Linear Algebra
36 pages
Functional Analysis - MT4515
No ratings yet
Functional Analysis - MT4515
40 pages
An Introduction To Numerical Partial Differential Equations: HUN AO ENG
No ratings yet
An Introduction To Numerical Partial Differential Equations: HUN AO ENG
80 pages
Vector Space - Wikipedia
No ratings yet
Vector Space - Wikipedia
24 pages
Design Sensitivity Analysis
No ratings yet
Design Sensitivity Analysis
162 pages
Change Vector Analysis (CVA)
No ratings yet
Change Vector Analysis (CVA)
31 pages
Basic Mathematics and Vectors
No ratings yet
Basic Mathematics and Vectors
4 pages
LW 1115 Math245notes (Waterloo)
No ratings yet
LW 1115 Math245notes (Waterloo)
50 pages
Spatiotemporal Data Analysis Gidon Eshel pdf download
No ratings yet
Spatiotemporal Data Analysis Gidon Eshel pdf download
81 pages
Quaternion Fourier Transforms for Signal and Image Processing 1st Edition Todd A. Ell download
100% (1)
Quaternion Fourier Transforms for Signal and Image Processing 1st Edition Todd A. Ell download
63 pages
Basic Linear Algebra For Deep Learning and Machine Learning Python Tutorial - by Towards AI Team - Towards AI - Oct, 2020 - Medium PDF
No ratings yet
Basic Linear Algebra For Deep Learning and Machine Learning Python Tutorial - by Towards AI Team - Towards AI - Oct, 2020 - Medium PDF
33 pages
Topology with Applications Topological Spaces via Near and Far 1st Edition Naimpally S.A. pdf download
100% (1)
Topology with Applications Topological Spaces via Near and Far 1st Edition Naimpally S.A. pdf download
52 pages
Topological Vector Spaces and Their Applications 1st Edition V.I. Bogachev - Download the ebook now and read anytime, anywhere
No ratings yet
Topological Vector Spaces and Their Applications 1st Edition V.I. Bogachev - Download the ebook now and read anytime, anywhere
56 pages
Chapter 3 - 10th Edition
No ratings yet
Chapter 3 - 10th Edition
41 pages
2023_hw3sol
No ratings yet
2023_hw3sol
14 pages
On Generalizations and Refinements of Triangle Inequalities: Tamotsu Izumida
No ratings yet
On Generalizations and Refinements of Triangle Inequalities: Tamotsu Izumida
74 pages
Zach2008 VMV Fast Global Labeling
No ratings yet
Zach2008 VMV Fast Global Labeling
12 pages
Uniform Lower Bound of Arithmetic Hilbert-Samuel Function of Hypersurfaces
No ratings yet
Uniform Lower Bound of Arithmetic Hilbert-Samuel Function of Hypersurfaces
46 pages
1 s2.0 S0895717710001743 Main
No ratings yet
1 s2.0 S0895717710001743 Main
7 pages
Euclidean Spaces PDF
No ratings yet
Euclidean Spaces PDF
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.