Applied Numerical Methods
Applied Numerical Methods
Steve Moore
October 6, 2016
ii
Contents
Preface v
I Systems of Equations 1
1 Introduction 3
2 Direct Methods 31
3 Iterative Methods 53
iii
iv CONTENTS
20 OpenMP 373
21 MPI 387
22 OpenCL 429
V Applications 449
Bibliography 451
Nomenclature 455
Glossary 455
Preface
Differential equations are applied everywhere in science and engineering. They can
be used to describe many physical phenomena ranging from the smallest scales,
where quantum mechanics is required, through the everyday scales, where Newto-
nian and continuum mechanics applies, up to the largest scales where relativistic
mechanics is required. For the most part, we can only obtain analytical solutions
to the simplest of these differential equations and have to turn to numerical meth-
ods implemented in computer programs to obtain numerical solutions to the rest.
With the ever increasing power of computers and modeling software, this area is
continually growing and so an understanding of numerical methods and how they
are applied is of great importance.
The aim of this book is hence to provide a basic introduction to a range of nu-
merical methods commonly used in practice and to show how they are applied in
solving ‘real world’ problems. The end goal is that one should have sufficient under-
standing to develop a parallel program, designed to run on a supercomputer, that
can solve a problem in electromagnetics, fluid, or solid mechanics, thermodynamics,
or molecular dynamics.
In order to reach this end goal the structure of this book has been designed
such that we gradually ‘build up’ our understanding of numerical methods and
programming ability. In Part I we will spend some time studying methods for solving
linear systems of equations, with a subsequent extension to nonlinear systems. The
reason for this is twofold; firstly it will give us a chance to create some simple
programs and ‘get a feel’ for the basic structure of scientific software. Secondly, the
solution of differential equations frequently reduces to finding the solution of a linear
system of equations, so we will need to know these methods in order to proceed. In
Part II we will spend some time studying methods for solving ordinary differential
equations. Now while these methods can be applied to solving many real world
problems directly, finding the solution of a partial differential equation frequently
reduces to finding the solution of a system of ordinary differential equations, so it is
important that we have this understanding in place before we move on. In Part III
we will spend some time studying methods for solving partial differential equations.
Again, these methods can be applied to solving many real world problems, but what
we want is to be able to do is develop parallel programs and before we can do this, we
have to have a solid understanding of these methods. In Part IV we will spend some
time studying some methods by which we can turn the serial programs developed
v
vi PREFACE
up until that point, into parallel programs that can be executed on a supercomputer
Finally, once we have all of these tools in place, Part V will work through a number
of applications, outlining the relevant physics and deriving the governing differential
equations, then developing a parallel program to perform some sort of simulation.
The various parts of this book will cover quite wide range of topics in the math-
ematical, physical, and computer sciences. It needs to be emphasized however that
this book is not designed to provide an in depth coverage of any of these topics,
or particular numerical methods. To elaborate on this point, this is not a book on
linear algebra, nor is it a book on partial differential equations, nor is it a book
on parallel computing or a book on continuum mechanics. In fact for most of the
chapters present in this book, one could find entire books (or in some cases a series
of books) dedicated to the subject. This book is not aiming to compete with other
texts; and in fact references will be given to some of the more detailed texts when-
ever appropriate. Each chapter will only provide the ‘bare minimum’ necessary to
get a computer program ‘up and running’. What this book will do however, is show
you how all of these areas are applied together and implemented in the form of a
working computer program. So perhaps the best way to think off this book is a
supplementary text that ties all of these other topics together, a bit like a ‘sampler
pack’ with an assortment of methods.
Since we will be developing computer programs we will have to choose a pro-
gramming language with which to do so. While the disadvantage is that we will be
restricting ourselves somewhat, the advantage is that we will get the most detailed
‘nuts and bolts’ understanding of how the methods really work. The question is
then which language do we choose? As it happens we will use two different lan-
guages, namely Matlab and C++. Now although it sounds like we’ll be doubling the
amount of effort required, there is a good reason for using both of these program-
ming languages. The reasons for using Matlab are that firstly, the programming
environment is relatively easy to use; and we can create a prototype program imple-
menting a numerical method relatively quickly. The most important reason however,
is that the nature of the language means that we can often write code that closely
resembles the mathematics, which will therefore aid in our understanding of how it
is implemented. Having pointed out these desirable features, one might then ask
the question, why should we use C++? Again, we have some good reasons for using
this programming language too. For starters we can generally create C++ programs
that execute much faster and perform larger calculations than a Matlab program
and furthermore, when we come to developing parallel programs, the methods that
we will be using won’t work with Matlab, but can be integrated with C++ code.
Another good reason for choosing C++ is that it is a ‘lower level’ language that ex-
poses us to some issues (memory allocation for example) that we generally don’t
have to worry about with Matlab. Now although this might seem like we’re just
making life more difficult by getting ‘closer to the computer’, it is important that
we have a reasonable understanding of how computers work. So that being said, for
some numerical methods we encounter we will create Matlab programs, others we
vii
will create C++ programs, and some we will create both, so that we can compare the
features of each language. No matter how we do it, there’s no escaping that fact that
we will have to spend some time looking through code and while you may find this
a painful and daunting prospect, this is really the best way to get an understand-
ing of how numerical methods work. As it happens however, these two languages
are syntactically quite similar in a number of ways, so translating between the two
should not be too difficult. An important point to always remember throughout
this book that there are many ways in which a computer program can be written
to implement a given numerical method. Some ways are better than others, some
make no difference at all. We will be taking the approach of picking one way so that
we get the job done. On this point, we will also have to strike a bit of a balance
between writing code that is efficient and writing code that is easy to understand.
Often these two goals are mutually exclusive; and if one wants highly optimized
code, it may end up looking nothing like the numerical method that it’s supposed
to be implementing; even though it is. In our case, with the Matlab examples we
will be mostly concerned with ease of understanding and with the C++ examples we
will be more concerned with efficiency.
Given the outline of this book, the assumed prerequisite background knowledge
will include some basic linear algebra and calculus, plus some familiarity with Matlab
and C++. All of the programs that we will create will be designed such that they can
be executed with little to no input required from the user and as we implement the
numerical methods we will walk through the process explaining the various steps
and the reasons for doing so. For anyone already familiar with C++ you will notice
that our use of the language will include minimal use of the object oriented concepts
(which are in fact a key feature of the language). The reason for this is to not
complicate the implementation of a numerical method by ‘wrapping it up’ in object
oriented code; but we will instead crete a small number of classes when it simplifes
a particular program. As a final point, a desired feature of this book is that one
can get the programs up and running without having to buy anything. For users of
Mac, Windows, or Linux, there are free C++ compilers out there which can turn the
C++ source codes provided with this book into executable programs. Matlab on the
other hand is a commercial product that one must pay for and so to this end all of
the Matlab source codes are designed such that they can be executed with Octave,
which is free.
The suggested approach for reading this book is of course, first and foremost
in the order it was written in. If however you already have some familiarity with
any one of the topics you can skip ahead and backtrack as necessary. At each step
throughout the book, references will be made to the previous ‘building blocks’ in
order to make this as easy as possible. Having now outlined the basic structure of
the book and the reason behind this choice of structure we are now in a position to
begin.
viii PREFACE
(a) (b)
(c) (d)
(e) (f)
Figure 1: Some example applications in science and engineering which involve the
solution of differential equations.
Part I
Systems of Equations
1
Chapter 1
Introduction
In this part of the book we are going to investigate a number of different methods
for solving systems of algebraic equations. We will first look at a number of direct
methods, where the number of computational operations required to solve a system is
fixed and predetermined by the nature of the system. We will then look at a number
of iterative methods, which involve ‘guessing’ the solution and iteratively refining
that guess, in which case the number of operations is not fixed. Then, finally we will
look at how we extend these techniques to solving systems of nonlinear algebraic
equations. As we do this we will create some simple Matlab and C++ programs so
that we ‘get a feel’ for the common structure of scientific software. Before we begin
investigating these methods however we need to outline some terminology and make
some definitions that will be used throughout the book.
The most general form for a linear system of equations can be expressed as:
Aφ = b
or, written another way:
N
X
am,n φn = bm
n=1
where we have:
a1,1 a1,2 ··· a1,N
φ 1
b1
a2,1 a2,2 ··· a2,N φ2 b2
A= φ= b=
.. .. .. .. .. ..
. . . .
.
.
φ
aM,1 aM,2 · · · aM,N N
bN
3
4 CHAPTER 1. INTRODUCTION
as possible throughout the book. With that in mind, remember that we will use
the subscripts m and n to denote indices into matrices and vectors, that will have
dimensions M and N . Later on, as our programs become more complex and we
reserve M and N for the definition different quantities, we will instead use Nrow and
Ncol to mean the same thing. Most of the time, the number of equations M will be
equal to the number of unknowns N so that we have a square matrix meaning that
we can find a unique solution for φ. In principle, we can solve our linear system as:
φ = A−1 b
but as we shall see however we pretty much never attempt to directly compute
the inverse of a matrix in practice, since it generally requires a prohibitive number
of computational operations to do so; instead we look for other ‘smarter’ ways
to solve the system. There will also be a few times throughout this book where
M > N , meaning that we have an over-determined system with more equations
than unknowns. We will cross this bridge when we come to it, but for now, let’s
just say that what we do in this case is to obtain a solution which is the ‘best fit’
for all of the unknowns. If M < N , meaning that we have an under-determined
system with more unknowns that equations then there will be no unique solution,
but fortunately the differential equations that we will encounter in this book will
always result in a square matrix with a unique solution.
Continuing with the definitions, a matrix is termed symmetric if AT = A (i.e.
am,n = an,m ) and skew symmetric if AT = −A (i.e. am,n = −an,m ). A matrix is
termed orthogonal if AT = A−1 and is termed Hermitian if A = AH . Here, the
superscript H denotes the conjugate transpose, which is equivalent to transposing
the matrix, then taking the complex conjugate of each entry (i.e. replacing i with
−i in any of the matrix entries which happen to be complex numbers). Finally, a
matrix is termed unitary if AAH = AH A = I, where I denotes the identity matrix.
Another very important classification is that we call a square matrix positive definite
if:
vT Av > 0 ∀v 6= 0
So for any N × 1 vector v not composed entirely of zeros, the above product will
yield a real number greater than zero. A more useful definition of a positive definite
matrix is one that has all eigenvalues greater than zero. The concept of positive
definiteness plays an important role in certain methods that we will cover later. We
call a matrix an M-matrix if:
am,n ≤ 0 ∀m 6= n
−1
(A )m,n ≥ 0 ∀m, n
and it is nonsingular . Here, all of the off-diagonal entries that are not zero are
negative. Another important property of a matrix that will be important later is its
5
so, the entry on the main diagonal is greater than the sum of the off diagonal entries
in that row. The excess amount:
X
min(am,m − |an,m |)
m6=n
is called the diagonal dominance of the matrix. This property will have important
implications when we come to investigating iterative methods for solving linear
systems.
With vectors we can define a number of quantities called norms. These norms
are a single number and in a sense describe a vector. More formally, they are
functions that assign a strictly positive ‘length’ or ‘size’ to a given vector. Perhaps
the simplest of these is the one norm, defined as:
N
X
kvk1 = |vn |
n=1
which is the sum of the absolute values of all of the entries in v. Another norm is
the p norm, defined as:
N
! p1
X
kvkp = |vn |p
n=1
For the case where p = 2, we would call this the two norm which it can be observed
is analogous to the Pythagoras’ theorem in N dimensions. The final vector norm
that we will consider is the infinity norm, defined as:
kAvkp
kAkp = max
v6=0 kvkp
If we let p = 1 then we get the one norm:
6 CHAPTER 1. INTRODUCTION
M
X
kAk1 = max |am,n |
1≤n≤N
m=1
which is the maximum absolute value of the sum of each column in the matrix.
If we let p → ∞ then we get the infinity norm:
N
X
kAk∞ = max |am,n |
1≤m≤M
n=1
which is the maximum absolute value of the sum of each row in the matrix. Another
category of matrix norms is known as an entrywise norm and we can again define a
p norm as:
M X
N
! p1
X
kAkp = |am,n |p
m=1 n=1
Similar to vectors these definitions can be quite useful. In particular, we can define
the condition number of a matrix as:
Figure 1.1: Some patterns found in matrices that are either defining the system
being solved, or used in the solution of the system. The dots indicate where an
entry in the matrix is nonzero and blank space illustrates where the entries are zero.
(a) illustrates a full matrix (b) illustrates an upper triangular matrix (c) illustrates a
lower triangular matrix (d) illustrates a diagonal matrix (e) illustrates a tridiagonal
matrix (f) illustrates a pentadiagonal matrix (g) illustrates a sparse matrix which
could be either symmetric or skew symmetric (h) illustrates a non-symmetric sparse
matrix and (i) illustrates a matrix with no definable structure.
8 CHAPTER 1. INTRODUCTION
Finally, for completeness it is worth mentioning that this concept extends to multi-
dimensional arrays where we could in Matlab have something like:
A = zeros(M,N,O,P,Q,R);
An important point to note is that in Matlab we can often ‘get away’ with not
explicitly allocating memory to store our array before we start adding entries into
it, but this comes at a price; namely that Matlab will constantly have to resize the
array, which will greatly reduce the execution speed of the program. In C++ we don’t
have the luxury of arrays being automatically resized for us and must always ensure
that we have allocated memory for our array. Memory allocation is a slightly more
complex issue in C++ than it is in Matlab, with many different ways of allocating
memory for a given array. The main considerations are the method of memory
allocation, whether or not the array is contiguous in memory, and how we want to
be able to access entries in the array. There are in fact two ways in which we can
allocate memory; the first of which is known as static allocation and is achieved for
a 1D array as:
float A[M];
Finally, for completeness it is worth mentioning that this concept extends to multi-
dimensional arrays where we could in C++ have something like:
float A[M][N][O][P][Q][R];
float A[M];
float A[M][N];
float A[M][N][O];
Stack
Figure 1.2: A schematic illustrating the creation of three arrays of increasing di-
mensionality using static memory allocation. The blue boxes indicate the bytes
required to store a single floating point number (which will most likely be 4) which
are contiguous in memory in an area known as the stack. Note that there is no
dimensionality associated with the memory itself, just blocks with increasing ad-
dresses, so the 3 × 9 layout for the 3D array has no meaning other than to illustrate
a contiguous block that will fit on the page.
Figure 1.2 illustrates the creation of these three different arrays and where the bytes
(that store the floating point numbers) are are actually located in memory. The
important points to note are that for static memory allocation we are guaranteed
that the array entries will be contiguous in computer memory in an area known
as the stack . It is worth mentioning at this stage that these techniques are not
dependent on the data type at all and would be the same for integers (int , long
, unsigned int , etc) or double precision floating point numbers.
The second way is known as dynamic allocation which is achieved for a 1D array
as:
float* A = new float[M];
where we can access entries with square brackets as just as we could for a statically
allocated 1D array as:
11
a_m = A[m];
where we can access entries with square brackets as just as we could for a statically
allocated 2D array as:
a_m = A[m][n];
where we can access entries with square brackets as just as we could for a statically
allocated 2D array as:
a_m = A[m][n][o];
The dynamic memory allocation snippets look more complex than their static coun-
terparts, so it’s worth explaining these, with the aid of Figure 1.3. Each time we use
the new operator in C++ we are ‘asking’ for a single contiguous block of memory to
store a particular type of data; and assuming that the operating system is able to
fullfill this request we are returned a pointer to where this block of memory begins
in an area of memory known as the heap. Now a pointer variable is much like any
other variable except that what it stores is a memory address of something, and in
the context of dynamic memory allocation, the pointer stores the base address where
the block of memory begins. As an analogy think of the computers’ memory as a
long street full of houses. Each house has its own letterbox, with a unique number
analogous to a memory address. The contents of the house is analogous to the data
that is stored at that memory location (it could be an integer or a floating point
number for instance).
Considering first the dynamic allocation of a 1D array, we create a variable of
type float * which is stored on the stack and is a ‘pointer to a float’ (i.e. storing
the address of a floating point number). For a 2D array it can be observed that the
variable is instead of type float ** which is a ‘pointer to a pointer to a float’. So
to store our 2D array we first allocate one 1D array on the heap (of type ‘pointer to
float’) and then in a for loop allocate multiple 1D arrays to store the actual floating
12 CHAPTER 1. INTRODUCTION
point numbers. When we use the notation A[m][n] the first square brackets ‘get’ the
address m entries along from the location that A points to, then the second square
brackets ‘get’ the value n entries along from that location. Finally, for a 3D array
this concept is extended even further such that the variable is of type float ***,
which is a ‘pointer to a pointer to a pointer to a float’ and we go through a sequence
of nested for loops to ultimately store the floating point numbers. As a final point
to note, the number of bytes required to store any sort of pointer is dependent on
the system itself (i.e. either 4 or 8 bytes depending on whether its 32 or 64 bit) and
is the same for a float *, float **, float ***, int *, int **, double *, etc. If
we were storing double precision floating point numbers instead of single precision
floating point numbers we would require 8 bytes per number rather than 4, but the
pointers would be the same either way.
Given the extra complication involved in allocating memory dynamically in C++
you might well be wondering why bother when we can allocate arrays statically
and avoid the for loops and also having to deallocate them. Well, two reasons;
number one, with static memory allocation we need to know how big the array will
be at ‘compile time’, i.e. when we turn our source code into an executable program.
Sometimes this is not a problem; other times we have no way of knowing how big
our array needs to be until the program is running. So in this case dynamic memory
allocation is our only option. The second reason is that the stack is quite a small area
of memory (generally somewhere around 8M b) and we can generally allocate much
bigger blocks of memory dynamically. So as we solve bigger and bigger systems of
equations ‘say’ dynamic memory allocation becomes a more attractive option for
this reason too.
Sometimes it can be advantageous to have our arrays allocated contiguously in
memory. While not such an issue initially this will become more important as we
consider parallel programming. For a 1D array this will be the case, but for the 2D
and 3D arrays we can see that the actual bytes of memory storing the floating point
numbers are scattered throughout the memory space. In addition to just having
memory allocated contiguously there is also an overhead in allocating each separate
block of memory, so it will be more efficient to allocate one big block of memory as
opposed to many smaller blocks. One way of allocating a 2D array contiguously in
memory would be:
float* A = new float[M*N];
but now we can no longer access entries in the same way. In fact given the two
indices m and n we must ‘map’ these to a single index as:
a_mn = A[m*N + n];
or:
a_mn = *(A+m*N + n);
where the * is the dereferencing operator and the meaning of this statement is; go
to the memory location pointed to by A, count along m ∗ N + n entries and return
13
float* A;
float** A;
float*** A;
float
float*
float**
float*** Stack Heap
Figure 1.3: A schematic illustrating the creation of three arrays of increasing di-
mensionality using dynamic memory allocation. The colored boxes indicate the
data type. Note that although each call to the new operator allocates a contiguous
block of memory, the overall array itself is not completely contiguous in memory. is
no dimensionality associated with the memory itself, just blocks with increasing ad-
dresses, so the 3 × 9 layout for the 3D array has no meaning other than to illustrate
a contiguous block that will fit on the page.
the value stored at that location. For a 3D array the equivalent way of allocating
the memory would be:
float* A = new float[M*N*O];
where we can access entries as:
a_mno = *(A+m*N*O+n*O+o);
So although we’ve now allocated our memory contiguously, we’ve lost the ‘nice’
way of indexing into arrays that mimics the math. Is there a way in which we can
allocate our memory contiguously with a minimal number of calls to new and retain
the square brackets for our access method? As it turns out there is and the code to
do so for a 2D array looks like:
float** A = new float*[M];
A[0] = new float[M*N];
for(int m=1, mm=N; m<M; m++, mm+=N)
{
A[m] = &A[0][mm];
14 CHAPTER 1. INTRODUCTION
}
where we can access entries with square brackets as just as we could for a statically
allocated 2D array as:
a_mn = A[m][n];
So in this case we have two allocations, we have contiguous memory allocated for
our float ’s and we can use the square brackets to index entries. The important
point is that the large block of size M × N is allocated and the pointer assigned in
the first entry of A. After that, rather than allocating new blocks for the different
entries in A we instead just assign the addresses from the one large block that we’ve
allocated (Figure 1.4). We can extend this concept to a 3D as:
float*** A = new float**[M];
A[0] = new float*[M*N];
A[0][0] = new float [M*N*O];
for(int m=0, mm=0; m<M; m++, mm+=N)
{
A[m] = &A[0][mm];
for(int n=0, nn=(m*N*O+n*O); n<N; n++, nn+=O)
{
A[m][n] = &A[0][0][nn];
}
}
where we can access entries with square brackets as just as we could for a statically
allocated 3D array as:
a_mno = A[m][n][o];
float** A;
float** A;
float
float*
float** Stack Heap
Figure 1.4: A schematic illustrating two different ways that a 2D array can be
allocated dynamically.
An important point that one must remember when using dynamic memory al-
location is that it is up to the programmer to deallocate (or free) up the memory
15
that is allocated with the new operator. Now, if we have a simple program that
allocates some memory, performs some computations, then ends, all of the memory
is reclaimed by the operating system when the program terminates. So in some
cases it’s not a ‘big deal’ if we don’t free up the memory. It is however probably bad
practice to get into this habit, and in fact it can cause a problem known as memory
leaks. To illustrate this problem, imagine that we have a program that is constantly
allocating blocks of memory. If we don’t free up the memory then eventually there
will come a point where, as far as the operating system is concerned there won’t be
any more memory that it can allocate and so most likely our program will crash.
The important step to remember (which we will always do throughout this book) is
whenever we’re finished with an array, we deallocate the memory with the delete
operator, which is achieved for a 1D array as:
delete [] A;
It is important to remember here that we delete the memory in the reverse order
that we allocate it. Otherwise, if we deleted the first block of float * then we
would no longer have access to the addresses of the individual blocks of float in
order to delete them. As it happens, we can also free up memory in Matlab too and
we can achieve this for a vector or a matrix (or in fact any data object) as:
clear A;
However, we will only really be using this function in our Matlab programs to
remove all of the variables in the ‘workspace’ with the clear all; command, rather
than applying it to specific data objects. One issue that current Matlab and C++
programmers will be aware of, but others may not, is the fact that in addition to
16 CHAPTER 1. INTRODUCTION
having different syntax and ways to allocate and free memory, Matlab uses one
based indexing, whereas C++ uses zero based indexing. All this means is that in a
C++ array, the first entry is given the index 0, rather than 1. This is usually not
a big issue, we just have to remember when translating a program from Matlab to
C++ to ‘knock’ 1 off each array index that we make.
Another issue that is very important when developing programs implementing
numerical methods is round-off error , namely the limited precision to which comput-
ers can represent real numbers. Generally a single precision floating point number
(a float type in C++) uses four bytes to store a number, a double precision floating
point number (a double type in C++) uses eight bytes, and if we’re really concerned
we can use a quadruple precision number (a long double type in C++) which uses
16 bytes. Now we would normally want our programs to produce as accurate a
result as possible and so the use of the most precise numbers possible may seem
appropriate. The trade off however is that this could require up to four times the
amount of memory. For small simple programs this will not really be a big issue, but
for large scale parallel programs memory can become important. Throughout this
book we will strike a balance between precision and memory and use the double
type for storing real numbers. One other common technique that could strike a rea-
sonable balance between accuracy and memory use is to use long double for ‘say’
variables that involve repeated operations such as summation, subtraction, multi-
plication, division, etc, but truncate and store the resulting array entries in a lower
precision.
In scientific software, the computers memory can often be a precious resource
and so storing a matrix in a 2D array as previously illustrated can be a huge waste of
memory if the matrix is in fact sparse. As an example, consider a system of equations
with 106 unknowns (which is fairly conservative by todays standards) and imagine
that it is represented by a tridiagonal matrix. In this case the matrix would be of size
1012 (i.e. storing one trillion entries), but would only contain approximately 3 × 106
nonzero entries (i.e. about three million entries). If we were storing this matrix
as a full matrix of single precision floating point numbers (i.e. 4 bytes of memory
per entry) then we would need nearly four Terabytes of memory compared to only
eleven Megabytes of memory if we were only storing the relevant data. Clearly this
is a huge potential savings. So, how do we go about storing only the nonzero entries
in a matrix? Well, in Matlab we can do this very easily by simply allocating our
matrix to be a sparse matrix as:
A = sparse(M,N);
Note that here we are using the term ‘matrix’ as opposed to a 2D array, because of
the context. Having declared our matrix in this manner we can then work with it
in the usual way, but there will be a substantial savings in the amount of memory
required, so ‘life is good’. Now if we want to do something similar in C++ we need
to ‘dig a little deeper’ into how a sparse matrix is actually stored. As it happens
there are a number of ways of doing this, but one common technique for doing so
17
is known as Compressed Row Storage (CRS). With this format no assumptions are
made about the sparsity structure of the matrix, and no unnecessary entries are
stored. The way that the matrix is stored is that we create three 1D arrays, one
containing floating point numbers (which stores the non-zero entries in the matrix),
ordered column by column, row by row; and two containing integers (which are used
to identify row and column indices). The best way to illustrate the concept is via
example, so consider the matrix:
9 0 0 0 2 0
3 9 0 0 0 3
0 7 8 7 0 0
A=
3 0 8 7 5 0
0 8 0 9 9 6
0 4 0 0 2 1
where we will denote the number of non-zero entries in the matrix by Nnz , which is
equal to 19 in this case. If we were to store this matrix using the CRS format then
we would have the arrays:
val ={9, 2, 3, 9, 3, 7, 8, 7, 3, 8, 7, 5, 8, 9, 9, 6, 4, 2, 1}
col ={1, 5, 1, 2, 6, 2, 3, 4, 1, 3, 4, 5, 2, 4, 5, 6, 2, 5, 6}
row={1, 3, 6, 9, 13, 17, 20}
It should be fairly obvious that val is storing all of the entries in order as we move
along the columns and then along the rows of A. The entries in col are storing
the column indices of the entries in A (assuming one based indexing). Finally the
entries in row are storing the index in col and val where the data for that par-
ticular row begins. So the data for row 1 begins in row(1) = 1, the data for row
2 begins in row(2) = 3, the data for row 3 begins in row(3) = 6, etc. Further-
more, to access all of the column data within a given row m, we work through val
from row(m), up to but not including row(m + 1), which is a nice feature when it
comes to performing matrix vector multiplication. By convention the format defines
row(M + 1) = Nnz + 1. If A happened to be symmetric then we could further
reduce the storage, but would have the trade-off of a more complicated algorithm
with a different pattern of data access. Furthermore, if A happened to have some
rows containing all zeros, then we would simply find that for any row m with no
nonzero entries, row(m) = row(m + 1), so when accessing entries, or ‘say’ perform-
ing a matrix vector multiplication, we can easily handle this scenario. So instead of
allocating memory to store M 2 = 36 entries, we are allocating 2Nnz + M = 45. So
in this particular case we’ve illustrated an important point; namely that one only
obtains a savings in memory using this format if the Nnz < M (M − 1)/2) (i.e. less
then half of the matrix entries are nonzero). For this example, this is not the case,
but going back to the 106 × 106 tridiagonal matrix mentioned previously, in this case
18 CHAPTER 1. INTRODUCTION
we could store the matrix in around twenty seven megabytes. So it turns out that we
can’t store just the nonzero entries in the matrix, because without the structure of
the arrays themselves we wouldn’t know where the nonzero entries belong, but even
so, we’ve made a pretty big savings in terms of computer memory. This method
does have some disadvantages however, namely that we can’t access entries directly,
we have to go through two arrays to get to the data.
Another common technique for storing a sparse matrix is known as Compressed
Column Storage (CCS). This approach is similar to compressed row storage except
that we essentially ‘swap’ the indexing of rows and columns. If we were to store A
now using the CCS format then we would have the arrays:
val ={9, 3, 3, 9, 7, 8, 4, 8, 8, 7, 7, 9, 2, 5, 9, 2, 3, 6, 1}
row={1, 2, 4, 2, 3, 5, 6, 3, 4, 3, 4, 5, 1, 4, 5, 6, 2, 5, 6}
col ={1, 4, 8, 10, 13, 17, 20}
It should be fairly obvious that val is storing all of the entries in order as we move
along the rows and then along the columns of A. The entries in row are storing
the row indices of the entries in A (assuming one based indexing). Finally the
entries in col are storing the index in row and val where the data for that particular
column begins. So the data for column 1 begins in col(1) = 1, the data for column
2 begins in col(2) = 4, the data for column 3 begins in col(3) = 8, etc. Some
other formats include Block Compressed Row Storage, Compressed Diagonal Storage,
Jagged Diagonal Storage, and Skyline Storage [63].
Example 1.1:
In this example we are going to develop a C++ class for storing a sparse matrix
using the compressed row storage format. For those readers new to C++, this would
be a good time to ‘brush up’ on some of the object oriented features of the language.
As it happens we will be using this class later on in the book when we come to
implementing programs for solving partial differential equations. Given that we have
just been discussing sparse matrices and dynamic memory allocation this seems like
the most appropriate point to develop this class, but this example could be skipped
and revisited when necessary.
The key feature that we want out of our class at this point is the ability to
store a sparse matrix and to insert and access entries. In later parts of the book
we will extend the class; for example, incorporating an optimized matrix-vector
multiplication routine. So let’s begin by creating the ‘skeleton’ for our class:
19
20 CHAPTER 1. INTRODUCTION
class SparseMatrix
{
public:
SparseMatrix();
~SparseMatrix();
void initialize(int nrow, int nnzperrow);
void finalize();
inline
double& operator()(int m, int n);
protected:
void reallocate();
private:
double* val_;
int* col_;
int* row_;
int* nnzs_;
int N_row_;
int N_nz_rowmax_;
int N_nz_;
int N_allocated_;
};
Here we have named our class SparseMatrix and we have declared a number of
private member variables and a few public and protected member functions, which
we will gradually explain. It can be observed the first three member variables are
the pointers for the three arrays that will need to be dynamically allocated to store
the sparse matrix data itself (and it can be observed that we will be using double
precision floating point numbers to store the entries in the matrix), the fifth member
variable is the number of rows, the seventh the number of nonzero entries and the
eighth is a variable which stores the number of entries currently allocated for the val
and col arrays. The remaining member variables will be explained in due course.
As a quick note, it is quite common in the design of C++ classes to identify member
variables of a class in some way. In this case we are using a trailing underscore, but
another commonly used convention is a preceding m_ to indicate that a variable is a
‘member’ of a class (i.e. m_val, m_col, etc). At the end of the day it doesn’t really
matter either way, the most important thing is to be consistent, and so throughout
this book we will use the trailing underscores to indicate member variables.
We want to be able to have a sparse matrix that can handle the case where we
don’t know exactly how many nonzero entries will be stored in it. While this gives
us a lot more flexibility and will ultimately make our programs simpler, it means we
may have to resize the arrays if they fill up. So the approach we will take is to ‘guess’
an initial size to allocate the three arrays and as we run of space we will increase the
value of this variable and reallocate more memory. Another design consideration is
that we can’t assume that we will add the entries, column by column, row by row,
as the CRS format requires. As such we will need to provide a mechanism for ‘filling
up’ (or ‘assembling’) the matrix that will allow us to insert entries in arbitrary order.
The one thing that we will however assume that we know is the number of rows in
21
the matrix.
Before we go about implementing the member functions of this class it is worth
illustrating how we will use it in a program:
int main(int argc, char** argv)
{
const int N_row = 6;
const int N_col = 6;
SparseMatrix A;
double B[N_row][N_col] = { { 9, 0, 0, 0, 2, 0},
{ 3, 9, 0, 0, 0, 3},
{ 0, 7, 8, 7, 0, 0},
{ 3, 0, 8, 7, 5, 0},
{ 0, 8, 0, 9, 9, 6},
{ 0, 4, 0, 0, 2, 1} };
A.initialize(N_row, 3);
for(int n=N_col-1; n>=0; n--)
{
for(int m=N_Row-1; m>=0; m--)
{
if(B[m][n]!=0)
{
A(m,n) = B[m][n];
}
}
}
A.finalize();
return 0;
}
It can be observed in the above program that we create two matrices A and B, for
former using our SparseMatrix class and the latter as a statically allocated 2D
array. In the program we first ‘initialize’ our sparse matrix, then work backwards
row by row, column by column through B inserting nonzero entries into A, then
‘finalize’ our matrix, then we’re done. The reason for going through B backwards
is to make sure that we’re inserting entries in an order different from the way that
the CRS schemes stores them, to make sure that the class is functioning correctly.
When it comes to assigning the values, we are using the square brackets to index
into B, but an overloaded C++ operator (namely the parentheses operator) to put
the value into A.
Moving along, the first thing our class needs is a constructor. This is the function
that is called when we create an instance (or object) of our class, with a statement
like:
SparseMatrix A;
N_row_ = 0;
N_nz_ = 0;
N_nz_rowmax_ = 0;
N_allocated_ = 0;
val_ = NULL;
col_ = NULL;
row_ = NULL;
nnzs_ = NULL;
}
where we are setting all variables to zero or to be null pointers. This happens
because we have constructed our matrix with no useful input information. What
we are going to do is instead use a member function called initialize taking as
arguments the number of rows and an estimate for the number of nonzero entries
per row same input arguments. The code for this function is implemented as:
void initialize(int nrow, int nnzperrow)
{
N_row_ = nrow;
N_nz_ = 0;
N_nz_rowmax_ = nnzperrow;
N_allocated_ = N_row_*N_nz_rowmax_;
val_ = new double [N_allocated_];
col_ = new int [N_allocated_];
row_ = new int [N_row_+1];
nnzs_ = new int [N_row_];
Here we assign the number of rows and the the maximum number of nonzero entries
per row N nz rowmax equal to our initial estimate. Note that the number of nonzero
entries (N nz ) is set equal to zero because this variable contains a count of the
current number of nonzero entries in the matrix and at the point of initialization
of the SparseMatrix object there are none. We then set our allocation size based
on an estimate of the number of rows multiplied by the number of nonzero entries
per row and allocate the arrays storing the sparse matrix. It is important to note
that because we know the number of rows in the matrix, we won’t have to reallocate
row as we assemble the matrix, only modify its contents. Following the memory
allocation we initialize all of the arrays with zero via the memset function and then
set the initial row positions. This last statement requires some elaboration because
23
it relates to how we choose to fill up the sparse matrix. Remember that with the
CRS format row(m) is the index in the col and val arrays where the data for row m
starts. Because we don’t know this information yet, one approach is to just assume
that each row will have exactly N nz rowmax entries per row and then in that case
we can then start to fill the matrix up. If some rows have more nonzero entries than
this we will have to reallocate more memory; if they have less then we will have
to ‘compress’ the matrix (which as you might’ve guessed will happen through the
finalize member function). In order to keep track of the actual number of nonzero
entries that any given row has, we will make use of the ‘temporary’ array nnzs that
will maintain a count for each row. After the matrix has been completely assembled
and compressed (so that it’s using the CRS format exactly) this array is redundant
and will be deleted. So after our initialize function has been called, given that
we estimated 3 nonzero entries per row, the arrays of the sparse matrix will look
like:
val ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
col ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 0, 0, 0, 0, 0}
At this point we can start to look at how we actually insert entries into the sparse
matrix. We can in fact come up with a pretty nice way of doing this by making use
of a feature of the C++ language known as operator overloading; and we will overload
the parentheses operator as:
inline double& operator()(int m, int n);
So here we have declared another member function, but the way we perform the
function call is via the parentheses, meaning that if we create an object of our class:
SparseMatrix A;
then we could access entries using the same syntax that we would with Matlab, i.e:
double a_mn = A(m,n);
val ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0}
col ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 0, 0, 0, 0, 1}
val ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 1, 0, 0}
col ={0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 0, 0, 0, 1, 1}
val ={0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 1, 0, 0}
col ={0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 0, 0}
row ={1, 4, 7, 10, 13, 16}
nnzs={0, 1, 0, 0, 1, 1}
val ={2, 0, 0, 3, 0, 0, 7, 8, 0, 5, 7, 8, 6, 9, 9, 1, 2, 4}
col ={5, 0, 0, 6, 0, 0, 4, 3, 0, 5, 4, 3, 6, 5, 4, 6, 5, 2}
row ={1, 4, 7, 10, 13, 16}
nnzs={1, 1, 2, 3, 3, 3}
At this point the next entry to be inserted would be B(5, 2) = 8, but the problem
is that we don’t have any more free space between the data for rows 5 and 6. So
at this point we’ll need to reallocate our arrays. The approach that we’ll take is
not necessarily the smartest or most efficient, but makes the code relatively simple.
Essentially what we’ll do is just double our current guess for the maximum number
of nonzero entries per row and reallocate the arrays appropriately. In order to check
whether or not we have enough room to insert an entry, we’ll add an extra bit of
code to our implementation of the parenthesis operator:
double& operator()(int m, int n)
{
if(nnzs_[m]>=N_nz_rowmax_)
{
26 CHAPTER 1. INTRODUCTION
this->reallocate();
}
int k = row_[m];
bool foundEntry = false;
while(k<(row_[m]+nnzs_[m]) && !foundEntry)
{
...
}
where the three dots ... are indicate the remainder of the function, which is
unchanged. The newly inserted if statement checks if our current count for the
number of nonzero entries for row m is equal to or greater than what we’ve set aside.
For the case of inserting B(5, 2) = 8 we have already inserted 3 value and so this
will be true. In this case we call the protected member function reallocate before
proceeding to insert the value. Because this is a protected member function, our
program won’t be able to access this function, only the sparse matrix object itself.
This is in-keeping with the encapsulation idea of object oriented programming, i.e.
our sparse matrix will be able to ‘take care of itself’ in a sense and reallocate memory
when it needs to. After implementing this function we won’t need to worry about
it at any other point in the programs that use it. As it happens this is exactly what
Matlab would be doing when we use the sparse function. The code implementing
this function will look like:
void reallocate()
{
N_nz_rowmax_ *= 2;
N_allocated_ = N_nz_rowmax_*N_row_;
delete [] val_;
delete [] col_;
val_ = tempVal;
col_ = tempCol;
return;
}
27
val ={2, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 5, 7, 8, 0, 0, 0, 6, 9, 9, 0, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 5, 4, 3, 0, 0, 0, 6, 5, 4, 0, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={1, 1, 2, 3, 3, 3}
val ={2, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 5, 7, 8, 0, 0, 0, 6, 9, 9, 8, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 5, 4, 3, 0, 0, 0, 6, 5, 4, 2, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={1, 1, 2, 3, 4, 3}
val ={2, 9, 0, 0, 0, 0, 3, 9, 3, 0, 0, 0, 7, 8, 7, 0, 0, 0, 5, 7, 8, 3, 0, 0, 6, 9, 9, 8, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 1, 0, 0, 0, 0, 6, 2, 1, 0, 0, 0, 4, 3, 2, 0, 0, 0, 5, 4, 3, 1, 0, 0, 6, 5, 4, 2, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={2, 3, 3, 4, 4, 3}
At this point we have inserted all of the entries that we want to, so the final step
is to ‘tidy up’ the data so that it is in CRS format. The code implementing the
finalize member function will be somewhat similar to the reallocate function
in that it will create new temporary copies of the arrays to work with, but its main
job will be to sort the column indices for each row into ascending order before
compressing them down into the col array. This function will be the most complex
one present in the class, but let’s start by presenting the whole thing and then
explaining the various components:
void finalize()
{
int minCol = 0;
int insertPos = 0;
int index = 0;
double* tempVal = new double [N_nz_];
28 CHAPTER 1. INTRODUCTION
delete [] val_;
delete [] col_;
delete [] row_;
delete [] nnzs_;
delete [] isSorted;
val_ = tempVal;
col_ = tempCol;
row_ = tempRow;
nnzs_ = NULL;
N_allocated_ = N_nz_;
return;
}
Similar to the reallocate function, we can see that the first steps consist of allo-
cating new arrays for val and col, but having assembled the matrix, we now know
the number of nonzero entries in the matrix, so we can in fact allocate the correct
size for these arrays. We are also going to allocate a temporary array isSorted
which stores a boolean flag defining whether or not a given entry has been sorted
and placed into the new arrays. The function of the nested for loops is to loop
29
over each row in the matrix and for each row, work through the entries that have
been inserted, find the one with the lowest column index, and then put it into the
new arrays. So, given the current state of the arrays:
val ={2, 9, 0, 0, 0, 0, 3, 9, 3, 0, 0, 0, 7, 8, 7, 0, 0, 0, 5, 7, 8, 3, 0, 0, 6, 9, 9, 8, 0, 0, 1, 2, 4, 0, 0, 0}
col ={5, 1, 0, 0, 0, 0, 6, 2, 1, 0, 0, 0, 4, 3, 2, 0, 0, 0, 5, 4, 3, 1, 0, 0, 6, 5, 4, 2, 0, 0, 6, 5, 2, 0, 0, 0}
row ={1, 7, 13, 19, 25, 31}
nnzs={2, 3, 3, 4, 4, 3}
As we search through the column data for row(1) between col entries 1 and 3 we
find that the lowest column index is 1 and it has not been previously sorted, so its
isSorted flag will be false and we can insert the value and column index into the
next available spot in the new val and col arrays, then set its flag to true . On
the second pass through the innermost for loop, we are scanning the same column
indices, but because the column index 1 has been sorted already, the lowest unsorted
index is 9, so we add the value and column index into the next available spot in
the new val and col arrays, then set its flag to true . After sorting any given row,
we can then set the values in row appropriately. After repeating this process for all
rows the arrays will look like:
val ={9, 2, 3, 9, 3, 7, 8, 7, 3, 8, 7, 5, 8, 9, 9, 6, 4, 2, 1}
col ={1, 5, 1, 2, 6, 2, 3, 4, 1, 3, 4, 5, 2, 4, 5, 6, 2, 5, 6}
row={1, 3, 6, 9, 13, 17}
Finally, the last thing we need to implement a is a destructor. This is the function
that is called when we destroy an instance (or object) of our class. This can take
the form:
~SparseMatrix()
{
if(val_) delete [] val_;
if(col_) delete [] col_;
if(row_) delete [] row_;
if(nnzs_) delete [] nnzs_;
}
So the important point to note is that because we allocated our three arrays dynam-
ically we have to delete them ourselves and this should be done in the destructor.
Note that if our matrix has been finalized, nnzs will have already been deleted and
hence will be a NULL pointer and at this point. As such it wouldn’t be deleted here
in the destructor.
The complete program is given in Example1_1.cpp.
30 CHAPTER 1. INTRODUCTION
Having now made all of the necessary definitions regarding vectors and matrices,
looking at how we actually go about storing them in both Matlab and C++, we are
now ready to start investigating some different methods for solving a linear system.
Those readers already familiar with Matlab will know that a linear system can be
solved quite simply by using the backslash operator as:
phi = A\b;
which instructs Matlab to find the solution of the linear system and it will in fact do
so by examining the structure of the A and choosing the most appropriate method,
so ‘life is good’. Throughout this book, sometimes we will make use of the back-
slash operator and other times we won’t. If the focus of a particular program is
to demonstrate ‘say’ a numerical method for solving a partial differential equation
then we may develop the code for doing that and just use the backslash operator
so solve the resulting linear system of equations. There will be a few times however
when we will want to explicitly implement some of the numerical methods that we
are about to investigate. When it comes to implementing C++ programs we don’t
have the backslash operator, so we will need to know how to implement a method
to solve a linear system ourselves.
Throughout the next two chapters we will for the most part be applying our
numerical methods to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
As can be observed we have a full symmetric matrix here, which is small enough that
we can work through the computations ‘by hand’ to see how the different methods
work. This is also a pattern that we will try and stick to throughout the book;
namely that we will try and test our numerical methods on the same problems so
that we can highlight the differences between the different methods and also know
what solution to expect.
Chapter 2
Direct Methods
As was mentioned in the introduction we can classify methods for linear systems
of equations as either direct methods or iterative methods. Direct methods solve a
system of equations with a fixed number of operations, which will depend upon M ,
but will be known at the start of the method. In the absence of round-off error due
to the finite precision to which a computer can represent numbers, direct methods
would deliver an exact solution.
Example 2.1:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform the Gaus-
sian elimination by hand. We first form the augmented matrix:
31
32 CHAPTER 2. DIRECT METHODS
2 1 −1 3
Ab = 1 4 2 −5
−1 2 6 7
Then we proceed by removing the entries in column 1, below the entry on the main
diagonal (i.e. a1,1 ). We achieve this by subtracting a2,1 /a1,1 × row 1 from row 2 and
a3,1 /a1,1 × row 1 from row 3, i.e:
1
Ab2 ← Ab2 − Ab1
2
−1
Ab3 ← Ab3 − Ab1
2
so that we get:
2 1 −1 3
Ab = 0 3.5 2.5 −6.5
0 2.5 5.5 8.5
We then remove the entries in column 2, below the main diagonal (i.e. a2,2 ). We do
this by subtracting a3,2 /a2,2 × row 2 from row 3, i.e:
2.5
Ab3 ← Ab3 − Ab2
3.5
such that we get:
2 1 −1 8
Ab = 0 0.5 0.5 1
0 0 3.7143 13.1429
At this point we can perform the back substitution as:
13.1429
φ3 = = 3.5385
3.7143
−6.5 − 2.5 × 3.5385
φ2 = =−4.3846
3.5
3 − (−1) × 3.5385 − 1 × (−4.3846)
φ1 = = 5.4615
2.0
In order to create a Matlab program to perform the Gaussian elimination we can
first create an augmented matrix with the cat function (which concatenates arrays)
as:
Ab = cat(2, A, b);
Having done this we can then loop through the columns of Ab, and for each column,
loop through the rows of Ab
2.1. GAUSSIAN ELIMINATION 33
for n=1:N_col
for m=n+1:N_row
Ab(m,:) = Ab(m,:) - Ab(m,n)/Ab(n,n)*Ab(n,:);
end
end
It should be noted here the colon operator implies ‘all’ of the elements in row m if
we write Ab(m,:), so we are performing a ‘vector’ operation, rather than operating
on individual entries in the array. At this point we can then perform the back
substitution as:
phi(N_row) = Ab(N_row,N_row+1)/Ab(N_row,N_row);
for m=N_row-1:-1:1
phi(m) = (Ab(m,N_row+1) - Ab(m, m+1:N_row)*phi(m+1:N_row))/Ab(m,m);
end
and our program is complete. If any of the elements on the main diagonal of A were
zero then we would run into the problem of division by zero when we add a multiple
of one row to another. If we allow for partial pivoting then we swap rows such that
the row with a zero on the main diagonal is swapped with the row below which has
the largest absolute value of am,n in column n. We could achieve this in Matlab by
checking for zero entries on the main diagonal and swapping any rows where this
occurs by modifying our initial code snippet as:
for n=1:N_col
if Ab(n,n)==0
[A_col_max k] = max(abs(Ab(n+1:N_row, n)));
temp = Ab(n,:);
Ab(n,:) = Ab(k,:);
Ab(k,:) = temp;
end
for m=n+1:N_row
Ab(m,:) = Ab(m,:) - Ab(m,n)/Ab(n,n)*Ab(n,:);
end
end
be a problem too. The other option might be to simply overwrite the data in A and
b rather than store the augmented matrix and perform the operations there, but
this too may be undesirable.
2.2 LU Decomposition
LU decomposition (or LU factorization) is a similar process to Gaussian elimination
and is equivalent in terms of the elementary row operations. The basic idea behind
the method is that the matrix A can be decomposed (or factored) so that:
A = LU
where L is a lower triangular matrix with ones on the main diagonal and U is an
upper triangular matrix:
1 0 ··· 0 u1,1 u1,2 · · · u1,N
l2,1 1 ··· 0 0 u2,2 · · · u2,N
L = .. U = ..
.. . . .. .. .. ..
. . . . . . . .
lM,1 lM,2 ··· 1 0 0 · · · uM,N
LU φ = b
So we let ψ = U φ and first solve:
Lψ = b
Because L is a lower triangular matrix this equation can be solved efficiently by
forward substitution. To find φ we then we solve:
Uφ = ψ
Because U is an upper triangular matrix this equation can be solved efficiently by
back substitution (as we did with Gaussian elimination).
Example 2.2:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
2.2. LU DECOMPOSITION 35
Before we start writing any code however, let’s work through and perform the LU
decomposition by hand. Unlike Gaussian elimination we don’t have to form an
augmented matrix, however we proceed in exactly the same manner as Gaussian
elimination except that we keep a record of the elementary row operations performed
at the nth stage in a matrix Ln and place the results of these row operations in U .
Let’s first assign U = A and L = I.The the algorithm proceeds by removing the
entries in column 1, below the entry on the main diagonal (i.e. u1,1 ). We achieve
this by subtracting u2,1 /u1,1 × row 1 from row 2 and u3,1 /u1,1 × row 1 to row 3, i.e:
1
U2 ← U2 − U1
2
−1
U3 ← U3 − U1
2
such that we get:
2 1 −1 1 0 0
U = 0 3.5 2.5 L1 = 0.5 1 0
0 2.5 5.5 −0.5 0 1
We then remove the entries in column 2, below the main diagonal (i.e. a2,2 ). We do
this by subtracting u3,2 /u2,2 × row 2 from row 3, i.e:
2.5
U3 ← U3 − U2
3.5
such that we get:
2 1 −1 1 0 0
U = 0 3.5 2.5 L2 = 0 1 0
0 0 3.7143 0 0.7143 1
We then have:
A = L1 L2 U
so that:
L = L1 L2
which in this particular example is:
1 0 0 1 0 0 1 0 0
L = 0.5 1 0 0 1 0 = 0.5 1 0
−0.5 0 1 0 0.7143 1 −0.5 0.7143 1
At this point we can perform the forward substitution on Lψ = b as:
36 CHAPTER 2. DIRECT METHODS
3
ψ1 = = 3
1
−5 − 0.5 × 3
ψ2 = =−6.5
1
7 − (−0.5) × 3 − 0.7143 × (−6.5)
ψ3 = =1 3.1429
1
We can then perform the back substitution on U φ = ψ as:
13.1429
φ3 = = 3.5385
3.7143
−6.5 − 2.5 × 3.5385
φ2 = =−4.3846
3.5
3 − (−1) × 3.5385 − 1 × (−4.3846)
φ1 = = 5.4615
2.0
In order to create a Matlab program to perform the LU decomposition we will
have nested for loops similar to Example 2.1:
for n=1:N_col
for m=n+1:N_row
L(m,n) = U(m,n)/U(n,n);
U(m,:) = U(m,:) - U(m,n)/U(n,n)*U(n,:);
end
end
If any of the elements on the main diagonal of A were zero then we would run into
the problem of division by zero when we add a multiple of one row to another. If we
allow for partial pivoting then we swap rows such that the row with a zero on the
main diagonal is swapped with the row below which has the largest absolute value
of am,n in column n. We could achieve this in our Matlab program in exactly the
same manner that we did with Example 2.1.
The complete program is given in Example2_2.m.
2.3. CHOLESKY DECOMPOSITION 37
Note that in both cases we have triangular matrices (lower and upper) which can
be solved directly using forward and backward substitution without using the Gaus-
sian elimination process (however we need this process or equivalent to compute the
LU decomposition itself). Thus the LU decomposition is computationally efficient
only when we have to solve a matrix equation multiple times for different b; it is
faster in this case to do an LU decomposition of the matrix A once and then solve
the triangular matrices for the different b, than to use Gaussian elimination each
time. As a final note before continuing on, we can perform an LU decomposition
using the Matlab function lu.
A = LLT
where L is a lower triangular matrix.
l1,1 0 0 l1,1 l1,2 l1,3
A = l2,1 l2,2 0 0 l2,2 l2,3
l1,3 l2,3 l3,3 0 0 l3,3
In this case Aφ = b becomes:
LLT φ = b
So we let ψ = Lφ and first solve:
Lψ = b
Because L is a lower triangular matrix this equation can be solved efficiently by
forward substitution. To find φ we then we solve:
LT φ = ψ
Because U is an upper triangular matrix this equation can be solved efficiently by
back substitution (as we did with Gaussian elimination and LU decomposition). If
we write out the matrix product in full we get:
2
a1,1 a1,2 a1,3 l1,1 l1,1 l2,1 l1,1 l3,1
a2,1 a2,2 a2,3 = l2,1 l1,1 2 2
l2,1 + l2,2 l2,1 l3,1 + l2,2 l3,2
2 2 2
a3,1 a3,2 a3,3 l3,1 l1,1 l3,1 l2,1 + l3,2 l2,2 l3,1 + l3,2 + l3,3
Then equating coefficients for each entry, we get the formulae:
38 CHAPTER 2. DIRECT METHODS
v
u
u m−1
X
lm,m = am,m −
t 2
lm,o
o=1
n−1
!
1 X
lm,n = am,n − lm,o ln,o
lm,m o=1
So we can compute the entry lm,n if we know the entry to the left and above. A
property of positive definite matrices is that the terms inside the square root above
is always positive and will result in a real number. We then proceed through the
matrix column by column and calculate the entries.
Example 2.3:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform the
Cholesky decomposition by hand. To begin we work through the first column of
A to compute the corresponding entries for L as:
√ √
l1,1 = a1,1 = 2 = 1.4142
1 1
l2,1 = (a2,1 − 0)= × (1 − 0) = 0.7071
l1,1 1.4142
1 1
l3,1 = (a3,1 − 0)= × (−1 − 0)=−0.7071
l1,1 1.4142
So that we have:
1.4142 0 0
L = 0.7071 0 0
−0.7071 0 0
We can then work through the second column of A to compute the corresponding
entries for L as:
2.3. CHOLESKY DECOMPOSITION 39
q √
2
l2,2 = a2,2 − l2,1 = 4 − 0.70712 =1.8708
1 1
l3,2 = (a3,2 − l3,1 l2,1 )= × (2 − (−0.7071) × 0.7071)=1.3363
l2,2 1.8708
So that we have:
1.4142 0 0
L = 0.7071 1.8708 0
−0.7071 1.3363 0
Then finally we compute the last entry on the main diagonal as:
q p
2 2
l3,3 = a3,3 − l3,1 − l3,2 = 6 − (−0.7071)2 − 1.33632 =1.9272
So that we have:
1.4142 0 0
L = 0.7071 1.8708 0
−0.7071 1.3363 1.9272
At this point we can perform the forward substitution on Lψ = b as:
3
ψ1 = = 2.1213
1.4142
−5 − 0.7071 × 2.1213
ψ2 = =−3.4744
1.8708
7 − (−0.7072) × 2.1213 − 1.3363 × (−3.4744)
ψ3 = = 6.8195
1.9272
We can then perform the back substitution on U φ = ψ as:
6.8195
φ3 = = 3.5385
1.9272
−3.4744 − 1.3363 × 3.5385
φ2 = =−4.3846
1.8708
2.1213 − (−0.7071) × 3.5385 − 0.7071 × (−4.3846)
φ1 = = 5.4615
2.0
In order to create a Matlab program to perform the Cholesky decomposition
we will have nested for loops similar to both Example 2.1 and 2.2 to compute the
entries in L:
40 CHAPTER 2. DIRECT METHODS
for n=1:N_col
Sum = 0;
for o=1:n-1
Sum = Sum + L(n,o)^2;
end
L(n,n) = sqrt(A(n,n) - Sum);
for m=n+1:N_row
Sum = 0;
for o=1:n-1
Sum = Sum + L(m,o)*L(n,o);
end
L(m,n) = 1/L(n,n)*(A(m,n)-Sum);
end
end
It can be observed that the Cholesky decomposition is very similar to the LU de-
composition, but an important difference is that partial pivoting will not be required
because the dominant coefficients will be on the main diagonal.
The complete program is given in Example2_3.m.
2.4 QR Decomposition
QR Decomposition (or QR factorization) is a technique similar to both LU and
Cholesky decompositions, that can be applied to any real, square matrix A. Here,
we write:
A = QR
where Q is an orthogonal matrix and R is an upper triangular matrix. In this case
Aφ = b becomes:
2.4. QR DECOMPOSITION 41
QRφ = b
Qψ = b
Rφ = ψ
What we are aiming to do with the Gram-Schmidt process is to find a new set of
vectors q1 , q1 , · · · qN , such that:
a1 · q1 a2 · q1 · · · aN · q1
0 a2 · q2 · · · aN · q2
A= a1 a2 · · · aN = q1 q 2 · · · qN
.. .. ... ..
. . .
0 0 · · · aN · qN
Furthermore, the vectors will be normalized such that ||qn ||2 = 1, so that they form
an orthonormal set (i.e. they are unit vectors). This will give us the orthogonal
and upper triangular matrices that the QR decomposition requires. One important
point to bear in mind here is that the number of components in each of the vectors
will be equal to the number of rows in the matrix A. The process works by initially
defining an intermediate vector un and then successively computing qn as:
42 CHAPTER 2. DIRECT METHODS
u1
u1 = a1 q1 =
||u1 ||2
aT2 q1 u2
u2 = a2 − q1 q2 =
qT1 q1 ||u2 ||2
aT3 q1 aT3 q2 u3
u3 = a3 − T q1 − T q2 q3 =
q1 q1 q2 q 2 ||u3 ||2
.. ..
.=.
n−1 T
X an qm un
un = an − qm qn = (2.1)
qT q
m=1 m m
||un ||2
aT q
proj(a, q) = q
qT q
which means projecting a onto q and so a more intuitive way to think of the algo-
rithm is constructing a new orthogonal vector qn by projecting an onto the previous
q1 , · · · , qn−1 vectors (Figure 2.1) and then normalizing it. So while the vectors an
will not necessarily be orthogonal, the vectors qn will be.
a2
u2
q2
a1, u1
q1
proj(a2,q1)
Figure 2.1: The Gram-Schmidt process illustrating the computation of the second
orthonormal vector q2 . Having set u1 equal to a1 and normalizing it to obtain q1 , we
compute q2 by projecting a2 onto q1 and then subtracting this vector (i.e. adding
the negative of it) to a2 , giving us u2 . Finally we normalize u2 to get q2 . While
this example is given in 2D for simplicity, the method extends to any number of
dimensions.
2.4. QR DECOMPOSITION 43
Example 2.4:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform the QR
decomposition by hand. The the algorithm begins by setting:
2
u1 = a1 = 1
−1
T 0.8165
u1 {2, 1, −1}
q1 = = = 0.4082
||u1 ||2 (22 + 12 + (−1)2 ) 21
−0.4082
and computing:
0.8165
R1,1 = a1 · q1 = {2, 1, −1} 0.4082 = 2.4495
−0.4082
So then we have:
0.8165 0 0 2.4495 0 0
Q = 0.4082 0 0 R = 0 0 0
−0.4082 0 0 0 0 0
aT2 q1
u2 =a2 − q1
qT1 q1
0.8165
{1, 4, 2} 0.4082
1
−0.4082
0.8165
= 4 − 0.4082
2 0.8165 −0.4082
{0.8165, 0.4082, −0.4082} 0.4082
−0.4082
−0.3333
= 3.3333
2.6667
−0.0778
u1
q2 = = 0.7785
||u1 ||2
0.6228
and computing:
0.8165
R2,1 =a2 · q1 = {1, 4, 2} 0.4082 =1.6330
−0.4082
−0.0778
R2,2 =a2 · q2 ={2, 1, −1} 0.7785 =4.2817
0.6228
So then we have:
0.8165 −0.0778 0 2.4495 1.6330 0
Q = 0.4082 0.7785 0 R = 0 4.2817 0
−0.4082 0.6228 0 0 0 0
verifying that these two vectors are orthogonal. We then proceed to compute the
third orthonormal vector as:
2.4. QR DECOMPOSITION 45
aT3 q1 aT3 q2
u3 =a2 − q 1 − q2
qT1 q1 qT2 q1
0.8165
{−1, 2, 6} 0.4082
−1
−0.4082
0.8165
= 2 − 0.4082
6
0.8165 −0.4082
{0.8165, 0.4082, −0.4082} 0.4082
−0.4082
−0.0778
{−1, 2, 6} 0.7785
0.6228
−0.0778
− 0.7785
−0.0778 0.6228
{−0.0778, 0.7785, 0.6228} 0.7785
0.6228
1.4182
= −1.1818
1.6545
0.5721
u1
q3 = = −0.4767
||u1 ||2
0.6674
and computing:
0.8165
R3,1 =a3 · q1 ={−1, 2, 6} 0.4082 =−2.4495
−0.4082
−0.0778
R3,2 =a3 · q2 ={−1, 2, 6} 0.7785 = 5.3716
0.6228
0.5721
R3,3 =a3 · q3 ={−1, 2, 6} −0.4767 = 2.4790
0.6674
So then we have:
0.8165 −0.0778 0.5721 2.4495 1.6330 −2.4495
Q = 0.4082 0.7785 −0.4767 R = 0 4.2817 5.3716
−0.4082 0.6228 0.6674 0 0 2.4790
Again, we can note that:
46 CHAPTER 2. DIRECT METHODS
0.5721
q1 · q3 = −0.0778 0.7785 0.6228 −0.4767 = 0.0
0.6674
0.5721
q2 · q3 = 0.8165 0.4082 −0.4082 −0.4767 = 0.0
0.6674
verifying that all three vectors are orthogonal. At this point we can perform the
matrix-vector multiplication as:
0.8165 0.4082 −0.4082 3 −2.4495
ψ = QT b = −0.0778 0.7785 0.6228 −5 = 0.2335
0.5721 −0.4767 0.6674 7 8.7719
8.7719
φ3 = = 3.5385
2.4790
0.2335 − 5.3716 × 3.5385
φ2 = =−4.3846
4.2817
−2.4495 − (−2.4495) × 3.5385 − 1.6330 × (−4.3846)
φ1 = = 5.4615
2.4495
In order to create a Matlab program to perform the QR decomposition we can
first note that since the qn vectors have a magnitude of one, we can in fact remove
the qT q computation from the denominator of the projection operator in order to
simplify the computations (because we know that this will be equal to one). With
that in mind we will then will have nested for loops similar to Example 2.1, 2.2,
and 2.3 to compute the entries in Q and R:
Q = A;
for n=1:N_col
for m=1:n-1
R(m,n) = Q(:,m)‘ * Q(:,n);
Q(:,n) = Q(:,n) - R(m,n)*Q(:,m);
end
R(n,n) = norm(Q(:,n));
Q(:,n) = Q(:,n)/R(n,n);
end
where it can be observed that we don’t actually need to create variables to store
un , but rather just store these computations in the Q array. Furthermore, it can be
observed that by initializing Q = A we don’t use A in the factorization steps. At
this point we can trivially compute ψ as:
2.5. TRIDIAGONAL MATRIX ALGORITHM 47
psi = Q‘*b;
which arise when we have an similar equation for each unknown of the form:
cn φn−1 + an φn + dn φn+1 = bn
and as we shall see in Part III, these types of systems can often arise when solving
partial differential equations. Now, although we could use any of the methods that
we’ve studied thus far to solve the system in Equation 2.2 we can exploit the nature
of the matrix in order to solve the system in substantially fewer computations. The
basic idea behind the method is that we perform Gaussian Elimination in the usual
way. If we were to form the augmented matrix:
a1 d 1 0 b1
c2 a1 d2 b2
.
Ab = c 3 a3 . . b3
.. .. ..
. . dM −1 .
0 cM aM b N
Then the algorithm proceeds by modifying the second row as:
48 CHAPTER 2. DIRECT METHODS
c2
Ab2 ← Ab2 − Ab1
a1
ĉn = 0
â1 = a1
ân = an ân−1 − dˆn−1 cn
dˆ1 = d1
dˆn = dn b̂n−1
b̂1 = b1
b̂n = bn an−1 − b̂n−1 cn
where the terms with the ˆ indicate the modified matrix coefficients as each row is
processed. As long as we know that all of the entries on the main diagonal will be
nonzero then we can divide out the ân terms to get:
c0n = 0
a0n = 1
d1
d01 =
a1
dn
d0n =
an − d0n−1 cn
b1
b01 =
a1
bn − b0n−1 cn
b0n =
an − d0n cn
2.5. TRIDIAGONAL MATRIX ALGORITHM 49
So we can loop through the rows and compute modified coefficients as:
(
dn
0 an
n=1
dn = dn
an −d0 cn
n = 2, 3...N − 1
n−1
and:
bn
(
an
n=1
b0n = bn −b0n−1 cn
an −d0n−1 cn
n = 2, 3...N − 1
which is the forward sweep. The solution is then obtained by back substitution:
φN = b0N
φn = b0n − d0n φn+1 n = N − 1, N − 2, · · · , 1
For such systems, the solution can be obtained in O(M ) operations instead of
the O(M 3 ) that we observed was required by Gaussian elimination. To illustrate
this savings in computational cost by way of an example, a system of equations
involving say 106 unknowns (which is a realistic sized system for pracital problems),
is going to require O(106 ) operations via the TDMA method, whereas full Gaussian
Elimination (not taking advantage of the tridiagonal nature) would require O(1018 )
operations!
Example 2.5:
In this example we will develop a Matlab program to solve the example system:
−2 1 0 0 0 φ 1
0
1 −2 1 0 0
φ 2
0
0
1 −2 1 0 φ3
= 0
0 0 1 −2 1 φ 0
3
0 0 0 1 −2 φ5 1
As can be observed, we can’t use the example system that we’d been using in other
examples up until this point, since it wasn’t a tridiagonal matrix. Before we start
writing any code however, let’s work through and perform the TDMA by hand. The
the algorithm begins by computing the modified coefficients for row 1 as:
d1 1
d01 = = = −0.5
a1 −2
b1 0
b01 = = = 0.0
a1 −2
50 CHAPTER 2. DIRECT METHODS
We then work through the rows and compute the modified coefficients for row 2 as:
d2 1
d02 = 0
= = −0.6667
a2 − d 1 c 2 −2 − (−0.5)(1)
b2 − b01 c2 0 − 0(1)
b02 = 0
= = 0.0
a2 − d1 c1 (−2) − (−0.5)(1)
d3 1
d03 = 0
= = −0.7500
a3 − d2 c3 −2 − (−0.6667)(1)
b3 − b02 c3 0 − 0(1)
b03 = 0
= = 0.0
a3 − d2 c2 (−2) − (−0.6667)(1)
d4 1
d04 = 0
= = −0.8000
a4 − d3 c4 −2 − (−0.7500)(1)
b4 − b03 c4 0 − 0(1)
b04 = 0
= = 0.0
a4 − d3 c3 (−2) − (−0.7500)(1)
b5 − b04 c5 1 − 0(1)
b05 = = = −0.8333
a5 − d04 c4 (−2) − (−0.8000)(1)
φ5 = b05 = −0.8333
and then working through the rows computing their modified coefficients as:
for n=2:N_col-1
dprime(n) = d(n) /(a(n) - c(n-1) * dprime(n-1));
bprime(n) = (b(n) - bprime(n-1)*c(n-1))/(a(n) - c(n-1) * dprime(n-1));
end
bprime(N_col) = (b(N_col) - bprime(N_col-1)*c(N_col-1))/(a(N_col) - c(N_col-1) * dprime(N_col-1));
Iterative Methods
Having now investigated some direct methods for solving a linear system of equa-
tions, it is time to investigate some iterative methods. It can be observed that for
all of the direct methods encountered in Chapter 2, we had a fixed number of oper-
ations, which was dependent on the size of the matrix, but after which we had the
exact solution (ignoring round off errors of course). Despite this advantageous prop-
erty of direct methods, the large number of operations required means that they are
not generally used for solving large systems, especially when the matrices are sparse
(as is generally the case when solving a partial differential equation). Furthermore,
an observation that we can make regarding ‘most’ of the direct methods is that they
require factorizing (or decomposing) A into a product of two matrices, both of which
we have to store. For the small example system that we were applying our methods
to, this feature doesn’t really matter; but for larger systems it could be a bit of a
pain.
Iterative methods work by specifying an initial ‘guess’ for the solution and suc-
cessively refining that guess until it is within some user specified tolerance to the
exact solution. We cannot generally know in advance how many iterations it will
take to converge on the correct solution. We can further classify iterative methods
as stationary or nonstationary and we will investigate them in this order. Station-
ary methods are older, simpler to understand and implement, but usually not as
effective. Non-stationary methods are a more recent development; their analysis is
usually harder to understand, but they can be highly effective. As we shall see later
on in the book, iterative methods work very nicely with sparse matrices, meaning
that we can get fast efficient solvers with minimal storage required. It is for this
reason that when we come to implementing programs to solve partial differential
equations, we will always be using iterative methods. Before we begin this investi-
gation, we need to make some definitions and introduce some concepts relevant to
all iterative methods. Throughout this book we will use the superscript k to denote
the current iteration of our numerical method. In order to avoid confusion with an
exponent, we will always state when a quantity is to be understood as being raised
to a power, if the mathematics is ambiguous. At any stage within our iterative
53
54 CHAPTER 3. ITERATIVE METHODS
ek = φk − φ (3.1)
So we can see that the error is a column vector that indicates how far our current
approximation of the solution φk is to the exact solution φ. Now in practice, this
measure is not of great use because in order to calculate it we need to know the
exact solution, which is what we are trying to find in the first place! A more useful
definition is that of the residual :
rk = b − Aφk (3.2)
If we rearrange Equation 3.1 in terms of φk and substitute into Equation 3.2 we get:
rk = −Aek
So while the error indicates how far we are from the correct solution, the residual
indicates how far we are from the correct value of b. Moreover we can think of the
residual as being the error transformed by A into the same ‘space’ as b. We could
always calculate the error from the residual, but this would involve computing the
inverse of A and if we were going to do that, then we wouldn’t need an iterative
method in the first place. Now we can calculate the residual at each iteration and
use this as our criterion for judging when our method has converged on a solution.
Since the residual itself is a column vector, we generally use one of the vector norms
defined in Chapter 1 such that we have a single scalar value to compare to some
user specified tolerance. We could then either continue iterating until the residual
norm has reduced to below this tolerance (e.g. krk k∞ < 0.001) or we could continue
iterating until the residual norm has reduced to below some proportion of its initial
value when we began the iterations (e.g. krk k∞ < 0.001kr0 k∞ ). Now while the
choice of tolerance is arbitrary, a good rule of thumb is to choose half the machine
precision. Thus if the precision of calculations is about 16 digits one may choose
the tolerance to be 10−8 .
a1,1 0 · · · 0 0 a1,2 · · · a1,N
0 a2,2 · · · 0 a2,1 0 · · · a2,N
D= R=
.. .. .. .. .. .. .. ..
. . . . . . . .
0 0 · · · aM,N aM,1 aM,2 ··· 0
(D + R) φ = b
and therefore:
Dφ = b − Rφ
φ = D−1 (b − Rφ)
Now the inverse of a diagonal matrix is trivially just another diagonal matrix where
each entry on the main diagonal is the reciprocal. With the Jacobi method we use
this as the basis for an iteration:
That is, we can update our solution based on the current ‘guess’. In terms of the
entries in the matrices, we get:
!
1 X
φk+1
m = bm − am,n φkn
am,m m6=n
This method is guaranteed when A is diagonally dominant and will of course ‘blow
up’ if any am,m = 0. In other situations, it may or may not converge. We need some
initial guess for φ0 , which ideally should be as close as possible to the solution.
Example 3.1:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform a few
iterations of the Jacobi method by hand. To begin the algorithm, let’s provide the
initial guess of φ0 = {1, 1, 1}T . We will also assume that we are using the infinity
norm as our measure of convergence, and therefore, initially we have:
3 2 1 −1 1 1
r0 = −5 − 1 4 2 1 = −12
7 −1 2 6 1 0
with kr0 k∞ = 12. We then work through the rows of A to update φ1 as:
56 CHAPTER 3. ITERATIVE METHODS
1 1
φ11 = b1 − a1,2 φ02 − a1,3 φ03 = (3 − 1 × 1 − (−1) × 1) = 1.5
a1,1 2
1 1
φ12 = b2 − a2,1 φ01 − a2,3 φ03 = (−5 − 1 × 1 − 2 × 1) = −2.5
a2,2 4
1 1
φ13 = b3 − a3,1 φ01 − a3,2 φ02 = (7 − (−1) × 1 − 2 × 1) = 1.0
a3,3 6
with kr1 k∞ = 6.5. We then repeat the procedure and update φ2 as:
1 1
φ21 = b1 − a1,2 φ12 − a1,3 φ13 = (3 − 1 × (−2.5) − (−1) × 1.0)= 3.0000
a1,1 2
1 1
φ22 = b2 − a2,1 φ11 − a2,3 φ13 = (−5 − 1 × 1.5 − 2 × 1.0)
=−2.1250
a2,2 4
1 1
φ23 = b3 − a3,1 φ11 − a3,2 φ12 = (7 − (1) × 1.5 − 2 × (−2.5)) = 2.0833
a3,3 6
with kr2 k∞ = 3.6667. As can be observed, each successive iteration brings the
solution closer to the exact solution and with the residual norm decreasing in turn.
With a tolerance of 10−8 the algorithm converges after 58 iterations. Now, in order
to create an algorithm, it’s worth noting that we only need the previous iteration to
compute next one. So there’s little point in storing every value of φk , instead what
we can do is have two vectors φold and φ. We initialize φold as φ0 and update to
find φ. We then assign φold = φ and repeat.
In order to create a Matlab program to perform the Jacobi method we will write
the method in a slightly different manner to the way we implemented the direct
methods in Chapter 2. Since we don’t know how many iterations our method will
take, we will use a while loop to continue iterating until we have converged on a
solution. As it happens we will stick to this form throughout the book and use
while loops to indicate iteration where the number of steps is not fixed. It is worth
pointing out however that once can achieve an iterative loop with the for loop
3.2. GAUSS-SEIDEL METHOD 57
construct; furthermore, we could also have achieved all of the looping over rows
and columns in the examples of Chapter 2 with while loops. So this convention is
mainly just to aid in understanding. With this in mind, our program will look like:
So within our iterative while loop we loop over all of the rows in A, then for
each row, loop over all of the columns in A and assemble the sum A_mphi which is
essentially the mth row in A multiplied by the column vector phi_old. An important
point to note is that we could also achieve this with the code A(m,:)*phi_old, but
this would include the A(m, m), which we do not want included in our summation.
Once the summation is complete, we can update φm . The last four lines within
the while loop consist of copying the data stored in phi into phi_old, computing
the updated residual and the infinity norm, then finally incrementing the iteration
counter. Another important point worth noting is that programmatically it makes
sense to have some user specified upper limit to the number of iterations just to
make sure that we don’t ever get stuck in an infinite loop if the method fails to
converge. For this reason the iterative while loop will continue until either the
solution converges, or until k reaches the maximum number of iterations.
The complete program is given in Example3_1.m.
a1,1 0 ··· 0 0 a1,2 · · · a1,N
a2,1 a2,2 ··· 0 0 0 · · · a2,N
L= U =
.. .. ... .. .. .. .. ..
. . . . . . .
aM,1 aM,2 · · · aM,N 0 0 ··· 0
(L + U ) φ = b
and therefore:
Lφ = b − U φ
φ = L−1 (b − U φ)
So the basic idea behind the Gauss-Seidel method is that we use this as the basis
for an iteration:
φk+1 = L−1 b − U φk
However, by taking advantage of the triangular form of L, the elements of φk+1 can
be computed sequentially using forward substitution. In terms of the entries in the
matrices, this is:
!
1 X X
φk+1
m = bm − am,n φkn − am,n φk+1
n
am,m n>m n<m
Note, this method is actually the same as the Jacobi method, except that we take
our φm values to be from the current iteration k + 1, whenever these are available.
This means that we are always using the most up to date information available to
us, which hopefully leads to faster convergence. Similar to the Jacobi method, we
need diagonal dominance and no zero entries on the main diagonal in order for this
method to be able to converge.
Example 3.2:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
3.2. GAUSS-SEIDEL METHOD 59
Before we start writing any code however, let’s work through and perform a few
iterations of the Gauss-Seidel method by hand. To begin the algorithm, let’s provide
the initial guess of φ0 = {1, 1, 1}T . We will also assume that we are using the infinity
norm as our measure of convergence, and therefore, initially we have:
3 2 1 −1 1 1
r0 = −5 − 1 4 2 1 = −12
7 −1 2 6 1 0
with kr0 k∞ = 12. We then work through the rows of A to update φ1 as:
1 1
φ11 = b1 − a1,2 φ02 − a1,3 φ03 = (3 − 1 × 1 − (−1) × 1) = 1.5000
a1,1 2
1 1
φ12 = b2 − a2,1 φ11 − a2,3 φ03 = (−5 − 1 × 1.5 − 2 × 1) =−2.1250
a2,2 4
1 1
φ13 = b3 − a3,1 φ11 − a3,2 φ12 = (7 − (1) × 1.5 − 2 × (−2.1250))= 2.1250
a3,3 6
and then update the residual as:
3 2 1 −1 1.5000 4.2500
r1 = −5 − 1 4 2 −2.1250 = −2.2500
7 −1 2 6 2.1250 0.0000
with kr1 k∞ = 4.2500. So already after one iteration we have a different solution
compared to the Jacobi method. We then repeat the procedure and update φ2 as:
1 1
φ21 = b1 − a1,2 φ12 − a1,3 φ13 = (3 − 1 × (−2.1250) − (−1) × 2.1250)= 3.6250
a1,1 2
1 1
φ22 = b2 − a2,1 φ21 − a2,3 φ13 = (−5 − 1 × 3.6250 − 2 × 2.1250) =−3.2188
a2,2 4
1 1
φ23 = b3 − a3,1 φ21 − a3,2 φ22 = (7 − (1) × 3.6250 − 2 × (−3.2188)) = 2.8438
a3,3 6
and then update the residual as:
3 2 1 −1 3.6250 1.8126
r2 = −5 − 1 4 2 −3.2188 = −1.4374
7 −1 2 6 2.8438 −0.0002
with kr2 k∞ = 1.8126. So after 2 iterations we can see that the residual norm is
about half of what it was in the case of the Jacobi method and with a tolerance of
10−8 the algorithm converges in 30 iterations.
In order to create a Matlab program to perform the Gauss-Seidel method we
can note that the immediate use of updated φk+1 means that instead of needing
60 CHAPTER 3. ITERATIVE METHODS
the ‘old’ and ‘new’ arrays, we can just have the one array and update in place. We
can do this because we know that when we come to update ‘say’ φk+1 3 we will have
k+1 k+1
already update φ1 and φ2 , but if there were entries φ4 · · · φM , they would still
contain data from iteration k. So we just use the values as if we didn’t care which
iteration they were from and ‘all will be OK’. So we could create an algorithm for
the Gauss-Seidel method in Matlab as:
while r_norm>tolerance && k<N_k
for m=1:N_row
SumA_mnphi_n = 0;
for n=1:N_col
if(n ~= m)
SumA_mnphi_n = SumA_mnphi_n + A(m,n)*phi(n);
end
end
phi(m) = (b(m) - SumA_mnphi_n)/A(m,m);
end
r = b - A*phi;
r_norm = max(abs(r));
k = k+1;
end
It can be observed that this algorithm is almost exactly the same as the implemen-
tation of the Jacobi method, except for the use of the phi_old array. In order to
create a C++ program we will use essentially the same programming constructs and
so the ‘bulk’ of the algorithm will look like:
while(r_norm>tolerance && k<N_k)
{
for(m=0; m<N_row; m++)
{
A_mphi = 0.0;
for(n=0; n<N_col; n++)
{
if(n!=m)
{
A_mphi += A[m][n]*phi[n];
}
}
phi[m] = (b[m] - A_mphi)/A[m][m];
}
...
k++;
}
where the major differences are simply the different syntax of the for and while
loops. The only part of the algorithm that is significantly different is the computa-
tion of the residual and the infinity norm. With C++, we can’t perform matrix-vector
multiplications or additions and subtractions in the same way that we do in Matlab.
So in order to achieve the equivalent computations we will have:
r_norm = 0.0;
3.3. SUCCESSIVE OVER-RELAXATION METHOD 61
Here we have two nested for loops, working through the rows and then columns
of A, multiplying the mth row of A by the current solution. As each entry in the
residual vector is computed, it is compared with the current value for the infinity
norm and if greater, the infinity norm is reassigned. So in this way, the infinity
norm is computed ‘on the fly’ as we compute the residual vector itself, rather than
afterwards.
The complete programs are given in Example3_2.m and Example3_2.cpp.
a1,1 0 · · · 0 0 0 ··· 0 0 a1,2 · · · a1,N
0 a2,2 · · · 0 a2,1 0 ··· 0
0 0 · · · a2,N
D= L= U = ..
.. .. ... .. .. .. . . .. .. .. ..
. . . . . . . . . . .
0 0 · · · aM,N aM,1 aM,2 ··· 0 0 0 ··· 0
(D + L + U ) φ = b
We then multiply both sides by a constant ω > 1, which we will call an over-
relaxation factor as:
ω (D + L + U ) φ = ωb
which we can rearrange as:
62 CHAPTER 3. ITERATIVE METHODS
(D + ωL) φ = ωb − (ωU + (ω − 1) D) φ
φ = (D + ωL)−1 (ωb − (ωU + (ω − 1) D) φ)
The reason for incorporating ω is so that we can make lager changes to our solution
at each iteration so that we ‘hopefully’ converge on a solution faster (i.e. in fewer
iterations), because we are able to make a larger update. So the basic idea behind
the successive over-relaxation method is that we use this as the basis for an iteration:
The choice of relaxation factor is not necessarily easy, and depends upon the prop-
erties of the A. For symmetric, positive definite matrices it can be proven that
0 < ω < 2 will lead to convergence, but we are generally interested in faster conver-
gence rather than just convergence.
Example 3.3:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform a few
iterations of the successive over-relaxation method by hand. To begin the algorithm,
let’s provide the initial guess of φ0 = {1, 1, 1}T and we’ll choose ω = 1.2. We will
also assume that we are using the infinity norm as our measure of convergence, and
therefore, initially we have:
3 2 1 −1 1 1
r0 = −5 − 1 4 2 1 = −12
7 −1 2 6 1 0
with kr0 k∞ = 12. We then work through the rows of A to update φ1 as:
3.3. SUCCESSIVE OVER-RELAXATION METHOD 63
ω
φ11 = (1 − ω)φ01 + b1 − a1,2 φ02 − a1,3 φ03
a1,1
1.2
= (−0.2)(1) + (3 − 1 × 1 − (−1) × 1)
2
= 1.6000
ω
φ12 = (1 − ω)φ02 + b2 − a2,1 φ11 − a2,3 φ03
a2,2
1.2
= (−0.2)(1) + (−5 − 1 × 1.6 − 2 × 1)
4
= −2.7800
ω
φ13 = (1 − ω)φ03 + b3 − a3,1 φ11 − a3,2 φ12
a3,3
1.2
= (−0.2)(1) + (7 − (1) × 1.6 − 2 × (−2.7800))
6
= 2.6320
with kr1 k∞ = 5.2120. We then repeat the procedure and update φ2 as:
ω
φ21 = (1 − ω)φ11 + b1 − a1,2 φ12 − a1,3 φ13
a1,1
1.2
= (−0.2)(1.6000) + (3 − 1 × (−2.7800) − (−1) × 1(2.6320))
2
= 4.7272
ω
φ22 = (1 − ω)φ12 + b2 − a2,1 φ21 − a2,3 φ13
a2,2
1.2
= (−0.2)(−2.7800) + (−5 − 1 × 4.7272 − 2 × 2.6320)
4
= −3.9414
ω
φ23 = (1 − ω)φ13 + b3 − a3,1 φ21 − a3,2 φ22
a3,3
1.2
= (−0.2)(2.6320) + (7 − (1) × 4.7272 − 2 × (−3.9414))
6
= 3.3956
3 2 1 −1 4.7272 0.8825
r2 = −5 − 1 4 2 −3.9414 = −0.7529
7 −1 2 6 3.3956 −0.7636
with kr2 k∞ = 0.8825. So after 2 iterations we can see that the residual norm is less
than half of what it was in the case of the Gauss-Seidel method and with a tolerance
of 10−8 the algorithm converges in 17 iterations.
In order to create a Matlab program to perform the successive over-relaxation
method we can essentially use the same algorithm that we developed to implement
the Gauss-Seidel method, simply changing the line of code where the new values for
φk+1 are updated:
while r_norm>tolerance && k<N_k
for m=1:N_row
SumA_mnphi_n = 0;
for n=1:N_col
if(n ~= m)
SumA_mnphi_n = SumA_mnphi_n + A(m,n)*phi(n);
end
end
phi(m) = (1-omega)*phi(m) + omega*(b(m) - SumA_mnphi_n)/A(m,m);
end
r = b - A*phi;
r_norm = max(abs(r));
k = k+1;
end
The complete program is given in Example3_3.m.
Having now investigated some stationary iterative methods, we will turn our
attention to non stationary methods. With the Jacobi, Gauss-Seidel, and successive
over-relaxation methods, the observation can be made that in performing an itera-
tion, the computations involve the current or old estimates of φ and the entries in
A and b, which are constant at each iteration (and in fact do not change through-
out the entire computation). With nonstationary methods on the other hand, the
computations involve information that does change at each iteration.
800
600
400
f
200
0
10
5 10
0 5
0
−5 −5
φ2 −10 −10
φ1
Figure 3.1: Quadratic form for an example system involving two equations.
66 CHAPTER 3. ITERATIVE METHODS
As can be observed the quadratic form f is a scalar, quadratic function of the vector
φ and c is a scalar. If φ happened to be a 2 × 1 column vector, then we could plot
the quadratic form as shown in Figure 3.1. It can be observed that f takes the
shape of a paraboloid and you may well wonder why this is the case, or if this is
always the case? The answer is that it will always be a paraboloid if A is positive
definite. If A is not positive definite then we end up with other surfaces such as
inverted paraboloids or hyperboloids and the method will not work. So bearing in
mind that our method is restricted to positive definite systems we can proceed to
computing the gradient of f :
∂f ∂f ∂f ∂f
= , ,··· ,
∂φ ∂φ1 ∂φ1 ∂φM
So while the quadratic form is a scalar valued function, the gradient of the quadratic
form is a vector-valued function and can be written as:
∂f 1 1
= AT φ + Aφ − b
∂φ 2 2
and if A is symmetric, this reduces to:
∂f
= Aφ − b (3.4)
∂φ
Remembering that we can find the minimum of a function by setting its derivative to
zero; the minimum of the quadratic form occurs when Aφ = b. The important point
here is that the solution to our linear system of equations occurs at the minimum
of the quadratic form. With this in mind (and noting that we are now restricting
ourselves to symmetric positive definite systems) the basic idea behind the method
of steepest descent can be thought of as specifying an initial guess for φ (which will
define a point on the surface) and iteratively stepping our way down the surface
until we reach the bottom. At this point we will have found the solution. Now
earlier we made the definition:
rk = b − Aφk
where rk is a vector that indicates how far away we are from the correct value of
b. More importantly however, by examination of Equation 3.4 we can infer that
the residual is actually the negative of the gradient of f , so we can think of the
residual as the direction of steepest descent. Given an initial point on the surface,
the algorithm works by taking a step in the direction of steepest descent:
to be standing there will be a direction of steepest descent and we will work our way
down by walking in that direction for a while, then reexamining the new direction
of steepest descent and walking in that direction for a while, and so on. Now, if
we could ‘see’ the bottom then of course we would head in that direction (and our
problem would of course already be solved), but to continue with the analogy, let’s
further assume that it’s a foggy day, so we can’t see the bottom of the valley, or
anything much around us. All we have is the direction of steepest descent based
on where we are currently standing. When we take a step we are committed to
walking a distance α in that direction and what we would like is to choose α such
that we make our way as far down the valley as possible, but of course not walk so
far that we begin making our way up the other side of the valley. Put another way
we want to choose α to minimize f along our direction of steepest descent. From
d
basic calculus, α minimizes f when the directional derivative dα f (φk+1 ) is equal to
zero. By the chain rule we can write:
dφk+1
= rk
dα
and that:
∂f (φk+1 )
k+1
= −rk+1
∂φ
So, by setting the directional derivative to zero, we find that α should minimize f
when rk and rk+1 are orthogonal:
(rk+1 )T rk = 0
T
b − Aφk+1 rk = 0
T
b − A(φk + αrk ) rk = 0
T
b − Aφk )T rk − α(Ark rk = 0
T
b − Aφk rk = α(Ark )T rk
(rk )T rk = α(rk )T (Ark )
(rk )T rk
α =
(rk )T Ark
At this point our method of steepest descent is complete.
68 CHAPTER 3. ITERATIVE METHODS
Example 3.4:
In this example we will develop a Matlab program to solve the example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform a few
iterations of the steepest descent method by hand. To begin the algorithm, let’s
provide the initial guess of φ0 = {1, 1, 1}T . This time however we’ll use the two
norm as our measure of convergence, and therefore, initially we have:
3
2 1 −1 1 1
r0 = −5 − 1 4 2 1 = −12
7 −1 2 6 1 0
1
with kr0 k2 = (12 + (−12)2 + 02 ) 2 = 12.0416. We then compute α as:
1
1 −12 0 −12
0
α= = 0.2617
2 1 −1 1
1 −12 0 1 4 2 −12
−1 2 6 0
update φ1 as:
1
with kr1 k2 = (3.61742 +0.30152 +6.54332 ) 2 = 7.4827. We then repeat the procedure
and compute α as:
3.4. STEEPEST DESCENT METHOD 69
3.6173
3.6173 0.3014 6.5433 0.3014
6.5433
α= = 0.2275
2 1 −1 3.6173
3.6173 0.3014 6.5433 1 4 2 0.3014
−1 2 6 6.5433
update φ2 as:
So it can be observed that each iteration involves computing a new value for α
based on the old residuals, updating φ, and computing the new residual. Here we
are using the Matlab function norm to compute the two norm. This function can
in fact compute other norms such as the inifinity norm (so we could have used it in
previous examples if we’d wished) by passing in a second argument to the function,
but defaults to computing the two norm.
The complete program is given in Example3_4.m.
70 CHAPTER 3. ITERATIVE METHODS
2 2
1 1
0 0
e0
−1
φ2
φ2
−1 d0
−2 −2
e1
−3 −3
−4 −4
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
φ1 φ1
(a) (b)
Figure 3.2: A comparison of the iterations of the steepest descent and conjugate
gradient methods for the example system involving two equations, the quadratic
form of which is shown in Figure 3.1
The conjugate gradient method uses a similar idea to the method of steepest de-
scent to solve a linear system, but removes one of its major disadvantages, namely
that steepest descent often finds itself taking steps in the same direction as previous
steps (Figure 3.2(a)). A much better approach would be to select a set of search
directions {d0 , d1 , d2 , · · · dM } and take exactly one step in each direction. Further-
more each step we take will be of the right length to obtain the correct value of one
part of the solution φm , then after M iterations we will have found the minimum of
f (i.e. solved Aφ = b). Using the previous analogy of trying to walk down to the
bottom of the valley, the method of steepest descent is analogous to walking down
the hill in a ‘zig zag’ pattern that ‘say’ a skier might take. The idea behind the
conjugate gradient method on the other hand is analogous to ‘heading south, by the
correct amount’, then ‘heading east by the correct amount’. Now obviously from
Figures 3.2(a) and 3.2(b) we our basing our analogy on a system with 2 unknowns,
but the concept extends to higher dimensions (it just becomes harder to visualize).
So, our update for φk+1 is given by:
φk+1 = φk + αdk
3.5. CONJUGATE GRADIENT METHOD 71
which is similar to the steepest descent method, but here we are stepping in along
our search direction, rather than along the direction of the residual. We will also
choose α to minimize f along our search direction as:
(rk+1 )T dk = 0
T
b − Aφk+1 dk = 0
T
b − A(φk + αdk ) dk = 0
T
b − Aφk )T dk − α(Adk dk = 0
T
b − Aφk dk = α(Adk )T dk
(rk )T dk = α(dk )T Adk
(dk )T rk
α = (3.6)
(dk )T Adk
So the big question now is how we go about constructing our search directions.
Referring back to Figure 3.2(b), it can be observed that with each step we are
updating the error vector as:
ek+1 = ek + αdk
and using our definition of the residual vector rk = −Aek , this implies that at each
step the new residual is being updated as:
rk+1 = −Aek+1
= −A(ek + αdk )
= rk − αAdk (3.7)
More importantly, the new error vector is orthogonal to the previous search direction:
(dk )T ek+1 = 0
Of course this fact is of no practical use to us in terms of constructing the search
directions, because if we ever happened to know the error vector at any iteration, the
72 CHAPTER 3. ITERATIVE METHODS
problem would immediately be solved. What we can do however is choose that the
search directions be A-orthogonal to one another, rather than orthogonal. This is a
key feature of the conjugate gradient method and implies that rather than choosing:
(dk )T dk+1 = 0
we have:
(dk )T Adk+1 = 0
This also has the effect of making the error vector A-orthogonal to the previous
search directions, rather than orthogonal:
(dk )T Aek+1 = 0
Using our definition of the residual vector, we see that this choice means that the
new residual will be orthogonal to the previous search direction:
−(dk )T rk+1 = 0
So once we take a step in a search direction, we need never step in that direction
again; the error vector is evermore A-orthogonal to all of the old search directions
and hence the residual is evermore orthogonal to all of the old search directions.
At this point, all that is missing is a way to construct a set of A-orthogonal
search directions. As it happens, we’ve already come across just such a technique in
Chapter 2; namely the Gram-Schmidt process, which we used in performing the QR
decomposition. Here however, instead of using the columns of A to construct the set
of orthonormal vectors (which formed the columns of Q), we will use the of residual
vectors rk to construct a set of A-orthogonal vectors. As such, the Gram-Schmidt
process to compute a new search direction is:
k
X (rk+1 )T Adn
dk+1 = rk+1 − dn (3.8)
n=0
(dn )T Adn
where it can be observed that the projection operator is different compared to Equa-
tion 2.1 since we are constructing A-orthogonal search vectors rather than orthogonal
vectors. Also, because we are not interested in an orthonormal set of vectors, we
don’t need to worry about the normalization step as we did with QR decomposition.
Now, because the search vectors are built from the residuals, and because each
residual is orthogonal to the previous search direction, each residual is also orthog-
onal to the previous residuals, hence a result that we will find useful shortly is
that:
(rk )T (rn ) = 0 k 6= n
Furthermore, if we pre-multiply Equation 3.8 by (rk+1 )T we get another useful result
that:
3.5. CONJUGATE GRADIENT METHOD 73
where:
(rk+1 )T Adn
βk,n = − (3.10)
(dn )T Adn
One feature of the Gram-Schmidt process is that we need to store each search
direction in order to evaluate each βk,n term and hence compute a new search di-
rection. This is an undesirable feature of the method because we don’t want to be
storing all of these vectors in memory, especially for a large system of equations.
Even if A happens to be a sparse matrix where we may be able to achieve substan-
tial savings by using a sparse matrix storage format, this wouldn’t help with the
fact that we would still need to store all of the old search directions, and this would
make the Conjugate Gradient method less attractive as an iterative method. As it
happens however, we can actually eliminate the need to store all of the old search
directions and to show how this is possible, we begin by taking Equation 3.7 and
pre-multiply by (rk+1 )T to get:
1
(rk+1 )T Adn = k+1 T n k+1 T n+1
(r ) r − (r ) r (3.11)
αn
Substituting into Equation 3.10 we get:
1 k+1 T n k+1 T n+1
n (r ) r − (r ) r
βk,n = α (3.12)
(dn )T Adn
But because all of our residuals are orthogonal, the only instance where Equation
3.12 is nonzero is when n = k, (i.e. for the latest search direction) where βk,n is
given by:
1 k+1 T k+1
k (r ) r
βk,n = α (3.13)
(dk )T Adk
All other values of n < k will results in inner products of residuals from previous
iterations which we know are orthogonal and hence the numerator will be zero. Now,
74 CHAPTER 3. ITERATIVE METHODS
finally we can substitute Equation 3.9 into Equation 3.6 so that we evaluate α at
any iteration as:
(rk )T rk
α= (3.14)
(dk )T Adk
and finally we can substitute Equation 3.14 into Equation 3.13 so that we can
evaluate β at any iteration as:
(rk+1 )T rk+1
β=
(rk )T rk
So at this point our conjugate gradient method is complete and will be similar to
that of steepest descent apart from the fact that after updating the residual, we will
need to update β and then compute a new search direction, before beginning the
next iteration.
Example 3.5:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform a few
iterations of the conjugate gradient method by hand. To begin the algorithm, let’s
provide the initial guess of φ0 = {1, 1, 1}T . As with the steepest descent method
we will use the two norm as our measure of convergence, and therefore, initially we
have:
3 2 1 −1 1 1
r0 = −5 − 1 4 2 1 = −12
7 −1 2 6 1 0
1
with kr0 k2 = (12 +(−12)2 +02 ) 2 = 12.0416. Furthermore, our initial search direction
will be equal to the initial residual d0 = {1, −12, 0}T . We then compute α as:
1
1 −12 0 −12
0
α= = 0.2617
2 1 −1 1
1 −12 0 1 4 2 −12
−1 2 6 0
3.5. CONJUGATE GRADIENT METHOD 75
r1 = r0 − αAd
1 2 1 −1 1 3.6173
= −12 − 0.2617 1 4 2 −12 = 0.3014
0 −1 2 6 0 6.5433
1
with kr0 k2 = (3.61732 + 0.30142 + 6.54332 ) 2 = 7.4827, compute the projection onto
the previous residual:
(r1 )T r1
β=
(r0 )T r0
3.6173
3.6173 0.3014 6.5433 0.3014
6.5433
=
1
1 −12 0 −12
0
= 0.3861
and then compute the new search direction:
d1 = r1 + βd0
3.6173 1 4.0035
= 0.3014 + 0.3861 −12 = −4.3323
6.5433 0 6.5433
r2 = r1 − αAd1
3.6173 2 1 −1 4.0035 4.5994
= 0.3014 − 0.3423 1 4 2 −4.3323 = 0.3833
6.5433 −1 2 6 6.5433 −2.5603
1
with kr2 k2 = (4.59942 + 0.38332 + (−2.5603)2 ) 2 = 5.2780, compute the projection
onto the previous residual:
(r2 )T r2
β=
(r1 )T r1
4.5994
4.5994 0.3833 −2.5603 0.3833
−2.5603
=
3.6173
3.6173 0.3014 6.5433 0.3014
6.5433
= 0.4975
d2 = r2 + βd1
4.5994 4.0035
= 0.3833 + 0.4975 −4.3323
−2.5603 6.5433
6.5912
= −1.7720
0.6951
We then repeat the procedure for the third and final iteration and compute α as:
3.5. CONJUGATE GRADIENT METHOD 77
4.5994
4.5994 0.3833 −2.5603 0.3833
−2.5603
α= = 0.4292
2 1 −1 6.5912
6.5912 −1.7720 0.6951 1 4 2 −1.7720
−1 2 6 0.6951
Which is the exact solution computed with all of the other methods studied. In
contrast to steepest descent which took 78 iterations to get to the same solution,
the conjugate gradient method only took 3; an obvious advantage! In order to
create a Matlab program to perform the conjugate gradient algorithm we will have
something very similar to the steepest descent code, the major difference being that
we will have the extra step of computing β at each iteration, plus we will have to
store both an ‘old’ and a ‘new’ residual vector. That being said, we can still create
quite a concise algorithm:
while r_norm>tolerance && k<N_k
alpha = (r_old’*r_old)./(d’*A*d);
phi = phi + alpha*d;
r = r_old - alpha*A*d;
beta = (r’*r)/(r_old’*r_old);
d = r + beta*d;
r_old = r;
r_norm = norm(r);
k = k+1;
end
}
dTAd += d[m]*Ad[m];
}
alpha = r_oldTr_old/dTAd;
for(m=0; m<N_row; m++)
{
phi[m] += alpha*d[m];
}
for(m=0; m<N_row; m++)
{
r[m] = r_old[m] - alpha*Ad[m];
}
rTr = 0.0;
for(m=0; m<N_row; m++)
{
rTr += r[m]*r[m];
}
beta = rTr/r_oldTr_old;
for(m=0; m<N_row; m++)
{
d[m] = r[m] + beta*d[m];
}
for(m=0; m<N_row; m++)
{
r_old[m] = r[m];
}
r_oldTr_old = rTr;
r_norm = sqrt(rTr);
k++;
}
Here it be observed that the major difference compared to the equivalent Matlab
code is that we have to perform the matrix-vector multiplications to compute dT Ad
via nested for loops and the vector addition and subtraction within single for
loops. Another interesting point to note is that when we take the inner product
of the new residual vector rT r (stored in the variable rTr) we have in fact almost
computed the two norm, since it is defined as the square root of the sum of the
squares of each term in r. We can therefore compute the two norm by simply taking
the square root of rTr, which saves on the amount of computation compared to
evaluating the residual from r = b − Aφ.
The complete programs are given in Example3_5.m and Example3_5.cpp.
A few issues worth mentioning are that firstly; although in the example presented
here, a system with 3 unknowns required 3 iterations, so you might think that this
is more similar to a direct method, where we know how many computations we’ll
need to perform in order to obtain a solution. For larger systems however, we will
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 79
typically require fewer iterations than the number of unknowns for the residual norm
to drop to an acceptable level. For example, for a system with 106 unknowns we
might be able to get a converged solution in only a few hundred iterations. We
are however guaranteed to have an exact solution after 106 iterations, but we’d
never want to perform that many in practice. The second issue worth pointing
out that there are a number of variations of this basic method, allowing for its
application to asymmetric systems and adding in preconditioning (i.e. multiplying
A by another matrix in order to produce a modified system with a better condition
number). For the interested reader an excellent reference for more detailed aspects of
conjugate gradient methods can be found in the book by Shewchuk [74]. In Matlab
the functions bicg, pcg (plus others) provide implementations of these variations.
a1 v 1 + a2 v 2 + · · · + aN v N
where the coefficients an are scalar quantities and we want to be able to express
other vectors in terms of linear combinations of this collection. We then define the
span to be the set of all linear combinations of the collection of vectors, which is
always a subspace of RD . Take for example the three dimensional Euclidean space
R3 , which we commonly deal with. If we had the collection of vectors v1 = {2, 5, 3},
v2 = {1, 1, 1}, then the span of this collection would be the subspace of R3 consisting
of all linear combinations of these vectors which in fact defines a plane in R3 with a
normal vector n = v1 × v2 . So with these two vectors we can only describe another
vector within this plane. If we now consider the collection of vectors v1 = {1, 0, 0},
v2 = {0, 1, 0}, v3 = {0, 0, 1} then the span of this collection would be all of R3
because every vector can be written as a linear combination of these vectors. For
80 CHAPTER 3. ITERATIVE METHODS
this reason we can say that this collection of vectors also form a basis for this space.
So in a sense the span indicates how much of the space is available to us.
Returning now to the Krylov subspace, we can see that the collection of vectors
are formed by repeatedly multiplying the residual vector r by the matrix A. Fur-
thermore, it can be observed that at each iteration another vector is added to the
collection. If we ‘stack’ all of these column vectors together then we can form the
Krylov matrix as:
φk = φ0 + Kk α (3.15)
Using the column vectors of the Krylov matrix can lead to some problems how-
ever; and this is related to the fact that there’s no guarantee that these vectors will
be orthogonal. So what we then need to do is compute a set of vectors which form
a basis for this space. So instead we will write our iterative update as:
φk = φ0 + Qα (3.16)
where Q will be the matrix of orthonormal vectors which form a basis for the Krylov
subspace. Another important and related concept in the GMRES method is the
Hessenberg factorization of a matrix:
A = QHQH
where we have:
h1,1 h1,2 h1,3 ··· h1,N
h2,1 h2,2 h2,3 ··· h2,N
0 h3,2 h3,3 ··· h2,N
Q= q1 q 2 · · · qN H=
.. .. .. ... ..
. . . .
0 0 0 hM,N −1 hM,N
Here, both Q and H are M × N square matrices and we are writing the unitary
matrix as being made up of the orthogonal column vectors qn . So any matrix A can
be written as a product of a unitary matrix Q and a Hessenberg matrix H and we
can see that in order to compute the update to our solution we will need to perform
this factorization in order to compute Q. If we multiply both sides by Q we get:
AQ = QHQH Q = QH
Since for a unitary matrix QH Q = I. Now, let’s say we only consider a part of this
system:
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 81
Qn = q1 q 2 · · · qn
Qn+1 = q1 q 2 · · · qn qn+1
h1,1 h1,2 h1,3 ··· h1,n
h2,1 h2,2 h2,3 ··· h2,n
0 h3,2 h3,3 ··· h2,n
Hn =
.. .. .. .. ..
. . . . .
0 0 0 hm,n−1 hm,n
0 0 0 0 hm+1,n
where we can solve for the column vector qn+1 and obtain a recursive equation as:
n
P
Aqn − hm,n qn
m=1
qn+1 = (3.18)
hn+1,n
It can be observed that the method in Equation 3.18 is quite similar to the Gram-
Schmidt process that we encountered in the QR decomposition and conjugate gra-
dient methods. As it happens we are in fact going to use a modified form of the
Gram-Schmidt process known as Arnoldi iteration in order to perform the Hessen-
berg factorization of Equation 3.17. In comparison then to the process outlined in
Equation 2.1 we do something very similar here:
82 CHAPTER 3. ITERATIVE METHODS
u1
u1 = r0 q1 =
||u1 ||2
qT1 Aq1 u2
u2 = Aq1 − q1 q2 =
qT1 q1 ||u2 ||2
qT Aq2 qT Aq2 u3
u3 = Aq2 − 1T q1 − 2 T q2 q3 =
q1 q1 q2 q2 ||u3 ||2
.. ..
.=.
n−1 T
X qm Aqn un
un = Aqn−1 − Tq
qm qn = (3.19)
m=1
q m m ||un ||2
where it can be observed that our first orthonormal vector is based on the initial
residual and furthermore, rather than constructing the sequence of orthonormal
vectors based on the columns of A as we did with the QR decomposition, here we
are multiplying the previous vector by A, which is how we obtain the collection of
vectors in the Krylov matrix. Finally we can note that again, since the qn vectors
are orthonormal, the denominator of the summation term in Equation 3.19 will
always be equal to 1 and can be ignored. So now by comparing terms in Equations
3.18 and 3.19 we can infer that the terms on and above the main diagonal in H are
given by:
Then, making use of the partial Hessenberg factorization in Equation 3.17, we get:
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 83
d = QHn+1 r
0
H
r0
= q1 q2 ··· qn qn+1
qH 1 r
0
qH r0
2
= ..
.
qT r 0
n+1
Now since the first column vector q1 is based on the initial residual; and since all of
the subsequent qn vectors are orthogonal to q1 , they will also be orthogonal to r0
and hence d is in fact:
||r0 ||2
0
d= ..
.
0
Now from inspection of Equation 3.21 we can see that the residual will be a
minimum when Hn α = d. But the important thing to note here is that since Hn
is an n + 1 × n matrix and so we have more equations than unknowns; i.e. an
over-determined system of equations. Now, if we could somehow remove the entries
below the main diagonal of Hn we would be left with an upper triangular matrix
(with the bottom row of Hn being all zeros) and we could then efficiently solve the
system by back substitution. As it happens there is a method to achieve this, which
is known as a Givens rotation. The idea with the technique is that we can create a
Givens matrix of the form:
84 CHAPTER 3. ITERATIVE METHODS
1 ... 0 ... 0 ... 0
.. . . .. .. ..
. . . . .
0 ... c ... s 0
. .. . . .. ..
..
G(m, n, θ) = . . . .
0 . . . −s ... c ... 0
. .. .. . . ..
.. . . . .
0 ... 0 ... 0 ... 1
which we can see is basically the identity matrix with the entries c = cos(θ) and
s = sin(θ) added in at a particular location (although we don’t really bother to
calculate the rotation angle θ). When we multiply our matrix by the Givens matrix,
only rows m and n will be affected and we will ‘zero’ the entry m, n in our matrix.
So in our particular case, assuming that we want to zero the entry Hn+1,n , we can
most simply compute the c and s terms by:
q
a= 2 + H2
Hn,n n+1,n
Hn,n
c=
a
Hn+1,n
s=
a
So we will need to create a sequence of Givens rotation matrices to ‘zero’ each entry
below the main diagonal (i.e. Gn , ...G2 G1 Hn ) before we can solve for α and of course
we must apply this same sequence of matrices to the vector d. At this point we can
then compute the entries in α using back substitution and then finally update φk
by evaluating Equation 3.21.
Example 3.6:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
2 1 −1 φ1 3
1 4 2 φ2 = −5
−1 2 6 φ3 7
Before we start writing any code however, let’s work through and perform a few
iterations of the GMRES method by hand. To begin the algorithm, let’s provide
the initial guess of φ1 = {1, 1, 1}T . As with the steepest descent and conjugate
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 85
gradient methods we will use the two norm as our measure of convergence, and
therefore, initially we have:
3 2 1 −1 1 1
r0 = −5 − 1 4 2 1 = −12
7 −1 2 6 1 0
1
with kr0 k2 = (12 + (−12)2 + 02 ) 2 = 12.0416 and we can set d1 = 12.0416. We can
then set:
1
u1 = r0 = −12
0
0.0830
u1 {1, −12, −1}T
q1 = = = −0.9965
||u1 ||2 (12 + (−12)2 + 02 ) 21
0.0000
2 1 −1 0.0830
= qT1 Aq1 = 0.0830 −0.9965 0.0000 1 4
H1,1 2 −0.9965 = 3.8207
−1 2 6 0.0000
So then we have:
3.8207 0 0
0.0830 0 0 0 0 0 0
Q = −0.9965 0 0 0 H =
0 0 0
0.0000 0 0 0
0 0 0
We then proceed to compute the second orthonormal vector as:
u2 =Aq1 − H1,1 q1
2 1 −1 0.0830 0.0830
= 1 4 2 −0.9965 − 2.4096 −0.9965
−1 2 6 0.0000 0.0000
−1.1477
= −0.0956
−2.0761
−0.4834
u2
q2 = = −0.0403
||u2 ||2
−0.8745
86 CHAPTER 3. ITERATIVE METHODS
p
H2,1 = ||u2 ||2 = (−1.1477)2 + (−0.0956)2 + (−2.0761)2 = 2.3742
2 1 −1 −0.4834
qT1 Aq2 =
H1,2 = 0.0830 −0.9965 0.0000 1 4 2 −0.0403 = 2.3742
−1 2 6 −0.8745
2 1 −1 −0.4834
qT2 Aq2 =
H2,2 = −0.4834 −0.0403 −0.8745 1 4 2 −0.0403 = 4.3963
−1 2 6 −0.8745
So then we have:
3.8207 2.3742 0
0.0830 −0.4834 0 0 2.3742 4.3963 0
Q = −0.9965 −0.0403 0 0 H =
0 0 0
0.0000 −0.8745 0 0
0 0 0
We then proceed to compute the third orthonormal vector as:
p
H3,2 = ||u3 ||3 = 1.79552 + 0.14962 + (−0.9995)2 = 2.0603
2 1 −1 0.8714
qT1 Aq3 =
H1,3 = 0.0830 −0.9965 0.0000 1 4 2 0.0726 = 0.0000
−1 2 6 −0.4851
2 1 −1 0.8714
= qT2 Aq3 = −0.4834 −0.0403 −0.8745
H2,3 1 4 2 0.0726 = 2.0603
−1 2 6 −0.4851
2 1 −1 0.8714
H3,3 = qT3 Aq3 =
0.8714 0.0726 −0.4851 1 4 2 0.0726 = 3.7830
−1 2 6 −0.4851
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 87
So then we have:
3.8207 2.3742 0.0000
0.0830 −0.4834 0.8714 0 2.3742 4.3963 2.0603
Q = −0.9965 −0.0403
0.0726 0 H =
0 2.0603 3.7830
0.0000 −0.8745 −0.4851 0
0 0 0
p
H4,3 = ||u4 ||2 = (−0.0888)2 + 0.15542 + 0.00002 × 10−14 = 0.0000
So then we have:
3.8207 2.3742 0.0000
0.0830 −0.4834 0.8714 −0.4961 2.3742 4.3963 2.0603
Q = −0.9965 −0.0403
0.0726 0.8682 H =
0 2.0603 3.7830
0.0000 −0.8745 −0.4851 0.0000
0 0 0.0000
At this point we have our upper Hessenberg matrix and in order to perform the
least squares minimization we need to ‘zero’ the entries below the main diagonal.
Starting with entry H2,1 , we compute:
88 CHAPTER 3. ITERATIVE METHODS
q √
a1 = (H1,1 )2 + (H2,1 )2 = 3.82072 + 2.37422 = 4.4983
H1,1 3.8207
c1 = = = 0.8494
a1 4.4983
H2,1 2.3742
s1 = 1 = = 0.5278
a 4.4983
And then multiply H and d by the first Givens rotation matrix to get:
H 1 = G1 H
0.8494 0.5278 0 0 3.8207 2.3742 0.0000
−0.5278 0.8494 0 0 2.3742 4.3963 2.0603
=
0 0 1 0 0 2.0603 3.7830
0 0 0 1 0 0 0.0000
4.4983 4.3369 1.0874
0 2.4810 1.7500
=
0 2.0603 3.7830
0 0 0.0000
and:
d1 = G1 d
0.8494 0.5278 0 0 12.0416
−0.5278 0.8494 0 0 0
=
0 0 1 0 0
0 0 0 1 0
10.2277
−6.3556
=
0
0
We then compute:
q √
a2 = 1 2
(H2,2 1 2
) + (H3,2 ) = 2.48102 + 2.06032 = 3.2249
1
H2,2 2.4810
c2 = = = 0.7693
a2 2.9050
1
2
H3,2 2.0603
s = 2 = = 0.6389
a 2.9050
And then multiply H 1 and d1 by the second Givens rotation matrix to get:
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 89
H 2 = G2 H
1 0 0 0 4.4983 4.3369 1.0874
0 0.7693 0.6389 0
0 2.4810 1.7500
= 0 −0.6389 0.7693 0
0 2.0603 3.7830
0 0 0 1 0 0 0.0000
4.4983 4.3369 1.0874
0 3.2249 3.7631
=
0 0 1.7923
0 0 0
and:
d2 = G2 d1
1 0 0 0
10.2277
0 0.7693 0.6389 0 −6.3556
= 0 −0.6389 0.7693 0
0
0 0 0 1 0
10.2277
−4.8894
=
4.0604
0
Now for this particular example, H4,3 is already zero and so we don’t need to perform
another givens rotation. At this point we can then perform the back substitution
on H 2 α = d as:
4.0604
α3 = = 2.2655
1.7923
−4.8894 − 3.7631 × 2.2655
α2 = =−4.1597
3.2250
10.2277 − 1.0875 × 2.2655 − 4.3370 × (−4.1597)
α1 = = 5.7365
4.4983
2.2655
φ11 =1.0000 + 0.0830 −0.4834 0.8714
−4.1597 = 5.4615
5.7365
2.2655
φ12 =1.0000 + −0.9965 −0.0403 0.0726
−4.1597 =−4.3846
5.7365
2.2655
φ13 =1.0000 + 0.0000 −0.8745 −0.4851
−4.1597 = 3.5385
5.7365
Which is the exact solution computed with all of the other methods studied. As with
the conjugate gradient method, for a 3 × 3 system, the solution will be computed
in 3 iterations. For larger systems however we will find that we can usually achieve
convergence in a number of iterations that is much smaller than the size of the
system. Another important point is that similar to the direct methods that we
studied in Chapter 2, the GMRES requires that we store Qn+1 and Hn . Now while we
could potentially store an upper Hessenberg matrix in a way such that we don’t need
to store the zeros in the lower triangular part of it (note that using the compressed
row storage format wouldn’t help because the matrix is more than half full, so the
addition of storing the row and column data would in fact require more storage)
Qn+1 will in fact be full and so there’s no opportunity for saving on storage there.
For very large systems then, the GMRES method can require a prohibitive amount
of storage. What is usually done to circumvent this problem is to have ‘restarts’;
that is we will pick a size n < M and apply the method as usual, then start the whole
thing again from scratch, the only difference being that we will have a better initial
guess for φ0 . In order to create a Matlab program to perform the GMRES algorithm
we will therefore include an iterative while loop which will apply the restarts. Since
we are always dealing with square matrices, we need to know M , but the variable N
is redundant, so let’s therefore change our interpretation of this variable to mean the
size of Qn+1 and Hn that we are prepared to store. Then, initially within our while
loop we will begin by assigning q1 and d1 and then perform the Arnoldi iteration
as:
u_norm = norm(u);
H(n+1,n) = u_norm;
Q(:,n+1) = u/u_norm;
end
...
end
At this point, the next step is to apply the Givens rotations to Hn and d in order to
bring the upper Hessenberg matrix into an upper triangular form, which is achieved
as:
storage that way. In fact what we are going to do is store Hn as two 1D arrays; one
storing the upper triangular part and one storing the part below the main diagonal.
For the upper triangular part, imagine taking all of the nonzero entries of each row,
and ‘butting’ them up so that they form a single 1D array. In this case the array
will be of size N (N + 1)/2. The catch with this idea however is that if we’re storing
this part of Hn in a 1D array we won’t be able to access entries with the [m][n]
indexing anymore. In fact what we will do is ‘map’ from this indexing to a single
linear index, which will be achieved by:
m(2 ∗ N − m − 1)
index(m, n) = +n
2
remembering that this is for ‘zero-based’ indexing. For the part below the main
diagonal, our 1D array will simply be of size N and will require no special mapping
to access elements. This approach is a good example illustrating the separation of
the mathematical notion of a matrix, from the way it is stored in computer memory
(i.e. here we’re storing a matrix, just not with a 2D array). Our C++ program will
follow the same procedure as the Matlab implementation:
while(r_norm>tolerance && k<N_k)
{
for(int m=0; m<N_Row; m++)
{
Q[m][0] = r[m]/r_norm;
}
for(n=0; n<N_col+1; n++)
{
d[n] = 0.0;
}
d[0] = r_norm;
for(n=0; n<N_col; n++)
{
for(int m=0; m<N_row; m++)
{
AQn = 0.0;
for(int o=0; o<N_row; o++)
{
AQn += A[m][o]*Q[o][n];
}
u[m] = AQn;
}
for(int m=0; m<=n; m++)
{
uTQm = 0.0;
for(int o=0; o<M; o++)
{
uTQm += u[o]*Q[o][m];
}
index1 = (2*N_col-m-1)*(m)/2+n;
H_u[index1] = uTQm;
3.6. GENERALIZED MINIMAL RESIDUAL METHOD 93
Where the major difference is the additional for loops required for performing the
matrix-vector multiplication and vector inner products. As can be observed in this
code ‘snippet’ we are storing the upper triangular part of Hn in a 1D array H_u
and the part below the main diagonal in the 1D array H_l. At this point we have
the completely assembled Qn+1 and Hn and of course the next step is to apply the
sequence of Givens rotation matrices to Hn and d in order to bring the system into
upper triangular form.
The more observant readers may have noticed that when a matrix is multiplied
by a Givens rotation matrix, only two rows are modified by the matrix matrix
multiplication. This gives us another opportunity to save on storage and to reduce
the number of computations required to perform the matrix-matrix multiplication.
For each multiplication by a Givens rotation matrix, only rows n and n + 1 are
affected and we can compute the entries in the oth columns of these two rows as:
1
Hn,o = cHn,o + sHn+1,o
1
Hn+1,o =−sHn,o + cHn+1,o
{
...
for(n=0; n<N_col; n++)
{
index1 = n*(2*N_col-n-1)/2+n;
a = sqrt(H_u[index1]*H_u[index1] + H_l[n]*H_l[n]);
c = H_u[index1]/a;
s = H_l[n] /a;
H_u[index1] = c*H_u[index1] + s*H_l[n];
H_l[n] = 0.0;
for(int o=n+1; o<N_col; o++)
{
index1 = n *(2*N_col-n-1)/2+o;
index2 = (n+1)*(2*N_col-n-2)/2+o;
tempH_u = c*H_u[index1] + s*H_u[index2];
H_u[index2] = -s*H_u[index1] + c*H_u[index2];
H_u[index1] = tempH_u;
}
tempd = c*d[n] + s*d[n+1];
d[n+1] = -s*d[n] + c*d[n+1];
d[n] = tempd;
}
...
}
At this point all that remains is to perform the back substitution for α and then
to update φk . This will be done in pretty much the same was as in the Matlab
implementation, the only significant difference being the indexing into the upper
triangular part of Hn and the extra work involved in performing the matrix-vector
multiplications:
while(r_norm>tolerance && k<N_k)
{
...
index1 = (N_col-1)*(2*N_col-N_col)/2+(N_col-1);
alpha[N_col-1] = d[N_col-1]/H_u[index1];
for(int m=N_col-2; m>=0; m--)
{
Halpha = 0.0;
for(int o=m+1; o<n; o++)
{
index1 = m*(2*N_col-m-1))/2+o;
Halpha += H_u[index1]*alpha[o];
}
index1 = m*(2*N_col-m-1)/2+m;
alpha[m] = (d[m] - Halpha) / H_u[index1];
}
r_norm = 0.0;
for(int m=0; m<N_row; m++)
{
Aphim = 0.0;
for(int o=0; o<N_row; o++)
{
Aphim += A[m][o]*phi[o];
}
r[m] = b[m] - Aphim;
r_norm += r[m]*r[m];
}
r_norm = sqrt(r_norm);
k += N_col;
}
One of the important advantages of GMRES is that it can be used to solve any
real square matrix and this is of practical importance in a number of applications
where the numerical methods result in asymmetric matrices that couldn’t be solved
by ‘say’ the conjugate gradient method (although they could be solved by one of
its variants). Another desirable feature of GMRES is that the residuals always
decrease at every iteration, as opposed to ‘say’ the steepest descent or conjugate
gradient methods where the residual norm may increase every now and then. As a
final note before continuing on, we can compute the solution to a linear system with
the GMRES method using the Matlab function gmres.
96 CHAPTER 3. ITERATIVE METHODS
Chapter 4
Aφ = b
If we to extend to the more general nonlinear case then we will write our system as:
f (φ) = 0 (4.1)
97
98 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS
∆φ2 d2 f
∆φ df
f (φ + ∆φ) = f (φ) + + + O(∆φ3 )
1! dφ φ 2! dφ2 φ
If we extend this to a vector of functions:
∆φ ∆φ2
f (φ + ∆φ) = f (φ) + J+ H + O(∆φ3 )
1! 2!
where:
∂f1 ∂f1 ∂f1
···
∂φ1 ∂φ2 ∂φM
∂f2 ∂f2 ∂f2
∂φ1 ∂φ2
··· ∂φM
J =
.. .. .. ..
. . . .
∂fM ∂fM ∂fM
∂φ1 ∂φ2
··· ∂φM
is known as the Hessian matrix. If we ‘ignore’ the second order terms and higher
we get the approximate relation:
∆φ = −J −1 f
So to update our solution, we are effectively solving the linear system:
J∆φ = −f
4.1. NEWTON’S METHOD 99
φk+1 = φk + ∆φ
When we were dealing with a linear system, we defined the residual as rk =
b − Aφk . When we’re dealing with a nonlinear system then the residual vector will
in fact be the vector of functions evaluated at the current guess for φ:
rk = f (φk )
And since we need to compute this in order to perform the outer iterations, this
makes computing the residual even easier than in the case of a linear system.
Example 4.1:
In this example we will develop a Matlab program to solve the example nonlinear
system:
φ1 + φ2 − φ1 φ2 + 2 0
f (φ) = =
φ1 e−φ2 − 1 0
Clearly this system is nonlinear due to the φ1 φ2 and the φ1 e−φ2 terms and so we
can’t write it in the form Aφ = b. Before we start writing any code however, let’s
work through and perform a few iterations of Newton’s method by hand. To begin
the algorithm, let’s provide the initial guess of of φ0 = {0, 0}T . We will also assume
that we are using the infinity norm as our measure of convergence, and therefore,
initially we have:
0 0 0+0−0×0+2 2
r = f (φ ) = =
0 × e−0 − 1 −1
with kr0 k∞ = 2. We will also define the Jacobian as:
" #
∂f1 ∂f1
∂φ1 ∂φ2 1 − φ2 1 − φ1
J = ∂f2 ∂f2 =
∂φ1 ∂φ2
e−φ2 −φ1 e−φ2
∆φ = −J −1 f
−1
1−0 1−0 0+0−0×0+2
=−
e0 −0e0 0 × e0 − 1
−1
1 1 2
=−
1 0 −1
1
=
−3
∆φ = −J −1 f
−1
1 − (−3) 1−1 3.0000
=−
e−(−3) −1 × e−(−3) 19.0855
−1
4.0000 0 3.0000
=−
20.0855 −20.0855 19.0855
−0.7500
=
0.2002
2 2 0.2500 + (−2.7998) − 0.2500 × (−2.7998) + 2 0.1502
r = f (φ ) = =
0.2500 × e−(−2.7998) − 1 3.1103
with kr2 k∞ = 3.1103. We again reevaluate the Jacobian and solve for ∆φ as:
∆φ = −J −1 f
−1
1 − (−2.7998) 1 − 0.2500 0.1502
=−
e−(−2.7998) −1 × e−(−2.7998) 3.1103
−1
3.7998 0.7500 0.1502
=−
16.4411 −4.1103 3.1103
−0.1055
=
0.3345
4 4 0.1445 + (−2.4653) − 0.1445 × (−2.4653) + 2 0.0353
r = f (φ ) = =
0.1445 × e−(−2.4653) − 1 0.6997
with kr4 k∞ = 0.6997. Repeating this procedure for 7 more iterations we finally
converge on the solution of φ = {0.0978, −2.3251}T . In order to create a Matlab
program to perform Newton’s method we will first note that since the purpose of
the example is to see how we solve a nonlinear system, we won’t pay any attention
to the solution of the linearized system and will therefore simply use the backslash
operator. Of course in principle almost any of the methods presented in this part of
the book could be used in its place. As such, the entire program will take the form:
phi = zeros(2,1);
f = [phi(1)+phi(2)-phi(1)*phi(2)+2; phi(1)*exp(-phi(2))-1];
k = 0;
r = f;
r_norm = max(abs(r));
tolerance = 1e-8;
N_k = 1e3;
An observation that can be made is that the above program is a little ‘verbose’
in that we have the column vectors f and r, both of which store the same thing.
Furthermore, we could just update as phi=phi-J\f. Since the point of this example
however was to demonstrate the method, and also, memory isn’t a concern for such
a small system, we don’t really care. The important point however is that you can
‘spot’ some of these inefficiencies so that when writing your own programs you can
make them more concise and efficient.
The complete program is given in Example4_1.m.
As a final note before continuing on, we can compute the solution to a nonlinear
system using the Matlab function fsolve. In such a case we present the specific
system to it as a function handle, which is how we avoid having the nonlinear system
‘hard coded’ into the program as was done in Example 4.1.
f (φk )
φk+1 = φk −
f 0 (φk )
where f 0 (φk ) is equivalent to J and hence f 0 (φk )−1 is equivalent to J −1 in this case.
The basic idea behind Broyden’s method is that we provide an initial estimate of the
Jacobian at the start of the algorithm (along with our initial guess for φ) and then
4.2. BROYDEN’S METHOD 103
φk − φk−1
φk+1 = φk − f (φk )
f (φk ) − f (φk−1 )
where here it can be observed that we’ve essentially replaced the approximation for
the Jacobian by a ‘difference expression’:
f (φk ) − f (φk−1 )
f 0 (φk ) =
φk − φk−1
Broyden’s method is essentially a generalization of the Secant method in multiple
dimensions, where we replace the first derivative by a ‘difference expression’, which
can be written as:
∆f k − J k−1 ∆φ k T
J k = J k−1 + (φ )
k∆φk k2
and then proceeds similar to Newton’s Method:
φk+1 = φk − ∆φk
Now, another thing we could do with the method is rather than store an approx-
imation to the Jacobian, store an approximation to the inverse of the Jacobian and
perform our updates on that instead. This would mean that we wouldn’t have to
solve a linear system at each outer iteration. In order to achieve this we can make
use of the Sherman-Morrison formula [49]:
new matrix. Now if we store an approximation for the inverse of the Jacobian, then
we can update using the Sherman-Morrison formula as:
Example 4.2:
In this example we will develop a Matlab program to solve the example nonlinear
system:
φ1 + φ2 − φ1 φ2 + 2 0
f (φ) = =
φ1 e−φ2 − 1 0
Before we start writing any code however, let’s work through and perform a few
iterations of Broydens’s method by hand. In this example we are going to store
an approximation to the inverse of the Jacobian. To begin the algorithm we will
provide the initial guess of of φ0 = {1, −1}T . Now, normally our initial guesses
have been quite simple, whereas here we are making use of the knowledge of the
exact solution from Example 4.1 to provide an initial guess slightly closer to the
exact solution. The reason for this is that since we will be ‘guessing’ the Jacobian,
there’s a good chance that our method won’t converge if both the initial guess for
the solution and the initial guess for the inverse of the Jacobian are both ‘way off’.
So this approach will allow us to use quite a simple approximation for (J 0 )−1 (which
is the one of the intended learning outcomes for this example), while still having an
algorithm which converges. We will also assume that we are using the infinity norm
as our measure of convergence, and therefore, initially we have:
0 0 1 + (−1) − 1 × (−1) + 2 3.0000
r = f (φ ) = =
1 × e−(−1) − 1 1.7183
with kr0 k∞ = 1.7183. We will also define our approximation for the inverse of the
Jacobian as:
−1 1 0
J =
0 1
We can then compute ∆φ as:
∆φ = −J −1 f
1 0 3.0000
=−
0 1 1.7183
−3.0000
=
−1.7183
4.2. BROYDEN’S METHOD 105
p
with k∆φk2 = (−3.000)2 + (−1.7183)2 = 3.4572 and update φ1 as:
1 1 (−2.0000) + (−2.1783) − (−2.0000) × (−2.7183) + 2 −8.1548
r = f (φ ) = =
(−2.0000) × e−(−2.7381) − 1 −31.3085
1 −1 0 −1 ∆φ − (J 0 )−1 ∆f T 0 −1
(J ) = (J ) + ∆φ (J )
∆φT (J 0 )−1 ∆f
1 0
=
0 1
−3.0000 1 0 −11.1548
−
−1.7183 −33.0268
0 1 1 0
+ −3.0000 −1.7183
1 0 −11.1548 0 1
−3.0000 −1.7183
0 1 −33.0268
0.7288 −0.1553
=
−1.0411 0.4037
∆φ = −J −1 f
0.7288 −0.1553 −8.1548
=−
−1.0411 0.4037 −31.3085
1.0804
=
4.1481
p
with k∆φk2 = (1.0804)2 + (4.1481)2 = 4.2865 and update φ2 as:
106 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS
2 2 (−0.9196) + 1.4298 − (−0.9196) × −1.4298 + 2 3.8250
r = f (φ ) = =
(−0.9196) × e−1.4298 − 1 −1.2201
∆φ − (J 1 )−1 ∆f
(J 2 )−1 = (J 1 )−1 + ∆φT (J 1 )−1
T 1 −1
∆φ (J ) ∆f
0.7288 −0.1553
=
−1.0411 0.4037
1.0804 0.7288 −0.1553 11.9799
−
4.1481 −1.0411 0.4037 30.0884
+
0.7288 −0.1553 11.9799
1.0804 4.1481
−1.0411 0.4037 30.0884
0.7288 −0.1553
× 1.0804 4.1481
−1.0411 0.4037
4.2006 −1.6366
=
−6.2593 2.6301
∆φ = −J −1 f
4.2006 −1.6366 3.8250
=−
−6.2593 2.6301 −1.2201
−18.0642
=
27.1511
p
with k∆φk2 = (1.0804)2 + (4.1481)2 = 32.6613 and update φ2 as:
4.2. BROYDEN’S METHOD 107
An observation that an be made with this method is that the residuals tend to
‘jump around’ initially before finally converging rapidly on the solution. Of course
the better our initial guesses for the solution and the inverse of the Jacobian, the
fewer iterations are likely required to converge on a solution. Furthermore, if these
guesses are too poor we may not converge at all.
The complete program is given in Example4_2.m.
Having now looked at a couple of methods for solving nonlinear systems we will
finish up this part of the book by introducing a number of concepts that will become
important later on.
Often, we will be able to write our resulting nonlinear system of equations in the
form:
A(φ)φ = b
where we are emphasizing that the matrix A contains entries that involve the de-
pendent variable φ in some way. So obviously this is not a linear system; but one
approach to solving a system of this form is known as Fixed point iteration. The
idea here is that we evaluate the terms in A using the previous estimate of φ, so
that we have:
108 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS
A1 (φ, ψ)φ = b1
A2 (φ, ψ)ψ = b2
where φ and ψ are both column vectors containing the solution to their respective
system of equations, but the important point is that the coefficient matrices may
depend upon both vectors. In fact the systems could perhaps be more generally
written in the form f (φ, ψ) = 0, but if instead the systems happened to be linear,
then we could trivially combine them into one large system:
φ
A =b
ψ
which we would called a coupled (or simultaneous) solution. Another approach we
might take however, is to hold one of the vectors as constant and solve for the other,
then switch. This would be known as a segregated solution. This also ties in quite
nicely with the Picard iteration approach and so our segregated solution in that we
can ‘assemble’ the matrices using data from a previous iteration:
A1 (φk , ψ k )φk+1 = b1
A2 (φk+1 ,ψ k )ψ k+1 = b2
4.2. BROYDEN’S METHOD 109
where is can be observed that we solve the first system using previous estimates
φk and ψ k , but as soon as we have solved for φk+1 we use that information to
assemble A2 before solving for ψ k+1 . One important difference between the coupled
and segregated solvers is that if we have one ‘giant’ matrix A, then this may require
significantly more memory to store it, whereas with a segregated solver, we may be
able to memory used to store A1 to also store A2 since we only need one at a time
and we will be reassembling them anyway. Furthermore, sometimes A1 = A2 and
so using a segregated solver definitely saves on memory. We would generally expect
faster convergence with a coupled solution, but also more computational expense.
Sometimes if the equations are particularly ‘nasty’ and nonlinear and tightly coupled,
then a segregated solution may be preferable as well.
These concepts will become clearer as we present the different governing differ-
ential equations in Part V and explain how they are converted into multiple systems
of equations. The important point to bear in mind at this stage is simply to note
that often we will be solving coupled systems of equations and occasionally they
will be nonlinear. It would then make sense to revisit this Chapter to revise these
concepts and ‘crystalize’ them in your mind.
To briefly summarize and recap on this part of the book, we have introduced some
relevant terminology and made some definitions relating to systems of equations that
we will encounter later on. We then investigated a number of direct methods, where
the number of operations required to solve a system is fixed, depending only upon the
size of the system, but we generally have to store multiple matrices to do so and as
a result we tend not to use these methods so much in practice. We then investigated
a number of iterative methods, where the number of operations required to solve a
system is unknown, but we generally don’t have to store as much information and
furthermore, the methods can be integrated with the sparse matrices that arise when
solving differential equations and as a result these methods are more commonly used
in practice. Finally we investigated some methods for extending these techniques
for solving nonlinear equations and how we deal with multiple systems together. In
addition to this we have gained some experience with both Matlab and C++ and seen
the basic structure of both types of program, which will be continued throughout
this book. We have seen that the programs involve lot’s of for and while loops
and generally we can make our Matlab programs a bit more concise because of the
matrix and vector operations that are available to us. Furthermore, a number of
the methods are directly available within Matlab. With C++ on the other hand
linear solvers aren’t a part of the language itself, but there are a number of libraries
available which can be integrated into one’s program such as Blas [3] and Lapack
[26] that can perform this functionality.
As a final point, it is worth mentioning that we have only briefly touched on the
details of all of these methods and so for the interested reader an excellent reference
for more detailed aspects of conjugate gradient methods can be found in the books
by Lindfield [65], Press [73], and Barret [63].
110 CHAPTER 4. NONLINEAR AND COUPLED SYSTEMS
Part II
111
Chapter 5
Introduction
φmax
φmin φl φl+1
Δt
}
t
tmin tl tl+1 tmax
Figure 5.1: A schematic illustrating the idea behind a numerical solution to an ordi-
nary differential equation (ODE). The blue line illustrates the continuous analytical
solution to the ODE in the domain of interest t ∈ [tmin , tmax ] and the blue dots
illustrate the numerical solution at two time steps (denoted by the superscript l)
within this domain.
In this part of the book we are going to investigate a number of different methods
for solving solving Ordinary Differential Equations (ODEs). Before we begin to
study these numerical methods however we need to outline some terminology and
make some definitions. The most general form of an ODE can be given as:
113
114 CHAPTER 5. INTRODUCTION
dN φ d3 φ d2 φ dφ
f , · · · , 3 , 2 , , φ, t = 0
dtN dt dt dt
Here we are using the variable φ to denote our dependent variable, t to denote our
independent variable, and f represents some arbitrary function of the dependent
variable and its derivatives, the independent variable, or both. We will generally
think of t as representing the ordinate of time, but it should be noted that for
this general problem t could represent any independent variable that we can take
derivatives with respect to. For example, ODEs involving derivatives with respect
to an independent variable representing a spatial ordinate x are also commonplace
in many engineering problems. In terms of notation, it is also common to express
derivatives with respect to time as φ̇, φ̈, etc, and with respect to a spatial ordinate
as φ0 , φ00 , etc. We will also make use of this notation from time to time throughout
the book (mainly just to fit a derivative expression in the body of a paragraph,
which would stretch the line spacing if using the fractional notation). We will
always assume the existence and uniqueness of the solution and also that f (φ, t)
has continuous partial derivatives with respect to φ and t of as high an order as
necessary. Now, the idea behind solving an ODE is to find φ(t), but because we are
restricting ourselves to numerical solutions we will only ever obtain a solution that
is a collection of φ at discrete points in time, denoted φ(tl ) = φl (Figure 5.1). To
obtain the numerical solution to our ODE we will always be obtaining it within a
domain which we can denote as tmin ≤ t ≤ tmax , or as t ∈ [tmin , tmax ]. So this is just
the range of the independent variable which we are interested in the solution of our
ODE. Often tmin will be zero, but it doesn’t have to be.
One of the important aspects we must consider is the order of the ODE, which
we can deduce from the highest derivative present in the equation. The order will
have important implications in terms of how much information we have to specify
to obtain a unique solution. A generic first order ODE takes the form:
dφ
= f (φ, t) (5.1)
dt
Another important aspect we must consider is whether or not we are solving a
single ODE or a system of ODEs. Throughout our study we will actually most often
be interested in studying systems of ODEs, in which case φ will be thought of as
representing a column vector of dependent variables φ = {φ1 , φ2 , ...φN }T and f will
hence represent a column vector of functions. If f happens to not explicitly involve
the independent variable t and is only a function of φ then this is known as an
autonomous system. As we will see, the numerical methods that we will investigate
are only designed to solve first order ODEs. The solution of a higher order ODE
however can be achieved if we ‘break it up’ into a system of first order ODEs. For
instance, to solve the third order ODE:
d3 φ
= f (φ, t)
dt3
115
the approach that one takes is to create a system of three first order ODEs by letting
φ1 = φ, φ2 = φ̇, and φ3 = φ̈ such that:
dφ1
= φ2
dt
dφ2
= φ3
dt
dφ3
= f (φ1 , t)
dt
which of course extends to higher order ODEs. So a single equation of order N is
equivalent to a system of N first order equations as long as the highest derivative
can be isolated. Another important aspect that we must consider is whether or not
we are solving a linear or nonlinear ODE (or system of ODEs). If our problem is
linear, then it can be represented in the form:
dN φ d2 φ dφ
a(t) + . . . + b(t) + c(t) + d(t)φ + e(t) = 0
dtN dt2 dt
and hence:
φ̇ = Kφ + s (5.2)
where K is an M × N matrix and s an N × 1 column vector, the entries of which
may be functions of t, but not of φ. This form leads to another important aspect
that we must consider which is whether or not the ODE is homogeneous or inho-
mogeneous. If s is zero then our ODE is homogeneous and if it is nonzero then it
is inhomogeneous. It should be noted that while we introduced homogeneity via a
linear ODE, a nonlinear ODE can be either homogeneous or inhomogeneous as well.
Finally, and perhaps most importantly, one of the aspects of our problem we
must consider is whether we are solving an initial value problem or a boundary value
problem. The amount of information that we will need to specify to obtain a solution
will depend upon the order of the ODE. For a first order ODE we must specify the
solution at a moment in time, most commonly at tmin as φ(tmin ) = φmin , which is
our initial condition. If we have say a second or third order ODE then we would
need so specify initial conditions for φ̇, and φ̈ respectively. Turning to the solution of
a system of ODEs we find that we need to specify an initial condition for each entry
in the vector φ, such as φmin = {φ1,min , φ2,min , ...φN,min }T . If all of these pieces of
information are specified at tmin then we are solving an initial value problem (IVP).
It is also possible that we may not always have the relevant information at tmin but
may have some of the information at tmax . In this case we are solving a boundary
value problem (BVP) and we would call the value φ(tmax ) = φmax our boundary
condition. We will see how we solve these problems in a later chapter, but for now
the point to take away from this discussion is that we need to specify a condition
116 CHAPTER 5. INTRODUCTION
(somewhere in the domain) for each derivative in our ODE, or each element in the
column vector of ODEs, however you find it easiest to visualize.
Having now introduced a number of concepts relating the the classification of
ODEs, let’s take a moment to look at some common example ODEs and classify
them. Beginning with the Bernoulli differential equation:
dφ
+ bφ = cφn
dt
We can see that this is a first order, nonlinear, homogeneous ODE. Another well
known ODE is the Euler differential equation:
d2 φ dφ
t2 2
+ at + bφ = c
dt dt
which we can see is a second order, linear, inhomogeneous ODE. Another well known
ODE is the van der Pol equation:
d2 φ 2 dφ
− a 1 − φ +φ=0
dt2 dt
which is a second order, nonlinear, homogeneous, ODE.
Throughout the next two chapters we will for the most part be applying our
numerical methods to solve two example systems, the first of which is:
d2 φ
= −4φ
dt2
which we can see is a second order, linear, homogeneous ODE. The second example
system is:
dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2
dt
which we can see is a system of 3 first order, nonlinear, homogeneous ODEs. By
testing our numerical methods on the same problem we should hopefully highlight
the differences between the different numerical methods and also know what solution
to expect.
We are now almost in a position to begin the studying the families of numerical
methods. Bearing in mind that our numerical solutions will only ever be approxima-
tions to the true solution however, we need to first define some numerical concepts.
We will simply state them here and then explain them at a later stage. For each
numerical method we will investigate the accuracy of the method and the stability
of the method.
Chapter 6
Euler Methods
Perhaps the simplest methods that we can use to solve an initial value problem are
Euler methods, so they present an excellent starting point. The basic idea behind
the method is that we first consider the Taylor Series Expansion about φ(tl ):
∆t2 d2 φ ∆t3 d3 φ
l+1 l ∆t dφ
φ(t ) = φ(t ) + + + + ... (6.1)
1! dt tl 2! dt2 tl 3! dt3 tl
Which we could also write in a more compact form as:
∞
(tl+1 − tl )n dn φ
X
l+1
φ(t )=
n=0
n! dtn lt
Letting tl+1 − tl = ∆t, which we will call the time step size, and substituting our
equation for a generic first order ODE given in Equation 5.1 into Equation 6.1 we
get:
∆t2 d2 φ ∆t3 d3 φ
l+1 l
φ(t ) = φ(t ) + ∆tf (φ, t) + + + ... (6.2)
2! dt2 tl 3! dt3 tl
With Euler methods we neglect the second order terms and higher such that we
have the approximate expression:
117
118 CHAPTER 6. EULER METHODS
computing the values of φ at the new time steps based solely on the values from the
previous time steps (Figure 6.1). Since we will always know the values of φ from
the previous time steps we have an explicit expression which we can evaluate to
compute the new values.
φl+1
}
φ
el+1local
φ(tl+1)
φl
t
tl tl+1
}
Δt
Figure 6.1: A schematic illustrating one step in the explicit Euler method. The green
arrow illustrates the gradient f (φl , tl ) which is used to step the solution forward. The
pink line illustrates computed step to φl+1 . The blue line illustrates the analytical
solution. Also illustrated is the local truncation error, which is the difference between
the analytical and numerical solutions at time step l + 1.
In order to consider the accuracy of the method, need to consider two types of
error; namely truncation error and round-off error. As mentioned in Part I, round-
off error arises due to the fact that computers can only represent numbers to a finite
precision. Truncation error on the other hand arises due to the terms in Equation
6.2 that were ignored. Assuming that we knew the exact solution of our ODE at
a given time step tl+1 (and ignoring round-off error for the moment) we define the
local truncation error as:
l+1
elocal = φ(tl+1 ) − φl+1
= φ(tl+1 ) − φl − ∆tf (φl , tl )
which is defining the incremental error introduced into the solution as we take a
step from φl to φl+1 . Going back to the Taylor series expansion in Equation 6.1, we
can see that the the error introduced into the solution is of the order O(∆t2 ), thus
6.1. EXPLICIT EULER METHOD 119
the explicit Euler method has a local truncation error of O(∆t2 ). The idea here is
that if we reduce ∆t by a factor of two, then the error should decrease by a factor of
four. The global truncation error is the accumulation of the local truncation error
after a number of time steps. Assuming perfect knowledge of the exact solution at
the initial time step we can define the global truncation error after Nt time steps as:
eN
N
t
= φ(t t ) − φNt
global
= φ(tNt ) − φ0 + ∆tf (φ0 , t0 ) + ∆tf (φ1 , t1 ) + . . . + ∆tf (φNt −1 , tNt −1 )
Now, if each step incurs an error of O(∆t2 ) and the errors are simply cumulative
(a fairly conservative assumption) and we have to take O(∆t−1 ) steps to cover the
domain, then the net truncation error is O(∆t). In other words, the error associated
with integrating an ODE using Euler methods is directly proportional to the time
step size. So Euler methods are termed a first-order accurate method because the
global truncation error associated with integrating over a finite domain scales like
O(∆t). More generally, a numerical method is conventionally called an N th order
method if its local truncation error per step is O(∆tN +1 ).
Note that truncation error would be incurred even if computers performed floating-
point arithmetic operations to infinite accuracy. Unfortunately, computers do not
perform such operations to infinite accuracy. In fact, a computer is only capable of
storing a floating-point number to a fixed number of decimal places. At large time
step sizes the error is dominated by truncation error, whereas round-off error dom-
inates at small time step sizes. So we in fact reach a point where further reducing
the time step size will actually increase the error in the solution.
Example 6.1:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
dφ
=1−φ
dt
in the domain t ∈ [0, 10], with initial condition φ(0) = 0. We will experiment with
the time step sizes ∆t = 0.1, ∆t = 0.5, ∆t = 1.0, and ∆t = 2.0 and compare
the numerical solution with the exact (or analytical) solution φ(t) = 1 − e−t . The
intended learning outcomes for this example will be to observe the application of
the explicit Euler method to solve an ODE and how we write data to an output file
in C++.
In order to begin, we apply the explicit Euler method to the ODE by substituting
in for f (φl , tl ) into Equation 6.3 as:
120 CHAPTER 6. EULER METHODS
φl+1 = φl + ∆t 1 − φl
Then in order to actually compute the solution at each time step we place this equa-
tion inside a time marching loop where we will compute the solution at successive
time steps as:
for l=1:N_t-1
phi(l+1) = phi(l) + Delta_t *(1 - phi(l));
end
in Matlab, and:
for(int l=0; l<N_t-1; l++)
{
phi[l+1] = phi[l] + Delta_t *(1.0 - phi[l]);
}
in C++. These two code snippets are essentially all that is required to implement an
explicit Euler method for solving this simple ODE. In order to output the solution
data to a file we will assume that we have completed the simulation and all of the
data has been computed. In this case we will first declare an instance of an fstream
class:
fstream file;
Then we will open the file, loop through the array and write out the data, then close
the file:
file.open("Example6_1.data", ios::out);
for(int l=0; l<N_t; l++)
{
file << phi[l] << "\t" << 1.0-exp(-t[l]) << endl;
}
file.close();
where the first argument in the open member function is our desired name for the
file and the second is a flag indicating in this case that we wish to write out the data.
For this particular example our output file will be a readable text file containing
two columns; the first containing the numerical solution and the second column
(separated by a tab "\t") containing the exact solution. So each pass through the
for loop writes out one row of data.
The complete programs are given in Example6_1.m and Example6_1.cpp. The
solution is shown in Figures 6.2(a) - 6.2(d). Note that the numerical solution gets
closer to the analytical solution for smaller values of ∆t. This is what we would
expect since we know that the error in the solution is proportional to the time step
size.
6.1. EXPLICIT EULER METHOD 121
2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
φ
φ
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10
t t
(a) (b)
2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
φ
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 0 2 4 6 8 10
t t
(c) (d)
Figure 6.2: Solution to the ODE in Example 6.1 with (a) ∆t = 0.1 (b) ∆t = 0.5 (c)
∆t = 1.0 (d) ∆t = 2.0. The green lines show the analytical solution and the blue
dotted lines show the numerical solution.
122 CHAPTER 6. EULER METHODS
dφ
= λφ (6.4)
dt
Im
λIm
θ
Re
λRe
Figure 6.3: Some qualitative solutions to the model problem. When λ is real the
solution will show exponential growth or decay depending on whether it is positive
or negative respectively. When it is imaginary the solution will be oscillatory like
a sinusoidal function, and if λ has both real and imaginary components then the
solution will exhibit sinusoidal growth or decay.
6.1. EXPLICIT EULER METHOD 123
where λ is a constant which can be a complex number. Figure 6.3 illustrates some
solutions to the model problem depending upon the value of λ. In most engineering
problems, the real part of λ is negative. This means that the solution to the ODE will
typically decay with time. You may well ask why we use this model problem for our
analysis and why we let λ be complex (especially since most of the problems that we
will be solving involve real numbers as opposed to complex numbers). The answer
to the first question is that many problems fall into this category, or as we shall see,
can be made to fall into this category. Although the model problem is homogeneous,
we find that the inhomogeneous parts of the ODE do not really affect the stability
analysis, so it is therefore ‘no big deal’ to ignore them. Furthermore, for complex
nonlinear problems we will most likely not be able to perform a stability analysis, so
we learn what we can from studying the model problem. We can however sometimes
‘linearize’ a nonlinear problem such that our stability analysis performed on the
model problem still applies. In regards to the second question, in many practical
problems, the ODEs produce oscillatory type solutions, which won’t appear unless
we allow λ to be a complex number.
Applying the explicit Euler method to Equation 6.4 gives:
φl+1 = (1 + λ∆t) φl
Thus the solution at any time step l can be written as:
φl = φ0 (1 + λ∆t)l
= φ0 (1 + (λRe + iλIm )∆t)l
= φ0 σ l
where λRe and λIm are the real and imaginary parts of λ and σ is known as the
amplification factor . An important point on notation is that here φ0 implies the
solution at time step 0, but σ l implies that the amplification factor is raised to the
power of l. This will be the case for all similar expressions throughout this part of
the book. If |σ| ≤ 1 then the solution will decay with time. The opposite is true if
|σ| > 1. Hence, in order to ensure the stability of the numerical method we want
|σ| ≤ 1, and therefore:
This is the equation of a circle of radius 1, centered at (-1,0), and the inequality
implies that σ must lie inside the circle in order for the method to be stable. This
plot is called the stability diagram and is shown in Figure 6.4. Within this region
we say that the method is absolutely stable and outside of this region we say that it
is unstable.
λImΔt
Unstable
Stable
λReΔt
-2 -1
Figure 6.4: The stability diagram for the explicit Euler method.
If we consider first the case where λ is real and negative (i.e. λ = −λRe ), the
model problem becomes:
dφ
= −λRe φ
dt
For illustrative purposes, let’s use φ(t = 0) = φ0 . The exact solution to this problem
is:
φ(t) = φ0 e−λRe t
then in order for the numerical method to be stable we have:
|1 − λRe ∆t| ≤ 1
where using the rule of inequalities (|a| ≤ b implying −b ≤ a ≤ b) means that:
−1 ≤ 1 − λRe ∆t ≤ 1
−2 ≤ −λRe ∆t ≤ 0
6.1. EXPLICIT EULER METHOD 125
and dividing through by −λRe and noting another rule of inequalities (that diving
by a negative number involves reversing the inequality) we get:
2
0 ≤ ∆t ≤ (6.6)
λRe
So we have found the maximum time step size that we can use and still get a stable
solution. It is worth mentioning that if λRe is positive, then the solution will exhibit
an exponential increase with time. This does not mean that the numerical method
however is not working, just that the stability analysis doesn’t really apply. We are
more concerned with cases where we know that the solution should decay with time
(or at least not grow exponentially), finding the regions in the stability diagram
where this will not happen, and staying out of them!
Consider now the √ case where λ is purely imaginary (i.e. λ = iλIm ), where i is
the imaginary unit −1 . The model problem becomes:
dφ
= iλIm φ
dt
For illustrative purposes, let’s use φ(t = 0) = φ0 . The exact solution to this problem
is:
φ(t) = φ0 eiλIm t
By considering Euler’s formula eiθ = cos θ + i sin θ (not to be confused with Euler
methods that are the focus of this chapter) we can see that the exact solution will be
oscillatory in nature. Because the stability region is only tangent to the imaginary
axis, the explicit Euler method is always unstable (irrespective of the time step size)
for purely imaginary λ. So we know that the amplitude will grow with time as the
value of λ is not within the stability region of Figure 6.4 (i.e. it lies somewhere on
the vertical axis).
If we are dealing with a system of ODEs, then the concepts of stability analysis
still apply. To see how let’s consider a linear system:
φ̇ = Kφ (6.7)
where K is a constant M x N matrix. We will assume that K is diagonalizable
[10], meaning that is has a complete set of N linearly independent eigenvectors ξn
satisfying Kξn = λn ξn for n = 1, 2, ...N (i.e. a standard eigenvalue problem). We
can represent the matrix of Eigenvectors as:
Ξ = [ξ1 , ξ2 , . . . ξN ]
where essentially we’re taking each ξn (which is a column vector) and ‘stacking’
them together column by column to form a matrix. We can then represent the
matrix of eigenvalues as:
126 CHAPTER 6. EULER METHODS
λ1 0 . . . 0
0 λ2 . . . 0
Λ= = diag(λ1 , λ2 , ...λN )
.. .. . . ..
. . . .
0 0 . . . λn
and so our matrix K can be written as:
K = ΞΛΞ−1
or rearranging in terms of Λ as:
Λ = Ξ−1 KΞ
Now if we let ψ(t) = Ξ−1 φ(t), we can multiply Equation 6.7 by Ξ−1 and realizing
that ΞΞ−1 = Ξ−1 Ξ = I gives the equivalent equations:
dψn
= λn ψn
dt
which is the model problem outlined in Equation 6.4 and we can translate the
stability criteria for a system of ODEs from that for a single ODE. So, applying the
explicit Euler method to the model problem gives:
φl+1 = (I + ∆tK) φl
which by the previous transformations can be written as:
ψ l+1 = (I + ∆tΛ) ψ l
This decouples into N independent equations, one for each component of ψ. These
take the form:
2
∆t ≤
max(λn )
If the range of magnitudes of the eigenvalues is large:
max(λn )
1
min(λn )
and the solution is desired over a large span of the independent variable t, then the
system is known as a stiff system. Stiff systems arise in physical situations with
widely varying rates of responses and can result in extremely small time step sizes
in order to satisfy the stability criteria. By ‘varying rates of response’ we mean that
maybe one ψn is changing rapidly while another one may change relatively slowly.
This same idea will apply more generally to other methods such that a method
is stable if and only if each λn ∆t is within the stability region of the numerical
method for every eigenvalue of the matrix K. An important point to realize when
considering stability analysis is that we derive all our stability regions by considering
a model problem, but of course most of the systems of ODEs that we will want to
solve in practice will not be so simple. If we have a linear system as was just outlined
then the idea is that we try and make our problem ‘look’ like the model problem
so that we can apply our stability analyses to it. If we are dealing with non-linear
systems then of course things are a lot more difficult.
Example 6.2:
In this example we will develop a Matlab program to solve the example system:
d2 φ dφ
+ + 4φ = 0 (6.8)
dt2 dt
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0. We will
plot the stability region for the explicit Euler method and experiment with the time
step sizes ∆t = 0.1 and ∆t = 0.5. The intended learning outcome for this example
will be to see how we solve a second order ODE, to observe how the time step size
affects where the solution will sit in the stability region of the explicit Euler method
and what happens to the solution when we are outside of that region.
In order to begin, we will need to break this second order ODE into a system of
first order ODEs. We do so by making the definition:
φ1 = φ
dφ1
φ2 =
dt
128 CHAPTER 6. EULER METHODS
dφ1
= φ2
dt
dφ2
= −4φ1 − φ2
dt
φ̇ = Kφ
where φ = {φ1 , φ2 }T and K is the matrix:
0 1
K=
−4 −1
We will actually store the solution as a 2D array in Matlab as:
phi = zeros(N_e, N_t)
φl+1 = φl + ∆tKφl
= (I + ∆tK)φl
For this system we can calculate the diagonal matrix of eigenvalues and eigenvectors
in Matlab with the eig function as:
[Xi Lambda] = eig(K);
which will give us two matrices, one containing the eigenvalues on the main diag-
onal and the other containing the eigenvectors. For this particular K we get the
eigenvalues:
−0.5000 + 1.9365i 0
Λ=
0 −0.5000 − 1.9365i
Since these two eigenvalues have negative real parts as well as imaginary parts, we
should be able to fit inside the stability region of the explicit Euler method for some
choice of ∆t. Recall the amplification factor for the explicit Euler method:
of the complex number at each x, y point, and then extract a contour of |σ| = 1.
This can be easily accomplished in Matlab using the meshgrid, abs and contourf
functions respectively as:
[X, Y] = meshgrid(-2:0.1:2, -2:0.1:2);
Z = X + i*Y;
sigma = abs((1 + real(Z)).^2 + imag(Z).^2);
contourf(X, Y, sigma, [1 1]);
plot(real(diag(Lambda)*Delta_t), imag(diag(Lambda))*Delta_t);
While we can derive an expression for the minimum time step size allowed in order
for all of the eigenvalues to be inside the stability region of the explicit Euler method,
a more interesting way for this example will be to experiment with some different
choices of ∆t and see how it affects the solution.
The complete program is given in Example6_2.m. As such, two different solutions
are shown in Figures 6.5(a) - 6.5(d) for ∆t = 0.1 and ∆t = 0.5, showing where the
λm ∆t terms lie on the complex plane in each case. As can be observed, when the
λm ∆t values are inside the stability region the solution decays with time (as it
should), whereas when the λm ∆t values are outside the stability region the solution
‘blows up’.
σ = 1 + iλIm ∆t
= Zeiθ
where:
p
Z= 1 + (λIm ∆t)2 (6.9)
represents the magnitude and:
−1 λIm ∆t
θ = tan
1
represents the angle. These definitions are useful in order to perform an error
analysis. Let’s now compare the exact solution at time t = l∆t:
130 CHAPTER 6. EULER METHODS
2.0 3
1.5
2
1.0
1
0.5
λReΔ t
0.0
φ
0
−0.5
−1
−1.0
−2
−1.5
−2.0 −3
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
λReΔ t
0 2 4 6 8 10
t
(a) (b)
2.0 3
1.5
2
1.0
1
0.5
λReΔ t
0.0
φ
−0.5
−1
−1.0
−2
−1.5
−2.0 −3
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
λReΔ t
0 2 4 6 8 10
t
(c) (d)
Figure 6.5: Solution to the ODE system in Example 6.2 illustrating (a) the λ∆t
terms in relation to the stability region of the explicit Euler method and (b) the
solution with ∆t = 0.1 (c) the λ∆t terms in relation to the stability region of the
explicit Euler method and (d) the solution with ∆t = 0.5.
6.1. EXPLICIT EULER METHOD 131
Im
σ
σIm
Ζ
θ
Re
σRe
φl = φ0 σ l = φ0 Z l eilθ
Dividing the two equations and rearranging in terms of the numerical solution we
get:
α3 α5 α7
tan−1 (α) = α − + − + ...
3 5 7
we can then rewrite the phase error as:
and amplitude, and investigated the stability region of the method. We will be doing
this for all of the numerical methods throughout the rest of this part of the book,
so that we can learn more about the advantages, disadvantages of each method.
Example 6.3:
In this example we will develop a Matlab program to solve the example system:
d2 φ
= −4φ (6.11)
dt2
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0 and
compare the numerical solution with the exact solution φ(t) = cos(2t) and hence
φ̇(t) = −2 sin(2t). The intended learning outcomes for this example will be to see
how the use of an implicit method means solving a system of equations at every
time step.
In order to begin, we will need to break this second order ODE into a system of
first order ODEs. We do so by making the definition:
φ1 = φ
dφ1
φ2 =
dt
134 CHAPTER 6. EULER METHODS
}
φ(tl+1)
φl el+1local
φl+1
t
tl tl+1
}
Δt
Figure 6.7: A schematic illustrating one step in the implicit Euler method. The
green arrow illustrates the gradient f (φl+1 , tl+1 ) which is used to step the solution
forward. The pink line illustrates computed step to φl+1 . The blue line illustrates
the analytical solution. Also illustrated is the local truncation error, which is the
difference between the analytical and numerical solutions at time step l + 1.
6.2. IMPLICIT EULER METHOD 135
dφ1
= φ2
dt
dφ2
= −4φ1
dt
φ̇ = Kφ
where φ = {φ1 , φ2 }T and K is the matrix:
0 1
K=
−4 0
We will actually store the solution as a 2D array in Matlab as:
phi = zeros(N_e, N_t)
φl+1 = φl + ∆tKφl+1
(I − ∆tK)φl+1 = φl (6.12)
Aφl+1 = b
where A = (I − ∆tK) and b = φl . So the important point is that by using an
implicit method to solve our system of ODEs we have to solve a system of equations
at every time step in order to compute the solution φl+1 . Now, because we had a
linear ODE, we get a linear system of equations to solve. As we will see shortly, when
we have a nonlinear system of ODEs and we use an implicit method, we have to solve
a nonlinear system of equations. Another point worthy of mention is that we could
in principle use any of the methods from Part I to solve the linear system at each
time step. Furthermore, because the system is such a small one, a direct method
such as Gaussian Elimination or LU Decomposition might be a good choice. Here
however, because the focus of this example is on how we apply the implicit Euler
method to solving a system of ODEs, we will simply use the backslash operator,
such that the code solving our system will be given as:
136 CHAPTER 6. EULER METHODS
for l=1:N_t-1
phi(:, l+1) = (I - Delta_t*K) \ phi(:, l);
end
3 3
2 2
1 1
φ
φ
0 0
−1 −1
−2 −2
−3 −3
0 2 4 6 8 10 0 2 4 6 8 10
t t
(a) (b)
Figure 6.8: Solution to the ODE system in Example 6.3 with ∆t = 0.05 showing (a)
φ1 = φ(t) and (b) φ2 = φ̇(t). The green lines show the exact solution and the blue
dotted lines show the numerical solution.
Example 6.4:
In this example we will develop a Matlab program to solve the example system:
dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2 (6.13)
dt
6.2. IMPLICIT EULER METHOD 137
in the domain t ∈ [0, 10], with initial conditions φ1 (0) = 0, φ2 (0) = 1, φ3 (0) = 1.
The intended learning outcomes for this example will be to see how the application
of an implicit method to a nonlinear system of ODEs requires iteratively solving a
linear system of equations.
In order to begin, we apply the implicit Euler method by substituting for f (φ, t)
into Equation 6.10 to get:
It can be observed that unlike Example 6.3 this system is nonlinear and consequently
we won’t be able to put the system in Equation 6.14 into the form Aφl+1 = b. In
this case we will have to use iterative method to solve the nonlinear system and as
such we will use the Newton’s method, which we studied in Chapter 4. To do so we
put the system into the form f (φl+1 ) = 0 as:
l+1
∆tφl+1 l+1
φ1 − φl1 − 2 φ3
f= φ2l+1 − φl2 + ∆tφl+1
1 φl+1
3 =0
l+1 l l+1 l+1
φ3 − φ3 +0.5 ∆tφ1 φ2
and then we compute the Jacobian matrix by working out all of the derivative
expressions. For this example we get:
−∆tφl+1 −∆tφl+1
1 3 2
∂fm
J= = ∆tφl+13 1 ∆tφl+1
1
∂φnl+1 l+1 l+1
0.5∆tφ2 0.5∆tφ1 1
We can then iteratively solve the linear system of equations at each time step:
J∆φ = f
and improve the values for φl+1 as:
φl+1,k+1 = φl+1,k ∆φ
where k denotes the iteration, until we converge on a solution for the time step l + 1.
As with all iterative techniques we need to ‘guess’ the solution for φl+1 before we
begin the iterative process. One approach would be to simply take φl+1,k=0 = 0,
which is a perfectly reasonable guess. A far more efficient approach however would
be to use the converged solution to the previous time step (i.e. φl+1,k=0 = φl ). In
terms of implementing the program in Matlab we the time marching loop with an
inner iterative loop as:
138 CHAPTER 6. EULER METHODS
for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
phi(:,l+1) = phi(:,l);
f = [ phi(1,l+1) - phi(1,l) - Delta_t*phi(2,l+1)*phi(3,l+1);
phi(2,l+1) - phi(2,l) + Delta_t*phi(1,l+1)*phi(3,l+1);
phi(3,l+1) - phi(3,l) + 0.5*Delta_t*phi(1,l+1)*phi(2,l+1)];
r_norm = max(abs(f));
while r_norm>tolerance && k<maxIterations
f = [ phi(1,l+1) - phi(1,l) - Delta_t*phi(2,l+1)*phi(3,l+1);
phi(2,l+1) - phi(2,l) + Delta_t*phi(1,l+1)*phi(3,l+1);
phi(3,l+1) - phi(3,l) + 0.5*Delta_t*phi(1,l+1)*phi(2,l+1)];
J = [ 1 -Delta_t*phi(3,l+1) -Delta_t*phi(2,l+1);
Delta_t*phi(3,l+1) 1 Delta_t*phi(1,l+1);
0.5*Delta_t*phi(2,l+1) 0.5*Delta_t*phi(1,l+1) 1 ];
Delta_phi = -J\f;
phi(:,l+1) = phi(:,l+1) + Delta_phi;
r_norm = max(abs(f));
end
end
Where a point to note is that we are using the infinity norm as the measure of
converge.
The complete program is given in Example6_4.m. Figure 6.9 shows the numerical
solution to the system in Equation 6.13. As can be observed the amplitude of the
oscillations is decreasing in time, which is incorrect as since the true solution should
exhibit no change in the amplitude of the oscillations.
1.5
1.0
0.5
0.0
φ
−0.5
φ1
−1.0 φ2
φ3
−1.5
0 2 4 6 8 10
t
Figure 6.9: Solution to the ODE system in Example 6.4 with ∆t = 0.1.
φl+1 = φl + λ∆tφl+1
which after rearranging gives:
φl
φl+1 =
1 − λ∆t
Thus the solution at any time step l can be written as:
φl = φ0 σ l
where our amplification factor this time is:
1
σ=
1 − λ∆t
As before, in order to ensure the stability of the numerical method, |σ| ≤ 1, therefore:
1
|σ| = ≤1
(1 − λRe ∆t)2 + (λIm ∆t)2
or, rearranging:
Z1 = |σ1 | = 1
q
Z2 = |σ2 | = 1 + (λIm ∆t)2
and:
140 CHAPTER 6. EULER METHODS
λImΔt
Stable
Unstable
λReΔt
1 2
Figure 6.10: The stability diagram for the implicit Euler method.
0
−1
θ1 = tan
1
−1 λIm ∆t
θ2 = tan −
1
1
Z=p (6.15)
(1 + (λIm ∆t)2
and:
−1 λIm ∆t
θ = tan
1
Comparing with the exact solution at time t = l∆t:
φl = φ0 σ l = φ0 Z l eilθ
and dividing the two equations we get:
6.2. IMPLICIT EULER METHOD 141
1.5 1.5
1.0 1.0
0.5 0.5
φ
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.5 −1.5
20 22 24 26 28 30 20 22 24 26 28 30
t t
(a) (b)
Figure 6.11: The solution to the model problem after a number of cycles, when λ = i
for (a) the Explicit and (b) the Implicit Euler Methods. The green curves illustrates
the analytical solution and the blue curves illustrates the numerical solution, in
particular highlighting the phase and amplitude error.
Chapter 7
Crank-Nicolson Methods
φ(tl+1)
} l+1
e local
φl+1
φl
t
tl tl+1
}
Δt
Figure 7.1: A schematic illustrating one step in the Crank-Nicolson method. The
green arrows illustrate the gradients f (φl , tl ) and f (φl+1 , tl+1 ) which are both used
to step the solution forward. The pink line illustrates computed step to φl+1 . The
blue line illustrates the analytical solution. Also illustrated is the local truncation
error, which is the difference between the analytical and numerical solutions at time
step l + 1.
In Chapter 6 we studied the Euler methods, which are pretty much the simplest
way in which we can solve an ODE numerically. These methods were only first order
accurate however and so an appropriate question would be, how can we improve on
that accuracy? In fact all of the numerical methods for solving ODEs that we will
study from here on in will be an improvement on the Euler methods and the one
143
144 CHAPTER 7. CRANK-NICOLSON METHODS
such method that will be the focus of this chapter is the Crank-Nicolson method.
The basic idea behind the method is to compute φ(tl+1 ) by integration:
Z tl+1
l+1 l
φ(t ) = φ(t ) + f (φ, t)dt (7.1)
tl
Example 7.1:
In this example we will develop a Matlab program to solve the example system:
d2 φ
= −4φ (7.3)
dt2
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0 and compare
the numerical solution with the exact solution φ(t) = cos(2t) and hence φ̇(t) =
−2 sin(2t). The intended learning outcomes for this example will be to simply
observe the application of the Crank-Nicolson method to solve a system of ODEs.
In order to begin, as we did in Example 6.3, we will need to break this second
order ODE into a system of first order ODEs. We do so by making the definition:
φ1 = φ
dφ1
φ2 =
dt
7.1. IMPLICIT CRANK-NICOLSON METHOD 145
dφ1
= φ2
dt
dφ2
= −4φ1
dt
φ̇ = Kφ
where φ = {φ1 , φ2 }T and K is the matrix:
0 1
K=
−4 0
We will actually store the solution as a 2D array in Matlab as:
phi = zeros(N_e, N_t)
∆t
φl+1 = φl + Kφl + Kφl+1
2
∆t l+1 ∆t
I− K φ = I+ K φl
2 2
Aφl+1 = b
where A = (I − ∆t/2K) and b = (I + ∆t/2K)φl . So as with the implicit Euler
method we have to solve a system of equations at every time step in order to compute
the solution φl+1 . Now, because we had a linear ODE, we get a linear system of
equations to solve. Since we are more interested in the application of the Crank-
Nicolson method, than how we solve this system, we will again simply use the
backslash operator to solve the resulting linear system and hence the code for solving
this system of ODEs is:
for l = 1:N_t-1
phi(:,l+1) = ((I - Delta_t/2*K)\(I + Delta_t/2*K))*phi(:,l);
end
146 CHAPTER 7. CRANK-NICOLSON METHODS
3 3
2 2
1 1
φ
φ
0 0
−1 −1
−2 −2
−3 −3
0 2 4 6 8 10 0 2 4 6 8 10
t t
(a) (b)
Figure 7.2: Solution to the ODE system in Example 7.1 with ∆t = 0.05 showing (a)
φ1 = φ(t) and (b) φ2 = φ̇(t). The green lines show the exact solution and the blue
dotted lines show the numerical solution.
∆t
φl+1 = φl + λφl + λφl+1
2 !
1 + ∆tλ
= 2
φl
1 − ∆tλ
2
!l
∆tλ
1+
φl = φ0 2
∆tλ
1− 2
!l
∆t
1+ λ
2 Re
+ i ∆t λ
2 Im
= φ0 ∆t ∆t
1− λ
2 Re
− i 2 λIm
= φ0 σ l
where our amplification factor this time is:
∆tλ
1+ 2
σ= ∆tλ
1− 2
Then in order for the numerical method to be stable we have |σ| ≤ 1, therefore:
q 2 2
1 + ∆t λ
2 Re
+ ∆t λ
2 Im
|σ| = q 2 2 ≤ 1
∆t ∆t
1 − 2 λRe + 2 λIm
or, rearranging and simplifying:
s 2 2 s 2 2
∆t ∆t ∆t ∆t
1+ λRe + λIm ≤ 1− λRe + λIm
2 2 2 2
∆t2 2 ∆t2 2 ∆t2 2 ∆t2 2
1 + ∆tλRe + λRe + λIm ≤ 1 − ∆tλRe + λRe + λ
4 4 4 4 Im
2λRe ∆t ≤ 0
λRe ∆t ≤ 0
Therefore the stability region of the Crank-Nicolson method is the entire left hand
plane on the stability plot (Figure 7.3), and the method is stable for any choice of
∆t (i.e. unconditionally stable), provided λRe is negative.
In order to perform the error analysis we are going to consider the case where λ
is imaginary. We want to get σ into polar form and the best way to do this is:
i∆tλIm
1+ σ1 Z1 eiθ1 Z1 i(θ1 −θ2 )
σ= 2
i∆tλIm
= = iθ
= e = Zeiθ
1− 2
σ2 Z2 e 2 Z2
In this case:
s 2
λIm ∆t
Z1 = |σ1 | = 1+
2
s 2
λIm ∆t
Z2 = |σ2 | = 1+ (7.4)
2
148 CHAPTER 7. CRANK-NICOLSON METHODS
λImΔt
Stable Unstable
λReΔt
Figure 7.3: The stability diagram for the implicit Crank-Nicolson method.
and:
λIm ∆t
−1
θ1 = tan
2
−1 λIm ∆t
θ2 = tan − (7.5)
2
and dividing the two equations as we did with the Euler methods:
we can see that there is no amplitude error associated with the Crank-Nicolson
method and that the phase error is:
7.1. IMPLICIT CRANK-NICOLSON METHOD 149
−1 λIm ∆t
θ − λIm ∆t = 2 tan − λIm ∆t
2
λIm ∆t 3 λIm ∆t 5 λIm ∆t 7
!
λIm ∆t 2 2 2
= 2 − + − + ... − λIm ∆t
2 3 5 7
(λIm ∆t)3 (λIm ∆t)5
= − + + ...
12 80
which is of order O(λIm ∆t)3 . So compared to the Euler methods the phase error here
is of the same order and because this term is negative, we would expect a phase lead
in the solution, but the leading term has 12 in the denominator instead of 3 and will
be smaller. To highlight this result, Figure 7.4 illustrates the solution to the model
problem after a number of time steps for the implicit Crank-Nicolson method. As
can be observed, the numerical solution is ahead of the exact solution, but has not
developed any error in the amplitude.
1.5
1.0
0.5
φ
0.0
−0.5
−1.0
−1.5
20 22 24 26 28 30
t
Figure 7.4: The solution to the model problem with the Crank-Nicolson Method
after a number of cycles, when λ = i. The green curve illustrates the analytical so-
lution and the blue curve illustrates the numerical solution, in particular highlighting
the phase error and that there is no amplitude error.
150 CHAPTER 7. CRANK-NICOLSON METHODS
φl+1 − φl = O(∆t)
So substituting into Equation 7.7 gives:
∂f l+1 l
f (φl+1 , tl+1 ) = f (φl , tl+1 ) + (φl+1 − φl ) (t , φ ) + O(∆t2 ) (7.8)
∂φ
Substituting Equation 7.8 into Equation 7.6 gives:
l+1 l ∆t l l+1 l+1 l ∂f l+1 l l l
φ =φ + f (φ , t ) + (φ − φ ) (t , φ ) + f (φ , t )
2 ∂φ
So, it is now possible to obtain an explicit expression for φl+1 to be:
∆t l l+1 l l
!
f (φ , t ) + f (φ , t )
φl+1 = φl + 2 ∂f
(7.9)
1 − ∆t2 ∂φ
(φl , tl+1 )
This method has good stability characteristics but suffers from the problem that you
have to find the derivative of f with respect to φ. This may not always be possible,
or may just be a pain to do. Now the implicit Crank-Nicolson method is the more
common of these two, so from now on we will refer to the implicit Crank-Nicolson
method simply as the Crank-Nicolson method, realizing that it is implicit in nature.
Chapter 8
Leapfrog Methods
φl+1
}el+1
local
φ(tl+1)
φl
t
tl-1 tl tl+1
}
}
Δt Δt
Figure 8.1: A schematic illustrating one step in the Leapfrog Method. The green
arrow illustrates the gradient f (φl , tl ) which is used to step the solution forward. The
pink line illustrates computed step to φl+1 . The blue line illustrates the analytical
solution. Also illustrated is the local truncation error, which is the difference between
the analytical and numerical solutions at time step l + 1.
151
152 CHAPTER 8. LEAPFROG METHODS
∆t2 d2 φ ∆t3 d3 φ
l+1 ∆t dφ
l
φ(t ) = φ(t ) + + + + ...
1! dt tl 2! dt2 tl 3! dt3 tl
∆t2 d2 φ ∆t3 d3 φ
l−1 l ∆t dφ
φ(t ) = φ(t ) − + − + ...
1! dt tl 2! dt2 tl 3! dt3 tl
Substituting into Equation 8.1, along with the relation that f (φl , tl ) is the first
derivative we get:
∆t2 d2 φ ∆t3 d3 φ
l ∆t dφ
φ(t ) + + + + ...
1! dt tl 2! dt2 tl 3! dt3 tl
∆t2 d2 φ ∆t3 d3 φ
l ∆t dφ ∆t dφ
=φ(t ) − + − + ... + 2
1! dt l t 2! dt2 tl 3! dt3 tl 1! dt tl
So it can be observed that we can cancel everything up to the terms involving the
third order derivatives and so we can infer from this that the Leapfrog method will
have a local truncation error of order ∆t3 and hence a global truncation error of
order ∆t2 .
Now, in practice, the way in which we would generally apply a Leapfrog method
is to a system of two ODEs. Then, as we integrate forward in time, we would
stagger the computations, meaning that for one of the dependent variables we would
compute values at time steps l, l +1, l +2, . . . as usual, but for the other, we compute
values at time steps l + 12 , l + 32 , l + 25 , . . .. It is not very often in this book that we will
use specific variables in the development of the numerical method, but in this case,
this approach is warranted, since it will help clarify the way the algorithm works.
The ‘classic’ use of a Leapfrog method, is integrating equations of motion where we
have:
153
dv
=a
dt
dr
=v
dt
where the variables r, v , a denote position, velocity, and acceleration of a particle
respectively. The way in which we would apply the Leapfrog method to solve this
system would be to write:
1 1
vl+ 2 = vl− 2 +∆tal (8.2)
l+1 l l+ 12
r =r +∆tv (8.3)
So we can see that the computations of r and v are staggered, or ‘leaping over’ one
another. Now, since we will be starting our time marching from an integer time
step, we will need to compute the velocity half a time step ahead, and the simplest
way in which we could do this would be as:
∆t l
1
vl+ 2 = vl +
a (8.4)
2
which is just an explicit Euler method stepping forward half a time step. We can
also use this result to write the Leapfrog method in a form that will only involve
computations at integer time steps. If we substitute Equation 8.4 into Equation 8.3
we get:
l+1 l ∆t l l
r = r + ∆t v + a
2
∆t2 l
= rl + ∆tvl + a
2
We can also then shift Equation 8.2 forward half a time step (i.e. adding 21 to each
1
time step index) and then approximate the resulting al+ 2 as the mean of al and al+1
(which is actually what we did with the Crank-Nicolson method in Chapter 7) so
that we get the discrete system:
∆t2 l
rl+1 = rl + ∆tvl + a
2
∆t l
vl+1 = vl + a + al+1
2
Although these equations look quite different to Equations 8.3 and 8.2 it is still
the same numerical method and while the positions and velocities are now defined
154 CHAPTER 8. LEAPFROG METHODS
Example 8.1:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
d2 φ
= −4φ
dt2
in the domain t ∈ [0, 10], with initial conditions φ(0) = 1 and φ̇(0) = 0 and
compare the numerical solution with the exact solution φ(t) = cos(2t) and hence
φ̇(t) = −2 sin(2t). The intended learning outcomes for this example will be to sim-
ply observe the application of the Leapfrog method to solve a system of ODEs.
As we did in Examples 6.3 and 7.1, we will begin by breaking this second order
ODE into a system of first order ODEs by making the definition:
φ1 = φ
dφ1
φ2 =
dt
Which applied to Equation 6.11 gives the 2 × 2 system:
dφ1
= φ2
dt
dφ2
= −4φ1
dt
We can then apply the Leapfrog method at integer time steps, to get:
∆t2
φl+1 = φl1 + ∆tφl2 + −4φl1
1
2
∆t
φl+1 = φl2 + −4φl1 + −4φl+1
2 1
2
which is an explicit expression which we can accomplish in our code as:
155
156 CHAPTER 8. LEAPFROG METHODS
for l=1:N_t-1
t(l+1) = t(l) + Delta_t;
phi(1,l+1) = phi(1,l) + Delta_t*phi(2,l) - 2*Delta_t^2*phi(1,l);
phi(2,l+1) = phi(2,l) - 2*Delta_t*(phi(1,l) + phi(1,l+1));
end
in Matlab, and:
for(int l=0; l<N_t-1; l++)
{
t[l+1] = t[l] + Delta_t;
phi[0][l+1] = phi[0][l] + Delta_t*phi[1][l] - 2*Delta_t*Delta_t*phi[0][l];
phi[1][l+1] = phi[1][l] - 2*Delta_t*(phi[0][l] + phi[0][l+1]);
}
in C++. These two code snippets are essentially all that is required to implement
an Leapfrog method for solving this simple ODE. Note how major differences are
the different indexing between Matlab and C++, computing ∆t2 slightly differently
in C++ and changing the limits of our for loop.
The complete program is given in Example8_1.m. Figures 8.2(a) - 8.2(b) illustrate
the solution and it can be observed that for the time step size ∆t = 0.05 the solution
appears to be of a similar accuracy to the Crank-Nicolson solution in Example 7.1
(Figures 7.2(a) - 7.2(b)) and more accurate than the solution using the implicit
Euler method in Example 6.3 (Figures 6.8(a) - 6.8(b)).
3 3
2 2
1 1
φ
0 0
−1 −1
−2 −2
−3 −3
0 2 4 6 8 10 0 2 4 6 8 10
t t
(a) (b)
Figure 8.2: Solution to the ODE system in Example 8.1 with ∆t = 0.05 showing (a)
φ1 = φ(t) and (b) φ2 = φ̇(t). The green lines show the exact solution and the blue
dotted lines show the numerical solution.
157
This equation is different to those for the Euler and Crank-Nicolson methods that
we have seen thus far, since is contains multiple time steps. In order to compute
the amplification factor we assume φl = φ0 σ l as we had with the previous methods.
Substituting in we get:
σ 2 − 2λ∆tσ − 1 = 0
so we have a polynomial in terms of σ to solve. It is in fact a key characteristic of
multistep methods that we have multiple roots for the amplification factor and in
this case we have two. Using the Quadratic Formula, the solution can be shown to
be:
√
σ1,2 = λ∆t ± λ2 ∆t2 + 1
Then, using a Power Series Expansion[54] for the term in the square root:
√ 1 1 1 5 4
1 + α = 1 + α − α2 + α3 − α + ...
2 8 16 128
we can approximate the roots in terms of powers of λ∆t to get:
√ 1 1 4 4
σ1 = λ∆t + λ2 ∆t2 + 1=λ∆t + 1 + λ2 ∆t2 − λ ∆t + . . .
2 8
√ 1 1 4 4
σ2 = λ∆t − λ2 ∆t2 + 1=λ∆t − 1 − λ2 ∆t2 + λ ∆t + . . .
2 8
If we consider the limit as ∆t → 0 and forget about the terms containing ∆t2 and
higher we obtain the asymptotic forms:
σ1 ≈ 1 + λ∆t
σ2 ≈ −1 + λ∆t
and so using our relation φl = φ0 σ l we see that φl = φ0 eλt which is the analytical
solution of the model problem. If we now consider σ2 , then applying the same logic,
this root does not correspond to the solution of the model problem and for this
reason it is known as the spurious root. So in order to determine the stability of the
Leapfrog method we need to analyze each root separately for different values of λ.
If λ > 0 then it is possible that |σ2 | < 1 and so the spurious solution decays,
whereas it must be the case that |σ1 | > 1 and so the true solution grows, hence the
method will be unstable in this case. If on the other hand λ < 0 then it is possible
that |σ1 | < 0, but it must be the case that |σ2 | > 1 and so the spurious solution will
grow and make the method unstable. So we have seen that either way, one of the
roots is going to make the method unstable. In fact the only way in which we can get
a stable solution is if λRe = 0 and then in order for |σ| ≤ 1 we must have −i ≤ σ ≤ i,
meaning that the region of stability of the Leapfrog Method is confined to a line
along the imaginary axis (Figure 8.3). This might seem a bit disappointing given
the comparison to the stability regions of the implicit Euler and Crank-Nicolson
methods that we came across in Chapters 6 and 7, but the Leapfrog method is still
useful if we are solving problems that have purely oscillatory solutions. As we shall
see in Part V this does occur in practice.
λImΔt
Unstable
λReΔt
-1
q
σ1 = iλIm ∆t + (iλIm ∆t)2 + 1
= Zeiθ
where:
r 2
+ (λIm ∆t)2
p
Z= 1 − (λIm ∆t)2
q
= 1 − (∆tλIm )2 + (∆tλIm )2
=1
So we can see that as with the Crank-Nicolson method there is no amplitude error
associated with the Leapfrog method. In order to compute the phase error we will
approximate the first amplification factor with the power series expansion up to
order O(λ∆t2 ). In this case we can write:
and:
!
λIm ∆t
θ = tan−1 (λIm ∆t)2
1− 2
Here we will use the power series expansion for the tan−1 function as we did previ-
ously, but we will also use the power series expansion:
1
= 1 + α + α2 + α3 + α4 + . . .
1−α
! !!
λIm ∆t 1
tan−1 (λIm ∆t)2
= tan−1 λIm ∆t (λIm ∆t)2
1− 2
1− 2
2 3 !!
(λIm ∆t)2 (λIm ∆t)2
(λIm ∆t)2
= tan−1 λIm ∆t 1 + + + + ...
2 2 2
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
−1
= tan λIm ∆t + + + + ...
2 4 8
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
= λIm ∆t + + + + ...
2 8 8
3
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
1
− λIm ∆t + + + + ...
3 2 4 8
5
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
1
+ λIm ∆t + + + + ... + ...
5 2 4 8
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
≈ λIm ∆t + − − + ...
6 20 56
Now, using the same approach that we did for the Euler and Crank-Nicolson meth-
ods, we can show that the phase error is given by:
!
−1 λIm ∆t
θ − λIm ∆t = tan 2 − λIm ∆t
1 − (λIm2∆t)
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
= λIm ∆t + − − + . . . − λIm ∆t
6 20 56
(λIm ∆t)3 λIm ∆t)5 (λIm ∆t)7
= − − + ...
6 20 56
So it can be observed that the phase error associated with the Leapfrog method is
of the same order as for the Euler and Crank-Nicolson methods and having 6 in
the denominator means that the phase error should be less than that introduced by
the Euler methods, but more than that introduced by the Crank-Nicolson method.
This time however, the leading order term is positive we would expect a phase
lag in the solution (meaning that the oscillations in the numerical solution will be
behind the exact solution). Figure 8.4 illustrates the solution to the model problem
after a number of time steps for the Leapfrog method. As can be observed, the
numerical solution is behind the exact solution, but has not developed any error in
the amplitude.
It is worth finishing this study of the Leapfrog method by comparing it to the
Crank-Nicolson method. Both have second order accuracy, no amplitude error, and
similar phase error magnitudes. The stability region of the Leapfrog Method is
161
1.5
1.0
φ 0.5
0.0
−0.5
−1.0
−1.5
20 22 24 26 28 30
t
Figure 8.4: The solution to the model problem with the Leapfrog Method after a
number of cycles, when λ = i. The green curve illustrates the analytical solution
and the blue curve illustrates the numerical solution, in particular highlighting the
phase error and that there is no amplitude error.
restricted to the imaginary axis, meaning that it is only stable for purely oscillatory
problems, so you might wonder why we wouldn’t just always choose the Crank-
Nicolson method, since the stability region was the entire left hand plane. One very
important practical reason is that the Leapfrog Method is fully explicit, unlike the
Crank-Nicolson method where we must solve a system of equations at each time step
and so for problems where solving a system would be too computationally intensive
the Leapfrog method could be a good choice.
162 CHAPTER 8. LEAPFROG METHODS
Chapter 9
Adams-Bashforth Methods
φl+1
}el+1
local
φ(tl+1)
φl
t
tl-1 tl tl+1
}
}
Δt Δt
Figure 9.1: A schematic illustrating one step in the second order Adams-Bashforth
method. The green arrows illustrate the gradients f (φl−1 , tl−1 ) and f (φl , tl ) which
are both used to step the solution forward. The pink line illustrates computed step
to φl+1 . The blue line illustrates the analytical solution.
In Chapter 8 we studied the Leapfrog method, which used the solution from
two previous time steps in order to improve the accuracy. Another possibility is to
use the gradient from previous time steps, which is in fact what we will do with
the Adams-Bashforth methods, which are the focus of this chapter (Figure 9.1).
This family of methods also falls into the category of multistep methods and having
now been exposed to one implementation of a multistep method with the Leapfrog
method, it is worth generalizing our notion of multistep methods, where the most
163
164 CHAPTER 9. ADAMS-BASHFORTH METHODS
Here, on the left hand side we have the solution itself at time steps φl+1−n ranging
from the new time step φl+1 back n time steps and on the right hand side we have the
gradient evaluated at time steps f (φl+1−n , tl+1−n ) ranging from the new time step
f (φl+1 , tl+1 ) back n time steps. These terms are then weighted by coefficients an
and bn and in developing our particular numerical method we will have to determine
these values. For the development of our Adams-Bashforth methods we will choose
a0 = 1 and b0 = 0. Because of this second choice we will have an explicit method,
because we won’t have f (φl+1 , tl+1 ) on the right hand side (as we did say with the
Crank-Nicolson method). So we will derive two explicit variations of the method
with different orders of accuracy, but it should be noted that other variations of the
method exist, including implicit methods.
A useful starting point for deriving the Adams-Bashforth methods is to return
to the idea that we compute φl+1 by integration (as we did with the Crank-Nicolson
method):
Z tl+1
l+1 l
φ =φ + f (φ, t)dt
tl
This time however, instead of using the trapezoidal rule to approximate the integra-
tion, we will instead replace f (φ, t) with an interpolation polynomial [34] p(t) which
we will be able to integrate. This gives approximations φl+1 of φ(tl+1 ) and φl of
φ(tl ):
Z tl+1
l+1 l
φ =φ + p(t)dt (9.1)
tl
Different choices for p(t) will produce the specific variations of the method. Before
we define what this interpolation polynomial will look like we will define the notation:
f 0 ≡f (φ0 ,t0 )
f 1 ≡f (φ1 ,t1 )
.. ..
. ≡.
f l ≡f (φl , tl )
to make the derivation a little less verbose. That being said, imagine now that we
have the set of data points:
(i.e. at each time step we have evaluated f ). Two important points to note at this
stage are that we are starting from t0 and our set of data points is moving forward
in time. We shall see later on that we can generalize both of these assumptions so
that the starting point is arbitrary and we can count backwards in time. With this
in mind we can then say that the Newton form of the interpolation polynomial takes
the form:
where:
n−1
Y
ηn (t) = t − tm = (t − t0 ) × (t − t1 ) × . . . × (t − tn−1 )
m=0
Now, because by definition our interpolation polynomial passes through all of the
data points, we have pN −1 (t0 ) = f 0 and using this result we can observe that all but
the first terms will be zero (because they will include the term (t0 − t0 )) and hence:
a0 = f 0
Following this line of reasoning we also have pN −1 (t1 ) = f 1 and can observe that in
this case all but the first two terms will be zero and hence:
f 1 = a0 + a1 (t1 − t0 )
= f 0 + a1 (t1 − t0 )
f1 − f0
a1 = 1
t − t0
Continuing this approach one more time (and after a bit of algebra) we get:
t2 −t1
− ft1 −t0
a2 = 2 0
t −t
Now, another way to express these coefficients of our interpolation polynomial is:
an = [f 0 , f 1 , . . . , f N −1 ]
166 CHAPTER 9. ADAMS-BASHFORTH METHODS
Here, the notation [, ] defines what are known as the Divided Differences[14] where:
[f 0 ] = f 0
0 1 f1 − f0
[f , f ] = 1
t − t0
2 1 1
f −f 0
0 1 2 [f 1 , f 2 ] − [f 0 , f 1 ] t2 −t1
− ft1 −f
−t0 f2 − f1 f1 − f0
[f , f , f ] = = = −
t2 − t0 t2 − t0 (t2 − t1 )(t2 − t0 ) (t1 − t0 )(t2 − t0 )
[f 1 , f 2 , f 3 ] − [f 0 , f 1 , f 2 ]
[f 0 , f 1 , f 2 , f 3 ] =
t3 − t0
are known as Forward Divided Differences and:
[f 0 ] = f 0
f1 − f0
[f 1 , f 0 ] = 1
t − t0
2 1 1
f −f 0
2 1 0 [f 2 , f 1 ] − [f 1 , f 0 ] t2 −t1
− ft1 −f
−t0 f2 − f1 f1 − f0
[f , f , f ] = = = −
t2 − t0 t2 − t0 (t2 − t1 )(t2 − t0 ) (t1 − t0 )(t2 − t0 )
[f 3 , f 2 , f 1 ] − [f 2 , f 1 , f 0 ]
[f 3 , f 2 , f 1 , f 0 ] =
t3 − t0
are known as Backward Divided Differences. Now, since with our Adams-Bashforth
method we have t1 − t0 = t2 − t1 = . . . = ∆t (i.e. a constant time step size), we can
simplify these Backward Divided Differences as:
[f 0 ] = f0
f1 − f0
[f 1 , f 0 ] =
∆t
[f , f ] − [f 1 , f 0 ]
2 1
f 2 − 2f 1 + f 0
[f 2 , f 1 , f 0 ] = =
2∆t 2∆t2
3 2 1 2 1 0
[f , f , f ] − [f , f , f ] f − 3f + 3f 1 − f 0
3 2
[f 3 , f 2 , f 1 , f 0 ] = =
3∆t 6∆t3
Furthermore, we can say that t = t0 + l∆t and simplify the form of the interpolation
polynomial to be:
which is known as the Newton Backward Divided Difference Formula and is the form
we will use to represent our interpolation polynomial. It is at this point that we
choose the order of the polynomial and hence how many terms in Equation 9.2 we
wish to keep. Now, you may well wonder how this polynomial is still a function of
time, since we don’t see t appearing anywhere in Equation 9.2. The answer is that
because we have the relation t = t0 + l∆t and our polynomial is a function of l then
in that sense it is a function of t.
p1 (t) = [f 1 ] + [f 1 , f 0 ]l∆t
We must now integrate this polynomial between tl to tl+1 , but to simplify the in-
tegration we will use the limits t0 and t1 and then note that because the definition
of these data points is arbitrary the results applies for tl to tl+1 . Now, because our
interpolation polynomial is a function of l we will use the relation t = t0 + l∆t and
perform a change of variables. In this case we can write:
dt = ∆tdl
and so
Z t1 Z 1
p1 (t)dt = p1 (l)∆tdl
t0 0
Z 1 Z 1
[f 1 ] + [f 1 , f 0 ]l dl
p1 ∆tdl = ∆t
0 0
2
1
1 1 0 l
= ∆t [f ]l + [f , f ] ∆t
2
0
∆t
= ∆t [f 1 ] + [f 1 , f 0 ] (9.3)
2
If we now substitute for the simplified backward divided difference expressions into
Equation 9.3 we get:
Z 1
1 1 0 ∆t
p1 ∆tdl = ∆t [f ] + [f , f ]
0 2
1 0
1 f −f ∆t
= ∆t f +
∆t 2
3 1 1 0
= ∆t f − f
2 2
But since this integral expression is valid between any two consecutive time steps,
we can replace the 0 and 1 superscripts with l − 1 and l so that substituting into
Equation 9.1 we obtain the second order Adams-Bashforth method:
l+1 l 3 l 1 l−1
φ = φ + ∆t f − f (9.4)
2 2
Here we follow exactly the same approach as we did for the second order method
and as such can just straight in to the integration:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 169
Z 1 Z 1
p3 ∆tdl = ∆t f 3 + [f 3 , f 2 ]l∆t + [f 3 , f 2 , f 1 ]l(l + 1)∆t2 + [f 3 , f 2 , f 1 , f 0 ]l(l + 1)(l + 2)∆t3 dl
0
"0
l2
3
l2
3 3 2 3 2 1 l
= ∆t [f ]l + [f , f ] ∆t + [f , f , f ] + ∆t2
2 3 2
4 #1
3l3 2l2
3 2 1 0 l 3
+ [f , f , f , f ] + + ∆t
4 3 2
0
2 3
3 3 2 ∆t 3 2 1 5∆t 3 2 1 0 9∆t
= ∆t [f ] + [f , f ] + [f , f , f ] + [f , f , f , f ]
2 6 4
(9.5)
If we now again substitute for the simplified backward divided difference expressions
into Equation 9.5 we get:
Z 1 2 3
3 3 2 ∆t 3 2 1 5∆t 3 2 1 0 9∆t
p3 ∆tdl = ∆t [f ] + [f , f ] + [f , f , f ] + [f , f , f , f ]
0 2 6 4
3
f − f 2 ∆t f 3 − 2f 2 + f 1 5∆t2
= ∆t f 3 + +
∆t 2 2∆t2 6
3 !
f − 3f 2 + 3f 1 − f 0 9∆t3
+
6∆t3 4
55 3 59 2 37 1 9 0
= ∆t f − f + f − f
24 24 24 24
But since this integral expression is valid between any four consecutive time steps,
we can replace the 0 to 3 superscripts with l − 3 and l so that substituting into
Equation 9.1 we obtain the fourth order Adams-Bashforth method:
∆t
φl+1 = φl + 55f l − 59f l−1 + 37f l−2 − 9f l−3
(9.6)
24
Example 9.1:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
170 CHAPTER 9. ADAMS-BASHFORTH METHODS
dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2 (9.7)
dt
in the domain t ∈ [0, 10], with initial conditions φ1 (0) = 0, φ2 (0) = 1, φ3 (0) = 1. The
intended learning outcome for this example will be to simply observe the application
of the fourth order Adams-Bashforth method to solve a system of ODEs.
One of the important features of the Adams-Bashforth methods is that we need
to keep track of the right hand side of our system of ODEs at multiple time steps.
There are two ways in which we could do this; either we store the gradient at multiple
time steps, or we reevaluate it at each time step. We will use the former approach
and as such will create an array f which will store the right hand side throughout
the simulation. So our algorithm might look something like:
for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
f(:,l) = [ phi(2,l)*phi(3,l); - phi(1,l)*phi(3,l); -0.5*phi(1,l)*phi(2,l) ];
phi(:,l+1) = phi(:,l) + Delta_t/24*(55*f(:,l) - 59*f(:,l-1) + 37*f(:,l-2) - 9*f(:,l-3));
end
in Matlab and:
for(l=0; l<N_t; l++)
{
t[l+1] = t[l] + Delta_t;
f[l][0] = phi[l][1]*phi[l][2];
f[l][1] = - phi[l][0]*phi[l][2];
f[l][2] = -0.5*phi[l][0]*phi[l][1];
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] ...
+ Delta_t/24*(55*f[l][e] - 59*f[l-1][e] + 37*f[l-2][e] - 9*f[l-3][e]);
}
}
in C++. One very important point to consider with multistep methods is what to
do on the first time step, and for a method of order N , what to do on the first N
time steps. The reason being that with a fourth order Adams-Bashforth method,
at time step l = 1, the algorithm requires values of f going back l − 3 time steps,
which is obviously a negative value. A common way to deal with this problem is
to use a different method until there are enough previous data points to be able to
use the method. For example we could use an explicit Euler method for the first 4
time steps, or perhaps just the first time step, then a second order Adams-Bashforth
method for the next two time steps. This is in fact what we will do and so our overall
algorithm will look like:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 171
for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
f(:,l) = [ phi(2,l)*phi(3,l); - phi(1,l)*phi(3,l); -0.5*phi(1,l)*phi(2,l) ];
if l>3
phi(:,l+1) = phi(:,l) + Delta_t/24*(55*f(:,l) - 59*f(:,l-1) + 37*f(:,l-2) - 9*f(:,l-3));
elseif l>1
phi(:,l+1) = phi(:,l) + Delta_t/2 *(3 *f(:,l) - f(:,l-1));
else
phi(:,l+1) = phi(:,l) + Delta_t * f(:,l);
end
end
in Matlab and:
for(l=0; l<N_t-1; l++)
{
t[l+1] = t[l] + Delta_t;
f[l][0] = phi[l][1]*phi[l][2];
f[l][1] = - phi[l][0]*phi[l][2];
f[l][2] = -0.5*phi[l][0]*phi[l][1];
if (l>3)
{
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] ...
+ Delta_t/24*(55*f[l][e] - 59*f[l-1][e] + 37*f[l-2][e] - 9*f[l-3][e]);
}
}
else if (l>1)
{
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] ...
+ Delta_t/2 *(3 *f[l][e] - f[l-1][e]);
}
}
else
{
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] + Delta_t * f[l][e];
}
}
}
in C++.
The complete programs are given in Example9_1.m and Example9_1.cpp. Figure
9.1 shows the numerical solution to Equation 9.7 obtained using the fourth order
Adams-Bashforth method. It can be observed that in comparison to the solution
obtained using the implicit Euler method in Example 6.4 (Figure 6.9) the Adams-
Bashforth solution is more accurate, since the amplitude of the oscillations remains
constant throughout the simulation.
172 CHAPTER 9. ADAMS-BASHFORTH METHODS
1.5
1.0
0.5
0.0
φ
−0.5
φ1
−1.0 φ2
φ3
−1.5
0 2 4 6 8 10
t
Figure 9.2: Solution to the ODE system in Example 9.1 with ∆t = 0.1.
q
1 + 32 λ∆t ± 1 + λ∆t + 94 (λ∆t)2
σ=
2
Using a power series expansion [54] to approximate the term in the square root:
r 2
9 1 9 2 1 9 2
1 + λ∆t + (λ∆t)2 = 1+ 1 + λ∆t + (λ∆t) − 1 + λ∆t + (λ∆t) +. . .
4 2 4 8 4
1
σ1 = 1+λ∆t + (λ∆t)2 + . . .
2
1 1
σ2 = λ∆t − (λ∆t)2 + . . .
2 2
eN iθ − e(N −1)iθ
σ= N
P
bn ei(N −n)θ
n=0
e2iθ − eiθ
σ= 3 2iθ (9.8)
2
e − 12 eiθ
we can then evaluate Equation 9.8 in the range 0 ≤ θ ≤ 2π and plot σ as a function
of θ. Figure 9.3 illustrates the stability region obtained using this approach for
the first four Adams-Bashforth methods and it should be noted that the region of
stability lies inside of the boundary. It can be observed that the stability regions
get smaller with increasing order accuracy and cross the real axis increasingly closer
to the origin. The second order Adams-Bashforth method is also only tangent to
the imaginary axis and thus, strictly it is unstable for pure imaginary λ, but it turns
out that the instability is very mild.
In order to perform the error analysis we as always consider the case where λ is
purely imaginary and get the amplification factor into polar form as:
174 CHAPTER 9. ADAMS-BASHFORTH METHODS
λImΔt
Unstable
Stable
λReΔt
-2 -1
Figure 9.3: The stability diagrams for the Adams-Bashforth methods of various
orders, including the first order (black line), second order (red line), third order
(green line), and fourth order (blue line).
(iλIm ∆t)2
σ = 1 + iλIm ∆t +
2
= Zeiθ
where:
s 2
(λIm ∆t)2
Z = 1− + (∆tλIm )2
2
r
(λIm ∆t)4
= 1+
4
and:
!
λIm ∆t
θ = tan−1 (λIm ∆t)2
1− 2
Using the same approach as before, we can show that the phase error is given by:
9.2. FOURTH ORDER ADAMS-BASHFORTH METHOD 175
!
λIm ∆t
θ − λIm ∆t = tan−1 2 − λIm ∆t
1 − (λIm2∆t)
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
= λIm ∆t + − − + . . . − λIm ∆t
6 20 56
(λIm ∆t)3 λIm ∆t)5 (λIm ∆t)7
= − − + ...
6 20 56
So it can be observed that the phase error associated with the second order Adams-
Bashforth method is the same as for the Leapfrog method (at least approximately)
but it also has an amplitude error associated with it. Figures 9.4(a) and 9.4(b)
illustrate the solution to the model problem after a number of time steps for the
second and fourth order Adams-Bashforth methods respectively. As can be observed,
for the second order method the numerical solution is behind the exact solution and
has developed an observable amplitude error, whereas the fourth order method has
not this error is not as pronounced.
1.5 1.5
1.0 1.0
0.5 0.5
φ
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.5 −1.5
20 22 24 26 28 30 20 22 24 26 28 30
t t
(a) (b)
Figure 9.4: The solution to the model problem with the Adams-Bashforth Method
after a number of cycles, when λ = i for (a) the second order and (b) the fourth order
Adams-Bashforth Methods. The green curves illustrates the analytical solution and
the blue curves illustrates the numerical solution, in particular highlighting the phase
and amplitude error.
176 CHAPTER 9. ADAMS-BASHFORTH METHODS
Chapter 10
Runge-Kutta methods
k2
φ
φl+1
}el+1
local
k1 φ(tl+1)
φl
t
tl tl+1
}
Δt
Figure 10.1: A schematic illustrating one step in the second order Runge-Kutta
method. The green arrows illustrate the gradients f (φl , tl ) and f (φl + k1 ∆t, tl + ∆t)
which are both used to step the solution forward. The pink line illustrates computed
step to φl+1 . The blue line illustrates the analytical solution.
In contrast to the multistep methods studied in the last two chapters, Runge-
Kutta methods fall under the category of multistage methods and are probably the
most popular methods in solving initial value problems. However, many variations of
the Runge-Kutta methods exist of varying orders of accuracy. The basic idea behind
the method is that the order of accuracy can be increased if one supplies additional
information about the function f . Runge-Kutta methods introduce multiple stages
between tl and tl+1 , and evaluate f at these intermediate stages (Figure 10.1). The
177
178 CHAPTER 10. RUNGE-KUTTA METHODS
additional function evaluations, of course, result in higher cost per time step, but
the accuracy is increased, and as it turns out, better stability properties are also
obtained. While there are implicit Runge-Kutta methods, the variations that we will
be dealing with (which are the more common variation) are fully explicit methods.
In general, Runge-Kutta methods approximate the solution to Equation 5.1 can
be written as:
g = a1 k1 + a2 k2 + a3 k3 + a4 k4 + · · · aN kN (10.2)
where:
k1 = f (φl , tl )
k2 = f (φl + q11 ∆tk1 , tl + p1 ∆t)
k3 = f (φl + q21 ∆tk1 + q22 ∆tk2 , tl + p2 ∆t)
k4 = f (φl + q31 ∆tk1 + q32 ∆tk2 + q33 ∆tk3 , tl + p3 ∆t)
..
.
kN = f (φl + qN −1,1 ∆tk1 + qN −1,2 ∆tk2 + · · · + qN −1,N −1 ∆tkN −1 , tl + pN −1 ∆t)
These k values are actually estimates of the gradient at the different intermediate
points. For N = 1, we get the first order Runge-Kutta method, which is in fact
equivalent to the explicit Euler method presented in Chapter 6.
We are going to derive two methods of different orders. In order to do this, we
are going to need to make use of a couple of formulae. The first is the Taylor series
expansion:
Df ∂f dt ∂f dφ
= +
Dt ∂t dt ∂φ dt
∂f ∂f
= +f
∂t ∂φ
10.1. SECOND ORDER RUNGE-KUTTA METHOD 179
which takes into account that a change in t can cause a change in φ. Substituting
this into the Taylor series expansion we get:
Note that the use of variables x and y in place of φ and t was deliberate and done so
in order to make the derivations more clear when the step sizes are different for the
various stages of a Runge-Kutta method. Ultimately our derivations will come down
to developing two equations using these tools and equating coefficients to produce
a specific Runge-Kutta method.
g = a1 k1 + a2 k2
180 CHAPTER 10. RUNGE-KUTTA METHODS
k1 = f (φl , tl )
k2 = f (φl + ∆tk1 , tl + ∆t)
At this point we then apply our two-dimensional Tayolor series expansion for k2 ,
noting that x ≡ φ, ∆x ≡ ∆tk1 = ∆tf (φl , tl ), y ≡ t, and ∆y ≡ ∆t:
l l ∂f ∂f
k2 = f (φ , t ) + ∆tk1 + ∆t + O(∆t2 )
∂φ ∂t tl
l l ∂f ∂f
= f (φ , t ) + ∆tf + ∆t + O(∆t2 )
∂φ ∂t tl
l l Df
= f (φ , t ) + ∆t + O(∆t2 )
Dt tl
Then substituting into Equation 10.1 gives:
φl+1 = φl + ∆t (a1 k1 + a2 k2 )
l l l l l Df
= φ + ∆t a1 f (φ , t ) + a2 f (φ , t ) + ∆t
Dt tl
l l l Df
= φ + ∆t (a1 + a2 ) f (φ , t ) + a2 ∆t (10.4)
Dt tl
If we compare Equation 10.4 to the Taylor series expansion back in Equation 10.3
up to second order:
∆t2 Df
l+1 l ∆t l l
φ(t ) = φ(t ) + f (φ , t ) + + O(∆t3 )
1! 2! Dt tl
and equate coefficients we get the system of equations
1
a2 =
2
a1 +a2 =1
So one possible (but generally referred to as the) second order Runge-Kutta method
is:
l+1 l 1 1
φ = φ + ∆t k1 + k2 (10.5)
2 2
where:
k1 = f (φl , tl )
k2 = f (φl + ∆tk1 , tl + ∆t)
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 181
g = a1 k1 + a2 k2 + a3 k3 + a4 k4
and furthermore can choose q11 = q22 = 21 , q33 = 1, q21 = q31 = q32 = 0, p1 = p2 = 12 ,
and p3 = 1 so that we get:
k1 = f φl , tl
l ∆t l ∆t
k2 = f φ + k1 , t +
2 2
l ∆t l ∆t
k3 = f φ + k2 , t +
2 2
l l
k4 = f φ + ∆tk3 , t + ∆t
At this point we then apply our two-dimensional Taylor series expansion for k2 ,
noting that x ≡ φ, ∆x ≡ ∆t
2 1
k = ∆t
2
f (φl , tl ), y ≡ t, and ∆y ≡ ∆t
2
:
l ∆t l ∆t
k2 = f φ + k1 , t +
2 2
l l ∆t ∂f ∆t ∂f
= f (φ , t ) + k1 + + O(∆t2 )
2 ∂φ 2 ∂t tl
∆t ∂f ∆t ∂f
= f (φl , tl ) + f + + O(∆t2 )
2 ∂φ 2 ∂t tl
l l ∆t Df
= f (φ , t ) + + O(∆t2 )
2 Dt tl
l ∆t l ∆t
k3 = f φ + k2 , t +
2 2
l l ∆t D l l ∆t Df
= f (φ , t ) + f (φ , t ) + + O(∆t2 )
2 Dt 2 Dt tl
k4 = f φl + ∆tk3 , tl + ∆t
l l D l l ∆t D l l ∆t Df
+ O(∆t2 )
= f (φ , t ) + ∆t f (φ , t ) + f (φ , t ) +
Dt 2 Dt 2 Dt l
t
φl+1 = φl + ∆t (a1 k1 + a2 k2 + a3 k3 + a4 k4 )
l
= φ + ∆t
a1 f (φl , tl )
l l ∆t Df
+ a2 f (φ , t ) +
2 Dt tl
l l ∆t D l l ∆t Df
+ a3 f (φ , t ) + f (φ , t ) +
2 Dt 2 Dt tl
l l D l l ∆t D l l ∆t Df
+ a4 f (φ , t ) + ∆t f (φ , t ) + f (φ , t ) +
Dt 2 Dt 2 Dt l
t
1 1 Df
= φl + (a1 + a2 + a3 + a4 ) ∆tf (φl , tl ) + a2 + a3 + a4 ∆t2
2 2 Dt tl
2 3
1 1 3D f 1 4D f
+ a3 + a4 ∆t + a4 ∆t + O(∆t5 ) (10.6)
4 2 Dt2 l t 4 Dt3 l t
a1 + a2 + a3 + a4 = 1
1 1 1
a2 + a3 + a4 =
2 2 2
1 1 1
a3 + a4 =
4 2 6
1 1
a4 =
4 24
So one possible (but generally referred to as the) fourth order Runge-Kutta method
is:
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 183
l+1 l 1 1 1 1
φ = φ + ∆t k1 + k2 + k3 + k4 (10.7)
6 3 3 6
where:
k1 = f φl , tl
l ∆t l ∆t
k2 = f φ + k1 , t +
2 2
l ∆t l ∆t
k3 = f φ + k2 , t +
2 2
l l
k4 = f φ + ∆tk3 , t + ∆t
Example 10.1:
In this example we will develop both a Matlab and a C++ program to solve the
example system:
dφ1
= φ2 φ3
dt
dφ2
= − φ1 φ3
dt
dφ3
= −0.5φ1 φ2 (10.8)
dt
in the domain t ∈ [0, 10], with initial conditions φ1 (0) = 0, φ2 (0) = 1, φ3 (0) = 1.
The intended learning outcomes for this example will be to observe the application
of the fourth order Runge-Kutta method to solve a system of ODEs and to see how
to implement a function call for evaluating the right hand side of the system in both
Matlab and C++.
In order to begin we are first going to develop a method by which to evaluate the
k1 , k2 , k3 , k4 values (which will be 3 × 1 column vectors since we are dealing with
a system) at each time step and we will do so by creating a function f which we
can call repeatedly throughout the simulation, changing only the input arguments.
Doing so will keep the code shorter, more elegant, and easier to understand; always
a good thing. The function will encapsulate the right hand side of Equation 10.8.
We can do this in Matlab as:
184 CHAPTER 10. RUNGE-KUTTA METHODS
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 185
function k = f(phi, t)
k = zeros(size(phi));
k(1) = phi(2)*phi(3);
k(2) = -phi(1)*phi(3);
k(3) = -0.5*phi(1)*phi(2);
end
but to call the function within our time marching loop we would need to do some-
thing like:
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t/2*k1[e];
}
f(k2, tempPhi);
The reason being that we can’t simply add arrays together ‘on the fly’ as we can in
Matlab. We can then explicitly update the system at each time step by:
l+1 l 1 1 1 1
φ = φ + ∆t k1 + k2 + k3 + k4
6 3 3 6
which would look something like:
for l = 1:N_t-1
t(l+1) = t(l) + Delta_t;
k1 = f(phi(:,l));
k2 = f(phi(:,l) + Delta_t/2*k1);
k3 = f(phi(:,l) + Delta_t/2*k2);
k4 = f(phi(:,l) + Delta_t *k3);
phi(:,l+1) = phi(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end
in Matlab and:
for(l=0; l<N_t; l++)
{
t[l+1] = t[l] + Delta_t;
f(k1, phi[l]);
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t/2*k1[e];
186 CHAPTER 10. RUNGE-KUTTA METHODS
}
f(k2, tempPhi);
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t/2*k2[e];
}
f(k3, tempPhi);
for(e=0; e<N_e; e++)
{
tempPhi[e] = phi[l][e] + Delta_t *k3[e];
}
f(k4, tempPhi);
for(e=0; e<N_e; e++)
{
phi[l+1][e] = phi[l][e] + Delta_t*(k1[e]/6 + k2[e]/3 + k3[e]/3 + k4[e]/6);
}
}
1.5
1.0
0.5
0.0
φ
−0.5
φ1
−1.0 φ2
φ3
−1.5
0 2 4 6 8 10
t
Figure 10.2: Solution to the ODE system in Example 10.1 with ∆t = 0.1.
In order to perform a stability analysis for the second order Runge-Kutta method
we begin by applying Equation 10.5 to the model equation of Equation 6.4 giving:
10.2. FOURTH ORDER RUNGE-KUTTA METHOD 187
∆t
φl+1 = φl + (k1 + k2 )
2
∆t
φl + f (tl , φl ) + f (tl + ∆t, φl + k1 ∆t)
=
2
∆t
φl + λφl + λ(φl + (λφl )∆t)
=
2
∆t
φl + λφl + (λ + ∆tλ2 )φl
=
2
(λ∆t)2
l
= φ 1 + λ∆t +
2
l
(λ∆t)2
l 0
φ = φ 1 + λ∆t +
2
= φ0 σ l (10.9)
(iλIm ∆t)2
σ = 1 + iλIm ∆t +
2
= Zeiθ
where:
188 CHAPTER 10. RUNGE-KUTTA METHODS
λImΔt
Unstable
Stable
λReΔt
-2 -1
Figure 10.3: The stability diagrams for the Runge-Kutta methods of various orders,
including the first order (black line), second order (red line), third order (green line),
and fourth order (blue line).
s 2
(λIm ∆t)2
Z = 1− + (∆tλIm )2
2
r
(λIm ∆t)4
= 1+
4
and:
!
−1 λIm ∆t
θ = tan (λIm ∆t)2
1− 2
Here we will use the power series expansion for the tan−1 function as we did previ-
ously, but we will also use the power series expansion:
1
= 1 + α + α2 + α3 + α4 + . . .
1−α
such that the phase error can be given as:
10.3. RUNGE-KUTTA-FEHLBERG METHOD (RKF-45) 189
! !!
λIm ∆t 1
tan−1 (λIm ∆t)2
= tan−1 λIm ∆t (λIm ∆t)2
1− 2
1− 2
2 3 !!
(λIm ∆t)2 (λIm ∆t)2
(λIm ∆t)2
= tan−1 λIm ∆t 1 + + + + ...
2 2 2
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
= tan−1 λIm ∆t + + + + ...
2 4 8
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
= λIm ∆t + + + + ...
2 4 8
3
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
1
− λIm ∆t + + + + ...
3 2 4 8
5
(λIm ∆t)3 (λIm ∆t)5 (λIm ∆t)7
1
+ λIm ∆t + + + + ... + ...
5 2 4 8
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
≈ λIm ∆t + − − + ...
6 20 56
Using the same approach that we did for the Euler and Crank-Nicolson methods,
we can show that the phase error is given by:
!
λIm ∆t
θ − λIm ∆t = tan−1 2 − λIm ∆t
1 − (λIm2∆t)
(λIm ∆t)3 λIm ∆t5 (λIm ∆t)7
= λIm ∆t + − − + . . . − λIm ∆t
6 20 56
(λIm ∆t)3 λIm ∆t)5 (λIm ∆t)7
= − − + ...
6 20 56
So it can be observed that the phase error associated with the second order Runge-
Kutta method is of the same as for the Leapfrog method, where we have a leading
order term proportional to λIm ∆t3 which will result in a phase lag. Figures 10.4(a)
and 10.4(b) illustrate the solution to the model problem after a number of time
steps for the second and fourth order Runge-Kutta methods respectively. As can be
observed, for the second order method the numerical solution is behind the exact
solution and has developed an observable amplitude error, whereas the fourth order
method has not.
1.5 1.5
1.0 1.0
0.5 0.5
φ
φ
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.5 −1.5
20 22 24 26 28 30 20 22 24 26 28 30
t t
(a) (b)
Figure 10.4: The solution to the model problem with the Runge-Kutta Method after
a number of cycles, when λ = i for (a) the second order and (b) the fourth order
Runge-Kutta Methods. The green curves illustrates the analytical solution and the
blue curves illustrates the numerical solution, in particular highlighting the phase
and amplitude error.
numerical method under control. In order to see how this can be done, recall that
the N th order Runge-Kutta scheme can be written as:
φ(tl+1 ) − φ(tl )
el+1
local (∆t) = − ψ(φ(tl ), tl )
∆t
φ(tl+1 ) − φl
= − ψ(φ(tl ), tl )
∆t
φ(tl+1 ) − (φl + ∆tψ(φ(tl ), tl ))
=
∆t
l+1
Note that elocal (∆t) is O(∆tN ). Since the Runge-Kutta method requires that the
numerical value of ψ = g, we can continue the above as:
1
el+1
local (∆t) = (φ(tl+1 ) − φl+1 )
∆t
1 l+1
= φ(t ) − φ̂l+1 + φ̂l+1 − φl+1
∆t
1 l+1
= êl+1
local (∆t) + φ̂ − φl+1
∆t
Recall that el+1 N l+1
local (∆t) is O(∆t ) and êlocal (∆t) is O(∆t
N +1
). Thus, if ∆t is small,
l+1
elocal (∆t) can be simply approximated as:
1 l+1
el+1
local (∆t) ≈ φ̂ − φl+1 (10.16)
∆t
Let’s now see how this information can be used to control the local truncation error.
Since el+1 N
local (∆t) is O(∆t ), we can write:
l+1
elocal (∆t) = α∆tN
If we increase or decrease the time step ∆t by a factor of β, then the local truncation
βN
error would be el+1 N N N N l+1
local (β∆t) = α(β∆t) = β α∆t = β elocal (∆t) ≈ ∆t (φ̂
l+1
−φl+1 ),
using Equation 10.15. Thus if we want to bound the local truncation error to a small
value , then:
192 CHAPTER 10. RUNGE-KUTTA METHODS
1/N
∆t
β≤
φ̂l+1 − φl+1
In practice, one usually sets:
!1/N
∆t
2
β=
φ̂l+1 − φl+1
One popular method to implement the above algorithm is called the Runge-
Kutta-Fehlberg method. In this method φl+1 and φ̂l+1 are approximated as:
l+1 l 25 1408 2197 1
φ = φ + ∆t k1 + k3 + k4 − k5 (10.17)
216 2565 4104 5
16 6656 28561 9 2
φ̂l+1 l
= φ̂ + ∆t k1 + k3 + k4 − k5 + k6 (10.18)
135 12825 56430 50 55
where:
k1 = f φl , t l
∆t ∆t
k2 = f φl + k1 , t +l
4 4
3∆t 9∆t 3∆t
k3 = f φl + k1 + k2 , l
t +
32 32 8
1932∆t 7200∆t 7296∆t 12∆t
k4 = f φl + k1 − k2 + k3 , l
t +
2197 2197 2197 13
439∆t 3680∆t 845∆t
k5 = f φl + k1 − 8∆t k2 + k3 − k4 , l
t + ∆t
216 513 4104
8∆t 3544∆t 1859∆t 11∆t ∆t
k6 = f φl − k1 + 2∆t k2 − k3 + k4 − l
k5 ,t +
27 2565 4104 40 2
It can be shown that the global error associated with φl+1 is O(∆t4 ) and the global
error associated with φ̂l+1 is O(∆t5 ). So N = 4 and β is calculated as:
1/4
∆t
β = 0.84 (10.19)
φ̂l+1 − φl+1
Recall that the error at time level l +1 is approximated as |φ̂l+1 −φl+1 | and assuming
that there is no error at time level l, i.e. φ(tl ) ≈ φl ≈ φ̂l . So using Equations 10.17
and 10.18, the error at time level l + 1 is approximated as:
10.3. RUNGE-KUTTA-FEHLBERG METHOD (RKF-45) 193
e = |φ̂l+1 − φl+1 |
1 128 2197 1 2
= k1 − k3 − k4 + k5 + k6
360 4275 75240 50 55
It should be noted that as with the other Runge-Kutta methods, the Runge-
Kutta-Fehlberg method can be extended to a system of equations by treating the k
terms as vectors. In this case however, our error will also be a vector and we must
use the maximum error in order to determine how to adjust the time step size.
Example 10.2:
In this example we will develop a Matlab program to solve the example system:
dφ −4π 4π
= cos (10.20)
dt (t + 1)2 t+1
in the domain t ∈ [0, 10], with initial conditions φ(0) = 0and compare the numer-
4π
ical solution with the analytical solution φ(t) = sin t+1 . The intended learning
outcome for this example will be to simply observe the application of the Runge-
Kutta-Fehlberg method to solve an ODE.
Similar to the fourth order algorithm developed in Example 10.1 we will develop
a function f which we will call repeatedly as the algorithm progresses. This will
take the form:
function val = f(phi, t)
val = -4*pi/(t+1)^2*cos(4*pi/(t+1));
end
A major difference compared to other numerical methods studied thus far is that if
we have a variable time step size ∆t then we don’t know exactly how many time steps
we’ll need to march through the temporal domain. As such we’ll use a while loop
instead of the usual for loop for marching through time, which will look something
like:
while ~finished
k1 = f(phi(l), t(l));
k2 = f(phi(l) + 1/4*k1*Delta_t, t(l) + 1/4*Delta_t);
k3 = f(phi(l) + 3/32 *k1*Delta_t + 9/32*k2*Delta_t, t(l) + 3/8*Delta_t);
...
e = abs(1/360*k1 - 128/4275*k3 - 2197/75240*k4 + 1/50*k5 + 2/55*k6);
if error < tolerance
t(l+1) = t(l) + Delta_t;
phi(l+1) = phi(l) + Delta_t*(25/216*k1 + 1408/2565*k3 + 2197/4104*k4 - 1/5*k5);
l = l+1;
194 CHAPTER 10. RUNGE-KUTTA METHODS
10.4. NEW DERIVATION 195
end
beta = 0.84*(tolerance/error)^(1/4);
...
end
where some of the k values have been omitted so that the code snippet can fit onto
the page. So at each iteration through the while loop we are computing the φl+1 ,
but if the error in the computed value is too large then we will adjust the time step
size and try again. Only when the error is less than the tolerance that we specify,
do we update φl+1 and move on. In order to adjust ∆t, we could so something like:
while ~finished
...
beta = 0.84*(tolerance/error)^(1/4);
if beta < 0.1
Delta_t = 0.1*Delta_t
elseif beta > 4.0
Delta_t = 4.0*Delta_t;
else
Delta_t = beta*Delta_t;
end
if Delta_t > Delta_t_max
Delta_t = Delta_t_max;
end
if t(l) >= t_max
finished = 1;
elseif t(l)+Delta_t>t_max
Delta_t = t_max-t(l);
elseif Delta_t < Delta_t_min
finished = 1;
end
end
1.0 2.5
2.0
0.5
1.5
∆t
φ
0.0
1.0
−0.5
0.5
−1.0 0.0
0 2 4 6 8 10 5 10 15 20 25 30
t l
(a) (b)
Figure 10.5: Solution to the ODE system in Example 10.2 showing (a) the numer-
ical and analytical solutions in blue and green respectively (b) the variation in ∆t
throughout the course of the simulation.
10.4. NEW DERIVATION 197
df ∂f dt ∂f dφ
= +
dt ∂t dt ∂φ dt
∂f ∂f
= +f
∂t ∂φ
φ(tl+1 ) = φ(tl )
+ ∆tf (φl , tl )
∆t2 ∂f
∂f
+ +f
2 ∂t ∂φ tl
∆t 3
∂ ∂f ∂f
∂
∂f ∂f
+ +f +f +f
6 ∂t ∂t ∂φ ∂φ ∂t ∂φ l
t
∆t 4
∂ ∂ ∂f ∂f
∂
∂f ∂f
∂f
∂ ∂f ∂f
∂
∂f ∂f
+ +f +f +f +f +f +f +f ...
24 ∂t ∂t ∂t ∂φ ∂φ ∂t ∂φ ∂φ ∂t ∂t ∂φ ∂φ ∂t ∂φ l
t
φl+1 = φl + ∆t (a1 k1 + a2 k2 )
where:
k1 = f (φl , tl )
k2 = f (φl + ∆tk1 , tl + ∆t)
Since the two dimensional Taylor series expansion of f (φ+∆φ, t+∆t) can be written
as:
198 CHAPTER 10. RUNGE-KUTTA METHODS
φl+1 = φl + ∆t (a1 k1 + a2 k2 )
l l l l l ∂f ∂f
= φ + ∆t a1 f (φ , t ) + a2 f (φ , t ) + ∆tf + ∆t
∂φ ∂t tl
l l l 2 ∂f ∂f
= φ + ∆t (a1 + a2 ) f (φ , t ) + ∆t a2 f +
∂φ ∂t tl
1
Compare with coefficients from first Taylor series expansion and we get a2 = 2
,
a1 = 12 .
φl+1 = φl + ∆t (a1 k1 + a2 k2 + a3 k3 + a4 k4 )
where:
k1 = f (φl , tl )
∆t ∆t
k2 = f (φl + k1 , tl + )
2 2
∆t ∆t
k3 = f (φl + k2 , tl + )
2 2
k4 = f (φl + ∆tk3 , tl + ∆t)
10.4. NEW DERIVATION 199
Same approach as before. We first substitute the two dimensional Taylor series
expansion for k2 :
k2 = f (φl , tl )
1 ∆t ∂f ∂f
+ k1 +
1! 2 ∂φ ∂t tl
2 2
∂ 2f ∂ 2 f
1 ∆t 2∂ f
+ k1 2 + 2k1 + 2
2! 2 ∂φ ∂φ∂t ∂t tl
3 3 3 3
∂ 3 f
1 ∆t 3∂ f 2 ∂ f ∂ f
+ k1 3 + 3k1 2 + 3k1 + 3
3! 2 ∂φ ∂φ ∂t ∂φ∂t2 ∂t tl
4 4 4 4 4 4
1 ∆t ∂ f ∂ f ∂ f ∂ f ∂ f
+ k14 4 + 4k13 3 + 6k12 2 2 + 4k1 +
4! 2 ∂φ ∂φ ∂t ∂φ ∂t ∂φ∂t3 ∂t4 tl
+ O(∆t)5
k3 = f (φl , tl )
1 ∆t ∂f ∂f
+ k2 +
1! 2 ∂φ ∂t tl
2 2
∂ 2f ∂ 2 f
1 ∆t 2∂ f
+ k2 2 + 2k2 + 2
2! 2 ∂φ ∂φ∂t ∂t tl
3 3 3
∂ 3f ∂ 3 f
1 ∆t 3∂ f 2 ∂ f
+ k2 3 + 3k2 2 + 3k2 + 3
3! 2 ∂φ ∂φ ∂t ∂φ∂t2 ∂t tl
4 4 4 4 4
∂ 4 f
1 ∆t 4∂ f 3 ∂ f 2 ∂ f ∂ f
+ k2 4 + 4k2 3 + 6k2 2 2 + 4k2 + 4
4! 2 ∂φ ∂φ ∂t ∂φ ∂t ∂φ∂t3 ∂t tl
5
+ O(∆t)
k4 = f (φl , tl )
1 ∂f ∂f
+ (∆t) k3 +
1! ∂φ ∂t tl
2
∂ 2f ∂ 2 f
1 2 2∂ f
+ (∆t) k3 2 + 2k3 + 2
2! ∂φ ∂φ∂t ∂t tl
3 3 3
∂ 3 f
1 3 3∂ f 2 ∂ f ∂ f
+ (∆t) k3 3 + 3k3 2 + 3k3 + 3
3! ∂φ ∂φ ∂t ∂φ∂t2 ∂t tl
4 4 4 4
∂ 4 f
1 4 4∂ f 3 ∂ f 2 ∂ f ∂ f
+ (∆t) k3 4 + 4k3 3 + 6k3 2 2 + 4k3 + 4
4! ∂φ ∂φ ∂t ∂φ ∂t ∂φ∂t3 ∂t tl
5
+ O(∆t)
Shooting Methods
The methods introduced thus far can only be used to solve initial value problems,
meaning that all the information you are given is at time t = tmin , and you are asked
to predict the solution up to a later point in time, say t = tmax . What if you are
given some information at t = tmin and some of the information at t = tmax ? These
kinds of problems are called Boundary Value Problems. There are two techniques for
solving boundary value problems, Shooting Methods, which use standard methods for
initial value problems such as Euler methods, Runge-Kutta methods, etc, and Direct
Methods, which are based on straight forward discretisation of the derivatives in the
differential equation and solving the resulting system of algebraic equations. As it
happens we will cover direct methods in the context of solving partial differential
equations later in the course, so for now will will only focus on how we implement
a shooting method.
The basic idea behind a shooting method is that they guess the information at
t = tmin in order to give the required conditions at t = tmax . To illustrate by way of
example consider a system of 3 ODEs (where the generalisation to a system of M
ODE’s is straightforward):
dφ1
= f1 (φ1 , φ2 , φ3 )
dt
dφ2
= f2 (φ1 , φ2 , φ3 )
dt
dφ3
= f3 (φ1 , φ2 , φ3 ) (11.1)
dt
given the following conditions φ1 (tmin ) = φ1,min , φ2 (tmin ) = φ2,min , and φ3 (tmax ) =
φ3,max . You are asked to find φ1 (t), φ2 (t), and φ3 (t). It is important to note that
φ3 (tmin ) has not been defined, so for the purpose of the following discussion, let
φ3 (tmin ) = α. The idea behind the shooting methods is that because φ3 (tmin ) is not
defined, we are free to choose any value for α that will give us φ3 (tmax ) = φ3,max .
But we do not know beforehand what value of α will achieve this. So we need iterate
through different values of α until the computed value of φ3 (tmax ) = φ3,max .
201
202 CHAPTER 11. SHOOTING METHODS
Using any numerical method for solving ODEs, Equation 11.1 can be solved with
the following initial conditions φ1 (tmin ) = φ1,min , φ2 (tmin ) = φ2,min , and φ3 (tmin ) =
α. Let’s say we take Nt steps of size ∆t to approach t = tmax and get the approximate
value of φN Nt
3,k=1 , where k denotes an iteration. The value of φ3,k=1 is dependent on
t
the value of α. Since we are only guessing the value of α, it is very likely that
φN3,k=1 − φ3,max 6= 0. For the following analysis, it will be convenient to define the
t
function:
g(α) = φN
3,k − φ3,max
t
In order for the numerical solution to satisfy the original boundary conditions, we
must ensure that:
g(α) = 0
Thus this becomes a root finding problem, i.e. we have to iteratively find the value
of α such that g(α) = 0. Each value α will give you a numerical solution. Only
the numerical solution computed with the value of α that ensures that g(α) = 0 is
the correct solution to the original problem. Since the problem has been recast as a
root finding problem, the Secant formula is usually used to provide a better guess
value of α:
Example 11.1:
Using a shooting method with an explicit Euler method for the time integration
and the secant formula, write a program in Matlabto solve the second order ODE:
d2 φ
= −2 (11.3)
dt2
in the domain t ∈ [0, 1], with boundary conditions φ(0) = 0 and φ(1) = 0.The in-
tended learning outcome for this example will be to simply observe the application
of a shooting method to solve an ODE with boundary values specified.
203
In order to begin, we will need to break this second order ODE into a system of
first order ODEs:
dφ1
= φ2
dt
dφ2
= −2 (11.4)
dt
We are given that φ1 (0) = φ(0) = 1. However we do not know the value of φ2 (0) =
φ̇(0), but rather the condition i.e. φ1 (1) = 1. So we are free to choose φ2 (0) = α.
The numerical solution computed using different values of α will give you different
values of φ2 (tmax ). Also note that while we are using the explicit Euler method to
compute the numerical solution, but any method can be chosen). We would like to
pick only the numerical solution that will give us φN1 = 1. Let’s define a function:
t
g(α) = φN
1 (α) − φ1 (1)
t
So our task now is to find the value of α such that g(α) = 0. Let’s just guess α0 = 0
and α1 = 2. For these values of α, numerical solution for Equation 11.4 could be
computed and g(α0 ) = −0.9500 and g(α1 ) = 1.0500. One can then use Equation
11.2 to compute α2 . This process is then repeated until g(αk ) ≈ 0. The Matlab
algorithm will look something like:
while abs(g(k)) > tolerance
alpha(k+1) = alpha(k) - (alpha(k) - alpha(k-1))*g(k)/(g(k) - g(k-1));
g(k+1) = solve(alpha(k+1));
k = k + 1;
end
where we have defined a function solve to perform the time marching at each
iteration, given the initial condition for φ3 .
The complete program is given in Example11_1.m. Figures 11.1(a) and 11.1(b)
show the numerical solution to Equation 11.3 obtained using the shooting method.
It can be observed that the αk are chosen according to Equation 11.2 and it forces
the numerical solution of φ at t = 1 to approach 5. Note, the numerical solution
computed using α5 and α6 are not distinguishable on the scale of the diagram.
204 CHAPTER 11. SHOOTING METHODS
2.5 1.5
2.0
1.0
1.5
0.5
φ
1.0
−0.0
0.5
−0.5
−0.0 k=1 φ1
k=2 φ2
k=3
−0.5 −1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
(a) (b)
Figure 11.1: Solution to the ODE system in Example 11.1 with ∆t = 0.01 showing
(a) the solution for φ1 (t) at each iteration computed using the explicit Euler method
(b) the converged solution for the system.
Part III
205
Chapter 12
Introduction
In this part of the book we are going to going to investigate a number of different
families of numerical methods for solving Partial Differential Equations (PDEs).
These contrast to ordinary differential equations in that the dependent variable
is some function of multiple independent variables (i.e. a multivariate function),
rather than just one. To elaborate on this point, the ODEs that we studied in
Part II involved derivatives of one or more dependent variables with respect to time.
The partial differential equations encountered in many common applications may
however involve derivatives with respect to time, and/or derivatives with respect to
one or more spatial dimensions. The most general form of a PDE can be given as:
∂ 2 φ ∂ 2 φ ∂ 2 φ ∂φ ∂φ
f , , , , , φ, x, t = 0
∂x2 ∂x∂t ∂t2 ∂x ∂t
where for brevity, we have only shown two independent variables x and t, and
derivatives up to second order, but of course both can be extended to any number.
As with ODEs we will again use φ to denote our dependent variable and t to denote
the ordinate of time; but now, in addition we are using x to denote a spatial ordinate.
It will often be helpful if we try and think of φ as representing a field of some
sort. Although the focus in this part of the book is more on the numerics than
the physics of what the PDE is describing, it will become apparent in Part V that
when applied to a particular problem, φ will represent some physical quantity such
as a velocity field, a temperature field, a stress or strain field, perhaps even an
electromagnetic field. The point is that it’s representing some quantity that varies
continuously over a region of space and time. The ‘continuously’ bit of the last
statement is quite important because its the continuity of these fields which allows
us to apply calculus in the first place. An important point to note in the examples
just mentioned are that we listed an examples of a scalar field, a vector field, and a
tensor field. To elaborate on this idea, a quantity like temperature can be described
by a single number; we’re all used to this from years of checking the weather. Of
course we can’t really describe the temperature variation throughout the atmosphere
by a single number because the temperature varies over a city, or a country, over the
207
208 CHAPTER 12. INTRODUCTION
entire planet in fact. The important point is that it’s a continuous function of space
and time and so there are infinitely many of these points that we could arbitrarily
choose; but having chosen a point in space and a moment in time, we only need one
number to define a temperature. In contrast, a quantity like velocity needs a few
numbers to describe it since velocity is a vector valued quantity. Continuing with the
weather analogy, we could imagine picking an arbitrary point in space and measuring
the wind velocity at that point. In our 3D world we would therefore assign three
numbers at this point describing the velocity vector components in each direction
at any moment in time. Tensor fields are then an extension of this idea, where we
require more numbers to describe a physical quantity at a location in space. We
will see some examples of tensor fields in Part V, where for a stress or a strain field
however we need nine numbers to define the state of each at a point in space. As it
turns out, we can define a tensor by its rank and the number of components that it
requires is Drank where D defines the number of spatial dimensions. Furthermore,
we can say that scalars and vectors are in fact subsets of the more general class of
tensors, i.e. a scalar is a rank 0 tensor requiring 30 = 1 number to describe it, a
vector is a rank 1 tensor requiring 31 = 3 numbers to describe it, and a rank 2 tensor
requires 32 = 9 numbers to describe it.
dΩ Ω
x dΓ
we might denote as φ(xi , yj , zk , tl ) = φli,j,k . As with ODEs, our PDEs will be solved
within a domain, and as with ODEs the temporal domain will be specified as t ∈
[tmin , tmax ]. In terms of the spatial domain however, things are a little more complex.
With some of the numerical methods we will study, we can specify the domain in a
similar way, namely x ∈ [xmin , xmax ], but in other numerical methods this will not
be possible, since the shape of a spatial domain can be arbitrarily complex. For the
latter cases we will use Ω to denote our spatial domain and Γ its boundary (the term
∂Ω is also sometimes used to denote a boundary). We will define our domain in
Euclidean space RD (which is a rather complicated way of saying that our domain is
D dimensional and is defined in terms of real numbers) and so we can say Ω ⊂ RD
(i.e. our domain is a subset of the D dimensional space of real numbers). The
advantage of using this notation is that no matter how many spatial dimensions
our problem is defined in, the notation doesn’t change. In R3 (i.e. 3D space) the
domain has a volume with boundary surfaces, in R2 (i.e. 2D space) the domain has
an area with boundary edges, and in R1 the domain has an length with boundary
points. At this stage we don’t need to elaborate much further; the meaning will
become apparent when we study the particular numerical methods. For the most
part we will only be solving problems in up to two spatial dimensions, plus time.
The reason for this restriction is that solving PDEs in R2 will reveal the relevant
complexity necessary for understanding how ‘real world’ problems are solved, but
also mean that we don’t get ‘bogged down’ in too much math and computation. All
of the techniques that we will cover extend readily to higher spatial dimensions.
As we did with our study of ODEs, before we get into studying the numerical
methods, we need to outline some basic concepts and definitions. One of the impor-
tant aspects we must consider is the order of a PDE, which as with ODEs is simply
the order of the highest derivative present in the equation. As with ODEs the order
of a PDE will have important implications in terms of how much information we
have to specify to obtain a solution. In this book we will only be considering PDEs
up to second order. Another important aspect we must consider is whether or not
we are solving a single PDE or a system of PDEs. Throughout this part of the
book our study we will only investigate the numerical solution of a single PDE, but
in Part V we will then extend these ideas to coupled systems of PDEs. Another
important aspect that we must consider is whether or not we are solving a linear or
nonlinear PDE. If a second order PDE is linear, then it can be represented in the
form:
∂ 2φ ∂ 2φ ∂ 2φ ∂φ ∂φ
a(x, t) 2
+ b(x, t) + c(x, t) 2
+ d(x, t) + e(x, t) + f (x, t)φ = g(x, t)
∂x ∂x∂t ∂t ∂x ∂t
(12.1)
If any of the coefficients a, b, c, d, e, f, g happen to also be functions of φ then the
PDE is described as quasi-linear . Now with PDEs it is the highest order terms that
determine the properties of the solutions. The collection of the highest order terms
is called the principal part and for the PDE presented in Equation 12.1 is defined
210 CHAPTER 12. INTRODUCTION
as:
∂ 2φ ∂ 2φ ∂ 2φ
a(x, t) + b(x, t) + c(x, t)
∂x2 ∂x∂t ∂t2
To generalize on this idea to three spatial dimensions, plus time, let x1 = x, x2 = y,
x3 = z, x4 = t. Then we can write out our second order linear PDE as:
4 X
4 4
X ∂ 2φ X ∂φ
Am,n + bm + f (φ, x1 , x2 , · · · x4 ) = 0
m=1 n=1
∂xm ∂xn m=1 ∂xm
Here A is a matrix containing the coefficients of the second order derivatives defin-
ing the principal part, b is a vector containing the coefficients of the first order
derivatives, and f is some function defining the remainder of the PDE. We can then
classify the PDE in this form by examining the eigenvalues of the matrix A, where:
• if λ1 , · · · , λD are nonzero and have the same sign (i.e. the matrix A is positive
definite) then the equation is termed Elliptic.
• if λ1 , · · · , λD are nonzero and all except one have the same sign then the
equation is termed Hyperbolic..
• if λ1 , · · · , λD are nonzero and at least two of them are positive and two negative
then the equation is termed Ultrahyperbolic..
The meaning and implications of this classification will become apparent throughout
the course since we will actually solve PDEs of each type, but for now we just need
to bear in mind that the classification informs us as to what the solution might look
like, and what information we need to specify to get a solution. As it happens elliptic
equations tend to describe steady state equilibrium problems, parabolic equations
tend to describe transient diffusion type problems, and hyperbolic equations tend
to describe transient problems exhibiting a wave type motion. An important point
to note is that because the terms in Am,n could be functions of xm it is possible that
the eigenvalues will in fact be functions of xm (not just single numbers) and that
the classification may change throughout the domain. A ‘classic’ example of this is
say, compressible fluid flow which changes from subsonic to supersonic as it flows
over the wing of an aircraft. A final point on the matter is that all first order PDEs
are classified as hyperbolic.
As with ODEs, another important aspect to consider is whether or not the PDE
is homogeneous or inhomogeneous. If g(xm ) is zero then it is the former, and if
non-zero then it is the latter.
Finally and perhaps most importantly, one of the important aspects of our prob-
lem we must consider is what types and how many boundary conditions we need
to specify for our problem to be well posed. When it comes to PDEs there are
211
three main types of boundary conditions that we can specify on the boundary of our
domain Γ. The first is known as a Dirichlet boundary condition and is defined as:
∂φ
= f (x) ∀(x) ∈ ΓN eumann
∂n
where n is a unit normal vector pointing away from the boundary. The idea here
is that the gradient of φ, normal to the boundary (i.e. pointing directly away from
the boundary), is equal to some specified function for all points x on the boundary
where we are choosing to specify the Neumann boundary condition. We could also
denote this by ∂n φ = f (x). The third type of boundary condition is known as a
Robin boundary condition and is defined as:
∂φ
aφ(x) + b = f (x) ∀(x) ∈ ΓRobin
∂n
where we can see that it is essentially a combination of a Dirichlet and a Neumann
boundary condition. We will refer to Mixed boundary conditions when we have
a problem where we specify one type of boundary condition on some part of the
domain, and a different type on another part. For instance we will frequently be
specifying both Dirichlet and Neumann conditions over different parts of a boundary
when we solve certain PDEs. A final type of boundary condition is known as a
Periodic boundary condition. In this case we effectively let the domain ‘loop’ back
around on itself so really, there is in fact no boundary anymore. It is then the
case that the boundary values at either end of the domain are equal. So we’ve
covered different types of boundary conditions, but also need to think about initial
conditions. In the solution of PDEs we often have what is known as a Cauchy
problem, which is essentially an extension of the idea of an initial value problem.
When we have only a first order derivative with respect to time, then we will need
to provide the initial condition φ(x, tmin ) = f (x), but if we have a second order
derivative with respect to time then we must also provide ∂t φ(x, tmin ) = g(x). The
Cauchy problem implies specifying the PDE with the appropriate number of initial
conditions in order to obtain a unique solution.
It is now time to introduce some common operators in vector calculus and we
will introduce these using two types of notation, namely vector notation and tensor
notation. Both of these forms describe the same thing, but often, working with one
212 CHAPTER 12. INTRODUCTION
type of notation will make life easier compared to the other. The first operator that
we will introduce is known as grad (or del or nabla) and is defined as:
∇ ≡∂xi i = 1, 2, . . . , D (12.2)
∂
≡ R1
∂x
∂ ∂
≡ , R2
∂x ∂y
∂ ∂ ∂
≡ , , R3
∂x ∂y ∂z
where the use of the ∇ symbol is vector notation, while the ∂xi notation is tensor
notation. This operator describes the gradient of a field. Now, if this operator was
applied to a scalar field, the result would be a vector field. For example, if we had
φ(x, y, z, t), then the result of this operation would be:
∂φ ∂φ ∂φ
∇φ = ∂xi φ = , ,
∂x ∂y ∂z
If however, this operator was applied to a vector field the result would be a dyadic.
For example, if we had v(x, y, z, t), with components {vx , vy , vz } then the result of
this operation would be:
∂
∇· = R1
∂x
∂ ∂
= + R2
∂x ∂y
∂ ∂ ∂
= + + R3
∂x ∂y ∂z
which is also written as div(). This operator describes the ‘source’ of a field at a
given point. Now, if this operator was applied to a vector field, the result would be
213
a scalar field. For example, if we had v(x, y, z, t), with components {vx , vy , vz } then
the result of this operation would be:
∂σxx ∂σxy ∂σxz ∂σyx ∂σyy ∂σyz ∂σzx ∂σzy ∂σzz
∇·σ = ∂xj σij = + + , + + , + +
∂x ∂y ∂z ∂x ∂y ∂z ∂x ∂y ∂z
An important point to note here is that in terms of the tensor notation we are using
Einstein summation notation[13], where any repeated indices (j in this case) imply
that we substitute in all possible values for the index and add them all together.
Any index appearing only once in an expression is hence called a free index.
Another common operator is known as the curl and is defined as:
∂vz ∂vy ∂vx ∂vz ∂vy ∂vx
∇×v = − , − , − R3 (12.3)
∂y ∂z ∂z ∂x ∂x ∂y
which operates on a vector field and produces another vector field. It is also written
as curl(). This operator is only defined in 3D and describes a vector field’s rotation
at a given point.
Finally we will define the Laplacian as:
∇2 ≡∂xi xi (12.4)
=∇ · ∇
∂2
≡ 2 R1
∂x
∂2 ∂2
≡ 2+ R2
∂x ∂y 2
∂2 ∂2 ∂2
≡ 2+ + R3
∂x ∂y 2 ∂z 2
which is also written as ∆. Now, if this operator was applied to a scalar field, the
result would be another scalar field. For example, if we had φ(x, y, z, t), then the
result of this operation would be:
∂ 2φ ∂ 2φ ∂2
∇ 2 φ = ∂ xi xi φ = + +
∂x2 ∂y 2 ∂z 2
214 CHAPTER 12. INTRODUCTION
If however, this operator was applied to a vector field, the result would be another
vector field. For example, if we had v(x, y, z, t), with components {vx , vy .vz } then
the result of this operation would be:
∂ 2 vx ∂ 2 vx ∂ 2 vx ∂ 2 vy ∂ 2 vy ∂ 2 vy ∂ 2 vz ∂ 2 vz ∂ 2 vz
2
∇ v = ∂ xj xj v i = + + , + + , + +
∂x2 ∂y 2 ∂z 2 ∂x2 ∂y 2 ∂z 2 ∂x2 ∂y 2 ∂z 2
The reason that we’ve made these definitions is that they are commonly used
in the description of a number of important PDEs. So, having now introduced a
number of concepts and definitions relating to the classification of PDEs, let’s take
a moment to look at some common example PDEs and classify them. Beginning
with the Poisson equation:
∂ 2φ ∂ 2φ ∂ 2φ
+ + 2 =ψ or ∇2 φ = ψ or ∂xi xi φ = ψ (12.5)
∂x2 ∂y 2 ∂z
We can see that this is a second order, linear, inhomogeneous PDE in RD . Because
the eigenvalues of the principal part are λ1 = 1, · · · , λD = 1 this equation is elliptic.
It describes a number of equilibrium problems (as observed by the fact that there’s
no time derivatives present). If ψ = 0 then the equation is known as Laplace’s
equation. Another ‘classic’ PDE is the Heat equation:
∂φ ∂ 2φ ∂ 2φ ∂ 2φ
= + + 2 or φ̇ = ∇2 φ or ∂ t φ = ∂ xi xi φ (12.6)
∂t ∂x2 ∂y 2 ∂z
which we can see is a second order, linear, homogeneous PDE in RD . Because the
eigenvalues of the principal part are λ1 = 1, · · · , λD = 1, λt = 0 this equation is
parabolic. It describes transient diffusion processes. Another ‘classic’ PDE is the
Wave equation, defined as:
∂φ ∂φ ∂φ ∂φ
= + + or φ̇ = ∇φ or ∂t φ = ∂xi φ (12.7)
∂t ∂x ∂y ∂z
which is a first order, linear, homogeneous PDE in RD . This version is also known
as the first order wave equation, or the one-way wave equation and being first order
it is hyperbolic. The wave equation is perhaps more commonly defined as:
∂ 2φ ∂ 2φ ∂ 2φ ∂ 2φ
= + + 2 or φ̈ = ∇2 φ or ∂tt φ = ∂xi xi φ (12.8)
∂t ∂x2 ∂y 2 ∂z
∂ 2 ux ∂ 2u
x ∂ 2 uy ∂ 2 uz ∂ 2 ux ∂ 2 ux ∂ 2 ux
ρ =(λ + µ) + + +µ + + +ρgx
∂t2 ∂x2 ∂x∂y ∂x∂z ∂x2 ∂y 2 ∂z 2
∂ 2 uy ∂ 2u
x ∂ 2 uy ∂ 2 uz ∂ 2 uy ∂ 2 uy ∂ 2 uy
ρ 2 =(λ + µ) + + +µ + + +ρgy
∂t ∂y∂x ∂y 2 ∂y∂z ∂x2 ∂y 2 ∂z 2
∂ 2 uz ∂ 2u
x ∂ 2 uy ∂ 2 uz ∂ 2 uz ∂ 2 uz ∂ 2 uz
ρ 2 =(λ + µ) + + +µ + + +ρgz
∂t ∂z∂x ∂z∂y ∂z 2 ∂x2 ∂y 2 ∂z 2
which we can see is a second order, linear, inhomogeneous PDE in RD . Because the
eigenvalues of the principal part of the second equation are λ1 = λ+2µ, · · · , λD = λ+
2µ, λt = −ρ this equation is hyperbolic. The dependent variable is the displacement
field u, which is a vector field, and so this equation can be thought of as a system
of D equations, one for each displacement component. In this case ρ is the mass
density and µ and λ are the Lamé Parameters which define the elastic properties
of a solid. Another common set of PDEs are the Navier-Stokes equations in fluid
mechanics (which come from the principals of conservation of mass and momentum):
ρ̇ +∇ · ρv =0 or ∂t ρ +∂xi ρvi =0
ρv̇+v · ∇ρv=µ∇2 v − ∇p + ρg ∂t ρvi +vj ∂xi ρvi =µ∂xj xj vi − ∂xi p + ρgi
or in 3D:
which we can see is a system of two PDEs, the first of which is a first order, linear,
homogeneous PDE in RD , the second of which is a second order, nonlinear, inho-
mogeneous PDE in RD . Because the eigenvalues of the principal part of the second
equation are λ1 = µ, · · · , λD = µ, λt = 0 this equation is parabolic. The dependent
variable in the first equation is the fluid mass density ρ, which is a scalar field. The
dependent variable in the second equation is the fluid velocity field v, which is a
vector field, and in fact it can be thought of as a system of D equations, one for
each velocity component. In this case µ describes the fluid viscosity and p is the
pressure field.
Another common set of PDE is the Energy equation in thermodynamics (which
comes from the principal of conservation of energy):
∂T
ρC = ∇k · ∇T + Q
∂t
which is a second order, linear PDE in RD . Because the eigenvalues of the
principal part are λ1 = k, · · · , λD = k, λt = 0 this equation is parabolic. The
dependent variable is the temperature field T , which is a scalar field. In this case ρ,
C, k, and Q are the mass density, specific heat capacity, thermal conductivity, and
heat generation respectively. Another common set of PDEs are Maxwell’s equations:
∂E
= ∇ × H− J
∂t
∂H
µ =− ∇ × E−M
∂t
∇ · E = ρf
µ∇ · H= 0
which is a system of four, first order, linear PDEs, one of which are homogeneous.
Here, E and H are the electric and magnetic field vectors respectively, ρf , J, and
M are the charge, current density, and the magnetization fields respectively, and
and µ are the electric and magnetic permeabilities of a material respectively. These
equations are hyperbolic and describe how electric charges and electric currents act
as sources for the electric and magnetic fields. Further, they describes how a time
varying electric field generates a time varying magnetic field and vice versa.
Finally we shall end with the Schrödinger equation:
∂Ψ ~2 2
i~ + ∇ Ψ = V (x)Ψ
∂t 2m
which is a second order, linear, homogeneous PDE in RD . Here Ψ is the wavefunction
of the system, V is the potential, m is the mass, and ~ is the reduced Planck constant.
It is used in quantum mechanics and describes how the quantum state of a physical
system changes in time.
217
Compared to the ODEs that we studied and solved in Part II we can now see
that PDEs are quite a bit more complicated. As a result, unlike ODEs, where we
can essentially develop ‘canned algorithms’ that we can subsequently apply to any
system of ODEs, PDEs are much more complex and it is much harder to have a
‘one method fits all equations’ type approach. Similar to Parts I and II we will are
going to define an example system which we can apply our numerical methods to
so, for the remainder of this part of the book we will be solving the generic scalar
transport equation:
in it various forms, meaning that sometimes we will set certain terms to zero, so
that the problem will be either hyperbolic, parabolic, or elliptic. As it stands at the
moment, this equation is second order, linear, inhomogeneous, and the eigenvalues of
the principal part are λ1 = µ, · · · , λD = µ, λt = 0, so this equation is parabolic. As
its name implies this equation describes the transport of a scalar quantity and can
be applied to solving numerous problems by simply replacing φ with some specific
variable such as density, velocity, temperature, etc. It is hence a good candidate for
the development of our numerical methods.
To elaborate briefly on the various terms in the equation, the first term on the
left hand side of Equation 12.11 is the derivative of φ with respect to time (and
has the same meaning as in Part II except that we use ∂ instead of d to denote
the differential, since φ is now a multivariate function), and as such we call this the
unsteady term as it allows for the variation of φ with time.
The second term involves the vector field v and is commonly called the convective
term. As its name implies the convective term describes how the variable φ is moved
or ‘convected’ through a spatial domain by the presence of the velocity field. As an
analogy think of a drop of dye being injected into a flowing stream of water and
imagine that we are using the scalar transport equation to compute the concentration
field of dye as a function of space and time. The concentration field will change as
dye is carried along with the flow, and this is exactly what the convective term
represents mathematically.
The first term on the right hand side of Equation 12.11 contains the variable
µ and the Laplacian of φ and is commonly called the diffusive term. As its name
implies the diffusive term describes the ‘diffusion’ of φ throughout the computational
domain. Returning to the drop of dye in water analogy; if the water was instead
still, then over time the drop would spread out and the water would change color.
The diffusion of the die is what this term represents mathematically.
The last term on the right hand side of Equation 12.11, ψ is known as the source
term and is a generic way of including any addition or problems specific terms into
this generic PDE. It should also be apparent that this term will define whether the
PDE is homogeneous or inhomogeneous. The source term may be some function of
φ or a function of x, a constant, or it may just be zero. The point is that we just
218 CHAPTER 12. INTRODUCTION
don’t put too much effort into specifying the form of the function at this stage. As
an example of where this term can be used, in fluid or solid mechanics problems
where the effects of gravity are included, the gravitational force would be added
into the source term as constant. As a second example, consider a thermodynamics
problem describing heat transfer in a solid which is producing its own heat (either
via a chemical reaction or electrical current). In either case φ would be the materials
temperature T and the function describing the source of heat would be placed into
ψ.
If the coefficients in the convective and diffusive terms (i.e. v and µ) are constant,
when we can use the vector identity
∇ · (ab) = a · ∇b + b(∇ · a)
and note that the second term on the right hand side will be zero. As such, we
arrive at a simplified form of the the generic scalar transport equation
Expanding out the operators in full for all of the terms, for a 3D problem we could
write the scalar transport equation as:
∂φ ∂φ ∂φ ∂φ ∂ 2φ ∂ 2φ ∂ 2φ
+ vx + vy + vz =µ 2 +µ 2 +µ 2 +ψ
∂t ∂x ∂y ∂z ∂x ∂y ∂z
Now, as we will soon see, the solution of a PDE generally involves applying some
numerical method to the terms involving the spatial derivatives and reducing the
PDE to a system of ODEs. Since we will be focusing on a linear PDE, we will be
able to express this system as:
M φ̇ = Kφ + s (12.12)
which we can see is essentially the same as Equation 5.2. Often, M is termed the
mass matrix , K the stiffness matrix , and s the load vector (or source vector). These
names derive from the application in solid mechanics where ‘say’ the K matrix was
representing the stiffness of a material, but we will continue their use throughout
this book. Turning a PDE into a system of ODEs is known as a semi-discretization
or the Method of Lines[28] and we can generally then use any of the numerical
methods from Part II to perform the time integration. It should be noted however
that it is possible to discretize PDEs in both space and time at the same time, which
would be a full discretization. If we happen to use an explicit method for the time
integration then we will have something like:
φl+1 − φl
M = Kφl + s
∆t
219
If on the other hand, we happen to use an implicit method for the time integration
then we will have something like:
φl+1 − φl
M = Kφl+1 + s
∆t
meaning that we will have to solve a system of equations at every time step like:
Aφl+1 = b
where:
A = M − ∆tK
b = M φl + ∆ts
So, it can be observed that often, solving a PDE reduces to solving a system of
ODEs, which in turn reduces to solving a system of algebraic equations. If we
don’t have a temporal derivative term in our PDE however, then applying a spatial
discretization would lead directly to a system of algebraic equations. It should
be noted that compared to some of the example problems from Parts I and II,
the important difference is that the size of the system of equations is defined by the
spatial discretization and will typically be much larger (e.g. we were solving systems
of size 3 × 3, but systems of say 106 × 106 are not uncommon).
As it happens, most of the concepts that were introduced in Part II relating to
accuracy, stability, consistency, convergence, etc, still apply when studying PDEs.
As we cover the different numerical methods for performing the spatial discretization,
we will define whether or not they use local or global approximations. Essentially
this defines whether or not the solution at any given point within the computational
domain is related to just a few of its immediate neighbors, or to all other points
within the domain. When we studied ODEs we could generally state the order
of accuracy of a method, or for some particular families (such as Runge-Kutta or
Adams-Bashforth methods) there were varying orders of accuracy available to us.
With the methods for solving PDEs that we will study, we have a similar scenario,
namely that we can obtain varying orders of accuracy for the spatial discretization.
(a) (b)
Figure 12.2: Schematics of (a) a regular structured grid and (b) an unstructured
grid composed of triangles.
topic could be an entire book by itself), but rather with the nature of the resulting
grids. As a quick aside before beginning our study, it is worth mentioning that
in practice the term mesh is commonly used synonymously with grid, so one will
commonly hear references to structured and unstructured meshes.
Structured grids are perhaps the simplest way in which we can break up a region
of space into discrete pieces. As can be seen in Figure 12.2(a) a square region of
space (our computational domain) has been broken up into a number of smaller
regularly spaced pieces. While shown in 2D for simplicity, the extension to 3D is
obvious where we would have a cube which is broken up into a regular array and
with spacings ∆x, ∆y, and ∆z in the x, y, z directions respectively. Now depending
on the numerical method that we will be applying to the structured grid, we can
either think of this as a collection discrete points spaced ∆x, ∆y, and ∆z apart, or
as a collection of cells (or elements) with volume ∆x × ∆y × ∆z.
In order to specify a particular value of φ(x, y, z, t) all that is needed is to assign
indices i, j, k for x, y, z respectively and then a particular points can be located by
φ(i∆x, j∆y, k∆z, t). If we denote the number of grid points in each dimension by
Nx , Ny , and Nz , then in order to store the field φ at any given moment in time (i.e.
for one time step) then we could store the array:
phi = zeros(N_x, N_y, N_z);
in Matlab, and:
double phi[N_x][N_y][N_z];
∆x, ∆y, and ∆z. An important point to note is that we don’t have to store our
field data in an array like this, but if the nature of our grid is analogous to a 3D
array we may as well take advantage of that. The only thing that really matters is
that for each scalar value of φ that we are storing we know where to locate it in 3D
space.
Figure 12.3: Some typical unstructured grid primitive cell types in 3D.
The obvious limitation to structured grids is that the vast majority of problems
of interest can’t be represented by a regular ‘block’ like this. Unstructured grids on
the other hand break up a region of space into a smaller number of primitives which
we will call either cells or elements depending upon the numerical method (Figure
12.2(b)). While triangles were used in this case there are many more possibilities
and Figure 12.3 illustrates some primitive types in 3D. Now while Figure 12.2(b)
shows an unstructured triangular grid equivalent of a square domain, the real power
in utilizing unstructured grids lies in the fact that we can solve PDEs in much
more complex domains. Figure 12.4(a) is one such example of a complex geometry
and illustrates a portion of an unstructured grid around an aircraft. In addition to
observing that the triangles are able to tessellate up the space around the fuselage, we
can also observe how the triangles can vary greatly in size from place to place within
the grid, thereby allowing greater resolution and accuracy in the solution wherever it
is needed. Figure 12.4(b) presents another example of the use of unstructured grids
applied to a spring. It can be observed that the same domain has been ‘meshed’ using
both two different types of primitives, namely tetrahedral and hexahedral cells. This
flexibility is a common feature of unstructured grids depending upon the numerical
method employed to solve the PDE within this grid, one may even ‘mix and match’
any number of different primitives within the one grid. The important thing is that
the entire spatial domain be tessellated into contiguous, non overlapping cells or
elements, expressed mathematically as:
Nc
[
Ωc = Ω
c=1
That is, the union of all the Nc cells Ωe tessellating the domain, is the domain Ω
itself.
222 CHAPTER 12. INTRODUCTION
(a) (b)
Figure 12.4: Two examples of unstructured grids of complex geometries (a) illus-
trates a tetrahedral grid of the region around the fuselage of an aircraft (b) illustrates
both a hexahedral and tetrahedral grid of a spring type structure.
x1 , y1 , z1
x2 , y2 , z2
x3 , y3 , z3
P =
x4 , y4 , z4
..
.
xNp , yNp , zNp
which is of size Np × ND where Np is the number of points defining the grid and ND
is the dimensionality of the domain. These points represent the vertices of the cells
or elements. Another data structure which can be defined in conjunction with the
points is a 2D array of edges:
223
P1 , P2
P2 , P3
P3 , P1
E=
P1 , PNp
..
.
P4 , P19
which is of size Ne × 2 where Ne is the number of edges in the grid. Here each
edge is a row in the edge array and is defined by two indices which each identify a
row in the points array (e.g. edge 1 is defined by point 1 and point 2, which have
the coordinates x1 , y1 , z1 and x2 , y2 , z2 ). One could also define an edge by explicitly
storing the coordinates of the end points of each edge, but if there are many edges
which share a given point (e.g. there are around five or six edges using each point in
Figure 12.2(b)) then this data structure becomes somewhat inefficient because we
will be storing the same coordinate point many times.
Building on this data structure we may then store a 2D array of faces:
E1 , E2 , E3 P1 , P2 , P3
E3 , E1 , E4
P3 , P1 , P4
E1 , E5 , E6 , ENe P1 , P5 , P6 , PNp
F = or
E8 , E9 , E7 , E5 , E5 P8 , P9 , P7 , P5 , P6
.. ..
. .
E9 , E4 , E6 P9 , P4 , P6
which will have Nf rows and the number of columns will depend upon the nature
of the face (e.g. triangular, quadrilateral, polyhedral). Now, the faces could defined
by indices identifying a row in the edge array or by indices identifying a row in
the points array. So we are saying that we could either define a face by its edges
or by its vertices. Generally there will be some ordering in the sequence of the
edges or points defining a face. The more common approach is to store the edges
in an anticlockwise order around the face. The varying number of edges or points
within each row are present to emphasize the point that we could have triangular,
quadrilateral, pentagonal, etc, faces in our unstructured grid.
Building up further we define a list of cells:
F1 , F2 , F3 , F5 E1 , E2 , E3 , E5 P1 , P2 , P3 , P5
F9 , F5 , F6 , F7
E9 , E5 , E6 , E7
P9 , P5 , P6 , P7
F11 , F12 , F13 , FNf , E11 , E12 , E13 , ENe , P11 , P12 , P13 , PNp ,
C= or or
F31 , F49 , F52 , F7 , F22 E31 , E49 , E52 , E7 , E22 P31 , P49 , P52 , P7 , P22
.. .. ..
. . .
F9 , E35 , F44 , F14 E9 , E35 , E44 , E14 P9 , P35 , P44 , P14
which will have Nc rows and the number of columns will depend upon the nature of
the cell (e.g. tetrahedral, hexahedral, polyhedral). That is we could define a cell by
either its faces, edges, or vertices, and in either case the indices in each row of the cell
array identify where a geometrical entity in one of the other arrays. Similar to the
224 CHAPTER 12. INTRODUCTION
faces, cells can be of essentially arbitrary shape (Figure 12.3) and hence defined by
a variable number of faces (e.g. 4 faces for a tetrahedral cell, 5 faces for a pyramid,
6 faces for a hexahedral cell). Obviously the definition of cells presented here only
applies to the discretization of a 3D domain. If we are looking at at 2D domain
then the faces play the same role as the cells in 3D. Depending on the terminology
however, you may find the faces referred to as cells in a 2D discretization and this
is the approach we will use in this book. So in 3D our cells will have a volume
associated with them, in 2D they will have an area associated with them, and in
1D they will have a length associated with them.
So far we have described how each individual cell is defined. In order to com-
pletely define an unstructured grid however we may need to go even further and also
store some connectivity information, describing how particular geometrical entities
are connected to one another. This is analogous to a given grid point in a structured
grid φi,j,k being able to locate its neighbors φi±1,j±1,k±1 , but in this case we need to
explicitly store the connectivity information. Some possibilities here are that for a
given point we might store the indices of all other points to which it is connected to
via an edge (Figure 12.5(a)):
P3 , P4 , P5 , P5 , P6
P19 , P23 , P74 , P6 , PNp
P4 , P5 , P8 , P31 ,
P9 , P5 , P7 , P31 , P42
..
.
P7 , P74 , P6 , P2 , P2
meaning that point 2 is connected to points 19, 23, 74, 6, and 2 etc. We may for a
given edge define the connectivity by storing the indices of the two faces which use
the edge (Figure 12.5(b)):
F1 , F2
F5 , F6
F11 , FNf
F31 , F49
..
.
F9 , F12
We can extend this concept further to faces and define the connectivity for a given
face by storing the indices of the two cells which share the face:
C1 , C2
C5 , C6
C11 , CNf
C31 , C49
..
.
C9 , C35
where we are saying here that face 2 is shared by cells 4 and 5 etc. It should be
noted that here a given face can be shared by only two cells, which is true for the
225
most common grid structures. On this point this data structure is sometimes termed
a ‘neighbor-owner’ array. The reason for this is that although both cells share the
face, one of these will be designated as being the owner cell for the face (say the first
cell index in the list) while the second will be designated as the neighbor cell for the
face. We will elaborate on this point at a later stage but to hint at the reason why
this is important, the discretization is going to involve the face normal vectors of
a cell which are defined as pointing outward, away from the cell. Obviously if two
cells are sharing the face then the normal vector will be pointing out of one face, but
in to the other. If we defined the normal as pointing out of the owner and into the
neighbor cell then this will help us keep our discretization consistent. Finally, we
could define the connectivity for a given cell by storing the indices of its neighboring
cells (Figure 12.5(c)):
C2 , C3 , C4 , C5
C5 , C5 , C7 , C9 , C11 , C23
C11 , C12 , C4 , C5 ,
C31 , C49 , C56 , C44 , C2
..
.
C9 , C35 , C19 , C11
where we are saying here that cell 1 neighbors cells 2, 3, 4, and 5.
Figure 12.5: Schematics of different types of grid connectivity (a) illustrates a given
node storing indices of its neighbouring nodes (b) illustrates a given face storing the
indices of the two neighbor and owner cells which share it (c) illustrates a given cell
storing the indices of the cells which neighbor it.
in Matlab and:
double Points[N_p][3];
double Faces [N_f][3];
double Cells [N_c][4];
double phi [N_p];
in C++. Here a tetrahedral cell is defined by 4 triangular faces, which are in turn
defined by 3 points which have 3 x, y, z components. Furthermore it is assumed
that we are defining the discrete values of φ at the vertices of the cells, hence why
the length of the φ array matches the length of the points array. As we shall see,
for the Finite Element method in Chapter 15, this will be the case, but for the
Finite Volume method in Chapter 14 we instead define the discrete values of φ at
the centroids of the cells, hence the length of the φ would match the length of the
cells array.
So, to re-emphasize the point, with a structured grid, in order to locate any
field variable in 3D space we only needed to store three numbers, ∆x, ∆y, and
∆z, then based on the index i, j, k index of the variable within it’s array we can
assign its coordinates. For an unstructured grid on the other hand we need to store
between three to four arrays in order to do the same thing; and these arrays may
have thousands or millions of rows in them. This is of course in addition to the array
storing the actual field data. But remember the extra computer memory required
to store an unstructured grid is the ‘price we pay’ in order to be able to deal with
complex spatial domains and hence solve ‘real world’ problems.
Another important issue concerning unstructured grids lies in how we define its
boundaries. With a structured grid, these will trivially be the elements in the edges
of the field array. For example, accessing all of the boundary values on the lower
x, y plane of a 3D domain could be achieved via:
phi_b = phi(:,:,1);
in Matlab, and:
for(i=0; i<N_x; i++)
{
for(j=0; j<N_y; j++)
{
phi_b = phi[i][j][0];
}
}
(a) (b)
two different ways of defining boundary conditions. In Figure 12.6(a) the numerical
method will dictate that the field variables are defined at the cell vertices and so
the boundary conditions are applied at the vertices on the boundary of the grid. In
Figure 12.6(b) however, the numerical method will dictate that the field variables
are defined at the cell centroids and so the boundary conditions are applied on
the faces on the boundary of the grid. So with the former, we need to specify
which points are on a particular boundary and with the latter we need to specify
which faces are on a particular boundary. A useful way to minimize the amount of
information required to locate all of these boundary points/faces is to assume that
in their respective arrays, all of the interior points/faces come first, followed by the
boundary points/faces. Furthermore, we assume that the boundary points/faces are
all grouped together. In this case an ‘elegant’ way to define the boundaries is with
a structure or a class like:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’start’, {}, ’value’, {});
in Matlab, or:
228 CHAPTER 12. INTRODUCTION
class Boundary
{
public:
Boundary(){ }
string name_;
string type_;
int N_;
int start_;
double value_;
};
Boundary Boundaries[N_b];
in Matlab and something similar in C++. Now, if we couldn’t assume that the
boundary points/faces were organized this way then our boundary struct or class
might not be so compact. If for instance the boundary points/faces were randomly
scattered throughout their arrays then we would need to explicitly store an array
for each boundary defining the indices of which points/faces are a part of it:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’indices’, {}, ’value’, {});
in Matlab, or:
class Boundary
{
public:
Boundary(){ }
string name_;
string type_;
int N_;
int* indices_;
double value_;
};
Boundary Boundaries[N_b];
where indices is a 1D array storing the indices of either the points or faces in the
grid. We can then simply loop over all of the boundaries in our struct or class as:
for b=1:N_b
229
if Boundaries(b).type ==’Dirichlet’
for n=1:Boundaries(b).N
phi(Boundaries(b).indices(n)) = Boundaries(b).value;
end
end
end
This introduction to structured and unstructured grids has only touched very
briefly on what is quite a large field. For the interested reader, some additional
information can be found in [45], [59], [71], and [70].
We are now in a position to begin the studying the various families of numerical
methods. Part of the reason for introducing all of these concepts initially is so
that when we come to applying different numerical methods to the solution various
forms of the generic scalar transport equation, we will have an understanding of
what the PDE is describing, what the solution might look like, what boundary and
initial conditions are required, so that we have a well posed problem, and how we go
about defining the required data structures computationally. As a final point, it is
worth mentioning that most of the concepts introduced here will have more meaning,
once we have actually studied a number of numerical methods and applied them to
specific PDEs. As such, it is recommended that you re-read this introduction at the
end of this part of the book.
230 CHAPTER 12. INTRODUCTION
Chapter 13
Just as Euler methods were the simplest methods that we can use to solve ODEs,
Finite Difference methods are perhaps the simplest methods that we can use to
solve PDEs, so they present an excellent starting point. The basic idea behind
the method is that we replace the derivative terms in a PDE with approximately
equivalent difference quotients, often called stencils. The difference quotients are
linear combinations of the field values at neighboring grid points. Finite Difference
methods are generally applied to structured grids where we have a regular and
equally spaced array of grid points covering our spatial domain and are a local
method in that the solution at any given point only involves the solution at a few
neighboring points.
In order to derive the finite difference quotients for a derivative term of order n,
we apply the general finite difference formula:
Nm Nm
!
dn φ
1 X X
= a−Nm φi−m + a0 φi + a+Nm φi+m (13.1)
dxn xi ∆xn m=1 m=1
where the values φi±m denote the the field values at neighboring points m and a±m
are weighting coefficients, whose value which will depend on how accurate we want
the difference quotient to be.
231
232 CHAPTER 13. FINITE DIFFERENCE METHODS
(∆x)2 d2 φ (∆x)3 d3 φ
dφ
φi+1 = φi + (∆x) + + + O(∆x4 )
dx xi 2! dx2 xi 3! dx3 xi
(∆x)2 d2 φ (∆x)3 d3 φ
dφ
φi−1 = φi − (∆x) + − − O(∆x4 )
dx xi 2! dx2 xi 3! dx3 xi
Now repeating this procedure for all 1 ≤ m ≤ Nm and substituting the resulting
expressions for φi±m into the right hand side of Equation 13.1 we get:
!
(Nm ∆x)2 d2 φ (Nm ∆x)3 d3 φ
dφ 1 dφ
= a−Nm φi−Nm − (Nm ∆x) + − + ...
dx xi
∆x dx xi 2! dx2 xi 3! dx3 xi
+ ...
+ a0 φi
+ ...
!!
(Nm ∆x)2 d2 φ (Nm ∆x)3 d3 φ
dφ
+ a+Nm φi+Nm + (Nm ∆x) + + + ...
dx xi 2! dx2 xi 3! dx3 xi
dφ 1
= a−Nm + . . . +a−1 +a0 +a+1 + . . .+a+Nm φ(xi )
dx xi ∆x
dφ
+ a−Nm (−Nm ) + . . . +a−1 (−1) +a+1 (1) + . . .+a+Nm (M ) ∆x
dx xi
(−Nm )2 (−1)2 (1)2 (Nm )2 2
2 d φ
+ a−Nm + . . . +a−1 +a+1 + . . .+a+Nm ∆x
2! 2! 2! 2! dx2 xi
(−Nm )3 (−1)3 (1)3 (Nm )3 3
3 d φ
+ a−Nm + . . . +a−1 +a+1 + . . .+a+Nm ∆x
3! 3! 3! 3! dx3 xi
!
+ ...
1 1 1 a−1 0
−1 0 1 a0 = 1
1 1
0 2 a+1 0
2
We can in fact solve this system and find that the coefficients are:
1
a−1 =−
2
a0 = 0
1
a+1 =
2
and hence the corresponding formula for the first derivative of φ is:
dφ 1
= (φi+1 − φi−1 ) (13.4)
dx xi
2∆x
This is called the second order central difference for the first derivative. So we can
see that to approximate the derivative at point xi we will need to know the field
values at one point either side xi±1 . Sometimes it’s more convenient to have a finite
difference quotient which only requires knowing field values on one side or the other
of point xi . In this case what we can do is ‘choose’ for the coefficients on one side to
be zero. For example, if we only want to involve points xi+1 then we could choose
a−1 = 0, so that we would have the reduced system:
1 1 a0 0
=
0 1 a+1 1
where the solution is trivially:
a0 =−1
a+1 = 1
and hence the corresponding formula for the first derivative of φ is:
dφ 1
= (φi+1 − φi ) (13.5)
dx xi
∆x
This is called the first order forward difference for the first derivative. Alternatively,
is we only want to involve points xi−1 then we could choose a+1 = 0, so that we
would have the reduced system:
1 1 a−1 0
=
−1 0 a0 1
13.1. FIRST DERIVATIVE 235
a0 = 1
a−1 =−1
and hence the corresponding formula for the first derivative of φ is:
dφ 1
= (φi − φi−1 ) (13.6)
dx xi
∆x
This is called the first order backward difference for the first derivative.
Let’s now consider the case where Nm = 2. The general finite difference formula
can then be written as:
dφ 1
= (a−2 φi−2 + a−1 φi−1 + a0 φi + a+1 φi+1 + a+2 φi+2 )
dx xi
∆x
and the system of equations that we need to satisfy in order to approximate the
derivative can be written as:
We can in fact solve this system and find that the coefficients are:
236 CHAPTER 13. FINITE DIFFERENCE METHODS
1
a−2 =
12
2
a−1 =−
3
a0 = 0
2
a+1 =
3
1
a+2 =−
12
and hence the corresponding formula for the first derivative of φ is:
dφ 1
= (φi−2 − 8φi−1 + 8φi+1 − φi+2 ) (13.7)
dx xi
12∆x
This is called the fourth order central difference for the first derivative. Again, if
we only wanted to derive forward or backward differences we could explicitly choose
a−2 = 0, a−1 = 0 or a+2 = 0, a+1 = 0 and solve the reduced systems. The important
point to note is that we can essentially create finite difference quotients of essentially
any order we choose, either forward, backward, or central.
As with the first derivative formula from before, expanding the right hand side in
terms of Taylor series and collecting coefficients of the derivatives gives:
a−Nm 0
..
..
1 ... 1 1 1 ... 1
.
.
−Nm . . . −1 0 1 . . . Nm
a−1
0
Nm2 2
2
... 1
2
0 12 . . . N2m
a0 = 1
3 3
− N6m . . . 1
0 16 . . . N6m a+1 0
6
.. .. .. .. .. .. .. .. ..
. . . . . . .
.
.
a+Nm 0
Where the only difference compared to the system of equations for the first derivative
it is the third element in the vector of known values which is equal to 1, instead of
the second. As with the case as the first derivative, the more equations you satisfy,
the more accurate your finite difference approximation will be. As we did for the
first derivative, let’s consider the case where Nm = 1. The general finite difference
formula can then be written as:
d2 φ
1
2
= (a−1 φi−1 + a0 φi + a+1 φi+1 )
dx xi ∆x2
The system of equations that we need to satisfy in order to approximate the deriva-
tive can be written as:
a−1 = 1
a0 =−2
a+1 = 1
and hence the corresponding formula for the second derivative of φ is:
238 CHAPTER 13. FINITE DIFFERENCE METHODS
d2 φ
1
2
= (φi−1 − 2φi + φi+1 ) (13.9)
dx xi ∆x2
This is called the second order central difference for the second derivative. As with
the first derivative, if we only wanted to derive forward or backward differences we
could explicitly choose a−2 = 0, a−1 = 0 or a+2 = 0, a+1 = 0 and solve the reduced
systems to obtain forward or backward difference.
Although our derivations thus far have all been in terms of x, it should be obvious
that in multiple dimensions all of the analyses apply and we simply replace the
independent variable x, y, z and the index i, j, k appropriately. Putting everything
together, if we were to use second order central differences ‘say’ to approximate the
derivative terms in our generic scalar transport equation, in 3D we would get:
dφi,j,k vx
+ (φi+1,j,k − φi−1,j,k )
dt 2∆x
vy
+ (φi,j+1,k − φi,j−1,k )
2∆y
vz
+ (φi,j,k+1 − φi,j,k−1 )
2∆z
µ
= (φi−1,j,k − 2φi,j,k + φi+1,j,k )
∆x2
µ
+ (φi,j−1,k − 2φi,j,k + φi,j+1,k )
∆y 2
µ
+ (φi,j,k−1 − 2φi,j,k + φi,j,k+1 )
∆z 2
+ ψi,j,k (13.10)
M φ̇ = Kφ + s (13.11)
where φ is a column vector is defining values each i, j, k grid point and M is the
identity matrix in this case. So it can be observed that by applying the Finite
difference method, we have applied a spatial discretization and we have reduced our
PDE to a system of ODEs. Now we have φi,j,k (t) and at this point we can apply
one of the numerical methods from Part II to perform the time integration.
Before we move on to applying this method to solving some specific forms of
the generic scalar transport equation, it is worth devoting a little time to the issue
of how we actually impose boundary conditions. Now, when we have a Dirichlet
boundary condition to impose, we are saying that we know the value of φ on the
boundary and so really what this means is that those values should not be a part of
the column vector φ in Equation 13.11. Instead, these values are incorporated into
the column vector s. With a Neumann boundary condition on the other hand, we
13.2. SECOND DERIVATIVE 239
are saying that we know the gradient of φ on the boundary, but the actual value of
φ on the boundary is still unknown. As such we can say that those values should be
a part of the column vector φ, in Equation 13.11, but the discrete equation for that
grid point will be modified by the Neumann boundary condition. To illustrate by
way of example, consider the discretized generic scalar transport equation in 1D:
dφi vx µ
+ (φi+1 − φi−1 ) = (φi−1 − 2φi + φi+1 ) + ψi (13.12)
dt 2∆x ∆x2
and suppose that at grid point xi we are also applying the Neumann boundary
condition:
∂φi
=f
∂x
What we would generally do here is to approximate the derivative boundary condi-
tion by a finite difference quotient, ‘say’ a second order central difference. In that
case we would have:
φi+1 − φi−1
=f (13.13)
2∆x
and so rearranging we get:
dφi vx µ
+ (φi−1 + 2∆xf − φi−1 ) = (φi−1 − 2φi + φi−1 + 2∆xf ) + ψi
dt 2∆x ∆x2
dφi 2µ
+ vx f = (φi−1 − φi + ∆xf ) + ψi
dt ∆x2
Example 13.1:
240 CHAPTER 13. FINITE DIFFERENCE METHODS
In this example we will develop both a Matlab and a C++ program to solve the
1D first order wave equation:
∂φ ∂φ
+v =0 (13.15)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, v = 1.0, and compare the numerical solution with
the exact solution
2
φ(x, t) = e−5(x−vt−3) + 1
e = φli − φ(xi , tl )
as our measure of convergence. For the spatial discretization we will use the Finite
Difference method with second order central differences for the first derivative and
for the temporal discretization we will use the fourth order Runge-Kutta method.
The intended learning outcomes for this example will be to ‘get a feel’ for applying
the Finite Difference method and observing the solution of a hyperbolic PDE. Also,
because we are using an explicit method to perform the time integration, we will
have stability constraints and we will investigate this by solving the PDE with some
different spatial step sizes ∆x and temporal step sizes ∆t. Finally, we will look
at how we can replace the Dirichlet boundary condition with a periodic boundary
condition.
So, there are a lot of things to learn in this example. To begin, let’s first confirm
in our minds that we have a well posed problem. Our PDE has two derivative terms
in it and so this translates into requiring two pieces of information in order to obtain
a unique solution, one boundary condition for the spatial derivative and one initial
condition for the temporal derivative. Since we were given both of these, then we
can say that our problem will be well posed. The reason for laboring this point is
that if the solution of a PDE is attempted with too many or too few boundary or
initial conditions it will be ‘doomed’ from the start. So it is always important to
consider these issues before writing any code.
Assuming now that our spatial domain has been broken up into Nx grid points
with spatial step size ∆x, then we can replace the spatial derivative with a second
order central difference and define an ODE at each interior grid point as:
13.2. SECOND DERIVATIVE 241
dφ2 v
=− (φ3 − φ1 )
dt 2∆x
dφ3 v
=− (φ4 − φ2 )
dt 2∆x
dφ4 v
=− (φ5 − φ3 )
dt 2∆x
.. ..
. = .
dφNx −1 v
=− (φNx − φNx −2 )
dt 2∆x
For the grid point Nx we can’t use a second order central difference however because
grid point xNx +1 is outside of the spatial domain. What we can do to remedy this
problem however is to use a first order backward difference for the spatial derivative
at this point:
dφNx v
=− (φNx − φNx −1 )
dt ∆x
You might wonder, does it matter if we use a second order accurate spatial discretiza-
tion for most of the points, but then a first order accurate spatial discretization at
this one point? The answer is technically yes, but in terms of both the accuracy and
stability of the solution, it doesn’t make too much difference. We could always use a
second order backward difference if we were concerned about this however. Moving
along, these ODEs can be written in the form:
M φ̇ = Kφ + s
where we have:
v vφ1
1 0 0 ··· 0 φ2
0 − 2∆x 0 ··· 0 φ2
2∆x
v v
0 1 0 ··· 0 φ3 0 − 2∆x ··· 0 φ3 0
d 2∆x
v
0 0 1 ··· 0 φ4 =
0 2∆x 0 ··· 0 φ4 + 0
dt
.. .. .. .. .. .. .. .. .. v
.. ..
. . . . 0
.
. . . . − 2∆x
.
.
v v
0 0 0 0 1 φNx 0 0 0 − ∆x φNx 0
∆x
It can be observed that both M and K are sparse matrices and because M = I, we
could in fact rewrite our system more simply as:
φ̇ = Kφ + s = f (φ) (13.16)
Now, using approach we took in implementing the fourth order Runge-Kutta method
in Example 10.1, we will define a function f to evaluate the right hand side of
Equation 13.16 at the various stages of the method. In our Matlab code, this will
take the form:
242 CHAPTER 13. FINITE DIFFERENCE METHODS
function k = f(phi)
k = zeros(N_x,1);
for i=2:N_x-1;
k(i) = -v/(2*Delta_x)*(phi(i+1) - phi(i-1));
end
k(N_x) = -v/( Delta_x)*(phi(N_x) - phi(N_x-1));
end
At this point it is worth addressing some practical issues regarding the storage of
phi and the imposition of the Dirichlet boundary condition. Although our system
of ODEs doesn’t include the Dirichlet point φ1 as an unknown, from a programming
point of view it makes the most sense to allocate one array to store all of the discrete
solution, including this point. Because we are stepping forward in time by ‘looping’
over all of the grid points we can simply set the value of the solution at this point
and not include it in the update. The remainder of the algorithm is just the basic
fourth order Runge-Kutta code from Example 10.1:
for l=1:N_t-1
k1 = f(phi(:,l));
k2 = f(phi(:,l) + Delta_t/2*k1);
k3 = f(phi(:,l) + Delta_t/2*k2);
k4 = f(phi(:,l) + Delta_t *k3);
phi(:,l+1) = phi(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end
In our C++ program we have the limitation that we can’t add arrays together
‘on the fly’ like we can in Matlab with statements like phi(:,l)+Delta t*k3, so
instead we are going to have to introduce a ‘temporary’ array to store this data. In
fact we will dynamically allocate a total of six arrays to store our field data and the
various stages of the Runge-Kutta method:
double* tempPhi = new double [N_x];
double* k1 = new double [N_x];
double* k2 = new double [N_x];
double* k3 = new double [N_x];
double* k4 = new double [N_x];
double* phi = new double [N_x];
...
In contrast to our Matlab program where our array phi stores the solution for each
grid point at every time step, our C++ program is only going to store the solution
for one time step. This is a more common approach to take in ‘real world’ programs
since it requires far less memory, but means that if we want to keep this solution
for post processing, we must write it to an output file. We’ll get to that shortly, but
for now, moving along, our function f, evaluating the right hand side, will take the
form:
void f(double* k, double* phi)
{
for(int i=1; i<N_x-1; i++)
{
13.2. SECOND DERIVATIVE 243
code. An important point to note is that in our Matlab time marching loop we
are updating phi(:,l+1), where the colon operator implies all of the grid points
at time step l + 1 (including the boundary point φ1 ). The reason that this is not a
problem is because in our function f we didn’t evaluate any of the k values at that
grid point (i.e. we looped through grid points 2 to Nx ). So as long as k(1) is zero
for each k1 , k2 , k3 , k4 then when we update phi(:,l+1) its value won’t change.
So, assuming that phi(1,1) was initialized to 1.0, we will have the correct result.
If we were worried about it, we could always modify our update statement to be
phi(2:N_x,l+1), or add in the line phi(1,l+1)=1.0 inside the time marching loop
to ensure that this is always the case. So it can be observed that there are a few
different ways we can make sure that we impose the boundary condition correctly.
So, although at this point we have a working algorithm, we have to consider the
fact that the fourth order Runge-Kutta method is an explicit method and hence
we need to make sure that while performing the time integration we are inside the
stability region. As we will see, this places restrictions on both ∆t and ∆x. For
this reason, we will in fact create the matrix K in our program, but it should be
understood that this is only to determine its eigenvalues, not because it is part of
our numerical method. Due to the diagonal nature of K we can create it Matlab
quite easily by using the diag function, which can create a matrix with terms, on,
above, or below the main diagonal as:
K = -v/(2*Delta_x) .* (-1.*diag(ones(N_x-1, 1), -1) + diag(ones(N_x-1, 1), 1));
Then, we can compute the eigenvalues as we did in Example 6.2 and plot them
relative to the stability region of the fourth order Runge-Kutta method:
[Xi Lambda] = eig(K);
[X, Y] = meshgrid(-4:0.1:4, -4:0.1:4);
Z = X + i*Y;
sigma = abs(1 + Z + (Z.^2)/2 + (Z.^3)/6 + (Z.^4)/24);
contourf(X, Y, sigma, [1 1]);
plot(real(diag(Lambda)*Delta_t), imag(diag(Lambda))*Delta_t);
v∆t
λm ∆t ∝
∆x
so we can see that the eigenvalues of K depend upon the velocity and the spatial
and temporal step sizes. We make the important definition:
v∆t
CF L =
∆x
where CF L is known as the Courant-Friedrichs-Lewy number. This parameter
is useful in determining the stability of explicit methods, but it should be noted
that its definition changes depending on the dimensionality of the problem and the
derivatives present in the PDE. The more important point however is that we can
13.2. SECOND DERIVATIVE 245
observe that if we decrease ∆x (i.e. add in more grid points) to try and get a more
accurate solution (noting that the error associated with the second order central
difference Finite Difference quotient is proportional to O(∆x2 )) then we find that
the CF L will increase, reducing the stability of the solution. We therefore also need
to reduce ∆t to maintain stability. So we can’t just decrease ∆x to try and get a
more accurate solution, without reducing ∆t at the same time.
The complete programs are given in Example13_1.m and Example13_1.cpp. Fig-
ures 13.1(a) - 13.1(b) illustrate the location of the λm ∆t terms for two different com-
binations of ∆x and ∆t, the first with ∆x = 0.05 and ∆t = 0.02, and the second
with ∆x = 0.02 and ∆t = 0.10. In the first combination, all the terms are located
within the stability region, and in the second they are not. The corresponding effect
on the solution is shown in Figures 13.2(a) - 13.2(b). It is easily observed that for
the second combination, the simulation ‘blows up’ after just a couple of time steps,
whereas for a stable solution we see the ‘bell shaped’ initial condition is simply
shifted (or convected) along through the computational domain. This is in essence
what the convective term describes in a PDE. Another observation that can be made
is that all of the eigenvalues of K are purely imaginary, which is a characteristic of
the discretization of the convective term. When we include ‘say’ the diffusive term
in the generic scalar transport equation we find that the discretization will result
in the eigenvalues having real components too. The important point to take away
from this example is that if we are using an explicit method for the time integra-
tion, we need to be careful when choosing our spatial and temporal step sizes. To
illustrate the convergence of the solution, Table 13.1 presents the inifinity norm for
a range of spatial and temporal step sizes (maintaining stability of course). As can
be observed, the finer the grid and the smaller the time step size, the lower the error
in the solution (which is of course what we could expect).
Table 13.1: The convergence of the solution, illustrating the infinity norm for a
range of spatial and temporal step sizes.
∆x ∆t ||e||∞
1.000 1.000 0.760540
0.500 0.500 0.706663
0.100 0.100 0.387008
0.050 0.050 0.138828
0.010 0.010 0.005468
0.005 0.005 0.001361
0.001 0.001 0.000054
The final thing we are going to do in this example is experiment with how we
impose a periodic boundary condition instead of a Dirichlet boundary condition.
First of all however, an important question we should ask is, does it matter which
246 CHAPTER 13. FINITE DIFFERENCE METHODS
4 4
3 3
2 2
1 1
Δt
Δt
0 0
Re
Re
λ
λ
−1 −1
−2 −2
−3 −3
−4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
λ Δt λ Δt
Re Re
(a) (b)
Figure 13.1: Location of the λm ∆t terms within the stability region of the fourth
order Runge-Kutta method for the PDE in Example 13.1 for (a) ∆x = 0.05 and
∆t = 0.02 (b) ∆x = 0.02 and ∆t = 0.10. It should be noted that each λm ∆t is
marked as a cross in the complex plane, but the large number of these terms makes
them appear as a solid strip. It can be observed that all of the λm ∆t terms are
purely imaginary.
end of the domain we apply this condition at? The answer to this is, yes it does.
Because v was positive the wave will move in the direction of increasing x. If we
were to try and impose the Dirichlet boundary condition φ(10, t) = 0 then we would
run in to problems as the wave approaches that boundary. The key result is that we
impose the Dirichlet boundary condition on the boundary that the wave is moving
away from, so then in order to impose it at φ(x = 10, t) we would need to make v
negative.
Having said that we can now look at instead imposing a periodic boundary
condition, which remember, is essentially closing the domain back up on itself so
essentially there is no boundary. Thinking of our 1D spatial domain in this example
as a bead necklace pulled taught (where each bead represents a grid point), then
applying periodic boundary conditions is analogous to doing up the necklace. The
way in which this modified our system is only at grid points x1 and xNx . This time,
we can now use second order central differences to get:
13.2. SECOND DERIVATIVE 247
2.5 2.5
2.0 2.0
φ
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
2.5 2.5
2.0 2.0
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(c) (d)
Figure 13.2: The solutions to the PDE in Example 13.1 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02 (b) the solution
at l = 0 and l = 13 for the combination ∆x = 0.02 and ∆t = 0.10.
dφ1 v
=−
(φ2 − φNx )
dt 2∆x
dφ2 v
=− (φ3 − φ1 )
dt 2∆x
dφ4 v
=− (φ5 − φ3 )
dt 2∆x
.. ..
. = .
dφNx −1 v
=− (φNx − φNx −2 )
dt 2∆x
dφNx v
=− (φ1 − φNx −1 )
dt 2∆x
248 CHAPTER 13. FINITE DIFFERENCE METHODS
So at the field value to the left of φ1 now becomes φNx and the field value to the
right of φNx becomes φ1 . We can in fact implement this in our program very easily
by trivially modifying the function f as:
function k = f(phi)
k = zeros(N_x,1);
k(1) = -v/(2*Delta_x)*(phi(2) - phi(N_x));
for i=2:N_x-1;
k(i) = -v/(2*Delta_x)*(phi(i+1) - phi(i-1));
end
k(N_x) = -v/(2*Delta_x)*(phi(1) - phi(N_x-1));
end
Figures 13.3(a) - 13.3(b) illustrate the solution at a number of time steps for the
the case where ∆x = 0.05 and ∆t = 0.02. It can be observed that the bell shaped
curve now exits the domain through the right hand side and immediately re-enters
through the left hand side. An interesting observation that can be made is that the
longer the simulation is run for, the more the bell shaped curve is distorted. This is
the effect of numerical error introduced by the spatial and temporal discretizations
becoming apparent.
Example 13.2:
In this example we will develop both a Matlab and a C++ program to solve the
2D Poisson equation:
∇2 φ + ψ = 0 (13.17)
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10, and compare the numerical solution with the
exact solution
∞
∞ X
X 40(1 − (−1)p )(1 − (−1)q )
φ(x, y) = sin(pπx) sin(qπy) + 1
p=1 q=1
(p2 + q 2 )pqπ 4
where we will define the error function
e = φi,j − φ(xi , yj )
and use the infinity norm
2.5 2.5
2.0 2.0
φ
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
2.5 2.5
2.0 2.0
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(c) (d)
Figure 13.3: The solutions to the PDE in Example 13.1 for the combination ∆x =
0.05 and ∆t = 0.02 illustrating the solution at (a) l = 1, (b) 1 = 350, (c) l = 500
and (d) l = 700.
250 CHAPTER 13. FINITE DIFFERENCE METHODS
as our measure of convergence. For the spatial discretization we will use the Finite
Difference method with second order central differences for the second derivatives
and to solve the resulting system of algebraic equations we will use the Gauss-Seidel
method, with the two-norm as our measure of convergence. The intended learn-
ing outcomes for this example will be ‘get a feel’ for applying the Finite Difference
method and observing the solution of an elliptic PDE and to investigate the appli-
cation of iterative methods to solve the resulting system of equations.
To begin, let’s first confirm in our minds that we have a well posed problem. Our
PDE has two second order derivative terms in it and so this translates into requiring
four pieces of information in order to obtain a unique solution, two boundary con-
ditions for each spatial derivative. Put another way, we need a boundary condition
over every part of the boundary. Since we were given all of these, then we can say
that our problem will be well posed.
Assuming now that our spatial domain has been broken up into Nx data points
in x and Ny data points in y, then we can replace the spatial derivatives with a
second order central difference and define and algebraic equation at each interior
grid point as:
φi−1,j − 2φi,j + φi+1,j φi,j−1 − 2φi,j + φi,j+1
+ + ψi,j = 0
∆x2 ∆y 2
For the purposes of the example let’s assume that ∆x = ∆y = ∆xy. We can then
rearrange the discrete equation to get:
Kφ = s
For the unrealistic case of a 5 × 5 grid ‘say’, K would take the form:
∆xy 2 ψ2,2 + φ2,1 + φ1,2
4 −1 0 −1 0 0 0 0 0
φ2,2
−1 4 −1 0 −1 0 0 0 0 φ3,2 ∆xy 2 ψ3,2 + φ3,1
2
0 −1 4 0 0 −1 0 0 0 φ4,2 ∆xy ψ4,2 + φ4,1 + φ5,2
−1 0 0 4 −1 0 −1 0 0 φ2,3 ∆xy 2 ψ2,3 + φ1,3
2
0 −1 0 −1 4 −1 0 −1 0
φ3,3 = ∆xy ψ3,3
0 0 −1 0 −1 4 0 0 −1 φ4,3 ∆xy 2 ψ4,3 + φ4,2
2
0 0 0 −1 0 0 4 −1 0 φ2,4 ∆xy ψ2,4 + φ1,4 + φ2,5
0 0 0 0 −1 0 −1 4 −1 φ3,4 ∆xy 2 ψ3,4 + φ3,5
2
0 0 0 0 0 −1 0 −1 4 φ4,4 ∆xy ψ4,4 + φ5,4 + φ4,5
(13.19)
Some important observations that we can make here are that similar to Example
13.1 we are dealing with a sparse matrix, but in this case the structure of the matrix
is not as regular. This is most obvious when we look at the terms either side of the
main diagonal, where there is a pattern of alternating 0 and −1, and this occurs
when two sequential finite difference equations have different j indices in the grid.
13.2. SECOND DERIVATIVE 251
While we could type out the matrix manually, this is a fairly naive approach to take,
since it will become very time consuming and more error prone as we increase the
resolution of the grid, and it will need to be retyped for every change in Nx and
Ny . What we would like is an algorithm that will allow us to simply input Nx and
Ny and take care of the corresponding size of the arrays or us. We could use the
Matlab function gallery(‘poisson’,N) to construct the matrix, but we are going
to take another approach.
An important point to note is that the vector of unknowns φ is of length (Nx −
2)(Ny − 2) and K is of size (Nx − 2)(Ny − 2) × (Nx − 2)(Ny − 2). If we increase
the number of grid points in each dimension from 5 to a more reasonable resolution
of say Nx = 100 and Ny = 100 then φ would be of the order of 10, 000, but more
importantly K would be of the order of 10, 000×10, 000 i.e. it will be storing around
100,000,000 numbers! Since most of these numbers are zero, explicitly storing the
matrix would be very inefficient. Furthermore, the matrix is actually only storing
two different entries, −1 on the off diagonal entries and 4 on the diagonal entries.
Finding the inverse of a matrix suah a size using one of the direct methods from
Chapter 2 would in general be too computationally intensive. In practice, a much
more common approach is to use an iterative method and for this example we will
be using the Gauss-Seidel method to solve this linear system.
As we will see shortly, we can implement the Gauss-Seidel method in a way
that means that we can actually completely remove the need to explicitly store the
stiffness matrix. To see how this is possible, recall the iterative formula for the
Gauss-Seidel method:
!
1 X X
φk+1
m = sm − Km,n φkn − Km,n φk+1
m (13.20)
Km,m n>m n<m
where From Example 3.2 we can remember that the algorithm will involve an inner
for loop over each φm to update its value, and an outer while loop to iteratively
repeat this procedure until convergence has been reached. It should also be noted
that specifying the row index m will involve specifying some combination of i and
j indices in our 2D computational grid and in fact we can explicitly evaluate m for
each grid point as:
m = (j − 2)(Nx − 2) + (i − 1)
For example, if i = 2 and j = 2 we get m = 1 (i.e. the first element in φ). From
examination of Equation 13.19 we can see that every Km,m = 4, Km,n = −1, and
every ∆xy 2 ψi,j will also be the same since ψ is a constant throughout the domain
in this example. So although the structure of the matrix isn’t ‘that’ regular, the
elements in the matrix do follow a particular pattern. You might think that it would
be a bit inefficient then to explicitly store an entire 2D array when essentially we
only need to store the two numbers (4 and −1); and in fact you’d be right! Because
252 CHAPTER 13. FINITE DIFFERENCE METHODS
of this feature of the stiffness matrix, we can write out the Gauss-Seidel iteration
for each grid point (i.e. each φm ) as:
1
φk+1 ∆xy 2 ψi,j + φk+1 k k+1 k
i,j = i−1,j + φi+1,j + φi,j−1 + φi,j+1
4
Notice how this is nothing more than a rearrangement of Equation 13.18, but is
also implementing the Gauss-Seidel iteration in Equation 13.20. Furthermore the
stiffnex matrix K has gone completely (i.e. we’re not explicitly storing it). We can
now solve the system by iterating over the i, j indices with two nested for loops
(rather than the m indices with one for loop), to update our solution as:
while r_norm>tolerance && k<N_k
for i=2:N_x-1
for j=2:N_y-1
phi(i,j) = (Delta_xy^2*psi + phi(i-1,j) + phi(i+1,j) ...
+ phi(i,j-1) + phi(i,j+1))/4;
end
end
for i=2:N_x-1
for j=2:N_y-1
r(i,j) = Delta_xy^2*psi + phi(i-1,j) + phi(i+1,j) ...
+ phi(i,j-1) + phi(i,j+1) - 4*phi(i,j);
end
end
r_norm = sqrt(sum(sum(r.^2)));
k = k + 1;
end
in Matlab. Some important points to note here are that firstly our for loops are
indexing from 2:N_x-1 and 2:N_y-1 because we are not defining a Finite Difference
equation at the Dirichlet boundary points. Secondly, it can be observed that we
are in fact storing our residual as a 2D array as well. This again emphasizes the
difference between conceptually thinking of a residual vector as a column vector, (the
same length as the number of unknown variables in our system) but computationally
storing it in a different manner. Because we have a structured grid, it makes sense
to store φ as a 2D array, and it then makes sense to store r in the same way. In the
above code snippet, when we compute the residual we essentially loop over every
grid point and evaluate:
k+1
ri,j = ∆xy 2 ψi,j + φk+1 k+1 k+1 k+1 k+1
i−1,j + φi+1,j + φi,j−1 + φi,j+1 − 4φi,j
rk+1 = s − Kφk+1
By then squaring each ri,j term, adding them all up and taking the square root,
we then have the two norm. In our C++ we will dynamically allocate the field and
residual data with as 2D arrays as:
13.2. SECOND DERIVATIVE 253
so that we are not limited by the stack size, can use [i][j] indexing, have our arrays
contiguous in memory, and minimize the number of separate memory allocations.
Our algorithm implement the Gauss-Seidel method will then look like:
while(r_norm>tolerance && k<N_k)
{
for(i=1; i<N_x-1; i++)
{
for(j=1; j<N_y-1; j++)
{
phi[i][j] = (Delta_x*Delta_y*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
r_norm = 0.0;
for(i=1; i<N_x-1; i++)
{
for(j=1; j<N_y-1; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
r_norm += r[i][j]*r[i][j];
}
}
r_norm = sqrt(r_norm);
k++;
}
Table 13.2: The convergence of the solution, illustrating the infinity norm for a
range of spatial step sizes.
∆x ∆y ||e||∞
0.50 0.50 0.111714
0.10 0.10 0.005730
0.05 0.05 0.001448
0.01 0.01 0.000184
2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1.0 1.0
φ
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(a) (b)
2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1.0 1.0
φ
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(c) (d)
Figure 13.4: The solutions to the PDE in Example 13.2 illustrating the solution at
iterations (a) k = 1, (b) k = 100, (c) k = 1, 000, and (d) k = 5, 000 with Nx = 65
and Ny = 65.
13.2. SECOND DERIVATIVE 255
Example 13.3:
In this example we will develop a Matlab program to solve the 2D generic scalar
transport equation:
φ̇ + v · ∇φ = µ∇2 φ + ψ (13.21)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will
use the Finite Difference method with second order central differences for the first
and second derivatives and for the temporal discretization we will use the implicit
Euler method. The intended learning outcomes for this example will be to observe
the solution of an parabolic PDE and to investigate the concept of ‘assembling’ the
matrix defining our system of equations. As we will see later in the book, this idea
will be used with many other numerical methods.
To begin, let’s first confirm in our minds that we have a well posed problem.
Our PDE has two second order spatial derivative terms and one first order temporal
derivative in it and so this translates into requiring five pieces of information in or-
der to obtain a unique solution, two boundary conditions for each spatial derivative
plus one initial condition. Remember that because it is the higher order spatial
derivatives that determine the nature of the PDE, the first derivatives in the con-
vective term aren’t really important here. Since we were given all of these bits of
information, then we can say that our problem will be well posed.
Assuming now that our spatial domain has been broken up into Nx data points
in x and Ny data points in y, then we can apply our spatial discretization; that is
the Finite Difference method, meaning that we replace the spatial derivatives with
a second order central difference and define an ODE at each interior grid point as:
dφi,j vx vy
+ (φi+1,j − φi−1,j ) + (φi,j+1 − φi,j−1 )
dt 2∆x 2∆y
µ µ
= 2
(φi−1,j − 2φi,j + φi+1,j ) + (φi,j−1 − 2φi,j + φi,j+1 ) + ψ
∆x ∆y 2
Collecting coefficients of each grid point we get:
dφi,j µ vy µ vx
= + φi,j−1 + + φi−1,j
dt ∆y 2 2∆y ∆x2 2∆x
2µ 2µ
− + φi,j
∆x2 ∆y 2
µ vy µ vx
+ − φ i,j+1 + − φi+1,j + ψ
∆y 2 2∆y ∆x2 2∆x
256 CHAPTER 13. FINITE DIFFERENCE METHODS
M φ̇ = Kφ + s
where again M is the identity matrix. Then, applying the implicit Euler method we
get:
φl+1 − φl
M = Kφl+1 + s
∆t
which can be rearranged to:
meaning that we will have to solve a system of equations at every time step like:
Aφl+1 = b
where:
A = M − ∆tK
b = M φl + ∆ts
So the only real ‘trick’ in this example is how we define the matrices. In contrast to
the previous two examples; this time we are actually going explicitly store the mass
and stiffness matrices, M and K. Because these matrices will be quite sparse, we
are going to use sparse function in Matlab to allocate memory for these as:
N_p = N_x * N_y;
K = sparse(N_p, N_p);
M = sparse(eye(N_p))
where Np is the number of grid points. For the stiffness matrix we are simply allo-
cating memory in this code snippet, whereas for the mass matrix we are creating an
Np ×Np identity matrix and then converting it from a full to a sparse representation.
To continue now with how we actually add all of the required information into these
matrices, we will in fact loop over every grid point and add the coefficients of the
algebraic equation defined at that point to the overall system of equations. Let’s use
the indices m and n to denote a row and column within K. As we did in Example
13.1, for any given grid point index i, j we can ‘map’ to the corresponding row index
as:
m = (j − 1)Nx + i
Now for each i, j grid point there is a connection with the neighboring i ± 1, j ± 1
grid points and we define these corresponding column indices in K as:
n = (j ± 1 − 1)Nx + i ± 1
13.2. SECOND DERIVATIVE 257
This means that in our loop over every grid point we will be adding in to the system:
2µ 2µ
Km,m = 2
+ i, j
∆x ∆x2
µ v
Km,n = + i, j − 1
∆y 2 2∆y
µ u
2
+ i + 1, j
∆x 2∆x
µ v
2
− i, j + 1
∆y 2∆y
µ u
2
− i + 1, j
∆x 2∆x
sm = ψ i, j
We will in fact create a function called assemble which we will call at the beginning
of our simulation, before entering the time marching loop. Similar functions will
be duplicated when we come to solving this same problem using other numerical
methods such as the Finite Volume or Finite Element methods, in order to highlight
the similarities and differences between the numerical methods. In this particular
case, our function could be implemented as:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
...
for j=2:N_y-1
for i=2:N_x-1
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) + v(2)/(2*Delta_y);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + mu/(Delta_x^2) + v(1)/(2*Delta_x);
n = (j-1 )*N_x + i+1;
K(m, n) = K(m, n) + mu/(Delta_x^2) - v(1)/(2*Delta_x);
n = (j+1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) - v(2)/(2*Delta_y);
s(m) = psi(i,j);
end
end
...
end
The only issue that we have not considered yet is what to do for the boundary grid
points, because as it stands this code snippet is only looping over the interior grid
points and adding their contributions to M and K. To match up with the dimension
of our matrices, we are going to store our field in a 2D array as:
phi = zeros(N_p, N_t);
258 CHAPTER 13. FINITE DIFFERENCE METHODS
so our vector of unknowns will actually include the Dirichlet boundary points, even
though, technically they are not unknown variables. Let’s first look at how we
handle the Neumann boundary points. Returning to the idea mentioned earlier,
the Neumann boundary condition will give us an extra equation that we can use to
modify the discrete equation at that boundary point. If we think about the x = 1
boundary we have:
∂φNx ,j
= ∇φb = 0
∂x
If we again approximate this derivative with a second order central difference we
get:
dφNx ,j µ vy 2µ
= + φNx ,j−1 + φNx −1,j
dt ∆y 2 2∆y ∆x2
2µ 2µ
− + φNx ,j
∆x2 ∆y 2
µ vy µ vx
+ − φNx ,j+1 + − 2∆x∇φb + ψ
∆y 2 2∆y ∆x2 2∆x
Where the coefficients of φNx +1 have been added to the coefficients of φNx −1 and the
term involving ∇φb will be incorporated into the load vector along with ψ. We can
analogously do the same thing at the y = 1 boundary and will add the coefficients
of φi,Ny +1 to φi,Ny −1 :
dφi,Ny 2µ µ vx
= φ i,Ny −1 + + φi−1,Ny
dt ∆y 2 ∆x2 2∆x
2µ 2µ
− + φi,Ny
∆x2 ∆y 2
µ
vx µ vy
+ − φi+1,Ny + − 2∆y∇φb + ψ
∆x2 2∆x ∆y 2 2∆y
dφNx ,Ny 2µ 2µ
= φNx ,Ny −1 + φNx −1,Ny
dt ∆y 2 ∆x2
2µ 2µ
− + φNx ,Ny
∆x2 ∆y 2
µ
vx µ vy
+ − 2∆x∇φb + − 2∆y∇φb + ψ
∆x2 2∆x ∆y 2 2∆y
So, to add the contributions of the Neumann boundary points to our system of
equations, we can have append some code to our assemble function to loop through
all of the Neumann boundary points and add their contributions to K and s as:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
...
gradphi_b = 0;
...
for i=2:N_x-1
j = N_y;
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + 2*mu/(Delta_y^2);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + mu/(Delta_x^2) + v(1)/(2*Delta_x);
n = (j-1 )*N_x + i+1;
K(m, n) = K(m, n) + mu/(Delta_x^2) - v(1)/(2*Delta_x);
s(m) = s(m) + (mu/(Delta_y^2) - v(2)/(2*Delta_y))*(2*Delta_y*gradphi_b) + psi;
end
for j=2:N_y-1
i = N_x;
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) + v(2)/(2*Delta_y);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + 2*mu/(Delta_x^2);
n = (j+1-1)*N_x + i;
K(m, n) = K(m, n) + mu/(Delta_y^2) - v(2)/(2*Delta_y);
s(m) = s(m) + (mu/(Delta_x^2) - v(1)/(2*Delta_x))*(2*Delta_x*gradphi_b) + psi;
end
i = N_x;
j = N_y;
m = (j -1)*N_x+(i -1)+1;
K(m, m) = K(m, m) - 2*mu/(Delta_x^2) - 2*mu/(Delta_y^2);
n = (j-1-1)*N_x + i;
K(m, n) = K(m, n) + 2*mu/(Delta_y^2);
n = (j-1 )*N_x + i-1;
K(m, n) = K(m, n) + 2*mu/(Delta_x^2);
s(m) = s(m) + (mu/(Delta_x^2) - v(1)/(2*Delta_x))*(2*Delta_x*gradphi_b)
260 CHAPTER 13. FINITE DIFFERENCE METHODS
Finally, for the Dirichlet boundary points, we will not add any contributions to K,
or s, but will instead set the values for those points in the array phi. We can do
this for every time step quite elegantly in Matlab and will do so by assigning the
following two lines of code in our assemble function:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
...
phi_b = 0;
...
phi(1:N_y:N_p,:) = phi_b;
phi(1:N_y, :) = phi_b;
end
At this point the system is completely assembled and Figure 13.5 illustrates a portion
of the stiffness matrix.
and of course if the Dirichlet boundary conditions are imposing a value of 0 then
the second term on the right hand side will drop out as well. This idea is known as
partitioning a matrix and is an idea that we will use again in Chapter 15 when we
study the Finite Element method. While this idea is nice conceptually, the Dirichlet
boundary points will be ‘scattered’ throughout phi, not located contiguously at the
end. What we can do however is create an explicit list of all of the indices of all of
the Dirichlet boundary points:
Fixed = [1:N_y, 1:N_y:N_p];
and then compute the F ree points as the difference between all of the indices in the
range 1:N_p and the Dirichlet indices with the Matlab function setdiff as:
Free = setdiff(1:N_p, Fixed);
which will only compute the solution for the interior and Neumann boundary points
and so as long as the Dirichlet boundary points are initialized to 1 in phi, then we
will have imposed the boundary conditions correctly. It is worth mentioning for the
sake of completeness that since we ultimately end up solving Aφl+1 = b at every
time step, and since the rows of A corresponding to the Dirichlet boundary points
will be all zeros while the entries in b will in fact contain the Dirichlet boundary
values, then another way to incorporate the effect of the Dirichlet boundary points
is to put a 1 on the main diagonal for any row corresponding to a Dirichlet boundary
point before performing the linear solve. Although this will enforce the boundary
conditions, a disadvantage is that it can result in A becoming badly scaled if the
other entries are much greater or less than 1. Now that we have shown how we
assemble our system of equations and solve for them at each time step, we are now
in a position to write out the core of the program as:
function [M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, N_x, N_y, ...
N_p, Delta_x, Delta_y)
A = M - Delta_t*K;
for l=1:N_t-1
262 CHAPTER 13. FINITE DIFFERENCE METHODS
b = M*phi(:,l) + Delta_t*s;
phi(Free,l+1) = A(Free,Free)\(b(Free) - A(Free,Fixed)*phi(Fixed,l+1));
end
2.0 2.0
1.8 1.8
1.6 1.6
φ
φ
1.4 1.4
1.2 1.2
1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(a) (b)
2.0 2.0
1.8 1.8
1.6 1.6
φ
1.4 1.4
1.2 1.2
1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(c) (d)
Figure 13.6: The solutions to the PDE in Example 13.3 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
13.3. VON NEUMANN STABILITY ANALYSIS 263
Having now seen a number of examples using the finite difference method to solve
a PDE it is time to devote some attention to analyzing the stability of a simulation
a little more closely. So far we have seen that our PDE can be broken down into a
system of ODEs of the form:
M φ̇ = Kφ + s
where M = I and depending upon the nature of the problem s may be zero, so it is
not too restrictive to consider problems of the form:
φ̇ = Kφ
Now, in Example 13.1 we saw that we could determine whether the numerical
method was going to be stable by finding the eigenvalues of K and observing where
each λm ∆t lay in the complex plane, relative to the stability region of the time
marching method that we were using for the simulation. This technique is known as
Matrix Stability Analysis and strictly speaking, that is all we need to do in order to
determine whether or not the simulation will be stable. However, for most practical
problems, the size of the system of equations that we are solving means that it’s not
feasible to compute the eigenvalues of K. So we need a better approach. What we
would like to be able to do is to know if our simulation will be stable or not, or put
another way, given a grid spacing, what time step size do we need in order to ensure
that the simulation will be stable when using a conditionally stable time marching
method. We will look at two such techniques for doing just this.
∂φ ∂ 2φ
=µ 2
∂t ∂x
If we were to use a second order central difference to discretize the diffusive term,
we would get the semi-discretization:
dφi µ
= (φi+1 − 2φi + φi−1 )
dt ∆x2
264 CHAPTER 13. FINITE DIFFERENCE METHODS
Furthermore, if we were to then use the explicit Euler method for the time marching,
we would get the full discretization:
µ∆t l
φl+1 = φli + l l
i φi+1 − 2φ i + φ i−1 (13.23)
∆x2
The basic idea behind the von Neumann stability analysis is that we assume an
analytical solution of the form:
µ∆t l ikxi+1
σ l+1 eikxi = σ l eikxi + l ikxi l ikxi−1
σ e − 2σ e + σ e
∆x2
Noting that xi+1 = xi + ∆x and xi−1 = xi − ∆x we can then get:
µ∆t
σ =1+ (cos(ik∆x) + i sin(ik∆x) − 2 + cos(ik∆x) − i sin(ik∆x))
∆x2
2µ∆t
=1+ (cos(ik∆x) − 1))
∆x2
Now in Part II we learned that for stability, we must have |σ| ≤ 1, otherwise σ l will
grow unbounded, and so:
1 + 2µ∆t (cos(k∆x) − 1) ≤ 1
∆x 2
In other words, we must have:
2µ∆t
−1 ≤ 1 + (cos(k∆x) − 1) ≤ 1
∆x2
The right hand inequality is always satisfied since (cos(k∆x) − 1) is always less than
or equal to zero. The left hand inequality can then be recast as:
2µ∆t
(cos(k∆x) − 1) > −2
∆x2
13.4. MODIFIED WAVENUMBER ANALYSIS 265
or:
∆x2
∆t ≤
µ(1 − cos(k∆x))
The worst (or most restrictive) case occurs when cos(k∆x) = −1. Thus the maxi-
mum time step size that we can choose in order for the simulation to remain stable
is:
∆x2
∆t ≤
2µ
The von Neumann stability analysis works whenever the space dependent terms
are eliminated after substituting the periodic form of the solution given in Equation
13.24. For example, if µ was not constant but some function of x say, then the
von Neumann analysis would not in general work. In this case σ would have to
be a function of x and the simple solution we obtained would no longer be valid.
The same problem would arise if a non-uniformly spaced spatial grid were used. Of
course in these cases the matrix stability analysis would still work.
∂φ ∂ 2φ
=µ 2
∂t ∂x
The basic idea behind the procedure is that we assume a separable solution of the
form:
dψ
= −µk 2 ψ (13.26)
dt
which we could in fact solve analytically. In the assumed form of the solution, k is the
known as the wavenumber. In practice, instead of using the analytical differentiation
that lead to Equation 13.26, we use a finite difference method to approximate the
spatial derivative. For example, using a second order central difference, we have:
266 CHAPTER 13. FINITE DIFFERENCE METHODS
dφi µ
= (φi+1 − 2φi + φi−1 )
dt ∆x2
If we now substitute in our assumed solution φi = ψ(t)eikxi then we get:
dψ µ
eikxi ikxi+1 ikxi ikxi−1
= ψe − 2ψe + ψe
dt ∆x2
As with the von Neumann stability analysis we can note that xi+1 = xi + ∆x and
xi−1 = xi − ∆x and get:
dψ µ ikx −ik∆x
eikxi ikxi ik∆x ikxi
= ψe e − 2ψe + ψe e
dt ∆x2
and then dividing both sides by eikxi we get:
dψ µ
ψeik∆x − 2ψ + ψe−ik∆x
= 2
dt ∆x
Then, making use of Euler’s formula we get:
dψ µ
= (cos(k∆x) + i sin(k∆x) − 2 + cos(k∆x) − i sin(k∆x)) ψ
dt ∆x2
2µ∆t
= (cos(k∆x) − 1)) ψ
∆x2
2µ
= − 2 (1 − cos(k∆x)) ψ
∆x
or put another way:
dψ
= −µk ∗2 ψ (13.27)
dt
where:
r
∗ 2
k = (1 − cos(k∆x))
∆x2
By analogy to Equation 13.26, k ∗ is called the modified wavenumber. The important
point here is that in comparing Equations 13.26 and 13.27 we can see that the use
of the central difference means that we now solving a different ODE. Therefore our
simulation will most definitely give us results that differ from the analytical solution.
Another key observation is that Equation 13.27 fits the form of the model ordinary
differential equation in Equation 6.4 with λ = −µk ∗2 . In Part II we investigated
the stability properties of various numerical methods for ODEs with respect to the
model initial value problem in Equation 6.4. Now, using the modified wavenumber
analysis, we can readily obtain the stability properties of any of those time marching
methods when applied to a PDE. All we have to do is replace λ with −µk ∗2 in our
stability analysis. The application of any other finite difference quotient (instead
13.4. MODIFIED WAVENUMBER ANALYSIS 267
of the second order central difference used here) will also lead to the same form
as Equation 13.27, but with a different modified wavenumber. In fact each finite
difference quotient has a distinct modified wavenumber associated with it.
If we were to use the explicit Euler method to solve Equation 13.27 then because
the modified wavenumbers are all real we can use the result in Equation 6.6:
2
∆t ≤
|λRe |
to get:
2
∆t ≤ 2µ
∆x2
(1 − cos(k∆x))
The ‘worst case scenario’ (i.e. the maximum limitation on the time step size) occurs
when cos(k∆x) = −1, leading to:
∆x2
∆t ≤
2µ
which is exactly the same as that obtained with the von Neumann analysis. The
advantage of the modified wavenumber analysis however, is that the stability lim-
its for different time marching methods applied to the same equation are readily
obtained. For example, if instead of the explicit Euler we had used a fourth order
Runge-Kutta method, the stability limit would have been:
We can then substitute in our assumed form of the solution (Equation 13.25) to get:
M M
!
∂ n φi 1 X X
= a−Nm ψeikxi−m + a0 ψeikxi + a+Nm ψeikxi+m
∂xn ∆xn m=1 m=1
We can then make use of the fact that xi+m = xi + Nm ∆x and xi−m = xi − Nm ∆x:
M M
!
n
∂ φi 1 X X
n
= a−Nm ψeikxi e−ikNm ∆x + a0 ψeikxi + a+Nm ψeikxi eikNm ∆x
∂x ∆xn m=1 m=1
and note that when this partial derivative expression is substituted into a PDE we
will be able to divide both sides by eikxi to get:
268 CHAPTER 13. FINITE DIFFERENCE METHODS
M M
!
∂ n φi 1 X X
n
= a−Nm ψe−ikNm ∆x + a0 ψ + a+Nm ψeikNm ∆x
∂x ∆xn m=1 m=1
M M
!
1 X X
= a−Nm e−ikNm ∆x + a0 + a+Nm eikNm ∆x ψ
∆xn m=1 m=1
∗n
=k ψ
Let’s now consider some specific finite difference quotients for the first derivative
of a function (i.e. n = 1) and calculate the modified wavenumbers. If we consider
first the forward difference we know that M = 1, a−1 = 0, a0 = −1, and a+1 = 1
and so substituting these coefficients into Equations 13.28 we get:
1
k∗ = 0 × e−ik∆x − 1 + 1 × eik∆x
∆x
1
= (cos(k∆x) − 1 + i sin(k∆x))
∆x
So we can see that in this case, the modified wavenumber is complex. If we consider
the backward difference we know that M = 1, a−1 = −1, a0 = 1, and a+1 = 0.
Substituting these coefficients into Equations 13.28 we get:
1
k∗ = −1 × e−ik∆x + 1 + 0 × eik∆x
∆x
1
= (1 − cos(k∆x) + i sin(k∆x))
∆x
If we consider the second order central difference for the first derivative we know
that M = 1, a−1 = −1/2, a0 = 0 and a+1 = 1/2. Substituting these coefficients into
Equations 13.28 we get:
∗ 1 1 −ik∆x 1 ik∆x
k = − e +0+ e
∆x 2 2
1 1 1 1 1
= − cos(k∆x) + i sin(k∆x) + cos(k∆x) + i sin(k∆x)
∆x 2 2 2 2
1
= (i sin(k∆x))
∆x
13.4. MODIFIED WAVENUMBER ANALYSIS 269
3.0 exact
2nd order central difference
4th order central difference
6th order central difference
2.5
2.0
(k∗Δx)n
1.5
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
(kΔx)n
Figure 13.7: A plot of the modified wavenumbers versus wavenumbers for some cen-
tral differencing methods. Note that the first order forward and backward methods
have the same wavenumber as the second order central difference method.
Figure 13.7 illustrates plots of (k ∗ ∆x)n versus (k∆x)n for three difference central
difference defined for a first order derivative. The exact curve illustrates the ana-
lytical solution where there is no error introduced by the use of a finite difference
method. For the other methods, we can observe that at lower values of (k∆x)n
all methods are reasonably close to the exact curve, but as (k∆x)n increases they
deviate away from the exact. Furthermore we can observe that the higher order
methods remain better approximations at larger values of (k∆x)n . The important
point here is that when applying a finite difference method to solve a PDE, ∆x
must be chosen (in conjunction with the particular finite difference method) such
that (k∆x)n is small enough, such that a reasonable approximation of the spatial
derivatives are achieved.
To elaborate on this point, let’s examine the effects of different finite difference
methods by considering the 1D first order wave equation of Example 13.1:
∂φ ∂φ
+v =0
∂t ∂x
dψ v
=− (i sin(k∆x)) ψ
dt ∆x
If on the other hand we were seeking an analytical solution, we would have:
dψ
= −uikψ
dt
Comparing these two solutions we have vk ∗ l instead of vk we could then define the
effective propagation speed as:
v sin(k∆x)
u∗ =
k∆x
For low frequency waves k /k ≈ 1 and so v ∗ ≈ v. For higher frequency waves
∗
We saw in Chapter 13 that the Finite Difference method required a regular struc-
tured grid (Figure 12.2(a)) in which to discretize the PDE over and so the continu-
ous scalar field variable φ is then approximated at a collection of grid points. Finite
Volume methods on the other hand break up a domain into a collection of cells
(or volumes) (Figure 12.2(b)) and then a spatial discretization of the original PDE
is performed over each cell. The method is still a local method in that the spatial
discretization at a point will only involve a few immediate neighbors, but Finite Vol-
ume methods offer perhaps the greatest flexibility in terms of the types and mixture
of grids which may be used to represent a computational domain. Furthermore, an
important point to note is that Finite Volume methods tend to be either cell based
or node based, depending upon whether the field variable is stored at the centroid
or the vertices of the cells defining the grid respectively. In this book we will only
concern ourselves with cell based Finite Volume methods (because this is the more
common variation) and we will assume that our spatial domain has been broken up
into a finite sized tessellation of non-overlapping triangles, which completely cover
the domain (i.e. we have a suitable unstructured grid). To begin our derivation, it
will help to consider the arbitrarily shaped spatial domain depicted in Figure 12.1.
Within this domain we will be solving the generic scalar transport equation:
φ̇ + v · ∇φ = µ∇2 φ + ψ
The basic idea behind the Finite Volume method is that we integrate this PDE over
the spatial domain as:
Z Z
µ∇2 φ + ψ dΩ
φ̇ + v · ∇φ dΩ = (14.1)
Ω Ω
Now while you may be worried at this point as to how one actually evaluates such
a nasty looking integral, the reality is that we don’t actually try to perform the
integration directly. Instead at this point we make use of a very important theorem
known as Gauss’s Divergence Theorem [12]:
271
272 CHAPTER 14. FINITE VOLUME METHODS
Z Z
∇ · f dΩ = f · dΓ
Ω Γ
where as was outlined in Chapter 12, the quantity ∇ · f is the divergence of the
vector field f . With reference to the domain depicted in Figure 12.1, the way to
think about what this theorem is saying is to imagine the arbitrarily shaped region
of RD space broken up into little pieces of size dΩ. We are then saying that the
sum of the scalar quantity that is the divergence of f in each of these pieces is equal
to the net flux of f out of the domain. This flux can be envisioned by considering
the piece on the boundary given by dΓ (noting that this is a vector quantity and is
normal to the boundary) and summing up the dot products between f and dΓ over
the boundary.
Following this theorem, some corollaries are:
Z Z
(f · ∇g + g∇ · f ) dΩ = f g · dΓ (14.2)
Ω
Z ZΓ
∇2 f dΩ = ∇f · dΓ (14.3)
Ω
Z ZΓ
∇f dΩ = f dΓ (14.4)
Ω Γ
where f and g denote scalar fields and g denotes another vector field. The reason
for introducing this theorem is because we are going to replace some of the domain
integrals with boundary integrals in Equation 14.1. Now although we haven’t stated
it explicitly, the vector field v that we have been using in the convective term of
our generic scalar transport equation is what is known as a solenoidal vector field
[51], meaning that it is divergence free (i.e. ∇ · v = 0). This assumption was made
somewhere along the line during the derivation of the scalar transport equation (for
instance in fluid mechanics when assuming incompressible flow), and since we are
more interested in solving it, rather than deriving its form for different physical
scalar quantities, we will just ‘go with it’ and accept this assumption. As it happens
however, a more general expression for the convective term in the generic scalar
transport equation is ∇ · (vφ) (compared to v · ∇φ). Given the solenoidal nature of
v we can use Equation 14.2 and cancel out the second term on the left hand side to
get:
Z Z
v · ∇φdΩ = vφ · dΓ
Ω Γ
Alternatively we could have simply brought v inside the derivative in the convective
term (i.e. ∇ · (vφ)) and used the original form of the divergence theorem, but this
273
still requires the assumption of v being solenoidal. If we now apply Equation 14.3
to the diffusive term we get:
Z Z
2
∇ φdΩ = ∇φ · dΓ
Ω Γ
The next key step in the derivation of the Finite Volume method is that we think
of our domain as being an arbitrary cell within the mesh and therefore the ‘domain’
means either the length, area, or volume of the cell (depending on whether or not
we are thinking in 1D, 2D, or 3D respectively) and the boundary will imply the
points, edges, or faces that define the cell (again, depending on whether or not we
are thinking in 1D, 2D, or 3D respectively). We then assume that φ and ψ are
constant throughout the cell and can be pulled out of the domain integrals. We
also replace the boundary integrals with summations over the faces comprising the
boundary, hence we end up with the discrete form:
Nf Nf
X X
φ̇Ω + vf φf · Γf = µf ∇φf · Γf + ψΩ
f f
where the subscript f denotes a face on the boundary of the domain of which Nf
is the total number, and the values φf are defined at the face centroid. It can be
observed that the domain integrals have been replaced by a multiplication by the
domain. This means a multiplication by ‘say’ the area of the domain in 2D or
the volume of the domain in 3D. This discrete integral form of the generic scalar
transport equation will subsequently be applied to every cell in the grid (much like
a finite difference equation is applied to every point in the grid) and can be written
as:
Nf Nf
X X
Ωc φ̇c + vf φf · Γf = µf ∇φf · Γf + Ωc ψc
f f
so that we can end up with a coupled system of equations which can be written in
the form:
M φ̇ = Kφ + s
where φ is now a vector is defining values at the cell centroid of each cell in the grid.
Now, the final piece remaining before we can assemble the matrices and solve the
system is how we evaluate φ and ∇φ at the face centroids. The way we do this is to
relate (or interpolate) the face centroid values from the cell centroid values of the two
274 CHAPTER 14. FINITE VOLUME METHODS
cells which share a given face. This means that ultimately our system of equations
will only involve cell centroid values of φ, which is our solution to the PDE. Just as
we could essentially employ any order of finite difference expression to approximate
the derivatives in the generic scalar transport equation, we can similarly with the
Finite Volume method employ a variety of approximations for φf and ∇φf . In
order to proceed with the derivation we shall use the simplest approximations for
now. Any given face in the interior of the grid (i.e. forgetting about the faces that
comprise the boundary of the grid) will be shared by two adjacent cells. We can
arbitrarily assign one of the cells to be the owner of the face and one to be the
neighbor of the face and while this assignment is arbitrary, the important point is
that the face area vector Γf is defined positive pointing out of the owner cell and
in to the neighbor cell (Figure 14.1).
cneighbour
Δζ
Γf
cowner
Figure 14.1: A schematic illustrating two cells in an unstructured grid sharing a face.
The vector Γf is defined as normal to its face, directed from the owner cell to the
neighbor cell. The length vector ∆ζ is defined between the centroids of the owner
and neighbor cells. It can be observed that generally these two vectors will not be
parallel, except for the case where the grid is composed of equilateral triangles.
If we first consider the convective term, one of the simplest ways in which we
can evaluate φf is to use a central difference approximation:
φn + φo
φf =
2
which is simply the mean of φc for the neighbor and owner cells. As a quick note,
the use of upwinding methods, analogous to the forward difference in Equation 13.5
are also commonly used. If we now consider the diffusive term, one of the simplest
ways in which we can evaluate the gradient at the face is via:
Γf
∇φf · Γf = (φn − φo )
∆ζ
275
where ∆ζ is the distance between the cell centroids of the two cells sharing the face
f (Figure 14.1) and Γf is the magnitude of the vector Γf . This approximation is
essentially treating the gradient in terms of ‘rise over run’. An important assumption
that we have made is that the line ∆ζ and the vector Γf are parallel. This is a pretty
bad assumption that we will address and improve later. We can now write out the
matrices for our semi-discretization as:
Nc
X
M= Ωc (14.5)
c=1
Nf
X µf Γf vf · Γ f
K= ± ± (14.6)
f =1
∆ζ 2
Nc
X
s= Ωc ψc (14.7)
c=1
where Nc is the number of cells in the grid. The ± signs in K come about because
these terms will contribute differently to the owner and neighbor cells of the face.
The last issue before we can apply the Finite Volume method in an example
is how we impose boundary conditions. Since the boundary of our computational
domain is comprised of faces, it is hence on these faces where we are prescribing
our boundary conditions. An interesting comparison can be made with the Finite
Difference method, where we stored our solution at grid points, and the boundary
conditions were also defined at grid points. Here however, our solution is defined at
cell centroids, but our boundaries are defined at face centroids, so one implication
of this is that we won’t have to worry about matrix partitioning. For a Dirichlet
boundary condition we hence have:
φf = φb
This boundary condition will contribute to both K and s, and to see how, consider
the way the boundary face contributes to the convective and diffusive terms:
µb Γb
(φb − φo ) − vb · Γb φb
∆ζ
where there the coefficient of φo will be added into K and the remainder (which is
all known) will be added into s. It is important to note that ∆ζ in this case would
be defined as the distance between the cell centroid and the face centroid of the
boundary face. For a Neumann boundary condition we have:
∇φf = ∇φb
This boundary condition will also contribute to both K and s, and to see how, again
consider the way the boundary face contributes to the convective and diffusive terms:
276 CHAPTER 14. FINITE VOLUME METHODS
vb · Γ b
µb Γb ∇φb − (φo + ∆ζ∇φb )
2
Again the coefficient of φo will be added into K and the remainder (which is all
known) will be added into s.
Example 14.1:
In this example we will develop a Matlab program to solve the 2D generic scalar
transport equation:
φ̇ + v · ∇φ = µ∇2 φ + ψ (14.8)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we
will use the Finite Volume method with triangular elements and for the temporal
discretization we will use the implicit Euler method. The intended learning outcomes
for this example will be ‘get a feel’ for applying the Finite Volume method to solve a
PDE and to investigate the concept of ‘assembling’ the matrix defining our system
of equations by looping over the faces in our grid.
To begin we are going to need to make some definitions regarding the data
structures that we are going to use to solve the problem. Life was comparatively
easy when we were using Finite Difference methods to solve PDEs because we could
define our computational domain as something like:
phi = zeros(N_x, N_y, N_t);
and navigate through the grid with i, j, l counters, and that was all we needed
to define the geometry and the connectivity. The ‘price that we pay’ for being
able to deal with complex spatial domains is that we must explicitly store all of
the geometry and connectivity information defining the unstructured grid. As such
we will be storing four arrays for this problem; an array called Points which is an
Np × 2 array storing the x, y coordinates of the points defining the grid, an array
called Faces which is an Nf × 2 array storing the two indices of the points defining
a face in the grid, an array called Cells which is an Nc × 3 array storing the three
indices of the points defining a cell in the grid, and an array called NeighborOwner
which is an Nf × 2 array defining the indices of the owner and the neighbor cells of
each face in the grid. We are going to assume that the Faces array is ordered such
that all of the internal faces (of which there are Nif ) are contiguous and come first
in the array, followed by the boundary faces, where the faces of each boundary are
contiguous in the array (i.e. all the faces corresponding to the lower x boundary
come first, then all the edges corresponding to the upper y boundary second, the
277
upper x boundary third, and the lower y boundary last). In order to prescribe our
boundary conditions we are going to make use of a structure to store all of the
information that we need:
Boundaries = struct
(
’start’, {2873, 2901, 2929, 2957},...
’N’, {28, 28, 28, 28},...
’type’, {’neumann’, ’neumann’, ’dirichlet’, ’dirichlet’},...
’value’, {0, 0, 1, 1}
);
where the field start will store the index of the first of the contiguous faces compris-
ing a given boundary, N is the total number of faces defining the boundary, type is
a string variable telling us what type of boundary condition to apply, and value is
the value of either the Dirichlet or Neumann condition. Since we are assuming that
all of this data is ‘at the ready’ it will all be defined in a function called readGrid,
which we will call at the start of our program.
At this point we can apply the spatial discretization that is the Finite Volume
method to our PDE and we know that we will have a semi-discrete system of the
form:
dφ
M = Kφ + s
dt
where:
Nc
X
M= Ωc
c=1
Nf
X µf Γf vf · Γ f
K= ± ±
f =1
∆ζ 2
Nc
X
s= Ωc ψc
c=1
φl+1 − φl
M = Kφl+1 + s
∆t
which can be rearranged to:
Aφl+1 = b
278 CHAPTER 14. FINITE VOLUME METHODS
where:
A = M − ∆tK
b = M φl + ∆ts
As with the Finite Difference method applied in Example 13.3 the core part of
the method is the ‘assembling’ of the matrices defining the system of equations. To
assemble these matrices we are going to need to evaluate Ωc and Γf for the cells and
faces respectively. Now in our 2D example Ωc is the area of each triangle, which we
can evaluate as:
Γf = −{y1 − y2 , x2 − x1 } (14.10)
and we can implement this in our Matlab program via the code:
for f=1:N_f
x = Points(Faces(f, :), 1);
y = Points(Faces(f, :), 2);
Gamma(f,:) = -[y(1)-y(2), x(2)-x(1)];
end
where again xp , yp are the x and y coordinates of the two points defining the edge.
We will also need the coordinates of both the cell centroids and the face centroids
in order to assemble K. The cell centroids can be calculated via:
x1 + x2 + x3 y 1 + y 2 + y 3
Centroidc = ,
3 3
and implemented via the code:
for c=1:N_c
x = Points(Cells(c, :), 1);
y = Points(Cells(c, :), 2);
cellCentroids(c,:) = [sum(x), sum(y)]/3;
end
279
c777 c778
c812
f1219
c779
f1266
c811
φb=0
f2956 φb=1
dφ811
Ω811 = (+Cf1221 − Df1221 ) φ811 + (+Cf1221 + Df1221 ) φ779 + . . .
dt
dφ779
Ω779 = (−Cf1221 − Df1221 ) φ811 + (−Cf1221 + Df1221 ) φ779 + . . .
dt
More generally, we could write this for any interior face in the grid as:
dφn
Ωn = (+Cf − Df ) φn + (+Cf + Df ) φo + . . .
dt
dφo
Ωo = (−Cf + Df ) φn + (−Cf − Df ) φo + . . .
dt
Now in order to actually implement this in our Matlab code our assemble func-
tion is going to involve ‘looping’ over all of the interior faces of the mesh and adding
contributions to K and the algorithm will look something like:
function [M, K, s, phi] = assemble(M, K, s, phi, Points, Faces, Cells, ...
NeighborOwner, Boundaries, N_p, N_if, ...
N_bf, N_f, N_c, N_b);
...
M = sparse(diag(Omega));
s = psi.*Omega;
for f=1:N_if
o = NeighborOwner(f, 1);
n = NeighborOwner(f, 2);
Delta_zi = norm(cellCentroids(n,:)-cellCentroids(o,:));
C_f = dot(v, Gamma(f, :))/2;
D_f = mu*norm(Gamma(f, :))/Delta_zi;
K(o, o) = K(o, o) - C_f - D_f;
K(o, n) = K(o, n) - C_f + D_f;
K(n, n) = K(n, n) + C_f - D_f;
281
Now, in order to apply the boundary conditions we can loop over every boundary in
our structure, then loop over every face within the boundary, and add the contribu-
tions of the boundary conditions to K and s. Let’s look at how the boundary faces
2901 and 2956 affect the ODEs for the cells 779 and 811. We could again write the
‘partially complete’ ODEs as:
dφ811 µΓ2956
Ω811 = (φb − φ811 )−v · Γ2956 φb +...
dt ∆ζ2956
dφ779
Ω779 = µΓ2901 ∇φb −v · Γ2901 (φ799 + ∆ζ2901 ∇φb )+ . . . (14.13)
dt
If we again substitute the definitions in Equation 14.12 into 14.13 and collect coef-
ficients we get:
dφ811
Ω811 = −Df2956 φ811 + (Df2956 − Cf2956 ) φb +...
dt
dφ779
Ω779 = −Cf2901 φ811 + (Df2901 − Cf2901 ) ∆ξ∇φb + . . .
dt
More generally, we could write this for any boundary face in the grid as:
dφo
Ωo = −Df φo + (Df − Cf ) φb + . . .
dt
dφo
Ωo = −Cf φo + (Df − Cf ) ∆ξ∇φb + . . . (14.14)
dt
Remembering that boundary faces can only have an owner cell. An important point
to note is that the first terms on the right hand side of Equation 14.14 will be
added to K, while the second terms on the right hand side will be added to s. In
order to implement this in our Matlab code we will add in some more code to our
assemble function, ‘looping’ over all of the boundary faces of the mesh and adding
contributions to K and s and the algorithm will look something like:
function [M, K, s, phi] = assemble(M, K, s, phi, Points, Faces, Cells, ...
NeighborOwner, Boundaries, N_p, N_if, ...
N_bf, N_f, N_c, N_b);
...
for b=1:N_b
for f=Boundaries(b).start:Boundaries(b).start+Boundaries(b).N-1
o = NeighborOwner(f,1);
C_f = dot(v, Gamma(f,:));
282 CHAPTER 14. FINITE VOLUME METHODS
Delta_zi = norm(faceCentroids(f,:)-cellCentroids(o,:));
D_f = mu*norm(Gamma(f, :), 2)/Delta_zi;
if strcmp(Boundaries(b).type,’dirichlet’)
K(o, o) = K(o, o) - D_f;
s(o) = s(o) + (D_f - C_f) * Boundaries(b).value;
elseif strcmp(Boundaries(b).type,’neumann’)
K(o, o) = K(o, o) - C_f;
s(o) = s(o) + (D_f - C_f)*Delta_zi* Boundaries(b).value;
end
end
end
end
Figure 14.3: The pattern of the assembled stiffness matrix K using the Finite Volume
method.
At this point, the system is completely assembled (Figure 14.3). It can be observed
that in comparison to the stiffness matrix for the system in Example 13.3, this
matrix is much less ordered, although it is symmetric.
In order to implement this in our Matlab program the core of the algorithm will
look like:
assemble();
A = M - Delta_t*K;
for l=1:N_t-1
b = M*phi(:,l) + Delta_t*s;
phi(:,l+1) = A\b;
end
The complete program is given in Example14_1.m. Figures 14.4(a) - 14.4(d)
illustrate the solution at a number of different time steps. It can be observed that
as with Example 13.3, as time progresses the bell shaped surface (which is the initial
condition) moves through the domain (due to the convective term) and spreads out
(due to the diffusive term), and the domain as a whole rises (due to the source term).
283
2.0 2.0
1.8 1.8
1.6 1.6
φ
φ
1.4 1.4
1.2 1.2
1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(a) (b)
2.0 2.0
1.8 1.8
1.6 1.6
φ
1.4 1.4
1.2 1.2
1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(c) (d)
Figure 14.4: The solutions to the PDE in Example 14.1 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
As was mentioned earlier, the approximation that we used for the ∇φf is not a
particularly good one, the reason being that in general ∆ζ and Γf are not parallel.
What this implies is that we don’t accurately capture the diffusion across the face
of each cell, reducing the accuracy of the method. What is commonly done is to
calculate the gradient as:
Γf · Γf
∇φf · Γf = (φn − φo ) + S
Γf · eζ
284 CHAPTER 14. FINITE VOLUME METHODS
where eζ is the unit vector in the between the two cell centroids sharing the face f .
The first term is known as the primary diffusion while the term S is known as the
secondary diffusion. The primary diffusion accounts for part of the diffusion aligned
with the unit vector eζ , while the secondary diffusion accounts for the diffusion along
the length of the face. Now the secondary diffusion depends upon φo and φn but it
is usually treated explicitly, meaning that it’s assumed known, and its contributes
to s, not K.
In practice an iterative method is usually used to solve the system of equations at
each time step, partly due to the fact that the generic scalar transport equation can
often be nonlinear (take for example the solution of the Navier-Stokes equations),
but also because the resulting systems can be too large for direct methods. The
point being that in practice, incorporating the secondary diffusion explicitly is less
of a problem, but of course it does require explicitly calculating the face gradients
at each iteration. This was the reason that we ignored the secondary diffusion in
Example 14.1.
A final point worth mentioning is that there are a variety of higher order methods
available for the discretization of the convective and diffusive terms. While we used
the simplest (and lowest order) methods to simplify the introduction of the Finite
Volume method, in practice higher order methods would generally be used. For the
interested reader, two excellent references for more detailed aspects of the Finite
Volume method can be found in the books by Versteeg [75] and Murthy [70].
Chapter 15
Similar to Finite Volume methods, Finite Element methods present a way by which
we can solve a PDE in a spatial domain with a much more complex geometrical
structure compared to Finite Difference methods. The domain is broken up into a
collection of cells and and then a spatial discretization is performed over each cell.
A key difference in terminology however, is that the cells are usually called elements
and the vertices of the elements are usually called nodes. Furthermore, the field
variable φ is defined at the nodes of an element, in contrast to a cell based Finite
Volume method where it was defined at the cell centroid. The elements used in 1D
are simply lines, in 2D they are commonly triangles or quadrilaterals, and in 3D,
tetrahedra or hexahedra (Figure 15.1). Just as there are higher order interpolation
methods available to the Finite Difference and Finite Volume methods, with the
Finite Element method we can use higher order elements to obtain a more accurate
solution. Furthermore, the Finite Element method is also a local method in that
the discretization within an element will only involve its own nodes.
Figure 15.1: Some common elements. In 1D, lines, in 2D, triangles and quadrilat-
erals, and in 3D, tetrahedra and hexahedra. Each row depicts the linear, quadratic,
and cubic versions of the element.
The theory of the Finite Element method is found in variational calculus and
there are generally two procedures one would normally use to solve a PDE using
285
286 CHAPTER 15. FINITE ELEMENT METHODS
the Finite Element method, known as the Rayleigh-Ritz and Galerkin methods
(both of which are subsets of the method of weighted residuals [62]). Similar to
the derivation of our Finite Volume method, we must choose a specific technique
and as such will only concern ourselves with the Galerkin method. It should be
understood that the method of weighed residuals is a mathematical technique in
and of its own right and is merely employed as a part of the Finite Element method,
analogous to being one of many ‘ingredients’. As with the Finite Volume method
we will assume that our spatial domain has been broken up into a tesselation of a
finite number of non-overlapping elements, which completely cover the domain (i.e.
we have a suitable unstructured grid). Before proceeding to discretize the generic
scalar transport equation however, it is worth taking a brief detour to investigate
the method of weighted residuals. To illustrate by way of example, let’s consider
the ODE from Example 6.1
dφ
=1−φ
dx
in the domain x ∈ [0, 10], with boundary condition φ(0) = 0. In this case we know
that the solution will be φ(x) = 1 − e−x , but we’ll assume a trial solution of the
form:
φ(x) = a0 + a1 x + a2 x2 + . . . + aN xN (15.1)
or more generally, we could write our trial solution as:
N
X
φ(x) = an pn (x)
n=0
where the an terms are coefficients that need to be determined and the pn terms
are the powers of x, known as a basis function or a trial function. Applying the
boundary condition we find that a0 = 0 and hence:
φ = a1 x + a2 x 2 + . . . + aN x N
If we only consider terms up to second order in our trial solution, then we can
substitute our trial solution into the original ODE and perform the differentiation
to get:
dφ
+ φ − 1 ≈ (a1 + 2a2 x) + (a1 x + a2 x2 ) − 1
dx
= a1 (1 + x) + a2 (2x + x2 ) − 1
Now let’s assume that our trial solution for φ doesn’t satisfy our original ODE
exactly. In this case we may define a residual as:
287
dφ
r := + φ − 1 = a1 (1 + x) + a2 (2x + x2 ) − 1 (15.2)
dx
which will be exactly zero if the trial solution satisfies the ODE, and non-zero oth-
erwise. It is important to note that the meaning of the word ‘residual’ is different
compared to the context of an iterative method for solving a linear system of equa-
tions. In that sense we were talking about column vector where each entry is a
measure of the error for each variable, whereas here we can think of the residual as
a continuous function of space. We cannot force the residual to vanish, no matter
how many terms we include in out trial solution. The idea of the method of weighted
residuals however is that we can multiply the residual by a weighting function and
force the integral of the weighted expression over the domain to vanish, i.e:
Z
W (x)r(x)dΩ = 0 (15.3)
Ω
Z10
pn rdx = 0
0
Substituting in for our the residual and the weighting functions and performing the
integration we get:
Z10 2 10
x3
3
x4 x2
2
x 2x
x a1 (1 + x) + a2 (2x + x ) − 1 dx = a1 + + a2 + −
2 3 3 4 2 0
0
= 383a1 + 3167a2 − 50
and:
288 CHAPTER 15. FINITE ELEMENT METHODS
Z10 3 10
x4
4
x5 x3
2 2
x 2x
x a1 (1 + x) + a2 (2x + x ) − 1 dx = a1 + + a2 + −
3 4 4 5 3 0
0
= 2833a1 + 25000a2 − 333
φ = 0.3182x − 0.0277x2
2.0
1 term
1.8 2 terms
3 terms
1.6 4 terms
exact
1.4
1.2
φ
1.0
0.8
0.6
0.4
0.2
0.0
0 2 4 6 8 10
x
Figure 15.2: Galerkin Method of Weighted residual solutions to the ODE in Example
6.1.
Figure 15.2 illustrates the solution to this ODE comparing the exact solution to
weighted residual solutions employing 1 to 4 terms in the trial solution. It can
be observed that the bigger our set of weighting functions, the more accurate our
solution, but the bigger the system of equations that we will have to solve.
Having now seen a simple example of the Galerkin method of weighted residuals
the next important step in the derivation of our Finite Element method is how we
289
express our field variable φ and its derivatives. Thinking of φ most generally as
being a continuous function of space and time, we will write the assumed solution
in the form:
Nn
X
φ(x, t) = ηn (x)φn (t) (15.5)
n=1
where the ηn terms are known as shape functions, which are functions of spatial
location only, while the nodal values of φ are functions of time only. We will look at
how shape functions are derived for some of the element types in Figure 15.1 shortly,
but for now, let’s just say that they come about from defining a trial solution similar
to that in Equation 15.1 and manipulating it such that the power series solution is
expressed in terms of the nodal values. We will use the notation Nn to denote
the number of nodes in an element (e.g. 2 for a linear line element, 3 for a linear
triangular element, 4 for a linear tetrahedral element), and Np as normal to mean
the total number of nodes (or points) in the grid, which should obviously be much
larger.
Having made this assumption for our assumed solution we can express the tem-
poral derivative of our field as:
Nn
∂φ X ∂
= (ηn φn )
∂t n=1
∂t
Nn
X ∂ηn ∂φn
= φn + ηn
n=1
∂t ∂t
Nn
X ∂φn
= ηn
n=1
∂t
Nn
∂φ X ∂
= (ηn φn )
∂x n=1
∂x
Nn
X ∂ηn ∂φn
= φn + ηn
n=1
∂x ∂x
Nn
X ∂ηn
= φn
n=1
∂x
Nn
X
∂xi φ = ∂xi ηn φn
n=1
Nn
∂ 2φ X ∂2
= (ηn φn )
∂x2 n=1
∂x 2
Nn
X ∂ ∂ηn
= φn
n=1
∂x ∂x
Nn
X ∂ 2 ηn ∂ηn ∂φn
= φn +
n=1
∂x2 ∂x ∂x
Nn
X ∂ 2 ηn
= φn
n=1
∂x2
Nn
X
∂ xi xj φ = ∂xi xj ηn φn
n=1
although, as we will see shortly, despite the elegance of this last result we won’t be
evaluating the second derivatives this way in practice.
If we now apply the Galerkin method of weighted residuals to the scalar transport
equation then we have:
Z
W φ̇ + v · ∇φ − µ∇2 φ − ψ dΩ = 0 (15.6)
Ω
At this point It can be observed that the integration procedure is quite similar to
that performed with the Finite Volume method and in fact would be equivalent if the
weighting function W were equal to 1. In our case however we will be substituting
in the shape functions and also our assumed form of the solution from Equation
15.5, and we will consider the domain of integration to be the domain of an element
itself such that the weighted residual expression can be rewritten as:
Z
ηp ηq φ̇q + ηp v · ∇ηq φq + µηp ∇2 ηq φq − ηp ψ dΩ = 0 (15.7)
Ωe
291
where an important point to note is that we are using Einstein summation notation,
implying that:
ηq φq = η1 φ1 + η2 φ2 + . . . + ηNn φNn
and:
η1 η1 η1 η2 ... η1 ηNn φ
1
η2 η1 η2 η2 ... η2 ηNn φ2
ηp ηq φq =
.. .. .. .. ..
. . . . .
ηNn η1 ηNn η2 . . . ηNn ηNn φ
Nn
The important point here is that applying the Galerkin method of weighted residuals
this way means that for every element we get a ‘sub’ system of equations local to each
element and in terms of the local nodal values 1, 2, . . . , Nn , that must be assembled
into a global system of equations with global indices 1, 2, . . . , Np The overall global
system of equations to solve our generic scalar transport equation is then given by:
Ne Z
X
ηp ηq φ̇q + ηp v · ∇ηq φq + µ∇ηp ∇2 ηq φq − ηp ψ dΩ = 0
e=1 Ω
e
where Ne is the number of elements in the grid. We can then rewrite the system of
equations in the form:
M φ̇ = Kφ + s (15.8)
where φ is now a column vector is defining values at the nodes, and:
Ne Z
X
M= ηp ηq dΩ (15.9)
e=1 Ω
e
Ne Z
X Z
2
K=− µηp ∇ ηq dΩ + ηp v · ∇ηq dΩ (15.10)
e=1 Ω Ωe
e
Ne Z
X
s= ηp ψp dΩ (15.11)
e=1 Ω
e
We are now in a position to look into the details of how we derive shape functions
for a particular element. We will start with the simplest element possible, namely
the linear line element, depicted in the upper left corner of Figure 15.1, which is
perhaps the simplest element that we can use in 1D. When we say a ‘linear’ element
we are making the approximation that the solution will vary as a linear function
and as such the trial function may be written as:
292 CHAPTER 15. FINITE ELEMENT METHODS
φ(x) = a0 + a1 x
which can be observed is exactly the same as the trial solution defined in Equation
15.1 but considering only the first two terms. Now, an important point is that
because we are defining the solution of our PDE at the nodes of the elements in our
mesh, if we evaluate this trial solution at say node 1 then we have:
φ(x1 ) = φ1 = a0 + a1 x1
which we could write as:
φ1 = p1 a
where:
a = {a0 , a1 }T
and:
p1 = {1, x1 }
If we then evaluate the trial solution at the other point we get:
φ(x2 ) = φ2 = a0 + a1 x2
which we can write in matrix form as:
φ1 1 x1 a0 p1
= = a = Ca
φ2 1 x2 a1 p2
where C will be a square 2 × 2 matrix and we can solve for the unknown parameters
by computing a = C −1 φ. Doing so, we find that:
1
a0 = x2 φ1 − x1 φ2
Le
1
a1 = − φ1 + φ2
Le
where Le is the length of the element and is defined in terms of its nodal coordinates
as:
Le = x2 − x1
If we now substitute these coefficients back into our trial solution we get:
1 1
φ(x) = x2 φ1 − x1 φ2 + − φ1 + φ2 x
Le Le
293
where we can factor out the nodal values and rewrite the solution in the form:
Nn
X
φ(x, t) = ηn (x)φn (t) = η1 φ1 + η2 φ2 (15.12)
n=1
where the ηn terms are the shape functions and for a linear 1D element are given
by:
1
η1 (x) = x2 − x
Le
1
η2 (x) = x − x1
Le
We can then trivially express the derivatives of the shape functions as:
∂η1 (x) 1
=−
∂x Le
∂η2 (x) 1
= (15.13)
∂x Le
So what we’ve done here is define a trial solution within our element as some contin-
uous function of space, then we’ve used the fact that our trial solution must assume
the values φn at the nodes in order to obtain an expression for the field variable that
varies continuously within the element, but is defined in terms of the nodal values,
not the ai coefficients. As it happens these shape functions are known as piecewise
continuous functions in that they are only defined within their ‘own’ element and
are all zero in any other element in the grid. If we know the values of φn at the
nodes of the element then we can use the shape functions to compute the value of φ
at any point within the element. Now obviously we don’t know the values of φn at
the nodes a priori; the whole point of solving a PDE is to find them. But as with
the Finite Difference and Finite Volume methods studied thus far, this expression
will allow us to assemble a system that we can use to solve for these unknown nodal
values.
The final step in our Finite Element method is to perform the integration of the
shape functions (or their derivatives) over the domain of each element, in order to
evalutate the terms in Equations 15.9-15.11. Although the integration is not too
difficult in this case we will make use of the integration formulae defined for a linear
line element:
Z
a!b!Ωe
ηpa ηqb dΩ = (15.14)
(a + b + 1)!
Ωe
where Ωe ≡ Le in 1D. Let’s first consider the integration of the shape functions as
required in the mass matrix:
294 CHAPTER 15. FINITE ELEMENT METHODS
Z
e
Mp,q = ηp ηq dΩ
Ωe
Remembering that p and q are in the range of 1 to 2 for the linear line element,
what we end up with is a local or ‘sub’ matrix M e , which for our linear triangular
element will be a 2 × 2 matrix. In order to evaluate each term in the matrix we
simply input the values of p and q to the integration formula in Equation 15.14. For
the case where p and q are equal (i.e. for elements on the main diagonal) we get:
Z
e
Mp,p = ηp ηp dΩ
Ωe
Z
= ηp2 ηq0
Ωe
2!0!Ωe
=
(2 + 0 + 1)!
2Ωe
=
3
For the case where p and q are not equal (i.e. for elements off the main diagonal)
we get:
Z
e
Mp,q = ηp ηq dΩ
Ωe
Z
= ηp1 ηq1
Ωe
1!1!Ωe
=
(1 + 1 + 1)!
1Ωe
=
6
So we can write a single expression for the any element in our local mass matrix as:
e (1 + δpq )Ωe
Mp,q =
6
which is simple enough that we could write out the whole thing as:
eΩe 2 1
M = (15.15)
6 1 2
295
Considering the contribution of the convective term to the stiffness matrix we have:
Z
e
Kp,q = − ηp v · ∇ηq dΩ
Ωe
Z
e
Kp,q = −v∇ηq ηp dΩ
Ωe
Z
= −v∇ηq ηp1 ηq0 dΩ
Ωe
Z
1!0!Ωe
= −v∇ηq dΩ
(1 + 0 + 1)!
Ωe
v∇ηq Ωe
= −
2
which is simple enough that we could write out the whole thing as:
v
− v2
e 2
K = v (15.16)
2
− v2
Considering now the contribution of the source term to the load vector we have:
Z
sep = ηp ψp dΩ
Ωe
Z
= ψp ηp dΩ
Ωe
1!0!Ωe
= ψp
(1 + 0 + 1)!
ψp Ωe
=
2
and if we can assume that ψ is the same at each node we can write our contribution
to the local load vector as:
ψΩe 1
se = (15.17)
2 1
296 CHAPTER 15. FINITE ELEMENT METHODS
Now, you may have noticed that we have not looked at the contribution of the
diffusive term. There is a good reason for this, namely that it involves the second
derivatives of the shape functions, which for a linear element will be zero. So what
we would find is that the contribution to the stiffness matrix would be zero, which
is obviously not correct. One option would be to use a higher order element, but as
we shall soon see, there is a second option.
Example 15.1:
In this example we will develop a Matlab program to solve the 1D first order
wave equation:
∂φ ∂φ
+v =0 (15.18)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, v = 1.0, and compare the numerical solution with
the exact solution
2
φ(x, t) = e−5(x−vt−3) + 1
where we will define the error function
e = φli − φ(xi , tl )
and use the infinity norm
many or too few boundary or initial conditions it will be ‘doomed’ from the start.
So it is always important to consider these issues before writing any code.
Similar to the equivalent Finite Difference case in Example 13.1 we will assume
that our spatial domain has been broken up into Ne elements and so the number of
grid points will be Nx = Ne + 1; furthermore with spatial step size Ωe ≡ Le ≡ ∆x.
Applying our method for each element we can then write:
L L v
− v2
3 6
d φ1 2
φ1
= v
L
6
L
3 dt φ2 2
− v2 φ2
L L v
− v2
3 6
d φ2 2
φ2
= v
L
6
L
3 dt φ3 2
− v2 φ3
.. ..
. = .
L L v
− v2
3 6
d φNx−1 2
φNx−1
= v
L
6
L
3 dt φNx 2
− v2 φNx
and combining these systems for all of the elements we will get the global system
of equations:
M φ̇ = Kφ + s
where we have:
L L v
− v2
3 6 0 ··· 0 φ1
2 0 ··· 0 φ1
L 2L L v
− v2
··· 0 φ2 0 ··· 0 φ2
6 3 6 d 2
v
L 2L
0 6 3 ··· 0 φ3 =
0 2 0 ··· 0 φ3
dt
.. .. .. .. .. .. .. .. .. ..
. . . . L
. . . . . − v2 .
6
v
0 0 0 L L φNx 0 0 0 − v2 φNx
6 3 2
In comparison to Example 13.1 it can be observed that we are including the Dirichlet
boundary node φ1 in the column vector of unknowns and also the mass matrix is
not equal to the identity matrix with the Finite Element method. It can also be
observed that the 2 × 2 ’elemental’ mass and stiffness matrices are ‘stamped’ in
place in the global mass and stiffness matrices and overlapping entries are added
up, which is a feature of the Galerkin method of weighted residuals. We can rewrite
our system more simply as:
dφ
= M −1 Kφ = f (φ) (15.19)
dt
So we will compute and store the matrix M −1 K and use this in our function f to
evaluate the right hand side. The Matlab code to achieve this will look like:
298 CHAPTER 15. FINITE ELEMENT METHODS
M = sparse(N_x, N_x);
K = sparse(N_x, N_x);
for p=1:N_x-1
M(p:p+1, p:p+1) = M(p:p+1, p:p+1) + Delta_x/6 *[2, 1; 1, 2];
K(p:p+1, p:p+1) = K(p:p+1, p:p+1) + v/2 *[1,-1; 1, -1];
end
MinvK = full(M\K);
So an interesting observation that can be made is that even though we are using
an explicit time marching scheme, we still have to solve a linear system M −1 K
(albeit only once). This is one feature of the Finite Element method that is quite
different to the Finite Difference and Finite Volume methods, where we only had
entries on the main diagonal of the mass matrix. Now, using approach we took in
implementing the fourth order Runge-Kutta method in Example 10.1, we will define
a function f to evaluate the right hand side of Equation 13.16 at the various stages
of the method. In our Matlab code, this will take the form:
function k = f(phi)
k = MinvK*phi;
k(1) = 0;
end
At this point the remainder of the algorithm is just the basic fourth order Runge-
Kutta code from Example 10.1:
for l=1:N_t-1
k1 = f(phi(:,l));
k2 = f(phi(:,l) + Delta_t/2*k1);
k3 = f(phi(:,l) + Delta_t/2*k2);
k4 = f(phi(:,l) + Delta_t *k3);
phi(:,l+1) = phi(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end
As we did in Example 13.1 we will make sure that each k1 (1), k2 (1), k3 (1), k4 (1) are
always zero such that as long as our Dirichlet boundary condition φ(0, t) = 1 has
been set, then out boundary condition will have been imposed correctly.
In order to check the whether or not our solution will be stable for a particular
number of elements and time step size we can compute the eigenvalues as we did
in Example 13.1 and plot them relative to the stability region of the fourth order
Runge-Kutta method:
[Xi Lambda] = eig(MinvK);
[X, Y] = meshgrid(-4:0.1:4, -4:0.1:4);
Z = X + i*Y;
sigma = abs(1 + Z + (Z.^2)/2 + (Z.^3)/6 + (Z.^4)/24);
contourf(X, Y, sigma, [1 1]);
plot(real(diag(Lambda)*Delta_t), imag(diag(Lambda))*Delta_t);
∆t = 0.10. In the first combination, all the terms are located within the stability
region, and in the second they are not. The corresponding effect on the solution
is shown in Figures 15.4(a) - 15.4(b). It is easily observed that for the second
combination, the simulation ‘blows up’ after just a couple of time steps, whereas
for a stable solution we see the ‘bell shaped’ initial condition is simply shifted (or
convected) along through the computational domain. To illustrate the convergence
of the solution, Table 15.1 presents the inifinity norm for a range of spatial and
temporal step sizes (maintaining stability of course). As can be observed, the finer
the grid and the smaller the time step size, the lower the error in the solution (which
is of course what we could expect).
Table 15.1: The convergence of the solution, illustrating the infinity norm for a
range of spatial and temporal step sizes.
∆x ∆t ||e||∞
1.000000 1.000000 0.672490
0.500000 0.500000 0.444394
0.100000 0.100000 0.022728
0.050000 0.050000 0.003280
0.010000 0.010000 0.000148
0.005000 0.005000 0.000037
Having now seen a fairly simple example of the Finite Element method in ac-
tion, we are going to tackle the issue of dealing with second order terms in a PDE,
approximated with linear shape functions. A result of assuming a linear variation of
φ within an element is that while the nodal values are equal at element boundaries
(known as C 0 continuity), unfortunately the first derivatives are not equal (C 1 conti-
nuity) and hence the second derivatives do not exist at all (Figure 15.5). You might
think that we should just abandon the use of such simple elements and use higher
order elements (Figure 15.1), but to require the second order spatial derivatives to
exist everywhere is too restrictive. Fortunately there is a solution, and this involves
removing the second derivative from the weighted residual expression for the scalar
transport equation. To see how this is done, let’s define the residual function for
the generic scalar transport equation and apply the Galerkin method of weighted
residuals as we did previously:
Z
2
W φ̇ + v · ∇φ − µ∇ φ − ψ dΩ = 0
Ω
300 CHAPTER 15. FINITE ELEMENT METHODS
4 4
3 3
2 2
1 1
Δt
Δt
0 0
Re
Re
λ
−1 −1
−2 −2
−3 −3
−4 −4
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
λReΔ t λReΔ t
(a) (b)
Figure 15.3: Location of the λm ∆t terms within the stability region of the fourth
order Runge-Kutta method for the PDE in Example 15.1 for (a) ∆x = 0.05 and
∆t = 0.02 (b) ∆x = 0.02 and ∆t = 0.10. It should be noted that each λm ∆t is
marked as a cross in the complex plane, but the large number of these terms makes
them appear as a solid strip. It can be observed that all of the λm ∆t terms are
purely imaginary.
301
2.5 2.5
2.0 2.0
φ
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
2.5 2.5
2.0 2.0
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(c) (d)
Figure 15.4: The solutions to the PDE in Example 15.1 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02 (b) the solution
at l = 0 and l = 8 for the combination ∆x = 0.02 and ∆t = 0.10.
302 CHAPTER 15. FINITE ELEMENT METHODS
x
∂ xφ
x
∂xxφ
In order to proceed we note the generic product rule formula for differentiation:
∂ ∂f ∂g
(f g) = g+ f
∂x ∂x ∂x
which in higher spatial dimensions can be applied to two scalar fields as:
∇ (f g) = ∇f g + ∇gf
or, using Einstein summation notation as:
∂ ∂f ∂g
(f g) = g+ f
∂xi ∂xi ∂xi
If we are dealing with a scalar field and a vector field, then the equivalent generic
formula is:
∇ · (f g) = ∇f · g + ∇ · gf
or, again using Einstein summation notation as:
∂ ∂f ∂gi
(f gi ) = gi + f
∂xi ∂xi ∂xi
Substituting into this formula for f = W for our scalar field and g = ∇φ for our
vector field field we get:
∇ · (W ∇φ) = ∇W · ∇φ + W ∇ · ∇φ
= ∇W · ∇φ + W ∇2 φ
Now, as we did in the Finite Volume method, we can again make use of the Diver-
gence Theorem (Equation 14.3) and apply it to the first term on the right hand side
of Equation 15.20 to get:
Z Z Z
2
W µ∇ φdΩ = µW ∇φ · dΓ − µ∇W · ∇φdΩ
Ω Γ Ω
Z Z Z Z Z
W φ̇dΩ + W v · ∇φdΩ + µ∇W · ∇φdΩ − W ψdΩ = µW ∇φ · dΓ
Ω Ω Ω Ω Γ
Z Z
W φ̇ + W v · ∇φ + µ∇W · ∇φ − W ψ dΩ = µW ∇φ · dΓ (15.21)
Ω Γ
This form of the problem is known as the weak form since it contains only the first
derivative of the solution φ(x, t), whereas the original form of the problem (Equation
15.6) is known as the strong form and contains the second derivative. The differenti-
ation requirement on the function φ(x, t) has been weakened, hence the name ‘weak
form’. It should be noted that no approximations have been made yet, i.e. nothing
has been lost in the formulation. However, piecewise linear approximations are now
possible because we don’t need to worry about the second derivatives of the shape
functions. Another point to note is that in terms of the application of boundary
conditions, the term on the right hand side of Equation 15.21 is the integral of the
derivative of φ over the boundary, which can be thought of as integrating a Neu-
mann boundary condition over the boundary of the domain. So, in contrast to the
Finite Difference and Finite Volume methods, the Neumann boundary conditions
are automatically incorporated into the integral form of the PDE. For this reason,
Neumann boundary conditions are often called natural boundary conditions, since
they ‘naturally pop up’ in the weighted residual expression. Following from this,
Dirichlet boundary conditions are then often called essential boundary conditions.
If we now substitute in for the weighting function, substitute in our assumed
form of the solution from Equation 15.5, and consider the domain of integration to
be the domain of an element itself, the weighted residual expression can be rewritten
as:
Z Z
ηp ηq φ̇q + ηp v · ∇ηq φq + µ∇ηp · ∇ηq φq − ηp ψ dΩ = µηp ∇φ · dΓ (15.22)
Ωe Γe
So while p and q were in the range 1 to 2 for the linear line element, they are are in
the range 1 to 3 for our linear triangular element. The overall system of equations
to solve our generic scalar transport equation is then given by:
Ne Z
X Ne Z
X
ηp ηq φ̇q + ηp v · ∇ηq φq + µ∇ηp · ∇ηq φq − ηp ψ dΩ = µηp ∇φ · dΓ
e=1 Ω e=1 Γ
e e
305
(x3, y3)
y
(x2, y2)
(x1, y1)
where Ne is the number of elements in the grid. We can then rewrite the system of
equations in the form:
M φ̇ = Kφ + s (15.23)
where φ is now a column vector is defining values at the nodes, and:
Ne Z
X
M= ηp ηq dΩ (15.24)
e=1 Ω
e
Ne Z
X Z
K=− µ∇ηp · ∇ηq dΩ + ηp v · ∇ηq dΩ (15.25)
e=1 Ω Ωe
e
Ne Z
X Z
s= ηp ψp dΩ + µηp ∇φ · dΓ (15.26)
e=1 Ω Γe
e
We will now consider one more spatial dimension and derive the shape functions
for the linear triangular element, perhaps the simplest element that we can use in
2D. Considering the linear triangular element depicted in Figure 15.6 with the x
and y coordinates of its 3 nodes labeled, we begin our derivation by again assuming
306 CHAPTER 15. FINITE ELEMENT METHODS
a trial solution for the field variable. Just as we did for the linear line element we
will assume a linear trial function, which takes the form:
φ(x, y) = a0 + a1 x + a2 y
where a0 , a1 , and a2 are scalar coefficients. This is essentially the 2D equivalent of
the trial solution to our linear line element. If we apply this trial solution at say
node 1 then we have:
φ(x1 , y1 ) = φ1 = a0 + a1 x1 + a2 y1
which we could write as:
φ1 = p1 a
where:
a = {a0 , a1 , a2 }T
and:
p1 = {1, x1 , y1 }
If the then apply the trial solution at the other two points we get:
φ(x2 , y2 ) = φ2 = a0 + a1 x2 + a2 y2
φ(x3 , y3 ) = φ3 = a0 + a1 x3 + a2 y3
where C will be a square 3 × 3 matrix and we can solve for the unknown parameters
by computing a = C −1 φ. Doing so, we find that:
1
a0 = (x2 y3 − x3 y2 ) φ1 + (x3 y1 − x1 y3 ) φ2 + (x1 y2 − x2 y1 ) φ3
2Ae
1
a1 = (y2 − y3 ) φ1 + (y3 − y1 ) φ2 + (y1 − y2 ) φ3
2Ae
1
a2 = (x3 − y2 ) φ1 + (x1 − x3 ) φ2 + (x2 − x1 ) φ3
2Ae
where Ae is the area of the element and is defined in terms of its nodal coordinates
as:
307
If we now substitute these coefficients back into our trial solution we can factor out
the nodal values and rewrite the solution in the form:
Nn
X
φ(x, y) = ηn (x, y)φn = η1 φ1 + η2 φ2 + η3 φ3 (15.27)
n=1
where the ηn terms are the shape functions and for the linear triangular element are
given by:
1
η1 (x, y) = (x2 y3 − x3 y2 ) + (y2 − y3 )x + (x3 − x2 )y
2Ae
1
η2 (x, y) = (x3 y1 − x1 y3 ) + (y3 − y1 )x + (x1 − x3 )y
2Ae
1
η3 (x, y) = (x1 y2 − x2 y1 ) + (y1 − y2 )x + (x2 − x1 )y (15.28)
2Ae
which may be more compactly written as:
1 x y
1
ηn (x, y) = 1 xn+1 yn+1 (15.29)
2Ae
1 xn+2 yn+2
So while ηn is a linear scalar function of x and y, ∂x ηn and ∂y ηn are vectors for the
linear triangular element, which store constant values for any given element.
The final step in our Finite Element method is to perform the integration of the
shape functions (or their derivatives) over the domain of each element. While this
may seem like a daunting task we will make use of two integration formulae defined
for the linear triangular element:
Z
a!b!c!2Ωe
ηpa ηqb ηrc dΩ = (15.32)
(a + b + c + 2)!
Ωe
and:
Z
a!b!Γe
ηpa ηqb dΓ = (15.33)
(a + b + 1)!
Γe
Where for the 2D case Ωe and Γe represent the area and an edge length of the
triangle. It is worth mentioning at this point that the use of these formulae greatly
simplifies matters. When other element types are used however, then no such for-
mulae exist and the integration of the shape functions over an element may utilize
a numerical method such as quadrature [21] and isoparametric elements. Let’s first
consider the integration of the shape functions as required in the mass matrix:
Z
e
Mp,q = ηp ηq dΩ
Ωe
Remembering that p and q are in the range of 1 to 3 for the linear triangular element,
what we end up with is a local or ‘sub’ matrix M e , which for our linear triangular
element will be a 3 × 3 matrix. In order to evaluate each term in the matrix we
simply input the values of p and q to the integration formula in Equation 15.32. For
the case where p and q are equal (i.e. for elements on the main diagonal) we get:
Z
e
Mp,p = ηp ηp dΩ
Ωe
Z
= ηp2 ηq0 ηr0 dΩ
Ωe
2!0!0!2Ωe
=
(2 + 0 + 0 + 2)!
2Ωe
=
12
For the case where p and q are not equal (i.e. for elements off the main diagonal)
we get:
309
Z
e
Mp,q = ηp ηq dΩ
Ωe
Z
= ηp1 ηq1 ηr0 dΩ
Ωe
1!1!0!2Ωe
=
(1 + 1 + 0 + 2)!
1Ωe
=
12
So we can write a single expression for the any element in our local mass matrix as:
e (1 + δpq )Ωe
Mp,q =
12
which is simple enough that we could write out the whole thing as:
2 1 1
Ωe
Me = 1 2 1 (15.34)
12
1 1 2
Considering the contribution of the convective term to the stiffness matrix we have:
Z Z
1 yq+1 − yq+2
ηp v · ∇ηq dΩ = v · ηp dΩ
2Ωe xq+2 − xq+1
Ωe Ωe
1 yq+1 − yq+2 0!1!0!2Ωe
= v·
2Ωe xq+2 − xq+1 (0 + 1 + 0 + 2)!
1 yq+1 − yq+2 Ωe
= v·
2Ωe xq+2 − xq+1 3
where we are writing shape function derivative terms in the form of a column vector
for compactness; the important thing is that the dot product between the velocity
vector and the vector defining the derivative of the shape functions produce a scalar
value. Considering now the contribution of the diffusive term to the stiffness matrix
we have:
Z Z
1 yp+1 − yp+2 1 yq+1 − yq+2
µ∇ηp · ∇ηq dΩ = µ · dΩ
2Ωe xp+2 − xp+1 2Ωe xq+2 − xq+1
Ωe Ωe
310 CHAPTER 15. FINITE ELEMENT METHODS
It can be observed that in this case every term inside the integral is a constant, so
the integration is in fact trivial:
Z
µ yp+1 − yp+2 yq+1 − yq+2
µ∇ηp · ∇ηq dΩ = · Ωe
4Ω2e xp+2 − xp+1 xq+2 − xq+1
Ωe
Again, the important point to note is that the dot product between these two vectors
produce a scalar value. So we can write our stiffness matrix more compactly as:
e 1 yq+1 − yq+2 µ yp+1 − yp+2 yq+1 − yq+2
Kp,q = −v · − · (15.35)
6 xq+2 − xq+1 4Ωe xp+2 − xp+1 xq+2 − xq+1
which we could write out in the form of a local stiffness matrix K e , but the terms
are so long that it would hardly fit on the page. Considering now the contribution
of the source term to the load vector we have:
Z
sep = ηp ψp dΩ
Ωe
Z
= ψp ηp dΩ
Ωe
1!0!0!2Ωe
= ψp
(1 + 0 + 0 + 2)!
ψp Ωe
=
3
and if we can assume that ψ is the same at each node we can write our contribution
to the local load vector as:
1
ψΩe
se = 1 (15.36)
3
1
An important point to note is that we will only perform this integral over the
boundary of the element (which in our case translates into an edge of a triangle) if
311
Z
sfp = µηp ∇φ · dΓ
Γe
Z
= µ∇φ ηp dΓ
Γe
1!0!Γe
= µ∇φ
(1 + 0 + 1)!
µ∇φΓe
= (15.37)
2
So we can write our load vector as:
f µ∇φΓe 1
s =
2 1
At this point we now have all of the machinery in place to assemble a system of
equations, but let’s just take a moment to recap on what we’ve done. Depending on
our choice of element we will end up with 3 × 3, local, mass and stiffness matrices
and an 3 × 1 load vector. To evaluate the terms in a local matrix we ‘loop’ over
e e
the p, q indices and compute the Mp,q , Kp,q terms and for the load vector we ‘loop’
over the p indices and compute the sp terms. So each of these terms is just a single
number that we ‘place’ in the local matrix. Once we have complete local mass and
stiffness matrices and a complete local load vector we have to add these terms into
the global mass and stiffness matrices, so the p, q indices of a particular node in
an element have to be ‘mapped’ to the global indices of that node within the grid
(Figure 15.7).
Example 15.2:
Local Indexing
p218
e3 Global Indexing
p1028 e5
e6
e9
p220
∇2 φ + ψ = 0 (15.38)
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10. To apply our spatial discretization we will use
the Finite Element method with linear triangular elements. The intended learning
outcomes for this example will be ‘get a feel’ for applying the Finite Element method
in 2D and observing how we can loop over the elements in our grid in order to
‘assemble’ the matrix defining our system of equations.
To begin, let’s first confirm in our minds that we have a well posed problem. Our
PDE has two second order derivative terms in it and so this translates into requir-
ing four pieces of information in order to obtain a unique solution, two boundary
conditions for each spatial derivative. Since we were given all of these, then we can
say that our problem will be well posed.
As we did with the Finite Volume method in Example 14.1 we will assume that
the unstructured grid defining the domain is already defined and will be returned
through a function called readGrid. As such we will be storing three arrays for
this problem; an array called Points which is an Np × 2 array storing the x, y
coordinates of the points defining the grid, an array called Faces which is an Nf × 2
array storing the two indices of the points defining a face in the grid, and an array
called Elements which is an Ne × 3 array storing the three indices of the points
defining an element in the grid. In contrast to the Finite Volume method example
we are not going to make any assumptions about the ordering of the points or the
faces in their respective arrays. In order to prescribe our boundary conditions we
are again going to make use of a structure to store all of the information that we
need:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’indices’, {}, ’value’, {});
So because we haven’t made any assumptions about the ordering of the points, we
cannot simply use a ‘start’ index and the number of indices to define where these
might be located in the Points array. Instead were are explicitly storing the indices
of each point on a given boundary.
At this point we can apply the spatial discretization that is the Finite Element
method to our PDE and we know that we will have a discrete system of the form:
Kφ = s
where:
314 CHAPTER 15. FINITE ELEMENT METHODS
XNe Z
K= µ∇ηp · ∇ηq dΩ
e=1 Ω
e
Ne Z
X Z
s= ηp ψp dΩ + µηp ∇φ · dΓ
e=1 Ω Γe
e
but because there are no Neumann boundaries in our problem the second term in
the load vector involving the integral of ∇φ over a boundary will be zero for every
element.
As with the Finite Difference method applied in Example 13.3 and the Finite
Volume method applied in Example 14.1 the core part of the method is the ‘assem-
bling’ of the matrices defining the system of equations. To assemble these matrices
we are going to need to evaluate Ωe and Γf for the elements and faces respectively.
Now in our 2D example Ωe is the area of each triangle, which we can evaluate as:
which is simply storing the spatial derivatives of the shape functions of the element.
If we were to evaluate the first term in the local stiffness matrix, then by evaluating
Equation 15.35 we would have:
e µ y2 − y3 y2 − y3
K1,1 = · (15.42)
4Ωe x3 − x1 x3 − x 2
and from examination of the layout of the array in Equation 15.41 we can see that
this is equivalent to:
e ∇η1,1 ∇η1,1
K1,1 = µ · Ωe (15.43)
∇η2,1 ∇η2,1
In fact we can apply this expression to any p, q entry in the local stiffness matrix
as:
e ∇η1,p ∇η1,q
Kp,q = µ · Ωe (15.44)
∇η2,p ∇η2,q
So what we need to do is loop over the p, q indices of each element with nested
for loops and evaluate the terms in K e . Now in order to actually implement this
in our Matlab code our assemble function is going to involve ‘looping’ over all of
the elements in the grid, then looping over all of the nodes of each element. The
algorithm will look something like:
function [K, s, phi, Free, Fixed] = assemble(K, s, phi, Points, Faces, ...
Elements, Boundaries, N_p, N_f, N_e, N_b);
...
s_e = [1; 1; 1];
for e=1:N_e
Nodes = Elements(e,:);
x = Points(Nodes,1);
y = Points(Nodes,2);
gradEta = [y(2)-y(3), y(3)-y(1), y(1)-y(2);
x(3)-x(2), x(1)-x(3), x(2)-x(1)];
for p=1:3
m = Nodes(p);
gradEta_p = [gradEta(1,p), gradEta(2,p)];
for q=1:3
n = Nodes(q);
gradEta_q = [gradEta(1,q), gradEta(2,q)];
K(m,n) = K(m,n) + mu*dot(gradEta_p,gradEta_q)*Omega(e);
end
s(m) = s(m) + s_e(p)*psi*Omega(e)/3;
end
end
...
end
Where an important point to note is that the array Nodes is a 1 × 3 array defining
the global indices of the 3 nodes defining any given element e. So with reference to
316 CHAPTER 15. FINITE ELEMENT METHODS
Figure 15.7, when e = 5, N odes = {218, 220, 1028}. So when we come to adding
in the contribution of the local stiffness matricix and the local load vector to the
global K and s, we can access and assign values to these positions quite easily in
Matlab with the notation K(m,n), etc.
Now, in order to apply the boundary conditions we can loop over every boundary
in our structure, and loop over every point in that boundary and assign the value
into the array φ for every time step. We can do this by adding in another for loop
over the boundaries within our assemble function as:
function [K, s, phi, Free, Fixed] = assemble(K, s, phi, Points, Faces, ...
Elements, Boundaries, N_p, N_f, N_e, N_b);
...
Fixed = [];
for b=1:N_b
for p=1:Boundaries(b).N
m = Boundaries(b).indices(p);
phi(m) = Boundaries(b).value;
end
Fixed = [Fixed; Boundaries(b).indices‘];
end
Free = setdiff(1:N_p, Fixed);
end
where it can be observed that as we loop over the Dirichlet boundaries we are adding
their indices to the array Fixed. Furthermore, we are then creating the array Free
using the setdiff function that will give us a list of the interior indices. At this
point, the system is completely assembled (Figure 15.8). It can be observed that
similar to the stiffness matrix for the system in Example 14.1, this matrix is sparse
and symmetric.
Figure 15.8: The pattern of the assembled stiffness matrix K using the Finite Ele-
ment method.
317
In terms of the Dirichlet boundary points, what we are going to do here is follow
the same idea that was used in Example 13.3 where we ‘partitioned’ the final system
of equations to solve a reduced system corresponding only to the interior points, so
that the rows corresponding to the Dirichlet boundary points won’t be included.
Conceptually we can think of our system Kφ = s as:
KF ree,F ree KF ree,F ixed φF ree sF ree
=
KF ixed,F ree KF ixed,F ixed φF ixed sF ixed
Where the subscript F ree is indicating all of the interior points where we don’t ac-
tually know the values in φ and the subscript F ixed is indicating all of the boundary
points where we do. Conceptually, we could then in fact just solve:
which will only compute the solution for the interior points and so as long as the
Dirichlet boundary points are initialized correctly in phi, then we will have imposed
the boundary conditions correctly. Now that we have shown how we assemble our
system of equations, we are now in a position to write out the core of the program
as:
[K, s, phi, Free, Fixed] = assemble(K, s, phi, Points, Faces, ...
Elements, Boundaries, N_p, N_f, N_e, N_b);
phi(Free) = K(Free,Free)\(s(Free) - K(Free,Fixed)*phi(Fixed));
Example 15.3:
In this example we will develop both a Matlab and a C++ program to solve the
2D generic scalar transport equation:
φ̇ + v · ∇φ = µ∇2 φ + ψ (15.45)
318 CHAPTER 15. FINITE ELEMENT METHODS
2.0 2.0
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1.0 1.0
φ
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(a) (b)
Figure 15.9: The solutions to the PDE in Example 15.2 illustrating the solution at
iterations (a) immediately after assembly of the global system of equations, and (b)
after the system has been solved.
319
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will
use the Finite Element method with linear triangular elements, for the temporal
discretization we will use the implicit Euler method, and in our C++ program we will
solve the resulting linear system with the Conjugate Gradient method. The intended
learning outcome for this example will be to simply observe the application of the
Finite Element method to solve a time dependent PDE and to see how we can use
the SparseMatrix class in a C++ program.
To begin, as we did with in Example 15.2 we will assume that the unstructured
grid defining the domain is already defined and will be returned through a function
called readGrid in our Matlab program and read in our C++ program. As such we
will be storing three arrays for this problem; an array called Points which is an
Np × 2 array storing the x, y coordinates of the points defining the grid, an array
called Faces which is an Nf × 2 array storing the two indices of the points defining
a face in the grid, and an array called Elements which is an Ne × 3 array storing
the three indices of the points defining an element in the grid. As we did previously,
we are not going to make any assumptions about the ordering of the points or the
faces in their respective arrays. In order to prescribe our boundary conditions we
are again going to make use of a structure in our Matlab program to store all of the
information that we need:
Boundaries = struct(’name’, {}, ’type’, {}, ’N’, {}, ’indices’, {}, ’value’, {});
So, for the φ(x, 0) = 1 boundary, for example, we would have the entry:
Boundaries(1).name = ’bottom’;
Boundaries(1).type = ’dirichlet’;
Boundaries(1).N = 28;
Boundaries(1).indices = [5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... 31];
Boundaries(1).value = 1.00000;
and for the ∂x φ(1, y) = 0 boundary, for example, we would have the entry:
Boundaries(2).name = ’right’;
Boundaries(2).type = ’neumann’;
Boundaries(2).N = 28;
Boundaries(2).indices = [29 30 31 32 33 34 35 36 37 38 39 40 41 42 ... 56];
Boundaries(2).value = 0.00000;
In contrast to Example 15.2 we now have Neumann boundaries present in our prob-
lem, which illustrates an important point. If we are dealing with a Dirichlet bound-
ary then indices indicates to which points in the Points array the given boundary
condition needs to be applied. If however, we are dealing with a Neumann bound-
ary, then indices indicates to which faces in the Faces array the integration of the
Neumann boundary condition needs to be performed. So depending on the type we
will interpret the indices differently and do different things with them when we
come to assembling our global system of equations. Using a struct in a Matlab
program is a good way to group and store the different ‘bits’ of data that define a
320 CHAPTER 15. FINITE ELEMENT METHODS
given boundary. In C++, this is the perfect job for a class and so we will define one
called Boundary which will take the form:
class Boundary
{
public:
Boundary()
{ }
string name_;
string type_;
int N_;
int* indices_;
double value_;
};
As can be observed, this is a fairly simple class, containing the same fields as the
Matlab struct and when we create our array Boundaries, we will be allocating
memory to store N_b of these boundary objects. So, let’s begin by implementing
our function to read in the unstructured grid in our C++ program. The contents of
the text file is going to be pretty much exactly the same as the data that was in the
readGrid function in the Matlab code and will look something like:
N_p 1093
N_f 112
N_e 2072
N_b 4
Points
0.00000 0.00000
1.00000 0.00000
...
0.33743 0.05103
Faces
0 4
4 5
...
111 0
Elements
494 778 113
495 779 114
...
383 1057 1029
Boundaries
bottom
dirichlet
28
1 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...
1.00000
right
neumann
28
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ...
0.00000
321
top
neumann
28
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 ...
0.00000
left
dirichlet
29
0 3 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 ...
1.00000
This is a fairly common approach to structuring a file defining a grid, where, at the
top, the first things we define are the numbers of points, faces, elements, boundaries,
etc, so that our program can dynamically allocate the required memory and will also
‘know’ how many of each entity to read in. Now, if we were developing a very general
purpose program, we might add in more information regarding the dimensionality
of the grid, the number of points defining a face, the element types and hence the
number of points defining an element. We’re going to keep things fairly simple with
our program however and make the assumptions that we are dealing with a 2D grid,
with linear triangular elements. We will also assume that the name of the file to
read in will be passed to the function read, and so with that in mind the function
will begin with:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& N_p, int& N_f, int& N_e, int& N_b)
{
fstream file;
string temp;
file.open(filename);
where we are opening up the file, reading in the data and assigning it to the integer
variables N_p. . .N_b. Knowing the amount of data contained in the file we can then
allocate the arrays to store this, and it can be observed that we are allocating the 2D
arrays using two calls to the new operator, such that our data will be contiguous in
memory. An important point to note is that in the text file, we are not interested in
the text N_p. . .N_b, but rather the numbers alongside them. As such we create the
string called temp where we will store this text, which allows us to work through
the file, overwriting it each time. Now, we could just store the numbers N_p. . .N_b
in the file, but this little bit of text defining the meaning of each number makes the
file a bit more ‘human readable’ (i.e. we have a bit more of an idea as to what the
numbers mean). If our input file was written in binary as opposed to ascii (meaning
it’s not supposed to be human readable), then perhaps we wouldn’t bother with
this.
The next part of the function involves looping over the number of points, faces,
elements, boundaries, and reading in the data:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& N_p, int& N_f, int& N_e, int& N_b)
{
...
file >> temp;
for(int p=0; p<N_p; p++)
{
file >> Points[p][0] >> Points[p][1];
}
file >> temp;
for(int f=0; f<N_f; f++)
{
file >> Faces[f][0] >> Faces[f][1];
}
file >> temp;
for(int e=0; e<N_e; e++)
{
file >> Elements[e][0] >> Elements[e][1] >> Elements[e][2];
}
file >> temp;
for(int b=0; b<N_b; b++)
{
file >> Boundaries[b].name_ >> Boundaries[b].type_ >> Boundaries[b].N_;
Boundaries[b].indices_ = new int [Boundaries[b].N_];
for(int n=0; n<Boundaries[b].N_; n++)
{
file >> Boundaries[b].indices_[n];
323
}
file >> Boundaries[b].value_;
}
file.close();
return;
}
At this point we can apply the spatial discretization that is the Finite Element
method to our PDE and we know that we will have a semi-discrete system of the
form:
M φ̇ = Kφ + s
where:
Ne Z
X
M= ηp ηq dΩ
e=1
Ωe
Ne Z
X Z
K=− µ∇ηp · ∇ηq dΩ + ηp v · ∇ηq dΩ
e=1 Ω Ωe
e
Ne Z
X Z
s= ηp ψp dΩ + µηp ∇φ · dΓ
e=1 Ω Γe
e
φl+1 − φl
M = Kφl+1 + s
∆t
which can be rearranged to:
So again, as with the Finite Difference and Finite Volume methods, we have reduced
our problem to the solution of:
Aφl+1 = b
where:
A = M − ∆tK
b = M φl + ∆ts
As with the Finite Difference method applied in Example 13.3, the Finite Volume
method applied in Example 14.1, and the Finite Element method applied in Example
324 CHAPTER 15. FINITE ELEMENT METHODS
15.2, the core part of the method is the ‘assembling’ of the matrices defining the
system of equations. To assemble these matrices we are going to need to evaluate
Ωe and Γf for the elements and faces respectively. Now, just as in Example 15.2, Ωe
is the area of each triangle, which we can evaluate as:
solutions of the three points that define it. What we will in fact do is loop over the
nodes of this element and evaluate the terms in the local mass and stiffness matrices
and the load vector. Now we have seen from Equation 15.34 that the local mass
matrix is fairly simple, containing 2 on the main diagonal and 1 everywhere else.
The only thing that changes between elements is the area of the element Ωe to which
the local mass matrix is multiplied by. The same is true for the contribution of the
source term to the local load vector and we can evaluate Equation 15.36 for each
element fairly simply. The local stiffness matrix K e is a bit more complicated, so
let’s look at how we create that. To help with this we will define the array:
1 y2 − y3 y3 − y1 y1 − y2
∇η = (15.46)
2Ωe x3 − x2 x1 − x3 x2 − x1
which is simply storing the spatial derivatives of the shape functions of the element.
If we were to evaluate the first term in the local stiffness matrix, then by evaluating
Equation 15.35 we would have:
e 1 y2 − y3 µ y2 − y3 y2 − y3
K1,1 = −v · − · (15.47)
6 x3 − x1 4Ωe x3 − x1 x3 − x2
and from examination of the layout of the array in Equation 15.46 we can see that
this is equivalent to:
e 1 ∇η1,1 ∇η1,1 ∇η1,1
K1,1 = −v · Ωe − µ · Ωe (15.48)
3 ∇η2,1 ∇η2,1 ∇η2,1
In fact the layout is such that we can apply this expression to any p, q entry in the
local stiffness matrix as:
e 1 ∇η1,q ∇η1,p ∇η1,q
Kp,q = −v · Ωe − µ · Ωe (15.49)
3 ∇η2,q ∇η2,p ∇η2,q
So what we need to do is loop over the p, q indices of each element with nested
for loops and evaluate the terms in K e . Now in order to actually implement this
in our Matlab code our assemble function is going to involve ‘looping’ over all of
the elements in the grid, then looping over all of the nodes of each element. The
algorithm will look something like:
[M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, Points, Faces, Elements, ...
Boundaries, N_p, N_f, N_e, N_b)
...
M_e = [2, 1, 1;
1, 2, 1;
1, 1, 2];
s_e = [1; 1; 1];
for e=1:N_e
Nodes = Elements(e,:);
x = Points(Nodes,1);
326 CHAPTER 15. FINITE ELEMENT METHODS
y = Points(Nodes,2);
gradEta = [y(2)-y(3), y(3)-y(1), y(1)-y(2);
x(3)-x(2), x(1)-x(3), x(2)-x(1)];
for p=1:3
m = Nodes(p);
gradEta_p = [gradEta(1,p), gradEta(2,p)];
for q=1:3
n = Nodes(q);
gradEta_q = [gradEta(1,q), gradEta(2,q)];
M(m,n) = M(m,n) + M_e(p,q) *Omega(e)/12;
K(m,n) = K(m,n) - dot(v,Gq)/3*Omega(e)
- mu*dot(gradEta_p,gradEta_q)*Omega(e);
end
s(m) = s(m) + s_e(p)*psi*Omega(e)/ 3;
end
end
...
end
Where an important point to note is that the array Nodes is a 1 × 3 array defining
the global indices of the 3 nodes defining any given element e. So with reference to
Figure 15.7, when e = 5, N odes = {218, 220, 1028}. So when we come to adding in
the contribution of the local mass and stiffness matrices and the local load vector
to the global M, K, and s, we can access and assign values to these positions quite
easily in Matlab with the notation K(m,n), etc.
Now, in order to apply the boundary conditions we can loop over every boundary
in our structure, and if the boundary is a Neumann boundary, we can loop over
every face in that boundary, evaluate the integral term in Equation 15.37 and add
the contribution to entries in the load vector corresponding to the two nodes that
define that face. If the boundary is a Dirichlet boundary we can loop over every
point in that boundary and assign the value into the array φ for every time step. We
can do this by adding in another for loop over the boundaries within our assemble
function as:
[M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, Points, Faces, Elements, ...
Boundaries, N_p, N_f, N_e, N_b)
...
Fixed = [];
for b=1:N_b
if strcmp(Boundaries(b).type, ’neumann’)
for f=1:Boundaries(b).N;
Nodes = Faces(Boundaries(b).indices(f),:);
for p=1:2
m = Nodes(p);
s(m) = s(m) + mu*Boundaries(b).value*Gamma(Boundaries(b).indices(f))/2;
end
end
elseif strcmp(Boundaries(b).type, ’dirichlet’)
for p=1:Boundaries(b).N
m = Boundaries(b).indices(p);
327
phi(m,:) = Boundaries(b).value;
end
Fixed = [Fixed; Boundaries(b).indices‘];
end
end
Free = setdiff(1:N_p, Fixed);
end
where it can be observed that as we loop over the Dirichlet boundaries were are
adding their indices to the array Fixed. Furthermore, we are then creating the array
Free using the setdiff function that will give us a list of the interior and Neumann
boundary indices. At this point, the system is completely assembled (Figure 15.10).
It can be observed that similar to the stiffness matrix for the system in Example 14.1,
this matrix is sparse and symmetric. As it happens, the mass matrix will have the
same pattern as the stiffness matrix depicted in Figure 15.10, illustrating a difference
between the Finite Difference, Finite Volume, and Finite Element methods applied
to solving the same problem. For the Finite Difference method, M was the identity
matrix, for the Finite Volume method, M was a diagonal matrix with the area of
each cell on the main diagonal, and for the Finite Element method, M involves the
area of each element, but in a more complex manner.
Figure 15.10: The pattern of the assembled stiffness matrix K using the Finite
Element method.
In our C++ program, we will be using our SparseMatrix class from Example
1.1 to store the mass and stiffness matrices, and A. Remembering that we need to
initialize these objects, assemble the coefficients with the overloaded () operator,
then finalize them, the algorithm will look something like:
void assemble(SparseMatrix& M, SparseMatrix& K, double* s, double* phi, ...
int* Free, int* Fixed, double** Points, int** Faces, int** Elements, ...
Boundary* Boundaries, int& N_p, int& N_f, int& N_e, int& N_b)
328 CHAPTER 15. FINITE ELEMENT METHODS
{
...
double M_e[3][3] = {{2.0, 1.0, 1.0}, {1.0, 2.0, 1.0}, {1.0, 1.0, 2.0}};
double s_e[3] = {1.0, 1.0, 1.0};
...
M.initialize(N_p, 10);
K.initialize(N_p, 10);
for(int e=0; e<N_e; e++)
{
for(int p=0; p<3; p++)
{
Nodes[p]= Elements[e][p];
x[p] = Points[Nodes[p]][0];
y[p] = Points[Nodes[p]][1];
}
gradEta[0][0] = (y[1]-y[2])/(2*Omega[e]);
gradEta[0][1] = (y[2]-y[0])/(2*Omega[e]);
gradEta[0][2] = (y[0]-y[1])/(2*Omega[e]);
gradEta[1][0] = (x[2]-x[1])/(2*Omega[e]);
gradEta[1][1] = (x[0]-x[2])/(2*Omega[e]);
gradEta[1][2] = (x[1]-x[0])/(2*Omega[e]);
for(int p=0; p<3; p++)
{
m = Nodes[p];
gradEta_p[0] = gradEta[0][p];
gradEta_p[1] = gradEta[1][p];
for(int q=0; q<3; q++)
{
n = Nodes[q];
gradEta_q[0]= gradEta[0][q];
gradEta_q[1]= gradEta[1][q];
M(m,n) += M_e[p][q]*Omega[e]/12;
K(m,n) -= ((v[0]*gradEta_q[0]+v[1]*gradEta_q[1])/6 ...
+ mu*(gradEta_p[0]*gradEta_q[0]+gradEta_p[1]*gradEta_q[1])*Omega[e]);
}
s[m] += s_e[p]*psi*Omega[e]/3;
}
}
for(int b=0; b<N_b; b++)
{
if (Boundaries[b].type_=="neumann")
{
for(int f=0; f<Boundaries[b].N_; f++)
{
for(int p=0; p<2; p++)
{
Nodes[p] = Faces[Boundaries[b].indices_[f]][p];
m = Nodes[p];
s[m] += mu*Boundaries[b].value_*Gamma[f]/2;
}
}
}
329
else if (Boundaries[b].type_=="dirichlet")
{
for(int p=0; p<Boundaries[b].N_; p++)
{
m = Boundaries[b].indices_[p];
phi[m] = Boundaries[b].value_;
Free[m] = false;
Fixed[m]= true;
}
}
}
K.finalize();
M.finalize();
...
}
which will only compute the solution for the interior points and so as long as the
Dirichlet boundary points are initialized correctly in phi, then we will have imposed
the boundary conditions correctly. Now that we have shown how we assemble our
system of equations, we are now in a position to write out the core of the program
as:
[M, K, s, phi, Free, Fixed] = assemble(M, K, s, phi, Points, Faces, Elements, ...
Boundaries, N_p, N_f, N_e, N_b);
A = M - Delta_t*K;
for l=1:N_t-1
b = M*phi(:,l) + Delta_t*s;
phi(Free,l+1) = A(Free,Free)\(b(Free) - A(Free,Fixed)*phi(Fixed,l+1));
end
Now, in our Matlab code we could simply write A=M-Delta_t*K in order to define
the matrix A, but obviously we can’t ‘get away’ with such a concise piece of code
in C++. In order to make the necessary computation as simple as possible the first
member function that we are going to add will overload the = operator so that we
can have the line of code A=M; in our program and the resulting function call will
copy all of the data stored in the val_, col_, row_ arrays, etc. As such the member
function will look like:
void SparseMatrix::operator= (const SparseMatrix& A)
{
if(val_) delete [] val_;
if(col_) delete [] col_;
if(row_) delete [] row_;
if(nnzs_) delete [] nnzs_;
N_row_ = A.N_row_;
N_nz_ = A.N_nz_;
N_nz_rowmax_ = A.N_nz_rowmax_;
331
N_allocated_ = A.N_allocated_;
val_ = new double [N_allocated_];
col_ = new int [N_allocated_];
row_ = new int [N_row_+1];
memcpy(val_, A.val_, N_nz_ *sizeof(double));
memcpy(col_, A.col_, N_nz_ *sizeof(int));
memcpy(row_, A.row_, (N_row_+1) *sizeof(int));
}
It can be observed here that our SparseMatrix object will receive a constant refer-
ence to another SparseMatrix object as its input argument, delete its own arrays if
they have been allocated, allocate memory to store all of the data, then finally copy
it. An important assumption that we have made here is that the matrix that we
are copying has been finalized, such that we don’t have to worry about copying the
nnzs array that was used temporarily in the assembly process. If we were develop-
ing a more general purpose class , then we would have to develop more complex
functions to deal with these situations. The second member function that we will
add will then subtract one matrix from another, but will allow us to multiply this
matrix by a constant before subtracting the elements:
void SparseMatrix::subtract(double u, SparseMatrix& A)
{
for(int k=0; k<N_nz_; k++)
{
val_[k] -= (u*A.val_[k]);
}
return;
}
As can be observed this is actually a fairly simple function because all we have to
do is loop over every element in the val arrays and subtract one from the other.
There are however, a couple of points worthy of mention. The first is that we
are making the assumption in this function that the two matrices have the same
nonzero entries. For our particular problem this will be the case because the mass
and stiffness matrices do have the same pattern. If we were attempting to develop
a class that was more general purpose and could work with matrices that have
different patterns, then this function would obviously have to be a bit more complex
and check the row and column indices etc and insert new entries if they weren’t
already there. A second point is that this function is quite specific to our program
in that we are subtracting a constant multiplied by another matrix. Again, if we
were developing a more general purpose class , then we might create many more
of these member functions, for example:
void SparseMatrix::subtract(SparseMatrix& A);
void SparseMatrix::add(double u, SparseMatrix& A);
void SparseMatrix::add(SparseMatrix& A);
plus any other operations that we think could be useful. Following the use of the =
and the subtract functions we will have the correctly assembled the matrix A. The
332 CHAPTER 15. FINITE ELEMENT METHODS
final member function that we will add will perform a matrix vector multiplication,
taking as an input a 1D array defining the vector that the matrix should be mul-
tiplied by, and a vector that is the output of the matrix vector multiplication. We
can achieve quite an efficient implementation of this algorithm because we only have
to loop over the nonzero entries in the val array and as such the member function
will look like::
void SparseMatrix::multiply(double* u, double* v)
{
for(int m=0; m<N_row_; m++)
{
u[m] = 0.0;
for(int k=row_[m]; k<row_[m+1]; k++)
{
u[m] += val_[k]*v[col_[k]];
}
}
return;
}
where it can be observed that we loop over each row in the matrix, then loop over
all of the non zero columns and add the corresponding term into the output vector
u. An important assumption that we are making here is that both u and v are of
the correct size. Another issue to consider is the case where we only want to use
the free indices in the matrix vector multiplication. As such we will create another
version of this function which will take two additional arguments defining the rows
and columns of the matrix that we want to use in any matrix vector product. It is
in this way that we can use the Free and Fixed indices to partition the matrix in
the implementation of our Conjugate Gradient method. The code for this member
function will look like:
void SparseMatrix::multiply(double* u, double* v, int* includerows, int* includecols)
{
for(int m=0; m<N_row_; m++)
{
u[m] = 0.0;
if(includerows[m])
{
for(int k=row_[m]; k<row_[m+1]; k++)
{
if(includecols[col_[k]])
{
u[m] += val_[k]*v[col_[k]];
}
}
}
}
return;
}
333
With these member functions now defined we can look at the time marching loop
in our program. After the first multiply statement, the 1D array b will contain the
matrix vector product M φl , then after the for loop M φl + ∆ts. With this array
evaluated for any given time step, we call the solve function, which will look like:
void solve(SparseMatrix& A, double* phi, double* b, int* Free, int* Fixed)
{
...
A.multiply(Aphi, phi, Free, Free);
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = b[m] - Aphi[m];
d[m] = r_old[m];
r_oldTr_old+= r_old[m]*r_old[m];
}
}
r_norm = sqrt(r_oldTr_old);
while(r_norm>tolerance && k<maxIterations)
{
dTAd = 0.0;
A.multiply(Ad, d, Free, Free);
for(m=0; m<N_row; m++)
{
if(Free[m])
{
dTAd += d[m]*Ad[m];
}
}
alpha = r_oldTr_old/dTAd;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
phi[m] += alpha*d[m];
}
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r[m] = r_old[m] - alpha*Ad[m];
}
}
rTr = 0.0;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
rTr += r[m]*r[m];
}
334 CHAPTER 15. FINITE ELEMENT METHODS
}
beta = rTr/r_oldTr_old;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
d[m] = r[m] + beta*d[m];
}
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = r[m];
}
}
r_oldTr_old = rTr;
r_norm = sqrt(rTr);
k++;
}
return;
}
Where the only differences compared to Example 3.5 are that firstly, we use the
multiply member function of the SparseMatrix class to compute any matrix
vector products and that secondly we only use the free indices in computing these.
The complete program is given in Example15_3.cpp. Figures 15.11(a) - 15.11(d)
illustrate the solution at a number of different time steps. It can be observed that as
time progresses the bell shaped surface (which is the initial condition) moves through
the domain (due to the convective term) and spreads out (due to the diffusive term),
and the domain as a whole rises (due to the source term).
This example has been the most complex one that we’ve encountered thus far,
combining numerical methods for solving a PDE, system of ODEs, and the linear
equations at each time step that result from the full discretization. It can be ob-
served that we ‘tweak’ the basic methods (or at least the codes implementing them)
in various ways such that these components ‘stitch’ together nicely. It should be
remembered however that we are striking a balance in terms of efficiency and ease
of understanding.
With other types of elements, the same ideas that we’ve just used still apply,
it’s just that the form of the shape functions may look different. For example, if we
were to derive shape functions in 3D for a linear tetrahedron say, we would start
with the trial solution:
φ(x, y, z) = a0 + a1 x + a2 y + a3 z
335
2.0 2.0
1.8 1.8
1.6 1.6
φ
φ
1.4 1.4
1.2 1.2
1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(a) (b)
2.0 2.0
1.8 1.8
1.6 1.6
φ
1.4 1.4
1.2 1.2
1.0 1.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(c) (d)
Figure 15.11: The solutions to the PDE in Example 15.3 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
336 CHAPTER 15. FINITE ELEMENT METHODS
and using the same procedure as we did for the linear triangle we would end up with
the shape functions:
1
η1 (x, y, z) = x2 (y3 z4 − y4 z3 ) + x3 (y4 z2 − y2 z4 ) + x4 (y2 z3 − y3 z2 )
6Ve
+ ((y4 − y2 )(z3 − z2 ) − (y3 − y2 )(z4 − z2 )) x
+ ((x3 − x2 )(z4 − z2 ) − (x4 − x2 )(z3 − z2 )) y
+ ((x4 − x2 )(y3 − y2 ) − (x3 − x2 )(y4 − y2 )) z
1
η2 (x, y, z) = x1 (y4 z3 − y3 z4 ) + x3 (y1 z4 − y4 z1 ) + x4 (y3 z1 − y1 z3 )
6Ve
+ ((y3 − y1 )(z4 − z3 ) − (y3 − y4 )(z1 − z3 )) x
+ ((x4 − x3 )(z3 − z1 ) − (x1 − x3 )(z3 − z4 )) y
+ ((x3 − x1 )(y4 − y3 ) − (x3 − x4 )(y1 − y3 )) z
1
η3 (x, y, z) = x1 (y2 z4 − y4 z2 ) + x2 (y4 z1 − y1 z4 ) + x4 (y1 z2 − y2 z1 )
6Ve
+ ((y2 − y4 )(z1 − z4 ) − (y1 − y4 )(z2 − z4 )) x
+ ((x1 − x4 )(z2 − z4 ) − (x2 − x4 )(z1 − z4 )) y
+ ((x2 − x4 )(y1 − y4 ) − (x1 − x4 )(y2 − y4 )) z
1
η4 (x, y, z) = x1 (y3 z2 − y2 z3 ) + x2 (y1 z3 − y3 z1 ) + x3 (y2 z1 − y1 z2 )
6Ve
+ ((y1 − y3 )(z2 − z1 ) − (y1 − y2 )(z3 − z1 )) x
+ ((x2 − x1 )(z1 − z3 ) − (x3 − x1 )(z1 − z2 )) y
+ ((x1 − x3 )(y2 − y1 ) − (x1 − x2 )(y3 − y1 )) z (15.50)
where Ve is the volume of the element and is defined in terms of its nodal coordinates
as:
1 x1 y1 z1
1 x2 y2 z2
6Ve =
1 x3 y3 z3
1 x4 y4 z4
So we can see here that the shape functions for the linear tetrahedron are similar
in form to those for the linear triangle, but have a few more terms in them and
unsurprisingly involve the volume of the element rather than the area of the element.
Analogously, for the linear tetrahedron we could express the derivatives of the shape
functions as:
337
∂η1 (x, y, z) 1
= ((y4 − y2 )(z3 − z2 ) − (y3 − y2 )(z4 − z2 ))
∂x 6Ve
∂η1 (x, y, z) 1
= ((x3 − x2 )(z4 − z2 ) − (x4 − x2 )(z3 − z2 ))
∂y 6Ve
∂η1 (x, y, z) 1
= ((x4 − x2 )(y3 − y2 ) − (x3 − x2 )(y4 − y2 ))
∂z 6Ve
∂η2 (x, y, z) 1
= ((y3 − y1 )(z4 − z3 ) − (y3 − y4 )(z1 − z3 ))
∂x 6Ve
∂η2 (x, y, z) 1
= ((x4 − x3 )(z3 − z1 ) − (x1 − x3 )(z3 − z4 ))
∂y 6Ve
∂η2 (x, y, z) 1
= ((x3 − x1 )(y4 − y3 ) − (x3 − x4 )(y1 − y3 ))
∂z 6Ve
∂η3 (x, y, z) 1
= ((y2 − y4 )(z1 − z4 ) − (y1 − y4 )(z2 − z4 ))
∂x 6Ve
∂η3 (x, y, z) 1
= ((x1 − x4 )(z2 − z4 ) − (x2 − x4 )(z1 − z4 ))
∂y 6Ve
∂η3 (x, y, z) 1
= ((x2 − x4 )(y1 − y4 ) − (x1 − x4 )(y2 − y4 ))
∂z 6Ve
∂η4 (x, y, z) 1
= ((y1 − y3 )(z2 − z1 ) − (y1 − y2 )(z3 − z1 ))
∂x 6Ve
∂η4 (x, y, z) 1
= ((x2 − x1 )(z1 − z3 ) − (x3 − x1 )(z1 − z2 ))
∂y 6Ve
∂η4 (x, y, z) 1
= ((x1 − x3 )(y2 − y1 ) − (x1 − x2 )(y3 − y1 )) (15.51)
∂z 6Ve
Furthermore, we have the integration formulae defined for the linear tetrahedron:
Z
a!b!c!d!6Ωe
ηpa ηqb ηrc ηsd dΩ = (15.52)
(a + b + c + d + 3)!
Ωe
and:
Z
a!b!c!2Γe
ηpa ηqb ηrc dΓ = (15.53)
(a + b + c + 2)!
Γe
Where for the 3D case Ωe and Γe represent the volume and face area of a tetrahedron
respectively.
As was mentioned earlier we have restricted ourselves to only investigating a very
small portion of the ‘wider world’ of the Finite Element method. The main reason
for this restriction is that it is easier to investigate one way in which we can develop
an algorithm, so that there is a basic understanding in place, and then progress to
338 CHAPTER 15. FINITE ELEMENT METHODS
studying other aspects at a later stage. For the interested reader some excellent
references for more detailed aspects of Finite Element methods can be found in the
books by [78], [76], [77].
Chapter 16
Spectral Methods
Having now progressed from the simple Finite Difference methods of a PDE on a
regular structured grid, to the more complex Finite Volume and Finite Element
methods on an unstructured grid, we are now going to return to using a regular
structured grid for our spatial discretization. One feature of these methods that
sets them apart from others that we have studied thus far is that Spectral methods
can be categorized as global methods, whereas the first three were local methods.
To elaborate on this idea, with the Finite Difference, Finite Volume, and Finite
Element methods, the solution at a particular location in the grid only depended
upon the solution at a few of its immediate neighbors. With the Finite Difference
method for example, the solution at a grid point i, j depends upon the solution at
i, j ± m grid points, and this gave us the coupled system of ODEs. Similarly with
the Finite Volume method, the solution in a cell depends only upon the solution
in neighboring cells. Spectral methods by contrast are formulated in such a way
that the solution at grid point i, j depends upon the solution throughout the entire
computational domain. Another feature of Spectral methods which sets them apart
is that our discretization procedure involves changing the basis of the representation
of the data. What we mean here is that for the Finite Difference, Finite Volume,
and Finite Element methods, our system of ODEs was still in terms of φ, which is
a function of space and time. With Spectral methods however, as we shall see, our
system of ODEs is going to involve functions of complex frequency space and time.
Similar to Finite Element methods, a key step in the formulation of the method is
via the method of weighted residuals, and it is possible to utilize either the Galerkin,
Collocation, or Tau, methods. Furthermore, just as one has the option to use many
different shape functions in a Finite Element method, Spectral methods can be
formulated using either a Fourier Series [20], Chebyshev polynomials [8], or Legendre
polynomials [27]. The use of the former usually implies that the PDE has periodic
boundary conditions over the entire boundary of the domain, while the latter two
can be applied to more general Dirichlet and Neumann boundary conditions. The
point being made here is that as with the Finite Element formulation developed in
Chapter 15, we are presenting one way in which a Spectral method can be developed,
339
340 CHAPTER 16. SPECTRAL METHODS
and as such will restrict ourselves to periodic problems, using a Fourier series and
the Galerkin method of weighted residuals.
Perhaps the simplest place to begin the derivation of the method is to recall the
Fourier series for an arbitrary function, defined in the domain x ∈ [0, Lx ]:
∞
a0 X 2πpx 2πpx
φ(x) = + ap cos + bp sin
2 p=1
Lx Lx
ZLx
1
a0 = φ(x) dx (16.1)
Lx
0
ZLx
1 2πpx
ap = φ(x) cos dx (16.2)
Lx Lx
0
ZLx
1 2πpx
bp = φ(x) sin dx (16.3)
Lx Lx
0
So we are saying here that the function φ(x) can be represented by an infinite number
of sin and cos functions of different frequencies and amplitudes. For our purposes
it is going to be easier to work with the complex form of the Fourier series, and do
get to this form we make use of Euler’s formula:
eiθ + e−iθ
cos(θ) =
2
eiθ − e−iθ
sin(θ) =
2i
so that we get:
∞ ∞
a0 1 X 2πipx 1X −2πipx
φ(x) = + (ap − ibp )e Lx + (ap + ibp )e Lx
2 2 p=1 2 p=1
In order to only have one exponential term in the expansion we change the dummy
index in the first summation by setting p = −p, thus:
−∞ ∞
a0 1 X −2πipx 1 X −2πipx
φ(x) = + (a−p − ib−p )e Lx + (ap + ibp )e Lx
2 2 p=−1 2 p=1
From the definition of ap and bp in Equations 16.2 and 16.3 respectively we have the
property that a−p = ap and b−p = −bp . So if we define:
341
a0
Φ0 =
2
ap + bp
Φp =
2
we get the complex form of the Fourier series:
+∞
X −2πipx
φ(x) = Φp e Lx
p=−∞
where:
ZLx
1 2πipx 2πipx
Φp = φ(x) cos + i sin
Lx Lx Lx
0
or, again making use of Euler’s formula:
ZL
1 2πipx
Φp = φ(x)e Lx
Lx
0
So the Φp terms are complex numbers that represent the amplitude and phase
of the different sinusoidal components of the input φ.
1.5 10
8
1.0
6
4
0.5
2
ΦIm
0
φ
0.0
−2
−0.5
−4
−6
−1.0
−8
−1.5 −10
0 1 2 3 4 5 6 7 8 9 10 −10 −8 −6 −4 −2 0 2 4 6 8 10
x ΦRe
(a) (b)
Figure 16.1: (a) An arbitrary periodic function φ defined a 1D spatial domain (b)
the complex coefficients of the discrete Fourier transform of φ.
Now because we are interested in numerical methods in this course, our spatial
domain is going to be comprised of a finite number of discrete points, and in fact our
342 CHAPTER 16. SPECTRAL METHODS
domain will be a regularly spaced structured grid as we had for the Finite Difference
methods. If we assume that our grid in 1D is composed of Nx points, then we change
our approximation to be:
Nx
2
1 X 2πip
x
φ(x, t) ≈ Φp (t)e Lx (16.4)
Nx
p=− N2x +1
n=− N2x +1
which is the forward discrete Fourier transform of φ. So the basic idea behind our
Spectral method is that the field variable φ(x, t) can be represented by a discrete
Fourier series (Figures 16.1(a) - 16.1(b)). So the important approximation that
we have made is that instead of using an infinite number of sin and cos terms to
reconstruct φ we are using Nx . An important issue that we need to raise at this
point is that of the Nyquist criterion [36], which states that we can only resolve
frequencies up to half the number of sample points used. So in actual fact, although
our discrete Fourier transform gives us Nx terms, we are only really getting Nx /2
useful frequencies from our discrete Fourier transform. To understand this, Figure
16.2(a) illustrates the real part of Φ for the function depicted in Figure 16.1(a).
We can see here that the Φp coefficients are mirrored about the y axis, so while
our discrete Fourier transform gave us Nx coefficients, half of them are redundant.
So in actual fact we could quite correctly reconstruct our waveform using just the
coefficients ranging from 0 < p ≤ Nx /2, if we divide by Nx /2 instead of Nx in
Equation 16.4. The only little trick is that we don’t do this for the case where
p = 0. It is worth pointing out that this normalization by 1/Nx is merely convention
and could be placed in either the forward or√ the inverse discrete Fourier Transform,
or both could have a normalization of 1/ Nx . The only requirement is that the
product of the two be 1/Nx .
A final point concerning the discrete Fourier transform worth mentioning at this
time is that it is also quite common for the discrete Fourier transform to be denoted:
N x −1
X 2πipn
Φp (t) = φn (t)e− Nx
n=0
Where the limits on the summation term are different, but the same number of
points are involved. As can be observed in Figure 16.2(b) however, we are still
getting the same Fourier coefficients, just in a different order.
Having made this substitution for our field variable φ we can now compute the
derivatives in the same way that we did with the Finite Element method:
343
8 8
6 6
4 4
2 2
ΦRe
ΦRe
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−50 −40 −30 −20 −10 0 10 20 30 40 50 1 10 20 30 40 50 60 70 80 90 100
p p
(a) (b)
Figure 16.2: The real part of the discrete Fourier transform when the limits are
taken from (a) [−Nx /2 + 1, Nx /2] and (b) [0, Nx − 1].
Nx
2
∂φ 1 X ∂Φp 2πip
= e Lx x
∂t Nx ∂t
p=− N2x +1
Nx
2
∂φ 1 X 2πip 2πip
x
= Φp e Lx
∂x Nx Lx
p=− N2x +1
Nx
2 2
∂ 2φ
1 X 2πip 2πip
x
2
= Φp e Lx
∂x Nx Lx
p=− N2x +1
with similar expressions in y and z. For the source term we assume that like φ, it
can be represented as a Fourier series:
Nx
2
1 X 2πip
x
ψ= Ψp (t)e Lx
Nx
p=− N2x +1
where as with Φp , Ψp represent the complex coefficients. Let’s now consider the
generic scalar transport equation defined in only one spatial dimension and substi-
tute these expressions for the derivatives. Doing so we get:
344 CHAPTER 16. SPECTRAL METHODS
Nx Nx
2 2
1 X dΦp (t) 2πip
x 1 X 2πip 2πip
x
e Lx + vΦp (t)e Lx
Nx dt Nx Lx
p=− N2x +1 p=− N2x +1
Nx
2 2
1 X 2πip 2πip
x
= µΦp (t)e Lx
Nx Lx
p=− N2x +1
Nx
2
1 X 2πip
x
+ Ψp (t)e Lx
Nx
p=− N2x +1
The next step in the formulation of our Spectral method is to apply the Galerkin
method of weighted residuals, but before we do so it will be worth taking a brief
detour to extend our notion of basis functions. With the Finite Element method we
defined our solution as:
Np
X
φ(x, t) = ηp (x)φp (t)
p=1
where the ηp terms were the basis functions and the φp terms were the coefficients
(which happened to be the values of φ at the nodal points in the grid). It can be
2πip
observed that this is quite similar to our solution in Equation 16.4 where the e Lx x
terms are the basis functions and the Φp (t) are the coefficients. Now if we think
about what a basis actually means, it’s a set of independent or orthogonal vectors
with which we can describe something. The most intuitive example is a Cartesian
basis where we have the unit vectors î, ĵ, k̂, and any vector quantity can be described
in terms of coefficients in that basis, (e.g. v = vx î + vy ĵ + vz k̂). If we switched to
a different basis such as spherical coordinates for example, u would not change,
only the coefficients in the expansion with the basis. One of the properties of an
orthonormal basis is that the inner product between two different components is zero
and between the same component is one. Returning once again to the Cartesian
basis we have î · ĵ=0, î · k̂ = 0, but î · î = 1. If we let e1 = {1, 0, 0}, e2 = {0, 1, 0} ,
e3 = {0, 0, 1} then we could write this inner product relationship more formally as:
3
X
< ep , eq >= ep,n eq,n = δpq
n=1
where the notation <, > is denoting an inner product, the indices p and q denote
different components of the basis, and δp,q is known as the Kronecker delta function,
which is defined as 1 if p = q and 0 otherwise. This is known as the orthogonality
condition. Now we can fairly easily extend this concept to basis vectors which have
more than three components. In fact this relation just defined works for vectors
345
with any number of components. In the limit where we have an infinite dimensional
vector, then we actually have a function. To elaborate on this point, think about
some function which has been evaluated at N points. This is analogous to an N
dimensional vector. So in the limit a function is an infinite dimensional vector. In
this case we write our inner product relationship as:
Z
< ηp , ηq >= ηp ηq dΩ = Ωδp,q
Ω
where the summation of the components of the two basis vectors is replaced by an
integral of the basis functions. If we allow for the scenario where we have complex
functions (i.e. the function gives us complex numbers) then we write our inner
product relationship as:
Z
< ηp , ηq >= ηp ηq∗ dΩ = Ωδp,q
Ω
∗
where the denotes the complex conjugate. Now although it is a bit more difficult
2πip
to visualize, we can think of and treat each of the e Lx x basis functions as we do
2πip
the Cartesian basis functions. In this case the complex conjugate is written e− Lx x
(since the complex conjugate is found by simply replacing i with −i). We can
now return to applying our method of weighted residuals, noting that we will be
using the complex conjugate of the basis functions as our weighing functions. Hence
substituting for W and our residual r and integrating over the domain we get:
Nx !
Z 2 2
− 2πiq x 1 X dΦp 2πip 2πip 2πip
x
e Lx + vΦp − µΦp − Ψp e Lx dΩ = 0
Nx dt Lx Lx
Ω p=− N2x +1
Noting that the residual is a constant within the domain of integration we get:
Nx !Z
2 2
1 X dΦp 2πip 2πip 2πiq 2πip
+ vΦp − µΦp − Ψp e− Lx x e Lx
x
dΩ = 0
Nx dt Lx Lx
p=− N2x +1 Ω
Now ultimately what this means is that each p component is independent of the
others, so we can drop the summation terms and factor out the N1x and Ω terms to
arrive at:
346 CHAPTER 16. SPECTRAL METHODS
2
dΦ 2πip 2πip
= µΦ − vΦ + Ψ
dt Lx Lx
which can be put in the form:
M Φ̇ = KΦ + s
where Φ is a vector of the coefficients in the discrete Fourier transform of φ and
as with the Finite Difference method, M is the identity matrix. One important
point to note here however is that because our system of equations is uncoupled, K
will only have entries on the main diagonal. It is an interesting fact of this global
method is that we don’t end up with a coupled system of ODEs as we have done with
the previous methods. The coupling however comes about from the computation
of the discrete Fourier transform where each coefficient in the transform involves a
summation over all of the components in the grid.
If we want to extend our Spectral method to higher spatial dimensions the only
difference is that we have to use either the 2D inverse Fourier transform:
Nx Ny
2 2
1 X X 2πip
x+ 2πiq y
φ(x, y, t) = Φ(p, q, t)e Lx Ly
Nx Ny
p=− N2x +1 q=− Ny +1
2
Nx Ny Nz
2 2 2
1 X X X 2πip
x+ 2πiq y+ 2πir z
φ(x, y, z, t) = Φ(p, q, r, t)e Lx Ly Lz
Nx Ny Nz
p=− N2x +1 q=− Ny +1 r=− N2z +1
2
We can then proceed to solve our system of ODEs using any of the methods
covered in Part II, the only difference is that we must compute the appropriate
discrete Fourier transform of the initial condition defined in the spatial domain in
order to begin the time marching. Finally in terms of the imposition of boundary
conditions we find that the necessary assumption of periodic boundaries means that
we do not have to do anything to impose the boundary conditions on our system as
they are automatically incorporated.
Example 16.1:
In this example we will develop both a Matlab and a C++ program to solve the
1D first order wave equation:
347
∂φ ∂φ
+v =0 (16.5)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 0, initial
2
condition φ(x, 0) = e−5(x−3) , v = 1.0, and compare the numerical solution with the
2
exact solution φ(x, t) = e−5(x−vt−3) . For the spatial discretization we will use the
Spectra method and for the temporal discretization we will use the fourth order
Runge-Kutta method. The intended learning outcomes for this example will be to
‘get a feel’ for applying a Spectral method and to see how we compute the discrete
Fourier transform.
Assuming now that our spatial domain has been broken up into Nx grid points
with spatial step size ∆x, then we can immediately define an ODE at each grid
point as:
dΦ 2πip
=− vΦ
dt Lx
which fits the form:
Φ̇ = KΦ
and is similar to Example 13.1. Here however, K will only have terms on the main
diagonal, so there’s no real point in storing it explicitly. This again emphasizes the
point that a matrix is often something that is conceptual, not something that is
necessarily defined and stored in the program. The most efficient way to store a
diagonal matrix would be as a 1D array, but in fact we don’t even need to do that
in this example. The next step is the application of the Runge-Kutta method. As
we did with Examples 10.1 and 13.1 we will define a function f which we can call
repeatedly as we make our k vectors at each time step and this will take the form:
function k = f(PHI)
p = (floor(-N_x/2+1):floor( N_x/2))‘;
k = (-v*(2*pi*i.*p./L_x)).*PHI;
end
where an important point to note is that in this code snippet, i is the imaginary
unit, not an index as it was in the Finite Difference method examples. At this point
the remainder of the algorithm is just the basic fourth order Runge-Kutta code from
Example 10.1:
for l=1:N_t-1
k1 = f(PHI(:,l));
k2 = f(PHI(:,l) + Delta_t/2*k1);
k3 = f(PHI(:,l) + Delta_t/2*k2);
k4 = f(PHI(:,l) + Delta_t *k3);
PHI(:,l+1) = PHI(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
end
348 CHAPTER 16. SPECTRAL METHODS
The only remaining piece of the program is the calculation of the discrete Fourier
transform of the initial condition to get the time marching loop started. In this
example we will write our own function to perform both the forward and inverse
transforms. These will take the form:
function PHI = computeDFT(phi)
PHI = zeros(N_x,1);
p = transpose(floor(-N_x/2+1):floor( N_x/2));
n = transpose(floor(-N_x/2+1):floor( N_x/2));
for m=1:N_x
PHI(m) = sum(phi .* exp(-2*pi*i.*n*p(m)/N_x));
end
end
for the inverse Fourier transform. So we begin the simulation by computing the
discrete Fourier transform of the initial condition, then perform our time integration
on the Fourier coefficients PHI. When we are interested in the real solution, we
compute the inverse Fourier transform so that we have φ at any point in time. In
our Matlab program this will look something like:
PHI(:.1) = computeDFT(phi(:,1))
for l=1:N_t-1
k1 = f(PHI(:,l));
k2 = f(PHI(:,l) + Delta_t/2*k1);
k3 = f(PHI(:,l) + Delta_t/2*k2);
k4 = f(PHI(:,l) + Delta_t *k3);
PHI(:,l+1) = PHI(:,l) + Delta_t *(k1/6 + k2/3 + k3/3 + k4/6);
phi(:,l+1) = computeInverseDFT(PHI(:,l+1));
end
One important point to note is that the solution we get from the inverse Fourier
transform will still involve complex numbers, however, all of the imaginary compo-
nents are zero, so we can simply take the real part of the inverse Fourier transform
without losing any information. It is easily observed that the ‘bell shaped’ initial
condition is simply shifted along through the computational domain, which is what
we could expect from this PDE.
The complete program is given in Example2_6.m. Figures 16.3(a) - 16.3(d) illus-
trate the solution at a number of time steps for the the case where ∆x = 0.05 and
∆t = 0.02. A final point to note is that the use of Matlab for this problem makes
life very easy when dealing with complex numbers since we don’t have to modify our
349
code at all. This not the case if we were to use C++ however, which by default does
not have any support for dealing with complex arithmetic. This is not to say that it
can’t be done however, but one would need to develop (or find existing) functionality
for handling this, and perhaps defining a class to define a complex number, storing
its real and imaginary components and overloading some of the complex arithmetic
operations.
1.5 1.5
1.0 1.0
0.5 0.5
φ
φ
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.5 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
1.5 1.5
1.0 1.0
0.5 0.5
φ
0.0 0.0
−0.5 −0.5
−1.0 −1.0
−1.5 −1.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(c) (d)
Figure 16.3: The solutions to the PDE in Example 16.1 for the combination ∆x =
0.05 and ∆t = 0.02 illustrating the solution at (a) l = 1, (b) l = 200, (c) l = 340
and (d) l = 500.
350 CHAPTER 16. SPECTRAL METHODS
Example 16.2:
In this example we will develop a Matlab program to solve the 2D generic scalar
transport equation:
φ̇ + v · ∇φ = µ∇2 φ + ψ (16.6)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 0,
2
φ(x, 0) = 0, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) ,
and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will use
Spectral method and for the temporal discretization we will use the implicit Euler
method. The intended learning outcome for this example will be to observe the
application of a Spectral method to solve a multidimensional PDE and to see how
we can use the Matlab routines to compute the forward and inverse discrete Fourier
transforms.
Assuming now that our spatial domain has been broken up into Nx data points
in x and Ny data points in y, then we can immediately define an ODE at each
interior grid point as:
2 2
dΦ 2πip 2πiq 2πip 2πiq
= µΦ + µΦ − uΦ − vΦ + Ψ
dt Lx Ly Lx Ly
M Φ̇ = KΦ + s
Similar to our Example 16.1. Here however, K will only have terms on the main
diagonal, so there’s definitely no point in storing the whole matrix. Furthermore,
because our discrete scalar field will be laid out in a 2D array as:
phi = zeros(N_x,N_y,N_t);
PHI = zeros(N_x,N_y,N_t);
we could in fact store the main diagonal elements in this way. So we will be storing
K as a 2D array in this example, but the 2D part has to do with the 2D nature of
the grid, not implying equations and unknowns. Each element in the 2D array K
represents an element on the main diagonal of the matrix K. That being said, we
will define our assemble function as:
function assemble()
[p q] = meshgrid(floor(-N_x/2+1):floor( N_x/2), floor(-N_y/2+1):floor( N_y/2));
K = (-v(1)*(2*pi*i.*p/L_x) - v(2)*(2*pi*i.*q/L_y)
+ mu*(2*pi*i.*p/L_x).^2 + mu*(2*pi*i.*q/L_y).^2);
s = PSI;
end
351
The interesting point here is that we’ve created a 2D array of p and q indices with
the meshgrid function, and then we are doing ‘element by element’ operations (as
can be observed by the .* and .^ operators). So what we end up with is an array
of K elements for each point in the grid. It is important to note that in this case
we shouldn’t really think of K as being a stiffness matrix. If it were then it would
be of size (Nx × Ny ) × (Nx × Ny ) with entries only on the main diagonal. Here we
have an array of size Nx × Ny with entries in every location in the array. This is
obviously more efficient in terms of storage. While we could use a sparse matrix in
Matlab to help alleviate this storage, there’s really no point since the equations are
completely decoupled.
Now applying the implicit Euler method to our system we get:
Φl+1 − Φl
M = KΦl+1 + s
∆t
which can be rearranged to:
where it can be observed that here we are not solving a system of equations (as
we were in other examples solving the generic scalar transport equation), rather we
are updating Φ point by point. It is just a nice feature of the Matlab language
that allows us to write the operation in one line of code (as opposed to placing it
inside two nested for loops as we would have to in a C++ implementation). The
only remaining piece of the program is the calculation of the 2D discrete Fourier
transform of the initial condition to get the time marching loop started. In this
example we will use the Matlab function fft2 to do this (we use the function fft in
1D). It should be noted that the main reason for using the Matlab function rather
than writing our own is that being a built-in Matlab function it can do it much
faster than any function we could write. An important point however is that the
Matlab function returns indices in the range [0, N − 1] rather than [−N/2 + 1, N/2],
so in order for the location of the Fourier coefficients to be consistent with what we
need for our method to work, we use the fftshift function to reorder the output of
fft. So the computation of the forward discrete Fourier transform for our problem
is:
PHI(:,:,1) = fftshift(fft2(phi(:,:,1)));
assemble();
for l=1:N_t-1
PHI(:,:,l+1) = (PHI(:,:,l) + Delta_t*s)./(1-Delta_t*K);
phi(:,:,l+1) = ifft2(ifftshift(PHI(:,:,l+1)));
end
Having now seen two examples using a Spectral method to solve a PDE it is
worth ending our investigation with some remarks on the accuracy and applicability
of the method. A primary benefit of Spectral methods over alternative approaches,
such Finite Difference, Finite Volume, and Finite Element methods. Spectral dis-
cretization of PDEs based on Fourier series (as well as Chebyshev polynomials etc),
provide very low error approximations and in many cases, these approaches can
be exponentially convergent, meaning that for a length N expansion, the difference
between the analytical and the numerical solution can be O(1/N N ). Second, since
the numerical accuracy of Spectral methods is so high the number of grid points re-
quired to achieve the desired precision can be very low, thus a Spectral method may
require less computer memory than say a Finite Difference method. One important
point to note however is that the PDE must exhibit smooth variation throughout
the computational domain, otherwise this convergence is lost. Spectral methods are
hence not applicable to ‘say’ transonic fluid flow developing shock waves, or any
other PDE that would have discontinuities. For the interested reader some excellent
references for more detailed aspects of Spectral methods can be found in the books
by Hesthaven [68], Peyret [72], and Kopriva [69].
353
1.0 1.0
0.8 0.8
0.6 0.6
φ
φ
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(a) (b)
1.0 1.0
0.8 0.8
0.6 0.6
φ
0.4 0.4
0.2 0.2
0.0 0.0
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
y y
x x
(c) (d)
Figure 16.4: The solutions to the PDE in Example 16.2 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
354 CHAPTER 16. SPECTRAL METHODS
Chapter 17
355
356 CHAPTER 17. SPECTRAL ELEMENT METHODS
Chapter 18
Meshfree Methods
357
358 CHAPTER 18. MESHFREE METHODS
Part IV
Parallel Computing
359
Chapter 19
Introduction
Having now covered a number of different numerical methods for solving systems
of algebraic, ordinary, and partial differential equations, and creating programs to
implement these methods, we are in a position to look at how we can develop parallel
versions of these programs to solve much bigger computational problems and solve
them faster. As such, we are going to study some different application programming
interfaces (APIs) for designing parallel programs. Up until this point, all of the
programs that we have developed have been written for serial computation. We can
take this to mean a series of instructions executed one after another on some form of
processor [43]. By contrast, parallel computation is the simultaneous use of multiple
compute resources to solve a problem. Now, the term ‘processor’ is a bit vague, but
we use this term because there are different designs and categorizations that could
be applied to the term ‘parallel computation’.
The techniques that we will study, namely OpenMP and MPI are designed such
that they can be used for designing programs to run on computing platforms ranging
from your laptop or workstation, up to the largest supercomputing facilities in the
world. Our interest will lie in the application of these techniques to solving PDEs,
but it should be remembered that they are far more general than this one application
and can be applied to parallelizing many different types of computational problems.
Furthermore, while we will be incorporating these techniques into programs writ-
ten in the C++ programming language, it is worth pointing out that both techniques
are commonly incorporated into programs written in other programming languages,
such as C and Fortran, for example.
Before we proceed to cover the details of OpenMP and MPI however, it will be
worth reviewing some basic concepts to provide a context for how they fit into the
‘wider world’ of High Performance Computing (HPC) [23]. As with the development
of our numerical methods, this will only be a brief introduction to some of the most
relevant concepts. Some excellent references can be found in [67] and [24] however.
One ‘classic’ method that can be used for the classification of computer architec-
tures is known as Flynn’s taxonomy [19]. Here, a computer architecture is classified
along the two independent dimensions of instruction and data and each dimension
361
362 CHAPTER 19. INTRODUCTION
SISD SIMD
Single Instruction, Single Instruction,
Single Data Multiple Data
MISD MIMD
can have one of two possible states: single or multiple (Figure 19.1). A single in-
struction, single data architecture is one where only on instruction stream is being
acted on by the processing unit at any one clock cycle and only one data stream
is being used as input. Many early computers such as old generation mainframes
followed this architecture, but even today, modern laptops and workstations can fall
into this category. A single instruction, multiple data architecture is one where a
number of processing units execute the same instruction at any one clock cycle, but
each processing unit can operate on a different data element. Early ‘vector’ processor
computers used this approach, but more recently graphics processing units found in
most modern laptops or workstations can fall into this category. A multiple instruc-
tion, single data architecture is one where a single data stream is fed into multiple
processing units, each of which can execute independent instruction streams. There
are very few actual examples of this style, but fault-tolerant computers executing
the same instructions redundantly in order to detect and mask errors would be an
example. Finally, a multiple instruction, multiple data architecture is one where
a number of processing units can be executing different instruction streams, which
may be operating on different data streams. This is currently the most common type
of computer and includes modern laptops and workstations, ‘clusters’ of networked
computers, and most current supercomputers.
If we now start to consider the specific types of hardware to which the term
‘processor’ can apply, we find that there are many different designs out there and
it makes sense to think of a ‘spectrum’ defined by how specialised the circuitry in
363
a processor is for a particular task, which will have implications in terms of its
performance, price, and ‘ease’ of programming.
At the least specialised end of this spectrum we have the central processing unit
(CPU)[7], which are found in modern laptops, workstations, and supercomputers.
A CPU is the hardware that caries out the instructions of a computer program
and performs the basic arithmetical, logical, and input/output operations of the
system. Because the average computer these days has to be able to perform all
sorts of tasks, CPUs must be quite general purpose and hence a tradeoff is that
performance is reduced in favour of flexibility. These days, most processors of this
type contain multiple cores[33] which can be thought of as multiple CPUs within
the one ‘component’ or ‘chip’. These cores can operate independently, executing
different instructions on different data streams.
Moving up the spectrum, an example of a slighly more specialised design are
graphics processing units(GPUs) [22], which are also found in modern laptops, work-
stations, and supercomputers. A GPU is the hardware that acts as a ‘co-processor’
performing the very intensive calculations required by modern day computer graph-
ics. GPUs are hence more specialised compared to CPUs in that more of the transis-
tors on the die comprising the chip, are dedicated to ‘number crunching’, compared
to a CPU which needs to perform more ‘administrative’ tasks. A fairly recent trend
has been the advent of APIs such as CUDA [9] and OpenCL [38], which allow users
to perform more general purpose calculations and hence the term general purpuse
computing on graphics processing units (GPGPU) has emerged. As with CPUs,
GPUs are in abundance and will be used together in most platforms.
Moving up the spectrum again, an example of a more specialised design are digital
signal processors (DSPs)[11]. The architecture of a DSP is optimized specifically for
digital signal processing, such as signals from audio or video sensors. Most also
support some of the features as an applications processor or microcontroller, since
signal processing is rarely the only task of a system.
Moving up the spectrum a little further, an example of an even more specialised
design is the field-programmable gate array (FPGA) [18]. Similar to GPUs, an
FPGA can act as a co-processor, but the architecture is such that they circuitry can
be rewritten for a particular task. This is in contrast to CPUs and GPUs where
the circutry is fixed and it is only the instructions comprising the program that
can change. In this case, the program that one would want to run on an FPGA
is actually defined in the hardware, but is reconfigurable at run time. Similar to
GPUs there are APIs that are available for programming them such as VHDL [60] or
Mitrion-C [30] for example.
At the most specialised end of the spectrum we have application specific inte-
grated circuits (ASICs) [2]. Because these processors are by definition designed for a
specific computational problem, they will generally give much greater performance
than ‘say’ a CPU would for the same task. Because of the extreme costs associated
with the masks required for the X-ray lithography to create an ASIC, they are not
feasible for most parallel computing applications. One good example of an exception
364 CHAPTER 19. INTRODUCTION
to this however is the RIKEN MDGRAPE-3 supercomputer built for the purpose of
performing molecular dynamics simulations, which uses specialised MDGRAPE-3
chips [46].
A Von Neumann architecture [61] is a design of computer in terms of a processing
unit (consisting of an arithmetic logic unit and processor registers, a control unit
containing an instruction register and program counter), memory to store both a
program and its instructions, plus input and output mechanisms including external
mass storage, and networking (Figure 19.2). The meaning of the term has evolved to
mean a stored-program computer in which an instruction fetch and a data operation
cannot occur at the same time because they share a common bus. The vast majority
of modern computers are based on this basic design and so we will now explore some
variations of this basic design and issues to consider when describing an HPC system.
Memory Subsystem
Program Data
Input/Output Subsystem
Storage Network
An important issue that we must consider when we talk about parallel computing
architectures is where the computers’ memory is physically located in relation to its
CPUs and there are two broad categories, namely shared memory and distributed
memory architectures. Shared memory architectures (Figures 19.3(a)) in general
provide the ability for all of the CPUs to access all of the memory in a global ‘ad-
dress space’. This means that multiple CPUs can operate independently but share
the same memory resources. From a programming point of view this architecture
means that code for ‘say’ accessing and manipulating entries in an array will not
require modification from a serial program because all of the allocated memory is
accessable. A further classification can be made with shared memory architectures,
namely uniform memory access (UMA) and non-uniform memory access (NUMA)
architectures [35], which relate to the memory access times. UMA architectures
are generally identical processors with equal memory access times, and this is a
commonly found in symmetric multiprocessor (SMP) machines [53] (Figure 19.4).
Examples range from modern dual core laptops up to larger systems containing on
365
(a) (b)
Figure 19.3: Schematics illustrating the concepts of (a) a shared memory parallel
programming model and (b) a distributed memory parallel programming model.
the order of tens of CPUs. NUMA architectures are often based on multiple SMP
machines connected together via a bus interconnect [4]. While each CPU can access
all of the memory, the access times of its local memory will be faster than those
across the interconnect. As a final classification cache-coherent non-uniform mem-
ory access (ccNUMA) architectures keep one consistent image of the high speed
memory known as cache [6] (Figure 19.5). The term ‘cache coherent’ refers to the
fact that for all CPUs any variable that is to be used must have a consistent value.
Therefore, it must be assured that the caches that provide these variables are also
consistent. Since the appearance of shared memory systems with multiple proces-
sors the cache coherency phenomenon also manifests itself within processors with
multiple cores; first and second level cache belong to a particular core and therefore
when another core needs data that not resides in its own cache it has to retrieve it
via the complete memory hierarchy of the processor chip. This is typically orders of
magnitude slower than when it can be fetched from its local cache. A disadvantage
with shared memory architectures is that it becomes increasingly more difficult and
hence expensive to ‘scale up’ the design to include more CPUs with more memory,
so these architectures seem to have limited scalability. An advantage with these ar-
chitectures is that the global address space can simplify the parallel program design.
Distributed memory architectures (Figure - 19.3(b)) in general require a com-
munication network to connect individual nodes. Each CPU has its own address
space (Figure 19.6) and so when non-local data is required by a given CPU it is the
responsibility of the programmer to explicitly define how the data is to be communi-
366 CHAPTER 19. INTRODUCTION
CPU
Core
CPU
Core
... CPU
Core
CPU
Core
Input/Output Subsystem
Storage Network
Storage Network
cated. From a programming point of view this architecture means that code for say
accessing and manipulating entries in an array will require modification from a se-
rial program because not all of the allocated memory is accessable. Examples range
from clusters of workstations connected via ethernet [16] up to massively parallel
processor (MPP) systems containing hundreds of thousands of CPUs and special-
ized network interconnects and topologies. A disadvantage of distributed memory
architectures is that the explicit communication of data can complicate the parallel
program design. An advantage with these architectures is that it is easier to scale
up the design, adding more CPUs and memory to the system.
Modern HPC systems can make use of various combinations of shared and dis-
tributed memory architectures (known as hybrid distributed-shared memory archi-
tectures), and utilizing different types of processors. One example design is a dis-
tributed memory multi-processor, where each compute node is a processor containing
multiple cores and shared memory, but which connect to one another over a network
367
Interconnection Network
… … …
Core Core Core Core Core Core Core Core Core Core Core Core
I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem
Interconnection Network
… … …
Core Core Core Core Core Core Core Core Core Core Core Core
I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem I/O Subsystem
Node
Auxiliary Auxiliary
Shared Memory
Memory Memory
I/O Subsystem
(Figure 19.9(c)), where compute nodes are connected in a lattice structure. The
term ‘torus’ comes from the fact that compute nodes at the sides of the lattice are
connected to the compute nodes at the opposite side, so the whole structure can be
thought to ‘wrap around on itself’. Communication between nearest neighbor nodes
will only require one ‘hop’ along the torus, but communication between nodes that
are farther away will require multiple hops. While a 3D torus is a common design,
this concept can be extended to higher dimensions such as a 5D torus. Essentially
what this means is that every ‘say’ N th node along the lattice will have additional
direct connections to other nodes further away on the lattice.
Figure 19.9: Schematics illustrating the concepts of (a) a star network (b) a tree
network (c) a 3D torous network.
Moving now from hardware to software, another issue worth considering is the
differences between processes and threads, since two of the techniques for paral-
lelization that we will investigate are based around their creation. A process is an
369
instance of a computer program that is being executed [42]. It contains the pro-
gram code and its current activity. Processes consist of (or ‘own’) a portion of the
computers memory, which will contain such things as the programs executable code,
process-specific data, a call stack , and a heap to hold intermediate computation data
generated during run time. A thread is the smallest unit of processing that can be
scheduled by an operating system [55]. It generally results from a computer pro-
gram ‘breaking’ into two or more concurrently running tasks. The implementation
of threads and processes differs from one operating system to another, but in most
cases, a thread is contained inside a process. Multiple threads can exist within the
same process and share resources such as memory, while different processes do not
share these resources. In particular, the threads of a process share its executable
code and its context.
Having now introduced some concepts relating to parallel computing it is worth
taking some time to think about how we now ‘scale up’ our simulations. quite
simply, we could define the speedup of a program as the ratio of the amount of time
a serial simulation takes to the amount of time the corresponding parallel simulation
takes. For example a simulation which takes 100s when run in serial, but only 20s
when run on some given number of CPUs, would mean that we have a speedup
of 5. Ideally, the more CPUs we use, smaller the amount of time the calculation
should take and hence the greater the speedup. One very well known law in parallel
computing is Amdahl’s law and can be used to predict the maximum theoretical
speedup attainable by a parallel program [1]. We need to know the fraction of the
code which is actually able to be run in parallel (which as we will see in the examples
to come is often less than 1) and will define it as P . Obviously if P = 0 then we will
get no speedup and if P = 1 then we can theoretically get an infinite speedup. When
the computation is divided among N processors, then we can define the speedup as:
1
speedup = P
(1 − P ) + N
The important result is that if we can only parallelize say 95% of the program,
then we will never be able to get more than a 20 times speedup of the program, no
matter how many processors we use (i.e. how big N is). To emphasize this point,
there are certain parts of computer programs (for example file I/O) that are often
quite difficult to parallelize. More fundamentally however, there are often algorithms
which by there very nature cannot be parallelized (computing a Fibonacci series for
example). Fortunately for us, with the solution of ODEs and PDEs there are at
least ‘some’ parts of the computation that can be parallelized.
Taking a simulation of a fixed size and using successively more CPUs to perform
the computation is what is known as strong scaling. Essentially we are taking a
problem of a fixed size and investigating how the solution time varies with the
number of CPUs. The greater the number of CPUs, the smaller the amount of
computation each one has to perform and hence the faster it will be able to perform
its computation (hopefully). Another metric however which may be quite applicable
370 CHAPTER 19. INTRODUCTION
however is weak scaling. In this case we keep a fixed problem size per processor and
look at how the solution time varies when we use successively more CPUs to perform
the computation. Now, the choice of which metric depends upon the problem. One
of the reasons for using supercomputers is that we want to be able to perform
computations faster and faster and so sometimes we want to scale out our problem
to get it to run faster. One of the other reasons for using supercomputers however
is that we want to be able to run bigger simulations that we weren’t able to do on
the limited memory spaces of smaller machines and so we make our problem bigger
and bigger as we use more CPUs. There is no ‘correct’ metric and the point of
this discussion is just to very briefly introduce some of the important concepts used
when designing and running large scale programs. Usually with a parallel code we
will want both bigger and faster at the same time!
User
Resource
Workstation Shared Manager
e.g. PC/Mac Storage e.g. Torque/PBS
SLURM, SGE
Job A Job B
User
Job
D
Workstation
Job C Job E
Figure 19.10: A schematic illustrating the model for connection and job submission
on a high performcance computing facility.
As was mentioned previously, the APIs that we are going to study can be run
either on your laptop or workstation, or on a supercomputer. While the program
code itself does generally not need to change too much, the method by which one
executes a program generally will. In fact, the way in which one ‘interacts’ and
uses an HPC system is often quite different to the way in which one would use their
own personal computer. Most often a user will connect their workstation to a login
node over the internet via a procol such as SSH [48] (Figure 19.10). From this login
node a user can interact with the shared file system, creating, editing, deleting files,
etc, and compiling their programs. Rather than executing their programs directly
however, an HPC facility will make use of some form of job scheduling software.
While there are many different packages available, a common feature is that rather
than trying to execute your program in parallel directly, one generally submits a
371
Here, the lines beginning with #PBS are defining information for the job scheduling
software. Working through the various lines we are specifying a ‘project account’
(-A), the number of processors to use (-l procs=), the wall time (which is the
amount of time the simulation is expected to take -l walltime=01:30:00). Fur-
thermore, we are giving the job a name (-N), specifying the name of the output file
to put any program output that would go to stdout (-o), specifying the name of
the output file to put any program output that would go to stderr (i.e. when some-
thing goes wrong with the program -e), and specifying an email address for the job
scheduling software to notify us when the job starts, finishes, or crashes! (-M). Finally
the last line tells the job scheduler to run the executable myProgram. The user can
then submit the job to the system with a command such as qsub myJobScript.pbs
(of course the software must be able to find the job script, the executable, and any
other relevant files). The job scheduler will then place the job in a queue (which can
be viewed with the command such as showq or qstat) and execute the commands
in the file when the required system resources become available. Finally a job can
be removed by the command qdel followed by the job ID that would be displayed
in the queue.
Another job scheduler is known as the SLURM [50] for which an example job
script (which we will call myJobScript.sbatch) could be written:
#!/bin/sh
#SBATCH --account=VR0084
#SBATCH --nodes=1
#SBATCH --time=01:30:00
#SBATCH --job-name=myJob
#SBATCH --output=myJob.out
#SBATCH --error=myJob.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=JohnSmith@gmail.com
srun -n 16 --nodes 1 --ntasks-per-node 16 myProgram
372 CHAPTER 19. INTRODUCTION
Here, the lines beginning with the #SBATCH are defining information for the job
scheduling software and the meaning of these statements is exactly the those ex-
plained for the previous job script. The user can then submit the job to the system
with a command such as sbatch myJobScript.sbatch and job scheduler will then
place the job in a queue (which can be viewed with the command such as squeue)
and execute the commands in the file when the required system resources become
available. Finally a job can be removed by the command scancel followed by the
job ID that would be displayed in the queue.
As a final point, it is worth mentioning that most of the concepts introduced here
will have more meaning, once we have actually studied the APIs and applied them
to specific PDEs. As such, it is recommended that you re-read this introduction
at the end of this part of the book. It is important to realize that there are many
different ways in which we can classify computing architectures and programming
models, and there are always exceptions to these categorizations. Furthermore,
while some categories encompass architectures and programming models which are
still in use today, others are obsolete (although they may find their way back into
future designs). The key thing is that you realize this discussion has just presenting
a rough guide, not strict rules.
Chapter 20
OpenMP
20.1 Concepts
MyProgram
Thread 0
Serial Region
Parallel Region
Synchronization
Serial Region
Thread 1 Thread 2 Thread 3
Parallel Region
Synchronization
Serial Region
Time
The first API that we will investigate is the Open Multi-Processing API, which
is an implementation of multithreading; a shared memory method of parallelization
whereby a master thread forks a specified number of slave threads and a task is
373
374 CHAPTER 20. OPENMP
divided among them. The threads then run concurrently, with the operating sys-
tem allocating threads to different cores. We create a parallel program with OpenMP
by finding regions of our code that can be carried out in parallel (for example for
loops) and add preprocessor directives into the code [5]. Then, when the compiler
turns our source code into machine executable code, the preprocessor will first use
these directives to modify the code such that the regions of the code will be executed
with multiple threads. Each thread has a unique integer ID associated with it and
the master thread has an ID of 0. After the execution of the parallel section of the
code, the threads join back into the master thread, which continues onward to the
end of the program (Figure 20.1). The preprocessor directives begin with the state-
ment #pragma omp and then include combinations of directives and clauses. Before
we examine some of the more relevant directives and clauses it will be useful to
demonstrate by way of an example. Consider the very simple computation carried
out in the for loop:
With the parallel directive we are instructing the compiler to create a team of
threads. Then with the for directive we are instructing the compiler that the
computations performed within the loop are to be shared among the team of threads.
The private clause tells the compiler that each thread should have its own unique
instance of the counter i (which makes sense if the threads are to compute different
parts of the loop), but the shared clause tells the compiler that the threads are
using the one instance of each of the arrays. An important point to bear in mind
with these types of constructs is that each thread must be able to compute its part
of the for loop without changing the results. This is fine for the above code snippet
because b[i] and c[i] are independent, but if we were computing something along
the lines of a[i]=a[i-1]+a[i-2] for example, then it’s possible that a thread might
need entries in a that haven’t been computed at the moment that it needs them, in
which case it will use whatever happens to be occupying that location in memory.
An OpenMP directive applies to the succeeding structured block of code. In this
case it is the for loop that defines the structured block and after completing the
loop the threads will join back into the master thread, but if we wanted to do
more things within our parallel region, we could, we just have to enclose all of the
statements within a pair of curly braces { }.
Now the codes that we develop for solving ODEs and PDEs tend to be full of for
and while loops and so there is great scope for taking our algorithm and adding in
some preprocessor directives to speed up the computation of the solution. Although
it may seem like the development of a parallel OpenMP program might be trivial,
there are some issues that we will need to consider and many subtle errors that can
20.1. CONCEPTS 375
become apparent at run time. One issue that we need to be aware of is the situation
of a race condition [44], which can occur when separate threads access and modify
the same memory location at the same time and lead to unpredictable values for
this variable. From a performance point of view, one issue that we need to be aware
of is how often the team of threads is created and destroyed. Since there is overhead
in creating a team of threads, we want to minimize the number of times this is done,
or put another way, maximize the amount of work that we can get done in parallel
when we create a team of threads. Consider for example the nested for loop:
for(i=0; i<10000; i++)
{
for(j=0; j<4; j++)
{
a[i][j] = b[i][j] + c[i][j];
}
}
If we were to parallelize the code by placing the preprocessor directive around the
inner loop:
for(i=0; i<10000; i++)
{
#pragma omp parallel for default(shared) private(j)
for(j=0; j<4; j++)
{
a[i][j] = b[i][j] + c[i][j];
}
}
then we will be creating and destroying our team of threads 10, 000 times. If we
were to say split the computation among four threads, then each thread will only
be responsible for the computation of 1 element in a given row of the array a. So we
would expect in this case that the benefit of parallelizing the code in this manner
would be minimal and in fact it is just as likely that the computation could take
longer compared to the serial case because of the overhead associated with the cre-
ation and destruction of the threads outweighing the gain. A better approach would
be to place the parallel directive around the outer loop:
of 2, 500 rows in the array, which is much more likely to give us an increase in
performance of the parallel code compared to the serial code. A final issue to
consider is that we want to try and minimize the amount of time that threads spend
waiting for one another to catch up. At the end of parallel for loops there is an
implied barrier where all of the threads have to reach the end of their share of the
loop before joining back to the master thread. In cases where we have multiple loops
or other computations within a parallel region we may encounter regions where only
one thread is doing some work (for example in writing data to an output file), while
the other threads are waiting. If we have explicitly placed synchronization points
into the code with the barrier directive then we may also cause the threads waste
time waiting for one another to catch up. So while the incorporation of barriers
can be a useful tool for debugging your code and making sure that it is working
properly, their presence should be minimized in final version of the code.
20.2 Directives
The OpenMP directives tend to instruct the compiler as to ‘what’ to do in a parallel
region. Some of the important directives are:
• #pragma omp parallel
Forms a team of threads and starts parallel execution.
• #pragma omp for
Specifies that the iterations of the for loop will be distributed among and
executed by the encountering team of threads.
• #pragma omp sections
Assigns consecutive but independent code blocks to different threads. This
could be useful in cases where we want different threads to undertake com-
pletely different tasks.
• #pragma omp single
Specifies that the associated structured block is executed by only one of the
threads in the team (not necessarily the master thread).
• #pragma omp master
Similar to a single directive, but the code block will be executed by the
master thread only with no barrier implied in the end.
• #pragma omp critical
Restricts execution of the associated structured block to a single thread at a
time.
• #pragma omp barrier
Specifies an explicit barrier at the point at which the directive appears.
20.3. CLAUSES 377
20.3 Clauses
The OpenMP clauses are placed together with the directives in the code, in order to
specify ‘how’ a piece of code is to be parallelized. Not all of the clauses are valid on
all of the directives however. Some of the important OpenMP clauses are:
• default(shared|none)
Controls the default data-sharing attributes of variables that are referenced in
a parallel construct.
• shared(variable-list)
Declares one or more list items to be shared by threads generated by a parallel
or task construct.
• private(variable-list)
Declares one or more list items to be private to a thread.
• firstprivate(variable-list)
Declares one or more list items to be private to a thread, and initialises each
of them with the value that the corresponding original item has when the
construct is encountered.
• lastprivate(variable-list)
Declares one or more list items to be private to an implicit thread, and causes
the corresponding original item to be updated after the end of the region.
• reduction(operator:list)
Declares accumulation into the list items using the indicated associative oper-
ator. Accumulation occurs into a private copy for each list item which is then
combined with the original item.
• num_threads(integer-expression)
Declares the number of threads to be used when encountering a parallel region.
• int omp_get_num_threads(void);
Returns the number of threads in the current team.
• int omp_get_max_threads(void);
Returns maximum number of threads that ‘could’ be used to form a new team
using a parallel construct without a num_threads clause.
• int omp_get_thread_num(void);
Returns the ID of the encountering thread where ID ranges from zero to the
size of the team minus 1.
• int omp_get_num_procs(void);
Returns the number of processors available to the program.
• int omp_in_parallel(void);
Returns true if the call to the routine is enclosed by an active parallel region;
otherwise, it returns false.
• double omp_get_wtime(void);
Returns elapsed wall clock time in seconds.
In order to actually create a program using OpenMP, the compiler must have
OpenMP support built in to it. So although the program may need to link to the
OpenMP runtime library in order use some of the routines it provides, this is not
enough to build the program by itself. Most modern compilers have OpenMP sup-
port, for example the g++ compiler has support since version 4.2 and Microsoft’s
Visual C++ compiler has support since the 2005 edition. In order to use the OpenMP
functionality, a C++ program must include the header file omp.h and will most likely
need to link to the runtime library. Once the program has been compiled it can
be executed in the same manner as any other program. The number of threads
that will be used in the parallel regions can generally be set in one of three man-
ners, either explicitly by using the omp_set_num_threads routine, by using the
num threads clause with a parallel directive, or by setting the environment vari-
able [15] OMP NUM THREADS. It is important to realize that we will only be able to
parallelize portions of the code which are inherently parallelizable. One ‘classic’ ex-
ample of this is a time marching for loop. While it would be trivial to try and place
a #pragma omp parallel for statement around a time marching loop, this will of
course not work because stepping forward in time requires the solution from the
previous step. If a time marching loop was split among ‘say’ two threads, then one
would start from the initial condition, but the other would begin half way through
the simulation and both would march forward together. The second thread would
require the solution from the previous time step, but this won’t have been computed
20.4. RUNTIME LIBRARY ROUTINES 379
yet by the first thread. The lesson to take away from this is when parallelizing with
OpenMP think about what data a thread will need and will it exist when it needs it.
One final point worth mentioning is that although one has the freedom to specify
any number of threads to be used for the parallel regions there will generally not be
any advantage to specifying more than the number of cores present on the system
that the code is executing on and in fact it is more likely that the code would
actually slow down as a result of doing so. An exception to this rule however is the
scenario where the cores support multiple hardware threads, so in this case we could
(ideally) expect to see performance by choosing the number of threads for parallel
execution to be up to the number of cores multiplied by the number of hardware
threads supported per core.
Example 20.1:
In this example we will develop a ‘Hello World’ example program in C++, par-
allelized with OpenMP to illustrate the creation of a parallel region and print to the
screen the thread IDs. The intended learning outcomes for this example will be to
‘get a feel’ for the structure of a program that includes OpenMP code.
In order to begin, let’s just dive right in and look at the complete program:
int main(int argc, char** argv)
{
int myID = omp_get_thread_num();
int N_Threads = omp_get_num_threads();
cout << ‘‘Hello world from Thread ’’ << myID << ‘‘ of ’’ << N_Threads << endl;
{
N_Threads = omp_get_num_threads();
}
As the program begins we will use two OpenMP library routines to get the thread ID
and the total number of threads and print to the terminal a message saying “Hello
world” with the threads ID. Initially when the program begins there will only be 1
thread, so we would expect that this message would read:
Hello world from Thread 0 of 1
Following the first cout statement however we will create a parallel region, letting
each thread share the N_Threads variable, but having its own private copy of myID.
Inside the parallel region each thread will call the omp_get_thread_num routine
to get its ID, but only the master thread will get the total number of threads.
Then each thread will print the same message as before, but this time it might look
something like:
Hello world from Thread 0 of 4
Hello world from Thread 1 of 4
Hello world from Thread 2 of 4
Hello world from Thread 3 of 4
An important point to note is the use of the critical directive which will let each
thread take its turn printing out its message. The complete program is given in
Example20_1.cpp.
Example 20.2:
In this example we will develop a C++ program to solve the 1D first order wave
equation:
∂φ ∂φ
+v =0 (20.1)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, v = 1.0. For the spatial discretization we will
use the Finite Difference method with second order central differences for the first
derivative and for the temporal discretization we will use the fourth order Runge-
Kutta method. Furthermore, our program will be parallelized with OpenMP. The
intended learning outcomes for this example will be to simply observe how we add
in OpenMP code to an existing C++ program for parallel execution.
In order to begin we will take the C++ program that was developed in Example
13.1 and add in the OpenMP code. The issues to consider here are where to create
the parallel regions, which loops can we parallelize, and which parts should be exe-
cuted by only one thread. Our current algorithm is essentially comprised of a time
20.4. RUNTIME LIBRARY ROUTINES 381
marching loop and then inside that, multiple for loops to compute the k values at
the various stages of the Runge-Kutta method. We will create the parallel region
outside the time marching loop as:
{
for(l=0; l<N_t-1; l++)
{
f(k1, phi);
It can be observed that the write function is to be called by only one thread. This
is an important point that requires some elaboration. Often, file I/O is the slowest
part of a program and it would be fantastic if we could speed it up by using multiple
threads. Unfortunately, file I/O is something that we can’t really parallelize with
OpenMP. So we know from Amdahl’s law that this will limit the maximum speedup
that we could expect, but there’s not really too much we can do about it.
The complete program is given in Example20_2.cpp with a Matlab script for
viewing the output of the program given in Example20_2Postprocessing.m. Fig-
ures 20.2(a) and 20.2(a) illustrate the solution at two different moments in time for
20.4. RUNTIME LIBRARY ROUTINES 383
the case where ∆x=0.05 and ∆t=0.02. Some example strong scaling runs for a more
‘substantial’ case where ∆x=0.001 and ∆t=0.001 are:
Number of Threads Execution Time (s)
1 12.680660
2 8.103922
4 5.425282
8 4.675502
2.5 2.5
2.0 2.0
φ
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
Figure 20.2: The solutions to the PDE in Example 20.2 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02.
Example 20.3:
In this example we will develop a C++ program to solve the 2D Poisson equation:
∇2 φ + ψ = 0 (20.2)
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10. For the spatial discretization we will use
the Finite Difference method with second order central differences for the second
derivatives and to solve the resulting system of algebraic equations we will use the
384 CHAPTER 20. OPENMP
So at this point we have parallelized the Gauss-Seidel iteration. One very subtle
issue is that stricly speaking, this is not the Gauss-Seidel method as presented in
Chapter 3 because with that technique we ‘swept’ through the column vector of
unknowns updating at each iteration. Each time we did this, the row entries below
the current row would have already been updated, but entries above would not have.
When we have multiple threads updating rows then it will not be true that all rows
below will have been updated. This result doesn’t affect the ability of the solver to
converge, but it’s worth mentioning that it is now a subtly different method.
20.4. RUNTIME LIBRARY ROUTINES 385
So at the end of the computation of the residual, we instruct one of the threads
to compute the square root with a single directive such that we have the correct
value for the two norm. It can be observed that we have parallelized the outermost
for loop in the Gauss-Seidel iteration and that only one thread in the team will
increment the iteration counter k.
The complete program is given in Example20_3.cpp with a Matlab script for
viewing the output of the program given in Example20_3Postprocessing.m. Figure
20.3 presents the converged solution for the case where Nx = 65 and Ny = 65. Some
386 CHAPTER 20. OPENMP
2.0
1.8
1.6
1.4
1.2
1.0
φ
0.8
0.6
0.4
0.2
0.0
0.0 1.0
0.2 0.8
0.4 0.6
0.6 0.4
0.8 0.2
1.0 0.0
y
x
Figure 20.3: The converged solution to the PDE in Example 20.3 illustrating the
solution for Nx = 60 and Ny = 60.
example strong scaling runs for a more ‘substantial’ case where Nx = 1001 and
Ny = 1001 are:
MPI
21.1 Concepts
Synchronization
Broadcast
Synchronization
Time
The second API that we will investigate is the Message Passing Interface, which
actually defines a language standard rather than an implementation. As the name
suggests the interface is based on the distributed memory parallel programming
387
388 CHAPTER 21. MPI
variable myRank, which will be assigned uniquely to each process when they begin
execution. Let’s assume that we have chosen to execute this code with four processes.
All of the processes will enter the if statement, but only one of the processes will
have a rank of 0 and hence enter the following structured block of code. As it does
it will send 10, 000 floating point numbers from its copy of the array a to process 1.
Meanwhile, process 1 will have skipped the first if statement and will have entered
the second structured block of code and as such will be waiting to receive 10, 000
floating point numbers into its copy of the array a from process 0. The remaining
processes with ranks 2 and 3 will not enter either structured block and will continue
onto the next portion of the program. The last two arguments of the function calls
above define the communicator and an optional argument tag which can be used
in some cases if we need to associate more information with a message. Returning
to the analogy of a room full of people, point-to-point communication is analogous
to one person talking to another, i.e. one person speaks, the other listens; and both
have to be performing their respective role in order for information to be transferred.
In contrast to point-to-point communication, collective communication is used
when we want to exchange information between multiple processes in the one func-
tion call. Some examples of this could be a broadcast, where we transfer data from
one process to all of the others (Figure 21.2(a)), a scatter, where we break up an
array from one process and distribute parts of it to other processes (Figure 21.2(b)),
a gather, where do the exact opposite of a scatter and assemble data from mutliple
processes into an array on one process (Figure 21.2(c)), and an all to all, where we
take arrays from all processes and send parts of them around to all other processes
(Figure 21.2(d)). Another common use is to perform a reduction. Consider for
example the code:
MPI_Reduce(a, b, 10000, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
The meaning here is that we are ‘summing’ the 10, 000 elements in a across all
of the processors and assigning the summations to the array b on process 0. To
elaborate on this point, if we had four processes, each having a copy of the array
a= {1, 2, 3, 4}, then the following the reduction process 0 would have the array
b= {4, 8, 12, 16}. Returning to the analogy of a room full of people, collective
communication is analogous to ‘say’ one person talking to everyone else, or even
everybody talking to everybody else at the same time!
As with OpenMP there are a number of issues we need to consider when devel-
oping MPI code, and one of the most important is the choice of using blocking or
non-blocking communication. When we use blocking communication the send and
receive routines will generally ‘hang’ until the message has been received. As we will
see in the examples this can lead to situations where processes either end up per-
forming their computations one after the other (which defeats the point of designing
a parallel program), or even worse to lock up completely. When we use non-blocking
communication the send and receive routines ‘post’ their data to be sent or received
and the program flow continues. This can allow for more efficient parallel execution
390 CHAPTER 21. MPI
since we can overlap computation and communication at the same time (meaning
that a process can ‘post’ a send or receive and do some computation in the mean-
time when this data is in transit), but of course we have to then explicitly check
at some point in the program that all of the required data has been sent to all of
the necessary processes by the point where they need it for computation, implying
the use of a function such as MPI_Waitall to check for this. As a final point, one
needs to be aware that there is an overhead and latency associated with creating
and sending a message and so messages passing should be performed ‘wisely’. In
particular sending many small messages repeatedly will generally be far less efficient
then sending fewer, larger messages. Furthermore, global collective communication
tends to create more ‘contention’ in the network compared to point-to-point com-
munication because every process may be trying to communicate with every other
process at the same time. The types of communication utilized in a distributed
memory parallel program really depend upon the nature of the algorithm however,
and as we will see, both forms are typically used in a program.
• int MPI_Finalize(void);
Terminates the MPI execution environment. Every MPI program should include
this function call somewhere near the end of the program and there should be
no MPI function calls after this one.
(a) (b)
(c) (d)
Figure 21.2: Some example MPI collective communications illustrating (a) a broad-
cast, (b) a scatter, (c) a gather, (d) an all to all.
much always need to include this function call somewhere near the beginning
of our programs so that we can store the total number of processes in the
integer variable size. This will help us decide how much ‘work’ each process
should be doing (e.g. how may grid points, or cells, or elements a process
should be responsible for).
• double MPI_Wtime(void);
Returns elapsed time from some arbitrary point in the past, in seconds. This is
an optional function, but is often useful for determining how long a simulation
takes.
• int MPI_Send(void* buf, int count, MPI_Datatype datatype,
392 CHAPTER 21. MPI
• int MPI_Waitall(void );
Wait for all processes to complete their sends and receives.
21.2. RUNTIME LIBRARY ROUTINES 393
One observation that can be made is that essentially all MPI functions involved
in transferring data expect pointers to contiguous blocks of memory, which ties in
nicely with the data structures that we have been using in our serial programs thus
far.
In contrast to OpenMP the compiler does not need to have any support for MPI,
rather it is just a process of linking the program to the appropriate libraries. As with
OpenMP there is a header file which much be included which is mpi.h . Commonly
however, the installation of the MPI libraries will include the program mpicxx which
can be used instead of the normal compiler to create the executable. In fact mpicxx
is merely a ‘wrapper’ and generally calls the normal compiler, but telling it where
to look for the header file and the libraries, as well as what libraries to link to.
Another difference compared to OpenMP is that programs are not executed in the
same manner by which a serial program is. A program built to run in parallel
with MPI is executed using another program called mpirun (or mpiexec), which is
responsible for creating all of the processes on the different computers, or processors
(depending on the system). Most simply, a program can be run via the command:
mpirun -n 4 myProgam
where we are specifying four processes to run the program in parallel. The exact
syntax differs between systems, but commonly one might need to specify the ‘full’
path of the executable so that mpirun can find it in the computers file system, and
there may be different ways in which input arguments to myProgram are specified.
Now as mentioned at the beginning of this section MPI defines a standard rather
than an implementation. As such there are a few common implementations, namely
OpenMPI [37], MPICH [32], and LAM [25]. As with running OpenMP programs one has
the freedom to specify any number of processes to be used for the execution of the
program, but there will generally not be any advantage to specifying more processes
than CPUs available in the system. It is also quite common that depending on the
21.2. RUNTIME LIBRARY ROUTINES 395
decomposition of the problem, not just any number of processes can be used. Often
it may need to be a power of two or an even number, etc. Generally for our parallel
MPI programs to run efficiently we will want good load balancing, meaning that each
process has a similar (or ideally the same) amount of work to do, so that they can
complete their computations in the same amount of time. In the context of solving
PDEs this translates to breaking up a computational grid so that processes have
a similar number of grid points or elements. The worst scenario would be if one
process had significantly more work to do than the others as it would slow down the
entire calculation.
Example 21.1:
In this example we will develop a ‘Hello World’ example program in C++, paral-
lelized with MPI to illustrate the creation of multiple processes and send some data
between them using point-to-point communication. The intended learning outcomes
for this example will be to ‘get a feel’ for the structure of a program that includes
MPI code.
In order to begin, as with the ‘Hello World’ program in Example 20.1 let’s just
dive right in and look at the complete program:
int main(int argc, char** argv)
{
int myID;
int N_Procs;
int dummy;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myID);
MPI_Comm_size(MPI_COMM_WORLD, &N_Procs);
if(myID==0)
{
dummy = 10;
cout << ‘‘Hello world from Process ’’ << myID << ‘‘ of ’’ << N_Procs << endl;
for(int n=1; n<N_Procs; n++)
{
MPI_Send(&dummy, 1, MPI_INT, n, TAG, MPI_COMM_WORLD);
}
}
else
{
MPI_Recv(&dummy, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD, &status);
cout << ‘‘Hello world from Process ’’ << myID << ‘‘ of ’’ << N_Procs << endl;
}
396 CHAPTER 21. MPI
21.2. RUNTIME LIBRARY ROUTINES 397
MPI_Finalize();
return 0;
}
As the program begins we call the MPI_Init function to initialize the MPI envi-
ronment, then we call MPI_Comm_rank and MPI_Comm_size to get the rank of each
process and the total number of processes respectively. We then enter an if - else
construct where the ‘root’ process (i.e. having a rank of 0) will enter the first struc-
tured block of code, print a message to the terminal saying ”Hello World” with its
rank, and then perform a blocking send to each of the other processes in turn, send-
ing them a single integer number. In contrast the remaining processes will enter the
second structured block of code and perform a blocking receive, waiting to receive
a single integer number from the root process. Once they have received this value
they will print to the terminal a message saying ”Hello World” with their rank, so
if we execute the program with the command:
mpirun -n 4 Example21_1
Example 21.2:
In this example we will develop a C++ program to solve the 1D first order wave
equation:
∂φ ∂φ
+v =0 (21.1)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, and v = 1.0. For the spatial discretization we
will use the Finite Difference method with second order central differences for the
first derivative and for the temporal discretization we will use the fourth order
Runge-Kutta method. Furthermore, our program will be parallelized with MPI. The
intended learning outcomes for this example will be to observe how we add in MPI
code to an existing C++ program for parallel execution, to investigate the use of both
398 CHAPTER 21. MPI
Process 3
φ0 φ1 φmyNx φmyNx+1
Figure 21.3: A schematic of the partitioned 1D problem domain for the first order
wave equation. The light grey boxes show the interior finite difference grid points,
the dark grey the ghost points and the pink box the Dirichlet boundary point.
The arrows show the flow of information between the ghost points during the time
marching.
21.2. RUNTIME LIBRARY ROUTINES 399
blocking and non-blocking communication, and to investigate the use of parallel file
IO to write out the simulation data.
In order to begin we will take the C++ program that was developed in Example
13.1 and add in the MPI code. The first issue to consider here is how we go about
breaking up the problem for a parallel computation. One of the common techniques
used is to assign each process a portion of the spatial domain (in our case a 1D
structured grid) and the processes will perform the time marching in parallel. In
this case we will retain the meaning of N_x to be the total number of grid points
in the domain, but now, rather than having N_x grid points, each process will have
‘say’ myN_x grid points, where we could compute the number of grid points for each
process via something like:
if(N_x%N_Procs)
{
myN_x = N_x/N_Procs + 1;
if(myID==N_Procs-1)
{
myN_x = N_x - myN_x*(N_Procs-1);
}
}
else
{
myN_x = N_x/N_Procs;
}
Remember that we want each process to have the same amount of work to do,
otherwise the execution speed of our parallel simulation will be impeded by the fact
that some processes will end up ‘waiting’ around for the others to catch up. The
way to interpret this code snippet is that if number of grid points doesn’t divide
evenly among the processes, then we add 1 to this division and ‘most’ processes will
be assigned this number. The process with the highest rank will however take the
remainder from the total number, minus what all of the other processes are assigned.
For example ‘say’ a grid of 201 points spread over 4 processes. The above ‘domain
decomposition’ would assign grid portions of {51, 51, 51, 48} to processes {0, 1, 2, 4}
respectively. Of course if we were really worried about load balancing we should
probably make sure that the size of our grid and the number of processors used
for the simulation are ‘tuned’ or ‘matched’ in some way. The above code snippet
however illustrates a way in which we can maintain a bit of flexibility in the program.
When it comes to setting the initial condition we have to know the ‘global’
x coordinates of each processes portion of the grid so that we can evaluate the
2
φ(x, 0) = e−5(x−3) + 1 term correctly. The way in which we will handle this is to
define a variable called prevN x which for a given process, stores the total number of
grid points posessed by processes with a lower rank than it. To evaluate this integer
we will make use of the MPI Exscan function as:
MPI_Exscan(&myN_x, &prevN_x, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
400 CHAPTER 21. MPI
where we are saying sum up the integers in myN x on every process below ‘me’ and
place this number in prevN x. So every process will have a different value for this
variable based on its rank. At this point we will then know where the grid points
for any given process are located in relation to the ‘global’ grid and can assign the
initial condition. In our program the code to implement this will look like:
for(i=1; i<myN_x+1; i++)
{
x = x_min + Delta_x*(prevN_x + i - 1);
phi[0][i] = exp(-5*pow(x-3, 2));
}
So the important observation to make here is that when developing parallel pro-
grams with a distributed memory model, we often have these ‘complications’ that
require more code in order to get everything set up correctly. Of course if we’d just
assumed that the number of grid points would be perfectly divisible by the number
of processes then we wouldn’t have to worry about this, but it’s important to be
exposed to some of these issues.
Now, we know that because we are using a second order central finite difference
stencil, the solution at a particular grid point will depend upon the solution at its
two neighboring grid points and so the approach of distributing the grid poses a
problem when we consider the fact that at the boundaries of each process’s portion
of the grid, it will be need data from grid points that exists in another process’s
memory. What we can do here is to allocate a slightly bigger array of size myN_x+2
to store what are known as ghost points (Figure 21.3). Then at every time step each
process p will send its value of phi[1] to location phi[myN_x+1] on process p − 1
and its value of phi[N_x] to location phi[0] on processor p + 1. Furthermore, each
process p will receive process p − 1’s value for phi[myN_x], which it will store in
phi[0], and process p + 1’s value of phi[1], which it will store in phi[myN_x+1].
The only exception to this are processes 0 and Np − 1, which contain the actual
boundaries of the grid. Having done this however, each process will be able to
evaluate the finite difference stencil and hence perform the time marching correctly.
As we march through time we will then need to exchange data between processes,
and in fact because we are using the fourth order Runge-Kutta method, we will need
to exchange ghost point data four times as we evaluate the k values on each process.
So in this case it makes sense to define a function that we can call repeatedly and
as such we will create the function exchange, which will take the form:
void exchange(double* phi, int myN_x, int myID, int N_Procs)
{
MPI_Status Status;
if(myID>0)
{
MPI_Send(&phi[1], 1, MPI_DOUBLE, myID-1, 1, MPI_COMM_WORLD);
}
if(myID<N_Procs-1)
{
MPI_Recv(&phi[myN_x+1], 1, MPI_DOUBLE, myID+1, 1, MPI_COMM_WORLD, &Status);
21.2. RUNTIME LIBRARY ROUTINES 401
The first input to this function will either be phi[l] for evaluating k1 and then
subsequently tempPhi for evaluating the remaining k values. As this function pro-
ceeds all process except for process 0 will enter the first if statement and try to
send a ghost value to their lower neighbor process (e.g. process 3 will try to send
to process 2). Only process 0 will be able to enter the second if statement and
wait to receive its ghost point value to process 1. As soon as this has happened
process 1 is then ‘freed’ up to receive its ghost point value from process 2, and
so on, such that there is in fact a ‘wave’ of propagation of data being exchanged.
Once the processes have received their first ghost point value they will try and send
their other ghost point value to their higher neighbor process (e.g. process 3 will
try to send to process 4). Since the last process (with rank N_Procs-1) will not
enter the second if statement, it will be waiting to receive from its ghost point
value from process N_Procs-2. As soon as this has happened process N_Procs-2 is
then ‘freed’ up to receive its ghost value from process N_Procs-3, and so on, such
that there is a second ‘wave’ of propagation of data being exchanged. This wave
of information transfer in effect creates a serial computation since all the processes
are sending in turn and this is not ideal. To implement the equivalent data transfer
using non-blocking communication we could instead define our exchange function
as:
void exchange(double* phi, int myN_x, int myID, int N_Procs)
{
MPI_Status statuses[4];
MPI_Request requests[4];
int N_r = 0;
if(myID>0)
{
MPI_Isend(&phi[1], 1, MPI_DOUBLE, myID-1, 1, MPI_COMM_WORLD, &requests[N_r++]);
MPI_Irecv(&phi[0], 1, MPI_DOUBLE, myID-1, 2, MPI_COMM_WORLD, &requests[N_r++]);
}
if(myID<N_Procs-1)
{
MPI_Irecv(&phi[myN_x+1], 1, MPI_DOUBLE, myID+1, 1, MPI_COMM_WORLD, &requests[N_r++]);
MPI_Isend(&phi[myN_x], 1, MPI_DOUBLE, myID+1, 2, MPI_COMM_WORLD, &requests[N_r++]);
}
MPI_Waitall(N_c, requests, statuses);
return;
}
It can be observed here are that the structure is similar, but we are using the non-
402 CHAPTER 21. MPI
{
for(int i=1; i<myN_x; i++)
{
k[i] = -v/(2*Delta_x)*(phi[i+1] -phi[i-1]);
}
k[myN_x] = -v/( Delta_x)*(phi[myN_x]-phi[myN_x-1]);
}
else
{
for(int i=1; i<myN_x+1; i++)
{
k[i] = -v/(2*Delta_x)*(phi[i+1] -phi[i-1]);
}
}
return;
}
Here, we can see that the k array will be evaluated in a slightly different manner
depending upon the processes rank. Beginning with process 0, because it has the
Dirichlet boundary point in its portion of the grid, it will start updating from the
third element in the phi array (with index 2). For process NP rocesses − 1 however,
it contains the grid point at the other end of the domain where we must use a first
order backward difference in order to evaluate the spatial derivative at that point.
For all other processes, they simply loop over all of their grid points (indices 1 to
myN_x) and update them. In every case, it is assumed that the ghost point values
will have been exchanged and will be ‘waiting’ for use.
Output File
Nx Time Step
0
1
2
3
4
Figure 21.4: A schematic illustrating the single ascii output file and how each
process writes its portion of the grid to the file. Each square illustrates the number
of bytes required to store a single field value φi as a number of characters. The
format of the text file is such that each row corresponds to the solution over the
entire grid at a particular time step.
The one thing that we have not yet mentioned is how we will output the data
from our program. There are a few issues to consider here. When our grid has been
404 CHAPTER 21. MPI
distributed into multiple memory spaces, how do we recombine all of that data; do
we combine it all to create one output file, or do we have each process output its
portion of the grid and then recombine the data with some other program? As it
happens, both of these approaches are used in practice and we will see both in action
shortly. Often, it is ‘easier’ if we only have one output file to post process and so
then the question is, what’s the best way to get all of the data together to write
out into one file? One approach is to have each process send its portion of the data
to one process (usually the root process with rank 0 ’say’). Process 0 could then
be responsible for opening an output file and writing all of the data to it. While
this idea sounds quite simple, the disadvantages with this approach are that firstly
it can mean a large amount of message passing, all to the one process (analogous to
having a large number of people all talking to the one person) and secondly, it might
not be possible for the memory space of process 0 to hold all of the field data for all
of the other processes. Indeed, part of the reason for running parallel computations
in the first place is so that we can run much larger calculations that could not fit
in the memory space of one individual processor. Another option is to use the MPI
file I/O functionality. The basic idea here is that each process can write to the one
file, but they will write to different parts of the file, so that once we are done, we
will have one complete file. We begin by declaring a variable of type MPI_File and
each process will open up this file:
MPI_File file;
MPI_File_open(MPI_COMM_WORLD, "Example21_2.data", MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &file);
where the important input arguments to the function call to open the file are the
file name, the communicator, and some additional flags defining how the file is to
be opened (in our case we are creating the file and writing to it only). Now we are
going to be writing out the data at each time step (as we did in Example 13.1) and
so we will define a function write that we will call at each time step to add the field
data to the output file. This is going to take a bit of work, so let’s in fact start by
looking at the complete function:
void write(MPI_File& file, double* phi, int l, int myN_x, int prevN_x, ...
int N_x, int myID, int N_Procs)
{
int N_CharPer_x = 7;
int N_BytesPer_x = N_CharPer_x * sizeof(char);
int N_BytesPer_l = N_x * N_BytesPer_x;
int Offset = prevN_x * N_BytesPer_x;
char buffer[myN_x*N_BytesPer_x];
for(int i=1; i<myN_x+1; i++)
{
if(i==myN_x && myID==N_Procs-1)
{
sprintf(buffer+(i-1)*N_BytesPer_x, "%+.3f\n", phi[i]);
}
else
21.2. RUNTIME LIBRARY ROUTINES 405
{
sprintf(buffer+(i-1)*N_BytesPer_x, "%+.3f\t", phi[i]);
}
}
MPI_File_seek(*file, l*N_BytesPer_l + Offset, MPI_SEEK_SET);
MPI_File_write(*file, buffer, myN_x * N_CharPer_x, MPI_CHAR, MPI_STATUS_IGNORE);
return;
}
Generally, with large scale computations we would want to write our simulation
data in binary format (as opposed to ascii format) because it is faster to write and
requires less disk space. The disadvantage with binary data however, is that we
can’t open up the file in ‘say’ a text editor to look at it. Most of the time we would
have some post processing program available to read out binary output, but for our
purposes we are going to use Matlab to perform the post processing and this will
be easier to do if we have an ascii text file. That being said, the thing that our
write function has to do is to decide where in the file to start writing the field
data for each process. We can in fact think of the file in this sense like a single 1D
array and we have to compute which entry in the array corresponds to the start of
our grid. So in fact what we have to do is compute how many bytes into the file
each process should ‘skip over’ before it starts writing its data. Returning now to
the code snippet, showing the write function, the first thing we do is declare some
integer variables, the first defining the number of characters that we will use to store
a floating point number. In this example we are using 7 characters to represent a
number and so any given number in our output file will look like +0.027. Here
we can see that the + and the . count as characters, plus the tab character which
will separate out every string (but we will just see it as white space). So using 7
characters we will only actually be storing our field data to 3 decimal places.
Following this definition we can compute the number of bytes to store a field
value φi as the number of characters multiplied by the number of bytes per character
(which is 1 for ascii). Then we can compute the number of bytes to store the entire
grid at any one time step as the number of bytes per field value multiplied by the
overall number of grid points. Then in order to compute where in the file any given
process should ‘skip’ to, we will make use of the prevN_x variable that was defined
when setting the initial condition. As such we can compute an ‘offset’ for each
process that is the number of grid points owned by ranks below it multiplied by
the number of bytes per grid point (field value). With this information computed
we enter a for loop where we loop over every grid point in our portion of the
grid and convert the floating point numbers to a string with the sprintf function,
which we put into an array called buffer. Once this is done, all that we need
to do is skip to the right portion of the file and write the data. The skipping is
performed with the MPI_File_seek function and the number of bytes to skip is
l*N_BytesPer_l+Offset, i.e. we have to skip by the offset for each process, plus
the number of time steps already written. Then, finally, we can write out the data
with the MPI_File_write function. The important argument to the function here
406 CHAPTER 21. MPI
is MPI_CHAR which indicates that we are writing ascii text. Often, this would be
MPI_FLOAT or MPI_DOUBLE when writing binary data.
The complete program is given in Example21_2.cpp with a Matlab script for
viewing the output of the program given in Example21_2Postprocessing.m. Fig-
ures 21.5(a) and 21.5(b) illustrate the solution at two different moments in time for
the case where ∆x=0.05 and ∆t=0.02. Some example strong scaling runs for a more
‘substantial’ case where ∆x=0.001 and ∆t=0.001 are:
Number of Processes Execution Time (s)
1 2.650334
2 1.355908
4 0.661549
8 0.371658
2.5 2.5
2.0 2.0
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
Figure 21.5: The solutions to the PDE in Example 21.2 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02.
Example 21.3:
In this example we will develop a C++ program to solve the 2D Poisson equation:
∇2 φ + ψ = 0 (21.2)
21.2. RUNTIME LIBRARY ROUTINES 407
in the domain x ∈ [0, 1], y ∈ [0, 1], with boundary conditions φ(0, y) = 1, φ(1, y) = 1,
φ(x, 0) = 1, φ(x, 1) = 1, and ψ = 10. For the spatial discretization we will use
the Finite Difference method with second order central differences for the second
derivatives and to solve the resulting system of algebraic equations we will use the
Gauss-Seidel method, with the two-norm as our measure of convergence. Further-
more, our program will be parallelized with MPI. The intended learning outcomes for
this example will be to observe how we add in MPI code to an existing C++ program
for parallel execution and to investigate creating new communicators and data types.
(a) (b)
Figure 21.6: A schematic of the partitioned 2D problem domain for the Poisson
equation. The pink circles show the interior finite difference grid points, while the
blue show the boundary grid points. The grey boxes show the portions of the grid
that are mapped to each process (Note that we actually use a 4 × 4 grid in the
Example) and the yellow patches show where the ghost cells overlap (i.e. each
process is actually only storing a 2 × 2 portion of the grid, the remaining grid points
are ghost cells). Finally the green arrows illustrate the flow of information during
each of the four send and receive operations that are required to communicate the
ghost cell values throughout the Gauss-Seidel iterations.
In order to begin we will take the C++ program that was developed in Example
13.2 and add in the MPI code. The first issue to consider here is how we go about
breaking up the problem for a parallel computation. Compared to the 1D spatial
domain from Example 21.2 we have a couple of options. We could either break up
the square domain into ‘strips’ and give each process one of these strips, or we could
408 CHAPTER 21. MPI
break it up into a ‘checkerboard’ type pattern and give each process one. We will
use the latter approach and break the domain up into smaller square pieces of equal
size. This means that we will need to use square numbers of processes (i.e. 4, 9,
16, 25, etc) and we can define an array dimensions which will store the numbers of
processes in x and y.
Similar to the 1D domain we will allocate the array slightly bigger to store
the ghost points and will exchange information between processes throughout the
simulation. Figures 21.6(a) and 21.6(b), illustrate an example grid, with its domain
decomposition and the ghost point data being exchanged between the process with
the middle of the grid and its neighbors. Because the layer of ghost points forms a
ring around the pieces of grid where we are defining the solution, it is often termed a
halo. For this example we will keep the number of grid points per process (which we
will denote myN_x and myN_y) constant, such that as we add more processes the grid
resolution will increase (i.e. ∆x and ∆y will decrease). As such we will calculate
the grid spacings as:
int myN_x = 20;
int N_x = myN_x*dimensions[X];
float Delta_x = (x_max-x_min)/(N_x-1);
Here what we are doing is creating a new communicator called Comm2D with the
function MPI_Cart_create that we will use instead of MPI_COMM_WORLD. The ar-
guments to this function include the current communicator, the dimensions of the
grid (of processes) or which there are N_D, whether or not this grid will be periodic
(which could be handy if our PDE had periodic boundary conditions ‘say’), and
whether or not the processes can be reordered (meaning whether or not the ranks
can be modified). Once we have created this new communicator each process can get
its new rank within the context of the 2D communicator and furthermore, we can
get the coordinates of each process (Figure 21.6(b)). One of the most useful features
however is via the use of the MPI_Cart_shift function, which will give use the
ranks of the processes either side of a given process. We will store these ranks in the
variables leftNeighbor, rightNeighbor, bottomNeighbor, topNeighbor, where it
should be noted that the actual values will of course be different on each process. An
21.2. RUNTIME LIBRARY ROUTINES 409
important point worth mentioning is that processes on the ‘boundary’ of the grid will
not have all four neighbors (e.g. the lower left process in Figure 13.2 will only have
a rightNeighbor and a topNeighbor). In these cases the MPI_Cart_shift func-
tion will return MPI_PROC_NULL, and if we try and send to or receive from this rank,
the send and receive functions will simply ignore it.
As we exchange ghost point information throughout the iterations of our Gauss-
Seidel method, we will need to send to and receive from each of these neighbors
and while we could also accomplish this using either the standard blocking or non-
blocking sends and receives, a more elegant way is to use the MPI_Sendrecv function
that will do two things at the same time. For example. to exchange all of the ghost
point data between the left and right neighbors of a given process we could do:
MPI_Sendrecv( &(phi[1][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
&(phi[myN_x+1][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[myN_x][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
&(phi[0][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
Comm2D, &status);
leftNeighbor
rightNeighbor
bottomNeighbor
topNeighbor
Figure 21.7: A schematic illustrating the ghost point data to be sent to neighboring
processes and its layout in memory.
In the first function call we are sending myN_y entries to the process leftNeighbor
and at the same time receiving myN_y entries to the process rightNeighbor. With
reference to the middle process in Figure 21.6(b) the address of the array that we
will send from is phi[1][1], and the address of the array that we will receive
into is phi[myN_x+1][1]. So upon completion of these two function calls we have
exchanged all of the data between our left and right neighbors. Now, because of the
way we have allocated memory for phi, elements in the array along a column y will
be contiguous in memory, but along a row x will not be. This is a problem in terms of
410 CHAPTER 21. MPI
sending and receiving data between out topNeighbor and bottomNeighbor because
the send and receive functions expect a pointer to a contiguous block of memory.
One option would be to create a new array (that will be a single contiguous block of
memory), loop over all of the x values that we need, copy them into the new array,
and send that array instead. A better way however is to again make use of some
MPI functionality, in particular to create a new data type, which we can implement
with the code:
MPI_Datatype strideType;
MPI_Type_vector(myN_x, 1, myN_y+2, MPI_DOUBLE, &strideType);
MPI_Type_commit(&strideType);
Here, strideType is our new data type, and it will be a vector that will hold
myN_x double precision floating point numbers. The third argument to the function
indicates the stride (in this case myN_y+2), meaning that when we try to send a
strideType (and give the function the pointer indicating where the block of memory
to be sent begins), every myN_y+2th value in the array will become part of the vector.
The MPI_Type_commit function makes the new data type available for use. Having
done this, we can then send to our topNeighbor and bottomNeighbor as:
MPI_Sendrecv( &(phi[1][1]), 1, strideType, bottomNeighbor, 0,
&(phi[1][myN_y+1]), 1, strideType, topNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[1][myN_y]), 1, strideType, topNeighbor, 0,
&(phi[1][0]), 1, strideType, bottomNeighbor, 0,
Comm2D, &status);
which is quite similar to data exchange between the left and right neighbors, ex-
cept that instead of sending myN_y double precision floating point numbers, we are
sending 1 strideType.
Now that we have our data exchange sorted out, the next little complication is
that some of the processes will contain the Dirichlet boundary grid points and so
when we perform our Gauss-Seidel iteration, we don’t want to include these points
in our update. With reference to Figure 21.6(a) we can see that process 0 (with
coordinates (0, 0)) should begin its update from the array entry phi[2][2] and end
at phi[myN_x][myN_y], whereas process 6 (with coordinates (2, 0)) should begin its
update from the array entry phi[2][1] and end at phi[myN_x-1][myN_y]. The
way in which we can integrate these different starting and ending indices for the
different processes is to define the variables:
int myiStart = myCoords[X]==0 ? 2 : 1;
int myiEnd = myCoords[X]==N-1 ? myN_x : myN_x+1;
int myjStart = myCoords[Y]==0 ? 2 : 1;
int myjEnd = myCoords[Y]==N-1 ? myN_y : myN_y+1;
Here we are saying that if a given processes x coordinate is 0 (meaning that it will
therefore include a portion of the left Dirichlet boundary), then the starting index
will be 2, otherwise it will be 1. Similarly, if the given processes x coordinate is N −1
(meaning that it will therefore include a portion of the right Dirichlet boundary),
21.2. RUNTIME LIBRARY ROUTINES 411
then the ending index will be myN_x, otherwise it will be myN_x+1. We can then
perform our Gauss-Seidel iteration correctly for each process via the code:
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
phi[i][j] = (Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
One important observation is that this will affect the load balancing slightly some
processes will have slightly more work to do than others performing the Gauss-Seidel
iteration. This could be improved by decomposing the data differently such that
each process have a similar number of interior points, but the tradeoff would be a
more complex code. As it happens, we will also need to do the same thing when it
comes to computing the residual, but we have an extra complication here. Because
each process is only operating on a portion of the grid, it will only be computing
a portion of the residual. Our iterative while loop will however need the ‘overall’
or ‘global’ residual norm. Because we are using the two norm as our measure of
convergence, what we will do is to have each process loop over its portion of the
2
grid, and add up the ri,j terms into a variable called my_rnorm as:
myr_norm = 0.0;
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
myr_norm += r[i][j]*r[i][j];
}
}
MPI_Allreduce(&myr_norm, &r_norm, 1, MPI_DOUBLE, MPI_SUM, Comm2D);
r_norm = sqrt(r_norm);
Following this, we will perform a reduction operation with the MPI_Allreduce func-
tion to sum all of the individual myr_norm’s on each process and place the result
in the variable r_norm, which each process has a copy of. Then each process will
compute the square root of this value, which will give the correct global measure of
the two norm. This way the processes will always stay synchronized.
We can now write out our entire iterative loop as:
while(r_norm>tolerance && k<N_k)
{
MPI_Sendrecv( &(phi[1][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
&(phi[myN_x+1][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[myN_x][1]), myN_y, MPI_DOUBLE, rightNeighbor, 0,
&(phi[0][1]), myN_y, MPI_DOUBLE, leftNeighbor, 0,
412 CHAPTER 21. MPI
Comm2D, &status);
MPI_Sendrecv( &(phi[1][1]), 1, strideType, bottomNeighbor, 0,
&(phi[1][myN_x+1]), 1, strideType, topNeighbor, 0,
Comm2D, &status);
MPI_Sendrecv( &(phi[1][myN_x]), 1, strideType, topNeighbor, 0,
&(phi[1][0]), 1, strideType, bottomNeighbor, 0,
Comm2D, &status);
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
phi[i][j] = (Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j]) / 4;
}
}
myr_norm = 0.0;
for(i=myiStart; i<myiEnd; i++)
{
for(j=myjStart; j<myjEnd; j++)
{
r[i][j] = Delta_xy*Delta_xy*psi + phi[i-1][j] + phi[i][j+1]
+ phi[i][j-1] + phi[i+1][j] - 4*phi[i][j];
myr_norm += r[i][j]*r[i][j];
}
}
MPI_Allreduce(&myr_norm, &r_norm, 1, MPI_DOUBLE, MPI_SUM, Comm2D);
r_norm = sqrt(r_norm);
k++;
}
where we can see that the basic steps within the while loop consist of firstly ex-
changing the ghost point data with neighboring processes, performing a Gauss-Seidel
update, then computing the local and then global residual norm.
The final issue we will consider here is writing out the data. In Example 21.2
we used the MPI file I/O functionality to create a single output file. Now, there’s no
reason why we couldn’t do the same thing here. The only difficulty is that it would
involve a bit more code to compute the correct locations in the file to write data
in order to have an output file that had the same structure as the grid. What we
will do in this case is use the second option that was mentioned in that example,
namely to have each process write out a file containing the data for its portion of
the grid. We will then leave it up to our Matlab post processing script to take care
of ‘reassembling’ the data. As such the code for writing out the data is fairly simple
and will take the form:
fstream file;
char myfileName[64];
...
sprintf(myFileName, "Example21_3_Process_%d_%d.data", myCoords[X], myCoords[Y]);
file.open(myFileName, ios::out);
for(i=1; i<myN_x+1; i++)
21.2. RUNTIME LIBRARY ROUTINES 413
{
for(j=1; j<myN_y+1; j++)
{
file << phi[i][j] << "\t";
}
file << endl;
}
file.close();
...
where it can be observed that each process is simply writing out its portion of
the grid in a nested for loop. The only issue worth mentioning here is that each
process’s file will have a unique filename that will include the coordinates of the
process in the name. This will make it easier to ‘reassemble’ the grid in our post
processing script.
2.0
1.8
1.6
φ
1.4
1.2
1.0
−0.2 1.2
0.0 1.0
0.2 0.8
0.4 0.6
0.6 0.4
0.8 0.2
1.0 0.0
1.2 −0.2
y
x
Figure 21.8: The converged solution to the PDE in Example 21.3 illustrating the
solution for Nx = 60 and Ny = 60.
Example 21.4:
In this example we will develop a C++ program to solve the 2D generic scalar
transport equation:
φ̇ + v · ∇φ = µ∇2 φ + ψ (21.3)
in the domain x ∈ [0, 1], y ∈ [0, 1], y ∈ [0, 2], with boundary conditions φ(0, y) = 1,
2
φ(x, 0) = 1, ∂x φ(1, y) = 0, ∂y φ(x, 1) = 0, initial condition φ(x, y, 0) = e−50((x−0.3) ) +
1, and v = {0.5, 0.5}, µ = 0.01, and ψ = 0.2. For the spatial discretization we will
use the Finite Element method with linear triangular elements, for the temporal
discretization we will use the implicit Euler method, and we will solve the resulting
linear system with the Conjugate Gradient method. Furthermore, our program will
be parallelized with MPI. The intended learning outcomes for this example will be to
observe how we add in MPI code to an existing C++ program for parallel execution
and to investigate the decomposion of an unstructured grid.
With the parallel program developed in Example 21.3, the use of a structured
grid meant that it was relatively easy to break up a spatial domain and use the
concept of ghost points in order to exchange the discrete field data throughout the
solution process. With an unstructured grid, we could use this approach, but we
would need to store a list of which points in any given processes Points array, are
its ghost points; or, we may in fact use ghost elements instead. As it happens
we are not going to use ghost ‘anything’ in our parallel computation and will do
something entirely different. Our method is going to be very much tied to the use of
the conjugate gradient method for solving the resulting linear system of equations
at each time step.
The way in which we will break up our unstructured grid is to assign the elements
to each process (Figure 21.9(b)). As can be observed, we are going to choose nine
processes for our parallel computation, but of course in principle the code that
we develop could be applied to any number of processes. Now, the manner by
which we break up the computational domain is quite a complicated topic and is
beyond the scope of this example. For the interested reader however, the domain
21.2. RUNTIME LIBRARY ROUTINES 415
(a) (b)
Figure 21.9: A schematic of the partitioned 2D problem domain for the generic
scalar transport equation. The grey boxes show the portions of the grid that are
mapped to each process.
416 CHAPTER 21. MPI
was decomposed with the Metis [29] library. So, although the elements are uniquely
assigned to a process, certain points on the boundaries between processes will be
duplicated and we will have to handle this appropriately in our code. It can be
observed that this is a little bit different compared to the concept of ghost points,
because in that case, each process is responsible for a unique set of points where
the ghost points allowed for the finite difference quotients to be evaluated at the
boundaries of the processes portion of the grid. In this case, the solution at a point
on the boundary between two grids will in fact be computed by each process sharing
that point.
We will see how we handle this shortly, but for now, let’s first look at how
we will get this data into our parallel program. We are going to assume that the
unstructured grid has already been decomposed into a number of separate ascii
text file and that include the intended process in its filename, such as:
Example21_4.grid0
Example21_4.grid1
...
Example21_4.grid8
As a quick aside, it is important to bear in mind that quite commonly one would
deal with binary files for large scale computer simulations because they are typically
smaller in size and are quicker to read and write. We are using ascii text files
here because the human readability makes aids in the understanding of how the
numerical methods work. Each file will follow the same form as the ascii text
file defined in Example 15.3, the major difference now, being the number of points,
faces, and elements in each text file. As before we will not make any assumptions
about the ordering of the points and faces in the file. The big difference here is that
for the points that are shared between processes we are going to define a new type
of boundary condition which we will call an interprocess boundary condition. We
shouldn’t really think of this as being a fundamental type of boundary condition
like a Dirichlet or a Neumann boundary condition, it’s really just a way in which we
keep track of which points are duplicated on other processes it does give us a nice
elegant way in which to incorporate them into our existing file structure however.
The contents of the file containing the portion of the unstructured grid for ‘say’
process 7 (Example21 4.grid7) will look something like:
N_p 133
N_f 20
N_e 223
N_b 4
Points
1.00000 0.00000
0.64286 0.00000
...
0.94855 0.33766
Faces
1 2
2 3
21.2. RUNTIME LIBRARY ROUTINES 417
...
19 20
Elements
21 62 98
82 45 106
...
49 126 113
Boundaries
bottom
dirichlet
11
0 1 2 3 4 5 6 7 8 9 10
1.00000
right
neumann
10
10 11 12 13 14 15 16 17 18 19
0.00000
process7to6
interprocess
9
54 55 62 72 78 90 96 97 98
6
process7to8
interprocess
13
20 23 32 47 60 62 63 67 68 112 123 125 132
8
As can be observed in the Boundaries section of the file, there is a Dirichlet and a
Neumann boundary condition, defined in the same manner as in Example 15.3, plus
two interprocess boundary conditions. With Dirichlet and Neumann boundaries,
the indices define the points and the faces respectively, to where the boundary
condition is applied. For an interprocess boundary the indices define the points that
are shared between one process and another and the value that is given defines the
process with which to share the data with. The contents of the file containing the
portion of the unstructured grid for process 6 will look something like:
N_p 139
N_f 10
N_e 238
N_b 5
Points
0.28571 0.00000
0.32143 0.00000
...
0.33743 0.05103
Faces
0 1
1 2
...
418 CHAPTER 21. MPI
9 10
Elements
47 21 48
13 44 57
...
71 109 138
Boundaries
bottom
dirichlet
11
0 1 2 3 4 5 6 7 8 9 10
1.00000
process6to4
interprocess
12
12 44 58 64 69 80 89 95 100 123 127 128
4
process6to5
interprocess
1
12
5
process6to7
interprocess
9
51 54 60 73 79 91 97 98 99
7
process6to8
interprocess
8
12 15 20 47 48 60 113 134
8
As can be observed, process 6 has an interprocess boundary, sharing nine points with
process 7, with indices 51 54 60 ... 99. These correspond with the interprocess
boundary for process 7, sharing nine points with process 6, with indices 54 55 62
... 98. An important point to note here is that the indices into each processes
Points array will be different because each process will have its own collection of
points, faces, and elements, with different ordering and numbering. As long as these
indices correspond to the same physical coordinates in each processes Points array
however, then this approach will work. So to emphasize this point, each process will
have a local numbering of its Points, Faces, and Elements arrays, but there will
also be an implicit global numbering. This is ‘almost’ all that is necessary in order
to handle the decomposed grid. Let’s just jump right in and look at the modified
read function and look at then we’ll discuss the one extra little complication that
we have to deal with. This function will look something like:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& myN_p, int& myN_f, int& myN_e, int& myN_b, ...
bool*& yourPoints, int myID)
21.2. RUNTIME LIBRARY ROUTINES 419
{
fstream file;
string temp;
char myFileName[64];
int myMaxN_sp = 0;
int myMaxN_sb = 0;
int maxN_sp = 0;
int yourID = 0;
sprintf(myFileName, "%s%d", filename, myID);
file.open(myFileName);
file >> temp >> myN_p;
file >> temp >> myN_f;
file >> temp >> myN_e;
file >> temp >> myN_b;
Points = new double* [myN_p];
Faces = new int* [myN_f];
Elements = new int* [myN_e];
Boundaries = new Boundary [myN_b];
Points[0] = new double [myN_p*2];
Faces[0] = new int [myN_f*2];
Elements[0] = new int [myN_e*3];
myPoints = new bool [myN_p];
...
}
A couple of points to note here are that firstly, each process will append its rank to
the string defining the file name, before it attempts to open that file. That way we
can run our program by passing in one argument (say Example21 4.grid) and the
code will append the rank so that the four processes read in Example21 4.grid0,
Example21 4.grid1, . . ., Example21 4.grid8. Another thing to note is that we
are defining another 1D array called myPoints storing boolean values which will
be true if a given process ‘owns’ those points, or false if another process ‘owns’
them. This concept is analogous to the neighbour-owner concept applied in the
Finite Volume method in Chapter 14, where, although a face is shared between two
cells, we pick one of those cells to be the ‘owner’ of the face and one to be the
‘neighbour’ of the face. Here will be assigning one of the two (or however many)
processes sharing the interprocess points to be the owner of them. We will see why
this is important shortly, but the approach that we will use to assign ownership
will be to give them to the process with the highest rank; analogous to the way
the older siblings always have more privileges than the younger sibling; the highest
rank gets to own the shared points. The final point to note is that we have declared
a variables myMaxN sp, which is an integer variable storing the maximum number
of shared points on any given interprocess boundary, and myMaxN sb, which is an
integer storing the number of interprocess boundaries that a given process has. This
will be used in order to allocate a buffer array to use in the MPI send and receive
routines. The rest of our read function will then look something like:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& myN_p, int& myN_f, int& myN_e, int& myN_b, ...
420 CHAPTER 21. MPI
same as for the serial case in Example 15.3. So that was easy; but although the
serial code doesn’t need to be modified, we do need to realize that the systems of
equations on each process will in fact not be quite correct at this stage. To elaborate
on this point, consider the interprocess boundary between processes 2 and 5 depicted
in Figure 21.10. In particular we are going to focus on one shared point, which is
point 84 in process 2’s local numbering scheme and point 88 in process 5’s local
numbering scheme. When we think about how we assemble M , K, and s with the
Finite Element method, we loop over all of the elements and for each element loop
over all of the nodes that a given element uses, adding in terms to M , K, and s in the
rows and columns corresponding to those nodes. If we consider the two rows in K
and s for point 84 on process 2, then we can see that they will be missing the entries
from the three elements that are in process 5’s portion of the grid. Conversely, the
two rows in K and s for point 88 on process 5, then we can see that they will be
missing the entries from the two elements that are in process 2’s portion of the grid.
As a quick point to note, because M and K have the same pattern, we only need to
consider one to convey this issue.
p88
p84
Process 2 Process 5
Figure 21.10: A schematic illustrating the rows in the stiffness matrices and load
vectors corresponding the a point shared between two processes. The colored blocks
illustrate the contribution that each node makes in the sytem of equations for that
point and the grey blocks indicate a summation of terms from multiple elements.
So, when each process comes to solving its system of equations at each time
step, the equations for each shared point will be missing some terms and if we want
422 CHAPTER 21. MPI
our algorithm to work, we need to share this data in order to make sure that the
shared point equations are the same on each process. So the question to ask is, do
we exchange the entries in these matrices? That does seem like a good way to do
things, but in fact we can get away with something even simpler. Again, this is very
much tied with our use of the conjugate gradient method. In order to build up to
how we make this work, let’s imagine that we have just completed the execution of
the assemble function and we have the incomplete M , K, and s. Similar to our MPI
programs in Examples 21.1 and 21.3 we will define a function called exchange that
we will call repeatedly throughout the program and we will pass this a 1D array,
representing a column vector of size Np × 1, which will look like:
void exchange(double* v, Boundary* Boundaries, int myN_b)
{
int yourID = 0;
int tag = 0;
MPI_Status status;
for(int b=0; b<myN_b; b++)
{
if(Boundaries[b].type_=="interprocess")
{
for(int p=0; p<Boundaries[b].N_; p++)
{
buffer[p] = v[Boundaries[b].indices_[p]];
}
yourID = static_cast<int> (Boundaries[b].value_);
MPI_Bsend(buffer, Boundaries[b].N_, MPI_DOUBLE, yourID, tag, MPI_COMM_WORLD);
}
}
for(int b=0; b<myN_b; b++)
{
if(Boundaries[b].type_=="interprocess")
{
yourID = static_cast<int> (Boundaries[b].value_);
MPI_Recv(buffer, Boundaries[b].N_, MPI_DOUBLE, yourID, tag, MPI_COMM_WORLD, &status);
for(int p=0; p<Boundaries[b].N_; p++)
{
v[Boundaries[b].indices_[p]] += buffer[p];
}
}
}
return;
}
In this function, the first thing we do is loop over all of the boundaries, and if the
boundary is an interprocess boundary we will loop over all of its points and copy
the data from the two corresponding entries in the input vector v into the buffer
array that was defined in the read function. We will then send off this buffer to
the process with whom these points are shared. We will then again loop over all
of the boundaries and for the interprocess boundaries we will receive a buffer from
the processes with whom we are sharing the points with and add the data into the
21.2. RUNTIME LIBRARY ROUTINES 423
corresponding rows in the input vector v. It is worth mentioning at this point that
when a process sends its data to the corresponding neighbor process, we are doing
so with a buffered send. The reason for this choice comes from that fact that if we
were to use a standard blocking send, then we could enter a ‘deadlock’ situation
where two processes are trying to send to each other at the same time and our
program will ‘hang’. An alternative would of course be to use non-blocking sends,
but in that case we could not use the one buffer array because we would not be
guaranteed that a process would have received the data stored in the buffer, before
we overwrite it with new data to send to another process. So an alternative here
is to use a buffered send where we can reuse the one buffer, but we must allocate
some additional memory for MPI to store this data until it can be exchanged. As
such, we will take a quick step back to our read function and add in the following
lines to the end:
void read(char* filename, double**& Points, int**& Faces, int**& Elements, ...
Boundary*& Boundaries, int& myN_p, int& myN_f, int& myN_e, int& myN_b, ...
bool*& yourPoints, int myID)
{
...
buffer = new double [maxN_sp];
bufferSize = (maxN_sp*sizeof(double)+MPI_BSEND_OVERHEAD)*myMaxN_sb;
MPI_Buffer_attach(new char[bufferSize] , bufferSize);
return;
}
Here we are allocating the buffer as before, but are now allocating a second array
to be used as the storage space. From examination of the bufferSize variable,
the term in the brackets defines enough bytes of memory to hold myMaxN sp dou-
ble precision floating point numbers, plus the amount of memory required by the
buffered send routine (defined by MPI BSEND OVERHEAD). This number of bytes is
then multiplied by the number of interprocess boundaries, in order to make sure
that we have enough space to potentially store all of the interprocess data until it
has been exchanged.
Returning now to how we make sure that each process is solving the correct
system of equations, if in our code we have the two lines:
assemble(M, K, s, phi, Free, Fixed, Points, Faces, Elements, Boundaries, ...
myN_p, myN_f, myN_e, myN_b, myID);
exchange(s, Boundaries, N_b);
then upon completion of the exchange function all processes will have the correct
entries in their load vectors corresponding to their shared points, because the func-
tion adds contributions to its own array from other processes. Now, because M
and K are also missing entries, we should do something similar, but in fact we will
leave the exchanging of this data for our solve function. As such, the core of our
algorithm will look like:
assemble(M, K, s, phi, Free, Fixed, Points, Faces, Elements, Boundaries, ...
myN_p, myN_f, myN_e, myN_b, myID);
424 CHAPTER 21. MPI
A = M;
A.subtract(Delta_t, K);
A.multiply(AphiFixed, phi, Free, Fixed);
exchange(AphiFixed, Boundaries, myN_b);
exchange(s, Boundaries, myN_b);
...
for(int l=0; l<N_t; l++)
{
M.multiply(b, phi);
exchange(b, Boundaries, myN_b);
for(int m=0; m<myN_p; m++)
{
b[m] += Delta_t*s[m] - AphiFixed[m];
}
solve(A, phi, b, Free, Fixed, Boundaries, yourPoints, myN_b, myID);
write(file, phi, myN_p);
}
So, it can be observed that at this stage most of the code is the same as the serial
version of the algorithm with the exception of the exchanging of the vectors s and
b. We can now look at how we solve the global linear system of equations at each
time step with the conjugate gradient method. Let’s just dive right in and look at
the code:
void solve(SparseMatrix& A, double* phi, double* b, bool* Free, bool* Fixed, ..
Boundary* Boundaries, bool* yourPoints, int myN_b, int myID)
{
...
A.multiply(Aphi, phi, Free, Free);
exchange(Aphi, Boundaries, myN_b);
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = b[m] - Aphi[m];
d[m] = r_old[m];
}
}
r_oldTr_old = innerProduct(r_old, r_old, Free, myPoints, N_row);
r_norm = sqrt(r_oldTr_old);
while(r_norm>tolerance && k<N_k)
{
A.multiply(Ad, d, Free, Free);
exchange(Ad, Boundaries, myN_b);
dTAd = innerProduct(d, Ad, Free, yourPoints, N_row);
alpha = r_oldTr_old/dTAd;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
phi[m] += alpha*d[m];
}
21.2. RUNTIME LIBRARY ROUTINES 425
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r[m] = r_old[m] - alpha*Ad[m];
}
}
rTr = innerProduct(r, r, Free, yourPoints, N_row);
beta = rTr/r_oldTr_old;
for(m=0; m<N_row; m++)
{
if(Free[m])
{
d[m] = r[m] + beta*d[m];
}
}
for(m=0; m<N_row; m++)
{
if(Free[m])
{
r_old[m] = r[m];
}
}
r_oldTr_old = rTr;
r_norm = sqrt(rTr);
k++;
}
return;
}
As can be observed we are using the exchange function a few times throughout the
duration of the algorithm. This in fact ties in quite nicely because the conjugate
gradient method involves the computation of the column vectors Aφ and Ad and we
know at this stage that A will not be quite correct because we haven’t exchanged the
data between processes. Rather than trying to exchange the entries in the matrix
itself; what we can do is exchange the vectors Aφ and Ad, which means less data to
exchange, and much simpler code. So this is in fact what we do. The only missing
piece of the puzzle now is the computation of the residuals and the search directions.
For this, we are going to define one more function called innerProduct, which will
be responsible for computing the correct global inner products that are used to
define the residual norm, α, and β. Before looking at the code for this, let’s just
take a moment to get a clear picture in our mind as to what’s going on. Because
each process has a portion of the grid (forgetting about the shared interprocess
points for a moment), then it will only have a portion of the global residual vector
(just as with the solution of the Poisson equation in Example 21.3). But we need
each process to have the correct global residual (or more importantly it’s two-norm)
in order to use compute the same α and β. If we didn’t then we would find that
each process would use it’s own residual to compute the norm and then compute
426 CHAPTER 21. MPI
different search directions and so the algorithm would be a complete mess! With
this in mind, the code for our innerProduct function take as an argument two input
arrays (reprsenting column vectors) and will look like:
double innerProduct(double* v1, double* v2, bool* Free, bool* myPoints, int N_row)
{
double myv1Tv2 = 0.0;
double v1Tv2 = 0.0;
return v1Tv2;
}
So it can be observed that the for loop incrementing the inner product is fairly
straightforward, but we are using the myPoints array in order to decide whether or
not to include the values in the increment. If we didn’t do this, then the shared
points would be counted more than once in the computation of the residual vector
(and all of the other vectors). So this is why it was important to determine the
ownership of a shared point right back when reading in the grid on each process.
Finally, after each process computes its portion of the inner product we use an
MPI Allreduce so that each process will have the correct global residual; and will
then compute the same update for φ, will then compute the correct new search
directions, etc.
The complete program is given in Example21 4.cpp and Figures 21.11(a) -
21.11(d) illustrate the solution at a number of different time steps with the por-
tions of grid owned by each process shifted to highlight the distributed grid.
Now just as one now expects to have libraries to perform certain calculations such
as the cos or exp functions in the standard math library, there are libraries available
which can be used to solve large scale linear systems in parallel and thereby remove
a large portion of this ‘hassle’. One such example is the Portable, Extensible Toolkit
for Scientific Computation (PETSc) [41] which is designed for the parallel solution
of PDEs and abstracts (or hides) much of the details associated with distributing a
matrix and communicating data throughout a simulation. With a spectral method
on the other hand, the global nature means that to compute the discrete Fourier
transform at a point we needed information throughout the entire domain, meaning
21.2. RUNTIME LIBRARY ROUTINES 427
2.0 2.0
1.8 1.8
1.6 1.6
φ
φ
1.4 1.4
1.2 1.2
1.0 1.0
−0.2 1.2 −0.2 1.2
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
1.2 −0.2 1.2 −0.2
y y
x x
(a) (b)
2.0 2.0
1.8 1.8
1.6 1.6
φ
1.4 1.4
1.2 1.2
1.0 1.0
−0.2 1.2 −0.2 1.2
0.0 1.0 0.0 1.0
0.2 0.8 0.2 0.8
0.4 0.6 0.4 0.6
0.6 0.4 0.6 0.4
0.8 0.2 0.8 0.2
1.0 0.0 1.0 0.0
1.2 −0.2 1.2 −0.2
y y
x x
(c) (d)
Figure 21.11: The solutions to the PDE in Example 21.4 illustrating the solution at
(a) t = 0, (b) t = 0.5, (c) t = 1.0 and (d) t = 1.5.
428 CHAPTER 21. MPI
that the we no longer have nearest neighbor communication and the concept of
ghost points no longer applies. It is therefore a bit more of a challenge to efficiently
compute a discrete Fourier transform in parallel when so much information needs to
be exchanged between processes. One example of a library which does this however
is the FFTW library [17].
As a final concluding remark it is worth emphasizing that as with OpenMP we
have only touched very briefly on the use of MPI and its application to solving PDEs.
There is much more functionality and many more performance issues that we have
not considered here, but can be found in either the full API specification [31] or [66].
Chapter 22
OpenCL
Compute Device
Compute Unit
Processing
Element
Host
22.1 Concepts
The third API that we will investigate is the Open Computing Language API, which
is a framework for writing programs that execute across heterogeneous platforms
consisting of central processing units (CPUs), graphics processing units (GPUs),
digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other
processors or hardware accelerators. The goal is that programmers can develop
efficient, yet portable code that can use the OpenCL framework to detect devices and
compile portions of the code at run-time to execute on them. In order to provide
429
430 CHAPTER 22. OPENCL
this infrastructure the setup of an OpenCL program can be more involved compared
to OpenMP and MPI, and will require an understanding of few different conceptual
models that we will looks at.
The platform version indicates the version of the OpenCL runtime supported.
This includes all of the APIs that the host can use to interact with the OpenCL
runtime, such as contexts, memory objects, devices, and command queues.
The device version is an indication of the devices capabilities, separate from the
runtime and compiler, as represented by the device info returned by clGetDeviceInfo.
Examples of attributes associated with the device version are resource limits and
extended functionality. The version returned corresponds to the highest version of
the OpenCL specification for which the device is conformant, but is not higher than
the platform version.
The language version for a device represents the OpenCL programming language
features a developer can assume are supported on a given device. The version
reported is the highest version of the language supported. OpenCL C is designed to
be backwards compatible, so a device is not required to support more than a single
language version to be considered conformant. If multiple language versions are
supported, the compiler defaults to using the highest language version supported
for the device. The language version is not higher than the platform version, but
may exceed the device version.
22.3. EXECUTION MODEL 431
Work-group
w= [1,1]
Work-group Size
W= [4,4]
NDRange Size
G= [12,12]
gx wx × Wx + lx + Ox
= (22.1)
gy wy × Wy + ly + Oy
where it should be fairly obvious that the extension to 3D or the reduction to 1D
is trivial. The number of work-groups [Mx , My ] can be computed as:
" #
Gx
Mx Wx
= Gy (22.2)
My Wy
and given a global ID and the work-group size, the work-group ID for a work-item
is computed as:
" #
gx −lx −Ox
wx Wx
= gy −ly −Oy (22.3)
wy Wy
A wide range of programming models can be mapped onto this execution model.
For example, one could imagine how algorithms using Finite Difference Methods in
1 − 3D could be executed on a device, where a work-item would be responsible for
some computation at a single grid point and the [i, j, k] grid index is mapped to a
[gx , gy , gz ] global index.
From a practical point of view, every OpenCL program will need a number of
data structures (or ‘objects’ that are instantiations of these types):
22.4. MEMORY MODEL 433
• Devices: The collection of OpenCL devices to be used by the host, where each
device is represented by a cl device id object.
• Kernels: The OpenCL functions that run on OpenCL devices, where a kernel is
represented by a cl kernel object.
• Program Objects: The program source and executable that implement the
kernels, where a program is represented by a cl program object.
• Memory Objects: A set of memory objects visible to the host and the OpenCL
devices. Memory objects contain values that can be operated on by instances
of a kernel and are represented by cl mem.
• Command Queues: The host creates this data structure to coordinate ex-
ecution of the kernels on the devices. The host places commands into the
command-queue which are then scheduled onto the devices within the con-
text. These objects are represented by a cl command queue.
• Contexts: Which are defined for a device and used in order to create programs,
command queues, and memory objects.
Depending on the particular application, there may be many of each of these objects,
but at a bare minimum, to have any OpenCL program that executes a kernel on a
device and returns the result of the kernel to host memory, there must be one of
each of these objects.
• Global Memory. This memory region permits read/write access to all work-
items in all work-groups. Work-items can read from or write to any element
of a memory object. Reads and writes to global memory may be cached
depending on the capabilities of the device.
• Constant Memory: A region of global memory that remains constant during
the execution of a kernel. The host allocates and initializes memory objects
placed into constant memory.
• Local Memory: A memory region local to a work-group. This memory region
can be used to allocate variables that are shared by all work-items in that
work-group. It may be implemented as dedicated regions of memory on the
OpenCL device. Alternatively, the local memory region may be mapped onto
sections of the global memory.
434 CHAPTER 22. OPENCL
Work-group Work-group
Private Private Private Private
Memory Memory Memory Memory
Global/Constant Memory
Compute Device
Host Memory
Host
Deciding how to use these memory regions is one of the most important considera-
tions for designing efficient code, since these regions have different sizes and different
access speeds.
• cl int clSetKernelArg(cl kernel kernel, cl uint arg index, size t arg size,
const void* arg value);
Used to set the argument value for a specific argument of a kernel.
In contrast to OpenMP and similar to MPI, the compiler does not need to have
any support for OpenCL, rather it is just a process of linking the program to the
appropriate library. It must be the case however that the vendor of a particular
device (e.g. a GPU or a Xeon Phi) provides an OpenCL implementation. In order to
use the OpenCL functionality, a C++ program must include the header file cl.h (or
opencl.h on Apple systems) and will most likely need to link to the runtime library
with a compiler flag such as -lOpenCL (or -framework OpenCL on Apply systems).
Once the program has been compiled it can be executed in the same manner as
any other program. Generally, the size of the NDRange index space will be defined
by the problem (e.g. Finite Difference grid size, or size of a Matrix, etc), and the
maximum work-group size is a device specific parameter that may be queried using
the clGetDeviceInfo function. As such, the work-group size and the number of
work-groups is something that will most often be computed within the program,
which is conceptually a bit different to OpenMP and MPI.
Example 22.1:
22.5. RUNTIME LIBRARY ROUTINES 437
In this example we will develop a ‘Hello World’ example program with OpenCL to
illustrate the identification of platform and device IDs and use a some of the runtime
library routines to obtain some useful information about them. The intended learn-
ing outcomes for this example will be to ‘get a feel’ for the structure of a program
that includes OpenCL code, so that in later examples where we are focussing on the
development of kernels, we will not have to elaborate on how platform and device
IDs are defined.
In order to begin, let’s just dive right in and look at the complete program:
int main(int argc, char** argv)
{
cl_uint N_Platforms;
cl_uint N_Devices;
cl_char dummy1[10240];
cl_uint dummy2;
size_t dummy3;
cl_ulong dummy4;
cl_device_fp_config dummy5;
clGetDeviceInfo(devices[d], CL_DEVICE_NAME,
sizeof(dummy1), &dummy1, NULL);
cout << ‘‘\tName: ’’ << dummy1 << endl;
clGetDeviceInfo(devices[d], CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint), &dummy2, NULL);
cout << ‘‘\tMax compute units: ’’ << dummy2 << endl;
clGetDeviceInfo(devices[d], CL_DEVICE_MAX_WORK_GROUP_SIZE,
438 CHAPTER 22. OPENCL
clGetDeviceInfo(devices[d], CL_DEVICE_MAX_MEM_ALLOC_SIZE,
sizeof(cl_ulong), &dummy4, NULL);
cout << ‘‘\tMax mem alloc size: ’’ << dummy4 << endl;
clGetDeviceInfo(devices[d], CL_DEVICE_DOUBLE_FP_CONFIG,
sizeof(cl_device_fp_config), &dummy5, NULL);
cout << ‘‘\tDouble precision capability: ’’ << dummy5 << endl;
}
}
return 0;
}
As the program begins we will use the clGetPlatformIDs function in order to query
the system for available platforms. In the first call, we pass in a NULL pointer for
the second argument and the address of the unsigned integer variable N Platforms
as the third argument, which will be assigned the number of platforms found. We
can then create a static array of cl platform ids and call the function a second
time, but this time passing in N Platforms and the array as arguments, causing the
function to populate the cl platform id objects. We then loop over the platforms
found and for each one will print a message to the terminal saying “Hello world”
with the platform ID and call the function clGetPlatformInfo a few times, in order
to obtain some useful information that we will print to the screen.
When we call clGetPlatformInfo, the first argument is the platform ID, the
second is an enumeration constant that defines what bit of platform information we
are after. The fourth argument is the address of the variable that we want to be
assigned with this information, and the third argument is the number of bytes that
this variable is represented by. For the platform information, everything that we
are interested in printing out is a string, so we can create a static char array called
dummy1 and pass this in. We can then do this three times to print out the platform
name, vendor, and OpenCL version supported.
Using a similar approach as we did to obtain the platform IDs we will use the
getDeviceIDs function in order to query the system for available devices. We call
the function once to get the number of devices, create a static array of cl device ids
and then call the function again to populate the cl device id objects. We then
loop over the devices found and for each one will print a message to the terminal
saying “Hello world” with the device ID and call the function clGetDeviceInfo
a few times, in order to obtain some useful information that we will print to the
screen.
When we call clGetDeviceInfo the format is similar to when we were calling
clGetPlatformInfo, except that this time not all of the information is a string. For
this reason, we have a few dummy variables of different types (e.g. cl uint, size t)
and it is important that we use the right variable for the piece of information that
22.5. RUNTIME LIBRARY ROUTINES 439
we are after (e.g. the maximum memory allocation size is represented by a cl ulong
- an unsigned long int ).
In contrast to other examples, the output of this program will depend on the
system on which it is run, but an example output would be something like:
Hello world from Platform 0 of 1
Name: Intel(R) OpenCL
Vendor: Intel(R) Corporation
Version: OpenCL 1.2 LINUX
Hello world from Device 0 of 2
Name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Max compute units: 1
Max work group size: 8192
Max mem alloc size: 67684369408
Double precision floating point configuration: 63
Hello world from Device 1 of 2
Name: Intel(R) Many Integrated Core Acceleration Card
Max compute units: 236
Max work group size: 8192
Max mem alloc size: 2017882112
Double precision floating point configuration: 63
It is important to note that these bits of device information will be used in order
to guide programs depending on the where they are run. For example, we may
use the CL DEVICE MAX WORK GROUP SIZE to define what our work-group size will be
when we execute a kernel, or if our code us using double s, have the program abort
if CL DEVICE DOUBLE FP CONFIG is zero (implying that the device can not support
double precision operations).
Example 22.2:
In this example we will develop a C++ program to solve the 1D first order wave
equation:
∂φ ∂φ
+v =0 (22.4)
∂t ∂x
in the domain x ∈ [0, 10], t ∈ [0, 10], with boundary condition φ(0, t) = 1, initial
2
condition φ(x, 0) = e−5(x−3) + 1, and v = 1.0. For the spatial discretization we will
use the Finite Difference method with second order central differences for the first
derivative and for the temporal discretization we will use the fourth order Runge-
Kutta method. Furthermore, our program will be parallelized with OpenCL. The
440 CHAPTER 22. OPENCL
intended learning outcomes for this example will be to observe how we create an
OpenCL context, command queue, write and build a kernel, and allocate and copy
memory between the host and device.
In order to begin we will take the C++ program that was developed in Example
13.1 and add in the OpenCL code. The issues to consider here are that firstly, in order
to develop an OpenCL program, we will need to obtain a platform and device ID,
create a context from the device ID, create command queue from the context, build
a program from the context and our kernel source code, then create the kernels.
So, all of this is a fair amount of code, just to get started. The second issue is
to determine what parts of the overall code are parallelizable and can be put into
kernels. Then finally (and quite importantly), how do we avoid (or at least minimize)
the bottlenecks of moving data back and forth between host and device memory.
In order to avoid making our top level code too cluttered, we will create a function
called initialize() that will be responsible for identifying the platform and device,
creating the context, command queue, building the program and creating the kernel.
As you could imagine, although every OpenCL program needs this things in order
to function, there many ways in which we could design the code to do this. So just
bear in mind that as a program becomes more complex, or reequires more flexibility,
etc, one single function for initialization might not always be the optimal solution.
Nevertheless, in this case our function will look something like:
void initialize(cl_platform_id& platformID, cl_device_id& deviceID,
cl_context& context, cl_command_queue& commandQueue, cl_program& program,
cl_kernel& computeRHS, cl_kernel& computeTempPhi, cl_kernel& updatePhi, int D)
{
cl_uint N_Platforms;
cl_uint N_Devices;
const cl_uint N_Kernels = 3;
cl_int errorID;
const char* kernelSource[N_Kernels];
platformID = platforms[0];
kernelSource[0] = read("computeRHS.cl");
kernelSource[1] = read("computeTempPhi.cl");
22.5. RUNTIME LIBRARY ROUTINES 441
kernelSource[2] = read("updatePhi.cl");
return;
}
As can be observed, the code for obtaining the platform and device IDs is similar
to Example 22.1, except that we are choosing the first platform (assuming that there
is one) and choosing the device based on an integer argument D to the function which
will select either devices[D] or if that number is invalid, the last device. Follwing
this we create a context by calling the clCreateContext function, passing in our
chosen device ID, then we create a command queue with the clCreateCommandQueue
function, passing in the context and device ID. The next step involves creating a
cl program object based on the source code for our kernels. Now, although we
haven’t actually discussed what kernels we are going to create, having seen Example
20.2 where we used multiple threads to evaluate the right hand side, compute the
variables in tempPhi and also to update phi at each time step, we will try the same
thing here, but create kernels to do so. More on this later however. For now we are
looking at how we create a program and to use the clCreateProgramWithSource
function, we need to pass in the number of kernels (three in our case) and a const
char** array defining the kernel code. So given that this function wants the code
input as an array of strings (similar to the way input arguments to a program are
defined in argv), there are actually a few options for doing so. One would be to
actually type the kernel code into an array:
const char* kernelSource[3] = {‘‘type kernel 1 code’’, ‘‘kernel 2 here’’, ‘‘etc’’};
but this gets a bit messy. A much better way is to code up a kernel in a separate
file (perhaps with a .cl extension) and create a function to read in the contents of
that ASCII text file and put it into a char array, which is exactly what we will do:
const char* read(const char* name)
{
ifstream file(name);
string sourceCode((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
size_t length = sourceCode.length();
char* buffer = new char [length+1];
sourceCode.copy(buffer, length);
buffer[length] = ’\0’;
return buffer;
}
442 CHAPTER 22. OPENCL
Although we won’t dwell on this function too much (but rather just use it in our
future programs), all it is doing is opening the kernel source file, reading the contents
into a string, creating a new char array, copying the contents and appending the
null character to the end, then returning this array.
Once we have put this code into our kernelSource array we can then create
cl program object, passing in the context object, the kernelSource array and the
number of kernels. Then we can call the clBuildProgram function in order to
compile and link the kernel source code, then finally can create our three cl kernel
objects by calling the clCreateKernel function, passing in the program object and
the kernel name (which will need to match the name in the .cl file).
At this point we can now start to look at what our kernels might look like.
As was mentioned previously, in other examples where we have solve the 1D wave
equation, we have used for loops to visit every entry in our arrays and perform
some sort of operation (e.g. evaluating the right hand side or computing an entry
in a k array). To parallelize these sorts of algorithms we either divided up the for
loops across multiple threads with OpenMP, or spread the grid (which is essentially
the same thing as dividing up the for loops) across multiple processes with MPI.
One common approach for parallelization with OpenCL (but also with other similar
APIs such as NVIDIA CUDA) is to think of each processing element (work-item)
being responsible for a grid point. In this case we ‘do away’ with the for loops
and instead launch the kernel multiple times (one for each grid point) and use the
processing element’s ID to define which grid point to process. As such, our kernel
for evaluating the right hand side will look like:
__kernel void computeRHS(__global double* k, __global double* phi,
const double Delta_x, const double v, const int N_x)
{
const int i = get_global_id(0);
if(i<N_x)
22.5. RUNTIME LIBRARY ROUTINES 443
{
tempPhi[i] = phi[i] + coeff*k[i];
}
return;
}
and our kernel for updating phi at each time step will look like:
__kernel void updatePhi(__global const double* k1, __global const double* k2,
__global const double* k3, __global const double* k4,
__global double* phi, const double Delta_t, const int N_x)
{
int i = get_global_id(0);
if(i<N_x)
{
phi[i] += Delta_t*(k1[i]/6 + k2[i]/3 + k3[i]/3 + k4[i]/6);
}
return;
}
The next thing is allocating memory on the device to store these arrays, then
copying the data across and setting kernel arguments. After that we will be pretty
much ‘good to go’ with our program. On the host, we have seen how to allocate
memory for our arrays using an approach like:
double* phi = new double [N_x];
To allocate a corresponding array on the device, we use the clCreateBuffer func-
tion as:
cl_mem phi_d = clCreateBuffer(context, CL_MEM_READ_WRITE, N_x*sizeof(double), NULL, &errorID);
Here the important things are that the array is associated with the context and in
this case the array is defined with CL MEM READ WRITE meaning that its contents can
be read and written to (which is fairly obvious). Other options however, include
CL MEM READ ONLY, CL MEM WRITE ONLY, and CL MEM COPY HOST PTR if we want to
initialize the contents of a device array with a host array, in which case we pass in
a pointer to the host array as the fourth argument (instead of NULL as was done
above). Just as a quick note on terminology; sometimes for arrays representing
equivalent data on the host and device, people append h and d (or something along
those lines) to the arrays to make it obvious where the array is allocated. It is im-
portant to note that although it is in our host code that we use the clCreateBuffer
function, the actual memory is not accessible with a statement like phi d[i] like it
would be for the array phi.
If we don’t initialize from a host array however we can use the clEnqueueWriteBuffer
function to copy the data across as:
clEnqueueWriteBuffer(commandQueue, phi_d, CL_TRUE, 0, N_x*sizeof(double), phi, 0, NULL, NULL);
which will copy the contents of phi to phi d and if we want to do the reverse:
clEnqueueReadBuffer(commandQueue, phi_d, CL_TRUE, 0, N_x*sizeof(double), phi, 0, NULL, NULL);
444 CHAPTER 22. OPENCL
to copy the contents of phi d to phi. The only thing left is the setting of kernel
arguments and then the calling of kernels. To address the former, let’s look at
how we call the function to evaluate the right hand side. As can be seen above, it
takes five arguments, global double* k, global double* phi, const double
Delta x, const double v, and const int N x. To set these arguments we will use
the clSetKernelArg function as:
clSetKernelArg(computeRHS, 0, sizeof(cl_mem), &k1_d);
clSetKernelArg(computeRHS, 1, sizeof(cl_mem), &phi_d);
clSetKernelArg(computeRHS, 2, sizeof(double), &Delta_x);
clSetKernelArg(computeRHS, 3, sizeof(double), &v);
clSetKernelArg(computeRHS, 4, sizeof(int), &N_x);
where it can be observed that the arguments to this function are the cl kernel
object to which the arguments apply (computeRHS in this case), the position in the
argument list, the size of the argument and the argument itself. For the arrays it
case be observed that we are passing in the device arrays, but for the grid spacing,
velocity, etc, we are passing in the adress of variables residing in host memory. A
copule of points to note here are that the RK4 method requires that we compute the
right hand side four times per timestep and each time, passing in a different k array
to store the evaluation in. As can be seen above, we set kernel argments of k1 d
and phi d (which is for the first stage), but for later stages we would need to pass
in ‘say’ k2 d and tempPhi d for the second stage. The point is that we will need to
reset the kernel arguments before actually calling the kernel. For arguments which
don’t change (such as N x, v, etc) we don’t need to reset those specific arguments.
Finally, in order to call a kernel, we can use the clEnqueueNDRangeKernel func-
tion as:
clEnqueueNDRangeKernel(commandQueue, computeRHS, 1, NULL, &G, NULL, 0, NULL, NULL);
Here, the important arguments are the command queue, the kernel, the dimension-
ality (i.e. 1 in this case) and the global work size G. For this simple example, if we
leave the sixth argument (which specifies the work-group size) as NULL , then the
OpenCL implementation will determine how to be break the global work-items into
appropriate work-group instances.
Noting that the same approach is used for setting arguments, and executing
kernels we can omit the entire time marching loop (as it is now fairly lengthy) and
perhaps just look at the computation of the second stage of the RK4 method and
the update of φ. As such, the time marching loop would look like:
for(l=0; l<N_t-1; l++)
{
...
// k2
errorID = clSetKernelArg(computeTempPhi, 2, sizeof(cl_mem), &k1_d);
errorID |= clSetKernelArg(computeTempPhi, 3, sizeof(double), &Delta_tOn2);
errorID |= clEnqueueNDRangeKernel(commandQueue, computeTempPhi, 1, NULL, &G,
NULL, 0, NULL, NULL);
22.5. RUNTIME LIBRARY ROUTINES 445
...
// Update phi
errorID = clEnqueueNDRangeKernel(commandQueue, updatePhi, 1, NULL, &G,
NULL, 0, NULL, NULL);
if(errorID!=CL_SUCCESS)
{
cerr << ‘‘Error updating phi’’ << endl;
exit(1);
}
errorID = clFinish(commandQueue);
Here we can see the execution of all three kernels, while only setting the arguments
that would change between executions (e.g. the k array before calling computeTempPhi
or computeRHS); the others can be set once outside fo the time marching loop. Since
the arguments to updatePhi don’t change at all (only the contents of the arrays
from timestep to timestep), so we can just call that kernel directly.
One last point worth mentioning is that in the code snippet above, the OpenCL
functions are returning a cl int named errorID. This variable can be used to test
for the successful execution of a function call by checking whether or not the value of
the variable is equal to CL SUCCESS, which is an enumerated integer equal to zero. All
of the other values that could be returned are negative integers that correspond to
specific problems (e.g. CL DEVICE NOT FOUND=-1, CL BUILD PROGRAM FAILURE=-11,
CL INVALID ARG VALUE=-50 to name just a few of the 63 error codes). One ‘good’
option would be to check every single function call and write out a specific error
message so that if things go wrong in the code, we can pinpoint the problem quickly
and easily. In this example however, for brevity we have applied a bitwise inclusive
or so that any non-zero error codes returned from a block of function calls will result
in errorID being a non-zero value.
The complete program is given in Example22 2.cpp with a Matlab script for
viewing the output of the program given in Example22 2Postprocessing.m. Fig-
ures 22.4(a) and 22.4(b) illustrate the solution at two different moments in time for
the case where ∆x=0.05 and ∆t=0.02.
446 CHAPTER 22. OPENCL
2.5 2.5
2.0 2.0
φ
1.5 1.5
1.0 1.0
0.5 0.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
(a) (b)
Figure 22.4: The solutions to the PDE in Example 22.2 illustrating the solution at
(a) l = 0 and l = 200 for the combination ∆x = 0.05 and ∆t = 0.02.
22.5. RUNTIME LIBRARY ROUTINES 447
Example 22.3:
448 CHAPTER 22. OPENCL
Part V
Applications
449
Bibliography
451
452 BIBLIOGRAPHY
[66] W. Gropp, E. Lusk, and A. Skjellum. Using MPI - Portable Parallel Pro-
gramming with the Message-Passing Interface. The MIT Press, Cambridge,
Massachusetts, 1999.
454 BIBLIOGRAPHY
[72] R. Peyret. Spectral methods for incompressible viscous flow. Springer, New
York, 2002.
[76] O. C. Zienkiewicz and R. L. Taylor. The finite element method for solid and
structural mechanics. Elsevier Butterworth-Heinemann, Oxford ; Burlington,
MA, 6th edition, 2005.