Num PDF
Num PDF
Num PDF
William G. Faris
Program in Applied Mathematics
University of Arizona
Fall 1992
Contents
1 Nonlinear equations 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 First order convergence . . . . . . . . . . . . . . . . 9
1.3.2 Second order convergence . . . . . . . . . . . . . . . 11
1.4 Some C notations . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Declarations . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Expressions . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.5 Statements . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.6 Function definitions . . . . . . . . . . . . . . . . . . 17
2 Linear Systems 19
2.1 Shears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Vectors and matrices in C . . . . . . . . . . . . . . . . . . . 27
2.3.1 Pointers in C . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Pointer Expressions . . . . . . . . . . . . . . . . . . 28
3 Eigenvalues 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Orthogonal similarity . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Symmetric matrices . . . . . . . . . . . . . . . . . . 34
3.3.2 Singular values . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 The Schur decomposition . . . . . . . . . . . . . . . 35
3.4 Vector and matrix norms . . . . . . . . . . . . . . . . . . . 37
3.4.1 Vector norms . . . . . . . . . . . . . . . . . . . . . . 37
1
2 CONTENTS
4 Nonlinear systems 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Brouwer fixed point theorem . . . . . . . . . . . . . 52
4.3 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 First order convergence . . . . . . . . . . . . . . . . 52
4.3.2 Second order convergence . . . . . . . . . . . . . . . 54
4.4 Power series . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 The spectral radius . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Linear algebra review . . . . . . . . . . . . . . . . . . . . . 57
4.7 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7.1 Approximation error and roundoff error . . . . . . . 59
4.7.2 Amplification of absolute error . . . . . . . . . . . . 59
4.7.3 Amplification of relative error . . . . . . . . . . . . . 61
4.8 Numerical differentiation . . . . . . . . . . . . . . . . . . . . 63
5.5 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 77
5.5.2 Linear constant coefficient equations . . . . . . . . . 77
5.5.3 Stiff systems . . . . . . . . . . . . . . . . . . . . . . 81
5.5.4 Autonomous Systems . . . . . . . . . . . . . . . . . 82
5.5.5 Limit cycles . . . . . . . . . . . . . . . . . . . . . . . 84
6 Fourier transforms 87
6.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Integers mod N . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 The circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 The integers . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5 The reals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Translation Invariant Operators . . . . . . . . . . . . . . . . 90
6.7 Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.8 The sampling theorem . . . . . . . . . . . . . . . . . . . . . 94
6.9 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
æ
4 CONTENTS
Chapter 1
Nonlinear equations
1.1 Introduction
This chapter deals with solving equations of the form f (x) = 0, where f is
a continuous function.
The usual way in which we apply the notion of continuity is through
sequences. If g is a continuous function, and cn is a sequence such that
cn → c as n → ∞, then g(cn ) → g(c) as n → ∞.
Here is some terminology that we shall use. A number x is said to be
positive if x ≥ 0. It is strictly positive if x > 0. (Thus we avoid the mind-
numbing term “non-negative.”) A sequence an is said to be increasing if
an ≤ an+1 for all n. It is said to be strictly increasing if an < an+1 for
all n. There is a similar definition for what it means for a function to be
increasing or strictly increasing. (This avoids the clumsy locution “non-
decreasing.”)
Assume that a sequence an is increasing and bounded above by some
c < ∞, so that an ≤ c for all n. Then it is always true that there is an
a ≤ c such that an → a as n → ∞.
1.2 Bisection
The bisection method is a simple and useful way of solving equations. It is
a constructive implementation of the proof of the following theorem. This
result is a form of the intermediate value theorem.
5
6 CHAPTER 1. NONLINEAR EQUATIONS
The construction here is the while loop, which is perhaps the most
fundamental technique in programming.
Here is an alternative version in which the iterations are controlled by
a counter n.
1.2. BISECTION 7
/* bisection */
#include <stdio.h>
#include <math.h>
int main()
{
int nsteps;
8 CHAPTER 1. NONLINEAR EQUATIONS
real a, b;
real (*f)(real);
f = quadratic;
fetch(nsteps, a, b);
display(nsteps,a,b);
bisect(nsteps,a,b,f);
return 0;
}
This program begins with declarations of a new type real and of four
functions bisect, quadratic, fetch, and display. The main program
uses these functions to accomplish its purpose; it returns the integer value
0 only to proclaim its satisfaction with its success.
One also needs to define the function quadratic.
This particular function has roots that are square roots of two. We shall
not go into the dismal issues of input and output involved with fetch and
display.
Another interesting question is that of uniqueness. If g is strictly in-
creasing on [a, b], then there is at most one solution of g(x) = 0.
The easiest way to check that g is strictly increasing on [a, b] is to check
that g 0 (x) > 0 on (a, b). Then for a ≤ p < q ≤ b we have by the mean value
theorem that g(q) − g(p) = g 0 (c)(q − p) > 0 for some c with p < c < q.
Thus p < q implies g(p) < g(q).
One can use a similar idea to find maxima and minima. Let g be a
continuous function on [a, b]. Then there is always a point r at which g
assumes its maximum.
Assume that g is unimodal, that is, that there exists an r such that g is
strictly increasing on [a, r] and strictly decreasing on [r, b]. The computa-
tional problem is to locate the point r at which the maximuum is assumed.
The trisection method accomplishes this task. Divide [a, b] into three
equal intervals with end points a < p < q < b. If g(p) ≤ g(q), then r must
be in the smaller interval [p, b]. Similarly, if g(p) ≥ g(q), then r must be
in the smaller interval [a, q]. The method is to repeat this process until a
sufficiently small interval is obtained.
Projects
1.3. ITERATION 9
1. Write a bisection program to find the square roots of two. Find them.
3. Use the program to solve tan x = x with π/2 < x < 3π/2.
Problems
3. Show that tan x = x has a solution with π/2 < x < 3π/2.
7. How many decimal places of accuracy does one gain at each bisection?
1.3 Iteration
1.3.1 First order convergence
Recall the intermediate value theorem: If f is a continuous function on the
interval [a, b] and f (a) ≤ 0 and f (b) ≥ 0, then there is a solution of f (x) = 0
in this interval.
This has an easy consequence: the fixed point theorem. If g is a contin-
uous function on [a, b] and g(a) ≥ a and g(b) ≤ b, then there is a solution
of g(x) = x in the interval.
Another approach to numerical root-finding is iteration. Assume that
g is a continuous function. We seek a fixed point r with g(r) = r. We can
attempt to find it by starting with an x0 and forming a sequence of iterates
using xn+1 = g(xn ).
10 CHAPTER 1. NONLINEAR EQUATIONS
Theorem 1.3.1 Let g be continuous and let xn a sequence such that xn+1 =
g(xn ). Then if xn → r as n → ∞, then g(r) = r.
This theorem shows that we need a way of getting sequences to converge.
One such method is to use increasing or decreasing sequences.
Theorem 1.3.2 Let g be a continuous function on [r, b] such that g(x) ≤ x
for all x in the interval. Let g(r) = r and assume that g 0 (x) ≥ 0 for
r < x < b. Start with x0 in the interval. Then the iterates defined by
xn+1 = g(xn ) converge to a fixed point.
Proof: By the mean value theorem, for each x in the interval there is
a c with g(x) − r = g(x) − g(r) = g 0 (c)(x − r). It follows that r ≤ g(x) ≤ x
for r ≤ x ≤ b. In other words, the iterations decrease and are bounded
below by r. 2
Another approach is to have a bound on the derivative.
Theorem 1.3.3 Assume that g is continuous on [a, b] and that g(a) ≥ a
and g(b) ≤ b. Assume also that |g 0 (x)| ≤ K < 1 for all x in the interval.
Let x0 be in the interval and iterate using xn+1 = g(xn ). Then the iterates
converge to a fixed point. Furthermore, this fixed point is unique.
Proof: Let r be a fixed point in the interval. By the mean value
theorem, for each x there is a c with g(x) − r = g(x) − g(r) = g 0 (c)(x − r),
and so |g(x) − r| = |g 0 (c)||x − r| ≤ K|x − r|. In other words each iteration
replacing x by g(x) brings us closer to r. 2
We say that r is a stable fixed point if |g 0 (r)| < 1. We expect convergence
when the iterations are started near a stable fixed point.
If we want to use this to solve f (x) = 0, we can try to take g(x) =
x − kf (x) for some suitable k. If k is chosen so that g 0 (x) = 1 − kf 0 (x) is
small for x near r, then there should be a good chance of convergence.
It is not difficult to program fixed point iteration. Here is a version that
displays all the iterates.
void iterate(int nsteps, real x, real (*g)(real) )
{
n = 0 ;
while( n < nsteps)
{
x = g(x) ;
n = n + 1 ;
display(n, x) ;
}
}
1.3. ITERATION 11
2. Implement the damped Newton’s method. Use this to find the largest
root of sin x−x2 = 0. Describe what happens if you start the iteration
with .46.
Problems
1. Let g(x) = (2/3)x + (7/3)(1/x2 ). Show that for every initial point
x0 above the fixed point the iterations converge to the fixed point.
What happens for initial points x0 > 0 below the fixed point?
3. Prove the fixed point theorem from the intermediate value theorem.
5. Perhaps one would prefer something that one could compute numer-
ically. Find the limit of (xn+1 − xn )/(xn − xn−1 ) as n → ∞.
may also define a procedure by having the side effect of changing the values
of variables.
The working part of a function definition is formed of statements, which
are commands to perform some action, usually changing the values of vari-
ables. The calculations are performed by evaluating expressions written
in terms of constants, variables, and functions to obtain values of various
types.
1.4.2 Types
Arithmetic
Now we go to the notation used to write a C program. The basic types
include arithmetic types such as:
char
int
float
double
These represent character, integer, floating point, and double precision
floating point values.
There is also a void type that has no values.
A variable of a certain type associates to each machine state a value of
this type. In a computer implementation a variable is realized by a location
in computer memory large enough to hold a value of the appropriate type.
Example: One might declare n to be an integer variable and x to be a
float variable. In one machine state n might have the value 77 and x might
have the value 3.41.
Function
Another kind of data object is a function. The type of a function depends
on the types of the arguments and on the type of the value. The type of the
value is written first, followed by a list of types of the arguments enclosed
in parentheses.
Example: float (int) is the type of a function of an integer argument
returning float. There might be a function convert of this type defined in
the program.
Example: float ( float (*)(float), float, int) is the type of a
function of three arguments of types float (*)(float), float, and int
returning float. The function iterate defined below is of this type.
A function is a constant object given by a function definition. A function
is realized by the code in computer memory that defines the function.
14 CHAPTER 1. NONLINEAR EQUATIONS
Pointer to function
There are no variable functions, but there can be variables of type pointer
to function. The values of such a variable indicate which of the function
definitions is to be used.
Example: float (*)(int) is the type of a pointer to a function from
int to float. There could be a variable f of this type. In some machine
state it could point to the function convert.
The computer implementation of pointer to function values is as ad-
dresses of memory locations where the functions are stored.
A function is difficult to manipulate directly. Therefore in a C expression
the value of a function is not the actual function, but the pointer associated
with the function. This process is known as pointer conversion.
Example: It is legal to make the assignment f = convert.
1.4.3 Declarations
A declaration is a specification of variables or functions and of their types.
A declaration consists of a value type and a list of declarators and is
terminated by a semicolon. These declarators associate identifiers with the
corresponding types.
Example: float x, y ; declares the variables x and y to be of type
float.
Example: float (*g)(float) ; declares g as a pointer to function
from float to float.
Example: float iterate( float (*)(float), float, int) ; de-
clares a function iterate of three arguments of types float (*)(float),
float, and int returning float.
1.4.4 Expressions
Variables
An expression of a certain type associates to each machine state a value of
this type.
Primary expressions are the expressions that have the highest prece-
dence. Constants and variables are primary expressions. An arbitrary
expression can be converted to a primary expression by enclosing it in
parentheses.
Usually the value of the variable is the data contained in the variable.
However the value of a function is the pointer that corresponds to the
function.
1.4. SOME C NOTATIONS 15
Function calls
A function call is an expression formed from a pointer to function expression
and an argument list of expressions. Its value is obtained by finding the
pointer value, evaluating the arguments and copying their values, and using
the function corresponding to the pointer value to calculate the result.
Example: Assume that g is a function pointer that has a pointer to
some function as its value. Then this function uses the value of x to obtain
a value for the function call g(x).
Example: The function iterate is defined with the heading iterate(
float (*g)(float), float x, float n ). A function call iterate(square,
z, 3) uses the value of iterate, which is a function pointer, and the val-
ues of square, x, and 3, which are function pointer, float, and integer. The
values of the arguments square, z, and 3 are copied to the parameters g,
x, and n. The computation described in the function definition is carried
out, and a float is returned as the value of the function call.
Casts
A data type may be changed by a cast operator. This is indicated by
enclosing the type name in parentheses.
Example: 7 / 2 evaluates to 3 while (float)7 / 2 evaluates to 3.5.
Relational expressions are formed by the inequalities < and <= and >
and >=.
Equality expressions are formed by the equality and negated equality ==
and !==.
Logical AND expression are formed by &&.
Logical OR expression are formed by ||.
Assignments
Another kind of expression is the assignment expression. This is of the
form variable = expression. It takes the expression on the right, evaluates
it, and assigns the value to the variable on the left (and to the assignment
expression). This changes the machine state.
An assignment is is read variable “becomes” expression.
Warning: This should be distinguished from an equality expression of the
form expression == expression. This is read expression “equals” expression.
Example: i = 0
Example: i = i + 1
Example: x = g(x)
Example: h = iterate, where h is a function pointer variable and
iterate is a function constant.
1.4.5 Statements
Expression statements
A statement is a command to perform some action changing the machine
state.
Among the most important are statements formed from expressions
(such as assignment expressions) of the form expression ;
Example: i = 0 ;
Example: i = i + 1 ;
Example: x = g(x) ;
Example: h = iterate ;, where h is a function pointer variable and
iterate is a function constant.
In the compound statement part of a function definition the statement
return expression ; stops execution of the function and returns the value
of the expression.
Control statements
There are several ways of building up new statements from old ones. The
most important are the following.
1.4. SOME C NOTATIONS 17
main()
{
float z, w ;
z = 2.0 ;
w = iterate(square, z, 3) ;
}
Linear Systems
2.1 Shears
In this section we make a few remarks about the geometric significance of
Gaussian elimination.
We begin with some notation. Let z be a vector and w be another
vector. We think of these as column vectors. The inner product of w and
z is wT z and is a scalar. The outer product of z and w is zwT , and this is
a matrix.
Assume that wT z = 0. A shear is a matrix M of the form I + zwT . It
is easy to check that the inverse of M is another shear given by I − zwT .
The idea of Gaussian elimination is to bring vectors to a simpler form
by using shears. In particular one would like to make the vectors have
many zero components. The vectors of concern are the column vectors of
a matrix.
Here is the algorithm. We want to solve Ax = b. If we can decompose
A = LU , where L is lower triangular and U is upper triangular, then we
are done. All that is required is to solve Ly = b and then solve U x = y.
In order to find the LU decomposition of A, one can begin by setting
L to be the identity matrix and U to be the original matrix A. At each
stage of the algorithm on replaces L by LM −1 and U by M U , where M is
a suitably chosen shear matrix.
The choice of M at the jth stage is the following. We take M = I + zej ,
19
20 CHAPTER 2. LINEAR SYSTEMS
where ej is the jth unit basis vector in the standard basis. We take z to
have non-zero coordinates zi only for index values i > j. Then M and M −1
are lower triangular matrices.
The goal is to try to choose M so that U will eventually become an
upper triangular matrix. Let us apply M to the jth column uj of the
current U . Then we want to make M uj equal to zero for indices larger
than j. That is, one must make uij + zi ujj = 0 for i > j. Clearly this can
be done, provided that the diagonal element ujj = 0.
This algorithm with shear transformations only works if all of the diag-
onal elements turn out to be non-zero. This is somewhat more restrictive
than merely requiring that the matrix A be non-singular.
Here is a program that implements the algorithm.
/* lusolve */
#include <stdio.h>
#include <stdlib.h>
vector vec(int);
matrix mat(int,int);
void fetchdim(int*);
void fetchvec(vector, int);
void fetchmat(matrix, int, int);
void displayvec(vector, int);
void displaymat(matrix, int, int);
int main()
{
int n;
vector b, x, y;
matrix a, l;
2.1. SHEARS 21
fetchdim(&n);
a = mat(n,n);
b = vec(n);
x = vec(n);
y = vec(n);
l = mat(n,n);
fetchmat(a,n,n);
displaymat(a,n,n);
fetchvec(b,n);
displayvec(b,n);
triangle(a,l,n);
displaymat(a,n,n);
displaymat(l,n,n);
solveltr(l,b,y,n);
displayvec(y,n);
solveutr(a,y,x,n);
displayvec(x,n);
return 0;
}
vector vec(int n)
{
vector x;
x = (vector) calloc(n+1, sizeof(real) );
return x ;
}
for (i = 1; i <= m ; i = i + 1)
a[i] = vec(n);
return a ;
}
The actual work in producing the upper triangular matrix is done by the
following procedure. The matrix a is supposed to become upper triangular
while the matrix l remains lower triangular.
The column procedure computes the proper shear and stores it in the
lower triangular matrix l.
{
int i;
for( i = j; i <= n; i = i+1)
l[i][j] = a[i][j] / a[j][j];
}
The shear procedure applies the shear to bring a closer to upper trian-
gular form.
2.2 Reflections
Gaussian elimination with LU decomposition is not the only technique for
solving equations. The QR method is also worth consideration.
The goal is to write an arbitrary matrix A = QR, where Q is an orthog-
onal matrix and R is an upper triangular matrix. Recall that an orthogonal
matrix is a matrix Q with QT Q = I.
Thus to solve Ax = b, one can take y = QT b and solve Rx = y.
We can define an inner product of vectors x and y by x · y = xT y. We
say that x and y are perpendicular or orthogonal if x · y = 0.√
The Euclidean length (or norm) of a vector x is |x| = x · x. A unit
vector u is a vector with length one: |u| = 1.
A reflection P is a linear transformation of the form P = I−2uuT , where
u is a unit vector. The action of a reflection on a vector perpendicular to
u is to leave it alone. However a x = cu vector parallel to u is sent to its
negative.
It is easy to check that a reflection is an orthogonal matrix. Further-
more, if P is a reflection, then P 2 = I, so P is its own inverse.
Consider the problem of finding a reflection that sends a given non-zero
vector a to a multiple of another given unit vector b . Since a reflection
preserves lengths, the other vector must be ±|a|b.
Take u = cw, where w = a ± |a|b, and where c is chosen to make u a
unit vector. Then c2 w · w = 1. It is easy to check that w · w = 2w · a.
Furthermore,
The select procedure does that calculation to determine the unit vector
that is appropriate to the given column.
26 CHAPTER 2. LINEAR SYSTEMS
In order to use this to solve an equation one must apply the same
reflections to the right hand side of the equation. Finally, one must solve
the resulting triangular system.
Projects
2.3. VECTORS AND MATRICES IN C 27
Pointer arithmetic
Let T be a type that is not a function type. For an integer i and a pointer
value p we have another pointer value p+i. This is the pointer value asso-
ciated with the ith variable of this type past the variable associated with
the pointer value p.
Incrementing the pointer to T value by i corresponds to incrementing
the address by i times the size of a T value.
The fact that variables of type T may correspond to a linearly ordered
set of pointer to T values makes C useful for models where a linear strucure
is important.
When a pointer value p is incremented by the integer amount i, then
p+i is a new pointer value. We use p[i] as a synonym for *(p+i). This is
the variable pointed to by p+i.
2.3. VECTORS AND MATRICES IN C 29
Function calls
In C function calls it is always a value that is passed. If one wants to
give a function access to a variable, one must pass the value of the pointer
corresponding to the variable.
Example: A procedure to fetch a number from input is defined with the
heading void fetch(float *p). A call fetch(&x) copies the argument,
which is the pointer value corresponding to x, onto the the parameter,
which is the pointer variable p. Then *p and x are the same float variable,
so an assignment to *p can change the value of x.
Example: A procedure to multiply a scalar x times a vector given by w
and put the result back in the same vector is defined with the heading void
mult(float x, float *w). Then a call mult(a,v) copies the values of the
arguments a and v onto the parameters x and w. Then v and w are pointers
with the same value, and so v[i] and w[i] are the same float variables.
Therefore an assignment statement w[i] = x * w[i] in the body of the
procedure has the effect of changing the value of v[i].
Memory allocation
There is a cleared memory allocation function named calloc that is very
useful in working with pointers. It returns a pointer value corresponding
to the first of a specified number of variables of a specified type.
The calloc function does not work with the actual type, but with the
size of the type. In an implementation each data type (other than function)
has a size. The size of a data type may be recovered by the sizeof( )
operator. Thus sizeof( float ) and sizeof( float * ) give numbers
that represent the amount of memory needed to store a float and the amount
of memory need to store a pointer to float.
The function call calloc( n, sizeof( float ) ) returns a pointer
to void corresponding to the first of n possible float variables. The cast
operator (float *) converts this to a pointer to float. If a pointer variable
v has been declared with float *v ; then
v = (float *) calloc( n, sizeof (float) ) ;
30 CHAPTER 2. LINEAR SYSTEMS
Eigenvalues
3.1 Introduction
A square matrix can be analyzed in terms of its eigenvectors and eigenval-
ues. In this chapter we review this theory and approach the problem of
numerically computing eigenvalues.
If A is a square matrix, x is a vector not equal to the zero vector, and
λ is a number, then the equation
Ax = λx (3.1)
3.2 Similarity
If we have an n by n matrix A and a basis consisting of n linearly indepen-
dent vectors, then we may form another matrix S whose columns consist of
the vectors in the basis. Let  be the matrix of A in the new basis. Then
AS = S Â. In other words, Â = S −1 AS is similar to A.
Similar matrices tend to have similar geometric properties. They always
have the same eigenvalues. They also have the same determinant and trace.
(Similar matrices are not always identical in their geometrical properties;
similarity can distort length and angle.)
31
32 CHAPTER 3. EIGENVALUES
We would like to pick the basis to display the geometry. The way to do
this is to use eigenvectors as basis vectors, whenever possible.
If the dimension n of the space of vectors is odd, then a matrix always
has at least one real eigenvalue. If the dimension is even, then there may
be no real eigenvalues. (Example: a rotation in the plane.) Thus it is often
helpful to allow complex eigenvalues and eigenvectors. In that case the
typical matrix will have a basis of eigenvectors.
If we can take the basis vectors to be eigenvectors, then the matrix Â
in this new basis is diagonal.
There are exceptional cases where the eigenvectors do not form a basis.
(Example: a shear.) Even in these exceptional cases there will always be
a new basis in which the matrix is triangular. The eigenvalues will appear
(perhaps repeated) along the diagonal of the triangular matrix, and the
determinant and trace will be the product and sum of these eigenvalues.
We now want to look more closely at the situation when a matrix has
a basis of eigenvectors.
We say that a collection of vectors is linearly dependent if one of the
vectors can be expressed as a linear combination of the others. Otherwise
the collection is said to be linearly independent.
X −1 AX = Λ. (3.3)
It is worth thinking a bit more about the meaning of the complex eigen-
values. It is clear that if A is a real matrix, then the eigenvalues that are
not real occur in complex conjugate pairs. The reason is simply that the
complex conjugate of the equation Ax = λx is Ax̄ = λ̄x̄. If λ is not real,
then we have a pair λ 6= λ̄ of complex conjugate eigenvalues.
We may write λ = a + ib and x = u + iv. Then the equation Ax = λx
becomes the two real equations Au = au − bv and Av = bu + av. The
vectors u and v are no longer eigenvectors, but they can be used as part of
a real basis. In this case instead of two complex conjugate diagonal entries
one obtains a two by two matrix that is a multiple of a rotation matrix.
Thus geometrically a typical real matrix is constructed from stretches,
shrinks, and reversals (from the real eigenvalues) and from stretches, shrinks,
and rotations (from the conjugate pair non-real eigenvalues).
Problems
2. Find the spectral representation for the matrix of the previous prob-
lem.
4. Give an example of two matrices with the same eigenvalues that are
not similar.
Theorem 3.3.1 For a symmetric real matrix A the eigenvalues are all real,
and there is always a basis of eigenvectors. Furthermore, these eigenvectors
may be taken to form an orthonormal basis. With this choice the matrix Q
is orthogonal, and  = Q−1 AQ is diagonal.
Of course one could also look at AAT and its square root, and this
would be different in general. We shall see, however, that these matrices
are always orthogonal similar, so in particular the eigenvalues are the same.
To this end, we use the following polar decomposition.
√
Proposition 3.3.1 Let A be a real square matrix. Then A = Q AT A,
where Q is orthogonal.
This amounts to writing the the matrix as the product of a part that
has absolute value one with a part that represents its absolute value. Of
course here the absolute value one part is an orthogonal matrix and the
absolute value part is a symmetric matrix.
Here is how this can be done. We can decompose the space into the
orthogonal sum of the range of AT and the √ nullspace of A. This is the
same as the orthogonal sum of the range of AT A and the nullspace of
√
AT A. The range of AT is the part where the absolute value is non-
zero.√On this part the unit size part is determined; we must√ define Q on
x = AT Ay in the range in such a way as to have Qx = Q AT Ay = Ay.
Then |Qx| = |Ay| = |x|, so Q sends the range of AT to the range of A and
preserves lengths on this part of the space. However on the nullspace of A
the unit size part is arbitrary. But we can also decompose the space into
the orthogonal sum of the range of A and the nullspace of AT . Since the
nullspaces of A and AT have the same dimension, we can define Q on the
nullspace of A to be an arbitrary orthogonal transformation that takes it
to the nullspace of AT .√
We see from A = Q AT A that AAT = QAT AQT . Thus AAT is similar
to AT A by the orthogonal matrix Q. The two possible notions of absolute
value are geometrically equivalent, and the two possible notions of singular
value coincide.
Q−1 AQ = U, (3.6)
Theorem 3.3.2 Let A be a real matrix with only real eigenvalues. Then
A is orthogonal similar to an upper triangular matrix U .
The ∞ and 1 norms are related by kAk∞ = kAT k1 . For the 2-norm we
have the important relation kAk2 = kAT k2 .
There is a very useful interpolation bound relating the 2-norm to the
other norms.
Proposition 3.4.1 p
kAk2 ≤ kAk1 kAk∞ . (3.15)
Why are these norms useful? Maybe the main reason is that kAk[∞] =
kAk2 , and so
kAk2 ≤ kAk[2] . (3.20)
This gives a useful upper bound that complements the interpolation bound.
3.4. VECTOR AND MATRIX NORMS 39
for every n = 1, 2, 3, . . ..
1. Evaluate each of the six matrix norms for the two-by-two matrix
whose first row is 0, −3 and whose second row is −1, 2.
3.5 Stability
3.5.1 Inverses
We next look at the stability of the inverse under perturbation. The fun-
damental result is the following.
Proposition 3.5.1 Assume that the matrix A has an inverse A−1 . Let E
be another matrix. Assume that E is small relative to A in the sense that
kEk < 1/kA−1 k. Let  = A − E. Then  has an inverse Â−1 , and
3.5.2 Iteration
Sometimes one wants to solve the equation Ax = b by iteration. A natural
choice of fixed point function is
Here C can be an arbitrary non-singular matrix, and the fixed point will
be a solution. However for convergence we would like C to be a reasonable
guess of or approximation to A−1 .
When this is satisfied we may write
then A is invertible.
The intervals about bii in the corollary are known as Gershgorin’s disks.
Problems
1. Assume Ax = b. Assume that there is a computed solution x̂ = x−e,
where e is an error vector. Let Ax̂ = b̂, and define the residual vector
r by b̂ = b − r. Show that |e|/|x| ≤ cond(A)|r|/|b|.
2. Assume Ax = b. Assume that there is an error in the matrix, so
that the matrix used for the computation is  = A − E. Take the
computed solution as x̂ defined by Âx̂ = b. Show that |e|/|x̂| ≤
cond(A)kEk/kAk.
3. Find the Gershgorin disks for the three-by-three matrix whose first
row is 1, 2, −1, whose second row is 2, 7, 0, and whose third row is
−1, 0, −5.
3.6. POWER METHOD 43
When k is large, the term c1 λk1 xi is so much larger than the other terms
that Ak u is a good approximation to a multiple of x1 .
[We can write this another way in terms of the spectral representation.
Let u be a non-zero vector such that y1T u 6= 0. Then
X
Ak u = λk1 x1 y1T u + λki xi yiT u. (3.32)
i6=1
When k is large the first term will be much larger than the other terms.
Therefore Ak u will be approximately λk1 times a multiple of the eigenvector
x1 .]
In practice we take u to be some convenient vector, such as the first
coordinate basis vector, and we just hope that the condition is satisfied.
We compute Ak u by successive multiplication of the matrix A times the
previous vector. In order to extract the eigenvalue we can compute the
result for k + 1 and for k and divide the vectors component by component.
Each quotient should be close to λ1 .
Problems
1. Take the matrix whose rows are 0, −3 and −1, 2. Apply the matrix
four times to the starting vector. How close is this to an eigenvector.
2. Consider the power method for finding eigenvalues of a real matrix.
Describe what happens when the matrix is symmetric and the eigen-
value of largest absolute value has multiplicity two.
3. Also describe what happens when the matrix is not symmetric and
the eigenvalues of largest absolute value are a complex conjugate pair.
44 CHAPTER 3. EIGENVALUES
3. Change the first 1 in the last row to a 2, and find the eigenvalues of
the resulting non-symmetric matrix.
To a good approximation, the first r terms of this sum are much larger than
the remaining terms. Thus to a good approximation the Ak ei for 1 ≤ i ≤ r
are just linear combinations of the first r eigenvectors.
We may replace the Ak ei by linear combinations that are orthonormal.
This is what is accomplished by the QR decomposition. The first r columns
3.8. POWER METHOD FOR SUBSPACES 45
Uk+1 = Q̃−1
k AQ̃k , (3.35)
1. Take the matrix whose rows are 0, −3 and −1, 2. Take the eigenvector
corresponding to the largest eigenvalue. Find an orthogonal vector
and form an orthogonal basis with these two vectors. Use the matrix
with this basis to perform a similarity transformation of the original
matrix. How close is the result to an upper triangular matrix?
46 CHAPTER 3. EIGENVALUES
2. Take the matrix whose rows are 0, −3 and −1, 2. Apply the matrix
four times to the starting vector. Find an orthogonal vector and form
an orthogonal basis with these two vectors. Use the matrix with this
basis to perform a similarity transformation of the original matrix.
How close is the result to an upper triangular matrix?
3.9 QR method
The famous QR method is just another variant on the power method for
subspaces of the last section. However it eliminates the calculational diffi-
culties.
Here is the algorithm. We want to find approximate the Schur decom-
position of the matrix A.
Start with U1 = A. Then iterate as follows. Having defined Uk , write
Uk = Qk Rk , (3.36)
Uk+1 = Rk Qk . (3.37)
(Note the reverse order). Then for large k the matrix Uk+1 should be a good
approximation to the upper triangular matrix in the Schur decomposition.
Why does this work?
First note that Uk+1 = Rk Qk = Q−1 k Uk Qk , so Uk+1 is orthogonal similar
to Uk .
Let Q̃k = Q1 · · · Qk and R̃k = Rk · · · R1 . Then it is easy to see that
Uk+1 = Q̃−1
k AQ̃k . (3.38)
In other words, the Q̃k that sets up the similarity of Uk+1 with A is the same
Q̃k that arises from the QR decompositon of the power Ak . But we have
seen that this should give an approximation to the Schur decomposition of
A. Thus the Uk+1 should be approximately upper triangular.
Projects
1. Implement the QR method for finding eigenvalues.
3.10. FINDING EIGENVALUES 47
2. Use the program to find the eigenvalues of the symmetric matrix with
rows 1, 1, 0, 1 and 1, 4, 1, 1 and 0, 1, 9, 5 and 1, 1, 5, 16.
3. Change the last 1 in the first row to a 3, and find the eigenvalues of
the resulting non-symmetric matrix.
reflection vectors used in the decomposition are each vectors that have only
two non-zero components. The arithmetic is much reduced.
æ
Chapter 4
Nonlinear systems
4.1 Introduction
This chapter deals with solving equations of the form f (x) = 0, where f is
a continuous function from Rn to Rn . It also treats questions of roundoff
error and its amplification in the course of a numerical calculation.
In much of what we do the derivative f 0 of such a function f will play
an essential role. This is defined in such as way that
f (x + h) − f (x) = f 0 (x)h + r, (4.1)
where the remainder is of higher than first order in the vector h. Thus the
derivative f 0 (x) is a matrix. If we write this in variables with y = f (x),
then the derivative formula is
∆y ≈ f 0 (x)∆x. (4.2)
If we write these relations in components, we get
n
X ∂fi (x)
fi (x + h) = fi (x) + hj + ri . (4.3)
j=1
∂xj
Thus the derivative matrix is the matrix of partial derivatives. Using vari-
ables one writes
n
X ∂yi
∆yi ≈ ∆xj . (4.4)
j=1
∂x j
49
50 CHAPTER 4. NONLINEAR SYSTEMS
This says that the new derivative matrix is obtained from the original ma-
trix by multiplying on each side by matrices representing the effect of the
coordinate transformations.
A variant is when we think of the function as a transformation from a
space to the same space. In that case we may write x̂ = g(x) and think of
x as the coordinates of the original point and x̂ as the coordinates of the
new point.
In this case there is only the change from x to z coordinates, so the
change of variable formula becomes
∂ ẑi X ∂ ẑi X ∂ x̂k ∂xr
= . (4.7)
∂zj ∂ x̂k r ∂xr ∂zj
k
We shall see in the problems that there is a special situation when this
change is a familiar operation of linear algebra.
When we think of a function f as a vector field, then we think of x as
being a point in some space and y = f (x) as being the components of a
vector attached to the point x.
Let us look at the effect of a change of coordinates on the vector field
itself. We change from x to z coordinates. Let us call the new components
of the vector field ȳ. Then if we look at a curve tangent to the vector field,
we see that along the curve
dz̄i X ∂zi dxk X ∂zi
ȳi = = = yk . (4.8)
dt ∂xk dt ∂xk
k k
How about the partial derivatives of the vector field? Here the situation
is ugly. A routine computation gives
This does not even look like matrix multiplication. We shall see in the
problems that there is a special situation where this difficulty does not
occur and where we get instead some nice linear algebra.
Problems
4.2 Degree
The intermediate value theorem was a fundamental result in solving equa-
tions in one dimension. It is natural to ask whether there is an analog of
this theorem for systems. There is such an analog; one version of it is the
following topological degree theorem.
We say that a vector y is opposite to another vector x if there exists
c ≥ 0 with y = −cx.
4.3 Iteration
4.3.1 First order convergence
Another approach to numerical root-finding is iteration. Assume that g is
a continuous function. We seek a fixed point r with g(r) = r. We can
4.3. ITERATION 53
1. Newton’s method for systems has the disadvantage that one must
compute many partial derivatives. Steffensen’s method provides an
alternative. The method is to iterate with g(x) = x − w, where w
is the solution of J(x)w = f (x). For Newton’s method J(x) = f 0 (x),
but for Steffensen’s method we approximate the matrix of partial
derivatives by a matrix of difference quotients. Thus the i, j entry of
J(x) is (fi (x + hj ej ) − fi (x))/hj ), where hj = αj (f (x)). Here α is a
function that vanishes at zero. Thus as f (x) approaches zero, these
difference quotients automatically approach the partial derivatives.
There are various possible choices for the function α. One popular
choice is the identity αj (z) = zj , so that hj = fj (x). The disadvan-
tage of this choice is that hj can be zero away from the root.
Another method is to take each component of α to be the length, so
that αj (z) = |z| and hj = |f (x)|. This choice of α is not differentiable
at the origin, but in this case this is not a problem.
Perhaps an even better method is to take αj (z) to be the minimum
of |z| and some small number, such as 0.01. This will make the
difference matrix somewhat resemble the derivative matrix even far
from the solution.
The project is to write a program for solving a system of non-linear
equations by Steffensen’s method. Try out the program on a simple
system for which you know the solution.
x3 − 3xy 2 − 6z 3 + 18zw2 − 1 = 0
2 3 2 3
3x y − y − 18z w + 6w = 0
4.3. ITERATION 55
xz − yw − 1 = 0
yz + xw = 0
Find a solution near the point where (x, y, z, w) is (0.6, 1.1, 0.4, −0.7).
Problems
6. Show that g 0 (r) = 0 and that g 00 (r) = (2m0 (r) − f 00 (r))/f 0 (r).
10. This leads to the problem of finding when a matrix A has the property
that A−1 has positive entries. Show that if A = D − N , where
D is diagonal and N is off-diagonal and D ≥ 0 and N ≥ 0 and
kD−1 N k∞ < 1, then A−1 has only positive entries.
56 CHAPTER 4. NONLINEAR SYSTEMS
11. Show that if in this situation A−1 exists but we have only kD−1 N k∞ ≤
1, then the same conclusion holds.
2. Show that if A−1 exists and k(EA−1 )k k < 1 for some k, then (A −
E)−1 exists.
Theorem 4.5.1 The norms of powers An and the spectral radius of A are
related by
1
lim kAn k n = ρ(A). (4.15)
n→∞
Proof:
1
It is easy to see that for all n it is true that ρ(A) ≤ kAn k n . The problem
is to show that for large n the right hand side is not much larger than the
left hand side.
4.6. LINEAR ALGEBRA REVIEW 57
The first thing to do is to check that |z| < 1/ρ(A) implies that (I −zA)−1
exists. (Otherwise 1/z would be an eigenvalue of A outside of the spectral
radius.)
The essential observation is that (I − zA)−1 is an analytic function of
z for |z| < 1/ρ(A). It follows that the power series expansion converges
in this disk. Thus for |z| with |z| < 1/ρ(A) there is a constant c with
k(zA)n k = |z|n kAn k ≤ c.
1
We have shown that for every r < 1/ρ(A) there is a c with kAn k n ≤
1
c n /r. Take 1/r to be larger than but very close to ρ(A). Take n so large
1
that c n /r is still close to ρ(A). Then kAn k must be larger than but close
to ρ(A). 2
Let us look at this proof in more detail. The essential point is the
convergence of the power series. Why must this happen? It is a miracle of
complex variable: the Cauchy integral formula reduces the convergence of
an arbitrary power series inside its radius of convergence to the convergence
of a geometric series.
Look at the Cauchy integral formula
Z
1
(I − wA)−1 = (1 − zA)−1 1/(z − w) dz, (4.16)
2πi
where w is inside the circle of integration |z| = r and r < 1/ρ(A). We may
expand in a geometric series in powers of w/z. From this we see that the
coefficients of the expansion in powers of w are
Z
1
An = (1 − zA)−1 1/z n+1 dz. (4.17)
2πi
This proves that kAn k ≤ c/rn , where
Z
1
c= k(I − zA)−1 k d|z| (4.18)
2πr
over |z| = r.
Notice that as r approaches 1/ρ(A) the bound c will become larger, due
to the contribution to the integral from the singularity at z = 1/λ.
Problems
1. Show that it is false in general that ρ(A + B) ≤ ρ(A) + ρ(B). Hint:
Find 2 by 2 matrices for which the right hand side is zero.
The largest eigenvalue of the matrix is 1/a > 1, so for large k this has very
large norm. Errors (from earlier stages of taking the power) are very much
amplified!
In this sense it falls under the scope of the analysis of the previous section.
Ultimately, however, the result is that the amplification of relative error
is given by a matrix with entries (zk /yi )(∂yi /∂zk ). We want this to be a
small matrix. We have no control over the yi , since this is simply the output
value. But the algorithm controls the size of the intermediate number zk .
62 CHAPTER 4. NONLINEAR SYSTEMS
3. Methods using Taylor series can have problems with roundoff errror.
Consider the problem of finding y = e−x for large x. Here are two
4.8. NUMERICAL DIFFERENTIATION 63
√
1. Take f (x) = x and consider the problem of computing the difference
quotient (f (x + h) − f (x))/h for small h. Discuss numerical stability
of various algorithms.
2. Take f (x) = 1/x and consider the problem of computing the difference
quotient (f (x + h) − f (x))/h for small h. Discuss numerical stability
of various algorithms.
3. Consider the problem of computing the derivative f 0 (x). One may
compute either (f (x + h) − f (x))/h, or one may compute (f (x + h) −
f (x − h))/(2h). Compare these from the point of view of approxima-
tion error.
4. Say that one takes Steffensen’s method with h = f (x)2 instead of
h = f (x). What is the situation with numerical stability?
Ordinary Differential
Equations
5.1 Introduction
This chapter is on the numerical solution of ordinary differential equations.
There is no attempt to cover a broad spectrum of methods. In fact, we
stick with the simplest methods to implement, the Runge-Kutta methods.
Our main purpose is to point out that there are two different problems
with the approximation of ordinary differential equations. The first is to
get an accurate representation of the solution for moderate time intervals
by using a small enough step size and an accurate enough approximation
method. The second is to get the right asymptotic behavior of the solution
for large time.
65
66 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
3. Solve this numerically with the trapezoid second order Runge Kutta
and compare.
4. Compare the Euler and trapezoid second order Runge Kutta method
with the (left endpoint) Riemann sum and trapezoid rule methods for
numerical integration.
dy
= a(t)y + s(t) (5.8)
dt
The trick is to let u(t) be a solution of the corresponding homogeneous
equation and try y = c(t)u(t). Then it is easy to solve for c(t) by integration
of dc(t)/dt = s(t)/u(t).
dy
= f (y). (5.9)
dt
68 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
Projects
5.3.3 Existence
We want to explore several questions. When do solutions exist? When are
they uniquely specified by the initial condition? How does one approximate
them numerically?
5.3. THEORY OF SCALAR EQUATIONS 69
5.3.4 Uniqueness
Assume in addition that g has continuous derivatives. Then the solution
with the given initial condition is unique. This fact is usually proved using
a fixed point iteration method.
Uniqueness can fail when g is continuous but when g(t, y) has infinite
slope as a function of y.
Problems
p
1. Plot the function g(y) = sign(y) |y|. Prove that it is continuous.
2. Plot its derivative and prove that it is not continuous.
3. Solve the differential equation
dy p
= sign(y) |y| (5.13)
dt
with the initial condition y = 0 when t = 0. Find all solutions for
t ≥ 0.
70 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
4. Substitute the solutions back into the equation and check that they
are in fact solutions.
5. Sketch the vector field in phase space ( with dx/dt = 1).
6. Consider the backward Euler’s method for this example. What am-
biguity is there in the numerical solution?
dy
= f (t, y) (5.18)
dt
with initial condition y = y0 at t = 0. For convenience we denote the
solution of the equation at a point where t = a by y(a).
The general method is to find a function φ(t, y, h) and compute
for some constant L. This would follow from a bound on the y partial
derivative of φ. We call this the slope bound.
Assume also that we have a bound Tn+1 ≤ Khp+1 . We call this the
local truncation error bound.
Theorem 5.4.1 Assume that the one-step numerical method for solving
the ordinary differential equation dy/dt = f (t, y) satisfies the slope bound
and the local truncation error bound. Then the error satisfies the global
truncation error bound
eLtn − 1
|n | ≤ Khp . (5.23)
L
This bound is a worst case analysis, and the error may not be nearly
as large as the bound. But there are cases when it can be this bad. Notice
that for fixed time tn the bound gets better and better as h → 0. In fact,
when the order p is large, the improvement is dramatic as h becomes very
small.
On the other hand, for fixed h, even very small, this bound becomes
very large as tn → ∞.
Obviously, it is often desirable to take p to be large. It is possible
to classify the Runge-Kutta methods of order p, at least when p is not too
large. The usual situation is to use a method of order 2 for rough calculation
and a method of order three or four for more accurate calculation.
We begin by classifying the explicit Runge-Kutta methods of order 2.
We take
We see that every such method is consistent, and hence of order one.
The condition that the method be of order 2 works out to be that ac = 1/2.
There is a similar classification of methods of order 3 and 4. In each
case there is a two-parameter family of Runge-Kutta methods. By far the
most commonly used method is a particular method of order 4. (There are
methods of higher order, but they tend to be cumbersome.)
Problems
It is natural to require that the method satisfies the condition that f (y) = 0
implies φ(y, h) = 0.
This is an iteration with the iteration function g(y) = y + hφ(y, h).
Under the above requirement the equilibrium point r is a fixed point with
g(r) = r. Such a fixed point r is stable if g 0 (r) = 1 + hφ0 (r, h) is strictly
less then one in absolute value.
If f 0 (y) < 0 for y near r, then for h small one might expect that
φ (r, h) < 0 and hence g 0 (r) < 1. Furthermore, for h small enough we
0
would have g 0 (s) > −1. So a stable equilibrium of the equation should
imply a stable fixed point of the numerical scheme, at least for small values
of h.
How small? We want g 0 (r) = 1 + hφ0 (r, h) > −1, or hφ0 (r, h) > −2.
Thus a rough criterion would be hf 0 (r) > −2.
In some problems, as we shall see, there are two time scales. One time
scale suggests taking a comparatively large value of time step h for the
integration. The other time scale is determined by the reciprocal of the
magnitude of f 0 (r), and this can be very small. If these scales are so
different that the criterion is violated, then the problem is said to be stiff.
(We shall give a more precise definition later.)
A class of problems where stiffness is involved is for equations of the
form
dy
= f (t, y), (5.27)
dt
where there is a function y = r(t) with f (t, r(t)) = 0 and with ∂f (t, r(t))/∂y <
0 and very negative. The dependence of r(t) on t is slow compared with
the fast decay of the solution y to y = r(t). Thus one might want to take
a moderate sized time step h suitable for tracking y = r(t). However this
would be too large for tracking the decay in detail, which we might not
even care to do. Thus we need a numerical method that gives the decay
without worrying about the details of how it happens.
74 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
If the two partial derivatives are both strictly negative, then this condition
is guaranteed for all h > 0, no matter how large, by the inequality
This says that the implicit method must have at least as much dependence
on the future as on the past. Thus a stiff problem requires implicitness.
5.4. THEORY OF NUMERICAL METHODS 75
One could start this iteration with yn . The fixed point of this function is the
desired yn+1 . The trouble with this method is that s0 (z) = hφ2 (tn , yn , z, h)
and for a stiff problem this is very large. So the iteration will presumably
not converge.
How about Newton’s method? This is iteration with
z − yn − hφ(t, yn , z, h)
t(z) = z − . (5.35)
1 − hφ2 (t, yn , z, h)
This should work, but the irritation is that one must compute a partial
derivative.
Problems
1. Consider a problem of the form dy/dt = −a(y − r(t)) where a > 0
is large but r(t) is slowly varying. Find the explicit solution of the
initial value problem when y = y0 at t = 0. Find the limit of the
solution at time t as a → ∞.
2. Take a = 10000 and r(t) = sin t. Use y0 = 1. The problem is to
experiment with explicit methods with different step sizes. Solve the
problem on the interval from t = 0 to t = π. Use Euler’s method with
step sizes h = .01, .001, .0001, .00001. Describe your computational
experience and relate it to theory.
3. The next problem is to do the same experiment with stable im-
plicit methods. Use the backward Euler’s method with step sizes
h = .01, .001.
4. Consider the general implicit method of the form
7. Are there second order methods for stiff problems that are stable?
Discuss.
8. We have required that every zero of f (y) also be a zero of φ(y, h).
When is this satisfied for Runge-Kutta methods?
9. Consider a Taylor method of the form φ(y, h) = f (y)+(1/2)f 0 (y)f (y)h.
When is the requirement satisfied for such Taylor methods?
5.5. SYSTEMS 77
5.5 Systems
5.5.1 Introduction
We now turn to systems of ordinary differential equations. For simplicity,
we concentrate on autonomous systems consisting of two equations. The
general form is
dx
= f (x, y) (5.37)
dt
dy
= g(x, y). (5.38)
dt
Notice that this includes as a special case the equation dy/dt = g(t, y).
Thus may be written as a system by writing it as
dx
= 1 (5.39)
dt
dy
= g(x, y). (5.40)
dt
In general, if we have a system in n variables with explicit time depen-
dence, then we may use the same trick to get an autonomous system in
n + 1 variables.
We may think of an autonomous system as being given by a vector
field. In the case we are considering this is a vector field in the plane with
components f (x, y) and g(x, y). If we change coordinates in the plane, then
we change these coordinates.
In general, the matrix of partial derivatives of this vector field transforms
in a complicated way under change of coordinates in the plane. However at
a zero of the vector field the matrix undergoes a similarity transformation.
Hence linear algebra is relevant!
In particular, two eigenvalues (with negative real parts) of the lineariza-
tion at a stable fixed point determine two rates of approach to equilibrium.
In the case when these rates are very different we have a stiff problem.
dx
= ax + by (5.41)
dt
dy
= cx + dy. (5.42)
dt
78 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
x = veλt (5.43)
y = weλt . (5.44)
av + bw = λv (5.45)
cv + dw = λw. (5.46)
dx
= Ax. (5.47)
dt
The trial solution is
x = veλt . (5.48)
The eigenvalue equation is
Av = λv. (5.49)
This has a non-zero solution only when det(λI − A) = 0.
When the two eigenvalues are both positive or both negative, the equilib-
rium is called a node. When one eigenvalue is positive and one is negative,
it is called a saddle. An attractive node corresponds to an overdamped
oscillator.
Oscillation
The second case is complex conjugate unequal eigenvalues λ = α + iω
and λ̄ = α − iω with α = (a + d)/2 and ω > 0. This takes place when
(a − d)2 + 4bc < 0. There are two independent complex conjugate solutions.
These are expressed in terms of eλt = eαt eiωt and eλ̄t = eαt e−iωt . Their real
5.5. SYSTEMS 79
and imaginary parts are independent real solutions. These are expressed in
terms of eαt cos(ωt) and eαt sin(ωt).
In matrix notation we have complex eigenvectors u±iv and the solutions
are
x = (c1 ± ic2 )eαt e±iωt (u ± iv). (5.51)
Taking the real part gives
x = c1 eαt (cos(ωt)u − sin(ωt)v) − c2 eαt (sin(ωt)u + cos(ωt)v). (5.52)
If we write c1 ± ic2 = ce±iθ , these take the alternate forms
x = ceαt e±i(ωt+θ) (u ± iv). (5.53)
and
x = ceαt (cos(ωt + θ)u − sin(ωt + θ)v). (5.54)
From this we see that the solution is characterized by an amplitude c and
a phase θ. When the two conjugate eigenvalues are pure imaginary, the
equilibrium is called a center. When the two conjugate eigenvalues have a
non-zero real part, it is called a spiral (or a focus). An center corresponds to
an undamped oscillator. An attractive spiral corresponds to an underdamped
oscillator.
Shearing
The remaining case is when there is only one eigenvalue λ = (a+d)/2. This
takes place when (a − d)2 + 4bc = 0. In this case we neeed to try a solution
of the form
x = peλt + vteλt (5.55)
y = qeλt + wteλt . (5.56)
We obtain the same eigenvalue equation together with the equation
ap + bq = λp + v (5.57)
cp + dq = λq + w. (5.58)
In practice we do not need to solve for the eigenvector: we merely take p, q
determined by the initial conditions and use the last equation to solve for
v, w.
Im matrix notation this becomes
x = peλt + vteλt (5.59)
with
Ap = λp + v. (5.60)
80 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
Inhomogeneous equations
The general linear constant coefficient equation is
dx
= Ax + r. (5.61)
dt
When A is non-singular we may rewrite this as
dx
= A(x − s), (5.62)
dt
where s = −A−1 r is constant. Thus x = s is a particular solution. The gen-
eral solution is the sum of this particular solution with the general solution
of the homogeneous equation.
Problems
2. Find the solution of this equation with the initial condition x = 1 and
y = 3 when t = 0.
3. Sketch the vector field in the above problem. Sketch the given solution
in the x, y phase space. Experiment to find a solution that passes
very close to the origin, and sketch it.
6. Find the solution of this equation with the initial condition x = 5 and
y = 4 when t = 0.
7. Sketch the vector field in the above problem. Find the orbit of the
given solution in phase space. Also plot x versus t and y versus t.
5.5. SYSTEMS 81
8. A frictionless spring has mass m > 0 and spring constant k > 0. Its
displacement and velocity x and y satisfy
dx
= y
dt
dy
m = −kx.
dt
Describe the motion.
9. A spring has mass m > 0 and spring constant k > 0 and friction
constant f > 0. Its displacement and velocity x and y satisfy
dx
= y
dt
dy
m = −kx − f y.
dt
Describe the motion in the case f 2 − 4k < 0 (underdamped).
10. Take m = 1 and k = 1 and f = 0.1. Sketch the vector field and the
solution in the phase plane. Also sketch x as a function of t.
11. In the preceding problem, describe the motion in the case f 2 − 4k >
0 (overdamped). Is it possible for the oscillator displacement x to
overshoot the origin? If so, how many times?
12. An object has mass m > 0 and its displacement and velocity x and y
satisfy
dx
= y
dt
dy
m = 0.
dt
Describe the motion.
13. Solve the above equation with many initial condition with x = 0 and
with varying value of y. Run the solution with these initial conditions
for a short time interval. Why can this be described as “shear”?
Problems
dx
= v (5.63)
dt
dv
m = −kx − cv, (5.64)
dt
where m > 0 is the mass, k > 0 is the spring constant, and c > 0 is the
friction constant. We will be interested in the highly damped situa-
tions, when m is small relative to k and c. Take k and c each 10 times
the size of m. Find the eigenvalues and find approximate numerical
expressions for them. Find approximate numerical expressions for the
eigenvectors. Describe the corresponding solutions.
dx
= f (x, y) (5.65)
dt
dy
= g(x, y). (5.66)
dt
An equilibrium point is a solution of f (r, s) = 0 and g(r, s) = 0. For
each equilibrium point we have a solution x = r and y = s.
Near an equilibrium point
The most interesting equilibrium takes place when the natural predator
growth rate cx − d with x = a/m at the prey carrying capacity is positive.
This says that the predator can live off the land.
Problems
5. For the pendulum problem, describe the nature of the two equilibria
when there is friction.
æ
86 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
Chapter 6
Fourier transforms
6.1 Groups
We want to consider several variants of the Fourier transform at once. The
unifying idea is that the Fourier transform deals with complex functions
defined on commutative groups. (Recall that a commutative group is a set
with operations of addition and subtraction that satisfy the usual proper-
ties.) Here are the groups that we shall consider.
The first is the group of all real numbers.
The second is the group of all integer multiples of n∆x, where ∆x > 0
is a fixed real number. This is a subgroup of the real numbers, since the
sum or difference of any two n∆x is also of the same form.
The third is the group of all real numbers mod L, where L > 0 is fixed.
This is the circle group, where the circle has circumference L. Two real
numbers determine the same element if they differ by an integer multiple
of L. Thus the circle group is a quotient group of the real numbers.
The final group is the group of all integer multiples n∆x mod L = N ∆x.
This is a subgroup of the circle group. It is also a quotient group of the
integer group. It is finite with precisely N elements.
87
88 CHAPTER 6. FOURIER TRANSFORMS
These formulas may be obtained from the finite case. Take the sum
over k to run from −N ∆k/2 = −π/∆x to N ∆k/2 = π/∆x, counting the
end point at most once. Then let N → ∞ and ∆x → 0 keeping L = N ∆x
fixed.
These formulas may be obtained from the finite case by taking the sum
on x to range from −N ∆x/2 to N ∆x/2 (counting end points only once)
and then letting N → ∞ with fixed ∆x.
These formulas may be obtained from the circle case integrating x from
−L/2 to L/2 and letting L → ∞ or from the integer case by integrating k
from −π/∆x to π/∆x and letting ∆x → 0.
The notation has been chosen to suggest that x is position and k is wave
number (spatial frequency). It is also common to find another notation in
which t is time and ω is (temporal) frequency. For the record, here are the
formulas in the other notation.
The Fourier transform is
Z ∞
fˆ(ω) = e−iωt f (t) dt. (6.13)
−∞
6.7 Subgroups
We now want to consider a more complicated situation. Let G be a group.
Let H be a discrete subgroup. Thus for some ∆y > 0 the group H consists
of the multiples n∆y of ∆y. We think of this subgroup as consisting of
uniformly spaced sampling points. Let Q be the quotient group, where we
identify multiples of ∆y with zero.
The group G has a dual group Ĝ. The elements of Ĝ that are multiples
of 2π/∆y form the dual group Q̂, which is a subgroup of Ĝ. The quotient
group, where we identify multiples of 2π/∆y with zero, turns out to be Ĥ.
These dual groups may all be thought of as consisting of angular frequencies.
We can summarize this situation in diagrams
H −→ G −→ Q (6.24)
and
Q̂ −→ Ĝ −→ Ĥ. (6.25)
The arrow between two groups means that elements of one group uniquely
determine elements of the next group. Furthermore, an element of the
group G or Ĝ that is determined by an element of the group on the left
itself determines the element 0 of the group on the right.
The first main example is when G is the reals, H is the subgroup of
integer multiples of ∆y, and Q is the circle of circumference ∆y. Then Ĝ is
the reals (considered as angular frequencies), Q̂ is the subgroup of multiples
of ∆r = 2π/∆y, and Ĥ is the circle of circumference ∆r.
The other main example is when G is the circle of circumference L =
N ∆y, H is the subgroup of order N consisting of integer multiples of ∆y,
and Q is the circle of circumference ∆y. Then Ĝ is the integers spaced by
∆k = 2π/L, Q̂ is the subgroup of multiples of ∆r = 2π/∆y = N ∆k, and
Ĥ is the group of order N consisting of multiples of ∆k mod N . In this
example integrals over Ĝ and Ĥ are replaced by sums.
We begin with f defined on G. Its Fourier transform is
Z
e−ikx f (x) dx = fˆ(k) (6.26)
G
Thus fH (x) has contributions from all frequency bands, but with a confus-
ing exponential factor in front.
However note that when y is in H, then fH (y) = f (y). Thus fH inter-
polates f at the sampling points.
Another way of writing fH (x) is as
Z X dk X
fH (x) = eikx e−iky f (y) ∆y = KH (x − y)f (y)∆y (6.36)
Ĥ 2π
y∈H y∈H
where Z
dk
KH (x) = eikx . (6.37)
Ĥ 2π
This formula expresses fH (x) directly in terms of the values f (y) at the
sampling points y.
Now assume in addition that the original Fourier transform fˆ is band-
limited, that is, it vanishes outside of Ĥ. In that case it is easy to see that
fH (x) = f (x) for all x in G. This is the sampling theorem: A band-limited
function is so smooth that it is determined by its values on the sampling
points.
6.9. FFT 95
6.9 FFT
It is clear that the obvious implementation of the Fourier transform on the
cyclic group of order N amounts to multiplying a matrix times a vector and
hence has order N 2 operations. The Fast Fourier Transform is another way
of doing the computation that only requires order N log2 N operations.
Again the setup is a group G and a subgroup H. Again we take Q to
be the quotient group. We have
Z X dx
fˆ(k) = e−ikx [ e−iky f (x + y)∆y] . (6.38)
Q ∆y
y∈H