Ma580 Book

MA580: Numerical Analysis: I
Zhilin Li
2 Z. Li
Chapter 1
Introduction
• Why is this course important (motivations)? What is the role of this class in the
problem solving process using mathematics and computers?
• Model problems and relations with course materials.
• Errors (definition and how to avoid them)
In Fig.1.1, we show a flow chart of a problem solving process. In this class, we will focus
on numerical solutions using computers, especially the problems in linear algebra. Thus
this course can also be called ”Numerical Linear Algebra”.
1.1 A Model problem: Data fitting and interpolation
Give data
t0 t1 ··· tm
(1.1.1)
y0 y1 ··· ym
The data can be taken from a function y(t), or a set of observed data. We want to find a
simple function ya (t) to approximate y(t) so that we can predict y(t) everywhere.
Approach I: Polynomial approximation.
Let
y(t) ≈ a0 + a1 t + a2 t2 + · · · + an tn . (1.1.2)
We need to find the coefficients a0 , a1 , a2 , · · · , an so that we can have an analytic expression.
n = 0 : constant approximation
n = 1 : linear regression
We should choose the coefficients in such a way that
n = 2 : quadratic regression
··· ···
3
4 Z. Li
Real Problem Physical Laws Mathematical/physical

/other approach Models
Analytic/Exact
Solution Techniques
Approximated
Use Computers
Interpret Solution Applications
visualization Products
Experiments
Prediction
Better Models
Figure 1.1: A flow chart of a problem solving process.

Numerical Analysis: I 5
they can match the data at the sample points1 . Thus we have

t = t0 : 

 a0 + a1 t0 + a2 t20 + · · · + an tn0 = y0
a0 + a1 t1 + a2 t21 + · · · + an tn1 = y1

t = t1 : 
(1.1.3)
··· 

 ·········
a0 + a1 tm + a2 t2m + · · · + an tnm = ym

t = tm : 
This is a linear system of equations for the unknown coefficients ai , i = 0, · · · , n. In the

matrix-vector form, it is
    
1 t0 t20 ··· tn0 a0 y0
1 t1 t21 ··· tn1 a1 y1
    
    
 .. .. .. .. ..  .. = .. . (1.1.4)
. . . . . . .
    
    
1 tm t2m · · · tnm an ym
We can simply write it as Ax = b, where A is an (m + 1) × (n + 1) matrix, and x =

h iT h iT
a0 a1 · · · an is an n by 1 column vector, and y0 y1 · · · ym is an (m+1)× 1
column vector. We distinguish the following cases assuming that ti are distinct (ti 6= tj ).
• m > n, that is, we have more equations than the unknowns. The system is over-
determined and we can only find the the best solution, for example the least squares
solution. Such a problem is a curve-fitting problems. When n = 1, it is also called
linear regression.
• m = n, we have the same number of equations and unknowns. There is a unique

solution to the linear system of equations. Such a problem is called an interpolation
because ya (t) will pass all the selected data.
• m < n, we have fewer equations than the unknowns. The system is under-determined
and we can find infinite number of the solutions. Often we prefer the SVD solution
which has the least length among all the solutions.
Note that the coefficient matrix is dense (not many zero entries) in this application.
1.2 A non-linear model
Not all the functions are polynomials, if we want to approximate y(t) by
ya (t) ≈ αeβt (1.2.1)

1
Not always possible.
6 Z. Li
for example, where α and β are unknowns, we would have




 αeβt0 = y0
 αeβt1 = y

1


 ······
 αeβtm = y

m
We get a non-linear system of equations for α and β.
1.3 Finite difference method and linear system of equations
The modern computers are designed in part to solve practical problems such as weather
forecast, computing the lift and drag of airplanes, missiles, space shuttle etc. In these
application, often partial differential equations, particularly the Navier-Stokes equations
are used to model the physical phenomena. How to solve three dimensional Navier-Stokes
equations is a still challenge today.
To solve the Navier-Stokes equations, often finite difference or finite element methods are
used. In a finite difference method, the partial derivatives are replaced by finite difference,
which is a combination of the function values. Such a process is called finite difference
discretization. After the discretization, the system of partial differential equations because a
system of algebraic system of equations either linear or non-linear. We use a one-dimensional
example here to illustrate the idea. Consider the two-point boundary value problem
d2 u(x)
= f (x), 0<x<1 (1.3.1)
dx2
u(0) = u(1) = 0. (1.3.2)
In a finite difference method, we try to an approximate solution of u(x) at a finite number

of points (not everywhere). The procedure is as follows:
• Generate a grid. For example, we can select n equally spaced points between 0 and
1 to find the approximate solution of u(x). The the spacing between two points is
h = 1/n, and these points are xi = ih, i = 0, 1, · · · , n with x0 = 0 and xn = 1. We
look for an approximate solution of u(x) at x1 , x1 , · · · , xn−1 . Note that we know
u(0) = u(x0 ) = 0 and u(1) = u(xn ) = 0 already from the boundary condition.
• Replace the derivative by a finite difference formula. It can be proved that
u(x − h) − 2u(x) + u(x + h) d2 u(x)

lim = . (1.3.3)
h→0 h2 dx2
Or
u(x − h) − 2u(x) + u(x + h) d2 u(x) h2 d4 u(x)
= + (1.3.4)
h2 dx2 12 dx4
At every grid points xi , i = 0, 1, · · · , n, we use the above formula ignoring the high
order terms to get
u(x1 − h) − 2u(x1 ) + u(x1 + h) d2 u(x1 )

≈ = f (x1 )
h2 dx21
u(x2 − h) − 2u(x2 ) + u(x2 + h) d2 u(x2 )
≈ = f (x2 )
h2 dx22
······
u(xi − h) − 2u(xi ) + u(xi + h) d2 u(xi )

≈ = f (xi )
h2 dx2i
······
u(xn−1 − h) − 2u(xn−1 ) + u(xn−1 + h) d2 u(xn−1 )
≈ = f (xn−1 )
h2 dx2n−1
If we replace the ’≈’ with the ’=’ sign, and replace u(xi ) which we do not know with
Ui which is the solution to the linear system of equations:
0 − 2U1 + U2
= f (x1 )
h2
U1 − 2U2 + U3 )
= f (x2 )
h2
······
Ui−1 − 2Ui + Ui+1

= f (xi )
h2
······
Un−2 − 2Un−1 + 0
= f (xn−1 )
h2
This system of equations Ax = b can be written as the matrix and vector form:
    
− h22 1
h2
U1 f (x1 ) − uha2
    
 1
 h2 − h22 1    
h 2   U 2   f (x2 ) 
    
    
1

 h2
− h22 h12   U3  
   f (x3 ) 

  =  (1.3.5)
.. .. ..  .   ..
  ..  
 

 . . .    . 

    
1
− h22 1
  Un−2   f (xn−2 )
    
 h2 h2 
    
1
h2
− h22 Un−1 f (xn−1 ) − hu2b
8 Z. Li
with u0 = 0 and ub = 02 .
• Solve the system of equations to get the approximate solution at each grid point.
• Implement and debug the computer code. Run the program to get the output. Ana-
lyze the results (tables, plots etc.).
• Error analysis: method error using finite difference to approximate derivative; machine
error (round-off errors).
We will discuss how to use computer to solve the system of equations. Note that the
coefficient matrix has special structure: tri-diagonal and most entries are zero. In two space
dimensions, a useful partial differential equation is the Poisson equation
uxx + uyy = f (1.3.6)
For a rectangular grid and uniform mesh, the finite differential equation at a grid point
(xi , yj ) is
Ui−1,j + Ui+1,j Ui,j−1 + Ui,j+1 2 2
+ − + Uij = fij (1.3.7)
(hx )2 (hy )2 (hx )2 (hy )2
For a general n by n grid, we will have
   
B I −4 1
   
I B I  1 −4 1
   
1   
A= 2 , B= .
   
h  .. .. .. .. .. ..
 . . . 


 . . . 

   
I B 1 −4
Note that the size of A is (n − 1)2 × (n − 1)2 . If we take n = 101, we will have one
million unknowns. A very large number that most laptop today may not be able to handle
if we store all the entries. For such a system of equations, an iterative method or sparse
matrix techniques are preferred. We will get back to this later.
1.4 Continuous and discrete eigenvalue problems
Consider the differential equation
u00 + λu = 0, 0 < x < 1, u(0) = 0, u(1) = 0. (1.4.1)
The solution is not unique. One particular solution is
u(x) = sin(kπx) (1.4.2)

2
For non-homogeneous boundary condition, we can plug in the value u(0) and u(1) in the place of u0 = 0
and ub .
corresponding to the eigenvalue λk = (kπ)2 for k = 1, 2, · · · . The significance of the eigen-

functions and eigenvalues is the basis of the Fourier series and theory. For example, for
non-homogeneous differential equation,
u00 = f (x), 0 < x < 1, u(0) = 0, u(1) = 0. (1.4.3)
The solution can be expressed as

∞
X
u(x) = αk sin(kπ), (1.4.4)
k=1
with R1
1 0 f (x) sin(kπx)dx
αk = − R1 2 . (1.4.5)
(kπ)2
0 sin (kπx)
See an ordinary differential equation text book for the details. More general eigenvalue of
this type would be
−(pu0 )0 + qu + λu = 0, p(x) ≥ p0 > 0, q(x) ≥ 0 (1.4.6)
with a more general boundary condition.

The associate numerical method is the spectral method. For example, if we apply the
finite difference method as discussed earlier, we would have
Ui−1 − 2Ui + Ui+1
+ λUi = 0, , i = 1, 2, c . . . n − 1. (1.4.7)
h2
In the matrix vector form, it is Ax = (−λ) x, which is an algebraic eigenvalue problem. The
matrix A is given in (3.0.1).
1.5 Objectives of the class
• Learn how to solve these problems (mainly linear algebra) efficiently and reliably on
computers.
• Method and software selection: what to use for what problems.
• Related theory and analysis.
• Preparation for other courses such as MA780, MA584, MA587 etc.
1.6 A computer number system
We want to use computers to solve mathematical problems and we should know the number
system in a computer.
10 Z. Li
A primitive computer system is only part of a real number system. A number in a

computer system is represented by
fraction exponential
xc = ±. d1 d2 · · · dn β s , 0 ≤ di ≤ β − 1, −Smax ≤ s ≤ Smax (1.6.1)
sign mantissa base
Below are some examples of such numbers
−0.111125 (binary, β = 2), −0.3142100 , −0.0314100 (1.6.2)
The choice of base number is
β=2 binary primitive

β=8 octal used for transition
β = 10 decimal custom & convenient
β = 16 hexadecimal save storage
The float number is denoted by f l(x). We can see that the expression of a floating
number is not unique. To get a unique expression, it is often designed that d1 6= 0 if xc is
a non-zero number. Such a floating number is called a normalized floating number. The
number zero is expressed as 0.00 · · · 0β 0 . Note that one bite is used to represent the sign in
the exponential.
Often there are two number systems in a programming language for a particular com-
puter, single precision corresponding to 32 bits; and double precision for 64 bits.
In a 32 bites computer number system, we have
sign exponential fraction

1 8 23
In a 64 bites computer number system, we have
sign exponential fraction

1 11 52
1.6.1 Properties of a computer number system
• It is a subset of the real number system with finite number of floating numbers. For
a 32-bit system, the total numbers is roughly 2β n (Smax − Smin + 1) − 1.
• Even if x and y are in the computer number system, their operations, for example,
f l(xy), can be out of the the computer number system.
• It has the maximum and minimum numbers; and maximum and non-zero minimum
magnitude. For a 32-bit system, the largest and smallest numbers can be calculated
from the following:
the largest exponential: = 20 + 21 + · · · + 26 = 27 − 1 = 127
the largest fraction = 0.1111 · · · 1 = 1 − 2−23
the largest positive number = 2127 (1 − 2−23 ) = 1.7014116 × 1038 .
The smallest number is then −1.7014116 × 1038 . The smallest positive number (or
smallest magnitude) is
0.000 · · · 1 × 2−127 = 2−127−23 = 7.0064923 × 10−46 (1.6.3)
while the smallest normalized magnitude is
0.100 · · · 0 × 2−127 = 2−128 = 2.9387359 × 10−39 . (1.6.4)
Overflow and underflow
If a computer system encounter a number whose magnitude is larger than the largest
floating number of the computer system, it is called OVERFLOW. This often happens
when a number is divided by zero, for example, we want to compute s/a but a
is undefined, or evaluate a function outside of the definition, for example, log(−5).
Computers often returns symbol such as N AN , inf , or simply stops the running
process. This can also happen when a number is divided by a very small number.
Often an overflow indicates a bug in the coding and should be avoided.
If a computer system encounter a number whose magnitude is smaller than the small-
est positive floating number of the computer system, it is called underflow. Often,
the computer system can set this number as zero and there is no harm to the running
process.
• The numbers in a computer number system is not evenly spaces. It is more clustered
around the origin and get sparser far away.
While a computer system is only a subset of the real number system, often it is good
enough if we know how to use it. If a single precision system is not adequate, we can use
the double precision system.
12 Z. Li
1.7 Round-off errors and floating point arithmetics
Since computer number system is only a subset of a real number system, errors (called
round-off errors) are inevitable when we solver problems using computers. The question
that we need to ask is how the errors affect the final results and how to minimize the
negative impact of errors.
Input errors
When we input a number into a computer, it is likely to have some errors. For example,
the number π can be represented exact in a computer number system. Thus a floating
number of expression of π denoted as f l(π) is different from π. The first question is, how
a computer system approximate π. The default is the round-off approach. Let us take
the cecimal system as an example. Let x be a real number that is in the range of the
computer number system in terms of the magnitude, and we express it as a normalized
floating number
x = 0.d1 d2 · · · dn dn+1 · · · × 10b , d1 6= 0. (1.7.1)
The floating number if the computer system using the round-off approach is

 0.d1 d2 · · · dn × 10b , if dn+1 ≤ 4,
f l(x) = (1.7.2)
 0.d d · · · (d + 1) × 10b , if d
1 2 n n+1 ≥ 5
1.7.1 Definition of errors
The absolue error is defined as the difference between the true value and the approxi-
mated,
absolute error = true - approximated. (1.7.3)
Thus the error for f l(x) is
absolution error of f l(x) = x − f l(x). (1.7.4)
The absolution error is often simply called the error.

Absolution error may not reflect the reality. One picks up 995 correct answers from
1000 problems certainly is better than the one that picks up 95 correct answers from 100
problems although both of the errors are 5. A more realistic error measurement is the
relative error which is defined as
absolution error
relative error = , (1.7.5)
true value
for example, if x 6= 0, then
x − f l(x)
relative error of f l(x) = . (1.7.6)
x
Obviously, for different x, the error x − f l(x) and the relative error are different. How
do we then characterize the round-off errors? We seek the upper bounds, or the worst case,
which should apply for all x’s.
For the round-off approach, we have

 0.00 · · · 0dn+1 · · · × 10b , if dn+1 ≤ 4,
|x − f l(x)| =
 0.00 · · · 0(1 − d b
n+1 ) · · · × 10 , if dn+1 ≥ 5
1
≤ 0.00 · · · 05 × 10b = 10b−n ,
2
which only depends on the magnitude of x. The relative error is
1 1
|x − f l(x)| 10b−n 10b−n 1 def ine
≤ 2 ≤ 2 b
= 10−n+1 == = machine precision (1.7.7)
|x| |x| 0.1 × 10 2
Note that the upper bound of the relative error for the round-off approach is independent
of x, only depends on the computer number system. This upper bound is called the machine
precision, or machine epsilon, which indicates the best accuracy that we can expect using
the computer number system.
In general we have
|x − f l(x)| 1
≤ β −n+1 (1.7.8)
|x| 2
for any base β. For a single precision computer number system (32 bits) we have3
1
= 2−23+1 = 2−23 = 1.192093 × 10−7 . (1.7.9)
2
For a 64-bits number system (double precision), we have
1
= 2−52+1 = 2−52 = 2.220446 × 10−16 . (1.7.10)
2
Relative error is closely associate with the concept of the significant digits. In general,
if a relative error is of order 10−5 , for example, it is likely the result has 5 significant digits.
An approximate number can be regarded as a perturbation of the true vales according
to the following theorem.
Theorem 1.7.1 If x ∈ R, then f l(x) = x(1 + δ), |δ| ≤ ia the relative error.
There are other ways to input a number into a computer system. Two other approaches
are rounding, and chopping, in which we have
f l(x) = 0.d1 d2 · · · dn × 10b (1.7.11)

3
Accoding to the definition, the ≈ 1.2 × 10−7 which is the same as Kahaner, Moler, and Nash’s book,
but it twice as larger as the result given in Demmel’s book, which we think it is wrong.
14 Z. Li
for chopping and

f l(x) = 0.d1 d2 · · · (dn + 1) × 10b . (1.7.12)
The errors bounds are twice as much as the round-off approach.
1.7.2 Error analysis of computer arithmetics
The primitive computer arithmetic only include addition, subtraction, multiplication, di-
vision, and logical operations. Logic operations do not generate errors. But the basic
arithmetic operations will introduce errors. The error bounds are given in the following
theorem.
Theorem 1.7.2 If a and b are two floating numbers in a computer number system, f l(a◦b)
is in the range of the computer number system, then
f l(a ◦ b) = (a ◦ b) (1 + δ) , ◦: + − × ÷ (1.7.13)
where
|δ| = |δ(a, b)| ≤ , (1.7.14)
we also have
√ √
f l( a) = a (1 + δ) . (1.7.15)
Note that δ is the relative error if (a ◦ b) 6= 0 of the operations and is bounded by the
machine precision. This is because
f l(a ◦ b) − (a ◦ b) = δ(a ◦ b) absolution error

f l(a ◦ b) − (a ◦ b)
= δ, |δ| ≤ .
(a ◦ b)
We conclude that the arithmetic operations within a computer number system give the
’best’ results that we can possible get. Does it mean that we do not need to worry about
round-off errors at all? Of course not!
1.7.3 Round-off error analysis and how to avoid round-off errors
Now we assume that x and y are two real numbers. When we input them into a computer,
we will have errors. First we consider the multiplications and divisions
f l(x ◦ y)) = f l (f l(x) ◦ f l(y)) = f l (x(1 + x ) ◦ y(1 + y )) , |x | ≤ , |y | ≤ ,
= (x(1 + x ) ◦ y(1 + y )) (1 + x◦y ), |x◦y | ≤ .
Note that x , y , and x◦y are different numbers although they have the same upper bound!
We distinguish several different cases
• Multiplication/division (◦ = × or ÷), take the multiplication as an example, we have
f l(xy)) = x(1 + x )y((1 + y )(1 + xy ) = xy 1 + x + y + xy + O(2 )

= xy(1 + δ), |δ| ≤ 3.
Often we ignore the high order term (h.o.t) since they one much smaller (for single
precision, we have 10−7 versus 10−14 ). Thus delta is the relative error as we mentioned
before. The error bound is understandable with the fact of 3 with two input errors
and one from the multiplication. The absolute error is −xyδ which is bounded by
3|xy|. The same bounds hold for the division too if the divisor is not zero. Thus the
errors from the multiplications/divisions are not big concerns here. But we should
avoid dividing by small numbers if possible.
• Now we consideration a subtraction ◦ ( an addition can be treated as a subtraction

since a + b = a − (−b) or vise versa.). Now we have
f l(x − y)) = (x(1 + x ) − y((1 + y )) (1 + xy )
= x − y + xx − yy + (x − y)xy + O(2 ).
The absolution error is
(x − y) − f l(x − y) = −xx + yy − (x − y)xy
|(x − y) − f l(x − y)| = |x| + |y|y + |x − y|
which does not seem to be too bad. But the relative error may be unbounded because

|(x − y) − f l(x − y)| xx − yy − (x − y)xy
= + O(2 )
|x − y| x−y
|xx − yy |
≤ + .
|x − y|
In general, x 6= x even though they are very small and have the same upper bound.
Thus the relative error can be arbitrarily large if x and y are very close! That means
the addition/subtraction can lead the loss of accuracy, or significant digits. It is
also called catastrophic cancellation as illustrate in the following example
0.31343639 − 0.31343637 = 0.00000002.
If the last two digits of the two numbers are wrong (likely in many circumstance),
then there is no significant digit left in the result. In this example, the absolute error
is till small, but the relative error is very large!
16 Z. Li
Round-off error analysis summary
• Use formulas f l(x) = x(1 + 1 ), f l(x ◦ y) = (x ◦ y)(1 + 2 ) etc.
• Expand and collect terms.
• Ignore high order terms.
1.7.4 An example of accuracy loss
Assume we want to solve a quadratic equation ax2 + bx + c = 0 on a computer, how do we

do it? First of all, we need to write down the algorithm mathematically before coding it.
From Vieta’s roots relations, x1 + x2 = −b/a, x1 x2 = c/a, we know that there are at least
three methods.
√ √
−b + b2 − 4ac −b − b2 − 4ac
• Algorithm 1: x1 = , x2 = ,
2a 2a
√
−b + b2 − 4ac c
• Algorithm 2: x1 = , x2 = ,
2a ax1
√
−b + − b2 − 4ac c
• Algorithm 3: x1 = , x2 = .
2a ax1
Mathematically, there are all equivalent (thus they are all call consistent). But occasionally,
they may give very different results especially if c is very small. When we put select the
algorithm to run on computer, we should choose Algorithm 2 if b ≤ 0 and Algorithm 3 if
b ≥ 0, why? This can be done using if · · · then conditional expression using any computer
language.
Let us check with the simple case a = 1, b = 2, x2 + 2x + e = 0. When e is very small,
we have
√
−2 − 4 − 4e √
x1 = = −1 − 1 − e ≈ −2
2
√
−2 + 4 − 4e √ e
x2 = = −1 + 1 − e = √ ≈ −0.5e
2 −1 − 1 − e
The last equality was obtained by rationalize to the denomenator.
Below is a Matlab code to illustrate four different algorithms:
function [y1,y2,y3,y4,y5,y6,y7,y8] = quad_err(e)
y1 = (-2 + sqrt(4 - 4*e))/2; y2 = (-2 -sqrt(4 - 4*e))/2;

y3 = (-2 + sqrt(4 - 4*e))/2; y4 = e/y3;
y5 = (-2 -sqrt(4 - 4*e))/2; y6 = e/y5;
y7 = -4*e/(-2 -sqrt(4 - 4*e))/2; y8 = e/y7;
From input various e, we can see how the accuracy gets lost. In general, when we have
e = 210−k , we will lose about k significant digits.
1.7.5 How to avoid loss of accuracy?
• Use different formula, for example,

p 4ac
−b + b2 − 4ac = − √ if b > 0.
b + b2 − 4ac
• Use Taylor expansion, for examples,
x2 x4 x2

1 − cos x = 1 − 1 − + − ··· ≈ .
2 4! 2
f 00 (x) f 000 (x)
f (x + h) − f (x) = hf 0 (x) + h2 + h3 + ···
2 3!
• Another rule of thumb for summations f l( ni=1 )xi . We should add those numbers
P
with small magnitude first to avoid ”large numbers eat small numbers”.
1.8 Some basic algorithms and Matlab codes

Pn
• Sum i=1 xi :
s=0; % initialize
for i=1:n
s = s + a(i); % A common mistake is forget the s here!
end
Qn
• Product i=1 xi :
s=1; % initialize
for i=1:n
s = s * a(i); % A common mistake is forget the s here!
end
18 Z. Li
Example: Matrix-vector multiplication y = Ax
In Matlab, we can simply use y = A ∗ x. Or we can use the component form so that we
can easily convert the code to other computer languages. We can put the following into a
Matlab .m file, say, test Ax.m with the following contents:
n=100; A=rand(n,n); x=rand(n,1); % Generate a set of data

for i=1:n;
y(i) = 0; % initialize
for j=1:n
y(i) = y(i) + A(i,j)*x(j); % Use ’;’ to compress the outputs.
end
end
We wish to develop efficient algorithms (fast, less storage, accurate, and easy to pro-
gram). Note that Matlab is case sensitive and the index of arrays should be positive integers
(can not be zero).
1.8.1 Hornet’s algorithm for computing a polynomial
Consider a polynomial pn (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 . How do we evaluate its

value at a point x. We can store the coefficients first in an array, say, a(1), a(2), · · · a(n + 1)
since we can not use a(0) in Matlab, then we can evaluate p( x) using
p = a(1);
for i=1:n
p = p + a(i+1)*x^(i+1);
end
The total number of operations are about O(n2 /2) multiplications, and n additions. Howere,
from the following observations
p3 (x) = a3 x3 + a2 x2 + a1 x + a0
= x(a3 x2 + a2 x + a1 ) + a0
= x(x(a3 x + a2 ) + a1 ) + a0
we can form the Hornet’s algorithm (pseudo-code)
p = a(n)
for i=n-1,-1,0
p = x*p + a(i)
endfor
which only requires n multiplications and additions!
1.9 Exercises
1. Let x = 0.d1 d2 · · · dn dn+1 · · · β b , d1 6= 0, 0 ≤ di ≤ β − 1. We can use the chopping

method to express x as a floating number
f lc (x) = 0.d1 d2 · · · dn β b
Find upper bounds of the absolute and relative errors of f lc (x) approximating x.
Compare the results with the results obtained from the rounding-off approach.
2. Let F be a computer number system of 64 bits. Find the following
(a) The largest and smallest number.

(b) The smallest normalized positive number.
(c) The smallest positive number.
(d) Give examples of underflow and overflow.
(e) The machine precision.
(f) Find upper bounds of the absolute and relative errors of f lc (x) approximating x
using the rounding approach.
Note that the specifics may differ slightly with different computers and compilers.
3. Assume we use a computer to evaluate the following expressions
(a) p = xyz, (b) s = x + y + z,
where x, y, and z are real numbers. Find upper bounds of absolute and relative errors.
Assume all the numbers involved are in the range of the computer number system.
Analyze the error bounds.
(HINT: You can set x1 = f l(x), y1 = f l(y), z1 = f l(z), p1 = f l(x1 y1 ), pc = f l(p1 z1 )
is the computed product of x, y, and z. (Note: Pay attention to the upper bounds
and absolute values, e.g., δ5 ≤ 5 is wrong, it should be |δ5 | ≤ 5.)
4. Design an algorithm (in pseudo-code form) to evaluate the following
(a) log(1 + x)/x in the interval [−0.5, 0.5].

√
(b) b − b2 − δ, where b and δ are two parameters with b2 − δ ≥ 0.
(c) ∇φ(x)/|∇φ(x)| where φ(x, y) is a scalar function of x and y.
20 Z. Li
You need to consider all possible scenarios.
5. Which of the following two formulas in computing π is better?

1 1 1 1
π = 4 1 − + − + + ···
3 5 7 9
0.53 3(0.5)5 3 · 5(0.5)7

π = 6 0.5 + + + + ··· .
2·3 2·4·5 2·4·6·7
How many terms should be chosen such that the error is less than 10−6 ? You can
write a short Matlab code to compare. Consider both accuracy and speed.
6. We can use the following three formulas to approximate the first derivative of a func-
tion f (x) at x0 .
f (x0 + h) − f (x0 )
f 0 (x0 ) ≈
h
f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈
2h
f (x0 ) − f (x0 − h)
f 0 (x0 ) ≈
h
When we use computers to find an approximation of a derivative (used in finite dif-
ference (FD) method, optimization, and many areas, we need to balance the errors
from the algorithm (truncation error) and round-off errors (from computers).
(a) Which formula is the most accurate in theory? Hint: Find the absolute error
using the Taylor expansion at x = x0 : f (x0 ±h) = f (x0 )±f 0 (x0 )h+f 00 (x0 )h2 /2±
f 000 (x0 )h3 /6 + O(h4 ).
(b) Write a program to compute the derivative with
• f (x) = x2 , x0 = 1.8.
• f (x) = ex sin x, x0 = 0.55.
Plot the errors versus h using log-log plot with labels and legends if necessary. In
the plot, h should range from 0.1 to the order of machine constant (10−16 ) with
h being cut by half each time (i.e., h = 0.1, h = 0.1/2, h = 0.1/22 , h = 0.1/23 ,
· · · , until h ≤ 10−16 .)
Hint: You need to find the true derivative (analytic) values in order to compute
and plot the errors.
Tabulate the absolute and relative errors corresponding to h = 0.1, 0.1/2, 0.1/4,
0.1/8, and 0.1/16 (that is, difference choices of h compared with that used in
the plots). The ratio (should be around 2 or 4) is defined as the quotient of two
consecutive errors. Analyze and explain your plots and tables. What is the
best h for each case with and without round-off errors?
1/h error (a) ratio error (b) ratio error (c) ratio
10 – – –
20
40
80
160
The ratio is defined as, for example
|error for n = 10|

ratio = .
|error for n = 20|
7. Mini-project: Find the relation between relative errors and significant digits.
22 Z. Li
Chapter 2
Vector and matrix norms
Vector and matrix norms are generalization of the function of absolute value for single
variable. There are at least two motivations to use them.
• Error estimates and analysis. How do we measure f l(x)−x if x is a vector or f l(A)−A

when A is a matrix?
• Convergence analysis for iterative methods? How do we know a vector sequence xk

converges or not.
2.1 Definition of a vector norm
Give a Rn space
Rn = {x, x = [x1 , x2 , · · · , xn ]T }
A vector norm is defined as a multi-variable function f (x) of its components x1 , x2 , · · · , xn

satisfying
1. f (x) ≥ 0 for ∀x ∈ Rn , and f (x) = 0 if and only (⇔) if x = 0.
2. f (αx) = |α|f (x).
3. f (x + y) ≤ f (x) + f (y). It is called the triangle inequality.
If a function f (x) satisfies (1)-(3), we use a special notation f (x) = kxk and call this
function a norm in Rn .
An example of a vector norm
Let
f (x) = max {|xi |}
1≤i≤n
23
24 Z. Li
. Is f (x) a vector norm?

• It is obvious that f (x) ≥ 0. If x = 0, then all xi = 0, so max{|xi |} = 0, that is
f (x) = 0. On the other hand, if f (x) = max{|xi |} = 0. That is, the largest magnitude is
zero, which means all the components have to be zero. We conclude x = 0.
•
f (αx) = max {|αxi |} = max {|α| |xi |} = |α| max {|xi |} = |α|f (x)
1≤i≤n 1≤i≤n 1≤i≤n
f (x + y) = max {|xi + yi |} = max {|xi | + |yi |}

1≤i≤n 1≤i≤n
≤ max {|xi |} + max {|yi |} = f (x) + f (y)
1≤i≤n 1≤i≤n
Therefore f (x is a vector and it is called the infinity norm, or maximum norm. It is

denoted as f (x = kxk∞ .
Are the following functions vector norms?
• f (x) = max {|xi |} + 5. No, since f (0) 6= 0.

1≤i≤n
max1≤i≤n {|xi |}
• f (x) = .No, since f (αx) 6= |α|f (x), and f (αx) has no definition for
x21
those vectors whose first component is zero.
( n )1/2
X
• f (x) = x2i . Yes, it is called 2-norm, or Euclidian norm, it is denoted as
i=1
f (x) = kxk2 . The sketch of the proof is given below.
Sketch of the proof for 2-norm
Proof: (1) and (2) are obvious. The triangle inequality is

( n )1/2 ( n )1/2 ( n )1/2
X X X
2 2 2 2
(xi + yi ) ≤ xi + yi or
i=1 i=1 i=1
n n n
( n )1/2 ( n )1/2
X X X X X
(x2i + yi2 ) ≤ x2i + yi2 +2 x2i yi2 or
i=1 i=1 i=1 i=1 i=1
n
( n )1/2 ( n )1/2
X X X
2 xi yi ≤ x2i yi2 .
i=1 i=1 i=1
The last inequality is the Cauchy-Schwarz inequality. To prove this inequality, we consider
a special quadratic function
n
X
g(λ) = (xi − λyi )2 ≥ 0
i=1
n
X n
X n
X
= x2i − 2λ x i yi + λ 2
yi2
i=1 i=1 i=1
n
X n
X n
X
2
= c + bλ + aλ , a= yi2 , b = −2 xi yi , c= x2i .
i=1 i=1 i=1
The function g(λ) is a non-negative function, and the quadratic equation g(λ) = 0 has at
most one real root or no roots. Therefore the discriminant should satisfy b2 − 4ac ≤ 0, that
is
n
!2 n n
X X X
4 x i yi − 4 yi2 x2i ≤ 0.
i=1 i=1 i=1
This is equivalent to
n v n n
X u X X
u 2
x i yi ≤ yi x2i .
t

i=1 i=1 i=1
This concludes
n

n
(
n
)1/2 ( n )1/2
X X X X
2 2
x i yi ≤ x i yi ≤ xi yi .

i=1 i=1 i=1 i=1
There are different Cauchy-Schwarz inequality in different space, for example, L2 space,
Sobolev space, Hilbert space etc. The proof process is similar. A special case is yi = 1 for
which we have
( )1/2 ( n )1/2 v
n n u n
X X
2
X
2 √ uX
xi ≤ xi 1 = nt x2i .

i=1 i=1 i=1 i=1
2.1.1 1-norm and Lp norms

n
X Pn
The function f (x) = |xi | is a vector. It is called 1-norm: kxk = i=1 |xi |.
i=1
An example:
!
−5
Let x = , find kxkp , p = 1, 2, ∞.
1
26 Z. Li
√
The solution is kxk1 = 6, kxk∞ = 5, kxk2 = 26. Note that, we have kxk∞ ≤ kxk2 ≤
kxk1 . This is true for any x.
In general, we can define the p-norm for p ≥ 1,
n
!1/p
X
kxkp = |xi |p .
i=1
2.1.2 Some properties of vector norms
• There are infinity number of vector norms (not unique).
• All vector norms are equivalent in finite dimensional space. That is, give any two
vector norms, say kxkα and kxkβ , there are two constant Cαβ and cαβ such that
cαβ kxkα ≤ kxkβ ≤ Cαβ kxkα
Note that, from the above inequality, we immediately have

1 1
kxkβ ≤ kxkα ≤ kxkβ .
Cαβ cαβ
Note that the constants are independent of x but may depend on the dimension (n).
Thus the one, two, and infinity norm of a vector norms are all equivalent. What are the
smallest (C) and the largest c constants?
Theorem 2.1.1
||x||1 √ √ √
≤ ||x||∞ ≤ ||x||2 ≤ n ||x||∞ ≤ n ||x||2 ≤ n ||x||1 .
n
As an illustration, we prove
√
||x||2 ≤ ||x||1 ≤ n ||x||2
Proof:
n
!2 n
X X X
||x||21 = |xi | = |xi |2 + 2 |xi ||xj |
i=1 i=1 i<j
X
= ||x||22 + 2 |xi ||xj | ≥ |x||22 .
i<j
On the other hand, from the Cauchy-Schwarz inequality, we have already known that
v
n u n
X √ u X √
||x||1 = |xi | · 1 ≤ n t x2i = n||x||2 .
i=1 i=1
2.1.3 Connection between the inner product and ||x||2
The inner product of two vectors x ∈ Rn , y ∈ Rn is defined as

n
X
(x, y) = xi yi .
i=1
Particularly, if y = x, we have
n
X n
X
(x, y) = xi xi = x2i = ||x||2 .
i=1 i=1
From Cauchy-Schwarz inequality, we also have

n
(
n
)1/2 ( n )1/2
X X X
|(x, y)| = x i yi ≤ x2i yi2 = ||x||2 ||y||2

i=1 i=1 i=1
2.2 Matrix norms
There are two definitions of a matrix norms. The first one is to use the same definition as
a vector norm.
Definition: A matrix norm is a multi-variable function of its entries that satisfy the
following relations:
1. f (A) ≥ 0 for ∀A ∈ Rm×n , and f (A) = 0 if and only (⇔) if A = 0.
2. f (αA) = |α|f (A).
3. f (A + B) ≤ f (A) + f (B). It is called the triangle inequality.
If a function f (A) satisfies (1)-(3), we use a special notation f (x) = kAk and call this
function a norm in Rm×n .
Or alternatively, we can treat a matrix as a long vector, then use the definition of the
vector norm. For an example, if A ∈ Rm×n = {aij }, if we treat the matrix as a long
vector, either row-wise or column-wise, the 2-norm, now it is called Frobenius norm, of
the matrix is v
um X n
uX
kAkF = t a2ij . (2.2.1)
i=1 j=1
√
For an n by n identity matrix, we have kIk = n instead of kIk = 1 we may have expected.
Since matrices are often used along with vectors, e.g. Ax = b, Ax = λx, it is naturally
to define a matrix norm from a vector norm.
28 Z. Li
2.2.1 Associated (induced, subordinate) matrix norms
Theorem 2.2.1 Given a vector norm k · k in Rn space, the function of the matrix function
in A ∈ Rm×n space defined below
kAxk kAxk
f (A) = sup = max = max kAxk (2.2.2)
x6=0 kxk x6=0 kxk kxk=1
is a matrix norm in Rm×n space. It is called the associated (or induced, or subordinate)
matrix norm.
Proof:
• It is obvious that f (A) ≥ 0. If A = 0, then Ax = 0 for any x, so kAxk = 0, thus

f (A) = 0. On the other hand, if f (A) = 0, then we must have A = 0. Otherwise,
there would be one entry of A is non-zero, say aij 6= 0. We would conclude the vector
Aej , which the j-th column of A, is a non-zero vector since one of component is aij .
Therefore, kAej k =6 0, and f (A) ≥ kAej k/kej k ≥ 0 which contradicts the fact that
f (A) = 0.
kαAxk kAxk kAxk
• f (αA) = max = max |α| = |α| max = |α|f (A).
x6=0 kxk x6=0 kxk x6=0 kxk
• For any x, from the property of a vector norm, we have k(A + B)xk = kAx + Bxk ≤
kAxk + kBxk. Thus we have

k(A + B)xk kAxk kBxk
max ≤ max +
x6=0 kxk x6=0 kxk kxk
kAxk kBxk
≤ max + max
x6=0 kxk x6=0 kxk
≤ f (A) + f (B)
2.2.2 Properties of associated matrix norms
from the definition of associated matrix norms, we can conclude the following important
properties.
kIxk
• kIk = 1. This is obvious since max = 1.
x6=0 kxk
• kAxk ≤ kAk kxk for any x. It is obviously true if x = 0.
Proof: If x 6= 0, then we have
kAyk kAxk
kAk = max ≥ .
y6=0 kyk kxk
Multiplying kxk to both sides, we get kAk kxk ≥ |Axk.
• kABk ≤ kAk kBk for any A and B that AB exists.

Proof: According to the definition, we have
kABxk kAkkBxk kBxk
max ≤ max ≤ kAk max = kAk kBk.
x6=0 kxk x6=0 kxk x6=0 kxk
2.2.3 Some commonly used matrix norms
For any vector norm, there is an associate matrix norm. Since we know how to evaluate
kxkp , p = 1, 2, ∞, we should know how to evaluate kAkp as well. It is not practical to use
the definition all the time.
Theorem 2.2.2
 
Xn n
X n
X n
X  n
X
kAk∞ = max |a1j |, |a2j |, · · · , |aij |, · · · , |amj | = max |aij | (2.2.3)
  i
j=1 j=1 j=1 j=1 j=1
( n n n n
) n
X X X X X
kAk1 = max |ai1 |, |ai2 |, · · · , |aij |, · · · , |ain | = max |aij | (2.2.4)
j
i=1 i=1 i=1 i=1 i=1
In other words, we add the magnitude of element in each rows, then we selected the largest
one, which is the infinity norm of the matrix A.
An example: Let
 
−5 0 7 12
A =  1 3 −1  5
 
0 0 1 1
6 3 9
Then kAk∞ = 12, kAk1 = 12.

Proof for kAk∞ k norm. Let
n
X
M = max |aij |.
i
j=1
The proof has two parts: (a), show that kAk∞ k ≤ M ; (b), find a specific x∗ , kx∗ k∞ = 1,
such that kAxk∞ k ≥ M . For any x, kxk∞ ≤ 1, we have
 P 
n
a 1j
 Pj=1
n 
j=1 a2j 
 
..

  n n n
 .  X X X
Ax =  Pn
 , |x j | ≤ 1, | aij xj | ≤ |a ij ||x j | ≤ |aij | ≤ M.
j=1 aij 

 j=1 j=1 j=1
 .. 

 . 

Pn
j=1 mj a
30 Z. Li
Now we conduct the second step. The largest sum of the magnitude has to be one (or
more) particular rows, say i∗ -th row. We choose
 
sign(ai∗ ,1 ) 
 sign(ai∗ ,2 ) 
   1
 x>0
∗
x =  .. ,
 sign(x) = 0 x=0 , x sign(x) = |x|.
. 
−1 x < 0
  
sign(ai∗ ,n )
Thus we get
 
×
 .. 

 . 

 Xn  n
aij x∗j
X
kAx∗ k∞
 
= k
  k∞ =
 |aij | = M,
 j=1  j=1
 .. 
.
 
 
×
where × means some number. That completes the proof.
Theorem 2.2.3
q q q q
kAk2 = max H H H H
λ1 (A A), λ2 (A A), · · · , λi (A A), · · · λn (A A)
q
= max λi (AH A),
i
where λi (AH A), i = 1, 2, · · · , n, are eigenvalues of AH A, AH is the conjugate transpose

of A.
The proof is left as an exercise. Note that even if A is a complex matrix, AH A or AAH
are symmetric semi-positive definite matrix. A symmetric semi-positive definite matrix B
(B = B H = {bij }) satisfies the following:
• For any x, xH Bx ≥ 0. Note that if B is a real matrix, then xH Bx =

P
ij bij xi xj .
• All eigenvalues of B are non-negative.
An example: Consider the following matrix:

 
1 −1
A= .
0 4
We have kAk∞ = 5, kAk1 = 4. To get kAk2 . We have

    
1 0 1 −1 1 −1
AT A =   = .
−1 4 0 4 −1 17
Thus det(λI −T A) = (λ − 1)(λ − 17) − 1 = 0, orpλ2 − 18λ + 16 = 0 whose roots are

√ √
λ = (18 ± 182 − 64)/2. We conclude that kAk2 = 9 + 65 ≈ 4.1306. We can see that
kAk2 is more difficult to get. However, if A is symmetric, that is A = AH , then we can
show that
kAk2 = max{|λi (A)|}, if A is symmetric.

i
Furthermore, if det(A) 6= 0, that is A is invertible, then

1
kA−1 k2 = , if A is symmetric and det(A) 6= 0.
min{|λi (A)|}
Remark 2.2.1 Different norms have different applications. The 2-norm is the Euclidean
distance in one, two, and three dimensions and is differentiable which is important for many
optimization, extreme value problems. The infinity norm 1-norm are non-differentiable but
maybe important for some optimization problems to preserve important quantities such as
sharp edges, corners etc. Note also that the matrix norms kAk∞ and kAk∞ are easy to
compute while kAk2 is not. Finally, there is a subtle difference between kxk∞ and L∞ . The
later one is often referred to the integral norms Lp .
2.3 Exercises
1. Find ||x||p , p = 1, 2, ∞ for the following vectors
(a) x = (3, −4, 0, −3/2)T .

(b) x = (sin k, cos k, 2k )T for a fixed positive integer k.
(c) x = (4/(k + 1), −2/k 2 , k 2 e−k )T for a fixed positive integer k.
2. Plot or sketch the set of kxkp = 1 in two-dimensions, p = 1, 2, ∞. This example helps

us to understand geometric meanings of vector norms.
3. (a): Find ||A||p , p = 1, 2, ∞ for the following matrices:

   
1 2 −2 1
 ;  ;
0 −3 1 −2
( n )
X
(b): Assume that A ∈ Rn,n . Show that ||A||1 = max |aij | .
0≤j≤n
i=1
4. (a) Show that kxk∞ is equivalent to kxk2 . That is to find constants C and c such that
c ≤ kxk∞ ≤ kxk2 ≤ Ckxk∞ . Note that you need to determine such constants
that the equalities are true for some particular x.
32 Z. Li
(b) Show that kQxk2 = kxk2 f Q is an orthogonal matrix (QH Q = I, QQH = I).
(c) Show that kABk ≤ kAkkBk for any natural matrix norm, and kQAk2 = kAk2 .
5. Let kxk be a vector norm, A be a symmetric positive definite matrix. (a): Show that
kAxk is also a vector norm. (b): It is known that we can factorize the matrix A as
A = B H B, find kAxk2 in terms of B and x. (c): If A = D is a diagonal matrix with
all positive diagonals, find the expression of kDxk2 .
Chapter 3
Solving a linear system of

equations–Direct method
In this chapter, we discuss some most used direct methods for solving a linear system of
equations of the following form
Ax = b, A ∈ Rn×n , b ∈ Rn , det(A) 6= 0. (3.0.1)
The condition of det(A) 6= 0 has the following equivalent statements
• The linear system of equations Ax = b has a unique solution.
• A is invertible, that is, A−1 exists.
Almost all the direct methods are based on Gaussian elimination. A direct method is
a method that returns the exact solution in finite number of operations with exact compu-
tation (no round-off errors present). Such a method is often suitable for small to modest,
dense matrices.
The main idea of Gaussian elimination algorithm is based on the following observation.
Consider the following (upper) triangular system of equations
    
a11 a12 · · · ··· a1n x1 b1
a22 · · · ··· a2n x2 b2
    
    

.. .. ..  ..   .. 
. . . . = . . (3.0.2)
    
 

.. ..  ..   .. 
. . . .
    
    
ann xn bn
From the structure of the system of equations, we can
• From the last equation: ann xn = bn , we get xn = bn /ann .
33
34 Z. Li
• From the last but second equation: an−1,n−1 xn−1 + an−1,n xn = bn−1 , we get xn−1 =
(bn−1 − an−1,n xn )/an−1,n−1 . Note that, we need the result xn from previous step.
• In general (the idea of induction), assume we have computed xn , xn−1 , · · · , xi+1 , from
the i-th equation, aii xi + ai,i+1 xi + · · · + ain xn = bi , we get
 
n
X
xi = bi − aij xj  /aii , i = n, n − 1, · · · , 1. (3.0.3)
j=i+1
Equation is called the backward substitution. A psuedo-code is given below:
for i = n, −1, 1
 
n
X
xi = bi − aij xj  /aii
j=i+1
endfor
Below is a matlab function of the backward substitution.
function [x] = backward(n,A,b)

for i=n:-1:1
x(i) = b(i);
for j=i+1:n
x(i) = x(i) - a(i,j)*x(j);
end
x(i) = x(i)/a(i,j);
end
In one step, we can count the number of operations. In i-th step, there are (n −
i) multiplications and one division; and (n − i) subtractions. Thus the total number of
multiplications and divisions is
n(n + 1) n2
1 + 2 + ··· + n = = + O(n).
2 2
The total number of addtions/subtractions is
n(n − 1) n2
1 + 2 + ··· + n − 1 = = + O(n).
2 2
The total cost is only about one matrix-vector multiplications which is considered to be
very fast.
3.0.1 Derivation of the Gaussian elimination algorithm
The main idea of Gaussian elimination (GE) is to use a row transforms to reduce they
system to an upper triangular one while keep the solution unchanged. For this purpose, we
can apply the Gaussian
elimination
to the coefficient matrix or to the augmented matrix
..
which is defined as A . b , that is, the matrix is enlarged by a column.
First, we use a 4 by 4 matrix to illustrate the idea. We use the number to indicate the
number of the times that the entries have been changed, and the sequence of changes.
.. ..
   
 1 1 1 1 . 1   1 1 1 1 . 1 
.. ..
   
   
 1 1 1 1 . 1   0 2 2 2 . 2 
 =⇒   =⇒
   
. ..

1 .. 1 
   
 1 1 1  0 2 2 2 . 2 
   
.. ..
   
1 1 1 1 . 1 0 2 2 2 . 2
.. ..
   
 1 1 1 1 . 1   1 1 1 1 . 2 
.. ..
   
   
 0 2 2 2 . 2   0 2 2 2 . 2 
=⇒   =⇒ 
   
. ..

3 3 .. 3 
   
 0 0  0 0 3 3 . 3 
   
.. ..
   
0 0 3 3 . 3 0 0 0 4 . 4
We can see that we need (n − 1) step to reach this goal.

In order to do these steps, we multiply a sequence of simple matricides to the augmented
matrix (or original one):

..
Ln−1 Ln−2 · · · L2 L1 A.b . (3.0.4)
We need to derive the recursive relations so that we can implement the algorithm. The
general procedure is to derive the first step, maybe second step if necessary..., the general
step to see if we have a complete algorithm. For convenience, we denote the right hand side
36 Z. Li

T ..
as b = [a1,n+1 , a2,n+1 , · · · , an,n+1 ], we set L1 A.b
 
1 0 0 ··· 0 
..

  a
  11 a12 ··· ··· a1n . a1,n+1
 .. .. ..

 −l
21 1 . . 0 
 

  a21 a22 ··· ··· a2n . a2,n+1 
 .. .. .. ..

 .. 
 −l
 31 0 1 . 0 
 . . ··· ··· ··· . . 

 . 

 .   .. .. .. .. 
 .. .. .. .. . ··· ··· ··· . . 
. . . 0   
  ..
  a
n1 an2 ··· ··· ann . an,n+1
−ln1 0 ··· 0 1
..
 
 a11 a12 · · · ··· a1n . a1,n+1
.. (2)

 0 a(2) (2)
 
22 ··· ··· a2n . a2,n+1 
 .. ..
 
= .

 . 

 .. (2) .. 
 . aij . 
 
..
0 .
(2)
We need to derive the formulas for li1 and aij . We multiply the 2-nd row of L1 to the
1-st column of [A|b] = A(1) to get
(1)
(1) (1) a21
−l21 a11 + a21 = 0, =⇒ l21 = (1)
.
a11
We multiply the 3-rd row of L1 to the 1-st column of [A|b] = A(1) to get
(1)
(1) (1) a31
−l31 a11 + a31 = 0, =⇒ l31 = (1)
.
a11
In general, we multiply the i-th row of L1 to the 1-st column of [A|b] = A(1) to get
(1)
(1) (1) ai1
−li1 a11 + ai1 = 0, =⇒ li1 = (1)
, i = 2, 3, · · · , n.
a11
So we have general formula for L1 .

(2)
Now we consider the general formulae for aij , i = 2, 3, · · · , n, j = 2, 3, · · · , n, n + 1. We
multiply the i-th row of L1 to the j-st column of [A|b] = A(1) to get
(1)
(1) (1) (2) (2) (1) ai1 (1)
−li1 a1j + aij = aij , aij = aij − a
(1) 1j
a11
i = 2, 3, · · · , n, j = 2, 3, · · · , n, n + 1.
Let us do one more step: L2 L1 A(1) = L2 A(2) = A(3) ,

 
1 0 0 ··· 0 
.. (1)

  a(1) a(1) ··· ··· a1n . a1,n+1
 .. ..   11 12
.. (2)

 0 1 . . 0 
 (2) (2) 

  0 a22 ··· ··· a2n . a2,n+1 
 ..  .

.. .. ..


 0 −l32 1 . 0 
 .. . ··· ··· ··· . .


  . .. .. ..

 .. .. .. ..   ..

. ··· ··· ··· . .


 . . . . 0 
 

(2) (2) .. (2)
  0 an2 ··· ··· ann . an,n+1
0 −ln2 · · · 0 1
.. (1)
 
(1) (1) (1)
 a11 a12 · · · ··· a1n . a1,n+1
.. (2)

 0 a(2) (2)
 
22 ··· ··· a2n . a2,n+1 
 .. ..
 
= .

 . 

 .. (3) .. 
 . aij . 
 
..
0 .
Since the formulas only depend on the indexes, we just need to replace the formulas
using the substitutions: 1 =⇒ 2 and 2 =⇒ 3 to get the following formulas:
(2)
ai2
li2 = (2)
, i = 3, · · · , n.
a22
(2)
(3) (2) ai2 (2)
aij = aij − a
(2) 2j
a22
i = 3, 4, · · · , n, j = 3, 4, · · · , n, n + 1.
Since there are (n − 1) steps, in k-step, we should have
(k)
aik
li,k+1 = (k)
, i = k + 1, · · · , n.
akk
(k)
(k+1) (k) aik (k)
aij = aij − a
(k) kj
akk
i = k + 1, · · · , n, j = k + 1, · · · , n, n + 1.
Important observations of the Gaussian elimination process:
(2) (1)
• After we get aij , we do not need aij anymore. We can overwrite to save the storage.
• We do not need to store zeros below the diagonals. Often we store lij .
38 Z. Li
Below is the pseuo-code using the overwrite
for k = 1, n − 1
for i = k + 1, n
aik
aik :=
akk
for j = k + 1, n
aij := aij − aik akj
end
end
end
3.0.2 What can we get after the Gaussian elimination?
• An upper triangular system

..
 
 a11 a12 · · · ··· a1n . a1,n+1
..

 
 a22 · · · ··· a2n . a2,n+1 
.. .. ..
 
 .. .. 
(3.0.5)
 . . . . . 
 
 .. .. .. .. 
 . . . . 
 
..
ann . an,n+1
which can be solved using the backward substitution.
• A matrix factorization of A = LU , where L is a unit lower triangular matrix and U

is an upper triangular matrix.
 
1 0 0 ··· 0  
a a · · · · · · a
 
 .. ..  11 12 1n
 a
 21 1 . . 0  a22 · · · · · · a2n 

 
 .. 
.. .. .. 
A= . . .  = LU (3.0.6)

 a31 a32 1 . 0 

 
.. .. 
 . . . 

 .. .. .. .. 
. . .

0 
ann

 
an1 an2 · · · an,n−1 1
• The determinant of A is
det(A) = a11 a22 · · · ann = det(U ). (3.0.7)
This is obvious since det(Li ) = 1 and the property of det(AB) = det(A)det(B).
• The GE method breaks down if one of the diagonals is zero!

Sketch of the proof of LU decomposition from GE process.

From
Ln−1 Ln−2 · · · L2 L1 A = U,
we have
A = (Ln−1 Ln−2 · · · L2 L1 )−1 U = L−1 −1 −1 −1

1 L2 · · · Ln−2 Ln−1 U.
It is easy to check that

   
1 0 0 ··· 0 1 0 0 ··· 0
   
 .. ..  
a21 .. .. 
 l
 21 1 . . 0  
  a11 1 . . 0 

   
L−1
 ..  
a31 .. 
1 =  l31 0 1 . =
0  0 1 . 0  (3.0.8)
 
a11 
   
 . . ..
.. ... .. .. .. ..
  
 .. . 0   . . . . 0 
   
   
an1
ln1 0 · · · 0 1 a11 0 ··· 0 1
In other words, we just need to change the sign of the non-zero column below the diagonal.
It is also easy to show that
 
1 0 0 ··· 0
 
 .. .. 
 l
 21 1 . . 0 

 
L−1 −1
 .. 
1 L2 =  l31 l32 1 . 0  (3.0.9)


 
 . .
.. ... .. 
 .. . 0 
 
 
ln1 ln2 · · · 0 1
In other words, we can simply add the non-zero columns together. Note that this is not true
for L−1 −1
2 L2 . Repeat the process leads to the desired LU decomposition.
3.0.3 Operation counts for the Gaussian elimination algorithm
As part of analysis, we need to know the cost of an algorithm, or the total number of
operations needed to complete the algorithm.
For the GE algorithm, we count the number of operations in the k-th step, then add
them together. In k-th step of GE algorithm, we need to compute
aik
ai,k+1 := , i = k + 1, · · · , n.
akk
aij =: aij − aik akj
i = k + 1, · · · , n, j = k + 1, · · · , n, n + 1.
40 Z. Li
We need (n − k) divisions for computing aik /akk and (n − k)(n − k + 1) multiplications

for updating the modified elements. Thus the total number of multiplications/divisions1
is (n − k)(n − k + 2) for k = 1, 2, · · · n − 1. Thus we have the table for the number of
multiplications/divisions
k=1 (n − 1)(n + 1)
k=2 (n − 2)n
.. ..
. .
n−1 1·3
The total is
n−1 n−1 n−1
X X X (n − 1)n(2n − 1) n3
k(k + 2) = k2 + 2k = + n(n − 1) = O( )
6 3
k=1 k=1 k=1
Note that the first term is actually the cost if we apply the GE process to the matrix A,
the second part if the same transform applied to the right hand side which is equivalent
to the forward substitution. We often emphasize the order of operations which is O(n3 /3).
The constant coefficient (1/3) is also important here.
The number of additions/substractions is almost the same. When n = 1000, the total
number of operations is roughly 2 · 109 /3 (about a billion), which is a quite large number.
It is true that for large and dense matrix, the GE algorithm is not very fast!
3.0.4 Partial pivoting strategy to avoid break-down and large errors
When a11 = 0, the GE algorithm breaks down even if A is non-singular. For example, we
consider Ax = b where
! !
0 1 1
A= , det(A) = −1 6= 0, A1 =
1 0 1 0
The Gaussian elimination fails for the first A, for the second one, in general we will have

ai1
f l aij − a1j
a11

ai1
= aij − a1j (1 + δ1 )(1 + δ2 ) (1 + δ3 )
a11
ai1 ai1
= aij − a1j − a1j δ3 + · · ·
a11 a11
We can see that if |a11 | is very small, the round-off error will be amplified. The element a11
is called the pivot element.
1
Usually we put the multiplication and divisions into one category; and additions/subtractions into
another category. An operation in the first category often take slightly longer time than that in the second
category
Partial column pivoting strategy: Before the Gaussian elimination, we exchanges

the row of the augmented matrix (or the original matrix) such that the switched pivot
element has the largest magnitude among all of the elements in the column. Below are the
steps of the 1-st Gaussian elimination with column partial pivoting algorithm,
• Choose al1 such that
|al1 | ≥ |ai1 |, i = 1, 2, · · · , n.
This can be done using
l = 1; pivot= abs(a(1,1));
for i=2:n
if( abs(a(i,1)) ) > pivot
pivot = abs(a(i,1));
l = i;
end
end
• Exchanges the l-th row with the 1-st row, alj ←→ a1j , j = 1, 2, · · · , n, n + 1. This can
be done using
for j=1:n+1
tmp = a(1,j);
a(1,j) = a(l,j);
a(1,j) = tmp
end
You should be able to see how to change the variables.
• Carry out the Gaussian elimination as usual.
Below are some example:

 
1 2 3
 
 −1 4 5  No pivoting is necessary
 
 
1 −1 0
   
1 2 3 3 −1 0
   
 −2 4 5  −7 →  −2 4 5 
   
   
3 −1 0 1 2 3
42 Z. Li
3.0.5 The matrix expression of partial column pivoting and elementary

permutation matrix
The partial column pivoting algorithm can be expressed as row transform as well.
   
1 1
 ..   .. 

 . 


 . 

   

 1 


 0 1 

I=
 .. 
−→
 ..  = Pij

.   .
   

 1 


 1 0 

 ..   .. 

 . 


 . 

1 1
An elementary permutation matrix Pij is the matrix obtained by exchanging the i-th and
j-th rows of the identity matrix, for example
 
0 0 1
 
P13 =  0 1 0  .
 
 
1 0 0
If we multiply Pij to a Matrix A from the left, the new matrix is the one obtained by
exchanging the i-th and j-th rows of the matrix A, for example
    
0 0 1 1 2 3 3 −1 0
    
 0 1 0   −2 4 5   −2 4 5  .
=
    
    
1 0 0 3 −1 0 1 2 3
Similarly, if we multiply Pij to a Matrix A from the left, the new matrix is the one obtained
by exchanging the i-th and j-th columns of the matrix A.
3.0.6 Properties of an elementary permutation matrix Pij
• Pij Pij = I, and Pij−1 = Pij .
• PijT = Pij = Pij−1 , that is, Pij is an orthogonal matrix, PijT Pij = Pij PijT = I.
• det(Pij ) = 1.
3.0.7 P A = LU decomposition
The Gaussian elimination with partial column pivoting algorithm can be written as the
following matrix form
Ln−1 Pn−2 Ln−2 · · · L2 P2 L1 P1 A = U, (3.0.10)
Can we put all Li ’s and Pi ’s together to get a P A = LU decomposition by moving Pi ’s to

the right and Li ’s to the left? Yes, we can with some modification. We need to exchanges
rows of Pi ’s. We can write
Ln−1 Pn−2 Ln−2 · · · L2 L

e 1 P2 P1 A = U,
Ln−1 Pn−2 Ln−2 · · · L

e2 L
e 1 P3 P2 P1 A = U,
e
··· ··· ···
Below is a demonstration how this can be done for P24 L1 = L

e 1 P24 .
    
1 0 0 0 1 0 0 0 1 0 0 0
    
 1
  −3 1 0 0   1 0 0 1 
 0 0 0 1    

P24 L1 = 


 2
=
  2


 0 0 1 0  3 0 1 0   3 0 1 0 
    
0 1 0 0 1 0 0 1 − 13 0 0 1
  
1 0 0 0 1 0 0 0
  

 1 1 0 0 
 0 0 0 1 
 
=  =L
e 1 P24 .
 2  
3 0 1 0  0 0 1 0 
  

− 13 0 0 1 0 1 0 0
We can see that when we move Pij to the right passing through Lk , we just need to exchanges
the two rows in the non-zero column below the diagonal. Finally we will have P A = LU .
The determinant of A can be computed using the following
det(P A) = det(L)det(U ) = u11 u22 · · · unn , or det(A) = (−1)m u11 u22 · · · unn ,
where m is the total number of row exchanges. Below is a pseudo-code of the process
for k = 1, n − 1, (n − 1)-th elimination
ap = |akk |
ip = k
for i = k + 1, n
if |aik | > |ap | then
ip = i
ap = |aik |
end
end
for j = k, n + 1 (or n)
at = ak,j
44 Z. Li
akj = aip,j
aip,j = at
end
begin Gaussian elimination process

·········
Finally, we have the P A = LU decomposition. Below is an example how this process
can be used. This example can be used in the debugging process.
3.0.8 Solving Ax = b using the P A = LU decomposition
Once we have the P A = LU decomposition, we can use the decomposition to solve the
linear system of equation as follows:
1. Step 1: Form b
e = P b, that is, exchange rows of b. This is because from Ax = b, we
get P Ax = P b.
2. Use forward substitution to solve Ly = P b.

e This is because we have P Ax = LU x =
P b.
3. Use the backward substitution to solve U x = y to get the solution.

3.0.9 An example of Gaussian elimination
See the scanned pdf file.

46 Z. Li
3.1 Other pivoting strategies
• Complete pivoting. At the first step, we choose |a11 | := max |aij |. In this approach, we
i,j
have to exchange both rows and columns which makes the programming more difficult.
It may also destroy some matrix structures for certain matrices. The improvement in
the accuracy is marginal compared with partial column pivoting.
• Scaled column pivoting. If the matrix A has very different magnitude in rows, this
approach is strongly recommended. At first step, we choose
|ak1 | |ai1 |
= max (3.1.1)
sk 1≤i≤n si
n
X
where si = |aij |.
j=1
3.2 Error Analysis
When we input vectors or matrices into a computer number system, we are going to have
round-off errors, the error satisfy the following relations
f l(bi ) = bi (1 + δi ), |δi | ≤ , f l(b) = b + Eb , kEb kp ≤ kbk , p = 1, 2, ∞,
f l(aij ) = aij (1 + eij ), |eij | ≤ , f l(A) = A + EA , kEA kp ≤ kAk, p = 1, ∞.
In general, from the equivalence, we have
kEb k ≤ C1 kbk , kEA kp ≤ C2 kAk, (3.2.1)
for any vector and matrix norms, where C1 , C2 are two constants depend on n.
Even before we solve the linear system of equations, we are solving a different problems
(A + EA )x = b + Eb due to the input errors. The question is how the errors affect the
results. This is summarized in the following theorem:
Theorem 3.2.1 If kA−1 Ea k < 1 (or kA−1 kkEa k < 1, a stronger condition), define xe =
A−1 b, δx = (A + EA )−1 (b + Eb ) − A−1 b, then
kAkkA−1 k

kδxk kEA k kEb k
≤ + , (3.2.2)
kxe k 1 − kAkkA−1 k kE Ak
kAk
kAk kbk
or the following if we ignore the high order terms:

kδxk −1 kEA k kEb k
≤ kAkkA k + (3.2.3)
kxe k kAk kbk
We can see that kAkkA−1 k is an important amplifying factor in the error estimate.
it is called the condition number of the matrix A,
cond(A) = kAkkA−1 k. (3.2.4)
For 2-norm, it is also denoted as κ(A) = cond2 (A) = kAk2 kA−1 k2 .

To prove the main theorem, we need the Banach’s lemma.
Lemma 3.2.1 If kEk < 1, then I + E is invertible, and

1
k(I + E)−1 k ≤ . (3.2.5)
1 − kEk
Proof: Consider the following matrix series:
∞
X
B = I − E + E 2 − E 3 + · · · + (−1)k E k + = (−1)k E k
k=0
It partial sum
n
X
Bn = I − E + E 2 − E 3 + · · · + (−1)n E n = (−1)k E k
k=0
satisfies
kBn k ≤ kIk + k − Ek + kE 2 k + · · · + k(−1)n E n k

1 − kEkn 1
≤ kIk + kEk + kEk2 + · · · + kEkn = =⇒
1 − kEk 1 − kEk
Let B = limn→∞ Bn , then B is the inverse of I + E from the following
(I + E)Bn = (I + E) I − E + E 2 − E 3 + · · · (−1)n E n = I − E n+1 =⇒ I.

Furthermore,
1
k(I + E)−1 k ≤ kIk + kEk + kEk2 + · · · + kEkn + · · · = .
1 − kEk
Now we prove the main error theorem.
kδxk
= k(A + EA )−1 (b + Eb ) − A−1 bk
kxe k
≤ k(A + EA )−1 (b + Eb − (A + EA )A−1 bk
≤ [A(I + A−1 EA ]−1 (b + Eb − b − EA xe k
= kA−1 (I + A−1 EA )−1 (Eb − EA xe )k
≤ kA−1 k k(I + A−1 EA ]−1 k(kEb k + kEA k kxe k)

kA−1 k
≤ (|Eb k + kEA k kxe k) .
1 − kA−1 EA k
48 Z. Li
Notice that
kEA k
kA−1 EA k ≤ kA−1 k kEA k ≤ kAk kA−1 k .
kAk
Thus we continue to get
kAk kA−1 k

kδxk kEb k kEA k
≤ + .
kxe k 1 − kAk kA−1 k kE Ak
kAk
kAkkxe k kAk
Since b = Axe , we have kAkkxe k, we arrive at

kδxk cond(A) kEb k kEA k
≤ + .
kxe k 1 − cond(A) kE Ak
kAk
kbk kAk
If we ignore high order terms in the above inequality, we can go one step further
!
kEA k 2

kδxk kEb k kEA k kEA k
≤ cond(A) + 1 + cond(A) + O cond(A) + ···
kxe k kbk kAk kAk kAk

kEb k kEA k
≈ cond(A) + .
kbk kAk
This completes the proof.
Remark 3.2.1
• The relative errors in the data (either (both) A or (and) b) are amplified by the factor
of the condition number cond(A).
• Usually the upper bound is over-estimated.
• However, the upper bound is attainable.
• The condition number cond(A) has nothing to do with any algorithm. It only depends
on the matrix itself. However, if cond(A) is very large in reference to the machine
precision, then no matter what algorithm we use, in general, we can not expect to good
result. Such a problem is called ill-conditioned matrix, or simply ill-conditioned.
For an ill-conditioned system of linear equations, a small perturbation in the data (A
and b) will cause large change in the solution.
• If cond(A) is small or modest in reference to the machine precision, the problem is

then called well-conditioned.
3.3 Wilkinson’s Backward round-off error analysis
There are various errors during a problem solving process, for example, modelling errors,
input error (f l(A)), algorithm error (truncation errors), and round-off errors. We have as-
sumed to start with a mathematical problem and wish to use computer to solve it, therefore
we will not discuss the modelling errors here.
Using Gaussian elimination method with partial pivoting, there is no formula error with
partial pivoting (that is why it is called a direct method). So we only need to consider round-
off errors. Round-off error analysis is often complicated. J. H. Wilkinson made it a little
simple. His technique is simple, the round-off errors can be regarded as a perturbation to the
original data, for example, f l(x+y) = (x(1+δx )+y(1+δy ))(1+δ3 ) = x(1+δ4 )+y(1+δ5 ) =
x
e + ye. In other words, the computed result is the exact sum of two perturbed numbers of
the original data.
For Gaussian elimination method for solving Ax = b, we have the following theorem due
to J. H. Wilkinson.
Theorem 3.3.1 The computed solution of the Gaussian elimination algorithm on a com-
puter for solving Ax = b is the exact solution of the following system of linear equations
(A + E)x = b, (3.3.1)
where kEk ≤ Cg(n) kAk . The function g(n) is called the growth factor that is defined
below:
(k)
max max |aij |
k 1≤i,j≤n
g(n) = n o (3.3.2)
(1)
max |aij |
1≤i,j≤n
The growth factor g(n) ≥ 1. We also can prove the following
• For Gaussian elimination algorithm without pivoting, g(n) = ∞ indicating the method
may break down.
• For Gaussian elimination algorithm with partial column pivoting, g(n) = 2n−1 . This
bound is reachable.
Proof: For Gaussian elimination algorithm with partial column pivoting, we have
akik k
ak+1 = akij − a .
ij
akkk jk
ak
Since | akik | ≤ 1, we conclude that
kk
|ak+1 k k k k−1 k 1
ij | ≤ max |aij | + max |aij | ≤ 2 max |aij | ≤ 2 × 2 max |aij | · · · ≤ 2 max |aij |.
ij ij ij ij ij
50 Z. Li
This concludes that g(n) ≤ 2n−1 . The following example shows that such a bound is
attainable,
   
1 1 1 1
   
 −1 1 1   0 1 2 
   
   
22
   
 −1 −1 1 1  99K99K99K  0 0 1 
   
 .. .. ..
   
.. .. ..
. . .
  
 . 1   . . 
   
−1 −1 · · · −1 1 0 0 ··· 2n−1
However, the matrix above is a specific one, the general conjecture is that for most
reasonable matrices, g(n) ∼ n.
3.3.1 Factors that affected the accuracy of computed solutions
For most of computational problems, the relative error of the computer solution xc approx-
imating the true solution xe satisfies the following relation
kxc − xe k
≤ cond(problem) g(algorithm) . (3.3.3)
kxe k
That is, three factors affect the accuracy
• The computer used for solving the problem characterized by the machine precision.
• The condition number of the problem. For a linear system of equations Ax = b, it is

cond(A).
• The algorithm used to solve the problem characterized by the growth factor g.
3.4 Residual vector and error estimates
In the error estimate, kA−1 k is involved. But we know it is difficult and expensive to get
A−1 . Do we have a better way to estimate how accurate an approximation is? The answer
is the resisual vector.
Definition 3.4.1 Given a system of linear equations Ax = b and an approximation xa , the

residual of xa is defined as
r(xa ) = b − Axa . (3.4.1)
If det(A) 6= 0 and r(xa ) = 0, the xa is the true solution. We can use kr(xa )k to measure
how xa is close to the true solution xe = A−1 b. Note that r(xa ) is called computable since
it just need matrix-vector multiplication and does need A−1 .
Example: Let
     
1 −1 0 0 1
     
A= 2 0 1 , b =  1 , xa =  1  .
     
     
3 0 2 0 1
Then the residual vector of xa is

 
0
 
r(xa ) = b − Axa =  −2  .
 
 
−5
How far is the residual from the relative error? The answer is given in the following
theorem:
Theorem 3.4.1
kr(xa )k kxe − xa k kr(xa )k
≤ ≤ kA−1 k (3.4.2)
kAkkxa k kxa k kxa k
In other words, if we normalize the matrix A such that kAk = 1, then the difference is
about the condition number of A.
Note that the residual vector is the gradient vector of the function f (x) = xT Ax when
A is a symmetric positive definite matrix. It is the search direction of the steepest descent
method in optimization and important basic concept in popular conjugate gradient (CG)
method.
3.5 The direct LU decomposition and Gaussian elimination

algorithm for special matrices
For some situations and various considerations, we may not need to have pivoting process in
the Gaussian elimination algorithm. This can be done using the direct LU decomposition.
3.5.1 The direct LU decomposition
Assuming that we do not do the pivoting, then we can have the direct A = LU decomposi-
tion, that is we can have closed formulas for the entries of L and U , where L is a unit lower
triangular matrix and U is an upper triangular matrix2 .
• Tridiagonal or banded matrices

2
Or we can have U to be a unit upper triangular matrix and L is a lower triangular one. The formulas
and algorithm are slightly different.
52 Z. Li
• Column diagonally dominant matrices
• Symmetric positive definite matrices
3.5.2 Direct LU decomposition
The direct LU decomposition of a matrix A without pivoting is equivalent to the Gaussian

elimination process. It has more value in theoretical purpose and in deriving other algo-
rithms, for example, the incomplete LU decomposition in optimization. In the direct LU
decomposition, we get the entries of L and U directly according to certain order. We can
write
 
1  
  u11 u12 · · · · · · u1n
 l
 21 1

u22 · · · · · · u2n 
 
 
  .. .. .. 
A = LU =   l31 l32 1 
 . . . .

 .. . .

.. ..
 .. .. 
. . 

. . .

 . 
unn
 
ln1 ln2 · · · ln,n−1 1
To derive the formulae for the entries of L and U , the order is very important: The
order is as follows.
• We can get the first row of U using the matrix multiplications (first of row of L and
any column of U )
a1j = 1u1j , =⇒ u1j = a1j , j = 1, 2, · · · n.
We have the first row of U .

We multiply the i-th row of L to the first column of U
ui1
ai1 = li1 u11 , =⇒ li1 = , i = 2, · · · , n.
u11
• Assume we have obtained the first (k − 1)-th rows of U and first (k − 1)-th columns
of L, we derive the formula for k-th row of U and k-th column of L. If we multiply
the k-th row of L to j-th (j ≥ k) column of U , we get
k−1
X k−1
X
ajk = lik uik + ujk , =⇒ ujk = ajk − lik uik , j = k, · · · , n.
i=1 i=1
After we get the k-th row of U , we can get the k-th column of L by multiplying the
i-th row of L to k-th column of U
k−1
X
aik − lij ujk
k−1
X j=1
aik = lij ujk + lik ukk , =⇒ lik = , i = k + 1, · · · , n.
ukk
j=1
3.6 Tridiagonal system of equations
The coefficient matrix A of a tridiagonal system of has the following form

 
d1 β1
 
 α d β2 
 2 2 
 
 

 α3 d3 β3 .

 
.. .. .. 
. . . 


 
αn dn
One particular application is the system of linear equations derived from the finite
difference method. Another application is the alternating directional implicit (ADI) for
solving two or three dimensional partial differential equations (PDE) using dimension by
dimension approach.
If we use the direct LU decomposition of the tridiagonal matrix, we see we get a very
simple decomposition
d01 β1
    
d1 β1 1
    
 α d β2   α0 1  d02 β2 
 2 2   2  
    
α30 d03
    

 α3 d3 β3 =
  1 
 β3 

    
.. .. ..   .. .. .. .. 
. . .   . . . . 
 
 
    
αn dn αn0 1 d0n
In other words, only two diagonals need to be changed.

From the matrix-matric multiplication and the order of the direct LU decomposition,
it is easy to derive
α1
d01 = d1 , α10 =
d01
d0i = di − αi0 βi−1 from αi0 βi−1 + d0i = di
0 αi+1 0
αi+1 = from αi+1 d0i = αi+1 .
d0i
This decomposition is called the Crout factorization in some references.

Using overwriting, we can have the following pseudo-code:
for i = 2, n
αi := αi /di−1
di := di − αi βi−1
54 Z. Li
end
Once we have the LU factorization, we can easily derive the formulas for forward and
backward substitutions for solving Ax = b.
The forward substitution is
y1 = b1 ,
for i = 2, n,
yi = bi − αi yi−1 ,
end
The backward substitution is

yn
xn = ,
dn
for i = n − 1, −1, 1,
yi − βi yi+1
xi = ,
di
end
The entire process (Crout decomposition, forward and backward substitutions) quires O(5n)
multiplications/divisions. The solution process sometimes is called chasing method for solv-
ing the tridiagonal system of equations.
3.6.1 Strictly column diagonally dominant matrices
If a matrix A = {aij } satisfies the following conditions

n
X
|aij | < |ajj |, j = 1, 2, · · · , n (3.6.1)
i=1,i6=j
For example, the matrix is the left below is strictly column diagonally dominant while the
second is not.
   
5 0 1 5 0 1
   
 −1 2 −3   −2 4 1
   

   
3 −1 7 4 0 5
If a matrix A is a strictly column diagonally dominant matrix, then it is obvious that

no pivoting is necessary in the first step of Gaussian elimination process because
n
X
|aij |
ai1 i=2
a11 ≤ < 1.

|a11 |
The question is: how the next step, or thereafter. We have the following theorem answers
this question:
Theorem 3.6.1 Let A be a strictly column diagonally dominant matrix. After one step
Gaussian elimination
 
a11 *
L1 A =  
0 A1
A1 is still a strictly column diagonally dominant matrix.

Proof:
n n

(1)
(1) ai1 (1)

(2) (2)
X X
|aij | = aij − (1) a1j < |ajj |?
i=2,i6=j i=2,i6=j
a11
n n

a(1)
(1) i1 (1)
X X
≤ |aij | + (1) |a1j |
a
i=2,i6=j i=2,i6=j 11
(1)
a
(1) (1) 1j (1) (1)
< |ajj | − |a1j | + (1) |a11 | − |aj1 |
a
11
(1)
a
(1) 1j (1)
< |ajj | − (1) |aj1 |
a
11
(1)

(1) a
1j (1) (2)
≤ ajj − (1) aj1 = |ajj |.
a11
Note that in the third row, we have used the strictly column diagonally dominant condition
(1)
twice, the first one with the j-th column that leads to a negative term −|a1j |; and the first
(1)
column which leads to another negative term −|aj1 |. This completes the proof.
3.6.2 Symmetric positive definite matrices and Cholesky decomposition
A matrix is called a symmetric positive definite (SPD) if the following conditions are met
1. A = AH , or aij = a¯ji .
2. For any x 6= 0, xH Ax > 0.
Note that the second condition has the following equivalent statements, which also give
some way to judge whether a matrix is an SPD or not.
• λi (A) > 0, i = 1, 2, · · · , n, that is, all the eigenvalues of A ∈ Rn,n are real and positive.
56 Z. Li
• All the determinants of the principal sub-matrices are positive.
A principal sub-matrix Ak is the matrix composed from the intersections of the first k the
rows and columns of the original matrix A, for example
 
  a11 a12 a13
a11 a12  
A1 = {a11 }, A2 , A3 =  a21 a22 a23  , · · · , An = A.
   
a21 a22  
a31 a32 a33
If A is an SPD. then so are Ai s.

Note also that if A is an SPD, then aii > 0, i = 1, 2, · · · , n, that is, all the diagonals are
positive. This is because if we take x = ei , then xH Ax = aii > 0.
Examples: Are the following matrixes ar symmetric positive definite matrices?
   
−4 1 1 2 −1 0
   
 1 0 1 ;  −1 2 −2 
   
   
1 1 8 0 −1 2
The first one is not since a11 < 0, or a22 = 0. For the second one, we have A = AT and
det(A1 ) = 2 > 0, det(A2 ) = 4 − 1 = 3 > 0, and det(A3 ) = det(A) = 8 − 2 − 2 = 4 > 0, so A
is an SPD.
3.6.3 Cholesky Decomposition A = LLT
We prefer to use the A = LLT instead of LU decomposition to take advantage of the

symmetry and positiveness. The decomposition is an analogue to the square root of a
positive number. We can save half of the cost and the storage compared to that the
standard Gaussian elimination process applied to a general matrix. Note that for SPD
√
matrices, no pivoting is necessary. If kAk ∼ 1, then the growth factor g(n) ≤ n and is
well under control. Therefore no pivoting is necessary.
Now we derive the formula for A = LLT decomposition, the process is similar to the
direct LU decomposition. We can write
 
l11  
  l11 l21 · · · ··· ln1

 l21 l22 
 l22 · · · ··· ln2


 
.. .. .. 
A = LLT = 
 
 l31 l32 l33 
 . . . .


.. .. .. ..
 .. .. 
. .

. . .
 
 .  
lnn
 
ln1 ln2 · · · ln,n−1 lnn
To derive the formulae for the entries of L, the order is very important: The order is as
follows.
• We can get the first column of L using the matrix multiplications.

√
a11 = l11 l11 , =⇒ l11 = a11 .
• We can get the rest of the first column of L using the matrix multiplications (first of
row of L and any column of LT )
a1j
a1j = l11 lj1 , =⇒ lj1 = , j = 2, · · · n.
l11
We have the first column of L.
• Assume we have obtained the first (k − 1)-th columns of L, we derive the formula for
k-th column of L. If we multiply the k-th row of L to k-th column of LT , we get
v
k−1
X
u
u k−1
X
2 2 ,
akk = lkj lkj + lkk , =⇒ lkk = akk −
t lkj
j=1 j=1
After we get the first element in the k-th column of L, we can get the the rest of k-th
column of L by multiplying the i-th row of L to j-th column of LT ,
k−1
X
aik − lij lkj
k−1
X j=1
aik = lij lkj + lik lkk , =⇒ lik = , j = k + 1, · · · , n.
lkk
j=1
Pseudo-code of A = LLT :
for k = 1, n
v
u
u k−1
X
lkk = akk −
t l2 kj
j=1
for i = k + 1, n
 
k−1
X
lik = aik − lij lkj  /lkk
j=1
end
end
Use the Cholesky decomposition (A = LLT ) to solve Ax = b.
From Ax = b we get LLT x = b. The forward and backward substitutions are the
following:
1. Solve Ly = b.
58 Z. Li
2. Solve LT x = y.
The number of multiplications/divisions needed for the Cholesky decomposition is O(n3 /6).
The storage needed is O(n2 /2). So we just need half the storage and half the computations.
The only disadvantage is that we need to evaluate square roots. A slightly different version
is to get the A = LDLT decomposition which may work for any symmetric matrix assuming
there is no break down (not guaranteed if A is not an SPD).
3.6.4 Software issues
• In Matlab, we can use x = A\b for solving Ax = b; [l, u] = lu(A) for LU decompo-
sition; chol(A) for Cholesky decomposition; det(A) for determinant of A; cond(A, 2),
for example, to find the condition number of A in 2-norm.
• Free subroutines using Fortran, C, and C++ on netlib (www.netlib.org) including

Linpack, blas, (collection of linear algebra), blas (basic linear algebra subroutines),
slatec, sparse, and many others in both single and double precision.
• There are many books on the programming of different methods. A popular one is
Numerical Recipes in (Fortran, C, Pascal, ...).
3.7 Exercises
1. Given
   
0 1 2 3 6
   
 3 0 1 2   6 
   
A=

,
 b=
 

 2 3 0 1   6 
   
1 2 3 0 6
(a) Use Gaussian elimination with the partial pivoting to find the matrix decom-
position P A = LU . This is a paper problem and you are asked to use exact
calculations (use fractions if necessary).
(b) Find the determinant of the matrix A.
(c) Use the factorization to solve Ax = b.
2. Consider solving AX = B for X, A ∈ Rn,n , B ∈ Rn,m . There are two obvious

algorithms. The first one is to get A = P LU using Gaussian elimination, and then
to solve for each column of X by forward and backward substitution. The second
algorithm is to compute A−1 using Gaussian elimination and then to multiply A−1 B
to get X. Count the number of operations by each algorithm and determine which
one is faster.
3. (a): Let xe be the solution of Ax = b assuming that det(A) 6= 0, x̄ be the solution of

Ax = b + δb, show that
kxe − x̄k kδbk

≤ cond(A) .
kxe k kbk
Hint: kbk = kAA−1 Bk ≤ kAkkA−1 bk.

(b): Given a matrix A that is invertible. Let E is be such a matrix that kA−1 Ek <
kA−1 k
α < 1. Show that A + E is invertible and k(A + E)−1 k ≤ .
1−α
4. Given an n-by-n non-singular matrix A, how do you efficiently solve the following
problems using Gaussian elimination with partial column pivoting?
(a) Solve Ak x = b, where k is a positive integer.

(b) Computer α = cT A−1 b, where c and b are two vectors.
You should (1) describe your algorithm; (2) present a pseudo-code; (3) find out the
required operation account.
5. Often a small determinant is a sign of an ill-conditioned matrix. But this exercise is

an exception. Given the following matrix A. (a): Find kAk∞ and det(A). (b) Using
Gaussian elimination to find the A = LU decomposition. (c): Using your result to
find cond∞ (A). Hint: Find the inverse of U T to get U −1 .
 
1 1
 

 −1 1 1 

 
 
A=
 −1 −1 1 1 

..
 
.. ..
. .
 
 . 1 
 
−1 −1 · · · −1 1
6. Check whether the following matrices are:
• Strictly column diagonally dominant.

• Symmetric positive definite.
Justify your conclusion. What is the significance of knowing these special matrices
to the Gaussian related algorithms? Answer this question by considering issues of
60 Z. Li
accuracy, speed, and the storage.

 
−5 2 1 0  
  2 −1 0  
 2
 7 −1 −1    α β
,  −1 2 −1  ,
    
 
 1 −1 5 1    −1 2
  0 −1 2
0 −1 1 4
Find the Cholesky decomposition A = LLT or A = LDLT for the middle matrix.
Note that, you need to determine the range of α and β.
7. Given a matrix A and a vector b

   
1 1
100 0 0 100
   
A= 0 2−1  , b= 1 
   
   
0 −1 2 1
(a) Find the condition number of A in 2-norm.
h iT h iT
(b) Given x1 = 1 1 1 and x2 = 1.1 1 0.9 , find the residual, and the
norm of the residual, for both x1 and x2 in 2-norm.
(c) Is one of x1 and x2 the exact solution of the system Ax = b?
(d) If x1 or x2 is not the solution, use the error estimate discussed in class (relation
between the residual and the relative error) to give an estimate error bound for
the relative error.
(e) Find the actual relative error for x1 and x2 as approximations to Ax = b and
compare with the error bound that you just have got. How much is the difference?
8. (Programming Part) Given a sequence of data
(x1 , y1 ), (x2 , y2 ), · · · , (xm , ym ), (xm+1 , ym+1 ),
write a program to interpolate the data using the following model
y(x) = a0 + a1 x + · · · + am−1 xm−1 + am xm .
(a) Derive the linear system of equations for the interpolation problem.
(b) Let xi = (i − 1)h, i = 1, 2, · · · , m + 1, h = 1/m, yi = sin πxi , write a computer
code using the Gaussian elimination with column partial pivoting to solve the
problem. Test your code with m = 4, 8, 16, 32, 64 and plot the error |y(x)−sin πx|
with 100 or more points between 0 and 1, that is, predict the function at more
points in addition to the sample points. For example, you can set h1 = 1/100;
x1 = 0 : h1 : 1, y1(i) = a0 + a1 x1(i) + · · · + am−1 (x1(i))m−1 + am (x1(i))m ,
y2(i) = sin(πx1(i)), plot(x1, y1 − y2).
(c) Record the CPU time (in Matlab type help cputime) for m = 50, 100, 150, 200, · · · , 350, 400.
Plot the CPU time versus m. Then use the Matlab function polyfit z = polyf it(m, cputime(m), 3)
to find a cubic fitting of the CPU time versus m. Write down the polynomial
and analyze your result. Does it look like a cubic function?
9. (Programming Part) Let A be a symmetric positive definite matrix.
(a) Derive the algorithms for A = LDLT decomposition, where L is a unit lower
triangular matrix, and D is a diagonal matrix.
(b) Write a Matlab code (or other language if you prefer) to do the factorization and
solve the linear system of equations Ax = b using the factorization. Hint: the
process is the following:
Ly = b, y is the unknown,
Dz = y, z is the unknown,
T
L x = z, x is the unknown, which is the solution.
Construct at least one example that you know the exact solution to validate your
code.
10. Extra Credit: Choose one from the following (Note: please do not ask the instructor
about the solution since it is extra credit):
1
q
(a) Let A ∈ Rn×n . Show that kAk2 = max λi (AT A) and kA−1 k2 = q
1≤i≤n
min λi (AT A)
1≤i≤n
, where λi (AT A), i = 1, 2, · · · , n are the eigenvalues of AT A. Show further that
σmax
cond2 (A) = , where σmax , σmin are the largest and smallest nonzero singular
σmin
values of A.
(b) Show that if A is a symmetric positive definite matrix, then after one step of
Gaussian elimination (without pivoting), then reduced matrix A1 in
 
a11 ∗
A =⇒  
0 A1
must be symmetric positive definite. Therefore no pivoting is necessary.

62 Z. Li
Chapter 4
Iterative methods for solving linear

system of equations
The Gaussian elimination method for solving Ax = b is quite efficient if the size of A is small
to medium (in reference the available computers) and dense matrices (most of entries of the
matrix are non-zero numbers). But for several reasons, sometimes an iterative method may
be more efficient as discussed below.
• For sparse matrices, the Gaussian elimination method may destroy the structure of
the matrix and cause ’fill-in’s, see for example,
   
1 1 1 1 1 1 1 1 1 1 1 1
   
2 1 0 0 0 0  0 −1 −2 −2 −2 −2 
   
 
   
   
 3 0 1 0 0 0 
  0 −3 −2 −3 −3 −3 
=⇒
  
   

 4 0 0 1 0 0 
 
 0 −4 −4 −3 −4 −4 

   
0 −5 −5 −5 −4 −5 
   
 5 0 0 0 1 0  
   
6 0 0 0 0 1 0 −6 −6 −6 −6 −5
Obviously, the case discussed above can be generalized to a general n by n matrix

with the same structure.
• Large sparse matrices for which we may not be able to store all the entries of the
matrix. Below we show an example in two dimensions.
63
64 Z. Li
4.1 The central finite difference method with five point sten-
cil for Poisson equation.
Consider the Poisson equation
uxx + uyy = f (x, y), (x, y) ∈ Ω = (a, b) × (c, d), (4.1.1)
u(x, y)|∂Ω = u0 (x, y), Dirichlet BC. (4.1.2)
If f ∈ L2 (Ω), then the solution exists and it is unique. Analytic solution is rarely available.
Now we discuss how to use the finite difference equation to solve the Poisson equation.
• Step 1: Generate a grid. A uniform Cartesian grid can be used:
b−a
xi = a + ihx , i = 0, 1, 2, · · · m, hx = , (4.1.3)
m
d−c
yj = c + jhy , j = 0, 1, 2, · · · n, hy = . (4.1.4)
n
We want to find an approximate solution Uij to the exact solution at all the grid
points (xi , yj ) where u(xi , yj ) is unknown. So there are (m − 1)(n − 1) unknown for
Dirichlet boundary condition.
• Step 2: Substitute the partial derivatives with a finite difference formula in terms of
the function values at grid points to get.
u(xi−1 , yj ) − 2u(xi , yj ) + u(xi+1 , yj ) u(xi , yj−1 ) − 2u(xi , yj ) + u(xi , yj+1 )

+
(hx )2 (hy )2
= fij + Tij , i = 1, · · · m − 1, j = 1, · · · n − 1,
where fij = f (xi , yj ). The local truncation error satisfies
(hx )2 ∂ 4 u (hy )2 ∂ 4 u
Tij ∼ + . (4.1.5)
12 ∂x4 12 ∂y 4
Define
h = max{ hx , hy } (4.1.6)
The finite difference discretization is consistent if
lim kTk = 0. (4.1.7)

h→0
Therefore the discretization is consistent and second order accurate.

If we remove the error term in the equation above, and replace the exact solution
u(xi , yj ) with the approximate solution Uij which is the solution of the linear system
of equations

Ui−1,j + Ui+1,j Ui,j−1 + Ui,j+1 2 2
+ − + Uij = fij (4.1.8)
(hx )2 (hy )2 (hx )2 (hy )2
The finite difference scheme at a grid point (xi , yj ) involves five grid points, east,
north, west, south, and the center. The center is called the master grid point.
• Solve the linear system of equations to get an approximate solution at grid points
(how?).
• Error analysis, implementation, visualization etc.
4.1.1 Matrix-vector form of the finite difference equations.
Generally, if one wants to use a direct method such as Gaussian elimination method or sparse
matrix techniques, then one needs to find out the matrix structure. If one use an iterative
method, such as Jacobi, Gauss Seidel, SOR(ω) methods, then it may be not necessarily to
have the matrix and vector form.
In the matrix vector form AU = F, the unknown is a one dimensional array. For the two
dimensional Poisson equations, the unknowns Uij are a two dimensional array. Therefore we
need to order it to get a one dimensional array. We also need to order the finite difference
equations. It is common practice that we use the same ordering for the equations and for
the unknowns.
There are two commonly used ordering. One is called the natural ordering that fits
sequential computers. The other one is called the red and black ordering that fits parallel
computers.
7 8 9 4 9 5
4 5 6 7 3 8
1 2 3 1 6 2
Figure 4.1: The natural ordering (left) and the red-black ordering (right).
66 Z. Li
The natural row ordering.
In the natural row ordering, we order the unknowns/equations row-wise, therefore the k-th
equation corresponding to (i, j) with the following relation
k = i + (m − 1)(j − 1), i = 1, 2, · · · , m − 1, j = 1, 2, · · · , n − 1. (4.1.9)
We use the following example to verify the matrix-vector form of the finite difference
equations.
Assume that hx = hy = h, m = n = 4, so we will have nine equations and nine
unknowns. The coefficient matrix is 9 by 9! To write down the matrix-vector form, we use
a one-dimensional array x to express the unknown Uij .
x1 = U11 , x2 = U21 , x3 = U31 , x4 = U12 , x5 = U22 ,

(4.1.10)
x6 = U32 , x7 = U13 , x8 = U23 , x9 = U33 .
If we order the equations the same way as we order the unknowns, then the nine equations
from the standard central finite difference scheme using the five point stencil are
1 u01 + u10
2
(−4x1 + x2 + x4 ) = f11 − ,
h h2
1 u20
(x1 − 4x2 + x3 + x5 ) = f21 − 2
h2 h
1 u30 + u41
2
(x2 − 4x3 + x6 ) = f31 −
h h2
1 u02
(x1 − 4x4 + x5 + x7 ) = f12 − 2
h2 h
1
(x2 + x4 − 4x5 + x6 + x8 ) = f22
h2
1 u42
(x3 + x5 − 4x6 + x9 ) = f32 −
h2 h2
1 u03 + u14
(x4 − 4x7 + x8 ) = f13 −
h2 h2
1 u24
(x5 + x7 − 4x8 + x9 ) = f23 − 2
h2 h
1 u34 + u43
(x6 + x8 − 4x9 ) = f33 − .
h2 h2
Now we can write down the coefficient matrix easily. It is block tridiagonal and has the
following form:
 
B I 0
1  
A= 2  I B I  (4.1.11)
 
h  
0 I B
where I is a 3 × 3 identity matrix:

 
−4 1 0
 
B= 1 −4 1 
 
 
0 1 −4
For a general n by n grid, we will have
   
B I −4 1
   
I B I  1 −4 1
   
1   
A= 2 , B= .
   
h  .. .. .. .. .. ..
 . . . 


 . . . 

   
I B 1 −4
Note that −A is a symmetric positive definite matrix and it is weakly diagonally dominant.
Therefore A is non-singular and there is a unique solution.
The matrix-vector form is useful to understand the structure of the linear system of
equations, and it may be necessary if a direct method (such as Gaussian elimination) or
sparse matrix techniques are used for solving the system. However, it is more convenient
sometimes to use the two parameters system (i, j), especially if an iterative method is used
to solve the system. It is more intuitive and useful to visualize the data using two index
system.
The eigenvalues and eigenvectors of A can be indexed by two parameters p and k corre-
sponding to wave numbers in the x and y directions. The (p, k)-th eigenvector up,k has n2
elements for a n by n matrix of the form above:
up,k
ij = sin(pπih) sin(kπjh), i, j = 1, 2, · · · n (4.1.12)
for p, k = 1, 2, · · · n. The corresponding eigenvalues are
p,k 2
λ = 2 cos(pπh) − 1) + cos(kπh) − 1) . (4.1.13)
h
The least dominant eigenvalue ( the smallest in the magnitude) is
λ1,1 = −2π + O(h2 ). (4.1.14)
The dominant eigenvalue (the largest in the magnitude) is
4
λn/2,n/2 ∼ − 2 . (4.1.15)
h
Therefore we have the following estimates:
4 1 1
kAk2 ∼ max |λp,k | = 2 , kA−1 k2 = p,k
∼ ,
h min |λ | 2π
(4.1.16)
−1 2 2
cond2 (A) = kAk2 kA k2 ∼ = O(n ).
πh2
Since the condition number is considered to be large, we should use double precision to
reduce the effect of round off errors.
68 Z. Li
4.2 Basic iterative methods for solving linear system of equa-

tions
The idea of iterative methods is to start with an initial guess, then improve the solution
iteratively. The first step is to re-write the original equation f (x) = 0 to an equivalent form
√
x = g(x) and then we can form an iteration: x(k+1) = g(x(k) ). For example, to find 3 5 is
equivalent to solving the equation x3 − 5 = 0. This equation can be written as x = 5/x2 or
3
x = x − x3x−52 . For the second one, the iteration
(x(3) )3
x(k+1) = x(k) − , k = 0, 1, · · ·
3(x(2) )2
is called the Newton’s iterative method. The mathematical theory behind this is the fixed
point theory.
For a linear system of equations Ax = b, we hope to re-write it as an equivalent form
x = Rx + c so that we can form an iteration x(k+1) = Rx(k) + c given an initial guess x0 .
We want to choose such a R and c that limk→∞ x(k) = xe = A−1 b. A common method is
called the splitting approach in which we re-write the matrix A as
A = M − K, det(M ) 6= 0. (4.2.1)
Then Ax = b can be written as (M − K)x = b, or M x = Kx + b, or x = M −1 Kx + M −1 b,

or x = Rx + c, where R = M −1 K is called the iteration matrix and c = M −1 b is a constant
vector. The iterative process is then given an initial guess x(0) , we can get a sequence of
{x(k) } according to
x(k+1) = Rx(k) + c (4.2.2)
We first discuss three basic iterative methods for solving Ax = b. To derive the three
methods, we re-write the matrix A as
   
a11 0
   
 a22   −a 0 
   21 
   
   
A = D−L−U = a33  −  −a31 −a32 0 
   
  ..
   
.. .. .. ..
. . . .
 
   . 
   
ann −an1 −an2 · · · −an,n−1 0
 
0 −a12 · · · · · · −a1n
0 · · · · · · −a2n 
 


.. .. .. 
− . . . 
 

.. .. 
. . 
 

0
4.3 The Jacobi iterative method: Solve for the diagonals
The matrix vector form of the The Jacobi iterative method can de derived as follows:
(D − L − U )x = b
Dx = (L + U )x + b
x = D−1 (L + U )x + D−1 b
x(k+1) = D−1 (L + U )x(k) + D−1 b.
The component form can be written as

 
n
(k+1) (k)
X
xi = bi − aij xj  /aii , i = 1, 2, · · · , n. (4.3.1)
j=1,j6=i
The component form is useful for implementation while the matrix-vector form is good
for convergence analysis.
4.4 The Gauss-Seidel iterative method: Use the most up-

dated
In the Jacobi iterative method, when we compute xk+1 2 , we have already computed x1 .
k+1
Assume that xk+11 is a better approximation than xk1 , why can not we use xk+1 1 when we
k+1 k
update x2 instead of x1 ? With this idea, we get a new iterative method which is the
Gauss-Seidel iterative method for solving Ax = b. The component form is
 
i−1 n
(k+1) (k+1) (k)
X X
xi = bi − aij xj − aij xj  /aii , i = 1, 2, · · · , n. (4.4.1)
j=1 j=i+1
To derive the matrix-vector form of the Gauss-Seidel iterative method, we write the
component form above to a form ( )(k+1) = ( )(k) + ( ). The component form above
is equivalent to
i−1 n
(k+1) (k+1) (k)
X X
aij xj + aii xi = bi − aij xj , i = 1, 2, · · · , n, (4.4.2)
j=1 j=i+1
which is the component form of the following system of equations
x(k+1) = U x + b, or x(k+1) = (D − L)−1 U x + (D − L)−1 b.
Thus the iteration matrix of the Gauss-Seidel iterative method is (D − L)−1 U , and the
constant vector is c = (D − L)−1 b.
70 Z. Li
4.4.1 Implementation details
The iterative methods are an infinity process. However, when we implement on a computer,
we have to stop it in finite time. One of several of the following stopping criteria are used
• kx(k+1) − x(k) k ≤ tol.

kx(k+1) −x(k) k
• kx(k) k
≤ tol.
• kr(x(k) )k ≤ tol.
• k ≥ kmax ,
where tol and kmax are two given parameters.
4.4.2 Pseudo-code of the Gauss-Seidel iterative method:
[x,k] = my_gs(n,a,b,x0,tol)
error = 1e5; x=x0; k=0;
while error > tol
for i=1:n
x(i) = b(i);
for j=1:n
if j~= i
x(i) = x(i) - a(i,j)*x(j);
end
x(i) = x(i)/a(i,i);
end
error = norm(x-x0); %defaut is 2-norm
x0=x; k = k+1; % replace the old value, add the counter.
end %end while
4.4.3 The Gauss-Seidel iterative method for 2-point boundary value prob-
lem
Following section 1.3, we have

Ui−1 − 2Ui + Ui+1
= f (xi ), i = 1, 2, · · · , n − 1.
h2
If we use the same ordering for the equations and unknowns, then the diagonals are always
−2/h2 , the Jacobi iteration is simply
(k) (k)
(k+1) U + Ui+1 h2
Ui = i−1 + f (xi , yj ), i = 1, 2, · · · , n − 1.
2 2
No matrix is formed (or only matrix-vector multiplication is needed), no ordering is neces-

sary (assuming that the equations and unknowns have the same ordering). For the Gauss-
Seidel iteration, from k-th to k + 1-th iteration, we can use the following
(k+1) (k)
Ui = Ui , i = 1, 2, · · · , n − 1,
(k+1) (k+1)
(k+1) Ui−1 + Ui+1 h2
Ui = + f (xi ), i = 1, 2, · · · , n − 1.
2 2
(k+1)
If Ui has not been updated, then it use the value from k-th iteration, otherwise it uses
the most updated one.
4.4.4 The Gauss-Seidel iterative method for the finite difference method
for Poisson equation
For the Poisson equation uxx + uyy = f (x, y), if we use the standard 5-point central finite
difference scheme
Ui−1,j + Ui+1,j − 4Uij + Ui,j−1 + Ui,j+1
= f (xi , yj )
h2
and the same ordering for the equations and unknowns, then the Jacobi iteration is
(k) (k) (k) (k)
(k+1) Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 h2
Uij = + f (xi , yj ),
4 4
i = 1, 2, · · · , n − 1, j = 1, 2, · · · , n − 1,
if the solution is prescribed along the boundary (Dirichlet BC). Again, no matrix is needed,
no ordering is necessary. We do not need to transform the two dimensional array to a one
dimensional one. The implementation is rather simple.
4.5 The successive over-relaxation (SOR(ω)) iterative method
The Jacobi and Gauss-Seidel methods can be quite slow. The SOR(ω) iterative method
is an acceleration method by choosing appropriate parameter ω. The SOR(ω) iterative
method is
x(k+1) = (1 − ω)xk + ωx̃k+1
GS . (4.5.1)
The component form is

  
i−1 n
(k+1) (k)
X X
xk+1
i = (1 − ω)xki + ω bi − aij xj − aij xj  /aii  . (4.5.2)
j=1 j=i+1
Note that, it is incorrect to get G-S result first, then do the linear interpolation.
72 Z. Li
The idea of the SOR method is to interpolate xk and xk+1

GS to get a better approximation.
k k+1
When ω ≤ 1, the new point of (1 − ω)x + ωx̃GS is between xk and xk+1 GS , and this it is
called interpolation. The iterative method is called under relaxation. When ω > 1, the new
point of (1 − ω)xk + ωx̃k+1 k k+1
GS is outside x and xGS , and this it is called extrapolation. The
iterative method is called over relaxation. Since the approach is used at every iteration, it
is called successive over relaxation (SOR) method.
To derive the matrix-vector form of SOR(ω) method, we write its component form as
i−1 n
(k+1) (k)
X X
aii xk+1
i +ω aij xj = aii (1 − ω)xki + ωbi − ω aij xj . (4.5.3)
j=1 j=i+1
This is equivalent to
(D − ωL)x(k+1) = ((1 − ω)D + ωU ) x(k) + ωb.
Thus the iteration matrix and constant vector of the SOR(ω) method are
RSOR (ω) = (D − ωL)−1 ((1 − ω)D + ωU ) , cSOR = ω(D − ωL)−1 . (4.5.4)
Theorem 4.5.1 A necessary condition for SOR(ω) method to converge is 0 < ω < 2.
4.6 Convergence of basic iteration methods x(k+1) = Rx(k) + c
Using an iterative method, we will get a vector sequence of {x(k) } and we know how to tell
whether it is convergent or not. However, for an iterative method, we need to consider all
possible initial guesses and constant vector c.
If the vector sequence of {x(k) } converges to x∗ , then by taking limit on both sides of
the iterative scheme, we have
x∗ = Rx∗ + c. (4.6.1)
The above equality is called the consistency condition.
Definition 4.6.1 The iteration methods x(k+1) = Rx(k) + c is convergent if for any initial
guess x(0) and constant vector c, the vector sequence of {x(k) } converges to the solution of
the system of equations x∗ = Rx∗ + c.
Now we discuss a few sufficient conditions that guarantee convergence of a basic iterative
method.
Theorem 4.6.1 If there is an associated matrix norm such that kRk < 1, then the iteration
method x(k+1) = Rx(k) + c converges.
Proof: Let e(k) = x(k) − x∗ , from the iterative method x(k+1) = Rx(k) + c and the
consistency condition x∗ = Rx∗ + c, we have
e(k+1) = Re(k)
0 ≤ ke(k+1) k = kRe(k) k ≤ kRkke(k) k ≤ kRkkRkke(k−1) k ≤ · · · ≤ kRkk+1 ke(0) k
Thus we conclude that lim ke(k) k = 0, or equivalently, lim x(k) = x(∗) .

k→∞ k→∞
Example: Given
!
0.9 0.05
R=
−0.8 −0.1
Does the iterative method x(k+1) = Rx(k) + c converge?

We can easily get kRk1 = 1.7, which leads to no conclusion; and kRk∞ k = 0.95 < 1,
which lead to the conclusion that the iterative method converges.
4.6.1 Convergence speed
In the theorem above, the k-th error depends on the initial one that we do not know. The
following error estimate does not need the initial error.
Theorem 4.6.2 If there is an associated matrix norm such that kRk < 1, we have the
following error estimate for the iteration method x(k+1) = Rx(k) + c.
kRkk
ke(k+1) k ≤ kx(1) − x(0) k. (4.6.2)
1 − kRk
Proof: From the iterative method x(k+1) = Rx(k) + c, we also have f x(k) = Rx(k−1) + c.
Subtracting the two, we get
x(k+1) − x(k) = R(x(k) − x(k−1) ) = R2 (x(k−1) − x(k−2) = · · · = Rk (x(1) − x(0) ).
Since
x(k+1) − x(k) = x(k+1) − x(∗) + x∗ − x(k) = e(k+1) − e(k) = (R − I)e(k) .
Combining the two equalities above we get

−(I − R)e(k) = Rk x(1) − x(0) .
This leads to

ke(k) k = k(I − R)−1 Rk x(1) − x(0) k.
Finally from the Banach’s lemma, we have

kRkk
ke(k) k ≤ kx(1) − x(0) k.
1 − kRk
74 Z. Li
4.6.2 Other sufficient conditions using the original matrix A
Theorem 4.6.3 If A is strictly row diagonally dominant matrix, then both Jacobi and
Gauss-Seidel methods converge. The Gauss-Seidel method converges faster in the sense that
kRGS k∞ ≤ kRJ k∞ < 1
Proof: The proof of the first part is easy. For the Jacobi method, we have R =
D−1 (L + U ), thus
n
X aij
kRJ k∞ = max < 1.
aii
i
j=1,j6=i
The proof for the Gauss-Seidel method is not trivial and long, we refer the readers to
the book [J. W. Demmel] on page 287-288.
For general matrices, it is unclear whether the Jacobi or Gauss-Seidel method converges
faster even if they both converge.
Theorem 4.6.4 If A is a symmetric positive definite (SPD) matrix, then the SOR(ω)
method converges for 0 < ω < 2.
Again we refer the readers to the book [J. W. Demmel] on page 290-291.
Theorem 4.6.5 If A is a weakly row diagonally dominant matrix,
n
X
|aij | ≤ |aii |, , i = 1, 2, · · · , n,
j=1,j6=i
with at least on inequality is strictly and A is irreducible, that is, there is no permutation
matrix such that
!
T A11 A12
P AP =
0 A22
then both Jacobi and Gauss-Seidel methods converge. The Gauss-Seidel method converges
faster in the sense that
kRGS k∞ ≤ kRJ k∞ < 1.
Lemma 4.6.1 A is irreducible if the graph of the matrix is strongly connected.

4.6.3 A sufficient and necessary condition
The spectral radius of a matrix A is defined as
ρ(A) = max |λi (A)| (4.6.3)

i
Note that from ρ(A) ≤ kAk from any associated matrix norms. The proof is quite simple.
Let λi∗ be the eigenvalue of A such that ρ(A) = |λi∗ | and x∗ 6= 0 is the corresponding
eigenvector, then we have Ax∗ = λi∗ x∗ . Thus we have |λi∗ |kx∗ k ≤ kAkkx∗ k. Since kx∗ k =
6 0,
we get ρ(A) ≤ kAk.
Theorem 4.6.6 An iteration method x(k+1) = Rx(k) + c converges for arbitrary x(0) and
c if and only if ρ(R) < 1.
Proof: Part A: If the iterative method converge, then ρ(R) < 1. This can be done
using counter proof method. Assume that ρ(R) > 1, then let Ax∗ = λi∗ x∗ , with ρ(A) =
6 0. If we set x(0) = x∗ and c = 0, then we have x(k+1) = λk+1
|λi∗ | > 1 and kx∗ k = ∗
i∗ x which
dost not have a limit since λk+1
i∗ → ∞. The case of ρ(R) = 1 is left as an exercise.
Proof: Part B: If ρ(R) < 1, then the iterative method converges for arbitrary x(0) and
c. They key in the proof is to find a matrix norm such that kRk < 1.
From liner algebra Jordan’s theorem we know that for any square matrix R, it is similar
to a Jordan canonical form, that is, there is a nonsingular matrix S such that
   
J1 λi 1
J2 λi 1
   
   
.. .. ..
   
S −1 RS =  . , . . ,
   

.. ..
   
. . 1 
   
  
Jp λi
Note that kS −1 RSk∞ = ρ(R) < 1 if all Jordan blocks are 1 by 1 matrix, otherwise
kS −1 RSk∞ ≤ ρ(R) + 1. Since ρ(R) < 1, we can find < 0 such that ρ(R) + < 1,
say = (1 − ρ(R))/2. Consider a particular Jordan Block Ji and assume it is a k by k
matrix. Let
   
1 λi
λi
   
   
.. .. ..
   
Di () =  . , Di ()Ji Di−1 () =  . . ,
   
.. ..
   
. . 
   
  
k−1 λi
76 Z. Li
Using this diagonal matrix, we can get

 
D1 ()
D1 ()
 
 
..
 
D() =  . ,
 
..
 
.
 
 
Dp ()
Thus kD−1 ()S −1 RSDk∞ = ρ(R) + < 1.

Finally we define a new matrix norm as
kRknew = kD−1 ()S −1 RSDk∞ (4.6.4)
It can easily show that the definition above is indeed an associate matrix norm and kRknew =
ρ(R) + < 1, we conclude that the iterative method converges for any initial guess and
vector c.
4.7 Discussion for the Poisson equations, what is the best ω?
For the finite difference methods for Poisson equations in 1D and 2D, we can find the
eigenvalues of the coefficient matrix, which also leads to eigenvalues of the iteration matrix
(Jacobib, Gauss Seidel, SORω). For 1D model problem u00 (x) = f (x) with the Dirichlet
boundary condition u(0) and u(1) are prescribed, the coefficient matrix is
   
−2 1 −2 1
   
 1
 −2 1 

 1
 −2 1 

   
1    
AF D = 2 1 −2 1 , A= 1 −2 1 .
h 






.. .. ..  .. .. .. 
. . .  . . . 
 
 
   
1 −2 1 −2
The matrix is weakly row diagonally dominant (not strictly) and irreducible; −A is
symmetric positive definite; or A is symmetric negative definite.
Theorem 4.7.1 For the above n by n matrix, we have

λ i (A)
ρ(RJ ) = max 1 + , (4.7.1)
1≤i≤n 2
where λi (A), i = 1, 2, · · · , n are the eigenvalues of the matrix A, RJ is the iteration matrix
of the Jacobi method.
Proof: We know that Rj = D−1 (L + U ) and D = −2I. Let λ be an eigenvalue of JR ,

then det(λI − JR ) = 0, or det(λI − D−1 (L + U )) = 0, or det(D−1 (λD − (L + U )) = 0,
or det(D)det(λD − (L + U )) = 0. Since det(D) 6= 0, we have det(λD − (L + U )) = 0, or
det((λ − 1)D + D − (L + U )) = 0, or det((λ − 1)D + A) = 0, or det((1 − λ)D − A) = 0, or
det(−2(1 − λ)I − A) = 0 since D = −2I. Thus −2(1 − λ) is an eigenvalue of A, or
λi (A)
λi (RJ ) = 1 + , i = 1, 2, · · · , n.
2
From the relation above, we have the theorem right way.
The following lemma gives the eigenvalues of a tri-diagonal matrix.
Lemma 4.7.1 Let A be the following n by n matrix

 
α β
 
 β α β
 

A= .
 
.. .. ..

 . . . 

 
β α
The eigenvalues of A are the following

kπ
λk = α + 2β cos , k = 1, 2, · · · , n.
n+1
The eigenvector corresponding to λk is
kjπ
χk,j = sin , j = 1, 2, · · · , n.
n+1
Proof : It is easy to check that Aχk = λk χk .
When α = −2, β = 1, we have
kπ
λk = −2 + 2 cos , k = 1, 2, · · · , n.
n+1
Thus the spectral radius of the Jacobi iterative method for 1D Poisson problem is

λk (A) 2(1 − cos(kπ/(n + 1))
ρ(RJ ) = max 1 + = max 1 −

1≤k≤n 2 1≤k≤n 2
2
kπ = cos π ∼ 1 − 1
π
= max cos .
1≤k≤n n + 1 n+1 2 n+1
We can see that as n is getting larger, the spectral radius is getting close to unit indi-
cating slower convergence.
Once we know the spectral radius, we also know roughly the number of iterations needed
to reach the desired accuracy. For example, if we wish to have roughly six significant digit,
the we should set ρ(R)k ≤ 10−6 or k ≥ −6 log10 (ρ(R)).
78 Z. Li
4.7.1 Finite difference method for the Poisson equation in two dimen-
sions
In two space dimensions, we have parallel results. The eigenvalues for the matrix A =
h2 AF D of N by N matrix N = n2 are

iπ jπ
λi,j = − 4 − 2 cos + cos , i, j = 1, 2, · · · n. (4.7.2)
n+1 n+1
The diagonals of the matrix are −4 and we have
λi,j (A)
λi,j (RJ ) = 1 + , i, j = 1, 2, · · · , n.
4
Thus

λi,j (A) 2(1 − cos(iπ/(n + 1)) + cos(jπ/(n + 1))
ρ(RJ ) = max 1 + = max 1 −
1≤i,j≤n 4 1≤i,j≤n 4
2
iπ jπ π 1 π
= max cos + cos = cos ∼1− .
1≤i,j≤n n+1 n + 1 n+1 2 n+1
We see that the results in 1D and 2D are pretty much the same. To derive the best
ω for the SOR(ω) method, we need to derive the eigenvalue relation between the original
matrix and iteration matrix. Note that RSOR = (D − ωL)−1 ((1 − ω)D + ωU ).
Theorem 4.7.2 The optimal ω for the SOR(ω) method for the system of equations derived
from the finite difference method for the Poisson equation is
2 2 2 2
ωopt = p = q = π ∼ π (4.7.3)
1 + 1 − (ρ(Rj ))2
1 + 1 − cos2 π 1 + sin n+1 1 + n+1
n+1
Below is the sketch of the proof.
• Step 1. Show the eigenvalue relation between RJ and RSOR (ω)

r
1 ω 2 λ2J
λSOR (ω) = 1 − ω + ω 2 λ2J ± ωλJ 1−ω+ (4.7.4)
2 4
• Step 2. Find the extreme value (minimum) of above as a function of λJ which leads
to the optimal ω.
We refer the readers to the book [J. W. Demmel] on page 292-293 for the proof.
Remark 4.7.1
• The spectral radius of RSOR (ω) is

 r
1 ω 2 ρ(RJ )2
1 − ω + ω 2 ρ(RJ )2 + ωρ(RJ ) 1−ω+ , 0 < ω < ωopt ,

ρ(RSOR ) = 2 4
ω − 1, ωopt < ω < 2.

If we plot ρ(RSOR ) against ω, we can see it is a quadratic curve between 0 < ω < ωopt
and it is flat as ω is getting closer to ωopt which means it is less sensitive in the
neighborhood of ωopt ; while the second piece is a linear function. Thus, we would
rather choose ω larger than smaller.
• The optimal ω is only for the Poisson equation, not for other elliptic problems. How-
ever, it gives a good indication of the best ω is the diffusion is dominant in reference
to the mesh size.
4.8 Exercises
1. Let A be a symmetric positive definite matrix, so it has the Cholesky decomposition

√
A = LLT . Show thatr (a): 0 < lkk ≤ akk , k = 1, 2, · · · , n. (b): From (a) to
derive max |lij | ≤ max |aij |. That is, the Cholesky decomposition is a stable
1≤j≤i≤n 1≤i,j≤n
algorithm. (c): Do the Jacobi, Gauss-Seidel, and SOR(ω) iterative methods converge?
2. Consider the Poisson equation
uxx + uyy = xy, (x, y) ∈ Ω

u(x, y)|∂Ω = 0,
where Ω is the unit square. Using the finite difference method, we can get a linear
system of equations
Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 − 4Uij

= f (xi , yj ), 1 ≤ i, j ≤ 3, (4.8.1)
h2
where h = 1/4, xi = ih, yj = jh, i, j = 0, 1, 2, 3, 4, and Uij is an approximation of

u(xi , yj ). Write down the coefficient matrix and the right hand side using the red-black
orderings given in the right diagram below. What is the dimension of the coefficient
matrix? How many nonzero entries and how many zeros? Generalize your results
to general case when 0 ≤ i, j ≤ n and h = 1/n. Write down the component form
of the SOR(ω) iterative method. Does the SOR(ω) iterative method depend on the
ordering? From your analysis, explain whether you prefer to use Gaussian elimination
method or an iterative method.
80 Z. Li
7 8 9 4 9 5
4 5 6 7 3 8
1 2 3 1 6 2
3. Given the following linear system of equations:
3x1 − x2 + x3 = 3
2x2 + x3 = 2
−x2 + 2x3 = 2
(a) With x(0) = [1, −1, 1]T , find the first iteration of the Jacobi, Gauss-Seidel, and
SOR (ω = 1.5) methods.
(b) Write down the Jacobi, Gauss-Seidel, and Seidel iteration matrices RJ , RGS ,
and SOR(ω).
(c) Do the Jacobi and Gauss-Seidel iterative methods converge? Why?
4. Explain when we want to use iterative methods to solve linear system of equations
Ax = b instead of direct methods.
Also if ||R|| = 1/10, then the iterative method x(k+1) = R x(k) +c converges to the so-
lution x∗ ,
x∗ = Rx∗ +c. How many iterations are required so that ||x(k) −x∗ || ≤ 10−6 ? Suppose
||x(0) − x∗ || = O(1).
5. Judge whether the iterative method x(k+1) = R x(k) + c converges or not.

 
e−1 −e1 −1 −1 −10
   
104 −1 −1
 
 0 sin π/4  0.9 0 0
   
 
(a) : R=
 0 0 −0.1 −1 1 , (b) :  0 0.3 −0.7 
 
  
 
 0 0 0 1 − e−2 −1 0 0.69 0.2999
 

 
0 0 0 0 1 − sin(απ)
6. Determine the convergence of the Jacobi and Gauss-Seidel method applied to the
system of equations Ax = b, where
 
3 −1 0 0 ··· ··· 0
 2 3 −1 0 · · · · · · 0
 
  
0.9 0 0  
 
 0
 2 3 −1 · · · · · · 0 

(a) : A =  0 1 2 , (b) : A =  · · · · · · · · · · · · · · · · · · ···
   

   
 ··· ··· ··· ··· ··· ··· ··· 
0 −2 1  
 0 ··· ··· 0 2 3 −1
 

0 ··· ··· ··· 0 2 3
7. Modify the Matlab code poisson drive.m and poisson sor.m to solve the following
diffusion and convection equation:
uxx + uyy + aux − buy = f (x, y), 0 ≤ x, y ≤ 1,
Assume that solution at the boundary x = 0, x = 1, y = 0, y = 1 are given (Dirichlet

boundary conditions). The central-upwinding finite difference scheme is
Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 − 4Uij Ui+1,j − Uij Ui,j − Ui,j−1
2
+a −b = fij
h h h
(a) Assume the exact solution is u(x, y) = e2y sin(πx), find f (x, y).
(b) Use the u(x, y) above for the boundary condition and the f (x, y) above for the
partial differential equation. Let a = 1, b = 2, and a = 100, b = 2, solve the
problem with n = 20, 40, 80, and n = 160. Try ω = 1, the best ω for the
Poisson equation discussed in the class, the optimal ω by testing, for example
ω = 1.9, 1.8, · · · , 1.
(c) Tabulate the error, the number of iterations for n = 20, 40, 80, and n = 160 with
your tested optimal ω, compare the number of iterations with the Gauss-Seidel
method.
(d) Plot the solution and the error for n = 40 with your tested optimal ω. Label
your plots as well.
8. Perform Gauss elimination with partial column pivoting on the matrix:

 
1 1 1
 
A= 1 3 −3  .
 
 
2 4 2
(a) What is the LU factorization of P A (where P A = LU )?

82 Z. Li
(b) Compute the direct factorization A = LU .

(c) Approximately how many operations are required in (a) and (b).
(d) Compute the determinant of A from (a) and (b).
(e) Use
h your factorization
i results from (a) and (b) to solve Ax = b, where bT =
2 −4 6 .
(f) Find kAkp , condp (A) for p = 1, 2, ∞.
9. When we solve a linear system of Ax = b (det(A) 6= 0) on a computer with machine

epsilon using the Gaussian elimination with partial column pivoting (GEPP), the
final computed solution, say xc will be different from the true solution xe = A−1 b.
(a): Give an error estimate of the relative error. (b): Give an upper bound of the
growth factor g(n). (c): Let the residual of xc be r(xc ) = kb − Axc k, show that
kxe − xa k kr(xc )k
≤ kA−1 k .
kxc k kxc k
10. List at least three kinds of matrices for which pivoting strategies may not be necessary
and explain why.
11. Suppose
     
1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
     
−4 1 0 0 0 0   0 1 0 0 0 0   0 1 0 0 0 0 
     

     
     
 3 0 1 0 0 0   0 0 1 0 0 0   0 0 0 0 1 0 
L1 =   , L3 =  , P =  .
     

 6 0 0 1 0 0   0 0 1/2
 1 0 0 
  0 0 0
 1 0 0 

     
−2 0 0 0 1 0  0 0 −1
     
  0 1 0   0 0 1 0 0 0 
     
1 0 0 0 0 1 0 0 1/5 0 0 1 0 0 0 0 0 1
(a) Can L1 or L3 be a Gauss transformation matrix with partial pivoting? Why?
(b) Compute L−1 −1 −1 −1
1 , L3 , L1 L3 , and L1 L3 .
(c) Compute P −1 , P T , P 2 , P L3 , and P L3 P .
(d) Find the condition number of each matrix.
12. For the following matrices

 
  4 1 0 1  
3 −1 α   1 γ −2
   1 α −1 1   
 
A =  −1 β 1/2  , B =  , C = 
 β 2 γ ,
  
 
   0 −1 β γ   
1 1/2 γ   −2 5 4
1 1 0 −2
can you choose the parameters so that the matrix is
(a) symmetric;
(b) strictly row diagonally dominant;
(c) symmetric positive definite?
13. Suppose A = LDLT , where L is a unit lower triangular matrix.
(a) Is A symmetric?
(b) When is A a symmetric positive definite matrix?
(c) Show that such a decomposition is possible if and only if the determinants of the
principal leading sub-matrices Ak of A are all non-zero for k = 1, 2, · · · n − 1.
(d) What are the orders of operations (multiplication/division, addition/subtraction)
needed for such decomposition?
(e) Can you get A = LLT factorization from A = LDLT if A is a S.P.D? How?
14. Derive A = LU decomposition, where U is a unit upper triangular matrix. That

is to derive the recursive relation for
lij , i = j, j + 1, · · · , n,
uij , j = i + 1, · · · , n,
(a) Write a pseudo code for your algorithm.

(b) How many operations (multiplications/divisions and addition/subtractions) are
required in your algorithm.
(c) Outline how to use such a decomposition to solve Ax = b and compute the
determinant of A.
15. Give a vector of x̄, and Ax = b. Derive the relation between the residual kb − Ax̄k
and the error kx̄ − A−1 bk.
16. For the following model matrices, what kind of matrix-factorization would you like to
use for solving the linear system of equations? Analyze your choices (operation count,
storage, pivoting etc).
 
3 −1 0 0 0  
  0.01 3 0 −4
 −1 3 −1 0
 
0   
   1 2 1 2 
   
 0 −1 3 −1 0  ,  .
   
   −1 0 3 −2 
 0 0 −1 3 −1 
   
  5 −2 3 6
0 0 0 −1 3
84 Z. Li
17. Let A(2) is the matrix obtained after one step Gauss elimination applied to a matrix
A, that is
(2) ai1
aij = aij − a1j .
a11
(a) Show that
(2)
max |aij | ≤ 2 max |aij | (4.8.2)
ij ij
if partial pivoting is used.

(b) Show that without pivoting, (4.8.2) is still true if A is column diagonally domi-
nant.
18. If we use Gauss-Seidel iteration backwards, that is, from xn , xn − 1, · · · x1 , we get

another Gaus-Seidel iterative method. Derive the matrix and vector form of such
an iterative method. If we apply this iterative method and the Gauss-Seidel method
alternatively, then the combined method is called symmetric Gauss-Seidel method.
The related SOR(ω) is called symmetric SOR(ω) or SSOR(ω) . Write a pseudo-code.
19. (a): Show that if λi is an eigenvalue of A, then λk is an eigenvalue of Ak for any

integers k assuming that if k < 0 A−1 exists. (b): Show that if A = AT , then
max |λi (A)|
cond2 (A) = .
min |λi (A)|
20. If λ is an eigenvalue of a matrix A,
(a) show that |λ| ≤ ||A|| and so ρ(A) ≤ ||A||;

(b) Show that if there is one matrix norms such that kRk < 1, then the stationary
iterative method converges.
(c) In the SOR(ω) method, can we take ω = −0.5, 1.8, 3.4?
21. Given a stationary iterative method xk+1 = Rxk + c, show that (a): if there is one
matrix norm such that kRk < 1, then the iterative method converges. (b): If the
spectral radius ρ(R) > 1, then the iterative method diverges. (c): How that if the
spectral radius ρ(R) > 1, then the iterative method converges.
22. Check the convergence of Jacobi, Gauss-Seidel, and SOR(ω) if (a): A is a column
dominant matrix (stricly, weakly) ; (b): A is a symmetric positive (or negative)
matrix.
23. Consider the linear system of equations

Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 − 4Uij
= fij , 0 < i, j < n
h2
with zero boundary condition (i = 0, or j = 0, or i = m, or j = n. (a): What is the

size of the linear system of equations if it is written as AU = F ? (b): What are the
λk (A)
order of kAkp , cond2 (A) in terms of n? (c): Show that ρ(RJ ) = max 1 + .
1≤k≤n 4
(d): * Show that the eigenvalues of A are

iπ jπ
λi,j = − 4 − 2 cos + cos , i, j = 1, 2, · · · n − 1.
n n
(d): How many iterations do we need in general if we use Jacobi, Gauss-Seidel, and
SOR(ω) in terms of n?
24. Consider the linear system of equations (also consider one, three dimensions)
Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 − 4Uij + αUij = fij , 0 < i, j < n − 1. (4.8.3)
(a) Write down the Jacobi, SOR(ω) methods.

(b) What is the order of the number of iterations for the convergence when ω = 1,
and the optimal ω in terms of n when α = 0?
(c) When α = 0 write down the matrix vector form of the linear system of equations
using the natural ordering, how about black/red ordering. What is relation
between the two coefficient matrices. Does SOR(ω) converge? For which ω,
why?
25. Given the following linear system of equations:
3x1 − x2 + x3 = 3
2x2 + x3 = 2
−x2 + 2x3 = 2
(a) With x(0) = [1, −1, 1]T , find the first and second iteration of the Jacobi, Gauss-
Seidel, and SOR (ω = 1.5) methods.
(b) Write down the Jacobi and Gauss-Seidel iteration matrices RJ and RGS .
(c) Do the Jacobi and Gauss-Seidel iterative methods converge?
26. Extra credit. Do some research to explain the behavior using cond hw.m and pos-
sible ways of improving it.
86 Z. Li
Selected solutions
1 The coefficient matrix of the red-black ordering is

!
D1 B 4
A= T
Di = − 2 I, i = 1, 2
B D2 h
B is a sparse matrix in which each row has at most 4 non-zero entries. The size of
the matrix is (n − 1)2 by (n − 1)2 with non-zero entries of O(5n2 ). The SOR(ω) at
k-th iteration can be written as
uk+1
ij = ukij , i, j = 1, 2, · · · , n − 1.
ω k+1
uk+1
ij = (1 − ω)u k
ij + u i−1,j + uk+1
i+1,j + uk+1
i,j−1 + uk+1
i,j+1 − h 2
fij , i, j = 1, 2, · · · , n − 1.
4
The iterative method does not depend on the ordering of the equations and unknowns,
but does depend on the index i and j.
2 (a) The results of one iteration are

     
1/3 1/3 0
xJ  1/2  , xGS =  1/2  xSOR(1.5) =  5/4 
     
1/2 5/4 31/16

(b) and (c), The iteration matrices are:
   
0 1/3 −1/3 0 1/3 −1/3
   
RJ =  0 0 −1/2  , RGS =  0 0 −1/2  .
   
   
0 1/2 0 0 0 −1/4
Since kRJ k∞ = 2/3 < 1 and kRGS k∞ = 2/3 < 1, both iterative methods converge.
3 For the first matrix, the eigenvalues of R are the diagonals. Notice that |aii | < 1 for
i = 1, 2, 3, 4. We just need to check |a55 | = 1 − sin(απ). Note that 0 < sin x < 1 if
0 < x < π and sin x is a periodic function of 2π. Thus, if 2k < α < 2k + 1, then the
iterative method converges, where k is an integer.
For the second matrix we have kRk1 = 0.9999 < 1, the iterative method converges.
4 (a) The iteration matrices are:

   
0 0 0 0 0 0
   
RJ =  0 0 −2 , RG−S =  0 0 −2 .
   
   
0 2 0 0 0 −4
Since ρ(RJ ) = 2 > 1 and ρ(RGS ) = 4 > 1, both iterative methods diverge.
(b) The matrix is weakly diagonally dominant and irreducible. Both Jacobi and
Gauss-Seidel iterative methods converge.
Chapter 5
Computing algebraic eigenvalues

and eigenvectors
5.1 Preliminary
Given a square matrix A ∈ Rn,n , if we can find a number λ ∈ C and x 6= 0 such that
Ax = λx, then λ is called an eigenvalue of A, x is called a corresponding eigenvector to
λ. Note that if Ax = λx, then A(cxx) = λ(cx) for any non-zero cnstant c, in other words,
eigenvector can differ by a constant. Often we prefer to use an eigenvector with unit length
( kxk = 1). We call (λ, x) an eigen-pair if Ax = λx (x 6= 0).
For an eigen-pair (λ, x), we have Ax − λx = 0. This means that (λI − A)x = 0 has
non-zero (or not unique) solutions. This indicates that (λIA ) is singular, or det(λIA ) = 0.
Thus λ must be a root of the characteristic polynomial of the matrix A, det(λI − A) =
λn + an−1 λn−1 + · · · + a1 λ + a0 .
There are n eigenvalues for an n by n square matrix. The eigenvalues can be real,
complex numbers, repeated roots. If the matrix is real, then the complex eigenvalues are
in pairs, that is, if λ = a + bi is one eigenvalue, then λ̄ = a − bi is also an eigenvalue. If A
is a real and symmetric matrix, then all the eigenvalues are real numbers.
Different eigenvectors corresponding to different eigenvalues are linear independent. If
an eigenvalue λ∗ has multiplicity p, which means that characteristic polynomial has the
factor (λ − λ∗ )p , but no the factor (λ − λ∗ )p+1 , the number is independent of eigenvectors
corresponding to λ∗ is less than or equal to p, recall some examples in class. If an n by n
square matrix A has n linear independent eigenvectors, then A is diagonalizable, that is,
there is a nonsingular matrix S, such that S −1 AS = D, where D = diag(λ1 , λ2 , · · · , λn ) is
a diagonal matrix.
For convenience of discussion, we will use the following notations. We arrange eigenval-
87
88 Z. Li
ues of a matrix A according to
|λ1 | ≥ |λ2 | ≥ · · · |λi | ≥ · · · ≥ |λn |.
Thus ρ(A) = |λ1 |, λ1 is called the dominant eigenvalue (can be more than one), while
λn is called the least dominant eigenvalue.
There are many applications of eigenvalue problems. Below are a few of them
• Frequencies if vibration, resonates, etc.
• Spectral radius and convergence of iterative method
• Discrete form of continuous eigenvalue problems, for example, the Sturm-Louville

problem
−(p(x)u0 (x))0 + q(x)u(x) = λu(x), 0 < x < 1,
u(0) = 0, u(1) = 0.
After applying a finite difference or finite element method, we would have Ax = λx.
The solutions are the basis for the Fourier series expansion.
• Stability analysis of dynamic systems, or numerical methods.
5.2 The Power’s method
The idea of the power method is: starting from a non-zero vector x(0) 6= 0 (an approximation
to an eigenvector), then form an iteration
x(k+1) = Ax(k) = A2 x(k−1) = · · · Ak+1 x(0)
Then under some conditions, we can extract an eigen-pair information from the sequence!
With the assumption that λ1 satisfies |λ1 | > |λ2 |, which is an essential condition, then we
(k+1) (k) (k+1)
can show that x(k) ∼ Cx1 , and xp /xp ∼ λ1 as k is very large, where |xp | = kx(k+1) k.
Sketch of the proof:
While feasible in theory, the idea is not practical in computation because x(k) −→ 0 if
ρ(A) < 1 and x(k) −→ ∞ if ρ(A) > 1. The solution is to rescale the vector sequence, which
leads to the following power method.
Given x(0) 6= 0, form the following iteration
for k = 1 until converges
y(k+1) = Ax(k)
y(k+1)
x(k+1) =
ky(k+1) k2
µk+1 = (x(k+1) )T Ax(k+1)
end
We can use the following stopping criteria: |µk+1 − µk | < tol or ky(k+1) − y(k) k < tol or
both.
Under some conditions, the sequence of the pair (µk , x(k) ) converge to the eigen-pair
corresponding to the dominant eigenvalue.
For simplicity of the proof, we assume that A has a complete eigenvectors v1 , v2 , · · · , vn ,
kvi k = 1, Avi = λi vi .
Theorem 5.2.1 Assume that |λ1 | > |λ2 | ≥ |λ3 | >≥ · · · ≥ |λn |, and x(0) = ni=1 αi vi with
P
α1 6= 0, then the pair (µk , x(k) ) the power method converges to the eigen-pair corresponding
to the dominant eigenvalue.
Proof : Note that

y(k−1)
y(k) = Ax(k−1) = A = γk Ay(k−1)
ky(k−1) k2
= γk γk−1 A2 y(k−2) = · · · γk γk−1 · · · γ1 Ak−1 y(1) = Ck Ak x(0)
where γk γk−1 · · · γ1 and Ck are some constants. Since x(k) is parallel to y(k) and has unity
length in 2-norm, we must have
y(k)
x(k) = .
ky(k) k2
On the other hand, we know we have
k k !
λ2 λn
Ak x(0) = λk1 α1 v1 + α2 v2 + · · · αn vn
λ1 λ1
Thus we have
x(k) α1 v1
lim x(k) = lim = = ±v1 .
k→∞ k→∞ kx(k) k2 kα1 v1 k2
and
lim µk = lim (x(k) )T Ax(k) = λ1 .

k→∞ k→∞
90 Z. Li
5.2.1 The power method using the infinity norm
If we use different scaling, we can get different power method. Uisng the infinity norm, first
we introduce the xp notation for a vector x. Given a vector x, xp is the first component
such that |xp | = kxk∞ , and p is the index. For example, if x = [ 2 −1 −5 5 −5 ]T ,
then xp = −5 with p = 3.
5.3 The inverse power method for the least dominant eigen-
value
If A is invertible, then 1/λi are the eigenvalues of A−1 and 1/λn will be the dominant
eigenvalue of A−1 . However, the following inverse power method for the least dominant
eigenvalue without the need to have the intermediate steps.
Ay(k+1) = x(k)
y(k+1)
x(k+1) =
ky(k+1) k2
µk+1 = (x(k+1) )T Ax(k+1) .
end
With similar conditions (|λn | < |λn−1 | ≤ · · · ≤ |λ1 , the essential condition), one can
prove that
lim µk = λn , lim x(k) = vn , with kvn k2 = 1.

k→∞ k→∞
Note that, at each iteration, we need to solve a linear system of equations with the
same coefficient matrix. This is the most expensive part of the algorithm. An efficient
implementation is to get the matrix decomposition done outside the loop. Let P A = LU ,
the algorithm can be written as follows.
Lz(k+1) = P x(k)
U y(k+1) = z(k+1)
y(k+1)
x(k+1) =
ky(k+1) k2
µk+1 = (x(k+1) )T Ax(k+1)
end
5.4 Gershogorin theorem and the shifted inverse power method
If we know a good approximate σ to an eigenvalue λp such that
|λp − σ| < min |λi − σ|.

1≤i≤n,i6=p
That is, λp − σ is the least dominant eigenvalue of A − σI. We can use the following shifted
inverse power method to find the eigenvalue λp and its corresponding eigenvector Given
x(0) 6= 0, form the following iteration
(A − σI)y(k+1) = x(k)
y(k+1)
x(k+1) =
ky(k+1) k2
µk+1 = (x(k+1) )T Ax(k+1)
end
Thus if we can find good approximations of any eigenvalue, we can use the shifted
inverse power method to compute it. Now the question is how do we roughly located the
eigenvalues of a matrix A. Gershgorin theorem provides some useful hints.
Definition 5.4.1 Given a matrix A ∈ C n×n , the circle ( all points within the circle on the
complex plane)
Xn
|λ − aii | ≤ |aij | (5.4.1)
j=1,j6=i
is called the i-th Gershgorin circle.

92 Z. Li
Theorem 5.4.1 Gershgorin Theorem.
1. Any eigenvalue have to be in one of Gershgorin circles.
2. The union of k Gershgorin circles, which do not intersect with other n − k circles,
contains precisely k eigenvalues of A.
Proof: For any eigen-pair (λ, x), Ax = λx. Consider the p-th component of x such that
|xp | = kxk∞ , we have
n
X
apj xj = λp xp
j=1
or
n
X
(λ − app )xp = apj xj .
j=1,j6=p
From the expression above we get

n n
X xj X
|λ − app | ≤ |apj | ≤ |apj |, since |xj /xp | ≤ 1.
xp
j=1,j6=p j=1,j6=p
Thus λ is in the p-th Gershgorin circle and the first part of the theorem is complete.
The proof for the second part is based continuation theory that roots of a polynomial
are continuous functions of the coefficients of the polynomials. The theorem is obviously
true for the diagonal matrix. If the radius of the Gershgorin circles increase continuously
as we change the zero off diagonal entries from 0 to aij , the eigenvalues will move among
union of the Gershgorin circles but can not cross to the disjoint ones.
Example: Let A be the following matrix
 
−5 −1 0
A =  −1 2 −1/2 
 
0 −1 8
Use the Gershgorin theorem to roughly locate the eigenvalues.
The three Gershgorin circles are
• R1 : |z + 5| ≤ 1.
• R2 : |z − 2| ≤ 1.5.
• R3 : |z − 8| ≤ 1.
The do not interset with each other, so each circle has one eigenvalue. Since the matrix is a
real matrix, and complex eigenvalues have to be in pair, we conclude that all the eigenvalues
are real. Thus we get
• the dominant eigenvalue satisfies 7 ≤ λ1 ≤ 9,
• the least dominant eigenvalue satisfies 0.5 ≤ λ3 ≤ 3.5,
• the middle eigenvalue satisfies −6 ≤ λ1 ≤ −4.
If we wish to find the middle eigenvalue λ2 we should choose σ = −5. Even for the
dominant eigenvalue, we would get faster convergence if we shift the matrix by taking
σ = 8 and then apply the shifted power method.
5.5 How to find a few or all eigenvalues?
If we know an eigenpair (λ1 , x1 ) of a matrix A, we can use the deflation method to reduce
A to a one-dimensional lower matrix whose eigenvalues are the same as the rest eigenvalues
of A. The process is follows. Assume that kx1 k2 = 1, we can form expand x to form an
orthogonal basis of Rn : {x1 , x2 , · · · xn } with xTi xj = δij . Note that δij = 0 if i 6= j and
δii = 1. Let Q = [x1 , x2 , · · · xn ], then Q is an orthogonal matrix ( QT Q = QQT = I). We
can get
QT AQ = QT [Ax1 , Ax2 , · · · Axn ]

!
T λ1 *
= Q [λ1 x1 , Ax2 , · · · Axn ] =
0 A1
Thus the eigenvalues of A1 are also those of A, but A1 is a one-dimensional matrix compare
with the original matrix A. The deflation method is only used if we wish to find a few
eigenvalues.
To find all eigenvalues, the QR method for eigenvalues are often used. The idea of the
QR method is first to reduce a matrix to a simple form (often upper Hessenberg matrix
or tridiagonal matrix) using similarity transformation S −1 AS so that the eigenvalues are
unchanged. Since the inverses of an orthogonal matrix is its transpose, S is often chosen
as orthogonal matrix. Orthogonal matrices also have better stability than other matrixes
since kQxk2 = kxk2 and kQAk2 = kAk2 .
Definition 5.5.1 Given a unit vector w, kwk2 = 1, the Household matrix is defined as
P = I − 2wwT . (5.5.1)
It is a simple check that P = P T = P −1 . Such a matrix sometimes is also called a

reflection matrix or transformation.
Theorem 5.5.1 If kxk2 = kyk2 , then there is a Household matrix P such that P x = y.
94 Z. Li
Proof: Assume that P x = y, then we have
I − 2wwT x = y

x − y = 2w(wT x).
Note that 2(wT x) is a number, thus w is parallel to x − y. Since w is also a unit vector
in 2-norm, we conclude that w = (x − y)/kx − yk2 , then it is simple manipulation to show
that P x = y.
Example: Find a Householder matrix P such that
   
3 α
P 0 = 0 
   
4 0
Note that in this example, we need to find both α and P . Since the orthogonal transfor-
mation does not change the 2-norm, we should have
p
32 + 42 = α2 , =⇒ α = ±5,
 
3±α
1
w = (x − y)/kx − yk2 =  0−0 
 
kx − yk2
4−0
To avoid possible cancellation, we should choose the opposite sign, that α = −5, and
√
w = [8 0 4]T / 80.
5.5.1 The QR decomposition of a matrix A
Start from A0 = A.
for k = 0, 1, · · · until converge
Ak = Qk RK
Ak+1 = Rk Qk .
end
Theorem 5.5.2 If A ∈ Rn×n , and |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |, then
• Ai+1 ∼ Ai ∼ A, that is, all Ak have the same eigenvalues.
•
 
R11 R12 · · · R1p
R22 · · · ···
 
 
lim Ak = RA =  .. .. 
k→∞ 
 . .


Rpp
where RA is a block upper triangular matrix whose diagonals Rii is either a 1 by 1 or 2 by

2 matrix (corresponding to complex eigenvalues in pair).
Proof of the first part: Ak+1 = Rk Qk = QTk Ak Rk .
Shifted QR method:
Start from A0 = A.
Ak − σ k I = Q k R K
Ak+1 = Rk Qk + σk I.
end
Stop criteria: max |aij | < tol.
3≤i≤n,1≤j≤i−2
Double shifted QR method: If σ is chosen as a complex number, then we should use

the Double shifted QR method.
Start from A0 = A.
Ak − σ k I = Q k R k
Ak+1 = Rk Qk + σk I.
Ak+1 − σ̄k I = Qk+1 Rk+1
Ak+2 = Rk+1 Qk+1 + σ̄k I.
end
QR method for finding all eigenvalues are quite expensive. To reduce the computational
cost. Often we use the similarity transformation via Householder matrix first to reduce the
original matrix to an upper Hessenberg matrix (tri-diagonal if the matrix is symmetric) first,
then apply the QR method. This requires n−2 steps: Pn−1 Pn−3 · · · P2 P1 AP1 P2 · · · Pn−3 Pn−2 ,
where
   
a21 α
!
a31 0
   
1 0    
P1 = , P̄1  .. = .. 
0 P̄1 
 .
 
  .


an1 0
for example. The reason to choose P1 to keep the first row of A unchanged when we multiply
P1 from the left is to ensure that the first columns of P1 A unchanged when we multiply P1
from the right so that those zeros will remain.
96 Z. Li
5.6 Exercises
1. Let w ∈ Rn , A = wT w, B = wwT .
(a) Find the size of the matrices A and B.

(b) Find kAk2 and kBk2 .
(c) Find the rank of A and B.
(d) Find the condition number of A and B if it is meaningful.
(e) Find all eigenvalues and corresponding eigenvectors of A and B.
Chapter 6
Least squares and SVD solutions
In this chapter, we discuss numerical method for solving arbitrary linear system of equations.
6.1 Least squares solutions
Consider Ax = b, where A ∈ Rm,×n , m ≥ n, and rank(A) = n. One motivation is the curve

fitting example in which we have many observed data, but we find a simple polynomial to
fit the data. Below are some examples:
   
2 0
   

 1  
 x = 0 


 4 


 4 

−1 2
From the first two equations, the solution should be x = 0; from the third equation, the
solution should be x = 1; from the last equation, the solution should be x = −2; In other
words, we can not find a single x that satisfy all the equations. The system of equations is
called over-determined. In general, there is no classical solution to an over-determined
system of equation. We need to find the ’best’ solution.
By the best solution we mean that to minimize the error in some norm. Since the
residual is computable, one approach is to minimize the residual in 2-norm of all possible
choices:
The solution x∗ is the ’best solution’ in 2-norm if it satisfies
kb − Ax∗ k2 = minn kb − Axk2 . (6.1.1)

x∈R
If we use different norms rather than 2-norm, it will leads to different algorithm and different
applications. But 2-norm is the one that is used the most and it is the simplest one.
97
98 Z. Li
Note that, an equivalent definition is the following:
kb − Ax∗ k22 = minn kb − Axk22 . (6.1.2)

x∈R
The above definition gives way to compute the ’best solution’ as the global minimum
of a multi-variable function of its component x1 , x2 , · · · , xn . This can be seen from the
following:
φ(x) = kb − Axk22 = (b − Ax)T (b − Ax) = bT b − bT Ax − xT AT b + xT AT Ax.
Note that bT Ax = xT AT b, one can easily get the gradient ϕ(x) which is
∇φ(x) = 2 AT Ax − AT b.

Since the columns of A are linearly independent (rank(A) = n), we have xT AT Ax =

kAxk2 > 0 for any non-zero x, and AT A is symmetric positive definite. The only critical
point of φ(x) is the solution of the following normal equation
AT Ax = AT b (6.1.3)
whose unique solution is the least squares solution which minimizes the 2-norm of the
residual in Rn space.
The normal equation approach not only provides a numerical method, but also shows
that the least squares solution is unique under the condition rank(A) = n, that is, the
columns of A are lienarly independent.
A serious problem with the normal equation approach is the possible ill-conditioned
system. Note that if m = n, then cond2 (AT ) = (cond2 (A))2 . A more accurate method is
the QR method for the least squares solution. For the following least squares problem
! !
R b1
x=
0 b2
We have
kb − Axk22 = kb1 − Rxk22 + kb2 k22
and
min kb − Axk22 = minn kb1 − Rxk22 + kb2 k22 = kb2 k22

x∈Rn x∈R
when b1 − Rx = 0 or x = R−1 b1 . Particularly, if R is an upper triangular matrix, we can

use the backward substitution to solve the tri-diagonal system of equations efficiently.
In the QR method for the least squares, the idea is to reduce the original problem to
the problem above using orthogonal, particularly, the Householder, transformation which
keeps the least squares solution unchanged. From the process of the QR algorithm, we can
apply a sequence of Householder matrices that reduce A to an upper triangular form
!
R
Pn Pn−1 · · · P1 A =
0
T
or QA = RT 0 , then from Ax = b, we get QAx = Qb, or
! !
R b̃1
x=
0 b̃2
.
In practice, we can apply the Householder matrix directly to the augmented matrix [A .. b]
as in the Gaussian elimination method.
An example:
6.2 Singular value decomposition (SVD), and SVD solution

of Ax = b
Singular value decomposition can be used to solve Ax = b for arbitrary A and b.
Theorem 6.2.1 Given any matrix A ∈ C m×n , there are two orthogonal matrices U ∈
C m×m , U U H = U H U = I and V ∈ C n×n , V V H = V H V = I such that
 
σ1
 

 σ2 

..
X X 
A=U V H, where = (6.2.1)

. 
 

 σp 

0 m×n
σ1 , σ2 , · · · , σp > 0 are called the singular values of A. Note that they are positive numbers,
p = rank(A). Furthermore
p p
• σi = λi (AH A) = λi (AAH ), square root of non-zero eigenvalues of AH A or AAH .
• kAk2 = max1≤i≤p σi (A).
Note that we often arrange singular values according to σ1 ≥ σ2 ≥ · · · ≥ σp > 0. The proof
process also gives a way for such a decomposition (constrictive).
Proof: Let σ1 = kAk2 , then there is an x, kx1 k2 = 1 such that AH Ax1 = σ12 x1 . Let
y1 = Ax1 /σ1 , we have ky1 k2 = kAx1 k2 /σ1 = 1.
100 Z. Li
Next we expand x1 to form an orthogonal basis in Rn×n to form an orthogonal matrix

V
V = [x1 , x2 , · · · , xn ] = [x1 , V1 ] , V H V = V H V = I.
We also expand y1 to form an orthogonal basis in Rm×m to form an orthogonal matrix U

h i
U = [y1 , y2 , · · · , yn ] = y1 , U1 , U H U = U H U = I.
Then we have
! !
H yH
1
σ1 0
U AV = Ax1 AV1 =
U1H H
0 U1 AV1
This is because yH H H H 2 H H
1 Ax1 = x1 A Ax1 /σ1 = x1 x1 σ1 /σ1 = σ1 ; U1 Ax1 = σ1 U1 y1 = 0;
yH H H H 2 H
1 AV1 = (Ay1 ) V1 = x1 A AV1 = σ1 x1 V1 = 0. Thus from the mathematics induction
principle, we can continue this process to get the SVD decomposition.
Pseudo-inverse of a matrix A
From the SVD decomposition of a matrix A, we can literally find the ’inverse’ of the matrix
to get its pseudo-inverse
 
1
σ1
 1 
σ2
 
 
A+ = V
X
+
UH, where
X
+
= .. (6.2.2)
.
 

 
 1 
 σp 
0 n×m
Particularly, if m = n and det(A) 6= 0, then A+ = A−1 . The pseudo-inverse matrix A+ of

a matrix A has the following properties:
• AA+ A = A, note that A+ A 6= I.
• A+ AA+ = A+ .
• A+ A = (A + A)H . If rank(A) = n, then A+ = (AH A)−1 AH .
• AA+ = (AA+ )H . If rank(A) = m, then A+ = AH (AAH )−1 .
Solving Ax = b of arbitrary matrix A
The SVD solution of Ax = b is simply x∗ = A+ b. It has the the following properties:

• If m = n and det(A) 6= 0, then x∗ = A+ b = A−1 b.
• If rank(A) = n, then kAx∗ − bk2 = minx∈Rn kAx∗ − bk2 , that is, x∗ is the least
squares solution.
• If there is more than one classical solution, then x∗ is the one with minimal 2-norm,
that is
kx∗ k2 = min {kxk2 } , kAx∗ − bk2 = minn kAx − bk2

kAx∗ −bk2 =kAx−bk2 x∈R

Ma580 Book

Uploaded by

Copyright:

Available Formats

Ma580 Book

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ma580 Book

Uploaded by

Copyright:

Available Formats

MA580: Numerical Analysis: I

• Model problems and relations with course materials.

• Errors (definition and how to avoid them)

1.1 A Model problem: Data fitting and interpolation

Approach I: Polynomial approximation.

Real Problem Physical Laws Mathematical/physical

Interpret Solution Applications

Figure 1.1: A flow chart of a problem solving process.

This is a linear system of equations for the unknown coefficients ai , i = 0, · · · , n. In the

We can simply write it as Ax = b, where A is an (m + 1) × (n + 1) matrix, and x =

• m = n, we have the same number of equations and unknowns. There is a unique

1.2 A non-linear model

Not all the functions are polynomials, if we want to approximate y(t) by

ya (t) ≈ αeβt (1.2.1)

for example, where α and β are unknowns, we would have

We get a non-linear system of equations for α and β.

1.3 Finite difference method and linear system of equations

In a finite difference method, we try to an approximate solution of u(x) at a finite number

• Replace the derivative by a finite difference formula. It can be proved that

u(x − h) − 2u(x) + u(x + h) d2 u(x)

u(x1 − h) − 2u(x1 ) + u(x1 + h) d2 u(x1 )

u(xi − h) − 2u(xi ) + u(xi + h) d2 u(xi )

Ui−1 − 2Ui + Ui+1

uxx + uyy = f (1.3.6)

1.4 Continuous and discrete eigenvalue problems

Consider the differential equation

u00 + λu = 0, 0 < x < 1, u(0) = 0, u(1) = 0. (1.4.1)

The solution is not unique. One particular solution is

u(x) = sin(kπx) (1.4.2)

corresponding to the eigenvalue λk = (kπ)2 for k = 1, 2, · · · . The significance of the eigen-

u00 = f (x), 0 < x < 1, u(0) = 0, u(1) = 0. (1.4.3)

The solution can be expressed as

−(pu0 )0 + qu + λu = 0, p(x) ≥ p0 > 0, q(x) ≥ 0 (1.4.6)

with a more general boundary condition.

1.5 Objectives of the class

• Method and software selection: what to use for what problems.

• Related theory and analysis.

• Preparation for other courses such as MA780, MA584, MA587 etc.

1.6 A computer number system

A primitive computer system is only part of a real number system. A number in a

Below are some examples of such numbers

−0.111125 (binary, β = 2), −0.3142100 , −0.0314100 (1.6.2)

The choice of base number is

β=2 binary primitive

sign exponential fraction

In a 64 bites computer number system, we have

sign exponential fraction

1.6.1 Properties of a computer number system

the largest exponential: = 20 + 21 + · · · + 26 = 27 − 1 = 127

the largest fraction = 0.1111 · · · 1 = 1 − 2−23

the largest positive number = 2127 (1 − 2−23 ) = 1.7014116 × 1038 .

0.000 · · · 1 × 2−127 = 2−127−23 = 7.0064923 × 10−46 (1.6.3)

while the smallest normalized magnitude is

0.100 · · · 0 × 2−127 = 2−128 = 2.9387359 × 10−39 . (1.6.4)

Overflow and underflow

1.7 Round-off errors and floating point arithmetics

1.7.1 Definition of errors

absolution error of f l(x) = x − f l(x). (1.7.4)

The absolution error is often simply called the error.

f l(x ◦ y)) = f l (f l(x) ◦ f l(y)) = f l (x(1 + x ) ◦ y(1 + y )) , |x | ≤ , |y | ≤ ,

= (x(1 + x ) ◦ y(1 + y )) (1 + x◦y ), |x◦y | ≤ .

f l(xy)) = x(1 + x )y((1 + y )(1 + xy ) = xy 1 + x + y + xy + O(2 )

= xy(1 + δ), |δ| ≤ 3.

f l(x − y)) = (x(1 + x ) − y((1 + y )) (1 + xy )

= x − y + xx − yy + (x − y)xy + O(2 ).

(x − y) − f l(x − y) = −xx + yy − (x − y)xy

|(x − y) − f l(x − y)| = |x| + |y|y + |x − y|

• Use formulas f l(x) = x(1 + 1 ), f l(x ◦ y) = (x ◦ y)(1 + 2 ) etc.

y1 = (-2 + sqrt(4 - 4e))/2; y2 = (-2 -sqrt(4 - 4e))/2;

y7 = -4e/(-2 -sqrt(4 - 4e))/2; y8 = e/y7;