Ma580 Book
Ma580 Book
Ma580 Book
Zhilin Li
2 Z. Li
Chapter 1
Introduction
• Why is this course important (motivations)? What is the role of this class in the
problem solving process using mathematics and computers?
In Fig.1.1, we show a flow chart of a problem solving process. In this class, we will focus
on numerical solutions using computers, especially the problems in linear algebra. Thus
this course can also be called ”Numerical Linear Algebra”.
Give data
t0 t1 ··· tm
(1.1.1)
y0 y1 ··· ym
The data can be taken from a function y(t), or a set of observed data. We want to find a
simple function ya (t) to approximate y(t) so that we can predict y(t) everywhere.
Let
y(t) ≈ a0 + a1 t + a2 t2 + · · · + an tn . (1.1.2)
We need to find the coefficients a0 , a1 , a2 , · · · , an so that we can have an analytic expression.
n = 0 : constant approximation
n = 1 : linear regression
We should choose the coefficients in such a way that
n = 2 : quadratic regression
··· ···
3
4 Z. Li
Analytic/Exact
Solution Techniques
Approximated
Use Computers
visualization Products
Experiments
Prediction
Better Models
they can match the data at the sample points1 . Thus we have
t = t0 :
a0 + a1 t0 + a2 t20 + · · · + an tn0 = y0
a0 + a1 t1 + a2 t21 + · · · + an tn1 = y1
t = t1 :
(1.1.3)
···
·········
a0 + a1 tm + a2 t2m + · · · + an tnm = ym
t = tm :
• m > n, that is, we have more equations than the unknowns. The system is over-
determined and we can only find the the best solution, for example the least squares
solution. Such a problem is a curve-fitting problems. When n = 1, it is also called
linear regression.
• m < n, we have fewer equations than the unknowns. The system is under-determined
and we can find infinite number of the solutions. Often we prefer the SVD solution
which has the least length among all the solutions.
Note that the coefficient matrix is dense (not many zero entries) in this application.
The modern computers are designed in part to solve practical problems such as weather
forecast, computing the lift and drag of airplanes, missiles, space shuttle etc. In these
application, often partial differential equations, particularly the Navier-Stokes equations
are used to model the physical phenomena. How to solve three dimensional Navier-Stokes
equations is a still challenge today.
To solve the Navier-Stokes equations, often finite difference or finite element methods are
used. In a finite difference method, the partial derivatives are replaced by finite difference,
which is a combination of the function values. Such a process is called finite difference
discretization. After the discretization, the system of partial differential equations because a
system of algebraic system of equations either linear or non-linear. We use a one-dimensional
example here to illustrate the idea. Consider the two-point boundary value problem
d2 u(x)
= f (x), 0<x<1 (1.3.1)
dx2
u(0) = u(1) = 0. (1.3.2)
• Generate a grid. For example, we can select n equally spaced points between 0 and
1 to find the approximate solution of u(x). The the spacing between two points is
h = 1/n, and these points are xi = ih, i = 0, 1, · · · , n with x0 = 0 and xn = 1. We
look for an approximate solution of u(x) at x1 , x1 , · · · , xn−1 . Note that we know
u(0) = u(x0 ) = 0 and u(1) = u(xn ) = 0 already from the boundary condition.
At every grid points xi , i = 0, 1, · · · , n, we use the above formula ignoring the high
order terms to get
If we replace the ’≈’ with the ’=’ sign, and replace u(xi ) which we do not know with
Ui which is the solution to the linear system of equations:
0 − 2U1 + U2
= f (x1 )
h2
U1 − 2U2 + U3 )
= f (x2 )
h2
······
This system of equations Ax = b can be written as the matrix and vector form:
− h22 1
h2
U1 f (x1 ) − uha2
1
h2 − h22 1
h 2 U 2 f (x2 )
1
h2
− h22 h12 U3
f (x3 )
= (1.3.5)
.. .. .. . ..
..
. . . .
1
− h22 1
Un−2 f (xn−2 )
h2 h2
1
h2
− h22 Un−1 f (xn−1 ) − hu2b
8 Z. Li
with u0 = 0 and ub = 02 .
• Solve the system of equations to get the approximate solution at each grid point.
• Implement and debug the computer code. Run the program to get the output. Ana-
lyze the results (tables, plots etc.).
• Error analysis: method error using finite difference to approximate derivative; machine
error (round-off errors).
We will discuss how to use computer to solve the system of equations. Note that the
coefficient matrix has special structure: tri-diagonal and most entries are zero. In two space
dimensions, a useful partial differential equation is the Poisson equation
For a rectangular grid and uniform mesh, the finite differential equation at a grid point
(xi , yj ) is
Ui−1,j + Ui+1,j Ui,j−1 + Ui,j+1 2 2
+ − + Uij = fij (1.3.7)
(hx )2 (hy )2 (hx )2 (hy )2
For a general n by n grid, we will have
B I −4 1
I B I 1 −4 1
1
A= 2 , B= .
h .. .. .. .. .. ..
. . .
. . .
I B 1 −4
Note that the size of A is (n − 1)2 × (n − 1)2 . If we take n = 101, we will have one
million unknowns. A very large number that most laptop today may not be able to handle
if we store all the entries. For such a system of equations, an iterative method or sparse
matrix techniques are preferred. We will get back to this later.
with R1
1 0 f (x) sin(kπx)dx
αk = − R1 2 . (1.4.5)
(kπ)2
0 sin (kπx)
See an ordinary differential equation text book for the details. More general eigenvalue of
this type would be
• Learn how to solve these problems (mainly linear algebra) efficiently and reliably on
computers.
We want to use computers to solve mathematical problems and we should know the number
system in a computer.
10 Z. Li
fraction exponential
xc = ±. d1 d2 · · · dn β s , 0 ≤ di ≤ β − 1, −Smax ≤ s ≤ Smax (1.6.1)
sign mantissa base
The float number is denoted by f l(x). We can see that the expression of a floating
number is not unique. To get a unique expression, it is often designed that d1 6= 0 if xc is
a non-zero number. Such a floating number is called a normalized floating number. The
number zero is expressed as 0.00 · · · 0β 0 . Note that one bite is used to represent the sign in
the exponential.
Often there are two number systems in a programming language for a particular com-
puter, single precision corresponding to 32 bits; and double precision for 64 bits.
In a 32 bites computer number system, we have
• It is a subset of the real number system with finite number of floating numbers. For
a 32-bit system, the total numbers is roughly 2β n (Smax − Smin + 1) − 1.
• Even if x and y are in the computer number system, their operations, for example,
f l(xy), can be out of the the computer number system.
Numerical Analysis: I 11
• It has the maximum and minimum numbers; and maximum and non-zero minimum
magnitude. For a 32-bit system, the largest and smallest numbers can be calculated
from the following:
The smallest number is then −1.7014116 × 1038 . The smallest positive number (or
smallest magnitude) is
If a computer system encounter a number whose magnitude is larger than the largest
floating number of the computer system, it is called OVERFLOW. This often happens
when a number is divided by zero, for example, we want to compute s/a but a
is undefined, or evaluate a function outside of the definition, for example, log(−5).
Computers often returns symbol such as N AN , inf , or simply stops the running
process. This can also happen when a number is divided by a very small number.
Often an overflow indicates a bug in the coding and should be avoided.
If a computer system encounter a number whose magnitude is smaller than the small-
est positive floating number of the computer system, it is called underflow. Often,
the computer system can set this number as zero and there is no harm to the running
process.
• The numbers in a computer number system is not evenly spaces. It is more clustered
around the origin and get sparser far away.
While a computer system is only a subset of the real number system, often it is good
enough if we know how to use it. If a single precision system is not adequate, we can use
the double precision system.
12 Z. Li
Since computer number system is only a subset of a real number system, errors (called
round-off errors) are inevitable when we solver problems using computers. The question
that we need to ask is how the errors affect the final results and how to minimize the
negative impact of errors.
Input errors
When we input a number into a computer, it is likely to have some errors. For example,
the number π can be represented exact in a computer number system. Thus a floating
number of expression of π denoted as f l(π) is different from π. The first question is, how
a computer system approximate π. The default is the round-off approach. Let us take
the cecimal system as an example. Let x be a real number that is in the range of the
computer number system in terms of the magnitude, and we express it as a normalized
floating number
x = 0.d1 d2 · · · dn dn+1 · · · × 10b , d1 6= 0. (1.7.1)
The floating number if the computer system using the round-off approach is
0.d1 d2 · · · dn × 10b , if dn+1 ≤ 4,
f l(x) = (1.7.2)
0.d d · · · (d + 1) × 10b , if d
1 2 n n+1 ≥ 5
The absolue error is defined as the difference between the true value and the approxi-
mated,
absolute error = true - approximated. (1.7.3)
Thus the error for f l(x) is
Obviously, for different x, the error x − f l(x) and the relative error are different. How
do we then characterize the round-off errors? We seek the upper bounds, or the worst case,
which should apply for all x’s.
For the round-off approach, we have
0.00 · · · 0dn+1 · · · × 10b , if dn+1 ≤ 4,
|x − f l(x)| =
0.00 · · · 0(1 − d b
n+1 ) · · · × 10 , if dn+1 ≥ 5
1
≤ 0.00 · · · 05 × 10b = 10b−n ,
2
which only depends on the magnitude of x. The relative error is
1 1
|x − f l(x)| 10b−n 10b−n 1 def ine
≤ 2 ≤ 2 b
= 10−n+1 == = machine precision (1.7.7)
|x| |x| 0.1 × 10 2
Note that the upper bound of the relative error for the round-off approach is independent
of x, only depends on the computer number system. This upper bound is called the machine
precision, or machine epsilon, which indicates the best accuracy that we can expect using
the computer number system.
In general we have
|x − f l(x)| 1
≤ β −n+1 (1.7.8)
|x| 2
for any base β. For a single precision computer number system (32 bits) we have3
1
= 2−23+1 = 2−23 = 1.192093 × 10−7 . (1.7.9)
2
For a 64-bits number system (double precision), we have
1
= 2−52+1 = 2−52 = 2.220446 × 10−16 . (1.7.10)
2
Relative error is closely associate with the concept of the significant digits. In general,
if a relative error is of order 10−5 , for example, it is likely the result has 5 significant digits.
An approximate number can be regarded as a perturbation of the true vales according
to the following theorem.
Theorem 1.7.1 If x ∈ R, then f l(x) = x(1 + δ), |δ| ≤ ia the relative error.
There are other ways to input a number into a computer system. Two other approaches
are rounding, and chopping, in which we have
The primitive computer arithmetic only include addition, subtraction, multiplication, di-
vision, and logical operations. Logic operations do not generate errors. But the basic
arithmetic operations will introduce errors. The error bounds are given in the following
theorem.
Theorem 1.7.2 If a and b are two floating numbers in a computer number system, f l(a◦b)
is in the range of the computer number system, then
f l(a ◦ b) = (a ◦ b) (1 + δ) , ◦: + − × ÷ (1.7.13)
where
|δ| = |δ(a, b)| ≤ , (1.7.14)
we also have
√ √
f l( a) = a (1 + δ) . (1.7.15)
Note that δ is the relative error if (a ◦ b) 6= 0 of the operations and is bounded by the
machine precision. This is because
We conclude that the arithmetic operations within a computer number system give the
’best’ results that we can possible get. Does it mean that we do not need to worry about
round-off errors at all? Of course not!
Now we assume that x and y are two real numbers. When we input them into a computer,
we will have errors. First we consider the multiplications and divisions
Note that x , y , and x◦y are different numbers although they have the same upper bound!
We distinguish several different cases
Numerical Analysis: I 15
Often we ignore the high order term (h.o.t) since they one much smaller (for single
precision, we have 10−7 versus 10−14 ). Thus delta is the relative error as we mentioned
before. The error bound is understandable with the fact of 3 with two input errors
and one from the multiplication. The absolute error is −xyδ which is bounded by
3|xy|. The same bounds hold for the division too if the divisor is not zero. Thus the
errors from the multiplications/divisions are not big concerns here. But we should
avoid dividing by small numbers if possible.
which does not seem to be too bad. But the relative error may be unbounded because
|(x − y) − f l(x − y)| xx − yy − (x − y)xy
= + O(2 )
|x − y| x−y
|xx − yy |
≤ + .
|x − y|
In general, x 6= x even though they are very small and have the same upper bound.
Thus the relative error can be arbitrarily large if x and y are very close! That means
the addition/subtraction can lead the loss of accuracy, or significant digits. It is
also called catastrophic cancellation as illustrate in the following example
If the last two digits of the two numbers are wrong (likely in many circumstance),
then there is no significant digit left in the result. In this example, the absolute error
is till small, but the relative error is very large!
16 Z. Li
Mathematically, there are all equivalent (thus they are all call consistent). But occasionally,
they may give very different results especially if c is very small. When we put select the
algorithm to run on computer, we should choose Algorithm 2 if b ≤ 0 and Algorithm 3 if
b ≥ 0, why? This can be done using if · · · then conditional expression using any computer
language.
Let us check with the simple case a = 1, b = 2, x2 + 2x + e = 0. When e is very small,
we have
√
−2 − 4 − 4e √
x1 = = −1 − 1 − e ≈ −2
2
√
−2 + 4 − 4e √ e
x2 = = −1 + 1 − e = √ ≈ −0.5e
2 −1 − 1 − e
The last equality was obtained by rationalize to the denomenator.
Below is a Matlab code to illustrate four different algorithms:
From input various e, we can see how the accuracy gets lost. In general, when we have
e = 210−k , we will lose about k significant digits.
x2 x4 x2
1 − cos x = 1 − 1 − + − ··· ≈ .
2 4! 2
f 00 (x) f 000 (x)
f (x + h) − f (x) = hf 0 (x) + h2 + h3 + ···
2 3!
• Another rule of thumb for summations f l( ni=1 )xi . We should add those numbers
P
with small magnitude first to avoid ”large numbers eat small numbers”.
s=0; % initialize
for i=1:n
s = s + a(i); % A common mistake is forget the s here!
end
Qn
• Product i=1 xi :
s=1; % initialize
for i=1:n
s = s * a(i); % A common mistake is forget the s here!
end
18 Z. Li
In Matlab, we can simply use y = A ∗ x. Or we can use the component form so that we
can easily convert the code to other computer languages. We can put the following into a
Matlab .m file, say, test Ax.m with the following contents:
We wish to develop efficient algorithms (fast, less storage, accurate, and easy to pro-
gram). Note that Matlab is case sensitive and the index of arrays should be positive integers
(can not be zero).
p = a(1);
for i=1:n
p = p + a(i+1)*x^(i+1);
end
The total number of operations are about O(n2 /2) multiplications, and n additions. Howere,
from the following observations
p3 (x) = a3 x3 + a2 x2 + a1 x + a0
= x(a3 x2 + a2 x + a1 ) + a0
= x(x(a3 x + a2 ) + a1 ) + a0
p = a(n)
for i=n-1,-1,0
p = x*p + a(i)
endfor
Numerical Analysis: I 19
1.9 Exercises
f lc (x) = 0.d1 d2 · · · dn β b
Find upper bounds of the absolute and relative errors of f lc (x) approximating x.
Compare the results with the results obtained from the rounding-off approach.
Note that the specifics may differ slightly with different computers and compilers.
where x, y, and z are real numbers. Find upper bounds of absolute and relative errors.
Assume all the numbers involved are in the range of the computer number system.
Analyze the error bounds.
(HINT: You can set x1 = f l(x), y1 = f l(y), z1 = f l(z), p1 = f l(x1 y1 ), pc = f l(p1 z1 )
is the computed product of x, y, and z. (Note: Pay attention to the upper bounds
and absolute values, e.g., δ5 ≤ 5 is wrong, it should be |δ5 | ≤ 5.)
6. We can use the following three formulas to approximate the first derivative of a func-
tion f (x) at x0 .
f (x0 + h) − f (x0 )
f 0 (x0 ) ≈
h
f (x0 + h) − f (x0 − h)
f 0 (x0 ) ≈
2h
f (x0 ) − f (x0 − h)
f 0 (x0 ) ≈
h
When we use computers to find an approximation of a derivative (used in finite dif-
ference (FD) method, optimization, and many areas, we need to balance the errors
from the algorithm (truncation error) and round-off errors (from computers).
(a) Which formula is the most accurate in theory? Hint: Find the absolute error
using the Taylor expansion at x = x0 : f (x0 ±h) = f (x0 )±f 0 (x0 )h+f 00 (x0 )h2 /2±
f 000 (x0 )h3 /6 + O(h4 ).
(b) Write a program to compute the derivative with
• f (x) = x2 , x0 = 1.8.
• f (x) = ex sin x, x0 = 0.55.
Plot the errors versus h using log-log plot with labels and legends if necessary. In
the plot, h should range from 0.1 to the order of machine constant (10−16 ) with
h being cut by half each time (i.e., h = 0.1, h = 0.1/2, h = 0.1/22 , h = 0.1/23 ,
· · · , until h ≤ 10−16 .)
Hint: You need to find the true derivative (analytic) values in order to compute
and plot the errors.
Tabulate the absolute and relative errors corresponding to h = 0.1, 0.1/2, 0.1/4,
0.1/8, and 0.1/16 (that is, difference choices of h compared with that used in
the plots). The ratio (should be around 2 or 4) is defined as the quotient of two
consecutive errors. Analyze and explain your plots and tables. What is the
best h for each case with and without round-off errors?
Numerical Analysis: I 21
1/h error (a) ratio error (b) ratio error (c) ratio
10 – – –
20
40
80
160
7. Mini-project: Find the relation between relative errors and significant digits.
22 Z. Li
Chapter 2
Vector and matrix norms are generalization of the function of absolute value for single
variable. There are at least two motivations to use them.
Give a Rn space
Rn = {x, x = [x1 , x2 , · · · , xn ]T }
If a function f (x) satisfies (1)-(3), we use a special notation f (x) = kxk and call this
function a norm in Rn .
Let
f (x) = max {|xi |}
1≤i≤n
23
24 Z. Li
f (αx) = max {|αxi |} = max {|α| |xi |} = |α| max {|xi |} = |α|f (x)
1≤i≤n 1≤i≤n 1≤i≤n
max1≤i≤n {|xi |}
• f (x) = .No, since f (αx) 6= |α|f (x), and f (αx) has no definition for
x21
those vectors whose first component is zero.
( n )1/2
X
• f (x) = x2i . Yes, it is called 2-norm, or Euclidian norm, it is denoted as
i=1
f (x) = kxk2 . The sketch of the proof is given below.
n n n
( n )1/2 ( n )1/2
X X X X X
(x2i + yi2 ) ≤ x2i + yi2 +2 x2i yi2 or
i=1 i=1 i=1 i=1 i=1
n
( n )1/2 ( n )1/2
X X X
2 xi yi ≤ x2i yi2 .
i=1 i=1 i=1
Numerical Analysis: I 25
The last inequality is the Cauchy-Schwarz inequality. To prove this inequality, we consider
a special quadratic function
n
X
g(λ) = (xi − λyi )2 ≥ 0
i=1
n
X n
X n
X
= x2i − 2λ x i yi + λ 2
yi2
i=1 i=1 i=1
n
X n
X n
X
2
= c + bλ + aλ , a= yi2 , b = −2 xi yi , c= x2i .
i=1 i=1 i=1
The function g(λ) is a non-negative function, and the quadratic equation g(λ) = 0 has at
most one real root or no roots. Therefore the discriminant should satisfy b2 − 4ac ≤ 0, that
is
n
!2 n n
X X X
4 x i yi − 4 yi2 x2i ≤ 0.
i=1 i=1 i=1
This is equivalent to
n v n n
X u X X
u 2
x i yi ≤ yi x2i .
t
i=1 i=1 i=1
This concludes
n
n
(
n
)1/2 ( n )1/2
X X X X
2 2
x i yi ≤ x i yi ≤ xi yi .
i=1 i=1 i=1 i=1
There are different Cauchy-Schwarz inequality in different space, for example, L2 space,
Sobolev space, Hilbert space etc. The proof process is similar. A special case is yi = 1 for
which we have
( )1/2 ( n )1/2 v
n n u n
X X
2
X
2 √ uX
xi ≤ xi 1 = nt x2i .
i=1 i=1 i=1 i=1
An example:
!
−5
Let x = , find kxkp , p = 1, 2, ∞.
1
26 Z. Li
√
The solution is kxk1 = 6, kxk∞ = 5, kxk2 = 26. Note that, we have kxk∞ ≤ kxk2 ≤
kxk1 . This is true for any x.
In general, we can define the p-norm for p ≥ 1,
n
!1/p
X
kxkp = |xi |p .
i=1
• All vector norms are equivalent in finite dimensional space. That is, give any two
vector norms, say kxkα and kxkβ , there are two constant Cαβ and cαβ such that
Note that the constants are independent of x but may depend on the dimension (n).
Thus the one, two, and infinity norm of a vector norms are all equivalent. What are the
smallest (C) and the largest c constants?
Theorem 2.1.1
||x||1 √ √ √
≤ ||x||∞ ≤ ||x||2 ≤ n ||x||∞ ≤ n ||x||2 ≤ n ||x||1 .
n
As an illustration, we prove
√
||x||2 ≤ ||x||1 ≤ n ||x||2
Proof:
n
!2 n
X X X
||x||21 = |xi | = |xi |2 + 2 |xi ||xj |
i=1 i=1 i<j
X
= ||x||22 + 2 |xi ||xj | ≥ |x||22 .
i<j
On the other hand, from the Cauchy-Schwarz inequality, we have already known that
v
n u n
X √ u X √
||x||1 = |xi | · 1 ≤ n t x2i = n||x||2 .
i=1 i=1
Numerical Analysis: I 27
Particularly, if y = x, we have
n
X n
X
(x, y) = xi xi = x2i = ||x||2 .
i=1 i=1
There are two definitions of a matrix norms. The first one is to use the same definition as
a vector norm.
Definition: A matrix norm is a multi-variable function of its entries that satisfy the
following relations:
If a function f (A) satisfies (1)-(3), we use a special notation f (x) = kAk and call this
function a norm in Rm×n .
Or alternatively, we can treat a matrix as a long vector, then use the definition of the
vector norm. For an example, if A ∈ Rm×n = {aij }, if we treat the matrix as a long
vector, either row-wise or column-wise, the 2-norm, now it is called Frobenius norm, of
the matrix is v
um X n
uX
kAkF = t a2ij . (2.2.1)
i=1 j=1
√
For an n by n identity matrix, we have kIk = n instead of kIk = 1 we may have expected.
Since matrices are often used along with vectors, e.g. Ax = b, Ax = λx, it is naturally
to define a matrix norm from a vector norm.
28 Z. Li
Theorem 2.2.1 Given a vector norm k · k in Rn space, the function of the matrix function
in A ∈ Rm×n space defined below
kAxk kAxk
f (A) = sup = max = max kAxk (2.2.2)
x6=0 kxk x6=0 kxk kxk=1
is a matrix norm in Rm×n space. It is called the associated (or induced, or subordinate)
matrix norm.
Proof:
• For any x, from the property of a vector norm, we have k(A + B)xk = kAx + Bxk ≤
kAxk + kBxk. Thus we have
k(A + B)xk kAxk kBxk
max ≤ max +
x6=0 kxk x6=0 kxk kxk
kAxk kBxk
≤ max + max
x6=0 kxk x6=0 kxk
≤ f (A) + f (B)
from the definition of associated matrix norms, we can conclude the following important
properties.
kIxk
• kIk = 1. This is obvious since max = 1.
x6=0 kxk
• kAxk ≤ kAk kxk for any x. It is obviously true if x = 0.
Proof: If x 6= 0, then we have
kAyk kAxk
kAk = max ≥ .
y6=0 kyk kxk
Multiplying kxk to both sides, we get kAk kxk ≥ |Axk.
Numerical Analysis: I 29
For any vector norm, there is an associate matrix norm. Since we know how to evaluate
kxkp , p = 1, 2, ∞, we should know how to evaluate kAkp as well. It is not practical to use
the definition all the time.
Theorem 2.2.2
Xn n
X n
X n
X n
X
kAk∞ = max |a1j |, |a2j |, · · · , |aij |, · · · , |amj | = max |aij | (2.2.3)
i
j=1 j=1 j=1 j=1 j=1
( n n n n
) n
X X X X X
kAk1 = max |ai1 |, |ai2 |, · · · , |aij |, · · · , |ain | = max |aij | (2.2.4)
j
i=1 i=1 i=1 i=1 i=1
In other words, we add the magnitude of element in each rows, then we selected the largest
one, which is the infinity norm of the matrix A.
An example: Let
−5 0 7 12
A = 1 3 −1 5
0 0 1 1
6 3 9
The proof has two parts: (a), show that kAk∞ k ≤ M ; (b), find a specific x∗ , kx∗ k∞ = 1,
such that kAxk∞ k ≥ M . For any x, kxk∞ ≤ 1, we have
P
n
a 1j
Pj=1
n
j=1 a2j
..
n n n
. X X X
Ax = Pn
, |x j | ≤ 1, | aij xj | ≤ |a ij ||x j | ≤ |aij | ≤ M.
j=1 aij
j=1 j=1 j=1
..
.
Pn
j=1 mj a
30 Z. Li
Now we conduct the second step. The largest sum of the magnitude has to be one (or
more) particular rows, say i∗ -th row. We choose
sign(ai∗ ,1 )
sign(ai∗ ,2 )
1
x>0
∗
x = .. ,
sign(x) = 0 x=0 , x sign(x) = |x|.
.
−1 x < 0
sign(ai∗ ,n )
Thus we get
×
..
.
Xn n
aij x∗j
X
kAx∗ k∞
= k
k∞ =
|aij | = M,
j=1 j=1
..
.
×
where × means some number. That completes the proof.
Theorem 2.2.3
q q q q
kAk2 = max H H H H
λ1 (A A), λ2 (A A), · · · , λi (A A), · · · λn (A A)
q
= max λi (AH A),
i
The proof is left as an exercise. Note that even if A is a complex matrix, AH A or AAH
are symmetric semi-positive definite matrix. A symmetric semi-positive definite matrix B
(B = B H = {bij }) satisfies the following:
Remark 2.2.1 Different norms have different applications. The 2-norm is the Euclidean
distance in one, two, and three dimensions and is differentiable which is important for many
optimization, extreme value problems. The infinity norm 1-norm are non-differentiable but
maybe important for some optimization problems to preserve important quantities such as
sharp edges, corners etc. Note also that the matrix norms kAk∞ and kAk∞ are easy to
compute while kAk2 is not. Finally, there is a subtle difference between kxk∞ and L∞ . The
later one is often referred to the integral norms Lp .
2.3 Exercises
( n )
X
(b): Assume that A ∈ Rn,n . Show that ||A||1 = max |aij | .
0≤j≤n
i=1
4. (a) Show that kxk∞ is equivalent to kxk2 . That is to find constants C and c such that
c ≤ kxk∞ ≤ kxk2 ≤ Ckxk∞ . Note that you need to determine such constants
that the equalities are true for some particular x.
32 Z. Li
(b) Show that kQxk2 = kxk2 f Q is an orthogonal matrix (QH Q = I, QQH = I).
(c) Show that kABk ≤ kAkkBk for any natural matrix norm, and kQAk2 = kAk2 .
5. Let kxk be a vector norm, A be a symmetric positive definite matrix. (a): Show that
kAxk is also a vector norm. (b): It is known that we can factorize the matrix A as
A = B H B, find kAxk2 in terms of B and x. (c): If A = D is a diagonal matrix with
all positive diagonals, find the expression of kDxk2 .
Chapter 3
In this chapter, we discuss some most used direct methods for solving a linear system of
equations of the following form
Almost all the direct methods are based on Gaussian elimination. A direct method is
a method that returns the exact solution in finite number of operations with exact compu-
tation (no round-off errors present). Such a method is often suitable for small to modest,
dense matrices.
The main idea of Gaussian elimination algorithm is based on the following observation.
Consider the following (upper) triangular system of equations
a11 a12 · · · ··· a1n x1 b1
a22 · · · ··· a2n x2 b2
.. .. .. .. ..
. . . . = . . (3.0.2)
.. .. .. ..
. . . .
ann xn bn
33
34 Z. Li
• From the last but second equation: an−1,n−1 xn−1 + an−1,n xn = bn−1 , we get xn−1 =
(bn−1 − an−1,n xn )/an−1,n−1 . Note that, we need the result xn from previous step.
• In general (the idea of induction), assume we have computed xn , xn−1 , · · · , xi+1 , from
the i-th equation, aii xi + ai,i+1 xi + · · · + ain xn = bi , we get
n
X
xi = bi − aij xj /aii , i = n, n − 1, · · · , 1. (3.0.3)
j=i+1
for i = n, −1, 1
n
X
xi = bi − aij xj /aii
j=i+1
endfor
In one step, we can count the number of operations. In i-th step, there are (n −
i) multiplications and one division; and (n − i) subtractions. Thus the total number of
multiplications and divisions is
n(n + 1) n2
1 + 2 + ··· + n = = + O(n).
2 2
n(n − 1) n2
1 + 2 + ··· + n − 1 = = + O(n).
2 2
The total cost is only about one matrix-vector multiplications which is considered to be
very fast.
Numerical Analysis: I 35
The main idea of Gaussian elimination (GE) is to use a row transforms to reduce they
system to an upper triangular one while keep the solution unchanged. For this purpose, we
can apply the Gaussian
elimination
to the coefficient matrix or to the augmented matrix
..
which is defined as A . b , that is, the matrix is enlarged by a column.
First, we use a 4 by 4 matrix to illustrate the idea. We use the number to indicate the
number of the times that the entries have been changed, and the sequence of changes.
.. ..
1 1 1 1 . 1 1 1 1 1 . 1
.. ..
1 1 1 1 . 1 0 2 2 2 . 2
=⇒ =⇒
. ..
1 .. 1
1 1 1 0 2 2 2 . 2
.. ..
1 1 1 1 . 1 0 2 2 2 . 2
.. ..
1 1 1 1 . 1 1 1 1 1 . 2
.. ..
0 2 2 2 . 2 0 2 2 2 . 2
=⇒ =⇒
. ..
3 3 .. 3
0 0 0 0 3 3 . 3
.. ..
0 0 3 3 . 3 0 0 0 4 . 4
..
Ln−1 Ln−2 · · · L2 L1 A.b . (3.0.4)
We need to derive the recursive relations so that we can implement the algorithm. The
general procedure is to derive the first step, maybe second step if necessary..., the general
step to see if we have a complete algorithm. For convenience, we denote the right hand side
36 Z. Li
T ..
as b = [a1,n+1 , a2,n+1 , · · · , an,n+1 ], we set L1 A.b
1 0 0 ··· 0
..
a
11 a12 ··· ··· a1n . a1,n+1
.. .. ..
−l
21 1 . . 0
a21 a22 ··· ··· a2n . a2,n+1
.. .. .. ..
..
−l
31 0 1 . 0
. . ··· ··· ··· . .
.
. .. .. .. ..
.. .. .. .. . ··· ··· ··· . .
. . . 0
..
a
n1 an2 ··· ··· ann . an,n+1
−ln1 0 ··· 0 1
..
a11 a12 · · · ··· a1n . a1,n+1
.. (2)
0 a(2) (2)
22 ··· ··· a2n . a2,n+1
.. ..
= .
.
.. (2) ..
. aij .
..
0 .
(2)
We need to derive the formulas for li1 and aij . We multiply the 2-nd row of L1 to the
1-st column of [A|b] = A(1) to get
(1)
(1) (1) a21
−l21 a11 + a21 = 0, =⇒ l21 = (1)
.
a11
We multiply the 3-rd row of L1 to the 1-st column of [A|b] = A(1) to get
(1)
(1) (1) a31
−l31 a11 + a31 = 0, =⇒ l31 = (1)
.
a11
In general, we multiply the i-th row of L1 to the 1-st column of [A|b] = A(1) to get
(1)
(1) (1) ai1
−li1 a11 + ai1 = 0, =⇒ li1 = (1)
, i = 2, 3, · · · , n.
a11
(1)
(1) (1) (2) (2) (1) ai1 (1)
−li1 a1j + aij = aij , aij = aij − a
(1) 1j
a11
i = 2, 3, · · · , n, j = 2, 3, · · · , n, n + 1.
Numerical Analysis: I 37
Since the formulas only depend on the indexes, we just need to replace the formulas
using the substitutions: 1 =⇒ 2 and 2 =⇒ 3 to get the following formulas:
(2)
ai2
li2 = (2)
, i = 3, · · · , n.
a22
(2)
(3) (2) ai2 (2)
aij = aij − a
(2) 2j
a22
i = 3, 4, · · · , n, j = 3, 4, · · · , n, n + 1.
(k)
aik
li,k+1 = (k)
, i = k + 1, · · · , n.
akk
(k)
(k+1) (k) aik (k)
aij = aij − a
(k) kj
akk
i = k + 1, · · · , n, j = k + 1, · · · , n, n + 1.
(2) (1)
• After we get aij , we do not need aij anymore. We can overwrite to save the storage.
• We do not need to store zeros below the diagonals. Often we store lij .
38 Z. Li
for k = 1, n − 1
for i = k + 1, n
aik
aik :=
akk
for j = k + 1, n
aij := aij − aik akj
end
end
end
• The determinant of A is
Ln−1 Ln−2 · · · L2 L1 A = U,
we have
As part of analysis, we need to know the cost of an algorithm, or the total number of
operations needed to complete the algorithm.
For the GE algorithm, we count the number of operations in the k-th step, then add
them together. In k-th step of GE algorithm, we need to compute
aik
ai,k+1 := , i = k + 1, · · · , n.
akk
aij =: aij − aik akj
i = k + 1, · · · , n, j = k + 1, · · · , n, n + 1.
40 Z. Li
k=1 (n − 1)(n + 1)
k=2 (n − 2)n
.. ..
. .
n−1 1·3
The total is
n−1 n−1 n−1
X X X (n − 1)n(2n − 1) n3
k(k + 2) = k2 + 2k = + n(n − 1) = O( )
6 3
k=1 k=1 k=1
Note that the first term is actually the cost if we apply the GE process to the matrix A,
the second part if the same transform applied to the right hand side which is equivalent
to the forward substitution. We often emphasize the order of operations which is O(n3 /3).
The constant coefficient (1/3) is also important here.
The number of additions/substractions is almost the same. When n = 1000, the total
number of operations is roughly 2 · 109 /3 (about a billion), which is a quite large number.
It is true that for large and dense matrix, the GE algorithm is not very fast!
When a11 = 0, the GE algorithm breaks down even if A is non-singular. For example, we
consider Ax = b where
! !
0 1 1
A= , det(A) = −1 6= 0, A1 =
1 0 1 0
The Gaussian elimination fails for the first A, for the second one, in general we will have
ai1
f l aij − a1j
a11
ai1
= aij − a1j (1 + δ1 )(1 + δ2 ) (1 + δ3 )
a11
ai1 ai1
= aij − a1j − a1j δ3 + · · ·
a11 a11
We can see that if |a11 | is very small, the round-off error will be amplified. The element a11
is called the pivot element.
1
Usually we put the multiplication and divisions into one category; and additions/subtractions into
another category. An operation in the first category often take slightly longer time than that in the second
category
Numerical Analysis: I 41
|al1 | ≥ |ai1 |, i = 1, 2, · · · , n.
l = 1; pivot= abs(a(1,1));
for i=2:n
if( abs(a(i,1)) ) > pivot
pivot = abs(a(i,1));
l = i;
end
end
• Exchanges the l-th row with the 1-st row, alj ←→ a1j , j = 1, 2, · · · , n, n + 1. This can
be done using
for j=1:n+1
tmp = a(1,j);
a(1,j) = a(l,j);
a(1,j) = tmp
end
The partial column pivoting algorithm can be expressed as row transform as well.
1 1
.. ..
.
.
1
0 1
I=
..
−→
.. = Pij
. .
1
1 0
.. ..
.
.
1 1
An elementary permutation matrix Pij is the matrix obtained by exchanging the i-th and
j-th rows of the identity matrix, for example
0 0 1
P13 = 0 1 0 .
1 0 0
If we multiply Pij to a Matrix A from the left, the new matrix is the one obtained by
exchanging the i-th and j-th rows of the matrix A, for example
0 0 1 1 2 3 3 −1 0
0 1 0 −2 4 5 −2 4 5 .
=
1 0 0 3 −1 0 1 2 3
Similarly, if we multiply Pij to a Matrix A from the left, the new matrix is the one obtained
by exchanging the i-th and j-th columns of the matrix A.
• PijT = Pij = Pij−1 , that is, Pij is an orthogonal matrix, PijT Pij = Pij PijT = I.
• det(Pij ) = 1.
3.0.7 P A = LU decomposition
The Gaussian elimination with partial column pivoting algorithm can be written as the
following matrix form
Ln−1 Pn−2 Ln−2 · · · L2 P2 L1 P1 A = U, (3.0.10)
Numerical Analysis: I 43
We can see that when we move Pij to the right passing through Lk , we just need to exchanges
the two rows in the non-zero column below the diagonal. Finally we will have P A = LU .
The determinant of A can be computed using the following
det(P A) = det(L)det(U ) = u11 u22 · · · unn , or det(A) = (−1)m u11 u22 · · · unn ,
where m is the total number of row exchanges. Below is a pseudo-code of the process
for k = 1, n − 1, (n − 1)-th elimination
ap = |akk |
ip = k
for i = k + 1, n
if |aik | > |ap | then
ip = i
ap = |aik |
end
end
for j = k, n + 1 (or n)
at = ak,j
44 Z. Li
akj = aip,j
aip,j = at
end
Once we have the P A = LU decomposition, we can use the decomposition to solve the
linear system of equation as follows:
1. Step 1: Form b
e = P b, that is, exchange rows of b. This is because from Ax = b, we
get P Ax = P b.
• Complete pivoting. At the first step, we choose |a11 | := max |aij |. In this approach, we
i,j
have to exchange both rows and columns which makes the programming more difficult.
It may also destroy some matrix structures for certain matrices. The improvement in
the accuracy is marginal compared with partial column pivoting.
• Scaled column pivoting. If the matrix A has very different magnitude in rows, this
approach is strongly recommended. At first step, we choose
|ak1 | |ai1 |
= max (3.1.1)
sk 1≤i≤n si
n
X
where si = |aij |.
j=1
When we input vectors or matrices into a computer number system, we are going to have
round-off errors, the error satisfy the following relations
for any vector and matrix norms, where C1 , C2 are two constants depend on n.
Even before we solve the linear system of equations, we are solving a different problems
(A + EA )x = b + Eb due to the input errors. The question is how the errors affect the
results. This is summarized in the following theorem:
Theorem 3.2.1 If kA−1 Ea k < 1 (or kA−1 kkEa k < 1, a stronger condition), define xe =
A−1 b, δx = (A + EA )−1 (b + Eb ) − A−1 b, then
kAkkA−1 k
kδxk kEA k kEb k
≤ + , (3.2.2)
kxe k 1 − kAkkA−1 k kE Ak
kAk
kAk kbk
We can see that kAkkA−1 k is an important amplifying factor in the error estimate.
it is called the condition number of the matrix A,
It partial sum
n
X
Bn = I − E + E 2 − E 3 + · · · + (−1)n E n = (−1)k E k
k=0
satisfies
Furthermore,
1
k(I + E)−1 k ≤ kIk + kEk + kEk2 + · · · + kEkn + · · · = .
1 − kEk
Now we prove the main error theorem.
kδxk
= k(A + EA )−1 (b + Eb ) − A−1 bk
kxe k
≤ k(A + EA )−1 (b + Eb − (A + EA )A−1 bk
Notice that
kEA k
kA−1 EA k ≤ kA−1 k kEA k ≤ kAk kA−1 k .
kAk
kAk kA−1 k
kδxk kEb k kEA k
≤ + .
kxe k 1 − kAk kA−1 k kE Ak
kAk
kAkkxe k kAk
If we ignore high order terms in the above inequality, we can go one step further
!
kEA k 2
kδxk kEb k kEA k kEA k
≤ cond(A) + 1 + cond(A) + O cond(A) + ···
kxe k kbk kAk kAk kAk
kEb k kEA k
≈ cond(A) + .
kbk kAk
Remark 3.2.1
• The relative errors in the data (either (both) A or (and) b) are amplified by the factor
of the condition number cond(A).
• The condition number cond(A) has nothing to do with any algorithm. It only depends
on the matrix itself. However, if cond(A) is very large in reference to the machine
precision, then no matter what algorithm we use, in general, we can not expect to good
result. Such a problem is called ill-conditioned matrix, or simply ill-conditioned.
For an ill-conditioned system of linear equations, a small perturbation in the data (A
and b) will cause large change in the solution.
There are various errors during a problem solving process, for example, modelling errors,
input error (f l(A)), algorithm error (truncation errors), and round-off errors. We have as-
sumed to start with a mathematical problem and wish to use computer to solve it, therefore
we will not discuss the modelling errors here.
Using Gaussian elimination method with partial pivoting, there is no formula error with
partial pivoting (that is why it is called a direct method). So we only need to consider round-
off errors. Round-off error analysis is often complicated. J. H. Wilkinson made it a little
simple. His technique is simple, the round-off errors can be regarded as a perturbation to the
original data, for example, f l(x+y) = (x(1+δx )+y(1+δy ))(1+δ3 ) = x(1+δ4 )+y(1+δ5 ) =
x
e + ye. In other words, the computed result is the exact sum of two perturbed numbers of
the original data.
For Gaussian elimination method for solving Ax = b, we have the following theorem due
to J. H. Wilkinson.
Theorem 3.3.1 The computed solution of the Gaussian elimination algorithm on a com-
puter for solving Ax = b is the exact solution of the following system of linear equations
(A + E)x = b, (3.3.1)
where kEk ≤ Cg(n) kAk . The function g(n) is called the growth factor that is defined
below:
(k)
max max |aij |
k 1≤i,j≤n
g(n) = n o (3.3.2)
(1)
max |aij |
1≤i,j≤n
• For Gaussian elimination algorithm without pivoting, g(n) = ∞ indicating the method
may break down.
• For Gaussian elimination algorithm with partial column pivoting, g(n) = 2n−1 . This
bound is reachable.
Proof: For Gaussian elimination algorithm with partial column pivoting, we have
akik k
ak+1 = akij − a .
ij
akkk jk
ak
Since | akik | ≤ 1, we conclude that
kk
|ak+1 k k k k−1 k 1
ij | ≤ max |aij | + max |aij | ≤ 2 max |aij | ≤ 2 × 2 max |aij | · · · ≤ 2 max |aij |.
ij ij ij ij ij
50 Z. Li
This concludes that g(n) ≤ 2n−1 . The following example shows that such a bound is
attainable,
1 1 1 1
−1 1 1 0 1 2
22
−1 −1 1 1 99K99K99K 0 0 1
.. .. ..
.. .. ..
. . .
. 1 . .
−1 −1 · · · −1 1 0 0 ··· 2n−1
However, the matrix above is a specific one, the general conjecture is that for most
reasonable matrices, g(n) ∼ n.
For most of computational problems, the relative error of the computer solution xc approx-
imating the true solution xe satisfies the following relation
kxc − xe k
≤ cond(problem) g(algorithm) . (3.3.3)
kxe k
That is, three factors affect the accuracy
• The computer used for solving the problem characterized by the machine precision.
• The algorithm used to solve the problem characterized by the growth factor g.
In the error estimate, kA−1 k is involved. But we know it is difficult and expensive to get
A−1 . Do we have a better way to estimate how accurate an approximation is? The answer
is the resisual vector.
If det(A) 6= 0 and r(xa ) = 0, the xa is the true solution. We can use kr(xa )k to measure
how xa is close to the true solution xe = A−1 b. Note that r(xa ) is called computable since
it just need matrix-vector multiplication and does need A−1 .
Numerical Analysis: I 51
Example: Let
1 −1 0 0 1
A= 2 0 1 , b = 1 , xa = 1 .
3 0 2 0 1
How far is the residual from the relative error? The answer is given in the following
theorem:
Theorem 3.4.1
kr(xa )k kxe − xa k kr(xa )k
≤ ≤ kA−1 k (3.4.2)
kAkkxa k kxa k kxa k
In other words, if we normalize the matrix A such that kAk = 1, then the difference is
about the condition number of A.
Note that the residual vector is the gradient vector of the function f (x) = xT Ax when
A is a symmetric positive definite matrix. It is the search direction of the steepest descent
method in optimization and important basic concept in popular conjugate gradient (CG)
method.
For some situations and various considerations, we may not need to have pivoting process in
the Gaussian elimination algorithm. This can be done using the direct LU decomposition.
Assuming that we do not do the pivoting, then we can have the direct A = LU decomposi-
tion, that is we can have closed formulas for the entries of L and U , where L is a unit lower
triangular matrix and U is an upper triangular matrix2 .
.. . .
.. ..
.. ..
. .
. . .
.
unn
ln1 ln2 · · · ln,n−1 1
To derive the formulae for the entries of L and U , the order is very important: The
order is as follows.
• We can get the first row of U using the matrix multiplications (first of row of L and
any column of U )
After we get the k-th row of U , we can get the k-th column of L by multiplying the
i-th row of L to k-th column of U
k−1
X
aik − lij ujk
k−1
X j=1
aik = lij ujk + lik ukk , =⇒ lik = , i = k + 1, · · · , n.
ukk
j=1
Numerical Analysis: I 53
One particular application is the system of linear equations derived from the finite
difference method. Another application is the alternating directional implicit (ADI) for
solving two or three dimensional partial differential equations (PDE) using dimension by
dimension approach.
If we use the direct LU decomposition of the tridiagonal matrix, we see we get a very
simple decomposition
d01 β1
d1 β1 1
α d β2 α0 1 d02 β2
2 2 2
α30 d03
α3 d3 β3 =
1
β3
.. .. .. .. .. .. ..
. . . . . . .
αn dn αn0 1 d0n
end
Once we have the LU factorization, we can easily derive the formulas for forward and
backward substitutions for solving Ax = b.
The forward substitution is
y1 = b1 ,
for i = 2, n,
yi = bi − αi yi−1 ,
end
The entire process (Crout decomposition, forward and backward substitutions) quires O(5n)
multiplications/divisions. The solution process sometimes is called chasing method for solv-
ing the tridiagonal system of equations.
For example, the matrix is the left below is strictly column diagonally dominant while the
second is not.
5 0 1 5 0 1
−1 2 −3 −2 4 1
3 −1 7 4 0 5
The question is: how the next step, or thereafter. We have the following theorem answers
this question:
Theorem 3.6.1 Let A be a strictly column diagonally dominant matrix. After one step
Gaussian elimination
a11 *
L1 A =
0 A1
n n
a(1)
(1) i1 (1)
X X
≤ |aij | + (1) |a1j |
a
i=2,i6=j i=2,i6=j 11
(1)
a
(1) (1) 1j (1) (1)
< |ajj | − |a1j | + (1) |a11 | − |aj1 |
a
11
(1)
a
(1) 1j (1)
< |ajj | − (1) |aj1 |
a
11
(1)
(1) a
1j (1) (2)
≤ ajj − (1) aj1 = |ajj |.
a11
Note that in the third row, we have used the strictly column diagonally dominant condition
(1)
twice, the first one with the j-th column that leads to a negative term −|a1j |; and the first
(1)
column which leads to another negative term −|aj1 |. This completes the proof.
A matrix is called a symmetric positive definite (SPD) if the following conditions are met
1. A = AH , or aij = a¯ji .
Note that the second condition has the following equivalent statements, which also give
some way to judge whether a matrix is an SPD or not.
• λi (A) > 0, i = 1, 2, · · · , n, that is, all the eigenvalues of A ∈ Rn,n are real and positive.
56 Z. Li
A principal sub-matrix Ak is the matrix composed from the intersections of the first k the
rows and columns of the original matrix A, for example
a11 a12 a13
a11 a12
A1 = {a11 }, A2 , A3 = a21 a22 a23 , · · · , An = A.
a21 a22
a31 a32 a33
The first one is not since a11 < 0, or a22 = 0. For the second one, we have A = AT and
det(A1 ) = 2 > 0, det(A2 ) = 4 − 1 = 3 > 0, and det(A3 ) = det(A) = 8 − 2 − 2 = 4 > 0, so A
is an SPD.
To derive the formulae for the entries of L, the order is very important: The order is as
follows.
• We can get the rest of the first column of L using the matrix multiplications (first of
row of L and any column of LT )
a1j
a1j = l11 lj1 , =⇒ lj1 = , j = 2, · · · n.
l11
We have the first column of L.
• Assume we have obtained the first (k − 1)-th columns of L, we derive the formula for
k-th column of L. If we multiply the k-th row of L to k-th column of LT , we get
v
k−1
X
u
u k−1
X
2 2 ,
akk = lkj lkj + lkk , =⇒ lkk = akk −
t lkj
j=1 j=1
After we get the first element in the k-th column of L, we can get the the rest of k-th
column of L by multiplying the i-th row of L to j-th column of LT ,
k−1
X
aik − lij lkj
k−1
X j=1
aik = lij lkj + lik lkk , =⇒ lik = , j = k + 1, · · · , n.
lkk
j=1
Pseudo-code of A = LLT :
for k = 1, n
v
u
u k−1
X
lkk = akk −
t l2 kj
j=1
for i = k + 1, n
k−1
X
lik = aik − lij lkj /lkk
j=1
end
end
Use the Cholesky decomposition (A = LLT ) to solve Ax = b.
From Ax = b we get LLT x = b. The forward and backward substitutions are the
following:
1. Solve Ly = b.
58 Z. Li
2. Solve LT x = y.
The number of multiplications/divisions needed for the Cholesky decomposition is O(n3 /6).
The storage needed is O(n2 /2). So we just need half the storage and half the computations.
The only disadvantage is that we need to evaluate square roots. A slightly different version
is to get the A = LDLT decomposition which may work for any symmetric matrix assuming
there is no break down (not guaranteed if A is not an SPD).
• In Matlab, we can use x = A\b for solving Ax = b; [l, u] = lu(A) for LU decompo-
sition; chol(A) for Cholesky decomposition; det(A) for determinant of A; cond(A, 2),
for example, to find the condition number of A in 2-norm.
• There are many books on the programming of different methods. A popular one is
Numerical Recipes in (Fortran, C, Pascal, ...).
3.7 Exercises
1. Given
0 1 2 3 6
3 0 1 2 6
A=
,
b=
2 3 0 1 6
1 2 3 0 6
(a) Use Gaussian elimination with the partial pivoting to find the matrix decom-
position P A = LU . This is a paper problem and you are asked to use exact
calculations (use fractions if necessary).
(b) Find the determinant of the matrix A.
(c) Use the factorization to solve Ax = b.
to get X. Count the number of operations by each algorithm and determine which
one is faster.
4. Given an n-by-n non-singular matrix A, how do you efficiently solve the following
problems using Gaussian elimination with partial column pivoting?
You should (1) describe your algorithm; (2) present a pseudo-code; (3) find out the
required operation account.
Justify your conclusion. What is the significance of knowing these special matrices
to the Gaussian related algorithms? Answer this question by considering issues of
60 Z. Li
Find the Cholesky decomposition A = LLT or A = LDLT for the middle matrix.
Note that, you need to determine the range of α and β.
(a) Derive the linear system of equations for the interpolation problem.
(b) Let xi = (i − 1)h, i = 1, 2, · · · , m + 1, h = 1/m, yi = sin πxi , write a computer
code using the Gaussian elimination with column partial pivoting to solve the
problem. Test your code with m = 4, 8, 16, 32, 64 and plot the error |y(x)−sin πx|
with 100 or more points between 0 and 1, that is, predict the function at more
points in addition to the sample points. For example, you can set h1 = 1/100;
x1 = 0 : h1 : 1, y1(i) = a0 + a1 x1(i) + · · · + am−1 (x1(i))m−1 + am (x1(i))m ,
y2(i) = sin(πx1(i)), plot(x1, y1 − y2).
Numerical Analysis: I 61
(c) Record the CPU time (in Matlab type help cputime) for m = 50, 100, 150, 200, · · · , 350, 400.
Plot the CPU time versus m. Then use the Matlab function polyfit z = polyf it(m, cputime(m), 3)
to find a cubic fitting of the CPU time versus m. Write down the polynomial
and analyze your result. Does it look like a cubic function?
(a) Derive the algorithms for A = LDLT decomposition, where L is a unit lower
triangular matrix, and D is a diagonal matrix.
(b) Write a Matlab code (or other language if you prefer) to do the factorization and
solve the linear system of equations Ax = b using the factorization. Hint: the
process is the following:
Ly = b, y is the unknown,
Dz = y, z is the unknown,
T
L x = z, x is the unknown, which is the solution.
Construct at least one example that you know the exact solution to validate your
code.
10. Extra Credit: Choose one from the following (Note: please do not ask the instructor
about the solution since it is extra credit):
1
q
(a) Let A ∈ Rn×n . Show that kAk2 = max λi (AT A) and kA−1 k2 = q
1≤i≤n
min λi (AT A)
1≤i≤n
, where λi (AT A), i = 1, 2, · · · , n are the eigenvalues of AT A. Show further that
σmax
cond2 (A) = , where σmax , σmin are the largest and smallest nonzero singular
σmin
values of A.
(b) Show that if A is a symmetric positive definite matrix, then after one step of
Gaussian elimination (without pivoting), then reduced matrix A1 in
a11 ∗
A =⇒
0 A1
The Gaussian elimination method for solving Ax = b is quite efficient if the size of A is small
to medium (in reference the available computers) and dense matrices (most of entries of the
matrix are non-zero numbers). But for several reasons, sometimes an iterative method may
be more efficient as discussed below.
• For sparse matrices, the Gaussian elimination method may destroy the structure of
the matrix and cause ’fill-in’s, see for example,
1 1 1 1 1 1 1 1 1 1 1 1
2 1 0 0 0 0 0 −1 −2 −2 −2 −2
3 0 1 0 0 0
0 −3 −2 −3 −3 −3
=⇒
4 0 0 1 0 0
0 −4 −4 −3 −4 −4
0 −5 −5 −5 −4 −5
5 0 0 0 1 0
6 0 0 0 0 1 0 −6 −6 −6 −6 −5
• Large sparse matrices for which we may not be able to store all the entries of the
matrix. Below we show an example in two dimensions.
63
64 Z. Li
4.1 The central finite difference method with five point sten-
cil for Poisson equation.
If f ∈ L2 (Ω), then the solution exists and it is unique. Analytic solution is rarely available.
Now we discuss how to use the finite difference equation to solve the Poisson equation.
b−a
xi = a + ihx , i = 0, 1, 2, · · · m, hx = , (4.1.3)
m
d−c
yj = c + jhy , j = 0, 1, 2, · · · n, hy = . (4.1.4)
n
We want to find an approximate solution Uij to the exact solution at all the grid
points (xi , yj ) where u(xi , yj ) is unknown. So there are (m − 1)(n − 1) unknown for
Dirichlet boundary condition.
• Step 2: Substitute the partial derivatives with a finite difference formula in terms of
the function values at grid points to get.
(hx )2 ∂ 4 u (hy )2 ∂ 4 u
Tij ∼ + . (4.1.5)
12 ∂x4 12 ∂y 4
Define
h = max{ hx , hy } (4.1.6)
If we remove the error term in the equation above, and replace the exact solution
u(xi , yj ) with the approximate solution Uij which is the solution of the linear system
of equations
Ui−1,j + Ui+1,j Ui,j−1 + Ui,j+1 2 2
+ − + Uij = fij (4.1.8)
(hx )2 (hy )2 (hx )2 (hy )2
The finite difference scheme at a grid point (xi , yj ) involves five grid points, east,
north, west, south, and the center. The center is called the master grid point.
• Solve the linear system of equations to get an approximate solution at grid points
(how?).
Generally, if one wants to use a direct method such as Gaussian elimination method or sparse
matrix techniques, then one needs to find out the matrix structure. If one use an iterative
method, such as Jacobi, Gauss Seidel, SOR(ω) methods, then it may be not necessarily to
have the matrix and vector form.
In the matrix vector form AU = F, the unknown is a one dimensional array. For the two
dimensional Poisson equations, the unknowns Uij are a two dimensional array. Therefore we
need to order it to get a one dimensional array. We also need to order the finite difference
equations. It is common practice that we use the same ordering for the equations and for
the unknowns.
There are two commonly used ordering. One is called the natural ordering that fits
sequential computers. The other one is called the red and black ordering that fits parallel
computers.
7 8 9 4 9 5
4 5 6 7 3 8
1 2 3 1 6 2
Figure 4.1: The natural ordering (left) and the red-black ordering (right).
66 Z. Li
In the natural row ordering, we order the unknowns/equations row-wise, therefore the k-th
equation corresponding to (i, j) with the following relation
We use the following example to verify the matrix-vector form of the finite difference
equations.
Assume that hx = hy = h, m = n = 4, so we will have nine equations and nine
unknowns. The coefficient matrix is 9 by 9! To write down the matrix-vector form, we use
a one-dimensional array x to express the unknown Uij .
The idea of iterative methods is to start with an initial guess, then improve the solution
iteratively. The first step is to re-write the original equation f (x) = 0 to an equivalent form
√
x = g(x) and then we can form an iteration: x(k+1) = g(x(k) ). For example, to find 3 5 is
equivalent to solving the equation x3 − 5 = 0. This equation can be written as x = 5/x2 or
3
x = x − x3x−52 . For the second one, the iteration
(x(3) )3
x(k+1) = x(k) − , k = 0, 1, · · ·
3(x(2) )2
is called the Newton’s iterative method. The mathematical theory behind this is the fixed
point theory.
For a linear system of equations Ax = b, we hope to re-write it as an equivalent form
x = Rx + c so that we can form an iteration x(k+1) = Rx(k) + c given an initial guess x0 .
We want to choose such a R and c that limk→∞ x(k) = xe = A−1 b. A common method is
called the splitting approach in which we re-write the matrix A as
A = M − K, det(M ) 6= 0. (4.2.1)
We first discuss three basic iterative methods for solving Ax = b. To derive the three
methods, we re-write the matrix A as
a11 0
a22 −a 0
21
A = D−L−U = a33 − −a31 −a32 0
..
.. .. .. ..
. . . .
.
ann −an1 −an2 · · · −an,n−1 0
0 −a12 · · · · · · −a1n
0 · · · · · · −a2n
.. .. ..
− . . .
.. ..
. .
0
Numerical Analysis: I 69
The matrix vector form of the The Jacobi iterative method can de derived as follows:
(D − L − U )x = b
Dx = (L + U )x + b
x = D−1 (L + U )x + D−1 b
The component form is useful for implementation while the matrix-vector form is good
for convergence analysis.
In the Jacobi iterative method, when we compute xk+1 2 , we have already computed x1 .
k+1
Assume that xk+11 is a better approximation than xk1 , why can not we use xk+1 1 when we
k+1 k
update x2 instead of x1 ? With this idea, we get a new iterative method which is the
Gauss-Seidel iterative method for solving Ax = b. The component form is
i−1 n
(k+1) (k+1) (k)
X X
xi = bi − aij xj − aij xj /aii , i = 1, 2, · · · , n. (4.4.1)
j=1 j=i+1
To derive the matrix-vector form of the Gauss-Seidel iterative method, we write the
component form above to a form ( )(k+1) = ( )(k) + ( ). The component form above
is equivalent to
i−1 n
(k+1) (k+1) (k)
X X
aij xj + aii xi = bi − aij xj , i = 1, 2, · · · , n, (4.4.2)
j=1 j=i+1
Thus the iteration matrix of the Gauss-Seidel iterative method is (D − L)−1 U , and the
constant vector is c = (D − L)−1 b.
70 Z. Li
The iterative methods are an infinity process. However, when we implement on a computer,
we have to stop it in finite time. One of several of the following stopping criteria are used
• kr(x(k) )k ≤ tol.
• k ≥ kmax ,
[x,k] = my_gs(n,a,b,x0,tol)
error = 1e5; x=x0; k=0;
while error > tol
for i=1:n
x(i) = b(i);
for j=1:n
if j~= i
x(i) = x(i) - a(i,j)*x(j);
end
x(i) = x(i)/a(i,i);
end
error = norm(x-x0); %defaut is 2-norm
x0=x; k = k+1; % replace the old value, add the counter.
end %end while
4.4.3 The Gauss-Seidel iterative method for 2-point boundary value prob-
lem
4.4.4 The Gauss-Seidel iterative method for the finite difference method
for Poisson equation
For the Poisson equation uxx + uyy = f (x, y), if we use the standard 5-point central finite
difference scheme
Ui−1,j + Ui+1,j − 4Uij + Ui,j−1 + Ui,j+1
= f (xi , yj )
h2
and the same ordering for the equations and unknowns, then the Jacobi iteration is
(k) (k) (k) (k)
(k+1) Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 h2
Uij = + f (xi , yj ),
4 4
i = 1, 2, · · · , n − 1, j = 1, 2, · · · , n − 1,
if the solution is prescribed along the boundary (Dirichlet BC). Again, no matrix is needed,
no ordering is necessary. We do not need to transform the two dimensional array to a one
dimensional one. The implementation is rather simple.
The Jacobi and Gauss-Seidel methods can be quite slow. The SOR(ω) iterative method
is an acceleration method by choosing appropriate parameter ω. The SOR(ω) iterative
method is
x(k+1) = (1 − ω)xk + ωx̃k+1
GS . (4.5.1)
Note that, it is incorrect to get G-S result first, then do the linear interpolation.
72 Z. Li
This is equivalent to
Thus the iteration matrix and constant vector of the SOR(ω) method are
Theorem 4.5.1 A necessary condition for SOR(ω) method to converge is 0 < ω < 2.
Using an iterative method, we will get a vector sequence of {x(k) } and we know how to tell
whether it is convergent or not. However, for an iterative method, we need to consider all
possible initial guesses and constant vector c.
If the vector sequence of {x(k) } converges to x∗ , then by taking limit on both sides of
the iterative scheme, we have
x∗ = Rx∗ + c. (4.6.1)
Definition 4.6.1 The iteration methods x(k+1) = Rx(k) + c is convergent if for any initial
guess x(0) and constant vector c, the vector sequence of {x(k) } converges to the solution of
the system of equations x∗ = Rx∗ + c.
Now we discuss a few sufficient conditions that guarantee convergence of a basic iterative
method.
Theorem 4.6.1 If there is an associated matrix norm such that kRk < 1, then the iteration
method x(k+1) = Rx(k) + c converges.
Numerical Analysis: I 73
Proof: Let e(k) = x(k) − x∗ , from the iterative method x(k+1) = Rx(k) + c and the
consistency condition x∗ = Rx∗ + c, we have
e(k+1) = Re(k)
In the theorem above, the k-th error depends on the initial one that we do not know. The
following error estimate does not need the initial error.
Theorem 4.6.2 If there is an associated matrix norm such that kRk < 1, we have the
following error estimate for the iteration method x(k+1) = Rx(k) + c.
kRkk
ke(k+1) k ≤ kx(1) − x(0) k. (4.6.2)
1 − kRk
Proof: From the iterative method x(k+1) = Rx(k) + c, we also have f x(k) = Rx(k−1) + c.
Subtracting the two, we get
Since
This leads to
ke(k) k = k(I − R)−1 Rk x(1) − x(0) k.
Theorem 4.6.3 If A is strictly row diagonally dominant matrix, then both Jacobi and
Gauss-Seidel methods converge. The Gauss-Seidel method converges faster in the sense that
Proof: The proof of the first part is easy. For the Jacobi method, we have R =
D−1 (L + U ), thus
n
X aij
kRJ k∞ = max < 1.
aii
i
j=1,j6=i
The proof for the Gauss-Seidel method is not trivial and long, we refer the readers to
the book [J. W. Demmel] on page 287-288.
For general matrices, it is unclear whether the Jacobi or Gauss-Seidel method converges
faster even if they both converge.
Theorem 4.6.4 If A is a symmetric positive definite (SPD) matrix, then the SOR(ω)
method converges for 0 < ω < 2.
Again we refer the readers to the book [J. W. Demmel] on page 290-291.
n
X
|aij | ≤ |aii |, , i = 1, 2, · · · , n,
j=1,j6=i
with at least on inequality is strictly and A is irreducible, that is, there is no permutation
matrix such that
!
T A11 A12
P AP =
0 A22
then both Jacobi and Gauss-Seidel methods converge. The Gauss-Seidel method converges
faster in the sense that
Note that from ρ(A) ≤ kAk from any associated matrix norms. The proof is quite simple.
Let λi∗ be the eigenvalue of A such that ρ(A) = |λi∗ | and x∗ 6= 0 is the corresponding
eigenvector, then we have Ax∗ = λi∗ x∗ . Thus we have |λi∗ |kx∗ k ≤ kAkkx∗ k. Since kx∗ k =
6 0,
we get ρ(A) ≤ kAk.
Theorem 4.6.6 An iteration method x(k+1) = Rx(k) + c converges for arbitrary x(0) and
c if and only if ρ(R) < 1.
Proof: Part A: If the iterative method converge, then ρ(R) < 1. This can be done
using counter proof method. Assume that ρ(R) > 1, then let Ax∗ = λi∗ x∗ , with ρ(A) =
6 0. If we set x(0) = x∗ and c = 0, then we have x(k+1) = λk+1
|λi∗ | > 1 and kx∗ k = ∗
i∗ x which
dost not have a limit since λk+1
i∗ → ∞. The case of ρ(R) = 1 is left as an exercise.
Proof: Part B: If ρ(R) < 1, then the iterative method converges for arbitrary x(0) and
c. They key in the proof is to find a matrix norm such that kRk < 1.
From liner algebra Jordan’s theorem we know that for any square matrix R, it is similar
to a Jordan canonical form, that is, there is a nonsingular matrix S such that
J1 λi 1
J2 λi 1
.. .. ..
S −1 RS = . , . . ,
.. ..
. . 1
Jp λi
Note that kS −1 RSk∞ = ρ(R) < 1 if all Jordan blocks are 1 by 1 matrix, otherwise
kS −1 RSk∞ ≤ ρ(R) + 1. Since ρ(R) < 1, we can find < 0 such that ρ(R) + < 1,
say = (1 − ρ(R))/2. Consider a particular Jordan Block Ji and assume it is a k by k
matrix. Let
1 λi
λi
.. .. ..
Di () = . , Di ()Ji Di−1 () = . . ,
.. ..
. .
k−1 λi
76 Z. Li
It can easily show that the definition above is indeed an associate matrix norm and kRknew =
ρ(R) + < 1, we conclude that the iterative method converges for any initial guess and
vector c.
For the finite difference methods for Poisson equations in 1D and 2D, we can find the
eigenvalues of the coefficient matrix, which also leads to eigenvalues of the iteration matrix
(Jacobib, Gauss Seidel, SORω). For 1D model problem u00 (x) = f (x) with the Dirichlet
boundary condition u(0) and u(1) are prescribed, the coefficient matrix is
−2 1 −2 1
1
−2 1
1
−2 1
1
AF D = 2 1 −2 1 , A= 1 −2 1 .
h
.. .. .. .. .. ..
. . . . . .
1 −2 1 −2
The matrix is weakly row diagonally dominant (not strictly) and irreducible; −A is
symmetric positive definite; or A is symmetric negative definite.
where λi (A), i = 1, 2, · · · , n are the eigenvalues of the matrix A, RJ is the iteration matrix
of the Jacobi method.
Numerical Analysis: I 77
We can see that as n is getting larger, the spectral radius is getting close to unit indi-
cating slower convergence.
Once we know the spectral radius, we also know roughly the number of iterations needed
to reach the desired accuracy. For example, if we wish to have roughly six significant digit,
the we should set ρ(R)k ≤ 10−6 or k ≥ −6 log10 (ρ(R)).
78 Z. Li
4.7.1 Finite difference method for the Poisson equation in two dimen-
sions
In two space dimensions, we have parallel results. The eigenvalues for the matrix A =
h2 AF D of N by N matrix N = n2 are
iπ jπ
λi,j = − 4 − 2 cos + cos , i, j = 1, 2, · · · n. (4.7.2)
n+1 n+1
λi,j (A)
λi,j (RJ ) = 1 + , i, j = 1, 2, · · · , n.
4
Thus
λi,j (A) 2(1 − cos(iπ/(n + 1)) + cos(jπ/(n + 1))
ρ(RJ ) = max 1 + = max 1 −
1≤i,j≤n 4 1≤i,j≤n 4
2
iπ jπ π 1 π
= max cos + cos = cos ∼1− .
1≤i,j≤n n+1 n + 1 n+1 2 n+1
We see that the results in 1D and 2D are pretty much the same. To derive the best
ω for the SOR(ω) method, we need to derive the eigenvalue relation between the original
matrix and iteration matrix. Note that RSOR = (D − ωL)−1 ((1 − ω)D + ωU ).
Theorem 4.7.2 The optimal ω for the SOR(ω) method for the system of equations derived
from the finite difference method for the Poisson equation is
2 2 2 2
ωopt = p = q = π ∼ π (4.7.3)
1 + 1 − (ρ(Rj ))2
1 + 1 − cos2 π 1 + sin n+1 1 + n+1
n+1
• Step 2. Find the extreme value (minimum) of above as a function of λJ which leads
to the optimal ω.
We refer the readers to the book [J. W. Demmel] on page 292-293 for the proof.
Remark 4.7.1
Numerical Analysis: I 79
If we plot ρ(RSOR ) against ω, we can see it is a quadratic curve between 0 < ω < ωopt
and it is flat as ω is getting closer to ωopt which means it is less sensitive in the
neighborhood of ωopt ; while the second piece is a linear function. Thus, we would
rather choose ω larger than smaller.
• The optimal ω is only for the Poisson equation, not for other elliptic problems. How-
ever, it gives a good indication of the best ω is the diffusion is dominant in reference
to the mesh size.
4.8 Exercises
where Ω is the unit square. Using the finite difference method, we can get a linear
system of equations
7 8 9 4 9 5
4 5 6 7 3 8
1 2 3 1 6 2
3x1 − x2 + x3 = 3
2x2 + x3 = 2
−x2 + 2x3 = 2
(a) With x(0) = [1, −1, 1]T , find the first iteration of the Jacobi, Gauss-Seidel, and
SOR (ω = 1.5) methods.
(b) Write down the Jacobi, Gauss-Seidel, and Seidel iteration matrices RJ , RGS ,
and SOR(ω).
(c) Do the Jacobi and Gauss-Seidel iterative methods converge? Why?
4. Explain when we want to use iterative methods to solve linear system of equations
Ax = b instead of direct methods.
Also if ||R|| = 1/10, then the iterative method x(k+1) = R x(k) +c converges to the so-
lution x∗ ,
x∗ = Rx∗ +c. How many iterations are required so that ||x(k) −x∗ || ≤ 10−6 ? Suppose
||x(0) − x∗ || = O(1).
6. Determine the convergence of the Jacobi and Gauss-Seidel method applied to the
system of equations Ax = b, where
3 −1 0 0 ··· ··· 0
2 3 −1 0 · · · · · · 0
0.9 0 0
0
2 3 −1 · · · · · · 0
(a) : A = 0 1 2 , (b) : A = · · · · · · · · · · · · · · · · · · ···
··· ··· ··· ··· ··· ··· ···
0 −2 1
0 ··· ··· 0 2 3 −1
0 ··· ··· ··· 0 2 3
7. Modify the Matlab code poisson drive.m and poisson sor.m to solve the following
diffusion and convection equation:
(b) Use the u(x, y) above for the boundary condition and the f (x, y) above for the
partial differential equation. Let a = 1, b = 2, and a = 100, b = 2, solve the
problem with n = 20, 40, 80, and n = 160. Try ω = 1, the best ω for the
Poisson equation discussed in the class, the optimal ω by testing, for example
ω = 1.9, 1.8, · · · , 1.
(c) Tabulate the error, the number of iterations for n = 20, 40, 80, and n = 160 with
your tested optimal ω, compare the number of iterations with the Gauss-Seidel
method.
(d) Plot the solution and the error for n = 40 with your tested optimal ω. Label
your plots as well.
11. Suppose
1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
−4 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
3 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0
L1 = , L3 = , P = .
6 0 0 1 0 0 0 0 1/2
1 0 0
0 0 0
1 0 0
−2 0 0 0 1 0 0 0 −1
0 1 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 1/5 0 0 1 0 0 0 0 0 1
(a) Can L1 or L3 be a Gauss transformation matrix with partial pivoting? Why?
(b) Compute L−1 −1 −1 −1
1 , L3 , L1 L3 , and L1 L3 .
(c) Compute P −1 , P T , P 2 , P L3 , and P L3 P .
(d) Find the condition number of each matrix.
(a) symmetric;
(b) strictly row diagonally dominant;
(c) symmetric positive definite?
(a) Is A symmetric?
(b) When is A a symmetric positive definite matrix?
(c) Show that such a decomposition is possible if and only if the determinants of the
principal leading sub-matrices Ak of A are all non-zero for k = 1, 2, · · · n − 1.
(d) What are the orders of operations (multiplication/division, addition/subtraction)
needed for such decomposition?
(e) Can you get A = LLT factorization from A = LDLT if A is a S.P.D? How?
lij , i = j, j + 1, · · · , n,
uij , j = i + 1, · · · , n,
15. Give a vector of x̄, and Ax = b. Derive the relation between the residual kb − Ax̄k
and the error kx̄ − A−1 bk.
16. For the following model matrices, what kind of matrix-factorization would you like to
use for solving the linear system of equations? Analyze your choices (operation count,
storage, pivoting etc).
3 −1 0 0 0
0.01 3 0 −4
−1 3 −1 0
0
1 2 1 2
0 −1 3 −1 0 , .
−1 0 3 −2
0 0 −1 3 −1
5 −2 3 6
0 0 0 −1 3
84 Z. Li
17. Let A(2) is the matrix obtained after one step Gauss elimination applied to a matrix
A, that is
(2) ai1
aij = aij − a1j .
a11
(a) Show that
(2)
max |aij | ≤ 2 max |aij | (4.8.2)
ij ij
21. Given a stationary iterative method xk+1 = Rxk + c, show that (a): if there is one
matrix norm such that kRk < 1, then the iterative method converges. (b): If the
spectral radius ρ(R) > 1, then the iterative method diverges. (c): How that if the
spectral radius ρ(R) > 1, then the iterative method converges.
22. Check the convergence of Jacobi, Gauss-Seidel, and SOR(ω) if (a): A is a column
dominant matrix (stricly, weakly) ; (b): A is a symmetric positive (or negative)
matrix.
(d): How many iterations do we need in general if we use Jacobi, Gauss-Seidel, and
SOR(ω) in terms of n?
24. Consider the linear system of equations (also consider one, three dimensions)
Ui−1,j + Ui+1,j + Ui,j−1 + Ui,j+1 − 4Uij + αUij = fij , 0 < i, j < n − 1. (4.8.3)
3x1 − x2 + x3 = 3
2x2 + x3 = 2
−x2 + 2x3 = 2
(a) With x(0) = [1, −1, 1]T , find the first and second iteration of the Jacobi, Gauss-
Seidel, and SOR (ω = 1.5) methods.
(b) Write down the Jacobi and Gauss-Seidel iteration matrices RJ and RGS .
(c) Do the Jacobi and Gauss-Seidel iterative methods converge?
26. Extra credit. Do some research to explain the behavior using cond hw.m and pos-
sible ways of improving it.
86 Z. Li
Selected solutions
3 For the first matrix, the eigenvalues of R are the diagonals. Notice that |aii | < 1 for
i = 1, 2, 3, 4. We just need to check |a55 | = 1 − sin(απ). Note that 0 < sin x < 1 if
0 < x < π and sin x is a periodic function of 2π. Thus, if 2k < α < 2k + 1, then the
iterative method converges, where k is an integer.
For the second matrix we have kRk1 = 0.9999 < 1, the iterative method converges.
5.1 Preliminary
Given a square matrix A ∈ Rn,n , if we can find a number λ ∈ C and x 6= 0 such that
Ax = λx, then λ is called an eigenvalue of A, x is called a corresponding eigenvector to
λ. Note that if Ax = λx, then A(cxx) = λ(cx) for any non-zero cnstant c, in other words,
eigenvector can differ by a constant. Often we prefer to use an eigenvector with unit length
( kxk = 1). We call (λ, x) an eigen-pair if Ax = λx (x 6= 0).
For an eigen-pair (λ, x), we have Ax − λx = 0. This means that (λI − A)x = 0 has
non-zero (or not unique) solutions. This indicates that (λIA ) is singular, or det(λIA ) = 0.
Thus λ must be a root of the characteristic polynomial of the matrix A, det(λI − A) =
λn + an−1 λn−1 + · · · + a1 λ + a0 .
There are n eigenvalues for an n by n square matrix. The eigenvalues can be real,
complex numbers, repeated roots. If the matrix is real, then the complex eigenvalues are
in pairs, that is, if λ = a + bi is one eigenvalue, then λ̄ = a − bi is also an eigenvalue. If A
is a real and symmetric matrix, then all the eigenvalues are real numbers.
Different eigenvectors corresponding to different eigenvalues are linear independent. If
an eigenvalue λ∗ has multiplicity p, which means that characteristic polynomial has the
factor (λ − λ∗ )p , but no the factor (λ − λ∗ )p+1 , the number is independent of eigenvectors
corresponding to λ∗ is less than or equal to p, recall some examples in class. If an n by n
square matrix A has n linear independent eigenvectors, then A is diagonalizable, that is,
there is a nonsingular matrix S, such that S −1 AS = D, where D = diag(λ1 , λ2 , · · · , λn ) is
a diagonal matrix.
For convenience of discussion, we will use the following notations. We arrange eigenval-
87
88 Z. Li
Thus ρ(A) = |λ1 |, λ1 is called the dominant eigenvalue (can be more than one), while
λn is called the least dominant eigenvalue.
There are many applications of eigenvalue problems. Below are a few of them
u(0) = 0, u(1) = 0.
After applying a finite difference or finite element method, we would have Ax = λx.
The solutions are the basis for the Fourier series expansion.
The idea of the power method is: starting from a non-zero vector x(0) 6= 0 (an approximation
to an eigenvector), then form an iteration
Then under some conditions, we can extract an eigen-pair information from the sequence!
With the assumption that λ1 satisfies |λ1 | > |λ2 |, which is an essential condition, then we
(k+1) (k) (k+1)
can show that x(k) ∼ Cx1 , and xp /xp ∼ λ1 as k is very large, where |xp | = kx(k+1) k.
Sketch of the proof:
While feasible in theory, the idea is not practical in computation because x(k) −→ 0 if
ρ(A) < 1 and x(k) −→ ∞ if ρ(A) > 1. The solution is to rescale the vector sequence, which
leads to the following power method.
Numerical Analysis: I 89
y(k+1) = Ax(k)
y(k+1)
x(k+1) =
ky(k+1) k2
end
We can use the following stopping criteria: |µk+1 − µk | < tol or ky(k+1) − y(k) k < tol or
both.
Under some conditions, the sequence of the pair (µk , x(k) ) converge to the eigen-pair
corresponding to the dominant eigenvalue.
For simplicity of the proof, we assume that A has a complete eigenvectors v1 , v2 , · · · , vn ,
kvi k = 1, Avi = λi vi .
Theorem 5.2.1 Assume that |λ1 | > |λ2 | ≥ |λ3 | >≥ · · · ≥ |λn |, and x(0) = ni=1 αi vi with
P
α1 6= 0, then the pair (µk , x(k) ) the power method converges to the eigen-pair corresponding
to the dominant eigenvalue.
where γk γk−1 · · · γ1 and Ck are some constants. Since x(k) is parallel to y(k) and has unity
length in 2-norm, we must have
y(k)
x(k) = .
ky(k) k2
On the other hand, we know we have
k k !
λ2 λn
Ak x(0) = λk1 α1 v1 + α2 v2 + · · · αn vn
λ1 λ1
Thus we have
x(k) α1 v1
lim x(k) = lim = = ±v1 .
k→∞ k→∞ kx(k) k2 kα1 v1 k2
and
If we use different scaling, we can get different power method. Uisng the infinity norm, first
we introduce the xp notation for a vector x. Given a vector x, xp is the first component
such that |xp | = kxk∞ , and p is the index. For example, if x = [ 2 −1 −5 5 −5 ]T ,
then xp = −5 with p = 3.
5.3 The inverse power method for the least dominant eigen-
value
If A is invertible, then 1/λi are the eigenvalues of A−1 and 1/λn will be the dominant
eigenvalue of A−1 . However, the following inverse power method for the least dominant
eigenvalue without the need to have the intermediate steps.
Given x(0) 6= 0, form the following iteration
Ay(k+1) = x(k)
y(k+1)
x(k+1) =
ky(k+1) k2
end
With similar conditions (|λn | < |λn−1 | ≤ · · · ≤ |λ1 , the essential condition), one can
prove that
Note that, at each iteration, we need to solve a linear system of equations with the
same coefficient matrix. This is the most expensive part of the algorithm. An efficient
implementation is to get the matrix decomposition done outside the loop. Let P A = LU ,
the algorithm can be written as follows.
Numerical Analysis: I 91
Lz(k+1) = P x(k)
U y(k+1) = z(k+1)
y(k+1)
x(k+1) =
ky(k+1) k2
end
That is, λp − σ is the least dominant eigenvalue of A − σI. We can use the following shifted
inverse power method to find the eigenvalue λp and its corresponding eigenvector Given
x(0) 6= 0, form the following iteration
(A − σI)y(k+1) = x(k)
y(k+1)
x(k+1) =
ky(k+1) k2
end
Thus if we can find good approximations of any eigenvalue, we can use the shifted
inverse power method to compute it. Now the question is how do we roughly located the
eigenvalues of a matrix A. Gershgorin theorem provides some useful hints.
Definition 5.4.1 Given a matrix A ∈ C n×n , the circle ( all points within the circle on the
complex plane)
Xn
|λ − aii | ≤ |aij | (5.4.1)
j=1,j6=i
2. The union of k Gershgorin circles, which do not intersect with other n − k circles,
contains precisely k eigenvalues of A.
Proof: For any eigen-pair (λ, x), Ax = λx. Consider the p-th component of x such that
|xp | = kxk∞ , we have
n
X
apj xj = λp xp
j=1
or
n
X
(λ − app )xp = apj xj .
j=1,j6=p
Thus λ is in the p-th Gershgorin circle and the first part of the theorem is complete.
The proof for the second part is based continuation theory that roots of a polynomial
are continuous functions of the coefficients of the polynomials. The theorem is obviously
true for the diagonal matrix. If the radius of the Gershgorin circles increase continuously
as we change the zero off diagonal entries from 0 to aij , the eigenvalues will move among
union of the Gershgorin circles but can not cross to the disjoint ones.
Example: Let A be the following matrix
−5 −1 0
A = −1 2 −1/2
0 −1 8
Use the Gershgorin theorem to roughly locate the eigenvalues.
The three Gershgorin circles are
• R1 : |z + 5| ≤ 1.
• R2 : |z − 2| ≤ 1.5.
• R3 : |z − 8| ≤ 1.
The do not interset with each other, so each circle has one eigenvalue. Since the matrix is a
real matrix, and complex eigenvalues have to be in pair, we conclude that all the eigenvalues
are real. Thus we get
Numerical Analysis: I 93
If we wish to find the middle eigenvalue λ2 we should choose σ = −5. Even for the
dominant eigenvalue, we would get faster convergence if we shift the matrix by taking
σ = 8 and then apply the shifted power method.
If we know an eigenpair (λ1 , x1 ) of a matrix A, we can use the deflation method to reduce
A to a one-dimensional lower matrix whose eigenvalues are the same as the rest eigenvalues
of A. The process is follows. Assume that kx1 k2 = 1, we can form expand x to form an
orthogonal basis of Rn : {x1 , x2 , · · · xn } with xTi xj = δij . Note that δij = 0 if i 6= j and
δii = 1. Let Q = [x1 , x2 , · · · xn ], then Q is an orthogonal matrix ( QT Q = QQT = I). We
can get
Thus the eigenvalues of A1 are also those of A, but A1 is a one-dimensional matrix compare
with the original matrix A. The deflation method is only used if we wish to find a few
eigenvalues.
To find all eigenvalues, the QR method for eigenvalues are often used. The idea of the
QR method is first to reduce a matrix to a simple form (often upper Hessenberg matrix
or tridiagonal matrix) using similarity transformation S −1 AS so that the eigenvalues are
unchanged. Since the inverses of an orthogonal matrix is its transpose, S is often chosen
as orthogonal matrix. Orthogonal matrices also have better stability than other matrixes
since kQxk2 = kxk2 and kQAk2 = kAk2 .
Definition 5.5.1 Given a unit vector w, kwk2 = 1, the Household matrix is defined as
P = I − 2wwT . (5.5.1)
Theorem 5.5.1 If kxk2 = kyk2 , then there is a Household matrix P such that P x = y.
94 Z. Li
I − 2wwT x = y
x − y = 2w(wT x).
Note that 2(wT x) is a number, thus w is parallel to x − y. Since w is also a unit vector
in 2-norm, we conclude that w = (x − y)/kx − yk2 , then it is simple manipulation to show
that P x = y.
Example: Find a Householder matrix P such that
3 α
P 0 = 0
4 0
Note that in this example, we need to find both α and P . Since the orthogonal transfor-
mation does not change the 2-norm, we should have
p
32 + 42 = α2 , =⇒ α = ±5,
3±α
1
w = (x − y)/kx − yk2 = 0−0
kx − yk2
4−0
To avoid possible cancellation, we should choose the opposite sign, that α = −5, and
√
w = [8 0 4]T / 80.
Start from A0 = A.
for k = 0, 1, · · · until converge
Ak = Qk RK
Ak+1 = Rk Qk .
end
•
R11 R12 · · · R1p
R22 · · · ···
lim Ak = RA = .. ..
k→∞
. .
Rpp
Numerical Analysis: I 95
Shifted QR method:
Start from A0 = A.
for k = 0, 1, · · · until converge
Ak − σ k I = Q k R K
Ak+1 = Rk Qk + σk I.
end
Stop criteria: max |aij | < tol.
3≤i≤n,1≤j≤i−2
for example. The reason to choose P1 to keep the first row of A unchanged when we multiply
P1 from the left is to ensure that the first columns of P1 A unchanged when we multiply P1
from the right so that those zeros will remain.
96 Z. Li
5.6 Exercises
1. Let w ∈ Rn , A = wT w, B = wwT .
In this chapter, we discuss numerical method for solving arbitrary linear system of equations.
From the first two equations, the solution should be x = 0; from the third equation, the
solution should be x = 1; from the last equation, the solution should be x = −2; In other
words, we can not find a single x that satisfy all the equations. The system of equations is
called over-determined. In general, there is no classical solution to an over-determined
system of equation. We need to find the ’best’ solution.
By the best solution we mean that to minimize the error in some norm. Since the
residual is computable, one approach is to minimize the residual in 2-norm of all possible
choices:
The solution x∗ is the ’best solution’ in 2-norm if it satisfies
If we use different norms rather than 2-norm, it will leads to different algorithm and different
applications. But 2-norm is the one that is used the most and it is the simplest one.
97
98 Z. Li
The above definition gives way to compute the ’best solution’ as the global minimum
of a multi-variable function of its component x1 , x2 , · · · , xn . This can be seen from the
following:
Note that bT Ax = xT AT b, one can easily get the gradient ϕ(x) which is
∇φ(x) = 2 AT Ax − AT b.
AT Ax = AT b (6.1.3)
whose unique solution is the least squares solution which minimizes the 2-norm of the
residual in Rn space.
The normal equation approach not only provides a numerical method, but also shows
that the least squares solution is unique under the condition rank(A) = n, that is, the
columns of A are lienarly independent.
A serious problem with the normal equation approach is the possible ill-conditioned
system. Note that if m = n, then cond2 (AT ) = (cond2 (A))2 . A more accurate method is
the QR method for the least squares solution. For the following least squares problem
! !
R b1
x=
0 b2
We have
and
keeps the least squares solution unchanged. From the process of the QR algorithm, we can
apply a sequence of Householder matrices that reduce A to an upper triangular form
!
R
Pn Pn−1 · · · P1 A =
0
T
or QA = RT 0 , then from Ax = b, we get QAx = Qb, or
! !
R b̃1
x=
0 b̃2
.
In practice, we can apply the Householder matrix directly to the augmented matrix [A .. b]
as in the Gaussian elimination method.
An example:
Theorem 6.2.1 Given any matrix A ∈ C m×n , there are two orthogonal matrices U ∈
C m×m , U U H = U H U = I and V ∈ C n×n , V V H = V H V = I such that
σ1
σ2
..
X X
A=U V H, where = (6.2.1)
.
σp
0 m×n
σ1 , σ2 , · · · , σp > 0 are called the singular values of A. Note that they are positive numbers,
p = rank(A). Furthermore
p p
• σi = λi (AH A) = λi (AAH ), square root of non-zero eigenvalues of AH A or AAH .
Note that we often arrange singular values according to σ1 ≥ σ2 ≥ · · · ≥ σp > 0. The proof
process also gives a way for such a decomposition (constrictive).
Proof: Let σ1 = kAk2 , then there is an x, kx1 k2 = 1 such that AH Ax1 = σ12 x1 . Let
y1 = Ax1 /σ1 , we have ky1 k2 = kAx1 k2 /σ1 = 1.
100 Z. Li
V = [x1 , x2 , · · · , xn ] = [x1 , V1 ] , V H V = V H V = I.
Then we have
! !
H yH
1
σ1 0
U AV = Ax1 AV1 =
U1H H
0 U1 AV1
This is because yH H H H 2 H H
1 Ax1 = x1 A Ax1 /σ1 = x1 x1 σ1 /σ1 = σ1 ; U1 Ax1 = σ1 U1 y1 = 0;
yH H H H 2 H
1 AV1 = (Ay1 ) V1 = x1 A AV1 = σ1 x1 V1 = 0. Thus from the mathematics induction
principle, we can continue this process to get the SVD decomposition.
Pseudo-inverse of a matrix A
From the SVD decomposition of a matrix A, we can literally find the ’inverse’ of the matrix
to get its pseudo-inverse
1
σ1
1
σ2
A+ = V
X
+
UH, where
X
+
= .. (6.2.2)
.
1
σp
0 n×m
• A+ AA+ = A+ .
• If rank(A) = n, then kAx∗ − bk2 = minx∈Rn kAx∗ − bk2 , that is, x∗ is the least
squares solution.
• If there is more than one classical solution, then x∗ is the one with minimal 2-norm,
that is