Numerical Linear Algebra: Course Material Networkmaths Graduate Programme Maynooth 2010
Numerical Linear Algebra: Course Material Networkmaths Graduate Programme Maynooth 2010
Numerical Linear Algebra: Course Material Networkmaths Graduate Programme Maynooth 2010
Volker Mehrmann,
TU Berlin,
August 3, 2010
Literature
The following books are also useful to complement the material of these notes:
2
Chapter 0
Introduction
The main topics of Numerical Linear Algebra are the solution of different classes of eigenvalue
problems and linear systems.
For the eigenvalue problem we discuss different classes.
(a) The standard eigenvalue problem: For a real or complex matrix A Cn,n , determine
x Cn , C, such that
Ax = x.
The standard eigenvalue problem is a special case of the generalized eigenvalue problem:
For real or complex matrices A, E Cm,n , determine x Cn , C, such that
Ax = Ex,
In many applications the coefficient matrices have extra properties such as being real and
symmetric or complex Hermitian.
For linear systems:
Ax = b, x Cn , b Cm
with A Cm,n we again may extra properties for the coefficient matrices.
We will concentrate in this course on the numerical solution of standard and generalized
eigenvalue problems and the solution of linear systems. We will briefly review some of the
standard techniques for small scale problems and put an emphasis on large scale problems.
Applications: Eigenvalue problems arise in
the analysis of the spectra and energy levels of atoms and molecules (quantum mechan-
ics);
model reduction techniques, where a large scale model is reduced to a small scale model
by leaving out weakly important parts;
Linear systems arise in almost any area of science and engineering such as
3
(a) frequency response analysis for excited structures and vehicles;
(b) finite element methods or finite difference methods for ordinary and partial differential
equations;
We will distinguish between small and medium class problems where the full matrices fit into
main memory, these are of today sizes n = 102 105 and large sparse problems, where the
coefficient matrices are stored in sparse formats, and have sizes n = 106 and larger. We will
mainly discuss the case of complex matrices. Many results hold equally well in the real case,
but often the presentation becomes more clumsy. We will point out when the real case is
substantially different.
We will discuss the following algorithms.
A small A large
EVP QR-Algorithm, QZ-Algorithm Lanczos, Arnoldi, Jacobi-Davidson
LS CG, GMRES
4
Chapter 1
Matrix theory
1.1 Basics
1.1.1 Eigenvalues and Eigenvectors
Let A, E Cn,n , then v Cn \ {0} and C that satisfy
Av = Ev
are called eigenvector and eigenvalue of the pair (E, A). In the special case that E is the
n n identity matrix In (I) we have eigenvalues and eigenvectors of the standard eigenvalue
problem.
The sets
(A) := { C| eigenvalue of A}
(E, A) := { C| eigenvalue of (E, A)}
||Ax||p
||A||p := sup
x6=0 ||x||p
is the matrix p-norm, p N {} and for invertible matrices A
Special cases:
(a) p = 1 ; the column-sum norm:
m
X
||A||1 = max |aij |
j
i=1
5
(b) p = ; the row-sum norm:
with AH = AT .
Convention:
(a) U is isometric;
(c) hU x, U yi = hx, yi for all x, y Ck (h, i: standard real or complex scalar product);
(e) U U H = In ;
(f ) U 1 = U H ;
||U || = 1 = ||U 1 || = (U ) .
6
1.1.4 Subspaces
Definition 3 A space U Cn is called subspace, if for all x, y U, C we have
x + y U, x U.
Theorem 4 Let U Cn be a subspace with basis (x1 , . . . , xm ) and X = [x1 , . . . , xm ], i.e.
Rank(X) = m.
(a) Then U = R (X) := {Xy| y Cm } (Range or column space of X).
(b) Let Y Cn,m with Rank(Y ) = m, then
R (X) = R (Y ) X = Y B, B Cm,m .
In particular then B is invertible and
XB 1 = Y.
(c) The Gram-Schmidt method for (x1 , . . . , xm ) delivers an orthonormal basis (q1 , . . . , qm )
of U with
x U Ax U for all x Cn .
Theorem 6 Let A Cn,n , X Cn,n and U = R (X). Then the following are equivalent:
(a) U is A-invariant;
(b) There exists B Ck,k , such that:
AX = XB.
Furthermore, in this case for C and v Ck :
Bv = v AXv = Xv,
i.e., every eigenvalue von B is also an eigenvalue von A.
Remark 7 If A, X, B satisfy AX = XB and if X has only one column x, then B is a scalar
and we obtain the eigenvalue equation
Ax = x,
i.e., X can be viewed as a generalization of the concept of eigenvector.
7
1.2 Matrix decompositions
1.2.1 Schur decomposition
Theorem 8 (Schur, 1909)
Let A Cn,n . Then there exists U Cn,n unitary such that
T := U H AU
is upper triangular.
Remark 9 In the Schur decomposition U can be chosen such that the eigenvalues of A appear
in arbitrary order on the diagonal.
T = QT AQ
is quasi-upper triangular.
Proof: The proof is similar to that of the Theorem of Schur: If A has a real eigenvector
then we can proceed the induction as in the complex case. Otherwise for a complex
eigenvector v = v1 + iv2 to the complex eigenvalue = 1 + i2 , with v1 , v2 Rn and
1 , 2 R, 2 6= 0 we have from
that
Av1 = 1 v1 2 v2 1 2
A [v1 v2 ] = [v1 v2 ] .
Av2 = 1 v2 + 2 v1 2 1
8
Hence Span {v1 , v2 } is an A-invariant subspace. Let (q1 , q2 ) be an orthonormal basis of
Span {v1 , v2 } and let Q = [q1 , q2 , q3 , . . . , qn ] be orthogonal, then
1 2
A 12
QH AQ = 2 1 .
0 A22
Example 13 Hermitian matrices, skew-Hermitian matrices and unitary matrices are nor-
mal.
then A12 = 0.
Proof: Exercise. 2
Corollary 15 If A Cn,n is normal, then there exists U Cn,n unitary, such that U H AU
is diagonal.
Proof: Exercise. 2
Theorem 16 (Generalized Schur form) Let A, E Cn,n be such that the pair (E, A) is
regular, i.e., det E A 6= 0 for all C. Then there exist U, V Cn,n unitary such that
S = U H EV, T := U H AV
Proof: Let (Ek ) be a sequence of nonsingular matrices that converges to E. For every let
1
QH
k AEk Qk = Tk
be a Schur decomposition of AEk1 and let ZkH (Ek1 Qk = Sk1 be a QR decomposition. Then
both QH H
k AZk = Rk Sk and Qk Ek Zk = Sk are upper triangular.
Using the Bolzano-Weierstra it follows that the bounded sequence (Qk , Zk ) has a
converging subsequence with a limit (Q, Z), where Q, Z are unitary. Then QH AZ = T and
QH EZ = S are upper triangular. 2
9
1.2.2 The singular value decomposition (SVD)
Theorem 17 (Singular value decomposition, SVD)
Let A Cm,n with Rank (A) = r. Then there exist unitary matrices U Cm,m and V Cn,n
such that
1
.. 0
A = U V H , = .
Cm,n .
r
0 0
Furthermore, 1 = ||A||2 and 1 , . . . , r are uniquely determined.
Proof: Exercise. 2
AH A = V H U H U V H = V H V H = V 2 V H
and
AAH = U V H V H U H = U H U H = U 2 U H ,
i.e. 12 , . . . , r2 are the nonzero eigenvalues of AAH and AH A, respectively.
(b) Since AV = U one has Kernel (A) = Span {vr+1 , . . . , vn } and Image (A) = R (A) =
Span {u1 , . . . , ur }.
10
is the best rank approximation to A in the sense that
where r+1 := 0.
Then hx1 , y1 i is real, 1 = arccos hx1 , y2 i is called first canonical angle and x1 , y1 are
called first canonical vectors.
(b) Suppose that we have determined j 1 canonical angles and vectors, i.e.,
x1 , . . . , xj1 U, y1 , . . . , yj1 V
j := arccos hxj , yj i
is the j-th canonical angle, and xj , yj are j-th canonical vectors. Proceeding inductively
we obtain k canonical angles 0 1 . . . k 2 and orthonormal bases (x1 , . . . , xk )
and (y1 , . . . , yk ) of U, V, respectively.
11
Lemma 20 For i, j = 1, . . . , k and i 6= j the canonical vectors satisfy hxi , yj i = 0.
Proof: Exercise. 2
R (P ) = U, R (Q) = V.
P H Q = U V H
with the diagonal matrix
H H
=U QV
| {zP } |{z}
XH Y
(b)
d (U, V) := max d (x, V)
xU
||x||=1
12
Proof: See Stewart/Sun. Matrix perturbation theory. Boston, 1990. 2
Lemma 24 Let
Im Im
U =R and U = R ,
0 X
with X C(nm)m be m-dimensional subspaces of Cn and let 1 , . . . , m be the canonical
angles between U and U. Then
tan 1 , . . . , tan n
are the singular values of X, in particular then ||X|| = tan m .
13
Chapter 2
Situation: A Cn,n , where n is small enough so that the matrix A can be fully stored and
that we can manipulate the whole matrix by similarity transformations.
q = c1 v1 + . . . + cn vn .
Aq = c1 1 v1 + . . . + cn n vn ,
Ak q = c1 k1 v1 + . . . + cn kn vn .
k k
1 k
A q c1 v1
|c2 | 2 ||v2 || + . . . + |cn | n ||vn ||
k
1 1
1
k
2 k
|c2 |||v2 || + . . . + |cn |||vn || 0,
1
14
Definition 25 A sequence (xk ) converges linearly to x, if there exists r with 0 < r < 1 such
that
||xk+1 x||
lim = r.
k ||xk x||
||xk+1 x||
lim = c 6= 0.
k ||xk x||m
In practice we do not know 1/k1 , thus we normalize differently and divide by the largest (in
modulus) component of Ak q.
Algorithm: (Power method)
Computes the dominant eigenvalue 1 and the associated eigenvector v1 .
The power method can also be used for large scale problem where only matrix vector mul-
tiplication is available. It computes only one eigenvalue and eigenvector and is for example
used in the Google page rank method. By the presented analysis we have proved the following
theorem.
(b) Forming the full products Aqk costs 2n2 flops and the scaling O(n) flops. Hence m
iterations will cost 2n2 m flops.
15
2.2 Shift-and-Invert and Rayleigh-Quotient-Iteration
Observations: Let A Cn,n and (, v) C Cn with Av = v. Then
(a) A1 v = 1 v for A invertible, and
(b) (A %I)v = ( %)v for all % C.
If 1 , . . . , n are again the eigenvalues of A with |1 | . . . |n |, then we can perform the
following iterations.
Inverse Iteration This is the power method applied to A1 . If |n | < |n1 |, then the
inverse iteration converges to an eigenvector to n with convergence rate | n1
n
| (which is
small if |n | |n1 |).
Shift and Invert Power Method This is the power method applied to (A %I)1 . Let
j , k be the eigenvalues that are closest to %, and suppose that |j %| < |k %|. Then the
power method for (A %I)1 converges to an eigenvector associated with j with rate
j %
k % .
The idea to determine a good eigenvalue approximation (shift) from a given eigenvector
approximation is to minimize the residual ||Aw w||. Consider the over-determined linear
system
w = Aw
with the n 1-Matrix w, the unknown vector and the right hand side Aw. We can use the
normal equations to solve ||Aw w|| = min!, i.e., we use
wH Aw
wH w = wH Aw respect. = .
wH w
16
Definition 30 Let A Cn,n and w Cn \ {0}. Then
wH Aw
r(w) :=
wH w
is called the Rayleigh-quotient of w with respect to A.
The following theorem gives an estimate for the distance of the Rayleigh-quotient from an
eigenvalue.
This gives an the idea for an iteration to iterate computing an approximate eigenvector and
from this a Rayleigh-quotient, i.e., the following algorithm:
Algorithm: Rayleigh-Quotient-Iteration (RQI)
This algorithm computes an eigenvalue/eigenvector pair (, v) C Cn of the matrix A
Cn,n .
Remark 32 (a) It is difficult to analyze the convergence of this algorithm but one observes
practically that it almost always converges. The convergence rate is typically quadratic.
For Hermitian matrices A = AH there is more analysis and one can even show cubic
convergence.
(b) The Rayleigh-quotient iteration can also be applied to very large matrices provided that
a linear system solver is available (we get back to this later).
(c) Costs: O(n3 ) flops per step if the linear system is solved with full Gaussian elimination.
The costs are O(n2 ) for Hessenberg matrices (see Chapter 2.4.2) and they can be even
smaller for banded or other sparse matrices.
(d) Ak1 I is almost singular, i.e., if k1 is close to an eigenvalue, then linear systems
with A k1 I are generally ill-conditioned. But if we use a backward stable method
then we get a small backward error, i.e.,
(A + A k1 I)x = qk1
with ||A|| small. Thus we can expect good results only if the eigenvalue/eigenvector
computation is well conditioned.
17
Conditioning of eigenvalues: if is a simple eigenvalue of A, i.e., the algebraic multiplicity
is 1 and if v, w are normalized right and left eigenvectors, i.e.,
Av = v, wH A = wH , kvk = 1 = kwk,
then for small perturbations A we have in first approximation that A+A has an eigenvalue
+ with
1
|| H kAk.
|w v|
We then have that 1/|wH v| is a condition number for simple eigenvalues. For normal matrices
we have v = w and thus |wH v| = 1. Normal matrices thus have well-conditioned eigenvalues.
In general, we expect R(Wk ) to converge to the invariant subspace U associated with the m
eigenvalues 1 , . . . , m . This iteration is called subspace iteration.
|1 | . . . |m | > |m+1 | . . . |n |.
Proof: We prove the theorem for the case that A is diagonalizable. We perform a similarity
transformation
1 A1 0
Anew = S Aold S = ,
0 A2
with A1 = diag (1 , . . . , m ) and A2 = diag (m+1 , . . . , n ). Then A1 is nonsingular, since
|1 | . . . |m | > 0. Set
1 Im
Unew = S Uold = R
0
18
and
1 0
Vnew = S Vold = R .
Inm
Furthermore, let
Z1
Wnew = S 1 Wold =
Z2
for some Z1 Cm,m and Z2 Cnm,m . Then
and thus,
Im
R(Wk ) = R .
Xk
It remains to show that
d(R(Wk ), U) 0.
(k)
Let m be the largest canonical angle between R(Wk ) and U. Then
k
d(R(Wk ), U) = sin (k) (k) k
m tan m = ||Xk || ||A2 ||||X0 ||||A1 ||
= |km+1 |||X0 |||k
m |,
19
Remark 34 For W0 = [w1 , . . . , wm ] we have
h i
Ak W 0 = Ak w1 , . . . , A k wm ,
(j)
i.e., we perform the iteration not only for W0 but simultaneously also for all W0 = [w1 , . . . , wj ],
since h i
k (j) k k
A W 0 = A w1 , . . . , A wj .
to the invariant subspace associated with 1 , . . . , j for all j = 1, . . . , m. For this reason one
often speaks of simultaneous subspace iteration.
Practice: Unfortunately, in finite precision arithmetic rounding errors lead to linear depen-
dence in R(Wk ) already after few iterations.
The basic idea to cope with this problem is to orthonormalize the columns in every step.
isometric and
n o n o
k1 k1 (k1) (k1)
Span A w1 , . . . , A wj = Span q1 , . . . , qj , j = 1, . . . , m.
and moreover
n o n o
(k) (k)
Span Ak w1 , . . . , Ak wj = Span q1 , . . . , qj , j = 1, . . . , m.
20
(a) Start: Choose Q0 Cn,m isometric.
Remark 35 Theoretically the convergence behavior of the unitary subspace iteration is as for
the subspace iteration but the described problems in finite precision arithmetic do not arise.
m nm
m A11 A12
Ak =
nm A21 A22
the block A21 converges to 0 for k . Since this happens for all m simultaneously, it
follows that Ak converges to block-upper triangular matrix.
Another question is whether we can directly move from Ak1 to Ak ? To see this, observe that
Ak1 = Q1
k1 AQk1
Ak = Q1
k AQk
and hence
Ak = Q1 1 1
k Qk1 Ak1 Qk1 Qk = Uk Ak1 Uk .
| {z }
=:Uk
Thus we can reformulate the k-th step of the unitary subspace iteration
AQk1 = Qk Rk
as
Ak1 = Q1 1
k1 AQk1 = Qk1 Qk Rk = Uk Rk .
21
This is a QR decomposition of Ak1 and we have
and
m nm
(k) (k)
" #
m A11 A12
Ak = (k) (k) ,
nm A21 A22
then for every % with m+1 < % < 1 there exists a constant c such that
m
(k)
||A21 || c%k .
where h i
(k)
Qk = q1 , . . . , qn(k)
In the special case that A is Hermitian, the sequence Ak converges to a diagonal matrix.
22
Remark 37 In the presented form the algorithm has two major disadvantages:
A way to address the two problems is the Hessenberg reduction and the use of shifts.
Householder transformations
The QR decomposition and other unitary transformations can be realized via Householder
transformations
2
P =I vv H
vH v
for v Cn \ {0}. Householder Transformations are Hermitian and unitary. (Exercise) Mul-
tiplication with Householder transformations is geometrically a reflection of a vector x Cn
at the hyperplane Span {v} . (Exercise.)
A typical task: Reflect x Cn \ {0} to a multiple of the first unit vector, i.e., determine v
and from this P such that
P x = ||x||e1 .
23
Another tool for unitary operations are Givens rotations
1
..
.
1
c s
1
Gi,j (c, s) =
.. ,
.
1
s c
1
..
.
1
where |c|2 +|s|2 = 1 and where the matrix differs from an identity only in positions (i, i), (i, j), (j, i), (j, j).
Multiplication of a matrix with a Givens rotation allows to zero an element in any position.
E.g. choose
c s
G1,2 (c, s) =
s c
(a) For a Hessenberg matrix H Cn,n the QR decomposition can be performed in O(n2 )
flops using Givens rotations.
qi = ci ui
for ci C with |ci | = 1 and |hi,i1 | = |gi,i1 | for i = 2, . . . , n, i.e., Q is determined already
essentially uniquely by q1 .
Proof: Exercise. 2
24
2.4.3 The Francis QR Algorithm with Shifts
Deflation: Let H Cn,n be in Hessenberg form. If H is not unreduced, i.e., if hm+1,m = 0
for some m, then
m nm
m H11 H12
H=
nm 0 H22
i.e., we can split our problem into two subproblems H11 , H22 .
Algorithm (QR Algorithm with Hessenberg reduction and shifts)
Given: A Cn,n :
(a) Compute U0 unitary such that
H0 := U0H AU0
is in Hessenberg form. We may assume that H0 is unreduced, otherwise we can deflate
right away.
(b) Iterate for k = 1, 2, . . . until deflation happens, i.e.,
(k)
hm+1,m = O(eps)(|hm,m | + |hm+1,m+1 |)
for some m and the machine precision eps.
(i) Choose shift k C.
(ii) Compute a QR decomposition Hk1 k I = Qk Rk of Hk1 k I.
(iii) Form Hk = Rk Qk + k I.
Remark 40 (a) Steps (ii) and (iii) of this algorithm correspond to a QR iteration step for
H0 k I.
(k) k
(b) The sub-diagonal entry hm+1,m in Hk converges with rate m+1 to 0.
m k
(k) (k)
(c) If hm+1,m = 0 or hm+1,m = O(eps), then we have deflation and we can continue with
smaller problems.
(d) If k is an eigenvalue then deflation happens immediately after one step.
Shift strategies:
(a) Rayleigh-quotient shift: For the special case that A is Hermitian, the sequence Ak
(k)
converges to a diagonal matrix. Then qn is a good approximation to an eigenvector
and a good approximation to the eigenvalue is the Rayleigh-quotient
r(qn(k) ) = (qn(k) )H Aqn(k)
(k)
which is just the n-th diagonal entry an,n of QH
k AQk .
(k)
Heuristic: We expect in general an,n to be a good approximation to an eigenvalue and
therefore may choose
(k)
k = an,n
(k)
With this choice hn,n1 typically converges quadratically to 0.
25
(b) Wilkinson-shift: Problems with the Rayleigh-quotient shift arise when the matrix is
real and has nonreal eigenvalues, e.g., for
0 1
A= .
1 0
R0 Q0 = IA0 = A0 ,
i.e., the algorithm stagnates. To avoid such situations, for A Cn,n in the k-th step,
one considers the submatrix B in
n2 2
n2
Ak =
2 B
(k)
and chooses the k as shift that is nearest to ann .
(d) On the average 2-3 iterations are needed until a 1 1 or 2 2 block deflates.
H 1 I = Q1 R1
H1 = R1 Q1 + 1 I
..
.
Hl1 l I = Ql Rl
Hl = Rl Ql + l I.
Then
Hl = QH H H H
l Ql Rl Ql + l Ql Ql = Ql (Ql Rl + l I)Ql = Ql Hl1 Ql
Hl = QH H
Q1 . . . Ql = QH HQ.
l . . . Q1 H | {z }
=:Q
This opens the question whether we can compute Q directly without carrying out l QR-
iterations.
Lemma 41 M := (H l I) . . . (H 1 I) = Q1 . . . Ql Rl . . . R1 = QR.
| {z }
=:R
(H j I) . . . (H 1 I) = Q1 . . . Qj Rj . . . R1 , j = 1, . . . , l.
26
j = 1: This is just the first step of the QR algorithm.
j 1 j:
Q1 . . . Qj Rj . . . R1
= Q1 . . . Qj1 (Hj1 j I)Rj1 . . . R1
= Q1 . . . Qj1 QHj1 . . . QH
1 HQ 1 . . . Qj1 j I Rj1 . . . R1
= (H j I)Q1 . . . Qj1 Rj1 . . . R1
I.A.
= (H j I)(H j1 I) . . . (H 1 I).
This leads to the idea to compute M and then the Householder QR decomposition of M , i.e.,
M = QR, and to set
H = QH RQ = Hl .
This means that one just needs one QR decomposition instead of l QR decompositions in
each QR step. On the other hand we would have to compute M , i.e., l 1 matrix-matrix
multiplications. But this can be avoided by computing H directly from H using the implicit
Q Theorem.
Implicit shift-strategy:
(a) Compute
M e1 = (H l I) . . . (H 1 I)e1 ,
the first column of M . Then the first l + 1 entries are in general nonzero. If l is not too
large, then this costs only O(1) flops.
l+1 nl1
l+1 0
P0 =
nl1 0 I
l+2 nl2
l+2
P0 HP0 =
nl2 0 H
(c) Determine Householder matrices P1 , . . . , Pn2 to restore the Hessenberg form. This is
called bulge chasing, since we chase the bulge the down the diagonal. This yields
27
(d) P0 has the same first column as Q. As in the first step for P0 we have Pk e1 = e1 , then
also
P0 P1 . . . Pn2
has the same first column as P0 and Q, respectively. With the implicit Q Theorem then
also Q and P0 . . . , Pn2 and therefore also H and QH HQ are essentially equal, thus we
have computed H and Hl directly from H.
Algorithm (Francis QR algorithm with implicit double-shift strategy): (Francis
1961)
Given A Cn,n :
Remark 42 (a) The empirical costs for the computation of all eigenvalues of A are approx.
10n3 flops, If also the transformation matrix Q is needed then this leads to approximately
25n3 flops.
(c) The method works also for real problems in real arithmetic, since the double shift can
be chosen as complex conjugate pairs.
For a given regular pair (E, A) with E, A Cn,n the algorithm computes unitary matrices
U0 , V0 such that
U0H EV0 = S0 , U0H AV0 = T0 ,
with S0 upper triangular and T0 upper Hessenberg.
28
Use the QR decomposition to compute U such that U H E is upper triangular and set A =
U H A = [aij ], E = U H E = [eij ].
For j = 1, . . . , n 2
for i = n, n 1, . . . , j + 2
Determine a 2 2 Householder or Givens matrix P such that
ai1,j
P =
aij 0
end
This algorithm costs about 5n3 flops plus extra 23/6n3 if U , V are desired.
If A is not unreduced then we can again deflate the problem into subproblems, i.e., as
E11 E12 A11 A12
E A =
0 E22 0 A22
with E22 strictly upper triangular and A22 is nonsingular upper triangular by the regularity
of (E, A).
In the top pair (E11 , A11 ) then E11 is invertible and in principle we can imply the implicit
1
QR algorithm to E11 A11 . This leads to the QZ algorithm of Moler and Stewart from 1973
(Exercise).
29
Chapter 3
Situation: Given a matrix A Cn,n with n very large (e.g., n 106 , 107 , . . .) and A sparse,
i.e., A has only very few nonzero elements. A sparse matrix A Cm,n is described by 6
parameters.
The programming environment MATLAB for example has the data structure sparse.
(a) For x Cn we can compute the product Ax, this often works with O(n) flops instead
of O(n2 ).
(b) We cannot apply standard similarity transformations, since the transformed matrix in
general is not sparse any more.
Often even A is not given but just a subroutine, that computes the product Ax for a given
x Cn , i.e.,
x black box Ax.
This is then the only possibility to obtain information about the matrix A.
30
For a given function u(x, y), we get on a two-dimensional grid in each grid point (i, j)
approximations uij = u(xi , yj ) for i = 1, . . . , m and j = 1, . . . , n. Here the approxima-
tion to v = u in the grid point (i, j) obtained via the 5-point difference star satisfies
-1
|
-1 4 -1 vij = 4uij ui,j+1 ui,j1 ui+1,j ui1,j ,
|
-1
together with further boundary conditions. This we can write as matrix equation.
for i=1,...,m
for j=1,...,n
v[i,j]=4*u[i,j]-u[i,j+1]-u[i,j-1]-u[i+1,j]-u[i-1,j]
end
end
[an , . . . , a1 , a0 , a1 , . . . , an ]
and is suffices to store this vector and to write a subroutine to compute the product Ax.
Then the question arises, how to compute some eigenvalues and eigenvectors of A? In general
we are only interested in a few eigenvalues, e.g., the ones that are largest or smallest in
modulus. (We would certainly not be able to store all eigenvectors because that would need
O(n3 ) storage.)
1. Idea: Unitary subspace iteration
h i
(0) (0) (0) (0)
(a) Start with m n orthonormal vectors q1 , . . . , qm and set Q0 = q1 , . . . , qm .
(b) Compute AQk1 = Qk , hence m-matrix vector products. For sparse matrices this costs
O(mn) flops instead of O(mn2 ) for general matrices.
31
(c) Compute a QR decomposition:
Qk = Qk Rk ,
where Qk Cn,m is isometric and Rk Cm,m is upper triangular.
2. Idea: Petrov-Galerkin projection methods: Make use of the fact that for (, v) C Cn \
{0} we have
(, v) is an eigenvalue/eigenvector pair Av v = 0, Av v Cn .
(b) Choose a second subspace L Cn , the test space. Then determine (, v) C K with
Av v L.
Often one uses L = K, this is called Galerkin projection. Hope: Some of the (, v) are
good approximations to eigenvalue/eigenvectors pairs of A.
As motivation we consider the special case that A Rn,n is symmetric with eigenvalues
1 . . . n . Then
xT Ax
1 = max r (x) and n = min r (x) , where r(x) = (Exercise)
x6=0 x6=0 xT x
y T QTk AQk y
Mk := max r (x) = max ,
xR(Qk )\{0} y6=0 y T QTk Qk y
and analogously
mk := min r (x) .
xR(Qk )\{0}
32
It is clear that r (x) grows fastest in the direction of the gradient
2
r(x) = Ax r(x)x (Exercise).
xT x
Since r (x) Span {x, Ax}, the conditions 3.1 and 3.2 are satisfied if
..
. n o
Span {q1 , . . . , qk+1 } = Span q1 , Aq1 , A2 q1 , . . . , Ak q1 .
(b)
n o
Kl (A, x) := R (Kl (A, x)) = Span x, Ax, A2 x, . . . , Al1 x
We have just observed that for symmetric matrices A Rn,n already after few matrix vector
multiplications, Krylov spaces yield good approximations to eigenvalues 1 and n , i.e., eigen-
values at the exterior of the spectrum. We expect that a similar property holds for general
matrices A Cn,n . Heuristic: Krylov spaces are good search spaces!
In the following we present a few properties of Krylov spaces. In particular, we construct the
relationship to minimal polynomials and Hessenberg matrices.
0 = p (A) x = Am x + m1 Am1 x + . . . + 1 Ax + 0 x.
33
Lemma 45 Let A Cn,n , x Cn and let be the degree of the minimal polynomial of x
with respect to A. Then
Proof: Exercise. 2
Lemma 46 Let A Cn,n and g1 Cn be such that g1 , Ag1 , . . . , Am1 g1 are linearly inde-
pendent. Suppose that g2 , . . . , gn are such that G = [g1 , g2 , . . . , gn ] is nonsingular and let
B = G1 AG = [bij ]. Then the following are equivalent
Proof: Exercise. 2
34
In the transformation with Householder matrices we construct the structure (triangular/Hessenberg)
by applying orthogonal transformations to the whole matrix, in the Gram-Schmidt/Arnoldi
method we construct the structure (triangular/Hessenberg) column by column without ever
transforming the matrix itself.
Thus,
k
!
1 X
qk+1 = Aqk hik qi .
hk+1,k
i=1
Due to the orthonormality of the qi we get
2) For k = 1, 2, . . . , n 1
k
X
(a) qk+1 := Aqk hik qi , hik = qiH Aqk .
i=1
(b) hk+1,k := ||qk+1 ||.
1
(c) qk+1 = qk+1 .
hk+1,k
Remark 47 (a) The algorithm stops if hm+1,m = 0 for some m ( good breakdown). Then
k+1
X
Aqk = hjk qj
j=1
for k = 1, . . . , m 1 and
m
X
Aqm = hjm qj .
j=1
35
Thus,
h11 ... h1,m1 h1m
h21 ..
.
A [q1 , . . . , qm ] = [q1 , . . . , qm ] 0 .. ..
,
. .
.. ..
..
. . .
0 . . . hm,m1 hmm
| {z }
=:Hm
This is called the Arnoldi relation. Due to the orthonormality of the qi we also have QH
m AQm =
Hm .
Consequence: Due to the relation between Hessenberg matrices and Krylov spaces it follows
that
Span {q1 , . . . , ql } = Kl (A, x)
for l = 1, . . . , m + 1. Thus the Arnoldi algorithm computes orthonormal bases of Krylov
spaces.
Application: The Arnoldi algorithm as projection method for A Cn,n , n large.
1) For a given start vector x 6= 0 compute the vectors q1 , q2 , . . . with the Arnoldi method.
is more and more expensive in each further step (concerning flops and storage).
36
This yields an orthonormal basis of the Krylov spaces:
4) If hm+1,m 6= 0, then choose Km (A, x) = R (Qm ) as search and test space for a projection
method, i.e., determine C and v R (Qm ) \ {0} with
Av v R (Qm ) .
Hope: Since R(Qm ) is a Krylov space, some of the (, v) are good approximations to eigen-
value/eigenvector pairs of A.
v K \ {0} und Av v K.
QH
m AQm z = z Av v R (Qm ) .
Proof:
QH H
m AQm z = z = Qm Qm z QH
m (Av v) = 0 Av v R (Qm ) .
Remark 50 We can compute the eigenvalues of Hm by the Francis QR algorithm and here
we can exploit that Hm is already in Hessenberg form. To compute the eigenvectors, we carry
out one step of inverse iteration with a computed eigenvalue of Hm as shift, i.e., we choose
a start vector w0 Cm , ||w0 || = 1, solve
(Hm Im ) w1 = w0
w1
for w1 and set w1 = ||w 1 ||
. Since is already a good eigenvalue approximation, i.e., a good
shift, one step is usually enough.
To check whether a Ritz pair is a good approximation, we can compute the residual. A
small residual means a small backward error, i.e., if Av v is small and the eigenvalue is
well-conditioned, then (, v) is a good approximation to an eigenvalue/eigenvector pair of A.
Theorem 51 Let A, Qm , Hm , hm+1,m be the results of m steps of the Arnoldi algorithm.
Furthermore, let z = [z1 , . . . , zm ]T Cm be an eigenvector of Hm associated with C.
Then (, v), with v = Qm z, is a Ritz pair of A with respect to R (Qm ) and
37
Proof: Using the Arnoldi relation, it follows that
Av v = AQm z Qm z
Qm Hm + hm+1,m qm+1 eTm z Qm z
=
= Qm (Hm z z) +hm+1,m zm qm+1
| {z }
=0
||Av v|| = |hm+1,m ||zm |.
Remark 52 (a) For the computation of the residual Av v we do not need to determine
the Ritz vector v explicitly.
(b) After some iterations in finite precision arithmetic the orthonormality of the qi dete-
riorates. This happens in particular when |hm+1,m | is small. Then the Ritz values
deteriorate as well. This can be fixed by re-orthonormalization using the modified Gram-
Schmidt method for q1 , . . . , qm .
(c) We know how to detect good approximations but we are not sure that they occur.
Aq1 = 1 q1 + 1 q2
Aqk = k1 qk1 + k qk + k qk+1 , k = 2, . . . , n 1
Aqn = n1 qn1 + n qn
This is called a 3-term recursion. Since q1 , . . . , qn are orthonormal we have, furthermore, that
k = qkH Aqk .
38
x
1) Start: Choose x 6= 0. Then set q0 := 0, q1 := and 0 := 0.
||x||
2) For k = 1, 2, . . . , m
Remark 53 (a) The Lanczos algorithm is essentially the Arnoldi algorithm for A = AH .
(b) As in the Arnoldi algorithm, after m steps we have (due to Lemma 46) that
(d) Due to the 3-term recursion we do no need to store more than three vectors qi , thus
we can choose m much bigger than in the Arnoldi algorithms. However, if we do not
store the qi then we cannot re-orthonormalize. The it is necessary to detect the spurious
eigenvalues and remove them. (Cullum-Willoughby method 1979.)
Idea: Transform A to tridiagonal form and allow the transformation matrix to become non-
unitary, i.e., we want to determine
X 1 AX = T
with T tridiagonal, or equivalently
1 1 0
..
1 2 .
AX = X
.. ..
.
. . n1
0 n1 n
39
If we write X as
X = [x1 , . . . , xk ] ,
then we obtain
Axk = k1 xk1 + k xk + k xk+1
for k = 1, . . . , n 1 and with 0 := 0, x0 := 0. Furthermore,
T H = X H AH X = Y 1 AH Y
with Y := X . Then clearly Y H X = I, i.e., yjH xi = ij with
Y = [y1 , . . . , yn ] .
Such families of vectors (x1 , . . . , xn ) and (y1 , . . . , yn ) are called bi-orthogonal. We again com-
pare columns in AH Y = Y T H . This means that
AH yk = k1 yk1 + k yk + k yk+1
for k = 1, . . . , n 1 and 0 := 0, y0 := 0. Since Y H X = I, it follows that
k = ykH Axk .
Moreover,
k xk+1 = Axk k1 xk1 k xk =: rk ,
and
k yk+1 = AH yk k1 yk1 k yk =: sk .
Furthermore, we have
H 1 H
1 = yk+1 xk+1 = s rk
k k k
for all k 1. But there is still freedom in the computation of k , k . We could in principle
use one of the variants
k = ||rk ||, k = ||sk ||, k = k , . . .
k = ||rk ||,
1 H
k = s rk ,
k k
1
xk+1 = rk ,
k
1
yk+1 = sk .
k
40
Remark 54 1) After m steps, if we do not have a breakdown (since we cannot divide by
0), we have
1 1 0
1 . . . ..
.
A [x1 , . . . , xm ] = [x1 , . . . , xm , xm+1 ]
.. .. .
| {z } . . m1
:=Xm m1 m
0 m
If we denote the submatrix consisting of the first m rows with Tm , then we have
and analogously
AH Ym = Ym Tm
H H
+ m ym+1 eTm .
2) Due to the relationship with Krylov spaces and Hessenberg matrices, we have
and
Span {y1 , . . . , yl } = Kl AH , y1 ,
l = 1, . . . , m.
41
and this is equivalent to
YmH (AXm z Xm z) = 0.
We thus have YmH (Av v) = 0 or
Av v R (Ym ) .
Such a pair (, v) is often called Petrov pair. Here we use R (Xm ) = Km (A, x1 ) as
search space and R (Ym ) = K AH , y1 as test space.
(a) In the look-ahead-variant one weakens the bi-orthonormality to avoid the serious
breakdowns (QMR algorithm).
(b) The bi-orthonormality deteriorates due to round-off errors in finite precision arith-
metic.
C, v K \ {0} ,
such that
Av v L.
In the Arnoldi and the symmetric Lanczos algorithm we choose L = K, in the nonsymmetric
Lanczos method L = Km AH , y1 .
Then the obvious question is whether in the computed pairs (, v) there are good approxi-
mations to eigenvalue/eigenvector pairs?
w = 0 x + 1 Ax + . . . + m1 Am1 x = p (A) x,
The convergence is typically very good for eigenvalue in the outer part of the spectrum and
very slow for the eigenvalues in the interior. Quantitative results using Lemma 55 are based
on optimal polynomial approximations, but a complete convergence analysis in all cases is an
open problem.
42
3.6 The Implicitely Restarted Arnoldi algorithm
The Arnoldi algorithm becomes more and more expensive the more iterations one performs,
and the alternative nonsymmetric Lanczos is unstable. Can we resolve the problem with
Arnoldi algorithm?
Ideas:
1) Use restarts: After m steps of the Arnoldi algorithm with startvector x choose p
m1 with
large for desiredi ,
|p (i ) | =
small for undesiredi .
Then choose p(A)x as new start vector and restart the Arnoldi algorithm. Since the start
vector has been enlarged in the components of the direction of the desired eigenvectors,
we except that the Krylov spaces contain, in particular, good approximations to the
desired eigenvectors.
The disadvantage is that we start with a single new vector, we loose a lot of already
obtained information. It would be ideal if we choose the new start vector p (A) x opti-
mally in the sense that we keep as much information as possible, i.e., maximally many
approximation to desired eigenvalues. This is again an open problem.
2) To preserve more information, we proceed as follows. If k eigenvalues are desired then
we run the Arnoldi algorithm for m = k + l steps. Then keep k vectors and throw away
l vectors. This leads to the
H (1) = Hm
For j = 1, . . . , l,
H (j) j I = Uj Rj (QR decomposition)
H (j+1) = Rj Uj + j I
end
Hm = H (l+1)
U = U1 . . . Ul
(i) (i)
Then Hm = U H Hm U and furthermore every Ui has the form Ui = G12 . . . Gm1,m with
(i) (i)
Givens rotation matrices G12 , . . . , Gm1,m . Every Ui is a Hessenberg matrix, thus U is
a product of l Hessenberg matrices and hence a band matrix with lower bandwidth l.
43
4) We have AQm = Qm Hm + fm+1 eTm with fm+1 = hm+1,m qm+1 . Hence,
A Qm U = Qm U U H Hm U +fm+1 eTm U ,
| {z } | {z } |{z}
=:Qm Hm =:uT
m
with
uTm = [0, . . . , 0, , , . . . , ].
| {z } | {z }
k1 l
= Qk Hk + fk+1 eTk .
5) Then we perform l further Arrnoldi steps and begin again with 1) until sufficiently many
eigenvalues are found.
Lemma 56 Let p (t) = (t 1 ) . . . (t l ). Then, with the notation and assumptions above,
p (A) Qm = Qm U R + Fm ,
where Fm = 0 Fm and R = Rk . . . R1 .
44
Proof: We have already shown that
p (A) Qm = Qm p (Hm ) + Fm .
l 1 l: We have
p (A) Qm
= (A 1 I) (A 2 I) . . . (A l I) Qm
I.V.
= (A 1 I) Qm (Hm 2 I) . . . (Hm l I) + Fm1
Qm (Hm 1 I) + hm+1,m qm+1 eTm (Hm 2 I) . . . (Hm l I) + (A 1 I) Fm1
=
h i
= Qm p (Hm ) + hm+1,m qm+1 eTm (Hm 2 I) . . . (Hm l I) + m l + 1 l 1R 0 (A 1 I) Fm1 ,
| {z }
has band widthl1
|
{z
}
=0,...,0,, . . . ,
| {z }
l
| {z }
=:Fm
2
where Fm = 0 Fm has the desired form.
Proof:
1) One has
p (A) Qm = Qm U R + Fm = Qm R + Fm
| {z }
=Qm
r11 . . . r1m
= Qm
.. .. + 0 F .
. . m
0 rmm
p (A) q1 = q1 r11 .
45
Therfore, choose = 1/r11 . Here r11 6= 0, otherwise we would have p (A) q1 = 0, i.e.,
the minimal polynomial of q1 with respect to A would have degree l. But then the
Krylov space Kl (A, q1 ) is invariant, i.e., the Arnoldi would have had a good
breakdown after l steps which is a contradiction to the fact that we have done m > l
steps without breakdown.
2) We could proceed as in 1) but we use the following approach: Since we know already
that
R (Qj ) = Kj (A, q1 ) , j = 1, . . . , m
and
R(Qj ) = Kj (A, q1 ) , j = 1, . . . , m
it follows that
R(Qj ) = Span p (A) q1 , Ap (A) q1 , . . . , Aj1 p (A) q1
= p (A) R (Qj )
46
3.7 The Jacobi-Davidson Algorithm
If only a specific pair (, u) of a large sparse matrix A Cn,n is desired (e.g., the eigenvalue
largest in modulus or the one nearest to C, then alternative methods can be considered.
1. Idea. Apply the Arnoldi algorithm with shift-and-invert, i.e., apply the Arnoldi algorithm
to (A I)1 . This has the disadvantage that per step we have to solve a linear system
with A I. In practice, this has to be done very accurately, otherwise the convergence
deteriorates. If a sparse LR decomposition of (A I)1 can be determined, then this is
acceptable otherwise this is problematic.
This leads to the following task: For q1 Cn , ||q1 || = 1 construct an isometric matrix
Qm = [q1 , . . . , qm ] ,
such that (, u) is quickly approximated by Ritz pairs of A with respect too R(Qm ). When
q1 , . . . , qk are constructed, then we compute Ritz pairs of A with respect to R(Qk ). Let
(k , vk ) be the Ritz pair with ||vk || = 1 that approximates (, u) best.
Ansatz:
u = vk + x, with x vk ,
= k + .
Here the unknown u is scaled so that (u vk ) vk . This can be achieved by determining an
approximation to x and computing qk+1 from x by orthonormalizing against q1 , . . . , qk .
Computation of x. Consider
P = I vk vkH .
This is the orthogonal projection to Span (vk ) . Furthermore, let
rk := Avk k vk
be the residual of (k , vk ) with respect to A. Then
P vk = 0, P x = x, P rk = rk ,
since (k , vk ) is a Ritz pair, and hence rk R (Qk ) 3 vk , i.e., rk vk . Moreover,
Au = u
A (vk + x) = (vk + x)
(A I) x = (A I) vk
= (A k I) vk + vk = rk + vk
P (A I) x = P rk = rk
P (A I) P x = rk and x vk .
47
Unfortunately we cannot solve the last equation, since we do not know . Therefore, we
replace by the Ritz value k ,
P (A k I) P x = rk , x vk . (3.3)
Interpretation: Since P x = x it follows from the Jacobi correction equation (3.3) that
(A k I) x = rk + vk
x = (A k I)1 rk + (A k I)1 vk
= vk + (A k I)1 vk
Since vk is already in the search space, we have extended R (Qk ) by the scaled vector
(A k I)1 vk . Since
v H Avk
vkH (Avk k vk ) = 0, k = kH ,
vk vk
this corresponds to the vector that one obtains by applying one step of the Rayleigh quotient
iteration to vk , which converges cubically for A = AH and at least quadratically otherwise.
48
(a) P (A k I) P I, hence x = rk . This is formally equivalent to the Arnoldi
algorithm (Exercise).
(b) P (A k I) P (D k I) where D is the diagonal part of A.
(This is the Davidson method in quantum chemistry.) It works very well for diag-
onally dominant matrices.
(c) Solve the correction equation with iterative methods (see next chapter).
cT
1 1 1
A = = .
z b F z z
This is equivalent to
= + cT z,
(F I) z = b.
3) We can compare the Jacobi-Davidson method and the Arnoldi algorithm with shift-and-
Invert for the computation of a desired eigenvalue/eigenvector pair. Jacobi-Davidson
consists in the solution of the Jacobi correction equation P (A k I) P x = rk and the
computation of the new search direction. Then we must determine the new eigenvalue of
Mk by applying the operator A in the step Wk = AQk . We can solve the Jacobi correc-
tion equation approximatively, since it is only used for the search direction. The spectral
information of the operator A will be injected in the step Wk = AQk where Mk and its
eigenvalues are computed. When we solve Jacobi correction equation approximatively
the convergence may be slowed down.
(a) In the Arnoldi algorithm in both steps, the computation of the search direction and the
application of the operator are combined in the equation
k
qk+1 = (A I)1 qk
X
hik qi .
i=1
The solution of the linear system has to be done very accurately. If this is not done
accurately enough, then the information about the operator is not injected well enough.
Furthermore, restarts, locking and purging is easier in the Jacobi-Davidson method,
since we do not need to preserve an Arnoldi relation.
(b) On the other hand the Jacobi-Davidson algorithm approximates only one pair at a time
while the Arnoldi method determines several eigenvalues at the same time.
49
3.8 Large Scale Generalized Eigenvalue Problems
For large scale generalized eigenvalue problems Ex = Ax with regular pairs (E, A) we
can apply all the methods from the previous sections whenever a solution of linear systems
(0 E A)x = b are feasible for chosen shifts 0 .
Set = ( + 0 )1 and replace Ex = Ax by x = (0 E A)1 Ex =: Ax and apply the
chosen algorithm to the matrix A. Whenever a matrix-vector multiplication with A is needed
we have to carry out a matrix-vector multiplication with E and solve a linear system with
(0 E A).
If an eigenvalue/eigenvector pair (, x) has been computed then it corresponds to an eigen-
value = 1 0 with the same eigenvector, using the spectral transformation =
( + 0 )1 . Using this relation we can make any eigenvalue an exterior eigenvalue by choosing
an appropriate 0 .
This spectral transformation can also be used in the case E = I to achieve fast convergence
near any chosen shift.
50
Chapter 4
4.1 Splitting-Methods
The basic idea of splitting methods is to split the matrix A in two summands A = M N
and to transfer the linear system to a fix-point equation
M x = N x + b.
M x(k+1) = N x(k) + b.
(b) The convergence is linear with convergence rate % M 1 N . This is typically too slow
compared with other methods so that splitting methods are today used mostly only in the
context of preconditioning.
0 ... 0 0 0
Example 63 We split A as A = . . . ... + .. .. . .
+ . . .
.
0 0 0 ... 0
| {z } | {z } | {z }
=:L =:D =:R
51
4.2 The Conjugate Gradient Method (CG)
Special case: A Rn,n symmetric and positive definite, b Rn .
(x) = b Ax.
r = b Ax
For r 6= 0 we have that (x + r) < (x) for some > 0. Thus we can decrease by
choosing the parameter to minimize the residual.
rT r
= .
rT Ar
Proof: Exercise. 2
1) Start: Choose x0 Rn .
52
(c) xk = xk1 + k rk1 .
We thus have global convergence for all start vectors. But the method has many disadvan-
tages.
(b) Furthermore, even if becomes small very quickly, then this is not automatically true
for the residual.
1) We minimize only in one search direction rk , but we have many more directions than
one (namely r0 , . . . , rk ).
The first two conditions R1) and R2) together guarantee convergence in at most n steps in
exact arithmetic, since we minimize over the whole space Rn .
To compute pk+1 and xk+1 , suppose that the search directions p1 , . . . , pk Rn and xk with
(xk ) = min (x) are already computed and then determine pk+1 and xk+1 with (xk+1 ) =
xRk
min (x), such that all three conditions R1)-R3) are satisfied.
xRk+1
53
for y Rk , R. Then we determine the parameters y and . We have
1
(xk+1 ) = (x0 + Pk y + pk+1 )T A (x0 + Pk y + pk+1 ) (x0 + Pk y + pk+1 )T b
2
1
= (x0 + Pk y) + pTk+1 A (x0 + Pk y) pTk+1 b + 2 pTk+1 Apk+1
2
1 2 T
= (x0 + Pk y) + pk+1 APk y + pk+1 Apk+1 pTk+1 r0 .
T
| {z
nur y
} |2 {z }
nur
If the mixed term was not there then we could minimize separately over the two variables.
Thus we choose pk+1 so that
pTk+1 APk = 0
and obtain
1 2 T
min (x) = min (x0 + Pk y) + min pk+1 Apk+1 pTk+1 r0 .
xRk+1 yRk R 2
| {z } | {z }
Sol. y=yk pT r
k+1 0
Sol. k+1 = T
p Apk+1
k+1
pT
k+1 r0
The second minimization is just a scalar minimization and solved by k+1 = T
pk+1 Apk+1
. Thus
we have satisfied conditions R2) and R3).
Then
pTi Apj = 0, i 6= j, i, j = 1, . . . , k
i.e., p1 , . . . , pk are orthogonal with respect to the scalar product
Then the question arises whether we can always find A-conjugate search directions.
Lemma 67 If rk = b Axk 6= 0, then there exists pk+1 Span {Ap1 , . . . , Apk } with
pTk+1 rk 6= 0.
Proof: For k = 0 this is clear (choose e.g. p1 = r0 ). For k 1 then with rk 6= 0 it follows
that
A1 b 6 Rk = x0 + Span {p1 , . . . , pk } ,
since A1 b is the unique minimum, which however is not reached yet, since rk 6= 0.
Therefore,
b 6 Ax0 + Span {Ap1 , . . . , Apk }
54
or
r0 = b Ax0 6 Span {Ap1 , . . . , Apk } .
Thus there exists pk+1 Span {Ap1 , . . . , Apk } with pTk+1 r0 6= 0. Since
xk x0 + Span {p1 , . . . , pk }, we have
Remark 68 From the proof of the Lemma 67 we have the following observation. Since
pT rk = pT r0 for p Span {Ap1 , . . . , Apk }T , we have, in particular, that pTk+1 rk = pTk+1 r0 , and
thus
pT r0 pT rk
k+1 = T k+1 = T k+1 .
pk+1 Apk+1 pk+1 Apk+1
We then can finally show that also the first requirement R1) is satisfied.
Proof: The matrix PkT APk = diag pT1 Ap1 , . . . , pTk Apk is invertible, since A is positive
definite. Thus Pk has full rank, i.e., the columns p1 , . . . , pk are linearly independent. 2
1) Start: Choose x0 Rn
(a) rk = b Axk ,
(b) If rk = 0 then stop and use xk = A1 b as solution. Otherwise choose pk+1
Span {Ap1 , . . . , Apk } with pTk+1 rk 6= 0 and compute
pTk+1 rk
k+1 = .
pTk+1 Apk+1
55
4.2.3 The Conjugate Gradient Algorithm, CG
We have seen that the choice of A-conjugate search directions has many advantages (an easy
computation of xk+1 from xk and a guaranteed convergence in at most n steps in exact
arithmetic). On the other hand we would like to keep the advantage of steepest descent
that the function decreases maximally in the direction of the negative gradient, i.e., this is
heuristically a good search direction. The idea is then to use the freedom in pk+1 to choose
that pk+1 which is nearest to rk , the direction of the negative gradient, i.e., to choose pk+1 so
that
||pk+1 rk || = min ||p rk ||. (4.1)
pSpan{Ap1 ,...,Apk }
At first sight this looks strange, since we wanted to choose directions that allow an easy
solution of the optimization problem and here we introduce another optimization problem.
We will see now that this optimization problem is easy to solve since it will turn out that
pk+1 is just a linear combination of pk and rk .
In the following, under the same assumptions as before, we choose the A-conjugate search
directions to minimize (4.1) for k = 0, . . . , m. Let Pk = [p1 , . . . , pk ] and show then that
pk+1 Span {pk , rk }.
3) rk+1 rj for j = 0, . . . , k,
Proof:
1) Since xk+1 = xk + k+1 pk+1 , it follows that
56
2) Applying 1) inductively we obtain
Span {Ap1 , . . . , Apk } Span {r0 , . . . , rk } , k = 1, . . . , m.
We have already shown that for all k = 0, . . . , m we have
pk+1 = rk APk zk Span {r0 , . . . , rk } .
Thus, we have
Span {p1 , . . . , pk+1 } Span {r0 , . . . , rk }
for k = 0, . . . , m. Moreover, with 1) it follows that
rk+1 Span {rk , Apk+1 } Span {rk , Ar0 , . . . , Ark }
for k = 0, . . . , m. Therefore,
r1 Span {r0 , Ar0 } ,
r2 Span {r0 , Ar0 , Ar1 } Span r0 , Ar0 , A2 r0 ,
..
. .
By induction we then have finally
Span {p1 , . . . , pk+1 } Span {r0 , . . . , rk } Kk+1 (A, r0 ) .
Equality follows by a dimension argument.
3) We show that PkT rk = 0 i.e., p1 , . . . , pk rk for all k = 1, . . . , m. By 2) we then also
have r0 , . . . , rk1 rk desired. We have xk+1 = x0 + Pk yk , where yk minimizes the
function
1
(x0 + Pk y) = (x0 + Pk y)T A(x0 + Pk y) (x0 + Pk y)T b
2
1
= (x0 ) + y T PkT (Ax0 b) + y T PkT APk y.
2
The gradient of y 7 (x0 + Pk y) therefore vanishes for y = yk , i.e.,
PkT APk yk + PkT (Ax0 b) = 0.
This is equivalent to 0 = PkT (b Ax0 APk yk ) = PkT (b Axk ) = PkT rk .
4) If k = 1, then by 2), it follows that p2 Span{r0 , r1 }. Since p1 = r0 we then have
p2 Span{p1 , r1 }. For k > 1 we partition zk from Lemma 70 as
w
zk = , w Rk1 , R.
with rk = rk1 k Apk . By 1) we then obtain from Lemma 70 that
pk+1 = rk APk zk
= rk APk1 w Apk
= rk APk1 w + (rk rk1 )
k
= 1+ rk + sk ,
k
57
where
sk = rk1 APk1 w
k
Span{rk1 , APk1 w}
Span{rk1 , Ap1 , . . . , Apk1 }
Span{r0 , . . . , rk1 }.
pk+1 = rk + k pk .
pTk Ark
k = .
pTk Apk
Hence pk+1 can be constructed directly from pk and rk without solving a minimization problem.
58
Remark 73 There are many theoretical results behind this simple algorithm, for example the
convergence after at most n steps in exact arithmetic, since the CG-algorithm is a special
case of the algorithm of A-conjugate search directions. The iterate xk satisfies
(xk ) = min (x) ,
xRk
For this reason one calls the CG-algorithm a Krylov space method.
59
Corollary 76 Let e k := {p k |p (0) = 1}. With the notation and assumptions of Theo-
rem 75 (in particular rk1 6= 0), there exists a unique polynomial pk
e k with
xk = x0 + pk1 (A) r0
Thus, we have
Then the uniqueness of pk , and the first equality in (4.2) follows from Theorem 75. To prove
the inequality in (4.2), let (v1 , . . . , vn ) be an orthonormal basis of eigenvectors of A to the
eigenvalues 1 , . . . , n . Furthermore let p k and
e0 = c1 v1 + . . . + cn vn with c1 , . . . , cn R.
Then,
p (A) e0 = c1 p (1 ) v1 + . . . + cn p (n ) vn .
By the orthogonality of the vi we obtain
n
X
||e0 ||2A = eT0 Ae0 = c2i i
i=1
and
n
X n
X
||p (A) e0 ||2A = c2i p (i )2 i max p () 2
c2i i .
(A)
i=1 i=1
60
Remark 77 1) From Corollary 76 we conclude that the CG algorithm converges fast if A
has an appropriate spectrum, i.e., one for which there exist polynomials p with p (0) = 1
and small degree such that |p () | is small for all (A). This is e.g. the case if
M 1 Ax = M 1 b
C 1 AC T C T x = C 1 b
Using the equations p1 = r0 and pi = ri1 + i pi1 for i = 2, . . . , n (see Section 4.2.3) we
obtain
Rk = Pk Bk .
Then the matrix RkT ARk is tridiagonal, since
pT1 Ap1
0
RkT ARk = BkT PkT APk Bk = BkT
.. Bk .
.
0 pTk Apk
61
Furthermore, we know from Theorem 75, that the r0 , . . . , rk1 are orthogonal and span a
r
Krylov space, i.e., krr00 k , . . . , krk1
k1 k
is an orthonormal basis of Kk (A, r0 ).
This leads to an interesting conclusion. If q1 := krr00 k and if q1 , . . . , qk are the vectors generated
by the Lanczos algorithm, then by the implicit Q Theorem
rj1
qj = , j = 1, . . . , k.
krj1 k
Thus, the tridiagonal matrix generated by the Lanczos algorithm is (except for signs) the
matrix RkT ARk , i.e.,
CG Lanczos
Application: In the course of the CG algorithm we can generate the tridiagonal matrix
RkT ARk and obtain information about extremal eigenvalues of A and the condition number
2 (A) = max
min
.
In Section 4.2.4 we have noticed that certain affine Krylov spaces are good search spaces.
This suggests to use again a Krylov space method. In the CG algorithm we have used that
the solution x = A1 b is the unique minimum of = 21 xT Ax xT b. This, however, holds in
general only if A Rn,n is symmetric positive definite.
This means that we have to solve in each step the least-squares problem
In Section 4.2.5 we have seen that the CG algorithm corresponds to the Lanczos algorithm.
We expect that in the general case
GMRES Arnoldi
After k steps of the Arnoldi algorithm (without breakdown) we have the Arnoldi relation
62
with Qk = [q1 , . . . qk ], Qk+1 = [Qk , qk+1 ] isometric and
h11 . . . ... h1k
h21 . .
. ..
.
. . ..
Hk+1,k = 0 .. .. Ck+1,k .
.
..
..
. . hk,k1 hkk
0 ... 0 hk+1,k
r0
If q1 = kr0 k , then Span{q1 , . . . , qk } = Kk (A, r0 ). Let x x0 + Kk (A, r0 ), i.e., x = x0 + Qk y
for an y Ck . Then
kb Axk = kb A(x0 + Qk y)k
= kr0 AQk yk
= kr0 Qk+1 Hk+1,k yk
= kQH r0 Hk+1,k yk since Qk+1 is isometric,
k+1
r0
=
kr0 k e1 Hk+1,k y
since q2 , . . . , qk+1 q1 = kr0 k . (4.4)
!
Reminder. For the solution of least-squares problems kc M yk = min, with M Ck,n ,
k n we may
1) compute QR decomposition of M ,
Q Cn,n unitary,
M = QR, R= R1 0 .
Algorithm (GMRES)
For A Cn,n invertible, b Cn , and a starting vector x0 Cn , the algorithm computes the
solution x = A1 b of Ax = b.
63
1) Start: r0 = b Ax0 , h10 = kr0 k.
rk
a) qk = ,
hk,k1
k
X
b) rk = Aqk hjk qj with hjk = QH
j rk ,
j=1
c) hk+1,k = krk k,
d) Determine yk such that
kr0 k e1 Hk+1,k yk
is minimal,
e) xk = x0 + Qk yk .
64
Remark 78 As the CG algorithm, also GMRES can be analyzed via polynomial approxima-
e k = {p k | p(0) = 1}.
tion in
x = x0 + p(A)r0 for p k1 ,
Determine p
e k , such that kp(A)r0 k is minimal.
If pk
e k is such that rk = pk (A)r0 , then
for all p
e k.
and, furthermore,
kp()k = max |p()|,
(A)
2) (V ) is small, i.e., if A is not too far from a normal matrix (since for a normal matrix
V can be chosen unitary, i.e., with condition number 1).
Remark 80 Convergence acceleration can again be achieved via preconditioning, i.e., instead
of Ax = b we solve M 1 Ax = M 1 b, where M y = c is easy to solve and chosen such that the
spectrum is appropriate.
65
Remark 81 Other methods for the solution of Ax = b, A invertible with A 6= AH :
Ax = x Ax = b
A= AH Lanczos CG
A 6= AH Arnoldi GMRES
Lanczos BiCG
But none of the methods is really efficient without preconditioning. To obtain a good pre-
conditioner depends very much on the problem and usually it has to be chosen based on
knowledge about the background of the problem.
66