Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
Systems
Dianne P. OLeary, 1983 and 1999 and 2007
September 25, 2007
When the matrix A is symmetric and positive denite, we have a whole new
class of algorithms for solving Ax
= b.
1 The Steepest Descent Algorithm
Recall from calculus that the gradient, f(x), is the direction in which the
function f is most rapidly increasing, and f(x) is the direction of steepest
descent. Thus, if we want to minimize f, we might think of taking a guess at
x
, evaluating the gradient, and taking a step in the opposite direction until
the function stops decreasing. Then we can repeat the process. This gives the
following algorithm.
1. Pick x
0
.
2. For k = 0, 1, . . . ,
(a) Evaluate p
k
= f(x
k
) = r
k
.
1
(b) Let x
k+1
= x
k
+
k
p
k
, where
k
is the minimizer of min
f(x
k
+p
k
).
End For.
To visualize the algorithm, picture an elliptical valley surrounded by mountains.
Level surfaces of the terrain are shown in Figure 1, as they might appear on
a topographical map. If a person is at point x
0
in the fog and wants to reach
the pit of the valley, she might follow an algorithm of picking the direction of
steepest descent, following the straight path until it starts to rise, and then
picking the new steepest descent direction. In that case, she follows the zigzag
path indicated in the gure. (See how relevant numerical analysis can be in real
life?)
We can nd an analytic formula for
k
. For xed x
k
and p
k
,
f(x
k
+ p
k
) =
1
2
(x
k
+ p
k
)
T
A(x
k
+ p
k
) (x
k
+ p
k
)
T
b
=
1
2
2
p
T
k
Ap
k
+ p
T
k
Ax
k
+p
T
k
b + constant .
The minimum of f with respect to occurs when the derivative is zero:
p
T
k
Ax
k
+ p
T
k
Ap
k
p
T
k
b = 0 (2)
so
=
p
T
k
(Ax
k
b)
p
T
k
Ap
k
=
p
T
k
r
k
p
T
k
Ap
k
(3)
So, to perform the minimization along a line, we set
k
=
p
T
k
r
k
p
T
k
Ap
k
=
r
T
k
r
k
p
T
k
Ap
k
(See the appendix for the proof of equivalence of the two expressions for .)
Let
E(x) =
1
2
(x x
)
T
A(x x
) .
This function also is minimized when x = x
max
min
max
+
min
2k
E(x
0
) ,
where
max
and
min
are the largest and smallest eigenvalues of A. (Try to
interpret this result in terms of the condition number of A in the 2-norm, the
ratio of the largest to smallest eigenvalue. Which matrices will show fast con-
vergence?)
2
Figure 1: Level curves (contour plot) for a quadratic function of two variables,
with the path of the steepest descent algorithm marked on it. After 20 iterations,
the error has been reduced by a factor of 10
5
. Conjugate gradients would step
from the initial iterate to the next, and then to the minimizer.
20 15 10 5 0
6
4
2
0
2
4
6
x(0)
x(1)
x*
3
2 The Conjugate Direction Algorithm
As we can see, the steepest descent algorithm is often far too slow. We will now
develop an algorithm that only takes n steps. It is based on a very simple idea.
Suppose we had n linearly independent vectors p
k
, k = 0, 1, . . . , n 1, with the
property
p
T
k
Ap
j
= 0 , k = j .
(If A = I, this is just orthogonality. For a general symmetric A, it is called
A-conjugacy.) Since there are n vectors, and they are linearly independent,
they form a basis, and we can express any vector as a linear combination of
them; for example,
x
x
0
=
n1
j=0
j
p
j
.
Lets multiply each side of this equation by p
T
k
A for each k. On the left hand
side we have
p
T
k
A(x
x
0
) = p
T
k
(b Ax
0
) = p
T
k
r
0
,
and on the right we have
p
T
k
A
n1
j=0
j
p
j
=
k
p
T
k
Ap
k
.
Therefore,
p
T
k
r
0
=
k
p
T
k
Ap
k
and
k
=
p
T
k
r
0
p
T
k
Ap
k
.
So we have a new algorithm for solving Ax
= b:
1. Pick x
0
and A-conjugate directions p
k
, k = 0, 1, . . . , n 1.
2. For k = 0, 1, . . . , n 1
(a) Set
k
=
p
T
k
r
0
p
T
k
Ap
k
.
(b) Let x
k+1
= x
k
+
k
p
k
.
End For.
Then x
n
= x
j=0
p
T
j
Av
k+1
p
T
j
Ap
j
p
j
End For.
It is more numerically stable to implement this last equation iteratively, substi-
tuting p
k+1
for v
k+1
after j = 0 (Modied Gram-Schmidt algorithm):
1. Let p
k+1
= v
k+1
.
2. For j = 0, 1, . . . , k,
p
k+1
= p
k+1
p
T
j
Ap
k+1
p
T
j
Ap
j
p
j
End For.
3 The Conjugate Gradient Algorithm
The conjugate gradient algorithm is a special case of the conjugate direction
algorithm. In this case, we intertwine the calculation of the new x vector and
the new p vector. In fact, the set of linearly independent vectors v
k
we use in
the Gram-Schmidt process is just the set of residuals r
k
. The algorithm is as
follows:
1. Let x
0
be an initial guess.
Let r
0
= b Ax
0
and p
0
= r
0
.
2. For k = 0, 1, 2, . . . , until convergence,
(a) Compute the search parameter
k
and the new iterate and residual
k
=
r
T
k
r
k
p
T
k
Ap
k
,
x
k+1
= x
k
+
k
p
k
,
r
k+1
= r
k
k
Ap
k
,
(b) Compute the new search direction p
k+1
by Gram-Schmidt on r
k+1
and the previous p vectors to make p
k+1
A-conjugate to the previous
directions.
5
End For.
Note that the rst step is a steepest descent step, and that in Figure 1, the
sequence of points is x
0
, x
1
, and x
.
In this form, the algorithm is a lengthy process, particularly the Gram-
Schmidt phase. We can shortcut in two places, though. In the current form we
need two matrix multiplications per iteration: Ap
k
for
k
and Ax
k+1
for r
k+1
.
But note that
r
k+1
= b Ax
k+1
= b A(x
k
+
k
p
k
) = r
k
k
Ap
k
so we actually need only one matrix multiplication.
The second shortcut is really surprising. It turns out that
p
T
j
Ar
k+1
= 0, j < k ,
so the Gram-Schmidt formula (with v
k+1
replaced by r
k+1
) reduces to
p
k+1
= r
k+1
p
T
k
Ar
k+1
p
T
k
Ap
k
p
k
which is very little work!
So here is the practical form of the conjugate gradient algorithm.
1. Let x
0
be an initial guess.
Let r
0
= b Ax
0
and p
0
= r
0
.
2. For k = 0, 1, 2, . . . , until convergence,
(a) Compute the search parameter
k
and the new iterate and residual
k
=
p
T
k
r
k
p
T
k
Ap
k
, (or, equivalently,
r
T
k
r
k
p
T
k
Ap
k
)
x
k+1
= x
k
+
k
p
k
,
r
k+1
= r
k
k
Ap
k
,
(b) Compute the new search direction
k
=
p
T
k
Ar
k+1
p
T
k
Ap
k
, (or, equivalently,
r
k+1
T
r
k+1
r
T
k
r
k
) ,
p
k+1
= r
k+1
+
k
p
k
,
End For.
6
And after K n steps, the algorithm terminates with r
K
= 0 and x
K
= x
.
The number K is bounded above by the number of distinct eigenvalues of A.
Not only does this algorithm terminate in a nite number of steps, a denite
advantage over steepest descent, but its error on each step has a better bound:
E(x
k
)
1
1 +
2k
E(x
0
) ,
where =
max
/
min
. So, even as an iterative method, without running a full
K steps, conjugate gradients converges faster.
4 Preconditioned Conjugate Gradients
Consider the problem
M
1/2
AM
1/2
x = M
1/2
b ,
where M is a symmetric positive denite. Then x = M
1/2
x solves our original
problem Ax
k
=
r
T
k
r
k
p
T
k
M
1/2
AM
1/2
p
k
x
k+1
= x
k
+
k
p
k
,
r
k+1
= r
k
k
M
1/2
AM
1/2
p
k
,
(b) Compute the new search direction
k
=
r
k+1
T
r
k+1
r
T
k
r
k
,
p
k+1
= r
k+1
+
k
p
k
,
End For.
Now lets return to the original coordinate system. Let M
1/2
r = r, x =
M
1/2
x, and p = M
1/2
p. Then the algorithm becomes
1. Let x
0
be an initial guess.
Let r
0
= b Ax
0
and p
0
= M
1
r
0
.
7
2. For k = 0, 1, 2, . . . , until convergence,
(a) Compute the search parameter
k
and the new iterate and residual
k
=
r
T
k
M
1
r
k
p
T
k
Ap
k
x
k+1
= x
k
+
k
p
k
,
r
k+1
= r
k
k
Ap
k
,
(b) Compute the new search direction
k
=
r
k+1
T
M
1
r
k+1
r
T
k
M
1
r
k
,
p
k+1
= M
1
r
k+1
+
k
p
k
,
End For.
We choose the symmetric positive denite matrix M so that M
1/2
AM
1/2
has better eigenvalue properties, and so that it is easy to apply the operator
M
1
.
For fast iterations, we want to be able to apply M
1
very quickly.
To make the number of iterations small, we want M
1
to be an approxi-
mate inverse of A.
Some common choices of the preconditioning matrix M:
M = the diagonal of A.
M = a banded piece of A.
M = an incomplete factorization of A, leaving out inconvenient elements.
M = a related matrix; e.g., if A is a discretization of a dierential operator,
M might be a discretization of a related operator that is easier to solve.
M might be the matrix from our favorite stationary iterative method
(SIM).
That last choice could use a little explanation. Consider your favorite sta-
tionary iterative method (Jacobi, Gauss-Seidel, SOR, etc.) It can be derived
by taking the equation Ax = b, splitting A into two pieces A = M N, and
writing Mx = Nx + b. The iteration then becomes
Mx
k+1
= Nx
k
+ b
8
or
x
k+1
= M
1
Nx
k
+ M
1
b.
Manipulating this a bit, we get
x
k+1
= x
k
+ (M
1
N I)x
k
+ M
1
b
= x
k
+ M
1
(N M)x
k
+ M
1
b
= x
k
+ M
1
(b Ax
k
)
= x
k
+ M
1
r
k
.
The matrix M that determines the multiple of the residual that we add on to
x becomes the conjugate gradient preconditioner.
5 Appendix: Algebra of Conjugate Gradients
In this appendix, we establish the Krylov subspace property of conjugate gra-
dients. and the equivalence of the alternate formulas for and .
Let p
0
= r
0
= b Ax
0
. Then we have already established the following four
relations:
r
k+1
= r
k
k
Ap
k
, (4)
p
k+1
= r
k+1
+
k
p
k
, (5)
k
=
r
T
k
p
k
p
T
k
Ap
k
, (6)
k
=
r
k+1
T
Ap
k
p
T
k
Ap
k
. (7)
In this appendix we establish nine more.
The next two relations lead us to the alternate formula for . First,
p
T
k
r
k+1
= 0 (8)
since
p
T
k
r
k+1
= p
T
k
r
k
k
p
T
k
Ap
k
by (4)
= 0 by (6) .
Next,
r
T
k
r
k
= r
T
k
p
k
(9)
since it is true for i = 0, and if we assume it true for i then
r
k+1
T
p
k+1
= r
k+1
T
r
k+1
+
k
r
k+1
T
p
k
by (5)
= r
k+1
T
r
k+1
by (8) .
9
Therefore,
k
=
r
T
k
r
k
p
T
k
Ap
k
.
Now we aim for the alternate formula for . We have that
p
k+1
T
Ap
k
= 0 (10)
since
p
k+1
T
Ap
k
= r
k+1
T
Ap
k
+
k
p
T
k
Ap
k
by (5)
= 0 by (7) .
The next two relations
r
T
k
p
j
= 0 , k > j , (11)
p
T
k
Ap
j
= 0 , k = j , (12)
are established together. For k, j = 0, 1, they are true by (8) and (10). Assume
that they are true for indices less than or equal to k. Then by (4),
r
k+1
T
p
j
= r
T
k
p
j
k
p
T
k
Ap
j
= 0 , (13)
where the last equality follows from the induction hypothesis if j < k and from
(8) if j = k. Therefore,
p
k+1
T
Ap
j
= r
k+1
T
Ap
j
+
k
p
T
k
Ap
j
by (5)
= r
k+1
T
rj+1+rj
j
+
k
p
T
k
Ap
j
by ( 4)
= r
k+1
T
jpjpj+1+pjj1pj1
j
+
k
p
T
k
Ap
j
by (5)
= 0 if j < k by (13) and the
induction hypothesis
= 0 if j = k by (10).
The next relation that we need is
r
T
k
r
j
= 0, k = j (14)
We can assume that k > j. Now, if j = 0, r
T
k
r
j
= r
T
k
p
0
= 0 by (11). If j > 0,
then
r
T
k
r
j
= r
T
k
p
j
j1
r
T
k
p
j1
by (5)
= 0 by (11) ,
and this establishes (14). Now we work with :
10
k
=
r
k+1
T
Ap
k
p
T
k
Ap
k
by (7)
=
r
k+1
T
(r
k
r
k+1
)
k
p
T
k
Ap
k
by (4)
=
r
k+1
T
(r
k
r
k+1
)
r
T
k
p
k
by (6)
= +
r
k+1
T
r
k+1
r
T
k
p
k
by (14)
Therefore, by (9),
k
=
r
k+1
T
r
k+1
r
T
k
r
k
. (15)
Finally, we note that that if sp denotes the subspace spanned by a set of
vectors, then
sp{p
0
, p
1
, . . . , p
k
} = sp{r
0
, Ar
0
, . . . , A
k1
r
0
} = sp{r
0
, r
1
, . . . , r
k
} (16)
since p
k+1
sp{r
k+1
, p
k
} by (5) and r
k+1
sp{r
k
, Ap
k
} by (4). This shows
that conjugate gradients is a Krylov subspace method. In fact, it is characterized
by minimizing E(x) over all vectors with x x
0
sp{r
0
, Ar
0
, . . . , A
k1
r
0
}.
6 References
The original paper on conjugate gradients:
M. R. Hestenes and E. Stiefel, Methods of Conjugate Gradients for Solving
Linear Systems, J. Res. Natl. Bur. Standards 49 (1952) pp. 409-436.
A clear exposition of the algorithm (without preconditioning):
David G. Luenberger, Linear and Nonlinear Programming, Addison Wesley, 2nd
edition (1984).
These notes parallel Luenbergers development in many ways.
11