Applied Numerical Linear Algebra. Lecture 5
Applied Numerical Linear Algebra. Lecture 5
Lecture 5
1 / 52
Improving the Accuracy of a Solution
2 / 52
Improving the Accuracy of a Solution
r = Axi − b
solve Ad = r for d
xi +1 = xi − d
3 / 52
THEOREM 2.7. Suppose that r is computed in double precision
and k(A) · ε < c ≡ 3n31g +1 < 1 where n is the dimension of A and
g is the pivot growth factor. Then repeated iterative refinement
converges with
||xi − A−1 b||∞
= O(ε).
||A−1 b||∞
Note that the condition number does not appear in the final error
bound. This means that we compute the answer accurately
independent of the condition number, provided that k(A)ε is
sufficiently less than 1. (In practice, c is too conservative an upper
bound, and the algorithm often succeeds even when k(A)ε is
greater than c.)
4 / 52
Sketch of Proof.
Let set here || · ||∞ by || · ||. Our goal is to show that
k(A)ε
||xi +1 − x|| ≤ ||xi − x|| ≡ ζ||xi − x||.
c
By assumption, ζ < 1, so this inequality implies that the error ||xi +1 − x||
decreases monotonically to zero. (In practice it will not decrease all the
way to zero because of rounding error in the assignment xi +1 = xi − d,
which we are ignoring.)
We begin by estimating the error in the computed residual r . We get
r = fl (Axi − b) = Axi − b + f , where
|f | ≤ nε2 (|A| · |xi | + |b|) + ε|Axi − b| ≈ ε|Axi − b|. The ε2 term comes
from the double precision computation of r , and the ε term comes from
rounding the double precision result back to single precision. Since
ε2 ≪ ε, we will neglect the ε2 term in the bound on |f |.
Next we get (A + δA)d = r , where from bound (2.11) we know that
||δA|| ≤ γ · ε · ||A||, where γ = 3n3 g , although this is usually much too
large. As mentioned earlier, we simplify matters by assuming
xi +1 = xi − d exactly.
5 / 52
Continuing to ignore all ε2 terms, we get
6 / 52
Therefore xi +1 − x = xi − d − x = A−1 δA(xi − x) − A−1 f and so
so if
ζ = ||A−1 || · ||A|| · ε(γ + 1) = k(A)ε/c < 1,
then we have convergence.
7 / 52
Single Precision Iterative Refinement
8 / 52
For a proof, see
N. J. Higham. Accuracy and Stability of Numerical Algorithms.
SIAM, Philadelphia, PA, 1996.
M. Arioli, J. Demmel, and I. S. Duff. Solving sparse linear systems
with sparse backward error. SIAM J. Matrix Anal. AppL,
10:165-190, 1989.
R. D. Skeel. Scaling for numerical stability in Gaussian elimination.
Journal of the ACM, 26:494-526, 1979.
R. D. Skeel. Iterative refinement implies numerical stability for
Gaussian elimination. Math. Comp., 35:817-832, 1980.
R. D. Skeel. Effect of equilibration on residual size for partial
pivoting. SIAM J. Numer. Anal, 18:449-454, 1981.
Single precision iterative refinement and the error bound (2.14) are
implemented in LAPACK routines like sgesvx.
9 / 52
Equilibration
11 / 52
These include the Sun SPARC-center 2000 [SPARCcenter 2000
architecture and implementation. Sun Microsystems, Inc., November
1993. Technical White Paper.];
SGI Power Challenge [SGI Power Challenge. Technical Report, Silicon
Graphics, 1995.];
DEC AlphaServer 8400 [D. M. Fenwick, D. J. Foley, W. B. Gist, S. R.
VanDoren, and D. Wissel. The AlphaServer 8000 series: High-end server
platform development. Digital Technical Journal, 7:43-65, 1995.];
and Cray C90/J90 [The Cray C90 series.
http://www.cray.com/PUBLIC/productinfo/C90/. Cray Research, Inc.;
The Cray J90 series. http://www.cray.com/PUBLIC/product-info/J90/.
Cray Research, Inc.].
12 / 52
ScaLAPACK is suitable for distributed-memory parallel computers, such
as the
IBM SP-2 [The IBM SP-2.
http://www.rs6000.ibm.com/software/sp products/sp2.html. IBM.],
Intel Paragon [The Intel Paragon,
http://www.ssd.intel.com/homepage.html. Intel.];
Cray T3 series [The Cray T3E series.
http://www.cray.com/PUBLIC/product-info/T3E/. Cray Research, Inc.];
networks of workstations [A. Anderson, D. Culler, D. Patterson, and the
NOW Team. A case for networks of workstations: NOW. IEEE Micro,
15(l):54-64, February 1995].
13 / 52
These libraries are available on NETLIB, including comprehensive
manuals [E. Anderson,et al., LAPACK Users’ Guide (2nd edition). SIAM,
Philadelphia, 1995; L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J.
Demmel, et al., ScaLAPACK Users’ Guide. Software, Environments, and
Tools 4. SIAM, Philadelphia, PA, 1997].
LAPACK was originally motivated by the poor performance of its
predecessors LINPACK and EISPACK (also available on NETLIB) on
some high-performance machines. For example, consider the table below,
which presents the speed in Mflops of LINPACK’s Cholesky routine spofa
on a Cray YMP, a supercomputer of the late 1980s. Cholesky is a variant
of Gaussian elimination suitable for symmetric positive definite matrices.
It is very similar to Algorithm 2.2. The table also includes the speed of
several other linear algebra operations. The Cray YMP is a parallel
computer with up to 8 processors that can be used simultaneously, so we
include one column of data for 1 processor and another column where all
8 processors are used.
14 / 52
1 Proc. 8 Proc.
Maximum speed 330 2640
Matrix-matrix multiply (n = 500) 312 2425
Matrix-vector multiply (n = 500) 311 2285
Solve TX = B (n = 500) 309 2398
Solve Tx = b (n = 500) 272 584
LINPACK (Cholesky, n = 500) 72 72
LAPACK (Cholesky, n = 500) 290 1414
LAPACK (Cholesky, n = 1000) 301 2115
The top line, the maximum speed of the machine, is an upper bound on
the numbers that follow. The basic linear algebra operations on the next
four lines have been measured using subroutines especially designed for
high speed on the Cray YMP. They all get reasonably close to the
maximum possible speed, except for solving Tx = b, a single triangular
system of linear equations, which does not use 8 processors effectively.
Solving TX = B refers to solving triangular systems with many
right-hand sides (B is a square matrix). These numbers are for large
matrices and vectors (n = 500).
15 / 52
Basic Linear Algebra Subroutines (BLAS)
16 / 52
In othe r words, a library of subroutines for matrix-matrix multiplication,
matrix-vector multiplication, and other similar operations is available with
a standard Fortran or C interface on high performance machines (and
many others), but underneath they have been optimized for each
machine. Our goal is to take advantage of these optimized BLAS by
reorganizing algorithms like Cholesky so that they call the BLAS to
perform most of their work.
Table 2.1 counts the number of memory references and floating points
operations performed by three related BLAS. For example, the number of
memory references needed to implement the saxpy operation in line 1 of
the table is 3n + 1, because we need to read n values of xi , n values of yi ,
and 1 value of α from slow memory to registers, and then write n values
of yi back to slow memory. The last column gives the ratio q of flops to
memory references (its highest-order term in n only).
17 / 52
The significance of q is that it tells us roughly how many flops that we
can perform per memory reference or how much useful work we can do
compared to the time moving data. This tells us how fast the algorithm
can potentially run. For example, suppose that an algorithm performs f
floating points operations, each of which takes tarith seconds, and m
memory references, each of which takes tmem seconds. Then the total
running time is as large as
m tmem 1 tmem
f · tarith + m · tmem = f · tarith · 1 + = f · tarith · 1 + ,
f tarith q tarith
assuming that the arithmetic and memory references are not performed in
parallel. Therefore, the larger the value of q, the closer the running time
is to the best possible running time f · tarith , which is how long the
algorithm would take if all data were in registers. This means that
algorithms with the larger q values are better building blocks for other
algorithms.
18 / 52
Table 2.1 reflects a hierarchy of operations: Operations such as saxpy
perform O(n1) flops on vectors and offer the worst q values; these are
called Level 1 BLAS, or BLAS1 [C. Lawson, R. Hanson, D. Kincaid, and
F. Krogh. Basic Linear Algebra Subprograms for Fortran usage. ACM
Trans. Math. Software, 5:308-323, 1979], and include inner products,
multiplying a scalar times a vector and other simple operations.
Operations such as matrix-vector multiplication perform O(n2 ) flops on
matrices and vectors and offer slightly better q values; these are called
Level 2 BLAS, or BLAS2 [J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. Algorithm
656: An extended set of FORTRAN Basic Linear Algebra Subroutines. ACM Trans. Math. Software, 14:18-32,
1988; J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN Basic Linear
Algebra Subroutines. ACM Trans. Math. Software, 14:1-17, 1988], and include solving triangular systems of
equations and rank-1 updates of matrices (A + xy T , x and y column vectors). Operations such as matrix-matrix
multiplication perform O(n3 ) flops on pairs of matrices and offer the best q values; these are called Level 3 BLAS,
or BLAS3 [J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. Algorithm 679: A set of Level 3 Basic Linear
Algebra Subprograms. A CM Trans. Math.Software, 16:18-28, 1990; J. Dongarra, J. Du Croz, I. Duff, and S.
Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Software, 16:1-17, 1990],
and include solving triangular systems of equations with many right-hand sides.
19 / 52
Table 2.1. Counting floating point operations and memory references for
the BLAS. f is the number of floating point operations, and m is the
number of memory references.
Operation Definition f m q = f /m
saxpy y = α · x + y or 2n 3n + 1 2/3
(BLAS1) yi = α · xi + yi
i = 1, . . . , n
Matrix-vector mult y = AP· x + y or 2n2 n2 + 3n 2
n
(BLAS2) yi = j=1 aij xj + yi
i = 1, . . . , n
Matrix-matrix mult C = AP· B + C or 2n3 4n2 n/2
n
(BLAS3) cij = k=1 aik bjk + cij
i, j = 1, . . . , n
20 / 52
How to Optimize Matrix Multiplication
21 / 52
The simplest matrix-multiplication algorithm that one might try consists
of three nested loops, which we have annotated to indicate the data
movements.
ALGORITHM 2.6. Unblocked matrix multiplication (annotated to
indicate memory activity):
for i = 1 to n
{ Read row i of A into fast memory}
for j = 1 to n
{ Read Cij into fast memory}
{ Read column j of B into fast memory}
for k = 1 to n
Cij = Cij + Aik · Bkj
end for
{ Write Cij back to slow memory}
end for
end for
22 / 52
Here is the detailed count of memory references: n3 for reading B n
times (once for each value of i); n2 for reading A one row at a time and
keeping it in fast memory until it is no longer needed; and 2n2 for reading
one entry of C at a time, keeping it in fast memory until it is completely
computed, and then moving it back to slow memory. This comes to
n3 + 3n2 memory moves, or q = 2n3 /(n2 + 3n2 ) ≈ 2, which is no better
than the Level 2 BLAS and far from the maximum possible n/2 (see
Table 2.1). If M ≪ n, so that we cannot keep a full row of A in fast
memory, q further decreases to 1, since the algorithm reduces to a
sequence of inner products, which are Level 1 BLAS. For every
permutation of the three loops on i, j, and k, one gets another algorithm
with q about the same.
Our preferred algorithm uses blocking, where C is broken into an N × N
block matrix with n/N × n/N blocks C ij , and A and B are similarly
partitioned, as shown below for N = 4.
23 / 52
ALGORITHM 2.7. Blocked matrix multiplication (annotated to
indicate memory activity):
for i = 1 to N
for j = 1 to N
{ Read C ij into fast memory}
for k = 1 to N
{ Read Aik into fast memory}
{ Read B kj into fast memory}
C ij = C ij + Aik · B kj
end for
{ Write C ij back to slow memory}
end for
end for
24 / 52
The number of memory references
Our memory reference count is as follows: 2n2 for reading and writing
each block of C once, Nn2 for reading A N times (reading each
n/N − by − n/N submatrix Aik N 3 times), and Nn2 for reading B N
times (reading each n/N − by − n/N submatrix B kj N 3 times), for a
total of (2N + 2)n2 ≈ 2Nn2 memory references.
So we want to choose N as small as possible to minimize the number of
memory references. But N is subject to the constraint M ≥ 3(n/N)2 ,
which means that one block each from A, B, and C must fit in fast
memory simultaneously.
p
This yields N ≈ n 3/M and so q ≈ (2n3 )/(2Nn2 ), which is much better
than the previous algorithm.
25 / 52
The number of memory references
26 / 52
The number of memory references
27 / 52
Both the above matrix-matrix multiplication algorithms
perform 2n3 arithmetic operations.
It turns out that there are other implementations of
matrix-matrix multiplication that use far fewer operations.
Strassen’s method [A. Aho, J. Hopcroft, and J. Ullman. The
Design and Analysis of Computer Algorithms.
Addison-Wesley, Reading, MA, 1974(0)
28 / 52
ALGORITHM 2.8. Strassen’s matrix multiplication algorithm:
C = Strassen(A, B, n)
/* Return C = A ∗ B, where A and B are n-by-n;
Assume n is a power of 2 */
if n = 1
return C = A ∗ B /* scalar multiplication */
else » – » –
A11 A12 B11 B12
Partition A = and B =
A21 A22 B21 B22
where the subblocks Aij and Bij are n/2-by-n/2
P1 = Strassen( A12 − A22 , B21 + B22 , n/2 )
P2 = Strassen( A11 + A22 , B11 + B22 , n/2 )
P3 = Strassen( A11 − A21 , B11 + B12 , n/2 )
P4 = Strassen( A11 + A12 , B22 , n/2 )
P5 = Strassen( A11 , B12 − B22 , n/2 )
P6 = Strassen( A22 , B21 − B11 , n/2 )
P7 = Strassen( A21 + A22 , B11 , n/2 )
C11 = P1 + P2 − P4 + P6
C12 = P4 + P5
C21 = P6 + P7
C22 = P2 − »P3 + P5 − P7 –
C11 C12
return C =
C21 C22
end if
29 / 52
Complexity of Strassen’s algorithm
30 / 52
Special Linear Systems
31 / 52
2.7.1. Real Symmetric Positive Definite Matrices
32 / 52
Proof.
1. If X is nonsingular, then A is s.p.d. if and only if X T AX is s.p.d.
X nonsingular implies Xx 6= 0 for all x 6= 0, so x T X T AXx > 0 for
all x 6= 0. So A s.p.d. implies X T AX is s.p.d. Use X −1 to deduce
the other implication.
33 / 52
2. If A is s.p.d. and H is any principal submatrix of
A(H = A(j : k, j : k) for some j ≤ k), then H is s.p.d.
Suppose first that H = A(1 : m, 1 : m). Then given any m-vector
y , the n-vector x = [y T , O]T satisfies y T Hy = x T Ax. So if
x T Ax > 0 for all nonzero x, then y T Hy > 0 for all nonzero y , and
so H is s.p.d. If H does not lie in the upper left corner of A, let P
be a permutation so that H does lie in the upper left corner of
P T AP and apply Part 1.
34 / 52
3. A is s.p.d. if and only if A = AT and all its eigenvalues are
positive.
Let X be the
V real, orthogonal eigenvector matrix of A so that
X TV
AX = P is the diagonal matrix of real eigenvalues λi . Since
x T x = i λi xi2 ,
V
is s.p.d. if and only if each λi > 0. Now
apply Part 1.
35 / 52
4. If A is s.p.d., then all aii > 0, and maxij |aij | = maxi aii > 0.
Let ei be the i th column of the identity matrix. Then
eiT Aei = aii > 0 for all i . If |akl | = maxij |aij | but k 6= l , choose
x = ek − sign(akl )ei . Then x T Ax = akk + all − 2|akl | ≤ 0,
contradicting positive-definiteness.
36 / 52
5. A is s.p.d. if and only if there is a unique lower triangular
nonsingular matrix L, with positive diagonal entries, such that
A = LLT . A = LLT is called the Cholesky factorization of A, and L
is called the Cholesky factor of A.
Suppose A = LLT with L nonsingular. Then
x T Ax = (x T L)(LT x) = ||LT x||22 > 0 for all x 6= 0, so A is s.p.d. If
A is s.p.d., we show that L exists by induction on the dimension n.
If we choose each lii > 0, our construction will determine L
√
uniquely. If n = 1, choose l11 = a11 , which exists since a11 > 0.
As with Gaussian elimination, it suffices to understand the block
2-by-2 case.
37 / 52
Write
a11 A12
A =
AT A22
" √12 # " √ #
a11 0 A12
1 0 a11 √
a11
= AT
√ 12
I 0 Ã22 0 I
" a11 #
a11 A12
= T AT A ,
A12 Ã22 + 12a11 12
AT
12 A12
so the (n − 1)-by-(n − 1) matrix Ã22 − a11 is symmetric.
38 / 52
1 0
By Part 1 above, is s.p.d, so by Part 2 Ã22 is s.p.d.
0 Ã22
Thus by induction there exists an L̃ such that Ã22 = L̃L̃T and
" √ # " √ #
a11 0 1 0 a11 √Aa1211
A = AT
√ 12
a11 I 0 L̃L̃T 0 I
" √ #" √ #
a11 0 a11 √Aa1211
= AT T
≡ LLT .
√ 12
a11 L̃ 0 L̃
39 / 52
We may rewrite this induction as the following algorithm.
ALGORITHM 2.11. Cholesky algorithm:
for j = 1 to nP
ljj = (ajj − j−1 2 1/2
k=1 ljk )
for i = j + 1 toPn
lij = (aij − j−1
k=1 lik ljk )/ljj
end for
end for
If A is not positive definite, then (in exact arithmetic) this
algorithm will fail by attempting to compute the square root of a
negative number or by dividing by zero; this is the cheapest way to
test if a symmetric matrix is positive definite.
40 / 52
The number of flops in Cholesky algorithm
41 / 52
Pivoting is not necessary for Cholesky to be numerically stable
(equivalently, we could also say any diagonal pivot order is
numerically stable). We show this as follows. The same analysis as
for Gaussian elimination in section 2.4.2 shows that the computed
solution x̂ satisfies (A + δA)x̂ = b with |δA| ≤ 3nε|L| · |LT |. But
by the Cauchy-Schwartz inequality and Part 4 of Proposition 2.2
X
(|L| · |LT |)ij = |lik | · |ljk |
qk P qP
≤ lik2 ljk2
√ √
= aii · ajj
≤ maxij |ajj |
42 / 52
Symmetric Indefinite Matrices
The question of whether we can still save half the time and half the
space when solving a symmetric but indefinite (neither positive definite
nor negative definite) linear system naturally arises. It turns out to be
possible, but a more complicated pivoting scheme and factorization is
required. If A is nonsingular, one can show that there exists a
permutation P, a unit lower triangular matrix L, and a block diagonal
matrix D with 1-by-1 and 2-by-2 blocks such that PAP T = LDL T
.
0 1
To see why 2-by-2 blocks are needed in D, consider the matrix .
1 0
This factorization can be computed stably, saving about half the work
and space compared to standard Gaussian elimination. The name of the
LAPACK subroutine which does this operation is ssysv. The algorithm is
described in [J. Bunch and L. Kaufman. Some stable methods for
calculating inertia and solving symmetric linear systems. Math. Comp.,
31:163-179, 1977].
43 / 52
Band Matrices
Band matrices arise often in practice and are useful to recognize because
their L and U factors are also ”essentially banded”, making them cheaper
to compute and store. We consider LU factorization without pivoting
and show that L and U are banded in the usual sense, with the same
band widths as A.
44 / 52
PROPOSITION 2.3. Let A be banded with lower bandwidth bL and
upper bandwidth bU . Let A = LU be computed without pivoting. Then
L has lower bandwidth bL and U has upper bandwidth bU . L and U can
be computed in about 2n · bU · bL arithmetic operations when bU and bL
are small compared to n. The space needed is (bL + bU + 1). The full
cost of solving Ax = b is 2nbU · bL + 2nbU + 2nbL .
PROPOSITION 2.4. Let A be banded with lower bandwidth bL and
upper bandwidth bU . Then after Gaussian elimination with partial
pivoting, U is banded with upper bandwidth at most bL + bU , and L is
”essentially banded” with lower bandwidth bL . This means that L has at
most bL + 1 nonzeros in each column and so can be stored in the same
space as a band matrix with lower bandwidth bL .
Gaussian elimination and Cholesky for band matrices are available in
LAPACK routines like ssbsv and sspsv.
Band matrices often arise from discretizing physical problems with
nearest neighbor interactions on a mesh (provided the unknowns are
ordered rowwise or columnwise).
45 / 52
Example: ODE
46 / 52
We need to derive equations to solve for our desired
approximations yi ≈ y (xi ), where y0 = α and yN+1 = β. To derive
these equations, we approximate the derivative y ′(xi ) by the
following finite difference approximation:
yi +1 − yi −1
y ′(xi ) ≈ .
2h
yi +1 − 2yi + yi −1
y ′′(xi ) ≈ .
h2
yi +1 − 2yi + yi −1 yi +1 − yi −1
2
− pi − qi yi = ri , 1 ≤ i ≤ N.
h 2h
47 / 52
Rewriting this as a linear system we get Ay = b, where
1 h
y1 r1 ( 2 + 4 p1 )α
0
−h2
y = ...
.. ..
, b = + ,
2 . .
0
yN rN ( 12 − h4 pN )β
48 / 52
and
a1 −c1 h2
.. .. ai = 1+ 2 qi ,
−b2 . .
1 2
A=
.. ..
,
bi = 2 [1 + h2 pi ],
. . cN−1 1 2
ci = 2 [1 − h2 pi ].
−bN aN
Note that ai > 0, and also bi > 0 and ci > 0 if h is small enough.
This is a nonsymmetric tridiagonal system to solve for y . We will show
how to change it to a symmetric positive definite tridiagonal system, so
that we may use band Cholesky to solve it.
49 / 52
q q q
c1 c1 c2 c1 c2 ···cN−1
Choose D = diag (1, b2 , b2 b3 , . . . , b2 b3 ···bN ). Then we may change
−1
Ay = b to (DAD )(Dy ) = Db or Ãỹ = b̃, where
√
√a1 − c 1 b2 √
− c 1 b2 a2 − c 2 b3
√
..
à =
− c 2 b3 . .
. .. . .. p
p − cN−1 bN
− cN−1 bN aN
50 / 52
Gershgorin’s Theorem
X
|λ − bkk | ≤ |bkj xj | ≤ |bkj |.
j6=k
51 / 52
Example: ODE (continuation)
h2 h2
1 h 1 h
|bi |+|ci | = 1 + pi + 1 − pi = 1 < 1+ q ≤ 1+ qi = ai .
2 2 2 2 2 2
52 / 52