0% found this document useful (0 votes)
52 views

Applied Numerical Linear Algebra. Lecture 5

This document discusses techniques for improving the accuracy of solutions to linear systems Ax = b. It describes iterative refinement, where the solution is repeatedly refined to reduce error. It also discusses equilibration, which preconditions the matrix A to reduce its condition number and thus improve accuracy. High-performance linear algebra software like LAPACK and ScaLAPACK use block algorithms with these techniques to efficiently solve problems on modern parallel computers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Applied Numerical Linear Algebra. Lecture 5

This document discusses techniques for improving the accuracy of solutions to linear systems Ax = b. It describes iterative refinement, where the solution is repeatedly refined to reduce error. It also discusses equilibration, which preconditions the matrix A to reduce its condition number and thus improve accuracy. High-performance linear algebra software like LAPACK and ScaLAPACK use block algorithms with these techniques to efficiently solve problems on modern parallel computers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Applied Numerical Linear Algebra.

Lecture 5

1 / 52
Improving the Accuracy of a Solution

We have just seen that the error in solving Ax = b may be as large


as k(A)ε. If this error is too large, what can we do? One
possibility is to rerun the entire computation in higher precision,
but this may be quite expensive in time and space. Fortunately, as
long as k(A) is not too large, there are much cheaper methods
available for getting a more accurate solution.

2 / 52
Improving the Accuracy of a Solution

To solve any equation f (x) = 0, we can try to use Newton’s method to


improve an approximate solution xi to get xi +1 = xi − ff′(x(xii)) . Applying
this to f (x) = Ax − b yields one step of iterative refinement:

r = Axi − b
solve Ad = r for d
xi +1 = xi − d

If we could compute r = Axi − b exactly and solve Ad = r exactly, we


would be done in one step, which is what we expect from Newton applied
to a linear problem. Roundoff error prevents this immediate convergence.
The algorithm is interesting and of use precisely when A is so
ill-conditioned that solving Ad = r (and Ax0 = b) is rather inaccurate.

3 / 52
THEOREM 2.7. Suppose that r is computed in double precision
and k(A) · ε < c ≡ 3n31g +1 < 1 where n is the dimension of A and
g is the pivot growth factor. Then repeated iterative refinement
converges with
||xi − A−1 b||∞
= O(ε).
||A−1 b||∞
Note that the condition number does not appear in the final error
bound. This means that we compute the answer accurately
independent of the condition number, provided that k(A)ε is
sufficiently less than 1. (In practice, c is too conservative an upper
bound, and the algorithm often succeeds even when k(A)ε is
greater than c.)

4 / 52
Sketch of Proof.
Let set here || · ||∞ by || · ||. Our goal is to show that

k(A)ε
||xi +1 − x|| ≤ ||xi − x|| ≡ ζ||xi − x||.
c

By assumption, ζ < 1, so this inequality implies that the error ||xi +1 − x||
decreases monotonically to zero. (In practice it will not decrease all the
way to zero because of rounding error in the assignment xi +1 = xi − d,
which we are ignoring.)
We begin by estimating the error in the computed residual r . We get
r = fl (Axi − b) = Axi − b + f , where
|f | ≤ nε2 (|A| · |xi | + |b|) + ε|Axi − b| ≈ ε|Axi − b|. The ε2 term comes
from the double precision computation of r , and the ε term comes from
rounding the double precision result back to single precision. Since
ε2 ≪ ε, we will neglect the ε2 term in the bound on |f |.
Next we get (A + δA)d = r , where from bound (2.11) we know that
||δA|| ≤ γ · ε · ||A||, where γ = 3n3 g , although this is usually much too
large. As mentioned earlier, we simplify matters by assuming
xi +1 = xi − d exactly.
5 / 52
Continuing to ignore all ε2 terms, we get

d = (A + δA)−1 r = (I + A−1 δA)−1 A−1 r


= (I + A−1 δA)−1 A−1 (Axi − b + f )
= (I + A−1 δA)−1 (xi − x + A−1 f )
≈ (I − A−1 δA)(xi − x + A−1 f )
≈ xi − x − A−1 δA(xi − x) + A−1 f .

6 / 52
Therefore xi +1 − x = xi − d − x = A−1 δA(xi − x) − A−1 f and so

||xi +1 − x|| ≤||A−1 δA(xi − x)|| + ||A−1 f ||


≤||A−1 || · ||δA|| · ||xi − x|| + ||A−1 || · ε · ||Axi − b||
≤||A−1 || · ||δA|| · ||xi − x|| + ||A−1 || · ε · ||A(xi − x)||
≤||A−1 || · γε · ||A|| · ||xi − x||
+||A−1 || · ||A|| · ε · ||xi − x||
= ||A−1 || · ||A|| · ε · (γ + 1) · ||xi − x||,

so if
ζ = ||A−1 || · ||A|| · ε(γ + 1) = k(A)ε/c < 1,
then we have convergence. 

7 / 52
Single Precision Iterative Refinement

THEOREM 2.8. Suppose that r is computed in single precision and

maxi (|A| · |x|)i


||A−1 ||∞ · ||A||∞ · · ε < 1.
mini (|A| · |x|)i

Then one step of iterative refinement yields x1 such that


(A + δA)x1 = b + δb with |δaij | = O(ε)|aij | and |δbi | = O(ε)|bi |. In
other words, the componentwise relative backward error is as small as
possible. For example, this means that if A and b are sparse, then δA and
δb have the same sparsity structures as A and b, respectively.

8 / 52
For a proof, see
N. J. Higham. Accuracy and Stability of Numerical Algorithms.
SIAM, Philadelphia, PA, 1996.
M. Arioli, J. Demmel, and I. S. Duff. Solving sparse linear systems
with sparse backward error. SIAM J. Matrix Anal. AppL,
10:165-190, 1989.
R. D. Skeel. Scaling for numerical stability in Gaussian elimination.
Journal of the ACM, 26:494-526, 1979.
R. D. Skeel. Iterative refinement implies numerical stability for
Gaussian elimination. Math. Comp., 35:817-832, 1980.
R. D. Skeel. Effect of equilibration on residual size for partial
pivoting. SIAM J. Numer. Anal, 18:449-454, 1981.
Single precision iterative refinement and the error bound (2.14) are
implemented in LAPACK routines like sgesvx.

9 / 52
Equilibration

There is one more common technique for improving the error in


solving a linear system: equilibration. This refers to choosing an
appropriate diagonal matrix D and solving DAx = Db instead of
Ax = b. D is chosen to try to make the condition number of DA
smaller than that of A. For instance, choosing dii to be the
reciprocal of the two-norm of row i of A would make DA nearly
equal to the identity matrix, reducing its condition number from
1014 to 1. It is possible to show that choosing D this way reduces

the condition number of DA to within a factor of n of its smallest
possible value for any diagonal D [A. Van Der Sluis. Condition
numbers and equilibration of matrices. Numer. Math., 14:14-23,
1969]. In practice we may also choose two diagonal matrices Drow
and Dcol and solve (Drow ADcol )x̄ = Drow b, x = Dcol x̄.
The techniques of iterative refinement and equilibration are
implemented in the LAPACK subroutines like sgerfs and sgeequ,
respectively. These are in turn used by driver routines like sgesvx.
10 / 52
Blocking Algorithms for Higher Performance

Changing the order of the three nested loops in the


implementation of Gaussian elimination in Algorithm 2.2 could
change the execution speed by orders of magnitude, depending on
the computer and the problem being solved. In this section we will
explore why this is the case and describe some carefully written
linear algebra software which takes these matters into account.
These implementations use so-called block algorithms, because
they operate on square or rectangular subblocks of matrices in
their innermost loops rather than on entire rows or columns. These
codes are available in public-domain software libraries such as
LAPACK (in Fortran, at NETLIB/lapack) and ScaLAPACK (at
NETLIB/scalapack). LAPACK (and its versions in other
languages) are suitable for PCs, workstations, vector computers,
and shared-memory parallel computers.

11 / 52
These include the Sun SPARC-center 2000 [SPARCcenter 2000
architecture and implementation. Sun Microsystems, Inc., November
1993. Technical White Paper.];
SGI Power Challenge [SGI Power Challenge. Technical Report, Silicon
Graphics, 1995.];
DEC AlphaServer 8400 [D. M. Fenwick, D. J. Foley, W. B. Gist, S. R.
VanDoren, and D. Wissel. The AlphaServer 8000 series: High-end server
platform development. Digital Technical Journal, 7:43-65, 1995.];
and Cray C90/J90 [The Cray C90 series.
http://www.cray.com/PUBLIC/productinfo/C90/. Cray Research, Inc.;
The Cray J90 series. http://www.cray.com/PUBLIC/product-info/J90/.
Cray Research, Inc.].

12 / 52
ScaLAPACK is suitable for distributed-memory parallel computers, such
as the
IBM SP-2 [The IBM SP-2.
http://www.rs6000.ibm.com/software/sp products/sp2.html. IBM.],
Intel Paragon [The Intel Paragon,
http://www.ssd.intel.com/homepage.html. Intel.];
Cray T3 series [The Cray T3E series.
http://www.cray.com/PUBLIC/product-info/T3E/. Cray Research, Inc.];
networks of workstations [A. Anderson, D. Culler, D. Patterson, and the
NOW Team. A case for networks of workstations: NOW. IEEE Micro,
15(l):54-64, February 1995].

13 / 52
These libraries are available on NETLIB, including comprehensive
manuals [E. Anderson,et al., LAPACK Users’ Guide (2nd edition). SIAM,
Philadelphia, 1995; L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J.
Demmel, et al., ScaLAPACK Users’ Guide. Software, Environments, and
Tools 4. SIAM, Philadelphia, PA, 1997].
LAPACK was originally motivated by the poor performance of its
predecessors LINPACK and EISPACK (also available on NETLIB) on
some high-performance machines. For example, consider the table below,
which presents the speed in Mflops of LINPACK’s Cholesky routine spofa
on a Cray YMP, a supercomputer of the late 1980s. Cholesky is a variant
of Gaussian elimination suitable for symmetric positive definite matrices.
It is very similar to Algorithm 2.2. The table also includes the speed of
several other linear algebra operations. The Cray YMP is a parallel
computer with up to 8 processors that can be used simultaneously, so we
include one column of data for 1 processor and another column where all
8 processors are used.

14 / 52
1 Proc. 8 Proc.
Maximum speed 330 2640
Matrix-matrix multiply (n = 500) 312 2425
Matrix-vector multiply (n = 500) 311 2285
Solve TX = B (n = 500) 309 2398
Solve Tx = b (n = 500) 272 584
LINPACK (Cholesky, n = 500) 72 72
LAPACK (Cholesky, n = 500) 290 1414
LAPACK (Cholesky, n = 1000) 301 2115
The top line, the maximum speed of the machine, is an upper bound on
the numbers that follow. The basic linear algebra operations on the next
four lines have been measured using subroutines especially designed for
high speed on the Cray YMP. They all get reasonably close to the
maximum possible speed, except for solving Tx = b, a single triangular
system of linear equations, which does not use 8 processors effectively.
Solving TX = B refers to solving triangular systems with many
right-hand sides (B is a square matrix). These numbers are for large
matrices and vectors (n = 500).

15 / 52
Basic Linear Algebra Subroutines (BLAS)

Since it is not cost-effective to write a special version of every routine like


Cholesky for every new computer, we need a more systematic approach.
Since operations like matrix-matrix multiplication are so common,
computer manufacturers have standardized them as the Basic Linear
Algebra Subroutines, or BLAS and optimized them for their machines.
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. Basic Linear Algebra
Subprograms for Fortran usage. ACM Trans. Math. Software, 5:308-323,
1979.
J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended
set of FORTRAN Basic Linear Algebra Subroutines. ACM Trans. Math.
Software, 14:1-17, 1988.
J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. A set of Level 3
Basic Linear Algebra Subprograms. ACM Trans. Math. Software,
16:1-17, 1990.

16 / 52
In othe r words, a library of subroutines for matrix-matrix multiplication,
matrix-vector multiplication, and other similar operations is available with
a standard Fortran or C interface on high performance machines (and
many others), but underneath they have been optimized for each
machine. Our goal is to take advantage of these optimized BLAS by
reorganizing algorithms like Cholesky so that they call the BLAS to
perform most of their work.
Table 2.1 counts the number of memory references and floating points
operations performed by three related BLAS. For example, the number of
memory references needed to implement the saxpy operation in line 1 of
the table is 3n + 1, because we need to read n values of xi , n values of yi ,
and 1 value of α from slow memory to registers, and then write n values
of yi back to slow memory. The last column gives the ratio q of flops to
memory references (its highest-order term in n only).

17 / 52
The significance of q is that it tells us roughly how many flops that we
can perform per memory reference or how much useful work we can do
compared to the time moving data. This tells us how fast the algorithm
can potentially run. For example, suppose that an algorithm performs f
floating points operations, each of which takes tarith seconds, and m
memory references, each of which takes tmem seconds. Then the total
running time is as large as
   
m tmem 1 tmem
f · tarith + m · tmem = f · tarith · 1 + = f · tarith · 1 + ,
f tarith q tarith

assuming that the arithmetic and memory references are not performed in
parallel. Therefore, the larger the value of q, the closer the running time
is to the best possible running time f · tarith , which is how long the
algorithm would take if all data were in registers. This means that
algorithms with the larger q values are better building blocks for other
algorithms.

18 / 52
Table 2.1 reflects a hierarchy of operations: Operations such as saxpy
perform O(n1) flops on vectors and offer the worst q values; these are
called Level 1 BLAS, or BLAS1 [C. Lawson, R. Hanson, D. Kincaid, and
F. Krogh. Basic Linear Algebra Subprograms for Fortran usage. ACM
Trans. Math. Software, 5:308-323, 1979], and include inner products,
multiplying a scalar times a vector and other simple operations.
Operations such as matrix-vector multiplication perform O(n2 ) flops on
matrices and vectors and offer slightly better q values; these are called
Level 2 BLAS, or BLAS2 [J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. Algorithm
656: An extended set of FORTRAN Basic Linear Algebra Subroutines. ACM Trans. Math. Software, 14:18-32,

1988; J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of FORTRAN Basic Linear

Algebra Subroutines. ACM Trans. Math. Software, 14:1-17, 1988], and include solving triangular systems of

equations and rank-1 updates of matrices (A + xy T , x and y column vectors). Operations such as matrix-matrix

multiplication perform O(n3 ) flops on pairs of matrices and offer the best q values; these are called Level 3 BLAS,

or BLAS3 [J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling. Algorithm 679: A set of Level 3 Basic Linear

Algebra Subprograms. A CM Trans. Math.Software, 16:18-28, 1990; J. Dongarra, J. Du Croz, I. Duff, and S.

Hammarling. A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Software, 16:1-17, 1990],

and include solving triangular systems of equations with many right-hand sides.

19 / 52
Table 2.1. Counting floating point operations and memory references for
the BLAS. f is the number of floating point operations, and m is the
number of memory references.
Operation Definition f m q = f /m
saxpy y = α · x + y or 2n 3n + 1 2/3
(BLAS1) yi = α · xi + yi
i = 1, . . . , n
Matrix-vector mult y = AP· x + y or 2n2 n2 + 3n 2
n
(BLAS2) yi = j=1 aij xj + yi
i = 1, . . . , n
Matrix-matrix mult C = AP· B + C or 2n3 4n2 n/2
n
(BLAS3) cij = k=1 aik bjk + cij
i, j = 1, . . . , n

The directory NETLIB/blas includes documentation and (unoptimized)


implementations of all the BLAS. For a quick summary of all the BLAS,
see NETLIB/blas/blasqr.ps.

20 / 52
How to Optimize Matrix Multiplication

Let us examine in detail how to implement matrix multiplication


C = A · B + C to minimize the number of memory moves and so
optimize its performance. We will see that the performance is sensitive to
the implementation details. To simplify our discussion, we will use the
following machine model. We assume that matrices are stored
columnwise, as in Fortran. (It is easy to modify the examples below if
matrices are stored rowwise as in C.) We assume that there are two levels
of memory hierarchy, fast and slow, where the slow memory is large
enough to contain the three n × n matrices A, B, and C , but the fast
memory contains only M words where 2n < M ≪ n2 ; this means that the
fast memory is large enough to hold two matrix columns or rows but not
a whole matrix. We further assume that the data movement is under
programmer control. (In practice, data movement may be done
automatically by hardware, such as the cache controller. Nonetheless, the
basic optimization scheme remains the same.)

21 / 52
The simplest matrix-multiplication algorithm that one might try consists
of three nested loops, which we have annotated to indicate the data
movements.
ALGORITHM 2.6. Unblocked matrix multiplication (annotated to
indicate memory activity):
for i = 1 to n
{ Read row i of A into fast memory}
for j = 1 to n
{ Read Cij into fast memory}
{ Read column j of B into fast memory}
for k = 1 to n
Cij = Cij + Aik · Bkj
end for
{ Write Cij back to slow memory}
end for
end for

22 / 52
Here is the detailed count of memory references: n3 for reading B n
times (once for each value of i); n2 for reading A one row at a time and
keeping it in fast memory until it is no longer needed; and 2n2 for reading
one entry of C at a time, keeping it in fast memory until it is completely
computed, and then moving it back to slow memory. This comes to
n3 + 3n2 memory moves, or q = 2n3 /(n2 + 3n2 ) ≈ 2, which is no better
than the Level 2 BLAS and far from the maximum possible n/2 (see
Table 2.1). If M ≪ n, so that we cannot keep a full row of A in fast
memory, q further decreases to 1, since the algorithm reduces to a
sequence of inner products, which are Level 1 BLAS. For every
permutation of the three loops on i, j, and k, one gets another algorithm
with q about the same.
Our preferred algorithm uses blocking, where C is broken into an N × N
block matrix with n/N × n/N blocks C ij , and A and B are similarly
partitioned, as shown below for N = 4.

23 / 52
ALGORITHM 2.7. Blocked matrix multiplication (annotated to
indicate memory activity):
for i = 1 to N
for j = 1 to N
{ Read C ij into fast memory}
for k = 1 to N
{ Read Aik into fast memory}
{ Read B kj into fast memory}
C ij = C ij + Aik · B kj
end for
{ Write C ij back to slow memory}
end for
end for

24 / 52
The number of memory references

Our memory reference count is as follows: 2n2 for reading and writing
each block of C once, Nn2 for reading A N times (reading each
n/N − by − n/N submatrix Aik N 3 times), and Nn2 for reading B N
times (reading each n/N − by − n/N submatrix B kj N 3 times), for a
total of (2N + 2)n2 ≈ 2Nn2 memory references.
So we want to choose N as small as possible to minimize the number of
memory references. But N is subject to the constraint M ≥ 3(n/N)2 ,
which means that one block each from A, B, and C must fit in fast
memory simultaneously.
p
This yields N ≈ n 3/M and so q ≈ (2n3 )/(2Nn2 ), which is much better
than the previous algorithm.

25 / 52
The number of memory references

In particular q grows independently of n as M grows, which means that


we expect the algorithm to be fast for any matrix size n and to go faster
if the fast memory size M is increased. These are both attractive
properties.
In fact, it can be shown that Algorithm 2.7 is asymptotically optimal [X.
Hong and H. T. Kung. I/O complexity: The red blue pebble game. In
Proc. of the 13th Symposium on the Theory of Computing, ACM, New
York, 1981]. In other words, no reorganization of matrix-matrix
multiplication (that performs
√ the same 2n3 arithmetic operations) can
have a q larger than O M.

26 / 52
The number of memory references

On the other hand, this brief analysis ignores a number of practical


issues:
1 A real code will have to deal with nonsquare matrices, for
which the optimal block sizes may not be square.
2 The cache and register structure of a machine will strongly
affect the best shapes of submatrices.
3 There may be special hardware instructions that perform both
a multiplication and an addition in one cycle. It may also be
possible to execute several multiply-add operations
simultaneously if they do not interfere.

27 / 52
Both the above matrix-matrix multiplication algorithms
perform 2n3 arithmetic operations.
It turns out that there are other implementations of
matrix-matrix multiplication that use far fewer operations.
Strassen’s method [A. Aho, J. Hopcroft, and J. Ullman. The
Design and Analysis of Computer Algorithms.
Addison-Wesley, Reading, MA, 1974(0)

was the first of these algorithms to be discovered and is the


simplest to explain.
This algorithm multiplies matrices recursively by dividing them into
2 × 2 block matrices and multiplying the subblocks using seven
matrix multiplications (recursively) and 18 matrix additions of half
the size; this leads to an asymptotic complexity of nlog2 7 ≈ n2.81
instead of n3 .

28 / 52
ALGORITHM 2.8. Strassen’s matrix multiplication algorithm:
C = Strassen(A, B, n)
/* Return C = A ∗ B, where A and B are n-by-n;
Assume n is a power of 2 */
if n = 1
return C = A ∗ B /* scalar multiplication */
else » – » –
A11 A12 B11 B12
Partition A = and B =
A21 A22 B21 B22
where the subblocks Aij and Bij are n/2-by-n/2
P1 = Strassen( A12 − A22 , B21 + B22 , n/2 )
P2 = Strassen( A11 + A22 , B11 + B22 , n/2 )
P3 = Strassen( A11 − A21 , B11 + B12 , n/2 )
P4 = Strassen( A11 + A12 , B22 , n/2 )
P5 = Strassen( A11 , B12 − B22 , n/2 )
P6 = Strassen( A22 , B21 − B11 , n/2 )
P7 = Strassen( A21 + A22 , B11 , n/2 )
C11 = P1 + P2 − P4 + P6
C12 = P4 + P5
C21 = P6 + P7
C22 = P2 − »P3 + P5 − P7 –
C11 C12
return C =
C21 C22
end if

29 / 52
Complexity of Strassen’s algorithm

It is straightforward to confirm by induction that this algorithm


multiplies matrices correctly.
To show that its complexity is O(nlog2 7 ), we let T (n) be the
number of additions, subtractions, and multiplications performed by
the algorithm. Since the algorithm performs 7 recursive calls on
matrices of size n/2, and 18 additions of n/2-by-n/2 matrices, we
can write down the recurrence T (n) = 7T (n/2) + 18(n/2)2 .
Changing variables from n to m = log2 n, we get a new recurrence
T̄ (m) = 7T̄ (m − 1) + 18(2m−1 )2 , where T̄ (m) = T (2m ). We can
confirm that this linear recurrence for T̄ has a solution
T̄ (m) = O(7m ) = O(nlog2 7 ).

30 / 52
Special Linear Systems

It is important to exploit any special structure of the matrix to increase


speed of solution and decrease storage. We will consider only real
matrices:
s.p.d. matrices,
symmetric indefinite matrices,
band matrices,
general sparse matrices,
dense matrices depending on fewer than n2 independent parameters.

31 / 52
2.7.1. Real Symmetric Positive Definite Matrices

Recall that a real matrix A is s.p.d. if and only if A = AT and x T Ax > 0


for all x 6= 0. In this section we will show how to solve Ax = b in half the
time and half the space of Gaussian elimination when A is s.p.d.
PROPOSITION 2.2.
1. If X is nonsingular, then A is s.p.d. if and only if X T AX is s.p.d.
2. If A is s.p.d. and H is any principal submatrix of A(H = A(j : k, j : k)
for some j ≤ k), then H is s.p.d.
3. A is s.p.d. if and only if A = AT and all its eigenvalues are positive.
4. If A is s.p.d., then all aii > 0, and maxij |aij | = maxi aii > 0.
5. A is s.p.d. if and only if there is a unique lower triangular nonsingular
matrix L, with positive diagonal entries, such that A = LLT . A = LLT is
called the Cholesky factorization of A, and L is called the Cholesky factor
of A.

32 / 52
Proof.
1. If X is nonsingular, then A is s.p.d. if and only if X T AX is s.p.d.
X nonsingular implies Xx 6= 0 for all x 6= 0, so x T X T AXx > 0 for
all x 6= 0. So A s.p.d. implies X T AX is s.p.d. Use X −1 to deduce
the other implication.

33 / 52
2. If A is s.p.d. and H is any principal submatrix of
A(H = A(j : k, j : k) for some j ≤ k), then H is s.p.d.
Suppose first that H = A(1 : m, 1 : m). Then given any m-vector
y , the n-vector x = [y T , O]T satisfies y T Hy = x T Ax. So if
x T Ax > 0 for all nonzero x, then y T Hy > 0 for all nonzero y , and
so H is s.p.d. If H does not lie in the upper left corner of A, let P
be a permutation so that H does lie in the upper left corner of
P T AP and apply Part 1.

34 / 52
3. A is s.p.d. if and only if A = AT and all its eigenvalues are
positive.
Let X be the
V real, orthogonal eigenvector matrix of A so that
X TV
AX = P is the diagonal matrix of real eigenvalues λi . Since
x T x = i λi xi2 ,
V
is s.p.d. if and only if each λi > 0. Now
apply Part 1.

35 / 52
4. If A is s.p.d., then all aii > 0, and maxij |aij | = maxi aii > 0.
Let ei be the i th column of the identity matrix. Then
eiT Aei = aii > 0 for all i . If |akl | = maxij |aij | but k 6= l , choose
x = ek − sign(akl )ei . Then x T Ax = akk + all − 2|akl | ≤ 0,
contradicting positive-definiteness.

36 / 52
5. A is s.p.d. if and only if there is a unique lower triangular
nonsingular matrix L, with positive diagonal entries, such that
A = LLT . A = LLT is called the Cholesky factorization of A, and L
is called the Cholesky factor of A.
Suppose A = LLT with L nonsingular. Then
x T Ax = (x T L)(LT x) = ||LT x||22 > 0 for all x 6= 0, so A is s.p.d. If
A is s.p.d., we show that L exists by induction on the dimension n.
If we choose each lii > 0, our construction will determine L

uniquely. If n = 1, choose l11 = a11 , which exists since a11 > 0.
As with Gaussian elimination, it suffices to understand the block
2-by-2 case.

37 / 52
Write
 
a11 A12
A =
AT A22
" √12 # " √ #
a11 0 A12
1 0 a11 √
a11
= AT
√ 12
I 0 Ã22 0 I
" a11 #
a11 A12
= T AT A ,
A12 Ã22 + 12a11 12

AT
12 A12
so the (n − 1)-by-(n − 1) matrix Ã22 − a11 is symmetric.

38 / 52
 
1 0
By Part 1 above, is s.p.d, so by Part 2 Ã22 is s.p.d.
0 Ã22
Thus by induction there exists an L̃ such that Ã22 = L̃L̃T and
" √ # " √ #
a11 0 1 0 a11 √Aa1211
A = AT
√ 12
a11 I 0 L̃L̃T 0 I
" √ #" √ #
a11 0 a11 √Aa1211
= AT T
≡ LLT . 
√ 12
a11 L̃ 0 L̃

39 / 52
We may rewrite this induction as the following algorithm.
ALGORITHM 2.11. Cholesky algorithm:
for j = 1 to nP
ljj = (ajj − j−1 2 1/2
k=1 ljk )
for i = j + 1 toPn
lij = (aij − j−1
k=1 lik ljk )/ljj
end for
end for
If A is not positive definite, then (in exact arithmetic) this
algorithm will fail by attempting to compute the square root of a
negative number or by dividing by zero; this is the cheapest way to
test if a symmetric matrix is positive definite.

40 / 52
The number of flops in Cholesky algorithm

As with Gaussian elimination, L can overwrite the lower half of A.


Only the lower half of A is referred to by the algorithm, so in fact
only n(n + l )/2 storage is needed instead of n2 . The number of
flops is  
n n
X X 1
2j + 2j  = n3 + O(n2 ),
3
j=1 i =j+1

or just half the flops of Gaussian elimination. Just as with


Gaussian elimination, Cholesky may be reorganized to perform
most of its floating point operations using Level 3 BLAS; see
LAPACK routine spotrf.

41 / 52
Pivoting is not necessary for Cholesky to be numerically stable
(equivalently, we could also say any diagonal pivot order is
numerically stable). We show this as follows. The same analysis as
for Gaussian elimination in section 2.4.2 shows that the computed
solution x̂ satisfies (A + δA)x̂ = b with |δA| ≤ 3nε|L| · |LT |. But
by the Cauchy-Schwartz inequality and Part 4 of Proposition 2.2
X
(|L| · |LT |)ij = |lik | · |ljk |
qk P qP
≤ lik2 ljk2
√ √
= aii · ajj
≤ maxij |ajj |

so || |L| · |LT | ||∞ ≤ n||A||∞ and ||δA||∞ ≤ 3n2 ε||A||∞ .

42 / 52
Symmetric Indefinite Matrices

The question of whether we can still save half the time and half the
space when solving a symmetric but indefinite (neither positive definite
nor negative definite) linear system naturally arises. It turns out to be
possible, but a more complicated pivoting scheme and factorization is
required. If A is nonsingular, one can show that there exists a
permutation P, a unit lower triangular matrix L, and a block diagonal
matrix D with 1-by-1 and 2-by-2 blocks such that PAP T = LDL T
 . 
0 1
To see why 2-by-2 blocks are needed in D, consider the matrix .
1 0
This factorization can be computed stably, saving about half the work
and space compared to standard Gaussian elimination. The name of the
LAPACK subroutine which does this operation is ssysv. The algorithm is
described in [J. Bunch and L. Kaufman. Some stable methods for
calculating inertia and solving symmetric linear systems. Math. Comp.,
31:163-179, 1977].

43 / 52
Band Matrices

A matrix A is called a band matrix with lower bandwidth bL , and upper


bandwidth bU if aij = 0 whenever i > j + bL or i < j − bU :
 
a11 ··· a1,bU +1 0
 .. 

 . a2,bU +2 

 .. 
A= L
 ab +1,1 . .


 a bL +2,2 a n−bU ,n 

 . .. .
..

 
0 an,n−bL ··· an,n

Band matrices arise often in practice and are useful to recognize because
their L and U factors are also ”essentially banded”, making them cheaper
to compute and store. We consider LU factorization without pivoting
and show that L and U are banded in the usual sense, with the same
band widths as A.

44 / 52
PROPOSITION 2.3. Let A be banded with lower bandwidth bL and
upper bandwidth bU . Let A = LU be computed without pivoting. Then
L has lower bandwidth bL and U has upper bandwidth bU . L and U can
be computed in about 2n · bU · bL arithmetic operations when bU and bL
are small compared to n. The space needed is (bL + bU + 1). The full
cost of solving Ax = b is 2nbU · bL + 2nbU + 2nbL .
PROPOSITION 2.4. Let A be banded with lower bandwidth bL and
upper bandwidth bU . Then after Gaussian elimination with partial
pivoting, U is banded with upper bandwidth at most bL + bU , and L is
”essentially banded” with lower bandwidth bL . This means that L has at
most bL + 1 nonzeros in each column and so can be stored in the same
space as a band matrix with lower bandwidth bL .
Gaussian elimination and Cholesky for band matrices are available in
LAPACK routines like ssbsv and sspsv.
Band matrices often arise from discretizing physical problems with
nearest neighbor interactions on a mesh (provided the unknowns are
ordered rowwise or columnwise).

45 / 52
Example: ODE

EXAMPLE 2.8. Consider the ordinary differential equation (ODE)


y ′′(x) − p(x)y ′(x) − q(x)y (x) = r (x) on the interval [a, b] with boundary
conditions y (a) = α, y (b) = β. We also assume q(x) ≥ q > 0. This
equation may be used to model the heat flow in a long, thin rod, for
example. To solve the differential equation numerically, we discretize it
by seeking its solution only at the evenly spaced mesh points xi = a + ih,
i = 0, . . . , N + 1, where h = (b − a)/(N + 1) is the mesh spacing. Define
pi = p(xi ), ri = r (xi ), and qi = q(xi ).

46 / 52
We need to derive equations to solve for our desired
approximations yi ≈ y (xi ), where y0 = α and yN+1 = β. To derive
these equations, we approximate the derivative y ′(xi ) by the
following finite difference approximation:

yi +1 − yi −1
y ′(xi ) ≈ .
2h

(Note that as h gets smaller, the right-hand side approximates


y ′(xi ) more and more accurately.) We can similarly approximate
the second derivative by

yi +1 − 2yi + yi −1
y ′′(xi ) ≈ .
h2

Inserting these approximations into the differential equation yields

yi +1 − 2yi + yi −1 yi +1 − yi −1
2
− pi − qi yi = ri , 1 ≤ i ≤ N.
h 2h

47 / 52
Rewriting this as a linear system we get Ay = b, where
     1 h 
y1 r1 ( 2 + 4 p1 )α
     0 
−h2 
y =  ...
 ..   ..
    
, b = + ,
  
2  .   .

  
     0 
yN rN ( 12 − h4 pN )β

48 / 52
and
 
a1 −c1 h2
 .. ..  ai = 1+ 2 qi ,
 −b2 . . 
1 2
A=
 .. ..
,
 bi = 2 [1 + h2 pi ],
 . . cN−1  1 2
ci = 2 [1 − h2 pi ].
−bN aN

Note that ai > 0, and also bi > 0 and ci > 0 if h is small enough.
This is a nonsymmetric tridiagonal system to solve for y . We will show
how to change it to a symmetric positive definite tridiagonal system, so
that we may use band Cholesky to solve it.

49 / 52
q q q
c1 c1 c2 c1 c2 ···cN−1
Choose D = diag (1, b2 , b2 b3 , . . . , b2 b3 ···bN ). Then we may change
−1
Ay = b to (DAD )(Dy ) = Db or Ãỹ = b̃, where
 √ 
√a1 − c 1 b2 √
 − c 1 b2 a2 − c 2 b3 

 
 .. 
à = 
 − c 2 b3 . .

 . .. . .. p 

p − cN−1 bN 
− cN−1 bN aN

It is easy to see that à is symmetric, and it has the same eigenvalues as


A because A and à = DAD −1 are similar. We will use the next theorem
to show it is also positive definite.

50 / 52
Gershgorin’s Theorem

THEOREM 2.9. Gershgorin. Let B be an arbitrary matrix. Then the


eigenvalues λ of B are located in the union of the n disks
X
|λ − bkk | ≤ |bkj |.
j6=k

Proof. Given λ and x 6= 0 such that Bx = λx, let 1 = ||x||∞ = xk by


scaling x if necessary. Then N
P
j=1 bkj xj = λxk = λ, so
PN
λ − bkk = j =1 b x
kj j , implying
j 6= k

X
|λ − bkk | ≤ |bkj xj | ≤ |bkj |. 
j6=k

51 / 52
Example: ODE (continuation)

If h is so small that for all i, | h2 pi | < 1, then

h2 h2
   
1 h 1 h
|bi |+|ci | = 1 + pi + 1 − pi = 1 < 1+ q ≤ 1+ qi = ai .
2 2 2 2 2 2

Therefore all eigenvalues of A lie inside the disks centered at


1 + h2 qi /2 ≥ 1 + h2 q/2 with radius 1; in particular, they must all
have positive real parts.
Since A is symmetric, its eigenvalues are real and hence positive, so
à is positive definite. Its smallest eigenvalue is bounded below by
qh2 /2.
Thus, it can be solved by Cholesky. The LAPACK subroutine for
solving a symmetric positive definite tridiagonal system is sptsv.

52 / 52

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy