Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
1
Outline of Next 2 Lectures
• Parallel algorithms for these matrix operations
with complexity estimates
• Existing parallel linear algebra subroutines and
libraries (PBLAS, ScaLAPACK, ATLAS, etc)
• Similar discussions for these operations
performed on sparse matrices
2
Motivation:Dense Linear Algebra
• In such cases, we may be solving for a
subset of the problem every time we solve
the equation A x = b, since A can be written
as
Ax = ( AI AJ AK ) x = b
with some factorization error
• It is true that these matrices are typically
banded, and therefore, the cost of full
factorization is not necessary. Bear with us.
3
Example: Deriving the Heat Equation
0
x-h
x
x+h
1
Consider a simple problem
• A bar of uniform material, insulated except at ends
• Let u(x,t) be the temperature at position x at time t
• Heat travels from x-h to x+h at rate proportional to:
Implicit Solution
• As with many (stiff) ODEs, need an implicit method
• This turns into solving the following equation
(I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i]
• Here I is the identity matrix and T is:
2 -1
Graph and “stencil”
-1 2 -1
T =
-1 2 -1
-1 2 -1 -1 2 -1
-1 2
4
2D Implicit Method
• Similar to the 1D case, but the matrix T is now
Graph and “stencil”
4 -1 -1
-1 4 -1 -1
-1
-1 4 -1
-1 4 -1 -1 -1 4 -1
T =
-1 -1 4 -1 -1
-1
-1 -1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
• Multiplying by this matrix (as in the explicit case) is simply
nearest neighbor computation on 2D grid
• To solve this system, there are several techniques
5
Building Blocks in Linear Algebra
• BLAS (Basic Linear Algebra Subprograms)
created / defined in 1979 by Lawson et. al
• BLAS intends to modularize problems in linear
algebra by identifying typical operations present in
complex algorithms in linear algebra, and defining
a standard interface to them
• This way, hardware vendors can optimize their
own version of BLAS and allow users’ programs
to run efficiently with simple recompilation
• Optimized BLAS implementations are usually
hand-tuned (and coded in assembly language)
6
Building Blocks in Linear Algebra
• BLAS advantages:
– Robustness: BLAS routines are programmed with
robustness in mind. Various exit conditions can be
diagnosed from the routines themselves, overflow is
predicted, and general pivoting algorithms are
implemented
– Portability: the calling API is fixed; hardware vendors
optimize behind-the-scenes
– Readability: since the names of BLAS routines are
fairly common, one knows exactly what a program is
doing by reading the source code; auto-documentation.
7
BLAS Level 1 Routines
• Typical operations
y ← αx + y
x ← αx
dot ← xT y
asum ← re( x) + im( x)
nrm 2 ← x 2
a max ← 1st k ∍ re( xk ) + img ( xk ) = x ∞
8
BLAS Level 2 Routines
• Level 1 BLAS routines do not have enough
granularity to achieve high performance: reuse of
registers must occur because of high cost of
memory accesses and limitations in current chip
architectures
• Optimization at least at the level of matrix-vector
operations is necessary. Level 1 disallows this by
hiding details from the compiler
• Level 2 BLAS includes these kinds of operations
which typically involve O(m n) operations, where
the matrices involved have size m x n
9
BLAS Level 2 Routines
• Additional operations for banded, Hermitian,
triangular, etc. matrices are also available (look at
“man blas” on junior)
• Efficiency of implementations can be increased in
this way, but there are drawbacks for cache-based
architectures which still want to reuse memory as
much as possible.
• Level 3 BLAS addresses this problem
10
BLAS Level 3 Routines
• Typical operations involve matrix-matrix products
C = αAB + βY
C = αAAT + βC
B = αTB
B = αT −1 B
• as well as rank-k updates and solutions of systems
involving triangular matrices
• Better performance is achieved
Peak
BLAS 3
BLAS 2
BLAS 1
11
Matrix Problem Solution, Ax=b
• The main steps in the solution process are
– Fill: computing the matrix elements of A
– Factor: factoring the dense matrix A
– Solve: solving for one or more right hand sides,b
12
Refine GE Algorithm (1)
• Initial Version
… for each column i
… zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
… for each row j below row i
for j = i+1 to n
… add a multiple of row i to row j
for k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
13
Refine GE Algorithm (3)
• Last version
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)
14
What GE really computes
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
Peak
BLAS 3
BLAS 2
BLAS 1
15
Pivoting in Gaussian Elimination
° A = [0 1] fails completely, even though A is “easy”
[1 0]
° Result of LU decomposition is
L=[ 1 0] = [ 1 0 ] … No roundoff error yet
[ fl(1/1e-4) 1 ] [ 1e4 1 ]
° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5
° Numerical instability
° Computed solution x totally inaccurate
° Cure: Pivot (swap rows of A) so entries of L and U bounded
for i = 1 to n-1
find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|
… i.e. largest entry in rest of column i
if |A(k,i)| = 0
exit with a warning that A is singular, or nearly so
elseif k != i
swap rows i and k of A
end if
A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1]
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
16