0% found this document useful (0 votes)
62 views

Outline of Next 2 Lectures: Matrix Computations: Direct Methods I

The document outlines a lecture on parallel methods for solving dense linear algebra problems directly using techniques like Gaussian elimination and matrix factorization. It discusses existing sequential algorithms for operations like matrix-vector/matrix multiplication and eigendecomposition. It will then cover parallel algorithms for these operations with complexity estimates and existing parallel linear algebra libraries. It also motivates these types of problems in computational physics which can be reduced to solving systems of the form Ax=b.

Uploaded by

itrial
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Outline of Next 2 Lectures: Matrix Computations: Direct Methods I

The document outlines a lecture on parallel methods for solving dense linear algebra problems directly using techniques like Gaussian elimination and matrix factorization. It discusses existing sequential algorithms for operations like matrix-vector/matrix multiplication and eigendecomposition. It will then cover parallel algorithms for these operations with complexity estimates and existing parallel linear algebra libraries. It also motivates these types of problems in computational physics which can be reduced to solving systems of the form Ax=b.

Uploaded by

itrial
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Matrix Computations: Direct Methods I

April 30, 2014


Lecture 10

Outline of Next 2 Lectures


•  Motivation for parallel solution of linear
algebra problems using direct methods
•  Brief discussion of existing sequential methods
for most relevant operations:
–  Gaussian Elimination / Matrix Factorization
–  Matrix-vector and matrix-matrix multiplication
–  Eigenvalue / eigenvector calculation

1
Outline of Next 2 Lectures
•  Parallel algorithms for these matrix operations
with complexity estimates
•  Existing parallel linear algebra subroutines and
libraries (PBLAS, ScaLAPACK, ATLAS, etc)
•  Similar discussions for these operations
performed on sparse matrices

Motivation:Dense Linear Algebra


•  Most problems in computational physics
can be reduced to the form
Ax = b
•  This is true whether the original problems
are linear or non-linear (with appropriate
linearization), whether the problems are 1D,
2D, 3D, and whether an approximate
factorization has been performed or not

2
Motivation:Dense Linear Algebra
•  In such cases, we may be solving for a
subset of the problem every time we solve
the equation A x = b, since A can be written
as
Ax = ( AI AJ AK ) x = b
with some factorization error
•  It is true that these matrices are typically
banded, and therefore, the cost of full
factorization is not necessary. Bear with us.

Motivation: Continuous Variables,


Continuous Parameters

Examples of such systems include

•  Heat flow: Temperature(position, time)


•  Diffusion: Concentration(position, time)
•  Electrostatic or Gravitational Potential:
Potential(position)
•  Fluid flow: Velocity,Pressure,Density(position,time)
•  Quantum mechanics: Wave-function(position,time)
•  Elasticity: Stress,Strain(position,time)

3
Example: Deriving the Heat Equation

0 x-h x x+h 1
Consider a simple problem
•  A bar of uniform material, insulated except at ends
•  Let u(x,t) be the temperature at position x at time t
•  Heat travels from x-h to x+h at rate proportional to:

d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)- u(x+h,t))/h


= C *
dt h

•  As h 0, we get the heat equation:


d u(x,t) d2 u(x,t)
= C * 2
dt dx

Implicit Solution
•  As with many (stiff) ODEs, need an implicit method
•  This turns into solving the following equation
(I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i]
•  Here I is the identity matrix and T is:

2 -1
Graph and “stencil”
-1 2 -1

T = -1 2 -1
-1 2 -1 -1 2 -1
-1 2

•  I.e., essentially solving Poisson’s equation in 1D

4
2D Implicit Method
•  Similar to the 1D case, but the matrix T is now
Graph and “stencil”
4 -1 -1
-1 4 -1 -1
-1
-1 4 -1
-1 4 -1 -1 -1 4 -1

T = -1 -1 4 -1 -1
-1
-1 -1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
•  Multiplying by this matrix (as in the explicit case) is simply
nearest neighbor computation on 2D grid
•  To solve this system, there are several techniques

Algorithms for 2D Poisson Equation with N unknowns

Algorithm Serial PRAM Memory #Procs


•  Dense LU N3 N N2 N2
•  Band LU N2 N N3/2 N
•  Jacobi N2 N N N
•  Explicit Inv. N log N N N
•  Conj.Grad. N 3/2 1/2
N *log N N N
•  RB SOR N 3/2 N 1/2 N N
•  Sparse LU N 3/2 N 1/2 N*log N N
•  FFT N*log N log N N N
•  Multigrid N log2 N N N
•  Lower bound N log N N

5
Building Blocks in Linear Algebra
•  BLAS (Basic Linear Algebra Subprograms)
created / defined in 1979 by Lawson et. al
•  BLAS intends to modularize problems in linear
algebra by identifying typical operations present in
complex algorithms in linear algebra, and defining
a standard interface to them
•  This way, hardware vendors can optimize their
own version of BLAS and allow users’ programs
to run efficiently with simple recompilation
•  Optimized BLAS implementations are usually
hand-tuned (and coded in assembly language)

Building Blocks in Linear Algebra


•  BLAS routines have to be simple enough that high
levels of optimization can be obtained
•  BLAS routines have to be general enough so that
complex algorithms can be constructed as
sequences of calls to these basic routines
•  Others (LINPACK, LAPACK, EISPACK, etc.)
have followed suit and have tried to do a similar
job for a variety of linear algebra problems

6
Building Blocks in Linear Algebra
•  BLAS advantages:
–  Robustness: BLAS routines are programmed with
robustness in mind. Various exit conditions can be
diagnosed from the routines themselves, overflow is
predicted, and general pivoting algorithms are
implemented
–  Portability: the calling API is fixed; hardware vendors
optimize behind-the-scenes
–  Readability: since the names of BLAS routines are
fairly common, one knows exactly what a program is
doing by reading the source code; auto-documentation.

BLAS Level 1 Routines


•  Perform low level functions (typically operations
between vectors like dot products, sums, etc.)
•  4 or 5 letter names preceded by s, d, c, z to
indicate precision type. For example, DAXPY is a
double precision addition of a vector and another
one multiplied by a scalar
•  Typical operations are O(n), where n is the length
of the vectors being operated on
•  Large ratio of floating point operations to memory
loads and stores prevents high Mflop rating of
these routines in most computers

7
BLAS Level 1 Routines
•  Typical operations

y ← αx + y
x ← αx
dot ← xT y
asum ← re( x) + im( x)
nrm 2 ← x 2
a max ← 1st k ∍ re( xk ) + img ( xk ) = x ∞

BLAS Level 1 Routines


•  One of the most basic operations, a matrix-vector
multiply, can actually be done with a sequence of
n SAXPY operations, but the result vector is
stored to memory and retrieved from it at every
step, but it could have remained in memory for the
actual computation.
•  BLAS Level 2 routines add functionality to help
out in this situation

8
BLAS Level 2 Routines
•  Level 1 BLAS routines do not have enough
granularity to achieve high performance: reuse of
registers must occur because of high cost of
memory accesses and limitations in current chip
architectures
•  Optimization at least at the level of matrix-vector
operations is necessary. Level 1 disallows this by
hiding details from the compiler
•  Level 2 BLAS includes these kinds of operations
which typically involve O(m n) operations, where
the matrices involved have size m x n

BLAS Level 2 Routines


•  Typical operations involve:
y = αAx + βy
y = αAT x + βy
y = Tx
y =TTx
x = T −1 x
•  as well as rank-1 and rank-2 updates to a matrix
(optimization).

9
BLAS Level 2 Routines
•  Additional operations for banded, Hermitian,
triangular, etc. matrices are also available (look at
“man blas” on junior)
•  Efficiency of implementations can be increased in
this way, but there are drawbacks for cache-based
architectures which still want to reuse memory as
much as possible.
•  Level 3 BLAS addresses this problem

BLAS Level 3 Routines


•  Sometimes it is preferable to decompose matrices
into blocks to perform various operations on a
matrix-matrix basis
•  Data reuse is enhanced in this way
•  Typically obtain O(n^3) operations with O(n^2)
data references (similar to granularity surface-to-
volume effect discussed earlier)
•  Two opportunities for parallelism:
–  operations on distinct blocks may be done in parallel
–  operations within a block may have loop-level
parallelism

10
BLAS Level 3 Routines
•  Typical operations involve matrix-matrix products
C = αAB + βY
C = αAAT + βC
B = αTB
B = αT −1 B
•  as well as rank-k updates and solutions of systems
involving triangular matrices
•  Better performance is achieved

BLAS Level 3 Routines

Peak
BLAS 3

BLAS 2

BLAS 1

11
Matrix Problem Solution, Ax=b
•  The main steps in the solution process are
–  Fill: computing the matrix elements of A
–  Factor: factoring the dense matrix A
–  Solve: solving for one or more right hand sides,b

Task Work Parallelism Parallel Sp

Fill O(n**2) embarrassing low


Factor O(n**3) moderately diff. very high
Solve O(n**2) moderately diff. high
Field Calc. O(n) embarrassing high

Review of Gaussian Elimination (GE) for


solving Ax=b
•  Add multiples of each row to later rows to make A upper
triangular
•  Solve resulting triangular system Ux = c by substitution
… for each column i
… zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
… for each row j below row i
for j = i+1 to n
… add a multiple of row i to row j
for k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

12
Refine GE Algorithm (1)
•  Initial Version
… for each column i
… zero it out below the diagonal by adding multiples of row i to later rows
for i = 1 to n-1
… for each row j below row i
for j = i+1 to n
… add a multiple of row i to row j
for k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

•  Remove computation of constant A(j,i)/A(i,i)


from inner loop
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i to n
A(j,k) = A(j,k) - m * A(i,k)

Refine GE Algorithm (2)


•  Last version
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i to n
A(j,k) = A(j,k) - m * A(i,k)

•  Don’t compute what we already know:


zeros below diagonal in column i
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)

13
Refine GE Algorithm (3)
•  Last version
for i = 1 to n-1
for j = i+1 to n
m = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)

•  Store multipliers m below diagonal in zeroed


entries for later use
for i = 1 to n-1
for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)

Refine GE Algorithm (4)


•  Last version for i = 1 to n-1
for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)
for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)

•  Express using matrix operations (BLAS)

for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)

14
What GE really computes
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

•  Call the strictly lower triangular matrix of multipliers M, and


let L = I+M
•  Call the upper triangle of the final matrix U
•  Lemma (LU Factorization): If the above algorithm
terminates (does not divide by zero) then A = L*U
•  Solving A*x=b using GE
–  Factorize A = L*U using GE (cost = 2/3 n3 flops)
–  Solve L*y = b for y, using substitution (cost = n2 flops)
–  Solve U*x = y for x, using substitution (cost = n2 flops)
•  Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired

Problems with basic GE algorithm


•  What if some A(i,i) is zero? Or very small?
–  Result may not exist, or be “unstable”, so need to pivot
•  Current computation all BLAS 1 or BLAS 2, but we know
that BLAS 3 (matrix multiply) is fastest
for i = 1 to n-1
A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)
- A(i+1:n , i) * A(i , i+1:n)

Peak
BLAS 3

BLAS 2
BLAS 1

15
Pivoting in Gaussian Elimination
°  A = [0 1] fails completely, even though A is “easy”
[1 0]

°  Illustrate problems in 3-decimal digit arithmetic:

A = [ 1e-4 1 ] and b = [ 1 ], correct answer to 3 places is x = [ 1 ]


[ 1 1 ] [2] [1]

°  Result of LU decomposition is
L=[ 1 0] = [ 1 0 ] … No roundoff error yet
[ fl(1/1e-4) 1 ] [ 1e4 1 ]

U = [ 1e-4 1 ] = [ 1e-4 1 ] … Error in 4th decimal place


[0 fl(1-1e4*1) ] [ 0 -1e4 ]

Check if A = L*U = [ 1e-4 1] … (2,2) entry entirely wrong


[ 1 0]

°  Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5
°  Numerical instability
°  Computed solution x totally inaccurate
°  Cure: Pivot (swap rows of A) so entries of L and U bounded

Gaussian Elimination with Partial Pivoting


(GEPP)
°  Partial Pivoting: swap rows so that each multiplier
|L(i,j)| = |A(j,i)/A(i,i)| <= 1

for i = 1 to n-1
find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|
… i.e. largest entry in rest of column i
if |A(k,i)| = 0
exit with a warning that A is singular, or nearly so
elseif k != i
swap rows i and k of A
end if
A(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1]
A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

°  Lemma: This algorithm computes A = P*L*U, where P is a


permutation matrix
°  Since each entry of |L(i,j)| <= 1, this algorithm is considered
numerically stable
°  For details see LAPACK code at www.netlib.org/lapack/single/sgetf2 and
Dongarra’s book

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy