0% found this document useful (0 votes)
38 views

Linear Least Squares Problems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Linear Least Squares Problems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.

php

Linear Least Squares Problems

3.1. Introduction

Given an m-by-n matrix A and an m-by-1 vector b, the linear least squares
problem is to find an n-by-1 vector x minimizing Ax — bil2. If m = n and
A is nonsingular, the answer is simply x = A —l b. But if m > n so that we
have more equations than unknowns, the problem is called overdetermined,
and generally no x satisfies Ax = b exactly. One occasionally encounters the
underdetermined problem, where m < n, but we will concentrate on the more
common overdetermined case.
This chapter is organized as follows. The rest of this introduction describes
three applications of least squares problems, to curve fitting, to statistical rnod-
eling of noisy data, and to geodetic modeling. Section 3.2 discusses three stan-
dard ways to solve the least squares problem: the normal equations, the QR
decomposition, and the singular value decomposition (SVD). We will frequently
use the SVD as a tool in later chapters, so we derive several of its properties
(although algorithms for the SVD are left to Chapter 5). Section 3.3 discusses
perturbation theory for least squares problems, and section 3.4 discusses the
implementation details and roundoff error analysis of our main method, QR
decomposition. The roundoff analysis applies to many algorithms using or-
thogonal matrices, including many algorithms for eigenvalues and the SVD in
Chapters 4 and 5. Section 3.5 discusses the particularly ill-conditioned situa-
tion of rank-deficient least squares problem and how to solve them accurately.
Section 3.7 and the questions at the end of the chapter give pointers to other
kinds of least squares problems and to software for sparse problems.

EXAMPLE 3.1. A typical application of least squares is curve fitting. Suppose


that we have m pairs of numbers (yl, bi), ... , (y r,,,, b m ) and that we want to find
the "best" cubic polynomial fit to bi as a function of yz. This means finding
polynomial coefficients xl, ... , x4 so that the polynomial p(y) _ >j xjy^-1
minimizes the residual ri - p(y2) — bi for i = 1 to m. We can also write this as

101
102 Applied Numerical Linear Algebra

minimizing
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

r1 p(yi) bi

r - r2
_ p(y2)
— b2

rm P(ym) bm

1 yl yi yi xl blb2
1 Y2 Y2 Y2 x2
X3

1 ym ym ym X4
bm
A•x—b,
where r and b are m-by-1, A is m-by-4, and x is 4-by-1. To minimize r,
we could choose any norm, such as IIrI, IIrIi, or 1r 2. The last one, which
corresponds to minimizing the sum of the squared residuals ^m r2, is a linear
least squares problem.
Figure 3.1 shows an example, where we fit polynomials of increasing degree
to the smooth function b = sin(iry/5) + y/5 at the 23 points y = —5, —4.5, —4,
... , 5.5, 6. The left side of Figure 3.1 plots the data points as circles, and four
different approximating polynomials of degrees 1, 3, 6, and 19. The right side
112
of Figure 3.1 plots the residual norm r versus degree for degrees from 1 to
20. Note that as the degree increases from 1 to 17, the residual norm decreases.
We expect this behavior, since increasing the polynomial degree should let us
fit the data better.
But when we reach degree 18, the residual norm suddenly increases dra-
matically. We can see how erratic the plot of the degree 19 polynomial is on
the left (the blue line). This is due to ill-conditioning, as we will later see.
Typically, one does polynomial fitting only with relatively low degree poly-
nomials, avoiding ill-conditioning [61]. Polynomial fitting is available as the
function polyf it in Matlab.
Here is an alternative to polynomial fitting. More generally, one has a set
of independent functions fl (y), ... , f(y) from I[8 k to I[8 and a set of points
(yl, bi), ... , (ym, b m ) with y2 E R k and bi E R, and one wishes to find a best
fit to these points of the form b = E^ 1 x^ f^(y). In other words one wants
to choose x = [xl, ... , x n ] T to minimize the residuals ri = E^ 1 x3 f^ (yz) — bi
for 1 < i < m. Letting = f^(y2), we can write this as r = Ax — b, where
A is m-by-n, x is n-by-1, and b and r are m-by-1. A good choice of basis
functions f(y) can lead to better fits and less ill-conditioned systems than
using polynomials [33, 84, 168]. o

EXAMPLE 3.2. In statistical modeling, one often wishes to estimate certain


parameters xj based on some observations, where the observations are con-
taminated by noise. For example, suppose that one wishes to predict the
college grade point average (GPA) (b) of freshman applicants based on their

Linear Least Squares Problems 103
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Fig. 3.1. Polynomial fit to curve b = sin(7ry/5) + y/5 and residual norms.

high school GPA (a1) and two Scholastic Aptitude Test scores, verbal (a2) and
quantitative (a3), as part of the college admissions process. Based on past
data from admitted freshmen one can construct a linear model of the form
b = E^__ 1 a^x^. The observations are ai1, ai2, az3, and bi, one set for each of
the m students in the database. Thus, one wants to minimize

r1 all a12 a13 b1


r2 a21 a22 a23 x1 b2
r- = X2 — -A•x—b,

X3
rm aml am2 am3 bm

which we can do as a least squares problem.


Here is a statistical justification for least squares, which is called linear
regression by statisticians: assume that the ai are known exactly so that only
b has noise in it, and that the noise in each bi is independent and normally
distributed with 0 mean and the same standard deviation o,. Let x be the so-
lution of the least squares problem and XT be the true value of the parameters.
Then x is called a maximum-likelihood estimate of XT, and the error x — XT is
normally distributed, with zero mean in each component and covariance ma-
trix a2(ATA)_l. We will see the matrix (A T A) -1 again below when we solve
the least squares problem using the normai equations. For more details on the
connection to statistics, 15 see, for example, [33, 259]. o

EXAMPLE 3.3. The least squares problem was first posed and formulated by
Gauss to solve a practical problem for the German government. There are
important economie and legal reasons to know exactly where the boundaries
lie between plots of land owned by different people. Surveyors would go out
and try to establish these boundaries, measuring certain angles and distances
15
The standard notation in statistics differs from linear algebra: statisticians write X/3 = y
instead of As = b.
104 Applied Numerical Linear Algebra

and then triangulating from known landmarks. As time passed, it became


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

necessary to improve the accuracy to which the locations of the landmarks


were known. So the surveyors of the day went out and remeasured many
angles and distances between landmarks, and it fell to Gauss to figure out
how to take these more accurate measurements and update the government
database of locations. For this he invented least squares, as we will explain
shortly [33].
The problem that Gauss solved did not go away and must be periodically
revisited. In 1974 the US National Geodetic Survey undertook to update the
US geodetic database, which consisted of about 700,000 points. The motiva-
tions had grown to include supplying accurate enough data for civil engineers
and regional planners to plan construction projects and for geophysicists to
study the motion of tectonic plates in the earth's crust (which can move up to
5 cm per year). The corresponding least squares problem was the largest ever
solved at the time: about 2.5 million equations in 400,000 unknowns. It was
also very sparse, which made it tractable on the computers available in 1978,
when the computation was done [164].
Now we briefly discuss the formulation of this problem. It is actually non-
linear and is solved by approximating it by a sequence of linear ones, each of
which is a linear least squares problem. The data base consists of a list of
points (landmarks), each labeled by location: latitude, longitude, and possibly
elevation. For simplicity of exposition, we assume that the Barth is flat and
suppose that each point i is labeled by linear coordinates zi = (xi, yi) T . For
each point we wish to compute a correction 5zi = ( 6xi, Syi) T so that the cor-
rected location z = (xi, y') T = zi + 5zi more nearly matches the new, more
accurate measurements. These measurements include both distances between
selected pairs of points and angles between the line segment from point i to
j and i to k (see Figure 3.2). To see how to turn these new measurements
into constraints, consider the triangle in Figure 3.2. The corners are labeled
by their (corrected) locations, and the angles 0 and edge lengths L are also
shown. From this data, it is easy to write down constraints based on simple
trigonometric identities. For example, an accurate measurement of ei leads to
the constraint

Bi = [(zj - zi) T (zk - zi)] 2


COS 2
( zj - z i) T ( j- zi) . ( z k - zi) T ( zk - zi)

where we have expressed cos ei in terms of dot products of certain sides of


the triangle. If we assume that 6zi is small compared to zi, then we can
linearize this constraint as follows: multiply through by the denominator of
the fraction, multiply out all the terms to get a quartic polynomial in all the
"6-variables" (like 5xi), and throw away all terms containing more than one
S-variable as a factor. This yields an equation in which all 6-variables appear
linearly. If we collect all these linear constraints from all the new angle and
distance measurements together, we get an overdetermined linear system of

Linear Least Squares Problems 105

Z'j=(Xj Yj )
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

z'k=(x'k,Yk)

z'^=(X'1 ,Yi )

Fig. 3.2. Constraints in updating a geodetic database.

equations for all the S-variables. We wish to find the smallest corrections, i.e.,
the smallest values of Sxi, etc., that most nearly satisfy these constraints. This
is a least squares problem. o

Later, after we introduce more machinery, we will also show how image
compression can be interpreted as a least squares problem (see Example 3.4).

3.2. Matrix Factorizations That Solve the Linear Least


Squares Problem
The linear least squares problem has several explicit solutions that we now
discuss:

1. normal equations,

2. QR decomposition,

3. SVD,

4. transformation to a linear system (see Question 3.3).

The first method is the fastest but least accurate; it is adequate when the
condition number is small. The second method is the standard one and costs
up to twice as much as the first method. The third method is of most use on an
ill-conditioned problem, i.e., when A is not of full rank; it is several times more
expensive again. The last method lets us do iterative refinement to improve
the solution when the problem is ill-conditioned. All methods but the third
can be adapted to deal efficiently with sparse matrices [33]. We will discuss
each solution in turn. We assume initially for methods 1 and 2 that A has full
column rank n.

106 Applied Numerical Linear Algebra

3.2.1. Normal Equations


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

To derive the normal equations, we look for the x where the gradient of II Ax —
bI12 = (Ax — b) T (Ax — b) vanishes. So we want

(A(x + e) — b) T (A(x + e) — b) — (Ax — b) T (Ax — b)


0 = lim
e- o IICII2
2eT (ATAx — AT b) + eTATAe
=
e 0 I1e112

The second term e e^2< e IIAIIéI112112 = JIAII2IIeII2 approaches 0 as e goes to


0, so the factor A T Ax—A T b in the first term must also be zero, or ATAx = ATb.
This is a system of n linear equations in n unknowns, the normal equations.
Why is x = (ATA) —I A T b the minimixer of IIAx — bI12? We can note that
the Hessian A T A is positive definite, which means that the function is strictly
convex and any critical point is a global minimum. Or we can complete the
square by writing x' = x + e and simplifying

(Ax'—b) T (Ax'—b) = (Ae + Ax — b)T(Ae + Ax — b)


= (Ae) T (Ae) + (Ax — b) T (Ax — b)
+2(Ae) T (Ax — b)
= IIAeII2+IIAx—bII2+26T (ATA x —AT b )
= IIAe112+ IIAx —bI12•

This is clearly minimized by e = 0. This is just the Pythagorean theorem, since


the residual r = Ax — b is orthogonal to the space spanned by the columns of
A, i.e., 0 = AT r = ATAx — ATb as illustrated below (the plane shown is the
span of the column vectors of A so that Ax, Ae, and Ax' = A(x + e) all lie in
the plane):

r=
Linear Least Squares Problems 107

Since AT A is symmetric and positive definite, we can use the Cholesky


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

decomposition to solve the normal equations. The total cost of computing


A T A, A T b, and the Cholesky decomposition is n 2 m + 3 n 3 + 0(n 2 ) flops. Since
m > n, the n 2 m cost of forming A T A dominates the cost.

3.2.2. QR Decomposition

THEOREM 3.1. QR decomposition. Let A be m-by-n with m >_ n. Suppose


that A has full column rank. Then there exist a unique m-by-n orthogonal
matrix Q (Q T Q = In ) and a unique n-by-n upper triangular matrix R with
positive diagonale r22 > 0 such that A = QR.

Proof. We give two proofs of this theorem. First, this theorem is a restatement
of the Gram—Schmidt orthogonalization process [139]. If we apply Gram—
Schmidt to the columns ai of A = [al, a2, ... , a n ] from left to right, we get
a sequence of orthonormal vectors ql through q„ spanning the same space:
these orthogonal vectors are the columns of Q. Gram—Schmidt also computes
coefficients rei = q^ ai expressing each column ai as a linear combination of ql
through qi: a2 = =1 r^jqj. The rei are just the entries of R.

ALGORITHM 3.1. The classical Gram—Schmidt (CGS) and modified Gram—


Schmidt (MGS) Algorithms for factoring A = QR:

for i = 1 to n /* compute ith columns of Q and R */


qi = ai
for j = 1 to i — 1 /* subtract component in qj direction from ai */
rj2 = q a i CGS
r^2 = q^ qi MGS
qZ = qz — rjjqj
end for
rij = Il qi 112
if r22 = 0 /* a2 is linearly dependent on al,... ,al */
quit
end if
qz = gi/r22
end for

We leave it as an exercise to show that the two formulas for rei in the algo-
rithm are mathematically equivalent (see Question 3.1). If A has full column
rank, r22 will not be zero. The following figure illustrates Gram—Schmidt when
A is 2-by-2:

108 Applied Numerical Linear Algebra
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

The second proof of this theorem will use Algorithm 3.2, which we present
in section 3.4.1. ❑
Unfortunately, CGS is numerically unstable in floating point arithmetic
when the columns of A are nearly linearly dependent. MGS is more stable and
will be used in algorithmslater in this book but may still result in Q being far
(I
from orthogonal Q T Q - I being far larger than s) when A is ill-conditioned
[31, 32, 33, 149]. Algorithm 3.2 in section 3.4.1 is a stable alternative algorithm
for factoring A = QR. See Question 3.2.
We will derive the formula for the x that minimizes lAx - bil2 using the
decomposition A = QR in three slightly different ways. First, we can always
choose m - n more orthonormal vectors Q so that [Q, Q] is a square orthogonal
matrix (for example, we can choose any m - n more independent vectors X
that we want and then apply Algorithm 3.1 to the n-by-n nonsingular matrix
[Q, X]). Then

_ Q 1
= II [Q, Q] T (Ax — b) I I2
Ax — bI I2 by part 4 of Lemma 1.7
2
T (QRx - b)
J 2

= I O(m n)xn J
L
Rx — 1
L
QTb
`w
1
]M2
2

— [ Rx Q QT b 1I2
J2
= IIRx—Q T bii2+IIQ T biI2
IIQ T b112

We can solve Rx - QTb = 0 for x, since A and R have the same rank,
n, and so R is nonsingular. Then x = R -1 Q T b, and the minimum value of
II Ax — bil 2 is II Q T bll 2 .
Here is a second, slightly different derivation that does not use the matrix
Linear Least Squares Problems 109

Q. Rewrite Ax — b as
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Ax—b = QRx—b=QRx—(QQ T +I—QQ T )b


= Q(Rx — Q T b) — (1— QQ T ) b.

Note that the vectors Q(Rx — QTb) and (I — QQT)b are orthogonal, be-
cause (Q(Rx — Q T b)) T ((I — QQ T )b) = (Rx — Q T b) T [Q T (I — QQ T )]b = (Rx —
QT b) T [0] b = 0. Therefore, by the Pythagorean theorem,

IIAx — bllá = IIQ(Rx — QTb) II2+ 1I(I— QQ T )b112


= 1iRx—Q T bII2 + II(I— QQ T )bII2,

where we have used part 4 of Lemma 1.7 in the form Qy II I12 = I1 y 112.
This sum
of squares is minimized when the first term is zero, i.e., x = R -1 Q T b.
Finally, here is a third derivation that starts from the normal equations
solution:

x = (A T A) —I A T b
_ (R T Q T QR) -1 RT Q T b = ( RT R )- 1 RT Q T b
= R—lR—TRTQTb = R-1QTb.

Later we will show that the cost of this decomposition and subsequent least
squares solution is 2n 2 m — 2 n 3 , about twice the cost of the normal equations
if m » n and about the same if m = n.

3.2.3. Singular Value Decomposition


The SVD is a very important decomposition which is used for many purposes
other than solving least squares problems.

THEOREM 3.2. SVD. Let A be an arbitrary m-by-n matrix with m >_ n. Then
we can write A = UEV T , where U is m-by-n and satisfies U T U = I, V is n-by-
n and satisfies V T V = I, and E = diag(al, ... , a n ), where cr > • • • > u n > 0.
The columns ul, ... ,u of U are called left singular vectors. The columns
vn of V are called right singular vectors. The ui are called singular
values. (If m < n, the SVD is defined by considering AT.)

A geometric restatement of this theorem is as follows. Given any m-by-n


matrix A, think of it as mapping a vector x E R n to a vector y = Ax E IR m .

Then we can choose one orthogonal coordinate system for R (where the unit
axes are the columns of V) and another orthogonal coordinate system for R m
(where the units axes are the columns of U) such that A is diagonal (E), i.e.,
maps a vector x = Ei 1 f32v2 to y = Ax = 1 aj/3 ui. In other words, any
matrix is diagonal, provided that we piek appropriate orthogonal coordinate
systems for its domain and range.
110 Applied Numerical Linear Algebra

Proof of Theorem 3.2. We use induction on m and n: we assume that the SVD
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

exists for (m — 1)-by-(n — 1) matrices and prove it for m-by-n. We assume


A $ 0; otherwise we can take E = 0 and let U and V be arbitrary orthogonal
matrices.
The basic step occurs when n = 1 (since m >_ n). We write A = UEV T
with U =A/IIAII2,E=IIAII2, and V=1.
For the induction step, choose v so 11 v 112 = 1 and I A I I 2= I I Av I I 2 > 0. Such
a v exists by the definition of IIAII2 = max 11 112=1 IIAvhI2. Let u = ^^Av
112 , which
is a unit vector. Choose Ü and V so that U = [u, U] is an m-by-m orthogonal
matrix, and V = [v, V] is an n-by-n orthogonal matrix. Now write

r T T n T AT
U T AV=I UT]•A•[v V]= .
UTAV]
[UTAv

Then

UTAv =
(Av)T (Av) = IIAvII2
= IIAvhl2 = IIAII2 - U
IIAvhl2 IIAvII2
and U T Av = U T u I I Av I 12 = 0. We claim UT Ç/ = 0 too because otherwise
or = IIAII2 = IIUTAVII2 > II[1,0,...,0]UTAVII2 = II [UIuTAV]II 2 > a, a contra-
diction. (We have used part 7 of Lemma 1.7.)
So U T AV = [ 0 ÜAV ] = [ 0 q ] • We may now apply the induction
hypothesis to A to get A = U1E1VT , where Ui is (m — 1)-by-(n — 1), E, is
(n — 1)-by-(n — 1), and Vl is (n — 1)-by-(n — 1). So

0 1_1 0 Q 0 1 0
UTAV=CO U
1 ^ 1 VT J[0 0 U1 ] [ El] [ 0 V1 ]
or

(v[
A=\U 0 UlJ/ L 0 E1J 0 Vl]/T
which is our desired decomposition. ❑
The SVD has a large number of important algebraic and geometric prop-
erties, the most important of which we state here.

THEOREM 3.3. Let A = UEV T be the SVD of the m-by-n matrix A, where
m > n. (There are analogous results for m < n.)

1. Suppose that A is symmetrie, with eigenvalues Xz and orthonormal eigen-


vectors ui. In other words A = UAU T is an eigendecomposition of A,
with A = diag(\1, ... , ) ), U = [ul, ... , u,], and UU T = I. Then an
SVD of A is A = UEV T , where oj = IXiI and vz = sign7i)ui, where
sign(0) = 1.

Linear Least Squares Problems 111

2. The eigenvalues of the symmetrie matrix A T A are o•2. The right singular
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

vectors vz are corresponding orthonormal eigenvectors.

3. The eigenvalues of the symmetrie matrix AA T are a2 and m — n zeroes.


The left singular vectors ui are corresponding orthonormal eigenvectors
for the eigenvalues Q2. One can take any m — n other orthogonal vectors
as eigenvectors for the eigenvalue 0.

4. Let H = [ Á óT ], where A is square and A = UEV T is the SVD of A.


Let E = diag(o- 1, ...,o ), U = [u1,...,u n ], and V = [vl,...,v]. Then
the 2n eigenvalues of H are ±uj, with corresponding unit eigenvectors

2 [ ±ui
5. If A has full rank, the solution of min i IlAx — bil2 is x = VE —l U T b.

6. IIAII2 = cr i . If A is square and nonsingular, then II A-1 11 2 1 — o n and


-

JIAI12.11A—'112 = "n
7. Suppose Ui > ... > aT > aT+i = ... = an = 0. Then the rank of A is r.
The null space of A, i.e., the subspace of vectors v such that Av = 0, is
the space spanned by columns r + 1 through n of V: span(v r+1 , . .. , vn ).
The range space of A, the subspace of vectors of the foren Aw for all w,
is the space spanned by columns 1 through r of U: span(ul, ... , U r ).

8. Let S' — ' be the unit sphere in Rn: Sn -1 = {x E R 11XJ12 = 1}.


Let A • Sn -1 be the image of Sn -1 under A: A • Sn — ' = {Ax xE
R 7 and 11X11 2 = 1}. Then A • Sn -1 is an ellipsoid centered at the origin
of R, with principal axes cru.

9. Write V = [v l , v 2 , ... , v n ] and U = [ ut, u2,. . . , u?], so A = UEV T


r 1 uiuivl (a sum of rank-1 matrices). Then a matrix of rank k < n
closest to A (measured with I1 12) is Ak = Ek_ 1 o-iuivT, and IIA — Ak112 =
Qk + 1• We may also write Ak = UEkV T , where Ek = diag(ul, ... , Qk, 0,
...,0).
Proof.

1. This is true by the definition of the SVD.

2. A T A = V EU T UEV T = V E 2 V T . This is an eigendecomposition of AT A,


with the columns of V the eigenvectors and the diagonal entries of E 2
the eigenvalues.

3. Choose an m-by-(m—n) matrix Ü so that [U, U] is square and orthogonal.


Then write
[U^ü]T
AAT = U^VTV^UT = U^ 2 UT = [U, ü] [2 0
0]

112 Applied Numerical Linear Algebra

This is an eigendecomposition of AA T .
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

4. See Question 3.14.

5. II Ax - bII2 = II UEV T x - b1I2. Since A has full rank, so does E, and thus
E is invertible. Now let [U, U] be square and orthogonal as above so

T l 2
UEV T x — bII2 = [ UT (UEV T x — b)
J 2

EVTTb T b 1 112
J 2

= II EV T x - U T bII2 + I U T blIz•
This is minimized by making the first term zero, i.e., x = VE -l U T b.

6. It is ciear from its definition that the two-norm of a diagonal matrix is


the largest absolute entry on its diagonal. Thus, by part 3 of Lemma 1.7,
IIAII2 = II U T AV II2 = IIr-II2 = ai and Al 2 = II V T A-1 UII2 = IIy-_ l II2 =
-i
n

7. Again choose an m-by-(m - n) matrix Ü so that the m-by-m matrix


U = [U, U] is orthogonal. Since U and V are nonsingular, A and U T AV =
^nXn
[ ^^n^_n^Xn ] - E have the same rank namely, r—by our assumption
about E. Also, v is in the null space of A if and only if V T v is in the null
space of U T AV = E, since Av = 0 if and only if U T AV (VT v) = 0. But
the null space of E is clearly spanned by columns r + 1 through n of the
n-by-n identity matrix In , so the null space of A is spanned by V times
these columns, i.e., v,. + l through vn . A similar argument shows that the
range space of A is the same as U times the range space of U T AV =
i.e., U times the first r columns of I„t , or ul through u r .

8. We "build" the set A • Sn - ' by multiplying by one factor of A = UEV T


at a time. The figure below illustrates what happens when

A = [ 3 1
13J

2 -1 / 2 -2 -1 / 2 4 0 2-1/2 -2 -1 / 2 T
_[ 2-1/2 2 -1/2 ][ 0 2 [ 2-1/2 2-1/2

UEV T .

Assume for simplicity that A is square and nonsingular. Since V is


orthogonal and so maps unit vectors to other unit vectors, V T • Sn -1 =
Sn -1 . Next, since v E S'^ -1 if and only if IIvII2 = 1, w E ESn -1 if and
only if IIE -1 w1I2 = 1 or Ei 1(wz/ai) 2 = 1. This defines an ellipsoid with


Linear Least Squares Problems 113

principal axes Qiei, where e2 is the ith column of the identity matrix.
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Finally, multiplying each w = Ev by U just rotates the ellipse so that


each ei becomes ui, the ith column of U.
s(= S^1) V' *S
4 4


2 2


0 0


-2 -2


-4 -4
-4 -2 0 2 4 -4 -2 0 2 4

Sigma*V'*S U*Sigma*V'*S
4 4


2 2 ------ - --,--- -- --


0 0 ----- ------

-2 ---- ----- ---------


-2


-4 -4
-4 -2 0 2 4 -4 -2 0 2 4

9. Ak has rank k by construction and

0

7 '7k +1
IIA — AkII2Ê= Ê uiuivz = U V T = a k +1 .
i=k +1
Orn 2
It remains to show that there is no closer rank k matrix to A. Let B
be any rank k matrix, so its null space has dimension n — k. The space
spanned by {vl, ... , vk +l} has dimension k + 1. Since the sum of their
dimensions is (n — k) + (k + 1) > n, these two spaces must overlap. Let
h be a unit vector in their intersection. Then

IIA—B1I2 > II(A—B)h1I2 = IIAh1I2 = IIUEV T h1Iz


= II (V T h)112
'— ak +lIIV T hII2
2
_ a k +1

EXAMPLE 3.4. We illustrate the last part of Theorem 3.3 by using it for image
compression. In particular, we will illustrate it with low-rank approximations
114 Applied Numerical Linear Algebra

of a clown. An m-by -n image is just an m-by-n matrix, where entry (i, j) is


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

interpreted as the brightness of pixel (i, j). In other words, matrix entries rang-
ing from 0 to 1 (say) are interpreted as pixels ranging from black (=0) through
various shades of gray to white (=1). (Colors also are possible.) Rather than
storing or transmitting all m • n matrix entries to represent the image, we often
prefer to compress the image by storing many fewer numbers, from which we
can still approximately reconstruct the original image. We may use Part 9 of
Theorem 3.3 to do this, as we now illustrate.
Consider the image in Figure 3.3(a). This 320-by-200 pixel image corre-
sponds to a 320-by-200 matrix A. Let A = UEV T be the SVD of A. Part
9 of Theorem 3.3 tells us that Ak = Ik_ 1 ojuivT is the best rank-k approx
II I
imation of A, in the sense of minimizing A — Ak I2 = Uk+l • Note that it
only takes m • k + n • k = (m + n) • k words to store ul through Uk and
alvl through akvk, from which we can reconstruct Ak. In contrast, it takes
m • n words to store A (or Ak explicitly), which is much larger when k is
small. So we will use Ak as our compressed image, stored using (m + n) • k
words. The other images in Figure 3.3 show these approximations for vari-
ous valnes of k, along with the relative errors ak +1 /a1 and compression ratios
(m + n) • k/(m • n) = 520.k/64000 k/123.

k Relative error = 0- k +1 /ak Compression ratio = 520k/64000


3 .155 .024
10 .077 .081
20 .040 .163

These images were produced by the following commands (the clown and
other images are available in Matlab among the visualization demonstration
files; check your local installation for location):

load clown.mat; [U,S,V]=svd(X); colormap('gray');


image(U(:,1:k)*S(1:k,1:k)*V(:,1:k)')
There are also many other, cheaper image-compression techniques available
than the SVD [189, 152]. o

Later we will see that the cost of solving a least squares problem with the
SVD is about the same as with QR when m » n, and about 4n 2 m — 3n 3 +
O(n 2 ) for smaller m. A precise comparison of the costs of QR and the SVD
also depends on the machine being used. See section 3.6 for details.

DEFINITION 3.1. Suppose that A is m -by-n with m > n and has full rank, with
A = QR = UEV T being A's QR decomposition and SVD, respectively. Then
UT
A + _ (A T A) — 'A T = R -1 Q T = VE —l
is called the (Moore-Penrose) pseudoinverse of A. If m < n, then A+
AT(AAT) -1.
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

approzimation.
Linear Least Squares Problems

(b)
(a)

Fig. 3.3. Image compression using the SVD. (a) Original image. (b) Rank k = 3
115
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

116

tion.
(c)

(d)
Fig. 3.3. Continued. (c) Rank k = 10 approxioration. (d) Rank k = 20 approxima-
Applied Numerical Linear Algebra
Linear Least Squares Problems 117

The pseudoinverse lets us write the solution of the full-rank, overdeter-


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

mined least squares problem as simply x = A+b. If A is square and full rank,
this formula reduces to x = Ab as expected. The pseudoinverse of A is
computed as pinv (A) in Matlab. When A is not full rank, the Moore—Penrose
pseudoinverse is given by Definition 3.2 in section 3.5.

3.3. Perturbation Theory for the Least Squares Problem

When A is not square, we define its condition number with respect to the
2-norm to be i2(A) - 0 „, ax(A)/Qmin(A). This reduces to the usual condition
-

number when A is square. The next theorem justifies this definition.

THEOREM 3.4. Suppose that A is m-by-n with m >_ n and has full rank. Sup-
pose that x minimizes Ax—bII2. Let r = Ax—b be the residual. Let x minimize
i6Ai 2 5 bll2 ) < 1 __
II (A + 6A)i — (b + 5b) 11 2 . Assume e - max( JI I1Ilijb112 in(A)
/r2(` Qmax(A)
--

Then

E f2 (A) +tan0 . r.2(A) } +O(e 2 ) - e • kLS + O ( E2 ) ,


lxp x l2 112 <

where sin 0 = 1 . In other words, 0 is the angle between the vectors b and
1 ILbII2
Ax and measures whether the residual norm IIr112 is large (near Ilbil) or small
(near 0). kLS is the condition number for the least squares problem.

Sketch of Proof. Expand x = (( A + SA) T (A + bA)) —1 (A + bA) T (b + bb) in


powers of SA and bb, and throw away all but the linear terms in SA and bb. O
We have assumed that e • r,2 (A) <1 for the same reason as in the derivation
of bound (2.4) for the perturbed solution of the square linear system Ax = b:
it guarantees that A + SA has full rank so that x is uniquely determined.
We may interpret this bound as follows. If 0 is 0 or very small, then the
residual is small and the effective condition number is about 2k2(A), much like
ordinary linear equation solving. If 0 is not small but not close to r/2, the
residual is moderately large, and then the effective condition number can be
much larger: K(A). If 9 is close to ir/2, so the true solution is nearly zero,
then the effective condition number becomes unbounded even if k2 (A) is small.
These three cases are illustrated below. The right-hand picture makes it easy
to see why the condition number is infinite when 9 = 7r/2: in this case the
solution x = 0, and almost any arbitrarily small change in A or b will yield a
nonzero solution x, an "infinitely” large relative change.

118 Applied Numerical Linear Algebra
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

k_LS = 2k(A) k_LS = O(k(A)2 k_LS = infinity


An alternative form for the bound in Theorem 3.4 that eliminates the O(e 2 )
term is as follows [258, 149] (here r is the perturbed residual r = (A+SA)x) —
(b + bb):

IIx — x112 < Er,2(A) 2+ (Yv2(A) + 1) Ilrll2


1-a2(A)
(

1 1x112 ^JA11211x112
IIr — r11 2
< (1 + 26r2(A)).
11rl12 -

We will see that, properly implemented, both the QR decomposition and


SVD are numerically stable; i.e., they yield a solution x minimizing I (A +
A)— (b + bb) lI2 with

C max ilbAIl lk5 b ij


IIAI1 1^b^^
) =O(6).

We may combine this with the above perturbation bounds to get error bounds
for the solution of the least squares problem, much as we did for linear equation
solving.
The normal equations are not as accurate. Since they involve solving
(A T A)x = A T b, the accuracy depends on the condition number k2 (A T A) _
rc2(A). Thus the error is always bounded by t2(A)E, never just r,2(A)E. There-
fore we expect that the normal equations can lose twice as many digits of
accuracy as methods based on the QR decomposition and SVD.
Furthermore, solving the normal equations is not necessarily stable; i.e.,
the computed solution x does not generally minimize I (A + SA) — (b + 5 b) 112
for small SA and 5b. Still, when the condition number is small, we expect
the normal equations to be about as accurate as the QR decomposition or
SVD. Since the normal equations are the fastest way to solve the least squares
problem, they are the method of choice when the matrix is well-conditioned.
We return to the problem of solving very ill-conditioned least squares prob-
lems in section 3.5.

3.4. Orthogonal Matrices


As we said in section 3.2.2, Gram—Schmidt orthogonalization (Algorithm 3.1)
may not compute an orthogonal matrix Q when the vettors being orthogonal-
Linear Least Squares Problems 119

ized are nearly linearly dependent, so we cannot use it to compute the QR


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

decomposition stably.
Instead, we base our algorithms on certain easily computable orthogonal
matrices called Householder relections and Givens rotations, which we can
choose to introduce zeros into vectors that they multiply. Later we will show
that any algorithm that uses these orthogonal matrices to introduce zeros
is automatically stable. This error analysis will apply to our algorithms for
the QR decomposition as well as many SVD and eigenvalue algorithms in
Chapters 4 and 5.
Despite the possibility of nonorthogonal Q, the MGS algorithm has im-
portant uses in numerical linear algebra. (There is little use for its less stable
version, CGS.) These uses include finding eigenvectors of symmetrie tridiagonal
matrices using bisection and inverse iteration (section 5.3.4) and the Arnoldi
and Lanczos algorithms for reducing a matrix to certain "condensed" forms
(sections 6.6.1, 6.6.6, and 7.4). Arnoldi and Lanczos algorithms are used as
the basis of algorithms for solving sparse linear systems and finding eigenval-
ues of sparse matrices. MGS can also be modified to solve the least squares
problem stably, but Q may still be far from orthogonal [33].

3.4.1. Householder Transformations

A Householder transformation (or reflection) is a matrix of the form P =


I — 2uu T where h^U112 = 1. It is easy to see that P = P T and ppT = (I —
2uu T ) (I — 2uu T ) = I — 4uu T + 4uu T uu T = I, so P is a symmetrie, orthogonal
matrix. It is called a reflection because Px is reflection of x in the plane
through 0 perpendicular to u.
u

Given a vector x, it is easy to find a Householder reflection P = I — 2uu T


to zero out all but the first entry of x: Px = [c, 0, ... , 0] T = c • Cl. We do
u
this as follows. Write Px = x — 2u(u T x) = c. el so that u = 2( x) (x — cel);
i.e., u is a linear combination of x and e l . Since 11x11 2 = IIPxI1 2 = lci, u must
be parallel to the vector u = x + hlxhI2e1, and so u = u/IIuIH2. One can verify
that either choice of sign yields a u satisfying Px = cel, as long as u 0. We
will use ü = x + sign(x1)ei, since this means that there is no cancellation in

120 Applied Numerical Linear Algebra

computing the first component of ü. In summary, we get


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

xl + sign(x1) • IIXII2
X2 u
u = with u = Iu
II2
xn

We write this as u = House(x). (In practice, we can store ü instead of u to save


the work of computing u, and use the formula P = I — (2 /IluI12)uu T instead
ofP=I-2uu T .)

EXAMPLE 3.5. We show how to compute the QR decomposition of a 5-by-


4 matrix A using Householder transformations. This example will make the
pattern for general m-by-n matrices evident. In the matrices below, PZ is a
5-by-5 orthogonal matrix, x denotes a generic nonzero entry, and o denotes a
zero entry.
XXX X
o x x x
1. Choose Pl so Al=PlA= o x x x
0 XXX
0 XXX

XXX X
0 XXX
2. Choose P2 = ,[0 p so A2-P2A1= 0 0 X X
2J 00 x x
00 x x

XXX X
1 0 0 XXX
3. Choose P3 = 1 so A3-P3A2= 0 0 X X

0 P O O O X
O O O X

1 XXX X
0 XXX
1 0
4. Choose P4 = 1 so A4 - P4A3 = 0 0 X X
0 P4 0 0 0 x
O O O O
Here, we have chosen a Householder matrix P' to zero out the subdiago-
nal entries in column i; this does not disturb the zeros already introduced in
previous columns.
Let us call the final 5-by-4 upper triangular matrix R - A4. Then A =
Pi P2 P3 PR = QR, where Q is the first four columns of PP2 P3 P4
P1P2P3P4 (since all PZ are symmetric) and R is the first four rows of R. o
Linear Least Squares Problems 121

Here is the general algorithm for QR decomposition using Householder


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

transformations.

ALGORITHM 3.2. QR factorization using Householder reflections:

for i = 1 to min(m — 1, n)
uz = Heuse(A(i : m, i))
P'= I-2uiuT
A(i:m,i:n)=PZA(i:m,i:n)
end for

Here are some more implementation details. We never need to form PZ


explicitly but just multiply

(I-2uiuT)A(i:m,i:n)=A(i:m,i: n)-2ui(uTA(i:m,i:n)),

which costs less. To store Pi, we need only ui, or ui and Iui. These can
be stored in column i of A; in fact it need not be changed! Thus QR can be
"overwritten" on A, where Q is stored in factored form Pl ... Pn _1 i and PZ is
stored as ui below the diagonal in column i of A. (We need an extra array of
length n for the top entry of ui, since the diagonal entry is occupied by Rii.)
Recall that to solve the least squares problem min IAx—b112 using A = QR,
we need to compute Q T b. This is done as follows: Q T b = Pn Pn _i . • Pi b, so
we need only keep multiplying b by Pi, P2,. . . , Pn
:

for i=1 ton


-y=-2•uTb(i:m)
b(i : m) = b(i : m) + 7ui
end for

The cost is n dot products ry = —2 • uT b and n "saxpys" b + ^/ui. The cost


of computing A = QR this way is 2n 2 m — 3n 3 , and the subsequent cost of
solving the least squares problem given QR is just an additional O(mn).
The LAPALK routine for solving the least squares problem using QR is
sgels. Just as Gaussian elimination can be reorganized to use matrix-matrix
multiplication and other Level 3 BLAS (see section 2.6), the same can be done
for the QR decomposition; see Question 3.17. In Matlab, if the m-by-n matrix
A has more rows than columns and b is m by 1, A\b solves the least squares
problem. The QR decomposition itself is also available via [Q,R]=qr(A).

3.4.2. Givens Rotations


B — sin B ]
A Givens rotation R(0) _ [ cos rotates any vector x E R 2 counter-
sin B cos B
clockwise by 0:

122 Applied Numerical Linear Algebra
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

R(e)x

We also need to define the Givens rotation by 0 in coordinates i and j:

i j

1 1

cos B — sin B
R(i, j, B)

7 sin 9 cos B

L 1 1
Given x, i, and j, we can zero out x by choosing cos 0 and sin 0 so that

cos 0 — sin 9 xi_ X2 + X^ ^

[ sine cos B ] [ x j , 6

or cos 0 = xz and sin 0 =


X2+x3 2?+X^

The QR algorithm using Givens rotations is analogous to using Householder


reflections, but when zeroing out column i, we zero it out one entry at a time
(bottom to top, say).

EXAMPLE 3.6. We illustrate two intermediate steps in computing the QR de-


composition of a 5-by-4 matrix using Givens rotations. To progress from

X x x x X X X X
0 X X X 0 X X X
0 0 X x to o 0 X X
00 x X o 0 o X
00 x X o 0 o X

we multiply

X X X X X X X X
0 X X X 0 X X X
00 x X = 00 x X
c —s 0 0 X X 0 0 X X
s C 00 x x 000 X
Linear Least Squares Problems 123

and
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

1 XXX x XXXI
1 0 XXX 0 XXX
C1 —S 1 00 x x = 00 x x o
s' c' 0 0 x x 0 0 0 x
1 o o o x o 0 o X

The cost of the QR decomposition using Givens rotations is twice the cost
of using Householder reflections. We will need Givens rotations for other ap-
plications later.
Here are some implementation details. Just as we overwrote A with Q
and R when using Householder reflections, we can do the same with Givens
rotations. We use the same trick, storing the information describing the trans-
formation in the entries zeroed out. Since a Givens rotation zeros out just one
entry, we must store the information about the rotation there. We do this as
follows. Let s = sin0 and c = cosO. If IsI < iei, store s • sign(c) and other-
n
wise store s g (s) . To recover s and c from the stored value (call it p) we do
'

the following: if Ipi < 1, then s = p and c = T— s 2 ; otherwise c = 1 and


s = 1 — c2 . The reason we do not just store s and compute c = sf1 — s 2 is
that when s is close to 1, c would be inaccurately reconstructed. Note also
that we may recover either s and c or —s and —c; this is adequate in practice.
There is also a way to apply a sequence of Givens rotations while perform-
ing fewer floating point operations than described above. These are called fast
Givens rotations [7, 8, 33]. Since they are still slower than Householder reflec-
tions for the purposes of computing the QR factorization, we will not consider
them further.

3.4.3. Roundoff Error Analysis for Orthogonal Matrices

This analysis proves backward stability for the QR decomposition and for many
of the algorithms for eigenvalues and singular values that we will discuss.

LEMMA 3.1. Let P be an exact Householder (or Givens) transformation, and


P be its floating point approximation. Then

fl(PA) = P(A + E) hJEil2 = 0(e) . JIAI12

and

fl(AP) = (A + F)P IJFI12 = 0 (e) . hIAll2.

Sketch of Proof. Apply the usual formula fl(a O b) = (a O b)(1 + e) to the


formulas for computing and applying P. See Question 3.16. ❑
In words, this says that applying a single orthogonal matrix is backward
stable.
124 Applied Numerical Linear Algebra

THEOREM 3.5. Consider applying a sequence of orthogonal transformations


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

to Ao. Then the computed product is an exact orthogonal transformation of


Ao + 6A, where IISAII2 = 0(r)IIA112• In other words, the entire computation is
backward stable:

fl(f f _1 ...
PiAoQ,Q2 ...
Q^) =P P,(Ao+E)Qi...Q^

with I I E I I2 = i • 0(E) - I I A I I2 • Here, as in Lemma 3.1, Pi and Qi are floating


point orthogonal matrices and PZ and Qi are exact orthogonal matrices.

Proof. Let P^ - P3 _ • • Pl and Q Q1 • • • Q. We wish to show that


fl(PjA^_1Qj) = Pj(A+E^)Q for some (IE^II2 = jO(s)IIAII2• We use
Lemma 3.1 recursively. The result is vacuously true for j = 0. Now assume
that the result is true for j — 1. Then we compute

B = fl(P^A^- 1 )
= Pj(A^_l+E') byLemma3.1
= Pj(Pj_1(A+ E^_1)Q^_1 + E') by induction
= P^(A+E^-1+PT lE^Q^ 1)Q^-1

P^ (A + E")Q^-1,
where

E"112 = IIE^—i + P?' 1E'Q? 1II2 <— IIE^—ilH2 + II PT IE'Qj 1II2


= IIE^—1 II2 + IIE'II2
= jO(E)IIAII2
since IIE^_ i ii 2 = (j — 1 ) 0 (e)IIAII2 and IIE'II2 = O(E)IIAII2. Postmultiplication
by Q is handled in the same way. ❑

3.4.4. Why Orthogonal Matrices?


Let us consider how the error would grow if we were to multiply by a sequence of
nonorthogonal matrices in Theorem 3.5 instead of orthogonal matrices. Let X
be the exact nonorthogonal transformation and X be its floating point approx-
imation. Then the usual floating point error analysis of matrix multiplication
tells us that

fl(XA) = XA+ E = X(A+X -1 E) - X(A+ F),

where IIEII2 < O(e)IIXII2 . IIAII2 and so IIFII2 < IIX -1 112 ' IIEII2 <_ 0(E) • r-2(X)
IIAII2•
So the error I I E I 12 is magnified by the condition number i2 (X) >_ 1. In a
larger product Xk • • • X,AY, ... Yk the error would be magnified by [ji k2(XZ) •
k2 (Y). This factor is minimized if and only if all Xi and Y are orthogonal (or
stalar multiples of orthogonal matrices), in which case the factor is one.
Linear Least Squares Problems 125

3.5. Rank-Deficient Least Squares Problems


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

So far we have assumed that A has full rank when minimizing liAx — bII2.
What happens when A is rank deficient or "close" to rank deficient? Such
problems arise in practice in many ways, such as extracting signals from noisy
data, solution of some integral equations, digital image restoration, comput-
ing inverse Laplace transforms, and so on [141, 142]. These problems are
very ill-conditioned, so we will need to impose extra conditions on their so-
lutions to make them well-conditioned. Making an ill-conditioned problem
well-conditioned by imposing extra conditions on the solution is called regular-
ization and is also done in other fields of numerical analysis when ill-conditioned
problems arise.
For example, the next proposition shows that if A is exactly rank deficient,
then the least squares solution is not even unique.

PROPOSITION 3.1. Let A be m-by-n with m > n and rank A = r < n. Then
there is an n — r dimensional set of vectors x that minimize IAx — bII2•

Proof. Let Az = 0. Then if x minimizes lI Ax — b1I2, so does x + z. ❑


Because of roundof in the entries of A, or roundofi during the computation,
it is most often the case that A will have one or more very small computed
singular values, rather than some exactly zero singular values. The next propo-
sition shows that in this case, the unique solution is likely to be very large and is
certainly very sensitive to error in the right-hand side b (see also Theorem 3.4).

PROPOSITION 3.2. Let o-min = crmin(A), the smallest singular value of A. As-
Sume Amin > 0. Then

1. if x minimizes lAx — bil2i then x1 2 >_ lunbl/a m i n , where u n is the last


column of U in A = UEV T .

2. changing b to b + bb can change x to x + bx, where 6X112 is as large as


Il bb 1I2 / U min
In other words, if A is nearly rank deficient (A m i n is small), then the solu-
tion x is ill-conditioned and possibly very large.

Proof. For part 1, x = A +b = VE —l U T b, so 11x112 = (IE —l U T blI2 >


I (E —l U T b)m,I = IunbJ /crmin. For part 2, choose bb parallel to u,. ❑
We begin our discussion of regularization by showing how to regularize
an exactly rank-deficient least squares problem: Suppose A is m-by-n with
rank r < n. Within the (n — r)- dimensional solution space, we will look for
the unique solution of smallest norm. This solution is characterized by the
following proposition.

126 Applied Numerical Linear Algebra

PROPOSITIoN 3.3. When A is exactly singular, the x that minimize Ax — b 112 11


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

can be characterized as follows. Let A = UEV T have rank r < n, and write
the SVD of A as

A = [Ui, U2]
LO l 0
J [V1, V2] T = U1E1VT (3.1)
where E 1 is r x r and nonsingular and Ui and Vl have r columns. Let o, =
Qmin(i), the smallest nonzero singular value of A. Then

1. all solutions x can be written x = V1Ei 1 Ul b+V2z, z an arbitrary vector.

2. the solution x has minimal norm IIx112 precisely when z = 0, in which


case x = V1 Ei'Ul b and 11x112 < I1b11 2 /a.

3. changing b to b + Sb can change the minimal norm solution x by at most


IISb11 2 /a.

In other words, the norm and condition number of the unique minimal norm
solution x depend on the smallest nonzero singular value of A.

Proof. Choose Ü so [U, U] = [U1, U2, Ü] is an m x m orthogonal matrix. Then

IIAx — bil = II[U,U] T (Ax—b)112


U T 2

= Uz (UiiV1Tx_b)ii
UT ii
112
E1V2'x — UTb 1112
= U
UT b 112
= IIF-1VT x— U12 b112+ IIU2 b112+ IIU T b112.
1. 1 I Ax — b112 is minimized when E 1 V1T x= UTb, or x = V1 i 1 Ul b+ V2z
since V1T V2z = 0 for all z.

2. Since the columns of Vl and V2 are mutually orthogonal, the Pythagorean


theorem implies that Ilxll2 = IIVir-1 1 U1 bll2 + IIV2z112, and this is mini-
mized by z = 0.

3. Changing b by bb changes x by at most I I Vi E 1 1 Ui bb I I 2< I I F- 1 1 1 12 I I Sb l i 2=


Il5bll 2 /cî. ❑

Proposition 3.3 tells us that the minimum norm solution x is unique and
may be well-conditioned if the smallest nonzero singular value is not too small.
This is key to a practical algorithm, discussed in the next section.
Linear Least Squares Problems 127

EXAMPLE 3.7. Suppose that we are doing.medical research on the effect of a


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

certain drug on blood sugar level. We collect data from each patient (numbered
from i = 1 to m) by recording his or her initial blond sugar level (a2,1), final
blood sugar level (bi), the amount of drug administered (ai,2), and other med-
ical quantities, including body weights on each day of a week-long treatment
(az,3 through ai,9). In total, there are n < m medical quantities measured for
each patient. Our goal is to predict bi given ai,1 through az,„ and we formulate
this as the least squares problem min, ilAx — bil2. We plan to use x to predict
the final blond sugar level b of future patient j by computing the dot product
i:k=1 a^kxk.
Since people's weight generally does not change significantly from day to
day, it is likely that columns 3 through 9 of matrix A, which contain the
weights, are very similar. For the sake of argument, suppose that columns
3 and 4 are identical (which may be the case if the weights are rounded to
the nearest pound). This means that matrix A is rank deficient and that
xp = [0, 0, 1, —1,0,. , 0] T is a right null vector of A. So if x is a (minimum
norm) solution of the least squares problem min, liAx — bil2i then x + /3x0 is
also a (nonminimum norm) solution for any scalar 0, including, say, ,Q = 0 and
Q = 10 6 . Is there any reason to prefer one value of 0 over another? The value
10 6 is clearly not a good one, since future patient j, who gains one pound
between days 1 and 2, will have that difference of one pound multiplied by
10 6 in the predictor Ek a^kxk of final blond sugar level. It is much more
reasonable to choose 0 = 0, corresponding to the minimum norm solution x.
0

For further justification of using the minimum norm solution for rank-
deficient problems, see [141, 142].
When A is square and nonsingular, the unique solution of Ax = b is of
course b = A -1 x. If A has more rows than columns and is possibly rank-
deficient, the unique minimum-norm least squares solution may be similarly
written b = A+b, where the Moore—Penrose pseudoinverse A+ is defined as
follows.

DEFINITION 3.2. (Moore—Penrose pseudoinverse A+ for possibly rank-deficient


A)
Let A = UEV T = U1 E 1 VT as in equation (3.1). Then A+ - V1Ei 1 U1 .
This is also written A+ = V T E+U, where E+ _ [ 0
ol 0 ]+ = [ E1 0 ]'
So the solution of the least squares problem is always x = A+b, and when
A is rank deficient, x has minimum norm.

128 Applied Numerical Linear Algebra

3.5.1. Solving Rank-Deficient Least Squares Problems Using the


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

SVD

Our goal is to compute the minimum norm solution x, despite roundof. In


the last section, we saw that the minimal norm solution was unique and had a
condition number depending on the smallest nonzero singular value. Therefore,
computing the minimum norm solution requires knowing the smallest nonzero
singular value and hence also the rank of A. The mail difficulty is that the
rank of a matrix changes discontinuously as a function of the matrix.

For example, the 2-by-2 matrix A = diag(1, 0) is exactly singular, and its
smallest nonzero singular value is o = 1. As described in Proposition 3.3, the
minimum norm least squares solution to min., IlAx — bil2 with b = [1, 11 T is
x = [1, 0] T , with condition number 1/Q = 1. But if we make an arbitrarily
tiny perturbation to get A = diag(1, c), then o drops to E and x = [l, l/ E] T
becomes enormous, as does its condition number 1 /E. In general, roundofi will
make such tiny perturbations, of magnitude O(e)IIAII2• As we just saw, this
can increase the condition number from 1/Q to 1/e.

We deal with this discontinuity algorithmically as follows. In general each


computed singular value Qi satisfies oij < O(--) IIAII2. This is a consequente

of backward stability: the computed SVD will be the exact SVD of a slightly
different matrix: A = UÊV T = A + SA, with IISA11 = 0 ( 5 ).IIAII. (This is
discussed in detail in Chapter 5.) This means that any &Z < O(e)1IA112 can
be treated as zero, because roundoff makes it indistinguishable from 0. In the
above 2-by-2 example, this means we would set the E in A to zero before solving
the least squares problem. This would raise the smallest nonzero singular value
from E to 1 and correspondingly decrease the condition number from 1/E to
1/o- = 1.

More generally, let tol be a user-supplied measure of uncertainty in the data


• 11 1
A. Roundofi implies that tol >_ e A , but it may be larger, depending on the
source of the data in A. Now set & = Qi if &i > tol, and &i = 0 otherwise. Let
E = diag(& ). We call UEV T the truncated SVD of A, because we have set
singular values smaller than tol to zero. Now we solve the least squares problem
using the truncated SVD instead of the original SVD. This is justified since
pb
II UÉV T — UÊV T 112 = IIU(E — E)V T 2 < tol; i.e., the change in A caused by
changing each Qi to Qi is less than the user's inherent uncertainty in the data.
The motivation for using E instead of E is that of all matrices within distante
tol of E, E maximizes the smallest nonzero singular value a. In other words, it
minimizes both the norm of the minimum norm least squares solution x and its
condition number. The picture below illustrates the geometric relationships
among the input matrix A, A = UEV T , and A = UEV T , where we we think
of each matrix as a point in Euclidean space IRm'n. In this space, the rank-
deficient matrices form a surface, as shown below:

Linear Least Squares Problems 129
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

trices

EXAMPLE 3.8. We illustrate the above procedure on two 20-by-10 rank-deficient


matrices A l (of rank rl = 5) and A 2 (of rank r2 = 7). We write the SVDs of
either Al or A2 as A i = UiE where the common dimension of UZ, E i , and
U is the rank ri of AZ; this is the same notation as in Proposition 3.3. The r2
nonzero singular values of A i (singular values of Ej) are shown as red x's in
Figure 3.4 (for A1) and Figure 3.5 (for A2). Note that Al in Figure 3.4 bas
five large nonzero singular values (all slightly exceeding 1 and so plotted on
top of one another, on the right edge the graph), whereas the seven nonzero
singular values of A 2 in Figure 3.5 range down to 1.2 • 10 -9 tol.
We then choose an ri-dimensional vector xi, and let x, = Uxi and bi =
Aixi = Ui E i xi, so x2 is the exact minimum norm solution minimizing IIAixz —
b2 II2. Then we consider a sequence of perturbed problems Ai + SA, where the
perturbation SA is chosen randomly to have a range of norms, and solve the
least squares problems (Ai + SA)y2 — bi li 2 using the truncated least squares
procedure with tol = 10 -9 . The blue lines in Figures 3.4 and 3.5 plot the
computed rank of Ai + SA (number of computed singular values exceeding
tol = 10 -9 ) versus IISAH2 (in the top graphs), and the error I^y2 — xiJI2/1jxj112
(in the bottom graphs). The Matlab code for producing these figures is in
HOMEPAGE/Matlab/RankDeficient.m.
The simplest case is in Figure 3.4, so we consider it first. Al + SA will
have five singular values near or slightly exceeding 1 and the other five equal
to IIAII2 or less. For SA 2 < tol, the computed rank of Al + 6A stays the
same as that of A 1 , namely, 5. The error also increases slowly from near
machine epsilon ( 10 -16 ) to about 10 — '° near 16A112 = tol , and then both
the rank and the error jump, to 10 and 1, respectively, for larger 116A1H. This
is consistent with our analysis in Proposition 3.3, which says that the condition
number is the reciprocal of the smallest nonzero singular value, i.e., the smallest
singular value exceeding tol. For 16A112 < tol, this smallest nonzero singular
value is near to, or slightly exceeds, 1. Therefore Proposition 3.3 predicts an
error of 116A112/O(1) = ^IM112. This well-conditioned situation is confirmed
by the small error plotted to the left of II6A2 = tol in the bottom graph of
Figure 3.4. On the other hand, when IA2 > tol, then the smallest nonzero
130 Applied Numerical Linear Algebra
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Fig. 3.4. Graph of truncated least squares solution of min„, II(Al + SA)yl - b1112,
using tol = 10 -9 . The singular values of A l are shown as red x's. The norm 116A112
is the horizontal axis. The top graph plots the rank of A l + SA, i.e., the numbers of
singular values exceeding tol. The bottom graph plots IIyl - x1112/11x1112, where xl is
the solution with SA = 0.

singular value is O(lISAII2), which is quite small, causing the error to jump to
IIbAII2/o(IISA112) = 0(1), as shown to the right of 1ISA112 = tol in the bottom
graph of Figure 3.4.
In Figure 3.5, the nonzero singular values of A2 are also shown as red x's;
the smallest one, 1.2. 10 -9 is just larger than tol. So the predicted error when
,

IISAII 2 < tol is IISAII2/10 -9 , which grows to 0(1) when I16A112 = tol. This is
confirmed by the bottom graph in Figure 3.5. o

3.5.2. Solving Rank-Deficient Least Squares Problems Using QR


with Pivoting
A cheaper but sometimes less accurate alternative to the SVD is QR with
pivoting. In exact arithmetic, if A had rank r < n and its first r columns were
independent, then its QR decomposition would look like
Rij R12
A=QR=Q 0 0
0 0
where R11 is r-by-r and nonsingular and R12 is r -by-(n - r). With roundoff,
we might hope to compute
Rij R12
R= 0 R22
0 0
Linear Least Squares Problems 131
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Fig. 3.5. Graph of truncated least squares solution of min y2 lI(A2 + SA)y 2 — b2112,
using tol = 10 -9 . The singular values of A2 are shown as red x's. The norm 116A112
is the horizontal axis. The top graph plots the rank of A2 + SA, i.e., the numbers of
singular values exceeding tol. The bottom graph plots 11y2 — x2112/11x2112, where x2 is
the solution with bA = 0.

with I I R22 112 very small, on the order of e I I A I I2 • In this case we could just set
R22 = 0 and minimize 11 Ax — b I 1 2 as follows: let [Q, Q] be square and orthogonal
so that

ir T 12= r RxQQTb 2
IIAx bII2 = QT (Ax—b)

L ] z IL 2
= IIRx — Q T bi12 + 11Q T bIi2•
Rll R12
Write Q = [Q1, Q2] and x = [ x1 ] conformally with R = [ 1 so
X2 0 0
that
IIAx — b1I2 = (IR11x1 + R 12 x 2 — Qi bII2 + IIQz bII2 + I IQ T bII2
is minimized by choosing x = [ R111i Q1 b2 R12x2)
1 for any x2. Note that the
choice x2 = 0 does not necessarily minimize I I x 2, but it is a reasonable choice,
especially if Rll is well-conditioned and R111 R12 is small.
Unfortunately, this method is not reliable since R may be nearly rank
deficient even if no R22 is small. For example, the n -by -n bidiagonal matrix
1
2
A=
. 1
1
2
132 Applied Numerical Linear Algebra

has A m i n (A) ti 2 — ', but A = Q • R with Q = I and R = A, and no R22 is small.


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

To deal with this failure to recognize rank deficiency, we may do QR with


column pivoting. This means that we factorize AP = QR, P being a permuta-
tion matrix. This idea is that at step i (which ranges from 1 to n, the number
of columns) we select from the unfinished part of A (columns i to n and rows
i to m) the column of largest norm and exchange it with the ith column. We
then proceed to compute the usual Householder transformation to zero out
column i in entries i + 1 to m. This pivoting strategy attempts to keep Rll as
well-conditioned as possible and R22 as small as possible.

EXAMPLE 3.9. If we compute the QR decomposition with column pivoting


to the last example (.5 on the diagonal and 1 on the superdiagonal) with
n = 11, we get R11,11 = 4.23. 10 -4 , a reasonable approximation to min(A)
3.66.10 -4 . Note that R„„ > O- m in(A) since Q m i n (A) is the norm of the smallest
perturbation that can lower the rank, and setting R^, n, to 0 lowers rank. o

One can show only a Rf(A) 2n but usually R nn is a reasonable approx-


imation to Qmin (A). The worst case, however, is as bad as worst-case pivot
growth in GEPP.
More sophisticated pivoting schemes than QR with column pivoting, called
rank-revealing QR algorithms, have been a subject of much recent study. Rank-
revealing QR algorithms that detect rank more reliably and sometimes also
faster than QR with column pivoting have been developed [28, 30, 48, 50, 109,
126, 128, 150, 196, 236]. We discuss them further in the next section.
QR decomposition with column pivoting is available as subroutine sgeqpf
in LAPACK. LAPACK also has several similar factorizations available: RQ
(sgerqf), LQ (sgelqf), and QL (sgeqlf). Future LAPACK releases will
contain improved versions of QR.

3.6. Performance Comparison of Methods for Solving


Least Squares Problems

What is the fastest algorithm for solving dense least squares problems? As
discussed in section 3.2, solving the normal equations is fastest, followed by
QR and the SVD. If A is quite well-conditioned, then the normal equations are
about as accurate as the other methods, so even though the normal equations
are not numerically stable, they may be used as well. When A is not well-
conditioned but far from rank deficient, we should use QR.
Since the design of fast algorithms for rank-deficient least squares problems
is a current research area, it is difficult to recommend a single algorithm to use.
We summarize a recent study [206] that compared the performance of several
algorithms, comparing them to the fastest stable algorithm for the non—rank-
deficient case: QR without pivoting, implemented using Householder trans-
formations as described in section 3.4.1, with memory hierarchy optimizations
Linear Least Squares Problems 133

described in Question 3.17. These comparisons were made in double preci-


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

sion arithmetic on an IBM RS6000/590. Included in the comparison were the


rank-revealing QR algorithms mentioned in section 3.5.2 and various imple-
mentations of the SVD (see section 5.4). Matrices of various sizes and with
various singular value distributions were tested. We present results for two
singular value distributions:
Type 1: random matrices, where each entry is uniformly distributed from —1
to 1;

Type 2: matrices with singular values distributed geometrically from 1 to e


(in other words, the ith singular value is 'y, where 'y is chosen so that
ry n =e).
Type 1 matrices are generally well-conditioned, and Type 2 matrices are
rank-deficient. We tested small square matrices (n = m = 20) and large
square matrices (m = n = 1600). We tested square matrices because if m is
sufficiently greater than n in the m-by-n matrix A, it is cheaper to do a QR
decomposition as a "preprocessing step" and then perform rank-revealing QR
or the SVD on R. (This is done in LAPACK.) If m » n, then the initial
QR decomposition dominates the the cost of the subsequent operations on the
n -by-n matrix R, and all the algorithms cost about the same.
The fastest version of rank-revealing QR was that of [30, 196]. On Type
1 matrices, this algorithm ranged from 3.2 times slower than QR without
pivoting for n = m = 20 to just 1.1 times slower for n = m = 1600. On Type 2
matrices, it ranged from 2.3 times slower (for n = m = 20) to 1.2 times slower
(for n = m = 1600). In contrast, the current LAPACK algorithm, dgeqpf,
was 2 times to 2.5 times slower for both matrix types.
The fastest version of the SVD was the one in [58], although one based on
divide-and-conquer (see section 5.3.3) was about equally fast for n = m = 1600.
(The one based on divide-and-conquer also used much less memory.) For Type
1 matrices, the SVD algorithm was 7.8 times slower (for n = m = 20) to 3.3
times slower (for n = m = 1600). For Type 2 matrices, the SVD algorithm was
3.5 times slower (for n = m = 20) to 3.0 times slower (for n = m = 1600). In
contrast, the current LAPACK algorithm, dgelss, ranged from 4 times slower
(for Type 2 matrices with n = m = 20) to 97 times slower (for Type 1 matrices
with n = m = 1600). This enormous slowdown is apparently due to memory
hierarchy effects.
Thus, we see that there is a tradeoff between reliability and speed in solv-
ing rank-deficient least squares problems: QR without pivoting is fastest but
least reliable, the SVD is slowest but most reliable, and rank-revealing QR
is in-between. If m » n, all algorithms cost about the same. The choice of
algorithm depends on the relative importante of speed and reliability to the
user.
Future LAPACK releases will contain improved versions of both rank-
revealing QR and SVD algorithms for the least squares problem.

134 Applied Numerical Linear Algebra

3.7. References and Other Topics for Chapter 3


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

The best recent reference on least squares problems is [33], which also discusses
variations on the basic problem discussed here (such as constrained, weighted,
and updating least squares), different ways to regularize rank-deficient prob-
lems, and software for sparse least squares problems. See also chapter 5 of
[121] and [168]. Perturbation theory and error bounds for the least squares
solution are discussed in detail in [149]. Rank-revealing QR decompositions
are discussed in [28, 30, 48, 50, 126, 150, 196, 206, 236]. In particular, these
papers examine the tradeoff between cost and accuracy in rank determination,
and in [206] there is a comprehensive performance comparison of the available
methods for rank-deficient least squares problems.

3.8. Questions for Chapter 3


QUESTION 3.1. (Easy) Show that the two variations of Algorithm 3.1, CGS
and MGS, are mathematically equivalent by showing that the two formulas for
rei yield the same results in exact arithmetic.

QUESTION 3.2. (Easy) This question will illustrate the difference in nu-
merical stability among three algorithms for computing the QR factoriza-
tion of a matrix: Householder QR (Algorithm 3.2), CGS (Algorithm 3.1),
and MGS (Algorithm 3.1). Obtain the Matlab program QRStability.m from
HOMEPAGE/Matlab/QRStability.m. This program generates random matri-
ces with user-specified dimensions m and n and condition number cnd, computes
their QR decomposition using the three algorithms, and measures the accuracy
of the results. It does this with the residual I I A — Q • R I / I A I I , which should be
around machine epsilon E for a stable algorithm, and the orthogonality of Q
^1Q T • Q — I j, which should also be around E. Run this program for small ma-
trix dimensions (such as m= 6 and n= 4), modest numbers of random matrices
(samples= 20), and condition numbers ranging from cnd= 1 up to cnd= 10 15 .
Describe what you see. Which algorithms are more stable than others? See
if you can describe how large JQ T • Q — I^^ can be as a function of choice of
algorithm, cnd and E.

QUESTION 3.3. (Medium; Hard) Let A be m-by-n, m > n, and have full rank.

A r b
1. (Medium) Show that [ AT ] [ ] _ [ ] has a solution where x
minimizes hjAx — bil2. One reason for this formulation is that we can
apply iterative refinement to this linear system if we want a more accurate
answer (see section 2.5).

2. (Medium) What is the condition number of the coefficient matrix in terms


of the singular values of A? Hint: Use the SVD of A.
Linear Least Squares Problems 135

3. (Medium) Give an explicit expression for the inverse of the coefficient


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

matrix, as a block 2-by-2 matrix. Hint: Use 2-by-2 block Gaussian elim-
ination. Where have we previously seen the (2,1) block entry?

4. (Hard) Show how to use the QR decomposition of A to implement an


iterative refinement algorithm to improve the accuracy of x.

QUESTION 3.4. (Medium) Weighted least squares: If some components of Ax-


b are more important than others, we can weight them with a scale factor di
and solve the weighted least squares problem min II D(Ax — b) I I2 instead, where
D has diagonal entries di. More generally, recall that if C is symmetric positive
definite, then lxIIc
- (xTCx)112 is a norm, and we can consider minimizing
Ax — bIIC• Derive the normal equations for this problem, as well as the
formulation corresponding to the previous question.

QUESTION 3.5. (Medium; Z. Bai) Let A E Rnxn be positive definite. Two


vectors ul and u2 are called A-orthogonal if ul Aug = 0. If U E RnxT and
U T AU = I, then the columns of U are said to be A-orthonormal. Show that
every subspace has an A-orthonormal basis.

QUESTION 3.6. (Easy; Z. Bai) Let A have the form

A= [ ]

where R is n-by-n and upper triangular, and S is m-by-n and dense. Describe
an algorithm using Householder transformations for reducing A to upper trian-
gular form. Your algorithm should not "fill in" the zeros in R and thus require
fewer operations than would Algorithm 3.2 applied to A.

QUESTION 3.7. (Medium; Z. Bai) If A = R + uv T , where R is an upper trian-


gular matrix, and u and v are column vectors, describe an efficient algorithm
to compute the QR decomposition of A. Hint: Using Givens rotations, your
algorithm should take O(n 2 ) operations. In contrast, Algorithm 3.2 would take
O(n 3 ) operations.

QUESTION 3.8. (Medium; Z. Bai) Let x E R' and let P be a Householder


matrix such that Px = ± Ix^I2e1. Let G1,2 i ... , G 7 _1, be Givens rotations,
and let Q = Gl,2 ... G 71 _1, n . Suppose Qx = +IIxII2e1. Must P equal Q? (You
need to give a proof or a counterexample.)

QUESTION 3.9. (Easy; Z. Bai) Let A be m-by-n, with SVD A = UEV T


Compute the SVDs of the following matrices in terms of U, E, and V:

1. (A T A)_ l

2. (ATA)-1AT
136 Applied Numerical Linear Algebra

3. A(A T A) -1
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

4. A(A T A) — 'AT

QUESTION 3.10. (Medium; R. Schreiber) Let Ak be a best rank-k approxima-


tion of the matrix A, as defined in Part 9 of Theorem 3.3. Let ui be the ith
singular value of A. Show that Ak is unique if Uk > uk +1

QUESTION 3.11. (Easy; Z. Bai) Let A be m-by-n. Show that X = A+ (the


Moore—Penrose pseudoinverse) minimizes II AX — III F over all n-by-m matrices
X. What is the value of this minimum?

QUESTION 3.12. (Medium; Z. Bai) Let A, B, and C be matrices with di-


mensions such that the product ATCBT is well defined. Let X be the set of
matrices X minimizing I I AX B — C I I F, and let Xo be the unique member of X
minimizing IIXIIF•
Show that Xo = A+CB+. Hint: Use the SVDs of A and
B.

QUESTION 3.13. (Medium; Z. Bai) Show that the Moore—Penrose pseudoin-


verse of A satisfies the following identities:

AA+A = A,
A+AA+ = A+,
A+ A = (A+A) T ,
AA+ = (AA+) T

QUESTION 3.14. (Medium) Prove part 4 of Theorem 3.3: Let


H=[Áó T ], where A is square and A = UEV T is its SVD. Let E_
diag(ui, ... ,o),
), U = [ul, ... ,u], and V = [vi, ... ,v]. Prove that the 2n
eigenvalues of H are +aj, with corresponding unit eigenvectors [ v ' . ]. Ex-
tend to the case of rectangular A.

QUESTION 3.15. (Medium) Let A be m-by-n, m < n, and of full rank. Then
min II Ax — btI2 is called an underdetermined least squares problem. Show that
the solution is an (n — m)-dimensional set. Show how to compute the unique
minimum norm solution using appropriately modified normal equations, QR
decomposition, and SVD.

QUESTION 3.16. (Medium) Prove Lemma 3.1.


Linear Least Squares Problems 137

QUESTION 3.17. (Hard) In section 2.6.3, we showed how to reorganize Gaus-


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

sian elimination to perform Level 2 BLAS and Level 3 BLAS at each step in
order to exploit the higher speed of these operations. In this problem, we will
show how to apply a sequence of Householder transformations using Level 2
and Level 3 BLAS.

1. Let ul, ... , ub be a sequence of vectors of dimension n, where jUiIJ2 = 1


and the first i — 1 components of ui are zero. Let P = Pb • Pb_ 1 ... Pl,
where PZ = I — 2uiuT is a Householder transformation. Show that there
is a b-by-b lower triangular matrix T such that P = I — UTU T , where
U = [ul, ... , ub]. In particular, provide an algorithm for computing the
entries of T. This identity shows that we can replace multiplication by b
Householder transformations Pl through Pb by three matrix multiplica-
tions by U, T, and U T (plus the cost of computing T).

2. Let House(x) be a function of the vector x which returns a unit vector u


such that (I — 2uu T )x = IIxII2ei; we showed how to implement House(x)
in section 3.4. Then Algorithm 3.2 for computing the QR decomposition
of the m-by-n matrix A may be written as

for i = 1 m
ui = House(A(i : m, i))
PZ = I — 2uiuT'
A(i:m,i:n)=PZA(i:rn,i:n)
endfor

Show how to implement this in terms of the Level 2 BLAS in an efficient


way (in particular, matrix-vector multiplications and rank-1 updates).
What is the floating point operation count? (Just the high-order terms
in n and m are enough.) It is sufficient to write a short program in the
same notation as above (although trying it in Matlab and comparing
with Matlab's own QR factorization are a good way to make sure that
you are right!).

3. Using the results of step (1), show how to implement QR decomposition


in terms of Level 3 BLAS. What is the operation count? This technique is
used to accelerate the QR decomposition, just as we accelerated Gaussian
elimination in section 2.6. It is used in the LAPACK routine sgeqrf.

QUESTION 3.18. (Medium) It is often of interest to solve constrained least


squares problems, where the solution x must satisfy a linear or nonlinear con-
straint in addition to minimizing IIAx — bil2. We consider one such problem
here. Suppose that we want to choose x to minimize lAx — bil2 subject to
the linear constraint Cx = d. Suppose also that A is m-by-n, C is p-by-n,
and C has full rank. We also assume that p < n (so Cx = d is guaranteed to
138 Applied Numerical Linear Algebra

be consistent) and n < m + p (so the system is not underdetermined). Show


Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

that there is a unique solution under the assumption that [ ] has full column
rank. Show how to compute x using two QR decompositions and some matrix-
vector multiplications and solving some triangular systems of equations. Hint:
Look at LAPACK routine sgglse and its description in the LAPACK manual
[10] (NETLIB/lapack/lug/lapack_lug.html).

QUESTION 3.19. (Hard; Programming) Write a program (in Matlab or any


other language) to update a geodetic database using least squares, as described
in Example 3.3. Take as input a set of "landmarks," their approximate coordi-
nates (xi, y2), and a set of new angle measurements B and distance measure-
ments The output should be corrections (Sxi, Syz) for each landmark, an
error bound for the corrections, and a picture (triangulation) of the old and
new landmarks.

QUESTION 3.20. (Hard) Prove Theorem 3.4.

QUESTION 3.21. (Medium) Redo Example 3.1, using a rank-deficient least


squares technique from section 3.5.1. Does this improve the accuracy of the
high-degree approximating polynomials?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy