Linear Least Squares Problems
Linear Least Squares Problems
php
3.1. Introduction
Given an m-by-n matrix A and an m-by-1 vector b, the linear least squares
problem is to find an n-by-1 vector x minimizing Ax — bil2. If m = n and
A is nonsingular, the answer is simply x = A —l b. But if m > n so that we
have more equations than unknowns, the problem is called overdetermined,
and generally no x satisfies Ax = b exactly. One occasionally encounters the
underdetermined problem, where m < n, but we will concentrate on the more
common overdetermined case.
This chapter is organized as follows. The rest of this introduction describes
three applications of least squares problems, to curve fitting, to statistical rnod-
eling of noisy data, and to geodetic modeling. Section 3.2 discusses three stan-
dard ways to solve the least squares problem: the normal equations, the QR
decomposition, and the singular value decomposition (SVD). We will frequently
use the SVD as a tool in later chapters, so we derive several of its properties
(although algorithms for the SVD are left to Chapter 5). Section 3.3 discusses
perturbation theory for least squares problems, and section 3.4 discusses the
implementation details and roundoff error analysis of our main method, QR
decomposition. The roundoff analysis applies to many algorithms using or-
thogonal matrices, including many algorithms for eigenvalues and the SVD in
Chapters 4 and 5. Section 3.5 discusses the particularly ill-conditioned situa-
tion of rank-deficient least squares problem and how to solve them accurately.
Section 3.7 and the questions at the end of the chapter give pointers to other
kinds of least squares problems and to software for sparse problems.
101
102 Applied Numerical Linear Algebra
minimizing
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
r1 p(yi) bi
r - r2
_ p(y2)
— b2
rm P(ym) bm
1 yl yi yi xl blb2
1 Y2 Y2 Y2 x2
X3
1 ym ym ym X4
bm
A•x—b,
where r and b are m-by-1, A is m-by-4, and x is 4-by-1. To minimize r,
we could choose any norm, such as IIrI, IIrIi, or 1r 2. The last one, which
corresponds to minimizing the sum of the squared residuals ^m r2, is a linear
least squares problem.
Figure 3.1 shows an example, where we fit polynomials of increasing degree
to the smooth function b = sin(iry/5) + y/5 at the 23 points y = —5, —4.5, —4,
... , 5.5, 6. The left side of Figure 3.1 plots the data points as circles, and four
different approximating polynomials of degrees 1, 3, 6, and 19. The right side
112
of Figure 3.1 plots the residual norm r versus degree for degrees from 1 to
20. Note that as the degree increases from 1 to 17, the residual norm decreases.
We expect this behavior, since increasing the polynomial degree should let us
fit the data better.
But when we reach degree 18, the residual norm suddenly increases dra-
matically. We can see how erratic the plot of the degree 19 polynomial is on
the left (the blue line). This is due to ill-conditioning, as we will later see.
Typically, one does polynomial fitting only with relatively low degree poly-
nomials, avoiding ill-conditioning [61]. Polynomial fitting is available as the
function polyf it in Matlab.
Here is an alternative to polynomial fitting. More generally, one has a set
of independent functions fl (y), ... , f(y) from I[8 k to I[8 and a set of points
(yl, bi), ... , (ym, b m ) with y2 E R k and bi E R, and one wishes to find a best
fit to these points of the form b = E^ 1 x^ f^(y). In other words one wants
to choose x = [xl, ... , x n ] T to minimize the residuals ri = E^ 1 x3 f^ (yz) — bi
for 1 < i < m. Letting = f^(y2), we can write this as r = Ax — b, where
A is m-by-n, x is n-by-1, and b and r are m-by-1. A good choice of basis
functions f(y) can lead to better fits and less ill-conditioned systems than
using polynomials [33, 84, 168]. o
Fig. 3.1. Polynomial fit to curve b = sin(7ry/5) + y/5 and residual norms.
high school GPA (a1) and two Scholastic Aptitude Test scores, verbal (a2) and
quantitative (a3), as part of the college admissions process. Based on past
data from admitted freshmen one can construct a linear model of the form
b = E^__ 1 a^x^. The observations are ai1, ai2, az3, and bi, one set for each of
the m students in the database. Thus, one wants to minimize
EXAMPLE 3.3. The least squares problem was first posed and formulated by
Gauss to solve a practical problem for the German government. There are
important economie and legal reasons to know exactly where the boundaries
lie between plots of land owned by different people. Surveyors would go out
and try to establish these boundaries, measuring certain angles and distances
15
The standard notation in statistics differs from linear algebra: statisticians write X/3 = y
instead of As = b.
104 Applied Numerical Linear Algebra
Z'j=(Xj Yj )
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
z'k=(x'k,Yk)
z'^=(X'1 ,Yi )
equations for all the S-variables. We wish to find the smallest corrections, i.e.,
the smallest values of Sxi, etc., that most nearly satisfy these constraints. This
is a least squares problem. o
Later, after we introduce more machinery, we will also show how image
compression can be interpreted as a least squares problem (see Example 3.4).
1. normal equations,
2. QR decomposition,
3. SVD,
The first method is the fastest but least accurate; it is adequate when the
condition number is small. The second method is the standard one and costs
up to twice as much as the first method. The third method is of most use on an
ill-conditioned problem, i.e., when A is not of full rank; it is several times more
expensive again. The last method lets us do iterative refinement to improve
the solution when the problem is ill-conditioned. All methods but the third
can be adapted to deal efficiently with sparse matrices [33]. We will discuss
each solution in turn. We assume initially for methods 1 and 2 that A has full
column rank n.
106 Applied Numerical Linear Algebra
To derive the normal equations, we look for the x where the gradient of II Ax —
bI12 = (Ax — b) T (Ax — b) vanishes. So we want
r=
Linear Least Squares Problems 107
3.2.2. QR Decomposition
Proof. We give two proofs of this theorem. First, this theorem is a restatement
of the Gram—Schmidt orthogonalization process [139]. If we apply Gram—
Schmidt to the columns ai of A = [al, a2, ... , a n ] from left to right, we get
a sequence of orthonormal vectors ql through q„ spanning the same space:
these orthogonal vectors are the columns of Q. Gram—Schmidt also computes
coefficients rei = q^ ai expressing each column ai as a linear combination of ql
through qi: a2 = =1 r^jqj. The rei are just the entries of R.
We leave it as an exercise to show that the two formulas for rei in the algo-
rithm are mathematically equivalent (see Question 3.1). If A has full column
rank, r22 will not be zero. The following figure illustrates Gram—Schmidt when
A is 2-by-2:
108 Applied Numerical Linear Algebra
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
The second proof of this theorem will use Algorithm 3.2, which we present
in section 3.4.1. ❑
Unfortunately, CGS is numerically unstable in floating point arithmetic
when the columns of A are nearly linearly dependent. MGS is more stable and
will be used in algorithmslater in this book but may still result in Q being far
(I
from orthogonal Q T Q - I being far larger than s) when A is ill-conditioned
[31, 32, 33, 149]. Algorithm 3.2 in section 3.4.1 is a stable alternative algorithm
for factoring A = QR. See Question 3.2.
We will derive the formula for the x that minimizes lAx - bil2 using the
decomposition A = QR in three slightly different ways. First, we can always
choose m - n more orthonormal vectors Q so that [Q, Q] is a square orthogonal
matrix (for example, we can choose any m - n more independent vectors X
that we want and then apply Algorithm 3.1 to the n-by-n nonsingular matrix
[Q, X]). Then
_ Q 1
= II [Q, Q] T (Ax — b) I I2
Ax — bI I2 by part 4 of Lemma 1.7
2
T (QRx - b)
J 2
= I O(m n)xn J
L
Rx — 1
L
QTb
`w
1
]M2
2
— [ Rx Q QT b 1I2
J2
= IIRx—Q T bii2+IIQ T biI2
IIQ T b112
We can solve Rx - QTb = 0 for x, since A and R have the same rank,
n, and so R is nonsingular. Then x = R -1 Q T b, and the minimum value of
II Ax — bil 2 is II Q T bll 2 .
Here is a second, slightly different derivation that does not use the matrix
Linear Least Squares Problems 109
Q. Rewrite Ax — b as
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Note that the vectors Q(Rx — QTb) and (I — QQT)b are orthogonal, be-
cause (Q(Rx — Q T b)) T ((I — QQ T )b) = (Rx — Q T b) T [Q T (I — QQ T )]b = (Rx —
QT b) T [0] b = 0. Therefore, by the Pythagorean theorem,
where we have used part 4 of Lemma 1.7 in the form Qy II I12 = I1 y 112.
This sum
of squares is minimized when the first term is zero, i.e., x = R -1 Q T b.
Finally, here is a third derivation that starts from the normal equations
solution:
x = (A T A) —I A T b
_ (R T Q T QR) -1 RT Q T b = ( RT R )- 1 RT Q T b
= R—lR—TRTQTb = R-1QTb.
Later we will show that the cost of this decomposition and subsequent least
squares solution is 2n 2 m — 2 n 3 , about twice the cost of the normal equations
if m » n and about the same if m = n.
THEOREM 3.2. SVD. Let A be an arbitrary m-by-n matrix with m >_ n. Then
we can write A = UEV T , where U is m-by-n and satisfies U T U = I, V is n-by-
n and satisfies V T V = I, and E = diag(al, ... , a n ), where cr > • • • > u n > 0.
The columns ul, ... ,u of U are called left singular vectors. The columns
vn of V are called right singular vectors. The ui are called singular
values. (If m < n, the SVD is defined by considering AT.)
Then we can choose one orthogonal coordinate system for R (where the unit
axes are the columns of V) and another orthogonal coordinate system for R m
(where the units axes are the columns of U) such that A is diagonal (E), i.e.,
maps a vector x = Ei 1 f32v2 to y = Ax = 1 aj/3 ui. In other words, any
matrix is diagonal, provided that we piek appropriate orthogonal coordinate
systems for its domain and range.
110 Applied Numerical Linear Algebra
Proof of Theorem 3.2. We use induction on m and n: we assume that the SVD
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
r T T n T AT
U T AV=I UT]•A•[v V]= .
UTAV]
[UTAv
Then
UTAv =
(Av)T (Av) = IIAvII2
= IIAvhl2 = IIAII2 - U
IIAvhl2 IIAvII2
and U T Av = U T u I I Av I 12 = 0. We claim UT Ç/ = 0 too because otherwise
or = IIAII2 = IIUTAVII2 > II[1,0,...,0]UTAVII2 = II [UIuTAV]II 2 > a, a contra-
diction. (We have used part 7 of Lemma 1.7.)
So U T AV = [ 0 ÜAV ] = [ 0 q ] • We may now apply the induction
hypothesis to A to get A = U1E1VT , where Ui is (m — 1)-by-(n — 1), E, is
(n — 1)-by-(n — 1), and Vl is (n — 1)-by-(n — 1). So
0 1_1 0 Q 0 1 0
UTAV=CO U
1 ^ 1 VT J[0 0 U1 ] [ El] [ 0 V1 ]
or
(v[
A=\U 0 UlJ/ L 0 E1J 0 Vl]/T
which is our desired decomposition. ❑
The SVD has a large number of important algebraic and geometric prop-
erties, the most important of which we state here.
THEOREM 3.3. Let A = UEV T be the SVD of the m-by-n matrix A, where
m > n. (There are analogous results for m < n.)
2. The eigenvalues of the symmetrie matrix A T A are o•2. The right singular
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
2 [ ±ui
5. If A has full rank, the solution of min i IlAx — bil2 is x = VE —l U T b.
JIAI12.11A—'112 = "n
7. Suppose Ui > ... > aT > aT+i = ... = an = 0. Then the rank of A is r.
The null space of A, i.e., the subspace of vectors v such that Av = 0, is
the space spanned by columns r + 1 through n of V: span(v r+1 , . .. , vn ).
The range space of A, the subspace of vectors of the foren Aw for all w,
is the space spanned by columns 1 through r of U: span(ul, ... , U r ).
This is an eigendecomposition of AA T .
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
5. II Ax - bII2 = II UEV T x - b1I2. Since A has full rank, so does E, and thus
E is invertible. Now let [U, U] be square and orthogonal as above so
T l 2
UEV T x — bII2 = [ UT (UEV T x — b)
J 2
EVTTb T b 1 112
J 2
= II EV T x - U T bII2 + I U T blIz•
This is minimized by making the first term zero, i.e., x = VE -l U T b.
A = [ 3 1
13J
2 -1 / 2 -2 -1 / 2 4 0 2-1/2 -2 -1 / 2 T
_[ 2-1/2 2 -1/2 ][ 0 2 [ 2-1/2 2-1/2
UEV T .
Linear Least Squares Problems 113
principal axes Qiei, where e2 is the ith column of the identity matrix.
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
2 2
0 0
-2 -2
-4 -4
-4 -2 0 2 4 -4 -2 0 2 4
Sigma*V'*S U*Sigma*V'*S
4 4
2 2 ------ - --,--- -- --
0 0 ----- ------
-4 -4
-4 -2 0 2 4 -4 -2 0 2 4
0
7 '7k +1
IIA — AkII2Ê= Ê uiuivz = U V T = a k +1 .
i=k +1
Orn 2
It remains to show that there is no closer rank k matrix to A. Let B
be any rank k matrix, so its null space has dimension n — k. The space
spanned by {vl, ... , vk +l} has dimension k + 1. Since the sum of their
dimensions is (n — k) + (k + 1) > n, these two spaces must overlap. Let
h be a unit vector in their intersection. Then
EXAMPLE 3.4. We illustrate the last part of Theorem 3.3 by using it for image
compression. In particular, we will illustrate it with low-rank approximations
114 Applied Numerical Linear Algebra
interpreted as the brightness of pixel (i, j). In other words, matrix entries rang-
ing from 0 to 1 (say) are interpreted as pixels ranging from black (=0) through
various shades of gray to white (=1). (Colors also are possible.) Rather than
storing or transmitting all m • n matrix entries to represent the image, we often
prefer to compress the image by storing many fewer numbers, from which we
can still approximately reconstruct the original image. We may use Part 9 of
Theorem 3.3 to do this, as we now illustrate.
Consider the image in Figure 3.3(a). This 320-by-200 pixel image corre-
sponds to a 320-by-200 matrix A. Let A = UEV T be the SVD of A. Part
9 of Theorem 3.3 tells us that Ak = Ik_ 1 ojuivT is the best rank-k approx
II I
imation of A, in the sense of minimizing A — Ak I2 = Uk+l • Note that it
only takes m • k + n • k = (m + n) • k words to store ul through Uk and
alvl through akvk, from which we can reconstruct Ak. In contrast, it takes
m • n words to store A (or Ak explicitly), which is much larger when k is
small. So we will use Ak as our compressed image, stored using (m + n) • k
words. The other images in Figure 3.3 show these approximations for vari-
ous valnes of k, along with the relative errors ak +1 /a1 and compression ratios
(m + n) • k/(m • n) = 520.k/64000 k/123.
These images were produced by the following commands (the clown and
other images are available in Matlab among the visualization demonstration
files; check your local installation for location):
Later we will see that the cost of solving a least squares problem with the
SVD is about the same as with QR when m » n, and about 4n 2 m — 3n 3 +
O(n 2 ) for smaller m. A precise comparison of the costs of QR and the SVD
also depends on the machine being used. See section 3.6 for details.
DEFINITION 3.1. Suppose that A is m -by-n with m > n and has full rank, with
A = QR = UEV T being A's QR decomposition and SVD, respectively. Then
UT
A + _ (A T A) — 'A T = R -1 Q T = VE —l
is called the (Moore-Penrose) pseudoinverse of A. If m < n, then A+
AT(AAT) -1.
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
approzimation.
Linear Least Squares Problems
(b)
(a)
Fig. 3.3. Image compression using the SVD. (a) Original image. (b) Rank k = 3
115
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
116
tion.
(c)
(d)
Fig. 3.3. Continued. (c) Rank k = 10 approxioration. (d) Rank k = 20 approxima-
Applied Numerical Linear Algebra
Linear Least Squares Problems 117
mined least squares problem as simply x = A+b. If A is square and full rank,
this formula reduces to x = Ab as expected. The pseudoinverse of A is
computed as pinv (A) in Matlab. When A is not full rank, the Moore—Penrose
pseudoinverse is given by Definition 3.2 in section 3.5.
When A is not square, we define its condition number with respect to the
2-norm to be i2(A) - 0 „, ax(A)/Qmin(A). This reduces to the usual condition
-
THEOREM 3.4. Suppose that A is m-by-n with m >_ n and has full rank. Sup-
pose that x minimizes Ax—bII2. Let r = Ax—b be the residual. Let x minimize
i6Ai 2 5 bll2 ) < 1 __
II (A + 6A)i — (b + 5b) 11 2 . Assume e - max( JI I1Ilijb112 in(A)
/r2(` Qmax(A)
--
Then
where sin 0 = 1 . In other words, 0 is the angle between the vectors b and
1 ILbII2
Ax and measures whether the residual norm IIr112 is large (near Ilbil) or small
(near 0). kLS is the condition number for the least squares problem.
1 1x112 ^JA11211x112
IIr — r11 2
< (1 + 26r2(A)).
11rl12 -
We may combine this with the above perturbation bounds to get error bounds
for the solution of the least squares problem, much as we did for linear equation
solving.
The normal equations are not as accurate. Since they involve solving
(A T A)x = A T b, the accuracy depends on the condition number k2 (A T A) _
rc2(A). Thus the error is always bounded by t2(A)E, never just r,2(A)E. There-
fore we expect that the normal equations can lose twice as many digits of
accuracy as methods based on the QR decomposition and SVD.
Furthermore, solving the normal equations is not necessarily stable; i.e.,
the computed solution x does not generally minimize I (A + SA) — (b + 5 b) 112
for small SA and 5b. Still, when the condition number is small, we expect
the normal equations to be about as accurate as the QR decomposition or
SVD. Since the normal equations are the fastest way to solve the least squares
problem, they are the method of choice when the matrix is well-conditioned.
We return to the problem of solving very ill-conditioned least squares prob-
lems in section 3.5.
decomposition stably.
Instead, we base our algorithms on certain easily computable orthogonal
matrices called Householder relections and Givens rotations, which we can
choose to introduce zeros into vectors that they multiply. Later we will show
that any algorithm that uses these orthogonal matrices to introduce zeros
is automatically stable. This error analysis will apply to our algorithms for
the QR decomposition as well as many SVD and eigenvalue algorithms in
Chapters 4 and 5.
Despite the possibility of nonorthogonal Q, the MGS algorithm has im-
portant uses in numerical linear algebra. (There is little use for its less stable
version, CGS.) These uses include finding eigenvectors of symmetrie tridiagonal
matrices using bisection and inverse iteration (section 5.3.4) and the Arnoldi
and Lanczos algorithms for reducing a matrix to certain "condensed" forms
(sections 6.6.1, 6.6.6, and 7.4). Arnoldi and Lanczos algorithms are used as
the basis of algorithms for solving sparse linear systems and finding eigenval-
ues of sparse matrices. MGS can also be modified to solve the least squares
problem stably, but Q may still be far from orthogonal [33].
xl + sign(x1) • IIXII2
X2 u
u = with u = Iu
II2
xn
XXX X
0 XXX
2. Choose P2 = ,[0 p so A2-P2A1= 0 0 X X
2J 00 x x
00 x x
XXX X
1 0 0 XXX
3. Choose P3 = 1 so A3-P3A2= 0 0 X X
0 P O O O X
O O O X
1 XXX X
0 XXX
1 0
4. Choose P4 = 1 so A4 - P4A3 = 0 0 X X
0 P4 0 0 0 x
O O O O
Here, we have chosen a Householder matrix P' to zero out the subdiago-
nal entries in column i; this does not disturb the zeros already introduced in
previous columns.
Let us call the final 5-by-4 upper triangular matrix R - A4. Then A =
Pi P2 P3 PR = QR, where Q is the first four columns of PP2 P3 P4
P1P2P3P4 (since all PZ are symmetric) and R is the first four rows of R. o
Linear Least Squares Problems 121
transformations.
for i = 1 to min(m — 1, n)
uz = Heuse(A(i : m, i))
P'= I-2uiuT
A(i:m,i:n)=PZA(i:m,i:n)
end for
(I-2uiuT)A(i:m,i:n)=A(i:m,i: n)-2ui(uTA(i:m,i:n)),
which costs less. To store Pi, we need only ui, or ui and Iui. These can
be stored in column i of A; in fact it need not be changed! Thus QR can be
"overwritten" on A, where Q is stored in factored form Pl ... Pn _1 i and PZ is
stored as ui below the diagonal in column i of A. (We need an extra array of
length n for the top entry of ui, since the diagonal entry is occupied by Rii.)
Recall that to solve the least squares problem min IAx—b112 using A = QR,
we need to compute Q T b. This is done as follows: Q T b = Pn Pn _i . • Pi b, so
we need only keep multiplying b by Pi, P2,. . . , Pn
:
R(e)x
i j
1 1
cos B — sin B
R(i, j, B)
7 sin 9 cos B
L 1 1
Given x, i, and j, we can zero out x by choosing cos 0 and sin 0 so that
[ sine cos B ] [ x j , 6
X x x x X X X X
0 X X X 0 X X X
0 0 X x to o 0 X X
00 x X o 0 o X
00 x X o 0 o X
we multiply
X X X X X X X X
0 X X X 0 X X X
00 x X = 00 x X
c —s 0 0 X X 0 0 X X
s C 00 x x 000 X
Linear Least Squares Problems 123
and
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
1 XXX x XXXI
1 0 XXX 0 XXX
C1 —S 1 00 x x = 00 x x o
s' c' 0 0 x x 0 0 0 x
1 o o o x o 0 o X
The cost of the QR decomposition using Givens rotations is twice the cost
of using Householder reflections. We will need Givens rotations for other ap-
plications later.
Here are some implementation details. Just as we overwrote A with Q
and R when using Householder reflections, we can do the same with Givens
rotations. We use the same trick, storing the information describing the trans-
formation in the entries zeroed out. Since a Givens rotation zeros out just one
entry, we must store the information about the rotation there. We do this as
follows. Let s = sin0 and c = cosO. If IsI < iei, store s • sign(c) and other-
n
wise store s g (s) . To recover s and c from the stored value (call it p) we do
'
This analysis proves backward stability for the QR decomposition and for many
of the algorithms for eigenvalues and singular values that we will discuss.
and
fl(f f _1 ...
PiAoQ,Q2 ...
Q^) =P P,(Ao+E)Qi...Q^
B = fl(P^A^- 1 )
= Pj(A^_l+E') byLemma3.1
= Pj(Pj_1(A+ E^_1)Q^_1 + E') by induction
= P^(A+E^-1+PT lE^Q^ 1)Q^-1
P^ (A + E")Q^-1,
where
where IIEII2 < O(e)IIXII2 . IIAII2 and so IIFII2 < IIX -1 112 ' IIEII2 <_ 0(E) • r-2(X)
IIAII2•
So the error I I E I 12 is magnified by the condition number i2 (X) >_ 1. In a
larger product Xk • • • X,AY, ... Yk the error would be magnified by [ji k2(XZ) •
k2 (Y). This factor is minimized if and only if all Xi and Y are orthogonal (or
stalar multiples of orthogonal matrices), in which case the factor is one.
Linear Least Squares Problems 125
So far we have assumed that A has full rank when minimizing liAx — bII2.
What happens when A is rank deficient or "close" to rank deficient? Such
problems arise in practice in many ways, such as extracting signals from noisy
data, solution of some integral equations, digital image restoration, comput-
ing inverse Laplace transforms, and so on [141, 142]. These problems are
very ill-conditioned, so we will need to impose extra conditions on their so-
lutions to make them well-conditioned. Making an ill-conditioned problem
well-conditioned by imposing extra conditions on the solution is called regular-
ization and is also done in other fields of numerical analysis when ill-conditioned
problems arise.
For example, the next proposition shows that if A is exactly rank deficient,
then the least squares solution is not even unique.
PROPOSITION 3.1. Let A be m-by-n with m > n and rank A = r < n. Then
there is an n — r dimensional set of vectors x that minimize IAx — bII2•
PROPOSITION 3.2. Let o-min = crmin(A), the smallest singular value of A. As-
Sume Amin > 0. Then
can be characterized as follows. Let A = UEV T have rank r < n, and write
the SVD of A as
A = [Ui, U2]
LO l 0
J [V1, V2] T = U1E1VT (3.1)
where E 1 is r x r and nonsingular and Ui and Vl have r columns. Let o, =
Qmin(i), the smallest nonzero singular value of A. Then
In other words, the norm and condition number of the unique minimal norm
solution x depend on the smallest nonzero singular value of A.
= Uz (UiiV1Tx_b)ii
UT ii
112
E1V2'x — UTb 1112
= U
UT b 112
= IIF-1VT x— U12 b112+ IIU2 b112+ IIU T b112.
1. 1 I Ax — b112 is minimized when E 1 V1T x= UTb, or x = V1 i 1 Ul b+ V2z
since V1T V2z = 0 for all z.
Proposition 3.3 tells us that the minimum norm solution x is unique and
may be well-conditioned if the smallest nonzero singular value is not too small.
This is key to a practical algorithm, discussed in the next section.
Linear Least Squares Problems 127
certain drug on blood sugar level. We collect data from each patient (numbered
from i = 1 to m) by recording his or her initial blond sugar level (a2,1), final
blood sugar level (bi), the amount of drug administered (ai,2), and other med-
ical quantities, including body weights on each day of a week-long treatment
(az,3 through ai,9). In total, there are n < m medical quantities measured for
each patient. Our goal is to predict bi given ai,1 through az,„ and we formulate
this as the least squares problem min, ilAx — bil2. We plan to use x to predict
the final blond sugar level b of future patient j by computing the dot product
i:k=1 a^kxk.
Since people's weight generally does not change significantly from day to
day, it is likely that columns 3 through 9 of matrix A, which contain the
weights, are very similar. For the sake of argument, suppose that columns
3 and 4 are identical (which may be the case if the weights are rounded to
the nearest pound). This means that matrix A is rank deficient and that
xp = [0, 0, 1, —1,0,. , 0] T is a right null vector of A. So if x is a (minimum
norm) solution of the least squares problem min, liAx — bil2i then x + /3x0 is
also a (nonminimum norm) solution for any scalar 0, including, say, ,Q = 0 and
Q = 10 6 . Is there any reason to prefer one value of 0 over another? The value
10 6 is clearly not a good one, since future patient j, who gains one pound
between days 1 and 2, will have that difference of one pound multiplied by
10 6 in the predictor Ek a^kxk of final blond sugar level. It is much more
reasonable to choose 0 = 0, corresponding to the minimum norm solution x.
0
For further justification of using the minimum norm solution for rank-
deficient problems, see [141, 142].
When A is square and nonsingular, the unique solution of Ax = b is of
course b = A -1 x. If A has more rows than columns and is possibly rank-
deficient, the unique minimum-norm least squares solution may be similarly
written b = A+b, where the Moore—Penrose pseudoinverse A+ is defined as
follows.
SVD
For example, the 2-by-2 matrix A = diag(1, 0) is exactly singular, and its
smallest nonzero singular value is o = 1. As described in Proposition 3.3, the
minimum norm least squares solution to min., IlAx — bil2 with b = [1, 11 T is
x = [1, 0] T , with condition number 1/Q = 1. But if we make an arbitrarily
tiny perturbation to get A = diag(1, c), then o drops to E and x = [l, l/ E] T
becomes enormous, as does its condition number 1 /E. In general, roundofi will
make such tiny perturbations, of magnitude O(e)IIAII2• As we just saw, this
can increase the condition number from 1/Q to 1/e.
of backward stability: the computed SVD will be the exact SVD of a slightly
different matrix: A = UÊV T = A + SA, with IISA11 = 0 ( 5 ).IIAII. (This is
discussed in detail in Chapter 5.) This means that any &Z < O(e)1IA112 can
be treated as zero, because roundoff makes it indistinguishable from 0. In the
above 2-by-2 example, this means we would set the E in A to zero before solving
the least squares problem. This would raise the smallest nonzero singular value
from E to 1 and correspondingly decrease the condition number from 1/E to
1/o- = 1.
trices
Fig. 3.4. Graph of truncated least squares solution of min„, II(Al + SA)yl - b1112,
using tol = 10 -9 . The singular values of A l are shown as red x's. The norm 116A112
is the horizontal axis. The top graph plots the rank of A l + SA, i.e., the numbers of
singular values exceeding tol. The bottom graph plots IIyl - x1112/11x1112, where xl is
the solution with SA = 0.
singular value is O(lISAII2), which is quite small, causing the error to jump to
IIbAII2/o(IISA112) = 0(1), as shown to the right of 1ISA112 = tol in the bottom
graph of Figure 3.4.
In Figure 3.5, the nonzero singular values of A2 are also shown as red x's;
the smallest one, 1.2. 10 -9 is just larger than tol. So the predicted error when
,
IISAII 2 < tol is IISAII2/10 -9 , which grows to 0(1) when I16A112 = tol. This is
confirmed by the bottom graph in Figure 3.5. o
Fig. 3.5. Graph of truncated least squares solution of min y2 lI(A2 + SA)y 2 — b2112,
using tol = 10 -9 . The singular values of A2 are shown as red x's. The norm 116A112
is the horizontal axis. The top graph plots the rank of A2 + SA, i.e., the numbers of
singular values exceeding tol. The bottom graph plots 11y2 — x2112/11x2112, where x2 is
the solution with bA = 0.
with I I R22 112 very small, on the order of e I I A I I2 • In this case we could just set
R22 = 0 and minimize 11 Ax — b I 1 2 as follows: let [Q, Q] be square and orthogonal
so that
ir T 12= r RxQQTb 2
IIAx bII2 = QT (Ax—b)
—
L ] z IL 2
= IIRx — Q T bi12 + 11Q T bIi2•
Rll R12
Write Q = [Q1, Q2] and x = [ x1 ] conformally with R = [ 1 so
X2 0 0
that
IIAx — b1I2 = (IR11x1 + R 12 x 2 — Qi bII2 + IIQz bII2 + I IQ T bII2
is minimized by choosing x = [ R111i Q1 b2 R12x2)
1 for any x2. Note that the
choice x2 = 0 does not necessarily minimize I I x 2, but it is a reasonable choice,
especially if Rll is well-conditioned and R111 R12 is small.
Unfortunately, this method is not reliable since R may be nearly rank
deficient even if no R22 is small. For example, the n -by -n bidiagonal matrix
1
2
A=
. 1
1
2
132 Applied Numerical Linear Algebra
What is the fastest algorithm for solving dense least squares problems? As
discussed in section 3.2, solving the normal equations is fastest, followed by
QR and the SVD. If A is quite well-conditioned, then the normal equations are
about as accurate as the other methods, so even though the normal equations
are not numerically stable, they may be used as well. When A is not well-
conditioned but far from rank deficient, we should use QR.
Since the design of fast algorithms for rank-deficient least squares problems
is a current research area, it is difficult to recommend a single algorithm to use.
We summarize a recent study [206] that compared the performance of several
algorithms, comparing them to the fastest stable algorithm for the non—rank-
deficient case: QR without pivoting, implemented using Householder trans-
formations as described in section 3.4.1, with memory hierarchy optimizations
Linear Least Squares Problems 133
The best recent reference on least squares problems is [33], which also discusses
variations on the basic problem discussed here (such as constrained, weighted,
and updating least squares), different ways to regularize rank-deficient prob-
lems, and software for sparse least squares problems. See also chapter 5 of
[121] and [168]. Perturbation theory and error bounds for the least squares
solution are discussed in detail in [149]. Rank-revealing QR decompositions
are discussed in [28, 30, 48, 50, 126, 150, 196, 206, 236]. In particular, these
papers examine the tradeoff between cost and accuracy in rank determination,
and in [206] there is a comprehensive performance comparison of the available
methods for rank-deficient least squares problems.
QUESTION 3.2. (Easy) This question will illustrate the difference in nu-
merical stability among three algorithms for computing the QR factoriza-
tion of a matrix: Householder QR (Algorithm 3.2), CGS (Algorithm 3.1),
and MGS (Algorithm 3.1). Obtain the Matlab program QRStability.m from
HOMEPAGE/Matlab/QRStability.m. This program generates random matri-
ces with user-specified dimensions m and n and condition number cnd, computes
their QR decomposition using the three algorithms, and measures the accuracy
of the results. It does this with the residual I I A — Q • R I / I A I I , which should be
around machine epsilon E for a stable algorithm, and the orthogonality of Q
^1Q T • Q — I j, which should also be around E. Run this program for small ma-
trix dimensions (such as m= 6 and n= 4), modest numbers of random matrices
(samples= 20), and condition numbers ranging from cnd= 1 up to cnd= 10 15 .
Describe what you see. Which algorithms are more stable than others? See
if you can describe how large JQ T • Q — I^^ can be as a function of choice of
algorithm, cnd and E.
QUESTION 3.3. (Medium; Hard) Let A be m-by-n, m > n, and have full rank.
A r b
1. (Medium) Show that [ AT ] [ ] _ [ ] has a solution where x
minimizes hjAx — bil2. One reason for this formulation is that we can
apply iterative refinement to this linear system if we want a more accurate
answer (see section 2.5).
matrix, as a block 2-by-2 matrix. Hint: Use 2-by-2 block Gaussian elim-
ination. Where have we previously seen the (2,1) block entry?
A= [ ]
where R is n-by-n and upper triangular, and S is m-by-n and dense. Describe
an algorithm using Householder transformations for reducing A to upper trian-
gular form. Your algorithm should not "fill in" the zeros in R and thus require
fewer operations than would Algorithm 3.2 applied to A.
1. (A T A)_ l
2. (ATA)-1AT
136 Applied Numerical Linear Algebra
3. A(A T A) -1
Downloaded 02/21/13 to 128.95.104.66. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
4. A(A T A) — 'AT
AA+A = A,
A+AA+ = A+,
A+ A = (A+A) T ,
AA+ = (AA+) T
QUESTION 3.15. (Medium) Let A be m-by-n, m < n, and of full rank. Then
min II Ax — btI2 is called an underdetermined least squares problem. Show that
the solution is an (n — m)-dimensional set. Show how to compute the unique
minimum norm solution using appropriately modified normal equations, QR
decomposition, and SVD.
sian elimination to perform Level 2 BLAS and Level 3 BLAS at each step in
order to exploit the higher speed of these operations. In this problem, we will
show how to apply a sequence of Householder transformations using Level 2
and Level 3 BLAS.
for i = 1 m
ui = House(A(i : m, i))
PZ = I — 2uiuT'
A(i:m,i:n)=PZA(i:rn,i:n)
endfor
that there is a unique solution under the assumption that [ ] has full column
rank. Show how to compute x using two QR decompositions and some matrix-
vector multiplications and solving some triangular systems of equations. Hint:
Look at LAPACK routine sgglse and its description in the LAPACK manual
[10] (NETLIB/lapack/lug/lapack_lug.html).