Chap 11
Chap 11
Chap 11
Chapter 11
In this chapter we use matrices to write regression models. Properties of matrices are reviewed in
Appendix A. The economy of notation achieved through using matrices allows us to arrive at some
interesting new insights and to derive several of the important properties of regression analysis.
The expected value of the random vector is just the vector of expected values of the random vari-
ables. For the random variables write E(yi ) = µi , then
E(y1 ) µ1
E(Y ) ≡ E(y2 ) = µ2 ≡ µ .
E(y3 ) µ3
In other words, expectation of a random vector is performed elementwise. In fact, the expected
value of any random matrix (a matrix consisting of random variables) is the matrix made up of the
expected values of the elements in the random matrix. Thus if wi j , i = 1, 2, 3, j = 1, 2 is a collection
of random variables and we write
w11 w12
W = w21 w22 ,
w31 w33
then
E(w11 ) E(w12 )
E(W ) ≡ E(w21 ) E(w22 ) .
E(w31 ) E(w33 )
We also need a concept for random vectors that is analogous to the variance of a random variable.
This is the covariance matrix, sometimes called the dispersion matrix, the variance matrix, or the
variance-covariance matrix. The covariance matrix is simply a matrix consisting of all the variances
and covariances associated with the vector Y . Write
and
Cov(yi , y j ) = E[(yi − µi )(y j − µ j )] ≡ σi j .
253
254 11. MULTIPLE REGRESSION: MATRIX FORMULATION
Two subscripts are used on σii to indicate that it is the variance of yi rather than writing Var(yi ) = σi2 .
The covariance matrix of our 3 × 1 vector Y is
σ11 σ12 σ13
Cov(Y ) = σ21 σ22 σ23 .
σ31 σ32 σ33
yi = β0 + β1 xi + εi i = 1, . . . , n , (11.2.1)
E(εi ) = 0, Var(εi ) = σ 2 , and Cov(εi , ε j ) = 0 for i ̸= j. In matrix terms this can be written as
y1 1 x1 ε1
[ ]
y2 1 x2 β0 ε2
. = . . + .
. .. .. β1 ..
.
yn 1 xn εn
These two vectors are equal if and only if the corresponding elements are equal, which occurs if and
only if model (11.2.1) holds. The conditions on the εi s translate into matrix terms as
E(e) = 0
Cov(e) = σ 2 I
where I is the n × n identity matrix. By definition, the covariance matrix Cov(e) has the variances
of the εi s down the diagonal. The variance of each individual εi is σ 2 , so all the diagonal elements
of Cov(e) are σ 2 , just as in σ 2 I. The covariance matrix Cov(e) has the covariances of distinct εi s as
its off-diagonal elements. The covariances of distinct εi s are all 0, so all the off-diagonal elements
of Cov(e) are zero, just as in σ 2 I.
11.2 MATRIX FORMULATION OF REGRESSION MODELS 255
E XAMPLE 11.2.1. Height and weight data are given in Table 11.1 for 12 individuals. In matrix
terms, the SLR model for regressing weights (y) on heights (x) is
y1 1 65 ε1
y2 1 65 ε2
y3 1 65 ε3
y4 1 65 ε4
y5 1 66 [ ] ε5
y6 1 66 β0 ε
= + 6 .
y7 1 63 β1 ε7
y8 1 63 ε8
y9 1 63 ε9
y10 1 72 ε10
y11 1 72 ε11
y12 1 72 ε12
The observed data for this example are
120
y1 140
y2 130
y3 135
y4 150
y5 135
= .
y6 110
y7 135
y8 120
y9 170
y10 185
160
We could equally well rearrange the order of the observations to write
y7 1 63 ε7
y8 1 63 ε8
y9 1 63 ε9
y1 1 65 ε1
y2 1 65 [ ] ε2
y3 1 65 β0 ε
= + 3
y4 1 65 β1 ε4
y4 1 66 ε5
y6 1 66 ε6
y10 1 72 ε10
y11 1 72 ε11
y12 1 72 ε12
256 11. MULTIPLE REGRESSION: MATRIX FORMULATION
in which the xi values are ordered from smallest to largest. 2
Y = X β + e, E(e) = 0, Cov(e) = σ 2 I.
where
E(εi ) = 0, Var(εi ) = σ 2 , Cov(εi , ε j ) = 0, i ̸= j.
In matrix terms this can be written as
β
··· 0
y1 1 x11 x12 x1,p−1 ε1
1 x21 x22 ··· x2,p−1 β1
y2 ε2
. = . ..
β2 + .
. ..
.. .. .. . ..
. . . . . .
.
yn 1 xn1 xn2 ··· xn,p−1 εn
β p−1
Yn×1 = Xn× p β p×1 + en×1
Multiplying and adding the right-hand side gives
β +β x +β x +···+β x
y1 0 1 11 2 12 p−1 1,p−1 + ε1
y2 β0 + β1 x21 + β2 x22 + · · · + β p−1 x2,p−1 + ε2
. =
.
.. ,
. .
yn β0 + β1 xn1 + β2 xn2 + · · · + β p−1 xn,p−1 + εn
which holds if and only if (11.2.2) holds. The conditions on the εi s translate into
E(e) = 0,
Cov(e) = σ 2 I,
E XAMPLE 11.2.3. In Example 11.2.1 we illustrated the matrix form of a SLR using the data on
heights and weights. We now illustrate some of the models from Chapter 8 applied to these data.
The cubic model
yi = β0 + β1 xi + β2 xi2 + β3 xi3 + εi (11.2.3)
11.2 MATRIX FORMULATION OF REGRESSION MODELS 257
is
y1 1 65 652 653 ε1
y2 1 65 652 65
3
ε2
y3 1 65 652 653 ε3
y4 1 65 652 65 ε4
3
β0
y5 1 66 662 66
3
ε
β1 5
y6 1 66 662 66 ε6
3
= β + .
y7 1 63 632 633 2 ε7
β
y8 1 63 632 63 3
3
ε8
β4
y9 1 63 632 63
3
ε9
y10 1 72 722 72
3
ε10
y11 1 72 722 723 ε11
y12 1 72 722 723 ε12
Some of the numbers in X are getting quite large, i.e., 653 = 274625. The model has better
numerical properties if we compute x̄· = 69.41666̄ and replace model (11.2.3) with the equivalent
model
This third degree polynomial is the largest polynomial that we can fit to these data. Two points
determine a line, three points determine a quadratic, and with only four district x values in the data,
we cannot fit a model greater than a cubic.
Define x̃ = (x − 63)/9 so that
(x1 , . . . , x12 ) = (65, 65, 65, 65, 66, 66, 63, 63, 63, 72, 72, 72)
transforms to
becomes
y1 1 65 1 0 ε1
y2 1 65 1 0 ε2
y3 1 65 1 0 ε3
y4 1 65 1 0 ε
4
y5 1 66 1 0 β0 ε5
y6 1 66 1 0 β1 ε 6
= + .
y7 1 63 1 0 β2 ε7
y8 1 63 1 0 β3 ε8
y9 1 63 1 0 ε9
y10 1 72 0 1 ε10
y11 1 72 0 1 ε11
y12 1 72 0 1 ε12
Notice that the last two columns of the X matrix add up to a column of 1s, like the first column.
This causes the rank of the 12 × 4 model matrix X to be only 3, so the model is not a regression
model. Dropping either of the last two columns (or the first column) does not change the model in
any meaningful way but makes the model a regression.
If we partition the SLR model into points below 65.5 and above 65.5, the matrix model becomes
y1 1 65 0 0 ε1
y2 1 65 0 0 ε2
y3 1 65 0 0 ε3
y4 1 65 0 0 ε
4
y5 0 0 1 66 β0 ε5
y6 0 0 1 66 β1 ε6
= + .
y7 1 63 0 0 β2 ε7
y8 1 63 0 0 β3 ε8
y9 1 63 0 0 ε9
y10 0 0 1 72 ε10
y11 0 0 1 72 ε11
y12 0 0 1 72 ε12
11.2 MATRIX FORMULATION OF REGRESSION MODELS 259
Alternatively, we could rewrite the model as
y7 1 63 0 0 ε7
y8 1 63 0 0 ε8
y9 1 63 0 0 ε9
y1 1 65 0 0 ε
1
y2 1 65 0 0 β0 ε2
y3 1 65 0 0 β1 ε3
= + .
y4 1 65 0 0 β2 ε4
y4 0 0 1 66 β3 ε5
y6 0 0 1 66 ε6
y10 0 0 1 72 ε10
y11 0 0 1 72 ε11
y12 0 0 1 72 ε12
This makes is a bit clearer that we are fitting a SLR to the points with small x values and a separate
SLR to cases with large x values. The pattern of 0s in the X matrix ensure that the small x values
only involve the intercept and slope parameters β0 and β1 for the line on the first partition set and
that the large x values only involve the intercept and slope parameters β2 and β3 for the line on the
second partition set.
y7 1 63 0 0 ε7
y8 1 63 0 0 ε8
y9 1 63 0 0 ε9
y1 1 65 0 0 ε
1
y2 1 65 0 0 β0 ε2
y3 1 65 0 0 β1 ε3
= + .
y4 1 65 0 0 γ0 ε4
y4 1 66 1 66 γ1 ε5
y6 1 66 1 66 ε6
y10 1 72 1 72 ε10
y11 1 72 1 72 ε11
y12 1 72 1 72 ε12
Here we have changed the first two columns to make them agree with the SLR of Example 11.2.1.
However, notice that if we subtract the third column from the first column we get the first column
of the previous version. Similarly,if we subtract the fourth column from the second column we get
the second column of the previous version. This model has intercept and slope parameters β0 and
β1 for the first partition and intercept and slope parameters (β0 + γ0 ) and (β1 + γ1 ) for the second
partition.
Because of the particular structure of these data with 12 observations but only four distinct
values of x, except for the Haar wavelet model, all of these models are equivalent to one another and
260 11. MULTIPLE REGRESSION: MATRIX FORMULATION
all of them are equivalent to a model with the matrix formulation
y1 1 0 0 0 ε1
2
y 1 0 0 0 ε2
3
y 1 0 0 0 ε3
4
y 1 0 0 0 ε
4
5
y 0 1 0 0 0β ε5
6
y 0 1 0 0 1 ε6
β
= + .
y7 1 0 1 0 β2 ε7
y8 1 0 1 0 β3 ε8
y9 1 0 1 0 ε9
y10 0 0 0 1 ε10
y11 0 0 0 1 ε11
y12 0 0 0 1 ε12
The models are equivalent in that they all give the same fitted values, residuals, and degrees of
freedom for error. We will see in the next chapter that this last matrix model has the form of a
one-way analysis of variance model. 2
Other models to be discussed later such as analysis of variance and analysis of covariance mod-
els can also be written as general linear models.
(Y − X β )′ (Y − X β ). (11.3.2)
The form in (11.3.2) is just the sum of the squares of the elements in the vector (Y − X β ). See also
Exercise 11.7.1.
We now give the general form for the least squares estimate of β in regression problems.
−1
Proposition 11.3.1. If r(X) = p, then β̂ = (X ′ X) X ′Y is the least squares estimate of β .
but ( )
−1 −1
X ′ I − X (X ′ X) X ′ = X ′ − (X ′ X) (X ′ X) X ′ = X ′ − X ′ = 0.
Thus ( )′ ( )
X β̂ − X β Y − X β̂ = 0
and similarly ( )′ ( )
Y − X β̂ X β̂ − X β = 0.
Eliminating the two middle terms in (11.3.3) gives
( )′ ( ) ( )′ ( )
′
(Y − X β ) (Y − X β ) = Y − X β̂ Y − X β̂ + X β̂ − X β X β̂ − X β .
This form is easily minimized. The first of the terms on the right-hand side does not depend
′
on β , so the β that minimizes (Y − X β ) (Y − X β ) is the β that minimizes the second term
( )′ ( )
X β̂ − X β X β̂ − X β . The second term is non-negative because it is the sum of squares of
the elements in the vector X β̂ − X β and it is minimized by making it zero. This is accomplished by
choosing β = β̂ . 2
[ ]
1 ȳ ∑ni=1 xi2 − x̄· ∑ni=1 xi yi
=
∑i=1 (xi − x̄· )
n 2 (∑ni=1 xi yi ) − nx̄· ȳ
[ { ( )} ]
1 ȳ ∑ni=1 xi2 − nx̄·2 ȳ − x̄· (∑ni=1 xi yi ) − nx̄·2 ȳ
=
∑i=1 (xi − x̄· )
n 2 β̂1 ∑ni=1 (xi − x̄· )2
[ ( n 2 ) ]
1 ȳ ∑i=1 xi − nx̄·2 − x̄· (∑ni=1 xi yi − nx̄· ȳ)
=
∑ (xi − x̄· )
n 2 β̂1 ∑ni=1 (xi − x̄· )2
[ i=1 ] [ ]
ȳ − β̂1 x̄· β̂
= = 0 .
β̂1 β̂1
Y = Z β∗ + e
where 1 (x − x̄ )
1 ·
1 (x2 − x̄· )
Z=
.. ..
. .
1 (xn − x̄· )
and ][
β∗0
β∗ = .
β1
−1
We need to compute β̂∗ = (Z ′ Z) Z ′Y . Observe that
[ ]
n 0
Z′Z = 2 ,
0 ∑ni=1 (xi − x̄· )
[1 ]
/ 0
′ −1 n
(Z Z) = 2 ,
0 1 ∑ni=1 (xi − x̄· )
[ n
]
∑i=1 yi
Z ′Y = ,
∑i=1 (xi − x̄· ) yi
n
and [ ] [ ]
′ −1 ′
ȳ
/ β̂∗0
β̂∗ = (Z Z) ZY= 2 = .
∑i=1 (xi − x̄· ) yi ∑i=1 (xi − x̄· )
n n
β̂1
These are the usual estimates. 2
11.3 LEAST SQUARES ESTIMATION OF REGRESSION PARAMETERS 263
Recall that least squares estimates have a number of other properties. If the errors are indepen-
dent with mean zero, constant variance, and are normally distributed, the least squares estimates are
maximum likelihood estimates and minimum variance unbiased estimates. If the errors are merely
uncorrelated with mean zero and constant variance, the least squares estimates are best (minimum
variance) linear unbiased estimates.
In multiple regression, simple algebraic expressions for the parameter estimates are not possible.
The only nice equations for the estimates are the matrix equations.
We now find expected values and covariance matrices for the data Y and the least squares esti-
mate β̂ . Two simple rules about expectations and covariance matrices can take one a long way in
the theory of regression. These are matrix analogues of Proposition 1.2.11. In fact, to prove these
matrix results, one really only needs Proposition 1.2.11, cf. Exercise 11.7.3.
Proposition 11.3.3. Let A be a fixed r × n matrix, let c be a fixed r × 1 vector, and let Y be an
n × 1 random vector, then
1. E (AY + c) = A E(Y ) + c
Applying these results allows us to find the expected value and covariance matrix for Y in a linear
model. The linear model has Y = X β + e where X β is a fixed vector (even though β is unknown),
E(e) = 0, and Cov(e) = σ 2 I. Applying the proposition gives
and
Cov(Y ) = Cov(e) = σ 2 I.
We can also find the expected value and covariance matrix of the least squares estimate β̂ . In
particular, we show that β̂ is an unbiased estimate of β by showing
( ) ( )
−1 −1 −1
E β̂ = E (X ′ X) X ′Y = (X ′ X) X ′ E (Y ) = (X ′ X) X ′ X β = β .
( )
To find variances and standard errors we need Cov β̂ . To obtain this matrix, we use the rules in
Proposition A.7.1. In particular, recall that the inverse of a symmetric matrix is symmetric and that
X ′ X is symmetric.
( ) [ ]
−1
Cov β̂ = Cov (X ′ X) X ′Y
[ ] [ ]′
−1 −1
= (X ′ X) X ′ Cov(Y ) (X ′ X) X ′
[ ] [ ]
−1 −1 ′
= (X ′ X) X ′ Cov(Y )X (X ′ X)
−1 −1
= (X ′ X) X ′ Cov(Y ) X (X ′ X)
−1 −1
= σ 2 (X ′ X) X ′ X (X ′ X)
−1
= σ 2 (X ′ X) .
E XAMPLE 11.3.2 C ONTINUED . For simple linear regression the covariance matrix becomes
( )
−1
Cov β̂ = σ 2 (X ′ X)
264 11. MULTIPLE REGRESSION: MATRIX FORMULATION
[ n 2 ]
1 ∑i=1 xi − ∑ni=1 xi
= σ 2
n ∑ni=1 (xi − x̄· ) − ∑i=1 xi
2 n
n
[ n 2 ]
1 ∑i=1 xi − nx̄·2 + nx̄·2 −nx̄·
= σ2 n
n ∑i=1 (xi − x̄· )
2 −nx̄· n
[ n 2
]
1 ∑ (xi − x̄· ) + nx̄·2 −nx̄·
= σ 2 i=1
n ∑n (xi − x̄· )
2 −nx̄· n
i=1 2
1 x̄ −x̄·
+ n ·
2 n ∑ (xi −x̄· )2 ∑ n (x −x̄ )2
,
= σ i=1
−x̄
i=1
1
i ·
·
∑ni=1 (xi −x̄· )2 ∑ni=1 (xi −x̄· )2
which agrees with results given earlier for simple linear regression.
Note that Y ′Y = ∑ni=1 y2i , C = nȳ2 = (∑ni=1 yi ) /n, and β̂ ′ X ′ X β̂ = β̂ ′ X ′Y . The difference between
2
the two tables is that the first includes a line for the intercept or grand mean while in the second the
total has been corrected for the grand mean.
The coefficient of determination can be computed as
SSReg
R2 = .
Y ′Y −C
This is the ratio of the variability explained by the predictor variables to the total variability of the
data. Note that (Y ′Y − C)/(n − 1) = s2y , the sample variance of the ys without adjusting for any
structure except the existence of a possibly nonzero mean.
i=1 i=1
i=1
The same result can be obtained from β̂ ′ X ′ X β̂ −C but the algebra is more tedious. 2
To obtain tests and confidence regions we need to make additional distributional assumptions.
In particular, we assume that the yi s have independent normal distributions. Equivalently, we take
ε1 , . . . , εn indep. N(0, σ 2 ).
This shows that β̂k is an unbiased estimate of βk . Before obtaining the standard error of β̂k , it is
−1
necessary to identify its variance. The covariance matrix of β̂ is σ 2 (X ′ X) , so the variance of
−
β̂k is the (k + 1)st diagonal element of σ 2 (X ′ X) . The (k + 1)st diagonal element is appropriate
1
because the first diagonal element is the variance of β̂0 not β̂1 . If we let ak be the (k + 1)st diagonal
−1
element of (X ′ X) and estimate σ 2 with MSE, we get a standard error for β̂k of
( ) √ √
SE β̂k = MSE ak .
β̂k − βk
∼ t(n − p).
SE(β̂k )
266 11. MULTIPLE REGRESSION: MATRIX FORMULATION
Standard techniques now provide tests and confidence intervals. For example, a 95% confidence
interval for βk has endpoints
β̂k ± t(.975, n − p) SE(β̂k )
where t(.975, n − p) is the 97.5th percentile of a t distribution with n − p degrees of freedom.
A (1 − α )100% simultaneous confidence region for β0 , β1 , . . . , β p−1 consists of all the β vectors
that satisfy ( ) ( )/
′
β̂ − β X ′ X β̂ − β p
≤ F(1 − α , p, n − p).
MSE
This region also determines joint (1− α )100% confidence intervals for the individual βk s with limits
√
β̂k ± pF(1 − α , p, n − p) SE(β̂k ).
These intervals are an application of Scheffé’s method of multiple comparisons, cf., Section 13.3.
We can also use the Bonferroni method to obtain joint (1 − α )100% confidence intervals with
limits ( )
α
β̂k ± t 1 − , n − p SE(β̂k ).
2p
Finally, we consider estimation of the point on the surface that corresponds to a given set of
predictor variables and the prediction of a new observation with a given set of predictor variables.
Let the predictor variables be x1 , x2 , . . . , x p−1 . Combine these into the row vector
x′ = (1, x1 , x2 , . . . , x p−1 ) .
−1
The point on the surface that we are trying to estimate is the parameter x′ β = β0 + ∑ pj=1 β j x j . The
least squares estimate is x′ β̂ which can be thought of as a 1 × 1 matrix. The variance of the estimate
is ( ) ( ) ( )
−1
Var x′ β̂ = Cov x′ β̂ = x′ Cov β̂ x = σ 2 x′ (X ′ X) x,
so the standard error is
( ) √ √
−1
SE x′ β̂ = MSE x′ (X ′ X) x ≡ SE(Sur f ace).
This is the standard error of the estimated regression surface. The appropriate reference distribution
is
x′ β̂ − x′ β
( ) ∼ t(n − p)
SE x′ β̂
ê = Y − Ŷ
= Y − X β̂
= Y − X(X ′ X)−1 X ′Y
( )
= I − X(X ′ X)−1 X ′ Y
= (I − M)Y
where
M ≡ X(X ′ X)−1 X ′ .
M is called the perpendicular projection operator (matrix) onto C(X), the column space of X. M
is the key item in the analysis of the general linear model, cf. Christensen (2011). Note that M is
symmetric, i.e., M = M ′ , and idempotent, i.e., MM = M, so it is a perpendicular projection operator
as discussed in Appendix A. Using these facts, observe that
n
SSE = ∑ ε̂i2
i=1
= ê′ ê
′
= [(I − M)Y ] [(I − M)Y ]
= Y ′ (I − M ′ − M + M ′ M)Y
= Y ′ (I − M)Y.
The last equality follows from M = M ′ and MM = M. Typically, the covariance matrix is not diag-
onal, so the residuals are not uncorrelated.
The variance of a particular residual ε̂i is σ 2 times the ith diagonal element of (I − M). The ith
diagonal element of (I − M) is the ith diagonal element of I, 1, minus the ith diagonal element of
M, say, mii . Thus
Var(ε̂i ) = σ 2 (1 − mii )
and the standard error of ε̂i is √
SE(ε̂i ) = MSE(1 − mii ).
The ith standardized residual is defined as
ε̂i
ri ≡ √ .
MSE(1 − mii )
The leverage of the ith case is defined to be mii , the ith diagonal element of M. Some people
like to think of M as the ‘hat’ matrix because it transforms Y into Ŷ , i.e., Ŷ = X β̂ = MY . More
common than the name ‘hat matrix’ is the consequent use of the notation hi for the ith leverage.
This notation was used in Chapter 7 but the reader should realize that hi ≡ mii . In any case, the
leverage can be interpreted as a measure of how unusual xi′ is relative to the other rows of the X
matrix, cf. Christensen (2011, section 13.1).
.
that is not too small, say with d ′ d = 1, having Xd = 0. Principal components (PC) regression is
a method designed to identify near redundancies among the predictor variables. Having identified
near redundancies, they can be eliminated if we so choose. In Section 10.7 we mentioned that having
small collinearity requires more than having small correlations among all the predictor variables, it
requires all partial correlations among the predictor variables to be small as well. For this reason,
eliminating near redundancies cannot always be accomplished by simply dropping well chosen
predictor variables from the model.
The basic idea of principal components is to find new variables that are linear combinations
of the x j s and that are best able to (linearly) predict the entire set of x j s, see Christensen (2001,
Chapter 3). Thus the first principal component variable is the one linear combination of the x j s
that is best able to predict all of the x j s. The second principal component variable is the linear
combination of the x j s that is best able to predict all the x j s among those linear combinations having
a sample correlation of 0 with the first principal component variable. The third principal component
variable is the best predictor that has sample correlations of 0 with the first two principal component
variables. The remaining principal components are defined similarly. With p − 1 predictor variables,
there are p − 1 principal component variables. The full collection of principal component variables
always predicts the full collection of x j s perfectly. The last few principal component variables are
least able to predict the original x j variables, so they are the least useful. They are also the aspects
of the predictor variables that are most redundant, see Christensen (2011, Section 14.5). The best
(linear) predictors used in defining principal components can be based on either the covariances
between the x j s or the correlations between the x j s. Unless the x j s are measured on the same scale
(with similarly sized measurements), it is generally best to use principal components defined using
the correlations.
For The Coleman Report data, a matrix of sample correlations between the x j s was given in
Example 9.7.1. Principal components are derived from the eigenvalues and eigenvectors of this
matrix, cf., Section A.8. (Alternatively, one could use eigenvalues and eigenvectors of the matrix
of sample covariances.) An eigenvector corresponding to the largest eigenvalue determines the first
principal component variable.
The eigenvalues are given in Table 11.2 along with proportions and cumulative proportions. The
proportions in Table 11.2 are simply the eigenvalues divided by the sum of the eigenvalues. The
cumulative proportions are the sum of the first group of eigenvalues divided by the sum of all the
eigenvalues. In this example, the sum of the eigenvalues is
The sum of the eigenvalues must equal the sum of the diagonal elements of the original matrix.
The sum of the diagonal elements of a correlation matrix is the number of variables in the matrix.
The third eigenvalue in Table 11.2 is .4966. The proportion is .4966/5 = .099. The cumulative
proportion is (2.8368 + 1.3951 + .4966)/5 = .946. With an eigenvalue proportion of 9.9%, the third
principal component variable accounts for 9.9% of the variance associated with predicting the x j s.
Taken together, the first three principal components account for 94.6% of the variance associated
with predicting the x j s because the third cumulative eigenvalue proportion is .946.
For the school data, the principal component (PC) variables are determined by the coefficients
in Table 11.3. The first principal component variable is
270 11. MULTIPLE REGRESSION: MATRIX FORMULATION
for i = 1, . . . , 20 where s1 is the sample standard deviation of the xi1 s, etc. The columns of coeffi-
cients given in Table 11.3 are actually eigenvectors for the correlation matrix of the x j s. The PC1
coefficients are an eigenvector corresponding to the largest eigenvalue, the PC2 coefficients are an
eigenvector corresponding to the second largest eigenvalue, etc.
We can now perform a regression on the new principal component variables. The table of coef-
ficients is given in Table 11.4. The analysis of variance is given in Table 11.5. The value of R2 is
.906. The analysis of variance table and R2 are identical to those for the original predictor variables
given in Section 9.1. The plot of standardized residuals versus predicted values from the principal
component regression is given in Figure 11.1. This is identical to the plot given in Figure 10.2 for
the original variables. All of the predicted values and all of the standardized residuals are identical.
Since Table 11.5 and Figure 11.1 are unchanged, any usefulness associated with principal
component regression must come from Table 11.4. The principal component variables display no
collinearity. Thus, contrary to the warnings given earlier about the effects of collinearity, we can
make final conclusions about the importance of variables directly from Table 11.4. We do not have
to worry about fitting one model after another or about which variables are included in which mod-
els. From examining Table 11.4, it is clear that the important variables are PC1, PC3, and PC4. We
can construct a reduced model with these three; the estimated regression surface is simply
where we merely used the estimated regression coefficients from Table 11.4. Refitting the reduced
model is unnecessary because there is no collinearity.
Residual−Fitted plot
3
2
Standardized residuals
1
0
−1
−2
25 30 35 40
Fitted
Figure 11.1: Standardized residuals versus predicted values for principal component regression.
To get predictions for a new set of x j s, just compute the corresponding PC1, PC3, and PC4
variables using formulae similar to those in equation (11.6.1) and make the predictions using the
fitted model in equation (11.6.2). When using equations like (11.6.1) to obtain new values of the
principal component variables, continue to use the x̄· j s and s j s computed from only the original
observations.
As an alternative to this prediction procedure, we could use the definitions of the principal
component variables, e.g., equation (11.6.1), and substitute for PC1, PC3, and PC4 in equation
(11.6.2) to obtain estimated coefficients on the original x j variables.
PC1
ŷ = 35.0825 + [−2.9419, −2.0457, 4.380] PC3
PC4
= 35.0825 + [−2.9419, −2.0457, 4.380] ×
(x1 − x̄·1 )/s1
−0.229 −0.555 −0.545 −0.170 −0.559 (x2 − x̄·2 )/s2
0.723
0.051 −0.106 −0.680 −0.037 (x3 − x̄·3 )/s3
0.018 −0.334 0.823 −0.110 −0.445 (x4 − x̄·4 )/s4
(x5 − x̄·5 )/s5
= 35.0825 + [−0.72651, 0.06550, 5.42492, 1.40940, −0.22889] ×
(x1 − 2.731)/0.454
(x2 − 40.91)/25.90
(x3 − 3.14)/9.63 .
(x4 − 25.069)/1.314
(x5 − 6.255)/0.654
Obviously this can be simplified into a form ŷ = 35.0825 + β̃1 x1 + β̃2 x2 + β̃3 x3 + β̃4 x4 + β̃5 x5 , which
in turn simplifies the process of making predictions and provides new estimated regression coeffi-
cients for the x j s that correspond to the fitted principal component model. These PC regression esti-
mates of the original β j s can be compared to the least squares estimates. Many computer programs
272 11. MULTIPLE REGRESSION: MATRIX FORMULATION
for performing PC regression report these estimates of the β j s and their corresponding standard
errors.
It was mentioned earlier that collinearity tends to increase the variance of regression coeffi-
cients. The fact that the later principal component variables are more nearly redundant is reflected
in Table 11.4 by the fact that the standard errors for their estimated regression coefficients increase
(excluding the intercept).
One rationale for using PC regression is that you just don’t believe in using nearly redundant
variables. The exact nature of such variables can be changed radically by small errors in the x j s. For
this reason, one might choose to ignore PC5 because of its small eigenvalue proportion, regardless
of any importance it may display in Table 11.4. If the t statistic for PC5 appeared to be significant,
it could be written off as a chance occurrence or, perhaps more to the point, as something that is un-
likely to be reproducible. If you don’t believe redundant variables, i.e., if you don’t believe that they
are themselves reproducible, any predictive ability due to such variables will not be reproducible
either.
When considering PC5, the case is pretty clear. PC5 accounts for only about 1.5% of the vari-
ability involved in predicting the x j s. It is a very poorly defined aspect of the predictor variables
x j and, anyway, it is not a significant predictor of y. The case is less clear when considering PC4.
This variable has a significant effect for explaining y, but it accounts for only 4% of the variability
in predicting the x j s, so PC4 is reasonably redundant within the x j s. If this variable is measuring
some reproducible aspect of the original x j data, it should be included in the regression. If it is not
reproducible, it should not be included. From examining the PC4 coefficients in Table 11.3, we see
that PC4 is roughly the average of the percent white collar fathers x2 and the mothers’ education
x5 contrasted with the socio- economic variable x3 . (Actually, this comparison is between the vari-
ables after they have been adjusted for their means and standard deviation as in equation (11.6.1).)
If PC4 strikes the investigator as a meaningful, reproducible variable, it should be included in the
regression.
In our discussion, we have used PC regression both to eliminate questionable aspects of the
predictor variables and as a method for selecting a reduced model. We dropped PC5 primarily
because it was poorly defined. We dropped PC2 solely because it was not a significant predictor.
Some people might argue against this second use of PC regression and choose to take a model based
on PC1, PC2, PC3, and possibly PC4.
On occasion, PC regression is based on the sample covariance matrix of the x j s rather than the
sample correlation matrix. Again, eigenvalues and eigenvectors are used, but in using relationships
like equation (11.6.1), the s j s are deleted. The eigenvalues and eigenvectors for the covariance ma-
trix typically differ from those for the correlation matrix. The relationship between estimated prin-
cipal component regression coefficients and original least squares regression coefficient estimates
is somewhat simpler when using the covariance matrix.
It should be noted that PC regression is just as sensitive to violations of the assumptions as reg-
ular multiple regression. Outliers and high leverage points can be very influential in determining
the results of the procedure. Tests and confidence intervals rely on the independence, homoscedas-
ticity, and normality assumptions. Recall that in the full principal components regression model,
the residuals and predicted values are identical to those from the regression on the original predic-
tor variables. Moreover, highly influential points in the original predictor variables typically have a
large influence on the coefficients in the principal component variables.
Minitab commands
Minitab commands for the principal components regression analysis are given below. The basic
command is ‘pca.’ The ‘scores’ subcommand places the principal component variables into columns
c12 through c16. The ‘coef’ subcommand places the eigenvectors into columns c22 through c26. If
one wishes to define principal components using the covariances rather than the correlations, simply
include a pca subcommand with the word ‘covariance.’
11.7 EXERCISES 273
MTB > pca c2-c6;
SUBC> scores c12-c16;
SUBC> coef c22-c26.
MTB > regress c8 on 5 c12-c16 c17 c18
MTB > plot c17 c18
11.7 Exercises
E XERCISE 11.7.1. Show that the form (11.3.2) simplifies to the form (11.3.1) for simple linear
regression.
E XERCISE 11.7.3. Use Proposition 1.2.11 to show that E(AY + c) = A E(Y ) + c and Cov(AY +
c) = ACov(Y )A′ .
E XERCISE 11.7.5. Do a principal components regression for the Younger data from Exer-
cise 9.12.1.
E XERCISE 11.7.6. Do a principal components regression for the Prater data from Exer-
cise 9.12.3.
E XERCISE 11.7.7. Do a principal components regression for the Chapman data of Exer-
cise 9.12.4.
E XERCISE 11.7.8. Do a principal components regression on for the pollution data of Exer-
cise 9.12.5.
E XERCISE 11.7.9. Do a principal components regression on for the body fat data of Exer-
cise 9.12.6.