Chap 11

Unbalanced Analysis of Variance, Design, and Regression
Ronald Christensen, fletcher@stat.unm.edu

Copyright 2013
Chapter 11
Multiple regression: matrix formulation
In this chapter we use matrices to write regression models. Properties of matrices are reviewed in
Appendix A. The economy of notation achieved through using matrices allows us to arrive at some
interesting new insights and to derive several of the important properties of regression analysis.
11.1 Random vectors

In this section we discuss vectors and matrices that are made up of random variables rather than just
numbers. For simplicity, we focus our discussion on vectors that contain 3 rows, but the results are
completely general.
Let y1 , y2 , and y3 be random variables. From these, we can construct a 3 × 1 random vector, say
 
y1
Y =  y2  .
y3
The expected value of the random vector is just the vector of expected values of the random vari-
ables. For the random variables write E(yi ) = µi , then
   
E(y1 ) µ1
E(Y ) ≡  E(y2 )  =  µ2  ≡ µ .
E(y3 ) µ3
In other words, expectation of a random vector is performed elementwise. In fact, the expected
value of any random matrix (a matrix consisting of random variables) is the matrix made up of the
expected values of the elements in the random matrix. Thus if wi j , i = 1, 2, 3, j = 1, 2 is a collection
of random variables and we write  
w11 w12
W =  w21 w22  ,
w31 w33
then  
E(w11 ) E(w12 )
E(W ) ≡  E(w21 ) E(w22 )  .
E(w31 ) E(w33 )
We also need a concept for random vectors that is analogous to the variance of a random variable.
This is the covariance matrix, sometimes called the dispersion matrix, the variance matrix, or the
variance-covariance matrix. The covariance matrix is simply a matrix consisting of all the variances
and covariances associated with the vector Y . Write
Var(yi ) = E(yi − µi )2 ≡ σii
and
Cov(yi , y j ) = E[(yi − µi )(y j − µ j )] ≡ σi j .
253
254 11. MULTIPLE REGRESSION: MATRIX FORMULATION
Two subscripts are used on σii to indicate that it is the variance of yi rather than writing Var(yi ) = σi2 .
The covariance matrix of our 3 × 1 vector Y is
 
σ11 σ12 σ13
Cov(Y ) =  σ21 σ22 σ23  .
σ31 σ32 σ33
When Y is 3 × 1, the covariance matrix is 3 × 3. If Y were 20 × 1, Cov(Y ) would be 20 × 20. The

covariance matrix is always symmetric because σi j = σ ji for any i, j. The variances of the individual
random variables lie on the diagonal that runs from the top left to the bottom right. The covariances
lie off the diagonal.
In general, if Y is an r × 1 random vector and E(Y ) = µ , then Cov(Y ) = E[(Y − µ )(Y − µ )′ ]. In
other words, Cov(Y ) is the expected value of the random matrix (Y − µ )(Y − µ )′ .
11.2 Matrix formulation of regression models

Simple linear regression in matrix form
The usual model for simple linear regression is
yi = β0 + β1 xi + εi i = 1, . . . , n , (11.2.1)
E(εi ) = 0, Var(εi ) = σ 2 , and Cov(εi , ε j ) = 0 for i ̸= j. In matrix terms this can be written as
     
y1 1 x1 ε1
[ ]
 y2   1 x2  β0  ε2 
 .  = . .  +  . 
 .   .. ..  β1  .. 
.
yn 1 xn εn
Yn×1 = Xn×2 β2×1 + en×1

Multiplying and adding the matrices on the right-hand side gives
  β +β x +ε 
y1 0 1 1 1
 y2   β0 + β1 x2 + ε2 
 . = .. .
 .   
. .
yn β0 + β1 xn + εn
These two vectors are equal if and only if the corresponding elements are equal, which occurs if and
only if model (11.2.1) holds. The conditions on the εi s translate into matrix terms as
E(e) = 0
where 0 is the n × 1 matrix containing all zeros and
Cov(e) = σ 2 I
where I is the n × n identity matrix. By definition, the covariance matrix Cov(e) has the variances
of the εi s down the diagonal. The variance of each individual εi is σ 2 , so all the diagonal elements
of Cov(e) are σ 2 , just as in σ 2 I. The covariance matrix Cov(e) has the covariances of distinct εi s as
its off-diagonal elements. The covariances of distinct εi s are all 0, so all the off-diagonal elements
of Cov(e) are zero, just as in σ 2 I.
11.2 MATRIX FORMULATION OF REGRESSION MODELS 255
Table 11.1: Weights for various heights.

Ht. Wt. Ht. Wt.
65 120 63 110
65 140 63 135
65 130 63 120
65 135 72 170
66 150 72 185
66 135 72 160
E XAMPLE 11.2.1. Height and weight data are given in Table 11.1 for 12 individuals. In matrix
terms, the SLR model for regressing weights (y) on heights (x) is
     
y1 1 65 ε1
 y2   1 65   ε2 
     
 y3   1 65   ε3 
     
 y4   1 65   ε4 
     
 y5   1 66  [ ]  ε5 
     
 y6   1 66  β0 ε 
 =  + 6 .
 y7   1 63  β1  ε7 
     
 y8   1 63   ε8 
     
 y9   1 63   ε9 
     
 y10   1 72   ε10 
     
y11 1 72 ε11
y12 1 72 ε12
The observed data for this example are

120
 
y1  140 
 
 y2   130 
   
 y3   135 
   
 y4   150 
   
 y5   135 
 = .
 y6   110 
   
 y7   135 
   
 y8   120 
   
y9  170 
 
y10 185
160
We could equally well rearrange the order of the observations to write
     
y7 1 63 ε7
 y8   1 63   ε8 
     
 y9   1 63   ε9 
     
 y1   1 65   ε1 
     
 y2   1 65  [ ]  ε2 
     
 y3   1 65  β0 ε 
 =  + 3 
 y4   1 65  β1  ε4 
     
 y4   1 66   ε5 
     
 y6   1 66   ε6 
     
 y10   1 72   ε10 
     
y11 1 72 ε11
y12 1 72 ε12
in which the xi values are ordered from smallest to largest. 2
The general linear model

The general linear model is a generalization of the matrix form for the simple linear regression
model. The general linear model is
Y = X β + e, E(e) = 0, Cov(e) = σ 2 I.
Y is an n × 1 vector of observable random variables. X is an n × p matrix of known constants. β

is a p × 1 vector of unknown (regression) parameters. e is an n × 1 vector of unobservable random
errors. It will be assumed that n ≥ p. Regression is any general linear model where the rank of X is
p.
E XAMPLE 11.2.2. Multiple regression

In non-matrix form, the multiple regression model is
yi = β0 + β1 xi1 + β2 xi2 + · · · + β p−1 xi,p−1 + εi , i = 1, . . . , n , (11.2.2)
where
E(εi ) = 0, Var(εi ) = σ 2 , Cov(εi , ε j ) = 0, i ̸= j.
In matrix terms this can be written as
   β 
  ··· 0  
y1 1 x11 x12 x1,p−1 ε1
 1 x21 x22 ··· x2,p−1   β1 
 y2     ε2 
 .  =  . .. 
  β2  +  . 
 .   ..
.. .. ..  .   .. 
. . . . .   . 
.
yn 1 xn1 xn2 ··· xn,p−1 εn
β p−1
Yn×1 = Xn× p β p×1 + en×1
Multiplying and adding the right-hand side gives
  β +β x +β x +···+β x 
y1 0 1 11 2 12 p−1 1,p−1 + ε1
 y2   β0 + β1 x21 + β2 x22 + · · · + β p−1 x2,p−1 + ε2 
 . = 
 .   
.. ,

. .
yn β0 + β1 xn1 + β2 xn2 + · · · + β p−1 xn,p−1 + εn
which holds if and only if (11.2.2) holds. The conditions on the εi s translate into
E(e) = 0,
where 0 is the n × 1 matrix consisting of all zeros, and
Cov(e) = σ 2 I,
where I is the n × n identity matrix. 2
E XAMPLE 11.2.3. In Example 11.2.1 we illustrated the matrix form of a SLR using the data on
heights and weights. We now illustrate some of the models from Chapter 8 applied to these data.
The cubic model
yi = β0 + β1 xi + β2 xi2 + β3 xi3 + εi (11.2.3)
is
     
y1 1 65 652 653 ε1
 y2   1 65 652 65 
3
 ε2 
     
 y3   1 65 652 653   ε3 
     
 y4   1 65 652 65     ε4 
3
    β0  
 y5   1 66 662 66 
3
ε 
     β1   5 
 y6   1 66 662 66     ε6 
3
 = β + .
 y7   1 63 632 633   2   ε7 
    β  
 y8   1 63 632 63  3
3
 ε8 
    β4  
 y9   1 63 632 63 
3
 ε9 
     
 y10   1 72 722 72 
3
 ε10 
     
y11 1 72 722 723 ε11
y12 1 72 722 723 ε12
Some of the numbers in X are getting quite large, i.e., 653 = 274625. The model has better
numerical properties if we compute x̄· = 69.41666̄ and replace model (11.2.3) with the equivalent
model
yi = γ0 + γ1 (xi − x̄· ) + γ2 (xi − x̄· )2 + β3 (xi − x̄· )3 + εi
and its matrix form

     
y1 1 (65 − x̄· ) (65 − x̄· )2 (65 − x̄· )3 ε1
 y2   1 (65 − x̄· ) (65 − x̄· )2 (65 − x̄· ) 
3
 ε2 
     
 y3   1 (65 − x̄· ) (65 − x̄· )2 (65 − x̄· )3   ε3 
     
 y4   1 (65 − x̄· ) (65 − x̄· )2 (65 − x̄· ) 
3
ε 
      4 
 y5   1 (66 − x̄· ) (66 − x̄· )2 (66 − x̄· )  γ0
3
 ε5 
     
 y6   1 (66 − x̄· ) (66 − x̄· )2 (66 − x̄· )3   γ1   ε6 
 =  + .
 y7   1 (63 − x̄· ) (63 − x̄· )2 (63 − x̄· )  γ2
3
 ε7 
     
 y8   1 (63 − x̄· ) (63 − x̄· )2 (63 − x̄· )3  β3  ε8 
     
 y9   1 (63 − x̄· ) (63 − x̄· )2 (63 − x̄· )3   ε9 
     
 y10   1 (72 − x̄· ) (72 − x̄· )2 (72 − x̄· )3   ε10 
     
y11 1 (72 − x̄· ) (72 − x̄· )2 (72 − x̄· )3 ε11
y12 1 (72 − x̄· ) (72 − x̄· )2 (72 − x̄· )3 ε12
This third degree polynomial is the largest polynomial that we can fit to these data. Two points
determine a line, three points determine a quadratic, and with only four district x values in the data,
we cannot fit a model greater than a cubic.
Define x̃ = (x − 63)/9 so that
(x1 , . . . , x12 ) = (65, 65, 65, 65, 66, 66, 63, 63, 63, 72, 72, 72)
transforms to
(x̃1 , . . . , x̃12 ) = (2/9, 2/9, 2/9, 2/9, 1/3, 1/3, 0, 0, 0, 1, 1, 1).
The basis function model based on cosines
yi = β0 + β1 xi + β2 cos(π x̃i ) + β3 cos(π 2x̃i ) + εi

becomes
     
y1 1 65 cos(2π /9) cos(4π /9) ε1
 y2   1 65 cos(2π /9) cos(4π /9)   ε2 
     
 y3   1 65 cos(2π /9) cos(4π /9)   ε3 
     
 y4   1 65 cos(2π /9) cos(4π /9)  ε 
      4 
 y5   1 66 cos(π /3) cos(2π /3)  β0  ε5 
     
 y6   1 66 cos(π /3) cos(2π /3)   β1   ε6 
 =  + .
 y7   1 63 cos(0) cos(0)  β2  ε7 
     
 y8   1 63 cos(0) cos(0)  β3  ε8 
     
 y9   1 63 cos(0) cos(0)   ε9 
     
 y10   1 72 cos(π ) cos(2π )   ε10 
     
y11 1 72 cos(π ) cos(2π ) ε11
y12 1 72 cos(π ) cos(2π ) ε12
The “Haar wavelet” model
yi = β0 + β1 xi + β2 I[0,.50) (x̃i ) + β3 I[.5,1] (x̃i ) + εi
becomes
     
y1 1 65 1 0 ε1
 y2   1 65 1 0  ε2 
     
 y3   1 65 1 0  ε3 
     
 y4   1 65 1 0 ε 
      4 
 y5   1 66 1 0  β0  ε5 
     
 y6   1 66 1 0   β1   ε 6 
 =  + .
 y7   1 63 1 0  β2  ε7 
     
 y8   1 63 1 0  β3  ε8 
     
 y9   1 63 1 0  ε9 
     
 y10   1 72 0 1  ε10 
     
y11 1 72 0 1 ε11
y12 1 72 0 1 ε12
Notice that the last two columns of the X matrix add up to a column of 1s, like the first column.
This causes the rank of the 12 × 4 model matrix X to be only 3, so the model is not a regression
model. Dropping either of the last two columns (or the first column) does not change the model in
any meaningful way but makes the model a regression.
If we partition the SLR model into points below 65.5 and above 65.5, the matrix model becomes
     
y1 1 65 0 0 ε1
 y2   1 65 0 0  ε2 
     
 y3   1 65 0 0  ε3 
     
 y4   1 65 0 0 ε 
      4 
 y5   0 0 1 66  β0  ε5 
     
 y6   0 0 1 66   β1   ε6 
 =  + .
 y7   1 63 0 0  β2  ε7 
     
 y8   1 63 0 0  β3  ε8 
     
 y9   1 63 0 0  ε9 
     
 y10   0 0 1 72   ε10 
     
y11 0 0 1 72 ε11
y12 0 0 1 72 ε12
Alternatively, we could rewrite the model as
     
y7 1 63 0 0 ε7
 y8   1 63 0 0  ε8 
     
 y9   1 63 0 0  ε9 
     
 y1   1 65 0 0 ε 
      1 
 y2   1 65 0 0  β0  ε2 
     
 y3   1 65 0 0   β1   ε3 
 =  + .
 y4   1 65 0 0  β2  ε4 
     
 y4   0 0 1 66  β3  ε5 
     
 y6   0 0 1 66   ε6 
     
 y10   0 0 1 72   ε10 
     
y11 0 0 1 72 ε11
y12 0 0 1 72 ε12
This makes is a bit clearer that we are fitting a SLR to the points with small x values and a separate
SLR to cases with large x values. The pattern of 0s in the X matrix ensure that the small x values
only involve the intercept and slope parameters β0 and β1 for the line on the first partition set and
that the large x values only involve the intercept and slope parameters β2 and β3 for the line on the
second partition set.
Fitting this model can also be accomplished by fitting the model
     
y7 1 63 0 0 ε7
 y8   1 63 0 0  ε8 
     
 y9   1 63 0 0  ε9 
     
 y1   1 65 0 0 ε 
      1 
 y2   1 65 0 0  β0  ε2 
     
 y3   1 65 0 0   β1   ε3 
 =  + .
 y4   1 65 0 0  γ0  ε4 
     
 y4   1 66 1 66  γ1  ε5 
     
 y6   1 66 1 66   ε6 
     
 y10   1 72 1 72   ε10 
     
y11 1 72 1 72 ε11
y12 1 72 1 72 ε12
Here we have changed the first two columns to make them agree with the SLR of Example 11.2.1.
However, notice that if we subtract the third column from the first column we get the first column
of the previous version. Similarly,if we subtract the fourth column from the second column we get
the second column of the previous version. This model has intercept and slope parameters β0 and
β1 for the first partition and intercept and slope parameters (β0 + γ0 ) and (β1 + γ1 ) for the second
partition.
Because of the particular structure of these data with 12 observations but only four distinct
values of x, except for the Haar wavelet model, all of these models are equivalent to one another and
all of them are equivalent to a model with the matrix formulation
     
y1 1 0 0 0 ε1
 2 
y 1 0 0 0   ε2 
     
 3 
y 1 0 0 0   ε3 
     
 4 
y 1 0 0 0  ε 
      4 
 5 
y 0 1 0 0  0β  ε5 
     
 6 
y 0 1 0 0   1   ε6 
β
 =  + .
 y7   1 0 1 0  β2  ε7 
     
 y8   1 0 1 0  β3  ε8 
     
 y9   1 0 1 0   ε9 
     
 y10   0 0 0 1   ε10 
     
y11 0 0 0 1 ε11
y12 0 0 0 1 ε12
The models are equivalent in that they all give the same fitted values, residuals, and degrees of
freedom for error. We will see in the next chapter that this last matrix model has the form of a
one-way analysis of variance model. 2
Other models to be discussed later such as analysis of variance and analysis of covariance mod-
els can also be written as general linear models.
11.3 Least squares estimation of regression parameters

The regression estimates given by standard computer programs are least squares estimates. For
simple linear regression, the least squares estimates are the values of β0 and β1 that minimize
n
∑ (yi − β0 − β1 xi )2 . (11.3.1)
i=1
For multiple regression, the least squares estimates of the β j s minimize

n
∑ (yi − β0 − β1 xi1 − β2 xi2 − · · · − β p−1 xi,p−1 )2 .
i=1
In matrix terms these can both be written as minimizing
(Y − X β )′ (Y − X β ). (11.3.2)
The form in (11.3.2) is just the sum of the squares of the elements in the vector (Y − X β ). See also
Exercise 11.7.1.
We now give the general form for the least squares estimate of β in regression problems.
−1
Proposition 11.3.1. If r(X) = p, then β̂ = (X ′ X) X ′Y is the least squares estimate of β .
P ROOF : The proof is optional material.

−1
Note that (X ′ X) exists only because in a regression problem the rank of X is p. The proof
stems from rewriting the function to be minimized.
( )′ ( )
′
(Y − X β ) (Y − X β ) = Y − X β̂ + X β̂ − X β Y − X β̂ + X β̂ − X β (11.3.3)
( )′ ( ) ( )′ ( )
= Y − X β̂ Y − X β̂ + Y − X β̂ X β̂ − X β
( )′ ( ) ( )′ ( )
+ X β̂ − X β Y − X β̂ + X β̂ − X β X β̂ − X β .
11.3 LEAST SQUARES ESTIMATION OF REGRESSION PARAMETERS 261
( )′ ( )
Consider one of the two cross-product terms from the last expression, say X β̂ − X β Y − X β̂ .
Using the definition of β̂ given in the proposition,
( )′ ( ) [ ( )]′ ( )
X β̂ − X β Y − X β̂ = X β̂ − β Y − X β̂
( )′ ( )
−1
= β̂ − β X ′ Y − X (X ′ X) X ′Y
( )′ ( )
−1
= β̂ − β X ′ I − X (X ′ X) X ′ Y
but ( )
−1 −1
X ′ I − X (X ′ X) X ′ = X ′ − (X ′ X) (X ′ X) X ′ = X ′ − X ′ = 0.
Thus ( )′ ( )
X β̂ − X β Y − X β̂ = 0
and similarly ( )′ ( )
Y − X β̂ X β̂ − X β = 0.
Eliminating the two middle terms in (11.3.3) gives
( )′ ( ) ( )′ ( )
′
(Y − X β ) (Y − X β ) = Y − X β̂ Y − X β̂ + X β̂ − X β X β̂ − X β .
This form is easily minimized. The first of the terms on the right-hand side does not depend
′
on β , so the β that minimizes (Y − X β ) (Y − X β ) is the β that minimizes the second term
( )′ ( )
X β̂ − X β X β̂ − X β . The second term is non-negative because it is the sum of squares of
the elements in the vector X β̂ − X β and it is minimized by making it zero. This is accomplished by
choosing β = β̂ . 2
E XAMPLE 11.3.2. Simple linear regression

We now show that Proposition 11.3.1 gives the usual estimates for simple linear regression. Readers
should refamiliarize themselves with the results in Section 6.10. They should also be warned that
the algebra in the first half of the example is a bit more sophisticated than that used elsewhere in
this book.
Assume the model
yi = β0 + β1 xi + εi i = 1, . . . , n.
and write  
1 x1
1 x2 
X =
 ... .. 
. 
1 xn
so [ n
]
n ∑i=1 xi
X ′X = n n .
∑i=1 xi ∑i=1 xi2
Inverting this matrix gives
[ ]
′ −1 1 n
∑i=1 xi2 − ∑ni=1 xi
(X X) = .
n ∑ni=1 xi2 − (∑ni=1 xi )
2 − ∑ni=1 xi n
The denominator in this term can be simplified by observing that

( )2 ( )
n n n n
n ∑ xi2 − ∑ xi ∑ xi2 − nx̄·2 = n ∑ (xi − x̄· ) .
2
=n
i=1 i=1 i=1 i=1
Note also that
n
[]
∑i=1 yi
′
XY= n .
∑i=1 xi yi
Finally, we get
−1
β̂ = (X ′ X) X ′Y
[ ]
∑i=1 xi2 ∑i=1 yi − ∑i=1 xi ∑i=1 xi yi
1 n n n n
=
n ∑ni=1 (xi − x̄· )
2 − ∑i=1 xi ∑i=1 yi + n ∑i=1 xi yi
n n n
[ ]
1 ȳ ∑ni=1 xi2 − x̄· ∑ni=1 xi yi
=
∑i=1 (xi − x̄· )
n 2 (∑ni=1 xi yi ) − nx̄· ȳ
[ { ( )} ]
1 ȳ ∑ni=1 xi2 − nx̄·2 ȳ − x̄· (∑ni=1 xi yi ) − nx̄·2 ȳ
=
∑i=1 (xi − x̄· )
n 2 β̂1 ∑ni=1 (xi − x̄· )2
[ ( n 2 ) ]
1 ȳ ∑i=1 xi − nx̄·2 − x̄· (∑ni=1 xi yi − nx̄· ȳ)
=
∑ (xi − x̄· )
n 2 β̂1 ∑ni=1 (xi − x̄· )2
[ i=1 ] [ ]
ȳ − β̂1 x̄· β̂
= = 0 .
β̂1 β̂1
As usual, the alternative regression model
yi = β∗0 + β1 (xi − x̄· ) + εi i = 1, . . . , n
is easier to work with. Write the model in matrix form as
Y = Z β∗ + e
where  1 (x − x̄ ) 
1 ·
 1 (x2 − x̄· ) 
Z=
 .. .. 

. .
1 (xn − x̄· )
and ][
β∗0
β∗ = .
β1
−1
We need to compute β̂∗ = (Z ′ Z) Z ′Y . Observe that
[ ]
n 0
Z′Z = 2 ,
0 ∑ni=1 (xi − x̄· )
[1 ]
/ 0
′ −1 n
(Z Z) = 2 ,
0 1 ∑ni=1 (xi − x̄· )
[ n
]
∑i=1 yi
Z ′Y = ,
∑i=1 (xi − x̄· ) yi
n
and [ ] [ ]
′ −1 ′
ȳ
/ β̂∗0
β̂∗ = (Z Z) ZY= 2 = .
∑i=1 (xi − x̄· ) yi ∑i=1 (xi − x̄· )
n n
β̂1
These are the usual estimates. 2
11.3 LEAST SQUARES ESTIMATION OF REGRESSION PARAMETERS 263
Recall that least squares estimates have a number of other properties. If the errors are indepen-
dent with mean zero, constant variance, and are normally distributed, the least squares estimates are
maximum likelihood estimates and minimum variance unbiased estimates. If the errors are merely
uncorrelated with mean zero and constant variance, the least squares estimates are best (minimum
variance) linear unbiased estimates.
In multiple regression, simple algebraic expressions for the parameter estimates are not possible.
The only nice equations for the estimates are the matrix equations.
We now find expected values and covariance matrices for the data Y and the least squares esti-
mate β̂ . Two simple rules about expectations and covariance matrices can take one a long way in
the theory of regression. These are matrix analogues of Proposition 1.2.11. In fact, to prove these
matrix results, one really only needs Proposition 1.2.11, cf. Exercise 11.7.3.
Proposition 11.3.3. Let A be a fixed r × n matrix, let c be a fixed r × 1 vector, and let Y be an
n × 1 random vector, then
1. E (AY + c) = A E(Y ) + c
2. Cov (AY + c) = ACov(Y )A′ .
Applying these results allows us to find the expected value and covariance matrix for Y in a linear
model. The linear model has Y = X β + e where X β is a fixed vector (even though β is unknown),
E(e) = 0, and Cov(e) = σ 2 I. Applying the proposition gives
E(Y ) = E(X β + e) = X β + E(e) = X β + 0 = X β
and
Cov(Y ) = Cov(e) = σ 2 I.
We can also find the expected value and covariance matrix of the least squares estimate β̂ . In
particular, we show that β̂ is an unbiased estimate of β by showing
( ) ( )
−1 −1 −1
E β̂ = E (X ′ X) X ′Y = (X ′ X) X ′ E (Y ) = (X ′ X) X ′ X β = β .
( )
To find variances and standard errors we need Cov β̂ . To obtain this matrix, we use the rules in
Proposition A.7.1. In particular, recall that the inverse of a symmetric matrix is symmetric and that
X ′ X is symmetric.
( ) [ ]
−1
Cov β̂ = Cov (X ′ X) X ′Y
[ ] [ ]′
−1 −1
= (X ′ X) X ′ Cov(Y ) (X ′ X) X ′
[ ] [ ]
−1 −1 ′
= (X ′ X) X ′ Cov(Y )X (X ′ X)
−1 −1
= (X ′ X) X ′ Cov(Y ) X (X ′ X)
−1 −1
= σ 2 (X ′ X) X ′ X (X ′ X)
−1
= σ 2 (X ′ X) .
E XAMPLE 11.3.2 C ONTINUED . For simple linear regression the covariance matrix becomes
( )
−1
Cov β̂ = σ 2 (X ′ X)
[ n 2 ]
1 ∑i=1 xi − ∑ni=1 xi
= σ 2
n ∑ni=1 (xi − x̄· ) − ∑i=1 xi
2 n
n
[ n 2 ]
1 ∑i=1 xi − nx̄·2 + nx̄·2 −nx̄·
= σ2 n
n ∑i=1 (xi − x̄· )
2 −nx̄· n
[ n 2
]
1 ∑ (xi − x̄· ) + nx̄·2 −nx̄·
= σ 2 i=1
n ∑n (xi − x̄· )
2 −nx̄· n
 i=1 2 
1 x̄ −x̄·
+ n ·
2 n ∑ (xi −x̄· )2 ∑ n (x −x̄ )2
,
= σ i=1
−x̄
i=1
1
i ·
·
∑ni=1 (xi −x̄· )2 ∑ni=1 (xi −x̄· )2
which agrees with results given earlier for simple linear regression.
11.4 Inferential procedures

We begin by examining the analysis of variance table for the regression model (11.2.2). We then
discuss tests, confidence intervals, and prediction intervals.
There are two frequently used forms of the ANOVA table:
Source df SS MS
β0 1 ≡Cnȳ2 nȳ2
′ ′
Regression p−1 β̂ X X β̂ −C SSReg/(p − 1)
Error n− p Y ′Y −C − SSReg SSE/(n − p)
Total n Y ′Y
and the more often used form
Source df SS MS
Regression p−1 β̂ ′ X ′ X β̂ −C SSReg/(p − 1)
Error n− p Y ′Y −C − SSReg SSE/(n − p)
Total n−1 Y ′Y −C
Note that Y ′Y = ∑ni=1 y2i , C = nȳ2 = (∑ni=1 yi ) /n, and β̂ ′ X ′ X β̂ = β̂ ′ X ′Y . The difference between
2
the two tables is that the first includes a line for the intercept or grand mean while in the second the
total has been corrected for the grand mean.
The coefficient of determination can be computed as
SSReg
R2 = .
Y ′Y −C
This is the ratio of the variability explained by the predictor variables to the total variability of the
data. Note that (Y ′Y − C)/(n − 1) = s2y , the sample variance of the ys without adjusting for any
structure except the existence of a possibly nonzero mean.
E XAMPLE 11.4.1. Simple linear regression

For simple linear regression, we know that
n n
SSReg = β̂12 ∑ (xi − x̄· ) = β̂1 ∑ (xi − x̄· ) β̂1
2 2
i=1 i=1
We will examine the alternative model
yi = β∗0 + β1 (xi − x̄· ) + εi .

11.4 INFERENTIAL PROCEDURES 265
Note that C = nβ̂∗20 , so the general form for SSReg reduces to the simple linear regression form
because
SSReg = β̂∗′ Z ′ Z β̂∗ −C

[ ]′ [ ][ ]
β̂∗0 n 0 β̂∗0
= 2 −C
β̂1 0 ∑ni=1 (xi − x̄· ) β̂1
n
= β̂12 ∑ (xi − x̄· ) .
2
i=1
The same result can be obtained from β̂ ′ X ′ X β̂ −C but the algebra is more tedious. 2
To obtain tests and confidence regions we need to make additional distributional assumptions.
In particular, we assume that the yi s have independent normal distributions. Equivalently, we take
ε1 , . . . , εn indep. N(0, σ 2 ).
To test the hypothesis

H0 : β1 = β2 = · · · = β p−1 = 0
use the analysis of variance table test statistic
MSReg
F= .
MSE
Under H0 ,
F ∼ F(p − 1, n − p).
We can also perform a variety of t tests for individual regression parameters βk . The procedures
fit into the general techniques of Chapter 3 based on identifying 1) the parameter, 2) the estimate,
3) the standard error of the estimate, and 4) the distribution of (Est − Par)/SE(Est). The parameter
of interest is βk . Having previously established that
  
β̂0 β0 
 β̂1   β1 
 
E .  =  . ,
 ..   .. 
β̂ p−1 β p−1
it follows that for any k = 0, . . . , p − 1,

( )
E β̂k = βk .
This shows that β̂k is an unbiased estimate of βk . Before obtaining the standard error of β̂k , it is
−1
necessary to identify its variance. The covariance matrix of β̂ is σ 2 (X ′ X) , so the variance of
−
β̂k is the (k + 1)st diagonal element of σ 2 (X ′ X) . The (k + 1)st diagonal element is appropriate
1
because the first diagonal element is the variance of β̂0 not β̂1 . If we let ak be the (k + 1)st diagonal
−1
element of (X ′ X) and estimate σ 2 with MSE, we get a standard error for β̂k of
( ) √ √
SE β̂k = MSE ak .
Under normal errors, the appropriate reference distribution is
β̂k − βk
∼ t(n − p).
SE(β̂k )
Standard techniques now provide tests and confidence intervals. For example, a 95% confidence
interval for βk has endpoints
β̂k ± t(.975, n − p) SE(β̂k )
where t(.975, n − p) is the 97.5th percentile of a t distribution with n − p degrees of freedom.
A (1 − α )100% simultaneous confidence region for β0 , β1 , . . . , β p−1 consists of all the β vectors
that satisfy ( ) ( )/
′
β̂ − β X ′ X β̂ − β p
≤ F(1 − α , p, n − p).
MSE
This region also determines joint (1− α )100% confidence intervals for the individual βk s with limits
√
β̂k ± pF(1 − α , p, n − p) SE(β̂k ).
These intervals are an application of Scheffé’s method of multiple comparisons, cf., Section 13.3.
We can also use the Bonferroni method to obtain joint (1 − α )100% confidence intervals with
limits ( )
α
β̂k ± t 1 − , n − p SE(β̂k ).
2p
Finally, we consider estimation of the point on the surface that corresponds to a given set of
predictor variables and the prediction of a new observation with a given set of predictor variables.
Let the predictor variables be x1 , x2 , . . . , x p−1 . Combine these into the row vector
x′ = (1, x1 , x2 , . . . , x p−1 ) .
−1
The point on the surface that we are trying to estimate is the parameter x′ β = β0 + ∑ pj=1 β j x j . The
least squares estimate is x′ β̂ which can be thought of as a 1 × 1 matrix. The variance of the estimate
is ( ) ( ) ( )
−1
Var x′ β̂ = Cov x′ β̂ = x′ Cov β̂ x = σ 2 x′ (X ′ X) x,
so the standard error is
( ) √ √
−1
SE x′ β̂ = MSE x′ (X ′ X) x ≡ SE(Sur f ace).
This is the standard error of the estimated regression surface. The appropriate reference distribution
is
x′ β̂ − x′ β
( ) ∼ t(n − p)
SE x′ β̂
and a (1 − α )100% confidence interval has endpoints

( α )
x′ β̂ ± t 1 − , n − p SE(x′ β̂ ).
2
When predicting a new observation, the point prediction is just the estimate of the point on
the surface but the standard error must incorporate the additional variability associated with a new
observation. The original observations were assumed to be independent with variance σ 2 . It is rea-
sonable to assume that a new observation is independent of the previous observations and has the
same variance. Thus, in the prediction we have to account for the variance of the new observation,
−1
which is σ 2 , plus the variance of the estimate x′ β̂ , which is σ 2 x′ (X ′ X) x. This leads to a variance
−1
for the prediction of σ 2 + σ 2 x′ (X ′ X) x and a standard error of
√ √ [ ]
−1 −1
MSE + MSE x′ (X ′ X) x= MSE 1 + x′ (X ′ X) x ≡ SE(Prediction).
11.5 RESIDUALS, STANDARDIZED RESIDUALS, AND LEVERAGE 267
Note that √
SE(Prediction) = MSE + [SE(Sur f ace)]2 .
The (1 − α )100% prediction interval has endpoints
( α )√ [ ]
−1
x′ β̂ ± t 1 − , n − p MSE 1 + x′ (X ′ X) x .
2
Results of this section constitute the theory behind most of the applications in Sections 9.1 and
9.2.
11.5 Residuals, standardized residuals, and leverage

Let xi′ = (1, xi1 , . . . , xi,p−1 ) be the ith row of X, then the ith fitted value is
ŷi = β̂0 + β̂1 xi1 + · · · + β̂ p−1 xi,p−1 = xi′ β̂
and the corresponding residual is

ε̂i = yi − ŷi = yi − xi′ β̂ .
The vector of predicted values is
   ′ 
ŷ1 x1 β̂
 ..   . 
Ŷ =  .  =  ..  = X β̂ .
ŷn xn′ β̂
The vector of residuals is
ê = Y − Ŷ
= Y − X β̂
= Y − X(X ′ X)−1 X ′Y
( )
= I − X(X ′ X)−1 X ′ Y
= (I − M)Y
where
M ≡ X(X ′ X)−1 X ′ .
M is called the perpendicular projection operator (matrix) onto C(X), the column space of X. M
is the key item in the analysis of the general linear model, cf. Christensen (2011). Note that M is
symmetric, i.e., M = M ′ , and idempotent, i.e., MM = M, so it is a perpendicular projection operator
as discussed in Appendix A. Using these facts, observe that
n
SSE = ∑ ε̂i2
i=1
= ê′ ê
′
= [(I − M)Y ] [(I − M)Y ]
= Y ′ (I − M ′ − M + M ′ M)Y
= Y ′ (I − M)Y.
Another common way of writing SSE is

[ ]′ [ ]
SSE = Y − X β̂ Y − X β̂ .
Having identified M, we can define the standardized residuals. First we find the covariance
matrix of the residual vector ê:
Cov(ê) = Cov([I − M]Y )

= [I − M]Cov(Y )[I − M]′
= [I − M]σ 2 I[I − M]′
= σ 2 (I − M − M ′ + MM ′ )
= σ 2 (I − M) .
The last equality follows from M = M ′ and MM = M. Typically, the covariance matrix is not diag-
onal, so the residuals are not uncorrelated.
The variance of a particular residual ε̂i is σ 2 times the ith diagonal element of (I − M). The ith
diagonal element of (I − M) is the ith diagonal element of I, 1, minus the ith diagonal element of
M, say, mii . Thus
Var(ε̂i ) = σ 2 (1 − mii )
and the standard error of ε̂i is √
SE(ε̂i ) = MSE(1 − mii ).
The ith standardized residual is defined as
ε̂i
ri ≡ √ .
MSE(1 − mii )
The leverage of the ith case is defined to be mii , the ith diagonal element of M. Some people
like to think of M as the ‘hat’ matrix because it transforms Y into Ŷ , i.e., Ŷ = X β̂ = MY . More
common than the name ‘hat matrix’ is the consequent use of the notation hi for the ith leverage.
This notation was used in Chapter 7 but the reader should realize that hi ≡ mii . In any case, the
leverage can be interpreted as a measure of how unusual xi′ is relative to the other rows of the X
matrix, cf. Christensen (2011, section 13.1).
11.6 Principal components regression

In Section 9.7 we dealt with the issue of collinearity. Four points were emphasized as the effects of
collinearity.
1. The estimate of any parameter, say β̂2 , depends on all the variables that are included in the model.
2. The sum of squares for any variable, say x2 , depends on all the other variables that are included
in the model. For example, none of SSR(x2 ), SSR(x2 |x1 ), and SSR(x2 |x3 , x4 ) would typically be
equal.
3. In a model such as yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi , small t statistics for both H0 : β1 = 0 and
H0 : β2 = 0 are not sufficient to conclude that an appropriate model is yi = β0 + β3 xi3 + εi . To
arrive at a reduced model, one must compare the reduced model to the full model.
4. A moderate amount of collinearity has little effect on predictions and therefore little effect on
SSE, R2 , and the explanatory power of the model. Collinearity increases the variance of the β̂ j s,
making the estimates of the parameters less reliable. Depending on circumstances, sometimes
a large amount of collinearity can have an effect on predictions. Just by chance one may get a
better fit to the data than can be justified scientifically.
At its worst, collinearity involves near redundancies among the predictor variables. An exact
redundancy among the predictor variables occurs when we can find a p × 1 vector d ̸= 0 so that
Xd = 0. When this happens the rank of X is not p, so we cannot find (X ′ X)−1 and we cannot find
the estimates of β in Proposition 11.3.1. Near redundancies occur when we can find a vector d
11.6 PRINCIPAL COMPONENTS REGRESSION 269
Table 11.2: Eigen analysis of the correlation matrix.

Eigenvalue 2.8368 1.3951 0.4966 0.2025 0.0689
Proportion 0.567 0.279 0.099 0.041 0.014
Cumulative 0.567 0.846 0.946 0.986 1.000
.
that is not too small, say with d ′ d = 1, having Xd = 0. Principal components (PC) regression is
a method designed to identify near redundancies among the predictor variables. Having identified
near redundancies, they can be eliminated if we so choose. In Section 10.7 we mentioned that having
small collinearity requires more than having small correlations among all the predictor variables, it
requires all partial correlations among the predictor variables to be small as well. For this reason,
eliminating near redundancies cannot always be accomplished by simply dropping well chosen
predictor variables from the model.
The basic idea of principal components is to find new variables that are linear combinations
of the x j s and that are best able to (linearly) predict the entire set of x j s, see Christensen (2001,
Chapter 3). Thus the first principal component variable is the one linear combination of the x j s
that is best able to predict all of the x j s. The second principal component variable is the linear
combination of the x j s that is best able to predict all the x j s among those linear combinations having
a sample correlation of 0 with the first principal component variable. The third principal component
variable is the best predictor that has sample correlations of 0 with the first two principal component
variables. The remaining principal components are defined similarly. With p − 1 predictor variables,
there are p − 1 principal component variables. The full collection of principal component variables
always predicts the full collection of x j s perfectly. The last few principal component variables are
least able to predict the original x j variables, so they are the least useful. They are also the aspects
of the predictor variables that are most redundant, see Christensen (2011, Section 14.5). The best
(linear) predictors used in defining principal components can be based on either the covariances
between the x j s or the correlations between the x j s. Unless the x j s are measured on the same scale
(with similarly sized measurements), it is generally best to use principal components defined using
the correlations.
For The Coleman Report data, a matrix of sample correlations between the x j s was given in
Example 9.7.1. Principal components are derived from the eigenvalues and eigenvectors of this
matrix, cf., Section A.8. (Alternatively, one could use eigenvalues and eigenvectors of the matrix
of sample covariances.) An eigenvector corresponding to the largest eigenvalue determines the first
principal component variable.
The eigenvalues are given in Table 11.2 along with proportions and cumulative proportions. The
proportions in Table 11.2 are simply the eigenvalues divided by the sum of the eigenvalues. The
cumulative proportions are the sum of the first group of eigenvalues divided by the sum of all the
eigenvalues. In this example, the sum of the eigenvalues is
5 = 2.8368 + 1.3951 + 0.4966 + 0.2025 + 0.0689.
The sum of the eigenvalues must equal the sum of the diagonal elements of the original matrix.
The sum of the diagonal elements of a correlation matrix is the number of variables in the matrix.
The third eigenvalue in Table 11.2 is .4966. The proportion is .4966/5 = .099. The cumulative
proportion is (2.8368 + 1.3951 + .4966)/5 = .946. With an eigenvalue proportion of 9.9%, the third
principal component variable accounts for 9.9% of the variance associated with predicting the x j s.
Taken together, the first three principal components account for 94.6% of the variance associated
with predicting the x j s because the third cumulative eigenvalue proportion is .946.
For the school data, the principal component (PC) variables are determined by the coefficients
in Table 11.3. The first principal component variable is
Table 11.3: Principal component variable coefficients.

Variable PC1 PC2 PC3 PC4 PC5
x1 −0.229 −0.651 0.723 0.018 −0.024
x2 −0.555 0.216 0.051 −0.334 0.729
x3 −0.545 0.099 −0.106 0.823 −0.060
x4 −0.170 −0.701 −0.680 −0.110 0.075
x5 −0.559 0.169 −0.037 −0.445 −0.678
Table 11.4: Table of coefficients: Principal component regression.

Predictor γ̂ SE(γ̂ ) t P
Constant 35.0825 0.4638 75.64 0.000
PC1 −2.9419 0.2825 −10.41 0.000
PC2 0.0827 0.4029 0.21 0.840
PC3 −2.0457 0.6753 −3.03 0.009
PC4 4.380 1.057 4.14 0.001
PC5 1.433 1.812 0.79 0.442
PC1i = −0.229(xi1 − x̄·1 )/s1 − 0.555(xi2 − x̄·2 )/s2

− 0.545(xi3 − x̄·3 )/s3 − 0.170(xi4 − x̄·5 )/s4 − 0.559(xi5 − x̄·5 )/s5 (11.6.1)
for i = 1, . . . , 20 where s1 is the sample standard deviation of the xi1 s, etc. The columns of coeffi-
cients given in Table 11.3 are actually eigenvectors for the correlation matrix of the x j s. The PC1
coefficients are an eigenvector corresponding to the largest eigenvalue, the PC2 coefficients are an
eigenvector corresponding to the second largest eigenvalue, etc.
We can now perform a regression on the new principal component variables. The table of coef-
ficients is given in Table 11.4. The analysis of variance is given in Table 11.5. The value of R2 is
.906. The analysis of variance table and R2 are identical to those for the original predictor variables
given in Section 9.1. The plot of standardized residuals versus predicted values from the principal
component regression is given in Figure 11.1. This is identical to the plot given in Figure 10.2 for
the original variables. All of the predicted values and all of the standardized residuals are identical.
Since Table 11.5 and Figure 11.1 are unchanged, any usefulness associated with principal
component regression must come from Table 11.4. The principal component variables display no
collinearity. Thus, contrary to the warnings given earlier about the effects of collinearity, we can
make final conclusions about the importance of variables directly from Table 11.4. We do not have
to worry about fitting one model after another or about which variables are included in which mod-
els. From examining Table 11.4, it is clear that the important variables are PC1, PC3, and PC4. We
can construct a reduced model with these three; the estimated regression surface is simply
ŷ = 35.0825 − 2.9419(PC1) − 2.0457(PC3) + 4.380(PC4), (11.6.2)
where we merely used the estimated regression coefficients from Table 11.4. Refitting the reduced
model is unnecessary because there is no collinearity.
Table 11.5: Analysis of variance: Principal component regression.

Source df SS MS F P
Regression 5 582.69 116.54 27.08 0.000
Error 14 60.24 4.30
Total 19 642.92
11.6 PRINCIPAL COMPONENTS REGRESSION 271
Residual−Fitted plot
3
2
Standardized residuals
1
0
−1
−2
25 30 35 40
Fitted
Figure 11.1: Standardized residuals versus predicted values for principal component regression.
To get predictions for a new set of x j s, just compute the corresponding PC1, PC3, and PC4
variables using formulae similar to those in equation (11.6.1) and make the predictions using the
fitted model in equation (11.6.2). When using equations like (11.6.1) to obtain new values of the
principal component variables, continue to use the x̄· j s and s j s computed from only the original
observations.
As an alternative to this prediction procedure, we could use the definitions of the principal
component variables, e.g., equation (11.6.1), and substitute for PC1, PC3, and PC4 in equation
(11.6.2) to obtain estimated coefficients on the original x j variables.
 
PC1
ŷ = 35.0825 + [−2.9419, −2.0457, 4.380]  PC3 
PC4
= 35.0825 + [−2.9419, −2.0457, 4.380] ×
 
  (x1 − x̄·1 )/s1
−0.229 −0.555 −0.545 −0.170 −0.559  (x2 − x̄·2 )/s2 
 0.723  
0.051 −0.106 −0.680 −0.037   (x3 − x̄·3 )/s3 
 
0.018 −0.334 0.823 −0.110 −0.445 (x4 − x̄·4 )/s4
(x5 − x̄·5 )/s5
= 35.0825 + [−0.72651, 0.06550, 5.42492, 1.40940, −0.22889] ×
 
(x1 − 2.731)/0.454
 (x2 − 40.91)/25.90 
 
 (x3 − 3.14)/9.63  .
 
(x4 − 25.069)/1.314
(x5 − 6.255)/0.654
Obviously this can be simplified into a form ŷ = 35.0825 + β̃1 x1 + β̃2 x2 + β̃3 x3 + β̃4 x4 + β̃5 x5 , which
in turn simplifies the process of making predictions and provides new estimated regression coeffi-
cients for the x j s that correspond to the fitted principal component model. These PC regression esti-
mates of the original β j s can be compared to the least squares estimates. Many computer programs
for performing PC regression report these estimates of the β j s and their corresponding standard
errors.
It was mentioned earlier that collinearity tends to increase the variance of regression coeffi-
cients. The fact that the later principal component variables are more nearly redundant is reflected
in Table 11.4 by the fact that the standard errors for their estimated regression coefficients increase
(excluding the intercept).
One rationale for using PC regression is that you just don’t believe in using nearly redundant
variables. The exact nature of such variables can be changed radically by small errors in the x j s. For
this reason, one might choose to ignore PC5 because of its small eigenvalue proportion, regardless
of any importance it may display in Table 11.4. If the t statistic for PC5 appeared to be significant,
it could be written off as a chance occurrence or, perhaps more to the point, as something that is un-
likely to be reproducible. If you don’t believe redundant variables, i.e., if you don’t believe that they
are themselves reproducible, any predictive ability due to such variables will not be reproducible
either.
When considering PC5, the case is pretty clear. PC5 accounts for only about 1.5% of the vari-
ability involved in predicting the x j s. It is a very poorly defined aspect of the predictor variables
x j and, anyway, it is not a significant predictor of y. The case is less clear when considering PC4.
This variable has a significant effect for explaining y, but it accounts for only 4% of the variability
in predicting the x j s, so PC4 is reasonably redundant within the x j s. If this variable is measuring
some reproducible aspect of the original x j data, it should be included in the regression. If it is not
reproducible, it should not be included. From examining the PC4 coefficients in Table 11.3, we see
that PC4 is roughly the average of the percent white collar fathers x2 and the mothers’ education
x5 contrasted with the socio- economic variable x3 . (Actually, this comparison is between the vari-
ables after they have been adjusted for their means and standard deviation as in equation (11.6.1).)
If PC4 strikes the investigator as a meaningful, reproducible variable, it should be included in the
regression.
In our discussion, we have used PC regression both to eliminate questionable aspects of the
predictor variables and as a method for selecting a reduced model. We dropped PC5 primarily
because it was poorly defined. We dropped PC2 solely because it was not a significant predictor.
Some people might argue against this second use of PC regression and choose to take a model based
on PC1, PC2, PC3, and possibly PC4.
On occasion, PC regression is based on the sample covariance matrix of the x j s rather than the
sample correlation matrix. Again, eigenvalues and eigenvectors are used, but in using relationships
like equation (11.6.1), the s j s are deleted. The eigenvalues and eigenvectors for the covariance ma-
trix typically differ from those for the correlation matrix. The relationship between estimated prin-
cipal component regression coefficients and original least squares regression coefficient estimates
is somewhat simpler when using the covariance matrix.
It should be noted that PC regression is just as sensitive to violations of the assumptions as reg-
ular multiple regression. Outliers and high leverage points can be very influential in determining
the results of the procedure. Tests and confidence intervals rely on the independence, homoscedas-
ticity, and normality assumptions. Recall that in the full principal components regression model,
the residuals and predicted values are identical to those from the regression on the original predic-
tor variables. Moreover, highly influential points in the original predictor variables typically have a
large influence on the coefficients in the principal component variables.
Minitab commands
Minitab commands for the principal components regression analysis are given below. The basic
command is ‘pca.’ The ‘scores’ subcommand places the principal component variables into columns
c12 through c16. The ‘coef’ subcommand places the eigenvectors into columns c22 through c26. If
one wishes to define principal components using the covariances rather than the correlations, simply
include a pca subcommand with the word ‘covariance.’
11.7 EXERCISES 273
MTB > pca c2-c6;
SUBC> scores c12-c16;
SUBC> coef c22-c26.
MTB > regress c8 on 5 c12-c16 c17 c18
MTB > plot c17 c18
11.7 Exercises
E XERCISE 11.7.1. Show that the form (11.3.2) simplifies to the form (11.3.1) for simple linear
regression.
E XERCISE 11.7.2. Show that Cov(Y ) = E[(Y − µ )(Y − µ )′ ].
E XERCISE 11.7.3. Use Proposition 1.2.11 to show that E(AY + c) = A E(Y ) + c and Cov(AY +
c) = ACov(Y )A′ .
E XERCISE 11.7.4. Using eigenvalues, discuss the level of collinearity in:

(a) the Younger data from Exercise 9.12.1,
(b) the Prater data from Exercise 9.12.3,
(c) the Chapman data of Exercise 9.12.4,
(d) the pollution data from Exercise 9.12.5,
(e) the body fat data of Exercise 9.12.6.
E XERCISE 11.7.5. Do a principal components regression for the Younger data from Exer-
cise 9.12.1.
E XERCISE 11.7.6. Do a principal components regression for the Prater data from Exer-
cise 9.12.3.
E XERCISE 11.7.7. Do a principal components regression for the Chapman data of Exer-
cise 9.12.4.
E XERCISE 11.7.8. Do a principal components regression on for the pollution data of Exer-
cise 9.12.5.
E XERCISE 11.7.9. Do a principal components regression on for the body fat data of Exer-
cise 9.12.6.

Chap 11

Uploaded by

Copyright:

Available Formats

Chap 11

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 11

Uploaded by

Copyright:

Available Formats

Unbalanced Analysis of Variance, Design, and Regression

Ronald Christensen, fletcher@stat.unm.edu

Multiple regression: matrix formulation

11.1 Random vectors

Var(yi ) = E(yi − µi )2 ≡ σii

When Y is 3 × 1, the covariance matrix is 3 × 3. If Y were 20 × 1, Cov(Y ) would be 20 × 20. The

11.2 Matrix formulation of regression models

Yn×1 = Xn×2 β2×1 + en×1

where 0 is the n × 1 matrix containing all zeros and

Table 11.1: Weights for various heights.

The general linear model

Y is an n × 1 vector of observable random variables. X is an n × p matrix of known constants. β

E XAMPLE 11.2.2. Multiple regression

yi = β0 + β1 xi1 + β2 xi2 + · · · + β p−1 xi,p−1 + εi , i = 1, . . . , n , (11.2.2)

where 0 is the n × 1 matrix consisting of all zeros, and

where I is the n × n identity matrix. 2

yi = γ0 + γ1 (xi − x̄· ) + γ2 (xi − x̄· )2 + β3 (xi − x̄· )3 + εi

and its matrix form

(x̃1 , . . . , x̃12 ) = (2/9, 2/9, 2/9, 2/9, 1/3, 1/3, 0, 0, 0, 1, 1, 1).

The basis function model based on cosines

yi = β0 + β1 xi + β2 cos(π x̃i ) + β3 cos(π 2x̃i ) + εi

The “Haar wavelet” model

yi = β0 + β1 xi + β2 I[0,.50) (x̃i ) + β3 I[.5,1] (x̃i ) + εi

Fitting this model can also be accomplished by fitting the model

11.3 Least squares estimation of regression parameters

For multiple regression, the least squares estimates of the β j s minimize

In matrix terms these can both be written as minimizing

P ROOF : The proof is optional material.

E XAMPLE 11.3.2. Simple linear regression

The denominator in this term can be simplified by observing that

As usual, the alternative regression model

yi = β∗0 + β1 (xi − x̄· ) + εi i = 1, . . . , n

is easier to work with. Write the model in matrix form as

2. Cov (AY + c) = ACov(Y )A′ .

E(Y ) = E(X β + e) = X β + E(e) = X β + 0 = X β

11.4 Inferential procedures

E XAMPLE 11.4.1. Simple linear regression

We will examine the alternative model

yi = β∗0 + β1 (xi − x̄· ) + εi .

SSReg = β̂∗′ Z ′ Z β̂∗ −C

To test the hypothesis

it follows that for any k = 0, . . . , p − 1,

Under normal errors, the appropriate reference distribution is

and a (1 − α )100% confidence interval has endpoints

11.5 Residuals, standardized residuals, and leverage

ŷi = β̂0 + β̂1 xi1 + · · · + β̂ p−1 xi,p−1 = xi′ β̂

and the corresponding residual is

The vector of residuals is

Another common way of writing SSE is

Cov(ê) = Cov([I − M]Y )

11.6 Principal components regression

Table 11.2: Eigen analysis of the correlation matrix.

5 = 2.8368 + 1.3951 + 0.4966 + 0.2025 + 0.0689.

Table 11.3: Principal component variable coefficients.

Table 11.4: Table of coefficients: Principal component regression.

PC1i = −0.229(xi1 − x̄·1 )/s1 − 0.555(xi2 − x̄·2 )/s2

ŷ = 35.0825 − 2.9419(PC1) − 2.0457(PC3) + 4.380(PC4), (11.6.2)

Table 11.5: Analysis of variance: Principal component regression.

E XERCISE 11.7.2. Show that Cov(Y ) = E[(Y − µ )(Y − µ )′ ].