Chapter 3 Multivariate Linear Regression
Chapter 3 Multivariate Linear Regression
Chapter 3 Multivariate Linear Regression
Chapter 3
3. Multivariate Linear Regression
A regression model that involves more than one regressor variable is called a multiple linear
regression model.
3.1. Description of the model
The data consist of n observations on a dependent or response variable Y and p
predictor or explanatory variables, X 1 , X 2 , X 3 , . . . ,Xp. The observations are
usually represented as in the Table below. The relationship between Y and X 1 , X 2 ,
X 3 , . . . , Xp, is formulated as a linear model
Y 0 1x1 2 x2 p xp i ,i 1,, n, (3.1)
Where 0 , 1 , 2 , , p , are constants referred to as the model partial regression
coefficients (or simply as the regression coefficients) and is a random
disturbance or error. It is assumed that for any set of fixed values of X 1 , X 2 , X 3 , . . .
, Xp, that fall within the range of the data, the linear equation (3) provides an
acceptable approximation of the true relationship between Y and the X’s (Y is
approximately a linear function of the X’s, and measures the discrepancy in that
approximation).
In particular, contains no systematic information for determining Y that is not
already captured by the X’s.
1
Multiple Linear Regression Prepared by Mandefro Abere 2007
The least squares method is to find the estimate of by minimizing the sum of squares of
residual,
2
Multiple Linear Regression Prepared by Mandefro Abere 2007
1
n
S ( ) S ( 0 , 1 , , p ) i2 1 2 n 2 t
i 1
n
(Y X ) t (Y X )
since Y X . Expanding S ( ) gives
S( ) (Y X)t (Y X ) (Y t t X t )(Y X )
Y tY Y t X t X tY t X t X
Y tY 2 t X tY t X t X
Since t X t Y ( t X t Y ) t Y t X a real number.
X t X X tY
The fitted values (in vector):
Yˆ1
ˆ
Y
Yˆ X ˆ 2
Yˆn
The residuals (in vector):
1
Y Yˆ Y Xˆ 2
n
3
Multiple Linear Regression Prepared by Mandefro Abere 2007
p 1 0 a1
( i 1 ai ) a
( t a ) 1
Note: (i) i 1
a, where and a 2 .
p a p 1
p 1 p 1
( i 1 j 1 aij )
( t A ) i 1 j 1
(ii) 2 A , where A is any ( p 1) ( p 1)
symmetry matrix.
t t t
t
Note: X X X X X X t
t
,
t
X X is a symmetric matrix.
Also,
X X X X
t 1 t t t 1
XtX
1
,
X X t 1
is a symmetric matrix.
t t
Note: X X Z Y is called the normal equation.
Note: t X (Y t ˆ t X t ) X Y t Y t X ( X t X ) 1 X t X Y t X Y t X ( X t X ) 1 X t X
Y t X Y t X 0.
1
1
Therefore, if there is an intercept, then the first column of X is . Then,
1
1 x11 x1 p
1 x x 2 p n
t X 1 2 n
21
i xij 0
i 1
1 x n1 x np
n
i 0
i 1
n
Note: For the linear regression model without the intercept, ˆi might not be equal to 0.
i 1
Note:
Yˆ X ˆ X X t X 1
X t Y HY ,
Where H X X tX 1
X t
is called “hat” matrix (or projection matrix). Thus,
4
Multiple Linear Regression Prepared by Mandefro Abere 2007
Y Xˆ Y HY I H Y .
Example 1:
Heller Company manufactures lawn mowers and related lawn equipment. The managers believe
the quantity of lawn mowers sold depends on the price of the mower and the price of a
competitor’s mower. We have the following data:
Competitor’s Price Heller’s Price Quantity sold
X i1 xi2 yi
120 100 102
140 110 100
190 90 120
130 150 77
155 210 46
175 150 93
125 250 26
145 270 69
180 300 65
150 250 85
ˆ 0 66 . 518
ˆ
X t Y 0 . 414
1
ˆ
1 X t X .
ˆ 0 . 269
2
The fitted regression equation is
5
Multiple Linear Regression Prepared by Mandefro Abere 2007
and
t
Y Yˆ [12.79, 5.21, -0.88, -2.86, -28.02, -5.49, -24.81, 15.31, 4.91, 23.84]
Suppose now we want to predict the quantity sold in a city where Heller prices it mower at $160
and the competitor prices its mower at $170. The quantity sold predicted is
66.518 0.414170 0.269160 93.718 .
3.3. Hypothesis Testing in Multiple Linear Regressions
Once we have estimated the parameters in the model, we face two immediate questions:
1. What is the overall adequacy of the model?
2. Which specific regressors seem important?
Several hypothesis testing procedures prove useful for addressing these questions. The
formal tests require that our random errors be independent and follow a normal
distribution with mean E ( ) =0 and variance Var( ) =
3.3.1. Tests for the significance of the regression
The test for significance of regression is a test to determine if there is a linear r/ship between the
response y and any of the explanatory variables , ,…, . This procedure is often thought of
as an overall or global test of model adequacy. The appropriate hypotheses are
• H0: = = =0 (None of X’s associated with Y)
• HA: Not all = 0 (for at least one j)
Rejection of this null hypothesis implies that at least one of the regressors , ,…,
contributes significantly to the model. The test procedure is the generalization of the analysis of
variance used in simple linear regression.
ANALYSIS OF VARIANCE (ANOVA)
The total variability in a regression analysis can be partitioned into a component explained by
this regression, SSR, and a component due to unexplained error SSR.
n 2 n 2 n 2
i.e., y i y y i yˆ i yˆ i y
n 1 i 1 i 1
6
Multiple Linear Regression Prepared by Mandefro Abere 2007
of freedom, we are getting the mean sum of squares. The results of overall significance test of a
model are summarized in the Analysis of Variance (ANOVA) table as follows.
Source of variation Sum of squares Df Mean sum of squares F
Regression SSR = ( yˆ i y ) 2 p SSR MSR
MSR F
p MSE
Residual SSE = ( y i yˆ ) 2 n – p-1 SSE
MSE
n p 1
t t ( , n p 1 )
t t ( , n p 1 )
The decision rule is rejecting H 0 if
t t
, ( n p 1 )
2
Where t follows Student’s t distribution with n – p -1 degrees of freedom and significance level
.
Example2. Test the significance of the regression of the data given in example 1 based on the
following Minitab Output.
The regression equation is
7
Multiple Linear Regression Prepared by Mandefro Abere 2007
Analysis of Variance
Source DF SS MS F P
Regression 2 4618.8 2309.4 6.58 0.025
Residual Error 7 2457.3 351.0
Total 9 7076.1
Source DF Seq SS
x1 1 715.5
x2 1 3903.3
Solution:
1. State the hypothesis: H0: = =0 (None of X’s associated with Y)
HA: Not all = 0 (for at least one j)
2. F ( p , n p 1) F0 .05 , 2 , 7 4 . 74 =
MSR 2309 .4
3. Test statistic F 351 6.58
MSE
Since we have rejected the null hypothesis H0: = =0 then we have to identify which variables
are more significant. So we will apply individual tests of the regression parameters H0: =0 and
against H1 : ≠0 .
Solution:
1. State the hypothesis: H0: =0 against H1 : ≠0
2. t ( n p 1) t 0 .025 , 7 2 . 365
2
3. Test statistic
^
1 1 0 . 4139 0
t= 0 . 2604
1 . 59
s ^
1
8
Multiple Linear Regression Prepared by Mandefro Abere 2007
And also to test the hypothesis H0: =0 and against H1 : ≠0 we will follow the following
step.
Solution:
1. State the hypothesis: H0: =0 against H1 : ≠0
2. t ( n p 1) t 0 .025 , 7 2 . 365
2
3. Test statistic
^
2 2 0 . 26978 0
t= 0 . 08091
3 . 34
s ^
2
From the above example we can estimate the variance of the estimated regression
parameters var(β ), var(β ) and var(β )using the formula cov(β)= [ ′ ] . Therefore
Which is a 3by3 matrix whose jth diagonal element is the variance of β and whose ijth
Where the value estimated from since the error term NID with mean zero and
constant variance then E(SSE/n-p-1= )= .
i.e. =351. β var(β ) 1753.5
Therefore var(β)=var β = var(β ) = 0.067
Β var(β ) 0.006
9
Multiple Linear Regression Prepared by Mandefro Abere 2007
If the population regression errors i , are normally distributed and the standard regression
assumptions hold, the 100(1 )% confidence interval for the regression coefficients, j
are given by
ˆ j t sˆ
( n p 1, ) j
2
ˆ j j
where sˆ ^2 C jj is the standard error of ˆ j that follows Student’s t-
j
Solution: the point estimates of 1 and 2 are 0.2604 and -0.26978 respectively, the
diagonal elements of [ ′ ] to 1 and 2 are =1.931437e-04 and
=1.864613e-05 and also ^2 =351. Therefore 95% CI on 1 and 2 is given by
ˆ t
1 ( 2 ,n
^2C and ˆ t
p 1, ) 11 ^2C . From these formulas we get 95% CI for
2 ( 2,n p 1, ) 22
1 is between -0.202 and 1.03 and also 2 is between -0.46 and -0.08.
3.4.1. Confidence Interval Estimation of the Mean Response
We may construct a confidence interval on the mean response at particular point, such as x 01 , x 02
,…, x0 p . Define the vector x 0 as
1
x0 = x 01
⋮
x0 p
The fitted value at this point is ̂ = x' 0 . This is unbiased estimator of E (y| x 0 ), since E ( ̂ )= E
Therefore a 100(1- ) percent confident interval on the mean response at the point x 01 , x 02 ,…,
x0 p is ̂ ± t ( n p 1) s.e( ̂ ).
2
Example3: Find the mean response value for the quantity of mowers sold when the price of the
mower is 200 and the price of competitors is 170 and also find 95% CI for mean response.
66.52
Solution: ̂ = x' 0 =[ 1 200 170] 0.4139 =103.44
−0.26978
10
Multiple Linear Regression Prepared by Mandefro Abere 2007
the point estimate of the future observation at the point x 01 , x 02 ,…, x0 p is = x' 0 . Therefore
This is the generalization of the prediction interval for future observation in simple linear
regression.
Example4: Find the predicted value for the quantity of mowers sold when the price of the
mower is 200 and the price of competitors is 170 and also find 95% CI for the predicted value.
66.52
Solution: = x' 0 = [ 1 200 170] 0.4139 =103.44
−0.26978
And also the 95% CI for future observation is given by ± t ( n p 1) s.e ( ) and we
2
obtain 103.44±2.365*14.21= [69.82,137.06]
3.6. Coefficient of determination R2
The total variability in a regression analysis can be partitioned into a component explained by
the regression, SSR, and a component due to unexplained error SSR.
n 2 n 2 n 2
is yi y
n 1
y i yˆi yˆ i y
i 1 i 1
11
Multiple Linear Regression Prepared by Mandefro Abere 2007
for two or more regression equations R 2 provides a comparable measure of the goodness for
the equations.
The coefficient of determination ( R 2 ) for data given in example 1 is 65.3%. This implies that
65.3% of the variability on y is explained by the regressors.
There is a potential problem with using R 2 as an overall measure of the quality of a fitted
equation. As additional independent variables are added to a multiple regression model, the
explained sum of squares SSR will increase even if the additional independent variable is not
an important predictor variable. Thus we might find that R 2 has increased spuriously after
12
Multiple Linear Regression Prepared by Mandefro Abere 2007
one or more non significant predictor variables have been added to the multiple regression
model. In such a case the increased value of R 2 would be misleading. To avoid this problem,
the adjusted coefficient of determination can be computed as follows.
The Adjusted Coefficient of Determination, R 2 , is defined as
SSE
2 (n p 1)
R 1
SST )
(n 1
We use this measure to correct for the fact that non relevant independent variables will result
in some small reduction in the error sum of squares. Thus, the adjusted R 2 provide
a better comparison between multiple regression models with different numbers of independent
variables. But the difference between R 2 and R 2 is not very large. However, if the regression
model had contained a number of independent variables that were not important conditional
predictors, then the difference would be substantial. The adjusted R-square for data in example 1
is 55.4%.
R r ( yˆ , y ) R 2
and is equal to the square root of the multiple coefficient of determination. We use R as another
measure of the strength of the relationship between the dependent variable and the independent
variables. Thus, it is comparable to the correlation between Y and X in simple regression.
3.7. Dummy variables
In the discussion of multiple regressions up to this point we have assumed that the independent
variables x j have existed over a range and contained many different values. However, in the
multiple regression assumptions the only restriction on the independent variables is that they are
fixed values. Thus, we could have an independent variable that took on only two values:
x j 0 and x j 1 . This structure is commonly defined as a dummy variable and we will see that
it provides a valuable tool for applying multiple regression to situations involving categorical
variables. One important example is a linear function that shifts in response to some influence.
Consider first a simple regression equation:
13
Multiple Linear Regression Prepared by Mandefro Abere 2007
Y 0 1 X 1
Now, suppose that we introduce a dummy variable, X 2 that has values 0 and 1 and that the
resulting equation is
Y 0 1 X 1 2 X 2
When X 2 =0 in this equation, the constant is 0 , but when X 2 = 1 the constant is 0 2 . Thus
we see that dummy variable shifts the linear relationship between Y and X 1 by the value of the
coefficient 2 . In this way we can represent the effect of shifts in our regression equation.
Dummy variables are also called indicator variables.
Example
The president of an Investors Ltd. Wants to determine if there is any evidence of wage
discrimination in the salaries of male and female financial analysts.
Examining the data he saw two subsets of salaries, and that salaries for male appears to be
uniformly higher across the years of experience.
This problem can be analyzed by estimating a multiple regression model of salary, Y versus
years of experience, X 1 with a second variable, X 2 that is coded as
0 - Female employees
1 - Male employees
The resulting multiple regression model
yˆ ˆ0 ˆ1 x1 ˆ 2 x 2
Can be analyzed using the procedures we have learned, noting that the coefficient 1 is an
estimate of the expected annual increase in salary per year of experience and 2 is the shift in
mean salary from male to female employees. If 2 is positive, we have an indication that male
salaries are uniformly higher.
14
Multiple Linear Regression Prepared by Mandefro Abere 2007
40650 7 0
46820 9 0
50149 10 0
59679 14 0
67360 17 0
51535 5 1
62289 7 1
72486 9 1
75022 10 1
93379 14 1
105979 17 1
We can see that 1 = 4076.5 indicating that the expected value for the annual increase is
$4076.5 and 2 = 14684, indicating that the male salaries are, on average $14684 higher.
Analyses such as these have been used successfully in a number of wage discrimination lawsuits.
As a result most companies perform an analysis similar to this to determine if there is any
evidence of salary discrimination.
Examples such as the previous one have wide application to a number of problems including the
following.
1. The relationship between the number of units sold and the price is likely to shift if a new
competitor moves into the market.
2. The relationship between aggregate consumption and aggregate disposable income may
shift in time of war or other major national event.
3. The relationship between total output and number of workers may shift as the result of
the introduction of new production technology.
4. The demand function for a product may shift because of a new advertising campaign or a
news release relating to the product.
15
Multiple Linear Regression Prepared by Mandefro Abere 2007
16