TUTORIAL 7: Multiple Linear Regression I. Multiple Regression
TUTORIAL 7: Multiple Linear Regression I. Multiple Regression
TUTORIAL 7: Multiple Linear Regression I. Multiple Regression
I. Multiple Regression
A regression with two or more explanatory variables is called a multiple regression. Multiple
linear regression is an extremely effective tool for answering statistical questions involving many
variables. The procedures PROC REG and PROC GLM can be used to perform regression in
SAS. In this tutorial we concentrate on using PROC REG. Much of the syntax is similar to that
used for fitting simple linear regression models; see Tutorial 6 for a review of this material.
A. PROC REG
PROC REG is the basic SAS procedure for performing regression analysis. The general form of
the PROC REG procedure is:
The MODEL statement is used to specify the response and explanatory variables to be used in the
regression model. For example, the statement:
MODEL y = x1 x2;
fits a multiple linear regression model with the variable y as the response variable and the
variables x1 and x2 as explanatory variables.
The fit of the model and the model assumptions can be checked graphically using the PLOT
statement. This statement can be used to make all the relevant plots needed for the regression
model. In the regression models we have discussed so far, it is assumed that the errors are
independent and normally distributed with mean 0 and variance 2. After performing regression,
it is necessary to check these assumptions by analyzing the residuals and studying a series of
residual plots. To plot the residuals against the explanatory variables use the statement:
Note that residual. (the period is required) is the variable name for the residuals created by PROC
REG. To plot the residuals against the predicted values we would use the statement:
PLOT residual.*predicted.;
Note that predicted. (the period is again required) is the variable name for the predicted values
from the regression model.
The OUTPUT statement is used to produce a new data set containing the original data used in the
regression model, as well as the predicted values and residuals. This new data set can, in turn, be
used to produce further diagnostic plots and check the model fit. When using the OUTPUT
statement, there are a number of helpful options which help to control the contents of the
OUTPUT file. The statement:
creates a new data set named outdata which contains the residuals and predicted values. The
residuals are given the name resid, and the predicted values the name yhat. The data set outdata
can then be used to further study the residuals.
Ex. Data was collected on 15 houses recently sold in a city. It consisted of the sales price (in $),
house size (in square feet), the number of bedrooms, the number of bathrooms, the lot size (in
square feet) and the annual real estate tax (in $).
The following program reads in the data and fits a multiple regression model with price as the
response variable and size and lot as the explanatory variables. It also produces residual plots of
the residuals against both explanatory variables as well as the predicted values.
DATA houses;
INPUT tax bedroom bath price size lot;
DATALINES;
590 2 1 50000 770 22100
1050 3 2 85000 1410 12000
20 3 1 22500 1060 3500
870 2 2 90000 1300 17500
1320 3 2 133000 1500 30000
1350 2 1 90500 820 25700
2790 3 2.5 260000 2130 25000
680 2 1 142500 1170 22000
1840 3 2 160000 1500 19000
3680 4 2 240000 2790 20000
1660 3 1 87000 1030 17500
1620 3 2 118600 1250 20000
3100 3 2 140000 1760 38000
2070 2 3 148000 1550 14000
650 3 1.5 65000 1450 12000
;
RUN;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 44825992653 22412996326 19.10 0.0002
Error 12 14082023347 1173501946
Corrected Total 14 58908016000
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -61969 32257 -1.92 0.0788
size 1 97.65137 18.16474 5.38 0.0002
lot 1 2.22295 1.13918 1.95 0.0748
We also obtained three residual plots which aren’t shown here. The output can be used to test a
variety of hypothesis tests regarding the model. For example, by studying the last paragraph we
see that the coefficient corresponding to size is significant (p-value=0.0002) when controlling for
lot size. However, the coefficient corresponding to lot size is not significant (p-value=0.0748)
when controlling for house size.
Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients
are equal to 0 (e.g. 3 = 4 = 0). We can do this using a partial F-test. This test involves
comparing the SSE from a reduced model (excluding the parameters we are testing) with the SSE
from the full model (including all of the parameters).
We can perform a partial F-test in PROC REG by including a TEST statement. For example, the
statement
tests the null hypothesis that the regression coefficients corresponding to var1 and var2 are both
equal to 0. However, note that any number of variables can be included in the TEST statement.
Suppose we include the variables bedroom, bath and size in our model and are interested in
testing whether the number of bedrooms and bathrooms are significant after taking size into
consideration. The following program performs the partial F-test:
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 43908504107 14636168036 10.73 0.0013
Error 11 14999511893 1363591990
Corrected Total 14 58908016000
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 27923 56306 0.50 0.6297
bedroom 1 -35525 25037 -1.42 0.1836
bath 1 2269.34398 22209 0.10 0.9205
size 1 130.79392 36.20864 3.61 0.0041
Mean
Source DF Square F Value Pr > F
Numerator 2 1775487498 1.30 0.3108
Denominator 11 1363591990
The final paragraph shows the results of the partial F-test. Since F=1.30 (p-value=0.3108) we
cannot reject the null hypothesis ( 2 = 3 = 0). It appears that bedroom and bath do not contribute
significant information to the sales price once size has been taken into consideration.
C. Model Selection
Often we have data on a large number of explanatory variables and wish to construct a regression
model using some subset of them. The use of a subset will make the resulting model easier to
interpret and more manageable, especially if more data is to be collected in the future.
Unnecessary terms in the model may also yield less precise inference.
One approach to model selection is to consider all possible subsets of the pool of explanatory
variables and find the model that best fits the data according to some criteria. Different criteria
may be used to select the best model, such as adjusted R2 or Mallow’s Cp. These criteria assign
scores to each model and allow us to choose the model with the best score.
In SAS we can perform model selection using Mallow’s Cp by including the selection option in
the MODEL statement. The statement:
To use Mallow’s Cp to determine which subset of the 5 possible explanatory variables that best
models the data, we can use the following code:
Number in
Model C(p) R-Square Variables in Model
From the output we see that the model with tax, bedroom and size minimizes the Cp criteria.
When the number of explanatory variables is large it is often not feasible to fit all possible
models. It is instead more efficient to use a search algorithm to find the best model. A number of
such search algorithms exist. They include: forward selection, backward elimination and stepwise
regression.
In SAS we can perform model selection using these algorithms by including the selection option
in the MODEL statement. The statement:
fits the best model using forward regression. Other options can be used by exchanging forward
with either backward or stepwise in the MODEL statement.
In multiple regression one would like the explanatory variables to be highly correlated with the
response variable. However, it is not desirable for the explanatory variables to be correlated with
one another. Multicollinearity exists when two or more of the explanatory variables used in the
regression model are moderately or highly correlated. The presence of a high degree of multi-
collinearity among the explanatory variables can result in the following problems: (i) the standard
deviation of the regression coefficients may be disproportionately large, (ii) the coefficient
estimates are unstable, and (iii) the regression coefficients may not be interpretable.
A method for detecting the presence of multicollinearity is the variance inflation factor (VIF). A
large VIF (>10) is taken as an indication that multicollinearity may be influencing the estimates.
Including the option VIF in the MODEL statement prints the variance inflation factors for each of
the explanatory variables.
If we want to determine whether there is any multicollinearity present in our model we need to
add the VIF option to the MODEL statement as shown below:
This gives rise to the same output as in the previous example. The only difference is an additional
column in the parameter estimate section showing each explanatory variables VIF score.
Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 -61969 32257 -1.92 0.0788 0
size 1 97.65137 18.16474 5.38 0.0002 1.03891
lot 1 2.22295 1.13918 1.95 0.0748 1.03891
Note that all the VIF values are rather small, so there does not appear to be any problems with
multicollinearity in this example.