Notes 12
Notes 12
Notes 12
Further...
2 R examples
What if the Regression Equation Contains “Wrong”
Predictors?
Before we learn about variable selection methods, we first need to understand the
consequences of a regression equation containing the “wrong” or “inappropriate”
variables.
There are four possible outcomes when formulating a regression model for a set of data:
1 The predicted responses in the first instance may be considered reliable, precise or as
having negligible random error, but all the responses missed the true value by a wide
margin. (biased estimate)
2 The target on the right has more random error (large variance) in the predicted responses ,
however, the results are valid, lacking systematic error (unbiased).
3 The middle target depicts our goal: observations that are both reliable (small variance) and
valid (unbiased).
1 A regression model contains one or more extraneous variables (outcome 3).
1. Extraneous variables are neither related to the response nor to any of the other
predictors.
2. The good news is that such a model does yield unbiased regression coefficients and
an unbiased M SE.
3. The bad news is that M SE has fewer degrees of freedom associated with it. When this
happens, our confidence intervals tend to be wider and our hypothesis tests tend to
have lower power.
2 A regression model is over-specified (outcome 4), then the regression equation contains
one or more redundant predictor variables.
1. That is, part of the model is correct, but we have predictors that are redundant.
2. Regression models that are over-specified yield unbiased regression coefficients and
an unbiased M SE.
3. Also, as with including extraneous variables, we’ve also made our model more
complicated and hard to understand than necessary.
Strategy
1 Know your goal, know your research question. Knowing how you plan to use
your regression model can assist greatly in the model building stage.
Example: let’s learn how the stepwise regression procedure works by considering a data
set that concerns the hardening of cement.
Researchers were interested in learning how the composition of the cement affected the
heat evolved during the hardening of the cement. They measured and recorded the
following data on 13 batches of cement:
5 10 15 20 5 10 15 20
110
y
80 90
5 10 15 20
x1
30 40 50 60 70
x2
10 15 20
x3
5
50
x4
30
10
80 90 110 30 40 50 60 70 10 30 50
Note: The number of predictors in this data set is not large. The stepwise procedure is typically
used on much larger data sets, for which it is not feasible to attempt to fit all of the possible
regression models.
Overview of Stepwise Regression using F -Tests to
Choose Variables
1 First, we start with no predictors in our “stepwise model.”
2 Then, at each step along the way we either enter or remove a predictor based on the general
(partial) F -tests – that is, the t-tests for the slope parameters – that are obtained.
3 We stop when no more predictors can be justifiably entered or removed from our stepwise
model, thereby leading us to a “final model.”
Now let’s start the procedure:
Step 1.
1 Fit each of the one-predictor models – that is, regress Y on x1 , regress Y on x2 ,. . ., and
regress Y on xp−1 .
2 Of those predictors whose t-test p-value is less than some α level, say, 0.15, the first
predictor put in the stepwise model is the predictor that has the smallest t-test p-value.
1 Suppose x1 had the smallest t-test p-value below α and therefore was deemed the “best”
single predictor arising from the the first step.
2 Now, fit each of the two-predictor models that include x1 as a predictor – that is, regress Y
on x1 and x2 , regress Y on x1 and x3 ,...,and regress Y on x1 and xp−1 .
3 Of those predictors whose t-test p-value is less than α, the second predictor put in the
stepwise model is the predictor that has the smallest t-test p-value.
4 If no predictor has a t-test p-value less than α, stop. The model with the one predictor
obtained from the first step is your final model.
5 But, suppose instead that x2 was deemed the “best” second predictor and it is therefore
entered into the stepwise model.
6 Now, since x1 was the first predictor in the model, step back and see if entering x2 into the
stepwise model somehow affected the significance of the x1 predictor.
That is, check the t-test p-value for testing β1 = 0. If the t-test p-value for β1 = 0 has become
not significant, remove x1 from the stepwise model.
Step 3.
1 Suppose both x1 and x2 made it into the two-predictor stepwise model and remained
there.
2 Now, fit each of the three-predictor models that include x1 and x2 as predictors – that is,
regress Y on x1 and x2 , and x3 , regress Y on x1 and x2 , and x4 , ..., and regress Y on x1
and x2 , and xp−1 .
3 Of those predictors whose t-test p-value is less than α, the third predictor put in the
stepwise model is the predictor that has the smallest t-test p-value.
4 If no predictor has a t-test p-value less than α, stop. The model containing the two predictors
obtained from the second step is your final model.
5 But, suppose instead that x3 was deemed the “best” third predictor and it is therefore
entered into the stepwise model.
6 Now, since x1 and x2 were the first predictors in the model, step back and see if entering x3
into the stepwise model somehow affected the significance of the x1 and x2 predictors.
If the t-test p-value for either β1 = 0 and β2 = 0 has become not significant, then remove the
predictor from the stepwise model.
Stopping the procedure. Continue the steps as described above until adding an additional
predictor does not yield a t-test p-value below α.
Back to the Cement Example
Note that in add1(), the general linear (partial) F -test is used. It can be checked that the p-value
for the F -test for each predictor is the same as the p-value for the t-test for each predictor
obtained by lm().
Step 2. choose the second predictor when x4 is in the model.
We regress Y on x4 and x1 , regress Y on x4 and x2 , regress Y on x4 and x3 . Instead of using
lm() for three times, we use the add1() function:
1 The F -statistic (p-value) for x1 is the largest (smallest). As a result of the second step, we
enter x1 into our stepwise model.
2 The update() function adds or removes a predictor to a linear model. In this example, it
means that the model is updated by including x4 . It is the same as regressing Y on x4 by
lm().
Before we proceed to the next step, we need to check whether or not adding x1 in the model
affects the significance of x4 :
The p-value for x4 shows that adding x1 in the model doesn’t affect the significance of x4
Step 3. choose the third predictor when x4 and x1 are in the model.
We regress Y on x4 , x1 , and x2 ; and we regress Y on x4 , x1 , and x3 , obtaining:
The predictor x2 has the smallest F -test p-value. Therefore, as a result of the third step, we enter
x2 into our stepwise model.
Recall that in a stepwise regression procedure, we don’t need to set the α level as strict as 0.05. It
could be larger.
Now, since x1 and x4 were the first two predictors in the model, we must step back and see if
entering x2 into the stepwise model affected the significance of the x1 and x4 predictors.
If we choose α = 0.15, then the p-value for x4 is larger than α. Therefore, we remove the predictor
x4 from the stepwise model, leaving us with the predictors x1 and x2 in our stepwise model.
Step 4. choose the third predictor when x1 and x2 are in the model.
We regress Y on x1 , x2 , and x3 . Note that we don’t need to regress Y on x1 , x2 , and x4 , since we
just delete x4 from a model containing x1 , x2 , and x4 .
Stop the stepwise regression procedure
According to its p-value, x3 is not eligible for entry into our stepwise model. That is, we stop our
stepwise regression procedure. Our final regression model, based on the stepwise procedure
using F -tests to choose predictors, contains only the predictors x1 and x2 .
Information Criteria (Other Criteria to Choose
Predictors)
1 Notice that the only difference between AIC and BIC is the multiplier of p, the number of
parameters.
2 The BIC places a higher penalty ( log(n) > 2 when n > 7) on the number of parameters in
the model so will tend to reward more parsimonious (smaller) models.
3 For regression models, the information criteria combine information about the SSE, number
of parameters p in the model, and the sample size n.
2 The procedure yields a single final model, although there are often several equally
good models.
3 Stepwise regression does not take into account a researcher’s knowledge about the
predictors. It may be necessary to force the procedure to include important
predictors.
4 One should not over-interpret the order in which predictors are entered into the
model.
Best Subsets Regression
1 The general idea behind best subsets regression is that we select the subset of
predictors that do the best at meeting some well-defined objective criterion, such as
having the largest R2 value or the smallest M SE.
Step 1.
First, identify all of the possible regression models derived from all of the possible combinations of
the candidate predictors. Unfortunately, this can be a huge number of possible models.
1. Suppose we have one (1) candidate predictor – x1 . Then, there are two (2) possible
regression models we can consider:
1 the one (1) model with no predictors.
2 the one (1) model with the predictor x1 .
2. Suppose we have two (2) candidate predictor – x1 and x2 . Then, there are four (4) possible
regression models we can consider:
1 the one (1) model with no predictors.
2 the two (2) models with the only one predictor each – the model with x1 alone; the
model with x2 alone.
3 the one (1) model with all two predictors – the model with x1 and x2 .
In general, if there are p − 1 possible candidate predictors, then there are 2p−1 possible regression
models containing the predictors. For example, 10 predictors yield 210 = 1024 possible regression
models.
Step 2.
2. By doing this, it cuts down considerably the number of possible regression models to
consider!
Step 3
1. Further evaluate and refine the handful of models identified in the last step. This might
entail performing residual analyses, transforming the predictors and/or response, adding
interaction terms, and so on.
2. Do this until you are satisfied that you have found a model that
The different criteria quantify different aspects of the regression model, and therefore often
yield different choices for the best set of predictors.
We should use best subsets regression as a screening tool to reduce the large number of
possible regression models to just a handful of models that we can evaluate further before
arriving at one final model.
R2 Value
Therefore, it makes no sense to define the “best” model as the model with the largest
R2 -value.
1 We can instead use the R2 -values to find the point where adding more predictors is
not worthwhile, because it yields a very small increase in the R2 -value.
2 In other words, we look at the size of the increase in R2 , not just its magnitude alone.
2 “summary.mod$rsq” shows the R2 for each “best” model. It shows that going from the
“best” one-predictor model to the “best” two-predictor model. The R2 value jumps from 67.5
to 97.9, which is the largest increase.
3 Based on the R2 -value criterion, the “best” model is the model with the two predictors x1 and
x2 .
Adjusted R2 Value and M SE
quantifies how far away our predicted responses are from our observed responses.
Therefore, according to the M SE criterion, the best regression model is the one with the
smallest M SE.
3 The adjusted R2 value increases only if M SE decreases. That is, the adjusted R2 value and
M SE criteria always yield the same “best” models.
Different criteria can lead us to the same “best” models. Based on the largest adjusted R2 value
and the smallest M SE criteria, the “best” model is the model with the three predictors x1 , x2 , and
x4 .
2 “summary.mod$rss” shows the SSE for each “best” model, we need to compute by hand
to find M SE.
Mallows’ Cp-statistic
3 Mallows’ Cp -statistic estimates the size of the bias that is introduced into the
predicted responses by having an underspecified model.
Bias and Variation in Predicted Responses
The true value is the center of the target.
1 The predicted responses in the first instance may be considered reliable, precise or as
having negligible random error, but all the responses missed the true value by a wide
margin. (biased estimate)
2 The target on the right has more random error (large variance) in the predicted responses ,
however, the results are valid, lacking systematic error (unbiased).
3 The middle target depicts our goal: observations that are both reliable (small variance) and
valid (unbiased).
A Measure of the Total Variation in the Predicted
Responses Γp
1 To quantify the total variation in the predicted responses, we just sum the two components
σŶ2 (random sampling variation) and Bi2 (prediction bias) over all n data points to obtain a
i
standardized measure of the total variation in the predicted responses Γp :
n n
1 nX 2 X o
Γp = 2
σŶ + Bi2
σ i=1
i
i=1
2 If all Bi are 0, then Γp achieves its smallest possible value, namely p, the number of
parameters:
n
1 nX 2 o
Γp = 2 σŶ + 0 = p
σ i=1
i
The best model is simply the model with the smallest value of Γp .
However, we don’t know the population quantities: σŶ2 and σ 2 , and hence Γp . We need a
i
statistic to estimate Γp .
Mallows’ Cp as An Estimate of Γp
1 We estimate σ 2 by M SEall , the mean squared error obtained from fitting the model
containing all of the candidate predictors.
2 Estimating σ 2 by M SEall assumes that there are no biases in the full model with all of the
predictors, an assumption that may or may not be valid.
1 Subset models with small Cp values have a small total (standardized) variance of prediction.
3 For the largest model containing all of the candidate predictors, Cp = p (always). Therefore,
you shouldn’t use Cp to evaluate the fullest model.
Using the Cp Criterion to Identify “Best” Models
1 Identify subsets of predictors for which the Cp value is near p (if possible).
2 The full model always yields Cp = p, so don’t select the full model based on Cp .
3 If all models, except the full model, yield a large Cp not near p, it suggests some important
predictor(s) are missing from the analysis.
In this case, we are well-advised to identify the predictors that are missing!
4 If a number of models have Cp near p, choose the model with the smallest Cp value, thereby
insuring that the combination of the bias and the variance is at a minimum.
5 When more than one model has a small value of Cp value near p, in general, choose the
simpler model or the model that meets your research needs.
According to the Cp criterion, the model with x1 and x2 , and the model with x1 , x2 ,
and x4 are both valid models.
2 R examples