0% found this document useful (0 votes)
30 views

Notes 12

1. The document discusses variable selection in regression modeling and describes four possible outcomes: correctly specified model, underspecified model, model with extraneous variables, and overspecified model. 2. It then explains the consequences of each outcome, such as biased vs unbiased coefficients and predictions. Correct specification yields the best results while underspecification is the worst. 3. Stepwise regression and best subsets regression are presented as methods to select variables and achieve a balanced model that is correctly specified. Stepwise regression works by iteratively adding and removing variables based on statistical testing at each step.

Uploaded by

Yash Sirowa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Notes 12

1. The document discusses variable selection in regression modeling and describes four possible outcomes: correctly specified model, underspecified model, model with extraneous variables, and overspecified model. 2. It then explains the consequences of each outcome, such as biased vs unbiased coefficients and predictions. Correct specification yields the best results while underspecification is the worst. 3. Stepwise regression and best subsets regression are presented as methods to select variables and achieve a balanced model that is correctly specified. Stepwise regression works by iteratively adding and removing variables based on statistical testing at each step.

Uploaded by

Yash Sirowa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Statistical Methods - 2

Notes 12
Further...

1 Introduction to model building and variable selection

2 R examples
What if the Regression Equation Contains “Wrong”
Predictors?

Before we learn about variable selection methods, we first need to understand the
consequences of a regression equation containing the “wrong” or “inappropriate”
variables.
There are four possible outcomes when formulating a regression model for a set of data:

1 The regression model is “correctly specified.”

2 The regression model is “underspecified.”

3 The regression model contains one or more “extraneous variables.”

4 The regression model is “over-specified.”


The Four Possible Outcomes

1 A regression model is correctly specified (outcome 1) if the regression equation contains


all of the relevant predictors, including any necessary transformations and interaction terms.
1. That is, there are no missing, redundant or extraneous predictors in the model. This is
the best possible outcome.
2. A correctly specified regression model yields unbiased regression coefficients and
unbiased predictions of the response.

2 A regression model is underspecified (outcome 2) if the regression equation is missing


one or more important predictor variables.
1. This situation is perhaps the worst outcome, because an underspecified model yields
biased regression coefficients and biased predictions of the response.
2. That is, in using the model, we would consistently underestimate or overestimate the
population slopes and the population means.
A Detour: Bias and Variation in Predicted Responses
The true value is the center of the target.

1 The predicted responses in the first instance may be considered reliable, precise or as
having negligible random error, but all the responses missed the true value by a wide
margin. (biased estimate)

2 The target on the right has more random error (large variance) in the predicted responses ,
however, the results are valid, lacking systematic error (unbiased).

3 The middle target depicts our goal: observations that are both reliable (small variance) and
valid (unbiased).
1 A regression model contains one or more extraneous variables (outcome 3).
1. Extraneous variables are neither related to the response nor to any of the other
predictors.
2. The good news is that such a model does yield unbiased regression coefficients and
an unbiased M SE.
3. The bad news is that M SE has fewer degrees of freedom associated with it. When this
happens, our confidence intervals tend to be wider and our hypothesis tests tend to
have lower power.

2 A regression model is over-specified (outcome 4), then the regression equation contains
one or more redundant predictor variables.
1. That is, part of the model is correct, but we have predictors that are redundant.
2. Regression models that are over-specified yield unbiased regression coefficients and
an unbiased M SE.
3. Also, as with including extraneous variables, we’ve also made our model more
complicated and hard to understand than necessary.
Strategy

1 Know your goal, know your research question. Knowing how you plan to use
your regression model can assist greatly in the model building stage.

2 Identify all of the possible candidate predictors.


1. Don’t worry about interactions or the appropriate functional form – such as x2 and log x
– just yet.
2. Just make sure you identify all the possible important predictors.

3 Use variable selection procedures to find the middle ground between an


underspecified model and a model with extraneous or redundant variables.
Two possible variable selection procedures are stepwise regression and best
subsets regression.

4 Fine-tune the model to get a correctly specified model.


Iterate back and forth between formulating different regression models and checking
the behavior of the residuals until you are satisfied with the model.
Stepwise Regression

Example: let’s learn how the stepwise regression procedure works by considering a data
set that concerns the hardening of cement.
Researchers were interested in learning how the composition of the cement affected the
heat evolved during the hardening of the cement. They measured and recorded the
following data on 13 batches of cement:

1 Predictor x1 : % of tricalcium aluminate

2 Predictor x2 : % of tricalcium silicate

3 Predictor x3 : % of tetracalcium alumino ferrite

4 Predictor x4 : % of dicalcium silicate

5 Response Y : heat evolved in calories during hardening of cement on a per gram


basis
As usual, we can first take a look at the scatterplot matrix.

5 10 15 20 5 10 15 20

110
y

80 90
5 10 15 20

x1

30 40 50 60 70
x2
10 15 20

x3
5

50
x4

30
10
80 90 110 30 40 50 60 70 10 30 50

Note: The number of predictors in this data set is not large. The stepwise procedure is typically
used on much larger data sets, for which it is not feasible to attempt to fit all of the possible
regression models.
Overview of Stepwise Regression using F -Tests to
Choose Variables
1 First, we start with no predictors in our “stepwise model.”

2 Then, at each step along the way we either enter or remove a predictor based on the general
(partial) F -tests – that is, the t-tests for the slope parameters – that are obtained.

3 We stop when no more predictors can be justifiably entered or removed from our stepwise
model, thereby leading us to a “final model.”
Now let’s start the procedure:
Step 1.

1 Fit each of the one-predictor models – that is, regress Y on x1 , regress Y on x2 ,. . ., and
regress Y on xp−1 .

2 Of those predictors whose t-test p-value is less than some α level, say, 0.15, the first
predictor put in the stepwise model is the predictor that has the smallest t-test p-value.

3 If no predictor has a t-test p-value less than α, stop.


Note that in a stepwise regression procedure, α can be larger than 0.05, in order to make it easier
to enter or remove a predictor.
Step 2.

1 Suppose x1 had the smallest t-test p-value below α and therefore was deemed the “best”
single predictor arising from the the first step.

2 Now, fit each of the two-predictor models that include x1 as a predictor – that is, regress Y
on x1 and x2 , regress Y on x1 and x3 ,...,and regress Y on x1 and xp−1 .

3 Of those predictors whose t-test p-value is less than α, the second predictor put in the
stepwise model is the predictor that has the smallest t-test p-value.

4 If no predictor has a t-test p-value less than α, stop. The model with the one predictor
obtained from the first step is your final model.

5 But, suppose instead that x2 was deemed the “best” second predictor and it is therefore
entered into the stepwise model.

6 Now, since x1 was the first predictor in the model, step back and see if entering x2 into the
stepwise model somehow affected the significance of the x1 predictor.
That is, check the t-test p-value for testing β1 = 0. If the t-test p-value for β1 = 0 has become
not significant, remove x1 from the stepwise model.
Step 3.

1 Suppose both x1 and x2 made it into the two-predictor stepwise model and remained
there.

2 Now, fit each of the three-predictor models that include x1 and x2 as predictors – that is,
regress Y on x1 and x2 , and x3 , regress Y on x1 and x2 , and x4 , ..., and regress Y on x1
and x2 , and xp−1 .

3 Of those predictors whose t-test p-value is less than α, the third predictor put in the
stepwise model is the predictor that has the smallest t-test p-value.

4 If no predictor has a t-test p-value less than α, stop. The model containing the two predictors
obtained from the second step is your final model.

5 But, suppose instead that x3 was deemed the “best” third predictor and it is therefore
entered into the stepwise model.

6 Now, since x1 and x2 were the first predictors in the model, step back and see if entering x3
into the stepwise model somehow affected the significance of the x1 and x2 predictors.
If the t-test p-value for either β1 = 0 and β2 = 0 has become not significant, then remove the
predictor from the stepwise model.

Stopping the procedure. Continue the steps as described above until adding an additional
predictor does not yield a t-test p-value below α.
Back to the Cement Example

Step 1. choose the first predictor


Using lm() four times to regress Y on each of the four predictors, we obtain:
The t-statistic for x4 is largest in absolute value and therefore the p-value for x4 is the smallest. As
a result of the first step, we enter x4 into our stepwise model.
In R, to conduct the first step, we can also use the add1() function, which is equivalent to but
easier than calling lm() four times.

Note that in add1(), the general linear (partial) F -test is used. It can be checked that the p-value
for the F -test for each predictor is the same as the p-value for the t-test for each predictor
obtained by lm().
Step 2. choose the second predictor when x4 is in the model.
We regress Y on x4 and x1 , regress Y on x4 and x2 , regress Y on x4 and x3 . Instead of using
lm() for three times, we use the add1() function:

1 The F -statistic (p-value) for x1 is the largest (smallest). As a result of the second step, we
enter x1 into our stepwise model.
2 The update() function adds or removes a predictor to a linear model. In this example, it
means that the model is updated by including x4 . It is the same as regressing Y on x4 by
lm().
Before we proceed to the next step, we need to check whether or not adding x1 in the model
affects the significance of x4 :

The p-value for x4 shows that adding x1 in the model doesn’t affect the significance of x4
Step 3. choose the third predictor when x4 and x1 are in the model.
We regress Y on x4 , x1 , and x2 ; and we regress Y on x4 , x1 , and x3 , obtaining:

The predictor x2 has the smallest F -test p-value. Therefore, as a result of the third step, we enter
x2 into our stepwise model.
Recall that in a stepwise regression procedure, we don’t need to set the α level as strict as 0.05. It
could be larger.
Now, since x1 and x4 were the first two predictors in the model, we must step back and see if
entering x2 into the stepwise model affected the significance of the x1 and x4 predictors.

If we choose α = 0.15, then the p-value for x4 is larger than α. Therefore, we remove the predictor
x4 from the stepwise model, leaving us with the predictors x1 and x2 in our stepwise model.
Step 4. choose the third predictor when x1 and x2 are in the model.
We regress Y on x1 , x2 , and x3 . Note that we don’t need to regress Y on x1 , x2 , and x4 , since we
just delete x4 from a model containing x1 , x2 , and x4 .
Stop the stepwise regression procedure
According to its p-value, x3 is not eligible for entry into our stepwise model. That is, we stop our
stepwise regression procedure. Our final regression model, based on the stepwise procedure
using F -tests to choose predictors, contains only the predictors x1 and x2 .
Information Criteria (Other Criteria to Choose
Predictors)

1. Akaike’s Information Criterion (AIC)

AIC = n log(SSE) − n log(n) + 2p

2. Bayesian Information Criterion (BIC)

BIC = n log(SSE) − n log(n) + plog(n)

1 Notice that the only difference between AIC and BIC is the multiplier of p, the number of
parameters.

2 The BIC places a higher penalty ( log(n) > 2 when n > 7) on the number of parameters in
the model so will tend to reward more parsimonious (smaller) models.

3 For regression models, the information criteria combine information about the SSE, number
of parameters p in the model, and the sample size n.

4 A small value, compared to values for other possible models, is good.


We can use the step() function in R to conduct stepwise regression procedure, instead of using
addl1() or lm() for multiple times. In this function, the default criterion to choose predictors is
AIC. For this cement example, we can get the following:
Note that if we use AIC to choose predictors, the stepwise regression procedure ends up with a
model containing x1 , x2 , and x4 , which is different from the model when we use F -tests to choose
predictors.
Some Cautions about Stepwise Regression
Procedure

1 The final model is not guaranteed to be optimal in any specified sense.

2 The procedure yields a single final model, although there are often several equally
good models.

3 Stepwise regression does not take into account a researcher’s knowledge about the
predictors. It may be necessary to force the procedure to include important
predictors.

4 One should not over-interpret the order in which predictors are entered into the
model.
Best Subsets Regression

1 The general idea behind best subsets regression is that we select the subset of
predictors that do the best at meeting some well-defined objective criterion, such as
having the largest R2 value or the smallest M SE.

2 In order to avoid an underspecified model, which yields biased estimators, a


fundamental rule of the best subsets regression procedure is that the list of
candidate predictor variables must include all of the variables that actually predict
the response.
The Procedure of Best Subsets Regression

Step 1.
First, identify all of the possible regression models derived from all of the possible combinations of
the candidate predictors. Unfortunately, this can be a huge number of possible models.

1. Suppose we have one (1) candidate predictor – x1 . Then, there are two (2) possible
regression models we can consider:
1 the one (1) model with no predictors.
2 the one (1) model with the predictor x1 .

2. Suppose we have two (2) candidate predictor – x1 and x2 . Then, there are four (4) possible
regression models we can consider:
1 the one (1) model with no predictors.
2 the two (2) models with the only one predictor each – the model with x1 alone; the
model with x2 alone.
3 the one (1) model with all two predictors – the model with x1 and x2 .
In general, if there are p − 1 possible candidate predictors, then there are 2p−1 possible regression
models containing the predictors. For example, 10 predictors yield 210 = 1024 possible regression
models.
Step 2.

1. From the possible models identified in the first step, determine


1 the one-predictor models that do the “best” at meeting some well-defined criteria

2 the two-predictor models that do the “best”

3 the three-predictor models that do the “best,” and so on.

2. By doing this, it cuts down considerably the number of possible regression models to
consider!

Step 3

1. Further evaluate and refine the handful of models identified in the last step. This might
entail performing residual analyses, transforming the predictors and/or response, adding
interaction terms, and so on.

2. Do this until you are satisfied that you have found a model that

1 meets the model conditions,

2 does a good job of summarizing the trend in the data,

3 and most importantly allows you to answer your research question.


What do You Think “Best” Means?

Some possible criteria:

1 the model with the largest R2

2 the model with the largest adjusted R2

3 the model with the smallest M SE (or square root of M SE)

4 Mallows’ Cp -statistic. (We will learn it soon.)

The different criteria quantify different aspects of the regression model, and therefore often
yield different choices for the best set of predictors.

We should use best subsets regression as a screening tool to reduce the large number of
possible regression models to just a handful of models that we can evaluate further before
arriving at one final model.
R2 Value

The R2 is defined as:


SSR SSE
R2 = =1−
SST O SST O
can only increase as more variables are added.

Therefore, it makes no sense to define the “best” model as the model with the largest
R2 -value.

1 We can instead use the R2 -values to find the point where adding more predictors is
not worthwhile, because it yields a very small increase in the R2 -value.

2 In other words, we look at the size of the increase in R2 , not just its magnitude alone.

3 It is used most often in combination with the other criteria.


R Package Leaps to Conduct Best Subset Regression
By the regsubsets function in Leaps, we can find the “best” model for each number of possible
predictor(s). For the cement example:

1 “summary.mod$which” shows that a “best” model includes some predictor(s). For


example, the first row tells us that for all possible models with only one predictor, the model
with x4 is the “best”.

2 “summary.mod$rsq” shows the R2 for each “best” model. It shows that going from the
“best” one-predictor model to the “best” two-predictor model. The R2 value jumps from 67.5
to 97.9, which is the largest increase.

3 Based on the R2 -value criterion, the “best” model is the model with the two predictors x1 and
x2 .
Adjusted R2 Value and M SE

1 The adjusted R2 value, which is defined as:


n − 1  n − 1  SSE   n−1 
Ra2 = 1 − (1 − R2 ) = 1 − =1− M SE
n−p n − p SST O SST O

makes us pay a penalty for adding more predictors to the model.


Therefore, we can just use the adjusted R2 value to choose the best model, which is the one
with the largest adjusted R2 value.

2 mean squared error (MSE) is defined as:


n
(Yi − Ŷi )2
P
SSE i=1
M SE = =
n−p n−p

quantifies how far away our predicted responses are from our observed responses.
Therefore, according to the M SE criterion, the best regression model is the one with the
smallest M SE.

3 The adjusted R2 value increases only if M SE decreases. That is, the adjusted R2 value and
M SE criteria always yield the same “best” models.
Different criteria can lead us to the same “best” models. Based on the largest adjusted R2 value
and the smallest M SE criteria, the “best” model is the model with the three predictors x1 , x2 , and
x4 .

1 “summary.mod$adjr2” shows the adjusted R2 for each “best” model.

2 “summary.mod$rss” shows the SSE for each “best” model, we need to compute by hand
to find M SE.
Mallows’ Cp-statistic

General idea of the Mallows’ Cp -statistic:

1 Recall that an underspecified model is a model in which important predictors are


missing.

2 And, an underspecified model yields biased regression coefficients and biased


predictions of the response.

3 Mallows’ Cp -statistic estimates the size of the bias that is introduced into the
predicted responses by having an underspecified model.
Bias and Variation in Predicted Responses
The true value is the center of the target.

1 The predicted responses in the first instance may be considered reliable, precise or as
having negligible random error, but all the responses missed the true value by a wide
margin. (biased estimate)

2 The target on the right has more random error (large variance) in the predicted responses ,
however, the results are valid, lacking systematic error (unbiased).

3 The middle target depicts our goal: observations that are both reliable (small variance) and
valid (unbiased).
A Measure of the Total Variation in the Predicted
Responses Γp

1 To quantify the total variation in the predicted responses, we just sum the two components
σŶ2 (random sampling variation) and Bi2 (prediction bias) over all n data points to obtain a
i
standardized measure of the total variation in the predicted responses Γp :
n n
1 nX 2 X o
Γp = 2
σŶ + Bi2
σ i=1
i
i=1

where Bi = E(Ŷi ) − E(Yi ).

2 If all Bi are 0, then Γp achieves its smallest possible value, namely p, the number of
parameters:
n
1 nX 2 o
Γp = 2 σŶ + 0 = p
σ i=1
i

The best model is simply the model with the smallest value of Γp .
However, we don’t know the population quantities: σŶ2 and σ 2 , and hence Γp . We need a
i
statistic to estimate Γp .
Mallows’ Cp as An Estimate of Γp

Mallows’ Cp -statistic is an estimate of Γp . It is defined as

SSEp (M SEp − M SEall )(n − p)


Cp = − (n − 2p) = p +
M SEall M SEall

1 We estimate σ 2 by M SEall , the mean squared error obtained from fitting the model
containing all of the candidate predictors.

2 Estimating σ 2 by M SEall assumes that there are no biases in the full model with all of the
predictors, an assumption that may or may not be valid.

3 Cp = p for the full model because in that case M SEp − M SEall = 0.


Recalling that p denotes the number of parameters in the model:

1 Subset models with small Cp values have a small total (standardized) variance of prediction.

2 When the Cp value is

1 near p, the bias is small (next to none)

2 much greater than p, the bias is substantial

3 below p, it is due to sampling error; interpret as no bias

3 For the largest model containing all of the candidate predictors, Cp = p (always). Therefore,
you shouldn’t use Cp to evaluate the fullest model.
Using the Cp Criterion to Identify “Best” Models

Here’s a reasonable strategy for using Cp to identify “best” models:

1 Identify subsets of predictors for which the Cp value is near p (if possible).

2 The full model always yields Cp = p, so don’t select the full model based on Cp .

3 If all models, except the full model, yield a large Cp not near p, it suggests some important
predictor(s) are missing from the analysis.
In this case, we are well-advised to identify the predictors that are missing!

4 If a number of models have Cp near p, choose the model with the smallest Cp value, thereby
insuring that the combination of the bias and the variance is at a minimum.

5 When more than one model has a small value of Cp value near p, in general, choose the
simpler model or the model that meets your research needs.
According to the Cp criterion, the model with x1 and x2 , and the model with x1 , x2 ,
and x4 are both valid models.

“summary.mod$cp” shows that Cp criterion value for each best model.


Topics in Today’s Class

1 Introduction to model building and variable selection

2 R examples

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy