Regression Cookbook

Department of Marketing and Statistics
Aarhus School of Business

Philosophy of Science II
Fall 2009
Jensen, Juhl & Mikkelsen
A Regression Cookbook
This brief note gives a very simplied step-by-step guide to conducting a multiple regression anal-
ysis. So given n observations on / explanatory variables, A
1
, . . . , A
and one response variable,

Y ,
(j
1
, r
11
, . . . , r
1
), (j
2
, r
21
, . . . , r
2
), . . . , (j
, r
1
, . . . , r
),
the multiple regression model is used to model a linear relationship between the response and the
explanatory variable (also known as regressors). Hence, the model has the following appearance:
j
= a
0
+a
1
r
1
+. . . +a
+c
, i = 1, . . . , n.
To enable valid inference the errors must fulll the following design criteria:
D1. Zero mean: 1(c
) = 0 for all i.
D2. Homoskedasticity: var(c
) = o
2
for all i.
D3. Mutually uncorrelated: c
and c
uncorrelated for all i = ,.

D4. Uncorrelated with r
1
, . . . , r
: c
and r
1
, . . . , r
are uncorrelated for all i and ,.

D5. Normality: c
i .i .d. N(0, o
2
) for all i.
Moreover, the regressors cannot be perfectly collinear and the linear functional relationship should
be a good specication.
Regression Step-by-step
Our approach is in this course quite supercial; basically we are making sure that we dont go
completely wrong. There are much more that could be done, but we put that aside for another
course. Also, there are many situations that will require judgement calls, and this ability can only
be learned through experience. In general a sensible regression analysis should at least comprise
the following elements:
1. Problem statement: What is the research question?
2. What are the relevant variables for addressing the research question? Present the dependent
variable and the regressors. What are the measurement scales? Which are the regressors
of interest and which are control variables? Get some preliminary idea of the functional
relationship by plotting the dependent variable against the regressors. Use scatterplots for
plotting j
against Interval/Ratio- and ordinal scaled regressors and box-type plot for nominal
variables.
3. Based on your research theory and the preliminary analysis set up an initial regression equa-
tion. This equation should be quite general/comprehensive, as the idea is to partly let the
data decide which variables and functional form are appropriate.
4. Estimate the model using OLS. Before you even consider the signicance of the individual
regressors you must investigate whether the design criteria are satised. We now consider
our approach to verifying the design criterion one-by-one.
D1 1(c
) = 0 : Always satised with a constant term in the model. Intuitively, the constant
term equals the xed portion of the dependent variable that cannot be explained by
the independent variables, whereas the error term equals the stochastic portion of the
unexplained value of the response.
D2 var(c
) = o
2
: One should make scatterplots of the residuals against the regressors. This
can be quite lengthy, so a shortcut is just to plot the residuals or the squared residuals
against the predicted values, j
. If the residual variation changes as we move a long the

horizontal axis then we should be concerned. However, in most cases we need not worry
about heteroskedasticity if we mechanically use robust standard errors. In general it is
useful to compute both the OLS and the robust standard error. If they are the same
then nothing is lost in using the robust one; if they dier then you should use the more
reliable ones that allow for heteroskedasticity.
D3 cor(c
, c
) = 0 : This assumption is typically only problematic in connection with time

series data. In this course we only deal with cross-section data generated by simple
random sampling. We assume that this procedure secures the validity of this assumption.
D4 cor(c
, r
) = 0 : Again one should make scatterplots of the residuals against the re-
gressors or the predicted values, j
. If we nd some kind of systematic pattern then we

should try to expand the model to account for this. Possible solutions are to include
omitted regressors or to alter the functional form. Another approach is to use so called
instrumental variables, a technic that is beyond the scope of this course.
D5 c
i .i .d. N(0, o
2
) : If we have more than 100 observations, then we rarely care. The
CLT ensures that in large samples the coecient estimates are approximately normally
distributed, and this holds for almost any choice of distribution for c
. Still, is is very
simple to make a histogram, P-P plots etc. that compares the residual distribution to
the normal distribution. In small samples a so called bootstrap approach will deliver
valid and reliable inference. This is also beyond the scope of this course. So in small
samples where the normal assumption fails to apply, we simply state this and note that
the conclusion are to be taken lightly.
5. Now with a model that approximately satisfy the design criterion, we can progress to simplify
the model. Our approach is to, one-by-one, remove insignicant variables. In general we take
out the variable with the largest P-value until only signicant variables are left in the model.
Warning: This is in many cases not as simple as it sounds. The main reason being due to
multicollinearity. In models with polynomial terms and/or interactions, looking at t-stats
can be misleading. If removing a (by t-stats) seemingly insignicant regressor reduces the
adjusted-1
2
then we should reconsider, especially if the variables have a large VIF-factor
(above 5).
Say two variables with high VIF-factors seem insignicant by t-stats and removing one makes
the other signicant and vice versa. This describes a situation where the variables are highly
collinear and both variables are good predictors. In this case keeping either one or a combina-
tion would make sense. Just remember to account for this in the interpretation. For example
2
in the wage case if we regress wage on the level of experience and age we might nd what we
just described. Then removing age does not mean that we should conclude that wage is not
correlated with age, but that both wage and experience tend to increase with a persons age.
In essence we should check the design criteria when ever we change the model, that is after
every single reduction. It is, however, rarely the case that the removal of an insignicant
variable will change the t in terms of D1-D5. So it is only in borderline cases we reconsider
if D1-D5 still holds.
6. For your nal model you should give interpretations and implications of the coecient esti-
mates.
Example: House prices
As an example we consider an equation that describes the determination of house prices. Lets say
we have access to some observations on the following variables:
price = house price
lotsize = size of the lot, in feet
sqrft = square footage
bdrms = number of bedrooms
colonial = 1 if home is colonial style
You can see a snapshot of the data below:
It is somewhat backward to start from some variables that you have, so now we pretend that we
start from the top of our step-by-step list.
Step 1: Problem statement: What determine the traded price on a residential house?
Step 2: There are of course many more relevant variables than the ve we have access to, but we
will have to do with these. The dependent variable is price, and the remaining four variables are
the regressors. The variables price, lotsize, sqrft and bdrms are all ratio scaled (even though bdrms
is discrete). colonial is a dummy variable, so it a nominal. In this example all the regressors are
actually variables of interest. Below are plots of the dependent variable against the regressors. We
both do plots in levels and for the cases were it makes sense we also try logarithmic versions:
3
4
Step 3: Based on the plots and a wish to work with elasticities when ever possible we could start
with the following regression equation:
lprice
= a
0
+a
1
llotsize
+a
2
lsqrft
+a
3
bdrms
+a
4
colonial
+-
It would actually make sense to expand this model with quadratic terms of at least bdrms. In
theory and from the preliminary plot we should have the felling that there should be decreasing
returns to the number of bathrooms one puts into a house,
lprice
= a
0
+a
1
llotsize
+a
2
lsqrft
+a
3
bdrms
+a
4
bdrms
2
+a
5
colonial
+-
Step 4: To estimate the model using OLS, and get the output and save the variables we want
make the following selections:
5
The independent(s) comprise:
llotsize, lsqrft, bdrms,
bdrms2 and colonial
This produces the following the following coecients table, and much more some of which we use
below.
6
Before you even consider the signicance of the individual regressors you must investigate whether
the design criteria are satised. We now consider the design criterion that needs attention.
var(c
) = o
2
: First we plot the residuals or the squared residuals against the predicted values, j
.
If we ignore the two outlying observations (marked with red in both plots) then there are no sign
of heteroskedasticity. Still, these outliers could aect the results both in terms of the estimated
coecients and their precision. To further check for heteroskedasticity we compute the robust
standard errors below.
7
8
The following output emerges:
In this table we are only interested in the standard errors (the coecient estimates should be
identical to those we got above, if they are not there is a bug somewhere), below we present them
next to the ones from the OLS output:
9
There are virtually no dierence, so we just use OLS as we progress.
cor(c
, r
) = 0 : Again one should make scatterplots of the residuals against the regressors or the
predicted values, j
. We did that above and going back it is only the two outliers that catch the eye.
With outlier like these one should do the full analysis both with and without these observations, and
compare the results. If the results diers then we should probably trust the results that emerged
after excluding the outliers. But we should also try to gure out what made these two observations
special. In general, we often need more variables to do this, so the only feasible option is to exclude
the outliers.
c
i .i .d. N(0, o
2
) :
Step 5: Now with a model that approximately satisfy the design criterion, we can progress to
simplify the model. There only seem to be serious multicollinearity wrt to the bathroom terms.
This is of course so by construction.
First we remove the colonial variable, reestimate the model, getting the following output:
10
Now, the question is whether we should remove the bathroom terms. This is a hard question to
answer based on the p-values only. The best approach we have (with the tools we have learned
in the course) is to estimate models with and without bdrms
and/or bdrms
2
and compare the

adjusted 1
2
from these models. The adjusted 1
2
for the model
lprice
= a
0
+a
1
llotsize
+a
2
lsqrft
+a
3
bdrms
+a
4
bdrms
2
+-
(1)
is 0.637. Excluding the quadratic terms we get
and the adjusted 1
2
for the model is 0.630.
In this model bdrms
is clearly not signicant, if we remove it

The adjusted 1
2
for the this model is is 0.627. So comparing to this to the model with both bdrms
and bdrms
2
we only lost about 1 percent of explanatory power. This is very little so it seems correct
to remove bdrms
and bdrms
2
. One could actually verify this statistically by testing the hypothesis

H
0
: a
3
= a
4
= 0 in the model in equation (1). Unfortunately, how to construct such a test is out
of the scope of this course.
Step 6: We are left with a constant elasticity model. A 1 % increase in lotsize increases the price
of a house by .17 %. A 1 % increase in sqrft increases the price of a house by .76 %.
11

Regression Cookbook

Uploaded by

Copyright:

Available Formats

Regression Cookbook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Cookbook

Uploaded by

Copyright:

Available Formats

Department of Marketing and Statistics

Aarhus School of Business

and one response variable,

uncorrelated for all i = ,.

are uncorrelated for all i and ,.

. If the residual variation changes as we move a long the

) = 0 : This assumption is typically only problematic in connection with time

. If we nd some kind of systematic pattern then we

and compare the

is clearly not signicant, if we remove it

. One could actually verify this statistically by testing the hypothesis

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.