Regression Cookbook
Regression Cookbook
Regression Cookbook
, r
1
, . . . , r
),
the multiple regression model is used to model a linear relationship between the response and the
explanatory variable (also known as regressors). Hence, the model has the following appearance:
j
= a
0
+a
1
r
1
+. . . +a
+c
, i = 1, . . . , n.
To enable valid inference the errors must fulll the following design criteria:
D1. Zero mean: 1(c
) = 0 for all i.
D2. Homoskedasticity: var(c
) = o
2
for all i.
D3. Mutually uncorrelated: c
and c
: c
and r
1
, . . . , r
i .i .d. N(0, o
2
) for all i.
Moreover, the regressors cannot be perfectly collinear and the linear functional relationship should
be a good specication.
Regression Step-by-step
Our approach is in this course quite supercial; basically we are making sure that we dont go
completely wrong. There are much more that could be done, but we put that aside for another
course. Also, there are many situations that will require judgement calls, and this ability can only
be learned through experience. In general a sensible regression analysis should at least comprise
the following elements:
1. Problem statement: What is the research question?
2. What are the relevant variables for addressing the research question? Present the dependent
variable and the regressors. What are the measurement scales? Which are the regressors
of interest and which are control variables? Get some preliminary idea of the functional
relationship by plotting the dependent variable against the regressors. Use scatterplots for
plotting j
against Interval/Ratio- and ordinal scaled regressors and box-type plot for nominal
variables.
3. Based on your research theory and the preliminary analysis set up an initial regression equa-
tion. This equation should be quite general/comprehensive, as the idea is to partly let the
data decide which variables and functional form are appropriate.
A Regression Cookbook
4. Estimate the model using OLS. Before you even consider the signicance of the individual
regressors you must investigate whether the design criteria are satised. We now consider
our approach to verifying the design criterion one-by-one.
D1 1(c
) = 0 : Always satised with a constant term in the model. Intuitively, the constant
term equals the xed portion of the dependent variable that cannot be explained by
the independent variables, whereas the error term equals the stochastic portion of the
unexplained value of the response.
D2 var(c
) = o
2
: One should make scatterplots of the residuals against the regressors. This
can be quite lengthy, so a shortcut is just to plot the residuals or the squared residuals
against the predicted values, j
, c
, r
) = 0 : Again one should make scatterplots of the residuals against the re-
gressors or the predicted values, j
i .i .d. N(0, o
2
) : If we have more than 100 observations, then we rarely care. The
CLT ensures that in large samples the coecient estimates are approximately normally
distributed, and this holds for almost any choice of distribution for c
. Still, is is very
simple to make a histogram, P-P plots etc. that compares the residual distribution to
the normal distribution. In small samples a so called bootstrap approach will deliver
valid and reliable inference. This is also beyond the scope of this course. So in small
samples where the normal assumption fails to apply, we simply state this and note that
the conclusion are to be taken lightly.
5. Now with a model that approximately satisfy the design criterion, we can progress to simplify
the model. Our approach is to, one-by-one, remove insignicant variables. In general we take
out the variable with the largest P-value until only signicant variables are left in the model.
Warning: This is in many cases not as simple as it sounds. The main reason being due to
multicollinearity. In models with polynomial terms and/or interactions, looking at t-stats
can be misleading. If removing a (by t-stats) seemingly insignicant regressor reduces the
adjusted-1
2
then we should reconsider, especially if the variables have a large VIF-factor
(above 5).
Say two variables with high VIF-factors seem insignicant by t-stats and removing one makes
the other signicant and vice versa. This describes a situation where the variables are highly
collinear and both variables are good predictors. In this case keeping either one or a combina-
tion would make sense. Just remember to account for this in the interpretation. For example
2
A Regression Cookbook
in the wage case if we regress wage on the level of experience and age we might nd what we
just described. Then removing age does not mean that we should conclude that wage is not
correlated with age, but that both wage and experience tend to increase with a persons age.
In essence we should check the design criteria when ever we change the model, that is after
every single reduction. It is, however, rarely the case that the removal of an insignicant
variable will change the t in terms of D1-D5. So it is only in borderline cases we reconsider
if D1-D5 still holds.
6. For your nal model you should give interpretations and implications of the coecient esti-
mates.
Example: House prices
As an example we consider an equation that describes the determination of house prices. Lets say
we have access to some observations on the following variables:
price = house price
lotsize = size of the lot, in feet
sqrft = square footage
bdrms = number of bedrooms
colonial = 1 if home is colonial style
You can see a snapshot of the data below:
It is somewhat backward to start from some variables that you have, so now we pretend that we
start from the top of our step-by-step list.
Step 1: Problem statement: What determine the traded price on a residential house?
Step 2: There are of course many more relevant variables than the ve we have access to, but we
will have to do with these. The dependent variable is price, and the remaining four variables are
the regressors. The variables price, lotsize, sqrft and bdrms are all ratio scaled (even though bdrms
is discrete). colonial is a dummy variable, so it a nominal. In this example all the regressors are
actually variables of interest. Below are plots of the dependent variable against the regressors. We
both do plots in levels and for the cases were it makes sense we also try logarithmic versions:
3
A Regression Cookbook
4
A Regression Cookbook
Step 3: Based on the plots and a wish to work with elasticities when ever possible we could start
with the following regression equation:
lprice
= a
0
+a
1
llotsize
+a
2
lsqrft
+a
3
bdrms
+a
4
colonial
+-
It would actually make sense to expand this model with quadratic terms of at least bdrms. In
theory and from the preliminary plot we should have the felling that there should be decreasing
returns to the number of bathrooms one puts into a house,
lprice
= a
0
+a
1
llotsize
+a
2
lsqrft
+a
3
bdrms
+a
4
bdrms
2
+a
5
colonial
+-
Step 4: To estimate the model using OLS, and get the output and save the variables we want
make the following selections:
5
A Regression Cookbook
The independent(s) comprise:
llotsize, lsqrft, bdrms,
bdrms2 and colonial
This produces the following the following coecients table, and much more some of which we use
below.
6
A Regression Cookbook
Before you even consider the signicance of the individual regressors you must investigate whether
the design criteria are satised. We now consider the design criterion that needs attention.
var(c
) = o
2
: First we plot the residuals or the squared residuals against the predicted values, j
.
If we ignore the two outlying observations (marked with red in both plots) then there are no sign
of heteroskedasticity. Still, these outliers could aect the results both in terms of the estimated
coecients and their precision. To further check for heteroskedasticity we compute the robust
standard errors below.
7
A Regression Cookbook
8
A Regression Cookbook
The following output emerges:
In this table we are only interested in the standard errors (the coecient estimates should be
identical to those we got above, if they are not there is a bug somewhere), below we present them
next to the ones from the OLS output:
9
A Regression Cookbook
There are virtually no dierence, so we just use OLS as we progress.
cor(c
, r
) = 0 : Again one should make scatterplots of the residuals against the regressors or the
predicted values, j
. We did that above and going back it is only the two outliers that catch the eye.
With outlier like these one should do the full analysis both with and without these observations, and
compare the results. If the results diers then we should probably trust the results that emerged
after excluding the outliers. But we should also try to gure out what made these two observations
special. In general, we often need more variables to do this, so the only feasible option is to exclude
the outliers.
c
i .i .d. N(0, o
2
) :
Step 5: Now with a model that approximately satisfy the design criterion, we can progress to
simplify the model. There only seem to be serious multicollinearity wrt to the bathroom terms.
This is of course so by construction.
First we remove the colonial variable, reestimate the model, getting the following output:
10
A Regression Cookbook
Now, the question is whether we should remove the bathroom terms. This is a hard question to
answer based on the p-values only. The best approach we have (with the tools we have learned
in the course) is to estimate models with and without bdrms
and/or bdrms
2
= a
0
+a
1
llotsize
+a
2
lsqrft
+a
3
bdrms
+a
4
bdrms
2
+-
(1)
is 0.637. Excluding the quadratic terms we get
and the adjusted 1
2
for the model is 0.630.
In this model bdrms
and bdrms
2
we only lost about 1 percent of explanatory power. This is very little so it seems correct
to remove bdrms
and bdrms
2