Mis-specifications of regression model
Mis-specifications of regression model
Introductory Econometrics
1
Mis-specifications of regression model
TABLE OF CONTENTS
Learning Objectives 3
1. Introduction 3
Practice Questions 18
References
th
1. D. N. Gujarati and D.C. Porter, Essentials of Econometrics, 4 edition, McGraw Hill
International Edition.
2. Jan Kmenta , Elements of Econometrics, Indian Reprint, Khosla Publishing House,
2008, few pages for ‘Review of Statistics’.
th
3. Christopher Dougherty, Introduction to Econometrics, 4 edition, OUP, Indian edition.
2
Mis-specifications of regression model
Learning Objectives
This chapter aims at presenting the various types of specification errors or specification bias
which occur in a chosen regression model. The chapter will focus mainly on two types of
specification bias which are the omitted variable bias and the irrelevant variable bias. It will
discuss about the various consequences of the specification errors along with examples.
Apart from this, you will learn various properties of a good model. The chapter concludes
with practice questions which will help you to test your concepts learned from this chapter.
1. Introduction
There are various assumptions underlying the classical linear regression model. These
assumptions ensure that the model estimated captures reasonably the economic
phenomenon which is under study. Among these assumptions, one is that the model is
correctly specified that is the model correctly represents the phenomena or the economic
process under study. In other words, the model under consideration is able to fully or
accurately capture the reality of the economy. However, the regression model which is
chosen may not always be specified correctly due to many reasons. This is because it is
very difficult to search for the true regression model and hence we encounter specification
errors.
A correctly specified model is a model which includes all the relevant variables which are
important in explaining the economic theory under study, do not include the unnecessary
variables, has the correct functional form, includes no errors of measurement in the
dependent variable and/or the explanatory variables, includes no outliers and has no
problem of simultaneous equation or the simultaneity bias. The specification errors or
specification bias occur when we do not estimate the correct model and instead end up
estimating some other model which may not be able to reasonably capture the economic
process which is being studied.
The model specifications errors are of various types but this chapter focuses only on two
types of errors. But before discussing these types of specification errors, let us first discuss
the attributes or the properties of a good regression model.
3
Mis-specifications of regression model
There are various criteria to choose a regression model which is economically sound. These
criteria are required in order to select the ‘best’ model among the available models. Some of
these criteria are as follows-
These are some of the criteria to choose the best model among the available models and
hence before formulating an econometric model, we must keep these criteria in mind.
However, even if the best model is selected, there are chances that the model might suffer
from the specification errors as discussed below.
There are various types of specification errors or bias which occur while choosing a
particular econometric model. These errors occur when instead of estimating the true model
the end up estimating some other model which is not correctly specified. Broadly speaking,
we can divide the specification errors into the following:
4
Mis-specifications of regression model
Yi = A1 + A2X2i + u2i ,
In the estimated model we have excluded an important variable which is X 3 and hence we
have committed specification errors in the form of omission of a relevant variable. The
error term of this model u2i will not only capture u1i but also captures the influence of the
excluded variable on the dependent variable. That is,
then, we can see that the regression model has the wrong functional form and hence in this
case, the specification errors takes the form of the incorrect functional form.
Lastly, if we estimate the following regression model instead of the true regression model:
Where Y’, X2’ and X3’ are the proxies of the true variables Y, X2 and X3 respectively.
5
Mis-specifications of regression model
Then, we commit specification errors in this case in the form of the errors of
measurement bias since these proxies may have errors of measurement.
Now, we move to detailed discussion of only the two types of specification errors which are
omission of relevant variables and inclusion of irrelevant variables.
This model is a three variable regression model, but instead of estimating this model, we
end up estimating the following model which may be due to many reasons:
Yi = A1 + A2X2i + u2i ,
Here, we have taken different notation for the regression coefficients in the estimated model
in order to differentiate it from the true model. Clearly, in the estimated model we have not
included a relevant variable X3 and hence the estimated model is the mis-specified
regression model since X3, as per the true model, explains certain amount of total variation
in the dependent variable Yi. There are consequences of committing omitted variable bias
which are discussed below.
The following are the consequences of omitting relevant variables from the regression
model-
a) The OLS estimators of the estimated model are biased if the excluded variable X 3 is
correlated with the included variable X2. In other words,
E(a1) ≠ B1 and
E(a2) ≠ B2 . (where E is the expected value and a1 and a2 are the respective OLS
estimators of A1 and A2).
Let us prove this consequence-
6
Mis-specifications of regression model
a2 =
Substituting the values of Yi and from the true model and putting in the formula
gives,
a2 =
E(a2) =
Linear Regression Model and n is the sample size), the last term in R.H.S becomes
zero and we get,
E(a2) =
7
Mis-specifications of regression model
E(a1) =
=
= [since E(a2) = B2 + B3 b2 ]
=
= ≠ B1 (since b2 ≠ 0 and B3 ≠ 0).
⟹ E(a1) ≠ B1 and hence a1 is a biased estimator of B1.
This shows that if b2 >0 and B3 >0 or b2 <0 and B3 <0, then a2 will overestimate the
true value of B2 on average and hence there is an upward bias. On the other hand, if
b2 >0 and B3 <0 or vice-versa then a2 will underestimate the true value of B2 on
average and so there is a downward bias. Also, if > 0 then a1 will
overestimate true value of B1 and if <0 then it will underestimate true
value of B1. This is because, X2 now not only takes into account its direct influence
on Y but also the indirect influence of X 3 on Y as shown in Figure 1 below.
Figure 1
8
Mis-specifications of regression model
(b) If X2 and X3 are uncorrelated that is b2 = 0 then E(a2) = B2. In other words, a2
will be an unbiased estimator of B2. However, E(a1) = ≠ B1 unless = 0
implying that a1 still remains a biased estimator.
(c) In case where b2 ≠ 0, a1 and a2, apart from being biased estimators, are also
inconsistent. This means that the bias does not become zero even in the large
samples. On the other hand, when b2 = 0, a2 is an unbiased and also a consistent
estimator but again a1 is a biased and inconsistent estimator.
(d) The variance of the error term σ2 is not correctly estimated in the estimated
model. This is because estimator of σ2 is = RSS/d.f (Residual sum of squares/ its
degrees of freedom) and RSS and its d.f will be different in the true and the
estimated models. The of true model will be an unbiased estimator while of the
2
estimated model will be a biased estimator of σ .
(e) Since of the estimated model is a biased estimator of σ2, the variance of a2
will be a biased estimator of the variance of which is the estimator of true B2. In
other words,
E[Var(a2)] ≠ Var( )
We can write the formulae of Var(a2) and Var( ) as follows:
Var(a2) = and
Var( )=
And since the last term on R.H.S is positive, it implies that, on average, Var(a 2) will
overestimate the variance of and hence there will always be a positive bias
associated with Var(a2).
9
Mis-specifications of regression model
(f) Due to this, the usual hypothesis-testing and confidence interval approaches are
not reliable and may give wrong conclusions about the statistical significance of
regression coefficients. The confidence intervals in this case will be wider due to
which there will be a greater likelihood that we do not reject the zero-null
hypothesis.
Considering the consequences of omitting important variables from the regression model, it
seems that these consequences are rather serious and hence after a regression model is
formulated on the basis of the economic theory it is always better not to drop a variable
from the given model. This is because dropping such relevant variables from the model will
result into under-specifying or under-fitting the regression model.
Where HG = Highest grade completed by the individual, HGM = Highest grade completed by
the individual’s mother and CA = measure of cognitive ability of the individual.
Now, we drop CA from the regression model giving the following results-
se = (0.4147) (0.0348)
t= (24.23)* (9.00)*
10
Mis-specifications of regression model
Also, running a regression of CA on HGM gave the value of slope coefficient as 0.42. This
means that there will be upward bias associated with the coefficient of HGM since b 2 >0 and
B3 >0. This is visible from both the regressions; the value of coefficient with HGM is indeed
larger in the latter regression. Also, R2 = 13.08 % in the latter regression shows the direct
influence of HGM along with the indirect influence of CA on HG and the remaining
consequences of omitted an important variable which is CA from the regression model will
follow.
The next type of specification errors is including unnecessary explanatory variables in the
regression model. This happens because sometimes the researcher may want to include all
the explanatory variables in the model; the approach is called ‘kitchen sink approach’, even
if some of these variables are not important from the viewpoint of the theoretical framework
of the model. Another reason for including irrelevant variables is that sometimes the
theoretical model itself is not known and hence researchers end up in including such
explanatory variables in the regression model. Also, including more and more explanatory
variables, even if they are unnecessary variables, leads to an increase in the value of
unadjusted R2 of the regression model. The adjusted R2 in this case will increase only when
the absolute value of t of the coefficient of interest is greater than one. This may increase
the goodness of fit of the regression model. Thus, by including such irrelevant explanatory
variables, we commit specification errors in the form of inclusion of irrelevant variables or
irrelevant variable bias.
Yi = B1 + B2X2i + u1i ,
but instead of estimating this two-variable regression model, we estimate the following
model which is three-variable regression model:
Hence, the latter model is the mis-specified model in the sense that it has an unnecessary
variable X3 which was not there in the true model. And so, by including this variable we
11
Mis-specifications of regression model
have over-fitted our regression model. Again, there are consequences of committing
irrelevant variable bias which are discussed below.
The following are the consequences of including unnecessary variable, which is X3 in the
present scenario, in the regression model:
(a) Unlike the consequences of ‘omitted variable bias’, the OLS estimators of the mis-
specified model are still unbiased and hence also consistent. Symbolically,
Where E is the expected value and c1, c2 and c3 are the respective OLS estimators of B1, B2
and B3, where as per the true model B3 = 0.
c2 =
c2 =
⟹ c2 =
12
Mis-specifications of regression model
E(c2) =
As already shown, = 0,
⟹ E(c2) =
c3 =
⟹ c3 =
E(c3) =
= 0, we get,
E(c3) = 0 = B3 ⟹ c3 is an unbiased estimator of B3.
(b) The variance of the error term, σ2, is correctly estimated in the mis-specified model.
(c) Thus, the usual hypothesis-testing and confidence interval approaches remain
reliable. That is, the usual t and F tests remain valid.
(d) But, the OLS estimators of the mis-specified model will have larger variances as
compared with the variances of the B’s in the true model. In other words, the
13
Mis-specifications of regression model
estimators are not efficient in general or they are no longer BLUE (these estimators
will be linear, unbiased estimators, LUE).
We can show this by using the variances of estimator of B2 , var( ), in true model
and that of the estimator of C2 , var (c2), in the mis-specified model.
Var( )= and
Var( ) =
Where r is the coefficient of correlation between X 2 and X3. Since , 0< r2 < 1
⟹ Var( ) < Var( ).
Thus, although c2 is an unbiased estimator of B2, its variance is larger than that of
estimator of true B2. Similarly, we have Var( ) < Var( ).
This shows that the inclusion of the irrelevant variable X 3 led to larger varances of
the OLS estimators making them less precise.
(e) It is impossible to determine the contribution of each explanatory variable to R2 in
the regression model.
(i) It will lead to larger variances of OLS estimators making them inefficient and
hence less precise,
(ii) It will lead to loss of degrees of freedom, and
(iii) It may also lead to the problem of multicollinearity.
Hence, it is always better to keep only those explanatory variables in the regression model
which are relevant from the theoretical point of view and are known to influence directly the
dependent variable.
14
Mis-specifications of regression model
In this example, data is on sales of fashion clothing (S), Real personal disposable income (I)
and Confidence Index (CI) during the time period 1986 to 1992. The data is quarterly and
since it is a time series data, dummy variable for time is included in the model. Suppose the
following model is the true model:
Where D2 = 1 for 2nd quarter, 0 otherwise, D3 = 1 for 3rd quarter, 0 otherwise, and D4 = 1
for the fourth quarter, 0 otherwise. This means that the first quarter is the base category in
this example. The regression results of this model is as follows-
t = -152.92 + 1.598 It + 0.293 CIt + 15.045 D2t + 26.00 D3t + 60.87 D4t
The results show that all the coefficients are individually and jointly statistically significant.
Now, to this regression model we add the differential slope dummies and after running the
regression we get the following results-
t = -191.58 + 2.049 It + 0.280 CIt + 196.70 D2t + 123.13 D3t + 50.96 D4t – 1.11 D2t It
- 1.21 D3t It – 0.04 D4t It – 0.29 D2t CIt + 0.06 D3t CIt + 0.05 D4t CIt
15
Mis-specifications of regression model
As can been seen from the latter regression results, the added variables are not statistically
significant and hence are the unnecessary variables. Also, the standard errors of the
coefficients are greater than they were in the true regression results, thereby reducing their
precision. The value of R2 is higher in the latter regression due to addition of the variables.
So, considering the results we drop these additional unnecessary differential slope dummies
and move back to our original true model.
So, we have discussed two types of specification errors along with their consequences.
There are other types of specification bias as mentioned earlier but their detailed discussion
is not a part of this chapter and is therefore left for the readers to explore.
16
Mis-specifications of regression model
PRACTICE QUESTIONS
Q1) Compare in detail the consequences of the omitted variable and the irrelevant variable
bias. Which, among the two, are more serious consequences?
Q2) ‘when we include unnecessary variable(s) in the regression model then the OLS
estimators are biased as well as inconsistent’. State briefly whether this statement is true or
false.
Q3) Prove the following, given that there is no correlation between the omitted variable X3
and the included variable X2.
E[Var(a2)] = Var( )+
Yt = B1 + B2 Xt + ut
Where Y = Consumption and X = Income. But instead of using this model, a researcher
used the following model-
lnYt = B1 + B2 lnXt + ut
17
Mis-specifications of regression model
Yt = α1 + α2 X2t + u1t
However, an irrelevant variable X3 is added and following regression model was used-
Q8) In the case of Omitted variable bias, a trade-off is involved as a2 is a biased estimator
but has a smaller variance than which is an unbiased estimator of B2. How would you,
then, choose between the two estimators?
Yt = δ1 + δ2 X2t + u1t
Instead of
Where it is known that X3 is a relevant variable then would you say that E( ) = α1 and E( 2)
= α2 ( δ δ ?
Q10) What are the reasons which result into a researcher estimating a mis-specified model?
18