0% found this document useful (0 votes)
16 views

Mis-specifications of regression model

This document discusses mis-specifications of regression models, focusing on omitted variable bias and irrelevant variable bias. It outlines the properties of a good model, types of specification errors, and the consequences of these errors, particularly emphasizing the importance of including relevant variables. The lesson concludes with practice questions to reinforce the concepts learned.

Uploaded by

Smitha Ajay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Mis-specifications of regression model

This document discusses mis-specifications of regression models, focusing on omitted variable bias and irrelevant variable bias. It outlines the properties of a good model, types of specification errors, and the consequences of these errors, particularly emphasizing the importance of including relevant variables. The lesson concludes with practice questions to reinforce the concepts learned.

Uploaded by

Smitha Ajay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Mis-specifications of regression model

Introductory Econometrics

LESSON: Mis-Specifications Of Regression Model

Lesson Developer: Ms. Nupur Kataria

College/Department: Department Of Economics, Kamala Nehru College, University


Of Delhi

1
Mis-specifications of regression model

TABLE OF CONTENTS

Section Number and Heading Page Number

Learning Objectives 3

1. Introduction 3

2. Properties of a good model 4

3. The specification errors 4

4. Omission of relevant variable(s)/Under-fitting of a model 6

4.1 The consequences of ‘Omitted variable bias’ 6

4.2 An example of ‘Omitted variable bias’ 10

5. Inclusion of irrelevant variable(s)/Over-fitting of a model 11

5.1 The consequences of ‘Irrelevant variable bias’ 12

5.2 An example of ‘Irrelevant variable bias’ 15

Practice Questions 18

References

th
1. D. N. Gujarati and D.C. Porter, Essentials of Econometrics, 4 edition, McGraw Hill
International Edition.
2. Jan Kmenta , Elements of Econometrics, Indian Reprint, Khosla Publishing House,
2008, few pages for ‘Review of Statistics’.
th
3. Christopher Dougherty, Introduction to Econometrics, 4 edition, OUP, Indian edition.

2
Mis-specifications of regression model

Learning Objectives

This chapter aims at presenting the various types of specification errors or specification bias
which occur in a chosen regression model. The chapter will focus mainly on two types of
specification bias which are the omitted variable bias and the irrelevant variable bias. It will
discuss about the various consequences of the specification errors along with examples.
Apart from this, you will learn various properties of a good model. The chapter concludes
with practice questions which will help you to test your concepts learned from this chapter.

1. Introduction

There are various assumptions underlying the classical linear regression model. These
assumptions ensure that the model estimated captures reasonably the economic
phenomenon which is under study. Among these assumptions, one is that the model is
correctly specified that is the model correctly represents the phenomena or the economic
process under study. In other words, the model under consideration is able to fully or
accurately capture the reality of the economy. However, the regression model which is
chosen may not always be specified correctly due to many reasons. This is because it is
very difficult to search for the true regression model and hence we encounter specification
errors.

A correctly specified model is a model which includes all the relevant variables which are
important in explaining the economic theory under study, do not include the unnecessary
variables, has the correct functional form, includes no errors of measurement in the
dependent variable and/or the explanatory variables, includes no outliers and has no
problem of simultaneous equation or the simultaneity bias. The specification errors or
specification bias occur when we do not estimate the correct model and instead end up
estimating some other model which may not be able to reasonably capture the economic
process which is being studied.

The model specifications errors are of various types but this chapter focuses only on two
types of errors. But before discussing these types of specification errors, let us first discuss
the attributes or the properties of a good regression model.

2. PROPERTIES OF A GOOD MODEL

3
Mis-specifications of regression model

There are various criteria to choose a regression model which is economically sound. These
criteria are required in order to select the ‘best’ model among the available models. Some of
these criteria are as follows-

a) A model should be kept as simple as possible which is called the principle of


parsimony or Occam’s razor. This is because a chosen model can never entirely
capture the economic phenomenon and hence some degree of simplification is
required in formulating the model.
b) A model should have parameter stability that is the values of the parameters should
be unique or stable. In other words, there should be only one estimate for a given
parameter otherwise the prediction or forecasting becomes difficult or not reliable.
c) The model should include all the relevant explanatory variables so that it is able to
explain the phenomenon under study and hence the chosen model is preferred over
all other models.
d) The model should show high goodness of fit that is the model should have a high
value of the adjusted R2.
e) The model should have the regressors which are exogenous that is the explanatory
variables included in the model are uncorrelated with the random error term u i.
f) The model must be consistent with the theoretical framework even if one or more
coefficients turn out to be insignificant or with wrong signs.
g) The estimated residuals of the model must be white noise or random.
h) The predictions of the model should be as accurate as possible.

These are some of the criteria to choose the best model among the available models and
hence before formulating an econometric model, we must keep these criteria in mind.
However, even if the best model is selected, there are chances that the model might suffer
from the specification errors as discussed below.

3. THE SPECIFICATION ERRORS

There are various types of specification errors or bias which occur while choosing a
particular econometric model. These errors occur when instead of estimating the true model
the end up estimating some other model which is not correctly specified. Broadly speaking,
we can divide the specification errors into the following:

a) Omission of relevant variable(s),


b) Inclusion of irrelevant variable(s),

4
Mis-specifications of regression model

c) Incorrect functional form of the model, and


d) Errors of measurement bias.

Let us briefly discuss these types of specification errors or specification bias.

Suppose the following regression model is the true model-

Yi = B1 + B2X2i + B3X3i + u1i ,

but instead of estimating this model, we estimate the following model:

Yi = A1 + A2X2i + u2i ,

In the estimated model we have excluded an important variable which is X 3 and hence we
have committed specification errors in the form of omission of a relevant variable. The
error term of this model u2i will not only capture u1i but also captures the influence of the
excluded variable on the dependent variable. That is,

u2i = B3X3i + u1i .

Now suppose, the following model is estimated:

Yi = C1 + C2X2i + C3X3i + C4X4i + u3i ,

then, we commit specification errors in the form of inclusion of an unnecessary variable


which is X4i. The random error term u3i in the estimated model is now given by:

u3i = - C4X4i + u1i

= u1i (since, C4 = 0 in the true regression model).

Next, if suppose the following model is estimated:

Yi = D1 + D2 lnX2i + D3 lnX3i + u4i ,

then, we can see that the regression model has the wrong functional form and hence in this
case, the specification errors takes the form of the incorrect functional form.

Lastly, if we estimate the following regression model instead of the true regression model:

Yi’ = E1 + E2X2i’ + E3X3i’ + u5i

Where Y’, X2’ and X3’ are the proxies of the true variables Y, X2 and X3 respectively.

5
Mis-specifications of regression model

Then, we commit specification errors in this case in the form of the errors of
measurement bias since these proxies may have errors of measurement.

Now, we move to detailed discussion of only the two types of specification errors which are
omission of relevant variables and inclusion of irrelevant variables.

4. OMISSION OF RELEVANT VARIABLE(S)/UNDER-FITTING OF A MODEL

As already discussed, we commit specification errors in the form of omission of relevant


variable(s) when we exclude one or more important explanatory variables from the
regression model. This type of specification errors or specification bias is also known as
omitted variable bias.

Suppose the following regression model is the true regression model:

Yi = B1 + B2X2i + B3X3i + u1i ,

This model is a three variable regression model, but instead of estimating this model, we
end up estimating the following model which may be due to many reasons:

Yi = A1 + A2X2i + u2i ,

Here, we have taken different notation for the regression coefficients in the estimated model
in order to differentiate it from the true model. Clearly, in the estimated model we have not
included a relevant variable X3 and hence the estimated model is the mis-specified
regression model since X3, as per the true model, explains certain amount of total variation
in the dependent variable Yi. There are consequences of committing omitted variable bias
which are discussed below.

4.1 THE CONSEQUENCES OF ‘OMITTED VARIABLE BIAS’

The following are the consequences of omitting relevant variables from the regression
model-

a) The OLS estimators of the estimated model are biased if the excluded variable X 3 is
correlated with the included variable X2. In other words,
E(a1) ≠ B1 and
E(a2) ≠ B2 . (where E is the expected value and a1 and a2 are the respective OLS
estimators of A1 and A2).
Let us prove this consequence-

6
Mis-specifications of regression model

True model: Yi = B1 + B2X2i + B3X3i + u1i



where bar denotes the respective sample means and is not equal to zero. (mean
of random error term in the sample is not necessarily zero).
In the estimated model, the OLS estimator of A 2 is given by the formula:

a2 =

Substituting the values of Yi and from the true model and putting in the formula
gives,

a2 =

Taking expected values both sides gives,

E(a2) =

Since explanatory variables are assumed to be non-stochastic and,

= E(u1i) – E( ) = 0 - E( ) = - = 0, (by assumption of Classical

Linear Regression Model and n is the sample size), the last term in R.H.S becomes
zero and we get,

E(a2) =

If X2 and X3 are correlated and we run a regression of X3 on X2 then the non-


stochastic Sample Regression Function (SRF) is given by-

Substituting b2 in E(a2) gives,


E(a2) = B2 + B3 b2 ≠ B2

7
Mis-specifications of regression model

Since X2 and X3 are correlated, b2 ≠ 0. Thus, E(a2) = B2 only when B3 = 0 which is


the case where X3 has no effect on Y and thus there is no mis-specification.
However, the true model assumes B3 ≠ 0 and hence we have,
E(a2) ≠ B2 ⟹ a2 is a biased estimator of B2.
Similarly,
a1 =
a1 =
Taking expected values both sides gives,

E(a1) =

=
= [since E(a2) = B2 + B3 b2 ]
=
= ≠ B1 (since b2 ≠ 0 and B3 ≠ 0).
⟹ E(a1) ≠ B1 and hence a1 is a biased estimator of B1.
This shows that if b2 >0 and B3 >0 or b2 <0 and B3 <0, then a2 will overestimate the
true value of B2 on average and hence there is an upward bias. On the other hand, if
b2 >0 and B3 <0 or vice-versa then a2 will underestimate the true value of B2 on
average and so there is a downward bias. Also, if > 0 then a1 will
overestimate true value of B1 and if <0 then it will underestimate true
value of B1. This is because, X2 now not only takes into account its direct influence
on Y but also the indirect influence of X 3 on Y as shown in Figure 1 below.

Figure 1

8
Mis-specifications of regression model

(b) If X2 and X3 are uncorrelated that is b2 = 0 then E(a2) = B2. In other words, a2
will be an unbiased estimator of B2. However, E(a1) = ≠ B1 unless = 0
implying that a1 still remains a biased estimator.

(c) In case where b2 ≠ 0, a1 and a2, apart from being biased estimators, are also
inconsistent. This means that the bias does not become zero even in the large
samples. On the other hand, when b2 = 0, a2 is an unbiased and also a consistent
estimator but again a1 is a biased and inconsistent estimator.

(d) The variance of the error term σ2 is not correctly estimated in the estimated
model. This is because estimator of σ2 is = RSS/d.f (Residual sum of squares/ its
degrees of freedom) and RSS and its d.f will be different in the true and the
estimated models. The of true model will be an unbiased estimator while of the
2
estimated model will be a biased estimator of σ .

(e) Since of the estimated model is a biased estimator of σ2, the variance of a2
will be a biased estimator of the variance of which is the estimator of true B2. In
other words,
E[Var(a2)] ≠ Var( )
We can write the formulae of Var(a2) and Var( ) as follows:

Var(a2) = and

Var( )=

Where r is the coefficient of correlation between X 2 and X3, Var(a2) is a biased


estimator and Var( ) is an unbiased estimator. Since 0< r2 <1,
⟹ Var(a2) < Var( ).
Hence, there is a trade-off involved as a2 is a biased estimator but has a smaller
variance than which is an unbiased estimator of B2.
Also, it can be shown that when X2 and X3 are uncorrelated, that is b2 = 0,

E[Var(a2)] = Var( )+ ≠ Var( )

And since the last term on R.H.S is positive, it implies that, on average, Var(a 2) will
overestimate the variance of and hence there will always be a positive bias
associated with Var(a2).

9
Mis-specifications of regression model

(f) Due to this, the usual hypothesis-testing and confidence interval approaches are
not reliable and may give wrong conclusions about the statistical significance of
regression coefficients. The confidence intervals in this case will be wider due to
which there will be a greater likelihood that we do not reject the zero-null
hypothesis.

Considering the consequences of omitting important variables from the regression model, it
seems that these consequences are rather serious and hence after a regression model is
formulated on the basis of the economic theory it is always better not to drop a variable
from the given model. This is because dropping such relevant variables from the model will
result into under-specifying or under-fitting the regression model.

4.2 AN EXAMPLE OF ‘OMITTED VARIABLE BIAS’

The following example is adapted from “Christopher Dougherty, Introduction to


th
Econometrics, 4 edition, OUP, Indian edition” page no. 254.

Suppose the following model is the true model:

HGi = B1 + B2HGMi + B3CAi + ui

Where HG = Highest grade completed by the individual, HGM = Highest grade completed by
the individual’s mother and CA = measure of cognitive ability of the individual.

The results of this regression is as follows-

i = 5.42 + 0.1235 HGMi + 0.1328 CAi

se = (0.493) (0.0331) (0.0097)

t = (10.99)* (3.73)* (13.64)*

*Significant at 1 % level of significance. R2 = 0.3543, n= 540, F(2,537) = 147.36,


Prob > F = 0.0000, RMSE = 1.963.

Now, we drop CA from the regression model giving the following results-

i = 10.046 + 0.313 HGMi

se = (0.4147) (0.0348)

t= (24.23)* (9.00)*

10
Mis-specifications of regression model

*Significant at 1 % level of significance. R2 = 0.1308, n= 540, F(2,537) = 80.93,


Prob > F = 0.0000, RMSE = 2.2756.

Also, running a regression of CA on HGM gave the value of slope coefficient as 0.42. This
means that there will be upward bias associated with the coefficient of HGM since b 2 >0 and
B3 >0. This is visible from both the regressions; the value of coefficient with HGM is indeed
larger in the latter regression. Also, R2 = 13.08 % in the latter regression shows the direct
influence of HGM along with the indirect influence of CA on HG and the remaining
consequences of omitted an important variable which is CA from the regression model will
follow.

5. INCLUSION OF IRRELEVANT VARIABLE(S)/OVER-FITTING OF A MODEL

The next type of specification errors is including unnecessary explanatory variables in the
regression model. This happens because sometimes the researcher may want to include all
the explanatory variables in the model; the approach is called ‘kitchen sink approach’, even
if some of these variables are not important from the viewpoint of the theoretical framework
of the model. Another reason for including irrelevant variables is that sometimes the
theoretical model itself is not known and hence researchers end up in including such
explanatory variables in the regression model. Also, including more and more explanatory
variables, even if they are unnecessary variables, leads to an increase in the value of
unadjusted R2 of the regression model. The adjusted R2 in this case will increase only when
the absolute value of t of the coefficient of interest is greater than one. This may increase
the goodness of fit of the regression model. Thus, by including such irrelevant explanatory
variables, we commit specification errors in the form of inclusion of irrelevant variables or
irrelevant variable bias.

So, suppose the following regression model is the true model-

Yi = B1 + B2X2i + u1i ,

but instead of estimating this two-variable regression model, we estimate the following
model which is three-variable regression model:

Yi = C1 + C2X2i + C3X3i + u3i ,

Hence, the latter model is the mis-specified model in the sense that it has an unnecessary
variable X3 which was not there in the true model. And so, by including this variable we

11
Mis-specifications of regression model

have over-fitted our regression model. Again, there are consequences of committing
irrelevant variable bias which are discussed below.

5.1 THE CONSEQUENCES OF ‘IRRELEVANT VARIABLE BIAS’

The following are the consequences of including unnecessary variable, which is X3 in the
present scenario, in the regression model:

(a) Unlike the consequences of ‘omitted variable bias’, the OLS estimators of the mis-
specified model are still unbiased and hence also consistent. Symbolically,

E(c1) = B1, E(c2) = B2 and E(c3) = B3 = 0.

Where E is the expected value and c1, c2 and c3 are the respective OLS estimators of B1, B2
and B3, where as per the true model B3 = 0.

Let us also prove this consequence-


True model: Yi = B1 + B2X2i + u1i

Subtracting these two equations gives us-
(Yi - ) = B2 + (u1i - )
⟹ yi = B2 x2i + (u1i - )
where bar denotes the respective sample means and is not equal to zero. (Mean
of random error term in the sample is not necessarily zero).
In the mis-specified model, the OLS estimator of C2 is given by the formula:

c2 =

where x2i = , x3i = , and yi = .


Now, putting the value of yi in the formula of c2 gives-

c2 =

⟹ c2 =

Taking expected values both sides,

12
Mis-specifications of regression model

E(c2) =

As already shown, = 0,

⟹ E(c2) =

⟹ c2 is an unbiased estimator of true parameter B 2.

Similarly, we can show that c3 is an unbiased estimator of B3 (where B3 = 0) as follows,

c3 =

⟹ c3 =

Taking expected values both sides, we get-

E(c3) =

= 0, we get,
E(c3) = 0 = B3 ⟹ c3 is an unbiased estimator of B3.

The formula for OLS estimator of B1 is given by-


c1 =
⟹ E(c1) =E(
=
=
⟹ E(c1) = , thus c1 is an unbiased estimator of B1.

(b) The variance of the error term, σ2, is correctly estimated in the mis-specified model.

(c) Thus, the usual hypothesis-testing and confidence interval approaches remain
reliable. That is, the usual t and F tests remain valid.

(d) But, the OLS estimators of the mis-specified model will have larger variances as
compared with the variances of the B’s in the true model. In other words, the

13
Mis-specifications of regression model

estimators are not efficient in general or they are no longer BLUE (these estimators
will be linear, unbiased estimators, LUE).

We can show this by using the variances of estimator of B2 , var( ), in true model
and that of the estimator of C2 , var (c2), in the mis-specified model.

Var( )= and

Var( ) =

Where r is the coefficient of correlation between X 2 and X3. Since , 0< r2 < 1
⟹ Var( ) < Var( ).
Thus, although c2 is an unbiased estimator of B2, its variance is larger than that of
estimator of true B2. Similarly, we have Var( ) < Var( ).
This shows that the inclusion of the irrelevant variable X 3 led to larger varances of
the OLS estimators making them less precise.
(e) It is impossible to determine the contribution of each explanatory variable to R2 in
the regression model.

These were the consequences of including unnecessary explanatory variables in the


regression model. Now, if we compare these consequences with the consequences of
omitted variable bias then we can clearly figure out that these consequences are much less
serious than those of omitting relevant variables. Thus, it seems that we should rather
include unnecessary explanatory variables in the regression model than to exclude
important variables. However, we must keep in mind the following things in the case of
inclusion of irrelevant variables-

(i) It will lead to larger variances of OLS estimators making them inefficient and
hence less precise,
(ii) It will lead to loss of degrees of freedom, and
(iii) It may also lead to the problem of multicollinearity.

Hence, it is always better to keep only those explanatory variables in the regression model
which are relevant from the theoretical point of view and are known to influence directly the
dependent variable.

5.2 AN EXAMPLE OF ‘IRRELEVANT VARIABLE BIAS’

14
Mis-specifications of regression model

The following example is adapted from chapter 3 of “Damodar Gujarati, Econometrics by


Example, Palgrave Macmillan”, Table no. 3.13 and 3.15.

In this example, data is on sales of fashion clothing (S), Real personal disposable income (I)
and Confidence Index (CI) during the time period 1986 to 1992. The data is quarterly and
since it is a time series data, dummy variable for time is included in the model. Suppose the
following model is the true model:

St = B1 + B2It + B3CIt + B4D2t + B5D3t + B6D4t + ut

Where D2 = 1 for 2nd quarter, 0 otherwise, D3 = 1 for 3rd quarter, 0 otherwise, and D4 = 1
for the fourth quarter, 0 otherwise. This means that the first quarter is the base category in
this example. The regression results of this model is as follows-

t = -152.92 + 1.598 It + 0.293 CIt + 15.045 D2t + 26.00 D3t + 60.87 D4t

se = (52.59) (0.37) (0.084) (4.31) (4.32) (4.42)

t = (-2.90)* (4.31)* (3.48)* (3.48)* (6.01)* (13.74)*

*Significant at 1 % level of significance. R2 = 0.9053, n= 28, F = 147.36, Prob > F =


0.0000.

The results show that all the coefficients are individually and jointly statistically significant.
Now, to this regression model we add the differential slope dummies and after running the
regression we get the following results-

t = -191.58 + 2.049 It + 0.280 CIt + 196.70 D2t + 123.13 D3t + 50.96 D4t – 1.11 D2t It

se = (107.98) (0.97) (0.15) (221.26) (163.43) (134.78) (1.40)

t = (-1.77)*** (2.56)** (1.79)*** (0.88) (0.75) (0.37) (-0.79)

- 1.21 D3t It – 0.04 D4t It – 0.29 D2t CIt + 0.06 D3t CIt + 0.05 D4t CIt

se = (1.13) (1.01) (0.381) (0.259) (0.201)

t = (-1.07) (-0.04) (-0.77) (0.251) (0.287)

***Significant at 10 % level of significance, ** Significant at 5 % level of significance. R2 =


0.9293, n= 28, F = 19.121, Prob > F = 0.0000.

15
Mis-specifications of regression model

As can been seen from the latter regression results, the added variables are not statistically
significant and hence are the unnecessary variables. Also, the standard errors of the
coefficients are greater than they were in the true regression results, thereby reducing their
precision. The value of R2 is higher in the latter regression due to addition of the variables.
So, considering the results we drop these additional unnecessary differential slope dummies
and move back to our original true model.

So, we have discussed two types of specification errors along with their consequences.
There are other types of specification bias as mentioned earlier but their detailed discussion
is not a part of this chapter and is therefore left for the readers to explore.

16
Mis-specifications of regression model

PRACTICE QUESTIONS

Q1) Compare in detail the consequences of the omitted variable and the irrelevant variable
bias. Which, among the two, are more serious consequences?

Q2) ‘when we include unnecessary variable(s) in the regression model then the OLS
estimators are biased as well as inconsistent’. State briefly whether this statement is true or
false.

Q3) Prove the following, given that there is no correlation between the omitted variable X3
and the included variable X2.

E[Var(a2)] = Var( )+

Q4) If suppose the true model is given as-

Yt = B1 + B2 Xt + ut

Where Y = Consumption and X = Income. But instead of using this model, a researcher
used the following model-

lnYt = B1 + B2 lnXt + ut

What type of specification error is committed in this case? Discuss.

Q5) Consider the following regression model,

Yt = B1 + B2 X2t + B3 X3t + B4 X4t + ut

Where Y = demand for Wagon R, X2 = Price of Wagon R, X3 = Per-capita income of the


individual and X4 = Price of Alto. Taking into account the attributes of a good model, can
you say that the given regression model is good?

Q6) Which of the following option is correct?

The OLS estimators in the over-fitted model are

17
Mis-specifications of regression model

(i) Inconsistent estimators


(ii) Inefficient estimators
(iii) Biased estimators
(iv) All of the above

Q7) Suppose the true regression model is

Yt = α1 + α2 X2t + u1t

However, an irrelevant variable X3 is added and following regression model was used-

Yt = α1 + α2 X2t + α3 X3t + u2t

(i) How would the inclusion of X3 affect the OLS estimators?


(ii) What will happen to the values of R2 and adjusted R2 after the inclusion of X3?
(iii) What will you do to avoid this situation?

Q8) In the case of Omitted variable bias, a trade-off is involved as a2 is a biased estimator
but has a smaller variance than which is an unbiased estimator of B2. How would you,
then, choose between the two estimators?

Q9) If the following model is estimated-

Yt = δ1 + δ2 X2t + u1t

Instead of

Yt = α1 + α2 X2t + α3 X3t + u2t

Where it is known that X3 is a relevant variable then would you say that E( ) = α1 and E( 2)
= α2 ( δ δ ?

Q10) What are the reasons which result into a researcher estimating a mis-specified model?

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy