04 Violation of Assumptions All
04 Violation of Assumptions All
04 Violation of Assumptions All
Multicollinearity is a phenomena that may be observed in multiple linear regression. It results in:
1. High standard error (small t-ratios) estimates and a high R2.
2. A wide confidence intervals for parameter estimates.
=
∑
Where, is the square of multiple correlation coefficient of Xj and the other elements of X.
and
=∑ ∀ = 2, 3, … ,
Based on this fact, we use the ratio of var ˆ j to to obtain a measure of the degree
of multicolliniearity in the model, which is known as the Variance Inflation Factor (VIFj), i.e.
∑ !
= = ≥ 1; ∀ = 2, 3, … ,
∑
There is a positive relationship between R 2j and VIFj, which is depicted in the following table
1
R 2j VIFj
0.1 1.11
0.2 1.25
0.5 2
0.8 5
0.9 10
0.95 20
VIFj
1 R2j
Detection of Multicollinearity
1. The estimates are unstable: thus though R2 is high, the parameter estimates will be unstable
(insignificant). Actually, if R2j = 1 it follows that the parameter estimates will be indeterminate.
2. Deletion/addition of an explanatory variable causes large changes in the estimates of the
remaining coefficients.
Suppose that = 2+3 + and that =2 − 1. There is no disturbance term in the
equation for Y, but that is not important. Suppose that we have the six observations shown.
X1 X2 Y X1 X2 Y
10 19 51
11 21 56 1 2 5
12 23 61 1 2 5
13 25 66 1 2 5
14 27 71 1 2 5
15 29 76 1 2 5
2
The three variables are plotted as line graphs below. Looking at the data, it is impossible to tell
whether the changes in Y are caused by changes in X1, by changes in X2, or jointly by changes in
both X1 and X2.
( , ) ( ) ( , ) ( , )
= ( ) ( ) [ ( , )]
( , ) ( ) ( , ) ( )
= ( ) ( ) [ ( )]
=
It turns out that both the numerator and the denominator are equal to zero. The regression
coefficient is not defined.
3
It is unusual for there to be an exact relationship among the explanatory variables in a regression.
When this occurs it s typically because there is a logical error in the specification.Possible
measures for alleviating multicollinearity
What can we do about this problem if encountered?
Note that:
1. Multicollinearity does not cause the regression coefficients to be biased.
2. The standard errors and t tests remain valid.
The problem is large standard errors than they would be in the absence of multicollinearity.
How can we reduce the variances?
We might be able to reduce it by bringing more variables into the model and reducing the
population variance of the disturbance term.
Recall: the population variance of is
= ×
1) Reduce
by including further relevant variables in the model.2) Increase the number of observations.
Time series: decrease the time interval (from yearly to quarterly etc.)
Cross section: Increase the sample size.
3) Increase Var(Xj).
If the correlated variables are similar conceptually, it may be reasonable to combine them into
some overall index; or
b) Dropping some of the correlated variables.
Drop some of the correlated variables, if they have insignificant coefficients.
Dangerous because some of the variables dropped may truly belong in the model and their
omission may cause omitted variable bias.
5) Empirical restriction
Use extraneous information, if available, concerning the coefficient of one of the variables.
For example, suppose that Y in
= + + +
4
Is the demand for a category of consumer expenditure (Y), X2 is aggregate disposable personal
income, and X3 is a price index for the category.
To fit a model of this type you would use time series data. If X2 and X3 are highly correlated,
which is often the case with time series variables, the problem of multicollinearity might be
eliminated in the following way.
Obtain data on income and expenditure on the category from a household survey and regress Y'
on X'. (The ' marks are to indicate that the data are household data, not aggregate data.)
′ ′ ′ ′
= + +
This is a simple regression because (Note there will be relatively little variation in the price paid
by the households in cross section data).
′ ′ ′ ′ ′ ′ ′
= + Now substitute for in the time series model. Subtract from both
sides of
= + + +
and regress
′
= − on
This is a simple regression, so multicollinearity has been eliminated.
There are some problems with this technique.
1. The coefficients may be conceptually different in time series and cross-section
contexts.
′
2. Since we subtract the estimated income component , not the true income component
, from Y when constructing Z, we have introduced an element of measurement error
in the dependent variable.
6. Theoretical restriction
Use theoretical restriction, (hypothetical relationship among the parameters of a regression
model).
It will be explained using an educational attainment model as an example. Suppose that we
hypothesize that highest grade obtained, Y, depends on Ability, X2 and highest education level
completed by the respondent's mother and father, X3 and X4, respectively.
5
Mother's education is generally held to be at least, if not more, important than father's education
for educational attainment, so this outcome is unexpected.
= + + + +
=
2 3 therefore
= + + ( + )+
6
Multiple Regression Analysis: Heteroskedasticity
What is Heteroskedasticity?
The assumption of homoskedasticity implies that conditional on the X variables, the variance
of the unobserved error, , is constant
If this is not true, that is, if the variance of is different for different values of the X’s, then
the errors are heteroskedastic
Example: in a savings equation, heteroskedasticity is present if the variance of the
unobserved factors affecting savings increases with income.
Homoskedasticity justifies the use of the usual t and F tests, and confidence intervals for
OLS estimation of the linear regression model, even with large sample sizes.
Example of Heteroskedasticity
f(Y|X)
Y
.
. E(Y|0 + 1X
.
X1 X2 X3
X
Why Worry About Heteroskedasticity?
Consider again the multiple linear regression model:
= + + ⋯+ +
Recall: … , under the first four Gauss-Markov assumptions, are unbiased.
Homoskedasticity assumption plays no role in showing whether OLS was unbiased.
7
Heteroskedasticity does not cause bias in the OLS estimators of the
Why introduce it as one of the Gauss-Markov assumptions?
Recall 1: without the homoskedasticity assumption, the estimators of Var( ) are biased.
Given that OLS standard errors are based on these variances, they are invalid for constructing
confidence intervals and t statistics.
The OLS t and F statistics do not have t and F distributions in the presence of
heteroskedasticity.
The statistics we use to test hypotheses under the G-M assumptions are invalid in the
presence of hetero.
Recall 2: The G-M theorem says that OLS is BLU, that relies on the homoskedasticity
assumption.
If Var( |X) is not constant OLS is no longer BLUE.
In addition, OLS is no longer asymptotically efficient in the class of estimators described in
Theorem 5.3.
It is possible to find estimators that are more efficient than OLS in the presence of
heteroskedasity (although it requires knowing the form of the hetero).
Detecting Hetroskedasticity
Y X Ymybar xmxbar xbyb xbxb yhat uh uh2
19.9 22.3 -3.66 -2.95 10.78 8.70 20.90 -1.00 1.00
31.2 32.3 7.65 7.05 53.90 49.70 29.90 1.30 1.70
31.8 36.6 8.25 11.35 93.58 128.82 33.76 -1.96 3.85
12.1 12.1 -11.46 -13.15 150.63 172.92 11.73 0.37 0.14
40.7 42.3 17.15 17.05 292.32 290.70 38.89 1.81 3.28
6.1 6.2 -17.46 -19.05 332.52 362.90 6.42 -0.32 0.10
38.6 44.7 15.05 19.45 292.63 378.30 41.05 -2.45 5.99
25.5 26.1 1.95 0.85 1.65 0.72 24.32 1.18 1.39
10.3 10.3 -13.26 -14.95 198.16 223.50 10.11 0.19 0.04
38.8 40.2 15.25 14.95 227.91 223.50 37.00 1.80 3.24
8 8.1 -15.56 -17.15 266.77 294.12 8.13 -0.13 0.02
33.1 34.5 9.55 9.25 88.29 85.56 31.87 1.23 1.50
33.5 38 9.95 12.75 126.80 162.56 35.02 -1.52 2.31
13.1 14.1 -10.46 -11.15 116.57 124.32 13.53 -0.43 0.18
14.8 16.4 -8.76 -8.85 77.48 78.32 15.60 -0.80 0.63
21.6 24.1 -1.96 -1.15 2.25 1.32 22.52 -0.92 0.85
29.3 30.1 5.75 4.85 27.86 23.52 27.92 1.38 1.91
25 28.3 1.45 3.05 4.41 9.30 26.30 -1.30 1.68
17.9 18.2 -5.66 -7.05 39.87 49.70 17.21 0.69 0.47
19.8 20.1 -3.76 -5.15 19.34 26.52 18.92 0.88 0.77
23.56 25.25 2423.73 2695.05 31.07
b1 0.899327
b0 0.852005
8
x
50
40
30
x
20
10
0
0 10 20 30 40 50
uh
3
0 uh
0 10 20 30 40 50
-1
-2
-3
Detection of hetroskedasticity
| ̂| = + √
For our data in levels, all the tests except the RESET, reject the homo hypothesis.
For the log-linear model except the RESET and Goldfeld and Quandt model, all the other test
reject the hypothesis that we have homoskedastic errors.
10
̂ is the ith residual from regressing Xj on all the other independent variables, and SST j is the
sum of squared residuals from this regression
Now that we have a consistent estimate of the variance, the square root can be used as a
standard error for inference
Typically call these robust standard errors
Sometimes the estimated variance is corrected for degrees of freedom by multiplying by n/(n
– k – 1); as n → ∞ it’s all the same, though
Important to remember that these robust standard errors only have asymptotic justification –
with small sample sizes t statistics formed with robust standard errors will not have a
distribution close to the t, and inferences will not be correct
In Stata, robust standard errors are easily obtained using the robust option of reg
Robust LM Statistic
Run OLS on the restricted model and save the residuals ̂
Regress each of the excluded variables on all of the included variables (q different
regressions) and save each set of residuals ̂ , ̂ , ..., ̂
Regress a variable defined to be = 1 on rˆ1uˆ , rˆ2uˆ ,..., rˆquˆ , with no intercept
The LM statistic is n – SSR1, where SSR1 is the sum of squared residuals from this final
regression
Effects/consequences of heteroskedasticity
Hetroskedasticity implies that the variance of error terms is not constant. Thus,
[ ]=
The main effects of this are:
1. The least squares estimates are still unbiased, but they are inefficient, and
2. The estimates of the variance are biased–thus the tests of significance based on them are
invalid.
Y 0 1 X u , with
The OLS estimator of the slope parameter is
∑
= + ∑
11
∑ ∑ [ ]
= + ∑
= + ∑
=
= [ ]+ [ ] + ⋯+ [ ]
∑
= [ + +⋯+ ]
∑
∑
=
∑
If we know the functional form of the variance up to a multiplicative constant, which we do not
know, we can proceed to obtain the variance by making nessary transfomations and obtain
Efficient estmators of , known as the weighted least squares – I leave this for graduate school.
Conclusion:
If the error terms are not homoskedastic
a) OLS estimators are unbiased
b) The OLS estimators are less efficient (i.e., have higher variance) than the WLS.
Looking at the estimator of the variance of βOLS, we estimate it by
While it’s always possible to estimate robust standard errors for OLS estimates, if we know
something about the specific form of the heteroskedasticity, we can obtain more efficient
estimates than OLS
The basic idea is going to be to transform the model into one that has homoskedastic errors –
called weighted least squares
12
Autocorrelation
The term autocorrelation may be defined as correlation between members of series of
observations ordered in time (as in time series data) or space (cross-sectional data).
Autocorrelation versus Serial Correlation: Similarities and differences.
Consider a two-variables linear regression model of the form:
= + + = 1, 2, …
Autocorrelation is present in the residuals if
[ ] ≠ 0 for > 1 thus,
[ , ] ≠ 0 for | | ≠
The autocovariance at lag s is defined by
= [ ] for | | = 0, ±1, ±2, ….
At lag zero, i.e., s = 0, we have = [ ]= [ ]= i.e., constant variance
The Autocorrelation coefficient at lag s is defined by:
=
Note that =1
The γ's and the ρ's are symmetric in s (the lag) and do not depend on t (time), i.e., they only
depend on the lags.
The covariance matrix of the 's is:
⋯
⋯
[ ′] = ⋯ ⋯ ⋯ ⋱ ⋯
⋯
Since = and letting = a constant; we write the autocorrelation matrix as:
1 ⋯
1 ⋯
[ ′] = ⋯ ⋯ ⋯ ⋱
⋯
⋯
1
Given that there is temporal dependence in the residuals, how do we best model it? There are
three main types of time series process that we could use:
Example: Monthly data to estimate a model that explains the demand for ice cream. The weather
will be an important factor hidden in the error term : positive and negative residuals group
together
13
.6
.5
.4
.3
.2
0 10 20 30
time
use "C:\Users\6440\Documents\DocA\Econometrics\Econ352\201819\Econ2061\DataVerbeek\icecream.dta"
tsset time
twoway (scatter cons time)
reg cons price income
predict chat
twoway (scatter cons time) (line chat time)
14
This is a very general process and may be difficult to estimate. However, we can usually be
fairly confident of capturing realistic dynamics in the residuals by considering low order AR
process. Hence, we will confine our selves (our attention) to the AR(1) model:
First Order Autocorrelation
Many forms of autocorrelation; The most popular form is known as the first-order autoregressive
process, which is given by
= + + ⋯+ +
and error terms following the following relation
= +
where ut is an error term with mean zero and constant variance , with no serial correlation.
The parameters ρ and are unknown, and, along with β we may wish to estimate them. Note
that the statistical properties of ut are the same as those assumed for in the standard case: thus
if ρ = 0, = ut and the standard Gauss–Markov conditions are satisfied.
To derive the covariance matrix ( ), we need to make an assumption about the distribution
of the initial period error, . We assumed that
1. is mean zero with the same variance as all other s.
2. The process has been operating for a long period in the past and that |ρ| < 1.
3. When |ρ| < 1 is satisfied we say the first-order autoregressive process is stationary.
4. A stationary process is such that the mean, variances and covariances of do not change over
time. Imposing stationarity it easily follows from
[ ]= [ ]+ [ ]→ [ ]=0
Moreover
[ ]= [ + ]= [ ]+
If we let [ ]= , we have
= + → =
The nondiagonal elements in the variance–covariance matrix of follow from
[ , ] = [( + ) ]= [ + ]
= [ ]+ [ ] = =
The covariance between error terms two periods apart is
[ , ] = [( + ) ]= [( + )]
= [ ]+ [ ]=
and in general we have, for non-negative values of s,
[ , ]=
15
Thus for 0 < |ρ| < 1 all elements in are mutually correlated with a decreasing covariance if the
distance as s gets large. The covariance matrix of is a full matrix (a matrix without zero
elements). Given our model
= + + ⋯+ +
with = + , where ut satisfies the Gauss–Markov conditions: A transformation that
leads to an error term with
= −
will generate homoskedastic non-autocorrelated errors. To do so, take
= + + ⋯+ +
lag Yt by one period and multiply this by to get
= + + ⋯+ +
then subtract from and get
− = (1 − ) + ( − ) + ⋯+ ( − )+ − , i.e.,
− = (1 − ) + ( − ) + ⋯+ ( − )+
Note: this transformation cannot be applied to the first observation (because Y0 and X0 are not
observed). The information in this first observation is lost and OLS in produces only an
approximate GLS estimator. With large n, the loss of a single observation will typically not have
a large impact on the results.
We can rescue to first observation by noting that , is uncorrelated with all ut s, t = 2, . . . , T .
However, the variance of is much larger than the variance of the transformed errors (u2, . . . ,
uT ), particularly when ρ is close to unity. To obtain homoskedastic and non-autocorrelated errors
in a transformed model (which includes the first observation), this first observation should be
transformed by multiplying it by 1 − , i.e.,
1− = 1− + 1− + ⋯+ 1− + 1−
OLS applied on the transformed variables leads to BLUE estimators and are called GLS
estimator.
Early work (Cochrane and Orcutt, 1949) dropped the first (transformed) observation to estimate
β from the remaining T − 1 transformed observations.
The estimator that uses all transformed observations is sometimes called the Prais–Winsten
(1954) estimator.
Unknown ρ
16
∑
= ∑
This estimator is typically biased, but under weak regularity conditions it is a consistent
estimator. Using instead of ρ to compute the feasible GLS (EGLS) estimator , the BLUE
property is no longer retained. Under the same conditions as before, it holds that the EGLS
estimator is asymptotically equivalent to the GLS estimator . That is, for large sample sizes
we can ignore the fact that ρ is estimated.
The Cochrane–Orcutt procedure, which is applied in many software packages, ρ and β are
recursively estimated until convergence
Clearly an R2 close to zero in this regression implies that lagged residuals are not explaining
current residuals and a simple way to test ρ = 0 is by computing (T − 1)R2.
If the model of interest includes a Yt-1 (or other X variables that are correlated with ), the
tests are still appropriate provided that the regressors (X) are included in the auxiliary
regression.
17
The Durbin–Watson Test
A popular test for 1st order autocorrelation: Durbin and Watson, 1950: Small sample distribution
under a restrictive set of conditions.
Two important assumptions:
a) Xs as deterministic (no Yt-1 in the model) and
b) the regression contain an intercept term.
The simplest and most commonly used model is one where the errors t and t-1 have a
correlation ρ.
Think of testing hypotheses about ρ on the basis of , the correlation between the least squares
residuals ̂ and ̂ . A commonly used statistic for this purpose (which is related to ) is the
Durbin-Watson (DW) statistic, (denote by d). It is defined as
∑ ( )
= ∑
where ̂ , is the estimated residual for period t. We can write d as
∑ ∑ ∑
= ∑
Since ∑ ̂ and ∑ ̂ , are approximately equal if the sample is large, we have
∑ ∑ ∑ ∑ ∑ ∑
= ∑
= ∑
=2 ∑
− ∑
= 2(1 − )
There are tables to test the hypothesis of zero autocorrelation against the hypothesis of first-order
positive autocorrelation. (For negative autocorrelation we interchange dL and dU)
If d < dL, we reject the null hypothesis of no autocorrelation.
If d > dU we do not reject the null hypothesis.
If dL < d < dU the test is inconclusive.
The UB of the DW statistic is a good approximation to its distribution when the regressors are
slowly changing and as economic time series change slowly one can use dU as the correct
significance point.
The significance points in the DW are tabulated for testing ρ = 0 against ρ > 0. If d > 2 and we
wish to test the hypothesis ρ = 0 against ρ < 0, we consider 4 - d and refer to them as if we are
testing for positive autocorrelation.
Although we have said that d = 2(1 - ) this approximation is valid only in large samples.
18
DYNAMIC MODELS
So far we have considered purely static models of the form:
= + +
What implications does this have for the underlying behaviour of economics agents?
What is the effect on Y of a change in X?
If X increases by 1 unit in period t, then Y will increase by β2 units instantaneously. This is
unrealistic. In the real world adjustment does not take place immediately—there are lags
involved. These lags may be due to:
i. time to feed through the economy
ii. lack of information
iii. search cost
iv. adjustment costs
Suppose it takes three periods for a change in X to feed through into Y given by the model of the
form:
= + + + +
The effect on Y of a one-time change in X is as follows:
Period Effect (Cumulative)
t β0
t+1 β0+ β1
t+2 β0+ β1+ β2
This type of model is known as a DISTRIBUTED LAG MODEL. In general we havve a model
of the form:
= + + + +⋯+ +
= + + + +
19
The easiest way to estimate Distributed Lag (DL) model is by OLS but note that estimation is
based on T-K observations.
Example: t Xt Xt-1 Xt-2
1 12 - -
2 10 12 -
3 14 10 12
4 15 14 10
Suppose we have an infinite distributed lag model in variable of the following form:
= + + + + ⋯+
Where the lag length is not defined, that is, how far into the past we want to go. One alternative
estimation mechanism of such a model is to use the KOYCK approach to distributed lag model.
If the β's are all of the same sign, Koyck assumes that they decline or decay geometrically as
follows:
= = 0,1,2, …
Where λ, such that 0 < λ <1, is known as the rate of decline or decay, of the distributed lag
1- λ is known as the speed of adjustment.
By assuming that λ<1, we attaching lesser weight to the distant β's that current ones, and also
insures that the sum of the β's, which gives the long-run multipliers, is finite, namely:
0
=
1−
=1
Using equation (2), equation (1) can be written as:
= + + + + ⋯+
Still this equation is not amenable to easy estimation. Now lag this equation by one period and
multiply both sides of the resulting model by λ to obtain:
= + + + +⋯+
Now subtract this equation from the original equation to obtain:
− = (1 − ) + +( − )
Or rearranging,
= (1 − ) + + +
Where ut = εt-λεt-1, a moving average of εt and εt-1.
20
AUTOREGRESSIVE MODELS
These are models containing lagged dependent variables as regressors.
= + + +
An AR(1) model may be written as:
= + +
which does not depend on Xt.
The use of OLS on autoregressive models is biased in finite samples, i.e., E ˆ 1 1 , but this
bias disappears as the sample size gets larger. That is, OLS is consistent.
NB. A special case of AR(1) model is the random walk, when we set α0 = 0 and α1 = 1, i.e.,
= +
If α0 ≠ 0, we have a random walk with drift:
= + +
These models are non-stationary models. The condition for stationarity is: |α1| < 1.
Why may we be interested in such dynamic models? What is the motivation or justification for
their use?
THE ADAPTIVE EXPECTATIONS MODEL
Consider the following model:
∗
= + +
∗
Where : is equilibrium/desired/ long-run/ expected/normal values of Xt.
∗ ∗
Now is not observable, but Xt is observable; so we need some way of relating to Xt. we
might suggest the following adjustment mechanism:
∗ ∗ ∗
− = ( − )
Where γ is an adjustment parameter–typically we would expect 0 < γ ≤ 1, i.e., expectations are
revised every period by a fraction γ of the difference between current value of Xt and previous
period's expectations..
∗ ∗ ∗ ∗
How can we use this information? Solve for in − = ( − ) to get
∗ ∗
= − (1 − )
∗
Lag = + + and multiply by 1 − to get
(1 − ) ∗
= (1 − ) + (1 − ) + (1 − )
∗
Subtruct this from = + + to get
21
∗ ∗
− (1 − ) = [1 − (1 − )] + − (1 − ) + − (1 − )
∗ ∗
− (1 − ) = + ( − (1 − ) )+ − (1 − )
∗ ∗
and substitute into the original equation:but = − (1 − )
therefore
− (1 − ) = + + − (1 − )
Or
= (1 − ) + + +
Where νt = ut - (1-γ)ut-1, which is a moving average error term depending on (1-γ).
= + +
= + +
= + +
23
Dubin watson statistic for testing first order autocorrelation is no longer appropriate if the model
contains lagged dependent variable as a regressor (i.e., Yt-1).
ℎ=
1− ( 2)
1
ℎ = 1−
2 1− ( 2)
NB. This test statistic can not be used if T .Var ̂ 2 ≥ 1. Alternatively, on can use Breusch-
Godfrey test of higher order autocorrelation, also known as the Lagrange multiplier test.
24