Econometrics Notes
Econometrics Notes
Econometrics Notes
2. Time series data - data for a single entity collected at multiple time periods, which can be
used to study the evolution of variables over time and to forecast future values of those
variables.
3. Panel data (longitudinal data) - data for multiple entities in which each entity is observed at
2 or more time periods, and it is used to learn about economic relationships from the
experiences of the many different entities in the data set and from the evolution over time of
the variables for each entity.
2
𝐸𝑥 𝑥̅ 2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 0
𝑛
The covariance, if large, it means that the two variables are related. If going towards 0, they are not
so related from a statistical point of view. Does not allow us to make a judgement about how
precisely the variables are related.
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 , 𝜎 , 𝐸 𝑥 𝑥̅ 𝑦 𝑦
The correlation between the variables is a standardized measure of how related the variables are. It
is always between -1 and 1, allowing for comparisons. The closer to 0, the less related the variables
are.
𝜎, 𝐶𝑂𝑉 𝑥, 𝑦
𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 , 𝜌 ,
𝜎 𝜎 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝑦
2
𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑚𝑖𝑠𝑡𝑎𝑘𝑒𝑠: 𝑌 𝑏0 𝑏1 𝑋
1
𝑌 𝛽0 𝛽1 𝑋
∑ 1 𝑋 𝑋 𝑌 𝑌 𝑠
𝛽1
∑ 1 𝑋 𝑋 2 𝑠2
𝛽0 𝑌 𝛽1 𝑋
𝑢 𝑌 𝑌
OLS estimator is unbiased and consistent.
Measures of Fit
𝑅2 and the standard error of the regression measure how well the OLS line fits the data:
𝑅2 is the fraction of sample variance of 𝑌 explained by 𝑋 , and it is written as the ratio of the
explained sum of squares (ESS) to the total sum of squares (TSS):
∑ 2
2
𝐸𝑆𝑆 1 𝑌 𝑌
𝑅 2
𝑇𝑆𝑆 ∑ 1 𝑌 𝑌
Or
𝑅2 𝑆𝑆𝑅 𝑢2
1
The standard error of the regression (SER) is an estimator of the standard deviation of the
regression error, 𝑢 . The SER is a measure of the spread of the observations around the regression
line, measured in the units of the dependent variable, Y.
𝑆𝐸𝑅 𝑠 𝑠2
𝑆𝑆𝑅
𝑠2
𝑛 2
The divisor is 𝑛 2 because it corrects for a slight downward bias introduced because two
regression coefficients were estimated (degree of freedom correction).
The Least Squares Assumptions
1. The conditional distribution of 𝑢 , given 𝑋 , has a mean of 0 - means that the two variables
have 0 covariance and are thus unrelated. Asserts that the other factors contained in 𝑢 are
unrelated to 𝑋 in the sense that given a value of 𝑋 , the mean of the distribution of these
other factors are 0. This means that while some factors have a positive effect on the
dependent variable, others have a negative, and these effects average out.
𝐸 𝑢 |𝑋 0
𝐶𝑜𝑟𝑟 𝑋 , 𝑢 0
Equivalent to assuming that the population regression line is the conditional mean of 𝑌
given 𝑋 .
3. Large outliers are unlikely - 𝑋 and 𝑌 have nonzero finite fourth moments (finite kurtosis).
0 𝐸 𝑋4 ∞ 𝑎𝑛𝑑 0 𝐸 𝑌4 ∞
𝐸 𝛽0 𝛽0 𝑎𝑛𝑑 𝐸 𝛽1 𝛽1
Thus, 𝛽0 and 𝛽1 are unbiased estimators, and if the sample is large enough, their distribution is well
approximated by the bivariate normal distribution, implying that the marginal distributions are
normal in large samples. When the sample size is large, the OLS estimators will also be close to the
true population coefficients with a high probability, because the variances of the estimators decrease
to 0 as n increases.
𝑌 𝛽0 𝛽1 𝑋 0
1
𝑌 𝛽0 𝛽1 𝑋 0
1 1 1
However, ∑ 1 𝛽0 is the sum of a variable which has exactly the same number (???), and this can
therefore be written as 𝑁 𝛽0 (because a the constant multiplied by a variable rule discussed in the
beginning of the lecture). Same goes for the 𝛽1 :
𝑌 𝑁 𝛽0 𝛽1 𝑋 0
1 1
We isolate 𝛽0 :
𝑁 𝛽0 𝑌 𝛽1 𝑋
1 1
𝑁 𝛽0 ∑ 1𝑌 𝛽1 ∑ 1𝑋
𝑁 𝑁 𝑁
↔
𝑌 𝑋
𝛽0 𝛽1
𝑁 𝑁
↔
𝛽0 𝑌 𝛽1 𝑋
Now we have derived the equation for 𝛽0 . It is given by the average of Y less 𝛽1 times the average
of X.
In order to find 𝛽1 ,we take the first derivative of RSS with respect to 𝛽1 :
2
𝑑∑ 1 𝑢
0
𝑑𝛽1
DO THIS YOURSELF
Another way of deriving 𝛽1 is by realizing that the second assumption also means that 𝐶𝑜𝑣 𝑋, 𝑢
0. We can rewrite 𝑢 as:
𝑌 𝛽0 𝛽1 𝑋 𝑢 →𝑢 𝑌 𝛽0 𝛽1 𝑋
We can now substitute this in the covariance:
𝐶𝑜𝑣 𝑋, 𝑢
↔
𝐶𝑜𝑣 𝑋, 𝑌 𝛽0 𝛽1 𝑋 0
This is rewritten as:
𝐶𝑜𝑣 𝑋, 𝑌 𝐶𝑜𝑣 𝑋, 𝛽0 𝐶𝑜𝑣 𝑋, 𝛽1 𝑋 0
We can reduce this further by realizing that the second term is the covariance of a constant and a
variable, meaning that is must be 0, and the third term is a constant multiplied by a variable,
therefore, we can move 𝛽1 outside, leaving the third term to be the covariance of x,x, which equals
the variance of the same variable. Thus:
𝐶𝑜𝑣 𝑋, 𝑌 0 𝛽1 𝑉𝑎𝑟 𝑋 0
We isolate 𝛽1 :
𝐶𝑜𝑣 𝑋, 𝑌
𝛽1
𝑉𝑎𝑟 𝑋
We have now derived the equation for 𝛽1 .
We can substitute this into the equation for 𝛽0 :
𝛽0 𝑌 𝛽1 𝑋
↔
𝐶𝑜𝑣 𝑋, 𝑌
𝛽0 𝑌 𝑋
𝑉𝑎𝑟 𝑋
From this, there are a few important relationships:
𝑖𝑓 𝐶𝑜𝑣 𝑋, 𝑌 0, 𝑡ℎ𝑒𝑛 𝛽1 0
𝑖𝑓 𝐶𝑜𝑣 𝑋, 𝑌 0, 𝑡ℎ𝑒𝑛 𝛽1 0
The covariance (and correlation) is not enough to determine the relationship between 2 variables,
because as shown by the equation for beta 1, the variance also needs to be included. If we do not
include the variance, we can see the direction of the relationship, but we cannot quantify it.
𝑉𝑎𝑟 𝑋 0
There needs to be variability in X for us to get important results.
Properties of OLS
1. Average of 𝑢 0
𝑢 𝑌 𝛽0 𝛽1 𝑋
𝐸𝑢 0
2. 𝐶𝑜𝑣 𝑥, 𝑢 0
3. 𝐸 𝛽1 𝛽1 . OLS is unbiased, i.e. the expected value of the observation is what we expect to
see from the theory.
The variance of 𝛽1 :
𝜎2 𝑉𝑎𝑟 𝑢
𝑉𝑎𝑟 𝛽1 𝑉𝑎𝑟 𝛽1
𝑆2 ∑ 𝑋 𝑋
Thus, a few things determine the variance of the beta: (1) the variance of the error, i.e. the variation
of the error, and (2) the variance of X (the higher the variance of X, the larger is the richness of
data, meaning that beta will have a lower variance).
1 2 2
2 1𝑛 2∑ 1 𝑋 𝑋 𝑢
𝜎 2
𝑛 1
∑ 1 𝑋 𝑋 2
𝑛
(2) Compute the t-statistic
The t-statistics has the form:
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
𝑡
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟
e.g.
𝛽1 𝛽1,0
𝑡
𝑆𝐸 𝛽1
(3) Compute the p-value, which is the smallest significance level at which the null hypothesis could
be rejected based on the test statistic actually observed. That is, the p-value is the probability of
obtaining a statistic, by random sampling variation, at least as different from the null hypothesis as
is the statistic observed, assuming that the null hypothesis is correct:
𝑝 𝑣𝑎𝑙𝑢𝑒 2𝜙 |𝑡 |
Where 𝑡 is the t-statistic computed in (2), and 𝜙 is the cumulative standard normal distribution
tabulated in Appendix Table 1. Thus:
𝑝 𝑣𝑎𝑙𝑢𝑒 Pr |𝑡 | |𝑡 |
A p-value of less than 5% provides evidence against the 𝐻0 , as this shows that the probability of
obtaining a value of 𝛽1 at least as far from the null as that actually observed is less than 5%.
Alternatively, one can simply compare the t-statistic to the critical value appropriate for the test
with the desired significance level. For example, a two-sided test with a 5% significance level
would reject the null hypothesis if |𝑡 | 1.96.
We can also perform a one-sided test, where the alternative hypothesis, 𝐻1 , only allows for 𝛽1 to be
either larger or smaller than 0. In this example, the hypotheses will be:
𝐻0 : 𝛽1 𝛽1,0
𝐻1 : 𝛽1 𝛽1,0
The only difference between a one and two-sided test is how we interpret the t-statistic. In a one-
sided test, 𝐻0 is rejected against the one-sided alternative for large negative, but not large positive,
values of the t-statistic. Thus, instead of rejecting 𝐻0 if |𝑡 | 1.96 as in the two-sided test, we
now reject if 𝑡 1.64. That is:
𝑝 𝑣𝑎𝑙𝑢𝑒 𝜙 𝑡
Where we use the left-tail probability if testing for 𝛽1 being smaller than 𝛽1,0 . If testing for it being
greater, we use the right-tail probability.
5.2 Confidence Intervals for a Regression Coefficient
A 95% confidence interval is the set of values that cannot be rejected using a two-sided hypothesis
test with a 5% significance level, and an interval that has a 95% probability of containing the true
value of the coefficient. A hypothesis test with a 5% significance level will, by definition, reject the
true value of a coefficient in only 5% of all possible samples; that is, in 95% of all possible samples,
the true value of a coefficient will not be rejected. Because the 95% confidence interval (as defined
in the first definition) is the set of all values of the coefficient that are not rejected at the 5%
significance level, it follows that the true value of the coefficient will be contained in the confidence
interval in 95% of all possible samples.
Continuing the example with 𝛽1 :
This confidence interval can be used to create a 95% confidence interval for the predicted effect of
a general change in x. If changing x by ∆𝑥, then the predicted change in Y is 𝛽1 ∆𝑥. The predicted
effect of ∆𝑥 using the estimate of 𝛽1 is 𝛽1 1.96𝑆𝐸 𝛽1 ∆𝑥.
A hypothesis test can be performed in the same way as described earlier. Also, a confidence interval
can be constructed, which would provide the interval for the difference between the 2 population
means.
5.4 Heteroskedasticity and Homoskedasticity
The error term 𝑢 is homoskedastic if the variance of the conditional distribution of 𝑢 given 𝑋 ,
𝑉𝑎𝑟 𝑢 |𝑋 𝑥 , is constant for 𝑖 1, … , 𝑛 and in particular does not depend on 𝑥. Otherwise, the
error term is heteroskedastic. In simpler terms, if the distribution of 𝑢 changes shape, i.e. the spread
increases, for different values of 𝑥, 𝑢 is heteroskedastic. If not, it is homoskedastic.
Implications of homoskedasticity:
- The OLS estimator is unbiased, consistent and asymptotically normal whether the errors are
homoskedastic or heteroskedastic.
- The OLS estimators 𝛽0 and 𝛽1 are efficient among all estimators that are linear in 𝑌1 , … . , 𝑌
and are unbiased, conditional on 𝑋1 , … , 𝑋 . Gauss-Markov theorem.
- Homoskedasticity-only SE:
𝑠2
𝜎2 2
∑ 1 𝑋 𝑋
If binary regression, then the pooled variance equation for the difference in means can be
used. However, the general equations provided earlier are heteroskedasticity-robust standard
errors, they can be used for both cases.
*** The rest of the chapter was optional, read if brought up in class.
2
𝑇𝑆𝑆 𝑌 𝑌 𝑌 𝑌
1
2 2 2
𝑇𝑆𝑆 𝑢 2 𝑢 𝑌 𝑌 𝑌 𝑌
1 1 1
2 2
𝑇𝑆𝑆 𝑢 𝑌 𝑌
1 1
The first term is SSR and the second term is estimated sum of squares (ESS):
𝑇𝑆𝑆 𝑆𝑆𝑅 𝐸𝑆𝑆
We divide both sides by TSS:
𝑇𝑆𝑆 𝑆𝑆𝑅 𝐸𝑆𝑆
𝑇𝑆𝑆 𝑇𝑆𝑆 𝑇𝑆𝑆
This is simplified, knowing that 𝑅2 :
𝑆𝑆𝑅
1 𝑅2
𝑇𝑆𝑆
↔
𝑆𝑆𝑅
𝑅2 1
𝑇𝑆𝑆
Due to these 2 equations being equal to 𝑅2 , a property of 𝑅2 is:
0 𝑅2 1
The squared correlation tells us how well our data fits the regression, but a larger 𝑅2 is not
necessarily better - it is just one indicator. The satisfactory level of squared correlation depends on a
lot of factors, e.g. with mini data 0.3 is often very satisfactory, while for macro data, 𝑅2 is expected
to be above 0.9 for it to be satisfactory.
Hypothesis Testing
Type I error: Reject 𝐻0 when it is true
Type II error: Not rejecting 𝐻0 when it is false
The test statistic for the mean can also be written as:
𝑌 𝜇 ,0
𝑆𝐷 𝑌
𝑡
√𝑛
𝑆𝑆𝑅
𝑆𝐸𝑅 𝑠2
𝑛 𝑘 1
2. 𝑅 2 - The fraction of the sample variance of 𝑌 explained by (or predicted by) the regressors:
𝐸𝑆𝑆 𝑆𝑆𝑅
𝑅2 1
𝑇𝑆𝑆 𝑇𝑆𝑆
𝑅 2 increases whenever a regressor is added (as SSR decreases), unless the estimated
coefficient on the added regressor is exactly 0.
3. 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 - Corrects 𝑅2 for increasing when adding a variable that does not improve the
fit of the model.
2
𝑛 1 𝑆𝑆𝑅 𝑠2
𝑅 1 1
𝑛 𝑘 1 𝑇𝑆𝑆 𝑠2
1
H , 𝑅2 by the factor , meaning that the adjusted 𝑅2 is always lower than
1
𝑅 2 . Adding a regressor to the model has 2 opposite effects - SSR decreases, but the factor
increases.
The adjusted 𝑅2 can be negative when the regressors reduce the sum of squared residuals by
such a small amount that this reduction fails to offset the factor.
Determining whether to include a variable or not should not be made based on maximizing 𝑅2 or
𝑅2 - it should be based on whether including the variable allows us to better estimate the causal
effect of interest.
6.5 The Least Squares Assumptions in Multiple Regression
There are 4 OLS assumptions:
1. The conditional distribution of 𝑢 given 𝑋1 , 𝑋2 , … , 𝑋 has a mean of 0.
Dummy variable trap: If there are 𝐺 binary variables and each observation falls into 1 and only 1, if
there is an intercept in the regression, and if all 𝐺 binary variables are included as regressors, then
the regression will fail because of perfect multicollinearity.
We solve this problem by excluding one of the regressors, i.e. only 𝐺 1 of the 𝐺 binary variables
are included - in this case, the coefficients on the included binary variables represent the
incremental effect of being in that category relative to the case of the omitted category, holding
constant the other regressors. We can also include all 𝐺 regressors and exclude the intercept.
This equation is heteroskedasticity-robust - as the sample size increases, the difference between this
and the homoskedasticity-only 𝐹-statistic vanishes.
For more than 2, we obtain the critical values for the 𝐹-statistic from the tables of the 𝐹 ,
distribution in Appendix Table 4 for the appropriate value of 𝑞 and the desired significance level.
𝑝 𝑣𝑎𝑙𝑢𝑒 Pr 𝐹 , 𝐹
If the number of restrictions is 1, then the 𝐹-statistic is the square of the 𝑡-statistic.
If the 𝐹-statistic exceeds the critical value, we reject 𝐻0 .
(1) test the restriction directly (some software have commands designed to do this)
(2) Transform the regression - rewriting the regression such that the restriction in 𝐻0 turns into a
restriction on a single regression coefficient - example of this on page 276.
4. A high 𝑅2 or 𝑅2 does not necessarily mean that you have the most appropriate set of
regressors, nor does a low 𝑅2 or 𝑅2 necessarily mean that you have an inappropriate set of
regressors.
Choose the scale that makes the most sense in terms of interpretation and easiness to read.
Notes from Lecture
Regression without a Constant
A regression without a constant is where we assume that the constant is equal to 0:
𝑌 𝛽0 𝛽1 𝑋 𝑢
↔
𝑌 0 𝛽1 𝑋
This graphically means that the regression line always goes through the point (0;0).
Never used in economic research - technically used for CAPM, as we assume that 𝛼 is always 0 in
this model.
When regressing without a constant, 𝛽1 is still found using the same equation as before 𝛽1
,
. However, the variance is now calculated as:
∑𝑢
𝑛 1
𝑉𝑎𝑟 𝛽1
∑ 𝑥2
The 𝑅2 has to be adjusted, because we are no longer looking at a normal simple regression (the
error term now equals 𝑢 𝑌 𝛽𝑋):
∑𝑋 𝑌
𝑅𝑎𝑤 𝑅2
∑ 𝑋 2 ∑ 𝑌2
𝑌 𝛽0 𝛽1 𝑋1 → 𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2
What we have to do to estimate the coefficients is the same as under simple regression. The general
equation is derived from:
𝑐𝑜𝑣 𝑋1 , 𝑢 𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2
↔
𝑐𝑜𝑣 𝑋1 , 𝑌 𝑐𝑜𝑣 𝑋1 , 𝛽0 𝑐𝑜𝑣 𝑋1 , 𝛽1 𝑋1 𝑐𝑜𝑣 𝑋1 , 𝛽2 𝑋2
↔
𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽1 𝑉𝑎𝑟 𝑋1 𝛽2 𝑐𝑜𝑣 𝑋1 , 𝑋2
↔
𝑐𝑜𝑣 𝑋1 , 𝑢 𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽1 𝑉𝑎𝑟 𝑋1 𝛽2 𝑐𝑜𝑣 𝑋1 , 𝑋2 0
Thus, to ensure that this equation equals 0, we get that the beta must be:
𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽2 𝑐𝑜𝑣 𝑋1 , 𝑋2
𝛽1
𝑉𝑎𝑟 𝑋1 𝑉𝑎𝑟 𝑋1
Thus, when an extra regressor is added, the coefficient is no longer the same as under a simple
regression - the second term is added to reflect the importance of the second variable. This suggests
that the betas of a model are different from each other - however, this is not always true.
There are 2 cases where 𝛽1 𝛽1 :
1. When 𝛽2 0 - because then the second term of the equation is 0, meaning it is not
important and should not be included in the model (simple regression model better).
2. When 𝑐𝑜𝑣 𝑋1 , 𝑋2 0 - this also makes the second term of the equation for 𝛽1 equal 0. It
means that 𝑋1 and 𝑋2 are orthogonal, i.e. they are explaining something, but they are not
related to each other. Again, the simple regression would be better.
If you forget to include 𝑋2 in your regression, the consequences depend on other characteristics:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 ⋯ 𝛽 𝑋 𝑢
The shape of the polynomial regression depends on how many times 𝑋 is added.
If we have a quadratic model 𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 𝑢 where the data is concave, we expect
𝛽1 0 and 𝛽2 0, as 𝛽2 has to mitigate the positive effect of 𝛽1 .
The effect of 𝑋 is found by taking the derivative 𝛽1 2𝛽2 𝑋 . (because this shows the effect
of one extra unit of X on Y)
If you want to include an extra term in your model, it has to be consistent with the theory of
common sense, and you have to be able to explain why you include it besides it just increasing 𝑅2 .
How to Redefine Variables
Redefining a variable means making a transformation of the variable, which is done for 3 reasons:
- You want to change the size of the variable due to different units of measurements
- You want to change the structure of the variable to make it more understandable
- You want to facilitate interpretations because the value of the coefficient is too small/large
If we have a simple regression model with one regressor:
𝑌 𝛽0 𝛽1 𝑋 𝑢
,
Where 𝛽1 , 𝑌 is the price of a house in DKK, and 𝑋 is the income of individuals in DKK.
If the readers of the journal article are American, we want to rescale the variable from being
measured in DKK to USD. This is done by multiplying the initial variable with the rescaling
constant, 𝜔, which in this case is the exchange rate:
𝑌∗ 𝜔1 𝑌
𝑎𝑛𝑑/𝑜𝑟
𝑋∗ 𝜔2 𝑋
We assume that the constant cannot be equal to 0. In general, 𝜔1 does not have to be equal to 𝜔2 , it
depends on the transformation we wish to perform.
By doing this, we get a new regression model:
𝑌∗ 𝛽0∗ 𝛽1∗ 𝑋 ∗ 𝑢∗
Where
𝐶𝑜𝑣 𝑋 ∗ , 𝑌 ∗
𝛽1∗
𝑉𝑎𝑟 𝑋 ∗
↔
𝐶𝑜𝑣 𝜔2 𝑋, 𝜔1 𝑌
𝛽1∗
𝑉𝑎𝑟 𝜔2 𝑋
↔
𝜔1 𝜔2 𝐶𝑜𝑣 𝑋, 𝑌
𝛽1∗
𝜔22 𝑉𝑎𝑟 𝑋
↔
𝜔1 𝐶𝑜𝑣 𝑋, 𝑌
𝛽1∗
𝜔2 𝑉𝑎𝑟 𝑋
↔
𝜔1
𝛽1∗ 𝛽
𝜔2 1
We also have that
𝛽0∗ 𝜔1 𝛽0
Thus, the relationship between the coefficients of the two models
𝜔1 2
𝑉𝑎𝑟 𝛽1∗ 𝑉𝑎𝑟 𝛽1
𝜔2
The t-statistic is the same, meaning that the significance is the same:
𝑡∗ 𝑡
Changing the scale of the Y variable will lead to a corresponding change in the scale of the
coefficients and standard errors, so no change in the significance or interpretation.
Changing the scale of one 𝑋 variable will lead to a change in the scale of that coefficient and
standard error, so again, no change in the significance or interpretation.
We can also redefine a variable by standardizing it both on the left or right hand-side
𝑌 𝑌
𝑌∗
𝑆𝐷 𝑌
𝑋 𝑋
𝑋∗
𝑆𝐷 𝑋
The old variable is thus affected by the average and divided by the standard error.
The new regression is then:
𝑌∗ 𝛽𝑋 ∗ 𝑢
The constant is missing, because if we have to standardize all the variables on the right hand-side,
we are taking the vector 1 minus the average, which is going to be 0, so we are running a regression
without a constant. The interpretation of 𝛽 in the model is therefore how much a change of 1 SD in
𝑋 has an effect on the standard deviation of the dependent variable 𝑌.
For example, if the beta of age is -0.340, an increase in SD of age will have a -0.340 impact on SD
of income.
F Statistics and Multiple Linear Restrictions
In terms of multiple regression analysis, we have a multiple linear restriction when talking about the
F-test. With simple linear regression, we only tested whether one variable was equal to 0. With
multiple regression, we test:
𝐻0 : 𝛽1 0, 𝛽2 0, … , 𝛽 0
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 0
We test this by taking a statistical index called F:
𝐸𝑆𝑆
𝐹≡ 𝑘 1
𝑆𝑆𝑅
𝑛 𝑘 1
If F is larger than the value found in the F table (with reference to its degrees of freedom (k-1) and
the number of observations minus the degrees of freedom), then we reject 𝐻0 ,
If we divide both terms in the F equation by TSS, we get:
𝑅2
𝐹≡ 𝑘 1
1 𝑅2
𝑛 𝑘 1
We can use this to see the marginal contribution of a variable to evaluate whether it should be
included in the regression model or not:
𝐸𝑆𝑆 𝐸𝑆𝑆
𝑘 𝑘
𝐹≡
𝑆𝑆𝑅
𝑛 𝑘 1
𝑘 1 𝑘 1 𝑘 𝑘
If 𝐹 𝐹 𝑘 𝑘 ,𝑛 𝑘 1 , we reject 𝐻0 .
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 𝑢
Mechanically, we can construct the 2 nd regressor by generating a new variable that equals the square
of the 1st variable. Then the quadratic model is simply a multiple regression model with 2
regressors. We can estimate the coefficient using OLS methods.
We can test the hypothesis that the relationship between 𝑋 and 𝑌 is linear against the alternative that
is it nonlinear by testing the following hypothesis:
𝐻0 : 𝛽2 0
𝐻1 : 𝛽2 0
Since the quadratic model before would perfectly describe a linear relationship if 𝛽2 0, as the
second regressor would then be absent, 𝐻0 is true if the relationship is linear. We use the 𝑡-statistic
to test this, and if this exceeds the 5% critical value of the test (1.96), we reject 𝐻0 and conclude
that the quadratic model is a better fit.
The effect on 𝑌 of a change in 𝑋1 , ∆𝑋1 , holding 𝑋2 , . . . , 𝑋 constant, is the difference in the expected
value of 𝑌 when the independent variables take on the values 𝑋1 𝛥𝑋1 , 𝑋2 , . . . , 𝑋 and the expected
value of 𝑌 when the independent variables take on the values 𝑋1 , 𝑋2 , . . . , 𝑋 .
The standard error of the estimator of the effect on 𝑌 of changing another variable is:
|∆𝑌|
𝑆𝐸 ∆𝑌
√𝐹
The 𝐹-statistic here is the one computed when testing the hypothesis that is described on page 311.
- Use the 𝑡-statistic to test the hypothesis that the coefficient on 𝑋 is 0. If you reject this
hypothesis, then 𝑋 belongs in the regression, so you use the polynomial of degree 𝑟
- If you do not reject 𝛽 0 in step 2, eliminate 𝑋 from the regression and estimate a
polynomial regression of degree 𝑟 1. Test whether the coefficient on 𝑋 1 is 0. If you
reject, use the polynomial of degree 𝑟 1
- If you do not reject 𝛽 1 0 in step 3, continue this procedure until the coefficient on the
highest power in your polynomial is statistically different from 0
(2) Logarithms
Logarithms convert changes in variables into percentage changes.
The exponential function of 𝑥 is 𝑒 (𝑒 2.71828 . The natural logarithm is the inverse of the
exponential function, i.e. it is the function for which 𝑥 ln 𝑒 . The logarithm is defined only for
1
positive values of 𝑥, and its slope is .
𝑌 𝛽0 𝛽1 ln 𝑋 𝑢
ln 𝑌 𝛽0 𝛽1 𝑋 𝑢
ln 𝑌 𝛽0 𝛽1 ln 𝑋 𝑢
** if 𝑌 is transformed (ln 𝑌 , you cannot compute the appropriate predicted value of 𝑌 by taking
the exponential function of the model, as this value is biased. The solution used in the book is to
simply not transform the predicted values of the logarithm of 𝑌 to their original units.
One of the regressors can be expressed as a linear combination of another regressor, e.g.:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋2 𝑢
𝑋2 𝛼0 𝛼1 𝑋1
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝛼0 𝛼1 𝑋1
↔
𝑌 𝛽0 𝛽2 𝛼0 𝛽1 𝛽2 𝛼1 𝑋1 𝑢
The problem is then that software will give us one coefficient for 𝛽1
𝛽2 𝛼1 𝑋1 , which does not allow us to disentangle the values of 𝛽1 and 𝛽2 .
- Depending on the software, we cannot even run the regression, if the variables
display perfect multicollinearity. R will tell us that the problem is there and will drop
one of the regressors when running the model.
Multicollinearity is related to the structure of the data. The theory tells us we should
include both variables, but because of the way the data is structured, we cannot do so, or
it would lead to biased variables.
2. Imperfect multicollinearity
Occurs when two regressors are highly linearly correlated, but the correlation is not exactly
1. Thus, we add an error term to perfect multicollinearity:
𝑋2 𝛼0 𝛼1 𝑋1 𝑣
Where 𝑣 is an error term, which captures the imperfection of the linear combination. In the
example with work experience from before, the error could for example be that it sometimes
is 𝑎𝑔𝑒 5 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑦𝑒𝑎𝑟𝑠 .
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝛼0 𝛼1 𝑋1
↔
𝑌 𝛽0 𝛽2 𝛼0 𝛽1 𝛽2 𝛼1 𝑋1 𝛽2 𝑣 𝑢
Using this regression, we can technically derive an estimate of 𝛽1 and 𝛽2 , given that we
know the value of the error term, 𝑣. However, we do not know this, so in practice we have
the same problem as under perfect multicollinearity, where the two betas cannot be
disentangled from each other.
The problem with imperfect multicollinearity comes with the significance level of the
regression - if we assume that:
𝜎2
𝑉𝑎𝑟 𝛽
𝑇𝑆𝑆 1 𝜌
Where 𝑇𝑆𝑆 is the total sum of squares of 𝑗 with respect to the other regressor, and 𝜌 is the
correlation of 𝑗 with respect to the other regressor.
Imperfect multicollinearity is also creating a problem because the regression created is very
sensitive with respect to outliers. The inclusion or exclusion of these variables will have a
very large effect on the betas.
Battista wants us to check for multicollinearity, which can be done by for example checking
for a very low t-statistic at the same time as a very high 𝑅2 , by looking at the correlation
between variables, or by employing the variance inflation factor (VIF) (can be done in R).
The approach of the auxiliary regression is used to compute a test statistic-such as the test
statistics for heteroskedasticity and serial correlation or any other regression that does not
estimate the model of primary interest. For example, if we have the model:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋2 𝛽3 𝑋3 𝑢
We can use the auxiliary regression to see the relationship of other variables to the variable
of interest, in this case 𝑋3 :
𝑋3 𝛾0 𝛾1 𝑋 𝛾2 𝑋2 𝜀
𝑌 𝛼0 𝛼1 𝑋1 𝛼2 𝑋2 𝛼3 𝜀̂ 𝑧
Doing this, we are taking out 𝑋3 but adding the residual contribution of 𝑋3 not related to 𝑋1
and 𝑋2 .
Thus, when we have multicollinearity, we can either do nothing, or try to find a theoretical
reason as to why we should not include the variable, combine cross-sectional and time series
data, drop a variable and specification bias, transform the variable, or try to find
additional/new data.
Nonlinear vs Linear Regression
When running an OLS regression, we need to have a linear relationship. If the data is suggesting a
non-linear representation, we can consider looking at the polynomials. Otherwise, another
alternative to check for a nonlinear relationship is to use the natural logarithm to transform the
variable.
Properties of ln:
- The logarithm is scale invariant, tells us that if we take a variable and transform it
using the natural logarithm, the scale is always going to be same, the ranking of the
value will always be the same (no reshuffle).
- If we take the logarithm of a variable 0, the value does not exist, which means that
if we have negative variables in the model, we will have a missing value in the
regression. If we have a lot of these values in our vector, we can consider using
ln 1 𝑥 , but we have to be careful, because this can change the results as well.
- When transforming using ln, the data is less skewed, there is less distortion in the
distribution of data, and the effect of outliers is lower.
There are 3 different cases that differ from the simple regression model (level-level):
(1) Level-log
𝑌 𝛽0 𝛽1 ln 𝑋1 𝑢
𝑑𝑌 1 ∆𝑋1
𝛽 𝑜𝑟 ∆𝑌 𝛽
𝑑𝑋1 𝑋1 1 𝑋1 1
Thus, the interpretation of 𝛽1 is the effect on 𝑌 if we increase 𝑋1 by 100%. Not used a lot in
economics.
(2) Log-level
ln 𝑌 𝛽0 𝛽1 X1 𝑢
𝑑ln 𝑦 1
𝑜𝑟 ∆𝑌 ∆𝑋𝛽
𝑑𝑌 𝑌
The interpretation is thus the percentage change in Y if there is an increase of 1 𝑋.
(3) Log-log
ln 𝑌 𝛽0 𝛽1 ln 𝑋1 𝑢
𝑑ln 𝑦 𝛽1
𝑑𝑋1 𝑋1
↔
𝑑𝑌
𝛽1 𝑌 𝑒𝑙𝑎𝑠𝑡𝑖𝑐𝑖𝑡𝑦
𝑑𝑋
𝑋
Reciprocal models are an alternative way of rescaling by including the reciprocal of your input as
the predictor:
1
𝑌 𝛽0 𝛽1 𝑢
𝑋1
𝑋1 0. This yields a non-linear relation between X on Y. If the coefficient is positive, as X
increases, Y decreases on average. If the coefficient is negative, as X increases, Y increases on
average.
𝛽1 ∑ 𝑋 𝑋 2 ∑ 𝑋 𝑋 𝑢
𝛽1
∑ 𝑋 𝑋 2
↔
∑ 𝑋 𝑋 𝑢
𝛽1 𝛽1
∑ 𝑋 𝑋 2
OLS BLUE - Under OLS assumptions, 𝑢 is equal to 0, and therefore, we have that:
𝐸 𝛽1 𝛽1
The variance of 𝛽1 :
∑ 𝑋 𝑋 𝑢
𝑉𝑎𝑟 𝛽1 𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 2
↔
∑ 𝑋 𝑋 2𝑢2
𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 22
Homoscedasticity and Heteroscedasticity
One of the assumptions of OLS states that 𝑉𝑎𝑟 𝑢 𝜎 2 , meaning that if the assumption holds, the
variance of 𝛽1 is:
∑ 𝑋 𝑋 2𝜎 2
𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 22
↔
𝜎2 𝜎2
𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 2 ∑ 𝑠2
When this assumption holds, we thus have homoskedasticity, meaning that the variance of the error
is identical for all the observations. Homoskedasticity can be written as:
𝑉𝑎𝑟 𝑢 |𝑋 𝑋 𝑉𝑎𝑟 𝑢 𝜎2
∑
If the assumption does not hold, we have heteroscedasticity and have to use 𝑉𝑎𝑟 𝛽1 ∑
.
This is a problem because having that 𝑉𝑎𝑟 𝑢 |𝑋 𝑋 𝑉𝑎𝑟 𝑢 𝜎 2 means that the variance of
the error is related to the observation (as the error is not the same for all 𝑖 , which makes the OLS
biased.
From simply looking at the data points, one can sometimes tell if there is homo- or
heteroscedasticity.
Homoscedastic:
Heteroscedastic:
If the data is heteroscedastic, we can use the White Robust SE, which corrects for degrees of
freedom, i.e.:
𝑛 ∑ 𝑋 𝑋 2𝑢2
𝑉𝑎𝑟 𝛽1
𝑛 𝑘 1 ∑ 𝑋 𝑋 22
The degrees of freedom become less important the larger the 𝑛.
Usually, no one tests for heteroscedasticity, since people assume that data has heteroscedasticity
and they correct the SE.
Matrix Notation
Using matrix notation:
𝑌 𝑋𝛽 𝑢
1
𝛽 𝑋𝑋 𝑋′𝑌
1
𝑉𝑎𝑟 𝛽 |𝑋 𝑋𝑋 𝑋 𝜎 2𝑋 𝑋 𝑋 1
1
For OLS, we minimize 𝑢 2, for GLS we minimize 𝑢12 √ .
In large samples, the efficiency gain from GLS is likely to be minimal. Accurately estimating the
variance-covariance matrix in the first stage is empirically challenging and often yields incorrect
estimates. We can get correct estimates of standard errors using White Standard errors, making the
use of GLS for the correction of heteroscedasticity largely unnecessary in practice.
Autocorrelation
One of the OLS assumptions states that the covariance and correlations between different
disturbances are all 0:
𝐶𝑜𝑣 𝑢 , 𝑢 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑡 𝑠
This assumption states that the disturbances 𝑢 and 𝑢 are independently distributed, which is called
serial dependence.
If the assumption is no longer valid, then the disturbances are not pairwise independent but pairwise
autocorrelated. This means that an error occurring at period 𝑡 may be carried over to the next period
𝑡 1. Autocorrelation is most likely to occur in time series data. In cross-sectional data we can
change the arrangement of the data without altering the results.
Lecture 8 - Binary Dependent Variable
Chapter 11 - Regression with a Binary Dependent Variable
When running a regression with a binary dependent variable, we interpret the regression as
modeling the probability that the dependent variable equals 1. Thus, for a binary variable,
𝐸 𝑌|𝑋1 , … , 𝑋 Pr 𝑌 1|𝑋1 , … 𝑋 .
The linear multiple regression model applied to a binary dependent variable is a a
bab , because it corresponds to the probability that the dependent variable equals 1
given 𝑋.
The coefficient 𝛽1 on a regressor 𝑋 is the change in the probability that 𝑌 1 associated with a unit
change in 𝑋.
Hypotheses concerning several coefficients can be tested using 𝐹-statistic, and confidence intervals
can be formed as 1.96 SEs.
The errors of the linear probability model are always heteroskedastic, so we must ALWAYS use
heteroskedasticity-robust SEs for inference.
𝑅2 cannot be estimated for a model where the dependent variable is binary unless the regressors are
also binary.
________________________________________________________________________________
Example - Denial of Bank Loans and Race
The following model is estimated:
Shortcoming of the LPM: Because probabilities cannot exceed 1, the effect on the probability that
𝑌 1 of a given change in 𝑋 must be nonlinear. However, in LPM, the effect of a change is
constant, which leads to predicted probabilities that can drop below 0 or exceed 1.
2) Logit
Pr 𝑌 1|𝑋1 , 𝑋2 , … , 𝑋 F 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2 ⋯𝛽 𝑋
Where 𝐹 is the cumulative standard logistic distribution function.
As with Probit, the logit coefficients are best interpreted by computing predicted probabilities and
differences in predicted probabilities. The coefficients of the logit model can be estimated by
maximum likelihood, so again, the 𝑡-statistic and confidence interval is applied as usually.
Sum-up: The LPM is easiest to use and to interpret, but it cannot capture the nonlinear nature of the
true population regression function. Probit and Logit regressions model this nonlinearity in the
probabilities, but their regression coefficients are more difficult to interpret.
𝑖𝑓 𝑑 0: 𝑌 𝛽0 𝛽1 𝑋1 𝑢
𝑖𝑓 𝑑 1: 𝑌 𝛽0 𝛿0 𝛽1 𝑋1 𝑢
Thus, if 𝑑 1, the slope will remain the same, but the starting point increases by the coefficient 𝛿0 .
Thus, 𝛿0 tells us the effect of a certain characteristic on the independent variable. The group that has
𝑑 0 is called the base/control group, and the difference between the two groups is equal to 𝛿0 .
We cannot add a variable that displays perfect multicollinearity with a dummy variable in the same
model, e.g. having 𝛿0 𝑓𝑒𝑚𝑎𝑙𝑒 and 𝑋1 𝑚𝑎𝑙𝑒, because then the male variable would have
perfect multicollinearity with the intercept.
The interaction of dummies is also of interest. The joint effect of dummy variables in a model can
be considered using an extended version of the model:
𝑌 𝛽0 𝛿1 𝑑1 𝛽1 𝑋1 𝛿2 𝑑2 𝑢
𝐸𝑥𝑡𝑒𝑛𝑑𝑒𝑑: 𝑌 𝛽0 𝛿1 𝑑1 𝛽1 𝑋1 𝛿2 𝑑2 𝛿3 𝑑1 𝑑2 𝜀
Where the interaction is measured by 𝑑1 𝑑2 . The coefficient 𝛿3 then shows the joint effect of the two
dummies on Y.
If we want to just study the effect of one of the dummies, we can take the first derivative:
𝑑𝑌
𝑑2 𝑑3 𝑑1
𝑑𝑑2
This shows that when studying the effect of 1 dummy variable, we have to take into consideration
the joint effect it has with other dummy variables.
If our goal is to study the relevance of the group, we need to know whether making this distinction
is important. In order to test the relevance, we can utilize the Chow-test:
Consider the regression and with two groups defined by the variable 𝑑:
𝑌 𝛽0 𝛽1 𝑋 ,1 𝛽2 𝑋 ,2 𝑢
We can split the regression into two subgroups:
1 𝑖𝑓 𝑑 0: 𝑌 𝛼0 𝛼1 𝑋 ,1 𝛼2 𝑋 ,2 𝑒
2 𝑖𝑓 𝑑 1: 𝑌 𝛾0 𝛾1 𝑋 ,1 𝛾2 𝑋 ,2 𝑒
We can then run a test, using the residual sum of squares from the 3 different regressions.
Steps:
1. Run the 3 regressions and compute RSS, RSS1, and RSS2
2 1
2. Compute 𝐹 1
W estimating something with the LPM, the 𝛽1 derived is going to be the effect
the variable we want to study, X, is having on the probability that the dependent
variable is equal to 1.
Negative:
- The structure of the dependent variable is not normal, meaning that the structure of
the error term 𝑢 is not normally distributed. This leads to the OLS being biased.
- Errors are heteroscedastic, as we expect Y to change a lot for given values of X. OLS
is then biased, unless we use White SE.
(2) Probit
When we assume that OLS is not a good strategy, we are thinking that maybe the probability that
our dependent variable is equal to 1 given the X cannot be run using the usual OLS regression, and
we must assume a particular type of function:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 𝜙 𝛽0 𝛽1 𝑋
Meaning that we have to consider a function which is not linear, and therefore, we need to
implement a different type of technique to derive the betas. In order to compute it, we use the
maximum likelihood (ML) technique, which where you take the first derivate of the problem, then
you try to fit that with values until you reach the maximum of the likelihood that the data can
explain the model (that the right-hand side can explain the left-hand side of the regression).
Probit is thus:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 𝜙 𝛽0 𝛽1 𝑋
Where 𝜙 is the cumulative standard normal distribution. The Probit model has the following
desirable properties:
- 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 is increasing in 𝑋 for all 𝛽 0 (same as LPM)
10.2 Pa Da a T T P : B a A C a
If data for each entity are obtained for 𝑇 2 time periods, we can compare values of the dependent
variable in the 2nd period to values in the 1st period - by focusing on these changes in the dependent
a ab , b a a a a b d
factors that differ from 1 entity to another but do not change over time within the entity. This entity
fixed variable is denoted 𝑍 :
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑍 𝑢
Because 𝑍 does not change over time, it will not produce any change in 𝑌 between the time
periods. Thus, the influence of 𝑍 can be eliminated by analyzing the change in 𝑌 between the 2
periods:
𝑌1 𝛽0 𝛽1 𝑋 1 𝛽2 𝑍 𝑢1
𝑌2 𝛽0 𝛽1 𝑋 2 𝛽2 𝑍 𝑢2
Subtracting these regressions from each other:
𝑌2 𝑌1 𝛽1 𝑋 2 𝑋1 𝑢2 𝑢1
Analyzing changes in 𝑌 and 𝑋 has the effect of controlling for variables that are constant over time,
thereby eliminating this source of OVB.
T b a a a a a 𝑇 2. If 𝑇 2, we need to use the
method of fixed effects regression.
10.5 The Fixed Effects Regression Assumptions and Standard Errors for Fixed Effects Regression
In panel data, the regression error can be correlated over time within an entity - this does not
introduce bias into the fixed effects estimator, but it affects the variance of the fixed effects
estimator and therefore, it affects how one computes standard errors.
Thus, we use clustered standard errors, which are robust to both heteroskedasticity and to
correlation over time within an entity.
When there are many entities (when 𝑛 is large), hypothesis tests and confidence intervals can be
computed using the usual large-sample normal and 𝐹 critical values.
The cross-sectional counterpart of Assumption 2 holds that each observation is independent, which
arises under simple random sampling. In contrast, Assumption 2 for panel data holds that the
variables are independent across entities but makes no such restriction within an entity.
If 𝑋 is correlated over time for a given entity, then 𝑋 is said to be autocorrelated - what happens
in one time period tens to be correlated with what happens in the next time period. Such omitted
factors, which persist over multiple years, produce autocorrelated regression errors. Not all omitted
factors will produce autocorrelation in 𝑢 ; if the factor for a given entity is independently
distributed from 1 year to another, then this component of the error term would be serially
uncorrelated.
If the regression errors are autocorrelated, then the heteroskedasticity-robust standard error formula
for cross-section is not valid. Instead, we have to use heteroskedasticity- and autocorrelation-robust
(HAC) SEs, e.g. clustered SEs. Equation given in appendix 10.2.
Sum-up: In panel data, variables are typically autocorrelated that is, correlated over time within an
entity. Standard errors need to allow both for this autocorrelation and for potential
heteroskedasticity, and one way to do so is to use clustered standard errors.
𝑍 is also not observable for entity 𝑗, but we can make a very strong assumption that 𝑖 and 𝑗 are
identical in 𝑍. The difference between 𝑖 and 𝑗, i.e. 𝑌 𝑌 𝛽1 𝑋 𝑋 𝑢 𝑢 . Then we
can simply run an OLS regression and get a correct estimate for 𝛽1 .
The problem with this approach is finding two entities that are identical. (1) it is very likely that 𝑖
𝑗 in terms of 𝑍, and (2) self-selection (e.g. choosing twins in medical studies) can lead to the model
only being able to explain that particular case (e.g. family structure), lack of generalizability.
𝑌, 1 𝛽0 𝛽1 𝑋 , 1 𝛽2 𝑍 𝑢, 1
With a second time period added, the strong assumption lies in time not playing any role in
𝑍 𝑍, 𝑍 , 1 𝑍 . Again, we take the difference between the 2 models:
𝑌, 1 𝑌, 𝛽1 𝑋 , 1 𝑋, 𝑢, 1 𝑢,
The unobservable variable, 𝑍, again disappears, allowing us to estimate 𝛽1 , which is now BLUE.
There are 2 issues with this approach:
1. We have to convince people that 𝑍 is constant over time, i.e. 𝑍 , 𝑍, 1 𝑍 . This is done
by collecting information collected by other scientists.
Since 𝛽2 𝑍 is constant over time, it is considered a constant, and the model can be rewritten:
𝑌, 𝛽0 𝛽2 𝑍 𝛽1 𝑋 , 𝑢 ,
↔
𝑌, 𝛼 𝛽1 𝑋 , 𝑢,
Where 𝛼 𝛽0 𝛽2 𝑍 is the fixed effect. Thus, when we have more than 2 time periods, the
dependent variable for the entity and time components can be written as a function of 𝑋 , plus a set
of dummies which are equal to 1 if the entity is 𝑖 and 0 otherwise.
The graphical presentation of fixed effects regression is parallel shifts of the function depending on
the 𝛼:
Again, we have a BLUE estimate for 𝛽1 , as including the dummy is able to explain completely all
of the equation.
Be careful when considering this regression, as it is prone to the dummy trap (perfect
multicollinearity) - we have to have 𝑛 1 dummies. The dummy that is excluded is called the
ba , a , i.e. the coefficient on the dummy would be the value compared to the
value of the base, i.e. the value of being 𝑖 instead of the base.
𝑌, 𝑌 𝛽1 𝑋 , 𝑋 𝑢, 𝑢
Both approaches give the same 𝛽 and 𝑡-statistic. However, when 𝑛 is very large, the fixed effects
regression is not very convenient for software, as it will have to estimate a very high number of
dummies.
A regression as the one above with both entity fixed and time fixed effects can be written as:
𝑌, 𝛽1 𝑋 , 𝛼 𝜆 𝑢,
Note that 𝛽0 is excluded as it is a part of 𝛼 . Because of the dummy trap, the number of dummies
estimated for 𝛼 is 𝑛 1, and for 𝜆 it is 𝑇 1.
Clustered SE: The OLS fixed effect estimator 𝛽 is unbiased, consistent, and asymptomatically
normally distributed. However, the usual OLS standard errors (both homoscedasticity-only and
heteroscedasticity-robust) will in general be wrong because they assume that 𝑢 , are serially
uncorrelated. In practice, the OLS SEs often understate the true sampling uncertainty - if 𝑢 , is
correlated over time, you do not have as much information (as much random variation) as you
would if 𝑢 , were uncorrelated. This problem is solved by using clustered standard errors:
𝑠2
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑒𝑑 𝑆𝐸 𝑜𝑓 𝑌
𝑛
Where
1
𝑠2 𝑌 𝑌
2
Lecture 10 - Instrumental Variable Regression
Chapter 12 - Instrumental Variables Regression
If 𝑋 and 𝑢 are correlated, the OLS estimator is inconsistent. This correlation can be due to various
explanations, e.g. OVB, errors in variables (measurement errors in the regressors), and
simultaneous causality. Whatever the source, the effect on 𝑌 of a unit change in 𝑋 can be estimated
using the instrumental variables estimator if there is a valid instrumental variable.
The population regression model relating the dependent variable 𝑌 and regressor 𝑋 is:
𝑌 𝛽0 𝛽1 𝑋 𝑢
If 𝑋 and 𝑢 are correlated, the OLS estimator is biased, and therefore, IV estimation uses an
a a a a ab 𝑍 to isolate that part of 𝑋 that is uncorrelated with 𝑢 .
Variables correlated with the error term are called endogenous variables, while variables that are
uncorrelated with 𝑢 are exogenous variables.
The 2 conditions for a valid instrument:
1. Instrument relevance: 𝑐𝑜𝑟𝑟 𝑍 , 𝑋 0
For an instrument to be relevant, its variation is related to the variation in 𝑋 . If in addition the
instrument is exogenous, then that part of the variation of 𝑋 captured by the instrumental variable is
exogenous. Thus, an instrument that is relevant and exogenous can capture movements in 𝑋 that
are exogenous, which means that we can estimate the population coefficient 𝛽1 .
If 𝑍 satisfies the conditions of instrument relevance and exogeneity, 𝛽1 can be estimated using an
IV a a a a a (2SLS), which is calculated in 2 stages. 1st stage -
decompose 𝑋 into 2 components, a problematic component that may be correlated with the 𝑢 and
another problem-free component that is uncorrelated. 2nd stage - use problem-free component to
estimate 𝛽1 .
To decompose 𝑋 , we use a population regression linking 𝑋 and 𝑍:
𝑋 𝜋0 𝜋1 𝑍 𝑣
𝜋0 𝜋1 𝑍 is the part of 𝑋 that can be predicted by 𝑍 , and since 𝑍 is exogenous, this component is
uncorrelated with 𝑢 . 𝑣 is the problematic part that is correlated with 𝑢 . Thus, we disregard 𝑣 and
use OLS to estimate the regression 𝑋 𝜋0 𝜋1 𝑍 . We then regress 𝑌 on 𝑋 using OLS, and the
resulting estimators from this 2 -stage regression are the 2SLS estimators 𝛽02 and 𝛽12 .
nd
If the sample is large, the 2SLS estimator is consistent and normally distributed. When there is a
single 𝑋 and a single instrument 𝑍, the 2SLS estimator is:
𝑐𝑜𝑣 𝑍 , 𝑌
𝛽12
𝑐𝑜𝑣 𝑍 , 𝑋
The 2SLS estimator is consistent because sample covariance is a constant estimator of the
population covariance. The sampling distribution for this estimator is approximately 𝑁 𝛽1 , 𝜎 2
where:
1 𝑣𝑎𝑟 𝑍 𝜇 𝑢
𝜎2
𝑛 𝑐𝑜𝑣 𝑍 , 𝑋 2
Because 𝛽12 is normally distributed in large samples, hypothesis tests about 𝛽1 can be performed
by computing the 𝑡-statistic, and a 95% confidence interval is given by the usual equation.
If 𝑊 is an effective control variable, then including it makes the instrument uncorrelated with 𝑢,
making the 2SLS estimator of the coefficient on 𝑋 consistent. If 𝑊 is correlated with 𝑢 then the
2SLS coefficient on 𝑊 is subject to OVB and does not have a causal interpretation.
When there is a single endogenous regressor, 𝑋, and some additional included exogenous variables,
the model of interest is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑊1 ⋯ 𝛽1 𝑊 𝑢
Where 𝑋 might be correlated with 𝑢 but 𝑊1 , … , 𝑊 are not. Then the population 1st stage 2SLS
relates 𝑋 to the exogenous variables 𝑊 and 𝑍, also called the reduced form equation for 𝑋, is:
𝑋 𝜋0 𝜋1 𝑍1 ⋯ 𝜋 𝑍 𝜋 1 𝑊1 ⋯ 𝜋 𝑊 𝑣
The unknown coefficients are estimated by OLS. In the 2 nd stage, we regress 𝑌 on 𝑋 , 𝑊1 , … , 𝑤
using OLS after replacing 𝑋 by its predicted values from the regression in the 1 st stage.
If we are dealing with a multiple regression model instead, each of the endogenous regressors
requires its own 1st-stage regression.
When there is one included endogenous variable but multiple instruments, the condition for
instrument relevance is that at least one 𝑍 is useful for predicting 𝑋 , given 𝑊. When we have
multiple endogenous variables, we must rule out perfect multicollinearity in the 2 nd -stage
population regression.
- The 2 conditions for a valid instrument in the picture above must hold
Under the IV regression assumptions, the TSLS estimator is consistent and normally distributed in
large samples. Thus, the general procedures for statistical inference in regression models extend to
2SLS regression. However, we have to realize that the 2 nd -stage OLS SEs are wrong ad they do not
adjust for the use of the predicted values of the included endogenous variables (R does this).
12.3 Checking Instrument Validity
(1) Instrument Relevance
The more relevant the instruments, the more variation in 𝑋 is explained by the instruments, and the
more information is available for use in IV regression. This leads to a more accurate estimator. In
addition, the more relevant the instruments, the better is the normal approximation to the sampling
distribution of the 2SLS estimator and its 𝑡-statistic.
Weak instruments mean that the normal distribution provides a poor approximation to the sampling
distribution even if the sampling size is large, meaning that the 2SLS is no longer reliable.
If we identify weak instruments, there are a few options for how to handle it:
- If we have a small number of strong instruments and many weak ones, we can
discard the weakest instruments and use the most relevant subset for 2SLS analysis.
This can make SEs increase, but do not worry, the initial SEs were not meaningful
anyway
- If coefficients are exactly identified, we cannot discard the weak instruments. The
solution is to either (1) identify additional, stronger instruments or (2) proceed with
the weak instruments but employ a different method than 2SLS
2. Measurement error - when the variable is not correctly measured, creating noise in the data.
Error is often random. Suppose we include the variable 𝑋 in our model, but the true variable
is actually 𝑋 ∗ , then 𝑋 𝑋 ∗ 𝑣 , where 𝑣 is the error. This leads to 𝑋 and 𝑣 being
correlated, causing our results to be biased. Can be fixed using IV.
If the correlation between the instrumental variable and the error term is 0, we also have that
the covariance is 0:
𝑐𝑜𝑟𝑟 𝑍 , 𝑢 0
↔
𝑐𝑜𝑣 𝑍 , 𝑢 0
𝑐𝑜𝑣 𝑍 , 𝑌 𝛽0 𝛽1 𝑋 0
↔
𝑐𝑜𝑣 𝑍 , 𝑌 𝑐𝑜𝑣 𝑍 , 𝛽0 𝑐𝑜𝑣 𝑍 , 𝛽1 𝑋 0
↔
𝑐𝑜𝑣 𝑍 , 𝑌 𝛽1 𝑐𝑜𝑣 𝑍 , 𝑋 0
Only one particular 𝛽1 will be able to satisfy this. Therefore, we have that:
𝑐𝑜𝑣 𝑍 , 𝑌
𝛽1
𝑐𝑜𝑣 𝑍 , 𝑋
In matrix notation:
1
𝛽1 𝑍𝑋 𝑍′𝑌
𝑋 𝛿0 𝛿1 𝑍 𝛿2 𝑤 𝑣
If 𝑍 is relevant, we should expect that 𝛿1 is statistically different from 0, which means that
this instrument is important in explaining 𝑋 . From this OLS regression, we then compute
the prediction of 𝑋 , i.e. 𝑋 .
𝑌 𝛼 𝛽1 𝑋 𝛽2 𝑤 𝜀
This is the model we started out with except we are including 𝑋 instead of 𝑋 . The
estimation of 𝛽1 is then the estimation of 𝛽1 .
Warning: If we run this regression using 2SLS by hand or using software, we are going to
find that 𝛽1 is going to be identical but 𝑆𝐸 𝛽1 is going to differ. Software provides the
correct SE as it adjusts it while hand-calculations do not.
- Do not forget the economic theory, this also needs to support our argument and
model.
3. Compute 𝐹-test and check the index J 𝐽 𝑚𝐹 ~ χ2 , the larger the F, the lower the
probability that we are rejecting the hypothesis of over-identification.
The number of instruments needs to be the same as the number of potentially endogenous variables.
The difference estimator is the difference in the sample averages for the treatment and control
groups, which is computed by regressing the outcome variable 𝑌 on a binary treatment indicator 𝑋:
𝑌 𝛽0 𝛽1 𝑋 𝑢
The efficiency of the difference estimator can often be improved by including some control
variables 𝑊 in the regression. The differences estimator with additional regressors is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑊1 ⋯ 𝛽1 𝑊 𝑢
If 𝑊 helps explain the variation in 𝑌, then including 𝑊 reduces the SE of the regression and often
also of the coefficient 𝛽1 . For 𝛽1 to be unbiased, the control variables 𝑊 must be such that 𝑢
satisfies conditional mean independence, i.e. 𝐸 𝑢 |𝑋 , 𝑊 𝐸 𝑢 |𝑊 - this holds when 𝑊 are
pretreatment individual characteristics (such as gender) and 𝑋 is randomly assigned.
The coefficient on 𝑊 does not have a causal interpretation.
Randomization based on covariates: Randomization in which the probability of assignment to the
treatment group depends on one or more observable variables 𝑊. Ensures that OLS estimator is
unbiased - as 𝑋 is assigned randomly based on 𝑊 .
𝛽1 ∆𝑌 ∆𝑌
If the quasi-experiment yields a variable 𝑍 that influences receipt of treatment, if data are available
both on 𝑍 and on the treatment actually received (𝑋 ), and if 𝑍 a a a , then
𝑍 is a valid instrument for 𝑋 and the coefficients can be estimated using 2SLS.
3. The heterogeneity in the treatment effect and heterogeneity in the effect of the instrument is
uncorrelated, i.e. 𝛽1 and 𝜋1 are random but 𝑐𝑜𝑣 𝛽1 , 𝜋1 0
Notes from Lecture
Experiments: consciously made or designed in order to understand some events in nature (test a
hypothesis)
Quasi-Experiments: a - resembles experiments in the sense that it is the same type of
hypothesis as in experiments, but quasi-experiments make use of historical data instead to estimate
the importance of certain variables
Program Evaluation: experiment on a smaller scale where we want to study a very small aspect of a
particular event that happened in society, most often policy evaluation
Randomization is very important in experiments. A certain portion of the population should receive
a , X. As a dummy variable, the treatment grou 1 . Due to this
randomization, we do not have to control for any other characteristics than the treatment X, and
therefore, our regression model will simply be:
𝑌 𝛽0 𝛽1 𝑋 𝑢
At the end of the experiment, we are going to get that if the individual is treated, the outcome will
be:
𝑌 𝛽0 𝛽1
For the control group:
𝑌 𝛽0
We can then find the effectiveness of the treatment by taking the difference between the two:
𝑌 𝑌 𝛽1
𝑋 𝛾0 𝛾1 𝑤1 ⋯ 𝛾 𝑤 𝑢
and then run a F-test to see if all the coefficients are 0 (randomization), because if
they are, no characteristic can explain that you are selected for treatment or not.
Partial solution if randomization is not perfect (reject F-test) is to run the regression
including the characteristic we also have to control for, e.g.:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑤1 𝑢
(2) Failure to follow treatment protocol: Individuals do not follow the instructions in
the protocol, which leads to a problem of selection, causing results to be biased.
(4) Change of behavior of participants: Telling individuals that they are in the
treatment group might lead to them changing their behavior. The solution is to not
tell individuals which group they are in.
- External threats
Differences-in-Differences (DID)
DID is an approach often taken to quasi-experiments. In contrast to experiments, were we are only
concerned with X and Y, quasi-experiments include an extra variable, Z, which is an event. We
need to distinguish between 2 dimensions: (1) intertemporal - before and after the event, and (2)
whether an individual is in the treatment or the control group. We thus need to consider 4
regressions:
(1) Before event, treatment group:
, , ,
𝑌 𝛽0 𝛽2 𝑤 𝛽3 𝛾 𝑢
Where 𝑤 is an observable characteristic, and 𝛾 is unobservable.
(2) After event, treatment group:
, , ,
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑤 𝛽3 𝛾 𝑢
Here we have added the treatment coefficient.
(3) Before event, control group:
, , ,
𝑌 𝛽0 𝛽2 𝑤 𝛽3 𝛾 𝑢
, , , , , ,
𝑌 𝑌 𝛽2 𝑤 𝑤 𝛽3 𝛾 𝛾 𝑢
Everything but the gammas can be observed, so we assume that the change of the unobservable are
identical over time, i.e. treated and control are identical in 𝛾 , we can write that:
, , , ,
∆𝑌 𝑌 𝑌 𝑌 𝑌 𝛽0 𝛽1 𝑋 𝑢
2. Prediction: 𝑌 𝛽1 𝑋1 𝛽2 𝑋2 ⋯ 𝛽 𝑋
𝑀𝑆𝑃𝐸 𝜎2 𝐸 𝛽1 𝛽1 𝑋1 ⋯ 𝛽 𝛽 𝑋
The principle of shrinkage: If we run an OLS regression and instead of taking 𝛽, we use the James-
Stein beta (𝛽 ), which is 𝛽 multiplied by a constant c 𝑐𝛽 , 0 𝑐 1 . This makes 𝛽 smaller, and
this deflation is needed for big datasets to correct for MSPE increasing in .
As c gets smaller, the squared bias of the estimator increases but the variance decreases. This
produces a bias-variance tradeoff: if 𝑘 is large, the benefit of smaller variance can bear out the cost
of larger bias, for the right choice of c 𝑎, thus reducing the MSPE.
Split-sample estimation of MSPE:
1. Estimate the model using half the estimation sample
2. Use the estimated model to predict Y for the other half of the data, called the reserve
sample 𝑎 and calculate the prediction error
3. Estimate the MSPE using the prediction errors for the test sample
1 2
𝑀𝑆𝑃𝐸 𝑌 𝑌
𝑛
The Ridge Regression
The ridge regression estimator shrinks the estimate towards 0 by penalizing large squared values of
the coefficients. The ridge regression estimator minimizes the penalized sum of squared residuals:
𝑆 𝑏; 𝜆 𝑌1 𝑏1 𝑋1 ⋯ 𝑏 𝑋 2
𝜆 𝑏2
1 1
Thus, the penalized sum of squared residuals is minimized at a smaller value of b than is the
unpenalized SSR.
The Lasso
The Lasso estimator shrinks the estimate towards 0 by penalizing large absolute values of the
coefficients. The Lasso regression estimator minimizes the penalized sum of squared residuals:
2
𝑆 𝑏; 𝜆 𝑌1 𝑏1 𝑋1 ⋯ 𝑏 𝑋 𝜆 𝑏
1 1
This looks like the Ridge estimation 𝑎 but it turns out to have very different properties: it is not
invariant to linear transformations. The Lasso estimator works especially well when in reality many
of the predictors are irrelevant, as it sets many of the 𝛽′𝑠 exactly equal to 0.
Principle Components
Ridge and Lasso reduce the MSPE by shrinking (biasing) the estimated coefficients to 0 and, in the
case of Lasso, by eliminating many of the regressors entirely.
The Principle Components regression instead collapses the vert many predictors into a much
smaller number 𝑝 ≪ 𝑘 of linear combinations of the predictors. These linear combinations 𝑎
called the principal components of X 𝑎 are computed so that they capture as much of the variation
in the original 𝑋′𝑠 as possible. Because the number 𝑝 of principal components is small, OLS can be
used, with the principal components as new regressors.
Principal components can be thought of as data compression, so that the compressed data have
fewer regressors with as little information loss as possible.
1
𝑐𝑜𝑣 𝑌 , 𝑌 𝑌 𝑌 1: 𝑌 𝑌1:
𝑇
1
𝑐𝑜𝑣 𝑌 , 𝑌
𝜌
𝑣𝑎𝑟 𝑌
Where 𝑌 1: the sample average of 𝑌 computed using the observations 𝑡 𝑗 1, … 𝑇.
14.3 Autoregressions
AR relates a time series variable to its past values. The first order autoregression (AR(1)) is
computed by regressing the next periods value on the current value using data from multiple years.
It is called a first-order autoregression because it is a regression of the series onto its own lag and
only 1 lag is used:
𝐴𝑅 1 : 𝑌 𝛽0 𝛽1 𝑌 1 𝑢
OLS is used to estimate the coefficients.
Forecast error: mistake made by the forecast, i.e. the difference between the value of 𝑌 1 that
actually occurred, and its forecasted value based on 𝑌 :
𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 𝑒𝑟𝑟𝑜𝑟 𝑌 1 𝑌 1|
** the forecast is not an OLS predicted value, and the forecast error is not an OLS residual.
F a a a a -of- a b a , a OLS a a
a a -a b a .
Root mean squared forecast error (RMSFE): measure of the size of the forecast error, the magnitude
of a typical mistake made using a forecasting model. It contains 2 sources of error: (1) the error
arising because future values of 𝑢 are unknown and (2) the error in estimating the coefficients 𝛽0
and 𝛽1 .
2
𝑅𝑀𝑆𝐹𝐸 𝐸 𝑌 1 𝑌 1|
If the 1st source of error is much larger than the second, which can happen if the sample size is very
large, then 𝑅𝑀𝑆𝐹𝐸 𝑣𝑎𝑟 𝑢 , i.e. the SD of the error term.
The assumption that the conditional expectation of 𝑢 is 0 given past values of 𝑌 has 2 important
implications:
1. The best forecast of 𝑌 1 based on its entire history depends on only the most recent 𝑝
values.
14.4 Time Series Regression with Additional Predictors and the Autoregressive Distributed Lag
Model
When other variables and their lags are added to an AR, the result is autoregressive distributed lag
(ADL) model.
ADL: lagged values of the dependent variable are included as regressors, but the regression also
includes multiple lags of an additional predictor.
The ADL with 𝑝 lags of 𝑌 and 𝑞 lags of 𝑋 is denoted 𝐴𝐷𝐿 𝑝, 𝑞 :
𝑌 𝛽0 𝛽1 𝑌 1 𝛽2 𝑌 2 ⋯ 𝛽 𝑌 𝛿1 𝑋 1 𝛿2 𝑋 2 ⋯ 𝛿 𝑋 𝑢
The idea that historical relationships can be generalized to the future, the principle which
forecasting relies on, is formalized by the concept of stationarity.
Stationarity: the probability distribution of the time series variable does not change over time, i.e.
the joint distribution of 𝑌 1 , 𝑌 2 , … , 𝑌 does not depend on 𝑠 regardless of the value of 𝑇.
Otherwise, 𝑌 is said to be non-stationary.
The bottom part of the image states the assumptions of the model.
Granger Causality tests: The Granger causality statistic is the 𝐹-statistic that tests the hypothesis
that the coefficients on all the values of one of the variables in the time series regression are 0. This
implies that these regressors have no predictive content for 𝑌 beyond that contained in the other
regressors. Rejecting 𝐻0 in a Granger test means that the past values of the variables appear to
contain information that is useful for forecasting 𝑌 , beyond that contain in past values of 𝑌 .
RMSFE for a time series regression with multiple predictors can also be written as:
𝑅𝑀𝑆𝐹𝐸 𝜎2 𝑣𝑎𝑟 𝛽0 𝛽0 𝛽1 𝛽1 𝑌 𝛿1 𝛿1 𝑋
- Bayes Information Criterion (BIC): Also called Schwarz information criterion (SIC)
𝑆𝑆𝑅 𝑝 ln 𝑇
𝐵𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇
𝑆𝑆𝑅 𝐾 ln 𝑇
𝐵𝐼𝐶 𝐾 ln 𝐾
𝑇 𝑇
𝑆𝑆𝑅 𝑝 2
𝐴𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇
Trend: A persistent long-term movement of a variable over time. A time series variable fluctuates
around its trend. There are 2 types of trends in time series data:
1. Deterministic - nonrandom function of time, e.g. a trend that is linear in time
Problems caused by stochastic trends leading to OLS estimators and 𝑡-statistics perhaps having
non-normal distributions even in large samples:
- The estimator of the autoregressive coefficient in AR(1) is biased toward 0 if its true value is
5.3
1, as the asymptotic distribution of 𝛽1 is shifted toward 0. 𝐸 𝛽1 1 - thus, the model
performs worse than a random walk model which imposes 𝛽1 1.
- The 𝑡-statstic on a regressor with a stochastic trend can have a nonnormal distribution even
in large samples. The distribution of the statistic is not readily tabulated - it is possible to do
so in the case of an AR with a unit root.
- Spurious regression: 2 series that are independent will with high probability misleadingly
appear to be related if they both have stochastic trends. A special case where certain
regression-based methods are still reliable is when the series are cointegrated, i.e. they
contain a common stochastic trend.
An informal way to detect stochastic trends is to look at the first autocorrelation coefficient. In large
samples, a small first autocorrelation coefficient combined with a time series plot that has no
apparent trend suggests that the series does not have a trend.
A formal test for stochastic trends is the Dickey-Fuller test. The starting point of this test is the AR
model. In an AR(1) model, we know that if 𝛽1 1, then 𝑌 is nonstationary and contains a
stochastic trend. Thus, for AR(1) the hypothesis for the DF in the model 𝑌 𝛽0 𝛽1 𝑌 1 𝑢 is:
𝐻0 : 𝛽1 1
𝐻1 : 𝛽1 1
The alternative hypothesis is that the series is stationary. The test is most easily implemented by
estimating a modified version of the AR model constructed by subtracting 𝑌 1 from both sides:
𝐻0 : 𝛿 0
𝐻1 : 𝛿 0
In the model ∆𝑌 𝛽0 𝛿𝑌 1 𝑢 , where 𝛿 𝛽1 1.
The 𝑡-statistic used is called the Dickey-Fuller statistic, which is computed using non-robust SEs.
For the Dickey-Fuller test in the AR 𝑝 model is presented in Key Concept 14.8 on page 605.
A commonly used alternative hypothesis is 𝐻1 : the series is stationary around a deterministic trend.
Using a hypothesis like this should be motivated by economic theory.
Under the 𝐻0 of a unit root, the augmented Dickey-Fuller (ADF) statistic does not have a normal
distribution. The critical values for the ADF test are given in table 14.4. The ADF test is one-sided.
The best way to handle a trend in a series is to transform it so it no longer has a trend. If the series
has a unit root, then the first difference of the series does not have a trend.
14.7 Non-Stationarity II: Breaks
Break: When the population regression function changes over the course of the sample. Can arise
either from a discrete change in the population regression coefficients at a distinct date or from a
gradual evolution of the coefficients over a longer period of time.
The issue with breaks is that OLS regressions estimate relationships that hold on average - if there
is a break, then 2 periods are combined, meaning that the average is not really true for either period,
leading to poor forecasts.
Chow test: If you suspect that there is a break in the series, you can test for a break at a known date.
The 𝐻0 of no break can be tested using a binary variable interaction regression - consider an
ADL(1,1) model where we let 𝜏 denote the hypothesized break date and 𝐷 𝜏 a binary variable that
equals 0 before the break date and 1 after:
𝑌 𝛽0 𝛽1 𝑌 1 𝛿1 𝑋 1 𝛾0 𝐷 𝜏 𝛾1 𝐷 𝜏 𝑌 1 𝛾2 𝐷 𝜏 𝑋 1 𝑢
If there is no break, then the regression is the same over both parts of the sample, so the binary
variable does not enter the equation. Thus:
𝐻0 : 𝛾0 𝛾1 𝛾2 0
𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 1 𝛾 0
This is tested using the 𝐹-statistic, as it is a joint hypothesis.
Quandt likelihood ratio (QLR): The Chow test can be modified to handle testing for unknown break
dates by testing for breaks at all possible dates 𝜏 in between 𝜏0 and 𝜏1 and then using the largest
resulting 𝐹-statistic to test for a break at an unknown date.
The QLR statistic is the largest of many 𝐹-statistics, and therefore, its distribution is not the same as
an individual 𝐹-statistic, meaning that the critical values must be obtained from another distribution
than the 𝐹 𝑄𝐿𝑅 𝐹 - the distribution of the WLR depends on and - they cannot be too close
to the end or beginning of the sample, so we use trimming to only compute the 𝐹-statistic for break
dates in the central part of the sample - with 15% trimming we only look at the middle 70%.
2. Strict exogeneity - 𝑢 has mean 0 given all past, present, and future values of 𝑋 . When we
have strict exogeneity, the OLS estimators are no longer the most efficient.
- Large outliers are unlikely; 𝑌 and 𝑋 have more than 8 nonzero, finite moments
In the distributed lag model, 𝑢 can be autocorrelated, i.e. it can be correlated with its lagged values,
because the omitted factors included in 𝑢 can be serially correlated. This does not affect the
consistency of OLS, nor does it introduce bias. However, we must use heteroskedasticity robust SEs
to avoid misleading statistical inferences. Use HAC.
Dynamic multipliers: The coefficients on 𝑋 and its lags are the dynamic multipliers which relate 𝑋
to 𝑌. The effect of a unit change in 𝑋 on 𝑌 after ℎ periods, which is 𝛽 1 in the distributed lag
model, is called the ℎ-period dynamic multiplier. 𝛽2 is the one-period dynamic multiplier. Thus, the
zero-period (or contemporaneous) dynamic multiplier is also called the impact effect, and it is 𝛽1 ,
the effect on 𝑌 of a change in 𝑋 in the same period. The SE of a dynamic multiplier is the HAC SEs
of the OLS coefficients.
Cumulative dynamic multipliers: The cumulative sum of the dynamic multipliers. The ℎ-period
cumulative dynamic multiplier is the cumulative effect of a unit change in 𝑋 on 𝑌 over the next ℎ
periods. The sum of all individual dynamic multipliers is the cumulative long-run effect on 𝑌 of a
change in 𝑋, and we call it the long-run cumulative dynamic multiplier. Can be estimated by
running:
𝑌 𝛿0 𝛿1 ∆𝑋 𝛿2 ∆𝑋 1 𝛿3 ∆𝑋 2 ⋯ 𝛿 ∆𝑋 1 𝛿 1 ∆𝑋 𝑢
Where the coefficients 𝛿 are in fact the cumulative dynamic multipliers. The last coefficient is the
long-run. Advantage: the HAC SEs of the coefficients are the HAC SEs of the cumulative dynamic
multipliers.
The variance is thus the variance of the OLS estimator when errors are uncorrelated multiplied by a
correction factor that arises from the autocorrelation of errors.
If 𝑣 is IID - as assumed for cross-sectional data - then 𝑣𝑎𝑟 𝑣̅ . However, if 𝑢 and 𝑋 are
1
not independently distributed over time, 𝑣𝑎𝑟 𝑣̅ 𝑓 , where 𝑓 1 2∑ 1 𝜌 and 𝜌
𝑐𝑜𝑟𝑟 𝑣 , 𝑣 (in large samples, 𝑓 tends to limit the expression without the fraction.
The estimator of 𝑓 has to balance between using too many sample autocorrelations and using too
few, because both cases would make the estimator inconsistent. The number of autocorrelations to
include depends on the sample size 𝑇. Truncation parameter 𝑚 0.75𝑇 - Exact equation for this
found on page 651.
Where 𝑌 𝑌 𝜙1 𝑌 1 and 𝑋 𝑋 𝜙1 𝑋 1.
Since the models are the same, the conditions for their estimation are the same. One of the
conditions is that since the 0 conditional mean assumption holds for general values of 𝜙1 is
equivalent to 𝐸 𝑢 |𝑋 1 , 𝑋 , 𝑋 1 , … 0. This is implied by 𝑋 being strictly exogenous, but it is
not implied by 𝑋 being (past and present) exogenous.
HAC SEs are not needed when the ADL coefficients are estimated by OLS. The estimated
coefficients are not themselves estimates of the dynamic multipliers - a way to compute these is to
express the estimated regression function as a function of current and past values of 𝑋 , i.e. by
eliminating 𝑌 from the estimated regression function:
The example above carries over to the general distributed lag model with multiple lags and an
AR(𝑝 error term. The genera DL with 𝑟 lags and 𝐴𝑅 𝑝 error term is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢
𝑢 𝜙1 𝑢 1 𝜙2 𝑢 2 ⋯ 𝜙 𝑢 𝑢
Where 𝛽 are the dynamic multipliers and 𝜙 are the autoregressive coefficients of the error term.
Under the 𝐴𝑅 𝑝 model for the errors, 𝑢 is serially uncorrelated.
Where 𝑞 𝑟 𝑝.
𝑌 𝛼0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢
Where 𝑌 𝑌 𝜙1 𝑌 1 ⋯ 𝜙 𝑌 and 𝑋 𝑋 𝜙1 𝑋 1 ⋯ 𝜙 𝑋 .
Again, the dynamic multipliers can be estimated by (feasible) GLS. This entails OLS estimation of
the coefficients of the quasi-differenced specification. The GLS estimator is asymptotically BLUE.
Otherwise, it can be estimated using OLS on the ADL specification - can provide a compact or
parsimonious summary of a long and complex DL.
The advantage of the GLS estimator is that, for a given lag length r in the distributed lag model, the
GLS estimator of the distributed lag coefficients is more efficient than the ADL estimator, at least
in large samples. In practice, then, the advantage of using the ADL approach arises because the
ADL specification can permit estimating fewer parameters than are estimated by GLS.
The first difference of a series, ∆𝑌, is its change between periods 𝑡 1 and 𝑡, i.e. ∆𝑌 𝑌 𝑌 1.
A series exhibiting autocorrelation is related to its own past values. For time series, we use the
autocovariance:
2
𝛾0 𝑐𝑜𝑣 𝑌 , 𝑌 𝑣𝑎𝑟 𝑌 𝐸 𝑌 𝐸𝑌
𝛾1 𝑐𝑜𝑣 𝑌 , 𝑌 1 𝐸 𝑌 𝐸𝑌 𝑌 1 𝐸𝑌 1
𝑐𝑜𝑣 𝑌 , 𝑌 1
𝜌
𝑣𝑎𝑟 𝑌
𝛾
𝜌
𝛾0
Thus, the properties of the autocorrelation function are:
- 𝜌0 1
- 𝜌 𝑗 𝜌
- 1 𝜌 1
- 𝜌 0 if 𝑌 is not serially correlated
The autocorrelation thus does not depend on time but on lags. The most recent lag is the most
relevant in predicting next year s value of 𝑌.
Autoregressive AR Processes
The AR(1) - autoregressive model of order 1 - is:
𝑌 𝛽0 𝛽1 𝑌 1 𝑢
In the AR(1), 𝑌 depends on 1st lag of its own past values. We estimate this autoregressive model
just as we would a normal regression, but we need an extra condition.
𝑌 1| 𝛽0 𝛽1 𝑌
The forecast error:
𝑒 1 𝑌 1 𝑌 1|
𝑌 1| 1.96𝑆𝐸 𝑌 1 𝑌 1|
𝑌 1| 𝛽0 𝛽1 𝑌 𝛽2 𝑌 1 ⋯ 𝛽 𝑌 1
The more parameters we estimate, the riskier the forecast - the more likely it is that we make an
estimation error. Parsimony principle - use just enough lags to construct the best forecast, as larger
models will have larger errors in estimation of the coefficients, strongly affecting RMSFE and
forecast uncertainty.
Autoregressive Distributed Lag Model
Sometimes, we may want to consider outside predictors. For example, the Philips curve states that
the unemployment is negatively related to the changes in inflation rate. So, we may want to use the
unemployment rate. This leads to the ADL model 𝐴𝐷𝐿 𝑝, 𝑞 :
𝑌 𝛽0 𝛽1 𝑌 1 𝛽2 𝑌 2 ⋯ 𝛽 𝑌 𝛿1 𝑋 1 𝛿2 𝑋 2 ⋯ 𝛿 𝑋 𝑢
- Adjusted 𝑅2 : If T is large and p is small (often for financial data), the adjusted 𝑅2
will not penalize the addition of extra lag terms enough, can lead to choosing a
model that is too large
𝑆𝑆𝑅 𝑝 2
𝐴𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇
The first term measures how close the model is to the actual data and does not
increase as you add more lags.
𝑆𝑆𝑅 𝑝 ln 𝑇
𝐵𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇
The BIC is similar to the AIC, but it adds a higher penalty as you add more lags. A
lower score is better.
In general, the model chosen by 𝐹/𝑡 statistics will be greater than or equal to that chosen by the
AIC, which will be greater than or equal to that chosen by the BIC.
Lecture 14 - Stationary/Nonstationary
Notes from Lecture
A time series is stationary if its probability distribution does not change over time, i.e. 𝑌 is
stationary if the joint distribution of 𝑌 1 , 𝑌 2 , … , 𝑌 does not depend on 𝑠, meaning that we
can quantify historical relationships and generalize it into the future. Variables can also be jointly
stationary.
Covariance stationary - weak stationarity
A series of 𝑌 is covariance stationary if it has constant mean, constant variance, and if the
covariance depends on the time difference, i.e. 𝛾 𝑐𝑜𝑣 𝑦 , 𝑦 only depends on 𝑘 (lag) and not
𝑡 (time) (no changes in the autocorrelation structure).
If data is not stationary, we take the first difference of the data - can be done before or after log
transformation. If the data is still not stationary, we take further differences.
Lag Operator
The lag operator:
𝐿 𝑦 𝑦
The lag operator and multiplication are commutative, i.e. if we have to take the lag of a time series
multiplied by a constant, the lag operator only applies to the variable that depends on time:
𝐿 𝑎𝑦 𝑎𝐿𝑦 𝑎𝑦 1
↔
𝑌 𝛽0 𝛽1 𝐿𝑌 ⋯ 𝛽 𝐿 𝑌 𝑢
1 𝛽1 𝐿 ⋯. 𝛽 𝐿 𝑌 𝛽0 𝑢
Thus, the AR(p) model is 𝑌 multiplied by a polynomial of order 𝑝. The model is stationary if the
roots of the characteristic polynomial 1 𝛽1 𝑧 ⋯ 𝛽 𝑧 0 lie outside the unit circle. It is
convenient to get inverse roots, as these will be inside the unit circle if AR(p) is stationary.
An easy way to tell if an AR(1) is stationary is to look at the coefficient on 𝑌 1 - if it is less than 1
in absolute value, the model is stationary.
If the coefficients on the lags are not very close to 1, e.g. 0.4, then the error term will cause a lot of
randomness, which can be seen in the graph being less smooth.
If an AR(p)model has a root that equals 1, the series is said to have a unit root. We can do a formal
test for a unit root by regressing 𝑌 on 𝑌 1 and then using the standard t-test for testing 𝛽 1. The
unit root test is thus:
𝐻0 : 𝛽 1
𝐻1 : 𝛽 1
The null hypothesis represents a unit root or stochastic trend, and 𝐻1 is a stationary time series.
** remember the t-statistic is for 𝐻0 : 𝛽 0, while we want to test 𝛽 1. We circumvent this
problem by rewriting the regression as the change:
𝑌 𝛽𝑌 1 𝜀
↔
𝑌 𝑌 1 𝛽𝑌 1 𝑌 1 𝜀
↔
𝑌 𝑌 1 𝛽 1 𝑌 1 𝜀
↔
∆𝑌 𝛿𝑌 1 𝜀
Where 𝛿 𝛽 1, and we then test the null hypothesis that 𝛿 0.
𝐻0 : 𝛿 0
The test-statistic follows the Dickey-Fuller distribution, not the normal t-distribution.
Augmented Dickey-Fuller (ADF) Test for Unit Root
Allows for larger autoregressive process and for unit root to be tested in any of the lags.
We should first test for a unit root in the presence of a trend. If the trend is not significant, we drop
it and re-test. The intercept should be kept no matter if it is significant or not, as this means that the
residuals sum to 0.
I a , a b .U a ,
which means that the test will not always reject the null hypothesis even when it is false.
Trends
A trend is a persistent long-term movement of a variable over time.
Deterministic trend: non-random function of time
Stochastic trend: random and varies over time
A series exhibiting a stochastic trend may have long periods of increases or decreases (e.g. monthly
stock prices). Thus, we often model time series using stochastic trends in economics and finance.
Non-stationarity can also be due to structural breaks, which arise from a change in the population
regression coefficients at a distinct time or from a gradual evolution of the coefficients. For
example, the breakdown of Bretton Woods produced a break in the USD/GBP exchange rate.
Breaks can cause problems with OLS regressions, as OLS estimates a relationship that holds on
average between the two periods. We can use the Chow test to test for a break date and then
continue the second model from after the break date. The test works by creating a dummy variable
𝐷 𝜏 1 if 𝑡 𝜏 and 0 otherwise, where 𝜏 𝑏𝑟𝑒𝑎𝑘 𝑑𝑎𝑡𝑒. We then test 𝐻0 : 𝛾0 𝛾 0 (no
break) using a F-test.
We can use the Quant-likelihood Ratio (QLR) statistic to test for an unknown break date. Let 𝐹 𝜏
be the Chow test-statistic testing the hypothesis of no break at date 𝜏. Then calculate the F-statistic
from Chow test for a range of different values for 𝜏, 0.15𝑇 𝜏 0.85𝑇. The QLR is then the
largest of these F-statistics. The QLR statistic does not follow one of our usual distributions,
meaning that we must use the simulated critical values from the table below:
Once we detect a structural break, we have evidence of 2 regimes. Only the most recent one is
relevant for a forecast, meaning that we should not let the first regime contaminate the data from the
second period. Also indicates that the data is non-stationary, and only using 1 period will revert the
data to stationarity.
Cut the data at the point where the F statistic is last above the critical value.
Lecture 15 - ARIMA
Notes from Lecture
Auto-Regressive Integrated Moving-Average (ARIMA) is used for modeling univariate time series.
They have 3 parts: 𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞 - 𝑝 is the autoregressive part (the number of lags), 𝑑 is the
integrated part (the number of unit roots - the number of differences necessary to make the time
series stationary), and 𝑞 is the number of lags in the moving average part.
The unit root process can be written as:
𝑌 𝑌0 𝑢 𝑢 1 ⋯ 𝑢 𝑢2 𝑢1
Where the shocks never die. There is no beta in the model, because under unit roots, 𝛽 1. There
is no correlation between the shocks, 𝑢, i.e. they are IID.
If the shocks are purely random, then the future (the state we find ourselves in now as well) is also
purely random. Every shock is equally likely to impact 𝑌 .
The variance of the unit root process is:
𝑉𝑎𝑟 𝑌 𝑉𝑎𝑟 𝑌0 𝑉𝑎𝑟 𝑢𝑡 𝑢𝑡 1 ⋯ 𝑢𝑡 𝑞 𝑢2 𝑢1 𝑡𝜎2
Because the shocks are independent over time, and that from the starting point 0, we cannot predict
which shocks will be in which periods, there is no linear dependency between the period and the
shocks, and we therefore do not include the covariance.
The variance grows without bounds - as time goes on, variance becomes larger, as variance is
simply time, 𝑡, multiplied by the variance of each shock.
Moving Average (MA) models are presented somewhat similarly. The MA process of order q:
𝑀𝐴 𝑞 : 𝑌 𝑐0 𝑢 𝜃1 𝑢 1 ⋯ 𝜃 𝑢
Where influence of the shocks is limited by only past 𝑞 periods. The biggest effect is the
contemporary shock, and every other shock is multiplied by its coefficient 𝜃 which is between 0
and 1 in absolute value.
If we have MA(1), then we only care about the current shock and 1 before that (first lag shock). The
other shocks have coefficients equal to 0. This ensures stationarity in MA models - MA models are
always covariance-stationary.
We can use the lag operator:
𝑌 𝑐0 1 𝜃1 𝐿 𝑢
MA models must be invertible, i.e. they can be converted into an infinite AR process. Otherwise,
shocks do not die out, and the model is not good for the forecast. MA(q) is invertible if the roots of
1 𝜃 𝑧 ⋯ 𝜃 𝑧 lie outside the unit circle.
Example - MA(1)
𝑌 𝑢 𝜃1 𝑢 1
↔
𝑉𝑎𝑟 𝑌 𝑉𝑎𝑟 𝑢 𝑣𝑎𝑟 𝜃1 𝑢 1 2𝐶𝑜𝑣 𝑢 , 𝜃1 𝑢 1
𝑉𝑎𝑟 𝑌 𝜎2 𝜃12 𝜎 2
↔
𝑉𝑎𝑟 𝑌 𝜎2 1 𝜃12
The covariance term drops out because it is equal to 0 due to 𝑢 being independent over 𝑡. This is
𝛾0 in autocovariance function. First autocovariance of 𝑌 , 𝛾1 , will, if we assume the MA(1) process
has mean = 0, be:
𝛾1 𝐸 𝑌 𝐸 𝑌 𝑌 1 𝐸𝑌 1
↔
𝛾1 𝐸𝑢 𝜃1 𝑢 1 𝑢 1 𝜃1 𝑢 2
↔
𝛾1 𝐸𝑢𝑢 1 𝜃1 𝐸 𝑢 2 1 𝜃1 𝐸 𝑢 𝑢 2 𝜃12 𝐸 𝑢 1𝑢 1
↔
𝛾1 𝜃1 𝜎 2
Thus, for an MA(q) process, the 𝑞th spike in the ACF will be significant and subsequent spikes will
be 0.
When choosing a model, it is important that the last lag is statistically significant, i.e. MA(4) needs
to have the 4th lag being significant. Lag 2 can be insignificant.
For AR processes, the autocorrelation function will slowly die out.
𝐴𝑅𝑀𝐴 𝑝, 𝑞 : 𝑌 𝛽0 𝛽1 𝑌 𝑢 𝜃𝑢
1 1
We can identify ARMA using the ACF and PACF, and AIC and BIC.
ARIMA Modeling
Steps:
1. Conduct ADF tests and difference the series until it is stationary - this will give us 𝑑
3. Check the residuals to make sure they are white noise. If the residuals are not white noise,
go back to step (2)
The assumptions of VAR are the same as for time series regression, applied to each separate
equation. The coefficients are estimated by OLS - the estimators are consistent and have a joint
normal distribution in large samples. Statistical inference proceeds in the normal manner.
For hypothesis testing, it is possible to test joint hypotheses that involve restrictions across multiple
equations. For example, in the model above, we can test whether the correct lag length is 𝑝 or 𝑝 1
by testing whether the coefficients on 𝑌 and 𝑋 are 0:
𝐻0 : 𝛽1 𝛽2 𝛾1 𝛾2 0
A forecasting model is used to make a forecast 1 period ahead (𝑇 1 using data through
period 𝑇. This model is then used to make a new 1 period forecast 𝑇 2 using data
through period 𝑇 and the forecasted value of 𝑇 1.This process then iterates until the
forecast is made for the desired horizon ℎ.
Thus, a 2-step ahead forecast of 1 variable depends on the forecast of all variables in the
VAR in period 𝑇 1. Generally, to compute multiperiod iterated VAR forecasts ℎ periods
ahead, it is necessary to compute forecasts of all variables for all intervening periods
between 𝑇 and 𝑇 ℎ.
Forecasts by using a single regression in which the dependent variable is the multiperiod-
ahead variable to be forecasted and the regressors are the predictor variables. In a direct ℎ-
period-ahead forecasting regression, all predictors are lagged ℎ periods to produce the ℎ-
period-ahead forecast.
For direct forecasts, the error term is serially correlated, and we therefore use HAC SEs. For
longer horizons, the higher the degree of serial correlation.
The iterated method is the recommended procedure for 2 reasons: (1) one-period-ahead regressions
estimate coefficients more efficiently than multiperiod-ahead regressions, and (2) iterated forecasts
tend to have time paths that are less erratic across horizons since they are produced using the same
model as opposed to direct forecasts, which use a new model for every horizon.
A direct forecast is, however, preferable if you suspect that the one-period-ahead model (the AR or
VAR) is not specified correctly, or if your forecast contains many predictors (estimation error).
For tests, a higher power is positive - it means that the test is more likely to reject 𝐻0 against the
alternative when the alternative is true.
The ADF was the first test developed to test for a unit root. Another test for unit roots is the DF-
GLS test, and this test has higher power. The DF-GLS test is computed in 2 steps. For a DF-GLS
test with 𝐻0 𝑌 ℎ𝑎𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘 𝑡𝑟𝑒𝑛𝑑 and 𝐻1
𝑌 𝑖𝑠 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑎𝑟𝑜𝑢𝑛𝑑 𝑎 𝑙𝑖𝑛𝑒𝑎𝑟 𝑡𝑖𝑚𝑒 𝑡𝑟𝑒𝑛𝑑, the steps are:
(1) the intercept and trend are estimated by GLS, an estimation which is performed by computing 3
new variables 𝑉 , 𝑋1 and 𝑋2 , where 𝑉 𝑌 𝛼∗ 𝑌 1 , 𝑋1 1 𝛼∗ , 𝑋2 𝑡 𝛼∗ 𝑡
13.5
1 , where 𝛼∗ 1 . Then 𝑉 is regressed against 𝑋1 and 𝑋2 , i.e. OLS is used to estimate:
𝑉 𝛿0 𝑋1 𝛿1 𝑋2 𝑒
Using the observations 𝑡 1, … , 𝑇. The model has no intercept. The OLS estimators are then used
a 𝑌 𝑌 𝑌 𝛿0 𝛿1 𝑡 .
(2) the DF test is used to test for a unit root in 𝑌 (DF does not include an intercept nor a time
trend). The number of lags in the regression is determined by either expert knowledge or an
information criterion.
If 𝐻1 𝑌 𝑖𝑠 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑤𝑖𝑡ℎ 𝑎 𝑚𝑒𝑎𝑛 𝑡ℎ𝑎𝑡 𝑚𝑖𝑔ℎ𝑡 𝑏𝑒 0 𝑏𝑢𝑡 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑎 𝑡𝑖𝑚𝑒 𝑡𝑟𝑒𝑛𝑑, the
preceding steps are modified, in particular 𝛼 ∗ 1 and 𝑋2 is omitted from the regression,
meaning that the series 𝑌 𝑌 𝛿0 .
The critical values for the DF-GLS test are:
If the DF-GLS test statistic (the 𝑡-statistic on 𝑌 1 in the regression in the 2nd step) is less than the
critical value, then 𝐻0 is rejected.
** unit roots have nonnormal distributions, i.e. even for large samples, the distribution is not normal
when the regressors are nonstationary. The nonnormal distribution of unit root test statistic is a
consequence of the nonstationary.
16.4 Cointegration
Cointegration: When 2 or more series have the same stochastic trend in common. In this case,
regression analysis can reveal long-term relationships among time series variables, but some new
methods are needed. Mathematically, suppose we have 𝑋 and 𝑌 are 𝐼 1 , then if for some
coefficient 𝜃 (the cointegrating coefficient), 𝑌 𝜃𝑋 is integrated of order 0, 𝑋 and 𝑌 are said to
be cointegrated. Computing the difference 𝑌 𝜃𝑋 eliminates the common stochastic trend.
If 𝑋 and 𝑌 are cointegrated, we do not have to take the first difference to eliminate a stochastic
trend. Instead, we can compute the difference 𝑌 𝜃𝑋 - this term is stationary, meaning that it can
also be used in regression analysis using VAR. VAR has to be augmented by including 𝑌 1
𝜃𝑋 1 as an additional regressor:
∆𝑌 𝛽10 𝛽11 ∆𝑌 1 ⋯ 𝛽1 ∆𝑌 𝛾11 ∆𝑋 1 ⋯ 𝛾1 ∆𝑋 𝛼1 𝑌 1 𝜃𝑋 1 𝑢1
The term 𝑌 𝜃𝑋 is also called the error correction term, and the combined model of the equations
above is the vector error correction model (VECM). In VECM, past values of the error correction
term help to predict future values of ∆𝑌 and/or ∆𝑋 .
There are 3 ways to check whether 2 variables are cointegrated, all of which should be used in
practice:
1. Use expert knowledge and economic theory
2. Graph the series and see whether they appear to have a common stochastic trend
The OLS estimator of the coefficient in the cointegration regression, while consistent, has a
nonnormal distribution when 𝑋 and 𝑌 are cointegrated. Therefore, inferences based on its 𝑡-
statistic can be misleading. Therefore, other estimators of the cointegrating coefficient have been
developed, e.g. the dynamic OLS (DOLS) estimator. DOLS is efficient in large samples, and
statistical inference about 𝜃 and the 𝛿 ba HAC SE a a , because it has a
normal distribution.
Cointegration tests can improperly reject 𝐻0 more frequently than they should, and frequently they
improperly fail to reject 𝐻0 .
If we extend to more variables, the number of cointegrating coefficients increase (always one less
than the number of variables).
If 2 or more variables are cointegrated, then the error correction term can help forecast these
variables and, possibly, other related variables. However, even closely related series can have
different trends for subtle reasons. If variables that are not cointegrated are incorrectly modeled
using a VECM, then the error correction term will be 𝐼 1 - this then introduces a trend into the
forecast that can result in a poor out-of-sample forecast performance.
16.5 Volatility Clustering and Autoregressive Conditional Heteroskedasticity
Volatility clustering: when a time series has some periods of low volatility and some periods of
high volatility.
The variance of an asset price is a measure of the risk of owning the asset: the larger the variance of
daily stock price changes, the more an investor stands to gain or lose on a typical day. In addition,
the value of some financial derivatives, e.g. options, depends on the variance of the underlying
asset.
Forecasting variances makes it possible to have accurate forecast intervals. If the variance of the
forecast is constant, then an approximate forecast confidence interval can be constructed as the
forecast +/- a multiple of SER. If, on the other hand, the width of the forecast interval should
change over time.
Volatility clustering implies that the error exhibits time-varying heteroskedasticity.
ARCH and GARCH are 2 models of volatility clustering. ARCH is analogous to the distributed lag
model, and GARCH is analogous to an ADL model.
(1) ARCH
The ARCH model of order 𝑝 is modeled:
𝐴𝑅𝐶𝐻 𝑝 : 𝜎2 𝛼0 𝛼1 𝑢 2 1 𝛼2 𝑢 2 2 ⋯ 𝛼 𝑢2
If the 𝛼 a , a a a , ARCH a
2
current squared error will be large in magnitude in the sense that its variance 𝜎 is large. Above, we
have described ARCH for the ADL(1,1) model, but it can be applied to the error variance of any
time series regression model with an error that has a conditional mean of 0.
(2) GARCH
Extends the ARCH model to let 𝜎 2 depend on its own lags as well as lags of the squared error:
𝐺𝐴𝑅𝐶𝐻 𝑝, 𝑞 : ∶ 𝜎2 𝛼0 𝛼1 𝑢 2 1 ⋯ 𝛼 𝑢2 𝜙1 𝜎 2 1 ⋯ 𝜙 𝜎2
As GARCH is similar to ADL, it can provide a more parsimonious model of dynamic multipliers
than ARCH, i.e. it can capture slowly changing variances with fewer parameters than the ARCH
model due to its incorporations of lags of 𝜎 2.
ARCH and GARCH coefficients are normally distributed in large samples, so in large samples, 𝑡-
statistics have standard normal distributions, and confidence intervals can be constructed as the
maximum likelihood estimate 1.96 SEs.
Assumptions:
- 𝑋 and 𝑌 are jointly stationary - no structural breaks (if structural breaks do exist, we
can estimate the model in different subperiods)
- Past and present exogeneity: 𝑋 must be uncorrelated with the error term, i.e. 𝑋
must be exogenous: 𝐸 𝑢 |𝑋 , 𝑋 1 , … , 𝑋 , … 𝑋2 , 𝑋1 0 . All coefficients in the
Distributed Lag model constitute all the non-zero dynamic effects
- Large outliers are unlikely (a number of moments in the data need to be stable so that
they are not twisted)
Some add to the 2nd assumption that the error term has a mean 0 given all past, present, AND future
values of 𝑋 , meaning that we can use more efficient estimators than OLS in our estimation. This is
called strict exogeneity. Very strong assumption, rarely realistic.
Strict exogeneity implies past and present exogeneity, but past and present exogeneity does not
imply strict exogeneity.
Estimation
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢
Under the Distributed Lag model assumptions, OLS yields consistent estimators of 𝛽′𝑠 - of the
dynamic multipliers - that is, if the assumptions hold, the coefficients are not biased.
If there is no autocorrelation in the error terms, then the OLS standard errors are also unbiased. If
there is autocorrelation, the OLS standard errors are not consistent, meaning that we must use
autocorrelation-consistent standard errors to ensure that our standard errors are IID, not correlated
to each other. For this purpose, we can use heteroskedasticity and autocorrelation consistent (HAC)
standard errors (Newey-West Standard Errors).
When estimating AR, ARMA and ADL models, we do not need to use HAC SEs. In these models,
the error terms are serially uncorrelated if we have included enough lags of 𝑌 - as this ensures that
the error term cannot be predicted using past 𝑌 b a 𝑢 .
The ℎ-period dynamic multiplier is the effect of a unit change in 𝑋 on 𝑌 after ℎ periods. The ℎ-
period cumulative dynamic multiplier is 𝛽1 𝛽2 𝛽 1 .
The long-run dynamic multiplier is the sum of all the individual dynamic multipliers.
The cumulative dynamic multipliers can be directly estimated by modifying the regression as
follows:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 𝑢
𝑌 𝛽0 𝛽1 𝑋 𝛽1 𝑋 1 𝛽2 𝑋 1 𝑢
𝑌 𝛽0 𝛽1 𝑋 𝑋 1 𝛽1 𝛽2 𝑋 1 𝑢
𝑌 𝛽0 𝛽1 ∆𝑋 𝛽1 𝛽2 𝑋 1 𝑢
𝑌 𝛿0 𝛿1 ∆𝑋 𝛿2 𝑋 1 𝑢
Where 𝛿0 𝛽0 , 𝛿1 𝛽1 , and 𝛿2 𝛽1 𝛽2 .
The HAC SEs on 𝛿1 and 𝛿2 are the SEs for the 2 cumulative multipliers.
𝛿 1 would be the long-run cumulative dynamic multiplier.
Granger Causality
Tests whether 𝑋 is useful in predicting 𝑌 - different from actual causality, more of a correlation test.
** 𝑋 is NOT exogenous.
Forecasting with VAR uses the chain algorithm, i.e. we construct 𝑇 1 with the help of lagged
variables that are known in period 𝑇, and we then armed with 𝑇 1 forecasted values construct the
𝑇 2 forecast.
The impulse response function helps us learn about the dynamic properties of vector
autoregressions of interests to forecasters. The impact is driven by the shock.
Let 𝑟 denote the return on the asset at time 𝑡. Return series often have low or no serial
autocorrelation (there is some dependence though). We let 𝑟 follow a simple ARMA model and
then study the volatility in the residuals:
𝑟 𝜇 𝑢
The ARMA 𝑝, 𝑞 process is then:
𝜇 𝜇0 ϕ𝑟 θ𝑢
1 1
Where 𝑢 𝜎 𝜀 . We have that 𝜀 ~𝑊𝑁 0,1 (i.e. white noise process with mean 0 and unit
variance), and 𝜎 2 𝑣𝑎𝑟 𝑢 |𝑢 1 , 𝑢 2 , … is the conditional variance of the error (shock) given its
past. The error thus becomes heteroskedastic, the variance changes over time.
Autoregressive Conditional Heteroskedasticity (ARCH)
The ARCH model is:
𝐴𝑅𝐶𝐻 𝑝 : 𝜎2 𝜔 𝛼1 𝑢 2 1 ⋯ 𝛼 𝑢2
** 𝑢 𝜎𝜀.
𝑝 in ARCH can be different from 𝑝 in ARMA 𝑝, 𝑞 .
We will assume that the distribution of 𝜀 is normal, because it is what drives the shocks - 𝜀 is a
sequence of IID random variables with mean 0 and variance 1 - its possible distributions are
standard normal, standardized student-t, and their skewed counterparts.
A property of ARCH is that volatility does not diverge to infinity (it varies within some fixed
range), i.e. volatility is stationary.
𝐴𝑅𝐶𝐻 1 : 𝜎2 𝛼0 𝛼1 𝑢 2 1
Since 𝑢 is stationary, we have that 𝑣𝑎𝑟 𝑢 𝛼0 𝛼1 𝑣𝑎𝑟 𝑢 . Then the unconditional variance
is:
𝛼0
𝑣𝑎𝑟 𝑢
1 𝛼1
Due to variance clustering.
2. Use the residuals of the mean equation to investigate the ARCH effects (ACF and PACF of
squared residuals) to see if the volatility is changing over time
3. Specify a volatility model if ARCH effects are present, and perform a joint estimation of the
mean and volatility equations
Generalized Autoregressive Conditional Heteroskedasticity (GARCH)
Since ARCH gives us the past values of the conditional variance, and we know that there is a
autocorrelation between them, GARCH includes these past values in its model:
𝐺𝐴𝑅𝐶𝐻 𝑝, 𝑞 : 𝜎2 𝛼0 𝛼1 𝑢 2 1 ⋯ 𝛼 𝑢2 𝜆1 𝜎 2 1 ⋯ 𝜆 𝜎2
Often, GARCH is more appropriate than high order ARCH models. Typically, we only need a low
order GARCH model - we should stick to GARCH(1,1), GARCH(2,1) and GARCH(1,2).
One measure of the persistence (high level of state dependency, creating smooth volatility clusters)
in the variance is the sum of the coefficients on 𝑢 2 1 and \𝑠𝑖𝑔𝑚𝑎2 1 in GARCH. If large, e.g. 0.96,
it indicates that changes in the conditional variance are persistent.
The fitted volatility bands of the estimated conditional SDs track the observed heteroskedasticity in
the series of monthly returns. This is useful for quantifying the time-varying volatility and the
resulting risk for investors holdings stocks. Furthermore, the GARCH model may also be sued to
produce forecast intervals whose widths depend on the volatility of the most recent periods - here
the bands show this.
5-A
6-B
7-B 𝐸 𝑌 2.4 0.6𝐸 𝑌 0.07𝐸 𝑌 0.1𝐸 𝑌 → 1 0.6 0.07 0.1 𝐸 𝑌 2.4 →
2.4
𝐸 𝑌 6.5
0.3
10 - C
Assignment
Do not correct SEs or control for contemporaneous values of the variables in ADL, because we
assume that autocorrelation is extracted from the model when we use AR, ARMA, or ADL. This
correction becomes important if we use DL.
Remember to plot the variables to determine whether to control for a trend or a drift in DF unit root
test.