Econometrics Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 95

Applied Econometrics - Notes

Lecture 1 - Introduction to Econometrics


Chapter 1 - Economic Questions and Data
Econometrics is the science and art of using economic theory and statistical techniques to analyze
economic data.
1.1 Economic Questions We Examine
This book is concerned with 4 questions, which will be discussed in general terms, and we will
cover the econometric approach to answering them:
1. Does reducing class size improve elementary school education?
2. Is there racial discrimination in the market for home loans?
3. How much do cigarette taxes reduce smoking?
4. By how much will U.S. GDP grow next year?
1.2 Causal Effects and Idealized Experiments
The first 3 questions concern causal relationships among variables. Causality means that a specific
action leads to a specific, measurable consequence/outcome. The causal effect can be measured by
conducting an experiment; for example, one can conduct a randomized controlled experiment,
where one has a control group and a treatment group and systematic relationships between variables
is eliminated by randomization.
Causal effect: the effect on an outcome of a given action or treatment as measured in an ideal
randomized controlled experiment, where the only systematic difference between the control and
treatment group is the treatment itself.
However, experiments are relatively rare in econometrics, because they are often unethical,
impossible to execute satisfactorily, or prohibitively expensive. They still serve as the theoretical
benchmark.
The 4th question concerns forecasting, which does not always involve causal relationships.
1.3 Data: Sources and Types
Data come from either experiments or nonexperimental observations of the world. Experimental
data come from experiments designed to evaluate a treatment or policy or to investigate a causal
effect. However, as real-world experiments with humans are difficult to control and administer and
can be expensive and unethical, most economic data are obtained by observing real-world behavior.
Observational data are obtained outside and experimental setting (e.g. surveys, administrative
records etc.). The use of observational data makes it more difficult to conclude on causal effects,
because randomization is lacking, i.e. it is difficult to sort out the effect of the treatment from other
relevant factors.
Data sets come in 3 main types:
1. Cross-sectional data - data on different entities for a single period of time, allowing us to
learn about relationships among variables by studying differences across people, firms, or
other entities during a single period of time.

2. Time series data - data for a single entity collected at multiple time periods, which can be
used to study the evolution of variables over time and to forecast future values of those
variables.

3. Panel data (longitudinal data) - data for multiple entities in which each entity is observed at
2 or more time periods, and it is used to learn about economic relationships from the
experiences of the many different entities in the data set and from the evolution over time of
the variables for each entity.

Notes from Lecture


www.econometrics-with-r.org/index.html
Econometrics is the field of economics in which statistical methods are developed and applied to
estimate economic relationships. Typically, we are concerned with whether a change in one
variable, X, causes a change in another variable, Y.
Classical econometric modeling - Economic theory Mathematical model of theory
Econometric model of theory Data Estimation of econometric model Hypothesis testing
Forecasting or prediction Using the model for control or policy purposes
Always check the quality of your dataset before initiating the analysis. Outliers are unusual in the
context of the model, but it is a subjective term. Computers cannot appropriately determine that
observations are outliers, we have to do that ourselves. Models may be merely inadequate; the
outlier can be a consequence of the specification, i.e. we create outliers ourselves - we cannot
remove points from our data if the observation is true.
Correlation does not imply causation.
Definitions
𝑚𝑒𝑎𝑛 𝑥̅
Variance tells us how much the variable deviates from the mean. The larger the variance, the larger
the fluctuations. Cannot be negative due to the square.

2
𝐸𝑥 𝑥̅ 2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝜎 0
𝑛
The covariance, if large, it means that the two variables are related. If going towards 0, they are not
so related from a statistical point of view. Does not allow us to make a judgement about how
precisely the variables are related.
𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 , 𝜎 , 𝐸 𝑥 𝑥̅ 𝑦 𝑦

The correlation between the variables is a standardized measure of how related the variables are. It
is always between -1 and 1, allowing for comparisons. The closer to 0, the less related the variables
are.
𝜎, 𝐶𝑂𝑉 𝑥, 𝑦
𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 , 𝜌 ,
𝜎 𝜎 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝑦

In simple linear regression models, we study the following equation:


𝑌 𝛽0 𝛽1 𝑋 𝑢
Where 𝑋 independent variable, right-hand side variable, explanatory variable, regressor,
covariate or control variable.

Lecture 2 - Regression Analysis: Basic Ideas and Estimations


Chapter 4 - Linear Regression with One Regressor
The general equation for a linear regression is:
𝑌 𝛽0 𝛽1 𝑋 𝑢
Where 𝛽0 is the intercept, 𝛽1 is the slope (the change in Y associated with a unit change in X), and
𝑢 is the error term, which incorporates all of the factors responsible for the difference between the
observed and predicted value. Y is the dependent variable, and X is the independent
variable/regressor.
Estimating the Coefficients of the Linear Regression Model
When creating the regression line, the most common approach is the ordinary least squares (OLS)
estimator, which chooses the regression coefficients so that the estimated regression line is as close
as possible to the observed data, i.e. total squared estimation mistakes are minimized.

2
𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑚𝑖𝑠𝑡𝑎𝑘𝑒𝑠: 𝑌 𝑏0 𝑏1 𝑋
1

The OLS regression line can be written as:

𝑌 𝛽0 𝛽1 𝑋

∑ 1 𝑋 𝑋 𝑌 𝑌 𝑠
𝛽1
∑ 1 𝑋 𝑋 2 𝑠2

𝛽0 𝑌 𝛽1 𝑋

𝑢 𝑌 𝑌
OLS estimator is unbiased and consistent.
Measures of Fit
𝑅2 and the standard error of the regression measure how well the OLS line fits the data:
𝑅2 is the fraction of sample variance of 𝑌 explained by 𝑋 , and it is written as the ratio of the
explained sum of squares (ESS) to the total sum of squares (TSS):
∑ 2
2
𝐸𝑆𝑆 1 𝑌 𝑌
𝑅 2
𝑇𝑆𝑆 ∑ 1 𝑌 𝑌

Or

𝑅2 𝑆𝑆𝑅 𝑢2
1

Where SSR is the sum of squared residuals.


Or
𝑆𝑆𝑅
𝑅2 1
𝑇𝑆𝑆

The standard error of the regression (SER) is an estimator of the standard deviation of the
regression error, 𝑢 . The SER is a measure of the spread of the observations around the regression
line, measured in the units of the dependent variable, Y.

𝑆𝐸𝑅 𝑠 𝑠2
𝑆𝑆𝑅
𝑠2
𝑛 2
The divisor is 𝑛 2 because it corrects for a slight downward bias introduced because two
regression coefficients were estimated (degree of freedom correction).
The Least Squares Assumptions
1. The conditional distribution of 𝑢 , given 𝑋 , has a mean of 0 - means that the two variables
have 0 covariance and are thus unrelated. Asserts that the other factors contained in 𝑢 are
unrelated to 𝑋 in the sense that given a value of 𝑋 , the mean of the distribution of these
other factors are 0. This means that while some factors have a positive effect on the
dependent variable, others have a negative, and these effects average out.

𝐸 𝑢 |𝑋 0
𝐶𝑜𝑟𝑟 𝑋 , 𝑢 0

Equivalent to assuming that the population regression line is the conditional mean of 𝑌
given 𝑋 .

2. 𝑋 , 𝑌 , 𝑖 1, … , 𝑛 are independently and identically distributed across distributions - this


is the case when observations are drawn by simple random sampling from a single large
population.
NB: time series data violates this assumption

3. Large outliers are unlikely - 𝑋 and 𝑌 have nonzero finite fourth moments (finite kurtosis).

0 𝐸 𝑋4 ∞ 𝑎𝑛𝑑 0 𝐸 𝑌4 ∞

Justifies large-sample approximations to the distributions of the OLS test statistics.


Sampling Distributions of the OLS Estimators
The sampling distributions of the OLS estimators are approximately normal in large samples due to
the central limit theorem.
𝐸 𝑌 𝜇

𝐸 𝛽0 𝛽0 𝑎𝑛𝑑 𝐸 𝛽1 𝛽1

Thus, 𝛽0 and 𝛽1 are unbiased estimators, and if the sample is large enough, their distribution is well
approximated by the bivariate normal distribution, implying that the marginal distributions are
normal in large samples. When the sample size is large, the OLS estimators will also be close to the
true population coefficients with a high probability, because the variances of the estimators decrease
to 0 as n increases.

Notes from Lecture


If we have 2 vectors, x and y, that contain different sets of values. If y is a constant, meaning that
the covariance between x and y is 0 𝐶𝑜𝑣 𝑥, 𝑦 0 , as there is no relationship at all.
If we take 𝐶𝑜𝑣 𝛽𝑥, 𝑦 when x and y are variable, then 𝐶𝑜𝑣 𝛽𝑥, 𝑦 𝛽 𝐶𝑜𝑣 𝑥, 𝑦 , meaning that if
we have a constant multiplied by a variable, we can isolate that term and multiply the covariance of
the variables with the constant.
The covariance of the same variable, e.g. 𝐶𝑜𝑣 𝑥, 𝑥 , is equal to the variance of that variable, so
𝐶𝑜𝑣 𝑥, 𝑥 𝑉𝑎𝑟 𝑥 .
Linear Regression
𝑌 𝛽0 𝛽1 𝑋 𝑢
Where 𝛽1 is the slope, 𝛽0 is the intercept, and 𝑢 is an error term.
Assumptions:
- On average, there will be no error, i.e. 𝐸 𝑢 0.
- If the assumption above holds, then 𝑋 and 𝑢 should not be related at all, 𝐸 𝑢|𝑋 0,
because that would mean that we are doing something conceptually wrong.
- Thus, the expected error given the value of x is assumed to be 0, which means we can also
assume that 𝐸 𝑌|𝑋 𝐸 𝛽0 𝛽1 𝑋 𝛽0 𝛽1 𝑋 .
We have just confirmed our own model.
OLS is the approach that 𝑚𝑖𝑛 ∑ 1 𝑢 2 with respect to 𝛽0 and 𝛽1 , which can be done by rewriting
our error term to something that contains the betas: 𝑚𝑖𝑛 ∑ 1 𝑌 𝑌 2 . We know that 𝑌 can be
replaced, and therefore, it can also be written as: 𝑚𝑖𝑛 ∑ 1 𝑌 𝛽0 𝛽1 𝑋 2 . This is the residual
sum of squares, RSS.
To derive 𝛽0 , we remember that:
2
𝑑∑ 1 𝑢
0
𝑑𝛽0
Thus, we take the first derivative:
2
𝑑∑ 1 𝑢
2 𝑌 𝛽0 𝛽1 𝑋 0
𝑑𝛽0 1
0
The -2 can be deleted, because 0:
2

𝑌 𝛽0 𝛽1 𝑋 0
1

This is rewritten as:

𝑌 𝛽0 𝛽1 𝑋 0
1 1 1

However, ∑ 1 𝛽0 is the sum of a variable which has exactly the same number (???), and this can
therefore be written as 𝑁 𝛽0 (because a the constant multiplied by a variable rule discussed in the
beginning of the lecture). Same goes for the 𝛽1 :

𝑌 𝑁 𝛽0 𝛽1 𝑋 0
1 1

We isolate 𝛽0 :

𝑁 𝛽0 𝑌 𝛽1 𝑋
1 1

We divide everything by 𝑁 to simplify:

𝑁 𝛽0 ∑ 1𝑌 𝛽1 ∑ 1𝑋
𝑁 𝑁 𝑁

𝑌 𝑋
𝛽0 𝛽1
𝑁 𝑁

𝛽0 𝑌 𝛽1 𝑋

Now we have derived the equation for 𝛽0 . It is given by the average of Y less 𝛽1 times the average
of X.
In order to find 𝛽1 ,we take the first derivative of RSS with respect to 𝛽1 :
2
𝑑∑ 1 𝑢
0
𝑑𝛽1
DO THIS YOURSELF

Another way of deriving 𝛽1 is by realizing that the second assumption also means that 𝐶𝑜𝑣 𝑋, 𝑢
0. We can rewrite 𝑢 as:
𝑌 𝛽0 𝛽1 𝑋 𝑢 →𝑢 𝑌 𝛽0 𝛽1 𝑋
We can now substitute this in the covariance:
𝐶𝑜𝑣 𝑋, 𝑢

𝐶𝑜𝑣 𝑋, 𝑌 𝛽0 𝛽1 𝑋 0
This is rewritten as:
𝐶𝑜𝑣 𝑋, 𝑌 𝐶𝑜𝑣 𝑋, 𝛽0 𝐶𝑜𝑣 𝑋, 𝛽1 𝑋 0
We can reduce this further by realizing that the second term is the covariance of a constant and a
variable, meaning that is must be 0, and the third term is a constant multiplied by a variable,
therefore, we can move 𝛽1 outside, leaving the third term to be the covariance of x,x, which equals
the variance of the same variable. Thus:
𝐶𝑜𝑣 𝑋, 𝑌 0 𝛽1 𝑉𝑎𝑟 𝑋 0
We isolate 𝛽1 :
𝐶𝑜𝑣 𝑋, 𝑌
𝛽1
𝑉𝑎𝑟 𝑋
We have now derived the equation for 𝛽1 .
We can substitute this into the equation for 𝛽0 :

𝛽0 𝑌 𝛽1 𝑋

𝐶𝑜𝑣 𝑋, 𝑌
𝛽0 𝑌 𝑋
𝑉𝑎𝑟 𝑋
From this, there are a few important relationships:
𝑖𝑓 𝐶𝑜𝑣 𝑋, 𝑌 0, 𝑡ℎ𝑒𝑛 𝛽1 0
𝑖𝑓 𝐶𝑜𝑣 𝑋, 𝑌 0, 𝑡ℎ𝑒𝑛 𝛽1 0
The covariance (and correlation) is not enough to determine the relationship between 2 variables,
because as shown by the equation for beta 1, the variance also needs to be included. If we do not
include the variance, we can see the direction of the relationship, but we cannot quantify it.
𝑉𝑎𝑟 𝑋 0
There needs to be variability in X for us to get important results.
Properties of OLS
1. Average of 𝑢 0
𝑢 𝑌 𝛽0 𝛽1 𝑋
𝐸𝑢 0
2. 𝐶𝑜𝑣 𝑥, 𝑢 0

3. 𝐸 𝛽1 𝛽1 . OLS is unbiased, i.e. the expected value of the observation is what we expect to
see from the theory.

The variance of 𝛽1 :
𝜎2 𝑉𝑎𝑟 𝑢
𝑉𝑎𝑟 𝛽1 𝑉𝑎𝑟 𝛽1
𝑆2 ∑ 𝑋 𝑋

Thus, a few things determine the variance of the beta: (1) the variance of the error, i.e. the variation
of the error, and (2) the variance of X (the higher the variance of X, the larger is the richness of
data, meaning that beta will have a lower variance).

If we make a lot of errors, then our estimate of 𝛽1 will be less precise.


1 𝑆𝑆𝑅
𝜎2 𝑢2
𝑛 2 𝑁 1

Lecture 3 - Hypothesis Tests, Confidence Intervals, and Inference


Chapter 5 - Regression with a Single Regressor
Hypothesis testing is concerned with testing if the null hypothesis is statistically significant or not.
If testing about the true slope of a regression being 0, the hypotheses can be written as:
𝐻0 : 𝛽1 𝛽1,0
𝐻1 : 𝛽1 𝛽1,0

Where 𝛽1,0 is the value of 𝛽1 under the null hypothesis, 𝐻0 .

The approach to two-sided hypothesis testing has 3 steps:

(1) compute the standard error of the estimator, e.g. 𝑆𝐸 𝛽1 .


𝑆𝐸 𝛽1 𝜎2

1 2 2
2 1𝑛 2∑ 1 𝑋 𝑋 𝑢
𝜎 2
𝑛 1
∑ 1 𝑋 𝑋 2
𝑛
(2) Compute the t-statistic
The t-statistics has the form:
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
𝑡
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟
e.g.
𝛽1 𝛽1,0
𝑡
𝑆𝐸 𝛽1
(3) Compute the p-value, which is the smallest significance level at which the null hypothesis could
be rejected based on the test statistic actually observed. That is, the p-value is the probability of
obtaining a statistic, by random sampling variation, at least as different from the null hypothesis as
is the statistic observed, assuming that the null hypothesis is correct:
𝑝 𝑣𝑎𝑙𝑢𝑒 2𝜙 |𝑡 |
Where 𝑡 is the t-statistic computed in (2), and 𝜙 is the cumulative standard normal distribution
tabulated in Appendix Table 1. Thus:
𝑝 𝑣𝑎𝑙𝑢𝑒 Pr |𝑡 | |𝑡 |

A p-value of less than 5% provides evidence against the 𝐻0 , as this shows that the probability of
obtaining a value of 𝛽1 at least as far from the null as that actually observed is less than 5%.
Alternatively, one can simply compare the t-statistic to the critical value appropriate for the test
with the desired significance level. For example, a two-sided test with a 5% significance level
would reject the null hypothesis if |𝑡 | 1.96.

We can also perform a one-sided test, where the alternative hypothesis, 𝐻1 , only allows for 𝛽1 to be
either larger or smaller than 0. In this example, the hypotheses will be:
𝐻0 : 𝛽1 𝛽1,0
𝐻1 : 𝛽1 𝛽1,0

The only difference between a one and two-sided test is how we interpret the t-statistic. In a one-
sided test, 𝐻0 is rejected against the one-sided alternative for large negative, but not large positive,
values of the t-statistic. Thus, instead of rejecting 𝐻0 if |𝑡 | 1.96 as in the two-sided test, we
now reject if 𝑡 1.64. That is:
𝑝 𝑣𝑎𝑙𝑢𝑒 𝜙 𝑡
Where we use the left-tail probability if testing for 𝛽1 being smaller than 𝛽1,0 . If testing for it being
greater, we use the right-tail probability.
5.2 Confidence Intervals for a Regression Coefficient
A 95% confidence interval is the set of values that cannot be rejected using a two-sided hypothesis
test with a 5% significance level, and an interval that has a 95% probability of containing the true
value of the coefficient. A hypothesis test with a 5% significance level will, by definition, reject the
true value of a coefficient in only 5% of all possible samples; that is, in 95% of all possible samples,
the true value of a coefficient will not be rejected. Because the 95% confidence interval (as defined
in the first definition) is the set of all values of the coefficient that are not rejected at the 5%
significance level, it follows that the true value of the coefficient will be contained in the confidence
interval in 95% of all possible samples.
Continuing the example with 𝛽1 :

95% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝛽1 𝛽1 1.96𝑆𝐸 𝛽1

This confidence interval can be used to create a 95% confidence interval for the predicted effect of
a general change in x. If changing x by ∆𝑥, then the predicted change in Y is 𝛽1 ∆𝑥. The predicted
effect of ∆𝑥 using the estimate of 𝛽1 is 𝛽1 1.96𝑆𝐸 𝛽1 ∆𝑥.

5.3 Regression When X is a Binary Variable


Regression analysis can also be used when the regressor is binary, i.e. when it takes on only two
values, 0 or 1. A binary variable is also called an indicator or dummy variable. While the mechanics
are the same, the interpretation of 𝛽1 is different, as regression with a binary variable is equivalent
to performing a differences of means analysis (section 3.4).
Example: Variable 𝐷 is binary and depends on the student-teacher ratio:

1 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑡𝑒𝑎𝑐ℎ𝑒 𝑟𝑎𝑡𝑖𝑜 𝑖𝑛 𝑡ℎ𝑒 𝑖 𝑑𝑖𝑠𝑡𝑟𝑖𝑐𝑡 20


𝐷
0 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 𝑡𝑒𝑎𝑐ℎ𝑒 𝑟𝑎𝑡𝑖𝑜 𝑖𝑛 𝑡ℎ𝑒 𝑖 𝑑𝑖𝑠𝑡𝑟𝑖𝑐𝑡 20
The population regression model with 𝐷 is then:
𝑌 𝛽0 𝛽1 𝐷 𝑢
As 𝐷 is not continuous, it is not useful to think of 𝛽1 as the slope, because there is no line. Thus, it
is simply the coefficient on 𝐷 . It is simplest to interpret the two betas is by looking at the 2 cases
separately:
𝑖𝑓 𝐷 0: 𝑌 𝛽0 𝑢
Thus, when 𝐷 0, 𝛽0 is the population mean value of test scores when the student-teacher ratio is
high.
𝑖𝑓 𝐷 1: 𝑌 𝛽0 𝛽1 𝑢
So, when the student-teacher ratio is low, 𝛽0 𝛽1 is the population mean value of test scores. This
also means that 𝛽1 is the difference between the conditional expectation that 𝑌 when 𝐷 0 and
when 𝐷 1, or 𝛽1 𝐸 𝑌 |𝐷 1 𝐸 𝑌 |𝐷 0 .

A hypothesis test can be performed in the same way as described earlier. Also, a confidence interval
can be constructed, which would provide the interval for the difference between the 2 population
means.
5.4 Heteroskedasticity and Homoskedasticity
The error term 𝑢 is homoskedastic if the variance of the conditional distribution of 𝑢 given 𝑋 ,
𝑉𝑎𝑟 𝑢 |𝑋 𝑥 , is constant for 𝑖 1, … , 𝑛 and in particular does not depend on 𝑥. Otherwise, the
error term is heteroskedastic. In simpler terms, if the distribution of 𝑢 changes shape, i.e. the spread
increases, for different values of 𝑥, 𝑢 is heteroskedastic. If not, it is homoskedastic.

Implications of homoskedasticity:
- The OLS estimator is unbiased, consistent and asymptotically normal whether the errors are
homoskedastic or heteroskedastic.

- The OLS estimators 𝛽0 and 𝛽1 are efficient among all estimators that are linear in 𝑌1 , … . , 𝑌
and are unbiased, conditional on 𝑋1 , … , 𝑋 . Gauss-Markov theorem.

- Homoskedasticity-only SE:

𝑠2
𝜎2 2
∑ 1 𝑋 𝑋

If binary regression, then the pooled variance equation for the difference in means can be
used. However, the general equations provided earlier are heteroskedasticity-robust standard
errors, they can be used for both cases.

5.5 The Theoretical Foundations of OLS


Gauss-Markov theorem: Under a set of conditions (the Gauss-Markov conditions), the theorem
holds that if the OLS assumptions hold and the error is homoskedastic, then the OLS estimator has
the smallest variance, conditional on 𝑋1 , … , 𝑋 , among all estimators in the class of linear
conditionally unbiased estimators. Thus, OLS estimator is the best linear conditionally biased
estimator (BLUE).

*** The rest of the chapter was optional, read if brought up in class.

Notes from Lecture


2
𝑎 𝑏 𝑎2 2𝑎𝑏 𝑏2
When determining the goodness of fit, the idea is to compare the real data to the observed, i.e.
𝑌 𝑌 𝑢 . The total sum of squares (TSS) is equal to the deviation of Y from the average:
2
𝑇𝑆𝑆 𝑌 𝑌
1

We want this to resemble 𝑌 𝑌 𝑢 more:

2
𝑇𝑆𝑆 𝑌 𝑌 𝑌 𝑌
1

2 2 2
𝑇𝑆𝑆 𝑢 2 𝑢 𝑌 𝑌 𝑌 𝑌
1 1 1

As 𝑢 is assumed to be 0 and not related to 𝑌 , this can be simplified:

2 2
𝑇𝑆𝑆 𝑢 𝑌 𝑌
1 1

The first term is SSR and the second term is estimated sum of squares (ESS):
𝑇𝑆𝑆 𝑆𝑆𝑅 𝐸𝑆𝑆
We divide both sides by TSS:
𝑇𝑆𝑆 𝑆𝑆𝑅 𝐸𝑆𝑆
𝑇𝑆𝑆 𝑇𝑆𝑆 𝑇𝑆𝑆
This is simplified, knowing that 𝑅2 :

𝑆𝑆𝑅
1 𝑅2
𝑇𝑆𝑆

𝑆𝑆𝑅
𝑅2 1
𝑇𝑆𝑆
Due to these 2 equations being equal to 𝑅2 , a property of 𝑅2 is:
0 𝑅2 1
The squared correlation tells us how well our data fits the regression, but a larger 𝑅2 is not
necessarily better - it is just one indicator. The satisfactory level of squared correlation depends on a
lot of factors, e.g. with mini data 0.3 is often very satisfactory, while for macro data, 𝑅2 is expected
to be above 0.9 for it to be satisfactory.
Hypothesis Testing
Type I error: Reject 𝐻0 when it is true
Type II error: Not rejecting 𝐻0 when it is false
The test statistic for the mean can also be written as:
𝑌 𝜇 ,0
𝑆𝐷 𝑌
𝑡
√𝑛

Lecture 4 - Multiple Regression Analysis I


Chapter 6 - Linear Regression with Multiple Regressors
Omitted variable bias (OVB): occurs when 2 conditions are met - (1) when the omitted variable is
correlated with the included regressor and (2) when the omitted variable is a determinant of the
dependent variable.
The first condition must hold because this is what makes the regressor pick up the effect of the
omitted variable.
OVB means that the first OLS assumption 𝐸 𝑢 |𝑋 0 does not hold. 𝑢 in a linear regression
with a single regressor represents all factors (other than 𝑋 ) that are determinants of 𝑌 . If 1 of these
factors is correlated with 𝑋 , this means that 𝑢 (which contains the factor) is correlated with 𝑋 .
Thus, OVB makes the OLS estimator biased.
Whether the bias is large or small depends on the correlation 𝜌 between the regressor and the
error term - the larger the correlation is in absolute value, the larger the bias. The direction of the
bias depends on whether the correlation is positive or negative.
6.2 The Multiple Regression Model
A multiple regression model allows us to estimate the effect on 𝑌 of changing 1 variable 𝑋1
while holding the other regressors 𝑋2 , 𝑋3 and so forth) constant. The population regression
function is then:
𝐸 𝑌 |𝑋1 𝑥1 , 𝑋2 𝑥2 𝛽0 𝛽1 𝑥1 𝛽2 𝑥2
𝛽1 is the partial effect on 𝑌 of 𝑋1 , i.e. it can here be interpreted as a unit change in 𝑋 , holding 𝑋2
constant/controlling for 𝑋2 :
∆𝑌
𝛽1 ℎ𝑜𝑙𝑑𝑖𝑛𝑔 𝑋2 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
∆𝑋1
One or more of the independent variables in multiple regression models are sometimes referred to
as control variables.
𝛽0 can be referred to as the constant term.
For a multiple regression model, the error term 𝑢 is homoskedastic íf the variance of the
conditional distribution of 𝑢 given the regressors is constant and does not depend on the values of
the regressors, i.e. 𝑣𝑎𝑟 𝑢 |𝑋1 , … , 𝑋 constant for 𝑖 1, … , 𝑛. Otherwise, the error term is
heteroskedastic.
6.3 The OLS Estimator in Multiple Regression
The OLS regression line is a straight line constructed using the OLS estimators.
6.4 Measures of Fit in Multiple Regression
3 measures of fit:
1. Standard error of regression- Estimates the SD of 𝑢 , i.e. it is a measure of the spread of the
distribution of 𝑌 around the regression line:

𝑆𝑆𝑅
𝑆𝐸𝑅 𝑠2
𝑛 𝑘 1

The denominator is a degrees of freedom adjustment, which is negligible when 𝑛 is large.

2. 𝑅 2 - The fraction of the sample variance of 𝑌 explained by (or predicted by) the regressors:

𝐸𝑆𝑆 𝑆𝑆𝑅
𝑅2 1
𝑇𝑆𝑆 𝑇𝑆𝑆

𝑅 2 increases whenever a regressor is added (as SSR decreases), unless the estimated
coefficient on the added regressor is exactly 0.

3. 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 - Corrects 𝑅2 for increasing when adding a variable that does not improve the
fit of the model.

2
𝑛 1 𝑆𝑆𝑅 𝑠2
𝑅 1 1
𝑛 𝑘 1 𝑇𝑆𝑆 𝑠2

1
H , 𝑅2 by the factor , meaning that the adjusted 𝑅2 is always lower than
1
𝑅 2 . Adding a regressor to the model has 2 opposite effects - SSR decreases, but the factor
increases.

The adjusted 𝑅2 can be negative when the regressors reduce the sum of squared residuals by
such a small amount that this reduction fails to offset the factor.
Determining whether to include a variable or not should not be made based on maximizing 𝑅2 or
𝑅2 - it should be based on whether including the variable allows us to better estimate the causal
effect of interest.
6.5 The Least Squares Assumptions in Multiple Regression
There are 4 OLS assumptions:
1. The conditional distribution of 𝑢 given 𝑋1 , 𝑋2 , … , 𝑋 has a mean of 0.

2. 𝑋1 , 𝑋2 , … , 𝑋 , 𝑖 1, … 𝑛, 𝑎𝑟𝑒 𝐼𝐼𝐷 - independently and identically distributed random


variables. Only holds if data are collected by simple random sampling.
3. Large outliers are unlikely - the OLS estimators can be sensitive to large outliers. Therefore,
we assume that 𝑋1 , … , 𝑋 and 𝑌 have nonzero finite fourth moments, i.e. the dependent
variable and regressors have finite kurtosis.

4. No perfect multicollinearity - perfect multicollinearity is when one of the regressors is a


perfect linear function of the other regressors, and this is a problem, because we cannot keep
other variables constant if they are dependent on the one we are changing.

6.6 The Distribution of the OLS Estimators in Multiple Regression


In large samples, the distribution of the OLS estimators in multiple regression is approximately
jointly normal.
In general, the OLS estimators are correlated - this correlation arises from the correlation between
the regressors.
6.7 Multicollinearity
Imperfect multicollinearity arises when one of the regressors is very highly correlated but not
perfectly correlated with the other regressors. Unlike perfect multicollinearity, imperfect
multicollinearity does not prevent estimation of the regression, nor does it imply a logical problem
with the choice of regressors. However, it does mean that one or more regression coefficients will
be estimated imprecisely, i.e. they will have a larger sampling variance.
When the regression includes an intercept, then one of the regressors that can be implicated in
perfect multicollinearity is the constant regressor 𝑋0 ,
Perfect multicollinearity is a statement about the data set you have on hand.

Dummy variable trap: If there are 𝐺 binary variables and each observation falls into 1 and only 1, if
there is an intercept in the regression, and if all 𝐺 binary variables are included as regressors, then
the regression will fail because of perfect multicollinearity.

We solve this problem by excluding one of the regressors, i.e. only 𝐺 1 of the 𝐺 binary variables
are included - in this case, the coefficients on the included binary variables represent the
incremental effect of being in that category relative to the case of the omitted category, holding
constant the other regressors. We can also include all 𝐺 regressors and exclude the intercept.

Chapter 7 - Hypothesis Tests and Confidence Intervals in Multiple Regression


Under the OLS assumption, the law of large numbers implies that sample averages converge to their
population counterparts.
7.1 Hypothesis Tests and Confidence Intervals for a Single Coefficient
If conducting a hypothesis test for a single coefficient being equal to 0 (i.e. it has no effect on Y),
the two-sided hypotheses would be
𝐻0 : 𝛽 0
𝐻1 : 𝛽 0
0
The 2nd step is to calculate the t-statistic 𝑡 , and the 3rd step is to compute the p-value of
the test or to compare the t-statistic to the critical value corresponding to the desired level of
significance. The 𝑝-value is the smallest significance level at which we can reject 𝐻0 , thus is the p-
value is less than 5%, the null hypothesis is rejected at 5% significance level.
A confidence interval can be constructed:

95% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝛽 𝛽 1.96𝑆𝐸 𝛽

7.2 Tests of Joint Hypotheses


When conducting a joint hypotheses test, we use the F-statistic. Joint hypothesis is a hypothesis that
imposes 2 or more restrictions on the regression coefficients. For example:
𝐻0 : 𝛽 𝛽 0
𝐻1 : 1 𝑜𝑟 𝑚𝑜𝑟𝑒 𝛽 0
Testing each of these individually using the t-statistic is unreliable, because the t-statistics are
independent, meaning that the probability of rejecting 𝐻0 is too large if done separately (for 2 𝛽
is 1 0.952 9.75%.
The F-statistic used for joint hypothesis testing combines the t-statistics adjusting for the correlation
between them. If our test has 2 restrictions, i.e. it tests for two coefficients, the F-statistic is:
1 𝑟12 𝑟22 2𝜌 1, 2 𝑡1 𝑡2
𝐹
2 1 𝜌21, 2

This equation is heteroskedasticity-robust - as the sample size increases, the difference between this
and the homoskedasticity-only 𝐹-statistic vanishes.
For more than 2, we obtain the critical values for the 𝐹-statistic from the tables of the 𝐹 ,
distribution in Appendix Table 4 for the appropriate value of 𝑞 and the desired significance level.
𝑝 𝑣𝑎𝑙𝑢𝑒 Pr 𝐹 , 𝐹

If the number of restrictions is 1, then the 𝐹-statistic is the square of the 𝑡-statistic.
If the 𝐹-statistic exceeds the critical value, we reject 𝐻0 .

7.3 Testing Single Restrictions Involving Multiple Coefficients


Example:
𝐻0 : 𝛽1 𝛽2
𝐻1 : 𝛽1 𝛽2
While this hypothesis only has 1 restriction, it does involve multiple coefficients, and therefore, we
need to modify our methods using 1 of 2 approaches:

(1) test the restriction directly (some software have commands designed to do this)
(2) Transform the regression - rewriting the regression such that the restriction in 𝐻0 turns into a
restriction on a single regression coefficient - example of this on page 276.

7.4 Confidence Sets for Multiple Coefficients


A 95% confidence set for 2 or more coefficients is a set that contains the true population values of
these coefficients in 95% of randomly drawn samples.
When there are 2 coefficients, the resulting confidence sets are ellipses, for example:

7.5 Model Specification for Multiple Regression


The OLS estimators of the coefficients in multiple regression will have OVB if an omitted
determinant of 𝑌 is correlated with at least one of the regressors. Then at least 1 of the regressors is
correlated with 𝑢 , which means the model violates the 1st OLS assumption.
Control variable: regressor which is included to hold constant factors that, if neglected, could lead
the estimated causal effect of interest to suffer from OVB.
The 1st OLS assumption treats regressors symmetrically - if we change this assumption to be
explicit about the distinction between a variable of interest and control variables, then the OLS
estimator of the effect of interest is unbiased, but the OLS estimator on control variables are biased
and do not have a causal interpretation.
This assumption is called conditional mean independence - it requires that the conditional
expectation of 𝑢 given 𝑋1 and 𝑋2 does not depend on 𝑋1 although it can depend on 𝑋2 :
𝐸 𝑢 |𝑋1 , 𝑋2 𝐸 𝑢 |𝑋2

There are 4 pitfalls to guard against wen using 𝑅2 or 𝑅2 :


1. An increase in 𝑅 2 or 𝑅 2 does not necessarily mean that an added variable is statistically
significant. To ensure this, we need to perform a hypothesis test using the 𝑡-statistic
2. A high 𝑅2 or 𝑅2 does not mean that the regressors are a true cause of the dependent
variable.

3. A high 𝑅2 or 𝑅2 does not mean that there is no OVB.

4. A high 𝑅2 or 𝑅2 does not necessarily mean that you have the most appropriate set of
regressors, nor does a low 𝑅2 or 𝑅2 necessarily mean that you have an inappropriate set of
regressors.

Choose the scale that makes the most sense in terms of interpretation and easiness to read.
Notes from Lecture
Regression without a Constant
A regression without a constant is where we assume that the constant is equal to 0:
𝑌 𝛽0 𝛽1 𝑋 𝑢

𝑌 0 𝛽1 𝑋
This graphically means that the regression line always goes through the point (0;0).
Never used in economic research - technically used for CAPM, as we assume that 𝛼 is always 0 in
this model.
When regressing without a constant, 𝛽1 is still found using the same equation as before 𝛽1
,
. However, the variance is now calculated as:

∑𝑢
𝑛 1
𝑉𝑎𝑟 𝛽1
∑ 𝑥2

We divide the numerator by the degrees of freedom 𝑛 1 .

The 𝑅2 has to be adjusted, because we are no longer looking at a normal simple regression (the
error term now equals 𝑢 𝑌 𝛽𝑋):
∑𝑋 𝑌
𝑅𝑎𝑤 𝑅2
∑ 𝑋 2 ∑ 𝑌2

𝑅2 can be negative in this case.


Multiple Regression
We need more than 1 variable to explain 𝑌. Thus, the regression should look like:
𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2 ⋯ 𝛽 𝑋 𝑢
The 𝛽′𝑠 are called partial regression coefficients, and they show the effect of their variable keeping
the other variables constant.
Assumptions:
- Linear in parameters
- Fixed 𝑋 values or 𝑋 values independent of the error term 𝑐𝑜𝑣 𝑋2 , 𝑢 𝑐𝑜𝑣 𝑋3 , 𝑢
⋯ 𝑐𝑜𝑣 𝑋 , 𝑢 0
- Zero mean value of the disturbance 𝐸 𝑢 |𝑋2 , 𝑋3 … 𝑋 0
- Homoscedasticity or constant variance of 𝑢
- No autocorrelation between disturbances 𝑐𝑜𝑣 𝑢 , 𝑢 0 𝑓𝑜𝑟 𝑖 𝑗
- The number of observations 𝑛 must be greater than the number of parameters to be
estimated
- There must be variation in the values of the 𝑋 variables (exception: you may have 1 constant
term, so you might think of 1 𝑋 variable equal to 1 for all observations)
- No exact collinearity between the 𝑋 variables
- No specification bias (all the variables should be the ones that explain the model, do not
include or exclude anything)
Similarities with Simple Regression
Suppose we go from a simple regression to a multiple regression with 2 variables:

𝑌 𝛽0 𝛽1 𝑋1 → 𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2
What we have to do to estimate the coefficients is the same as under simple regression. The general
equation is derived from:
𝑐𝑜𝑣 𝑋1 , 𝑢 𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2

𝑐𝑜𝑣 𝑋1 , 𝑌 𝑐𝑜𝑣 𝑋1 , 𝛽0 𝑐𝑜𝑣 𝑋1 , 𝛽1 𝑋1 𝑐𝑜𝑣 𝑋1 , 𝛽2 𝑋2

𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽1 𝑉𝑎𝑟 𝑋1 𝛽2 𝑐𝑜𝑣 𝑋1 , 𝑋2

𝑐𝑜𝑣 𝑋1 , 𝑢 𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽1 𝑉𝑎𝑟 𝑋1 𝛽2 𝑐𝑜𝑣 𝑋1 , 𝑋2 0
Thus, to ensure that this equation equals 0, we get that the beta must be:

𝑐𝑜𝑣 𝑋1 , 𝑌 𝛽2 𝑐𝑜𝑣 𝑋1 , 𝑋2
𝛽1
𝑉𝑎𝑟 𝑋1 𝑉𝑎𝑟 𝑋1

Thus, when an extra regressor is added, the coefficient is no longer the same as under a simple
regression - the second term is added to reflect the importance of the second variable. This suggests
that the betas of a model are different from each other - however, this is not always true.
There are 2 cases where 𝛽1 𝛽1 :

1. When 𝛽2 0 - because then the second term of the equation is 0, meaning it is not
important and should not be included in the model (simple regression model better).

2. When 𝑐𝑜𝑣 𝑋1 , 𝑋2 0 - this also makes the second term of the equation for 𝛽1 equal 0. It
means that 𝑋1 and 𝑋2 are orthogonal, i.e. they are explaining something, but they are not
related to each other. Again, the simple regression would be better.

If you forget to include 𝑋2 in your regression, the consequences depend on other characteristics:

- If 𝑐𝑜𝑣 𝑋1 , 𝑋2 0 and 𝛽2 0, then it would cause 𝛽1 𝛽1 , which is a positive bias. This


means that we would overestimate the effect of 𝛽1 in a simple regression.

- If 𝑐𝑜𝑣 𝑋1 , 𝑋2 0 and 𝛽2 0, then 𝛽1 𝛽1 , which is a negative bias. We thus


underestimate the effect of 𝛽1 in a simple regression.

- If 𝑐𝑜𝑣 𝑋1 , 𝑋2 0 and 𝛽2 0, then 𝛽1 𝛽1 , negative bias

- If 𝑐𝑜𝑣 𝑋1 , 𝑋2 0 and 𝛽2 0, then 𝛽1 𝛽1 , positive bias

The interpretation of the coefficient 𝛽:


𝑑𝑌
𝛽 𝑜𝑟 ∆𝑌 𝛽 ∆𝑋
𝑑𝑋
Thus, the coefficient is the impact of 𝑋 on Y holding all the other variables constant.
The goodness of fit for a multiple regression is the adjusted 𝑅 2 , which can also be written as 𝑅 2 :
𝑆𝑆𝑅
𝑅2 1 𝑛 𝑘 1
𝑇𝑆𝑆
𝑛 1
Where 𝑘 is the number of variables in the regression. Holding all constant, if 𝑘 increases, 𝑅2 will
decrease. Thus, it penalizes the model for including more variables.
Always be careful when using the adjusted 𝑅2 , it is better to include the variables despite it
decreasing 𝑅2 . Turn to the theory for guidance on what to do.
Comparisons of 𝑅 2 should only be done having the same dependent variable, Y. It will give us no
information if we change the structure of the dependent variable.
Lecture 5 - Multiple Regression Analysis II
Notes from Lecture
The squared correlation:
𝐸𝑆𝑆
𝑅2
𝑇𝑆𝑆
This is a proxy of how much the model can estimate divided by the total sum of squares, i.e. what is
in the dependent variable. Will be between 0 and 1 (can be negative if there is no constant).
𝑆𝑆𝑅 1 𝑅2 𝑇𝑆𝑆
Polynomial Regressions
A non-linear model has a non-constant slope. For OLS, we assume that all variables are linear. If
we want to have a linear model that takes into account non-constant slopes, the best way is to add
the same variable with a different exponential:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 ⋯ 𝛽 𝑋 𝑢
The shape of the polynomial regression depends on how many times 𝑋 is added.
If we have a quadratic model 𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 𝑢 where the data is concave, we expect
𝛽1 0 and 𝛽2 0, as 𝛽2 has to mitigate the positive effect of 𝛽1 .

The effect of 𝑋 is found by taking the derivative 𝛽1 2𝛽2 𝑋 . (because this shows the effect
of one extra unit of X on Y)
If you want to include an extra term in your model, it has to be consistent with the theory of
common sense, and you have to be able to explain why you include it besides it just increasing 𝑅2 .
How to Redefine Variables
Redefining a variable means making a transformation of the variable, which is done for 3 reasons:
- You want to change the size of the variable due to different units of measurements
- You want to change the structure of the variable to make it more understandable
- You want to facilitate interpretations because the value of the coefficient is too small/large
If we have a simple regression model with one regressor:
𝑌 𝛽0 𝛽1 𝑋 𝑢
,
Where 𝛽1 , 𝑌 is the price of a house in DKK, and 𝑋 is the income of individuals in DKK.

If the readers of the journal article are American, we want to rescale the variable from being
measured in DKK to USD. This is done by multiplying the initial variable with the rescaling
constant, 𝜔, which in this case is the exchange rate:
𝑌∗ 𝜔1 𝑌
𝑎𝑛𝑑/𝑜𝑟
𝑋∗ 𝜔2 𝑋
We assume that the constant cannot be equal to 0. In general, 𝜔1 does not have to be equal to 𝜔2 , it
depends on the transformation we wish to perform.
By doing this, we get a new regression model:
𝑌∗ 𝛽0∗ 𝛽1∗ 𝑋 ∗ 𝑢∗
Where
𝐶𝑜𝑣 𝑋 ∗ , 𝑌 ∗
𝛽1∗
𝑉𝑎𝑟 𝑋 ∗

𝐶𝑜𝑣 𝜔2 𝑋, 𝜔1 𝑌
𝛽1∗
𝑉𝑎𝑟 𝜔2 𝑋

𝜔1 𝜔2 𝐶𝑜𝑣 𝑋, 𝑌
𝛽1∗
𝜔22 𝑉𝑎𝑟 𝑋

𝜔1 𝐶𝑜𝑣 𝑋, 𝑌
𝛽1∗
𝜔2 𝑉𝑎𝑟 𝑋

𝜔1
𝛽1∗ 𝛽
𝜔2 1
We also have that

𝛽0∗ 𝜔1 𝛽0
Thus, the relationship between the coefficients of the two models

If we pre-multiply the dependent variable 𝑌 by coefficient 𝜔1 , 𝛽1 is going to be multiplied by that


amount as well. If we instead pre-multiply the independent variable by coefficient 𝜔2 , 𝛽1 will be
divided by that constant. Thus, if we rescale the dependent variable by multiplying with a constant,
the beta will also be multiplied by this constant, if we rescale the independent variable by
multiplying it with a constant, the beta will be divided by this constant.

𝜔1 2
𝑉𝑎𝑟 𝛽1∗ 𝑉𝑎𝑟 𝛽1
𝜔2
The t-statistic is the same, meaning that the significance is the same:
𝑡∗ 𝑡
Changing the scale of the Y variable will lead to a corresponding change in the scale of the
coefficients and standard errors, so no change in the significance or interpretation.
Changing the scale of one 𝑋 variable will lead to a change in the scale of that coefficient and
standard error, so again, no change in the significance or interpretation.

We can also redefine a variable by standardizing it both on the left or right hand-side
𝑌 𝑌
𝑌∗
𝑆𝐷 𝑌
𝑋 𝑋
𝑋∗
𝑆𝐷 𝑋
The old variable is thus affected by the average and divided by the standard error.
The new regression is then:
𝑌∗ 𝛽𝑋 ∗ 𝑢
The constant is missing, because if we have to standardize all the variables on the right hand-side,
we are taking the vector 1 minus the average, which is going to be 0, so we are running a regression
without a constant. The interpretation of 𝛽 in the model is therefore how much a change of 1 SD in
𝑋 has an effect on the standard deviation of the dependent variable 𝑌.
For example, if the beta of age is -0.340, an increase in SD of age will have a -0.340 impact on SD
of income.
F Statistics and Multiple Linear Restrictions
In terms of multiple regression analysis, we have a multiple linear restriction when talking about the
F-test. With simple linear regression, we only tested whether one variable was equal to 0. With
multiple regression, we test:
𝐻0 : 𝛽1 0, 𝛽2 0, … , 𝛽 0
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 0
We test this by taking a statistical index called F:
𝐸𝑆𝑆
𝐹≡ 𝑘 1
𝑆𝑆𝑅
𝑛 𝑘 1
If F is larger than the value found in the F table (with reference to its degrees of freedom (k-1) and
the number of observations minus the degrees of freedom), then we reject 𝐻0 ,
If we divide both terms in the F equation by TSS, we get:

𝑅2
𝐹≡ 𝑘 1
1 𝑅2
𝑛 𝑘 1
We can use this to see the marginal contribution of a variable to evaluate whether it should be
included in the regression model or not:
𝐸𝑆𝑆 𝐸𝑆𝑆
𝑘 𝑘
𝐹≡
𝑆𝑆𝑅
𝑛 𝑘 1
𝑘 1 𝑘 1 𝑘 𝑘
If 𝐹 𝐹 𝑘 𝑘 ,𝑛 𝑘 1 , we reject 𝐻0 .

Lecture 6 - Nonlinear Regression and Other Issues Related to OLS


Chapter 8 - Nonlinear Regression Functions
A non-linear function is a function with a slope that is not constant, i.e. the slope depends on the
value of 𝑋.
A quadratic population regression with one regressor is modelling 𝑌 as a function of 𝑋 and the
square of 𝑋:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 𝑢
Mechanically, we can construct the 2 nd regressor by generating a new variable that equals the square
of the 1st variable. Then the quadratic model is simply a multiple regression model with 2
regressors. We can estimate the coefficient using OLS methods.
We can test the hypothesis that the relationship between 𝑋 and 𝑌 is linear against the alternative that
is it nonlinear by testing the following hypothesis:
𝐻0 : 𝛽2 0
𝐻1 : 𝛽2 0
Since the quadratic model before would perfectly describe a linear relationship if 𝛽2 0, as the
second regressor would then be absent, 𝐻0 is true if the relationship is linear. We use the 𝑡-statistic
to test this, and if this exceeds the 5% critical value of the test (1.96), we reject 𝐻0 and conclude
that the quadratic model is a better fit.

The effect on 𝑌 of a change in 𝑋1 , ∆𝑋1 , holding 𝑋2 , . . . , 𝑋 constant, is the difference in the expected
value of 𝑌 when the independent variables take on the values 𝑋1 𝛥𝑋1 , 𝑋2 , . . . , 𝑋 and the expected
value of 𝑌 when the independent variables take on the values 𝑋1 , 𝑋2 , . . . , 𝑋 .
The standard error of the estimator of the effect on 𝑌 of changing another variable is:
|∆𝑌|
𝑆𝐸 ∆𝑌
√𝐹
The 𝐹-statistic here is the one computed when testing the hypothesis that is described on page 311.

The general approach to modeling nonlinear regressions is:


1. Identify a possible nonlinear relationship

2. Specify a nonlinear function and estimate its parameters by OLS

3. Determine whether the nonlinear model improves upon a linear model

4. Plot the estimated nonlinear regression function

5. Estimate the effect on 𝑌 of a change in 𝑋

8.2 Nonlinear Functions of a Single Independent Variable


(1) Polynomials
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 2 ⋯ 𝛽𝑋 𝑢
Where 𝑟 denotes the highest power of 𝑋 that is included in the regression. When 𝑟 2 the model is
quadratic, and when 𝑟 3 the regression is cubic. In a polynomial regression, the regressors are
powers of the same dependent variable, the coefficients on which can be determined by OLS
regression.
The test that the population regression is linear is that the coefficients after 𝛽1 are equal to 0,
because the model is linear if the quadratic and higher-degree terms do not enter the function:
𝐻0 : 𝛽2 𝛽3 ⋯ 𝛽 0
𝐻1 : 𝛽 0 𝑓𝑜𝑟 𝑗 2, … , 𝑟

As this is a joint test with 𝑞 1 restrictions, we use the 𝐹-statistic.


When determining which degree polynomial to use, the answer is to include enough terms to model
the nonlinear regression function adequately but no more - we should drop the higher-degree terms
that have a coefficient of 0. We find these using sequential hypothesis testing:
- Pick a maximum value of 𝑟and estimate the polynomial regression for that 𝑟 (usually 2-4)

- Use the 𝑡-statistic to test the hypothesis that the coefficient on 𝑋 is 0. If you reject this
hypothesis, then 𝑋 belongs in the regression, so you use the polynomial of degree 𝑟
- If you do not reject 𝛽 0 in step 2, eliminate 𝑋 from the regression and estimate a
polynomial regression of degree 𝑟 1. Test whether the coefficient on 𝑋 1 is 0. If you
reject, use the polynomial of degree 𝑟 1

- If you do not reject 𝛽 1 0 in step 3, continue this procedure until the coefficient on the
highest power in your polynomial is statistically different from 0

(2) Logarithms
Logarithms convert changes in variables into percentage changes.
The exponential function of 𝑥 is 𝑒 (𝑒 2.71828 . The natural logarithm is the inverse of the
exponential function, i.e. it is the function for which 𝑥 ln 𝑒 . The logarithm is defined only for
1
positive values of 𝑥, and its slope is .

When ∆𝑥 is small, the difference between ln 𝑥 ∆𝑥 and ln 𝑥 is approximately the percentage


change in 𝑥 divided by 100:
∆𝑥
ln 𝑥 ∆𝑥 ln 𝑥
𝑥

There are 3 cases where logarithms might be used:


1. Linear-log model - 𝑋 is transformed by taking its logarithm but 𝑌 is not.

𝑌 𝛽0 𝛽1 ln 𝑋 𝑢

I a a , a 1% a 𝑋 is associated with a change of 𝑌 of 0.01𝛽1 . To


compute the estimators, first create a new variable ln 𝑋 and then run a regression as
normal.

2. Log-linear model - 𝑌 is transformed to its logarithm but 𝑋 is not.

ln 𝑌 𝛽0 𝛽1 𝑋 𝑢

A 1-unit change in X is associated with a 100 𝛽1 percent change in 𝑌, i.e. when 𝑋



changes by 1 unit, then changes by 𝛽1 .

3. Log-log model - Both 𝑋 and 𝑌 are transformed to their logarithms.

ln 𝑌 𝛽0 𝛽1 ln 𝑋 𝑢

A 1% a e in 𝑋 is associated with a 𝛽1 % change in Y, i.e. 𝛽1 is the elasticity of 𝑌 with


%∆
respect to 𝑋. 𝛽1 . Remember that 1% change in 𝑋 is ∆𝑋 0.01𝑋.
%∆
While 𝑅2 can be used to compare log-linear and log-log model, it cannot be used to compare the
linear-log model and the log-log model because their dependent variables are different.

** if 𝑌 is transformed (ln 𝑌 , you cannot compute the appropriate predicted value of 𝑌 by taking
the exponential function of the model, as this value is biased. The solution used in the book is to
simply not transform the predicted values of the logarithm of 𝑌 to their original units.

8.3 Interactions Between Independent Variables


Interactions between variables occur when the effect on 𝑌 of a change in 1 independent variable
depends on the value of another independent variable.

(1) Interactions between 2 binary variables


I 2b a a ab interact, we can modify the specification of the model so that
the interaction is allowed by introducing another regressor, the product of the 2 binary variables.
For example, if we observe an interaction between 𝐷1 and 𝐷2 , we modify the model in the
following way:
𝑌 𝛽0 𝛽1 𝐷1 𝛽2 𝐷2 𝑢

𝑌 𝛽0 𝛽1 𝐷1 𝛽2 𝐷2 𝛽3 𝐷1 𝐷2 𝑢
The added product term is called the interaction term. The coefficient 𝛽3 is the difference in the
effect on 𝑌 of 𝐷1 for an observation taking on value 1 in 𝐷2 .
A method for interpreting coefficients in regressions with binary variables is to first compute the
expected values of Y for each possible case described by the set of binary variables. Next compare
these expected values. Each coefficient can then be expressed either as an expected value or as the
difference between two or more expected values.

(2) Interactions between 1 binary and 1 continuous variable


The population regression line relating 𝑌 and the continuous variable 𝑋 can depend on the binary
variable 𝐷 in 3 different ways, depending on how the interaction term is added.
Since the specifications are all versions of a regular multiple regression model, the coefficients can
be estimated by OLS once the interaction variable 𝑋 𝐷 is created.
The 3 different ways are summarized in the graph below - note that the last one is rarely used:
(3) Interactions between 2 continuous variables
The interaction between 𝑋1 and 𝑋2 can be modeled as:
𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2 𝛽3 𝑋1 𝑋2 𝑢
With this model, the effect on 𝑌 of a change in 𝑋1 holding 𝑋2 constant is:
∆𝑌
𝛽1 𝛽3 𝑋2
∆𝑋1
As the coefficient on the interaction term, 𝛽3 , is the effect of a unit increase in 𝑋1 and 𝑋2 above and
beyond the sum of the effects of a unit increase in 𝑋1 alone and a unit increase in 𝑋2 alone.

Notes from Lecture


Math refresh:
𝑑 ln 𝑥 1
𝑑𝑥 𝑥
Multicollinearity
Can be divided into perfect and imperfect multicollinearity:
1. Perfect multicollinearity

One of the regressors can be expressed as a linear combination of another regressor, e.g.:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋2 𝑢

Will have perfect multicollinearity if

𝑋2 𝛼0 𝛼1 𝑋1

For example, work experience can be approximated by taking 𝑎𝑔𝑒 6


𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑦𝑒𝑎𝑟𝑠 . If we run a regression including age, education and work experience
approximated as stated before, we will run into problems:

- If we substitute the linear relationship into the model, we get:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝛼0 𝛼1 𝑋1

𝑌 𝛽0 𝛽2 𝛼0 𝛽1 𝛽2 𝛼1 𝑋1 𝑢

The problem is then that software will give us one coefficient for 𝛽1
𝛽2 𝛼1 𝑋1 , which does not allow us to disentangle the values of 𝛽1 and 𝛽2 .

- Depending on the software, we cannot even run the regression, if the variables
display perfect multicollinearity. R will tell us that the problem is there and will drop
one of the regressors when running the model.

Multicollinearity is related to the structure of the data. The theory tells us we should
include both variables, but because of the way the data is structured, we cannot do so, or
it would lead to biased variables.

2. Imperfect multicollinearity

Occurs when two regressors are highly linearly correlated, but the correlation is not exactly
1. Thus, we add an error term to perfect multicollinearity:

𝑋2 𝛼0 𝛼1 𝑋1 𝑣

Where 𝑣 is an error term, which captures the imperfection of the linear combination. In the
example with work experience from before, the error could for example be that it sometimes
is 𝑎𝑔𝑒 5 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑦𝑒𝑎𝑟𝑠 .

If we substitute the linear relationship into the regression model, we have:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝛼0 𝛼1 𝑋1

𝑌 𝛽0 𝛽2 𝛼0 𝛽1 𝛽2 𝛼1 𝑋1 𝛽2 𝑣 𝑢

Using this regression, we can technically derive an estimate of 𝛽1 and 𝛽2 , given that we
know the value of the error term, 𝑣. However, we do not know this, so in practice we have
the same problem as under perfect multicollinearity, where the two betas cannot be
disentangled from each other.

We can run an OLS regression with imperfect multicollinearity. If we do so without


knowing that is has imperfect multicollinearity, the OLS is still unbiased, as we do not
violate any assumptions. Furthermore, the OLS still has the lowest variance.

The problem with imperfect multicollinearity comes with the significance level of the
regression - if we assume that:

𝜎2
𝑉𝑎𝑟 𝛽
𝑇𝑆𝑆 1 𝜌

Where 𝑇𝑆𝑆 is the total sum of squares of 𝑗 with respect to the other regressor, and 𝜌 is the
correlation of 𝑗 with respect to the other regressor.

When we have imperfect multicollinearity, 𝜌 → 1, meaning that the denominator goes


towards 0, the implication of which is that the variance of 𝛽 will be extremely high. The
problem of the variance increasing just because of this effect is that the standard error of 𝛽
also increases, meaning that the t statistic will decrease, widening the confidence interval.
Thus, we are able to say less about our data and model.

Imperfect multicollinearity is also creating a problem because the regression created is very
sensitive with respect to outliers. The inclusion or exclusion of these variables will have a
very large effect on the betas.

Battista wants us to check for multicollinearity, which can be done by for example checking
for a very low t-statistic at the same time as a very high 𝑅2 , by looking at the correlation
between variables, or by employing the variance inflation factor (VIF) (can be done in R).

The approach of the auxiliary regression is used to compute a test statistic-such as the test
statistics for heteroskedasticity and serial correlation or any other regression that does not
estimate the model of primary interest. For example, if we have the model:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋2 𝛽3 𝑋3 𝑢

We can use the auxiliary regression to see the relationship of other variables to the variable
of interest, in this case 𝑋3 :

𝑋3 𝛾0 𝛾1 𝑋 𝛾2 𝑋2 𝜀

If 𝑋3 is very correlated with 𝑋1 and 𝑋2 , we also expect 𝛾1 and 𝛾2 to be extremely


statistically significantly different from 0. The error of the regression is important because it
is the effect of 𝑋3 depurated by (netting out the effects of) 𝑋1 and 𝑋2 . This allows us to run a
new regression:

𝑌 𝛼0 𝛼1 𝑋1 𝛼2 𝑋2 𝛼3 𝜀̂ 𝑧

Doing this, we are taking out 𝑋3 but adding the residual contribution of 𝑋3 not related to 𝑋1
and 𝑋2 .

Thus, when we have multicollinearity, we can either do nothing, or try to find a theoretical
reason as to why we should not include the variable, combine cross-sectional and time series
data, drop a variable and specification bias, transform the variable, or try to find
additional/new data.
Nonlinear vs Linear Regression
When running an OLS regression, we need to have a linear relationship. If the data is suggesting a
non-linear representation, we can consider looking at the polynomials. Otherwise, another
alternative to check for a nonlinear relationship is to use the natural logarithm to transform the
variable.

Properties of ln:
- The logarithm is scale invariant, tells us that if we take a variable and transform it
using the natural logarithm, the scale is always going to be same, the ranking of the
value will always be the same (no reshuffle).

- If we take the logarithm of a variable 0, the value does not exist, which means that
if we have negative variables in the model, we will have a missing value in the
regression. If we have a lot of these values in our vector, we can consider using
ln 1 𝑥 , but we have to be careful, because this can change the results as well.

- When transforming using ln, the data is less skewed, there is less distortion in the
distribution of data, and the effect of outliers is lower.
There are 3 different cases that differ from the simple regression model (level-level):
(1) Level-log
𝑌 𝛽0 𝛽1 ln 𝑋1 𝑢

𝑑𝑌 1 ∆𝑋1
𝛽 𝑜𝑟 ∆𝑌 𝛽
𝑑𝑋1 𝑋1 1 𝑋1 1

Thus, the interpretation of 𝛽1 is the effect on 𝑌 if we increase 𝑋1 by 100%. Not used a lot in
economics.
(2) Log-level
ln 𝑌 𝛽0 𝛽1 X1 𝑢

𝑑ln 𝑦 1
𝑜𝑟 ∆𝑌 ∆𝑋𝛽
𝑑𝑌 𝑌
The interpretation is thus the percentage change in Y if there is an increase of 1 𝑋.

(3) Log-log
ln 𝑌 𝛽0 𝛽1 ln 𝑋1 𝑢

𝑑ln 𝑦 𝛽1
𝑑𝑋1 𝑋1

𝑑𝑌
𝛽1 𝑌 𝑒𝑙𝑎𝑠𝑡𝑖𝑐𝑖𝑡𝑦
𝑑𝑋
𝑋

Reciprocal models are an alternative way of rescaling by including the reciprocal of your input as
the predictor:
1
𝑌 𝛽0 𝛽1 𝑢
𝑋1
𝑋1 0. This yields a non-linear relation between X on Y. If the coefficient is positive, as X
increases, Y decreases on average. If the coefficient is negative, as X increases, Y increases on
average.

Lecture 7 - Other Issues Related to OLS


Notes from Lecture
If we have the covariance of a variable with respect to a constant, it will equal 0, e.g. 𝐶𝑜𝑣 𝑋, 𝛽0
0.
The covariance of a variable with respect to itself, it is equal to the variance of that variable, e.g.
𝐶𝑜𝑣 𝑋, 𝑋 𝑉𝑎𝑟 𝑋

We can rewrite the equation for 𝛽1 :


𝐶𝑜𝑣 𝑋, 𝑌
𝛽1
𝑉𝑎𝑟 𝑋

∑ 𝑋 𝑋 𝑌 𝑌
𝛽1 2
∑ 𝑋 𝑋

∑ 𝑋 𝑋 𝑌
𝛽1
∑ 𝑋 𝑋 2

∑ 𝑋 𝑋 𝛽0 𝛽1 𝑋 𝑢
𝛽1
∑ 𝑋 𝑋 2

∑ 𝑋 𝑋 𝛽0 ∑ 𝑋 𝑋 𝛽1 𝑋1 ∑ 𝑋 𝑋 𝑢
𝛽1
∑ 𝑋 𝑋 2

𝛽1 ∑ 𝑋 𝑋 2 ∑ 𝑋 𝑋 𝑢
𝛽1
∑ 𝑋 𝑋 2

∑ 𝑋 𝑋 𝑢
𝛽1 𝛽1
∑ 𝑋 𝑋 2
OLS BLUE - Under OLS assumptions, 𝑢 is equal to 0, and therefore, we have that:

𝐸 𝛽1 𝛽1

The variance of 𝛽1 :
∑ 𝑋 𝑋 𝑢
𝑉𝑎𝑟 𝛽1 𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 2

∑ 𝑋 𝑋 2𝑢2
𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 22
Homoscedasticity and Heteroscedasticity
One of the assumptions of OLS states that 𝑉𝑎𝑟 𝑢 𝜎 2 , meaning that if the assumption holds, the
variance of 𝛽1 is:
∑ 𝑋 𝑋 2𝜎 2
𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 22

𝜎2 𝜎2
𝑉𝑎𝑟 𝛽1
∑ 𝑋 𝑋 2 ∑ 𝑠2

When this assumption holds, we thus have homoskedasticity, meaning that the variance of the error
is identical for all the observations. Homoskedasticity can be written as:
𝑉𝑎𝑟 𝑢 |𝑋 𝑋 𝑉𝑎𝑟 𝑢 𝜎2

If the assumption does not hold, we have heteroscedasticity and have to use 𝑉𝑎𝑟 𝛽1 ∑
.
This is a problem because having that 𝑉𝑎𝑟 𝑢 |𝑋 𝑋 𝑉𝑎𝑟 𝑢 𝜎 2 means that the variance of
the error is related to the observation (as the error is not the same for all 𝑖 , which makes the OLS
biased.

From simply looking at the data points, one can sometimes tell if there is homo- or
heteroscedasticity.

Homoscedastic:
Heteroscedastic:

Breusch-Pagan test for heteroscedasticity:


1. Run OLS regression per usual, 𝑌 𝛽0 𝛽1 𝑋1 ⋯ 𝛽𝑋 𝑢

2. Regress squared residuals on original right hand-side variables, 𝑢 2 𝛾0 𝛾1 𝑋1 ⋯


𝛾 𝑋 𝑟 . If the error is independent of the 𝑋, if the regression is homoscedastic, then the
coefficients, 𝛾, should have no impact in explaining the error.

3. Test for no heteroscedasticity, 𝐻0 : 𝛾0 𝛾1 ⋯ 𝛾 0

White test for heteroscedasticity:


1. Run OLS regression per usual, 𝑌 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2 𝑢

2. Regress squared residuals on original right hand-side variables, 𝑢 2 𝛾0 𝛾1 𝑋1 𝛾2 𝑋2


𝛾3 𝑋12 𝛾4 𝑋22 𝛾5 𝑋1 𝑋2 𝑟 . If the error is independent of the 𝑋, if the regression is
homoscedastic, then the coefficients, 𝛾, should have no impact in explaining the error.

3. Test for no heteroscedasticity, 𝐻0 : 𝛾0 𝛾1 ⋯ 𝛾5 0

If the data is heteroscedastic, we can use the White Robust SE, which corrects for degrees of
freedom, i.e.:
𝑛 ∑ 𝑋 𝑋 2𝑢2
𝑉𝑎𝑟 𝛽1
𝑛 𝑘 1 ∑ 𝑋 𝑋 22
The degrees of freedom become less important the larger the 𝑛.

Usually, no one tests for heteroscedasticity, since people assume that data has heteroscedasticity
and they correct the SE.
Matrix Notation
Using matrix notation:
𝑌 𝑋𝛽 𝑢
1
𝛽 𝑋𝑋 𝑋′𝑌
1
𝑉𝑎𝑟 𝛽 |𝑋 𝑋𝑋 𝑋 𝜎 2𝑋 𝑋 𝑋 1

Thus, the variance of 𝛽 under heteroscedasticity is expected to depend on X.


Generalized Least Squares (GLS)
GLS is an alternative way to correct for heteroscedasticity or autocorrelation. 2 step technique:
(1) estimate the normal regression and get an estimate for the residuals
1
(2) re-estimate the original regression, using the weights √ where 𝜎 2 is the estimator for the
sample variance. Both independent and dependent variables are multiplied by this weight before
rerunning the regression.

1
For OLS, we minimize 𝑢 2, for GLS we minimize 𝑢12 √ .

In large samples, the efficiency gain from GLS is likely to be minimal. Accurately estimating the
variance-covariance matrix in the first stage is empirically challenging and often yields incorrect
estimates. We can get correct estimates of standard errors using White Standard errors, making the
use of GLS for the correction of heteroscedasticity largely unnecessary in practice.
Autocorrelation
One of the OLS assumptions states that the covariance and correlations between different
disturbances are all 0:
𝐶𝑜𝑣 𝑢 , 𝑢 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑡 𝑠
This assumption states that the disturbances 𝑢 and 𝑢 are independently distributed, which is called
serial dependence.
If the assumption is no longer valid, then the disturbances are not pairwise independent but pairwise
autocorrelated. This means that an error occurring at period 𝑡 may be carried over to the next period
𝑡 1. Autocorrelation is most likely to occur in time series data. In cross-sectional data we can
change the arrangement of the data without altering the results.
Lecture 8 - Binary Dependent Variable
Chapter 11 - Regression with a Binary Dependent Variable
When running a regression with a binary dependent variable, we interpret the regression as
modeling the probability that the dependent variable equals 1. Thus, for a binary variable,
𝐸 𝑌|𝑋1 , … , 𝑋 Pr 𝑌 1|𝑋1 , … 𝑋 .
The linear multiple regression model applied to a binary dependent variable is a a
bab , because it corresponds to the probability that the dependent variable equals 1
given 𝑋.
The coefficient 𝛽1 on a regressor 𝑋 is the change in the probability that 𝑌 1 associated with a unit
change in 𝑋.
Hypotheses concerning several coefficients can be tested using 𝐹-statistic, and confidence intervals
can be formed as 1.96 SEs.
The errors of the linear probability model are always heteroskedastic, so we must ALWAYS use
heteroskedasticity-robust SEs for inference.
𝑅2 cannot be estimated for a model where the dependent variable is binary unless the regressors are
also binary.
________________________________________________________________________________
Example - Denial of Bank Loans and Race
The following model is estimated:

The coefficient of 0.177 ba a a ba a a a a 17.7% bab


having their loan application denied than a white applicant, holding constant their payment-to-
income ratio.
________________________________________________________________________________

Shortcoming of the LPM: Because probabilities cannot exceed 1, the effect on the probability that
𝑌 1 of a given change in 𝑋 must be nonlinear. However, in LPM, the effect of a change is
constant, which leads to predicted probabilities that can drop below 0 or exceed 1.

11.2 Probit and Logit Regression


Probit and Logit regressions are nonlinear models specifically designed for binary dependent
variables, correcting the shortcoming of LPM by forcing the predicted values to be between 0 and 1.
1) Probit
With a single regressor:
Pr 𝑌 1|𝑋 Φ 𝛽0 𝛽1 𝑋
Where Φ is the cumulative standard normal distribution function (Appendix Table 1). In the model,
the term 𝛽0 𝛽1 𝑋 plays the role of 𝓏 in the cumulative standard normal distribution, the value of
which we can look up the probability of being to the left of in the tail of the normal distribution.
𝛽1 is the chance in the 𝓏-value associated with a unit change in 𝑋 - if 𝛽1 is positive, an increase in 𝑋
increases the 𝓏-value and thus increases the probability that 𝑌 1 and vice versa.

With multiple regressors:


Pr 𝑌 1|𝑋1 , 𝑋2 Φ 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2
Here, the model is extended by adding regressors to compute the 𝓏-value.
The coefficients are often estimated using the method of maximum likelihood, which produces
efficient (minimum variance) estimators that are also consistent and normally distributed in large
samples, meaning that the 𝑡-statistics and confidence intervals for the coefficients can be computed
in the normal way.

2) Logit
Pr 𝑌 1|𝑋1 , 𝑋2 , … , 𝑋 F 𝛽0 𝛽1 𝑋1 𝛽2 𝑋2 ⋯𝛽 𝑋
Where 𝐹 is the cumulative standard logistic distribution function.
As with Probit, the logit coefficients are best interpreted by computing predicted probabilities and
differences in predicted probabilities. The coefficients of the logit model can be estimated by
maximum likelihood, so again, the 𝑡-statistic and confidence interval is applied as usually.

Sum-up: The LPM is easiest to use and to interpret, but it cannot capture the nonlinear nature of the
true population regression function. Probit and Logit regressions model this nonlinearity in the
probabilities, but their regression coefficients are more difficult to interpret.

11.3 Estimation and Inference in the Logit and Probit Models


** Skip this part but read if need information on maximum likelihood estimations (MLE) and
nonlinear least squares estimation.

Notes from Lecture


Binary Data
Binary data can take on 2 values, 0 and 1 (other values can be considered, as variables can be
rescaled, so it does not make a difference).
Dummy Variables
We denote dummy variables with 𝑑.
OLS model with dummy variable:
𝑌 𝛽0 𝛿0 𝑑 𝛽1 𝑋1 𝑢
The sample can be split into 2 groups, 𝑑 0 and 𝑑 1, which will give us 2 different models:

𝑖𝑓 𝑑 0: 𝑌 𝛽0 𝛽1 𝑋1 𝑢
𝑖𝑓 𝑑 1: 𝑌 𝛽0 𝛿0 𝛽1 𝑋1 𝑢

Thus, if 𝑑 1, the slope will remain the same, but the starting point increases by the coefficient 𝛿0 .
Thus, 𝛿0 tells us the effect of a certain characteristic on the independent variable. The group that has
𝑑 0 is called the base/control group, and the difference between the two groups is equal to 𝛿0 .
We cannot add a variable that displays perfect multicollinearity with a dummy variable in the same
model, e.g. having 𝛿0 𝑓𝑒𝑚𝑎𝑙𝑒 and 𝑋1 𝑚𝑎𝑙𝑒, because then the male variable would have
perfect multicollinearity with the intercept.

The interaction of dummies is also of interest. The joint effect of dummy variables in a model can
be considered using an extended version of the model:
𝑌 𝛽0 𝛿1 𝑑1 𝛽1 𝑋1 𝛿2 𝑑2 𝑢
𝐸𝑥𝑡𝑒𝑛𝑑𝑒𝑑: 𝑌 𝛽0 𝛿1 𝑑1 𝛽1 𝑋1 𝛿2 𝑑2 𝛿3 𝑑1 𝑑2 𝜀
Where the interaction is measured by 𝑑1 𝑑2 . The coefficient 𝛿3 then shows the joint effect of the two
dummies on Y.
If we want to just study the effect of one of the dummies, we can take the first derivative:
𝑑𝑌
𝑑2 𝑑3 𝑑1
𝑑𝑑2

This shows that when studying the effect of 1 dummy variable, we have to take into consideration
the joint effect it has with other dummy variables.
If our goal is to study the relevance of the group, we need to know whether making this distinction
is important. In order to test the relevance, we can utilize the Chow-test:
Consider the regression and with two groups defined by the variable 𝑑:
𝑌 𝛽0 𝛽1 𝑋 ,1 𝛽2 𝑋 ,2 𝑢
We can split the regression into two subgroups:
1 𝑖𝑓 𝑑 0: 𝑌 𝛼0 𝛼1 𝑋 ,1 𝛼2 𝑋 ,2 𝑒
2 𝑖𝑓 𝑑 1: 𝑌 𝛾0 𝛾1 𝑋 ,1 𝛾2 𝑋 ,2 𝑒

We can then run a test, using the residual sum of squares from the 3 different regressions.
Steps:
1. Run the 3 regressions and compute RSS, RSS1, and RSS2

2 1
2. Compute 𝐹 1

3. Test for irrelevance of grouping: 𝐻0 : 𝛼0 𝛾0 , 𝛼1 𝛾1 , 𝛼2 𝛾2. Reject if F


computed is higher than the threshold derived from the table.

Dummy variables can be used to study economics of discrimination, e.g.


ln 𝑤𝑎𝑔𝑒 𝛽0 𝛿0 𝐹𝑒𝑚𝑎𝑙𝑒 𝛽1 𝐸𝑑𝑢 𝑢
𝛿0 𝐸 ln wage |female 1, Edu E ln wage |female 0, Edu

The dummy variable trap is an example of perfect multicollinearity.


Dummy Variable as Dependent Variable (Y)
(1) Linear Probability Model
When we have a dummy as the dependent variable, we are considering an OLS estimate called the
linear probability model (LMP).
Positive:
- Interpretation of results:

𝐸 𝑌|𝑋 1 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 0 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 0|𝑋



𝐸 𝑌|𝑋 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 𝛽1

W estimating something with the LPM, the 𝛽1 derived is going to be the effect
the variable we want to study, X, is having on the probability that the dependent
variable is equal to 1.

Negative:
- The structure of the dependent variable is not normal, meaning that the structure of
the error term 𝑢 is not normally distributed. This leads to the OLS being biased.
- Errors are heteroscedastic, as we expect Y to change a lot for given values of X. OLS
is then biased, unless we use White SE.

- If we take the prediction of the dependent variable, 𝑌, it can be outside [0;1].


Despite its negative sides, LMP is very popular in econometric papers. Battista recommends
combining it with Probit or Logit, show both of them, check for consistency between them.
When running the LPM, we need to use robust standard errors, because models with a binary
variable by default with be heteroscedastic.

(2) Probit
When we assume that OLS is not a good strategy, we are thinking that maybe the probability that
our dependent variable is equal to 1 given the X cannot be run using the usual OLS regression, and
we must assume a particular type of function:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 𝜙 𝛽0 𝛽1 𝑋
Meaning that we have to consider a function which is not linear, and therefore, we need to
implement a different type of technique to derive the betas. In order to compute it, we use the
maximum likelihood (ML) technique, which where you take the first derivate of the problem, then
you try to fit that with values until you reach the maximum of the likelihood that the data can
explain the model (that the right-hand side can explain the left-hand side of the regression).
Probit is thus:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 𝜙 𝛽0 𝛽1 𝑋
Where 𝜙 is the cumulative standard normal distribution. The Probit model has the following
desirable properties:
- 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 is increasing in 𝑋 for all 𝛽 0 (same as LPM)

- 0 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 1 for all 𝑋 (different from LPM)


In terms of interpretation, we can consider the model 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 𝜙 𝛽0 𝛽1 𝑋 and
let 𝛽0 2 and 𝛽1 3. At 𝑋 0.4 we then get:
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑌 1|𝑋 0.4 𝜙 2 3 0.4 𝜙 0.8
This is then the area under the standard normal density to the left of 𝑧 0.8, which is 0.2119 or
21.19%. This is then the value of 𝛽1 - if positive, an increase in 𝑋 increases the 𝑧 value and thus
increases the probability that 𝑌 1. Although the effect of 𝑋 on 𝑧 is linear, the effect of 𝑋 on the
probability is non-linear. Thus, in practice, the easiest way to interpret the regression coefficients is
to compute the change in the predicted probabilities
(3) Logit
Similar to Probit models but uses the logit cumulative distribution function:
1
𝐹 𝛽0 𝛽1 𝑋 ⋯ 𝛽 𝑋
1 𝑒
Often yields a similar result to Probit regression models. Which one to use depends on what we are
studying (check which technique is the most used within the field of what you are studying).

Lecture 9 - Panel Data


Chapter 10 - Regression with Panel Data
Panel data/longitudinal data: data for 𝑛 different entities observed at 𝑇 different time periods. 𝑌
denotes the variable 𝑌 observed for the 𝑖th of 𝑛 entities in the 𝑡th of 𝑇 periods.
A balanced panel has all its observations, i.e. the variables are observed for each entity for each
time period. An unbalanced panel has some missing data for at least one time period for at least 1
entity.
If unobserved factors remain constant over time for a given entity, we can hold these factors
constant even though we cannot measure them by using OLS regression with fixed effects.

10.2 Pa Da a T T P : B a A C a
If data for each entity are obtained for 𝑇 2 time periods, we can compare values of the dependent
variable in the 2nd period to values in the 1st period - by focusing on these changes in the dependent
a ab , b a a a a b d
factors that differ from 1 entity to another but do not change over time within the entity. This entity
fixed variable is denoted 𝑍 :
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑍 𝑢
Because 𝑍 does not change over time, it will not produce any change in 𝑌 between the time
periods. Thus, the influence of 𝑍 can be eliminated by analyzing the change in 𝑌 between the 2
periods:
𝑌1 𝛽0 𝛽1 𝑋 1 𝛽2 𝑍 𝑢1
𝑌2 𝛽0 𝛽1 𝑋 2 𝛽2 𝑍 𝑢2
Subtracting these regressions from each other:
𝑌2 𝑌1 𝛽1 𝑋 2 𝑋1 𝑢2 𝑢1
Analyzing changes in 𝑌 and 𝑋 has the effect of controlling for variables that are constant over time,
thereby eliminating this source of OVB.
T b a a a a a 𝑇 2. If 𝑇 2, we need to use the
method of fixed effects regression.

10.3 Fixed Effects Regression


Fixed effects regression is a method for controlling for OVB in panel data when the omitted
variables vary across entities but do not change over time. The model will have 𝑛 different
intercepts, which can be represented by a set of binary (or indicator) variables, as these absorb the
influences of all omitted variables that differ from 1 entity to the next but are constant over time.
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑍 𝑢
Where 𝑍 is an unobserved variable that varies from 1 entity to the next but does not change over
time. We want to estimate 𝛽1 , the effect on 𝑌 of 𝑋 holding constant the unobserved entity
characteristics 𝑍. Because 𝑍 is constant over time, it is a constant that is added to the intercept and
thus changes it. We can thus rewrite the model as:
𝑌 𝛼 𝛽1 𝑋 𝑢
Where 𝛼 𝛽0 𝛽2 𝑍 . The slope for all entities is thus the same, only the intercept changes - the
intercept can be called the effect of being in entity 𝑖. 𝛼 is thus the entity fixed effects.
To develop the fixed effects regression model using binary variables, let 𝐷1 be a binary variable
that equals 1 when 𝑖 1 and 0 otherwise, let 𝐷2 be 1 when 𝑖 2 and 0 otherwise and so on. To
avoid the dummy variable trap - perfect multicollinearity occurring when including all 𝑛 binary
variables plus a common intercept - we arbitrarily omit the binary variable 𝐷1 :
𝑌 𝛽0 𝛽1 𝑋 𝛾2 𝐷2 𝛾3 𝐷3 ⋯ 𝛾 𝐷𝑛 𝑢
Here, the slope is again the same for all entities, while the intercept is different - however, in this
model, the intercept is 𝛽0 𝛾 for 𝑖 2.
This regression has 𝑘 𝑛 regressors, the 𝑘 𝑋 a 𝑛 1 binary variables and the intercept.

When 𝑇 2, there are 3 ways to estimate 𝛽1 b OLS: (1) b a a a , (2)


b a a ab a , a (3) -de a a . The OLS estimators
produced are identical.

10.4 Regression with Time Fixed Effects


Time fixed effect control for variables that are constant across entities but evolve over time:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑆 𝑢
Again, the model will have 𝑛 different intercepts, and the time fixed effects regression model is:
𝑌 𝜆 𝛽1 𝑋 𝑢
The intercept 𝜆 can be thought of as the effect on 𝑌 of time period 𝑡.
The time fixed effects regression model can also be represented using 𝑇 1 binary indicators:
𝑌 𝛽0 𝛽1 𝑋 𝛿2 𝐵2 ⋯ 𝛿 𝐵𝑇 𝑢
The intercept is included and the first binary variable 𝐵1 is omitted to prevent perfect
multicollinearity (dummy variable trap).
We can also create a model that combines the effects. The entity and time fixed effects regression
model is:
𝑌 𝛼 𝜆 𝛽1 𝑋 𝑢
Where 𝛼 is the entity fixed effect and 𝜆 is the time fixed effect. We can also represent this model
using 𝑛 1 binary indicators and 𝑇 1 time binary indicators, along with an intercept:
𝑌 𝛽0 𝛽1 𝑋 𝛾2 𝐷2 ⋯ 𝛾 𝐷𝑛 𝛿2 𝐵2 ⋯ 𝛿 𝐵𝑇 𝑢
This model eliminates OVB arising both from unobserved variables that are constant over time and
from unobserved variables that are constant across states.

10.5 The Fixed Effects Regression Assumptions and Standard Errors for Fixed Effects Regression
In panel data, the regression error can be correlated over time within an entity - this does not
introduce bias into the fixed effects estimator, but it affects the variance of the fixed effects
estimator and therefore, it affects how one computes standard errors.
Thus, we use clustered standard errors, which are robust to both heteroskedasticity and to
correlation over time within an entity.
When there are many entities (when 𝑛 is large), hypothesis tests and confidence intervals can be
computed using the usual large-sample normal and 𝐹 critical values.

Fixed effects regression assumptions:

The cross-sectional counterpart of Assumption 2 holds that each observation is independent, which
arises under simple random sampling. In contrast, Assumption 2 for panel data holds that the
variables are independent across entities but makes no such restriction within an entity.
If 𝑋 is correlated over time for a given entity, then 𝑋 is said to be autocorrelated - what happens
in one time period tens to be correlated with what happens in the next time period. Such omitted
factors, which persist over multiple years, produce autocorrelated regression errors. Not all omitted
factors will produce autocorrelation in 𝑢 ; if the factor for a given entity is independently
distributed from 1 year to another, then this component of the error term would be serially
uncorrelated.
If the regression errors are autocorrelated, then the heteroskedasticity-robust standard error formula
for cross-section is not valid. Instead, we have to use heteroskedasticity- and autocorrelation-robust
(HAC) SEs, e.g. clustered SEs. Equation given in appendix 10.2.
Sum-up: In panel data, variables are typically autocorrelated that is, correlated over time within an
entity. Standard errors need to allow both for this autocorrelation and for potential
heteroskedasticity, and one way to do so is to use clustered standard errors.

Notes from Lecture


Panel data: multidimensional data that contains observations about different cross sections
𝑖 𝑒𝑛𝑡𝑖𝑡𝑦 𝑖 1, … , 𝑁 over time 𝑡 𝑡𝑖𝑚𝑒 𝑡 1, … , 𝑇 (i.e. cross-sectional time series data).
Balanced panel is when we do not have a missing observation.
Unbalanced panel is when there is a missing observation.
The problem of omitted variables occurs when a regression model leaves out 1 or more relevant
variables, which can lead to the model attributing the effect of the omitted variables to that/those
included.
We can add an extra variable:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑍 𝑢
If 𝑍 is not observable for 𝑖, then we can consider the same model for entity 𝑗:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑍 𝑢

𝑍 is also not observable for entity 𝑗, but we can make a very strong assumption that 𝑖 and 𝑗 are
identical in 𝑍. The difference between 𝑖 and 𝑗, i.e. 𝑌 𝑌 𝛽1 𝑋 𝑋 𝑢 𝑢 . Then we
can simply run an OLS regression and get a correct estimate for 𝛽1 .
The problem with this approach is finding two entities that are identical. (1) it is very likely that 𝑖
𝑗 in terms of 𝑍, and (2) self-selection (e.g. choosing twins in medical studies) can lead to the model
only being able to explain that particular case (e.g. family structure), lack of generalizability.

Panel Data with 2 Periods


The best entity who is identical to 𝑖 is entity 𝑖 itself, and therefore, we need to focus on adding the
time component instead. With 2 periods, we have:
𝑌, 𝛽0 𝛽1 𝑋 , 𝛽2 𝑍 𝑢,

𝑌, 1 𝛽0 𝛽1 𝑋 , 1 𝛽2 𝑍 𝑢, 1
With a second time period added, the strong assumption lies in time not playing any role in
𝑍 𝑍, 𝑍 , 1 𝑍 . Again, we take the difference between the 2 models:

𝑌, 1 𝑌, 𝛽1 𝑋 , 1 𝑋, 𝑢, 1 𝑢,

The unobservable variable, 𝑍, again disappears, allowing us to estimate 𝛽1 , which is now BLUE.
There are 2 issues with this approach:
1. We have to convince people that 𝑍 is constant over time, i.e. 𝑍 , 𝑍, 1 𝑍 . This is done
by collecting information collected by other scientists.

2. We cannot say anything about 𝑍.

Fixed Effects Regressions


When there are more than 2 time periods, 𝑇 2, the regression model will be:
𝑌, 𝛽0 𝛽1 𝑋 , 𝛽2 𝑍 𝑢,

Since 𝛽2 𝑍 is constant over time, it is considered a constant, and the model can be rewritten:
𝑌, 𝛽0 𝛽2 𝑍 𝛽1 𝑋 , 𝑢 ,


𝑌, 𝛼 𝛽1 𝑋 , 𝑢,

Where 𝛼 𝛽0 𝛽2 𝑍 is the fixed effect. Thus, when we have more than 2 time periods, the
dependent variable for the entity and time components can be written as a function of 𝑋 , plus a set
of dummies which are equal to 1 if the entity is 𝑖 and 0 otherwise.
The graphical presentation of fixed effects regression is parallel shifts of the function depending on
the 𝛼:

Again, we have a BLUE estimate for 𝛽1 , as including the dummy is able to explain completely all
of the equation.
Be careful when considering this regression, as it is prone to the dummy trap (perfect
multicollinearity) - we have to have 𝑛 1 dummies. The dummy that is excluded is called the
ba , a , i.e. the coefficient on the dummy would be the value compared to the
value of the base, i.e. the value of being 𝑖 instead of the base.

A a a - a , which takes the average


for all the different types of entities as an additional regression, and then we take the difference
between that and the panel regression:

𝑌, 𝑌 𝛽1 𝑋 , 𝑋 𝑢, 𝑢

Both approaches give the same 𝛽 and 𝑡-statistic. However, when 𝑛 is very large, the fixed effects
regression is not very convenient for software, as it will have to estimate a very high number of
dummies.

Regression with Time Fixed Effects


If there is a variable which has the same effect on all entities but differs over time, we can add it to
the model:
𝑌, 𝛽0 𝛽1 𝑋 , 𝛽2 𝑍 𝛽3 𝑆 𝑢,

A regression as the one above with both entity fixed and time fixed effects can be written as:
𝑌, 𝛽1 𝑋 , 𝛼 𝜆 𝑢,

Note that 𝛽0 is excluded as it is a part of 𝛼 . Because of the dummy trap, the number of dummies
estimated for 𝛼 is 𝑛 1, and for 𝜆 it is 𝑇 1.

Clustered SE: The OLS fixed effect estimator 𝛽 is unbiased, consistent, and asymptomatically
normally distributed. However, the usual OLS standard errors (both homoscedasticity-only and
heteroscedasticity-robust) will in general be wrong because they assume that 𝑢 , are serially
uncorrelated. In practice, the OLS SEs often understate the true sampling uncertainty - if 𝑢 , is
correlated over time, you do not have as much information (as much random variation) as you
would if 𝑢 , were uncorrelated. This problem is solved by using clustered standard errors:

𝑠2
𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑒𝑑 𝑆𝐸 𝑜𝑓 𝑌
𝑛

Where

1
𝑠2 𝑌 𝑌
2
Lecture 10 - Instrumental Variable Regression
Chapter 12 - Instrumental Variables Regression
If 𝑋 and 𝑢 are correlated, the OLS estimator is inconsistent. This correlation can be due to various
explanations, e.g. OVB, errors in variables (measurement errors in the regressors), and
simultaneous causality. Whatever the source, the effect on 𝑌 of a unit change in 𝑋 can be estimated
using the instrumental variables estimator if there is a valid instrumental variable.
The population regression model relating the dependent variable 𝑌 and regressor 𝑋 is:
𝑌 𝛽0 𝛽1 𝑋 𝑢
If 𝑋 and 𝑢 are correlated, the OLS estimator is biased, and therefore, IV estimation uses an
a a a a ab 𝑍 to isolate that part of 𝑋 that is uncorrelated with 𝑢 .
Variables correlated with the error term are called endogenous variables, while variables that are
uncorrelated with 𝑢 are exogenous variables.
The 2 conditions for a valid instrument:
1. Instrument relevance: 𝑐𝑜𝑟𝑟 𝑍 , 𝑋 0

2. Instrument exogeneity: 𝑐𝑜𝑟𝑟 𝑍 , 𝑢 0

For an instrument to be relevant, its variation is related to the variation in 𝑋 . If in addition the
instrument is exogenous, then that part of the variation of 𝑋 captured by the instrumental variable is
exogenous. Thus, an instrument that is relevant and exogenous can capture movements in 𝑋 that
are exogenous, which means that we can estimate the population coefficient 𝛽1 .

If 𝑍 satisfies the conditions of instrument relevance and exogeneity, 𝛽1 can be estimated using an
IV a a a a a (2SLS), which is calculated in 2 stages. 1st stage -
decompose 𝑋 into 2 components, a problematic component that may be correlated with the 𝑢 and
another problem-free component that is uncorrelated. 2nd stage - use problem-free component to
estimate 𝛽1 .
To decompose 𝑋 , we use a population regression linking 𝑋 and 𝑍:
𝑋 𝜋0 𝜋1 𝑍 𝑣
𝜋0 𝜋1 𝑍 is the part of 𝑋 that can be predicted by 𝑍 , and since 𝑍 is exogenous, this component is
uncorrelated with 𝑢 . 𝑣 is the problematic part that is correlated with 𝑢 . Thus, we disregard 𝑣 and
use OLS to estimate the regression 𝑋 𝜋0 𝜋1 𝑍 . We then regress 𝑌 on 𝑋 using OLS, and the
resulting estimators from this 2 -stage regression are the 2SLS estimators 𝛽02 and 𝛽12 .
nd

If the sample is large, the 2SLS estimator is consistent and normally distributed. When there is a
single 𝑋 and a single instrument 𝑍, the 2SLS estimator is:
𝑐𝑜𝑣 𝑍 , 𝑌
𝛽12
𝑐𝑜𝑣 𝑍 , 𝑋
The 2SLS estimator is consistent because sample covariance is a constant estimator of the
population covariance. The sampling distribution for this estimator is approximately 𝑁 𝛽1 , 𝜎 2
where:
1 𝑣𝑎𝑟 𝑍 𝜇 𝑢
𝜎2
𝑛 𝑐𝑜𝑣 𝑍 , 𝑋 2
Because 𝛽12 is normally distributed in large samples, hypothesis tests about 𝛽1 can be performed
by computing the 𝑡-statistic, and a 95% confidence interval is given by the usual equation.

12.2 The General IV Regression Model


The general IV regression has 4 types of variables: (1) the dependent variable, 𝑌, (2) problematic
endogenous regressors which are correlated with the error term, 𝑋, (3) additional regressors called
included exogeneous variables denoted 𝑊, and (4) instrumental variables denoted 𝑍.
For IV regression to be possible, the model must contain at least as many instrumental variables 𝑍s
as endogenous regressors 𝑋𝑠.
- If the number of instruments 𝑚 equals the number of endogenous regressors 𝑘 ,
i.e. 𝑚 𝑘, the regression coefficients are said to be exactly identified

- If 𝑚 𝑘 the regression coefficients are overidentified

- If 𝑚 𝑘 the regression coefficients are under-identified


Under-identification means that we cannot use the IV regression model.

If 𝑊 is an effective control variable, then including it makes the instrument uncorrelated with 𝑢,
making the 2SLS estimator of the coefficient on 𝑋 consistent. If 𝑊 is correlated with 𝑢 then the
2SLS coefficient on 𝑊 is subject to OVB and does not have a causal interpretation.
When there is a single endogenous regressor, 𝑋, and some additional included exogenous variables,
the model of interest is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑊1 ⋯ 𝛽1 𝑊 𝑢
Where 𝑋 might be correlated with 𝑢 but 𝑊1 , … , 𝑊 are not. Then the population 1st stage 2SLS
relates 𝑋 to the exogenous variables 𝑊 and 𝑍, also called the reduced form equation for 𝑋, is:
𝑋 𝜋0 𝜋1 𝑍1 ⋯ 𝜋 𝑍 𝜋 1 𝑊1 ⋯ 𝜋 𝑊 𝑣
The unknown coefficients are estimated by OLS. In the 2 nd stage, we regress 𝑌 on 𝑋 , 𝑊1 , … , 𝑤
using OLS after replacing 𝑋 by its predicted values from the regression in the 1 st stage.
If we are dealing with a multiple regression model instead, each of the endogenous regressors
requires its own 1st-stage regression.

When there is one included endogenous variable but multiple instruments, the condition for
instrument relevance is that at least one 𝑍 is useful for predicting 𝑋 , given 𝑊. When we have
multiple endogenous variables, we must rule out perfect multicollinearity in the 2 nd -stage
population regression.

The IV regression assumptions:


- 𝐸 𝑢 |𝑊1 , … , 𝑊 0

- 𝑋1 , … , 𝑋 , 𝑊1 , … , 𝑊 , 𝑍1 , … , 𝑍 , 𝑌 are IID draws from their joint distribution

- Large outliers are unlikely: The 𝑋 , 𝑊 , 𝑍 , a 𝑌 have nonzero finite fourth


moments

- The 2 conditions for a valid instrument in the picture above must hold
Under the IV regression assumptions, the TSLS estimator is consistent and normally distributed in
large samples. Thus, the general procedures for statistical inference in regression models extend to
2SLS regression. However, we have to realize that the 2 nd -stage OLS SEs are wrong ad they do not
adjust for the use of the predicted values of the included endogenous variables (R does this).
12.3 Checking Instrument Validity
(1) Instrument Relevance
The more relevant the instruments, the more variation in 𝑋 is explained by the instruments, and the
more information is available for use in IV regression. This leads to a more accurate estimator. In
addition, the more relevant the instruments, the better is the normal approximation to the sampling
distribution of the 2SLS estimator and its 𝑡-statistic.
Weak instruments mean that the normal distribution provides a poor approximation to the sampling
distribution even if the sampling size is large, meaning that the 2SLS is no longer reliable.

If we identify weak instruments, there are a few options for how to handle it:
- If we have a small number of strong instruments and many weak ones, we can
discard the weakest instruments and use the most relevant subset for 2SLS analysis.
This can make SEs increase, but do not worry, the initial SEs were not meaningful
anyway

- If coefficients are exactly identified, we cannot discard the weak instruments. The
solution is to either (1) identify additional, stronger instruments or (2) proceed with
the weak instruments but employ a different method than 2SLS

(2) Instrument Exogeneity


If the instruments are not exogenous, the 2SLS estimator converges in probability to something
other than the population coefficient.
Assessing whether the instruments are exogenous necessarily requires making an expert judgment
based on personal knowledge of the application. If, however, there are more instruments than
endogenous regressors, then there is a statistical tool that can be helpful in this process: the so-
called test of overidentifying restrictions.
12.5 Where Do Valid Instruments Come From?
There are 2 approaches to finding instruments that are both relevant and exogenous:
1. Econometric - use economic theory to suggest instruments. The drawback is that economic
theories are abstractions that often do not take into account the nuances and details
necessary for analyzing a data set.

2. Statistical - look for exogenous sources of variation in 𝑋 arising from a random


phenomenon that induces shifts in the endogenous regressor. The drawback is that the
approach requires a lot of knowledge of the problem being studied.

Notes from Lecture


Problems with OLS Estimates and Instrumental Variable Approach
The main assumption of OLS regressions is that the expected value of the error given the X should
be equal to 0 𝐸 𝑢|𝑋 0 . We can face 3 different kinds of problems that lead to this assumption
not being satisfied:
1. Omitted variable bias - excluding a variable because it cannot be observed or we do not have
the data will lead to its effect being ascribed to the variables that are included in the model,
making them biased. Can be fixed using fixed effects regressions.

2. Measurement error - when the variable is not correctly measured, creating noise in the data.
Error is often random. Suppose we include the variable 𝑋 in our model, but the true variable
is actually 𝑋 ∗ , then 𝑋 𝑋 ∗ 𝑣 , where 𝑣 is the error. This leads to 𝑋 and 𝑣 being
correlated, causing our results to be biased. Can be fixed using IV.

3. Simultaneity/reverse causality - Simultaneity is where the explanatory variable is jointly


determined with the dependent variable, i.e. X causes Y, but Y also causes X. It is not
possible to say which way the causality goes. E.g. prices and quantities of goods demanded.
To fix this, we need to find an instrument which is related to price 𝑐𝑜𝑟𝑟 𝑋 , 𝑍 0 but
does not have an impact on demand 𝑐𝑜𝑟𝑟 𝑍 , 𝑢 0 , as this allows us to keep the
demand curve constant and only shift the supply curve - allows us to fully derive demand
curve. The instrument must be relevant (correlation not equal to 0) and exogenous (should
not be a part of other variables in the model).

If the correlation between the instrumental variable and the error term is 0, we also have that
the covariance is 0:

𝑐𝑜𝑟𝑟 𝑍 , 𝑢 0

𝑐𝑜𝑣 𝑍 , 𝑢 0

𝑐𝑜𝑣 𝑍 , 𝑌 𝛽0 𝛽1 𝑋 0

𝑐𝑜𝑣 𝑍 , 𝑌 𝑐𝑜𝑣 𝑍 , 𝛽0 𝑐𝑜𝑣 𝑍 , 𝛽1 𝑋 0

𝑐𝑜𝑣 𝑍 , 𝑌 𝛽1 𝑐𝑜𝑣 𝑍 , 𝑋 0

Only one particular 𝛽1 will be able to satisfy this. Therefore, we have that:

𝑐𝑜𝑣 𝑍 , 𝑌
𝛽1
𝑐𝑜𝑣 𝑍 , 𝑋
In matrix notation:
1
𝛽1 𝑍𝑋 𝑍′𝑌

2-Stage Least Squares


Our regression is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑤 𝑢1
Where 𝑤 is other controls. If we know another variable, 𝑍 , which is related to 𝑋 , is still missing,
then we can use 2SLS:
2 steps:
1. Run OLS regression based on the model above but with 𝑋 as the dependent variable to
instrument it:

𝑋 𝛿0 𝛿1 𝑍 𝛿2 𝑤 𝑣

If 𝑍 is relevant, we should expect that 𝛿1 is statistically different from 0, which means that
this instrument is important in explaining 𝑋 . From this OLS regression, we then compute
the prediction of 𝑋 , i.e. 𝑋 .

2. Estimate the following regression:

𝑌 𝛼 𝛽1 𝑋 𝛽2 𝑤 𝜀

This is the model we started out with except we are including 𝑋 instead of 𝑋 . The
estimation of 𝛽1 is then the estimation of 𝛽1 .

Warning: If we run this regression using 2SLS by hand or using software, we are going to
find that 𝛽1 is going to be identical but 𝑆𝐸 𝛽1 is going to differ. Software provides the
correct SE as it adjusts it while hand-calculations do not.

Important things to check:


- What is 𝑐𝑜𝑣 𝑋, 𝑍 ? If it is approximately 0 and the variables are not so related
(making Z not so relevant), then we get very biased and bad estimates. The way to
check this is to take the regression from the 1 st step of 2SLS and compute the F-test.
The rule of thumb is that if 𝐹 10, then the instrument is relevant, as it tells us that
𝛿1 is different from 0. If not, the instrument is weak.

- If we have more instruments, 𝑚, we can theoretically end up in 3 different situations


(𝑘 number of regressors we want to instrument).

(1) identification: 𝑚 𝑘 - this is what we have already done, we do not have to


check for anything else

(2) over-identification: 𝑚 𝑘 - test for over-identification

(3) under-identification: 𝑚 𝑘 - we cannot have this case, otherwise IV does not


exist.

- Do not forget the economic theory, this also needs to support our argument and
model.

Test for over-identification:

1. Software will generate 𝑢 2 𝑌 𝑌2

2. Run regression of 𝑢 2 over 𝑍 and 𝑤

3. Compute 𝐹-test and check the index J 𝐽 𝑚𝐹 ~ χ2 , the larger the F, the lower the
probability that we are rejecting the hypothesis of over-identification.
The number of instruments needs to be the same as the number of potentially endogenous variables.

Lecture 11 - Experiments and Quasi-Experiment


Chapter 13 - Experiments and Quasi-Experiments
Potential outcome: the outcome for an individual under a potential treatment.
Causal effect: difference in the potential outcome if the treatment is received and the potential
outcome if it is not. This effect can differ from 1 individual to the next, and we therefore focus on
the average causal effect.
The causal effect is estimated by using an ideal randomized controlled experiment. The average
causal effect on 𝑌 of treatment is the difference between the conditional expectations
𝐸 𝑌 |𝑋 1 𝐸 𝑌 |𝑋 0 .

The difference estimator is the difference in the sample averages for the treatment and control
groups, which is computed by regressing the outcome variable 𝑌 on a binary treatment indicator 𝑋:
𝑌 𝛽0 𝛽1 𝑋 𝑢
The efficiency of the difference estimator can often be improved by including some control
variables 𝑊 in the regression. The differences estimator with additional regressors is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑊1 ⋯ 𝛽1 𝑊 𝑢
If 𝑊 helps explain the variation in 𝑌, then including 𝑊 reduces the SE of the regression and often
also of the coefficient 𝛽1 . For 𝛽1 to be unbiased, the control variables 𝑊 must be such that 𝑢
satisfies conditional mean independence, i.e. 𝐸 𝑢 |𝑋 , 𝑊 𝐸 𝑢 |𝑊 - this holds when 𝑊 are
pretreatment individual characteristics (such as gender) and 𝑋 is randomly assigned.
The coefficient on 𝑊 does not have a causal interpretation.
Randomization based on covariates: Randomization in which the probability of assignment to the
treatment group depends on one or more observable variables 𝑊. Ensures that OLS estimator is
unbiased - as 𝑋 is assigned randomly based on 𝑊 .

13.2 Threats to Validity of Experiments


A statistical study is internally valid if the statistical inferences about causal effects are valid for the
population being studied; it is externally valid if its inferences and conclusions can be generalized
from the population and setting studied to other populations and settings.
Threats to the internal validity of randomized controlled experiments include failure to randomize,
failure to follow the treatment protocol, attrition, experimental effects, and small sample sizes.
Threats to external validity compromise the ability to generalize the results of the study to other
populations and settings. Two such threats are when the experimental sample is not representative
of the population of interest and when the treatment being studied is not representative of the
treatment that would be implemented more broadly.
When determining whether the estimated effects of 𝑋 is large or small in a practical sense, there are
2 ways to do so:
1. Translating the estimated changes in 𝑌 into units of SDs of 𝑌 so that the estimates are
comparable across 𝑋 .

2. By comparing the estimated effect of 1 coefficient to other coefficients.

13.4 Quasi Experiments


Quasi-experiment (natural experiment): experiment where randomness is introduced by variations
in individual circumstances that make it appear as if the treatment is randomly assigned. There are 2
types of quasi-experiments, (1) whether an individual receives treatment is viewed as if it is
randomly determined, in which case the causal effect can be estimated by OLS using the treatment
𝑋 a a , a (2) a a a a a a a ,
a a a a b IV a a
variation being the IV.
If the treatment in a quasi-experimen a a a , a b
variables W , then the treatment effect can be estimated using the differences regression. Some
differences might remain though, and therefore, we use the differences-in-differences estimator,
which is the average change in 𝑌 for those in the treatment group minus the average change in Y for
those in the control group:

𝛽1 ∆𝑌 ∆𝑌

If the treatment is randomly assigned, then 𝛽1 is an unbiased and consistent estimator


of the causal effect. It can also be estimated as 𝛽1 in the following regression:
∆𝑌 𝛽0 𝛽1 𝑋 𝑢
By focusing on the change in 𝑌 over the course of the experiment, the DiD estimator removes the
influence of initial values of 𝑌 that vary between the treatment and control groups.

If the quasi-experiment yields a variable 𝑍 that influences receipt of treatment, if data are available
both on 𝑍 and on the treatment actually received (𝑋 ), and if 𝑍 a a a , then
𝑍 is a valid instrument for 𝑋 and the coefficients can be estimated using 2SLS.

Sharp regression discontinuity designs: receipt of treatment is entirely determined by whether 𝑊


exceeds a given threshold. There will be a jump in 𝑌 at the threshold, and the jump equals the
average treatment effect for the subpopulation with 𝑊 𝑤0 (which might be a useful
approximation to the average treatment effect in the population).
Fuzzy regression discontinuity designs: here, crossing the threshold influences receipt of the
treatment but it is not the sole determinant. In a design like this, 𝑋 will be correlated with 𝑢 . If,
however, any special effect of crossing the threshold operates solely by increasing the probability of
treatment that is, the direct effect of crossing the threshold is captured by the linear term in 𝑊,
then an instrumental variables approach is available. Specifically, the binary variable 𝑍 which
indicates crossing the threshold influences receipt of treatment but is uncorrelated with 𝑢 , so it is a
valid instrument for 𝑋 .

13.5 Potential Problems with Quasi-Experiments


Generally, the issues of quasi-experiments resemble those of experiments, there are only minor
modifications.

13.6 Experimental and Quasi-Experimental Estimates in Heterogeneous Populations


If we have a heterogeneous population, there is unobserved variation in the causal effect, meaning
that the 𝑖th individual now has its own causal effect, 𝛽1 , a
potential outcomes if the treatment is or is not received. This leads to a model that looks like:
𝑌 𝛽0 𝛽1 𝑋 𝑢
If 𝑋 a a , OLS a a a a a .
𝑐𝑜𝑣 𝛽0 𝛽1 𝑋 , 𝑋
𝛽1 2 𝐸 𝛽1
𝜎
However, this is generally not true for IV estimator. If 𝑋 is partially influenced by 𝑍 , then the IV
estimator using the instrument 𝑍 estimates a weighted average of the causal effects, where those for
whom the instrument is most influential receive the most weight. If there is heterogeneity in the
effect on 𝑋 of 𝑍 , the linear model will be a 1st-stage 2SLS equation modified to allow for the effect
on 𝑋 of a change in 𝑍 to vary from entity to entity:
𝑋 𝜋0 𝜋1 𝑍 𝑣
Where the coefficients vary from individual to individual. The coefficient in the 2nd stage then
becomes:
𝐸 𝛽1 𝜋1
𝛽12
𝐸 𝜋1
Thus, the 2SLS estimator is a consistent estimator of a weighted average of the individual causal
effects, where the individuals who receive the most weight are those for whom the instrument is
most influential. This weighted average causal effect estimated by 2SLS is called the local average
treatment effect (LATE).
There are 3 cases where LATE equals the average treatment effect:
1. The treatment effect is the same for all individuals, i.e. 𝛽1 𝛽1

2. The instrument affects each individual equally, i.e. 𝜋1 𝜋1

3. The heterogeneity in the treatment effect and heterogeneity in the effect of the instrument is
uncorrelated, i.e. 𝛽1 and 𝜋1 are random but 𝑐𝑜𝑣 𝛽1 , 𝜋1 0
Notes from Lecture
Experiments: consciously made or designed in order to understand some events in nature (test a
hypothesis)
Quasi-Experiments: a - resembles experiments in the sense that it is the same type of
hypothesis as in experiments, but quasi-experiments make use of historical data instead to estimate
the importance of certain variables
Program Evaluation: experiment on a smaller scale where we want to study a very small aspect of a
particular event that happened in society, most often policy evaluation

Randomization is very important in experiments. A certain portion of the population should receive
a , X. As a dummy variable, the treatment grou 1 . Due to this
randomization, we do not have to control for any other characteristics than the treatment X, and
therefore, our regression model will simply be:
𝑌 𝛽0 𝛽1 𝑋 𝑢
At the end of the experiment, we are going to get that if the individual is treated, the outcome will
be:

𝑌 𝛽0 𝛽1
For the control group:

𝑌 𝛽0
We can then find the effectiveness of the treatment by taking the difference between the two:

𝑌 𝑌 𝛽1

If 𝛽1 is significant, it means that the treatment works.

Problems with experiments:


- Internal threats

(1) Failure to randomize: When selecting individuals, we do not get a random


distribution, which means that the treatment is not the only variable we have to
control for. We can test this by running a regression of the treatment over a different
set of characteristics of the individuals.

𝑋 𝛾0 𝛾1 𝑤1 ⋯ 𝛾 𝑤 𝑢

and then run a F-test to see if all the coefficients are 0 (randomization), because if
they are, no characteristic can explain that you are selected for treatment or not.
Partial solution if randomization is not perfect (reject F-test) is to run the regression
including the characteristic we also have to control for, e.g.:

𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑤1 𝑢

This is not preferable, as we do not get as solid a conclusion.

(2) Failure to follow treatment protocol: Individuals do not follow the instructions in
the protocol, which leads to a problem of selection, causing results to be biased.

(3) Attrition: Individuals drop out of sample.

(4) Change of behavior of participants: Telling individuals that they are in the
treatment group might lead to them changing their behavior. The solution is to not
tell individuals which group they are in.

(5) Small sample: Not sure results can be extended

- External threats

(1) Non-representative people

(2) Non-representative treatment

(3) General equilibrium problem

Differences-in-Differences (DID)
DID is an approach often taken to quasi-experiments. In contrast to experiments, were we are only
concerned with X and Y, quasi-experiments include an extra variable, Z, which is an event. We
need to distinguish between 2 dimensions: (1) intertemporal - before and after the event, and (2)
whether an individual is in the treatment or the control group. We thus need to consider 4
regressions:
(1) Before event, treatment group:
, , ,
𝑌 𝛽0 𝛽2 𝑤 𝛽3 𝛾 𝑢
Where 𝑤 is an observable characteristic, and 𝛾 is unobservable.
(2) After event, treatment group:
, , ,
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑤 𝛽3 𝛾 𝑢
Here we have added the treatment coefficient.
(3) Before event, control group:
, , ,
𝑌 𝛽0 𝛽2 𝑤 𝛽3 𝛾 𝑢

(4) After event, control group:


, , ,
𝑌 𝛽0 𝛽2 𝑤 𝛽3 𝛾 𝑢
Since the control group is not treated, the regression does not change.

To assess the impact of the treatment, we determine:


, , , , , ,
𝑌 𝑌 𝛽1 𝑋 𝛽2 𝑤 𝑤 𝛽3 𝛾 𝛾 𝑢

, , , , , ,
𝑌 𝑌 𝛽2 𝑤 𝑤 𝛽3 𝛾 𝛾 𝑢

Everything but the gammas can be observed, so we assume that the change of the unobservable are
identical over time, i.e. treated and control are identical in 𝛾 , we can write that:
, , , ,
∆𝑌 𝑌 𝑌 𝑌 𝑌 𝛽0 𝛽1 𝑋 𝑢

Lecture 12 - Big Data


Notes from Lecture
At least 1 of these 3 characteristics must be satisfied for a dataset to be qualified as big data:
1. millions of observations
2. thousands of variables
3. Non-standard data
To obtain millions of observations, we often have to make use of registered data, e.g. Statistik
Danmark, scrapping data from websites, or data from firms.
Thousands of variables can be created in datasets by including interactions and polynomials, or by
including categorial variables. Make sure they are relevant.
Non-standard data can be text or pictures, which is then translated into numbers, e.g. facial analysis.
Many Predictor Problem
With big data, there can a computational problem, as the software is computing 𝛽 by taking the
matrix of the variable we want to study, X, transposing it with the inverse, transposing it again and
multiplying it by the vector Y, or taking the covariance of X and Y and dividing it by the variance
of X. This gets very demanding when have a lot of data.
There is also a problem of interpretation, which is that with big data it is not possible to make a
selection of data, meaning that we do not know which variables to check whether matters via a
hypothesis test. Instead, we must ask software to tell us whether a variable is important or not.
Thus, with big data, we cannot claim anything about causality.

Estimation sample: subset of observations, 𝑛


Out of sample (OOS) observations: The number of observations minus 𝑛
The regression is then:
𝑌 𝛽1 𝑋1 𝛽2 𝑋2 ⋯ 𝛽 𝑋 𝑢
When performing a regression using big data, we standardize the variables to ensure a clear
interpretation of the variables.
We want to minimize the error term. The mean squared prediction error (MSPE) is equal to:
2
𝑀𝑆𝑃𝐸 𝐸 𝑌 𝑌 𝑋

We want to minimize MSPE, a b a a . We do this by


taking the following steps:
1. 𝑌 𝛽1 𝑋1 𝛽2 𝑋2 ⋯ 𝛽 𝑋 𝑢

2. Prediction: 𝑌 𝛽1 𝑋1 𝛽2 𝑋2 ⋯ 𝛽 𝑋

3. Take the difference between the 2 to find the prediction error: 𝑌 𝑌


𝛽1 𝛽1 𝑋1 𝛽2 𝛽2 𝑋2 ⋯ 𝛽 𝛽 𝑋 𝑢1

The Principle of Shrinkage


2
The 𝑢 cannot be changed in an oracle forecast, and it is equal to 𝐸 𝑢 𝜎 2 . This leads to
MSPE of:

𝑀𝑆𝑃𝐸 𝜎2 𝐸 𝛽1 𝛽1 𝑋1 ⋯ 𝛽 𝛽 𝑋

This cannot be solved by running an OLS regression, as 𝑀𝑆𝑃𝐸 1 𝜎 2 meaning that if


↑, then 𝑀𝑆𝑃𝐸 ↑.

The principle of shrinkage: If we run an OLS regression and instead of taking 𝛽, we use the James-
Stein beta (𝛽 ), which is 𝛽 multiplied by a constant c 𝑐𝛽 , 0 𝑐 1 . This makes 𝛽 smaller, and
this deflation is needed for big datasets to correct for MSPE increasing in .

As c gets smaller, the squared bias of the estimator increases but the variance decreases. This
produces a bias-variance tradeoff: if 𝑘 is large, the benefit of smaller variance can bear out the cost
of larger bias, for the right choice of c 𝑎, thus reducing the MSPE.
Split-sample estimation of MSPE:
1. Estimate the model using half the estimation sample
2. Use the estimated model to predict Y for the other half of the data, called the reserve
sample 𝑎 and calculate the prediction error
3. Estimate the MSPE using the prediction errors for the test sample
1 2
𝑀𝑆𝑃𝐸 𝑌 𝑌
𝑛
The Ridge Regression
The ridge regression estimator shrinks the estimate towards 0 by penalizing large squared values of
the coefficients. The ridge regression estimator minimizes the penalized sum of squared residuals:

𝑆 𝑏; 𝜆 𝑌1 𝑏1 𝑋1 ⋯ 𝑏 𝑋 2
𝜆 𝑏2
1 1

Where 𝜆 is chosen from a set of 𝜆.

Thus, the penalized sum of squared residuals is minimized at a smaller value of b than is the
unpenalized SSR.
The Lasso
The Lasso estimator shrinks the estimate towards 0 by penalizing large absolute values of the
coefficients. The Lasso regression estimator minimizes the penalized sum of squared residuals:

2
𝑆 𝑏; 𝜆 𝑌1 𝑏1 𝑋1 ⋯ 𝑏 𝑋 𝜆 𝑏
1 1

This looks like the Ridge estimation 𝑎 but it turns out to have very different properties: it is not
invariant to linear transformations. The Lasso estimator works especially well when in reality many
of the predictors are irrelevant, as it sets many of the 𝛽′𝑠 exactly equal to 0.
Principle Components
Ridge and Lasso reduce the MSPE by shrinking (biasing) the estimated coefficients to 0 and, in the
case of Lasso, by eliminating many of the regressors entirely.
The Principle Components regression instead collapses the vert many predictors into a much
smaller number 𝑝 ≪ 𝑘 of linear combinations of the predictors. These linear combinations 𝑎
called the principal components of X 𝑎 are computed so that they capture as much of the variation
in the original 𝑋′𝑠 as possible. Because the number 𝑝 of principal components is small, OLS can be
used, with the principal components as new regressors.
Principal components can be thought of as data compression, so that the compressed data have
fewer regressors with as little information loss as possible.

Lecture 13 - Time Series Introduction


Chapter 14 - Introduction to Time Series Regression and Forecasting
The value of 𝑌 in the previous period is called its first lagged value/its first lag, and it is denoted
𝑌 1 . The change in the value of 𝑌 between period 𝑡 1 and period 𝑡 is 𝑌 𝑌 1 and it is called the
first difference in the variable 𝑌 .
Time series data is often analyzed after computing their logarithms or the changes in their
logarithms, particularly because many economic series exhibit growth that is potentially
exponential, implying that the logarithm of the time series grows linearly.
Autocorrelation: also called serial correlation, it is the correlation of a series with its own lagged
values. The 1st autocorrelation coefficient is the correlation between 𝑌 and 𝑌 1 , the 2nd
autocorrelation is the correlation between 𝑌 and 𝑌 2 and so on.
The 𝑗th population autocovariances and autocorrelations can be estimated by the sample:

1
𝑐𝑜𝑣 𝑌 , 𝑌 𝑌 𝑌 1: 𝑌 𝑌1:
𝑇
1

𝑐𝑜𝑣 𝑌 , 𝑌
𝜌
𝑣𝑎𝑟 𝑌
Where 𝑌 1: the sample average of 𝑌 computed using the observations 𝑡 𝑗 1, … 𝑇.

14.3 Autoregressions
AR relates a time series variable to its past values. The first order autoregression (AR(1)) is
computed by regressing the next periods value on the current value using data from multiple years.
It is called a first-order autoregression because it is a regression of the series onto its own lag and
only 1 lag is used:
𝐴𝑅 1 : 𝑌 𝛽0 𝛽1 𝑌 1 𝑢
OLS is used to estimate the coefficients.
Forecast error: mistake made by the forecast, i.e. the difference between the value of 𝑌 1 that
actually occurred, and its forecasted value based on 𝑌 :

𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 𝑒𝑟𝑟𝑜𝑟 𝑌 1 𝑌 1|

** the forecast is not an OLS predicted value, and the forecast error is not an OLS residual.
F a a a a -of- a b a , a OLS a a
a a -a b a .

Root mean squared forecast error (RMSFE): measure of the size of the forecast error, the magnitude
of a typical mistake made using a forecasting model. It contains 2 sources of error: (1) the error
arising because future values of 𝑢 are unknown and (2) the error in estimating the coefficients 𝛽0
and 𝛽1 .

2
𝑅𝑀𝑆𝐹𝐸 𝐸 𝑌 1 𝑌 1|
If the 1st source of error is much larger than the second, which can happen if the sample size is very
large, then 𝑅𝑀𝑆𝐹𝐸 𝑣𝑎𝑟 𝑢 , i.e. the SD of the error term.

The assumption that the conditional expectation of 𝑢 is 0 given past values of 𝑌 has 2 important
implications:
1. The best forecast of 𝑌 1 based on its entire history depends on only the most recent 𝑝
values.

2. The errors 𝑢 are serially uncorrelated.

14.4 Time Series Regression with Additional Predictors and the Autoregressive Distributed Lag
Model
When other variables and their lags are added to an AR, the result is autoregressive distributed lag
(ADL) model.
ADL: lagged values of the dependent variable are included as regressors, but the regression also
includes multiple lags of an additional predictor.
The ADL with 𝑝 lags of 𝑌 and 𝑞 lags of 𝑋 is denoted 𝐴𝐷𝐿 𝑝, 𝑞 :
𝑌 𝛽0 𝛽1 𝑌 1 𝛽2 𝑌 2 ⋯ 𝛽 𝑌 𝛿1 𝑋 1 𝛿2 𝑋 2 ⋯ 𝛿 𝑋 𝑢

In general, forecasts can be improved by using multiple predictors.

The idea that historical relationships can be generalized to the future, the principle which
forecasting relies on, is formalized by the concept of stationarity.
Stationarity: the probability distribution of the time series variable does not change over time, i.e.
the joint distribution of 𝑌 1 , 𝑌 2 , … , 𝑌 does not depend on 𝑠 regardless of the value of 𝑇.
Otherwise, 𝑌 is said to be non-stationary.
The bottom part of the image states the assumptions of the model.

Granger Causality tests: The Granger causality statistic is the 𝐹-statistic that tests the hypothesis
that the coefficients on all the values of one of the variables in the time series regression are 0. This
implies that these regressors have no predictive content for 𝑌 beyond that contained in the other
regressors. Rejecting 𝐻0 in a Granger test means that the past values of the variables appear to
contain information that is useful for forecasting 𝑌 , beyond that contain in past values of 𝑌 .

RMSFE for a time series regression with multiple predictors can also be written as:

𝑅𝑀𝑆𝐹𝐸 𝜎2 𝑣𝑎𝑟 𝛽0 𝛽0 𝛽1 𝛽1 𝑌 𝛿1 𝛿1 𝑋

RMSFE can be used to construct a forecast interval. The interval is given by 𝑌 1|


1.96𝑆𝐸 𝑌 1 𝑌 1| , where the SE is an estimator of RMSFE. However, because of the
uncertainty about future events, the forecast intervals are sometimes so wide that they have limited
use in decision making - thus, some make the intervals narrower by decreasing the confidence level.

14.5 Lag Length Selection Using Information Criteria


Choosing the order 𝑝 of an AR requires balancing the marginal benefit of including more lags
against the marginal cost of additional estimation uncertainty. There are multiple approaches on
how to balance these 2 things:
- The 𝐹-statistic approach: start with a model with too many lags and perform
hypothesis tests on the final lag, then drop that lag if it is not significant at the 5%
level. Drawback: Ends up producing a model that is too large sometimes due to the
5% probability of rejecting incorrectly.

- Bayes Information Criterion (BIC): Also called Schwarz information criterion (SIC)

𝑆𝑆𝑅 𝑝 ln 𝑇
𝐵𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇

I a 𝐾 coefficients (regression with multiple predictors) then


BIC will be computed in the following way:

𝑆𝑆𝑅 𝐾 ln 𝑇
𝐵𝐼𝐶 𝐾 ln 𝐾
𝑇 𝑇

T BIC a 𝑝 is the value that minimizes 𝐵𝐼𝐶 𝑝 . Given the equation, we


can see that the BIC helps decide precisely how large the increase in 𝑅2 must be to
justify including the additional lag - due to the link between SSR and 𝑅2 .

- Akaike Information Criterion (AIC): Similar to BIC but smaller

𝑆𝑆𝑅 𝑝 2
𝐴𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇

A a a SSR a a a . The AIC estimator


of 𝑝 is inconsistent due to the small size of the 2 nd term - in large samples, the AIC
will overestimate always 𝑝. Still used often, as some worry that BIC gives a model
with too few lags.

14.6 Non-Stationarity I: Trends


The problem caused by non-stationarity and its solution depends on the nature of that non-
stationarity. The 2 most important types of non-stationarity are trends and breaks.

Trend: A persistent long-term movement of a variable over time. A time series variable fluctuates
around its trend. There are 2 types of trends in time series data:
1. Deterministic - nonrandom function of time, e.g. a trend that is linear in time

2. Stochastic - random and varies over time


Often, we model using stochastic trends. The simplest model of a stochastic trend is the random
walk model. 𝑌 is said to follow a random walk if the change in 𝑌 is IID:
𝑌 𝑌 1 𝑢
Where 𝑢 is IID. The basic idea is that the value of the series tomorrow is its value today + an
additional unpredictable change. I a , b a a
value today. The random walk can have a tendency to move in a certain direction, which we then
refer to as a random walk with drift.
The variance of a random walk increases over time, so the distribution of 𝑌 must also change over
time, which is what makes it non-stationary - because the variance of 𝑌 depends on 𝑡, its
distribution depends on 𝑡, i.e. it is non-stationary.
The variance of a random walk increases without bounds; thus, the population autocorrelations are
not defined. However, its sample autocorrelations tend to be very close to 1 - the 𝑗th autocorrelation
of a random walk converges to 1 in probability.
The random walk model is thus a special case of the AR(1) model in which 𝛽1 1. If |𝛽1 | 1 and
𝑢 is stationary, then the joint distribution of 𝑌 and its lags does not depend on 𝑡, so 𝑌 is stationary.
If an AR(𝑝 has a root that equals 1, the series is said to have a unit autoregressive root/a unit root.
If 𝑌 has a unit root, then it contains a stochastic trend.

Problems caused by stochastic trends leading to OLS estimators and 𝑡-statistics perhaps having
non-normal distributions even in large samples:
- The estimator of the autoregressive coefficient in AR(1) is biased toward 0 if its true value is
5.3
1, as the asymptotic distribution of 𝛽1 is shifted toward 0. 𝐸 𝛽1 1 - thus, the model
performs worse than a random walk model which imposes 𝛽1 1.

- The 𝑡-statstic on a regressor with a stochastic trend can have a nonnormal distribution even
in large samples. The distribution of the statistic is not readily tabulated - it is possible to do
so in the case of an AR with a unit root.

- Spurious regression: 2 series that are independent will with high probability misleadingly
appear to be related if they both have stochastic trends. A special case where certain
regression-based methods are still reliable is when the series are cointegrated, i.e. they
contain a common stochastic trend.

An informal way to detect stochastic trends is to look at the first autocorrelation coefficient. In large
samples, a small first autocorrelation coefficient combined with a time series plot that has no
apparent trend suggests that the series does not have a trend.
A formal test for stochastic trends is the Dickey-Fuller test. The starting point of this test is the AR
model. In an AR(1) model, we know that if 𝛽1 1, then 𝑌 is nonstationary and contains a
stochastic trend. Thus, for AR(1) the hypothesis for the DF in the model 𝑌 𝛽0 𝛽1 𝑌 1 𝑢 is:
𝐻0 : 𝛽1 1
𝐻1 : 𝛽1 1
The alternative hypothesis is that the series is stationary. The test is most easily implemented by
estimating a modified version of the AR model constructed by subtracting 𝑌 1 from both sides:
𝐻0 : 𝛿 0
𝐻1 : 𝛿 0
In the model ∆𝑌 𝛽0 𝛿𝑌 1 𝑢 , where 𝛿 𝛽1 1.
The 𝑡-statistic used is called the Dickey-Fuller statistic, which is computed using non-robust SEs.
For the Dickey-Fuller test in the AR 𝑝 model is presented in Key Concept 14.8 on page 605.

A commonly used alternative hypothesis is 𝐻1 : the series is stationary around a deterministic trend.
Using a hypothesis like this should be motivated by economic theory.

Under the 𝐻0 of a unit root, the augmented Dickey-Fuller (ADF) statistic does not have a normal
distribution. The critical values for the ADF test are given in table 14.4. The ADF test is one-sided.

The best way to handle a trend in a series is to transform it so it no longer has a trend. If the series
has a unit root, then the first difference of the series does not have a trend.
14.7 Non-Stationarity II: Breaks
Break: When the population regression function changes over the course of the sample. Can arise
either from a discrete change in the population regression coefficients at a distinct date or from a
gradual evolution of the coefficients over a longer period of time.
The issue with breaks is that OLS regressions estimate relationships that hold on average - if there
is a break, then 2 periods are combined, meaning that the average is not really true for either period,
leading to poor forecasts.

Chow test: If you suspect that there is a break in the series, you can test for a break at a known date.
The 𝐻0 of no break can be tested using a binary variable interaction regression - consider an
ADL(1,1) model where we let 𝜏 denote the hypothesized break date and 𝐷 𝜏 a binary variable that
equals 0 before the break date and 1 after:
𝑌 𝛽0 𝛽1 𝑌 1 𝛿1 𝑋 1 𝛾0 𝐷 𝜏 𝛾1 𝐷 𝜏 𝑌 1 𝛾2 𝐷 𝜏 𝑋 1 𝑢
If there is no break, then the regression is the same over both parts of the sample, so the binary
variable does not enter the equation. Thus:
𝐻0 : 𝛾0 𝛾1 𝛾2 0
𝐻1 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 1 𝛾 0
This is tested using the 𝐹-statistic, as it is a joint hypothesis.

Quandt likelihood ratio (QLR): The Chow test can be modified to handle testing for unknown break
dates by testing for breaks at all possible dates 𝜏 in between 𝜏0 and 𝜏1 and then using the largest
resulting 𝐹-statistic to test for a break at an unknown date.
The QLR statistic is the largest of many 𝐹-statistics, and therefore, its distribution is not the same as
an individual 𝐹-statistic, meaning that the critical values must be obtained from another distribution
than the 𝐹 𝑄𝐿𝑅 𝐹 - the distribution of the WLR depends on and - they cannot be too close
to the end or beginning of the sample, so we use trimming to only compute the 𝐹-statistic for break
dates in the central part of the sample - with 15% trimming we only look at the middle 70%.

Chapter 15 - Estimation of Dynamic Causal Effects


Dynamic causal effect: the time path of the effect on the outcome of interest of the treatment.
The model used to estimate dynamic causal effects needs to incorporate lags, which is done by
expressing 𝑌 as a distributed lag of current and 𝑟 past values of 𝑋 . We call this model the
distributed lag model relating 𝑋 and 𝑟 of its lags to 𝑌 :
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 𝛽3 𝑋 2 ⋯ 𝛽 1𝑋 𝑢
The coefficient on 𝑋 is the effect of a unit change in 𝑋 on 𝑌 after ℎ periods. The dynamic causal
effect is the effect of a change in 𝑋 on 𝑌 , 𝑌 1 , 𝑌 1 , and so forth; that is, it is the sequence of
causal effects on current and future values of 𝑌.
This formulation of dynamic causal effects in time series data as the expected outcome of an
experiment in which different treatment levels are repeatedly applied to the same subject has 2
implications for empirical attempts to measure the dynamic causal effect with observational time
series data. The 1st implication is that the dynamic causal effect should not change over the sample
on which we have data. The 2nd implication is that 𝑋 must be uncorrelated with the error term, i.e. 𝑋
must be exogenous.
We use 2 concepts of exogeneity:
1. Past and present exogeneity - 𝑢 has a conditional mean of 0 given current and all past
values of 𝑋 . This implies that the more distant causal effects - all the causal effects beyond
lag 𝑟 - are 0, meaning that the 𝑟 distributed lag coefficients constitute all non-0 dynamic
causal effects.

2. Strict exogeneity - 𝑢 has mean 0 given all past, present, and future values of 𝑋 . When we
have strict exogeneity, the OLS estimators are no longer the most efficient.

15.3 Estimation of Dynamic Causal Effects with Exogeneous Regressors


If 𝑋 is exogenous, then its dynamic causal effect on 𝑌 can be estimated by OLS estimation of the
distributed lag regression:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽𝑋 𝑢
The distributed lag model assumptions:
- 𝑋 is exogenous, i.e. 𝐸 𝑢 |𝑋 , 𝑋 1 , 𝑋 2 , … 0. Implies that the 𝑟 distributed lag
coefficients constitute nonzero dynamic causal effects

- The random variables 𝑌 and 𝑋 have a stationary distribution and 𝑌 , 𝑋 and 𝑌 ,𝑋


become independent as 𝑗 gets larger

- Large outliers are unlikely; 𝑌 and 𝑋 have more than 8 nonzero, finite moments

- There is no perfect multicollinearity

In the distributed lag model, 𝑢 can be autocorrelated, i.e. it can be correlated with its lagged values,
because the omitted factors included in 𝑢 can be serially correlated. This does not affect the
consistency of OLS, nor does it introduce bias. However, we must use heteroskedasticity robust SEs
to avoid misleading statistical inferences. Use HAC.

Dynamic multipliers: The coefficients on 𝑋 and its lags are the dynamic multipliers which relate 𝑋
to 𝑌. The effect of a unit change in 𝑋 on 𝑌 after ℎ periods, which is 𝛽 1 in the distributed lag
model, is called the ℎ-period dynamic multiplier. 𝛽2 is the one-period dynamic multiplier. Thus, the
zero-period (or contemporaneous) dynamic multiplier is also called the impact effect, and it is 𝛽1 ,
the effect on 𝑌 of a change in 𝑋 in the same period. The SE of a dynamic multiplier is the HAC SEs
of the OLS coefficients.
Cumulative dynamic multipliers: The cumulative sum of the dynamic multipliers. The ℎ-period
cumulative dynamic multiplier is the cumulative effect of a unit change in 𝑋 on 𝑌 over the next ℎ
periods. The sum of all individual dynamic multipliers is the cumulative long-run effect on 𝑌 of a
change in 𝑋, and we call it the long-run cumulative dynamic multiplier. Can be estimated by
running:
𝑌 𝛿0 𝛿1 ∆𝑋 𝛿2 ∆𝑋 1 𝛿3 ∆𝑋 2 ⋯ 𝛿 ∆𝑋 1 𝛿 1 ∆𝑋 𝑢
Where the coefficients 𝛿 are in fact the cumulative dynamic multipliers. The last coefficient is the
long-run. Advantage: the HAC SEs of the coefficients are the HAC SEs of the cumulative dynamic
multipliers.

15.4 Heteroskedasticity- and Autocorrelation Consistent Standard Errors


When 𝑢 is autocorrelated, the variance of 𝛽1 can be written as the product of 2 terms:
𝑣̅ 𝑣𝑎𝑟 𝑣̅
𝑉𝑎𝑟 𝛽1 𝑣𝑎𝑟
𝜎2 𝜎2 2

The variance is thus the variance of the OLS estimator when errors are uncorrelated multiplied by a
correction factor that arises from the autocorrelation of errors.
If 𝑣 is IID - as assumed for cross-sectional data - then 𝑣𝑎𝑟 𝑣̅ . However, if 𝑢 and 𝑋 are
1
not independently distributed over time, 𝑣𝑎𝑟 𝑣̅ 𝑓 , where 𝑓 1 2∑ 1 𝜌 and 𝜌
𝑐𝑜𝑟𝑟 𝑣 , 𝑣 (in large samples, 𝑓 tends to limit the expression without the fraction.

Combining these equations, the variance of 𝛽1 when 𝑣 is autocorrelated:


1 𝜎2
𝑣𝑎𝑟 𝛽1 𝑓
𝑇 𝜎2 2

Where 𝑓 is the adjustment for the serial correlation.


The HAC estimator of the variance (Newey-West variance estimator) can also be written as:
𝜎2 𝜎2 𝑓

The estimator of 𝑓 has to balance between using too many sample autocorrelations and using too
few, because both cases would make the estimator inconsistent. The number of autocorrelations to
include depends on the sample size 𝑇. Truncation parameter 𝑚 0.75𝑇 - Exact equation for this
found on page 651.

15.5 Estimation of Dynamic Causal Effects with Strictly Exogenous Regressors


When 𝑋 is strictly exogenous, 2 alternative estimators of dynamic causal effects are available: (1)
estimating ADL and calculating the dynamic multipliers from the ADL coefficients - has fewer
estimated coefficients, reducing estimation error, and (2) generalized least squares (GLS) - GLS has
a smaller variance.
________________________________________________________________________________
Example - DL with AR(1) Errors
The causal effect on 𝑌 of a change in 𝑋 lasts for only 2 periods. This leads to the following model:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 𝑢
The error term is serially correlated, so we must use HAC SEs. If we do not want to do this and 𝑋
is strictly exogenous, then we can adapt an AR model for serial correlation in 𝑢 , from which we
can then derive some estimators that can be more efficient than the OLS estimators in the DL
model. To do so, we use the following AR(1) model for the error term:
𝑢 𝜙1 𝑢 1 𝑢
Where 𝜙1 is the autoregressive parameter, 𝑢 is serially uncorrelated, and no intercept is needed
because 𝐸 𝑢 0. This suggests that we can write the DL model with a serially correlated error as
an ADL model with serially uncorrelated error:
𝑌 𝛼0 𝜙1 𝑌 1 𝛿0 𝑋 𝛿1 𝑋 1 𝛿2 𝑋 2 𝑢
Where 𝛼0 𝛽0 1 𝜙1 , 𝛿0 𝛽1 , 𝛿1 𝛽2 𝜙1 𝛽1 , and 𝛿2 𝜙1 𝛽2 . The betas here are the ones
from the initial DL model, and 𝜙1 is the autocorrelation coefficient from the AR(1) model.

This ADL model can also be expressed as a quasi-difference model:


𝑌 𝛼0 𝛽1 𝑋 𝛽2 𝑋 1 𝑢

Where 𝑌 𝑌 𝜙1 𝑌 1 and 𝑋 𝑋 𝜙1 𝑋 1.

Since the models are the same, the conditions for their estimation are the same. One of the
conditions is that since the 0 conditional mean assumption holds for general values of 𝜙1 is
equivalent to 𝐸 𝑢 |𝑋 1 , 𝑋 , 𝑋 1 , … 0. This is implied by 𝑋 being strictly exogenous, but it is
not implied by 𝑋 being (past and present) exogenous.

(1) OLS estimation of the ADL model

HAC SEs are not needed when the ADL coefficients are estimated by OLS. The estimated
coefficients are not themselves estimates of the dynamic multipliers - a way to compute these is to
express the estimated regression function as a function of current and past values of 𝑋 , i.e. by
eliminating 𝑌 from the estimated regression function:

If 𝛼0 𝛽0 1 𝜙1 , 𝛿0 𝛽1 , 𝛿1 𝛽2 𝜙1 𝛽1 , and 𝛿2 𝜙1 𝛽2 holds exactly for the estimated


coefficients, then the dynamic multipliers beyond the 2 would all be 0. However, this is generally
nd

not the case.

(2) GLS estimation of the ADL model


𝑌 𝛼0 𝛽1 𝑋 𝛽2 𝑋 1 𝑢
We assume initially that the infeasible GLS estimator, 𝜙1 , is known. While this estimator is called
infeasible because it is unknown in practice, it can be modified using an estimator which yields a
feasible version of the GLS estimator. Specifically, the feasible GLS estimators of 𝛽1 and 𝛽2 are the
OLS estimators of 𝛽1 and 𝛽2 in the model above computed by regressing 𝑌 on 𝑋 .
Explanation of (iterated) Cochrane-Orcutt method on page 658.
Using nonlinear least squares (NLLS), which minimizes the sum of squared mistakes made by the
estimated regression function, recognizing that the regression function is a nonlinear function of the
parameters being estimated. However, it requires sophisticated algorithms for minimizing nonlinear
functions of unknown parameters (not for DL with AR(1), here the iterated Cochrane-Orcutt GLS
estimators are the NLLS estimator of the ADL coefficients).
GLS estimators are efficient when 𝑋 is strictly exogenous and the transformed errors 𝑢 are
homoskedastic. The loss of information from estimating 𝜙1 is negligible when 𝑇 is large, as the
feasible and infeasible GLS estimators have the same variance in large samples. Thus, if 𝑋 is
strictly exogenous, then GLS is more efficient than the OLS estimator of the DL coefficients.
________________________________________________________________________________

The example above carries over to the general distributed lag model with multiple lags and an
AR(𝑝 error term. The genera DL with 𝑟 lags and 𝐴𝑅 𝑝 error term is:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢
𝑢 𝜙1 𝑢 1 𝜙2 𝑢 2 ⋯ 𝜙 𝑢 𝑢

Where 𝛽 are the dynamic multipliers and 𝜙 are the autoregressive coefficients of the error term.
Under the 𝐴𝑅 𝑝 model for the errors, 𝑢 is serially uncorrelated.

𝑌 can be written in the ADL form as well:


𝑌 𝛼0 𝜙1 𝑌 1 ⋯ 𝜙 𝑌 𝛿0 𝑋 𝛿1 𝑋 1 ⋯ 𝛿 𝑋 𝑢

Where 𝑞 𝑟 𝑝.

In terms of quasi-differences, the model is:

𝑌 𝛼0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢
Where 𝑌 𝑌 𝜙1 𝑌 1 ⋯ 𝜙 𝑌 and 𝑋 𝑋 𝜙1 𝑋 1 ⋯ 𝜙 𝑋 .

Again, the dynamic multipliers can be estimated by (feasible) GLS. This entails OLS estimation of
the coefficients of the quasi-differenced specification. The GLS estimator is asymptotically BLUE.
Otherwise, it can be estimated using OLS on the ADL specification - can provide a compact or
parsimonious summary of a long and complex DL.
The advantage of the GLS estimator is that, for a given lag length r in the distributed lag model, the
GLS estimator of the distributed lag coefficients is more efficient than the ADL estimator, at least
in large samples. In practice, then, the advantage of using the ADL approach arises because the
ADL specification can permit estimating fewer parameters than are estimated by GLS.

15.6 Orange Juice Prices and Cold Weather


When conducting an empirical analysis, it is important to check whether these results are sensitive
to changes in the details of the analysis. This is done by checking 3 aspects: (1) sensitivity to the
computation of the HAC SEs, (2) an alternative specification that investigates potential OVB, and
(3) an analysis of the stability over time of the estimated multipliers.

15.7 Is Exogeneity Plausible?


The interpretation of the coefficients in a DL regression as causal dynamic effects hinges on the
assumption that 𝑋 is exogenous. With time series data, a particularly important concern is that there
could be simultaneous causality, which results in endogenous regressors. If this is the case, the ADL
or GLS methods are not appropriate.

Notes from Lecture


Time series data are collected on the same observational unit at multiple time periods.
The first thing to do with any time series is to plot it, as the graph gives a lot of information, e.g. the
dynamic development of the data.
𝑌 is the time series at time 𝑡, 𝑡 1; 𝑇 . We use it for forecasting, estimation of dynamic causal
effects, and modeling risk.
The first lag of 𝑌 is 𝑌 1.

The first difference of a series, ∆𝑌, is its change between periods 𝑡 1 and 𝑡, i.e. ∆𝑌 𝑌 𝑌 1.

The first difference of the logarithm of 𝑌 is ∆ ln 𝑌 ln 𝑌 ln 𝑌 1 .


The percentage change of a time series 𝑌 between periods 𝑡 1 and 𝑡 is approximately
100∆ln 𝑌 , where the approximation is the most accurate when the percentage change is small.

A series exhibiting autocorrelation is related to its own past values. For time series, we use the
autocovariance:
2
𝛾0 𝑐𝑜𝑣 𝑌 , 𝑌 𝑣𝑎𝑟 𝑌 𝐸 𝑌 𝐸𝑌
𝛾1 𝑐𝑜𝑣 𝑌 , 𝑌 1 𝐸 𝑌 𝐸𝑌 𝑌 1 𝐸𝑌 1

For time series data, the correlation coefficient is:


𝑐𝑜𝑣 𝑌 , 𝑌 1
𝜌
𝑣𝑎𝑟 𝑌 𝑣𝑎𝑟 𝑌

Assumption: 𝑣𝑎𝑟 𝑌 𝑣𝑎𝑟 𝑌 1 ⋯ 𝑣𝑎𝑟 𝑌

𝑐𝑜𝑣 𝑌 , 𝑌 1
𝜌
𝑣𝑎𝑟 𝑌
𝛾
𝜌
𝛾0
Thus, the properties of the autocorrelation function are:

- 𝜌0 1
- 𝜌 𝑗 𝜌
- 1 𝜌 1
- 𝜌 0 if 𝑌 is not serially correlated

The autocorrelation thus does not depend on time but on lags. The most recent lag is the most
relevant in predicting next year s value of 𝑌.

Autoregressive AR Processes
The AR(1) - autoregressive model of order 1 - is:
𝑌 𝛽0 𝛽1 𝑌 1 𝑢
In the AR(1), 𝑌 depends on 1st lag of its own past values. We estimate this autoregressive model
just as we would a normal regression, but we need an extra condition.

For forecasts, p a a -a ,a a a -of- a (


data).
𝑌 1| is the forecast of 𝑌 1 based on 𝑌 , 𝑌 1 etc. using the population (unknown) coefficients.
Add hat to denote estimation.
For AR(1), the forecast of AR(1), the one-step ahead point forecast into the future is:

𝑌 1| 𝛽0 𝛽1 𝑌
The forecast error:

𝑒 1 𝑌 1 𝑌 1|

The root mean squared forecast error (RMSFE):


2
𝑅𝑀𝑆𝐹𝐸 𝐸 𝑌 1 𝑌 1|
If the error in the estimation coefficients is small (it is in large samples), then RMSFE is estimated
by the standard error of the regression.
We can construct a 95% interval forecast:

𝑌 1| 1.96𝑆𝐸 𝑌 1 𝑌 1|

AR(p) is just like AR(1) but it has p lags instead:


𝑌 𝛽0 𝛽1 𝑌 1 𝛽2 𝑌 2 ⋯ 𝛽 𝑌 𝑢

The point forecast is constructed in a similar way:

𝑌 1| 𝛽0 𝛽1 𝑌 𝛽2 𝑌 1 ⋯ 𝛽 𝑌 1

The more parameters we estimate, the riskier the forecast - the more likely it is that we make an
estimation error. Parsimony principle - use just enough lags to construct the best forecast, as larger
models will have larger errors in estimation of the coefficients, strongly affecting RMSFE and
forecast uncertainty.
Autoregressive Distributed Lag Model
Sometimes, we may want to consider outside predictors. For example, the Philips curve states that
the unemployment is negatively related to the changes in inflation rate. So, we may want to use the
unemployment rate. This leads to the ADL model 𝐴𝐷𝐿 𝑝, 𝑞 :
𝑌 𝛽0 𝛽1 𝑌 1 𝛽2 𝑌 2 ⋯ 𝛽 𝑌 𝛿1 𝑋 1 𝛿2 𝑋 2 ⋯ 𝛿 𝑋 𝑢

Where we assume that 𝐸 𝑢 |𝑌 1, 𝑌 2, … , 𝑋 1, 𝑋 2, … 0, meaning that the residuals must have


no autocorrelation.
Granger Causality Test
The test of the joint hypothesis that none of the 𝑋 is a useful predictor above and beyond lagged
values of 𝑌 is called a Granger Causality Test:
𝑌 𝛽0 𝛽1 𝑌 1 ⋯ 𝛽 𝑌 𝛾1 𝑋 1 𝛾 𝑋 𝑢

The null hypothesis is:


𝐻0 : 𝛾1 ⋯ 𝛾 0
We determine the order of an autoregression by comparing:
- The F/t statistic: Choose the order of an AR(p) based on whether or not the
coefficients are statistically significant at, say, the 5% level. Often leads to choosing
a model that is too large (5% of the time).

- 𝑅2 : Always increasing in the number of lags

- Adjusted 𝑅2 : If T is large and p is small (often for financial data), the adjusted 𝑅2
will not penalize the addition of extra lag terms enough, can lead to choosing a
model that is too large

- The Akaike Information Criterion (AIC):

𝑆𝑆𝑅 𝑝 2
𝐴𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇

The first term measures how close the model is to the actual data and does not
increase as you add more lags.

- The Bayes Information Criterion (BIC):

𝑆𝑆𝑅 𝑝 ln 𝑇
𝐵𝐼𝐶 𝑝 ln 𝑝 1
𝑇 𝑇

The BIC is similar to the AIC, but it adds a higher penalty as you add more lags. A
lower score is better.

In general, the model chosen by 𝐹/𝑡 statistics will be greater than or equal to that chosen by the
AIC, which will be greater than or equal to that chosen by the BIC.

Lecture 14 - Stationary/Nonstationary
Notes from Lecture
A time series is stationary if its probability distribution does not change over time, i.e. 𝑌 is
stationary if the joint distribution of 𝑌 1 , 𝑌 2 , … , 𝑌 does not depend on 𝑠, meaning that we
can quantify historical relationships and generalize it into the future. Variables can also be jointly
stationary.
Covariance stationary - weak stationarity
A series of 𝑌 is covariance stationary if it has constant mean, constant variance, and if the
covariance depends on the time difference, i.e. 𝛾 𝑐𝑜𝑣 𝑦 , 𝑦 only depends on 𝑘 (lag) and not
𝑡 (time) (no changes in the autocorrelation structure).

If data is not stationary, we take the first difference of the data - can be done before or after log
transformation. If the data is still not stationary, we take further differences.

Lag Operator
The lag operator:
𝐿 𝑦 𝑦
The lag operator and multiplication are commutative, i.e. if we have to take the lag of a time series
multiplied by a constant, the lag operator only applies to the variable that depends on time:
𝐿 𝑎𝑦 𝑎𝐿𝑦 𝑎𝑦 1

The lag operator is furthermore distributive over addition:


𝐿 𝑦 𝑥 𝐿𝑦 𝐿𝑥 𝑦 1 𝑥 1

Applying the lag operator to AR(p) model:


𝐴𝑅 𝑝 : 𝑌 𝛽0 𝛽1 𝑌 1 ⋯ 𝛽 𝑌 𝑢


𝑌 𝛽0 𝛽1 𝐿𝑌 ⋯ 𝛽 𝐿 𝑌 𝑢

1 𝛽1 𝐿 ⋯. 𝛽 𝐿 𝑌 𝛽0 𝑢

Thus, the AR(p) model is 𝑌 multiplied by a polynomial of order 𝑝. The model is stationary if the
roots of the characteristic polynomial 1 𝛽1 𝑧 ⋯ 𝛽 𝑧 0 lie outside the unit circle. It is
convenient to get inverse roots, as these will be inside the unit circle if AR(p) is stationary.
An easy way to tell if an AR(1) is stationary is to look at the coefficient on 𝑌 1 - if it is less than 1
in absolute value, the model is stationary.
If the coefficients on the lags are not very close to 1, e.g. 0.4, then the error term will cause a lot of
randomness, which can be seen in the graph being less smooth.

If an AR(p)model has a root that equals 1, the series is said to have a unit root. We can do a formal
test for a unit root by regressing 𝑌 on 𝑌 1 and then using the standard t-test for testing 𝛽 1. The
unit root test is thus:
𝐻0 : 𝛽 1
𝐻1 : 𝛽 1
The null hypothesis represents a unit root or stochastic trend, and 𝐻1 is a stationary time series.
** remember the t-statistic is for 𝐻0 : 𝛽 0, while we want to test 𝛽 1. We circumvent this
problem by rewriting the regression as the change:
𝑌 𝛽𝑌 1 𝜀

𝑌 𝑌 1 𝛽𝑌 1 𝑌 1 𝜀

𝑌 𝑌 1 𝛽 1 𝑌 1 𝜀

∆𝑌 𝛿𝑌 1 𝜀
Where 𝛿 𝛽 1, and we then test the null hypothesis that 𝛿 0.
𝐻0 : 𝛿 0
The test-statistic follows the Dickey-Fuller distribution, not the normal t-distribution.
Augmented Dickey-Fuller (ADF) Test for Unit Root
Allows for larger autoregressive process and for unit root to be tested in any of the lags.

We should first test for a unit root in the presence of a trend. If the trend is not significant, we drop
it and re-test. The intercept should be kept no matter if it is significant or not, as this means that the
residuals sum to 0.
I a , a b .U a ,
which means that the test will not always reject the null hypothesis even when it is false.

Trends
A trend is a persistent long-term movement of a variable over time.
Deterministic trend: non-random function of time
Stochastic trend: random and varies over time
A series exhibiting a stochastic trend may have long periods of increases or decreases (e.g. monthly
stock prices). Thus, we often model time series using stochastic trends in economics and finance.

Non-stationarity can also be due to structural breaks, which arise from a change in the population
regression coefficients at a distinct time or from a gradual evolution of the coefficients. For
example, the breakdown of Bretton Woods produced a break in the USD/GBP exchange rate.
Breaks can cause problems with OLS regressions, as OLS estimates a relationship that holds on
average between the two periods. We can use the Chow test to test for a break date and then
continue the second model from after the break date. The test works by creating a dummy variable
𝐷 𝜏 1 if 𝑡 𝜏 and 0 otherwise, where 𝜏 𝑏𝑟𝑒𝑎𝑘 𝑑𝑎𝑡𝑒. We then test 𝐻0 : 𝛾0 𝛾 0 (no
break) using a F-test.
We can use the Quant-likelihood Ratio (QLR) statistic to test for an unknown break date. Let 𝐹 𝜏
be the Chow test-statistic testing the hypothesis of no break at date 𝜏. Then calculate the F-statistic
from Chow test for a range of different values for 𝜏, 0.15𝑇 𝜏 0.85𝑇. The QLR is then the
largest of these F-statistics. The QLR statistic does not follow one of our usual distributions,
meaning that we must use the simulated critical values from the table below:

Once we detect a structural break, we have evidence of 2 regimes. Only the most recent one is
relevant for a forecast, meaning that we should not let the first regime contaminate the data from the
second period. Also indicates that the data is non-stationary, and only using 1 period will revert the
data to stationarity.
Cut the data at the point where the F statistic is last above the critical value.
Lecture 15 - ARIMA
Notes from Lecture
Auto-Regressive Integrated Moving-Average (ARIMA) is used for modeling univariate time series.
They have 3 parts: 𝐴𝑅𝐼𝑀𝐴 𝑝, 𝑑, 𝑞 - 𝑝 is the autoregressive part (the number of lags), 𝑑 is the
integrated part (the number of unit roots - the number of differences necessary to make the time
series stationary), and 𝑞 is the number of lags in the moving average part.
The unit root process can be written as:
𝑌 𝑌0 𝑢 𝑢 1 ⋯ 𝑢 𝑢2 𝑢1

Where the shocks never die. There is no beta in the model, because under unit roots, 𝛽 1. There
is no correlation between the shocks, 𝑢, i.e. they are IID.
If the shocks are purely random, then the future (the state we find ourselves in now as well) is also
purely random. Every shock is equally likely to impact 𝑌 .
The variance of the unit root process is:
𝑉𝑎𝑟 𝑌 𝑉𝑎𝑟 𝑌0 𝑉𝑎𝑟 𝑢𝑡 𝑢𝑡 1 ⋯ 𝑢𝑡 𝑞 𝑢2 𝑢1 𝑡𝜎2

Because the shocks are independent over time, and that from the starting point 0, we cannot predict
which shocks will be in which periods, there is no linear dependency between the period and the
shocks, and we therefore do not include the covariance.
The variance grows without bounds - as time goes on, variance becomes larger, as variance is
simply time, 𝑡, multiplied by the variance of each shock.

Moving Average (MA) models are presented somewhat similarly. The MA process of order q:
𝑀𝐴 𝑞 : 𝑌 𝑐0 𝑢 𝜃1 𝑢 1 ⋯ 𝜃 𝑢

Where influence of the shocks is limited by only past 𝑞 periods. The biggest effect is the
contemporary shock, and every other shock is multiplied by its coefficient 𝜃 which is between 0
and 1 in absolute value.
If we have MA(1), then we only care about the current shock and 1 before that (first lag shock). The
other shocks have coefficients equal to 0. This ensures stationarity in MA models - MA models are
always covariance-stationary.
We can use the lag operator:
𝑌 𝑐0 1 𝜃1 𝐿 𝑢

MA models must be invertible, i.e. they can be converted into an infinite AR process. Otherwise,
shocks do not die out, and the model is not good for the forecast. MA(q) is invertible if the roots of
1 𝜃 𝑧 ⋯ 𝜃 𝑧 lie outside the unit circle.

Example - MA(1)
𝑌 𝑢 𝜃1 𝑢 1

Where 𝑢 0, 𝜎 2 is IID. The variance of 𝑌 is then:


𝑉𝑎𝑟 𝑌 𝑉𝑎𝑟 𝑢 𝜃1 𝑢 1


𝑉𝑎𝑟 𝑌 𝑉𝑎𝑟 𝑢 𝑣𝑎𝑟 𝜃1 𝑢 1 2𝐶𝑜𝑣 𝑢 , 𝜃1 𝑢 1

𝑉𝑎𝑟 𝑌 𝜎2 𝜃12 𝜎 2

𝑉𝑎𝑟 𝑌 𝜎2 1 𝜃12
The covariance term drops out because it is equal to 0 due to 𝑢 being independent over 𝑡. This is
𝛾0 in autocovariance function. First autocovariance of 𝑌 , 𝛾1 , will, if we assume the MA(1) process
has mean = 0, be:

𝛾1 𝐸 𝑌 𝐸 𝑌 𝑌 1 𝐸𝑌 1


𝛾1 𝐸𝑢 𝜃1 𝑢 1 𝑢 1 𝜃1 𝑢 2


𝛾1 𝐸𝑢𝑢 1 𝜃1 𝐸 𝑢 2 1 𝜃1 𝐸 𝑢 𝑢 2 𝜃12 𝐸 𝑢 1𝑢 1


𝛾1 𝜃1 𝜎 2

Because 𝑢 is independent, 𝐸 𝑢 𝑢 0 for 𝑗 0. The second autocovariance in MA(1) is equal


to 0:
𝛾2 0
Since we have that:
𝛾0 1 𝜃12 𝜎 2
𝛾1 𝜃1 𝜎 2
𝛾 0

Then we see that:


𝛾0
𝜌0 1
𝛾0
𝛾1 𝜃1
𝜌1
𝛾0 1 𝜃12
𝛾
𝜌 0
𝛾0

Thus, for an MA(q) process, the 𝑞th spike in the ACF will be significant and subsequent spikes will
be 0.

When choosing a model, it is important that the last lag is statistically significant, i.e. MA(4) needs
to have the 4th lag being significant. Lag 2 can be insignificant.
For AR processes, the autocorrelation function will slowly die out.

Partial Autocorrelation Function (PACF):


- The partial autocorrelation of lag 𝑗 is the autocorrelation between 𝑌 and 𝑌 after
controlling for 𝑌 1 , … , 𝑌 1.
- PACF for lag 𝑗 can be found by estimating an AR(j) and looking at the coefficient on
𝑌 .
- If the 𝑗th spike is positive, it suggests a positive relationship between 𝑌 and 𝑌 .
- The 𝑝th spike in the PACF will be non-zero for AR(p). All subsequent spikes will be
0.

ARMA(p, q) models combine AR and MA models.

𝐴𝑅𝑀𝐴 𝑝, 𝑞 : 𝑌 𝛽0 𝛽1 𝑌 𝑢 𝜃𝑢
1 1

We can identify ARMA using the ACF and PACF, and AIC and BIC.

I R, a auto.arima b - we have to state the selection


a b ic=AIC” for example. We can search over only 𝑝 and 𝑞 (ARMA model) by writing
d=1” or “d=0” - d=0 if the data is already stationary.
** Can probably also use “step” function if wanting to use AIC as criterion anyway.

ARIMA Modeling
Steps:
1. Conduct ADF tests and difference the series until it is stationary - this will give us 𝑑

2. Look at ACF, PACF, AIC and BIC to get 𝑝 and 𝑞

3. Check the residuals to make sure they are white noise. If the residuals are not white noise,
go back to step (2)

Lecture 16 - Estimation of Dynamic Causal Effects


Chapter 16 - Additional Topics in Time Series Regression
Vector autoregression (VAR): A set of 𝑘 time series regressions in which the regressors are lagged
values of all 𝑘 series. A VAR extends the univariate autoregression to a vector (list) of time series
variables - if the number of lags in each of the equations is the same and equal to 𝑝, the system of
equations is called a VAR 𝑝 .
For 2 variables, 𝑌 and 𝑋 , VAR 𝑝 consists of the 2 equations:
𝑌 𝛽10 𝛽11 𝑌 1 ⋯ 𝛽1 𝑌 𝛾11 𝑋 1 ⋯ 𝛾1 𝑋 𝑢1

𝑋 𝛽20 𝛽21 𝑌 1 ⋯ 𝛽2 𝑌 𝛾21 𝑋 1 ⋯ 𝛾2 𝑋 𝑢2

The assumptions of VAR are the same as for time series regression, applied to each separate
equation. The coefficients are estimated by OLS - the estimators are consistent and have a joint
normal distribution in large samples. Statistical inference proceeds in the normal manner.
For hypothesis testing, it is possible to test joint hypotheses that involve restrictions across multiple
equations. For example, in the model above, we can test whether the correct lag length is 𝑝 or 𝑝 1
by testing whether the coefficients on 𝑌 and 𝑋 are 0:

𝐻0 : 𝛽1 𝛽2 𝛾1 𝛾2 0

𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 0


Because the estimated coefficients have a jointly normal distribution in large samples, the 𝐹-
statistic is used.
The number of coefficients in each equation of a VAR is proportional to the number of variables in
the VAR, e.g. a VAR with 5 variables and 4 lags will have 21 coefficients (4 5 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 in
each of its 5 equations, meaning that there is a total of 105 coefficients. This high number of
estimators increases the amount of estimation error entering a forecast, which is why we want to
keep the number of variables in a VAR small - we also want to ensure that the variables are related,
as they can then help predict each other. Otherwise, we are adding an unrelated variable which
simply introduces estimation error without adding predictive content.
The lag length in a VAR can thus be determined by using 𝐹-tests as described before, or by using
information criteria such as BIC - choose the lag length that minimizes BIC.
Structural VAR modeling is when VARs are used to model the underlying structure of the
economy. Requires very specific assumptions derived from economic theory and institutional
knowledge.

16.2 Multiperiod Forecasts


There are 2 approaches to multiperiod forecasts:
1. Iterated Multiperiod Forecasts

A forecasting model is used to make a forecast 1 period ahead (𝑇 1 using data through
period 𝑇. This model is then used to make a new 1 period forecast 𝑇 2 using data
through period 𝑇 and the forecasted value of 𝑇 1.This process then iterates until the
forecast is made for the desired horizon ℎ.

Thus, a 2-step ahead forecast of 1 variable depends on the forecast of all variables in the
VAR in period 𝑇 1. Generally, to compute multiperiod iterated VAR forecasts ℎ periods
ahead, it is necessary to compute forecasts of all variables for all intervening periods
between 𝑇 and 𝑇 ℎ.

2. Direct Multiperiod Forecasts

Forecasts by using a single regression in which the dependent variable is the multiperiod-
ahead variable to be forecasted and the regressors are the predictor variables. In a direct ℎ-
period-ahead forecasting regression, all predictors are lagged ℎ periods to produce the ℎ-
period-ahead forecast.

For direct forecasts, the error term is serially correlated, and we therefore use HAC SEs. For
longer horizons, the higher the degree of serial correlation.

The iterated method is the recommended procedure for 2 reasons: (1) one-period-ahead regressions
estimate coefficients more efficiently than multiperiod-ahead regressions, and (2) iterated forecasts
tend to have time paths that are less erratic across horizons since they are produced using the same
model as opposed to direct forecasts, which use a new model for every horizon.
A direct forecast is, however, preferable if you suspect that the one-period-ahead model (the AR or
VAR) is not specified correctly, or if your forecast contains many predictors (estimation error).

16.3 Orders of Integration and the DF-GLS Unit Root Test


If a series has a random walk trend, it has an autoregressive root 1. A random walk, the trend at
date 𝑡 equals the trend at date 𝑡 1 plus a random error term, is specified by (integrated of order
one 𝐼 1 ):
𝑌 𝛽0 𝑌 1 𝑢
Some series have trends that are smoother than implied by this model, and therefore, a different
model is needed. One such model makes the first difference of the trend follow a random walk
(integrated of order two 𝐼 2 ):
∆𝑌 𝛽0 ∆𝑌 1 𝑢
If 𝑌 follows this equation, then ∆𝑌 ∆𝑌 1 (the second difference ∆2 𝑌 ) is stationary. If a series
has a trend of this form, then the first difference of the series has an autoregressive root 1.
A series that does not have a stochastic trend and is stationary is said to be integrated of order 0
𝐼 0 .
Thus, the order of integration refers to the number of times a series has to be differenced for it to be
stationary.
To sum up, if 𝑌 is 𝐼 2 then ∆𝑌 is 𝐼 1 , so ∆𝑌 has a root 1. If, however, 𝑌 is 𝐼 1 , then ∆𝑌 is
stationary. We can use hypothesis testing to find the correct integrated order by testing whether ∆𝑌
has a unit root:
𝐻0 : 𝑌 𝑖𝑠 𝐼 2
𝐻1 : 𝑌 𝑖𝑠 𝐼 1

For tests, a higher power is positive - it means that the test is more likely to reject 𝐻0 against the
alternative when the alternative is true.
The ADF was the first test developed to test for a unit root. Another test for unit roots is the DF-
GLS test, and this test has higher power. The DF-GLS test is computed in 2 steps. For a DF-GLS
test with 𝐻0 𝑌 ℎ𝑎𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘 𝑡𝑟𝑒𝑛𝑑 and 𝐻1
𝑌 𝑖𝑠 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑎𝑟𝑜𝑢𝑛𝑑 𝑎 𝑙𝑖𝑛𝑒𝑎𝑟 𝑡𝑖𝑚𝑒 𝑡𝑟𝑒𝑛𝑑, the steps are:

(1) the intercept and trend are estimated by GLS, an estimation which is performed by computing 3
new variables 𝑉 , 𝑋1 and 𝑋2 , where 𝑉 𝑌 𝛼∗ 𝑌 1 , 𝑋1 1 𝛼∗ , 𝑋2 𝑡 𝛼∗ 𝑡
13.5
1 , where 𝛼∗ 1 . Then 𝑉 is regressed against 𝑋1 and 𝑋2 , i.e. OLS is used to estimate:

𝑉 𝛿0 𝑋1 𝛿1 𝑋2 𝑒
Using the observations 𝑡 1, … , 𝑇. The model has no intercept. The OLS estimators are then used
a 𝑌 𝑌 𝑌 𝛿0 𝛿1 𝑡 .

(2) the DF test is used to test for a unit root in 𝑌 (DF does not include an intercept nor a time
trend). The number of lags in the regression is determined by either expert knowledge or an
information criterion.
If 𝐻1 𝑌 𝑖𝑠 𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑟𝑦 𝑤𝑖𝑡ℎ 𝑎 𝑚𝑒𝑎𝑛 𝑡ℎ𝑎𝑡 𝑚𝑖𝑔ℎ𝑡 𝑏𝑒 0 𝑏𝑢𝑡 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑎 𝑡𝑖𝑚𝑒 𝑡𝑟𝑒𝑛𝑑, the
preceding steps are modified, in particular 𝛼 ∗ 1 and 𝑋2 is omitted from the regression,
meaning that the series 𝑌 𝑌 𝛿0 .
The critical values for the DF-GLS test are:

If the DF-GLS test statistic (the 𝑡-statistic on 𝑌 1 in the regression in the 2nd step) is less than the
critical value, then 𝐻0 is rejected.
** unit roots have nonnormal distributions, i.e. even for large samples, the distribution is not normal
when the regressors are nonstationary. The nonnormal distribution of unit root test statistic is a
consequence of the nonstationary.

16.4 Cointegration
Cointegration: When 2 or more series have the same stochastic trend in common. In this case,
regression analysis can reveal long-term relationships among time series variables, but some new
methods are needed. Mathematically, suppose we have 𝑋 and 𝑌 are 𝐼 1 , then if for some
coefficient 𝜃 (the cointegrating coefficient), 𝑌 𝜃𝑋 is integrated of order 0, 𝑋 and 𝑌 are said to
be cointegrated. Computing the difference 𝑌 𝜃𝑋 eliminates the common stochastic trend.
If 𝑋 and 𝑌 are cointegrated, we do not have to take the first difference to eliminate a stochastic
trend. Instead, we can compute the difference 𝑌 𝜃𝑋 - this term is stationary, meaning that it can
also be used in regression analysis using VAR. VAR has to be augmented by including 𝑌 1
𝜃𝑋 1 as an additional regressor:
∆𝑌 𝛽10 𝛽11 ∆𝑌 1 ⋯ 𝛽1 ∆𝑌 𝛾11 ∆𝑋 1 ⋯ 𝛾1 ∆𝑋 𝛼1 𝑌 1 𝜃𝑋 1 𝑢1

∆𝑋 𝛽20 𝛽21 ∆𝑌 1 ⋯ 𝛽2 ∆𝑌 𝛾21 ∆𝑋 1 ⋯ 𝛾2 ∆𝑋 𝛼2 𝑌 1 𝜃𝑋 1 𝑢2

The term 𝑌 𝜃𝑋 is also called the error correction term, and the combined model of the equations
above is the vector error correction model (VECM). In VECM, past values of the error correction
term help to predict future values of ∆𝑌 and/or ∆𝑋 .

There are 3 ways to check whether 2 variables are cointegrated, all of which should be used in
practice:
1. Use expert knowledge and economic theory

2. Graph the series and see whether they appear to have a common stochastic trend

3. Perform statistical tests for cointegration


Step 3 includes the unit root testing procedures extended to test for cointegration:
𝐻0 : 𝑌 𝜃𝑋 ℎ𝑎𝑠 𝑎 𝑢𝑛𝑖𝑡 𝑟𝑜𝑜𝑡
If 𝐻0 is rejected, then 𝑌 and 𝑋 are cointegrated and we can use VECM. The details of the test
depend on whether 𝜃 is known or not. If known, we can simply construct the series 𝑧 𝑌 𝜃𝑋
and then testing 𝐻0 : 𝑧 ℎ𝑎𝑠 𝑎 𝑢𝑛𝑖𝑡 𝑟𝑜𝑜𝑡. If unknown, then it must be estimated by OLS 𝑌
𝛼 𝜃𝑋 𝑧 before conducting a DF 𝑡-test with an intercept but no time trend to test for a unit
root in the residual from this regression, 𝑧̂ . This is called the Engle-Granger Augmented Dickey-
Fuller test (EG-ADF).

The OLS estimator of the coefficient in the cointegration regression, while consistent, has a
nonnormal distribution when 𝑋 and 𝑌 are cointegrated. Therefore, inferences based on its 𝑡-
statistic can be misleading. Therefore, other estimators of the cointegrating coefficient have been
developed, e.g. the dynamic OLS (DOLS) estimator. DOLS is efficient in large samples, and
statistical inference about 𝜃 and the 𝛿 ba HAC SE a a , because it has a
normal distribution.
Cointegration tests can improperly reject 𝐻0 more frequently than they should, and frequently they
improperly fail to reject 𝐻0 .

If we extend to more variables, the number of cointegrating coefficients increase (always one less
than the number of variables).
If 2 or more variables are cointegrated, then the error correction term can help forecast these
variables and, possibly, other related variables. However, even closely related series can have
different trends for subtle reasons. If variables that are not cointegrated are incorrectly modeled
using a VECM, then the error correction term will be 𝐼 1 - this then introduces a trend into the
forecast that can result in a poor out-of-sample forecast performance.
16.5 Volatility Clustering and Autoregressive Conditional Heteroskedasticity
Volatility clustering: when a time series has some periods of low volatility and some periods of
high volatility.
The variance of an asset price is a measure of the risk of owning the asset: the larger the variance of
daily stock price changes, the more an investor stands to gain or lose on a typical day. In addition,
the value of some financial derivatives, e.g. options, depends on the variance of the underlying
asset.
Forecasting variances makes it possible to have accurate forecast intervals. If the variance of the
forecast is constant, then an approximate forecast confidence interval can be constructed as the
forecast +/- a multiple of SER. If, on the other hand, the width of the forecast interval should
change over time.
Volatility clustering implies that the error exhibits time-varying heteroskedasticity.

ARCH and GARCH are 2 models of volatility clustering. ARCH is analogous to the distributed lag
model, and GARCH is analogous to an ADL model.
(1) ARCH
The ARCH model of order 𝑝 is modeled:
𝐴𝑅𝐶𝐻 𝑝 : 𝜎2 𝛼0 𝛼1 𝑢 2 1 𝛼2 𝑢 2 2 ⋯ 𝛼 𝑢2

If the 𝛼 a , a a a , ARCH a
2
current squared error will be large in magnitude in the sense that its variance 𝜎 is large. Above, we
have described ARCH for the ADL(1,1) model, but it can be applied to the error variance of any
time series regression model with an error that has a conditional mean of 0.

(2) GARCH
Extends the ARCH model to let 𝜎 2 depend on its own lags as well as lags of the squared error:
𝐺𝐴𝑅𝐶𝐻 𝑝, 𝑞 : ∶ 𝜎2 𝛼0 𝛼1 𝑢 2 1 ⋯ 𝛼 𝑢2 𝜙1 𝜎 2 1 ⋯ 𝜙 𝜎2

As GARCH is similar to ADL, it can provide a more parsimonious model of dynamic multipliers
than ARCH, i.e. it can capture slowly changing variances with fewer parameters than the ARCH
model due to its incorporations of lags of 𝜎 2.

ARCH and GARCH coefficients are normally distributed in large samples, so in large samples, 𝑡-
statistics have standard normal distributions, and confidence intervals can be constructed as the
maximum likelihood estimate 1.96 SEs.

Notes from Lecture


Dynamic causal effect: Estimating the effect of current and past changes in X on Y.
Ideally, we would like to test the effects of a treatment on a test group and a control group.
However, it is often difficult to split the control group from the treatment group when doing
economic research. Instead, we use time series data - i.e. a randomized controlled experiments
consists of the same subjects being given different treatments at different points in time - the
subjects are both the control and the treatment group.
If the subjects are drawn from the same distribution, i.e. if 𝑌 , 𝑋 are stationary, then the dynamic
causal effect can be deduced using OLS regression of 𝑌 on lagged values of 𝑋 . This estimator is
called the distributed lag estimator.
The question is how long lasting the effect of 𝑋 is, how many lags are relevant. Thus, the order
does not matter.
The effect of 𝑋 on the present 𝑌 is measured by 𝛽1 . The effect of 𝑋 on 𝑌 1 (tomorrow) is
measured by 𝛽2 and so on. Thus, the OLS regression allows us to predict how 𝑌 will change and be
affected by current changes in 𝑋 .
The distributed lag model:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢

Assumptions:
- 𝑋 and 𝑌 are jointly stationary - no structural breaks (if structural breaks do exist, we
can estimate the model in different subperiods)

- Past and present exogeneity: 𝑋 must be uncorrelated with the error term, i.e. 𝑋
must be exogenous: 𝐸 𝑢 |𝑋 , 𝑋 1 , … , 𝑋 , … 𝑋2 , 𝑋1 0 . All coefficients in the
Distributed Lag model constitute all the non-zero dynamic effects

- Large outliers are unlikely (a number of moments in the data need to be stable so that
they are not twisted)

- There is no perfect multicollinearity

Some add to the 2nd assumption that the error term has a mean 0 given all past, present, AND future
values of 𝑋 , meaning that we can use more efficient estimators than OLS in our estimation. This is
called strict exogeneity. Very strong assumption, rarely realistic.
Strict exogeneity implies past and present exogeneity, but past and present exogeneity does not
imply strict exogeneity.
Estimation
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 ⋯ 𝛽 1𝑋 𝑢

Under the Distributed Lag model assumptions, OLS yields consistent estimators of 𝛽′𝑠 - of the
dynamic multipliers - that is, if the assumptions hold, the coefficients are not biased.
If there is no autocorrelation in the error terms, then the OLS standard errors are also unbiased. If
there is autocorrelation, the OLS standard errors are not consistent, meaning that we must use
autocorrelation-consistent standard errors to ensure that our standard errors are IID, not correlated
to each other. For this purpose, we can use heteroskedasticity and autocorrelation consistent (HAC)
standard errors (Newey-West Standard Errors).
When estimating AR, ARMA and ADL models, we do not need to use HAC SEs. In these models,
the error terms are serially uncorrelated if we have included enough lags of 𝑌 - as this ensures that
the error term cannot be predicted using past 𝑌 b a 𝑢 .

The ℎ-period dynamic multiplier is the effect of a unit change in 𝑋 on 𝑌 after ℎ periods. The ℎ-
period cumulative dynamic multiplier is 𝛽1 𝛽2 𝛽 1 .
The long-run dynamic multiplier is the sum of all the individual dynamic multipliers.
The cumulative dynamic multipliers can be directly estimated by modifying the regression as
follows:
𝑌 𝛽0 𝛽1 𝑋 𝛽2 𝑋 1 𝑢
𝑌 𝛽0 𝛽1 𝑋 𝛽1 𝑋 1 𝛽2 𝑋 1 𝑢
𝑌 𝛽0 𝛽1 𝑋 𝑋 1 𝛽1 𝛽2 𝑋 1 𝑢
𝑌 𝛽0 𝛽1 ∆𝑋 𝛽1 𝛽2 𝑋 1 𝑢
𝑌 𝛿0 𝛿1 ∆𝑋 𝛿2 𝑋 1 𝑢
Where 𝛿0 𝛽0 , 𝛿1 𝛽1 , and 𝛿2 𝛽1 𝛽2 .
The HAC SEs on 𝛿1 and 𝛿2 are the SEs for the 2 cumulative multipliers.
𝛿 1 would be the long-run cumulative dynamic multiplier.

Granger Causality
Tests whether 𝑋 is useful in predicting 𝑌 - different from actual causality, more of a correlation test.
** 𝑋 is NOT exogenous.

Lecture 17 - Vector Autoregressions


Notes from Lecture
If we wish to forecast multiple economic variables, we can develop a single model that is capable of
doing this, which also makes the forecasts mutually consistent.
The approach is called vector autoregression (VAR), and it extends the univariate autoregressions to
multiple time series, i.e. we extend the univariate regression to a vector of time series variables.
If we have 2 time series variables 𝑌 and 𝑋 :
𝑌 𝛽10 𝛽11 𝑌 1 𝛽12 𝑋 1 𝑢1
𝑋 𝛽20 𝛽21 𝑌 1 𝛽12 𝑋 1 𝑢2
These can then be combined in the model:
𝑍 𝛽0 𝛽1 𝑍 1 𝑢
A VAR with 𝑘 time-series variables consists of 𝑘 equations, one for each variable. The number of
lags in each of the equations, 𝑝, is the same, so the system can be called VAR(𝑝 - the example
above is VAR(1).
We select the order of VAR by using an information criterion, e.g. select the model that minimizes
AIC or BIC. The criteria also take into account that we do not want to extend the lags too far and
create a model that is too large, as this makes the forecast too risky (parsimony principle).
I R, a VAR .
The number of coefficients that need to be estimated is 𝑘 𝑘𝑝 1 , i.e. is it proportional to the
number of variables in VAR. We want to keep this number low.
We want to make sure that the variables are related to each other through theory or empirical
evidence, and the cross-correlation function can help with this. It is the analog of ACF for the
multivariate case, i.e. it is the correlation between a variable and lags of another variable.
The assumptions for VAR are the same as the ones for ADL. Besides these assumptions, we allow
for the possibility that the disturbances in VAR being correlated, meaning that when one equation is
shocked, the others will typically be shocked as well.
When modeling the equations together, we use more data than when modeling them separately.

Forecasting with VAR uses the chain algorithm, i.e. we construct 𝑇 1 with the help of lagged
variables that are known in period 𝑇, and we then armed with 𝑇 1 forecasted values construct the
𝑇 2 forecast.
The impulse response function helps us learn about the dynamic properties of vector
autoregressions of interests to forecasters. The impact is driven by the shock.

Lecture 18 - ARCH and GARCH


Notes from Lecture
The volatility of many financial and macro variables changes over time, and it is not usually
directly observable.
Volatility clustering: a series with some periods of low volatility and some periods of high
volatility. Ensures that the variance of returns can be forecasted despite the returns themselves
being difficult to forecast.

Let 𝑟 denote the return on the asset at time 𝑡. Return series often have low or no serial
autocorrelation (there is some dependence though). We let 𝑟 follow a simple ARMA model and
then study the volatility in the residuals:
𝑟 𝜇 𝑢
The ARMA 𝑝, 𝑞 process is then:

𝜇 𝜇0 ϕ𝑟 θ𝑢
1 1

Where 𝑢 𝜎 𝜀 . We have that 𝜀 ~𝑊𝑁 0,1 (i.e. white noise process with mean 0 and unit
variance), and 𝜎 2 𝑣𝑎𝑟 𝑢 |𝑢 1 , 𝑢 2 , … is the conditional variance of the error (shock) given its
past. The error thus becomes heteroskedastic, the variance changes over time.
Autoregressive Conditional Heteroskedasticity (ARCH)
The ARCH model is:
𝐴𝑅𝐶𝐻 𝑝 : 𝜎2 𝜔 𝛼1 𝑢 2 1 ⋯ 𝛼 𝑢2

** 𝑢 𝜎𝜀.
𝑝 in ARCH can be different from 𝑝 in ARMA 𝑝, 𝑞 .
We will assume that the distribution of 𝜀 is normal, because it is what drives the shocks - 𝜀 is a
sequence of IID random variables with mean 0 and variance 1 - its possible distributions are
standard normal, standardized student-t, and their skewed counterparts.

A property of ARCH is that volatility does not diverge to infinity (it varies within some fixed
range), i.e. volatility is stationary.
𝐴𝑅𝐶𝐻 1 : 𝜎2 𝛼0 𝛼1 𝑢 2 1

In this model, 𝐸 𝑢 0 and 𝑣𝑎𝑟 𝑢 𝐸 𝑢2 𝛼0 𝛼1 𝑣𝑎𝑟 𝑢 1

Since 𝑢 is stationary, we have that 𝑣𝑎𝑟 𝑢 𝛼0 𝛼1 𝑣𝑎𝑟 𝑢 . Then the unconditional variance
is:
𝛼0
𝑣𝑎𝑟 𝑢
1 𝛼1
Due to variance clustering.

Modeling ARCH has 3 steps:


1. Specify a mean equation (e.g. ARMA) to remove any linear dependence

2. Use the residuals of the mean equation to investigate the ARCH effects (ACF and PACF of
squared residuals) to see if the volatility is changing over time

3. Specify a volatility model if ARCH effects are present, and perform a joint estimation of the
mean and volatility equations
Generalized Autoregressive Conditional Heteroskedasticity (GARCH)
Since ARCH gives us the past values of the conditional variance, and we know that there is a
autocorrelation between them, GARCH includes these past values in its model:
𝐺𝐴𝑅𝐶𝐻 𝑝, 𝑞 : 𝜎2 𝛼0 𝛼1 𝑢 2 1 ⋯ 𝛼 𝑢2 𝜆1 𝜎 2 1 ⋯ 𝜆 𝜎2

Often, GARCH is more appropriate than high order ARCH models. Typically, we only need a low
order GARCH model - we should stick to GARCH(1,1), GARCH(2,1) and GARCH(1,2).
One measure of the persistence (high level of state dependency, creating smooth volatility clusters)
in the variance is the sum of the coefficients on 𝑢 2 1 and \𝑠𝑖𝑔𝑚𝑎2 1 in GARCH. If large, e.g. 0.96,
it indicates that changes in the conditional variance are persistent.

The fitted volatility bands of the estimated conditional SDs track the observed heteroskedasticity in
the series of monthly returns. This is useful for quantifying the time-varying volatility and the
resulting risk for investors holdings stocks. Furthermore, the GARCH model may also be sued to
produce forecast intervals whose widths depend on the volatility of the most recent periods - here
the bands show this.

Time Series Practice Test


1-B
2-D
3-A
1.2
4-A 0.4
2.6 , | 2.6| | 3.12|

5-A
6-B
7-B 𝐸 𝑌 2.4 0.6𝐸 𝑌 0.07𝐸 𝑌 0.1𝐸 𝑌 → 1 0.6 0.07 0.1 𝐸 𝑌 2.4 →
2.4
𝐸 𝑌 6.5
0.3

8 - C (it is invertible if the roots are outside the unit circle)


1
9-C 1 0.2𝑧 0→𝑧 0.2
5

10 - C

Assignment
Do not correct SEs or control for contemporaneous values of the variables in ADL, because we
assume that autocorrelation is extracted from the model when we use AR, ARMA, or ADL. This
correction becomes important if we use DL.
Remember to plot the variables to determine whether to control for a trend or a drift in DF unit root
test.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy