Lecture 2
Lecture 2
Lecture 2
Lamarche
Orthogonal Projections
Lecture 2
where y is the response variable for subject i and xi is a vector of p independent variables
that includes an intercept. The model can be written in a more convenient matrix notation
as
y = X β + u (2)
|{z} |{z} |{z} |{z}
n×1 n×p p×1 n×1
| {z }
n×1
It follows that
∑
n
xi x′i = X ′ X,
i=1
∑
n
xi yi = X ′ y.
i=1
1 Estimation
We can estimate the parameter β using the Method of Moments or alternatively the tradi-
tional method called Ordinary Least Squares (OLS). The derivation of the OLS estimator is
1
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
= min{y y − β X y − y Xβ + β ′ X ′ Xβ)}
′ ′ ′ ′
β
= min{y ′ y − 2β ′ X ′ y + β ′ X ′ Xβ}
β
−2X ′ y + 2X ′ Xβ = 0,
which implies that X ′ (y − Xβ) = 0, and consequently, under Assumptions A1, A2 and A3:
β̂ = (X ′ X)−1 X ′ y. (3)
The OLS estimator β̂ gives the closest vector X β̂, in terms of the Euclidean norm, to y
among all vectors that lie on the space spanned by the vector X.
INSERT FIGURE
We can also obtain the variance of the estimator under homocedasticity and independent
errors (Assumptions A4 and A5). Note that,
( ) ( )
V (β̂|X) = V (X ′ X)−1 X ′ y|X = V (X ′ X)−1 X ′ (Xβ + u)|X
= (X ′ X)−1 X ′ V (u|X)X(X ′ X)−1
= (X ′ X)−1 X ′ σu2 IN X(X ′ X)−1
= σu2 (X ′ X)−1 ,
under the assumption that the variance of the error term is “spherical” i.e. V (u|X) = σu2 I.
The variance depends on σu2 , which is unknown and needs to be estimated. As in Lecture 1,
it is estimated using û′ û/n − p.
Note also that the homocedasticity assumption simplifies the expression of the variance
2
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
of OLS.
E(β̂) = E(E(β̂|X)) = β
The OLS estimator is unbiased and linear, which means that we can write β̂ as a linear
function of the response variable y e.g.
β̂ = Ay
β̃ = (A + B)y
Therefore, it must be the case that E β̃ = β because BX = 0. Also, the variance of the
estimator β̃ is,
The last equation suggests that the variance of the estimator β̃ is larger than the variance
of the estimator β̂. This can be stated in the following important theorem,
3
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
Note that the theorem is for conditional variances. Can we claim that V (β̃) ≥ V (β̂) also?
We need a bit more work ... Consider first the definition
First, note that the second term is equal to zero because the estimator is unbiased, and
therefore, V (β) = 0. Thus,
( )
V (β̂) = E(V (β̂|X)) = σ 2 E (X ′ X)−1 .
Similarly, [ ( ) ]
V (β̃) = E(V (β̃|X)) = σ 2 E (X ′ X)−1 + E (BB ′ ) .
Therefore, we conclude that
V (β̃) ≥ V (β̂),
because E(BB ′ ) is positive semi-definite.
3 Projections
Define the projection matrix
P = X(X ′ X)−1 X ′ .
The ith diagonal element of P is labeled leverage point and is defined as hi = x′i (X ′ X)−1 xi .
The matrix P is symmetric and idempotent, and its trace is equal to the number of inde-
pendent variables. Symmetry implies,
Idempotency implies,
Lastly,
tr(P ) = tr(X(X ′ X)−1 X ′ ) = tr(X ′ X(X ′ X)−1 ) = tr(Ip ) = p.
ŷ = X β̂,
represents the orthogonal projection of y onto the column space of X. Note that,
ŷ = X β̂ = X(X ′ X)−1 X ′ y = P y
4
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
where the matrix P is the projection matrix. On the other hand, the residual vector
û = y − ŷ = (I − P )y = M y,
where M := I − P , which is a matrix that produces ‘residuals’. Note that we can also write
the vector of residuals as
û = M y = M (Xβ + u) = M Xβ + M u = M u
because P X = X. This implies that the residuals from the regression and the independent
variables are orthogonal,
X ′ û = X ′ M u = X ′ (I − P )u = (X ′ − X ′ P )u = 0.
..
Remark 1. The basic geometry of OLS can be seen in Figure 1. Let y and X = [X1 .X 2 ] be
vectors, and define the space spanned by the columns of X as:
S(X) := {z ∈ Rn |z = Xc, c ∈ Rp }
5
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
The OLS fitting gives the closest vector to y in the space spanned by the Xs.
Example 1. To illustrate the basic geometry of OLS, we present a simple example (Figure
2). Let y = (1, 3)′ and X = (1, 1)′ be vectors, and consider the regression model yi = βXi +ui .
The OLS estimator is
arg min{(y1 − β)2 + (y2 − β)2 }
and the associated first order condition,
1∑
−2(y1 − β) − 2(y2 − β) = 0, ⇒ ȳ = yi = 2.
2
y = Xβ + Zγ + u, (5)
6
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
and the corresponding estimator β̂. Moreover, consider the alternative regression model,
Proof: Standard OLS formulae in model (5) gives the following normal equations,
X ′ y = X ′ X β̂ + X ′ Z γ̂
Z ′ y = Z ′ X β̂ + Z ′ Z γ̂
Solving for γ̂ in the second equation, and replacing in the first equation, we obtain
suggesting that
X ′ (I − Z(Z ′ Z)−1 Z ′ )y = X ′ (I − Z(Z ′ Z)−1 Z ′ )X β̂
Therefore,
β̂ = (X ′ Mz X)−1 X ′ Mz y = (X̃ ′ X̃)−1 X̃ ′ ỹ = β̃
In addition to the point estimates, the residuals are identical. To see this consider,
y = ŷ + û = X β̂ + Z γ̂ + û
and multiply by Mz ,
Mz y = Mz X β̂ + Mz Z γ̂ + Mz û ⇒ ỹ = X̃ β̃ + Mz û,
but Mz û = û because Pz û = 0.
Another way is seeing the this result is considering the following argument. Recall that
y = ŷ + û = P y + M y = X β̂ + Z γ̂ + M y (7)
.
Note that M is a projection matrix that uses [X ..Z]. Applying partitioned formulae, one can
show that M = I − PXZ = I − (PZ + MZ X(X ′ MZ X)−1 X ′ MZ ).
If we multiply (7) by MZ , we have
MZ y = MZ X β̂ + MZ Z γ̂ + MZ M y (8)
Therefore, we need to show that MZ M = M or, in words, that the last term in equations
7
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
MZ M = (I − PZ )(I − PXZ )
= I − PZ − PXZ + PZ PXZ
= I − PZ − PXZ + PZ (PZ + MZ X(X ′ MZ X)−1 X ′ MZ )
= I − PZ − PXZ + PZ PZ + PZ MZ X(X ′ MZ X)−1 X ′ MZ
= I − PZ − PXZ + PZ + 0
= I − PXZ = M,
y = γ + βxi + ui
where in this case, X = xi and Z = 1. Before turning to the application of FWL theorem,
we can use the method of moments to derive the slope estimator. Our model assumptions
were
These two equations are restrictions imposed to the data about the joint probability distri-
bution of x and y. There are two unknowns and two equations, which suggests that we can
solve for γ and β. The sample counterpart of these equations are
1∑
n
(yi − γ − βxi ) = 0 (9)
n i=1
1∑
n
(xi (yi − γ − βxi )) = 0 (10)
n i=1
8
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
∑
n ∑
n
(xi − x̄)(yi − ȳ) = β̂ (xi − x̄)2
i=1 i=1
Following the FWL theorem, we first have to obtain the residuals in the regression,
xi = α + ϵi ⇒ x̃i = xi − α̂ = xi − x̄
Similarly,
yi = δ + vi ⇒ ỹi = yi − δ̂ = yi − ȳ
Finally, we have to regress the residuals of the response variable on the residuals of the
variable of interest,
ỹi = πx̃i + wi
giving, ∑ ∑n
x̃i ỹi (x − x̄)(yi − ȳ)
π̂ = ∑ 2 = i=1 ∑n i = β̂ (12)
i=1 (xi − x̄)
x̃i 2
Let’s check now that the residuals in the two equations are equal, so the standard errors
of β̂ and π̂ are equal. Again, by definition,
and
ỹi = π̂x̃i + ŵi . (14)
and therefore,
using the definition of the intercept γ̂. (Recall that we discuss this it in Lecture 1). Notice
then that the fitted values are equal, and therefore, the residual in equation (13) is equal to
9
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
0.0827
log of average hourly earnings
3
2
2
1
1
0
0 5 10 15 0 10 20 30 40 50
0.0904
log of average hourly earnings
1.0
2
0.0
y tilde
1
−1.0
0
−2.0
10
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
Example 3. Suppose we want to estimate the returns to education considering the following
model,
lwi = β0 + β1 x1i + β2 x2i + β3 x3i + ui (15)
where lwi is the log of wages of the ith worker, and x1i , x2i , and x3i are the level of education,
year of experience, and year with the actual employer of the ith worker. The variable of
interest is education, therefore we can rewrite the model using our previous notation:
y = Z ′ γ + βX + u
yi = x′i β + zi′ γ + ui
yi = x′i b + vi .
Note that the second equation gives a projection of y on x1 . In this case, we have:
b = (Exi x′i )−1 (Exi yi ) = (Exi x′i )−1 (Exi (x′i β + zi′ γ + ui ))
= (Exi x′i )−1 (Exi x′i )β + (Exi x′i )−1 (Exi zi′ )γ + (Exi x′i )−1 (Exi ui ))
= β + (Exi x′i )−1 (Exi zi′ )γ,
which obviously implies that the projection in the second equation does not identify a coef-
ficient of interest β unless Exi zi′ = 0.
Let’s discuss now the trade-off we often face in applied econometrics when we do not
11
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
observe a few independent variables. Suppose now that the true model is,
y = Xβ + Zγ + u, (16)
could be biased if either γ ̸= 0 or the observables independent variables X and Z are not
independent,
E(β̂s ) = β + E((X ′ X)−1 X ′ (Zγ + u)) = β + Γγ
with Γ = (X ′ X)−1 X ′ Z. Note that Γ is the estimator of a regression of Z on X.
Example 4. We start studying the bias effect from omitting a relevant variable in the
regression model. Consider that the true population model is,
y = β0 + β1 x1 + β2 x2 + u (18)
y = β0 + β1 x1 + u, (19)
which excludes the relevant variable x2 . Suppose we are interested in the slope parameter
β1 . In this case,
∑
β2 ni=1 (xi1 − x̄1 )xi2
E(β̃1 ) = β1 + ∑n
i=1 (xi1 − x̄1 )
2
= β1 + β2 α̃1
What is the interpretation of α̃1 ? Note that if we have a simple regression model of the form,
x2 = α0 + α1 x1 + ϵ,
and we obtain the OLS estimator for the slope parameter α1 , this is equal to
∑n
(xi1 − x̄1 )(xi2 − x̄2 )
α̃1 = i=1∑n
i=1 (xi1 − x̄1 )
2
12
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
The bias is zero if either β2 = 0, which means that the independent variable x2 does not
have an effect on the dependent variable y, or α̃1 = 0, which means that the independent
variables x1 and x2 are uncorrelated. If β2 ≈ 0 or α̃1 ≈ 0, the size of the bias would be small,
and therefore it may not be a cause of concern.
Sometimes it is possible to infer the sign of the bias. If the sign of the unknown parameter
β2 is equal to the sign of the correlation between the independent variables x1 and x2 , the
bias is positive. In this case, we say that β̃1 is upward bias. On the other hand, if the sign
of β2 is different than the sign of the correlation between the independent variables, the bias
is negative. It is said that in this case, the estimator α̃1 is downward bias.
Example 5. Consider that y is log of wages, x1 denotes years of education, and x2 is ability
or ‘spunk’ of some workers. We may believe that β2 is positive and different from zero
because there are evidence that suggests that more talented people receive higher wages. On
the other hand, we suspect that Corr(x1 , x2 ) ̸= 0. The talented people tends to spend more
years in school, so we infer that the relationship between education and ability is positive.
Consequently the bias may be positive, so it may be the case that the OLS estimate of β1 .
Although β̂s is biased, it is efficient compared to the estimator of β1 from the long model,
β̂l = (X ′ Mz X)−1 (X ′ Mz y)
To see this, first write the estimator from a short model as,
where ŷ and û are the vectors of fitted values and residuals from the long model. Thus,
13
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche
γ̂ = (Z ′ Mx Z)−1 (Z ′ Mx y)
= γ + (Z ′ Mx Z)−1 (Z ′ Mx u).
β̂ = (X ′ Mz X)−1 (X ′ Mz y)
= β + Γγ + (X ′ Mz X)−1 (X ′ Mz u).
14