Lecture 2

Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof.


Orthogonal Projections
We start assuming a linear model,

yi = x′i β + ui for i = 1, 2, . . . , n, (1)

|{z} |{z} |{z} |{z}
1×1 1×p p×1 1×1

where y is the response variable for subject i and xi is a vector of p independent variables
that includes an intercept. The model can be written in a more convenient matrix notation
y = X β + u (2)
|{z} |{z} |{z} |{z}
n×1 n×p p×1 n×1
| {z }

where y is now defined as a n × 1 vector, the matrix X is n × p, the vector of parameters β

is p × 1, and the random vector u is n × 1. In other words,
       
y1 1 x11 x12 ... x1p−1 β1 u1
 y2   1 x21 x22 ... x2p−1   β2   u2 
       
y =  ..  ; X =  .. .. .. .. ..  ; β =  ..  ; u =  ..  .
 .   . . . . .   .   . 
yn 1 xn1 xn2 ... xnp−1 βp un

It follows that

xi x′i = X ′ X,

xi yi = X ′ y.

1 Estimation
We can estimate the parameter β using the Method of Moments or alternatively the tradi-
tional method called Ordinary Least Squares (OLS). The derivation of the OLS estimator is

based on minimizing the sum of squared errors

||u||2 = ||y − Xβ||2

where || · || is the Euclidean norm defined as (z ′ z)1/2 = ( zi2 )1/2 for a generic vector z.
Therefore, the objective function is

min ||y − Xβ||2 = min(y − Xβ)′ (y − Xβ) = min{(y ′ − β ′ X ′ )(y − Xβ)}

β β β

= min{y y − β X y − y Xβ + β ′ X ′ Xβ)}
′ ′ ′ ′

= min{y ′ y − 2β ′ X ′ y + β ′ X ′ Xβ}

because β ′ X ′ y = y ′ Xβ. The first order condition is

−2X ′ y + 2X ′ Xβ = 0,

which implies that X ′ (y − Xβ) = 0, and consequently, under Assumptions A1, A2 and A3:

β̂ = (X ′ X)−1 X ′ y. (3)

Alternatively, we define the estimator as

β̂ = arg min ||y − Xβ||2 (4)

The OLS estimator β̂ gives the closest vector X β̂, in terms of the Euclidean norm, to y
among all vectors that lie on the space spanned by the vector X.


We can also obtain the variance of the estimator under homocedasticity and independent
errors (Assumptions A4 and A5). Note that,
( ) ( )
V (β̂|X) = V (X ′ X)−1 X ′ y|X = V (X ′ X)−1 X ′ (Xβ + u)|X
= (X ′ X)−1 X ′ V (u|X)X(X ′ X)−1
= (X ′ X)−1 X ′ σu2 IN X(X ′ X)−1
= σu2 (X ′ X)−1 ,

under the assumption that the variance of the error term is “spherical” i.e. V (u|X) = σu2 I.
The variance depends on σu2 , which is unknown and needs to be estimated. As in Lecture 1,
it is estimated using û′ û/n − p.
Note also that the homocedasticity assumption simplifies the expression of the variance

of OLS.

2 Finite Sample Properties

The OLS estimator β̂ is unbiased when the independent variables are independent of the
error term (Assumption A2),

E(β̂|X) = E((X ′ X)−1 X ′ y|X)

= β + E((X ′ X)−1 X ′ u|X)
= β + (X ′ X)−1 X ′ E(u|X)
= β

By the law of iterated expectations, we also have that

E(β̂) = E(E(β̂|X)) = β

The OLS estimator is unbiased and linear, which means that we can write β̂ as a linear
function of the response variable y e.g.

β̂ = Ay

where A = (X ′ X)−1 X ′ . Now, consider another unbiased linear estimator,

β̃ = (A + B)y

Therefore, it must be the case that E β̃ = β because BX = 0. Also, the variance of the
estimator β̃ is,

V (β̃|X) = E[(β̃ − β)(β̃ − β)′ ]

= (A + B)Euu′ (A + B)′
= σ 2 (X ′ X)−1 + σ 2 BB ′
= V (β̂|X) + σ 2 BB ′

The last equation suggests that the variance of the estimator β̃ is larger than the variance
of the estimator β̂. This can be stated in the following important theorem,

Theorem 1. (Gauss-Markov) If E(u|X) = 0 and E(uu′ |X) = σ 2 I in a linear regression

model, then the OLS estimator is efficient e.g. V (β̃|X) ≥ V (β̂|X) or

V (β̃|X) − V (β̂|X) is positive semi-definite

Note that the theorem is for conditional variances. Can we claim that V (β̃) ≥ V (β̂) also?
We need a bit more work ... Consider first the definition

V (β̂) = E(V (β̂|X)) + V (E(β̂|X)).

First, note that the second term is equal to zero because the estimator is unbiased, and
therefore, V (β) = 0. Thus,
( )
V (β̂) = E(V (β̂|X)) = σ 2 E (X ′ X)−1 .

Similarly, [ ( ) ]
V (β̃) = E(V (β̃|X)) = σ 2 E (X ′ X)−1 + E (BB ′ ) .
Therefore, we conclude that
V (β̃) ≥ V (β̂),
because E(BB ′ ) is positive semi-definite.

3 Projections
Define the projection matrix
P = X(X ′ X)−1 X ′ .
The ith diagonal element of P is labeled leverage point and is defined as hi = x′i (X ′ X)−1 xi .
The matrix P is symmetric and idempotent, and its trace is equal to the number of inde-
pendent variables. Symmetry implies,

P ′ = (X(X ′ X)−1 X ′ )′ = (X ′ )′ ((X ′ X)−1 )′ (X)′ = X(X ′ X)−1 X ′ = P.

Idempotency implies,

P P = X(X ′ X)−1 X ′ X(X ′ X)−1 X ′ = X(X ′ X)−1 X ′ = P.

tr(P ) = tr(X(X ′ X)−1 X ′ ) = tr(X ′ X(X ′ X)−1 ) = tr(Ip ) = p.

The vector of fitted values, defined as

ŷ = X β̂,

represents the orthogonal projection of y onto the column space of X. Note that,

ŷ = X β̂ = X(X ′ X)−1 X ′ y = P y

Figure 1: OLS fitting and projections

where the matrix P is the projection matrix. On the other hand, the residual vector

û = y − ŷ = (I − P )y = M y,

where M := I − P , which is a matrix that produces ‘residuals’. Note that we can also write
the vector of residuals as

û = M y = M (Xβ + u) = M Xβ + M u = M u

because P X = X. This implies that the residuals from the regression and the independent
variables are orthogonal,

X ′ û = X ′ M u = X ′ (I − P )u = (X ′ − X ′ P )u = 0.

Remark 1. The basic geometry of OLS can be seen in Figure 1. Let y and X = [X1 .X 2 ] be
vectors, and define the space spanned by the columns of X as:

S(X) := {z ∈ Rn |z = Xc, c ∈ Rp }

Figure 2: OLS fitting with two observations

The OLS fitting gives the closest vector to y in the space spanned by the Xs.

Example 1. To illustrate the basic geometry of OLS, we present a simple example (Figure
2). Let y = (1, 3)′ and X = (1, 1)′ be vectors, and consider the regression model yi = βXi +ui .
The OLS estimator is
arg min{(y1 − β)2 + (y2 − β)2 }
and the associated first order condition,
−2(y1 − β) − 2(y2 − β) = 0, ⇒ ȳ = yi = 2.

Therefore, we obtain ŷ = β̂X = (2, 2)′ and û = y − ŷ = (−1, 1)′ with X ′ û = 0. As it is

clear in the figure, it is also interesting to obtain the lenght of the following vectors using
the Euclidean norm:
√ √ √ √
∥y∥ = 12 + 32 = 10, ∥ŷ∥ = 22 + 22 = 8,
√ √
∥u∥ = ∥y − ŷ∥ = (y1 − 2)2 + (y2 − 2)2 = 2.

4 Partitioned Fit (Frish-Waugh-Lovell Theorem)

Consider a partitioned regression model,

y = Xβ + Zγ + u, (5)

and the corresponding estimator β̂. Moreover, consider the alternative regression model,

ỹ = X̃β + ũ, (6)

with the corresponding estimator β̃, with z̃ = MZ z for a generic vector z.

Theorem 2. The estimators β̂ and β̃ are numerically identical.

Proof: Standard OLS formulae in model (5) gives the following normal equations,

X ′ y = X ′ X β̂ + X ′ Z γ̂
Z ′ y = Z ′ X β̂ + Z ′ Z γ̂

Solving for γ̂ in the second equation, and replacing in the first equation, we obtain

X ′ y = X ′ X β̂ + X ′ Z((Z ′ Z)−1 Z ′ y − (Z ′ Z)−1 Z ′ X β̂),

suggesting that
X ′ (I − Z(Z ′ Z)−1 Z ′ )y = X ′ (I − Z(Z ′ Z)−1 Z ′ )X β̂
β̂ = (X ′ Mz X)−1 X ′ Mz y = (X̃ ′ X̃)−1 X̃ ′ ỹ = β̃

In addition to the point estimates, the residuals are identical. To see this consider,

y = ŷ + û = X β̂ + Z γ̂ + û

and multiply by Mz ,

Mz y = Mz X β̂ + Mz Z γ̂ + Mz û ⇒ ỹ = X̃ β̃ + Mz û,

but Mz û = û because Pz û = 0.
Another way is seeing the this result is considering the following argument. Recall that

y = ŷ + û = P y + M y = X β̂ + Z γ̂ + M y (7)
Note that M is a projection matrix that uses [X ..Z]. Applying partitioned formulae, one can
show that M = I − PXZ = I − (PZ + MZ X(X ′ MZ X)−1 X ′ MZ ).
If we multiply (7) by MZ , we have

MZ y = MZ X β̂ + MZ Z γ̂ + MZ M y (8)

Therefore, we need to show that MZ M = M or, in words, that the last term in equations

(7) and (8) are equal. Note that

MZ M = (I − PZ )(I − PXZ )
= I − PZ − PXZ + PZ PXZ
= I − PZ − PXZ + PZ (PZ + MZ X(X ′ MZ X)−1 X ′ MZ )
= I − PZ − PXZ + PZ PZ + PZ MZ X(X ′ MZ X)−1 X ′ MZ
= I − PZ − PXZ + PZ + 0
= I − PXZ = M,

which gives the results since M y = M u = û.

Note too that in the case of Z ′ X = 0, the “off-diagonal” elements of the inverse matrix
in the partitioned formula are zero, then M = I − PXZ = I − (PZ + PX ). Again, we have
that PZ (PZ + PX ) = PZ , and the result continues to hold.
The theorem implies that the OLS estimates of the slope parameter β can be obtained
as follows:

1. Regress y on Z and obtain the residuals ỹ

2. Regress X on Z and obtain the residuals X̃
3. Regress ỹ on X̃, and obtain the OLS estimate β̂.
Example 2. Consider a simple bivariate model,

y = γ + βxi + ui

where in this case, X = xi and Z = 1. Before turning to the application of FWL theorem,
we can use the method of moments to derive the slope estimator. Our model assumptions

E(u) = E(y − γ − βx) = 0

Cov(x, u) = E(xu) − E(x)E(u) = E(xu) = E(x(y − γ − βx)) = 0

These two equations are restrictions imposed to the data about the joint probability distri-
bution of x and y. There are two unknowns and two equations, which suggests that we can
solve for γ and β. The sample counterpart of these equations are

(yi − γ − βxi ) = 0 (9)
n i=1
(xi (yi − γ − βxi )) = 0 (10)
n i=1

From equation (6), we obtain γ̂ = ȳ − β̂ x̄, and replacing in (8), we obtain

n ∑
(xi − x̄)(yi − ȳ) = β̂ (xi − x̄)2
i=1 i=1

which implies that the estimator for the parameter β is

(x − x̄)(yi − ȳ)
β̂ = i=1 ∑n i (11)
i=1 (xi − x̄)

Following the FWL theorem, we first have to obtain the residuals in the regression,

xi = α + ϵi ⇒ x̃i = xi − α̂ = xi − x̄

yi = δ + vi ⇒ ỹi = yi − δ̂ = yi − ȳ
Finally, we have to regress the residuals of the response variable on the residuals of the
variable of interest,
ỹi = πx̃i + wi
giving, ∑ ∑n
x̃i ỹi (x − x̄)(yi − ȳ)
π̂ = ∑ 2 = i=1 ∑n i = β̂ (12)
i=1 (xi − x̄)
x̃i 2

Let’s check now that the residuals in the two equations are equal, so the standard errors
of β̂ and π̂ are equal. Again, by definition,

yi = γ̂ + β̂xi + ûi , (13)

ỹi = π̂x̃i + ŵi . (14)

We can write the last equation as,

yi − ȳ = π̂(xi − x̄) + ŵi = β̂(xi − x̄) + ŵi

and therefore,

yi = (ȳ − β̂ x̄) + β̂1 xi + ŵi ,

= γ̂ + β̂xi + ŵi ,

using the definition of the intercept γ̂. (Recall that we discuss this it in Lecture 1). Notice
then that the fitted values are equal, and therefore, the residual in equation (13) is equal to

log of average hourly earnings

log of average hourly earnings





0 5 10 15 0 10 20 30 40 50

years of education Experience

log of average hourly earnings


y tilde



0 500 1000 2000 −10 −5 0 5

Experience Squared X tilde

Figure 3: Partial Residual Plot

the residual in equation (14).

Example 3. Suppose we want to estimate the returns to education considering the following
lwi = β0 + β1 x1i + β2 x2i + β3 x3i + ui (15)
where lwi is the log of wages of the ith worker, and x1i , x2i , and x3i are the level of education,
year of experience, and year with the actual employer of the ith worker. The variable of
interest is education, therefore we can rewrite the model using our previous notation:

y = Z ′ γ + βX + u

where y = [lw], Z = [x2 , x3 ] and X = [x1 ].

We use a Wooldridge’s (2002) data set to estimate the previous model. The data set
contains observations for 500 individuals (Figure 1 shows the bivariate relation between
dependent and independent variables). The last plot shows the slope obtained using partial
residual plot. The return to education is 9 percent with an standard error of 0.0075.

5 Misspecification and Trade-offs

Under the classical assumptions the OLS estimator is unbiased, e.g. E(β̂) = β. However, the
estimator may be biased if we omit a variable in the regression model. Introducing variables
in the model to ‘correct’ for the bias problem does not work because the variance of the
estimators from the ‘long’ version of the model (e.g., the model that includes the additional
explanatory variable) may be higher than the variance of the estimators from the ‘short’
version of the model (e.g., the model that does not include the explanatory variable).
Under the assumptions discussed so far, consider:

yi = x′i β + zi′ γ + ui
yi = x′i b + vi .

Note that the second equation gives a projection of y on x1 . In this case, we have:

b = (Exi x′i )−1 (Exi yi ) = (Exi x′i )−1 (Exi (x′i β + zi′ γ + ui ))
= (Exi x′i )−1 (Exi x′i )β + (Exi x′i )−1 (Exi zi′ )γ + (Exi x′i )−1 (Exi ui ))
= β + (Exi x′i )−1 (Exi zi′ )γ,

which obviously implies that the projection in the second equation does not identify a coef-
ficient of interest β unless Exi zi′ = 0.
Let’s discuss now the trade-off we often face in applied econometrics when we do not

observe a few independent variables. Suppose now that the true model is,

y = Xβ + Zγ + u, (16)

but instead we estimate model,

y = Xβ + ϵ. (17)

The OLS estimator of the parameter of interest β,

β̂s = arg min ∥y − Xβ∥2 ,

= (X ′ X)−1 X ′ y,

could be biased if either γ ̸= 0 or the observables independent variables X and Z are not
E(β̂s ) = β + E((X ′ X)−1 X ′ (Zγ + u)) = β + Γγ
with Γ = (X ′ X)−1 X ′ Z. Note that Γ is the estimator of a regression of Z on X.

Example 4. We start studying the bias effect from omitting a relevant variable in the
regression model. Consider that the true population model is,

y = β0 + β1 x1 + β2 x2 + u (18)

but we obtain OLS estimators from a model like,

y = β0 + β1 x1 + u, (19)

which excludes the relevant variable x2 . Suppose we are interested in the slope parameter
β1 . In this case,

β2 ni=1 (xi1 − x̄1 )xi2
E(β̃1 ) = β1 + ∑n
i=1 (xi1 − x̄1 )

= β1 + β2 α̃1

What is the interpretation of α̃1 ? Note that if we have a simple regression model of the form,

x2 = α0 + α1 x1 + ϵ,

and we obtain the OLS estimator for the slope parameter α1 , this is equal to
(xi1 − x̄1 )(xi2 − x̄2 )
α̃1 = i=1∑n
i=1 (xi1 − x̄1 )

Therefore, the omitted variable bias is

E(β̃1 ) − β1 = β2 α̃1 (20)

The bias is zero if either β2 = 0, which means that the independent variable x2 does not
have an effect on the dependent variable y, or α̃1 = 0, which means that the independent
variables x1 and x2 are uncorrelated. If β2 ≈ 0 or α̃1 ≈ 0, the size of the bias would be small,
and therefore it may not be a cause of concern.
Sometimes it is possible to infer the sign of the bias. If the sign of the unknown parameter
β2 is equal to the sign of the correlation between the independent variables x1 and x2 , the
bias is positive. In this case, we say that β̃1 is upward bias. On the other hand, if the sign
of β2 is different than the sign of the correlation between the independent variables, the bias
is negative. It is said that in this case, the estimator α̃1 is downward bias.
Example 5. Consider that y is log of wages, x1 denotes years of education, and x2 is ability
or ‘spunk’ of some workers. We may believe that β2 is positive and different from zero
because there are evidence that suggests that more talented people receive higher wages. On
the other hand, we suspect that Corr(x1 , x2 ) ̸= 0. The talented people tends to spend more
years in school, so we infer that the relationship between education and ability is positive.
Consequently the bias may be positive, so it may be the case that the OLS estimate of β1 .
Although β̂s is biased, it is efficient compared to the estimator of β1 from the long model,

β̂l = (X ′ Mz X)−1 (X ′ Mz y)

To see this, first write the estimator from a short model as,

β̂s = (X ′ X)−1 X ′ (ŷl + û)

where ŷ and û are the vectors of fitted values and residuals from the long model. Thus,

β̂s = β̂l + Γγ̂

Thus the variance of the estimator in the long model is

V (β̂l ) = V (β̂s − Γγ̂) = V (β̂s ) + ΓV (γ̂)Γ′ , (21)

which implies that

V (β̂l ) ≥ V (β̂s ),
since the quadratic term is positive definite. The result in equation (21) holds if the (missing)
covariance term is zero. We verify the conditions under which equation (21) holds in the
following theorem:
Theorem 3. Cov(β̂s , Γγ̂) = 0

Proof: First, recall that

γ̂ = (Z ′ Mx Z)−1 (Z ′ Mx y)
= γ + (Z ′ Mx Z)−1 (Z ′ Mx u).

Also, recall that

β̂ = (X ′ Mz X)−1 (X ′ Mz y)
= β + Γγ + (X ′ Mz X)−1 (X ′ Mz u).

By definition, we have that

{( ) }
′ ′
Cov(β̂s , Γγ̂) = E β̂s − E(βs ) (γ̂ − E(γ̂)) Γ
{( ) }
= E β̂s − E(βs (γ̂ − γ)′ Γ′
{ }
= E (X ′ X)−1 X ′ u((Z ′ Mx Z)−1 ZMx u)′ Γ′ ) ,

by replacing by definitions. The last expression can be written as

( )
Cov(β̂s , Γγ̂) = E (X ′ X)−1 X ′ uu′ Mx Z(Z ′ Mx Z)−1 Γ′
( )
= σ 2 (X ′ X)−1 X ′ Mx Z(Z ′ Mx Z)−1 Γ′

because V (u) = σ 2 I. Note that Mx X = 0, therefore, covariance is equal to zero under



