Week 2: CEF & OLS: Dan Yavorsky
Week 2: CEF & OLS: Dan Yavorsky
Week 2: CEF & OLS: Dan Yavorsky
MFE 402
Dan Yavorsky
1
Topics for Today
• Conditional Expectation
• General CEF Model
• Linear CEF Model
• Best Linear Predictor
• Least Squares Estimator
2
Conditional Expectation
Joint Distribution & Density Functions
• For vectors of discrete random variables, the joint probability mass function
• For vectors of continuous random variables, the joint probability density function
𝜕2
fX,Y (x, y) = F (x, y)
𝜕 x𝜕 y X,Y
3
Example: Joint Distributions
Discrete Continuous
3.5
Log Dollars per Hour
X=1 1/60 4/60 9/60
X=2 2/60 6/60 12/60
3.0
X=3 3/60 8/60 3/60
X=4 4/60 2/60 6/60
2.5 2.0
0 5 10 15 20 25 30 35 40 45
Labor Market Experience (Years)
4
Marginal (Univariate) Distribution & Density Functions
The marginal univariate cumulative distribution function (or just marginal distribution) of X is
FX (x) = ℙ[X ≤ x]
= ℙ[X ≤ x, Y ≤ ∞]
= lim FX,Y (x, y)
y→∞
∞ x
=∫ ∫ fX,Y (u, v) du dv
-∞ -∞
The marginal univariate probability density function (or just marginal density) of X is
d
fX (x) = F (x)
dx X
∞
=∫ fX,Y (x, y) dy
-∞
5
Example: Marginal Distributions
Discrete Continuous
0 5 10 15 20 25 30 35 40 45
Years experience density
6
Conditional Distribution & Density Functions
Think of this as the distribution of Y for the subpopulation with a specific value of X
𝜕
fY|X (y|x) = F (y|x)
𝜕 y Y|X
fX,Y (x, y)
=
fX (x)
7
Example: Conditional Distributions
Discrete Continuous
∞
𝔼[Y|X = x] = ∫ y fY|X (y|x) dy
-∞
∞
1
= ∫ y fX,Y (x, y) dy
fX (x) -∞
It is a function of x:
m(x) = 𝔼[Y|X = x]
9
Example: CEF
Discrete Continuous
3.5
Log Dollars per Hour
X=1 1/60 4/60 9/60 14/60
X=2 2/60 6/60 12/60 20/60
3.0
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60
2.5
f(y) 10/60 20/60 30/60
2.0
E[Y|X]
0 5 10 15 20 25 30 35 40 45
𝔼[Y|X = 1] = 1(1/14) + 2(4/14) + 3(9/14) = 2.571 Labor Market Experience (Years)
10
Law of Iterated Expectations
𝔼 [𝔼[Y|X]] = 𝔼[Y]
That is, for each value x, we have 𝔼[Y|X = x], and then we take the probability-weighted
average across the x’s:
= ∫ ∫ y fX,Y (x, y) dy dx
Rk R
=𝔼[Y]
11
CEF Model
CEF Error
The CEF error e (some textbooks use 𝜀) is defined as the difference between Y and the CEF
evaluated at X:
e = Y - m(X)
Y = m(X) + e
Notice: e is derived from FX,Y (X, Y) so it’s properties are derived from this construction.
Some properties:
The CEF m(X) is the best (in a minimum mean-squared error sense) predict of Y.
One way to measure the (ex ante and non-stochastic) magnitude of the prediction error is with
the mean-squared error function:
2 2
𝔼 [(Y - g(X)) ] = 𝔼 [(m(X) + e - g(X)) ]
2
= 𝔼 [e2 ] + 𝔼 [e (m(X) - g(X))] + 𝔼 [(m(X) - g(X)) ]
2
= 𝔼 [e2 ] + 𝔼 [(m(X) - g(X)) ]
≥ 𝔼 [ e2 ]
2
= 𝔼 [(Y - m(X)) ]
13
Regression Derivative
One way to interpret the CEF is how marginal changes in the regressors x imply changes in the
conditional expectation of the response variable Y. For example, changing x1 :
𝜕 𝜕
∇1 m(x) = m(x) = m(x1 , x2 , … , xk )
𝜕 x1 𝜕 x1
Notice:
1. The effect of each variable is calculated holding the other variables constant.
• This is not “all else” but rather “all else in the model”
2. The effect of each variable is on the conditional expectation of Y, not necessarily the
actual value of Y.
• The regression derivative is the change in the actual value of Y only if the error e is
unaffected by the change in the regressor
14
Linear CEF
Linear CEF
m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + … + xk 𝛽k
We often use shorthand notation, defining the vectors x (notice the 1) and 𝛽 as
1 𝛽0
⎡ ⎤ ⎡ ⎤
x
⎢ 1⎥ ⎢𝛽1 ⎥
x = ⎢ x2 ⎥ and 𝛽 = ⎢𝛽2 ⎥
⎢ ⎥ ⎢ ⎥
⎢⋮⎥ ⎢⋮⎥
⎣ xk ⎦ ⎣𝛽k ⎦
so that we can write the CEF as simply
m(x) = x′ 𝛽 15
Linear CEF Model
Y = X′ 𝛽 + e
𝔼[e|X] = 0
One of the most appealing features of the Linear CEF Model is that the coefficients are the
regression derivatives:
𝜕
∇1 m(x) 𝜕 x1 m(x) 𝛽
⎡ ⎤ ⎡ 𝜕 ⎤ ⎡ 1⎤
∇ m (x ) ⎥=⎢ m (x )⎥ = ⎢𝛽2 ⎥ = 𝛽
∇m(x) = ⎢ 2 ⎢
𝜕 x2
⎥ ⎢⋮⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝜕
⎣∇k m(x)⎦ ⎣ 𝜕 xk m(x) ⎦ ⎣𝛽k ⎦
Therefore, the coefficients have simple and natural interpretations as the marginal effects of
changing one variable, holding the others constant. 16
Linear CEF with Non-Linear Effects
The linear CEF is less restrictive than it first appears. Take the following CEF as an example:
m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + x22 𝛽3
Here, m(x) is non-linear in x2 but we can define x3 = x22 to re-write the CEF as
m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + x3 𝛽3
This creates a linear CEF, which is sufficient for most econometric purposes (ie, estimation and
inference of the 𝛽 parameters). The one major exception is with the analysis of regression
derivatives, which should be defined with respect to the “original” variables:
𝜕
m(x) = 𝛽1
𝜕 x1
𝜕
m(x) = 𝛽2 + 2x2 𝛽3
𝜕 x2 17
Example: Linear vs Non-Linear Effects
X2 X2
X1 X1
18
Linear CEF with Dummy Variables
If all regressors (the x’s) take a finite set of values, the CEF can be written as a linear function
of regressors.
One binary variable example:
• Suppose X represents binary gender with X = 0 for males and X = 1 for females.
• Let 𝔼[Y|X = 0] = 𝜇0 and 𝔼[Y|X = 1] = 𝜇1 and define 𝛽0 = 𝜇0 and 𝛽1 = 𝜇1 - 𝜇0
• Then m(x) = 𝛽0 + 𝛽1 x1
CEF of Wages on Experience for Men CEF of Wages on Experience for Women
3.5
3.5
Log Dollars per Hour
3.0
2.5
2.5
2.0
2.0
E[Y|X] E[Y|X]
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Labor Market Experience (Years) Labor Market Experience (Years)
20
Best Linear Predictor
When the CEF is Linear
Then:
𝔼[Xe] = 0
𝔼[X(Y - X′ 𝛽)] = 0
𝔼[XY] - 𝔼[XX′ ]𝛽 = 0
-1
𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]
2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]
As a quadratic function of 𝛽 :
2
d(𝛽) = 𝔼 [(m(X) - X′ 𝛽) ]
= 𝔼 [m(X)2 ] - 2𝛽 ′ 𝔼 [Xm(X)] + 𝛽 ′ 𝔼 [XX′ ] 𝛽
The minimizer 𝛽 = arg minb∈ℝk d(b) is the same linear projection coefficient
23
Three Reasons
Define the population linear regression function as X′ 𝛽 with 𝛽 = (𝔼[XX′ ])-1 𝔼[XY]
24
Ordinary Least Squares
̂
Estimator: 𝛽OLS
Samples
We’ve been discussing the best linear predictor of Y given X for a pair of random variables
(Y, X) ∈ ℝ × ℝk and called this predictor the linear projection model. We are now interested in
estimating the parameters of this model, in particular, the projection coefficient
-1
𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]
We can estimate 𝛽 from samples which include joint measurements of (Y, X).
Notice that
The variables {(Y1 , X1 ), … , (Yi , Xi ), … , (Yn , Xn )} are identically distributed; they are draws
from a common distribtion FX,Y (x, y).
25
Moment Estimators
The Law(s) of Large Numbers show, with varying mathematical approaches, that the average
(say, X)̄ approaches the expectation (say, 𝔼[X]) as the sample size grows.
For example,
• Suppose 𝜇 = 𝔼[Y]
n
• Estimate 𝜇 with 𝜇̂ = 1n ∑i=1 Yi
• Suppose 𝜇 = 𝔼[h(Y)]
n
• Estimate 𝜇 with 𝜇̂ = 1n ∑i=1 h(Yi )
26
OLS Estimator for Linear CEF
Suppose the conditional expectation function is linear in X. We can use moment estimators of
the expectations:
1 n
QXX = 𝔼[XX′ ] ⇒ Q̂ XX = ∑ XX′
n i=1
1 n
QXY = 𝔼[XY] ⇒ Q̂ XY = ∑ XY
n i=1
and so
-1 -1
𝛽 = (QXX ) QXY = (𝔼 [XX′ ]) 𝔼 [XY]
-1
-1 1 n 1 n
⇒ 𝛽 ̂ = (Q̂ XX ) Q̂ XY = ( ∑ XX′ ) ( ∑ XY)
n i=1 n i=1
27
OLS Estimator as Best Linear Approximation
While the conditional expectation m(X) = 𝔼[Y|X] is the best predictor of Y among all functions of X, it’s
functional form is typically unknown, and is thought to be linear only in special cases (e.g., a fully saturated
model of categorical variables).
Consequently, in most cases, it is more realistic to view the linear specification m(X) = X′ 𝛽 as an approximation.
Recall, the linear projection coefficient 𝛽 is defined as the minimizer of the expected squared error S(𝛽):
2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]
1 n 2
Ŝ(𝛽) = ∑ (Yi - X′i 𝛽)
n i=1
n
2 2
S(𝛽0 , 𝛽1 ) = 𝔼 [(Y - 𝛽0 - 𝛽1 X) ] ⇒ Ŝ(𝛽0 , 𝛽1 ) = ∑ (Yi - 𝛽0 - 𝛽1 Xi )
i=1
=0
⇒ 𝛽1̂ = Cov(Xi , Yi )/Var(Xi )
29
Solving for the OLS Estimator with Multiple Regressors
Now consider the general case with k regressors (including X1 = 1): Y = X′ 𝛽 + e
2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]
n n n n n
2 ′
Ŝ(𝛽) = ∑ (Yi - X′i 𝛽) = ∑ (Yi - X′i 𝛽) (Yi - X′i 𝛽) = ∑ Y2i - 2𝛽 ∑ Xi Yi + 𝛽′ (∑ Xi X′i ) 𝛽
i=1 i=1 i=1 i=1 i=1
n n
𝜕 ̂ ̂
S(𝛽) = -2 ∑ Xi Yi + 2 ∑ Xi X′i 𝛽 ̂ = 0
𝜕𝛽 i=1 i=1
-1
n n
⇒ 𝛽 ̂ = (∑ Xi X′i ) (∑ Xi Yi )
i=1 i=1
30
Model in Matrix Notation
It is notationally and computationally convenient to write the model and statistics in matrix notation.
Define
Y1 X′1 e1
⎡ ⎤ ⎡ ′⎤ ⎡ ⎤
Y X e
Y = ⎢ 2⎥ , X = ⎢ 2⎥ , e = ⎢ 2⎥
n×1 ⎢ ⋮ ⎥ n×k ⎢ ⋮ ⎥ n×1 ⎢⋮⎥
′
⎣ Yn ⎦ ⎣ Xn ⎦ ⎣ en ⎦
n
∑ Xi X′i = X′ X
i=1
n
∑ Xi Y′i = X′ Y
i=1
𝛽 ̂ = (X′ X)-1 X′ Y
32
Unbiasedness of the OLS Estimator
This means that it has a distribution, which we call the “sampling distribution.”
If the mean of the sampling distribution is centered over the value we seek to estimate, then
the estimator is unbiased.
33
Computing the OLS Estimator in R
5
print(lmcoefs)
(Intercept) exper[sam]
4
2.876515044 0.004776039
Log Wage
3
2
1
0 10 20 30 40 50
Years of Experience
34
Computing the OLS Estimator in R “by hand”
35
Next Time
• Residuals
• Projections
• R Squared
• CEF Error Variance
• Variance of the OLS Estimator
• Homoskedasticity
• Heteroskedasticity
36