Week 2: CEF & OLS: Dan Yavorsky

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Week 2: CEF & OLS

MFE 402

Dan Yavorsky

1
Topics for Today

• Conditional Expectation
• General CEF Model
• Linear CEF Model
• Best Linear Predictor
• Least Squares Estimator

2
Conditional Expectation
Joint Distribution & Density Functions

We begin by generalizing from a univariate random variable to a vector of random variables


(aka multivariate random vector) using the bivariate case to illustrate:

• The joint (cumulative) distribution function

FX,Y (x, y) = ℙX,Y [X ≤ x, Y ≤ y]

• For vectors of discrete random variables, the joint probability mass function

pX,Y (x, y) = ℙ[X = x, Y = y]

• For vectors of continuous random variables, the joint probability density function

𝜕2
fX,Y (x, y) = F (x, y)
𝜕 x𝜕 y X,Y
3
Example: Joint Distributions

Discrete Continuous

Y=1 Y=2 Y=3

3.5
Log Dollars per Hour
X=1 1/60 4/60 9/60
X=2 2/60 6/60 12/60

3.0
X=3 3/60 8/60 3/60
X=4 4/60 2/60 6/60

2.5 2.0
0 5 10 15 20 25 30 35 40 45
Labor Market Experience (Years)

4
Marginal (Univariate) Distribution & Density Functions

The marginal univariate cumulative distribution function (or just marginal distribution) of X is

FX (x) = ℙ[X ≤ x]
= ℙ[X ≤ x, Y ≤ ∞]
= lim FX,Y (x, y)
y→∞
∞ x
=∫ ∫ fX,Y (u, v) du dv
-∞ -∞

The marginal univariate probability density function (or just marginal density) of X is
d
fX (x) = F (x)
dx X

=∫ fX,Y (x, y) dy
-∞

5
Example: Marginal Distributions

Discrete Continuous

Y=1 Y=2 Y=3 f(x)


X=1 1/60 4/60 9/60 14/60
X=2 2/60 6/60 12/60 20/60 0 1 2 3 4 5 6
Log wage density
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60
f(y) 10/60 20/60 30/60

0 5 10 15 20 25 30 35 40 45
Years experience density

6
Conditional Distribution & Density Functions

If X has a discrete distribution:

• The conditional distribution function of Y given X = x is

FY|X (y|x) = ℙ[Y ≤ y|X = x]

Think of this as the distribution of Y for the subpopulation with a specific value of X

• The conditional density function of Y given X = x is

𝜕
fY|X (y|x) = F (y|x)
𝜕 y Y|X
fX,Y (x, y)
=
fX (x)

7
Example: Conditional Distributions

Discrete Continuous

Y=1 Y=2 Y=3 f(x)


X = 10
X=1 1/60 4/60 9/60 14/60
X = 25
X=2 2/60 6/60 12/60 20/60
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60 X=5

f(y) 10/60 20/60 30/60

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5


Log Dollars per Hour
fY|X (y = 1|x = 2) = (2/60)/(20/60) = 0.1
fY|X (y = 2|x = 2) = (6/60)/(20/60) = 0.3
fY|X (y = 3|x = 2) = (12/60)/(20/60) = 0.6
8
Conditional Expectation Function

The Conditional Expectation Function (CEF) of Y given X is:


𝔼[Y|X = x] = ∫ y fY|X (y|x) dy
-∞

1
= ∫ y fX,Y (x, y) dy
fX (x) -∞

It is a function of x:

m(x) = 𝔼[Y|X = x]

9
Example: CEF

Discrete Continuous

Y=1 Y=2 Y=3 f(x)

3.5
Log Dollars per Hour
X=1 1/60 4/60 9/60 14/60
X=2 2/60 6/60 12/60 20/60

3.0
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60

2.5
f(y) 10/60 20/60 30/60

2.0
E[Y|X]
0 5 10 15 20 25 30 35 40 45
𝔼[Y|X = 1] = 1(1/14) + 2(4/14) + 3(9/14) = 2.571 Labor Market Experience (Years)

𝔼[Y|X = 2] = 1(2/20) + 2(6/20) + 3(12/20) = 2.500


𝔼[Y|X = 3] = 1(3/14) + 2(8/14) + 3(3/14) = 2.000
𝔼[Y|X = 4] = 1(4/12) + 2(2/12) + 3(6/12) = 2.167

10
Law of Iterated Expectations

The expectation of the conditional expectation is the unconditional expectation.

A simple special case:

𝔼 [𝔼[Y|X]] = 𝔼[Y]

That is, for each value x, we have 𝔼[Y|X = x], and then we take the probability-weighted
average across the x’s:

𝔼 [𝔼[Y|X]] = ∫ 𝔼[Y|X = x] fX (x) dx


Rk

= ∫ (∫ y fY|X (y|x) dy) fX (x) dx


Rk R

= ∫ ∫ y fX,Y (x, y) dy dx
Rk R
=𝔼[Y]
11
CEF Model
CEF Error

The CEF error e (some textbooks use 𝜀) is defined as the difference between Y and the CEF
evaluated at X:

e = Y - m(X)

By construction, this yields the “breakdown” formula

Y = m(X) + e

Notice: e is derived from FX,Y (X, Y) so it’s properties are derived from this construction.

Some properties:

• 𝔼[e|X] = 0 (i.e., e is “mean independent” of X, but not necessarily independent of X)


• 𝔼[e] = 0
• 𝔼[h(X) e] = 0 12
CEF as the Best Predictor

The CEF m(X) is the best (in a minimum mean-squared error sense) predict of Y.

Suppose you have X and want to predict Y: Ŷ = g(X)

One way to measure the (ex ante and non-stochastic) magnitude of the prediction error is with
the mean-squared error function:
2 2
𝔼 [(Y - g(X)) ] = 𝔼 [(m(X) + e - g(X)) ]
2
= 𝔼 [e2 ] + 𝔼 [e (m(X) - g(X))] + 𝔼 [(m(X) - g(X)) ]
2
= 𝔼 [e2 ] + 𝔼 [(m(X) - g(X)) ]
≥ 𝔼 [ e2 ]
2
= 𝔼 [(Y - m(X)) ]

13
Regression Derivative

One way to interpret the CEF is how marginal changes in the regressors x imply changes in the
conditional expectation of the response variable Y. For example, changing x1 :

𝜕 𝜕
∇1 m(x) = m(x) = m(x1 , x2 , … , xk )
𝜕 x1 𝜕 x1

Notice:

1. The effect of each variable is calculated holding the other variables constant.
• This is not “all else” but rather “all else in the model”

2. The effect of each variable is on the conditional expectation of Y, not necessarily the
actual value of Y.
• The regression derivative is the change in the actual value of Y only if the error e is
unaffected by the change in the regressor
14
Linear CEF
Linear CEF

An important special case is when the CEF is linear in X:

m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + … + xk 𝛽k

We often use shorthand notation, defining the vectors x (notice the 1) and 𝛽 as

1 𝛽0
⎡ ⎤ ⎡ ⎤
x
⎢ 1⎥ ⎢𝛽1 ⎥
x = ⎢ x2 ⎥ and 𝛽 = ⎢𝛽2 ⎥
⎢ ⎥ ⎢ ⎥
⎢⋮⎥ ⎢⋮⎥
⎣ xk ⎦ ⎣𝛽k ⎦
so that we can write the CEF as simply

m(x) = x′ 𝛽 15
Linear CEF Model

Y = X′ 𝛽 + e

𝔼[e|X] = 0

One of the most appealing features of the Linear CEF Model is that the coefficients are the
regression derivatives:

𝜕
∇1 m(x) 𝜕 x1 m(x) 𝛽
⎡ ⎤ ⎡ 𝜕 ⎤ ⎡ 1⎤
∇ m (x ) ⎥=⎢ m (x )⎥ = ⎢𝛽2 ⎥ = 𝛽
∇m(x) = ⎢ 2 ⎢
𝜕 x2
⎥ ⎢⋮⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝜕
⎣∇k m(x)⎦ ⎣ 𝜕 xk m(x) ⎦ ⎣𝛽k ⎦

Therefore, the coefficients have simple and natural interpretations as the marginal effects of
changing one variable, holding the others constant. 16
Linear CEF with Non-Linear Effects

The linear CEF is less restrictive than it first appears. Take the following CEF as an example:

m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + x22 𝛽3

Here, m(x) is non-linear in x2 but we can define x3 = x22 to re-write the CEF as

m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + x3 𝛽3

This creates a linear CEF, which is sufficient for most econometric purposes (ie, estimation and
inference of the 𝛽 parameters). The one major exception is with the analysis of regression
derivatives, which should be defined with respect to the “original” variables:
𝜕
m(x) = 𝛽1
𝜕 x1
𝜕
m(x) = 𝛽2 + 2x2 𝛽3
𝜕 x2 17
Example: Linear vs Non-Linear Effects

Linear Regression Surface Non-Linear Regression Surface


Y Y

X2 X2

X1 X1

18
Linear CEF with Dummy Variables

If all regressors (the x’s) take a finite set of values, the CEF can be written as a linear function
of regressors.
One binary variable example:
• Suppose X represents binary gender with X = 0 for males and X = 1 for females.
• Let 𝔼[Y|X = 0] = 𝜇0 and 𝔼[Y|X = 1] = 𝜇1 and define 𝛽0 = 𝜇0 and 𝛽1 = 𝜇1 - 𝜇0
• Then m(x) = 𝛽0 + 𝛽1 x1

Two binary variables example:


• Suppose X1 represents binary gender (X1 = 1 is female) and X2 represents marital status
X2 = 1 is married)
• Let m(x) = 𝛽0 + 𝛽1 x1 + 𝛽2 x2 + 𝛽3 x1 x2
• Then:
• 𝔼[Y|X1 = 0, X2 = 0] = 𝛽0
• 𝔼[Y|X1 = 1, X2 = 0] = 𝛽0 + 𝛽1
• 𝔼[Y|X1 = 0, X2 = 1] = 𝛽0 + 𝛽2
• 𝔼[Y|X1 = 1, X2 = 1] = 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3 19
Liner CEF with Continuous and Dummy Variables

CEF of Wages on Experience for Men CEF of Wages on Experience for Women
3.5

3.5
Log Dollars per Hour

Log Dollars per Hour


3.0

3.0
2.5

2.5
2.0

2.0
E[Y|X] E[Y|X]
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Labor Market Experience (Years) Labor Market Experience (Years)

20
Best Linear Predictor
When the CEF is Linear

Suppose the CEF is a linear function of x: m(X) = X′ 𝛽

Recall that 𝔼[Xe] = 0.

Then:
𝔼[Xe] = 0
𝔼[X(Y - X′ 𝛽)] = 0
𝔼[XY] - 𝔼[XX′ ]𝛽 = 0
-1
𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]

We’ve assumed a couple of technical, mathematical properties:

• The means, variances, and covariance between X and Y are finite


• The 𝔼[XX′ ] is positive definite
1
BHE uses the notation: 𝛽 = Q-XX QXY
21
Best Linear Predictor

A linear predictor for Y is a function X′ 𝛽 for some 𝛽 ∈ ℝk .

The mean-squared prediction error is

2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]

As a quadratic function of 𝛽 :

S(𝛽) = 𝔼 [Y2 ] - 2𝛽 ′ 𝔼 [XY] + 𝛽 ′ 𝔼 [XX′ ] 𝛽

Take a first-order condition and solve for 𝛽 :


𝜕
S(𝛽) = -2𝔼 [XY] + 2𝔼 [XX′ ] 𝛽 = 0
𝜕𝛽
-1
⇒ 𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]
22
The minimizer 𝛽 = arg minb∈ℝk S(b) is called the linear projection coefficient
Best Linear Approximation

A linear approximation to the CEF m(X) is a function X′ 𝛽 for some 𝛽 ∈ ℝk .

The mean-squared approximation error is

2
d(𝛽) = 𝔼 [(m(X) - X′ 𝛽) ]
= 𝔼 [m(X)2 ] - 2𝛽 ′ 𝔼 [Xm(X)] + 𝛽 ′ 𝔼 [XX′ ] 𝛽

Take a first-order condition and solve for 𝛽 :


𝜕
d(𝛽) = -2𝔼 [Xm(X)] + 2𝔼 [XX′ ] 𝛽 = 0
𝜕𝛽
-1
⇒ 𝛽 = (𝔼 [XX′ ]) 𝔼 [Xm(X)]
-1
= (𝔼 [XX′ ]) 𝔼 [XY]

The minimizer 𝛽 = arg minb∈ℝk d(b) is the same linear projection coefficient
23
Three Reasons

Define the population linear regression function as X′ 𝛽 with 𝛽 = (𝔼[XX′ ])-1 𝔼[XY]

1. When the CEF is linear, the population regression function is it


2. The population regression function is the best linear predictor of Y given X
3. The population regression function is the best linear approximation to 𝔼[Y|X] (ie, to m(X))

24
Ordinary Least Squares
̂
Estimator: 𝛽OLS
Samples

We’ve been discussing the best linear predictor of Y given X for a pair of random variables
(Y, X) ∈ ℝ × ℝk and called this predictor the linear projection model. We are now interested in
estimating the parameters of this model, in particular, the projection coefficient

-1
𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]

We can estimate 𝛽 from samples which include joint measurements of (Y, X).

Notice that

• the random variables are Y and X


• the observations in the sample are Yi and Xi
• n is the sample size, such that the dataset is {(Yi , Xi ) ∶ i = 1, … , n}

The variables {(Y1 , X1 ), … , (Yi , Xi ), … , (Yn , Xn )} are identically distributed; they are draws
from a common distribtion FX,Y (x, y).
25
Moment Estimators

The Law(s) of Large Numbers show, with varying mathematical approaches, that the average
(say, X)̄ approaches the expectation (say, 𝔼[X]) as the sample size grows.

We use this idea to develop “analog” or “plug in” or moment” estimators.

For example,

• Suppose 𝜇 = 𝔼[Y]
n
• Estimate 𝜇 with 𝜇̂ = 1n ∑i=1 Yi
• Suppose 𝜇 = 𝔼[h(Y)]
n
• Estimate 𝜇 with 𝜇̂ = 1n ∑i=1 h(Yi )

There’s an entire method of econometrics devoted to the idea of Generalized Method of


Moments estimation.

26
OLS Estimator for Linear CEF

Suppose the conditional expectation function is linear in X. We can use moment estimators of
the expectations:

1 n
QXX = 𝔼[XX′ ] ⇒ Q̂ XX = ∑ XX′
n i=1
1 n
QXY = 𝔼[XY] ⇒ Q̂ XY = ∑ XY
n i=1

and so
-1 -1
𝛽 = (QXX ) QXY = (𝔼 [XX′ ]) 𝔼 [XY]
-1
-1 1 n 1 n
⇒ 𝛽 ̂ = (Q̂ XX ) Q̂ XY = ( ∑ XX′ ) ( ∑ XY)
n i=1 n i=1

27
OLS Estimator as Best Linear Approximation
While the conditional expectation m(X) = 𝔼[Y|X] is the best predictor of Y among all functions of X, it’s
functional form is typically unknown, and is thought to be linear only in special cases (e.g., a fully saturated
model of categorical variables).

Consequently, in most cases, it is more realistic to view the linear specification m(X) = X′ 𝛽 as an approximation.

Recall, the linear projection coefficient 𝛽 is defined as the minimizer of the expected squared error S(𝛽):

2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]

The moment estimator of S(𝛽) is the sample average:

1 n 2
Ŝ(𝛽) = ∑ (Yi - X′i 𝛽)
n i=1

𝛽 ̂ = arg min Ŝ(𝛽) is the ordinary least squares estimator because

• it minimizes the sum of squared errors


• it solves k equations with k unknowns using “ordinary” techniques
28
Solving for the OLS Estimator with One Regressor
Consider the case where k = 1 so that there is a scalar regressor X and two coefficients: Y = 𝛽0 + 𝛽1 X + e

n
2 2
S(𝛽0 , 𝛽1 ) = 𝔼 [(Y - 𝛽0 - 𝛽1 X) ] ⇒ Ŝ(𝛽0 , 𝛽1 ) = ∑ (Yi - 𝛽0 - 𝛽1 Xi )
i=1

First Order Condition for 𝛽0 : First Order Condition for 𝛽1 :


n n
𝜕 ̂ ̂ 𝜕 ̂ ̂
S(𝛽0 , 𝛽1̂ ) = -2 × ∑ (Yi - 𝛽0̂ - 𝛽1̂ Xi ) S(𝛽0 , 𝛽1̂ ) = -2 × ∑ (Yi - 𝛽0̂ - 𝛽1̂ Xi ) × Xi
𝜕𝛽0 i=1
𝜕𝛽1 i=1
n n n n
= Ȳ - 𝛽0̂ - 𝛽1̂ X̄
= ∑ Yi Xi - Ȳ ∑ Xi + 𝛽1̂ X̄ ∑ Xi - 𝛽1̂ ∑ X2i
=0 i=1 i=1 i=1 i=1
n n
⇒ 𝛽0̂ = Ȳ - 𝛽1̂ X̄
= (∑ Yi Xi - Ȳ X̄ ) - 𝛽1̂ (n-1 ∑ X2i - X̄ 2 )
i=1 i=1

=0
⇒ 𝛽1̂ = Cov(Xi , Yi )/Var(Xi )

29
Solving for the OLS Estimator with Multiple Regressors
Now consider the general case with k regressors (including X1 = 1): Y = X′ 𝛽 + e

2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]

n n n n n
2 ′
Ŝ(𝛽) = ∑ (Yi - X′i 𝛽) = ∑ (Yi - X′i 𝛽) (Yi - X′i 𝛽) = ∑ Y2i - 2𝛽 ∑ Xi Yi + 𝛽′ (∑ Xi X′i ) 𝛽
i=1 i=1 i=1 i=1 i=1

k First Order Conditions

n n
𝜕 ̂ ̂
S(𝛽) = -2 ∑ Xi Yi + 2 ∑ Xi X′i 𝛽 ̂ = 0
𝜕𝛽 i=1 i=1

-1
n n
⇒ 𝛽 ̂ = (∑ Xi X′i ) (∑ Xi Yi )
i=1 i=1

BHE uses the notation 𝛽 ̂ = Q̂ -XX


1 Q̂
XY

30
Model in Matrix Notation

It is notationally and computationally convenient to write the model and statistics in matrix notation.

The n linear equations Yi = X′i 𝛽 + ei make a system of n equations, which we stack:


Y1 = X′1 𝛽 + e1
Y2 = X′2 𝛽 + e2

Yn = X′n 𝛽 + en

Define

Y1 X′1 e1
⎡ ⎤ ⎡ ′⎤ ⎡ ⎤
Y X e
Y = ⎢ 2⎥ , X = ⎢ 2⎥ , e = ⎢ 2⎥
n×1 ⎢ ⋮ ⎥ n×k ⎢ ⋮ ⎥ n×1 ⎢⋮⎥

⎣ Yn ⎦ ⎣ Xn ⎦ ⎣ en ⎦

Then the system of n equations can by written compactly as Y = X𝛽 + e


31
Notational Simplifications in Matrix Form

With the matrix notation…

…sample sums can be written as

n
∑ Xi X′i = X′ X
i=1

n
∑ Xi Y′i = X′ Y
i=1

…and the least squares estimator is

𝛽 ̂ = (X′ X)-1 X′ Y

32
Unbiasedness of the OLS Estimator

𝛽 ̂ is a function of random variables X and Y and so it is a random variable.

This means that it has a distribution, which we call the “sampling distribution.”

If the mean of the sampling distribution is centered over the value we seek to estimate, then
the estimator is unbiased.

̂ ] = 𝔼[(X′ X)-1 X′ Y|X]


𝔼[𝛽 |X
= 𝔼[(X′ X)-1 X′ (X𝛽 + e)|X]
= (X′ X)-1 X′ X𝛽 + (X′ X)-1 X′ 𝔼[e|X]
=𝛽+0

⇒ 𝛽 ̂ is an unbiased estimator for 𝛽 .

33
Computing the OLS Estimator in R

dat <- read.table("support/cps09mar.txt") plot(x=exper[sam], y=lwage[sam], pch=20,


exper <- dat[,1] - dat[,4] - 6 col="dodgerblue4", ylab="Log Wage",
lwage <- log( dat[,5]/(dat[,6]*dat[,7]) ) xlab="Years of Experience", main="")
sam <- dat[,11]==4 & dat[,12]==7 & dat[,2]==0 abline(a=lmcoefs[1], b=lmcoefs[2],
col="firebrick", lwd=2)
lmcoefs <- coef(lm(lwage[sam] ~ exper[sam]))

5
print(lmcoefs)
(Intercept) exper[sam]

4
2.876515044 0.004776039

Log Wage

3
2
1
0 10 20 30 40 50

Years of Experience
34
Computing the OLS Estimator in R “by hand”

y <- lwage[sam] ssq <- function(beta,x,y) {


x <- cbind(1, exper[sam]) sum((y - x %*% beta)^2)
}
xx <- t(x) %*% x out <- optim(par=c(0,0), fn=ssq,
xy <- t(x) %*% y x=x, y=y,
betahat <- solve(xx) %*% xy control=list(reltol=1e-12))
print(betahat) optimcoefs <- out$par
print(optimcoefs)
[,1]
[1,] 2.876515044 [1] 2.876516802 0.004776023
[2,] 0.004776039

35
Next Time

• Residuals
• Projections
• R Squared
• CEF Error Variance
• Variance of the OLS Estimator
• Homoskedasticity
• Heteroskedasticity

36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy