Week 2: CEF & OLS: Dan Yavorsky

Week 2: CEF & OLS
MFE 402
Dan Yavorsky
1
Topics for Today
• Conditional Expectation
• General CEF Model
• Linear CEF Model
• Best Linear Predictor
• Least Squares Estimator
2
Conditional Expectation
Joint Distribution & Density Functions
We begin by generalizing from a univariate random variable to a vector of random variables

(aka multivariate random vector) using the bivariate case to illustrate:
• The joint (cumulative) distribution function
FX,Y (x, y) = ℙX,Y [X ≤ x, Y ≤ y]
• For vectors of discrete random variables, the joint probability mass function
pX,Y (x, y) = ℙ[X = x, Y = y]
• For vectors of continuous random variables, the joint probability density function
𝜕2
fX,Y (x, y) = F (x, y)
𝜕 x𝜕 y X,Y
3
Example: Joint Distributions
Discrete Continuous
Y=1 Y=2 Y=3
3.5
Log Dollars per Hour
X=1 1/60 4/60 9/60
X=2 2/60 6/60 12/60
3.0
X=3 3/60 8/60 3/60
X=4 4/60 2/60 6/60
2.5 2.0
0 5 10 15 20 25 30 35 40 45
Labor Market Experience (Years)
4
Marginal (Univariate) Distribution & Density Functions
The marginal univariate cumulative distribution function (or just marginal distribution) of X is
FX (x) = ℙ[X ≤ x]
= ℙ[X ≤ x, Y ≤ ∞]
= lim FX,Y (x, y)
y→∞
∞ x
=∫ ∫ fX,Y (u, v) du dv
-∞ -∞
The marginal univariate probability density function (or just marginal density) of X is
d
fX (x) = F (x)
dx X
∞
=∫ fX,Y (x, y) dy
-∞
5
Example: Marginal Distributions
Discrete Continuous
Y=1 Y=2 Y=3 f(x)

X=1 1/60 4/60 9/60 14/60
X=2 2/60 6/60 12/60 20/60 0 1 2 3 4 5 6
Log wage density
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60
f(y) 10/60 20/60 30/60
0 5 10 15 20 25 30 35 40 45
Years experience density
6
Conditional Distribution & Density Functions
If X has a discrete distribution:
• The conditional distribution function of Y given X = x is
FY|X (y|x) = ℙ[Y ≤ y|X = x]
Think of this as the distribution of Y for the subpopulation with a specific value of X
• The conditional density function of Y given X = x is
𝜕
fY|X (y|x) = F (y|x)
𝜕 y Y|X
fX,Y (x, y)
=
fX (x)
7
Example: Conditional Distributions
Discrete Continuous
Y=1 Y=2 Y=3 f(x)

X = 10
X=1 1/60 4/60 9/60 14/60
X = 25
X=2 2/60 6/60 12/60 20/60
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60 X=5
f(y) 10/60 20/60 30/60
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

fY|X (y = 1|x = 2) = (2/60)/(20/60) = 0.1
fY|X (y = 2|x = 2) = (6/60)/(20/60) = 0.3
fY|X (y = 3|x = 2) = (12/60)/(20/60) = 0.6
8
Conditional Expectation Function
The Conditional Expectation Function (CEF) of Y given X is:
∞
𝔼[Y|X = x] = ∫ y fY|X (y|x) dy
-∞
∞
1
= ∫ y fX,Y (x, y) dy
fX (x) -∞
It is a function of x:
m(x) = 𝔼[Y|X = x]
9
Example: CEF
Discrete Continuous
Y=1 Y=2 Y=3 f(x)
3.5
X=1 1/60 4/60 9/60 14/60
X=2 2/60 6/60 12/60 20/60
3.0
X=3 3/60 8/60 3/60 14/60
X=4 4/60 2/60 6/60 12/60
2.5
f(y) 10/60 20/60 30/60
2.0
E[Y|X]
0 5 10 15 20 25 30 35 40 45
𝔼[Y|X = 1] = 1(1/14) + 2(4/14) + 3(9/14) = 2.571 Labor Market Experience (Years)
𝔼[Y|X = 2] = 1(2/20) + 2(6/20) + 3(12/20) = 2.500

𝔼[Y|X = 3] = 1(3/14) + 2(8/14) + 3(3/14) = 2.000
𝔼[Y|X = 4] = 1(4/12) + 2(2/12) + 3(6/12) = 2.167
10
Law of Iterated Expectations
The expectation of the conditional expectation is the unconditional expectation.
A simple special case:
𝔼 [𝔼[Y|X]] = 𝔼[Y]
That is, for each value x, we have 𝔼[Y|X = x], and then we take the probability-weighted
average across the x’s:
𝔼 [𝔼[Y|X]] = ∫ 𝔼[Y|X = x] fX (x) dx

Rk
= ∫ (∫ y fY|X (y|x) dy) fX (x) dx

Rk R
= ∫ ∫ y fX,Y (x, y) dy dx
Rk R
=𝔼[Y]
11
CEF Model
CEF Error
The CEF error e (some textbooks use 𝜀) is defined as the difference between Y and the CEF
evaluated at X:
e = Y - m(X)
By construction, this yields the “breakdown” formula
Y = m(X) + e
Notice: e is derived from FX,Y (X, Y) so it’s properties are derived from this construction.
Some properties:
• 𝔼[e|X] = 0 (i.e., e is “mean independent” of X, but not necessarily independent of X)

• 𝔼[e] = 0
• 𝔼[h(X) e] = 0 12
CEF as the Best Predictor
The CEF m(X) is the best (in a minimum mean-squared error sense) predict of Y.
Suppose you have X and want to predict Y: Ŷ = g(X)
One way to measure the (ex ante and non-stochastic) magnitude of the prediction error is with
the mean-squared error function:
2 2
𝔼 [(Y - g(X)) ] = 𝔼 [(m(X) + e - g(X)) ]
2
= 𝔼 [e2 ] + 𝔼 [e (m(X) - g(X))] + 𝔼 [(m(X) - g(X)) ]
2
= 𝔼 [e2 ] + 𝔼 [(m(X) - g(X)) ]
≥ 𝔼 [ e2 ]
2
= 𝔼 [(Y - m(X)) ]
13
Regression Derivative
One way to interpret the CEF is how marginal changes in the regressors x imply changes in the
conditional expectation of the response variable Y. For example, changing x1 :
𝜕 𝜕
∇1 m(x) = m(x) = m(x1 , x2 , … , xk )
𝜕 x1 𝜕 x1
Notice:
1. The effect of each variable is calculated holding the other variables constant.
• This is not “all else” but rather “all else in the model”
2. The effect of each variable is on the conditional expectation of Y, not necessarily the
actual value of Y.
• The regression derivative is the change in the actual value of Y only if the error e is
unaffected by the change in the regressor
14
Linear CEF
Linear CEF
An important special case is when the CEF is linear in X:
m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + … + xk 𝛽k
We often use shorthand notation, defining the vectors x (notice the 1) and 𝛽 as
1 𝛽0
⎡ ⎤ ⎡ ⎤
x
⎢ 1⎥ ⎢𝛽1 ⎥
x = ⎢ x2 ⎥ and 𝛽 = ⎢𝛽2 ⎥
⎢ ⎥ ⎢ ⎥
⎢⋮⎥ ⎢⋮⎥
⎣ xk ⎦ ⎣𝛽k ⎦
so that we can write the CEF as simply
m(x) = x′ 𝛽 15
Linear CEF Model
Y = X′ 𝛽 + e
𝔼[e|X] = 0
One of the most appealing features of the Linear CEF Model is that the coefficients are the
regression derivatives:
𝜕
∇1 m(x) 𝜕 x1 m(x) 𝛽
⎡ ⎤ ⎡ 𝜕 ⎤ ⎡ 1⎤
∇ m (x ) ⎥=⎢ m (x )⎥ = ⎢𝛽2 ⎥ = 𝛽
∇m(x) = ⎢ 2 ⎢
𝜕 x2
⎥ ⎢⋮⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝜕
⎣∇k m(x)⎦ ⎣ 𝜕 xk m(x) ⎦ ⎣𝛽k ⎦
Therefore, the coefficients have simple and natural interpretations as the marginal effects of
changing one variable, holding the others constant. 16
Linear CEF with Non-Linear Effects
The linear CEF is less restrictive than it first appears. Take the following CEF as an example:
m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + x22 𝛽3
Here, m(x) is non-linear in x2 but we can define x3 = x22 to re-write the CEF as
m(x) = 𝛽0 + x1 𝛽1 + x2 𝛽2 + x3 𝛽3
This creates a linear CEF, which is sufficient for most econometric purposes (ie, estimation and
inference of the 𝛽 parameters). The one major exception is with the analysis of regression
derivatives, which should be defined with respect to the “original” variables:
𝜕
m(x) = 𝛽1
𝜕 x1
𝜕
m(x) = 𝛽2 + 2x2 𝛽3
𝜕 x2 17
Example: Linear vs Non-Linear Effects
Linear Regression Surface Non-Linear Regression Surface

Y Y
X2 X2
X1 X1
18
Linear CEF with Dummy Variables
If all regressors (the x’s) take a finite set of values, the CEF can be written as a linear function
of regressors.
One binary variable example:
• Suppose X represents binary gender with X = 0 for males and X = 1 for females.
• Let 𝔼[Y|X = 0] = 𝜇0 and 𝔼[Y|X = 1] = 𝜇1 and define 𝛽0 = 𝜇0 and 𝛽1 = 𝜇1 - 𝜇0
• Then m(x) = 𝛽0 + 𝛽1 x1
Two binary variables example:

• Suppose X1 represents binary gender (X1 = 1 is female) and X2 represents marital status
X2 = 1 is married)
• Let m(x) = 𝛽0 + 𝛽1 x1 + 𝛽2 x2 + 𝛽3 x1 x2
• Then:
• 𝔼[Y|X1 = 0, X2 = 0] = 𝛽0
• 𝔼[Y|X1 = 1, X2 = 0] = 𝛽0 + 𝛽1
• 𝔼[Y|X1 = 0, X2 = 1] = 𝛽0 + 𝛽2
• 𝔼[Y|X1 = 1, X2 = 1] = 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3 19
Liner CEF with Continuous and Dummy Variables
CEF of Wages on Experience for Men CEF of Wages on Experience for Women
3.5
3.5

3.0
3.0
2.5
2.5
2.0
2.0
E[Y|X] E[Y|X]
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Labor Market Experience (Years) Labor Market Experience (Years)
20
Best Linear Predictor
When the CEF is Linear
Suppose the CEF is a linear function of x: m(X) = X′ 𝛽
Recall that 𝔼[Xe] = 0.
Then:
𝔼[Xe] = 0
𝔼[X(Y - X′ 𝛽)] = 0
𝔼[XY] - 𝔼[XX′ ]𝛽 = 0
-1
𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]
We’ve assumed a couple of technical, mathematical properties:
• The means, variances, and covariance between X and Y are finite

• The 𝔼[XX′ ] is positive definite
1
BHE uses the notation: 𝛽 = Q-XX QXY
21
Best Linear Predictor
A linear predictor for Y is a function X′ 𝛽 for some 𝛽 ∈ ℝk .
The mean-squared prediction error is
2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]
As a quadratic function of 𝛽 :
S(𝛽) = 𝔼 [Y2 ] - 2𝛽 ′ 𝔼 [XY] + 𝛽 ′ 𝔼 [XX′ ] 𝛽
Take a first-order condition and solve for 𝛽 :

𝜕
S(𝛽) = -2𝔼 [XY] + 2𝔼 [XX′ ] 𝛽 = 0
𝜕𝛽
-1
⇒ 𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]
22
The minimizer 𝛽 = arg minb∈ℝk S(b) is called the linear projection coefficient
Best Linear Approximation
A linear approximation to the CEF m(X) is a function X′ 𝛽 for some 𝛽 ∈ ℝk .
The mean-squared approximation error is
2
d(𝛽) = 𝔼 [(m(X) - X′ 𝛽) ]
= 𝔼 [m(X)2 ] - 2𝛽 ′ 𝔼 [Xm(X)] + 𝛽 ′ 𝔼 [XX′ ] 𝛽
Take a first-order condition and solve for 𝛽 :

𝜕
d(𝛽) = -2𝔼 [Xm(X)] + 2𝔼 [XX′ ] 𝛽 = 0
𝜕𝛽
-1
⇒ 𝛽 = (𝔼 [XX′ ]) 𝔼 [Xm(X)]
-1
= (𝔼 [XX′ ]) 𝔼 [XY]
The minimizer 𝛽 = arg minb∈ℝk d(b) is the same linear projection coefficient
23
Three Reasons
Define the population linear regression function as X′ 𝛽 with 𝛽 = (𝔼[XX′ ])-1 𝔼[XY]
1. When the CEF is linear, the population regression function is it

2. The population regression function is the best linear predictor of Y given X
3. The population regression function is the best linear approximation to 𝔼[Y|X] (ie, to m(X))
24
Ordinary Least Squares
̂
Estimator: 𝛽OLS
Samples
We’ve been discussing the best linear predictor of Y given X for a pair of random variables
(Y, X) ∈ ℝ × ℝk and called this predictor the linear projection model. We are now interested in
estimating the parameters of this model, in particular, the projection coefficient
-1
𝛽 = (𝔼 [XX′ ]) 𝔼 [XY]
We can estimate 𝛽 from samples which include joint measurements of (Y, X).
Notice that
• the random variables are Y and X

• the observations in the sample are Yi and Xi
• n is the sample size, such that the dataset is {(Yi , Xi ) ∶ i = 1, … , n}
The variables {(Y1 , X1 ), … , (Yi , Xi ), … , (Yn , Xn )} are identically distributed; they are draws
from a common distribtion FX,Y (x, y).
25
Moment Estimators
The Law(s) of Large Numbers show, with varying mathematical approaches, that the average
(say, X)̄ approaches the expectation (say, 𝔼[X]) as the sample size grows.
We use this idea to develop “analog” or “plug in” or moment” estimators.
For example,
• Suppose 𝜇 = 𝔼[Y]
n
• Estimate 𝜇 with 𝜇̂ = 1n ∑i=1 Yi
• Suppose 𝜇 = 𝔼[h(Y)]
n
• Estimate 𝜇 with 𝜇̂ = 1n ∑i=1 h(Yi )
There’s an entire method of econometrics devoted to the idea of Generalized Method of

Moments estimation.
26
OLS Estimator for Linear CEF
Suppose the conditional expectation function is linear in X. We can use moment estimators of
the expectations:
1 n
QXX = 𝔼[XX′ ] ⇒ Q̂ XX = ∑ XX′
n i=1
1 n
QXY = 𝔼[XY] ⇒ Q̂ XY = ∑ XY
n i=1
and so
-1 -1
𝛽 = (QXX ) QXY = (𝔼 [XX′ ]) 𝔼 [XY]
-1
-1 1 n 1 n
⇒ 𝛽 ̂ = (Q̂ XX ) Q̂ XY = ( ∑ XX′ ) ( ∑ XY)
n i=1 n i=1
27
OLS Estimator as Best Linear Approximation
While the conditional expectation m(X) = 𝔼[Y|X] is the best predictor of Y among all functions of X, it’s
functional form is typically unknown, and is thought to be linear only in special cases (e.g., a fully saturated
model of categorical variables).
Consequently, in most cases, it is more realistic to view the linear specification m(X) = X′ 𝛽 as an approximation.
Recall, the linear projection coefficient 𝛽 is defined as the minimizer of the expected squared error S(𝛽):
2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]
The moment estimator of S(𝛽) is the sample average:
1 n 2
Ŝ(𝛽) = ∑ (Yi - X′i 𝛽)
n i=1
𝛽 ̂ = arg min Ŝ(𝛽) is the ordinary least squares estimator because
• it minimizes the sum of squared errors

• it solves k equations with k unknowns using “ordinary” techniques
28
Solving for the OLS Estimator with One Regressor
Consider the case where k = 1 so that there is a scalar regressor X and two coefficients: Y = 𝛽0 + 𝛽1 X + e
n
2 2
S(𝛽0 , 𝛽1 ) = 𝔼 [(Y - 𝛽0 - 𝛽1 X) ] ⇒ Ŝ(𝛽0 , 𝛽1 ) = ∑ (Yi - 𝛽0 - 𝛽1 Xi )
i=1
First Order Condition for 𝛽0 : First Order Condition for 𝛽1 :

n n
𝜕 ̂ ̂ 𝜕 ̂ ̂
S(𝛽0 , 𝛽1̂ ) = -2 × ∑ (Yi - 𝛽0̂ - 𝛽1̂ Xi ) S(𝛽0 , 𝛽1̂ ) = -2 × ∑ (Yi - 𝛽0̂ - 𝛽1̂ Xi ) × Xi
𝜕𝛽0 i=1
𝜕𝛽1 i=1
n n n n
= Ȳ - 𝛽0̂ - 𝛽1̂ X̄
= ∑ Yi Xi - Ȳ ∑ Xi + 𝛽1̂ X̄ ∑ Xi - 𝛽1̂ ∑ X2i
=0 i=1 i=1 i=1 i=1
n n
⇒ 𝛽0̂ = Ȳ - 𝛽1̂ X̄
= (∑ Yi Xi - Ȳ X̄ ) - 𝛽1̂ (n-1 ∑ X2i - X̄ 2 )
i=1 i=1
=0
⇒ 𝛽1̂ = Cov(Xi , Yi )/Var(Xi )
29
Solving for the OLS Estimator with Multiple Regressors
Now consider the general case with k regressors (including X1 = 1): Y = X′ 𝛽 + e
2
S(𝛽) = 𝔼 [(Y - X′ 𝛽) ]
n n n n n
2 ′
Ŝ(𝛽) = ∑ (Yi - X′i 𝛽) = ∑ (Yi - X′i 𝛽) (Yi - X′i 𝛽) = ∑ Y2i - 2𝛽 ∑ Xi Yi + 𝛽′ (∑ Xi X′i ) 𝛽
i=1 i=1 i=1 i=1 i=1
k First Order Conditions
n n
𝜕 ̂ ̂
S(𝛽) = -2 ∑ Xi Yi + 2 ∑ Xi X′i 𝛽 ̂ = 0
𝜕𝛽 i=1 i=1
-1
n n
⇒ 𝛽 ̂ = (∑ Xi X′i ) (∑ Xi Yi )
i=1 i=1
BHE uses the notation 𝛽 ̂ = Q̂ -XX

1 Q̂
XY
30
Model in Matrix Notation
It is notationally and computationally convenient to write the model and statistics in matrix notation.
The n linear equations Yi = X′i 𝛽 + ei make a system of n equations, which we stack:

Y1 = X′1 𝛽 + e1
Y2 = X′2 𝛽 + e2
⋮
Yn = X′n 𝛽 + en
Define
Y1 X′1 e1
⎡ ⎤ ⎡ ′⎤ ⎡ ⎤
Y X e
Y = ⎢ 2⎥ , X = ⎢ 2⎥ , e = ⎢ 2⎥
n×1 ⎢ ⋮ ⎥ n×k ⎢ ⋮ ⎥ n×1 ⎢⋮⎥
′
⎣ Yn ⎦ ⎣ Xn ⎦ ⎣ en ⎦
Then the system of n equations can by written compactly as Y = X𝛽 + e

31
Notational Simplifications in Matrix Form
With the matrix notation…
…sample sums can be written as
n
∑ Xi X′i = X′ X
i=1
n
∑ Xi Y′i = X′ Y
i=1
…and the least squares estimator is
𝛽 ̂ = (X′ X)-1 X′ Y
32
Unbiasedness of the OLS Estimator
𝛽 ̂ is a function of random variables X and Y and so it is a random variable.
This means that it has a distribution, which we call the “sampling distribution.”
If the mean of the sampling distribution is centered over the value we seek to estimate, then
the estimator is unbiased.
̂ ] = 𝔼[(X′ X)-1 X′ Y|X]

𝔼[𝛽 |X
= 𝔼[(X′ X)-1 X′ (X𝛽 + e)|X]
= (X′ X)-1 X′ X𝛽 + (X′ X)-1 X′ 𝔼[e|X]
=𝛽+0
⇒ 𝛽 ̂ is an unbiased estimator for 𝛽 .
33
Computing the OLS Estimator in R
dat <- read.table("support/cps09mar.txt") plot(x=exper[sam], y=lwage[sam], pch=20,

exper <- dat[,1] - dat[,4] - 6 col="dodgerblue4", ylab="Log Wage",
lwage <- log( dat[,5]/(dat[,6]*dat[,7]) ) xlab="Years of Experience", main="")
sam <- dat[,11]==4 & dat[,12]==7 & dat[,2]==0 abline(a=lmcoefs[1], b=lmcoefs[2],
col="firebrick", lwd=2)
lmcoefs <- coef(lm(lwage[sam] ~ exper[sam]))
5
print(lmcoefs)
(Intercept) exper[sam]
4
2.876515044 0.004776039
Log Wage
3
2
1
0 10 20 30 40 50
Years of Experience
34
Computing the OLS Estimator in R “by hand”
y <- lwage[sam] ssq <- function(beta,x,y) {

x <- cbind(1, exper[sam]) sum((y - x %*% beta)^2)
}
xx <- t(x) %*% x out <- optim(par=c(0,0), fn=ssq,
xy <- t(x) %*% y x=x, y=y,
betahat <- solve(xx) %*% xy control=list(reltol=1e-12))
print(betahat) optimcoefs <- out$par
print(optimcoefs)
[,1]
[1,] 2.876515044 [1] 2.876516802 0.004776023
[2,] 0.004776039
35
Next Time
• Residuals
• Projections
• R Squared
• CEF Error Variance
• Variance of the OLS Estimator
• Homoskedasticity
• Heteroskedasticity
36

Week 2: CEF & OLS: Dan Yavorsky

Uploaded by

Copyright:

Available Formats

Week 2: CEF & OLS: Dan Yavorsky

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2: CEF & OLS: Dan Yavorsky

Uploaded by

Copyright:

Available Formats

Week 2: CEF & OLS

We begin by generalizing from a univariate random variable to a vector of random variables

• The joint (cumulative) distribution function

FX,Y (x, y) = ℙX,Y [X ≤ x, Y ≤ y]

pX,Y (x, y) = ℙ[X = x, Y = y]

Y=1 Y=2 Y=3

Y=1 Y=2 Y=3 f(x)

If X has a discrete distribution:

• The conditional distribution function of Y given X = x is

FY|X (y|x) = ℙ[Y ≤ y|X = x]

• The conditional density function of Y given X = x is

Y=1 Y=2 Y=3 f(x)

f(y) 10/60 20/60 30/60

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

The Conditional Expectation Function (CEF) of Y given X is:

Y=1 Y=2 Y=3 f(x)

𝔼[Y|X = 2] = 1(2/20) + 2(6/20) + 3(12/20) = 2.500

The expectation of the conditional expectation is the unconditional expectation.

A simple special case:

𝔼 [𝔼[Y|X]] = ∫ 𝔼[Y|X = x] fX (x) dx

= ∫ (∫ y fY|X (y|x) dy) fX (x) dx

By construction, this yields the “breakdown” formula

• 𝔼[e|X] = 0 (i.e., e is “mean independent” of X, but not necessarily independent of X)

Suppose you have X and want to predict Y: Ŷ = g(X)

An important special case is when the CEF is linear in X:

Linear Regression Surface Non-Linear Regression Surface

Two binary variables example:

Log Dollars per Hour

Suppose the CEF is a linear function of x: m(X) = X′ 𝛽

Recall that 𝔼[Xe] = 0.

We’ve assumed a couple of technical, mathematical properties:

• The means, variances, and covariance between X and Y are finite

A linear predictor for Y is a function X′ 𝛽 for some 𝛽 ∈ ℝk .

The mean-squared prediction error is

S(𝛽) = 𝔼 [Y2 ] - 2𝛽 ′ 𝔼 [XY] + 𝛽 ′ 𝔼 [XX′ ] 𝛽

Take a first-order condition and solve for 𝛽 :

A linear approximation to the CEF m(X) is a function X′ 𝛽 for some 𝛽 ∈ ℝk .

The mean-squared approximation error is

Take a first-order condition and solve for 𝛽 :

1. When the CEF is linear, the population regression function is it

• the random variables are Y and X

We use this idea to develop “analog” or “plug in” or moment” estimators.

There’s an entire method of econometrics devoted to the idea of Generalized Method of

The moment estimator of S(𝛽) is the sample average:

𝛽 ̂ = arg min Ŝ(𝛽) is the ordinary least squares estimator because

• it minimizes the sum of squared errors

First Order Condition for 𝛽0 : First Order Condition for 𝛽1 :

k First Order Conditions

BHE uses the notation 𝛽 ̂ = Q̂ -XX

The n linear equations Yi = X′i 𝛽 + ei make a system of n equations, which we stack:

Then the system of n equations can by written compactly as Y = X𝛽 + e

With the matrix notation…

…sample sums can be written as

…and the least squares estimator is

𝛽 ̂ is a function of random variables X and Y and so it is a random variable.

̂ ] = 𝔼[(X′ X)-1 X′ Y|X]

⇒ 𝛽 ̂ is an unbiased estimator for 𝛽 .

dat <- read.table("support/cps09mar.txt") plot(x=exper[sam], y=lwage[sam], pch=20,