TA_session_06
TA_session_06
TA_session_06
TA Session 6
Jukina HATAKEYAMA∗
May 21, 2024
Contents
1 Review of Some Concepts for a Multivariate Normal Random Variable 2
∗
E-mail: u868710a@ecs.osaka-u.ac.jp
1
1 Review of Some Concepts for a Multivariate Normal
Random Variable
Theorem 1.1 (Multivariate Normal Distribution). Let the vector x = (x1 , . . . , xk )′ ∈ Rk
be the set of n random variables, µ their mean vector, and Σ their variance–covariance
matrix. The general form of the joint distribution is given by
−k/2 −1/2 1 ′ −1
f (x) = (2π) |Σ| exp − (x − µ) Σ (x − µ) .
2
where
Σ = diag(σ12 , . . . , σk2 ).
The proof is shown in Appendix A.
Adding to the theorem, we can construct the characteristic function and moment generating
function for this random variable as follows.
Theorem 1.2 (Characteristic Function and Moment Generating Function). For a ran-
dom variable x : Ω → Rk which follows a multibvariate normal distribution with mean
µ ∈ Rk and variance–covariance matrix Σ ∈ Rk×k , by using a parameter θ ∈ Rk , we can
define a function φX : Rk → C:
iθ ′ x ′ 1 ′
φx (θ) := E[e ] = exp iθ µ − θ Σθ , (1)
2
2
vector. Denoting by
y := (y1 , . . . , yn )′ ∈ Rn ,
u := (u1 , . . . , un )′ ∈ Rn ,
x1 x1,1 · · · x1,k
.. .. .. .. ∈ Rn×k ,
x := . = . . .
xk xn,1 · · · xn,k
we can write the stacked regression system as follows:
y1 x1,1 · · · x1,k b1 u1
..
y = xb + u
⇐⇒
.
.
. =
.
.
.
. . .
.
.
.
.
.
. + .
.
xn,1 · · · xn,k un
yn bk
| {z } | {z } | {z } | {z }
∈Rn ∈Mn×k (R) ∈Rk ∈Rn
This implies that the OLS estimator is an estimator which minimizes the sum of the residual
sum of squares. The OLS estimator obtained from the above definition becomes as follows.
Theorem 2.1 (Ordinary Least Squares (OLS) Estimator for a Multivariate Regression
Model). Suppose
Proof. To obtain the OLS estimator, we have to confirm the first and second order condition
for the minimization problem of the following loss function S(b):
arg min ∥y − Xb∥22 =: arg min S(b).
b b
3
The first order condition becomes
′
∇b ∥y − X b̂∥22 = ∇b y − X b̂ y − X b̂
= −2X ′ (y − X b̂) = 0.
The OLS estimator, denoted as b̂, satisfies this equation, and hence
(X ′ X) b̂ = X ′ y.
From the assumption H1, the inverse matrix (X ′ X)−1 exists, with X = (X1′ , . . . , Xk′ )′ ∈
Mn×k (R), whose columns are independent so that X ′ X is a full rank matrix, and therefore
we can obtain the OLS estimator in the form of (4). The second order condition becomes
By assumption H1, X ′ X is a positive definite matrix. This shows that the loss function S(b)
has a minimum at the OLS estimator b̂.
From this theorem, we can confirm that the OLS estimator expressed as (4) is a random
variable since we can rewite it as follows:
−1
b̂ = b + (X ′ X) X ′ u. (5)
Therefore, we can consider the mean and variance of the OLS estimator. First, we see the
mean of the OLS estimator, which will be used to prove that the OLS estimator is an unbiased
estimator.
Proposition 2.1 (Mean of the OLS Estimator). Suppose
E[b̂|X] = b. (6)
= b,
E[b̂] = b.
4
Lemma 2.1 (Law of Iterated Expectation). For any two random variables x and y,
The variance of the OLS estimator, which is the minimum variance in the class of linear
OLS estimator, becomes as follows.
Proposition 2.2 (Variance of the OLS Estimator). Suppose [H1–H2] holds and assume
Therefore,
′
V[b̂ X] = E b̂ − E[b̂ X] b̂ − E[b̂ X] X
h i
−1 −1
= E (X ′ X) X ′ uu′ X (X ′ X) X
−1 −1
= (X ′ X) X ′ E uu′ X X (X ′ X)
−1 −1
= (X ′ X) X ′ σ 2 In X (X ′ X)
−1
= σ 2 (X ′ X) .
which porves (9). See the Appendix B for the proof of the first equality.
5
2.2 Properties of the OLS Estimator
Here we exhibit some properties of the OLS estimator.
Theorem 2.2 (Properties of the OLS Estimator). The OLS estimator obtained above
has the following properties.
(i) unbiasedness Under the assumption H2, the OLS estimator b̂ becomes an unbi-
ased estimator:
E[b̂] = b. (10)
H5 X ′ X is positive definite;
H6 For all i, for all k, l, the moments of E [|Xik Xil |] exist and E [X ′ X] is positive
definite,
(iii) efficiency Under the assumption [H1–H4], the variance of the OLS estimator is
the minimum one in the class of linear unbiased estimator.
Proof. We can derive these properties via a similar calculation in the case of a simple regres-
sion model.
−1
1 ′ 1 ′
b̂ = b + XX Xu (12)
n n
!−1 !
1X ′ 1X ′
n n
=b+ X Xi X ui (13)
n i=1 i n i=1 i
P −1
−−−→ b + E [Xi′ Xi ] E [Xi′ ui ] . (14)
n→∞
Here we apply the convergence of the product of random variables in probability, which
we will discuss in the following. From the weak law of large numbers (WLLN),
6
1X ′
n
p
Xi Xi −−−→ E [Xi′ Xi ] < ∞; (15)
n i=1 n→∞
1X ′
n
p
Xi ui −−−→ E [Xi′ ui ] = 0 ∈ Rk . (16)
n i=1 n→∞
−1
1 ′ P −1
X Xi −−−→ E [Xi′ Xi ] (17)
n i n→∞
holds from the continuous mapping theorem shown as below. Thus, substituting (15)
and (16) into (14) results in
−1
b̂ −−−→ b + E [Xi′ Xi ]
P
0 = b,
n→∞
p
which indicates that b̂ −−−→ b.
n→∞
(iii) efficiency As for the efficiency of the OLS estimator, the following Gauss–Markov
theorem for a multiple regression model support the efficiency.
The convergence of the product of random variables in probability and continuous mapping
theorem are respectively given as follows.
Lemma 2.2 (Convergence of the Product of Random Variables in Probability). Suppose
a sequence of random vector Xn converges in probability to X and yn to y, respectively.
Then, the product of the two random variable Xn yn also converges in probability to the
product of the each probability limit:
P
Xn yn −−−→ Xy.
n→∞
In another notation,
7
3 Gauss–Markov Theorem for a Multiple Regression
Model
Here we will obtain a general result for the class of linear unbiased estimators of b. It can
be
conducted via a direct method.
Theorem 3.1 (Gauss–Markov Theorem for a Multiple Regression Model). Under the
assumption [H1–H4], the OLS estimater b̂ of the multiple regression model
y i = X i b + ui , (18)
for all i ∈ {1, . . . , n} is of minimum variance among the class of linear unbiased estimator.
Proof. Let us assume another unbiased linear estimator of b, say b̃. Thus, there exists a
matrix A ∈ Rk×n such that b̃ = Ay. Since b̃ is an unbiased estimator,
E[b̃] = b (19)
−1
= σ 2 AX (X ′ X) X ′ A′ + AMX A′ .
Substituting AX(= X ′ A′ ) = Ik and V[b̂] = σ 2 (X ′ X)−1 into the above equation results in
for any column vector ai in A for i ∈ {1, . . . , k}, which proves the theorem.
8
4 Asymptotic Normality for the OLS Estimator of a
Multiple Regression Model
In this section, we derive the asymptotic distribution of an OLS estimator to observe how
the distribution changes as n → ∞.
Theorem 4.1 (Asymptotic Normality of an OLS Estimator). Let b̂ be the OLS estima-
tor obtained under the assumption [H1–H6]. Then, the OLS estimator asymptotically
follows a normal distribution as follows:
√
−1
n(b̂ − b) −−−→ NRk 0, σ 2 (E [Xi′ Xi ])
d
.
n→∞
Therefore,
!−1 !
√ 1X ′ 1 X ′
n n
n(b̂ − b) = X Xi √ X ui . (24)
n i=1 i n i=1 i
From the Lindeberg–Feller central limit theorem (Lindeberg–Feller CLT) as well as the
weak law of large numbers (WLLN) and continuous mapping theorem, we have
!−1
1X ′
n
−1
−−−→ E [Xi′ Xi ] ;
P
Xi Xi
n i=1 n→∞
! !
1 X ′ √ 1X ′
n n
Xi ui − 0 −−−→ NRk (0, V[Xi′ ui ]) ,
d
√ X i ui = n
n i=1 n i=1 n→∞
Then,
V[Xi′ ui ] = E V[Xi′ ui Xi ] + V E[Xi′ ui Xi ]
| {z }
′ =0
= E Xi V[ui Xi ]Xi
= E Xi′ σ 2 Xi
= σ 2 E [Xi′ Xi ] < ∞,
9
where
b ∼ NRk 0, σ 2 E [Xi′ Xi ] .
From the following relation:
−1 −1
b ∼ NRk 0, σ 2 E [Xi′ Xi ] =⇒ E [Xi′ Xi ] b ∼ NRk 0, σ 2 E [Xi′ Xi ] ,
we obtain
√
−1
n(b̂ − b) −−−→ NRk 0, σ 2 E [Xi′ Xi ]
d
.
n→∞
Appendix
A The Probability Density Function for a Multivariate
Normal Distribution
A.1 Independent Univariate Normals
To derive the general case of probability density function for a multivariate normal distribu-
tion, we will start with a vector consisting of k independent and normally distributed random
variables with mean 0: x = (x1 , . . . , xk ) where
xi ∼ NR 0, σi2 .
Let us denote, by fxi , the probability density function for a single normal random variable
xi for i ∈ {1, . . . , k}. Then, since the variables are independent, the joint probability density
function, fx , of all k variables will just be the product of their densities:
fx = Πki=1 fxi
1 x2i
= Πki=1 p exp − 2
2πσi2 2σi
1 1 ′ 2 −1
=p exp − x diag(σ1 , . . . , σk ) x
2
(2π)k Πki=1 σi2 2
1 1 ′ −1
= exp − x Σ x
(2π)k/2 |Σ|1/2 2
where Σ = diag(σ12 , . . . , σk2 ). In this case we say that x ∼ NRk (0, Σ). Unfortunately, this
derivation is restricted to the case where these entries are independent and 0-centered. Thus,
we will see that we can derive the general case using this result.
10
A.2 Affine Transformations of a Random Vector
Consider an affine transformation L : Rk → Rk , L(x) = Ax + b for an invertible matrix
A ∈ Rk×k and a constant vector b ∈ Rk . It is easy to verify that when we apply this
transformation to a random variable z = (z1 , . . . , zk ) with mean µ ∈ Rk and variance–
covariance matrix Σz ∈ Rk×k we get a new random variable x = L(z) such that
E[L(x)] = L (E[z]) ;
V[L(x)] = E (x − E[x]) (x − E[x])′ = AΣz A′ .
In this case, for a symmetric, positive definite matrix Σ and constant vector µ, we will be
looking at the transformation x = Σ1/2 z+µ. It is interesting to note that, given an orthogonal
decomposition Σ = U ΛU ′ , where U is orthogonal and Λ is a diagonal matrix consisting of
the eigenvalues of Σ, entry xi of the new random vector is a weighted sum of originally
independent random variables in z. Let ui denote the ith row of a matrix U . Then,
p X
k
xi = Σ1/2
z z +µ i
= λ i u i z + µi = λi uij zj + µi .
j=1
We now just need one more fact about a change of variables to derive the general multivariate
normal probability density function for this new random vector.
11
−1
dz
g(x) = f r(x) det
dx
1
= f Σ−1/2 (x − µ) p
det(Σ)
1 1 1 −1/2 ′ −1/2
=p p exp − Σ (x − µ) Σ (x − µ)
(2π)k det(Σ) 2
1 1 ′ −1
= exp − (x − µ) Σ (x − µ) .
(2π)k/2 |Σ|1/2 2
This is the probability density function for a multivariate normal distribution with mean
vector µ and a covariance matrix Σ. We say that x ∼ NRk (µ, Σ).
12
Lemma B.1. In general, we have the following equation:
E g(x) y − E[y X] = 0.
Proof. Using the fact E y − E[y X] X = 0, we have
E g(x) y − E[y X] = E E g(x) y − E[y X] X
= E g(x)E y − E[y X] X
= 0.
13