Chapter 2 & 3-Review of Probability and Statistics

Introduction to Econometrics
Fourth Edition, Global Edition
Chapters 2 and 3
Review of Probability and
Statistics
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Empirical problem
• Class size and educational output
– Policy question: What is the effect on test scores (or
some other outcome measure) of reducing class size by
one student per class ? by 8 students/class ?
– We must use data to find out (is there any way to answer
this without data?)

The California Test Score Data Set
All California school districts (n = 420)
Variables:
▪ 5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
▪ Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers

Initial look at the data: (You should already know
how to interpret this table)
TABLE 4.1 Summary of the Distribution of Student–Teacher Ratios and

Fifth-Grade Test Scores for 420 Districts in California in 1999
Percentile
Blank Standard 50%
Average Deviation 10% 25% 40% (median) 60% 75% 90%
Student– 19.6 1.9 17.3 18.6 19.3 19.7 20.1 20.9 21.9
teacher ratio
Test score 654.2 19.1 630.4 640.0 649.1 654.5 659.4 666.7 679.1
This table doesn’t tell us anything about the relationship between

test scores and the STR.

Do districts with smaller classes have higher
test scores?
Scatterplot of test score v. student-teacher ratio
What does this figure show?

We need to get some numerical evidence on whether
districts with low STRs have higher test scores – but how?
1. Compare average test scores in districts with low STRs to those
with high STRs (“estimation”)
2. Test the “null” hypothesis that the mean test scores in the two
types of districts are the same ( 𝐻0 : 𝑌ത𝑠 = 𝑌ത𝑙 ) against the
“alternative” hypothesis that they differ ( 𝐻1 : 𝑌ത𝑠 ≠ 𝑌ത𝑙 )
(“hypothesis testing”)
3. Estimate an interval for the difference in the mean test scores,
high v. low STR districts (“confidence interval”)

Initial data analysis: Compare districts with “small”
(STR < 20) and “large” (STR ≥ 20) class sizes:
Class Size Average score Standard deviation n

(Ῡ ) (s2)
Small 657.4 19.4 238
Large 650.0 17.9 182
1. Estimation of Δ = difference between group means

2. Test the hypothesis that Δ = 0
3. Construct a confidence interval for Δ

1. Estimation
nsmall nlarge
1 1
Ysmall − Ylarge =
nsmall
Y − n Y
i =1
i i
large i =1
= 657.4 − 650.0
= 7.4
- Is this a large difference in a real-world sense?

- This is a big enough difference to be important for school reform
discussions, for parents, or for a school committee?

2. Hypothesis testing (1 of 2)
𝐻0 : 𝑌ത𝑠 = 𝑌ത𝑙 𝐻1 : 𝑌ത𝑠 ≠ 𝑌ത𝑙
• Difference-in-means test: compute the t-statistic:
ഥ𝒔 − 𝒀
𝒀 ഥ𝒍 ഥ𝒔 − 𝒀
𝒀 ഥ𝒍
𝒕= =
𝑺𝑬(𝒀ഥ𝒔 − 𝒀
ഥ𝒍)
𝒔𝟐𝒔 𝒔𝟐𝒍
+
𝒏𝒔 𝒏𝒍
Where:
- s and l refer to “small” and “large” STR (student-teacher ratio) districts
- 𝑆𝐸(𝑌ത𝑠 − 𝑌ത𝑙 ) is the “standard error” of 𝑌ത𝑠 − 𝑌ത𝑙
1
- 𝑠𝑠2 = σ𝑛𝑖=1
𝑠
(𝑌𝑖 − 𝑌ത𝑠 )2 (𝑒𝑡𝑐. )
𝑛𝑠 −1
2. Hypothesis testing (2 of 2)
Compute the difference-in-means t-statistic:
Size Ῡ s2 n
small 657.4 19.4 238
large 650.0 17.9 182
Ys − Yl 657.4 − 650.0 7.4
t= = = = 4.05
ss2
+ sl2 19.42
+ 17.92 1.83
ns nl 238 182
The critical value of t (from the table): 1.96 (𝛼 = 5% 𝑎𝑛𝑑 𝑑𝑓 = 418)

𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 4.05 > 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡 = 1.96 → so reject (at the
5% significance level) the null hypothesis that the two means are the
same
How do we test a hypothesis using the
critical value of t ?
1. Calculate the t value (statistic) for your sample
2. Find the critical value of t in the t table
3. Determine if the (absolute) t value is greater than the critical
value of t
4. Reject the null hypothesis if the samples’ t value is greater than
the critical value of t. Otherwise, don’t reject the null
hypothesis

How do we find the critical value of t in the t
table ?
▪ Step 1: Choose one-tailed or two-tailed test:
- One-tailed tests are used when the alternative hypothesis is directional
Example: 𝐻0 : 𝑌ത𝑠 ≤ 𝑌ത𝑙 𝐻1 : 𝑌ത𝑠 > 𝑌ത𝑙
- Two-tailed tests are used when the alternative hypothesis is non-
directional
Example: 𝐻0 : 𝑌ത𝑠 = 𝑌ത𝑙 𝐻1 : 𝑌ത𝑠 ≠ 𝑌ത𝑙
▪ Step 2: Calculates the degrees of freedom
- If we have one sample: 𝑑𝑓 = 𝑛 − 1
- If we have two samples: 𝑑𝑓 = 𝑛1 − 1 + 𝑛2 − 1 = 𝑛1 + 𝑛2 − 2
▪ Step 3: Choose a significance level (usually 𝛼 = 5% = 0.05)
▪ Step 4: Find the critical value of t in the t table
3. Confidence interval
A 95% (1 − 𝛼 = 1 − 0.05 = 0.95) confidence interval for the
difference between the means is,
(Ῡs – Ῡl ) ± t* × SE(Ῡs – Ῡl )
= 7.4 ± 1.96 × 1.83 = (3.8, 11.0)
→ We are sure at 95% that the difference between the means of the
two groups will fall between 3.8 and 11.0
Two equivalent statements:
1. The 95% confidence interval for Δ doesn’t include 0
2. The hypothesis that Δ = 0 is rejected at the 5% level
➔ There is a difference between the means of the two groups
What comes next…
• The mechanics of estimation, hypothesis testing, and confidence
intervals should be familiar
• These concepts extend directly to regression and its variants
• Before turning to regression, however, we will review some of
the underlying theory of estimation, hypothesis testing, and
confidence intervals:
– Why do these procedures work, and why use these rather than others?
– We will review the intellectual foundations of statistics and econometrics

Review of Statistical Theory
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
The probability framework for statistical inference

a) Population, random variable... and probability distribution
b) Moments of a distribution (mean, variance, standard deviation,
covariance, correlation)
c) Conditional distributions and conditional means
d) Distribution of a sample of data drawn randomly from a
population: Y1,…, Yn
(a) Population, random variable... and
distribution
Random variable Y
• Numerical summary of a random outcome (district average test
score)
– Discrete variables consist of indivisible categories (class size)
– Continuous variables are infinitely divisible into whatever units a
researcher may choose (time)
Population
• The group or collection of all possible entities of interest (school
districts)
• We will think of populations as infinitely large (∞ is an
approximation to “very big”)
distribution
Sample
• A portion or part of the population, used to represent the
population in a research study
• The goal is to use the results obtained from the sample to help
answer questions about the population
Sample space
• The set of all possible outcomes

distribution

distribution
Event
• Subset of the sample space; a set of one or more outcomes
Probability distribution
• The list of all possible values of the variable and the probability
that each value will occur, for ex. Pr[Y = 650] (when Y is
discrete)
• Or, the probabilities of sets of these values, for ex. Pr[640 ≤ Y ≤
660] (when Y is continuous)

distribution: Example
Example: probability distribution of a discrete random variable
Let M be the number of times your internet connection fails
while you are writing a term paper
The probability distribution of M is the list of probabilities of all

possible outcomes:
No connection failure (Pr[M=0]), one connection failure
(Pr[M=1]); and so forth

distribution: Example
Event: "The internet connection will fail one or two times"

The probability of this event is the sum of the probabilities of the
constituent outcomes:
Pr(M=1 or M=2) = Pr(M=1) + Pr(M=2) = 0.10 + 0.06 = 0.16 or
16%

Descriptive Statistics
• Descriptive statistics are methods for organizing, summarizing,
and presenting data in an informative way
• Tables and graphs are used to organize data
• Descriptive values (mean, variance…) are used to summarize
data
• A descriptive value for a population is called a parameter and a
descriptive value for a sample is called a statistic

Inferential Statistics
• Inferential statistics are methods for using sample data to make
general conclusion (inferences) about populations
• Because a sample is typically only a part of the whole
population, sample data provides only limited information about
the population
→Sample statistics are generally imperfect representatives of the
corresponding population parameters
• The discrepancy between a sample statistic and its population
parameter is called sampling error

(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
mean = expected value (expectation) of Y
= E(Y )
= μY
= average value of Y
variance = E(Y – μY)2
=  Y2
= measure of the squared spread of the distribution
standard deviation = variance =  Y

Example: Internet connection failures
• Mean: Weighted average of the possible outcomes:

𝒌
𝑬 𝒀 = 𝝁𝒚 = 𝒚𝟏 𝒑𝟏 + 𝒚𝟐 𝒑𝟐 + ⋯ + 𝒚𝒌 𝒑𝒌 = ෍ 𝒚𝒊 𝒑𝒊
𝒊=𝟏
𝐸 𝑀 = 0.35

• Variance: Weighted average of the squared difference (spread)

between Y (M in this case) and its mean
𝒌
𝟐
𝑽𝒂𝒓 𝒀 = 𝑬 𝒀 − 𝝁𝒀 = (𝒚𝟏 − 𝝁𝒀 )𝟐 𝒑𝟏 + ⋯ + 𝒚𝒌 − 𝝁𝒀 𝟐 𝒑𝒌 = ෍(𝒚𝒊 − 𝝁𝒀 )𝟐 𝒑𝒊
𝒊=𝟏
𝑉𝑎𝑟 𝑀 = 0.6475

• Standard deviation: Weighted average of the difference between

Y (M in this case) and its mean
𝝈𝑴 = 𝑽𝒂𝒓(𝑴) = 0.6475 ≅ 0.8
The units of the standard deviation are the same as the units of Y

Skewness: The skewness of a distribution provides a mathematical

way to describe how much a distribution deviates from symmetry
(measure of asymmetry of a distribution)
E (Y − Y ) 
 3
skewness =  
3
Y
• Skewness = 0: distribution is symmetric

• Skewness > (<) 0: distribution has long right (left) tail

Kurtosis: The kurtosis of a distribution is a measure of how much mass

is in its tails and therefore is a measure of how much of the variance of Y
arises from extreme values. An extreme value of Y is called an outlier
E (Y − Y ) 
 4
kurtosis =  
4
Y
• Kurtosis = 3: normal distribution

• Kurtosis > 3: heavy tails (“leptokurtotic”)



2 random variables: joint distributions and
covariance (1 of 2)
• Random variables X and Z have a joint distribution
• The covariance between X and Z is
cov(X,Z) = E[(X – μX)(Z – μZ)] = σXZ
• The covariance is a measure of the linear association between X
and Z; its units are units of X × units of Z
• cov(X,Z) > 0 means a positive relation between X and Z (and
vice versa)
• If X and Z are independently distributed, then cov(X,Z) = 0 (but
not vice versa → A nonlinear relationship can exist)
• The covariance of a r.v. with itself is its variance:
cov( X , X ) = E[( X −  X )( X −  X )] = E[( X −  X ) 2 ] =  X2
2 random variables: joint distributions and
covariance (2 of 2)
The covariance between Test Score and STR is negative:
So is the correlation…
The correlation coefficient is defined in
terms of the covariance:
cov( X , Z )  XZ
corr( X , Z ) = = = rXZ
var( X ) var( Z )  X  Z
• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association
The correlation is an alternative measure of dependence

between X and Y that solves the “units” problem of the
covariance
The correlation coefficient measures linear
association

The correlation coefficient measures linear
association

(c) Conditional distributions and
conditional means
Conditional distributions
• The distribution of Y, given value(s) of some other random variable, X
• Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
• conditional mean = mean of conditional distribution = 𝐸(𝑌 |𝑋 = 𝑥)
• conditional variance = variance of conditional distribution
• Example: E(Test score|STR < 20) = the mean of test scores among
districts with small class sizes
Difference in means is the difference between the means of two
conditional distributions: Δ = E(Test score|STR < 20) – E(Test score|STR ≥
20)
conditional means (Example)
Suppose that half the time you write your term paper in the
school library, which has a new wireless network (A = 1);
otherwise, you write it in your room, which has an old wireless
network (A = 0)
• 𝑱𝒐𝒊𝒏𝒕 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝑴 = 𝟎 𝒂𝒏𝒅 𝑨 = 𝟎:

Pr 𝑀 = 0, 𝐴 = 0 = 0.35
• 𝑪𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏𝒂𝒍 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝑴 = 𝟎 𝒈𝒊𝒗𝒆𝒏 𝑨 = 𝟎:

Pr(𝑀 = 0|A = 0) = 0.70
• 𝑪𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏𝒂𝒍 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝑴 = 𝟎 𝒂𝒏𝒅 𝑨 = 𝟎:
Pr(𝑀 = 0, 𝐴 = 0) 0.35
Pr 𝑀 = 0|𝐴 = 0 = = = 0.70 = 70%
Pr(𝐴 = 0) 0.5
• In general:
𝐏𝐫(𝒀 = 𝒚, 𝑿 = 𝒙)
𝐏𝐫 𝒀 = 𝒚|𝑿 = 𝒙 =
𝐏𝐫(𝑿 = 𝒙)

• Conditional mean of M given A=0 is:

𝐸 𝑀|A = 0 = 0 × 0.7 + 1 × 0.13 + 2 × 0.1 + 3 × 0.05 + 4 × 0.02
= 0.56

• In general:
𝒌
𝑬 𝒀|𝑿 = 𝒙 = ෍ 𝒚𝒊 × 𝑷𝒓(𝒀 = 𝒚𝒊 |𝑿 = 𝒙)
𝒊=𝟏
• The conditional mean (conditional expectation) of Y given

X, is the mean of the conditional distribution of Y given X
➔ It is the mean value of Y when X=x

• Conditional variance of M given A=0 is:

𝑉𝑎𝑟 𝑀|A = 0
= 0 − 0.56 2 × 0.7 + 1 − 0.56 2 × 0.13 + 2 − 0.56 2
× 0.10 + 3 − 0.56 2 × 0.05 + 4 − 0.56 2 × 0.02 = 0.9864
• Conditional standard deviation of M given A=0 is:
𝜎𝑀|𝐴=0 = 0.9864 = 0.9932
• In general:
𝒌
𝟐
𝑽𝒂𝒓 𝒀|𝑿 = 𝒙 = ෍ 𝒚𝒊 − 𝑬(𝒀|𝑿 = 𝒙) × 𝑷𝒓(𝒀 = 𝒚𝒊 |𝑿 = 𝒙)
𝒊=𝟏
• The conditional variance/standard deviation of Y given X,

is the variance/standard deviation of the conditional
distribution of Y given X
➔ It is the variance/standard deviation of Y when X=x

conditional means
The conditional mean plays a key role in prediction:
• Suppose you want to predict a value of Y, and you are given the
value of a random variable X that is related to Y
-For example, you want to predict someone’s income, given their years of
education
• A common measure of the quality of your prediction (m) of Y is
the mean squared prediction error (MSPE), given X: E[(Y – m)]2
• Of all possible predictions m that depend on X, the conditional
mean E(Y|X) is the optimal prediction of Y given X (because it
has the smallest mean squared prediction error)

(d) Distribution of a sample of data drawn
randomly from a population: Y1,…, Yn
We will assume simple random sampling
• Choose an individual (district, entity) at random from the
population (every individual in the population has an equal chance
to be chosen)
Randomness and data
• Prior to sample selection, the value of Y is random because the
individual selected is random
• Once the individual is selected and the value of Y is observed, then
Y is just a number – not random
• The data set is (Y1, Y2,…, Yn), where Yi = value of Y for the ith
individual (district, entity) sampled
Distribution of Y1,…, Yn under simple
random sampling
• Because individuals #1 and #2 are selected at random, the value
of Y1 has no information content for Y2. Thus:
– Y1 and Y2 are independently distributed (knowing the value
of 𝑌1 provides no information about 𝑌2 )
– Y1 and Y2 come from the same distribution, that is, Y1, Y2 are
identically distributed
– That is, under simple random sampling, Y1 and Y2 are
independently and identically distributed (i.i.d.).
– More generally, under simple random sampling, {Yi}, i = 1,…, n,
are i.i.d.

This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population…
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Ῡ is the natural estimator of the mean. But:
a) What are the properties of Ῡ ?
b) Why should we use Ῡ rather than some other estimator?
▪ Y1 (the first observation)
▪ maybe unequal weights – not simple average
▪ median(Y1,…, Yn)
The starting point is the sampling distribution of Ῡ …
(a) The sampling distribution of the
sample mean Ῡ
The sample mean Ῡ is a random variable, and its properties are
determined by the sampling distribution of Ῡ
– The individuals in the sample are drawn at random
– Thus the values of (Y1, …, Yn) are random
– Thus functions of (Y1, …, Yn), such as Ῡ , are random: had a different
sample been drawn, they would have taken on a different value
➔ That’s why Ῡ has a probability distribution that is called the
sampling distribution of Ῡ (Ῡ varies from sample to sample)
– The mean and variance of Ῡ are the mean and variance of its sampling
distribution, E(Ῡ) and var(Ῡ)
– The concept of the sampling distribution underpins all of econometrics

The mean and variance of the sampling
distribution of Ῡ
E (Y ) = Y
 Y2
var(Y ) =
n
Implications:
1. Ῡ is an unbiased estimator of μY (that is, E(Ῡ ) = μY)
2. var(Ῡ ) is inversely proportional to n
1. the spread of the sampling distribution is proportional to 1/ n
2. Thus the sampling uncertainty associated with Y is proportional
to 1/ n (larger samples, less uncertainty, but square-root law)

sample mean Ῡ: Example
The weights of pennies minted after 2010 (denoted Y) are
approximately normally distributed with mean of 2.46 grams and
standard deviation of 0.02 grams
➔Population mean: 𝜇𝑌 = 2.46
➔Population standard deviation: 𝜎𝑌 = 0.02
Approximate the sampling distribution of the sample mean by

obtaining 200 simple random samples of size 𝑛 = 5 from this
population

The data presented in the next slide represent the sample means for
the 200 random samples of size 𝑛 = 5
For example, the first sample had the following data:

2.493 2.466 2.473 2.492 2.471
Thus, the mean of this sample is:
𝑦1 + 𝑦2 + 𝑦3 + 𝑦4 + 𝑦5 12.395
ത
𝑌= = = 2.479
𝑛 5
(If we take another sample of size 𝑛 = 5, the value of 𝑌ത will not be
the same)



• The mean of the (200) sample means is 2.46, the same as the
mean of the population:
𝐸 𝑌ത = 𝜇𝑌 = 2.46
• The standard deviation of the (200) sample means is:
𝜎𝑌 0.02
𝜎𝑌ത = = = 0.0014
𝑛 200
ഥ is called
The standard deviation of the sampling distribution 𝒀
the standard error of the mean and is denoted 𝝈𝒀ഥ
𝝈𝒀ഥ < 𝝈𝒀 : As the sample gets larger, we do not expect as much
spread in the sample means because larger observations will
offset smaller observations
Things we want to know about the sampling
distribution:
• What is the mean of Ῡ ?
– If E(Ῡ ) = 𝜇𝑌 ,then Ῡ is an unbiased estimator of μ
• What is the variance of Ῡ ?

– How does var(Ῡ) depend on n (famous 1/n formula)
• Does Ῡ become close to μ when n is large?

– Law of large numbers: Ῡ is a consistent estimator of μ
• Ῡ – μ appears bell shaped for n large…is this generally true?

– In fact, Ῡ – μ is approximately normally distributed for n large (Central
Limit Theorem)

The sampling distribution of Ῡ when n is
large
For small sample sizes, the distribution of Ῡ is complicated, but if
n is large, the sampling distribution is simple!
1. As n increases, the distribution of Ῡ becomes more tightly centered
around μY (the Law of Large Numbers)
2. Moreover, the distribution of Ῡ – μY becomes normal (the Central Limit
Theorem)

The Law of Large Numbers:
An estimator is consistent if the probability that it falls within an
interval of the true population value tends to 1 as the sample size
increases
Under certain conditions, when 𝑛 increases: 𝑌ത is a consistent
estimator for 𝜇𝑌 (or equivalently, 𝑌ത converges in probability to 𝜇𝑌 :
𝑝
ത
𝑌 ՜ 𝜇𝑌 )
– (𝑌1 , … , 𝑌𝑛 ) are i.i.d.
– The variance of Y is finite (𝜎𝑌2 < ∞)
➔ The Law of Large Numbers states that, under these
conditions, 𝑌ത will be near 𝜇𝑌 with very high probability when 𝑛 is
large (because when the sample is large, the large values tend to
balance the small values, and their sample average will be close to
their common mean)
The Central Limit Theorem (CLT)
• Under certain conditions, the distribution of 𝑌ത becomes
approximately normal as the sample size 𝑛 increases (whether
the population is normally distributed or not):
𝑌ത 𝑖𝑠 𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒𝑙𝑦 𝑛𝑜𝑟𝑚𝑎𝑙 ՜ 𝑌~𝑁 ത 𝜇𝑌 , 𝜎𝑌ത
Conditions:
- (𝑌1 , … , 𝑌𝑛 ) are i.i.d.
- 0 < 𝜎𝑌2 < ∞
(𝑌−𝜇𝑌 ) ത
ത
• The standardized version of 𝑌:
𝜎𝑌
ഥ
According to the CLT, when 𝑛 is large, the distribution is well

approximated by the standard normal distribution:
(𝑌ത − 𝜇𝑌 )
~𝑁(0,1)
𝜎𝑌ത
Sampling distribution of Ῡ when Y is Bernoulli, p = 0.78:

Y − E (Y )
Same example: sampling distribution of :
var(Y )

Summary: The Sampling Distribution of Ῡ
For Y1 , , Yn i.i.d. with 0   Y2  ,
• The exact (finite sample) sampling distribution of Y has mean Y
(“Y is an unbiased estimator of Y ”) and variance  Y2 /n
• Other than its mean and variance, the exact distribution of Ῡ is
complicated and depends on the distribution of Y (the population
distribution)
• When n is large, the sampling distribution simplifies:
p
− Y → Y (Law of large numbers)
Y − E (Y )
− is approximately N (0,1) (CLT)
var(Y )
(b) Why Use Ῡ To Estimate μY?
• A natural way to estimate the mean value of Y in a population
(that is 𝜇𝑌 ) is to compute the sample mean 𝑌ത from a sample of 𝑛
i.i.d. observations
• In general, we would like an estimator that gets as close as
possible to the unknown true value
• Thus, this leads to three specific desirable characteristics of an
estimator:
– Unbiasedness
– Consistency
– Efficiency

Unbiasedness
• Suppose you evaluate an estimator many times over repeated
randomly drawn samples. It is reasonable to hope that, on
average, you would get the right answer. Thus, a desirable
property of an estimator is that the mean of its sampling
distribution equals 𝜇𝑌 ; if so, the estimator is said to be unbiased.
• If the estimator is unbiased → 𝐸 𝑌ത = 𝜇𝑌

Consistency
• An estimator is consistent if the probability that it falls within an
interval of the true population value tends to 1 as the sample size
increases
➔ 𝑌ത will be near 𝜇𝑌 with very high probability when 𝑛 is large

Difference between Unbiasedness & Consistency
• Unbiasedness is a statement about the expected value of the
sampling distribution of the estimator: 𝐸 𝑌ത = 𝜇𝑌
• Consistency is a statement about “where the sampling
distribution of the estimator is going” as the sample size
increases → Consistency means that, as the sample size
increases, the sampling distribution of the estimator becomes
increasingly concentrated at the true parameter value (𝜇𝑌 )

Efficiency
• Suppose you have two candidate estimators that are unbiased:
𝑌ത1 𝑎𝑛𝑑 𝑌ത2 , how might you choose between them?
• One way to do so is to choose the estimator with the tightest
sampling distribution. This suggests picking the estimator with
the smallest variance
• If 𝑌ത1 has a smaller variance than 𝑌ത2 , then 𝑌ത1 is said to be more
efficient than 𝑌ത2
• The terminology “efficiency” derives from the notion that if 𝑌ത1
has a smaller variance than 𝑌ത2 , then it uses the information in the
data more efficiently than does 𝑌ത2
• 𝑌ത isn’t the only estimator of 𝜇𝑌 , can you think of a time you
might want to use the median instead ?
• Thanks to its properties (unbiasedness, consistency, efficiency),

𝑌ത is the Best Linear Unbiased Estimator (BLUE) → 𝑌ത is the
most efficient estimator of 𝜇𝑌 among all the other unbiased
estimators

ഥ is the least squares estimator of 𝝁𝒀
▪𝒀
➢ 𝑌ത provides the best fit to the data in the sense that the
average squared differences between the observations and 𝑌ത
are the smallest among all the possible estimators
➢ Consider the problem of finding the estimator 𝑚 that
minimizes the total squared gap (difference) between the
estimator 𝑚 and the sample points: σ𝑛𝑖=1(𝑌𝑖 − 𝑚)2
➢ Because 𝑚 is an estimator of 𝐸(𝑌), you can think of it as a
prediction of the value of 𝑌𝑖
➢ Thus, the gap 𝑌𝑖 − 𝑚 can be thought of as a prediction
mistake (error)

ഥ is the least squares estimator of 𝝁𝒀
▪𝒀
➢ The estimator 𝑚 that minimizes the sum of squared gaps
(σ𝑛𝑖=1(𝑌𝑖 − 𝑚)2 ) is called the “least squares estimator”

2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a provisional
decision based on the evidence at hand whether a null hypothesis is
true, or instead that some alternative hypothesis is true. That is, test
– H0: E(Y ) = μY,0 vs. H1: E(Y ) > μY,0 (1-sided, >)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) < μY,0 (1-sided, <)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) ≠ μY,0 (2-sided)

Hypothesis testing
Rejection
Right-tail test:
area (𝛼) 𝐻0 : 𝐸 𝑌 = 𝜇𝑌
𝐻1 : 𝐸 𝑌 > 𝜇𝑌
Rejection Left-tail test:

area (𝛼) 𝐻0 : 𝐸 𝑌 = 𝜇𝑌
𝐻1 : 𝐸 𝑌 < 𝜇𝑌
Rejection Two-tail test:

area (𝛼/2) 𝐻0 : 𝐸 𝑌 = 𝜇𝑌
𝐻1 : 𝐸 𝑌 ≠ 𝜇𝑌

Hypothesis testing: Traditional method
If 𝑛 ≥ 30 𝑎𝑛𝑑 𝜎𝑌 is known:
• Compute the z-statistic:
𝑌ത − 𝜇𝑌 𝜎𝑌
𝑧= where 𝜎𝑌ത =
𝜎𝑌ത 𝑛
• Determine the z-critical value (from the z-table), with a
level of significance of 5% (𝛼 = 0.05):
– If one-tail test: 1 − 𝛼 = 1 − 0.05 = 0.95
From the z-table, the critical value of z is ±1.645
𝛼 0.05
– If two-tail test: 1 − =1− = 0.975
2 2
From the z-table, theCopyright
critical value of z is ±1.96
© 2019 Pearson Education Ltd. All Rights Reserved.
Right-tail test with 𝜶 = 𝟎. 𝟎𝟓

Left-tail test with 𝜶 = 𝟎. 𝟎𝟓

Two-tail test with 𝜶 = 𝟎. 𝟎𝟓

• If the z-statistic fall in the rejection area (z-statistic ≥ z-
critical) → We reject the null hypothesis (𝐻0 )
• If the z-statistic does not fall in the rejection area (z-
statistic < z-critical) → we don’t reject or fail to reject
the null hypothesis (𝐻0 )
If 𝒏 < 𝟑𝟎 and/or 𝝈𝒀 is unknown → we use the t-test

(see slides 9-13)

Hypothesis testing: p-value method
• A sample of data cannot provide conclusive evidence about
the null hypothesis, it is possible to do a probabilistic
calculation that permits testing the null hypothesis in a way
that accounts for sampling uncertainty → computing the p-
value of the null hypothesis
• The p-value is the evidence against the null hypothesis
– The smaller the p-value, the stronger the evidence against 𝐻0
– The larger the p-value, the weaker the evidence against 𝐻0
• The p-value is compared to the significance level (𝛼) in
order to know if we have to reject 𝐻0 or not

If 𝑛 ≥ 30 𝑎𝑛𝑑 𝜎𝑌 is known:
• Compute the z-statistic:
𝑌ത − 𝜇𝑌 𝜎𝑌
𝑧= where 𝜎𝑌ത =
𝜎𝑌ത 𝑛
• Compute the p-value
• No need to determine the z-critical value
• We need to know the significance level (𝛼 = 0.05)
• If p-value > 𝛼 = 0.05 → We fail to reject 𝐻0
• If p-value ≤ 𝛼 = 0.05 → We reject 𝐻0
• The p-value represents the area that corresponds to the
z-statistic
• We use the z-statistic in order to determine the area
from the z-table (we look to the values in the first
column and first line and not to the values that are
inside the table)
• If one-tail test: p-value = 1 − 𝑎𝑟𝑒𝑎
• If two-tail test: p-value = (1 − 𝑎𝑟𝑒𝑎) × 2
• If 𝒏 < 𝟑𝟎 and/or 𝝈𝒀 is unknown → we use the t-test
(see slides 9-13) Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Mathematically:
• One-tail:
𝑌ത − 𝜇𝑌
p − value = 𝑃𝑟𝐻0 𝑍 > 𝑧 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 𝑃𝑟𝐻0 𝑍>
𝜎𝑌 / 𝑛
= 1 − Φ(z statistic)
• Two-tail:
p − value = 2 × 𝑃𝑟𝐻0 𝑍 > 𝑧 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐
𝑌ത − 𝜇𝑌
= 2 × 𝑃𝑟𝐻0 𝑍 > = 2 × (1 − Φ z statistic )
𝜎𝑌 / 𝑛

Two-tail

What is the link between the p-value and the
significance level?
• The significance level is prespecified. For example, if
the prespecified significance level is 5%
– you reject the null hypothesis if |t| ≥ 1.96
– Equivalently, you reject if p-value ≤ 0.05
– The p-value is sometimes called the marginal significance
level
– Often, it is better to communicate the p-value than simply
whether a test rejects or not – the p-value contains more
information than the “yes/no” statement about whether the
test rejects

At this point, you might be wondering,…
What happened to the t-table and the degrees of freedom?
Digression: the Student t distribution
If Yi , i = 1, , n is i.i.d. N ( Y ,  Y2 ), then the t -statistic has the

Student t -distribution with n − 1 degrees of freedom.
The critical values of the Student t-distribution is tabulated in the
back of all statistics books. Remember the recipe?
1. Compute the t-statistic
2. Compute the degrees of freedom, which is n – 1
3. Look up the 5% critical value
4. If the t-statistic exceeds (in absolute value) this critical value, reject the
null hypothesis.

Comments on this recipe and the Student
t-distribution
• If the sample size is moderate (several dozen) or large (hundreds
or more), the difference between the t-distribution and N(0,1)
critical values is negligible. Here are some 5% critical values for
2-sided tests:
degrees of freedom 5% t-distribution critical
(n – 1) value
10 2.23
20 2.09
30 2.04
60 2.00
∞ 1.96
Comments on this recipe and the Student
t-distribution
• So, the Student-t distribution is only relevant when the
sample size is very small; but in that case, for it to be
correct, you must be sure that the population
distribution of Y is normal. In economic data, the
normality assumption is rarely credible.
Do you think earnings are normally distributed?
– Suppose you have a sample of n = 10 observations from one
of these distributions – would you feel comfortable using the
Student t distribution?

The Student-t distribution – Summary
• The assumption that 𝑌 is distributed 𝑁(𝜇𝑌 , 𝜎𝑌 ) is rarely
plausible in practice
• For n > 30, the t-distribution and N(0,1) are very close
(as n grows large, the tn–1 distribution converges to
N(0,1))
• For historical reasons (sample sizes were small),
statistical software typically uses the t-distribution to
compute p-values – but this is irrelevant when the
sample size is moderate or large
• For these reasons, in this class we will focus on the
large-n approximation given by the CLT

2. Estimation
3. Testing
4. Confidence intervals
Confidence Intervals
• A 95% confidence interval for μY is an interval that contains the true
value of μY in 95% of repeated samples
• Digression: What is random here? The values of Y1,...,Yn and thus any
functions of them – including the confidence interval. The confidence
interval will differ from one sample to the next. The population
parameter, μY, is not random; we just don’t know it

Confidence Intervals
𝑌ത ± 𝑡 ∗ × 𝑆𝐸 𝑌ത
Where:
𝑡 ∗ : the critical value of t (from t table)
𝜎
And 𝑆𝐸 𝑌ത = 𝑌
𝑛
If 𝛼 = 5%, thus, the 95% confidence interval is:

𝑆𝑌 𝑆𝑌
ത
𝜇𝑌 𝜖 𝑌 − 1.96 × ത
, 𝑌 + 1.96 ×
𝑛 𝑛
This confidence interval relies on the large-n results that Y is

p
approximately normally distributed and sY2 →  Y2 .

Chapter 2 & 3-Review of Probability and Statistics

Uploaded by

Copyright:

Available Formats

Chapter 2 & 3-Review of Probability and Statistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 & 3-Review of Probability and Statistics

Uploaded by

Copyright:

Available Formats

Introduction to Econometrics

Fourth Edition, Global Edition

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

TABLE 4.1 Summary of the Distribution of Student–Teacher Ratios and

This table doesn’t tell us anything about the relationship between

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

What does this figure show?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Class Size Average score Standard deviation n

1. Estimation of Δ = difference between group means

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

- Is this a large difference in a real-world sense?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

The critical value of t (from the table): 1.96 (𝛼 = 5% 𝑎𝑛𝑑 𝑑𝑓 = 418)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

The probability framework for statistical inference

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

The probability distribution of M is the list of probabilities of all

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Event: "The internet connection will fail one or two times"

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Mean: Weighted average of the possible outcomes:

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Variance: Weighted average of the squared difference (spread)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Standard deviation: Weighted average of the difference between

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Skewness: The skewness of a distribution provides a mathematical

• Skewness = 0: distribution is symmetric

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Kurtosis: The kurtosis of a distribution is a measure of how much mass

• Kurtosis = 3: normal distribution

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

The correlation is an alternative measure of dependence

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• 𝑱𝒐𝒊𝒏𝒕 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝑴 = 𝟎 𝒂𝒏𝒅 𝑨 = 𝟎:

• 𝑪𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏𝒂𝒍 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝑴 = 𝟎 𝒈𝒊𝒗𝒆𝒏 𝑨 = 𝟎:

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Conditional mean of M given A=0 is:

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• The conditional mean (conditional expectation) of Y given

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

• Conditional variance of M given A=0 is:

• The conditional variance/standard deviation of Y given X,

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.