STAT2102_Chapter6
STAT2102_Chapter6
STAT2102_Chapter6
1
Definition. The 25th, 50th, and 75th percentiles are called the first, second, and third
quartiles of the sample. The 50th percentile is also called the median of the sample.
• We believe that the height for CUHK students is normally distributed. How can we
estimate µ?
• Point estimates (Chapter 6) are single numbers, for example, µ̂ = 165 cm.
• Interval estimates (Chapter 7) give a range in which the parameter value is likely to
be, for example, [163, 167] for µ.
2
Definition. A function of X1 , X2 , . . . , Xn is called a statistic. The statistic, say u(X1 , X2 , . . . , Xn ),
used to estimate θ is called a (point) estimator of θ. The value u(x1 , x2 , . . . , xn ) com-
puted using the data is called the estimate.
How to choose an appropriate estimator?
Definition. If E[u(X1 , X2 , . . . , Xn )] = θ, then the statistic u(X1 , X2 , . . . , Xn ) is called an
unbiased estimator of θ. Otherwise, it is said to be biased.
We also want the mean squared error E[(u(X1 , X2 , . . . , Xn ) − θ)2 ] to be small, so
that the estimator is close to the truth.
Theorem. The sample mean
n
1X
X̄ = Xi
n
i=1
Besides estimating the mean and variance, a large class of estimation problems can be
described as follows. Suppose we have a random sample X1 , X2 . . . , Xn from the population
distribution with pmf or pdf f (x; θ), that is, we know that the distribution belongs to a
particular family of distributions (e.g., exponential) but we don’t know the parameter.
How do we estimate θ?
Definition. We define the likelihood function to be
3
6.5 A Simple Regression Problem
Now we consider yet another class of estimation problems. “Regression analysis” refers to
the analysis of data involving two or more variables. Its objective is to discover the nature
of their relationship and then to explore it for the purposes of prediction. For example,
• A major concern in radiation therapy is the extent of cellular damage induced by the
uration and intensity of exposure.
In studying the relation between two variables x and Y , we consider the experiment
where the values of x are controlled, whereas Y is dependent on x and may be subject to
uncontrollable sources of error.
Definition. In the simple linear model, we assume
Yi = α1 + βxi + εi ,
where εi , for i = 1, 2, . . . , n, are independent and follows N (0, σ 2 ). We can also rewrite the
model as
Yi = α + β(xi − x̄) + εi ,
where α = α1 + β x̄.
Observing the data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), our goal is to estimate α, β and σ 2 .
Theorem. The maximum likelihood estimators of α, β and σ 2 are, respectively,
α̂ = Ȳ ,
Pn
Yi (xi − x̄)
β̂ = Pi=1
n 2
,
i=1 (xi − x̄)
and
n
1X
σ̂ 2 = [Yi − α̂ − β̂(xi − x̄)]2 .
n
i=1
1 1
It can be verified that α̂ and β̂ are unbiased. If we replace n by n−2 in the definition of
σ̂ 2 , then it also becomes unbiased.
Definition. The above α̂ and β̂ are called the least squares estimators because they
minimize
Xn
[Yi − α − β(xi − x̄)]2
i=1
4
among all choices of α and β.
Example. Given these four pairs of (x, y) values:
x 1 1 2 4
Find the least squares regression line. (Answer: α̂ = 6, β̂ = −1.5)
y 7 8 6 3