Advanced Statistics Lecture Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 115
At a glance
Powered by AI
The text discusses experimental designs, nonparametric statistical methods, sampling techniques, and multivariate analysis.

The text discusses completely randomized design, randomized block design, and Latin square design as some standard experimental designs.

The text describes the sign test, signed-rank test, Wilcoxon rank-sum test, Kruskal-Wallis test, and tests of randomness as nonparametric statistical tests.

Contents

1 Experimental Designs 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Some Standard Experimental Designs . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Completely Randomized Design . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Randomized Block Design . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Latin Square Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Nonparametric Methods 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 The Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 The Signed-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The Wilcoxon Rank-Sum Test for Comparing Two Treatments . . . . . . . . 21
2.4 The Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Test of randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Measures of Correlation Based on Ranks . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Properties of the rank correlation coefficient . . . . . . . . . . . . . . 28

3 Sampling 30
3.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 How to Draw a Simple Random Sample . . . . . . . . . . . . . . . . 31
3.1.3 Estimation of Population Mean and Total . . . . . . . . . . . . . . . 33
3.1.4 Selecting the Sample Size for Estimating Population Means and Totals 37
3.1.5 Estimation of Population Proportion . . . . . . . . . . . . . . . . . . 38
3.1.6 Sampling with Probabilities Proportional to Size . . . . . . . . . . . . 42
3.2 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 How To Draw a Stratified Random Sample . . . . . . . . . . . . . . . 46
3.2.3 Estimation of Population Mean and Total . . . . . . . . . . . . . . . 47
3.2.4 Selecting the Sample Size for Estimating Population Means and Total 50
3.2.5 Allocation of the Sample . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.6 Estimation of Population Proportion . . . . . . . . . . . . . . . . . . 57
3.2.7 Selecting the Sample Size and Allocating the Sample to Estimate Pro-
portions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Multivariate Analysis 61
4.1 Multivariate Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.1 Examples of Multivariate Data . . . . . . . . . . . . . . . . . . . . . 61
4.1.2 Multivariate Data Structure . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.3 Mean and Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Random Vectors and Multivariate Distributions . . . . . . . . . . . . . . . . 63

1
4.2.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Expectation and Dispersion . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Covariance between Two Random Vectors . . . . . . . . . . . . . . . 65
4.2.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . 66
4.3 Properties of Multivariate Normal Distribution . . . . . . . . . . . . . . . . . 66
4.3.1 Expectation and Dispersion . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Partitioning of Multivariate Normal Vector . . . . . . . . . . . . . . . 67
4.3.3 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Sampling from a Multivariate Normal Distribution . . . . . . . . . . . . . . . 71
4.4.1 Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . 71
4.4.3 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.4 The Multivariate Central Limit Theorem . . . . . . . . . . . . . . . . 72
4.5 Inference on Multivariate Normal Mean . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.2 Confidence regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.3 Simultaneous confidence intervals . . . . . . . . . . . . . . . . . . . . 74

5 Decision Theory 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Mixed Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Statistical Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 The Minimax Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 The Bayes Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Inference 89
6.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Methods of evaluating Estimators . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5 Cramér-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

2
Chapter 1

Experimental Designs

1.1 Introduction
Earlier you learned how to compare the means of two populations. In many practical
situations, we will have more than two population means to compare. For example, we may
want to compare the mean lengths of five different species of turtles by testing the hypothesis
that all have the same mean. If you take independent samples from these populations the
sample means will most likely be different. But this does not mean that the population
means are different. So we need a proper statistical test to decide if there is enough evidence
to refute the null hypothesis.
Performing multiple t-test and comparing each pair is not only tedious and time-consuming,
but also going to increase the probability of making wrong decision of type I. This is be-
cause, even though the probability of rejecting the null hypothesis that the two population
means are equal is fixed at α for each test, when k such tests are performed the over all
probability of rejecting at least one of them wrongly increases to 1 − (1 − α)k . For example,
if α = 0.05 and there are 10 such tests, the overall type I error will be close to 0.40.
Thus we need a single test procedure whose probability of type I error is α for testing
the hypothesis that all the means are equal. This procedure is called Analysis of Variance.
The assumptions required are the following:

1. The underlying populations are approximately normally distributed.

2. The variances are all equal

Let us see how the test is done for five populations. Generalising for k populations will
then be easily understood. Suppose we have five sets of measurements that are normally
distributed with population means µ1 , µ2 , µ3 , µ4 , µ5 and common variance σ 2 . Samples of
size 9 are taken from each and the sample means and the sample variances are calculated.
We shall define the combined estimate of the common variance by

(n1 − 1)s21 + (n2 − 1)s22 + (n3 − 1)s23 + (n4 − 1)s24 + (n5 − 1)s25
s2w =
(n1 − 1) + (n2 − 1) + (n3 − 1) + (n4 − 1) + (n5 − 1)

which is just an extension of the pooled variance we came across in the two-sample case. It
measures the variability of the observations within the five populations and the subscript w
stands for within-population variability.
Next we consider a quantity that measures the variability between the populations. As-
suming that the null hypothesis is true, that is, all the population means are indeed equal,
we know that the sampling distribution of the sample mean based on 9 observations will

3
2
have the same mean µ and variance σ9 . Since we have drawn five samples of nine observa-
2
tions each, we can estimate the variance of the distribution of sample means, σ9 , using the
fact that the sample variance of the sample means is given by
P 2 1 P 2
ȳi − 5 [( ȳi ) ]
.
5−1
2
This quantity estimates σ9 and hence 9 times that quantity estimates σ 2 . Call this quantity
s2B , the between-population variability. If the null hypothesis is true, then within-population
s2
and between-population variability should be the same, so the ratio s2B should be close to
W
1. We will use this ratio to be our test statistic.
It turns out that in our example this ratio has F distribution with degrees of freedom 4
and 40. Thus we can check the validity of the null hypothesis by comparing the calculated
s2
value of s2B with the value found from the F-table.
W
The experimental setting we just described, where a random sample of observations is
taken from each of k different populations is called a completely randomized design. Though
the example we discussed had sample sizes all equal, it need not be the case in general for
a completely randomized design. We will now discuss under maximum generality how the
hypothesis about equality of means is tested. Remember that we assume normality and
equality of variances.

Notation
yij = j th observation in the sample from population i.
ni = P Sample size for the sample from population i
n= ni = The total sample size
Ti = The sum of the sample measurements obtained from population i.
Ti
ȳi = P
ni
= The average of observations drawn from population i
G= Ti = The grand total of all observations
ȳ = Gn = The average of all observations
2 2
P
T SS = Total sum of squares = (yij − ȳ) P = (n − 1)s2
SST = Within-sample sum of squares = P(yij − ȳi )
SSE = Between-sample sum of squares = ni (yi − ¯ ȳ)2

Shortcut
P Formulas 2
T SS = yij2 − Gn
P Ti2 G2
SST = ni
− n
SSE = T SS − SST

The statistical model for this design is the following:


yij = µ + αi + εij
where
µ = an overall mean, which is an unknown constant
αi = an effect due to treatment i (an unknown constant)
εij = a random error associated with the j th response on treatment i.

The errors εij ’s are assumed to be normally distributed with mean zero and an unknown
variance σ 2 . In addition we shall also assume that the errors are independent.
The statistical test of interest is to see if there is a difference among the treatment means.
The null hypothesis then is
H0 : α1 = α2 = · · · = αt = 0

4
where t is the number of experimental groups or samples
Note: The “Sum of Squares Between” and the “Sum of Squares within” that we talked
about earler are now called “Sum of Squares due to Treatment” and “Sum of Squares due
to Error” respectively, and will be denoted by SST and SSE respectively. Once we have
computed the required quantities, we arrange them in the form of a table generally known
as the analysis of variance table (or ANOVA table).
Source Degrees of freedom Sum of squares Mean squares F Ratio

SST M ST
Treatments k−1 SST M ST = k−1 M SE

SSE
Error n−k SSE M SE = n−k

Total n−1 T SS
Here T SS and SST are computed using the formula and SSE is found by subtraction.
The above type of ANOVA is called a single factor analysis of variance or a one-way analysis
of variance.

Example 1.1.1 Nineteen pigs are assigned at random among four experimental groups.
Each group is fed a different diet. The data are pig body weights, in kilograms, after being
raised on these diets. Perform a 0.05 level test to decide whether the weights are the same
for all four diets.

H0 : µ1 = µ2 = µ3 = µ4 (The mean weights are all equal)


H1 : H0 is not true(Not all the four mean weights are equal)
Feed1 Feed2 Feed3 Feed4
60.8 68.7 102.6 87.9
57.0 67.7 102.1 84.2
65.0 74.0 100.2 83.1
58.6 66.3 96.5 85.7
61.7 69.8 90.3

Solution:
Here k = 4; n1 = n2 = n4 = 5, n3 = 4 and n = 19.
T
P1 =2 303.1, T2 = 346.5, TG32= 401.4, T4 = 431.2, G = 1482.2.
yij = 119981.900 and n = 115627.202
2
yij − Gn = 119981.900 − 115627.202 = 4354.698
P 2
TSS =
P Ti2 G2
SST = ni
− n = 119853.550 − 115627.202 = 4226.348
SSE = TSS - SST = 4354.698 - 4226.348 = 128.350
Source Degrees of freedom Sum of squares Mean squares F Ratio
Treatments 3 4226.348 1408.783 164.635
Error 15 128.35 8.557
Total 18 4354.698
As the calculated F-value is larger than the table value F3,15,0.05 = 3.29, we reject the null
hypothesis and conclude that the means are not all equal, or in other words, “there is a
treatment effect”.

5
1.2 Some Standard Experimental Designs
1.2.1 Completely Randomized Design
The design of an experiment is the process of planning an experiment that is appropri-
ate for the situation. A large part of scientific reasoning consists of drawing conclusions
from experiments (studies) that have been carefully designed, appropriately conducted, and
properly analyzed. In this section we will discuss some standard experimental designs and
their analyses. The single factor ANOVA discussed earlier is called a completely randomized
design (CRD), where we were interested in comparing treatment means for different levels
of a single factor.
Consider the following example. A horticultural laboratory is interested in examining the
leaves of apple trees to detect nutritional deficiencies using three different laboratory proce-
dures. In particular, the laboratory would like to determine whether there is a difference in
mean assay readings for apple leaves utilizing three different laboratory procedures (A, B,
C). The experimental units in this investigation are apple tree leaves and the treatments are
the three levels of the qualitative variable ”laboratory procedure”. If a single analyst takes
a random sample of nine leaves from the same tree, randomly assigns three leaves to each
of the three procedures, and assays the leaves using the assigned treatment, we could use
the three sample means to estimate the corresponding mean leaf nutritional deficiency for
the three laboratory test procedures - in other words use the analysis of variance methods
discussed earlier to run a statistical test of the hypothesis that all the treatment means are
identical. The design used for this investigation is a completely randomized design with
three observations for each treatment.
The completely randomized design has several advantages and disadvantages used as an
experimental design for comparing t treatment means.

Advantages And Disadvantages of a Completely Randomized Design

Advantages

1. It is extremely easy to construct the design.

2. The design is easy to analyze even though the sample sizes might not be the same for
each treatment.

3. The design can be used for any number of treatments.

Disadvantages

1. Although the completely randomized design can be used for any number of treatments,
it is best suited for situations in which there are relatively few treatments.

2. The experimental units to which treatments are applied must be as homogeneous as


possible. Any extraneous sources of variability will tend to inflate the error term,
making it more difficult to detect differences among the treatment means. Now we
will extend the idea to the cases where there are nuisance factors are present. These
are factors we are not primarily interested in, but nevertheless need be considered as
they affect the outcome of the experiment. For these designs we need to arrange the
data in such a way that we can control the extraneous factors.

6
1.2.2 Randomized Block Design
Let us now change the horticultural problem slightly and see how well the completely ran-
domized design suits our needs. Suppose that, rather than relying upon one analyst, we use
three analysts for the leaf assays, say, Analyst 1, Analyst 2 and Analyst 3. If we randomly
assigned three apple leaves to each of the analysts, we might end up with a randomization
scheme like the one listed in the table below:

Analyst
1 2 3
A B C
A B C
A B C

Even though we still have three observations for each treatment in this scheme, any
differences that we may observe among the leaf determinations for the three laboratory
procedures may be due entirely to differences among the analysts who assayed the leaves.
For example, if we tested the hypothesis H0 : µA − µB = 0 against H1 : µA − µB 6= 0 and
were led to reject H0 , we would not be able to tell whether µA differs from µB because
assays from analyst 1 are different from those for analyst 2 or because the properties of
determinations by procedure A differ markedly from those for procedure B. This example
illustrates a situation in which the nine experimental units (tree leaves) are affected by an
extraneous source of variability: the analyst. In this case the units differ markedly and
would not be a homogeneous set on which we could base an evaluation of the effects of the
three treatments.
The completely randomized design just described can be modified to gain additional in-
formation concerning the means µA , µB and µC . We can block out the undesirable variability
among analysts by using the following experimental design. We restrict our randomization
of treatments to experimental units to ensure that each analyst performs a determination
using each of the three procedures. The order of these determinations for each analyst is
randomized. One such randomization is listed below.

Analyst
1 2 3
A B A
C A B
B C C
Table 1.2.2

Note that each analyst will assay three leaves, one leaf for each of the three procedures.
Hence pairwise comparisons among the laboratory procedures that utilize the sample means
will be free of any variability among analysts. For example, if we ran the test

H0 : µA − µB = 0 against H1 : µA − µB 6= 0
and rejected H0 , the difference between µA and µB would be due to a difference between the
nutritional deficiencies detected by procedures A and B and not due to a difference among
the analysts, since each analyst would have assayed one leaf for each of the three procedures.
This design, which represents an extension to the completely randomized design is called
a randomized block design; the analysts in our experiment are called blocks. By using this
design, we have effectively filtered out any variability among the analyst enabling us to
make more precise comparisons among the treatment means µA , µB and µC .

7
As another example, suppose we want to study weight gain in guinea pigs where the
factor whose effect is to be tested is diet (treatment in ANOVA). Twenty animals are to
used in this experiment, five on each of four diets. However, the experimenter believes there
are some experimental factors that likely would affect weight gain and does not feel that all
twenty animal cages can be kept in the laboratory at identical conditions of temperature,
light, etc. Therefore, five blocks of experimental units are established, i.e. five groups
of guinea pigs. Each block of animals consists of four guinea pigs, one of each of the
experimental diets. All members of a block are considered to be at identical conditions
(except, of course, for diet).
In general, we can use a randomized block design to compare t different treatment means
when an extraneous source of variability (blocks) is present. If there are b different blocks,
we would run each of the t treatments in each block to filter out the block to block variability.
In our example we had t = 3 treatment means (laboratory procedures) and b = 3 blocks
(analysts).

Definition 1.2.1 A randomized block design is an experimental design for comparing t


treatments in b blocks. Treatments are randomly assigned to experimental units within a
block, with each treatment appearing exactly once in every block. Thus we will have a total
sample size of n = bt with t observation in each block and b observations receiving each
treatment.
Advantages and Disadvantages of a Randomized Block Design
The randomized block design has certain advantages and disadvantages, as described
below:
Advantages

1. It is a useful design for comparing t treatment means in the presence on single extra-
neous source of variability.

2. The statistical analysis is simple.

3. The design is easy to construct.

4. It can be used to accommodate any number of treatments in any number of blocks.

Disadvantages

1. Since the experimental units within a block must be homogeneous, the design is best
suited for a relatively small number of treatments.

2. This design controls for only one extraneous source of variability (due to blocks).
Additional extraneous sources of variability tend to increase the error term, making
it more difficult to detect treatment differences.

3. The effect of each treatment on the response must be approximately the same from
block to block.

NOTATION
yij = Observation for treatment i in block j.
t = The number of treatments.
n = The total sample size = bt.
Ti = The sum of the sample measurements receiving treatment i.
ȳi = The average of observations receiving treatment i = Tbi
Bj = The sum of the sample measurements in block j.

8
B
B̄j = The sample mean for block j = tj P
G = The grand total of all observations = Ti
G
ȳ = The average of all observations = n
G2
(yij − ȳ)2 = (n − 1)s2 = yij2 −
P P
TSS = Total sum of squares = n
P Ti2 G2
SST = Treatment sum of squares = b (ȳi − ȳ)2 =
P
b
− n
B 2 2
SSB = Block sum of squares = t (B̄j − ȳ)2 = − Gn
P P j
t
SSE = TSS - SST - SSB

The statistical model for this design is the following:

yij = µ + αi + βj + ij

where
µ = an overall mean, which is an unknown constant
αi = an effect due to treatment i (an unknown constant)
βj = an effect due to block j (an unknown constant)
ij = a random error associated with the response on treatment i, block j.

The errors ij ’s are assumed to be normally distributed with mean zero and an unknown
variance σ 2 . In addition we shall also assume that the errors are independent.

The statistical test of interest is to see if there is a difference among the treatment means.
The null hypothesis then is

H0 : α1 = α2 = · · · = αt = 0

The ANOVA table for an RBD is given below:

Source Degrees of freedom Sum of squares Mean squares F-Ratio

SST M ST
Treatments t−1 SST M ST = t−1 M SE

SSB M SB
Blocks b−1 SSB M SB = b−1 M SE

SSE
Error (b − 1)(t − 1) SSE M SE = (b−1)(t−1)

Total bt − 1 T SS

The computed F-ratio for the treatments, M ST /M SE is compared with the table value of
the F-distribution of t − 1 and (b − 1)(t − 1) degrees of freedom and if the computed value is
higher, then the null hypothesis is rejected and it is concluded that there is a treatment effect.

We can use these computations to test if there is a block effect. Here the test is

H0 : β1 = β2 = · · · = βb = 0
Similar to the treatment effect testing procedure, we compare the computed value of M SB/M SE
with the table value of the F-distribution of (b − 1) and (b − 1)(t − 1) degrees of freedom.

9
USE OF RBD

One major reason for blocking is that the experimenter does not have a sufficient number
of homogeneous experimental units available to run a completely randomized design with
several observations per treatment combination. A realistic situation where blocking would
be required is an industrial type experiment where there is only sufficient time to run one
set of treatment combinations per day. In such an experiment one would be blocking on
time and the number of days that the experiment is carried out would be the number of
blocks. This type of experiment points out another reason for blocking which is to expand
the inference space. Even if it is possible to run a large number of each treatment combi-
nation on one day, the experimenter might still choose to block over time so that he could
make inferences over time rather than to just one particular day.

Example 1.2.2 Four chemical treatments for fabric are to be compared with regard to
their ability to resist stains. Two different types of fabric are available for the experiment,
which is to be used as blocks. Each chemical is applied to a sample of each type of fab-
ric, resulting in a randomized block design. The measurements are given below in Figure 1.1:

Figure 1.1: Assignment of Treatments to Fabrics

The problem is to test if there is evidence to suggest if there are significant differences be-
tween treatment means.

Solution
TSS=464 − 562 /8 = 464 − 392 = 72
SST=429-392=37
SSB=424-392=32
SSE=72-37-32=3

The ANOVA summary table then is

Source Degrees of freedom Sum of squares Mean squares F Ratio


Treatments 3 37 12.33 12.33
blocks 1 32 32 32
Error 3 3 1
Total 7 72

As the table value is F3,3,0.05 = 9.28, we reject the hypothesis that there is no treatment
effect. Also, we see in the summary table that the F-ratio for blocks is 32 compared to
the table value F1,3,0.05 = 10.13 . So there is evidence of a difference between block means.

10
That is, the fabrics seem to differ with respect to their ability to resist stains when treated
with these chemicals. The decision to ”block out” the variability between fabrics therefore
appears to have been a good one.
RELATIVE EFFICIENCY OF RBD

It is natural that we will be interested to find out if blocking has increased our precision for
comparing treatment means in a given experiment. Let M SERB and M SECR represent the
MSE for RBD and CRD respectively. One measure of precision is the variance of ȳi . For
an RBD, the estimated variance of ȳi is M SECR /r , where r is the number of observations
of each treatment required to satisfy the relationship
M SECR M SERB M SECR r
= or =
r b M SERB b
r
The quantity b
is called the relative efficiency of the RBD.

If we have to apply this formula to find the relative efficiency of the RBD, we would have
to run RBD and CRD, which is a waste. The following formula helps to overcome this
difficulty. The proof of the formula is beyond the scope of this course. If MSE is that of the
RBD,

M SECR (b − 1)M SB + b(t − 1)M SE


=
M SERB (bt − 1)M SE
Example: Calculate the relative efficiency of the RBD in the above example.

M SECR (2 − 1)32 + 2(4 − 1)1 38


= = = 5.43
M SERB (8 − 1)1 7
In other words, it would take more than 5 times as many observations in using a completely
randomized design to gather the same amount of information on the treatment means as it
would take when using the randomized block design.

1.2.3 Latin Square Design


Latin square design is used when there are two extraneous sources of variability. (Recall that
we use RBD if there is one extraneous sources of variability.) When there are two sources
of nuisance variation, it is necessary to block on both sources to eliminate this unwanted
variability.
For example, the apple leaf problem of the previous section can be complicated further
in the following way. Suppose that each leaf assay takes a long time and only one can be
done by each analyst per day. If we used the randomized block design of Table 1.2.2, letting
the first row denote day 1, the second row denote day 2, and the third row denote day 3,
the design could be listed as shown in Figure 1.2.
Suppose now that we tested H0 : µA − µB = 0 against H1 : µA − µB 6= 0. Two procedure
A determinations were done on Day 1 and one on Day 2, whereas procedure B was used on
each of the three days. Thus if we reject H0 , we would not be certain whether µA differed
from µB because of a difference in the laboratory procedures or because of a difference
among the three days. Sometimes laboratory equipment must be calibrated daily and new
chemical solutions must be prepared. Differences in determinations from day to day could
be due to differences among the solutions or to differences in calibration accuracy.
This example illustrates a situation in which the experimental units (leaves) are affected
by a second extraneous source of variability, days. We can modify the randomized block

11
Figure 1.2: Randomized Block Design for the Leaf Assay in the Presence of a
Day Effect

design to filter out this second source of variability, the variability among days, in addition
to filtering out the first source, variability among analysts. To do this we restrict our
randomization to ensure that each treatment appears in each row (day) and in each column
(analyst). One such randomization is shown in Figure 1.3. Note that the test procedures
have been assigned to analysts and to days so that each procedure is performed once day
and once by each analyst. Hence pairwise comparisons among treatment procedure that
involve the sample means are free of variability among days and analysts.
This experimental design is called a Latin square design (LSD). In general, a Latin square
design can be used to compare t treatment means in the presence of two extraneous sources
of variability, which we block off into t rows and t columns. The t treatments are then
randomly assigned to the rows and columns so that each treatment appears in every row
and every column of the design (see Figure 1.3).

Figure 1.3: Assignment of Leaves to Analysts and Days

Thus, blocking in two directions can be accomplished by using a Latin square design.

Definition 1.2.3 A t × t Latin square design contains t rows and t columns. The t treat-
ments are randomly assigned to experimental units within the rows and columns so that
each treatment appears in every row and in every column.
Advantages and Disadvantages of a Latin Square Design
The advantages and disadvantages of the Latin square design are listed here.
Advantages
1. The design is particularly appropriate for comparing t treatment means in the presence
of two sources of extraneous variation, each measured at t levels.
2. The analysis is still quite simple.
Disadvantages
1. Although a Latin square can be constructed for any value of t, it is best suited for
comparing t treatments when 5 < t < 10.

12
2. Any additional extraneous sources of variability tend to inflate the error term making
it more difficult to detect differences among the treatment means.

3. The effect of each treatment on the response must be approximately the same across
rows and columns.

4. The number of rows and columns must be the same as the number of treatments.

A typical randomization scheme for a 4 × 4 Latin square comparing the treatments I, II,
III, and IV is shown in Figure 1.4. Note that each treatment appears in all four rows and
all four columns.

Figure 1.4: A 4 × 4 Latin square

Example 1.2.4 To evaluate the toxicity of certain compounds in water, samples must be
preserved for long periods of time. Suppose an experiment is being conducted to evaluate
the Maximum Holding Time (MHT) for four different preservatives used to treat a mercury
based compound. The MHT is defined as the time that elapses before the solution loses
10% of its initial concentration. Both the level of the initial concentration and the analyst
who measures the MHT are sources of variation, so an experimental design that blocks on
both is necessary to allow the difference in mean MHT between preservatives to be more
accurately estimated. An LSD is constructed wherein each preservative is applied exactly
once to each initial concentration, and is analyzed exactly once by each analyst. For this
the design has to be ”square”, since the single application of each treatment level at each
block level requires that the number of levels of each block equal the number of treatment
levels. Thus to apply the LSD using four different preservatives, we must employ four initial
concentrations and four analysts. The design would be constructed as shown below, where
Pi is preservative i.

13
NOTATION
Yijk = Observation for treatment i in row j and column k.
t = The number of treatments = the number of rows = the number of columns.
n = The total sample size = t2
Ti = The sum of the sample measurements receiving treatment i.
ȳi = The average of observations receiving treatment i = Ti /t
Rj = The sum of the sample measurements in row j.
R̄j = The sample mean for row j = Rj /t
Ck = The sum of all observations in column k .
C¯k = The average of all observations in column
P k = Ck /t
G = The grand total of all observations = Ti
ȳ = ThePaverage of all observationsP =G/n
2
2 2
TSS = (yijk − ȳ) = (n − 1)s = yijk 2
− Gn
P Ti2 G2
SST = t (ȳi − ȳ)2 =
P
t
− n
P 2
P Rj2 G2
SSR = t (R̄j − ȳ) = t
− n
P ¯ C 2 2
SSC = t (Ck − ȳ)2 = − Gn
P k
t
SSE = TSS - SST - SSR - SSC

The model for a response in an LSD is about the same as that for an RBD with the added
term to account for the second blocking variable. The model is

yij = µ + αi + βj + γk + ijk

where
µ = an overall mean, which is an unknown constant
αi = an effect due to treatment i (an unknown constant)
βj = an effect due to row j (an unknown constant)
γk = effect due to column k (an unknown constant)
ijk = a random error associated with the response on treatment i, row j and column k.

The assumptions are similar to those of RBD model.

The ANOVA table then is given below:

Source Degrees of freedom Sum of squares Mean squares F Ratio

SST M ST
Treatments t−1 SST M ST = t−1 M SE

SSR M SR
Rows t−1 SSR M SR = t−1 M SE

SSC M SC
Columns t−1 SSC M SC = t−1 M SE

SSE
Error (t − 1)(t − 2) SSE M SE = (t−1)(t−2)

Total t2 − 1 T SS

14
Example 1.2.5 Four varieties of wheat are to be compared to test if there is any difference
in mean yields. The two additional sources of variability are four different fertilizers and
four different years. It is assumed that the various sources of variability do not interact.
Test the hypotheses that there is no difference in wheat yields for

1. different types of wheat

2. different fertilizers

3. different years

The design and the data are given below: The letters A, B, C, D represent the types of wheat.

Solution:

T1 = 223, T2 = 213, T3 = 247, T4 = 252


R1 = 294, R2 = 243, R3 = 206, R4 = 192
C1 = 236, C2 = 257, C3 = 201, C4 = 241
T SS = 2500, SST = 264, SSR = 1557, SSC = 418, SSE = 261.

These result in the following ANOVA table:

Source Degrees of freedom Sum of squares Mean squares F Ratio

Treatments 3 264 88 2.02


Rows 3 1557 519 11.93
Columns 3 418 139.33 3.2
Error 6 261 43.5
Total 15 2500

As F3,6,0.05 = 4.76, the decisions, at 5% level, are to (1) accept (2) reject (3) accept.
RELATIVE EFFICIENCY OF LSD

As with the randomized block design, we can compare the efficiency of the Latin square
design to that of the completely randomized design. Let M SELS and M SECR denote the
mean square errors, respectively, for a Latin square design and a completely randomized
design.

The formula for computing the relative efficiency of LSD is given by


M SECR M SR + M SC + (t − 1)M SE
=
M SELS (t + 1)M SE

15
where the M SE is that of LSD.

We will refer to the data of the example above and compute the efficiency of the Latin
square design relative to a completely randomized design.

For these data, t = 4, M SR = 519, M SC = 139.33, and M SE = 43.5. Substituting into the
formula for relative efficiency, we have

M SECR [519 + 139.33 + 3(43.5)]


= = 3.63
M SELS 5(43.5)

That is, it would take approximately 3.63 times as many observations in using a completely
randomized design to gather the same amount of information on the treatment means as it
would take when using the Latin square design.

16
Chapter 2

Nonparametric Methods

2.1 Introduction
The models for inference procedures that we have learned thus far assume a specific structure
for the population distribution. The Student’s t test for inferences about a population mean
and the comparison of two means, the χ2 and F tests for inferences about variances, the
inferences presented for regression models, and the analysis of variance are all based on the
assumption that the response measurements constitute samples from normal populations.
These procedures are designed to make inferences about the values of the parameters µ
and σ that appear in the prescription for the mathematical curve of the normal population.
Collectively, they are called normal-theory parametric inference procedures.
Nonparametric statistics is a body of inference procedures that is valid under a much
wider range of shapes for the population distribution. The term nonparametric inference is
derived from the fact that the usefulness of these procedures does not require modeling a
population in terms of a specific parametric form of density curves, such as normal distri-
butions. In testing hypotheses, nonparametric test statistics typically utilize some simple
aspects of the sample data, such as the signs of the measurements, order relationships, or
category frequencies. These general features do not require the existence of a meaningful
numerical scale for measurements. More importantly, stretching or compressing the scale
does not alter them. As a consequence, the null distribution of a nonparametric test statistic
can be determined without regard to the shape of the underlying population distribution.
For this reason, these procedures are also called distribution-free tests.
What major difficulties are associated with parametric procedures based on the t, χ2 , F
or similar distributions, and how do we overcome these difficulties by using a nonparametric
approach? First, the distributions of these parametric model statistics and their tabu-
lated percentage points are valid if the underlying population distribution is approximately
normal. Although the normal distribution does approximate many real-life situations, ex-
ceptions are numerous, so that it is unwise to presume that normality is a fact of life. Drastic
departures from normality often occur in the forms of conspicuous asymmetry, sharp peaks,
or heavy tails. The intended strength of parametric inference procedures can be seriously
affected when such departures occur, especially when the sample size is small or moderate.
For instance, a t test with an intended level of significance of α = .05 may have an actual
type I error probability far in excess of this value. Similarly, the strength of confidence
statements can be seriously distorted.
Second, parametric procedures require that observations be recorded on an unambiguous
scale of measurements and then make explicit use of specific numerical aspects of the data,
such as the sample mean, the standard deviation, and other sums of squares. In a great
many situations, particularly in the social and behavioral sciences, responses are difficult
or impossible to measure on a specific and meaningful numerical scale. Characteristics like

17
degree of apathy, taste preference, and surface gloss cannot be evaluated on an objective
numerical scale, and an assignment of numbers is, therefore, bound to be arbitrary. Also,
when people are asked to express their views on a 5-point rating scale, where 1 represents
strongly disagree and 5 represents strongly agree, the numbers have little physical meaning
beyond the fact that higher scores indicate greater agreement. Data of this type are called
ordinal data, because only the order of the numbers is meaningful and the distance between
two numbers does not lend itself to practical interpretation. Any controversy surround-
ing a scale of measurement for ordinal data is readily transmitted to parametric statistical
procedures through the use of sample means and standard deviations. Nonparametric pro-
cedures that utilize information only on order or rank are therefore particularly suited to
measurements on an ordinal scale. When the data constitute measurements on a meaningful
numerical scale and the assumption of normality holds, parametric procedures are certainly
more efficient in the sense that tests have higher power and confidence intervals are gener-
ally shorter than their nonparametric counterparts. A choice between the parametric and
nonparametric approach should be guided by a consideration of loss of efficiency and the
degree of protection desired against possible violations of the assumptions.

2.2 The Sign Test


The sign test is often used as a nonparametric alternative to the one-sample t-test. where we
test the null hypothesis µ = µ0 against a suitable alternative. For the sign test, we assume
merely that the population sampled is continuous and symmetrical. We assume that the
population is continuous so that there is zero probability of getting a value equal to µ0 , and
we do not even need the assumption of symmetry if we change the null hypothesis to be
about the population median.
In the sign test we replace each sample value exceeding µ0 with a plus sign and each value
less than µ0 with a minus sign. Then we test the null hypothesis that the number of plus
signs is a value of a random variable having the binomial distribution with the parameters
n (the total number of plus or minus signs) and p = 21 . The two-sided alternative µ 6= µ0
thus becomes p 6= 12 , and the one-sided alternatives µ < µ0 and µ > µ0 become p < 21 and
p > 12 , respectively. If a sample value equals µ0 , we simply discard it.
To perform a sign test when the sample size is very small, we refer directly to a table of
binomial probabilities. When the sample size is large, we use the normal approximation to
the binomial distribution.

Example 2.2.1 The following are measurements of the breaking strength of a certain kind
of 2-inch cotton ribbon in pounds:

163 165 160 189 161 171 158 151 169 162
163 139 172 165 148 166 172 163 187 173
Use the sign test to test the null hypothesis µ = 160 against the alternative hypothesis
µ > 160 at the 0.05 level of significance.

Solution: The null and alternative hypotheses are given by H0 : µ = 160 and H1 : µ > 160,
and the significance level is given by α = 0.05. We use the test statistic X, the observed
number of plus signs. Replacing each value exceeding 160 with a plus sign, each value
less than 160 with a minus sign, and discarding the one value that equals 160, we get
+ + + + + − − + + + − + + − + + + + + so n = 19 and x = 15. From the binomial tables
we find that P (X ≥ 15) = 0.0095 for p = 12 . Since the P -value, 0.0095, is less than 0.05,
the null hypothesis must be rejected. Thus we conclude that the mean breaking strength of
the given kind of ribbon exceeds 160 pounds.

18
Example 2.2.2 The following data, in tons, are the amounts of sulfur oxides emitted by
a large industrial plant in 40 days:

17 15 20 29 19 18 22 25 27 9
24 20 17 6 24 14 15 23 24 26
19 23 28 19 16 22 24 17 20 13
19 10 23 18 31 13 20 17 24 14
Use the sign test to test the null hypothesis µ = 21.5 against the alternative hypothesis
µ < 21.5 at the 0.01 level of significance.

Solution: The null and alternative hypotheses are given by H0 : µ = 21.5 and H1 : µ < 21.5,
and the significance level is given by α = 0.05. Reject the null hypothesis if z < −z0.01 =
−2.33, where z = 2x−n

n
with x being the number of plus signs (values exceeding 21.5). Since
n = 40 and x = 16, we get
2x − n 32 − 40
z= √ = √ = −1.26
n 40
Since z = −1.26 exceeds -2.33, the null hypothesis is not rejected.
The sign test can also be used when we deal with paired data. In such problems, each pair
of sample values is replaced by a plus sign if the difference between the paired observations
is positive (that is, if the first value exceeds the second value) and by a minus sign if the
difference between the paired observations is negative (that is, if the first value is less than
the second value), and it is discarded if the difference is zero. To test the null hypothesis
that two continuous symmetrical populations have equal means (or that two continuous
populations have equal medians), we can thus use the sign test, which, in connection with
this kind of problem, is referred to as the paired-sample sign test.

Example 2.2.3 To determine the effectiveness of a new traffic-control system, the numbers
of accidents that occurred at 12 dangerous intersections during four weeks before and four
weeks after the installation of the new system were observed, and the following data were
obtained:

(3, 1) (5, 2) (2, 0) (3, 2) (3, 2) (3, 0)


(0, 2) (4, 3) (1, 3) (6, 4) (4, 1) (1, 0)
Use the paired-sample sign test at the 0.05 level of significance to test the null hypothesis
that the new traffic-control system is only as effective as the old system.

Solution: The null and alternative hypotheses are given byH0 : µ1 = µ2 and H1 : µ1 > µ2
and α = 0.05.
Use the test statistic X, the observed number of plus signs. Replacing each positive
difference by a plus sign and each negative difference by a minus sign, we get + + + +
+ + − + − + ++ so that n = 12 and x = 10. From the binomial tables we find that
P (X ≥ 10) = 0.0192 for p = 0.5. Since the P -value, 0.0192, is less than 0.05, the null
hypothesis must be rejected. Thus we conclude that the new traffic-control system is effective
in reducing the number of accidents at dangerous intersections.

2.2.1 The Signed-Rank Test


The sign test is easy to perform, but since we utilize only the signs of the differences between
the observations and µ0 in the one-sample case or the signs of the differences between the

19
pairs of observations in the paired-sample case, it tends to be wasteful of information. An
alternative nonparametric test, the Wilcoxon signed-rank test, is less wasteful in that it takes
into account also the magnitudes of the differences. In this test, we rank the differences
without regard to their signs, assigning rank 1 to the smallest difference in absolute value,
rank 2 to the second smallest difference in absolute value etc., and rank n to the largest
difference in absolute value. Zero differences are again discarded, and if the absolute value
of two or more differences are the same, we assign each one the mean of the ranks that they
jointly occupy. Then, the signed-rank test is based on one of th following three statistics
– the sum of the ranks assigned to the positive differences ( T + ), the sum of the ranks
assigned to the negative differences (T − ), or T = min(T + , T − ). All these are equivalent
because T + + T − = n(n+1)
2
.
Depending on the alternative hypothesis, we base the signed-rank test on T + ,T − , or T .
The assumptions and the null hypotheses are the same as in the case of the sign test. The
correct statistic, along with the appropriate critical value, is summarized in the following
table, where in each case the level of significance is α.

Alternative hypothesis Reject the null hypothesis if


µ < µ0 T + ≤ T2α
µ > µ0 T − ≤ T2α
µ 6= µ0 T ≤ Tα

Example 2.2.4 The following are 15 measurements of the octane rating of a certain kind
of gasoline.
97.5 95.2 97.3 96.0 96.8
100.3 97.4 95.3 93.2 99.1
96.1 97.6 98.2 98.5 94.9
Use the signed-rank test at the 0.05 level of significance to test whether the mean octane
rating of the given kind of gasoline is 98.5.

Solution: H0 : µ = 98.5 vs. H1 : µ 6= 98.5


Reject the null hypothesis if T ≤ T0.05 . Subtracting 98.5 from each value and ranking
the differences without regard to their sign, we get

Measurement Difference Rank


97.5 -1.0 4
95.2 -3.3 12
97.3 -1.2 6
96.0 -2.5 10
96.8 -1.7 7
100.3 1.8 8
97.4 -1.1 5
95.3 -3.2 11
93.2 -5.3 14
99.1 0.6 2
96.1 -2.4 9
97.6 -0.9 3
98.2 -0.3 1
98.5 0.0
94.9 -3.6 13

20
Thus T + = 8 + 2 = 10 and T = min(95, 10) = 10. From Table 6.14, we find that
T0.05 = 21 for n = 14. Since T = 10 is less than 21, the null hypothesis must be rejected:
the mean octane rating of the given kind of gasoline is not 98.5.
When we deal with paired data, the signed-rank test can also be used in place of the
paired-sample sign test. In this case, we test the null hypothesis µ1 = µ2 using the test
criteria given in the table, except that the alternative hypotheses are now µ1 < µ2 , µ1 > µ2
or µ1 6= µ2 instead of µ < µ0 , µ > µ0 or µ 6= µ0 .
For n > 15 it is considered reasonable to assume that T + is a value of a random variable
whose distribution is approximately normal. To perform the signed-rank test based on
this assumption, we need the following results, which apply regardless of whether the null
hypothesis is µ = µ0 or µ1 = µ2 .

Theorem 2.2.5 Under the assumptions required by the signed-rank test, T + has mean
µ = n(n+1)
4
and variance σ 2 = n(n+1)(2n+1)
24
. When n is large, the distribution of T + can be
approximated by a normal distribution.

Example 2.2.6 The following are the weights in pounds, before and after, of 16 persons
who stayed on a certain reducing diet for four weeks.

Before After Before After


147.0 137.9 147.7 149.0
183.5 176.2 166.8 158.5
232.1 219.0 131.9 134.4
161.6 163.8 150.3 149.3
197.5 193.5 197.2 189.1
206.3 201.4 159.8 159.1
177.0 180.6 171.7 173.2
215.4 203.2 208.1 195.4

Use the signed-rank test to test at the 0.05 level of significance whether the weight-reducing
diet is effective.

Solution: H0 : µ1 = µ2 vs. H1 : µ1 > µ2


+
Reject the null hypothesis if z > z.05 = 1.645, where z = T σ−µ . The differences between
the respective pairs are 9.1, 7.3, 13.1, -2.2, 4.0, 4.9, -3.6, 12.2, -1.3, 8.3, -2.5, 1.0, 8.1, 0.7,
-1.5, and 12.7. If their absolute values are ranked, we find that the positive differences
occupy ranks 13, 10, 16, 8, 9, 14, 12, 2, 11, 1 and 15. Thus,

T + = 13 + 10 + 16 + 8 + 9 + 14 + 12 + 2 + 11 + 1 + 15 = 111.
+
Since µ = n(n+1)
4
= 16(17)
4
= 68 and σ 2 = n(n+1)(2n+1)
24
= 16(17)(33)
24
= 374, we get z = T σ−µ =
111−68

374
= 2.22. As the calculated value of z exceeds the table value z0.05 = 1.645, the null
hypothesis must be rejected; we conclude that the diet is effective in reducing weight.

2.3 The Wilcoxon Rank-Sum Test for Comparing Two


Treatments
The problem of comparing two populations based on independent random samples has
already been discussed under the assumptions of normality and equal variances for the
population distributions. There the parametric procedure was based on Student’s t statistic.

21
Here we describe a useful nonparametric procedure originally proposed by F. Wilcoxon
(1945). An equivalent alternative version was independently proposed by H. Mann and D.
Whitney (1947). For a comparative study of two treatments A and B, a set of n = n1 + n2
experimental units are randomly divided into two groups of sizes n1 and n2 , respectively.
Treatment A is applied to the n1 units, and treatment B is applied to the n2 units. Let the
response measurements be X11 , X12 , · · · , X1n1 for Treatment A and X21 , X22 , · · · , X2n2 for
Treatment B.
These two treatments constitute independent random samples from two populations. As-
suming that larger responses indicate a better treatment, we wish to test the null hypothesis
that there is no difference between the two treatment effects vs. the one-sided alternative
that Treatment A is more effective than Treatment B. The distributions are assumed to
be continuous. Note that no assumption is made regarding the shape of the population
distribution. The basic concept underlying the rank-sum test can now be explained by the
following intuitive line of reasoning. Suppose that the two sets of observations are plotted on
the same diagram, using different markings to identify their sources. Under H0 the samples
come from the same population so that the two sets of points should be well mixed. How-
ever, if the larger observations are more often associated with the first sample, for example,
we can infer that Population A is possibly shifted to the right of Population B. These two
situations are diagrammed below, where the combined set of points in each case is serially
numbered from left to right. These numbers are called the combined sample ranks. In the
following figure, large as well as small ranks are associated with each sample.
A B A B B A B A A
Rank 1 2 3 4 5 6 7 8 9
Ranks are well mixed (H0 is probably true)
In the figure below, most of the larger ranks are associated with the first sample.

B B B A B A A A A
Rank 1 2 3 4 5 6 7 8 9
Sample A contains more of the larger ranks
(H1 is probably true)

Therefore, considering the sum of the ranks associated with the first sample as a test
statistic, a large value of this statistic should reflect that the first population is located to
the right of the second. To establish a rejection region with a specified level of significance,
we must consider the distribution of the rank-sum statistic under the null hypothesis.
The formal model assumes that both distributions are continuous. We test the hypothe-
ses H0 : The two population distributions are identical against the alternative H1 : The
distribution of Population A is shifted to the right of the distribution of Population B.
As originally proposed by Wilcoxon, the test is thus based on W1 , the sum of the ranks
of the values of the first sample, or on W2 , the sum of the ranks of the values of the second
sample. It does not matter whether we choose W1 or W2 , for if there are n1 values in the
first sample and n2 values in the second sample, W1 + W2 is always the sum of the first
n1 + n2 positive integers, which is (n1 +n2 )(n2 1 +n2 +1)
In actual practice, we do not base tests directly on W1 or W2 . Instead, we use the related
statistics U1 = W1 − n1 (n21 +1) or U2 = W2 − n2 (n22 +1) or U = min(U1 , U2 ). The tests based on
U1 , U2 , or U are all equivalent to those based on W1 or W2 , but they have the advantage
that they lend themselves more readily to the construction of tables of critical values. The
correct statistic to use, together with the right critical value, is summarized in the following
table, where in each case the level of significance is α.

22
Alternative hypothesis Reject the null hypothesis if
µ1 < µ2 U1 ≤ U2α
µ1 > µ2 U2 ≤ U2α
µ1 6= µ2 U ≤ Uα

The critical values of the U statistic for α = 0.1, 0.05, 0.02 and 0.01 are respectively
given in Table 6.6, Table 6.7, Table 6.8 and Table 6.9. For these tables, rows represent n1
and columns represent n2 .

Example 2.3.1 Suppose that we want to compare two kinds of emergency flares, Brand A
and Brand B on the basis of the following burning times (rounded to the nearest tenth of
a minute):

A : 14.9 11.3 13.2 16.6 17.0 14.1 15.4 13.0 16.9


B : 15.2 19.8 14.7 18.3 16.1 21.2 18.9 12.2 15.3 19.4

Test at the 0.05 level of significance whether the two samples come from identical continuous
populations or whether the average burning time of Brand A flares is less than that of Brand
B flares.

Solution: H0 : µ1 = µ2 and H1 : µ1 < µ2 with α = 0.05.


Arranging these values jointly (as if they were one sample) in an increasing order of
magnitude and assigning them in this order the ranks 1, 2, 3 · · · , 19, we find that the values
of the first sample (Brand A) occupy ranks 1, 3, 4, 5, 7, 10, 12, 13 and 14, while those of the
second sample (Brand B) occupy ranks 2, 6, 8, 9, 11, 15, 16, 17, 18 and 19. Had there been
ties, we would have assigned to each of the tied observations the mean of the ranks that
they jointly occupy. Since n1 = 9 and n2 = 10, reject the null hypothesis if U1 ≤ U0.1 = 24,
obtained from Table 6.6.

W1 = 1 + 3 + 4 + 5 + 7 + 10 + 12 + 13 + 14 = 69
9(10)
U1 = 69 − = 24
2
Thus the null hypothesis must be rejected: we conclude that on the average Brand A flares
have a shorter burning time than Brand B flares.

Example 2.3.2 To determine if a new hybrid (A) seedling produces a bushier flowering
plant than a currently popular variety (B), a horticulturist plants 2 new hybrid seedlings
and 3 currently popular seedlings in a garden plot. After the plants mature, the following
measurements of shrub girth in inches are recorded:

A : 31.8 39.1
B : 35.5 27.6 21.3
Do these data strongly indicate that the new hybrid produces larger shrubs than the current
variety?

Solution: We wish to test the null hypothesis H0 : Populations A and B are identical. v s.
the alternative hypothesis H1 : Population A is shifted from B toward larger values. In the
rank-sum test, the two samples are placed together and ranked from smallest to largest.

Combined ordered sample 21.3 27.6 31.8 35.5 39.1


Rank 1 2 3 4 5
Variety B B A B A

23
Rank sums for A and B are respectively given by W1 = 3+5 = 8 and W2 = 1+2+4 = 7.
U1 = W1 − n1 (n21 +1) = 8 − 2(3)
2
= 5 and U2 = W2 − n2 (n22 +1) = 7 − 3(4)
2
= 1. From Table 6.6,
U2α = U0.1 = 0, so we cannot reject H0 . Note that for sample sizes this small, no matter
what was observed we cannot reject the null hypothesis at 0.05 level of significance.
When n1 and n2 are both greater than 8, it is considered reasonable to assume that
U1 and U2 are values of random variables having approximately normal distributions. To
perform the U test based on this assumption. we need the following result.

Theorem 2.3.3 Under the assumptions required by the U test, U1 and U2 are values of
random variables having the mean µ = n12n2 and variance σ 2 = n1 n2 (n12
1 +n2 +1)
.

Example 2.3.4 The following are the weight gains (in pounds) of two random samples of
young turkeys fed two different diets but otherwise kept under identical conditions:

Diet1 16.3 10.1 10.7 13.5 14.9 11.8 14.3 10.2


12.0 14.7 23.6 15.1 14.5 18.4 13.2 14.0
Diet2 21.3 23.8 15.4 19.6 12.0 13.9 18.8 19.2
15.3 20.1 14.8 18.9 20.7 21.1 15.8 16.2
Use the U test at the 0.01 level of significance to test the null hypothesis that the two
populations sampled are identical against the alternative hypothesis that on the average the
second diet produces a greater gain in weight.

Solution: H0 : µ1 = µ2 and H1 : µ1 < µ2 with α = 0.01. Reject the null hypothesis if


z ≤ −2.33, where z = U1σ−µ .
Ranking the data jointly according to size, we find that the values of the first sample
occupy ranks 21, 1, 3, 8, 15, 4, 11, 2, 5.5, 13, 31, 16, 12, 22, 7 and 10. (The fifth and sixth
values are both 12.0. so we assigned each the rank 5.5.) Thus,

W1 = 1 + 2 + 3 + 4 + 5.5 + 7 + 8 + 10 + 11 + 12 + 13 + 15 + 16 + 21 + 22 + 31 = 181.5
n1 (n1 + 1) 16(17)
U1 = W1 − = 181.5 − = 45.5
2 2
Since µ = n12n2 = 16(16)
2
= 128 and σ 2 = n1 n2 (n12
1 +n2 +1)
= 16(16)(33)
12
= 704, we get
U1 −µ 45.5−128
z = σ = √704 = −3.11 < −2.33, the null hypothesis must be rejected. We conclude
that on the average the second diet produces a greater gain in weight.

Example 2.3.5 Two geological formations are compared with respect to richness of min-
eral content. The mineral contents of 7 specimens of ore collected from Formation 1 and
5 specimens collected from Formation 2 are measured by chemical analysis. The following
data are obtained:

Formation 1 7.6 11.1 6.8 9.8 4.9 6.1 15.1


Formation 2 4.7 6.4 4.1 3.7 3.9

Do the data provide strong evidence that Formation 1 has a higher mineral content than
Formation 2? Test with α = .05.

Solution H0 : µ1 = µ2 and H1 : µ1 > µ2


To use the rank-sum test, first we rank the combined sample and determine the sum of
ranks for the second sample, which has the smaller size. The observations from the second
sample and their ranks are underlined here for quick identification:

24
Values 3.7 3.9 4.1 4.7 4.9 6.1 6.4 6.8 7.6 9.8 11.1 15.1
Ranks 1 2 3 4 5 6 7 8 9 10 11 12

The observed value of the rank-sum statistic is W2 = 1 + 2 + 3 + 4 + 7 = 17. Since n1 = 7


and n2 = 5, reject the null hypothesis if U2 ≤ 6.

n2 (n2 + 1) 5(6)
U2 = W2 − = 17 − =2
2 2
Since U2 ≤ 6, the null hypothesis must be rejected. We conclude that Formation 1 has a
higher mineral content than Formation 2.

Example 2.3.6 Flame-retardant materials are tested by igniting a paper tab on the hem
of a dress worn by a mannequin. One response is the vertical length of damage to the fabric
measured in inches. The following data for 5 samples, each taken from two fabrics, are
obtained by researchers at the National Bureau of Standards as part of a larger cooperative
study. Do the data provide strong evidence that a difference in flammability exists between
the two fabrics? Test with α = .05.

Fabric A 5.7 7.3 7.6 6.0 6.5


Fabric B 4.9 7.4 5.3 4.6 6.2

Solution: H0 : µ1 = µ2 and H1 : µ1 6= µ2
The combined values and their ranks are given below. The values from Fabric B are in
bold.

Values 4.6 4.9 5.3 5.7 6.0 6.2 6.5 7.3 7.4 7.6
Ranks 1 2 3 4 5 6 7 8 9 10

W1 = 34 and W2 = 21, so U1 = 34 − 5(6) 2


= 19 and U2 = 21 − 5(6) 2
= 6. Hence
U = min(U1 , U2 ) = 6. Because the alternative hypothesis is two-sided, we reject the null
hypothesis if U ≤ U0.05 = 2, obtained from Table 6.7. Thus we conclude that we do not
have sufficient evidence to reject H0 .

2.4 The Kruskal-Wallis Test


The Kruskal-Wallis test, also called the H test, is a generalization of the rank-sum test to
the case where we test the null hypothesis that k samples come from identical continuous
populations. In other words, it is a non-parametric alternative to the one-way analysis of
variance.
Let n1 , n2 , . . . , nk be the sample sizes and let n = n1 + n2 + · · · , nk . As in the U test,
the data are ranked jointly from low to high, as though they constitute one sample. Then,
letting Ri be the sum of the ranks of the values of the ith sample, we base the test on the
statistic
k
12 X Ri2
H= − 3(n + 1).
n(n + 1) i=1 ni
The null hypothesis is rejected for large values of H. The test is based on the large-sample
theory that the sampling distribution of the random variable corresponding to H can be
approximated by a chi-square distribution with k − 1 degrees of freedom.

25
Example 2.4.1 The following are the final examination grades of samples from three
groups of students who were taught German by three different methods (classroom instruc-
tion and language laboratory, only classroom instruction, and only self-study in language
laboratory):

First method 94 88 91 74 87 97
Second method 85 82 79 84 61 72 80
Third method 89 67 72 76 69

Use the H test at the 0.05 level of significance to test the null hypothesis that the three
methods are equally effective.

Solution: H0 : µ1 = µ2 = µ3 and H1 : µ1 , µ2 , µ3 are not all equal


Reject the null hypothesis if H ≥ 5.991, where 5.991 is the value of χ22,0.05 . Ranking
the grades from 1 to 18, we find that R1 = 6 + 13 + 14 + 16 + 17 + 18 = 84, R2 =
1 + 4.5 + 8 + 9 + 10 + 11 + 12 = 55.5 and R3 = 2 + 3 + 4.5 + 7 + 15 = 31.5. Substituting the
values of R1 , R2 , and R3 together with n1 = 6, n2 = 7, n3 = 5, and n = 18 into the formula
for H, we get
k
Ri2
 2
55.52 31.52

12 X 12 84
H= − 3(n + 1) = + + − 3(19) = 6.67
n(n + 1) i=1 ni 18(19) 6 7 5

Since H = 6.67 exceeds 5.991, the null hypothesis must be rejected. We conclude that the
three methods are not all equally effective.

2.5 Test of randomness


In this section we will learn how to test randomness of arrangements. The technique we
shall use is based on runs. A run is a succession of identical occurrences.

Example 2.5.1

• The results of the games of a football team presented as W W W LLLLLLLW W W W


has three runs in it, with lengths 3, 7 and 4.

• The result of a coin tossing experiment was HHT T T HT HT HT T T T T HHHHH.

• The results of replications of an experiment whose outcomes are success and failure
were F F SSSSF F .

Let n1 and n2 be the number of occurrences of the two types, and let n = n1 + n2 . If u
is the number of runs, we reject the hypothesis of randomness at the significance level α if
either u ≤ u0α or u ≥ u α2 where u0α and u α2 can be found on a Tables 6.10 to 6.13.
2 2

Example 2.5.2
The win-loss information of a basketball team for the past 22 games is given below:

W W W W LLLW W W W W W W LLW W LLLL


Test at the 0.05 significance level whether this sequence is random.

Solution
H0 : Arrangement is random vs. H1 : Arrangement is not random

26
Since n1 = 13 and n2 = 9, from Table 6.10 and Table 6.11 we get that the rejection
region is {u ≤ 6} ∪ {u ≥ 17}. Here u = 6, which falls in the rejection region. Hence we
reject the null hypothesis and conclude that the arrangement is not random.

Theorem 2.5.3 Under the assumption that the null hypothesis of randomness holds, the
run random variable U has mean and variance given by

2n1 n2 2n1 n2 (2n1 n2 − n1 − n2 )


µ= + 1 and σ 2 = .
n1 + n2 (n1 + n2 )2 (n1 + n2 − 1)

When n1 and n2 are large (≥ 10), the distribution of U can be approximated by a normal
distribution.

Example 2.5.4 The following is an arrangement of men and women lined up to purchase
ticket for a rock concert.

MWMWMMMWMWMM
MWWMMMMWWMWM
MMWMMMWWWMWM
MMWMWMMMMWWM
Test for randomness at the 0.05 significance level.

Solution: H0 : Arrangement is random


H1 : Arrangement is not random
Reject the null hypothesis if |z| ≥ 1.96, where
u−µ
z= .
σ
As n1 = 30 and n2 = 18,
2 · 30 · 18
µ= + 1 = 23.5
30 + 18
2 · 30 · 18(2 · 30 · 18 − 30 − 18)
σ2 = = 10.2926
(30 + 18)2 (30 + 18 − 1)
and
27 − 23.5
z=√ = 1.09.
10.2926
We do not reject the null hypothesis. There is no evidence to indicate that the arrangement
is not random.

2.6 Measures of Correlation Based on Ranks


Ranks may also be employed to determine the degree of association between two random
variables. These two variables could be mathematical ability and musical aptitude or the
aggressiveness scores of first- and second-born sons on a psychological test. The assumptions
underlying the significance test for correlation coefficient are stringent (it requires bivariate
normality), it is sometimes preferable to use a nonparametric alternative. The most popular
one is the Rank Correlation Coefficient, or Spearman’s Rank Correlation Coefficient, rs .

27
Let (xi , yi ), i = 1, . . . n be n observations on a pair of attributes. The rank correlation
coefficient is given by
n
6 X
rs = 1 − d2
n(n2 − 1) i=1 i
where di is the difference between the ranks assigned to xi and yi . When there are no ties,
rs is the correlation coefficient calculated for the ranks. This is because
n n
X X n(n + 1)
ri = si =
i=1 i=1
2
n n
X X n(n + 1)(2n + 1)
ri2 = s2i =
i=1 i=1
6
n Pn 2
X n(n + 1)(2n + 1) d
ri si = − i=1 i
i=1
6 2
This rank correlation shares the properties of r such as always being between 1 and −1 and
that values near 1 indicate a tendency for the larger values of X to be paired with the larger
values of Y . However, the rank correlation is more meaningful, because its interpretation
does not require the relationship to be linear.

2.6.1 Properties of the rank correlation coefficient


1. −1 ≤ rs ≤ 1

2. rs near +1 indicates a tendency for the larger values of X to be associated with the
larger values of Y . Values near −1 indicate the opposite relationship.

3. The association need not be linear, only an increasing decreasing relationship is re-
quired.

Example 2.6.1 An interviewer in charge of hiring large numbers of typists wishes to de-
termine the strength of the relationship between ranks given on the basis of an interview
and scores on an aptitude test. Calculate rs based on the data for 6 applicants given below:

Interview rank 5 2 3 1 6 4
Aptitude score 47 32 29 28 56 38

Interview 5 2 3 1 6 4
Aptitude 5 3 2 1 6 4

n
6 X 6
rs = 1 − 2
d2i = 1 − (2) = 0.9429
n(n − 1) i=1 6(35)
The relationship between interview rank and aptitude score appears to he quite strong.

28
Figure above helps to stress the point that rs is a measure of any monotone relationship,
not merely a linear relation. Here the value of rs is 1 while the value of the correlation
coefficient will be less than 1.

Theorem 2.6.2 Under the null hypothesis that there is no correlation, the mean and vari-
ance of rs is given by
1
E(rs ) = 0 andV (rs ) =
n−1
Example 2.6.3 The following are the numbers of hours that 10 students studied for an
examination and the score that they obtained. Calculate rs and test the hypothesis at 0.01
level that there no significant monotone relationship between the two variables.

No. of hours (x) 8 5 11 13 10 5 18 15 2 8


Score (y) 56 44 79 72 70 54 94 85 33 65

H0 : There is no correlation vs. H1 : There


√ is a correlation
We reject the null hypothesis if z = rs n − 1 is such that |z| ≥ z α2 = 2.576. We calculate
rs as follows:

Rank of x 4.5 2.5 7 8 6 2.5 10 9 1 4.5


Rank of y 4 2 8 7 6 3 10 9 1 5
d 0.5 0.5 -1 1 0 -0.5 0 0 0 -0.5
d2 .25 0.25 1 1 0 0.25 0 0 0 0.25

6(3) √
d2i = 3 and hence rs = 1 − 10(99)
P
= 0.98. Thus z = rs n − 1 = .98(3) = 2.94 which
exceeds 2.576. Thus we conclude that there is a relationship between study time and scores.

29
Chapter 3

Sampling

3.1 Simple Random Sampling


3.1.1 Introduction
The objective of a sample survey is to make an inference about the population from infor-
mation contained in a sample. Two factors affect the quantity of information contained in
the sample and hence affect the precision of our inference-making procedure. The first is the
size of the sample selected from the population. The second is the amount of variation in
the data; variation can frequently be controlled by the method of selecting the sample. The
procedure for selecting the sample is called the sample survey design. For a fixed sample size
n we will consider various designs, or sampling procedures, for obtaining the n observations
in the sample. Since observations cost money, a design that provides a precise estimator of
the parameter for a fixed sample size yields a savings in cost to the experimenter. The basic
design, or sampling technique, called simple random sampling is discussed in this section.

Definition 3.1.1 If a sample of size n is drawn from a population of size N in such a way
that every possible sample of size n has the same chance of being selected the sampling
procedure is called simple random sampling. The sample thus obtained is called a simple
random sample.
It is a consequence of this definition that all individual elements in a population have the
same chance of being selected; however, this statement cannot be taken as a definition of
simple random sampling, because it does not imply that all samples of size n have the same
chance of being selected

We will use simple random sampling to obtain estimators for population means, totals, and
proportions.

Consider the following problem. A federal auditor is to examine the accounts for a city
hospital. The hospital records obtained from a computer show a particular accounts receiv-
able total, and the auditor must verify this total. If there are 28,000 open accounts in the
hospital, the auditor cannot afford the time to examine every patient record to obtain a
total accounts receivable figure. Hence the auditor must choose some sampling scheme for
obtaining a representative sample of patient records. After examining the patient accounts
in the sample, the auditor can then estimate the accounts receivable total for the entire
hospital. If the computer figure lies within a specified distance of the auditor’s estimate, the
computer figure is accepted as valid. Otherwise, more hospital records must be examined
for possible discrepancies between the computer figure and the sample data.

30
Suppose that all N = 28, 000 patient records are recorded on computer cards and a sample
size n = 100 is to be drawn. The sample is called a simple random sample if every possible
sample of n = 100 records has the same chance of being selected.

Simple random sampling forms the basis of most of the sampling designs discussed in this
course, and it forms the basis of most scientific surveys done in practice. The Nielsen Tele-
vision Index (NTI) is the most widely used audience measurement service in existence. It is
based on a random sample of approximately twelve hundred households that have a storage
instantaneous audimeter connected to the television set. This meter records whether or not
the television set is on, what channel is being viewed, and channel changes. In an addi-
tional random sample of families each family keeps a diary on who watches various shows.
The NTI reports the number of households in the audience, the type of audience, and the
amount of television viewing for various time periods.

The Gallup poll actually begins with a random sample of approximately 300 election dis-
tricts, sampled from the 200,000 election districts in the United States. Men households for
interviewing are selected from each district by another randomization device. The sampling
is in two stages, but simple random sampling plays a key role at each stage.

Auditors study simple random samples of accounts in order to check for compliance with
audit controls set up by the firm or to verify the actual dollar value of the accounts. Thus
they may wish to estimate the proportion of accounts not in compliance with controls or
the total value of say, accounts receivable.

Marketing research often involves a simple random sample of potential users of a product.
The researcher may want to estimate the proportion of potential buyers who prefer a certain
color of car or flavor of food.

A forester may estimate the volume of timber or proportion of diseased trees in a forest by
selecting geographic points in the area covered by the forest and then attaching a plot of
fixed size and shape (such as a circle of 10-meter radius) to that point. All the trees within
the sample plots may be studied, but, again, the basic design is a simple random sample.

Two problems now face the experimenter: (1) how does he or she draw the simple random
sample, and (2) how can he or she estimate the various population parameters of interest?
These topics are discussed in the following sections.

3.1.2 How to Draw a Simple Random Sample


To draw a simple random sample from the population of interest is not as trivial as it may
first appear. How can we draw a sample from a population in such a way that every possible
sample of size n has the same chance of being selected? We might use our own judgment to
”randomly” select the sample. This technique is frequently called haphazard sampling. A
second technique, representative sampling, involves choosing a sample that we consider to be
typical or representative of the population. Both haphazard and representative sampling are
subject to investigator bias, and, more importantly, they lead to estimators whose proper-
ties cannot be evaluated. Thus neither of these techniques leads to a simple random sample.

Simple random samples can be selected by using random numbers. Tables of random num-
bers are available in many statistics books. One may also use a computer to generate

31
random numbers.

A random number table is set of integers generated so that in the long run the table will
contain all ten integers (0, 1, · · · , 9) in approximately equal proportions, with no trends in
the pattern in which the digits were generated. Thus if one number is selected from a ran-
dom point in the table, it is equally likely to be any of the digits 0 through 9.

Choosing numbers from the table is analogous to drawing numbers out of a hat containing
those numbers on thoroughly mixed pieces of paper. Suppose we want a simple random
sample of three persons to be selected from seven. We could number the people from 1
to 7, put slips of paper containing these numbers (one number to a slip) into a hat, mix
them, and draw out three, without replacing drawn numbers. Analogously, we could drop a
pencil point on a random starting point in a random number table. Suppose the point falls
on the 15th line of column 9 and we decide to use the right-most digit (a 5, in this case).
This procedure is like drawing a 5 from the hat. We may now proceed in any direction to
obtain the remaining numbers in the sample. Suppose we decide before starting to proceed
down the page. The number immediately below the 5 is a 2, so our second sampled person
is number 2. Proceeding, we next come to an 8, but there are only seven people in our
population; hence the 8 must be ignored. Two more 5s then appear, but both must be
ignored since person 5 has already been selected. (The 5 has been removed from the hat.)
Finally, we come to a 1, and our sample of three is completed with persons numbered 5, 2,
and 1.

Note that any starting point can be used and one can move in any predetermined direction.
If more than one sample is to be used in any problem, each should have its own unique
starting point. Many computer programs, such as Minitab, can be used to generate random
numbers.
A more realistic illustration is given in Example 3.1.2.

Example 3.1.2 For simplicity, assume there are N = 1000 patient records from which a
simple random sample of n = 20 is to be drawn. We know that a simple random sample
will be obtained if every possible sample of n = 20 records has the same chance of being
selected. The digits in the random number table are generated to satisfy the conditions of
simple random sampling. Determine which records are to be included in a sample of size
n = 20.

Solution: We can think of the accounts as being numbers 001, 002, · · · , 999, 000. That
is, we have 1000 three-digit numbers, where 001 represents the first record, 999 the 999th
patient record, and 000 the 1000th.

Refer to random number table and use the first column; if we drop the last two digits of
each number, we see that the first three-digit number formed is 104, the second is 223, the
third is 241, and so on. Taking a random sample of 20 digits, we obtain the numbers shown
in Figure 3.1.
If the records are actually numbered, we merely choose the records with the corresponding
numbers, and these records represent a simple random sample of n = 20 from N = 1000.
If the patient accounts are not numbered, we can refer to a list of the accounts and count
from the 1st to the 10th, 23rd, 70th, and so on, until the desired numbers are reached. If
a random number occurs twice, the second occurrence is omitted, and another number is
selected as its replacement.

32
Figure 3.1: Patient records to be included in the sample

3.1.3 Estimation of Population Mean and Total


We stated previously that the objective of survey sampling is to draw inferences about a
population from information contained in a sample. One way to make inferences is to esti-
mate certain population parameters by utilizing the sample information. The objective of a
sample survey is often to estimate a population mean, denoted by µ, or a population total,
denoted by τ . Thus the auditor of Example 1.2.1 might be interested in the mean dollar
value for the accounts receivable or the total dollar amount in these accounts. Hence we
consider estimation of the two population parameters, µ and τ , in this section.

Suppose that a simple random sample of n accounts is drawn, and we are to estimate the
mean value per account for the total population of hospital records. Intuitively, we would
employ the sample average Pn
yi
ȳ = i=1
n
to estimate µ.

Of course, a single value of ȳ tells us very little about the population mean µ, unless we
are able to evaluate the goodness of our estimator. Hence in addition to estimating µ, we
would like to place a bound on the error of estimation. It can be shown that ȳ possesses
many desirable properties for estimating µ. In particular, ȳ is an unbiased estimator of µ
and has a variance that decreases as the sample size n increases. More precisely for a simple
random sample chosen without replacement from a population of size N ,

E[ȳ] = µ

and
σ2
V (ȳ) = (3.1)
n
In addition,
N
E[s2 ] = σ2
N −1
where s2 is the sample variance. We make use of these in order to estimate µ and construct
an error bound.

Estimator of the population mean µ:


Pn
i=1 yi
µ̂ = ȳ = (3.2)
n
33
Estimated variance of ȳ:
s2
 
N −n
V̂ (ȳ) = (3.3)
n N
where Pn Pn 2
− ȳ)2
i=1 (yi y − nȳ 2
s = 2
= i=1 i
n−1 n−1
Bound on the error of estimation:
s  
s2 N − n
q
2 V̂ (ȳ) = 2 (3.4)
n N
The table below shows the confidence intervals for the mean when 50 samples of size 15
were drawn. Only 2 (4%) of the intervals fail to cover the true mean µ = 9.035.

Example 3.1.3 Refer to the hospital audit of Example 3.1.2 and suppose that a random
sample of n = 200 accounts is selected from the total of N = 1000. The sample mean of the
accounts is found to be ȳ = 94.22, and the sample variance is s2 = 445.21. Estimate µ, the
average due for all 1000 hospital accounts, and place a bound on the error of estimation.

Solution: We use ȳ = 94.22 to estimate µ. A bound on the error of estimation can be


found by using the formula (3.4).
s   s  
s2 N − n 445.21 1000 − 200
q
2 V̂ (ȳ) = 2 =2
n N 200 1000

= 2 1.7808 = 2.67
Thus we estimate the mean value per account, µ, to be ȳ = 94.22. Since n is large, the
sample mean should possess approximately a normal distribution, so that $94.22 ± $2.67 is
approximately a 95% confidence interval for the population mean.

Example 3.1.4 A simple random sample of n = 9 hospital records is drawn to estimate


the average amount of money due on N = 484 open accounts. The sample values for these
nine records are listed in the table below. Estimate µ , the average amount outstanding,
and place a bound on your error of estimation.

Solution:
9
X 9
X
yi = 368.00, yi2 = 15332.50
i=1 i=1
Our estimate of the population mean is
P9
yi 368.00
ȳ = i=1 = = $40.89
9 9
Sample variance is given by
P9 2
2 i=1 (yi − ȳ)
s =
n−1
P9 2
P9 2
i=1 yi − ( i=1 yi ) /9
=
8
3682
 
1
= 15, 322.50 −
8 9
= 35.67

34
Figure 3.2: Confidence intervals for N = 20 and n = 15

35
The error bound is given by
r s  
2
s N −n 35.67 484 − 9
q
2 V̂ (ȳ = 2 ( )=2 = $3.94
n N 9 484

To summarize, the estimate of the mean amount of money owed per account,µ, is ȳ = $40.89.
Although we cannot be certain how close ȳ is to µ, we are reasonably confident that y the
error of estimation is less than $3.94.

Many sample surveys are conducted to obtain information about a population total. The
federal auditor of Example 1.2.1 would probably be interested in verifying the computer
figure for the total accounts receivable (in dollars) for the N = 1000 open accounts.

You recall that the mean for a population of size N is the sum of all observations in the pop-
ulation divided by N . The population total, the sum of all observations in the population,
is denoted by the symbol τ . Hence
Nµ = τ
Intuitively, we expect the estimator of τ to be N times the estimator of µ.

Estimator of the population total τ :


Pn
N i=1 yi
τ̂ = N ȳ = (3.5)
n
Estimated variance of τ̂
s2
  
2 N −n
V̂ (τ̂ ) = V̂ (N ȳ) = N (3.6)
n N
where Pn
2 − ȳ)2
i=1 (yi
s =
n−1
Bound on the error of estimation
s   
s2 N −n
q
2 V̂ (N ȳ) = 2 N2 (3.7)
n N

Note that the estimated variance of τ̂ = N ŷ in Equation (3.6) is N 2 times the estimated
variance of ȳ.

Example 3.1.5 An industrial firm is concerned about the time per week spent by scientists
on certain trivial tasks. The time log sheets of a simple random sample of n = 50 employees
show the average amount of time spent on these tasks is 10.31 hours with a sample variance

36
s2 = 2.25. The company employs N = 750 scientists. Estimate the total number of man-
hours lost per week on trivial tasks, place a bound on the error of estimation.

Solution:
The estimate of τ is given by

τ = N ȳ = 750(10.31) = 7732.5hours

Bound on the error of estimation is given by


s    s   
s 2 N − n 2.25 750 − 50
q
2 V̂ (N ȳ) = 2 N 2 = 2 (750)2 = 307.4hours
n N 50 750

Thus the estimate of total time lost is τ̂ = 7732.5 hours and we are reasonably confident
that the error of estimation is less than 307.4 hours.

3.1.4 Selecting the Sample Size for Estimating Population Means


and Totals
At some point in the design of the survey, someone must make a decision about the size of
the sample to be selected from the population. So far we have discussed a sampling proce-
dure (simple random sampling) but have said nothing about the number of observations to
be included in the sample. The implications of such a decision are obvious. Observations
cost money. Hence if the sample is too large, time and talent are wasted. Conversely, if
the number of observations included in the sample is too small, we have bought inadequate
information for the time and effort expended and have again been wasteful.

The number of observations needed to estimate a population mean µ with a bound on


the error of estimation of magnitude B is found by setting two standard deviations of the
estimator, ȳ, equal to B and solving this expression for n. That is, we must solve
s  
p σ2 N − n
B = 2 V (ŷ) = 2 (3.8)
n N −1

to get

N σ2
n= (3.9)
(N − 1)D + σ 2
B2
where D = 4

Solving for n in a practical situation presents a problem because the population variance σ 2
is unknown. Since a sample variance s2 is frequently available from prior experimentation,
we can obtain an approximate sample size by replacing σ 2 with s2 in Equation 3.9. We will
illustrate a method for guessing a value of σ 2 when very little prior information is available.
If N is large, as it usually is, the (N − 1) can be replaced by N in the denominator of
equation 3.9.

Example 3.1.6 The average amount of money for a hospital’s accounts receivable must be
estimated. Although no prior data is available to estimate the population variance σ 2 , that
most accounts lie within a 100 range is known. There are N = 1000 open accounts. Find
the sample size needed to estimate µ with a bound on the error of estimation B = $3.

37
Solution: We need an estimate of σ 2 ,the population variance. Since the range is often
approximately equal to four standard deviations (4σ), one-fourth of the range will provide
an approximate value of σ.
range 100
σ= = = 25
4 4
Hence σ is taken to be approximately 25 and

σ 2 = 625
B2 32
D= = = 2.25
4 4
1000(625)
n= = 217.56
999(2.25) + 625
That is, we need approximately 218 observations to estimate µ,the mean accounts receivable,
with a bound on the error of estimation of $3.00.
In like manner, we can determine the number of observations needed to estimate a population
total τ , with a bound on the error of estimation of magnitude B. The required sample size
is found by setting two standard deviations of the estimator equal to B and solving this
expression for n. Proceeding as we did earlier, we get

N σ2
n= (3.10)
(N − 1)D + σ 2
B2
where D = 4N 2

Example 3.1.7 An investigator is interested in estimating the total weight gain in 0 to 4


weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh each bird would be
time-consuming and tedious. Therefore, determine the number of chicks to be sampled in
this study in order to estimate τ with a bound on the error of estimation equal to 1000
grams. Many similar studies on chick nutrition have been run in the past. Using data from
these studies, the investigator found that σ 2 , the population variance, was approximately
equal to 36.00 (grams)2 . Determine the required sample size.

Solution: We can obtain an approximate sample size using the equation (3.10) with σ 2 =
36.
B2 (1000)2
D= = = 0.25
4N 2 4(1000)2
That is,
N σ2 1000(36.00)
n= 2
= = 125.98
(N − 1)D + σ 999(0.25) + 36.00
The investigator, therefore, needs to weigh n = 126 chicks to estimate τ , the total weight
gain for N = 1000 chickens in 0 to 4 weeks, with a bound on the error of estimation equal
to 1000 grams.

3.1.5 Estimation of Population Proportion


The investigator conducting a sample survey is frequently interested in estimating the pro-
portion of the population that possesses a specified characteristic. For example, a congres-
sional leader investigating the merits of an 18-year-old voting age may want to estimate
the proportion of the potential voters in the district between the ages of 18 and 21. A
marketing research group may be interested in the proportion of the total sales market in

38
diet preparations that is attributable to a particular product. That is, what percentage of
sales is accounted for by a particular product? A forest manager may be interested in the
proportion of trees with a diameter of 12 inches or more. Television ratings are often deter-
mined by estimating the proportion of the viewing public that watches a particular program.

You will recognize that all these examples exhibit a characteristic of the binomial exper-
iment, that is, an observation either does belong or does not belong to the category of
interest. For example, one can estimate the proportion of eligible voters in a particular
district by examining population census data for several of the precincts within the district.
An estimate of the proportion of voters between 18 and 21 years of age for the entire district
will be the fraction of potential voters from the precincts sampled that fell into this age range.

In subsequent discussion we denote the population proportion and its estimator by the
symbols p and p̂, respectively. The properties of p̂ for simple random sampling parallel
those of the sample mean ȳ if the response measurements are defined as follows: Let yi = 0
if the ith element sampled does not possess the specified characteristic and yi = 1 if it does.
Then the total number of elements in a sample of size n possessing a specified characteristic
is n
X
yi
i=1

If we draw a simple random sample of size n, the sample proportion p̂ is the fraction of the
elements in the sample that possess the characteristic of interest. For example, the estimate
p̂ of the proportion of eligible voters between the ages of 18 and 21 in a certain district is
number of voters sampled between the ages of 18 and 21
p̂ =
number of voters sampled
or Pn
i=1 yi
p̂ = = ȳ
n

In other words, p̂ is the average of the 0 and 1 values from the sample. Similarly, we can
think of the population proportion as the average of the 0 and 1 values for the entire pop-
ulation (that is, p = µ).

Estimator of the population proportion p:


Pn
yi
p̂ = ȳ = i=1 (3.11)
n
Estimated variance of p̂:
 
p̂q̂ N −n
V̂ (p̂) = (3.12)
n−1 N

where q̂ = 1 − p̂
Bound on the error of estimation:
s  
p̂q̂ N −n
q
2 V̂ (p̂) = 2 (3.13)
n−1 N

39
Example 3.1.8 A simple random sample of n = 100 college seniors was selected to estimate
(1) the fraction of N = 300 seniors going on to graduate school and (2) the fraction of
students that have held part-time jobs during college. Let and (i = 1, 2, · · · , 100) denote
the responses of the ith student sampled. We will set yi = 0 if the ith student does not
plan to attend graduate school and yi = 1 if he does. Similarly, let xi = 0 if he has
not held a part-time job sometime during college and xi = l if he has. Using the sample
data presented in the accompanying table, estimate p1 , the proportion of seniors planning
to attend graduate school, and p2 ,the proportion of seniors who have had a part-time job
sometime during their college careers (summers included).

Solution: The sample proportions are given by


Pn
yi 15
p̂1 = i=1 = = 0.15
n 100
and Pn
i=1 xi 65
p̂2 = = = 0.65
n 100
Bounds on the error of estimation for p1 and p2 respectively are
s  
p̂1 q̂1 N − n
q
2 V̂ (p̂1 ) = 2
n−1 N
s  
(0.15)(0.85) 300 − 100
= 2
99 300
= 0.059

40
and
s  
p̂2 q̂2 N − n
q
2 V̂ (p̂2 ) = 2
n−1 N
s  
(0.65)(0.35) 300 − 100
= 2
99 300
= 0.078

Thus we estimate that 15% of the seniors plan to attend graduate school, with a bound on
the error of estimation equal to 0.059. We estimate that 65% of the seniors have held a
part-time job during college, with a bound on the error of estimation equal to 0.078.
We have shown that the population proportion p can be regarded as the average (µ)of the
0 and 1 values for the entire population. Hence the problem of determining the sample size
required to estimate p to within B units should be analogous to determining a sample size
for estimating µ with a bound on the error of estimation B. You will recall that the required
sample size for estimating µ is given by

N σ2
n= (3.14)
(N − 1)D + σ 2
2
where D = B4 . The corresponding sample size needed to estimate p can be found by re-
placing σ 2 with pq.

Sample size required to estimate p with a bound on the error of estimation B:

N pq
n= (3.15)
(N − 1)D + pq
B2
where q = 1 − p and D = 4

In a practical situation we do not know p. An approximate sample size can be found by


replacing p with an estimated value. Frequently, such an estimate can be obtained from
similar past surveys. However, if no such prior information is available, we can take p = 0.5
to obtain a conservative sample size (one that is likely to be larger than required). This
yields
N
n=
4(N − 1)D + 1

Example 3.1.9 Student government leaders at a college want to conduct a survey to de-
termine the proportion of students that favors a proposed honor code. Since interviewing
N = 2000 students in a reasonable length of time is almost impossible, determine the sam-
ple size (number of students to be interviewed) needed to estimate p with a bound on the
error of estimation of magnitude B = 0.05. Assume that no prior information is available
to estimate p.

Solution: We can approximate the required sample sizes when no prior information is
available by setting p = 0.5. We have

B2 (0.05)2
D= = = 0.000625
4 4

41
N 2000
n= = = 333.47
4(N − 1)D + 1 4(1999)0.000625 + 1

That is, 334 students must be interviewed to estimate the proportion of students that favors
the proposed honor code with a bound on the error of estimation of B = 0.05.

Example 3.1.10 Referring to Example 3.1.9, suppose that in addition to estimating the
proportion of students that favors the proposed honor code, student government leaders also
want to estimate the number of students who feel the student union building adequately
serves their needs. Determine the combined sample size required for a survey to estimate
p1 , the proportion that favors the proposed honor code, and p2 , the proportion that believes
the student union adequately serves its needs, with bounds on the errors of estimation of
magnitude B1 = 0.05 and B2 = 0.07. Although no prior information is available to estimate
p1 , approximately 60 % of the students believed the union adequately met their needs in a
similar survey run the previous year.

Solution: In this example we must determine a sample size n that will allow us to estimate
p1 , with a bound B1 = 0.05 and p2 with a bound B2 = 0.07. First, we determine the
sample sizes that satisfy each objective separately. The larger of the two will then be the
combined sample size for a survey to meet both objectives. From Example 1.5.2 the sample
size required to estimate p1 , with a bound on the error of estimation of B1 = 0.05 was
n = 334 students. We can use data from the survey of the previous year to determine the
sample size needed to estimate p2 . We have

B2 (0.07)2
D= = = 0.001225
4 4
N pq (2000)(0.6)(0.4) 480
n= = = = 178.52
(N − 1)D + pq (1999)(0.001225) + (0.6)(0.4) 2.68877

That is, 179 students must be interviewed to estimate p2 . The sample size required to
achieve both objectives in one survey is 334, the larger of the two sample sizes.

3.1.6 Sampling with Probabilities Proportional to Size


Previous work in this chapter has depended on the sample being a simple random sample,
according to Definition 1.1.1. We will now show that varying the probabilities with which
different sampling units are selected is sometimes advantageous. Suppose, for example, we
wish to estimate the number of job openings in a city by sampling industrial firms within
the city. Typically, many such firms will be quite small and employ few workers, while some
firms will be very large. In a simple random sample, size of firm is not taken into account,
and a typical sample will contain mostly small firms. But the information desired (number
of job openings) is heavily influenced by the large firms. Thus we should be able to improve
on the simple random sample by giving the large firms a greater chance to appear in the
sample. A method for accomplishing this sampling is called sampling with probabilities pro-
portional to size, or pps sampling.

For a sample y1 , y2 , · · · , yn from a population of size N , let

πi = probability that yi appears in the sample

42
Unbiased estimators of τ and µ, along with their estimated variances and bounds on the
error of estimation, are as follows:

Estimator of the population total τ :


n  
1X yi
τ̂pps = (3.16)
n i=1 πi

Estimated variance of τ̂pps :


n  2
1 X yi
V̂ (τ̂pps ) = − τ̂pps (3.17)
n(n − 1) i=1 πi

Bound on the error of estimation:


v
u n  2
1 yi
q u X
2 V̂ (τ̂pps ) = 2t − τ̂pps (3.18)
n(n − 1) i=1 πi

Estimator of the population mean µ:


n  
1 1 X yi
µ̂pps = τ̂pps = (3.19)
N N n i=1 πi

Estimated variance of µpps


ˆ
n  2
1 X yi
V̂ (µ̂pps ) = 2 − τ̂pps (3.20)
N n(n − 1) i=1 πi

Bound on the error of estimation:


v
u n  2
1 yi
q u X
2 V̂ (µ̂pps ) = 2 t − τ̂pps (3.21)
N 2 n(n − 1) i=1 πi

These estimators are unbiased for any choices of πi , but it is clearly in the best interest of
the experimenter to choose these probabilities so that the variances of p the estimators are
as small as possible. The best practical way to choose the πi ’s is to choose them proportional
to a known measurement that is highly correlated with yi . In the problem of estimating
total number of job openings, firms can be sampled with probabilities proportional to their
total work force, which should be known fairly accurately before the sample is selected.
The number of job openings per firm is not known before sampling, but it should be highly
correlated with the total number of workers in the firm.

Example 3.1.11 An investigator wishes to estimate the average number of defects per
board on boards of electronic components manufactured for installation in computers. The
boards contain varying numbers of components, and the investigator feels that the number
of defects should be positively correlated with the number of components on a board. Thus
pps sampling is used, with the probability of selecting any one board for the sample being
proportional to the number of components on that board. A sample of n = 4 boards is to

43
be selected from the N = 10 boards of one day’s production. The number of components
on the 10 boards are, respectively,

10, 12, 22, 8, 16, 24, 9, 10, 8, 31

Show how to select n = 4 boards with probabilities proportional to size.

Solution: We list the number of components (our measure of size) in a column and list the
cumulative ranges and desirable πi ’s in adjacent columns, as follows:

Board No. of Comp. Cum. Range πi


1 10 1-10 10/150
2 12 11-22 12/150
3 22 23-24 22/150
4 8 45-52 8/150
5 16 53-68 16/150
6 24 69-92 24/150
7 9 93-101 9/150
8 10 102-111 10/150
9 8 112-119 8/150
10 31 120-150 31/150

There are 150 components in the population to be sampled. We can think of these compo-
nents as being numbered from 1 to 150. The cumulative range column keeps track of the
interval of numbered components on each board. Board number 1 has the first 10 compo-
nents, board number 2 has components 11 through 22, and so on.

The π’s are simply the number of components per board divided by the total number of
components. The boards having greater numbers of components have larger probabilities
of selection.

To choose the sample of n = 4 boards, we enter the random number table and select four
random numbers between l and 150. The numbers we selected were 14, 56, 94, and 25. We
locate these numbers in the cumulative range column. The boards corresponding to those
range intervals constitute the sample.

Since 14 lies in the range of board 2, that board enters the sample. Similarly, 56 lies in
the range of board 5, 94 lies in the range of board 7, and 25 lies in the range of board 3.
Thus the sample consists of boards 2, 3, 5, and 7. These boards have been selected with
probabilities proportional to their numbers of components. Note that with this method we
could have sampled a particular board more than once.

Example 3.1.12 After the sampling of Example 3.1.11 was completed, the number of
defects found on boards 2, 3, 5, and 7 were, respectively, 1, 3, 2, and 1. Estimate the
average number of defects per board, and place a bound on the error of estimation.

Solution:
n    
1 X yi 1 150 3(150) 2(150) 150
µ̂pps = = + + + = 1.71
N n i=1 πi 40 12 22 16 9

44
n  2
1 X yi
V̂ (µ̂pps ) = − τ̂pps
N 2 n(n − 1) i=1 πi
 2  2  2
1 150 3(150) 2(150)
= − 17.10 + − 17.10 + − 17.10
102 (4)(3) 12 22 16
 2 
150
+ − 17.10
9
= 0.0295

The estimate of the average number of defects per board, with a bound on the error of
estimation, is then
1.71 ± 0.34
and the interval (1.37, 2.05) provides an approximate 95% confidence interval for the average
number of defects per board.

3.2 Stratified Random Sampling


3.2.1 Introduction
The purpose of sample survey design is to maximize the amount of information for a given
cost. Simple random sampling, the basic sampling design, often provides good estimates of
population quantities at low cost. In this chapter, we define a second sampling procedure,
stratified random sampling, which in many instances increases the quantity of information
for a given cost.

Definition 3.2.1 A stratified random sample is one obtained by separating population


elements into non-overlapping groups, called strata, and then selecting a simple random
sample from each stratum.
The principal reasons for using stratified random sampling rather than simple random sam-
pling are as follows:

1. Stratification may produce a smaller bound on the error of estimation than would be
produced by a simple random sample of the same size.

2. The cost per observation in the survey may be reduced by stratification of the popu-
lation elements into convenient groupings.

3. Estimates of population parameters may be desired for subgroups of the population.


These subgroups should then be identifiable strata.

These three reasons for stratification should be kept in mind when one is deciding whether or
not to stratify a population or deciding how to define strata. Sampling hospital patients on
a certain diet to assess weight gain may be more efficient if the patients are stratified by sex,
since men tend to weigh more than women. A poll of college students at a large university
may be more conveniently administered and carried out if students are stratified into on-
campus and off-campus residents. A quality control sampling plan in a manufacturing plant
may be stratified by production lines because estimates of proportions of defective products
may be required by the manager of each line.

45
3.2.2 How To Draw a Stratified Random Sample
The first step in the selection of a stratified random sample is to clearly specify the strata;
then each sampling unit of the population is placed into its appropriate stratum. This step
may be more difficult than it sounds.

For example, suppose that you plan to stratify the sampling units, say households, into rural
and urban units. What should be done with households in a town of 1000 inhabitants? Are
these households rural or urban? They may be rural if the town is isolated in the country,
or they may be urban if the town is adjacent to a large city. Hence to specify what is meant
by urban and rural is essential so that each sampling unit clearly falls into only one stratum.

After the sampling units are divided into strata, we select a simple random sample from
each stratum by using the techniques given in the chapter ’Simple Random Sampling’. We
discuss the problem of choosing appropriate sample sizes for the strata later in this chapter.
We must be certain that the samples selected from the strata are independent. That is,
different random sampling schemes should be used within each stratum so that the obser-
vations chosen in one stratum do not depend upon those chosen in another.

Some additional notation is required for stratified random sampling. Let


L=number of strata
Ni =number of sampling units in stratum i
N =number of sampling units in the population=N1 + N2 + · · · + NL

The following example illustrates a situation in which stratified random sampling may be
appropriate.

Example 3.2.2 An advertising firm, interested in determining how much to emphasize


television advertising in a certain county, decides to conduct a sample survey to estimate
the average number of hours per week that households within the county watch television.
The county contains two towns, town A and town B and a rural area. Town A is built
around a factory, and most households contain factory workers with school-aged children.
Town B is an exclusive suburb of a city in a neighbouring county and contains older residents
with few children at home. There are 155 households in town A, 62 in town B and 93 in
the rural area. Discuss the merits of using stratified random sampling in this situation.

Solution: The population of households falls into three natural groupings, two towns and
a rural area, according to geographic location. Thus to use these divisions as three strata
is quite natural simply for administrative convenience in selecting the samples and carrying
out the fieldwork. In addition, each of the three groups of households should have similar
behaviour patterns among residents within the group. We expect to see relatively small
variability in number of hours of television viewing among households within a group and
this is precisely the situation in which stratification produces a reduction in a bound on the
error of estimation.

The advertising firm may wish to produce estimates on average television-viewing hours for
each town separately. Stratified random sampling allows for these estimates.

For the stratified random sample, we have N1 = 155, N2 = 62 and N3 = 93 with N = 310.

46
3.2.3 Estimation of Population Mean and Total
How can we use the data from a stratified random sample to estimate the population mean?
Let ȳi denote the sample mean for the simple random sample selected from stratum i, ni
the sample size for stratum i, µi the population mean for stratum i, and τi the population
total for stratum i. Then the population total τ is equal to τ1 + τ2 + · · · + τL . We have
a simple random sample within each stratum. Therefore we know from Simple Random
Sampling that ȳi is an unbiased estimator of µi and Ni ȳi is an unbiased estimator of the
stratum total τi = Ni µi . It seems reasonable to form an estimator of τ , which is the sum of
the τi ’s by summing the estimator of the τi ’s. Similarly, since the population mean µ equals
the population total τ divided by N , an unbiased estimator of µ is obtained by summing
the estimators of the τi ’s over all strata and then dividing by N . We denote this estimator
by ȳst , where the subscript st indicates that stratified random sampling is used.

Estimator of the population mean µ:


L
1 1 X
ȳst = [N1 ȳ1 + N2 ȳ2 + · · · + NL ȳL ] = Ni ȳi (3.22)
N N i=1

Estimated variance of ȳst :


1
V̂ (ȳst ) = [N 2 V̂ (ȳ1 ) + N22 V̂ (ȳ2 ) + · · · + NL2 V̂ (ȳL )] (3.23)
N2  1  2  2 
1 s1 sL
= 2
N1 (N1 − n1 ) + · · · + NL (NL − nL ) (3.24)
N n1 nL
L
s2i
 
1 X
= N i (N i − n i ) (3.25)
N 2 i=1 ni

Bound on the error of estimation:


v
u
u 1 X L  2
si
q
2 V̂ (ȳst ) = 2 t Ni (Ni − ni ) (3.26)
N 2 i=1 ni

Example 3.2.3 Suppose the survey planned in Example 3.2.2 is carried out. The advertis-
ing firm has enough time and money to interview n = 40 households and decides to select
random samples of size n1 = 20 from town A, n2 = 8 from town B, and n3 = 12 from the
rural area.(We will discuss the choice of sample sizes later). The simple random samples are
selected and the interviews conducted. The results, with measurements of television-viewing
time in hours per week, are shown on Table 3.3.
Estimate the average television viewing time, in hours per week, for a) all households in the
county b) all households in town B. In both cases, place a bound on the error of estimation.

The terms s21 , s22 and s23 in Table 3.3 are the sample variances for strata 1, 2 and 3, respec-
tively; they are given by the formula
Pni 2
Pni 2 2
2 j=1 (yij − ȳi ) j=1 yij − ni ȳi
si = =
ni − 1 ni − 1
for i = 1, 2, 3 where yij is the jth observation in stratum i. These variances estimate the
corresponding true stratum variances σ12 , σ22 and σ32 .

Solution:

47
Figure 3.3: Television viewing time, in hours per week

(a) all households in the county

1
ȳst = [N1 ȳ1 + N2 ȳ2 + N3 ȳ3 ]
N
1
= [(155)(33.900) + (62)(25.125) + (93)(19.00)]
310
= 27.7
is the best estimate of the average number of hours per week that all households in the
county spend watching television. Also,
3   2
1 X 2 Ni − ni si
V̂ (ȳst ) = 2
Ni
N i=1 Ni ni
(155)2 (0.871)(35.358) (62)2 (0.871)(232.411) (93)2 (0.871)(87.636)
 
1
= + +
(310)2 20 8 12
= 1.97
The estimate of the population mean with an approximate two-standard-deviation bound
on the error of estimation is given by
q
ȳst ± 2 V̂ (ȳst )

27.675 ± 2 1.97
27.7 ± 2.8
Thus we estimate the average number of hours per week that households in the county view
television to be 27.7 hours. The error of estimation should be less than 2.8 hours with a
probability approximately equal to 0.95.

b) all households in town B.


The n2 = 8 observations from stratum 2 constitute a simple random sample. The estimate
of the average viewing time for town B with an approximate two-standard-deviation bound
on the error of estimation is given by
s  2
N2 − n2 s2
ȳ2 ± 2
N2 n2
s  
62 − 8 232.411
25.1 ± 2
62 8

48
25.1 ± 10.0

This estimate has a large bound on the error of estimation because s22 is large and the
sample size n2 is small. Thus the estimate ȳst of the population mean is quite good, but
the estimate ȳ2 of the mean of stratum 2 is poor. If an estimate is desired for a particular
stratum, the sample from that stratum must be large enough to provide a reasonable bound
on the error of estimation.
Procedures for the estimation of a population total τ follow directly from the procedures
presented for estimating µ. Since τ is equal to N µ, an unbiased estimator of τ is given by
N ȳst .

Estimation of a population total τ :


L
X
N ȳst = N1 ȳ1 + N2 ȳ2 + · · · + NL ȳL = Ni ȳi (3.27)
i=1

Estimated variance of N ȳst :


L
s2i
  
2
X Ni − ni
V̂ (N ȳst ) = N V̂ (ȳst ) = Ni2 (3.28)
i=1
Ni ni

Bound on the error of estimation:


v
u L   2
Ni − ni si
q uX
2
2 V̂ (N ȳst ) = 2t Ni (3.29)
i=1
Ni ni

Example 3.2.4 Refer to Example 3.4 and estimate the total number of hours per week
that households in the county view television. Place a bound on the error of estimation.

Solution:
N ȳst = 310(27.7) = 8587hours
The estimate variance of N ȳst is given by

V̂ (N ȳst ) = N 2 V̂ (ȳst ) = (310)2 (1.97) = 189, 278.560

The estimate of the population total with a bound on the error of estimation is given by
q
N ȳst ± 2 V̂ (N ȳst )
p
8587 ± 2 189, 278.560
8587 ± 870
Thus we estimate the total weekly viewing time for households in the county to be 8587
hours. The error of estimation should be less than 870 hours.

49
3.2.4 Selecting the Sample Size for Estimating Population Means
and Total
The amount of information in a sample depends on the sample size n, since V̂ (ȳst ) decreases
as n increases. Let us examine a method of choosing the sample size to obtain a fixed amount
of information for estimating a population parameter. Suppose we specify that the estimate
ȳst should lie within B units of the population mean, with probability approximately 0.95.
Symbolically, we want
p
2 V (ȳst ) = B
or
B2
V (ȳst ) =
4
This equation contains the actual population variance of ȳst rather than the estimated vari-
ance.
2
Although we set V (ȳst ) equals to B4 , we cannot solve for n unless we know something about
the relationships among n1 , n2 , · · · , nL and n. There are many ways of allocating a sample
size n among the various strata. In each case, however, the number of observations ni
allocated to the ith stratum is some fraction of the total sample size n. We denote this
fraction by wi . Hence we can write

ni = nwi , i = 1, 2, · · · , L (3.30)

B2
Using the above equation, we can then set V (ȳst ) equal to 4
and solve for n.

Similarly, estimation of the population total τ with a bound of B units on the error of
estimation leads to the equation p
2 V (N ȳst ) = B
or
B2
V (ȳst ) =
4N 2
Approximate sample size required to estimate µ or τ with a bound B on the
error of estimation:
PL 2 2
i=1 Ni σi /wi
n= (3.31)
N 2 D + Li=1 Ni σi2
P

where wi is the fraction of observations allocated to stratum i, σi2 is the population variance
for stratum i, and
2
D = B4 when estimating µ
B2
D = 4N 2 when estimating τ .

We must obtain approximations of the population variances σ12 , σ22 , · · · , σL2 before we can
use Equation (3.31). One method of obtaining these approximations is to use the sample
variances s21 , s22 , · · · , s2L from a previous experiment to estimate σ12 , σ22 , · · · , σL2 . A second
method requires knowledge of the range of the observations within each stratum. From
Tchebysheff’s theorem and the normal distribution the range should be roughly four to six
standard deviations.

50
Example 3.2.5 A prior survey suggests that the stratum variances for Example (3.2.2) are
approximately σ12 ≈ 25, σ22 ≈ 225 and σ32 ≈ 100. We wish to estimate the population mean
by using ȳst . Choose the sample size to obtain a bound on the error of estimation equal to
2 hours if the allocation fractions are given by w1 = 31 , w2 = 13 and w3 = 13 . and In other
words, you are to take an equal number of observations from each stratum.

Solution: A bound on the error of 2 hours means that


p
2 V (ȳst ) = 2

or
V (ȳst ) = 1
Therefore, D = 1.
In Example 3.2.2, N1 = 155, N2 = 62 and N3 = 93. Therefore

3
X N 2σ2 i i N12 σ12 N22 σ22 N32 σ32
= + +
i=1
wi w1 w2 w3
(155)2 (25) (62)2 (225) (93)2 (100)
= + +
1/3 1/3 1/3
= (24025)(75) + (3844)(675) + (8649)(300)
= 6991275

3
X
Ni2 σi2 = N12 σ12 + N22 σ22 + N32 σ32
i=1
= (155)(25) + (62)(225) + (93)(100)
= 27125

N 2 D = (310)2 (1) = 96100

From Equation(3.31)we then have


P3 2 2
i=1 Ni σi /wi 6991275
n= 3 = = 56.7
2 96100 + 27125
2
P
N D + i=1 Ni σi

Thus the experimenter should take n = 57 observations with


 
1
n1 = n(w1 ) = 57 = 19
3
n2 = 19
n3 = 19

3.2.5 Allocation of the Sample


You recall that the objective of a sample survey design is to provide estimators with small
variances at the lowest possible cost. After the sample size n is chosen, there are many ways
to divide n into the individual stratum sample sizes n1 , n2 , · · · , nL . Each division may result

51
in a different variance for the sample mean. Hence our objective is to use an allocation that
gives a specified amount of information at minimum cost.

In terms of our objective the best allocation scheme is affected by three factors. They are
as follows:

1. The total number of elements in each stratum

2. The variability of observations within each stratum

3. The cost of obtaining an observation from each stratum

The number of elements in each stratum affects the quantity of information in the sample.
A sample of size 20 from a population of 200 elements should contain more information
than a sample of 20 from 20,000 elements. Thus large sample sizes should be assigned to
strata containing large number of elements.

Variability must be considered because a larger sample is needed to obtain a good estimate
of a population parameter when the observations are less homogeneous.

If the cost of obtaining an observation varies from stratum to stratum, we will take small
samples from strata with high costs. We will do so because our objective is to keep the cost
of sampling at a minimum.

Approximate allocation that minimizes cost for a fixed value of V (ȳst ) or mini-
mizes V (ȳst ) for a fixed cost:

Ni σi / ci
wi = PL √ (3.32)
k=1 Nk σk / ck

where Ni denotes the size of the ith stratum, σi2 denotes the population variance for the
ith stratum, and ci denotes the cost of obtaining a single observation from the ith stratum.

Note that ni is directly proportional to Ni and σi and inversely proportional to ci .

One must approximate the variance of each stratum before sampling in order to use the
allocation formula (3.32). The approximations can be obtained from earlier surveys or from
knowledge of the range of the measurements within each stratum.

Substituting the values of wi in (3.32) into Equation (3.31), we get


 PL √  PL √ 
k=1 N k σ k / c k i=1 N i σ i ci
n =  P L 2
 (3.33)
2
N D + i=1 Ni σi

for optimal allocation with the variance of ȳst fixed at D.

Example 3.2.6 The advertising firm in Example 3.2.2 finds that obtaining an observation
from a rural household costs more than obtaining a response in town A or B. The increase
is due to costs of traveling from one rural household to another. The cost per observation in
each town is estimated to be $9.00 (that is,c1 = c2 = 9), and the costs per observation in the
rural area to be $16.00 (that is, c3 = 16). The stratum standard deviations (approximated
by the strata sample variances from a prior survey) are σ1 ≈ 5, σ2 ≈ 15 and σ3 ≈ 10 . Find
the overall sample size n and the stratum sample sizes, n1 , n2 and n3 , that allow the firm to

52
estimate, at minimum cost, the average television-viewing time with a bound on the error
of estimation equal to 2 hours.

Solution We have
3
X Nk σk N1 σ1 N2 σ2 N3 σ3
√ = √ + √ + √
k=1
ck c1 c2 c3
155(5) 62(15) 93(10)
= √ + √ + √
9 9 16
= 800.83

and
3
X √ √ √ √
Ni σi ci = N1 σ1 c1 + N2 σ2 c2 + N3 σ3 c3
i=1
√ √ √
= 155(5) 9 + 62(15) 9 + 93(10) 16
= 8835

Thus
√ √
( 3k=1 Nk σk / ck )( 3i=1 Ni σi ci )
P P
n =
N 2 D + 3i=1 Ni σi2
P

(800.83)(8835)
=
(310)2 (1) + 27125
= 57.42 ≈ 58

Then √ !  
N1 σ1 / c1 155(5)/3
n1 = n P3 √ =n = 0.32n = 18.5 ≈ 18
k=1 Nk σk / ck
800.83
Similarly,  
62(15)/3
n2 = n = 0.39n = 22.6 ≈ 23
800.83
 
93(10)/4
n3 = n = 0.29n = 16.8 ≈ 17
800.83
Hence the experimenter should select 18 households at random from town A, 23 from town
B, and 17 from rural area. He can then estimate the average number of hours spent watching
television at minimum cost with a bound of 2 hours on the error of estimation.
In some stratified sampling problems the cost of obtaining an observation is the same for all
strata. If the costs are unknown, we may be willing to assume that the costs per observation
are equal. If c1 = c2 = · · · = cL , then the cost terms cancel in Equation (3.32) and
!
Ni σi
ni = n PL (3.34)
i=1 Ni σi

This method of selecting n1 , n2 , · · · , nL is called Neyman allocation. Under Neyman allo-


cation Equation (3.33) for the total sample size n becomes

( Li=1 Ni σi )2
P
n= (3.35)
N 2 D + Li=1 Ni σi2
P

53
Example 3.2.7 The advertising firm of Example 3.2.2 decides to use telephone interviews
rather than personal interviews because all households in the county have telephones, and
this method reduces costs. The cost of obtaining an observation is then the same in all three
strata. The stratum standard deviations are given approximated by σ1 ≈ 5, σ2 ≈ 15 and
σ3 ≈ 10. The firm desires to estimate the population mean µ with a bound on the error of
estimation equal to 2 hours. Find the appropriate sample size n and stratum sample sizes
n1 , n2 and n3 .

Solution: We will now use Equations (3.34) and (3.35), since the costs are the same in
all strata. Therefore to find the allocation fractions,w1 , w2 and w3 , we use Equation (3.34).
Then
3
X
Ni σi = N1 σ1 + N2 σ2 + N3 σ3
i=1
= (155)(5) + (62)(15) + (93)(10)
= 2635

and from Equation (3.34)


" #  
N1 σ1 (155)(5)
n1 = n P3 =n = n(0.30)
i=1 Ni σi
2635

Similarly,  
(62)(15)
n2 = n = n(0.35)
2635
 
(93)(10)
n3 = n = n(0.35)
2635
Thus w1 = 0.30, w2 = 0.35 and w3 = 0.35.
Now let us use Equation (3.35) to find n. A bound of 2 hours on the error of estimation
means that p
2 V (ȳst ) = 2 ⇒ V (ȳst ) = 1
Therefore,
B2
D= =1
4
and
N 2 D = (310)2 (1) = 96100
Also,
3
X
Ni σi2 = 27125
i=1

( 3i=1 Ni σi )2
P
n =
N 2 D + 3i=1 Ni σi2
P

(2635)2
=
96100 + 27125
= 56.34 ≈ 57

Then
n1 = nw1 = (57)(0.30) = 17

54
n2 = nw2 = (57)(0.35) = 20
n3 = nw3 = (57)(0.35) = 20
The sample size n in this example is nearly the same as in the previous example, but the
allocation has changed. More observations are taken from the rural area because these
observations no longer have a higher cost.
In addition to encountering equal costs, we sometimes encounter approximately equal vari-
ances, σ12 , σ22 , · · · , σL2 . In that case the σi ’s cancel in Equation (3.34) and
!  
Ni Ni
ni = n PL =n (3.36)
i=1 Ni
N

This method of assigning sample sizes to the strata is called proportional allocation because
sample sizes n1 , n2 , · · · , nL are proportional to stratum sizes N1 , N2 , · · · , NL . Of course,
proportional allocation can be, and often is, used when stratum variances and costs are not
equal. One advantage to using this allocation is that the estimator ȳst becomes simply the
sample mean for the entire sample. This feature can be an important timesaving feature in
some surveys.

Under proportional allocation Equation (3.31)for the value of n, which yields V (ȳst )=D,
becomes
PL 2
i=1 Ni σi
n= (3.37)
N D + N1 Li=1 Ni σi2
P

Example 3.2.8 The advertising firm in Example 3.2.2 thinks that the approximate vari-
ances used in previous examples are in error and that the stratum variances are approx-
imately equal. The common value of σi was approximated by 10 in a preliminary study.
Telephone interviews are to be used, and hence costs will be equal in all strata. The firm
desires to estimate the average number of hours per week that households in the county
watch television, with a bound on the error of estimation equal to 2 hours. Find the sample
size and stratum sample sizes necessary to achieve this accuracy.

Solution: We have
3
X
Ni σi2 = N1 σ12 + N2 σ22 + N3 σ32
i=1
= (155)(100) + (62)(100) + (93)(100)
= 31000

Thus, since D = 1, Equation (3.37)gives


31000
n= = 75.6 or 76
310(1) + (1/310)(31000)

Therefore !    
N1 N1 155
n 1 = n P3 =n =n ) = n(0.5) = 38
i=1 Ni N 310
   
N2 62
n2 = n =n = n(0.2) = 15
N 310

55
   
N3 93
n3 = n =n = n(0.3) = 23
N 310
These results differ from those of Example 3.2.7 because here the variances are assumed to
be equal in all strata and are approximated by a common value.
The amount of money to be spent on sampling is sometimes fixed before the experiment
is started. Then the experimenter must find a sample size and allocation scheme that
minimizes the variance of the estimator for a fixed expenditure.

Example 3.2.9 In the television-viewing example, suppose the costs are as specified in Ex-
ample 3.2.6 . That is, c1 = c2 = 9 and c3 = 16. Let the stratum variances be approximated
by σ1 ≈ 5, σ2 ≈ 15 and σ3 ≈ 10. Given the advertising firm has only $500 to spend on
sampling, choose the sample size and the allocation that minimize V (ŷst ).

Solution: The allocation scheme is still given by Equation 3.32 In Example 3.2.6 we found
w1 = 0.32, w2 = 0.39 and w3 = 0.29. Since the total cost must equal $500, we have

c1 n1 + c2 n2 + c3 n3 = 500

or
9n1 + 9n2 + 16n3 = 500
Since ni = nwi , we can substitute as follows:

9nw1 + 9nw2 + 16nw3 = 500

or
9n(0.32) + 9n(0.39) + 16n(0.29) = 500
Solving for n, we obtain
11.03n = 500
500
n= = 45.33
11.03
Therefore we must take n = 45 to ensure that the cost remains below $500. The corre-
sponding allocation is given by

n1 = nw1 = (45)(0.32) = 14

n2 = nw2 = (45)(0.39) = 18
n3 = nw3 = (45)(0.29) = 13
We can make the following summary statement on stratified random sampling: In general,
stratified random sampling with proportional allocation will produce an estimator with
smaller variance that that produced by simple random sampling (with the same sample
size) if there is considerable variability among the stratum means. If sampling costs are
nearly equal from stratum to stratum, stratified random sampling with optimal allocation
will yield estimators with smaller variance than will proportional allocation when there is
variability among the stratum variances.

56
3.2.6 Estimation of Population Proportion
In our numerical examples we have been interested in estimating the average or the total
number of hours per week spent watching television. In contrast, suppose that the advertis-
ing firm wants to estimate the proportion (fraction) of households that watches a particular
show. The population is divided into strata, just as before, and a simple random sample
is taken from each stratum. Interviews are then conducted to determine the proportion p̂i
of households in stratum i that view the show. This p̂i is an unbiased estimator of pi , the
population proportion in stratum i. Reasoning as we did in Section 1.3, we conclude that
Ni p̂i is an unbiased estimator of the total number of households in stratum i that view this
particular show. Hence N1 p̂1 + N2 p̂2 + · · · + NL p̂L is a good estimator of the total number of
viewing households in the population. Dividing this quantity by N , we obtain an unbiased
estimator of the population proportion p of households viewing the show.

Estimator of the population proportion p:


L
1 1 X
p̂st = (N1 p̂1 + N2 p̂2 + · · · + NL p̂L ) = Ni p̂i (3.38)
N N i=1

Estimated variance of p̂st :


1 h 2 2 2
i
V̂ (p̂st ) = N 1 V̂ (p̂ 1 ) + N2 V̂ (p̂ 2 ) + · · · + NL V̂ (p̂ L ) (3.39)
N2
L
1 X 2
= N V̂ (p̂i ) (3.40)
N 2 i=1 i
L  
1 X p̂i q̂i
= Ni (Ni − ni ) (3.41)
N 2 i=1 ni − 1

Bound on the error of estimation:


v
u L   
u 1 X Ni − ni p̂i q̂i
q
2
2 V̂ (p̂st ) = 2t N (3.42)
N 2 i=1 i Ni ni − 1

Example 3.2.10 The advertising firm wanted to estimate the proportion of households in
the county of Example (3.2.2) that view show X. The county is divided into three strata,
town A, town B and the rural area. The strata contain N1 = 155, N2 = 62 and N3 = 93
households, respectively. A stratified random sample of n = 40 households is chosen with
proportional allocation. In other words, a simple random sample is taken from each stratum;
the sizes of the samples are n1 = 20, n2 = 8 and n3 = 12. Interviews are conducted in
the 40 sampled households; results are shown in Table below. Estimate the proportion of
households viewing show X, and place a bound on the error of estimation.

Solution: The estimate of the proportion of households viewing show X is given by p̂st .
We calculate
1
p̂st = [(155)(0.80) + (62)(0.25) + (93)(0.50)] = 0.60
310
First, let us calculate the V̂ (p̂i ) terms. We have
     
N1 − n1 p̂1 q̂1 155 − 20 (0.8)(0.2)
V̂ (p̂1 ) = = = 0.007
N1 n1 − 1 155 19

57
Figure 3.4: Data for Example 3.2.10
     
N2 − n2 p̂2 q̂2 62 − 8 (0.25)(0.75)
V̂ (p̂2 ) = = = 0.024
N2 n2 − 1 62 7
     
N3 − n3 p̂3 q̂3 93 − 12 (0.5)(0.5)
V̂ (p̂3 ) = = = 0.020
N3 n3 − 1 93 11

3
1 X 2 1 
(155)2 (0.007) + (62)2 (0.024) + (93)2 (0.020) = 0.0045

V̂ (p̂st ) = 2 Ni V̂ (p̂i ) = 2
N i=1 (310)

Then the estimate of proportion of households in the county that view show X, with a bound
on the error of estimation, is given by

q
p̂st ± 2 V̂ (p̂st )

0.60 ± 2 0.0045
0.60 ± 2(0.07)
0.60 ± 0.14

The bound on the error in Example 3.2.10 is quite large. We could reduce this bound and
make the estimator more precise by increasing the sample size. The problem of choosing a
sample size is considered in the next section.

3.2.7 Selecting the Sample Size and Allocating the Sample to


Estimate Proportions
To estimate a population proportion, we first indicate how much information we desire by
specifying the size of the bound; the sample size is chosen accordingly.

The formula for the sample size n (for a given bound B on the error of estimation) is the
same as Equation (3.31) except that σi2 becomes pi qi .

Approximate sample size required to estimate p with a bound B on the error of


estimation:
PL 2
i=1 Ni pi qi /wi
n= (3.43)
N 2 D + Li=1 Ni pi qi
P

where wi is the fraction of observations allocated to stratum i, pi is the population proportion


for stratum i, and
B2
D=
4

58
The allocation formula that gives the variance of p̂st equal to some fixed constant at mini-

mum cost is the same as Equation (3.32)with σi replaced by pi qi .

Approximate allocation that minimizes cost for a fixed value of V (p̂st ) or mini-
mizes V (p̂st ) for a fixed cost:
p
Ni pi qi /ci
ni = n p p p (3.44)
N1 p1 q1 /c1 + N2 p2 q2 /c2 + · · · + NL pL qL /cL
p
Ni pi qi /ci
= n PL p (3.45)
k=1 Nk pk qk /ck
where Ni denotes the size of the ith stratum, pi denotes the population proportion for the
ith stratum, and ci denotes the cost of obtaining a single observation for the ith stratum.
Example 3.2.11 The data of Table 3.4 were obtained from a sample conducted last year.
The advertising firm now wants to conduct a new survey in the same county to estimate the
proportion of households viewing show X. Although the fractions p1 , p2 and p3 that appear
in Equation (3.43) and (3.44) are unknown, they can be approximated by the estimates
from the earlier study, that is, p̂1 = 0.80, p̂2 = 0.25 and p̂3 = 0.50. The cost of obtaining an
observation is $9 for either town and $16 for the rural area, that is c1 = c2 = 9 and c3 = 16.
The number of households within the strata are N1 = 155, N2 = 62 and N3 = 93. The
firm wants to estimate the population proportion p with a bound on the error of estimation
equal to 0.1. Find the sample size n and the strata sample sizes, n1 , n2 and n3 , that will
give the desired bound at minimum cost.

Solution We first find the allocation fractions wi . Using p̂i to approximate pi , we have
3 r r r r
X p̂i q̂i p̂1 q̂1 p̂2 q̂2 p̂3 q̂3
Ni = N1 + N2 + N3
i=1
ci c1 c2 c3
r r r
(0.8)(0.2) (0.25)(0.75) (0.5)(0.5)
= 155 + 62 + 93
9 9 16
62.000 26.846 46.500
= + +
3 3 4
= 41.241
and p !  
N1 p̂1 q̂1 /c1 20.667
n1 = n P3 p =n = n(0.50)
i=1 Ni p̂ i q̂ i /ci 41.241
Similarly,
8.949
) = n(0.22)
n2 = n(
41.241
11.625
n3 = n( ) = n(0.28)
41.241
Thus w1 = 0.50, w2 = 0.22 and w3 = 0.28.

The next step is to find n. First, the following quantities must be calculated:
3
X N 2 p̂i q̂i
i N12 p̂1 q̂1 N22 p̂2 q̂2 N32 p̂3 q̂3
= + +
i=1
wi w1 w2 w3
(155)2 (0.8)(0.2) (62)2 (0.25)(0.75) (93)2 (0.5)(0.5)
= + +
0.50 0.22 0.28
= 18686.46

59
3
X
Ni p̂i q̂i = N1 p̂1 q̂1 + N2 p̂2 q̂2 + N3 p̂3 q̂3
i=1
= (155)(0.8)(0.2) + (62)(0.25)(0.75) + (93)(0.5)(0.5)
= 59.675

p
To find D, we let 2 V (p̂st ) = 0.1 (the bound on the error of estimation). Then

(0.1)2
V (p̂st ) = = 0.0025 = D
4
and
N 2 D = (310)2 (0.0025) = 240.25
Finally, P3
Ni2 p̂i q̂i /wi
i=1 18686.46
n= P3 = = 62.3 ≈ 63
2
N D + i=1 Ni p̂i q̂i 240.25 + 59.675
Hence
n1 = nw1 = (63)(0.50) = 31
n2 = nw2 = (63)(0.22) = 14
n3 = nw3 = (63)(0.28) = 18
If the cost of sampling does not vary from stratum to stratum, then the cost factors ci cancel
from Equation (3.44).

Recall that the allocation formula (3.32) assumes a very simple form when the variances as
well as costs are equal for all strata. Equation (3.44) simplifies in the same way, provided
all stratum proportions pi are equal and all costs ci are equal. The Equation (3.44) becomes
 
Ni
ni = n , i = 1, 2, · · · , L (3.46)
N

As previously noted, this method for assignment of sample sizes to the strata is called
proportional allocation.

60
Chapter 4

Multivariate Analysis

4.1 Multivariate Data Sets


Multivariate Analysis is necessary when two or more variables (variates, responses, at-
tributes) are measured on each unit (individual, plot, sample member) and analysis of
overall system is required. In a multivariate sample, the observations will generally be
independent but within an observation, the variables will not.
Multivariate analysis deals with several types of variables: binary, discrete, categorical,
continuous, etc. Multivariate data arise in many areas of science, ranging from biology to
psychology.
Frequently in multivariate data the observations are grouped, and we are interested in
discovering possible differences in the groups. The choice of multivariate techniques depends
on the type of data and the sort of objectives. The underlying theme of most analyses is
simplification.

4.1.1 Examples of Multivariate Data


1. Education: A set of marks in a range of examinations will reflect a child’s ability.

2. Medical: One hundred patients with leukemia have historical studies carried out giving
about 20 variables, each with a − or + or ++ response.

3. Agriculture: Observations are available on crop yield and weather conditions over a
number of different sites. Is there a simple relationship that connects the crop variables
with the weather variables?

4.1.2 Multivariate Data Structure


Suppose p variables are measured on each of n units. What we thus have is a sample of size
n, each unit being a p-vector. If the variables are x1 , x2 , . . . , xp , then denote them as a p×
1 vector x. An arbitrary observation on a unit may be denoted by the observation vector
x = (x1 , x2 , ..., xp )T . The resulting set of data can be written as an n × p matrix with each
observation appearing in rows and each variable appearing in columns. This data matrix
will henceforth be denoted as X.
   
x11 x12 . . . x1p xT1
 x21 x22 . . . x2p   xT 
  2 
X =  .. ..  =  .. 

.. . .
 . . . .   . 
xn1 xn2 . . . xnp xTn

61
Example 4.1.1 Measurements of three variables were taken for four individuals. The data
obtained were (1.2, 2.8, 7.8),(2.3, 2.9, 8.1),(1.7, 3.1, 8.2),(2.0, 2.8, 7.9).
The data matrix is given by  
1.2 2.8 7.8
 2.3 2.9 8.1 
X=
 1.7

3.1 8.2 
2.0 2.8 7.9

4.1.3 Mean and Dispersion


In the univariate case we have observations x1 , x2 , . . .P
, xn for a variable X. The sample
mean x̄ for this set of observations is given by x̄ = n1 ni=1 xi and is used to estimate the
population
Pmean µ. Similarly the
Ppopulation variance σ 2 is estimated by the sample variance
1 n 2 1 n
s = n−1 i=1 (xi − x̄) = n−1 ( i=1 xi − nx̄2 ).
2 2

In the multivariate case, the population mean is a vector µ and the corresponding
estimator, the sample mean, is also a vector. In place of the population variance, we have
the Dispersion Matrix or Variance Covariance Matrix. This is estimated by the sample
dispersion matrix S. The sample mean vector and the sample dispersion matrix are given
by
n
1X
x̄ = xi
n i=1
n
1 1 X T
X T X − nx̄x̄T =
 
S= Xi − X̄ Xi − X̄
n−1 n − 1 i=1
where X is the data matrix of order n × p.
Note that x̄ = (x̄1 , x̄2 , · · · , x̄p )T where the ith component x̄i is the sample mean of the
ith variable. Similarly, S = ((sij )) where sij = n−1 1
( nk=1 xki xkj − nx̄i x̄j ). Consequently, sii
P

is the sample variance of the ith variable and sij is the sample covariance between variable
i and variable j.
The sample correlation matrix is given by R. For i 6= j, the ij th element of R is the
s
correlation between the ith variable and the j th variable, given by rij = √siiijsjj . All the
diagonal elements of R are 1.

Example 4.1.2 Consider the data set given in Example 4.1.1.


1. The sample mean vector is given by
 
1.8
x̄ =  2.9 
8
2. The sample dispersion matrix is given by
 
.2200 .0067 .0433
S =  .0067 .0200 .0233 
.0433 .0233 .0333

3. The sample correlation matrix is given by


 
1 .1005 .5060
R =  .1005 1 .9037 
.5060 .9037 1

62
4.2 Random Vectors and Multivariate Distributions
4.2.1 Random Vectors
A random vector is a vector whose components are random variables and is the multivariate
generalization of a single random variable. They are usually  denoted as column vectors
X1
 X2 
with each component being a random variable. If X =  ..  is a random vector, its joint
 
 . 
Xp
distribution function, or the joint distribution function of the components of X, is defined
as
FX (x) = P (X1 ≤ x1 , X2 ≤ x2 , . . . Xp ≤ xp ) .
When exists, the joint density of X is given by

∂ p F (x)
fX (x) =
∂x1 ∂x2 . . . ∂xp

and has the property that


Z x1 Z x2 Z xp
FX (x) = ... fX (y)dy1 dy2 . . . dyp
−∞ −∞ −∞

4.2.2 Expectation and Dispersion


Definition 4.2.1

1. The Expectation of a random vector X is the vector whose ith component is the
expectation of Xi .  
E(X1 )
 E(X2 ) 
E(X) = 
 
.. 
 . 
E(Xp )

2. The Dispersion of a random vector X is the matrix whose ij th entry is the covariance
between Xi and Xj . If E(X) = µ, then
h i
D(X) = E (X − µ) (X − µ)T = E XX T − µµT


3. The Correlation matrix of a random vector X is the matrix P whose ij th entry is the
correlation between Xi and Xj . So P = ((ρij )) where
σij
ρij = Corr(Xi , Xj ) = √ .
σii σjj

Note: Each diagonal element of the correlation matrix P is 1.

Theorem 4.2.2 Let X be a random p-vector with E(X) = µ and D(X) = Σ. Let A be
any m × p matrix, a be any p-dimensional vector, b be any m-dimensional vector and c ∈ R.
Then

1. E(aT X + c) = aT µ + c

63
2. V (aT X + c) = aT Σa

3. E(AX + b) = Aµ + b

4. D(AX + b) = AΣAT

Theorem 4.2.3 Let Σ = ((σij )) be the dispersion of a random p-vector X. Then

1. Cov(Xi , Xj ) = σij

2. V (Xi ) = σii

3. Σ is a symmetric matrix.

4. Σ is a non-negative definite matrix. It is positive definite if and only if Σ is non-


singular.

Note: Recall that a matrix A is said to be non-negative definite if for every vector a ∈ Rp ,
aT Aa ≥ 0. A is said to be positive definite if for every vector a 6= 0, aT Aa > 0. A
non-negative definite matrix is positive definite if and only if it is non-singular.
     
X1 1 4 −2
Example 4.2.4 Suppose has mean µ = and dispersion Σ = .
  X2 −3 −2 9
Y1
Let Y = where Y1 = 2X1 + X2 and Y2 = 3X1 − X2 . Then
Y2

1. E(X1 ) = 1, E(X2 ) = −3, V (X1 ) = 4, V (X2 ) = 9 and Cov(X1 , X2 ) = −2.

2. Expectation and varianceof 2X


 1 + X2 can be calculated by applying (1) and (2) of
2
Theorem 4.2.2 with a = . So
1
 
1
E(2X1 + X2 ) = (2, 1) = −1
−3

  
4 −2 2
V (2X1 + X2 ) = (2, 1) = 17
−2 9 1


2 1
3. Y = AX where A = . So by (3) and (4) of Theorem 4.2.2,
3 −1
    
2 1 1 −1
E(Y ) = =
3 −1 −3 6

and      
2 1 4 −2 2 3 17 13
D(Y ) = =
3 −1 −2 9 1 −1 13 57

4. E(Y1 ) = −1, E(Y2 ) = 6, V (Y1 ) = 17, V (Y2 ) = 57 and Cov(Y1 , Y2 ) = 13.

We shall now discuss a result in matrix algebra known as spectral decomposition.

Theorem 4.2.5 Let A be a k × k symmetric non-negative definite matrix. Let the k


eigenvalues and the corresponding normalized eigenvectors be denoted by λi and ei , i =
1, · · · , k. Then

64
Pk
1. A = i=1 λi ei eTi

2. If E = [e1 , e2 , · · · , ek ], then E is an orthogonal matrix, that is, EE T = E T E = I.

3. If Λ = diag[λ1 , · · · , λk ], then A = EΛE T = EΛE −1

4. For any integer n, An = EΛn E T . In particular, A−1 = EΛ−1 E T


1 1 1
5. A 2 = EΛ 2 E T where A 2 is a symmetric matrix B such that B 2 = A
1 1 1
6. (A 2 )−1 = (A−1 ) 2 (denoted by A− 2 )
1 1
7. A− 2 = EΛ− 2 E T
3
− 12
 
Example 4.2.6 Let A = 2
. The eigenvalues of A are easily seen to be λ1 = 1
− 12 32
and λ2 = 2 and the normalized eigenvectors are ( √12 , √12 )T and ( √12 , − √12 )T . So the matrices
E and Λ are given by " #
√1 √1
2 2
E= √1
2
− √12
and  
1 0
Λ= .
0 2
 
1 0
1. For any n, An = E ET
0 2n
" # " # √ √ 
√1 √1 √1 √1

1
2 2 1 √0 2 2 1 1 + √2 1 − √2
2. A = 2
√1
=
2
− √12 0 2 √1
2
− √12 2 1− 2 1+ 2
" # " #  √ √
√1 √1 √1 √1

1 0
3. A − 21
= 2 2 2 2
= 1
√ √2 + 1 √2 − 1
√1 − √12 0 √1 √1 − √12 2 2 2−1 2+1
2 2 2

1 1
Note that the product of A 2 and A− 2 is I.
The following theorem is a consequence of Theorem 4.2.2.

Theorem 4.2.7 Let X be a random vector with mean µ and dispersion Σ. If Σ is non-
1
singular, then Σ− 2 (X − µ) has mean 0 and dispersion matrix I.

4.2.3 Covariance between Two Random Vectors


Definition 4.2.8 If X is a random p-vector and Y is a random q-vector, the covariance
between X and Y is the p × q matrix whose ij th element is the covariance between Xi and
Yj . That is,
Cov(X, Y ) = ((Cov(Xi , Yj )))

Theorem 4.2.9 Let X be a random p-vector and Y is a random q-vector. If A is an m × p


matrix, B is an n × q matrix, a be any m-vector, b be any n-vector, then
1. Cov(Y , X) = Cov(X, Y )T

2. Cov(AX + a, BY + b) = ACov(X, Y )B T

3. Cov(X, X) = D(X)

65
Example 4.2.10 Let X be a random 2-vector and Y is a random 3-vector such that
 
1 3 −1
Cov(X, Y ) = .
2 7 4

Then
 
1 2
1. Cov(Y , X) =  3 7 
−1 4
 
      1  
X1 + X2 1 1 1 3 −1  2 = 32
2. Cov , Y1 + 2Y2 + 3Y3 =
X1 − X 2 1 −1 2 7 4 −24
3

4.2.4 Multivariate Normal Distribution


The most commonly used multivariate distribution is the multivariate normal distribution.
A random vector with multivariate normal distribution will not only have each component
normally distributed, but also will have the property that every linear combination of the
components will be normally distributed.

Definition 4.2.11 A p-variate random vector X is said to have multivariate normal dis-
tribution if for every vector a ∈ Rp (the set of all p-dimensional column vectors), aT X is
either normally distributed or has a degenerate distribution (that is, it is a constant).
If a multivariate normal random vector X has mean vector µ and variance-covariance matrix
Σ, then we denote it by X ∼ Np (µ, Σ). The multivariate normal distribution in the special
case of p = 2 is known as bivariate normal distribution.

Example 4.2.12 Suppose X and Y are independent and have N (0, 1) distribution.
 
X
1. ∼ N2 (0, I) where 0 is the 2 × 1 zero vector and I is the 2 × 2 identity matrix.
Y
 
2X + Y + 4
2. Let X = . Then X has multivariate normal distribution N2 (µ, Σ)
 −X + 3Y  
4 5 1
where µ = and Σ = .
0 1 10

4.3 Properties of Multivariate Normal Distribution


4.3.1 Expectation and Dispersion
Theorem 4.3.1 Let X have Np (µ, Σ) distribution. Let a be any p-dimensional vector and
let A be any m × p matrix. Then
1. aT X ∼ N (aT µ, aT Σa)

2. AX ∼ Nm (Aµ, AΣAT )

3. X has a joint density if and only if Σ is non-singular. In this case the density of X is
given by
1 1
fX (x) = p exp{− (x − µ)T Σ−1 (x − µ)}
(2π)p |Σ| 2

66
1
4. If Σ is non-singular, then Σ− 2 (X − µ) ∼ Np (0, I)
Note: Property (4) is the multivariate analogue of standardizing a normal random variable
to get standard normal random variable.
   
1 4 −2
Example 4.3.2 Suppose X in Example 4.2.4 is distributed as N2 , .
−3 −2 9
Let Y ,a and A be as in Example 4.2.4. Then

1. aT X ∼ N (aT µ, aT Σa) = N (−1, 9)


   
−1 17 13
2. Y ∼ N2 ,
6 13 57

Theorem 4.3.3 Let U have Nm (0, I) distribution. Let a be any p-dimensional vector and
let B be any p × m matrix. Then
1. The density function of U is given by
m
1 1X 2
fU (u) = m exp{− u}
(2π) 2 2 i=1 i

2. BU + a has Np (a, BB T ) distribution.

Example 4.3.4 Let U1 , U2 and U3 be independent and identically distributed (i.i.d ) N (0, 1)
random variables. Then

1. U = (U1 , U2 , U3 )T is distributed as N3 (0, I)

2. fU1 ,U2 ,U3 (u1 , u2 , u3 ) = 2π√1 2π exp{− 12 (u21 + u22 + u23 )}


        
1 2 −2 U1 7 7 9 19 29
3.  3 4 −4   U2  +  8  ∼ N3  8  ,  19 41 63 
5 6 −6 U3 9 9 29 63 97

4.3.2 Partitioning of Multivariate Normal Vector


Let X be a p-variate random vector. Let q, 1 ≤ q ≤ p − 1 be an integer and let r = p − q.
Then we can partition X as  (1) 
X
X=
X (2)
where X (1) = (X1 , · · · , Xq )T is a q vector and X (2) = (Xq+1 , · · · , Xp )T is an r-vector.
The corresponding partitions for mean vector and the dispersion matrix will yield
 (1)   
µ Σ11 Σ12
µ= and Σ =
µ(2) Σ21 Σ22

where µ(i) = E(X (i) ) and Σij = Cov(X (i) , X (j) ).


In other words,
• µ(1) = E(X (1) )

• µ(2) = E(X (2) )

• Σ11 = D(X (1) )

67
• Σ12 = Cov(X (1) , X (2) )
• Σ21 = Cov(X (2) , X (1) ) = ΣT12
• Σ22 = D(X (2) )

Theorem 4.3.5 Let X1 and X2  be independent


 random vectors such that X1 ∼  Nq (µ1 , Σ1 )
X1 µ1
and X2 ∼ Nr (µ2 , Σ2 ). Then ∼ Nq+r (µ, Σ) where µ = and Σ =
  X2 µ2
Σ1 0q×r
.
0r×q Σ2
   
1 4 −2
Example 4.3.6 Suppose X be distributed as N2 , . Let Y ∼
−3  −2  9
X
N (3, 16) be such that X and Y are independent. Then W = has trivariate normal
Y
distribution with mean and dispersion given by
   
1 4 −2 0
E(W ) =  −3  and D(W ) =  −2 9 0 
3 0 0 16
 (1) 
X
Theorem 4.3.7 Let X be distributed as Np (µ, Σ). Partition X as , µ as
X (2)
 (1)   
µ Σ11 Σ12
(2) and Σ as where X (1) and µ(1) are q × 1, X (2) and µ(2) are r × 1,
µ Σ21 Σ22
Σ12 is q × r, Σ21 is r × q and Σ11 and Σ22 are q × q and r × r respectively. Then
1. X (1) and X (2) are independent if and only if Σ12 = 0.
2. Any vector made of the components of X has multivariate normal distribution with the
corresponding mean vector and covariance matrix. In particular, X (1) ∼ Nq (µ(1) , Σ11 )
and X (2) ∼ Nr (µ(2) , Σ22 ).
3. The conditional distribution of X (1) given X (2) = x(2) is Nq (µ1.2 , Σ11.2 ) where µ1.2 =
µ(1) + Σ12 Σ−1
22 (x
(2)
− µ(2) ) and Σ11.2 = Σ11 − Σ12 Σ−1
22 Σ21 .
   
−1 15 −12 −14 −1
 −3   −12 39 17 −5 
Example 4.3.8 Suppose X ∼ N4 (µ, Σ) where µ =   2  and Σ =  −14 17
  .
22 −4 
3 −1 −5 −4 4
1. Find the conditional distribution of X3 given X1 = 0, X2 = −3 and X4 = 2.
First, write down the distribution of W = (X3 , X1 , X2 , X4 )T .
   
2 22 −14 17 −4
 −1   −14 15 −12 −1 
W ∼ N4   −3  ,  17 −12 39 −5
   

3 −4 −1 −5 4
From Part 3 of Theorem 4.3.7, we get that the conditional distribution of X3 is normal
with mean
 −1  
15 −12 −1 1
467
2 + (−14, 17, −4)  −12 39 −5   0  =
205
−1 −5 4 −1

68
and variance
 −1  
15 −12 −1 −14
1083
22 − (−14, 17, −4)  −12 39 −5   17  =
410
−1 −5 4 −4

2. Find P (X3 > 3|X1 = 0, X2 = −3, X4 = 2)


 
467
3− 205 
P (X3 > 3|X1 = 0, X2 = −3, X4 = 2) = P Z > q
1083
410

= P (Z > 0.44)
= 0.33
 
X1
3. Find the conditional distribution of given X3 = 1 and X4 = 4.
X2
 
X1
From Part 3 of Theorem 4.3.7, we get that the conditional distribution of
    X2
X3 1
given = is bivariate normal. The conditional mean is given by
X4 4

    −1    
−1 −14 −1 22 −4 1 2
µ (1)
+ Σ12 Σ−1
22 (x
(2) (2)
−µ ) = + −
−3 17 −5 −4 4 4 3
 
−1 5
=
4 17

The conditional dispersion is given by

    −1  
15 −12 −14 −1 22 −4 −14 17
Σ11 − Σ12 Σ−1
22 Σ21 = −
−12 39 17 −5 −4 4 −1 −5
 
1 9 −13
=
4 −13 99

4. Find P (X1 + X2 ≤ −4|X3 = 1, X4 = 4)


From the conditional distribution of (X1 , X2 )T given X3 = 1 and X4 = 4, we conclude
that the conditional mean and variance of X1 + X2 are −5.5 and 20.5. So
 
−4 − (−5.5)
P (X1 + X2 ≤ −4|X3 = 1, X4 = 4) = P Z ≤ √
20.5
= P (Z ≤ 0.33)
= 0.6293

4.3.3 Bivariate Normal Distribution


The case of p = 2 deserves special treatment. The following theorem is the special case of
bivariate normal and follows from previous results.

Theorem 4.3.9 Let (X1 , X2 )T have bivariate normal distribution with E(X1 ) = µ1 , E(X2 ) =
µ2 , V (X1 ) = σ12 , V (X2 ) = σ22 and Corr(X1 , X2 ) = ρ where ρ ∈ (−1, 1). Then

69
1. the joint density of X1 and X2 is given by
 n 2  2    o
1√ 1 x1 −µ1 x2 −µ2 x1 −µ1 x2 −µ2
fX1 ,X2 (x1 , x2 ) = 2
exp − 2(1−ρ2 ) σ1
+ σ2
− 2ρ σ1 σ2
2πσ1 σ2 1−ρ

2. X1 and X2 are independent if and only if ρ = 0.


 
3. the conditional distribution of X1 given X2 = x2 is given by N µ1 + ρ σσ12 (x2 − µ2 ), σ12 (1 − ρ2 )
 
4. the conditional distribution of X2 given X1 = x1 is given by N µ2 + ρ σσ21 (x1 − µ1 ), σ22 (1 2
−ρ )

Example 4.3.10 Suppose that the joint distribution of X and Y is given by a bivariate
normal distribution with E(X) = 1, E(Y ) = −2, V (X) = 4, V (Y ) = 9 and Corr(X, Y ) =
−.8.
1. Find the conditional distribution of X given Y = 1.
The conditional mean of X given Y = 1 is given by
2
E(X|Y = 1) = 1 + (−.8) (1 − (−2)) = −0.6
3
and the conditional variance is given by
V (X|Y = 1) = 4(1 − .64) = 1.44

2. Find P (X ≤ 0|Y = 1).


From the previous calculations, it follows that
 
0.6
P (X ≤ 0|Y = 1) = P Z ≤
1.2
= P (Z ≤ 0.5)
= 0.6915

Theorem 4.3.11 Let X be distributed as Np (µ, Σ) where Σ is non-singular. Then


1. (X − µ)T Σ−1 (X − µ) is distributed as χ2p , the chi-square distribution with p degrees
of freedom.
2. Np (µ, Σ) assigns probability 1−α to the solid ellipsoid x : (x − µ)T Σ−1 (x − µ) ≤ χ2p,α .


In other words, if A = x : (x − µ)T Σ−1 (x − µ) ≤ χ2p,α , then P (X ∈ A) = 1 − α.




Note: Chi-square distribution with n degrees of freedom is obtained by adding squares of n


independent standard normal P random variables. In other words, if X1 , X2 , · · · , Xn are i.i.d.
N (0, 1) random variables the ni=1 Xi2 has χ2n distribution. The multivariate generalization
of chi-square distribution is known as Wishart distribution.
   
7 14 23 35
Example 4.3.12 Suppose X ∼ N3  8  ,  23 41 63 . Then
9 35 63 97
1. (X − µ)T Σ−1 (X − µ) = 2X12 + 33.25X22 + 11.25X32 − 13X1 X2 + 7X1 X3 − 38.5X2 X3 +
13X1 − 94.5X2 + 56.5X3 + 78.25 has chi-square distribution with 3 degrees of freedom.
2. If A = {x : 2x21 + 33.25x22 + 11.25x23 − 13x1 x2 + 7x1 x3 − 38.5x2 x3 + 13x1 − 94.5x2 +
56.5x3 +78.25 ≤ 7.81}, then P (A) = 0.95. If 7.81 is changed to 6.25 or 9.35, P (X ∈ A)
changes respectively to 0.9 or 0.975.

70
4.4 Sampling from a Multivariate Normal Distribu-
tion
Just as we take samples from univariate populations to estimate parameters of the popu-
lations, we use multivariate samples for parametric inference in multivariate distributions.
Method of maximum likelihood is often used to estimate parameters such as the population
mean and dispersion.

Theorem 4.4.1 Let X1 , X2 , · · · , Xn be a random sample from a univariate normal popu-


lation with mean µ and variance σ 2 . Then the maximum likelihood estimators of µ and σ 2
are given by
n
1X
µ̂ = X n = Xi
n i=1
and n
1X 2 n − 1 2
σ̂ 2 = Xi − X = s
n i=1 n
In the case of multivariate normal, the results are very similar to the univariate case.

4.4.1 Wishart Distribution


The multivariate generalization of the chi-square distribution we discussed in the previous
section is known
Pas Wishart distribution. Let Z1 , Z2 , · · · , Zm be i.i.d. Np (0, Σ). Then the
m
distribution of i=1 Zi ZiT is called Wishart distribution with m degrees of freedom and is
denoted by Wp (Σ, m). When p = 1, Wp (Σ, m) is the same as σ 2 χ2m .

Theorem 4.4.2 1. Let W1 and W2 be two independent random matrices where W1 ∼


Wp (Σ, m1 ) and W2 ∼ Wp (Σ, m2 ). Then W1 + W2 ∼ Wp (Σ, m1 + m2 ).

2. If W is distributed as Wp (Σ, m) and C is a q × p matrix, then CW C T is distributed


as Wq (CΣC T , m).

4.4.2 Maximum Likelihood Estimator


Recall that in the univariate normal case, the maximum likelihood estimators of µ and σ 2
are x̄ and n−1
n
s2 . For the case of multivariate normal, the results are strikingly similar.

Theorem 4.4.3 Let X1 , X2 , · · · , Xn be a random sample from a p-variate normal popu-


lation with mean µ and dispersion Σ. Then the maximum likelihood estimators of µ and
Σ are given by
n
1X
µ̂ = X n = Xi
n i=1
and n
1  1X T n−1
X T X − nx xT =

Σ̂ = Xi − X Xi − X = S
n n i=1 n
The following theorem is a direct generalization of its famous univariate case.

Theorem 4.4.4 Let X1 , X2 , · · · , Xn be a random sample from a p-variate normal popu-


lation with mean µ and dispersion Σ. Then
1. X is distributed as Np (µ, n1 Σ)

71
2. (n − 1)S is distributed as Wp (Σ, n − 1)
3. X and S are independent.

4.4.3 The Law of Large Numbers


The univariate strong law of large numbers states P that if X1 , X2 , · · · are i.i.d. random
variables with mean µ, then with probability 1, n1 ni=1 Xi converges to µ. The following is
its multivariate analogue.

Theorem 4.4.5 Let X1 , X2 , · · · be i.i.d. random vectors with mean µ. Then


( n )
1 X
P Xi → µ = 1
n i=1

4.4.4 The Multivariate Central Limit Theorem


Similar to the law of large numbers, the central limit theorem can be extended to its mul-
tivariate case.

Theorem 4.4.6 Let X1 , X2 , · · · be i.i.d. random vectors with mean µ and dispersion Σ.
Then √  d
n Xn − µ → − Np (0, Σ).
√ 
In other words, as n → ∞, the distribution of n X n − µ approaches Np (0, Σ) distribu-
tion.

4.5 Inference on Multivariate Normal Mean


In the univariate case, we developed confidence intervals and hypothesis tests for the mean of
a normal distribution. We will now look into their multivariate generalizations and construct
confidence regions, simultaneous confidence intervals, and tests of hypotheses for the mean
vector of a multivariate normal distribution.

4.5.1 Hypothesis Testing


Recall that in the univariate case, based on a random sample of size n, the hypothesis testing
for H0 : µ = µ0 against the two-sided alternative H1 : µ 6= µ0 is done using the test statistic

X − µ0
t= √
s/ n
where X̄ is the sample mean and s is the sample standard deviation. This test statistic has
a t-distribution with n − 1 degrees of freedom, and we reject the null hypothesis when the
observed value of |t| exceeds tn−1, α2 .
The multivariate generalization of the t test statistic is given by
T
T 2 = n X − µ0 S −1 X − µ0


where X̄ and S is the sample dispersion matrix. This statistic is called Hotelling’s T 2 . We
do not need special tables for the critical values of this distribution because it turns out
that
(n − 1)p
T2 ∼ Fp,n−p
n−p

72
where Fp,n−p denotes an F distribution with numerator degrees of freedom p and denomi-
nator degrees of freedom n − p. We reject the null hypothesis H0 : µ = µ0 in favour of the
alternative H1 : µ 6= µ0 if the observed value of T 2 exceeds (n−1)p
n−p
Fp,n−p,α .
2
n(X−µ0 )
Note that in the special case of p = 1, T 2 = s2
∼ t2n−1 and (n−1)p
n−p
Fp,n−p = F1,n−1 =
2
tn−1 .

Example 4.5.1 A sample


  of size 42 from a bivariate normal population yielded x̄ =
0.564 0.0144 0.0117
and S = . At 5% level of significance, test the hypothesis
0.603 0.0117 0.0146
that the population
 mean vector µ = (0.562, 0.589)T .
199.046 −159.509
S −1 = . We compute the T 2 test statistic by
−159.509 196.319
T
T 2 = n X − µ0 S −1 X − µ0


= 42(0.002, 0.014)S −1 (0.002, 0.014)T


 
199.046 −159.509
= 42(0.002, 0.014) (0.002, 0.014)T
−159.509 196.319
= 1.274
(41)2 (41)2
Comparing this with 40
F2,40 = 40
3.23 = 6.62, we do not reject H0 .

4.5.2 Confidence regions


A 100(1 − α)% confidence region for the mean of a p-variate normal distribution is the
ellipsoid given by
 
T −1  (n − 1)p
µ:n X −µ S X −µ ≤ Fp,n−p,α .
n−p

Note that the null hypothesis H0 : µ = µ0 is rejected at significance level α if and only if
µ0 falls outside the 100(1 − α)% confidence region.
The directions of the axes of the ellipsoid are given by the eigenvectors of S. If the
eigenvalues of S are denoted by λi , i = 1, · · · , p, then the half-lengths of the axes are given
by s
p (n − 1)p
λi Fp,n−p,α .
n(n − p)
n T −1 o
(n−1)p

So if k = n(n−p) Fp,n−p,α , then the confidence region is given by µ : X − µ S X −µ ≤k

and the axes are given by λi k ei , where ei ’s are the normalized eigenvectors of S.
In the special case of p = 2, the confidence region is given by an ellipse.

Example 4.5.2 For thepreviousexample where  a sample of size


 42 from a bivariate normal
0.564 0.0144 0.0117
population yielded x̄ = and S = , construct a 95% confidence
0.603 0.0117 0.0146
ellipse for the population mean vector µ.
The eigenvalues of S are 0.026 and 0.002 and the eigenvectors are (0.704, 0.710)T and
(−0.710, 0.704)T . So the the half-lengths of the axes are given by
s s
p (n − 1)p √ (41)2
λi Fp,n−p,α = 0.026 F2,40,.05 = (0.1612)0.3971 = 0.064
n(n − p) 42(40)

73
and s
√ (41)2
0.002 F2,40,.05 = (0.0447)0.3971 = 0.018
42(40)
The angle the major axis makes with the x-axis is given by arctan( 0.710
0.704
) = 45.24◦ .

4.5.3 Simultaneous confidence intervals


The t-intervals we learned to construct for univariate situations have 100(1−α)% confidence
when taken individually. When several confidence statements are considered simultaneously,
they have much reduced confidence level. The following theorem enables us to construct
confidence intervals simultaneously, not only for all individual components of the mean
vector, but also for all its linear combinations.

Theorem 4.5.3 Let X1 , X2 , · · · , Xn is a random sample from a p-variate normal popu-


lation
q with mean µ and dispersion Σ. Then, simultaneously for all a, the interval aT X ±
(n−1)p
F
n(n−p) p,n−p,α
aT Sa contains the value of aT µ with probability 1 − α.

Note that by taking a to be the vector with 1 at ith position and 0 elsewhere, we can
get confidence intervals for the individual µi ’s. These F -intervals will be wider than the
corresponding t-intervals. Interestingly, these simultaneous intervals are the shadows, or
projections, of the confidence ellipsoid on the axes of the component means.

Example 4.5.4 The scores obtained by n = 87 college students on the College Level Ex-
amination Program (CLEP) subtest X1 and College Qualification Test (CQT) subtests X2
and X3 resulted in the following sample mean and sample dispersion.
   
527.74 5691.34 600.51 217.25
x̄ =  54.69  and S =  600.51 126.05 23.37 
25.13 217.25 23.37 23.11
Construct 95% simultaneous confidence intervals for µ1 , µ2 , µ3 and µ2 − µ3 .

s s s
(n − 1)p (86)3 (86)3
Fp,n−p,α = F3,84,.05 = (2.7) = 0.3087
n(n − p) 87(84) 87(84)
aT X is 527.74, 54.69, 25.13 and 54.69 − 25.13 = 29.56 and aT Sa is 5691.34, 126.05, 23.11
and 102.42 for µ1 , µ2 , µ3 and µ2 − µ3 respectively. So the confidence intervals are

527.74 ± (0.3087) 5691.34 = 527.74 ± 23.29,

54.69 ± (0.3087) 126.05 = 54.69 ± 3.47,

25.13 ± (0.3087) 23.11 = 25.13 ± 1.48
and √
29.56 ± (0.3087) 102.42 = 29.56 ± 3.12.

74
Chapter 5

Decision Theory

5.1 Introduction
Game theory models strategic situations, or games, in which an individual’s success in mak-
ing choices depends on the choices of others. It has applications in statistics, logic, computer
science, economics, management, operations research, political science, social psychology
and biology. Game Theory can be regarded as a multi-agent decision problem where there
are two or more people contending for limited rewards or payoffs. The players make decisions
on which their payoff depends. Each player is expected to behave rationally.
The first known discussion of game theory dates back to James Waldegrave in 1713. The
main contributors to the field are John von Neumann, Émile Borel, John Nash, Reinhard
Selten, John Harsanyi, John Maynard Smith, Thomas Schelling, Robert Aumann, Roger
Myerson, Leonid Hurwicz and Eric Maskin.
Game theory can be classified into cooperative or non-cooperative games depending on
whether the players cooperate with each other. Of these, noncooperative games are the ones
that are most interesting and are able to model situations to the finest details, producing
accurate results. It can also be classified as symmetric and asymmetric games. If the
identities of the players can be interchanged without changing the payoff to the strategies,
then a game is symmetric. Another important classification of games is into zero-sum
and non-zero-sum games. In zero-sum games, choices by players can neither increase nor
decrease the available resources because the total benefit to all players in the game, for every
combination of strategies, always adds to zero.

Example 5.1.1 Players A and B play the game of penny matching where each player has
a penny and secretly turns the penny to heads or tails. The players then reveal their choices
simultaneously. If the pennies match, that is, both heads or both tails, Player A keeps both
pennies. This, of course, means Player B loses a penny so that it is +1 for A and -1 for B.
If the pennies do not match, Player B keeps both pennies so that it is -1 for A and +1 for
B. The pay-offs can be represented as follows:

Player B
Heads Tails
Player A Heads (1,-1) (-1,1)
Tails (-1,1) (1,-1)

This is an example of a zero-sum game, where one player’s gain is exactly equal to the
other player’s loss. The matrix that represents the pay-offs is called the pay-off matrix. The
choices of the players are strategies. Since for a zero-sum game the second player’s pay off
can be determined by that of the first player, for such games we can represent the payoff
matrix in a simpler fashion:

75
Player B
Heads Tails
Player A Heads 1 -1
Tails -1 1

We shall use the convention that the entry always represents the gain of Player A, whose
strategies are represented by rows of the pay off matrix.

Example 5.1.2 A manufacturer must decide whether he should expand his business now
or postpone expansion. The implications of his action or inaction depends on whether there
is going to be a recession. If he expands now, there will be a profit of $100,000 if there is
no recession and a loss of $50,000 if there is recession. On the other hand, if he postpones
expansion, there will be a profit of $60,000 if there is no recession and a profit of $5,000 if
there is recession. One may consider this to be game between the manufacturer and Nature.
Using $1000 as a unit, we can write down the pay-off matrix as follows:

Nature
No recession Recession
Manufacturer Expand 100 -50
Postpone 60 5

Definition 5.1.3 A strategy in a game is said to be a minimax strategy if the it minimizes


the maximum loss, or maximizes the minimum gain.
For the pay-off matrix in Example 5.1.2, we should look for the strategy that maximizes the
minimum gain for the manufacturer. If he expands now, the minimum gain is -50, whereas
if he postpones expansion, the minimum gain is 5. Thus the minimax rule suggests that he
should postpone expansion. As Nature is not considered a competing opponent, we do not
worry about its minimax rule.
We can generalize the idea in Example 5.1.2 as follows. Here the manufacturer has two
choices a1 (expand now) and a2 (postpone expansion) while Nature has choice between θ1
(no recession) and θ2 (there is a recession). Depending on these the pay-off matrix is as
shown below:

Nature
No recession Recession
Manufacturer Expand L(a1 , θ1 ) L(a1 , θ2 )
Postpone L(a2 , θ1 ) L(a2 , θ2 )

If there are more than two choices for the players, say m for the manufacturer and n for
Nature, the resulting matrix will be of order m × n. The amounts L(ai , θj ) are the values of
the loss function. In other words, L(ai , θj ) is the loss of the manufacturer if he takes action
ai when the true state of the Nature is θj .
We will now return to the methods of finding optimal strategies for two-person zero-sum
games. Consider the following example.

Example 5.1.4 For the following game, find the optimal strategies for both players.

Player B
I II
Player A 1 10 -1
2 11 13

76
It is easy to see that for Player B, the optimal choice depends on Player A’s strategy. If
A chooses Strategy 1, Player B, for whom small numbers are better, will choose Strategy
II, whereas he will choose Strategy I if A chooses Strategy 2. But this uncertainty is not
present from A’s point of view. Irrespective of the choice of B, Strategy 2 is the better one
for him. Knowing this, B will always choose Strategy I, and the end result is that A will
get paid $11.
In Example 5.1.4, the amount A wins, which is 10, is called the value of the game. If the
numbers in a row of a pay-off matrix are all less than or equal to the corresponding values
in another row, the first row is said to be dominated by the second row, and hence the first
row can be eliminated from the game. Similarly, if the numbers in a column are all greater
than or equal to the corresponding values in another column, the first column is dominated
by the second column, and hence the first column can be eliminated. In this example, Row
2 dominates Row 1, and after eliminating Row 1, Column 1 dominates Column 2.
Dominated rows are also called recessive rows. Similarly, dominated columns are called
recessive columns.
Smaller rows and larger columns are
dominated or recessive, and hence can
be removed

In many situations, using dominated strategies to arrive at a complete solution will not
work. One may find that there are no dominated strategies at all, or may only succeed in
partially reducing the size of the matrix.

Example 5.1.5
Player B
I II III
1. 1 -3 4 -4
Player A 2 0 2 4
3 -4 -8 10

Player B
I II III
2. 1 -3 4 -7
Player A 2 0 2 -6
3 -4 -8 -5
In the two matrices given above, the first one has no dominated strategies, where as in
the second one we can reduce it to a smaller matrix. In the second matrix, Strategy I of
Player B can be eliminated because it is dominated by Strategy III. No further reduction
is possible. After deleting the dominated strategy of B, the new pay-off matrix is

Player B
II III
1 4 -7
Player A 2 2 -6
3 -8 -5

For Matrix 1, the minimum pay-offs for Player A for his three strategies are -4, 0 and
-8, and among these, the maximum is 0, so his minimax strategy is Strategy 2. Similarly
Player B’s minimax strategy is found by looking at his maximum losses, 0, 4 and 10, and
taking the minimum. This leads to Strategy I.

77
An important aspect of the situation in the previous example is that they are spyproof.
Even if a player knows the intentions of the other player, it is still the best option available.
Often, games are not spyproof.

Example 5.1.6 For the following game, the strategies of the players are not spyproof.

Player B
I II
Player A 1 4 -9
2 -2 2

For Player A, the minimax strategy would be Strategy 2, the one that maximizes mini-
mum pay-off. For B, it is Strategy II, but if he knows that A will use strategy 2, B is better
off using Strategy I. Similarly, if A knows that B is going for Strategy II, he will want to use
Strategy 2, but if he thinks B may switch to Strategy I, he may want to switch to Strategy
1 for a better pay-off. This could lead to guessing and outguessing, and a lot of attempted
trickery and deception.

Definition 5.1.7 For an m × n matrix M , a saddle point is an entry that is simultaneously


a row minimum and a column maximum.
A game whose minimax strategies are spyproof are said to be strictly determined. It is
easy to check whether the minimax strategies for a particular game matrix are spy proof
or not. The strategies are spyproof if and only if the matrix has a saddle point. For the
matrices in Example 5.1.5, 0 is a saddle point. For the matrix in Example 5.1.6, clearly
there is no saddle point.
When there is a saddle point, the strategies corresponding to the saddle point are the
minimax strategies. Moreover, the entry at the saddle point is the value of the game. It is
possible for a matrix to have more than one entry that is a saddle point. In such cases, the
values of all the saddle points will be the same. It is also true that if (i, j)th and (i0 , j 0 )th
values are saddle points, then (i, j 0 )th and (i0 , j)th values will also be saddle points.

Example 5.1.8 Solve the following game. In other words, find the minimax strategies for
both players and the value of the game.

Player B
I II III
1 0 6 -1
Player A 2 1 4 2
3 -1 -3 6

Solution: Clearly the entry 1 in Row 2 and Column 1 is a saddle point, that is, simulta-
neously a row minimum and a column maximum. So the optimal strategies are Row 2 and
Column 1, and the value of the game is 1.

5.2 Mixed Strategies


As already observed, many games are not strictly determined. For such games, there is
no fixed strategy that would work for either players. We will go with the assumption that
these games will be played between the two opponents repeatedly. Exhibiting any pattern
of behaviour can be detrimental because the opponent can take advantage of it and increase
his winnings. For instance, in Example 5.1.6, if Player A kept on choosing Row 2, expecting

78
B to use Column 2 and expecting to win $2, he will have a unpleasant surprise when B
plays Column 1 and causes A to lose $2 instead. If A tries to alternate his choices, soon the
opponent will catch on and will counter it by suitably changing his strategy.
The only viable option in this sort of situations is to mix the strategies in such a way
that the opponent cannot predict your move. For instance, you may consider tossing a coin
or rolling a die, and depending on the outcome, make your decision. Thus it all comes down
to what probabilities one should attach to each strategy.

Definition 5.2.1 A pure strategy of a game is one where no randomization is done. A


mixed or randomized strategy is one where a probability distribution is associated to the
various pure strategies.
From now on, for an m×n matrix game M , a strategy of Player A, or the row player, will
be denoted by a 1 × m row probability vector p, a row vector whose entries are non-negative
and add to 1. Similarly, a strategy of Player B, or the column player, will be denoted by
an n × 1 column probability vector q. We will denote the row player by R and the column
player by C. A strategy where one entry is 1 and all others are 0 is a pure strategy.

Theorem 5.2.2 If for a pay-off matrix M , R plays the m different strategies with proba-
bilities in the vector p and C plays the n different strategies with probabilities in the vector
q, then the expected winning of R is given by pM q.
Proof: The expected pay-off is given by m
P Pn
i=1 j=1 pi qj mij = pM q

That there always exist minimax strategies for both players is an important theorem in
game theory. This theorem, known as the fundamental theorem of game theory, is due to
John von Neumann.

Theorem 5.2.3 (Fundamental Theorem of Game Theory)


For every m × n matrix game M , there exist strategies p∗ and q ∗ for R and C respectively,
and a unique number v such that p∗ M q ≥ v for every strategy q of C and pM q ∗ ≤ v for
every strategy p of R.
The strategies p∗ and q ∗ are not necessarily unique. From the theorem, it follows that

p∗ M q ∗ = v.

The number v is the value of the game. The triplet (p∗ , q ∗ , v) is called a solution of the
game.
The fundamental theorem of game theory in its full generality will not be proved in
this course. For the special case of 2 × 2 games we have formulas that give us a complete
solution.
 
a b
Theorem 5.2.4 For a 2 × 2 game M = , let D = (a + d) − (b + c). If M is not
c d
strictly determined, then D 6= 0. Moreover, the minimax strategies and the value of the
game are given by  
∗ ∗ ∗ d−c a−b
p = [p1 , p2 ] = , ,
D D
 ∗   d−b 
∗ q1 D
q = = a−c ,
q2∗ D
and
ad − bc
v= .
D

79
Proof: The expected pay-off for R, if he uses the randomized strategy (p, 1 − p), is
pa + (1 − p)c if C chooses Column 1 and pb + (1 − p)d if C chooses Column 2. So the
minimum pay-off is min(pa + (1 − p)c, pb + (1 − p)d). This is to be maximized with respect
to p. It is clear that the minimum is maximized at the point of intersection of the two lines
pa + (1 − p)c and pb + (1 − p)d, which is found by solving the equation

pa + (1 − p)c = pb + (1 − p)d.

It is easily verified that this happens at p = d−c


D
.
Similarly, the minimax strategy for C is found by solving the equation

qa + (1 − q)b = qc + (1 − q)d
d−b
which leads to q = D
. The value v is calculated by

1
p∗ M q ∗ = (d − c, a − b)M (d − b, a − c)T
D2
1
= [a(d − c)(d − b) + c(a − b)(d − b) + b(d − c)(a − c) + d(a − b)(a − c)]
D2
1
= [ad2 − abd − acd + abc + acd − bcd − abc + b2 c
D2
+abd − abc − bcd + bc2 + a2 d − abd − acd + bcd]
1  2 2 2 2

= ad − bcd − abc + b c + bc + a d − abd − acd
D2
1
= [(ad − bc)((a + d) − (b + c))]
D2
ad − bc
=
D

Example 5.2.5 The two-finger Morra game is played by two players who simultaneously
show one or two fingers and agree to pay-off for each combination of outcomes. There are
many variations to this game, and here we consider the case where the pay-off matrix is as
follows:

1 finger 2 fingers
1 finger 2 -3
2 fingers -3 4

Before we work out the optimal strategies, it would be interesting to think about which
one you would rather be - a row player or a column player. As there are no saddle points, the
game is not strictly determined. Since a = 2, b = c = −3, d = 4, we have D = 6−(−6) = 12.
 7 5  ∗  7 5 T
Hence p∗ = 12 , 12 , q = 12 , 12 and v = ad−bc
D
= −1
12
. Clearly it is better to be C.

Example 5.2.6 Suppose the following matrix represents the pay-off for a two-person zero-
sum game. Find the optimal strategies for each player and the value of the game.

Player B
I II III
1 12 -6 2
Player A 2 -3 9 10
3 -4 4 12

80
Solution: There are no recessive rows, but the third column is dominated by Column 2.
We remove Column 3 to get the following matrix:

Player B
I II
1 12 -6
Player A 2 -3 9
3 -4 4

Now Row 3 is dominated by Row 2, so we can remove Row 3 to get the following
irreducible matrix.

Player B
I II
1 12 -6
Player A 2 -3 9

There are no saddle points so the game is not strictly determined. Since a = 12, b =
−6, c = −3, d = 9, we have D = 21 − (−9) = 30. Hence
   
∗ 12 18 2 3
p = , = , ,
30 30 5 5
 T  T
∗ 15 15 1 1
q = , = ,
30 30 2 2
and
ad − bc 90
v= = = 3.
D 30

5.3 Statistical Games


In statistical inference, estimation of parameters is done on the basis of sample data. It
is possible to interpret this as a game between the statistician and Nature. Here Nature
controls the features of the population (such as the population mean) and the statistician
tries to predict these values based on data. As an example, if we want to estimate the
mean µ of a normal population based on a random sample of size n, we could that Nature
controls the true value of µ while we could estimate it by x̄ with some reward or penalty
that depends on the size of the error.
There are similarities between inference and game theory, but there also two major
difference between the two. First, there is the question of whether it is reasonable to treat
Nature as a malevolent opponent. Clearly not, but that makes matters worse in a way,
because we cannot expect the opponent to behave in a rational way. The other difference
is that while in an ordinary game, we choose our strategies without any knowledge of the
strategies of our opponent, in a statistical game, we are supplied with sample data that
provides some information about Nature’s choice. This makes the game immensely more
complicated. The following comparatively simple situation illustrates this.

Example 5.3.1 We are told that a coin is either unbiased or double-headed, and we have
to make a decision based on the outcome of a toss of the coin. There is a penalty of 1 for a
wrong decision and no reward or penalty of the correct decision. This gives us the following
loss matrix.

81
Nature
θ1 θ2
Statistician a1 L(a1 , θ1 ) = 0 L(a1 , θ2 ) = 1
a2 L(a2 , θ1 ) = 1 L(a2 , θ2 ) = 0
Here θ1 is the state of Nature that the coin is balanced and θ2 is the state of Nature that
the coin is double-headed. Similarly, a1 is statistician’s decision that the coin is balanced
and a2 is statistician’s decision that the coin is double-headed.
Let X denote the number of heads obtained when the coin is tossed; X takes value 0
or 1. Player A, the statistician, knows the outcome of the coin toss, that is, he knows the
value the random variable X takes. To make use of this information in choosing between
the two actions, we need a function, known as a decision function, or a decision rule, that
associates outcomes to actions. As there are two states of Nature and two actions in this
example, we have 2 × 2 = 4 possible decision rules. In general, the set of all decision rules is
the set of all functions from the parameter space Θ to the action space A. In our case, the
first decision rule d1 associates both outcomes to action a1 , d2 associates x = 0 to action
a1 and x = 1 to action a2 , d3 associates x = 0 to action a2 and x = 1 to action a1 , and d4
associates both outcomes to action a2 . Notationally, this means
d1 (0) = a1 , d1 (1) = a1
d2 (0) = a1 , d2 (1) = a2
d3 (0) = a2 , d3 (1) = a1
d4 (0) = a2 , d4 (1) = a2 .
We now compare these decision rules in terms of the expected loss to which they lead.
The expected loss of a particular decision rule is called its risk, and this depends on the
value of θ. Thus we have a risk function, defined as
R(d, θ) = Eθ [L(d(X), θ)]
where Eθ indicates that the expectation is taken under the assumption that the true state
of Nature is θ. We can now calculate the values of the risk function. Note that Pθ (X = x)
has values for various choices of θ and x given by
1 1
Pθ1 (X = 0) = , Pθ1 (X = 1) = , Pθ2 (X = 0) = 0, Pθ2 (X = 1) = 1.
2 2
So
R(d1 , θ1 ) = Eθ1 [L(d1 (X), θ1 )] = L(a1 , θ1 ) = 0
R(d1 , θ2 ) = Eθ2 [L(d1 (X), θ2 )] = L(a1 , θ2 ) = 1

1 1 1
R(d2 , θ1 ) = Eθ1 [L(d2 (X), θ1 )] = ·0+ ·1=
2 2 2
R(d2 , θ2 ) = Eθ2 [L(d2 (X), θ2 )] = L(a2 , θ2 ) = 0

1 1 1
R(d3 , θ1 ) = Eθ1 [L(d3 (X), θ1 )] = ·1+ ·0=
2 2 2
R(d3 , θ2 ) = Eθ2 [L(d3 (X), θ2 )] = L(a1 , θ2 ) = 1

R(d4 , θ1 ) = Eθ1 [L(d4 (X), θ1 )] = L(a2 , θ1 ) = 1


R(d4 , θ2 ) = Eθ2 [L(d4 (X), θ2 )] = L(a2 , θ2 ) = 0
These values can be arranged in the form of a matrix:

82
Nature
θ1 θ2
d1 0 1
1
Statistician d2 2
0
1
d3 2
1
d4 1 0
Remember that here we have a loss matrix, not a pay-off matrix, so rows with large
values may be eliminated. Clearly d1 is better than d3 and d2 is better than d4 , so we can
remove d3 and d4 . This is not surprising since d3 and d4 lead to a decision of double-headed
coin when a tail is obtained, which is absurd.
This leaves us with a 2 × 2 game given by
Nature
θ1 θ2
Statistician d1 0 1
1
d2 2
0
and if Nature is looked upon as a malevolent opponent, it can be shown that the optimal
strategy for the statistician is to randomize between d1 and d2 with respective probabilities
1
3
and 23 .
The decision rules d3 and d4 in the above example are considered inadmissible.

Definition 5.3.2 A decision rule d is said to be inadmissible if there exist a decision rule
d1 and a value θ0 such that R(d1 , θ) ≤ R(d, θ) for all θ and R(d1 , θ0 ) < R(d, θ0 ).
The following example further illustrates the concepts of a loss function and a risk
function.

Example 5.3.3 A random variable X has uniform (0, θ) distribution. We want to estimate
θ on the basis of a single observation x. The decision functions are to be of the form
d(x) = kx, where k ≥ 1 is a constant. The losses are proportional to the absolute error, so
the loss function is given by
L(kx, θ) = c|kx − θ|
where c is a positive constant. Find the value of k that minimizes the risk.
Solution: Note that the the density function is given by
1
f (x) = I[0 < x < θ].
θ
So the risk function is given by
R(d, θ) = Eθ [L(d(X), θ)]
Z θ
1
= c|kx − θ| dx
0 θ
Z θ Z θ
k 1 1
= c(θ − kx) dx + c(kx − θ) dx
0 θ θ
k
θ
θ  2 θ !
kx2 k

c kx
= θx − + − θx
θ 2 0 2 θ
k
 2
θ2 kθ2 θ2 θ2

c θ 2
= − + −θ − +
θ k 2k 2 2k k
 
k 1
= cθ −1+
2 k

83

This is minimized when k = 2 as can be easily verified by elementary
√ calculus. So the
best rule among all decision rules of the form d(x) = kx is d(x) = 2x.
In Example 5.3.3, the minimizing value of k did not depend on the value of θ. This is
often not the case, and if we had not restricted ourselves to decision functions of the form
d(x) = kx, the situation would be very different. Without any restriction, it is clear that
d(x) = θ1 would be the best choice when θ = θ1 , d(x) = θ2 would be the best choice when
θ = θ2 etc. and there is no best decision rule that works for all values of θ.
In general, we try to find decision rules that are best with respect o some criterion. The
two criteria that we shall consider are the minimax criterion and the Bayes criterion. In the
first, we choose d that minimizes the maximum value of R(d, θ) over θ. In the other, we
choose the decision function d for which the Bayes Risk E[R(d, Θ)] is a minimum, where
the expectation is taken with respect to Θ. For this, we need to look upon Θ as a random
variable with a given distribution.

5.4 The Minimax Criterion


We may consider minimizing the maximum risk for the coin tossing problem in Example
5.3.1 as follows. For d1 , the maximum risk is 1, while for d2 , the maximum risk is 21 . So
clearly the minimax rule is d2 .

Example 5.4.1 Let X have binomial distribution with parameters n and θ. Here θ is the
unknown parameter. We consider decision functions of the form x+a n+b
where a and b are
constants. The loss is assumed to be proportional to the squared error. Find the minimax
decision rule.
Solution: Let the decision rule with constants a and b be denoted by da,b . Then L (da,b (x), θ) =
2
c x+a
n+b
− θ . The risk function R(d, θ) is given as follows:

R(da,b , θ) = Eθ [L(da,b (X), θ)]


"  2 #
X +a
= Eθ c −θ
n+b
c
Eθ (X + a − (n + b)θ)2
 
= 2
(n + b)
c
Eθ X 2 + a2 + (n + b)2 θ2 + 2aX − 2(n + b)θX − 2a(n + b)θ
 
= 2
(n + b)

We make use of the fact that E(X) = nθ and E(X 2 ) = nθ(1 − θ + nθ) to get that
c 2 2 2
R(da,b , θ) = (n+b)2 [θ (b − n) + θ(n − 2ab) + a ]. We maximize this expression with respect

to θ for fixed a and b, then we minimize the resulting expression with respect to a√and b.
This√involves some messy algebra and the minimizing values of a and b are a = 2n and
b = n.
To simplify the computation of the minimax rule in the above problem, we can make
use of the equalizer principle. According to the equalizer principle, under fairly general
conditions, the risk function of the minimax decision rule is a constant that does not depend
on θ. Using this principle, we get that for the minimax rule,

b2 − n = 0, n − 2ab = 0.
√ √
From this, it follows that a = 2n and b = n.

84
5.5 The Bayes Criterion
Under this method, we treat the parameter as a random variable following certain distribu-
tion called the prior distribution. The Bayes risk of a decision rule is the expected value of
R(d, Θ) where Θ is a random variable that takes values in the parameter space. We need
to choose the decision rule d that minimizes the Bayes risk. This involves the computa-
tion of the conditional distribution of Θ given X. This distribution is called the posterior
distribution of Θ given X.

Example 5.5.1 Suppose X ∼ U (0, θ) and θ has a prior distribution given by Θ ∼ gamma
(2, 1). Find the Bayes estimator based on a single observation if we assume that the loss
function is proportional to the squared-error.

Solution: The density of X was originally represented by fX (x) = 1θ I[0 < x < θ], but
now that the parameter is a random variable, we should treat this density function as a
conditional density function:
1
fX|Θ (x|θ) = I[0 < x < θ]
θ
The density function of Θ is given by

fΘ (θ) = θe−θ I[θ > 0]

from which it follows that the joint density of X and Θ is given by


1
fX,Θ (x, θ) = fX|Θ (x|θ)fΘ (θ) = θe−θ I[0 < x < θ] = e−θ I[0 < x < θ].
θ
From this joint density, we derive the marginal density of X and the conditional density of
Θ given X as follows. The marginal density of X is given by
Z ∞ Z ∞
fX (x) = fX,Θ (x, θ)dθ = e−θ dθ = e−x
−∞ x

for x > 0. The conditional density of Θ given X = x is given by

fX,Θ (x, θ) e−θ


fΘ|X (θ|x) = = −x = ex−θ
fX (x) e

for 0 < x < θ.

Z ∞ Z ∞ 
E[R(d, Θ)] = L(d(x), θ)fX|Θ (x|θ)dx fΘ (θ)dθ
Z−∞
∞ Z −∞

= L(d(x), θ)fX,Θ (x, θ)dxdθ
Z−∞

−∞
Z ∞ 
2
= c(d(x) − θ) fΘ|X (θ|x)dθ fX (x)dx
−∞ −∞

For each x, if d(x) is so chosen that the interior integral is minimized, the double integral
will be minimized. To minimize
Z ∞
c(d(x) − θ)2 fΘ|X (θ|x)dθ
−∞

85
for a fixed x, we need to choose d(x) = E[Θ|X = x]. (This is because for any random
variable Y , the value of a that minimizes E(Y − a)2 is E(Y ). Apply this to the conditional
distribution of Θ given X = x. Here the variable a is d(x).)
Therefore the Bayes decision rule is d(x) = E[Θ|X = x], which can be computed as
follows:

Z ∞
E[Θ|X = x] = θex−θ dθ
Zx ∞
= (x + y)e−y dy (by substituting y = θ − x)
0
Z ∞ Z ∞
−y
= x e dy + ye−y dy
0 0
= x+1

because both of the integrals can be verified to equal to 1, the first directly and the second
by integration by parts. Thus the Bayes decision rule is given by d(x) = x + 1. So the Bayes
estimator of θ is x + 1.
The computation described above is a general method for finding the Bayes estimator
whenever the loss function is proportional to the squared error. The other loss function that
is commonly used is L(d(x), θ) = |d(x) − θ|. For this case we need to use the fact that for
any random variable Y , the value of a that minimizes E|Y − a| is the median of Y . Hence
we have the following theorem.

Theorem 5.5.2

1. If the loss function is proportional to the squared error, then the Bayes estimator is
given by the conditional expectation of the posterior distribution.
2. If the loss function is proportional to the absolute error, then the Bayes estimator is
given by the conditional median of the posterior distribution.

Example 5.5.3 For Example 5.5.1, find the Bayes estimator under the absolute error loss
function.

Solution: The conditional density of Θ given X = x is given by fΘ|X (θ|x) = ex−θ I[θ > x]
for x > 0. So the Bayes estimator is given by the median of this distribution, which we
calculate as follows:
The median of a random variable X is a number M such that P (X ≤ M ) = 0.5. Thus
RM
we equate x ex−θ dθ to 0.5 and solve for M .

Z M  x
ex−θ dθ = ex e−θ M
x
= ex e−x − e−M
 

= 1 − ex−M

Equating 1 − ex−M to 0.5 and solving for M , we get M = x + ln 2.

Example 5.5.4 Suppose X ∼ geometric(θ) and θ has a prior distribution given by Θ ∼


uniform (0, 1). Find the Bayes estimator based on a single observation if we assume that
the loss function is proportional to the squared error.

86
Solution: The conditional density of X given Θ = θ is given by the function fX|Θ (x|θ) =
θ(1 − θ)x for x ∈ {0, 1, 2, . . .}. The density function of Θ is given by fΘ (θ) = I[0 < θ < 1],
so the joint density of X and Θ is given by

fX,Θ (x, θ) = θ(1 − θ)x I[0 < θ < 1]

for x ∈ {0, 1, 2, . . .}. The marginal density of X is obtained by integrating the joint density
with respect to θ. This leads to
Z 1
Γ(2)Γ(x + 1)
fX (x) = θ(1 − θ)x dθ =
0 Γ(x + 3)

for x ∈ {0, 1, 2, . . .}. The conditional density of Θ given X = x is hence given by

fX,Θ (x, θ) Γ(x + 3)


fΘ|X (θ|x) = = θ(1 − θ)x I[0 < θ < 1].
fX (x) Γ(2)Γ(x + 1)

This is the beta distribution with parameters 2 and x+1, which gives us that the conditional
2
expectation of Θ given X = x is given by x+3 . This is the Bayes estimator.

Theorem 5.5.5 If X ∼ binomial(n, θ) and the parameter θ has a prior distribution given
by Θ ∼ beta (α, β), the posterior distribution of Θ given X = x is a beta distribution with
x+α
parameters x + α and n − x + β. The Bayes estimator of θ is given by n+α+β .
Proof:  
n x
fX|Θ (x|θ) = θ (1 − θ)n−x
x
for x ∈ {0, 1, 2, . . . , n}.

Γ(α + β) α−1
fΘ (θ) = θ (1 − θ)β−1 I[0 < θ < 1]
Γ(α)Γ(β)

 
n x Γ(α + β) α−1
fX,Θ (x, θ) = θ (1 − θ)n−x θ (1 − θ)β−1 I[0 < θ < 1]
x Γ(α)Γ(β)
 
n Γ(α + β) x+α−1
= θ (1 − θ)n−x+β−1 I[0 < θ < 1]
x Γ(α)Γ(β)

for x ∈ {0, 1, 2, . . . , n}.

  Z 1
n Γ(α + β) x+α−1
fX (x) = θ (1 − θ)n−x+β−1 dθ
x Γ(α)Γ(β)
0 
n Γ(α + β) 1 x+α−1
Z
= θ (1 − θ)n−x+β−1 dθ
x Γ(α)Γ(β) 0
 
n Γ(α + β) Γ(α + x)Γ(n − x + β)
=
x Γ(α)Γ(β) Γ(n + α + β)

Γ(n + α + β)
fΘ|X (θ|x) = θx+α−1 (1 − θ)n−x+β−1
Γ(α + x)Γ(n − x + β)
This is a beta (x + α, n − x + β) distribution and hence the conditional expectation of
x+α
Θ given X = x is given by n+α+β .

87
Example 5.5.6 Let X ∼ binomial(20, θ) with θ having a beta (5, 3) prior distribution.
Find the Bayes estimator of θ if X = 12 is observed.
x+α 12+5
The Bayes estimator is given by n+α+β = 5+3+20 = 17
28
.

Theorem 5.5.7 If a random sample of size n is drawn from a normal population with
mean µ and variance σ 2 where the prior for µ is given by a N (µ0 , σ02 ) distribution, then the
posterior distribution is given by N (µ1 , σ12 ) where

nx̄σ02 + µ0 σ 2
µ1 =
nσ02 + σ 2

and
σ02 σ 2
σ12 = .
nσ02 + σ 2
nx̄σ02 +µ0 σ 2
The Bayes estimator for θ is given by nσ02 +σ 2
.

Example 5.5.8 A distributor of soft drink vending machines feels that in a supermarket,
the number of drinks one of his machines will sell is normally distributed with an average
of 738 drinks per week and standard deviation 13.4. For a machine placed in a particular
market, the number of drinks sold varies from week to week, and is represented by a normal
distribution with standard deviation 42.5. If one of the distributor’s machines put into a
new super market averaged 692 sales in the first 10 weeks, what is the distributor’s personal
probability that for this market the value of the mean is actually between 700 and 720?
Here σ = 42.5, µ0 = 738 and σ0 = 13.4.

nx̄σ02 + µ0 σ 2 10(692)(13.4)2 + 738(42.5)2


µ1 = = = 715
nσ02 + σ 2 10(13.4)2 + (42.5)2

and
σ02 σ 2 (13.4)2 (42.5)2
σ12 = = = 90.
nσ02 + σ 2 10(13.4)2 + (42.5)2
So the required probability is given by P ( 700−715

90
≤ Z ≤ 720−715

90
) = P (−1.58 ≤ Z ≤
0.53) = 0.645.

88
Chapter 6

Inference

6.1 Parameter Estimation


In this chapter we learn to fit the laws of probability to data. All the probability distributions
we learned in the past depend on a few parameters. In some cases the parameters are
important features of the distributions such as mean and standard deviation, as in the cases
of normal distribution and Poisson distribution. In others, mean and standard deviation
may be functions of one or more parameters, as in the case of gamma and beta distributions.
In most real-life situations, we do not know the value of the parameters and hence must be
estimated from data.
A statistic is a function of the observations. By definition, estimators and statistics do
not depend on the parameters, though their distributions may, and in the case of estimators
usually will, depend on parameters. For instance, in the case of normal(µ, σ 2 ), X is a
statistic, but X − µ is not. Note that the distribution of X does depend on µ. A statistic
becomes an estimator when it is used to estimate something.
In what follows, when we discuss statistics and estimators, we may use small case and
capitalized versions interchageably. For instance, to denote the sample mean, depending on
the situation, we could use x or X.

Example 6.1.1 Emission of alpha particles from radio active sources is known to follow
Poisson law. In order to calculate probabilities associated with alpha emission, we need the
parameter λ, and we estimate it from the emission data. When the the emission rates were
observed with a radio active source of americium 241, the mean emission rate was found to
be 0.8392 emissions per second. This number is our estimated value of λ.
In Example 6.1.1, we are justified in using the mean value as the estimator of λ because
λ is the theoretical mean, or expectation, of a Poisson(λ) distribution. In general, we need
to come up with a formula for the estimator whose use to estimate the parameter can be
justified. There are two general procedures that we will use. The first is the method of
moments and the second is the method of maximum likelihood.

6.2 Method of Moments


The moments method of estimation is perhaps the oldest method for estimating parameters.
The inherent simplicity of the method is what makes it attractive. The idea is to express
the parameter to be estimated in terms of the moments of the distribution and substitute
the sample moments in their place to obtain an estimator. The weakness of the method is
that in many situations, better estimators can be found.

89
Definition 6.2.1 The k th moment of a random variable X is defined by µk = E(X k ).
Note: E(X k ) is sometimes referred to as the k th raw moment or k th hmoment about i the
th k
origin. This is in contrast to the k central moment, defined as E (X − µ) . Some
books use µk to denote the central moments while using the notation µ0k to denote the raw
moments.
If X1 , X2 , . . . , Xn are i.i.d. random Pvariables from the distribution of interest, the k th
sample moment is defined as mk = n ni=1 Xik . Then it is reasonable to take mk as an
1

estimator of µk . The method of moments constructs formulas to express the parameters


in terms of the moments of lowest possible order and substitutes sample moments into the
formulas. Suppose there are k parameters, θ1 , θ2 , . . . , θk , that need to be estimated. If we
can express these parameters as θi = fi (µ1 , µ2 , . . . , µk ), i = 1, 2, . . . , k, then the estimators
of these parameters are given by θ̂i = fi (m1 , m2 , . . . , mk ), i = 1, 2, . . . , k. Note that in some
cases we may need moments with order higher than k to estimate k parameter. We shall
denote the method of moments estimator of the parameter θ by θ̃ and refer to it as the
MME of θ.
Usually, we first express the moments in terms of the parameters, then invert these equa-
tions to solve the for the parameters. Thus, solving the set of equations µi = gi (θ1 , θ2 , . . . , θk ),
i = 1, 2, . . . , k, we construct the equations θi = fi (µ1 , µ2 , . . . , µk ), i = 1, 2, . . . , k.

Example 6.2.2 In Example 6.1.1, µ1 = λ, so the method of moments estimates λ by


λ̂ = µ1 = X

Example 6.2.3 Suppose X1 , X2 , . . . , Xn are i.i.d. random variables from the normal dis-
tribution N (µ, σ 2 ). Then θ1 = µ and θ2 = σ 2 . Note that µ1 = µ and µ2 = E(X 2 ) = µ2 + σ 2 .
From these, we get µ = µ1 and σ 2 = µ2 − µ21 . Substituting the sample moments in place of
the population moments, we get
n
2 1X 2 2
µ̂ = m1 = X, σ̂ = m2 − m21 = Xi − X .
n i=1

Example 6.2.4 Suppose X1 , X2 , . . . , Xn are i.i.d. binomial(k, p) where both k and p are
unknown. (Ordinarily, for a binomial estimation problem, we know n; only p need be
estimated. The case where both are unknown is used to estimate rates of crimes for which
there are many unreported occurrences. For such a crime, the reporting rate p and the
number of occurrences k are both unknown.
E(X) = kp and V (X) = kp(1 − p) from which it follows that E(X 2 ) = kp(1 − p) + k 2 p2 ,
so we get the two equations
n
1X 2
X = k̃ p̃, X = k̃ p̃(1 − p̃) + k̃ 2 p̃2 .
n i=1 i

We solve for k̃ and p̃ to get


2
X
k̃ = Pn 2
1
X− n i=1 Xi − X
and
X
p̃ = .

Example 6.2.5 Let X1 , X2 , . . . , Xn be i.i.d. geometric(θ) where θ is to be estimated.
E(X) = p1 , so we equate X to p1 to get p̃ = X1 .

90
Example 6.2.6 Let X1 , X2 , . . . , Xn be i.i.d. uniform(0, θ). Find the MME of θ.
E(X) = 2θ , so we equate X to 2θ to get θ̃ = 2X.
The following theorem helps us find the MME of a function of the parameter.

Theorem 6.2.7 (Invariance property of MMEs)


If θ̂ is the MLE of θ, then for any function τ , the MLE of τ (θ) is τ (θ̂).
To illustrate the idea behind this theorem, consider the following situation. Suppose
the unknown parameter θ = (θ1 , θ2 ) is expressed as a function of the first two moments (as
in the case of gamma distribution), θ1 = f1 (µ1 , µ2 ) and θ2 = f2 (µ1 , µ2 ) so that θ = f (µ).
Then θ̃1 = f1 (m1 , m2 ) and θ̃2 = f2 (m1 , m2 ). Now, if (g1 (θ), g2 (θ)) is a function of θ, then,
as gi (θ) = gi (f (µ)) = (gi ◦ f )(µ), we have g˜i (θ) = (gi ◦ f )(m1 , m2 ). On the other hand,
gi (θ̃) = gi (f (m1 , m2 )), so g˜i (θ) = gi (θ̃)

6.3 Method of Maximum Likelihood


Definition 6.3.1 Let X1 , X2 , . . . , Xn be jointly distributed random variables whose distri-
bution depends on a parameter θ ∈ Rk . Then the likelihood function of these random
variables is their joint density f (x1 , x2 , . . . , xn |θ), treated as a function of θ.
The method of maximum likelihood finds the value of θ that maximizes the likelihood
function for a given set of x1 , x2 , . . . , xn . The maximizing value of θ is called the maximum
likelihood estimator of the parameter.
For i.i.d. random variables X1 , X2 , . . . , Xn with common density function given by
f (x|θ), because of the independence, the likelihood function takes the form
n
Y
L(x|θ) = f (xi |θ)
i=1

where x = (x1 , x2 , . . . , xn ). One may then use calculus techniques to maximize this function
over θ. In many cases, it easier to maximize the log likelihood function the natural logarithm
of the likelihood function given by
n
X
l(x|θ) = log L(x|θ) = log f (xi |θ).
i=1

This is because differentiating a sum is easier than a product. Taking logarithms does not
affect the maximizing value of θ. The equation obtained by differentiating the log likelihood
and equating it to zero is called the likelihood equation.
Sometimes it is convenient to treat the xi ’s as fixed constants and write l(x|θ) as l(θ).
Clutter of notations can be avoided this way. Also, we shall denote the maximum likelihood
estimator of the parameter θ by θ̂ and refer to it as the MLE of θ. Note that in these notes,
log x always denotes the natural logarithm of x.
In the case of discrete distributions, one may, occasionally use p in place of f to remind
us that we are dealing with a probability mass function not a density function. In the
discrete case, integration will be replaced by summation.

Example 6.3.2 Suppose X1 , X2 , . . . , Xn are i.i.d. Poisson(λ). Find the MLE of λ.


−λ x
p(x|λ) = e x!λ , so

91
n
X
l(λ) = log p(xi |λ)
i=1
Xn
= (xi log λ − λ − log xi !)
i=1
n
X
= nx log λ − nλ − log xi !
i=1

Differentiating and equating to zero, we get the equation


nx
l0 (λ) = −n=0
λ
−nx
whose solution gives λ̂ = x. As l00 (λ) = λ2
< 0, the solution to the ML equation is the
maximum.

Example 6.3.3 Let X1 , X2 , . . . , Xn be i.i.d. Normal(µ, σ 2 ). Find the MLEs of µ and σ 2 .


1 x−µ 2
f (x|µ, σ 2 ) = √ 1 e− 2 ( σ ) , so
2πσ

n
X
2
l(µ, σ ) = log f (xi |µ, σ 2 )
i=1
n  
X 1 1
= − log σ − log 2π − 2 (xi − µ)2
i=1
2 2σ
n
n 1 X
= −n log σ − log 2π − 2 (xi − µ)2
2 2σ i=1

We now find partial derivatives of l with respect to µ and σ (the differentiation can be
with respect to σ or σ 2 ) and equate them to zero.
n
∂l 1 X
= (xi − µ)
∂µ σ 2 i=1
n
∂l n X
= − + σ −3 (xi − µ)2
∂σ σ i=1

Solving these equations, we get the MLE’s of the parameters to be

µ̂ = x

and n n
1X 1X 2
2
σ̂ = (xi − x)2 = xi − x2 .
n i=1 n i=1
That the solution to the ML equations give the maximum can be verified by computing the
second derivative matrix and verifying that its determinant is positive and the first element
is negative.

92
Example 6.3.4 The cosine of the angle at which electrons are emitted in muon decay is a
random variable with density given by
1 + θx
f (x|θ) = I [|x| ≤ 1]
2
where |θ| ≤ 1. Let X1 , X2 , . . . , Xn be a sample from this distribution. Find the MLE of θ.
n
X n
X
l(θ) = log f (xi |θ) = (log(1 + θxi ) − log 2) .
i=1 i=1

Differentiating l with respect to θ and equating it to zero, we get that the MLE θ̂ is the
solution to the equation
n
X xi
= 0.
i=1
1 + θx i

As there is no closed form solution to this equation, we need to use a numerical method, an
iterative procedure such as Newton-Raphson method.

Example 6.3.5 The Pareto distribution has a density given by


θaθ
f (x|θ) = θ+1 I [x ≥ a]
x
where θ > 1 and a is considered to be a known constant. Let X1 , X2 , . . . , Xn be a sample
from this distribution. Find the MLE of θ.
n
X n
X
l(θ) = log f (xi |θ) = (log θ + θ log a − (θ + 1) log xi ) .
i=1 i=1

Differentiating l with respect to θ and equating it to zero, we get the equation


n  
X 1
+ log a − log xi = 0.
i=1
θ

Solving this equation, we get


n
θ̂ = Pn .
i=1 log xi − n log a

If we denote log xi by yi , this can be simplified to


1
.
ȳ − log a
Example 6.3.6 Let X1 , X2 , . . . , Xn be i.i.d. uniform (0, θ). Find the MLE for θ.
Note that here the support of the random variable depends on θ, and as a result the
density function isQnot differentiableQwith respect to θ. This problem must  be tackled
n 1 n 1
directly. L(x|θ) = i=1 f (xi |θ) = θn i=1 I [0 ≤ xi ≤ θ] = θn I 0 ≤ x(n) ≤ θ where x(n) is
the maximum among all xi ’s. Observe that the first factor increases as θ decreases, and the
second factor is zero if θ < x(n) . Thus the likelihood is maximized when θ = x(n) , and hence
x(n) is the MLE of θ.
The following theorem helps us find the MLE of a function of the parameter.

Theorem 6.3.7 (Invariance property of MLEs)


If θ̂ is the MLE of θ, then for any function τ , the MLE of τ (θ) is τ (θ̂).

93
Example 6.3.8 For the i.i.d. normal case given Example 6.3.3, find the MLEs of µ2 and
σ.
As the µ̂ = x and σ̂ 2 = n1 ni=1 (xi − x)2 , from the above theorem it follows that the
P
q P
MLE of µ is given by x and the MLE of σ is given by n1 ni=1 x2i − x2 .
2 2

Before leaving this section, we will discuss some results that that help us deal with
estimators such as x(n) in Example 6.3.6

Proposition 6.3.9 Let X1 , X2 , . . . , Xn be i.i.d. with distribution function F and density


function f . Let X(k) denote the k th order statistic (the k th smallest value), so that X(1) =
min1≤i≤n Xi and X(n) = max1≤i≤n Xi . Then

1. FX(1) = 1 − [1 − F (x)]n and fX(1) = n [1 − F (x)]n−1 f (x)

2. FX(n) = [F (x)]n and fX(n) = n [F (x)]n−1 f (x)

3. If the common distribution of the Xi ’s is uniform(0, θ), then



 
(a) E X(k) = n+1
  k(n−k+1)θ2
(b) V X(k) = (n+1) 2 (n+2)

θ
 
(c) E X(1) = n+1

 
(d) E X(n) = n+1
2
(e) V X(1) = V X(n) = (n+1)nθ2 (n+2) .
   

The proof of Proposition 6.3.9 (except 3(a) and 3(b)) is left as an exercise.

6.4 Methods of evaluating Estimators


In what follows, we will consider the problem of estimating a function τ (θ) of the parameter
θ in detail. Taking τ to be the identity function, that is, τ (x) = x, we reduce it to the
estimation of θ. For convenience, we may often restrict our discussion to the case of τ (θ) = θ.
To measure how good an estimator of a parameter is, we often use bias, which is the mean
difference between the estimator and the parameter, and the mean squared error, which is
the expected value of the squared difference between the estimator and the parameter.

Definition 6.4.1 The bias of an estimator T of τ (θ), denoted by bθ (T ), is defined as Eθ (T )−


τ (θ). The estimator is said to be unbiased if the bias is always zero, that is, for all θ,

Eθ (T ) = τ (θ)

Note that there is no invariance property for unbiasedness. If θ̂ is an unbiased estimator


of θ, then for most functions τ , τ (θ̂) is not an unbiased estimator for τ (θ). If τ is a linear
function, that is, τ (x) = ax + b, then τ (θ̂) is an unbiased estimator for τ (θ).

Definition 6.4.2 The mean squared error, or MSE, of an estimator T of τ (θ) is defined as
Eθ [T − τ (θ)]2 .
The following proposition connects these concepts.

Proposition 6.4.3
M SEθ (T ) = Vθ (T ) + b2θ (T )

94
Proof:

M SEθ (T ) = Eθ [T − τ (θ)]2
= Eθ [T − Eθ (T ) + Eθ (T ) − τ (θ)]2
= Eθ [T − Eθ (T )]2 + (Eθ (T ) − τ (θ))2 + 2Eθ [(T − Eθ (T ))(Eθ (T ) − τ (θ))]
= Vθ (T ) + b2θ (T ) + 2(Eθ (T ) − τ (θ))Eθ [T − Eθ (T )]
= Vθ (T ) + b2θ (T )

because the last factor is zero.


Thus, MSE of an estimator T takes into account both the variance of T , which measures
how tightly the distribution of T clusters about its mean, and the the bias, which measures
how far the centre of the distribution of T deviates from the target τ (θ). If the MSE of
an estimator is lower than that of another, then we have a reason to believe that the first
estimator is better. Note that when an estimator T is unbiased for τ (θ), The MSE of T is
the same as its variance.

Example 6.4.4 For the MLE’s of normal parameters in Example 6.3.3, find the bias and
the MSE.
2
Eµ (µ̂) = Eµ (X) = µ, so µ̂ is unbiased for µ and the MSE is the same as V (X) = σn . It
can be shown that !
n
2 1 X
2 2
S = X − nX
n − 1 i=1 i
is such that E(S 2 ) = σ 2 so thatS 2 is an unbiased estimator of σ 2 . Thus the bias of σ̂ 2 is
2
given by E (σ̂ 2 − σ 2 ) = n−1
n
− 1 σ 2 = −σn
2σ 4
To calculate the MSE of σ̂ 2 , we need a result that we will not prove: V (S 2 ) = n−1 . The
2
MSE of σ̂ can then be calculated by using Proposition 6.4.3.

M SE(σ̂ 2 ) = V σ̂ 2 + b2 σ̂ 2
 
2
 σ4

n−1
= V S2 + 2
n n
2
2σ 4 σ4

n−1
= + 2
n n−1 n
 
2n − 1
= σ4
n2
We now introduce a concept called consistency. Consistency of an estimator is about
how close the estimator gets to the target as the sample becomes large.

Definition 6.4.5 A sequence of estimators θ̂n of θ based on n observations is said to be


consistent if
θ̂n → θ0
as n → ∞ where θ0 is the true value of θ.
Note: Depending on whether the mode of convergence is convergence in probability or
convergence with probability 1, θ̂n is said to be weakly consistent or strongly consistent.
These concepts will be discussed in detail in the Advanced Probability module.
The result that turns out to be most useful in proving consistency of estimators is the
law of large numbers which states that if X1 , X2 , . . . , X n is an i.i.d. sequence of random
variables with mean µ, then the sample mean X n = n ni=1 Xi converges to µ. The weak
1
P

95
law of large numbers (WLLN) states that the convergence in probability holds while the
strong law of large numbers (SLLN) states that convergence with probability 1 holds.
Our next theorem is about consistency of maximum likelihood estimators. The proof
of the theorem requires that the family of density functions {fθ : θ ∈ Θ} of the random
variable satisfies a set of regularity conditions. Let Θ be the parameter space, assumed to be
an open subset of the real line (or Rn ). Let X = (X1 , X2 , . . . , Xn ) and x = (x1 , x2 , . . . , xn ).

1. The support of fθ , the set A = {x : fθ (x) > 0}, does not depend on θ.
2. For all x ∈ A and for all θ ∈ Θ, log fθ (x) is differentiable with respect to θ.
3. For any function h on Rn such that Eθ [|h(X)|] < ∞, the operations of integration
and differentiation with respect to θ can be interchanged in the integral expression for
Eθ [h(X)]. That is,
Z Z
∂ ∂
h(x)L(x|θ)dx = h(x) L(x|θ)dx (6.1)
∂θ Rn Rn ∂θ
whenever the RHS of (6.1) is finite.
∂ ∂
R R
A sufficient condition for 6.1 to hold is that both Rn h(x) ∂θ L(x|θ)dx and Rn |h(x) ∂θ L(x|θ)|dx
are continuous functions of θ. We have a fairly general sufficient condition for all of the above
conditions to hold.

Proposition 6.4.6 If L(x|θ) can be expressed as e{c(θ)T (x)+d(θ)+S(x)} IA (x) where A is a


subset of Rn and c(θ) has a continuous, non-zero derivative on Θ, then all the regularity
conditions hold.
A family of distributions that have the above form is said to be an exponential family
of distributions. Note that when the parameter is a vector, the product c(θ)T (x) must be
replaced by matrix multiplication (c(θ))T T (x). Also note that when verifying whether a
family of distributions form an exponential family, it may be easier to compare the logarithm
of the likelihood with c(θ)T (x) + d(θ) + S(x). This must be done after checking whether
the support is free of θ.

Example 6.4.7 We consider each of the familiar distributions and check if it is an expo-
nential family of distributions. Examples of exponential families are normal (with one or
two unknown parameters), multivariate normal, lognormal, exponential, gamma, weibull,
Pareto (with known lower bound), chi-squared, beta, Bernoulli, binomial (with n known),
negative binomial (with fixed number of failures), Poisson, geometric and Wishart. Stu-
dent’s t, Pareto with unknown lower bound and the uniform distribution with unknown
bounds are examples of families that are not exponential.
1. Exponential: f (x|λ) = λe−λx I[x > 0], so
P
L(x|λ) = λn e−λ xi
I[xi ≥ 0, i = 1, . . . , n].
Take A = [0, ∞)n , d(λ) = n log λ, S(x) = 0, c(λ) = −λ and T (x) =
P
xi .
Γ(α+β) α−1
2. Beta: f (x|α, β) = Γ(α)Γ(β)
x (1 − x)β−1 I[0 ≤ x ≤ 1], so
 
Γ(α + β) X  X 
l(x|α, β) = n log + (α − 1) log xi + (β − 1) log(1 − xi )
Γ(α)Γ(β)
 
Γ(α+β)
Take A = [0, 1]n , d(α, β) = n log Γ(α)Γ(β)
, S(x) = 0, c(α, β) = (α − 1, β − 1) and
P P
T (x) = ( log xi , log(1 − xi )).

96
3. Normal (µ, 1): f (x|µ) = √1 exp [−.5(x − µ)2 ].

n X n X X
l(x|µ) = − log 2π − .5 (xi − µ)2 = − log 2π − .5 x2i − .5nµ2 + µ xi
2 2
Take A = Rn , d(µ) = − n2 log 2π−.5nµ2 , S(x) = −.5 x2i , c(µ) = µ and T (x) = xi .
P P

4. Binomial(k, p) where k is known: f (x) = xk px (1 − p)k−x I [x ∈ {0, 1, . . . , k}]




  X
X k X
l(x|p) = log + xi log p + (k − xi ) log(1 − p)
xi
   X
X k p
= log + kn log(1 − p) + log xi
xi 1−p
 
k p

Take A = {0, 1, . . . , k}n , d(p) = kn log(1 − p), S(x) =
P
log xi
, c(p) = log 1−p
P
and T (x) = xi .

Theorem 6.4.8 Under the regularity conditions on the density function f , the MLE from
an i.i.d. sample is strongly consistent. That is,
h i
Pθ0 θ̂n → θ0 = 1

Proof: The proof given here is just a sketch, as rigorous treatmentPis rather involved.
We want to maximize l(θ) which is the same as maximizing n1 l(θ) = n1 ni=1 log f (xi |θ). By
taking Y = log f (X|θ) and applying SLLN to Y , we get that n1 l(θ) → Eθ0 [log f (X|θ)].
As n1 l(θ) is close to Eθ0 [log f (X|θ)], we conclude that the θˆn that maximizes n1 l(θ) is
close to the value of θ that maximizes Eθ0 [log f (X|θ)]. To find this value, we differentiate
Eθ0 [log f (X|θ)] with respect to θ,

Z
∂ ∂
Eθ [log f (X|θ)] = log f (x|θ)f (x|θ0 )dx
∂θ 0 ∂θ
Z
∂f (x|θ) f (x|θ0 )
= dx
∂θ f (x|θ)

and equate it to zero. We shall show that θ = θ0 is a solution to this equation. At θ = θ0 ,


RHS of the above equation becomes

Z Z
∂f (x|θ) f (x|θ0 ) ∂f (x|θ)
dx = dx
∂θ θ=θ0 f (x|θ0 ) ∂θ θ=θ0
Z
d
= f (x|θ)dx
dθ θ=θ0

d
= (1)
dθ θ=θ0
= 0

because densities always integrate to 1. The interchanging of differentiation and integration


requires smoothness assumption.
As a byproduct of the above proof we get the following result.

97
Proposition 6.4.9 If the density f of the random variable X satisfies the regularity con-
ditions, then  

Eθ log f (X|θ) = 0
∂θ
Proof:

  Z
∂ ∂f (x|θ)
Eθ log f (X|θ) = dx
∂θ ∂θ
Z
d
= log f (x|θ)dx

d
= (1)

= 0

We now introduce a concept called the Fisher information. The Fisher information, or
simply information, measures the amount of information that an observable random variable
X carries about an unknown parameter θ. It is a function of θ, and defined as the variance
of the derivative of the logarithm of the density function.

Definition 6.4.10 If a random variable X has density function f that depends on a pa-
rameter θ, the information is given by
 2

I(θ) = Eθ log f (X|θ)
∂θ
I(θ) is often referred to as the information
∂ number
 or information function. From Proposi-
tion 6.4.9 it follows that I(θ) = Vθ ∂θ log f (X|θ) .

Theorem 6.4.11 Under appropriate regularity conditions on the density function f ,


 2 

I(θ) = −Eθ log f (X|θ)
∂θ2
Proof: As before,
Z

0 = f (x|θ)dx
∂θ
Z

= f (x|θ)dx
∂θ
Z  

= log f (x|θ) f (x|θ)dx.
∂θ
Differentiating both sides once more with respect to θ, we get
Z  
∂ ∂
0 = log f (x|θ) f (x|θ)dx
∂θ ∂θ
Z  2  Z  2
∂ ∂
= log f (x|θ) f (x|θ)dx + log f (x|θ) f (x|θ)dx.
∂θ2 ∂θ
which shows that
2
∂2
Z  Z  

log f (x|θ) f (x|θ)dx = − log f (x|θ) f (x|θ)dx
∂θ ∂θ2

98
One of the reasons why information function is important is the result that states that
for large values of n, the distribution of the MLE, when θ0 is the true value of the parameter
1
θ, is approximately normal with mean θ0 and variance nI(θ 0)
. This large sample distribution
is referred to as the asymptotic distribution of the estimator.

Theorem 6.4.12 Under smoothness conditions on the density function f ,


√  
d 1
n θ̂ − θ0 → − N (0, )
I(θ0 )
d
where →
− means the convergence is in distribution.
The above theorem says that the asymptotic distribution of the MLE is normal and its
asymptotic variance is I(θ10 ) . We will not prove this theorem.

Example 6.4.13 Find the asymptotic distribution for the MLE of λ based on an i.i.d.
sample from a Poisson(λ) distribution.
We have already seen that the MLE is X. Now we shall calculate the information
function for λ. As log f (x|λ) = x log λ − λ − log x!,

∂2
 
I(λ) = −Eλ log f (X|λ)
∂λ2
 
X
= Eλ 2
λ
1
=
λ
Thus the asymptotic distribution of X is normal with variance λ. But this we already knew
from the central limit theorem.

Example 6.4.14 Find the asymptotic distribution for the MLE of θ based on an i.i.d.
sample from a Pareto(θ) distribution.

θaθ
f (x|θ) = ⇒ log f (x|θ) = log θ + θ log a − (θ + 1) log x
xθ+1
d 1
⇒ log f (x|θ) = + log a − log x
dθ θ
d2 1
⇒ 2
log f (x|θ) = − 2
dθ θ
1
⇒ I(θ) = 2 .
θ
√  
Thus the asymptotic distribution of n θ̂ − θ0 is N (0, θ02 ).

6.5 Cramér-Rao Inequality


We have said that MSE can, in some cases, be used to measure the performance of estimators.
In many situations, computation of MSE can be difficult. Quite apart from this problem,
it is difficult to decide which of two estimators is better based on their MSE because the
MSE is a function of the parameter. If we have two estimators S and T for θ, it usually the

99
case that M SEθ (S) < M SEθ (T ) for some values of θ and M SEθ (T ) < M SEθ (S) for other
values of θ. In this situation, clearly we cannot decide which one is better using the MSE
criterion.
Suppose that these estimators are such that M SEθ (T ) ≤ M SEθ (S) for all θ with
M SEθ0 (T ) < M SEθ0 (S) for some value θ0 of θ. Then clearly is reasonable to consider
T to be a better estimator than S. Such an estimator S is said to be inadmissible.
It is easy to see that there is no such thing as the best estimator over all possible
estimators. S ≡ θ0 will always be better than any other at θ0 . For instance, the ‘estimator’
S ≡ 3, cannot be beaten by anything else at 3. While it is a useless estimator everywhere
else, at 3 it measures the parameter most accurately. In addition, this estimator has zero
variance.
The moral of this is that there is no point in trying to minimize MSE or variance over
all possible estimators. The set of all estimators is too large. Instead, we restrict ourselves
to a smaller class of estimators – the class of unbiased estimators.
There are some very good reasons for restricting ourselves to unbiased estimators. First,
they do not consistently overestimate or underestimate. Second, it rules out ridiculous
estimates such as the constant estimators discussed earlier. Third, for unbiased estimators,
the MSE is exactly the same as the variance. Finally, it is the case that among unbiased
estimators, we can often find an estimator that is better than all other unbiased estimators.
When such a best estimator exists, it is called a uniformly minimum variance unbiased
estimator, or UMVUE.

Example 6.5.1 Let X1 , X2 , . . . , Xn be i.i.d. with mean θ) and variance σ 2 . Consider the
following unbiased estimators of θ).

1. T1 = X1 with variance σ 2
X1 +X2 σ2
2. T2 = 2
with variance 2

2X1 +3X6 13σ 2


3. T3 = 5
with variance 25

( ni=1 a2i )σ2


P
a1 X1 +aP
2 X2 +···+an Xn
4. T4 = n with variance 2
i=1 ai ( ni=1 ai )
P

X1 +X2 +···+Xn
5. T5 = n
with variance σ 2

When we compare the variances of these estimators, we see that estimator T5 is the best
among these. When all the ai s are the same, T4 = T5 .

Definition 6.5.2 An unbiased estimator T of τ (θ) is called a uniformly minimum variance


unbiased estimator if for all unbiased estimators S of τ (θ) and for all θ,

Vθ (T ) ≤ Vθ (S).

A natural question to ask at this stage is whether there can be more than one UMVUE.
The answer is no, and is stated in the following theorem.

Theorem 6.5.3 UMVUE of τ (θ), if exists, is unique. That is, if S and T are UMVUEs of
τ (θ), then S = T .
Proof: As S and T are both UMVUEs, E(S) = E(T ) = τ (θ) and V (S) = V (T ). Let
W = S+T
2
. Clearly W is also an unbiased estimator of τ (θ). Note that

100
1
V (W ) = [V (S) + V (T ) + 2Cov(S, T )]
4
1h p i
≤ V (S) + V (T ) + 2 V (S)V (T )
4
= V (S).

The inequality comes from the fact that Corr(S, T ) ≤ 1. But S is a UMVUE, so the V (W )
cannot be strictly less than V (S), and hence the inequality for the covariance must be an
equality. This can happen only when one of the two random variables is a linear function
of the other. It follows that T = a(θ)S + b(θ) where a(θ) > 0. As V (S) = V (T ), a(θ) = 1,
and since E(S) = E(T ), it follows that b(θ) = 0. Thus S = T .

There are some problems with the notion of UMVUE.
1. Unbiased estimators may not exist.
2. Even if there are unbiased estimators, UMVUE may not exist.
3. Even when UMVUE exists, it may be inadmissible.
4. Unbiasedness is not invariant under non-linear transformations.
Apart from restricting the class of estimators considered to unbiased estimators, there are
two other ways out of the dilemma that there is no best procedure. One is to average the
MSE over the values of the parameter using some distribution on the parameter space (this
is called a prior distribution) and finding the estimator that minimizes this average. This
is called the Bayes procedure. The other is to look at the maximum value of the MSE over
all values of the parameter and minimizing this maximum. This is known as the minimax
method. Both of these are discussed in detail in the Decision Theory chapter.
We have a theorem that helps us decide whether a given estimator is the UMVUE. This
theorem, due to Harald Cramér and C.R.Rao and known as the Cramér-Rao Inequality,
gives us a lower bound on the variance of an unbiased estimator.

Theorem 6.5.4 (Cramér-Rao Inequality)


Let X1 , X2 , . . . , Xn be i.i.d. with density function f (x|θ) that satisfies the regularity con-
ditions. Let T be any unbiased estimator of τ (θ). Then
h i2
dτ (θ)

Vθ (T ) ≥ .
nI(θ)
In particular, if T is an unbiased estimator of θ. Then
1
Vθ (T ) ≥ .
nI(θ)
∂ ∂
f (x|θ) ∂ i f (X |θ)
Proof: Let g(x) = ∂θf (x|θ) . Let Wi = ∂θ log f (Xi |θ) = ∂θf (Xi |θ) = g(Xi ) and W =
Pn
i=1 Wi . From 6.4.9, it follows that E(Wi ) = 0 for each i, and hence E(W ) = 0. We
shall show that E(T W ) = dτdθ(θ) , which, in turn, implies that Cov(T, W ) = dτdθ(θ) because
E(W ) = 0.
Note that
n n n ∂ n n n
∂ Y X ∂ Y X
∂θ
f (xi |θ) Y X Y
f (xi |θ) = f (xi |θ) f (xj |θ) = f (xj |θ) = g(xi ) f (xj |θ),
∂θ i=1 i=1
∂θ j6=i i=1
f (x i |θ) j=1 i=1 j=1

101
so

Z " n
# n
X Y
E(T W ) = T (x) g(xi ) f (xj |θ)dx
Rn i=1 j=1
Z n
∂ Y
= T (x) f (xi |θ)dx
Rn ∂θ i=1
Z n
d Y
= T (x) f (xi |θ)dx
dθ Rn i=1
d
= Eθ (T )

dτ (θ)
=


As V ( ∂θ log f (X|θ)) = I(θ), it follows that V (W ) = nI(θ). Hence,

−1 ≤ Corr(T, W ) ≤ 1 ⇒ Cov 2 (T, W ) ≤ V (W )V (T )


 2
dτ (θ)
⇒ ≤ V (W )V (T )

h i2 h i2
dτ (θ) dτ (θ)
dθ dθ
⇒ V (T ) ≥ =
V (W ) nI(θ)

Definition 6.5.5 Let T be an unbiased estimator of τ (θ).

1. The efficiency of T is defined to be the ratio of the Cramér-Rao lower bound to the
V (T ). That is,
2
[ dτdθ(θ) ]
nI(θ)
e(T ) = .
V (T )

2. T is said to be efficient if e(T ) = 1.

The efficiency of T is always less than or equal to 1. An unbiased estimator T is efficient


if its variance attains the Cramér-Rao lower bound. One of the uses of Theorem 6.5.4 is
that in some cases it can tell us whether standard estimators are UMVUE’s. If T is efficient,
then it is automatically the UMVUE, though the converse is not true.

Example 6.5.6 The MLE of µ for a normal (µ, σ 2 ) distribution is given by µ̂ = X and its
2
variance is given by V (X) = σn . From Example 6.3.3, we know that
n
∂l 1 X
= 2 (xi − µ)
∂µ σ i=1

and hence
∂ 2l −1
2
= 2.
∂µ σ
1 1 σ2
Thus I(µ) = σ2
and nI(µ)
= n
showing that µ̂ is efficient and hence is the UMVUE of µ.

102
Example 6.5.7 The MLE and the MME of θ for a uniform (0, θ) distribution are given

by θ̂ = X(n) and θ̃ = 2X. From Proposition 6.3.9, expectation of X(n) is given by n+1 , so
(n+1)X (n)
T1 = n
is an unbiased estimator of θ. T2 = 2X is already an unbiased estimator. We
shall compare their variances.
θ2
V (T2 ) = 4V n(X) = 3n (follows from the fact that variance of a uniform (a, b) random
(b−a)2 θ2
variable is given by 12 ), where as V (T1 ) = n(n+2) from Proposition 6.3.9. Here V (T1 ) is
uniformly smaller than V (T2 ).
If we were to try to apply Cramér-Rao inequality blindly, we will find that I(θ) = θ12 and
2
conclude that V (T ) ≥ θn for all unbiased estimator T . What we found about V (T1 ) and
V (T2 ) both contradict this. This shows that Cramér-Rao inequality does not apply here.
This is because, as a consequence of the support of the distribution depending on θ, the
interchangeability of integration and differentiation with respect to θ does not work here.

6.6 Sufficiency
A sufficient statistic for a parameter θ is a statistic (a function of the data) that captures all
the information about θ contained in the sample. Once we have the value of the sufficient
statistic, any additional information in the sample does not contain any information about
θ. This means that if T (X) is a sufficient statistic, then any inference about θ should depend
on the sample X only through T (X).

Definition 6.6.1 A statistic T (X) is a sufficient statistic for a parameter θ if the condi-
tional distribution of the sample X given the value of T (X) does not depend on θ

Example
Pn 6.6.2 For X1 , X2 , . . . , Xn be i.i.d. Bernoulli(p), we shall show that T (X) =
i=1 X i is a sufficient statistic for p.

fT |X (t|x)fX (x)
fX|T (x|t) =
fT (t)
P
fX (x)I[ xi = t]
= n t

t
p (1 − p)n−t
P P
xi
(1 − p)n− xi I[ xi = t]
P
p
= n t

t
p (1 − p)n−t
P
I[ xi = t]
= n

t

which is free of the paprameter p.


The following two theorems allow us to find sufficient statistics by simple inspection of
the likelihood function.

Theorem 6.6.3 (Factorization Theorem)


A statistic T (x) is sufficient for a parameter θ if and only if there exists g(., .) and h(.) such
that the likelihood function L(x|θ) can be factorized as

L(x|θ) = g(T (x), θ)h(x).

Example 6.6.4 We want to find a sufficient statistics for the parameter θ of an P


i.i.d sample
x n xi
from geometric(θ).
P f (x|θ) = θ(1 − θ) I[x = 0, 1, 2, . . .], so L(x|θ) = θ (1 − θ) . It is now
easily seen that xi is a sufficient statistic.

103
1
 
Example 6.6.5 Let X1 , X2 , . . . , Xn be i.i.d. uniform(0, θ). L(x|θ) = θn
I 0 ≤ x(n) ≤ θ , so
x(n) is a sufficient statistic for θ.

Definition 6.6.6 Let X be an i.i.d. sample from f (x|θ). A statistic T (x) is said to be
complete if Eθ [g(T )] = 0 for all θ implies that Pθ (g(T ) = 0) = 1 for all θ.

Theorem 6.6.7 If L(x|θ) can be expressed as e{c(θ)T (x)+d(θ)+S(x)} IA (x), that is, we have
an exponential family of distributions, then T (x) is a complete sufficient statistic for θ.
function f can be expressed as f (x) = e{c(θ)t(x)+d1 (θ)+S1 (x)} IB (x)
Consequently, if the density P
where B ⊂ R, then T (x) = ni=1 t(xi ) is a complete sufficient statistic for θ.
The sufficiency part of Theorem 6.6.7 follows from Theorem 6.6.3 by taking g(x, y) =
c(y)x+d(y)
e and h(x) = eT (x) IA (x). Using these two theorems, we can find a sufficient statistic
easily in most cases.
−θ x
Example 6.6.8 Estimation of θ in Poisson (θ) Pdistribution: p(x|θ) = e x!θ forms an expo-
nential family with t(x) = x and hence T (x) = ni=1 xi is a complete sufficient statistic for
θ.

Example 6.6.9 Estimation of θ in Pareto (θ) distribution: The density is given by f (x|θ) =
θaθ
xθ+1
I [x ≥ a] where θ > 1 and a P
is a known constant. This forms an exponential family with
t(x) = log x and hence T (x) = ni=1 log xi is a complete sufficient statistic for θ.
The concept of sufficiency is a useful tool in finding better unbiased estimators. The
following theorem, due to C.R. Rao and David Blackwell, tells us that we can get a uniformly
better unbiased estimator by conditioning an unbiased estimator with respect to a sufficient
statistic.

Theorem 6.6.10 (Rao–Blackwell Theorem)


Let W be any unbiased estimator of τ (θ) and let T be any sufficient statistic for θ. Let
φ(T ) = E(W |T ). Then φ(T ) is a uniformly better (or as good) unbiased estimator of τ (θ),
that is,
1. Eθ (φ(T )) = τ (θ)

2. Vθ (φ(T )) ≤ Vθ (W ) for all θ.


Proof: First, we need to satisfy ourselves that φ(T ) is indeed an estimator, that is, it
does not depend on the unknown parameter. As T is a sufficient statistic, the conditional
distribution of the sample given T is independent of θ, so the conditional expectation of W
given T is free of θ.
Now, a well-known result in probability theory states that if X and Y are two random
variables, then
E [E (X|Y )] = E(X) (6.2)
and
V [E (X|Y )] + E [V (X|Y )] = V (X). (6.3)
From (6.2), we have Eθ (φ(T )) = Eθ [E (W |T )] = Eθ (W ) = τ (θ), so φ(T ) is an unbiased
estimator of τ (θ). From (6.3), it follows that

Vθ (W ) = Vθ [E (W |T )] + Eθ [V (W |T )] ≥ Vθ [E (W |T )] = Vθ (φ(T ))

The process of transforming an estimator using the Rao–Blackwell theorem is some-


times called Rao–Blackwellization. The transformed estimator is called the RaoBlackwell
estimator.

104
Example 6.6.11 Suppose X1 , X2 , . . . , Xn are i.i.d. Poisson(λ). We will perform Rao–
Blackwellization on an unbiased estimator of λ to improve
Pn it. Since E(X1 ) + λ, X1 is an
unbiased estimator of λ. Also we know that T (x) = i=1 xi is a sufficient statistic for θ.
We will now find E (X1 |T ). Note that since sum of n i.i.d. Poisson(λ) is Poisson(nλ),

e−nλ (nλ)t
fT (t) = .
t!

fX1 ,T (x, t)
fX1 |T (x|t) =
fT (t)
fT |X1 (t|x)fX1 (x)
=
fT (t)
Pn−1   e−λ λx 
P i=1 X i = t − x|X 1 = x x!
= e −nλ (nλ) t

t! −λ x 
e−(n−1)λ ((n−1)λ)t−x e λ
(t−x)! x!
= e−nλ (nλ)t
t!
t! (n − 1)t−x
=
x!(t − x)! nt
   x  t−x
t 1 1
= 1−
x n n

So the conditional distribution of X1 T = t is binomial with parameters t and n1 , and hence


E(X1 |T ) = Tn = X. Thus Rao–Blackwellization on X1 results in the UMVUE.
Note that (6.2) and (6.3) make no mention of sufficiency, so it might seem that condition-
ing on anything will result in an improvement. The expectation will be unchanged and the
variance will reduce, but the problem is that without sufficiency, the resulting conditional
expectation may depend on the parameter.
The following theorem connects the concepts of UMVUE, completeness and sufficiency.

Theorem 6.6.12 (Lehmann-Scheffé Theorem)


Let T be a complete sufficient statistic for a parameter θ and φ(T ) be an estimator based
on T . Then φ(T ) is the UMVUE of its expected value.
Pn
Example 6.6.13 We have seen that in the Poisson Pn (θ) example, i=1 Xi is a complete
1
sufficient statistic for θ. Thus we know that n i=1 Xi is the UMVUE for its expectation,
2 2 E(X)
which is θ. Now consider the statistic T (X) = X − Xn . E(T ) = E(X ) − n
= V (X) +
E 2 (X) − nθ = nθ + θ2 − nθ = θ2 , so T is the UMVUE of θ2 .

105
Statistical Tables

106
Table 6.1: Normal Distribution

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999

107
Table 6.2: t distribution

α
d.f. 0.200 0.100 0.050 0.025 0.010 0.005 0.001
1 1.376 3.078 6.314 12.706 31.821 63.656 318.289
2 1.061 1.886 2.920 4.303 6.965 9.925 22.328
3 0.978 1.638 2.353 3.182 4.541 5.841 10.214
4 0.941 1.533 2.132 2.776 3.747 4.604 7.173
5 0.920 1.476 2.015 2.571 3.365 4.032 5.894
6 0.906 1.440 1.943 2.447 3.143 3.707 5.208
7 0.896 1.415 1.895 2.365 2.998 3.499 4.785
8 0.889 1.397 1.860 2.306 2.896 3.355 4.501
9 0.883 1.383 1.833 2.262 2.821 3.250 4.297
10 0.879 1.372 1.812 2.228 2.764 3.169 4.144
11 0.876 1.363 1.796 2.201 2.718 3.106 4.025
12 0.873 1.356 1.782 2.179 2.681 3.055 3.930
13 0.870 1.350 1.771 2.160 2.650 3.012 3.852
14 0.868 1.345 1.761 2.145 2.624 2.977 3.787
15 0.866 1.341 1.753 2.131 2.602 2.947 3.733
16 0.865 1.337 1.746 2.120 2.583 2.921 3.686
17 0.863 1.333 1.740 2.110 2.567 2.898 3.646
18 0.862 1.330 1.734 2.101 2.552 2.878 3.610
19 0.861 1.328 1.729 2.093 2.539 2.861 3.579
20 0.860 1.325 1.725 2.086 2.528 2.845 3.552
21 0.859 1.323 1.721 2.080 2.518 2.831 3.527
22 0.858 1.321 1.717 2.074 2.508 2.819 3.505
23 0.858 1.319 1.714 2.069 2.500 2.807 3.485
24 0.857 1.318 1.711 2.064 2.492 2.797 3.467
25 0.856 1.316 1.708 2.060 2.485 2.787 3.450
26 0.856 1.315 1.706 2.056 2.479 2.779 3.435
27 0.855 1.314 1.703 2.052 2.473 2.771 3.421
28 0.855 1.313 1.701 2.048 2.467 2.763 3.408
29 0.854 1.311 1.699 2.045 2.462 2.756 3.396
30 0.854 1.310 1.697 2.042 2.457 2.750 3.385
31 0.853 1.309 1.696 2.040 2.453 2.744 3.375
32 0.853 1.309 1.694 2.037 2.449 2.738 3.365
33 0.853 1.308 1.692 2.035 2.445 2.733 3.356
34 0.852 1.307 1.691 2.032 2.441 2.728 3.348
35 0.852 1.306 1.690 2.030 2.438 2.724 3.340
40 0.851 1.303 1.684 2.021 2.423 2.704 3.307
50 0.849 1.299 1.676 2.009 2.403 2.678 3.261
60 0.848 1.296 1.671 2.000 2.390 2.660 3.232
70 0.847 1.294 1.667 1.994 2.381 2.648 3.211
80 0.846 1.292 1.664 1.990 2.374 2.639 3.195
∞ 0.841 1.282 1.645 1.960 2.326 2.576 3.091

108
Table 6.3: Chi-square distribution

α
d.f. 0.995 0.990 0.975 0.950 0.050 0.025 0.010 0.005
1 3.9E-05 0.00016 0.00098 0.00393 3.841 5.024 6.635 7.879
2 0.0100 0.0201 0.0506 0.103 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 11.070 12.832 15.086 16.750
6 0.676 0.872 1.237 1.635 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 14.067 16.013 18.475 20.278
8 1.344 1.647 2.180 2.733 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 18.307 20.483 23.209 25.188
11 2.603 3.053 3.816 4.575 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 27.587 30.191 33.409 35.718
18 6.265 7.015 8.231 9.390 28.869 31.526 34.805 37.156
19 6.844 7.633 8.907 10.117 30.144 32.852 36.191 38.582
20 7.434 8.260 9.591 10.851 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 32.671 35.479 38.932 41.401
22 8.643 9.542 10.982 12.338 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 36.415 39.364 42.980 45.558
25 10.520 11.524 13.120 14.611 37.652 40.646 44.314 46.928
26 11.160 12.198 13.844 15.379 38.885 41.923 45.642 48.290
27 11.808 12.878 14.573 16.151 40.113 43.195 46.963 49.645
28 12.461 13.565 15.308 16.928 41.337 44.461 48.278 50.994
29 13.121 14.256 16.047 17.708 42.557 45.722 49.588 52.335
30 13.787 14.953 16.791 18.493 43.773 46.979 50.892 53.672
40 20.707 22.164 24.433 26.509 55.758 59.342 63.691 66.766
50 27.991 29.707 32.357 34.764 67.505 71.420 76.154 79.490
60 35.534 37.485 40.482 43.188 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 90.531 95.023 100.425 104.215
80 51.172 53.540 57.153 60.391 101.879 106.629 112.329 116.321
100 67.328 70.065 74.222 77.929 124.342 129.561 135.807 140.170

109
Table 6.4: F distribution with α = 0.05

ν1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 161 199 216 225 230 234 237 239 241 242 243 244 245 245 246
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.76 8.74 8.73 8.71 8.70
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.94 5.91 5.89 5.87 5.86
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.70 4.68 4.66 4.64 4.62
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.03 4.00 3.98 3.96 3.94
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.60 3.57 3.55 3.53 3.51
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.31 3.28 3.26 3.24 3.22
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.10 3.07 3.05 3.03 3.01
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.94 2.91 2.89 2.86 2.85
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.82 2.79 2.76 2.74 2.72
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.72 2.69 2.66 2.64 2.62
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.63 2.60 2.58 2.55 2.53
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.57 2.53 2.51 2.48 2.46
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.51 2.48 2.45 2.42 2.40

Table 6.5: F distribution with α = 0.01

ν1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 4052 4999 5403 5625 5764 5859 5928 5981 6022 6056 6083 6106 6126 6143
2 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 99.4 99.4
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 27.1 27.0 26.9
4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.5 14.4 14.3 14.2
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.96 9.89 9.82 9.77
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.79 7.72 7.66 7.60
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.54 6.47 6.41 6.36
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.73 5.67 5.61 5.56
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.18 5.11 5.05 5.01
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.77 4.71 4.65 4.60
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.46 4.40 4.34 4.29
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.22 4.16 4.10 4.05
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 4.02 3.96 3.91 3.86
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.86 3.80 3.75 3.70
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.73 3.67 3.61 3.56

110
Table 6.6: Critical Values for the U test when α = 0.1

2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0 0 0 1 1 1 1 2 2 3 3
3 0 0 1 2 2 3 4 4 5 5 6 7 7
4 0 1 2 3 4 5 6 7 8 9 10 11 12
5 0 1 2 4 5 6 8 9 11 12 13 15 16 18
6 0 2 3 5 7 8 10 12 14 16 17 19 21 23
7 0 2 4 6 8 11 13 15 17 19 21 24 26 28
8 1 3 5 8 10 13 15 18 20 23 26 28 31 33
9 1 4 6 9 12 15 18 21 24 27 30 33 36 39
10 1 4 7 11 14 17 20 24 27 31 34 37 41 44
11 1 5 8 12 16 19 23 27 31 34 38 42 46 50
12 2 5 9 13 17 21 26 30 34 38 42 47 51 55
13 2 6 10 15 19 24 28 33 37 42 47 51 56 61
14 3 7 11 16 21 26 31 36 41 46 51 56 61 66
15 3 7 12 18 23 28 33 39 44 50 55 61 66 72

Table 6.7: Critical Values for the U test when α = 0.05

2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0 0 0 0 1 1 1 1
3 0 1 1 2 2 3 3 4 4 5 5
4 0 1 2 3 4 4 5 6 7 8 9 10
5 0 1 2 3 5 6 7 8 9 11 12 13 14
6 1 2 3 5 6 8 10 11 13 14 16 17 19
7 1 3 5 6 8 10 12 14 16 18 20 22 24
8 0 2 4 6 8 10 13 15 17 19 22 24 26 29
9 0 2 4 7 10 12 15 17 20 23 26 28 31 34
10 0 3 5 8 11 14 17 20 23 26 29 30 36 39
11 0 3 6 9 13 16 19 23 26 30 33 37 40 44
12 1 4 7 11 14 18 22 26 29 33 37 41 45 49
13 1 4 8 12 16 20 24 28 30 37 41 45 50 54
14 1 5 9 13 17 22 26 31 36 40 45 50 55 59
15 1 5 10 14 19 24 29 34 39 44 49 54 59 64

111
Table 6.8: Critical Values for the U test when α = 0.02

2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0 0 0
3 0 0 1 1 1 2 2 2 3
4 0 1 1 2 3 3 4 5 5 6 7
5 0 1 2 3 4 5 6 7 8 9 10 11
6 1 2 3 4 6 7 8 9 11 12 13 15
7 0 1 3 4 6 7 9 11 12 14 16 17 19
8 0 2 4 6 7 9 11 13 15 17 20 22 24
9 1 3 5 7 9 11 14 16 18 21 23 26 28
10 1 3 6 8 11 13 16 19 22 24 27 30 33
11 1 4 7 9 12 15 18 22 25 28 31 34 37
12 2 5 8 11 14 17 21 24 28 31 35 38 42
13 0 2 5 9 12 16 20 23 27 31 35 39 43 47
14 0 2 6 10 13 17 22 26 30 34 38 43 47 51
15 0 3 7 11 15 19 24 28 33 37 42 47 51 56

Table 6.9: Critical Values for the U test when α = 0.01

3 4 5 6 7 8 9 10 11 12 13 14 15
3 0 0 0 1 1 1 2
4 0 0 1 1 2 2 3 3 4 5
5 0 1 1 2 3 4 5 6 7 7 8
6 0 1 2 3 4 5 6 7 9 10 11 12
7 0 1 3 4 6 7 9 10 12 13 15 16
8 1 2 4 6 7 9 11 13 15 17 18 20
9 0 1 3 5 7 9 11 13 16 18 20 22 24
10 0 2 4 6 9 11 13 16 18 21 24 26 29
11 0 2 5 7 10 13 16 18 21 24 27 30 33
12 1 3 6 9 12 15 18 21 24 27 31 34 37
13 1 3 7 10 13 17 20 24 27 31 34 38 42
14 1 4 7 11 15 18 22 26 30 34 38 42 46
15 2 5 8 12 16 20 24 29 33 37 42 46 51

112
0
Table 6.10: Critical Values u0.025 for the Runs Test

2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 2 2 2 2
3 2 2 2 2 2 2 2 2 2 3
4 2 2 2 3 3 3 3 3 3 3 3
5 2 2 3 3 3 3 3 4 4 4 4 4
6 2 2 3 3 3 3 4 4 4 4 5 5 5
7 2 2 3 3 3 4 4 5 5 5 5 5 6
8 2 3 3 3 4 4 5 5 5 6 6 6 6
9 2 3 3 4 4 5 5 5 6 6 6 7 7
10 2 3 3 4 5 5 5 6 6 7 7 7 7
11 2 3 4 4 5 5 6 6 7 7 7 8 8
12 2 2 3 4 4 5 6 6 7 7 7 8 8 8
13 2 2 3 4 5 5 6 6 7 7 8 8 9 9
14 2 2 3 4 5 5 6 7 7 8 8 9 9 9
15 2 3 3 4 5 6 6 7 7 8 8 9 9 10

Table 6.11: Critical Values u0.025 for the Runs Test

4 5 6 7 8 9 10 11 12 13 14 15
4 9 9
5 9 10 10 11 11
6 9 10 11 12 12 13 13 13 13
7 11 12 13 13 14 14 14 14 15 15 15
8 11 12 13 14 14 15 15 16 16 16 16
9 13 14 14 15 16 16 16 17 17 18
10 13 14 15 16 16 17 17 18 18 18
11 13 14 15 16 17 17 18 19 19 19
12 13 14 16 16 17 18 19 19 20 20
13 15 16 17 18 19 19 20 20 21
14 15 16 17 18 19 20 20 21 22
15 15 16 18 18 19 20 21 22 22

113
0
Table 6.12: Critical Values u0.005 for the Runs Test

3 4 5 6 7 8 9 10 11 12 13 14 15
3 2 2 2 2
4 2 2 2 2 2 2 2 3
5 2 2 2 2 3 3 3 3 3 3
6 2 2 2 3 3 3 3 3 3 4 4
7 2 2 3 3 3 3 4 4 4 4 4
8 2 2 3 3 3 3 4 4 4 5 5 5
9 2 2 3 3 3 4 4 5 5 5 5 6
10 2 3 3 3 4 4 5 5 5 5 6 6
11 2 3 3 4 4 5 5 5 6 6 6 7
12 2 2 3 3 4 4 5 5 6 6 6 7 7
13 2 2 3 3 4 5 5 5 6 6 7 7 7
14 2 2 3 4 4 5 5 6 6 7 7 7 8
15 2 3 3 4 4 5 6 6 7 7 7 8 8

Table 6.13: Critical Values u0.005 for the Runs Test

5
6 7 8 9 10 11 12 13 14 15
5 11
6 11 12 13 13
7 13 13 14 15 15 15
8 13 14 15 15 16 16 17 17 17
9 15 15 16 17 17 18 18 18 19
10 15 16 17 17 18 19 19 19 20
11 15 16 17 18 19 19 20 20 21
12 17 18 19 19 20 21 21 22
13 17 18 19 20 21 21 22 22
14 17 18 19 20 21 22 23 23
15 19 20 21 22 22 23 24

114
Table 6.14: Critical Values for the Signed-rank test

n T0.10 T0.05 T0.02 T0.01


5 1
6 2 1
7 4 2 0
8 6 4 2 0
9 8 6 3 2
10 11 8 5 3
11 14 11 7 5
12 17 14 10 7
13 21 17 13 10
14 26 21 16 13
15 30 25 20 16
16 36 30 24 19
17 41 35 28 23
18 47 40 33 28
19 54 46 38 32
20 60 52 43 37
21 68 59 49 43
22 75 66 56 49
23 83 73 62 55
24 92 81 69 61
25 101 90 77 68

115

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy