Advanced Statistics Lecture Notes
Advanced Statistics Lecture Notes
Advanced Statistics Lecture Notes
1 Experimental Designs 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Some Standard Experimental Designs . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Completely Randomized Design . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Randomized Block Design . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Latin Square Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Nonparametric Methods 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 The Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 The Signed-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The Wilcoxon Rank-Sum Test for Comparing Two Treatments . . . . . . . . 21
2.4 The Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Test of randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Measures of Correlation Based on Ranks . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Properties of the rank correlation coefficient . . . . . . . . . . . . . . 28
3 Sampling 30
3.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 How to Draw a Simple Random Sample . . . . . . . . . . . . . . . . 31
3.1.3 Estimation of Population Mean and Total . . . . . . . . . . . . . . . 33
3.1.4 Selecting the Sample Size for Estimating Population Means and Totals 37
3.1.5 Estimation of Population Proportion . . . . . . . . . . . . . . . . . . 38
3.1.6 Sampling with Probabilities Proportional to Size . . . . . . . . . . . . 42
3.2 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 How To Draw a Stratified Random Sample . . . . . . . . . . . . . . . 46
3.2.3 Estimation of Population Mean and Total . . . . . . . . . . . . . . . 47
3.2.4 Selecting the Sample Size for Estimating Population Means and Total 50
3.2.5 Allocation of the Sample . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.6 Estimation of Population Proportion . . . . . . . . . . . . . . . . . . 57
3.2.7 Selecting the Sample Size and Allocating the Sample to Estimate Pro-
portions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Multivariate Analysis 61
4.1 Multivariate Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.1 Examples of Multivariate Data . . . . . . . . . . . . . . . . . . . . . 61
4.1.2 Multivariate Data Structure . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.3 Mean and Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Random Vectors and Multivariate Distributions . . . . . . . . . . . . . . . . 63
1
4.2.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Expectation and Dispersion . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Covariance between Two Random Vectors . . . . . . . . . . . . . . . 65
4.2.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . 66
4.3 Properties of Multivariate Normal Distribution . . . . . . . . . . . . . . . . . 66
4.3.1 Expectation and Dispersion . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Partitioning of Multivariate Normal Vector . . . . . . . . . . . . . . . 67
4.3.3 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Sampling from a Multivariate Normal Distribution . . . . . . . . . . . . . . . 71
4.4.1 Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . 71
4.4.3 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.4 The Multivariate Central Limit Theorem . . . . . . . . . . . . . . . . 72
4.5 Inference on Multivariate Normal Mean . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.2 Confidence regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.3 Simultaneous confidence intervals . . . . . . . . . . . . . . . . . . . . 74
5 Decision Theory 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Mixed Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Statistical Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 The Minimax Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 The Bayes Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Inference 89
6.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Methods of evaluating Estimators . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5 Cramér-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.6 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2
Chapter 1
Experimental Designs
1.1 Introduction
Earlier you learned how to compare the means of two populations. In many practical
situations, we will have more than two population means to compare. For example, we may
want to compare the mean lengths of five different species of turtles by testing the hypothesis
that all have the same mean. If you take independent samples from these populations the
sample means will most likely be different. But this does not mean that the population
means are different. So we need a proper statistical test to decide if there is enough evidence
to refute the null hypothesis.
Performing multiple t-test and comparing each pair is not only tedious and time-consuming,
but also going to increase the probability of making wrong decision of type I. This is be-
cause, even though the probability of rejecting the null hypothesis that the two population
means are equal is fixed at α for each test, when k such tests are performed the over all
probability of rejecting at least one of them wrongly increases to 1 − (1 − α)k . For example,
if α = 0.05 and there are 10 such tests, the overall type I error will be close to 0.40.
Thus we need a single test procedure whose probability of type I error is α for testing
the hypothesis that all the means are equal. This procedure is called Analysis of Variance.
The assumptions required are the following:
Let us see how the test is done for five populations. Generalising for k populations will
then be easily understood. Suppose we have five sets of measurements that are normally
distributed with population means µ1 , µ2 , µ3 , µ4 , µ5 and common variance σ 2 . Samples of
size 9 are taken from each and the sample means and the sample variances are calculated.
We shall define the combined estimate of the common variance by
(n1 − 1)s21 + (n2 − 1)s22 + (n3 − 1)s23 + (n4 − 1)s24 + (n5 − 1)s25
s2w =
(n1 − 1) + (n2 − 1) + (n3 − 1) + (n4 − 1) + (n5 − 1)
which is just an extension of the pooled variance we came across in the two-sample case. It
measures the variability of the observations within the five populations and the subscript w
stands for within-population variability.
Next we consider a quantity that measures the variability between the populations. As-
suming that the null hypothesis is true, that is, all the population means are indeed equal,
we know that the sampling distribution of the sample mean based on 9 observations will
3
2
have the same mean µ and variance σ9 . Since we have drawn five samples of nine observa-
2
tions each, we can estimate the variance of the distribution of sample means, σ9 , using the
fact that the sample variance of the sample means is given by
P 2 1 P 2
ȳi − 5 [( ȳi ) ]
.
5−1
2
This quantity estimates σ9 and hence 9 times that quantity estimates σ 2 . Call this quantity
s2B , the between-population variability. If the null hypothesis is true, then within-population
s2
and between-population variability should be the same, so the ratio s2B should be close to
W
1. We will use this ratio to be our test statistic.
It turns out that in our example this ratio has F distribution with degrees of freedom 4
and 40. Thus we can check the validity of the null hypothesis by comparing the calculated
s2
value of s2B with the value found from the F-table.
W
The experimental setting we just described, where a random sample of observations is
taken from each of k different populations is called a completely randomized design. Though
the example we discussed had sample sizes all equal, it need not be the case in general for
a completely randomized design. We will now discuss under maximum generality how the
hypothesis about equality of means is tested. Remember that we assume normality and
equality of variances.
Notation
yij = j th observation in the sample from population i.
ni = P Sample size for the sample from population i
n= ni = The total sample size
Ti = The sum of the sample measurements obtained from population i.
Ti
ȳi = P
ni
= The average of observations drawn from population i
G= Ti = The grand total of all observations
ȳ = Gn = The average of all observations
2 2
P
T SS = Total sum of squares = (yij − ȳ) P = (n − 1)s2
SST = Within-sample sum of squares = P(yij − ȳi )
SSE = Between-sample sum of squares = ni (yi − ¯ ȳ)2
Shortcut
P Formulas 2
T SS = yij2 − Gn
P Ti2 G2
SST = ni
− n
SSE = T SS − SST
The errors εij ’s are assumed to be normally distributed with mean zero and an unknown
variance σ 2 . In addition we shall also assume that the errors are independent.
The statistical test of interest is to see if there is a difference among the treatment means.
The null hypothesis then is
H0 : α1 = α2 = · · · = αt = 0
4
where t is the number of experimental groups or samples
Note: The “Sum of Squares Between” and the “Sum of Squares within” that we talked
about earler are now called “Sum of Squares due to Treatment” and “Sum of Squares due
to Error” respectively, and will be denoted by SST and SSE respectively. Once we have
computed the required quantities, we arrange them in the form of a table generally known
as the analysis of variance table (or ANOVA table).
Source Degrees of freedom Sum of squares Mean squares F Ratio
SST M ST
Treatments k−1 SST M ST = k−1 M SE
SSE
Error n−k SSE M SE = n−k
Total n−1 T SS
Here T SS and SST are computed using the formula and SSE is found by subtraction.
The above type of ANOVA is called a single factor analysis of variance or a one-way analysis
of variance.
Example 1.1.1 Nineteen pigs are assigned at random among four experimental groups.
Each group is fed a different diet. The data are pig body weights, in kilograms, after being
raised on these diets. Perform a 0.05 level test to decide whether the weights are the same
for all four diets.
Solution:
Here k = 4; n1 = n2 = n4 = 5, n3 = 4 and n = 19.
T
P1 =2 303.1, T2 = 346.5, TG32= 401.4, T4 = 431.2, G = 1482.2.
yij = 119981.900 and n = 115627.202
2
yij − Gn = 119981.900 − 115627.202 = 4354.698
P 2
TSS =
P Ti2 G2
SST = ni
− n = 119853.550 − 115627.202 = 4226.348
SSE = TSS - SST = 4354.698 - 4226.348 = 128.350
Source Degrees of freedom Sum of squares Mean squares F Ratio
Treatments 3 4226.348 1408.783 164.635
Error 15 128.35 8.557
Total 18 4354.698
As the calculated F-value is larger than the table value F3,15,0.05 = 3.29, we reject the null
hypothesis and conclude that the means are not all equal, or in other words, “there is a
treatment effect”.
5
1.2 Some Standard Experimental Designs
1.2.1 Completely Randomized Design
The design of an experiment is the process of planning an experiment that is appropri-
ate for the situation. A large part of scientific reasoning consists of drawing conclusions
from experiments (studies) that have been carefully designed, appropriately conducted, and
properly analyzed. In this section we will discuss some standard experimental designs and
their analyses. The single factor ANOVA discussed earlier is called a completely randomized
design (CRD), where we were interested in comparing treatment means for different levels
of a single factor.
Consider the following example. A horticultural laboratory is interested in examining the
leaves of apple trees to detect nutritional deficiencies using three different laboratory proce-
dures. In particular, the laboratory would like to determine whether there is a difference in
mean assay readings for apple leaves utilizing three different laboratory procedures (A, B,
C). The experimental units in this investigation are apple tree leaves and the treatments are
the three levels of the qualitative variable ”laboratory procedure”. If a single analyst takes
a random sample of nine leaves from the same tree, randomly assigns three leaves to each
of the three procedures, and assays the leaves using the assigned treatment, we could use
the three sample means to estimate the corresponding mean leaf nutritional deficiency for
the three laboratory test procedures - in other words use the analysis of variance methods
discussed earlier to run a statistical test of the hypothesis that all the treatment means are
identical. The design used for this investigation is a completely randomized design with
three observations for each treatment.
The completely randomized design has several advantages and disadvantages used as an
experimental design for comparing t treatment means.
Advantages
2. The design is easy to analyze even though the sample sizes might not be the same for
each treatment.
Disadvantages
1. Although the completely randomized design can be used for any number of treatments,
it is best suited for situations in which there are relatively few treatments.
6
1.2.2 Randomized Block Design
Let us now change the horticultural problem slightly and see how well the completely ran-
domized design suits our needs. Suppose that, rather than relying upon one analyst, we use
three analysts for the leaf assays, say, Analyst 1, Analyst 2 and Analyst 3. If we randomly
assigned three apple leaves to each of the analysts, we might end up with a randomization
scheme like the one listed in the table below:
Analyst
1 2 3
A B C
A B C
A B C
Even though we still have three observations for each treatment in this scheme, any
differences that we may observe among the leaf determinations for the three laboratory
procedures may be due entirely to differences among the analysts who assayed the leaves.
For example, if we tested the hypothesis H0 : µA − µB = 0 against H1 : µA − µB 6= 0 and
were led to reject H0 , we would not be able to tell whether µA differs from µB because
assays from analyst 1 are different from those for analyst 2 or because the properties of
determinations by procedure A differ markedly from those for procedure B. This example
illustrates a situation in which the nine experimental units (tree leaves) are affected by an
extraneous source of variability: the analyst. In this case the units differ markedly and
would not be a homogeneous set on which we could base an evaluation of the effects of the
three treatments.
The completely randomized design just described can be modified to gain additional in-
formation concerning the means µA , µB and µC . We can block out the undesirable variability
among analysts by using the following experimental design. We restrict our randomization
of treatments to experimental units to ensure that each analyst performs a determination
using each of the three procedures. The order of these determinations for each analyst is
randomized. One such randomization is listed below.
Analyst
1 2 3
A B A
C A B
B C C
Table 1.2.2
Note that each analyst will assay three leaves, one leaf for each of the three procedures.
Hence pairwise comparisons among the laboratory procedures that utilize the sample means
will be free of any variability among analysts. For example, if we ran the test
H0 : µA − µB = 0 against H1 : µA − µB 6= 0
and rejected H0 , the difference between µA and µB would be due to a difference between the
nutritional deficiencies detected by procedures A and B and not due to a difference among
the analysts, since each analyst would have assayed one leaf for each of the three procedures.
This design, which represents an extension to the completely randomized design is called
a randomized block design; the analysts in our experiment are called blocks. By using this
design, we have effectively filtered out any variability among the analyst enabling us to
make more precise comparisons among the treatment means µA , µB and µC .
7
As another example, suppose we want to study weight gain in guinea pigs where the
factor whose effect is to be tested is diet (treatment in ANOVA). Twenty animals are to
used in this experiment, five on each of four diets. However, the experimenter believes there
are some experimental factors that likely would affect weight gain and does not feel that all
twenty animal cages can be kept in the laboratory at identical conditions of temperature,
light, etc. Therefore, five blocks of experimental units are established, i.e. five groups
of guinea pigs. Each block of animals consists of four guinea pigs, one of each of the
experimental diets. All members of a block are considered to be at identical conditions
(except, of course, for diet).
In general, we can use a randomized block design to compare t different treatment means
when an extraneous source of variability (blocks) is present. If there are b different blocks,
we would run each of the t treatments in each block to filter out the block to block variability.
In our example we had t = 3 treatment means (laboratory procedures) and b = 3 blocks
(analysts).
1. It is a useful design for comparing t treatment means in the presence on single extra-
neous source of variability.
Disadvantages
1. Since the experimental units within a block must be homogeneous, the design is best
suited for a relatively small number of treatments.
2. This design controls for only one extraneous source of variability (due to blocks).
Additional extraneous sources of variability tend to increase the error term, making
it more difficult to detect treatment differences.
3. The effect of each treatment on the response must be approximately the same from
block to block.
NOTATION
yij = Observation for treatment i in block j.
t = The number of treatments.
n = The total sample size = bt.
Ti = The sum of the sample measurements receiving treatment i.
ȳi = The average of observations receiving treatment i = Tbi
Bj = The sum of the sample measurements in block j.
8
B
B̄j = The sample mean for block j = tj P
G = The grand total of all observations = Ti
G
ȳ = The average of all observations = n
G2
(yij − ȳ)2 = (n − 1)s2 = yij2 −
P P
TSS = Total sum of squares = n
P Ti2 G2
SST = Treatment sum of squares = b (ȳi − ȳ)2 =
P
b
− n
B 2 2
SSB = Block sum of squares = t (B̄j − ȳ)2 = − Gn
P P j
t
SSE = TSS - SST - SSB
yij = µ + αi + βj + ij
where
µ = an overall mean, which is an unknown constant
αi = an effect due to treatment i (an unknown constant)
βj = an effect due to block j (an unknown constant)
ij = a random error associated with the response on treatment i, block j.
The errors ij ’s are assumed to be normally distributed with mean zero and an unknown
variance σ 2 . In addition we shall also assume that the errors are independent.
The statistical test of interest is to see if there is a difference among the treatment means.
The null hypothesis then is
H0 : α1 = α2 = · · · = αt = 0
SST M ST
Treatments t−1 SST M ST = t−1 M SE
SSB M SB
Blocks b−1 SSB M SB = b−1 M SE
SSE
Error (b − 1)(t − 1) SSE M SE = (b−1)(t−1)
Total bt − 1 T SS
The computed F-ratio for the treatments, M ST /M SE is compared with the table value of
the F-distribution of t − 1 and (b − 1)(t − 1) degrees of freedom and if the computed value is
higher, then the null hypothesis is rejected and it is concluded that there is a treatment effect.
We can use these computations to test if there is a block effect. Here the test is
H0 : β1 = β2 = · · · = βb = 0
Similar to the treatment effect testing procedure, we compare the computed value of M SB/M SE
with the table value of the F-distribution of (b − 1) and (b − 1)(t − 1) degrees of freedom.
9
USE OF RBD
One major reason for blocking is that the experimenter does not have a sufficient number
of homogeneous experimental units available to run a completely randomized design with
several observations per treatment combination. A realistic situation where blocking would
be required is an industrial type experiment where there is only sufficient time to run one
set of treatment combinations per day. In such an experiment one would be blocking on
time and the number of days that the experiment is carried out would be the number of
blocks. This type of experiment points out another reason for blocking which is to expand
the inference space. Even if it is possible to run a large number of each treatment combi-
nation on one day, the experimenter might still choose to block over time so that he could
make inferences over time rather than to just one particular day.
Example 1.2.2 Four chemical treatments for fabric are to be compared with regard to
their ability to resist stains. Two different types of fabric are available for the experiment,
which is to be used as blocks. Each chemical is applied to a sample of each type of fab-
ric, resulting in a randomized block design. The measurements are given below in Figure 1.1:
The problem is to test if there is evidence to suggest if there are significant differences be-
tween treatment means.
Solution
TSS=464 − 562 /8 = 464 − 392 = 72
SST=429-392=37
SSB=424-392=32
SSE=72-37-32=3
As the table value is F3,3,0.05 = 9.28, we reject the hypothesis that there is no treatment
effect. Also, we see in the summary table that the F-ratio for blocks is 32 compared to
the table value F1,3,0.05 = 10.13 . So there is evidence of a difference between block means.
10
That is, the fabrics seem to differ with respect to their ability to resist stains when treated
with these chemicals. The decision to ”block out” the variability between fabrics therefore
appears to have been a good one.
RELATIVE EFFICIENCY OF RBD
It is natural that we will be interested to find out if blocking has increased our precision for
comparing treatment means in a given experiment. Let M SERB and M SECR represent the
MSE for RBD and CRD respectively. One measure of precision is the variance of ȳi . For
an RBD, the estimated variance of ȳi is M SECR /r , where r is the number of observations
of each treatment required to satisfy the relationship
M SECR M SERB M SECR r
= or =
r b M SERB b
r
The quantity b
is called the relative efficiency of the RBD.
If we have to apply this formula to find the relative efficiency of the RBD, we would have
to run RBD and CRD, which is a waste. The following formula helps to overcome this
difficulty. The proof of the formula is beyond the scope of this course. If MSE is that of the
RBD,
11
Figure 1.2: Randomized Block Design for the Leaf Assay in the Presence of a
Day Effect
design to filter out this second source of variability, the variability among days, in addition
to filtering out the first source, variability among analysts. To do this we restrict our
randomization to ensure that each treatment appears in each row (day) and in each column
(analyst). One such randomization is shown in Figure 1.3. Note that the test procedures
have been assigned to analysts and to days so that each procedure is performed once day
and once by each analyst. Hence pairwise comparisons among treatment procedure that
involve the sample means are free of variability among days and analysts.
This experimental design is called a Latin square design (LSD). In general, a Latin square
design can be used to compare t treatment means in the presence of two extraneous sources
of variability, which we block off into t rows and t columns. The t treatments are then
randomly assigned to the rows and columns so that each treatment appears in every row
and every column of the design (see Figure 1.3).
Thus, blocking in two directions can be accomplished by using a Latin square design.
Definition 1.2.3 A t × t Latin square design contains t rows and t columns. The t treat-
ments are randomly assigned to experimental units within the rows and columns so that
each treatment appears in every row and in every column.
Advantages and Disadvantages of a Latin Square Design
The advantages and disadvantages of the Latin square design are listed here.
Advantages
1. The design is particularly appropriate for comparing t treatment means in the presence
of two sources of extraneous variation, each measured at t levels.
2. The analysis is still quite simple.
Disadvantages
1. Although a Latin square can be constructed for any value of t, it is best suited for
comparing t treatments when 5 < t < 10.
12
2. Any additional extraneous sources of variability tend to inflate the error term making
it more difficult to detect differences among the treatment means.
3. The effect of each treatment on the response must be approximately the same across
rows and columns.
4. The number of rows and columns must be the same as the number of treatments.
A typical randomization scheme for a 4 × 4 Latin square comparing the treatments I, II,
III, and IV is shown in Figure 1.4. Note that each treatment appears in all four rows and
all four columns.
Example 1.2.4 To evaluate the toxicity of certain compounds in water, samples must be
preserved for long periods of time. Suppose an experiment is being conducted to evaluate
the Maximum Holding Time (MHT) for four different preservatives used to treat a mercury
based compound. The MHT is defined as the time that elapses before the solution loses
10% of its initial concentration. Both the level of the initial concentration and the analyst
who measures the MHT are sources of variation, so an experimental design that blocks on
both is necessary to allow the difference in mean MHT between preservatives to be more
accurately estimated. An LSD is constructed wherein each preservative is applied exactly
once to each initial concentration, and is analyzed exactly once by each analyst. For this
the design has to be ”square”, since the single application of each treatment level at each
block level requires that the number of levels of each block equal the number of treatment
levels. Thus to apply the LSD using four different preservatives, we must employ four initial
concentrations and four analysts. The design would be constructed as shown below, where
Pi is preservative i.
13
NOTATION
Yijk = Observation for treatment i in row j and column k.
t = The number of treatments = the number of rows = the number of columns.
n = The total sample size = t2
Ti = The sum of the sample measurements receiving treatment i.
ȳi = The average of observations receiving treatment i = Ti /t
Rj = The sum of the sample measurements in row j.
R̄j = The sample mean for row j = Rj /t
Ck = The sum of all observations in column k .
C¯k = The average of all observations in column
P k = Ck /t
G = The grand total of all observations = Ti
ȳ = ThePaverage of all observationsP =G/n
2
2 2
TSS = (yijk − ȳ) = (n − 1)s = yijk 2
− Gn
P Ti2 G2
SST = t (ȳi − ȳ)2 =
P
t
− n
P 2
P Rj2 G2
SSR = t (R̄j − ȳ) = t
− n
P ¯ C 2 2
SSC = t (Ck − ȳ)2 = − Gn
P k
t
SSE = TSS - SST - SSR - SSC
The model for a response in an LSD is about the same as that for an RBD with the added
term to account for the second blocking variable. The model is
yij = µ + αi + βj + γk + ijk
where
µ = an overall mean, which is an unknown constant
αi = an effect due to treatment i (an unknown constant)
βj = an effect due to row j (an unknown constant)
γk = effect due to column k (an unknown constant)
ijk = a random error associated with the response on treatment i, row j and column k.
SST M ST
Treatments t−1 SST M ST = t−1 M SE
SSR M SR
Rows t−1 SSR M SR = t−1 M SE
SSC M SC
Columns t−1 SSC M SC = t−1 M SE
SSE
Error (t − 1)(t − 2) SSE M SE = (t−1)(t−2)
Total t2 − 1 T SS
14
Example 1.2.5 Four varieties of wheat are to be compared to test if there is any difference
in mean yields. The two additional sources of variability are four different fertilizers and
four different years. It is assumed that the various sources of variability do not interact.
Test the hypotheses that there is no difference in wheat yields for
2. different fertilizers
3. different years
The design and the data are given below: The letters A, B, C, D represent the types of wheat.
Solution:
As F3,6,0.05 = 4.76, the decisions, at 5% level, are to (1) accept (2) reject (3) accept.
RELATIVE EFFICIENCY OF LSD
As with the randomized block design, we can compare the efficiency of the Latin square
design to that of the completely randomized design. Let M SELS and M SECR denote the
mean square errors, respectively, for a Latin square design and a completely randomized
design.
15
where the M SE is that of LSD.
We will refer to the data of the example above and compute the efficiency of the Latin
square design relative to a completely randomized design.
For these data, t = 4, M SR = 519, M SC = 139.33, and M SE = 43.5. Substituting into the
formula for relative efficiency, we have
That is, it would take approximately 3.63 times as many observations in using a completely
randomized design to gather the same amount of information on the treatment means as it
would take when using the Latin square design.
16
Chapter 2
Nonparametric Methods
2.1 Introduction
The models for inference procedures that we have learned thus far assume a specific structure
for the population distribution. The Student’s t test for inferences about a population mean
and the comparison of two means, the χ2 and F tests for inferences about variances, the
inferences presented for regression models, and the analysis of variance are all based on the
assumption that the response measurements constitute samples from normal populations.
These procedures are designed to make inferences about the values of the parameters µ
and σ that appear in the prescription for the mathematical curve of the normal population.
Collectively, they are called normal-theory parametric inference procedures.
Nonparametric statistics is a body of inference procedures that is valid under a much
wider range of shapes for the population distribution. The term nonparametric inference is
derived from the fact that the usefulness of these procedures does not require modeling a
population in terms of a specific parametric form of density curves, such as normal distri-
butions. In testing hypotheses, nonparametric test statistics typically utilize some simple
aspects of the sample data, such as the signs of the measurements, order relationships, or
category frequencies. These general features do not require the existence of a meaningful
numerical scale for measurements. More importantly, stretching or compressing the scale
does not alter them. As a consequence, the null distribution of a nonparametric test statistic
can be determined without regard to the shape of the underlying population distribution.
For this reason, these procedures are also called distribution-free tests.
What major difficulties are associated with parametric procedures based on the t, χ2 , F
or similar distributions, and how do we overcome these difficulties by using a nonparametric
approach? First, the distributions of these parametric model statistics and their tabu-
lated percentage points are valid if the underlying population distribution is approximately
normal. Although the normal distribution does approximate many real-life situations, ex-
ceptions are numerous, so that it is unwise to presume that normality is a fact of life. Drastic
departures from normality often occur in the forms of conspicuous asymmetry, sharp peaks,
or heavy tails. The intended strength of parametric inference procedures can be seriously
affected when such departures occur, especially when the sample size is small or moderate.
For instance, a t test with an intended level of significance of α = .05 may have an actual
type I error probability far in excess of this value. Similarly, the strength of confidence
statements can be seriously distorted.
Second, parametric procedures require that observations be recorded on an unambiguous
scale of measurements and then make explicit use of specific numerical aspects of the data,
such as the sample mean, the standard deviation, and other sums of squares. In a great
many situations, particularly in the social and behavioral sciences, responses are difficult
or impossible to measure on a specific and meaningful numerical scale. Characteristics like
17
degree of apathy, taste preference, and surface gloss cannot be evaluated on an objective
numerical scale, and an assignment of numbers is, therefore, bound to be arbitrary. Also,
when people are asked to express their views on a 5-point rating scale, where 1 represents
strongly disagree and 5 represents strongly agree, the numbers have little physical meaning
beyond the fact that higher scores indicate greater agreement. Data of this type are called
ordinal data, because only the order of the numbers is meaningful and the distance between
two numbers does not lend itself to practical interpretation. Any controversy surround-
ing a scale of measurement for ordinal data is readily transmitted to parametric statistical
procedures through the use of sample means and standard deviations. Nonparametric pro-
cedures that utilize information only on order or rank are therefore particularly suited to
measurements on an ordinal scale. When the data constitute measurements on a meaningful
numerical scale and the assumption of normality holds, parametric procedures are certainly
more efficient in the sense that tests have higher power and confidence intervals are gener-
ally shorter than their nonparametric counterparts. A choice between the parametric and
nonparametric approach should be guided by a consideration of loss of efficiency and the
degree of protection desired against possible violations of the assumptions.
Example 2.2.1 The following are measurements of the breaking strength of a certain kind
of 2-inch cotton ribbon in pounds:
163 165 160 189 161 171 158 151 169 162
163 139 172 165 148 166 172 163 187 173
Use the sign test to test the null hypothesis µ = 160 against the alternative hypothesis
µ > 160 at the 0.05 level of significance.
Solution: The null and alternative hypotheses are given by H0 : µ = 160 and H1 : µ > 160,
and the significance level is given by α = 0.05. We use the test statistic X, the observed
number of plus signs. Replacing each value exceeding 160 with a plus sign, each value
less than 160 with a minus sign, and discarding the one value that equals 160, we get
+ + + + + − − + + + − + + − + + + + + so n = 19 and x = 15. From the binomial tables
we find that P (X ≥ 15) = 0.0095 for p = 12 . Since the P -value, 0.0095, is less than 0.05,
the null hypothesis must be rejected. Thus we conclude that the mean breaking strength of
the given kind of ribbon exceeds 160 pounds.
18
Example 2.2.2 The following data, in tons, are the amounts of sulfur oxides emitted by
a large industrial plant in 40 days:
17 15 20 29 19 18 22 25 27 9
24 20 17 6 24 14 15 23 24 26
19 23 28 19 16 22 24 17 20 13
19 10 23 18 31 13 20 17 24 14
Use the sign test to test the null hypothesis µ = 21.5 against the alternative hypothesis
µ < 21.5 at the 0.01 level of significance.
Solution: The null and alternative hypotheses are given by H0 : µ = 21.5 and H1 : µ < 21.5,
and the significance level is given by α = 0.05. Reject the null hypothesis if z < −z0.01 =
−2.33, where z = 2x−n
√
n
with x being the number of plus signs (values exceeding 21.5). Since
n = 40 and x = 16, we get
2x − n 32 − 40
z= √ = √ = −1.26
n 40
Since z = −1.26 exceeds -2.33, the null hypothesis is not rejected.
The sign test can also be used when we deal with paired data. In such problems, each pair
of sample values is replaced by a plus sign if the difference between the paired observations
is positive (that is, if the first value exceeds the second value) and by a minus sign if the
difference between the paired observations is negative (that is, if the first value is less than
the second value), and it is discarded if the difference is zero. To test the null hypothesis
that two continuous symmetrical populations have equal means (or that two continuous
populations have equal medians), we can thus use the sign test, which, in connection with
this kind of problem, is referred to as the paired-sample sign test.
Example 2.2.3 To determine the effectiveness of a new traffic-control system, the numbers
of accidents that occurred at 12 dangerous intersections during four weeks before and four
weeks after the installation of the new system were observed, and the following data were
obtained:
Solution: The null and alternative hypotheses are given byH0 : µ1 = µ2 and H1 : µ1 > µ2
and α = 0.05.
Use the test statistic X, the observed number of plus signs. Replacing each positive
difference by a plus sign and each negative difference by a minus sign, we get + + + +
+ + − + − + ++ so that n = 12 and x = 10. From the binomial tables we find that
P (X ≥ 10) = 0.0192 for p = 0.5. Since the P -value, 0.0192, is less than 0.05, the null
hypothesis must be rejected. Thus we conclude that the new traffic-control system is effective
in reducing the number of accidents at dangerous intersections.
19
pairs of observations in the paired-sample case, it tends to be wasteful of information. An
alternative nonparametric test, the Wilcoxon signed-rank test, is less wasteful in that it takes
into account also the magnitudes of the differences. In this test, we rank the differences
without regard to their signs, assigning rank 1 to the smallest difference in absolute value,
rank 2 to the second smallest difference in absolute value etc., and rank n to the largest
difference in absolute value. Zero differences are again discarded, and if the absolute value
of two or more differences are the same, we assign each one the mean of the ranks that they
jointly occupy. Then, the signed-rank test is based on one of th following three statistics
– the sum of the ranks assigned to the positive differences ( T + ), the sum of the ranks
assigned to the negative differences (T − ), or T = min(T + , T − ). All these are equivalent
because T + + T − = n(n+1)
2
.
Depending on the alternative hypothesis, we base the signed-rank test on T + ,T − , or T .
The assumptions and the null hypotheses are the same as in the case of the sign test. The
correct statistic, along with the appropriate critical value, is summarized in the following
table, where in each case the level of significance is α.
Example 2.2.4 The following are 15 measurements of the octane rating of a certain kind
of gasoline.
97.5 95.2 97.3 96.0 96.8
100.3 97.4 95.3 93.2 99.1
96.1 97.6 98.2 98.5 94.9
Use the signed-rank test at the 0.05 level of significance to test whether the mean octane
rating of the given kind of gasoline is 98.5.
20
Thus T + = 8 + 2 = 10 and T = min(95, 10) = 10. From Table 6.14, we find that
T0.05 = 21 for n = 14. Since T = 10 is less than 21, the null hypothesis must be rejected:
the mean octane rating of the given kind of gasoline is not 98.5.
When we deal with paired data, the signed-rank test can also be used in place of the
paired-sample sign test. In this case, we test the null hypothesis µ1 = µ2 using the test
criteria given in the table, except that the alternative hypotheses are now µ1 < µ2 , µ1 > µ2
or µ1 6= µ2 instead of µ < µ0 , µ > µ0 or µ 6= µ0 .
For n > 15 it is considered reasonable to assume that T + is a value of a random variable
whose distribution is approximately normal. To perform the signed-rank test based on
this assumption, we need the following results, which apply regardless of whether the null
hypothesis is µ = µ0 or µ1 = µ2 .
Theorem 2.2.5 Under the assumptions required by the signed-rank test, T + has mean
µ = n(n+1)
4
and variance σ 2 = n(n+1)(2n+1)
24
. When n is large, the distribution of T + can be
approximated by a normal distribution.
Example 2.2.6 The following are the weights in pounds, before and after, of 16 persons
who stayed on a certain reducing diet for four weeks.
Use the signed-rank test to test at the 0.05 level of significance whether the weight-reducing
diet is effective.
T + = 13 + 10 + 16 + 8 + 9 + 14 + 12 + 2 + 11 + 1 + 15 = 111.
+
Since µ = n(n+1)
4
= 16(17)
4
= 68 and σ 2 = n(n+1)(2n+1)
24
= 16(17)(33)
24
= 374, we get z = T σ−µ =
111−68
√
374
= 2.22. As the calculated value of z exceeds the table value z0.05 = 1.645, the null
hypothesis must be rejected; we conclude that the diet is effective in reducing weight.
21
Here we describe a useful nonparametric procedure originally proposed by F. Wilcoxon
(1945). An equivalent alternative version was independently proposed by H. Mann and D.
Whitney (1947). For a comparative study of two treatments A and B, a set of n = n1 + n2
experimental units are randomly divided into two groups of sizes n1 and n2 , respectively.
Treatment A is applied to the n1 units, and treatment B is applied to the n2 units. Let the
response measurements be X11 , X12 , · · · , X1n1 for Treatment A and X21 , X22 , · · · , X2n2 for
Treatment B.
These two treatments constitute independent random samples from two populations. As-
suming that larger responses indicate a better treatment, we wish to test the null hypothesis
that there is no difference between the two treatment effects vs. the one-sided alternative
that Treatment A is more effective than Treatment B. The distributions are assumed to
be continuous. Note that no assumption is made regarding the shape of the population
distribution. The basic concept underlying the rank-sum test can now be explained by the
following intuitive line of reasoning. Suppose that the two sets of observations are plotted on
the same diagram, using different markings to identify their sources. Under H0 the samples
come from the same population so that the two sets of points should be well mixed. How-
ever, if the larger observations are more often associated with the first sample, for example,
we can infer that Population A is possibly shifted to the right of Population B. These two
situations are diagrammed below, where the combined set of points in each case is serially
numbered from left to right. These numbers are called the combined sample ranks. In the
following figure, large as well as small ranks are associated with each sample.
A B A B B A B A A
Rank 1 2 3 4 5 6 7 8 9
Ranks are well mixed (H0 is probably true)
In the figure below, most of the larger ranks are associated with the first sample.
B B B A B A A A A
Rank 1 2 3 4 5 6 7 8 9
Sample A contains more of the larger ranks
(H1 is probably true)
Therefore, considering the sum of the ranks associated with the first sample as a test
statistic, a large value of this statistic should reflect that the first population is located to
the right of the second. To establish a rejection region with a specified level of significance,
we must consider the distribution of the rank-sum statistic under the null hypothesis.
The formal model assumes that both distributions are continuous. We test the hypothe-
ses H0 : The two population distributions are identical against the alternative H1 : The
distribution of Population A is shifted to the right of the distribution of Population B.
As originally proposed by Wilcoxon, the test is thus based on W1 , the sum of the ranks
of the values of the first sample, or on W2 , the sum of the ranks of the values of the second
sample. It does not matter whether we choose W1 or W2 , for if there are n1 values in the
first sample and n2 values in the second sample, W1 + W2 is always the sum of the first
n1 + n2 positive integers, which is (n1 +n2 )(n2 1 +n2 +1)
In actual practice, we do not base tests directly on W1 or W2 . Instead, we use the related
statistics U1 = W1 − n1 (n21 +1) or U2 = W2 − n2 (n22 +1) or U = min(U1 , U2 ). The tests based on
U1 , U2 , or U are all equivalent to those based on W1 or W2 , but they have the advantage
that they lend themselves more readily to the construction of tables of critical values. The
correct statistic to use, together with the right critical value, is summarized in the following
table, where in each case the level of significance is α.
22
Alternative hypothesis Reject the null hypothesis if
µ1 < µ2 U1 ≤ U2α
µ1 > µ2 U2 ≤ U2α
µ1 6= µ2 U ≤ Uα
The critical values of the U statistic for α = 0.1, 0.05, 0.02 and 0.01 are respectively
given in Table 6.6, Table 6.7, Table 6.8 and Table 6.9. For these tables, rows represent n1
and columns represent n2 .
Example 2.3.1 Suppose that we want to compare two kinds of emergency flares, Brand A
and Brand B on the basis of the following burning times (rounded to the nearest tenth of
a minute):
Test at the 0.05 level of significance whether the two samples come from identical continuous
populations or whether the average burning time of Brand A flares is less than that of Brand
B flares.
W1 = 1 + 3 + 4 + 5 + 7 + 10 + 12 + 13 + 14 = 69
9(10)
U1 = 69 − = 24
2
Thus the null hypothesis must be rejected: we conclude that on the average Brand A flares
have a shorter burning time than Brand B flares.
Example 2.3.2 To determine if a new hybrid (A) seedling produces a bushier flowering
plant than a currently popular variety (B), a horticulturist plants 2 new hybrid seedlings
and 3 currently popular seedlings in a garden plot. After the plants mature, the following
measurements of shrub girth in inches are recorded:
A : 31.8 39.1
B : 35.5 27.6 21.3
Do these data strongly indicate that the new hybrid produces larger shrubs than the current
variety?
Solution: We wish to test the null hypothesis H0 : Populations A and B are identical. v s.
the alternative hypothesis H1 : Population A is shifted from B toward larger values. In the
rank-sum test, the two samples are placed together and ranked from smallest to largest.
23
Rank sums for A and B are respectively given by W1 = 3+5 = 8 and W2 = 1+2+4 = 7.
U1 = W1 − n1 (n21 +1) = 8 − 2(3)
2
= 5 and U2 = W2 − n2 (n22 +1) = 7 − 3(4)
2
= 1. From Table 6.6,
U2α = U0.1 = 0, so we cannot reject H0 . Note that for sample sizes this small, no matter
what was observed we cannot reject the null hypothesis at 0.05 level of significance.
When n1 and n2 are both greater than 8, it is considered reasonable to assume that
U1 and U2 are values of random variables having approximately normal distributions. To
perform the U test based on this assumption. we need the following result.
Theorem 2.3.3 Under the assumptions required by the U test, U1 and U2 are values of
random variables having the mean µ = n12n2 and variance σ 2 = n1 n2 (n12
1 +n2 +1)
.
Example 2.3.4 The following are the weight gains (in pounds) of two random samples of
young turkeys fed two different diets but otherwise kept under identical conditions:
W1 = 1 + 2 + 3 + 4 + 5.5 + 7 + 8 + 10 + 11 + 12 + 13 + 15 + 16 + 21 + 22 + 31 = 181.5
n1 (n1 + 1) 16(17)
U1 = W1 − = 181.5 − = 45.5
2 2
Since µ = n12n2 = 16(16)
2
= 128 and σ 2 = n1 n2 (n12
1 +n2 +1)
= 16(16)(33)
12
= 704, we get
U1 −µ 45.5−128
z = σ = √704 = −3.11 < −2.33, the null hypothesis must be rejected. We conclude
that on the average the second diet produces a greater gain in weight.
Example 2.3.5 Two geological formations are compared with respect to richness of min-
eral content. The mineral contents of 7 specimens of ore collected from Formation 1 and
5 specimens collected from Formation 2 are measured by chemical analysis. The following
data are obtained:
Do the data provide strong evidence that Formation 1 has a higher mineral content than
Formation 2? Test with α = .05.
24
Values 3.7 3.9 4.1 4.7 4.9 6.1 6.4 6.8 7.6 9.8 11.1 15.1
Ranks 1 2 3 4 5 6 7 8 9 10 11 12
n2 (n2 + 1) 5(6)
U2 = W2 − = 17 − =2
2 2
Since U2 ≤ 6, the null hypothesis must be rejected. We conclude that Formation 1 has a
higher mineral content than Formation 2.
Example 2.3.6 Flame-retardant materials are tested by igniting a paper tab on the hem
of a dress worn by a mannequin. One response is the vertical length of damage to the fabric
measured in inches. The following data for 5 samples, each taken from two fabrics, are
obtained by researchers at the National Bureau of Standards as part of a larger cooperative
study. Do the data provide strong evidence that a difference in flammability exists between
the two fabrics? Test with α = .05.
Solution: H0 : µ1 = µ2 and H1 : µ1 6= µ2
The combined values and their ranks are given below. The values from Fabric B are in
bold.
Values 4.6 4.9 5.3 5.7 6.0 6.2 6.5 7.3 7.4 7.6
Ranks 1 2 3 4 5 6 7 8 9 10
25
Example 2.4.1 The following are the final examination grades of samples from three
groups of students who were taught German by three different methods (classroom instruc-
tion and language laboratory, only classroom instruction, and only self-study in language
laboratory):
First method 94 88 91 74 87 97
Second method 85 82 79 84 61 72 80
Third method 89 67 72 76 69
Use the H test at the 0.05 level of significance to test the null hypothesis that the three
methods are equally effective.
Since H = 6.67 exceeds 5.991, the null hypothesis must be rejected. We conclude that the
three methods are not all equally effective.
Example 2.5.1
• The results of replications of an experiment whose outcomes are success and failure
were F F SSSSF F .
Let n1 and n2 be the number of occurrences of the two types, and let n = n1 + n2 . If u
is the number of runs, we reject the hypothesis of randomness at the significance level α if
either u ≤ u0α or u ≥ u α2 where u0α and u α2 can be found on a Tables 6.10 to 6.13.
2 2
Example 2.5.2
The win-loss information of a basketball team for the past 22 games is given below:
Solution
H0 : Arrangement is random vs. H1 : Arrangement is not random
26
Since n1 = 13 and n2 = 9, from Table 6.10 and Table 6.11 we get that the rejection
region is {u ≤ 6} ∪ {u ≥ 17}. Here u = 6, which falls in the rejection region. Hence we
reject the null hypothesis and conclude that the arrangement is not random.
Theorem 2.5.3 Under the assumption that the null hypothesis of randomness holds, the
run random variable U has mean and variance given by
When n1 and n2 are large (≥ 10), the distribution of U can be approximated by a normal
distribution.
Example 2.5.4 The following is an arrangement of men and women lined up to purchase
ticket for a rock concert.
MWMWMMMWMWMM
MWWMMMMWWMWM
MMWMMMWWWMWM
MMWMWMMMMWWM
Test for randomness at the 0.05 significance level.
27
Let (xi , yi ), i = 1, . . . n be n observations on a pair of attributes. The rank correlation
coefficient is given by
n
6 X
rs = 1 − d2
n(n2 − 1) i=1 i
where di is the difference between the ranks assigned to xi and yi . When there are no ties,
rs is the correlation coefficient calculated for the ranks. This is because
n n
X X n(n + 1)
ri = si =
i=1 i=1
2
n n
X X n(n + 1)(2n + 1)
ri2 = s2i =
i=1 i=1
6
n Pn 2
X n(n + 1)(2n + 1) d
ri si = − i=1 i
i=1
6 2
This rank correlation shares the properties of r such as always being between 1 and −1 and
that values near 1 indicate a tendency for the larger values of X to be paired with the larger
values of Y . However, the rank correlation is more meaningful, because its interpretation
does not require the relationship to be linear.
2. rs near +1 indicates a tendency for the larger values of X to be associated with the
larger values of Y . Values near −1 indicate the opposite relationship.
3. The association need not be linear, only an increasing decreasing relationship is re-
quired.
Example 2.6.1 An interviewer in charge of hiring large numbers of typists wishes to de-
termine the strength of the relationship between ranks given on the basis of an interview
and scores on an aptitude test. Calculate rs based on the data for 6 applicants given below:
Interview rank 5 2 3 1 6 4
Aptitude score 47 32 29 28 56 38
Interview 5 2 3 1 6 4
Aptitude 5 3 2 1 6 4
n
6 X 6
rs = 1 − 2
d2i = 1 − (2) = 0.9429
n(n − 1) i=1 6(35)
The relationship between interview rank and aptitude score appears to he quite strong.
28
Figure above helps to stress the point that rs is a measure of any monotone relationship,
not merely a linear relation. Here the value of rs is 1 while the value of the correlation
coefficient will be less than 1.
Theorem 2.6.2 Under the null hypothesis that there is no correlation, the mean and vari-
ance of rs is given by
1
E(rs ) = 0 andV (rs ) =
n−1
Example 2.6.3 The following are the numbers of hours that 10 students studied for an
examination and the score that they obtained. Calculate rs and test the hypothesis at 0.01
level that there no significant monotone relationship between the two variables.
6(3) √
d2i = 3 and hence rs = 1 − 10(99)
P
= 0.98. Thus z = rs n − 1 = .98(3) = 2.94 which
exceeds 2.576. Thus we conclude that there is a relationship between study time and scores.
29
Chapter 3
Sampling
Definition 3.1.1 If a sample of size n is drawn from a population of size N in such a way
that every possible sample of size n has the same chance of being selected the sampling
procedure is called simple random sampling. The sample thus obtained is called a simple
random sample.
It is a consequence of this definition that all individual elements in a population have the
same chance of being selected; however, this statement cannot be taken as a definition of
simple random sampling, because it does not imply that all samples of size n have the same
chance of being selected
We will use simple random sampling to obtain estimators for population means, totals, and
proportions.
Consider the following problem. A federal auditor is to examine the accounts for a city
hospital. The hospital records obtained from a computer show a particular accounts receiv-
able total, and the auditor must verify this total. If there are 28,000 open accounts in the
hospital, the auditor cannot afford the time to examine every patient record to obtain a
total accounts receivable figure. Hence the auditor must choose some sampling scheme for
obtaining a representative sample of patient records. After examining the patient accounts
in the sample, the auditor can then estimate the accounts receivable total for the entire
hospital. If the computer figure lies within a specified distance of the auditor’s estimate, the
computer figure is accepted as valid. Otherwise, more hospital records must be examined
for possible discrepancies between the computer figure and the sample data.
30
Suppose that all N = 28, 000 patient records are recorded on computer cards and a sample
size n = 100 is to be drawn. The sample is called a simple random sample if every possible
sample of n = 100 records has the same chance of being selected.
Simple random sampling forms the basis of most of the sampling designs discussed in this
course, and it forms the basis of most scientific surveys done in practice. The Nielsen Tele-
vision Index (NTI) is the most widely used audience measurement service in existence. It is
based on a random sample of approximately twelve hundred households that have a storage
instantaneous audimeter connected to the television set. This meter records whether or not
the television set is on, what channel is being viewed, and channel changes. In an addi-
tional random sample of families each family keeps a diary on who watches various shows.
The NTI reports the number of households in the audience, the type of audience, and the
amount of television viewing for various time periods.
The Gallup poll actually begins with a random sample of approximately 300 election dis-
tricts, sampled from the 200,000 election districts in the United States. Men households for
interviewing are selected from each district by another randomization device. The sampling
is in two stages, but simple random sampling plays a key role at each stage.
Auditors study simple random samples of accounts in order to check for compliance with
audit controls set up by the firm or to verify the actual dollar value of the accounts. Thus
they may wish to estimate the proportion of accounts not in compliance with controls or
the total value of say, accounts receivable.
Marketing research often involves a simple random sample of potential users of a product.
The researcher may want to estimate the proportion of potential buyers who prefer a certain
color of car or flavor of food.
A forester may estimate the volume of timber or proportion of diseased trees in a forest by
selecting geographic points in the area covered by the forest and then attaching a plot of
fixed size and shape (such as a circle of 10-meter radius) to that point. All the trees within
the sample plots may be studied, but, again, the basic design is a simple random sample.
Two problems now face the experimenter: (1) how does he or she draw the simple random
sample, and (2) how can he or she estimate the various population parameters of interest?
These topics are discussed in the following sections.
Simple random samples can be selected by using random numbers. Tables of random num-
bers are available in many statistics books. One may also use a computer to generate
31
random numbers.
A random number table is set of integers generated so that in the long run the table will
contain all ten integers (0, 1, · · · , 9) in approximately equal proportions, with no trends in
the pattern in which the digits were generated. Thus if one number is selected from a ran-
dom point in the table, it is equally likely to be any of the digits 0 through 9.
Choosing numbers from the table is analogous to drawing numbers out of a hat containing
those numbers on thoroughly mixed pieces of paper. Suppose we want a simple random
sample of three persons to be selected from seven. We could number the people from 1
to 7, put slips of paper containing these numbers (one number to a slip) into a hat, mix
them, and draw out three, without replacing drawn numbers. Analogously, we could drop a
pencil point on a random starting point in a random number table. Suppose the point falls
on the 15th line of column 9 and we decide to use the right-most digit (a 5, in this case).
This procedure is like drawing a 5 from the hat. We may now proceed in any direction to
obtain the remaining numbers in the sample. Suppose we decide before starting to proceed
down the page. The number immediately below the 5 is a 2, so our second sampled person
is number 2. Proceeding, we next come to an 8, but there are only seven people in our
population; hence the 8 must be ignored. Two more 5s then appear, but both must be
ignored since person 5 has already been selected. (The 5 has been removed from the hat.)
Finally, we come to a 1, and our sample of three is completed with persons numbered 5, 2,
and 1.
Note that any starting point can be used and one can move in any predetermined direction.
If more than one sample is to be used in any problem, each should have its own unique
starting point. Many computer programs, such as Minitab, can be used to generate random
numbers.
A more realistic illustration is given in Example 3.1.2.
Example 3.1.2 For simplicity, assume there are N = 1000 patient records from which a
simple random sample of n = 20 is to be drawn. We know that a simple random sample
will be obtained if every possible sample of n = 20 records has the same chance of being
selected. The digits in the random number table are generated to satisfy the conditions of
simple random sampling. Determine which records are to be included in a sample of size
n = 20.
Solution: We can think of the accounts as being numbers 001, 002, · · · , 999, 000. That
is, we have 1000 three-digit numbers, where 001 represents the first record, 999 the 999th
patient record, and 000 the 1000th.
Refer to random number table and use the first column; if we drop the last two digits of
each number, we see that the first three-digit number formed is 104, the second is 223, the
third is 241, and so on. Taking a random sample of 20 digits, we obtain the numbers shown
in Figure 3.1.
If the records are actually numbered, we merely choose the records with the corresponding
numbers, and these records represent a simple random sample of n = 20 from N = 1000.
If the patient accounts are not numbered, we can refer to a list of the accounts and count
from the 1st to the 10th, 23rd, 70th, and so on, until the desired numbers are reached. If
a random number occurs twice, the second occurrence is omitted, and another number is
selected as its replacement.
32
Figure 3.1: Patient records to be included in the sample
Suppose that a simple random sample of n accounts is drawn, and we are to estimate the
mean value per account for the total population of hospital records. Intuitively, we would
employ the sample average Pn
yi
ȳ = i=1
n
to estimate µ.
Of course, a single value of ȳ tells us very little about the population mean µ, unless we
are able to evaluate the goodness of our estimator. Hence in addition to estimating µ, we
would like to place a bound on the error of estimation. It can be shown that ȳ possesses
many desirable properties for estimating µ. In particular, ȳ is an unbiased estimator of µ
and has a variance that decreases as the sample size n increases. More precisely for a simple
random sample chosen without replacement from a population of size N ,
E[ȳ] = µ
and
σ2
V (ȳ) = (3.1)
n
In addition,
N
E[s2 ] = σ2
N −1
where s2 is the sample variance. We make use of these in order to estimate µ and construct
an error bound.
Example 3.1.3 Refer to the hospital audit of Example 3.1.2 and suppose that a random
sample of n = 200 accounts is selected from the total of N = 1000. The sample mean of the
accounts is found to be ȳ = 94.22, and the sample variance is s2 = 445.21. Estimate µ, the
average due for all 1000 hospital accounts, and place a bound on the error of estimation.
Solution:
9
X 9
X
yi = 368.00, yi2 = 15332.50
i=1 i=1
Our estimate of the population mean is
P9
yi 368.00
ȳ = i=1 = = $40.89
9 9
Sample variance is given by
P9 2
2 i=1 (yi − ȳ)
s =
n−1
P9 2
P9 2
i=1 yi − ( i=1 yi ) /9
=
8
3682
1
= 15, 322.50 −
8 9
= 35.67
34
Figure 3.2: Confidence intervals for N = 20 and n = 15
35
The error bound is given by
r s
2
s N −n 35.67 484 − 9
q
2 V̂ (ȳ = 2 ( )=2 = $3.94
n N 9 484
To summarize, the estimate of the mean amount of money owed per account,µ, is ȳ = $40.89.
Although we cannot be certain how close ȳ is to µ, we are reasonably confident that y the
error of estimation is less than $3.94.
Many sample surveys are conducted to obtain information about a population total. The
federal auditor of Example 1.2.1 would probably be interested in verifying the computer
figure for the total accounts receivable (in dollars) for the N = 1000 open accounts.
You recall that the mean for a population of size N is the sum of all observations in the pop-
ulation divided by N . The population total, the sum of all observations in the population,
is denoted by the symbol τ . Hence
Nµ = τ
Intuitively, we expect the estimator of τ to be N times the estimator of µ.
Note that the estimated variance of τ̂ = N ŷ in Equation (3.6) is N 2 times the estimated
variance of ȳ.
Example 3.1.5 An industrial firm is concerned about the time per week spent by scientists
on certain trivial tasks. The time log sheets of a simple random sample of n = 50 employees
show the average amount of time spent on these tasks is 10.31 hours with a sample variance
36
s2 = 2.25. The company employs N = 750 scientists. Estimate the total number of man-
hours lost per week on trivial tasks, place a bound on the error of estimation.
Solution:
The estimate of τ is given by
τ = N ȳ = 750(10.31) = 7732.5hours
Thus the estimate of total time lost is τ̂ = 7732.5 hours and we are reasonably confident
that the error of estimation is less than 307.4 hours.
to get
N σ2
n= (3.9)
(N − 1)D + σ 2
B2
where D = 4
Solving for n in a practical situation presents a problem because the population variance σ 2
is unknown. Since a sample variance s2 is frequently available from prior experimentation,
we can obtain an approximate sample size by replacing σ 2 with s2 in Equation 3.9. We will
illustrate a method for guessing a value of σ 2 when very little prior information is available.
If N is large, as it usually is, the (N − 1) can be replaced by N in the denominator of
equation 3.9.
Example 3.1.6 The average amount of money for a hospital’s accounts receivable must be
estimated. Although no prior data is available to estimate the population variance σ 2 , that
most accounts lie within a 100 range is known. There are N = 1000 open accounts. Find
the sample size needed to estimate µ with a bound on the error of estimation B = $3.
37
Solution: We need an estimate of σ 2 ,the population variance. Since the range is often
approximately equal to four standard deviations (4σ), one-fourth of the range will provide
an approximate value of σ.
range 100
σ= = = 25
4 4
Hence σ is taken to be approximately 25 and
σ 2 = 625
B2 32
D= = = 2.25
4 4
1000(625)
n= = 217.56
999(2.25) + 625
That is, we need approximately 218 observations to estimate µ,the mean accounts receivable,
with a bound on the error of estimation of $3.00.
In like manner, we can determine the number of observations needed to estimate a population
total τ , with a bound on the error of estimation of magnitude B. The required sample size
is found by setting two standard deviations of the estimator equal to B and solving this
expression for n. Proceeding as we did earlier, we get
N σ2
n= (3.10)
(N − 1)D + σ 2
B2
where D = 4N 2
Solution: We can obtain an approximate sample size using the equation (3.10) with σ 2 =
36.
B2 (1000)2
D= = = 0.25
4N 2 4(1000)2
That is,
N σ2 1000(36.00)
n= 2
= = 125.98
(N − 1)D + σ 999(0.25) + 36.00
The investigator, therefore, needs to weigh n = 126 chicks to estimate τ , the total weight
gain for N = 1000 chickens in 0 to 4 weeks, with a bound on the error of estimation equal
to 1000 grams.
38
diet preparations that is attributable to a particular product. That is, what percentage of
sales is accounted for by a particular product? A forest manager may be interested in the
proportion of trees with a diameter of 12 inches or more. Television ratings are often deter-
mined by estimating the proportion of the viewing public that watches a particular program.
You will recognize that all these examples exhibit a characteristic of the binomial exper-
iment, that is, an observation either does belong or does not belong to the category of
interest. For example, one can estimate the proportion of eligible voters in a particular
district by examining population census data for several of the precincts within the district.
An estimate of the proportion of voters between 18 and 21 years of age for the entire district
will be the fraction of potential voters from the precincts sampled that fell into this age range.
In subsequent discussion we denote the population proportion and its estimator by the
symbols p and p̂, respectively. The properties of p̂ for simple random sampling parallel
those of the sample mean ȳ if the response measurements are defined as follows: Let yi = 0
if the ith element sampled does not possess the specified characteristic and yi = 1 if it does.
Then the total number of elements in a sample of size n possessing a specified characteristic
is n
X
yi
i=1
If we draw a simple random sample of size n, the sample proportion p̂ is the fraction of the
elements in the sample that possess the characteristic of interest. For example, the estimate
p̂ of the proportion of eligible voters between the ages of 18 and 21 in a certain district is
number of voters sampled between the ages of 18 and 21
p̂ =
number of voters sampled
or Pn
i=1 yi
p̂ = = ȳ
n
In other words, p̂ is the average of the 0 and 1 values from the sample. Similarly, we can
think of the population proportion as the average of the 0 and 1 values for the entire pop-
ulation (that is, p = µ).
where q̂ = 1 − p̂
Bound on the error of estimation:
s
p̂q̂ N −n
q
2 V̂ (p̂) = 2 (3.13)
n−1 N
39
Example 3.1.8 A simple random sample of n = 100 college seniors was selected to estimate
(1) the fraction of N = 300 seniors going on to graduate school and (2) the fraction of
students that have held part-time jobs during college. Let and (i = 1, 2, · · · , 100) denote
the responses of the ith student sampled. We will set yi = 0 if the ith student does not
plan to attend graduate school and yi = 1 if he does. Similarly, let xi = 0 if he has
not held a part-time job sometime during college and xi = l if he has. Using the sample
data presented in the accompanying table, estimate p1 , the proportion of seniors planning
to attend graduate school, and p2 ,the proportion of seniors who have had a part-time job
sometime during their college careers (summers included).
40
and
s
p̂2 q̂2 N − n
q
2 V̂ (p̂2 ) = 2
n−1 N
s
(0.65)(0.35) 300 − 100
= 2
99 300
= 0.078
Thus we estimate that 15% of the seniors plan to attend graduate school, with a bound on
the error of estimation equal to 0.059. We estimate that 65% of the seniors have held a
part-time job during college, with a bound on the error of estimation equal to 0.078.
We have shown that the population proportion p can be regarded as the average (µ)of the
0 and 1 values for the entire population. Hence the problem of determining the sample size
required to estimate p to within B units should be analogous to determining a sample size
for estimating µ with a bound on the error of estimation B. You will recall that the required
sample size for estimating µ is given by
N σ2
n= (3.14)
(N − 1)D + σ 2
2
where D = B4 . The corresponding sample size needed to estimate p can be found by re-
placing σ 2 with pq.
N pq
n= (3.15)
(N − 1)D + pq
B2
where q = 1 − p and D = 4
Example 3.1.9 Student government leaders at a college want to conduct a survey to de-
termine the proportion of students that favors a proposed honor code. Since interviewing
N = 2000 students in a reasonable length of time is almost impossible, determine the sam-
ple size (number of students to be interviewed) needed to estimate p with a bound on the
error of estimation of magnitude B = 0.05. Assume that no prior information is available
to estimate p.
Solution: We can approximate the required sample sizes when no prior information is
available by setting p = 0.5. We have
B2 (0.05)2
D= = = 0.000625
4 4
41
N 2000
n= = = 333.47
4(N − 1)D + 1 4(1999)0.000625 + 1
That is, 334 students must be interviewed to estimate the proportion of students that favors
the proposed honor code with a bound on the error of estimation of B = 0.05.
Example 3.1.10 Referring to Example 3.1.9, suppose that in addition to estimating the
proportion of students that favors the proposed honor code, student government leaders also
want to estimate the number of students who feel the student union building adequately
serves their needs. Determine the combined sample size required for a survey to estimate
p1 , the proportion that favors the proposed honor code, and p2 , the proportion that believes
the student union adequately serves its needs, with bounds on the errors of estimation of
magnitude B1 = 0.05 and B2 = 0.07. Although no prior information is available to estimate
p1 , approximately 60 % of the students believed the union adequately met their needs in a
similar survey run the previous year.
Solution: In this example we must determine a sample size n that will allow us to estimate
p1 , with a bound B1 = 0.05 and p2 with a bound B2 = 0.07. First, we determine the
sample sizes that satisfy each objective separately. The larger of the two will then be the
combined sample size for a survey to meet both objectives. From Example 1.5.2 the sample
size required to estimate p1 , with a bound on the error of estimation of B1 = 0.05 was
n = 334 students. We can use data from the survey of the previous year to determine the
sample size needed to estimate p2 . We have
B2 (0.07)2
D= = = 0.001225
4 4
N pq (2000)(0.6)(0.4) 480
n= = = = 178.52
(N − 1)D + pq (1999)(0.001225) + (0.6)(0.4) 2.68877
That is, 179 students must be interviewed to estimate p2 . The sample size required to
achieve both objectives in one survey is 334, the larger of the two sample sizes.
42
Unbiased estimators of τ and µ, along with their estimated variances and bounds on the
error of estimation, are as follows:
These estimators are unbiased for any choices of πi , but it is clearly in the best interest of
the experimenter to choose these probabilities so that the variances of p the estimators are
as small as possible. The best practical way to choose the πi ’s is to choose them proportional
to a known measurement that is highly correlated with yi . In the problem of estimating
total number of job openings, firms can be sampled with probabilities proportional to their
total work force, which should be known fairly accurately before the sample is selected.
The number of job openings per firm is not known before sampling, but it should be highly
correlated with the total number of workers in the firm.
Example 3.1.11 An investigator wishes to estimate the average number of defects per
board on boards of electronic components manufactured for installation in computers. The
boards contain varying numbers of components, and the investigator feels that the number
of defects should be positively correlated with the number of components on a board. Thus
pps sampling is used, with the probability of selecting any one board for the sample being
proportional to the number of components on that board. A sample of n = 4 boards is to
43
be selected from the N = 10 boards of one day’s production. The number of components
on the 10 boards are, respectively,
Solution: We list the number of components (our measure of size) in a column and list the
cumulative ranges and desirable πi ’s in adjacent columns, as follows:
There are 150 components in the population to be sampled. We can think of these compo-
nents as being numbered from 1 to 150. The cumulative range column keeps track of the
interval of numbered components on each board. Board number 1 has the first 10 compo-
nents, board number 2 has components 11 through 22, and so on.
The π’s are simply the number of components per board divided by the total number of
components. The boards having greater numbers of components have larger probabilities
of selection.
To choose the sample of n = 4 boards, we enter the random number table and select four
random numbers between l and 150. The numbers we selected were 14, 56, 94, and 25. We
locate these numbers in the cumulative range column. The boards corresponding to those
range intervals constitute the sample.
Since 14 lies in the range of board 2, that board enters the sample. Similarly, 56 lies in
the range of board 5, 94 lies in the range of board 7, and 25 lies in the range of board 3.
Thus the sample consists of boards 2, 3, 5, and 7. These boards have been selected with
probabilities proportional to their numbers of components. Note that with this method we
could have sampled a particular board more than once.
Example 3.1.12 After the sampling of Example 3.1.11 was completed, the number of
defects found on boards 2, 3, 5, and 7 were, respectively, 1, 3, 2, and 1. Estimate the
average number of defects per board, and place a bound on the error of estimation.
Solution:
n
1 X yi 1 150 3(150) 2(150) 150
µ̂pps = = + + + = 1.71
N n i=1 πi 40 12 22 16 9
44
n 2
1 X yi
V̂ (µ̂pps ) = − τ̂pps
N 2 n(n − 1) i=1 πi
2 2 2
1 150 3(150) 2(150)
= − 17.10 + − 17.10 + − 17.10
102 (4)(3) 12 22 16
2
150
+ − 17.10
9
= 0.0295
The estimate of the average number of defects per board, with a bound on the error of
estimation, is then
1.71 ± 0.34
and the interval (1.37, 2.05) provides an approximate 95% confidence interval for the average
number of defects per board.
1. Stratification may produce a smaller bound on the error of estimation than would be
produced by a simple random sample of the same size.
2. The cost per observation in the survey may be reduced by stratification of the popu-
lation elements into convenient groupings.
These three reasons for stratification should be kept in mind when one is deciding whether or
not to stratify a population or deciding how to define strata. Sampling hospital patients on
a certain diet to assess weight gain may be more efficient if the patients are stratified by sex,
since men tend to weigh more than women. A poll of college students at a large university
may be more conveniently administered and carried out if students are stratified into on-
campus and off-campus residents. A quality control sampling plan in a manufacturing plant
may be stratified by production lines because estimates of proportions of defective products
may be required by the manager of each line.
45
3.2.2 How To Draw a Stratified Random Sample
The first step in the selection of a stratified random sample is to clearly specify the strata;
then each sampling unit of the population is placed into its appropriate stratum. This step
may be more difficult than it sounds.
For example, suppose that you plan to stratify the sampling units, say households, into rural
and urban units. What should be done with households in a town of 1000 inhabitants? Are
these households rural or urban? They may be rural if the town is isolated in the country,
or they may be urban if the town is adjacent to a large city. Hence to specify what is meant
by urban and rural is essential so that each sampling unit clearly falls into only one stratum.
After the sampling units are divided into strata, we select a simple random sample from
each stratum by using the techniques given in the chapter ’Simple Random Sampling’. We
discuss the problem of choosing appropriate sample sizes for the strata later in this chapter.
We must be certain that the samples selected from the strata are independent. That is,
different random sampling schemes should be used within each stratum so that the obser-
vations chosen in one stratum do not depend upon those chosen in another.
The following example illustrates a situation in which stratified random sampling may be
appropriate.
Solution: The population of households falls into three natural groupings, two towns and
a rural area, according to geographic location. Thus to use these divisions as three strata
is quite natural simply for administrative convenience in selecting the samples and carrying
out the fieldwork. In addition, each of the three groups of households should have similar
behaviour patterns among residents within the group. We expect to see relatively small
variability in number of hours of television viewing among households within a group and
this is precisely the situation in which stratification produces a reduction in a bound on the
error of estimation.
The advertising firm may wish to produce estimates on average television-viewing hours for
each town separately. Stratified random sampling allows for these estimates.
For the stratified random sample, we have N1 = 155, N2 = 62 and N3 = 93 with N = 310.
46
3.2.3 Estimation of Population Mean and Total
How can we use the data from a stratified random sample to estimate the population mean?
Let ȳi denote the sample mean for the simple random sample selected from stratum i, ni
the sample size for stratum i, µi the population mean for stratum i, and τi the population
total for stratum i. Then the population total τ is equal to τ1 + τ2 + · · · + τL . We have
a simple random sample within each stratum. Therefore we know from Simple Random
Sampling that ȳi is an unbiased estimator of µi and Ni ȳi is an unbiased estimator of the
stratum total τi = Ni µi . It seems reasonable to form an estimator of τ , which is the sum of
the τi ’s by summing the estimator of the τi ’s. Similarly, since the population mean µ equals
the population total τ divided by N , an unbiased estimator of µ is obtained by summing
the estimators of the τi ’s over all strata and then dividing by N . We denote this estimator
by ȳst , where the subscript st indicates that stratified random sampling is used.
Example 3.2.3 Suppose the survey planned in Example 3.2.2 is carried out. The advertis-
ing firm has enough time and money to interview n = 40 households and decides to select
random samples of size n1 = 20 from town A, n2 = 8 from town B, and n3 = 12 from the
rural area.(We will discuss the choice of sample sizes later). The simple random samples are
selected and the interviews conducted. The results, with measurements of television-viewing
time in hours per week, are shown on Table 3.3.
Estimate the average television viewing time, in hours per week, for a) all households in the
county b) all households in town B. In both cases, place a bound on the error of estimation.
The terms s21 , s22 and s23 in Table 3.3 are the sample variances for strata 1, 2 and 3, respec-
tively; they are given by the formula
Pni 2
Pni 2 2
2 j=1 (yij − ȳi ) j=1 yij − ni ȳi
si = =
ni − 1 ni − 1
for i = 1, 2, 3 where yij is the jth observation in stratum i. These variances estimate the
corresponding true stratum variances σ12 , σ22 and σ32 .
Solution:
47
Figure 3.3: Television viewing time, in hours per week
1
ȳst = [N1 ȳ1 + N2 ȳ2 + N3 ȳ3 ]
N
1
= [(155)(33.900) + (62)(25.125) + (93)(19.00)]
310
= 27.7
is the best estimate of the average number of hours per week that all households in the
county spend watching television. Also,
3 2
1 X 2 Ni − ni si
V̂ (ȳst ) = 2
Ni
N i=1 Ni ni
(155)2 (0.871)(35.358) (62)2 (0.871)(232.411) (93)2 (0.871)(87.636)
1
= + +
(310)2 20 8 12
= 1.97
The estimate of the population mean with an approximate two-standard-deviation bound
on the error of estimation is given by
q
ȳst ± 2 V̂ (ȳst )
√
27.675 ± 2 1.97
27.7 ± 2.8
Thus we estimate the average number of hours per week that households in the county view
television to be 27.7 hours. The error of estimation should be less than 2.8 hours with a
probability approximately equal to 0.95.
48
25.1 ± 10.0
This estimate has a large bound on the error of estimation because s22 is large and the
sample size n2 is small. Thus the estimate ȳst of the population mean is quite good, but
the estimate ȳ2 of the mean of stratum 2 is poor. If an estimate is desired for a particular
stratum, the sample from that stratum must be large enough to provide a reasonable bound
on the error of estimation.
Procedures for the estimation of a population total τ follow directly from the procedures
presented for estimating µ. Since τ is equal to N µ, an unbiased estimator of τ is given by
N ȳst .
Example 3.2.4 Refer to Example 3.4 and estimate the total number of hours per week
that households in the county view television. Place a bound on the error of estimation.
Solution:
N ȳst = 310(27.7) = 8587hours
The estimate variance of N ȳst is given by
The estimate of the population total with a bound on the error of estimation is given by
q
N ȳst ± 2 V̂ (N ȳst )
p
8587 ± 2 189, 278.560
8587 ± 870
Thus we estimate the total weekly viewing time for households in the county to be 8587
hours. The error of estimation should be less than 870 hours.
49
3.2.4 Selecting the Sample Size for Estimating Population Means
and Total
The amount of information in a sample depends on the sample size n, since V̂ (ȳst ) decreases
as n increases. Let us examine a method of choosing the sample size to obtain a fixed amount
of information for estimating a population parameter. Suppose we specify that the estimate
ȳst should lie within B units of the population mean, with probability approximately 0.95.
Symbolically, we want
p
2 V (ȳst ) = B
or
B2
V (ȳst ) =
4
This equation contains the actual population variance of ȳst rather than the estimated vari-
ance.
2
Although we set V (ȳst ) equals to B4 , we cannot solve for n unless we know something about
the relationships among n1 , n2 , · · · , nL and n. There are many ways of allocating a sample
size n among the various strata. In each case, however, the number of observations ni
allocated to the ith stratum is some fraction of the total sample size n. We denote this
fraction by wi . Hence we can write
ni = nwi , i = 1, 2, · · · , L (3.30)
B2
Using the above equation, we can then set V (ȳst ) equal to 4
and solve for n.
Similarly, estimation of the population total τ with a bound of B units on the error of
estimation leads to the equation p
2 V (N ȳst ) = B
or
B2
V (ȳst ) =
4N 2
Approximate sample size required to estimate µ or τ with a bound B on the
error of estimation:
PL 2 2
i=1 Ni σi /wi
n= (3.31)
N 2 D + Li=1 Ni σi2
P
where wi is the fraction of observations allocated to stratum i, σi2 is the population variance
for stratum i, and
2
D = B4 when estimating µ
B2
D = 4N 2 when estimating τ .
We must obtain approximations of the population variances σ12 , σ22 , · · · , σL2 before we can
use Equation (3.31). One method of obtaining these approximations is to use the sample
variances s21 , s22 , · · · , s2L from a previous experiment to estimate σ12 , σ22 , · · · , σL2 . A second
method requires knowledge of the range of the observations within each stratum. From
Tchebysheff’s theorem and the normal distribution the range should be roughly four to six
standard deviations.
50
Example 3.2.5 A prior survey suggests that the stratum variances for Example (3.2.2) are
approximately σ12 ≈ 25, σ22 ≈ 225 and σ32 ≈ 100. We wish to estimate the population mean
by using ȳst . Choose the sample size to obtain a bound on the error of estimation equal to
2 hours if the allocation fractions are given by w1 = 31 , w2 = 13 and w3 = 13 . and In other
words, you are to take an equal number of observations from each stratum.
or
V (ȳst ) = 1
Therefore, D = 1.
In Example 3.2.2, N1 = 155, N2 = 62 and N3 = 93. Therefore
3
X N 2σ2 i i N12 σ12 N22 σ22 N32 σ32
= + +
i=1
wi w1 w2 w3
(155)2 (25) (62)2 (225) (93)2 (100)
= + +
1/3 1/3 1/3
= (24025)(75) + (3844)(675) + (8649)(300)
= 6991275
3
X
Ni2 σi2 = N12 σ12 + N22 σ22 + N32 σ32
i=1
= (155)(25) + (62)(225) + (93)(100)
= 27125
51
in a different variance for the sample mean. Hence our objective is to use an allocation that
gives a specified amount of information at minimum cost.
In terms of our objective the best allocation scheme is affected by three factors. They are
as follows:
The number of elements in each stratum affects the quantity of information in the sample.
A sample of size 20 from a population of 200 elements should contain more information
than a sample of 20 from 20,000 elements. Thus large sample sizes should be assigned to
strata containing large number of elements.
Variability must be considered because a larger sample is needed to obtain a good estimate
of a population parameter when the observations are less homogeneous.
If the cost of obtaining an observation varies from stratum to stratum, we will take small
samples from strata with high costs. We will do so because our objective is to keep the cost
of sampling at a minimum.
Approximate allocation that minimizes cost for a fixed value of V (ȳst ) or mini-
mizes V (ȳst ) for a fixed cost:
√
Ni σi / ci
wi = PL √ (3.32)
k=1 Nk σk / ck
where Ni denotes the size of the ith stratum, σi2 denotes the population variance for the
ith stratum, and ci denotes the cost of obtaining a single observation from the ith stratum.
√
Note that ni is directly proportional to Ni and σi and inversely proportional to ci .
One must approximate the variance of each stratum before sampling in order to use the
allocation formula (3.32). The approximations can be obtained from earlier surveys or from
knowledge of the range of the measurements within each stratum.
Example 3.2.6 The advertising firm in Example 3.2.2 finds that obtaining an observation
from a rural household costs more than obtaining a response in town A or B. The increase
is due to costs of traveling from one rural household to another. The cost per observation in
each town is estimated to be $9.00 (that is,c1 = c2 = 9), and the costs per observation in the
rural area to be $16.00 (that is, c3 = 16). The stratum standard deviations (approximated
by the strata sample variances from a prior survey) are σ1 ≈ 5, σ2 ≈ 15 and σ3 ≈ 10 . Find
the overall sample size n and the stratum sample sizes, n1 , n2 and n3 , that allow the firm to
52
estimate, at minimum cost, the average television-viewing time with a bound on the error
of estimation equal to 2 hours.
Solution We have
3
X Nk σk N1 σ1 N2 σ2 N3 σ3
√ = √ + √ + √
k=1
ck c1 c2 c3
155(5) 62(15) 93(10)
= √ + √ + √
9 9 16
= 800.83
and
3
X √ √ √ √
Ni σi ci = N1 σ1 c1 + N2 σ2 c2 + N3 σ3 c3
i=1
√ √ √
= 155(5) 9 + 62(15) 9 + 93(10) 16
= 8835
Thus
√ √
( 3k=1 Nk σk / ck )( 3i=1 Ni σi ci )
P P
n =
N 2 D + 3i=1 Ni σi2
P
(800.83)(8835)
=
(310)2 (1) + 27125
= 57.42 ≈ 58
Then √ !
N1 σ1 / c1 155(5)/3
n1 = n P3 √ =n = 0.32n = 18.5 ≈ 18
k=1 Nk σk / ck
800.83
Similarly,
62(15)/3
n2 = n = 0.39n = 22.6 ≈ 23
800.83
93(10)/4
n3 = n = 0.29n = 16.8 ≈ 17
800.83
Hence the experimenter should select 18 households at random from town A, 23 from town
B, and 17 from rural area. He can then estimate the average number of hours spent watching
television at minimum cost with a bound of 2 hours on the error of estimation.
In some stratified sampling problems the cost of obtaining an observation is the same for all
strata. If the costs are unknown, we may be willing to assume that the costs per observation
are equal. If c1 = c2 = · · · = cL , then the cost terms cancel in Equation (3.32) and
!
Ni σi
ni = n PL (3.34)
i=1 Ni σi
( Li=1 Ni σi )2
P
n= (3.35)
N 2 D + Li=1 Ni σi2
P
53
Example 3.2.7 The advertising firm of Example 3.2.2 decides to use telephone interviews
rather than personal interviews because all households in the county have telephones, and
this method reduces costs. The cost of obtaining an observation is then the same in all three
strata. The stratum standard deviations are given approximated by σ1 ≈ 5, σ2 ≈ 15 and
σ3 ≈ 10. The firm desires to estimate the population mean µ with a bound on the error of
estimation equal to 2 hours. Find the appropriate sample size n and stratum sample sizes
n1 , n2 and n3 .
Solution: We will now use Equations (3.34) and (3.35), since the costs are the same in
all strata. Therefore to find the allocation fractions,w1 , w2 and w3 , we use Equation (3.34).
Then
3
X
Ni σi = N1 σ1 + N2 σ2 + N3 σ3
i=1
= (155)(5) + (62)(15) + (93)(10)
= 2635
Similarly,
(62)(15)
n2 = n = n(0.35)
2635
(93)(10)
n3 = n = n(0.35)
2635
Thus w1 = 0.30, w2 = 0.35 and w3 = 0.35.
Now let us use Equation (3.35) to find n. A bound of 2 hours on the error of estimation
means that p
2 V (ȳst ) = 2 ⇒ V (ȳst ) = 1
Therefore,
B2
D= =1
4
and
N 2 D = (310)2 (1) = 96100
Also,
3
X
Ni σi2 = 27125
i=1
( 3i=1 Ni σi )2
P
n =
N 2 D + 3i=1 Ni σi2
P
(2635)2
=
96100 + 27125
= 56.34 ≈ 57
Then
n1 = nw1 = (57)(0.30) = 17
54
n2 = nw2 = (57)(0.35) = 20
n3 = nw3 = (57)(0.35) = 20
The sample size n in this example is nearly the same as in the previous example, but the
allocation has changed. More observations are taken from the rural area because these
observations no longer have a higher cost.
In addition to encountering equal costs, we sometimes encounter approximately equal vari-
ances, σ12 , σ22 , · · · , σL2 . In that case the σi ’s cancel in Equation (3.34) and
!
Ni Ni
ni = n PL =n (3.36)
i=1 Ni
N
This method of assigning sample sizes to the strata is called proportional allocation because
sample sizes n1 , n2 , · · · , nL are proportional to stratum sizes N1 , N2 , · · · , NL . Of course,
proportional allocation can be, and often is, used when stratum variances and costs are not
equal. One advantage to using this allocation is that the estimator ȳst becomes simply the
sample mean for the entire sample. This feature can be an important timesaving feature in
some surveys.
Under proportional allocation Equation (3.31)for the value of n, which yields V (ȳst )=D,
becomes
PL 2
i=1 Ni σi
n= (3.37)
N D + N1 Li=1 Ni σi2
P
Example 3.2.8 The advertising firm in Example 3.2.2 thinks that the approximate vari-
ances used in previous examples are in error and that the stratum variances are approx-
imately equal. The common value of σi was approximated by 10 in a preliminary study.
Telephone interviews are to be used, and hence costs will be equal in all strata. The firm
desires to estimate the average number of hours per week that households in the county
watch television, with a bound on the error of estimation equal to 2 hours. Find the sample
size and stratum sample sizes necessary to achieve this accuracy.
Solution: We have
3
X
Ni σi2 = N1 σ12 + N2 σ22 + N3 σ32
i=1
= (155)(100) + (62)(100) + (93)(100)
= 31000
Therefore !
N1 N1 155
n 1 = n P3 =n =n ) = n(0.5) = 38
i=1 Ni N 310
N2 62
n2 = n =n = n(0.2) = 15
N 310
55
N3 93
n3 = n =n = n(0.3) = 23
N 310
These results differ from those of Example 3.2.7 because here the variances are assumed to
be equal in all strata and are approximated by a common value.
The amount of money to be spent on sampling is sometimes fixed before the experiment
is started. Then the experimenter must find a sample size and allocation scheme that
minimizes the variance of the estimator for a fixed expenditure.
Example 3.2.9 In the television-viewing example, suppose the costs are as specified in Ex-
ample 3.2.6 . That is, c1 = c2 = 9 and c3 = 16. Let the stratum variances be approximated
by σ1 ≈ 5, σ2 ≈ 15 and σ3 ≈ 10. Given the advertising firm has only $500 to spend on
sampling, choose the sample size and the allocation that minimize V (ŷst ).
Solution: The allocation scheme is still given by Equation 3.32 In Example 3.2.6 we found
w1 = 0.32, w2 = 0.39 and w3 = 0.29. Since the total cost must equal $500, we have
c1 n1 + c2 n2 + c3 n3 = 500
or
9n1 + 9n2 + 16n3 = 500
Since ni = nwi , we can substitute as follows:
or
9n(0.32) + 9n(0.39) + 16n(0.29) = 500
Solving for n, we obtain
11.03n = 500
500
n= = 45.33
11.03
Therefore we must take n = 45 to ensure that the cost remains below $500. The corre-
sponding allocation is given by
n1 = nw1 = (45)(0.32) = 14
n2 = nw2 = (45)(0.39) = 18
n3 = nw3 = (45)(0.29) = 13
We can make the following summary statement on stratified random sampling: In general,
stratified random sampling with proportional allocation will produce an estimator with
smaller variance that that produced by simple random sampling (with the same sample
size) if there is considerable variability among the stratum means. If sampling costs are
nearly equal from stratum to stratum, stratified random sampling with optimal allocation
will yield estimators with smaller variance than will proportional allocation when there is
variability among the stratum variances.
56
3.2.6 Estimation of Population Proportion
In our numerical examples we have been interested in estimating the average or the total
number of hours per week spent watching television. In contrast, suppose that the advertis-
ing firm wants to estimate the proportion (fraction) of households that watches a particular
show. The population is divided into strata, just as before, and a simple random sample
is taken from each stratum. Interviews are then conducted to determine the proportion p̂i
of households in stratum i that view the show. This p̂i is an unbiased estimator of pi , the
population proportion in stratum i. Reasoning as we did in Section 1.3, we conclude that
Ni p̂i is an unbiased estimator of the total number of households in stratum i that view this
particular show. Hence N1 p̂1 + N2 p̂2 + · · · + NL p̂L is a good estimator of the total number of
viewing households in the population. Dividing this quantity by N , we obtain an unbiased
estimator of the population proportion p of households viewing the show.
Example 3.2.10 The advertising firm wanted to estimate the proportion of households in
the county of Example (3.2.2) that view show X. The county is divided into three strata,
town A, town B and the rural area. The strata contain N1 = 155, N2 = 62 and N3 = 93
households, respectively. A stratified random sample of n = 40 households is chosen with
proportional allocation. In other words, a simple random sample is taken from each stratum;
the sizes of the samples are n1 = 20, n2 = 8 and n3 = 12. Interviews are conducted in
the 40 sampled households; results are shown in Table below. Estimate the proportion of
households viewing show X, and place a bound on the error of estimation.
Solution: The estimate of the proportion of households viewing show X is given by p̂st .
We calculate
1
p̂st = [(155)(0.80) + (62)(0.25) + (93)(0.50)] = 0.60
310
First, let us calculate the V̂ (p̂i ) terms. We have
N1 − n1 p̂1 q̂1 155 − 20 (0.8)(0.2)
V̂ (p̂1 ) = = = 0.007
N1 n1 − 1 155 19
57
Figure 3.4: Data for Example 3.2.10
N2 − n2 p̂2 q̂2 62 − 8 (0.25)(0.75)
V̂ (p̂2 ) = = = 0.024
N2 n2 − 1 62 7
N3 − n3 p̂3 q̂3 93 − 12 (0.5)(0.5)
V̂ (p̂3 ) = = = 0.020
N3 n3 − 1 93 11
3
1 X 2 1
(155)2 (0.007) + (62)2 (0.024) + (93)2 (0.020) = 0.0045
V̂ (p̂st ) = 2 Ni V̂ (p̂i ) = 2
N i=1 (310)
Then the estimate of proportion of households in the county that view show X, with a bound
on the error of estimation, is given by
q
p̂st ± 2 V̂ (p̂st )
√
0.60 ± 2 0.0045
0.60 ± 2(0.07)
0.60 ± 0.14
The bound on the error in Example 3.2.10 is quite large. We could reduce this bound and
make the estimator more precise by increasing the sample size. The problem of choosing a
sample size is considered in the next section.
The formula for the sample size n (for a given bound B on the error of estimation) is the
same as Equation (3.31) except that σi2 becomes pi qi .
58
The allocation formula that gives the variance of p̂st equal to some fixed constant at mini-
√
mum cost is the same as Equation (3.32)with σi replaced by pi qi .
Approximate allocation that minimizes cost for a fixed value of V (p̂st ) or mini-
mizes V (p̂st ) for a fixed cost:
p
Ni pi qi /ci
ni = n p p p (3.44)
N1 p1 q1 /c1 + N2 p2 q2 /c2 + · · · + NL pL qL /cL
p
Ni pi qi /ci
= n PL p (3.45)
k=1 Nk pk qk /ck
where Ni denotes the size of the ith stratum, pi denotes the population proportion for the
ith stratum, and ci denotes the cost of obtaining a single observation for the ith stratum.
Example 3.2.11 The data of Table 3.4 were obtained from a sample conducted last year.
The advertising firm now wants to conduct a new survey in the same county to estimate the
proportion of households viewing show X. Although the fractions p1 , p2 and p3 that appear
in Equation (3.43) and (3.44) are unknown, they can be approximated by the estimates
from the earlier study, that is, p̂1 = 0.80, p̂2 = 0.25 and p̂3 = 0.50. The cost of obtaining an
observation is $9 for either town and $16 for the rural area, that is c1 = c2 = 9 and c3 = 16.
The number of households within the strata are N1 = 155, N2 = 62 and N3 = 93. The
firm wants to estimate the population proportion p with a bound on the error of estimation
equal to 0.1. Find the sample size n and the strata sample sizes, n1 , n2 and n3 , that will
give the desired bound at minimum cost.
Solution We first find the allocation fractions wi . Using p̂i to approximate pi , we have
3 r r r r
X p̂i q̂i p̂1 q̂1 p̂2 q̂2 p̂3 q̂3
Ni = N1 + N2 + N3
i=1
ci c1 c2 c3
r r r
(0.8)(0.2) (0.25)(0.75) (0.5)(0.5)
= 155 + 62 + 93
9 9 16
62.000 26.846 46.500
= + +
3 3 4
= 41.241
and p !
N1 p̂1 q̂1 /c1 20.667
n1 = n P3 p =n = n(0.50)
i=1 Ni p̂ i q̂ i /ci 41.241
Similarly,
8.949
) = n(0.22)
n2 = n(
41.241
11.625
n3 = n( ) = n(0.28)
41.241
Thus w1 = 0.50, w2 = 0.22 and w3 = 0.28.
The next step is to find n. First, the following quantities must be calculated:
3
X N 2 p̂i q̂i
i N12 p̂1 q̂1 N22 p̂2 q̂2 N32 p̂3 q̂3
= + +
i=1
wi w1 w2 w3
(155)2 (0.8)(0.2) (62)2 (0.25)(0.75) (93)2 (0.5)(0.5)
= + +
0.50 0.22 0.28
= 18686.46
59
3
X
Ni p̂i q̂i = N1 p̂1 q̂1 + N2 p̂2 q̂2 + N3 p̂3 q̂3
i=1
= (155)(0.8)(0.2) + (62)(0.25)(0.75) + (93)(0.5)(0.5)
= 59.675
p
To find D, we let 2 V (p̂st ) = 0.1 (the bound on the error of estimation). Then
(0.1)2
V (p̂st ) = = 0.0025 = D
4
and
N 2 D = (310)2 (0.0025) = 240.25
Finally, P3
Ni2 p̂i q̂i /wi
i=1 18686.46
n= P3 = = 62.3 ≈ 63
2
N D + i=1 Ni p̂i q̂i 240.25 + 59.675
Hence
n1 = nw1 = (63)(0.50) = 31
n2 = nw2 = (63)(0.22) = 14
n3 = nw3 = (63)(0.28) = 18
If the cost of sampling does not vary from stratum to stratum, then the cost factors ci cancel
from Equation (3.44).
Recall that the allocation formula (3.32) assumes a very simple form when the variances as
well as costs are equal for all strata. Equation (3.44) simplifies in the same way, provided
all stratum proportions pi are equal and all costs ci are equal. The Equation (3.44) becomes
Ni
ni = n , i = 1, 2, · · · , L (3.46)
N
As previously noted, this method for assignment of sample sizes to the strata is called
proportional allocation.
60
Chapter 4
Multivariate Analysis
2. Medical: One hundred patients with leukemia have historical studies carried out giving
about 20 variables, each with a − or + or ++ response.
3. Agriculture: Observations are available on crop yield and weather conditions over a
number of different sites. Is there a simple relationship that connects the crop variables
with the weather variables?
61
Example 4.1.1 Measurements of three variables were taken for four individuals. The data
obtained were (1.2, 2.8, 7.8),(2.3, 2.9, 8.1),(1.7, 3.1, 8.2),(2.0, 2.8, 7.9).
The data matrix is given by
1.2 2.8 7.8
2.3 2.9 8.1
X=
1.7
3.1 8.2
2.0 2.8 7.9
In the multivariate case, the population mean is a vector µ and the corresponding
estimator, the sample mean, is also a vector. In place of the population variance, we have
the Dispersion Matrix or Variance Covariance Matrix. This is estimated by the sample
dispersion matrix S. The sample mean vector and the sample dispersion matrix are given
by
n
1X
x̄ = xi
n i=1
n
1 1 X T
X T X − nx̄x̄T =
S= Xi − X̄ Xi − X̄
n−1 n − 1 i=1
where X is the data matrix of order n × p.
Note that x̄ = (x̄1 , x̄2 , · · · , x̄p )T where the ith component x̄i is the sample mean of the
ith variable. Similarly, S = ((sij )) where sij = n−1 1
( nk=1 xki xkj − nx̄i x̄j ). Consequently, sii
P
is the sample variance of the ith variable and sij is the sample covariance between variable
i and variable j.
The sample correlation matrix is given by R. For i 6= j, the ij th element of R is the
s
correlation between the ith variable and the j th variable, given by rij = √siiijsjj . All the
diagonal elements of R are 1.
62
4.2 Random Vectors and Multivariate Distributions
4.2.1 Random Vectors
A random vector is a vector whose components are random variables and is the multivariate
generalization of a single random variable. They are usually denoted as column vectors
X1
X2
with each component being a random variable. If X = .. is a random vector, its joint
.
Xp
distribution function, or the joint distribution function of the components of X, is defined
as
FX (x) = P (X1 ≤ x1 , X2 ≤ x2 , . . . Xp ≤ xp ) .
When exists, the joint density of X is given by
∂ p F (x)
fX (x) =
∂x1 ∂x2 . . . ∂xp
1. The Expectation of a random vector X is the vector whose ith component is the
expectation of Xi .
E(X1 )
E(X2 )
E(X) =
..
.
E(Xp )
2. The Dispersion of a random vector X is the matrix whose ij th entry is the covariance
between Xi and Xj . If E(X) = µ, then
h i
D(X) = E (X − µ) (X − µ)T = E XX T − µµT
3. The Correlation matrix of a random vector X is the matrix P whose ij th entry is the
correlation between Xi and Xj . So P = ((ρij )) where
σij
ρij = Corr(Xi , Xj ) = √ .
σii σjj
Theorem 4.2.2 Let X be a random p-vector with E(X) = µ and D(X) = Σ. Let A be
any m × p matrix, a be any p-dimensional vector, b be any m-dimensional vector and c ∈ R.
Then
1. E(aT X + c) = aT µ + c
63
2. V (aT X + c) = aT Σa
3. E(AX + b) = Aµ + b
4. D(AX + b) = AΣAT
1. Cov(Xi , Xj ) = σij
2. V (Xi ) = σii
3. Σ is a symmetric matrix.
Note: Recall that a matrix A is said to be non-negative definite if for every vector a ∈ Rp ,
aT Aa ≥ 0. A is said to be positive definite if for every vector a 6= 0, aT Aa > 0. A
non-negative definite matrix is positive definite if and only if it is non-singular.
X1 1 4 −2
Example 4.2.4 Suppose has mean µ = and dispersion Σ = .
X2 −3 −2 9
Y1
Let Y = where Y1 = 2X1 + X2 and Y2 = 3X1 − X2 . Then
Y2
4 −2 2
V (2X1 + X2 ) = (2, 1) = 17
−2 9 1
2 1
3. Y = AX where A = . So by (3) and (4) of Theorem 4.2.2,
3 −1
2 1 1 −1
E(Y ) = =
3 −1 −3 6
and
2 1 4 −2 2 3 17 13
D(Y ) = =
3 −1 −2 9 1 −1 13 57
64
Pk
1. A = i=1 λi ei eTi
1 1
Note that the product of A 2 and A− 2 is I.
The following theorem is a consequence of Theorem 4.2.2.
Theorem 4.2.7 Let X be a random vector with mean µ and dispersion Σ. If Σ is non-
1
singular, then Σ− 2 (X − µ) has mean 0 and dispersion matrix I.
2. Cov(AX + a, BY + b) = ACov(X, Y )B T
3. Cov(X, X) = D(X)
65
Example 4.2.10 Let X be a random 2-vector and Y is a random 3-vector such that
1 3 −1
Cov(X, Y ) = .
2 7 4
Then
1 2
1. Cov(Y , X) = 3 7
−1 4
1
X1 + X2 1 1 1 3 −1 2 = 32
2. Cov , Y1 + 2Y2 + 3Y3 =
X1 − X 2 1 −1 2 7 4 −24
3
Definition 4.2.11 A p-variate random vector X is said to have multivariate normal dis-
tribution if for every vector a ∈ Rp (the set of all p-dimensional column vectors), aT X is
either normally distributed or has a degenerate distribution (that is, it is a constant).
If a multivariate normal random vector X has mean vector µ and variance-covariance matrix
Σ, then we denote it by X ∼ Np (µ, Σ). The multivariate normal distribution in the special
case of p = 2 is known as bivariate normal distribution.
Example 4.2.12 Suppose X and Y are independent and have N (0, 1) distribution.
X
1. ∼ N2 (0, I) where 0 is the 2 × 1 zero vector and I is the 2 × 2 identity matrix.
Y
2X + Y + 4
2. Let X = . Then X has multivariate normal distribution N2 (µ, Σ)
−X + 3Y
4 5 1
where µ = and Σ = .
0 1 10
2. AX ∼ Nm (Aµ, AΣAT )
3. X has a joint density if and only if Σ is non-singular. In this case the density of X is
given by
1 1
fX (x) = p exp{− (x − µ)T Σ−1 (x − µ)}
(2π)p |Σ| 2
66
1
4. If Σ is non-singular, then Σ− 2 (X − µ) ∼ Np (0, I)
Note: Property (4) is the multivariate analogue of standardizing a normal random variable
to get standard normal random variable.
1 4 −2
Example 4.3.2 Suppose X in Example 4.2.4 is distributed as N2 , .
−3 −2 9
Let Y ,a and A be as in Example 4.2.4. Then
Theorem 4.3.3 Let U have Nm (0, I) distribution. Let a be any p-dimensional vector and
let B be any p × m matrix. Then
1. The density function of U is given by
m
1 1X 2
fU (u) = m exp{− u}
(2π) 2 2 i=1 i
Example 4.3.4 Let U1 , U2 and U3 be independent and identically distributed (i.i.d ) N (0, 1)
random variables. Then
67
• Σ12 = Cov(X (1) , X (2) )
• Σ21 = Cov(X (2) , X (1) ) = ΣT12
• Σ22 = D(X (2) )
68
and variance
−1
15 −12 −1 −14
1083
22 − (−14, 17, −4) −12 39 −5 17 =
410
−1 −5 4 −4
= P (Z > 0.44)
= 0.33
X1
3. Find the conditional distribution of given X3 = 1 and X4 = 4.
X2
X1
From Part 3 of Theorem 4.3.7, we get that the conditional distribution of
X2
X3 1
given = is bivariate normal. The conditional mean is given by
X4 4
−1
−1 −14 −1 22 −4 1 2
µ (1)
+ Σ12 Σ−1
22 (x
(2) (2)
−µ ) = + −
−3 17 −5 −4 4 4 3
−1 5
=
4 17
−1
15 −12 −14 −1 22 −4 −14 17
Σ11 − Σ12 Σ−1
22 Σ21 = −
−12 39 17 −5 −4 4 −1 −5
1 9 −13
=
4 −13 99
Theorem 4.3.9 Let (X1 , X2 )T have bivariate normal distribution with E(X1 ) = µ1 , E(X2 ) =
µ2 , V (X1 ) = σ12 , V (X2 ) = σ22 and Corr(X1 , X2 ) = ρ where ρ ∈ (−1, 1). Then
69
1. the joint density of X1 and X2 is given by
n 2 2 o
1√ 1 x1 −µ1 x2 −µ2 x1 −µ1 x2 −µ2
fX1 ,X2 (x1 , x2 ) = 2
exp − 2(1−ρ2 ) σ1
+ σ2
− 2ρ σ1 σ2
2πσ1 σ2 1−ρ
Example 4.3.10 Suppose that the joint distribution of X and Y is given by a bivariate
normal distribution with E(X) = 1, E(Y ) = −2, V (X) = 4, V (Y ) = 9 and Corr(X, Y ) =
−.8.
1. Find the conditional distribution of X given Y = 1.
The conditional mean of X given Y = 1 is given by
2
E(X|Y = 1) = 1 + (−.8) (1 − (−2)) = −0.6
3
and the conditional variance is given by
V (X|Y = 1) = 4(1 − .64) = 1.44
70
4.4 Sampling from a Multivariate Normal Distribu-
tion
Just as we take samples from univariate populations to estimate parameters of the popu-
lations, we use multivariate samples for parametric inference in multivariate distributions.
Method of maximum likelihood is often used to estimate parameters such as the population
mean and dispersion.
71
2. (n − 1)S is distributed as Wp (Σ, n − 1)
3. X and S are independent.
Theorem 4.4.6 Let X1 , X2 , · · · be i.i.d. random vectors with mean µ and dispersion Σ.
Then √ d
n Xn − µ → − Np (0, Σ).
√
In other words, as n → ∞, the distribution of n X n − µ approaches Np (0, Σ) distribu-
tion.
where X̄ and S is the sample dispersion matrix. This statistic is called Hotelling’s T 2 . We
do not need special tables for the critical values of this distribution because it turns out
that
(n − 1)p
T2 ∼ Fp,n−p
n−p
72
where Fp,n−p denotes an F distribution with numerator degrees of freedom p and denomi-
nator degrees of freedom n − p. We reject the null hypothesis H0 : µ = µ0 in favour of the
alternative H1 : µ 6= µ0 if the observed value of T 2 exceeds (n−1)p
n−p
Fp,n−p,α .
2
n(X−µ0 )
Note that in the special case of p = 1, T 2 = s2
∼ t2n−1 and (n−1)p
n−p
Fp,n−p = F1,n−1 =
2
tn−1 .
Note that the null hypothesis H0 : µ = µ0 is rejected at significance level α if and only if
µ0 falls outside the 100(1 − α)% confidence region.
The directions of the axes of the ellipsoid are given by the eigenvectors of S. If the
eigenvalues of S are denoted by λi , i = 1, · · · , p, then the half-lengths of the axes are given
by s
p (n − 1)p
λi Fp,n−p,α .
n(n − p)
n T −1 o
(n−1)p
So if k = n(n−p) Fp,n−p,α , then the confidence region is given by µ : X − µ S X −µ ≤k
√
and the axes are given by λi k ei , where ei ’s are the normalized eigenvectors of S.
In the special case of p = 2, the confidence region is given by an ellipse.
73
and s
√ (41)2
0.002 F2,40,.05 = (0.0447)0.3971 = 0.018
42(40)
The angle the major axis makes with the x-axis is given by arctan( 0.710
0.704
) = 45.24◦ .
Note that by taking a to be the vector with 1 at ith position and 0 elsewhere, we can
get confidence intervals for the individual µi ’s. These F -intervals will be wider than the
corresponding t-intervals. Interestingly, these simultaneous intervals are the shadows, or
projections, of the confidence ellipsoid on the axes of the component means.
Example 4.5.4 The scores obtained by n = 87 college students on the College Level Ex-
amination Program (CLEP) subtest X1 and College Qualification Test (CQT) subtests X2
and X3 resulted in the following sample mean and sample dispersion.
527.74 5691.34 600.51 217.25
x̄ = 54.69 and S = 600.51 126.05 23.37
25.13 217.25 23.37 23.11
Construct 95% simultaneous confidence intervals for µ1 , µ2 , µ3 and µ2 − µ3 .
s s s
(n − 1)p (86)3 (86)3
Fp,n−p,α = F3,84,.05 = (2.7) = 0.3087
n(n − p) 87(84) 87(84)
aT X is 527.74, 54.69, 25.13 and 54.69 − 25.13 = 29.56 and aT Sa is 5691.34, 126.05, 23.11
and 102.42 for µ1 , µ2 , µ3 and µ2 − µ3 respectively. So the confidence intervals are
√
527.74 ± (0.3087) 5691.34 = 527.74 ± 23.29,
√
54.69 ± (0.3087) 126.05 = 54.69 ± 3.47,
√
25.13 ± (0.3087) 23.11 = 25.13 ± 1.48
and √
29.56 ± (0.3087) 102.42 = 29.56 ± 3.12.
74
Chapter 5
Decision Theory
5.1 Introduction
Game theory models strategic situations, or games, in which an individual’s success in mak-
ing choices depends on the choices of others. It has applications in statistics, logic, computer
science, economics, management, operations research, political science, social psychology
and biology. Game Theory can be regarded as a multi-agent decision problem where there
are two or more people contending for limited rewards or payoffs. The players make decisions
on which their payoff depends. Each player is expected to behave rationally.
The first known discussion of game theory dates back to James Waldegrave in 1713. The
main contributors to the field are John von Neumann, Émile Borel, John Nash, Reinhard
Selten, John Harsanyi, John Maynard Smith, Thomas Schelling, Robert Aumann, Roger
Myerson, Leonid Hurwicz and Eric Maskin.
Game theory can be classified into cooperative or non-cooperative games depending on
whether the players cooperate with each other. Of these, noncooperative games are the ones
that are most interesting and are able to model situations to the finest details, producing
accurate results. It can also be classified as symmetric and asymmetric games. If the
identities of the players can be interchanged without changing the payoff to the strategies,
then a game is symmetric. Another important classification of games is into zero-sum
and non-zero-sum games. In zero-sum games, choices by players can neither increase nor
decrease the available resources because the total benefit to all players in the game, for every
combination of strategies, always adds to zero.
Example 5.1.1 Players A and B play the game of penny matching where each player has
a penny and secretly turns the penny to heads or tails. The players then reveal their choices
simultaneously. If the pennies match, that is, both heads or both tails, Player A keeps both
pennies. This, of course, means Player B loses a penny so that it is +1 for A and -1 for B.
If the pennies do not match, Player B keeps both pennies so that it is -1 for A and +1 for
B. The pay-offs can be represented as follows:
Player B
Heads Tails
Player A Heads (1,-1) (-1,1)
Tails (-1,1) (1,-1)
This is an example of a zero-sum game, where one player’s gain is exactly equal to the
other player’s loss. The matrix that represents the pay-offs is called the pay-off matrix. The
choices of the players are strategies. Since for a zero-sum game the second player’s pay off
can be determined by that of the first player, for such games we can represent the payoff
matrix in a simpler fashion:
75
Player B
Heads Tails
Player A Heads 1 -1
Tails -1 1
We shall use the convention that the entry always represents the gain of Player A, whose
strategies are represented by rows of the pay off matrix.
Example 5.1.2 A manufacturer must decide whether he should expand his business now
or postpone expansion. The implications of his action or inaction depends on whether there
is going to be a recession. If he expands now, there will be a profit of $100,000 if there is
no recession and a loss of $50,000 if there is recession. On the other hand, if he postpones
expansion, there will be a profit of $60,000 if there is no recession and a profit of $5,000 if
there is recession. One may consider this to be game between the manufacturer and Nature.
Using $1000 as a unit, we can write down the pay-off matrix as follows:
Nature
No recession Recession
Manufacturer Expand 100 -50
Postpone 60 5
Nature
No recession Recession
Manufacturer Expand L(a1 , θ1 ) L(a1 , θ2 )
Postpone L(a2 , θ1 ) L(a2 , θ2 )
If there are more than two choices for the players, say m for the manufacturer and n for
Nature, the resulting matrix will be of order m × n. The amounts L(ai , θj ) are the values of
the loss function. In other words, L(ai , θj ) is the loss of the manufacturer if he takes action
ai when the true state of the Nature is θj .
We will now return to the methods of finding optimal strategies for two-person zero-sum
games. Consider the following example.
Example 5.1.4 For the following game, find the optimal strategies for both players.
Player B
I II
Player A 1 10 -1
2 11 13
76
It is easy to see that for Player B, the optimal choice depends on Player A’s strategy. If
A chooses Strategy 1, Player B, for whom small numbers are better, will choose Strategy
II, whereas he will choose Strategy I if A chooses Strategy 2. But this uncertainty is not
present from A’s point of view. Irrespective of the choice of B, Strategy 2 is the better one
for him. Knowing this, B will always choose Strategy I, and the end result is that A will
get paid $11.
In Example 5.1.4, the amount A wins, which is 10, is called the value of the game. If the
numbers in a row of a pay-off matrix are all less than or equal to the corresponding values
in another row, the first row is said to be dominated by the second row, and hence the first
row can be eliminated from the game. Similarly, if the numbers in a column are all greater
than or equal to the corresponding values in another column, the first column is dominated
by the second column, and hence the first column can be eliminated. In this example, Row
2 dominates Row 1, and after eliminating Row 1, Column 1 dominates Column 2.
Dominated rows are also called recessive rows. Similarly, dominated columns are called
recessive columns.
Smaller rows and larger columns are
dominated or recessive, and hence can
be removed
In many situations, using dominated strategies to arrive at a complete solution will not
work. One may find that there are no dominated strategies at all, or may only succeed in
partially reducing the size of the matrix.
Example 5.1.5
Player B
I II III
1. 1 -3 4 -4
Player A 2 0 2 4
3 -4 -8 10
Player B
I II III
2. 1 -3 4 -7
Player A 2 0 2 -6
3 -4 -8 -5
In the two matrices given above, the first one has no dominated strategies, where as in
the second one we can reduce it to a smaller matrix. In the second matrix, Strategy I of
Player B can be eliminated because it is dominated by Strategy III. No further reduction
is possible. After deleting the dominated strategy of B, the new pay-off matrix is
Player B
II III
1 4 -7
Player A 2 2 -6
3 -8 -5
For Matrix 1, the minimum pay-offs for Player A for his three strategies are -4, 0 and
-8, and among these, the maximum is 0, so his minimax strategy is Strategy 2. Similarly
Player B’s minimax strategy is found by looking at his maximum losses, 0, 4 and 10, and
taking the minimum. This leads to Strategy I.
77
An important aspect of the situation in the previous example is that they are spyproof.
Even if a player knows the intentions of the other player, it is still the best option available.
Often, games are not spyproof.
Example 5.1.6 For the following game, the strategies of the players are not spyproof.
Player B
I II
Player A 1 4 -9
2 -2 2
For Player A, the minimax strategy would be Strategy 2, the one that maximizes mini-
mum pay-off. For B, it is Strategy II, but if he knows that A will use strategy 2, B is better
off using Strategy I. Similarly, if A knows that B is going for Strategy II, he will want to use
Strategy 2, but if he thinks B may switch to Strategy I, he may want to switch to Strategy
1 for a better pay-off. This could lead to guessing and outguessing, and a lot of attempted
trickery and deception.
Example 5.1.8 Solve the following game. In other words, find the minimax strategies for
both players and the value of the game.
Player B
I II III
1 0 6 -1
Player A 2 1 4 2
3 -1 -3 6
Solution: Clearly the entry 1 in Row 2 and Column 1 is a saddle point, that is, simulta-
neously a row minimum and a column maximum. So the optimal strategies are Row 2 and
Column 1, and the value of the game is 1.
78
B to use Column 2 and expecting to win $2, he will have a unpleasant surprise when B
plays Column 1 and causes A to lose $2 instead. If A tries to alternate his choices, soon the
opponent will catch on and will counter it by suitably changing his strategy.
The only viable option in this sort of situations is to mix the strategies in such a way
that the opponent cannot predict your move. For instance, you may consider tossing a coin
or rolling a die, and depending on the outcome, make your decision. Thus it all comes down
to what probabilities one should attach to each strategy.
Theorem 5.2.2 If for a pay-off matrix M , R plays the m different strategies with proba-
bilities in the vector p and C plays the n different strategies with probabilities in the vector
q, then the expected winning of R is given by pM q.
Proof: The expected pay-off is given by m
P Pn
i=1 j=1 pi qj mij = pM q
That there always exist minimax strategies for both players is an important theorem in
game theory. This theorem, known as the fundamental theorem of game theory, is due to
John von Neumann.
p∗ M q ∗ = v.
The number v is the value of the game. The triplet (p∗ , q ∗ , v) is called a solution of the
game.
The fundamental theorem of game theory in its full generality will not be proved in
this course. For the special case of 2 × 2 games we have formulas that give us a complete
solution.
a b
Theorem 5.2.4 For a 2 × 2 game M = , let D = (a + d) − (b + c). If M is not
c d
strictly determined, then D 6= 0. Moreover, the minimax strategies and the value of the
game are given by
∗ ∗ ∗ d−c a−b
p = [p1 , p2 ] = , ,
D D
∗ d−b
∗ q1 D
q = = a−c ,
q2∗ D
and
ad − bc
v= .
D
79
Proof: The expected pay-off for R, if he uses the randomized strategy (p, 1 − p), is
pa + (1 − p)c if C chooses Column 1 and pb + (1 − p)d if C chooses Column 2. So the
minimum pay-off is min(pa + (1 − p)c, pb + (1 − p)d). This is to be maximized with respect
to p. It is clear that the minimum is maximized at the point of intersection of the two lines
pa + (1 − p)c and pb + (1 − p)d, which is found by solving the equation
pa + (1 − p)c = pb + (1 − p)d.
qa + (1 − q)b = qc + (1 − q)d
d−b
which leads to q = D
. The value v is calculated by
1
p∗ M q ∗ = (d − c, a − b)M (d − b, a − c)T
D2
1
= [a(d − c)(d − b) + c(a − b)(d − b) + b(d − c)(a − c) + d(a − b)(a − c)]
D2
1
= [ad2 − abd − acd + abc + acd − bcd − abc + b2 c
D2
+abd − abc − bcd + bc2 + a2 d − abd − acd + bcd]
1 2 2 2 2
= ad − bcd − abc + b c + bc + a d − abd − acd
D2
1
= [(ad − bc)((a + d) − (b + c))]
D2
ad − bc
=
D
Example 5.2.5 The two-finger Morra game is played by two players who simultaneously
show one or two fingers and agree to pay-off for each combination of outcomes. There are
many variations to this game, and here we consider the case where the pay-off matrix is as
follows:
1 finger 2 fingers
1 finger 2 -3
2 fingers -3 4
Before we work out the optimal strategies, it would be interesting to think about which
one you would rather be - a row player or a column player. As there are no saddle points, the
game is not strictly determined. Since a = 2, b = c = −3, d = 4, we have D = 6−(−6) = 12.
7 5 ∗ 7 5 T
Hence p∗ = 12 , 12 , q = 12 , 12 and v = ad−bc
D
= −1
12
. Clearly it is better to be C.
Example 5.2.6 Suppose the following matrix represents the pay-off for a two-person zero-
sum game. Find the optimal strategies for each player and the value of the game.
Player B
I II III
1 12 -6 2
Player A 2 -3 9 10
3 -4 4 12
80
Solution: There are no recessive rows, but the third column is dominated by Column 2.
We remove Column 3 to get the following matrix:
Player B
I II
1 12 -6
Player A 2 -3 9
3 -4 4
Now Row 3 is dominated by Row 2, so we can remove Row 3 to get the following
irreducible matrix.
Player B
I II
1 12 -6
Player A 2 -3 9
There are no saddle points so the game is not strictly determined. Since a = 12, b =
−6, c = −3, d = 9, we have D = 21 − (−9) = 30. Hence
∗ 12 18 2 3
p = , = , ,
30 30 5 5
T T
∗ 15 15 1 1
q = , = ,
30 30 2 2
and
ad − bc 90
v= = = 3.
D 30
Example 5.3.1 We are told that a coin is either unbiased or double-headed, and we have
to make a decision based on the outcome of a toss of the coin. There is a penalty of 1 for a
wrong decision and no reward or penalty of the correct decision. This gives us the following
loss matrix.
81
Nature
θ1 θ2
Statistician a1 L(a1 , θ1 ) = 0 L(a1 , θ2 ) = 1
a2 L(a2 , θ1 ) = 1 L(a2 , θ2 ) = 0
Here θ1 is the state of Nature that the coin is balanced and θ2 is the state of Nature that
the coin is double-headed. Similarly, a1 is statistician’s decision that the coin is balanced
and a2 is statistician’s decision that the coin is double-headed.
Let X denote the number of heads obtained when the coin is tossed; X takes value 0
or 1. Player A, the statistician, knows the outcome of the coin toss, that is, he knows the
value the random variable X takes. To make use of this information in choosing between
the two actions, we need a function, known as a decision function, or a decision rule, that
associates outcomes to actions. As there are two states of Nature and two actions in this
example, we have 2 × 2 = 4 possible decision rules. In general, the set of all decision rules is
the set of all functions from the parameter space Θ to the action space A. In our case, the
first decision rule d1 associates both outcomes to action a1 , d2 associates x = 0 to action
a1 and x = 1 to action a2 , d3 associates x = 0 to action a2 and x = 1 to action a1 , and d4
associates both outcomes to action a2 . Notationally, this means
d1 (0) = a1 , d1 (1) = a1
d2 (0) = a1 , d2 (1) = a2
d3 (0) = a2 , d3 (1) = a1
d4 (0) = a2 , d4 (1) = a2 .
We now compare these decision rules in terms of the expected loss to which they lead.
The expected loss of a particular decision rule is called its risk, and this depends on the
value of θ. Thus we have a risk function, defined as
R(d, θ) = Eθ [L(d(X), θ)]
where Eθ indicates that the expectation is taken under the assumption that the true state
of Nature is θ. We can now calculate the values of the risk function. Note that Pθ (X = x)
has values for various choices of θ and x given by
1 1
Pθ1 (X = 0) = , Pθ1 (X = 1) = , Pθ2 (X = 0) = 0, Pθ2 (X = 1) = 1.
2 2
So
R(d1 , θ1 ) = Eθ1 [L(d1 (X), θ1 )] = L(a1 , θ1 ) = 0
R(d1 , θ2 ) = Eθ2 [L(d1 (X), θ2 )] = L(a1 , θ2 ) = 1
1 1 1
R(d2 , θ1 ) = Eθ1 [L(d2 (X), θ1 )] = ·0+ ·1=
2 2 2
R(d2 , θ2 ) = Eθ2 [L(d2 (X), θ2 )] = L(a2 , θ2 ) = 0
1 1 1
R(d3 , θ1 ) = Eθ1 [L(d3 (X), θ1 )] = ·1+ ·0=
2 2 2
R(d3 , θ2 ) = Eθ2 [L(d3 (X), θ2 )] = L(a1 , θ2 ) = 1
82
Nature
θ1 θ2
d1 0 1
1
Statistician d2 2
0
1
d3 2
1
d4 1 0
Remember that here we have a loss matrix, not a pay-off matrix, so rows with large
values may be eliminated. Clearly d1 is better than d3 and d2 is better than d4 , so we can
remove d3 and d4 . This is not surprising since d3 and d4 lead to a decision of double-headed
coin when a tail is obtained, which is absurd.
This leaves us with a 2 × 2 game given by
Nature
θ1 θ2
Statistician d1 0 1
1
d2 2
0
and if Nature is looked upon as a malevolent opponent, it can be shown that the optimal
strategy for the statistician is to randomize between d1 and d2 with respective probabilities
1
3
and 23 .
The decision rules d3 and d4 in the above example are considered inadmissible.
Definition 5.3.2 A decision rule d is said to be inadmissible if there exist a decision rule
d1 and a value θ0 such that R(d1 , θ) ≤ R(d, θ) for all θ and R(d1 , θ0 ) < R(d, θ0 ).
The following example further illustrates the concepts of a loss function and a risk
function.
Example 5.3.3 A random variable X has uniform (0, θ) distribution. We want to estimate
θ on the basis of a single observation x. The decision functions are to be of the form
d(x) = kx, where k ≥ 1 is a constant. The losses are proportional to the absolute error, so
the loss function is given by
L(kx, θ) = c|kx − θ|
where c is a positive constant. Find the value of k that minimizes the risk.
Solution: Note that the the density function is given by
1
f (x) = I[0 < x < θ].
θ
So the risk function is given by
R(d, θ) = Eθ [L(d(X), θ)]
Z θ
1
= c|kx − θ| dx
0 θ
Z θ Z θ
k 1 1
= c(θ − kx) dx + c(kx − θ) dx
0 θ θ
k
θ
θ 2 θ !
kx2 k
c kx
= θx − + − θx
θ 2 0 2 θ
k
2
θ2 kθ2 θ2 θ2
c θ 2
= − + −θ − +
θ k 2k 2 2k k
k 1
= cθ −1+
2 k
83
√
This is minimized when k = 2 as can be easily verified by elementary
√ calculus. So the
best rule among all decision rules of the form d(x) = kx is d(x) = 2x.
In Example 5.3.3, the minimizing value of k did not depend on the value of θ. This is
often not the case, and if we had not restricted ourselves to decision functions of the form
d(x) = kx, the situation would be very different. Without any restriction, it is clear that
d(x) = θ1 would be the best choice when θ = θ1 , d(x) = θ2 would be the best choice when
θ = θ2 etc. and there is no best decision rule that works for all values of θ.
In general, we try to find decision rules that are best with respect o some criterion. The
two criteria that we shall consider are the minimax criterion and the Bayes criterion. In the
first, we choose d that minimizes the maximum value of R(d, θ) over θ. In the other, we
choose the decision function d for which the Bayes Risk E[R(d, Θ)] is a minimum, where
the expectation is taken with respect to Θ. For this, we need to look upon Θ as a random
variable with a given distribution.
Example 5.4.1 Let X have binomial distribution with parameters n and θ. Here θ is the
unknown parameter. We consider decision functions of the form x+a n+b
where a and b are
constants. The loss is assumed to be proportional to the squared error. Find the minimax
decision rule.
Solution: Let the decision rule with constants a and b be denoted by da,b . Then L (da,b (x), θ) =
2
c x+a
n+b
− θ . The risk function R(d, θ) is given as follows:
We make use of the fact that E(X) = nθ and E(X 2 ) = nθ(1 − θ + nθ) to get that
c 2 2 2
R(da,b , θ) = (n+b)2 [θ (b − n) + θ(n − 2ab) + a ]. We maximize this expression with respect
to θ for fixed a and b, then we minimize the resulting expression with respect to a√and b.
This√involves some messy algebra and the minimizing values of a and b are a = 2n and
b = n.
To simplify the computation of the minimax rule in the above problem, we can make
use of the equalizer principle. According to the equalizer principle, under fairly general
conditions, the risk function of the minimax decision rule is a constant that does not depend
on θ. Using this principle, we get that for the minimax rule,
b2 − n = 0, n − 2ab = 0.
√ √
From this, it follows that a = 2n and b = n.
84
5.5 The Bayes Criterion
Under this method, we treat the parameter as a random variable following certain distribu-
tion called the prior distribution. The Bayes risk of a decision rule is the expected value of
R(d, Θ) where Θ is a random variable that takes values in the parameter space. We need
to choose the decision rule d that minimizes the Bayes risk. This involves the computa-
tion of the conditional distribution of Θ given X. This distribution is called the posterior
distribution of Θ given X.
Example 5.5.1 Suppose X ∼ U (0, θ) and θ has a prior distribution given by Θ ∼ gamma
(2, 1). Find the Bayes estimator based on a single observation if we assume that the loss
function is proportional to the squared-error.
Solution: The density of X was originally represented by fX (x) = 1θ I[0 < x < θ], but
now that the parameter is a random variable, we should treat this density function as a
conditional density function:
1
fX|Θ (x|θ) = I[0 < x < θ]
θ
The density function of Θ is given by
Z ∞ Z ∞
E[R(d, Θ)] = L(d(x), θ)fX|Θ (x|θ)dx fΘ (θ)dθ
Z−∞
∞ Z −∞
∞
= L(d(x), θ)fX,Θ (x, θ)dxdθ
Z−∞
∞
−∞
Z ∞
2
= c(d(x) − θ) fΘ|X (θ|x)dθ fX (x)dx
−∞ −∞
For each x, if d(x) is so chosen that the interior integral is minimized, the double integral
will be minimized. To minimize
Z ∞
c(d(x) − θ)2 fΘ|X (θ|x)dθ
−∞
85
for a fixed x, we need to choose d(x) = E[Θ|X = x]. (This is because for any random
variable Y , the value of a that minimizes E(Y − a)2 is E(Y ). Apply this to the conditional
distribution of Θ given X = x. Here the variable a is d(x).)
Therefore the Bayes decision rule is d(x) = E[Θ|X = x], which can be computed as
follows:
Z ∞
E[Θ|X = x] = θex−θ dθ
Zx ∞
= (x + y)e−y dy (by substituting y = θ − x)
0
Z ∞ Z ∞
−y
= x e dy + ye−y dy
0 0
= x+1
because both of the integrals can be verified to equal to 1, the first directly and the second
by integration by parts. Thus the Bayes decision rule is given by d(x) = x + 1. So the Bayes
estimator of θ is x + 1.
The computation described above is a general method for finding the Bayes estimator
whenever the loss function is proportional to the squared error. The other loss function that
is commonly used is L(d(x), θ) = |d(x) − θ|. For this case we need to use the fact that for
any random variable Y , the value of a that minimizes E|Y − a| is the median of Y . Hence
we have the following theorem.
Theorem 5.5.2
1. If the loss function is proportional to the squared error, then the Bayes estimator is
given by the conditional expectation of the posterior distribution.
2. If the loss function is proportional to the absolute error, then the Bayes estimator is
given by the conditional median of the posterior distribution.
Example 5.5.3 For Example 5.5.1, find the Bayes estimator under the absolute error loss
function.
Solution: The conditional density of Θ given X = x is given by fΘ|X (θ|x) = ex−θ I[θ > x]
for x > 0. So the Bayes estimator is given by the median of this distribution, which we
calculate as follows:
The median of a random variable X is a number M such that P (X ≤ M ) = 0.5. Thus
RM
we equate x ex−θ dθ to 0.5 and solve for M .
Z M x
ex−θ dθ = ex e−θ M
x
= ex e−x − e−M
= 1 − ex−M
86
Solution: The conditional density of X given Θ = θ is given by the function fX|Θ (x|θ) =
θ(1 − θ)x for x ∈ {0, 1, 2, . . .}. The density function of Θ is given by fΘ (θ) = I[0 < θ < 1],
so the joint density of X and Θ is given by
for x ∈ {0, 1, 2, . . .}. The marginal density of X is obtained by integrating the joint density
with respect to θ. This leads to
Z 1
Γ(2)Γ(x + 1)
fX (x) = θ(1 − θ)x dθ =
0 Γ(x + 3)
This is the beta distribution with parameters 2 and x+1, which gives us that the conditional
2
expectation of Θ given X = x is given by x+3 . This is the Bayes estimator.
Theorem 5.5.5 If X ∼ binomial(n, θ) and the parameter θ has a prior distribution given
by Θ ∼ beta (α, β), the posterior distribution of Θ given X = x is a beta distribution with
x+α
parameters x + α and n − x + β. The Bayes estimator of θ is given by n+α+β .
Proof:
n x
fX|Θ (x|θ) = θ (1 − θ)n−x
x
for x ∈ {0, 1, 2, . . . , n}.
Γ(α + β) α−1
fΘ (θ) = θ (1 − θ)β−1 I[0 < θ < 1]
Γ(α)Γ(β)
n x Γ(α + β) α−1
fX,Θ (x, θ) = θ (1 − θ)n−x θ (1 − θ)β−1 I[0 < θ < 1]
x Γ(α)Γ(β)
n Γ(α + β) x+α−1
= θ (1 − θ)n−x+β−1 I[0 < θ < 1]
x Γ(α)Γ(β)
Z 1
n Γ(α + β) x+α−1
fX (x) = θ (1 − θ)n−x+β−1 dθ
x Γ(α)Γ(β)
0
n Γ(α + β) 1 x+α−1
Z
= θ (1 − θ)n−x+β−1 dθ
x Γ(α)Γ(β) 0
n Γ(α + β) Γ(α + x)Γ(n − x + β)
=
x Γ(α)Γ(β) Γ(n + α + β)
Γ(n + α + β)
fΘ|X (θ|x) = θx+α−1 (1 − θ)n−x+β−1
Γ(α + x)Γ(n − x + β)
This is a beta (x + α, n − x + β) distribution and hence the conditional expectation of
x+α
Θ given X = x is given by n+α+β .
87
Example 5.5.6 Let X ∼ binomial(20, θ) with θ having a beta (5, 3) prior distribution.
Find the Bayes estimator of θ if X = 12 is observed.
x+α 12+5
The Bayes estimator is given by n+α+β = 5+3+20 = 17
28
.
Theorem 5.5.7 If a random sample of size n is drawn from a normal population with
mean µ and variance σ 2 where the prior for µ is given by a N (µ0 , σ02 ) distribution, then the
posterior distribution is given by N (µ1 , σ12 ) where
nx̄σ02 + µ0 σ 2
µ1 =
nσ02 + σ 2
and
σ02 σ 2
σ12 = .
nσ02 + σ 2
nx̄σ02 +µ0 σ 2
The Bayes estimator for θ is given by nσ02 +σ 2
.
Example 5.5.8 A distributor of soft drink vending machines feels that in a supermarket,
the number of drinks one of his machines will sell is normally distributed with an average
of 738 drinks per week and standard deviation 13.4. For a machine placed in a particular
market, the number of drinks sold varies from week to week, and is represented by a normal
distribution with standard deviation 42.5. If one of the distributor’s machines put into a
new super market averaged 692 sales in the first 10 weeks, what is the distributor’s personal
probability that for this market the value of the mean is actually between 700 and 720?
Here σ = 42.5, µ0 = 738 and σ0 = 13.4.
and
σ02 σ 2 (13.4)2 (42.5)2
σ12 = = = 90.
nσ02 + σ 2 10(13.4)2 + (42.5)2
So the required probability is given by P ( 700−715
√
90
≤ Z ≤ 720−715
√
90
) = P (−1.58 ≤ Z ≤
0.53) = 0.645.
88
Chapter 6
Inference
Example 6.1.1 Emission of alpha particles from radio active sources is known to follow
Poisson law. In order to calculate probabilities associated with alpha emission, we need the
parameter λ, and we estimate it from the emission data. When the the emission rates were
observed with a radio active source of americium 241, the mean emission rate was found to
be 0.8392 emissions per second. This number is our estimated value of λ.
In Example 6.1.1, we are justified in using the mean value as the estimator of λ because
λ is the theoretical mean, or expectation, of a Poisson(λ) distribution. In general, we need
to come up with a formula for the estimator whose use to estimate the parameter can be
justified. There are two general procedures that we will use. The first is the method of
moments and the second is the method of maximum likelihood.
89
Definition 6.2.1 The k th moment of a random variable X is defined by µk = E(X k ).
Note: E(X k ) is sometimes referred to as the k th raw moment or k th hmoment about i the
th k
origin. This is in contrast to the k central moment, defined as E (X − µ) . Some
books use µk to denote the central moments while using the notation µ0k to denote the raw
moments.
If X1 , X2 , . . . , Xn are i.i.d. random Pvariables from the distribution of interest, the k th
sample moment is defined as mk = n ni=1 Xik . Then it is reasonable to take mk as an
1
Example 6.2.3 Suppose X1 , X2 , . . . , Xn are i.i.d. random variables from the normal dis-
tribution N (µ, σ 2 ). Then θ1 = µ and θ2 = σ 2 . Note that µ1 = µ and µ2 = E(X 2 ) = µ2 + σ 2 .
From these, we get µ = µ1 and σ 2 = µ2 − µ21 . Substituting the sample moments in place of
the population moments, we get
n
2 1X 2 2
µ̂ = m1 = X, σ̂ = m2 − m21 = Xi − X .
n i=1
Example 6.2.4 Suppose X1 , X2 , . . . , Xn are i.i.d. binomial(k, p) where both k and p are
unknown. (Ordinarily, for a binomial estimation problem, we know n; only p need be
estimated. The case where both are unknown is used to estimate rates of crimes for which
there are many unreported occurrences. For such a crime, the reporting rate p and the
number of occurrences k are both unknown.
E(X) = kp and V (X) = kp(1 − p) from which it follows that E(X 2 ) = kp(1 − p) + k 2 p2 ,
so we get the two equations
n
1X 2
X = k̃ p̃, X = k̃ p̃(1 − p̃) + k̃ 2 p̃2 .
n i=1 i
90
Example 6.2.6 Let X1 , X2 , . . . , Xn be i.i.d. uniform(0, θ). Find the MME of θ.
E(X) = 2θ , so we equate X to 2θ to get θ̃ = 2X.
The following theorem helps us find the MME of a function of the parameter.
where x = (x1 , x2 , . . . , xn ). One may then use calculus techniques to maximize this function
over θ. In many cases, it easier to maximize the log likelihood function the natural logarithm
of the likelihood function given by
n
X
l(x|θ) = log L(x|θ) = log f (xi |θ).
i=1
This is because differentiating a sum is easier than a product. Taking logarithms does not
affect the maximizing value of θ. The equation obtained by differentiating the log likelihood
and equating it to zero is called the likelihood equation.
Sometimes it is convenient to treat the xi ’s as fixed constants and write l(x|θ) as l(θ).
Clutter of notations can be avoided this way. Also, we shall denote the maximum likelihood
estimator of the parameter θ by θ̂ and refer to it as the MLE of θ. Note that in these notes,
log x always denotes the natural logarithm of x.
In the case of discrete distributions, one may, occasionally use p in place of f to remind
us that we are dealing with a probability mass function not a density function. In the
discrete case, integration will be replaced by summation.
91
n
X
l(λ) = log p(xi |λ)
i=1
Xn
= (xi log λ − λ − log xi !)
i=1
n
X
= nx log λ − nλ − log xi !
i=1
n
X
2
l(µ, σ ) = log f (xi |µ, σ 2 )
i=1
n
X 1 1
= − log σ − log 2π − 2 (xi − µ)2
i=1
2 2σ
n
n 1 X
= −n log σ − log 2π − 2 (xi − µ)2
2 2σ i=1
We now find partial derivatives of l with respect to µ and σ (the differentiation can be
with respect to σ or σ 2 ) and equate them to zero.
n
∂l 1 X
= (xi − µ)
∂µ σ 2 i=1
n
∂l n X
= − + σ −3 (xi − µ)2
∂σ σ i=1
µ̂ = x
and n n
1X 1X 2
2
σ̂ = (xi − x)2 = xi − x2 .
n i=1 n i=1
That the solution to the ML equations give the maximum can be verified by computing the
second derivative matrix and verifying that its determinant is positive and the first element
is negative.
92
Example 6.3.4 The cosine of the angle at which electrons are emitted in muon decay is a
random variable with density given by
1 + θx
f (x|θ) = I [|x| ≤ 1]
2
where |θ| ≤ 1. Let X1 , X2 , . . . , Xn be a sample from this distribution. Find the MLE of θ.
n
X n
X
l(θ) = log f (xi |θ) = (log(1 + θxi ) − log 2) .
i=1 i=1
Differentiating l with respect to θ and equating it to zero, we get that the MLE θ̂ is the
solution to the equation
n
X xi
= 0.
i=1
1 + θx i
As there is no closed form solution to this equation, we need to use a numerical method, an
iterative procedure such as Newton-Raphson method.
93
Example 6.3.8 For the i.i.d. normal case given Example 6.3.3, find the MLEs of µ2 and
σ.
As the µ̂ = x and σ̂ 2 = n1 ni=1 (xi − x)2 , from the above theorem it follows that the
P
q P
MLE of µ is given by x and the MLE of σ is given by n1 ni=1 x2i − x2 .
2 2
Before leaving this section, we will discuss some results that that help us deal with
estimators such as x(n) in Example 6.3.6
θ
(c) E X(1) = n+1
nθ
(d) E X(n) = n+1
2
(e) V X(1) = V X(n) = (n+1)nθ2 (n+2) .
The proof of Proposition 6.3.9 (except 3(a) and 3(b)) is left as an exercise.
Eθ (T ) = τ (θ)
Definition 6.4.2 The mean squared error, or MSE, of an estimator T of τ (θ) is defined as
Eθ [T − τ (θ)]2 .
The following proposition connects these concepts.
Proposition 6.4.3
M SEθ (T ) = Vθ (T ) + b2θ (T )
94
Proof:
M SEθ (T ) = Eθ [T − τ (θ)]2
= Eθ [T − Eθ (T ) + Eθ (T ) − τ (θ)]2
= Eθ [T − Eθ (T )]2 + (Eθ (T ) − τ (θ))2 + 2Eθ [(T − Eθ (T ))(Eθ (T ) − τ (θ))]
= Vθ (T ) + b2θ (T ) + 2(Eθ (T ) − τ (θ))Eθ [T − Eθ (T )]
= Vθ (T ) + b2θ (T )
Example 6.4.4 For the MLE’s of normal parameters in Example 6.3.3, find the bias and
the MSE.
2
Eµ (µ̂) = Eµ (X) = µ, so µ̂ is unbiased for µ and the MSE is the same as V (X) = σn . It
can be shown that !
n
2 1 X
2 2
S = X − nX
n − 1 i=1 i
is such that E(S 2 ) = σ 2 so thatS 2 is an unbiased estimator of σ 2 . Thus the bias of σ̂ 2 is
2
given by E (σ̂ 2 − σ 2 ) = n−1
n
− 1 σ 2 = −σn
2σ 4
To calculate the MSE of σ̂ 2 , we need a result that we will not prove: V (S 2 ) = n−1 . The
2
MSE of σ̂ can then be calculated by using Proposition 6.4.3.
M SE(σ̂ 2 ) = V σ̂ 2 + b2 σ̂ 2
2
σ4
n−1
= V S2 + 2
n n
2
2σ 4 σ4
n−1
= + 2
n n−1 n
2n − 1
= σ4
n2
We now introduce a concept called consistency. Consistency of an estimator is about
how close the estimator gets to the target as the sample becomes large.
95
law of large numbers (WLLN) states that the convergence in probability holds while the
strong law of large numbers (SLLN) states that convergence with probability 1 holds.
Our next theorem is about consistency of maximum likelihood estimators. The proof
of the theorem requires that the family of density functions {fθ : θ ∈ Θ} of the random
variable satisfies a set of regularity conditions. Let Θ be the parameter space, assumed to be
an open subset of the real line (or Rn ). Let X = (X1 , X2 , . . . , Xn ) and x = (x1 , x2 , . . . , xn ).
1. The support of fθ , the set A = {x : fθ (x) > 0}, does not depend on θ.
2. For all x ∈ A and for all θ ∈ Θ, log fθ (x) is differentiable with respect to θ.
3. For any function h on Rn such that Eθ [|h(X)|] < ∞, the operations of integration
and differentiation with respect to θ can be interchanged in the integral expression for
Eθ [h(X)]. That is,
Z Z
∂ ∂
h(x)L(x|θ)dx = h(x) L(x|θ)dx (6.1)
∂θ Rn Rn ∂θ
whenever the RHS of (6.1) is finite.
∂ ∂
R R
A sufficient condition for 6.1 to hold is that both Rn h(x) ∂θ L(x|θ)dx and Rn |h(x) ∂θ L(x|θ)|dx
are continuous functions of θ. We have a fairly general sufficient condition for all of the above
conditions to hold.
Example 6.4.7 We consider each of the familiar distributions and check if it is an expo-
nential family of distributions. Examples of exponential families are normal (with one or
two unknown parameters), multivariate normal, lognormal, exponential, gamma, weibull,
Pareto (with known lower bound), chi-squared, beta, Bernoulli, binomial (with n known),
negative binomial (with fixed number of failures), Poisson, geometric and Wishart. Stu-
dent’s t, Pareto with unknown lower bound and the uniform distribution with unknown
bounds are examples of families that are not exponential.
1. Exponential: f (x|λ) = λe−λx I[x > 0], so
P
L(x|λ) = λn e−λ xi
I[xi ≥ 0, i = 1, . . . , n].
Take A = [0, ∞)n , d(λ) = n log λ, S(x) = 0, c(λ) = −λ and T (x) =
P
xi .
Γ(α+β) α−1
2. Beta: f (x|α, β) = Γ(α)Γ(β)
x (1 − x)β−1 I[0 ≤ x ≤ 1], so
Γ(α + β) X X
l(x|α, β) = n log + (α − 1) log xi + (β − 1) log(1 − xi )
Γ(α)Γ(β)
Γ(α+β)
Take A = [0, 1]n , d(α, β) = n log Γ(α)Γ(β)
, S(x) = 0, c(α, β) = (α − 1, β − 1) and
P P
T (x) = ( log xi , log(1 − xi )).
96
3. Normal (µ, 1): f (x|µ) = √1 exp [−.5(x − µ)2 ].
2π
n X n X X
l(x|µ) = − log 2π − .5 (xi − µ)2 = − log 2π − .5 x2i − .5nµ2 + µ xi
2 2
Take A = Rn , d(µ) = − n2 log 2π−.5nµ2 , S(x) = −.5 x2i , c(µ) = µ and T (x) = xi .
P P
X
X k X
l(x|p) = log + xi log p + (k − xi ) log(1 − p)
xi
X
X k p
= log + kn log(1 − p) + log xi
xi 1−p
k p
Take A = {0, 1, . . . , k}n , d(p) = kn log(1 − p), S(x) =
P
log xi
, c(p) = log 1−p
P
and T (x) = xi .
Theorem 6.4.8 Under the regularity conditions on the density function f , the MLE from
an i.i.d. sample is strongly consistent. That is,
h i
Pθ0 θ̂n → θ0 = 1
Proof: The proof given here is just a sketch, as rigorous treatmentPis rather involved.
We want to maximize l(θ) which is the same as maximizing n1 l(θ) = n1 ni=1 log f (xi |θ). By
taking Y = log f (X|θ) and applying SLLN to Y , we get that n1 l(θ) → Eθ0 [log f (X|θ)].
As n1 l(θ) is close to Eθ0 [log f (X|θ)], we conclude that the θˆn that maximizes n1 l(θ) is
close to the value of θ that maximizes Eθ0 [log f (X|θ)]. To find this value, we differentiate
Eθ0 [log f (X|θ)] with respect to θ,
Z
∂ ∂
Eθ [log f (X|θ)] = log f (x|θ)f (x|θ0 )dx
∂θ 0 ∂θ
Z
∂f (x|θ) f (x|θ0 )
= dx
∂θ f (x|θ)
Z Z
∂f (x|θ) f (x|θ0 ) ∂f (x|θ)
dx = dx
∂θ θ=θ0 f (x|θ0 ) ∂θ θ=θ0
Z
d
= f (x|θ)dx
dθ θ=θ0
d
= (1)
dθ θ=θ0
= 0
97
Proposition 6.4.9 If the density f of the random variable X satisfies the regularity con-
ditions, then
∂
Eθ log f (X|θ) = 0
∂θ
Proof:
Z
∂ ∂f (x|θ)
Eθ log f (X|θ) = dx
∂θ ∂θ
Z
d
= log f (x|θ)dx
dθ
d
= (1)
dθ
= 0
We now introduce a concept called the Fisher information. The Fisher information, or
simply information, measures the amount of information that an observable random variable
X carries about an unknown parameter θ. It is a function of θ, and defined as the variance
of the derivative of the logarithm of the density function.
Definition 6.4.10 If a random variable X has density function f that depends on a pa-
rameter θ, the information is given by
2
∂
I(θ) = Eθ log f (X|θ)
∂θ
I(θ) is often referred to as the information
∂ number
or information function. From Proposi-
tion 6.4.9 it follows that I(θ) = Vθ ∂θ log f (X|θ) .
98
One of the reasons why information function is important is the result that states that
for large values of n, the distribution of the MLE, when θ0 is the true value of the parameter
1
θ, is approximately normal with mean θ0 and variance nI(θ 0)
. This large sample distribution
is referred to as the asymptotic distribution of the estimator.
Example 6.4.13 Find the asymptotic distribution for the MLE of λ based on an i.i.d.
sample from a Poisson(λ) distribution.
We have already seen that the MLE is X. Now we shall calculate the information
function for λ. As log f (x|λ) = x log λ − λ − log x!,
∂2
I(λ) = −Eλ log f (X|λ)
∂λ2
X
= Eλ 2
λ
1
=
λ
Thus the asymptotic distribution of X is normal with variance λ. But this we already knew
from the central limit theorem.
Example 6.4.14 Find the asymptotic distribution for the MLE of θ based on an i.i.d.
sample from a Pareto(θ) distribution.
θaθ
f (x|θ) = ⇒ log f (x|θ) = log θ + θ log a − (θ + 1) log x
xθ+1
d 1
⇒ log f (x|θ) = + log a − log x
dθ θ
d2 1
⇒ 2
log f (x|θ) = − 2
dθ θ
1
⇒ I(θ) = 2 .
θ
√
Thus the asymptotic distribution of n θ̂ − θ0 is N (0, θ02 ).
99
case that M SEθ (S) < M SEθ (T ) for some values of θ and M SEθ (T ) < M SEθ (S) for other
values of θ. In this situation, clearly we cannot decide which one is better using the MSE
criterion.
Suppose that these estimators are such that M SEθ (T ) ≤ M SEθ (S) for all θ with
M SEθ0 (T ) < M SEθ0 (S) for some value θ0 of θ. Then clearly is reasonable to consider
T to be a better estimator than S. Such an estimator S is said to be inadmissible.
It is easy to see that there is no such thing as the best estimator over all possible
estimators. S ≡ θ0 will always be better than any other at θ0 . For instance, the ‘estimator’
S ≡ 3, cannot be beaten by anything else at 3. While it is a useless estimator everywhere
else, at 3 it measures the parameter most accurately. In addition, this estimator has zero
variance.
The moral of this is that there is no point in trying to minimize MSE or variance over
all possible estimators. The set of all estimators is too large. Instead, we restrict ourselves
to a smaller class of estimators – the class of unbiased estimators.
There are some very good reasons for restricting ourselves to unbiased estimators. First,
they do not consistently overestimate or underestimate. Second, it rules out ridiculous
estimates such as the constant estimators discussed earlier. Third, for unbiased estimators,
the MSE is exactly the same as the variance. Finally, it is the case that among unbiased
estimators, we can often find an estimator that is better than all other unbiased estimators.
When such a best estimator exists, it is called a uniformly minimum variance unbiased
estimator, or UMVUE.
Example 6.5.1 Let X1 , X2 , . . . , Xn be i.i.d. with mean θ) and variance σ 2 . Consider the
following unbiased estimators of θ).
1. T1 = X1 with variance σ 2
X1 +X2 σ2
2. T2 = 2
with variance 2
X1 +X2 +···+Xn
5. T5 = n
with variance σ 2
When we compare the variances of these estimators, we see that estimator T5 is the best
among these. When all the ai s are the same, T4 = T5 .
Vθ (T ) ≤ Vθ (S).
A natural question to ask at this stage is whether there can be more than one UMVUE.
The answer is no, and is stated in the following theorem.
Theorem 6.5.3 UMVUE of τ (θ), if exists, is unique. That is, if S and T are UMVUEs of
τ (θ), then S = T .
Proof: As S and T are both UMVUEs, E(S) = E(T ) = τ (θ) and V (S) = V (T ). Let
W = S+T
2
. Clearly W is also an unbiased estimator of τ (θ). Note that
100
1
V (W ) = [V (S) + V (T ) + 2Cov(S, T )]
4
1h p i
≤ V (S) + V (T ) + 2 V (S)V (T )
4
= V (S).
The inequality comes from the fact that Corr(S, T ) ≤ 1. But S is a UMVUE, so the V (W )
cannot be strictly less than V (S), and hence the inequality for the covariance must be an
equality. This can happen only when one of the two random variables is a linear function
of the other. It follows that T = a(θ)S + b(θ) where a(θ) > 0. As V (S) = V (T ), a(θ) = 1,
and since E(S) = E(T ), it follows that b(θ) = 0. Thus S = T .
There are some problems with the notion of UMVUE.
1. Unbiased estimators may not exist.
2. Even if there are unbiased estimators, UMVUE may not exist.
3. Even when UMVUE exists, it may be inadmissible.
4. Unbiasedness is not invariant under non-linear transformations.
Apart from restricting the class of estimators considered to unbiased estimators, there are
two other ways out of the dilemma that there is no best procedure. One is to average the
MSE over the values of the parameter using some distribution on the parameter space (this
is called a prior distribution) and finding the estimator that minimizes this average. This
is called the Bayes procedure. The other is to look at the maximum value of the MSE over
all values of the parameter and minimizing this maximum. This is known as the minimax
method. Both of these are discussed in detail in the Decision Theory chapter.
We have a theorem that helps us decide whether a given estimator is the UMVUE. This
theorem, due to Harald Cramér and C.R.Rao and known as the Cramér-Rao Inequality,
gives us a lower bound on the variance of an unbiased estimator.
101
so
Z " n
# n
X Y
E(T W ) = T (x) g(xi ) f (xj |θ)dx
Rn i=1 j=1
Z n
∂ Y
= T (x) f (xi |θ)dx
Rn ∂θ i=1
Z n
d Y
= T (x) f (xi |θ)dx
dθ Rn i=1
d
= Eθ (T )
dθ
dτ (θ)
=
dθ
∂
As V ( ∂θ log f (X|θ)) = I(θ), it follows that V (W ) = nI(θ). Hence,
1. The efficiency of T is defined to be the ratio of the Cramér-Rao lower bound to the
V (T ). That is,
2
[ dτdθ(θ) ]
nI(θ)
e(T ) = .
V (T )
Example 6.5.6 The MLE of µ for a normal (µ, σ 2 ) distribution is given by µ̂ = X and its
2
variance is given by V (X) = σn . From Example 6.3.3, we know that
n
∂l 1 X
= 2 (xi − µ)
∂µ σ i=1
and hence
∂ 2l −1
2
= 2.
∂µ σ
1 1 σ2
Thus I(µ) = σ2
and nI(µ)
= n
showing that µ̂ is efficient and hence is the UMVUE of µ.
102
Example 6.5.7 The MLE and the MME of θ for a uniform (0, θ) distribution are given
nθ
by θ̂ = X(n) and θ̃ = 2X. From Proposition 6.3.9, expectation of X(n) is given by n+1 , so
(n+1)X (n)
T1 = n
is an unbiased estimator of θ. T2 = 2X is already an unbiased estimator. We
shall compare their variances.
θ2
V (T2 ) = 4V n(X) = 3n (follows from the fact that variance of a uniform (a, b) random
(b−a)2 θ2
variable is given by 12 ), where as V (T1 ) = n(n+2) from Proposition 6.3.9. Here V (T1 ) is
uniformly smaller than V (T2 ).
If we were to try to apply Cramér-Rao inequality blindly, we will find that I(θ) = θ12 and
2
conclude that V (T ) ≥ θn for all unbiased estimator T . What we found about V (T1 ) and
V (T2 ) both contradict this. This shows that Cramér-Rao inequality does not apply here.
This is because, as a consequence of the support of the distribution depending on θ, the
interchangeability of integration and differentiation with respect to θ does not work here.
6.6 Sufficiency
A sufficient statistic for a parameter θ is a statistic (a function of the data) that captures all
the information about θ contained in the sample. Once we have the value of the sufficient
statistic, any additional information in the sample does not contain any information about
θ. This means that if T (X) is a sufficient statistic, then any inference about θ should depend
on the sample X only through T (X).
Definition 6.6.1 A statistic T (X) is a sufficient statistic for a parameter θ if the condi-
tional distribution of the sample X given the value of T (X) does not depend on θ
Example
Pn 6.6.2 For X1 , X2 , . . . , Xn be i.i.d. Bernoulli(p), we shall show that T (X) =
i=1 X i is a sufficient statistic for p.
fT |X (t|x)fX (x)
fX|T (x|t) =
fT (t)
P
fX (x)I[ xi = t]
= n t
t
p (1 − p)n−t
P P
xi
(1 − p)n− xi I[ xi = t]
P
p
= n t
t
p (1 − p)n−t
P
I[ xi = t]
= n
t
103
1
Example 6.6.5 Let X1 , X2 , . . . , Xn be i.i.d. uniform(0, θ). L(x|θ) = θn
I 0 ≤ x(n) ≤ θ , so
x(n) is a sufficient statistic for θ.
Definition 6.6.6 Let X be an i.i.d. sample from f (x|θ). A statistic T (x) is said to be
complete if Eθ [g(T )] = 0 for all θ implies that Pθ (g(T ) = 0) = 1 for all θ.
Theorem 6.6.7 If L(x|θ) can be expressed as e{c(θ)T (x)+d(θ)+S(x)} IA (x), that is, we have
an exponential family of distributions, then T (x) is a complete sufficient statistic for θ.
function f can be expressed as f (x) = e{c(θ)t(x)+d1 (θ)+S1 (x)} IB (x)
Consequently, if the density P
where B ⊂ R, then T (x) = ni=1 t(xi ) is a complete sufficient statistic for θ.
The sufficiency part of Theorem 6.6.7 follows from Theorem 6.6.3 by taking g(x, y) =
c(y)x+d(y)
e and h(x) = eT (x) IA (x). Using these two theorems, we can find a sufficient statistic
easily in most cases.
−θ x
Example 6.6.8 Estimation of θ in Poisson (θ) Pdistribution: p(x|θ) = e x!θ forms an expo-
nential family with t(x) = x and hence T (x) = ni=1 xi is a complete sufficient statistic for
θ.
Example 6.6.9 Estimation of θ in Pareto (θ) distribution: The density is given by f (x|θ) =
θaθ
xθ+1
I [x ≥ a] where θ > 1 and a P
is a known constant. This forms an exponential family with
t(x) = log x and hence T (x) = ni=1 log xi is a complete sufficient statistic for θ.
The concept of sufficiency is a useful tool in finding better unbiased estimators. The
following theorem, due to C.R. Rao and David Blackwell, tells us that we can get a uniformly
better unbiased estimator by conditioning an unbiased estimator with respect to a sufficient
statistic.
Vθ (W ) = Vθ [E (W |T )] + Eθ [V (W |T )] ≥ Vθ [E (W |T )] = Vθ (φ(T ))
104
Example 6.6.11 Suppose X1 , X2 , . . . , Xn are i.i.d. Poisson(λ). We will perform Rao–
Blackwellization on an unbiased estimator of λ to improve
Pn it. Since E(X1 ) + λ, X1 is an
unbiased estimator of λ. Also we know that T (x) = i=1 xi is a sufficient statistic for θ.
We will now find E (X1 |T ). Note that since sum of n i.i.d. Poisson(λ) is Poisson(nλ),
e−nλ (nλ)t
fT (t) = .
t!
fX1 ,T (x, t)
fX1 |T (x|t) =
fT (t)
fT |X1 (t|x)fX1 (x)
=
fT (t)
Pn−1 e−λ λx
P i=1 X i = t − x|X 1 = x x!
= e −nλ (nλ) t
t! −λ x
e−(n−1)λ ((n−1)λ)t−x e λ
(t−x)! x!
= e−nλ (nλ)t
t!
t! (n − 1)t−x
=
x!(t − x)! nt
x t−x
t 1 1
= 1−
x n n
105
Statistical Tables
106
Table 6.1: Normal Distribution
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
107
Table 6.2: t distribution
α
d.f. 0.200 0.100 0.050 0.025 0.010 0.005 0.001
1 1.376 3.078 6.314 12.706 31.821 63.656 318.289
2 1.061 1.886 2.920 4.303 6.965 9.925 22.328
3 0.978 1.638 2.353 3.182 4.541 5.841 10.214
4 0.941 1.533 2.132 2.776 3.747 4.604 7.173
5 0.920 1.476 2.015 2.571 3.365 4.032 5.894
6 0.906 1.440 1.943 2.447 3.143 3.707 5.208
7 0.896 1.415 1.895 2.365 2.998 3.499 4.785
8 0.889 1.397 1.860 2.306 2.896 3.355 4.501
9 0.883 1.383 1.833 2.262 2.821 3.250 4.297
10 0.879 1.372 1.812 2.228 2.764 3.169 4.144
11 0.876 1.363 1.796 2.201 2.718 3.106 4.025
12 0.873 1.356 1.782 2.179 2.681 3.055 3.930
13 0.870 1.350 1.771 2.160 2.650 3.012 3.852
14 0.868 1.345 1.761 2.145 2.624 2.977 3.787
15 0.866 1.341 1.753 2.131 2.602 2.947 3.733
16 0.865 1.337 1.746 2.120 2.583 2.921 3.686
17 0.863 1.333 1.740 2.110 2.567 2.898 3.646
18 0.862 1.330 1.734 2.101 2.552 2.878 3.610
19 0.861 1.328 1.729 2.093 2.539 2.861 3.579
20 0.860 1.325 1.725 2.086 2.528 2.845 3.552
21 0.859 1.323 1.721 2.080 2.518 2.831 3.527
22 0.858 1.321 1.717 2.074 2.508 2.819 3.505
23 0.858 1.319 1.714 2.069 2.500 2.807 3.485
24 0.857 1.318 1.711 2.064 2.492 2.797 3.467
25 0.856 1.316 1.708 2.060 2.485 2.787 3.450
26 0.856 1.315 1.706 2.056 2.479 2.779 3.435
27 0.855 1.314 1.703 2.052 2.473 2.771 3.421
28 0.855 1.313 1.701 2.048 2.467 2.763 3.408
29 0.854 1.311 1.699 2.045 2.462 2.756 3.396
30 0.854 1.310 1.697 2.042 2.457 2.750 3.385
31 0.853 1.309 1.696 2.040 2.453 2.744 3.375
32 0.853 1.309 1.694 2.037 2.449 2.738 3.365
33 0.853 1.308 1.692 2.035 2.445 2.733 3.356
34 0.852 1.307 1.691 2.032 2.441 2.728 3.348
35 0.852 1.306 1.690 2.030 2.438 2.724 3.340
40 0.851 1.303 1.684 2.021 2.423 2.704 3.307
50 0.849 1.299 1.676 2.009 2.403 2.678 3.261
60 0.848 1.296 1.671 2.000 2.390 2.660 3.232
70 0.847 1.294 1.667 1.994 2.381 2.648 3.211
80 0.846 1.292 1.664 1.990 2.374 2.639 3.195
∞ 0.841 1.282 1.645 1.960 2.326 2.576 3.091
108
Table 6.3: Chi-square distribution
α
d.f. 0.995 0.990 0.975 0.950 0.050 0.025 0.010 0.005
1 3.9E-05 0.00016 0.00098 0.00393 3.841 5.024 6.635 7.879
2 0.0100 0.0201 0.0506 0.103 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 11.070 12.832 15.086 16.750
6 0.676 0.872 1.237 1.635 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 14.067 16.013 18.475 20.278
8 1.344 1.647 2.180 2.733 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 18.307 20.483 23.209 25.188
11 2.603 3.053 3.816 4.575 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 27.587 30.191 33.409 35.718
18 6.265 7.015 8.231 9.390 28.869 31.526 34.805 37.156
19 6.844 7.633 8.907 10.117 30.144 32.852 36.191 38.582
20 7.434 8.260 9.591 10.851 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 32.671 35.479 38.932 41.401
22 8.643 9.542 10.982 12.338 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 36.415 39.364 42.980 45.558
25 10.520 11.524 13.120 14.611 37.652 40.646 44.314 46.928
26 11.160 12.198 13.844 15.379 38.885 41.923 45.642 48.290
27 11.808 12.878 14.573 16.151 40.113 43.195 46.963 49.645
28 12.461 13.565 15.308 16.928 41.337 44.461 48.278 50.994
29 13.121 14.256 16.047 17.708 42.557 45.722 49.588 52.335
30 13.787 14.953 16.791 18.493 43.773 46.979 50.892 53.672
40 20.707 22.164 24.433 26.509 55.758 59.342 63.691 66.766
50 27.991 29.707 32.357 34.764 67.505 71.420 76.154 79.490
60 35.534 37.485 40.482 43.188 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 90.531 95.023 100.425 104.215
80 51.172 53.540 57.153 60.391 101.879 106.629 112.329 116.321
100 67.328 70.065 74.222 77.929 124.342 129.561 135.807 140.170
109
Table 6.4: F distribution with α = 0.05
ν1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 161 199 216 225 230 234 237 239 241 242 243 244 245 245 246
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4 19.4
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.76 8.74 8.73 8.71 8.70
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.94 5.91 5.89 5.87 5.86
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.70 4.68 4.66 4.64 4.62
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.03 4.00 3.98 3.96 3.94
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.60 3.57 3.55 3.53 3.51
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.31 3.28 3.26 3.24 3.22
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.10 3.07 3.05 3.03 3.01
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.94 2.91 2.89 2.86 2.85
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.82 2.79 2.76 2.74 2.72
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.72 2.69 2.66 2.64 2.62
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.63 2.60 2.58 2.55 2.53
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.57 2.53 2.51 2.48 2.46
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.51 2.48 2.45 2.42 2.40
ν1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 4052 4999 5403 5625 5764 5859 5928 5981 6022 6056 6083 6106 6126 6143
2 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 99.4 99.4
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 27.1 27.0 26.9
4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.5 14.4 14.3 14.2
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.96 9.89 9.82 9.77
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.79 7.72 7.66 7.60
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.54 6.47 6.41 6.36
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.73 5.67 5.61 5.56
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.18 5.11 5.05 5.01
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.77 4.71 4.65 4.60
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.46 4.40 4.34 4.29
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.22 4.16 4.10 4.05
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 4.02 3.96 3.91 3.86
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.86 3.80 3.75 3.70
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.73 3.67 3.61 3.56
110
Table 6.6: Critical Values for the U test when α = 0.1
2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0 0 0 1 1 1 1 2 2 3 3
3 0 0 1 2 2 3 4 4 5 5 6 7 7
4 0 1 2 3 4 5 6 7 8 9 10 11 12
5 0 1 2 4 5 6 8 9 11 12 13 15 16 18
6 0 2 3 5 7 8 10 12 14 16 17 19 21 23
7 0 2 4 6 8 11 13 15 17 19 21 24 26 28
8 1 3 5 8 10 13 15 18 20 23 26 28 31 33
9 1 4 6 9 12 15 18 21 24 27 30 33 36 39
10 1 4 7 11 14 17 20 24 27 31 34 37 41 44
11 1 5 8 12 16 19 23 27 31 34 38 42 46 50
12 2 5 9 13 17 21 26 30 34 38 42 47 51 55
13 2 6 10 15 19 24 28 33 37 42 47 51 56 61
14 3 7 11 16 21 26 31 36 41 46 51 56 61 66
15 3 7 12 18 23 28 33 39 44 50 55 61 66 72
2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0 0 0 0 1 1 1 1
3 0 1 1 2 2 3 3 4 4 5 5
4 0 1 2 3 4 4 5 6 7 8 9 10
5 0 1 2 3 5 6 7 8 9 11 12 13 14
6 1 2 3 5 6 8 10 11 13 14 16 17 19
7 1 3 5 6 8 10 12 14 16 18 20 22 24
8 0 2 4 6 8 10 13 15 17 19 22 24 26 29
9 0 2 4 7 10 12 15 17 20 23 26 28 31 34
10 0 3 5 8 11 14 17 20 23 26 29 30 36 39
11 0 3 6 9 13 16 19 23 26 30 33 37 40 44
12 1 4 7 11 14 18 22 26 29 33 37 41 45 49
13 1 4 8 12 16 20 24 28 30 37 41 45 50 54
14 1 5 9 13 17 22 26 31 36 40 45 50 55 59
15 1 5 10 14 19 24 29 34 39 44 49 54 59 64
111
Table 6.8: Critical Values for the U test when α = 0.02
2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0 0 0
3 0 0 1 1 1 2 2 2 3
4 0 1 1 2 3 3 4 5 5 6 7
5 0 1 2 3 4 5 6 7 8 9 10 11
6 1 2 3 4 6 7 8 9 11 12 13 15
7 0 1 3 4 6 7 9 11 12 14 16 17 19
8 0 2 4 6 7 9 11 13 15 17 20 22 24
9 1 3 5 7 9 11 14 16 18 21 23 26 28
10 1 3 6 8 11 13 16 19 22 24 27 30 33
11 1 4 7 9 12 15 18 22 25 28 31 34 37
12 2 5 8 11 14 17 21 24 28 31 35 38 42
13 0 2 5 9 12 16 20 23 27 31 35 39 43 47
14 0 2 6 10 13 17 22 26 30 34 38 43 47 51
15 0 3 7 11 15 19 24 28 33 37 42 47 51 56
3 4 5 6 7 8 9 10 11 12 13 14 15
3 0 0 0 1 1 1 2
4 0 0 1 1 2 2 3 3 4 5
5 0 1 1 2 3 4 5 6 7 7 8
6 0 1 2 3 4 5 6 7 9 10 11 12
7 0 1 3 4 6 7 9 10 12 13 15 16
8 1 2 4 6 7 9 11 13 15 17 18 20
9 0 1 3 5 7 9 11 13 16 18 20 22 24
10 0 2 4 6 9 11 13 16 18 21 24 26 29
11 0 2 5 7 10 13 16 18 21 24 27 30 33
12 1 3 6 9 12 15 18 21 24 27 31 34 37
13 1 3 7 10 13 17 20 24 27 31 34 38 42
14 1 4 7 11 15 18 22 26 30 34 38 42 46
15 2 5 8 12 16 20 24 29 33 37 42 46 51
112
0
Table 6.10: Critical Values u0.025 for the Runs Test
2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 2 2 2 2
3 2 2 2 2 2 2 2 2 2 3
4 2 2 2 3 3 3 3 3 3 3 3
5 2 2 3 3 3 3 3 4 4 4 4 4
6 2 2 3 3 3 3 4 4 4 4 5 5 5
7 2 2 3 3 3 4 4 5 5 5 5 5 6
8 2 3 3 3 4 4 5 5 5 6 6 6 6
9 2 3 3 4 4 5 5 5 6 6 6 7 7
10 2 3 3 4 5 5 5 6 6 7 7 7 7
11 2 3 4 4 5 5 6 6 7 7 7 8 8
12 2 2 3 4 4 5 6 6 7 7 7 8 8 8
13 2 2 3 4 5 5 6 6 7 7 8 8 9 9
14 2 2 3 4 5 5 6 7 7 8 8 9 9 9
15 2 3 3 4 5 6 6 7 7 8 8 9 9 10
4 5 6 7 8 9 10 11 12 13 14 15
4 9 9
5 9 10 10 11 11
6 9 10 11 12 12 13 13 13 13
7 11 12 13 13 14 14 14 14 15 15 15
8 11 12 13 14 14 15 15 16 16 16 16
9 13 14 14 15 16 16 16 17 17 18
10 13 14 15 16 16 17 17 18 18 18
11 13 14 15 16 17 17 18 19 19 19
12 13 14 16 16 17 18 19 19 20 20
13 15 16 17 18 19 19 20 20 21
14 15 16 17 18 19 20 20 21 22
15 15 16 18 18 19 20 21 22 22
113
0
Table 6.12: Critical Values u0.005 for the Runs Test
3 4 5 6 7 8 9 10 11 12 13 14 15
3 2 2 2 2
4 2 2 2 2 2 2 2 3
5 2 2 2 2 3 3 3 3 3 3
6 2 2 2 3 3 3 3 3 3 4 4
7 2 2 3 3 3 3 4 4 4 4 4
8 2 2 3 3 3 3 4 4 4 5 5 5
9 2 2 3 3 3 4 4 5 5 5 5 6
10 2 3 3 3 4 4 5 5 5 5 6 6
11 2 3 3 4 4 5 5 5 6 6 6 7
12 2 2 3 3 4 4 5 5 6 6 6 7 7
13 2 2 3 3 4 5 5 5 6 6 7 7 7
14 2 2 3 4 4 5 5 6 6 7 7 7 8
15 2 3 3 4 4 5 6 6 7 7 7 8 8
5
6 7 8 9 10 11 12 13 14 15
5 11
6 11 12 13 13
7 13 13 14 15 15 15
8 13 14 15 15 16 16 17 17 17
9 15 15 16 17 17 18 18 18 19
10 15 16 17 17 18 19 19 19 20
11 15 16 17 18 19 19 20 20 21
12 17 18 19 19 20 21 21 22
13 17 18 19 20 21 21 22 22
14 17 18 19 20 21 22 23 23
15 19 20 21 22 22 23 24
114
Table 6.14: Critical Values for the Signed-rank test
115