Thinking With Data
Thinking With Data
net/publication/2242219
CITATIONS READS
0 1,746
4 authors, including:
Da Ta
University of British Columbia
1 PUBLICATION 0 CITATIONS
SEE PROFILE
iii
iv
Preface
T
his book comprises a collection of lecture notes for the statistics com-
ponent of the course Psychology 2030: Methods and Statistics from the
Department of Psychology at the University of Lethbridge. In addition
to basic statistical methods, the book includes discussion of many other useful
topics. For example, it has a section on writing in APA format (see Chapter
17), and another on how to read the professional psychological literature (see
Chapter 16). We even provide a subsection on the secret to living to be 100 years
of age (see section A.2.2)—although the solution may not be fully satisfactory!
Despite this volume comprising the fourth edition of the book, it is still very
much a work in progress, and is by no means complete. However, despite its
current limitations, we expect that students will find it to be a useful adjunct to
the lectures. We welcome any suggestions on additions and improvements to the
book, and, of course, the report of any typos and other errors.1 Please email
any such errors or corrections to: vokey@uleth.ca or allens@uleth.ca.
The book is produced at the cost of printing and distribution, and will evolve
and change from semester to semester, so it will have little or no resale value;
the latest version is also always available on the web as a portable document
format (pdf) file at: http://people.uleth.ca/~vokey/pdf/thinking.pdf.
We expect the book to be used, or better, used up over the course. In the
hands of the student, this volume is intended to become the private statistics
diary of the student/reader. To that end, we have printed the book with especially
wide, out-side margins, intended to be the repository by the student/reader of
further notes and explanatory detail obtained from lecture, private musings, and,
as beer is inextricably and ineluctably tied to statistics, pub conversation over a
pint or two.
1 For example, the Fall, 2005 printing of the 4th edition corrected a series of typos and
errors that previous students kindly pointed out to us. This Fall, 2007 printing contains even
more corrections, again reported by our sharp-eyed students. Thank you to all!
v
vi PREFACE
Contents
Preface v
1 Introduction 1
1.1 Sir Carl Friedrich Gauss (1777-1855) . . . . . . . . . . . . . . . . 1
1.2 Summing any Constant Series . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Definition of a constant series . . . . . . . . . . . . . . . . 3
1.2.2 What about constant series with c > 1? . . . . . . . . . . 3
1.3 What’s the Point? . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I Description 7
2 The Mean (and related statistics) 9
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 The mean is a statistic . . . . . . . . . . . . . . . . . . . . 10
2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 The mean is that point from which the sum of deviations
is zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 The mean as a balancing-point . . . . . . . . . . . . . . . 12
2.2.3 The mean is the point from which the sum of squared
deviations is a minimum. . . . . . . . . . . . . . . . . . . 12
2.2.4 And the Mean is . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 The Method of Provisional Means . . . . . . . . . . . . . 15
2.3 Other Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 The Absolute Mean . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Root-Mean-Square . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 Harmonic Mean . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Definition of the Median . . . . . . . . . . . . . . . . . . . 19
2.4.2 A complication . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Properties of the Median . . . . . . . . . . . . . . . . . . 21
2.5 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii
viii CONTENTS
3 Measures of Variability 25
3.1 The Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 The D2 and D statistics . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Variance (S 2 ) and Standard Deviation (S) . . . . . . . . . . . . . 26
3.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Transformed Scores 35
4.1 The Linear Transform . . . . . . . . . . . . . . . . . . . . . . . . 35
2
4.1.1 Rules for changing X, SX and SX . . . . . . . . . . . . . 35
Adding a constant to every score . . . . . . . . . . . . . . 36
Multiplying each score by a constant . . . . . . . . . . . . 37
4.2 The Standard Score Transform . . . . . . . . . . . . . . . . . . . 38
4.2.1 Properties of Standard Scores . . . . . . . . . . . . . . . . 38
4.2.2 Uses of Standard Scores . . . . . . . . . . . . . . . . . . . 39
4.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Correlation 53
7.1 Pearson product-moment correlation
coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.1 Sums of Cross-Products . . . . . . . . . . . . . . . . . . . 54
7.1.2 Sums of Differences . . . . . . . . . . . . . . . . . . . . . 56
What values can ASD take? . . . . . . . . . . . . . . . . . 57
7.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3 Other Correlational Techniques . . . . . . . . . . . . . . . . . . . 58
7.3.1 The Spearman Rank-Order Correlation Coefficient . . . . 60
Tied Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3.2 The Point-Biserial Correlation Coefficient . . . . . . . . . 61
7.3.3 And Yet Other Correlational Techniques . . . . . . . . . . 62
7.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8 Linear Regression 65
8.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
0
8.1.1 The Regression Equation: ZYi = rxy ZXi . . . . . . . . . . 67
8.1.2 From standard scores to raw scores . . . . . . . . . . . . . 68
8.1.3 Correlation and Regression: rxy = ry0 y . . . . . . . . . . . 69
CONTENTS ix
II Significance 89
10 Introduction to Significance 91
10.1 The Canonical Example . . . . . . . . . . . . . . . . . . . . . . . 94
10.1.1 Counting Events . . . . . . . . . . . . . . . . . . . . . . . 94
The fundamental counting rule . . . . . . . . . . . . . . . 94
Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 95
Combinations . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.1.2 The Example . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.1.3 Permutation/Exchangeability Tests . . . . . . . . . . . . . 97
10.2 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Details, details . . . . . . . . . . . . . . . . . . . . . . . . 168
17.0.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
17.0.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.0.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
17.0.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
17.0.9 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
17.0.10 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
17.0.11 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . 171
IV Appendices 173
A Summation 175
A.1 What is Summation? . . . . . . . . . . . . . . . . . . . . . . . . . 175
A.1.1 The Summation Function .P. . . . . . . . . . . . . . . . . 175
A.1.2 The Summation Operator: . . . . . . . . . . . . . . . . 176
A.2 Summation Properties . . . . . . . . . . . . . . . . . . . . . . . . 176
A.2.1 Summation is commutative . . . . . . . . . . . . . . . . . 177
A.2.2 Summation is associative . . . . . . . . . . . . . . . . . . 177
The secret to living to 100 years of age . . . . . . . . . . . 178
A.3 Summation RulesP . . . . . . .P. . . . P. . . . . . . . . . . . . . . . 178
A.3.1 Rule 1: P(X + Y ) = P X + P Y . . . . . . . . . . . . . 178
A.3.2 Rule 2: P(X − Y ) =P X − Y . . . . . . . . . . . . . 179
A.3.3 Rule 3: (X + c) = X + nc . . . . . . . . . . . . . . . 179
Xn
A.3.4 Rule 4: c = nc . . . . . . . . . . . . . . . . . . . . . . 180
i=1
P P
A.3.5 Rule 5: cX = c X . . . . . . . . . . . . . . . . . . . 180
A.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
References 191
Index 192
List of Figures
2.1 The mean is that point from which the sum of deviations is zero 11
2.2 The mean minimises the sum of squares . . . . . . . . . . . . . . 13
3.1 A map of distances between towns and cities along the Trans-
Canada highway. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9.1 The label from one of the brands of beer produced by P. Ballantine
& Sons depicting the overlapping circle logo. . . . . . . . . . . . 84
9.2 A Euler diagram or “Ballantine” of the example correlations from
Table 9.2. The area of overlap of any one circle with another
represents the squared correlation between the two variables. . . 85
9.3 The same figure as in Figure 9.2, except shaded to depict the area
of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xiii
xiv LIST OF FIGURES
xv
xvi LIST OF TABLES
15.1 Hypothetical values for three observations per each of three groups.152
15.2 Sums of squares for the hypothetical data from Table 15.1. . . . 153
15.3 Critical values of the F -distribution . . . . . . . . . . . . . . . . 157
Introduction
A
s an old man, the eminent mathematician, Sir Carl Friedrich Gauss (see
Figure 1.1) enjoyed relating the story of his first day in his first class
in arithmetic at the age of 10. His teacher, Herr Büttner, was a brutish
man who was a firm believer in the pedagogical advantages of thrashing the boys
under his tutelage. Having been their teacher for many years, Büttner knew
that none of the boys had ever heard about arithmetic progressions or series.
Accordingly, and in the fashion of bullies everywhere and everywhen who press
every temporary advantage to prove their superiority, he set the boys a very
long problem of addition that he, unbeknownst to boys, could answer in seconds
with a simple formula.
The problem was of the sort of adding up the numbers 176 + 195 + 214 +
. . . + 2057, for which the step from each number to the next is the same (here,
19), and a fixed number of such numbers (here 100) are to be added. As was the
tradition of the time, the students were instructed to place their slates1 on a
table, one slate on top of the other in order, as soon as each had each completed
the task. No sooner had Büttner finished the instructions than Gauss placed his
slate on the table, saying “There it lies.”2 Incredulous (or so it goes), the teacher
looked at him scornfully (but, no doubt, somewhat gleefully as he anticipated
the beating he would deliver to Gauss for what surely must be a wrong answer)
while the other students continued to work diligently for another hour. Much
later, when the teacher finally checked the slates, Gauss was the only student to
have the correct answer (Bell, 1937, pp. 221–222).3
1 Yes, in the late 18th century, slates, not paper were used by schoolchildren.
2 Actually, as the son of a poor, German family, he no doubt said it in his peasant German
dialect: “Ligget se”.
3 Büttner was so impressed with the obvious brilliance of Gauss that at least for Gauss he
became a humane and caring teacher. Out of his own limited resources, Büttner purchased
the best book on arithmetic of the day and gave it to Gauss. He also introduced the 10-year
1
2 CHAPTER 1. INTRODUCTION
How did he do it? For ease of exposition, let us assume the simpler task of
summing the integers from 1 to 100 inclusive. As the story goes, Gauss simply
imagined the sum he sought, say S, as being written simultaneously in both
ascending and descending order:
S = 1 + 2 + 3 + · · · + 98 + 99 + 100
S = 100 + 99 + 98 + · · · + 3 + 2 + 1
Then, instead of adding the numbers horizontally across the rows, he added
them vertically:
S + S = (1 + 100) + (2 + 99) + (3 + 98) + · · · + (98 + 3) + (99 + 2) + (100 + 1)
or
10100
And if 2S = 10100, then S must equal 2 = 5050!
Xi+1 − Xi = c (1.1)
for all Xi . That is, the difference between any number in the set, Xi , and the
next number, Xi+1 , is a constant, c. For example, X = {1, 2, 3, · · · , 6} is a series
by this definition with c = 1. So is X = {1, 3, 5, 7}, except that c = 2. In general,
with the set, X, defined as in equation 1.1, we can ask: what is the sum of X?
Following the 10-year old Gauss, we note that the sum across each of the
pairs formed by lining up the ascending series with its descending counterpart
is always a constant, k, for any constant series. In the case of summing the
numbers from 1 to 100, k = 101. As any of the pairs thus formed for a given set
yields the same sum, k, we can define k = X1 + Xn , that is, as the sum of the
first and last numbers of the set. We also note that in doing the pairing, we have
used each number in the pair twice. That is, using n to indicate the number of
numbers in the series, in general, nk = twice the desired sum. Therefore,
X1 + Xn
X1 + X2 + . . . + Xn = n (1.2)
2
For example, applying equation (1.4) to obtain the sum of the series, X =
{4, 7, · · · , 31}, we note that c = 3 and that, therefore, n = 31−4
3 + 1 = 10.
As k = X1 + Xn = 31 + 4 = 35, and k2 = 35 2 = 17.5, then the sum equals
10(17.5) = 175. For the sum of the earlier series, 176 + 195 + 214 + . . . + 2057,
c = 195 − 176 = 19, so n = 2057−17619 + 1 = 100, k = 2057 + 176 = 2233, so the
sum is 100 2233 2 = 111650.
questions (see Appendix C), so that the student may explore each of the concepts
in a concrete way.
But the story of Gauss’s childhood insight highlights another important
point. These useful properties are properties of numbers, and, as such apply to
numbers—any and all numbers. These properties are not properties of what
the numbers on any given occassion refer to, or mean. Whether it makes sense
to apply one of these techniques to a given set of numbers is not a question
of mathematics. Gauss’s insight provides the sum of any constant series of
numbers whether it makes sense (in terms of what the numbers refer to) to take
such a sum or not. All of the statistical techniques we discuss are properties of
the numbers themselves. Extension of these properties to the referents of the
numbers requires an inference that is not mathematical in nature. Part of the
polemic of this book is to make this distinction and its consequences clear.
1.4 Questions
1. What is the arithmetic average of the series X ={11,19,27,. . . ,75}?
Description
7
Chapter 2
2.1 Definition
A
mean is a normalised sum; which is to say that a mean of a set of
scores is just the sum of that set of scores corrected for the number
of scores in the set. Normalising the sums in this way allows for the
direct comparison of one normalised sum with another without having to worry
about whether the sums differ (or not) from each other simply because one sum
is based on more numbers than the other. One sum, for example, could be
larger than another simply because it is based on many more numbers rather
than because it was derived from, in general, larger numbers than another sum.
Expressed as means, however, any difference between two sums must directly
reflect the general or “average” magnitude of the numbers on which they are
based.
You probably know the mean as the arithmetic average: the sum of the
scores divided by the number of scores. In summation notation,1 the mean can
be defined as:
n
X
Xi
i=1
X= (2.1)
n
As shown in equation 2.1, the symbol for the mean of a set of scores is the
name of the set of scores with an overbar; if the name of the set of scores is X,
then the symbol of the mean of that set is X. If, on the other hand, the name
of the set were “age” or “height” or Z, say, then the respective means would be
labelled as age, height, and Z.
1 See Appendix A for a review of summation and the summation operator, Σ.
9
10 CHAPTER 2. THE MEAN (AND RELATED STATISTICS)
2.2 Properties
Although no doubt often thought of contemptuously given its familiarity,2 the
mean has useful properties not shared with any other single-valued summary of
the data.
2.2.1 The mean is that point from which the sum of devi-
ations is zero.
This property equally could have been used as the definition of the mean. That is,
we could have defined the mean as that value from which the sum of deviations is
zero and then derived equation 2.1 from it as, say, the “calculation” formula. Not
that we couldn’t use this property as a method to find the mean of a distribution
of numbers; it just wouldn’t be very efficient. Still, if you have found a value
that just happens to result in the sum of deviations from it being zero, you know
that that value equals the mean by definition.
This property is probably most easily grasped graphically, as shown in Figure
2.1. For any distribution of scores, a plot of the sum of deviation scores as a
function of different values or “guesses” about what the mean might be results
in a diagonal line running from positive sums on the left for guesses less than
the mean to negative sums on the right for guesses greater than the mean. The
“guess” corresponding to the point at which the line crosses a sum of zero is the
mean of the distribution.
What this property means is that any other value whatsoever must result in
a sum of deviations from it that is different from zero. Expressed in summation
notation,
Xn
(Xi − X) = 0 (2.2)
i=1
50
40
30
20
Sum of Deviations
10
-10
-20
-30
-40
-50
0 1 2 3 4 5 6 7 8 9 10
Guess for the Mean
Figure 2.1: The mean is that point from which the sum of deviations is zero.
In this example, which plots the sum of deviations as a function of different
“guesses” for the mean, X = {1, 2, 3, 4, 5, 6, 7, 8, 9} and X = 5.
First,
n
X n
X n
X
(Xi − X) = Xi − X
i=1 i=1 i=1
n
X
As c = nc, then
i=1
n
X n
X
(Xi − X) = Xi − nX
i=1 i=1
n
X
Xi
i=1
But, X = n , so
n
X
n n
Xi
X X i=1
(Xi − X) = Xi − n
i=1 i=1
n
Xn n
X
= Xi − Xi
i=1 i=1
=0
Table 2.1: Demonstration using summation notation that the mean is that point
from which the sum of deviations is zero.
12 CHAPTER 2. THE MEAN (AND RELATED STATISTICS)
2.2.3 The mean is the point from which the sum of squared
deviations is a minimum.
As with the previous property, this property also could have served as the
definition of the mean. That is, we could have used this property to define the
mean and then derived both equations 2.1 and 2.2 from it. Similarly, again as
with the previous property, this property of the mean is probably most easily
understood graphically. If the sum of squared deviations is plotted as a function
of different values or guesses for the mean, a parabola or bowl-shaped curve
is produced. The guess corresponding to the exact bottom of this bowl, the
minimum of the function, would be the mean (see Figure 2.2).
Using Y as any value whatsoever, this property is expressed in summation
notation as:
Xn Xn
(Xi − X)2 ≤ (Xi − Y )2 (2.3)
i=1 i=1
Another way of describing this property of the mean is to say that the mean
minimises the sum of squares.
Demonstrating this property is a bit more difficult than it was for the previous
one. What follows may be conveniently ignored if one is willing to accept the
property without proof.
One relatively simple method for the appropriately initiated is to use the
Xn
differential calculus, and setting the first derivative of (Xi − X)2 to zero,
i=1
derive the definition formula for the mean. Rather than assume the requisite
calculus background, however, we provide an algebraic proof that relies on
a technique known as reductio ad absurdum. Reductio ad absurdum means
to reduce to an absurdity, and works as a proof by first assuming the direct
3 It is conventional to denote deviation scores by the lower-case version of the name of the
300
280
Figure 2.2: The mean minimises the sum of squares. In this example, which
plots the sum of squared deviations as a function of different “guesses” of the
mean, X = {1, 2, 3, 4, 5, 6, 7, 8, 9}, X = 5, and the minimum sum of squared
deviations is 60. Note that this minimum occurs only at the mean.
4 For reductio ad absurdum to work as a mathematical proof requires that the system be
consistent (i.e., not contain contradictions among its premises or axioms); otherwise, reasoning
to an absurdity or contradiction may reflect nothing more than the inherent inconsistency of
the system.
14 CHAPTER 2. THE MEAN (AND RELATED STATISTICS)
n
X
We start, then, by assuming that (Xi − X)2 is not the minimum sum, that
i=1
some other value, either larger or smaller than the mean, substituted for the
mean, will produce a sum of squared deviations from it lower than that obtained
with the mean. We represent this value as X + c, where c is either positive or
negative. Then, our contradictory assumption amounts to the statement that:
n
X n
X
(Xi − X)2 > (Xi − (X + c))2
i=1 i=1
Which yields
n Pn Pn
X ( i=1 Xi )2 ( i=1 Xi )2
Xi2 −2 +
i=1
n n
n n n
X X X 2
> Xi2 − 2(X + c) Xi + (X + 2Xc + c2 )
i=1 i=1 i=1
and
n Pn n n
X ( Xi )2 X X 2
Xi2 − i=1
> Xi2 − 2(X + c) Xi + nX + 2nXc + nc2
i=1
n i=1 i=1
resulting in
n Pn n Pn
X ( i=1 Xi )2 X ( i=1 Xi )2
Xi2 − > 2
Xi − + nc2
i=1
n i=1
n
Table 2.2: Demonstration that the sum of squared deviations from the mean is
a minimum.
2.2. PROPERTIES 15
n
X
Xi = 1 + 2 + 3 + 4 + 5 = 3 + 3 + 3 + 3 + 3 = nX = 5 ∗ 3 = 15. It is in this
i=1
sense that the mean of X can truly be considered to be representative of the
numbers in the set: it is the only number that can be substituted for each and
every number in the set and still result in the same sum. As an aside, it is also
for this reason that the mean is frequently substituted for missing data in a set
of numbers.
which means that the mean of n scores may be re-written in terms of the mean
for the first n − 1 scores as:
(n − 1)X n−1 + Xn
Xn =
n
A little algebra yields:
Xn − X n−1
X n = X n−1 + (2.4)
n
16 CHAPTER 2. THE MEAN (AND RELATED STATISTICS)
a recursive equation for calculating the mean (see Spicer, 1972; Vokey, 1990).
That is, the mean for the first n scores may be calculated from the mean for the
first n − 1 scores plus the deviation of the nth score from the just previous mean
divided by n.
For the set of numbers, X = {−10, −9, 6, 9, 4}, the mean, X = 0, but the absolute
mean, X abs = 38/5 = 7.6, which is noticeably closer in absolute magnitude
to the numbers of the set. For example, say you have a choice between two
poker games and you want to know how high the stakes are in each. One
answer to that question is to compute the average amount of money that changes
hands in each game. For game X, imagine that the winnings and losses of
the four players were X = {−10, 30, −20, 0}, X = 0, and for game Y, they
were Y = {−1000, 3000, −2000, 0}, Y = 0. Using the arithmetic mean, then,
the two games appear equivalent on average (i.e., both are zero-sum games).
However, comparing the absolute means, X abs = (10 + 30 + 20 + 0)/4 = $15 vs.
Y abs = (1000 + 3000 + 2000 + 0)/4 = $1500, reveals that the stakes in game Y
are much higher. The absolute mean maintains that 100/1 ratio of the size of
the number and is thus more informative for you. That’s not to say that the
arithmetic mean of 0 in each case is wrong, it just doesn’t tell you what you want
to know—and remember, the whole point of a statistic is to tell you something
useful.
2.3. OTHER MEANS 17
2.3.2 Root-Mean-Square
Another common approach is to square the numbers to remove the sign, compute
the mean squared value, and then take the square-root of the mean-square to
return to the original units. This value is the known as the root-mean-square
or r.m.s. To compute it, you follow the name in reverse (i.e., square each score,
take the mean of the squared scores, and then take the square-root of the mean).
In summation notation: v
uX n
Xi2
u
u
t
i=1
X rms = (2.6)
n
For example, for the same set of numbers, X = {−10, −9, 6, 9, 4}, X rms =
p
= 314/5 = 7.92, which is also noticeably closer in absolute magnitude to
the numbers of the set. In fact, except where all the numbers are the same,
X rms > X abs . That is, in general, X rms is always a little larger than X abs .
X rms often is used in situations in which the data tend to move back and
forth, like a sine wave, AC current, loudspeakers producing sound (hence, its
use in stereo system specifications, e.g., 100 watts rms). Put a kid on a swing
and measure the endpoints of the swing as it goes back and forth. You’ll
get something like: X = {10, −10, 11, −11, 9, −9} for going back and forth
3 times; compare that to: Y = {5, −5, 5.5, −5.5, 4.5, −4.5}. Who has swung
farther? The arithmetic mean for both is 0, which is not informative, as with
the aforementioned poker games. However,
p
X rms = (102 + (−10)2 + 112 + (−11)2 + 92 + (−92 ))/6 = 10.033
and
p
Y rms = (52 + (−5)2 + 5.52 + (−5.5)2 + 4.52 + (−4.52 ))/6 = 5.016
so clearly X rms captures the cyclic peaks (and, hence, the power in the swing or
the loudness of the speakers) more meaningfully than does the arithmetic mean.
The Π operator denotes the product of the operands in the same way that the
Σ symbol denotes the sum. Equation 2.7 says that to compute the geometric
mean, the scores are cumulatively multiplied together, and then the nth root
of the resulting product is taken. For example, for X = {10, 100, 1000}, X G =
1
1000000 3 = 100, the middle number of the geometric series. Exactly the same
result would have been obtained if we had first log-transformed the data (say,
to base 10, although the actual base used is not critical), to produce the scores
log10 (X) = {1, 2, 3}, and then had taken the anti-log to the same base of the
arithmetic mean of the transformed scores (i.e., 102 = 100). If the data are not
a geometric series, the geometric mean of the data still returns the (anti-log of
the) arithmetic mean of the log-transformed data; in effect, it removes the base
of the data from the computation of the mean.
Think of it this way. Imagine rewriting each score, Xi , as a constant, b,
raised to some exponent, xi , e.g., logb (X) = bx1 , bx2 , . . . , bxn (which is precisely
what a logb -transform does). The geometric mean is simply that same constant
raised to the arithmetic average of the exponents, i.e., X G = bx .
The geometric mean is commonly used in such fields as economics, so examples
of its use tend to come from that domain. Here is a simple example that
emphasises the distinction between the arithmetic and the geometric means.
Imagine the following game. There are 10 people in a room, each asked to guess a
different number between 1 and 100; the person with the guess most distant from
a randomly chosen number for that trial is asked to leave with the cash prize for
that trial, and the prize doubles for those remaining. Let’s say that the prize starts
at $2. Thus, the first person to drop out wins $2, and the last person remaining
wins $210 = $1024. What is the average amount won? Using the arithmetic mean,
the value would be (2+4+8+16+32+64+128+256+512+1024)/10 = $204.6—
among the higher numbers (i.e., it is biased toward the larger values), whereas,
the geometric mean comes to $45.25, midway between the bottom-half and
top-half of the scores.
Imagine travelling 200 km to Calgary. For the first quarter of the distance
of the trip you travel 100 km/h, and then for the next quarter of the distance
because of highway construction, you are slowed to 50 km/h (speed fines double!);
in the next quarter, you get to travel at the full 110 km/h allowed for by highway
2. Unfortunately, as you encounter the last quarter of the distance to Calgary,
the inevitable traffic jams, stalled cars and trucks, and the general freeway
insanity that is Calgary, slows you to 10 km/h (and you swear you will never go
to Calgary again). Given all that, when you finally arrive, your friendly, Calgary
host asks you innocently, what was your average speed? You might think that
the appropriate answer is just the arithmetic mean of X = {100, 50, 110, 10},
or X = 67.5—that is, you averaged 67.5 km/h. But note, the trip took 30
minutes to cover the first 50 km, 60 minutes to cover the second 50 km, 27.27
minutes to cover the third 50 km, and 300 minutes to cover the last 50 km,
for a total of 417.27 minutes, or 6.95 hours! If you really did average 67.5
km/h, then in 6.95 hours, you would have travelled 6.95*67.5 = 469.13 km!
But the whole trip was only 200 km! The harmonic mean, in contrast, X H =
4/(1/100 + 1/50 + 1/110 + 1/10) = 28.758, when multiplied by the number of
hours, yields the correct distance of 200 km.
All this expression says that if you label each score with +1 if it is above the
median, and with −1 if it is below the median, the number of +1 scores will
precisely balance the number of −1 scores, for a grand sum of zero.
2.4.2 A complication
Sets containing many duplicates of the middle value can be problematic in
meeting this definition of the median, although this difficulty is often ignored,
and with good reason. First, the approach shown here complicates what is
otherwise a straightforward and simple procedure for determining a median.
Second, for many distributions, the simply-determined median value is also the
modal (i.e., most frequent) value, and is often worth retaining for that reason.
Third, for many distributions, especially those collected along continuous scales,
the possibility of identical values rarely arises. Fourth, the approach shown is
most useful for situations where different values have been “binned” or collected
into intervals. For these reasons, this subsection can probably be ignored with
little loss.
Take the set X = {1, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5}, for example. It is clear that
the median should be somewhere among all the duplicated threes, but simply
settling for 3 as the median does not seem to be in the spirit of the definition
in that only 4/12 = 33.3%, the 4 and fives, rather than 50% of the numbers
in the set are greater than 3, and only 2/12 = 16.7%, the 1 and 2, are less
than 3. In fact, as there are 12 numbers (i.e., n = 12) in the set, then the
middle or 50th percentile number should be the number with a rank-order of
.50(n + 1) = 6.5—between the fourth and fifth 3 in the example set.
For this location of the median to be meaningful requires that the duplicate
values themselves be rank-ordered in some way; the 3 closest to the 4 in the
series, for example, be in some sense a greater value of 3 than the one closest to
the 2. How could this be? Assuming that the values in the set were not true
integers (as would be the case if they were counts, for example), but instead
were obtained as the result of some kind of measurement on a continuous scale,
then each of the threes in the set could be seen as rounded approximations to
the true measured values. Following the conventional procedures for rounding
numbers, this assumption implies that the “different” threes in the example set
were obtained from measurements ranging in value from a low of 2.5 to a high of
3.4999. . . , and could be rank-ordered by these assumed true values. Thus, the
median for this set would be located in the interval 2.5 to 3.4999. . . . But where,
5 The expression is true for odd n as long as the summation is over all values of X except
the median to avoid the division by zero problem that would otherwise result.
2.5. QUESTIONS 21
In which
l = the lower real limit of the interval (2.5 in our example case)
2.5 Questions
1. The sum of a set of scores equals 90, with a mean of 10. How many scores
are in the set?
22 CHAPTER 2. THE MEAN (AND RELATED STATISTICS)
2. For a set of 19 scores with a mean of 11, the following sum is computed:
X19
(Xi − q) = 0. What is the value of the constant q?
i=1
3. For your part-time work over 12 months, you received 12 cheques. Although
the amount of each cheque was different from all of the others, the average
was $500. How much did you make in total?
4. On a recent quiz, the mean score for the 10 males in the class was 70, and
for the 20 females was 80. What is the mean for the class as a whole?
5. If it were not for the low grade of Allen Scott—the class ne’er-do-well (and
the only male in the class with two first names)—the mean grade for the 5
males in the class would have equalled that of the 10 females, who had a
mean grade of 90. The mean for the class as a whole was 85. What was
Allen Scott’s grade?
6. In a set of 12 scores with a mean of 60, the sum of deviations from the
mean of the 10 scores below the mean is found to equal -120. One of the
remaining two scores is equal to 85. What is the value of the other one?
7. For the set of scores, X ={2, 5, 1, -4, 12, -7, -2}, compute the mean, the
absolute mean, and the root-mean-square mean.
8. In a set of scores, the mean of the 18 males equalled the overall mean of
27. What was the sum of the 34 female scores?
11. For reasons not clear even to himself, Allen Scott finds that if he adds
2.756 to every one of his 10 course grades after multiplying each of them
by 47, he almost precisely matches the distances in astronomical units
(Earth = 1au) of the 10 planets of the solar system.6 If the sum of the
course grades was 22.8, what is the mean of his estimates of the planetary
distances?
12. The mean of 10 scores is 6. If these scores are in fact the logarithms to the
base 2 of the original 10 scores, what is the geometric mean of the original
scores?
6 In his dreams; there is nothing about this problem that relates directly or indirectly to the
15. The students in a sadistics (sic) class report that they spent a total of
$900 on textbooks and calculators for the course, a mean of $45 per person.
One-quarter of the class is male, and males spent a mean of $60 for their
books and calculators.
(a) How many students are in the class?
(b) What was the mean amount spent by females?
16. You run an experiment with a control group and an experimental group
in which the mean for the 3 scores in the control group was 20; the
mean for the experiment as a whole, which included the 5 scores from the
experimental group, was 22.5. Although the results suggest a difference
between the two groups, you are concerned because without the lowest
score in the control group, the control mean would have equaled the mean
for the experimental group. What was this lowest score?
24 CHAPTER 2. THE MEAN (AND RELATED STATISTICS)
Chapter 3
Measures of Variability
A
fter the location of the data, the next property of interest is how variable
or dispersed —how spread out—the data are. Two sets or distributions
of numbers could be otherwise identical (e.g., same mean, same median,
same overall shape, etc.), but still differ in how variable or different from one
another the numbers in each set are. There have been many different measures
of variability proposed and used over the years, reflecting different but related
aspects of a distribution of numbers. We will highlight only a few. These few are
selected for discussion because either they are commonly used in experimental
psychology, or because they have properties that are exploited in subsequent
statistics, or typically both.
25
26 CHAPTER 3. MEASURES OF VARIABILITY
But because Xi is a constant for the sum over j, and likewise Xj is a constant
for the sum over i, we get
n
X n
X
n Xi − n Xj = 0
i=1 j=1
Lake
Cache Louise 55 Banff
Creek Kamloops
102 226 Calgary
84 128
107 Revelstoke
193 Salmon 293 Swift
Arm Current
150
Hope 224
Vancouver Medicine
Hat
Figure 3.1: A map of distances between towns and cities along the Trans-Canada
highway.
Van. Hope Cache Cr. Kam. Sal. Arm Rev. Lk. Lou. Banff Cal. Med. Hat Swift Cur.
Vancouver 0 150 343 427 534 636 862 917 1045 1338 1562
Hope 150 0 193 277 384 486 712 767 895 1188 1412
Cache Creek 343 193 0 84 191 293 519 574 702 995 1219
Kamloops 427 277 84 0 107 209 435 490 618 911 1135
Salmon Arm 534 384 191 107 0 102 328 383 511 804 1028
Revelstoke 636 486 293 209 102 0 226 281 409 702 926
Lake Louise 862 712 519 435 328 226 0 55 183 476 700
Banff 917 767 574 490 383 281 55 0 128 421 645
Calgary 1045 895 702 618 511 409 183 128 0 293 517
Medicine Hat 1338 1188 995 911 804 702 476 421 293 0 224
Swift Current 1562 1412 1219 1135 1028 926 700 645 517 224 0
Table 3.1: Distances (in km) from each city or town shown in Figure 3.1 to every other city or town shown on that map. The
cities are listed in order from West to East. Note that each place is 0 km from itself (the diagonal) and that each distance is
displayed twice: once in the upper-right triangle and once in the lower-left triangle.
28
3.3. VARIANCE (S 2 ) AND STANDARD DEVIATION (S) 29
Table 3.2: Distances (in km) from each city or town shown in Figure 3.1 to
Calgary. Note that using Calgary as a reference point and defining distances
eastward as negative and westward as positive (both typical Calgarian biases),
allows for a substantially reduced table. Any distance shown in Table 3.1 can
be calculated from the information in this table requiring a maximum of 1
subtraction (e.g, the distance from Salmon Arm to Banff equals 511 − 128 = 383
km as shown in Table 3.1.
the mean-square minus the squared-mean. Also surprising is the fact that, despite
computing the deviations from the mean of the distribution rather than from
every other score, S 2 is really just a (much) more efficient way of computing D2 .
In fact, D2 = 2S 2 (see Table 3.4 for the derivation). So, S 2 or the variance does
reflect the average squared difference of the scores from one another in addition
to reflecting the average squared difference from the mean. With the exception
of a constant factor of 2, the concepts are interchangeable.
Many hand calculators, computer statistical applications, and computer
spreadsheet applications provide functions for variance and standard deviation.
Unfortunately, these functions sometimes compute what is known as a population
variance (or standard deviation) estimate (see Chapter 14, section 14.1 for further
information on this concept). As not all calculators, statistical applications,
and spreadsheets necessarily default to one or the other calculation—and few
document exactly how the function is calculated, you may have to test the
function to see whether it is dividing by n—the correct value—or by n − 1 to
produce a population estimate. To do so, test the function with the values 1,
2 A related approach is to use a fixed number of bits of internal representation and encode
each number in a floating-point form (i.e., the base-2 equivalent of exponential notation). The
effect of limited precision, however, is similar.
3.3. VARIANCE (S 2 ) AND STANDARD DEVIATION (S) 31
n
X
(Xi − X)2
i=1
S2 = n
n
X 2
= 1
n Xi2 − 2Xi X + X
i=1 !
n n
X
2
X 2
= n1 Xi − 2X Xi + nX
i=1 i=1
n n 2
X X
n Xi n Xi
X X
2 i=1 i=1
= n1 Xi − 2 Xi + n
n n
i=1 i=1
n
!2 n
!2
X X
n Xi Xi
X
i=1 i=1
= 1
Xi2 − 2 +
n n n
i=1
n
!2
X
n
Xi
X
i=1
= n1 Xi2 −
n
i=1
n
X
(Xi − Xj )2
i,j=1
D2 =
n2
n
1 X
Xi2 − 2Xi Xj + Xj2
= 2
n i,j=1
n n n
1 X X X
= X2 − 2 Xi Xj + Xj2
n2 i,j=1 i i,j=1 i,j=1
n n
! 2 n
1 X X X
= n Xi2 − 2 Xi +n Xj2
n2 i=1 i=1 j=1
n n
!2 n
X X X
Xi2 Xi Xi2
i=1
= − 2 i=1 2 + i=1
n n n
n n
!2
X X
X2 Xi
i=1 i=1
= 2 −
n n2
= 2S 2
3.4 Questions
1. For a set 20 scores, the mean is found to be 10, and the sum of the squared
scores is found to be 2500. What is the standard deviation of the 20 scores?
Transformed Scores
O
ften it is desirable to transform the data in some way to achieve a distri-
bution with “nicer” properties so as to facilitate comparison with some
theoretical distribution or some other set of data. Such transformations
can be especially useful for data-sets for which the actual values of the numbers
are of little or no intrinsic interest. Common examples of such transforma-
tions or re-scaling are found in IQ, GRE, and SAT scores, grade-point-averages,
course-grades computed by “grading on a curve”, temperatures (e.g., Fahrenheit,
Celsius, or Absolute scales), currencies, and so on.
2
4.1.1 Rules for changing X, SX and SX
Linear transformations of data are often used because they have the effect of
adjusting the mean and variance of the distribution.
1 Technically, the transform we are referring to here is known in mathematics as an affine
transform. In mathematics, a linear transform is any transform T that for every X1 and
X2 and any number c, T (X1 ) + T (X2 ) = T (X1 + X2 ) and T (cXi ) = cT (Xi ). So, an affine
transform is a linear transform plus a constant.
35
36 CHAPTER 4. TRANSFORMED SCORES
Zi = Xi + c
n
X
Xi + nc
i=1
= n
n
X
Xi
i=1 nc
= n + n
= X +c
For the variance,
n
X 2
Zi − Z
i=1
SZ2 = n
n
X 2
(Xi + c) − (X + c)
i=1
= n
n
X 2
Xi + c − X − c
i=1
= n
n
X 2
Xi − X
i=1
= n
2
= SX
And, hence, SZ = SX .
4.1. THE LINEAR TRANSFORM 37
Zi = cXi
n
X
Xi
= c i=1n
= cX
For the variance,
n
X 2
Zi − Z
i=1
SZ2 = n
n
X 2
cXi − cX
i=1
= n
n
X 2
c2 Xi − X
i=1
= n
n
X 2
Xi − X
= c2 i=1 n
= c2 SX
2
n
X
1
= nS Xi − X
i=1
n
X
But as (Xi − X) = 0
i=1
1
= nS (0)
= 0
4.2. THE STANDARD SCORE TRANSFORM 39
n
X
Zi2
i=1
= n
n
X 2
Xi − X
1 i=1
= n 2
SX
n
X
(Xi − X)2
i=1
But, S 2 = n
n
X
(Xi − X)2
1 i=1
= n Xn
(Xi − X)2
i=1
n
n
= n
= 1
a mean of 500 and a standard deviation of 100 via the standard score trans-
form. That is, whatever the mean and standard deviation of the actual scores
on the examination, the scores are converted to GRE scores via the following
transformation, after first having been converted to standard (Z−) scores:
So, a GRE score less than 500 is automatically recognised as a score less than
the mean, and a score of 653 as 1.53 standard deviations above the mean. Scores
obtained on “standardised” IQ tests are similarly transformed, although in this
case so that the mean is 100 with a standard deviation of 15.
From a statistical standpoint, however, standard scores are quite useful in
their own right. As they have a mean of 0 produced by subtracting the raw-score
mean from every raw score, they are “centred”, and subsequent calculations
with them can occur without regard to their mean. Standardised scores in
some sense express the “true essence” of each score beyond that of the common
properties (mean and standard deviation) of their distribution. Other advantages
are discussed in subsequent sections.
4.3 Questions
1. For a set of standardised scores, the sum of squared scores is found to
equal 10. How many scores are there?
2. For a set of standardised scores, each score is multiplied by 9 and then has
five added to it. What are the mean and standard deviation of the new
scores?
3. Express a score of 22 from a distribution of scores with a mean of 28 and
a standard deviation of 3 as a linearly transformed score in a distribution
with a mean of -10 and a standard deviation of 0.25.
4. On a recent test, Sam—the class ne’er-do-well—acheived a score of 50,
which was 3 standard deviations below the class mean. Sandy—the class
wizard—acheived a score of 95, which was 2 standard deviations above the
class mean. What was the class mean?
5. T-scores have a mean of 50 and a standard deviation of 10. Express a
GRE score of 450 as a T-score.
6. The WAIS (Wechsler Adult Intelligence Scale—or so it is alleged) is scaled
to have a mean of 100 and a standard deviation of 15. Your dog’s score of
74 on the meat-lovers’ sub-scale was rescaled to give a WAIS meat-lovers’
IQ of 70. If the standard deviation on the meat-lovers’ sub-scale is 7, what
is the mean?
7. Following a standard score transform, a score of 26 from distribution X is
found to be equivalent to a score of 75 in distribution Y for which the mean
4.3. QUESTIONS 41
R
educing a distribution of numbers to its mean and variance may not
capture all that is interesting or useful to know about: two distributions
may have identical means and variances, for example, but still be quite
different in shape. As with the mean and variance, the other aspects of the
distribution that underlie these differences in shape can be captured in various,
carefully chosen statistics.
5.1.1 Skewness
Skewness, for example, is based on the third moment of the distribution, or the
sum of cubic deviations from the mean, and is often defined as:
n
X
(Xi − X)3
i=1
G1 = 3
nSX
43
44 CHAPTER 5. OTHER DESCRIPTIVE STATISTICS
5.1.2 Kurtosis
Kurtosis is derived from the fourth moment (i.e., the sum of quartic deviations),
and is often defined as:
n
X
(Xi − X)4
i=1
G2 = 4 −3
nSX
It captures the “heaviness” or weight of the tails relative to the centre of the
distribution. One reason it does so is that it uses a larger exponent than does
the computation of variance. The larger the exponent, the greater the role large
deviations, and thereby tail values, play in the sum. Thus, the resulting statistic
deemphasizes small deviations (i.e., values around the mean) in favor of large
deviations, or the values in the tails. Hence, although both variance and kurtosis
concern deviations from the mean, kurtosis is concerned principally with the
tail values. Positive values of kurtosis indicate relatively heavier tails, negative
values indicate relatively sparse tails, and a value of zero kurtosis indicates that
the tails are about the weight they should be.
Relative to what?, you might ask. The answer is relative to the “normal”
or Gaussian distribution—the “bell-curve” of many discussions, for example,
about the distribution of IQ in human populations. This standard of the normal
distribution also explains the “−3” in the equation for the kurtosis statistic. By
subtracting 3, the sum equals zero when the distribution in question is “normal”.
It also explains why the different sums are “normalised” by the standard deviation
(SX ) raised to the power of the moment. Such normalisation renders the Gaussian
distribution the standard or “normal” distribution: deviations from its shape
appear as non-zero values in these equations, with the extent of the deviation in
shape from the normal distribution reflected in increasingly deviant values. It
is not just that the normal distribution is taken as the norm by fiat or general
consensus. Rather, unlike many data distributions, a normal distribution is
completely determined by the first two moments. If you know the mean and
standard deviation of a normal distribution, you know everything there is to
know about it. Similarly, if some data-set is well-approximated with a normal
distribution, then knowing the mean and standard deviation of that data-set
provides virtually all you need to know about that data-set.
Three kurtotic shapes of distributions are generally recognized with labels.
“Normal” distributions, having tails of about the “right” weight—neither too
heavy or too light, are referred to as “mesokurtic” from “meso” for “middle”.
Distributions that are too light in the tails (i.e., very peaked distributions)
are called “leptokurtic”, and distributions with too heavy tails (i.e., flattened
5.2. QUESTIONS 45
5.2 Questions
1. A distribution of scores with a mean of 40 and a median of 50 would suffer
from what kind of skew? Why?
I
magine that you have carefully recorded the mean and standard deviation
of your data, but have lost all the raw scores, and you are now confronted
with a question such as: Was it possible that more than 20% of the original
scores could have been greater than 3 times the mean?1 Or that, at a minimum,
80% of the scores were within two standard deviations of the mean? Two famous
mathematical inequalities provide for (albeit limited) answers to such questions.
Although originally derived within probability theory (and famous for that),
each can be presented as consequences of simple algebra.
provided only these (or related statistics) in a research report, say, but wish to know more
about the original distribution of scores.
47
48 CHAPTER 6. RECOVERING THE DISTRIBUTION
X = wc + (1 − w)0
= wc
2 The restriction of c ≥ X is simply to ensure that the proportion does not exceed 1.
6.1. THE MARKOV INEQUALITY 49
X
w=
c
That is, for any value of c ≥ X, the proportion of scores in the distribution X
greater than or equal to c is at most X/c. That is,
X
p{Xi ≥ c} ≤ (6.1)
c
That also means, taking the complementary event, that at least 1 − (X/c) of
the scores are at most c. So, for example, in a distribution having all positive
scores (and not all zero) with a mean of 20, at most 1/3 of the scores could be
greater than or equal to 60 (i.e., 20/60), and at least 2/3 must be less than 60.
Similarly, at most 1/5 (i.e., 20/100) or 20% of the scores could be greater than
or equal to 100, and at least 80% must be less than 100.
The other common way of expressing the Markov inequality is in terms of
multiples of the mean of the distribution—that is, where c (c ≥ 1) is some
multiple of X. If so, then we can re-express equation 6.1 as
X
p{Xi ≥ cX} ≤
cX
Cancelling X on the right-side of the equation yields:
1
p{Xi ≥ cX} ≤ (6.2)
c
This version of the Markov inequality says that for any value of c > 1, the
proportion of scores in the distribution greater than or equal to c times the
mean is at most 1/c. That also means, taking the complementary event, that at
least 1 − (1/c) of the scores are at most c times the mean. Thus, for example,
for any distribution whatsoever having all positive scores (and not all zero), at
most 50% of the scores are greater than or equal to twice the mean, and at a
minimum 50% of them are less than that. Similarly, at most 25% of the scores
are greater than or equal to four times the mean, and at a minimum 75% of
them are less than that. And so on.
Consider the following example in light of the Markov inequality. A student
is explaining how he is doing at university. He notes that although the mean
grade for his 30 courses is only 50, “. . . that’s because of some very low grades
in outrageously-difficult courses like sadistics (sic); actually, in more than 21 of
these courses my grade has been more than 75!” Given that negative grades are
not usually possible, based strictly on the numbers given, is the student’s claim
plausible? As 75 = (1.5)50 or 1.5 times the mean, Markov’s inequality says that
at most 100/1.5 = 67% of the 30 grades, or 20, could be greater than or equal
to 75 with a mean of 50. But the student is claiming that more than 21 courses
exceeded or equalled 75. So, either the student is misrepresenting the grades, or
he has miscalculated their mean.
50 CHAPTER 6. RECOVERING THE DISTRIBUTION
1
p{Xi ≥ hSX } ≤ (6.3)
h2
Thus, the Tchebycheff inequality states that in any distribution of scores
whatsoever, for any value of h > 1, the proportion of scores greater than or equal
to h standard deviations (plus or minus) from the mean is at most 1/h2 . Taking
the complementary event, that also means that at least 1 − (1/h2 ) of the scores
must lie within h standard deviations of the mean. So, for example, for any
distribution of scores whatsoever, no more than 25% of the scores can be beyond
2 standard deviations from the mean, and at least 75% of them must lie within
2 standard deviations of the mean. Similarly, at most 11.11% of the scores can
be beyond 3 standard deviations from the mean, and at least 88.89% of them
must lie within 3 standard deviations of the mean. And so on.
Consider the following example. A fast-talking, investment promoter is
exalting the virtues of a particular investment package. He claims that “On
average, people who invest in this package have received a 30% annual return on
their investment, and many people have had returns of over 39%.” Intrigued,
you ask for, and receive, the information that the average reported is a mean
with a standard deviation of 3% for the investment return of 400 people, and
that the “many people” referred to “. . . is more than 50”. Based strictly on the
numbers given, is there reason to doubt the claims of the promoter? Returns of
over 39% are at least 3 standard deviations from the mean. According to the
Tchebycheff inequality, with a mean of 30% and a standard deviation of 3%, at
most 11.11% or 44.44 of the 400 people could have returns more than 3 standard
deviations from the mean, but the promoter is claiming that over 50 had returns
of 39% or more. The claims don’t add up.
6.3 Questions
1. A distribution of 50 scores with a mean of -4 and a standard deviation of
5 can have at most how many of those scores greater than 6?
Correlation
S
cience is generally concerned with whether and the extent to which vari-
ables covary across cases; in this sense, it is a search for or verification
of structure or patterns of relationship among a set of variables. There
are many methods of measuring this similarity in variation among variables
across cases, but all of them reduce to a basic idea of redundancy among a set
of variables: the variables go together in some way such that knowledge of one
provides some information about another. Tall people tend to be heavier than
short people, for example, although there are clear exceptions to this pattern.
Height, then, is partially redundant with weight, which is to say that it is
correlated with weight. Despite its centrality to all of science, a mathematical
definition of correlation arrived quite late in the history of science. As with the
concepts of evolution by natural selection and the expansion of the universe (and
the “big bang”), the concept of correlation is only obvious in retrospect.
and he spent years in Germany studying the works of the great German economists, especially
Karl Marx. Pearson was so taken with Marx, that, upon returning to England, he officially
changed his name to “Karl”. He was one in a long string of “leftist” scholars in the history of
statistics, genetics, and evolutionary theory.
53
54 CHAPTER 7. CORRELATION
equation for correlation based on bivariate normal spaces, which involves more
complicated mathematical techniques than we wish to entertain here. We’ll take
a different approach, in fact, two of them.
we find that it reaches its maximum value when the order of the values in Y
matches as much as possible the order of the values in X. The sum reaches its
minimum value when the values in Y are ordered as much as possible backward
7.1. PEARSON PRODUCT-MOMENT CORRELATION COEFFICIENT 55
X, and held the ordering of Y constant. The result is the same in either case.
3 Theoretically because the maximum and minimum values will only be obtained if the
orderings of the standardized values are perfectly matched, and the paired deviations are
exactly the same in magnitude. Otherwise, even if perfectly matched with respect to order,
the computed sum will only approach the theoretical maximum (or minimum).
56 CHAPTER 7. CORRELATION
n
X
ZXi ZXi
i=1
M AX(rxy ) =
n
n
X
2
ZX i
i=1
=
n
n
=
n
= 1
And the theoretical minimum is given as the correlation of a variable with
the negative of itself (i.e., each value multiplied by −1):
n
X
ZXi (−ZXi )
i=1
M IN (rxy ) =
n
n
X
2
−(ZX i
)
i=1
=
n
−n
=
n
= −1
Thus, −1 ≤ r ≤ 1.
Pn Pn
Unfortunately, because i=1 ZXi = i=1 ZYi = 0, this mean is always zero. As
usual to eliminate this problem, we can square the paired differences to produce
an average squared difference or ASD:
n
X
(ZXi − ZYi )2
i=1
ASD =
n
n
X
(ZXi − (−ZXi ))2
i=1
ASD =
n
n
X
(ZXi + ZXi )2
i=1
=
n
n
X
(2ZXi )2
i=1
=
n
n
X
2
ZX i
= 4 i=1
n
1
= 4
1
= 4
1
rxy = 1 − ASD
2
Xn
(ZXi − ZYi )2
1 i=1
= 1− (7.2)
2 n
It is, in fact, the Pearson product-moment correlation coefficient calculated
from differences between pairs of standardized values rather than products. Table
7.1 shows the mathematical equivalence of equation 7.1 with equation 7.2.
7.2 Covariance
As noted, the Pearson product-moment correlation coefficient is computed with
two variables, X and Y , converted to standard-scores so as to limit the maximum
and minimum values of rxy to 1.0 and −1.0. There is another measure of
association based on the same principles that is simply the average cross-product
of deviation scores, rather than standard scores. It is called the covariance of the
paired scores, and is defined as follows for the covariance of variables X and Y :
Pn
(Xi − X)(Yi − Y )
Sxy = i=1 (7.3)
n
Notice how, if X and Y were the same variables (i.e., Xi = Yi ), equation 7.3 is
2
simply the definitional formula for variance, SX . It is used in those situations in
which the original measurement scales of the variables are considered to be an
important part of their relationship.
1
rxy = 1 − ASD
2
Xn
(ZXi − ZYi )2
1
= 1 − i=1
2 n
n
1 X
2
− 2ZXi ZYi + ZY2i
= 1− ZX
2n i=1 i
n
!
1 X
= 1− n−2 ZXi ZYi + n
2n i=1
n
!
1 X
= 1− 2n − 2 Z X i Z Yi
2n i=1
Xn
ZXi ZYi
i=1
= 1 − 1 −
n
n
X
ZXi ZYi
i=1
= 0+
n
n
X
ZXi ZYi
i=1
=
n
Table 7.1: Demonstration that the average squared difference formula for r and
the cross-products formula for r are equivalent.
60 CHAPTER 7. CORRELATION
Table 7.2: Demonstration that the Spearman formula is equivalent to the Pearson
cross-product formula applied to the data in ranks (equation 7.1).
n
X
6 Di2
i=1
r =1− (7.4)
n(n2 − 1)
For example, suppose the ranked preference of five cars (e.g., a Topaz,
a Geo, a Volvo 244, a Civic, and a Sprint, respectively) from Sandy were
S = {1, 3, 2, 5, 4}; that is, Sandy preferred the Topaz most and the Civic
least. Suppose further that Chris ranked the same 5 cars as C = {3, 1, 4, 5, 2};
that is, Chris preferred the Geo most and Civic least. As shown in Table
7.2, the Spearman rank-order correlation coefficient for these ranks would be:
1 − 6 ∗ (22 + 22 + 22 + 0 + 22 )/(53 − 5) = 1 − (6 ∗ 16)/120 = 1 − .8 = .2.
4 The precise derivation is left as an exercise for the reader. In doing so, note that the
Tied Ranks
Sometimes, the ranking procedure results in tied ranks. It might be the case,
for example, that the individual or process doing the ranking can’t discriminate
between two or more cases; they all should be ranked “3”, for instance. Or, it
may be the case that the data to be ranked contain a number of equal values.
Tied ranks are problematic because equation 7.4 assumes that each value has
a unique rank. There are two ways of handling this situation. The tied cases
could simply be ignored, and the correlation computed for the remaining cases,
or the tied ranks could be adjusted. In the latter case, the ranks that would
normally be associated with the values are averaged, and the average used as
the rank for each of the values. For example, if the ranks in question were 7, 8,
and 9, the ranks would be averaged to 8, and the rank of 8 assigned to each of
the three values. Applying equation 7.4 to ranks adjusted in this way tends to
overestimate the absolute value of r that would be obtained with untied ranks,
but as long as the number of tied ranks is small, the effect is usually trivially
small.
Y1 − Y0 √
rpb = pq (7.5)
SY
62 CHAPTER 7. CORRELATION
7.4 Questions
1. Each of you and a friend rank the 5 recently-released, currently-playing
movies, with 1 as the most preferred and 5 as the least preferred. The
results are as follows:
How well are your ratings of the movies related to those of your friend?
Person Gym IQ
Arnold Tim’s 120
Hans Tim’s 110
Franz Tim’s 100
Lou Tim’s 80
Charles Tim’s 90
PeeWee Jim’s 90
Percival Jim’s 90
Les Jim’s 110
Dilbert Jim’s 140
Edwin Jim’s 120
What is the correlation between gym membership with IQ? Does it appear
to be the case (at least for these data) that “Smarter people choose Jim’s
Gym.”?
3. On recent sadistics (sic) test, the 16 psychology majors in the class acheived
a mean grade of 80, and the 4 business majors in the class achieved a mean
grade of 76. If the standard deviation for the 20 students was 2, what was
the correlation between major and sadistics (sic) grade?
5. Below are the results of two horse races involving the same 5 horses:
What is the correlation between the two races? Describe the relationship
in words.
Chapter 8
Linear Regression
F
inding that two variables covary is also to say that from knowledge of one
the other may be predicted, at least to some degree. If this covariation
is captured as a correlation coefficient, then the degree or extent to
which one variable can be predicted by the other is given by its value. But
exactly what kind of prediction is it? The Pearson product-moment correlation
coefficient (and its algebraic variants) is a measure of the linear intensity of the
relationship between two variables; it measures the extent to which the values of
one variable may be described as a linear transformation of the values of another
variable. In fact, it was from the observations that one variable could be seen as
an approximate linear function of another that led to the development of the
concept of the correlation coefficient in the first place.
65
66 CHAPTER 8. LINEAR REGRESSION
variables are regressed (as we now say) on one another in standard score form.
0
8.1.1 The Regression Equation: ZYi = rxy ZXi
Consider the case where there is only a moderate
Pn correlation between variables
X and Y , say, rxy = .5. For this to be true, i=1 ZXi ZYi must equal 12 n.2 How
could this occur? If each ZY of every ZX ZY pair were equal to 12 ZX then
n
X
ZXi ZYi
i=1
rxy =
n
n
X
ZXi .5ZXi
i=1
=
n
n
X
ZXi ZXi
i=1
= .5
n
n
X
2
ZX i
= .5 i=1
n
n
= .5
n
= .5
But it is not necessary that each ZYi = .5ZXi for this statement to be true, rather
that on average they do so. That is, rxy = .5 merely indicates that, on average,
ZY = .5ZX (or ZX = .5ZY ). If rxy = .5, then, what is the best prediction of ZY
given ZX ? Given the foregoing, it would appear that ZY = .5ZX would be the
best prediction. How is this the best prediction? For any particular ZX ZY pair,
0
it may not be. Using ZYi as the predicted value of ZYi from ZXi , we may find
0
that the difference, ZYi − ZYi , is quite large for some pairs. But we know that
the sum of squared deviations from the mean is a minimum in that any value
other than the mean will produce a larger sum of squared deviations from it.
So, predicting the average ZY for a given ZX will produce the smallest average
squared difference between the predicted and the actual values. That is, on
average, predicting the average ZY for a given ZX produces the smallest error
of prediction, at least as measured in terms of this least-squares criterion.3 But,
0
as just noted, the average ZY given ZX is equal to rxy ZX . That is, using ZYi
Pn
ZXi ZYi Pn
2It will be remembered that rxy = i=1
, so Z Z = rxy n.
n i=1 Xi Yi
3 One can start with this notion of minimising the sum of squared deviations of the actual
from the predicted values, and ask what linear expression minimises it using differential calculus.
The result is the same in either case.
68 CHAPTER 8. LINEAR REGRESSION
produces the lowest error of prediction (in terms of least-squares) in ZY for all
rxy . Similarly,
0
ZXi = rxy ZYi (8.2)
produces the lowest error of prediction (in terms of least-squares) in ZX for all
0
rxy . Note that unless rxy equals 1 or -1, the predicted value of ZYi is always less
extreme or deviant than the ZXi value from which the prediction is made. This
reduction in deviance in predicted values is Galton’s “regression”, and occurs
strictly as a function of imperfect correlation: the poorer the correlation between
X and Y , the greater the regression in the predicted values.
Note that unless the standard deviations are equal, Y = X, and rxy = ±1, the
two regression lines are not the same.4
As an example, consider the hypothetical data shown in Table 8.1 representing
the heights in inches of five adult women paired with that of their closest (in age)
adult brothers. As shown, the correlation in height between sister and brother is
4 There is a line, called the first principal component, that is the average of the two regression
P
lines. Whereas the P regression line of Y on X minimizes (Y − Y 0 )2 , and the regression of
X on Y minimizes 0 2
(X − X ) , the principal component line minimizes the sum of squared
deviations between obtained and predicted along both axes simultaneously.
8.1. LINEAR REGRESSION 69
Table 8.1: Example regression data representing the heights in inches of 5 women
and their brothers expressed as both raw and standard scores. Column Zs Zb is
the cross-product of the paired standard (Z-) scores.
rsb = 0.6 —a moderate, positive correlation. Figures 8.2 and 8.3 plot these data
in raw and standard score form, respectively. Superimposed on each figure are
the regression lines of brother’s height regressed on sister’s, and sister’s height
regressed on brother’s.
80
60
Brother Height
40
20
0
0 20 40 60 80
Sister Height
Figure 8.2: Scatterplot and regression lines for the hypothetical data from Table
8.1.
8.1. LINEAR REGRESSION 71
2 Z’sister = 0.6(Zbrother)
1
Brother Height (z)
Z’brother = 0.6(Zsister)
-1
-2
-3
-3 -2 -1 0 1 2 3
Sister Height (z)
Figure 8.3: Scatterplot and regression lines for the standardized hypothetical
data from Table 8.1.
72 CHAPTER 8. LINEAR REGRESSION
Clearly, if rxy = 0, the best (in the least-squares sense) prediction of every Yi
from Xi is simply the mean of Y , Y , and, as equations 8.5 and 8.6 indicate, the
error of prediction in this case is just the standard deviation of Y . On the other
hand, if rxy = ±1.0, then the standard error of estimate is zero. The standard
error of estimate can be thought as capturing the variability of the Y -values for a
particular X-value, averaged over all values of X. Low values of SY −Y 0 indicate
that the average spread of the actual values of Y around the value predicted
from X is also low—that, on average, the predicted values are quite accurate.
Conversely, values of SY −Y 0 close to the standard deviation of Y indicate that
predictions from X are little better than simply predicting the mean of Y in
every case—that knowledge of X contributes little to the (linear) prediction of
Y.
The standard error of estimate also provides another interpretation of the
correlation coefficient. It can be shown by algebraic manipulation of equation
8.6 that:
2 SY2 −Y 0
rxy = 1−
SY2
SY2 − SY2 −Y 0
=
SY2
2
That is, the squared correlation coefficient, rxy , represents the reduction in the
proportion of the original variance of Y that is provided by knowledge of X. The
better or more accurately Y is (linearly) predicted by X, the greater the value of
2
rxy and the the greater the proportion of the variance of Y “accounted for” by
8.1. LINEAR REGRESSION 73
knowledge of X. For this reason, the square of the correlation coefficient is called
the coefficient of determination. This proportion is, in fact, that proportion of
the original variance of Y that corresponds to the variability of the predicted
values of Y (i.e., the variability of the points along the regression line):
n
X 0
(Yi − Y )2
i=1
SY2 0 =
n
= SY2 rxy
2
Hence, what has been accomplished by the regression is the partioning of the
total sum-of-squares, SStot , into two pieces, that which can be “accounted for”
by the regression, SSreg , and that which cannot, SSres :
2
The term 1 − rxy is called the coefficient of non-determination because it
represents the proportion of the original variance of Y that is “unaccounted
for” or not determined by X—that is, not linearly related to X. The positive
square-root of the coefficient of non-determination, as in equation 8.6, is called
the coefficient of alienation because it reflects the the “alienation” of the Y
scores from the X scores in unsquared terms (as with rxy ).
q q
|rxy | 2
1 − rxy 1− 2
1 − rxy
0.000 1.000 0.000
0.100 0.995 0.005
0.200 0.980 0.020
0.300 0.954 0.046
0.400 0.917 0.083
0.500 0.866 0.134
0.600 0.800 0.200
0.700 0.714 0.286
0.800 0.600 0.400
0.866 0.500 0.500
0.900 0.436 0.564
0.950 0.312 0.688
0.990 0.141 0.859
0.999 0.045 0.955
Table 8.2 reports the values of both the coefficient of alienation and PIP for
selected values of |rxy |. Note that compared to a correlation of zero, a correlation
of 0.6 produces an error of estimate that is 0.8 or 80% of what it would be
(a reduction of 20% in error of estimate or a 20% improvement in prediction),
whereas a correlation of of 0.3 produces an error of estimate that is only 0.95 or
95% of what it would be with a correlation of zero (i.e., only a 5% reduction
in error of estimate or a 5% improvement in prediction). The correlation must
be as high as 0.866 before the error of estimate is reduced by one-half (or the
prediction improved by 50%) relative to a correlation of zero. Also note that
the difference in improvement of prediction between a correlation of 0.7 and a
correlation of 0.9 is approximately the same as that between 0.2 and 0.7. That
is, not only does the prediction improve with larger correlations, but the rate of
improvement also increases, as shown in Figure 8.4.
8.2 Questions
1. The correlation between two tests, A and B, is found to be r = .6. The
standard deviation of test A is 10 and that of test B is 5. If every score of
test A is predicted from the corresponding test B scores by virtue of the
correlation between them, what would be the standard deviation of
(a) the predicted scores (i.e., SY 0 )?
(b) the residuals (i.e., SY −Y 0 )?
8.2. QUESTIONS 75
0.9
Proportional Improvement in Prediction
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Correlation: rxy
(a) What is the regression equation to predict the midterm score from
your predictive test?
(b) How much of the variability in test scores does your predictive test
account for?
3. Catbert, the evil human resources director, complains that the Wesayso
corporation for which he works fosters mediocrity. His evidence for this
claim is that people whose first performance review upon entering the
company is very poor typically improve on subsequent reviews whereas
people whose first performance review is stellar typically get worse on
subsequent reviews. Explain to Catbert why his conclusion doesn’t follow
from his “evidence”.
4. Explain why the y intercept of the regression line for a set of Z (standard)
scores is always zero.
5. With rxy = .8, Sy = 10, and Sx = 5, if Y scores are predicted from the X
scores on the basis of the correlation between them, what would be the
standard deviation of the predicted scores?
6. Research has shown that the best students in their first year of university
on average do less well at the end of their last year. However, the very
worst students in first year tend on average to do better at the end of their
last year. Allen Scott (a.k.a. the class n’er-do-well) argues that these data
prove that university hurts the best students, but helps the worst students.
Explain to poor Allen why his conclusion doesn’t necessarily follow. What
assumptions is he making?
7. You collect the following grades in Advanced Egg Admiration (AEA) and
Basket-weaving (BW) for five students who have taken both courses:
8.2. QUESTIONS 77
Student AEA BW
Sandy 54 84
Sam 66 68
Lauren 72 76
Pat 60 92
Chris 63 80
C
onsider the situation of three intercorrelated variables X, Y , and Z,
such that rxy is the correlation between X and Y , rxz is the correlation
between X and Z, and ryz is the correlation between Y and Z. Suppose
the interest is in the correlation between X and Y in the absence of their
correlations with Z. Expressed in terms of standard scores, we can produce
the residual ZX and ZY scores from their respective regression equations with
Z. That is, the residual ZX scores—the residuals from the values predicted
0
from ZZ —are given by ZX − ZX = ZX − rxz ZZ . Similarly, the residual ZY
0
scores are given by ZY − ZY = ZY − ryz ZZ . By definition, these residual
scores are uncorrelated with Z. Hence, the correlation between these residual
standard scores represents the correlation between X and Y in the absence of
any correlation of either variable with Z.
p
2
q standard deviation of these residual standard scores is equal to 1 − rxz
The
and 1 − ryz 2 , respectively (see section 8.1.4). So, dividing each of these residuals
79
80 CHAPTER 9. PARTIAL AND MULTIPLE CORRELATION
X
1 (Z X − rxz Z Z )(Z Y − ryz Z Z )
r(xy).z = q
n p
2
1 − rxz 2
1 − ryz
1 X
ZX ZY − ZX ryz ZZ − ZY rxz ZZ + ZZ2 rxz ryz
= p q
n 1 − rxz2 2
1 − ryz
P P P P 2
1 ZX ZY ZX ZZ ZY ZZ ZZ
=p q − ryz − rxz + rxz ryz
2
1 − rxz 2
1 − ryz n n n n
1
=p q [rxy − rxz ryz − ryz rxz + rxz ryz ]
2
1 − rxz 2
1 − ryz
rxy − rxz ryz
=p q
2
1 − rxz 2
1 − ryz
Table 9.1: Derivation of the expression for the partial correlation between
variables X and Y controlling statistically for the correlation of both with
variable Z.
absence of their individual correlations with Z (see Table 9.1 for the derivation):
.7 − (.8)(.875)
r(sl).t = √ √
1 − .82 1 − .8752
= 0.00
For these (hypothetical, but not unrealistic) values, then, there would be no
correlation between smoking and lung cancer death-rate once the common
correlation with time is taken into account. But increases in the death-rate from
lung cancer are unlikely to occur in the same year as the increases in smoking
as lung cancer takes many years to develop, and many more years to reach the
point of death. We would expect the increases in death-rate from lung cancer
to follow by some years the increases in smoking. Suppose, then, that we lag
the death-rate by, say, 20 years relative to the smoking data—that is, compute
the partial correlation between smoking and the corresponding death-rate from
lung cancer 20 years later, again taking the common correlation with time into
account. Suppose in doing so that the simple correlations between smoking
and lagged-time and lung cancer death-rate and lagged-time not unreasonably
remained relatively unchanged at rst = .80 and rlt = .85, respectively, and that
the correlation between smoking and the lung cancer death-rate rose to rsl = .9.
The partial correlation between smoking and lung cancer death-rate lagged by
20 years would be:
.9 − (.8)(.85)
r(sl).t = √ √
1 − .82 1 − .852
= 0.70
indicating a substantial correlation between smoking and the death-rate from
lung cancer 20 years later, even after taking the common correlation with time
into account.
1 To see a provocative discussion using actual data from the years 1888 to 1983 see Peace,
L. R. (1985). A time correlation between cigarette smoking and lung cancer. The Statistician,
34, 371-381.
2 We’re using t (for time) rather than y (for year) to avoid confusing the variables in this
example (s,l,t) with the variables in the general form of the the partial and multiple correlation
formulae (x,y,z ).
82 CHAPTER 9. PARTIAL AND MULTIPLE CORRELATION
Table 9.2: The correlation matrix of the example variables of smoking (cigarette
sales), year, and cancer (lung-cancer death rate 20 years later).
row with all the other variables? Looking at the three rows of variables in Table
9.2, it appears that some variables (rows) are better “accounted for” than others.
For example, the correlations with the remaining variables are generally higher
for the row defined by “cancer” than for the rows defined by either “smoking” or
“year”. It is this difference in total correlation that we are attempting to capture
with the concept of multiple correlation.
for another.
84 CHAPTER 9. PARTIAL AND MULTIPLE CORRELATION
Figure 9.1: The label from one of the brands of beer produced by P. Ballantine
& Sons depicting the overlapping circle logo.
9.2. MULTIPLE CORRELATION 85
Figure 9.3: The same figure as in Figure 9.2, except shaded to depict the area of
interest.
86 CHAPTER 9. PARTIAL AND MULTIPLE CORRELATION
The problem with just adding the two squared, simple correlations together is
the area of the lung cancer circle that smoking and year jointly overlap (i.e., the
centre polygon in Figure 9.2). The area we want to compute is shown in Figure
9.3.
9.3 Questions
1. Given the following data of 10 cases measured on 4 variables:
2. In your job as a W.H.O. worker you’ve collected quite a bit of data related
to water quality and health in various cities around the world. Your
measures of water quality and health are correlated quite highly—Pearson
r = .75. However, being statistically and methodologically astute, you
realize that wealth (as measured by average income) may be related to
both variables so you correlate income with health and get a Pearson
correlation of r = .5, and you correlate income with water quality and get
a Pearson correlation of r = .60.
(a) Is there any relation between water quality and health after you have
accounted for the effects of income on water quality?
(b) What is the multiple correlation, R, relating health to the independent
variables of water quality and income?
3. For a class project, you collect data from 6 students concerning their sex
(s), scoring females as 0 and males as 1, grade (g) on a recent sadistics (sic)
test, and number of beers (b) consumed by him or her the night before
the test. You compute the following correlations: rgs = −.8325, rbs = .665,
88 CHAPTER 9. PARTIAL AND MULTIPLE CORRELATION
and rgb = −.83. What can you say about the relationship between grade
and beer consumption taking sex into account for both variables?
Part II
Significance
89
Chapter 10
Introduction to Significance
T
he origin (and fundamental meaning) of statistical significance and
experimental design is usually traced to an apocryphal story told by
Sir Ronald Aylmer Fisher in his now-classic, 1935 book, The design of
experiments (see Figure 10.1).1 As he tells it, one late, sunny, summer afternoon
in Cambridge, England in the late 1920s, a small group of men and women were
seated outside at a table for afternoon tea. One of the women insisted that the
tea tastes differently depending upon whether the milk or the tea is poured into
the cup first.
As the story goes, Fisher, one the the attendees at the tea party, proposed
that they test the woman’s hypothesis. He suggested that they prepare cups
of tea outside of the woman’s sight, one-half with milk first, the remainder
with tea first, and then present the cups to the woman, one at a time, in a
random order, recording each of the woman’s guesses as to cup-type without
feedback. This random assignment of cup-type to ordinal position accomplished
two things. First, it assured that, unless the woman actually had the ability she
claimed to have, her sequence of guesses should not match the actual sequence
of cup-types in whole or in part except by chance alone. Second, the random
assignment provided a basis for computing these chances. Fisher argued that
if the chances that the woman’s sequence of guesses matched in whole or in
part the actual sequence were sufficiently low, the result should be declared
statistically significant, and the experiment to have provided support for the
woman’s hypothesis. His book doesn’t record what the actual results of the
experiment were.
This idea of statistical significance is simple enough. A statement of statistical
significance for some aspect of the data is simply a claim for the presence of
structure supported by some statistical evidence derived from the data. However,
the various techniques developed to provide support for such claims from the
1 The story may not be apocryphal. Salsburg (2001) reports that he heard the story from
H. Fairfield Smith in the late 1960s, who claimed to have been there. According to Salsburg
(2001), Smith also claimed that the woman was perfectly accurate in her guesses.
91
92 CHAPTER 10. INTRODUCTION TO SIGNIFICANCE
data are sometimes complicated, are often controversial, and are not necessarily
coherent with one another. Much of the reason for the confusion and controversy
can be traced to the history of the development of the techniques, and their
subsequent evolution. Over time, often very different and sometimes incommen-
surate goals and beliefs have been attached to particular techniques such that it
is now often not entirely clear what the original intent of a particular method
was, or even whether it is capable of meeting both its original and current goals.
Thus, some theorists (the majority) view significance testing as necessarily
tied to probability, random sampling, and estimates of population parameters.
A statement of statistical significance according to this view is necessarily a
statement about one or more populations from the which the current data are
(or are not) seen as a random sample. The whole point or purpose of statistical
significance by this view is to make inferences about populations based on
random samples from them. It is these techniques that make up the bulk of
most introductory and more advanced textbooks on the theory and application
of statistics, as well as the bulk of the mathematical discipline of statistics per
se. However, possibly because of their ubiquity, it is also these techniques and
their interpretations that are the source of much of the historical and current
93
controversy in statistics.
Other theorists, though, have recognised that especially in disciplines such as
psychology, only rarely can the data-sets (never mind the organisms or events
generating them) be viewed as random samples from any population. As a
consequence, they have argued for a more local claim of structure, that of effects
in the current experiment or research setting (see, e.g., Edgington, 1995). The
claim is still probabilistic, and is intimately tied to the notions of independence
and randomisation (as opposed to but not exclusive of random sampling), but
the extension or generality beyond that of the current experiment is not part of
the claim of statistical significance. Support for such claims of generality must
arise from other aspects of the research design (which could include random
sampling) or setting.
Most recently, a very small minority of theorists (e.g., Rouanet, Bernard,
& Lecoutre, 1986; Vokey, 1998) has argued that statistical significance can be
divorced even from probabilistic considerations, as statements merely of the atyp-
icality of the data relative to some specified set of possible, potential or theorized
results. The idea here, unlike the emphases of the two previous approaches (al-
though, viewed simply as techniques, they embody this principle as well), is that
structure is a relative concept. Something isn’t either structured or not in some
absolute sense (viewpoint 1), or only under specially-constructed circumstances
(viewpoint 2), but rather is structured or not only relative to something else—in
this case, some set of other possibilities or alternative outcomes: the claim of
structure by this viewpoint is not that the data are patterned, but that they are
patterned differently. With careful definitions of the set for comparison, this
viewpoint can in fact encompass the previous two as special cases (as we shall
see), but neither is seen as necessary for a meaningful definition of statistical
significance.2
The emphasis in this book, then, is the third approach, although it incorpo-
rates the methods or techniques of each of the others. It does so by considering
all of the methods as merely means of generating the comparison sets through
what are known as null or null hypothesis models as the sole source of structure
in the data at hand. The question, then, becomes one of whether or not any
apparent patterning in the data (e.g., such as the scores in one group being
higher on average than those in another group) could reasonably be attributed
to the null model—that is, whether the current data could reasonably be seen
as typical of the comparison set. If so, the apparent patterning is declared to be
nonsignificant with respect to that model and comparison set. If not, then the
2 This approach also neatly avoids the very thorny (and as yet unresolved) issue of the
definition of a truly random sequence or set, simply by denying the direct need for one. For
example, some people would argue that the decimal expansion of the constant π is a random
sequence because there is no apparent (to this point in its expansion to millions of digits)
patterning of the digits. But others would argue that it is anything but random because simple
algorithms can be used to generate every one of the infinite number of digits (hence, the term
“pseudo-random” to describe such sequences, including those produced by the “random”-number
generators provided with most computer programming languages, which, after all, are just
simple, iterative algorithms). This third viewpoint could incorporate such a definition if one
ever arises as simply another set of outcomes for comparison.
94 CHAPTER 10. INTRODUCTION TO SIGNIFICANCE
Permutations
A common counting problem is to determine the number of different orders
or permutations in which N different events or objects can be arranged. For
example, if you had 6 family members, you might wish to know in how many
different orders the family members could be arranged in a line for a family
photograph. The answer is that there are 6 ways the first position could be filled,
5 ways (from the remaining 5 family members after the first position is filled)
that the second position could be filled, 4 ways that the third position could be
filled, and so on. Applying the fundamental counting rule, those values yield a a
total of 6 ∗ 5 ∗ 4 ∗ 3 ∗ 2 ∗ 1 = 720.
In general, N events can be permuted in N ! (read “N-factorial”) ways.3 That
is, the number of ways N events can be permuted taken in groups of N at a
time is:
PNN = N !
where
N ! = N ∗ (N − 1) ∗ · · · ∗ (N − (N − 1))
N -factorial is perhaps more easily defined recursively. Defining 0! = 1, then
N ! = N ∗ (N − 1)!
6! = 6 ∗ 5! = 6 ∗ 5 ∗ 4! = · · · = 6 ∗ 5 ∗ 4 ∗ 3 ∗ 2 ∗ 1 ∗ 0! = 720
Combinations
Suppose you were interested in how many different photographs could be taken
of the 6 family members if only 3 were to be included each time and the order of
the three within the photograph was of no concern. That is, how many different
sets or combinations of 3 family members could be made from the 6 family
3 It is fun to think that the symbol for factorial numbers, the exclamation mark, was
chosen to reflect the fact that these numbers get large surprisingly quickly — a mathematical
witticism.
96 CHAPTER 10. INTRODUCTION TO SIGNIFICANCE
members? Note that for every set of 3 family members, there are 3! ways of
permuting them. We are concerned only with the number of different sets, not
the ordering within the sets. Thus, as PrN yields the total number of different
permutations, and r! is the number of permutations per set of r things, then the
number of different sets or combinations is given by:
PrN
CrN =
r!
For our example case, then, the number of different sets of 3 family members that
can be made from 6 family members is the number of permutations, 6!/3! = 120,
divided by the number of permutations of any one of those sets, 3! = 6, yielding
120/6 = 20 different sets.
Mathematicians commonly use a different symbol for combinations. Coupling
that symbol with the expression for PrN , yields the following general equation
for the number of combinations of N things taken r at a time (read as “N choose
r”):
N N!
=
r r!(N − r)!
evidence that the language clearly needed it. It is meant to be used disparagingly.
5 Fisher never specified a specific criterion for statistical significance, apparently believing
that the precise value should be left up to the judgement of the individual scientist for any
given comparison. Some subsequent theorists believe this fluidity to be anathema, and demand
that the criterion for significance not only be specified, but specified in advance of collecting
the data!
98 CHAPTER 10. INTRODUCTION TO SIGNIFICANCE
Table 10.1: A sampling of the 70 possible exchanges of the male and female
scores.
actually obtained between the two groups (i.e., 6.25 for the males minus 2.75 for
the females = 3.5) should not be all that different from what we would generally
obtain by exchanging scores between the sexes. As before, there are 70 different
ways6 we can exchange the scores between the two groups.7 A few examples
of these exchanges, and the resulting means and mean differences are shown in
Table 10.1.
A histogram of the distribution of the mean differences from the complete set
of 70 exchanges is shown in Figure 10.2. Of these, only the original distribution
of scores and one other (the one that exchanges the high female for the low
male) produces a mean difference as large or larger than the obtained difference—
a proportion of only 2/70 = 0.028571. Hence, we would declare the result
significant; the mean difference obtained from this distribution is atypical of
those that result from exchanging male and female scores. Said differently, the
male and female scores appear not to be exchangeable, at least with respect to
means.
The set or population of patterns to which we wish to compare our particular
result can be generated in innumerable ways, depending on how we wish to
characterise the process with respect to which we wish to declare our result
typical or atypical. In the next chapter, we consider a theoretical distribution of
patterns of binary events that can be adapted to many real-world situations.
10.2 Questions
1. From 8 positive and 6 negative numbers, 4 numbers are chosen without
replacement and multiplied. What proportion of all sets of 4 numbers
6 This number represents the number of unique exchanges—including none at all—of scores
between each of the 4 males and each of the 4 females. Usually, a computer program is used
to determine the exchanges; hence, the need to await the easy access to computers for this
fundamental method to be a practical approach to statistics.
7 We could also exchange scores within the groups, but that would not affect the group
means, and, hence, would not influence the difference between the means—our statistic of
choice here, nor would it change the proportion of scores equal to or greater than our obtained
difference.
10.2. QUESTIONS 99
5
Frequency
0
-4 -3 -2 -1 0 1 2 3 4
Mean Difference
Figure 10.2: The frequency distribution of the 70 mean differences between the
male and female sets of scores produced from the possible 70 unique exchanges
of male and female scores.
(a) will the die show a number less than 3 OR the card is from a red
suit?
(b) will the die show a number less than 3 AND the card is from the
diamond suit?
3. Two dice are rolled. Given that the two faces show different values, what
proportion of all such rolls will one of the faces be a 3?
6. Ten people in a room are wearing badges marked 1 through 10. Three
people are chosen to leave the room simultaneously. Their badge number
is noted. What proportion of such 3 people selections would
100 CHAPTER 10. INTRODUCTION TO SIGNIFICANCE
8. Each of two people tosses three coins. Of all the results possible, what
proportion has the two people obtaining the same number of heads?
Chapter 11
C
onsider what is known as a binary or binomial situation: an event either
is or has property X or it isn’t or doesn’t; no other possibilities exist, the
two possibilities are mutually exclusive and exhaustive. Such distinctions
are also known as contradictories: event 1 is the contradiction of event 2, not
just its contrary. For example, blue and not-blue are contradictories, but blue
and orange are merely contraries.
In a set of these events, suppose further that X makes up the proportion
or has the relative frequency p of the set, and not-X makes up the remaining
proportion or has relative frequency 1 − p = q of the set. Clearly, then,
p+q =1
So, if we selected one of these events at random (i.e., without respect to the
distinction in question) then we would expect to see the p-event 100p% of the
time, and the q-event 100q% of the time—that is, this result would be our
expectation if we repeated this selection over and over again with replacement.
With replacement means that we return the event just selected to the pool before
we select again, or more generally that having selected an event has no effect
on the subsequent likelihood of selecting that event again—the selections are
independent.
Now, what if we selected 2 of these events at random with replacement,
and repeated this selection of two events over many trials? Clearly, there are 3
possibilities for any such selection trial. We could observe 2 p-events, 1 p-event
and 1 q-event, or 2 q-events. At what rates would we expect to observe each of
these possibilities? If we square (p + q) to represent the two selections per trial,
we get:
(p + q)2 = p2 + 2pq + q 2 = 1
This expression may be rewritten as:
101
102 CHAPTER 11. THE BINOMIAL DISTRIBUTION
Note that it conveniently partitions the total probability (of 1.0) into three
summed terms, each having the same form. Before discussing this form further,
let’s see what happens if we cube (p + q), representing 3 selections per trial:
And so on.
Each term in each of these expansions has the following form:
N r N −r
p q
r
r p-events and N − r q-events can happen if there are a total of N such events,
and pr q N −r provides the probability of that particular combination over the N
events. For example, with 3 selections per trial, obtaining 3 p-events (and, hence,
0 q-events) can occur only 1 way, with the probablity of that way of ppp = p3 q 0 ,
whereas 2 p-events and 1 q-event can occur 3 ways (i.e., ppq, pqp, and qpp), with
the probablity of any one of those ways of ppq = p2 q 1 . Thus, the probability of
observing exactly 2 p-events in 3 independent selections is 3p2 q 1 .
Consider now a population (or relative frequency or probability) distribution,
X, consisting of two values: X = 1 with probability p, and PN X = 0 with
probability q = 1 − p. Suppose we construct the statistic Z = i=1 X, sampling
independently from X, N times; the statistic Z is just the frequency of X = 1 for
a given trial of N selections, which can range from 0 to N . What is the sampling
distribution of Z? That is, how are the different values of Z distributed? From
the foregoing, the relative frequency of any particular value, r, of Z (0 ≤ Z ≤ N )
is given by:
N r N −r
p(Z = r) = p q (11.1)
r
An example of a distribution of all 11 (0 ≤ Z ≤ 10) such values of Z for N = 10
and p = .5 is shown in Table 11.1 and Figure 11.1. Such distributions, based
on equation 11.1 are called binomial distributions. A binomial distribution is
a model of what is known as a Bernoulli process: two mutually exclusive and
exhaustive events with independent trials.
Consider the following simple application of the binomial distribution. Normal
coins when tossed, are often supposed to be fair —meaning that the coin (via
the tossing procedure) is just as likely to come up heads as tails on any given
toss, and coming heads or tails on any given toss in no way influences whether
or not the coin comes up heads or tails on the next toss.1 We can use the
binomial distribution shown in Table 11.1 as a model of such a fair coin tossed
10 times. That is, the distribution shown in Table 11.1 and Figure 11.1 provides
the relative frequency of observing any given number of heads in 10 tosses of a
fair coin.
1 It is also assumed that the coin can only come up heads or tails (e.g., it can’t land on edge),
and only one of the two possibilities can occur on any given toss, i.e., a Bernoulli process.
103
Value of Z
0 1 2 3 4 5 6 7 8 9 10
.001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
.001 .011 .055 .172 .377 .623 .828 .945 .989 .999 1
1 .999 .989 .945 .828 .623 .377 .172 .055 .011 .001
P
Table 11.1: The binomial distribution (X = {0, 1}) of Z = X for N = 10
trials, p = .5. Each column provides the relative frequency, the cumulative
relative frequency (up), and the cumulative relative frequency (down) for the
respective values of Z from 0 to 10. The relative frequencies (i.e., the values in
the first row) are plotted in Figure 11.1.
Suppose, then, that we observe 9 heads in 10 tosses of the coin, and we ask:
Is the coin fair? According to our model of a fair coin, 9 or more2 heads should
occur less than 1.1% of the time (see Table 11.1). Now, we can accept one of
two alternatives: Either we have just observed one out of a set of events whose
combined relative frequency of occurrence is only 1.1%, or there is something
wrong with our model of the coin. If 1.1% strikes us as sufficiently improbable or
unlikely, then we reject the model, and conclude that something other than our
model is going on. These “other things” could be any one or more of an infinite
set of possibilities. For example, maybe the model is mostly correct, except that
the coin (or tossing procedure) in question is biased to come up heads much
more frequently than tails. Or maybe the trials aren’t really independent: maybe
having once come up heads, the coin (or the tossing procedure) just continues
to generate heads. And so on.
impressive, we should be even more impressed by even more extreme possibilities, such as 10
heads in 10 tosses. Thus, we want the combined relative frequency of the whole set of results
that we would find impressive should they occur.
104 CHAPTER 11. THE BINOMIAL DISTRIBUTION
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10
one or the other tail of the distribution, such as the investigation of a directional
prediction (i.e., we were concerned that the coin was biased in favour of heads;
hence, either alternative, being unbiased—the model—or biased against heads
would suffice to disconfirm the prediction), we would probably prefer to use the
one-tailed test.
A related error is failing to reject the model when we should ; that is, failing to
reject the model when it is not, in fact, a valid description of the process that
gave rise to the data. As this is the second of the errors we can make, it is
referred to as a Type II error, and the symbol for the probability of making a
Type II error is the second letter of the Greek alphabet, β. We can know even
less about this probability than we can about alpha because there are myriad,
often infinite, ways in which the model could be invalid, ranging from it being
almost correct (e.g., the model is accurate except that p actually equals .6 not
.5), to it being wrong in every way one could imagine.
significance tests. Indeed, in the absence of random sampling from said populations, significance
tests in general could not meaningfully be so characterized.
106 CHAPTER 11. THE BINOMIAL DISTRIBUTION
0.400 0.400
0.300 0.300
0.200 0.200
0.100 0.100
0.000 0.000
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
n = 100, p = .5
0.15000
0.10000
0.05000
0.00000
012345678911011121314151617182920212223242526272839303132333435363738494041424344454647485950515253545556575869606162636465666768797071727374757677788980818283848586878899909192939495969798
190 0
n = 100, p = 1/6
0.15000
0.10000
0.05000
0.00000
012345678911011121314151617182920212223242526272839303132333435363738494041424344454647485950515253545556575869606162636465666768797071727374757677788980818283848586878899909192939495969798
190 0
these are not absolute probabilities: they are still conditional on the assumptions’
being a valid description of the process that generated the data.
In psychology and most other biological and life sciences it is rare that
our theories are so well-specified that the statistical models and values of the
parameters can be directly derived from them. Typically, we get around this
problem by either (1) adapting our data collection process to match as closely as
possible to that of the null model (e.g., random assignment), or (2) using models
and techniques that are as general as possible (e.g., random sampling and the
testing of parameters that are likely to be distributed in predictable ways), or
(3) typically both. These approaches are the subject of the next chapter.
11.1 Questions
1. Listed below are the distances in centimetres from the bulls-eye for inebri-
ated bar patrons’ first and second attempts to hit the centre of the dart
board. Use the sign-test to test (at the α = .05 level) the hypothesis that
accuracy improved over the two attempts:
108 CHAPTER 11. THE BINOMIAL DISTRIBUTION
Throws
patron First Second
a 5 1
b 8 9
c 2.5 1.3
d 6.8 4.2
e 5.2 2
f 6.7 4
g 2.2 1.1
h .25 .1
i 4.3 4.2
j 3 2
2. A coin is tossed 12 times, and 10 heads are observed. Based on the evidence,
is it likely to be a fair coin (i.e., where “fair coin” includes the tossing
process)?
3. At the University of Deathbridge, the students’ residence is infested with
Bugus Vulgaris. These nasty insects come in two sizes: large and extra-large.
In recent years, the extra-large appear to have gained the upper hand, and
out-number the large by a ratio of 3:1. Eight Bugus Vulgaris have just
walked across some poor student’s residence floor (carrying off most of
the food and at least one of the roommates, in the process). Assuming
(not unreasonably) that the number of Bugus Vulgaris in the residences
borders on infinite, what is the probability that no more than two of the
eight currently crossing the residence floor are extra-large?
4. The probability that Mike will sleep in and be late for class is known to
be (from previous years and courses) p = .5. Lately, though, you’ve begun
to wonder whether Mike’s being late for this particular class involves more
than simply the random variation expected. You note that the number of
times Mike was late over the last 10 classes was 9. Assuming that Mike
either is or isn’t late for class and that his attempts to get to class on time
are independent, do you have reason to suspect that something is going on
beyond the normal random variation expected of Mike? Explain why or
why not.
5. The probability of getting an “B” or better in sadistics (sic) is only .40
(that’s why it’s called sadistics). Eight sadistics students are selected at
random. What is the probability that more than 2/3 of them fail to get a
“B” or better in sadistics?
6. At the U of L, 20% of the students have taken one or more courses in
statistics.
(a) If you randomly sample 10 U of L students, what is the probability
that your sample will contain more than 2 students who have taken
statistics?
11.1. QUESTIONS 109
A
s we’ve seen, the binomial distribution is an easy-to-use theoretical
distribution to which many real-world questions may be adapted. But
as the number of trials or cases gets larger, the computations associated
with calculating the requisite tail probabilities become increasingly burdensome.
This problem was recognised as early as 1733 by Abraham De Moivre who asked
what happens to the binomial distribution as n gets larger and larger? He noted
that as n increases the binomial distribution gets increasingly smooth, symmetric
(even for values of p 6= .5), and “bell-shaped”. He found that “in the limit”, as n
goes to infinity, the binomial distribution converges on a continuous distribution
now referred to as the “normal distribution”.1 That is, as you sum over more
and more independent Bernoulli events, the resulting distribution is more and
more closely approximated by the normal distribution.
1 (x−µ)2
f (x) = √ e− 2σ2 (12.1)
σ 2π
bution”, after Carl Friedrich Gauss who along with Laplace and Legendre in the early 1800s
recognised its more general nature. In fact, the naming of this distribution is a prime example
of Stigler’s Law of Eponymy (Stigler, 1999), which states that no discovery in math (or science)
is named after its true discover (a principle that even applies, recursively, to Stigler’s Law!).
111
112 CHAPTER 12. THE CENTRAL LIMIT THEOREM
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3 -2 -1 0 1 2 3
Figure 12.1: The standard normal curve with µ = 0 and σ = 1. φ(x) is plotted
as a function of standard deviation units from the mean.
z µ to z beyond z z µ to z beyond z
0 0 .5 .4 .1554 .3446
.01 .004 .496 .41 .1591 .3409
.02 .008 .492 .42 .1628 .3372
.03 .012 .488 .43 .1664 .3336
.04 .016 .484 .44 .17 .33
.05 .0199 .4801 .45 .1736 .3264
.06 .0239 .4761 .46 .1772 .3228
.07 .0279 .4721 .47 .1808 .3192
.08 .0319 .4681 .48 .1844 .3156
.09 .0359 .4641 .49 .1879 .3121
.1 .0398 .4602 .5 .1915 .3085
.11 .0438 .4562 .51 .195 .305
.12 .0478 .4522 .52 .1985 .3015
.13 .0517 .4483 .53 .2019 .2981
.14 .0557 .4443 .54 .2054 .2946
.15 .0596 .4404 .55 .2088 .2912
.16 .0636 .4364 .56 .2123 .2877
.17 .0675 .4325 .57 .2157 .2843
.18 .0714 .4286 .58 .219 .281
.19 .0753 .4247 .59 .2224 .2776
.2 .0793 .4207 .6 .2257 .2743
.21 .0832 .4168 .61 .2291 .2709
.22 .0871 .4129 .62 .2324 .2676
.23 .091 .409 .63 .2357 .2643
.24 .0948 .4052 .64 .2389 .2611
.25 .0987 .4013 .65 .2422 .2578
.26 .1026 .3974 .66 .2454 .2546
.27 .1064 .3936 .67 .2486 .2514
.28 .1103 .3897 .68 .2517 .2483
.29 .1141 .3859 .69 .2549 .2451
.3 .1179 .3821 .7 .258 .242
.31 .1217 .3783 .71 .2611 .2389
.32 .1255 .3745 .72 .2642 .2358
.33 .1293 .3707 .73 .2673 .2327
.34 .1331 .3669 .74 .2704 .2296
.35 .1368 .3632 .75 .2734 .2266
.36 .1406 .3594 .76 .2764 .2236
.37 .1443 .3557 .77 .2794 .2206
.38 .148 .352 .78 .2823 .2177
.39 .1517 .3483 .79 .2852 .2148
continued on next page
114 CHAPTER 12. THE CENTRAL LIMIT THEOREM
Table 12.1 also can be used to determine the probabilities for ranges not
explicitly listed in the table. For example, what proportion of the scores from a
normal distribution are between 1 and 2 standard deviations from the mean?
As 47.72% of the scores are between the mean and 2 standard deviations from
the mean, and 34.13% are between the mean and 1 standard deviation, then
47.72 − 34.13 = 13.59% of the scores are between 1 and 2 standard deviations
from the mean.
1. For 0 ≤ z ≤ 2.2, the proportion of area between the mean and z is given
as z(4.4 − z)/10.
For example, for z = 1, the first rule is applied, yielding (1)(4.4 − 1)/10 = .34, as
compared to the .3413 of Table 12.1. According to Shah, comparing the simple
approximation values to the actual areas yielded a maximum absolute error of
.0052, or roughly only 1/2%.
a computer software package. It is not known what particular approximation algorithm is used
by this software.
12.2. THE NORMAL APPROXIMATION TO THE BINOMIAL 119
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-5 0 5 10 15 20 25 30 35 40
Figure 12.2: The binomial distribution for n = 36, p = .5 overlaid with a normal
approximation with µ = 18, σ = 3.
more events we sum over, the larger the variability will be. Thus, the variance
of the sum should be a direct function of n andpan inverse function of p > .5, or
np(1 − p), and the standard deviation will be np(1 − p).
Thus, to convert any value from a binomial distribution to a standard score:
x − np
z=p
np(1 − p)
the 9 scores, X, would be greater than 54? Means are just sums corrected for n,
so because µ and σ are the same for each score (i.e., independent trial), equation
12.3 could be re-written in terms of means as:
X −µ
ZX = q (12.4)
σ2
n
5 Yes, 12, for any and all uniform distributions; the reason for the denominator of 12 is left
2
= 5.77
√
36
= 2.08
within 1.96 standard errors of the mean of the sample mean 95% of the time.
That is,
−1.96σX + X ≤ µ ≤ 1.96σX + X
is the 95% confidence interval . These are the same intervals that accompany
election and other poll results in which you are told, for example, that the “poll
percentages are reliable plus or minus 3% 19 times out of 20”. For the current
example, the 95% confidence interval is
5.77 5.77
−1.96 √ + 22 ≤ µ ≤ 1.96 √ + 22 = 20.12 ≤ µ ≤ 22.96
36 36
(Note that the lower bound is greater than our hypothesised mean, which we
already knew because we computed the sample mean of 22 to be significantly
different from a population mean of 20.)
Neyman referred to these ranges as “confidence intervals” because it is not at
all clear exactly what the 95% actually refers to. From a frequentist probabilist
perspective it can be interpreted to mean that if you repeated the sampling many,
many times, all other things being equal, 95% of those times the interval would
encompass the actual value of µ (given, again, that the model is true except for
the parameter, µ). That is, you are “95% confident” that the interval overlaps
the actual value of the population parameter. But that isn’t saying much; in
particular, it doesn’t say that there is a 95% probability that the population
mean falls within the computed bounds. And, despite its many proponents who
argue for confidence intervals as a superior replacement for Fisherian significance
tests, it really says no more than the original significance test did. You can, of
course, compute 99% confidence intervals, and 99.9% intervals (and 90% and
80% intervals), but they provide no more information (just in a different form)
than would the significance test at the corresponding alpha-level. Fisher, for
one, never bought into them. And, inductively, they add nothing to what the
significance test already provided: if the significance test was significant, then
the corresponding confidence interval will not overlap the hypothesised mean;
otherwise, it will.
Subsequent chapters present similarly-constructed inferential statistics, all of
which can equally, using exactly the same logic, have confidence intervals com-
puted for the putative “tested” parameter. If the conditions are appropriate for
a significance test, then the computed confidence intervals are merely redundant.
If more is assumed (or obtained), such as true random sampling (i.e., so that
the random sampling serves a role beyond that of providing the independence
assumed by significance testing), then confidence intervals can be serve as valu-
able tools in the evaluation of estimates of population parameters from sample
statistics. However, in that case the goals are fundamentally different.
12.4 Questions
1. Thirty-six scores, randomly-sampled from a population of scores with
σ = 18, are found to have a mean of 10. What is the probability that the
population mean (µ) is less than 7?
2. You are given a coin and told that either it is a fair coin or that it is a coin
biased to come up “heads” 80% of the time. To figure out which it is, you
toss the coin 36 times and count the number of heads. Using the notion
that the coin is fair as your null hypothesis, and setting α = .10,
3. Despite the fact that grades are known not to be normally distributed,
comparisons of mean grades (say, between one class and another) often
are done using the normal distribution as the theoretical (null hypothesis)
model of how such means would be distributed. Explain this apparent
contradiction.
4. A coin is tossed 64 times and and 28 heads are observed. Based on the
evidence, is it likely (say, α = .05) to be a fair coin?
6. A local dairy sells butter in 454 gram (1 pound, for those wedded to
imperial measures) packages. As a concerned consumer, you randomly
sample 36 of these packages (all that your research budget can afford),
record their weights as measured on a home scale, and compute a mean
of 450 grams. As this is below the stated weight, you contact the dairy
and are informed by a spokesperson that your mean represents simply
the typical variability associated with the manufacturing process. The
spokesperson notes further that the standard deviation for all packages of
butter produced by the dairy is 12 grams. Should you believe the claims
of the spokesperson? Show why or why not.
8. When there is no fire, the number of smoke particles in the air is distributed
normally with a mean of 500 parts per billion (PPB) and a standard
deviation of 200 ppb. In the first minute of a fire, the number of smoke
particles is likewise normally distributed, but with a mean of 550 ppb and
a standard deviation of 100 ppb. For a particular smoke detector:
(a) what criterion will detect 88.49% of the fires in the first minute?
(b) what will be the false-alarm rate for the criterion in a.?
(c) with the criterion set to produce only 6.68% false-alarms, what will
be the probability of a type II error?
9. With some obvious exceptions (hee hee!), women’s hair at the U. of L. is
generally longer than men’s. Assume that the length of women’s hair at
the U. of L. is normally distributed with µ = 44 and σ = 8 cm. For men
at the U. of L., assume that the length of hair is also normally-distributed,
but with µ = 34 and σ = 16 cm. Four heads of hair are sampled at random
from one of the two sexes and the mean length is calculated. Using a
decision criterion of 40 cm for discriminating from which of the two sexes
the hair sample was drawn,
(a) what is the probability that an all male sample will be correctly
labeled as such?
(b) what is the probability that an all female sample will be mislabeled ?
10. Although there probably are some students who do actually like sadistics
(sic), common sense says that in the general population, the proportion is
at most 1/3. Of 36 of Vokey and Allen’s former sadistics students, 16 claim
to like sadistics. Test the hypothesis that these 36 people are a random
sample of the general population. Be sure to articulate the null hypothesis,
and any assumptions necessary for drawing the conclusion in question.
√
11. From a set of scores uniformly distributed between 0.0 and 12(8), 64
scores are sampled at random. What is the probability that the mean of
the sample is greater than 12?
Chapter 13
The χ2 Distribution
C
hapter 12 introduced the central limit theorem, and discussed its use
in providing for the normal approximation to the binomial distribution.
This chapter introduces another application of both the theorem and the
normal approximation, enroute to another continuous, theoretical distribution
known as the chi (pronounced ky—rhymes with “sky”) square distribution,
symbolized as χ2 .
Consider again the binomial situation of Chapter 12, only this time imagine
that we were examining the results of tossing a coin, say, 64 times. According to
our model of a fair coin (and tossing procedure), we would expect µ = .5(64) = 32
heads, on average. Suppose, however, that we observed only x = 24 heads. Is
this result sufficient evidence to reject our model of a fair coin?
Converting the observed frequency of 24 to a standardized score as in Chapter
12, yields:
x−µ
z = p
n(p)(1 − p)
24 − 32
= p
64(.5)(1 − .5)
= −2 (13.1)
Using the normal approximation to the binomial, observing 24 heads (or fewer)
in 64 tosses of a fair coin should occur less than requisite 5% of the time (2.28%
to be exact), so we should reject the model.
127
128 CHAPTER 13. THE χ2 DISTRIBUTION
the square of the difference between what was observed and what was expected
(i.e., theorised) over what was expected for each of the two possible outcomes,
and then sum them. That is, where Oi and Ei refer to observed and expected
frequencies, respectively:
(O1 − E1 )2 (O2 − E2 )2
z2 = +
E1 E2
(24 − 32)2 (40 − 32)2
= +
32 32
64 64
= +
32 32
= 4 (13.2)
Note that z 2 is a measure of how well the expected frequencies, E1 and E2 , fit
the observed frequencies, O1 and O2 . The closer the expected frequencies are
to the obtained frequencies, the smaller z 2 will be, reaching zero when they are
equal. Hence, if we knew the distribution of the statistic z 2 we could compare
our computed value to that distribution, and, hence, determine the likelihood or
relative frequency of such a result (or a more extreme result), assuming that the
obtained value of z 2 was just a random value from that distribution.
There is a family of theoretical distributions, called the chi square distributions
, that are defined as the sum of independent, squared scores from standard
normal distributions for n > 1, and as a squared normal distribution if only one
is involved. That is, where x is a score from any normal distribution, and n
indexes the number of independent scores summed over, the corresponding value
from the requisite χ2 distribution is given as:
n 2
X xi − µi
χ2n = (13.3)
i=1
σi
Shown in Figure 13.1 are four chi square distributions from the family of
such distributions. Chi square distributions are distinguished by the number of
independent, squared, standard normal distributions summed over to produce
them. This parameter is referred as the number of “degrees of freedom”—
independent sources—that went in to producing the distribution. This concept
of “degrees of freedom” will become clearer subsequently (and, not so incidentally,
was the source of one of Karl Pearson’s biggest failures, and the principle source
of his life-long feud with Fisher).
As surprising as it may appear, the computation of z 2 in equation 13.2 is
arithmetically just an alternative method of computing z in equation 13.1, except
that it computes the square of z (note for our example values that 4 is the square
of -2, and this will be the case for all frequencies of Oi and Ei and, hence, z
and z 2 ). Now, if it is reasonable to invoke the normal approximation to the
binomial for our example in the application of equation 13.1, then it is equally
reasonable to assume it in the arithmetically related case of z 2 in equation 13.2;
that is, that z 2 is the corresponding square of a score from a standard normal
13.1. CHI SQUARE AND THE GOODNESS OF FIT 129
0.9
0.8
0.7
Relative Frequency
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15
Chi square
Level of α
df 0.25 0.2 0.15 0.1 0.05 0.02 0.01 .001 .0001
1 1.32 1.64 2.07 2.71 3.84 5.41 6.63 10.83 15.14
2 2.77 3.22 3.79 4.61 5.99 7.82 9.21 13.82 18.42
3 4.11 4.64 5.32 6.25 7.81 9.84 11.34 16.27 21.11
4 5.39 5.99 6.74 7.78 9.49 11.67 13.28 18.47 23.51
5 6.63 7.29 8.12 9.24 11.07 13.39 15.09 20.52 25.74
Table 13.2: The observed and expected (in parentheses) frequencies of national
political party preferences for the example polynomial situation discussed in the
text.
of 50%, 15%, 15%, 10%, and 10%, respectively. Are the results of the statistics
class significantly different from that expected on the basis of the national poll?
According to the national poll, 50% or 30 of the 60 statistics students would
be expected to have preferred the Liberals, 9 to have preferred the NDP, 9
to have preferred the Conservatives, 6 to have preferred the Greens, and the
remaining 6 to have preferred Other. These observed and expected values are
shown in Table 13.2. Extending equation 13.2, where k refers to the number of
cells, we get:
k
X (Oi − Ei )2
z2 = (13.4)
i=1
Ei
(20 − 30)2 (20 − 9)2 (10 − 9)2 (5 − 6)2 (5 − 6)2
= + + + +
30 9 9 6 6
= 17.22
For the 5 cells of observed frequencies in Table 13.2 (with their total), only 4
contribute independent information to the computation of z 2 , because if you know
the frequencies in any 4 of the cells, the fifth can be determined by subtraction
from the total. This example demonstrates the meaning of degrees of freedom:
in a set of 5 frequencies only 4 are free to vary because the fifth is fixed by the
total frequency. In general, with n frequencies, only n − 1 are free to vary. Hence,
for such tables, there are n − 1 degrees of freedom. So, entering Table 13.1 with
5 − 1 = 4 df indicates that obtaining a z 2 value of 17.22 or larger would occur
somewhere between 0.1% and 1% of the time if the observed frequencies were
drawn at random from a model with those expected frequecies. Because the
p-value is less than .05, the model is rejected, and we conclude that the results
from the statistics class are significantly different from what would be expected
on the basis of the national poll.
Despite the fact that the polynomial situation is not binomial, use of the χ2
distribution still requires the same conditions to be met for it to be a meaningful
model: the categories must be exhaustive, mutually exclusive (any given case
can fall into one and only category), and assignment of a case to any given
category must occur independently of the assignment of any other case to its
category—this last requirement to ensure the independence required of the
normal approximation. Thus, for example, combining the 10 responses from
you and from each of your friends on a recent multiple-choice test to assess, say,
whether the distribution of responses into correct and incorrect categories was
significantly different from chance would most likely be inappropriate because
132 CHAPTER 13. THE χ2 DISTRIBUTION
Coke Pepsi
Females 32 28 60
Males 8 32 40
40 60 100
the responses from any one of you are unlikely to be truly independent of one
another. Finally, remember that the use of the χ2 distribution here is just an
approximation to the actual discrete distribution of frequencies: if the overall n
is too small or the expected frequency for any cell is too low, the approximation
breaks down. A rule of thumb here is that the expected frequency for every cell
should be greater than about 5.
Coke Pepsi
Females 24 36 60
Males 16 24 40
40 60 100
Table 13.3, except that the four cell frequencies have been adjusted so that cola
preference is now independent of sex. Note that the marginal totals for both cola
preference and sex are identical to those of the original table; the adjustment for
independence does not affect these totals. Instead, the interior cell frequencies
are adjusted so that the proportion of females who prefer Coke is the same as
the proportion males who do (or, identically, that the proportion of those with a
Cola preference that are female is the same as that with a Pepsi preference).
So, for example, 24/60 = .40 or 40% of the females in Table 13.4 prefer Coke,
as do 16/40 = .4 or 40% of the males. Similarly, 24/40 = .60 or 60% of the Coke
drinkers are female, as are 36/60 = .60 or 60% of the Pepsi drinkers. Note that
these proportions are identical to those of the corresponding marginals (i.e., 40%
of the 100 people prefer Coke, and 60% of the 100 are females), which, in fact, is
how they were computed. To compute the independence or expected frequency
for the cell corresponding to females with a Coke preference, for example, just
multiply the proportion of the total frequency that prefered Coke (.4) by the
marginal frequency of females (60) [or, equivalently, multiply the proportion of
the total frequency that were female (.6) by the marginal frequency of those
that prefer Coke (40)] to get 24. In general, to compute the expected frequency
for a cell in row i, column j given independence, where mi and mj are the
corresponding row and column marginal totals, and n is the total frequency:
mi mj
Eij = (13.5)
n
Degrees of freedom
For 2 x 2 contingency tables, only one of the four expected frequencies needs to
be computed in this way; the remainder can be obtained by subtraction from the
corresponding marginal totals to complete the table. What this result means is
that for the assessment of independence in a 2 x 2 table, only one cell frequency
is free to vary; once one cell is known, the others are automatically determined.
Hence, for the assessment of independence in a 2 x 2 table, there is only 1 degree
of freedom.
Once all the expected frequencies have been determined, z 2 is computed in
same way as before (see equation 13.4):
4
2
X (Oi − Ei )2
z =
i=1
Ei
134 CHAPTER 13. THE χ2 DISTRIBUTION
A measure of association
To say that cola preference and sex are not independent is to say that they are
correlated or associated with one another. So, the first question one might ask
following rejection of independence is, if cola preference and sex are associated,
just how associated are they? A common measure of association for 2 x 2
contingency tables is called the phi (pronounced fee) coefficient, which is a
simple function of the approximated value of χ2 (z 2 in our case) and n (the total
frequency):
r
z2
φ= (13.6)
n
The φ coefficient varies between zero and 1, much like the absolute value of
the Pearson product-moment correlation coefficient. In fact, it is the Pearson
product-moment correlation coefficient. Exactly the same result would be
obtained if the values of 0 and 1 were assigned to males and females, respectively,
the values of 0 and 1 were assigned to the Coke and Pepsi preferences, respectively,
and the techniques of Chapter 7 were applied to the 100 cases in Table 13.3. For
our example case,
r
χ2
φ =
n
r
11.11
=
100
= 0.33
That is, cola preference and sex are moderately (but significantly!) correlated or
associated in Table 13.3.
What does change is the computation of the degrees of freedom. For greater
than 2 x 2 tables, where r is the number of rows and c is the number of columns,
the degrees of freedom is found by:
df = (r − 1)(c − 1) (13.7)
So, for example, for a contingency table with 4 rows and 6 columns, the degrees
of freedom would be:
df = (r − 1)(c − 1)
= (4 − 1)(6 − 1)
= 15
However, df here is still just the number of cells that have to be filled in before
the remainder can be determined by subtraction from the marginal totals.
13.3 Questions
1. The Consumers association of Canada (CAC) decides to investigate the
claims of a new patent cold remedy. From 25 years of investigations
conducted at the British Cold institute, it is known that for people who
take no medication at all, one week from the onset of cold symptoms 30%
show improvement, 45% show no change, and the remainder are worse.
The results for a sample of 50 people who used the new cold medicine were
that 10 people improved, 25 showed no change, and the remainder got
worse. Based on these results, what should the CAC conclude about the
new cold remedy, and why?
2. Although there probably are some students who do actually like sadistics
(sic), common sense says that in the general population, the proportion is
at most 1/3. Of 36 of Vokey and Allen’s former sadistics students, 16 claim
136 CHAPTER 13. THE χ2 DISTRIBUTION
to like sadistics. Test the hypothesis that these 36 people are a random
sample of the general population. Be sure to articulate null and alternative
hypotheses, and any assumptions necessary for drawing the conclusion in
question.
4. Two successive flips of a fair coin is called a trial. 100 trials are run with a
particular coin; on 22 of the trials, the coin comes up “heads” both times;
on 60 of the trials the coin comes up once a “head” and once a “tail”; and
on the remaining trials, it comes up “tails” for both flips. Is this sufficient
evidence (α = .05) to reject the notion that the coin (and the flipping
process) is a fair one?
Male Female
Beer 20 15
No beer 5 20
The t Distribution
A
ll of the foregoing was very much dependent on the fact that in each case
the population variance, σ 2 , was known. But what if σ 2 is not known?
This was the problem faced by W. S. Gosset, a chemist at Guinness
breweries (see Figure 14.1), in the first few decades of the twentieth century.
One approach, the one current at the time, would be to use very large samples,
so that the variance of the sample would differ hardly at all from the population
or process sampled from, and simply substitute the sample variance for the
population variance in equation 12.4. But as reasonable as this solution appears
to be (assuming the sample size really is large), it would not work for Gosset.
His problem was to work with samples taken from tanks of beer to determine, for
example, whether the concentration of alcohol was different from its optimum at
that point in the brewing process.1 The use of large samples would use up the
product he was testing! He needed and developed a small-sample approach.
vats of yeast. Precise volumes of yeast of equally precise concentrations according to a (still)
secret formula had to be added to the mix of hops and barley to produce the beer for which
Guinness was justly so famous. Too much or too little would ruin the beer. But, the alive
yeast were constantly reproducing in the vats, so the concentrations fluctuated daily, and had
to assessed just before being added to the next batch of beer.
139
140 CHAPTER 14. THE T DISTRIBUTION
than n:
n
X
(Xi − X)2
i=1
σ̂ 2 =
n−1
Why n − 1? The reason is that variance is the average squared deviation from the
mean; it is based on deviation scores from which the mean has been removed—
the deviation scores are constrained to have a mean of zero. This constraint is
equivalent to knowing one the scores contributing to the sum, and that score,
therefore, is no longer free to vary. For example, if you know that the sum of
two scores is equal to 5, then once you find the value of just one of the scores,
the value of the other is determined because they must sum to 5. The variance
possible for these scores, then, is in this sense determined by the variability
of just one of them, rather than both. Thus, if we want the average of the
variability possible of a set of scores, the average should be based on the number
contributing to it, which is one less than the number of scores. This estimate, σ̂ 2 ,
is no more likely to be too small as to be too large, as shown in the last column
of Table 14.1,and, hence, is an unbiased estimate of the population variance.
But it is still an estimate, which means that a different sample would likely give
rise to a different estimate. That is, these estimates are themselves distributed.2
Thus, even if we substituted these now unbiased estimated values into equation
12.4, the resulting distribution of Z would not be normal in general, because Z is
now the distribution of the ratio of one random distribution (the assumed normal)
and another random distribution (that of the estimated variance), the result of
which is known mathematically not to be normal, although it converges on the
normal as n gets large (principally
√ because, as n gets larger, the denominator
converges on a constant, σ 2 / n, and is is no longer distributed). In fact, although
the shapes are similar (i.e., they are all symmetrical, roughly triangular-shaped
distributions similar to the normal), the exact shape of the distribution and,
hence, the proportions of scores that fall into different ranges, changes as a
function of how many scores went into computing the estimated variance. This
number, the denominator of the formula for computing the estimated variance,
n − 1, is referred to as the degrees of freedom of the estimate.
This family of related distributions is called the t−distributions, and Gosset
worked out what the exact distributions were for each of the small samples he
would likely have to work with, such as sample sizes of less than 10 or 20. A
selection of tail values of the t−distributions for many of these small samples is
shown in Table 14.2, as a function of degrees of freedom, and for both one- and
two-tailed tests. He published these ideas and distributions in a series of articles
under the pseudonym “Student”, both because the Guinness brewing company
had an explicit policy forbidding its employees to publish methods related to
the brewing of beer after an unfortunate incident involving another employee
in which a trade secret had been so revealed, and because he fancied himself at
2 Given the assumption of normality for the population, they are in fact distributed as a
best as just a student of the then burgeoning work of such statistical luminaries
as Galton, Pearson, and Fisher.
S2n
σ̂ 2 =
n−1
(8)(9)
=
(9 − 1)
= 9
That value is the estimated population variance, but what we want is an estimate
of the standard deviation of sample means from such a population, which we
obtain, as we did earlier, by dividing the estimated population variance by n,
and taking the square-root, producing an estimated standard error of the mean:
r
2
σ̂X
σ̂X = (14.1)
n
Formally, the t−test for testing a single mean against its hypothesized
population value is defined as a modification of equation 12.4:
X −µ
tdf = q (14.2)
σ̂ 2
n
3 Gosset used the symbol z; subsequent authors, keen to distinguish Gosset’s test from the
z-test, introduced the t symbol. Why t was chosen, rather than, say, g (for Gosset) or s (for
Student) is not known to the authors.
144 CHAPTER 14. THE T DISTRIBUTION
samples, the dependent t-test, and so on. Regardless of the label, they are all just variants of
the t-test for one sample mean.
146 CHAPTER 14. THE T DISTRIBUTION
2 n1 S12 + n2 S22
σ̂pooled =
(n1 − 1) + (n2 − 1)
2
σ̂pooled is now our best estimate of the variance of this hypothetical population
of scores from which our two samples of scores were presumed to have been
drawn. In section 14.1, the denominator of the equation 14.2, the square-root
of the estimated population standard deviation divided by n, is the estimated
standard error of the mean or estimated standard deviation of means of samples
of size n. We need something similar here, although as we are interested in the
difference between two sample means, we want the estimated standard error of
the difference between sample means. Even though we are subtracting the means
from one another, the variability of each mean contributes to the variability of
the difference. Thus, we want to add our best estimate of the variability of one
mean to the corresponding best estimate of the variability of the other mean to
get the best estimate of the variance of the difference between the means, and
then take the square-root to get the estimated standard error of the difference
14.2. T -TEST FOR TWO INDEPENDENT SAMPLE MEANS 147
The t-test is now just the ratio of the difference between the two sample
means over the estimated standard error of such differences:
X1 − X2
tdf = (14.4)
σ̂X 1 −X 2
where the degrees of freedom are, as before, the df that produced the estimated
2
variance, σ̂pooled , in this case, n1 + n2 − 2. As a demonstration of the t-test
for two sample means, let us return to canonical example of section 10.1.3 of
chapter 10 in which females had the scores {1, 2, 3, 5} and males had the scores
{4, 6, 7, 8}. To make the example more concrete, imagine these scores were
errors on a sadistics (sic) test. The null hypothesis model is that these errors
are distributed normally, and the distribution of scores is identical for males
and females as we have selected them (i.e., the scores for our groups represent
independent random samples from this distribution of scores).
The mean for the males is 6.25 errors, with the mean number of errors for
the females as 2.75, for a difference between the two means of 3.5. The sample
variance in each case is S 2 = 2.1875, so the pooled estimated variance is equal
to:
2 4(2.1875) + 4(2.1875)
σ̂pooled =
(4 − 1) + (4 − 1)
= 2.92
and the standard error of the estimate of the difference between the means is
equal to:
r
2.92 2.92
σ̂X 1 −X 2 = +
4 4
= 1.21
Looking at the t-table with 6 degrees of freedom, yields a critical value of 2.4469
at α = .05, two-tailed test. As 2.90 exceeds that value, we reject the model, and
declare that the males scored significantly more errors on average than did the
females.
148 CHAPTER 14. THE T DISTRIBUTION
Thus, the t-test is also a test of the significance of the rpb statistic—if the
corresponding t is significant, then the correlation between group membership
and the dependent measure is significantly different from zero. By the same
token, rpb is the corresponding measure of the strength of relationship between
group membership and the dependent measure for the t-test. Thus, for the
example in the previous section (14.2):
r
2.92
rpb = ±
2.92 + 6
rpb = ±0.76
That is, group membership (sex) is highly correlated with errors, or, equivalently,
that .762 = 57.76% of the variance in the errors can be “accounted for” by group
membership.
14.3 Questions
1. Sixty-four people are asked to measure the width of a large room. The mean
measurement is 9.78 metres with a variance of 63 squared-centimetres.
Assuming that the measurement errors are random (and normally dis-
tributed), the true width of the room is unlikely (i.e., α = .05) to be less
than or greater than .
2. From a normally-distributed population of scores with µ = 40, a random
sample of 17 scores is taken, and the sample mean calculated. The standard
deviation for the sample is 10, and the mean is found to be greater than
the population mean. If the probability of observing a sample mean of
that magnitude (or greater) is equal to .10, what is the sample mean?
3. A set of 65 scores, randomly sampled from a normal distribution, has a
mean of 30 and a variance of 1600. Test the notion that population from
which this sample was drawn has a mean of 40.
5A derivation that we will leave as an exercise for the reader . . .
14.3. QUESTIONS 149
6. On five different trips by automobile over the same route between your home
in Lethbridge and a friend’s home near Calgary, you record the following
odometer readings (in kilometres): 257, 255, 259, 258, and 261. Assuming
that measurement errors of this kind would be normally-distributed, is
there sufficient evidence to reject the belief that the distance between your
home and your friend’s over this route is 260 km?
7. Many people claim that the television volume surges during commercials.
To test this notion, you measure the average volume during 36 regular
programs and the corresponding commercials that occur within them,
producing 36 pairs of program-commercial volumes. On average, the
commercials are 8 decibels louder than the programs, and the sum of
squared differences is equal to 16329.6. Is this sufficient evidence to
substantiate the claim?
8. The manufacturer of the Lamborghini Countach recently claimed in a
prestigious magazine that the sports car was able to accelerate from 0-60
in 4.54 seconds. In my previous life as a test driver for this company (we
all can dream, can’t we?), I drove 25 of these fine automobiles and found
that the mean latency in acceleration from 0-60 was 4.84 seconds, with a
standard deviation (for the sample) of .35 seconds. Is my result statistically
inconsistent with the claim in the magazine?
150 CHAPTER 14. THE T DISTRIBUTION
Chapter 15
T
he statistical procedure outlined here is known as “analysis of variance”,
or more commonly by its acronym, ANOVA; it has become the most
widely used statistical technique in biobehavioural research. Although
randomisation testing (e.g., permutation and exchangeability tests) may be used
for completely-randomized designs (see Chapter 10),1 the discussion here will
focus on the normal curve method (nominally a random-sampling procedure2 ),
as developed by the British statistician, Sir Ronald A. Fisher
15.1 An example
For the example data shown in Table 15.1, there are three scores per each of
three independent groups. According to the usual null hypothesis, the scores
in each group comprise a random sample of scores from a normally-distributed
population of scores.3 That is, the groups of scores are assumed to have been
sampled from normally-distributed populations having equal means and variances
1 Randomisation testing also can be used when random assignment has not been used, in
which case the null hypothesis is simply that the data are distributed as one would expect
from random assignment—that is, there is no systematic arrangement to the data.
2 When subjects have not been randomly sampled, but have been randomly assigned,
ANOVA and the normal curve model may be used to estimate the p−values expected with
the randomization test. It is this ability of ANOVA to approximate the randomization
test that was responsible at least in part for the ready acceptance and use of ANOVA for
experimental research in the days before computing equipment was readily available to perform
the more appropriate, but computationally-intensive randomisation tests. Inertia, ignorance,
or convention probably account why it rather than randomisation testing is still used in that
way.
3 Note that it is the theoretical distribution from which the scores themselves are being
drawn that is assumed to be normal, and not the sampling distribution. As we will see, the
sampling distribution of the statistic of interest here is definitely not normal.
151
152 CHAPTER 15. THE GENERAL LOGIC OF ANOVA
Group
1 2 3
-2 -1 0
-1 0 1
0 1 2
X 1 = −1 X2 = 0 = XG X3 = 1
Table 15.1: Hypothetical values for three observations per each of three groups.
(or, equivalently, from the same population). Actually, the analysis of variance
(or F -test), as with Student’s t-test (see Chapter 14, is fairly robust with respect
to violations of the assumptions of normality and homogeneity of variance,4 so
the primary claim is that of equality of means; the alternative hypothesis, then,
is that at least one of the population means is different from the others.
Xij − X G = Xij − X i + X i − X G
That is, the deviation of any score from the grand mean of all the scores is equal
to the deviation of that score from its group mean plus the deviation of its group
mean from the grand mean. Or, to put it more simply, the total deviation of a
score is equal to its within group deviation plus its between group deviation.
Now, if we square and sum these deviations over all scores:5
X X X X
(Xij − X G )2 = (Xij − X i )2 + (X i − X G )2 + 2 (Xij − X i )(X i − X G )
ij ij ij ij
For any group, the group mean minus the grand mean is a constant, and the
sum of deviations around the group mean is always zero, so the last term is
always zero, yielding:
X X X
(Xij − X G )2 = (Xij − X i )2 + (X i − X G )2
ij ij ij
or
4 Which is to say that the p-values derived from its use are not strongly affected by such
always zero; so, we always square the deviations before summing—producing a sum of squares,
or SS (see Chapter 2).
15.1. AN EXAMPLE 153
Table 15.2: Sums of squares for the hypothetical data from Table 15.1.
dfB = a − 1
6 Degrees of freedom, as you no doubt recall, represent the number of values entering into
the sum that are free to vary; for example, for a sum based on 3 numbers, there are 2 degrees
of freedom because from knowledge of the sum and any 2 of the numbers, the last number is
determined.
154 CHAPTER 15. THE GENERAL LOGIC OF ANOVA
For the mean square within groups (M SW ), each group has ni − 1 scores free to
vary. Hence, in general,
X
dfw = (ni − 1)
i
dfW = a(n − 1)
Note that:
dfT = dfB + dfW
SSW
M SW =
a(n − 1)
and
SSB
M SB =
(a − 1)
M SW = σ̂x2
To estimate the variance of the means of samples of size n sampled from the
same population, we could use the standard formula:7
σ̂x2
σ̂x2 =
n
But,
M SB = nσ̂x2
Hence, given that the null hypothesis is true, M SB also is an estimate of the
population variance of the scores.
7 This expression is similar to the formula you encountered in Chapter 12, except that it
0.9
0.8
0.7
0.6
Relative Frequency
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6
F(2,6)-ratio
M SB
F (dfB , dfW ) =
M SW
If the null hypothesis is true, then this ratio should be approximately 1.0.
Only approximately, however; first, because both values are estimates, their
ratio will only tend toward their expected value; second, even the expected
value—long-term average—of this ratio is unlikely to equal 1.0 both because, in
general, the expected value of a ratio of random values does not equal the ratio
of their expected values, and because the precise expected value under the null
hypothesis is dfW /(dfW − 2), which for the current example is 6/4 = 1.5.
Of course, even if the null hypothesis were true, this ratio may deviate from its
expected value (of approximately 1.0) due to random error in one or the other (or
both) estimates. But the probability distribution of these ratios can be calculated
(given the null hypothesis assumptions mentioned earlier). These calculations
produce a family of distributions, one for each combination of numerator and
denominator degrees of freedom, called the F -distributions. Critical values of
some of these distributions are shown in Table 15.3.
8 Named in honour of Sir Ronald A. Fisher. Fisher apparently was not comfortable with
The F -distributions are positively skewed, and become more peaked around
their expected value as either the numerator or the denominator (or both)
degrees of freedom increase.9 A plot of the probability density function of the
F -distribution with 2 and 6 degrees of freedom is shown in Figure 15.1. If
the null hypothesis (about equality of population means) isn’t true, then the
numerator (M SB ) of the F -ratio contains variance due to differences among the
population means in addition to the population (error) variance resulting, on
average, in an F -ratio greater than 1.0. Thus, comparing the obtained F -ratio
to the values expected theoretically allows one to determine the probability of
obtaining such a value if the null hypothesis is true. A sufficiently improbable
value (i.e., much greater than 1.0) leads to the conclusion that at least one of
the population means from which the groups were sampled is different from the
others.
Returning to our example data, M SW = 6/(3(3 − 1)) = 1 and M SB =
6/(3 − 1) = 3. Hence, F (2, 6) = 3/1 = 3. Consulting the appropriate F -
distribution with 2 and 6 degrees of freedom (see Figure 15.1 and Table 15.3),
the critical value for α = .05 is F = 5.14.10 Because our value is less than the
critical value, we conclude that the obtained F -ratio is not sufficiently improbable
to reject the null hypothesis of no mean differences.
15.2 F and t
Despite appearances, the F -ratio is not really an entirely new statistic. Its
theoretical distribution has the same assumptions as Student’s t-test, and, in
fact, when applied to the same situation (i.e., the difference between the means
of two independently sampled groups), F (1, df ) = t2df —that is, it is simply the
square of t.
15.3 Questions
1. You read that “an ANOVA on equally-sized, independent groups revealed
an F (3, 12) = 4.0.” If the sum-of-squares between-groups was 840,
2 18.51 98.50 19.00 99.00 19.16 99.17 19.25 99.25 19.30 99.30
3 10.13 34.12 9.55 30.82 9.28 29.46 9.12 28.71 9.01 28.24
4 7.71 21.20 6.94 18.00 6.59 16.69 6.39 15.98 6.26 15.52
5 6.61 16.26 5.79 13.27 5.41 12.06 5.19 11.39 5.05 10.97
6 5.99 13.75 5.14 10.93 4.76 9.78 4.53 9.15 4.39 8.75
7 5.59 12.25 4.74 9.55 4.35 8.45 4.12 7.85 3.97 7.46
8 5.32 11.26 4.46 8.65 4.07 7.59 3.84 7.01 3.69 6.63
9 5.12 10.56 4.26 8.02 3.86 6.99 3.63 6.42 3.48 6.06
10 4.96 10.04 4.10 7.56 3.71 6.55 3.48 5.99 3.33 5.64
15 4.54 8.68 3.68 6.36 3.29 5.42 3.06 4.89 2.90 4.56
20 4.35 8.10 3.49 5.85 3.10 4.94 2.87 4.43 2.71 4.10
30 4.17 7.56 3.32 5.39 2.92 4.51 2.69 4.02 2.53 3.70
40 4.08 7.31 3.23 5.18 2.84 4.31 2.61 3.83 2.45 3.51
50 4.03 7.17 3.18 5.06 2.79 4.20 2.56 3.72 2.40 3.41
Table 15.3: The critical values of the F -distribution at both α = 0.05 and α = 0.01 as a function of numerator (df1 ) and
denominator (df2 ) degrees of freedom.
157
158 CHAPTER 15. THE GENERAL LOGIC OF ANOVA
The Psychological
Literature
159
Chapter 16
The Literature
T
here are several formal ways of making the results of scientific research
public. They include presentations at professional meetings, journal
articles, and books. Each serves a slightly different purpose, and the
procedure is somewhat different but in all cases the main purpose is identical:
Sharing the results of one’s research with others.1
If you don’t make your research results public, you aren’t doing
science—you may be doing research, research and development,
playing mental games, or something else, but one of the main
criteria of science is that the findings are public.
One of the main goals of this course is to help you gain the skills necessary
to comfortably read the original scientific psychology literature. For the most
part, this means articles in psychology journals, although when you are first
entering an area, or looking for an overview, books may be useful and for the
most up-to-date information scientific meetings are preferred.
16.1 Meetings
Academics attend professional meetings in order to tell others in their field what
they are doing and to find out what the others are doing. These activities are an
important part of the research process and of the informal lines of communication
that build up around any area of expertise. Professional meetings give you the
most up-to-date information possible if you don’t know someone working in the
area well enough to phone them up and say “so what’s new?”.
1 That is, in addition to just telling one’s friends, which is an important source of current
information for those who are “in the club”. We are, after all, talking about a relatively small
group once you get to a reasonable degree of specialisation.
161
162 CHAPTER 16. THE LITERATURE
Examples of yearly conferences that serve this purpose are the annual meet-
ings of: the American Psychological Association (APA), the Society for Ex-
perimental Social Psychology (SESP), the Canadian Psychological Association
(CPA—important for social psychologists and clinical psychologists), the Psy-
chonomic Society (important for cognitive psychologists), the Canadian Society
for Brain, Behaviour, and Cognitive Science. (CSBBCS or BBCS2 —important
for cognitive psychologists and neuroscientists), the Society for Neuroscience,
and Advances in Research in Vision and Opthamology (ARVO). There are
also regional conferences (e.g., Banff Annual Seminar in Cognitive Science—
BASICS) and smaller, more specialized conferences (e.g., Society for Computers
in Psychology—SCiP, Consciousness and Cognition).
An important reason for keeping up on the latest research by attending
conferences is so that you don’t duplicate the work of other researchers—you
don’t want to have two labs working on the same project. There will, of course,
be a lot of overlap between the research you choose to do and the research
that is taking place in other labs but life is too short to waste your time doing
what someone else is doing. You want your research to be informed by the
latest related findings. You don’t want to be working on a project based on a
certain assumption only to find out that another researcher has just proven that
assumption wrong.
Much of the benefit of conferences comes from the informal discussions one
has with colleagues over dinner or drinks, in shared cabs, while waiting in airports,
and so on. Many of the best insights and ideas that emerge from conferences come
from these intense, one-on-one discussions. It is this availability of colleagues in
close proximity and in the absence other obligations (e.g., teaching, committee
meetings, family obligations) that ensures the survival and effectiveness of these
conferences in the days of email and virtual conferencing. In addition to the
critical informal discussions, there are two formal types of presentations that
generally occur and often serve as the springboard for later, informal discussions.
They are talks and posters.
16.1.1 Talks
Research talks generally take the form of a short lecture laying out the results of
a research project. They allow researchers efficiently to disseminate their most
recent work to relatively large groups of interested colleagues. The presenter,
usually the primary researcher on the project, lays out the reason for undertaking
the project, how the project was carried out, what was found, and what it means
in terms of psychologial theories. As the time is usually limited (never long
enough in the opinion of the presenter, although ocassionally far too long in the
opinion of the listeners), the information is presented in substantially less detail
than would be contained in a journal article on the same research. This lack of
detail is not a problem because the listeners are generally given the opportunity
2 Leave it to a group of Canadian scientists to give themselves a name so long that they
to ask questions about any additional information they desire. There is always a
short time scheduled for questions at the end of the talk, and the presenter is
usually available at the conference for several days.
16.1.2 Posters
Posters are visual displays (generally about 4ft X 8ft in size) that present much
the same information as is presented in a talk. They allow a lot of people to
present their research in a short period of time. Poster sessions generally run
for about two hours in rooms containing anywhere from a few dozen to several
hundred posters (depending on the size of the conference). In that time the
presenter hangs around in front of his or her poster while people mill about,
read the poster (or pass it by), and discuss it with the presenter. The poster
format allows for more two-way interaction between the presenter and his or her
audience than a talk does and for greater flexibility in how the information is
presented. The information can be tailored to the individual to whom it is being
presented—the researcher can discuss esoteric details of procedure and theory
with other experts in the area and general issues and relations to other areas of
psychology with experts from other areas. The best posters use lots of visual
information—pictures of apparatus and procedures, and graphs of data—and use
little writing. Such posters allow the reader quickly to get the gist of research
and then decide whether to follow up on details with a closer look or by talking
to the presenter. Although the audience for a single poster is often smaller than
for a single talk, the interaction with each member of the audience is generally
more intense for posters.
Some research is more easily presented as a poster, and some as a talk. Either
way, though, the important thing is that the research is out in the open being put
to the test of public scrutiny where other researchers can criticize the methods or
assumptions being made, argue with the conclusions being drawn, and generally
help to keep the researcher on the intellectual “straight and narrow”. In addition,
good research will influence the work of others in the field, hastening the advance
of the science as a whole. Remember, science is a public process.
16.2 Journals
The main outlet for information in any field of science is in the field’s journals.
Journals are regularly published (often bimonthly or quarterly) magazines con-
taining detailed reports of original research, and occasionally, review articles that
summarize the state of some specific field of research. There are hundreds of
psychology journals, including dozens of high-quality, respected ones. There are
journals of general interest to many scientific psychologists, journals devoted to
specific areas, journals devoted to even more specific areas, and journals devoted
to reviews of various areas of psychology.
All journals are edited by an expert in the field. It is the editor’s job to
ensure that only high-quality papers of interest to the readers of his or her
164 CHAPTER 16. THE LITERATURE
16.3 Books
Books tend to be more more general than journal articles, covering a topic more
broadly. Because of the time required to write a book, or to coordinate several
different authors writing individual chapters in the case of edited books, books
are usually less up-to-date than journal articles. They are often a good starting
point when entering a new area because they afford the ability to cover a broad
area with greater detail than there is room for in a journal article. Keep this
in mind when you are researching an area. Any research report or student
project based entirely on books will be inadequate due to the dated nature of
the information available in books.
Chapter 17
V
irtually all of the journal articles you will read (or write) will be in APA
format. Historically, many of the leading journals in areas of scientific
psychology were published by the APA, and it is the APA, consequently,
that controls the rules of writing for those journals. APA is the TLA1 for the
American Psychological Association, the association that, at one time, was meant
to represent all psychologists. More recently, most of the scientific psychologists
have left the APA to form the American Psychological Society (APS), leaving
the APA mainly to the clinical psychologists. The Canadian equivalents of
these two organisations are the Canadian Psychological Association (CPA—for
clinicians) and the Canadian Society for Brain, Behaviour, and Cognitive Science
(CSBBCS—for scientists). The British equivalents are the British Psychological
Association (BPA—for clinicians) and the Experimental Psychology Society
(EPS—for scientists). Many journals that are not published by the APA have
also adopted the APA guidelines or slightly modified versions of them. These
journals include virtually all psychology journals and many others in related
fields. Journals in neuroscience are more likely to use other forms of referencing,
but you’ll note that the structure of the articles is similar to APA format.
APA format is a set of rules for writing journal articles that are designed
to make the articles easy for the publisher to publish and easy for the reader
to understand. A brief overview of these rules is given here. The specifications
for writing in APA format—in their full, detailed, glory—can be found in the
Publication Manual of the American Psychological Association, Fifth Edition
(American Psychological Association, 2001). When reading journal articles
it will be very helpful to know the format in which they are written. When
writing journal articles or class assignments based on them (i.e., research reports,
research proposals) it is imperative that you know the format.
Specifications for the manuscript itself (e.g., line spacing, margin sizes, headers
on the manuscript, placing of tables, figures, and footnotes, etc.) are largely
for the benefit of those who work in the publishing process: Double spacing
1 Three Letter Acronym
165
166 CHAPTER 17. THE APA FORMAT
and large margins allow for easy placing of copy editor’s comments; consistent
line spacing and margins make it easy to estimate how many journal pages a
manuscript will take; headers and page numbers on a manuscript make it easy
to keep track of the pages; and separate tables, figures, and footnotes are ready
to be easily copied, and set in type.
Specifications for the structure of the article (e.g., order of the sections,
content of the sections) are largely for the benefit of the reader. Anyone reading
a journal article in an APA journal can be sure that certain types of information
will be in certain portions of the article. This consistency makes it much easier
to quickly skim articles or to scan an article looking for specific information.
The different sections of an APA formatted research article and their contents
are described in the following sections.
17.0.1 Title
The title is probably the most important set of words in your article and the
one least considered by most students (you know you’re not going to get a good
mark on a research proposal that you’ve titled Research Proposal ). It is the
title that is likely to attract people to read your article2 and so most researchers
spend weeks, if not months considering just the right title for their article. The
ideal title will be short, informative, and clever. Although all three of these
criteria may not be compatible in every case, you should strive for at least two of
three. One of my favourites is an article by Ian Begg (1987) about the meaning
ascribed to a particular logical term by students entitled Some.
17.0.2 Abstract
The abstract is the second most important set of words in your article. It is
a short summary of your entire article. Abstracts typically run from 75 to
150 words. The specific maximum length for the abstract is set by the policy
of the particular journal in which the article appears. The abstract is critical
because it is the part of the article that is most likely to be read. In fact, it is
the only portion of the article that will be read by the vast majority of people
who are intrigued enough by the title to read further. The abstract is also, at
least at present, easier to obtain than any other part of an article. Most online
databases contain the abstracts of all the articles they list, but not the full articles.
Thus, great care should be taken in writing the abstract to ensure that it is as
informative as possible within the constraints of the specified maximum word
count. The abstract should state what theories were tested, what experimental
manipulations were made, and what the results of those manipulations were.
2 unless you’ve built such a reputation as a researcher and/or writer that people will read
17.0.3 Introduction
The introduction is pretty much what it sounds like. It is in the introduction
that the author introduces the problem that he or she is trying to solve. An
introduction generally starts out by reviewing the relevant literature. That is,
summarizing what other researchers have discovered about the general topic
of the article. Then, the author’s particular point of view is introduced along
with supportive evidence from the relevant literature. The specific experimental
manipulation that was carried out is introduced along with how it will add to the
collective knowledge about the topic at hand. Most of the time an introduction
will start off relatively general and become more and more specific as it goes
along until it narrows in on the precise manipulations that will be carried out
and their potential meaning to the relevant literature.
It is helpful when reading a research article—and even more helpful when
writing one—to think of the article as a story. The writer’s job is to tell a
coherent, convincing story based on the research findings in the field in which the
research takes place, and on the new research findings presented in the article.
Thus, all the writing in the article is part of the story and any information that
doesn’t help to move the story along shouldn’t be there. Try to keep this point
in mind when you are constructing the literature review sections of any papers
you write. It will help you avoid the common error of simply listing a lot of
research findings without explaining the connection of each to the purpose of
your paper (i.e., what role they play in the story you are trying to tell).
17.0.4 Method
The method is where the specific methods used in the study are explained. A
reader of the article should be able to replicate the study (i.e., do it again in
exactly the same way) using only the information in the method section. If that
is not possible (for example, if you had to contact the author to ascertain some
additional details) the method section was not properly written. The method
section is generally broken down into subsections. The specific subsections that
appear will vary with the type of the experiment, but the following subsections
are generally used.
Participants
The participants subsection contains particulars about who took part in the study.
Who participated, how many were there, how were they recruited, and were they
remunerated in some way? If animals were used, what kind were they, where
were they obtained, and did they have any specific important characteristics?
For most studies that use “normal” people as participants this section will be
quite short. However, if special populations are used, for example, people with
particular types of brain damage or animals with unusual genetic structures, the
method section can be quite long and detailed.
168 CHAPTER 17. THE APA FORMAT
Materials
The materials section contains descriptions of any important materials that
were used in the study. Materials can be quite varied including things like
lists of words, stories, questionnaires, computer programs, and various types of
equipment. In general, anything physical that is used in the study is described
in the materials section (including things such as answer sheets and response
buttons). Anything that is purely mental (e.g., counting backward by threes,
imagining an elephant) is described in the procedure section.
Procedure
The procedure section contains precisely what was done in the study: What
steps were taken, and in what order. The best procedure sections tend to take
the reader through the study step by step from the perspective of a participant.
It is important that the reader have a good feeling for precisely what was done
in the study, not only for the sake of replicability but in order to understand
what a participant is faced with and consequently might do.
Details, details
Remember, when deciding what to include in the method section, that the
overriding principle is to give enough information in enough detail that a reader
of the paper could replicate the study—and no more. Thus, including “HB
pencils” in the materials section is probably not important (unless the study deals
with the perceptibility of handwriting), likewise listing 23 females with a mean
age of 18.6 years and 17 males with a mean age of 19.2 years is also probably
overkill (unless the study looks at sex differences on the task as a function of
age). On the other hand, if one half of your participants were recruited at a local
high school and the other half at a senior citizen’s centre, you should probably
mention it.
Don’t forget to include in your method section what you are measuring and
what you will be analyzing in the results section.
17.0.5 Results
The results section contains what was found in the study (what it means goes in
the discussion section, the reason for looking for it goes in the introduction). If
you have a lot of numbers to deal with, a figure or table is often the best way
to present your data. It is important to remember that you are trying to tell a
story and that the data are part of the story. Good story-telling means that you
shouldn’t just present a whole bunch of numbers and let the reader figure it out.
You should present the numbers that are important to your story, and where
a difference between sets of numbers is important, you should statistically test
that difference and report the results of the test. The statistical tests are usually
reported in the body of the text rather than on the graphs or in the the tables.
169
You want to start out by stating the type of statistical test you are using and
the alpha p value you are using (usually p < .05). Don’t forget to tell the reader
what you were measuring (what the numbers stand for) and what the direction
of any differences was. You want to report the results of any statistical tests
as clearly and concisely as possible. You can usually get a good idea of how to
report the results of statisical tests by looking at how the authors in the papers
you’ve been reading do so. Here are a couple of examples of how to report the
results of a statisical test:
. . . There were significantly higher testosterone levels in the males (68.5 ppm)
than in the females (32.2 ppm), t(34) = 7.54. . . .
. . . The final weight of the participants was subjected to a 2 x 2 between
subjects analysis of variance (ANOVA) with sex of the subjects and age of the
subjects as factors. Males (mean = 160 lb) weighed significantly more than
females (mean = 140 lb), F(1,35) = 5.32, MSe = 5.5. Older subjects (mean =
148 lb) did not weigh significantly more than younger subjects (mean = 152
lb), F(1,35) = 1.15, MSe = 5.5, n.s. Finally, there was no significant interaction
between age and sex of the subjects (F < 1). . . .
Note that the direction of any differences is noted, as are the statistical
significance of the differences, the values of the statistics, and means for each
group. In these examples there is no need for any tables or figures as the
information is all presented in the body of the text. Don’t spend a lot of time
explaining the meaning of the results in this section; that comes in the discussion
section.
17.0.6 Discussion
The last section of the body of the paper is where you explain what your results
mean in terms of the questions you set out to answer—where you finish the
story. The first part of the discussion generally summarizes the findings and
what they mean and the later parts generally deal with any complications, any
new questions raised by the findings, and suggestions for future directions to be
explored.
17.0.7 References
The reference section is the most complex part to learn how to write but often
the most informative part to read—it leads you to other related articles or books.
The reference section lists all the sources you have cited in the body of the paper
and it lists them in a complete and compact form. Every source of information
cited in the body of the paper must appear in the reference section, and every
source in the reference section must be cited in the body of the paper.
There are specific ways of referencing just about any source of information
you could imagine. For most, however, it is easy to think of a reference as
a paragraph containing three sentences. The three most common sources of
information are journal articles, chapters in edited books, and entire books
written by a single author or group of authors. In all these cases the reference
170 CHAPTER 17. THE APA FORMAT
consists of three sentences, the first describes the author(s), the second gives
the title of the work, and the third tells you how to find it. In the case of a
journal article the third section contains the title of the journal and volume and
issue in which it is contained. In the case of a chapter in an edited volume, the
third section contains the name of the book, the name of the editor, and the
publisher. In the case of a book, the third section contains the publisher of the
book. Examples of the three types of references follow:
Journal Article:
Vokey, J. R., & Read, J. D. (1985). Subliminal messages: Between
the Devil and the media. American Psychologist, 40, 1231–1239.
Book:
Strunk, W., Jr., & White, E. B. (1979). The elements of style (3rd
ed.). New York: Macmillan.
Of course, one should almost always use “that” rather than “which”
(Strunk & White, 1979). That (not which) is why several style
experts suggest going on a “which” hunt when proof reading.
17.0.8 Notes
When writing the paper all notes, including footnotes, and author notes (a special
type of footnote that gives contact information for the author and acknowledges
those who have helped with the paper) appear at the end of the manuscript
171
17.0.9 Tables
Any section of text or numbers that is set apart from the main body of the
text is called a table. Tables are most commonly used to present data (sets of
numbers) but are occasionally used to display examples of materials. Each table
requires an explanatory caption and must be referred to and described in the
body of the article. Each table is typed, along with its caption, on a separate
page. The tables appear, in the order in which they are referenced in the body
of the article, immediately following the footnotes page.
17.0.10 Figures
A figure is information of a pictorial nature. The most common use of figures is to
display data in a graph, although they are less commonly used to show drawings
or pictures of apparatus or to summarize procedures. As with tables, figures are
consecutively numbered in the order in which they are referenced in the body
of the article. Also like tables, each figure requires an explanatory caption and
must be referred to and described in the body of the article. However, unlike
tables, the captions for the figures are printed separately from the figures on
the page immediately following the tables but immediately preceding the figures
themselves.
17.0.11 Appendices
An appendix contains something that is important but that is peripheral enough
to the main point of the article that it can be put at the end in a separate
section. The most common uses of appendices are for sets of materials (e.g.,
some interesting set of sentences with weird characteristics, or a questionnaire)
or complex statistical points that would break up the flow of the paper if left in
the main part (e.g., the mathematical derivation of a new statistic).
172 CHAPTER 17. THE APA FORMAT
Part IV
Appendices
173
Appendix A
Summation
M
ost of the operations on and summaries of the data in this book are
nothing more than various kinds of sums: sums of the data themselves
and sums of different transformations of the data. Often these sums
are normalised or corrected for the number of items summed over, so that the
now-normalised sum directly reflects the magnitude of the items summed over,
rather than being in addition confounded with the sheer number of items making
up the sum. But even then, these normalised sums are are still essentially sums.
Because sums are central to the discussion, it is important to understand the
few, simple properties of sums and the related principles of summation.
175
176 APPENDIX A. SUMMATION
is written as the cell formula =SUM(B1:B4), indicating that the sum of the items
in cells B1 through B4 is to be the value of the cell containing the formula.
P
A.1.2 The Summation Operator:
Mathematicians, of course, can’t be denied their own stab at the arcane (indeed,
their nomenclature or terminology often exemplifies the word). In the original
Greek alphabet (i.e., alpha, beta, . . . ), the uppercase letter corresponding to
the Arabic (or
P modern) alphabet letter “S”—for “sum”—is called sigma, and is
written as: . To write “the sum of X” in mathematese, then, one writes:
X
X
Although it is redundant (and, when did that ever hurt?), the sum over all items,
where n denotes the number of items of X, is often written as:
n
X
Xi
i=1
rule for multiplication. Viewing multiplication as just a rewriting rule for addition—as is the
case for everyday encounters with addition—implies a limited or reduced form of arithmetic
known as Presburger arithmetic. A more complex form of arithmetic—known asPeano
arithmetic—includes multiplication as a distinct concept independent of addition. For our
current purposes, Presburger arithmetic is all that is needed. For those with an interest in
esoterica, it is Peano arithmetic that is assumed as a premise in Gödel’s famous incompleteness
theorem; Presburger arithmetic is complete in this mathematical sense.
2 A related property is that addition is distributive: a − (b + c) = a − b − c
178 APPENDIX A. SUMMATION
In English, the sum of the numbers from the first to the last is equal to the sum
of the numbers from the first to last but one (the “penultimate” number—isn’t
wonderful that we have words for such abstruse concepts?) plus the last number.
But what is the sum of the numbers from the first to the penultimate number?
It is the sum of the numbers from the first to the penultimate but one, plus the
penultimate number, and so on, until we reach the the sum of the first number,
which is just the first number.
Note how this summation function recursively calls itself (i.e., it uses itself in its
own definition). If you find this idea of recursion somewhat bewildering, consider
the question of how to live to be 100 years of age. Clearly, the answer is to live
to be 99, and then to be careful for one year. Ah, but how does one live to be
99? Clearly, live to be 98 and then be careful for one year, and so on.3 The
associative property of summation is like that: summation is just the summing
of sums. Ah, but how do you get the sums to summate? Clearly, by summation!
P P P
A.3.1 Rule 1: (X + Y ) = X+ Y
This rule should be intuitively obvious; it states that the sum of the numbers
over the two sets X and Y is equal to the sum of the sums of X and Y ,
and follows from the fact that, for example, the sum of {1, 2, 3} + {4, 5, 6} =
(1 + 2 + 3) + (4 + 5 + 6) = 6 + 15 = 1 + 2 + 3 + 4 + 5 + 6 = 21. However, it is a
handy re-writing rule to eliminate the parentheses in the expression.
3 One could also ask how to be careful for one year, and the answer is just as clearly to live
for 364 days, and then be careful for one day, and so on. And how does be careful for one day?
Live for 23 hours and then be careful for one hour . . . .
A.3. SUMMATION RULES 179
(X − Y ) = X−
P P P
A.3.2 Rule 2: Y
Similar
P to rule 1, it makes no difference whether you subtract first P
and thenPsum
[ (X − Y )] or sum first and then subtract the respective sums ( X − Y ),
as long as the ordering is maintained. If, as discussed in A.2.1, subtraction is
re-written as the sum of negative numbers, then rule 2 becomes rule 1.
P P
A.3.3 Rule 3: (X + c) = X + nc
The associative property also allows for the simple computation of the new sum
following the addition of a constant to every item of X. Suppose, for example,
it was decided after set X (consisting, say, of hundreds or thousands of items)
had been laboriously summed that a value of c should be added to every score
(where c could be either negative or positive). One approach would be to ignore
the computed sum of X, and to add c to every item of X, producing a new set
of scores Z, and then sum the items of Z. Workable, but tedious.
The associative property, however, suggests another, more efficient approach.
For every item i of Xi ,
Zi = Xi + c
So,
n
X
Zi = (X1 + c) + (X2 + c) + · · · + (Xn + c)
i=1
As there is one c for each item in X, there are a total of n values of c being
added to the original sum:
n
X n
X
Zi = Xi + nc
i=1 i=1
which is just the original sum plus nc. So, if you have the sum for, say, 200
scores, and wish to compute the sum for the same scores with 5 added to each
one (e.g., some generous stats prof decided to give each of 200 students 5 bonus
marks), the new sum is simply the old sum plus 200 ∗ 5 = the old sum +1000.
Cool. The whole operation can be summed up (pun intended!) in the following
expression, and follows from rule 1:
n
X n
X n
X n
X
(X + c) = X+ c= X + nc
i=1 i=1 i=1 i=1
n
X
A.3.4 Rule 4: c = nc
i=1
The sum of a constant is simply n times that constant.
P P
A.3.5 Rule 5: cX = c X
What if every number in X instead of having a constant added to it, was
multiplied by a constant (where the constant has either positive—multiplication—
P
or negative—division—exponents; see section A.2.1)? Again, the original X
could be ignored,
P each X i multiplied by c to produce a new value, Z i = cXi,
and then Z computed. However, as every Xi is multiplied by the constant,
n
X
cXi = cX1 + cX2 + · · · + cXn
i=1
A.4 Questions
1. The sum of 20 scores is found to be 50. If 2 were to be subtracted from
every one of the 20 scores, what would the new sum be?
4
X
2. X ={5, 17, 2, 4} and Y = 10. If each item of Y were subtracted from
i=1
its corresponding item in X, what would the new sum be?
8
X
3. X ={2, 11, 5, 17, 2, 8, 9, -4, 2.3}. X = ?.
i=3
T
his chapter is loosely based on Appendix C-4 from Keppel (1982), and
represents an attempt to render it intelligible. It is included here at the
request of my (JRV) senior and former students, who believe, apparently,
that only this chapter makes anything other than simple one-way ANOVAs make
sense. It is, as one might say, a “good trick”: if you want to wow your colleagues
on your ability quickly to determine the appropriate error-terms for the effects
of any ANOVA design, this appendix provides the algorithm you need.1 Indeed,
I (JRV) have used it as the basis of more than one general ANOVA computer
program. Note that where it provides for no error-term, a quasi-F may be
computed, by combining the existing means-squares either strictly by addition
(F’ ) or by some combination of addition and subtraction (F”—allowing for the
unfortunate possibility of negative F-ratios). The logic of quasi-Fs is, in one
sense, unassailable (as long as one is thinking “approximation to a randomization
or permutation test”), but is otherwise questionable, at best.
To provide a concrete context, the following research design will serve to
exemplify the steps and the logic. Suppose we have 20 nouns and 20 verbs—
presumed to be a random sample from the larger population of nouns and
verbs. Following the convention of using uppercase letters to denote fixed factors
and lowercase letters to denote random factors, noun vs. verb forms a fixed,
word-type factor, T . Each participant receives either the 20 nouns or the 20
verbs, with 10 participants in each condition. Thus, the random factor, words,
w, is nested within T , and participants, s, are also nested within T , but within
each level of T , s crosses w, as shown in Table B.1.
1 Keppel (1982) is the best book on ANOVA ever written, in my opinion; the next edition
(the third) of this book, for reasons I can’t fathom, stripped out most of what made Keppel
(1982) so good, including Appendix C-4. If you can find a copy of Keppel (1982), hang on to
it.
181
182 APPENDIX B. KEPPEL’S SYSTEM FOR COMPLEX ANOVAS
Word Type (T )
Noun Verb
w1 w2 w3 · · · w20 w21 w22 w23 ··· w40
s1 s11
s2 s12
s3 s13
.. ..
. .
s10 s20
2. Construct all possible interactions, again treating the factors as if they all
cross:
T, w, s, T · w, T · s, w · s, T · w · s
3. Denote nesting for each potential source wherever it occurs using the slash
notation of Lindman (1974). Where letters repeat to the right of the slash,
write them only once:
2. If the source contains a nested factor, then multiply the product of the df s
of the factors to the left of the slash by the product of the levels of the
factors to right of the slash.
Using lowercase letters of each of the factors to denote the levels of each factor,
we obtain the following for the sources remaining at the end of step 4 from
section B.1:
Table B.2: Calculation of the degrees of freedom for the research design in Table
B.1.
T ⇒ (w)(s)T
w/T ⇒ (s)w/T
s/T ⇒ (w)s/T
w · s/T ⇒ w · s/T
There are two rules for generating the expected mean squares, in both cases
ignoring the letters corresponding to coefficients expressed in parentheses:
1. List the variance component term indentifying the source in question. This
term is known as the null hypothesis component for that source.
Table B.3: The expected mean squares for the sources of variance in the design
shown in Table B.1.
M Sw/T
F (dfw/T , dfw·s/T ) =
M Sw·s/T
because the expected mean square for the w · s/T interaction differs only by the
null hypothesis component of w/T from the expected mean square of w/T , as
may be seen in Table B.3. The same is true for the effect of “subjects within
Type”, s/T , and so it takes the same error-term.
But what of the effect of word-Type, T ? There is no corresponding expected
mean-square from among the other effects that matches that of T in all but the
null hypothesis component of T . With no corresponding error-term, there is no
F -ratio that may be formed to test the effect of word-Type, T .
Appendix C
Answers to Selected
Questions
Y
es, as students, we always hated the “selected” part, too. Until, that
is, we wrote our own book to accompany the course. Why not answer
every question, so the reader can see how the “expert” (or, at least,
the textbook authors) would answer each and every question? The concern is
just that, because the answers would thereby be so readily available, the reader
would not take the extra minutes to solve the problem on her or his own. And it
is those extra few minutes of deliberation and problem solving that we believe
are essential to gaining a good grasp and feel for the material. “Fine”, you might
say. “But is not watching the ‘expert’ at work valuable, and, indeed, instructive?”
We can’t but agree. View it as a compromise. Frustrating? Yes. Necessary?
Also yes.
Rather than simply provide each numeric answer itself, we have worked
out each of the answers here in some detail so it should be apparent how we
arrived at each of them. We urge the reader to do the same for the remaining
unanswered questions.
185
186 APPENDIX C. ANSWERS TO SELECTED QUESTIONS
5. If Allen Scott had the lowest grade of the 5 males in the class who otherwise
would have had the same mean of 90 as the 10 females, then the remaining
4 males must have had a mean 90, for a sum of 4 ∗ 90 = 360. The sum
for the 10 females, then, was 10 ∗ 90 = 900, and for the class as a whole,
15 ∗ 85 = 1275. Thus, Allen Scott’s grade must have been the difference
between the sum for the class as a whole minus the sums for the females
and the 4 males or 1275 − 900 − 360 = 15.
rpb = 0.8
Thus, a substantial correlation of major with sadistics (sic) grade, with
psychology majors exceeding business majors.
−.83 − (−.8325)(.665)
r(gb).s = p √
1 − (−.8325)2 1 − .6652
−.2763
r(gb).s = = −.67
.41
188 APPENDIX C. ANSWERS TO SELECTED QUESTIONS
Some people are not happy leaving the answer in this form, feeling, ap-
parently, that if the answer is not reduced to a number the question has
not been fully answered. To that end, we’ll get you started by noting
that the term 4 is 4!2! , which is (6)(5)(4!)
6 6!
2!4! . Cancelling the 4!, yields
(6)(5)
2 = (3)(5) = 15.
6b. In the table, 35 ordered beer and 25 didn’t. Of the 60, 2/3 or 40 would be
expected to order beer. So, z 2 = (35 − 40)2 /40 + (25 − 20)2 /20 = 1.875.
Comparing that value to the χ2 table with 1 degree of freedom, we find
that the probablity of observing such a result given the truth of the null
hypothesis is between .15 and .20—much greater than the .05 usually
required for significance. Hence, we conclude that there is not sufficient
evidence to reject the hypothesis that the sample is a random sample of
students from the general population of students, at least with respect to
ordering beer.
191
Index
192
INDEX 193
t-distributions, 142
t-test, 143
t-test for dependent samples, 145
t-test for independent samples, 146
tautology, 152
Tchebycheff inequality, 50
Tchebycheff, Pafnuti L., 50
Tied Ranks, 61
Trans-Canada highway, 27
transformed scores, 35
two-tailed test, 103
Type I error, 103
Type II error, 103
Vancouver, 27
variability, 25
variance, 26
variance (definition), 29
WAIS, 40
Wechsler Adult Intelligence Scale, 40
Weldon, Walter, 53
Wesayso corporation, 76
which hunt, 170
Z-score, 38