0% found this document useful (0 votes)
45 views

Alternate To Null Hypothesis

nice article explaning statistical testing

Uploaded by

sidcurie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Alternate To Null Hypothesis

nice article explaning statistical testing

Uploaded by

sidcurie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

P SY CH O L O GI C AL SC I E NC E

General Article

An Alternative to Null-
Hypothesis Significance Tests
Peter R. Killeen

Arizona State University

ABSTRACT—The statistic prep estimates the probability of This is where problems arise. Fisher (1959), who introduced
replicating an effect. It captures traditional publication NHST, knew that ‘‘such a test of significance does not authorize
criteria for signal-to-noise ratio, while avoiding para- us to make any statement about the hypothesis in question in
metric inference and the resulting Bayesian dilemma. In terms of mathematical probability’’ (p. 35). This is because such
concert with effect size and replication intervals, prep statements concern p(H0|x  D), which does not generally equal
provides all of the information now used in evaluating p(x  D|H0). The confusion of one conditional for the other is
research, while avoiding many of the pitfalls of traditional analogous to the conversion fallacy in propositional logic.
statistical inference. Bayes showed that p(H|x  D) 5 p(x  D|H)p(H)/p(x  D). The
unconditional probabilities are the priors, and are largely
unknowable. Fisher (1959) allowed that p(x  D|H0) may
Psychologists, who rightly pride themselves on their methodo- ‘‘influence [the null’s] acceptability’’ (p. 43). Unfortunately,
logical expertise, have become increasingly embarrassed by absent priors, ‘‘P values can be highly misleading measures of
‘‘the survival of a flawed method’’ (Krueger, 2001) at the heart of the evidence provided by the data against the null hypothesis’’
their inferential procedures. Null-hypothesis significance tests (Berger & Selke, 1987, p. 112; also see Nickerson, 2000, p.
(NHSTs) provide criteria for separating signal from noise in the 248). This constitutes a dilemma: On the one hand, ‘‘a test of
majority of published research. They are based on inferred significance contains no criterion for ‘accepting’ a hypothesis’’
sampling distributions, given a hypothetical value for a param- (Fisher, 1959, p. 42), and on the other, we cannot safely reject a
eter such as a population mean (m) or difference of means be- hypothesis without knowing the priors. Significance tests
tween an experimental group (mE) and a control group (mC; e.g., without priors are the ‘‘flaw in our method.’’
H0: mE  mC 5 0). Analysis starts with a statistic on the obtained There have been numerous thoughtful reviews of this foun-
data, such as the difference in the sample means, D. D is a point dational issue (e.g., Nickerson, 2000), attempts to make the best
on the line with probability mass of zero. It is necessary to relate of the situation (e.g., Trafimow, 2003), proposals for alternative
that point to some interval in order to engage probability theory. statistics (e.g., Loftus, 1996), and defenses of significance tests
Neyman and Pearson (1933) introduced critical intervals over and calls for their abolition alike (e.g., Harlow, Mulaik, &
which the probability of observing a statistic is less than a Steiger, 1997). When so many experts disagree on the solution,
stipulated significance level, a (e.g., z scores between [1, 2] perhaps the problem itself is to blame. It was Fisher (1925) who
and between [12, 11] over which a < .05). If a statistic falls focused the research community on parameter estimation ‘‘so
within those intervals, it is deemed significantly different from convincingly that for the next 50 years or so almost all theo-
that expected under the null hypothesis. Fisher (1959) pre- retical statisticians were completely parameter bound, paying
ferred to calculate the probability of obtaining a statistic larger little or no heed to inference about observables’’ (Geisser, 1992,
than |D| over the interval [|D|, 1]. This probability, p(x  D|H0), p. 1). But it is rare for psychologists to need estimates of pa-
is called the p value of the statistic. Researchers typically hope rameters; we are more typically interested in whether a causal
to obtain a p value sufficiently small (viz. less than a) so that relation exists between independent and dependent variables
they can reject the null hypothesis. (but see Krantz, 1999; Steiger & Fouladi, 1997). Are women
attracted more to men with symmetric faces than to men with
asymmetric faces? Does variation in irrelevant dimensions of
Address correspondence to Peter Killeen, Department of Psychology,
Arizona State University, Tempe, AZ 85287-1104; e-mail: killeen@ stimuli affect judgments on relevant dimensions? Does re-
asu.edu. view of traumatic events facilitate recovery? Our unfortunate

Volume 16—Number 5 Copyright


Downloaded r 2005 American
from pss.sagepub.com Psychological
at INDIANA Society on April 6, 2010
UNIV BLOOMINGTON 345
We Can Reject the Null Hypothesis

historical commitment to significance tests forces us to rephrase 1985; see the top panel of Fig. 1 and the appendix):
these good questions in the negative, attempt to reject those
nullities, and be left with nothing we can logically say about the d0  Nðd; sd Þ: ð2Þ
questions—whether p 5 .100 or p 5 .001. This article provides
sd is the standard error of the estimate of effect size, the square
an alternative, one that shifts the argument by offering ‘‘a so-
root of
lution to the question of replicability’’ (Krueger, 2001, p. 16).
n2
PREDICTING REPLICABILITY sd 2  ; ð3Þ
n E n C ð n  4Þ

Consider an experiment in which the null hypothesis—no dif- for n > 4. When nE 5 nC, Equation 3 reduces to sd 2 
ference between experimental and control groups—can be re- 4=ðn  4Þ.
jected with a p value of .049. What is the probability that we can Define replication as an effect of the same sign as that found in
replicate this significance level? That depends on the state of the original experiment. The probability of a replication attempt
nature. In this issue, as in most others, NHST requires us to take having an effect d20 greater than zero, given a population effect
a stand on things that we cannot know. If the null is true, ceteris size of d, is the area to the right of 0 in the sampling distribution
paribus we shall succeed—get a significant effect—5% of the centered at d (middle panel of Fig. 1). Unfortunately, we do not
time. If the null is false, replicability depends on the population know the value of the parameter d and must therefore eliminate it.
effect size, d. Power analysis varies the hypothetical discrep-
ancy between the means of control and experimental popula-
tions, giving the probability of appropriately rejecting the null
under those various assumptive states of nature. This awkward
machinery is seldom invoked outside of grant proposals, whose
review panels demand an n large enough to provide significant
returns on funding.
Greenwald, Gonzalez, Guthrie, and Harris (1996) reviewed
the NHST controversy and took the first clear steps toward a
useful measure of replicability. They showed that p values
predict the probability of getting significance in a replication
attempt when the measured effect size, d 0 , equals the population
effect size, d. This postulate, d 5 d 0 , complements NHST’s d 5
0, while making better use of the available data (i.e., the ob-
served d 0 > 0). But replicating ‘‘significance’’ replicates the
dilemma of significance tests: Data can speak to the probability
of H0 and the alternative, HA, only after we have made a com-
mitment to values of the priors. Abandoning the vain and un-
necessary quest for definitive statements about parameters frees
us to consider statistics that predict replicability in its broadest
sense, while avoiding the Bayesian dilemma.

The Framework
Consider an experimental group and an independent control
group whose sample means, ME and MC, differ by a score of D.
The corresponding dimensionless measure of effect size d 0
(called d by Cohen, 1969; g by Hedges & Olkin, 1985; and d 0 in
signal detectability theory) is
ME  MC
d0 ¼ ; ð1Þ Fig. 1. Sampling distributions of effect size (d). The top panel shows a
sp distribution for a population effect size of d 5 0.1; the experiment yielded
an effect size of 0.3, and thus had a sampling error D 5 d 10  d 5 0.2. The
where sp is the pooled within-group standard deviation. If the middle panel shows the probability of a replication as the area under the
experimental and control populations are normal and the total sampling distribution to the right of 0, given knowledge that d 5 0.1. The
bottom panel shows the posterior predictive density of effect size in rep-
sample size is greater than 20 (nE 1 nC 5 n > 20), the sampling lication. Absent knowledge of d, the probability of replication is predicted
distribution of d 0 is approximately normal (Hedges & Olkin, as the area to the right of 0.

346 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 Volume 16—Number 5
Peter R. Killeen

Fig. 2. Probability of replication (prep) as a function of the number of observations and measured effect size, d 10 .
The functions in each panel show prep for values of d 10 increasing in steps of 0.1, from 0.10 (lowest curve) to 1.0
(highest curve). The dashed lines show the combination of effect size and n necessary to reject a null hypothesis
of no difference between the means of the experimental and control groups (i.e., mE  mC 5 0) using a two-tailed t test
with a 5 .05. When realization variance, sd 2 , is 0 (left panel), replicability functions asymptote at 1.0. For a one-tailed
test, the dashed line drops to .88. When realization variance is 0.08 (right panel), the median for social psycho-
logical research, replicability functions asymptote below 1.0. As n approaches infinity, the t-test criterion falls to an
asymptote of .5.
pffiffiffiffiffiffiffiffiffi
Eliminating d z 5 0.5/ 0:40 5 0.79 (Equation 6). A table of the normal
Define the sampling error, D, as D 5 d 0  d (Fig. 1, top panel). distribution assigns a prep of .785.1
For the original experiment, this equation may be rewritten as As the hypothetical number of observations in the replicate
d 5 d 10  D1. Replication requires that if d 10 is greater than 0, approaches infinity, the sampling variance of the replication
then d 20 is also greater than 0, that is, that d 20 5 d 1 D2 > 0. goes to zero, and prep is the positive area of N(d 10 , sd1 ). This is
Substitute d 10  D1 in place of d in this equation. Replication the sampling distribution of a standard power analysis at the
thus requires that d 20 5 d 10  D1 1 D2 > 0. The expectation of maximum likelihood value for d, and establishes an upper
each sampling error is 0 with variance sd 2 . For independent bound for replicability. It is unlikely, however, that the next
0 0
pffiffiffi the variances add, so that d 2  Nðd 1 ; sdR Þ, with
replications, investigator will have sufficient resources or interest to ap-
sdR ¼ 2sdR . The probability of replication, prep, is the area of proach that upper bound. By default, then, prep is defined for
the distribution for which d 0 is greater than 0, shaded in the equipotent replications, ones that employ the same number of
bottom panel of Figure 1: subjects as the original experiment and experience similar
Z 1
levels of sampling error. The probability of replication may be
prep ¼ nðd 10 ; sdR Þ: ð4Þ calculated under other scenarios (as shown later), but for pur-
0
poses of qualifying the data in hand, equipotency, which dou-
Slide the distribution to the left by the distance d 10 to see that
bles the sampling variance, is assumed.
Equation 4 describes the same area as
Z 1 Z d 10 The left panel of Figure 2 shows the probability of replicating
the results of an experiment whose measured effect size is d 10 5
prep ¼ nð0; sdR Þ ¼ nð0; sdR Þ: ð5Þ
d 10 1 0.1 (bottom curve), 0.2, . . . , 1.0, as a function of the number of
observations in the original study. These results permit a com-
It is easiest to calculate prep from the right integral in
parison with traditional measures of significance. The dashed
Equations 5, by consulting a normal probability table for the
line connects the effect sizes necessary to reject the null under a
cumulative probability up to
two-tailed t test, with probability of a Type I error, a, less than
d0
z¼ 1 : ð6Þ .05. Satisfying this criterion is tantamount to establishing a prep
sdR of approximately .917.

Example
Parametric Variance
Suppose an experiment with nE 5 nC 5 12 yields a difference
The calculations presented thus far assume that the variance
between experimental and control groups of 5.0 with sp 5 10.0.
contributed by contextual variables in the replicate is negligible
This gives an effect of d 10 5 0.5 (Equation 1) with a variance
of sd1 2  4/(24  4) 5 0.20 (Equation 3), and a replication Excels spreadsheets with relevant calculations are available from http://www.
1

variance of sdR 2 ¼ 2  sd1 2  0:40. From this, it follows that asu.edu/clas/psych/research/sqab and from http://www.latrobe.edu.au/psy/esci/.

Volume 16—Number 5 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 347
We Can Reject the Null Hypothesis

compared with the sampling error of d. This is the classic fixed- (2004) found that 70% showed a negative relation between
effects model of science. But every experiment is a sample from heart rate and aggressive behavior patterns. The median value
a population of possible experiments on the topic, and each of of prep over those studies was .71 (.69 assuming sd 2 5 0.08). In
those, with its own differences in detail, has its own subspecies a meta-analysis of 37 studies of the effectiveness of massage
of effect size, di. This is true a fortiori for correlational studies therapy, Moyer, Rounds, and Hannum (2004) found that 83%
involving different instruments or moderators (Mosteller & reported positive effects on various dependent variables; in-
Colditz, 1996). The population of effect sizes adds a realization cluding an estimate of publication bias against negative results
variance, sd 2 , to the sampling distributions of the original and reduced this value to 74%. The median value of prep over those
the replicate (Raudenbush, 1994; Rubin, 1981; van den Noort- studies was .75 (.73 assuming sd 2 5 0.08). In a meta-analysis
gate & Onghena, 2003), so that the standard error of effect size of 45 studies of transformational leadership, Eagly, Johannesen-
in replication becomes Schmidt, and van Engen (2003) found that 82% showed an
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi advantage for women, and argued against attenuation by pub-
sdR ¼ 2ðsd1 2 þ sdi 2 Þ: ð7Þ
lication bias. The median value of prep over these studies was .79
In a recent meta-meta-analysis of more than 25,000 social (dropping to .68 for sd 2 5 0.08 because of the generally small
science studies, Richard, Bond, and Stokes-Zoota (2003) effect sizes). Averaging values of prep and counting the propor-
reported a mean within-literature variance of sd 2 5 0.092 tion of positive results are both inefficient ways of aggregating
(median 5 0.08), corrected for sampling variance (Hedges & and evaluating data (Cooper & Hedges, 1994), but such anal-
Vevea, 1998). The statistic sd 2 places an upper limit on yses provide face validity for prep, which is intended primarily as
the probability of replication, one felt most severely by studies a measure of the robustness of studies taken singly.
with small effect sizes. This is shown graphically in the right
panel of Figure 2. The probability of replication R d0 pnoffiffiffi longer
asymptotes at 1.0, but rather at prepðmaxÞ ¼ 1 1
nð0; 2sd Þ. At Generalizations
n 5 100, the functions shown in the right panel of Figure 2 are Whenever an effect size can be calculated (see Rosenthal,
no more than 5 points below their asymptotes. Given a repre- 1994, for conversions among indices; Cortina & Nouri, 2000, for
sentative sd 2 of 0.08, for no value of n will a measured effect analysis of variance designs; Grissom & Kim, 2001, for cave-
size of d 0 less than 0.52 attain a prep greater than .90; but ats), so also can prep. Randomization tests, described in the
this standard comes within reach of a sample size of 40 for a appendix, facilitate computation of prep for complex designs or
d 0 of 0.8. situations in which assumptions of normality are untenable.
Reliance on standard hypothesis-testing techniques that ig- Calculation of the n required for a desired prep is straightfor-
nore realization variance may be one of the causes for the dis- ward. For a presumptive effect size of d and realization variance
mayingly common failures of replication. The standard t test of sd 2 , calculate the z score corresponding to prep, and employ
will judge an effect of any size significant at a sufficiently large an n 5 nE 1 nC no fewer than
n, even though the odds for replication may be very close to
chance. Figure 2 provides understanding, if no consolation, to 8z 2
n¼ þ 4: ð8Þ
investigators who have failed to replicate published findings of d  2sd 2 z2
2

high significance but low effect size. The odds were never very
much in their favor. Setting a replicability criterion for publi- Negative results indicate that the desired prep is unobtainable
cation that includes an estimate of realization variance would for that sd 2 . For example, for d 5 0.8, sd 2 5 0.08, and a de-
filter the correlational background noise noted by Meehl (1997) sired prep 5 .9, z(.9)2 5 1.64, and the minimum n is 40.
and others. Stronger claims than replication of a positive effect are some-
Claiming replicability for an effect that would merely be of times warranted. An investigator may wish to claim that a new
the same sign may seem too liberal, when the prior probability drug is more effective than a standard. The replicability of the data
of that is 1/2, but traditional null-hypothesis tests are them- supporting that claim may be calculated by integrating Equation 4
selves at best merely directional. The proper metric of effect not from 0, but from ds, the effect size of the standard bearer.
size is d or r, not p or prep. In the present analysis, replicability Editors may prefer to call a result replicable only if it accounts for,
qualifies effect, not effect size: A d 20 of 2.0 constitutes a failure say, at least 1% of the variance in the data, for which d 0 must be
to replicate an effect size (d 10 ) of 0.3, but is a strong replication greater than 0.04. They may also require that it pass the Aikaike
of the effect. Requiring a result to have a prep of .9 exacts a criterion for adding a parameter (distinct means for experimental
standard comparable to (Fig. 2, left panel) or exceeding (right and control groups; Burnham & Anderson, 2002), for which
panel) the standard of traditional significance tests. r2 must be greater than 1  e2/n. Together, these constraints
Does prep really predict the probability of replication? In a define a lower limit for ‘‘replicable’’ at prep  55. However these
meta-analysis of 37 studies of the psychophysiology of aggres- minima are set, a fair assessment of sd is necessary for prep to give
sion, including unpublished nonsignificant data sets, Lorber investigators a fair assessment of replicability.

348 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 Volume 16—Number 5
Peter R. Killeen

The replicability of differences among experimental condi-


tions is calculated the same way as that between experimental
and control conditions. Multiple comparisons are made by the
conjunction or disjunction of prep: If treatments A and B are
independent, each with prep of .80, the probability of replicating
both effects is .64, and the probability of replicating at least one
is .87. The probability of n independent attempts to replicate an
experiment all succeeding is prep n .
As is the case for all statistics, there is sampling variability
associated with d 0 , so that any particular value of prep may be
more or less representative of the values found by other studies
executed under similar conditions. It is an estimate. Replication
intervals (RIs) aid interpretation by reflecting prep onto the
measurement axis. Their calculation is the same as for confi-
dence intervals (CIs), but with variance doubled. RIs can be
used as equivalence tests for evaluating point predictions. The
standard error of estimate conveniently captures 52% of future
replications (Cumming, Williams, & Fidler, 2004). This familiar
error bar can therefore be interpreted as an approximate 50% RI.
In the example given earlier, for sd 5 0, the 50% RI for D is
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
approximately 5  2ð102 =24Þ  ½2:1; 7:9.

WHY SWITCH? Fig. 3. Complementarity of prep and p. The top panel shows sampling
distributions for d 10 given the null (left) and for d 20 given d1 (right). The
Sampling distributions for replicates involve two sources of small black area gives the probability of finding a statistic more extreme
than d1 if the null were true. The large shaded area gives the probability of
variance, leading to a root-2 increase in the standard error over finding supportive evidence in an equipotent replication. In the bottom
that used to calculate significance. Why incur that cost? Both p panel, prep is plotted against the p values calculated for the normal dis-
and prep are functions of effect size and n, and so convey similar tribution under the null hypothesis with d 5 0.1, 0.2, . . . , 1.0, and n
ranging from 10 to 80; prep is calculated from Equations 3, 5, and 6. The
information: The top panel in Figure 3 shows p as the area in the function is described in the appendix.
right tail of the sampling distribution of d 10 , given the null, and
prep as the area in the right tail of the prospective sampling null. The prep statistic provides a graded measure of replica-
distribution of d 20 , given d 10 . As d 10 or n varies, prep and bility that authorizes positive statements about results: ‘‘This
p change in complement. effect will replicate 100( prep)% of the time’’ conveys useful
Recapturing a familiar index of merit is reassuring, as are the information, whatever the value of prep.
familiar calculations involved; but these analyses are not
equivalent. Consider the following contrasts:
Real Power
Traditionally, replication has been viewed as a second suc-
Intuitive Sense cessful attainment of a significant effect. The probability of
What is the difference between p values of .05 and .01, or be- getting a significant effect in a replicate is found by integrating
tween p values of .01 and .001? If you follow Neyman-Pearson Equation 4 from a lower limit given by the critical value
and have set a to be .05, you must answer, ‘‘Nothing’’ (Meehl, d ¼ sdR ta;n2 2 . This calculation does not require that the
1978). If you follow Fisher, you can say, ‘‘The probability of original study achieved significance. Such analyses may help
finding a statistic more extreme than this under the null is p.’’ bridge to the new perspective; but once prep is determined,
Now compare those p values, and the oblique responses they calculation of traditional significance is a step backward. The
support, with their corresponding values of prep shown in the curves in Figure 2 predict the replicability of an effect given
bottom panel of Figure 3. These steps in p values take us from known results, not the probability of a statistic given the value of
prep of .88 to .95 to .99—increments that are clear, interpret- a parameter whose value is not given.
able, and manifestly important to a practicing scientist.

Elimination of Errors
Logical Authority Significance level is defined as the probability of rejecting the
Under NHST, one can never accept a hypothesis, and is often null when it is true (a Type I error of probability a); power is
left in the triple-negative no-man’s land of failure to reject the defined as the probability of rejecting the null when it is false,

Volume 16—Number 5 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 349
We Can Reject the Null Hypothesis

and not doing so is a Type II error. False premises lead to improbable under the null. As a graduated measure, prep pro-
conclusions that may be logically consistent but empirically vides a basis for a richer approach to decision making than the
invalid, a Type III error. Calculations of p are contingent on the Neyman-Pearson strategy, currently the mode in psychology.
null being true. Because the null is almost always false (Cohen, Decision makers may compute expected value, E(v), by multi-
1994), investigators who imply that manipulations were effec- plying prep or its complement by the values they assign out-
tive on the basis of a p less than a are prone to Type III errors. comes. Let v1(d 0 ) be the value of positive action for an effect size
Because prep is not conditional on the truth value of the null, it d 0 , including potential costs for small or contrary effects. Then
avoids all three types of error. R þ
þ1  
Eðvþ Þ ¼ v ðxÞn x; d01 ; sR . Comparison with an analogous
One might, of course, be misled by a value of prep that itself 1
cannot be replicated. This can be caused by calculation for E(v) will inform the decision.

sampling error: d1 may deviate substantially from d (RIs help


interpret this risk.)

failure to include an estimate of sd 2 in the replication vari- Congeniality With Bayes


ance Probability theory provides a unique basis for the logic of sci-

publication bias against small or negative effects ence (Cox, 1961), and Bayes’ theorem provides the machinery to

the presence of confounds, biased data selection, and other make science cumulative (Jaynes & Bretthorst, 2003; see the
missteps that plague all mapping of particular results to appendix). Falsification of the null cannot contribute to the
general claims cumulation of knowledge (Stove, 1982); the use of Bayes to
reduce sdR 2 can. NHST stipulates an arbitrary mean  for the
 test
Because of these uncertainties, prep is only an estimate of the statistic a priori (0) and a variance a posteriori sp 2 =n . The
proportion of replication attempts that will be successful. It statistic prep uses both moments of the observed data in a co-
measures the robustness of a demonstration; its accuracy in herent fashion to predict the most likely posterior distribution of
predicting the proportion of positive replications depends on the replicate statistic. Information from replicates may be
the factors just listed. pooled to reduce sd 2 (Louis & Zelterman, 1994; Miller &
Pollack, 1994). Systematic explorations of phenomena identify
predictors or moderators that reduce sd 2 . The information
Greater Confidence
contributed by an experiment, and thus its contribution to
The American Psychological Association (Wilkinson & the
knowledge, is a direct function of this reduction in sdR 2 .
Task Force on Statistical Inference, 1999) has called for the
increased use of CIs. Unfortunately, few researchers know how
to interpret them, and fewer still know where to put them
Improved Communication
(Cumming & Finch, 2001; Cumming et al., 2004; Estes, 1997;
The classic definition of replicability can cause harmful con-
Smithson, 2003; Thompson, 2002). CIs are often drawn cen-
fusion when weak but supportive results must be categorized as
tered over the sample statistic, as though it were the parameter;
a ‘‘failure to replicate [at p < .05]’’ (Rossi, 1997). Consider an
when a CI does not subsume 0, it is often concluded that the null
experiment involving memory for deep versus superficial en-
may be rejected. The first practice is misleading, and the second
coding of target words. This experiment, conducted in an un-
wrong. CIs are derived from sampling distributions of M around
dergraduate methods class, yielded a highly significant effect
a hypostatized m: |m  M| will be less than the CI 100p% of the
for the pooled data of 124 students, t(122) 5 5.46 (Parkinson,
time. But as difference scores, CIs have lost their location.
2004). We can ‘‘power down’’ the effect estimated from the
Situating them requires an implicit commitment to parame-
pooled data to predict the probability that each of the seven
ters—either to m 5 0 for NHST or to m 5 M for the typical
sections in which these data were collected would replicate this
position of CIs flanking the statistic. Such a commitment, absent
classic effect. All of the test materials and instructions were
priors, runs afoul of the Bayesian dilemma. In contrast, RIs can
identical, so sd 2 was approximately 0. The effect size from the
be validly centered on the statistic to which they refer, and the
pooled data, d 0 , was 0.49. Individual class sections, averaging
replication level may be correctly interpreted as the probability
ns of 18, contributed the majority of variability to the replicate
that the statistics of future equipotent replications will fall
sampling distribution, whose variance is the sum of sampling
within the interval.
variances for n 5 124 (‘‘original’’) and again for n 5 18 (rep-
licates). Replacing sdR in Equation 4 with the root of this sum
Decision Readiness predicts a replicability of .81: Approximately six of the seven
Significance tests are said to provide decision criteria essential sections should get a positive effect. It happens that all seven
to science. But it is a poor decision theory that takes no account did, although for one the effect size was a mere 0.06. Unfortu-
of prior information and no account of expected values, and in nately, the instructor had to tell four of the seven sections that
the end lets us decide only whether or not to reject a statistic as they had, by contemporary standards, failed to replicate a very

350 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 Volume 16—Number 5
Peter R. Killeen

reliable result, as their ps were greater than .05. It was a good Cortina, J.M., & Nouri, H. (2000). Effect size for ANOVA designs.
opportunity to discuss sampling error. It was not a good op- Thousand Oaks, CA: Sage.
portunity to discuss careers in psychology. Cox, R.T. (1961). The algebra of probable inference. Baltimore: Johns
Hopkins University Press.
‘‘How odd it is that anyone should not see that all observation Cumming, G., & Finch, S. (2001). A primer on the understanding, use
must be for or against some view if it is to be of any service!’’ and calculation of confidence intervals based on central and
(Darwin, 1994, p. 269). Significance tests can never be for: noncentral distributions. Educational and Psychological Mea-
‘‘Never use the unfortunate expression ‘accept the null hy- surement, 61, 532–575.
pothesis’’’ (Wilkinson & the Task Force on Statistical Inference, Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and re-
searchers’ understanding of confidence intervals and standard
1999, p. 599). And without priors, there are no secure grounds
error bars. Understanding Statistics, 3, 299–311.
for being against—rejecting— the null. It follows that if our Darwin, C. (1994). The correspondence of Charles Darwin (Vol. 9; F.
observations are to be of any service, it will not be because we Burkhardt, J. Browne, D.M. Porter, & M. Richmond, Eds.).
have used significance tests. All this may be hard news for Cambridge, England: Cambridge University Press.
small-effects research, in which significance attends any hy- Eagly, A.H., Johannesen-Schmidt, M.C., & van Engen, M.L. (2003).
pothesis given enough n, whether or not the results are repli- Transformational, transactional, and laissez-faire leadership
styles: A meta-analysis comparing men and women. Psychological
cable. But editors may lower the hurdle for potentially important
Bulletin, 129, 569–591.
research that comes with so precise a warning label as prep. Estes, W.K. (1997). On the communication of information by displays
When replicability becomes the criterion, researchers can of standard errors and confidence intervals. Psychonomic Bulletin
gauge the risks they face in pursuing a line of study: An as- & Review, 4, 330–341.
sistant professor may choose paradigms in which prep is typi- Fisher, R.A. (1925). Theory of statistical estimation. Proceedings of the
cally greater than .8, whereas a tenured risk taker may hope to Cambridge Philosophical Society, 22, 700–725.
Fisher, R.A. (1959). Statistical methods and scientific inference (2nd
reduce sd 2 in a line of research having preps around .6. When
ed.). New York: Hafner Publishing.
replicability becomes the criterion, significance, shorn of its Geisser, S. (1992). Introduction to Fisher (1922): On the mathematical
statistical duty, can once again become a synonym for the im- foundations of theoretical statistics. In S. Kotz & N.L. Johnson
portance of a result, not for its improbability. (Eds.), Breakthroughs in statistics (Vol. 1, pp. 1–10). New York:
Springer-Verlag.
Acknowledgments—Colleagues whose comments have im- Greenwald, A.G., Gonzalez, R., Guthrie, D.G., & Harris, R.J. (1996).
proved this article include Sandy Braver, Darlene Crone-Todd, Effect sizes and p values: What should be reported and what
should be replicated? Psychophysiology, 33, 175–183.
James Cutting, Randy Grace, Tony Greenwald, Geoff Loftus,
Grissom, R.J., & Kim, J.J. (2001). Review of assumptions and problems
Armando Machado, Roger Milsap, Ray Nickerson, Morris in the appropriate conceptualization of effect size. Psychological
Okun, Clark Presson, Anon Reviewer, Matt Sitomer, and Methods, 6, 135–146.
François Tonneau. In particular, I thank Geoff Cumming, whose Harlow, L.L., Mulaik, S.A., & Steiger, J.H. (Eds.). (1997). What if there
were no significance tests? Mahwah, NJ: Erlbaum.
careful readings saved me from more than one error. The con-
Hedges, L.V. (1981). Distribution theory for Glass’s estimator of effect
cept was presented at a meeting of the Society of Experimental sizes and related estimators. Journal of Educational Statistics, 6,
Psychologists, March 2004, Cornell University. The research 107–128.
was supported by National Science Foundation Grant IBN Hedges, L.V., & Olkin, I. (1985). Statistical methods for meta-analysis.
New York: Academic Press.
0236821 and National Institute of Mental Health Grant
Hedges, L.V., & Vevea, J.L. (1998). Fixed- and random-effects models
1R01MH066860. in meta-analysis. Psychological Methods, 3, 486–504.
Jaynes, E.T., & Bretthorst, G.L. (2003). Probability theory: The logic of
REFERENCES science. Cambridge, England: Cambridge University Press.
Krantz, D.H. (1999). The null hypothesis testing controversy in psy-
Berger, J.O., & Selke, T. (1987). Testing a point null hypothesis: The chology. Journal of the American Statistical Association, 44,
irreconcilability of P values and evidence. Journal of the Ameri- 1372–1381.
can Statistical Association, 82, 112–122. Krueger, J. (2001). Null hypothesis significance testing: On the sur-
Bruce, P. (2003). Resampling stats in Excel [Computer software]. Re- vival of a flawed method. American Psychologist, 56, 16–26.
trieved February 1, 2005, from http://www.resample.com Loftus, G.R. (1996). Psychology will be a much better science when we
Burnham, K.P., & Anderson, D.R. (2002). Model selection and multi- change the way we analyze data. Current Directions in Psycho-
model inference: A practical information-theoretic approach (2nd logical Science, 5, 161–171.
ed.). New York: Springer-Verlag. Lorber, M.F. (2004). Psychophysiology of aggression, psychopathy, and
Cohen, J. (1969). Statistical power analysis for the behavioral sciences. conduct problems: A meta-analysis. Psychological Bulletin, 130,
New York: Academic Press. 531–552.
Cohen, J. (1994). The earth is round ( p < .05). American Psychologist, Louis, T.A., & Zelterman, D. (1994). Bayesian approaches to research
49, 997–1003. synthesis. In H. Cooper & L.V. Hedges (Eds.), The handbook of
Cooper, H., & Hedges, L.V. (Eds.). (1994). The handbook of research research synthesis (pp. 411–422). New York: Russell Sage Foun-
synthesis. New York: Russell Sage Foundation. dation.

Volume 16—Number 5 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 351
We Can Reject the Null Hypothesis

Meehl, P.E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Wilkinson, L., & the Task Force on Statistical Inference. (1999). Sta-
Ronald, and the slow progress of soft psychology. Journal of tistical methods in psychology: Guidelines and explanations.
Consulting and Clinical Psychology, 46, 806–834. American Psychologist, 54, 594–604.
Meehl, P.E. (1997). The problem is epistemology, not statistics: Re-
place significance tests by confidence intervals and quantify ac-
(RECEIVED 5/5/04; REVISION ACCEPTED 7/30/04)
curacy of risky numerical predictions. In L.L. Harlow, S.A.
Mulaik, & J.H. Steiger (Eds.), What if there were no significance
tests? (pp. 393–425). Mahwah, NJ: Erlbaum.
Miller, N., & Pollock, V.E. (1994). Meta-analytic synthesis for the- APPENDIX
ory development. In H. Cooper & L.V. Hedges (Eds.), The hand-
book of research synthesis (pp. 457–484). New York: Russell Sage This back room contains equations, details, and generalizations.
Foundation.
Mosteller, F., & Colditz, G.A. (1996). Understanding research syn-
thesis (meta-analysis). Annual Review of Public Health, 17, Effect Size
1–23. The denominator of effect size given by Equation 1 is the pooled
Moyer, C.A., Rounds, J., & Hannum, J.W. (2004). A meta-analysis variance, calculated as
of massage therapy research. Psychological Bulletin, 130, 3–18.
sC2 ðnC  1Þ þ sE2 ðnE  1Þ
Neyman, J., & Pearson, E.S. (1933). On the problem of the most effi- sp2 ¼ :
cient tests of statistical hypotheses. Philosophical Transactions of n2
the Royal Society of London, Series A, 231, 289–337. Hedges (1981) showed that an unbiased estimate of d is
Nickerson, R.S. (2000). Null hypothesis significance testing: A review
of an old and continuing controversy. Psychological Methods, 5, d  d0 ½1  3=ð4n  9Þ:
241–301. The adjustment is small, however, and with suitable adjust-
Parkinson, S.R. (2004). [Levels of processing experiments in a methods
class]. Unpublished raw data.
ments in sd, d 0 suffices.
Raudenbush, S.W. (1994). Random effects models. In H. Cooper & L.V. Negative effects generate preps less than .5, indicating the
Hedges (Eds.), The handbook of research synthesis (pp. 301–321). unlikelihood of positive effects in replication. For consistency,
New York: Russell Sage Foundation. if d 0 is less than 0, use |d 0 | and report the result as the repli-
Richard, F.D., Bond, C.F., Jr., & Stokes-Zoota, J.J. (2003). One hundred cability of a negative effect. Useful conversions are d 0 5 2r(1 
years of social psychology quantitatively described. Review of
r2)1/2 (Rosenthal, 1994) and d 0 5 t[1/nE 1 1/nC]1/2 for the
General Psychology, 7, 331–363.
Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper simple two-independent-group case and d 0 5 tr[(1  r)/nE 1 (1
& L.V. Hedges (Eds.), The handbook of research synthesis (pp.  r)/nC]1/2 for a repeated measures t, where r is the correlation
231–244). New York: Russell Sage Foundation. between the measures (Cortina & Nouri, 2000).
Rosenthal, R., & Rubin, D.B. (2003). requivalent: A simple effect size The asymptotic variance of effect size (Hedges, 1981) is
indicator. Psychological Methods, 8, 492–496.
n d2
Rossi, J.S. (1997). A case study in the failure of psychology as a cu- sd 2 ¼ þ :
mulative science: The spontaneous recovery of verbal learning. nE nC 2n
In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there
Equation 3 in the text is optimized for the use of d 0 , however,
were no significance tests? (pp. 175–197). Mahwah, NJ: Erlbaum.
Rubin, D.B. (1981). Estimation in parallel randomized experiments. and delivers accurate values of prep for 1 d 0 1.
Journal of Educational Statistics, 6, 377–400.
Smithson, M. (2003). Confidence intervals. Thousand Oaks, CA: Sage.
Variance of Replicates
Steiger, J.H., & Fouladi, R.T. (1997). Noncentrality interval estimation
and the evaluation of statistical models. In L.L. Harlow, S.A. h desired variance
The i of replicates, sdR 2 , equals the expectation
2
Mulaik, & J.H. Steiger (Eds.), What if there were no significance E ðd2  d1 Þ . This may be expanded (Estes, 1997) as
tests? (pp. 221–257). Mahwah, NJ: Erlbaum.
h i h i
Stove, D.C. (1982). Popper and after: Four modern irrationalists. New E ðd2  d1 Þ2 ¼ E ððd2  dÞ  ðd1  dÞÞ2
York: Pergamon Press (Available from Krishna Kunchithapadam, h i
http://www.geocities.com/ResearchTriangle/Facility/4118/dcs/ ¼ E ð d2  dÞ 2 þ ð d1  dÞ 2
popper)  2E½ðd2  dÞðd1  dÞ
Thompson, B. (2002). What future quantitative social science research
h i h i
could look like: Confidence intervals for effect sizes. Educational
The quantities E ðd2  dÞ2 and E ðd1  dÞ2 are the vari-
Researcher, 31(3), 25–32.
Trafimow, D. (2003). Hypothesis testing and theory evaluation at the ances of d2 and d1, each equal to sd 2 . For independent repli-
boundaries: Surprising insights from Bayes’s theorem. Psycho- cations, the expectation of the cross product E½ðd2  dÞ
logical Review, 110, 526–535. ðd1  dÞ is 0.
van den Noortgate, W., & Onghena, P. (2003). Estimating the mean
h i
effect size in meta-analysis: Bias, precision, and mean squared Therefore, sdR 2 ¼ E ðd2  d1 Þ2 ¼ sd 2 þ sd 2 . It follows
error of different weighting methods. Behavior Research Methods, that thepstandard
ffiffiffi error of effect size of equipotent replications is
Instruments, & Computers, 35, 504–511. sd R ¼ 2 s d .

352 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 Volume 16—Number 5
Peter R. Killeen

When nE 5 nC > 2,
Bootstrap populations for the experimental and control
8 samples independently, generating subsamples of half the
sdR 2  þ 2sd 2
n4 size of the original samples, using software such as Resam-
pffiffiffi
When the sizes of the original and replicate samples vary, pling Statsr (Bruce, 2003). This half-sizing provides the 2
replication variance should be based on increase in the standard deviation intrinsic to calculation of
prep.
sdR 2 ¼ sd10 ;n12 þ sd10 ;n22 þ 2sd 2 :

Generate an empirical sampling distribution of the difference
of the means of the subsamples, or of the mean of the dif-
ferences for a matched-sample design.
prep as a Function of p

The proportion of the means that are positive gives prep.
We may approximate the normal distribution by the logistic and
solve for prep as a function of p. This suggests the following
equation: This robust approach does not take into account sd2 , and so is
"  2=3 #1 accurate only for exact replications.
p
prep  1 þ :
1p

The parenthetical converts a p value into a probability ratio


A Cumulative Science
appropriate for the logistic inverse. For two-tailed comparisons,
Falsification of the null, even when possible, provides no ma-
halve p. Users of Excel can simply evaluate prep 5 NORMS-
chinery for the cumulation of knowledge. Reduction of sdR
DIST(NORMSINV(1  P)/SQRT(2)) (G. Cumming, personal commu-
does. Information is the reduction of entropy, which can be
nication, October 24, 2004). This estimate is complementary to
measured as the Fisher information content of the distribution of
Rosenthal and Rubin’s (2003) estimate of effect size directly
effect sizes. The difference of the entropies before and after an
from p and n.
experiment, I ¼ log2 ðsbefore =safter Þ, measures its incremental
contribution of information. The discovery of better theoretical
Randomization Method structures, predictors, or moderators that convert within-group
Randomization methods avoid assumptions of normality, are variance to between-group variance permits large reductions in
useful for small-n experiments, and are robust against het- sd2 , and thus sdR ; smaller reductions are effected by cumula-
eroscedasticity. To employ them: tive increases in n.

Volume 16—Number 5 Downloaded from pss.sagepub.com at INDIANA UNIV BLOOMINGTON on April 6, 2010 353

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy