Gardner & Altman (1986) PDF
Gardner & Altman (1986) PDF
Gardner & Altman (1986) PDF
Statistics in Medicine
Abstract result that would have been obtained had all the eligible subjects
Overemphasis on hypothesis testing-and the use of P values* to (the "population") been investigated rather than just a sample of
dichotomise significant or non-significant results-has detracted them. What authors and readers should want to know is by how
from more useful approaches to interpreting study results, such much the illness modified the mean blood concentrations or by how
as estimation and confidence intervals. In medical studies much the new treatment altered the prognosis, rather than only the
investigators are usually interested in determining the size of level of statistical significance.
difference of a measured outcome between groups, rather than a The excessive use of hypothesis testing at the expense of other
simple indication of whether or not it is statistically significant. ways of assessing results has reached such a degree that levels of
Confidence intervals present a range of values, on the basis of the significance are often quoted alone in the main text and abstracts of
sample data, in which the population value for such a difference papers, with no mention of actual concentrations, proportions, etc,
may lie. Some methods of calculating confidence intervals for or their differences. The implication of hypothesis testing- that
means and differences between means are given, with similar there can always be a simple "yes" or "no" answer as the
information for proportions. The paper also gives suggestions for fundamental result from a medical study-is clearly false and used
graphical display. in this way hypothesis testing is of limited value.2
Confidence intervals, if appropriate to the type of study, We discuss here the rationale behind an alternative statistical
should be used for major findings in both the main text of a paper approach-the use of confidence intervals; these are more informa-
and its abstract. tive than P values, and we recommend them for papers published in
the British Medical Journal (and elsewhere). This should not be
taken to mean that confidence intervals should appear in all papers;
in some cases, such as where the data are purely descriptive,
Introduction confidence intervals are inappropriate and in others techniques for
Over the past two or three decades the use of statistics in medical obtaining them are complex or unavailable.
journals has increased tremendously. One unfortunate consequence
has been a shift in emphasis away from the basic results towards an
undue concentration on hypothesis testing. In this approach data Presentation of study results: limitations of P values
are examined in relation to a statistical "null" hypothesis, and the
practice has led to the mistaken belief that studies should aim at The common simple statements "P<0-05," "P>O-05," or "P NS"
obtaining "statistical significance." On the contrary, the purpose of convey little information about a study's findings and rely on an arbitrary
convention of using the 5% level of statistical significance to define two
most research investigations in medicine is to determine the alternative outcomes significant or not significant-which is not helpful
magnitude of some factor(s) of interest. and encourages lazy thinking. Furthermore, even precise P values convey
For example, a laboratory based study may investigate the nothing about the sizes of the differences between study groups. Rothman
difference in mean concentrations of a blood constituent between pointed this out in 1978 and advocated the use of confidence intervals,3 and
patients with and without a certain illness, while a clinical study may recently he and his colleagues repeated the proposal.'
assess the difference in prognosis of patients with a particular Presenting P values alone can lead to them being given more merit than
disease treated by alternative regimens in terms of rates of cure, they deserve. In particular, there is a tendency to equate statistical
remission, relapse, survival, etc. The difference obtained in such a significance with medical importance or biological relevance. But small
study will be only an estimate of what we really need, which is the differences of no real interest can be statistically significant with large sample
sizes, whereas clinically important effects may be statistically non-significant
only because the number of subjects studied was small.
*In this paper we have preferred the notation of Mainland' and used P for the
probability associated with the outcome of a test of the null hypothesis, and not p
which is used for a proportion (see Appendix 2). Although contrary to the Presentation of study results: confidence intervals
Vancouver convention, it is statistically more established and would also have
been preferable for the statistical guidelines published in the BMI.- It is more useful to present sample statistics as estimates of results that
would be obtained if the total population were studied. The lack of precision
of a sample statistic-for example, the mean-which results from both the
MRC Environmental Epidemiology Unit (University of Southampton), degree of variability in the factor being investigated and the limited size of
Southampton General Hospital, Southampton S09 4XY the study, can be shown advantageously by a confidence interval.
MARTIN J GARDNER, BSC, PHD, professor of medical statistics A confidence interval produces a move from a single value estimate-such
as the sample mean, difference between sample means, etc-to a range of
Division of Medical Statistics, MRC Clinical Research Centre, Harrow, values that are considered to be plausible for the population. The width of a
Middlesex HAI 3UJ confidence interval based on a sample statistic depends partly on its standard
DOUGLAS G ALTMAN, BSC, medical statistician error, and hence on both the standard deviation and the sample size (see
Correspondence to: Professor Gardner. Appendix 1 for a brief description of the important, but often misunder-
stood, distinction between the standard deviation and standard error). It also
BRITISH MEDICAL JOURNAL VOLUME 292 15 MARCH 1986 747
depends on the degree of "confidence" that we want to associate with the sample means calculated in each study, then, in the long run, 95% of these
resulting interval. confidence intervals would include the population difference between
Suppose that in a study comparing samples of 100 diabetic and 100 non- means.
diabetic men of a certain age a difference of 6 0 mm Hg was found between The sample size affects the size of the standard error and this in turn
their mean systolic blood pressures and that the standard error of this affects the width of the confidence interval. This is shown in fig 2, which
difference between sample means was 2-5 mmHg comparable to the shows the 95% confidence interval from samples with the same means and
difference between means in the Framingham study.' The 95% confidence standard deviations as before but only half as large-that is, 50 diabetics and
interval for the population difference between means is from to 50 non-diabetics. Reducing the sample size leads to less precision and an
10-9 mm Hg and is shown in fig together with the original data. Details of increase in the width of the confidence interval, in this case by some 40'o0.
how to calculate the confidence interval are given in Appendix 2. The investigator can select the degree of confidence associated with a
Put simply, this means that there is a 95% chance that the indicated range confidence interval, though 95% is the most common choice-just as a 5%
includes the "population" difference in mean blood pressure levels-that is, level of statistical significance is widely used. If greater or less confidence is
the value which would be obtained by including the total populations of required different intervals can be constructed: 99%, 95%, and 90%
diabetics and non-diabetics at which the study is aimed. More exactly, in a confidence intervals for the data in fig 1 are shown in fig 3. As would be
statistical sense, the confidence interval means that if a series of identical expected, greater confidence that the population difference is within a
studies were carried out repeatedly on different samples from the same confidence interval is obtained with wider intervals. In practice, intervals
populations, and a 95% confidence interval for the difference between the other than 99%, 95% or 90% are rarely quoted.
Some methods of calculating confidence intervals for means, proportions,
and their differences are given in Appendix 2. Confidence intervals can also
Systolic blood be calculated for other statistics, such as regression slopes and relative risks.6
pressure (mm Hg) When the observed data cannot be regarded as having come from a Normal
distribution the situation is not always straightforward (see Appendix 2).
2001 Confidence intervals convey only the effects of sampling variation on the
190 Difference in precision of the estimated statistics and cannot control for non-sampling
00 mean systolic errors such as biases in design, conduct, or analysis.
180 000
00 0
bbod pressure
0000
(mm Hg)
170 - 30
0
00000 000
Difference in
160 0000000000000 Coocc 950/. CI - 20 mean systolic
150
co
000
ooocmc
109 - 10 blood pressure
-4- 000000000
000000
140 4 cuckoo0 6.0 (mm Hg)
140
000C0000000
000000000
011_-n
_____.~~
0o 000COD00
000 15-
130 000000
.
-10
C0000
120 00~00
-20 %125
0 coooo
OOCQ
-
Systolic blood
pressure (mm Hg)
200 -5 -
0 99 0/o 95%/o 900/o
190 D)ifference in
0
neon systolic Confidence intervals
180 0
bllood pressure FIG 3-Confidence intervals associated with differing degrees of "confidence"
0 00 (nmmHg) using the same data as in fig 1.
170 cmo 00 30
ocm 0
160
00
00 oOo 0o 95'/. Cl - 20
130
00
Suggested mode of presentation We acknowledge the collaboration of the editorial staff of the British
In content, our only proposed change is that confidence intervals should MedicalJournal in the development of this paper and its proposals. We also
thank the people who kindly read and constructively criticised the manu-
be reported instead of standard errors. This will encourage a move away script during its development and Miss Brigid Grimes for her careful typing.
from the current emphasis on statistical significance. For the major
finding(s) of a study we recommend that full statistical information should
be given, including sample estimates, confidence intervals, test statistics,
and P values-assuming that basic details, such as sample sizes and standard Appendix 1: Standard deviation and standard error
deviations, have been reported earlier in the paper. The major findings When numerical findings are reported, regardless of whether or not their
would include at least those related to the original hypothesis(es) of the study statistical significance is quoted, they are often presented with additional
and those reported in the abstract. statistical information. The distinction between two widely quoted statistics
For the above example the textual presentation of the results might read: -the standard deviation and the standard error-is, however, often
The difference between the sample mean systolic blood pressures in misunderstood. 1'14
diabetics and non-diabetics was 6 0 mmHg, with a 95% confidence The standard deviation is a measure of the variability between individuals
interval from 1-1 to 10-9 mmHg; the t test statistic was 2 4, with in the level of the factor being investigated, such as blood alcohol
198 degrees of freedom and an associated P value of P=0 02. concentrations in a sample of car drivers, and is thus a descriptive index. By
In short: contrast, the standard error is a measure of the uncertainty in a sample
Mean 60 mm Hg, 95% CI 1 1 to 109; t=2-4, df= 198, P=0 02. statistic. For example, the standard error of the mean indicates the
uncertainty of the mean blood alcohol concentration among the sample of
The exact P value from the t distribution is 0-01732, but one or two drivers as an estimate of the mean value among the population of all car
significant figures are enough2; this value is seen to be within the range 0-01 drivers. The standard deviation is relevant when variability between
to 0-05 determined earlier from the confidence intervals. Often a range for P individuals is of interest; the standard error is relevant to summary statistics
will need to be given because only limited figures are available in published such as means, proportions, differences, regression slopes, etc.2
tables-for example, 0 3<P<0 4. The standard error of the sample statistic, which depends on both the
The two extremes of a confidence interval are sometimes presented as standard deviation and the sample size, is a recognition that a sample is most
confidence limits. However, the word "limits" suggests that there is no going unlikely to determine the population value exactly. In fact, if a further
beyond and may be misunderstood because, of course, the population value sample is taken in identical circumstances almost certainly it will produce a
will not always lie within the confidence interval. Moreover, there is a danger different estimate of the same population value. The sample statistic is
that one or other of the "limits" will be quoted in isolation from the rest of the therefore imprecise, and the standard error is a measure of this imprecision.
results, with misleading consequences. For example, concentrating only on By itself the standard error has limited meaning, but it can be used to
the upper figure and ignoring the rest of the confidence interval would produce a confidence interval, which does have a useful interpretation.
misrepresent the finding by exaggerating the study difference. Conversely,
quoting only the lower limit would incorrectly underestimate the difference.
The confidence interval is thus preferable because it focuses on the range of
values. Appendix 2: Methods of calculating confidence intervals
The same notation can be used for presenting confidence intervals in Formulas for calculating confidence intervals (CIs) are given for means,
tables. Thus, a column headed "95% confidence interval" or "95% CI" proportions, and their differences. There is a common underlying principle
would have rows of intervals: "1 1 to 10-9", "48 to 85", etc. Confidence of subtracting and adding to the sample statistic a multiple of its standard
intervals can also be incorporated into figures, where they are preferable to error (SE). This extends to other statistics, such as regression coefficients,
the widely used standard error, which is often shown solely in one direction but is not universal.
from the sample estimate. If individual data values can be shown as well,
which is usually possible for small samples, this is even more informative.
Thus in fig 1, despite the considerable overlap of the two sets of sample data,
the shift in means is shown by the 95% confidence interval excluding zero. CONFIDENCE INTERVALS FOR MEANS AND THEIR DIFFERENCES
For paired samples the individual differences can be plotted advantageously Confidence intervals for means are constructed using the t distribution if
in a diagram. the data have an approximately Normal distribution. For differences
The example given here of the difference between two means is common. between two means the data should also have similar standard deviations
Although there is some intrinsic interest in the mean values themselves, (SDs) in each study group. This is implicit in the example given in the text
inferences from a study will be concerned mainly with their difference. and in the worked example below.
Giving confidence intervals for each mean separately is therefore unhelpful,
because these do not usually indicate the precision of the difference or its
statistical significance.78 Thus, the major contrasts of a study should be
shown directly, rather than only vaguely in terms of the separate means (or Single sample
proportions). The confidence interval for a population mean is derived using the mean
For a paper with only a limited number of statistical comparisons related (x) and its standard error from a sample of size n. For this case the
to the initial hypotheses confidence intervals are recommended throughout. SE= SD/V. Thus, the confidence interval is given by:
Where multiple comparisons are concerned, however, the usual problems of
interpretation arise, since some confidence intervals will exclude the "null" X-(t1 -,2XSE) to x+(tr1 ,,/2XSE),
value-for example, zero difference-through sampling variation alone.
This mirrors the situation of calculating a multiplicity of P values, where not where l-,,1w2 iS the appropriate value from the t distribution with n - 1 degrees
BRITISH MEDICAL JOURNAL VOLUME 292 15 MARCH 1986 749
of freedom associated with a "confidence" of 100(1 -ct)%. For a 95% CI ca isv For the original samples of 100 each the appropriate values of to 995 and
0 05, for a 99% CI ct is 0 01, and so on. Values of t can be found from tables in to 95 with 198 degrees of freedom to calculate the 99% and 90% CIs are 2 60
statistical textbooks or Documenta Geigy.'5 For a 95% CI the value of twill be and 1-65, respectively. Thus the 99% CI is calculated as:
close to 2-0 for samples of 20 upwards but noticeably greater than 2-0 for
smaller samples. 6-0-(260x250) to 60+(2-60x250)
that is, from - 0 5 to 12- 5 mm Hg (fig 3), and the 90% CI is given by:
Two samples 6 0-(1 65x2 50) to 6 0+(1 65x2 50)
Unpaired case-The confidence interval for the difference between two that is, from 19 to 101 mm Hg (fig 3).
population means is derived in a similar way. Suppose x, and x2 are the two
sample means, s, and S2 the corresponding standard deviations, and n, and
n2 the sample sizes. Firstly, we need a "pooled" estimate of the standard
deviation, which is given by: Sample sizes and confidence intervals
In general increasing the sample size will reduce the width of the
S-= = .
nl+n2-2
confidence interval. If we assume the same means and standard deviations as
in the example fig 4 shows the resulting 99%, 95%, and 90% confidence
From this the standard error of the difference between the two sample means intervals for the difference in mean blood pressures for sample sizes of up to
is: 500 in each group. The benefit, in terms of narrowing the confidence
interval, of a further increase in the number of subjects falls sharply with
increasing sample size.
SEdiff= s
nI n2
The confidence interval is then: Difference in
mean systolic
xl X2-(t, -,&2XSEdiff) to xl-x2+(t_(,d2XSEdiff), blood pressure
(mm Hg)
where tl,,2 is taken from the t distribution with n,+n2-2 degrees of n=50 n=100 Confidence intervals
freedom.
If the standard deviations differ considerably then a common pooled - 99%
estimate is not appropriate unless a suitable transformation of scale can be
found. Otherwise obtaining a confidence interval is more complex.6 I-. 995%
Paired case-This includes studies of repeated measurements-for _________90%
example, at different times or in different circumstances on the same
subjects-and matched case-control comparisons. For such data the same
formulas as for the single sample case are used to calculate the confidence
interval, where x and SD are now the mean and standard deviation of the
individual within subject or patient-control differences.
(~~~b
-C)2 12 Bunce H, Hokanson JA? Weiss GB. Avoiding ambiguity when reporting variability in biomedical
data. AmJMed 1980;69:8-9.
SEdiff 13 Altman DG. Statistics in medical journals. Statistics inMedicine 1982;1:59-71.
14 Brown GW. Standard deviation, standard error: which "standard" should we use? AmJ Dis Child
1982;1l36:93741 . '
The confidence interval for Pn-P2 is then given as: 15 Diem K, Lentner C, eds. Documena Geigy. Scientfic tables. 7th ed. Baale: Geigy, 1970.
16 Fleiss JL. Statistical methodsfor rates and proportins. 2nd ed. Chichester: Wiley, 1981:29-30.
17 Breslow NE, Day NE. Statistical methods in cancer research: volumne 1-the analysis of case-control
pj-p2-(N,-a2xSEdiuf) to p,-p2+(N, ,I2xSEdf), studies. Lyon: International Agency for Research on Cancer, 1980:133-4.
where N,<, is fousnd as for the single sample case. (Accepe8jaay 1986)-/