Bayesian

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Journal of Evaluation in Clinical Practice, 6, 2, 193–204

Bayesian statistics in medical research: an intuitive alternative to


conventional data analysis
Lyle C. Gurrin BSc (Hons), PhD, AStat,1 Jennifer J. Kurinczuk MD, MSc, MFPHM, FAFPHM 2 and
Paul R. Burton MD, MSc, MFPHM, CStat3
1
Biostatistician, Women and Infants Research Foundation, King Edward Memorial Hospital, Subiaco, Perth, Australia
2
Perinatal Epidemiologist, TVW Telethon Institute for Child Health Research, West Perth, Australia and Clinical Senior
Lecturer, Department of Public Health, The University of Western Australia, Australia;
3
Professor of Genetic Epidemiology, Department of Epidemiology and Public Health, University of Leicester, Leicester,
UK and Head, Division of Biostatistics and Genetic Epidemiology, TVW Telethon Institute for Child Health Research,
Department of Paediatrics, University of Western Australia, West Perth, Australia.

Correspondence Summary
Dr Lyle C. Gurrin Statistical analysis of both experimental and observational data is central
Biostatistician
Women and Infants Research
to medical research. Unfortunately, the process of conventional statistical
Foundation analysis is poorly understood by many medical scientists. This is due, in part,
King Edward Memorial Hospital to the counter-intuitive nature of the basic tools of traditional (frequency-
PO Box 134, Subiaco based) statistical inference. For example, the proper definition of a con-
Perth, W.A., 6008
Australia
ventional 95% confidence interval is quite confusing. It is based upon the
imaginary results of a series of hypothetical repetitions of the data gen-
Keywords: assisted reproduction eration process and subsequent analysis. Not surprisingly, this formal defi-
technology, Bayesian statistics, nition is often ignored and a 95% confidence interval is widely taken to
confidence intervals, ICSI, medical
represent a range of values that is associated with a 95% probability of con-
statistics, P values, statistical inference
taining the true value of the parameter being estimated. Working within the
Accepted for publication: traditional framework of frequency-based statistics, this interpretation is
21 September 1999 fundamentally incorrect. It is perfectly valid, however, if one works within
the framework of Bayesian statistics and assumes a ‘prior distribution’ that
is uniform on the scale of the main outcome variable. This reflects a limited
equivalence between conventional and Bayesian statistics that can be used
to facilitate a simple Bayesian interpretation based on the results of a stand-
ard analysis. Such inferences provide direct and understandable answers
to many important types of question in medical research. For example, they
can be used to assist decision making based upon studies with unavoidably
low statistical power, where non-significant results are all too often, and
wrongly, interpreted as implying ‘no effect’. They can also be used to over-
come the confusion that can result when statistically significant effects are
too small to be clinically relevant. This paper describes the theoretical basis
of the Bayesian-based approach and illustrates its application with a prac-
tical example that investigates the prevalence of major cardiac defects in a
cohort of children born using the assisted reproduction technique known
as ICSI (intracytoplasmic sperm injection).

© 2000 Blackwell Science 193


L.C. Gurrin et al.

ated with a larger P-value is not necessarily evidence


Introduction
for the null hypothesis, since under some circum-
Many research questions in medical science can most stances this can be quite likely to occur even if the
naturally be answered by assessing the probability null hypothesis is false. The design of a research study
that a particular hypothesis is true, or false, having generating such data is said to have low statistical
observed a relevant set of data. Medical researchers power, in that datasets that are apparently consistent
have long embraced statistical methods in deter- with the null hypothesis can occur relatively fre-
mining how such data should impact on their belief quently even if the null hypothesis is false and thus
about the plausibility of the hypothesis in question. an alternative hypothesis is true (Armitage & Berry
There is, however, widespread misunderstanding as 1994, p195–206).
to the appropriate interpretation of the tools as- This situation is compounded when the results of
sociated with statistical inference, such as P-values a study are declared to be ‘statistically significant’ if
and confidence intervals. In this paper we will add- the P-value is observed to fall below an arbitrary
ress some common misconceptions about the use of threshold, typically 0.05. Such ‘tests of statistical sig-
statistical methods in medical research, and suggest nificance’ or ‘hypothesis tests’ represent an unneces-
an alternative and more intuitive interpretation, sary dichotomization of the set of all possible results
based on the Bayesian theory of statistics (Lindley of a study into an over-simplified ‘accept/reject’ deci-
1965a,b; Box & Tiao 1973; Lee 1989). The Bayesian sion analysis. The continuum of evidence across the
methodology is particularly useful in both the clini- range of potential data is completely ignored by such
cal setting and the arena of public health policy significance testing, which is often inappropriately
when the results of a study must subsequently be viewed by clinicians and medical researchers as the
used to facilitate a decision (Burton 1994; Lilford & statistical equivalent of a diagnostic test in medicine
Braunholtz 1996). (Burton 1994; Burton et al. 1998). In a typical test of
significance given some null hypothesis, P < 0.05 is
often interpreted to mean ‘there is a difference (the
Frequentist statistics
null hypothesis is false)’, while P ≥ 0.05 is understood
Medical statistics is firmly founded upon the fre- to mean that ‘there is no difference (the null hypoth-
quentist theory of statistics (Armitage & Berry 1994, esis is true)’. These common but incorrect inter-
p93–99), which is the best known and most widely pretations both express results in terms of the null
used framework for statistical reasoning. In this hypothesis being true or false, and suggest that the P-
framework, the process of inference requires us to value provides a direct quantitative measure of the
consider every possible result that a study could plausibility of the null hypothesis. Taken to its ulti-
potentially generate. This leads to the calculation of mate conclusion, this results in the fundamental mis-
a P-value, defined as the probability of observing conception that the P-value measures the probability
data at least as ‘extreme’ as, or more ‘extreme’ than, that a given null hypothesis is true having observed
the data that was actually observed in the current a particular set of data.
study given that the particular hypothesis (usually the A P-value actually reflects the probability of
‘null’ hypothesis of ‘no difference’ or ‘no effect’) is obtaining a particular pattern of results, or one more
true. This is a statement of frequency-based prob- extreme, on the basis of an hypothesis that is assumed
ability since it involves the relative frequency of an to be true. The probability that an hypothesis is true
outcome or event in a repeated series of identical, or false is not the long-run probability of an event
hypothetical experiments. and cannot even be expressed in the framework of
If the calculated P-value is small then the observed frequency-based probability. In any particular case
data are surprisingly extreme, in that they are the hypothesis must either be true or false, and no
improbable if the null hypothesis is true, and so it frequency-based probability should be attached
represents evidence against the null hypothesis. The to it (Armitage & Berry 1994, p76–77). Any formal
converse is, however, not necessarily true; observing inferences in this vein must at best be indirect. As a
a set of data that is less extreme and is thus associ- minimum, any reasonable assessment of the viability

194 © 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204


Bayesian statistics in medical research

of the null hypothesis requires simultaneous con- avoid irrational behaviour in response to them
sideration of the relative plausibility of a variety of (Walley 1991; Walley et al. 1996).
competing hypotheses that are also consistent with
the data and cannot, therefore, be based on the cal-
culation of a single P-value, assuming that the null
Bayesian statistics
hypothesis is true!
Suppose that we plan to conduct an observational
or experimental study to further our knowledge
Degrees of belief and subjective probability
about some quantity of interest (called a statistical
In order to overcome the problems discussed in parameter), and thus to collect information that
the last paragraph, we need to subscribe to a more will provide evidence to support or refute a current
general notion of probability. While we would wish hypothesis about the quantity of interest. The
to maintain the simple frequentist interpretation of Bayesian approach to statistical inference (named
probability as the long-run frequency of events in cir- after the 18th century English clergyman the Rev-
cumstances where it is appropriate, we would also erend Thomas Bayes) initially asks the researcher
like to make probabilistic statements and judgements to collate all pre-existing information, reflecting
about statistical parameters and, ultimately, scientific both evidence based on past studies and current
hypotheses. beliefs, before prospectively collecting any new data.
Most statisticians now accept the concept of sub- This information is then expressed in mathematical
jective probability, where statements involving the form as a prior probability distribution. The prior
use of probability are taken to represent a ‘degree distribution is simply a quantification of the current
of personal belief’ about the quantity or event of state of understanding about the unknown quantity
interest (Lindley 1965a). This removes the need of interest and can be thought of as attaching a
to associate probability with observable events weight, expressed as a probability, to each possible
(however hypothetical such events may be) and value of the quantity of interest before additional
allows us to make quantitative judgements about the data are recorded and examined. Values of the quan-
likelihood of an assertion being correct in circum- tity of interest that are viewed as being a priori fairly
stances where there is no reasonable long-run fre- likely to represent the true quantity are assigned a
quency interpretation. high prior probability and those that are viewed as
A typical example occurs when we attach a less likely receive a correspondingly lower prior
probability to an event in public affairs, such as probability.
the statement ‘there is a 10% chance that Australia The prior distribution by definition allows investi-
will become a republic before 31 December 2005’. gators to incorporate pre-existing information into
Clearly Australia will not debate the transition their analysis, something that is more difficult to do
from a constitutional monarchy to a republic a large in the frequentist theory of statistics. Lilford et al.
number of times under identical conditions and so a (1995) comment that ‘Bayesian methods utilise all
relative frequency interpretation of the probability of available data’. This provides a distinct advantage
this event is simply not possible. Frequentist statisti- over conventional methods of analysis, which Lilford
cians should refrain from attaching probabilities & Braunholtz (1996) rightly observe ‘do not allow
to such one off events, though many events (for decision makers to take explicit account of additional
example, sporting contests) can be similar enough for evidence’. The choice of an appropriate prior distri-
the associated probabilities to warrant a frequentist bution is usually based on a combination of the fol-
interpretation. Although it is not strictly necessary lowing three sources of information:
for subjective probabilities to be based on data, they (i) evidence from previous studies via the inspec-
should change, in a rational manner, as new data tion of historical data;
accrue. More formally, a sequence of subjective prob- (ii) consultation with experts in the field to elicit
ability statements must be internally consistent or their clinical opinion, which potentially involves a
coherent in the sense of Walley (1991) in order to degree of subjective judgement;

© 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204 195


L.C. Gurrin et al.

(iii) the development of theoretical physical or bio-

0.4
logical models.
New evidence from the data collected during the

0.3
current study is summarized by the likelihood func-

Posterior density
tion (Edwards 1992; Berger & Wolpert 1988). This is

0.2
a mathematical object that describes how the prob-
ability distribution of the observed data depends on

0.1
the particular values of the statistical parameters that
govern the chosen class of statistical models.

0.0
The last step in the Bayesian process is to combine 0 5 10 15 20
the prior distribution with the likelihood function Fall in diastolic blood pressure in response to drug 'X' (mm Hg)

using a mathematical routine derived from Bayes’


Figure 1 A Bayesian posterior distribution for the fall
Theorem (Lindley 1965a,b; Armitage & Berry 1994, in diastolic blood pressure.
p71–77). The result of this process, called the pos-
terior probability distribution, is an updated reflec-
tion of our beliefs about the statistical parameters (DBP) following commencement of a new drug ‘X’
and has a probabilistic interpretation analogous to in a group of 40 patients with mild untreated hyper-
the prior distribution. From a mathematical point of tension. A pharmaceutical company wishes to deter-
view, weighting the prior distribution by the likeli- mine whether preliminary results are good enough
hood function forms the posterior probability distri- to warrant proceeding to a full phase III trial. At
bution. It is the posterior distribution that is used to the outset it is stated that the drug will be regarded
draw inferences and thus form conclusions about the as ‘potentially useful’ if it reduces DBP by at least
relevant quantity of interest. 5 mm Hg. An appropriate prior distribution was
An important advantage of the Bayesian approach chosen by seeking the advice of experts.
to statistical analysis is that it provides probability As an example of the type of inference which may
distributions for the quantities of interest. This makes be drawn from the posterior distribution, let us use
it possible to make genuine probability statements the figure to estimate the probability that the true fall
about the magnitude of such parameters, such as the in DBP in response to ‘X’ lies somewhere between
probability that a clinical effect lies within a particu- 12.5 mm Hg and 15 mm Hg. The required posterior
lar range (e.g. ‘the probability that the odds ratio is probability is equal to the area under that part of
between 0.2 and 0.5 is 95%’). Furthermore, one con- the curve which falls between 12.5 mm Hg and
sequence of this is the intuitively appealing opportu- 15 mm Hg (the area which is cross hatched) as a pro-
nity to attach a probability to a statistical hypothesis portion of the total area under the curve. The total
of interest, since such a hypothesis is merely a state- area under the curve is 1.0, the crosshatched area is
ment about the value or nature of such parameters 0.155 and the required probability is therefore
(e.g. ‘there is a 5% chance that the treatment effect 15.5%. In other words, having observed the current
is greater than 0’). This provides direct and explicit data, and assuming that the prior distribution was
answers to the sorts of questions that are usually chosen appropriately, there is a probability of 15.5%
posed by clinicians and medical researchers. Such an that drug ‘X’ reduces DBP by between 12.5 mm Hg
interpretation can be extrapolated immediately to and 15 mm Hg in patients with mild untreated
clinical practice and could form the basis of decisions hypertension.
about policy in public health medicine. Equivalently we can answer the research question
that was of primary interest to the pharmaceutical
company: ‘what is the probability that the true fall in
Example 1
DBP is at least 5 mm Hg?’. The relevant probability
Figure 1 illustrates the Bayesian posterior distribu- is represented by the full shaded area (single line
tion generated from a hypothetical phase II clinical shading and cross hatching) on Figure 1, which
trial investigating the fall in diastolic blood pressure encompasses 95.2% of the total area under the curve.

196 © 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204


Bayesian statistics in medical research

Thus there is a 95.2% posterior probability that the not unusual to hear researchers in a scientific context
true fall in DBP in response to drug ‘X’ is at least say that they need to draw an ‘objective’ inference
5 mm Hg. that is untainted by the personal opinions and preju-
dices of those participating in the project.
Under these circumstances some statisticians have
Choosing a prior distribution
proposed what might loosely be called an objective
Specification of the prior distribution is a matter of Bayesian theory of statistical inference. They ad-
ongoing concern for those contemplating the use of vocate the use of ‘vague’, ‘flat’ or ‘non-informative’
Bayesian methods in medical research. Clearly any prior distributions that in some sense emphasize the
conclusions drawn from a Bayesian analysis will role of the current experimental data and obviate the
potentially be sensitive to the choice of prior distrib- need for specific reference to prior beliefs (Lindley
ution. Some authors have devoted considerable 1965b; Hughes 1993). One such distribution is the
thought to the process of formalizing the choice uniform probability distribution, which assigns equal
of a prior probability distribution. Freedman & prior weight to each possible value of the quantity of
Spiegelhalter (1983), Spiegelhalter & Freedman interest on the scale of the chosen outcome measure.
(1986), Chaloner et al. (1993) and Kadane et al. (1980) Each value of the quantity of interest is viewed as
have all made some suggestions as to eliciting and ‘equally likely’ before the new data are observed,
quantifying the prior opinions of clinicians, but this which seems intrinsically reasonable. The use of a
remains a difficult task. It is sometimes fancifully sug- uniform prior probability distribution focuses atten-
gested that clinicians and ‘consumers’ should come tion on current rather than pre-existing data (Lindley
equipped with their own prior distribution which 1965b), in that the shape of the posterior distribu-
they can then combine with the likelihood function tion depends entirely on the likelihood function.
provided by the statistician! Although the use of the prior distribution has a
If there is important pre-existing information that number of shortcomings and does not truly represent
needs to be taken into account then it can be incor- a formal mathematical expression of the state of
porated into a subsequent analysis by formulating a ‘prior ignorance’ (Walley 1991; Walley et al. 1996; see
suitably descriptive prior distribution.This is a crucial also the discussion), it provides an ad hoc standard
step in the Bayesian process, despite the fact that or reference analysis, from a common starting point,
it is often treated with scepticism by traditionally that aids comparison between current experimental
minded statisticians and clinicians. Nevertheless, or observational data, and that obtained from other
although we do not wish to downplay the importance sources. Furthermore, the uniform prior distribution
of choosing an appropriate prior distribution in situa- provides an important link between frequentist and
tions where there is considerable prior knowledge, Bayesian theories of statistical analysis, which can be
there are, to be realistic, many circumstances where conveniently illustrated by exploring the role of the
little or no relevant pre-existing information is avail- confidence interval in statistical inference.
able. It is perhaps a reasonable criticism of the
Bayesian approach to statistical analysis that, in this
The interpretation of confidence intervals
situation, attempting to specify a prior distribution is
effectively trying to quantify something that does not The concept of a confidence interval was developed
exist. Alternatively, we may wish to restrict attention by frequentist statisticians in order to represent the
to the current data so that we can, in some sense, let precision of a parameter estimate as the size of
the data ‘speak for themselves’, or, in the words of an interval of values that necessarily includes the
Lilford et al. (1995) ‘represent the information arising estimate itself. Confidence intervals are generated
just from the data’. Lindley (1965a) comments ‘even by inverting a probability statement about the data
when one has some appreciable prior knowledge of given the value of the parameters, in order to come
theta [a quantity of interest] one may like to express up with a range of values for the true parameter to
the posterior beliefs about theta without reference to which we can attach a probabilistic interpretation.
them [i.e. the prior distribution]’. Equivalently, it is In order to remain faithful to the frequency based

© 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204 197


L.C. Gurrin et al.

definition of probability, however, a conventional


Example 2
95% confidence interval is properly defined in a
somewhat subtle manner, in terms of hypothetical Intracytoplasmic sperm injection (ICSI) involves the
repetitions of the study and analysis under consider- selection and injection of a single spermatozoon into
ation. To paraphrase: if new data were to be repeat- an oocyte. The procedure is an extension of standard
edly sampled, the same analysis carried out and a in-vitro fertilization treatment and represents the
series of 95% confidence intervals calculated, 19 out most significant development in the field of assisted
of 20 such intervals would in the long run include the reproduction since the birth of the first ‘test tube
true value of the quantity being estimated (see baby’ in 1978. ICSI offers, for the first time, the
Armitage & Berry 1994, p93–99). prospects of genetic parenthood for men with pro-
Most researchers, however, interpret a 95% found oligozoospermia (low sperm count) and, with
confidence interval in a rather different manner. the use of testicular biopsy and epididymal aspiration
They infer that the confidence interval contains the techniques, even for those men with azoospermia
unknown quantity with 95% probability (Burton (no sperm present in the ejaculate). The success of
1994). Within the frequentist framework, this in- ICSI has led to its use throughout the world. There
terpretation of the confidence interval is fundamen- are, however, several theoretical concerns about
tally incorrect (Armitage & Berry 1994, p93–99). In the safety of ICSI and a series of potential risks for
any particular case, the unknown true value either the offspring have been identified (Patrizio 1995;
does lie within the bounds of the confidence interval Cummins & Jequier 1994; De Kretser 1995).
or it does not; there is no appropriate frequency- The ICSI technique was developed by a group at
based interpretation of the probability of this the Brussels Free University (Palermo et al. 1992).
‘single’ event. If, however, an objective Bayesian From the outset, this group had the foresight to set
analysis is carried out using a prior distribution which in place follow-up of the infants born after ICSI
is uniform on the principal scale of analysis, it can treatment at their centre. As their cohort of infants
be shown that a conventional C% confidence inter- has increased in number, they have published
val encloses a range of values that also encompasses their findings in an overlapping series of papers
C% of the area under the posterior distribution (Bonduelle et al. 1994; Van Steirteghem et al. 1994;
(Lindley 1965b). A 95% confidence interval is there- Tournaye et al. 1995; Liebaers et al. 1995; Bonduelle
fore equivalent to a posterior subjective probability et al. 1996). By 1995 they had assessed 420 live born
of 95% that the true value lies between the lower and infants conceived following ICSI treatment at their
upper bounds of the confidence interval – an inter- centre and had identified a series of birth defects
pretation that corresponds precisely to that stated (Bonduelle et al. 1996). They used a definition of
above. Some statisticians would prefer to call a major birth defects for which population comparison
confidence interval interpreted in this manner a data were not available. They determined that 14 of
Bayesian 95% credible interval (Winkler 1972). The 420 live born infants (3.3%) had a major birth defect
Bayesian interpretation we have suggested is accept- and concluded that there was no increase in the
able provided that one acknowledges that we are prevalence of birth defects in infants born after ICSI
dealing with subjective probability, not frequency (Bonduelle et al. 1996). However, by reclassifying the
probability, and that one is assuming that any prior reported defects using the classification system used
information is to be viewed as ‘vague’. This congru- by the Western Australian Birth Defects Registry,
ence between conventional frequentist confidence researchers were able to compare the birth preva-
intervals and the corresponding Bayesian credible lence of defects with the population prevalence esti-
interval associated with a uniform prior probability mates from Western Australia for live births during
distribution provides a straightforward and intuitive the same time period (Kurinczuk & Bower 1997).
interpretation of the results of a conventional statis- Following the reclassification 31 of the 420 children
tical analysis, and affords a simple introduction to the (7.38%) were defined as having a major birth defect,
Bayesian approach to statistical inference (Burton compared to 3.8% of the general Western Australian
1994). population of live births. Of particular interest were

198 © 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204


Bayesian statistics in medical research

the 14 (3.33%) infants with cardiac malformations Using these standard results the data are likely to
defined as major by Kurinczuk & Bower (1997). be interpreted in one of three ways. First, it may
There was some concern, however, that because of be noted that P > 0.05 and this may be interpreted
the unusually close surveillance of the Belgian as suggesting that the null hypothesis should be
cohort, the increased risk of cardiac birth defects ‘accepted’ and the conclusion drawn that there is no
described by Kurinczuk & Bower (1997) may have evidence of an increased prevalence of major cardiac
been due to the over-diagnosis of defects that would defects in children born following ICSI. This inter-
otherwise never have come to medical attention pretation is, of course, fundamentally incorrect.
(Kurinczuk & Bower 1997; Bonduelle et al. 1997). Second, it may be noted that there appears to be a
Having excluded all cardiac defects that may (even potentially important increase in the prevalence of
remotely) have fallen into this category, 5 of the 420 major cardiac defects in the ICSI cohort that is close
(1.19%) infants were deemed to have at least one to twice the corresponding proportion in the general
major cardiac defect that would definitely have been population. However, because the result based on the
identified under routine surveillance. This was then five cases of cardiac defects was not statistically sig-
compared to the corresponding prevalence of major nificant, it might be argued that this data set is too
cardiac defects in the population of Western small to draw any meaningful inferences. This inter-
Australian live births, that is, 0.67%. pretation is safer than the first, but fails to use the
Researchers in Western Australia wished to deter- data to their full potential. A third alternative is to
mine whether these results warranted the submission interpret the 95% confidence interval. This interval
of a grant application to investigate this issue further (calculated above as 0.00153 to 0.0223) is wide and
using local ICSI data. They wished to know how encompasses values that would lead to quite differ-
likely it was that ICSI was associated with an increase ent inferences. For example, a birth prevalence of
in the birth prevalence of major birth defects, par- 0.002 would suggest that ICSI infants had a preva-
ticularly cardiac defects, and if so, how likely it was lence of major cardiac defects that was only 30% of
that such an increase in cardiac defects was large, for that in the general population, whereas a prevalence
example, greater than two-fold. of 0.0201 would suggest that it was three times as
Let us initially consider how these data might be high. Both of these values are contained in the con-
analysed in a conventional setting. The ‘null’ hypoth- fidence interval and are therefore, in some sense, con-
esis will be that ‘the birth prevalence of major cardiac sistent with the observed data. This confirms that the
defects in the ICSI birth cohort is the same (0.0067) sample size is too small and suggests that further
as in the general Western Australian population’. study is important. This interpretation is both valid
A conventional test of the null hypothesis based and informative and there is no question that if a
upon the standard Normal approximation to the standard approach to analysis is to be adopted it
binomial distribution (Armitage & Berry 1994, should be based upon confidence intervals. This
pp70–71, 118–125) would utilize a standard error for approach does not, however, allow us to express
1
the proportion of ((0.0067 ¥ 0.9933)/420) /2 = 0.00398. some of these qualitative impressions in a quantita-
The observed proportion in the ICSI cohort is tive manner. For example, although a prevalence of
5/420 = 0.0119 and so the standardized Normal 0.0201 falls within the 95% confidence interval and
deviate (Z) is ((0.0119 - 0.0067)/0.00398) = 1.31, is therefore ‘consistent’ with the data, it is unclear
which (from the usual statistical tables) is equival- how likely it is, on the basis of this preliminary analy-
ent to a 2-tailed P-value of 0.191. In calculating sis, that the true birth prevalence really is this high or
the 95% confidence interval for the proportion, maybe even higher.
we now ignore the null hypothesis and use the As an alternative, we would propose that a
observed proportion to calculate its standard error Bayesian analysis be carried out using a ‘non-
(Armitage & Berry 1994, sections 4.7, 4.9): informative’ prior distribution that is uniform on the
((0.0119 ¥ 0.9881)/(420) = 0.00529. This produces a scale of proportions. Having made this assumption
95% confidence interval of 0.0119 ± 1.96 ¥0.00529 = we can now make use of the equivalence of a stand-
0.00153 to 0.0223. ard C% confidence interval and a Bayesian C%

© 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204 199


L.C. Gurrin et al.

credible interval. In order to generalize the ensuing stated threshold is C% + (100% - C%)/2. This is
calculations, let us consider what may be called a crit- because any value which falls inside the critical con-
ical confidence interval. This is defined as the confi- fidence interval (posterior probability = C%) must by
dence interval with a midpoint at the observed value definition exceed the threshold of interest and sym-
(in this example, at a proportion of 0.0119) and a metry dictates that one half of all values which fall
lower limit at the value of a threshold of interest (in outside the confidence interval (posterior probability
this example, at a proportion of 0.0067 correspond- = (100% - C%)/2) will also exceed the threshold. In
ing to the ‘null value’ associated with the rate in the order to calculate the probability that the true value
general population), and an upper limit that is by of a quantity of interest is less than a given thresh-
symmetry the same distance above the observed old, one may carry out a series of analogous calcula-
value as the lower limit is below. Using Z tables it tions using the critical confidence interval whose
is straightforward to determine the percentage upper limit falls at the threshold.
coverage of this critical confidence interval. In this Returning to the example, let us calculate the
example, such a confidence interval on the propor- probability that the true proportion exceeds 0.0134,
tion scale extends from 0.0067 to 0.0119 + (0.0119 - which is twice the rate in the general population.
0.0067) = 0.0171. This is symmetric about the Since this value exceeds the observed value of 0.0119,
observed proportion, that is 0.0119, and extends we set the upper bound of the critical confidence
0.0052/0.00529 = 0.983 standard errors in either direc- interval to the threshold of interest, namely 0.0134,
tion. Reference to a table of the Z distribution indi- and calculate the lower bound to be as far below
cates that 83.71% of the area under the curve lies the observed value of 0.0119 as 0.0134 is above,
below Z = 0.983, thus 16.29% lies above this point giving a value of 0.0119 - (0.0134 - 0.0119) = 0.0104.
and, by symmetry, 16.29% of the area lies below This confidence interval, extending from 0.0104
Z = -0.983. This particular confidence interval is to 0.0134, is ± 0.2835 standard errors around the
therefore a (100 - [2 ¥ (16.29]) = 67.42% confidence estimated proportion of 0.0119. This is a 22.32%
interval which means that, having adopted a prior confidence interval and the posterior probability
distribution that is uniform on the scale of pro- that the true proportion exceeds 0.0134 is half of
portions, the range 0.0067 to 0.0171 is a Bayesian the probability lying outside this interval, or (100%
67.42% credible interval. This means that there - 22.32%)/2 = 38.84%.
is 67.42% posterior probability that the true propor- These results tell the researcher that it is very
tion lies between 0.0067 and 0.0171 and a 16.29% likely (approximately 84%) that the true prevalence
posterior probability that it is greater than 0.0171. of major cardiac defects is greater in the ICSI cohort
There is therefore a 67.42% + 16.29% = 83.71% than in the general population and that there is close
posterior probability that the true proportion is to a 40% probability that it exceeds twice the back-
greater than 0.0067 and thus a relatively high prob- ground rate. Similar calculations demonstrate that
ability that the risk of a major cardiac defect in a the chance that the true proportion in the ICSI
baby conceived using ICSI is higher than the risk in cohort is as high as three times the rate in the general
the general population. Readers should note that, in population is only 6.06%. To extend the characteri-
this particular case, the posterior probability of zation further, Table 1 details the posterior probabil-
83.71% could have been obtained directly from the ity that the true proportion exceeds a series of
table of the Z distribution: ‘83.71% of the area under thresholds of interest.
the curve lies below Z = + 0.983’. However, we Analyses such as those illustrated above
explain the calculation in terms of a two-sided confi- proved to be of considerable value to the medical
dence interval, because we believe that this clarifies scientists in Western Australia investigating the risks
the full procedure and it is appropriate under all associated with ICSI therapy. The investigators
circumstances. were subsequently successful in obtaining a research
In general, if the percentage coverage of the con- grant (from the March of Dimes Birth Defects
fidence interval is C%, the posterior probability that Foundation in New York) to continue their work in
the true value of the quantity of interest exceeds the this area.

200 © 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204


Bayesian statistics in medical research

Table 1 The posterior probability that the prevalence of major cardiac defects in the ICSI cohort exceeds a series of
thresholds based on the prevalence in the general population

Threshold Threshold as a multiple of the Posterior probability that the true Posterior probability that the true rate
of interest prevalence in general population rate exceeds the stated threshold is less than the stated threshold

0.0067 1.0 83.71% 16.29%


0.01005 1.5 63.67% 36.33%
0.0134 2.0 38.84% 61.16%
0.01675 2.5 17.97% 82.03%
0.0201 3.0 6.06% 93.94%
0.02345 3.5 1.45% 98.55%
0.0268 4.0 0.24% 99.76%

understandable manner. The use of subjective prob-


Discussion
ability within a Bayesian framework is particularly
Conventional statistical analyses based upon the useful in circumstances where a conventional ap-
‘frequency-based’ view of probability and statistical proach to statistical analysis may be difficult or
inference often fail to make the best and most com- misleading. These include circumstances where: (i)
plete use of available data when assessing the evi- a statistically non-significant result may be large
dence for an hypothesis under investigation (Burton enough to be clinically relevant (small sample size);
1994; Burton et al. 1998; Lilford et al. 1995). This (ii) a statistically significant result may be too small
results from having to restrict attention to the con- to be of clinical relevance (large sample size); or (iii)
sideration of just one hypothesis (usually the ‘null’ where one wishes to draw quantitative conclusions
hypothesis) and then to the comparison of the resul- regarding the probability that two or more outcomes
tant P-value with an arbitrarily chosen threshold are sufficiently similar that any difference is unlikely
(usually P = 0.05) instead of viewing it on a continu- to be clinically relevant.
ous scale as an (indirect) measure of evidence. This The use of a uniform prior probability distribution
reduces a potentially powerful inferential tool to a promotes a confluence between the Bayesian and
simplistic, mechanistic and ultimately very poor form conventional frequentist approaches, since 95%
of decision analysis known as the statistical signifi- confidence intervals can be viewed legitimately as
cance test. The widely held belief that all studies and containing the true value of interest with 95% prob-
experiments that result in a non-significant P-value ability. Many researchers already interpret confi-
provide the same support for the specified hypoth- dence intervals in precisely this manner and thus our
esis is just one example of the type of misunder- proposal does not require a radical modification of
standing that can easily arise from a failure to the way in which many researchers approach statis-
appreciate the subtleties of interpretation associated tical analyses. It is important, however, that one
with conventional frequentist statistical analysis acknowledges that this interpretation of confidence
(Freeman 1993). intervals is only valid if one works within a Bayesian
We have proposed an alternative approach that framework using a uniform prior distribution. We
views the problem of statistical inference from a have suggested reporting, where appropriate, the
Bayesian perspective. Such an approach allows one posterior probability that the quantity of interest
to make full use of the available data. A genuine exceeds a series of clinically relevant thresholds
probabilistic interpretation, based on the concept rather than just a single 95% confidence interval.
of subjective probability, provides direct answers Although the use of a uniform prior probability
to questions about the probable magnitude of the distribution provides a neat introduction to the
effects of interest and hence permits one to com- Bayesian process, there are a number of reasons why
pare competing hypotheses in a straightforward and the uniform prior distribution does not provide the

© 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204 201


L.C. Gurrin et al.

foundation on which to base a bold new theory of noticeably different to the original values of 83.7%
statistical analysis! and 38.8%, respectively. Nevertheless, this change
First, the uniform prior probability distribution would make little or no difference to the principal
does not provide a formal mathematical representa- conclusion of the analysis.
tion of ‘prior ignorance’. No single prior distribution In most settings in medical statistics, confidence
is appropriate when one is faced with a complete intervals are calculated in a way that assumes that the
lack of information (Walley 1991).Walley et al. (1996) distribution of the data, or at least a relevant
note that ‘. . . any [single] Bayesian prior distribution summary statistic, can be approximated by a suitable
assigns precise probabilities to hypotheses and there- Normal distribution. The correspondence of a C%
fore has strong behavioural implications, e.g. it pre- confidence interval calculated using such an approxi-
cisely determines “fair” betting rates on the truth of mation to a C% credible interval is therefore only
the hypotheses.’ Bayesian statisticians would endorse exact when the data are Normally distributed. For
repeating the analysis using many different prior dis- most of the standard probability distributions used to
tributions in the hope of encapsulating a wide range analyse and model medical data, the approximation
of prior beliefs about the values of the relevant para- is, in general, quite close even when the sample size
meters. This is known as a Robust Bayesian approach is relatively small (for an example, see Burton
to analysis (Berger 1984, 1990, 1994; Greenhouse & (1994)).
Wasserman 1995). Such a sensitivity analysis is clearly One of the problems with Bayesian analysis is that
important if we are to ascertain how the posterior it is often a non-trivial problem to combine the prior
distribution is affected by changes to the prior prob- information and the current data to produce the pos-
ability distribution or by changes to the model used terior distribution. Despite the increasing availability
to create the likelihood. of purpose-designed software for Bayesian analysis
A second difficulty with the uniform prior distrib- (BUGS, Spiegelhalter et al. 1995), specialist advice
ution is its sensitivity to transformation; the uniform and software is generally required in order to bring
distribution may in fact be very non-uniform when Bayesian statistics into the medical research work-
transformed to another scale of analysis. A prior dis- place. The congruence between conventional confi-
tribution that is uniform on the scale of proportions, dence intervals and Bayesian credible intervals
for example, cannot simultaneously be uniform on generated using a uniform prior distribution does,
the scale of odds and vice versa, and yet in many cases however, provide a simple way to obtain inferences
either scale would be appropriate for analysis. We in Bayesian form which can be implemented using
would argue that if two scales really are equally standard software based on the results and output of
appropriate, and the use of a prior which is uniform a conventional statistical analysis.
on one scale leads to a qualitatively different con- The use of Bayesian methods is growing amongst
clusion to an analysis based upon a prior which is clinical scientists and clinicians. The congruence
uniform on the other scale, then inferences must, of between a Bayesian analysis using a uniform
course, be viewed as uncertain. One hopes that in prior and a conventional analysis provides a non-
situations where more than one analytical scale is threatening introduction to Bayesian methods and
appropriate, the choice of scale would result in rela- means that analyses of the type we describe can be
tively small quantitative changes rather than large carried out on standard software. Our approach is
qualitative alterations to the principal conclusions.To straightforward to implement, offers the potential to
illustrate, if Example 2 had been worked assuming describe the results of conventional analyses in a
uniformity on the scale of loge(odds) rather than on manner that is more easily understood, and leads nat-
the scale of proportions, the estimated posterior urally to rational decisions. We do not suggest that
probability that the true rate of cardiovascular birth this approach should be used all the time, nor should
defects in ICSI baby exceeded the general popula- it be used is an excuse for designing studies which are
tion rate, or twice that rate, would have been 90.1% too small or a fallback position when a conventional
and 39.5%, respectively. Because the sample size is analysis fails to produce a statistically significant
so small (only five cases), these probabilities are result. However, when it is used appropriately, we

202 © 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204


Bayesian statistics in medical research

believe that this approach is a useful addition to con- alternative to p values. Journal of Epidemiology and
ventional methods. Community Health 52, 318–323.
Chaloner K., Church T., Louis T.A. & Matts J.P. (1993)
Graphical elicitation of a prior distribution for a clinical
Acknowledgements trial. The Statistician 42, 341–353.
Cummins J.M. & Jequier A.M. (1994) Treating male infer-
Jennifer Kurinczuk gratefully acknowledges receipt tility needs more clinical andrology, not less. Human
of a two year project grant from the March of Dimes Reproduction 9, 1214–1219.
Birth Defects Foundation, New York (#6-FY98–497; De Kretser D.M. (1995) The potential of intracytoplasmic
#6-FY99–683). sperm injection (ICSI) to transmit genetic defects
This work was funded in part by the National causing male infertility. Reproduction Fertility and Devel-
Health and Medical Research Council of Australia as opment 7, 137–142.
one component of Program Grant #96\3209. Edwards A.W.F. (1992) Likelihood. Johns Hopkins
University Press, Baltimore.
Freedman L.S. & Spiegelhalter D.J. (1983) The assessment
References of subjective opinion and its use in relation to stopp-
ing rules for clinical trials. The Statistician 32, 153–
Armitage P. & Berry G. (1994) Statistical Methods in 160.
Medical Research. 3rd edn. Blackwell Scientific Publica- Freeman P.R. (1993) The role of p-values in analysing trials
tions, Oxford. results. Statistics in Medicine 12, 1443–1452.
Berger J. (1984) The robust Bayesian viewpoint (with Greenhouse J.B. & Wasserman L. (1995) Robust Bayesian
discussion). In: Robustness in Bayesian Analyses. methods for monitoring clinical trials. Statistics in Medi-
(Ed. J. Kadane). North-Holland, Amsterdam. pp. 63– cine 14, 1379–1391.
144. Hughes M.D. (1993) Reporting Bayesian analyses of clini-
Berger J. (1990) Robust Bayesian analysis: sensitivity to the cal trials. Statistics in Medicine 12, 1651–1663.
prior. Journal of Statistical Planning and Inference 25, Kadane J.B., Dickey J.M., Winkler R.L., Smith W.S. &
303–328. Peters S.C. (1980) Interactive elicitation of opinion for a
Berger J.O. (1994) An overview of robust Bayesian analy- normal linear model. Journal of the American Statistical
sis (with discussion). Test 3, 5–124. Association 75, 845–854.
Berger J.O. & Wolpert R.L. (1988) The Likelihood Kurinczuk J.J. & Bower C. (1997) Birth defects in infants
Principle. 2nd edn. Institute of Mathematical Statistics, conceived by intracytoplasmic sperm injection: an alter-
Hayward, California. native interpretation. British Medical Journal 315,
Bonduelle M., Desmyttere S., Buysse A., Van Assche E., 1260–1265.
Schietecatte J., Devroey P. et al. (1994) Prospective Lee P.M. (1989) Bayesian statistics: an introduction. Arnold,
follow-up study of 55 children born after subzonal London.
insemination and intracytoplasmic sperm injection. Liebaers I., Bonduelle M., Legein J., Wilikens E., Van
Human Reproduction 9, 1765–1769. Assche E., Buysse A. et al. (1995) Follow-up of children
Bonduelle M., Legein J., Buysse A., Van Assche E., Wisanto born after intracytoplasmic sperm injection. In: Fertility
A., Devroey P. et al. (1996) Prospective follow-up study and Sterility: a Current Overview. (Hedon B., Bringer J.,
of 423 children born after intracytoplasmic sperm injec- Mares P. eds.) Parthenon, New York.
tion. Human Reproduction 11, 1558–1564. Lilford R.J., Thornton J.G. & Braunholtz D. (1995) Clinical
Bonduelle M., Devroey P., Liebaers I. & Van Steirteghem trials and rare diseases: a way out of a conundrum.
A. (1997) Commentary: Major defects are overesti- British Medical Journal 311, 1621–1625.
mated. British Medical Journal 315, 1265–1266. Lilford R.J. & Braunholtz D. (1996) The statistical basis of
Box G.E.P. & Tiao G.C. (1973) Bayesian inference public policy: a paradigm shift is overdue. British Medical
in statistical analysis. Addison-Wesley, Reading, Journal 313, 603–607.
Massachusetts. Lindley D.V. (1965a) Introduction to probability and
Burton P.R. (1994) Helping doctors to draw appropriate statistics from a Bayesian viewpoint. Part 1 Probability.
inferences from the analysis of medical studies. Statistics Cambridge University Press, Cambridge. pp. 19–25,
in Medicine 13, 1699–1713. 29–42, 50, 58.
Burton P.R., Gurrin L.C. & Campbell M.J. (1998) Clinical Lindley D.V. (1965b) Introduction to probability and statis-
significance not statistical significance: a simple Bayesian tics from a Bayesian viewpoint. Part 2 Inference. Cam-

© 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204 203


L.C. Gurrin et al.

bridge University Press, Cambridge. pp. 1–13, 15, 18, Tournaye H., Liu J., Nagy Z., Joris H., Wisanto A.,
19. Bonduelle M. et al. (1995) Intracytoplasmic sperm injec-
Palermo G., Joris H., Devroey P. & Van Steirteghem A.C. tion (ICSI): the Brussels Experience. Reproduction Fer-
(1992) Pregnancies after intracytoplasmic injection of a tility and Development 7, 269–279.
single spermatozoon into an oocyte. Lancet 340, 17–18. Van Steirteghem A.C., Nagy P., Liu J., Joris H., Smitz J.,
Patrizio P. (1995) Intracytoplasmic sperm injection (ICSI): Camus M. et al. (1994) Intracytoplasmic sperm injection
potential genetic concerns. Human Reproduction 10, – ICSI. Reproductive Medical Review 3, 199–207.
2520–2523. Walley P. (1991) Statistical reasoning with imprecise proba-
Spiegelhalter D.J. & Freedman L.S. (1986) A predictive bilities. Chapman & Hall, London.
approach to selecting the size of a clinical trial, based on Walley P., Gurrin L. & Burton P. (1996) Analysis of clinical
subjective opinion. Statistics in Medicine 5, 1–13. data using imprecise prior probabilities. The Statistician
Spiegelhalter D., Thomas A., Best N. & Gilks W. (1995) 45, 457–486.
BUGS. Bayesian inference using Gibbs sampling, Winkler R.L. (1972) An introduction to Bayesian inference
Version 0.60. MRC Biostatistics Unit, Cambridge. and decision. Holt, Rinehart and Winston Inc., New
http://www.mrc-bsu.cam.ac.uk/bugs. York. pp. 395–396.

204 © 2000 Blackwell Science, Journal of Evaluation in Clinical Practice, 6, 2, 193–204

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy