0% found this document useful (0 votes)
18 views3 pages

nrn3475 p062

Uploaded by

hasega2592
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views3 pages

nrn3475 p062

Uploaded by

hasega2592
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

A N A LY S I S

Box 1 | Key statistical terms


Second, the lower the power of a study, the lower
the probability that an observed effect that passes the
CAMARADES required threshold of claiming its discovery (that is,
The Collaborative Approach to Meta-Analysis and Review of Animal Data from reaching nominal statistical significance, such as p < 0.05)
Experimental Studies (CAMARADES) is a collaboration that aims to reduce bias and actually reflects a true effect 1,12. This probability is called
improve the quality of methods and reporting in animal research. To this end,
the PPV of a claimed discovery. The formula linking the
CAMARADES provides a resource for data sharing, aims to provide a web-based
stratified meta-analysis bioinformatics engine and acts as a repository for completed
PPV to power is:
reviews. PPV = ([1 – β] × R) ⁄ ([1− β] × R + α)
Effect size
An effect size is a standardized measure that quantifies the size of the difference where (1 − β) is the power, β is the type II error, α is the
between two groups or the strength of an association between two variables. As type I error and R is the pre-study odds (that is, the odds
standardized measures, effect sizes allow estimates from different studies to be that a probed effect is indeed non-null among the Nature Reviews |
effects
compared directly and also to be combined in meta-analyses. being probed). The formula is derived from a simple
Excess significance two-by-two table that tabulates the presence and non-
Excess significance is the phenomenon whereby the published literature has an presence of a non-null effect against significant and
excess of statistically significant results that are due to biases in reporting. non-significant research findings1. The formula shows
Several mechanisms contribute to reporting bias, including study publication bias, that, for studies with a given pre-study odds R, the
where the results of statistically non-significant (‘negative’) studies are left lower the power and the higher the type I error, the
unpublished; selective outcome reporting bias, where null results are omitted; and lower the PPV. And for studies with a given pre-study
selective analysis bias, where data are analysed with different methods that favour odds R and a given type I error (for example, the tra-
‘positive’ results.
ditional p = 0.05 threshold), the lower the power, the
Fixed and random effects lower the PPV.
A fixed-effect meta-analysis assumes that the underlying effect is the same (that is, For example, suppose that we work in a scientific field
fixed) in all studies and that any variation is due to sampling errors. By contrast, a
in which one in five of the effects we test are expected to
random-effect meta-analysis does not require this assumption and allows for
be truly non-null (that is, R = 1 / (5 – 1) = 0.25) and that we
heterogeneity between studies. A test of heterogeneity in between-study effects is
often used to test the fixed-effect assumption. claim to have discovered an effect when we reach p < 0.05;
if our studies have 20% power, then PPV = 0.20 × 0.25 /
Meta-analysis
Meta-analysis refers to statistical methods for contrasting and combining results from
(0.20 × 0.25 + 0.05) = 0.05 / 0.10 = 0.50; that is, only half of
different studies to provide more powerful estimates of the true effect size as opposed our claims for discoveries will be correct. If our studies
to a less precise effect size derived from a single study. have 80% power, then PPV = 0.80 × 0.25 / (0.80 × 0.25 +
Positive predictive value
0.05) = 0.20 / 0.25 = 0.80; that is, 80% of our claims for
The positive predictive value (PPV) is the probability that a ‘positive’ research finding discoveries will be correct.
reflects a true effect (that is, the finding is a true positive). This probability of a research Third, even when an underpowered study discovers a
finding reflecting a true effect depends on the prior probability of it being true (before true effect, it is likely that the estimate of the magnitude
doing the study), the statistical power of the study and the level of statistical of that effect provided by that study will be exaggerated.
significance. This effect inflation is often referred to as the ‘winner’s
Proteus phenomenon curse’13 and is likely to occur whenever claims of discov-
The Proteus phenomenon refers to the situation in which the first published study is ery are based on thresholds of statistical significance (for
often the most biased towards an extreme result (the winner’s curse). Subsequent example, p < 0.05) or other selection filters (for example,
replication studies tend to be less biased towards the extreme, often finding evidence a Bayes factor better than a given value or a false-discov-
of smaller effects or even contradicting the findings from the initial study. ery rate below a given value). Effect inflation is worst for
Statistical power small, low-powered studies, which can only detect effects
The statistical power of a test is the probability that it will correctly reject the null that happen to be large. If, for example, the true effect is
hypothesis when the null hypothesis is false (that is, the probability of not committing a medium-sized, only those small studies that, by chance,
type II error or making a false negative decision). The probability of committing a type II overestimate the magnitude of the effect will pass the
error is referred to as the false negative rate (β), and power is equal to 1 – β.
threshold for discovery. To illustrate the winner’s curse,
Winner’s curse suppose that an association truly exists with an effect size
The winner’s curse refers to the phenomenon whereby the ‘lucky’ scientist who makes a
that is equivalent to an odds ratio of 1.20, and we are try-
discovery is cursed by finding an inflated estimate of that effect. The winner’s curse
ing to discover it by performing a small (that is, under-
occurs when thresholds, such as statistical significance, are used to determine the
presence of an effect and is most severe when thresholds are stringent and studies are powered) study. Suppose also that our study only has the
too small and thus have low power. power to detect an odds ratio of 1.20 on average 20% of
the time. The results of any study are subject to sampling
variation and random error in the measurements of the
First, low power, by definition, means that the chance variables and outcomes of interest. Therefore, on aver-
of discovering effects that are genuinely true is low. That age, our small study will find an odds ratio of 1.20 but,
is, low-powered studies produce more false negatives because of random errors, our study may in fact find an
than high-powered studies. When studies in a given odds ratio smaller than 1.20 (for example, 1.00) or an odds
field are designed with a power of 20%, it means that if ratio larger than 1.20 (for example, 1.60). Odds ratios of
there are 100 genuine non-null effects to be discovered 1.00 or 1.20 will not reach statistical significance because
in that field, these studies are expected to discover only of the small sample size. We can only claim the association
20 of them11. as nominally significant in the third case, where random

366 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Table 2 | Sample size required to detect sex differences in water maze and radial maze performance
Total animals Required N per study Typical N per study Detectable effect for typical N
used
80% power 95% power Mean Median 80% power 95% power
Water maze 420 134 220 22 20 d = 1.26 d = 1.62
Radial maze 514 68 112 24 20 d = 1.20 d = 1.54
Meta-analysis indicated an effect size of Cohen’s d = 0.49 for water maze studies and d = 0.69 for radial maze studies.

80% power, and the average sample size of 24 animals experiments, the total numbers of animals actually used
for the radial maze experiments was only sufficient to in the studies contributing to the meta-analyses were
detect an effect size of d = 1.20. In order to achieve 80% even larger: 420 for the water maze experiments and
power to detect, in a single study, the most probable true 514 for the radial maze experiments.
effects as indicated by the meta-analysis, a sample size There is ongoing debate regarding the appropriate
of 134 animals would be required for the water maze balance to strike between using as few animals as possi-
experiment (assuming an effect size of d = 0.49) and ble in experiments and the need to obtain robust, reliable
68 animals for the radial maze experiment (assuming findings. We argue that it is important to appreciate the
an effect size of d = 0.69); to achieve 95% power, these waste associated with an underpowered study — even a
sample sizes would need to increase to 220 and 112, study that achieves only 80% power still presents a 20%
respectively. What is particularly striking, however, is possibility that the animals have been sacrificed with-
the inefficiency of a continued reliance on small sample out the study detecting the underlying true effect. If the
sizes. Despite the apparently large numbers of animals average power in neuroscience animal model studies is
required to achieve acceptable statistical power in these between 20–30%, as we observed in our analysis above,
the ethical implications are clear.
Low power therefore has an ethical dimension —
100 unreliable research is inefficient and wasteful. This applies
to both human and animal research. The principles of the
80 ‘three Rs’ in animal research (reduce, refine and replace)83
Post-study probability (%)

require appropriate experimental design and statistics


— both too many and too few animals present an issue
60
as they reduce the value of research outputs. A require-
ment for sample size and power calculation is included
40 in the Animal Research: Reporting In Vivo Experiments
80% power (ARRIVE) guidelines84, but such calculations require a
20 30% power clear appreciation of the expected magnitude of effects
10% power being sought.
0 Of course, it is also wasteful to continue data col-
0 0.2 0.4 0.6 0.8 1.0
lection once it is clear that the effect being sought does
not exist or is too small to be of interest. That is, studies
Pre-study odds R
are not just wasteful when they stop too early, they are
Figure 4 | Positive predictive value as a function of the also wasteful when they stop too late. Planned, sequen-
pre-study odds of associationNature Reviews | Neuroscience
for different levels of tial analyses are sometimes used in large clinical trials
statistical power. The probability that a research finding when there is considerable expense or potential harm
reflects a true effect — also known as the positive associated with testing participants. Clinical trials may
predictive value (PPV) — depends on both the pre-study be stopped prematurely in the case of serious adverse
odds of the effect being true (the ratio R of ‘true effects’
effects, clear beneficial effects (in which case it would be
over ‘null effects’ in the scientific field) and the study’s
statistical power. The PPV can be calculated for given unethical to continue to allocate participants to a placebo
values of statistical power (1 – β), pre-study odds ratio (R) condition) or if the interim effects are so unimpressive
and type I error rate (α), using the formula PPV = ([1 – β] × R) that any prospect of a positive result with the planned
⁄ ([1− β] × R + α). The median statistical power of studies in sample size is extremely unlikely 85. Within a significance
the neuroscience field is optimistically estimated to be testing framework, such interim analyses — and the pro-
between ~8% and ~31%. The figure illustrates how low tocol for stopping — must be planned for the assump-
statistical power consistent with this estimated range tions of significance testing to hold. Concerns have been
(that is, between 10% and 30%) detrimentally affects the raised as to whether stopping trials early is ever justified
association between the probability that a finding reflects given the tendency for such a practice to produce inflated
a true effect (PPV) and pre-study odds, assuming α = 0.05.
effect size estimates86. Furthermore, the decision process
Compared with conditions of appropriate statistical
power (that is, 80%), the probability that a research finding around stopping is not often fully disclosed, increasing
reflects a true effect is greatly reduced for 10% and 30% the scope for researcher degrees of freedom86. Alternative
power, especially if pre-study odds are low. Notably, in an approaches exist. For example, within a Bayesian frame-
exploratory research field such as much of neuroscience, work, one can monitor the Bayes factor and simply stop
the pre-study odds are often low. testing when the evidence is conclusive or when resources

372 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

100 time, computational analysis of very large datasets is now


relatively straightforward, so that an enormous number of
tests can be run in a short time on the same dataset. These

Relative bias of research finding (%)


80 dramatic advances in the flexibility of research design and
analysis have occurred without accompanying changes to
other aspects of research design, particularly power. For
60
example, the average sample size has not changed sub-
stantially over time88 despite the fact that neuroscientists
40 are likely to be pursuing smaller effects. The increase in
research flexibility and the complexity of study designs89
combined with the stability of sample size and search for
20 increasingly subtle effects has a disquieting consequence:
a dramatic increase in the likelihood that statistically sig-
0
nificant findings are spurious. This may be at the root of
the recent replication failures in the preclinical literature8
0 20 40 60 80 100
and the correspondingly poor translation of these findings
Statistical power of study (%)
into humans90.
Figure 5 | The winner’s curse: effect size inflation as Low power is a problem in practice because of the
Nature Reviews | Neuroscience
a function of statistical power. The winner’s curse normative publishing standards for producing novel,
refers to the phenomenon that studies that find evidence significant, clean results and the ubiquity of null
of an effect often provide inflated estimates of the size of hypothesis significance testing as the means of evaluat-
that effect. Such inflation is expected when an effect has ing the truth of research findings. As we have shown,
to pass a certain threshold — such as reaching statistical these factors result in biases that are exacerbated by low
significance — in order for it to have been ‘discovered’.
power. Ultimately, these biases reduce the reproducibil-
Effect inflation is worst for small, low-powered studies,
which can only detect effects that happen to be large. If, ity of neuroscience findings and negatively affect the
for example, the true effect is medium-sized, only those validity of the accumulated findings. Unfortunately,
small studies that, by chance, estimate the effect to be publishing and reporting practices are unlikely to
large will pass the threshold for discovery (that is, the change rapidly. Nonetheless, existing scientific practices
threshold for statistical significance, which is typically can be improved with small changes or additions that
set at p < 0.05). In practice, this means that research approximate key features of the idealized model4,91,92.
findings of small studies are biased in favour of inflated We provide a summary of recommendations for future
effects. By contrast, large, high-powered studies can research practice in BOX 2.
readily detect both small and large effects and so are less
biased, as both over- and underestimations of the true
Increasing disclosure. False positives occur more fre-
effect size will pass the threshold for ‘discovery’. We
optimistically estimate the median statistical power of quently and go unnoticed when degrees of freedom in
studies in the neuroscience field to be between ~8% and data analysis and reporting are undisclosed5. Researchers
~31%. The figure shows simulations of the winner’s curse can improve confidence in published reports by noting
(expressed on the y‑axis as relative bias of research in the text: “We report how we determined our sample
findings). These simulations suggest that initial effect size, all data exclusions, all data manipulations, and all
estimates from studies powered between ~ 8% and ~31% measures in the study.”7 When such a statement is not
are likely to be inflated by 25% to 50% (shown by the possible, disclosure of the rationale and justification of
arrows in the figure). Inflated effect estimates make it deviations from what should be common practice (that
difficult to determine an adequate sample size for is, reporting sample size, data exclusions, manipula-
replication studies, increasing the probability of type II
tions and measures) will improve readers’ understand-
errors. Figure is modified, with permission, from REF. 103
© (2007) Cell Press. ing and interpretation of the reported effects and,
therefore, of what level of confidence in the reported
effects is appropriate. In clinical trials, there is an
are expended87. Similarly, adopting conservative priors increasing requirement to adhere to the Consolidated
can substantially reduce the likelihood of claiming that Standards of Reporting Trials (CONSORT), and the
an effect exists when in fact it does not 85. At present, same is true for systematic reviews and meta-analyses,
significance testing remains the dominant framework for which the Preferred Reporting Items for Systematic
within neuroscience, but the flexibility of alternative (for Reviews and Meta-Analyses (PRISMA) guidelines are
example, Bayesian) approaches means that they should now being adopted. A number of reporting guidelines
be taken seriously by the field. have been produced for application to diverse study
designs and tools, and an updated list is maintained
Conclusions and future directions by the EQUATOR Network93. A ten-item checklist of
A consequence of the remarkable growth in neurosci- study quality has been developed by the Collaborative
ence over the past 50 years has been that the effects we Approach to Meta-Analysis and Review of Animal Data
now seek in our experiments are often smaller and more in Experimental Stroke (CAMARADES), but to the best
subtle than before as opposed to when mostly easily dis- of our knowledge, this checklist is not yet widely used in
cernible ‘low-hanging fruit’ were targeted. At the same primary studies.

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 373

© 2013 Macmillan Publishers Limited. All rights reserved

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy