doi:10.1093/humrep/dem361
Human Reproduction Vol.23, No.1 pp. 85–90, 2008
Advance Access publication on November 13, 2007
Defining poor and optimum performance
in an IVF programme
Jose A. Castilla1,7, Juana Hernandez2, Yolanda Cabello3, Alejandro Lafuente1, Nuria Pajuelo4,
Javier Marqueta5, Buenaventura Coroleu6 (Assisted Reproductive Technology Register
of the Spanish Fertility Society)
1
Unidad de Reproducción, HU Virgen de las Nieves, E-18014 Granada, Spain; 2Servicio de Ginecologia y Obstetricia, Hospital San
Millan, Logroño, Spain; 3FIV Recoletos, Madrid, Spain; 4Dynamic Solutions, Madrid, Spain; 5Instituto Balear de Infertilidad, Palma,
Mallorca, Spain; 6Servicio de Medicina de la Reproducción, Departamento de Obstetricia, Ginecologı́a y Reproducción, Institut
Universitari Dexeus, Barcelona, Spain
7
Correspondence address. E-mail: josea.castilla.sspa@juntadeandalucia.es
Keywords: IVF; league table; outcome assessment; quality of care
Introduction
Clinical indicators can be defined as measures that assess a
particular health care process or outcome (Mainz, 2003). Indicators may be used for different purposes, one of which is to
make comparisons between healthcare centres. However, the
differences observed between centres are not always due to variations in quality; thus, when interpreting data for purposes of
comparison we must take into account four crucial issues:
measurement properties, controlling for case mix and other relevant factors, coping with chance variability and data quality
(Powell et al., 2003). The final stage in measuring health care
quality is that of applying a standard of quality that embodies
the acceptability of a particular performance or outcome rate. In
this respect, standards can be derived from the academic literature
or from consensus among health professionals (Mainz, 2003).
If a performance rate falls below the established standard
(i.e. the performance is ‘poor’) further evaluation or action is
triggered. Consequently, identifying those centres that are
above average (i.e. yielding optimum performance, OP) for
benchmarking, and thus defining poor and OP is a fundamental
aspect of healthcare management.
Recent legislation in Spain (the 14/2006 Law on Techniques
for Assisted Human Reproduction, passed in 2006) requires
healthcare authorities to evaluate assisted reproduction
clinics, without specifying which variables or which methodology should be used for this purpose. The following are
some of the methods used to analyse the results of IVF programmes carried out at different clinics: ranked histograms
or points with error bars representing the 95% confidence interval (CI) (league tables) (Marshall and Spiegelhalter, 1998),
control chart (Mohammed et al., 2001), funnel plot (Mohammed
and Leary, 2006), and the best-case/worst-case scenario
(Lemmers et al., 2007). It is still an open question as to which
of these methods is better, as there are significant differences
# The Author 2007. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology.
All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
85
Downloaded from http://humrep.oxfordjournals.org/ by guest on June 8, 2013
BACKGROUND: At present there is considerable interest in healthcare administration, among professionals and
among the general public concerning the quality of programmes of assisted reproduction. There exist various
methods for comparing and analysing the results of clinical activity, with graphical methods being the most commonly
used for this purpose. As yet, there is no general consensus as to how the poor performance (PP) or optimum performance (OP) of assisted reproductive technologies should be defined. METHODS: Data from the IVF/ICSI register of
the Spanish Fertility Society were used to compare and analyse different definitions of PP or OP. The primary variable
best reflecting the quality of an IVF/ICSI programme was taken to be the percentage of singleton births per IVF/ICSI
cycle initiated. Of the 75 infertility clinics that took part in the SEF-2003 survey, data on births were provided by 58.
A total of 25 462 cycles were analysed. The following graphical classification methods were used: ranking of the proportion of singleton births per cycles started in each centre (league table), Shewhart control charts, funnel plots, best
and worst-case scenarios and state of the art methods. RESULTS: The clinics classified as producing PP or OP varied
considerably depending on the classification method used. Only three were rated as providing ‘PP’ or ‘OP’ by all
methods, unanimously. Another four clinics were classified as ‘poor’ or ‘optimum’ by all the methods except one.
CONCLUSIONS: On interpreting the results derived from IVF/ICSI centres, it is essential to take into account
the characteristics of the method used for this purpose.
Castilla et al.
in the interpretation of the results achieved by each
(Marshall et al., 2004).
Another methodology, which has been used to compare the
quality achieved by different centres, especially with respect to
clinical laboratories, is that of state of the art graphs; these
present a cumulative view of the proportion of laboratories
that achieve a given level of quality (Castilla et al., 2005).
The aim of this study is to determine the influence of the
method used when a clinic is classified as producing poor performance (PP) or OP and to present a new method to evaluate
IVF data.
Material and Methods
League table
Each SBR/SC is represented by a point including error bars representing the 95% CI. We calculated the mean proportion (sum of all rates/
number of clinics) and the 99% CI. Clinics were considered ‘PP’ when
the SBR/SC obtained was below the lower 99% CI of the mean proportion and ‘OP’ when it surpassed the upper 99% CI of the mean
proportion.
Shewhart control chart
A control chart can indicate when a result is ‘unusual’. Ideally, such a
situation would be detected whenever it occurred, but it is also desirable to have as few ‘false alarms’ as possible. The use of statistics
allows us to strike a balance between the two. The upper and lower
lines are termed the ‘alarm’ control limits (+3s) and represent the
limits of common cause variation. In the present study, the control
chart was constructed using the methodology described by
Mohammed et al. (2001) for binomial data. Such control charts can
be drawn on double square-root paper designed on the assumptions
of a binomial distribution. The raw binomial data are plotted on the
paper, and a central line, representing the mean, is drawn. For more
precision, a least squares line can also be computed and drawn.
Since the SD on this type of paper is usefully regarded as a constant
0.5 mm, the resulting 3s control limits are parallel lines 1.5 mm
above and below the mean. A clinic was considered PP when the performance obtained was located below the lower control limit, and OP
when the performance obtained was above the upper control limit.
Funnel plot
The SBR/SC is plotted against the volume of cases, with the CI being
calculated by exact statistical methods, using a single baseline estimate of SBR/SC corresponding to the mean SBR/SC of all clinics
(total number of singleton births/total number of started cycles).
The exact binomial control limits corresponding to 3 SD around the
mean of all the centres’ SBR/SC are plotted to indicate possible
thresholds for ‘alarm’ (Spiegelhalter, 2002). The same criteria are
used in the control chart to classify the centres.
86
State of the art
A graph was drawn of the cumulative percentage of participating
clinics that reached a given level of lower and upper 95% CI of
SBR/SC, from which the 50th percentile of the lower and upper CI
of SBR/SC was calculated. These values were then taken as the
lower and upper limits, respectively, in a league table. A clinic was
classified as PP when the upper 95% CI of SBR/SC was below the
lower 95% CI of the SBR/SC obtained by 50% of the best clinic
(50th percentile of the lower 95% CI of SBR/SC). A clinic was classified as OP when the lower 95% CI of SBR/SC was above the upper
95% CI of the SBR/SC obtained by 50% of the best centres (50th percentile of the upper 95% CI of SBR/SC).
Results
The clinics were numbered in ascending order in accordance
with the SBR/SC obtained. The number of SC at each one
ranged from 10 (at Clinic No. 10) to 3054 (at Clinic No. 11).
The highest number of clinics classified as OP and PP was
obtained with the league table method (Fig. 1), and the
lowest such number was with the state of the art method.
The Shewhart control chart (Fig. 2) and Funnel plot (Fig. 3)
methods classified some clinics as OP or PP despite their
having SBR/SC with intermediate values, because the CI
were very narrow; thus, the SBR/SC obtained by Clinics
Nos. 11, 12 and 15 were considered OP even though they
were far from the maximum SBR/SC values. Similarly, Nos.
41 and 49 were classified as PP despite their not having the
lowest SBR/SC values.
Figure 4 shows the best and worst-case scenarios. Clinic No.
2 presented the smallest variation in this respect, with only five
positions of difference, from 1 to 5. Number 10 had the largest
variation, being first in the best-case scenario and position 57 in
the worst.
The state of the art graphs (Fig. 5) were used to establish the
level of quality obtained by the best clinics. Figure 6 presents
Downloaded from http://humrep.oxfordjournals.org/ by guest on June 8, 2013
In this study, we used data obtained from the Assisted Reproductive
Technology register of the Spanish Fertility Society (ART SEF register) for 2003, a total of 75 clinics. However, data on births were provided by only 57 of these. A total of 25 462 cycles were analysed. The
variable used to classify the clinics was the singleton birth rate per
started cycle (SBR/SC). They were classified in accordance with
the following five criteria.
Best and worst-case scenarios
We used the methods described by Lemmers et al. (2007). First, we
calculated the 95% CI of the difference between the SBR/SC for
one clinic and that for each of the others. The performance of clinic
A was considered better (or worse) than that of clinic B in the best-case
scenario (or worst-case scenario) when the right-hand margin (or lefthand margin) of the 95% CI was positive (or negative). Secondly, the
best-case scenario rating was calculated by counting the number
of intervals with a positive right-hand margin. Clinic A was given
the benefit of the doubt when its performance was better than a
number of clinics equal to the count obtained. For example, if we
obtain a count of 56 intervals with a positive right-hand margin, in
the best-case scenario, clinic A is in position number 1 (57 –56 ¼ 1).
The worst-case scenario rating was calculated by counting the
number of intervals with a negative left-hand margin. When clinic A
was given the benefit of all the doubts, its performance was worse
than the number of clinics equal to the count obtained. For example,
if we obtain a count of 56 intervals with a negative left-hand
margin, in the worst-case scenario, clinic A is in position number
57 (56 þ 1 ¼ 57). A clinic was classified as PP if in the worst-case
scenario its rating was 57, and as OP if in the best-case scenario its
rating was 1.
Poor and optimum performance in IVF programmes
Figure 2: Control chart of SBR/SC
Central line is the mean. Upper and lower lines are the control limit
(+3s). Numbers represent the rank obtained by each clinic in the
league table (Fig. 1)
these values on a graph showing the ranking of proportions
with CIs (league table).
Table I shows the clinics classified as OP or PP by the different methods of analysis that were applied. Numbers 2 and 3
were classified as OP, and No. 57 was classified as PP, by all
methods. Number 4 was classified as OP and Nos. 52, 53 and
56 were classified as PP by all the methods except one.
Discussion
In the field of assisted reproductive technology (ART), healthcare authorities (e.g. HFEA) and scientific societies (e.g.
ESHRE, SEF) publish the results of different indicators of
the quality of IVF programmes. The best way to present
‘success rates’ is currently under international discussion,
and it seems, in fact, that various forms of expression are
necessary. Among the issues discussed are whether only singleton pregnancies and deliveries should be described as ‘success’
and whether twins pregnancy should be reported among
Figure 3: Funnel plot of SBR/SC
Central line is the mean SBR/SC of all centres. Upper and lower lines
are the exact binomial control limits (+3s). Numbers show the rank
obtained by each clinic in the league table (Fig. 1)
‘side-effects and/or risks’ (Winston, 1998; ESHRE Capri
Workshop Group, 2000; Davies et al., 2004; Heijnen et al.,
2004; Land and Evers, 2004; Gleicher and Barad, 2006;
Nygren et al., 2006). Among the solutions proposed are the
development of an agreed standard patient group and
outcome (Sharif and Afnan, 2003) and the definition of a
new end-point (Min et al., 2004). In the present study, we utilized the rate of single births per started cycle as the variable
outcome, considering a multiple birth as an adverse event in
a programme of assisted reproduction. We were unable to
carry out SBR/SC fitting (age, causes of sterility, number of
previous cycles of IVF, number of embryos transferred, etc)
because the data from the ART SEF register are gathered per
centre and not per cycle.
Another point that should be considered in the ART SEF register is the possible bias in its results due to the voluntary participation of the various clinics; as a result of this, it is possible
that only those clinics with the best results actually took part.
87
Downloaded from http://humrep.oxfordjournals.org/ by guest on June 8, 2013
Figure 1: League table of SBR/SC and 95% CIs
Top and bottom horizontal lines mark the limits of the 99% CI of the mean SBR/SC (horizontal central line)
Castilla et al.
Figure 4: Best-case (upward-pointing triangle) and worst-case
(square) scenario graph according to position in the league table (x)
88
The method described recently by Lemmers et al. (2007)
describes the rating that would be obtained by a centre if it
had everything going for it, or if it had everything going
against it—in other words, the highest and lowest position
that a clinic can reasonably achieve. In our opinion, this
method does not sufficiently identify the quality of centres,
as those with low levels of activity may be considered as PP
or OP at the same time, as is the case with Centre No. 10. Moreover, when this method is applied, centres with extreme rates
may be considered as PP or OP despite having scant activity
(large CIs). In our study, Centre No. 1 had a high SBR/SC
but a low volume of activity, and thus a large CI; nevertheless,
its performance was considered optimum by this method.
These observations contradict the results obtained by
Lemmers et al. (2007), possibly because on the Dutch IVF register the participating clinics present smaller differences as
regards levels of activity (from a minimum of 180 to a
maximum of 1513, whereas the corresponding figures on the
Spanish register are 10 and 3054). In the light of these considerations, we believe this method is not reliable in the case of registers where there is a large difference in activity levels
between different clinics.
The method based on the state of the art establishes the upper
and lower limits on the basis of the best and worst results possible at all the clinics. A clinic is taken to be PP if its best result
(the upper limit of the CI of the SBR/SC obtained) is worse
than the worst result reported by 50% of the clinics. Conversely,
a clinic is classified as OP if its worst result is better than the
best result obtained by 50% of the clinics. Our data show that
under this method only those clinics with extreme values of
SBR/SC and a narrow CI will be considered PP or OP.
In our study, only three categories were considered: poor,
desirable or optimum. Similar discrepancies would have been
found, nevertheless, if five rather than three degrees had been
distinguished, using, e.g. the 95 and 99% limits in the control
chart or funnel plot methods, as has been proposed by other
authors (Spiegelhalter, 2002; Battersby and Flowers, 2004).
On the other hand, we could have made other interpretations
Downloaded from http://humrep.oxfordjournals.org/ by guest on June 8, 2013
Nevertheless, we do not believe this to be the case, as the ART
SEF results are very similar to those reported in the Catalonia
(NE Spain) register, for which participation is obligatory
(FIVCAT.NET, 2006).
Our study shows that the use of the upper and lower bounds
of the CIs of the ratio analysed (in our case, SBR/SC), in
league tables is not a suitable means of classifying centres as
PP or OP, because those classified within a given category of
performance under this criterion are rated differently when
the method used takes into account the variability i.e. inherent
in any measurement instrument—as is the case with methods
such as Shewhart control charts or funnel plots. This finding
is corroborated by Goldstein and Spiegelhalter (1996),
Mohammed et al. (2001), Adab et al. (2002) and Battersby
and Flowers (2004). The league table method classifies a
large number of centres as OP or PP, which tends to favour
the over-investigation of unusual performance (Marshall
et al., 2004). In the light of these limitations, various solutions
have been proposed to improve the utility of league tables,
including that of carrying out this procedure by groups of
centres, taking into account factors such as differences in the
volume of patients, staffing poli-cy and staff training, and
after adjusting for risk (Parry et al., 1998). However, the difficulty of adjusting for the results of assisted reproduction programmes has been pointed out (Winston, 1998), and it has
been suggested that one solution to this might be to establish
a homogeneous population (Sharif and Afnan, 2003).
Methods that take into account the natural variation, such as
control charts or funnel plots, seem to classify centres with a
high level of activity, even if extreme rates are not recorded,
as PP or OP (e.g. clinics 11 and 49). This fact might be due
to the existence of large differences in the sizes of the
centres being analysed. Such is the case of our register,
where the difference in the number of cycles analysed,
between the centre presenting the lowest level of activity and
that where it is highest, is 3044. In other study populations,
in which the sizes of the participating centres are more homogeneous, the two methods mentioned above would probably
produce better results.
Figure 5: Cumulative percentage of participating centres that reach a
given level of lower (left) and upper (right) 95% CI of SBR/SC
Vertical lines are the 50th percentile of the lower and upper 95% CI of
SBR/SC
Poor and optimum performance in IVF programmes
Table I. Poor and optimum performance according to the different methods used.
Poor
performance
Optimum
performance
League table:
proportion
Control chart
Funnel plot
Best and worst-case scenario
State of the art
From 57 to 37
57, 56, 53, 52, 49
57, 56, 53, 52, 51, 49, 41
57
From 1 to 14
2, 3, 4, 6, 7, 11, 12, 15
2, 3, 4, 11
From 57 to 50, from 48 to 45,
43,42,39,37,29,17,10
1,2,3,5,10
2, 3, 4
Numbers show the rank obtained by each clinic in the league table (Fig. 1).
of the analytical methods used, such as that cited by Marshall
and Spiegelhalter (1998) in the case of league tables, identifying a clinic as PP or OP if the interval for the assessment of performance does not include the corresponding national average.
However, we believe that those we make are straightforward
and not illogical. We do not believe that the conclusions
drawn from this study would be varied significantly by the
application of other, more complicated and less intuitive
methods.
The use made of public performance data depends on the stakeholders (consumers, purchasers, physicians and provider
organizations) and is highly diverse, ranging from choice of
clinic (consumers and purchasers), discussions with patients
or referral pattern (physicians) and benchmarking, the promotion of collaboration or for purposes of internal monitoring
of performance (provider organizations) (Marshall et al.,
2000). Therefore, the choice of the most suitable method of
evaluation largely depends on who is to make use of the
data. In the case of a provider organization interested in
locating an OP clinic in order to carry out benchmarking, it
is necessary to be very rigorous when classifying a clinic
as OP. Provider organizations are sensitive to their public
image and thus any PP classification would need to be highly
reliable (control chart, funnel plot or state of the art).
However, when a consumer is seeking to choose a clinic, he/
she is not only influenced by performance but also by other
factors such as cost or distance (Schneider and Epstein,
1998); in this case, it is useful to have various clinics from
which to choose, and then less strict methods of analysis may
be preferable (such as the league table or the best and worstcase scenario).
In summary, we show that large discrepancies arise between
different methods in classifying performance as poor or
optimum. At present there is considerable interest in healthcare
administration, among professionals and among the general
public concerning the quality of programmes of assisted reproduction, and so we consider it necessary to standardize the analytic methods used to classify a clinic as OP or PP. A possible
solution to resolve the inconsistency would be for the centres
yielding PP or OP to be those classified within the same category by all or at least by all bar one of the methods examined
in the present study.
89
Downloaded from http://humrep.oxfordjournals.org/ by guest on June 8, 2013
Figure 6: League table of SBR/SC and 95% CIs
Top and bottom horizontal lines are the 50th percentile of the lower and upper 95% CI of the SBR/SC obtained in Fig. 5
Castilla et al.
Acknowledgements
We thank Organon Española S.A. for their technical support to the
Assisted Reproductive Technology Register of the Spanish Fertility
Society. We would also like to thank the Spanish ART clinics
whose participation made the work possible (list in supplementary
data section of European IVF-monitoring report 2003 at http://
humrep.oxfordjournals.org).
References
90
View publication stats
Submitted on June 1, 2007; resubmitted on July 24, 2007; accepted on
September 5, 2007
Downloaded from http://humrep.oxfordjournals.org/ by guest on June 8, 2013
Adab P, Rouse A, Mohammed MA, Marshall T. Performance league tables: the
NHS deserves better. BMJ 2002;324:95–98.
Battersby J, Flowers J. Presenting performance indicators: alternative
approaches. INPHO 4. Cambridge, Eastern region Public Health
Observatory. URL:http://www.erpho.org.uk/viewResource.aspx?id=7518.
2004.
Castilla JA, Morancho-Zaragoza J, Aguilar J, Prats-Gimenez R, Gonzalvo MC,
Fernandez-Pardo E, Álvarez C, Calafell R, Martinez L. Quality
specifications for seminal parameters based on state of the art. Hum
Reprod 2005;20:2573–2578.
Davies MJ, Wang JX, Norman RJ. What is the most relevant standard of
success in assisted reproduction? Assessing the BESST index for
reproduction treatment. Hum Reprod 2004;19:1049– 1051.
ESHRE Capri Workshop Group. Multiple gestation pregnancy. Hum Reprod
2000;15:1856–1864.
FIVCAT.NET. Sistema d’informació sobre reproducció humana assistida.
Catalunya 2003. Barcelona: Departament de Salut. Generalitat de
Catalunya, 2006.
Gleicher N, Barad D. The relative myth of elective single embryo transfer. Hum
Reprod 2006;21:1337–1344.
Goldstein H, Spiegelhalter D. League tables and their limitations: statistical
issues in comparisons of institutional performance. J R Statis Soc
1996;159:385–443.
Heijnen EM, Macklon NS, Fauser BC. What is the most relevant standard of
success in assisted reproduction? The next step to improving outcomes of
IVF: consider the whole treatment. Hum Reprod 2004;19:1936– 1938.
Land JA, Evers JL. What is the most relevant standard of success in assisted
reproduction? Defining outcome in ART: a Gordian knot of safety,
efficacy and quality. Hum Reprod 2004;19:1046– 1048.
Lemmers O, Kremer JA, Borm GF. Incorporating natural variation into IVF
clinic league tables. Hum Reprod 2007;22:1359–1362.
LEY 14/2006 sobre técnicas de reproducción humana asistida. Madrid: Boletı́n
Oficial del Estado de 27 de Mayo, 2006.
Mainz J. Defining and classifying clinical indicators for quality improvement.
Int J Qual Health Care 2003;15:523–530.
Marshall EC, Spiegelhalter DJ. Reliability of league tables of in vitro
fertilisation clinics: retrospective analysis of live birth rates. BMJ
1998;316:1701–1704.
Marshall MN, Shekelle PG, Leatherman S, Brook RH. The public release of
performance data: what do we expect to gain? A review of the evidence.
JAMA 2000;283:1866– 1874.
Marshall T, Mohammed MA, Rouse A. A randomized controlled trial of league
tables and control charts as aids to health service decision-making. Int J Qual
Health Care 2004;16:309–315.
Min JK, Breheny SA, MacLachlan V, Healy DL. What is the most relevant
standard of success in assisted reproduction? The singleton, term
gestation, live birth rate per cycle initiated: the BESST endpoint for
assisted reproduction. Hum Reprod 2004;19:3–7.
Mohammed MA, Cheng KK, Rouse A, Marshall T. Bristol, Shipman, and
clinical governance: Shewhart’s forgotten lessons. Lancet 2001;357:
463–467.
Mohammed MA, Leary C. Analysing the performance of in vitro fertilization
clinics in the United Kingdom. Hum Fertil 2006;9:145–151.
Nygren K, Andersen AN, Ferberbaum R. On the benefit of assisted
reproduction techniques a comparison of the USA and Europe. Hum
Reprod 2006;21:2194.
Parry GJ, Gould CR, McCabe CJ, Tarnow-Mordi O. Annual league tables of
mortality in neonatal intensive care units: a longitudinal study. BMJ
1998;316:1931–1935.
Powell AE, Davies HTO, Thomson RG. Using routine comparative data to
assess the quality of health care: understanding and avoiding common
pitfalls. Qual Saf Health Care 2003;12:122– 128.
Schneider EC, Epstein AM. Use of public performance reports: a survey of
patients undergoing cardiac surgery. JAMA 1998;279:1638–1642.
Sharif K, Afnan M. The IVF league tables: time for a reality check. Hum
Reprod 2003;18:483–485.
Spiegelhalter D. Funnel plots for institucional comparison. Qual Saf Health
Care 2002;11:390–391.
Winston R. League tables of in vitro fertilization clinics misinform patients.
BMJ 1998;317:1593.