Statistical Tools for Measuring Agreement
Statistical Tools for Measuring Agreement
123
Lawrence Lin A.S. Hedayat
Baxter International Inc., WG3-2S Department of Mathematics, Statistics
Rt. 120 and Wilson Rd. and Computer Science
Round Lake, IL 60073, USA University of Illinois, Chicago
lawrence lin@baxter.com 851 S. Morgan St.
Chicago, IL 60607-7045, USA
Wenting Wu hedayat@uic.edu
Mayo Clinic
200 First Street SW.
Rochester, MN 55905, USA
wu.wenting@mayo.edu
vii
viii Preface
presented, and all other related tools will be well referenced. Many practical
examples will be presented throughout the book in a wide variety of situations for
continuous and categorical data.
A book such as this cannot have been written without substantial assistance from
others. We are indebted to the many contributors who have developed the theory and
practice discussed in this book. We also would like to acknowledge our appreciation
of the students at the University of Illinois at Chicago (UIC) who helped us in many
ways. Specifically, six PhD dissertations on agreement subjects have been produced
by Robieson (1999), Zhong (2001), Yang (2002), Wu (2005), Lou (2006) and Tang
(2010). Their contributions have been the major sources for this book. Most of the
typing using MikTeX was performed by the UIC PhD student Mr. Yue Yu, who also
double-checked the accuracy of all the formulas.
We would like to mention that we have found the research into theory and
application performed by Professors Tanya King, of the Pennsylvania State Hershey
College of Medicine; Vernon Chinchilli, of the Pennsylvania State University
College of Medicine; and Huiman Barnhart, of the Duke Clinical Research Institute,
are truly inspirational. Their work has influenced our direction for developing the
materials of our book. We are also indebted to Professor Phillip Schluter, of the
School of Public Health and Psychosocial Studies at AUT University, New Zealand,
for his permission to use the data presented in Examples 5.9.3 and 6.7.2 prior to
their publication.
Finally, all SAS and R macros and most data in the examples are provided at the
web sites shown below:
1. http://www.uic.edu/hedayat/
2. http://mayoresearch.mayo.edu/biostat/sasmacros.cfm
The U.S. National Science Foundation supported this project under Grants DMS-
06-03761 and DMS- 09-04125.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Precision, Accuracy, and Agreement . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Traditional Approaches for Continuous Data . . . .. . . . . . . . . . . . . . . . . . . . 3
1.3 Traditional Approaches for Categorical Data . . . .. . . . . . . . . . . . . . . . . . . . 4
2 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.1 Basic Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2 Absolute Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2.1 Mean Squared Deviation . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2.2 Total Deviation Index .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
2.2.3 Coverage Probability.. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
2.3 Relative Indices .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
2.3.1 Intraclass Correlation Coefficient . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
2.3.2 Concordance Correlation Coefficient . . .. . . . . . . . . . . . . . . . . . . . 12
2.4 Sample Counterparts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.5 Proportional Error Case . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
2.6 Summary of Simulation Results . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
2.7 Asymptotic Power and Sample Size . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.8 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.8.1 Example 1: Methods Comparison .. . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.8.2 Example 2: Assay Validation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 18
2.8.3 Example 3: Assay Validation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
2.8.4 Example 4: Lab Performance Process Control . . . . . . . . . . . . . 25
2.8.5 Example 5: Clinical Chemistry and
Hematology Measurements That Conform
to CLIA Criteria. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.9 Proofs of Asymptotical Normality When Target
Values Are Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
2.9.1 CCC and Precision Estimates . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
2.9.2 MSD Estimate .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 39
2.9.3 CP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
2.9.4 Accuracy Estimate . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 41
xi
xii Contents
xv
xvi Symbols Used and Abbreviations
Fig. 1.1 Assessing agreement of observed values (new) and target values (gold standard method)
Traditionally, the agreement between observed and target values has been assessed
by a paired t-test, the results of which could be misleading. A slightly better
approach in the assessment of agreement is to test the linear least squares estimates
against the identity line (a straight line with zero intercept and unit slope). We will
denote this method by LS01. Furthermore, the target values have been assumed
fixed in the LS01 (regression) analysis, even when the target values are obviously
random. These two rudimentary approaches capture only the accuracy information
relative to the precision. The potential for obtaining misleading results using these
two approaches can be illustrated graphically by Fig. 1.2.
Another popular traditional method for assessing agreement of observed and
target values is the Pearson correlation coefficient. However, this coefficient is only
a measure of precision. The potential for obtaining misleading results using this ap-
proach can be illustrated by Fig. 1.3. In addition, this coefficient has frequently been
misused to assess linearity, which should have been assessed through goodness-
of-fit statistics. Other traditionally used approaches for assessing agreement of
observed and target values have included the coefficient of variation and the mean
square error of the regression analysis, which are measures of precision only.
Perhaps the most valid traditional approach for assessing agreement of observed
and target values has been the intraclass correlation coefficient (ICC). The ICC in
its original form (Fisher 1925) is the ratio of between-sample variance and total
(within C between) variance under the model of equal marginal distributions. This
original ICC was intended to measure precision only. This coefficient is invariant
with respect to the interchanges of Y and X values within any pairs. Several forms
of ICC have evolved. In particular, Bartko (1966), Shrout and Fleiss (1979), Fleiss
(1986), and Brennan (2001) have put forth various reliability assessments. We will
discuss ICC in greater detail in Section 2.3.1 and Chapter 3. In Chapter 5, we will
Fig. 1.2 Situations in which a paired t -test or least squares test against the identity line (LS01)
can be misleading: (a) Rejected by paired t -test/LS01. (b) Accepted by paired t -test but rejected
by LS01. (c) Accepted by paired t -test/LS01
4 1 Introduction
Fig. 1.3 Situations in which the Pearson correlation coefficient can be misleading
introduce some special forms of ICC under the general case that correspond to
agreement, precision, and accuracy for both continuous and categorical data, based
on a paper by Lin, Hedayat, and Wu (2007).
It should be pointed out that the traditional hypothesis-testing approach is
not appropriate for assessing agreement except for the cases that will be pre-
sented in Chapter 6. In traditional hypothesis testing, the rejection region (alter-
native hypothesis) is the region for declaring a difference based on strong evidence
presented in the data. Failing to reject the null hypothesis does not imply accepting
agreement, but implies a lack of evidence for declaring a difference. The proper
setting for assessing agreement is to reverse the null and alternative hypotheses, so
that the conventional rejection region actually is the region for declaring agreement.
Therefore, we would reject the null hypothesis of a difference and accept the
alternative hypothesis of agreement based on strong evidence presented in the data
(Dunnett and Gent 1977; Bross 1985; Rodary, Com-Nougue, and Tournade 1989;
Lin 1992). With the given criterion and the same precision of the data, the larger
the sample size, the easier it should be to accept the agreement. Here, a meaningful
criterion for an acceptable difference should be prespecified, and the hypothesis
testing should be one-sided. Indeed, the proper hypothesis-testing approach is
equivalent to computing the one-sided confidence limit. If this limit were better
than the prespecified criterion, we would accept the agreement. We will use the
confidence limit approach in this book for simplicity. However, for the sample size
and power calculation, we will use the proper hypothesis-testing approach.
Compared to approaches for continuous data, there have been fewer misleading
approaches for categorical data. The most popular approach for assessing agreement
began with kappa (Cohen 1960) and weighted kappa (Cohen 1968; Fleiss, Cohen,
and Everitt 1969). The kappa coefficients assess nonchance (chance-corrected)
agreement relative to the total nonchance agreement. There is a long history of
valid tools available for assessing marginal equivalence, association, and agreement.
These will be referenced in Chapter 3.
Chapter 2
Continuous Data
We will now introduce new approaches that have evolved for measuring agreement
since 1988. Some of these new approaches were summarized, studied, and com-
pared by Lin, Hedayat, Sinha, and Yang (2002). Here, we include the necessary
proofs that were left out of that article. In addition, we include assorted examples to
demonstrate agreement techniques. We begin with the most basic model, in which
paired observations (Y and X ) are collected.
When target values are random, the joint distribution of Y and X is assumed to have
a bivariate distribution with finite second moments with means y , x , variances
y , x and covariance yx . When target values are fixed, Yi jXi , i D 1; : : : ; n, are
2 2
Mean squared deviation (MSD) evaluates an aggregated deviation from the identity
line, MSD D E.Y X /2 . It can be expressed as
"2 D . y x/
2
C 2
y C 2
x 2 yx ; (2.1)
when target values are fixed, where X and sx2 are the sample mean and variance
of X .
Estimated by sample counterparts (e 2 or ejX
2
) with a log transformation, W D
ln.e / or WjX D ln.ejX / has an asymptotic normal distribution with mean ln."2 / or
2 2
where 2 ./ is the cumulative noncentral chi-square distribution up to ı02 , with one
2
d
degree of freedom and noncentrality parameter 2 . This measure will be presented
d
shortly.
2.2 Absolute Indices 9
where .2 /1 is the inverse function of 2 ./. Since the estimate of this index has
intractable asymptotic properties, Lin (2000) and Lin, Hedayat, Sinha, and Yang
(2002) have suggested the following TDI0 approximation:
:
ı2 0 D .2 /1 .0 ; 1/"2 ; (2.7)
or
: 1 1 0
ı0 Dˆ 1 j"j; (2.8)
2
when X is random, or
: 1 1 0
ı0 jX Dˆ 1 j"jX j; (2.9)
2
when X is fixed.
The approximation is satisfactory (Lin 2000) when:
1. 0 D 0:75 and 6 1=2,
2. 0 D 0:8 and 6 8,
3. 0 D 0:85 and 6 2,
4. 0 D 0:9 and 6 1,
5. 0 D 0:95 and 6 1=2.
2
The quantity D d2 is called the relative bias squared (RBS). The interpre-
d
tation of this approximated TDI is that approximately 1000 % of observations are
within ı0 of the target values. TDI20 is proportional to MSD, and therefore we
may perform an inference based on the asymptotic normality of W D ln.e 2 /, where
e 2 is the sample counterpart of MSD when X is random, or WjX D ln.ejX 2
/ when
X is fixed. This simplified method will become very useful when we deal with the
more general case to be introduced in Chapter 5.
The idea of using such an approximation of TDI was motivated by Holder and
Hsuan (1993). They proposed a moment-based criterion for assessing individual
bioequivalence. They showed that in a slightly different fashion, ı2 0 , or the squared
function of (2.6), has an upper bound ı2 0 C D c0 "2 , where c0 is a constant not
depending on d and d . Therefore, ı0 C conservatively captures at least 1000 %
of observations within the boundary from target values of a reference compound.
Holder and Hsuan (1993) used a numerical algorithm for the determination of c0
under some parametric and nonparametric distribution of D D Y X .
10 2 Continuous Data
However, the asymptotic distribution property of this estimate has not been
established. Lin (2000) made a comparison between this statistic and the TDI
given in (2.7). When 0 D 0:9, ı2 0 and ı2 0 C are identical under the normality
assumption. Using TDI0:9 is almost exact when dd < 1, and would become
conservative otherwise. Using TDI0:8 is most robust, since it can tolerate an RBS
value as high as 8:0.
A TDI is similar in concept to a tolerance limit. The difference is that a tolerance
limit captures individual deviations from their own mean, while a TDI captures
individual deviations from their target values, for a high proportion (say, 90%), and
with a high degree of confidence (say, 95%) when the upper confidence limit of TDI
is used.
1X
n
ı0 jX D ı .i / ; (2.11)
n i D1 0
where
" 2 #
ı02 ˇ0 C .ˇ1 1/Xi
ı0 .i / D 2
2
; 1; ; (2.12)
e e
when target values are fixed. The estimate of coverage probability using sample
p
counterparts (pı0 or pı0 jX ) by the logit transformation, T D ln 1pı0ı or TjX D
0
p
ln 1pı0ıjXjX , has an asymptotic normal distribution with mean E.pı0 / D ln 1ı0ı or
0 0
E.pı0 jX / D ln 1ı0ıjXjX , and variance
0
2 2
0:5 ıC .ıC / C ı .ı / C .ıC / .ı /
2
T D ; (2.13)
.n 3/.1 ı0 /2 ı20
when X is random, or
h i
C02 .C0 X C1 /2 C22
n2
C n2 sx2
C 2n2
2
T jX
D ; (2.14)
.n 3/.1 ı0 /2 ı20
2.3 Relative Indices 11
ı0 C d
ıC D ; (2.15)
d
ı0 d
ı D ; (2.16)
d
ı0 C ˇ0 C .ˇ1 1/Xi
ıC ˇ i D ; (2.17)
e
ı0 ˇ0 .ˇ1 1/Xi
ı ˇ i D ; (2.18)
e
X
n
C0 D .ıC ˇ i / .ı ˇ i /; (2.19)
i D1
X
n
C1 D .ıC ˇ i / .ı ˇ i / Xi ; (2.20)
i D1
X
n
C2 D ıC ˇ i .ıC ˇ i / C ı ˇ i .ı ˇ i /; (2.21)
i D1
which will become very useful when we deal with the more general case to be
introduced in Chapter 5.
measurements for each pair of brothers, (x; y), twice into the computation of the
usual product–moment correlation coefficient , once in the order (x; y) and once
in the order (y; x). If there are more than two brothers in each set, each possible pair
of measurements is entered twice into the computation. Thus, the number of entries
in the correlation for this set is n.n 1/, where n is the number of brothers in the
data set.
Harris (1913) developed a simple formula for intraclass correlation as a function
of (a) the variance of the means of each set of measurements around the overall
mean and (b) the variance of the total set of measurements.
Fisher (1925) observed that variance measurements could be partitioned into two
components. The first component is the between-sample variance after removing
the residual variance, which Fisher called A. The second component is the residual
variance or within-sample variance, which Fisher called B. Thus the population
intraclass correlation can be expressed as
A
I D : (2.23)
ACB
Fisher (1925) noted that the ICC could be estimated using mean squares from
an analysis of variance (ANOVA). We will revisit ICC in Chapter 3, where we will
show its association with kappa, weighted kappa, and the concordance correlation
coefficient (CCC) presented below. We will also revisit, in Chapter 5, the general
form of the ICC for agreement, precision, and accuracy coefficients.
The MSD can be standardized such that: 1 indicates that each pair of readings
is in perfect agreement in the population (for example, 1; 1; 2; 2; 3; 3; 4; 4; 5; 5),
while 0 indicates no correlation, and 1 means that each pair of readings is in
perfect reversed agreement in the population (for example, 5; 1; 4; 2; 3; 3; 2; 4;
1; 5). Lin (1989) introduced one such standardization of MSD, called CCC, which
is defined as
"2
c D 1 (2.24)
"2jD0
"2
D 1
y C x C.
2 2 2
y x/
2 yx
D ;
2
y C 2
x C. y x/
2
2 x y
D (2.25)
2
y C 2
x C. y x/
2
2.3 Relative Indices 13
2ˇ1 sx2
cjX D ; (2.26)
2
y C sx2 C . y XN /2
when X is fixed. The CCC is closely related to the intraclass correlation and has
a meaningful geometrical interpretation. It is inversely related to the mean square
of the ratio of the within-sample total deviation ("2 ) and the total deviation ("2jD0 ).
For example, if the within-sample total deviation is 10%, 32%, or 45% of the total
deviation, then the CCC is 0:99 D .1 0:12 /, 0:90 D .1 0:322 /, or 0:80 D
.1 0:452 /, respectively. In Chapter 3, we will show that for ordinal categorical
data, CCC degenerates into the weighted kappa suggested by Cohen (1968).
Section 1.1 defined accuracy and precision in the one-dimensional situation.
According to the two-dimensional model of Section 2.1, the between-sample
variation is typically inherited or is a result of the design of the sampling process,
and is usually unrelated to within-sample precision of an assay. Therefore, we
consider the difference in between-sample variance as a systematic bias, and it is
included in the inaccuracy. A sample mean and sample variance define a marginal
distribution in most of the commonly used distributions.
The precision coefficient is the Pearson correlation coefficient ./ between Y and
X, where
yx
D : (2.28)
y x
14 2 Continuous Data
Here 2 has the same scale as the accuracy coefficient, from 0 (no agreement) to 1
(perfect agreement). It is evident from (2.25), (2.26), and (2.27) that the CCC is the
product of the precision and accuracy coefficients when target values are random
or fixed.
The estimate of CCC (rc or rcjX ) using sample counterparts by the Z transformation,
1CrcjX
1rc or ZjX D 2 ln 1rcjX , has an asymptotic normal distribution with
Z D 12 ln 1Cr c 1
1CcjX
mean 12 ln 1Cc
1c or
1
2 ln 1cjX , and variance
1 .1 2 /c2 2c3 .1 c / 2 c4 4
2
D C (2.29)
Z
.n 2/ .1 c2 /2 .1 c2 /2 22 .1 c2 /2
or
.1 2 /c2 $ 2 c2 .1 2 /
2
ZjX D $ 2 c2 C .1 c $/ C
2
;
.n 2/.1 c2 /2 2 2
(2.30)
when X is random or fixed, respectively.
The estimate of the accuracy coefficient (ca or cajX ) using sample counterparts
c
by the logit transformation, L D ln 1c
ca
a
or LjX D ln 1cajX , has an asymptotic
ajX
a
normal distribution with mean ln 1 a
or ln 1ajX , and variance
ajX
2 $2a .1 2 / C 12 .1 $a /2 .1 4 /
2
D : (2.32)
LjX
.n 2/.1 a /2
For the purpose of statistical inference, the parameters discussed above can be
replaced with their sample estimates that are consistent estimators, such as the
moments estimates.
The sample counterparts for y , x , y2 , x2 , yx , ˇ1 , and are
1X
n
Y D Yi ; (2.33)
n i D1
1X
n
XD Xi ; (2.34)
n i D1
1X
n
sy2 D .Yi Y /2 ; (2.35)
n i D1
1X
n
sx2 D .Xi X/2 ; (2.36)
n i D1
1X
n
2
syx D .Yi Y /.Xi X /; (2.37)
n i D1
syx
b1 D 2 ; (2.38)
sx
and
syx
rD : (2.39)
sy sx
1
We could certainly use n1 rather than n1 in the above variance and covariance
1
estimates. However, we use n to bound the CCC estimate by 1.0.
In estimating the MSD, we use the sum of squared difference divided by n 1.
For (2.17) and (2.18) when X is fixed, we use
n
se2 D .1 r 2 /sy2 : (2.40)
.n 3/
For less bias in estimating the RBS, and in estimating d2 in (2.15) and (2.16), we use
n 2
sd2 D sx C sy2 2sxy : (2.41)
.n 3/
The use of n 2 or n 3 instead of n in the denominators of the above variance
equations is for the small-sample-size bias correction based on the simulation
studies in Lin, Hedayat, Sinha, and Yang (2002). The use of sample-size bias
correction is not important when the sample size is large.
16 2 Continuous Data
For the purpose of performing statistical inference for each index, we should
compute the confidence limit (lower limit for CCC, precision coefficient, accuracy
coefficient, CP, and upper limit for TDI) based on its respective transformation, then
perform antitransformation to the limit. We will declare that the assay agreement
is acceptable when the limit is better than the prespecified criterion. The use
of transformed estimates can speed up the approach to normality. Moreover, a
transformation could bound the confidence interval to its respective parameter
range, say, 1 to 1 for CCC and precision coefficient, 0 to 1 for the accuracy
coefficient and CP, and 0 to infinity for MSD.
Throughout this book, once the asymptotic normality of an estimated index has
been defined, statistical inference can be established through confidence limit(s).
Let O be the estimate of an agreement index and O2 its variance. Then the one-sided
upper or lower confidence limit becomes
O C ˆ1 .1 ˛/ O O or O ˆ1 .1 ˛/ O O ;
where O O is the estimate of the square root of variance, O , using sample counter-
parts. When the sample size is small, say less than 30, we can also use the cutoff
value of the cumulative central t-distribution instead of the standard cumulative
normal distribution to form the statistical inference.
When Y and X are positively valued variables and the standard deviations of Y are
proportional to either Y or X , it is assumed that ln.Y / and ln.X / have a bivariate
normal distribution. Let 100 % be the percent change in Y and X . Then
1 Y
DP < < .1 C / D P Œj ln.Y / ln.X /j < ln.1 C / :
.1 C / X
(2.42)
Let D D ln.Y / ln.X / and ı0 D ln.1 C 0 /. Then 0 D 100Œexp.ı0 / 1%.
This 100 0 % is denoted by TDI%0 .
In the case of proportional errors, all of the above unscaled and scaled agreement
indices should be computed from the log transformed data. In practice, we have
encountered the proportional error case more frequently than the constant error case.
excellent agreement with the theoretical values from normal samples even when
n D 15. However, these estimates are not expected to be robust against outliers or
large deviations from normality or log-normality. The robustness issues of the CCC
have been addressed in King and Chinchilli (2001a, 2001b), using M-estimation or
using a power function of the absolute value of D to compute the CCC.
In assessing agreement, the null and alternative hypotheses should be reversed. The
conventional rejection region actually is the region of declaring agreement (one-
sided). Asymptotic power and sample size calculation should proceed by the above
principle. These powers of CCC, TDI, and CP were compared in Lin, Hedayat,
Sinha, and Yang (2002). The results showed that the TDI and CP estimates have
similar power, and are superior to CCC, but they are valid only under the normality
assumption. Therefore, for inference, TDI and CP are superior to CCC. However,
the CCC and precision and accuracy coefficients remain very useful and informative
tools, as is evident from the following examples. In Chapter 4, we will discuss the
sample size subject in greater detail.
2.8 Examples
This example was presented in Lin, Hedayat, Sinha, and Yang (2002). DCLHb is
a treatment solution containing oxygen-carrying hemoglobin. The DCLHb level
in a patient’s serum is routinely measured by the Sigma method. The simpler
HemoCue method was modified to reproduce the DCLHb values of the Sigma
method. Serum samples from 299 patients over a 50–2;000 mg/dL range were
collected. The DCLHb values of each sample were measured by both methods
twice, and the averages of the duplicate values were evaluated. The client required
with 95% confidence that the within-sample total deviation be less than 15% of
the total deviation. This means that the allowable CCC was 1 0:152 D 0:9775.
The client also needed with 95% confidence that at least 90% of the HemoCue
observations be within 150 mg/dL of the targeted Sigma values. This means that the
allowable TDI0:9 was 150 mg/dL, or that the allowable CP150 was 0:9.
The results are presented in Fig. 2.1 and Table 2.1. The plot indicates that the
within-sample error is relatively constant across the clinical range. The plot also
indicates that the HemoCue accuracy is excellent and that the precision is adequate.
The CCC estimate is 0:987, which means that the within-sample total deviation is
about 11:6% of the total deviation. The CCC one-sided lower confidence limit is
18 2 Continuous Data
Table 2.1 Agreement statistics for HemoCue and Sigma readings on measuring DCLHb
Precision Accuracy
Statistics CCC coefficient coefficient TDI0:9 CP150 RBS
Estimate 0.9866 0.9867 0.9999 127.5 0.9463 0.00
95% Conf. limit 0.9838 0.9839 0.9989 136.4 0.9276 –
Allowance 0.9775 – – 150.0 0.9000 –
“–” means “not applicable”
0:984, which is greater than 0:9775. The precision coefficient estimate is 0:987 with
a one-sided lower confidence limit of 0:984. The accuracy coefficient estimate is
0:9999 with a one-sided lower confidence limit of 0:9989. The TDI0:9 estimate is
127:5 mg/dL, which means that 90% of HemoCue observations are within 127:5
mg/dL of their target values. The one-sided upper confidence limit for TDI0:9 is
136:4 mg/dL, which is less than 150 mg/dL. Finally, the CP150 estimate is 0:946,
which means that 94:6% of HemoCue observations are within 150 mg/dL of their
target values. The one-sided lower confidence limit for CP150 is 0:928, which
is greater than 0:9. Therefore, the agreement between HemoCue and Sigma is
acceptable with excellent accuracy and adequate precision. The relative bias squared
is estimated to be near zero, indicating that the approximation of TDI should be
excellent.
This example was presented in Lin, Hedayat, Sinha, and Yang (2002). FVIII is a
clotting agent in plasma. The FVIII assay uses a marker with varying dilutions of
known FVIII activities to form a standard curve. The assay started at 1W5 or 1W10
2.8 Examples 19
Fig. 2.2 Observed FVIII assay results versus targeted values started at 1W5
serial dilutions were prepared until they reached the target values. Target values
were fixed at 3%, 8%, 38%, 91%, and 108%. Six samples were assayed per target
value. The error was expected to be proportional mainly due to dilutions. The client
needed with 95% confidence that the within-sample total deviation be less than 15%
of the total deviation. This means that the allowable CCC was 1 0:152 D 0:9775.
The client also needed with 95% confidence that 80% of FVIII observations be
within 50% of target values (note that this is the percentage of the measuring unit,
which is also a percentage). This means that the allowable TDI%0:8 was 50%, or
that the allowable CP50% was 0:8.
Figures 2.2 and 2.3 present the results started at 1W5 and at 1W10 serial dilutions
for these plots of observed FVIII assay results versus targeted values in log2 scale.
Note that there are overlying observations in the plots. Specifically, in Fig. 2.2, four
replicate readings of 3% and duplicate readings of 2% are observed at the target
value of 3%, and circles at the target value of 8% represent duplicate readings of
8%, 9%, and 10%. Duplicate readings of 45% are observed at target values of 38%.
Also note that in Fig. 2.3, four replicate readings of 5% and duplicate readings of 4%
are observed at the target value of 3%. Three replicate readings of 11% and duplicate
readings of 12% are observed at the target value of 8%. Duplicate readings of 49%
are observed at target values of 38%. Duplicate readings of 124% are observed at
target values of 91%. The plots indicate that the within-sample error is relatively
constant across the target values in log scale. The precision is good for both assays
started at 1W5 and at 1W10 serial dilutions, but the accuracy is not as good for the
assay started at 1W10 serial dilutions.
Tables 2.2 and 2.3 present the agreement statistics started at 1W5 and 1W10 serial
dilutions. For the assay started at 1W5 serial dilutions, the CCC is estimated to be
0:992, which means that the within-sample total deviation is about 9:1% of the
20 2 Continuous Data
Fig. 2.3 Observed FVIII assay results versus targeted values started at 1W10
total deviation. The one-sided lower confidence limit is 0:987, which is greater than
0.9775. The precision coefficient is estimated to be 0.994 with a one-sided lower
confidence limit of 0.991. The accuracy coefficient is estimated to be 0.998 with
a one-sided lower confidence limit of 0.994. TDI%0:8 is estimated to be 27.3%,
which means that 80% of observations are within 27.3% change from target values
(percentage of percentage values). The one-sided upper confidence limit is 35.0%,
which is less than 50%. Finally, CP50% is estimated to be 0.965, which means that
96.5% of observations are within 50% change from target values. The one-sided
lower confidence limit is 0.892, which is greater than 0.8. The agreement between
2.8 Examples 21
the FVIII assay and the actual concentration is acceptable with good precision and
accuracy. The relative bias squared is estimated to be 0.12, so that the approximation
of TDI should be excellent.
For the assay started at 1:10 serial dilutions, the CCC is estimated to be 0.967,
which means that the within-sample total deviation is about 18.2% of the total
deviation. The one-sided lower confidence limit is 0.958, which is less than 0.9775.
The precision coefficient is estimated to be 0.995 with a one-sided lower confidence
limit of 0.992. The accuracy coefficient is estimated to be 0.972 with a one-
sided lower confidence limit of 0.964. TDI%0:8 is estimated to be 58.9%, which
means that 80% of observations are within 58.9% change from target percentage
values. The one-sided upper confidence limit is 69.0%, which is greater than 50%.
Finally, CP50% is estimated to be 0.702, which means that 70.2% of observations
are within 50% change from target values. The one-sided lower confidence limit is
0.590, which is less than 0.8. The agreement between the FVIII assay and actual
concentration had good precision but is not acceptable due to mediocre accuracy.
The relative bias squared is estimated to be 3.75, which is less than 8.0, so that the
approximation of TDI should be acceptable.
This example was presented in Lin and Torbeck (1998). A study to validate an
amino acid analysis test method was conducted. Solutions were prepared at ap-
proximately 90%, 100%, and 110% of label concentration of the amino acids, each
containing nine determinations (observed values). Target values were determined
based on their average molecular weights, which were much more precise and
accurate but were still measured with random error. For each test method we
compute the estimates of CP, TDI, CCC, precision coefficient, accuracy coefficient,
and their confidence limits. It is debatable whether we should treat the target values
as random or fixed because they were average values. We therefore take the more
conservative approach by treating target values as random, which yields the same
estimates of agreement statistics but with a larger respective standard error for each
estimate.
The observed and target values were expressed as a percentage of label con-
centration. Using estimates of the CCC components, coefficients of accuracy (ca )
and precision (r), four out of 30 amino acids were chosen for illustration, each
representing an example of distinctive precise and/or accurate situations. These
four amino acids and their label concentrations were glycine (1 g/L), ornithine (6.4
g/L), L-threonine acid (3.6 g/L), and l-methionine (2 g/L).
The range of these data was approximately 20% (90%–110%) of label concen-
tration. The client needed with 95% confidence that at least 80% of observations be
within 3% of target values. This means that the 95% upper limit of TDI0:8 must be
less than 3, or that the 95% lower limit of CP3 must be greater than 0.8. Note that
the measurement unit is in percentage, and the error structure was assumed constant
across the data range. The client did not specify a criterion for the CCC.
22 2 Continuous Data
Figures 2.4–2.7 present the plots and Tables 2.4–2.7 present the agreement
statistics for glycine, ornithine, L-threonine acid, and L-methionine, respectively.
The results for glycine are accurate and precise, with CCC D 0:996 .0:994/,
r D 0:998 .0:996/, ca D 0:998 .0:996/, TDI0:8 D 0:93 .1:18/, and CP3 D 0:9999
.0:9966/. Values presented in parentheses represent the respective 95% lower or
upper confidence limit. More than 80% of observations are within 0.93 of target
values. The 95% upper confidence limit of TDI0:8 is 1.18, which is within the
2.8 Examples 23
allowable 3%. The 95% lower confidence limit of CP3 is 0.997, which is better
than the allowable 0.8. The CCC estimate is near 1, indicating an almost perfect
agreement.
The results of ornithine are accurate but less precise, with CCC D 0:974 .0:95/,
r D 0:976 .0:954/, ca D 0:998 .0:98/, TDI0:8 D 2:45 .3:09/, and CP3 D 0:870
.0:750/. The 95% upper confidence limit of TDI0:8 is 3.09, and the 95% lower
confidence limit of CP3 is 0.750.
The results of L-threonine are inaccurate but precise, with CCC D 0:944 .0:908/,
r D 0:991 .0:981/, ca D 0:953 .0:922/, TDI0:8 D 3:61 .4:14/, and CP3 D 0:656
.0:519/. The 95% upper confidence limit of TDI0:8 is 4.14, and the 95% lower
confidence limit of CP3 is 0.519.
The results of L-methionine are inaccurate and imprecise, with CCC D 0:531
.0:399/, r D 0:972 .0:946/, ca D 0:546 .0:428/, TDI0:8 D 13:68 .14:89/, and
CP3 D 0:0001 .0:0000/. The 95% upper confidence limit of TDI0:8 is 14.89, and
the 95% lower confidence limit of CP3 is almost zero. Note that the TDI0:8 estimate
of the L-methionine assay is conservative, since the estimate of its relative bias
squared value is large.
In summary, only the glycine assay in this example meets the client’s criterion.
2.8 Examples 25
This example was presented in Lin (2008). For quality control of clinical laborato-
ries, control materials of various concentrations were randomly sent to laboratories
for testing. The test results were to satisfy the proficient testing (PT) criterion.
The PT criterion for each lab test of the Clinical Laboratory Improvement Amend-
ments (CLIA) Final Rule (2003) http://wwwn.cdc.gov/clia/regs/subpart i.aspx#493.
929 required that 80% of observations be within a certain percentage or unit of
the target concentration for measuring control materials. The target concentrations
usually were the average of control materials across a peer group of labs using
similar instruments. Such a criterion lends itself directly for using the TDI%0:8 or
TDI0:8 .
For each of the majority of lab measurements, laboratories were required to test
commercial control materials at least once a day for at least two concentrations
(low and high). Daily glucose values of 116 laboratory instruments were monitored.
Based on accuracy and precision indices, we selected four laboratory instruments
with four distinct combinations of precision and accuracy. For each laboratory
instrument we computed TDI%0:9 , CCC, and precision and accuracy coefficients.
Here, we chose to monitor 90% for a cushion, instead of 80%, of observations across
all levels that were within TDI%0:9 or TDI0:9 units of targets. We can translate from
TDI0:9 to TDI0:8 by multiplying by 1:282=1:645 D 0:779. The target values were
computed as the average of these 116 laboratory instruments. For glucose, this PT
criterion was 10% or 6 mg/dL, whichever was larger. The range of these data was
around 70–270 mg/dL. In this case, the 10% value was the PT criterion (always
larger than 6 mg/dL).
For each lab instrument, we computed the preceding agreement statistics for each
calendar month and for the last available 50 days (current window). Across the
116 lab instruments, we computed the group geometric mean (GM), one standard
deviation (1-SD), and two standard deviation (2-SD) upper limits with 3-month
average TDI%0:9 values as benchmarks. Note that the distribution of TDI%0:9 was
shown to be log-normal. Here, the confidence limit of TDI%0:9 per lab instrument
was not computed, because we were using the population benchmarks. Therefore, it
is irrelevant whether the target values were treated as random or fixed.
Figures 2.8–2.11 present the plots of the four selected cases. For each case,
the left-hand plot presents the usual agreement plot of observations versus target
values for the current window, and each plotted symbol (circle) represents the daily
glucose value against the daily average glucose values across 116 labs. The right-
hand plot monitors the quality control results over a selected time window based
on TDI%0:9 values. We chose to monitor a rolling three-completed-month window
(June, July, and August in this case) plus the current window. Each plotted symbol
(dot) represents the monthly or current window TDI%0:9 value. Also presented
are the population benchmarks of geometric mean, 1-SD, and 2-SD upper limits,
and the PT criterion (PTC) of 10% (dashed line). Although the CCC, precision
coefficient, and accuracy coefficient are not shown in the right-hand plot, these
26 2 Continuous Data
350
L 300
a
b 250
v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value
Fig. 2.8 Observed glucose measures versus target values of lab instrument for the current window
and the control chart based on TDI%0:9 : almost perfect
350
L 300
a
b 250
v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value
Fig. 2.9 Observed glucose measures versus target values of lab instrument for the current window
and the control chart based on TDI%0:9 : imprecise but accurate
values of the current window were used to select the four instruments that are
presented here. The use of CP can be helpful here. However, the CP values have
difficulty discriminating among good instruments when they all have very high CP
values.
Figure 2.8 shows the best-performing lab instrument among all 116 lab instru-
ments, with CCC D 0:9998, r 2 D 0:9997, ca D 0:9999, and TDI%0:9 D 2:1%
for the current window. It has an almost perfect CCC, and its TDI%0:9 values are
around 2%–3%.
Figure 2.9 shows a less-precise but accurate lab instrument, with CCC D 0:996,
r 2 D 0:996, ca D 0:998, and TDI%0:9 D 9:8% for the current window. Its values
rank at around 2/3 (slightly greater than 1-SD value) among its peers in June, slightly
better than its peer average in July, and at around the PTC level in August and the
current window.
2.8 Examples 27
350
L 300
a
b 250
v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value
Fig. 2.10 Observed glucose measures versus target values of lab instrument for the current
window and the control chart based on TDI%0:9 : precise but inaccurate
350
L 300
a
b 250
v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value
Fig. 2.11 Observed glucose measures versus target values of lab-instrument for the current
window and the control chart based on TDI%0:9 : imprecise and inaccurate
Figure 2.10 shows a precise but inaccurate lab instrument, with CCC D 0:995,
r 2 D 0:9996, ca D 0:995, and TDI%0:9 D 11% for the current window. Its TDI%0:9
values are between 1-SD and 2-SD values of its peers in June and July, at around
the PTC level in August, and slightly worse than the PTC in the current window.
Figure 2.11 shows the worst-performing lab instrument among all 116 lab
instruments, with CCC D 0:983, r 2 D 0:991, ca D 0:988, and TDI%0:9 D 22%
for the current window. Its TDI%0:9 values are between 1-SD and 2-SD values of its
peers in July and August, and worse than the 2-SD value of its peers in June and the
current window.
This example conveys a few lessons. First, it is difficult to judge agreement solely
by the CCC, precision coefficient, and accuracy coefficient in their absolute values.
All comparisons should be judged in relative terms. Here, even the worst CCC value
shown in Fig. 2.11 is 0.983. Such a high value here is due to the large study range of
28 2 Continuous Data
70–270 mg/dL. Note that as stated earlier, the CCC, precision coefficient, accuracy
coefficient, and ICC depend largely on the study range. Comparisons among any
of these are valid only with similar study ranges. In this example, the study ranges
(based on peer means) are identical. It is important to report the study range when
reporting these statistics.
Second, not all lab instruments are created equal. Their quality, in terms of
TDI%0:9 values, could range from 2% to 22%, which is quite diverse. It is important
to submit our blood samples to a lab with a good reputation for quality.
Third, the PTC value is between the GM and 1-SD benchmarks for measuring
glucose, which means that about one-third of the lab instruments are in danger of
failing the PTC as dictated by the CLIA 2003 Final Rule. Perhaps the PTC is set too
strictly for measuring glucose. Note that we use TDI%0:9 instead of TDI%0:8 values
for a cushion here.
The data for this example were obtained through the clinical labs within the research
and development organization at Baxter Healthcare Corporation. Analysis of serum
chemistry analytes previously validated using the Hitachi 911 (Roche Diagnostics)
chemistry analyzer (reference or gold standard assays) was to be converted to the
Olympus AU400e (Olympus America, Inc.) chemistry analyzer (test or new assays).
Assay comparisons were performed by analyzing approximately 50–60 samples per
assay on both analyzers.
Hematology analyses were performed using two automated hematology instru-
ments: the CELL DYN 3500 (Abbott Diagnostics) as the reference and the ADVIA
2120 (SIEMENS Healthcare Diagnostics) as the test. Analyses were performed on
whole blood samples drawn into tubes containing EDTA anticoagulant. A total of
93 samples from 16 humans, 18 rabbits, 19 rats, 20 dogs, and 20 pigs were each
tested once on both instruments. All species were combined to establish wider data
ranges, and assay performance was not expected to depend on species.
Evaluations included clinical chemistry analytes of albumin, BUN, calcium,
chloride, creatinine, glucose, iron, magnesium, potassium, sodium, total protein,
and triglycerides; and hematology analytes of hemoglobin, platelet count, RBC, and
WBC. Any analyte without a PTC or values outside the data range was not included
in the evaluations.
Table 2.8 presents the data range, PT criterion, and estimates and confidence
limits (in parentheses) of agreement statistics for each of the above analytes. Data
ranges of clinical chemistry analytes were acquired from the Olympus Chemistry
Reagent Guide “Limitations of the Procedure” section for each assay (Olympus
America, Inc. Reagent Guide Version 3.0. May 2007. Irving, TX). Data ranges
of hematology analytes were acquired from ADVIA 2120 Version 5.3.1–MS2005
Bayer (now Siemens) Healthcare LLC.
Table 2.8 Agreement statistics against the PT criteria (PTC) for clinical chemistry and hematology analytes
Analyte Range PTC n CCC Precision Coef. Accuracy Coef. TDI0:8 a CPPTC b RBSc
2.8 Examples
Albumin (g/dL) 1.5–6.0 10% 58 0.723 (0.626) 0.859 (0.789) 0.842 (0.768) 23.6% (27.29%) 0.379 (0.309) 1.19
BUN 6 22 (mg/dL) 2–22 2 44 0.993 (0.989) 0.996 (0.993) 0.997 (0.994) 0.77 (0.91) 1.000 (0.996) 0.53
BUN >22 (mg/dL) 22–130 9% 16 0.999 (0.998) 0.999 (0.998) 1.000 (0.999) 3.26% (4.48%) 0.999 (0.954) 0.05
Calcium (mg/dL) 4–18 1 60 0.873 (0.833) 0.996 (0.994) 0.876 (0.838) 1.58 (1.69) 0.302 (0.223) 11.7
Chloride (mmol/L) 50–200 5% 61 0.991 (0.987) 0.996 (0.993) 0.995 (0.992) 2.44% (2.8%) 0.995 (0.982) 0.81
Creatinine Enz <2 (mg/dL) 0.2–2 0.3 50 0.989 (0.982) 0.990 (0.983) 0.999 (0.996) 0.08 (0.09) 1.000 (1.000) 0.06
Creatinine Jaffe <2 (mg/dL) 0.2–2 0.3 47 0.963 (0.942) 0.981 (0.969) 0.981 (0.966) 0.14 (0.16) 0.998 (0.989) 0.92
Glucose >60 (mg/dL) 60–800 10% 55 0.999 (0.998) 0.999 (0.998) 1.000 (0.999) 4.28% (5.01%) 0.997 (0.986) 0.32
Iron (ug/dL) 10–1000 20% 59 0.994 (0.991) 0.997 (0.995) 0.997 (0.995) 9.74% (11.45%) 0.987 (0.961) 0.03
Magnesium (mg/dL) 0.5–8 25% 61 0.970 (0.958) 0.996 (0.993) 0.974 (0.963) 14.41% (15.71%) 0.999 (0.995) 5.82
Potassium (mmol/L) 1–10 0.5 59 0.996 (0.995) 0.998 (0.997) 0.998 (0.997) 0.14 (0.16) 1.000 (1.000) 0.43
Sodium (mmol/L) 50–200 4 59 0.994 (0.991) 0.995 (0.993) 0.999 (0.997) 1.77 (2.06) 0.996 (0.984) 0.16
Total Protein (g/dL) 3–12 10% 56 0.993 (0.989) 0.993 (0.989) 1.000 (0.997) 2.31% (2.71%) 1.000 (1.000) 0.03
Triglycerides (g/dL) 10–1000 25% 56 0.997 (0.996) 0.998 (0.998) 0.999 (0.998) 8.2% (9.56%) 1.000 (0.999) 0.59
Hemoglobin (g/dL) 1–22.5 7% 93 0.967 (0.957) 0.997 (0.996) 0.969 (0.961) 6.02% (6.4%) 0.942 (0.903) 7.08
Platelet Count (103=¯L) 10–3500 25% 84 0.917 (0.884) 0.926 (0.896) 0.990 (0.973) 31.72% (36.79%) 0.695 (0.629) 0.04
RBC (106= L) 0.1–12 6% 92 0.988 (0.984) 0.997 (0.995) 0.992 (0.989) 4.13% (4.52%) 0.966 (0.937) 2.22
WBC (103=¯L) 0.1–100 15% 93 0.942 (0.922) 0.958 (0.941) 0.984 (0.972) 21.34% (24.41%) 0.640 (0.576) 0.08
Note: Shown in parentheses is the 95% upper (TDI) or lower (CCC, precision, accuracy, CP) confidence limit. Boldface analytes are those that failed the PTC.
a
Total deviation index to cover 80% of absolute difference or % change. The 95% upper limit should be less than PTC or PTC%.
b
Coverage probability of values within the PTC. The 95% lower limit should be greater than 0.8.
c
The relative bias squared (RBS) must be less than 8 in order for the approximate TDI to be valid. Otherwise, the TDI estimate is conservative depending on
the RBS value.
29
30 2 Continuous Data
For evaluation of BUN, the PTC is 9% for values greater than or equal to 22
mg/dL or 2 mg/dL for values less than or equal to 22 mg/dL, since 9% of 22 is
approximately 2. The criterion used for creatinine Enz and Jaffe is 0.3 mg/dL, since
only values less than 2 mg/dL were evaluated. Glucose was evaluated for values
greater than 60 mg/dL with the criterion 10%.
Figure 2.12 presents the agreement plots of the above analytes. The analytes of
albumin, calcium, platelet count, and WBC had precision and/or accuracy problems,
while the other analytes appeared to perform well. Comparing the 95% upper
confidence limit of TDI against the PTC or PTC% or the 95% lower confidence
limit CP against 0.8, all but the analytes of albumin, calcium, platelet count, and
WBC (shown boldface in Table 2.8) pass the PTC with 95% confidence.
Table 2.9 presents the results of traditional statistical analyses based on paired
t-test and ordinary regression. The results from Deming (orthogonal) regression
by treating X as a random variable are not shown because they are more or less
similar to those of ordinary regression. Table 2.9 shows the data range, sample size,
paired t-test p-value, intercepts, slopes, and the testing of intercept (0) and slope (1)
of ordinary regressions.
The paired t-test rejects the agreement of extremely well performing analytes,
that is, BUN 6 22 mg/dL, chloride, creatinine Jaffe, glucose, magnesium, potas-
sium, sodium, triglycerides, hemoglobin, and RBC, with p D 0:003 for sodium
and p < 0:001 for the others. These rejections correspond to the left-hand plot of
Fig. 1.2 and are due primarily to near-zero residual variance and/or large sample
size. On the other hand, the paired t-test accepts (p D 0:062) the agreement of
platelet count. Such failure to reject corresponds to the right-hand plot of Fig. 1.2,
due primarily to its large residual variance.
In terms of ordinary regression, tests of intercept (0) and/or slope (1) (LS01)
reject (p < 0:05) the agreement of the extremely well performing analytes of
BUN 6 22 mg/dL, chloride, creatinine Jaffe, iron, magnesium, potassium, sodium,
triglycerides, and hemoglobin. These rejections are similar, although not identical,
to the paired t-test for the same reasons. On the other hand, LS01 accepts (p >
0:373) the agreement of platelet count for the same reasons as paired t-test.
In terms of CCC, precision and accuracy for clinical chemistry analysts of BUN,
chloride, glucose, iron, magnesium, potassium, sodium, total protein, triglycerides
have excellent agreement (CCC > 0:99) between measurements from the Olympus
AU400e and Hitachi 911 instruments. Both have excellent precision (>0:99) and
accuracy (>0:99).
Creatinine Enz and creatinine Jaffe comfortably pass the PTC. Because their
data ranges are small, from 0.2 to 1.8 mg/dL, their CCC and precision and
accuracy coefficients become relatively lower. Magnesium also pass the PTC by
a comfortable margin. It has excellent precision (0.9956) but relatively smaller
accuracy (0.9738), because most Olympus values are smaller than Hitachi values
by a negligible amount.
For clinical chemistry analytes of albumin and calcium, the lab has difficulties
proving the equivalence of measurements from the Olympus AU400e and Hitachi
911 instruments. Albumin measurements from both the Olympus AU400e and
Hitachi 911 are neither accurate (0.8422) nor precise (0.8589). We are 95%
2.8 Examples
Table 2.9 Traditional statistics for clinical chemistry and hematology analytes
Paired t -test Intercept D 0 Slope D 1
Analyte Range n p-value Intercept Slope p-value p-value
Albumin (g/dL) 1.5–6.0 58 <0.001 0.181 0.783 0.045 0.001
BUN 6 22 (mg/dL) 2–22 44 <0.001 0.848 0.966 0 0.014
BUN >22 (mg/dL) 22–130 16 0.365 0.037 1.012 0.401 0.329
Calcium (mg/dL) 4–18 60 <0.001 0.15 0.884 0.205 <0.001
Chloride (mmol/L) 50–200 61 <0.001 0.16 0.963 0.005 0.002
Creatinine Enz <2 (mg/dL) 0.2–2 50 0.093 0.026 0.986 0.196 0.517
Creatinine Jaffe <2 (mg/dL) 0.2–2 47 <0.001 0.107 0.962 0 0.186
Glucose >60 (mg/dL) 60–800 55 <0.001 0.049 1.007 0.129 0.306
Iron (ug/dL) 10–1000 59 0.214 0.383 1.078 0 <0.001
Magnesium (mg/dL) 0.5–8 61 <0.001 0.095 0.999 0 0.907
Potassium (mmol/L) 1–10 59 <0.001 0.22 0.967 0 <0.001
Sodium (mmol/L) 50–200 59 0.003 4.025 0.975 0.032 0.06
Total Protein (g/dL) 3–12 56 0.198 0.015 0.99 0.602 0.531
Triglycerides (g/dL) 10–1000 56 <0.001 0.104 1.016 0.003 0.049
Hemoglobin (g/dL) 1–22.5 93 <0.001 0.078 1.046 0.001 <0.001
Platelet Count (103= L) 10–3500 84 0.062 0.211 1.042 0.461 0.373
RBC (106= L) 0.1–12 92 <0.001 0.005 1.012 0.776 0.194
WBC (103= L) 0.1–100 93 0.006 0.196 1.116 0.009 0.001
2 Continuous Data
2.9 Proofs of Asymptotical Normality When Target Values Are Random 37
confident that albumin measurements can deviate 27.3% (>10%) from their target
values, and that the measured values conformed to the PTC only 30.9% (<80%) of
the time. Calcium measurements are precise (0.9963) but not accurate (0.8759). We
are 95% confident that calcium measurements can deviate 1.58 mg/dL (>1 mg/dL)
from their target values, and that the measured values conform to the PTC only
22.8% (<80%) of the time.
For hemoglobin, the lab has good agreement (CCC D 0:9665) between mea-
surements from the Advia 2120 and Cell DYN 3500 instruments, with excellent
precision (0.9970) and good accuracy (0.9694). There is a small bias showing
that measurements from the Advia 2120, with only one exception, are consistently
higher than from the Cell DYN 3500. Note that data were not collected over the full
analytical range of 5 to 22 g/dL for this analyte.
RBC counts have very good agreement (CCC D 0:9883) between measurements
from the Advia 2120 and Cell DYN 3500 instruments. Both have excellent precision
(0.9965) and accuracy (0.9917). There is a small bias showing that all but one of
the measurements from Advia 2120 are consistently higher than those of Cell DYN
3500. Note that data are not collected over the full analytical range of 2 to 10
106/ L for this analyte.
Platelet count from the Advia 2120 and Cell DYN 3500 are accurate (0.9897)
but imprecise (0.9263). We are 95% confident that platelet count measurements
can deviate 36.8% (>25%) from their target values and that they conformed to the
PTC only 62.9% (<80%) of the time. Additionally, WBC measurements are also
relatively accurate (0.9839) but imprecise (0.9576), especially for readings greater
than 8 103/ L. The lab fails to show that these two analytes meet the PTC. We
are 95% confident that WBC measurements can deviate 24.4% (>15%) from their
target values, and that they conform to the PTC only 58.0% (<80%) of the time.
In summary, using the agreement statistics presented in this example, 14 out of
18 method comparison cases meet the CLIA criteria with 95% confidence. Of the
four that did not meet the CLIA criteria, one is acceptable by the traditional paired
t-test and regression analysis. Of the 14 that meet the CLIA criteria, 11 are rejected
by the traditional paired t-test or regression analysis.
This proof can be seen in Lin (1989). The Z transformation of the CCC estimate
can be expressed as Z D g.m/, where
38 2 Continuous Data
m D .m1 ; m2 ; m3 ; m4 ; m5 /0
!0
1X 2 1X 2 1X
n n n
D Y ; X; X ; Y ; Yi Xi ; (2.43)
n i D1 i n i D1 i n i D1
and
1 4.m5 m1 m2 /
Z D g.m/ D ln 1 C :
2 m3 C m4 2m5
The vector m is expressed as a function of sample moments, and has an asymptotic
5-variate normality with mean
0
‚D. y;
2
x; y C 2 2
y; x C 2
x ; yx C y x/ ;
11 D y2 ;
12 D 21 D yx ;
22 D x2 ;
13 D 31 D 2 y 2
y;
23 D 32 D 2 y yx ;
33 D 2 y4 C 4 y2 2
y;
14 D 41 D 2 x yx ;
24 D 42 D 2 x 2
x;
34 D 43 D 2 yx2
C4 y x yx ;
44 D 2 C4 4
x
2
x
2
x;
15 D 51 D x 2
y C y yx ;
25 D 52 D 2
y x C x yx ;
35 D 53 D2 yx
2
y C2 2
yx y C2 y
2
x y;
45 D 54 D2 yx
2
x C2 2
yx x C2 y
2
x x;
and
55 D 2 2
y x C 2 2
y x C 2 2
x y C 2
yx C2 y x yx :
It follows from the delta method or from the theory of functions of asymptotically
normal vectors (Serfling 1980, Corollary 3.3) that Z is asymptotically normal with
1 0
mean 12 ln 1Cc
1c and variance n d †d, where
d D .d1 ; d2 ; d3 ; d4 ; d5 /0
ˇ ˇ 0
@g.m/ ˇˇ @g.m/ ˇˇ
D ;:::; :
@m1 ˇmD‚ @m5 ˇmD‚
2.9 Proofs of Asymptotical Normality When Target Values Are Random 39
x
d1 D ;
2
y C 2
x C2 yx C. y x/
2
y
d2 D ;
2
y C 2
x C2 yx C. y x/
2
yx
d3 D d4 D h ih i;
2
y C 2
x C2 yx C. y x /2 y C
2 2
x 2 yx C. y x/
2
and
. 2
y C 2
x/ C. y x/
2
d5 D h ih i:
2
y C 2
x C2 yx C. y x/
2 2
y C 2
x 2 yx C. y x/
2
After straightforward, albeit tedious algebraic calculations, it can be shown that the
variance of Z is
1 0
2
Z D d †d
n
1 .1 2 /c2 2c3 .1 c / 2 c4 4
D C : (2.45)
n .1 c2 /2 .1 c2 /2 22 .1 c2 /2
The Z-transformed CCC estimate can approach normality much more rapidly as
confirmed by the Monte Carlo experiment in Lin (1989). When c D and D 0,
(2.45) degenerates into n1 , which is the variance of the Z transformation of the
precision estimate.
From (2.43), we can write the natural log transformation of the MSD estimate, or
W D ln.e 2 / as
By the delta method, W is asymptotically normal with mean ln. 2 / and variance
1 0
n d †d , where
0 1
33 34 35
†D@ 44 45
A
55
40 2 Continuous Data
and
ˇ ˇ ˇ 0
@g.m/ ˇˇ @g.m/ ˇˇ @g.m/ ˇˇ 1 1 2 0
dD ; ; D ; ; :
@m3 ˇmD‚ @m4 ˇmD‚ @m5 ˇmD‚ "2 "2 "2
2.9.3 CP Estimate
This proof can be seen in Lin, Hedayat, Sinha, and Yang (2002). Here, we use
a different approach to demonstrating the delta method. We can use a first-order
approximation to compute the mean and variance of the CP estimate pı0 by
ı0 d 1 ı0 d
p ı0 D ˆ .X d d/
d d d
ı0 d ı0 d ı0 d
2
.sd d/ ˆ
d d d
1 ı0 d ı0 C d ı0 d
C .Od d/ 2
.sd d/
d d d d
CO .X d d/
2
C O .sd d/
2
C O .X d d /.sd d/ ; (2.47)
2 )
1 ı0 d ı0 d ı0 C d ı0 d
C C
2 d d d d
1
CO 2 : (2.48)
n
2.9 Proofs of Asymptotical Normality When Target Values Are Random 41
Because CP is bounded by 0 and1, it is better to use the logit transformation for
p
statistical inference. Let T D ln 1pı0ı . Then the asymptotic mean of T is D
0
2
ı0 pı
ln 1ı , and the asymptotic variance is T2 D 2 .1 0
/2
.
0 ı0 ı0
This proof can be seen in Robieson (1999). This estimate, ca , does not involve m5
in (2.43). We redefine the new set of m vectors that are mostly uncorrelated as
2 0
which has an asymptotic 4-variate normality with mean E.m/ D . y;
2
x; y ; x /
and variance n1 †, where † D f ij g44 .
Here,
11 D y2 ;
12 D 21 D yx ;
13 D 14 D 23 D 24 D 31 D 41 D 32 D 42 D 0;
33 D 2 y4 ;
34 D 43 D 2 yx
2
;
and
44 D2 4
x:
L D g.m1 ; m2 ; m3 ; m4 /
p
2 m3 m4
D ln p ;
m3 C m4 C .m1 m2 /2 2 m3 m4
a
and is asymptotically normal with mean ln 1 a
and variance n1 d 0 †d, where
ˇ ˇ 0
@g.m/ ˇˇ @g.m/ ˇˇ
d D .d1 ; d2 ; d3 ; d4 /0 D ; : : : ; :
@m1 ˇmD‚ @m4 ˇmD‚
42 2 Continuous Data
d3 D 2 2
1
2 1
2 ;
y a .1a / 2 y x .1a /
and
d4 D 2 2
1
2 1
2 :
x a .1a / 2 y x .1a /
The ZjX transformation of the CCC estimate can be expressed as ZjX D g.mjX /,
where
0
mjX D .mjX;1 ; mjX;2 ; mjX;3 /0 D Y ; b1 ; se2 (2.50)
and
" #
1 mjX;3 C m2jX;2 sx2 C sx2 C .mjX;1 X /2 C 2mjX;2 sx2
ZjX D g.mjX / D ln :
2 mjX;3 C m2jX;2 sx2 C sx2 C .mjX;1 X /2 2mjX;2 sx2
2.10 Proofs of Asymptotical Normality When Target Values Are Fixed 43
1 1CcjX
Here ZjX is asymptotically normal with mean 2 ln 1cjX and variance
1 0
n d jX † jX d jX , where
0 1
2
y .1 2 / 0 0
B C
† jX D B
@ 0 2
y .1 2 /=sx2 0 C:
A (2.51)
0 0 2 4
y .1 2 /2
c2 . y X /
djX;1 D ;
.1 c2 /2 y sx
c 1
djX;2 D 1 ;
.1 c2 /2 c
and
c2
djX;3 D :
2.1 c2 /2 y sx
From (2.50) we can write the natural logarithm of the MSD estimate, or WjX D
2
ln.ejX /, as
By the delta method, WjX is asymptotically normal with mean ln."2jX / and variance
1 0
n
d jX † jX d jX , where † jX was shown in (2.51), and
!0
2. y X/ 2.1 ˇ1 /sx2 1
d jX D 2
; ; 2 :
"jX "2jX "jX
44 2 Continuous Data
2.10.3 CP Estimate
The proof can be seen in Lin, Hedayat, Sinha, and Yang (2002). Recall that in the
regression model when target values are fixed, we assumed that eY has a normal
distribution with mean 0 and variance e2 . Under this setup, the coverage probability
of the i th observation is
1X
n
ı0 jX D ı i : (2.53)
n i D1 0
Suppose that we have a random sample f.Yi ; Xi / j i D .1; : : : ; n/g and that ˇ0 , ˇ1 ,
and e2 are estimated by b0 , b1 , and se2 . Then b0 and b1 are independent of se . An
estimate of ı0 i is
ı0 b0 .b1 1/xi ı0 b0 .b1 1/xi
p ı0 i D ˆ ˆ ;
se se
and an estimate of ı0 jX is
1X
n
pı0 jX D pı i :
n i D1 0
By the same method as shown in Section 2.9.3, it can be shown that
1
E.pı0 jX / D ı0 jX C O ;
n
and that the asymptotic variance of pı0 jX is
" #
1 C 2
.C 0 X C 1 / 2
C 2
1
pı0 jX D C C 2 CO 2 ;
2 0 2
n n2 n2 sx2 2n n
. y X /2
djX1 D ;
y sx .1 a /
2
sx ˇ1 sx2 ˇ1
djX 2 D C ;
y .1 a /2 y a .1 a /
2 2
and
1 1
djX 3 D C :
2 y sx .1 a / 2 2
y a .1 2a /
2 $2a .1 2 / C 12 .1 $a /2 .1 4 /
2
D :
LjX
n.1 a /2
H .n 1/.U3 U1 /
rc D D : (2.54)
G U1 C nU2 C .n 1/U3
They further showed that the Z-transformation of rc by the delta method has
asymptotic normal distribution with mean 12 tanh1 .c /, and variance
2
c2 2
2 HG
2
Z D 2
H
C G2 ; (2.55)
1 c2 H2 HG G
where
2
H D .n 1/2 ŒV .U3 / C V .U1 / 2cov.U3 ; U1 /;
2
G D .n 1/2 V .U3 / C V .U1 / C n2 V .U2 / C 2.n 1/cov.U3 ; U1 /
C2n.n 1/cov.U3 ; U2 / C 2ncov.U3 ; U2 /;
and
where
P P P
j '1ij j '2ij j '3ij
'1i D ; '2i D ; '3i D :
.n 1/ .n 1/ .n 1/
Then we have
4 X
V D .'i U /0 .'i U /:
n2 i
Barnhart and Williamson (2001) first proposed to use GEE for statistical estimation
and inference for CCC. They used three sets of GEE equations, one for estimating
means accounting for covariates, one for estimating variances without accounting
for covariates, and one for estimating the Z-transformed CCC. Variance–covariance
matrices of the above estimates can be estimated, and the delta method can be
applied to obtain the asymptotic normality of the Z-transformed CCC estimate.
Suppose we have (Y11 , Y21 ), (Y12 , Y22 ), : : : , (Y1n , Y2n ) random samples from
n samples or subjects. Let Yi be the 2 1 vector that contains the two readings
and let the 2 p matrix Xi be the corresponding p covariates for the i th sample,
where the first column of Xi is a vector of all ones representing an intercept term.
Let Yi D .Y1i ; Y2i /0 and ˇ be a 2 1 marginal parameter vector. The three GEE
equations are shown below.
In the first set of equations, the marginal mean vector of Yi is E.Yi / D i D
Xi ˇ, and the parameter estimates of ˇ are obtained by
X
n
Di0 Vi1 .Yi i .ˇ// D 0; (2.56)
1
where Di D d i
dˇ and Vi is the working covariance matrix for Yi (Zeger and Liang
1986).
Let 12 and 22 be variances of Y1 and Y2 . In the second set of equations, the
variances of Y1 and Y2 without accounting for covariates are estimated by
X
n
Fi0 Hi1 Yi2 i2 . 2 ; ˇ/ D 0; (2.57)
1
2 0
and D .1 ; 2 / . In solving these equations, the diagonal components of Hi are
2 2
Let i D E.Y1i Y2 i /0 and let Z be the Z-transformed CCC. In the third equation,
Z is estimated by
X
n
Ci Wi1 .Y1 Y2 i .Z; ˇ; 2 // D 0; (2.58)
1
where Ci D d i
dZ and Wi is the variance of i .
This GEE method and the U-statistic method yield the same CCC estimate as
proposed by Lin (1989), but the variances of the CCC estimate are slightly different
because these two methods do not assume normality, while the method by Lin
assumes normality in the computation of the variance of the CCC estimate.
Carrasco and Jover (2005) proposed to use the maximum likelihood (ML) or
restricted ML (RML) method through a mixed effect model. Robieson (1999) and
Carrasco and Jover (2005) showed that the CCC is a special form of ICC under the
mixed effect model of random subject effect with the variance ˛2 , residual effect
with the variance e2 , and fixed assay or rater effect with the mean square ˇ2 , when
2
ˇ is included in the denominator. Specifically, the CCC can be expressed as
2
c D ˛
: (2.59)
2
˛ C 2
e C 2
ˇ
In Section 3.1.3, we will revisit this coefficient in detail. Carrasco and Jover (2005)
proceeded to use the delta method for statistical inference after estimating the
variance–covariance matrix of the variance components through ML or RML.
This method does not yield the same CCC estimate as proposed by Lin (1989),
and the variance of the CCC estimate can be slightly different, because this method
assumes normality through MLE or RMLE.
Compared to CCC, there have been fewer contributions related to TDI and CP. Some
of the methods related to TDI and CP in the literature are pointed out in the last
paragraph of Section 2.13. Most of those articles use more complicated iterative
methods to fine-tune TDI and CP as well as their confidence intervals.
2.12 Discussion 49
2.12 Discussion
Three agreement statistics, MSD, TDI, and CP, are unscaled indices, which do not
depend on the between-subject variation. TDI and CP attempt to capture a large
proportion (CP) of observations that are within a certain deviation (TDI) from their
target values. We can compute CP for a given TDI, denoted by CPı0 , or compute TDI
for a given CP, denoted by TDI0 . When the error structure is proportional, we apply
a log transformation to the data, and the resulting TDI is then antilog transformed.
When we subtract 1.00 from this antilog-transformed value and multiply by 100,
it becomes a percent change (TDI%0 ) rather than an absolute difference (Lin
2000, 2003, Lin, Hedayat, Sinha, and Yang 2002). This means that 1000 % of
observations are within TDI%0 of the target values. TDI and CP offer the most
intuitively clear interpretation and have better power for statistical inference, yet
they do not have precision and accuracy components. Also, Lin, Hedayat, Sinha, and
Yang (2002) and Lin, Hedayat, and Wu (2007) used approximations and assumed
normality to perform estimations and statistical inferences for TDI and CP. When
the data are not normally or log-normally distributed, a transformation to bring the
data closer to normality might be necessary.
TDI and CP are mirrored statistics. The former requires a given coverage
probability to compute the absolute difference or percent change. The latter requires
a given absolute difference or percent change to compute the coverage probability.
The former has the advantage for the following reason:
• TDI can discriminate among assays with much better agreement than CP, because
in these cases, CP values are near one.
• When there is no hard allowance available, one can still compute TDI at CP D
0:8 or 0:9, but one cannot compute CP without a reasonably given TDI value.
Due to its equivalence to kappa and weighted kappa as well as its close tie to
ICC, the CCC is perhaps the most popular index among statisticians for assessing
agreement. The CCC and precision and accuracy coefficients are ICC-like (Lin,
Hedayat, and Wu 2007), and are scaled (relative) to the total variation, especially
the between-sample variation. This property is appealing when we wish to assess
agreement over the entire reasonable value range from normal to abnormal.
Comparisons among any of these three agreement statistics are valid only with
similar study ranges, which is proportional to the between-sample variation. When
the study range is fixed and meaningful, the CCC and precision and accuracy
coefficients offer meaningful geometric interpretations. It is important to report the
study range when reporting these statistics. Good agreement over a small range
50 2 Continuous Data
100 60
80
55
60
50
40
45
20
0 40
0 20 40 60 80 100 40 45 50 55 60
Fig. 2.13 Agreement over larger and shorter analytical ranges. (a) Larger analytical range.
(b) Shorter analytical range
Apart from CP, all of the agreement coefficients defined in this chapter have the
same coefficient estimates regardless whether the target values are assumed random
or fixed. The CP estimates under random and fixed target values are asymptotically
the same. The variance of each of these coefficient estimates under the random target
assumption is always larger than the variance under the fixed target assumption.
These are evident by comparing (2.3) and (2.4) for the log MSD (TDI), (2.13) and
(2.14) for the logit CP, (2.29) and (2.30) for the Z-transformed CCC, (2.31) and
(2.32) for the logit accuracy coefficient, and the formulas in the text after (2.32)
for the Z-transformed precision coefficient. Therefore, the confidence limits of an
agreement coefficient would be closer to its coefficient estimate under the fixed
target assumption than under the random target assumption.
There are two types of repeated-measures data for agreement assessment that we
often encounter. For one type, the between-sample variation forms the data range.
2.12 Discussion 51
For the other, the repeated measures form the data range. An example of the
first case is to have each subject’s blood pressures measured over time, which
is what is usually encountered in practice. Many tools are available for this type
of repeated measure, which can be found in the Section 2.13. An example of
the second case is to have each sample taken from a homogeneous population
and to perform serial dilutions that form the data range. Such serial dilutions are
uniform across all homogeneous samples. In this case, we could compute agreement
coefficient estimates for each sample, and treat these estimates as random samples
from a population. We then compute means and confidence limits based on the
respective transformations of the agreement statistics. Antitransformation of these
limits would be their respective confident limits. Such an approach is valid if we
don’t have missing data. We may also follow the more detailed approach proposed
by Chinchilli, Martel, Kumanyika, and Lloyd (1996).
The CCC and accuracy and precision coefficient estimates are in general quite robust
against moderate deviation from normality. If not, there are tools based on robust
estimates by King and Chinchilli (2001b) and based on a nonparametric approach
by Guo and Manatunga (2007).
The TDI and CP estimates are heavily dependent on the normality or log-
normality assumption. When there is evidence that data are not normally distributed
for the constant random error case, and not log-normally distributed for the
proportional random error case, data transformation might be necessary for the
robustness of TDI and CP estimates. In this case, see Lin and Vonesh (1989) for
the transformation approach that minimizes the MSD between the ordered observed
and theoretical quantiles.
In this book we deal only with no-missing-data cases. In this chapter and Chapter 3,
we discuss the case of paired assays or raters, and we often do not encounter a
large amount of missing data in practice. Therefore, deleting cases with missing
data is a reasonable approach. Missing data situation can sometimes be an issue
in practice as pointed out in Chapters 5 and 6 when we have multiple raters with
replicates. Research in the social and psychological sciences and in clinical trials
may often encounter missing data. Approaches that can handle missing data should
be an interesting area of research.
52 2 Continuous Data
Chapter 2 is based largely on the materials in Lin, Hedayat, Sinha, and Yang (2002).
For an earlier introduction of CCC and precision and accuracy coefficients, see Lin
(1989), and for TDI, see Lin (2000).
The method of Bland and Altman (1986) for assessing agreement uses a
meaningful graphical approach and computes the confidence limits from the paired
differences. Because of its simplicity, this approach has been quite popular among
medical researchers. This approach lacks a specific index to summarize the degree
of agreement, and thus statistical inferences about the estimate cannot be performed.
Bland and Altman later (1999) improved on their approach with statistical inference.
Their approaches are similar to our TDI. The major difference between our TDI and
their approaches is that we capture a majority of observations from their respective
individual target values, while their approaches capture the paired differences from
the mean of paired differences.
2.13 A Brief Tour of Related Publications 53
Absolute agreement indices for categorical data have rarely been presented in the
published literature. However, these can be very useful when we try to verify
agreement within a given experiment. For example, when replicates are taken per
rater per subject, we might want to compare whether one rater is better than another
in terms of within-rater precision. For another example, in the field of individual
bioequivalence, we might be interested in evaluating interdrug agreement relative
to intradrug agreement. These scenarios will be presented in Chapter 6, when we
concentrate on comparisons of absolute indices rather than of relative indices.
We have no knowledge whether MSD has been used for categorical data in the
literature. We defined MSD in Chapter 2 as "2 D E.Y X /2 . Based on agreement
probabilities presented in Table 3.1, MSD becomes
X
t X
t
"2 D .i j /2 ij ; i; j D 1; 2; : : : ; t: (3.1)
i j
When the outcome takes on binary scores, or t D 2, the MSD becomes 12 C 21 ,
or 1 .11 C 22 /. The MSD shown in (3.1) is the absolute or unscaled measure of
disagreement. The absolute measure of agreement, …0 , can be defined as a weighted
average of all probabilities,
X
t X
t
…0 D wij ij I (3.2)
i j
3.1 Basic Approach When Target Values Are Random 57
.i j /2
wij D 1 : (3.3)
.t 1/2
Equation (3.2) with the weight function (3.3) is actually a linear function of MSD
through the relationship
"2
…0 D 1 : (3.4)
.t 1/2
The weight function in (3.3) assigns the heaviest weight of 1 when i D j for the
main diagonal probabilities, and lower weights depending on the squared distance
from the main diagonal probabilities. We will discuss the weight function in greater
detail in Section 3.1.3. If we let
(
1 when i D j ;
wij D
0 otherwise,
P
then we have …0 D i i i , which had been commonly used as an agreement index
prior to the introduction of kappa by Cohen (1960). When the outcome takes on
binary scores, or t D 2, …0 becomes 11 C 22 .
The controversy associated with the absolute indices such as …0 is that a certain
amount of agreement is to be expected by chance alone. Even when both raters have
totally different metrics in judging the subjects, we would expect them to agree
with a certain probability …c by chance alone. It is natural to eliminate such chance
agreements. This is the idea behind Cohen’s kappa (1960).
To better illustrate the arguments made above, let us look at some numerical
examples when t D 2 and n D 100:
1. Perfect agreement: Raters 1 and 2 agree on every subject in the following table.
There is no controversy expected to conclude the perfect agreement between the
two raters.
Rater 2
Yes No Total
Yes 50 0 50
Rater 1 No 0 50 50
Total 50 50 100
time, in fact, there is no association between the raters in this example because
all their agreement can be attributed to chance.
Rater 2
Yes No Total
Yes 25 25 50
Rater 1 No 25 25 50
Total 50 50 100
Also note that in each of the above examples, the marginal distributions are the
same for the two raters. Just as for continuous data, the fact of identical marginal
distributions between the two raters does not necessarily imply agreement between
them.
Cohen (1960) proposed a measure called kappa as a chance-corrected agreement
index. This index depends strictly on the diagonal probabilities corrected by the
chance probabilities derived from the marginal probabilities.
Cohen (1968) later improved on the kappa coefficient by proposing weighted
kappa for data measured in ordinal scales. We will discuss weighted kappa first,
because kappa is a special case of weighted kappa. Weighted kappa is designed to
recognize that some disagreements between the two raters should be considered
more serious than others. For example, disagreement between “mild” and “life-
threatening” for a patient’s condition is more serious than disagreement between
“critical” and “life-threatening.” Therefore, it would be prudent to assign weights
to reflect the seriousness of disagreements among the rated conditions. In general,
for assessing agreement (disagreement), we would expect the weight to be greater
(smaller) for cells closer to (farther from) the main diagonal.
The weighted measure of agreement by chance becomes
X
t X
t
…c D wij i j ; (3.5)
i j
where i .j / represents the marginal probability by rater X (rater Y ). When the
outcome takes on binary scores, or t D 2, then …c reduces to 1 1 C 2 2 .
Finally, weighted kappa is defined as
…0 …c
w D : (3.6)
1 …c
Cohen requires these weights to satisfy the following conditions:
1. wij D 1 when i D j ,
2. 0 < wij < 1 when i ¤ j ,
3. wij D wji .
Cicchetti and Allison (1971) suggested the following set of weights:
ji j j
wij D 1 : (3.7)
t 1
3.1 Basic Approach When Target Values Are Random 59
Fleiss and Cohen (1973) suggested the squared weight function as shown in (3.3).
Weighted kappa based on the weight function in (3.7) is always less than or equal
to that based on the weight function in (3.3).
When
(
1 for i D j ,
wij D
0 for i ¤ j ,
Suppose we have two raters each of whom evaluates and assigns n subjects
independently to one of t categories. Let pij represent the proportion of subjects
assigned to category i by rater X and category j by rater Y , where i; j D 1; 2; : : : ; t.
Let pi .pj / represent the marginal proportion that a subject is assigned to category i
(category j ) by rater X (rater Y ), where i; j D 1; 2; : : : ; t. Then the expected values
of the above proportions are E.pij / D ij , E.pi / D i , and E.pj / D j . These
proportions are also the maximum likelihood estimators (MLEs) or sample moment
counterparts of the respective probabilities.
Fleiss, Cohen, and Everitt (1969) proposed to estimate weighted kappa using the
sample counterpart as
P0 Pc
O w D ; (3.8)
1 Pc
where
X t X t Xt X
t
P0 D wij pij ; Pc D wij pi pj :
i j i j
The weighted kappa estimate O w has an asymptotic normal distribution with mean
w and variance
PP 2 2
ij wij .wN i C wN j /.1 w / w …c .1 w /
i j
2
O w D ; (3.9)
n.1 …c /2
P P
where wN i D j j wij and w
N j D i i wij .
60 3 Categorical Data
O w 1:645sO w :
These Cohen’s kappa coefficients have been widely used and can be obtained from
the output of SAS procedure FREQ if option “agree” is used in the TABLES
statement.
Using the weight function defined in (3.3), we can verify that the weighted kappa is
actually equal to the CCC presented in Chapter 2. Recall that CCC was defined in
(2.24) as
"2
c D 1 2 :
" jD0
When D 0, MSD becomes
X
t X
t
"2 jD0 D .i j /2 i j ; i; j D 1; 2; : : : ; t: (3.10)
i j
For the agreement statistics presented in Sections 3.1.2 and 3.1.3, two raters need
to classify subjects or samples based on the same metrics. When the metrics are not
the same, or when the shifts in marginal distributions are negligible, then we are
dealing with association instead of agreement. Several statistical tools are available
for measuring and evaluating association, which are listed in Section 3.4. Popular
association measurements include the Pearson correlation coefficient, 2 statistic
for contingency tables, some forms of intraclass correlation coefficients (ICC), and
the log-linear modeling, among several others. If we use the Pearson correlation
coefficient for measuring association, we can regard it as a precision measurement
as in the CCC. Accuracy can be characterized by examining the differences in
the marginal distributions. Such statistical tools are also widely available in the
textbooks listed in Section 3.4. In this book we concentrate on the accuracy and
precision topics that mirror those of CCC in Section 2.3.
The equivalence of CCC and weighted kappa can shed new light on the way
we view the weighted kappa. We can break w into precision (Pearson correlation
coefficient, ) and accuracy (a ) coefficients in the same way we did for CCC.
That is,
w D a ; (3.12)
where
yx
D (3.13)
y x
and
2 y x
a D : (3.14)
2
y C 2
x C. y x/
2
Here,
X
t
x D i i ;
i
X
t
y D jj ;
j
!2
X
t X
t
2
x D i i
2
i i ;
i i
0 12
X
t X
t
2
y D j 2 j @ jj A ;
j j
62 3 Categorical Data
and
!0 t 1
X
t X
t X
t X
yx D ijij i i @ jj A :
i j i j
CCC is a special form of ICC (Robieson 1999, Carrasco and Jover 2003) when
the mean square of the difference among raters (fixed rater effect) is included in
the denominator. Therefore, CCC and weighted kappa can be estimated by variance
components with related statistical inferences through the delta method.
For simplicity, we first demonstrate the equivalence of ICC and CCC based on
the basic mixed effect model where each of n subjects is evaluated by two raters.
The value of Yij is the category that rater j had assigned to subject i . We will
discuss a more general form of ICC in Chapter 5, where we have k > 2 raters, each
evaluating n subjects with m > 1 readings.
The mixed effect model considered here is
where the overall mean is . The fixed rater effect is ˇj and sums to zero. The
random subject effect is ˛i with the variance ˛2 . The random error is eij with
variance e2 , and eij is assumed not to correlate with ˛i .
Traditionally, ICC in its original form is defined as the between-subject variance
˛ divided by the total variance of ˛ C e in the above model, disregarding the rater
2 2 2
cov.Yij ; Yi 0 j 0 / D 2
; if i D i 0 , j ¤ j 0 , (3.18)
ˆ ˛
:̂0; if i ¤ i 0 .
For j D 2, we have 2
˛ D 12 , or the covariance of the two raters, and
2
C 2
2
˛ C 2
e D 1 2
; (3.19)
2
where 2
j is the variance of the effect associated with rater j , j D 1; 2.
2
Let ˇ be the rater mean squares, which is
. 1 2/
2
2
ˇ D :
2
The CCC or weighted kappa becomes
2
2 12
c D w D D ˛
:
1 C 2 C. 1 C C
2 2 2 2 2 2
2/ ˛ e ˇ
Model (3.17) assumes equal variances between the two raters. It is interesting to note
that the CCC remains the same whether the variances are assumed equal or not. It is
evident from (3.19) that ˛2 C e2 is actually the average of the two rater variances.
However, the composition of accuracy and precision coefficients are slightly altered
under this model. We will discuss the ICC and CCC in greater detail in Chapter 5.
This example is taken from von Eye and Schuster (2000). Two psychiatrists
evaluated 129 patients who had previously been diagnosed as clinically depressed.
The rating categories were 0 for not depressed, 1 for mildly depressed, and 2 for
clinically depressed. Table 3.2 presents the rating results that the two psychiatrists
provided.
Table 3.3 presents the related agreement statistics and their confidence limits for
the data reported in Table 3.2. There is a moderate level of agreement (weighted
kappa D 0.4204 with a 95% lower confidence limit of 0.2737 by the squared weight
function) between the two psychiatrists with good accuracy and moderate precision.
64 3 Categorical Data
When the target values are fixed, investigators rarely examine results using a relative
index. We consider the example of a diagnostic test. We select n0 negative samples
and n1 positive samples to be tested as negative .0/ or positive .1/ by an instrument
or a rater. Here, n0 and n1 are known. Fleiss (1973) illustrated this type of data
in Section 1.2 of his book. Let us consider the 2 2 table presented in Table 3.4,
from n0 and n1 negative and positive samples for the assessment of sensitivity and
specificity in Section 3.2.1.
Sensitivity is the conditional probability of positive response given that the sample
is positive, P .Y D 1 j X D 1/, which can be estimated by pss D n11 =n1 . The
larger the value of P .Y D 1 j X D 1/, the more sensitive the test. Specificity is
the conditional probability of negative response given that the sample is negative,
P .Y D 0 j X D 0/, which can be estimated by psp D n00 =n0 . The larger the value
of P .Y D 0 j X D 0/, the more specific the test. These parameters can be directly
estimated from the 2 2 table presented in Table 3.4. Statistical inference related to
these parameters can simply be addressed by the inference on related proportions.
Variances of sensitivity and specificity can be estimated to be pss .1 pss /=n1 and
3.2 Basic Approaches When Target Values Are Fixed: Absolute Indices 65
psp .1 psp /=n0 , respectively. When n0 or n1 is large (say > 30), we can use normal
approximation to compute a 95% confidence interval as
s
pss .1 pss /
pss ˙ 1:96 (3.20)
n1
and s
psp .1 psp /
psp ˙ 1:96 ; (3.21)
n0
respectively. Here, we are often interested in only the one-tailed 95% lower
confidence limit.
When n0 or n1 is not large enough, we can use the binomial (exact) confidence
interval approach. We take the inverse of the binomial distribution with parameters
n0 or n1 , pss or psp , and with the probability 0.025 for the lower limit and 0.975
for the upper limit. For the one-tailed lower limit, we use the probability 0.05. We
then divide these limits by either n0 or n1 for the confidence limits on sensitivity
and specificity.
Epidemiologists distinguish two types of error rates. The false positive error rate,
F C , is defined as the proportion of negative samples among those tested positive.
The false negative error rate, F , is defined as the proportion of positive samples
among those tested negative. These error rates are typically defined through the
application of Bayes’s theorem by
P .Y D 1 j X D 0/P .X D 0/
F C D P .X D 0 j Y D 1/ D (3.22)
P .Y D 1/
and
P .Y D 0 j X D 1/P .X D 1/
F D P .X D 1 j Y D 0/ D : (3.23)
P .Y D 0/
For assay validation or diagnostic lab testing, there is little interest in estimating
these error rates in a population. Only the sensitivity and specificity values are of
interest here. However, in the diagnostic environment, it is common practice to
regard 1-sensitivity as the false negative error rate and 1-specificity as the false
positive error rate.
3.3 Discussion
We have examined the agreement statistics for ordinal and binary data when target
values are fixed or random. When target values are fixed, sensitivity and specificity
coefficients are commonly used under the fixed n0 and n1 samples. We rarely
conduct a study from n pairs of observations with fixed target values without fixing
the n0 and n1 samples.
3.3 Discussion 67
When target values are random, kappa and weighted kappa are very popular
indices for measuring agreement for binary and ordinal data. The weighted kappa
with the squared weight function is identical to the CCC introduced in Chapter 2.
In Chapter 5, we will show that using the GEE methodology, the confidence limits
of CCC and weighted kappa or kappa are identical when the Z-transformation is
not used. When data have a nominal scale such as “depression, personality disorder,
schizophrenia, and others,” use of the simple kappa with diagonal weights of one
and zero otherwise has been common practice. Fleiss (1971) introduced category-
specific kappa for nominal data, which is more in-depth with examination of the
agreement–disagreement matrix among the nominal categories.
The U-statistic approach in Section 2.11.1 is another novel approach to es-
timating weighted kappa with statistical inference. In this approach, the weight
function of Cicchetti and Allison (1971) given in (3.7) can be applied. This approach
allows for extension to the multiple-raters case. The GEE methodology proposed
by Barnhart and Williamson (2001) can also be applied to estimate weighted kappa
with the squared weight function with statistical inference. Both the U-statistic and
GEE methodologies are applicable when the target values are random.
CCC or weighted kappa is a special form of ICC when the mean square difference
among the fixed rater effect is included in the denominator, which can be estimated
by variance components with statistical inference through the delta method.
The data structure discussed in this chapter has focused on the most basic case,
namely, two raters evaluating each of n subjects only once. In Chapters 5 and 6,
we will present and discuss tools for cases of at least two raters evaluating each of
n subjects, and each could have m 1 replicates. Such tools can be used for both
categorical and continuous data when the target values are random.
In mental health studies, numerous instruments have been developed for the di-
agnosis of psychiatric disorders such as major depression, and there is considerable
interest in replacing one instrument by another instrument for reduction of cost, ease
of administration, and other considerations.
However, since these instruments are based on different questionnaires with
distinctive structures and point systems, they often have different scales. When
the scales of the instruments are different, the existing agreement methodology is
not applicable. For example, in a depression study, depression was measured by a
clinician-administered ordinal scale of no depression, mild depression, and severe
depression, and by a continuous scale of dimensional self-report. It remains un-
known whether the less-time-consuming self-report dimensional scale can replace
clinician-administered scale to determine the grade of the illness. This problem is
equivalent to assessing the extent to which the continuous scale can be interpreted
as the ordinal graded severity of depression.
Due to the different measurement scales, this question cannot be addressed in
the classical framework of agreement. Alternatively, we may perform a jackknife
(leaving one out) discriminant analysis to best classify the continuous scale against
the ordinal scale of no depression, mild depression, and severe depression. We can
then assess the weighted kappa of the classified scale based on the optimal cutoff
68 3 Categorical Data
For the assessment of bias among raters with respect to marginal distributions, the
literature includes McNemar’s test (1947), Cochran’s Q test (1950), Madansky’s
Q test (1963), and Friedman’s 2 test (1937). In addition, Fleiss and Everitt
(1971) proposed a method for comparing the marginal distributions of an agreement
table, and Koch, Landis, Freeman, Freeman Jr, and Lehnen (1977) proposed
marginal homogeneity tests within the multivariate categorical data. Darroch (1981)
introduced Mantel–Haenszel tests for marginal symmetry. Landis, Sharp, Kuritz,
and Koch (1998) proposed generalized Mantel–Haenszel tests for both nominal
and ordinal data scales. Such bias assessment can also be captured in the accuracy
coefficient, as shown in (3.14). These correspond to the accuracy component of
weighted kappa.
For the assessment of association, the Pearson correlation coefficient and the
usual 2 test of association have been available for many decades. Such association
assessment can also be captured in the precision coefficient, as shown in (3.13).
Birch (1964, 1965) proposed partial association under 2 2 and general cases. For
the assessment of reliability, the ICC in its original form (Fisher 1925) is the ratio of
between-sample variance and total (within plus between) variance under the model
of equal marginal distributions. This original ICC was intended to measure precision
only.
Several forms of ICC have evolved. In particular, Bartko (1966), Shrout and
Fleiss (1979), Fleiss (1986), and Brennan (2001) have put forth various reliability
assessments. Landis and Koch (1977b) introduced category-specific intraclass and
interclass correlation coefficients within a multivariate variance-components model
for categorical data that can accommodate unbalanced designs. These correspond to
the precision component of weighted kappa. In Chapter 5, we will address various
forms of ICCs, including the one that represents precision.
3.4 A Brief Tour of Related Publications 69
For the assessment of agreement, Cohen (1960) introduced the kappa coefficient
to measure agreement between two raters on a nominal categorical data scale,
followed by Cohen (1968) and Everitt (1968) each separately proposing a weighted
kappa coefficient. However, Fleiss noted that the proposed variance estimators for
both of these agreement measures were incorrect, and invited both Cohen and
Everitt to collaborate in publishing correct variances, which appeared in Fleiss,
Cohen, and Everitt (1969). Fleiss (1971) introduced category-specific kappa coeffi-
cients and generalized Cohen’s kappa to situations involving multiple observers and
multiple categories.
Combining these broad areas into a common estimation and hypothesis-testing
framework, Landis and Koch (1977a) proposed the multivariate categorical data
framework of Koch, Landis, Freeman, Freeman Jr, and Lehnen (1977) for the
analysis of repeated measurement designs. The focus of this framework was to
test first-order marginal homogeneity among multiple observers and to estimate
multiple correlated kappa coefficients and their associated estimated covariances
to facilitate ease of confidence interval construction. Cicchetti and Fleiss (1977)
studied the null distributions of weighted kappa and the C ordinal statistic. Fleiss
and Cuzick (1979) further proposed kappa for binary response when the number
of observers differs for each subject. Bloch and Kraemer (1989) discussed 2 2
kappa coefficients as measures of agreement or association in greater details.
Donner and Eliasziw (1992) proposed a goodness-of-fit approach to inference
procedures and sample size estimation for the kappa statistic. Shoukri and Martin
(1995) proposed MLE of the kappa coefficient from models of matched binary
responses. Williamson, Lipsitz, and Manatunga (2000) proposed modeling kappa
for measuring dependent categorical agreement data. Schuster (2001) used kappa as
a parameter of a symmetry model for rater agreement. Broemeling (2009) proposed
Bayesian methods for measure of agreement.
Another tool for assessing agreement is the log-linear model for rater agreement
of von Eye and Schuster (2000). There is a wealth of books (Landis and Koch
1977c; Haberman 1974, 1978; Goodman 1978; Haberman 1979; Fleiss, Levin, Paik,
and Wiley (1981); Aickin 1983; Freeman 1987; Cox and Snell 1989; Shoukri and
Edge 1996; Christensen 1997; Agresti 1990; Von Eye and Mun 2005, to name a
few) describing agreement assessment, including association, for categorical data.
Furthermore, Haber, Gao, and Barnhart (2007) introduced a method for assessing
observer agreement in studies involving replicated binary observations. Guo and
Manatunga (2009) proposed a method of measuring agreement of multivariate
discrete survival times using a modified weighted kappa coefficient. Yang and
Chinchilli (2009, 2011) proposed fixed-effects modeling of Cohen’s kappa for
bivariate multinomial data.
Chapter 4
Sample Size and Power
In this chapter, in addition to the formal method of computing the sample size and
power, we will discuss ways to compute sample size and power based on normal
approximation through a simplified and conservative approach. As pointed out
previously in assessing agreement, the traditional null and alternative hypotheses
should be reversed. The conventional rejection region is now the region of declaring
agreement, and is usually one-sided. Asymptotic power and sample size calculations
will proceed by this principle.
We will present the asymptotic power of accepting agreement and sample size
calculation whereby we utilize either MSD or CCC in accepting or rejecting
agreement. Inference based on approximated TDI or CP can be assessed through
MSD. Let be a transformed agreement index. For continuous data, we use the
Z-transformation for a CCC estimate, the logit transformation for a CP estimate,
and the natural log transformation for an MSD estimate. When is MSD, we declare
that two raters are in agreement under the alternative hypothesis H0 W < 0 , where
0 is a prespecified tolerable index value or the null value. Here, the null hypothesis
is H0 W 0 . When is CCC, we reverse the signs of the above hypotheses.
We compute the probability of declaring agreement under the alternative value 1 ,
which can be the ideal condition value or the available historical value. We refer to
this probability as power. Lin, Hedayat, Sinha, and Yang (2002) presented the power
and sample size computations that are described in the sequel.
Let 20 =nc and 21 =nc be variances of the respective estimates, where nc D n c,
with c D 2 for MSD and CCC, and c D 3 for CP. For the one-tailed fixed type-I
error ˛, the power for declaring agreement based on CCC and CP becomes
p
. 0 1 / C ˆ1 .1 ˛/0 = nc
P D1ˆ p ; (4.1)
1 = nc
where ˆ./ is the cumulative standard normal density function. For the one-tailed
fixed type-I and type-II errors ˛ and ˇ, the associated sample size becomes
2
ˆ1 .1 ˇ/1 C ˆ1 .1 ˛/0
nD C c: (4.3)
. 0 1/
For the case of two raters with a single reading per rater, we can simplify the
computations of the power and sample size using the upper bound of the variance
of each estimate of the agreement indices. According to (2.29) and (2.30), the
variance of the Z-transformed CCC estimate is less than or equal to 1=.n 2/.
Such approximation is almost exact when is close to zero and $ is close to one.
According to (2.3) and (2.4), the variance of the log-transformed MSD estimate is
less than or equal to 2=.n 2/. Such approximation is exact when is equal to zero.
The upper bound of the variance of transformed CP estimate remains unknown.
Because CP is a mirrored index of TDI, and the approximated CP can also be
computed from MSD, we need only calculate the sample size of approximated
TDI or CP, which can be derived from the upper bound of the variance of the
log-transformed MSD estimate. The sample size determination does not have to be
exact, as long as we stay on the conservative side. For categorical data, even though
the Z-transformation does not necessarily help, we can still utilize the simplification
for computing the sample size for the weighted kappa, since the variances of CCC
and weighted kappa are the same under the GEE methodology.
Suppose, as an example, that the available historical data indicate that CCC is 0.99
over a desirable data range, and we are willing to tolerate a CCC of 0.98. In this
case, the sample size would be
2
ˆ1 .1 ˇ/ C ˆ1 .1 ˛/
nD C 2;
tanh1 .0:99/ tanh1 .0:98/
which is equal to n D 53 for ˛ D 0:05 and 1 ˇ D 0:8, where tanh1 ./ is the
Z-transformation.
4.3 Examples Based on the Simplified Case 73
which is equal to n D 24 for ˛ D 0:05 and 1 ˇ D 0:8, where ln./ is the log
transformation.
Most frequently, we only have historical within sample variance ( e2 ) for data
with constant error or coefficient of variation (CV%) for data with proportional
error. We can easily translate these into TDI and CCC. Recall that in (2.24), CCC is
inversely related to the mean square of the ratio of the within-sample total deviation
and the total deviation. When the marginal distributions of two raters are identical,
the within-sample total deviation is e and the total deviation is the square root of
the sum of e2 and the between-sample variance. The data range is proportional to
the associated standard deviation, and assuming that our historical e was computed
based on a large sample size (say at least 100 observations), this proportionality
value is about 5 (Grant and Leavenworth 1972). p
For data with constant error, we can use ˆ1 Œ1.1/=2 2 e as our historical
TDI0 value, and use 1 .5 e2 /=dr2 as our historical CCC value over a desirable
data range dr . For data with proportional error, we can assume that the data are
log-normally distributed and use the relationship
! 2 D exp. 2
e/ 1; (4.4)
0:232, or TDI%0:9 of 26.1%. Assuming that the maximum value of the data is ten
times the minimum value, we have a historical CCC of 1 0:01= Œln.10/=52 D
0:953. If we are willing to allow for a 50% increase in e2 , and a ˇ2 that is half of
p
2
e , we have an allowable CV% of exp.0:01 1:5 C 0:01 0:5/ 1 D 14:2%,
or TDI0:9 of 0.328, or TDI%0:9 of 38.5%, and an allowable CCC of 0.906. In this
case, the sample size based on TDI would be
2
ˆ1 .1 ˇ/ C ˆ1 .1 ˛/
nD2 C 2;
ln.0:3282 / ln.0:2322/
74 4 Sample Size and Power
which is equal to n D 28 for ˛ D 0:05 and 1 ˇ D 0:8. The sample size based on
CCC would be
2
ˆ1 .1 ˇ/ C ˆ1 .1 ˛/
nD C 2;
tanh1 .0:953/ tanh1 .0:906/
where yij l stands for the lth reading from subject i given by rater j , with
i D 1; 2; : : : ; n, j D 1; 2; : : : ; k and l D 1; 2; : : : ; m. The readings can be conti-
nuous, binary, or ordinal. The overall mean is . The random subject effect, ˛i , has
equal second moments across all raters. The random rater and subject interaction
effect, ij , has equal second moments across all raters. The random error effect,
eij l , is uncorrelated with ˛i and ij . The fixed rater effect is ˇj , and we assume
Pk
j D1 ˇj D 0.
The random subject effect ˛i has mean 0 and variance ˛2 . The interaction effect
ij has mean 0 and variance 2 . The random error effect has mean 0 and variance e2 .
Based on model (5.1) and balanced data, variance components can be expressed as
follows. First, the variance of the random error effect is defined as
Pn Pk 2
i D1 j D1 ij
2
e D ; (5.2)
nk
where jj 0 l l 0 is the covariance of yij l and yij 0 l 0 among different raters and replicates.
The variance of the interaction effect is defined as
2
D A C B C D; (5.4)
where
Pk Pm 2
j D1 lD1 jl
AD ; (5.5)
m2 k
2
and jl is the variance of yij l for rater j and replicate l,
Pk Pm1 Pm
2 j D1 lD1 l 0 DlC1 jl l0
BD ; (5.6)
m2 k
where jl l0 is the covariance of yij l and yij l 0 between replicates for rater j ,
C D 2
˛; (5.7)
and
2
DD e
: (5.8)
m
5.2 Intrarater Precision 77
D2 2
e: (5.10)
For any rater j , the intrarater precision for continuous and categorical data
between any two replications, l and l 0 , is defined as
where var./ and cov./ represent the variance and covariance functions,
respectively.
Under model (5.1) in terms of variance components defined in Section 5.1, the
CCCintra becomes
˛ C
2 2
c;intra D intra D 2 : (5.14)
˛ C C e
2 2
78 5 A Unified Model for Continuous and Categorical Data
The CCCintra measures the proportion of the variance that is attributable to the
subjects. Based on model (5.1), this proportion is the same for all k raters.
Furthermore, the means of replicates within a rater are assumed equal under
model (5.1). Therefore, the intrarater agreement CCCintra equals intra , the intrarater
precision coefficient with the accuracy coefficient of one, and (5.11) and (5.12) are
exact, not approximate. This relative agreement index is heavily dependent on the
total variability (total data range).
Since there are m replicated readings for subject i given by rater j , the average of
those m readings could be used to measure the interrater agreement. We use yNij to
denote the average of m readings from subject i given by rater j .
The MSD between any two raters j and j 0 , MSDinterjj 0 , becomes
"2interjj 0 D E .yNij yNij 0 /2
2
D .ˇj ˇj 0 /2 C 2 2
C e
: (5.15)
m
Across k fixed raters, MSDinter is the average of k.k 1/=2 MSDinterjj 0 indices
2 X
k1 X k
"2inter D "2interjj 0
k.k 1/ j D1 0
j Dj C1
2
D2 2
ˇ C 2
C e
: (5.16)
m
In this chapter we use the approximated inter and total CP based on (2.22)
because the exact CP based on (2.10) would be complicated for model (5.1).
For normally distributed data, TDIinter.0 / and CPinter.ı0 / can be approximated by
r
: 1 1 0 2
ıinter.0 / Dˆ 1 2 2
ˇ C2 2
C2 e
(5.17)
2 m
and
2 0 13
: 6 B ı0 C7
inter.ı0 / D 1 2 41 ˆ @ q A5 : (5.18)
2 2
ˇ C2 2
C2 2
e =m
5.3 Interrater Agreement 79
The TDI and CP approximations are good when the relative biased squared (RBS)
value is reasonable (see Section 2.2.2). Otherwise, the approximation will be
conservative when 0 > 0:9. Here, the RBS is defined as:
2
ˇ
inter D : (5.19)
2
C 2
e =m
"2inter
c;inter D 1 : (5.20)
"2interjyN D0
i1 ;yNi 2 ;:::;yNi k
Under model (5.1) in terms of variance components defined in Section 5.1, the
CCCinter becomes
2
c;inter D ˛
2
: (5.21)
2
˛ C 2
C e
m
C 2
ˇ
Since readings from different raters have different expected means, we further define
that the interrater agreement CCCinter consists of two parts: interrater precision and
interrater accuracy coefficients. The interrater precision coefficient becomes
cov.yNij ; yNij 0 / 2
inter D p p D ˛
: (5.22)
var.yNij / var.yNij 0 /
2
2
˛ C 2
C e
m
a;inter D 2
: (5.23)
2
˛ C 2
C e
m
C 2
ˇ
Here, c;inter is the product of inter and a;inter . The accuracy index measures how
close the means of raters are. In model (5.1), variances are assumed to be the same
for different raters, and consequently they are not present in the accuracy index.
Therefore, the definition of accuracy is slightly modified compared to that originally
defined by Lin (1989). The interrater agreement is measured based on the average
of m readings made by each rater. Therefore, the agreement indices depend on the
number of replications (m).
The approach proposed by Barnhart, Song, and Haber (2005) allows for different
variances among raters, and is a measure based on the true readings, ij , from each
rater and subject. Therefore, their interrater CCC does not depend on the number of
replications. In addition, the interrater CCC from Barnhart, Song, and Haber (2005)
equals the limit of our CCCinter as the number of replications m goes to infinity.
80 5 A Unified Model for Continuous and Categorical Data
Since there are m replicated readings for subject i given by rater j , the interrater
agreement could be based on any one of the m replicated readings. Total agreement
is such a measure of agreement based on any individual reading from each rater.
The MSD between any two raters j and j 0 , MSDtotaljj 0 , becomes
"2totaljj 0 D E .yij l yij 0 l 0 /2
D .ˇj ˇj 0 /2 C 2 2 C e2 : (5.24)
Across k fixed raters, MSDtotal is the average of k.k 1/=2 MSDtotaljj 0 indices
2 X
k1 X k
"2total D "2totaljj 0
k.k 1/ j D1 0
j Dj C1
D2 2
ˇ C 2
C 2
e : (5.25)
The TDI and CP approximations are adequate when the RBS is reasonable.
Otherwise, the approximations will be conservative when 0 > 0:9. Here, the RBS
is defined as
2
ˇ
total D : (5.28)
2
C 2
e
Under model (5.1) in terms of variance components defined in Section 5.1, the
CCCtotal becomes
2
c;total D ˛
: (5.30)
2
˛ C 2
C 2
e C 2
ˇ
5.6 Asymptotic Normality 81
cov.yij l ; yij 0 l 0 /
total D p p
var.yij l / var.yij 0 l 0 /
2
D ˛
(5.31)
2
˛ C 2
C 2
e
and
2
˛ C 2
C 2
e
a;total D : (5.32)
2
˛ C 2
C 2
e C 2
ˇ
Again,
c;total D total a;total :
Similar to Section 2.5, when the residual standard deviation becomes proportional
to the measurement, we apply the natural log transformation to the data and then
compute the agreement statistics. The TDI%0 is defined as
We use yNj l to denote the average of n readings from rater j given by replicate l.
The rater means and all variance components, 1 , 2 , : : :, k , ˇ2 , ˛2 , 2 , e2 are
estimated through the GEE methodology according to the following system of
equations:
X
n
Fi0 Hi 1 ŒQi ‚ D 0: (5.34)
i D1
82 5 A Unified Model for Continuous and Categorical Data
Here,
0 1 1
.yi11 C yi12 C C yi1m /
B m C
B : C
B : C
B : C
B C
B C
B 1 C
B .yij1 C yij 2 C C yij m / C
B m C
B C
B : C
B :
: C
B C
B C
B 1 C
B .yik1 C yik2 C C yikm / C
B m C
Qi D B
B
C;
C
B 1 Pk1 Pk C
B .yNij yNij 0 /2 C
B 0
k.k 1/ j D1 j Dj C1 C
B C
B C
B 4 Pk1 Pk Pm1 Pm C
B .yij l yNj l /.yij 0 l 0 yNj 0 l 0 / C
B 0 0
m2 k.k 1/ j D1 j Dj C1 lD1 l DlC1 C
B C
B P
m C
B C
B 1 Pk lD1 .yij l y
Nij /2 C
B C
B k j D1
.m 1/ C
B C
@ A
1 Pk Pm 2 Pk Pm1 Pm
.yij l yNj l /2 C 2 0 .yij l yNj l /.yij l 0 yNj l 0 /
m2 k j D1 lD1 m k j D1 lD1 l DlC1
0 1
1
B :: C
B : C
B C
B C
B j C
B :: C
B C
B : C
B C
B k C
DB C:
B 2 C
B 2C
ˇ C C C
2 e
B
B m C
B 2 C
B ˛ C
B C
B 2 C
@ e A
2
˛ C m C
2 e 2
The working covariance matrix for Qi (Zeger and Liang 1986) is conveniently set
as a diagonal matrix (Barnhart and Williamson 2001) given by
5.6 Asymptotic Normality 83
0 1
a1
B :: C
B : C
B C
B 0 C
B aj C
B :: C
B : C
B C
Hi D diag.var.Qi // D B C;
B ak C
B C
B d C
B C
B 0 e C
B C
@ f A
g
k.k 1/.3k 2/
D 4
e C k.k 1/.3k 2/ 4
C 8k.k 1/ 2 2
ˇ
m2
8k.k 1/ 4k 2 .k 1/
C 2 2
ˇ e C 2 2
e; (5.36)
m m
8 9
< 4 X
k1 X k X X
m1 m
=
e D var .yij l N
y j l /.yij 0l 0 y
N j 0l0 /
: m2 k.k 1/ ;
j D1 0 0
j Dj C1 lD1 l DlC1
8 9
<1 X k Pm 2 =
.y
lD1 ij l yN ij /
f D var
:k .m 1/ ;
j D1
2k
D 4
e; (5.38)
m1
84 5 A Unified Model for Continuous and Categorical Data
and
gD
2 3
1 X k Xm
2 X X X
k m1 m
var 4 2 .yij l yNj l /2 C 2 .yij l yNj l /.yij l 0 yNj l 0 /5
m k j D1 m k j D1 0
lD1 lD1 l DlC1
m2 k 2 .3m 1/ 4
D km.m 1/.2m 3/ C m.m 1/.k 1/=2 C ˛
2
3k mk mk.m 3/ 4
C m2 k 1 CmC C km.m 1/.2m 3/ 4 C e
2 2 2
C 2m2 k.2 k/ C mk.m 1/.mk C 5m 4/ ˛2 2
mk.m 1/ 2 2
C mk.4 mk C .m 1/ C
2
˛ e
2
.m 1/.mk C 4/ 2 2
Cmk 4 C .m 1/ mk C
2
e: (5.39)
2
Finally,
@‚ 1k 044
Fi D D ; (5.40)
@. 1 ; : : : ; k ; ˇ2 ; 2 2 2
˛; e ; / 044 f 44
where 0 1
1 0 1=m 1
B0 1 0 0C
f 44 DB
@0
C:
0 1 0A
0 1 1=m 1
Note that the working covariance matrix is derived assuming normality. We obtain
the estimates of the means and variance components through (5.34), which are their
respective sample counterparts given in Qi , and estimate their variance–covariance
matrix through the GEE methodology.
The variance–covariance matrix for Qi is
1 10
var.Qi / D D †D 1 ; (5.41)
n
where
X
n
DD Fi 0 Hi 1 Fi
i D1
5.6 Asymptotic Normality 85
and
X
n
†D Fi 0 Hi 1 .Qi ‚/ .Qi ‚/0 Hi 1 Fi :
i D1
"
4 var. e2 /
var.O"2inter / D var. 2
ˇ/ C var. 2
/ C
n m2
2 2
#
2cov. 2 2
ˇ; e / 2cov. e; /
C2cov. 2 2
ˇ; / C C ; (5.43)
m m
and
4h
var.O"2total / D var. 2
ˇ/ C var. 2
/ C var. 2
e/
n
i
C2cov. 2 2
ˇ; / C 2cov. 2 2
ˇ; e / C 2cov. 2 2
e; / : (5.44)
When estimating the variances of TDIs, we use the log transformation based on
MSD, W./ D ln."2./ /. The transformed variance for w is
var."2./ /
var.W./ / D :
"4./
Therefore, we have
b intra / D var. e2 /
var.W ; (5.45)
n. e2 /2
var. 2
ˇ/ C var. 2
/ C var. 2
e /=m
2
b inter / D
var.W
n. 2
ˇ C 2
C 2
e =m/
2
cov. 2 2
ˇ; / C cov. ˇ ; e /=m C cov. e ; /=m
2 2 2 2
C ; (5.46)
ˇ C C e =m/
2 2 2 2
n.
86 5 A Unified Model for Continuous and Categorical Data
and
var. 2
ˇ/ C var. 2
/ C var. 2
e/
b total / D
var.W
n. 2
ˇ C 2
C 2 2
e/
cov. 2 2
ˇ; / C cov. ˇ ; e / C cov. e ; /
2 2 2 2
C : (5.47)
n. 2
ˇ C C e/
2 2 2
The variances of TDIintra , TDIinter , and TDItotal are given in (5.45), (5.46), and (5.47)
divided by 4, respectively.
For CP indices, we use the asymptotic variance based on the transformed variable
from (5.12), (5.18), and (5.27), and we have
ı02
!2
"2./
e ı2 var."2./ /
var.O ./ı0 / D 1 C 20 ; (5.48)
n "./ 8"2./ ı02
.1 intra /2 nh i
var.Oc;intra / D var.Ointra / D var. ˛2 / C var. 2 / C 2cov. ˛2 ; 2
/
n. ˛ C C e /
2 2 2 2
h io
Cintra
2
var. e2 / 2.1 intra /intra cov. ˛2 ; e2 / C cov. e2 ; 2 / ; (5.49)
1 n h
var.Oc;inter / D .1 inter /2 var. 2
˛/ C inter
2
var. 2
ˇ/
n. 2
˛ C 2
ˇ C 2
C 2
e =m/
2
2 2
#
2cov. 2 2
var. e2 / ˇ; e/ 2cov. e; /
C C var. 2
/ C 2cov. 2
ˇ;
2
/ C C
m2 m m
2 2
cov. ˛; e/
2.1 inter /inter cov. 2
ˇ;
2
˛/ C cov. 2
˛;
2
/ C ; (5.50)
m
1 n h
var.Oc;total / D .1 total /2 var. 2
˛/ C total
2
var. 2
ˇ/ C var. 2
e/
n. 2
˛ C 2
ˇ C 2
C 2 2
e/
i
C var. 2
/ C 2cov. 2
ˇ;
2
/ C 2cov. 2
ˇ;
2
e/ C 2cov. 2
e;
2
/
h io
2.1 total /total cov. 2
ˇ;
2
˛/ C cov. 2
˛;
2
/ C cov. 2
˛;
2
e/ ; (5.51)
1 var. e2 /
var.Ointer / D .1 inter /2 var. 2
˛/ C inter
2
C var. 2
/
n. ˛2 C 2 C 2
e =m/
2 m2
# )
2 2
2cov. e; / cov. 2
˛;
2
e/
C 2.1 inter /inter cov. 2
˛;
2
/ C ; (5.52)
m m
5.7 The Case m D 1 87
1 n h
var.Ototal / D .1 total /2 var. 2
˛/ C total
2
var. 2
e/ C var. 2
/
n. 2
˛ C 2
C 2 2
e/
i h io
C2cov. 2
e;
2
/ 2.1 total /total cov. ˛2 ; 2
/ C cov. 2
˛;
2
e/ ; (5.53)
( "
1 var. e2 /
var.O a;inter / D .1 a;inter /2 var. 2
˛/ C var. 2
/ C
n. 2
˛ C 2
ˇ C 2
C 2
e =m/
2 m2
#
2 2
2cov. ˛2 ; 2
e/
2cov. ; e/
C C 2cov. 2
˛;
2
/ C C 2a;inter var. 2
ˇ/
m m
" 2 2
#)
cov. ˇ; e/
2.1 a;inter /a;inter cov. 2
˛;
2
ˇ/ C C cov. 2
ˇ;
2
/ ; (5.54)
m
and
1 n h
var.O a;total / D .1 a;total /2 var. 2
˛/ C var. 2
/ C var. 2
e/
n. ˛2 C ˇ2 C 2
C 2 2
e/
i
C2cov. 2
˛; C 2cov. ˛2 ;
2
e/
2
/ C 2cov. 2
;
2
e/ C 2a;total var. 2
ˇ/
h io
2.1 a;total /a;total cov. 2
˛;
2
ˇ/ C cov. 2
ˇ;
2
e/ C cov. 2
ˇ;
2
/ : (5.55)
the index being Oc;intra , Oc;inter , Oc;total , Ointer , or Ototal . When estimating the variances
of accuracy coefficients or CP estimates for continuous data, we use the logit
transformation. The transformed variance of an estimate of accuracy coefficient
var.index/
or CP is var.index/ D .index/.1index/ with the index being O a;inter , O a;total , O intra.•0 / ,
O inter.ı0 / , or O total.ı0 / . When computing confidence limits of the above agreement
indices for continuous data, we would compute the limit based on the respective
transformation, and then antitransform the limit.
For the cases with m D 1, the interaction effect between rater and subject in model
(5.1), ij , cannot be separated from the error effect. Thus the model reduces to
where each term follows the same distribution as that specified in model (5.1). The
variance components are simplified to
2 X
k1 X k
2
D jj 0 I (5.57)
˛
k.k 1/ j D1 0
j Dj C1
Pk
j Dj 0 D1 jj 0
2
e D ˛I
2
(5.58)
k
Pk1 Pk
j D1 j 0 Dj C1 . j j0/
2
2
D : (5.59)
ˇ
k.k 1/
Accordingly, the MSD, TDI, and CP for continuous data become
"2 D 2 C 2 ˇ2 ;
2
e (5.60)
: 1 1 0 q
ı0 Dˆ 1 2 2
ˇ C2 2
e; (5.61)
2
and
2 0 13
: 6 B ı0 C7
ı0 D 1 2 41 ˆ @ q A5 : (5.62)
2 2
ˇ C2 2
e
"2
c D 1
"2jy
i1 ;yi 2 ;:::;yi k D0
2
D ˛
2
˛ C 2
e C 2
ˇ
2
D ˛
Pk1 Pk
2
˛ C 2
e C 1
k.k1/ j D1 j 0 Dj C1 . j j0/
2
2 Pk1 Pk
k.k1/ j D1 j 0 Dj C1 jj 0
D Pk Pk1 Pk
1
k j D1
2
j C 1
k.k1/ j D1 j 0 Dj C1 . j j0/
2
Pk1 Pk
2 j D1 j 0 Dj C1 jj 0
D Pk Pk1 Pk : (5.63)
.k 1/ j D1
2
j C j D1 j 0 Dj C1 . j j0/
2
The above equations show that each of the four variance components can be
expressed as functions of variances and pairwise covariances. Thus even though
in the proposed unified approach, we assume the homogeneity of all variances,
the CCC defined in (5.63) remains the same as the overall concordance correlation
coefficient (OCCC) proposed by Lin (1989), King and Chinchilli (2001a, 2001b),
and Barnhart, Haber, and Song (2002), where they did not assume the homogeneity
of all variances.
When we use the GEE methodology to estimate all variance components, the
estimating system of equations given in (5.34) are simplified to
0 1
yi1
B :: C
B : C
B C
B C
B yij C
B :: C
B C
B : C
Qi D B
B yi k
C;
C
B C
B Pk1 Pk C
B 1
j 0 Dj C1 .yij yij 0 /2 C
B k.k1/ j D1 C
B Pk C
B 1
j D1 .yij yNj /2 C
@ k A
Pk Pk
2
k.k1/ j D1 j 0 Dj C1 .yij yNj /.yij l 0 yNj 0 /
0 1
1
B :: C
B : C
B C
B C
B j C
B :: C
B : C
B C
DB C;
B k C
B C
B 2C
ˇ C e C
2
B
B C
B 2C
˛ C e A
2
@
2
˛
0 1
. 2
˛ C 2
e /I k 0 0 0
B 2 2
eC ˇ e
4 C
B 0 0 0 C
B k.k1/ C
Hi D B
B 2k ˛ C e C2 ˛ e
4 4 2 2
C;
C
B 0 0 0 C
@ k
A
˛ C2k e C2 ˛ e
2k.k1/ 4 4 2 2
0 0 0 k.k1/
90 5 A Unified Model for Continuous and Categorical Data
and 0 1
1k 0 0 0
@‚ B0 1 0 1C
Fi D DB
@0
C:
@. 1 ; : : : ; k ; ˇ2 ; ˛2 ; e2 / 0 1 1A
0 0 1 0
The variances of the estimates of agreement indices are
4h i
var.O"2 / D var. ˇ2 / C var. e2 / C 2cov. ˇ2 ; e2 / ; (5.64)
n
2 var. 2
ˇ/ C var. 2
e/ C 2cov. 2 2
ˇ; e /
b / D var.O" / D
var.W ; (5.65)
"4 n. 2
e C 2 2
ˇ/
ı02
2
e "2 ı2 var."2 /
var.O ı0 / D 1 C 02 ; (5.66)
n " 8"2 ı02
1 n
var.Oc / D .1 /2 var. 2
˛/ C 2 Œvar. 2
ˇ/ C var. 2
e/
n. ˛2 C ˇ2 C e2 /2
o
C2cov. 2 2
ˇ ; e / 2.1 /Œcov. 2 2
ˇ; ˛/ C cov. 2 2
˛ ; e / ; (5.67)
1
O D
var./ .1 /2 var. 2
˛/ C 2 var. 2
e/ 2.1 /cov. 2 2
˛; e / ;
n. 2
˛ C 2 2
e/
(5.68)
and
1 n
var.O a / D .1 a /2 Œvar. 2
˛/ C var. 2
e/ C 2cov. 2 2
˛ ; e /
n. 2
˛ C 2
ˇ C 2 2
e/
o
C2a var. 2
ˇ/ 2a .1 a /Œcov. 2 2
ˇ; ˛/ C cov. 2 2
ˇ ; e / : (5.69)
Again, we use the respective transformations for statistical inference for continuous
data.
Estimations of agreement indices shown in this chapter are based on the GEE
methodology. We have discussed approaches by Barnhart, Song, and Haber (2005)
throughout this chapter. Carrasco and Jover (2003) proposed another important
method using the maximum likelihood (ML) or restricted maximum likelihood
(RML) method based on a mixed effect model. In Section 2.11.3 we presented this
model for k D 2, which is also applicable for k > 2. For normally distributed data,
5.7 The Case m D 1 91
the estimates from Lin, Hedayat, and Wu (2007) have been shown to be very close to
the estimates obtained in Carrasco and Jover (2003). However, the unified approach
that we proposed has two advantages: First, since we use the GEE methodology
to estimate all variance components and means, our method can handle not only
normally distributed data, but also data from the exponential family including the
binomial and multinomial distributions. Second, our approach is expected to be
robust against a moderate deviation from normally distributed data.
King and Chinchilli (2001a) proposed one important method of estimations and
statistical inferences for the generalized CCC for continuous and categorical data
based on the U-statistic. The case k D 2 was presented in Section 2.11.1. We will
present the case k > 2 below.
Let q D 1; 2; : : : ; k 1 and r D 2; : : : ; k index the pairwise combinations of the
k raters. Then the generalized CCC can be expressed as
Ng D
P P
qr EFXq FXr g.Xq Xr / g.Xq C Xr / qr EFXq Xr g.Xq Xr / g.Xq C Xr /
q<r q<r
;
P 1P
qr EFXq FXr g.Xq Xr / g.Xq C Xr / C qr EFXq Xr g.2Xq / C g.2Xr /
q<r 2 q<r
P
.n 1/ qr .U3qr U1qr /
q<r
ONg D P P P :
qr U1qr C n qr U2qr C .n 1/ qr U3qr
q<r q<r q<r
Since the sum of U-statistics is a U-statistic, the above notation can be simplified as
.n 1/.U3s U1s /
ONg D ;
U1s C nU2s C .n 1/U3s
92 5 A Unified Model for Continuous and Categorical Data
P
where Uls D qr Ulqr , l D 1; 2; 3, the sum over all distinct pairs. The same
q<r
approach as found in Section 2.11.1 can be applied by replacing U1 , U2 , and U3 with
U1s , U2s , and U3s , respectively. Then ONg has an asymptotically normal distribution
with mean Ng and a variance that can be consistently estimated with
O : O 2 var.Hs / 2cov.Hs ; Gs / var.Gs /
var.Ng / D .Ng / C :
Hs2 Hs Gs Gs2
For ordinal and binary data, when k D 2 and m D 1, the above GEE estimates of
CCC reduce to kappa (Cohen 1960) and weighted kappa (Cohen 1968) with square
distance function (Robieson 1999; King and Chinchilli 2001a, 2001b; Barnhart,
Haber, and Song 2002). In addition, its variances reduce to the variances of kappa
and weighted kappa (Wu 2005). The proof of such equivalence is shown below.
When k D 2 and m D 1 with t ordinal outcomes, Fleiss, Cohen, and Everitt
(1969) introduced the asymptotic variance for the estimated weighted kappa, kOw , as
0 1
1 Xt X t
var.kOw / D @ Aij B A ; (5.70)
n.1 kw /4 i D1 j D1
where
X
t
N j D
w wij i ; (5.73)
i D1
and
B D .…o …c 2…c C …o /2 : (5.74)
5.7 The Case m D 1 93
In order to compare the variance of the weighted kappa to that of the CCC, we let
X
t
l0 D i l i ; l D 1; 2; 3; 4;
i D1
X
t
0l D i jl ; l D 1; 2; 3; 4;
i D1
X
t X
t
0
l l0 D i l j l ij ; l D 1; 2; 3 and l 0 D 1; 2; 3
i D1 j D1
1
1 ˘o D . 20 2 11 C 02 / (5.75)
.t 1/2
and
1
1 ˘c D . 20 2 01 10 C 02 /: (5.76)
.t 1/2
For the square weighted function wij , where
.i j /2
wij D 1 ; i; j D 1; 2; : : : ; t;
.t 1/2
we have
X
t
wN i D wij j
j D1
1 X t
D 1 .i j /2 j
.t 1/2 j D1
1
D 1 .i 2 2i 01 C 02 /; (5.77)
.t 1/2
and
X
t
wN j D wij i
i D1
1 X t
D 1 .i j /2 i
.t 1/2 i D1
1
D 1 .j 2 2j 10 C 20 /: (5.78)
.t 1/2
94 5 A Unified Model for Continuous and Categorical Data
A1 C A2 A3 B1
D ; (5.79)
n
where
4.t 1/4 . 20 C 20 2 11 /
2
C . 40 C 04 C 3. 20 C 2
02 / /. 20 C 02 2 11 /
2
A2 D
. 20 C 02 2 10 01 /4
.4 2
10 02 C4 2
01 20 4 10 03 4 01 C 2 22 4
30 10 21 /. 20 C 02 2 11 /
2
C
. 20 C 02 2 10 01 /4
.4 01 12 C8 10 01 11 8 10 01 02 8 10 20 01 /. 20 C 02 2 11 /
2
C
. 20 C 02 2 10 01 /
4
8.t 1/2 . 20 C 02 2 11 /2 . 20 C 02 2 10 01 /
C ; (5.81)
. 20 C 02 2 10 01 /4
. 20 C 02 2 11 /
A3 D 2.t 1/4 2.t 1/2 . 20 C 02 2 11 /
. 20 C 02 2 10 01 /3
2.t 1/2 . 20 C 02 2 10 01 / C 40 C 04 2 10 03 2 01 30 2 31 2 13
C. 20 C 02 /
2
C2 22 2 10 21 2 01 12 C4 01 21
C4 10 12 2 20 11 2 02 11 (5.82)
and
.t 1/8
B1 D
. 20 C 02 2 10 01 /
4
Œ.t 1/2 . 20 C 02 2 1/2 .
11 /Œ.t 20 C 02 2 10 01 /
.t 1/4
2
2Œ.t 1/2 . 20C 02 2 10 01 / Œ.t 1/2 . 20 C 02 2 11 /
C : (5.83)
.t 1/2 .t 1/2
4
var.Oc / D .A C B C C C D E F /; (5.84)
n. 20 C 02 2 10 01 /
4
5.8 Summary of Simulation Results 95
where
AD. 2
01 20 C2 10 01 11 C 2
10 02 4 2
10
2
01 /. 20 C 02 2 11 /
2
; (5.85)
BD. 40 C2 22 C 04 . 20 C 02 /
2
/. 11 10 01 /
2
; (5.86)
C D. 22 2
11 /. 20 C 02 2 10 01 /
2
; (5.87)
D D 2. 11 10 01 / Œ 01 . 30 10 20 / C 10 . 03 01 02 /
C 01 . 12 10 02 / C 10 . 21 01 20 / . 20 C 02 2 11 /; (5.88)
E D 2. 20 C 02 2 11 /. 20 C 02 2 10 01 /
. 01 21 2 10 01 11 C 10 12 /; (5.89)
and
F D 2. 31 C 13 11 20 11 02 /. 11 10 01 /. 20 C 02 2 10 01 /: (5.90)
We have shown that when k D 2 and m D 1 and when the data are ordinal, the CCC
without the Z-transformation is exactly the same as the weighted kappa with the
squared weight function, both in estimation and statistical inference. For binary data,
the weighted kappa reduces to kappa. Therefore, our approach can naturally extend
the kappa and weighted kappa for k > 2 and m > 1. In addition, our approach
provides precision and accuracy coefficients for categorical data.
In order to evaluate the performance of the GEE methodology for estimation and
inference of the proposed indices and to compare the proposed indices against other
existing methods, simulation studies were designed and conducted for different
types of data: binary, ordinal, and normal. For each of the three types of data, we
considered three cases: k D 2 and m D 1, k D 4 and m D 1, and k D 2 and m D 3.
For each case, we generated 1000 random samples of size 20 each. For binary and
ordinal data, we considered two situations: inferences obtained through transfor-
mations (Z-transformations for CCC and precision indices, logit transformation
for accuracy indices) and inferences obtained without transformations. For normal
data, we considered only inferences obtained through transformations. In addition
to the above transformation, we considered logit transformation for CP and log
transformation for TDI for normal data.
96 5 A Unified Model for Continuous and Categorical Data
For binary data, our estimates are very close to their corresponding theoretical
values, and the means of the estimated standard deviations are very close to the
corresponding standard deviations of the estimates. Therefore, our estimates are
sufficiently good for binary data with or without transformation. When m D 1, we
also compared our CCC estimates to that obtained from the method by Carrasco
and Jover (2003). Our standard error estimates are superior to the estimates by
Carrasco and Jover (2003) regardless of whether a transformation is used. The
estimates obtained with transformation were comparable to the estimates obtained
without transformation. Therefore, we suggest that for binary data, the use of a
transformation is acceptable but not necessary.
For ordinal data, the means of the estimates are very close to the theoretical
values, and the means of the estimated standard errors are very close to the
corresponding standard deviations of the estimates. Similar to binary data, we also
calculated the CCC estimates from the method by Carrasco and Jover (2003) for the
cases with m D 1. The estimates from the two methods are very close to each other
regardless of whether transformation is used. Therefore, we conclude that for ordinal
data, the use of a transformation is acceptable but not necessary. Surprisingly, the
method of Carrasco and Jover (2003) performs as well as ours for ordinal data, even
though their model assumes normality.
For normal data, our estimates resemble the respective theoretical values very
well. The means of the estimated standard error are very close to the corresponding
standard deviations of the estimates. For the cases with m D 1, our CCCs are very
close to that obtained from the method of Carrasco and Jover (2003). For the detailed
simulation results, see Lin, Hedayat, and Wu (2007). Based on the simulation
results, we conclude that our method works well for binary data, ordinal data, and
normal data, both in estimates as well as in corresponding statistical inferences.
5.9 Examples
the two methods. This example has been given by Lin, Hedayat, Sinha, and Yang
(2002) and Lin (2003), where the averages of the replicated readings were used, and
is presented in Example 2.8.1.
Figures 5.1–5.3 plot the data for this example for the HemoCue method, mea-
surement 1 vs. measurement 2; Sigma method, measurement 1 vs. measurement 2;
and the average of the HemoCue method vs. the average of the Sigma method. The
plots indicate that the errors are rather constant across the data range. Therefore, no
log transformation was applied to the data.
In terms of TDI and CP indices, the least acceptable agreement is defined as
having at least 90% of pair observations over the entire range within 75 mg/dL
of each other if the observations are from the same method, and within 150 mg/dL
of each other if the observations are from different methods based on the average of
each method. In terms of CCC indices, the least acceptable agreement is defined
as a within-sample total deviation of not more than 7.5% of the total deviation if
observations are from the same method, and a within-sample total deviation of not
more than 15% of the total deviation if observations are from different methods.
These translate into a least-acceptable CCCintra of 0:9943 D .1 0:0752 /, and a
least-acceptable CCCinter of 0:9775 D .1 0:152 /.
The agreement statistics and their corresponding one-sided 95% lower or upper
confidence limits are presented in Table 5.1. The CCCintra estimate is 0.9986,
which means for the observationsp from the same method that the within-sample
deviation is about 3:7% D 1 0:9986 of the total deviation. The 95% lower
confidence limit for CCCintra is 0.9983, which is greater than 0.9943. The CCCinter
estimate is 0.9866, which means for the average observations p from different
methods, the within-sample total deviation is about 11:6% D 1 0:9866 of the
98 5 A Unified Model for Continuous and Categorical Data
Fig. 5.3 HemoCue method’s average measurement vs. Sigma method’s average measurement
total deviation. The 95% lower confidence limit for CCCinter is 0.9825, which is
greater than 0.9775. The precisionintra estimate is 0.9986, with a one-sided lower
confidence limit 0.9983. The precisioninter estimate is 0.9866 with a one-sided lower
confidence limit 0.9825, and the accuracyinter estimate is 1.0000 with one-sided
lower confidence limit 0.9987. The CCCtotal estimate is 0.9859, which means for
individual observations from different methods, the within-sample total deviation is
5.9 Examples 99
Table 5.1 Agreement statistics and their confidence limits for Example 5.9.1
Precision Accuracy
Type Statistics CCC coefficient coefficient TDI0:9 CPTDI a a RBSb
Intra Estimate 0.9986 0.9986 . 41.1 0.9973 .
95% Conf. limit 0.9983 0.9983 . 46.2 0.9949 .
Allowance 0.9943 0.9943 . 75 0.9000 .
Inter Estimate 0.9866 0.9866 1 127.3 0.9474 0
95% Conf. limit 0.9825 0.9825 0.9987 145.9 0.9228 .
Allowance 0.9775 . . 150 0.9000 .
Total Estimate 0.9859 0.9860 1 130.5 0.9412 0
95% Conf. limit 0.9818 0.9818 0.9987 148.9 0.9160 .
Allowance 0.9775 . . 150 0.9000 .
For k D 2, n D 299, and m D 2.
a
This is the CP given the TDI allowances of 75 mg/dL or 150 mg/dL.
b
The relative bias squared (RBS) must be less than 1 or 8 for the CP criterion of 0.9 or 0.8,
respectively, in order for the approximated TDI and CP to be valid.
about 11.87% of the total deviation. The 95% lower confidence limit for CCCtotal is
0.9818. The precisiontotal estimate is 0.9860 with a one-sided lower confidence limit
0.9818, and the accuracytotal estimate is 1.0000 with one-sided lower confidence
limit 0.9987.
The TDIintra.0:9/ estimate is 41.1 mg/dL, which means that 90% of the readings
are within 41.1 mg/dL of their replicate readings from the same method. The one-
sided upper confidence limit for TDIintra.0:9/ is 46.2 mg/dL, which is less than
75 mg/dL. The TDIinter.0:9/ estimate is 127.3 mg/dL, which means that based on
the average readings, 90% of the HemoCue readings are within 127.3 mg/dL of
the Sigma readings. The one-sided upper confidence limit for TDIinter.0:9/ is 145.9
mg/dL, which is slightly less than 150 mg/dL. The TDItotal.0:9/ estimate is 130.5
mg/dL, with the one-sided upper confidence limit 148.9 mg/dL, which is slightly
less than 150 mg/dL as well.
Finally, the CPintra.75/ estimate is 0.9973, which means that 99.7% of HemoCue
observations are within 75 mg/dL of their duplicate values from the same method.
The one-sided lower confidence limit for CPintra.75/ is 0.9949, which is larger than
0.9. The CPinter.150/ estimate is 0.9474, which means that 94.7% of HemoCue
readings are within 150 mg/dL of the Sigma readings based on the average of
each method. The one-sided lower confidence limit for CPinter.150/ is 0.9228, which
is larger than 0.9. The CPtotal.150/ estimate is 0.9412, which means that 94% of
HemoCue observations are within 150 mg/dL of Sigma observations based on
individual readings. The one-sided lower confidence limit for CPtotal.150/ is 0.9160.
The agreement between the HemoCue method and the Sigma method is ac-
ceptable with excellent accuracy and adequate precision and with accuracy slightly
better than precision.
100 5 A Unified Model for Continuous and Categorical Data
This example can be seen in Lin, Hedayat, and Wu (2007). In this example, we
consider the hemagglutinin inhibition (HAI) assay for antibody to influenza A
(H3N2) in rabbit serum samples from two different labs. Serum samples from 64
rabbits were measured twice by each method. Antibody level was classified as
negative, positive, or highly positive (too numerous to count).
Tables 5.2–5.5 present the frequency tables for within-lab and between-lab
readings. Tables 5.2 and 5.3 present the frequency tables of the first reading vs.
the second reading from each lab. Table 5.4 presents the frequency table of the first
reading from one lab vs. the first reading from the other lab. Table 5.5 presents the
frequency table of the second reading from one lab vs. the second reading from
the other lab. Those tables suggest that the within-lab agreement is good but the
between lab agreement is not, and lab 2 tends to report higher ratings than lab 1.
This is an imprecise assay with ordinal responses, and therefore we allow for
less-demanding agreement criteria. In terms of CCC indices, agreement was defined
as a within-sample total deviation of not more than 50% of the total deviation if
observations are from the same method, and a within-sample total deviation of not
more than 75% of the total deviation if observations are from different methods.
This translates into a least-acceptable CCCintra of 0:75 D .1 0:52 /, and a least
acceptable CCCinter of 0:4375 D .1 0:752 /.
The estimates of agreement statistics and their corresponding one-sided 95%
lower confidence limits are presented in Table 5.6. The CCCintra is estimated to
5.9 Examples 101
Table 5.6 Agreement statistics and their confidence limits for Example 5.9.2
Type Statistics CCC Precision coefficient Accuracy coefficient
Intra Estimate 0.8836 0.8836 .
95% Conf. limit 0.8109 0.8109 .
Allowance 0.7500 0.7500 .
Inter Estimate 0.3723 0.5679 0.6554
95% Conf. limit 0.2448 0.4571 0.5383
Allowance 0.4375 . .
Total Estimate 0.3578 0.5349 0.6688
95% Conf. limit 0.2335 0.4216 0.5570
Allowance . . .
For k D 2, n D 64, and m D 2.
This example was obtained from Professor Philip Schluter, head of research, School
of Public Health and Psychosocial Studies at AUT University, New Zealand. The
leader of the project is Dr. Andrew McLennan, Sydney Ultrasound for Women,
Sydney, Australia, and Royal North Shore Hospital, Sydney, Australia. This exam-
ple will be published separately by McLennan and colleagues.
Several recent studies have demonstrated that the nasal bone (NB) is sonograph-
ically “absent” in a large proportion of fetuses affected by Down syndrome at
102 5 A Unified Model for Continuous and Categorical Data
11–13 weeks gestation. The purpose of this study was to demonstrate that the nasal
bone can be accurately assessed and used for population screening in an Australian
obstetric population at 11–13 weeks gestation.
There were 20 operators (accredited and experienced in nuchal translucency
imaging) who supplied 20 NB images. The images were assessed for the presence
(1) and absence (0) of NB. Three examiners assessed each of the 400 images
twice (the repeat assessment separated by at least 24 h), giving a total of 2,400
assessments. Tables 5.7–5.12 present the frequency tables among three examiners
and their duplicate readings. It appears that within-examiner has slightly better
agreement than between-examiner, as expected, and there is little difference among
the marginal distributions of three examiners (good accuracy).
This is an imprecise assay with binary responses, and therefore we allow for less-
demanding agreement criteria. In terms of CCC indices, agreement was defined
as a within-sample total deviation of no more than 60% of the total deviation if
observations are from the same method based on the average of duplicate readings,
and a within-sample total deviation of no more than 70% of the total deviation
if observations are from different methods. This translates into a least-acceptable
CCCintra of 0:64 D .1 0:62 /, and a least acceptable CCCinter of 0:51 D .1 0:72 /.
5.9 Examples 103
Table 5.13 Agreement statistics and their confidence limits for Example 5.9.3
Type Statistics CCC Precision coefficient Accuracy coefficient
Intra Estimate 0.7047 0.7047 .
95% Conf. limit 0.6558 0.6558 .
Allowance 0.6400 0.6400 .
Inter Estimate 0.6369 0.6442 0.9886
95% Conf. limit 0.5779 0.5867 0.9811
Allowance 0.5100 . .
Total Estimate 0.5438 0.5491 0.9902
95% Conf. limit 0.4852 0.4913 0.9839
Allowance . . .
For K D 3, n D 400, and m D 2.
Table 5.13 presents the estimates of agreement statistics and their corresponding
one-sided 95% lower confidence limits. The CCCintra is estimated to be 0.7047,
which means that for observations
p from the same method, the within-sample
deviation is about 54:3% D 1 0:7047 of the total deviation. The 95% lower
confidence limit for CCCintra is 0.6558, which is better than 0.64. The CCCinter
is estimated to be 0.6369, which means that for the average observations from
different methods, the within-sample deviation is about 60.3% of the total deviation.
The 95% lower confidence limit for CCCinter is 0.5779, which is better than 0.51.
The precisioninter is estimated to be 0.6442 with a one-sided lower confidence limit
0.5867, and the accuracyinter is estimated to be 0.9886 with a one-sided lower
confidence limit 0.9811. The CCCtotal is estimated to be 0.5438, which means that
for individual observations from different methods, the within-sample deviation is
about 67.5% of the total deviation. The 95% lower confidence limit for CCCtotal
is 0.4852. The precisiontotal is estimated to be 0.5491 with a one-sided lower
confidence limit 0.4913, and the accuracytotal is estimated to be 0.9902 with a
one-sided lower confidence limit 0.9839. Overall, the agreements among three
examiners readings and within examiners are marginally acceptable, with very good
accuracy, and most disagreements are from imprecision rather than inaccuracy.
This example is obtained from Table 1 of Bland and Altman (1999), where a set
of systolic blood pressure data from a study in which simultaneous measurements
were made by each of two experienced observers (denoted by J and R) using a
sphygmomanometer (gold standard) and by a semiautomatic blood pressure monitor
(denoted by S ). Three sets of readings were made in quick succession for each
method (J1 –J3 , R1 –R3 , S1 –S3 ). The purpose of the study was to evaluate whether
the semiautomatic blood pressure monitor can replace the blood pressure apparatus
104 5 A Unified Model for Continuous and Categorical Data
Fig. 5.4 Agreement between J and R based on their mean triplicate readings in log scale
Fig. 5.8 Agreement between S1 and J1 (reflecting total agreement) in log scale
Fig. 5.9 Agreement between S and J based on their mean triplicate readings (reflecting intera-
greement) in log scale
Table 5.14 Agreement statistics and their confidence limits for Example 5.9.4
Precision Accuracy
Type Statistics CCC coefficient coefficient TDI0:9 CPTDIa RBSa
Intra Estimate 0.9383 0.9383 . 13.78 0.9798 .
95% Conf. limit 0.9166 0.9166 . 15.46 0.9701 .
Allowance 0.9000 0.9000 . 20 0.9000 .
Inter Estimate 0.7253 0.8316 0.8721 33.05 0.8014 0.87
95% Conf. limit 0.6044 0.7327 0.8132 41.34 0.7232 .
Allowance 0.8 . . 25 0.9000 .
Total Estimate 0.6991 0.7974 0.8767 35.58 0.8438 0.69
95% Conf. limit 0.5822 0.7015 0.8203 43.51 0.7831 .
Allowance 0.7000 . . 30 0.9000 .
For K D 2, n D 85, and m D 3.
a
The relative bias squared (RBS) must be less than 1 or 8 for CPa of 0.9 or 0.8, respectively,
in order for the approximated TDI and CP to be valid.
deviation means that the result of the semiautomatic instrument can deviate by up
to 52.2 mmHg with 95% confidence.
It is clear that the within-J precision (Fig. 5.6) is much better than the within-
S precision (Fig. 5.7). To measure the agreement by each method, we can simply
perform agreement assessment among the triplicate readings one method at a time
with k D 3 and m D 1. However, this does not tell us whether within-J or within-R
precision is significantly better than that of within-S . We will revisit this scenario in
Chapter 6.
5.10 Discussion
Each of the approaches in the previous chapters for assessing agreement becomes
one of the special cases of our approach. For continuous data, when m approaches
108 5 A Unified Model for Continuous and Categorical Data
ı2 ı2 ı2 ı2
CPı0 b 2 ;1 2 ;1 2 ;1 2 ;1
M SDI nt ra M SDinter M SDtotal M SD
a
Q D ˆ1 .1 1
2
/ is the inverse cumulative normal distribution.
2
b 2
. MıSD ; 1/ is a central chi-square distribution with one degree of freedom.
infinity, the proposed CCCinter reduces to that proposed by Barnhart, Song, and
Haber (2005). When m D 1, the proposed CCC reduces to the CCC proposed by
Carrasco and Jover (2003), which is the same as the OCCC proposed by Lin (1989),
King and Chinchilli (2001a), and Barnhart, Haber, and Song (2002). Barnhart,
Haber, and Song (2002) pointed out that OCCC is actually a weighted average of
pairwise CCC values. When k D 2 and m D 1, the proposed CCC reduces to
the original CCC proposed by Lin (1989). For categorical data, when k D 2 and
m D 1, the proposed CCC reduces to the kappa for binary data and weighted kappa
with squared weight for ordinal data, in both estimates and statistical inferences.
In addition, we decomposed the CCC into precision and accuracy components
for a deeper understanding of the sources of the disagreement. The concept of
accuracy and precision can also be applied to categorical data. For continuous data,
the relative or scaled indices are heavily dependent on the total variability (total data
range). Therefore, these indices are not comparable if the ranges of the data are not
comparable. The same is true for categorical data when we have data that are heavily
clustered into a single cell, for example, when evaluating agreement based on low
prevalence rate.
5.10 Discussion 109
We also have proposed absolute indices, MSD, TDI, and CP, which are independent
of the total data range. These absolute indices are easily comprehensible. However,
these absolute indices are valid only when the relative bias squared is small enough
(Lin 2000, 2003; Lin, Hedayat, Sinha, and Yang 2002) and the normality is assumed.
We refer the reader to Section 2.12.7, most of which is applicable to this chapter as
well. Subject-based covariates can conveniently be adjusted using the model
There are two aspects of this unified approach that can be extended and developed.
First, for categorical and nonnormal continuous data, we may include the link
functions, such as log and logit, in the GEE methodology. We expect the approach
with a link function would become more robust to different types of data. Second,
current variance component functions are based on balanced data. Therefore, we
would have to delete samples or subjects with missing data. Approaches that can
handle missing data should be an interesting area of research.
There are relatively few references available related to this chapter in medical
and pharmaceutical publications other than those mentioned in the first paragraph
of this chapter. See Barnhart, Haber, and Lin (2007) for an overview on assessing
agreement with continuous measurements. Chen and Barnhart (2008) compared
ICC and CCC for assessing agreement for data without and with replications. Haber
and Barnhart (2008) proposed a general approach to evaluating agreement between
110 5 A Unified Model for Continuous and Categorical Data
In Chapter 5, we provided statistical tools for assessing the intra-, inter-, and
total-rater agreement among all raters. In this chapter, we provide statistical tools
for comparing total-rater agreement to intrarater precision, and intrarater precision
among selected raters.
When multiple raters are available with replicates, we are often interested
to know whether raters can be used interchangeably if they do not deviate
too much more than they deviate among their replicates without any clinical
or practical alteration. Here, we need to assume that the variation among
replicates (intrarater variability) is acceptable. FDA’s guideline (2001) (http://www.
fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/
ucm070244.pdf) introduced a method for evaluating individual agreement between
a test drug and a reference drug in the context of individual bioequivalence.
Barnhart, Kosinski, and Haber (2007) extended FDA’s approach in the case of
multiple raters. They proposed the individual equivalence coefficient (IEC) and the
coefficient of individual agreement (CIA) to compare the intrarater precision relative
to the total-rater agreement of one or multiple references, or when no reference is
available. They used the method of moments for estimation and nonparametric
bootstrapping for statistical inference. Our approaches allow users to explore total-
rater agreement relative to intrarater precision, whereby users can select raters of
interest to be evaluated. We also allow users to compare intrarater precision among
selected raters. We present such a general comparison model and propose the total–
intra ratio (TIR) as well as the intra–intra ratio (IIR) to evaluate the comparative
agreement when there exists a reference and when there does not.
The TIR is a noninferiority assessment such that the differences of individual
readings from different raters can not be inferior by a certain margin to the
differences of the replicated readings within raters. For a TIR example, to assess
the individual bioequivalence, the agreement of test and reference compound is
assessed relative to the agreement of within-reference compound. Although the
TIR is equivalent to IEC and CIA, we have proposed on alternative statistical
inference approach using the GEE methodology that works for both continuous and
categorical data.
The IIR is a classical assessment for which the precision of selected assays/raters
can be better than, equal to, or worse than that of other assays/raters. For an IIR
example, in the medical-device environment, we often want to know whether the
within-device precision of a newly developed device is better than, equal to, or worse
than that of the within-device precision for the old device.
GEE methodology is used for estimation and statistical inference. Our approach
allows for selecting any subset of raters as test and reference raters. In addition,
we present assorted examples to demonstrate the flexibility of our approach. More
details about this chapter can be found in Lin, Hedayat, and Tang (2012).
Here yijl represents the lth reading by rater j for subject i , with i D 1; 2; : : : ; n, j D
1; 2; : : : ; k, and l D 1; 2; : : : ; m. The true reading of the j th rater on subject i is
ij , and it is considered random because subjects are considered random, although
the mean of each rater, j , is considered fixed. The residual random effect is eijl .
We make the following assumptions:
E. ij / D j;
var. ij / D 2
j;
corr. ij ; ij 0 / D jj 0 ;
E.eij l / D 0; and
var.eij l / D &j2 :
Here, we assume that the eij l are uncorrelated, ij and eij l are uncorrelated, and
replicates within a rater are interchangeable.
We use model (6.1) as opposed to model (5.1) because we allow the flexibility
of evaluating any subset of raters in this chapter. For continuous data, when the
within-sample error is proportional to the observed data values, we apply a log
transformation to the data. Based on model (6.1), we propose to use mean squared
deviation (MSD) to assess comparative agreement.
6.2 MSD for Continuous and Categorical Data 113
Under the assumption that replicates within a rater are interchangeable, we propose
to use MSDintra to evaluate the intrarater agreement, which is the mean squared
deviation among replicated readings within raters. Intrarater agreement evaluates
the precision within raters. For a chosen rater j , based on model (6.1), MSDintraj
can be expressed as
D. j j/
2
C. 2
j C &j2 C 2
j C &j2 / 2 2
j
D 2&j2 : (6.2)
For two or more raters, MSDintra is then defined as the average of MSDintraj for the
selected s raters.
D. j j0/
2
C. 2
j C &j2 C 2
j0 C &j20 / 2jj 0 j j0: (6.3)
Overall, MSDtotal for the selected s raters is then defined as the average across
s.s 1/=2 pairs of MSDtotaljj 0 .
the information for evaluating comparative agreement. For any two raters, say rater
j and j 0 , based on the model, MSDinterjj 0 can be expressed as
The above equation shows that as soon as MSDtotal and MSDintra are determined,
then MSDinter is given as well. Therefore, for comparative agreement, further
evaluation for MSDinter is not necessary.
For comparative agreement, the exact intrarater agreement indices TDI and CP,
as shown in Chapter 5, are one-to-one functions of MSDintra for normally distributed
data. The approximate total TDI and CP are one-to-one functions of MSDtotal
for normally distributed data. Within the same experiment, the between-sample
variances or data ranges are similar, and therefore further scaling relative to the
between-sample variance such as CCC is not necessary for comparative agreement.
Furthermore, such scaling would reduce the power of the experiment.
Let any two replicates from the same rater or different raters, denoted by X and
Y , represent the classification scores of a subject in one of t categories. Table 3.1
presents the agreement table for all possible probability outcomes, where pq
represents the probability of X D p and Y D q, p; q D 1; 2; : : : ; t.
Based on the agreement probabilities presented in Table 3.1, when X and Y
represent any two replicates within rater j , MSDintraj for categorical data becomes
XX
"2intraj D .p q/2 pq ; p; q D 1; 2; : : : ; t: (6.6)
p q
Let …0j be the weighted probability of agreementP P rater j with the squared
within
weight function defined in (3.3). Then …0j D tp tq wpq pq . Therefore, the
relationship between MSDintraj and the weighted probability of agreement becomes
Note that weighted kappa is the chance corrected scaling of the weighted probability
of agreement.
6.3 GEE Estimation 115
where …0jj 0 is the weighted probability of agreement from any replicate of raters
j and j 0 . For categorical data, kappa and weighted kappa are the most common
indices for assessing agreement between two raters, each with a single reading.
Because of the equivalence of CCC and weighted kappa, weighted kappa is also
largely dependent on MSD. Therefore, within the same experiment, further scaling
from MSD to weighted kappa is not necessary for evaluating the comparative
agreement.
X
n
0
Fi 1 Hi1
1 .Qi 1 / D 0; (6.9)
i D1
where
0 1
.yi11 C yi12 C C yi1m /=m
B :: C
B : C
B C
Qi 1 D B .yij1 C yij 2 C C yij m /=m C
B
C;
B : C
@ :: A
.yi k1 C yi k2 C C yi km /=m
0 1
1
B :: C
B : C
B C
DB
B
C
j C;
B :: C
@ : A
k
116 6 A Comparative Model for Continuous and Categorical Data
and Fi 1 D @. 1@
;:::; k/
D I kk . The working covariance matrix for Qi 1 (Zeger and
Liang 1986) is
0 .1/
1
a1
B :: C
B : 0 C
B C
B .1/ C
Hi 1 D diag.var.Qi 1 // D B aj C;
B C
B :: C
@ 0 : A
.1/
ak
where
.1/ 1
aj D var .yij1 C yij 2 C C yij m /
m
&j2
D 2
j C : (6.10)
m
Note that we assume normality for constructing all of the working covariance
matrices. The estimator for is
!1 !
X
n
0
X
n
0
O D
Fi 1 Hi1
1 Fi 1 Fi 1 Hi1
1 Qi 1 : (6.11)
i D1 i D1
X
n
0
Fi 2 Hi1
2 .Qi 2 & / D 0;
2
(6.12)
i D1
where
0 Pm 1
lD1 .yi1l yNi1 /2 =.m 1/
B :: C
B : C
B Pm C
Qi 2 DB
B lD1 .yij l yNij /2 =.m 1/ C
C;
B :: C
@ : A
Pm
lD1 .yi kl yNi k / =.m 1/
2
0 1
&12
B : C
B :: C
B C
B C
& 2 D B &j2 C ;
B : C
B : C
@ : A
&k2
6.3 GEE Estimation 117
@& 2
and Fi 2 D @.&12 ;:::;&k2 /
D I kk . The working covariance matrix for Qi 2 is
0 .2/
1
a1
B :: C
B : 0 C
B C
B .2/ C
Hi 2 D diag.var.Qi 2 // D B aj C;
B C
B :: C
@ 0 : A
.2/
ak
where
Pm 2
.2/ lD1 .yij l yNij /
aj D var
m1
2&j2
D : (6.13)
m1
The estimator for & 2 is
!1 !
X
n
0
X
n
0
&O D
2
Fi 2 Hi1
2 Fi 2 Fi 2 Hi1
2 Qi 2 : (6.14)
i D1 i D1
X
n
0
Fi 3 Hi1
3 .Qi 3 g.; & ; // D 0;
2 2
(6.15)
i D1
where
0 1
.yi11 C yi12 C C yi1m /2 =m2
B :: C
B : C
B C
Qi 3 D B .yij1 C yij 2 C C yij m / =m C
B 2 2
C;
B : C
@ :: A
.yi k1 C yi k2 C C yi km / =m
2 2
0 1
2
1 C &12 =m C 2
1
B :: C
B C
B : C
B 2 C
g.; & ; / D B
2 2 2
C &j =m C j C ;
2
B j
C
B :: C
@ : A
2
k C &k2 =m C 2k
118 6 A Comparative Model for Continuous and Categorical Data
@g.;& 2 ;2 /
and Fi 3 D @. 21 ;:::; 2k /
The working covariance matrix for Qi 3 is
.
0 .3/ 1
a1
B :: C
B : 0 C
B C
B .3/ C
Hi 3 D diag.var.Qi 3 // D B aj C;
B C
B :: C
@ 0 : A
.3/
ak
where
.3/ 1
aj D var 2 .yij1 C yij 2 C C yij m / 2
m
!
2 2 2
!
& j & j
D 2 2j C C 4 2j j C
2
: (6.16)
m m
O and &O 2 .
We obtain the estimate for and & 2 from (6.11) and (6.14), namely,
O using the equation
We then solve for 2
!1 !
X
n
0
X
n
0 &O 2
O2 D
Fi 3 Hi1 Fi 3 Hi1 O2
;
3 Fi 3 3 Qi 3 (6.17)
i D1 i D1
m
where O 2 represents the vector in which the element is the square of the
O
corresponding element of the vector .
Finally, we estimate by the fourth set of estimating equations using the cross
products:
Xn
0
Fi 4 Hi1
4 Qi 4 h.; ; / D 0;
2
(6.18)
i D1
where
0 1
yNi1 yNi 2
B :: C
B : C
B C
B yN yN C
B i1 i k C
B yN yN C
B i 2 i 3 C
B : C
B :: C
B C
Qi 4 DB C ;
B yNi 2 yNi k C
B C
B :: C
B : C
B C
B yNij yNij 0 C
B C
B :: C
@ : A
yNi.k1/ yNi k 1
k.k1/
2
6.3 GEE Estimation 119
0 1
1 2 C 12 1 2
B :: C
B : C
B C
h.; 2 ; / D B
B j j0 C jj 0 j j0
C
C ;
B :: C
@ : A
.k1/ k C .k1/k k1 k 1
k.k1/
2
@h.;2 ;/
and Fi 4 D @.12 ;:::;.k1/k /
. The working covariance matrix for Qi 4 is
Hi 4 D diag.var.Qi 4 //
0 .4/ 1
a12
B :: C
B : C
B C
B .4/
0 C
B a1k C
B C
B .4/
a23 C
B C
B :: C
B : C
DBB C ;
.4/
a2k C
B C
B :: C
B : C
B C
B .4/ C
B
B
0 ajj 0 C
C
B :: C
@ : A
.4/
a.k1/k k.k1/ k.k1/
2 2
where
.4/
ajj 0 D var.yNij yNij 0 /
2
D jj 0 2j 2j 0 C 2 j
2
j 0 jj 0 j
2
j0
! !
&j20 &j2
C 2
j C &j2 2
j0 C &j20 C j
2
j0 C C j0
2
j C : (6.19)
m m
O D 1 10
V./ D †D 1 ; (6.21)
n
where
0 P n 0
1
F H 1 F 0 0 0
B i D1 i 1 i 1 i 1 C
B C
B Pn C
B 0
0
Fi 2 Hi1 Fi 2 0 0 C
B 2 C
B i D1 C
DDB n C;
B P 0 1 P n 0 P
n 0 C
B F H G F H 1 G F H 1 F 0 C
B i D1 i 3 i 3 i 2 i D1 i 3 i 3 i 3 i D1 i 3 i 3 i 3 C
B C
@ P n 0 P
n 0 P
n 0
A
Fi 4 Hi1
4
Gi 4 0 F H
i4 i4
1
Gi 6 Fi 4 Hi1
4
Fi 4
i D1 i D1 i D1
with
Gi 2 D @g.; & 2 ; 2 /=@;
Gi 4 D @h.; 2 ; /=@;
Gi 6 D @h.; 2 ; /=@2 ;
and
0 1
A11 A12 A13 A14
B A21 A22 A23 A24 C
†DB
@ A31
C;
A32 A33 A34 A
A41 A42 A43 A44
with
X
n
0 0
A11 D Fi 1 Hi1
1 .Qi 1 /.Q
O O Hi1
i 1 / 1 Fi 1 ;
i D1
X
n
0 0
A12 D Fi 1 Hi1
1 .Qi 1 /.Q
O O 2 / Hi1
i2 & 2 Fi 2 ;
i D1
X
n
0 0
A13 D Fi 1 Hi1
1 .Qi 1 /.Q
O i 3 g.;
O & 2 ; 2 // Hi1
3 Fi 3 ;
i D1
X
n
0 0
A14 D Fi 1 Hi1 O 1
1 .Qi 1 /.Q
O i 4 h.; ; // Hi 4 Fi 4 ;
2
i D1
6.4 Comparison of Total-Rater Agreement with Intrarater Precision: Total–Intra Ratio 121
X
n
0 0
A22 D Fi 2 Hi1 O 2 /.Qi 2 &O 2 / Hi1
1 .Qi 2 & 2 Fi 2 ;
i D1
X
n
0 0
A23 D Fi 2 Hi1
2 .Qi 2 &
O 2 /.Qi 3 g.;
O & 2 ; 2 // Hi1
3 Fi 3 ;
i D1
X
n
0 0
A24 D Fi 2 Hi1 O 2 ; // Hi1
2 .Qi 2 &
O 2 /.Qi 4 h.; 4 Fi 4 ;
i D1
X
n
0 0
A33 D Fi 3 Hi1
3 .Qi 3 g.;
O & 2 ; 2 //.Qi 3 g.;
O & 2 ; 2 // Hi1
3 Fi 3 ;
i D1
X
n
0 0
A34 D Fi 3 Hi1 O 2 ; // Hi1
3 .Qi 3 g.;
O & 2 ; 2 //.Qi 4 h.; 4 Fi 4 ;
i D1
X
n
0 0
A44 D Fi 4 Hi1 O O 1
4 .Qi 4 h.; e ; //.Qi 4 h.; ; // Hi 4 Fi 4 ;
2 2
i D1
0 0 0 0 0 0
and A21 D A12 , A31 D A13 , A41 D A14 , A32 D A23 , A42 D A24 , A43 D A34 .
Now we have obtained the estimates and variance–covariance matrix for all
parameters. We can then use the delta method to obtain the estimates and their
variances for all indices that are functions of these parameters.
For evaluating the type of individual agreement mentioned in the second paragraph
of this chapter, it is natural to use TIR, the ratio of MSDtotal and MSDintra , to assess
the noninferiority between measurements from different raters to an intrarater
precision. More generally, the comparison can be based on selected multiple
pairs of MSDtotaljj 0 relative to selected multiple MSDintraj . Selected raters for
MSDtotal can also be selected for MSDintra , and hence raters in the numerator
and raters in the denominator are not mutually exclusive. This approach allows
substantial flexibility in making comparisons between chosen test raters and chosen
reference raters. In addition, it is not required to select a reference rater when
none is available. For example, when k D 2, we can evaluate deviations among
individual values of test and reference raters relative to the deviation within the
reference raters, MSDtotalT;R /MSDintraR , or relative to that within both test and
reference raters, MSDtotalT;R /MSDintraT;R . When k D 3 with one of the raters being
the reference rater, we can evaluate MSDtotalT1 ;R /MSDintraR , MSDtotalT2 ;R /MSDintraR ,
or MSDtotalT1 T2 ;R =MSDintraR . When k D 3 and none is the reference, we can
122 6 A Comparative Model for Continuous and Categorical Data
In the case of one or multiple references, we select a set of test raters and a set of
reference raters out of the total of k raters. Suppose we are interested in evaluating
t different test raters, 1 6 t 6 k, with respect to r reference raters, 1 6 r 6 k,
2 6 t C r 6 2k. Test raters are indexed by j , j D 1; : : : ; t, and reference raters
are indexed by j 0 , j 0 D 1; : : : ; r. The individual differences between selected sets
of test and reference raters are evaluated by the average of the pairwise total mean
squared deviation, MSDtotalT;R . The intrarater precision is evaluated by the average
of intra mean squared deviation of selected r reference raters. The ratio, TIRR is
used to assess the individual agreement:
Pt Pr
"2totalT;R j D1 j 0 D1 E.yij l yij 0 l 0 /2 =t r
R D D Pr
"2intraR yij 0 l 0 /2 =r
j D1 E.yij 0 l
1 Pr Pt 0 2 1 Pt 1 Pr
j D1 . j j / C t j D1 &j C r
2 2
tr j 0 D1 j 0 D1 &j 0
D 2 P r 2
r j 0 D1 &j 0
1 Pt 1 Pr 1 Pr Pt 0
j D1 j C r j 0 D1 j 0 t r
2 2
t j 0 D1 j D1 jj 0 j j
C 2 P r 2
: (6.22)
r j 0 D1 &j 0
b
Theoretically, TIRR cannot be less than 1. However, its estimate, TIRR , can be less
than one due to random error. When TIRR D 1, total-rater agreement is exactly
the same as intra-reference-rater agreement. Higher values of TIRR indicate worse
individual agreement. The cause of disagreement could be due to (1) difference
between the means T of the test raters and means R of the reference raters,
(2) difference between the error variance &T2 of the test raters and the error variance
&R2 of the reference raters, (3) the subject by rater interaction: D2 D var. iT
iR / D T C R 2T;R T R .
2 2
When there is no specific reference rater, we can select t test raters MSDtotal
relative to their MSDintra . The individual difference is evaluated by the average of
the pairwise total mean squared deviation MSDtotalT among t selected test raters,
6.4 Comparison of Total-Rater Agreement with Intrarater Precision: Total–Intra Ratio 123
2 6 t 6 k. We then use the average of MSDintra of all test raters in the denominator.
The total–intra ratio without a specific reference, TIRall , is expressed as
"2total
all D
"2intra
P Pt
2 tj1 D1 j 0 Dj C1 E.yij l yij 0 l 0 / =t.t 1/
2
D Pt
j D1 E.yij l yij l 0 / =t
2
P Pt
2 tj1 D1 j 0 Dj C1 . j j 0 / C &j C &j 0 C
2 2 2 2
j C 2
j0 jj 0 j j0 =t.t 1/
D Pt :
2 j D1 &j2
(6.23)
When no specific reference exists, TIRall varies between 1 and 1.
In the case of two raters with one of them treated as a reference, i.e., k D 2,
t D 1, and r D 1, TIRR degenerates to FDA’s method for evaluating individual
bioequivalence under the relative scale. Following FDA guidelines on individual
bioequivalence, the agreement of test and reference compounds can be assessed
relative to the agreement of within-reference compound. Let yiT l and yiRl 0 be the
lth and l 0 th reading on subject i from test compound (T ) and reference compound
(R), respectively. Then the individual bioequivalence criterion (IBC) is defined by
FDA as
This FDA approach is primarily based on the approach proposed by Sheiner (1992),
which uses a normal linear mixed model estimated by REML.
By our definition, TIRR is expressed as
"2totalT;R
R D
"2intraR
E.yiT l yiRl 0 /2
D
E.yiRl yiRl 0 /2
IBC
D C 1: (6.25)
2
Note that FDA also uses a constant scale when MSDintraR is small. In addition, it
requires that the ratio of the geometric mean of the test and reference compound lie
124 6 A Comparative Model for Continuous and Categorical Data
between 0.8 and 1.25. We will discuss in detail later, in Example 6.7.4, the topic of
individual bioequivalence. In that example, we will add meaningful interpretations
to the FDA’s criteria by making use of the information presented in Chapters 5 and 6.
Barnhart, Kosinski, and Haber (2007) proposed the coefficient of individual agree-
ment (CIA) for assessing individual agreement. When there are t test raters and r
reference raters, CIA with reference is defined as
Pr
j D1 E.yij 0 l yij 0 l 0 / =r
2
"2intraR
CIAR D D Pt Pr : (6.26)
"2totalT;R j 0 D1 E.yij l yij 0 l 0 / =t r
2
j D1
Compare these CIAs to TIRR and TIRall in (6.22) and (6.23), respectively: CIA is
the reciprocal of TIR.
Basically, IBC, CIA, and TIR are the same indices. The differences are in the
estimation approaches, in that the IBC is ML-based, CIA is method-of-moments-
based with bootstrapping for statistical inference, and TIR is GEE-based. CIA and
TIR are extended for multiple raters, while IBC is limited to two raters only.
Recall that we have obtained estimates of all parameters as well as their variance–
covariance matrix via GEE methodology in Section 6.3. These GEE estimates
of parameters turn out to be moment estimates. Since TIR is a function of the
parameters in the model, the method of moments is used to estimate TIR, and the
delta method is used for the statistical inference.
When the reference exists, the TIRR estimate can be obtained as
Pr Pt Pt Pr
1
j 0 D1 j D1 . O j O 0j /2 C 1
O j2
j D1 & C 1
j 0 D1 &O j20
OR D
tr t r
2 Pr
r j 0 D1 &O r2
Pt O 2 C 1 Pr 0 O 2 1 Pr Pt O O0
j D1 Ojj 0 j
1
t j D1 j r j D1 r tr j 0 D1 j
C 2 Pr
: (6.28)
r j 0 D1 &O r2
6.4 Comparison of Total-Rater Agreement with Intrarater Precision: Total–Intra Ratio 125
The estimate of the log transformed TIRR , WT D ln.OR /, has an asymptotic normal
distribution with mean ln.R / and variance
1 0
2
WT D d †C dT ; (6.29)
n T
where WT D ln.OR / D g.m/,
m D .; O 2 ; /
O &O 2 ; O 0 D .m1 ; m2 ; m3 ; m4 /0 ;
†C D nV./ O from (6.21), which is the variance–covariance matrix for the parame-
ter estimates,
0
and ‚ D .; & 2 ; 2 ; / . We use the correction factor n6 n
for the variance in (6.29)
because it has been shown to have less bias in the simulation studies.
The elements of dT are computed as follows: If the j th rater is selected as the
test rater and the j 0 th rater is selected as the reference rater, then:
Pr . 0 /
j 0 D1 j
• The j th element of dT is 1R P
t rj 0 D1 &j20
j
.
Pt . /
j0
• The j 0 th element of dT is 1R j D1 j
P
t rj 0 D1 &j20
.
• The .k C j /th element of dT is 1R t Pr r & 2 .
j 0 D1 j 0
0
1.kPrC jP/th
• The
t
element of dPT is
0 2 1 t 2 1
P t 2 1
Pr 2 1
Pr Pt 0
1 tr j 0 D1 j D1 . j j / C t j D1 &j C t j D1 j C r j 0 D1 j0
tr j 0 D1 j D1 jj 0 j j
2
Pr 2 2 .
R
r . j 0 D1 &j 0 /
P !
j0
r rj 0 D1 jj 0
• The .2k C j /th element of dT is 1
R
Pr
2t j 0 D1 &j20
j
.
P j
!
t tj D1 jj 0 0
0
• The .2k C j /th element of dT is . 1
R
Pr
2t j 0 D1 &j20
j
• When j < j 0 , the 3k C .2kj2/.j 1/ C .j 0 j / th element of dT is
1 j j0
P .
R t r 0 2
j D1 &j 0
0 0
• When j > j 0 , the 3k C .2kj 2/.j 1/ C .j j 0 / th element of dT is
1 j j0
P .
R t r & 2
j 0 D1 j 0
• All other elements in dT are zero.
126 6 A Comparative Model for Continuous and Categorical Data
The log transformed TIRR estimate approaches normality rapidly and can efficiently
bound the confidence interval within 0 to 1. The confidence limit for TIRR is
computed based on the log-transformed TIR estimates, WT D ln.OR /. The antilog
transformation is performed on the confidence limit of WT to obtain the actual
confidence limit for TIRR . Individual agreement is established when the confidence
limit is smaller than the prespecified criterion, say, 2.25.
When no reference raters exist and the average of all error variances of the test
raters are used in the denominator, the TIRall estimate is given by
P Pt
2 tj1
D1 O j2 C &O j20 C O 2j C O 2j 0 Ojj 0 O j O 0j =t .t 1/
j 0 Dj C1 . O j O j 0 / C &
2
b
all D Pt :
2 j D1 &O j2
(6.30)
The statistical inference for TIRall can be obtained in the same way as when there
are reference raters. For the purpose of statistical inference, the parameters in the
variances of estimates presented above can be replaced by their sample counterparts
that are consistent estimators.
where "2intraT and "2intraR denote MSDintra among selected test and reference raters,
respectively.
6.5 Comparison of Intrarater Precision Among Selected Raters: Intra–Intra Ratio 127
IIR less than 1 indicates better overall precision for test raters than reference
raters. For statistical inference, we would construct a two-sided 100.1 ˛=2/%
confidence interval for IIR, and claim superiority or inferiority if the upper or
lower limit is less than or greater than 1.0. On the other hand, if we intend to
examine whether precision of the test and reference raters are equal, we would
claim equivalence if the confidence interval is bounded by a prespecified clinically
relevant interval.
Similarly as for TIR, the estimate for IIR is obtained via the method of moments
by GEE methodology, and the related statistical inference is obtained via the delta
method. The IIR estimate is given by
Pt
&O 2 =t
O D Prj D1 j : (6.32)
j 0 D1 &O j20 =r
1 0
2
WI D d †C dI ; (6.33)
n I
m D .0; &O 2 ; 0; 0/
D .0; m2 ; 0; 0/;
and
dI D .0; d& 2 ; 0; 0/
ˇ
@g.m/ ˇˇ
D 0; ; 0; 0 :
@m2 ˇmD‚
n
We use the correction factor n6 for the variance in (6.33) because it has been shown
to have less bias in simulation studies.
The elements of dI are computed as follows: If the j th rater is selected as the
test rater and the j 0 th rater is selected as the reference rater, then:
• The .k C j /th element of dI is 1 t Pr r & 2 .
j 0 D1 j 0
128 6 A Comparative Model for Continuous and Categorical Data
P
r tj D1 &j2
• The .k C j 0 /th element of dI is 1 Pr
t . j 0 D1 &j20 /2
.
• All other elements in dI are zero. The value of the elements in dI correspond to
the selection of the test and reference raters.
The transformed IIR estimate approaches normality rapidly, and the confidence
interval for IIR is bounded within 0 to 1. The confidence limit for IIR is
computed based on the log-transformed IIR estimates, WI D ln. O /. The antilog
transformation is performed on the confidence limit of WT to obtain the actual
confidence limit for IIR. Better precision for test raters than reference raters is
determined when the confidence limit is less than 1. For the purpose of statistical
inference, the parameters in the variances of estimates presented above can be
replaced by their sample counterparts that are consistent estimators.
6.7 Examples
The data set for this example is from Baxter Healthcare Corporation. Spectropho-
tometer systems are used to measure the spectrophotometric determination of
glycine. In this example, we examine the agreement between two different systems,
S1 and S2 , with duplicate measurements on each of 38 samples. Figures 6.1–6.3
present the graphs among readings between the two systems and their replicated
measurements. There are no reference instruments here, and hence it would be
informative to investigate the TIR and IIR without reference. Here, data were
assumed to have a constant error structure.
130 6 A Comparative Model for Continuous and Categorical Data
The TIRall of MSDtotal relative to MSDintra is estimated to be 0.699 with the 95%
upper confidence limit of 0.817, which is within any clinically relevant criterion
for claiming individual agreement. The IIR of MSDintraS1 relative to MSDintraS2 is
estimated to be 1.010 with 95% confidence interval .0:752; 1:357/, which indicates
that the precisions of the two systems are not statistically different. We would
claim equivalence if we considered precision deviation of less than 40% clinically
acceptable. Based on the result of TIR and IIR, we conclude that the two systems
can be used interchangeably.
6.7 Examples 131
Period
Seq. 1 2 3 4
1 T T R R
2 R R T T
3 R T T R
4 T R R T
For simplicity, we study only the area under the curve (AUC), and we assume that
the data have proportional error structure. To save space, we do not list the original
data, but we will shortly present the condensed data. We first tested the period and
sequence effects using the mixed effect model on log-transformed AUC, and found
that there was no evidence of these two effects with p > 0:5. Therefore, we list
the data in the format T1, T2, R1, and R2, representing periods 1 and 2 of the test
and reference compounds, as shown in Table 6.1. Note that subject 16 had missing
data, and subject numbers 906, 908, 921, and 932 were recruited to replace the four
dropout subjects with numbers 6, 8, 21, and 32.
132 6 A Comparative Model for Continuous and Categorical Data
The criteria for claiming individual bioequivalence under the FDA guidance
are
1. Reference Scale:
. T R/
2
C D C
2 2
2
D 2
WT WR
< 2:5; if 2
WR > 2
W 0; (6.34)
WR
2. Constant Scale:
. T R/
2
C D C
2 2
2
2
WT WR
< 2:5; if 2
WR 6 2
W 0; (6.35)
W0
where D2 D BT 2
C BR 2
RT , T and R are means, BT 2
and BR 2
are
2 2
between-subject variances, W T and WR are within-subject variances, and RT is
the between-subject covariance, of the test and reference compounds, respectively.
Finally, W2 0 is the cutoff value based on the within-subject variance of the reference
compound. The criterion 2.5 is the aggregate allowance of ln. T = R / D ln.1:25/,
W T WR D 0:02, D D 0:03, and W 0 D 0:04.
2 2 2 2
The FDA individual bioequivalence criteria have not been used widely, even by
FDA staff, for the following reasons:
1. It is difficult for pharmacists and statisticians to understand the meaning of the
criteria.
2. The concept is far more complicated than the average or population
bioequivalence.
3. For statisticians, it is complicated to compute the confidence limits of the
estimates of the relative and constant scales unless there is a program or macro
ready at hand.
4. Perhaps the biggest problem is that there is a discontinuity region in using the
2
reference and constant scales. That is, when the estimate of WR is near W2 0
within a natural random fluctuation, there is a penalty to fall into the more strict
constant scale criterion by chance.
Using our definitions provided in Chapters 5 and 6, we can much better interpret
and understand these scales and provide tools (see Chapter 7) to do the analysis.
See Section 6.4.1 for the relationship between TIR and the relative scale, while
the relationship between TIRR and the relative scale is given in (6.25). Using our
definition, the relative scale criterion means that MSDtotalT;R cannot be more than
2.25 of MSDintraR , or TIRR < 2:25, with .1 ˛/% confidence.
134 6 A Comparative Model for Continuous and Categorical Data
are within 43.7% of each other. If TDI%0:8 is greater than 43.7%, we must use the
reference scale according to (6.34) to ensure that the MSDtotalT;R cannot be more than
2.25 of the MSDintraR with 100.1 ˛/% confidence. Otherwise, we would use the
constant scale.
We now examine the meaning of the constant scale. Note that
MSDtotalT;R D . T R /2 C D2 under model (6.1). If we assume that W2 T WR 2
D
0:02, 0, or 0.02, (6.35) under model (6.1) becomes MSDtotalT;R < 0:12, 0.1, or
0.08, respectively. Again, when W2 0 D 0:04, from (2.8), (5.26), and (5.33), the
TDI%total.0:8/ becomes 55.9%, 50.0%, or 43.7%, respectively. This means that
approximately 80% of the individual AUC values from the test compound cannot
deviate more than 55.9%, 50.0%, or 43.7%, respectively, of the individual AUC
values from the reference compound with 100.1 ˛/% confidence. Compared
to the TDI%intra.0:8/ D 43:7% based on the cutoff value of W2 0 D 0:04 of the
intrareference compound, this constant scale criterion appears to be too stringent to
meet.
Let us summarize the interpretation of the FDA individual bioequivalence criteria
under model (6.1). If 80% of the duplicate values of the reference compound
deviate more than 43.7% from each other, the MSDtotalT;R cannot be more than
2.25 of the MSDintraR with 100.1 ˛/% confidence. Otherwise, we must show
that approximately 80% of the individual AUC values from the test compound
cannot deviate more than 44% to 56% of the individual AUC values from the
reference compound with 100.1 ˛/% confidence. The conventional ˛ is set at
0.05, one-tailed. It appears that there is room for improvement in redefining these
criteria.
We now analyze the data in this example. We begin by calculating the TDI%0:8
between R1 and R2 using (2.8), assuming the proportional error case. Figure 6.4
shows the agreement plot of R1 and R2 in log2 scale. It is clear that there exists
insufficient precision in the data, and that TDI%0:8 D 124:4%, which means that
80% of R1 and R2 pairs can deviate up to 124%. This is much higher than the
43.7% cutoff value, and therefore we would use the relative scale according to the
FDA rules.
Figure 6.5 shows the agreement plot of T1 and R2 in log2 scale, representing
the total agreement among the individual AUC values of the test and reference
compounds. The precision of the data is slightly better than that between R1 and R2,
but in general, T1 has lower AUC values than R2. The TIRR is estimated to be 0.691,
with upper one-sided 95% confidence limit 1.076, indicating that the MSDtotalT;R is
1.076 of the MSDintraR with 100.1 ˛/% confidence, which is much less than 2.25.
Therefore, individual bioequivalence is accepted according to FDA’s rules because
of the large within-reference deviation.
6.7 Examples 135
Fig. 6.5 Agreement between the first test compound reading and the second reference compound
reading
Our statistical tools allow us to go one step further by examining the precision
of the test compound relative to that of the reference compound, namely, the IIR of
MSDintraT relative to MSDintraR . Figure 6.6 presents the agreement plot of T1 and
T2 in log2 scale. The precision of the test compound appears tighter than that of the
reference compound, as shown in Fig. 6.4. Using the statistical tools presented in
Chapter 2, the TDI%0:8 of T1 and T2 is estimated to be 70.2%, with a 95% upper
confidence limit of 90.3%. The IIR of MSDintraT relative to MSDintraR is estimated
to be 0.432 with 95% confidence interval 0.168 to 1.115.
136 6 A Comparative Model for Continuous and Categorical Data
6.8 Discussion
In this chapter, we will walk through three examples using SAS macros and R
functions. We will present the calling codes with in-depth explanations for each
macro or function. For the first two examples, one with continuous data and one
with categorical data, we will study from the basic models to more complicated
models using the material presented in Chapters 2, 3, 5, and 6. We will also include
an individual bioequivalence example using the material presented in Chapters 2, 5,
and 6.
We have produced and made available three SAS macros: Agreement,
UnifiedAgreement, and TIR IIR. To run one of these macros, we need
to download it from one of the following websites:
1. http://www.uic.edu/-hedayat/
2. http://mayoresearch.mayo.edu/biostat/sasmacros.cfm
These websites also contain most of the data used in this book.
An R package named Agreement is also available containing functions
corresponding to the three SAS macros. These are also available for downloading
from the above websites, or they can be installed directly from the comprehensive
R archive network (CRAN).
In each of the following examples, we will first present the procedure based on the
SAS macros, and then present the corresponding R functions.
Let us begin with the continuous data based on Example 5.9.1. This example is
also presented in Example 2.8.1 using the basic model by taking the average of the
duplicate readings for each of the HemoCue and Sigma assays. For Example 2.8.1,
we execute the macro using the %agreement statement with parameters defined
below:
• Dataset: the name of your data. We must avoid using dataset names of c,
cc, t, tt, tb, and p, because these dataset names are used in this macro.
• Y: reading of the test assay or rater that will be shown on the vertical axis of the
agreement plot.
• V label: label for the vertical axis of the agreement plot.
• X: reading for the target assay or rater that will be shown on the horizontal axis
of the agreement plot.
• H label: label for the horizontal axis (target) of the agreement plot.
• Min: minimum of the plotting range.
• Max: maximum of the plotting range.
• By:
– For the constant error structure, this is the increment of the plotting range.
For example, by D 5.
– For the proportional error structure, these are the log scale increments between
min and max. For example, if the data range is from 1 to 60, then min D 1,
max D 64, by D 2; 4; 8; 16; 32.
• Error: constant or proportional error structure.
– error D const: the constant error structure. Here, TDI is expressed as an
absolute difference with the same measurement unit as the original data.
– error D prop: the proportional error structure. Here, TDI is expressed as
a percent change. The natural log transformation to the data will be applied.
• CCC a: a CCC allowance, which can be set as missing if there is no prespecified
allowance.
• CP a: a CP allowance that must be specified for computing TDI.
• TDI a: a TDI allowance that must be specified for computing CP, and must be a
percent value when error D prop is specified or an absolute difference when
error D const is specified.
• Target: random or fixed.
• Alpha: 100.1 ’/% one-tailed confidence limit. The default is 0.05.
• Dec: significant digits after the decimal point printed for TDI when error D
const is specified. The default is dec D 2.
7.1 Workshop for Continuous Data 141
The SAS macro Agreement.sas can be executed using the following code:
libname ex ‘x:\xx\xxx\xxxx’;
data e2_1;
set ex.example2_1;
HemocueAve=mean(of hemocue1 hemocue2);
SigmaAve=mean(of sigma1 sigma2);
run;
%inc ‘x:\xxx\xxx\xxx\Agreement.sas’;
Suppose we would like to assume that the error structure is proportional; we then
change from error=const to error=prop. In this case, we will need to change
the plotting increment into a log increment by making min=25, max=3200, and
by=50 100 200 400 800 1600. We need to change the TDI allowance to
percent change. We do not need to specify the significant digit for the TDI% output,
because it is coded to print as a percentage with two significant digits after the
decimal point. The code to execute the SAS macro is shown below:
%agreement(dataset=e2_1, y=HemoCueAve, x=SigmaAve,
V_label=HemoCue, H_label=Sigma, min=25, max=3200, by=50 100
200 400 800 1600, CCC_a=0.9775, CP_a=0.9, TDI_a=50,
error=prop, target=random, alpha=0.05);
We will then have the outputs as shown in Table 7.1 and Fig. 7.1. From
the agreement plot of Fig. 7.1, we immediately know that the proportional error
assumption is not appropriate, with larger variations at the lower concentrations.
Therefore, the results shown in Table 7.1 become irrelevant. When the target values
are assumed fixed as in Example 2.8.2, we can simply use target=fixed.
Similarly, the corresponding R function agreement for Example 2.8.1 can be
executed with the following R code:
library(Agreement);
HemocueAve=apply(Example2_1[,c("HEMOCUE1", "HEMOCUE2")],1,
mean);
SigmaAve=apply(Example2_1[,c("SIGMA1","SIGMA2")],1,mean);
agr_c=agreement(y=HemocueAve,x=SigmaAve,V_label="Hemacue",
H_label="Sigma", min=0, max=2000, by=250, CCC_a=0.9775,
CP_a=0.9,TDI_a=150,error="const", target="random", dec=1,
alpha=0.05)
html.report(agr_c, file="report_1")
agr_p=agreement(y=HemocueAve,x=SigmaAve,V_label="Hemacue",
H_label="Sigma", min=25, max=3200, by=c(50,100,200,400,800,
1600), CCC_a=0.9775, CP_a=0.9, TDI_a=50, error="prop",
target="random", alpha=0.05)
html.report(agr_p, file="report_2")
All the parameters of the R functions have the same definition as the SAS macro,
except that there is no dataset parameter, and we must add double quotation marks
for some parameters. The function html.report is used to generate an html
file containing the information as shown in Table 2.1 and Fig. 2.1. For a detailed
explanation of the R functions, please read the R help document.
7.1 Workshop for Continuous Data 143
Table 7.1 HemoCue and sigma readings on measuring DCLHb assuming proportional
error
Statistics CCC Precision Accuracy TDI%0:9 CP50% RBS
Estimate 0.9744 0.9752 0.9992 49.78 0.9001 0.02
95% Conf. limit 0.9691 0.9701 0.9976 54.06 0.8748 .
Allowance 0.9775 . . 50.00 0.9000 .
n D 299.
The relative bias squared (RBS) must be less than 1 or 8 for CP a of 0.9 or 0.8,
respectively, in order for the approximated TDI to be valid. Otherwise, the TDI
estimate is conservative depending on the RBS value.
Fig. 7.1 HemoCue and Sigma readings on measuring DCLHb assuming proportional error
%UnifiedAgreement(dataset=ex.example2_1,var=hemocue1
hemocue2 sigma1 sigma2,k=2,m=2,CCC_a_intra=0.9943,
CCC_a_inter=0.9775,CCC_a_total=0.9775,
CP_a=0.9, tran=1, TDI_a_intra=75, TDI_a_inter=150,
TDI_a_total=150, error=const, dec=1, alpha=0.05);
The parameter dataset is the name of the dataset, which must contain the variables
in the var parameter. The order of the entries specified in the parameter var
should follow the same rule of that in the SAS macro. If the parameter var
is not given, the R function will use all the variables in the input dataset.
The other parameters have the same definition as the SAS macro. The function
html.unified agreement is used to generate an html file containing the
summary table of unified agreement.
Table 7.2 HemoCue and Sigma readings on measuring DCLHb using GEE
Statistics CCC Precision Accuracy TDI%0:9 CP150 RBS
Estimate 0.9866 0.9866 1.0000 127.3 0.9474 0.00
95% Conf. limit 0.9825 0.9825 0.9987 145.9 0.9228 .
Allowance 0.9775 . . 150.0 0.9000 .
For k D 2, n D 299, and m D 1.
The relative bias squared (RBS) must be less than 1 or 8 for CP a of 0.9 or 0.8,
respectively, in order for the approximated TDI to be valid.
assumes normality for deriving the variances of the estimates of the agreement
indices, while the former uses the GEE approach without assuming normality.
For more robustness of the confidence limits and/or when k > 2, we might need
to use the unified macro. However, to produce the agreement plots, we would
need to call the agreement.sas macro. In addition, the definitions of precision
and accuracy are slightly different because the unified approach assumes that the
variances of assays or raters are equal, and it utilizes the approximation according
to (5.62) for CP.
For Example 2.8.1, if we want to use the more robust GEE approach for the case
of m D 1, we can run the following code after taking the average of the replicates
for HemoCue and Sigma as we did with running %agreement earlier:
%UnifiedAgreement(dataset=e2_1, var=HemocueAve SigmaAve,
k=2, m=1, CCC_a=0.9775,CP_a=0.9, tran=1, TDI_a=150,
error=const, dec=1,alpha=0.05);
The results are shown in Table 7.2. As expected, these are exactly the same as
the results shown in Table 5.1 under interagreement. Compared to Table 2.1, the
CCC and TDI estimates are identical, and the precision and accuracy coefficients
are almost the same. The CP estimate using GEE is 0.947 compared to 0.946
in Table 2.1. These are almost identical because of an almost perfect accuracy,
indicating that their variances are the same. The lower confidence limits for CCC,
precision and accuracy coefficients, and CP are slightly smaller than those shown
in Table 2.1. Correspondingly, the upper confidence limit for TDI is slightly larger
than those shown in Table 2.1, indicating that for this example the GEE approach is
slightly more conservative.
Similarly, the corresponding R function for Example 2.8.1 by the unified
approach can be performed with the following code:
unified.agreement(dataset=cbind(HemocueAve,SigmaAve),
k=2,m=1,CCC_a=0.9775,CP_a=0.9, tran=1, TDI_a=150,
error="const",dec=1,alpha=0.05);
7.1 Workshop for Continuous Data 147
To calculate the TIR and IIR indices introduced in Chapter 6, we need to call
the SAS macro TIR IIR.sas using the %TIR IIR statement with parameters
defined below:
• dataset: name of your data. We must avoid using dataset names a, b, c, t1,
t2, ttt, bt, one, final, because these dataset names are used in this macro.
• k: number of methods/raters/instruments/assay, etc.
• m: number of replications for each k.
• var: dependent variable names to be evaluated, e.g., y1 1, y1 2, . . . , y1 m,
y2 1, y2 2, . . . , y2 m, . . . , yk 1, yk 2, . . . , yk m, etc.
• TIR test: the selected test raters for calculating TIR; must be input in the
format (‘1’,‘2’,‘3’,... ‘k’), where ‘1’ represents the first m columns
for rater 1, ‘2’ represents the second m columns for rater 2, and ‘k’ represents
the last m columns for rater k, etc. When calculating multiple TIRs, the test raters
for calculating each TIR must be separated by #. For example, when k D 3, we
specify (‘1’,‘3’)#(‘1’,‘2’,‘3’)#(‘3’)#(‘2’)#(‘1’,‘2’) for
each of the five sets of the test raters.
• TIR ref: the selected reference raters for calculating TIR that correspond
to TIR test. If ref=(all) is specified, then the intraraters of all raters
will be used as the denominator. When calculating multiple TIRs, the
corresponding reference raters must be separated by #. For example, use
(‘2’)#(all)#(‘1’,‘2’)#(‘1’)#(‘1’,‘2’) to represent the five
selected sets of reference raters. When TIR ref is not specified as (all),
each TIR is computed as the total MSD of test vs selected reference raters
relative to the intra MSD of the selected reference raters. When TIR ref is
specified as (all), the macro will assess the average of the total MSD of all
raters relative to the average of intra MSD of all raters. For the first TIR example
as shown in TIR test and TIR ref, the macro would assess the average of
the total MSD of “raters 1 vs 2 and raters 3 vs 2” relative to the intra MSD of
“rater 2.” For the second TIR example, the macro would assess the average of
the total MSD of all raters relative to the average of intra MSD of all raters. For
the third TIR example, the macro would assess the average of the total MSD of
“raters 3 vs 1 and raters 3 vs 2” relative to the average of intra MSD of “raters
1 and 2.” For the fourth TIR example, the macro would assess the total MSD of
“raters 2 and 1” relative to the intra MSD of “rater 1.” For the fifth TIR example,
the macro would assess the total MSD of “raters 1 and 2” relative to the average
of intra MSD of “raters 1 and 2.”
• IIR test: the selected test raters for calculating IIR, which must be input in
the format of (‘1’,‘2’,‘3’, ...‘k’). When calculating multiple IIRs,
the test raters for calculating each IIR must be separated by #. For example,
when k D 3, specify (‘1’)#(‘2’)#(‘3’)#(‘1’).
148 7 Workshop
Table 7.3 TIR and IIR between HemoCue (‘1’) and Sigma (‘2’) with
duplicates
TIR IIR
Statistics Total(‘1’,‘2’)vs(all) =Intra(all) Intra(‘1’) =Intra(‘2’)
Estimate 10.094 0.348
95% Conf. limit 13.657 (0.189, 0.643)
Compared to . 1.00
For k D 2, n D 299, and m D 2.
One-tailed upper limit for TIR and two-tailed interval for IIR.
• IIR ref: the selected reference raters for calculating IIR. When calculating
multiple IIRs, the corresponding reference raters must be separated by #.
For example, specify (‘2’,‘3’)#(‘1’,‘3’)#(‘1’,‘2’)#(‘2’). Each
set of reference raters must be mutually exclusive from its corresponding set of
the selected test raters.
• Error = const for the constant error structure for continuous data.
Error = prop for the proportional error structure for continuous data. Here,
log transformation to data will be applied for continuous data. For categorical
data, use Error=const.
• Alpha: 100.1 ’/% one-tailed upper confidence limit for TIR or two-tailed
confidence interval for IIR. The default is 0.05.
• TIR a: allowance for TIR.
To calculate TIR and IIR for Example 5.9.1, we would execute the following
code:
%TIR_IIR(dataset=ex.example2_1, var=hemocue1 hemocue2
sigma1 sigma2, k=2, m=2, TIR_test=(‘1’,‘2’), TIR_ref=(all),
IIR_test=(‘1’),IIR_ref=(‘2’),
error=const, alpha=0.05, TIR_a=.);
The results are shown in Table 7.3. The TIR of MSDtotal relative to MSDintra was
estimated to be 10.09 with one-sided 95% upper confidence limit of 13.66, which
was much larger than any clinically meaningful criterion, as evident by comparing
Fig. 5.3 to Figs. 5.1 and 5.2. The IIR of MSDintra HemoCue relative to MSDintra Sigma
was estimated to be 0.348 with the 95% confidence interval of 0.189–0.643, which
indicates that HemoCue had better within-assay precision than Sigma, as evident by
comparing Figs. 5.1 and 5.2.
Similarly, the corresponding R functions TIR IIR for the comparative agree-
ment approach can be performed with the following code:
TIR_IIR(dataset=Example2_1,var=c("HEMOCUE1", "HEMOCUE2",
"SIGMA1", "SIGMA2"),
k=2,m=2,TIR_test=c("1,2"),TIR_ref=c("All"),IIR_test=c("1"),
IIR_ref=c("2"), error="const", alpha=0.05, TIR_a=.);
7.2 Workshop for Categorical Data 149
All the parameters have the same definition as in the SAS macro. However,
the format of the parameters TIR test, TIR ref, IIR test, and IIR ref
are slightly different: TIR test=c("1,2") means that the selected test raters
for calculating TIR are the first and second raters. If there are multiple TIRs,
each set of test raters must be an entry in the sequence using double quo-
tation marks separated by a comma. For example, when k D 3, we may
specify TIR test=c("1,3","1,2,3","3","2","1,2"). The formats of
TIR ref, IIR test, and IIR ref are defined similarly.
We begin with the workshop for categorical data using Example 5.9.3, and examine
the kappa of three examiners based on their first readings. These frequency tables
can be seen in Tables 5.10 and 5.10. In this example, the variable names for the
three examiners and their duplicates are m1 1, m1 2, m2 1, m2 2, m3 1, m3 2.
We are now interested in the kappa only of m1 1, m2 1, m3 1. We then execute the
following code:
%UnifiedAgreement(dataset=ex.Example5_3, var=m1_1 m2_1
m3_1, k=3, m=1, ccc_a=., tran=0, alpha=0.05);
The results are shown in Table 7.4. These are slightly lower than those shown
under total agreement in Table 5.13. Again, the results show that the disagreement
was largely due to imprecision rather than inaccuracy. For categorical data, TDI and
CP are not meaningful, and therefore are not computed.
Similarly, the corresponding R macro for Example 5.9.3 using the unified
approach can be performed with the following code:
unified.agreement(dataset=Example5_3, var=c("m1_1",
"m2_1","m3_1"), k=3, m=1, CCC_a=NA, tran=0, alpha=0.05);
150 7 Workshop
Table 7.5 Agreement statistics among the first two examiners based
on reading 1
Statistics CCC Precision Accuracy
Estimate 0.5147 0.5148 0.9998
95% Conf. limit 0.4225 0.4226 0.9982
Allowance . . .
For k D 2, n D 400, and m D 1.
When k D 2, we can also compute the kappa-related results by running the SAS
procedure FREQ. To demonstrate the equivalence, we first examine the agreement
between the first readings of examiners 1 and 2 by executing the following code:
%UnifiedAgreement(dataset= ex.Example5_3, var=m1_1 m2_1,
k=2, m=1, CCC_a=., tran=0, alpha=0.05);
Note that we use an alpha value of 0.1 because we want only the one-tailed lower
confidence limit. The (WT=FC) is not necessary in this case because this example
has a binary outcome, but we leave it there just in case the data have an ordinal
outcome and we would like to use the square distance function. The results are
shown in the following SAS output file:
The kappa and its lower confidence limit are exactly the same as shown in
Table 7.4. The proof of such equivalence is shown in Section 5.7.2. Note that this
SAS procedure FREQ cannot perform kappa for more than two raters.
To calculate the intraassay, interassay, and total-assay agreement indices for cate-
gorical data with k 2 and m 1 using the data of Example 5.9.3, we execute the
7.2 Workshop for Categorical Data 151
Table 7.6 TIR and IIR among three examiners with duplicates
TIR IIR
Statistics Total(‘1’,‘2’,‘3’)vs(all) =Intra(all) Intra(‘3’) =Intra(‘1’,‘2’)
Estimate 1.560 1.323
95% Conf. limit 2.376 (0.931, 1.881)
Compared to 2.5 1.00
For k D 3, n D 400, and m D 2.
*One-tailed upper limit for TIR and two-tailed interval for IIR.
following code. The results are shown in Table 5.13 of Chapter 5. The frequency
tables can be seen in Tables 5.7–5.12.
%UnifiedAgreement(dataset=ex.Example5_3, var=m1_1 m1_2
m2_1 m2_2 m3_1 m3_2, k=3,m=2,CCC_a_intra=0.64,
CCC_a_inter=0.51,CCC_a_total=.,tran=0, alpha=0.05);
Similarly, the corresponding R macro for Example 5.9.3 with the unified
approach can be performed with the following code:
unified.agreement(dataset=Example5_3,k=3,m=2,CCC_a_intra=0.64,
CCC_a_inter=0.51, CCC_a_total=NA, tran=0, alpha=0.05);
To study the TIR and IIR information as shown in Example 6.7.2 using the same
data in Example 5.9.3, we execute the following code:
%TIR_IIR(dataset=data5.Example5_3, var=m1_1-m1_2
m2_1-m2_2 m3_1-m3_2, k=3,m=2, TIR_test=(‘1’,‘2’,‘3’),
TIR_ref=(all), IIR_test=(‘3’), IIR_ref=(‘1’,‘2’),
error=const, alpha=0.05, TIR_a=2.5)
The results are shown in Table 7.6, with the description for TIR given in
Section 6.7.2.
Similarly, the corresponding R macro for the comparative approach can be
performed with the following code:
TIR_IIR(dataset=Example5_3,var=c("m1_1","m1_2","m2_1",
"m2_2","m3_1","m3_2"), k=3, m=2,TIR_test=c("1,2,3"),
TIR_ref=c("All"), IIR_test=c("3"), IIR_ref=c("1,2"),
error="const", alpha=0.05, TIR_a=2.5);
152 7 Workshop
Table 7.7 TIR and IIR between test (1) and reference (2) compounds
with duplicates
TIR IIR
Statistics Total(‘1’)vs(‘2’) =Intra.‘20 / Intra(‘1’) =Intra(‘2’)
Estimate 0.6907 0.4324
95% Conf. limit 1.0761 (0.1676, 1.1151)
Compared to 2.25 1.00
For k D 2, n D 39, and m D 2.
One-tailed upper limit for TIR and two-tailed interval for IIR.
In Example 6.7.4, we first need to compute TDI%0:8 between the duplicate values
of the reference compound, namely, R1 and R2. We can either use the basic macro
agreement.SAS or the unified model UnifiedAgreement.sas. For using
the basic model, we execute the following code:
%agreement(dataset=ex.Example6_4, y=R1, x=R2,
V_label=R1, H_label=R2, min=2, max=64, by=4 8 16, CCC_a=0.9,
CP_a=0.8, TDI_a=50, error=prop, target=random, alpha=0.05);
The results are shown in Table 7.7, with the description for TIR given in the last
paragraph of Section 6.7.2.
Similarly, the corresponding R macro for the comparative approach can be
performed with the following code:
agreement(y=Example6_4[,"R1"], x=Example6_4[,"R2"],
V_label="R1", H_label="R2", min=2, max=64, by=c(4,8,16),
CCC_a=0.9, CP_a=0.8, TDI_a=50, error="prop",
target="random", alpha=0.05);
TIR_IIR(dataset=Example6_4, var=c("T1","T2","R1","R2"),
k=2, m=2, TIR_test=c("1"), TIR_ref=c("2"), IIR_test=c("1"),
IIR_ref=c("2"), error="prop", alpha=0.05, TIR_a=2.25);
References
Carrasco, J., T. King, and V. Chinchilli. 2009. The concordance correlation coefficient for
repeated measures estimated by variance components. Journal of Biopharmaceutical Statis-
tics 19(1):90–105.
Carrasco L., J. Luis, T. King, and V. Chinchilli. 2007. Comparison of concordance correlation
coefficient estimating approaches with skewed data. Journal of Biopharmaceutical Statistics
17: 673–684.
Chen, C. and H. Barnhart. 2008. Comparison of ICC and CCC for assessing agreement for data
without and with replications. Computational Statistics & Data Analysis 53(2):554–564.
Chinchilli, V., J. Martel, S. Kumanyika, and T. Lloyd. 1996. A weighted concordance correlation
coefficient for repeated measurement designs. Biometrics 52(1):341–353.
Choudhary, P. 2007. A tolerance interval approach for assessment of agreement with left censored
data. Journal of Biopharmaceutical Statistics 17(4):583–594.
Choudhary, P. 2008. A tolerance interval approach for assessment of agreement in method
comparison studies with repeated measurements. Journal of Statistical Planning and Infer-
ence 138(4):1102–1115.
Choudhary, P. and H. Nagaraja. 2007. Tests for assessment of agreement using probability criteria.
Journal of Statistical Planning and Inference 137(1):279–290.
Christensen, R. 1997. Log-linear Models and Logistic Regression 2nd Ed. New York: Springer.
Cicchetti, D. and T. Allison. 1971. A new procedure for assessing reliability of scoring EEG sleep
recordings. American Journal of EEG Technology 11:101–109.
Cicchetti, D. and J. Fleiss. 1977. Comparison of the null distributions of weighted kappa and the
C ordinal statistic. Applied Psychological Measurement 1(2):195–201.
CLIA Final Rule 2003. CLIA programs; laboratory requirements relating to quality systems and
certain personnel qualifications. Final rule. Federal Register 68(16):3639–3714. available at
http://www.phppo.cdc.gov/clia/pdf/CMS-2226-F.pdf.
Cochran, W. 1950. The comparison of percentages in matched samples. Biometrika 37(3-4):
256–266.
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological
Measurement 20(1):37–46.
Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or
partial credit. Psychological Bulletin 70(4):213–220.
Cox, D. and E. Snell. 1989. Analysis of Binary Data. Boca Raton: Chapman & Hall/CRC.
Darroch, J. 1981. The Mantel-Haenszel test and tests of marginal symmetry; fixed-effects and
mixed models for a categorical response. International Statistical Review 49:285–307.
Davis, C. and D. Quade. 1968. On comparing the correlations within two pairs of variables.
Biometrics 24(4):987–995.
Donner, A. and M. Eliasziw. 1992. A goodness-of-fit approach to inference procedures for
the kappa statistic: Confidence interval construction, significance-testing and sample size
estimation. Statistics in Medicine 11(11):1511–1519.
Dunnett, C. and M. Gent. 1977. Significance testing to establish equivalence between treatments,
with special reference to data in the form of 2 x 2 tables. Biometrics 33(4):593–602.
Escaramis, G., C. Ascaso, and J. Carrasco. 2010. The total deviation index estimated by tolerance
intervals to evaluate the concordance of measurement devices. BMC Medical Research
Methodology 10(1):31.
Everitt, B. 1968. Moments of the statistics kappa and weighted kappa. British Journal of
Mathematical and Statistical Psychology 21(1):97–103.
Fisher, S. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
Fleiss, J. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin
76(5):378–382.
Fleiss, J.L. 1973. Statistical Methods for Rates and Proportions. New York: John Wiley & Sons.
Fleiss, J. 1986. Reliability of measurement. The Design and Analysis of Clinical Experiments
1(1):1–32.
References 155
Fleiss, J. and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation
coefficient as measures of reliability. Educational and Psychological Measurement 33(3):
613–619.
Fleiss, J., J. Cohen, and B. Everitt. 1969. Large sample standard errors of kappa and weighted
kappa. Psychological Bulletin 72(5):323–327.
Fleiss, J. and J. Cuzick. 1979. The reliability of dichotomous judgments: Unequal numbers of
judges per subject. Applied Psychological Measurement 3(4):537–542.
Fleiss, J. and B. Everitt. 1971. Comparing the marginal totals of square contingency tables. British
Journal of Mathematical and Statistical Psychology 24:117–123.
Fleiss, J., B. Levin, M. Paik, and J. Wiley. 1981. Statistical Methods for Rates and Proportions 2nd
Ed. New York: Wiley.
Fleiss, J. and P. Shrout. 1978. Approximate interval estimation for a certain intraclass correlation
coefficient. Psychometrika 43(2):259–262.
Freeman, D. 1987. Applied Categorical Data Analysis. New York: Marcel Dekker.
Friedman, M. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association 32(200):675–701.
Goodman, L. 1978. Analyzing Qualitative/Categorical Data: Log-linear Models and Latent-
Structure Analysis. Cambridge, MA: Abt Books.
Grant, E. and R. Leavenworth. 1972. Statistical Quality Control. New York: McGraw-Hill.
Guo, Y. and A. Manatunga. 2007. Nonparametric estimation of the concordance correlation co-
efficient under univariate censoring. Biometrics 63(1):164–172.
Guo, Y. and A. Manatunga. 2009. Measuring agreement of multivariate discrete survival times
using a modified weighted kappa coefficient. Biometrics 65(1):125–134.
Haber, M. and H. Barnhart. 2008. A general approach to evaluating agreement between two
observers or methods of measurement from quantitative data with replicated measurements.
Statistical Methods in Medical Research 17(2):151–169.
Haber, M., J. Gao, and H. Barnhart. 2007. Assessing observer agreement in studies involving
replicated binary observations. Journal of Biopharmaceutical Statistics 17(4):757–766.
Haberman, S. 1974. The Analysis of Frequency Data. Chicago: University of Chicago Press.
Haberman, S. 1978. Analysis of Qualitative Data, volumn 1. Introductory Topics. New York:
Academic Press.
Haberman, S. 1979. Analysis of Qualitative Data, volumn 2. New Developments. New York:
Academic Press.
Hedayat, A., C. Lou, and B. Sinha. 2009. A Statistical Approach to Assessment of Agree-
ment Involving Multiple Raters. Communications in Statistics-Theory and Methods 38(16):
2899–2922.
Helenowski, I., E. Vonesh, H. Demirtas, A. Rademaker, V. Ananthanarayanan, P. Gann, and
B. Jovanovic. 2011. Defining Reproducibility Statistics as a Function of the Spatial Covariance
Structures in Biomarker Studies. The International Journal of Biostatistics 7(1):article2.
Hiriote, S. and V. Chinchilli. 2010. Matrix-based concordance correlation coefficient for repeated
measures. Biometrics 66:1–20.
Holder, D. and F. Hsuan. 1993. Moment-based criteria for determining bioequivalence. Biometrika
80(4):835–846.
King, T. and V. Chinchilli. 2001a. A generalized concordance correlation coefficient for continuous
and categorical data. Statistics in Medicine 20(14):2131–2147.
King, T. and V. Chinchilli. 2001b. Robust estimators of the concordance correlation coefficient.
Journal of Biopharmaceutical Statistics 11(3):83–105.
King, T., V. Chinchilli, and J. Carrasco. 2007. A repeated measures concordance correlation co-
efficient. Statistics in Medicine 26(16):3095–3113.
King, T., V. Chinchilli, K. Wang, and J. Carrasco. 2007. A class of repeated measures concordance
correlation coefficients. Journal of Biopharmaceutical Statistics 17(4):653–672.
Koch, G., J. Landis, J. Freeman, D. Freeman Jr, and R. Lehnen. 1977. A general methodology for
the analysis of experiments with repeated measurement of categorical data. Biometrics 33(1):
133–158.
156 References
Landis, J. and G. Koch. 1977a. A one-way components of variance model for categorical data.
Biometrics 33(4):671–679.
Landis, J. and G. Koch. 1977b. An application of hierarchical kappa-type statistics in the
assessment of majority agreement among multiple observers. Biometrics 33(2):363–374.
Landis, J. and G. Koch. 1977c. The measurement of observer agreement for categorical data.
Biometrics 33(1):159–174.
Landis, J., T. Sharp, S. Kuritz, and G. Koch. 1998. Mantel–haenszel methods. Encyclopedia of
Biostatistics 3:2378–2391.
Li, R. and M. Chow. 2005. Evaluation of reproducibility for paired functional data. Journal of
Multivariate Analysis 93(1):81–101.
Liang, K. and S. Zeger. 1986. Longitudinal data analysis using generalized linear models. Bio-
metrika 73(1):13–22.
Lin, L. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1):
255–268.
Lin, L. 1992. Assay validation using the concordance correlation coefficient. Biometrics 48(2):
599–604.
Lin, L. 2000. Total deviation index for measuring individual agreement with applications in
laboratory performance and bioequivalence. Statistics in Medicine 19(2):255–270.
Lin, L. 2003. Measuring agreement. Encyclopedia of Biopharmaceutical Statistics, 561–567.
Lin, L. 2008. Overview of agreement statistics for medical devices. Journal of Biopharmaceutical
Statistics 18(1):126–144.
Lin, L. and V. Chinchilli. 1997. Rejoinder to the letter to the editor from Atkinson and Nevill.
Biometrics 53(2):777–778.
Lin, L., A. Hedayat, B. Sinha, and M. Yang. 2002. Statistical Methods in Assessing Agreement.
Journal of the American Statistical Association 97(457):257–270.
Lin, L., A. Hedayat, and Y. Tang. 2012. A comparison for measuring individual agreement. Journal
of Biopharmaceutical Statistics, accept for publication.
Lin, L., A. Hedayat, and W. Wu. 2007. A unified approach for assessing agreement for continuous
and categorical data. Journal of Biopharmaceutical Statistics 17(4):629–652.
Lin, L. and L. Torbeck. 1998. Coefficient of accuracy and concordance correlation coefficient: new
statistics for methods comparison. PDA Journal of Pharmaceutical Science and Technology
52(2):55–59.
Lin, L. and E. Vonesh. 1989. An empirical nonlinear data-fitting approach for transforming data to
normality. American Statistician 43(4):237–243.
Linn, S. 2004. A new conceptual approach to teaching the interpretation of clinical tests. Journal
of Statistical Education 12(3):1–9.
Liu, X., Y. Du, J. Teresi, and D. Hasin. 2005. Concordance correlation in the measurements of time
to event. Statistics in Medicine 24(9):1409–1420.
Lou, C. 2006. Assessment of Agreement, PhD Thesis. University of Illinois at Chicago.
Madansky, A. 1963. Tests of homogeneity for correlated samples. Journal of the American
Statistical Association 58(301):97–119.
McNemar, Q. 1947. Note on the sampling error of the difference between correlated proportions
or percentages. Psychometrika 12(2):153–157.***
Quiroz, J. 2005. Assessment of equivalence using a concordance correlation coefficient in a repea-
ted measurements design. Journal of Biopharmaceutical Statistics 15(6):913–928.
Quiroz, J. and R. Burdick. 2009. Assessment of Individual Agreements with Repeated Measure-
ments Based on Generalized Confidence Intervals. Journal of Biopharmaceutical Statistics
19(2):345–359.
Robieson, W. 1999. On Weighted Kappa and Concordance Correlation Coefficinet, PhD Thesis.
Chicago: University of Illinois.
Rodary, C., C. Com-Nougue, and M. Tournade. 1989. How to establish equivalence between
treatments: A one-sided clinical trial in paediatric oncology. Statistics in Medicine 8(5):
593–598.
References 157
Schuster, C. 2001. Kappa as a parameter of a symmetry model for rater agreement. Journal of
Educational and Behavioral Statistics 26(3):331–342.
Serfling, R. 1980. Approximation Theorems of Mathematical Statistics. New York: John Wiley &
Sons.
Sheiner, L. 1992. Bioequivalence revisited. Statistics in Medicine 11(13):1777–1788.
Shoukri, M. M. 2004. Measures of Interobserver Agreement. Chapman and Hall.
Shoukri, M. and V. Edge. 1996. Statistical Methods for Health Sciences. New York: CRC Press.
Shoukri, M. and S. Martin. 1995. Maximum likelihood estimation of the kappa coefficient from
models of matched binary responses. Statistics in Medicine 14(1):83–99.
Shrout, P. and J. Fleiss. 1979. Intraclass correlations: uses in assessing rater reliability. Psychol
Bull 86(2):420–428.
Tang, Y. 2010. A Comparison Model for Measuring Individual Agreement, PhD Thesis. Chicago:
University of Illinois.
Von Eye, A. and E. Mun. 2005. Analyzing Rater Agreement: Manifest Variable Methods. New
Jersey: Lawrence Erlbaum.
von Eye, A. and C. Schuster. 2000. Log-linear model for rater agreement. Multiciencia 4:38–56.
Vonesh, E. and M. Chinchilli. 1997. Linear and Nonlinear Models for the Analysis of Repeated
Measurements. New York: Marcel Dekker, Inc.
Vonesh, E., V. Chinchilli, and K. Pu. 1996. Goodness-of-fit in generalized nonlinear mixed-effects
models. Biometrics 52(2):572–587.
Wang, W. and J. Gene Hwang. 2001. A nearly unbiased test for individual bioequivalence problems
using probability criteria. Journal of Statistical Planning and Inference 99(1):41–58.
Westlake, W. 1976. Symmetrical confidence intervals for bioequivalence trials. Biometrics32(4):
741–744.
Williamson, J., S. Crawford, and H. Lin. 2007. Resampling dependent concordance correlation
coefficients. Journal of Biopharmaceutical Statistics 17(4):685–696.
Williamson, J., S. Lipsitz, and A. Manatunga. 2000. Modeling kappa for measuring dependent
categorical agreement data. Biostatistics 1(2):191–202.
Wu, W. 2005. A Unified Approach for Assessing Agreement, PhD Thesis. Chicago: University of
Illinois.
Yang, M. 2002. Universal Optimality in Crossover Design and Statistical Methods in Assessing
Agreement, PhD Thesis. Chicago: University of Illinois.
Yang, J. and V. Chinchilli. 2009. Fixed-effects modeling of Cohen’s kappa for bivariate multino-
mial data. Communications in Statistics–Theory and Methods 38:3634–3653.
Yang, J. and V. Chinchilli. 2011. Fixed-effects modeling of Cohen’s kappa for bivariate multino-
mial data. Computational Statistics and Data Analysis 55:1061–1070.
Zeger, S. and K. Liang. 1986. Longitudinal data analysis for discrete and continuous outcomes.
Biometrics 42(1):121–130.
Zeger, S., K. Liang, and P. Albert. 1988. Models for longitudinal data: a generalized estimating
equation approach. Biometrics 44(4):1049–1060.
Zhong, J. 2001. Optimal and Efficient Nonlinear Design and Solutions with Interpretations to
Individual Bioequivalence, PhD Thesis. Chicago: University of Illinois.
Index
A interrater, 87
accuracy, 1–2 total-rater, 87
accuracy coefficient concordance correlation coefficient (CCC)
basic, 13 basic, 14, 38, 42
generalized or overall, 89 generalized or overall, 88
interrater, 79 interrater (CCCinter ), 87
total-rater, 81 intrarater (CCCintra ), 87
agreement, 1–2 total-rater (CCCtotal ), 87
association, 61, 68 coverage probability (CP)
basic, 40, 44
generalized or overall, 88
C
interrater, 86, 87
coefficient of
intrarater, 86, 87
individual agreement (CIA), 124
total-rater, 86, 87
individual bioequivalence (IBC) by FDA,
intra–intra ratio (IIR), 127
123
mean squared deviation (MSD)
concordance correlation coefficient (CCC)
basic, 7, 39, 43
generalized or overall, 88, 91–92
generalized or overall, 88
interrater (CCCinter ), 79
interrater, 85
intrarater (CCCintra ), 77
intrarater, 85
total-rater (CCCtotal ), 80
total-rater, 85
concordance correlation coefficient (CCC)
precision coefficient
basic model, 12–13
basic, 13
correlation coefficient, see precision coefficient
generalized or overall, 88
coverage probability (CP)
interrater, 87
generalized or overall, 88
intrarater, 87
interrater, 78
total-rater, 87
intrarater, 77
total deviation index (TDI)
total-rater, 80
basic, 9
coverage probability (CP)
generalized or overall, 88
basic model, 10
interrater, 85, 86
intrarater, 85, 86
E total-rater, 85, 86
estimate of total–intra ratio (TIR)
accuracy coefficient when no reference exists, 126
basic, 14, 41, 45 when reference exists, 124
generalized or overall, 88 weighted kappa, 58
G T
generalized estimation equations (GEE) target values
by Barnhart, Haber, and Song (2002), fixed, 1
47–48 random, 1
comparative model, 115–121 total deviation index (TDI)
unified model, 81–84 basic, 8, 9
generalized or overall, 88
interrater, 78
I intrarater, 77
intra–intra ratio (IIR), 126 total-rater, 80
intraclass correlation coefficient (ICC), 3, 11 total–intra ratio (TIR)
when no reference exists, 123
K when reference exists, 122
kappa, 57–59
U
M U-statistics for CCC, 46–47, 91–92
mean squared deviation (MSD)
basic model, 7, 56
generalized or overall, 88 V
interrater, 78, 114 variance components, 76
intrarater, 77, 113 variance of the estimate of
total-rater, 80, 113 accuracy coefficient
model of basic, 14, 42, 45
basic, 7 generalized or overall, 90
comparative, 112 interrater, 87
unified, 75 total-rater, 87
when m D 1, 87 concordance correlation coefficient (CCC)
basic, 14, 39, 43
generalized or overall, 90
P interrater, 86, 87
precision, 1–2 intrarater, 86, 87
precision coefficient total-rater, 86, 87
basic, 13–14 coverage probability (CP)
generalized or overall, 88 basic, 10, 40, 41, 44
interrater, 79 generalized or overall, 90
intrarater, 77 interrater, 86, 87
total-rater, 81 intrarater, 86, 87
proportional error structure total-rater, 86, 87
basic model, 16 intra–intra ratio (IIR), 127
comparative model, 112 mean squared deviation (MSD)
unified model, 81 basic, 8, 40, 44
generalized or overall, 90
R interrater, 85
relative bias squared (RBS) intrarater, 85
basic, 9 total-rater, 85, 86
interrater, 79 precision coefficient
total-rater, 80 basic, 14
generalized or overall, 90
interrater, 86, 87
S total-rater, 87
sample size and power total deviation index (TDI)
general case, 71–72 basic, 9
simplified case, 72 generalized or overall, 90
Index 161
interrater, 86 W
intrarater, 86 weight function
total-rater, 86 Ciccetti and Allison, 58
total–intra ratio (TIR) Fleiss and Cohen, 57, 59
when no reference exists, 126 weighted kappa, 57–59
when reference exists, 125
weighted kappa, 59