0% found this document useful (0 votes)
4 views174 pages

Statistical Tools for Measuring Agreement

The document discusses statistical tools for measuring agreement, focusing on the assessment of agreement in various contexts such as lab performance and method comparisons. It emphasizes the importance of understanding both theoretical and practical aspects of agreement assessments, providing examples and methodologies for continuous and categorical data. The book is aimed at statisticians, researchers, and practitioners in fields like biomedical devices and medical research, and includes discussions on precision, accuracy, and sample size calculations.

Uploaded by

Samuel Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views174 pages

Statistical Tools for Measuring Agreement

The document discusses statistical tools for measuring agreement, focusing on the assessment of agreement in various contexts such as lab performance and method comparisons. It emphasizes the importance of understanding both theoretical and practical aspects of agreement assessments, providing examples and methodologies for continuous and categorical data. The book is aimed at statisticians, researchers, and practitioners in fields like biomedical devices and medical research, and includes discussions on precision, accuracy, and sample size calculations.

Uploaded by

Samuel Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 174

Statistical Tools for Measuring Agreement

Lawrence Lin  A.S. Hedayat  Wenting Wu

Statistical Tools for


Measuring Agreement

123
Lawrence Lin A.S. Hedayat
Baxter International Inc., WG3-2S Department of Mathematics, Statistics
Rt. 120 and Wilson Rd. and Computer Science
Round Lake, IL 60073, USA University of Illinois, Chicago
lawrence lin@baxter.com 851 S. Morgan St.
Chicago, IL 60607-7045, USA
Wenting Wu hedayat@uic.edu
Mayo Clinic
200 First Street SW.
Rochester, MN 55905, USA
wu.wenting@mayo.edu

ISBN 978-1-4614-0561-0 e-ISBN 978-1-4614-0562-7


DOI 10.1007/978-1-4614-0562-7
Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 2011935222

c Springer Science+Business Media, LLC 2012


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To
Sha-Li, Juintow, Buortau, and Shintau Lin
Batool, Leyla, and Yashar Hedayat
Xujian and MingEn Li
Preface

Agreement assessments are widely used in assessing the acceptability of a new


or generic process, methodology and/or formulation in areas of lab performance,
instrument/assay validation or method comparisons, statistical process control,
goodness-of-fit, and individual bioequivalence. Successful applications in these
situations require a sound understanding of both the underlying theory and practical
problems in real life. This book seeks to blend theory and applications effectively
and to present these two aspects with many practical examples.
The common theme in agreement assessment is to assess the agreement between
observations of assay or rater .Y / and their target (reference) counterpart values
.X /. Target values may be considered random or fixed. Random target values are
measured with random error. Common random target values are the gold standard
of measurements, being both well established and widely acceptable. Sometimes
we may also be interested in comparing two methods without a designated gold-
standard method, or in comparing two technicians, times, reagents, or the like by
the same method. Common fixed target values are the expected values or known
values, which will be discussed in the most basic model presented in Chapters 2
and 3.
When there is a disagreement between methods, we need to know whether the
source of the disagreement is due to a systematic shift (bias) or random error.
Specific coefficients of accuracy and precision will be introduced to characterize
these sources. This is particularly important in the medical-device environment,
because a systematic shift usually can be easily fixed through calibration, while
a random error usually is a more cumbersome variation-reduction exercise.
We will consider unscaled (absolute) and scaled (relative) agreement statistics
for both continuous and categorical variables. Unscale agreement statistics are
independent of between-sample variation, while the scale agreement statistics are
relative to the between-sample variance. For continuous variables with proportional
error, we often can simply apply a log transformation to the data and would evaluate
percent changes rather than absolute differences. In practically all estimation cases,
the statistical inference for parameter estimates will be discussed.

vii
viii Preface

This book should appeal to a broad range of statisticians, researchers, practi-


tioners, and students, in areas such as biomedical devices, psychology, and medical
research in which agreement assessment is needed. Knowledge of regression, cor-
relation, the asymptotic delta method, U-statistics, generalized estimation equations
(GEE), and the mixed-effect model would be helpful in understanding the material
presented and discussed in this book.
In Chapter 1, we will discuss definitions of precision, accuracy, and agreement,
and discuss the pitfalls of some misleading approaches for continuous data.
In Chapter 2, we will start with the basic scenario of assessing agreement of two
assays or raters, each with only one measurement for continuous data. In this basic
scenario, we will consider the case of random or fixed target values for unscaled
(absolute) and scaled (relative) indices with constant or proportional error structure.
In Chapter 3, we will introduce traditional approaches for categorical data with
the basic scenario for unscaled and scaled indices. In terms of scaled agreement
statistics, we will present the convergence of approaches for categorical and con-
tinuous data, and their association with a modified intraclass correlation coefficient.
The information in this chapter and Chapter 2 sets the stage for discussing unified
approaches in Chapters 5 and 6. In both Chapters 2 and 3, there is available a wealth
of references to the basic model of agreement assessment. We will provide brief
tours of related publications in these two chapters.
In Chapter 4, we will discuss sample size and power calculations for the basic
models for continuous data. We will also introduce a simplified approach that
is applicable to continuous and categorical data. We will present many practical
examples in which we know only the most basic historical information such as
residual variance or coefficient of variation.
In Chapter 5, we will consider a unified approach to evaluating agreement among
multiple .k/ raters, each with multiple replicates .m/ for both continuous and
categorical data. Under this general setting, intrarater precision, interrater agreement
based on the average of m readings, and total-rater agreement based on individual
readings will be discussed.
In Chapter 6, we will consider a flexible and general setting in which where
the agreement of certain cases can be compared relative to the agreement of a
chosen case. For example, to assess individual bioequivalence, we are interested in
assessing the agreement of test and reference compounds relative to the agreement
of the within-reference compound. As another example, in the medical-device
environment, we often want to know whether the within-assay agreement of a newly
developed assay is better than that of an existing assay. Both Chapters 5 and 6 are
applicable to continuous and categorical data.
In Chapter 7, we will present a workshop using a continuous data set, a
categorical data set, and an individual bioequivalence data set as examples. We will
then address the use of SAS and R macros and the interpretation of the outputs from
the most basic cases to more comprehensive cases.
This book is concise and concentrates on topics primarily based on the authors’
research. However, proofs that were omitted from our published articles will be
Preface ix

presented, and all other related tools will be well referenced. Many practical
examples will be presented throughout the book in a wide variety of situations for
continuous and categorical data.
A book such as this cannot have been written without substantial assistance from
others. We are indebted to the many contributors who have developed the theory and
practice discussed in this book. We also would like to acknowledge our appreciation
of the students at the University of Illinois at Chicago (UIC) who helped us in many
ways. Specifically, six PhD dissertations on agreement subjects have been produced
by Robieson (1999), Zhong (2001), Yang (2002), Wu (2005), Lou (2006) and Tang
(2010). Their contributions have been the major sources for this book. Most of the
typing using MikTeX was performed by the UIC PhD student Mr. Yue Yu, who also
double-checked the accuracy of all the formulas.
We would like to mention that we have found the research into theory and
application performed by Professors Tanya King, of the Pennsylvania State Hershey
College of Medicine; Vernon Chinchilli, of the Pennsylvania State University
College of Medicine; and Huiman Barnhart, of the Duke Clinical Research Institute,
are truly inspirational. Their work has influenced our direction for developing the
materials of our book. We are also indebted to Professor Phillip Schluter, of the
School of Public Health and Psychosocial Studies at AUT University, New Zealand,
for his permission to use the data presented in Examples 5.9.3 and 6.7.2 prior to
their publication.
Finally, all SAS and R macros and most data in the examples are provided at the
web sites shown below:
1. http://www.uic.edu/hedayat/
2. http://mayoresearch.mayo.edu/biostat/sasmacros.cfm
The U.S. National Science Foundation supported this project under Grants DMS-
06-03761 and DMS- 09-04125.

Round Lake, IL, USA Lawrence Lin


Chicago, IL, USA Samad Hedayat
Rochester, MN, USA Wenting Wu
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.1 Precision, Accuracy, and Agreement . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1
1.2 Traditional Approaches for Continuous Data . . . .. . . . . . . . . . . . . . . . . . . . 3
1.3 Traditional Approaches for Categorical Data . . . .. . . . . . . . . . . . . . . . . . . . 4
2 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.1 Basic Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2 Absolute Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2.1 Mean Squared Deviation . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
2.2.2 Total Deviation Index .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 8
2.2.3 Coverage Probability.. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
2.3 Relative Indices .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
2.3.1 Intraclass Correlation Coefficient . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
2.3.2 Concordance Correlation Coefficient . . .. . . . . . . . . . . . . . . . . . . . 12
2.4 Sample Counterparts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
2.5 Proportional Error Case . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
2.6 Summary of Simulation Results . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
2.7 Asymptotic Power and Sample Size . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.8 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.8.1 Example 1: Methods Comparison .. . . . . .. . . . . . . . . . . . . . . . . . . . 17
2.8.2 Example 2: Assay Validation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 18
2.8.3 Example 3: Assay Validation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
2.8.4 Example 4: Lab Performance Process Control . . . . . . . . . . . . . 25
2.8.5 Example 5: Clinical Chemistry and
Hematology Measurements That Conform
to CLIA Criteria. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 28
2.9 Proofs of Asymptotical Normality When Target
Values Are Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
2.9.1 CCC and Precision Estimates . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 37
2.9.2 MSD Estimate .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 39
2.9.3 CP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
2.9.4 Accuracy Estimate . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 41
xi
xii Contents

2.10 Proofs of Asymptotical Normality When Target


Values Are Fixed.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 42
2.10.1 CCC and Precision Estimates . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 42
2.10.2 MSD Estimate .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 43
2.10.3 CP Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 44
2.10.4 Accuracy Estimate . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
2.11 Other Estimations and Statistical Inference Approaches . . . . . . . . . . . . 45
2.11.1 U-Statistic for CCC . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 46
2.11.2 GEE for CCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 47
2.11.3 Mixed Effect Model for CCC . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 48
2.11.4 Other Methods for TDI and CP . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 48
2.12 Discussion.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
2.12.1 Absolute Indices . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 49
2.12.2 Relative Indices Scaled to the Data Range.. . . . . . . . . . . . . . . . . 49
2.12.3 Variances of Index Estimates Under Random
and Fixed Target Values . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 50
2.12.4 Repeated Measures CCC . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 50
2.12.5 Data Transformations .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
2.12.6 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 51
2.12.7 Account for Covariants . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 52
2.13 A Brief Tour of Related Publications .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 52
3 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
3.1 Basic Approach When Target Values Are Random . . . . . . . . . . . . . . . . . . 55
3.1.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
3.1.2 Absolute Indices . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 56
3.1.3 Relative Indices: Kappa and Weighted Kappa .. . . . . . . . . . . . . 57
3.1.4 Sample Counterparts .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
3.1.5 Statistical Inference on Weighted Kappa . . . . . . . . . . . . . . . . . . . 59
3.1.6 Equivalence of Weighted Kappa and CCC . . . . . . . . . . . . . . . . . 60
3.1.7 Weighted Kappa as Product of Precision
and Accuracy Coefficients . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 61
3.1.8 Intraclass Correlation Coefficient and Its
Association with Weighted Kappa and CCC . . . . . . . . . . . . . . . 62
3.1.9 Rater Comparison Example . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 63
3.2 Basic Approaches When Target Values Are Fixed:
Absolute Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 64
3.2.1 Sensitivity and Specificity . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 64
3.2.2 Diagnostic Test Example . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 66
3.3 Discussion.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 66
3.4 A Brief Tour of Related Publications .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 68
4 Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
4.1 The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
4.2 The Simplified Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 72
4.3 Examples Based on the Simplified Case. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 72
Contents xiii

5 A Unified Model for Continuous and Categorical Data . . . . . . . . . . . . . . . . . 75


5.1 Definition of Variance Components . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 76
5.2 Intrarater Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 77
5.3 Interrater Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 78
5.4 Total-Rater Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 80
5.5 Proportional Error Case . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
5.6 Asymptotic Normality .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
5.7 The Case m D 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 87
5.7.1 Other Estimation and Statistical Inference Approaches .. . . 90
5.7.2 Variances of CCC and Weighted Kappa for k D 2 .. . . . . . . . 92
5.8 Summary of Simulation Results . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 95
5.9 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 96
5.9.1 Example 1: Methods Comparison .. . . . . .. . . . . . . . . . . . . . . . . . . . 96
5.9.2 Example 2: Assay Validation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 100
5.9.3 Example 3: Nasal Bone Image Assessment
by Ultrasound Scan . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
5.9.4 Example 4: Accuracy and Precision
of an Automatic Blood Pressure Meter .. . . . . . . . . . . . . . . . . . . . 103
5.10 Discussion.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 107
5.10.1 Relative or Scaled Indices . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 107
5.10.2 Absolute or Unscaled Indices . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 109
5.10.3 Covariate Adjustment .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 109
5.10.4 Future Research Topics and Related Publications .. . . . . . . . . 109
6 A Comparative Model for Continuous and Categorical Data . . . . . . . . . . 111
6.1 General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 112
6.2 MSD for Continuous and Categorical Data . . . . . .. . . . . . . . . . . . . . . . . . . . 113
6.2.1 Intrarater Precision .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
6.2.2 Total-Rater Agreement . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
6.2.3 Interrater Agreement .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
6.2.4 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 114
6.3 GEE Estimation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 115
6.4 Comparison of Total-Rater Agreement with Intrarater
Precision: Total–Intra Ratio . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 121
6.4.1 When One or Multiple References Exist.. . . . . . . . . . . . . . . . . . . 122
6.4.2 Comparison to FDA’s Individual
Bioequivalence with Relative Scale . . . . .. . . . . . . . . . . . . . . . . . . . 123
6.4.3 Comparison to Coefficient of Individual Agreement .. . . . . . 124
6.4.4 Estimation and Asymptotic Normality ... . . . . . . . . . . . . . . . . . . . 124
6.5 Comparison of Intrarater Precision Among Selected
Raters: Intra–Intra Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 126
6.5.1 Estimation and Asymptotic Normality ... . . . . . . . . . . . . . . . . . . . 127
6.6 Summary of Simulation Results . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 128
xiv Contents

6.7 Examples .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 128


6.7.1 Example 1: TIR and IIR for an Automatic
Blood Pressure Meter .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 128
6.7.2 Example 2: Nasal Bone Image Assessment
by Ultrasound Scan . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 129
6.7.3 Example 3: Validation of the Determination
of Glycine on a Spectrophotometer System . . . . . . . . . . . . . . . . 129
6.7.4 Example 4: Individual Bioequivalence... . . . . . . . . . . . . . . . . . . . 131
6.8 Discussion.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 136
7 Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 139
7.1 Workshop for Continuous Data . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 139
7.1.1 The Basic Model (mD1).. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 140
7.1.2 Unified Model .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 143
7.1.3 TIR and IIR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147
7.2 Workshop for Categorical Data . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 149
7.2.1 The Basic Model (mD1).. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 149
7.2.2 Unified Model .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 150
7.2.3 TIR and IIR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 151
7.3 Individual Bioequivalence .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 152
References .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 153
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 159
Symbols Used and Abbreviations

In this book, we use a Greek letter (symbol) to represent a parameter to be estimated,


and we use its respective English letter or the symbol with a hat to represent its
sample counterpart or estimate. The exception is that we use X to represent the
sample mean, due to the long history of that convention. When a transformation is
performed, we use an uppercase letter to represent a transformed estimate. However,
there are some complicated computational formulas in which we use uppercase
letters to simplify the computation. In the sequel, we use Greek letters to represent
parameters when the target value X is considered random. When the target value
is considered fixed, we add jX as a subscript to the corresponding parameter. For
example, "2jX represents the mean squared deviation (MSD) when the target value X
is assumed fixed. We use a boldface symbol or letter to represent a vector or matrix.
Symbols and their corresponding definitions are listed below:
"2 Mean squared deviation
ı0 Total deviation index
ı0 Coverage probability
c Concordance correlation coefficient
 Location shift
$ Scale shift
a Accuracy coefficient
 Precision coefficient
 Kappa
w Weighted kappa
 Total-rater MSD to Intra-rater MSD ratio
Intra-rater MSD to Intra-rater MSD ratio
 Relative bias squared
Abbreviations used in this book are (in alphabetical order):
CCC: Concordance correlation coefficient
CDF: Cumulative density function
CIA: Coefficient of individual agreement
CL: Confidence limit

xv
xvi Symbols Used and Abbreviations

CLIA: Clinical laboratory improvement amendments


CP: Coverage probability
GEE: Generalized estimation equations
GM: Geometric mean
ICC: Intraclass correlation coefficient
IIR: Intra rater MSD to Intra rater MSD ratio
ML: Maximum likelihood
MLE: Maximum likelihood estimate
MSD: Mean squared deviation
PT: Proficient testing
PTC: Proficient testing criterion
RBS: Relative bias squared
RML: Restricted maximum likelihood
RMLE: Restricted maximum likelihood estimate
SD: Standard deviation
TDI: Total deviation index based on absolute difference
TDI%: Total deviation index based on percent change
TIR: Total-rater MSD to Intra rater MSD ratio
Chapter 1
Introduction

Consider the problems of assessing the acceptability of a new or generic process,


methodology, and/or formulation in areas of lab performance, instrument/assay
validation or method comparisons, statistical process control, goodness-of-fit, and
individual bioequivalence. The common theme is to assess the agreement between
observations .Y / and their corresponding target values .X /. Target values may be
considered random or fixed. Commonly used random target values are the gold
standard measurements, which are proven and widely acceptable. Commonly used
fixed target values are the expected or known values. We might be interested in
comparing two methods without a designated gold standard method. Sometimes,
we may also be interested in comparing a newly developed assay that is alleged
to be more precise and accurate than a designated gold standard assay. Within a
method, we might be interested in comparing technicians/times/reagents.
For simplicity and the ease of reference, we will use the term assays and raters
to represent assays, raters, instruments, methods, etc. Also, we will use the term
samples to designate samples, patients, animals, or subjects. In the tradition of
the subject matter, we use throughout this book the terms index and coefficient
interchangeably.
Figure 1.1 presents a typical situation for assessing agreement. When we plot the
observed values on the y-axis versus the corresponding target values over a desirable
range on the x-axis, we would like to see agreement in the paired data so that the
observations fall closely along the identity line, which is the straight line with zero
intercept and unit slope. When there is evidence of disagreement, it is important to
address the issue and search for the sources of that disagreement.

1.1 Precision, Accuracy, and Agreement

Generally, the common basic sources of disagreement come from within-sample


variation (imprecision) and/or a shift in the marginal distributions (inaccuracy). Fix-
ing imprecision is a within-sample variance reduction exercise in the medical-device

L. Lin et al., Statistical Tools for Measuring Agreement, 1


DOI 10.1007/978-1-4614-0562-7 1, © Springer Science+Business Media, LLC 2012
2 1 Introduction

Fig. 1.1 Assessing agreement of observed values (new) and target values (gold standard method)

or engineering environment, which is typically more cumbersome than fixing


inaccuracy. The cause of inaccuracy (systematic bias) is most likely a calibration
problem in measuring devices. In the published literature and regulatory documents,
we find the following definitions for precision, accuracy, and agreement.
The Food and Drug Administration (FDA) guidelines on bioanalytical method
validation (http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatory
Information/Guidances/UCM070107.pdf) defines accuracy as deviation of mean
from true value (trueness), while the International Organization for Standardization
(ISO) 5725 (1994) defines accuracy as both “trueness” and “precision.” When there
is no hard true value, we can loosely define improved accuracy as less bias.
The FDA defines precision as “the closeness of agreement (degree of scatter)
between a series of measurements obtained from multiple sampling of the same
homogeneous sample under the prescribed conditions” while the ISO defines preci-
sion as the closeness of agreement between independent test results obtained under
stipulated conditions measures from within-sample variation. The FDA further
defines “precision under tight conditions (intra-batch, inter-batch, true replicates)”
as repeatability, and “precision across labs” as reproducibility. Other precision
definitions could include intraclass correlation coefficient (ICC) for measuring
reliability in the social sciences.
Agreement is often defined as having both precision and accuracy, which is a
function of the absolute difference between pair readings. There is confusion in the
above definitions. For example, the ISO’s definition of accuracy is confused with
the definition of agreement. We adopt the accuracy and precision that correspond to
the FDA definitions. There is a host of terminology related to precision such as re-
peatability, reproducibility, robustness, validity, and reliability. We shall try to keep
the terminology simple and straightforward, such as agreement/precision/accuracy
across assays/days/technicians/labs.
1.2 Traditional Approaches for Continuous Data 3

1.2 Traditional Approaches for Continuous Data

Traditionally, the agreement between observed and target values has been assessed
by a paired t-test, the results of which could be misleading. A slightly better
approach in the assessment of agreement is to test the linear least squares estimates
against the identity line (a straight line with zero intercept and unit slope). We will
denote this method by LS01. Furthermore, the target values have been assumed
fixed in the LS01 (regression) analysis, even when the target values are obviously
random. These two rudimentary approaches capture only the accuracy information
relative to the precision. The potential for obtaining misleading results using these
two approaches can be illustrated graphically by Fig. 1.2.
Another popular traditional method for assessing agreement of observed and
target values is the Pearson correlation coefficient. However, this coefficient is only
a measure of precision. The potential for obtaining misleading results using this ap-
proach can be illustrated by Fig. 1.3. In addition, this coefficient has frequently been
misused to assess linearity, which should have been assessed through goodness-
of-fit statistics. Other traditionally used approaches for assessing agreement of
observed and target values have included the coefficient of variation and the mean
square error of the regression analysis, which are measures of precision only.
Perhaps the most valid traditional approach for assessing agreement of observed
and target values has been the intraclass correlation coefficient (ICC). The ICC in
its original form (Fisher 1925) is the ratio of between-sample variance and total
(within C between) variance under the model of equal marginal distributions. This
original ICC was intended to measure precision only. This coefficient is invariant
with respect to the interchanges of Y and X values within any pairs. Several forms
of ICC have evolved. In particular, Bartko (1966), Shrout and Fleiss (1979), Fleiss
(1986), and Brennan (2001) have put forth various reliability assessments. We will
discuss ICC in greater detail in Section 2.3.1 and Chapter 3. In Chapter 5, we will

Fig. 1.2 Situations in which a paired t -test or least squares test against the identity line (LS01)
can be misleading: (a) Rejected by paired t -test/LS01. (b) Accepted by paired t -test but rejected
by LS01. (c) Accepted by paired t -test/LS01
4 1 Introduction

Fig. 1.3 Situations in which the Pearson correlation coefficient can be misleading

introduce some special forms of ICC under the general case that correspond to
agreement, precision, and accuracy for both continuous and categorical data, based
on a paper by Lin, Hedayat, and Wu (2007).
It should be pointed out that the traditional hypothesis-testing approach is
not appropriate for assessing agreement except for the cases that will be pre-
sented in Chapter 6. In traditional hypothesis testing, the rejection region (alter-
native hypothesis) is the region for declaring a difference based on strong evidence
presented in the data. Failing to reject the null hypothesis does not imply accepting
agreement, but implies a lack of evidence for declaring a difference. The proper
setting for assessing agreement is to reverse the null and alternative hypotheses, so
that the conventional rejection region actually is the region for declaring agreement.
Therefore, we would reject the null hypothesis of a difference and accept the
alternative hypothesis of agreement based on strong evidence presented in the data
(Dunnett and Gent 1977; Bross 1985; Rodary, Com-Nougue, and Tournade 1989;
Lin 1992). With the given criterion and the same precision of the data, the larger
the sample size, the easier it should be to accept the agreement. Here, a meaningful
criterion for an acceptable difference should be prespecified, and the hypothesis
testing should be one-sided. Indeed, the proper hypothesis-testing approach is
equivalent to computing the one-sided confidence limit. If this limit were better
than the prespecified criterion, we would accept the agreement. We will use the
confidence limit approach in this book for simplicity. However, for the sample size
and power calculation, we will use the proper hypothesis-testing approach.

1.3 Traditional Approaches for Categorical Data

An example of categorical response is for two or more assays to assign a sample’s


condition according to a binary scale of no/yes or normal/not normal, or to an ordi-
nal scale of fair, mild, serious, critical, or life-threatening. A nonordinal (nominal)
scale in assessing agreement is less often encountered in practical situations.
1.3 Traditional Approaches for Categorical Data 5

Compared to approaches for continuous data, there have been fewer misleading
approaches for categorical data. The most popular approach for assessing agreement
began with kappa (Cohen 1960) and weighted kappa (Cohen 1968; Fleiss, Cohen,
and Everitt 1969). The kappa coefficients assess nonchance (chance-corrected)
agreement relative to the total nonchance agreement. There is a long history of
valid tools available for assessing marginal equivalence, association, and agreement.
These will be referenced in Chapter 3.
Chapter 2
Continuous Data

We will now introduce new approaches that have evolved for measuring agreement
since 1988. Some of these new approaches were summarized, studied, and com-
pared by Lin, Hedayat, Sinha, and Yang (2002). Here, we include the necessary
proofs that were left out of that article. In addition, we include assorted examples to
demonstrate agreement techniques. We begin with the most basic model, in which
paired observations (Y and X ) are collected.

2.1 Basic Model

When target values are random, the joint distribution of Y and X is assumed to have
a bivariate distribution with finite second moments with means y , x , variances
y , x and covariance yx . When target values are fixed, Yi jXi , i D 1; : : : ; n, are
2 2

assumed to be observations in a random sample from the basic regression model


Y D ˇ0 C ˇ1 X C eY . Here, eY is the residual error with mean 0 and variance e2 .

2.2 Absolute Indices

2.2.1 Mean Squared Deviation

Mean squared deviation (MSD) evaluates an aggregated deviation from the identity
line, MSD D E.Y  X /2 . It can be expressed as

"2 D . y  x/
2
C 2
y C 2
x 2 yx ; (2.1)

when target values are random, or

"2jX D . yjX  X /2 C sx2 .1  ˇ1 /2 C 2


e; (2.2)

L. Lin et al., Statistical Tools for Measuring Agreement, 7


DOI 10.1007/978-1-4614-0562-7 2, © Springer Science+Business Media, LLC 2012
8 2 Continuous Data

when target values are fixed, where X and sx2 are the sample mean and variance
of X .
Estimated by sample counterparts (e 2 or ejX
2
) with a log transformation, W D
ln.e / or WjX D ln.ejX / has an asymptotic normal distribution with mean ln."2 / or
2 2

ln."2jX / and variance


 4
2 . y  x/
2
D 1 ; (2.3)
W
n2 "4

when target values are random, or


2  2 3
2 6 "2jX  2
e 7
2
W jX D 41  5; (2.4)
n2 "4jX

when target values are fixed.


The proof of (2.3) can be found in Section 2.9.2. The proof of (2.4) can be found
in Section 2.10.2. For statistical inference, refer to Section 2.4 for the information
regarding sample counterparts, and to Sections 2.9 and 2.10 for the proofs of
asymptotic normality in this chapter.
The MSD is not an easy index to interpret. The following methods will put some
meaningful interpretation on this basic MSD index.

2.2.2 Total Deviation Index

To examine the agreement from a different perspective, a measure that captures


a large proportion (0 ) of data within a boundary (ı0 ) from target values was
considered by Lin, Hedayat, Sinha, and Yang (2002). For example, we may want
to capture at least 90% of individual observations that are within 10% of their
target values. We would compute total deviation index (TDI) for the given coverage
probability (CP) criterion of 0:9 to see whether this TDI is less than 10%, or compute
coverage probability (CP) for the given TDI criterion of 10% to see whether this CP
is more than 0:9.
Assume that D D Y  X has a normal distribution with mean d D y  x
and variance d2 D y2 C x2  2 xy . We may find  for a given ı0 criterion, CPı0 ,
which is
 2 
ı0 D P .D 2 < ı02 / D 2 ı02 ; 1; d2 ; (2.5)
d

where 2 ./ is the cumulative noncentral chi-square distribution up to ı02 , with one
2
d
degree of freedom and noncentrality parameter 2 . This measure will be presented
d
shortly.
2.2 Absolute Indices 9

We may also find ı for a given 0 criterion, TDI0 , which is


s  
2
ı0 D .2 /1 0 ; 1; d
2
; (2.6)
d

where .2 /1 is the inverse function of 2 ./. Since the estimate of this index has
intractable asymptotic properties, Lin (2000) and Lin, Hedayat, Sinha, and Yang
(2002) have suggested the following TDI0 approximation:
:
ı2 0  D .2 /1 .0 ; 1/"2 ; (2.7)

or
 
: 1 1  0
ı0  Dˆ 1 j"j; (2.8)
2
when X is random, or
 
: 1 1  0
ı0 jX Dˆ 1 j"jX j; (2.9)
2

when X is fixed.
The approximation is satisfactory (Lin 2000) when:
1. 0 D 0:75 and  6 1=2,
2. 0 D 0:8 and  6 8,
3. 0 D 0:85 and  6 2,
4. 0 D 0:9 and  6 1,
5. 0 D 0:95 and  6 1=2.
2
The quantity  D d2 is called the relative bias squared (RBS). The interpre-
d
tation of this approximated TDI is that approximately 1000 % of observations are
within ı0  of the target values. TDI20 is proportional to MSD, and therefore we
may perform an inference based on the asymptotic normality of W D ln.e 2 /, where
e 2 is the sample counterpart of MSD when X is random, or WjX D ln.ejX 2
/ when
X is fixed. This simplified method will become very useful when we deal with the
more general case to be introduced in Chapter 5.
The idea of using such an approximation of TDI was motivated by Holder and
Hsuan (1993). They proposed a moment-based criterion for assessing individual
bioequivalence. They showed that in a slightly different fashion, ı2 0 , or the squared
function of (2.6), has an upper bound ı2 0 C D c0 "2 , where c0 is a constant not
depending on d and d . Therefore, ı0 C conservatively captures at least 1000 %
of observations within the boundary from target values of a reference compound.
Holder and Hsuan (1993) used a numerical algorithm for the determination of c0
under some parametric and nonparametric distribution of D D Y  X .
10 2 Continuous Data

However, the asymptotic distribution property of this estimate has not been
established. Lin (2000) made a comparison between this statistic and the TDI
given in (2.7). When 0 D 0:9, ı2 0  and ı2 0 C are identical under the normality
assumption. Using TDI0:9 is almost exact when dd < 1, and would become
conservative otherwise. Using TDI0:8 is most robust, since it can tolerate an RBS
value as high as 8:0.
A TDI is similar in concept to a tolerance limit. The difference is that a tolerance
limit captures individual deviations from their own mean, while a TDI captures
individual deviations from their target values, for a high proportion (say, 90%), and
with a high degree of confidence (say, 95%) when the upper confidence limit of TDI
is used.

2.2.3 Coverage Probability

We now consider finding  for a given ı0 criterion. This is


 2 
ı0 D P .D < ı0 / D  ı0 ; 1; d2 ;
2 2 2 2
(2.10)
d

when target values are random, or

1X
n
ı0 jX D ı .i / ; (2.11)
n i D1 0

where
"  2 #
ı02 ˇ0 C .ˇ1  1/Xi
ı0 .i / D  2
2
; 1; ; (2.12)
e e

when target values are fixed. The estimate of coverage probability using sample
p
counterparts (pı0 or pı0 jX ) by the logit transformation, T D ln 1pı0ı or TjX D
0
p 
ln 1pı0ıjXjX , has an asymptotic normal distribution with mean E.pı0 / D ln 1ı0ı or
0 0

E.pı0 jX / D ln 1ı0ıjXjX , and variance
0

 2  2
0:5 ıC .ıC / C ı .ı / C .ıC /  .ı /
2
T D ; (2.13)
.n  3/.1  ı0 /2 ı20

when X is random, or
h i
C02 .C0 X C1 /2 C22
n2
C n2 sx2
C 2n2
2
T jX
D ; (2.14)
.n  3/.1  ı0 /2 ı20
2.3 Relative Indices 11

when X is fixed, where

ı0 C d
ıC D ; (2.15)
d

ı0  d
ı D ; (2.16)
d

ı0 C ˇ0 C .ˇ1  1/Xi
ıC ˇ i D ; (2.17)
e

ı0  ˇ0  .ˇ1  1/Xi
ı ˇ i D ; (2.18)
e

X
n

C0 D .ıC ˇ i /  .ı ˇ i /; (2.19)
i D1

X
n

C1 D .ıC ˇ i /  .ı ˇ i / Xi ; (2.20)
i D1

X
n

C2 D ıC ˇ i .ıC ˇ i / C ı ˇ i .ı ˇ i /; (2.21)
i D1

and ./ is the standard normal distribution function.


The proof of (2.13) can be found in Section 2.9.3. The proof of (2.14) can be
found in Section 2.10.3.
We can also use the simplified method of
 
: ı02
ı0  D 2 ;1 ; (2.22)
"2

which will become very useful when we deal with the more general case to be
introduced in Chapter 5.

2.3 Relative Indices

2.3.1 Intraclass Correlation Coefficient

Pearson (1899, 1901) developed the intraclass correlation coefficient (ICC) as a


way to estimate various aspects of fraternal resemblance. Given pairs of brothers,
one might be interested not in the correlation in height between the older and the
younger brother, or the taller and the shorter brother, but simply between brothers in
general. In this case, the heights of the two brothers are logically interchangeable.
Pearson suggested that this correlation could be estimated by entering the height
12 2 Continuous Data

measurements for each pair of brothers, (x; y), twice into the computation of the
usual product–moment correlation coefficient , once in the order (x; y) and once
in the order (y; x). If there are more than two brothers in each set, each possible pair
of measurements is entered twice into the computation. Thus, the number of entries
in the correlation for this set is n.n  1/, where n is the number of brothers in the
data set.
Harris (1913) developed a simple formula for intraclass correlation as a function
of (a) the variance of the means of each set of measurements around the overall
mean and (b) the variance of the total set of measurements.
Fisher (1925) observed that variance measurements could be partitioned into two
components. The first component is the between-sample variance after removing
the residual variance, which Fisher called A. The second component is the residual
variance or within-sample variance, which Fisher called B. Thus the population
intraclass correlation can be expressed as

A
I D : (2.23)
ACB

Fisher (1925) noted that the ICC could be estimated using mean squares from
an analysis of variance (ANOVA). We will revisit ICC in Chapter 3, where we will
show its association with kappa, weighted kappa, and the concordance correlation
coefficient (CCC) presented below. We will also revisit, in Chapter 5, the general
form of the ICC for agreement, precision, and accuracy coefficients.

2.3.2 Concordance Correlation Coefficient

The MSD can be standardized such that: 1 indicates that each pair of readings
is in perfect agreement in the population (for example, 1; 1; 2; 2; 3; 3; 4; 4; 5; 5),
while 0 indicates no correlation, and 1 means that each pair of readings is in
perfect reversed agreement in the population (for example, 5; 1; 4; 2; 3; 3; 2; 4;
1; 5). Lin (1989) introduced one such standardization of MSD, called CCC, which
is defined as

"2
c D 1  (2.24)
"2jD0

"2
D 1
y C x C. 
2 2 2
y x/

2 yx
D ;
2
y C 2
x C. y  x/
2

2 x y
D (2.25)
2
y C 2
x C. y  x/
2
2.3 Relative Indices 13

when X is random, and

2ˇ1 sx2
cjX D ; (2.26)
2
y C sx2 C . y  XN /2

when X is fixed. The CCC is closely related to the intraclass correlation and has
a meaningful geometrical interpretation. It is inversely related to the mean square
of the ratio of the within-sample total deviation ("2 ) and the total deviation ("2jD0 ).
For example, if the within-sample total deviation is 10%, 32%, or 45% of the total
deviation, then the CCC is 0:99 D .1  0:12 /, 0:90 D .1  0:322 /, or 0:80 D
.1  0:452 /, respectively. In Chapter 3, we will show that for ordinal categorical
data, CCC degenerates into the weighted kappa suggested by Cohen (1968).
Section 1.1 defined accuracy and precision in the one-dimensional situation.
According to the two-dimensional model of Section 2.1, the between-sample
variation is typically inherited or is a result of the design of the sampling process,
and is usually unrelated to within-sample precision of an assay. Therefore, we
consider the difference in between-sample variance as a systematic bias, and it is
included in the inaccuracy. A sample mean and sample variance define a marginal
distribution in most of the commonly used distributions.

2.3.2.1 Accuracy Coefficient

The accuracy coefficient measures the closeness of the marginal distributions of Y


and X , where 1 signifies equal means and variances, and 0 indicates that the absolute
difference in means and/or variance approach infinity. The accuracy coefficient can
be broken down into measures of location and/or scale shifts, where the location

shift is  D py y xx , and the scale shift is $ D yx or yx . Here, the accuracy
coefficient is defined as
2
a D ; (2.27)
$ C 1=$ C  2
2
when X is random. We can replace x by sx2 , and x by X in (2.27) when X is
fixed.

2.3.2.2 Precision Coefficient

The precision coefficient is the Pearson correlation coefficient ./ between Y and
X, where
yx
D : (2.28)
y x
14 2 Continuous Data

Here 2 has the same scale as the accuracy coefficient, from 0 (no agreement) to 1
(perfect agreement). It is evident from (2.25), (2.26), and (2.27) that the CCC is the
product of the precision and accuracy coefficients when target values are random
or fixed.

2.3.2.3 Statistical Inference on CCC and the Accuracy


and Precision Coefficients

The estimate of CCC (rc or rcjX ) using sample counterparts by the Z transformation,
1CrcjX
1rc or ZjX D 2 ln 1rcjX , has an asymptotic normal distribution with
Z D 12 ln 1Cr c 1

1CcjX
mean 12 ln 1Cc
1c or
1
2 ln 1cjX , and variance

 
1 .1  2 /c2 2c3 .1  c / 2 c4  4
2
D C  (2.29)
Z
.n  2/ .1  c2 /2 .1  c2 /2 22 .1  c2 /2

or
 
.1  2 /c2 $ 2 c2 .1  2 /
2
ZjX D $ 2 c2 C .1  c $/ C
2
;
.n  2/.1  c2 /2 2 2
(2.30)
when X is random or fixed, respectively.
The estimate of the accuracy coefficient (ca or cajX ) using sample counterparts
c
by the logit transformation, L D ln 1c
ca
a
or LjX D ln 1cajX , has an asymptotic
ajX
a 
normal distribution with mean ln 1 a
or ln 1ajX , and variance
ajX

2a  2 .$ C 1=$  2/ C 12 2a .$ 2 C 1=$ 2 C 22 / C .1 C 2 /.a  2  1/


2
D
L
.n  2/.1  a /2
(2.31)
or

 2 $2a .1  2 / C 12 .1  $a /2 .1  4 /
2
D : (2.32)
LjX
.n  2/.1  a /2

The estimate of the precision coefficient (r or rjX ) by the Z-transformation has


.1C /
an asymptotic normal distribution with mean 12 ln .1C/
.1/
or 12 ln .1 jX / , and variance
jX
1 12 =2
n3
when target values are random, or when target values are fixed. The
n3
proof of (2.29) and (2.30) can be found in Sections 2.9.1 and 2.10.1. The proof of
(2.31) and (2.32) can be found in Sections 2.9.4 and 2.10.4.
2.4 Sample Counterparts 15

2.4 Sample Counterparts

For the purpose of statistical inference, the parameters discussed above can be
replaced with their sample estimates that are consistent estimators, such as the
moments estimates.
The sample counterparts for y , x , y2 , x2 , yx , ˇ1 , and  are

1X
n
Y D Yi ; (2.33)
n i D1
1X
n
XD Xi ; (2.34)
n i D1
1X
n
sy2 D .Yi  Y /2 ; (2.35)
n i D1
1X
n
sx2 D .Xi  X/2 ; (2.36)
n i D1
1X
n
2
syx D .Yi  Y /.Xi  X /; (2.37)
n i D1
syx
b1 D 2 ; (2.38)
sx

and
syx
rD : (2.39)
sy sx

1
We could certainly use n1 rather than n1 in the above variance and covariance
1
estimates. However, we use n to bound the CCC estimate by 1.0.
In estimating the MSD, we use the sum of squared difference divided by n  1.
For (2.17) and (2.18) when X is fixed, we use
n
se2 D .1  r 2 /sy2 : (2.40)
.n  3/

For less bias in estimating the RBS, and in estimating d2 in (2.15) and (2.16), we use
n  2 
sd2 D sx C sy2  2sxy : (2.41)
.n  3/
The use of n  2 or n  3 instead of n in the denominators of the above variance
equations is for the small-sample-size bias correction based on the simulation
studies in Lin, Hedayat, Sinha, and Yang (2002). The use of sample-size bias
correction is not important when the sample size is large.
16 2 Continuous Data

For the purpose of performing statistical inference for each index, we should
compute the confidence limit (lower limit for CCC, precision coefficient, accuracy
coefficient, CP, and upper limit for TDI) based on its respective transformation, then
perform antitransformation to the limit. We will declare that the assay agreement
is acceptable when the limit is better than the prespecified criterion. The use
of transformed estimates can speed up the approach to normality. Moreover, a
transformation could bound the confidence interval to its respective parameter
range, say, 1 to 1 for CCC and precision coefficient, 0 to 1 for the accuracy
coefficient and CP, and 0 to infinity for MSD.
Throughout this book, once the asymptotic normality of an estimated index has
been defined, statistical inference can be established through confidence limit(s).
Let O be the estimate of an agreement index and O2 its variance. Then the one-sided
upper or lower confidence limit becomes

O C ˆ1 .1  ˛/ O O or O  ˆ1 .1  ˛/ O O ;

where O O is the estimate of the square root of variance, O , using sample counter-
parts. When the sample size is small, say less than 30, we can also use the cutoff
value of the cumulative central t-distribution instead of the standard cumulative
normal distribution to form the statistical inference.

2.5 Proportional Error Case

When Y and X are positively valued variables and the standard deviations of Y are
proportional to either Y or X , it is assumed that ln.Y / and ln.X / have a bivariate
normal distribution. Let 100 % be the percent change in Y and X . Then
 
1 Y
 DP < < .1 C / D P Œj ln.Y /  ln.X /j < ln.1 C / :
.1 C / X
(2.42)
Let D D ln.Y /  ln.X / and ı0 D ln.1 C 0 /. Then 0 D 100Œexp.ı0 /  1%.
This 100 0 % is denoted by TDI%0 .
In the case of proportional errors, all of the above unscaled and scaled agreement
indices should be computed from the log transformed data. In practice, we have
encountered the proportional error case more frequently than the constant error case.

2.6 Summary of Simulation Results

A comprehensive simulation was conducted in Lin, Hedayat, Sinha, and Yang


(2002) to study the small-sample properties of the transformed estimates of
CCC, precision coefficient, accuracy coefficient, TDI, and CP. The results showed
2.8 Examples 17

excellent agreement with the theoretical values from normal samples even when
n D 15. However, these estimates are not expected to be robust against outliers or
large deviations from normality or log-normality. The robustness issues of the CCC
have been addressed in King and Chinchilli (2001a, 2001b), using M-estimation or
using a power function of the absolute value of D to compute the CCC.

2.7 Asymptotic Power and Sample Size

In assessing agreement, the null and alternative hypotheses should be reversed. The
conventional rejection region actually is the region of declaring agreement (one-
sided). Asymptotic power and sample size calculation should proceed by the above
principle. These powers of CCC, TDI, and CP were compared in Lin, Hedayat,
Sinha, and Yang (2002). The results showed that the TDI and CP estimates have
similar power, and are superior to CCC, but they are valid only under the normality
assumption. Therefore, for inference, TDI and CP are superior to CCC. However,
the CCC and precision and accuracy coefficients remain very useful and informative
tools, as is evident from the following examples. In Chapter 4, we will discuss the
sample size subject in greater detail.

2.8 Examples

2.8.1 Example 1: Methods Comparison

This example was presented in Lin, Hedayat, Sinha, and Yang (2002). DCLHb is
a treatment solution containing oxygen-carrying hemoglobin. The DCLHb level
in a patient’s serum is routinely measured by the Sigma method. The simpler
HemoCue method was modified to reproduce the DCLHb values of the Sigma
method. Serum samples from 299 patients over a 50–2;000 mg/dL range were
collected. The DCLHb values of each sample were measured by both methods
twice, and the averages of the duplicate values were evaluated. The client required
with 95% confidence that the within-sample total deviation be less than 15% of
the total deviation. This means that the allowable CCC was 1  0:152 D 0:9775.
The client also needed with 95% confidence that at least 90% of the HemoCue
observations be within 150 mg/dL of the targeted Sigma values. This means that the
allowable TDI0:9 was 150 mg/dL, or that the allowable CP150 was 0:9.
The results are presented in Fig. 2.1 and Table 2.1. The plot indicates that the
within-sample error is relatively constant across the clinical range. The plot also
indicates that the HemoCue accuracy is excellent and that the precision is adequate.
The CCC estimate is 0:987, which means that the within-sample total deviation is
about 11:6% of the total deviation. The CCC one-sided lower confidence limit is
18 2 Continuous Data

Fig. 2.1 HemoCue and Sigma readings on measuring DCLHb

Table 2.1 Agreement statistics for HemoCue and Sigma readings on measuring DCLHb
Precision Accuracy
Statistics CCC coefficient coefficient TDI0:9 CP150 RBS
Estimate 0.9866 0.9867 0.9999 127.5 0.9463 0.00
95% Conf. limit 0.9838 0.9839 0.9989 136.4 0.9276 –
Allowance 0.9775 – – 150.0 0.9000 –
“–” means “not applicable”

0:984, which is greater than 0:9775. The precision coefficient estimate is 0:987 with
a one-sided lower confidence limit of 0:984. The accuracy coefficient estimate is
0:9999 with a one-sided lower confidence limit of 0:9989. The TDI0:9 estimate is
127:5 mg/dL, which means that 90% of HemoCue observations are within 127:5
mg/dL of their target values. The one-sided upper confidence limit for TDI0:9 is
136:4 mg/dL, which is less than 150 mg/dL. Finally, the CP150 estimate is 0:946,
which means that 94:6% of HemoCue observations are within 150 mg/dL of their
target values. The one-sided lower confidence limit for CP150 is 0:928, which
is greater than 0:9. Therefore, the agreement between HemoCue and Sigma is
acceptable with excellent accuracy and adequate precision. The relative bias squared
is estimated to be near zero, indicating that the approximation of TDI should be
excellent.

2.8.2 Example 2: Assay Validation

This example was presented in Lin, Hedayat, Sinha, and Yang (2002). FVIII is a
clotting agent in plasma. The FVIII assay uses a marker with varying dilutions of
known FVIII activities to form a standard curve. The assay started at 1W5 or 1W10
2.8 Examples 19

Fig. 2.2 Observed FVIII assay results versus targeted values started at 1W5

serial dilutions were prepared until they reached the target values. Target values
were fixed at 3%, 8%, 38%, 91%, and 108%. Six samples were assayed per target
value. The error was expected to be proportional mainly due to dilutions. The client
needed with 95% confidence that the within-sample total deviation be less than 15%
of the total deviation. This means that the allowable CCC was 1  0:152 D 0:9775.
The client also needed with 95% confidence that 80% of FVIII observations be
within 50% of target values (note that this is the percentage of the measuring unit,
which is also a percentage). This means that the allowable TDI%0:8 was 50%, or
that the allowable CP50% was 0:8.
Figures 2.2 and 2.3 present the results started at 1W5 and at 1W10 serial dilutions
for these plots of observed FVIII assay results versus targeted values in log2 scale.
Note that there are overlying observations in the plots. Specifically, in Fig. 2.2, four
replicate readings of 3% and duplicate readings of 2% are observed at the target
value of 3%, and circles at the target value of 8% represent duplicate readings of
8%, 9%, and 10%. Duplicate readings of 45% are observed at target values of 38%.
Also note that in Fig. 2.3, four replicate readings of 5% and duplicate readings of 4%
are observed at the target value of 3%. Three replicate readings of 11% and duplicate
readings of 12% are observed at the target value of 8%. Duplicate readings of 49%
are observed at target values of 38%. Duplicate readings of 124% are observed at
target values of 91%. The plots indicate that the within-sample error is relatively
constant across the target values in log scale. The precision is good for both assays
started at 1W5 and at 1W10 serial dilutions, but the accuracy is not as good for the
assay started at 1W10 serial dilutions.
Tables 2.2 and 2.3 present the agreement statistics started at 1W5 and 1W10 serial
dilutions. For the assay started at 1W5 serial dilutions, the CCC is estimated to be
0:992, which means that the within-sample total deviation is about 9:1% of the
20 2 Continuous Data

Fig. 2.3 Observed FVIII assay results versus targeted values started at 1W10

Table 2.2 FVIII assay results started at 1 W 5


Precision Accuracy
Statistics CCC coefficient coefficient TDI%0:8 CP50% RBS
Estimate 0.9917 0.9942 0.9975 27.35 0.9653 0.12
95% Conf. limit 0.9875 0.9908 0.9935 35.01 0.8921 –
Allowance 0.9775 – – 50.0 0.8000 –
“–” means “not applicable”

Table 2.3 FVIII assay results started at 1:10


Precision Accuracy
Statistics CCC coefficient coefficient TDI%0:8 CP50% RBS
Estimate 0.9669 0.9947 0.9721 58.95 0.7016 3.75
95% Conf. limit 0.9584 0.9917 0.9638 69.01 0.5898 –
Allowance 0.9775 – – 50.0 0.8000 –
“–” means “not applicable”

total deviation. The one-sided lower confidence limit is 0:987, which is greater than
0.9775. The precision coefficient is estimated to be 0.994 with a one-sided lower
confidence limit of 0.991. The accuracy coefficient is estimated to be 0.998 with
a one-sided lower confidence limit of 0.994. TDI%0:8 is estimated to be 27.3%,
which means that 80% of observations are within 27.3% change from target values
(percentage of percentage values). The one-sided upper confidence limit is 35.0%,
which is less than 50%. Finally, CP50% is estimated to be 0.965, which means that
96.5% of observations are within 50% change from target values. The one-sided
lower confidence limit is 0.892, which is greater than 0.8. The agreement between
2.8 Examples 21

the FVIII assay and the actual concentration is acceptable with good precision and
accuracy. The relative bias squared is estimated to be 0.12, so that the approximation
of TDI should be excellent.
For the assay started at 1:10 serial dilutions, the CCC is estimated to be 0.967,
which means that the within-sample total deviation is about 18.2% of the total
deviation. The one-sided lower confidence limit is 0.958, which is less than 0.9775.
The precision coefficient is estimated to be 0.995 with a one-sided lower confidence
limit of 0.992. The accuracy coefficient is estimated to be 0.972 with a one-
sided lower confidence limit of 0.964. TDI%0:8 is estimated to be 58.9%, which
means that 80% of observations are within 58.9% change from target percentage
values. The one-sided upper confidence limit is 69.0%, which is greater than 50%.
Finally, CP50% is estimated to be 0.702, which means that 70.2% of observations
are within 50% change from target values. The one-sided lower confidence limit is
0.590, which is less than 0.8. The agreement between the FVIII assay and actual
concentration had good precision but is not acceptable due to mediocre accuracy.
The relative bias squared is estimated to be 3.75, which is less than 8.0, so that the
approximation of TDI should be acceptable.

2.8.3 Example 3: Assay Validation

This example was presented in Lin and Torbeck (1998). A study to validate an
amino acid analysis test method was conducted. Solutions were prepared at ap-
proximately 90%, 100%, and 110% of label concentration of the amino acids, each
containing nine determinations (observed values). Target values were determined
based on their average molecular weights, which were much more precise and
accurate but were still measured with random error. For each test method we
compute the estimates of CP, TDI, CCC, precision coefficient, accuracy coefficient,
and their confidence limits. It is debatable whether we should treat the target values
as random or fixed because they were average values. We therefore take the more
conservative approach by treating target values as random, which yields the same
estimates of agreement statistics but with a larger respective standard error for each
estimate.
The observed and target values were expressed as a percentage of label con-
centration. Using estimates of the CCC components, coefficients of accuracy (ca )
and precision (r), four out of 30 amino acids were chosen for illustration, each
representing an example of distinctive precise and/or accurate situations. These
four amino acids and their label concentrations were glycine (1 g/L), ornithine (6.4
g/L), L-threonine acid (3.6 g/L), and l-methionine (2 g/L).
The range of these data was approximately 20% (90%–110%) of label concen-
tration. The client needed with 95% confidence that at least 80% of observations be
within 3% of target values. This means that the 95% upper limit of TDI0:8 must be
less than 3, or that the 95% lower limit of CP3 must be greater than 0.8. Note that
the measurement unit is in percentage, and the error structure was assumed constant
across the data range. The client did not specify a criterion for the CCC.
22 2 Continuous Data

Fig. 2.4 Observed measures versus target values of glycine

Fig. 2.5 Observed measures versus target values of ornithine

Figures 2.4–2.7 present the plots and Tables 2.4–2.7 present the agreement
statistics for glycine, ornithine, L-threonine acid, and L-methionine, respectively.
The results for glycine are accurate and precise, with CCC D 0:996 .0:994/,
r D 0:998 .0:996/, ca D 0:998 .0:996/, TDI0:8 D 0:93 .1:18/, and CP3 D 0:9999
.0:9966/. Values presented in parentheses represent the respective 95% lower or
upper confidence limit. More than 80% of observations are within 0.93 of target
values. The 95% upper confidence limit of TDI0:8 is 1.18, which is within the
2.8 Examples 23

Fig. 2.6 Observed measures versus target values of L-threonine

Fig. 2.7 Observed measures versus target values of L-methionine

Table 2.4 Glycine results (n D 27)


Precision Accuracy
Statistics CCC coefficient coefficient TDI0:8 CP3 RBS
Estimate 0.9962 0.9981 0.9981 0.93 0.9999 0.08
95% Conf. limit 0.9937 0.9963 0.9961 1.18 0.9966 –
Allowance 0.9900 – – 3.0 0.8000 –
“–” means “not applicable”
24 2 Continuous Data

Table 2.5 Ornithine results (n D 27)


Precision Accuracy
Statistics CCC coefficient coefficient TDI0:8 CP3 RBS
Estimate 0.9738 0.9759 0.9978 2.45 0.8703 0.08
95% Conf. limit 0.9502 0.9535 0.9801 3.09 0.7496 –
Allowance 0.9900 – – 3.0 0.8000 –
“–” means “not applicable”

Table 2.6 L-threonine results (n D 27)


Precision Accuracy
Statistics CCC coefficient coefficient TDI0:8 CP3 RBS
Estimate 0.9444 0.9905 0.9534 3.61 0.6557 4.44
95% Conf. limit 0.9084 0.9814 0.9222 4.14 0.5188 –
Allowance 0.9900 – – 3.0 0.8000 –
“–” means “not applicable”

Table 2.7 L-methionine results (n D 27)


Precision Accuracy
Statistics CCC coefficient coefficient TDI0:8 CP3 RBS
Estimate 0.5308 0.9723 0.5459 13.68 0.0001 26.11
95% Conf. limit 0.3991 0.9464 0.4280 14.89 0.0000 –
Allowance 0.9900 – – 3.0 0.8000 –
“–” means “not applicable”

allowable 3%. The 95% lower confidence limit of CP3 is 0.997, which is better
than the allowable 0.8. The CCC estimate is near 1, indicating an almost perfect
agreement.
The results of ornithine are accurate but less precise, with CCC D 0:974 .0:95/,
r D 0:976 .0:954/, ca D 0:998 .0:98/, TDI0:8 D 2:45 .3:09/, and CP3 D 0:870
.0:750/. The 95% upper confidence limit of TDI0:8 is 3.09, and the 95% lower
confidence limit of CP3 is 0.750.
The results of L-threonine are inaccurate but precise, with CCC D 0:944 .0:908/,
r D 0:991 .0:981/, ca D 0:953 .0:922/, TDI0:8 D 3:61 .4:14/, and CP3 D 0:656
.0:519/. The 95% upper confidence limit of TDI0:8 is 4.14, and the 95% lower
confidence limit of CP3 is 0.519.
The results of L-methionine are inaccurate and imprecise, with CCC D 0:531
.0:399/, r D 0:972 .0:946/, ca D 0:546 .0:428/, TDI0:8 D 13:68 .14:89/, and
CP3 D 0:0001 .0:0000/. The 95% upper confidence limit of TDI0:8 is 14.89, and
the 95% lower confidence limit of CP3 is almost zero. Note that the TDI0:8 estimate
of the L-methionine assay is conservative, since the estimate of its relative bias
squared value is large.
In summary, only the glycine assay in this example meets the client’s criterion.
2.8 Examples 25

2.8.4 Example 4: Lab Performance Process Control

This example was presented in Lin (2008). For quality control of clinical laborato-
ries, control materials of various concentrations were randomly sent to laboratories
for testing. The test results were to satisfy the proficient testing (PT) criterion.
The PT criterion for each lab test of the Clinical Laboratory Improvement Amend-
ments (CLIA) Final Rule (2003) http://wwwn.cdc.gov/clia/regs/subpart i.aspx#493.
929 required that 80% of observations be within a certain percentage or unit of
the target concentration for measuring control materials. The target concentrations
usually were the average of control materials across a peer group of labs using
similar instruments. Such a criterion lends itself directly for using the TDI%0:8 or
TDI0:8 .
For each of the majority of lab measurements, laboratories were required to test
commercial control materials at least once a day for at least two concentrations
(low and high). Daily glucose values of 116 laboratory instruments were monitored.
Based on accuracy and precision indices, we selected four laboratory instruments
with four distinct combinations of precision and accuracy. For each laboratory
instrument we computed TDI%0:9 , CCC, and precision and accuracy coefficients.
Here, we chose to monitor 90% for a cushion, instead of 80%, of observations across
all levels that were within TDI%0:9 or TDI0:9 units of targets. We can translate from
TDI0:9 to TDI0:8 by multiplying by 1:282=1:645 D 0:779. The target values were
computed as the average of these 116 laboratory instruments. For glucose, this PT
criterion was 10% or 6 mg/dL, whichever was larger. The range of these data was
around 70–270 mg/dL. In this case, the 10% value was the PT criterion (always
larger than 6 mg/dL).
For each lab instrument, we computed the preceding agreement statistics for each
calendar month and for the last available 50 days (current window). Across the
116 lab instruments, we computed the group geometric mean (GM), one standard
deviation (1-SD), and two standard deviation (2-SD) upper limits with 3-month
average TDI%0:9 values as benchmarks. Note that the distribution of TDI%0:9 was
shown to be log-normal. Here, the confidence limit of TDI%0:9 per lab instrument
was not computed, because we were using the population benchmarks. Therefore, it
is irrelevant whether the target values were treated as random or fixed.
Figures 2.8–2.11 present the plots of the four selected cases. For each case,
the left-hand plot presents the usual agreement plot of observations versus target
values for the current window, and each plotted symbol (circle) represents the daily
glucose value against the daily average glucose values across 116 labs. The right-
hand plot monitors the quality control results over a selected time window based
on TDI%0:9 values. We chose to monitor a rolling three-completed-month window
(June, July, and August in this case) plus the current window. Each plotted symbol
(dot) represents the monthly or current window TDI%0:9 value. Also presented
are the population benchmarks of geometric mean, 1-SD, and 2-SD upper limits,
and the PT criterion (PTC) of 10% (dashed line). Although the CCC, precision
coefficient, and accuracy coefficient are not shown in the right-hand plot, these
26 2 Continuous Data

350

L 300
a
b 250

v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value

Fig. 2.8 Observed glucose measures versus target values of lab instrument for the current window
and the control chart based on TDI%0:9 : almost perfect

350

L 300
a
b 250

v 200
a
l 150
u
e 100

50
50 100 150 200 250 300 350
Target Value

Fig. 2.9 Observed glucose measures versus target values of lab instrument for the current window
and the control chart based on TDI%0:9 : imprecise but accurate

values of the current window were used to select the four instruments that are
presented here. The use of CP can be helpful here. However, the CP values have
difficulty discriminating among good instruments when they all have very high CP
values.
Figure 2.8 shows the best-performing lab instrument among all 116 lab instru-
ments, with CCC D 0:9998, r 2 D 0:9997, ca D 0:9999, and TDI%0:9 D 2:1%
for the current window. It has an almost perfect CCC, and its TDI%0:9 values are
around 2%–3%.
Figure 2.9 shows a less-precise but accurate lab instrument, with CCC D 0:996,
r 2 D 0:996, ca D 0:998, and TDI%0:9 D 9:8% for the current window. Its values
rank at around 2/3 (slightly greater than 1-SD value) among its peers in June, slightly
better than its peer average in July, and at around the PTC level in August and the
current window.
2.8 Examples 27

350
L 300
a
b 250

v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value

Fig. 2.10 Observed glucose measures versus target values of lab instrument for the current
window and the control chart based on TDI%0:9 : precise but inaccurate

350

L 300
a
b 250

v 200
a
l 150
u
e 100
50
50 100 150 200 250 300 350
Target Value

Fig. 2.11 Observed glucose measures versus target values of lab-instrument for the current
window and the control chart based on TDI%0:9 : imprecise and inaccurate

Figure 2.10 shows a precise but inaccurate lab instrument, with CCC D 0:995,
r 2 D 0:9996, ca D 0:995, and TDI%0:9 D 11% for the current window. Its TDI%0:9
values are between 1-SD and 2-SD values of its peers in June and July, at around
the PTC level in August, and slightly worse than the PTC in the current window.
Figure 2.11 shows the worst-performing lab instrument among all 116 lab
instruments, with CCC D 0:983, r 2 D 0:991, ca D 0:988, and TDI%0:9 D 22%
for the current window. Its TDI%0:9 values are between 1-SD and 2-SD values of its
peers in July and August, and worse than the 2-SD value of its peers in June and the
current window.
This example conveys a few lessons. First, it is difficult to judge agreement solely
by the CCC, precision coefficient, and accuracy coefficient in their absolute values.
All comparisons should be judged in relative terms. Here, even the worst CCC value
shown in Fig. 2.11 is 0.983. Such a high value here is due to the large study range of
28 2 Continuous Data

70–270 mg/dL. Note that as stated earlier, the CCC, precision coefficient, accuracy
coefficient, and ICC depend largely on the study range. Comparisons among any
of these are valid only with similar study ranges. In this example, the study ranges
(based on peer means) are identical. It is important to report the study range when
reporting these statistics.
Second, not all lab instruments are created equal. Their quality, in terms of
TDI%0:9 values, could range from 2% to 22%, which is quite diverse. It is important
to submit our blood samples to a lab with a good reputation for quality.
Third, the PTC value is between the GM and 1-SD benchmarks for measuring
glucose, which means that about one-third of the lab instruments are in danger of
failing the PTC as dictated by the CLIA 2003 Final Rule. Perhaps the PTC is set too
strictly for measuring glucose. Note that we use TDI%0:9 instead of TDI%0:8 values
for a cushion here.

2.8.5 Example 5: Clinical Chemistry and Hematology


Measurements That Conform to CLIA Criteria

The data for this example were obtained through the clinical labs within the research
and development organization at Baxter Healthcare Corporation. Analysis of serum
chemistry analytes previously validated using the Hitachi 911 (Roche Diagnostics)
chemistry analyzer (reference or gold standard assays) was to be converted to the
Olympus AU400e (Olympus America, Inc.) chemistry analyzer (test or new assays).
Assay comparisons were performed by analyzing approximately 50–60 samples per
assay on both analyzers.
Hematology analyses were performed using two automated hematology instru-
ments: the CELL DYN 3500 (Abbott Diagnostics) as the reference and the ADVIA
2120 (SIEMENS Healthcare Diagnostics) as the test. Analyses were performed on
whole blood samples drawn into tubes containing EDTA anticoagulant. A total of
93 samples from 16 humans, 18 rabbits, 19 rats, 20 dogs, and 20 pigs were each
tested once on both instruments. All species were combined to establish wider data
ranges, and assay performance was not expected to depend on species.
Evaluations included clinical chemistry analytes of albumin, BUN, calcium,
chloride, creatinine, glucose, iron, magnesium, potassium, sodium, total protein,
and triglycerides; and hematology analytes of hemoglobin, platelet count, RBC, and
WBC. Any analyte without a PTC or values outside the data range was not included
in the evaluations.
Table 2.8 presents the data range, PT criterion, and estimates and confidence
limits (in parentheses) of agreement statistics for each of the above analytes. Data
ranges of clinical chemistry analytes were acquired from the Olympus Chemistry
Reagent Guide “Limitations of the Procedure” section for each assay (Olympus
America, Inc. Reagent Guide Version 3.0. May 2007. Irving, TX). Data ranges
of hematology analytes were acquired from ADVIA 2120 Version 5.3.1–MS2005
Bayer (now Siemens) Healthcare LLC.
Table 2.8 Agreement statistics against the PT criteria (PTC) for clinical chemistry and hematology analytes
Analyte Range PTC n CCC Precision Coef. Accuracy Coef. TDI0:8 a CPPTC b RBSc
2.8 Examples

Albumin (g/dL) 1.5–6.0 10% 58 0.723 (0.626) 0.859 (0.789) 0.842 (0.768) 23.6% (27.29%) 0.379 (0.309) 1.19
BUN 6 22 (mg/dL) 2–22 2 44 0.993 (0.989) 0.996 (0.993) 0.997 (0.994) 0.77 (0.91) 1.000 (0.996) 0.53
BUN >22 (mg/dL) 22–130 9% 16 0.999 (0.998) 0.999 (0.998) 1.000 (0.999) 3.26% (4.48%) 0.999 (0.954) 0.05
Calcium (mg/dL) 4–18 1 60 0.873 (0.833) 0.996 (0.994) 0.876 (0.838) 1.58 (1.69) 0.302 (0.223) 11.7
Chloride (mmol/L) 50–200 5% 61 0.991 (0.987) 0.996 (0.993) 0.995 (0.992) 2.44% (2.8%) 0.995 (0.982) 0.81
Creatinine Enz <2 (mg/dL) 0.2–2 0.3 50 0.989 (0.982) 0.990 (0.983) 0.999 (0.996) 0.08 (0.09) 1.000 (1.000) 0.06
Creatinine Jaffe <2 (mg/dL) 0.2–2 0.3 47 0.963 (0.942) 0.981 (0.969) 0.981 (0.966) 0.14 (0.16) 0.998 (0.989) 0.92
Glucose >60 (mg/dL) 60–800 10% 55 0.999 (0.998) 0.999 (0.998) 1.000 (0.999) 4.28% (5.01%) 0.997 (0.986) 0.32
Iron (ug/dL) 10–1000 20% 59 0.994 (0.991) 0.997 (0.995) 0.997 (0.995) 9.74% (11.45%) 0.987 (0.961) 0.03
Magnesium (mg/dL) 0.5–8 25% 61 0.970 (0.958) 0.996 (0.993) 0.974 (0.963) 14.41% (15.71%) 0.999 (0.995) 5.82
Potassium (mmol/L) 1–10 0.5 59 0.996 (0.995) 0.998 (0.997) 0.998 (0.997) 0.14 (0.16) 1.000 (1.000) 0.43
Sodium (mmol/L) 50–200 4 59 0.994 (0.991) 0.995 (0.993) 0.999 (0.997) 1.77 (2.06) 0.996 (0.984) 0.16
Total Protein (g/dL) 3–12 10% 56 0.993 (0.989) 0.993 (0.989) 1.000 (0.997) 2.31% (2.71%) 1.000 (1.000) 0.03
Triglycerides (g/dL) 10–1000 25% 56 0.997 (0.996) 0.998 (0.998) 0.999 (0.998) 8.2% (9.56%) 1.000 (0.999) 0.59
Hemoglobin (g/dL) 1–22.5 7% 93 0.967 (0.957) 0.997 (0.996) 0.969 (0.961) 6.02% (6.4%) 0.942 (0.903) 7.08
Platelet Count (103=¯L) 10–3500 25% 84 0.917 (0.884) 0.926 (0.896) 0.990 (0.973) 31.72% (36.79%) 0.695 (0.629) 0.04
RBC (106= L) 0.1–12 6% 92 0.988 (0.984) 0.997 (0.995) 0.992 (0.989) 4.13% (4.52%) 0.966 (0.937) 2.22
WBC (103=¯L) 0.1–100 15% 93 0.942 (0.922) 0.958 (0.941) 0.984 (0.972) 21.34% (24.41%) 0.640 (0.576) 0.08
Note: Shown in parentheses is the 95% upper (TDI) or lower (CCC, precision, accuracy, CP) confidence limit. Boldface analytes are those that failed the PTC.
a
Total deviation index to cover 80% of absolute difference or % change. The 95% upper limit should be less than PTC or PTC%.
b
Coverage probability of values within the PTC. The 95% lower limit should be greater than 0.8.
c
The relative bias squared (RBS) must be less than 8 in order for the approximate TDI to be valid. Otherwise, the TDI estimate is conservative depending on
the RBS value.
29
30 2 Continuous Data

For evaluation of BUN, the PTC is 9% for values greater than or equal to 22
mg/dL or 2 mg/dL for values less than or equal to 22 mg/dL, since 9% of 22 is
approximately 2. The criterion used for creatinine Enz and Jaffe is 0.3 mg/dL, since
only values less than 2 mg/dL were evaluated. Glucose was evaluated for values
greater than 60 mg/dL with the criterion 10%.
Figure 2.12 presents the agreement plots of the above analytes. The analytes of
albumin, calcium, platelet count, and WBC had precision and/or accuracy problems,
while the other analytes appeared to perform well. Comparing the 95% upper
confidence limit of TDI against the PTC or PTC% or the 95% lower confidence
limit CP against 0.8, all but the analytes of albumin, calcium, platelet count, and
WBC (shown boldface in Table 2.8) pass the PTC with 95% confidence.
Table 2.9 presents the results of traditional statistical analyses based on paired
t-test and ordinary regression. The results from Deming (orthogonal) regression
by treating X as a random variable are not shown because they are more or less
similar to those of ordinary regression. Table 2.9 shows the data range, sample size,
paired t-test p-value, intercepts, slopes, and the testing of intercept (0) and slope (1)
of ordinary regressions.
The paired t-test rejects the agreement of extremely well performing analytes,
that is, BUN 6 22 mg/dL, chloride, creatinine Jaffe, glucose, magnesium, potas-
sium, sodium, triglycerides, hemoglobin, and RBC, with p D 0:003 for sodium
and p < 0:001 for the others. These rejections correspond to the left-hand plot of
Fig. 1.2 and are due primarily to near-zero residual variance and/or large sample
size. On the other hand, the paired t-test accepts (p D 0:062) the agreement of
platelet count. Such failure to reject corresponds to the right-hand plot of Fig. 1.2,
due primarily to its large residual variance.
In terms of ordinary regression, tests of intercept (0) and/or slope (1) (LS01)
reject (p < 0:05) the agreement of the extremely well performing analytes of
BUN 6 22 mg/dL, chloride, creatinine Jaffe, iron, magnesium, potassium, sodium,
triglycerides, and hemoglobin. These rejections are similar, although not identical,
to the paired t-test for the same reasons. On the other hand, LS01 accepts (p >
0:373) the agreement of platelet count for the same reasons as paired t-test.
In terms of CCC, precision and accuracy for clinical chemistry analysts of BUN,
chloride, glucose, iron, magnesium, potassium, sodium, total protein, triglycerides
have excellent agreement (CCC > 0:99) between measurements from the Olympus
AU400e and Hitachi 911 instruments. Both have excellent precision (>0:99) and
accuracy (>0:99).
Creatinine Enz and creatinine Jaffe comfortably pass the PTC. Because their
data ranges are small, from 0.2 to 1.8 mg/dL, their CCC and precision and
accuracy coefficients become relatively lower. Magnesium also pass the PTC by
a comfortable margin. It has excellent precision (0.9956) but relatively smaller
accuracy (0.9738), because most Olympus values are smaller than Hitachi values
by a negligible amount.
For clinical chemistry analytes of albumin and calcium, the lab has difficulties
proving the equivalence of measurements from the Olympus AU400e and Hitachi
911 instruments. Albumin measurements from both the Olympus AU400e and
Hitachi 911 are neither accurate (0.8422) nor precise (0.8589). We are 95%
2.8 Examples

Fig. 2.12 Agreement plots


31
32

Fig. 2.12 (continued)


2 Continuous Data
2.8 Examples

Fig. 2.12 (continued)


33
34

Fig. 2.12 (continued)


2 Continuous Data
2.8 Examples

Fig. 2.12 (continued)


35
36

Table 2.9 Traditional statistics for clinical chemistry and hematology analytes
Paired t -test Intercept D 0 Slope D 1
Analyte Range n p-value Intercept Slope p-value p-value
Albumin (g/dL) 1.5–6.0 58 <0.001 0.181 0.783 0.045 0.001
BUN 6 22 (mg/dL) 2–22 44 <0.001 0.848 0.966 0 0.014
BUN >22 (mg/dL) 22–130 16 0.365 0.037 1.012 0.401 0.329
Calcium (mg/dL) 4–18 60 <0.001 0.15 0.884 0.205 <0.001
Chloride (mmol/L) 50–200 61 <0.001 0.16 0.963 0.005 0.002
Creatinine Enz <2 (mg/dL) 0.2–2 50 0.093 0.026 0.986 0.196 0.517
Creatinine Jaffe <2 (mg/dL) 0.2–2 47 <0.001 0.107 0.962 0 0.186
Glucose >60 (mg/dL) 60–800 55 <0.001 0.049 1.007 0.129 0.306
Iron (ug/dL) 10–1000 59 0.214 0.383 1.078 0 <0.001
Magnesium (mg/dL) 0.5–8 61 <0.001 0.095 0.999 0 0.907
Potassium (mmol/L) 1–10 59 <0.001 0.22 0.967 0 <0.001
Sodium (mmol/L) 50–200 59 0.003 4.025 0.975 0.032 0.06
Total Protein (g/dL) 3–12 56 0.198 0.015 0.99 0.602 0.531
Triglycerides (g/dL) 10–1000 56 <0.001 0.104 1.016 0.003 0.049
Hemoglobin (g/dL) 1–22.5 93 <0.001 0.078 1.046 0.001 <0.001
Platelet Count (103= L) 10–3500 84 0.062 0.211 1.042 0.461 0.373
RBC (106= L) 0.1–12 92 <0.001 0.005 1.012 0.776 0.194
WBC (103= L) 0.1–100 93 0.006 0.196 1.116 0.009 0.001
2 Continuous Data
2.9 Proofs of Asymptotical Normality When Target Values Are Random 37

confident that albumin measurements can deviate 27.3% (>10%) from their target
values, and that the measured values conformed to the PTC only 30.9% (<80%) of
the time. Calcium measurements are precise (0.9963) but not accurate (0.8759). We
are 95% confident that calcium measurements can deviate 1.58 mg/dL (>1 mg/dL)
from their target values, and that the measured values conform to the PTC only
22.8% (<80%) of the time.
For hemoglobin, the lab has good agreement (CCC D 0:9665) between mea-
surements from the Advia 2120 and Cell DYN 3500 instruments, with excellent
precision (0.9970) and good accuracy (0.9694). There is a small bias showing
that measurements from the Advia 2120, with only one exception, are consistently
higher than from the Cell DYN 3500. Note that data were not collected over the full
analytical range of 5 to 22 g/dL for this analyte.
RBC counts have very good agreement (CCC D 0:9883) between measurements
from the Advia 2120 and Cell DYN 3500 instruments. Both have excellent precision
(0.9965) and accuracy (0.9917). There is a small bias showing that all but one of
the measurements from Advia 2120 are consistently higher than those of Cell DYN
3500. Note that data are not collected over the full analytical range of 2 to 10 
106/ L for this analyte.
Platelet count from the Advia 2120 and Cell DYN 3500 are accurate (0.9897)
but imprecise (0.9263). We are 95% confident that platelet count measurements
can deviate 36.8% (>25%) from their target values and that they conformed to the
PTC only 62.9% (<80%) of the time. Additionally, WBC measurements are also
relatively accurate (0.9839) but imprecise (0.9576), especially for readings greater
than 8  103/ L. The lab fails to show that these two analytes meet the PTC. We
are 95% confident that WBC measurements can deviate 24.4% (>15%) from their
target values, and that they conform to the PTC only 58.0% (<80%) of the time.
In summary, using the agreement statistics presented in this example, 14 out of
18 method comparison cases meet the CLIA criteria with 95% confidence. Of the
four that did not meet the CLIA criteria, one is acceptable by the traditional paired
t-test and regression analysis. Of the 14 that meet the CLIA criteria, 11 are rejected
by the traditional paired t-test or regression analysis.

2.9 Proofs of Asymptotical Normality When Target


Values Are Random

2.9.1 CCC and Precision Estimates

This proof can be seen in Lin (1989). The Z transformation of the CCC estimate
can be expressed as Z D g.m/, where
38 2 Continuous Data

m D .m1 ; m2 ; m3 ; m4 ; m5 /0
!0
1X 2 1X 2 1X
n n n
D Y ; X; X ; Y ; Yi Xi ; (2.43)
n i D1 i n i D1 i n i D1

and  
1 4.m5  m1 m2 /
Z D g.m/ D ln 1 C :
2 m3 C m4  2m5
The vector m is expressed as a function of sample moments, and has an asymptotic
5-variate normality with mean
0
‚D. y;
2
x; y C 2 2
y; x C 2
x ; yx C y x/ ;

and variance n1 †, where


†Df ij g55 : (2.44)
Here, with the assumption that f.Yi ; Xi / j i D 1; 2; : : : ; ng are random samples
from a bivariate normal distribution, we have

11 D y2 ;
12 D 21 D yx ;
22 D x2 ;
13 D 31 D 2 y 2
y;
23 D 32 D 2 y yx ;

33 D 2 y4 C 4 y2 2
y;
14 D 41 D 2 x yx ;

24 D 42 D 2 x 2
x;
34 D 43 D 2 yx2
C4 y x yx ;

44 D 2 C4 4
x
2
x
2
x;
15 D 51 D x 2
y C y yx ;

25 D 52 D 2
y x C x yx ;
35 D 53 D2 yx
2
y C2 2
yx y C2 y
2
x y;

45 D 54 D2 yx
2
x C2 2
yx x C2 y
2
x x;

and
55 D 2 2
y x C 2 2
y x C 2 2
x y C 2
yx C2 y x yx :

It follows from the delta method or from the theory of functions of asymptotically
normal vectors (Serfling 1980, Corollary 3.3) that Z is asymptotically normal with
1 0
mean 12 ln 1Cc
1c and variance n d †d, where

d D .d1 ; d2 ; d3 ; d4 ; d5 /0
 ˇ ˇ 0
@g.m/ ˇˇ @g.m/ ˇˇ
D ;:::; :
@m1 ˇmD‚ @m5 ˇmD‚
2.9 Proofs of Asymptotical Normality When Target Values Are Random 39

The elements of d are

 x
d1 D ;
2
y C 2
x C2 yx C. y  x/
2
 y
d2 D ;
2
y C 2
x C2 yx C. y  x/
2
 yx
d3 D d4 D h ih i;
2
y C 2
x C2 yx C. y  x /2 y C
2 2
x 2 yx C. y  x/
2

and
. 2
y C 2
x/ C. y  x/
2
d5 D h ih i:
2
y C 2
x C2 yx C. y  x/
2 2
y C 2
x 2 yx C. y  x/
2

After straightforward, albeit tedious algebraic calculations, it can be shown that the
variance of Z is

1 0
2
Z D d †d
n
 
1 .1  2 /c2 2c3 .1  c / 2 c4  4
D C  : (2.45)
n .1  c2 /2 .1  c2 /2 22 .1  c2 /2

The Z-transformed CCC estimate can approach normality much more rapidly as
confirmed by the Monte Carlo experiment in Lin (1989). When c D  and  D 0,
(2.45) degenerates into n1 , which is the variance of the Z transformation of the
precision estimate.

2.9.2 MSD Estimate

From (2.43), we can write the natural log transformation of the MSD estimate, or
W D ln.e 2 / as

W D g.m3 ; m4 ; m5 / D ln.m3 C m4  2m5 /:

By the delta method, W is asymptotically normal with mean ln. 2 / and variance
1 0
n d †d , where
0 1
33 34 35
†D@ 44 45
A
55
40 2 Continuous Data

and
 ˇ ˇ ˇ 0  
@g.m/ ˇˇ @g.m/ ˇˇ @g.m/ ˇˇ 1 1 2 0
dD ; ; D ; ; :
@m3 ˇmD‚ @m4 ˇmD‚ @m5 ˇmD‚ "2 "2 "2

After straightforward algebraic calculations, we can show that the variance of W is


 4
2 . y  x/
2
W D 1 : (2.46)
n "4

2.9.3 CP Estimate

This proof can be seen in Lin, Hedayat, Sinha, and Yang (2002). Here, we use
a different approach to demonstrating the delta method. We can use a first-order
approximation to compute the mean and variance of the CP estimate pı0 by
   
ı0  d 1 ı0  d
p ı0 D ˆ  .X d  d/
d d d
   
ı0  d ı0  d ı0  d
 2
.sd  d/  ˆ
d d d
   
1 ı0  d ı0 C d ı0  d
C .Od  d/  2
.sd  d/
d d d d
  
CO .X d  d/
2
C O .sd  d/
2
C O .X d  d /.sd  d/ ; (2.47)

where .x/ is the density function of the standard normal distribution, X d D Y X ,


and limx!0 O.x/
x < 1.
Therefore, the expected value of pı0 is
 
1
E.pı0 / D ı0 C O ;
n

and the asymptotic variance of pı0 becomes


(    2
1 ı0  d ı0  d
2
pı0 D 
n d d

    2 )
1 ı0  d ı0  d ı0 C d ı0  d
C C
2 d d d d
 
1
CO 2 : (2.48)
n
2.9 Proofs of Asymptotical Normality When Target Values Are Random 41

Because CP is bounded by 0 and1, it is better to use the logit transformation for
p
statistical inference. Let T D ln 1pı0ı . Then the asymptotic mean of T is  D
  0
2
ı0 pı
ln 1ı , and the asymptotic variance is T2 D  2 .1 0
/2
.
0 ı0 ı0

2.9.4 Accuracy Estimate

This proof can be seen in Robieson (1999). This estimate, ca , does not involve m5
in (2.43). We redefine the new set of m vectors that are mostly uncorrelated as

m D .m1 ; m2 ; m3 ; m4 /0 D .Y ; X ; sy2 ; sx2 /0 ;

2 0
which has an asymptotic 4-variate normality with mean E.m/ D . y;
2
x; y ; x /
and variance n1 †, where † D f ij g44 .
Here,

11 D y2 ;
12 D 21 D yx ;
13 D 14 D 23 D 24 D 31 D 41 D 32 D 42 D 0;
33 D 2 y4 ;
34 D 43 D 2 yx
2
;
and
44 D2 4
x:

The logit transformation of the accuracy estimate can be written as

L D g.m1 ; m2 ; m3 ; m4 /
 p 
2 m3 m4
D ln p ;
m3 C m4 C .m1  m2 /2  2 m3 m4

a
and is asymptotically normal with mean ln 1 a
and variance n1 d 0 †d, where

 ˇ ˇ 0
@g.m/ ˇˇ @g.m/ ˇˇ
d D .d1 ; d2 ; d3 ; d4 /0 D ; : : : ; :
@m1 ˇmD‚ @m4 ˇmD‚
42 2 Continuous Data

The elements of d are


. y x/
d1 D d2 D 2 ;
y x .1a /

d3 D 2 2
1
2  1
2 ;
y a .1a / 2 y x .1a /

and
d4 D 2 2
1
2  1
2 :
x a .1a / 2 y x .1a /

It can be shown that the variance of L is


1 0
2
L D d †d
n
    
1 1 1 2 1
D 2 2
 $ C  2 C  $ 2
C C 2 2
n.1  a /2 a $ 2 a $2

C 1 C 2 a  2  1 : (2.49)

2.10 Proofs of Asymptotical Normality When Target Values


Are Fixed

None of the proofs of asymptotic normality of the estimates of agreement indices


other than CP with fixed target values have been given in the literature. The proofs
here are much simplified compared to those in Section 2.9, because we are dealing
with moments associated with Y only.

2.10.1 CCC and Precision Estimates

The ZjX transformation of the CCC estimate can be expressed as ZjX D g.mjX /,
where
0
mjX D .mjX;1 ; mjX;2 ; mjX;3 /0 D Y ; b1 ; se2 (2.50)

and
" #
1 mjX;3 C m2jX;2 sx2 C sx2 C .mjX;1  X /2 C 2mjX;2 sx2
ZjX D g.mjX / D ln :
2 mjX;3 C m2jX;2 sx2 C sx2 C .mjX;1  X /2  2mjX;2 sx2
2.10 Proofs of Asymptotical Normality When Target Values Are Fixed 43

1 1CcjX
Here ZjX is asymptotically normal with mean 2 ln 1cjX and variance
1 0
n d jX † jX d jX , where
0 1
2
y .1  2 / 0 0
B C
† jX D B
@ 0 2
y .1  2 /=sx2 0 C:
A (2.51)
0 0 2 4
y .1   2 /2

The elements of d jX are

c2 . y  X /
djX;1 D ;
.1  c2 /2  y sx
 
c 1
djX;2 D  1 ;
.1  c2 /2 c 
and

c2
djX;3 D :
2.1  c2 /2  y sx

It can be shown that the variance of ZjX is


 
c2 .1  2 / $ 2 c2 .1  2 /
2
D $ 2 2
 C .1  c $/ 2
C : (2.52)
ZjX
n.1  c2 /2 2 c
2
 
2
When c D ,  D 0, and $ D 1, (2.52) degenerates to n1 1  2 , which is the
variance of the Z transformation of the precision estimate.

2.10.2 MSD Estimate

From (2.50) we can write the natural logarithm of the MSD estimate, or WjX D
2
ln.ejX /, as

WjX D g.mjX;1 ; mjX;2 ; mjX;3 /



D ln .mjX;1  X /2 C mjX;3 C sx2 .1  mjX;2 /2 :

By the delta method, WjX is asymptotically normal with mean ln."2jX / and variance
1 0
n
d jX † jX d jX , where † jX was shown in (2.51), and
!0
2. y  X/ 2.1  ˇ1 /sx2 1
d jX D 2
; ; 2 :
"jX "2jX "jX
44 2 Continuous Data

It can be shown that the variance of WjX is


2  2 3
"2jX  2
2 6 e 7
2
WjX D 41  5:
n "4jX

2.10.3 CP Estimate

The proof can be seen in Lin, Hedayat, Sinha, and Yang (2002). Recall that in the
regression model when target values are fixed, we assumed that eY has a normal
distribution with mean 0 and variance e2 . Under this setup, the coverage probability
of the i th observation is

ı0 i D Pr.jYi  Xi j < ı0 /


   
ı0  ˇ0  .ˇ1  1/xi ı0  ˇ0  .ˇ1  1/xi
Dˆ ˆ :
e e

We define the overall coverage probability as

1X
n
ı0 jX D ı i : (2.53)
n i D1 0

Suppose that we have a random sample f.Yi ; Xi / j i D .1; : : : ; n/g and that ˇ0 , ˇ1 ,
and e2 are estimated by b0 , b1 , and se2 . Then b0 and b1 are independent of se . An
estimate of ı0 i is
   
ı0  b0  .b1  1/xi ı0  b0  .b1  1/xi
p ı0 i D ˆ ˆ ;
se se
and an estimate of ı0 jX is
1X
n
pı0 jX D pı i :
n i D1 0
By the same method as shown in Section 2.9.3, it can be shown that
 
1
E.pı0 jX / D ı0 jX C O ;
n
and that the asymptotic variance of pı0 jX is
" #  
1 C 2
.C 0 X  C 1 / 2
C 2
1
pı0 jX D C C 2 CO 2 ;
2 0 2
n n2 n2 sx2 2n n

where C0 ; C1 and C2 are defined in (2.19), (2.20), and (2.21).


2.11 Other Estimations and Statistical Inference Approaches 45

2.10.4 Accuracy Estimate


0
Here we use the same † jX as in (2.51). The elements of d jX D djX1 ; djX 2 ; djX 3
become

. y  X /2
djX1 D  ;
y sx .1  a /
2

sx ˇ1 sx2 ˇ1
djX 2 D  C ;
y .1  a /2 y a .1  a /
2 2

and
1 1
djX 3 D  C :
2 y sx .1  a / 2 2
y a .1  2a /

Therefore, the variance of LjX is

 2 $2a .1  2 / C 12 .1  $a /2 .1  4 /
2
D :
LjX
n.1  a /2

2.11 Other Estimations and Statistical Inference Approaches

Estimations of agreement indices presented in this chapter are based on moment


estimations by replacing proposed indices with their respective sample counterparts.
Statistical inferences based on these estimates are carried out by the routine delta
method.
King and Chinchilli (2001a) proposed estimations and statistical inferences for
the CCC based on the U-statistic outlined by Davis and Quade (1968). Barnhart
and Williamson (2001) proposed estimations and statistical inferences for the CCC
based on the GEE methodology outlined by Liang and Zeger (1986). Barnhart,
Haber, and Song (2002) later extended the GEE methodology for multiple assays
or raters. Both U-statistic and GEE methodologies have the advantage of address-
ing both estimation and statistical inference simultaneously based on established
general formulas.
Carrasco and Jover (2005) proposed to use the maximum likelihood (ML)
method with a mixed effect model. All of the above approaches were proposed for
CCC only, and have been extended to be applicable when we have multiple assays
or raters. However, none of the above has addressed cases in which the target values
are fixed. The U-statistic and GEE methodology for the CCC are also valid when
we have categorical data, to be discussed in Chapters 3 and 5.
46 2 Continuous Data

2.11.1 U-Statistic for CCC

Suppose we have (X1 , Y1 ), (X2 , Y2 ), : : : , (Xn , Yn ) random samples from n samples


or subjects. For i; j D 1; 2; : : : ; n, let

'1ij D .Xi  Yi /2 C .Xj  Yj /2  .Xi C Yi /2  .Xj C Yj /2 ;


'2ij D Xi2 C Xj2 C Yi2 C Yj2 ;
'3ij D .Xi  Yj /2  .Xi C Yj /2 C .Xj  Yi /2  .Xj C Yi /2 ;
P
ij '1ij
U1 D ;
2n.n  1/
P
ij '2ij
U2 D ;
n.n  1/
and P
ij '3ij
U3 D :
2n.n  1/
King and Chinchilli (2001a) showed that the CCC estimate can be written in terms
of functions of the U-statistics as

H .n  1/.U3  U1 /
rc D D : (2.54)
G U1 C nU2 C .n  1/U3

They further showed that the Z-transformation of rc by the delta method has
asymptotic normal distribution with mean 12 tanh1 .c /, and variance
 2 
c2 2
2 HG
2
Z D 2
H
 C G2 ; (2.55)
1  c2 H2 HG G

where
2
H D .n  1/2 ŒV .U3 / C V .U1 /  2cov.U3 ; U1 /;
2
G D .n  1/2 V .U3 / C V .U1 / C n2 V .U2 / C 2.n  1/cov.U3 ; U1 /
C2n.n  1/cov.U3 ; U2 / C 2ncov.U3 ; U2 /;
and

HG D .n  1/.n  2/cov.U3 ; U1 / C n.n  1/cov.U3 ; U2 / C .n  1/2 V .U3 /


.n  1/V .U1 /  n.n  1/cov.U2 ; U1 /:

The variance–covariance matrix of U D .U1 ; U2 ; U3 /0 , denoted by V , can be


obtained as follows. Let
'i D .'1i ; '2i ; '3i /0 ;
2.11 Other Estimations and Statistical Inference Approaches 47

where
P P P
j '1ij j '2ij j '3ij
'1i D ; '2i D ; '3i D :
.n  1/ .n  1/ .n  1/

Then we have
4 X
V D .'i  U /0 .'i  U /:
n2 i

2.11.2 GEE for CCC

Barnhart and Williamson (2001) first proposed to use GEE for statistical estimation
and inference for CCC. They used three sets of GEE equations, one for estimating
means accounting for covariates, one for estimating variances without accounting
for covariates, and one for estimating the Z-transformed CCC. Variance–covariance
matrices of the above estimates can be estimated, and the delta method can be
applied to obtain the asymptotic normality of the Z-transformed CCC estimate.
Suppose we have (Y11 , Y21 ), (Y12 , Y22 ), : : : , (Y1n , Y2n ) random samples from
n samples or subjects. Let Yi be the 2  1 vector that contains the two readings
and let the 2  p matrix Xi be the corresponding p covariates for the i th sample,
where the first column of Xi is a vector of all ones representing an intercept term.
Let Yi D .Y1i ; Y2i /0 and ˇ be a 2  1 marginal parameter vector. The three GEE
equations are shown below.
In the first set of equations, the marginal mean vector of Yi is E.Yi / D i D
Xi ˇ, and the parameter estimates of ˇ are obtained by

X
n
Di0 Vi1 .Yi  i .ˇ// D 0; (2.56)
1

where Di D d i
dˇ and Vi is the working covariance matrix for Yi (Zeger and Liang
1986).
Let 12 and 22 be variances of Y1 and Y2 . In the second set of equations, the
variances of Y1 and Y2 without accounting for covariates are estimated by

X
n
Fi0 Hi1 Yi2  i2 . 2 ; ˇ/ D 0; (2.57)
1

where Fi D d 2 and Hi is the working covariance matrix for Yi , i D  C i ,


di 2 2 2 2

2 0
and  D .1 ; 2 / . In solving these equations, the diagonal components of Hi are
2 2

assumed normal even if Yi is not normally distributed.


48 2 Continuous Data

Let i D E.Y1i Y2 i /0 and let Z be the Z-transformed CCC. In the third equation,
Z is estimated by

X
n
Ci Wi1 .Y1 Y2  i .Z; ˇ;  2 // D 0; (2.58)
1

where Ci D d i
dZ and Wi is the variance of i .
This GEE method and the U-statistic method yield the same CCC estimate as
proposed by Lin (1989), but the variances of the CCC estimate are slightly different
because these two methods do not assume normality, while the method by Lin
assumes normality in the computation of the variance of the CCC estimate.

2.11.3 Mixed Effect Model for CCC

Carrasco and Jover (2005) proposed to use the maximum likelihood (ML) or
restricted ML (RML) method through a mixed effect model. Robieson (1999) and
Carrasco and Jover (2005) showed that the CCC is a special form of ICC under the
mixed effect model of random subject effect with the variance ˛2 , residual effect
with the variance e2 , and fixed assay or rater effect with the mean square ˇ2 , when
2
ˇ is included in the denominator. Specifically, the CCC can be expressed as

2
c D ˛
: (2.59)
2
˛ C 2
e C 2
ˇ

In Section 3.1.3, we will revisit this coefficient in detail. Carrasco and Jover (2005)
proceeded to use the delta method for statistical inference after estimating the
variance–covariance matrix of the variance components through ML or RML.
This method does not yield the same CCC estimate as proposed by Lin (1989),
and the variance of the CCC estimate can be slightly different, because this method
assumes normality through MLE or RMLE.

2.11.4 Other Methods for TDI and CP

Compared to CCC, there have been fewer contributions related to TDI and CP. Some
of the methods related to TDI and CP in the literature are pointed out in the last
paragraph of Section 2.13. Most of those articles use more complicated iterative
methods to fine-tune TDI and CP as well as their confidence intervals.
2.12 Discussion 49

2.12 Discussion

2.12.1 Absolute Indices

Three agreement statistics, MSD, TDI, and CP, are unscaled indices, which do not
depend on the between-subject variation. TDI and CP attempt to capture a large
proportion (CP) of observations that are within a certain deviation (TDI) from their
target values. We can compute CP for a given TDI, denoted by CPı0 , or compute TDI
for a given CP, denoted by TDI0 . When the error structure is proportional, we apply
a log transformation to the data, and the resulting TDI is then antilog transformed.
When we subtract 1.00 from this antilog-transformed value and multiply by 100,
it becomes a percent change (TDI%0 ) rather than an absolute difference (Lin
2000, 2003, Lin, Hedayat, Sinha, and Yang 2002). This means that 1000 % of
observations are within TDI%0 of the target values. TDI and CP offer the most
intuitively clear interpretation and have better power for statistical inference, yet
they do not have precision and accuracy components. Also, Lin, Hedayat, Sinha, and
Yang (2002) and Lin, Hedayat, and Wu (2007) used approximations and assumed
normality to perform estimations and statistical inferences for TDI and CP. When
the data are not normally or log-normally distributed, a transformation to bring the
data closer to normality might be necessary.
TDI and CP are mirrored statistics. The former requires a given coverage
probability to compute the absolute difference or percent change. The latter requires
a given absolute difference or percent change to compute the coverage probability.
The former has the advantage for the following reason:
• TDI can discriminate among assays with much better agreement than CP, because
in these cases, CP values are near one.
• When there is no hard allowance available, one can still compute TDI at CP D
0:8 or 0:9, but one cannot compute CP without a reasonably given TDI value.

2.12.2 Relative Indices Scaled to the Data Range

Due to its equivalence to kappa and weighted kappa as well as its close tie to
ICC, the CCC is perhaps the most popular index among statisticians for assessing
agreement. The CCC and precision and accuracy coefficients are ICC-like (Lin,
Hedayat, and Wu 2007), and are scaled (relative) to the total variation, especially
the between-sample variation. This property is appealing when we wish to assess
agreement over the entire reasonable value range from normal to abnormal.
Comparisons among any of these three agreement statistics are valid only with
similar study ranges, which is proportional to the between-sample variation. When
the study range is fixed and meaningful, the CCC and precision and accuracy
coefficients offer meaningful geometric interpretations. It is important to report the
study range when reporting these statistics. Good agreement over a small range
50 2 Continuous Data

100 60

80
55
60
50
40
45
20

0 40
0 20 40 60 80 100 40 45 50 55 60

Fig. 2.13 Agreement over larger and shorter analytical ranges. (a) Larger analytical range.
(b) Shorter analytical range

of measurements cannot be extrapolated to conclude good agreement over a larger


range of measurements.
In other words, we should not conduct a method-comparison experiment over
a range similar to the range of intrasample random fluctuation (Lin and Chinchilli
1997). As an illustrative example, Fig. 2.13a shows an artificial result of a good
agreement over a desirable analytical range. If we study a subset of the data, limited
to a much shorter analytical range (the box portion of Fig. 2.13a), which is magnified
in Fig. 2.13b, any correlation coefficient would be much smaller.

2.12.3 Variances of Index Estimates Under Random and Fixed


Target Values

Apart from CP, all of the agreement coefficients defined in this chapter have the
same coefficient estimates regardless whether the target values are assumed random
or fixed. The CP estimates under random and fixed target values are asymptotically
the same. The variance of each of these coefficient estimates under the random target
assumption is always larger than the variance under the fixed target assumption.
These are evident by comparing (2.3) and (2.4) for the log MSD (TDI), (2.13) and
(2.14) for the logit CP, (2.29) and (2.30) for the Z-transformed CCC, (2.31) and
(2.32) for the logit accuracy coefficient, and the formulas in the text after (2.32)
for the Z-transformed precision coefficient. Therefore, the confidence limits of an
agreement coefficient would be closer to its coefficient estimate under the fixed
target assumption than under the random target assumption.

2.12.4 Repeated Measures CCC

There are two types of repeated-measures data for agreement assessment that we
often encounter. For one type, the between-sample variation forms the data range.
2.12 Discussion 51

For the other, the repeated measures form the data range. An example of the
first case is to have each subject’s blood pressures measured over time, which
is what is usually encountered in practice. Many tools are available for this type
of repeated measure, which can be found in the Section 2.13. An example of
the second case is to have each sample taken from a homogeneous population
and to perform serial dilutions that form the data range. Such serial dilutions are
uniform across all homogeneous samples. In this case, we could compute agreement
coefficient estimates for each sample, and treat these estimates as random samples
from a population. We then compute means and confidence limits based on the
respective transformations of the agreement statistics. Antitransformation of these
limits would be their respective confident limits. Such an approach is valid if we
don’t have missing data. We may also follow the more detailed approach proposed
by Chinchilli, Martel, Kumanyika, and Lloyd (1996).

2.12.5 Data Transformations

The CCC and accuracy and precision coefficient estimates are in general quite robust
against moderate deviation from normality. If not, there are tools based on robust
estimates by King and Chinchilli (2001b) and based on a nonparametric approach
by Guo and Manatunga (2007).
The TDI and CP estimates are heavily dependent on the normality or log-
normality assumption. When there is evidence that data are not normally distributed
for the constant random error case, and not log-normally distributed for the
proportional random error case, data transformation might be necessary for the
robustness of TDI and CP estimates. In this case, see Lin and Vonesh (1989) for
the transformation approach that minimizes the MSD between the ordered observed
and theoretical quantiles.

2.12.6 Missing Data

In this book we deal only with no-missing-data cases. In this chapter and Chapter 3,
we discuss the case of paired assays or raters, and we often do not encounter a
large amount of missing data in practice. Therefore, deleting cases with missing
data is a reasonable approach. Missing data situation can sometimes be an issue
in practice as pointed out in Chapters 5 and 6 when we have multiple raters with
replicates. Research in the social and psychological sciences and in clinical trials
may often encounter missing data. Approaches that can handle missing data should
be an interesting area of research.
52 2 Continuous Data

2.12.7 Account for Covariants

Covariate adjustment has been a controversial topic in assessing agreement.


Proponents argue that without such adjustment, CCC and thus accuracy and
precision coefficients are artificially inflated. Opponents argue that such adjustment
artificially decreases the CCC because the data range is being reduced, as seen in
an earlier paragraph about the effect of the data range. Opponents further argue that
assay or rater agreement that could depend on covariants cannot be judged as having
good agreement. In addition, often covariants are selected to cover a desirable data
range, as in the use a variety of species in Example 2.8.5. In this case, the covariate
adjustment related to species can be misleading. We believe that there are cases
in which covariate adjustment is meaningful. However, we recommend that the
practice of covariate adjustment be used with caution.
The GEE methodology proposed by Barnhart and Williamson (2001) allows for
covariate adjustment, but only for means, not for variances and covariances. To
adjust for variances and covariances, a simple and reasonable way is to perform
the linear regression for each assay or rater, then use the intercept or overall mean
estimate plus the residual of each assay or rater as the adjusted dependent variable.
The adjusted dependent variable of each assay or rater would still contain the
subject-to-subject variation without the effect of covariates. We can then perform
the estimations and statistical inferences of the agreement indices based on these
adjusted dependent variables. Such an approach is appropriate when covariates are
subject-specific, such as age and gender. This approach is not applicable when we
have fixed target values. However, we have rarely encountered covariates for which
the target values are fixed.

2.13 A Brief Tour of Related Publications

Chapter 2 is based largely on the materials in Lin, Hedayat, Sinha, and Yang (2002).
For an earlier introduction of CCC and precision and accuracy coefficients, see Lin
(1989), and for TDI, see Lin (2000).
The method of Bland and Altman (1986) for assessing agreement uses a
meaningful graphical approach and computes the confidence limits from the paired
differences. Because of its simplicity, this approach has been quite popular among
medical researchers. This approach lacks a specific index to summarize the degree
of agreement, and thus statistical inferences about the estimate cannot be performed.
Bland and Altman later (1999) improved on their approach with statistical inference.
Their approaches are similar to our TDI. The major difference between our TDI and
their approaches is that we capture a majority of observations from their respective
individual target values, while their approaches capture the paired differences from
the mean of paired differences.
2.13 A Brief Tour of Related Publications 53

Chinchilli, Martel, Kumanyika, and Lloyd (1996) addressed repeated-measures


CCC, which is the weighted average of CCCs across subjects. Vonesh, Chinchilli,
and Pu (1996) and Vonesh and Chinchilli (1997) modified CCC for goodness-of-fit.
King and Chinchilli (2001a) used a U-statistics framework for CCC, which includes
the generalization of CCC for multiple assays or raters, and the approach is
applicable to categorical data. Barnhart and Williamson (2001) used GEE to
estimate CCC, which also can be extended to include the generalization of CCC for
multiple assays or raters (Barnhart, Haber, and Song 2002). In our opinion, these
GEE methodologies are also applicable for categorical data, although the authors
did not make such a claim.
King and Chinchilli (2001b) proposed a robust estimation of CCC through an
absolute loss function or M-estimation by U-statistics. Li and Chow (2005) used
weighted CCC by kernel density for repeated-measures image data. Quiroz (2005)
proposed to assess agreement using the CCC in a repeated-measurement design.
Liu, Du, Teresi, and Hasin (2005) proposed CCC for survival data. Barnhart, Song,
and Lyles (2005) proposed assay validation for left-censored data. Carrasco and
Jover (2005) proposed to use the maximum likelihood (ML) or restricted ML
(RML) method through a mixed-effect model, which is applicable to multiple
assays or raters. Robieson (1999) and Carrasco and Jover (2005) showed that the
CCC is a special form of ICC under the mixed-effect model of random subject
effect and residual effect and fixed assay or rater effect when the fixed assay
or rater effect is included in the denominator of the ICC. King, Chinchilli, and
Carrasco (2007) proposed another approach of repeated-measures CCC. They used
the population estimates, rather than subject-specific estimates proposed earlier
by Chinchilli, Martel, Kumanyika, and Lloyd (1996), to construct a repeated-
measures CCC. King, Chinchilli, Wang, and Carrasco (2007) presented a class
of repeated-measures CCC. Guo and Manatunga (2007) proposed using nonpara-
metric estimation of the CCC under univariate censoring. Carrasco, Luis, King,
and Chinchilli (2007) compared concordance correlation coefficient estimating
approaches with skewed data. Williamson, Crawford, and Lin (2007) presented a
permutation testing for comparing dependent CCCs. Quiroz and Burdick (2009)
again proposed an assessment of individual agreement with repeated measurements
based on generalized confidence intervals. Carrasco, King, and Chinchilli (2009)
again proposed repeated-measures CCC estimated by variance components. Hiriote
and Chinchilli (2010) proposed matrix-based concordance correlation coefficient for
repeated measures. Helenowski, Vonesh, Demirtas, Rademaker, Ananthanarayanan,
Gann, and Jovanovic (2011) extended CCC by allowing for different spatial
variance–covariance structures of the data. They proposed a general concordance
correlation matrix representing pairwise CCCs along with an overall CCC.
There have been relatively fewer publications for TDI and CP, which are
summarized below. Wang and Gene Hwang (2001) proposed a nearly unbiased test
(NUT) based on CP for the application of individual bioequivalence. Choudhary and
Nagaraja (2007) proposed an exact test and modified NUT for CP and TDI for data
with a small sample size (<30) that need to be solved numerically through iterations,
and a bootstrap estimation for data with a moderate sample size. Hedayat, Lou, and
54 2 Continuous Data

Sinha (2009) introduced CP involving multiple assays or raters. Escaramis, Ascaso,


and Carrasco (2010) simplified the approach by Choudhary and Nagaraja (2007)
for TDI using a tolerance limit approach through iterations. Choudhary (2008)
proposed a tolerance interval approach for assessment of agreement in method
comparison studies with repeated measurements. Choudhary (2008) proposed the
tolerance approach with left censured data.
Chapter 3
Categorical Data

In Chapter 2, we discussed agreement statistics for continuous data in terms of


absolute and relative indices. For categorical data, the agreement statistics presented
here are quite different for random and fixed target values due to traditional practice
over a long period of time. Therefore, the organization of this chapter is different
from that of Chapter 2. In Section 3.1, we will discuss cases in which the target
values are random. In Section 3.2, we will discuss cases in which the target values
are fixed, which are rather simple and straightforward.
Agreement indices for categorical data have been conventionally regarded as
measuring agreement among raters. Therefore, in this chapter, we refer to agreement
among raters, observers, instruments, assays, methods, etc., as rater agreement,
rather than as assay agreement as presented in Chapter 2.

3.1 Basic Approach When Target Values Are Random

There is a good collection of books and references describing agreement assessment


for categorical data, including association and marginal agreement. A list of such
references is given in Section 3.4, at the end of this chapter. We do not attempt to
describe the details of all of available tools. Instead, we evaluate popular agreement
indices for categorical data that are closely related to those described in Chapter 2.
We will discuss the equivalence of ICC, CCC, and weighted kappa, and set the
stage for unified approaches to be presented in Chapters 5 and 6 for continuous and
categorical data when the target values are random.

3.1.1 Data Structure

Clinical measurements can be on a continuous or categorical scale. An example of


the former is a patient’s blood pressure. An example of the latter is the assignment

L. Lin et al., Statistical Tools for Measuring Agreement, 55


DOI 10.1007/978-1-4614-0562-7 3, © Springer Science+Business Media, LLC 2012
56 3 Categorical Data

Table 3.1 Agreement Y


probability table
Category 1 2 ... t Total
1 11 12 ... 1t 1
2 21 22 ... 2t 2
X :: ::
: ... ... ... ... :
t t1 t2 ... t t t
Total 1 2 ... t 1

by two or more raters of a patient’s condition according to an ordinal scale of fair,


mild, serious, critical, or life-threatening, or the assignment of a patient’s condition
to a binary scale of normal or abnormal. Here, the raters should use the same metrics
in judging. Evaluation of agreement for data with a nominal scale without any order
will be discussed in Section 3.3.
We begin with the most basic scenario for paired ordinal observations (Y and X )
with a bivariate multinomial distribution when both Y and X take on values of 1 to t,
or 0 to t  1. Table 3.1 presents the agreement data for all possible probability out-
comes, where ij represents the probability of X D i and Y D j , i; j D 1; 2; : : : ; t.

3.1.2 Absolute Indices

Absolute agreement indices for categorical data have rarely been presented in the
published literature. However, these can be very useful when we try to verify
agreement within a given experiment. For example, when replicates are taken per
rater per subject, we might want to compare whether one rater is better than another
in terms of within-rater precision. For another example, in the field of individual
bioequivalence, we might be interested in evaluating interdrug agreement relative
to intradrug agreement. These scenarios will be presented in Chapter 6, when we
concentrate on comparisons of absolute indices rather than of relative indices.
We have no knowledge whether MSD has been used for categorical data in the
literature. We defined MSD in Chapter 2 as "2 D E.Y  X /2 . Based on agreement
probabilities presented in Table 3.1, MSD becomes
X
t X
t
"2 D .i  j /2 ij ; i; j D 1; 2; : : : ; t: (3.1)
i j

When the outcome takes on binary scores, or t D 2, the MSD becomes 12 C 21 ,
or 1  .11 C 22 /. The MSD shown in (3.1) is the absolute or unscaled measure of
disagreement. The absolute measure of agreement, …0 , can be defined as a weighted
average of all probabilities,

X
t X
t
…0 D wij ij I (3.2)
i j
3.1 Basic Approach When Target Values Are Random 57

a popular weight function, the squared weight function, is

.i  j /2
wij D 1  : (3.3)
.t  1/2
Equation (3.2) with the weight function (3.3) is actually a linear function of MSD
through the relationship
"2
…0 D 1  : (3.4)
.t  1/2
The weight function in (3.3) assigns the heaviest weight of 1 when i D j for the
main diagonal probabilities, and lower weights depending on the squared distance
from the main diagonal probabilities. We will discuss the weight function in greater
detail in Section 3.1.3. If we let
(
1 when i D j ;
wij D
0 otherwise,
P
then we have …0 D i i i , which had been commonly used as an agreement index
prior to the introduction of kappa by Cohen (1960). When the outcome takes on
binary scores, or t D 2, …0 becomes 11 C 22 .

3.1.3 Relative Indices: Kappa and Weighted Kappa

The controversy associated with the absolute indices such as …0 is that a certain
amount of agreement is to be expected by chance alone. Even when both raters have
totally different metrics in judging the subjects, we would expect them to agree
with a certain probability …c by chance alone. It is natural to eliminate such chance
agreements. This is the idea behind Cohen’s kappa (1960).
To better illustrate the arguments made above, let us look at some numerical
examples when t D 2 and n D 100:
1. Perfect agreement: Raters 1 and 2 agree on every subject in the following table.
There is no controversy expected to conclude the perfect agreement between the
two raters.

Rater 2
Yes No Total
Yes 50 0 50
Rater 1 No 0 50 50
Total 50 50 100

2. No agreement: Raters 1 and 2 agree on 25 C 25 D 50 of the 100 subjects they


examined in the following table. Although these two raters agree 50% of the
58 3 Categorical Data

time, in fact, there is no association between the raters in this example because
all their agreement can be attributed to chance.
Rater 2
Yes No Total
Yes 25 25 50
Rater 1 No 25 25 50
Total 50 50 100

Also note that in each of the above examples, the marginal distributions are the
same for the two raters. Just as for continuous data, the fact of identical marginal
distributions between the two raters does not necessarily imply agreement between
them.
Cohen (1960) proposed a measure called kappa as a chance-corrected agreement
index. This index depends strictly on the diagonal probabilities corrected by the
chance probabilities derived from the marginal probabilities.
Cohen (1968) later improved on the kappa coefficient by proposing weighted
kappa for data measured in ordinal scales. We will discuss weighted kappa first,
because kappa is a special case of weighted kappa. Weighted kappa is designed to
recognize that some disagreements between the two raters should be considered
more serious than others. For example, disagreement between “mild” and “life-
threatening” for a patient’s condition is more serious than disagreement between
“critical” and “life-threatening.” Therefore, it would be prudent to assign weights
to reflect the seriousness of disagreements among the rated conditions. In general,
for assessing agreement (disagreement), we would expect the weight to be greater
(smaller) for cells closer to (farther from) the main diagonal.
The weighted measure of agreement by chance becomes
X
t X
t
…c D wij i  j ; (3.5)
i j

where i  .j / represents the marginal probability by rater X (rater Y ). When the
outcome takes on binary scores, or t D 2, then …c reduces to 1 1 C 2 2 .
Finally, weighted kappa is defined as
…0  …c
w D : (3.6)
1  …c
Cohen requires these weights to satisfy the following conditions:
1. wij D 1 when i D j ,
2. 0 < wij < 1 when i ¤ j ,
3. wij D wji .
Cicchetti and Allison (1971) suggested the following set of weights:
ji  j j
wij D 1  : (3.7)
t 1
3.1 Basic Approach When Target Values Are Random 59

Fleiss and Cohen (1973) suggested the squared weight function as shown in (3.3).
Weighted kappa based on the weight function in (3.7) is always less than or equal
to that based on the weight function in (3.3).
When
(
1 for i D j ,
wij D
0 for i ¤ j ,

weighted kappa becomes the kappa coefficient originally proposed by Cohen


(1960). This kappa coefficient has been used for data with a binary or nominal
scale. Weighted kappa, regardless of which weight function is used, degenerates
into kappa in the binary case. These relative agreement indices are invariant under
any linear transformation on the paired data.

3.1.4 Sample Counterparts

Suppose we have two raters each of whom evaluates and assigns n subjects
independently to one of t categories. Let pij represent the proportion of subjects
assigned to category i by rater X and category j by rater Y , where i; j D 1; 2; : : : ; t.
Let pi  .pj / represent the marginal proportion that a subject is assigned to category i
(category j ) by rater X (rater Y ), where i; j D 1; 2; : : : ; t. Then the expected values
of the above proportions are E.pij / D ij , E.pi  / D i  , and E.pj / D j . These
proportions are also the maximum likelihood estimators (MLEs) or sample moment
counterparts of the respective probabilities.

3.1.5 Statistical Inference on Weighted Kappa

Fleiss, Cohen, and Everitt (1969) proposed to estimate weighted kappa using the
sample counterpart as
P0  Pc
O w D ; (3.8)
1  Pc
where
X t X t Xt X
t
P0 D wij pij ; Pc D wij pi  pj :
i j i j
The weighted kappa estimate O w has an asymptotic normal distribution with mean
w and variance
PP  2  2
ij wij  .wN i  C wN j /.1  w /  w  …c .1  w /
i j
2
O w D ; (3.9)
n.1  …c /2
P P
where wN i  D j j wij and w
N j D i i  wij .
60 3 Categorical Data

Statistical inference can be performed using sample counterparts for (3.9),


denoted by sO w . We can compute a 95% one-tailed lower confidence limit as

O w  1:645sO w :

These Cohen’s kappa coefficients have been widely used and can be obtained from
the output of SAS procedure FREQ if option “agree” is used in the TABLES
statement.

3.1.6 Equivalence of Weighted Kappa and CCC

Using the weight function defined in (3.3), we can verify that the weighted kappa is
actually equal to the CCC presented in Chapter 2. Recall that CCC was defined in
(2.24) as
"2
c D 1  2 :
" jD0
When  D 0, MSD becomes

X
t X
t
"2 jD0 D .i  j /2 i  j ; i; j D 1; 2; : : : ; t: (3.10)
i j

Therefore, (3.5) becomes


X
t X
t
"2 jD0
…c D wij i  j D 1  :
i j
.t  1/2

Together with (3.4), the CCC becomes


1  …0 …0  …c
c D 1  D ; (3.11)
1  …c 1  …c
which is exactly the same as the weighted kappa in (3.6).
The equivalence of weighted kappa using the squared distance weight function
and CCC can also be found in Robieson (1999), King and Chinchilli (2001a, 2001b),
and Barnhart, Haber, and Song (2002). King and Chinchilli (2001a, 2001b) proved
that the weighted kappa estimate using the weight function (3.7) actually is exactly
the same as the CCC estimate with the absolute difference function by U-statistic.
The variance presented in (3.9) is not the same as the variance of the CCC
estimate presented in (2.29), because that the CCC presented in Chapter 2 uses the
normality assumption in deriving the variance of the CCC estimate. In Chapter 5 we
will prove that the equivalence can be established using general estimating equations
(GEE) without assuming normality.
3.1 Basic Approach When Target Values Are Random 61

3.1.7 Weighted Kappa as Product of Precision


and Accuracy Coefficients

For the agreement statistics presented in Sections 3.1.2 and 3.1.3, two raters need
to classify subjects or samples based on the same metrics. When the metrics are not
the same, or when the shifts in marginal distributions are negligible, then we are
dealing with association instead of agreement. Several statistical tools are available
for measuring and evaluating association, which are listed in Section 3.4. Popular
association measurements include the Pearson correlation coefficient, 2 statistic
for contingency tables, some forms of intraclass correlation coefficients (ICC), and
the log-linear modeling, among several others. If we use the Pearson correlation
coefficient for measuring association, we can regard it as a precision measurement
as in the CCC. Accuracy can be characterized by examining the differences in
the marginal distributions. Such statistical tools are also widely available in the
textbooks listed in Section 3.4. In this book we concentrate on the accuracy and
precision topics that mirror those of CCC in Section 2.3.
The equivalence of CCC and weighted kappa can shed new light on the way
we view the weighted kappa. We can break w into precision (Pearson correlation
coefficient, ) and accuracy (a ) coefficients in the same way we did for CCC.
That is,
w D a ; (3.12)

where
yx
D (3.13)
y x

and
2 y x
a D : (3.14)
2
y C 2
x C. y  x/
2

Here,
X
t

x D i i  ;
i

X
t

y D jj ;
j
!2
X
t X
t
2
x D i i  
2
i i  ;
i i
0 12
X
t X
t
2
y D j 2 j  @ jj A ;
j j
62 3 Categorical Data

and
!0 t 1
X
t X
t X
t X
yx D ijij  i i  @ jj A :
i j i j

We can see that  depends on cell probabilities, while a strictly depends on


marginal probabilities. In the most basic scenario of i; j D 1; 2, the precision
coefficient becomes
11 22  12 21
D p ; (3.15)
1 2 1 2
and the accuracy coefficient is defined as
p
2 1 2 1 2
a D : (3.16)
1 2 C 1 2 C .2  2 /2

Obviously, 0 < a 6 1, and a D 1 when the marginal probabilities are identical.


Since w D a , the weighted kappa is uniquely determined by the marginal prob-
abilities and the Pearson correlation coefficient, even though the cell probabilities
could vary. Inferences on these accuracy and precision coefficients for categorical
data will be discussed in Chapter 5.

3.1.8 Intraclass Correlation Coefficient and Its Association


with Weighted Kappa and CCC

CCC is a special form of ICC (Robieson 1999, Carrasco and Jover 2003) when
the mean square of the difference among raters (fixed rater effect) is included in
the denominator. Therefore, CCC and weighted kappa can be estimated by variance
components with related statistical inferences through the delta method.
For simplicity, we first demonstrate the equivalence of ICC and CCC based on
the basic mixed effect model where each of n subjects is evaluated by two raters.
The value of Yij is the category that rater j had assigned to subject i . We will
discuss a more general form of ICC in Chapter 5, where we have k > 2 raters, each
evaluating n subjects with m > 1 readings.
The mixed effect model considered here is

Yij D C ˛i C ˇj C eij ; j D 1; 2; i D 1; 2; : : : ; n; (3.17)

where the overall mean is . The fixed rater effect is ˇj and sums to zero. The
random subject effect is ˛i with the variance ˛2 . The random error is eij with
variance e2 , and eij is assumed not to correlate with ˛i .
Traditionally, ICC in its original form is defined as the between-subject variance
˛ divided by the total variance of ˛ C e in the above model, disregarding the rater
2 2 2

effect ˇj . This traditional ICC can be interpreted in terms of rater consistency or


reliability rather than rater agreement (Shrout and Fleiss 1979). The expected value
3.1 Basic Approach When Target Values Are Random 63

of an individual observation in model (3.17), regardless of data being continuous,


ordinal, or binary, is
E.Yij / D j D C ˇj ; j D 1; 2:

The covariances among individual observations are


8
ˆ 0 0
< ˛ C e ; if i D i , j D j ,
ˆ 2 2

cov.Yij ; Yi 0 j 0 / D 2
; if i D i 0 , j ¤ j 0 , (3.18)
ˆ ˛
:̂0; if i ¤ i 0 .

For j D 2, we have 2
˛ D 12 , or the covariance of the two raters, and
2
C 2
2
˛ C 2
e D 1 2
; (3.19)
2
where 2
j is the variance of the effect associated with rater j , j D 1; 2.
2
Let ˇ be the rater mean squares, which is

. 1  2/
2
2
ˇ D :
2
The CCC or weighted kappa becomes
2
2 12
c D w D D ˛
:
1 C 2 C. 1 C C
2 2 2 2 2 2
2/ ˛ e ˇ

Model (3.17) assumes equal variances between the two raters. It is interesting to note
that the CCC remains the same whether the variances are assumed equal or not. It is
evident from (3.19) that ˛2 C e2 is actually the average of the two rater variances.
However, the composition of accuracy and precision coefficients are slightly altered
under this model. We will discuss the ICC and CCC in greater detail in Chapter 5.

3.1.9 Rater Comparison Example

This example is taken from von Eye and Schuster (2000). Two psychiatrists
evaluated 129 patients who had previously been diagnosed as clinically depressed.
The rating categories were 0 for not depressed, 1 for mildly depressed, and 2 for
clinically depressed. Table 3.2 presents the rating results that the two psychiatrists
provided.
Table 3.3 presents the related agreement statistics and their confidence limits for
the data reported in Table 3.2. There is a moderate level of agreement (weighted
kappa D 0.4204 with a 95% lower confidence limit of 0.2737 by the squared weight
function) between the two psychiatrists with good accuracy and moderate precision.
64 3 Categorical Data

Table 3.2 Severity of Rater Y


depression evaluated by two Category 0 1 2 Total
psychiatrists
0 11 2 19 32
1 1 3 3 7
Rater X 2 0 8 82 90
Total 12 13 104 129

Table 3.3 Agreement Statistics Estimate 95% LCLa


statistics on severity of
depression evaluated by two Kappa (with 0 or 1 weights) 0.3745 0.2448
psychiatrists Weighted kappab 0.4018 0.2653
Weighted kappac 0.4204 0.2737
Pearson correlation 0.4694 0.3369
Spearman correlation 0.4202 0.2742
a
LCL: lower confidence limit
b
weighted by Ciccetti and Allison function, (3.7)
c
weighted by Fleiss and Cohen function, (3.3)

3.2 Basic Approaches When Target Values Are Fixed: Absolute


Indices

When the target values are fixed, investigators rarely examine results using a relative
index. We consider the example of a diagnostic test. We select n0 negative samples
and n1 positive samples to be tested as negative .0/ or positive .1/ by an instrument
or a rater. Here, n0 and n1 are known. Fleiss (1973) illustrated this type of data
in Section 1.2 of his book. Let us consider the 2  2 table presented in Table 3.4,
from n0 and n1 negative and positive samples for the assessment of sensitivity and
specificity in Section 3.2.1.

3.2.1 Sensitivity and Specificity

Sensitivity is the conditional probability of positive response given that the sample
is positive, P .Y D 1 j X D 1/, which can be estimated by pss D n11 =n1 . The
larger the value of P .Y D 1 j X D 1/, the more sensitive the test. Specificity is
the conditional probability of negative response given that the sample is negative,
P .Y D 0 j X D 0/, which can be estimated by psp D n00 =n0 . The larger the value
of P .Y D 0 j X D 0/, the more specific the test. These parameters can be directly
estimated from the 2  2 table presented in Table 3.4. Statistical inference related to
these parameters can simply be addressed by the inference on related proportions.
Variances of sensitivity and specificity can be estimated to be pss .1  pss /=n1 and
3.2 Basic Approaches When Target Values Are Fixed: Absolute Indices 65

Table 3.4 2  2 table under Observed Y


fixed sample sizes
Category 0 1
Actual X 0 n00 n01 n0
1 n10 n11 n1

psp .1  psp /=n0 , respectively. When n0 or n1 is large (say > 30), we can use normal
approximation to compute a 95% confidence interval as
s
pss .1  pss /
pss ˙ 1:96 (3.20)
n1

and s
psp .1  psp /
psp ˙ 1:96 ; (3.21)
n0
respectively. Here, we are often interested in only the one-tailed 95% lower
confidence limit.
When n0 or n1 is not large enough, we can use the binomial (exact) confidence
interval approach. We take the inverse of the binomial distribution with parameters
n0 or n1 , pss or psp , and with the probability 0.025 for the lower limit and 0.975
for the upper limit. For the one-tailed lower limit, we use the probability 0.05. We
then divide these limits by either n0 or n1 for the confidence limits on sensitivity
and specificity.

3.2.1.1 False Positive Rate and False Negative Rate

Epidemiologists distinguish two types of error rates. The false positive error rate,
F C , is defined as the proportion of negative samples among those tested positive.
The false negative error rate, F  , is defined as the proportion of positive samples
among those tested negative. These error rates are typically defined through the
application of Bayes’s theorem by
P .Y D 1 j X D 0/P .X D 0/
F C D P .X D 0 j Y D 1/ D (3.22)
P .Y D 1/
and
P .Y D 0 j X D 1/P .X D 1/
F  D P .X D 1 j Y D 0/ D : (3.23)
P .Y D 0/

These error rates cannot be directly estimated by the 2  2 table presented in


Table 3.4, because the sample sizes, n0 and n1 , are fixed in the study. In order to
estimate the above error rates, we will need to collect the information about the true
positive (disease) rate, P .X D 1/. Here, P .X D 0/ is equal to 1  P .X D 1/.
66 3 Categorical Data

Table 3.5 Skin cancer Pre-resection


diagnosed by pathological
evaluation and by the Category No Yes
preresection evaluation Pathological No 112 6 118
of a dermatologist Yes 10 63 73

Table 3.6 Agreement Statistics Estimate 95% LCLba 95% LCLnb


statistics between
preresection and pathological Sensitivity 63/73 D 0.863 0.795 0.797
evaluations Specificity 112/118 D 0.949 0.915 0.916
a
LCLb: lower confidence limit by binomial distribution
b
LCLn: lower confidence limit by normal approximation

For assay validation or diagnostic lab testing, there is little interest in estimating
these error rates in a population. Only the sensitivity and specificity values are of
interest here. However, in the diagnostic environment, it is common practice to
regard 1-sensitivity as the false negative error rate and 1-specificity as the false
positive error rate.

3.2.2 Diagnostic Test Example

This example is taken from Linn (2004). In a case-control study, pathological


diagnoses of skin cancer (n1 D 73) and benign tumors (n0 D 118) (defining the “true
disease status”) were recorded. Patients were also evaluated by the preresection
clinical diagnostic (the “test”) by a dermatologist. Table 3.5 presents the rating
results that the two raters provided.
Table 3.6 presents the related agreement statistics and their associated confidence
limits. The dermatologist has a good specificity (0.949 with a 95% lower confidence
limit of 0.915) and a moderate sensitivity (0.863 with a 95% lower confidence limit
of 0.795). Note that in this case-control study, the sample sizes of case (n1 D 73) and
control (n0 D 118) were fixed. Therefore, the prevalence of the disease (skin cancer)
cannot be estimated from this study.

3.3 Discussion

We have examined the agreement statistics for ordinal and binary data when target
values are fixed or random. When target values are fixed, sensitivity and specificity
coefficients are commonly used under the fixed n0 and n1 samples. We rarely
conduct a study from n pairs of observations with fixed target values without fixing
the n0 and n1 samples.
3.3 Discussion 67

When target values are random, kappa and weighted kappa are very popular
indices for measuring agreement for binary and ordinal data. The weighted kappa
with the squared weight function is identical to the CCC introduced in Chapter 2.
In Chapter 5, we will show that using the GEE methodology, the confidence limits
of CCC and weighted kappa or kappa are identical when the Z-transformation is
not used. When data have a nominal scale such as “depression, personality disorder,
schizophrenia, and others,” use of the simple kappa with diagonal weights of one
and zero otherwise has been common practice. Fleiss (1971) introduced category-
specific kappa for nominal data, which is more in-depth with examination of the
agreement–disagreement matrix among the nominal categories.
The U-statistic approach in Section 2.11.1 is another novel approach to es-
timating weighted kappa with statistical inference. In this approach, the weight
function of Cicchetti and Allison (1971) given in (3.7) can be applied. This approach
allows for extension to the multiple-raters case. The GEE methodology proposed
by Barnhart and Williamson (2001) can also be applied to estimate weighted kappa
with the squared weight function with statistical inference. Both the U-statistic and
GEE methodologies are applicable when the target values are random.
CCC or weighted kappa is a special form of ICC when the mean square difference
among the fixed rater effect is included in the denominator, which can be estimated
by variance components with statistical inference through the delta method.
The data structure discussed in this chapter has focused on the most basic case,
namely, two raters evaluating each of n subjects only once. In Chapters 5 and 6,
we will present and discuss tools for cases of at least two raters evaluating each of
n subjects, and each could have m  1 replicates. Such tools can be used for both
categorical and continuous data when the target values are random.
In mental health studies, numerous instruments have been developed for the di-
agnosis of psychiatric disorders such as major depression, and there is considerable
interest in replacing one instrument by another instrument for reduction of cost, ease
of administration, and other considerations.
However, since these instruments are based on different questionnaires with
distinctive structures and point systems, they often have different scales. When
the scales of the instruments are different, the existing agreement methodology is
not applicable. For example, in a depression study, depression was measured by a
clinician-administered ordinal scale of no depression, mild depression, and severe
depression, and by a continuous scale of dimensional self-report. It remains un-
known whether the less-time-consuming self-report dimensional scale can replace
clinician-administered scale to determine the grade of the illness. This problem is
equivalent to assessing the extent to which the continuous scale can be interpreted
as the ordinal graded severity of depression.
Due to the different measurement scales, this question cannot be addressed in
the classical framework of agreement. Alternatively, we may perform a jackknife
(leaving one out) discriminant analysis to best classify the continuous scale against
the ordinal scale of no depression, mild depression, and severe depression. We can
then assess the weighted kappa of the classified scale based on the optimal cutoff
68 3 Categorical Data

values of the continuous scale against the more definitive clinician-administered


ordinal scale.
As revealed in Shoukri (2004), in the evaluation of diagnostic tests it is well
known that certain tests that appear to have high sensitivity and specificity may on
the other hand have low predictive accuracy when the prevalence of the disease is
low. When the prevalence rate is low, we would have an asymmetric cluster of the
observations observed in the nondisease category and with sparse cell observations
otherwise. This situation is similar to the case of continuous data when the data
range is short and with a few outliers. This leads to a higher probability of agreement
by chance and therefore a lower kappa value. In this case, in analogy to the use of
TDI for continuous data, perhaps we should also present the absolute (unscaled)
agreement index as shown in (3.2).

3.4 A Brief Tour of Related Publications

For the assessment of bias among raters with respect to marginal distributions, the
literature includes McNemar’s test (1947), Cochran’s Q test (1950), Madansky’s
Q test (1963), and Friedman’s 2 test (1937). In addition, Fleiss and Everitt
(1971) proposed a method for comparing the marginal distributions of an agreement
table, and Koch, Landis, Freeman, Freeman Jr, and Lehnen (1977) proposed
marginal homogeneity tests within the multivariate categorical data. Darroch (1981)
introduced Mantel–Haenszel tests for marginal symmetry. Landis, Sharp, Kuritz,
and Koch (1998) proposed generalized Mantel–Haenszel tests for both nominal
and ordinal data scales. Such bias assessment can also be captured in the accuracy
coefficient, as shown in (3.14). These correspond to the accuracy component of
weighted kappa.
For the assessment of association, the Pearson correlation coefficient and the
usual 2 test of association have been available for many decades. Such association
assessment can also be captured in the precision coefficient, as shown in (3.13).
Birch (1964, 1965) proposed partial association under 2  2 and general cases. For
the assessment of reliability, the ICC in its original form (Fisher 1925) is the ratio of
between-sample variance and total (within plus between) variance under the model
of equal marginal distributions. This original ICC was intended to measure precision
only.
Several forms of ICC have evolved. In particular, Bartko (1966), Shrout and
Fleiss (1979), Fleiss (1986), and Brennan (2001) have put forth various reliability
assessments. Landis and Koch (1977b) introduced category-specific intraclass and
interclass correlation coefficients within a multivariate variance-components model
for categorical data that can accommodate unbalanced designs. These correspond to
the precision component of weighted kappa. In Chapter 5, we will address various
forms of ICCs, including the one that represents precision.
3.4 A Brief Tour of Related Publications 69

For the assessment of agreement, Cohen (1960) introduced the kappa coefficient
to measure agreement between two raters on a nominal categorical data scale,
followed by Cohen (1968) and Everitt (1968) each separately proposing a weighted
kappa coefficient. However, Fleiss noted that the proposed variance estimators for
both of these agreement measures were incorrect, and invited both Cohen and
Everitt to collaborate in publishing correct variances, which appeared in Fleiss,
Cohen, and Everitt (1969). Fleiss (1971) introduced category-specific kappa coeffi-
cients and generalized Cohen’s kappa to situations involving multiple observers and
multiple categories.
Combining these broad areas into a common estimation and hypothesis-testing
framework, Landis and Koch (1977a) proposed the multivariate categorical data
framework of Koch, Landis, Freeman, Freeman Jr, and Lehnen (1977) for the
analysis of repeated measurement designs. The focus of this framework was to
test first-order marginal homogeneity among multiple observers and to estimate
multiple correlated kappa coefficients and their associated estimated covariances
to facilitate ease of confidence interval construction. Cicchetti and Fleiss (1977)
studied the null distributions of weighted kappa and the C ordinal statistic. Fleiss
and Cuzick (1979) further proposed kappa for binary response when the number
of observers differs for each subject. Bloch and Kraemer (1989) discussed 2  2
kappa coefficients as measures of agreement or association in greater details.
Donner and Eliasziw (1992) proposed a goodness-of-fit approach to inference
procedures and sample size estimation for the kappa statistic. Shoukri and Martin
(1995) proposed MLE of the kappa coefficient from models of matched binary
responses. Williamson, Lipsitz, and Manatunga (2000) proposed modeling kappa
for measuring dependent categorical agreement data. Schuster (2001) used kappa as
a parameter of a symmetry model for rater agreement. Broemeling (2009) proposed
Bayesian methods for measure of agreement.
Another tool for assessing agreement is the log-linear model for rater agreement
of von Eye and Schuster (2000). There is a wealth of books (Landis and Koch
1977c; Haberman 1974, 1978; Goodman 1978; Haberman 1979; Fleiss, Levin, Paik,
and Wiley (1981); Aickin 1983; Freeman 1987; Cox and Snell 1989; Shoukri and
Edge 1996; Christensen 1997; Agresti 1990; Von Eye and Mun 2005, to name a
few) describing agreement assessment, including association, for categorical data.
Furthermore, Haber, Gao, and Barnhart (2007) introduced a method for assessing
observer agreement in studies involving replicated binary observations. Guo and
Manatunga (2009) proposed a method of measuring agreement of multivariate
discrete survival times using a modified weighted kappa coefficient. Yang and
Chinchilli (2009, 2011) proposed fixed-effects modeling of Cohen’s kappa for
bivariate multinomial data.
Chapter 4
Sample Size and Power

In this chapter, in addition to the formal method of computing the sample size and
power, we will discuss ways to compute sample size and power based on normal
approximation through a simplified and conservative approach. As pointed out
previously in assessing agreement, the traditional null and alternative hypotheses
should be reversed. The conventional rejection region is now the region of declaring
agreement, and is usually one-sided. Asymptotic power and sample size calculations
will proceed by this principle.

4.1 The General Case

We will present the asymptotic power of accepting agreement and sample size
calculation whereby we utilize either MSD or CCC in accepting or rejecting
agreement. Inference based on approximated TDI or CP can be assessed through
MSD. Let be a transformed agreement index. For continuous data, we use the
Z-transformation for a CCC estimate, the logit transformation for a CP estimate,
and the natural log transformation for an MSD estimate. When is MSD, we declare
that two raters are in agreement under the alternative hypothesis H0 W < 0 , where
0 is a prespecified tolerable index value or the null value. Here, the null hypothesis
is H0 W  0 . When is CCC, we reverse the signs of the above hypotheses.
We compute the probability of declaring agreement under the alternative value 1 ,
which can be the ideal condition value or the available historical value. We refer to
this probability as power. Lin, Hedayat, Sinha, and Yang (2002) presented the power
and sample size computations that are described in the sequel.
Let 20 =nc and 21 =nc be variances of the respective estimates, where nc D n  c,
with c D 2 for MSD and CCC, and c D 3 for CP. For the one-tailed fixed type-I
error ˛, the power for declaring agreement based on CCC and CP becomes
 p 
. 0  1 / C ˆ1 .1  ˛/0 = nc
P D1ˆ p ; (4.1)
1 = nc

L. Lin et al., Statistical Tools for Measuring Agreement, 71


DOI 10.1007/978-1-4614-0562-7 4, © Springer Science+Business Media, LLC 2012
72 4 Sample Size and Power

and the power based on MSD becomes


 p 
. 0  1/  ˆ1 .1  ˛/0 = nc
P Dˆ p ; (4.2)
1 = nc

where ˆ./ is the cumulative standard normal density function. For the one-tailed
fixed type-I and type-II errors ˛ and ˇ, the associated sample size becomes
 2
ˆ1 .1  ˇ/1 C ˆ1 .1  ˛/0
nD C c: (4.3)
. 0  1/

4.2 The Simplified Case

For the case of two raters with a single reading per rater, we can simplify the
computations of the power and sample size using the upper bound of the variance
of each estimate of the agreement indices. According to (2.29) and (2.30), the
variance of the Z-transformed CCC estimate is less than or equal to 1=.n  2/.
Such approximation is almost exact when  is close to zero and $ is close to one.
According to (2.3) and (2.4), the variance of the log-transformed MSD estimate is
less than or equal to 2=.n  2/. Such approximation is exact when  is equal to zero.
The upper bound of the variance of transformed CP estimate remains unknown.
Because CP is a mirrored index of TDI, and the approximated CP can also be
computed from MSD, we need only calculate the sample size of approximated
TDI or CP, which can be derived from the upper bound of the variance of the
log-transformed MSD estimate. The sample size determination does not have to be
exact, as long as we stay on the conservative side. For categorical data, even though
the Z-transformation does not necessarily help, we can still utilize the simplification
for computing the sample size for the weighted kappa, since the variances of CCC
and weighted kappa are the same under the GEE methodology.

4.3 Examples Based on the Simplified Case

Suppose, as an example, that the available historical data indicate that CCC is 0.99
over a desirable data range, and we are willing to tolerate a CCC of 0.98. In this
case, the sample size would be
 2
ˆ1 .1  ˇ/ C ˆ1 .1  ˛/
nD C 2;
tanh1 .0:99/  tanh1 .0:98/

which is equal to n D 53 for ˛ D 0:05 and 1  ˇ D 0:8, where tanh1 ./ is the
Z-transformation.
4.3 Examples Based on the Simplified Case 73

Suppose, as another example, the historical data indicate that TDI%0:9 is


10%, and we are willing to tolerate a TDI%0:9 of 15%. According to (2.42),
the relationship between TDI based on the log-transformed data and TDI% is
TDI D ln.1 C TDI% 2
100 /. In addition, MSD is proportional to TDI . In this case, the
2
proportionality value between MSD and TDI is irrelevant, and the sample size
would become
!2
ˆ1 .1  ˇ/ C ˆ1 .1  ˛/
nD2 ˚ C 2;
ln Œln.1 C 0:15/2 =Œln.1 C 0:10/2

which is equal to n D 24 for ˛ D 0:05 and 1  ˇ D 0:8, where ln./ is the log
transformation.
Most frequently, we only have historical within sample variance ( e2 ) for data
with constant error or coefficient of variation (CV%) for data with proportional
error. We can easily translate these into TDI and CCC. Recall that in (2.24), CCC is
inversely related to the mean square of the ratio of the within-sample total deviation
and the total deviation. When the marginal distributions of two raters are identical,
the within-sample total deviation is e and the total deviation is the square root of
the sum of e2 and the between-sample variance. The data range is proportional to
the associated standard deviation, and assuming that our historical e was computed
based on a large sample size (say at least 100 observations), this proportionality
value is about 5 (Grant and Leavenworth 1972). p
For data with constant error, we can use ˆ1 Œ1.1/=2 2 e as our historical
TDI0 value, and use 1  .5 e2 /=dr2 as our historical CCC value over a desirable
data range dr . For data with proportional error, we can assume that the data are
log-normally distributed and use the relationship

! 2 D exp. 2
e/  1; (4.4)

where e2 is the historical within-sample variance based on log-transformed data and


! is the coefficient of variation (CV). As for the allowable TDI and CCC, we simply
allow a cushion for e2 and some error for the biased square of the two raters, ˇ2 .
e of ln.1 C
2
For example, if our historical CV is 10%, we have our historical
p
0:1 / D 0:01 based on log-transformed data, or TDI0:9 of 1:645 2 ln.1 C 0:12 / D
2

0:232, or TDI%0:9 of 26.1%. Assuming that the maximum value of the data is ten
times the minimum value, we have a historical CCC of 1  0:01= Œln.10/=52 D
0:953. If we are willing to allow for a 50% increase in e2 , and a ˇ2 that is half of
p
2
e , we have an allowable CV% of exp.0:01  1:5 C 0:01  0:5/  1 D 14:2%,
or TDI0:9 of 0.328, or TDI%0:9 of 38.5%, and an allowable CCC of 0.906. In this
case, the sample size based on TDI would be
 2
ˆ1 .1  ˇ/ C ˆ1 .1  ˛/
nD2 C 2;
ln.0:3282 /  ln.0:2322/
74 4 Sample Size and Power

which is equal to n D 28 for ˛ D 0:05 and 1  ˇ D 0:8. The sample size based on
CCC would be
 2
ˆ1 .1  ˇ/ C ˆ1 .1  ˛/
nD C 2;
tanh1 .0:953/  tanh1 .0:906/

which is equal to n D 51 for ˛ D 0:05 and 1  ˇ D 0:8. Here, the value of dr is


relatively irrelevant in computing the sample size. Note that the historical TDI0:9 and
CCC calculated from CV% actually correspond to the intrasample TDI and CCC.
We need to have a little more cushion to allow for some systematic bias and for
more random error when setting the criterion.
As stated earlier, the sample size (power) based on CCC is always larger (smaller)
than that based on TDI, primarily due to the additional variation related to estimating
the scaled denominator. After collecting the data, we will accept the agreement if the
one-sided 95% upper (lower) confidence limit of the TDI (CCC) is better than the
preset criterion.
Chapter 5
A Unified Model for Continuous
and Categorical Data

In this chapter, we generalize agreement assessment for continuous and categorical


data to cover multiple raters (k > 2), and each with multiple readings (m > 1)
from each of the n subjects. In Chapters 2 and 3, we discussed agreement statistics
for continuous and categorical data, respectively, based on the basic model of two
raters with a single measure each per subject. In the terminology of this chapter,
those earlier chapters discussed primarily the case k D 2 and m D 1. We utilize
the results from Barnhart, Song, and Haber (2005), who first proposed the within-
rater CCC, between-rater CCC based on the average of replicates, and between-rater
CCC based on individual replicate, and used GEE methodology for estimation and
statistical inference. We then combine the GEE methodology with the knowledge
gained from Robieson (1999) and Carrasco and Jover (2003), and propose a unified
approach that is applicable to continuous and categorical data.
Our approach establishes the agreement statistics of CCC, accuracy and precision
coefficients for continuous and categorical data, and TDI and CP for normally
distributed data, based on functions of variance components through a two-way
mixed-effect model. We segregate these agreement statistics of MSD, TDI, CP,
CCC, precision and accuracy coefficients into intrarater, interrater based on the
average of replicates, and total rater based on individual replicate values. We then
use the GEE methodology to form the estimations and combine it with the delta
method to form statistical inferences. This chapter is largely based on the materials
from Lin, Hedayat, and Wu (2007).
Suppose each of k raters measures each of n subjects m times. The model we use
for measuring agreement is

yij l D C ˛i C ˇj C ij C eij l ; (5.1)

where yij l stands for the lth reading from subject i given by rater j , with
i D 1; 2; : : : ; n, j D 1; 2; : : : ; k and l D 1; 2; : : : ; m. The readings can be conti-
nuous, binary, or ordinal. The overall mean is . The random subject effect, ˛i , has
equal second moments across all raters. The random rater and subject interaction

L. Lin et al., Statistical Tools for Measuring Agreement, 75


DOI 10.1007/978-1-4614-0562-7 5, © Springer Science+Business Media, LLC 2012
76 5 A Unified Model for Continuous and Categorical Data

effect, ij , has equal second moments across all raters. The random error effect,
eij l , is uncorrelated with ˛i and ij . The fixed rater effect is ˇj , and we assume
Pk
j D1 ˇj D 0.
The random subject effect ˛i has mean 0 and variance ˛2 . The interaction effect
ij has mean 0 and variance 2 . The random error effect has mean 0 and variance e2 .

5.1 Definition of Variance Components

Based on model (5.1) and balanced data, variance components can be expressed as
follows. First, the variance of the random error effect is defined as
Pn Pk 2
i D1 j D1 ij
2
e D ; (5.2)
nk

where ij2 is the variance of yij l for subject i and rater j .


The variance of the random subject effect is defined as
Pk1 Pk Pm1 Pm
4 j D1 j 0 Dj C1 lD1 l 0 DlC1 jj 0 l l 0
2
D ; (5.3)
˛
m k.k  1/
2

where jj 0 l l 0 is the covariance of yij l and yij 0 l 0 among different raters and replicates.
The variance of the interaction effect is defined as
2
 D A C B  C  D; (5.4)

where
Pk Pm 2
j D1 lD1 jl
AD ; (5.5)
m2 k
2
and jl is the variance of yij l for rater j and replicate l,
Pk Pm1 Pm
2 j D1 lD1 l 0 DlC1 jl l0
BD ; (5.6)
m2 k
where jl l0 is the covariance of yij l and yij l 0 between replicates for rater j ,

C D 2
˛; (5.7)

and
2
DD e
: (5.8)
m
5.2 Intrarater Precision 77

Finally, the mean square of the fixed rater effect is defined as


Pk1 Pk
j D1 j 0 Dj C1 .ˇj  ˇj 0 /2
2
D : (5.9)
ˇ
k.k  1/

5.2 Intrarater Precision

We assume that replicates within a rater are interchangeable. When measuring


unscaled intrarater agreement independent of data range, we use "2intra to denote the
MSDintra between any two replications l and l 0 , l; l 0 D 1; 2; : : : ; m, for any raters or
for the average across k fixed raters:

"2intra D E.yij l  yij l 0 /2


 
D. j  j/
2
C2 2
˛ C 2
 C 2
e  2. 2
˛ C 2
/

D2 2
e: (5.10)

For normally distributed data, following (2.8), TDIintra.0 / can be expressed as


 q
1 1  0
ıintra.0 / D ˆ 1 2 2
e; (5.11)
2

and CPintra.ı0 / can be expressed as


" !#
ı0
intra.ı0 / D 12 1ˆ p : (5.12)
2 2
e

For any rater j , the intrarater precision for continuous and categorical data
between any two replications, l and l 0 , is defined as

cov.yijl ; yijl0 / "2intra


intra D p p D1 2 ; (5.13)
var.yijl / var.yijl0 / "intraj yij1 ;yij 2 ;:::;yijm D 0

where var./ and cov./ represent the variance and covariance functions,
respectively.
Under model (5.1) in terms of variance components defined in Section 5.1, the
CCCintra becomes
˛ C 
2 2
c;intra D intra D 2 : (5.14)
˛ C  C e
2 2
78 5 A Unified Model for Continuous and Categorical Data

The CCCintra measures the proportion of the variance that is attributable to the
subjects. Based on model (5.1), this proportion is the same for all k raters.
Furthermore, the means of replicates within a rater are assumed equal under
model (5.1). Therefore, the intrarater agreement CCCintra equals intra , the intrarater
precision coefficient with the accuracy coefficient of one, and (5.11) and (5.12) are
exact, not approximate. This relative agreement index is heavily dependent on the
total variability (total data range).

5.3 Interrater Agreement

Since there are m replicated readings for subject i given by rater j , the average of
those m readings could be used to measure the interrater agreement. We use yNij  to
denote the average of m readings from subject i given by rater j .
The MSD between any two raters j and j 0 , MSDinterjj 0 , becomes

"2interjj 0 D E .yNij   yNij 0  /2
 2
D .ˇj  ˇj 0 /2 C 2 2
 C e
: (5.15)
m

Across k fixed raters, MSDinter is the average of k.k  1/=2 MSDinterjj 0 indices

2 X
k1 X k
"2inter D "2interjj 0
k.k  1/ j D1 0
j Dj C1
 2
D2 2
ˇ C 2
 C e
: (5.16)
m

In this chapter we use the approximated inter and total CP based on (2.22)
because the exact CP based on (2.10) would be complicated for model (5.1).
For normally distributed data, TDIinter.0 / and CPinter.ı0 / can be approximated by
 r
: 1 1  0 2
ıinter.0 / Dˆ 1 2 2
ˇ C2 2
 C2 e
(5.17)
2 m

and
2 0 13
: 6 B ı0 C7
inter.ı0 / D 1  2 41  ˆ @ q A5 : (5.18)
2 2
ˇ C2 2
 C2 2
e =m
5.3 Interrater Agreement 79

The TDI and CP approximations are good when the relative biased squared (RBS)
value is reasonable (see Section 2.2.2). Otherwise, the approximation will be
conservative when 0 > 0:9. Here, the RBS is defined as:
2
ˇ
inter D : (5.19)
2
 C 2
e =m

For continuous and categorical data, the CCCinter becomes

"2inter
c;inter D 1  : (5.20)
"2interjyN D0
i1 ;yNi 2 ;:::;yNi k

Under model (5.1) in terms of variance components defined in Section 5.1, the
CCCinter becomes
2
c;inter D ˛
2
: (5.21)
2
˛ C 2
 C e
m
C 2
ˇ

Since readings from different raters have different expected means, we further define
that the interrater agreement CCCinter consists of two parts: interrater precision and
interrater accuracy coefficients. The interrater precision coefficient becomes

cov.yNij  ; yNij 0  / 2
inter D p p D ˛
: (5.22)
var.yNij  / var.yNij 0  /
2
2
˛ C 2
 C e
m

The interrater accuracy coefficient becomes


2
2
˛ C 2
 C m
e

a;inter D 2
: (5.23)
2
˛ C 2
 C e
m
C 2
ˇ

Here, c;inter is the product of inter and a;inter . The accuracy index measures how
close the means of raters are. In model (5.1), variances are assumed to be the same
for different raters, and consequently they are not present in the accuracy index.
Therefore, the definition of accuracy is slightly modified compared to that originally
defined by Lin (1989). The interrater agreement is measured based on the average
of m readings made by each rater. Therefore, the agreement indices depend on the
number of replications (m).
The approach proposed by Barnhart, Song, and Haber (2005) allows for different
variances among raters, and is a measure based on the true readings, ij , from each
rater and subject. Therefore, their interrater CCC does not depend on the number of
replications. In addition, the interrater CCC from Barnhart, Song, and Haber (2005)
equals the limit of our CCCinter as the number of replications m goes to infinity.
80 5 A Unified Model for Continuous and Categorical Data

5.4 Total-Rater Agreement

Since there are m replicated readings for subject i given by rater j , the interrater
agreement could be based on any one of the m replicated readings. Total agreement
is such a measure of agreement based on any individual reading from each rater.
The MSD between any two raters j and j 0 , MSDtotaljj 0 , becomes

"2totaljj 0 D E .yij l  yij 0 l 0 /2
 
D .ˇj  ˇj 0 /2 C 2 2 C e2 : (5.24)

Across k fixed raters, MSDtotal is the average of k.k  1/=2 MSDtotaljj 0 indices

2 X
k1 X k
"2total D "2totaljj 0
k.k  1/ j D1 0
j Dj C1
 
D2 2
ˇ C 2
 C 2
e : (5.25)

For normally distributed data, the TDItotal.0 / and CPtotal.ı0 / can be


approximated by
 
: 1 1  0 q 2
ıtotal. 0 / D ˆ 1 2 ˇ C 2 2 C 2 e2 (5.26)
2
and
2 0 13
: 6 B ı0 C7
total.ı0 / D 1  2 41  ˆ @ q A5 : (5.27)
2 2
ˇ C2 2
 C2 2
e

The TDI and CP approximations are adequate when the RBS is reasonable.
Otherwise, the approximations will be conservative when 0 > 0:9. Here, the RBS
is defined as
2
ˇ
total D : (5.28)
2
 C 2
e

For continuous and categorical data, CCCtotal is defined as


"2total
c;total D 1  : (5.29)
"2totaljy D0
i1l ;yi 2l ;:::;yi kl

Under model (5.1) in terms of variance components defined in Section 5.1, the
CCCtotal becomes
2
c;total D ˛
: (5.30)
2
˛ C 2
 C 2
e C 2
ˇ
5.6 Asymptotic Normality 81

The total-rater precision and accuracy coefficients are

cov.yij l ; yij 0 l 0 /
total D p p
var.yij l / var.yij 0 l 0 /
2
D ˛
(5.31)
2
˛ C 2
 C 2
e

and
2
˛ C 2
 C 2
e
a;total D : (5.32)
2
˛ C 2
 C 2
e C 2
ˇ

Again,
c;total D total a;total :

5.5 Proportional Error Case

Similar to Section 2.5, when the residual standard deviation becomes proportional
to the measurement, we apply the natural log transformation to the data and then
compute the agreement statistics. The TDI%0 is defined as

0 D 100 Œexp.ı0 /  1 %: (5.33)

In this situation, TDI%0 is the antitransformed TDI0 from which 1 is subtracted,


which measures a percent change rather than an absolute deviation.

5.6 Asymptotic Normality

We use yNj l to denote the average of n readings from rater j given by replicate l.
The rater means and all variance components, 1 , 2 , : : :, k , ˇ2 , ˛2 , 2 , e2 are
estimated through the GEE methodology according to the following system of
equations:

X
n
Fi0 Hi 1 ŒQi  ‚ D 0: (5.34)
i D1
82 5 A Unified Model for Continuous and Categorical Data

Here,

0 1 1
.yi11 C yi12 C    C yi1m /
B m C
B : C
B : C
B : C
B C
B C
B 1 C
B .yij1 C yij 2 C    C yij m / C
B m C
B C
B : C
B :
: C
B C
B C
B 1 C
B .yik1 C yik2 C    C yikm / C
B m C
Qi D B
B
C;
C
B 1 Pk1 Pk C
B .yNij   yNij 0  /2 C
B 0
k.k  1/ j D1 j Dj C1 C
B C
B C
B 4 Pk1 Pk Pm1 Pm  C
B .yij l  yNj l /.yij 0 l 0  yNj 0 l 0 / C
B 0 0
m2 k.k  1/ j D1 j Dj C1 lD1 l DlC1 C
B C
B P
 m  C
B C
B 1 Pk lD1 .yij l  y
Nij  /2 C
B C
B k j D1
.m  1/ C
B C
@ A
1 Pk Pm 2 Pk Pm1 Pm
.yij l  yNj l /2 C 2 0 .yij l  yNj l /.yij l 0  yNj l 0 /
m2 k j D1 lD1 m k j D1 lD1 l DlC1

with the expected values of

0 1
1
B :: C
B : C
B C
B C
B j C
B :: C
B C
B : C
B C
B k C
DB C:
B 2 C
B 2C
ˇ C C C
2 e
B
B m C
B 2 C
B ˛ C
B C
B 2 C
@ e A
2
˛ C m C 
2 e 2

The working covariance matrix for Qi (Zeger and Liang 1986) is conveniently set
as a diagonal matrix (Barnhart and Williamson 2001) given by
5.6 Asymptotic Normality 83

0 1
a1
B :: C
B : C
B C
B 0 C
B aj C
B :: C
B : C
B C
Hi D diag.var.Qi // D B C;
B ak C
B C
B d C
B C
B 0 e C
B C
@ f A
g

with the elements of


 
1
a1 D aj D ak D var .yij1 C yij 2 C    C yij m /
m
2
D 2
˛ C 2
 C e
; (5.35)
m
2 3
1 X
k1 X k
d D var 4 .yNij   yNij 0  /2 5
k.k  1/ j D1 0
j Dj C1

k.k  1/.3k  2/
D 4
e C k.k  1/.3k  2/ 4
 C 8k.k  1/ 2 2
ˇ 
m2
8k.k  1/ 4k 2 .k  1/
C 2 2
ˇ e C 2 2
 e; (5.36)
m m
8 9
< 4 X
k1 X k X X
m1 m
 =
e D var .yij l  N
y j l /.yij 0l 0  y
N j 0l0 /
: m2 k.k  1/ ;
j D1 0 0
j Dj C1 lD1 l DlC1

m4 k.k  1/.2k  3/ 4 m2 k.k  1/ 4


D m4 k.k  1/.2k  3/ ˛4 C  C e
2 2

2 C .m  1/2 C 2m.m  1/.k  2/ k.k  1/m2  2 2 
C ˛ e C 2 2
 e
2
Cm4 k.k  1/.2k  3/ 2 2
˛ ; (5.37)

8 9
<1 X k  Pm 2 =
.y
lD1 ij l  yN ij  /
f D var
:k .m  1/ ;
j D1

2k
D 4
e; (5.38)
m1
84 5 A Unified Model for Continuous and Categorical Data

and

gD
2 3
1 X k Xm
2 X X X
k m1 m
var 4 2 .yij l  yNj l /2 C 2 .yij l  yNj l /.yij l 0  yNj l 0 /5
m k j D1 m k j D1 0
lD1 lD1 l DlC1
 
m2 k 2 .3m  1/ 4
D km.m  1/.2m  3/ C m.m  1/.k  1/=2 C ˛
2
   
3k mk mk.m  3/ 4
C m2 k 1  CmC C km.m  1/.2m  3/ 4 C e
2 2 2

C 2m2 k.2  k/ C mk.m  1/.mk C 5m  4/ ˛2 2
 
mk.m  1/ 2 2
C mk.4  mk C .m  1/ C
2
˛ e
2
 
.m  1/.mk C 4/ 2 2
Cmk 4 C .m  1/  mk C
2
 e: (5.39)
2

Finally,
 
@‚ 1k 044
Fi D D ; (5.40)
@. 1 ; : : : ; k ; ˇ2 ; 2 2 2
˛; e ;  / 044 f 44
where 0 1
1 0 1=m 1
B0 1 0 0C
f 44 DB
@0
C:
0 1 0A
0 1 1=m 1
Note that the working covariance matrix is derived assuming normality. We obtain
the estimates of the means and variance components through (5.34), which are their
respective sample counterparts given in Qi , and estimate their variance–covariance
matrix through the GEE methodology.
The variance–covariance matrix for Qi is

1 10
var.Qi / D D †D 1 ; (5.41)
n
where

X
n
DD Fi 0 Hi 1 Fi
i D1
5.6 Asymptotic Normality 85

and
X
n
†D Fi 0 Hi 1 .Qi  ‚/ .Qi  ‚/0 Hi 1 Fi :
i D1

Since the working covariance matrix, Hi , in (5.34) is a diagonal matrix and


model (5.1) does not involve covariates, the estimate of var.Qi / actually degen-
erates to the robust variance estimate from the least squares estimation equations
(Zeger, Liang, and Albert 1988).
We then use the delta method to derive the asymptotic variance of the estimate
for each agreement index. We arrive analytically at the following variances of the
MSD estimates:
4
var.O"2intra / D var. 2
e /; (5.42)
n

"
4 var. e2 /
var.O"2inter / D var. 2
ˇ/ C var. 2
/ C
n m2
2 2
#
2cov. 2 2
ˇ; e / 2cov. e; /
C2cov. 2 2
ˇ;  / C C ; (5.43)
m m

and
4h
var.O"2total / D var. 2
ˇ/ C var. 2
/ C var. 2
e/
n
i
C2cov. 2 2
ˇ;  / C 2cov. 2 2
ˇ; e / C 2cov. 2 2
e; / : (5.44)

When estimating the variances of TDIs, we use the log transformation based on
MSD, W./ D ln."2./ /. The transformed variance for w is

var."2./ /
var.W./ / D :
"4./

Therefore, we have

b intra / D var. e2 /
var.W ; (5.45)
n. e2 /2

var. 2
ˇ/ C var. 2
/ C var. 2
e /=m
2
b inter / D
var.W
n. 2
ˇ C 2
 C 2
e =m/
2

cov. 2 2
ˇ;  / C cov. ˇ ; e /=m C cov. e ;  /=m
2 2 2 2
C ; (5.46)
ˇ C  C e =m/
2 2 2 2
n.
86 5 A Unified Model for Continuous and Categorical Data

and

var. 2
ˇ/ C var. 2
/ C var. 2
e/
b total / D
var.W
n. 2
ˇ C 2
 C 2 2
e/

cov. 2 2
ˇ;  / C cov. ˇ ; e / C cov. e ;  /
2 2 2 2
C : (5.47)
n. 2
ˇ C  C e/
2 2 2

The variances of TDIintra , TDIinter , and TDItotal are given in (5.45), (5.46), and (5.47)
divided by 4, respectively.
For CP indices, we use the asymptotic variance based on the transformed variable
from (5.12), (5.18), and (5.27), and we have

ı02
 !2
"2./
e ı2 var."2./ /
var.O ./ı0 / D 1 C 20 ; (5.48)
n "./ 8"2./ ı02

where "2./ is intra, inter, or total MSD.


We arrive analytically at the following variances of the estimates of CCC,
precision and accuracy coefficients:

.1  intra /2 nh i
var.Oc;intra / D var.Ointra / D var. ˛2 / C var. 2 / C 2cov. ˛2 ; 2
/
n. ˛ C  C e /
2 2 2 2

h io
Cintra
2
var. e2 /  2.1  intra /intra cov. ˛2 ; e2 / C cov. e2 ; 2 / ; (5.49)

1 n h
var.Oc;inter / D .1  inter /2 var. 2
˛/ C inter
2
var. 2
ˇ/
n. 2
˛ C 2
ˇ C 2
 C 2
e =m/
2

2 2
#
2cov. 2 2
var. e2 / ˇ; e/ 2cov. e; /
C C var. 2
/ C 2cov. 2
ˇ;
2
/ C C
m2 m m
 2 2

cov. ˛; e/
 2.1  inter /inter cov. 2
ˇ;
2
˛/ C cov. 2
˛;
2
/ C ; (5.50)
m
1 n h
var.Oc;total / D .1  total /2 var. 2
˛/ C total
2
var. 2
ˇ/ C var. 2
e/
n. 2
˛ C 2
ˇ C 2
 C 2 2
e/
i
C var. 2
/ C 2cov. 2
ˇ;
2
/ C 2cov. 2
ˇ;
2
e/ C 2cov. 2
e;
2
/
h io
 2.1  total /total cov. 2
ˇ;
2
˛/ C cov. 2
˛;
2
/ C cov. 2
˛;
2
e/ ; (5.51)
 
1 var. e2 /
var.Ointer / D .1  inter /2 var. 2
˛/ C inter
2
C var. 2
/
n. ˛2 C 2 C 2
e =m/
2 m2
#  )
2 2
2cov. e; / cov. 2
˛;
2
e/
C  2.1  inter /inter cov. 2
˛;
2
/ C ; (5.52)
m m
5.7 The Case m D 1 87

1 n h
var.Ototal / D .1  total /2 var. 2
˛/ C total
2
var. 2
e/ C var. 2
/
n. 2
˛ C 2
 C 2 2
e/
i h io
C2cov. 2
e;
2
/  2.1  total /total cov. ˛2 ; 2
/ C cov. 2
˛;
2
e/ ; (5.53)
( "
1 var. e2 /
var.O a;inter / D .1  a;inter /2 var. 2
˛/ C var. 2
/ C
n. 2
˛ C 2
ˇ C 2
 C 2
e =m/
2 m2
#
2 2
2cov. ˛2 ; 2
e/
2cov. ; e/
C C 2cov. 2
˛;
2
/ C C 2a;inter var. 2
ˇ/
m m
" 2 2
#)
cov. ˇ; e/
 2.1  a;inter /a;inter cov. 2
˛;
2
ˇ/ C C cov. 2
ˇ;
2
/ ; (5.54)
m

and

1 n h
var.O a;total / D .1  a;total /2 var. 2
˛/ C var. 2
/ C var. 2
e/
n. ˛2 C ˇ2 C 2
 C 2 2
e/
i
C2cov. 2
˛; C 2cov. ˛2 ;
2
e/
2
/ C 2cov. 2
;
2
e/ C 2a;total var. 2
ˇ/
h io
2.1  a;total /a;total cov. 2
˛;
2
ˇ/ C cov. 2
ˇ;
2
e/ C cov. 2
ˇ;
2
/ : (5.55)

When estimating the variances of the above CCC or precision coefficient


estimates for continuous data, we use the Z-transformation. Thus the transformed
var.index/
variance of an estimate of CCC or precision indices is var.Zindex / D 1.index/ 2 , with

the index being Oc;intra , Oc;inter , Oc;total , Ointer , or Ototal . When estimating the variances
of accuracy coefficients or CP estimates for continuous data, we use the logit
transformation. The transformed variance of an estimate of accuracy coefficient
var.index/
or CP is var.index/ D .index/.1index/ with the index being O a;inter , O a;total , O intra.•0 / ,
O inter.ı0 / , or O total.ı0 / . When computing confidence limits of the above agreement
indices for continuous data, we would compute the limit based on the respective
transformation, and then antitransform the limit.

5.7 The Case m D 1

For the cases with m D 1, the interaction effect between rater and subject in model
(5.1), ij , cannot be separated from the error effect. Thus the model reduces to

yij D C ˛i C ˇj C eij ; (5.56)


88 5 A Unified Model for Continuous and Categorical Data

where each term follows the same distribution as that specified in model (5.1). The
variance components are simplified to

2 X
k1 X k
2
D jj 0 I (5.57)
˛
k.k  1/ j D1 0
j Dj C1
Pk
j Dj 0 D1 jj 0
2
e D  ˛I
2
(5.58)
k
Pk1 Pk
j D1 j 0 Dj C1 . j  j0/
2
2
D : (5.59)
ˇ
k.k  1/
Accordingly, the MSD, TDI, and CP for continuous data become

"2 D 2 C 2 ˇ2 ;
2
e (5.60)
 
: 1 1  0 q
ı0 Dˆ 1 2 2
ˇ C2 2
e; (5.61)
2
and
2 0 13
: 6 B ı0 C7
ı0 D 1  2 41  ˆ @ q A5 : (5.62)
2 2
ˇ C2 2
e

For continuous and categorical data, the CCC becomes

"2
c D 1 
"2jy
i1 ;yi 2 ;:::;yi k D0

2
D ˛
2
˛ C 2
e C 2
ˇ
2
D ˛
Pk1 Pk
2
˛ C 2
e C 1
k.k1/ j D1 j 0 Dj C1 . j  j0/
2

2 Pk1 Pk
k.k1/ j D1 j 0 Dj C1 jj 0
D Pk Pk1 Pk
1
k j D1
2
j C 1
k.k1/ j D1 j 0 Dj C1 . j  j0/
2

Pk1 Pk
2 j D1 j 0 Dj C1 jj 0
D Pk Pk1 Pk : (5.63)
.k  1/ j D1
2
j C j D1 j 0 Dj C1 . j  j0/
2

The precision coefficient becomes


2
D ˛
;
2
˛ C 2
e
5.7 The Case m D 1 89

and the accuracy coefficient becomes


2
C 2
a D ˛ e
:
2
˛ C 2
ˇ C 2
e

The above equations show that each of the four variance components can be
expressed as functions of variances and pairwise covariances. Thus even though
in the proposed unified approach, we assume the homogeneity of all variances,
the CCC defined in (5.63) remains the same as the overall concordance correlation
coefficient (OCCC) proposed by Lin (1989), King and Chinchilli (2001a, 2001b),
and Barnhart, Haber, and Song (2002), where they did not assume the homogeneity
of all variances.
When we use the GEE methodology to estimate all variance components, the
estimating system of equations given in (5.34) are simplified to
0 1
yi1
B :: C
B : C
B C
B C
B yij C
B :: C
B C
B : C
Qi D B
B yi k
C;
C
B C
B Pk1 Pk C
B 1
j 0 Dj C1 .yij  yij 0 /2 C
B k.k1/ j D1 C
B Pk C
B 1
j D1 .yij  yNj /2 C
@ k A
Pk Pk
2
k.k1/ j D1 j 0 Dj C1 .yij  yNj /.yij l 0  yNj 0 /
0 1
1
B :: C
B : C
B C
B C
B j C
B :: C
B : C
B C
 DB C;
B k C
B C
B 2C
ˇ C e C
2
B
B C
B 2C
˛ C e A
2
@
2
˛
0 1
. 2
˛ C 2
e /I k 0 0 0
B 2 2
eC ˇ e
4 C
B 0 0 0 C
B k.k1/ C
Hi D B
B 2k ˛ C e C2 ˛ e
4 4 2 2
C;
C
B 0 0 0 C
@ k
A
˛ C2k e C2 ˛ e
2k.k1/ 4 4 2 2
0 0 0 k.k1/
90 5 A Unified Model for Continuous and Categorical Data

and 0 1
1k 0 0 0
@‚ B0 1 0 1C
Fi D DB
@0
C:
@. 1 ; : : : ; k ; ˇ2 ; ˛2 ; e2 / 0 1 1A
0 0 1 0
The variances of the estimates of agreement indices are
4h i
var.O"2 / D var. ˇ2 / C var. e2 / C 2cov. ˇ2 ; e2 / ; (5.64)
n
2 var. 2
ˇ/ C var. 2
e/ C 2cov. 2 2
ˇ; e /
b / D var.O" / D
var.W ; (5.65)
"4 n. 2
e C 2 2
ˇ/

 ı02
2
e  "2 ı2 var."2 /
var.O ı0 / D 1 C 02 ; (5.66)
n " 8"2 ı02
1 n
var.Oc / D .1  /2 var. 2
˛/ C 2 Œvar. 2
ˇ/ C var. 2
e/
n. ˛2 C ˇ2 C e2 /2
o
C2cov. 2 2
ˇ ; e /  2.1  /Œcov. 2 2
ˇ; ˛/ C cov. 2 2
˛ ; e / ; (5.67)

1 
O D
var./ .1  /2 var. 2
˛/ C 2 var. 2
e/  2.1  /cov. 2 2
˛; e / ;
n. 2
˛ C 2 2
e/
(5.68)
and
1 n
var.O a / D .1  a /2 Œvar. 2
˛/ C var. 2
e/ C 2cov. 2 2
˛ ; e /
n. 2
˛ C 2
ˇ C 2 2
e/
o
C2a var. 2
ˇ/  2a .1  a /Œcov. 2 2
ˇ; ˛/ C cov. 2 2
ˇ ; e / : (5.69)

Again, we use the respective transformations for statistical inference for continuous
data.

5.7.1 Other Estimation and Statistical Inference Approaches

Estimations of agreement indices shown in this chapter are based on the GEE
methodology. We have discussed approaches by Barnhart, Song, and Haber (2005)
throughout this chapter. Carrasco and Jover (2003) proposed another important
method using the maximum likelihood (ML) or restricted maximum likelihood
(RML) method based on a mixed effect model. In Section 2.11.3 we presented this
model for k D 2, which is also applicable for k > 2. For normally distributed data,
5.7 The Case m D 1 91

the estimates from Lin, Hedayat, and Wu (2007) have been shown to be very close to
the estimates obtained in Carrasco and Jover (2003). However, the unified approach
that we proposed has two advantages: First, since we use the GEE methodology
to estimate all variance components and means, our method can handle not only
normally distributed data, but also data from the exponential family including the
binomial and multinomial distributions. Second, our approach is expected to be
robust against a moderate deviation from normally distributed data.
King and Chinchilli (2001a) proposed one important method of estimations and
statistical inferences for the generalized CCC for continuous and categorical data
based on the U-statistic. The case k D 2 was presented in Section 2.11.1. We will
present the case k > 2 below.
Let q D 1; 2; : : : ; k  1 and r D 2; : : : ; k index the pairwise combinations of the
k raters. Then the generalized CCC can be expressed as

Ng D
P  P 
qr EFXq FXr g.Xq  Xr /  g.Xq C Xr /  qr EFXq Xr g.Xq  Xr /  g.Xq C Xr /
q<r q<r
  ;
P 1P
qr EFXq FXr g.Xq  Xr /  g.Xq C Xr / C qr EFXq Xr g.2Xq / C g.2Xr /
q<r 2 q<r

where g./ is a distance function, including robust distance functions, defined in


King and Chinchilli (2001a, 2001b), and FXq and FXr are the cumulative density
functions (CDFs) of Xq and Xr . When g.x/ D x 2 , Ng becomes the CCC defined by
Lin (1989).
Let U qr D .U1qr ; U2qr ; U3qr /0 be the U-statistic as defined in Section 2.11.1.
To construct U-estimators of the CCC with independent multivariate samples
.X11 ; X21 ; : : : ; Xk1 /,: : : , .X1n ; X2n ; : : : ; Xk n / of size n, King and Chinchilli (2001a)
substitute FXq , FXr , and FXq Xr by their respective empirical CDFs. For a general
function g, this results in the following estimator of Ng :

P
.n  1/ qr .U3qr  U1qr /
q<r
ONg D P P P :
qr U1qr C n qr U2qr C .n  1/ qr U3qr
q<r q<r q<r

Since the sum of U-statistics is a U-statistic, the above notation can be simplified as

.n  1/.U3s  U1s /
ONg D ;
U1s C nU2s C .n  1/U3s
92 5 A Unified Model for Continuous and Categorical Data

P
where Uls D qr Ulqr , l D 1; 2; 3, the sum over all distinct pairs. The same
q<r
approach as found in Section 2.11.1 can be applied by replacing U1 , U2 , and U3 with
U1s , U2s , and U3s , respectively. Then ONg has an asymptotically normal distribution
with mean Ng and a variance that can be consistently estimated with
 
O : O 2 var.Hs / 2cov.Hs ; Gs / var.Gs /
var.Ng / D .Ng /  C :
Hs2 Hs Gs Gs2

As before, the normal approximation of the extended robust estimators of the


concordance correlation coefficient can be improved through the use of the
Z-transformation.

5.7.2 Variances of CCC and Weighted Kappa for k D 2

For ordinal and binary data, when k D 2 and m D 1, the above GEE estimates of
CCC reduce to kappa (Cohen 1960) and weighted kappa (Cohen 1968) with square
distance function (Robieson 1999; King and Chinchilli 2001a, 2001b; Barnhart,
Haber, and Song 2002). In addition, its variances reduce to the variances of kappa
and weighted kappa (Wu 2005). The proof of such equivalence is shown below.
When k D 2 and m D 1 with t ordinal outcomes, Fleiss, Cohen, and Everitt
(1969) introduced the asymptotic variance for the estimated weighted kappa, kOw , as
0 1
1 Xt X t
var.kOw / D @ Aij  B A ; (5.70)
n.1  kw /4 i D1 j D1

where

Aij D ij Œwij .1  …c /  .wN i  C wN j /.1  …o /2 ; (5.71)


X
t
wN i  D wij j ; (5.72)
j D1

X
t
N j D
w wij i  ; (5.73)
i D1

and
B D .…o …c  2…c C …o /2 : (5.74)
5.7 The Case m D 1 93

In order to compare the variance of the weighted kappa to that of the CCC, we let

X
t

l0 D i l i  ; l D 1; 2; 3; 4;
i D1

X
t

0l D i jl ; l D 1; 2; 3; 4;
i D1

X
t X
t
0
l l0 D i l j l ij ; l D 1; 2; 3 and l 0 D 1; 2; 3
i D1 j D1

Based on the above notation, we obtain

1
1  ˘o D . 20 2 11 C 02 / (5.75)
.t  1/2

and
1
1  ˘c D . 20 2 01 10 C 02 /: (5.76)
.t  1/2
For the square weighted function wij , where

.i  j /2
wij D 1  ; i; j D 1; 2; : : : ; t;
.t  1/2

we have

X
t
wN i  D wij j
j D1

1 X t
D 1 .i  j /2 j
.t  1/2 j D1

1
D 1 .i 2  2i 01 C 02 /; (5.77)
.t  1/2

and

X
t
wN j D wij i 
i D1

1 X t
D 1 .i  j /2 i 
.t  1/2 i D1
1
D 1 .j 2  2j 10 C 20 /: (5.78)
.t  1/2
94 5 A Unified Model for Continuous and Categorical Data

By substituting all those terms into (5.70), we obtain


0 1
1 Xt X t
var.kOw / D @ Aij  B A
n.1  c /4 i D1 j D1

A1 C A2  A3  B1
D ; (5.79)
n
where

.t  1/4 C 40 C6 22 C 044 31 4 13  2.t  1/2 . 20 C 02 2 11 /


A1 D ; (5.80)
20 C 02  2
. 2
10 01 /

4.t  1/4 . 20 C 20 2 11 /
2
C . 40 C 04 C 3. 20 C 2
02 / /. 20 C 02 2 11 /
2
A2 D
. 20 C 02  2 10 01 /4
.4 2
10 02 C4 2
01 20 4 10 03 4 01 C 2 22  4
30 10 21 /. 20 C 02 2 11 /
2
C
. 20 C 02  2 10 01 /4
.4 01 12 C8 10 01 11  8 10 01 02  8 10 20 01 /. 20 C 02 2 11 /
2
C
. 20 C 02  2 10 01 /
4

8.t  1/2 . 20 C 02  2 11 /2 . 20 C 02 2 10 01 /
C ; (5.81)
. 20 C 02  2 10 01 /4
. 20 C 02 2 11 / 
A3 D 2.t  1/4  2.t  1/2 . 20 C 02 2 11 /
. 20 C 02  2 10 01 /3
2.t  1/2 . 20 C 02 2 10 01 / C 40 C 04 2 10 03 2 01 30 2 31 2 13

C. 20 C 02 /
2
C2 22 2 10 21 2 01 12 C4 01 21

C4 10 12 2 20 11 2 02 11  (5.82)

and

.t  1/8
B1 D
. 20 C 02  2 10 01 /
4


Œ.t  1/2  . 20 C 02 2  1/2  .
11 /Œ.t 20 C 02 2 10 01 /

.t  1/4
2
2Œ.t  1/2  . 20C 02  2 10 01 / Œ.t  1/2  . 20 C 02 2 11 /
 C : (5.83)
.t  1/2 .t  1/2

Using the system of equations as specified in (5.34) when m D 1, the variance of


CCC can be expressed as

4
var.Oc / D .A C B C C C D  E  F /; (5.84)
n. 20 C 02 2 10 01 /
4
5.8 Summary of Simulation Results 95

where

AD. 2
01 20 C2 10 01 11 C 2
10 02 4 2
10
2
01 /. 20 C 02 2 11 /
2
; (5.85)
BD. 40 C2 22 C 04 . 20 C 02 /
2
/. 11  10 01 /
2
; (5.86)
C D. 22  2
11 /. 20 C 02 2 10 01 /
2
; (5.87)
D D 2. 11  10 01 / Œ 01 . 30  10 20 / C 10 . 03  01 02 /

C 01 . 12  10 02 / C 10 . 21  01 20 / . 20 C 02 2 11 /; (5.88)
E D 2. 20 C 02 2 11 /. 20 C 02 2 10 01 /

. 01 21 2 10 01 11 C 10 12 /; (5.89)

and

F D 2. 31 C 13  11 20  11 02 /. 11  10 01 /. 20 C 02 2 10 01 /: (5.90)

Further simplifications yield

var.Oc / D var.kOw /: (5.91)

We have shown that when k D 2 and m D 1 and when the data are ordinal, the CCC
without the Z-transformation is exactly the same as the weighted kappa with the
squared weight function, both in estimation and statistical inference. For binary data,
the weighted kappa reduces to kappa. Therefore, our approach can naturally extend
the kappa and weighted kappa for k > 2 and m > 1. In addition, our approach
provides precision and accuracy coefficients for categorical data.

5.8 Summary of Simulation Results

In order to evaluate the performance of the GEE methodology for estimation and
inference of the proposed indices and to compare the proposed indices against other
existing methods, simulation studies were designed and conducted for different
types of data: binary, ordinal, and normal. For each of the three types of data, we
considered three cases: k D 2 and m D 1, k D 4 and m D 1, and k D 2 and m D 3.
For each case, we generated 1000 random samples of size 20 each. For binary and
ordinal data, we considered two situations: inferences obtained through transfor-
mations (Z-transformations for CCC and precision indices, logit transformation
for accuracy indices) and inferences obtained without transformations. For normal
data, we considered only inferences obtained through transformations. In addition
to the above transformation, we considered logit transformation for CP and log
transformation for TDI for normal data.
96 5 A Unified Model for Continuous and Categorical Data

For binary data, our estimates are very close to their corresponding theoretical
values, and the means of the estimated standard deviations are very close to the
corresponding standard deviations of the estimates. Therefore, our estimates are
sufficiently good for binary data with or without transformation. When m D 1, we
also compared our CCC estimates to that obtained from the method by Carrasco
and Jover (2003). Our standard error estimates are superior to the estimates by
Carrasco and Jover (2003) regardless of whether a transformation is used. The
estimates obtained with transformation were comparable to the estimates obtained
without transformation. Therefore, we suggest that for binary data, the use of a
transformation is acceptable but not necessary.
For ordinal data, the means of the estimates are very close to the theoretical
values, and the means of the estimated standard errors are very close to the
corresponding standard deviations of the estimates. Similar to binary data, we also
calculated the CCC estimates from the method by Carrasco and Jover (2003) for the
cases with m D 1. The estimates from the two methods are very close to each other
regardless of whether transformation is used. Therefore, we conclude that for ordinal
data, the use of a transformation is acceptable but not necessary. Surprisingly, the
method of Carrasco and Jover (2003) performs as well as ours for ordinal data, even
though their model assumes normality.
For normal data, our estimates resemble the respective theoretical values very
well. The means of the estimated standard error are very close to the corresponding
standard deviations of the estimates. For the cases with m D 1, our CCCs are very
close to that obtained from the method of Carrasco and Jover (2003). For the detailed
simulation results, see Lin, Hedayat, and Wu (2007). Based on the simulation
results, we conclude that our method works well for binary data, ordinal data, and
normal data, both in estimates as well as in corresponding statistical inferences.

5.9 Examples

5.9.1 Example 1: Methods Comparison

Dispirin crosslinked hemoglobin (DCLHb) is a solution containing oxygen-carrying


hemoglobin. The solution was created as a blood substitute to treat trauma patients
and to replace blood loss during surgery. Measurements of DCLHb in patient’s
serum after infusion are routinely performed using a Sigma method. A method of
measuring hemoglobin called the HemoCue photometer was modified to reproduce
the Sigma instrument DCLHb results. To validate this modified method, serum
samples from 299 patients over the analytical range of 50–2,000 mg/dL were
collected. DCLHb values of each sample were measured simultaneously with the
HemoCue and Sigma methods, andeach sample was measured twice by each of
5.9 Examples 97

Fig. 5.1 HemoCue method measurement 1 vs. measurement 2

the two methods. This example has been given by Lin, Hedayat, Sinha, and Yang
(2002) and Lin (2003), where the averages of the replicated readings were used, and
is presented in Example 2.8.1.
Figures 5.1–5.3 plot the data for this example for the HemoCue method, mea-
surement 1 vs. measurement 2; Sigma method, measurement 1 vs. measurement 2;
and the average of the HemoCue method vs. the average of the Sigma method. The
plots indicate that the errors are rather constant across the data range. Therefore, no
log transformation was applied to the data.
In terms of TDI and CP indices, the least acceptable agreement is defined as
having at least 90% of pair observations over the entire range within 75 mg/dL
of each other if the observations are from the same method, and within 150 mg/dL
of each other if the observations are from different methods based on the average of
each method. In terms of CCC indices, the least acceptable agreement is defined
as a within-sample total deviation of not more than 7.5% of the total deviation if
observations are from the same method, and a within-sample total deviation of not
more than 15% of the total deviation if observations are from different methods.
These translate into a least-acceptable CCCintra of 0:9943 D .1  0:0752 /, and a
least-acceptable CCCinter of 0:9775 D .1  0:152 /.
The agreement statistics and their corresponding one-sided 95% lower or upper
confidence limits are presented in Table 5.1. The CCCintra estimate is 0.9986,
which means for the observationsp from the same method that the within-sample
deviation is about 3:7% D 1  0:9986 of the total deviation. The 95% lower
confidence limit for CCCintra is 0.9983, which is greater than 0.9943. The CCCinter
estimate is 0.9866, which means for the average observations p from different
methods, the within-sample total deviation is about 11:6% D 1  0:9866 of the
98 5 A Unified Model for Continuous and Categorical Data

Fig. 5.2 Sigma method measurement 1 vs. measurement 2

Fig. 5.3 HemoCue method’s average measurement vs. Sigma method’s average measurement

total deviation. The 95% lower confidence limit for CCCinter is 0.9825, which is
greater than 0.9775. The precisionintra estimate is 0.9986, with a one-sided lower
confidence limit 0.9983. The precisioninter estimate is 0.9866 with a one-sided lower
confidence limit 0.9825, and the accuracyinter estimate is 1.0000 with one-sided
lower confidence limit 0.9987. The CCCtotal estimate is 0.9859, which means for
individual observations from different methods, the within-sample total deviation is
5.9 Examples 99

Table 5.1 Agreement statistics and their confidence limits for Example 5.9.1
Precision Accuracy
Type Statistics CCC coefficient coefficient TDI0:9 CPTDI a a RBSb
Intra Estimate 0.9986 0.9986 . 41.1 0.9973 .
95% Conf. limit 0.9983 0.9983 . 46.2 0.9949 .
Allowance 0.9943 0.9943 . 75 0.9000 .
Inter Estimate 0.9866 0.9866 1 127.3 0.9474 0
95% Conf. limit 0.9825 0.9825 0.9987 145.9 0.9228 .
Allowance 0.9775 . . 150 0.9000 .
Total Estimate 0.9859 0.9860 1 130.5 0.9412 0
95% Conf. limit 0.9818 0.9818 0.9987 148.9 0.9160 .
Allowance 0.9775 . . 150 0.9000 .
For k D 2, n D 299, and m D 2.
a
This is the CP given the TDI allowances of 75 mg/dL or 150 mg/dL.
b
The relative bias squared (RBS) must be less than 1 or 8 for the CP criterion of 0.9 or 0.8,
respectively, in order for the approximated TDI and CP to be valid.

about 11.87% of the total deviation. The 95% lower confidence limit for CCCtotal is
0.9818. The precisiontotal estimate is 0.9860 with a one-sided lower confidence limit
0.9818, and the accuracytotal estimate is 1.0000 with one-sided lower confidence
limit 0.9987.
The TDIintra.0:9/ estimate is 41.1 mg/dL, which means that 90% of the readings
are within 41.1 mg/dL of their replicate readings from the same method. The one-
sided upper confidence limit for TDIintra.0:9/ is 46.2 mg/dL, which is less than
75 mg/dL. The TDIinter.0:9/ estimate is 127.3 mg/dL, which means that based on
the average readings, 90% of the HemoCue readings are within 127.3 mg/dL of
the Sigma readings. The one-sided upper confidence limit for TDIinter.0:9/ is 145.9
mg/dL, which is slightly less than 150 mg/dL. The TDItotal.0:9/ estimate is 130.5
mg/dL, with the one-sided upper confidence limit 148.9 mg/dL, which is slightly
less than 150 mg/dL as well.
Finally, the CPintra.75/ estimate is 0.9973, which means that 99.7% of HemoCue
observations are within 75 mg/dL of their duplicate values from the same method.
The one-sided lower confidence limit for CPintra.75/ is 0.9949, which is larger than
0.9. The CPinter.150/ estimate is 0.9474, which means that 94.7% of HemoCue
readings are within 150 mg/dL of the Sigma readings based on the average of
each method. The one-sided lower confidence limit for CPinter.150/ is 0.9228, which
is larger than 0.9. The CPtotal.150/ estimate is 0.9412, which means that 94% of
HemoCue observations are within 150 mg/dL of Sigma observations based on
individual readings. The one-sided lower confidence limit for CPtotal.150/ is 0.9160.
The agreement between the HemoCue method and the Sigma method is ac-
ceptable with excellent accuracy and adequate precision and with accuracy slightly
better than precision.
100 5 A Unified Model for Continuous and Categorical Data

Table 5.2 Lab 1 frequency Negative Positive Highly positive


table of first reading (row) vs.
second reading (column) Negative 6 1 0
Positive 0 49 0
Highly positive 0 0 8

Table 5.3 Lab 2 frequency Negative Positive Highly positive


table of first reading (row) vs.
second reading (column) Negative 2 0 0
Positive 0 22 2
Highly positive 0 5 33

Table 5.4 Lab 1 first reading Negative Positive Highly positive


(row) vs. lab 2 first reading
(column) Negative 2 5 0
Positive 0 19 30
Highly positive 0 0 8

Table 5.5 Lab 1 second Negative Positive Highly positive


reading (row) vs. lab 2 second
reading (column) Negative 2 4 0
Positive 0 23 27
Highly positive 0 0 8

5.9.2 Example 2: Assay Validation

This example can be seen in Lin, Hedayat, and Wu (2007). In this example, we
consider the hemagglutinin inhibition (HAI) assay for antibody to influenza A
(H3N2) in rabbit serum samples from two different labs. Serum samples from 64
rabbits were measured twice by each method. Antibody level was classified as
negative, positive, or highly positive (too numerous to count).
Tables 5.2–5.5 present the frequency tables for within-lab and between-lab
readings. Tables 5.2 and 5.3 present the frequency tables of the first reading vs.
the second reading from each lab. Table 5.4 presents the frequency table of the first
reading from one lab vs. the first reading from the other lab. Table 5.5 presents the
frequency table of the second reading from one lab vs. the second reading from
the other lab. Those tables suggest that the within-lab agreement is good but the
between lab agreement is not, and lab 2 tends to report higher ratings than lab 1.
This is an imprecise assay with ordinal responses, and therefore we allow for
less-demanding agreement criteria. In terms of CCC indices, agreement was defined
as a within-sample total deviation of not more than 50% of the total deviation if
observations are from the same method, and a within-sample total deviation of not
more than 75% of the total deviation if observations are from different methods.
This translates into a least-acceptable CCCintra of 0:75 D .1  0:52 /, and a least
acceptable CCCinter of 0:4375 D .1  0:752 /.
The estimates of agreement statistics and their corresponding one-sided 95%
lower confidence limits are presented in Table 5.6. The CCCintra is estimated to
5.9 Examples 101

Table 5.6 Agreement statistics and their confidence limits for Example 5.9.2
Type Statistics CCC Precision coefficient Accuracy coefficient
Intra Estimate 0.8836 0.8836 .
95% Conf. limit 0.8109 0.8109 .
Allowance 0.7500 0.7500 .
Inter Estimate 0.3723 0.5679 0.6554
95% Conf. limit 0.2448 0.4571 0.5383
Allowance 0.4375 . .
Total Estimate 0.3578 0.5349 0.6688
95% Conf. limit 0.2335 0.4216 0.5570
Allowance . . .
For k D 2, n D 64, and m D 2.

be 0.8836, which means that for observations


p from the same method, the within-
sample deviation is about 34:1% D 1  0:8836 of the total deviation. The 95%
lower confidence limit for CCCintra is 0.8109, which is larger than 0.7500. The
CCCinter is estimated to be 0.3723, which means that for the average observations
from different methods, the within-sample deviation is about 79.2% of the total
deviation. The 95% lower confidence limit for CCCinter is 0.2448, which is less
than 0.4375. The precisioninter is estimated to be 0.5679 with a one-sided lower
confidence limit 0.4571, and the accuracyinter is estimated to be 0.6554 with a one-
sided lower confidence limit 0.5383. The CCCtotal is estimated to be 0.3578, which
means that for individual observations from different methods, the within-sample
deviation is about 80.1% of the total deviation. The 95% lower confidence limit
for CCCtotal is 0.2335. The precisiontotal is estimated to be 0.5349 with a one-sided
lower confidence limit 0.4216, and the accuracytotal is estimated to be 0.6688 with a
one-sided lower confidence limit 0.5570.
Overall, the agreement between the two labs’ readings is not acceptable, while
the within-lab agreement is much better than the interlab agreement. The agreement
within each lab, if of interest, can be obtained by applying kappa or weighted kappa
to each lab separately.

5.9.3 Example 3: Nasal Bone Image Assessment


by Ultrasound Scan

This example was obtained from Professor Philip Schluter, head of research, School
of Public Health and Psychosocial Studies at AUT University, New Zealand. The
leader of the project is Dr. Andrew McLennan, Sydney Ultrasound for Women,
Sydney, Australia, and Royal North Shore Hospital, Sydney, Australia. This exam-
ple will be published separately by McLennan and colleagues.
Several recent studies have demonstrated that the nasal bone (NB) is sonograph-
ically “absent” in a large proportion of fetuses affected by Down syndrome at
102 5 A Unified Model for Continuous and Categorical Data

Table 5.7 Examiner 1: Absent Present


reading 1 (row) versus
reading 2 (column) Absent 316 14
Present 23 47

Table 5.8 Examiner 2: Absent Present


reading 1 (row) versus
reading 2 (column) Absent 318 9
Present 19 54

Table 5.9 Examiner 3: Absent Present


reading 1 (row) versus
reading 2 (column) Absent 285 9
Present 19 72

Table 5.10 Reading 1: Absent Present


examiner 1 (row) versus
examiner 2 (column) Absent 300 30
Present 27 43

Table 5.11 Reading 1: Absent Present


examiner 1 (row) versus
examiner 3 (column) Absent 274 56
Present 20 50

Table 5.12 Reading 1: Absent Present


examiner 2 (row) versus
examiner 3 (column) Absent 277 50
Present 17 56

11–13 weeks gestation. The purpose of this study was to demonstrate that the nasal
bone can be accurately assessed and used for population screening in an Australian
obstetric population at 11–13 weeks gestation.
There were 20 operators (accredited and experienced in nuchal translucency
imaging) who supplied 20 NB images. The images were assessed for the presence
(1) and absence (0) of NB. Three examiners assessed each of the 400 images
twice (the repeat assessment separated by at least 24 h), giving a total of 2,400
assessments. Tables 5.7–5.12 present the frequency tables among three examiners
and their duplicate readings. It appears that within-examiner has slightly better
agreement than between-examiner, as expected, and there is little difference among
the marginal distributions of three examiners (good accuracy).
This is an imprecise assay with binary responses, and therefore we allow for less-
demanding agreement criteria. In terms of CCC indices, agreement was defined
as a within-sample total deviation of no more than 60% of the total deviation if
observations are from the same method based on the average of duplicate readings,
and a within-sample total deviation of no more than 70% of the total deviation
if observations are from different methods. This translates into a least-acceptable
CCCintra of 0:64 D .1  0:62 /, and a least acceptable CCCinter of 0:51 D .1  0:72 /.
5.9 Examples 103

Table 5.13 Agreement statistics and their confidence limits for Example 5.9.3
Type Statistics CCC Precision coefficient Accuracy coefficient
Intra Estimate 0.7047 0.7047 .
95% Conf. limit 0.6558 0.6558 .
Allowance 0.6400 0.6400 .
Inter Estimate 0.6369 0.6442 0.9886
95% Conf. limit 0.5779 0.5867 0.9811
Allowance 0.5100 . .
Total Estimate 0.5438 0.5491 0.9902
95% Conf. limit 0.4852 0.4913 0.9839
Allowance . . .
For K D 3, n D 400, and m D 2.

Table 5.13 presents the estimates of agreement statistics and their corresponding
one-sided 95% lower confidence limits. The CCCintra is estimated to be 0.7047,
which means that for observations
p from the same method, the within-sample
deviation is about 54:3% D 1  0:7047 of the total deviation. The 95% lower
confidence limit for CCCintra is 0.6558, which is better than 0.64. The CCCinter
is estimated to be 0.6369, which means that for the average observations from
different methods, the within-sample deviation is about 60.3% of the total deviation.
The 95% lower confidence limit for CCCinter is 0.5779, which is better than 0.51.
The precisioninter is estimated to be 0.6442 with a one-sided lower confidence limit
0.5867, and the accuracyinter is estimated to be 0.9886 with a one-sided lower
confidence limit 0.9811. The CCCtotal is estimated to be 0.5438, which means that
for individual observations from different methods, the within-sample deviation is
about 67.5% of the total deviation. The 95% lower confidence limit for CCCtotal
is 0.4852. The precisiontotal is estimated to be 0.5491 with a one-sided lower
confidence limit 0.4913, and the accuracytotal is estimated to be 0.9902 with a
one-sided lower confidence limit 0.9839. Overall, the agreements among three
examiners readings and within examiners are marginally acceptable, with very good
accuracy, and most disagreements are from imprecision rather than inaccuracy.

5.9.4 Example 4: Accuracy and Precision of an Automatic


Blood Pressure Meter

This example is obtained from Table 1 of Bland and Altman (1999), where a set
of systolic blood pressure data from a study in which simultaneous measurements
were made by each of two experienced observers (denoted by J and R) using a
sphygmomanometer (gold standard) and by a semiautomatic blood pressure monitor
(denoted by S ). Three sets of readings were made in quick succession for each
method (J1 –J3 , R1 –R3 , S1 –S3 ). The purpose of the study was to evaluate whether
the semiautomatic blood pressure monitor can replace the blood pressure apparatus
104 5 A Unified Model for Continuous and Categorical Data

Fig. 5.4 Agreement between J and R based on their mean triplicate readings in log scale

routinely used in a typical medical center by an experienced nurse or a doctor. In


their paper, the authors evaluated the agreement of only the first measurement by
observer J and the machine (i.e., J1 and S1 ). This data set has k D 3 and m D 3.
Because readings by J and R were almost identical based on mean of triplicate
readings (see Fig. 5.4 in log scale), we analyzed this data set between J and S , or
k D 2 and m D 3. Here we did not have any solid allowances (criteria) prespecified,
and therefore, the allowances given are somewhat arbitrary.
First, we examine the distribution characteristic and the error structure to see
whether we should assume proportional error when assessing agreement. We
examine the S method because it is the most imprecise method. Figure 5.5 presents
the plot of S2 (circle) or S3 (square) versus S1 in the original scale. We can see
that each marginal distribution is skewed to the right, and the error increases when
the reading increases. Therefore, we proceed with assuming the proportional error
assumption. In Figs. 5.6–5.9, we use the log scale with each tick mark multiplied by
1.2 from the previous tick mark.
Figure 5.6 presents the within-J agreement (precision) plot of J2 (circle) or J3
(square) versus J1 . Figure 5.7 presents the within-S agreement (precision) plot of
S2 (circle) or S3 (square) versus S1 . We can see that the within-triplicate readings
of J are more precise than those of S . Figure 5.8 presents the agreement plot
of S1 versus J1 reflecting total agreement among individual readings. Figure 5.9
presents the agreement plot of S versus J based on the mean of triplicate readings
reflecting interagreement. Figures 5.8 and 5.9 appear quite similar, indicating that
the total agreement and interagreement are similar, while the within method had
better agreement (precision), as expected. More importantly, the readings of the S
method are neither accurate nor precise when compared to readings of the J method.
Table 5.14 presents the estimates of agreement statistics with 95% confidence
limits. The 95% upper limit of within-method TDI%0:9 is 15.5%, meaning that we
are 95% confident that 90% of the within-method individual readings do not deviate
5.9 Examples 105

Fig. 5.5 Within-S meter agreement, S2 =S3 vs. S1 in original scale (ı D S2 ,  D S3 )

Fig. 5.6 Within-J meter agreement, J2 =J3 vs. J1 (ı D J2 ,  D J3 ) in log scale

Fig. 5.7 Within-S meter agreement, S2 =S3 vs. S1 (ı D S2 ,  D S3 ) in log scale


106 5 A Unified Model for Continuous and Categorical Data

Fig. 5.8 Agreement between S1 and J1 (reflecting total agreement) in log scale

Fig. 5.9 Agreement between S and J based on their mean triplicate readings (reflecting intera-
greement) in log scale

more than 15.5%. The precision coefficient is estimated to be 0.9383. In contrast,


the 95% upper limit of total-method TDI%0:9 is 43.5%, meaning that we are 95%
confident that 90% of the two methods’ individual triplicate readings do not deviate
more than 43.5%, which is about the same as the inter method (41.3%) as seen in the
figures. The total accuracy and precision coefficients are estimated to be 0.7974 and
0.8767, which are much less precise than the within-method precision coefficient of
0.9383.
We can conclude that this evaluated semiautomatic blood pressure monitor is
neither precise enough nor accurate enough to replace the sphygmomanometer for
measuring systolic blood pressure. The individual semiautomatic readings from the
same patient can deviate up to 43.5% from sphygmomanometer readings measured
by a nurse or doctor. Given the normal systolic pressure of 120 mmHg, the 43.5%
5.10 Discussion 107

Table 5.14 Agreement statistics and their confidence limits for Example 5.9.4
Precision Accuracy
Type Statistics CCC coefficient coefficient TDI0:9 CPTDIa RBSa
Intra Estimate 0.9383 0.9383 . 13.78 0.9798 .
95% Conf. limit 0.9166 0.9166 . 15.46 0.9701 .
Allowance 0.9000 0.9000 . 20 0.9000 .
Inter Estimate 0.7253 0.8316 0.8721 33.05 0.8014 0.87
95% Conf. limit 0.6044 0.7327 0.8132 41.34 0.7232 .
Allowance 0.8 . . 25 0.9000 .
Total Estimate 0.6991 0.7974 0.8767 35.58 0.8438 0.69
95% Conf. limit 0.5822 0.7015 0.8203 43.51 0.7831 .
Allowance 0.7000 . . 30 0.9000 .
For K D 2, n D 85, and m D 3.
a
The relative bias squared (RBS) must be less than 1 or 8 for CPa of 0.9 or 0.8, respectively,
in order for the approximated TDI and CP to be valid.

deviation means that the result of the semiautomatic instrument can deviate by up
to 52.2 mmHg with 95% confidence.
It is clear that the within-J precision (Fig. 5.6) is much better than the within-
S precision (Fig. 5.7). To measure the agreement by each method, we can simply
perform agreement assessment among the triplicate readings one method at a time
with k D 3 and m D 1. However, this does not tell us whether within-J or within-R
precision is significantly better than that of within-S . We will revisit this scenario in
Chapter 6.

5.10 Discussion

We have proposed a series of indices for assessing agreement, precision, and


accuracy for multiple raters each with multiple readings. Those indices can be used
to assess intrarater, interrater, and total-rater agreement for both continuous and
categorical data. These indices are summarized in Table 5.15. All those indices are
expressed as functions of variance components through a two-way mixed model,
and GEE methodology combined with the delta method is used to estimate all
indices and perform their related statistical inferences. For sample size and power
calculations of an agreement index, the reader is referred to the general procedures
outlined in Section 4.1.

5.10.1 Relative or Scaled Indices

Each of the approaches in the previous chapters for assessing agreement becomes
one of the special cases of our approach. For continuous data, when m approaches
108 5 A Unified Model for Continuous and Categorical Data

Table 5.15 Summary of agreement indices based on functions of variance components


Statistics Intra Inter Total mD1
2
˛ C 2

2
˛
2
˛
2
˛
CCC 2 C 2 C 2 2 2 C 2 C 2 C 2 2 C 2
C 2
˛  e 2
˛ C 2
 C e
C 2
ˇ
˛  e ˇ ˛ ˇ e
m
2
˛ C 2

2
˛
2
˛
2
˛
Precision 2 C 2 C 2 2 2 C 2 C 2 2 C 2
˛  e 2
˛ C 2
 C e ˛  e ˛ e
m 2
2
˛ C 2
 C e
2
C 2
C 2 2
C 2
Accuracy NA m ˛  e ˛ e
2
e
2 C 2 C 2 C 2 2 C 2
C 2
2
˛ C 2
 C C 2
ˇ
˛  e ˇ ˛ ˇ e
m 2
e
MSD 2 2
e 2 2
ˇ C2 2
 C2 2 2
ˇ C2 2
 C2 2
e 2 2
e C2 2
ˇ
m
p p p p
TDI0 a Q M SDI nt ra Q M SDinter Q M SDtotal Q M SD

       
ı2 ı2 ı2 ı2
CPı0 b 2 ;1 2 ;1 2 ;1 2 ;1
M SDI nt ra M SDinter M SDtotal M SD
a
Q D ˆ1 .1  1
2
/ is the inverse cumulative normal distribution.
2
b 2
 . MıSD ; 1/ is a central chi-square distribution with one degree of freedom.

infinity, the proposed CCCinter reduces to that proposed by Barnhart, Song, and
Haber (2005). When m D 1, the proposed CCC reduces to the CCC proposed by
Carrasco and Jover (2003), which is the same as the OCCC proposed by Lin (1989),
King and Chinchilli (2001a), and Barnhart, Haber, and Song (2002). Barnhart,
Haber, and Song (2002) pointed out that OCCC is actually a weighted average of
pairwise CCC values. When k D 2 and m D 1, the proposed CCC reduces to
the original CCC proposed by Lin (1989). For categorical data, when k D 2 and
m D 1, the proposed CCC reduces to the kappa for binary data and weighted kappa
with squared weight for ordinal data, in both estimates and statistical inferences.
In addition, we decomposed the CCC into precision and accuracy components
for a deeper understanding of the sources of the disagreement. The concept of
accuracy and precision can also be applied to categorical data. For continuous data,
the relative or scaled indices are heavily dependent on the total variability (total data
range). Therefore, these indices are not comparable if the ranges of the data are not
comparable. The same is true for categorical data when we have data that are heavily
clustered into a single cell, for example, when evaluating agreement based on low
prevalence rate.
5.10 Discussion 109

5.10.2 Absolute or Unscaled Indices

We also have proposed absolute indices, MSD, TDI, and CP, which are independent
of the total data range. These absolute indices are easily comprehensible. However,
these absolute indices are valid only when the relative bias squared is small enough
(Lin 2000, 2003; Lin, Hedayat, Sinha, and Yang 2002) and the normality is assumed.

5.10.3 Covariate Adjustment

We refer the reader to Section 2.12.7, most of which is applicable to this chapter as
well. Subject-based covariates can conveniently be adjusted using the model

Yij l  Xi  D C ˛i C ˇj C ij C eij l ; (5.92)

where Xi D .xi1 ; xi 2 ; : : : ; xip /0 represents p covariates for the subject or sample


i without the intercept term, and  D .1 ; 2 ; : : : ; p /0 represents slopes of the p
covariates.
A simple and reasonable approach is to perform linear regression for each rater
j and replicate l, then use the intercept estimate plus the residual of each rater
j and replicate l as the adjusted dependent variable, Zij l D Yij l  Xi . We can
then proceed to perform the estimations and statistical inferences of the agreement
indices based on model (5.1) using the adjusted dependent variables. A formal and
more efficient way is to include the Xi  term in model (5.1), and solve for the GEE
estimates and their variance–covariance matrix iteratively.

5.10.4 Future Research Topics and Related Publications

There are two aspects of this unified approach that can be extended and developed.
First, for categorical and nonnormal continuous data, we may include the link
functions, such as log and logit, in the GEE methodology. We expect the approach
with a link function would become more robust to different types of data. Second,
current variance component functions are based on balanced data. Therefore, we
would have to delete samples or subjects with missing data. Approaches that can
handle missing data should be an interesting area of research.
There are relatively few references available related to this chapter in medical
and pharmaceutical publications other than those mentioned in the first paragraph
of this chapter. See Barnhart, Haber, and Lin (2007) for an overview on assessing
agreement with continuous measurements. Chen and Barnhart (2008) compared
ICC and CCC for assessing agreement for data without and with replications. Haber
and Barnhart (2008) proposed a general approach to evaluating agreement between
110 5 A Unified Model for Continuous and Categorical Data

two raters or methods of measurement from quantitative data with replicated


measurements. There are many references in the social and psychological sciences
for ICC-type indices. Some of these indices are closely related to the relative indices
shown in this chapter. These are well represented in Brennan (2001).
Chapter 6
A Comparative Model for Continuous
and Categorical Data

In Chapter 5, we provided statistical tools for assessing the intra-, inter-, and
total-rater agreement among all raters. In this chapter, we provide statistical tools
for comparing total-rater agreement to intrarater precision, and intrarater precision
among selected raters.
When multiple raters are available with replicates, we are often interested
to know whether raters can be used interchangeably if they do not deviate
too much more than they deviate among their replicates without any clinical
or practical alteration. Here, we need to assume that the variation among
replicates (intrarater variability) is acceptable. FDA’s guideline (2001) (http://www.
fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/
ucm070244.pdf) introduced a method for evaluating individual agreement between
a test drug and a reference drug in the context of individual bioequivalence.
Barnhart, Kosinski, and Haber (2007) extended FDA’s approach in the case of
multiple raters. They proposed the individual equivalence coefficient (IEC) and the
coefficient of individual agreement (CIA) to compare the intrarater precision relative
to the total-rater agreement of one or multiple references, or when no reference is
available. They used the method of moments for estimation and nonparametric
bootstrapping for statistical inference. Our approaches allow users to explore total-
rater agreement relative to intrarater precision, whereby users can select raters of
interest to be evaluated. We also allow users to compare intrarater precision among
selected raters. We present such a general comparison model and propose the total–
intra ratio (TIR) as well as the intra–intra ratio (IIR) to evaluate the comparative
agreement when there exists a reference and when there does not.
The TIR is a noninferiority assessment such that the differences of individual
readings from different raters can not be inferior by a certain margin to the
differences of the replicated readings within raters. For a TIR example, to assess
the individual bioequivalence, the agreement of test and reference compound is
assessed relative to the agreement of within-reference compound. Although the
TIR is equivalent to IEC and CIA, we have proposed on alternative statistical
inference approach using the GEE methodology that works for both continuous and
categorical data.

L. Lin et al., Statistical Tools for Measuring Agreement, 111


DOI 10.1007/978-1-4614-0562-7 6, © Springer Science+Business Media, LLC 2012
112 6 A Comparative Model for Continuous and Categorical Data

The IIR is a classical assessment for which the precision of selected assays/raters
can be better than, equal to, or worse than that of other assays/raters. For an IIR
example, in the medical-device environment, we often want to know whether the
within-device precision of a newly developed device is better than, equal to, or worse
than that of the within-device precision for the old device.
GEE methodology is used for estimation and statistical inference. Our approach
allows for selecting any subset of raters as test and reference raters. In addition,
we present assorted examples to demonstrate the flexibility of our approach. More
details about this chapter can be found in Lin, Hedayat, and Tang (2012).

6.1 General Model

Suppose each of n randomly selected subjects is measured by each of k fixed raters


with m replications. The general model for comparative agreement is

yij l D ij C eij l : (6.1)

Here yijl represents the lth reading by rater j for subject i , with i D 1; 2; : : : ; n, j D
1; 2; : : : ; k, and l D 1; 2; : : : ; m. The true reading of the j th rater on subject i is
ij , and it is considered random because subjects are considered random, although
the mean of each rater, j , is considered fixed. The residual random effect is eijl .
We make the following assumptions:

E. ij / D j;

var. ij / D 2
j;

corr. ij ; ij 0 / D jj 0 ;

E.eij l / D 0; and
var.eij l / D &j2 :

Here, we assume that the eij l are uncorrelated, ij and eij l are uncorrelated, and
replicates within a rater are interchangeable.
We use model (6.1) as opposed to model (5.1) because we allow the flexibility
of evaluating any subset of raters in this chapter. For continuous data, when the
within-sample error is proportional to the observed data values, we apply a log
transformation to the data. Based on model (6.1), we propose to use mean squared
deviation (MSD) to assess comparative agreement.
6.2 MSD for Continuous and Categorical Data 113

6.2 MSD for Continuous and Categorical Data

6.2.1 Intrarater Precision

Under the assumption that replicates within a rater are interchangeable, we propose
to use MSDintra to evaluate the intrarater agreement, which is the mean squared
deviation among replicated readings within raters. Intrarater agreement evaluates
the precision within raters. For a chosen rater j , based on model (6.1), MSDintraj
can be expressed as

"2intraj D E.yij l  yij l 0 /2

D. j  j/
2
C. 2
j C &j2 C 2
j C &j2 /  2 2
j

D 2&j2 : (6.2)

For two or more raters, MSDintra is then defined as the average of MSDintraj for the
selected s raters.

6.2.2 Total-Rater Agreement

For evaluating total-rater agreement, we proposed in Chapter 5 to use MSDtotal ,


which is based on any individual reading from each rater. For any two raters, say
raters j and j 0 , based on model (6.1), MSDtotaljj 0 can be expressed as

"2totaljj 0 D E.yij l  yij 0 l 0 /2

D. j  j0/
2
C. 2
j C &j2 C 2
j0 C &j20 /  2jj 0 j j0: (6.3)

Overall, MSDtotal for the selected s raters is then defined as the average across
s.s  1/=2 pairs of MSDtotaljj 0 .

6.2.3 Interrater Agreement

In Chapter 5, we evaluated the interrater agreement using MSDinter . In this chapter,


however, MSDinter need not be evaluated, because MSDtotal and MSDintra include all
114 6 A Comparative Model for Continuous and Categorical Data

the information for evaluating comparative agreement. For any two raters, say rater
j and j 0 , based on the model, MSDinterjj 0 can be expressed as

"2interjj 0 D E.yNij   yNij 0  /2


 
&j2 &j20
D. j  j0/
2
C 2
j C C 2
j0 C  2jj 0 j j 0: (6.4)
m m
The overall MSDinter for the selected s raters is the average across s.s  1/=2 pairs
of MSDinterjj 0 . The relationship between MSDintra , MSDinter , and MSDtotal can be
obtained upon simplification:
 
"2total "2inter 1
D 2 C 1 : (6.5)
"2intra "intra m

The above equation shows that as soon as MSDtotal and MSDintra are determined,
then MSDinter is given as well. Therefore, for comparative agreement, further
evaluation for MSDinter is not necessary.
For comparative agreement, the exact intrarater agreement indices TDI and CP,
as shown in Chapter 5, are one-to-one functions of MSDintra for normally distributed
data. The approximate total TDI and CP are one-to-one functions of MSDtotal
for normally distributed data. Within the same experiment, the between-sample
variances or data ranges are similar, and therefore further scaling relative to the
between-sample variance such as CCC is not necessary for comparative agreement.
Furthermore, such scaling would reduce the power of the experiment.

6.2.4 Categorical Data

Let any two replicates from the same rater or different raters, denoted by X and
Y , represent the classification scores of a subject in one of t categories. Table 3.1
presents the agreement table for all possible probability outcomes, where pq
represents the probability of X D p and Y D q, p; q D 1; 2; : : : ; t.
Based on the agreement probabilities presented in Table 3.1, when X and Y
represent any two replicates within rater j , MSDintraj for categorical data becomes
XX
"2intraj D .p  q/2 pq ; p; q D 1; 2; : : : ; t: (6.6)
p q

Let …0j be the weighted probability of agreementP P rater j with the squared
within
weight function defined in (3.3). Then …0j D tp tq wpq pq . Therefore, the
relationship between MSDintraj and the weighted probability of agreement becomes

"2intraj D .t  1/2 .1  …0j /: (6.7)

Note that weighted kappa is the chance corrected scaling of the weighted probability
of agreement.
6.3 GEE Estimation 115

Similarly, when X and Y represent any replicate from raters j and j 0 ,


respectively, MSDtotaljj 0 for categorical data becomes
XX
"2totaljj 0 D .p  q/2 pq D .t  1/2 .1  …0jj 0 /; p; q D 1; 2; : : : ; t;
p q
(6.8)

where …0jj 0 is the weighted probability of agreement from any replicate of raters
j and j 0 . For categorical data, kappa and weighted kappa are the most common
indices for assessing agreement between two raters, each with a single reading.
Because of the equivalence of CCC and weighted kappa, weighted kappa is also
largely dependent on MSD. Therefore, within the same experiment, further scaling
from MSD to weighted kappa is not necessary for evaluating the comparative
agreement.

6.3 GEE Estimation

Let  D .; & 2 ; 2 ; /0 be the vector of parameters with  D . 1 ; : : : ; j ; : : : ; k /0 ,


& 2 D .&12 ; : : : ; &j2 ; : : : ; &k2 /0 , 2 D . 21 ; : : : ; 2j ; : : : ; 2k /0 , and  D .12 ; : : : ; jj 0 ; : : : ;
.k1/k /0 . Estimates of  along with its variance–covariance matrix can be obtained
via GEE methodology. The following system of estimation equations is used to
obtain the estimates of the parameters.
We estimate  by the first set of estimating equations:

X
n
0
Fi 1 Hi1
1 .Qi 1  / D 0; (6.9)
i D1

where
0 1
.yi11 C yi12 C    C yi1m /=m
B :: C
B : C
B C
Qi 1 D B .yij1 C yij 2 C    C yij m /=m C
B
C;
B : C
@ :: A
.yi k1 C yi k2 C    C yi km /=m
0 1
1
B :: C
B : C
B C
DB
B
C
j C;
B :: C
@ : A
k
116 6 A Comparative Model for Continuous and Categorical Data

and Fi 1 D @. 1@
;:::; k/
D I kk . The working covariance matrix for Qi 1 (Zeger and
Liang 1986) is
0 .1/
1
a1
B :: C
B : 0 C
B C
B .1/ C
Hi 1 D diag.var.Qi 1 // D B aj C;
B C
B :: C
@ 0 : A
.1/
ak
where
 
.1/ 1
aj D var .yij1 C yij 2 C    C yij m /
m

&j2
D 2
j C : (6.10)
m
Note that we assume normality for constructing all of the working covariance
matrices. The estimator for  is
!1 !
X
n
0
X
n
0
O D
 Fi 1 Hi1
1 Fi 1 Fi 1 Hi1
1 Qi 1 : (6.11)
i D1 i D1

We estimate & 2 by the second set of estimating equations:

X
n
0
Fi 2 Hi1
2 .Qi 2  & / D 0;
2
(6.12)
i D1

where
0 Pm 1
lD1 .yi1l  yNi1 /2 =.m  1/
B :: C
B : C
B Pm C
Qi 2 DB
B lD1 .yij l  yNij  /2 =.m  1/ C
C;
B :: C
@ : A
Pm
lD1 .yi kl  yNi k / =.m  1/
2

0 1
&12
B : C
B :: C
B C
B C
& 2 D B &j2 C ;
B : C
B : C
@ : A
&k2
6.3 GEE Estimation 117

@& 2
and Fi 2 D @.&12 ;:::;&k2 /
D I kk . The working covariance matrix for Qi 2 is
0 .2/
1
a1
B :: C
B : 0 C
B C
B .2/ C
Hi 2 D diag.var.Qi 2 // D B aj C;
B C
B :: C
@ 0 : A
.2/
ak

where
 Pm 2
.2/ lD1 .yij l  yNij  /
aj D var
m1

2&j2
D : (6.13)
m1
The estimator for & 2 is
!1 !
X
n
0
X
n
0
&O D
2
Fi 2 Hi1
2 Fi 2 Fi 2 Hi1
2 Qi 2 : (6.14)
i D1 i D1

We estimate 2 by the third set of estimating equations:

X
n
0
Fi 3 Hi1
3 .Qi 3  g.; & ;  // D 0;
2 2
(6.15)
i D1

where
0 1
.yi11 C yi12 C    C yi1m /2 =m2
B :: C
B : C
B C
Qi 3 D B .yij1 C yij 2 C    C yij m / =m C
B 2 2
C;
B : C
@ :: A
.yi k1 C yi k2 C    C yi km / =m
2 2

0 1
2
1 C &12 =m C 2
1
B :: C
B C
B : C
B 2 C
g.; & ;  / D B
2 2 2
C &j =m C j C ;
2
B j
C
B :: C
@ : A
2
k C &k2 =m C 2k
118 6 A Comparative Model for Continuous and Categorical Data

@g.;& 2 ;2 /
and Fi 3 D @. 21 ;:::; 2k /
The working covariance matrix for Qi 3 is
.
0 .3/ 1
a1
B :: C
B : 0 C
B C
B .3/ C
Hi 3 D diag.var.Qi 3 // D B aj C;
B C
B :: C
@ 0 : A
.3/
ak
where
 
.3/ 1
aj D var 2 .yij1 C yij 2 C    C yij m / 2
m
!
2 2 2
!
& j & j
D 2 2j C C 4 2j j C
2
: (6.16)
m m

O and &O 2 .
We obtain the estimate for  and & 2 from (6.11) and (6.14), namely, 
O using the equation
We then solve for  2

!1 !
X
n
0
X
n
0 &O 2
O2 D
 Fi 3 Hi1 Fi 3 Hi1 O2 
 ;
3 Fi 3 3 Qi 3 (6.17)
i D1 i D1
m

where  O 2 represents the vector in which the element is the square of the
O
corresponding element of the vector .
Finally, we estimate  by the fourth set of estimating equations using the cross
products:
Xn
0
Fi 4 Hi1
4 Qi 4  h.;  ; / D 0;
2
(6.18)
i D1

where
0 1
yNi1 yNi 2
B :: C
B : C
B C
B yN yN C
B i1 i k C
B yN yN C
B i 2 i 3 C
B : C
B :: C
B C
Qi 4 DB C ;
B yNi 2 yNi k C
B C
B :: C
B : C
B C
B yNij  yNij 0  C
B C
B :: C
@ : A
yNi.k1/ yNi k 1
k.k1/
2
6.3 GEE Estimation 119

0 1
1 2 C 12 1 2
B :: C
B : C
B C
h.; 2 ; / D B
B j j0 C jj 0 j j0
C
C ;
B :: C
@ : A
.k1/ k C .k1/k k1 k 1
k.k1/
2

@h.;2 ;/
and Fi 4 D @.12 ;:::;.k1/k /
. The working covariance matrix for Qi 4 is

Hi 4 D diag.var.Qi 4 //
0 .4/ 1
a12
B :: C
B : C
B C
B .4/
0 C
B a1k C
B C
B .4/
a23 C
B C
B :: C
B : C
DBB C ;
.4/
a2k C
B C
B :: C
B : C
B C
B .4/ C
B
B
0 ajj 0 C
C
B :: C
@ : A
.4/
a.k1/k k.k1/ k.k1/
2  2

where
.4/
ajj 0 D var.yNij  yNij 0  /
 2
D jj 0 2j 2j 0 C 2 j
2
j 0 jj 0 j
2
j0
! !
   &j20 &j2
C 2
j C &j2 2
j0 C &j20 C j
2
j0 C C j0
2
j C : (6.19)
m m

We obtain the estimate for 2 from (6.17), namely  O 2 . We then use  O 2 in


O and 
2
(6.18) for  and  and solve for . O Upon simplification, the estimate for each
element of O is given by
 Pn 
i D1 yNij  yNij 0  O j O j0;
Ojj 0 D  O j O j0 (6.20)
n

O and O j and O j 0 are the


where O j and O j 0 are the j th and j 0 th elements of vector ,
0
square roots of the j th and j th elements of vector  . O 2
120 6 A Comparative Model for Continuous and Categorical Data

When there is no covariate, no iteration is needed for the above


estimation process. The variance–covariance matrix for the estimated parameters
O D .; O 2 ; /
O &O 2 ;  O 0 is given by

O D 1 10
V./ D †D 1 ; (6.21)
n
where
0 P n 0
1
F H 1 F 0 0 0
B i D1 i 1 i 1 i 1 C
B C
B Pn C
B 0
0
Fi 2 Hi1 Fi 2 0 0 C
B 2 C
B i D1 C
DDB n C;
B P 0 1 P n 0 P
n 0 C
B F H G F H 1 G F H 1 F 0 C
B i D1 i 3 i 3 i 2 i D1 i 3 i 3 i 3 i D1 i 3 i 3 i 3 C
B C
@ P n 0 P
n 0 P
n 0
A
Fi 4 Hi1
4
Gi 4 0 F H
i4 i4
1
Gi 6 Fi 4 Hi1
4
Fi 4
i D1 i D1 i D1

with
Gi 2 D @g.; & 2 ; 2 /=@;

Gi 3 D @g.; & 2 ; 2 /=@& 2 ;

Gi 4 D @h.; 2 ; /=@;

Gi 6 D @h.; 2 ; /=@2 ;

and
0 1
A11 A12 A13 A14
B A21 A22 A23 A24 C
†DB
@ A31
C;
A32 A33 A34 A
A41 A42 A43 A44
with
X
n
0 0
A11 D Fi 1 Hi1
1 .Qi 1  /.Q
O O Hi1
i 1  / 1 Fi 1 ;
i D1

X
n
0 0
A12 D Fi 1 Hi1
1 .Qi 1  /.Q
O O 2 / Hi1
i2  & 2 Fi 2 ;
i D1

X
n
0 0
A13 D Fi 1 Hi1
1 .Qi 1  /.Q
O i 3  g.;
O & 2 ; 2 // Hi1
3 Fi 3 ;
i D1

X
n
0 0
A14 D Fi 1 Hi1 O 1
1 .Qi 1  /.Q
O i 4  h.;  ; // Hi 4 Fi 4 ;
2

i D1
6.4 Comparison of Total-Rater Agreement with Intrarater Precision: Total–Intra Ratio 121

X
n
0 0
A22 D Fi 2 Hi1 O 2 /.Qi 2  &O 2 / Hi1
1 .Qi 2  & 2 Fi 2 ;
i D1

X
n
0 0
A23 D Fi 2 Hi1
2 .Qi 2  &
O 2 /.Qi 3  g.;
O & 2 ; 2 // Hi1
3 Fi 3 ;
i D1

X
n
0 0
A24 D Fi 2 Hi1 O 2 ; // Hi1
2 .Qi 2  &
O 2 /.Qi 4  h.; 4 Fi 4 ;
i D1

X
n
0 0
A33 D Fi 3 Hi1
3 .Qi 3  g.;
O & 2 ; 2 //.Qi 3  g.;
O & 2 ; 2 // Hi1
3 Fi 3 ;
i D1

X
n
0 0
A34 D Fi 3 Hi1 O 2 ; // Hi1
3 .Qi 3  g.;
O & 2 ; 2 //.Qi 4  h.; 4 Fi 4 ;
i D1

X
n
0 0
A44 D Fi 4 Hi1 O O 1
4 .Qi 4  h.; e ; //.Qi 4  h.;  ; // Hi 4 Fi 4 ;
2 2

i D1

0 0 0 0 0 0
and A21 D A12 , A31 D A13 , A41 D A14 , A32 D A23 , A42 D A24 , A43 D A34 .
Now we have obtained the estimates and variance–covariance matrix for all
parameters. We can then use the delta method to obtain the estimates and their
variances for all indices that are functions of these parameters.

6.4 Comparison of Total-Rater Agreement with Intrarater


Precision: Total–Intra Ratio

For evaluating the type of individual agreement mentioned in the second paragraph
of this chapter, it is natural to use TIR, the ratio of MSDtotal and MSDintra , to assess
the noninferiority between measurements from different raters to an intrarater
precision. More generally, the comparison can be based on selected multiple
pairs of MSDtotaljj 0 relative to selected multiple MSDintraj . Selected raters for
MSDtotal can also be selected for MSDintra , and hence raters in the numerator
and raters in the denominator are not mutually exclusive. This approach allows
substantial flexibility in making comparisons between chosen test raters and chosen
reference raters. In addition, it is not required to select a reference rater when
none is available. For example, when k D 2, we can evaluate deviations among
individual values of test and reference raters relative to the deviation within the
reference raters, MSDtotalT;R /MSDintraR , or relative to that within both test and
reference raters, MSDtotalT;R /MSDintraT;R . When k D 3 with one of the raters being
the reference rater, we can evaluate MSDtotalT1 ;R /MSDintraR , MSDtotalT2 ;R /MSDintraR ,
or MSDtotalT1 T2 ;R =MSDintraR . When k D 3 and none is the reference, we can
122 6 A Comparative Model for Continuous and Categorical Data

evaluate MSDtotal /MSDintra , MSDtotal1;2 /MSDintra1;2 , MSDtotal1;3 /MSDintra1;3 , and


MSDtotal2;3 /MSDintra2;3 . In the following, we will discuss cases with at least one
reference rater and those in which no reference rater is available. Here MSDintraT;R
is the average of MSDintraT and MSDintraR , while MSDtotalT1 T2 ;R is the average of
MSDtotalT1 ;R and MSDtotalT2 ;R .

6.4.1 When One or Multiple References Exist

In the case of one or multiple references, we select a set of test raters and a set of
reference raters out of the total of k raters. Suppose we are interested in evaluating
t different test raters, 1 6 t 6 k, with respect to r reference raters, 1 6 r 6 k,
2 6 t C r 6 2k. Test raters are indexed by j , j D 1; : : : ; t, and reference raters
are indexed by j 0 , j 0 D 1; : : : ; r. The individual differences between selected sets
of test and reference raters are evaluated by the average of the pairwise total mean
squared deviation, MSDtotalT;R . The intrarater precision is evaluated by the average
of intra mean squared deviation of selected r reference raters. The ratio, TIRR is
used to assess the individual agreement:
Pt Pr
"2totalT;R j D1 j 0 D1 E.yij l  yij 0 l 0 /2 =t r
R D D Pr
"2intraR  yij 0 l 0 /2 =r
j D1 E.yij 0 l
1 Pr Pt 0 2 1 Pt 1 Pr
j D1 . j  j / C t j D1 &j C r
2 2
tr j 0 D1 j 0 D1 &j 0
D 2 P r 2
r j 0 D1 &j 0
1 Pt 1 Pr 1 Pr Pt 0
j D1 j C r j 0 D1 j 0  t r
2 2
t j 0 D1 j D1 jj 0 j j
C 2 P r 2
: (6.22)
r j 0 D1 &j 0

b
Theoretically, TIRR cannot be less than 1. However, its estimate, TIRR , can be less
than one due to random error. When TIRR D 1, total-rater agreement is exactly
the same as intra-reference-rater agreement. Higher values of TIRR indicate worse
individual agreement. The cause of disagreement could be due to (1) difference
between the means T of the test raters and means R of the reference raters,
(2) difference between the error variance &T2 of the test raters and the error variance
&R2 of the reference raters, (3) the subject by rater interaction: D2 D var. iT 
iR / D T C R  2T;R T R .
2 2

6.4.1.1 When No Specific Reference Exists

When there is no specific reference rater, we can select t test raters MSDtotal
relative to their MSDintra . The individual difference is evaluated by the average of
the pairwise total mean squared deviation MSDtotalT among t selected test raters,
6.4 Comparison of Total-Rater Agreement with Intrarater Precision: Total–Intra Ratio 123

2 6 t 6 k. We then use the average of MSDintra of all test raters in the denominator.
The total–intra ratio without a specific reference, TIRall , is expressed as

"2total
all D
"2intra
P Pt
2 tj1 D1 j 0 Dj C1 E.yij l  yij 0 l 0 / =t.t  1/
2
D Pt
j D1 E.yij l  yij l 0 / =t
2

P Pt
2 tj1 D1 j 0 Dj C1 . j  j 0 / C &j C &j 0 C
2 2 2 2
j C 2
j0  jj 0 j j0 =t.t  1/
D Pt :
2 j D1 &j2

(6.23)
When no specific reference exists, TIRall varies between 1 and 1.

6.4.2 Comparison to FDA’s Individual Bioequivalence


with Relative Scale

In the case of two raters with one of them treated as a reference, i.e., k D 2,
t D 1, and r D 1, TIRR degenerates to FDA’s method for evaluating individual
bioequivalence under the relative scale. Following FDA guidelines on individual
bioequivalence, the agreement of test and reference compounds can be assessed
relative to the agreement of within-reference compound. Let yiT l and yiRl 0 be the
lth and l 0 th reading on subject i from test compound (T ) and reference compound
(R), respectively. Then the individual bioequivalence criterion (IBC) is defined by
FDA as

E.yiT l  yiRl 0 /2  E.yiRl  yiRl 0 /2


IBC D : (6.24)
E.yiRl  yiRl 0 /2 =2

This FDA approach is primarily based on the approach proposed by Sheiner (1992),
which uses a normal linear mixed model estimated by REML.
By our definition, TIRR is expressed as

"2totalT;R
R D
"2intraR
E.yiT l  yiRl 0 /2
D
E.yiRl  yiRl 0 /2
IBC
D C 1: (6.25)
2
Note that FDA also uses a constant scale when MSDintraR is small. In addition, it
requires that the ratio of the geometric mean of the test and reference compound lie
124 6 A Comparative Model for Continuous and Categorical Data

between 0.8 and 1.25. We will discuss in detail later, in Example 6.7.4, the topic of
individual bioequivalence. In that example, we will add meaningful interpretations
to the FDA’s criteria by making use of the information presented in Chapters 5 and 6.

6.4.3 Comparison to Coefficient of Individual Agreement

Barnhart, Kosinski, and Haber (2007) proposed the coefficient of individual agree-
ment (CIA) for assessing individual agreement. When there are t test raters and r
reference raters, CIA with reference is defined as
Pr
j D1 E.yij 0 l  yij 0 l 0 / =r
2
"2intraR
CIAR D D Pt Pr : (6.26)
"2totalT;R j 0 D1 E.yij l  yij 0 l 0 / =t r
2
j D1

When no reference is available, CIA without reference is defined as


Pt
j D1 E.yij l  yij l 0 / =t
2
"2intra
CIAall D 2 D Pt 1 Pt : (6.27)
"total 2 j D1 j 0 Dj C1 E.yij l  yij 0 l 0 /2 =t.t  1/

Compare these CIAs to TIRR and TIRall in (6.22) and (6.23), respectively: CIA is
the reciprocal of TIR.
Basically, IBC, CIA, and TIR are the same indices. The differences are in the
estimation approaches, in that the IBC is ML-based, CIA is method-of-moments-
based with bootstrapping for statistical inference, and TIR is GEE-based. CIA and
TIR are extended for multiple raters, while IBC is limited to two raters only.

6.4.4 Estimation and Asymptotic Normality

Recall that we have obtained estimates of all parameters as well as their variance–
covariance matrix via GEE methodology in Section 6.3. These GEE estimates
of parameters turn out to be moment estimates. Since TIR is a function of the
parameters in the model, the method of moments is used to estimate TIR, and the
delta method is used for the statistical inference.
When the reference exists, the TIRR estimate can be obtained as
Pr Pt Pt Pr
1
j 0 D1 j D1 . O j  O 0j /2 C 1
O j2
j D1 & C 1
j 0 D1 &O j20
OR D
tr t r
2 Pr
r j 0 D1 &O r2
Pt O 2 C 1 Pr 0 O 2  1 Pr Pt O O0
j D1 Ojj 0 j
1
t j D1 j r j D1 r tr j 0 D1 j
C 2 Pr
: (6.28)
r j 0 D1 &O r2
6.4 Comparison of Total-Rater Agreement with Intrarater Precision: Total–Intra Ratio 125

The estimate of the log transformed TIRR , WT D ln.OR /, has an asymptotic normal
distribution with mean ln.R / and variance
1 0
2
WT D d †C dT ; (6.29)
n T
where WT D ln.OR / D g.m/,

m D .; O 2 ; /
O &O 2 ;  O 0 D .m1 ; m2 ; m3 ; m4 /0 ;

†C D nV./ O from (6.21), which is the variance–covariance matrix for the parame-
ter estimates,

dT D .d ; d& 2 ; d2 ; d /0


 ˇ ˇ ˇ ˇ 0
@g.m/ ˇˇ @g.m/ ˇˇ @g.m/ ˇˇ @g.m/ ˇˇ
D ; ; ; ;
@m1 ˇmD‚ @m2 ˇmD‚ @m3 ˇmD‚ @m4 ˇmD‚

0
and ‚ D .; & 2 ; 2 ; / . We use the correction factor n6 n
for the variance in (6.29)
because it has been shown to have less bias in the simulation studies.
The elements of dT are computed as follows: If the j th rater is selected as the
test rater and the j 0 th rater is selected as the reference rater, then:
 Pr .  0 / 
j 0 D1 j
• The j th element of dT is 1R P
t rj 0 D1 &j20
j
.
  Pt .  / 
j0
• The j 0 th element of dT is 1R j D1 j
P
t rj 0 D1 &j20
.
 
• The .k C j /th element of dT is 1R t Pr r & 2 .
j 0 D1 j 0
0
 1.kPrC jP/th
• The
t
element of dPT is
0 2 1 t 2 1
P t 2 1
Pr 2 1
Pr Pt 0

1 tr j 0 D1 j D1 . j  j / C t j D1 &j C t j D1 j C r j 0 D1 j0
 tr j 0 D1 j D1 jj 0 j j
2
Pr 2 2 .
R
r . j 0 D1 &j 0 /
P !
j0
r rj 0 D1 jj 0
• The .2k C j /th element of dT is 1
R
Pr
2t j 0 D1 &j20
j
.
P j
!
t  tj D1 jj 0 0
0
• The .2k C j /th element of dT is . 1
R
Pr
2t j 0 D1 &j20
j

 
• When j < j 0 , the 3k C .2kj2/.j 1/ C .j 0  j / th element of dT is
  
1 j j0
P .
R t r 0 2
j D1 &j 0
 0 0

• When j > j 0 , the 3k C .2kj 2/.j 1/ C .j  j 0 / th element of dT is
  
1 j j0
P .
R t r & 2
j 0 D1 j 0
• All other elements in dT are zero.
126 6 A Comparative Model for Continuous and Categorical Data

The log transformed TIRR estimate approaches normality rapidly and can efficiently
bound the confidence interval within 0 to 1. The confidence limit for TIRR is
computed based on the log-transformed TIR estimates, WT D ln.OR /. The antilog
transformation is performed on the confidence limit of WT to obtain the actual
confidence limit for TIRR . Individual agreement is established when the confidence
limit is smaller than the prespecified criterion, say, 2.25.
When no reference raters exist and the average of all error variances of the test
raters are used in the denominator, the TIRall estimate is given by
P Pt
2 tj1
D1 O j2 C &O j20 C O 2j C O 2j 0  Ojj 0 O j O 0j =t .t  1/
j 0 Dj C1 . O j  O j 0 / C &
2
b
 all D Pt :
2 j D1 &O j2
(6.30)
The statistical inference for TIRall can be obtained in the same way as when there
are reference raters. For the purpose of statistical inference, the parameters in the
variances of estimates presented above can be replaced by their sample counterparts
that are consistent estimators.

6.5 Comparison of Intrarater Precision Among Selected


Raters: Intra–Intra Ratio

In the medical-device environment, we are often interested in whether the within-


device precision of a newly developed device can be better than, equal to, or worse
than that of the within-device precision for the old device. In our approach, we can
select any set of test raters as well as any set of reference raters out of the total
of k raters, and the intraprecision of the selected set of test raters is compared to
that of a selected set of reference raters. Here, the selected two sets of test and
reference raters are mutually exclusive. For example, when k D 2, we can evaluate
only MSDintra1 =MSDintra2 and MSDintra2 =MSDintra1 . When k D 3, we can evaluate
MSDintra1;2 =MSDintra3 , MSDintra1;3 =MSDintra2 , and MSDintra2;3 =MSDintra1 . When we
select t test raters and r reference raters, 1 6 t 6 k, 1 6 r 6 k, 2 6 t C r 6 2k,
the IIR can be expressed as
"2intraT
D
"2intraR
Pt
j D1 E.yij l  yij l 0 /2 =t
D Pr
j 0 D1 E.yij 0 l  yij 0 l 0 /2 =r
Pt
j D1 &j2 =t
D Pr ; (6.31)
j 0 D1 &j20 =r

where "2intraT and "2intraR denote MSDintra among selected test and reference raters,
respectively.
6.5 Comparison of Intrarater Precision Among Selected Raters: Intra–Intra Ratio 127

IIR less than 1 indicates better overall precision for test raters than reference
raters. For statistical inference, we would construct a two-sided 100.1  ˛=2/%
confidence interval for IIR, and claim superiority or inferiority if the upper or
lower limit is less than or greater than 1.0. On the other hand, if we intend to
examine whether precision of the test and reference raters are equal, we would
claim equivalence if the confidence interval is bounded by a prespecified clinically
relevant interval.

6.5.1 Estimation and Asymptotic Normality

Similarly as for TIR, the estimate for IIR is obtained via the method of moments
by GEE methodology, and the related statistical inference is obtained via the delta
method. The IIR estimate is given by
Pt
&O 2 =t
O D Prj D1 j : (6.32)
j 0 D1 &O j20 =r

The estimate of the log-transformed IIR, WI D ln. O /, has an asymptotic normal


distribution with mean ln. O / and variance

1 0
2
WI D d †C dI ; (6.33)
n I

where WI D ln. O / D g.m/,

m D .0; &O 2 ; 0; 0/
D .0; m2 ; 0; 0/;

and

dI D .0; d& 2 ; 0; 0/
 ˇ 
@g.m/ ˇˇ
D 0; ; 0; 0 :
@m2 ˇmD‚

n
We use the correction factor n6 for the variance in (6.33) because it has been shown
to have less bias in simulation studies.
The elements of dI are computed as follows: If the j th rater is selected as the
test rater and the j 0 th rater is selected as the reference rater, then:
 
• The .k C j /th element of dI is 1 t Pr r & 2 .
j 0 D1 j 0
128 6 A Comparative Model for Continuous and Categorical Data

 P
r tj D1 &j2

• The .k C j 0 /th element of dI is  1 Pr
t . j 0 D1 &j20 /2
.
• All other elements in dI are zero. The value of the elements in dI correspond to
the selection of the test and reference raters.
The transformed IIR estimate approaches normality rapidly, and the confidence
interval for IIR is bounded within 0 to 1. The confidence limit for IIR is
computed based on the log-transformed IIR estimates, WI D ln. O /. The antilog
transformation is performed on the confidence limit of WT to obtain the actual
confidence limit for IIR. Better precision for test raters than reference raters is
determined when the confidence limit is less than 1. For the purpose of statistical
inference, the parameters in the variances of estimates presented above can be
replaced by their sample counterparts that are consistent estimators.

6.6 Summary of Simulation Results

Simulation studies were conducted to assess the performance of the GEE


methodology for estimation and inference of TIR and IIR for three different types
of data: binary, ordinal, and normal when k D 2 and m D 2. A total of 1,000
random samples was generated with size of 40 for normal data and 80 for binary
and ordinal data. For each type of data, the simulation was designed to evaluate
both the significance level and the power.
For TIR with criterion 2.25, we generated data with TIR D 2:25 to assess
the significance level. We generated data with TIR D 1:25 for normal data and
TIR D 1:33 for categorical data to assess the power. For IIR, we generated data with
IIR D 1 to assess the significance level. We generated data with IIR D 0:5 to assess
the power. The estimates and their standard errors correspond to their theoretical
values very well. The coverage probability to assess a significant level is close to
0.05, and powers vary from 0.45 to 0.80. The details of the simulation results can
be seen in Lin, Hedayat, and Tang (2012).

6.7 Examples

6.7.1 Example 1: TIR and IIR for an Automatic Blood


Pressure Meter

In Example 5.9.4, we examined the intraagreement, interagreement, and total


agreement of one semiautomated blood pressure meter (S ) and the gold standard
sphygmomanometer measured by two medical staffs (J and R) with triplicate
measurements by each. Figs. 5.4–5.9 show that it is informative to investigate the
6.7 Examples 129

TIR of MSDtotalS;JR relative to MSDintraJR , and the IIR of MSDintraS relative to


MSDintraJR , because readings from J and R were precisely interchangeable, as
evident in Fig. 5.4. Here, data were assumed to have a proportional error structure
and were analyzed with a log transformation to the data.
We now evaluate the TIR of MSDtotalS;JR relative to MSDintraJR . The TIRR of S
relative to J and R is estimated to be 7.06 with the one-tailed 95% upper confidence
limit 10.45, which is much greater than any clinically relevant criterion, indicating
that S is not interchangeable with J and R. The IIR estimate is 1.57 with the
two-sided 95% confidence interval (1.05, 2.33). The lower limit is greater than 1,
indicating that the precision of S is significantly inferior to the precision of J and R.
The results imply that the automatic blood pressure monitor does not have good
individual agreement with sphygmomanometers used by medical staff and that
the precision of the automatic blood pressure apparatus is significantly inferior
to that of a sphygmomanometer. Therefore, we would not want to replace a
sphygmomanometer with the studied automatic machine used in this study.

6.7.2 Example 2: Nasal Bone Image Assessment


by Ultrasound Scan

In Example 5.9.3, we examined the intraagreement, interagreement, and total


agreement of nasal bone (NB) image to be used for population screening in an
Australian obstetric population at 11–13 weeks gestation. Recall that three raters
assessed each of the 400 images twice (k D 3, m D 2).
Tables 5.7–5.12 present the frequency tables among three raters and their
duplicate readings. It appears that within raters has a better agreement than between
raters, and it is informative to investigate the TIR without any reference examiners.
This TIR is estimated to be 1.56 with 95% upper confidence limit 2.38. We can
accept the between-raters agreement if we consider using the criterion 2.5.

6.7.3 Example 3: Validation of the Determination of Glycine


on a Spectrophotometer System

The data set for this example is from Baxter Healthcare Corporation. Spectropho-
tometer systems are used to measure the spectrophotometric determination of
glycine. In this example, we examine the agreement between two different systems,
S1 and S2 , with duplicate measurements on each of 38 samples. Figures 6.1–6.3
present the graphs among readings between the two systems and their replicated
measurements. There are no reference instruments here, and hence it would be
informative to investigate the TIR and IIR without reference. Here, data were
assumed to have a constant error structure.
130 6 A Comparative Model for Continuous and Categorical Data

Fig. 6.1 S1 reading 1 vs. S1 reading 2

Fig. 6.2 S2 reading 1 vs. S2 reading 2

The TIRall of MSDtotal relative to MSDintra is estimated to be 0.699 with the 95%
upper confidence limit of 0.817, which is within any clinically relevant criterion
for claiming individual agreement. The IIR of MSDintraS1 relative to MSDintraS2 is
estimated to be 1.010 with 95% confidence interval .0:752; 1:357/, which indicates
that the precisions of the two systems are not statistically different. We would
claim equivalence if we considered precision deviation of less than 40% clinically
acceptable. Based on the result of TIR and IIR, we conclude that the two systems
can be used interchangeably.
6.7 Examples 131

Fig. 6.3 S1 reading 1 vs. S2 reading 1

6.7.4 Example 4: Individual Bioequivalence

To study individual bioequivalence, we downloaded a set of data with a four-period–


four-sequence crossover design from the FDA web site. We do not know the product
name or the manufacturer. There were 40 healthy volunteers in the study. The design
of the study is given in the table below, where T represents the test compound and
R represents the reference compound.

Period
Seq. 1 2 3 4
1 T T R R
2 R R T T
3 R T T R
4 T R R T

For simplicity, we study only the area under the curve (AUC), and we assume that
the data have proportional error structure. To save space, we do not list the original
data, but we will shortly present the condensed data. We first tested the period and
sequence effects using the mixed effect model on log-transformed AUC, and found
that there was no evidence of these two effects with p > 0:5. Therefore, we list
the data in the format T1, T2, R1, and R2, representing periods 1 and 2 of the test
and reference compounds, as shown in Table 6.1. Note that subject 16 had missing
data, and subject numbers 906, 908, 921, and 932 were recruited to replace the four
dropout subjects with numbers 6, 8, 21, and 32.
132 6 A Comparative Model for Continuous and Categorical Data

Table 6.1 Data listing Subject T1 T2 R1 R2


1 13 16:5 16:2 9:35
2 27:3 58:6 23 33:3
3 8:11 9:28 63:1 4:76
4 4:62 5:88 6:15 6:78
5 3:77 6 6:9 6:72
7 20 17:4 28:6 26:1
9 15:6 14:1 8:95 17:6
10 5:66 4:56 3:62 5:07
11 17:4 21:5 13:4 13:1
12 4:9 4:83 4:13 3:07
13 39:2 32:8 33:7 28:5
14 6:78 5:34 10:2 16:7
15 15:8 30:7 20:5 17:9
16 . . . .
17 10:3 12:2 11:5 9:58
18 31:6 36:7 22:4 25:2
19 11:5 10:8 18:3 32:4
20 8:29 9:31 6:26 10:5
22 6:17 7:86 7:98 4:54
23 44:3 22:2 35 20:3
24 22:5 60:2 31:5 15:1
25 9:63 5:76 8:34 3:85
26 5:35 5:09 5:33 4:78
27 8:95 4:82 4:18 10:2
28 5:1 9:68 7:08 9:67
29 6:53 3:7 7:88 6:23
30 3:05 5:12 2:29 5:48
31 7:41 5:98 5:35 5:79
33 0:82 1:91 2:63 0:83
34 3:43 5:34 6:91 7:66
35 8:58 8:48 5:71 6:64
36 8:34 9:12 6:9 12:2
37 15:6 15:8 12:3 14:6
38 3:65 5:63 8:81 13:6
39 6:72 5:58 11:6 7:38
40 20:4 20:6 19:3 20:5
906 7:11 5:41 6:55 4:55
908 11:9 12:6 8:68 6:48
921 14:7 13:6 12:9 24:5
932 10:6 20:8 9:09 13:7
T1: AUC value of the period-1 test compound
T2: AUC value of the period-2 test compound
R1: AUC value of the period-1 reference com-
pound
R2: AUC value of the period-2 reference com-
pound
6.7 Examples 133

The criteria for claiming individual bioequivalence under the FDA guidance
are
1. Reference Scale:

E.yiT l  yiT l 0 /2  E.yiRl  yiRl 0 /2


IBC D
E.yiRl  yiRl 0 /2 =2

. T  R/
2
C D C
2 2
 2
D 2
WT WR
< 2:5; if 2
WR > 2
W 0; (6.34)
WR

2. Constant Scale:

. T  R/
2
C D C
2 2
 2

2
WT WR
< 2:5; if 2
WR 6 2
W 0; (6.35)
W0

where D2 D BT 2
C BR 2
 RT , T and R are means, BT 2
and BR 2
are
2 2
between-subject variances, W T and WR are within-subject variances, and RT is
the between-subject covariance, of the test and reference compounds, respectively.
Finally, W2 0 is the cutoff value based on the within-subject variance of the reference
compound. The criterion 2.5 is the aggregate allowance of ln. T = R / D ln.1:25/,
W T  WR D 0:02, D D 0:03, and W 0 D 0:04.
2 2 2 2

The FDA individual bioequivalence criteria have not been used widely, even by
FDA staff, for the following reasons:
1. It is difficult for pharmacists and statisticians to understand the meaning of the
criteria.
2. The concept is far more complicated than the average or population
bioequivalence.
3. For statisticians, it is complicated to compute the confidence limits of the
estimates of the relative and constant scales unless there is a program or macro
ready at hand.
4. Perhaps the biggest problem is that there is a discontinuity region in using the
2
reference and constant scales. That is, when the estimate of WR is near W2 0
within a natural random fluctuation, there is a penalty to fall into the more strict
constant scale criterion by chance.
Using our definitions provided in Chapters 5 and 6, we can much better interpret
and understand these scales and provide tools (see Chapter 7) to do the analysis.
See Section 6.4.1 for the relationship between TIR and the relative scale, while
the relationship between TIRR and the relative scale is given in (6.25). Using our
definition, the relative scale criterion means that MSDtotalT;R cannot be more than
2.25 of MSDintraR , or TIRR < 2:25, with .1  ˛/% confidence.
134 6 A Comparative Model for Continuous and Categorical Data

We now examine the meaning of W2 0 D 0:04. From (2.8) and (5.11),


we can convert W2 0 D 0:04 to TDI0:8 of 0.3625 based on the log scale because
MSDintraR D 2 W2 0 D 0:08. From (5.33), the TDI%0:8 becomes 43.7%. Therefore,
W 0 D 0:04 means that 80% of the duplicate values of the reference compound
2

are within 43.7% of each other. If TDI%0:8 is greater than 43.7%, we must use the
reference scale according to (6.34) to ensure that the MSDtotalT;R cannot be more than
2.25 of the MSDintraR with 100.1  ˛/% confidence. Otherwise, we would use the
constant scale.
We now examine the meaning of the constant scale. Note that
MSDtotalT;R D . T  R /2 C D2 under model (6.1). If we assume that W2 T  WR 2
D
0:02, 0, or 0.02, (6.35) under model (6.1) becomes MSDtotalT;R < 0:12, 0.1, or
0.08, respectively. Again, when W2 0 D 0:04, from (2.8), (5.26), and (5.33), the
TDI%total.0:8/ becomes 55.9%, 50.0%, or 43.7%, respectively. This means that
approximately 80% of the individual AUC values from the test compound cannot
deviate more than 55.9%, 50.0%, or 43.7%, respectively, of the individual AUC
values from the reference compound with 100.1  ˛/% confidence. Compared
to the TDI%intra.0:8/ D 43:7% based on the cutoff value of W2 0 D 0:04 of the
intrareference compound, this constant scale criterion appears to be too stringent to
meet.
Let us summarize the interpretation of the FDA individual bioequivalence criteria
under model (6.1). If 80% of the duplicate values of the reference compound
deviate more than 43.7% from each other, the MSDtotalT;R cannot be more than
2.25 of the MSDintraR with 100.1  ˛/% confidence. Otherwise, we must show
that approximately 80% of the individual AUC values from the test compound
cannot deviate more than 44% to 56% of the individual AUC values from the
reference compound with 100.1  ˛/% confidence. The conventional ˛ is set at
0.05, one-tailed. It appears that there is room for improvement in redefining these
criteria.
We now analyze the data in this example. We begin by calculating the TDI%0:8
between R1 and R2 using (2.8), assuming the proportional error case. Figure 6.4
shows the agreement plot of R1 and R2 in log2 scale. It is clear that there exists
insufficient precision in the data, and that TDI%0:8 D 124:4%, which means that
80% of R1 and R2 pairs can deviate up to 124%. This is much higher than the
43.7% cutoff value, and therefore we would use the relative scale according to the
FDA rules.
Figure 6.5 shows the agreement plot of T1 and R2 in log2 scale, representing
the total agreement among the individual AUC values of the test and reference
compounds. The precision of the data is slightly better than that between R1 and R2,
but in general, T1 has lower AUC values than R2. The TIRR is estimated to be 0.691,
with upper one-sided 95% confidence limit 1.076, indicating that the MSDtotalT;R is
1.076 of the MSDintraR with 100.1  ˛/% confidence, which is much less than 2.25.
Therefore, individual bioequivalence is accepted according to FDA’s rules because
of the large within-reference deviation.
6.7 Examples 135

Fig. 6.4 Agreement between the duplicates of the reference compound

Fig. 6.5 Agreement between the first test compound reading and the second reference compound
reading

Our statistical tools allow us to go one step further by examining the precision
of the test compound relative to that of the reference compound, namely, the IIR of
MSDintraT relative to MSDintraR . Figure 6.6 presents the agreement plot of T1 and
T2 in log2 scale. The precision of the test compound appears tighter than that of the
reference compound, as shown in Fig. 6.4. Using the statistical tools presented in
Chapter 2, the TDI%0:8 of T1 and T2 is estimated to be 70.2%, with a 95% upper
confidence limit of 90.3%. The IIR of MSDintraT relative to MSDintraR is estimated
to be 0.432 with 95% confidence interval 0.168 to 1.115.
136 6 A Comparative Model for Continuous and Categorical Data

Fig. 6.6 Agreement between the duplicate of the test compound

6.8 Discussion

We have proposed two comparative agreement indices, TIR for interchangeability,


and IIR for comparing precision measurements between multiple raters with multi-
ple readings. TIR is a noninferiority assessment for determining whether several
assays/raters can be used interchangeably, in terms of its ratio to the intrarater
precision, for scenarios with and without a reference. IIR is an assessment for
evaluating whether the precision of one or multiple raters is better than, equal to,
or worse than that of other raters.
The approach we have proposed here is very general in a sense that any one or
multiple raters can be selected as the test or reference raters. Users have substantial
flexibility to make comparisons among any number of raters of interest. The FDA’s
method for evaluating individual bioequivalence under the relative scale becomes a
special case of ours. The examples in this chapter show that our approaches have
wide application in a variety of agreement studies.
We use GEE methodology because it is applicable to continuous, ordinal, and
binary data. Lin, Hedayat, and Wu (2007) showed, in their simulation study when
m D 1, that using ML or RML as proposed by Carrasco and Jover (2003) works
well for ordinal data, but not for binary data. For ordinal data, we expect that the
ML or RML approach would work well for the comparative model as well. The
ML-based TIR and IIR should be an interesting topic for future research.
The bound for claiming individual agreement, by FDA’s definition, is based on
the upper limit TIRR D 2:25. It is likely that the use of the 2.25 TIR criterion
is too stringent for categorical data. These limits have been established based on
experience with bioavailability data and with the average bioequivalence criterion.
Users should carefully choose the boundary based on their specific clinical and
historical assessments.
6.8 Discussion 137

Our approach is based on balanced data. Further research is needed to investigate


the case of unbalanced data with the possibility of covariate adjustments. For sample
size and power calculations of TIR and IIR, the reader is referred to the general
procedures outlined in Section 4.1.
There are relatively few references available regarding the topics in this chapter.
Westlake (1976) proposed a modification of the conventional confidence interval
method to obtain symmetric confidence intervals around 0.0 for bioequivalence
trials. FDA guidance for industry (2001) introduced the statistical approach of
evaluating individual bioequivalence between test and reference drugs, which was
largely based on Sheiner (1992). Barnhart, Kosinski, and Haber (2007) proposed
the concept of individual equivalence and extended FDA’s approach of comparing
two raters to the interchangeability among multiple raters. We have no knowledge
of any publications regarding IIR.
Chapter 7
Workshop

In this chapter, we will walk through three examples using SAS macros and R
functions. We will present the calling codes with in-depth explanations for each
macro or function. For the first two examples, one with continuous data and one
with categorical data, we will study from the basic models to more complicated
models using the material presented in Chapters 2, 3, 5, and 6. We will also include
an individual bioequivalence example using the material presented in Chapters 2, 5,
and 6.
We have produced and made available three SAS macros: Agreement,
UnifiedAgreement, and TIR IIR. To run one of these macros, we need
to download it from one of the following websites:
1. http://www.uic.edu/-hedayat/
2. http://mayoresearch.mayo.edu/biostat/sasmacros.cfm
These websites also contain most of the data used in this book.
An R package named Agreement is also available containing functions
corresponding to the three SAS macros. These are also available for downloading
from the above websites, or they can be installed directly from the comprehensive
R archive network (CRAN).

7.1 Workshop for Continuous Data

In each of the following examples, we will first present the procedure based on the
SAS macros, and then present the corresponding R functions.

L. Lin et al., Statistical Tools for Measuring Agreement, 139


DOI 10.1007/978-1-4614-0562-7 7, © Springer Science+Business Media, LLC 2012
140 7 Workshop

7.1.1 The Basic Model (mD1)

Let us begin with the continuous data based on Example 5.9.1. This example is
also presented in Example 2.8.1 using the basic model by taking the average of the
duplicate readings for each of the HemoCue and Sigma assays. For Example 2.8.1,
we execute the macro using the %agreement statement with parameters defined
below:

• Dataset: the name of your data. We must avoid using dataset names of c,
cc, t, tt, tb, and p, because these dataset names are used in this macro.
• Y: reading of the test assay or rater that will be shown on the vertical axis of the
agreement plot.
• V label: label for the vertical axis of the agreement plot.
• X: reading for the target assay or rater that will be shown on the horizontal axis
of the agreement plot.
• H label: label for the horizontal axis (target) of the agreement plot.
• Min: minimum of the plotting range.
• Max: maximum of the plotting range.
• By:
– For the constant error structure, this is the increment of the plotting range.
For example, by D 5.
– For the proportional error structure, these are the log scale increments between
min and max. For example, if the data range is from 1 to 60, then min D 1,
max D 64, by D 2; 4; 8; 16; 32.
• Error: constant or proportional error structure.
– error D const: the constant error structure. Here, TDI is expressed as an
absolute difference with the same measurement unit as the original data.
– error D prop: the proportional error structure. Here, TDI is expressed as
a percent change. The natural log transformation to the data will be applied.
• CCC a: a CCC allowance, which can be set as missing if there is no prespecified
allowance.
• CP a: a CP allowance that must be specified for computing TDI.
• TDI a: a TDI allowance that must be specified for computing CP, and must be a
percent value when error D prop is specified or an absolute difference when
error D const is specified.
• Target: random or fixed.
• Alpha: 100.1  ’/% one-tailed confidence limit. The default is 0.05.
• Dec: significant digits after the decimal point printed for TDI when error D
const is specified. The default is dec D 2.
7.1 Workshop for Continuous Data 141

The SAS macro Agreement.sas can be executed using the following code:
libname ex ‘x:\xx\xxx\xxxx’;

data e2_1;
set ex.example2_1;
HemocueAve=mean(of hemocue1 hemocue2);
SigmaAve=mean(of sigma1 sigma2);
run;

goptions reset=all ftext=swiss htext=1;

%inc ‘x:\xxx\xxx\xxx\Agreement.sas’;

ods rtf file=‘x:\xxx\xxx\....\xxx.rtf’ style=styles. TablesRTF;

%agreement(dataset=e2_1, y=HemocueAve, x=SigmaAve, V_label=


HemaCue, H_label=Sigma, min=0, max=2000, by=250, CCC_a=0.9775,
CP_a=0.9, TDI_a=150, error=const, target=random, dec=1,
alpha=0.05);

ods rtf close;

In the code above, the dataset name is example2 1 in the libname ex


statement. The variables hemocue1 and hemocue2 are duplicate values of
HemoCue. The variables sigma1 and sigma2 are duplicate values of Sigma.
The data step is used to compute the average of duplicates for each assay, namely
HemoCueAve as the test assay that will be shown in the vertical axis, and
SigmaAve as the target assay that will be shown in the horizontal axis. The
ods rtf file statement is used to define the output destination where the
agreement table and plot are located. The use of style=styles.TablesRTF is
a predefined style for the table layout, which is not needed if we use the default
layout. The use of goptions reset=all ftext=swiss htext=1 is to
define our desirable graphic options for the agreement plot. We then give the %inc
statement about the location for storing the macro. Most of the above data steps and
output formats and destinations will be needed for all of the SAS macros.
Example 2.8.1 assumes that the target values are random (target=random);
the error structure is constant (error=const); the CCC allowance is 0.9775
(CCC a=0.9775); the TDI allowance is 150 mg/dL (TDI a=150); and the CP
allowance is 0.9 (CP a=0.9). The combination of TDI allowance of 150 mg/dL
and the CP allowance of 0.9 means that we would like to ensure that 90% of
the HemoCue observations are within 150 mg/dL of the Sigma values. Note that
the CCC allowance does not affect any of the calculations. We may omit it if we
do not have an allowance (CCC a=.). We also would like to compute the 95%
upper or lower confidence limit for each agreement index (alpha=0.05). For
creating the agreement plot, we need to specify the data range to be plotted (min=0,
max=2000), with the linear increment of 250 mg/dL (by=250) as shown in
Fig. 2.1. We would like to format the TDI output with one significant digit (dec=1).
The xxx.rtf output file contains the table and plot as shown in Table 2.1 and
Fig. 2.1.
142 7 Workshop

Suppose we would like to assume that the error structure is proportional; we then
change from error=const to error=prop. In this case, we will need to change
the plotting increment into a log increment by making min=25, max=3200, and
by=50 100 200 400 800 1600. We need to change the TDI allowance to
percent change. We do not need to specify the significant digit for the TDI% output,
because it is coded to print as a percentage with two significant digits after the
decimal point. The code to execute the SAS macro is shown below:
%agreement(dataset=e2_1, y=HemoCueAve, x=SigmaAve,
V_label=HemoCue, H_label=Sigma, min=25, max=3200, by=50 100
200 400 800 1600, CCC_a=0.9775, CP_a=0.9, TDI_a=50,
error=prop, target=random, alpha=0.05);

We will then have the outputs as shown in Table 7.1 and Fig. 7.1. From
the agreement plot of Fig. 7.1, we immediately know that the proportional error
assumption is not appropriate, with larger variations at the lower concentrations.
Therefore, the results shown in Table 7.1 become irrelevant. When the target values
are assumed fixed as in Example 2.8.2, we can simply use target=fixed.
Similarly, the corresponding R function agreement for Example 2.8.1 can be
executed with the following R code:
library(Agreement);

HemocueAve=apply(Example2_1[,c("HEMOCUE1", "HEMOCUE2")],1,
mean);
SigmaAve=apply(Example2_1[,c("SIGMA1","SIGMA2")],1,mean);

#For constant error structure.

agr_c=agreement(y=HemocueAve,x=SigmaAve,V_label="Hemacue",
H_label="Sigma", min=0, max=2000, by=250, CCC_a=0.9775,
CP_a=0.9,TDI_a=150,error="const", target="random", dec=1,
alpha=0.05)
html.report(agr_c, file="report_1")

#For proportional error structure.

agr_p=agreement(y=HemocueAve,x=SigmaAve,V_label="Hemacue",
H_label="Sigma", min=25, max=3200, by=c(50,100,200,400,800,
1600), CCC_a=0.9775, CP_a=0.9, TDI_a=50, error="prop",
target="random", alpha=0.05)
html.report(agr_p, file="report_2")

All the parameters of the R functions have the same definition as the SAS macro,
except that there is no dataset parameter, and we must add double quotation marks
for some parameters. The function html.report is used to generate an html
file containing the information as shown in Table 2.1 and Fig. 2.1. For a detailed
explanation of the R functions, please read the R help document.
7.1 Workshop for Continuous Data 143

Table 7.1 HemoCue and sigma readings on measuring DCLHb assuming proportional
error
Statistics CCC Precision Accuracy TDI%0:9 CP50% RBS
Estimate 0.9744 0.9752 0.9992 49.78 0.9001 0.02
95% Conf. limit 0.9691 0.9701 0.9976 54.06 0.8748 .
Allowance 0.9775 . . 50.00 0.9000 .
n D 299.

The relative bias squared (RBS) must be less than 1 or 8 for CP a of 0.9 or 0.8,
respectively, in order for the approximated TDI to be valid. Otherwise, the TDI
estimate is conservative depending on the RBS value.

Fig. 7.1 HemoCue and Sigma readings on measuring DCLHb assuming proportional error

7.1.2 Unified Model

To assess the intraassay, interassay, and total-assay agreement as shown in Example


5.9.1, we need to call the macro using the %UnifiedAgreement statement with
parameters defined below:
• Dataset: name of your data. We must avoid using dataset names a, one, y,
x, id, rating, outb, outc, method, best, par, r, and r1, because these
dataset names are used in this macro.
• k: number of methods/raters/instruments/assay, etc.
• m: number of replications for each k.
• var: dependent variable names to be evaluated, e.g., y1 1, y1 2, . . . , y1 m,
y2 1, y2 2, . . . , y2 m, . . . , yk 1, yk 2, . . . , yk m, etc.
• CCC a: CCC allowance when m D 1.
• CCC a intra: intra-CCC allowance.
144 7 Workshop

• CCC a inter: inter-CCC allowance.


• CCC a total: total CCC allowance.
The above CCC allowances can be set as missing if there are no prespecified
allowances.
• CP a: coverage probability (CP) allowance for continuous data.
• TDI a: TDI allowance when m D 1 for continuous data.
• TDI a intra: intra-TDI allowance for continuous data.
• TDI a inter: inter-TDI allowance for continuous data.
• TDI a total: total TDI allowance for continuous data.
The above TDI a values must be specified as a percent change when
Error=prop is specified.
• tran = 1: transformation such as Z, logit, and log will be used for statistical
inference.
tran = 0: no transformation will be used for statistical inference.
tran=1 can be used for categorical data, but the TDI and CP outputs would
become irrelevant. Therefore, tran=0 is recommended for all categorical data
and tran=1 is recommended for all continuous data.
• Error = const: constant error structure for continuous data. When
error=const, TDI is an absolute difference with the same measurement
unit as for the original data.
Error = prop: proportional error structure for continuous data. When
error=prop, TDI is a percent change. Log transformation will be applied
to original data.
For categorical data, use Error = const.
• Dec: significant digits after the decimal point printed for TDI when
error=const is specified. The default is dec=2.
• Alpha: .1  ’/% one-tailed confidence limit. The default is 0.05.
For calculating intraagreement, interagreement, and total-agreement indices
as shown in Example 5.9.1, we would execute the following code. Note that
UnifiedAgreement.sas does not produce agreement plots. Therefore, the
code of the first three %agreement macros shown below is meant to produce
the plots of within HemoCue, within Sigma, and between the two methods based on
the average.
%agreement(dataset=ex.example2_1, y=hemocue2,x=hemocue1,
min=0, max=2000, by=250, V_label=HemoCue 2, H_label=HemoCue 1,
error=const, target=random, CCC_a=0.9775, CP_a=0.9, alpha=0.05,
TDI_a=150, dec=1);

%agreement(dataset=ex.example2_1, y=Sigma2, x=sigma1,


min=0, max=2000, by=250, V_label=Sigma 2, H_label=Sigma 1,
error=const, target=random, CCC_a=0.9775, CP_a=0.9, alpha=0.05,
TDI_a=150, dec=1);

%agreement(dataset=e2_1, y=HemocueAve, x=SigmaAve,


min=0,max=2000, by=250, V_label=HemoCue, H_label=Sigma,
error=const,target=random, CCC_a=0.9775, CP_a=0.9, alpha=0.05,
TDI_a=150, dec=1);
7.1 Workshop for Continuous Data 145

%UnifiedAgreement(dataset=ex.example2_1,var=hemocue1
hemocue2 sigma1 sigma2,k=2,m=2,CCC_a_intra=0.9943,
CCC_a_inter=0.9775,CCC_a_total=0.9775,
CP_a=0.9, tran=1, TDI_a_intra=75, TDI_a_inter=150,
TDI_a_total=150, error=const, dec=1, alpha=0.05);

In UnifiedAgreement.sas for Example 5.9.1, the parameter for speci-


fying dependent variables must be in the order y1 1, y1 2, . . . , y1 m, y2 1,
y2 2, . . . , y2 m, . . . , and yk 1, yk 2, . . . , yk m (var=hemocue1 hemocue2
sigma1 sigma2). We need to specify the number of raters (k=2) and the
number of replicates (m=2) per rater per subject. We specify the CCC allowances
for intra (CCC a intra=0.9943), inter (CCC a inter=0.9775), and total
(CCC a total=0.9775), and the TDI allowances for intra (TDI a intra=75),
inter (TDI a inter=150), and total (TDI a total=150). The CP allowance
is set at 0.9 to capture 90% of observations (CP a=0.9) for all intra, inter, and
total CPs. We specify tran=1, indicating that Z-transformation for CCCs, logit
transformation for CPs and accuracy coefficients, and log transformation for TDIs
will be used for statistical inferences. For categorical data, we may specify tran=0,
since the above transformations are not necessarily needed. All TDI and CP indices
would become irrelevant for categorical data and thus will not be printed. The results
are shown in Example 5.9.1.
Similarly, the corresponding R function unified.agreement for Example
5.9.1 can be performed with the following code:
ua=unified.agreement(dataset=Example2_1,
var=c("HEMOCUE1","HEMOCUE2", "SIGMA1","SIGMA2"),
k=2, m=2, CCC_a_intra=0.9943,
CCC_a_inter=0.9775, CCC_a_total=0.9775, CP_a=0.9, tran=1,
TDI_a_intra=75, TDI_a_inter=150, TDI_a_total=150,
error="const", dec=1, alpha=0.05) html.unified_agreement(ua)

The parameter dataset is the name of the dataset, which must contain the variables
in the var parameter. The order of the entries specified in the parameter var
should follow the same rule of that in the SAS macro. If the parameter var
is not given, the R function will use all the variables in the input dataset.
The other parameters have the same definition as the SAS macro. The function
html.unified agreement is used to generate an html file containing the
summary table of unified agreement.

7.1.2.1 Unified Model with mD1

The macro UnifiedAgreement.sas can also be applied to cases in which there


is no replicates (m=1). When k D 2, the estimated confidence limit is slightly
different from that estimated by the macro agreement.sas, because the latter
146 7 Workshop

Table 7.2 HemoCue and Sigma readings on measuring DCLHb using GEE
Statistics CCC Precision Accuracy TDI%0:9 CP150 RBS
Estimate 0.9866 0.9866 1.0000 127.3 0.9474 0.00
95% Conf. limit 0.9825 0.9825 0.9987 145.9 0.9228 .
Allowance 0.9775 . . 150.0 0.9000 .
For k D 2, n D 299, and m D 1.

The relative bias squared (RBS) must be less than 1 or 8 for CP a of 0.9 or 0.8,
respectively, in order for the approximated TDI to be valid.

assumes normality for deriving the variances of the estimates of the agreement
indices, while the former uses the GEE approach without assuming normality.
For more robustness of the confidence limits and/or when k > 2, we might need
to use the unified macro. However, to produce the agreement plots, we would
need to call the agreement.sas macro. In addition, the definitions of precision
and accuracy are slightly different because the unified approach assumes that the
variances of assays or raters are equal, and it utilizes the approximation according
to (5.62) for CP.
For Example 2.8.1, if we want to use the more robust GEE approach for the case
of m D 1, we can run the following code after taking the average of the replicates
for HemoCue and Sigma as we did with running %agreement earlier:
%UnifiedAgreement(dataset=e2_1, var=HemocueAve SigmaAve,
k=2, m=1, CCC_a=0.9775,CP_a=0.9, tran=1, TDI_a=150,
error=const, dec=1,alpha=0.05);

The results are shown in Table 7.2. As expected, these are exactly the same as
the results shown in Table 5.1 under interagreement. Compared to Table 2.1, the
CCC and TDI estimates are identical, and the precision and accuracy coefficients
are almost the same. The CP estimate using GEE is 0.947 compared to 0.946
in Table 2.1. These are almost identical because of an almost perfect accuracy,
indicating that their variances are the same. The lower confidence limits for CCC,
precision and accuracy coefficients, and CP are slightly smaller than those shown
in Table 2.1. Correspondingly, the upper confidence limit for TDI is slightly larger
than those shown in Table 2.1, indicating that for this example the GEE approach is
slightly more conservative.
Similarly, the corresponding R function for Example 2.8.1 by the unified
approach can be performed with the following code:
unified.agreement(dataset=cbind(HemocueAve,SigmaAve),
k=2,m=1,CCC_a=0.9775,CP_a=0.9, tran=1, TDI_a=150,
error="const",dec=1,alpha=0.05);
7.1 Workshop for Continuous Data 147

7.1.3 TIR and IIR

To calculate the TIR and IIR indices introduced in Chapter 6, we need to call
the SAS macro TIR IIR.sas using the %TIR IIR statement with parameters
defined below:
• dataset: name of your data. We must avoid using dataset names a, b, c, t1,
t2, ttt, bt, one, final, because these dataset names are used in this macro.
• k: number of methods/raters/instruments/assay, etc.
• m: number of replications for each k.
• var: dependent variable names to be evaluated, e.g., y1 1, y1 2, . . . , y1 m,
y2 1, y2 2, . . . , y2 m, . . . , yk 1, yk 2, . . . , yk m, etc.
• TIR test: the selected test raters for calculating TIR; must be input in the
format (‘1’,‘2’,‘3’,... ‘k’), where ‘1’ represents the first m columns
for rater 1, ‘2’ represents the second m columns for rater 2, and ‘k’ represents
the last m columns for rater k, etc. When calculating multiple TIRs, the test raters
for calculating each TIR must be separated by #. For example, when k D 3, we
specify (‘1’,‘3’)#(‘1’,‘2’,‘3’)#(‘3’)#(‘2’)#(‘1’,‘2’) for
each of the five sets of the test raters.
• TIR ref: the selected reference raters for calculating TIR that correspond
to TIR test. If ref=(all) is specified, then the intraraters of all raters
will be used as the denominator. When calculating multiple TIRs, the
corresponding reference raters must be separated by #. For example, use
(‘2’)#(all)#(‘1’,‘2’)#(‘1’)#(‘1’,‘2’) to represent the five
selected sets of reference raters. When TIR ref is not specified as (all),
each TIR is computed as the total MSD of test vs selected reference raters
relative to the intra MSD of the selected reference raters. When TIR ref is
specified as (all), the macro will assess the average of the total MSD of all
raters relative to the average of intra MSD of all raters. For the first TIR example
as shown in TIR test and TIR ref, the macro would assess the average of
the total MSD of “raters 1 vs 2 and raters 3 vs 2” relative to the intra MSD of
“rater 2.” For the second TIR example, the macro would assess the average of
the total MSD of all raters relative to the average of intra MSD of all raters. For
the third TIR example, the macro would assess the average of the total MSD of
“raters 3 vs 1 and raters 3 vs 2” relative to the average of intra MSD of “raters
1 and 2.” For the fourth TIR example, the macro would assess the total MSD of
“raters 2 and 1” relative to the intra MSD of “rater 1.” For the fifth TIR example,
the macro would assess the total MSD of “raters 1 and 2” relative to the average
of intra MSD of “raters 1 and 2.”
• IIR test: the selected test raters for calculating IIR, which must be input in
the format of (‘1’,‘2’,‘3’, ...‘k’). When calculating multiple IIRs,
the test raters for calculating each IIR must be separated by #. For example,
when k D 3, specify (‘1’)#(‘2’)#(‘3’)#(‘1’).
148 7 Workshop

Table 7.3 TIR and IIR between HemoCue (‘1’) and Sigma (‘2’) with
duplicates
TIR IIR
Statistics Total(‘1’,‘2’)vs(all) =Intra(all) Intra(‘1’) =Intra(‘2’)
Estimate 10.094 0.348
95% Conf. limit 13.657 (0.189, 0.643)
Compared to . 1.00
For k D 2, n D 299, and m D 2.

One-tailed upper limit for TIR and two-tailed interval for IIR.

• IIR ref: the selected reference raters for calculating IIR. When calculating
multiple IIRs, the corresponding reference raters must be separated by #.
For example, specify (‘2’,‘3’)#(‘1’,‘3’)#(‘1’,‘2’)#(‘2’). Each
set of reference raters must be mutually exclusive from its corresponding set of
the selected test raters.
• Error = const for the constant error structure for continuous data.
Error = prop for the proportional error structure for continuous data. Here,
log transformation to data will be applied for continuous data. For categorical
data, use Error=const.
• Alpha: 100.1  ’/% one-tailed upper confidence limit for TIR or two-tailed
confidence interval for IIR. The default is 0.05.
• TIR a: allowance for TIR.
To calculate TIR and IIR for Example 5.9.1, we would execute the following
code:
%TIR_IIR(dataset=ex.example2_1, var=hemocue1 hemocue2
sigma1 sigma2, k=2, m=2, TIR_test=(‘1’,‘2’), TIR_ref=(all),
IIR_test=(‘1’),IIR_ref=(‘2’),
error=const, alpha=0.05, TIR_a=.);

The results are shown in Table 7.3. The TIR of MSDtotal relative to MSDintra was
estimated to be 10.09 with one-sided 95% upper confidence limit of 13.66, which
was much larger than any clinically meaningful criterion, as evident by comparing
Fig. 5.3 to Figs. 5.1 and 5.2. The IIR of MSDintra HemoCue relative to MSDintra Sigma
was estimated to be 0.348 with the 95% confidence interval of 0.189–0.643, which
indicates that HemoCue had better within-assay precision than Sigma, as evident by
comparing Figs. 5.1 and 5.2.
Similarly, the corresponding R functions TIR IIR for the comparative agree-
ment approach can be performed with the following code:
TIR_IIR(dataset=Example2_1,var=c("HEMOCUE1", "HEMOCUE2",
"SIGMA1", "SIGMA2"),
k=2,m=2,TIR_test=c("1,2"),TIR_ref=c("All"),IIR_test=c("1"),
IIR_ref=c("2"), error="const", alpha=0.05, TIR_a=.);
7.2 Workshop for Categorical Data 149

Table 7.4 Agreement statistics among three examiners based on


reading 1
Statistics CCC Precision Accuracy
Estimate 0.4958 0.5034 0.9849
95% Conf. limit 0.4289 0.4370 0.9745
Allowance . . .
For k D 3, n D 400, and m D 1.

All the parameters have the same definition as in the SAS macro. However,
the format of the parameters TIR test, TIR ref, IIR test, and IIR ref
are slightly different: TIR test=c("1,2") means that the selected test raters
for calculating TIR are the first and second raters. If there are multiple TIRs,
each set of test raters must be an entry in the sequence using double quo-
tation marks separated by a comma. For example, when k D 3, we may
specify TIR test=c("1,3","1,2,3","3","2","1,2"). The formats of
TIR ref, IIR test, and IIR ref are defined similarly.

7.2 Workshop for Categorical Data

7.2.1 The Basic Model (mD1)

We begin with the workshop for categorical data using Example 5.9.3, and examine
the kappa of three examiners based on their first readings. These frequency tables
can be seen in Tables 5.10 and 5.10. In this example, the variable names for the
three examiners and their duplicates are m1 1, m1 2, m2 1, m2 2, m3 1, m3 2.
We are now interested in the kappa only of m1 1, m2 1, m3 1. We then execute the
following code:
%UnifiedAgreement(dataset=ex.Example5_3, var=m1_1 m2_1
m3_1, k=3, m=1, ccc_a=., tran=0, alpha=0.05);

The results are shown in Table 7.4. These are slightly lower than those shown
under total agreement in Table 5.13. Again, the results show that the disagreement
was largely due to imprecision rather than inaccuracy. For categorical data, TDI and
CP are not meaningful, and therefore are not computed.
Similarly, the corresponding R macro for Example 5.9.3 using the unified
approach can be performed with the following code:
unified.agreement(dataset=Example5_3, var=c("m1_1",
"m2_1","m3_1"), k=3, m=1, CCC_a=NA, tran=0, alpha=0.05);
150 7 Workshop

Table 7.5 Agreement statistics among the first two examiners based
on reading 1
Statistics CCC Precision Accuracy
Estimate 0.5147 0.5148 0.9998
95% Conf. limit 0.4225 0.4226 0.9982
Allowance . . .
For k D 2, n D 400, and m D 1.

7.2.1.1 Equivalence to SAS Procedure FREQ when kD2 and mD1

When k D 2, we can also compute the kappa-related results by running the SAS
procedure FREQ. To demonstrate the equivalence, we first examine the agreement
between the first readings of examiners 1 and 2 by executing the following code:
%UnifiedAgreement(dataset= ex.Example5_3, var=m1_1 m2_1,
k=2, m=1, CCC_a=., tran=0, alpha=0.05);

The results are displayed in Table 7.5.


We then execute the following SAS code:
proc freq data=ex.Example5_3;
table m1_1*m2_1/agree (WT=FC) alpha=0.1;
run;

Note that we use an alpha value of 0.1 because we want only the one-tailed lower
confidence limit. The (WT=FC) is not necessary in this case because this example
has a binary outcome, but we leave it there just in case the data have an ordinal
outcome and we would like to use the square distance function. The results are
shown in the following SAS output file:

Simple Kappa Coefficient


Kappa 0.5147
ASE 0.0560
90% Lower Conf. limit 0.4225
90% Upper Conf. limit 0.6069
Sample Size = 400

The kappa and its lower confidence limit are exactly the same as shown in
Table 7.4. The proof of such equivalence is shown in Section 5.7.2. Note that this
SAS procedure FREQ cannot perform kappa for more than two raters.

7.2.2 Unified Model

To calculate the intraassay, interassay, and total-assay agreement indices for cate-
gorical data with k  2 and m  1 using the data of Example 5.9.3, we execute the
7.2 Workshop for Categorical Data 151

Table 7.6 TIR and IIR among three examiners with duplicates
TIR IIR
Statistics Total(‘1’,‘2’,‘3’)vs(all) =Intra(all) Intra(‘3’) =Intra(‘1’,‘2’)
Estimate 1.560 1.323
95% Conf. limit 2.376 (0.931, 1.881)
Compared to 2.5 1.00
For k D 3, n D 400, and m D 2.
*One-tailed upper limit for TIR and two-tailed interval for IIR.

following code. The results are shown in Table 5.13 of Chapter 5. The frequency
tables can be seen in Tables 5.7–5.12.
%UnifiedAgreement(dataset=ex.Example5_3, var=m1_1 m1_2
m2_1 m2_2 m3_1 m3_2, k=3,m=2,CCC_a_intra=0.64,
CCC_a_inter=0.51,CCC_a_total=.,tran=0, alpha=0.05);

Similarly, the corresponding R macro for Example 5.9.3 with the unified
approach can be performed with the following code:
unified.agreement(dataset=Example5_3,k=3,m=2,CCC_a_intra=0.64,
CCC_a_inter=0.51, CCC_a_total=NA, tran=0, alpha=0.05);

7.2.3 TIR and IIR

To study the TIR and IIR information as shown in Example 6.7.2 using the same
data in Example 5.9.3, we execute the following code:
%TIR_IIR(dataset=data5.Example5_3, var=m1_1-m1_2
m2_1-m2_2 m3_1-m3_2, k=3,m=2, TIR_test=(‘1’,‘2’,‘3’),
TIR_ref=(all), IIR_test=(‘3’), IIR_ref=(‘1’,‘2’),
error=const, alpha=0.05, TIR_a=2.5)

The results are shown in Table 7.6, with the description for TIR given in
Section 6.7.2.
Similarly, the corresponding R macro for the comparative approach can be
performed with the following code:
TIR_IIR(dataset=Example5_3,var=c("m1_1","m1_2","m2_1",
"m2_2","m3_1","m3_2"), k=3, m=2,TIR_test=c("1,2,3"),
TIR_ref=c("All"), IIR_test=c("3"), IIR_ref=c("1,2"),
error="const", alpha=0.05, TIR_a=2.5);
152 7 Workshop

Table 7.7 TIR and IIR between test (1) and reference (2) compounds
with duplicates
TIR IIR
Statistics Total(‘1’)vs(‘2’) =Intra.‘20 / Intra(‘1’) =Intra(‘2’)
Estimate 0.6907 0.4324
95% Conf. limit 1.0761 (0.1676, 1.1151)
Compared to 2.25 1.00

For k D 2, n D 39, and m D 2.
One-tailed upper limit for TIR and two-tailed interval for IIR.

7.3 Individual Bioequivalence

In Example 6.7.4, we first need to compute TDI%0:8 between the duplicate values
of the reference compound, namely, R1 and R2. We can either use the basic macro
agreement.SAS or the unified model UnifiedAgreement.sas. For using
the basic model, we execute the following code:
%agreement(dataset=ex.Example6_4, y=R1, x=R2,
V_label=R1, H_label=R2, min=2, max=64, by=4 8 16, CCC_a=0.9,
CP_a=0.8, TDI_a=50, error=prop, target=random, alpha=0.05);

The TDI%0:8 was estimated to be 124:4%, which is greater than 43.7%,


indicating that the reference scale should be used. For calculating the TIR and IIR
with their 95% confidence limits, we execute the following code:
%TIR_IIR(dataset=ex.Example6_4, var=t1 t2 r1 r2, k=2,
m=2, TIR_test=(‘1’), TIR_ref=(‘2’), IIR_test=(‘1’),
IIR_ref=(‘2’), error=prop, alpha=0.05, TIR_a=2.25);

The results are shown in Table 7.7, with the description for TIR given in the last
paragraph of Section 6.7.2.
Similarly, the corresponding R macro for the comparative approach can be
performed with the following code:
agreement(y=Example6_4[,"R1"], x=Example6_4[,"R2"],
V_label="R1", H_label="R2", min=2, max=64, by=c(4,8,16),
CCC_a=0.9, CP_a=0.8, TDI_a=50, error="prop",
target="random", alpha=0.05);

TIR_IIR(dataset=Example6_4, var=c("T1","T2","R1","R2"),
k=2, m=2, TIR_test=c("1"), TIR_ref=c("2"), IIR_test=c("1"),
IIR_ref=c("2"), error="prop", alpha=0.05, TIR_a=2.25);
References

Agresti, A. 1990. Categorical data analysis. New York: Wiley.


Agresti, A. 1996. An Introduction to Categorical Data Analysis. New York: Wiley.
Aickin, M. 1983. Linear Statistical Analysis of Discrete Data. New York: Wiley.
Barnhart, H., M. Haber, and L. Lin. 2007. An overview on assessing agreement with continuous
measurements. Journal of Biopharmaceutical Statistics 17(4):529–569.
Barnhart, H., M. Haber, and J. Song. 2002. Overall concordance correlation coefficient for
evaluating agreement among multiple observers. Biometrics 58(4):1020–1027.
Barnhart, H., A. Kosinski, and M. Haber. 2007. Assessing individual agreement. Journal of
Biopharmaceutical Statistics 17(4):697–719.
Barnhart, H., J. Song, and M. Haber. 2005. Assessing intra, inter and total agreement with
replicated readings. Statistics in Medicine 24(9):1371–1384.
Barnhart, H., J. Song, and R. Lyles. 2005. Assay validation for left-censored data. Statistics in
Medicine 24(21):3347–3360.
Barnhart, H. and J. Williamson. 2001. Modeling concordance correlation via GEE to evaluate
reproducibility. Biometrics 57(3):931–940.
Bartko, J. 1966. The intraclass correlation coefficient as a measure of reliability. Psychological
Reports 19(1):3–11.
Birch, M. 1964. The detection of partial association, I: the 2 x 2 case. Journal of the Royal
Statistical Society, Series B 26(3):313–324.
Birch, M. 1965. The detection of partial association, II: the general case. Journal of the Royal
Statistical Society. Series B 27(1):111–124.
Bland, J. and D. Altman. 1999. Measuring agreement in method comparison studies. Statistical
Methods in Medical Research 8(2):135–160.
Bland, J. and D.G. Altman. 1986. Statistical methods for assessing agreement between two
methods of clinical measurement. The LANCET i:307–310.
Bloch, D. and H. Kraemer. 1989. 2 2 kappa coefficients: measures of agreement or association.
Biometrics 45(1):269–287.
Brennan, R. 2001. Generalizability Theory. New York: Springer.
Broemeling, L. 2009. Bayesian Methods for Measures of Agreement. Boca Raton: Chapman &
Hall/CRC.
Bross, I. 1985. Why proof of safety is much more difficult than proof of hazard. Biometrics 41(3):
785–793.
Carrasco, J. and L. Jover. 2003. Estimating the generalized concordance correlation coefficient
through variance components. Biometrics 59(4):849–858.
Carrasco, J. and L. Jover. 2005. Concordance correlation coefficient applied to discrete data.
Statistics in Medicine 24(24):4021–4034.

L. Lin et al., Statistical Tools for Measuring Agreement, 153


DOI 10.1007/978-1-4614-0562-7, © Springer Science+Business Media, LLC 2012
154 References

Carrasco, J., T. King, and V. Chinchilli. 2009. The concordance correlation coefficient for
repeated measures estimated by variance components. Journal of Biopharmaceutical Statis-
tics 19(1):90–105.
Carrasco L., J. Luis, T. King, and V. Chinchilli. 2007. Comparison of concordance correlation
coefficient estimating approaches with skewed data. Journal of Biopharmaceutical Statistics
17: 673–684.
Chen, C. and H. Barnhart. 2008. Comparison of ICC and CCC for assessing agreement for data
without and with replications. Computational Statistics & Data Analysis 53(2):554–564.
Chinchilli, V., J. Martel, S. Kumanyika, and T. Lloyd. 1996. A weighted concordance correlation
coefficient for repeated measurement designs. Biometrics 52(1):341–353.
Choudhary, P. 2007. A tolerance interval approach for assessment of agreement with left censored
data. Journal of Biopharmaceutical Statistics 17(4):583–594.
Choudhary, P. 2008. A tolerance interval approach for assessment of agreement in method
comparison studies with repeated measurements. Journal of Statistical Planning and Infer-
ence 138(4):1102–1115.
Choudhary, P. and H. Nagaraja. 2007. Tests for assessment of agreement using probability criteria.
Journal of Statistical Planning and Inference 137(1):279–290.
Christensen, R. 1997. Log-linear Models and Logistic Regression 2nd Ed. New York: Springer.
Cicchetti, D. and T. Allison. 1971. A new procedure for assessing reliability of scoring EEG sleep
recordings. American Journal of EEG Technology 11:101–109.
Cicchetti, D. and J. Fleiss. 1977. Comparison of the null distributions of weighted kappa and the
C ordinal statistic. Applied Psychological Measurement 1(2):195–201.
CLIA Final Rule 2003. CLIA programs; laboratory requirements relating to quality systems and
certain personnel qualifications. Final rule. Federal Register 68(16):3639–3714. available at
http://www.phppo.cdc.gov/clia/pdf/CMS-2226-F.pdf.
Cochran, W. 1950. The comparison of percentages in matched samples. Biometrika 37(3-4):
256–266.
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological
Measurement 20(1):37–46.
Cohen, J. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or
partial credit. Psychological Bulletin 70(4):213–220.
Cox, D. and E. Snell. 1989. Analysis of Binary Data. Boca Raton: Chapman & Hall/CRC.
Darroch, J. 1981. The Mantel-Haenszel test and tests of marginal symmetry; fixed-effects and
mixed models for a categorical response. International Statistical Review 49:285–307.
Davis, C. and D. Quade. 1968. On comparing the correlations within two pairs of variables.
Biometrics 24(4):987–995.
Donner, A. and M. Eliasziw. 1992. A goodness-of-fit approach to inference procedures for
the kappa statistic: Confidence interval construction, significance-testing and sample size
estimation. Statistics in Medicine 11(11):1511–1519.
Dunnett, C. and M. Gent. 1977. Significance testing to establish equivalence between treatments,
with special reference to data in the form of 2 x 2 tables. Biometrics 33(4):593–602.
Escaramis, G., C. Ascaso, and J. Carrasco. 2010. The total deviation index estimated by tolerance
intervals to evaluate the concordance of measurement devices. BMC Medical Research
Methodology 10(1):31.
Everitt, B. 1968. Moments of the statistics kappa and weighted kappa. British Journal of
Mathematical and Statistical Psychology 21(1):97–103.
Fisher, S. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
Fleiss, J. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin
76(5):378–382.
Fleiss, J.L. 1973. Statistical Methods for Rates and Proportions. New York: John Wiley & Sons.
Fleiss, J. 1986. Reliability of measurement. The Design and Analysis of Clinical Experiments
1(1):1–32.
References 155

Fleiss, J. and J. Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation
coefficient as measures of reliability. Educational and Psychological Measurement 33(3):
613–619.
Fleiss, J., J. Cohen, and B. Everitt. 1969. Large sample standard errors of kappa and weighted
kappa. Psychological Bulletin 72(5):323–327.
Fleiss, J. and J. Cuzick. 1979. The reliability of dichotomous judgments: Unequal numbers of
judges per subject. Applied Psychological Measurement 3(4):537–542.
Fleiss, J. and B. Everitt. 1971. Comparing the marginal totals of square contingency tables. British
Journal of Mathematical and Statistical Psychology 24:117–123.
Fleiss, J., B. Levin, M. Paik, and J. Wiley. 1981. Statistical Methods for Rates and Proportions 2nd
Ed. New York: Wiley.
Fleiss, J. and P. Shrout. 1978. Approximate interval estimation for a certain intraclass correlation
coefficient. Psychometrika 43(2):259–262.
Freeman, D. 1987. Applied Categorical Data Analysis. New York: Marcel Dekker.
Friedman, M. 1937. The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association 32(200):675–701.
Goodman, L. 1978. Analyzing Qualitative/Categorical Data: Log-linear Models and Latent-
Structure Analysis. Cambridge, MA: Abt Books.
Grant, E. and R. Leavenworth. 1972. Statistical Quality Control. New York: McGraw-Hill.
Guo, Y. and A. Manatunga. 2007. Nonparametric estimation of the concordance correlation co-
efficient under univariate censoring. Biometrics 63(1):164–172.
Guo, Y. and A. Manatunga. 2009. Measuring agreement of multivariate discrete survival times
using a modified weighted kappa coefficient. Biometrics 65(1):125–134.
Haber, M. and H. Barnhart. 2008. A general approach to evaluating agreement between two
observers or methods of measurement from quantitative data with replicated measurements.
Statistical Methods in Medical Research 17(2):151–169.
Haber, M., J. Gao, and H. Barnhart. 2007. Assessing observer agreement in studies involving
replicated binary observations. Journal of Biopharmaceutical Statistics 17(4):757–766.
Haberman, S. 1974. The Analysis of Frequency Data. Chicago: University of Chicago Press.
Haberman, S. 1978. Analysis of Qualitative Data, volumn 1. Introductory Topics. New York:
Academic Press.
Haberman, S. 1979. Analysis of Qualitative Data, volumn 2. New Developments. New York:
Academic Press.
Hedayat, A., C. Lou, and B. Sinha. 2009. A Statistical Approach to Assessment of Agree-
ment Involving Multiple Raters. Communications in Statistics-Theory and Methods 38(16):
2899–2922.
Helenowski, I., E. Vonesh, H. Demirtas, A. Rademaker, V. Ananthanarayanan, P. Gann, and
B. Jovanovic. 2011. Defining Reproducibility Statistics as a Function of the Spatial Covariance
Structures in Biomarker Studies. The International Journal of Biostatistics 7(1):article2.
Hiriote, S. and V. Chinchilli. 2010. Matrix-based concordance correlation coefficient for repeated
measures. Biometrics 66:1–20.
Holder, D. and F. Hsuan. 1993. Moment-based criteria for determining bioequivalence. Biometrika
80(4):835–846.
King, T. and V. Chinchilli. 2001a. A generalized concordance correlation coefficient for continuous
and categorical data. Statistics in Medicine 20(14):2131–2147.
King, T. and V. Chinchilli. 2001b. Robust estimators of the concordance correlation coefficient.
Journal of Biopharmaceutical Statistics 11(3):83–105.
King, T., V. Chinchilli, and J. Carrasco. 2007. A repeated measures concordance correlation co-
efficient. Statistics in Medicine 26(16):3095–3113.
King, T., V. Chinchilli, K. Wang, and J. Carrasco. 2007. A class of repeated measures concordance
correlation coefficients. Journal of Biopharmaceutical Statistics 17(4):653–672.
Koch, G., J. Landis, J. Freeman, D. Freeman Jr, and R. Lehnen. 1977. A general methodology for
the analysis of experiments with repeated measurement of categorical data. Biometrics 33(1):
133–158.
156 References

Landis, J. and G. Koch. 1977a. A one-way components of variance model for categorical data.
Biometrics 33(4):671–679.
Landis, J. and G. Koch. 1977b. An application of hierarchical kappa-type statistics in the
assessment of majority agreement among multiple observers. Biometrics 33(2):363–374.
Landis, J. and G. Koch. 1977c. The measurement of observer agreement for categorical data.
Biometrics 33(1):159–174.
Landis, J., T. Sharp, S. Kuritz, and G. Koch. 1998. Mantel–haenszel methods. Encyclopedia of
Biostatistics 3:2378–2391.
Li, R. and M. Chow. 2005. Evaluation of reproducibility for paired functional data. Journal of
Multivariate Analysis 93(1):81–101.
Liang, K. and S. Zeger. 1986. Longitudinal data analysis using generalized linear models. Bio-
metrika 73(1):13–22.
Lin, L. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1):
255–268.
Lin, L. 1992. Assay validation using the concordance correlation coefficient. Biometrics 48(2):
599–604.
Lin, L. 2000. Total deviation index for measuring individual agreement with applications in
laboratory performance and bioequivalence. Statistics in Medicine 19(2):255–270.
Lin, L. 2003. Measuring agreement. Encyclopedia of Biopharmaceutical Statistics, 561–567.
Lin, L. 2008. Overview of agreement statistics for medical devices. Journal of Biopharmaceutical
Statistics 18(1):126–144.
Lin, L. and V. Chinchilli. 1997. Rejoinder to the letter to the editor from Atkinson and Nevill.
Biometrics 53(2):777–778.
Lin, L., A. Hedayat, B. Sinha, and M. Yang. 2002. Statistical Methods in Assessing Agreement.
Journal of the American Statistical Association 97(457):257–270.
Lin, L., A. Hedayat, and Y. Tang. 2012. A comparison for measuring individual agreement. Journal
of Biopharmaceutical Statistics, accept for publication.
Lin, L., A. Hedayat, and W. Wu. 2007. A unified approach for assessing agreement for continuous
and categorical data. Journal of Biopharmaceutical Statistics 17(4):629–652.
Lin, L. and L. Torbeck. 1998. Coefficient of accuracy and concordance correlation coefficient: new
statistics for methods comparison. PDA Journal of Pharmaceutical Science and Technology
52(2):55–59.
Lin, L. and E. Vonesh. 1989. An empirical nonlinear data-fitting approach for transforming data to
normality. American Statistician 43(4):237–243.
Linn, S. 2004. A new conceptual approach to teaching the interpretation of clinical tests. Journal
of Statistical Education 12(3):1–9.
Liu, X., Y. Du, J. Teresi, and D. Hasin. 2005. Concordance correlation in the measurements of time
to event. Statistics in Medicine 24(9):1409–1420.
Lou, C. 2006. Assessment of Agreement, PhD Thesis. University of Illinois at Chicago.
Madansky, A. 1963. Tests of homogeneity for correlated samples. Journal of the American
Statistical Association 58(301):97–119.
McNemar, Q. 1947. Note on the sampling error of the difference between correlated proportions
or percentages. Psychometrika 12(2):153–157.***
Quiroz, J. 2005. Assessment of equivalence using a concordance correlation coefficient in a repea-
ted measurements design. Journal of Biopharmaceutical Statistics 15(6):913–928.
Quiroz, J. and R. Burdick. 2009. Assessment of Individual Agreements with Repeated Measure-
ments Based on Generalized Confidence Intervals. Journal of Biopharmaceutical Statistics
19(2):345–359.
Robieson, W. 1999. On Weighted Kappa and Concordance Correlation Coefficinet, PhD Thesis.
Chicago: University of Illinois.
Rodary, C., C. Com-Nougue, and M. Tournade. 1989. How to establish equivalence between
treatments: A one-sided clinical trial in paediatric oncology. Statistics in Medicine 8(5):
593–598.
References 157

Schuster, C. 2001. Kappa as a parameter of a symmetry model for rater agreement. Journal of
Educational and Behavioral Statistics 26(3):331–342.
Serfling, R. 1980. Approximation Theorems of Mathematical Statistics. New York: John Wiley &
Sons.
Sheiner, L. 1992. Bioequivalence revisited. Statistics in Medicine 11(13):1777–1788.
Shoukri, M. M. 2004. Measures of Interobserver Agreement. Chapman and Hall.
Shoukri, M. and V. Edge. 1996. Statistical Methods for Health Sciences. New York: CRC Press.
Shoukri, M. and S. Martin. 1995. Maximum likelihood estimation of the kappa coefficient from
models of matched binary responses. Statistics in Medicine 14(1):83–99.
Shrout, P. and J. Fleiss. 1979. Intraclass correlations: uses in assessing rater reliability. Psychol
Bull 86(2):420–428.
Tang, Y. 2010. A Comparison Model for Measuring Individual Agreement, PhD Thesis. Chicago:
University of Illinois.
Von Eye, A. and E. Mun. 2005. Analyzing Rater Agreement: Manifest Variable Methods. New
Jersey: Lawrence Erlbaum.
von Eye, A. and C. Schuster. 2000. Log-linear model for rater agreement. Multiciencia 4:38–56.
Vonesh, E. and M. Chinchilli. 1997. Linear and Nonlinear Models for the Analysis of Repeated
Measurements. New York: Marcel Dekker, Inc.
Vonesh, E., V. Chinchilli, and K. Pu. 1996. Goodness-of-fit in generalized nonlinear mixed-effects
models. Biometrics 52(2):572–587.
Wang, W. and J. Gene Hwang. 2001. A nearly unbiased test for individual bioequivalence problems
using probability criteria. Journal of Statistical Planning and Inference 99(1):41–58.
Westlake, W. 1976. Symmetrical confidence intervals for bioequivalence trials. Biometrics32(4):
741–744.
Williamson, J., S. Crawford, and H. Lin. 2007. Resampling dependent concordance correlation
coefficients. Journal of Biopharmaceutical Statistics 17(4):685–696.
Williamson, J., S. Lipsitz, and A. Manatunga. 2000. Modeling kappa for measuring dependent
categorical agreement data. Biostatistics 1(2):191–202.
Wu, W. 2005. A Unified Approach for Assessing Agreement, PhD Thesis. Chicago: University of
Illinois.
Yang, M. 2002. Universal Optimality in Crossover Design and Statistical Methods in Assessing
Agreement, PhD Thesis. Chicago: University of Illinois.
Yang, J. and V. Chinchilli. 2009. Fixed-effects modeling of Cohen’s kappa for bivariate multino-
mial data. Communications in Statistics–Theory and Methods 38:3634–3653.
Yang, J. and V. Chinchilli. 2011. Fixed-effects modeling of Cohen’s kappa for bivariate multino-
mial data. Computational Statistics and Data Analysis 55:1061–1070.
Zeger, S. and K. Liang. 1986. Longitudinal data analysis for discrete and continuous outcomes.
Biometrics 42(1):121–130.
Zeger, S., K. Liang, and P. Albert. 1988. Models for longitudinal data: a generalized estimating
equation approach. Biometrics 44(4):1049–1060.
Zhong, J. 2001. Optimal and Efficient Nonlinear Design and Solutions with Interpretations to
Individual Bioequivalence, PhD Thesis. Chicago: University of Illinois.
Index

A interrater, 87
accuracy, 1–2 total-rater, 87
accuracy coefficient concordance correlation coefficient (CCC)
basic, 13 basic, 14, 38, 42
generalized or overall, 89 generalized or overall, 88
interrater, 79 interrater (CCCinter ), 87
total-rater, 81 intrarater (CCCintra ), 87
agreement, 1–2 total-rater (CCCtotal ), 87
association, 61, 68 coverage probability (CP)
basic, 40, 44
generalized or overall, 88
C
interrater, 86, 87
coefficient of
intrarater, 86, 87
individual agreement (CIA), 124
total-rater, 86, 87
individual bioequivalence (IBC) by FDA,
intra–intra ratio (IIR), 127
123
mean squared deviation (MSD)
concordance correlation coefficient (CCC)
basic, 7, 39, 43
generalized or overall, 88, 91–92
generalized or overall, 88
interrater (CCCinter ), 79
interrater, 85
intrarater (CCCintra ), 77
intrarater, 85
total-rater (CCCtotal ), 80
total-rater, 85
concordance correlation coefficient (CCC)
precision coefficient
basic model, 12–13
basic, 13
correlation coefficient, see precision coefficient
generalized or overall, 88
coverage probability (CP)
interrater, 87
generalized or overall, 88
intrarater, 87
interrater, 78
total-rater, 87
intrarater, 77
total deviation index (TDI)
total-rater, 80
basic, 9
coverage probability (CP)
generalized or overall, 88
basic model, 10
interrater, 85, 86
intrarater, 85, 86
E total-rater, 85, 86
estimate of total–intra ratio (TIR)
accuracy coefficient when no reference exists, 126
basic, 14, 41, 45 when reference exists, 124
generalized or overall, 88 weighted kappa, 58

L. Lin et al., Statistical Tools for Measuring Agreement, 159


DOI 10.1007/978-1-4614-0562-7, © Springer Science+Business Media, LLC 2012
160 Index

G T
generalized estimation equations (GEE) target values
by Barnhart, Haber, and Song (2002), fixed, 1
47–48 random, 1
comparative model, 115–121 total deviation index (TDI)
unified model, 81–84 basic, 8, 9
generalized or overall, 88
interrater, 78
I intrarater, 77
intra–intra ratio (IIR), 126 total-rater, 80
intraclass correlation coefficient (ICC), 3, 11 total–intra ratio (TIR)
when no reference exists, 123
K when reference exists, 122
kappa, 57–59
U
M U-statistics for CCC, 46–47, 91–92
mean squared deviation (MSD)
basic model, 7, 56
generalized or overall, 88 V
interrater, 78, 114 variance components, 76
intrarater, 77, 113 variance of the estimate of
total-rater, 80, 113 accuracy coefficient
model of basic, 14, 42, 45
basic, 7 generalized or overall, 90
comparative, 112 interrater, 87
unified, 75 total-rater, 87
when m D 1, 87 concordance correlation coefficient (CCC)
basic, 14, 39, 43
generalized or overall, 90
P interrater, 86, 87
precision, 1–2 intrarater, 86, 87
precision coefficient total-rater, 86, 87
basic, 13–14 coverage probability (CP)
generalized or overall, 88 basic, 10, 40, 41, 44
interrater, 79 generalized or overall, 90
intrarater, 77 interrater, 86, 87
total-rater, 81 intrarater, 86, 87
proportional error structure total-rater, 86, 87
basic model, 16 intra–intra ratio (IIR), 127
comparative model, 112 mean squared deviation (MSD)
unified model, 81 basic, 8, 40, 44
generalized or overall, 90
R interrater, 85
relative bias squared (RBS) intrarater, 85
basic, 9 total-rater, 85, 86
interrater, 79 precision coefficient
total-rater, 80 basic, 14
generalized or overall, 90
interrater, 86, 87
S total-rater, 87
sample size and power total deviation index (TDI)
general case, 71–72 basic, 9
simplified case, 72 generalized or overall, 90
Index 161

interrater, 86 W
intrarater, 86 weight function
total-rater, 86 Ciccetti and Allison, 58
total–intra ratio (TIR) Fleiss and Cohen, 57, 59
when no reference exists, 126 weighted kappa, 57–59
when reference exists, 125
weighted kappa, 59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy