Harvard Lecture Series Session 2 - Reliability
Harvard Lecture Series Session 2 - Reliability
Qian-Li Xue
Biostatistics Program
Harvard Catalyst | The Harvard Clinical & Translational Science Center
Short course, October 27, 2016
Objectives
• Classical Test Theory
• Definitions of Reliability
• Types of Reliability Coefficients
– Test-Retest, Inter-Rater, Internal Consistency,
– Correction for Attenuation
• Review Exercises
What is reliability
• Consistency of measurement
• The extent to which a measurement
instrument can differentiate among
subjects
• Reliability is relative
Facets of Reliability
• Mrs. Z scores 20 at visit 1 and 25 at visit 2.
Could be:
• Random variation
– (Test-Retest)
• Tech # 2 more lenient than Tech # 1
– (Inter-Rater Reliability)
• Version # 2 easier than Version # 1
– (Related to Internal Consistency)
• Mrs. Z’s picture-naming actually improved
Classical Test Theory
• X = Tx + e
• The Observed Score = True Score + Error
• Assumptions:
– E(e) = 0
– Cov(Tx,e) = 0
– Cov(ei,ek) = 0
• Tau-Equivalent: TX1 = TX 2
• Congeneric: TX1 = β TX 2 + c
See Graham (2006) for details.
Correlation, r
Correlation (i.e. “Pearson” correlation) is a scaled version
of covariance
cov( x, y)
rxy =
var( x) var( y)
-1 ≤ r ≤ 1
r=1 perfect positive correlation
r = -1 perfect negative correlation
r=0 uncorrelated
Correlation between Parallel Tests
K
⎡ 2 ⎤
K ⎢ ∑
σ itemi ⎥ 4 ⎡ 2.67 + 2.7 + 2.67 + 6.27 ⎤
α= ⎢1 − i =1 2 ⎥ α = ⎢1 − ⎥ = 0.91
K − 1 ⎢ σ total ⎥ 3 ⎣ 44.97 ⎦
⎢⎣ ⎥⎦
Cronbach’s Alpha
• Mathematically
equivalent to ICC(3,k)
rx , y
rTxTY =
rxx ryy
Correction for Attenuation
How to Improve Reliability
• Reduce error variance
– Better observer training
– Improve scale design
• Enhance true variance
– Introduce new items better at capturing
heterogeneity
– Change item responses
• Increase number of items in a scale
Exercise #1
• You develop a new survey measure of
depression based on a pilot sample that
consists of 33% severely depressed, 33%
mildly depressed, and 33% non-depressed.
You are happy to discover that your measure
has a high reliability of 0.90. Emboldened by
your findings, you find funding and administer
your survey to a nationally representative
sample. However, you find that your reliability
is now much lower. Why might have the
reliability dropped?
Exercise #1 - Answer
BMS pilot − EMS 10 − 1
0.90 = =
BMS pilot 10
-20 -10 0 10 20
0
2
4
True Score
6
8
10
Exercise #2a – Answer
Observed Score (Positive Correlation)
-10 0 10 20 30
0
2
4
True Score
6
8
10
Exercise #2b - Answer
Exercise #3
• The reported correlations between years of
educational attainment and adults’ scores on
anti-social personality disorder scales (ASP)
is usually about 0.30, and the reported
reliability of the education scale is 0.95 and
for the ASP scale 0.70. What will your
observed correlation between these two
measures be if your data on the education
scale has the same reliability (0.95) but the
ASP has much lower reliability of 0.40?
Exercise #3 - Answer
• Solve for true score correlation from
reported data.
rxy .30
rTxTy = = = .367883
rxx ryy .95 × .70
• Solve for new observed correlation