0% found this document useful (0 votes)

23 views

Harvard Lecture Series Session 2 - Reliability

The document discusses measurement reliability including classical test theory, definitions of reliability, types of reliability coefficients such as test-retest, inter-rater, and internal consistency reliability. Examples and exercises are provided to help explain the key concepts around reliability.

Uploaded by

Miguel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Harvard Lecture Series Session 2 - Reliability

Uploaded by

Miguel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Measurement Reliability

Qian-Li Xue
Biostatistics Program
Harvard Catalyst | The Harvard Clinical & Translational Science Center
Short course, October 27, 2016
Objectives
• Classical Test Theory
• Definitions of Reliability
• Types of Reliability Coefficients
– Test-Retest, Inter-Rater, Internal Consistency,
– Correction for Attenuation
• Review Exercises
What is reliability
• Consistency of measurement
• The extent to which a measurement
instrument can differentiate among
subjects
• Reliability is relative
Facets of Reliability
• Mrs. Z scores 20 at visit 1 and 25 at visit 2.
Could be:
• Random variation
– (Test-Retest)
• Tech # 2 more lenient than Tech # 1
– (Inter-Rater Reliability)
• Version # 2 easier than Version # 1
– (Related to Internal Consistency)
• Mrs. Z’s picture-naming actually improved
Classical Test Theory
• X = Tx + e
• The Observed Score = True Score + Error
• Assumptions:
– E(e) = 0
– Cov(Tx,e) = 0
– Cov(ei,ek) = 0

• Var(X) =Var(Tx+e) = Var(Tx) + 2Cov(Tx,e)+Var(e)

• Var(X) = Var(Tx) + Var(e)
Reliability as Consistency of
Measurement
• The relationship between parallel tests

• Ratio of True score variance to total score

variance ρxx = Var(Tx)
Var(X)
= Var(X)-Var(e)
Var(X)
Parallel Tests
• Parallel: TX1 = TX 2 Var (ε1 ) = Var (ε 2 )

• Tau-Equivalent: TX1 = TX 2

• Essentially Tau-Equivalent: TX1 = TX 2 + c

• Congeneric: TX1 = β TX 2 + c
See Graham (2006) for details.
Correlation, r
Correlation (i.e. “Pearson” correlation) is a scaled version
of covariance

cov( x, y)
rxy =
var( x) var( y)
-1 ≤ r ≤ 1
r=1 perfect positive correlation
r = -1 perfect negative correlation
r=0 uncorrelated
Correlation between Parallel Tests

• ρ X1 X 2 equal to reliability of each test

cov(TX1 + ε1 , TX 2 + ε 2 )
ρX X =
1 2
var( X 1 ) var( X 2 )
cov(TX1 , TX 2 ) + cov(TX1 , ε 2 ) + cov(TX 2 , ε1 ) + cov(ε1 , ε 2 )
=
var( X1 ) var( X 2 )
var(TX )
ρX X =
1 2
var( X )
DIADS Example
• Depression in Alzheimers Disease
Study.
• Placebo-controlled double-blind
controlled trial of sertraline
• One outcome was the Boston Naming
Test.
• Consists of 60 pictures to be named,
two versions.
Measures for Reliability
Continuous Categorical

Test-retest r or ICC Kappa or ICC

Inter-rater r or ICC Kappa or ICC

Internal Alpha or KR-20 or ICC

Consistency Split-half or (dichotomous)
ICC
Kappa Coefficient
(Cohen, 1960)

• Test-Retest or Inter-rater reliability for

categorical (typically dichotomous) data.
• Accounts for chance agreement
Kappa Coefficient

kappa = Po - Pe Po = observed proportion of agreements

1.0 - Pe Pe = expected proportion of agreements

kappa = [(20+55)/100]-[(10.5+45.5)/100] = 0.43

1-[(10.5+45.5)/100]
Kappa in STATA
Kappa Interpretation
• Interpretation:
Kappa Value Interpretation
Below 0.00 Poor
0.00-0.20 Slight
0.21-0.40 Fair
0.41-0.60 Moderate
0.61-0.80 Substantial
0.81-1.00 Almost perfect
(source: Landis, J. R. and Koch, G. G. 1977. Biometrics 33: 159-174)

• kappa could be high simply because marginal proportions

are either very high or very low!!
• Best interpretation of kappa is to compare its values on
other, similar scales
Weighted Kappa
(Cohen, 1968)

• For ordered polytomous data

• Requires assignment of a weighting matrix

• Kw=ICC with quadratic weights (Fleiss & Cohen, 1973)

Measures for Reliability
Continuous Categorical

Test-retest r or ICC Kappa or ICC

Inter-rater r or ICC Kappa or ICC

Internal Alpha or KR-20 or ICC

Consistency Split-half or (dichotomous)
ICC
Internal Consistency
• Degree of homogeneity of items within a
scale.
• Items should be correlated with each
other and the total score.
• Not a measure of dimensionality;
assumes unidimensionality.
Internal Consistency and
Dimensionality
• Two (at least) explanations for lack of
internal consistency among scale items:
– More than one dimension
– Bad items
Cronbach’s Alpha

K
⎡ 2 ⎤
K ⎢ ∑
σ itemi ⎥ 4 ⎡ 2.67 + 2.7 + 2.67 + 6.27 ⎤
α= ⎢1 − i =1 2 ⎥ α = ⎢1 − ⎥ = 0.91
K − 1 ⎢ σ total ⎥ 3 ⎣ 44.97 ⎦
⎢⎣ ⎥⎦
Cronbach’s Alpha
• Mathematically
equivalent to ICC(3,k)

• When inter-item correlations are equal

across items, equal to the average of all
split-half reliabilities. kc kr
α= ≈
v + (k − 1)c 1 + (k − 1)r
See DeVellis pp 36-38
STATA Alpha Output
Kuder-Richardson 20
K
⎡ ⎤ pi =
⎢ ∑ pi qi ⎥ Proportion responding
K positively to item i
KR 20 = ⎢1 − i =1 2 ⎥
K − 1 ⎢ σ total ⎥
qi = 1 − pi
⎢⎣ ⎥⎦

• Cronbach’s alpha for dichotomous items

• Use alpha command in STATA, will
automatically give KR20 when items are
dichotomous.
Correction for Attenuation

• You can calculate rx,y

• You want to know rTxTy

rx , y
rTxTY =
rxx ryy
Correction for Attenuation
How to Improve Reliability
• Reduce error variance
– Better observer training
– Improve scale design
• Enhance true variance
– Introduce new items better at capturing
heterogeneity
– Change item responses
• Increase number of items in a scale
Exercise #1
• You develop a new survey measure of
depression based on a pilot sample that
consists of 33% severely depressed, 33%
mildly depressed, and 33% non-depressed.
You are happy to discover that your measure
has a high reliability of 0.90. Emboldened by
your findings, you find funding and administer
your survey to a nationally representative
sample. However, you find that your reliability
is now much lower. Why might have the
reliability dropped?
Exercise #1 - Answer
BMS pilot − EMS 10 − 1
0.90 = =
BMS pilot 10

BMS National − EMS 4 − 1

ICCNational = = = 0.75
BMS National 4

Suppose all of the national sample are severely

depressed, then BMS (between-person variance)
drops, as does ICC.
Exercise #2
• A: Draw data where the cov(Tx,e) is negative
• B: Draw data where the cov(Tx,e) is positive
Observed Score (Negative Correlation)

-20 -10 0 10 20

0
2
4
True Score
6
8
10
Exercise #2a – Answer
Observed Score (Positive Correlation)

-10 0 10 20 30

0
2
4
True Score
6
8
10
Exercise #2b - Answer
Exercise #3
• The reported correlations between years of
educational attainment and adults’ scores on
anti-social personality disorder scales (ASP)
is usually about 0.30, and the reported
reliability of the education scale is 0.95 and
for the ASP scale 0.70. What will your
observed correlation between these two
measures be if your data on the education
scale has the same reliability (0.95) but the
ASP has much lower reliability of 0.40?
Exercise #3 - Answer
• Solve for true score correlation from
reported data.
rxy .30
rTxTy = = = .367883
rxx ryy .95 × .70
• Solve for new observed correlation

rxy = rTxTy × rxx ryy = .367883× .95 *.40 = .227

Exercise #4
• In rating a dichotomous child health outcome
among 100 children, two psychiatrists
disagree in 20 cases – in 10 of these cases
the 1st psychiatrist rated the outcome as
present and the 2nd as absent, and in the
other 10 cases were vice-versa. What will be
the value of the Kappa coefficient if both
psychiatrists agree that 50 children have the
outcome?
Exercise #4 - Answer

pob − pex .8 − .52

κ= = = .58
1 − pex 1 − .52
Exercise #5
• Give substantive examples of how
measures of self-reported discrimination
could possibly violate each of the three
assumptions of classical test theory.
Exercise #5 - Answer
• E(x) = 0 could be violated if the true score is
underreported as a result of social desirability bias

• Cov(Tx,e)=0 could be violated if people systematically

overreported or underreported discrimination at either
high or low extremes of the measure

• Cov(ei,ej)=0 could be violated if discrimination was

clustered within certain areas of a location, and
multiple locations were included in the analysis pool.

Starting at The Beginning: An Introduction To Coefficient Alpha and Internal Consistency
No ratings yet
Starting at The Beginning: An Introduction To Coefficient Alpha and Internal Consistency
6 pages
Psychological Assessment - Reliability & Validity
100% (1)
Psychological Assessment - Reliability & Validity
56 pages
Validity and Reliability: I Qra Development Academy Reporter: Nur - Salam Sultan SEPT. 21, 2019
No ratings yet
Validity and Reliability: I Qra Development Academy Reporter: Nur - Salam Sultan SEPT. 21, 2019
22 pages
Psychometrics
No ratings yet
Psychometrics
102 pages
Chapter 13 Assessing Quality of Measurement Tools 2
No ratings yet
Chapter 13 Assessing Quality of Measurement Tools 2
57 pages
Psy 112 Handout 6
No ratings yet
Psy 112 Handout 6
6 pages
Chapter 5 Reliability
No ratings yet
Chapter 5 Reliability
38 pages
Reliability Reviewer
No ratings yet
Reliability Reviewer
5 pages
Reliability: Floramae Z. Campos Student/MA-GC
No ratings yet
Reliability: Floramae Z. Campos Student/MA-GC
29 pages
05B-Reliability
No ratings yet
05B-Reliability
8 pages
RELIABILITY 2024
No ratings yet
RELIABILITY 2024
30 pages
Reliablitiy Handout
No ratings yet
Reliablitiy Handout
3 pages
5 Reliability
No ratings yet
5 Reliability
29 pages
test constrcution
No ratings yet
test constrcution
39 pages
Formulas
No ratings yet
Formulas
12 pages
Reliability and Validity
No ratings yet
Reliability and Validity
23 pages
Psychometric Properties
No ratings yet
Psychometric Properties
3 pages
LEC6
No ratings yet
LEC6
56 pages
CHAPTER 4 Norms and Reliability - PPT
No ratings yet
CHAPTER 4 Norms and Reliability - PPT
54 pages
TEST-REALIABILTY
No ratings yet
TEST-REALIABILTY
39 pages
Reliability (Statistics) : Navigation Search
No ratings yet
Reliability (Statistics) : Navigation Search
3 pages
Psychological Testing 2018 PDF
No ratings yet
Psychological Testing 2018 PDF
74 pages
Streiner, D.L. (2003) Starting at The Beginning An Introduction To Coefficient Alpha and Internal PDF
No ratings yet
Streiner, D.L. (2003) Starting at The Beginning An Introduction To Coefficient Alpha and Internal PDF
5 pages
Reliability
No ratings yet
Reliability
11 pages
Language Test Reliability
No ratings yet
Language Test Reliability
20 pages
Chapter 4: Reliability
No ratings yet
Chapter 4: Reliability
40 pages
Psycass Reviewer
No ratings yet
Psycass Reviewer
19 pages
Psych Assessment (Ratio)
No ratings yet
Psych Assessment (Ratio)
14 pages
Strructures
No ratings yet
Strructures
28 pages
Introduction To Reliability: What Is Reliability? Why Is It Important?
No ratings yet
Introduction To Reliability: What Is Reliability? Why Is It Important?
14 pages
Test Reliability
100% (1)
Test Reliability
41 pages
20201231172157D4978 - Psikometri 6 - 8
No ratings yet
20201231172157D4978 - Psikometri 6 - 8
31 pages
Unit 2 Measurement Scales in Psychology
No ratings yet
Unit 2 Measurement Scales in Psychology
26 pages
Week 9 Slides
No ratings yet
Week 9 Slides
64 pages
3520250309221559-RELIABILITY
No ratings yet
3520250309221559-RELIABILITY
12 pages
Good Psychometric Properties
No ratings yet
Good Psychometric Properties
44 pages
Handbook of Psychological Assessment Fourth Edition
100% (1)
Handbook of Psychological Assessment Fourth Edition
9 pages
Alfa Elevado Redundancia Streiner 2003
No ratings yet
Alfa Elevado Redundancia Streiner 2003
5 pages
3 - Reliability
No ratings yet
3 - Reliability
38 pages
Readings Psy211
No ratings yet
Readings Psy211
23 pages
Reliability and Validity
No ratings yet
Reliability and Validity
18 pages
Group 6
No ratings yet
Group 6
21 pages
Flora, David
No ratings yet
Flora, David
18 pages
Establishing Te Lesson 6
No ratings yet
Establishing Te Lesson 6
36 pages
Reliability
No ratings yet
Reliability
11 pages
Psychological Assessment Outline Summary
No ratings yet
Psychological Assessment Outline Summary
9 pages
Questionnaire Reliability Validity
No ratings yet
Questionnaire Reliability Validity
29 pages
Paprint
No ratings yet
Paprint
3 pages
PROFED8
No ratings yet
PROFED8
59 pages
PSY211_READINGS
No ratings yet
PSY211_READINGS
12 pages
Reliability, Validity, and Scaling
No ratings yet
Reliability, Validity, and Scaling
16 pages
CLASS PRESENTATION - Test Reliability
No ratings yet
CLASS PRESENTATION - Test Reliability
7 pages
RELIABILITY AND VALIDITY
No ratings yet
RELIABILITY AND VALIDITY
47 pages
WK12 Measurement Reliability and Validity
No ratings yet
WK12 Measurement Reliability and Validity
8 pages
The Measurement of Behaviour: Psych 3F40 Psychological Research Mike Maniaci 9 / 2 5 / 2 0 1 3
No ratings yet
The Measurement of Behaviour: Psych 3F40 Psychological Research Mike Maniaci 9 / 2 5 / 2 0 1 3
33 pages
RMBS M2 Lecture 5a
No ratings yet
RMBS M2 Lecture 5a
42 pages
Students_Slides_1_Realibity
No ratings yet
Students_Slides_1_Realibity
59 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Harvard Lecture Series Session 2 - Reliability

Uploaded by

Harvard Lecture Series Session 2 - Reliability

Uploaded by

Measurement Reliability

• Var(X) =Var(Tx+e) = Var(Tx) + 2Cov(Tx,e)+Var(e)

• Ratio of True score variance to total score

• Essentially Tau-Equivalent: TX1 = TX 2 + c

• ρ X1 X 2 equal to reliability of each test

Test-retest r or ICC Kappa or ICC

Inter-rater r or ICC Kappa or ICC

Internal Alpha or KR-20 or ICC

• Test-Retest or Inter-rater reliability for

kappa = Po - Pe Po = observed proportion of agreements

kappa = [(20+55)/100]-[(10.5+45.5)/100] = 0.43

• kappa could be high simply because marginal proportions

• For ordered polytomous data

• Kw=ICC with quadratic weights (Fleiss & Cohen, 1973)

Test-retest r or ICC Kappa or ICC

Inter-rater r or ICC Kappa or ICC

Internal Alpha or KR-20 or ICC

• When inter-item correlations are equal

• Cronbach’s alpha for dichotomous items

• You can calculate rx,y

BMS National − EMS 4 − 1

Suppose all of the national sample are severely

rxy = rTxTy × rxx ryy = .367883× .95 *.40 = .227

pob − pex .8 − .52

• Cov(Tx,e)=0 could be violated if people systematically

• Cov(ei,ej)=0 could be violated if discrimination was

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.