Reliability Without Reliability
Reliability Without Reliability
Reliability Without Reliability
Robert J. Mislevy
Educational Testing Service
Research Memorandum RM-94-18-ONR
October, 1994
Introduction
Can there be validity without reliability? asks Pamela Moss (1994) in her
article of the same name in the Educational Researcher. Yes, Professor Moss
answers. She proposes a hermeneutic approach to educational assessmentthe
validity of which, she argues, does not depend on standard test theory indicators
of reliability such as KR-20 coefficients and inter-rater correlations. I agree that it
is possible have validity without reliability, if by reliability we refer only to
these particular indices and others like them, but this is far too narrow a
conception of reliability. More broadly construed, however, reliability concerns
the credibility and the limitations of the information from which we wish to draw
inferences. If we fail to address this concern in an appropriate manner, we fail to
establish the validity of those inferences. This paper discusses and illustrates a
broader conception of reliability in educational assessment, to ground a deeper
understanding of the issues raised by Professor Mosss question. (See Mislevy,
1994, for a more comprehensive discussion.)
That reliability be examined in an appropriate manner is key, because
KR-20s and inter-rater correlations characterize the credibility of certain kinds of
data we employ for certain kinds of inferences in educational assessment, but not
others. I applaud Professor Mosss use of a hermeneutic perspective to gain
insights into questions of educational assessment, because, as Goethe wrote in
Sprche in Prosa, He who is ignorant of foreign languages knows not his own.
Exploring how other fields deal with evidence and inference can indeed help us
disentangle the commingled concepts from statistics, psychology, and
measurement that constitute test theory as we usually think about itto
distinguish how we are reasoning from what we are reasoning aboutto better
prepare ourselves to tackle problems of how to characterize students learning
What is Reliability?
We can think of increasingly general senses of the term reliability as it
relates to educational assessment:
True-score reliability (Gulliksen, 1950). The classical reliability coefficient
rho assumes repeatable observations comprised of an examinees true score
and a random measurement error. Rho is the proportion of variance in a
particular population of examinees observed scores attributable to the variance
of their true-scores. The data are equally-valued responses to interchangeable
tasks, constituting a source of potentially collaborating evidence, or more
evidence of the same kind about a given inference. Rho does in fact gauge
observed scores weight of evidencefor the inference of lining up people from
this particular population along the true-score scale. It does in fact bound
validityfor the inference of predicting a variable related linearly to true
score, with validity defined as the correlation between the scores and
In Analysis of evidence , Anderson and Twining (1991) use analogies from educational testing to
help law students learn distinctions among rules, criteria, standards for evaluating evidence.
Agreeing too much on key points, along with agreeing too little on tangential issues, lowers the
credibility of suspected collaborators in a criminal investigation; this pattern is likely under the
hypothesis of a rehersed alibi. Reproducability does not equal to credibility, since a conspirator
can repeat a lie 100 times. Data too good to be true toppled Cyril Burt (Kamin, 1974).
Not all cold fusion experiments are created equal; those with better
controls and more reliable measuring instruments, or incorporating
lessons from earlier experiments, are privileged. Early positive results
were traced to experimental mistakes and interpretational errors, in which
questionable data were consistently accepted as evidence of desired
outcomes (Taubes, 1993).3
A joke made the rounds of experimental labs: Q: Why cant most people get heat, neutrons,
and tritium [putative evidence of cold fusion] at the same time? A: Its almost impossible to
make that many mistakes at once (Taubes, 1993, p. 468).
Conclusion
Can we have validity without reliability? If by reliability we mean only
KR-20 coefficients or inter-rater correlations, the answer is yes. Sometimes these
particular indices for evaluating evidence suit the problem we encounter;
sometimes they dont. But when multiple sources of evidence are available and
they dont agree, wed better have alternative lines of argumentation to establish
the weight and relevance of the evidence to the inference being drawn.
Sometimes people disagree because they focus on different aspects of a situation
from different perspectives, which need to be integrated in a more thoughtful
way than averaging. But sometimes people disagree because they are
uninformed or biased, because their task is not clearly specified, or because they
are dishonest. We bear the burden of unraveling these possibilities.
If by reliability we mean credibility of evidence, where credibility is defined
as appropriate to the inference, the answer is no, we cannot have validity without
reliability. Because validity encompasses the process of reasoning as well as
the data, uncritically accepting observations as strong evidence, when they may
be incorrect, misleading, unrepresentative, or fraudulent, may lead
coincidentally to correct conclusions but not to valid ones. Good intentions and
plausible theories are not enough to honestly evaluate and subsequently improve
our efforts. That familiar tools for establishing the credibility of evidence in
educational assessment do not span the full range of inferences does not negate
References
Anderson, T.J., & Twining, W.L. (1991). Analysis of evidence. Boston: Little,
Brown, & Co.
Andreassen, S., Jensen, F.V., & Olesen, K.G. (1990). Medical expert systems based
on causal probabilistic networks. Aalborg, Denmark: Institute of Electronic
Systems, Aalborg University.
Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability
of behavioral measurements: Theory of generalizability for scores and profiles. New
York: Wiley.
Gulliksen, H. (1950/1987). Theory of mental tests. New York: Wiley. Reprint,
Hillsdale, NJ: Erlbaum.
Kadane, J.B., & Schum, D.A. (1992). Opinions in dispute: the Sacco-Vanzetti case.
In J.M. Bernardo, J.O. Berger, A.P. Dawid, & A.F.M. Smith (Eds.), Bayesian
Statistics 4 (pp. 267-287). Oxford, U.K.: Oxford University Press.
Kamin, L.J. (1974). The science and politics of IQ. Potomac, MD: Erlbaum.
Klempner, G., Kornfeld, A., & Lloyd, B. (1991). The EPRI generator expert
monitoring system: Expertise with the GEMS prototype. Presented at the
American Power Conference, May, Chicago, IL.
Mislevy, R.J. (1994). Evidence and inference in educational assessment.
Psychometrika, 59, 439-483.
Moss, P. (1994). Can there be validity without reliability? Educational Researcher,
23(2), 5-12.
Myford, C.M., & Mislevy, R.J. (in press). Monitoring and improving a portfolio
assessment system. ETS Research Report. Princeton, NJ: Educational
Testing Service.
UseWord6.0corlate o
Figure 2
An Influence Diagram for a Simple Diagnostic Problem