Reliability Without Reliability

Can There Be Reliability without Reliability?

Robert J. Mislevy
Educational Testing Service
Research Memorandum RM-94-18-ONR

October, 1994

This paper is a response to Pamela Mosss (1994) Educational Researcher article,

Can there be validity without reliability? It was supported by Contract No.
N00014-91-J-4101, R&T 4421573-01, from the Cognitive Science Program,
Cognitive and Neural Sciences Division, Office of Naval Research, and by the
National Center for Research on Evaluation, Standards, and Student Testing
(CRESST), Educational Research and Development Program, cooperative
agreement number R117G10027 and CFDA catalog number 84.117G,
administered by the Office of Educational Research and Improvement, U.S.
Department of Education. I am grateful to Drew Gitomer, Charlie Lewis, and
Howard Wainer for discussions and helpful comments on an earlier draft.

Can There Be Reliability without Reliability?

A recent article by Pamela Moss asks the title question, Can there be
validity without reliability? If by reliability we mean only KR-20 coefficients or
inter-rater correlations, the answer is yes. Sometimes these particular indices for
evaluating evidence suit the problem we encounter; sometimes they dont. If by
reliability we mean credibility of evidence, where credibility is defined as
appropriate to the intended inference, the answer is no, we cannot have validity
without reliability. Because validity encompasses the process of reasoning as
well as the data, uncritically accepting observations as strong evidence, when
they may be incorrect, misleading, unrepresentative, or fraudulent, may lead
coincidentally to correct conclusions but not to valid ones. This paper discusses
and illustrates a broader conception of reliability in educational assessment, to
ground a deeper understanding of the issues raised by Professor Mosss
Key words:

Educational assessment, hermeneutics, reliability, validity.

Can there be validity without reliability? asks Pamela Moss (1994) in her
article of the same name in the Educational Researcher. Yes, Professor Moss
answers. She proposes a hermeneutic approach to educational assessmentthe
validity of which, she argues, does not depend on standard test theory indicators
of reliability such as KR-20 coefficients and inter-rater correlations. I agree that it
is possible have validity without reliability, if by reliability we refer only to
these particular indices and others like them, but this is far too narrow a
conception of reliability. More broadly construed, however, reliability concerns
the credibility and the limitations of the information from which we wish to draw
inferences. If we fail to address this concern in an appropriate manner, we fail to
establish the validity of those inferences. This paper discusses and illustrates a
broader conception of reliability in educational assessment, to ground a deeper
understanding of the issues raised by Professor Mosss question. (See Mislevy,
1994, for a more comprehensive discussion.)
That reliability be examined in an appropriate manner is key, because
KR-20s and inter-rater correlations characterize the credibility of certain kinds of
data we employ for certain kinds of inferences in educational assessment, but not
others. I applaud Professor Mosss use of a hermeneutic perspective to gain
insights into questions of educational assessment, because, as Goethe wrote in
Sprche in Prosa, He who is ignorant of foreign languages knows not his own.
Exploring how other fields deal with evidence and inference can indeed help us
disentangle the commingled concepts from statistics, psychology, and
measurement that constitute test theory as we usually think about itto
distinguish how we are reasoning from what we are reasoning aboutto better
prepare ourselves to tackle problems of how to characterize students learning

Reliability without Reliability

beyond scores on standardized tests, how to evoke and interpret evidence to this
end, and how to establish the weight and coverage of data as evidence for
conjectures and decisions framed in these terms. Physical measurement has long
been a source of concepts and techniques for educational assessment (e.g., Rasch,
1960/1980). In addition to the hermeneutic tradition, we can also gain insights
from fields such as medicine, history, and jurisprudence 1 (Schum, 1987). Seeing
how reliability problems arise and how they are dealt with in these fields
helps us understand their appearance in our own.

What is Reliability?
We can think of increasingly general senses of the term reliability as it
relates to educational assessment:
True-score reliability (Gulliksen, 1950). The classical reliability coefficient
rho assumes repeatable observations comprised of an examinees true score
and a random measurement error. Rho is the proportion of variance in a
particular population of examinees observed scores attributable to the variance
of their true-scores. The data are equally-valued responses to interchangeable
tasks, constituting a source of potentially collaborating evidence, or more
evidence of the same kind about a given inference. Rho does in fact gauge
observed scores weight of evidencefor the inference of lining up people from
this particular population along the true-score scale. It does in fact bound
validityfor the inference of predicting a variable related linearly to true
score, with validity defined as the correlation between the scores and

In Analysis of evidence , Anderson and Twining (1991) use analogies from educational testing to

help law students learn distinctions among rules, criteria, standards for evaluating evidence.

Reliability without Reliability

predicted variables in this particular population. But even under classical test
theory, rho need not convey the evidential value of scores for other inferences
for example, the magnitude of change in true score from pretest to posttest, or
whether a students true score is above a specified cutoff value.
Reproducibility. We can extend reliability beyond this specific and
population-bound inference, yet retain grounding in the consistency of
exchangeable (equally-informative and equally-valued) independent sources.
Experimenters attempting to reproduce Pons and Fleishmanns purported cold
fusion results faced reliability concerns in this sense: The way to circumvent this
skittishness [of BF3 neutron counters] was to use two counters, or even five or six,
and only pay attention to those events in which all the detectors fired
simultaneously (Taubes, 1993, p. 450). Before detailing investigations of
witnesses to the Kennedy assassination, Gerald Posner (1993, p. 236)
summarized an overarching pattern:
How many shots were fired at Dealey Plaza? Estimates at the scene
ranged from one to eight. However, on this issue, there was more
agreement than on any other postassassination matter. Of the nearly two
hundred witnesses over 88 percent heard three shots. Although
almost every conspiracy theory proposes that more than one assassin
relies on there having been four or more shots, the writers seldom disclose
that fewer than one in twenty witnesses heard that many.
In educational measurement, proportions of agreement among raters,
decision-consistency coefficients, and generalizability coefficients (Cronbach et
al., 1972) reflect this sense of reliability. These indices characterize the weight of
evidence for inferences within the true-score test theory paradigm that are not
addressed by rho. They can be useful even if one doesnt literally believe

Reliability without Reliability

observations are exchangeable. The Concentration section of each Advanced
Placement Studio Art portfolio is rated by two judges independently, and only
portfolios that provoke excessive differences are probed further. If ensuing
discussion reveals one judge differed because of special knowledge of, say,
glazing techniques, this information impacts the deliberation. The
exchangeability framework provides indices of similarity among judges
evaluations, but just as importantly, it highlights particulars where
exchangeability is not a plausible approximation, to direct attention and
expertise where they are most needed (Myford & Mislevy, in press).
Differential likelihood. A datum becomes evidence in some analytic
problem when its relevance to one or more hypotheses being considered is
established. [E]vidence is relevant on some hypothesis if it either increases or
decreases the likeliness of the hypothesis (Schum, 1987, p. 16). Under
probability-based reasoning, the relative likelihood of an observation under
alternative true states is the weight of evidence it provides for each; reliable
observations make sharp distinctions among the possibilities. 2 The empirical
consistency discussed above is one way to ground likelihoods; we take a BF 3
burst with a grain of salt once we know a bump is as likely to cause one as an
actual neutron. Theoretical and subjective considerations can also provide
information about relative likelihoods. The MUNIN neuro-muscular disease
diagnostic system uses conditional probabilities for test results and symptoms

Agreeing too much on key points, along with agreeing too little on tangential issues, lowers the

credibility of suspected collaborators in a criminal investigation; this pattern is likely under the
hypothesis of a rehersed alibi. Reproducability does not equal to credibility, since a conspirator
can repeat a lie 100 times. Data too good to be true toppled Cyril Burt (Kamin, 1974).

Reliability without Reliability

given disease states, which are based on clinical experience and physiological
theory (Andreassen, Jensen, & Olesen, 1990). Failure Analysis Associates
probability cones (Figure 1) for sources of shots in the Kennedy assassination
extend uncertainties in positions and angles backwards from the points of impact
(Posner, 1993, p. 476).
[Figure 1]
We use similar reasoning to convey our uncertainty about a students
proficiency under an item response theory model, or her stage of proportional
reasoning under a latent class model. We obtain in these cases numerical
assessments of the evidential value (read reliability) of the databut only if,
perhaps after considerable effort, we can arrange circumstances in which our
data, our model, and our intentions cohere (Wright & Stone, 1979). Less
formally, a tutor constructs a model for a students understanding, probing
What organization does the student have in mind so that his actions seem, to
him, to form a coherent pattern? (Thompson, 1982). Reliable data allow the
tutor to identify a perspective from which the students pattern of actions make
sense, but are unlikely from relevant alternative perspectives.
This is not reliability in the sense of accumulating collaborating evidence,
as in classical test theory, but in the sense of converging evidenceaccumulating
evidence of different types that support the same inference. A mass of data is
more reliable in this sense as more aspects support a given inference and fewer
aspects conflict or contradict it. It is less reliable when it is internally inconsistent
or equivocal, or if we realize that securing additional information would cause
us to revise our beliefs substantially. Such considerations characterize the
reliability of the evidence supporting a legal case, and jurists and statisticians

Reliability without Reliability

have explored the means by which, and the extent to which, they can be
expressed in terms of differntial likelihoods (e.g., Kadane & Schum, 1992).
Credibility. In common parlance, reliability simply means the extent to
which information can be trusted, a concern clearly broader than traditional
educational measurement situations. The world constantly confronts us with
unrepeatable observations and non-exchangeable sources, which we must
interpret as best we can if we have no alternative (there was only one trial of the
Kennedy assassination), or learn from to develop more principled ways of
gathering and interpreting information (to assess prospects of cold fusion or
students understandings of proportional reasoning).
When sources are not exchangeable, we must unravel secondary sources
of information about their credibilities:

Not all cold fusion experiments are created equal; those with better
controls and more reliable measuring instruments, or incorporating
lessons from earlier experiments, are privileged. Early positive results
were traced to experimental mistakes and interpretational errors, in which
questionable data were consistently accepted as evidence of desired
outcomes (Taubes, 1993).3

Lincoln at Gettysburg (Wills, 1992) is mainly a hermeneutic analysis of what

Lincoln meant when he presented the Gettysburg Address, but its
Appendix I explores what he actually said. Five versions in Lincolns hand
and four newspaper transcriptions survive. Unanimity about a phrase

A joke made the rounds of experimental labs: Q: Why cant most people get heat, neutrons,

and tritium [putative evidence of cold fusion] at the same time? A: Its almost impossible to
make that many mistakes at once (Taubes, 1993, p. 468).

Reliability without Reliability

suggests he spoke it as such, but for discrepancies Wills must consider
such clues as these: The draft Lincolns secretary claimed he saw Lincoln
speak from appears on Executive Mansion letterhead, corroborating
eyewitness accounts, but omits key phrases all newspapers report and
garbles the transition between pages.
We must often integrate multiple strands of evidence, and reliability
typically refers to the weight of evidence of a particular strand. Influence
diagrams in troubleshooting (e.g., Klempner et al., 1991), medical diagnosis
(Andreasson et al., 1990), and legal reasoning (Wigmore, 1937) depict how
sources and credibilities of information relate to inferences. Temperature is one
strand of evidence in determining whether a childs infection is bacterial or viral
(Figure 2). A thermometer reading is direct evidence about temperature, and the
reliability of the thermometer concerns its credibility about this symptom. The
reading is indirect evidence about nature of illness. In conjunction with other
evidence, even a hand on his foreheadan unreliable thermometercan aid
in the diagnosis.
[Figure 2]
In this light, the irony is not that test administrators warn test users
against interpreting scores without other sources of information, but that the test
users themselves are most prone to reify traits such as IQ or writing
ability. The view among contemporary researchers, whose work is beginning to
influence the next generation of tests, substantiates the caveat:
The evidence from cognitive psychology suggests that test performances
are comprised of complex assemblies of component informationprocessing actions that are adapted to task requirements during

Reliability without Reliability

performance Whatever their practical value as summaries, for selection,
classification, certification, or program evaluation, the cognitive
psychological view is that such [trait-based] interpretations no longer
suffice as scientific explanations of aptitude and achievement constructs.
(Snow & Lohman, 1989, p. 317).

Can we have validity without reliability? If by reliability we mean only
KR-20 coefficients or inter-rater correlations, the answer is yes. Sometimes these
particular indices for evaluating evidence suit the problem we encounter;
sometimes they dont. But when multiple sources of evidence are available and
they dont agree, wed better have alternative lines of argumentation to establish
the weight and relevance of the evidence to the inference being drawn.
Sometimes people disagree because they focus on different aspects of a situation
from different perspectives, which need to be integrated in a more thoughtful
way than averaging. But sometimes people disagree because they are
uninformed or biased, because their task is not clearly specified, or because they
are dishonest. We bear the burden of unraveling these possibilities.
If by reliability we mean credibility of evidence, where credibility is defined
as appropriate to the inference, the answer is no, we cannot have validity without
reliability. Because validity encompasses the process of reasoning as well as
the data, uncritically accepting observations as strong evidence, when they may
be incorrect, misleading, unrepresentative, or fraudulent, may lead
coincidentally to correct conclusions but not to valid ones. Good intentions and
plausible theories are not enough to honestly evaluate and subsequently improve
our efforts. That familiar tools for establishing the credibility of evidence in
educational assessment do not span the full range of inferences does not negate

Reliability without Reliability

the responsibility to establish the credibility of evidence upon which educational
decisions are made. If anything, our task becomes harder rather than easier.

Reliability without Reliability

Anderson, T.J., & Twining, W.L. (1991). Analysis of evidence. Boston: Little,
Brown, & Co.
Andreassen, S., Jensen, F.V., & Olesen, K.G. (1990). Medical expert systems based
on causal probabilistic networks. Aalborg, Denmark: Institute of Electronic
Systems, Aalborg University.
Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability
of behavioral measurements: Theory of generalizability for scores and profiles. New
York: Wiley.
Gulliksen, H. (1950/1987). Theory of mental tests. New York: Wiley. Reprint,
Hillsdale, NJ: Erlbaum.
Kadane, J.B., & Schum, D.A. (1992). Opinions in dispute: the Sacco-Vanzetti case.
In J.M. Bernardo, J.O. Berger, A.P. Dawid, & A.F.M. Smith (Eds.), Bayesian
Statistics 4 (pp. 267-287). Oxford, U.K.: Oxford University Press.
Kamin, L.J. (1974). The science and politics of IQ. Potomac, MD: Erlbaum.
Klempner, G., Kornfeld, A., & Lloyd, B. (1991). The EPRI generator expert
monitoring system: Expertise with the GEMS prototype. Presented at the
American Power Conference, May, Chicago, IL.
Mislevy, R.J. (1994). Evidence and inference in educational assessment.
Psychometrika, 59, 439-483.
Moss, P. (1994). Can there be validity without reliability? Educational Researcher,
23(2), 5-12.
Myford, C.M., & Mislevy, R.J. (in press). Monitoring and improving a portfolio
assessment system. ETS Research Report. Princeton, NJ: Educational
Testing Service.

Reliability without Reliability

Posner, G. (1993). Case closed: Lee Harvey Oswald and the assassination of JFK. New
York: Random House.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests.
Copenhagen: Danish Institute for Educational Research/Chicago:
University of Chicago Press (reprint).
Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, Md.:
University Press of America.
Snow, R.E., & Lohman, D.F. (1984). Toward a theory of cognitive aptitude for
learning from instruction. Journal of Educational Psychology, 76, 347-376.
Taubes, G. (1993). Bad science: The short life and weird times of cold fusion. New York:
Random House.
Thompson, P.W. (1982). Were lions to speak, we wouldn't understand. Journal of
Mathematical Behavior, 3, 147-165.
Wigmore, J.H. (1937). The science of judicial proof (3rd Ed.). Boston: Little, Brown, &
Wright, B.D., & Stone, M. (1979). Best Test Design. Chicago: MESA Press.

Reliability without Reliability

Reliability without Reliability

Reliability without Reliability

Reliability without Reliability

Figure 2
An Influence Diagram for a Simple Diagnostic Problem

