Chapter 5
Chapter 5
If 2 represents the total variance, 2tr the Sources of error variance that occur during
true variance, and 2e the error variance, test administration may influence the
then the relationship of the variances can be testtaker’s attention or motivation. The
expressed as 2 = 2tr + 2e testtaker’s reactions to those influences are
the source of one kind of error variance.
The term reliability refers to the proportion Examples of untoward influences during
of the total variance attributed to true administration of a test include factors
variance. The greater the proportion of the related to the test environment: the room
total variance attributed to true variance, temperature, the level of lighting, and the
the more reliable the test. Because true amount of ventilation and noise, for
differences are assumed to be stable, they instance.
are presumed to yield consistent scores on
repeated administrations of the same test Other potential sources of error variance
as well as on equivalent forms of tests. during test administration are testtaker
variables. Pressing emotional problems,
Let’s emphasize here that a systematic physical discomfort, lack of sleep, and the
source of error would not affect score effects of drugs or medication can all be
consistency. A systematic error source does sources of error variance.
not change the variability of the distribution
or affect reliability. Examiner-related variables are potential
sources of error variance. The examiner’s
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
physical appearance and demeanor—even abuse one partner suffers at the hands of
the presence or absence of an examiner— the other may never be known, so the
are some factors for consideration here. amount of test variance that is true relative
to error may never be known.
Some examiners in some testing situations
might knowingly or unwittingly depart from Reliability Estimates
the procedure prescribed for a particular
test. On an oral examination, some Test-Retest Reliability Estimates
examiners may unwittingly provide clues by
emphasizing key words as they pose One way of estimating the reliability of a
questions. They might convey information measuring instrument is by using the same
about the correctness of a response through instrument to measure the same thing at
head nodding, eye movements, or other two points in time. In psychometric
nonverbal gestures. Clearly, the level of parlance, this approach to reliability
professionalism exhibited by examiners is a evaluation is called the test-retest
source of error variance. method, and the result of such an
evaluation is an estimate of test-retest
Test Scoring and Interpretation reliability.
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
loudness, or taste). However, even in because of the particular items that were
measuring variables such as these, and selected for inclusion in the test.
even when the time period between the two
administrations of the test is relatively On the other hand, once an alternate or
small, various factors (such as experience, parallel form of a test has been developed,
practice, memory, fatigue, and motivation) it is advantageous to the test user in several
may intervene and confound an obtained ways. For example, it minimizes the effect of
measure of reliability. memory for the content of a previously
administered form of the test. Certain traits
Parallel-Forms and Alternate-Forms are presumed to be relatively stable in
Reliability Estimates people over time, and we would expect tests
measuring those traits—alternate forms,
Coefficient of equivalence is the degree parallel forms, or otherwise—to reflect that
of the relationship between various forms of stability.
a test can be evaluated by means of an
alternate-forms or parallel-forms coefficient An estimate of the reliability of a test can be
of reliability. obtained without developing an alternate
form of the test and without having to
Parallel forms of a test exist when, for administer the test twice to the same
each form of the test, the means and the people. Deriving this type of estimate
variances of observed test scores are equal. entails an evaluation of the internal
consistency of the test items. Logically
In theory, the means of scores obtained on enough, it is referred to as an internal
parallel forms correlate equally with the true consistency estimate of reliability or as
score. More practically, scores obtained on an estimate of inter-item consistency.
parallel tests correlate equally with other There are different methods of obtaining
measures. internal consistency estimates of reliability.
One such method is the split-half estimate.
Alternate forms are simply different
versions of a test that have been Internal Consistency Estimates of
constructed so as to be parallel. Reliability
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
because it’s likely this procedure would for the adjustment of split-half reliability.
spuriously raise or lower the reliability The symbol stands for the Pearson r of
coefficient. Different amounts of fatigue for scores in the two half tests:
the first as opposed to the second part of
the test, different amounts of test anxiety, 2 r hh
and differences in item difficulty as a r SB=
1+ r hh
function of placement in the test are all
factors to consider.
Usually, but not always, reliability increases
as test length increases. Ideally, the
One acceptable way to split a test is to
additional test items are equivalent with
randomly assign items to one or the other
respect to the content and the range of
half of the test. Another acceptable way to
difficulty of the original items. Estimates of
split a test is to assign odd-numbered items
reliability based on consideration of the
to one half of the test and even-numbered
entire test therefore tend to be higher than
items to the other half. This method yields
those based on half of a test.
an estimate of split-half reliability that is
also referred to as odd-even reliability. Yet
If test developers or users wish to shorten a
another way to split a test is to divide the
test, the Spearman-Brown formula may be
test by content so that each half contains
used to estimate the effect of the shortening
items equivalent with respect to content and
on the test’s reliability. Reduction in test size
difficulty.
for the purpose of reducing test
administration time is a common practice in
In general, a primary objective in splitting a
certain situations.
test in half for the purpose of obtaining a
split-half reliability estimate is to create
A Spearman-Brown formula could also be
what might be called “mini-parallel-forms,”
used to determine the number of items
with each half equal to the other—or as
needed to attain a desired level of reliability.
nearly equal as humanly possible—in
In adding items to increase test reliability to
format, stylistic, statistical, and related
a desired level, the rule is that the new
aspects.
items must be equivalent in content and
difficulty so that the longer test still
The Spearman-Brown formula allows a
measures what the original test measured. If
test developer or user to estimate internal
the reliability of the original test is relatively
consistency reliability from a correlation of
low, then it may be impractical to increase
two halves of a test. It is a specific
the number of items to reach an acceptable
application of a more general formula to
level of reliability. Another alternative would
estimate the reliability of a test that is
be to abandon this relatively unreliable
lengthened or shortened by any number of
instrument and locate—or develop—a
items.
suitable alternative.
The general Spearman-Brown (rSB) formula is
Internal consistency estimates of reliability,
such as that obtained by use of the
nr xy Spearman-Brown formula, are inappropriate
r SB=
1+ ( n−1 ) r xy for measuring the reliability of
heterogeneous tests and speed tests.
where rSB is equal to the reliability adjusted
by the Spearman-Brown formula, rxy is equal Other Methods of Estimating Internal
to the Pearson r in the original-length test, Consistency
and n is equal to the number of items in the
revised version divided by the number of Other methods used to obtain estimates of
items in the original version. internal consistency reliability include
formulas developed by Kuder and
By determining the reliability of one half of a Richardson (1937) and Cronbach (1951).
test, a test developer can use the Inter-item consistency refers to the
Spearman-Brown formula to estimate the degree of correlation among all the items on
reliability of a whole test. Because a whole a scale. A measure of inter-item consistency
test is two times longer than half a test, n is calculated from a single administration of
becomes 2 in the Spearman-Brown formula a single form of a test. An index of inter item
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
consistency, in turn, is useful in assessing split-half reliability estimates will be similar.
the homogeneity of the test. Tests are said However, KR-20 is the statistic of choice for
to be homogeneous if they contain items determining the inter-item consistency of
that measure a single trait. dichotomous items, primarily those items
that can be scored right or wrong (such as
Homogeneity (derived from the Greek multiple-choice items). If test items are
words homos, meaning “same,” and genos, more heterogeneous, KR-20 will yield lower
meaning “kind”) is the degree to which a reliability estimates than the split-half
test measures a single factor. In other method.
words, homogeneity is the extent to which
items in a scale are unifactorial. Because of the great heterogeneity of
content areas when taken as a whole, it
In contrast to test homogeneity, could reasonably be predicted that the KR-
heterogeneity describes the degree to 20 estimate of reliability will be lower than
which a test measures different factors. A the odd-even one. The following formula
heterogeneous (or nonhomogeneous) test is may be used:
composed of items that measure more than
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
letter alpha (α ) and the number 20, the In contrast to coefficient alpha, a Pearson r
latter a reference to KR-20. may be thought of as dealing conceptually
with both dissimilarity and similarity.
Developed by Cronbach (1951) and Accordingly, an r value of -1 may be thought
subsequently elaborated on by others, of as indicating “perfect dissimilarity.” In
coefficient alpha may be thought of as the practice, most reliability coefficients—
mean of all possible split-half correlations, regardless of the specific type of reliability
corrected by the Spearman Brown formula. they are measuring—range in value from 0
In contrast to KR-20, which is appropriately to 1.
used only on tests with dichotomous items,
coefficient alpha is appropriate for use on Let’s emphasize that all indexes of
tests containing nondichotomous items. The reliability, coefficient alpha among them,
formula for coefficient alpha is provide an index that is a characteristic of a
particular group of test scores, not of the
( )( )
2 test itself. The precise amount of error
k Σσ
rα= 1− 2 inherent in a reliability estimate will vary
k−1 σ with the sample of testtakers from which the
data were drawn. If a new group of
Where r α is coefficient alpha, k is the testtakers is sufficiently different from the
number of items, σ 2 is the variance of one group of testtakers on whom the reliability
studies were done, the reliability coefficient
item, Σ is the sum of variances of each item,
may not be as impressive—and may even
and σ 2 is the variance of the total test be unacceptable.
scores.
Measures of Inter-Scorer Reliability
Coefficient alpha is the preferred statistic for
obtaining an estimate of internal Unfortunately, in some types of tests under
consistency reliability. A variation of the some conditions, the score may be more a
formula has been developed for use in function of the scorer than anything else.
obtaining an estimate of test-retest
reliability. Essentially, this formula yields an Variously referred to as scorer reliability,
estimate of the mean of all possible test- judge reliability, observer reliability, and
retest, split-half coefficients. Coefficient inter-rater reliability, inter-scorer
alpha is widely used as a measure of reliability is the degree of agreement or
reliability, in part because it requires only consistency between two or more scorers
one administration of the test. (or judges or raters) with regard to a
particular measure.
Unlike a Pearson r, which may range in
value from 1 to 1, coefficient alpha typically If the reliability coefficient is high, the
ranges in value from 0 to 1. The reason for prospective test user knows that test scores
this is that, conceptually, coefficient alpha can be derived in a systematic, consistent
(much like other coefficients of reliability) is way by various scorers with sufficient
calculated to help answer questions about training. Perhaps the simplest way of
how similar sets of data are. Here, similarity determining the degree of consistency
is gauged, in essence, on a scale from 0 among scorers in the scoring of a test is to
(absolutely no similarity) to 1 (perfectly calculate a coefficient of correlation. This
identical). It is possible, however, to correlation coefficient is referred to as a
conceive of data sets that would yield a coefficient of inter-scorer reliability.
negative value of alpha. Still, because
negative values of alpha are theoretically Using and Interpreting a Coefficient of
impossible, it is recommended under such Reliability
rare circumstances that the alpha coefficient
be reported as zero. As Streiner (2003) We have seen that, with respect to the test
pointed out, a value of alpha above .90 may itself, there are basically three
be “too high” and indicate redundancy in approaches to the estimation of
the items. reliability: (1) test-retest, (2) alternate or
parallel forms, and (3) internal or inter-item
consistency. The method or methods
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
employed will depend on a number of (2) the characteristic, ability, or trait being
factors, such as the purpose of obtaining a measured is presumed to be dynamic or
measure of reliability. static;
(3) the range of test scores is or is not
Reliability is a mandatory attribute in all restricted;
tests we use. However, we need more of it (4) the test is a speed or a power test; and
in some tests, and we will admittedly allow (5) the test is or is not criterion-referenced.
for less of it in others. If a test score carries
with it life-or-death implications, then we
need to hold that test to some high
standards—including relatively high Homogeneity versus
standards with regard to coefficients of Heterogeneity of Test item
reliability. If a test score is routinely used in
combination with many other test scores Tests designed to measure one factor, such
and typically accounts for only a small part as one ability or one trait, are expected to
of the decision process, that test will not be be homogeneous in items. For such tests,
held to the highest standards of reliability. it is reasonable to expect a high degree of
internal consistency. By contrast, if the test
The Purpose of the Reliability is heterogeneous in items, an estimate of
Coefficient internal consistency might be low relative to
a more appropriate estimate of test-retest
If a specific test of employee performance is reliability.
designed for use at various times over the
course of the employment period, it would Dynamic versus Static
be reasonable to expect the test to Characteristics
demonstrate reliability across time. It would
thus be desirable to have an estimate of the A dynamic characteristic is a trait, state,
instrument’s test-retest reliability. or ability presumed to be ever-changing as a
function of situational and cognitive
For a test designed for a single experiences.
administration only, an estimate of internal
consistency would be the reliability measure The best estimate of reliability would be
of choice. obtained from a measure of internal
consistency.
If the purpose of determining reliability is to
break down the error variance into its parts, A trait, state, or ability presumed to be
then a number of reliability coefficients relatively unchanging (a static
would have to be calculated. characteristic), such as intelligence. In this
instance, obtained measurement would not
Thus, an individual reliability coefficient may be expected to vary significantly as a
provide an index of error from test function of time, and either the test-retest or
construction, test administration, or test the alternate-forms method would be
scoring and interpretation. A coefficient of appropriate.
inter-rater reliability, for example, provides
information about error as a result of test Restriction or Inflation of Range
scoring. Specifically, it can be used to
answer questions about how consistently In using and interpreting a coefficient of
two scorers score the same test items. reliability, the issue variously referred to as
restriction of range or restriction of
The Nature of the Test variance (or, conversely, inflation of
range or inflation of variance) is
Closely related to considerations concerning important. If the variance of either variable
the purpose and use of a reliability in a correlational analysis is restricted by the
coefficient are those concerning the nature sampling procedure used, then the resulting
of the test itself. Included here are correlation coefficient tends to be lower. If
considerations such as whether the variance of either variable in a
(1) the test items are homogeneous or correlational analysis is inflated by the
heterogeneous in nature;
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
sampling procedure, then the resulting the test and is then adjusted using the
correlation coefficient tends to be higher. Spearman-Brown formula to obtain a
reliability estimate of the whole test.
Also of critical importance is whether the Although there are exceptions, such
range of variances employed is appropriate traditional procedures of estimating
to the objective of the correlational analysis. reliability are usually not appropriate for use
with criterion-referenced test.
Speed tests versus Power tests
In criterion-referenced testing, and
When a time limit is long enough to allow particularly in mastery testing, how different
testtakers to attempt all items, and if some the scores are from one another is seldom a
items are so difficult that no testtaker is able focus of interest. The critical issue for the
to obtain a perfect score, then the test is a user of a mastery test is whether or not a
power test. By contrast, a speed test certain criterion score has been achieved.
generally contains items of uniform level of Traditional ways of estimating reliability are
difficulty (typically uniformly low) so that, not always appropriate for criterion-
when given generous time limits, all referenced tests, though there may be
testtakers should be able to complete all the instances in which traditional estimates can
test items correctly. be adopted.
Score differences on a speed test are Alternatives to the True Score Model
therefore based on performance speed (Classical Model)
because items attempted tend to be correct.
The true score model is the most widely
A reliability estimate of a speed test should used and accepted model in the
be based on performance from two psychometric literature today. Historically,
independent testing periods using one of the true score model of the reliability of
the following: (1) test-retest reliability, (2) measurement enjoyed a virtually
alternate-forms reliability, or (3) split-half unchallenged reign of acceptance from the
reliability from two separately timed half early 1900s through the 1940s. The 1950s
tests. If a split-half procedure is used, then saw the development of an alternative
the obtained reliability coefficient is for a theoretical model, one originally referred to
half test and should be adjusted using the as domain sampling theory and better
Spearman-Brown formula. known today in one of its many modified
forms as generalizability theory.
If a speed test is administered once and
some measure of internal consistency is The theory of domain sampling rebels
calculated, such as the Kuder-Richardson or against the concept of a true score existing
a split-half correlation, the result will be a with respect to the measurement of
spuriously high reliability coefficient. psychological constructs. Whereas those
who subscribe to true score theory seek
Criterion-referenced tests to estimate the portion of a test score that is
attributable to error, proponents of domain
A criterion-referenced test is designed to sampling theory seek to estimate the
provide an indication of where a testtaker extent to which specific sources of variation
stands with respect to some variable or under defined conditions are contributing to
criterion, such as an educational or a the test score.
vocational objective.
In domain sampling theory, a test’s
Recall that a test-retest reliability estimate is reliability is conceived of as an objective
based on the correlation between the total measure of how precisely the test score
scores on two administrations of the same assesses the domain from which the test
test. In alternate-forms reliability, a draws a sample. In theory, the items in the
reliability estimate is based on the domain are thought to have the same
correlation between the two total scores on means and variances of those in the test
the two forms. In split-half reliability, a that samples from the domain. Of the three
reliability estimate is based on the types of estimates of reliability, measures of
correlation between scores on two halves of
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
internal consistency are perhaps the most test user how test scores should be used
compatible with domain sampling theory. and how dependable those scores are as a
basis for decisions, depending on the
Generalizability Theory context of their use.
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
scores. By contrast, such assumptions are coefficient of the test. The standard error of
inherent in latent-trait models. A shorthand measurement allows us to estimate, with a
reference to these types of models is specific level of confidence, the range in
“Rasch,” so reference to “the Rasch model” which the true score is likely to exist.
is a reference to an IRT model with very
specific assumptions about the underlying Because the standard error of measurement
distribution. The psychometric advantages functions like a standard deviation in this
of item response theory have made this context, we can use it to predict what would
model appealing, especially to commercial happen if an individual took additional
and academic test developers and to large- equivalent tests:
scale test publishers. approximately 68% (actually,
68.26%) of the scores would be
Reliability and Individual Scores expected to occur within ± 1 σ meas of
the true score;
The Standard Error of Measurement approximately 95% (actually,
95.44%) of the scores would be
The standard error of measurement,
expected to occur within ± 2 σ meas of
often abbreviated as SEM or SEM, provides a
measure of the precision of an observed test the true score;
score. It provides an estimate of the amount approximately 99% (actually,
of error inherent in an observed score or 99.74%) of the scores would be
measurement. expected to occur within ± 3 σ meas of
the true score.
In general, the relationship between the
SEM and the reliability of a test is inverse; The best estimate available of the
the higher the reliability of a test (or individual’s true score on the test is the
individual subtest within a test), the lower test score already obtained. The
the SEM. standard error of measurement, like the
reliability coefficient, is one way of
The standard error of measurement is expressing test reliability. If the standard
the tool used to estimate or infer the extent deviation of a test is held constant, then
to which an observed score deviates from a the smaller the σ meas, the more reliable
true score. It is also the standard deviation the test will be; as r xx increases, the σ meas
of a theoretically normal distribution of test
decreases.
scores obtained by one person on
equivalent tests.
In practice, the standard error of
measurement is most frequently used in
Also known as the standard error of a score
the interpretation of individual test
and denoted by the symbol σ meas, the scores. Further, the standard error of
standard error of measurement is an index measurement is useful in establishing
of the extent to which one individual’s what is called a confidence interval: a
scores vary over tests presumed to be range or band of test scores that is likely
parallel. If the standard deviation for the to contain the true score.
distribution of test scores is known (or can
be calculated) and if an estimate of the The Standard Error of the Difference
reliability of the test is known (or can be between Two Scores
calculated), then an estimate of the
standard error of a particular score (that is, The amount of error in a specific test
the standard error of measurement) can be score is embodied in the standard error
determined by the following formula: of measurement. But scores can change
from one testing to the next for reasons
σ meas=σ √ 1−r xx other than error. True differences in the
characteristic being measured can also
where σ measis equal to the standard error of affect test scores. These differences may
be of great interest. Comparisons
measurement, σ is equal to the standard
between scores are made using the
deviation of test scores by the group of
standard error of the difference, a
testtakers, and r xx is equal to the reliability statistical measure that can aid a test
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
user in determining how large a
difference should be before it is
considered statistically significant.
Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik