0% found this document useful (0 votes)
10 views

Chapter 5

Psychological Assessment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter 5

Psychological Assessment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Chapter 5: Reliability Domain sampling model is the idea that

we cannot cover/ask all the questions,


Reliability refers to consistency in instances, etc. about a trait/topic. Some
measurement. sources of error is limited items that is not
sufficient to cover a domain,
Reliability coefficient is an index of
reliability, a proportion that indicates the Item response theory is sensitive and
ratio between the true score variance on a responsive to the ability of a person. It
test and the total variance. discriminates the differences between the
ability of the testtakers. Different levels of
The Concept of Reliability difficulty depend on the person.

Error refers to the component of the Sources of Error Variance


observed test score that does not have to do
with the testtaker’s ability.  Test Construction

If we use X to represent an observed score, One source of variance during test


T to represent a true score, and E to construction is item sampling or content
represent error, then the fact that an sampling, terms that refer to variations
observed score equals the true score plus among items within a test as well as to
error may be expressed as follows: X = T + variation among items between tests.
E Consider two or more tests designed to
measure a specific skill, personality
True score is your true ability, not affected attribute, or body of knowledge. Differences
by other variables. Error is any other are sure to be found in the way the items
variable that affect the score. It is always are worded and in the exact content
random. sampled.

A statistic useful in describing sources of From the perspective of a test creator, a


test score variability is the variance (2)— challenge in test development is to
the standard deviation squared. This maximize the proportion of the total
statistic is useful because it can be broken variance that is true variance and to
into components. Variance from true minimize the proportion of the total variance
differences is true variance, and variance that is error variance.
from irrelevant, random sources is error
variance.  Test Administration

If 2 represents the total variance, 2tr the Sources of error variance that occur during
true variance, and 2e the error variance, test administration may influence the
then the relationship of the variances can be testtaker’s attention or motivation. The
expressed as 2 = 2tr + 2e testtaker’s reactions to those influences are
the source of one kind of error variance.
The term reliability refers to the proportion Examples of untoward influences during
of the total variance attributed to true administration of a test include factors
variance. The greater the proportion of the related to the test environment: the room
total variance attributed to true variance, temperature, the level of lighting, and the
the more reliable the test. Because true amount of ventilation and noise, for
differences are assumed to be stable, they instance.
are presumed to yield consistent scores on
repeated administrations of the same test Other potential sources of error variance
as well as on equivalent forms of tests. during test administration are testtaker
variables. Pressing emotional problems,
Let’s emphasize here that a systematic physical discomfort, lack of sleep, and the
source of error would not affect score effects of drugs or medication can all be
consistency. A systematic error source does sources of error variance.
not change the variability of the distribution
or affect reliability. Examiner-related variables are potential
sources of error variance. The examiner’s

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
physical appearance and demeanor—even abuse one partner suffers at the hands of
the presence or absence of an examiner— the other may never be known, so the
are some factors for consideration here. amount of test variance that is true relative
to error may never be known.
Some examiners in some testing situations
might knowingly or unwittingly depart from Reliability Estimates
the procedure prescribed for a particular
test. On an oral examination, some Test-Retest Reliability Estimates
examiners may unwittingly provide clues by
emphasizing key words as they pose One way of estimating the reliability of a
questions. They might convey information measuring instrument is by using the same
about the correctness of a response through instrument to measure the same thing at
head nodding, eye movements, or other two points in time. In psychometric
nonverbal gestures. Clearly, the level of parlance, this approach to reliability
professionalism exhibited by examiners is a evaluation is called the test-retest
source of error variance. method, and the result of such an
evaluation is an estimate of test-retest
 Test Scoring and Interpretation reliability.

The advent of computer scoring and a Test-retest reliability is an estimate of


growing reliance on objective, computer- reliability obtained by correlating pairs of
scorable items virtually have eliminated scores from the same people on two
error variance caused by scorer differences different administrations of the same test.
in many tests. Manuals for individual
intelligence tests tend to be very explicit The test-retest measure is appropriate when
about scoring criteria lest examinees’ evaluating the reliability of a test that
measured intelligence vary as a function of purports to measure something that is
who is doing the testing and scoring. relatively stable over time, such as a
personality trait. It is generally the case
Scorers and scoring systems are potential (although there are exceptions) that, as the
sources of error variance. A test may employ time interval between administrations of the
objective-type items amenable to computer same test increases, the correlation
scoring of well-documented reliability. Yet between the scores obtained on each
even then, the possibility of a technical testing decreases. The passage of time can
glitch contaminating the data is possible. If be a source of error variance. The longer the
subjectivity is involved in scoring, then the time that passes, the greater the likelihood
scorer (or rater) can be a source of error that the reliability coefficient will be lower.
variance. The element of subjectivity in
scoring may be much greater in the When the interval between testing is greater
administration of certain nonobjective-type than six months, the estimate of test-retest
personality tests, tests of creativity (such as reliability is often referred to as the
the block test just described), and certain coefficient of stability.
academic tests (such as essay
examinations). An evaluation of a test-retest reliability
coefficient must therefore extend beyond
 Other sources of Error the magnitude of the obtained coefficient. If
we are to come to proper conclusions about
Certain types of assessment situations lend the reliability of the measuring instrument,
themselves to particular varieties of evaluation of a test-retest reliability
systematic and nonsystematic error. For estimate must extend to consideration of
example, consider assessing the extent of possible intervening factors between test
agreement between partners regarding the administrations.
quality and quantity of physical and
psychological abuse in their relationship. A An estimate of test-retest reliability may be
number of studies have suggested most appropriate in gauging the reliability of
underreporting or overreporting of tests that employ outcome measures such
perpetration of abuse also may contribute to as reaction time or perceptual judgments
systematic error. Just as the amount of (including discriminations of brightness,

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
loudness, or taste). However, even in because of the particular items that were
measuring variables such as these, and selected for inclusion in the test.
even when the time period between the two
administrations of the test is relatively On the other hand, once an alternate or
small, various factors (such as experience, parallel form of a test has been developed,
practice, memory, fatigue, and motivation) it is advantageous to the test user in several
may intervene and confound an obtained ways. For example, it minimizes the effect of
measure of reliability. memory for the content of a previously
administered form of the test. Certain traits
Parallel-Forms and Alternate-Forms are presumed to be relatively stable in
Reliability Estimates people over time, and we would expect tests
measuring those traits—alternate forms,
Coefficient of equivalence is the degree parallel forms, or otherwise—to reflect that
of the relationship between various forms of stability.
a test can be evaluated by means of an
alternate-forms or parallel-forms coefficient An estimate of the reliability of a test can be
of reliability. obtained without developing an alternate
form of the test and without having to
Parallel forms of a test exist when, for administer the test twice to the same
each form of the test, the means and the people. Deriving this type of estimate
variances of observed test scores are equal. entails an evaluation of the internal
consistency of the test items. Logically
In theory, the means of scores obtained on enough, it is referred to as an internal
parallel forms correlate equally with the true consistency estimate of reliability or as
score. More practically, scores obtained on an estimate of inter-item consistency.
parallel tests correlate equally with other There are different methods of obtaining
measures. internal consistency estimates of reliability.
One such method is the split-half estimate.
Alternate forms are simply different
versions of a test that have been Internal Consistency Estimates of
constructed so as to be parallel. Reliability

Although they do not meet the requirements Split-Half Reliability Estimates


for the legitimate designation “parallel,”
alternate forms of a test are typically An estimate of split-half reliability is
designed to be equivalent with respect to obtained by correlating two pairs of scores
variables such as content and level of obtained from equivalent halves of a single
difficulty. test administered once. It is a useful
measure of reliability when it is impractical
Obtaining estimates of alternate-forms or undesirable to assess reliability with two
reliability and parallel-forms reliability is tests or to administer a test twice (because
similar in two ways to obtaining an of factors such as time or expense).
estimate of test-retest reliability:
(1) Two test administrations with the same The computation of a coefficient of split-half
group are required, and reliability generally entails three steps:
(2) test scores may be affected by factors Step 1. Divide the test into equivalent
such as motivation, fatigue, or intervening halves.
events such as practice, learning, or therapy Step 2. Calculate a Pearson r between
(although not as much as when the same scores on the two halves of the test.
test is administered twice). Step 3. Adjust the half-test reliability using
the Spearman-Brown formula (discussed
An additional source of error variance, item shortly).
sampling, is inherent in the computation of
an alternate- or parallel-forms reliability When it comes to calculating split-half
coefficient. Testtakers may do better or reliability coefficients, there’s more than one
worse on a specific form of the test not as a way to split a test—but there are some ways
function of their true ability but simply you should never split a test. Simply dividing
the test in the middle is not recommended

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
because it’s likely this procedure would for the adjustment of split-half reliability.
spuriously raise or lower the reliability The symbol stands for the Pearson r of
coefficient. Different amounts of fatigue for scores in the two half tests:
the first as opposed to the second part of
the test, different amounts of test anxiety, 2 r hh
and differences in item difficulty as a r SB=
1+ r hh
function of placement in the test are all
factors to consider.
Usually, but not always, reliability increases
as test length increases. Ideally, the
One acceptable way to split a test is to
additional test items are equivalent with
randomly assign items to one or the other
respect to the content and the range of
half of the test. Another acceptable way to
difficulty of the original items. Estimates of
split a test is to assign odd-numbered items
reliability based on consideration of the
to one half of the test and even-numbered
entire test therefore tend to be higher than
items to the other half. This method yields
those based on half of a test.
an estimate of split-half reliability that is
also referred to as odd-even reliability. Yet
If test developers or users wish to shorten a
another way to split a test is to divide the
test, the Spearman-Brown formula may be
test by content so that each half contains
used to estimate the effect of the shortening
items equivalent with respect to content and
on the test’s reliability. Reduction in test size
difficulty.
for the purpose of reducing test
administration time is a common practice in
In general, a primary objective in splitting a
certain situations.
test in half for the purpose of obtaining a
split-half reliability estimate is to create
A Spearman-Brown formula could also be
what might be called “mini-parallel-forms,”
used to determine the number of items
with each half equal to the other—or as
needed to attain a desired level of reliability.
nearly equal as humanly possible—in
In adding items to increase test reliability to
format, stylistic, statistical, and related
a desired level, the rule is that the new
aspects.
items must be equivalent in content and
difficulty so that the longer test still
The Spearman-Brown formula allows a
measures what the original test measured. If
test developer or user to estimate internal
the reliability of the original test is relatively
consistency reliability from a correlation of
low, then it may be impractical to increase
two halves of a test. It is a specific
the number of items to reach an acceptable
application of a more general formula to
level of reliability. Another alternative would
estimate the reliability of a test that is
be to abandon this relatively unreliable
lengthened or shortened by any number of
instrument and locate—or develop—a
items.
suitable alternative.
The general Spearman-Brown (rSB) formula is
Internal consistency estimates of reliability,
such as that obtained by use of the
nr xy Spearman-Brown formula, are inappropriate
r SB=
1+ ( n−1 ) r xy for measuring the reliability of
heterogeneous tests and speed tests.
where rSB is equal to the reliability adjusted
by the Spearman-Brown formula, rxy is equal Other Methods of Estimating Internal
to the Pearson r in the original-length test, Consistency
and n is equal to the number of items in the
revised version divided by the number of Other methods used to obtain estimates of
items in the original version. internal consistency reliability include
formulas developed by Kuder and
By determining the reliability of one half of a Richardson (1937) and Cronbach (1951).
test, a test developer can use the Inter-item consistency refers to the
Spearman-Brown formula to estimate the degree of correlation among all the items on
reliability of a whole test. Because a whole a scale. A measure of inter-item consistency
test is two times longer than half a test, n is calculated from a single administration of
becomes 2 in the Spearman-Brown formula a single form of a test. An index of inter item

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
consistency, in turn, is useful in assessing split-half reliability estimates will be similar.
the homogeneity of the test. Tests are said However, KR-20 is the statistic of choice for
to be homogeneous if they contain items determining the inter-item consistency of
that measure a single trait. dichotomous items, primarily those items
that can be scored right or wrong (such as
Homogeneity (derived from the Greek multiple-choice items). If test items are
words homos, meaning “same,” and genos, more heterogeneous, KR-20 will yield lower
meaning “kind”) is the degree to which a reliability estimates than the split-half
test measures a single factor. In other method.
words, homogeneity is the extent to which
items in a scale are unifactorial. Because of the great heterogeneity of
content areas when taken as a whole, it
In contrast to test homogeneity, could reasonably be predicted that the KR-
heterogeneity describes the degree to 20 estimate of reliability will be lower than
which a test measures different factors. A the odd-even one. The following formula
heterogeneous (or nonhomogeneous) test is may be used:
composed of items that measure more than

( k−1k )(1− Σσpq )


one trait.
r KR20= 2
The more homogeneous a test is, the more
inter-item consistency it can be expected to
have. Because a homogeneous test samples where r KR20 stands for the Kuder-Richardson
a relatively narrow content area, it is to be formula 20 reliability coefficient, k is the
expected to contain more inter-item number of test items, σ 2 is the variance of
consistency than a heterogeneous test. Test total test scores, p is the proportion of
homogeneity is desirable because it allows testtakers who pass the item, q is the
relatively straightforward test-score proportion of people who fail the item, and
interpretation. Σ pq is the sum of the pq products over all
items.
Testtakers with the same score on a
homogeneous test probably have similar Psychologists frequently rely on a test
abilities in the area tested. Testtakers with battery —a selected assortment of tests
the same score on a more heterogeneous and assessment procedures—in the process
test may have quite different abilities. of evaluation. A test battery is typically
composed of tests designed to measure
Although a homogeneous test is desirable
different variables.
because it so readily lends itself to clear
interpretation, it is often an insufficient tool An approximation of KR-20 can be obtained
for measuring multifaceted psychological by the use of the twenty-first formula in the
variables such as intelligence or personality. series developed by Kuder and Richardson,
One way to circumvent this potential source a formula known as KR-21. The KR-21
of difficulty has been to administer a series formula may be used if there is reason to
of homogeneous tests, each designed to assume that all the test items have
measure some component of a approximately the same degree of difficulty.
heterogeneous variable. Formula KR-21 has become outdated in an
era of calculators and computers. Way back
Dissatisfaction with existing split-half
when, KR-21 was sometimes used to
methods of estimating reliability compelled
estimate KR-20 only because it required
G. Frederic Kuder and M. W. Richardson to
many fewer calculations.
develop their own measures for estimating
reliability. Numerous modifications of Kuder-Richardson
formulas have been proposed through the
The most widely known of the many
years. The one variant of the KR-20 formula
formulas they collaborated on is their
that has received the most acceptance and
Kuder-Richardson formula 20, or KR-20,
is in widest use today is a statistic called
so named because it was the twentieth
coefficient alpha. You may even hear it
formula developed in a series. Where test
referred to as coefficient α −20 . This
items are highly homogeneous, KR-20 and
expression incorporates both the Greek

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
letter alpha (α ) and the number 20, the In contrast to coefficient alpha, a Pearson r
latter a reference to KR-20. may be thought of as dealing conceptually
with both dissimilarity and similarity.
Developed by Cronbach (1951) and Accordingly, an r value of -1 may be thought
subsequently elaborated on by others, of as indicating “perfect dissimilarity.” In
coefficient alpha may be thought of as the practice, most reliability coefficients—
mean of all possible split-half correlations, regardless of the specific type of reliability
corrected by the Spearman Brown formula. they are measuring—range in value from 0
In contrast to KR-20, which is appropriately to 1.
used only on tests with dichotomous items,
coefficient alpha is appropriate for use on Let’s emphasize that all indexes of
tests containing nondichotomous items. The reliability, coefficient alpha among them,
formula for coefficient alpha is provide an index that is a characteristic of a
particular group of test scores, not of the

( )( )
2 test itself. The precise amount of error
k Σσ
rα= 1− 2 inherent in a reliability estimate will vary
k−1 σ with the sample of testtakers from which the
data were drawn. If a new group of
Where r α is coefficient alpha, k is the testtakers is sufficiently different from the
number of items, σ 2 is the variance of one group of testtakers on whom the reliability
studies were done, the reliability coefficient
item, Σ is the sum of variances of each item,
may not be as impressive—and may even
and σ 2 is the variance of the total test be unacceptable.
scores.
Measures of Inter-Scorer Reliability
Coefficient alpha is the preferred statistic for
obtaining an estimate of internal Unfortunately, in some types of tests under
consistency reliability. A variation of the some conditions, the score may be more a
formula has been developed for use in function of the scorer than anything else.
obtaining an estimate of test-retest
reliability. Essentially, this formula yields an Variously referred to as scorer reliability,
estimate of the mean of all possible test- judge reliability, observer reliability, and
retest, split-half coefficients. Coefficient inter-rater reliability, inter-scorer
alpha is widely used as a measure of reliability is the degree of agreement or
reliability, in part because it requires only consistency between two or more scorers
one administration of the test. (or judges or raters) with regard to a
particular measure.
Unlike a Pearson r, which may range in
value from 1 to 1, coefficient alpha typically If the reliability coefficient is high, the
ranges in value from 0 to 1. The reason for prospective test user knows that test scores
this is that, conceptually, coefficient alpha can be derived in a systematic, consistent
(much like other coefficients of reliability) is way by various scorers with sufficient
calculated to help answer questions about training. Perhaps the simplest way of
how similar sets of data are. Here, similarity determining the degree of consistency
is gauged, in essence, on a scale from 0 among scorers in the scoring of a test is to
(absolutely no similarity) to 1 (perfectly calculate a coefficient of correlation. This
identical). It is possible, however, to correlation coefficient is referred to as a
conceive of data sets that would yield a coefficient of inter-scorer reliability.
negative value of alpha. Still, because
negative values of alpha are theoretically Using and Interpreting a Coefficient of
impossible, it is recommended under such Reliability
rare circumstances that the alpha coefficient
be reported as zero. As Streiner (2003) We have seen that, with respect to the test
pointed out, a value of alpha above .90 may itself, there are basically three
be “too high” and indicate redundancy in approaches to the estimation of
the items. reliability: (1) test-retest, (2) alternate or
parallel forms, and (3) internal or inter-item
consistency. The method or methods

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
employed will depend on a number of (2) the characteristic, ability, or trait being
factors, such as the purpose of obtaining a measured is presumed to be dynamic or
measure of reliability. static;
(3) the range of test scores is or is not
Reliability is a mandatory attribute in all restricted;
tests we use. However, we need more of it (4) the test is a speed or a power test; and
in some tests, and we will admittedly allow (5) the test is or is not criterion-referenced.
for less of it in others. If a test score carries
with it life-or-death implications, then we
need to hold that test to some high
standards—including relatively high  Homogeneity versus
standards with regard to coefficients of Heterogeneity of Test item
reliability. If a test score is routinely used in
combination with many other test scores Tests designed to measure one factor, such
and typically accounts for only a small part as one ability or one trait, are expected to
of the decision process, that test will not be be homogeneous in items. For such tests,
held to the highest standards of reliability. it is reasonable to expect a high degree of
internal consistency. By contrast, if the test
The Purpose of the Reliability is heterogeneous in items, an estimate of
Coefficient internal consistency might be low relative to
a more appropriate estimate of test-retest
If a specific test of employee performance is reliability.
designed for use at various times over the
course of the employment period, it would  Dynamic versus Static
be reasonable to expect the test to Characteristics
demonstrate reliability across time. It would
thus be desirable to have an estimate of the A dynamic characteristic is a trait, state,
instrument’s test-retest reliability. or ability presumed to be ever-changing as a
function of situational and cognitive
For a test designed for a single experiences.
administration only, an estimate of internal
consistency would be the reliability measure The best estimate of reliability would be
of choice. obtained from a measure of internal
consistency.
If the purpose of determining reliability is to
break down the error variance into its parts, A trait, state, or ability presumed to be
then a number of reliability coefficients relatively unchanging (a static
would have to be calculated. characteristic), such as intelligence. In this
instance, obtained measurement would not
Thus, an individual reliability coefficient may be expected to vary significantly as a
provide an index of error from test function of time, and either the test-retest or
construction, test administration, or test the alternate-forms method would be
scoring and interpretation. A coefficient of appropriate.
inter-rater reliability, for example, provides
information about error as a result of test  Restriction or Inflation of Range
scoring. Specifically, it can be used to
answer questions about how consistently In using and interpreting a coefficient of
two scorers score the same test items. reliability, the issue variously referred to as
restriction of range or restriction of
The Nature of the Test variance (or, conversely, inflation of
range or inflation of variance) is
Closely related to considerations concerning important. If the variance of either variable
the purpose and use of a reliability in a correlational analysis is restricted by the
coefficient are those concerning the nature sampling procedure used, then the resulting
of the test itself. Included here are correlation coefficient tends to be lower. If
considerations such as whether the variance of either variable in a
(1) the test items are homogeneous or correlational analysis is inflated by the
heterogeneous in nature;

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
sampling procedure, then the resulting the test and is then adjusted using the
correlation coefficient tends to be higher. Spearman-Brown formula to obtain a
reliability estimate of the whole test.
Also of critical importance is whether the Although there are exceptions, such
range of variances employed is appropriate traditional procedures of estimating
to the objective of the correlational analysis. reliability are usually not appropriate for use
with criterion-referenced test.
 Speed tests versus Power tests
In criterion-referenced testing, and
When a time limit is long enough to allow particularly in mastery testing, how different
testtakers to attempt all items, and if some the scores are from one another is seldom a
items are so difficult that no testtaker is able focus of interest. The critical issue for the
to obtain a perfect score, then the test is a user of a mastery test is whether or not a
power test. By contrast, a speed test certain criterion score has been achieved.
generally contains items of uniform level of Traditional ways of estimating reliability are
difficulty (typically uniformly low) so that, not always appropriate for criterion-
when given generous time limits, all referenced tests, though there may be
testtakers should be able to complete all the instances in which traditional estimates can
test items correctly. be adopted.

Score differences on a speed test are Alternatives to the True Score Model
therefore based on performance speed (Classical Model)
because items attempted tend to be correct.
The true score model is the most widely
A reliability estimate of a speed test should used and accepted model in the
be based on performance from two psychometric literature today. Historically,
independent testing periods using one of the true score model of the reliability of
the following: (1) test-retest reliability, (2) measurement enjoyed a virtually
alternate-forms reliability, or (3) split-half unchallenged reign of acceptance from the
reliability from two separately timed half early 1900s through the 1940s. The 1950s
tests. If a split-half procedure is used, then saw the development of an alternative
the obtained reliability coefficient is for a theoretical model, one originally referred to
half test and should be adjusted using the as domain sampling theory and better
Spearman-Brown formula. known today in one of its many modified
forms as generalizability theory.
If a speed test is administered once and
some measure of internal consistency is The theory of domain sampling rebels
calculated, such as the Kuder-Richardson or against the concept of a true score existing
a split-half correlation, the result will be a with respect to the measurement of
spuriously high reliability coefficient. psychological constructs. Whereas those
who subscribe to true score theory seek
 Criterion-referenced tests to estimate the portion of a test score that is
attributable to error, proponents of domain
A criterion-referenced test is designed to sampling theory seek to estimate the
provide an indication of where a testtaker extent to which specific sources of variation
stands with respect to some variable or under defined conditions are contributing to
criterion, such as an educational or a the test score.
vocational objective.
In domain sampling theory, a test’s
Recall that a test-retest reliability estimate is reliability is conceived of as an objective
based on the correlation between the total measure of how precisely the test score
scores on two administrations of the same assesses the domain from which the test
test. In alternate-forms reliability, a draws a sample. In theory, the items in the
reliability estimate is based on the domain are thought to have the same
correlation between the two total scores on means and variances of those in the test
the two forms. In split-half reliability, a that samples from the domain. Of the three
reliability estimate is based on the types of estimates of reliability, measures of
correlation between scores on two halves of

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
internal consistency are perhaps the most test user how test scores should be used
compatible with domain sampling theory. and how dependable those scores are as a
basis for decisions, depending on the
 Generalizability Theory context of their use.

Generalizability theory may be viewed as From the perspective of generalizability


an extension of true score theory wherein theory, a test’s reliability is very much a
the concept of a universe score replaces function of the circumstances under which
that of a true score. the test is developed, administered, and
interpreted.
Developed by Lee J. Cronbach (1970) and
his colleagues, generalizability theory is  Item Response Theory
based on the idea that a person’s test
scores vary from testing to testing because Item response theory procedures provide
of variables in the testing situation. a way to model the probability that a person
Cronbach encouraged test developers and with X ability will be able to perform at a
researchers to describe the details of the level of Y. It models the probability that a
particular test situation or universe leading person with X amount of a particular
to a specific test score. This universe is personality trait will exhibit Y amount of that
described in terms of its facets, which trait on a personality test designed to
include things like the number of items in measure it.
the test, the amount of training the test
scorers have had, and the purpose of the A synonym for IRT in the academic literature
test administration. is latent-trait theory. IRT is not a term
used to refer to a single theory or method, it
According to generalizability theory, given refers to a family of theory and methods—
the exact same conditions of all the facets in and quite a large family at that— with many
the universe, the exact same test score other names used to distinguish specific
should be obtained. This test score is the approaches.
universe score, and it is, as Cronbach
noted, analogous to a true score in the true Examples of two characteristics of items
score model. within an IRT framework are the difficulty
level of an item and the item’s level of
A generalizability study examines how discrimination. “Difficulty” in this sense
generalizable scores from a particular test refers to the attribute of not being easily
are if the test is administered in different accomplished, solved, or comprehended. In
situations. It examines how much of an the context of IRT, discrimination signifies
impact different facets of the universe have the degree to which an item differentiates
on the test score. among people with higher or lower levels of
the trait, ability, or whatever it is that is
The influence of particular facets on the test being measured.
score is represented by coefficients of
generalizability. These coefficients are There are IRT models designed to handle
similar to reliability coefficients in the true data resulting from the administration of
score model. tests with dichotomous test items (test
items or questions that can be answered
After the generalizability study is done, with only one of two alternative responses)
Cronbach et al. recommended that test There are IRT models designed to handle
developers do a decision study, which data resulting from the administration of
involves the application of information from tests with polytomous test items (test items
the generalizability study. In the decision or questions with three or more alternative
study, developers examine the usefulness responses).
of test scores in helping the test user make
decisions. In practice, test scores are used In general, latent-trait models differ in some
to guide a variety of decisions, from placing important ways from classical “true score”
a child in special education to hiring new test theory. For example, in classical true
employees to discharging mental patients score test theory, no assumptions are made
from the hospital. It is designed to tell the about the frequency distribution of test

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
scores. By contrast, such assumptions are coefficient of the test. The standard error of
inherent in latent-trait models. A shorthand measurement allows us to estimate, with a
reference to these types of models is specific level of confidence, the range in
“Rasch,” so reference to “the Rasch model” which the true score is likely to exist.
is a reference to an IRT model with very
specific assumptions about the underlying Because the standard error of measurement
distribution. The psychometric advantages functions like a standard deviation in this
of item response theory have made this context, we can use it to predict what would
model appealing, especially to commercial happen if an individual took additional
and academic test developers and to large- equivalent tests:
scale test publishers.  approximately 68% (actually,
68.26%) of the scores would be
Reliability and Individual Scores expected to occur within ± 1 σ meas of
the true score;
The Standard Error of Measurement  approximately 95% (actually,
95.44%) of the scores would be
The standard error of measurement,
expected to occur within ± 2 σ meas of
often abbreviated as SEM or SEM, provides a
measure of the precision of an observed test the true score;
score. It provides an estimate of the amount  approximately 99% (actually,
of error inherent in an observed score or 99.74%) of the scores would be
measurement. expected to occur within ± 3 σ meas of
the true score.
In general, the relationship between the
SEM and the reliability of a test is inverse; The best estimate available of the
the higher the reliability of a test (or individual’s true score on the test is the
individual subtest within a test), the lower test score already obtained. The
the SEM. standard error of measurement, like the
reliability coefficient, is one way of
The standard error of measurement is expressing test reliability. If the standard
the tool used to estimate or infer the extent deviation of a test is held constant, then
to which an observed score deviates from a the smaller the σ meas, the more reliable
true score. It is also the standard deviation the test will be; as r xx increases, the σ meas
of a theoretically normal distribution of test
decreases.
scores obtained by one person on
equivalent tests.
In practice, the standard error of
measurement is most frequently used in
Also known as the standard error of a score
the interpretation of individual test
and denoted by the symbol σ meas, the scores. Further, the standard error of
standard error of measurement is an index measurement is useful in establishing
of the extent to which one individual’s what is called a confidence interval: a
scores vary over tests presumed to be range or band of test scores that is likely
parallel. If the standard deviation for the to contain the true score.
distribution of test scores is known (or can
be calculated) and if an estimate of the The Standard Error of the Difference
reliability of the test is known (or can be between Two Scores
calculated), then an estimate of the
standard error of a particular score (that is, The amount of error in a specific test
the standard error of measurement) can be score is embodied in the standard error
determined by the following formula: of measurement. But scores can change
from one testing to the next for reasons
σ meas=σ √ 1−r xx other than error. True differences in the
characteristic being measured can also
where σ measis equal to the standard error of affect test scores. These differences may
be of great interest. Comparisons
measurement, σ is equal to the standard
between scores are made using the
deviation of test scores by the group of
standard error of the difference, a
testtakers, and r xx is equal to the reliability statistical measure that can aid a test

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik
user in determining how large a
difference should be before it is
considered statistically significant.

When comparing scores achieved on the


different tests, it is essential that the
scores be converted to the same scale.
The formula for the standard error of the
difference between two scores is

σ diff = √ σ 2meas 1+ σ 2meas 2

where σ diff is the standard error of the


2
difference between two scores, σ meas1 is
the squared standard error of
2
measurement for test 1, and σ meas2 is
the squared standard error of
measurement for test 2.

If we substitute reliability coefficients for


the standard errors of measurement of
the separate scores, the formula
becomes

σ diff = √ 2−r 1−r 2

where r 1 is the reliability coefficient of


test 1, r 2 is the reliability coefficient of
test 2, and σ is the standard deviation.
Note that both tests would have the
same standard deviation because they
must be on the same scale (or be
converted to the same scale) before a
comparison can be made.

The standard error of the difference


between two scores will be larger than
the standard error of measurement for
either score alone because the former is
affected by measurement error in both
scores. This also makes good sense: If
two scores each contain error such that
in each case the true score could be
higher or lower, then we would want the
two scores to be further apart before we
conclude that there is a significant
difference between them. The value
obtained by calculating the standard
error of the difference is used in much
the same way as the standard error of
the mean.

Prepared by Sittie Ashya B. Ali (BS Psychology) Psychological Testing and Assessment (7th Ed)
MSU-Main Campus, Marawi City Cohen-Swerdlik

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy