CHAPTER 4 Realiability For Teachers
CHAPTER 4 Realiability For Teachers
CHAPTER 4 Realiability For Teachers
It is the user who must take responsibility for determining whether or not scores
are sufficiently trustworthy to justify anticipated uses and interpretations.
—AERAetal.. 1999. p. 31
CHAPTER HIGHLIGHTS
LEARNING OBJECTIVES
90
Reliability for Teachers 91
In simplest terms, in the context Most dictionaries define reliability in terms of dependability, trust
of measurement, reliability
worthiness, or having a high degree of confidence in something. Reli
refers to consistency or stability ability in the context of educational and psychological measurement
of assessment results. is concerned to some extent with these same factors, but is extended
to such concepts as stability and consistency. In simplest terms, in the
context of measurement, reliability refers to consistency or stability of assessment results.
Although it is common for people to refer to the “reliability of a lest,” in the Standards for
Educational and Psychological Testing (AERA et al., 1999) reliability is considered to be a
characteristic of scores or assessment results, not tests themselves.
Consider the following example: A teacher administers a 25-item math test in the
morning to assess the students’ skill in multiplying two-digit numbers. If the test had been
administered in the afternoon rather than the morning, would Susie's score on the lest have
been the same? Because there are literally thousands of two-digit multiplication problems, if
the teacher had used a different group of 25 two-digit multiplication problems, would Susie
have received the same score? What about the ambulance that went by, its siren wailing
loudly, causing Johnny to look up and watch for a few seconds? Did this affect his score, and
did it affect Susie’s, who kept working quietly? Jose wasn't feeling well that morning but
came to school because he felt the test was so important. Would his score have been better if
he had waited to take the test when he was feeling better? Would the students have received
the same scores if another teacher had graded the test ? All of these questions involve issues
of reliability. They all ask if the test produces consistent scores.
As you can see from these examples, numerous factors can affect reliability. The time
the test is administered, the specific set of questions included on the lest, distractions due to
external (e.g., ambulances) or internal (e.g., illness) events, and the person grading the test
are just a few of these factors. In this chapter you will learn to take many of the sources of
unreliability into account when selecting or developing assessments and evaluating scores.
You will also learn to estimate the degree of reliability in test scores with a method that best
fits your particular situation. First, however, we will introduce the concept of measurement
error as it is essential to developing a thorough understanding of reliability.
Errors of Measurement
Some degree of measurement Some degree of error is inherent in all measurement. Although mea
error is inherent in all surement error has largely been studied in the context of psychologi
measurement. cal and educational tests, measurement error clearly is not unique to
this context. In fact, as Nunnally and Bernstein (1994) point out, mea
surement in other scientific disciplines has as much, if not more, error
than that in psychology and education. They give the example of physiological blood pressure
measurement, which is considerably less reliable than many educational tests. Even in situa
tions in which we generally believe measurement is exact, some error is present. If we asked
a dozen people to time a 440-yard race using the same brand of stopwatch, it is extremely
unlikely that they would all report precisely the same time. If we had a dozen people and a
measuring tape graduated in millimeters and required each person to measure independently
92 CHAPTER 4
the length of a 100-foot strip of land, it is unlikely all of them would report the same answer
to the nearest millimeter. In the physical sciences the introduction of more technologically
sophisticated measurement devices has reduced, but not eliminated, measurement error.
Different theories or models have been developed to address measurement issues, but
possibly the most influential is classical test theory (also called true score theory). Accord
ing to this theory, every score on a test is composed of two components: the true score (i.e.
the score that would be obtained if there were no errors, if the score were perfectly reliable)
and the error score: Obtained Score = True Score + Error. This can be represented in a very
simple equation:
X = 7 + £
Here we use X{ to represent the observed or obtained score of an individual; that is, Xl is the
score the test taker received on the test. The symbol T is used to represent an individual's
true score and reflects the test taker’s true skills, abilities, knowledge, attitudes, or whatever
the test measures, assuming an absence of measurement error. Finally, £ represents mea
surement error.
Measurement error reduces the usefulness of measurement. It limits the extent to
which test results can be generalized and reduces the confidence we have in test results
(AERA et al., 1999). Practically speaking, when we administer a
Measurement error limits the test we are interested in knowing the test taker’s true score. Due to
extent to which test results can the presence of measurement error we can never know with abso
be generalized and reduces lute confidence what the true score is. However, if we have informa
the confidence we have in test tion about the reliability of measurement, we can establish intervals
results (AERA et al., 1999). around an obtained score and calculate the probability that the true
score will fall within the interval specified. We will come back to
this with a more detailed explanation when we discuss the standard error of measurement
later in this chapter. First, we will elaborate on the major sources of measurement error. It
should be noted that we will limit our discussion to random measurement error. Some writ
ers distinguish between random and systematic errors. Systematic error is much harder to
detect and requires special statistical methods that are generally beyond the scope of this
text; however, some special cases of systematic error are discussed in Chapter 16. (Special
Interest Topic 4.1 provides a brief introduction to Generalizability Theory, an extension of
classical reliability theory.)
results are not. A number of factors may introduce error into test scores and even though all
cannot be assigned to distinct categories, it may be helpful to group these sources in some
manner and to discuss their relative contributions. The types of errors that are our greatest
concern are errors due to content sampling and time sampling.
Content Sampling Error. Tests rarely, if ever, include every possible question or evalu
ate every possible relevant behavior. Let's revisit the example we introduced at the begin
ning of this chapter. A teacher administers a math test designed to assess skill in multiply ing
two-digit numbers. We noted that there are literally thousands of two-digit multiplication
problems. Obviously it would be impossible for the teacher to develop and administer a
test that includes all possible items. Instead, a universe or domain of test items is defined
based on the content of the material to be covered. From this domain a sample of test ques
lions is taken. In this example, the teacher decided to select 25 items to measure students
ability. These 25 items are simply a sample and, as with any sampling procedure, may not
be representative of the domain from which they are drawn. The error that results from
differences between the sample of items (i.e., the test) and the domain of items (i.e all
the possible items) is referred to as content sampling error. In reading other sources, you
might see this type of error referred to as domain sampling error. Domain sampling error
and content sampling error are the same. Content sampling error typically is considered
the largest source of error in test scores and therefore is the source that concerns us most
Fortunately, content sampling error is also the easiest and most accurately estimated source
of measurement error.
The amount of measurement error due to content sampling is determined by how well
we sample the total domain of items. If the items on a test are a good sample of the domain.
the amount of measurement error due to content sampling will be
If the items on a test are a good relatively small. If the items on a test are a poor sample of the domain,
sample of the domain, the the amount of measurement error due to content sampling will be
amount of measurement error relatively large. Measurement error resulting from content sampling
due to content sampling w ill be is estimated by analyzing the degree of statistical similarity among
relatively small. the items making up the test. In other words, we analyze the test items
to determine how well they correlate with one another and with the
test taker's standing on the construct being measured. We will explore a variety of methods
for estimating measurement errors due to content sampling later in this chapter.
Measurement error due to Time Sampling Error. Measurement error also can be introduced
time sampling reflects random by one’s choice of a particular time to administer the test. If Eddie
fluctuations in performance did not have breakfast and the math test was just before lunch, he
from one situation to another might be distracted or hurried and not perform as well as if he look
and limits our ability to the lest after lunch. But Michael, who ate too much at lunch and
generalize test scores across was up a little late last night, was a little sleepy in the afternoon and
might not perform as well on an afternoon test as he would have on
different situations.
the morning test. If during the morning testing session a neighboring
class was making enough noise to be disruptive, the class might have
performed better in the afternoon when the neighboring class was relatively quiet. These
are all examples of situations in which random changes over time in the test taker (e g.
fatigue, illness, anxiety) or the testing environment (e.g., distractions, temperature) affeci
Reliability for Teachers 95
performance on the test. This type of measurement error is referred to as time sampling
error and reflects random fluctuations in performance from one situation or time to another
and limits our ability to generalize test results across different situations. Some assessment
experts refer to this type of error as temporal instability. As you might expect, testing experts
have developed methods of estimating error due to time sampling.
Other Sources of Error. Although errors due to content sampling and time sampling ac
count for the major proportion of random error in testing, administrative and scoring errors
that do not affect all lest takers equally will also contribute to the random error observed
in scores. Clerical errors committed while adding up a student's score or an administrative
error on an individually administered test are common examples. When the scoring of a test
relies heavily on the subjective judgment of the person grading the test or involves subtle
discriminations, it is important to consider differences in graders, usually referred to as
inter-scorer or inter-rater differences. That is, would the test laker receive the same score
if different individuals graded the test? For example, on an essay test would two different
graders assign the same scores? These are just a few examples of sources of error that do
not lit neatly into the broad categories of content or time sampling errors.
X=T+E
As you remember, Xx represents an individual's obtained score. T represents the true score,
and E represents random measurement error. This equation can be extended to incorporate the
concept of variance. This extension indicates that the variance of test scores is the sum of the
true score variance plus the error variance, and is represented in the following equation:
_0 _i
ax - CTf + Gp
Here, represents the variance of the total test, represents true score variance, and Gp
represents the variance due to measurement error. True score variance reflects differences in
test takers due to real differences in skills, abilities, knowledge, attitudes, and so on. whereas
the total score variance is made up of true score variance plus variance due to all the sources
of random error we have previously described.
The general symbol for the reliability of assessment results associated with content
or domain sampling is rxx and is referred to as the reliability coefficient. We estimate the
reliability of a test score as the ratio of true score variance to total score variance. Math
ematically, reliability is written
'xv = °t / Gx
96 CHARIER 4
This equation defines the reliability of test scores as the proportion of test score van ana
due to true score differences. The reliability coefficient is considered to be the summary
mathematical representation of this ratio or proportion.
Reliability coefficients can be classified into three broad categories (AERA ci al
1999). These include (1) coefficients derived from the administration of the same test on
different occasions (i.e., test-retest reliability), (2) coefficients based on the administration
of parallel forms of a test (i.e., alternate-form reliability), and (3) coefficients derived Iron
a single administration of a test (internal consistency coefficients i. A
Reliability can be defined as fourth type, inter-rater reliability, is indicated when scoring involves
the proportion of test score a significant degree of subjective judgment. The major methods o!
variance due to true score estimating reliability are summarized in Table 4.1. Each of these ap
differences. proaches produces a reliability coefficient (rvv) that can be inter
preted in terms of the proportion or percentage of test score vanana
attributable to true variance. For example, a reliability coefficient of 0.90 indicates that 90r
of the variance in test scores is attributable to true variance. The remaining 10% reflects
error variance. We will now consider each of these methods of estimating reliability.
Tcst-Retest r12 One form Two sessions Administer the same test to the sanu
group at two different sessions.
Alternate forms
Simultaneous rub Two forms One session Administer two forms of the tesi to
administration the same group in the same session.
Delayed rab Two forms Two sessions Administer two forms of the test
administration to the same group at two different
sessions.
Split-half ^oe One form One session Administer the test to a group
one time. Split the test into two
equivalent halves, typically
correlating scores on the odd-
numbered items with scores on die
even-numbered items.
Coefficient alpha /■
i.i
One form One session Administer the test to a group one
or KR-20 time. Apply appropriate procedures.
In ter-rater r One form One session Administer the test to a group one
time. Two or more raters score the
test independently.
Reliability for Teachers 97
Test-Retest Reliability
Probably the most obvious way to estimate the reliability of a test is to administer the same
test to the same group of individuals on two different occasions. With this approach the reli
ability coefficient is obtained by simply calculating the correlation between the scores on
the two administrations. For example, we could administer our 25-item math test one week
after the initial administration and then correlate the scores obtained on the two administra
tions. This estimate of reliability is referred to as test-retest reliability and is sensitive to
Test-retest reliability is sensitive measurement error due to time sampling. It is an index of the stability
of test scores over time. Because many tests are intended to measure
to measurement error due to
fairly stable characteristics, we expect tests of these constructs to
tinn sampling and is an index of
produce stable scores. Test-retest reliability reflects the degree to
the stability of scores over time. which test scores can be generalized across different situations or
over time.
One important consideration when calculating and evaluating test-retest reliability
is the length of the interval between the two lest administrations. If the test-retest interval
is very short (e.g., hours or days), the reliability estimate may be artificially inflated by
memory and practice effects from the first administration. If the test interval is longer, the
estimate of reliability may be lowered not only by the instability of the scores but also by
actual changes in the test takers during the extended period. In practice, there is no single
“best" time interval, but the optimal interval is determined by the
One important consideration way the test results are to be used. For example, intelligence is a con
when calculating and evaluating struct or characteristic that is thought to be fairly stable, so it would
test-retest reliability is the be reasonable to expect stability in intelligence scores over weeks or
length of the interval between months. In contrast, an individual’s mood (e.g.. depressed, elated,
nervous) is more subject to transient fluctuations, and stability across
the two test administrations.
weeks or months would not be expected.
In addition to the construct being measured, the way the test
is to be used is an important consideration in determining what is an appropriate test-retest
interval. Because the SAT is used to predict performance in college, it is sensible to expect
stability over relatively long periods of time. In other situations, long-term stability is much
less of an issue. For example, the long-term stability of a classroom achievement lest (such
as our math test) is not a major concern because it is expected that the students will be
enhancing existing skills and acquiring new ones due to class instruction and studying. In
summary, when evaluating the stability of test scores, one should consider the length of the
test-retest interval in the context of the characteristics being measured and how the scores
are to be used.
The test-retest approach does have significant limitations, the most prominent being
carryover effects from the first to second testing. Practice and memory effects result in
different amounts of improvement in retest scores for different lest
The test-retest approach does lakers. These carryover effects prevent the two administrations from
have significant limitations, the being independent and as a result the reliability coefficients may
most prominent being carry be artificially inflated. In other instances, repetition of the test may
over effects from the first to change either the nature of the test or the test laker in some subtle or
second testing. even obvious way (Ghiselli, Campbell. & Zcdeck, 1981). As a result,
98 CHAPTER 4
only tests that are not appreciably influenced by these carryover effects are suitable lor tin-
method of estimating reliability.
Alternate-Form Reliability
Another approach to estimating reliability involves the development of two equivalent u
parallel forms of the test. The development of these alternate forms requires a detailed u i
plan and considerable effort because the tests must truly be parallel in terms of content di I
ficulty, and other relevant characteristics. The two forms of the test are then administered
to the same group of individuals and the correlation is calculated between the scores i n
the two assessments. In our example of the 25-item math test, the teacher could develop
a parallel test containing 25 new problems involving the multiplication of double digit'
To be parallel the items would need to be presented in the same format and be of the same
level of difficulty. Two fairly common procedures are used to e -
A Iternate-form reliability tablish alternate-form reliability. One is alternate-form rehabilit
based on simultaneous based on simultaneous administrations and is obtained when the tu
administration is primarily forms of the lest are administered on the same occasion (i.c.. back
sensitive to measurement error to back). The other, alternate form with delayed administration, s
due to content sampling. obtained when the two forms of the test are administered on tu
different occasions. Alternate-form reliability based on simultane
ous administration is primarily sensitive to measurement error related to content sampling
Alternate-form reliability with delayed administration is sensitive to measurement error duo
to both content sampling and time sampling.
Alternate-form reliability has the advantage of reducing the carryover effects that at
a prominent concern with test-retest reliability. However, although practice and memor
effects may be reduced using the alternate-form approach. the\ an
Alternate-form reliability often not fully eliminated. Simply exposing test takers to the . >n
based on delayed administration mon format required for parallel tests often results in some can > < >u
is sensitive to measurement effects even if the content of the two tests is different. For example
error due to content sampling a test taker given a test measuring nonverbal reasoning abilities ma
and time sampling, but cannot develop strategies during the administration of the first form that allei
differentiate the two types of her approach to the second form, even if the specific content <>l the
error. items is different. Another limitation of the alternate-form appi naci
to estimating reliability is that relatively few tests, standardized ci
teacher made, have alternate forms. As we suggested, the development of alternate form-
that are actually equivalent is a time-consuming process, and many test developers do no;
pursue this option. Nevertheless, at times it is desirable to have more than one form of a test
and when multiple forms exist, alternate-form reliability is an important consideration.
Internal-Consistency Reliability
Internal-consistency reliability estimates primarily reflect errors related to content sum
pling. These estimates are based on the relationship between items within a test and at
derived from a single administration of the test.
Reliability for Teachers 99
2 x 0.74
Reliability of Full Test =
1 + 0.74
1.48
Reliability of Full Test = = 0.85
1.74
The reliability coefficient of 0.85 estimates the reliability of the full test when the odd-even
halves correlated at 0.74. This demonstrates that the uncorrected split-half reliability coef
ficient presents an underestimate of the reliability of the full lest. Table 4.2 provides examples
of half-test coefficients and the corresponding full-test coefficients that were corrected with
100 CHAPTER 4
the Spearman-Brown formula. By looking at the first row in this table, you will see that a
half-test correlation of 0.50 corresponds to a corrected full-test coefficient of 0.67.
Although the odd-even approach is the most common way to divide a lest and will
generally produce equivalent halves, certain situations deserve special attention. For exam
pie, if you have a test with a relatively small number of items (e.g., <8), it may be desirable
to divide the test into equivalent halves based on a careful review of item characteristics
such as content, format, and difficulty. Another situation that deserves special attention in
volves groups of items that deal with an integrated problem (this is referred to as a lestlei i
For example, if multiple questions refer to a specific diagram or reading passage, that w hole
set of questions should be included in the same half of the test. Splitting integrated problems
can artificially inflate the reliability estimate (e.g., Sireci, Thissen, & Wainer, 1991).
An advantage of the split-half approach to reliability is that it can be calculated from a
single administration of a test. Also, because only one testing session is involved, this approach
reflects errors due only to content sampling and is not sensitive to time sampling errors.
Coefficient Alpha and Kuder-Richardson Reliability. Other approaches to estimating
reliability from a single administration of a test arc based on formulas developed by Ruder
and Richardson (1937) and Cronbach (1951). Instead of comparing responses on two halves
of the test as in split-half reliability, this approach examines the consistency of responding
to all the individual items on the test. Reliability estimates produced
Coefficient alpha and Kuder- with these formulas can be thought of as the average of all possible
Richardson reliability are split-half coefficients. Like split-half reliability, these estimates are
sensitive to error introduced sensitive to measurement error introduced by content sampling. Ad
hy content sampling, hut also ditionally, they are also sensitive to the heterogeneity of the lest con
reflect the heterogeneity of tent. When we refer to content heterogeneity, we are concerned with
test content. the degree to which the test items measure related characteristics
For example, our 25-item math test involving multiplying two-digit
numbers would probably be more homogeneous than a test designed
to measure both multiplication and division. An even more heterogeneous test would be
one that involves multiplication and reading comprehension, two fairly dissimilar content
domains. As discussed later, sensitivity to content heterogeneity can influence a particular
reliability formula’s use on different domains.
While Kuder and Richardson’s formulas and coefficient alpha both reflect item het
erogeneity and errors due to content sampling, there is an important difference in terms
of application. In their original article Kuder and Richardson (1937) presented numerous
formulas for estimating reliability. The most commonly used formula is known as the
Kuder-Richardson formula 20 (KR-20). KR-20 is applicable when test items are scored
dichotomously, that is, simply right or wrong, as 0 or 1. Coefficient alpha (Cronbach, 1951)
is a more general form of KR-20 that also deals with test items that produce scores with
multiple values (e.g., 0, 1, or 2). Because coefficient alpha is more broadly applicable, it has
become the preferred statistic for estimating internal consistency (Keith & Reynolds, 1990).
Tables 4.3 and 4.4 illustrate the calculation of KR-20 and coefficient alpha, respectively.
Inter-Rater Reliability
If the scoring of a test relies on subjective judgment, it is important to evaluate the degree
of agreement when different individuals score the test. This is referred to as inter-scorer or
inter-rater reliability. Estimating inter-rater reliability is a fairly straightforward process.
The test is administered one time and two individuals independently score each test. A cor
relation is then calculated between the scores obtained by the two scorers. This estimate
of reliability is not sensitive to error due to content or time sampling, but only reflects dif
ferences due to the individuals scoring the test. In addition to the correlational approach,
inter-rater agreement can also be evaluated by calculating the per
If the scoring of an assessment centage of times that two individuals assign the same scores to the
relies on subjective judgment, performances of students. This approach is illustrated in Special
Interest Topic 4.2.
it is important to evaluate the
On some tests, inter-rater reliability is of little concern. For ex
degree of agreement when
ample, on a test with multiple-choice or true-false items, grading is
different individuals score the
fairly straightforward and a conscientious grader should produce reli
test. This is referred to as inter- able and accurate scores. In the case of our 25-item math test, a care
rater reliability. ful grader should be able to determine whether the students’ answers
are accurate and assign a score consistent with that of another careful
grader. However, for some tests inter-rater reliability is a major concern. Classroom essay
tests are a classic example. It is common for students to feel that a different teacher might
have assigned a different score to their essays. It can be argued that the teacher’s personal
biases, preferences, or mood influenced the score, not only the content and quality of the
student’s essay. Even on our 25-item math test, if the teacher required that the students “show
their work’’ and this influenced the students’ grades, subjective judgment might be involved
and inter-rater reliability could be a concern.
102 CHAPTER 4
KR-20 is sensitive to measurement error due to content sampling and is also a measure of item
heterogeneity. KR-20 is applicable when test items are scored dichotomously, that is. simply right
or wrong, as 0 or 1. The following formula is used for calculating KR-20:
KR-20 = M
k - \ \
SD: - I/?; x q. )
SD2
where k = number of items
SD2 = variance of total test scores
p- = proportion of correct responses on item
q- = proportion of incorrect responses on item
Consider these data for a five-item test administered to six students. Each item could receive a
score of either 1 or 0.
Student 1 1 0 1 1 1 4
Student 1 1 1 1 1 1 5
Student 3 1 0 1 0 0 2
Student 4 0 0 0 1 0 1
Student 5 1 1 1 1 1 5
Student 6 1 1 0 1 1 4
P\ 0.8333 0.5 0.6667 0.8333 0.6667 SD2 = 2.25
0.1667 0.5 0.3333 0.1667 0.3333
Pi X <h 0.1389 0.25 0.2222 0.1389 0.2222
X/jj x q- = 0.972
2.25 - 0.972 )
KR-20 = 5/4
2.25
= 1.25
m)
= 1.25(0.568)
= 0.71
Coefficient alpha is sensitive to measurement error due to content sampling and is also a measure
of item heterogeneity. It can be applied to tests with items that are scored dichotomously or that
have multiple values. The formula for calculating coefficient alpha is:
LSD,2 )
Coefficient alpha
SD2
where k = number of items
SDj2 = variance of individual items
SD2 = variance of total test scores
Consider these data for a five-item test that was administered to six students. Each item could
receive a score ranging from 1 to 5.
Student 1 4 3 4 5 5 21
Student 2 3 3 9 3 3 14
Student 3 2 3 2 9 1 10
Student 4 4 4 5 3 4 20
Student 5 2 3 4 2 3 14
Student 6 2 2 2 1 3 10
SDj2 0.8056 0.3333 1.4722 1.5556 1.4722 SD2 = 18.81
Note: When calculating SDj2 and SD2. n was used in the denominator.
0.8056 + 0.3333 + 1.4722 + 1.5556 + 1.4722 )
Coefficient Alpha
■U 1 - 18.81
= 1.25(1 - 5.63889/18.81)
= 1.25(1 - 0.29978)
= 1.25(0.70)
= 0.875
Reliability of composite scores ing period or semester. Many standardized psychological instruments
is generally greater than the contain several measures that are combined to form an overall com
measures that contribute to posite score. For example, the Wechsler Adult Intelligence Scale—
the composite. Third Edition (Wechsler, 1997) is composed of 1 1 subtests used in
the calculation of the Full Scale Intelligence Quotient (FSIQ). Both
of these situations involve composite scores obtained by combining
the scores on several different tests or subtests. The advantage of composite scores is that the
reliability of composites is generally greater than that of the individual scores that contribute
to the composite. More precisely, the reliability of a composite is the result of the number of
scores in the composite, the reliability of the individual scores, and the correlation between
those scores. The more scores in the composite, the higher the correlation between those
SPECIAL INTEREST T O P I C 4.2
Calculating Inter-Rater Agreement
Performance assessments require test takers to complete a process or produce a product in a context
that closely resembles real-life situations. For example, a student might engage in a debate, compose
a poem, or perform a piece of music. The evaluation of these types of performances is typically
based on scoring rubrics that specify what aspects of the student’s performance should be considered
when providing a score or grade. The scoring of these types of assessments obviously involves the
subjective judgment of the individual scoring the performance, and as a result inter-rater reliability
is a concern. As noted in the text one approach to estimating inter-rater reliability is to calculate the
correlation between the scores that are assigned by two judges. Another approach is to calculate the
percentage of agreement between the judges’ scores.
Consider an example wherein two judges rated poems composed by 25 students. The poems
were scored from 1 to 5 based on criteria specified in a rubric, with 1 being the lowest performance
and 5 being the highest. The results are illustrated in the following table:
Ratings of Rater 1
Ratings of Rater 2 1 2 3 4 5
5 0 0 1 2 4
4 0 0 2 3 2
3 0 2 3 1 0
2 1 1 1 0 0
/ 1 1 0 0 0
Once the data are recorded you can calculate inter-rater agreement with the following formula:
This degree of inter-rater agreement might appear low to you, but this would actually be re
spectable for a classroom test. In fact the Pearson correlation between these judges’ ratings is 0.80
(better than many, if not most, performance assessments).
Instead of requiring the judges to assign the exact same score for agreement, some authors
suggest the less rigorous criterion of scores being within one point of each other (e.g., Linn & Cron
hind, 2000). If this criterion were applied to these data, the modified agreement percent would be
96% because only one of the judges’ scores were not within one point of each other (Rater 1 assigned
a 3 and Rater 2 a 5).
We caution you not to expect this high a rate of agreement should you examine the inter-rater
agreement of your own performance assessments. In fact you will learn later that difficulty scoring
performance assessments in a reliable manner is one of the major limitations of these procedures.
104
Reliability for Teachers 105
scores, and the higher the individual reliabilities, the higher the composite reliability. As
we noted, tests are simply samples of the test domain, and combining multiple measures is
analogous to increasing the number of observations or the sample size.
and KR-20 would probably underestimate reliability. In the situation of a test with hetero
geneous content (the heterogeneity is intended and not a mistake), the split-half method is
preferred. Because the goal of the split-half approach is to compare two equivalent halves,
it would be necessary to ensure that each half has equal numbers of both multiplication and
division problems.
We have been focusing on tests of achievement when providing examples, but the same
principles apply to other types of tests. For example, a test that measures depressed mood
may assess a fairly homogeneous domain, making the use of coefficient alpha or KR-20 ap
propriate. However, if the test measures depression, anxiety, anger, and impulsiveness, the
content becomes more heterogeneous and the split-half estimate would be indicated. In this
situation, the split-half approach would allow the construction of two equivalent halves with
equal numbers of items reflecting the different traits or characteristics under investigation.
Naturally, if different forms of a test are available, it would be important to estimate
alternate-form reliability. If a test involves subjective judgment by the person scoring the
test, inter-rater reliability is important. Many contemporary test manuals report multiple
estimates of reliability. Given enough information about reliability, one can partition the
error variance into its components, as demonstrated in Figure 4.1.
10%
Content
Sampling
Construct. Some constructs are more difficult to measure than others simply because the
item domain is more difficult to sample adequately. As a general rule, personality variables
are more difficult to measure than academic knowledge. As a result, what might be an ac
ceptable level of reliability for a measure of “dependency" might be regarded as unaccept
able for a measure of reading comprehension. In evaluating the acceptability of a reliability
coefficient one should consider the nature of the variable under investigation and how dif
ficult it is to measure. By carefully reviewing and comparing the reliability estimates of
different instruments available for measuring a construct, one can determine which is the
most reliable measure of the construct.
Time Available for Testing. If the amount of time available for testing is limited, only
a limited number of items can be administered and the sampling of the test domain is open
to greater error. This could occur in a research project in which the school principal allows
you to conduct a study in his or her school but allows only 20 minutes to measure all the
variables in your study. As another example, consider a districtwide screening for reading
problems wherein the budget allows only 15 minutes of testing per student. In contrast, a
psychologist may have two hours to administer a standardized intelligence test individually.
It would be unreasonable to expect the same level of reliability from these significantly dif
ferent measurement processes. However, comparing the reliability coefficients associated
with instruments that can be administered within the parameters of the testing situation can
help one select the best instrument for the situation.
Test Score Use. The way the test scores will be used is another major consideration
when evaluating the adequacy of reliability coefficients. Diagnostic tests that form the
basis for major decisions about individuals should be held to a higher standard than tests
used with group research or for screening large numbers of individuals. For example,
an individually administered test of intelligence that is used in the diagnosis of mental
retardation would be expected to produce scores with a very high level of reliability. In
108 CHAPTER 4
this context, performance on the intelligence test provides critical information used le
determine whether the individual meets the diagnostic criteria. In contrast, a brief icsi
used to screen all students in a school district for reading problems would be held to less
rigorous standards. In this situation, the instrument is used simply for screening purposes
and no decisions are being made that cannot easily be reversed. It helps to remember that
although high reliability is desirable with all assessments, standards of acceptabilitv vary
according to the way test scores will be used. High-stakes decisions demand highly reli
able information!
If a test is being used to make ■ If a test is being used to make important decisions that art
important decisions that are likely to significantly impact individuals and are not easily reversed,
it is reasonable to expect reliability coefficients of 0.90 or even 0.95
likely to impact individuals
This level of reliability is regularly obtained with individually ad
significantly and are not easily
ministered tests of intelligence. For example, the reliability ot iht
reversed, it is reasonable to Wechsler Adult Intelligence Scale—Third Edition (Wcchsler. 1997).
expect reliability coefficients an individually administered intelligence test, is 0.98.
of 0.90 or even 0.95.
■ Reliability estimates of 0.80 or more arc considered acceptable
in many testing situations and are commonly reported for group and
individually administered achievement and personality tests. For example, the California
Achievement Test/5 (CAT/5)(CTB/Macmillan/McGraw-Hill, 1993), a set of group-admin
istered achievement tests frequently used in public schools, has reliability coefficients that
exceed 0.80 for most of its subtests.
■ For teacher-made classroom tests and tests used for screening, reliability estimaies
of at least 0.70 are expected. Classroom tests arc frequently combined to form linear com
posites that determine a final grade, and the reliability of these composites is expected i <' be
greater than the reliabilities of the individual tests. Marginal coefficients in the 0.70s might
also be acceptable when more thorough assessment procedures are available to address
concerns about individual cases.
Reliability for Teachers 109
Some writers suggest that reliability coefficients as low as 0.60 are acceptable for group
research, performance assessments, and projective measures, but we are reluctant to endorse
the use of any assessment that produces scores with reliability estimates below 0.70. As you
recall, a reliability coefficient of 0.60 indicates that 40% of the observed variance can be
attributed to random error. How much confidence can you place in assessment results when
you know that 40%' of the variance is attributed to random error?
The preceding guidelines on reliability coefficients and qualitative judgments of their
magnitude must also be considered in context. Some constructs are just a great deal more
difficult to measure reliably than others. From a developmental perspective, we know that
emerging skills or behavioral attributes in children are more difficult to measure than mature
or developed skills. When a construct is very difficult to measure, any reliability coefficient
greater than 0.50 may well be acceptable just because there is still more true score variance
present in such values relative to error variance. However, before choosing measures with
reliability coefficients below 0.70, be sure no better measuring instruments are available
that are also practical and whose interpretations have validity evidence associated with the
intended purposes of the test.
Possibly the most obvious A natural question at this point is “What can we do to improve the
way to improve the reliability reliability of our assessment results?*’ In essence we are asking what
steps can be taken to maximize true score variance and minimize error
of measurement is simply to
variance. Probably the most obvious approach is simply to increase
increase the number of items on
the number of items on a test. In the context of an individual test, if
a test. If we increase the number we increase the number of items while maintaining the same quality
of items while maintaining the as the original items, we will increase the reliability of the test. This
same quality as the original concept was introduced when we discussed split-half reliability and
items, we will increase the presented the Spearman-Brown formula. In fact, a variation of the
reliability of the test. Spearman-Brown formula can be used to predict the effects on reli
ability achieved by adding items:
n x
r
For instance, consider the example of our 25-item math test. If the reliability of the
test were 0.80 and we wanted to estimate the increase in reliability we would achieve by
increasing the test to 30 items (a factor of 1.2), the formula would be:
1.2 x 0.80
r =
1 + [(1.2 - 1)0.801
no CHAPTER 4
0.96
r =
1.16
/• = 0.83
Table 4.6 provides other examples illustrating the effects of increasing the length of
our hypothetical test on reliability. By looking in the first row of this table you see that in
creasing the number of items on a test with a reliability of 0.50 by a factor of 1.25 results in a
predicted reliability of 0.56. Increasing the number of items by a factor of 2.0 (i.e.. doubling
the length of the test) increases the reliability to 0.67.
In some situations various factors will limit the number of items we can include in
a test. For example, teachers generally develop tests that can be administered in a specific
time interval, usually the time allocated for a class period. In these situations, one can
enhance reliability by using multiple measurements that are combined for an average or
composite score. As noted earlier, combining multiple tests in a linear composite will
increase the reliability of measurement over that of the component tests. In summan
anything we do to get a more adequate sampling of the content domain will increase the
reliability of our measurement.
In Chapter 6 we will discuss a set of procedures collectively referred to as “item anal
yses.” These procedures help us select, develop, and retain test items with good measure
ment characteristics. While it is premature to discuss these procedures in detail, it should
be noted that selecting or developing good items is an important step in developing a good
test. Selecting and developing good items will enhance the measurement characteristics of
the assessments you use.
Another way to reduce the effects of measurement error is what Ghiselli, Campbell,
and Zedeck (1981) refer to as “good housekeeping procedures.” By this they mean test
developers should provide precise and clearly stated procedures regarding the administra
tion and scoring of tests. Examples include providing explicit instructions for standardized
administration, developing high-quality rubrics to facilitate reliable scoring, and requiring
extensive training before individuals can administer, grade, or interpret a test.
Range Restriction. The values we obtain when calculating reliability coefficients are
dependent on characteristics of the sample or group of individuals on which the analyses
are based. One characteristic of the sample that significantly impacts the coefficients is the
degree of variability in perfonnance (i.e., variance). More precisely, reliability coefficients
based on samples with large variances (referred to as heterogeneous samples) will generally
produce higher estimates of reliability than those based on samples with small variances
(referred to as homogeneous samples). When reliability coefficients are based on a sample
with a restricted range of variability, the coefficients may actually underestimate the reli
ability of measurement. For example, if you base a reliability analysis on students in a gifted
and talented class in which practically all of the scores reflect exemplary performance (e.g.,
>90% correct), you will receive lower estimates of reliability than if the analyses are based
on a class with a broader and more nearly normal distribution of scores.
112 CHAPTER 4
The reliability estimates Mastery Testing. Criterion-referenced tests are used to make inter
discussed in this chapter are pretations relative to a specific level of performance. Mastery testing
usually not applicable to scores is an example of a criterion-referenced test by which a test taker s
of mastery tests. Because performance is evaluated in terms of achieving a cut score instead
mastery tests emphasize of the degree of achievement. The emphasis in this testing situation
classification, a recommended is on classification. Either test takers score at or above the cut score
and are classified as having mastered the skill or domain, or they
approach is to use an index
score below the cut score and are classified as having not mastered
that reflects the consistency of
the skill or domain. Mastery testing often results in limited variability
classification. among test takers, and, as we just described, limited variability in
performance results in small reliability coefficients. As a result, the
reliability estimates discussed in this chapter are typically inadequate for assessing the reli
ability of mastery test scores. Given the emphasis on classification, a recommended approach
is to use an index that refiects the consistency of classification (AERA et ah, 1999). Special
Interest Topic 4.3 illustrates a useful procedure for evaluating the consistency of classifica
lion when using mastery tests.
Form A: Mastery
(score of 80% or better) 4 32
Form A: Nonmastery
(score <80%) 11 3
Students classified as achieving mastery on both tests are denoted in the upper right-hand
cell while students classified as not having mastered the skill are denoted in the lower left-hand cell.
There were four students who were classified as having mastered the skills on Form A but not on
Form B (denoted in the upper left-hand cell). There were three students who were classified as hav
ing mastered the skills on Form B but not on Form A (denoted in the lower right-hand cell). The next
step is to calculate the percentage of consistency by using the following formula:
(continued)
114 CHAPTER 4
This approach is limited to situations in which you have parallel mastery tests. Another limitation is
that there are no clear standards regarding what constitutes “acceptable” consistency of classifica
tion. As with the evaluation of all reliability information, the evaluation of classification consistency
should take into consideration the consequences of any decisions that are based on the test resuils
(e.g., Gronlund, 2003). If the test results are used to make high-stakes decisions (e.g., awarding a
diploma), a very high level of consistency is required. If the test is used only for low-stake decisions
(e.g., failure results in further instruction and retesting), a lower level of consistency may be accept
able. Subkoviak (1984) provides a good discussion of several techniques for estimating the clas
sification consistency of mastery tests, including some rather sophisticated approaches that require
only a single administration of the test.
The greater the reliability of a present in test scores, and the SD reflects the variability of the
test score, the smaller the SEM scores in the distribution. The SEM is estimated using the follow
and the more confidence we have ing formula:
in the precision of test scores.
SEM = SDV 1 - /-vv
where SD = the standard deviation of the obtained scores
rxx = the reliability of the test
Let’s work through two quick examples. First, let’s assume a test with a standard
deviation of 10 and reliability of 0.90.
Now let's assume a test with a standard deviation of 10 and reliability of 0.80. The SD
is the same as in the previous example, but the reliability is lower.
Notice that as the reliability of the test scores decreases, the SEM increases. Because the
reliability coefficient reflects the proportion of observed score variance due to true score
variance and the SEM is an estimate of the amount of error in lest scores, this inverse
relationship is what one would expect. The greater the reliability of test scores, the smaller
the SEM and the more confidence we have in the precision of test scores. The lower the
reliability of a test, the larger the SEM and the less confidence we have in the precision of
test scores. Table 4.7 shows the SEM as a function of SD and reliability. Examining the
Reliability for Teachers 115
first row in the table shows that on a test with a standard deviation of 30 and a reliability
coefficient of 0.95 the SEM is 6.7. In comparison, if the reliability of the test score is 0.90
the SEM is 9.5; if the reliability of the test is 0.85 the SEM is 1 1.6; and so forth. The SEM
is used in calculating intervals or bands around observed scores in which the true score is
expected to fall. We will now turn to this application of the SEM.
then we would expect him or her to obtain scores between 67 and 73 two-thirds of the lime.
To obtain a 95(7c confidence interval we simply determine the number of standard dev la
lions encompassing 95% of the scores in a distribution. By referring to a table representing
areas under the normal curve (see Appendix F), you can determine that 95% of the scores
in a normal distribution fall within ±1.96 of the mean. Given a true score of 70 and SI M
of 3, the 95% confidence interval would be 70 ± 3( 1.96) or 70 ± 5.88. Therefore, in this
situation an individual’s observed score would be expected to be between 64.12 and 75 XX
95% of the time.
You might have noticed a potential problem with this approach to calculating confi
dence intervals. So far we have described how the SEM allows us to form confidence in
tervals around the test taker’s true score. The problem is that we don’t know a test uikci 's
true score, only the observed score. Although it is possible for us to estimate true scores
(see Nunnally & Bernstein. 1994), it is common practice to use the SEM to establish con
fidence intervals around obtained scores (see Gulliksen, 1950). These confidence inter
vals are calculated in the same manner as just described, but the interpretation is slightly
different. In this context the confidence interval is used to define the range of scores that
will contain the individual’s true score. For example, if an individual obtains a score of 70
on a test wiih a SEM of 3.0, we would expect his or her true score to be between 67 and 73
(obtained score ±1 SEM ) 68% of the time. Accordingly, we would expect his or her true
score to be between 64.12 and 75.88 95% of the time (obtained score ±1.96 SEM)
It may help to make note of the relationship between the reliability of the tesi score,
the SEM. and confidence intervals. Remember that we noted that as the reliability of scores
increases the SEM decreases. The same relationship exists between test reliability and confi
dence intervals. As the reliability of test scores increases (denoting less measurement error),
the confidence intervals become smaller (denoting more precision in measurement).
A major advantage of the SEM and the use of confidence intervals is that they serve lo
remind us that measurement error is present in all scores and that we should interpret scores
cautiously. A single numerical score is often interpreted as if it is precise and involves no
error. For example, if you report that Susie has a Full Scale IQ of 113, her parents might
interpret this as implying that Susie’s IQ is exactly 113. If you are
A major advantage of the using a high-quality IQ test such as the Wechsler Intelligence Scale
SEM and the use of confidence for Children—4th Edition or the Reynolds Intellectual Assessment
intervals is that they serve to Scales, the obtained IQ is very likely a good estimate of her true IQ.
remind us that measurement However, even with the best assessment instruments the oblained
scores contain some degree of error and the SEM and confidence
error is present in all scores and
intervals help us illustrate this. This information can be reported in
that we should interpret scores
different ways in written reports. For example, Kaufman and Lich-
cautiously. tenberger (1999) recommend the following format:
Susie obtained a Full Scale IQ of 1 13 (between 108 and 1 18 with 95% confidence)
Susie obtained a Full Scale IQ in the High Average range, with a 95% probability that
her true IQ falls between 108 and 118.
■
Reliability for Teachers 117
Regardless of the exact format used, the inclusion of confidence intervals highlights
the fact that test scores contain some degree of measurement error and should be interpreted
with caution. Most professional test publishers either report scores as bands within which
the test taker’s true score is likely to fall or provide information on calculating these confi
dence intervals.
X(n - X)
KR-21 = 1
n a2
where X = mean
a2 = variance
n = number of items
Consider the following set of 20 scores: 50, 48, 47, 46, 42, 42, 41,40, 40, 38, 37, 36,
36, 35, 34, 32, 32, 31,30, and 28. Here the X = 38.25, a2 = 39.8, and n = 50. Therefore,
38.25(50 - 38.25)
KR-21 = l
50(39.8)
449.4375
= 1
1990
= 1 - 0.23 = 0.77
As you see, this is a fairly simple procedure. If you have access to a computer with a spread
sheet program or a calculator with mean and variance functions, you can estimate the reli
ability of a classroom test easily in a matter of minutes with this formula.
Special Interest Topic 4.4 presents a shortcut approach for calculating the Kuder-
Richardson formula 21 (KR-21). If you want to avoid even these limited computations, we
prepared Table 4.8, which allows you to estimate the KR-21 reliability for dichotomously
118 CHAPTER 4
Source: Saupe, J. L. (1961). Some useful estimates of the Kuder-Richardson formula number 20 reliability coet'
ficient. Educational and Psychological Measurement, 2, 63-72.
10 0.29 0.60
20 0.20 0.64 0.80
30 0.47 0.76 0.87
40 0.60 0.82 0.90
50 0.68 0.86 0.92
75 0.79 0.91 0.95
100 0.84 0.93 0.96
Reliability for Teachers 119
scored classroom tests if you know the standard deviation and number of items (this table
was modeled after tables originally presented by Deiderich, 1973). This table is appropriate
for tests with a mean of approximately 80% correct (we are using a mean of 80% correct
because it is fairly representative of many classroom tests). To illustrate its application, con
sider the following example. If your test has 50 items and an SD of 8, select the “Number of
Items” row for 50 items and the “Standard Deviation” column for 0.15/?, because 0.15(50)
= 7.5, which is close to your actual SD of 8. The number at the intersection is 0.86, which
is a very respectable reliability for a classroom test (or a professionally developed test for
that matter).
If you examine Table 4.8, you will likely detect a few fairly obvious trends. First, the
more items on the test the higher the estimated reliability coefficients. We alluded to the
beneficial impact of increasing test length previously in this chapter and the increase in reli
ability is due to enhanced sampling of the content domain. Second, tests with larger standard
deviations (i.e., variance) produce more reliable results. For example, a 30-item test with
an SD of 3—i.e., 0.10(/z)—results in an estimated reliability of 0.47, while one with an SD
of 4.5—i.e., 0.150?)—results in an estimated reliability of 0.76. This reflects the tendency
we described earlier that restricted score variance results in smaller reliability coefficients.
We should note that while we include a column for standard deviations of 0.200?), standard
deviations this large are rare with classroom tests (Deiderich, 1973). In fact, from our expe
rience it is more common for classroom tests to have standard deviations closer to 0.100?).
Before leaving our discussion of KR-21 and its application to classroom tests, we do want
to caution you that KR-21 is only an approximation of KR-20 or coefficient alpha. KR-21
assumes the lest items are of equal difficulty and it is usually slightly lower than KR-20 or
coefficient alpha (Hopkins, 1998). Nevertheless, if the assumptions are not grossly violated,
it is probably a reasonably good estimate of reliability for many classroom applications.
Our discussion of shortcut reliability estimates to this point has been limited to tests
that are dichotomously scored. Obviously, many of the assessments teachers use are not
dichotomously scored and this makes the situation a little more complicated. If your items
are not scored dichotomously, you can calculate coefficient alpha with relative ease using
a commonly available spreadsheet such as Microsoft Excel. With a little effort you should
be able to use a spreadsheet to perform the computations illustrated previously in Tables
4.3 and 4.4.
Summary
Reliability refers to consistency in test scores. If a test or other assessment procedure pro
duces consistent measurements, its scores are reliable. Why is reliability so important? As
we have emphasized, assessments are useful because they provide information that helps
educators make better decisions. However, the reliability (and validity) of that information
is of paramount importance. For us to make good decisions, we need reliable information.
By estimating the reliability of our assessment results, we get an indication of how much
confidence we can place in them. If we have highly reliable and valid information, it is prob
able that we can use that information to make better decisions. If the results are unreliable,
they are of little value to us.
120 CHAPTER 4
■ Test-retest reliability involves the administration of the same test to a group of indi
viduals on two different occasions. The correlation between the two sets of scores is the
test-retest reliability coefficient and reflects errors due to time sampling.
■ Alternate-form reliability involves the administration of parallel forms of a test to a
group of individuals. The correlation between the scores on the two forms is the reliabilitv
coefficient. If the two forms are administered at the same time, the reliability coefficient
reflects only content sampling error. If the two forms of the test arc administered at different
times, the reliability coefficient reflects both content and time sampling errors.
■ Internal-consistency reliability estimates are derived from a single administration
of a test. Split-half reliability involves dividing the test into two equivalent halves and
calculating the correlation between the two halves. Instead of comparing performance on
two halves of the test, coefficient alpha and the Kuder-Richardson approaches examine the
consistency of responding among all of the individual items of the test. Split-half reliabi I it\
reflects errors due to content sampling whereas coefficient alpha and the Kuder-Richardson
approaches reflect both item heterogeneity and errors due to content sampling.
■ Inter-rater reliability is estimated by administering the test once but having the re
sponses scored by different examiners. By comparing the scores assigned by different ex
aminers, one can determine the influence of different raters or scorers. Inter-rater reliabilit)
is important to examine when scoring involves considerable subjective judgment.
RECOMMENDED READINGS
American Educational Research Association, American Psy W. H. Freeman. Chapters 8 and 9 provide outstanding
chological Association, & National Council on Measure discussions of reliability. A classic!
ment in Education (1999). Standards for educational Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric
and psychological testing. Washington, DC: AERA. theory (3rd ed.). New York: McGraw-Hill. Chapter 6,
Chapter 5, Reliability and Errors of Measurement, is a The Theory of Measurement Error, and Chapter 7, The
great resource! Assessment of Reliability are outstanding chapters. An
Feldt, L. S.. & Brennan, R. L. (1989). Reliability. In R. L. other classic!
Linn (Ed.), Educational measurement (3rd ed., pp. 105- Subkoviak, M. J. (1984). Estimating the reliability of mastery-
146). Upper Saddle River, NJ: Memll/Prenlice Hall. A nonmastery classifications. In R. A. Berk (Ed.), A guide
little technical at times, but a great resource for students to criterion-referenced test construction (pp. 267-291).
wanting to learn more about reliability. Baltimore: Johns Hopkins University Press. An excellent
Ghiselli. E. E.,Campbell,.!. P, &Zedeck,S. (1981). A'/ermm?- discussion of techniques for estimating the consistency
ment theory’ for the behavioral sciences. San Francisco: of classification with mastery tests.
PRACTICE ITEMS
1. Consider these data for a five-item test that was administered to six students. Each item could
receive a score of either 1 or 0. Calculate KR-20 using the following formula:
k ( SD2 - X/Jj x ^ )
KR-20 =
k - 1 SD2
where k = number of items
SD2 = variance of total test scores
/?j = proportion of correct responses on item
c/i = proportion of incorrect responses on item
122 CHAPTER 4
Student 1 0 1 1 0 1
Student 2 1 1 1 1 I
Student 3 1 0 1 0 0
Student 4 0 0 0 1 0
Student 5 1 1 1 1 1
Student 6 1 1 0 1 0
P\ SD2
ch
P\ x <7i
Note: When calculating SD2, use n in the denominator.
2. Consider these data for a five-item test that was administered to six students. Each item
could receive a score ranging from 1 to 5. Calculate coefficient alpha using the following
formula:
ISP,2 )
Coefficient alpha
SD2
where k = number of items
SDj2 = variance of individual items
SD2 = variance of total test scores
Student I 4 5 4 5 5
Student 2 3 3 2 3 2
Student 3 2 3 1 2 1
Student 4 4 4 5 5 4
Student 5 2 3 2 2 3
Student 6 1 2 2 1 3
SD.2 SD2 =