MODULE 5 - Development of Assessment Tools
MODULE 5 - Development of Assessment Tools
MODULE 5 - Development of Assessment Tools
2. Among the types of tests, which do you think is the easiest one? Explain
your answer.
What do you
already know?
3. Have you ever asked yourself how your teacher constructed your
examination?
2
The table of specification provides the test constructor a way to ensure that the
assessment is based on the intended learning outcomes. As a would-be teacher, you
must have your lesson plan for easy access to the necessary information needed in the
preparation of the table of specification. Your lesson will make your test construction
easier for it is your guide in the whole planning session.
There are things to consider in preparing your examination such as the
appropriateness and the qualities of assessment tools. This will also ensure the
congruency of the learning objectives to the activities you are going to give to your
students. The qualities of assessment tools serve as your guide in ensuring the validity,
reliability, fairness, objectivity, scorability, adequacy, administrability, practicality, and
efficiency of your instrument as well as the result of the test.
5. Be sure that the item is independent of all other items. The answer to one item
should not be required as a condition in answering the next item. A hint to one
answer should not be embedded in another item.
6. Be sure the item has one of the best answers on which experts would agree.
7. Prevent unintended clues to an answer in the statement or questions.
Grammatical inconsistencies such as a or a give clues to the correct answer to
those students who are not well prepared for the test.
8. Avoid replication of the textbook in writing test items; do not quote directly from
the textual materials. You are usually not interested in how well students memorize
the text.
9. Avoid trick or catch questions in an achievement test. Do not waste time testing
how well the students can interpret their intentions.
10. Try to write items that require higher-order thinking skills.
comprehension level in the cognitive domain. This type of test is appropriate when there
are only two plausible alternatives or distracters.
California:
A. Contains the tallest mountain in the United States.
B. Has an eagle on its state flag.
C Is the second largest state in terms of area.
*D Was the location of the Gold Rush of 1849.
Good Example
What is the main reason so many people moved to California in 1849?
A. California's land was fertile, plentiful, and inexpensive.
*B. Gold was discovered in central California.
C. The east was preparing for a civil war.
D. They wanted to establish religious settlements.
Good Example
When analyzing your students’ pretest and posttest scores to determine if your
teaching has had a significant effect, an appropriate statistic to use is the test
for:
*A. Dependent samples.
B. Heterogeneous samples.
C. Homogenous samples.
D. Independent samples.
Good Example
Which of these assessment findings, if identified in a client who has pneumonia,
indicates that the client needs to be suctioned?
A. Absence of adventitious breath sounds.
7
There are two important characteristics of an item that will be of interest to the
teacher. These are: (a) item difficulty, and (b) discrimination index. We shall learn how to
measure these characteristics and apply our knowledge in making a decision about the
item in question.
The difficulty of an item or item difficulty is defined as the number of students who are
able to answer the item correctly divided by the total number of students. Thus:
Item difficulty number = of students with correct answer/ total number of students
The item difficulty is usually expressed in percentage.
Example: What is the item difficulty index of an item if 25 students are unable to answer
it correctly while 75 answered it correctly?
Here, the total number of students is 100, hence, the item difficulty index is 75/100 or
75%.
One problem with this type of difficulty index is that it may not actually indicate that
the item is difficult (or easy). A student who does not know the subject matter will naturally
be unable to answer the item correctly even if the question is easy. How do we decide on
the basis of this index whether the item is too difficult or too easy? The following arbitrary
rule is often used in the literature:
Difficult items tend to discriminate between those who know and those who do not
know the answer. Conversely, easy items cannot discriminate between these two groups
of students. We are therefore interested in deriving a measure that will tell us whether an
item can discriminate between these two groups of students. Such a measure is called
an index of discrimination.
An easy way to derive such a measure is to measure how difficult an item is with
respect to those in the upper 25% of the class and how difficult it is with respect to those
in the lower 25% of the class. If the upper 25% of the class found the item easy yet the
lower 25% found it difficult, then the item can discriminate properly between these two
groups. Thus:
Index of discrimination DU- DL
Example: Obtain the index of discrimination of an item if the upper 25% of the class
had a difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while the
lower 25% of the class had a difficulty index of 0.20.
Here, DU = 0.60 while DL= 0.20, thus index of discrimination = .60 - .20 =.40.
Theoretically, the index of discrimination can range from -1.0 (when DU=0 and DL= 1) to
1.0 ( when DU= 1 and DL= 0). When the index of discrimination is equal to -1, then this
means that all of the lower 25% of the students got the correct answer while all of the
upper 25% got the wrong answer. In a sense, such an index discriminates correctly
between the two groups but the item itself is highly questionable. Why should the bright
11
ones get the wrong answer and the poor ones get the right answer? On the other hand, if
the index of discrimination is 1.0, then this means that all of the lower 25% failed to get
the correct answer while all of the upper 25% got the correct answer. This is a perfectly
discriminating item and is the ideal item that should be included in the test. From these
discussions, let us agree to discard or revise all items that have negative discrimination
index for although they discriminate correctly between the upper and lower 25% of the
class, the content of the item itself may be highly dubious. As in the case of the index of
difficulty, we have the following rule of thumb:
The correct response is B. Let us compute the difficulty index and index of discrimination:
Difficulty Index = no. of students getting correct response/ total
= 40/100 = 40%, within range of a “good item”
The discrimination index can similarly be computed:
DU = no. of students in upper 25% with correct response/no. of students in the upper 25%
= 15/20 = .75 or 75%
DL = no. of students in lower 75% with correct response/ no. of students in the lower 25%
= 5/20 = .25 or 25%
Discrimination Index = DU – DL = .75 - .25 = .50 or 50%.
It is also instructive to note that the distracter A is not an effective distracter since this
was never selected by the students. Distracters C and D appear to have a good appeal
as distracters.
Item Difficulty
Item difficulty is the percentage of people who answer an item correctly. It is the
relative frequency with which examinees choose the correct response (Thorndike,
Cunningham, Thorndike, & Hagen, 1991). It has an index ranging from a low of 0 to a high
of +1.00.
Higher difficulty indexes indicate easier items. An item answered correctly by 75% of the
examinees has an item difficult level of .75. An item answered correctly by 35% of the
examinees has an item difficulty level of .35.
Item difficulty is a characteristic of the item and the sample that takes the test. For
example, a vocabulary question that asks for synonyms for English nouns will be easy for
American graduate students in English literature, but difficult for elementary children. Item
difficulty provides a common metric to compare items that measure different domains,
such as questions in statistics and sociology making it possible to determine if either item
is more difficult for the same group of examinees. Item difficulty has a powerful effect on
both the variability of test scores and the precision with which test scores discriminate
among groups of examinees (Thorndike, Cunningham, Thorndike, & Hagen, 1991). In
discussing procedures to determine minimum and maximum test scores, Thompson and
Levitov (1985) said that
Items tend to improve test reliability when the percentage of students who correctly
answer the item is halfway between the percentage expected to correctly answer if pure
guessing governed responses and the percentage (100%) who would correctly answer if
everyone knew the answer.
12
Index Difficulty
P =_____Ru + RL_______×100
T
Where: Ru – The number in the upper group who answered the item correctly.
RL – The number in the lower group who answered the item correctly.
T – The total number who tired the item.
Item Discriminating
Item discrimination compares the number of high scorers and low scorers who answer
an item correctly. It is the extent to which items discriminate among trainees in the high
and low groups. The total test and each item should measure the same thing. High
performers should be more likely to answer a good item correctly, and low performers
more likely to answer incorrectly. Scores range from – 1.00 to +1.00 with an ideal score
of +1.00. Positive coefficients indicate that high-scoring examinees tended to have higher
scores on the item, while a negative coefficient indicates that low-scoring students tended
to have lower scores. On items that discriminate well, more high scorers than low scorers
will answer those items correctly.
To compute item discrimination, a test is scored, scores are rank ordered, and 27 percent
of the highest and lowest scorers are selected (Kelley, 1939). The number of correct
answers in the highest 27 percent is subtracted from the number of correct answers in the
lowest 27 percent. This result is divided by the number of people in the larger of the two
groups. The percentage of 27 percent is used because “this value will maximize
differences in normal distributions while providing enough cases for analysis” (Wiersma
& Jurs, 1990, p. 145). Comparing the upper and lower groups promotes stability by
maximizing differences between the two groups. The percentage of individuals included
in the highest and lowest groups can vary. Nunnally (1972) suggested 25 percent, while
SPSS (1999) uses the highest and lowest one-third.
Wood (1960) stated that
When more students in the lower group than in the upper group select the right answer
to an item, the item actually has negative validity. Assuming that the criterion itself has
validity, the item is not only useless but is actually serving to decrease the validity of the
test.
Item Discriminating
P =_____Ru + RL_______×100
1/2 T
Where: P – percentage who answered the item correctly (index of difficulty)
R – number who answered the item correctly
T – total number who tried the item.
P = _8__×100 = 40%
20
The smaller the percentage figure the more difficult the item
13
Validation
Validity is the extent to which a test measures what it purports to measure or as
referring to the appropriateness, correctness, meaningfulness and usefulness of the
specific decisions a teacher makes based on the test results. These two definitions of
validity differ in the sense that the first definition refers to the test itself while the second
refers to the decisions made by the teacher based on the test.
Face validity estimates whether a test measures what it claims to measure. It is the extent
to which a test seems relevant, important, and interesting. It is the least rigorous measure
of validity.
Content validity is the degree to which a test matches a curriculum and accurately
measures the specific training objectives on which a program is based. Typically it uses
expert judgment of qualified experts to determine if a test is accurate, appropriate, and
fair.
Criterion-related validity measures how well a test compares with an external criterion. It
includes:
Predictive validity is the correlation between a predictor and a criterion obtained at a later
time (e.g., test score on a specific competence and caseworker performance of a job-
related tasks).
Concurrent validity is the correlation between a predictor and a criterion at the same point
in time (e.g., performance on a cognitive test related to training and scores on a Civil
Service examination).
Construct validity is the extent to which a test measures a theoretical construct (e.g., a
researcher examines a personality test to determine if the personality typologies account
for actual results).
Validity is an overall evaluative judgment, founded on empirical evidence and
theoretical rationales, of the adequacy and appropriateness of inferences and actions
based on test scores. As such validity is an inductive summary of both the adequacy of
existing evidence for and the appropriateness of potential consequences of test
interpretation and use (Messick, 1988, pp. 33-34).
14
What you have read is just a few information about development of assessment
tools. Let’s us put this information into a meaningful learning experience.
Development of an assessment tools will assured that the main objective a teacher
will achieve because of identification and validation of effective strategy in assessing
students learning outcomes. Assessment tools are coherent system in which varied
Summary assessment strategies in validating the students learning outcomes and make sure
everything is fair and valid result. Assessments is to provide information about the
effectiveness of instruction and the overall progress of students’ science learning using
differentiated strategy in validating.
March, Colin. Teaching Social Studies. National Library of Australia, Prentice-Hall of
Australia
Required Calmorin L. P. Measurement and Evaluation, Third Edition. National Book store.
Readings Mandaluyong City 1550.
Calmorin L. P. (2011), Assessment of Student Learning 1.Firts Edition. Rex Book
Store, Inc
Feedback
16
Development of
Assessment Tools
1. A teacher constructed a test which would measure the student's ability to apply
previous knowledge to certain situations. In particular, the evidence that a student is
able to apply previous knowledge are:
Draw correct conclusions that are based on the information given; Identify one or more
logical implications that follow from a given point of view; State whether two ideas are
identical, just similar,, unrelated or contradictory. Write test items using the multiple
choice type of test that would cover these concerns of the teacher. Show your test to
an expert and ask him to judge whether the items indeed cover these concerns. (10
points)
3. Enumerate the three types of validity evidences. Which of these types of validity is
the most difficult to measure? Why? (10 points)