Chapter 5 Constructing Tests and Performance Assessments
Chapter 5 Constructing Tests and Performance Assessments
Chapter 5 Constructing Tests and Performance Assessments
1. Maximum performance tests (MPT): with these, we assume that all examinees
will perform their best as all examinees are equally, highly motivated. Examinee
performance is influenced by the effects of heredity, schooling or training, and his
or her personal environment. A test may fall into more than one classification.
a. Intelligence tests: These are tests of general aptitude. IQ (Intelligence
Quotient) scores will most likely vary according to the intelligence test taken
as they do tend to be different. IQ is influenced by previous academic
achievement. IQ tests rely on a theory of intelligence, requiring construct
validity.
b. Aptitude (or ability) tests: These tests measure performance potential.
Aptitude tests imply prediction and are sometimes substituted for intelligence
tests, and used for classification (e.g., ability grouping). Ability test scores are
influenced by prior achievement (e.g., reading and math knowledge and
skills.). These rely on content validity.
c. Achievement tests are used to measure examinees’ current knowledge and
skill level. These tend rely on content validity.
d. Speeded tests are maximum performance tests where the speed at which the
test is completed is a vital element in scoring (e.g., typing test, rifle assembly,
foot race, etc.) If a test has a time limit such that virtually all students finish
the test, then it is not considered a speeded test and is called a power test.
Most achievement tests are also power tests.
e. Performance tests are designed to require examinees to demonstrate
competence by constructing a response. In “traditional” testing, this takes
form as a constructed response, i.e., brief or extended essays. A performance
assessment may also require a competence demonstration via the construction
of a work product; in this instance, detailed product specifications, scoring
criteria, and rating or scoring sheets are used.
3. Standardized tests are aptitude or achievement tests which are administered under
standard conditions and whose scores are interpreted under standardized rules,
typically employing standard scores such as norms, except on criterion-referenced
tests. On most standardized tests, the breadth of content coverage is broad and test
items are written to maximize score variability. Thus, items which are too hard or
too easy are not used in constructing the test. Standardized achievement and
aptitude tests rely on content validity.
4. Informal tests are those typically intended for a localized purpose, such as the
simple classroom achievement test.
1. First, when planning the assessment or test, consider examinees’ age, stage of
development, ability level, culture, etc. These factors will influence construction
of learning targets or outcomes, the types of item formats selected, how items are
actually written, and test length.
3. Third, the test item specifications are written. Item specifications intended for a
state-, provincial- or nation-wide tests will be more detailed than will a common
departmental or classroom examination, which will probably be just the learning
outcome/target benchmarks and possible test item formats. Detailed item specs
are written if different item writers construct tests for different examinee pools,
e.g., school districts permitting each school to write its own unit examinations.
6. Sixth, once a consensus has been reached, items are written, based on item writing
guidelines, and revised until there is agreement that each item (a) meets its item
specification, (b) conforms to item writing guidelines, (c) “fits” the test blueprint,
and (d) no bias is present; the determination is done by the same or similar subject
matter experts completing Step 4, above. Step 6 is again critical to establishing
content validity. The outcome of Step 6 is an initial version of the test.
7. Seventh, once the initial test is written, it is typically pilot tested with examinees
similar to the intended target audience and evaluated statistically (using item
analysis indices; see Part IV of the chapter). The statistical evaluation provides
guidance for the revision or rejection of poorly functioning items. Pilot more test
items than will be actually needed. The outcome of Step 7 is the final version of
the test.
(c) Skill standards state very clearly the specific skills students are
expected to have mastered at a specified performance level.
(2) It is often easy to confuse content and skill standards.
(a) More specifically, content standards specify declarative
knowledge, e.g., mathematical rules, statistical formulas, important
historical facts, grammar rules, or steps in conducting a biology
experiment, etc.
(b) Skill standards specify procedural or conditional knowledge, e.g.,
conducting statistical or mathematical operations based on
formulas or mathematical rules; interpreting or explaining
historical facts or analyzing historical data; correcting a passage of
text for poor grammar; or conducting a biology experiment.
(c) The key difference between content and skill standards is that with
content standards, students are required to possess specific
knowledge; skill standards require students to apply that
knowledge in some expected fashion at an expected level of
performance.
c. In crafting learning standards, targets, or outcomes, Oosterhof (1994, pp.
43-47) suggests the writer consider:
(1) Capability. Identify the intellectual capability being assessed.
Performance taxonomies such as Bloom, et al. or Gagne are helpful
here.
(2) Behavior. Indicate the specific behavior which is to be evidence that
the targeted learning has occurred. The behavior should be directly
observable, requiring no inference.
(3) Situation. Often, it is helpful to specify the conditions under which the
behavior is to be demonstrated. Describe the circumstances the
behavior is to be demonstrated.
(4) Special Conditions. Depending on the circumstances, one may need to
place conditions on the behavior, e.g., name a letter or word correctly
80% of the time in order to conclude that the targeted learning has
occurred.
2. There are several models for framing these intended outcomes or standards;
we briefly examine two and then integrate these two approaches into one,
guided by Oosterhof’s (1994, pp. 43-47) suggestions.
a. Mitchell (1996) offers the following Taxonomy and Definitions
(1) Content (or academic) standards. These standards identify what
knowledge and skills are expected of learners at specified phases in
their educational progression.
(2) Performance standards. Performance standards have levels, e.g., 4, 3,
2, 1; exceeds expectations, meets expectations, or does not meet
expectations; unsatisfactory, progressing, proficient, or exemplary,
which are intended to show degree of content mastery.
(3) Opportunity to learn standards. There are instances where enabling
learning and/or performance conditions are identified to ensure that
the fifth element as assessment options may become too limited. Two
sample standards are:
(a) The student will accurately compute algebraic equations.
(b) The student will accurately and concisely describe modern
leadership theories.
d. The five part model presented here can also be used to evaluate standards
written by others. Regardless of the approach employed, each standard
should meet at least the first four components, be developmentally
appropriate, and ensure that students have had the opportunity to learn,
develop the desired attitude, and/or acquire the specified skill(s).
2. The more important the content, the greater the number of test items. To
arrive at these numbers, the test developer will typically
a. Determine the total number of items to be included. This number is
influenced by the testing time available, the test environment,
developmental status of the examinees, and size of the content and/or skill
domain to be tested.
b. Test items are allocated to each learning target, objective, or outcome;
more critical standards are allotted more items.
3. Next, test items are sorted across each learning target or benchmark, by
intellectual skill. Test item classification guidelines are:
a. Determine exactly what the action verb in the Learning Target benchmark
asks or requires of the examinee.
b. Ensure the test item asks or requires the examinee to know or do what is
specified in the benchmark. The test item and benchmark must match.
c. Examine the benchmark’s action verb, to classify the item.
(1) If the examinee is to just recall or know data or information, classify
under “Knowledge,” which typically is declarative knowledge.
(2) If the examinee is to recall or identify data or information in a different
manner than he or she first learned, classify under “Comprehension,”
which typically is declarative knowledge.
(3) If the examinee is to apply (i.e., use) the knowledge or skill learned,
then classify under “Application,” which typically requires procedural
and/or conditional knowledge.
(4) If the examinee is to analyze (i.e., take apart to thoroughly examine)
an idea, recommendation, poem, report, argument, or an interpretation
of a dataset or what someone else produced, classify under “Analysis.”
(5) If the examinee is to write a poem, article, term paper, play or perform
an artistic routine, or paint a painting, or prepare a photography
exhibit, classify under “Synthesis.” Of all of Bloom’s levels, creativity
is most associated with “Synthesis,” followed by “Analysis.” An
example is the construction of a term paper where “Analysis” is
d. Carefully select the best action verb which requires the examinee to know
the content, perform the skill or display the attitude intended. There are
many sources from which to select the best action verb, e.g., the Internet,
colleagues, or the synonym list found in most word processing programs.
e. If there is a question about how to classify an item, consult an experienced
colleague. Sometimes there is item classification disagreement, due to
differing perspectives; try to reach consensus. The test designer usually
makes the final classification decision.
Table 5.1
Test Blue Print for End of Term Test on Language Arts Strand A: Reading
Standard: 2. The student writes uses the reading process effectively, Grades 3-5
Benchmark Knowledge Comprehension Application Analysis Synthesis Total
The student will… 5a,b 6d,e 1f 12
Uses a table of contents, index, headings, captions,
illustrations, and major words to anticipate or predict
content and purpose of a reading selection. (LA.A.1.2.1)
Item Totals 19 13 11 1 44
Note: The source of the example is: Florida Curriculum Framework: Language Arts PreK-12 Sunshine State Standards and Instructional
Practice (Florida Department of Education, 1996, pp. 36-37). It is increasingly common to insert the actual test item numbers once the test has
been finalized as a quality control strategy. Item Formats: amultiple choice, btrue/false, cmatching, dfill-in-the blank, ecompletion, fbrief
response, gextended response.
2. There are three types of test items; the key difference is the test item’s
purpose.
a. Mastery items measure essential minimums that all examinees should
know; measure memorization of facts or simple computations (e.g.,
Bloom, et al.’s knowledge or comprehension skill levels); and are
commonly used in licensing or certification tests.
b. Power items are designed to measure what typical examinees are expected
to know; these items may range in difficulty form very easy to very hard,
depending on the test purpose and the content, skills, or attitudes being
measured. Power items are commonly found in achievement tests.
c. Speed items are designed to assess higher level concepts and skills (e.g.,
Bloom, et al.’s application, analysis, synthesis, or evaluation) expected of
the most able examinees. Speed items are found on tests of mental abilities
(e.g., IQ tests). Don’t confuse these item types with speeded tests where
the strictly observed time limit is part of the measurement process (e.g.,
decision-making accuracy in an emergency situation by a paramedic,
firefighter, or police officer).
3. Test items must have two (2) very important characteristics: test items are
unidimensional and logically independent
a. Items are Unidimensional
(1) Each test Item must measure one attribute, e.g., knowledge, attitude, or
psychomotor skill, etc.
(2) If it were possible, in theory, to write every possible test item for an
attribute (e.g., knowledge, skill, or attitude), they would fully describe
every single aspect of that attribute. There would be too many test
items; so, we write test items to measure the most important
attribute(s) of the content, skill or attitude of interest.
(3) This unidimensional assumption is critical for two reasons: (a) it
permits a more accurate interpretation of a test item and (b) allows the
explanation that a single trait (e.g., knowledge, skill, or attitude)
accounts for an examinee responding correctly to the test item.
b. Items are logically independent
(1) An examinee’s response to any test item is independent of his or her
response to any other test item. Responses to all test items are
unrelated, i.e., statistically independent.
(2) Practical implications for writing test items are: (a) write items so that
one item doesn’t give clues to the correct answer to another item and
(b) if several items are related, e.g., to a graphic (picture, table or
graph), they should be written so as not to “betray” one another but to
test different aspects of the graphic. See Interpretative Exercise in the
discussion on writing multiple choice items and Appendix 5.1.
Table 5.2
Language Arts Unit 3 Test Scoring Plan
Standard Item Format Item Points Subtest Points
L.A.A.1.2.1 Multiple Choice 3 x 2 items = 6
(Subtest 1) True/False 1 x 3 items = 3
Fill-in-Blank 2 x 3 items = 2
Completion 3 x 3 items = 9
Brief Response 5 x 1 item = 5
Total = 25 points
L.A.A.1.2.2 Multiple Choice 3 x 4 items = 12
(Subtest 2) Matching 2 x 4 items = 8
Fill-in-Blank 2 x 4 items = 8
Brief Response 5 x 2 items = 10
Extended Response 8 x 1 item = 8
Total = 46 points
L.A.A.1.2.3 Multiple Choice 3 x 6 items = 18
(Subtest 3) Matching 2 x 3 items = 6
Completion 3 x 3 items = 9
Total = 33 points
L.A.A.1.2.4 Brief Response 5 x 5 items = 25
(Subtest 4) Total =25 points
Table 5.3
Language Arts Unit 3 Subtest and Total Test Scores
Subtest x Points Earned Points Possible Percent Mastery
A (L.A.A.1.2.1) 18 25 72%
B (L.A.A.1.2.2) 39 46 85%
C (L.A.A.1.2.3) 28 33 85%
B (L.A.A.1.2.4) 20 25 80%
Total Test Score 105 129 81%
(3) Points are allocated in the same manner as a traditional classroom test
with the most critical information receiving the highest point
allocation and less import receiving fewer points. See Appendices 5.2,
5.4 and 5.5, which contain task descriptions and scoring rubrics.
Table 5.4
Course Assignment & Test Point Weights
Assignment Points
Traditional Classroom Test Construction Task 224
Direct Performance Assessment Construction Task 160
Discussion Prompt Answers (8 x 12 points x 2) 192
Discussion Answer Comments (16 x 4 points x 1.5) 96
Mid-term and Final (2 @ 75 points each) 150
Total Points 822
c. Combing the information from Tables 5.4 and 5.5, a student earning 739
points is assumed to have “mastered” 90% (actually 0.899%) of the
content (specified by the learning outcomes or targets) for a performance
indicator of an “A-” or an “Excellent” performance.
Table 5.5
Grading Scale
Grade Percentage Meaning
A 95-100% Exceptional
A- 90-94% Excellent
B+ 87-89% Very Good
B 83-86% Good
B- 80-82% Fair
C 75-79% Marginal
F ≤75% Failure
Note. Saint Leo University (2010). Graduate academic
catalog 2010-2011. St. Leo, FL: Author.
b. Best Answer: Examinees select the “best” response option that “fits” the
situation presented in the item stem.
(1) Best Answer
___This culture developed an accurate calendar. They built steep temple pyramids
and used advanced agricultural techniques. They developed a system of mathematics
that included the concept of zero. They were located mostly in the Yucatán
Peninsula. They ruled large cities based in southern and southeastern Mexico, as well
as in the Central American highlands. This passage best describes the ancient.
a. Aztecs c. Mayas
b. Incas d. Olmecs
(2) Answer: “c”
(2) Answer: 1. a; 2. b; 3. c; 4. d
At a minimum all tests must have (a) _________validity and (b) ____ reliability.
e. Analogy
(1) A multiple choice item can be written in the form of an analogy.
(2) Hieroglyphics is to Egypt as cuneiform is to —
a. Phoenicia c. Persia
b. Sumer d. Crete
(3) Answer: b
3. A review of the test item design literature (Oosterhof, 1994, pp. 33-147;
Gronlund, 1998, pp. 60-74; Popham, 2000, pp. 242-250) and the authors’
reveal these item writing guidelines.
a. Learning outcomes should drive all item construction. There should be at
least two items per key learning outcome.
b. In the item stem, present a clear stimulus or problem, using language that
an intended examinee would understand. Each item should address only
one central issue.
c. Generally, write the item stem in positive language, but ensure to the
extent possible, that the bulk of the item wording is in the stem.
d. Underline or bold negative words whenever used in an item stem.
e. Ensure that the intended correct answer is correct and that distractors or
foils are plausible to examinees who have not mastered the content.
f. Ensure that the items are grammatically correct and that answer options
are grammatically parallel with the stem and each other.
g. With respect to correct answers, vary length and position in the answer
option array. Ensure that there is only one correct answer per item.
h. Don’t use “all of the above,” and use “none of the above” unless there is
no other option.
i. Vary item difficulty by writing stems of varying levels of complexity or
adjusting the attractiveness of distractors.
j. Ensure that each item stands on its own, unless part of a scenario where
several items are grouped. Don’t use this item format to measure opinions.
k. Ensure that the arrangements of response options are not illogical or
confusing and that they are not interdependent and overlapping.
4. Specific Applications
a. Best Answer Items: In this scenario, the examinee is required to select the
best response from the options presented. These items are intended to test
evaluation skills (i.e., make relative judgments given the scenario
presented in the stem).
b. To assess complex processes, use pictorials that the examinee must
explain.
c. Use analogies to measure relationships (analysis or synthesis).
d. Present a scenario and ask examinees to identify assumptions (analysis).
e. Present an evaluative situation and ask examinees to analyze applied
criteria (evaluation).
f. Construct a scenario where examinees select examples of principles or
concepts. These items may either measure analysis or synthesis,
depending on the scenario.
c. Advantages
(1) Interpretive skills are critical in life.
(2) IE can measure more complex learning than any single item.
(3) As a related series of items, IE can tap greater skills depth and breadth.
(4) The introductory material, display or scenario can provide necessary
background information.
(5) IE measures specific mental processes and can be scored objectively.
d. Limitations
(1) IE is labor and time intensive as well as difficult to construct.
(2) The introductory material which forms the basis for the exercise is
difficult to locate and when it is, reworking for clarity, precision, and
brevity is often required.
(3) Solid reading skills are required. Examinees that lack sufficient
reading skills may perform poorly due to limited reading skills.
(4) IE is a widely used proxy measure of higher order intellectual skills.
For example, IE can be used to assess the elements of problem solving
skills, but the extent to which the discrete skills are integrated.
e. Construction Guidelines
(1) Begin with written, verbal, tabular, or graphic (e.g., charts, graphs,
maps, or pictures) introductory material which serves as the basis for
the exercise. When writing, selecting, or revising introductory
material, keep the material:
(a) Relevant to learning objective(s) and the intellectual skill being
assessed;
(b) Appropriate for the examinees’ development, knowledge and skill
level, and academic or professional experience;
(c) At a simple reading level, avoiding complex words or sentence
structures, etc;
(d) Brief, as brief introductory material minimizes the influence of
reading ability on testing;
(e) Complete, contains all the information needed to answer items; and
(f) Clear, concise, and focused on the IE’s purpose.
(2) Ensure that items, usually multiple choice:
(a) Require application of the relevant intellectual skill (e.g.,
application, analysis, synthesis, and evaluation);
(b) Don’t require answers readily available from the introductory
material or that can be answered correctly without the introductory
material;
(c) Are of sufficient number to be either proportional to or longer than
the introductory material; and
(d) Comply with item writing guidelines;
(e) Revise content, as is necessary, when developing items.
(3) When examinee correctly answers an item, it should be due to the fact
that he or she has mastered the intellectual skill needed to correctly
answer rather than failure to memorize background information. For
Acceptable reliability standards for measures exist. For instruments where groups
are concerned, 0.80a or higher is adequate. For decisions about individuals, 0.85b is
the bare minimum 0.95c is the desired standard.
a. T F ______
b. T F ______
c. T F ______
Answer: a. T; b. F 0.90; c. T
b. Embedded Items
(1) In a paragraph are included underlined words or word groupings.
Examinees are asked to determine whether or not the underlined
content possesses a specified quality, e.g., being true, correct, etc.
(2) Embedded items are useful for assessing declarative or procedural
knowledge.
(3) Directions. Indicate whether the underlined word is correct within the context of
threats to a measure’s reliability. Circle “1” for correct or “2” for incorrect.
Answer: a. 1 and b. 2.
(3) Directions. Read each option and if you think the option is true, circle “T” or if you
think the statement is false, circle “F.”
Answers: 1. a; 2.a
e. Standard Format
(1) Traditionally, a true/false item has been written as a simple declarative
sentence which was either correct or incorrect.
(2) Directions. Read the statement carefully. If you think the statement is true, circle
“T” or if you think the statement is false, circle “F.”
Answer: T
c. There are usually more responses (right hand column) than premises. This
reduces the biasing impact of guessing on the test score.
d. Example
Directions. Match the terms presented in the right column to their definitions which are
presented in the left column. Write the letter which represents your choice of definition
in the blank provided.
Definition Term
_____ 1. Measures essential minimums that all a. Mastery Items
examinees should know
_____ 2. Designed to assess higher level concepts & skills b. Power Items
e. Dependent Items
Answers: 1.a; 2.b; 3.b; 4.d
3. Item writing guidelines (Gronlund, 1998, p. 86-87 & Popham, 2000, pp. 240-
241) are:
(a) Employ only homogeneous content for a set of premises and responses.
(b) Keep each list reasonably short, but ensure that each is complete, given the
table of specifications. Seven premises and 10-12 options should be the
maximum for each set of matching items.
(c) To reduce the impact of guessing, list more responses than premises and
let a response be used more than once. This helps prevent examinees from
gaining points through the process of elimination.
(d) Put responses in alphabetical or numerical order.
(e) Don’t break matching items across pages.
(f) Give full directions which include the logical basis for matching and
indicate how often a response may be used.
1. The two most essential attributes of a measure are: (a) ________ and (b)______.
2. Strengths and limitations (Gronlund, 1998, pp. 96-97; Kubiszyn & Borich,
1996, p. 99-100; Oosterhof, 1994, pp. 98-100; Popham, 2000, pp. 264-266) of
completion and short answer items identified are:
a. Strengths
(1) These item formats are efficient in that due to ease of writing and
answering, more items can be constructed and used. Hence, much
content can be assessed, improving content validity.
(2) The effect of guessing is reduced as the answer must be supplied.
(3) These item formats are ideal for assessing mastery of content where
computations are required as well as other knowledge outcomes.
b. Limitations
(1) Phasing statements or questions which are sufficiently focused to elicit
a single correct response is difficult. There are often more than one
“correct” answer depending on the degree of item specificity.
(2) Scoring maybe influenced by an examinee’s writing and spelling
ability. Scoring can be time consuming and repetitious, thereby
introducing scoring errors.
(3) This item format is best used with lower level cogitative skills, e.g.,
knowledge or comprehension.
(4) The level of technological support for scoring completion and short
answer items is very limited as compared to selected response items
(i.e., matching, multiple choice, and true/false).
3. Item writing guidelines (Gronlund, 1998, pp. 87-100; Oosterhof, 1994, pp.
101-104; & Popham, 2000, pp. 264-266) are:
a. Focus the statement or question so that there is only one concise answer of
one or two words or phrases. Use precise statement or question wording
to avoid ambiguous items.
b. A complete question is recommended over an incomplete statement. You
should use one, unbroken blank of sufficient length for the answer.
(2) Using an original example, explain the process for conducting an item analysis
study, including (a) a brief scenario description, (b) at least three sample items in
different formats and (c) the computation, interpretation, and application of relevant
statistical indices. (Extended Response)
3. Strengths and limitations of essay items (Gronlund, 1998, p. 103; Kubiszyn &
Borich, 1996, pp. 109-110; Oosterhof, 1994, pp. 110-112; Popham, 2000, pp.
266-268) are:
a. Strengths
(1) These formats are effective devices for assessing lower and higher
order cognitive skills. The intellectual skill(s) must be identified and
require the construction of a product that is an expected outcome of the
exercise of the specified intellectual skill or skills.
(2) Item writing time is reduced as compared to select response items, but
care must be taken to construct highly focused items which assess the
examinee’s mastery of relevant performance standards.
(3) In addition to assessing achievement, essay formats also assess an
examinee’s writing, grammatical, and vocabulary skills.
(4) It is almost impossible for an examinee to guess a correct response.
b. Limitations
(1) Since answering essay format items consumes a significant amount of
examinee time, other relevant performance objectives or standards
may not be assessed.
(2) Scores may be influenced by writing skills, bluffing, poor penmanship,
inadequate grammatical skills, misspelling, etc.
(3) Scoring takes a great amount of time and tends to be unreliable given
the subjective nature of scoring.
g. Have sufficient time to answer each essay item, if used on a test, and
identify the point value. It’s best to use restricted response essays for
examinations.
h. Verify an item’s quality by writing a model answer. This should identify
any ambiguity.
2. Item scoring guidelines (Gronlund, 1998, pp. 107-110; Kubiszyn & Borich,
1996, pp.110-115; Oosterhof, 1994, pp. 113-118; Popham, 2000, pp. 270-271)
are:
a. Regardless of essay type, evaluate an examinee’s response in relation to
the performance objective or standard.
(1) Ensure that the item’s reading level is appropriate to examinees.
(2) Grade all examinees’ responses to the same item before moving on to
the next. If possible, evaluate essay responses anonymously, i.e., try
not to know whose essay response you are evaluating.
(3) For high stakes examinations (i.e., degree sanction tests), have at least
two trained readers evaluate the essay response.
b. For restricted/brief response essays, ensure that the model answer is
reasonable to knowledgeable students and content experts, and use a point
method based on a model response. Restricted essays are not
recommended for assessing higher order intellectual skills.
c. For an extended response essay, use a holistic or analytical scoring rubric
with previously defined criteria and point weights. When evaluating
higher order cognitive skills, generally evaluate extended essay responses
in terms of
(1) Content: Is the content of the response logically related to the
performance objective or standard being assessed? Knowledge,
comprehension, or application skills can be assessed depending on
how the essay item was constructed.
(2) Organization: Does the response have an introduction, body and
conclusion?
(3) Process: Is the recommendation, solution, or decision in the essay
response adequately supported by reasoned argument and/or
supporting evidence? This is the portion of the response where
analytical, synthesis, or evaluation skills are demonstrated. Again, be
very specific as to which skill is being assessed and that the essay
item, itself, requires the examinee to construct a response consistent
with the intellectual skill being assessed.
d. When evaluating essays where a solution or conclusion is required,
consider the following in constructing your scoring plan:
(1) Accuracy or reasonableness: Will posited solution or conclusion
work?
(2) Completeness or internal consistency: To what extent does the solution
or conclusion relate to the problem or stimulus?
IV. Statistical Strategies for Improving Select Response Test Items (Item Analysis)
A. Item Analysis: Purpose
1. The purpose of analyzing item behavior is to select items which are best
suited for the purpose of the test.
2. The purpose of testing is to differentiate between those who know the content
and those who don’t. This can be done by:
a. Identifying those items answered correctly by knowledgeable examinees.
b. Identifying those items answered incorrectly by less knowledgeable
examinees.
3. There are two indices: item difficulty or p-value, Index of Discrimination (D).
b. Item format affects p-values, particularly where the item format is able to
guess a response. For example:
(1) 37 x 3 = _____ (This format presents little opportunity to correctly guess--no choices.)
(2) 37 x 3 = ______ (This format presents more opportunity to correctly guess.)
(a) 11.1 (c) 1111
(b) 111 (d) 11.11
c. Target p-values
From the test designer’s perspective, to maximize total test variance
(needed for high reliability coefficients), each item p-value should be at or
close to 0.50. Some recommend a range between 0.40 and 0.60 (Crocker
& Algina, 1986, pp. 311-313).
e. p-values are rarely the primary criterion for item selection into a test and
for tests designed to differentiate between those who know the content and
those who don’t; p-values should be of consistent, moderate difficulty.
b. Formula: D = Pu - Pl
(1) Term Meanings
(a) D = discrimination index
(b) Pu = upper proportion of examinees who answered item correctly
(c) Pl = lower proportion of examinees who answered item correctly
(2) To determine Pu and/or Pl, select the upper 25-27% of examinees and
the lower 25-27% of examinees. Determine the proportion in each
group answering the item correctly.
(3) For example: Item 1, Pu = 0.80 & Pl = 0.30, so 0.80 - 0.30 = 0.50.
Thus, D = 0.50. Eighty percent answered Item 1 correctly in the high
scoring group; whereas, 30% did so in the lower scoring group. See
Interpretive Guidelines below.
c. Properties of “D”
(1) Ranges from -1.0 to 1.0
(2) Positive values indicate the item favors the upper scoring group.
(3) Negative values indicate the item favors the lower scoring group.
(4) The closer “D” is to 0.00 or 1.0 the less likely the item is going to have
a positive value.
(3) The most likely incorrect response should have the 2’nd highest
endorsement level. If all incorrect foils are equally likely, then lower
scoring examinee selection should be fairly consistent across the
incorrect foils.
(4) Revise foils which high- and low-scoring examinees together select.
b. Item 13 A B C* D E Omit
High .00 .00 100 .00 .00 .00
Lower .00 .00 100 .00 .00 .00
All .00 .00 100 .00 .00 .00
c. Item 29 A B* C D Omit
High .36 .36 .00 .28 .00
Lower .41 .24 .10 .25 .00
All .24 .60 .10 .06 .00
(1) The target p-value (0.74) for a 4-foil multiple choice item suggests that
the item has promise, but requires revision. However, the 0.11 “D”
value indicates that the item has very little discriminating power.
(2) Distractors “A”, “B”, and “D”, attracted similar endorsements from
both the higher and lower scoring groups. This suggests item
ambiguity.
(3) A revision strategy would be to make foil “C” more attractive and
revise foils “A”, “B”, and “D” so that they are more attractive to lower
scoring or higher scoring examinees, but not both.
b. Higher order intellectual skills, built on lower order intellectual skills, can
be directly assessed or tested through performance assessment. Well
designed and implemented tasks can be “correctly” solved in a variety of
ways thus facilitating examinee critical thinking skills and for the making
of evaluative inferences by examiners.
c. Very clear connections to teaching or training quality can be made. Both
the process of constructing the response (i.e., product) and the response
itself can be assessed and evaluated; thereby improving teaching and
learning, and assessment.
d. Examinees’ commitment is increased given their sense of control over the
learning, assessment and evaluation processes. Motivation tends to be high
when examinees are required to produce an original work or product.
2. DPA Limitations
a. Performance assessments are very labor and time intensive to design,
prepare, organize, and evaluate. Records must also be maintained.
b. Performance has to be scored immediately if a process is being assessed.
c. Scoring is susceptible to rater error. Raters or examiners must be highly
trained on the task description, performance criteria, and rating form to
assess performance similarly and consistently. A plan for “breaking ties”
is needed; two examiners may rate a performance quite differently. A third
more senior or experienced rater can serve as the “tie breaker.”
d. Complex intellectual or psychomotor skills are composed of several
different but complimentary enabling skills (e.g., reading level, drawing
talent, physical coordination, stress tolerance, etc.) which might not be
recognized or assessed. Examinees will likely perform some enabling
skills more proficiently than others. Critical enabling skills should be
identified and specifically observed and rated.
e. Other potential limitations include time, cost, and the availability of
equipment and judges. Due to these issues, performance assessment must
be used to assess very highly relevant skills, which are teachable.
2. Rating Scales
a. Rating scales are efficient in assessing dispositions (attitudes), work
products, and social skills. Rating scales are more difficult to construct
than checklists, but tend to be both reliable and defensible, provided they
are content valid. Rating scales are efficient to score and can provide
quality feedback to examinees.
b. It is essential that each response option be fully defined and that the
definition be logically related to the purpose of the rating scale and be
progressive (i.e., represent plausible examinee performance levels). Like
checklists, rating scales tend to be unidimensional in that each assesses
(a) For rating scales, place the least desired characteristic or lowest
rating (e.g., “Strongly Disagree”) on the left end of the continuum
and the most desired quality or highest rating on the right end of
the continuum (e.g., “Strongly Agree”).
(b) Arrange similarly for checklists (e.g., “Poor Quality” “Acceptable
Quality,” or “High Quality”).
(8) Provide a space for rater comments beside each item or at the end of
the checklist or rating scale.
(9) Always train examiners or raters so they proficient in using the
checklist or rating scale.
(a) After each practice session, debrief the raters so that all can
understand each other’s logic in rating an examinee’s performance,
and where necessary arrive at a rating consensus.
(b) Stress to raters the importance of using the full range of rating
options on rating scales; many raters use only the upper tier of the
rating continuum, which is fine provided all examinees actually
perform at those levels.
2. Definition of Terms
a. A rubric is an instrument for rating an examinee response to a stimulus or
task description and is scored either holistically or analytically.
b. Evaluative criteria are performance standards which are rated, typically
using points. See the 33 standards identified in Appendix 5.2. Each
performance standard should be logically related but functionally
(2) Analytic rubrics are richer in detail and description, thereby offering
greater diagnostic value. This provides examinees with a detailed
assessment of strengths and weaknesses.
(3) Analytic rubrics should be based on a thorough analysis of the process,
skill, or task to be completed. Performance definitions must clearly
differentiate between performance levels. Detailed checklists and
rating scales are often used in analytic ratings.
(4) Analytic rubrics are presented in Appendices 5.2, 5.4, and 5.5.
References
Airasian, P. W., (1997). Classroom assessment (3rd ed.). New York, NY: McGraw-Hill.
Bloom. B. S., Engelhart, M. D., Frost, E. J., & Krathwohl, D. (1956). Taxonomy of
educational objectives. Book 1 cogitative domain. New York, NY: Longman.
Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory.
New York, NY: Holt, Rinehart, & Winston.
Gagne, R. M. (1985). The conditions of learning and theory of instruction (4th ed.).
Chicago, IL: Holt, Rinehart, & Winston, Inc.
Haladyna, T. M. (1997). Writing test items to evaluate higher order thinking. Needham
Heights, MA: Allyn & Bacon.
Kubiszyn, T. & Borich, G. (1996). Educational testing and measurement. New York,
NY: Harper Collins College Publishers.
Lyman, H. B. (1998). Test scores and what they mean (6th ed.). Needham Heights, MA:
Allyn & Bacon.
Quellmalz, E. S. (1987). Developing reasoning skills. In Joan Baron & Robert Sternberg
(eds.) Teaching thinking skills: Theory and practice (pp. 86-105). New York, NY:
Freeman.
Stiggins, R. J., Griswold, M. M., & Wikelund, K. R. (1989). Measuring thinking skills
through classroom assessment. Journal of Educational Measurement, 26, (3), 233-
246.
Appendices
Appendix 5.2 is a task specific, analytical rubric which includes a detailed description of
the task to be accomplished and the performance scoring rubric.
Appendix 5.3 is a simple rating scale that was used by team members to rate the
contribution of each other's contribution in the construction of a class project.
Appendix 5.4 is a skill focused scoring rubric used to rate student critiques of an article
from a juried journal. It has a very brief task description but most of the performance
criteria are presented in the rubric. The score for each criterion is summed to reach a final
performance rating on the assignment.
Appendix 5.6 describes ethical test preparation strategies for examinees as well as
guidance on test taking skills.
1. What is the correct punctuation edit for the sentence in lines 2 and 3 (My . . . table.)?
A. My parents were still asleep; however, my mother had my lunch packed and
waiting on the kitchen table.
B. My parents were still asleep however; my mother had my lunch packed and waiting the
kitchen table.
C. My parents were still asleep, however, my mother had my lunch; packed and waiting on
the kitchen table.
D. My parents were still asleep, however, my mother had my lunch packed and waiting; on
the kitchen table.
ID:27480 Mt. Washington A
2. Which is the correct revision of the sentence in lines 4 and 5 (Walking . . . higher.)?
A. The first pale hint of red was entering the sky as the sun rose higher and higher,
walking to my car.
B. As the sun rose higher and higher, walking to my car, the first pale hint of red was
entering the sky.
C. Walking to my car, I noticed the first pale hint of red entering the sky as the sun rose
higher and higher.
D. Walking to my car, the sun rose higher and higher as the first pale hint of red was
entering the sky.
ID:88949 Mt. Washington C
4. Which sentence does not add to the development of the essay’s main idea?
A. The cars in the parking lot were mostly from other states. (line 7)
B. I had definitely chosen the right day for my first climb of Mt. Washington. (lines 8–9)
C. After an hour and a half had passed, I gradually picked up the pace. (lines 13–14)
D. Beads of sweat ran down my face as I pushed toward the top. (line 26)
ID:27487 Mt. Washington A
Source: New Hampshire Educational Assessment and Improvement Program, New Hampshire Department
of Education. End-of-Grade 10 COMMON ITEMS, Released Items 2000-2001.
Retrieved June 1, 2002 from: http://www.ed.state.nh.us/Assessment/2000-2001/2001Common10.pdf
Student work teams will construct a direct performance task which meets the standards in the attached
Performance Task Quality Assessment Index (PTQAI).
Each work team will develop a unit or mid-term performance task with a scoring rubric suitable for the 6 th -
12th grade classroom. For this task, your work team will rely on the text, Internet research, team member
experience, external consultants (e.g., senior teachers, etc.) and research and/or professional journals. The
final work product is to be submitted electronically using Microsoft Word 2007 or a later edition.
The professor will give one “free read,” using the Performance Task Quality Assessment Index (PTQAI)
before “grading.” The work product will be assessed on four dimensions or traits: quality of the task
description, clarity and relevance of the performance criteria, scoring rubric functionality, and task
authenticity, i.e., how realistic the intended task is for examinees. The rubric is analytical during the
formative phase of the task and holistic at “grading,” as a total score is awarded. Maximum points are 132.
See the PTQAI.
The task will be divided into two parts: task description and rubric. Students will submit a clear set of
directions (PTQAI items 1-3), appropriate task description (PTQAI items 4-10), relevant performance
criteria (PTQAI items 11-15), a suitable scoring rubric (PTQAI items 16-23), and be authentic (PTQAI
items 24-33). The performance criteria may be embedded within the scoring rubric.
The team must describe the classroom performance assessment context, using the attached Direct
Performance Assessment Task Scenario.
Before constructing the direct performance task, complete the scenario description to set the context within
which it is set. Provide the following information about examinees and their school setting by either filling
in the blank with the requested information or checking a blank as requested.
Ensure that intended learning target content or skills are fully described; do not refer the professor to a web
site or other document containing this information. Otherwise, the task will be returned until the learning
target information is supplied as required.
Examinee Characteristics
9) School Ownership: Public: _____ Private: ______ Charter: _____ (Check one)
10) School Type: Elementary: ____ Middle: ____ High: ___ (Check one)
Learning Target(s): Clearly state the intended learning target content and/or skills which are to be
assessed in your extended performance task.
Classroom Assessment Plan: Describe in a few paragraphs your general classroom assessment plan. Next,
describe how your team’s extended performance task would fit into that assessment plan and how you
would use the information to improve classroom teaching and learning.
The PTQAI is the scoring rubric used to assess the product to be produced in response to the task
description. The PTQAI is composed of 33 evaluative criteria (called standards) distributed across these
four subtests or dimensions: Task Description, Performance Criteria, Scoring System, and Authenticity.
Possible scores are “Meets Standard,” 4 points: “Mostly Meets Standard,” 3 points; “Marginally Meets
Standard,” 2 points; “Does Not Meet Standards,” 1 point; and “Missing or Not in Evidence,” 0 points.
Performance level quality definitions are:
“Meets Standard” means the response fully and completely satisfies the standard.
“Mostly Meets Standard” means the response nearly fully and completely satisfies the standard.
“Marginally Meets Standard” means the response partly satisfies the standard.
“Does Not Meet Standard” means the response does not fully and completely satisfy the standard.
“Missing or Not in Evidence” means the response presents no evidence of meeting the standard.
In order to compute a total score for summative evaluation purposes, the following criteria are used:
Basic corresponds to a C to B- (75-82%). Performance is acceptable but improvements are needed to meet
expectations well.
Novice corresponds to an F (< 74%). Performance is weak; the skills or standards are not sufficiently
demonstrated at this time.
To compute a summative performance level, total earned points are divided by points possible (132).
Read each statement carefully. Next, circle either 2 for “Yes” or 1 for “No” if the
indicator (behavior) was demonstrated by the group member or not. The “?” indicates
that the indicator was not observed.
4. The quality of the group member’s data (e.g., research articles, URLs, books, 1 2 0
etc.) contribution was high, given the task.
9. The group member exhibited appropriate listening skills which assisted the 1 2 0
team in accomplishing its task.
10. The team member was sufficiently flexible so as to enable the work group to 1 2 0
complete the task at hand.
11. The team member demonstrated writing skills, which helped the work group 1 2 0
meet its objective.
Interpretation: The higher the number of points, the higher the perceived contribution level.
Students will work in groups to review and critique a short article from a peer-reviewed electronic journal which
reports on a single quantitative, qualitative, or mixed-method study. Do not select a literature review or a meta-analysis
article.
The group will critique the work based on evidence of reliability, validity, design suitability, and practical usefulness of
the information. When asked to discuss the suitability and sufficiency of any supporting data, briefly summarize the
data and then critique it using prevailing best practices, professional standards, your text, and other research. Cite
references in an APA style reference list. The task is a three page maximum, excluding title and reference pages. Left –
margin headings: Intent, Type, Reliability/Authority, Validity/Verisimilitude, and Conclusion. The article’s URL
must be provided.
Rating:
Exceptional corresponds to an A (95-100%). Performance is outstanding; significantly above the usual
expectations.
Proficient corresponds to a grade of B to A- (83-94%). Skills and standards are at the level of expectation.
Basic corresponds to a C to B- (75-82%). Skills and standards are acceptable but improvements are needed to
meet expectations well.
Novice corresponds to an F (< 74%). Performance is weak; the skills or standards are not sufficiently
demonstrated at this time.
0 This criterion is missing or not in evidence.
Ratings
Criteria 0 Novice Basic Proficient Exceptional
Intent. The intent of the research is summarized succinctly and 1.0 - 7.4 7.5 – 8.2 8.25 – 9.4 9.5 - 10
thoroughly in a style appropriate to the research design. State the
purpose for which the research was conducted.
Type. State whether the study was primarily or exclusively 1.0 - 7.4 7.5 – 8.2 8.25 – 9.4 9.5 - 10
quantitative, qualitative, or mixed method. Briefly describe the study's
methodology to prove your designation.
Reliability/Authority. For an article which reports on a quantitative 1.0 – 18.4 18.5 – 19.9 20 - 23.4 23.5 - 25
study, describe the data collection device’s internal consistency,
stability, equivalence, or inter-rater reliability; give coefficients if
available.
For an article reporting on a qualitative study, describe the author’s
qualifications and experience, the sponsoring organization, number of
times the article has been cited in other research, cite other researcher’s
opinions of the article, how consistent the reported findings are with
other studies, etc.)
Discuss the suitability and sufficiency of any supporting data.
Validity/Verisimilitude. For an article which reports on a quantitative 1.0 – 18.4 18.5 – 19.9 20 - 23.4 23.5 - 25
study, describe the data collection device’s content, criterion-related,
or construct validity; also comment on the control or prevention of
internal research design threats to internal design validity.
For an article reporting on a qualitative study, show the article’s
verisimilitude (i.e., appearance of truth). Do this by commenting on the
logical analysis used by the authors; describe how consistent their
findings and recommendations are with other researchers; comment on
the consistency of the study's design and execution with other research
on the topic, etc. When assessing logical analysis, consider logic leaps,
the internal consistency of arguments, deductive and inductive
reasoning, etc.
Discuss the suitability and sufficiency of any supporting data.
Conclusion. Provide an overall assessment of research reported in the 1.0 – 11.4 11.5 – 12.4 12.5 – 14.5 - 15
article. Did the study meet prevailing best practices for its research 14.4
design and data collection strategies? Why or why not? Describe the
practical application of findings to professional practice.
Writing and grammar skills are appropriate to the graduate level 1 – 11.1 11.2-12.3 12.4-14.1 14.2-15
(including APA citations and references).
Total Earned Points: ____________
The presentation scoring rubric is presented below and is composed of three subtests or sections: Organization,
Presence, and Technology. Odd number ratings reflect the “mid-point” between two even numbered scores.
2. Successful examinees tend to understand the purpose for testing, know how
the results will be used, comprehend the test’s importance and its relevance to
teaching and learning, expect to score well, and have confidence in their
abilities.
3. Plan study time. Study material with which you have the most difficulty.
Move on to more familiar or easier material after mastering the more difficult
content. However, balance time allocations so that there is sufficient time to
review all critical or key material before the test.
4. Study material in a sequenced fashion over time. Allocate three or four hours
each day up to the night before the test. Spaced review is more effective
because you repeatedly review content which increases retention and builds
content connections. “Don’t cram.” Develop a study schedule and stick to it.
Rereading difficult content will help a student learn. Tutoring can be a
valuable study asset.
5. Frame questions which are likely to be on the test in the item format likely to
be found. If an instructor has almost always used multiple choice items on
prior tests or quizzes, it is likely that he or she will continue to use that item
format. Anticipate questions an examiner might ask. For example:
a. For introductory courses, there will likely to be many terms to memorize.
These terms form the disciplinary vocabulary. In addition to being able to
define such terms, the examinee might need to apply or identify the terms.
b. It is likely the examinee will need to apply, compare and contrast terms
concepts or ideas, apply formulae, solve problems, construct procedures,
evaluate a position or opinion, etc. Most likely, the examinee will need to
show analytical and synthesis skills.
d. Once started, meet regularly and focus on one subject or project at a time,
have an agenda for each study session, ensure that logistics remain
“worked out” and fair, follow an agreed upon meeting format, include a
time-limited open discussion of the topic, and brainstorm possible test
questions.
7. Other Recommendations
a. For machine-graded multiple-choice tests, ensure the selected answer
corresponds to the question the examinee intends to answer.
b. If using an answer sheet and test booklet, check the test booklet against the
answer sheet whenever starting a new page, column, or section.
c. Read test directions very carefully.
d. If efficient use of time is critical, quickly review the test and organize a
mental test completion schedule. Check to ensure that when one-quarter
of the available time is used, the examinee is one-quarter through the test.
e. Don't waste time reflecting on difficult-to-answer questions. Guess if there
is no correction for guessing; but it is better to mark the item and return to
it later if time allows, as other test items might cue you to the correct
answer.
f. Don't read more complexity into test items than is presented. Simple test
items almost always require simple answers.
g. Ask the examiner to clarify an item if needed, unless explicitly forbidden.
h. If the test is completed and time remains, review answers, especially those
which were guessed at or not known.
i. Changing answers may produce higher test scores, if the examinee is
reasonably sure that the revised answer is correct.
j. Write down formula equations, critical facts, etc. in the margin of the test
before answering items.
References
Mehrens, W. A. & Kaminski, J. (1989). Methods for improving standardized test scores:
Fruitful, fruitless, or fraudulent? Educational Measurement: Issues and Practice, 8,
(1), 14-22.