0% found this document useful (0 votes)
28 views

Language Testing Summary of Chapters 1 6

Uploaded by

Yeojin Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Language Testing Summary of Chapters 1 6

Uploaded by

Yeojin Kim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter One: Preliminaries of Language Testing

1. WHY TESTING
In general, testing has the following purposes:
 Students, teachers, administrators and parents want to ascertain the degree to which those goals have
been realized
 Government and private sectors who employ the students are interested in having precise information
about students’ abilities.
 Most importantly, through testing, accurate information is obtained based on which educational
decisions are made.
Tests can benefit students in the following ways:
 Testing can create a positive attitude toward class and will motivate them in learning the subject
matter.
 Testing can help students prepare themselves and thus learn the materials.
Testing can also benefit teachers:
 Testing helps teachers to diagnose their efforts in teaching.
 Testing can also help teachers gain insight into ways to improve evaluation process.

2. TEST, MEASUREMENT, EVALUATION


Measurement is the process of quantifying the characteristics of persons according to explicit procedures
and rules.

Test is an instrument, often connoting the presentation of a set of questions to be answered, to obtain a
measure of a characteristic of a person.
 Note: What distinguishes a test from other types of measurement is that it is designed to obtain a
specific sample of behavior.

Evaluation has been defined in a variety of ways:


1) The process of delineating, obtaining, and providing useful information for judging decision
alternatives.
2) The determination of the congruence between performance and objectives.
3) A process that allows one to make a judgment about the desirability or value of a measure.

 Note: It is important to point out that we never measure or evaluate people. We measure or evaluate
characteristics or properties of people.

3. NORM-REFERENCED TESTS vs. CRITERION-REFERENCED TESTS


If we compare the score of a testee to the scores of other testees, this would be norm referencing. However,
if we interpret a testee’s performance by comparing it to some specific criterion without concern for how
other testees performed, this would be criterion referencing. Usually, a testee passes the test only when he
has given the right answer to all or a specific number of test items.
Characteristics NRT CRT
Relative (A student’s Absolute (A student’s performance
performance is compared to is compared only to the amount, or
Type of Interpretation
those of all other students in percentage, of material learned.)
percentile terms)
To measure general language To measure specific objectives-
Type of Measurement
abilities or proficiencies based language points
Normal distribution of scores Varies; often non-normal. Students
around the mean who know the material should
score 100%
Distribution of Scores

Spread students out along a Assess the amount of material


Purpose of Testing continuum of general abilities known or learned by each student
or proficiencies
A few relatively long subtests A series of short, well-defined
Test Structure
with a variety of item contents subtests with similar item contents
Students have little or no ideas Students know exactly what
Knowledge of Questions of what content to expect in content to expect in test items
test items
When a great number of When a test item is missed by a
testees miss an item, it is great number of testees, the
Missed Items
eliminated from the test instructional materials are revised
or additional work is given

4. TEACHER-MADE TESTS VS. STANDARDIZED TESTS


A teacher-made test is a small scale, classroom test which is generally prepared, administered and scored by
one teacher. On the other hand, standardized tests are commercially prepared by skilled test-makers and
measurement experts. They provide methods of obtaining samples of behavior under uniform procedures.

Characteristics Teacher-Made Test Standardized Test


Type of Interpretation Criterion-referencing Norm-referencing
Direction for Specific, culture-free direction for every
Usually no uniform directions
Administration and testee to understand; standardized
specified
Scoring administration and scoring procedures
Both content and sampling are Content determined by curriculum and
Sampling of Content determined by classroom subject-matter experts; involves
teacher extensive investigations of existing
syllabi, textbooks, and programs;
sampling of content done systematically
May be hurried and haphazard;
Uses meticulous construction procedures
often no test blueprints, item
that include constructing objectives and
Construction tryouts, item analysis or
test blueprints, employing item tryouts,
revision; quality of test may be
item analysis, and item revisions
quite poor
Only local classroom norms are
In addition to local norms, standardized
available, i.e. they are
Norms tests typically make available national
determined by the school or a
schools district norms
department
Unknown; usually lower than
High; written by specialists, pretested
Quality of Items standardized tests due to
and selected on the basis of effectiveness
limited time and skill of
teacher
Unknown; usually high if
Reliability High
carefully constructed

5. WASHBACK
A facet of consequential validity is washback. Washback generally refers to the effects the tests have on
instruction in terms of how students prepare for the test.
 Note: ‘Cram’ courses and ‘teaching to the test’ are examples of such washback.
In classroom assessment it refers to the information that washes back to students in the form of useful
diagnoses of strengths and weaknesses.
 Harmful washback is said to occur the test content and testing techniques are at variance with the
objectives of the course.
 Beneficial washback is said to result when a testing procedure encourages good teaching practice is
introduced.

6. TEST BIAS
A test or item can be considered to be biased if one particular section of the candidate population is
advantaged or disadvantaged by some feature of the test or item which is not relevant to what is being
measured.
 Fairness can be defined as the degree to which a test treats every student the same, or the degree to
which it is impartial. Equitable treatment in terms of testing conditions, access to practice materials,
performance feedback, retest opportunities, and other features of test administration, including
providing reasonable accommodation for test takers with disabilities when appropriate, are important
aspects of fairness under this perspective.

7. AUTHENTICITY
It is the degree of correspondence of the characteristics of a given language test task to the features of target
language task.
 The language in the test is as natural as possible.
 Items are contextualized rather than isolated.
 Topics are meaningful for the learner.
 Some thematic organization to items is provided, such as through a story line or episode.
 Tasks represent, or closely approximate, real-world tasks.
Chapter Two: Language Test Functions

1. TWO MAJOR FUNCTIONS OF LANGUAGE TESTS


1.1. Evaluation of attainment tests
Attainment evaluation tests measure to what extent examinees have learned the intended skill, performance,
knowledge, etc. in a given area.

1.1.1. Achievement tests


Such tests are related directly to classroom lessons, units, or even a total curriculum.
 General achievement tests are (standardized) tests which deal with a body of knowledge. Constructors
of such tests rarely teach students being tested. One example is a test to measure students’
achievement in the first year of high school.
The content of a final achievement test may be based directly on a detailed course syllabus or on the
books and other materials used. This has been referred to as the syllabus-content approach.
 Since the test only contains what it is thought that the students have actually encountered, and
thus can be considered a fair test.
 If the syllabus is badly designed, or the books and other materials are badly chosen, the results
of a test can be very misleading.
The alternative approach is to base the test directly on the objectives of the course.
 It makes it possible for performance on the test to show just how far students have achieved
those objectives. This in turn puts pressure on those responsible for the syllabus and for the
selection of books and materials to ensure that these are consistent with the course objectives.
Tests based on objectives work against the perpetuation of poor teaching practice, something
which course-content-based tests fail to do.
 It is unfair. If the course content does not fit well with objectives, examinees will be expected
to do things for which they have not been prepared
 Diagnostic tests measure the degree of students’ achievement on a particular subject/topic and
specifically the detailed elements of an instructional topic. They show weaknesses and strengths of
students so that teachers can modify the instructional procedure and remedial action can be taken if
the number of students is big.

Criterion-Referenced
Test Qualities
Achievement Diagnostic
Details of Information Specific Very specific
Terminal objectives of course or
Focus Enabling objectives of courses
program
To determine the degree of
To inform students and teachers of
Purpose of Decision learning for advancement or
objectives needing more work
graduation
When Administered End of courses Middle of courses
1.1.2. Knowledge tests
These tests are used when the medium of instruction is a language other than examinees’ mother tongue.

1.1.3. Proficiency tests


If your aim in a test is to tap the overall language ability, i.e. global competence, then you are, in a
conventional terminology, testing proficiency. A proficiency test in not limited to any one course curriculum,
or single skill in the language; rather it tests overall ability. More precisely, these tests measure:
degree of his capability to demonstrate his knowledge in language use;
and degree of capability in language components

 Note: A key issue in testing proficiency is the difficulty that centers on the complexity of defining the
term ‘proficiency’ (construct of language). This difficulty renders the construct of proficiency tests
difficult.

1.2. Prognostic tests


Prognostic tests are not related to a particular course of instruction. Their objective is to predict and make
decisions about future success and actions of examinee based on present capabilities.

1.2.1. Placement tests


Placement tests are used to determine the most appropriate channel of education for examinees. The purpose
of placement tests is merely to measure the capabilities of an applicant in perusing a certain path of language
learning and to place them into an appropriate level or section of a language curriculum or school.
 Note: There is no pass or fail in placement tests.
 Note: Teachers benefit from placement decisions because they end up with classes that have students
with relatively homogeneous ability levels.
 Note: If there is a mismatch between the placement test and what is taught in a program, the danger is
that the groupings of similar ability levels will simply not occur.

1.2.2. Selection tests


The purpose of selection tests is to provide information upon which the examinees’ acceptance or non-
acceptance into a particular program can be determined.
 Note: In contrast to placement tests, testees pass or fail in selection tests.
 Note: As many candidates obtain the criterion, due to administrative restrictions, these tests turn into
competition tests.

1.2.3. Aptitude tests


These tests are used to predict applicants’ success in achieving certain objectives in the future. Such tests are
designed to measure a person’s capability or general ability to learn a foreign language and to be successful
in that undertaking.
Language aptitude tests are designed to measure a person’s capability or general ability to learn a foreign
language a priori and to be successful in that undertaking. Language aptitude tests were ostensibly designed
to apply to the classroom learning of any language. These tests do not tell us who will succeed or fail in
learning a foreign language. They attempt to predict the rate at which certain students will be able to acquire
a language. Language aptitude tests usually consist of several different tests which measure the following
cognitive abilities.
 Sound coding ability (or phonetic coding): the ability to identify and remember new auditory
phonetic material in such a way that this material can be recognized, identified and remembered over a
period longer than a few seconds. This is a rather unique auditory component of foreign language
aptitude.
 Grammatical coding ability: the ability to identify the grammatical functions of different parts of
sentences in a variety of contexts.
 Memorization (or rote learning ability): the ability to remember words, rules, etc. in a new language.
Rote learning ability is a kind of general memory, but individuals seem to differ in their ability to
apply their memory to the foreign language situation.
 Inductive learning ability: the ability to work out linguistic forms, rules, patterns, and meanings from
new linguistic content with a minimum of supervision or guidance.
Chapter Three: Forms of Language Test

The form of a test refers to its physical appearance.

1. STRUCTURE OF AN ITEM
An item, the smallest unit of a test, consists of two parts: the stem and the response.

1) How many functions do language tests serve? (stem)


two key
alternatives three
four distractor
five

2. CLASSIFICATION OF ITEM FORMS


2.1. Subjective vs. objective items
Subjective items are those in which the scorer must make an opinionated judgment. Objective items are
those in which the correctness of the test taker’s response is determined by predetermined/objective criteria.
 Note: You should know that objectivity and subjectivity refers to the way a test item is scored.
The most beautiful season is ……
1) spring 2) summer 3) fall 4) winter
There are …… seasons in a year.
1) four 2) three 3) two 4) five

2.2. Essay-Type vs. multiple-choice items


Essay-type items are those in which the examinee is required to produce language elements. Multiple-
choice items are those in which examinee is required to select the correct response from among given
alternatives.

2.3. Suppletion vs. recognition items


Suppletion (or production; supply) items require the examinee to supply the missing part(s) of the sentence
or complete an incomplete sentence. Recognition items require examinee to select an answer from a list of
possibilities.

3. TYPES OF ITEMS
3.1. Receptive response items
Multiple-choice (MC) items are undoubtedly one of the most widely used types of items in objective tests.
MC items have the following advantages.
 Because of the highly structured nature of these items, the test writer can get directly at many of the
specific skills and learning he wishes to measure. This in turn leads to their diagnostic function.
 The test writer can include a large number of different tasks in the testing session. Thus they have
practicality.
 Scoring can be done quickly and involves no judgments as to degrees of correctness. Thus they have
reliability.
However, these items are disadvantageous on the grounds they:
 are passive, i.e. such items test only recognition knowledge but not language communication,
 may have harmful washback,
 expose students to errors,
 are de-contextualized,
 are one of the most difficult and time-consuming types of items to construct,
 are simpler to answer than subjective tests,
 encourage guessing.
There is a way to compensate for students’ guessing on tests. That is, there is a mathematical way to adjust
or correct for guessing. This statistical procedure properly named guessing correction formula is:
Wrong
Guessing Correction Formula  Right 
n-1

where n refers to the number of options.

 Example: In a test which consisted of 80 items with four options, a student answered 50 items
correctly and gave 30 wrong answers. After applying guessing correction formula his score would
be
1) 45 2) 35 3) 40 4) 30
Wrong 30
Score  Right   50 
 50  10  40
n1 41

3.2. Personal response items (or alternative assessment options)


In recent years, language teachers have stepped up efforts to develop non-test assessment options. Such
innovations are referred to as personal response items that encourage the students to produce responses that
hold personal meaning.

3.2.1. Self-assessment
Self-assessment is defined as any items wherein students are asked to rate their own knowledge, skills, or
performances. Thus, self-assessments provide the teacher with some idea of how the students view their own
language abilities and development.
 speed,
 direct involvement of students → increased motivation
 the encouragement of autonomy,
 subjectivity
There are at least two categories of self-assessment:
 Direct assessment of a specific performance: a student typically monitors himself in either oral or
written production and renders some kind of evaluation of performance.
 Indirect assessment of general competence: this type of assessment targets large slices of time with
a view to rendering an evaluation of general ability, as opposed to one specific, relatively time-
constrained performance.
3.2.2. Journal
Journals can range from language learning logs, to grammar discussions, to responses to readings, to
attitudes and feelings about oneself. One of the principal objectives in a student’s dialogue journal is to carry
on a conversation with the teacher. Through dialogue journals, teachers can become better acquainted with
their students, in terms of both their learning progress and their affective states, and thus become better
equipped to meet students’ individual needs.
 Because journal writing is a dialogue between students and teacher, journals afford a unique
opportunity for a teacher to offer various kinds of feedback to learners.
 Journals are too free to form to be assessed accurately.
 Certain critics have expressed ethical concerns.
CHAPTER FOUR: BASIC STATISTICS IN
LANGUAGE TESTING

1. STATISTICS
Statistics involves collecting numerical information called data, analyzing them, and making
meaningful decisions on the basis of the outcome of the analyses. Statistics is of two types:
descriptive and inferential.

2. TYPES OF DATA
2.1. Nominal Data
As the name implies names an attribute or category and classifies the data according to presence
or absence of the attribute, e.g. ‘gender,’ ‘nationality,’ ‘native language,’ etc.

2.2. Ordinal Data


Like the nominal scale, an ordinal scale names a group of observations, but, as its label implies,
an ordinal scale also orders, or ranks, the data. For example, the degree of happiness is shown by
very unhappy – unhappy – happy – very happy.

2.3. Interval Data


Interval data represent the ordering of a named group of data, but they provide additional
information. Interval data also show the (more) precise distances between the points in the
rankings, e.g. test scores.

2.4. Ratio Data


Ratio data are similar to interval data except that they have absolute zero. As a result of this new
characteristic, in ratio data we can say ‘this point is two time as high as that point,’

Shows Gives Equal Absolute


categories ranking distances zero
Nominal
Ordinal
Interval
Ratio

3. TABULATION OF DATA
Suppose that the following table shows the reading scores of students in an achievement test.

Student a b c d f g h i j k l
Score 93 95 92 95 100 96 92 96 92 95 92
3.1. Rank Order
The first step is to arrange the scores in the order of size, usually from highest to lowest. If two
testees received the same score, we should divide the sum of their rank.

Score Rank order


100 1
96 2.5
96 2.5
95 5
95 5
95 5
93 7
92 9.5
92 9.5
92 9.5
92 9.5

The next table shows the same scores. The remaining terms used in tabulation of data will be
presented according to this tale.

Frequency Relative Cumulative


Score Percentage Percentile
(f) Frequency Frequency (F)
100 1 0.09 0.09 × 100= 9 11 100
96 2 0.18 0.18 × 100= 18 10 90
95 3 0.27 0.27 × 100= 27 8 72
93 1 0.09 0.09 × 100= 9 5 45
92 4 0.36 0.36 × 100= 36 4 36
Total = 11

3.2. The Frequency Distribution


(Simple/Absolute) frequency (f), also called simple or absolute frequency, is the number of times
a score occurs.

3.3. Relative Frequency


Relative frequency refers to the simple frequency of each score divided by the total number of
scores.

3.4. Percentage
When relative frequency index is multiplied by 100, the result is called percentage.

3.5. Cumulative Frequency


Cumulative frequency (F) indicates the standing of any particular score in a group of scores. This
index shows how many students received a particular score and less than that.
3.6. Percentile
When cumulative frequency index is divided by the total number of learners multiplied by 100,
the result is percentile. Percentile rank shows what percentage of students received a particular
score or below that.

4. DESCRIPTIVE STATISTICS
4.1. Measures of Central Tendency
4.1.1. Mode
The most easily obtained measure of central tendency is the mode. The mode is the score that
occurs most frequently in a set of scores, e.g. 88 is mode in
80, 81, 81, 85, 88, 88, 88, 93, 94, 94

Note: When all of the scores in a group occur with the same frequency, it is customary to
say that the group of sores has ‘no mode’ as in
83, 83, 83, 88, 88, 88, 90, 90, 90, 95, 95, 95

Note: When two adjacent scores have the same frequency, the mode is the average of the
two adjacent scores, so 86.5 is mode in the following set
80, 82, 84, 85, 85, 88, 88, 90, 94

Note: When two non-adjacent scores have the same frequency, the distribution is bi-
modal.
82, 82, 85, 85, 85, 87, 88, 88, 88, 90, 94

4.1.2. Median
The median (Md) is the score at the 50th percentile in a group of scores, e.g. 85 is median in
81, 81, 82, 84, 85, 86, 86, 88, 89

Note: If the data are an even number of scores, the median is the point halfway between
the central values when the scores are ranked, e.g. 85 in the following set
81, 81, 82, 84, 86, 86, 88, 90

4.1.3. Mean
Mean is probably the single most often reported indicator of central tendency. It is the same as

∑𝑋
arithmetic average:
𝑋 =
𝑁
Note: If we were to find the deviation of scores from the mean of the set, the sum would
be exactly zero.
Note: The limitation of means is that it is seriously sensitive to extreme scores.

4.1.4. Mid Point


The midpoint in a set of scores is that point halfway between the highest score and the lowest
score on the test. The formula for calculating the midpoint is:
𝐻𝑖𝑔ℎ +
𝑀𝑖𝑑 𝑝𝑜𝑖𝑛𝑡 =
𝐿𝑜𝑤
2

4.2. Measures of Variability


4.2.1. Range
Range is the simplest measure of dispersion and is defined as the difference between the number
of points between the highest score on a measure and the lowest score, e.g. range is 8 in the set
92, 95, 95, 97, 98, 98, 100

Note: Range changes drastically with the magnitude of extreme scores (or outliers).

4.2.2. Standard Deviation (SD)


The most frequently used measure of variability is the standard deviation. SD is the average
difference of all scores from the mean.

3, 5, 5, 8, 9
1 2 3 4 5 6 7 8 9 10 11
Mean: 6

Mean of the arrorws length in the first ∑X 1+1+3+3+2


= =2
figure = N 5

1, 4, 5, 10, 10
1 2 3 4 5 6 7 8 9 10 11
Mean: 6

Mean of the arrorws length in the second


figure = ∑X 5+2+1+4+4
= = 3.2
N 5

Therefore, we can say the scores in the second figure in average deviate more from their mean

∑(𝑋 − 𝑋̅ ) 2
than do the scores in the first set, i.e. they

𝑆𝐷 = √
𝑁
4.2.3. Variance
To find variance, you simply stop short of the last step in calculating the standard deviation. You
do not need to bother with finding the square root.

𝑆𝐷 = ̅ )2 ∑(𝑋−𝑋̅ ) 2

∑(𝑋−𝑋
Variance =
𝑁



You will frequently find variance showed as 𝑆2.

Variance = S2 (standard deviation squared)  Standard deviation = √variance


5. NORMAL DISTRIBUTION
A normal distribution means that most of the scores cluster around the mean of the distribution,
and the number of scores gradually decreases on either side of the mean. The resulting figure is a
symmetrical bell-shaped curve.

 Example: In vocabulary test, mean and standard deviation are calculated to be 82


and 4, respectively. In this test, 68% of students fall between ---------

 Example: The mean and SD of a set of scores are 45 and 5. A student who
obtained 55 has a percentile rank of---------.

30 35 4045 50 55 60
 Example: In a test mean and standard deviation are 32 and 3. A student is ---------
probable to obtain a score higher than 29.

2326 29 32338 41

6. DERIVED SCORES
Raw scores are obtained simply by counting the number of right answers. Raw scores from two
different tests are not comparable. To solve this problem, we could convert the raw scores into
percentile or standard scores.
Percentile scores indicate how a given student’s score relates to the test scores of the
entire group of students.
Standardized scores are obtained by taking into account the mean and SD of any given
set of scores. Standard scores represent a student’s score in relation to how far the score
varies from the test mean in terms of standard deviation units.

6.1. z score
The ‘z score’ just tells you how many standard deviations above or below the mean any score or
observation might be:
𝑋−
𝑧=
𝑋
𝑆𝐷

 Example: In a set of scores where, mean and SD are 41 and 10, what is the z score of

51 −
a student who obtained 51?
𝑋
𝑧= 41 = +1
− =
𝑋 10
𝑆
𝐷

𝑇 𝑠𝑐𝑜𝑟𝑒 = 10𝑧 + 50
6.2. T score
The formula for calculating T score is:
Therefore, the T score of the student in the previous example would be 60.

8. CORRELATION
Correlation analysis refers to a family of statistical analyses that determines the degree of
relationship between two sets of numbers. The numerical value representing the degree to which
two variables are related (co-vary, or vary together) is called correlation coefficient. Correlation
is the go-togetherness of two sets of scores. Let’s take the following hypothetical set of scores
and then represent them on a scatter plot.
Positive correlation:
Students Test A Test B
Dean 2 3
Randy 3 5
Joey 4 7
Jeanne 5 9
Kimi 6 11
Shenan 7 13

This is a linear perfect positive correlation.


The two sets of scores may not necessarily be ordered in exactly the same way. Here is another
set of hypothetical scores with the representing scatergram.

Students Test A Test B


Dean 2 2
Randy 4 3
Joey 6 7
Jeanne 8 7
Kimi 9 10
Shenan 12 11
Negative correlation:
Students Days of absence English scores
Dean 8 20
Randy 7 30
Joey 6 40
Jeanne 5 50
Kimi 4 60
Shenan 3 70

This is a linear (not perfect) positive correlation.


The two sets of scores may not necessarily be ordered in exactly the same way. Here is another
set of hypothetical scores with the representing scatergram.

Students Days of absence English test


Dean 2 90
Randy 3 60
Joey 5 40
Jeanne 8 40
Kimi 8 30
Shenan 9 10
Note: If high scores in one set are associated with low scores on the other set, there is a
negative relationship between the two sets of scores.
Note: If high scores in one set are associated with high scores on the other set, there is a
positive relationship between the two sets of scores.

Zero Correlation:

Curvi-linear:

9. CORRELATIONAL FORMULAS
Correlational values are named after their strength:
 Both ± 1 are considered perfect correlations.
 – 0.4 ≤ r ≤ + 0.4 are considered weak correlations.
 –0.8 ≥ r > –1 and 0.8 ≤ r < 1 are considered strong correlations.

Note: The sign (– or +) of the correlation coefficient doesn’t have any effect on the degree
of association, only on the direction of the association.

9.1. Pearson Product Moment Correlation


Karl Pearson developed a correlation which demonstrates the strength of a relationship between
two sets of continuous scale data:

𝑁(∑ 𝑋𝑌) − (∑ 𝑋)(∑ 𝑌)


𝑟=
√[𝑁(∑ 𝑋2) − (∑ 𝑋)2][𝑁(∑ 𝑌2) − (∑ 𝑌)2]

9.2. Rank Order Correlation


The Spearman rho (𝜌) correlation coefficient is used only when the data exist in ranked (ordinal)

6 ∑ 𝐷2
form:
𝜌 =1−
𝑁 (𝑁2 − 1)

9.3. Point Biserial Correlation


It is used when one set of data is continuous and the other set is nominal. The nominal variable is
dichotomous which can take only the values of 1 or 0. The correlation between each single test
item (nominal scale) and the total test (continuous scale) can be computed by this formula:
𝑋𝑝 − 𝑋𝑞
𝑟𝑝𝑏 = √𝑝 ∙ 𝑞
𝑆
𝑥

Note: Correlation doesn’t show causality between two variables. It shows relative
positions in one variable are associated with relative positions in the other variable.
CHAPTER FIVE: TEST CONSTRUCTION

1. DETERMINING FUNCTION AND FORM OF THE TEST


In order to determine the function and form of a test, three factors should be taken into account:
(a) characteristics of the examinees
(b) specific purpose of the test
(c) scope of the test

2. PLANNING (Specifying the test content)


It is important for the tester to decide on the area of knowledge to be measured. In order to determine
the content of a test table of specification should be prepared.
 The main purpose of table of specification is to assure the test developer that the test includes a
representative sample of the materials covered in a particular course.

Instructional objectives/
Content Number of items

Reported speech 3
Subjunctive 2
Dangling structure 5

3. PREPARING ITEMS
Last year, incoming students ……… on the first day of school.
1) enrolled 2) will enroll 3) will enrolled 4) should enroll
Have you heard the planning committee’s ……… for solving the city’s traffic problems?
1) purpose 2) propose 3) design 4) theory

4. REVIEWING
It is highly recommended that the produced test be reviewed by an outsider to know his subjective
ideas and evaluation of the test.

5. PRETESTING
Pretesting is defined as administering the newly developed test to a group of examinees with
characteristics similar to those of the target group. The purpose of pre-testing is to determine,
objectively, the characteristics of the individual items and the characteristics of the items altogether.

Item Facility (IF)


Item facility refers to the easiness of an item:
∑𝐶
𝐼𝐹
= 𝑁
𝐼𝐹 = item facility
∑ 𝐶 = sum of the correct responses
𝑁 = total number of responses

Example: In a test 20 testees answered an item correctly. If 50 students took the exam, what would be
item facility?

∑ 20
𝐼𝐹 = 𝐶 = = 0.4
50
𝑁

Example: A test was given to 75 examinees: 50 answered correctly, 10 answered wrongly, and 15 left
the item blank. What is FV?

∑ 50
𝐼𝐹 = 𝐶 = = 0.83
60
𝑁

Note: The range of IF index is 0 ≤ 𝐼𝐹 ≤ 1


Note: The acceptable range of IF index is 0.37 ≤ 𝐼𝐹 ≤ 0.63

Note: The ideal index for IF is 𝐼𝐹𝑖𝑑𝑒𝑎𝑙 = 0.5




 Note: By determining item facility, the test constructor can easily find out item difficulty. Item

Item Difficulty (ID) = 1 − 𝐼𝐹


difficulty can be calculated by using the following formula:

Lower
Items
Item Discrimination (ID)
Subjects 1 2 3 4 5 6 7 8 9 10 Total
Item discrimination refers to the extent to which a particular item discriminates more knowledgeable
Shenan 1 0 1 1 1 1 1 1 1 1 9
Robert 1 0 1 1 1 1 1 0 1 1 8
Millie 1 0 1 1 1 1 1 0 1 0 7
Higher
Kimi 1 0 0 1 0 1 0 1 1 1 7
Jeanne 1 0 1 1 0 1 0 0 0 1 5
examinees from less knowledgeable ones. To compute the item discrimination, the following formula
Corky 0 1 0 0 1 0 0 1 0 1 4
should be used:
Dean 0 1 0 0 0 0 1 1 1 0 4
Bill 0 1 1 0 1 1 0 0 0 0 4
Randy 0 1 0 0 0 0 0 1 0 0 2
Mitsuko 0 1 0 0 0 0 0 0 0 0 1
∑ 𝐶ℎ𝑖𝑔ℎ − ∑
𝐼𝐷 =
𝐶𝑙𝑜𝑤 1⁄
2
𝑁
∑ 𝐶ℎ𝑖𝑔ℎ= the number of correct responses to a particular item by the examinees in the high group
∑ 𝐶𝑙𝑜𝑤 = the number of correct responses to a particular item by the examinees in the low group
1⁄ 𝑁 = the total number of responses divided by 2
2

Example: If in a class with 50 students, 20 students in high group and 10 students in the low group
answered an item correctly, then ID equals ---------

∑ 𝐶ℎ𝑖𝑔ℎ − ∑ 𝐶𝑙𝑜𝑤
20 − 10 10
𝐼𝐷 = 1⁄ 𝑁 = = +0.4
1⁄ 2 (50) 25
2
=

Example: All the 30 testees in the high group and one-third of the students in the low group answered
item number one correctly. In case there were 100 items in the test, what are IF and ID?

∑ 𝐶ℎ𝑖𝑔ℎ−∑ 𝐶𝑙𝑜𝑤 20 ∑𝐶 40
𝐼𝐷 =
30−10 = = = +0.66 𝐼𝐹 = = = 0.66
1⁄ 𝑁 1⁄ (60) 30 𝑁 60
2 2

 Note: The range of ID index is −1 ≤ 𝐼𝐷 ≤ +1


 Note: The acceptable range of item discrimination is 𝐼𝐷 ≥ +0.4
 Note: If all students answered a question correctly (IF = 1), it would mean that the item is not
only too easy but also non-discriminating (ID = 0). Similarly, if none of the students answered
an item (IF = 0), it would mean that the item is not only too difficult but also non-
discriminating (ID = 0).

Choice Distribution (CD)


Choice distribution refers to:
(1) The frequency with which alternatives are assigned as the correct answer.
(2) The frequency with which alternatives are selected by the examinees in a multiple-choice item.
Accordingly, there are three types of distractors:
Functioning: a distracter which attracts more low-scoring students, who have not mastered the
subject
Non-functioning: a distracter which attracted no one, not even the poorest examinees
Mal-functioning: a distracter which attracted more high than low scorers, it is a mal-
functioning choice

Example: (Choice C is the answer)

Choice Highs Lows Total


A 3 8 11
B 7 3 10
C 14 5 19
D 0 0 0
(20) (20) (40)
∑C
IF = = 0.47
19
=
N
40
∑ 𝐶ℎ𝑖𝑔ℎ − ∑ 𝐶𝑙𝑜𝑤
= 14 − 5 = 0.45
ID = 1⁄ 𝑁 1⁄ (40)
2 2

6. VALIDATION
Through validation, which is the last step in test construction process, validity as a characteristic of a
test as a total unit is determined.
1) RELIABILITY
On a ‘reliable’ test, one’s score on its various administrations would not differ greatly. That is,
one’s score would be quite consistent. The notion of consistency of one’s score with respect to
one’s average score over repeated administrations is the central concept of reliability.

2) CLASSICAL TRUE SCORE THEORY (CTS)


CTS states that an observed score an examinee obtains on a test comprises two factors or
components: a true score and an error score. If the observed score is represented by X, the true
score by T and the error score by E, the relationship between the observed and true score can be
illustrated as follows:

T=X
T>XT<X
True score Error score Observed score

𝐓 + 𝐄 = 𝐗
It’s important to keep in mind that we observe the X score – we never actually see the true (T) or
error (E) scores. According to CTS, reliability or unreliability is explained as follows. A measure
is considered reliable if it would give us the same result over and over again (assuming that what
we are measuring isn't changing!).
Now, we don’t speak of the reliability of a measure for an individual – reliability is a
characteristic of a measure that’s taken ‘across’ individuals. Therefore, we can say, the
performance of students on any test will tend to vary from each other, but their performances can
vary for a variety of reasons. Generally, the sources of variance in a set of scores fall into two
general sources of variance: (a) meaningful variance: those creating variance related to the
purposes of the test or subject matter area being tested, and (b) error variance: those generating
variance due to other extraneous sources. Meaningful variation, which would be predictable, is
called systematic variation and contributes to reliability. Error variation which may not be
predictable is called unsystematic variation. A list of issues which are the potential sources of
error variance is provided in the following table.

The relationship between true, error and observed scores which was stated by a simple equation
has a parallel equation at the level of the variance of a measure. That is, across a set of scores, we
assume that:

Vt Ve Vx

1
Reliability is expressed as the ratio of the variance of true scores to the variance of observed
scores. Notationally, this relationship is presented as:

Vt
r=
Vx

3) STANDARD ERROR OF MEASUREMENT


The formula for calculating SEM is relatively simple:
𝑆𝐸𝑀 = 𝑆𝑥√1 − 𝑟

𝑆𝑥 is the standard deviation of the test


𝑟 is reliability of the test
where

Example
If the standard deviation of a test were 15 and its reliability were estimated as 0.84,
then what would be standard error of measurement?
1) 1.5 2) 4 3) 3.5 4) 6

𝑆𝐸𝑀 = 𝑆𝑥√1 − 𝑟 = 15 × √1 − 0.84 = 15 × √0.16 = 15 × 0.4 = 6

 It can be inferred from the last example that there is a negative relationship between
standard error of measurement and reliability. When there is no measurement error
reliability equals +1.

Conceptually this statistic is used to determine a band around a student’s score within which that
student’s score would probably fall. Using a normal distribution, we have:

2
Example
If the SEM of a set of scores is 2.5 we can be sure that a student’s true score who
obtained 15 would fluctuate 68% of times between ---------
1) 12.5 and 17.5 2) 15 and 17.5 3) 10 and 20 4) 10.5 and 15

Since the item asks for 68% of the times, the shaded area covering between 12.5 and 17.5
is the answer.

4) APPROACHES TO ESTIMATING RELIABILITY


4.1) Internal Consistency
Internal consistency reliability uses information internal available in one administration of a
single test. The main idea behind the internal consistency method is that all the items in a test
attempt to measure elements of a single trait, i.e. there is an internal homogeneity among the
items.
a) test scores are unidimensional, which means that the parts or items of a given test all
measure the same, single ability, i.e. items comprising a test are homogeneous. For
example, grammatical points, vocabulary, reading and listening comprehension, are all
subparts of the trait called language ability;
b) the items or parts of a test are locally independent. That is, we assume that an

3
individual’s response to a given test item does not depend upon how he responds to other
items that are of equal difficulty, i.e. items comprising a test are independent. This is a
praised characteristic of multiple-choice items.

4.3.1) Split-half Methods


In this method, when a single test is administered to a group of examinees, the test is split, or
divided, into two equal halves.

Spearman Brown estimate


The correlation between the two halves is an estimate of the test score reliability. Since the
length of the test is divided the test into two halves, to estimate the reliability of the full test, the
resulting correlation between the two halves is plugged into the Spearman-Brown prophecy

2(𝑟ℎ𝑎𝑙𝑓)
formula:
𝑟𝑡𝑜𝑡𝑎𝑙 =
1 + ℎ𝑎𝑙𝑓)

𝑟𝑡𝑜𝑡𝑎𝑙 is the reliability of the full-length test


𝑟ℎ𝑎𝑙𝑓 is the reliability of half of the test
where:

Example
The reliability of half of a grammar test is calculated to be 0.35. By applying the
Spearman Brown’s prophecy formula, the total reliability would be ---------
1) 0.51 2) 0.63 3) 0.45 4) 0.38

2𝑟ℎ𝑎𝑙𝑓 2 × .35 0.7


𝑟𝑡𝑜𝑡𝑎𝑙 = = = =
1 0.51 1 + .35
ℎ𝑎𝑙𝑓
1.35

4.3.2) Item variance methods


KR-21 Method
This formula which is sometimes called rational equivalence is the easiest and most frequently

𝑋̅ (𝐾 −
used method of internal consistency estimates.
] [1 𝑋̅ )

(𝐾𝑅 − 21)𝑟 = [ − ]

𝐾−1 𝐾𝑉
where: 𝐾 is the number of the items in a

𝑋̅ is the mean score


test

𝑉 is the variance

 KR-21 always provides an underestimate index of reliability.


4
Example
If a test with 30 items has a variance and mean of 10 and 20, then reliability of the
test would be---------.
1) 0.44 2) 0.34 3) 0.29 4) 0.52
𝐾 𝑋̅ (𝐾 − 𝑋̅ )
30 20 × (30 − 20) 2 30 1
30
𝑟=( ) (1 )= ) × (1 )=( ) × (1 − ) = ×
− ( −
𝐾−1 𝐾𝑉 30 × 10 3 29 3

29 29
= 0.34
5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy