Language Testing Summary of Chapters 1 6
Language Testing Summary of Chapters 1 6
1. WHY TESTING
In general, testing has the following purposes:
Students, teachers, administrators and parents want to ascertain the degree to which those goals have
been realized
Government and private sectors who employ the students are interested in having precise information
about students’ abilities.
Most importantly, through testing, accurate information is obtained based on which educational
decisions are made.
Tests can benefit students in the following ways:
Testing can create a positive attitude toward class and will motivate them in learning the subject
matter.
Testing can help students prepare themselves and thus learn the materials.
Testing can also benefit teachers:
Testing helps teachers to diagnose their efforts in teaching.
Testing can also help teachers gain insight into ways to improve evaluation process.
Test is an instrument, often connoting the presentation of a set of questions to be answered, to obtain a
measure of a characteristic of a person.
Note: What distinguishes a test from other types of measurement is that it is designed to obtain a
specific sample of behavior.
Note: It is important to point out that we never measure or evaluate people. We measure or evaluate
characteristics or properties of people.
5. WASHBACK
A facet of consequential validity is washback. Washback generally refers to the effects the tests have on
instruction in terms of how students prepare for the test.
Note: ‘Cram’ courses and ‘teaching to the test’ are examples of such washback.
In classroom assessment it refers to the information that washes back to students in the form of useful
diagnoses of strengths and weaknesses.
Harmful washback is said to occur the test content and testing techniques are at variance with the
objectives of the course.
Beneficial washback is said to result when a testing procedure encourages good teaching practice is
introduced.
6. TEST BIAS
A test or item can be considered to be biased if one particular section of the candidate population is
advantaged or disadvantaged by some feature of the test or item which is not relevant to what is being
measured.
Fairness can be defined as the degree to which a test treats every student the same, or the degree to
which it is impartial. Equitable treatment in terms of testing conditions, access to practice materials,
performance feedback, retest opportunities, and other features of test administration, including
providing reasonable accommodation for test takers with disabilities when appropriate, are important
aspects of fairness under this perspective.
7. AUTHENTICITY
It is the degree of correspondence of the characteristics of a given language test task to the features of target
language task.
The language in the test is as natural as possible.
Items are contextualized rather than isolated.
Topics are meaningful for the learner.
Some thematic organization to items is provided, such as through a story line or episode.
Tasks represent, or closely approximate, real-world tasks.
Chapter Two: Language Test Functions
Criterion-Referenced
Test Qualities
Achievement Diagnostic
Details of Information Specific Very specific
Terminal objectives of course or
Focus Enabling objectives of courses
program
To determine the degree of
To inform students and teachers of
Purpose of Decision learning for advancement or
objectives needing more work
graduation
When Administered End of courses Middle of courses
1.1.2. Knowledge tests
These tests are used when the medium of instruction is a language other than examinees’ mother tongue.
Note: A key issue in testing proficiency is the difficulty that centers on the complexity of defining the
term ‘proficiency’ (construct of language). This difficulty renders the construct of proficiency tests
difficult.
1. STRUCTURE OF AN ITEM
An item, the smallest unit of a test, consists of two parts: the stem and the response.
3. TYPES OF ITEMS
3.1. Receptive response items
Multiple-choice (MC) items are undoubtedly one of the most widely used types of items in objective tests.
MC items have the following advantages.
Because of the highly structured nature of these items, the test writer can get directly at many of the
specific skills and learning he wishes to measure. This in turn leads to their diagnostic function.
The test writer can include a large number of different tasks in the testing session. Thus they have
practicality.
Scoring can be done quickly and involves no judgments as to degrees of correctness. Thus they have
reliability.
However, these items are disadvantageous on the grounds they:
are passive, i.e. such items test only recognition knowledge but not language communication,
may have harmful washback,
expose students to errors,
are de-contextualized,
are one of the most difficult and time-consuming types of items to construct,
are simpler to answer than subjective tests,
encourage guessing.
There is a way to compensate for students’ guessing on tests. That is, there is a mathematical way to adjust
or correct for guessing. This statistical procedure properly named guessing correction formula is:
Wrong
Guessing Correction Formula Right
n-1
Example: In a test which consisted of 80 items with four options, a student answered 50 items
correctly and gave 30 wrong answers. After applying guessing correction formula his score would
be
1) 45 2) 35 3) 40 4) 30
Wrong 30
Score Right 50
50 10 40
n1 41
3.2.1. Self-assessment
Self-assessment is defined as any items wherein students are asked to rate their own knowledge, skills, or
performances. Thus, self-assessments provide the teacher with some idea of how the students view their own
language abilities and development.
speed,
direct involvement of students → increased motivation
the encouragement of autonomy,
subjectivity
There are at least two categories of self-assessment:
Direct assessment of a specific performance: a student typically monitors himself in either oral or
written production and renders some kind of evaluation of performance.
Indirect assessment of general competence: this type of assessment targets large slices of time with
a view to rendering an evaluation of general ability, as opposed to one specific, relatively time-
constrained performance.
3.2.2. Journal
Journals can range from language learning logs, to grammar discussions, to responses to readings, to
attitudes and feelings about oneself. One of the principal objectives in a student’s dialogue journal is to carry
on a conversation with the teacher. Through dialogue journals, teachers can become better acquainted with
their students, in terms of both their learning progress and their affective states, and thus become better
equipped to meet students’ individual needs.
Because journal writing is a dialogue between students and teacher, journals afford a unique
opportunity for a teacher to offer various kinds of feedback to learners.
Journals are too free to form to be assessed accurately.
Certain critics have expressed ethical concerns.
CHAPTER FOUR: BASIC STATISTICS IN
LANGUAGE TESTING
1. STATISTICS
Statistics involves collecting numerical information called data, analyzing them, and making
meaningful decisions on the basis of the outcome of the analyses. Statistics is of two types:
descriptive and inferential.
2. TYPES OF DATA
2.1. Nominal Data
As the name implies names an attribute or category and classifies the data according to presence
or absence of the attribute, e.g. ‘gender,’ ‘nationality,’ ‘native language,’ etc.
3. TABULATION OF DATA
Suppose that the following table shows the reading scores of students in an achievement test.
Student a b c d f g h i j k l
Score 93 95 92 95 100 96 92 96 92 95 92
3.1. Rank Order
The first step is to arrange the scores in the order of size, usually from highest to lowest. If two
testees received the same score, we should divide the sum of their rank.
The next table shows the same scores. The remaining terms used in tabulation of data will be
presented according to this tale.
3.4. Percentage
When relative frequency index is multiplied by 100, the result is called percentage.
4. DESCRIPTIVE STATISTICS
4.1. Measures of Central Tendency
4.1.1. Mode
The most easily obtained measure of central tendency is the mode. The mode is the score that
occurs most frequently in a set of scores, e.g. 88 is mode in
80, 81, 81, 85, 88, 88, 88, 93, 94, 94
Note: When all of the scores in a group occur with the same frequency, it is customary to
say that the group of sores has ‘no mode’ as in
83, 83, 83, 88, 88, 88, 90, 90, 90, 95, 95, 95
Note: When two adjacent scores have the same frequency, the mode is the average of the
two adjacent scores, so 86.5 is mode in the following set
80, 82, 84, 85, 85, 88, 88, 90, 94
Note: When two non-adjacent scores have the same frequency, the distribution is bi-
modal.
82, 82, 85, 85, 85, 87, 88, 88, 88, 90, 94
4.1.2. Median
The median (Md) is the score at the 50th percentile in a group of scores, e.g. 85 is median in
81, 81, 82, 84, 85, 86, 86, 88, 89
Note: If the data are an even number of scores, the median is the point halfway between
the central values when the scores are ranked, e.g. 85 in the following set
81, 81, 82, 84, 86, 86, 88, 90
4.1.3. Mean
Mean is probably the single most often reported indicator of central tendency. It is the same as
∑𝑋
arithmetic average:
𝑋 =
𝑁
Note: If we were to find the deviation of scores from the mean of the set, the sum would
be exactly zero.
Note: The limitation of means is that it is seriously sensitive to extreme scores.
Note: Range changes drastically with the magnitude of extreme scores (or outliers).
3, 5, 5, 8, 9
1 2 3 4 5 6 7 8 9 10 11
Mean: 6
1, 4, 5, 10, 10
1 2 3 4 5 6 7 8 9 10 11
Mean: 6
Therefore, we can say the scores in the second figure in average deviate more from their mean
∑(𝑋 − 𝑋̅ ) 2
than do the scores in the first set, i.e. they
𝑆𝐷 = √
𝑁
4.2.3. Variance
To find variance, you simply stop short of the last step in calculating the standard deviation. You
do not need to bother with finding the square root.
𝑆𝐷 = ̅ )2 ∑(𝑋−𝑋̅ ) 2
√
∑(𝑋−𝑋
Variance =
𝑁
�
�
You will frequently find variance showed as 𝑆2.
Example: The mean and SD of a set of scores are 45 and 5. A student who
obtained 55 has a percentile rank of---------.
30 35 4045 50 55 60
Example: In a test mean and standard deviation are 32 and 3. A student is ---------
probable to obtain a score higher than 29.
2326 29 32338 41
6. DERIVED SCORES
Raw scores are obtained simply by counting the number of right answers. Raw scores from two
different tests are not comparable. To solve this problem, we could convert the raw scores into
percentile or standard scores.
Percentile scores indicate how a given student’s score relates to the test scores of the
entire group of students.
Standardized scores are obtained by taking into account the mean and SD of any given
set of scores. Standard scores represent a student’s score in relation to how far the score
varies from the test mean in terms of standard deviation units.
6.1. z score
The ‘z score’ just tells you how many standard deviations above or below the mean any score or
observation might be:
𝑋−
𝑧=
𝑋
𝑆𝐷
Example: In a set of scores where, mean and SD are 41 and 10, what is the z score of
51 −
a student who obtained 51?
𝑋
𝑧= 41 = +1
− =
𝑋 10
𝑆
𝐷
𝑇 𝑠𝑐𝑜𝑟𝑒 = 10𝑧 + 50
6.2. T score
The formula for calculating T score is:
Therefore, the T score of the student in the previous example would be 60.
8. CORRELATION
Correlation analysis refers to a family of statistical analyses that determines the degree of
relationship between two sets of numbers. The numerical value representing the degree to which
two variables are related (co-vary, or vary together) is called correlation coefficient. Correlation
is the go-togetherness of two sets of scores. Let’s take the following hypothetical set of scores
and then represent them on a scatter plot.
Positive correlation:
Students Test A Test B
Dean 2 3
Randy 3 5
Joey 4 7
Jeanne 5 9
Kimi 6 11
Shenan 7 13
Zero Correlation:
Curvi-linear:
9. CORRELATIONAL FORMULAS
Correlational values are named after their strength:
Both ± 1 are considered perfect correlations.
– 0.4 ≤ r ≤ + 0.4 are considered weak correlations.
–0.8 ≥ r > –1 and 0.8 ≤ r < 1 are considered strong correlations.
Note: The sign (– or +) of the correlation coefficient doesn’t have any effect on the degree
of association, only on the direction of the association.
6 ∑ 𝐷2
form:
𝜌 =1−
𝑁 (𝑁2 − 1)
Note: Correlation doesn’t show causality between two variables. It shows relative
positions in one variable are associated with relative positions in the other variable.
CHAPTER FIVE: TEST CONSTRUCTION
Instructional objectives/
Content Number of items
Reported speech 3
Subjunctive 2
Dangling structure 5
3. PREPARING ITEMS
Last year, incoming students ……… on the first day of school.
1) enrolled 2) will enroll 3) will enrolled 4) should enroll
Have you heard the planning committee’s ……… for solving the city’s traffic problems?
1) purpose 2) propose 3) design 4) theory
4. REVIEWING
It is highly recommended that the produced test be reviewed by an outsider to know his subjective
ideas and evaluation of the test.
5. PRETESTING
Pretesting is defined as administering the newly developed test to a group of examinees with
characteristics similar to those of the target group. The purpose of pre-testing is to determine,
objectively, the characteristics of the individual items and the characteristics of the items altogether.
Example: In a test 20 testees answered an item correctly. If 50 students took the exam, what would be
item facility?
∑ 20
𝐼𝐹 = 𝐶 = = 0.4
50
𝑁
Example: A test was given to 75 examinees: 50 answered correctly, 10 answered wrongly, and 15 left
the item blank. What is FV?
∑ 50
𝐼𝐹 = 𝐶 = = 0.83
60
𝑁
Lower
Items
Item Discrimination (ID)
Subjects 1 2 3 4 5 6 7 8 9 10 Total
Item discrimination refers to the extent to which a particular item discriminates more knowledgeable
Shenan 1 0 1 1 1 1 1 1 1 1 9
Robert 1 0 1 1 1 1 1 0 1 1 8
Millie 1 0 1 1 1 1 1 0 1 0 7
Higher
Kimi 1 0 0 1 0 1 0 1 1 1 7
Jeanne 1 0 1 1 0 1 0 0 0 1 5
examinees from less knowledgeable ones. To compute the item discrimination, the following formula
Corky 0 1 0 0 1 0 0 1 0 1 4
should be used:
Dean 0 1 0 0 0 0 1 1 1 0 4
Bill 0 1 1 0 1 1 0 0 0 0 4
Randy 0 1 0 0 0 0 0 1 0 0 2
Mitsuko 0 1 0 0 0 0 0 0 0 0 1
∑ 𝐶ℎ𝑖𝑔ℎ − ∑
𝐼𝐷 =
𝐶𝑙𝑜𝑤 1⁄
2
𝑁
∑ 𝐶ℎ𝑖𝑔ℎ= the number of correct responses to a particular item by the examinees in the high group
∑ 𝐶𝑙𝑜𝑤 = the number of correct responses to a particular item by the examinees in the low group
1⁄ 𝑁 = the total number of responses divided by 2
2
Example: If in a class with 50 students, 20 students in high group and 10 students in the low group
answered an item correctly, then ID equals ---------
∑ 𝐶ℎ𝑖𝑔ℎ − ∑ 𝐶𝑙𝑜𝑤
20 − 10 10
𝐼𝐷 = 1⁄ 𝑁 = = +0.4
1⁄ 2 (50) 25
2
=
Example: All the 30 testees in the high group and one-third of the students in the low group answered
item number one correctly. In case there were 100 items in the test, what are IF and ID?
∑ 𝐶ℎ𝑖𝑔ℎ−∑ 𝐶𝑙𝑜𝑤 20 ∑𝐶 40
𝐼𝐷 =
30−10 = = = +0.66 𝐼𝐹 = = = 0.66
1⁄ 𝑁 1⁄ (60) 30 𝑁 60
2 2
6. VALIDATION
Through validation, which is the last step in test construction process, validity as a characteristic of a
test as a total unit is determined.
1) RELIABILITY
On a ‘reliable’ test, one’s score on its various administrations would not differ greatly. That is,
one’s score would be quite consistent. The notion of consistency of one’s score with respect to
one’s average score over repeated administrations is the central concept of reliability.
T=X
T>XT<X
True score Error score Observed score
𝐓 + 𝐄 = 𝐗
It’s important to keep in mind that we observe the X score – we never actually see the true (T) or
error (E) scores. According to CTS, reliability or unreliability is explained as follows. A measure
is considered reliable if it would give us the same result over and over again (assuming that what
we are measuring isn't changing!).
Now, we don’t speak of the reliability of a measure for an individual – reliability is a
characteristic of a measure that’s taken ‘across’ individuals. Therefore, we can say, the
performance of students on any test will tend to vary from each other, but their performances can
vary for a variety of reasons. Generally, the sources of variance in a set of scores fall into two
general sources of variance: (a) meaningful variance: those creating variance related to the
purposes of the test or subject matter area being tested, and (b) error variance: those generating
variance due to other extraneous sources. Meaningful variation, which would be predictable, is
called systematic variation and contributes to reliability. Error variation which may not be
predictable is called unsystematic variation. A list of issues which are the potential sources of
error variance is provided in the following table.
The relationship between true, error and observed scores which was stated by a simple equation
has a parallel equation at the level of the variance of a measure. That is, across a set of scores, we
assume that:
Vt Ve Vx
1
Reliability is expressed as the ratio of the variance of true scores to the variance of observed
scores. Notationally, this relationship is presented as:
Vt
r=
Vx
Example
If the standard deviation of a test were 15 and its reliability were estimated as 0.84,
then what would be standard error of measurement?
1) 1.5 2) 4 3) 3.5 4) 6
It can be inferred from the last example that there is a negative relationship between
standard error of measurement and reliability. When there is no measurement error
reliability equals +1.
Conceptually this statistic is used to determine a band around a student’s score within which that
student’s score would probably fall. Using a normal distribution, we have:
2
Example
If the SEM of a set of scores is 2.5 we can be sure that a student’s true score who
obtained 15 would fluctuate 68% of times between ---------
1) 12.5 and 17.5 2) 15 and 17.5 3) 10 and 20 4) 10.5 and 15
Since the item asks for 68% of the times, the shaded area covering between 12.5 and 17.5
is the answer.
3
individual’s response to a given test item does not depend upon how he responds to other
items that are of equal difficulty, i.e. items comprising a test are independent. This is a
praised characteristic of multiple-choice items.
2(𝑟ℎ𝑎𝑙𝑓)
formula:
𝑟𝑡𝑜𝑡𝑎𝑙 =
1 + ℎ𝑎𝑙𝑓)
Example
The reliability of half of a grammar test is calculated to be 0.35. By applying the
Spearman Brown’s prophecy formula, the total reliability would be ---------
1) 0.51 2) 0.63 3) 0.45 4) 0.38
𝑋̅ (𝐾 −
used method of internal consistency estimates.
] [1 𝑋̅ )
�
(𝐾𝑅 − 21)𝑟 = [ − ]
�
𝐾−1 𝐾𝑉
where: 𝐾 is the number of the items in a
𝑉 is the variance
29 29
= 0.34
5