5 Reliability
5 Reliability
Example:
A weight scale provides a reliable score if it
tells the same weight every time
How similar would the students’ scores have been had she assessed them
yesterday,or tomorrow, or next week?
How much would the scores have differed had the different teacher scored it?
How much would the scores have differed had the teacher used diffferent sample of
tasks?
Close succession
Assessment Assessment
Attention
Fatigue
Guessing
Memory
Effort
Long period
Assessment Assessment
Learning experience
Health
Forgetting
Characteristics of Reliability
▪ Reliability refers to the results obtained with an assessment instrument and not
the instrument itself
▪ An estimate of reliability always refers to a particular type of consistency
▪ If you want to measure what individuals will be like at some future time, consistency
of scores over time is important
▪ If you want to measure individual’s current understanding of certain scientific
principles, consistency of the performance across different tasks is important
Characteristics of Reliability
Validity
If something has high validity
it has also high reliability
▪ Correlation coefficient: A static that indicates the degree of the relationship between
any two sets of scores obtained from the same group of individuals
Time 1 Time 2
Construct X Construct X
This correlation coefficient indicates
how stable the assessment results are
Instrument A Instrument A over a period of time
Form 1 Form 1
Sample n Sample n
SCORE 1 SCORE 2
CLOSE TO 1 High Reliability
▪ If the time interval between two tests is too short, the constancy of the results
will be distorted because students will remember the taks and the responses to
first test.
▪ If the time interval between two tests is too long, the constancy of the results
will be distorted because the actual changes in student will happen.
Equivalent/alternative /parallel forms method
Time 1 Time 1
Construct X Construct X
This correlation coefficient indicates
the degree to which the two
Instrument A Instrument A assessments are measuring the same
aspects of behaviour.
Form 1 Form 2
Sample n Sample n
SCORE 1 SCORE 2
CLOSE TO 1 High Reliability
CLOSE TO 0
Lower Reliability 13
Equivalent/alternative /parallel forms method
Time 1 Time 2
Construct X Construct X
This correlation coefficient indicates
the degree to which the two
Instrument A Instrument A assessments are measuring the same
aspects of behaviour.
Form 1 Form 2
Sample n Sample n
SCORE 1 SCORE 2
CLOSE TO 1 High Reliability
CLOSE TO 0
Lower Reliability 15
Internal-Consistency Methods-Split half reliability
İ01 responses
Methods of splitting • There are several
• the first versus second İ02 responses internal-consistency
half methods that require
only one administration
• odd versus even- İ03 responses of an instrument.
numbered
• a random selection İ04 responses • Split-half procedure:
involves scoring two
It indicates theİ05responses
degree to halves of a test
separately for each
which consistent results subject and calculating
obtained from İ06
tworesponses
halves the correlation
of the test coefficient between the
two scores.
İ07 responses
İ08 responses
Internal-Consistency Methods-Split half reliability…
Spearman-Brown Formula:
2 x correlation between half assessments
Reliability on full assessment =
1 + correlation between half assessments
2 x (0.60)
Reliability on full assessment = = 0.75
1 + (0.60)
Internal-Consistency Methods-KR-20,KR 21, Alpha coefficient…
Alpha Coefficient:
• It is the generalization of the KR-20 for assessments that have more than
dicthotomous scores (e.g each tasks is scored on a 5-point scale)
• Both coefficients provide information about the degree to which the items or
tasks in the assessment measure similar characteristics
Limitations for Internal Consistency Methods
▪ They are not appropriate for the speed assessment-for assessments with time
limits preventing students from attempting every task.
▪ They do not indicate the constancy of student responses from day to day
because there is only one administration.
INTER-RATER RELIABILITY/scorer agreement method
Rater 1 Rater 2
Construct X Construct X
This correlation coefficient indicates
the degree to which the relative
Instrument A Instrument A ordering of responses is consistent
from one rater to another.
Form 1 Form 1
Sample n Sample n
SCORE 1 SCORE 2
CLOSE TO 1 High Reliability
CLOSE TO 0
Lower Reliability 21
Percentage of agreement
Suppose that we are assessing a student over and over again on the same
assessment procedure. We will obviously get different scores each time.
• True score is one that would be obtained if the test were perfectly reliable.
If a student were tested repeatedly under identical conditions and there were no memory,
learning, practice or fatigue effects,
- We could be 68% sure that the true score of him/her would fall within one SEM of his/her
obtained score.
- We could be 95% sure that the true score of him/her would fall within two SEM of his/her
obtained score.
- We could be 99% sure that the true score of him/her would fall within three SEM of his/her
obtained score.
• Each obtained score has a confidence band/interval.
For example;
• Tuğrul has a score of 52 and standard error of measurement is 4.
• What does this mean?
• Tuğrul’s true score is between (52-4) AND (52+4) with 68% confidence. In other
words, we 68% confident that his true score is between 48 and 56.
• Tuğrul’s true score is between (52-4x2) AND (52+4x2) with 95% confidence. In
other words, we 95% confident that his true score is between 44 and 60.
• Tuğrul’s true score is between (52-4x3) AND (52+4x3) with 99% confidence. In
other words, we 99% confident that his true score is between 40 and 64.
Relationship between SEM and Reliability
𝑆𝐸𝑀 =𝑆𝐷 𝑋 √1 −𝑟
SD=Standard deviation, r = reliability
As the reliability coefficient increases for any standard deviation, the standard error of
measurement decreases. Coversely, small reliability coefficients are associated with large
measurement errors.
Factors Influencing Reliability
▪ Number of Assessment Tasks: The larger the number of tasks, the higher its reliability.
▪ Longer assessment provides a more adequate sample of the behaviour being measured.
▪ Scores are less affected by chance factors such as familiarity with a given task
Objectivity: It refers to the degree to which equally competent scorers obtained the same results.
▪ Raters are important. Raters should be trained on how to use rubrics.
▪ Clearly established rubrics.
USABILITY
▪ Ease of administration
▪ Directions should be simple and clear
▪ Time needed for the administration should not be too great
▪ Ease of interpretation and application of results
▪ If they are interpreted correctly and applied effectively, they contribute to
more intelligent educational decisions.