0% found this document useful (0 votes)
285 views

Reliability and Validity

1. Psychological tests aim to measure individual differences accurately, but are subject to systematic and random errors of measurement that reduce reliability. 2. A fundamental characteristic of psychological tests is that each scale should measure only one psychological trait. However, the example scoring shows that a test with four items measures both anxiety and sociability. 3. Both random and systematic errors can affect psychological test scores. Random errors vary randomly across people, while systematic errors consistently affect all people in the same way. Sources of error include item selection, administration, and scoring.

Uploaded by

R-wah Larounette
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
285 views

Reliability and Validity

1. Psychological tests aim to measure individual differences accurately, but are subject to systematic and random errors of measurement that reduce reliability. 2. A fundamental characteristic of psychological tests is that each scale should measure only one psychological trait. However, the example scoring shows that a test with four items measures both anxiety and sociability. 3. Both random and systematic errors can affect psychological test scores. Random errors vary randomly across people, while systematic errors consistently affect all people in the same way. Sources of error include item selection, administration, and scoring.

Uploaded by

R-wah Larounette
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

CH 4

The Reliability and Validity of


Psychological tests
Concept of systematic and random errors of measurement
are very important when we assessing individual differences
, and leads to the important aspect of psychometrics known
as reliability theory.
One fundamental and entirely uncontroversial characteristics
of psychological tests is that each scale should assess one ( and
only one) psychological characteristic.
Item1: I often feel anxious Yes (2) - Uncertain ( 1) No (0)

Item 2: A good, load party is the best way to celebrate . Yes (2) - Uncertain ( 1) No (0)

Item 3: I have been to see my doctor because of nerves Yes (2) - Uncertain ( 1) No (0)

Item 4: I hate being on my own. Yes (2) - Uncertain ( 1) No (0)

Cannot hope to draw any conclusion
Measures two distinct concepts.
Scoring : 2 ,0 ,2 ,0 Anxious and unsociable
Scoring : 0 ,2,0 ,2 Non-anxious and sociable
Scoring : 1,1 ,1 ,1 Moderately anxious and moderately sociable
Ensure that all of the items in a particular scale measure one trait
There is always some error of measurement associated
with measures of size, mass or volume ( Digital scale in the
kitchen when we weight flour , the surveyors tape
100 time measurements average
Well known and few variable can effect the accuracy of
physical measurement
Random error is caused by any factors that randomly affect
measurement of the variable across the sample. For instance, each
person's mood can inflate or deflate their performance on any
occasion. In a particular testing, some children may be feeling in a
good mood and others may be depressed. If mood affects their
performance on the measure, it may artificially inflate the
observed scores for some children and artificially deflate them for
others. The important thing about random error is that it does not
have any consistent effects across the entire sample. Instead, it
pushes observed scores up or down randomly. This means that if
we could see all of the random errors in a distribution they would
have to sum to 0 -- there would be as many negative errors as
positive ones. The important property of random error is that it
adds variability to the data but does not affect average
performance for the group. Because of this, random error is
sometimes considered noise.

Systematic error : That they hold across most or all of the
members of a group. unlike the random errors sources of
systematic errors will not tend to cancel out when repeated
measurements are made under the same physical conditions

Systematic error is caused by any factors that systematically
affect measurement of the variable across the sample. For
instance, if there is loud traffic going by just outside of a
classroom where students are taking a test, this noise is likely
to affect all of the children's scores -- in this case, systematically
lowering them. Unlike random error, systematic errors tend to
be consistently either positive or negative -- because of this,
systematic error is sometimes considered to be bias in
measurement.

Sources of measurement error
Item selection
Test administration
Test scoring
Reducing Measurement Error
1) pilot test the instruments, getting feedback from the
respondents regarding how easy or hard the measure
was and information about how the testing environment
affected their performance.
2) train who applied the instrument so that they aren't
unconsciously introducing error.
3) double-check the data thoroughly. All data entry for
computer analysis should be verified.
4) use statistical procedures to adjust for measurement
error.
Finally, use multiple measures of the same construct.
Especially if the different measures don't share the same
systematic errors.
Good measurement instruments are those that are little
influenced by both random error & of systematic error.
Taking multiple measurements under any set of physical
conditions and averaging the results reduces the impact of
random errors.
Averaging measurements from different instruments will tend
to reduce the effects of systematic error.
Measurement error and reliability
Measurement error reduces reliability or repeatability
Assumptions from classical theory
Measurement errors are random
Mean error of measurement = 0
True scores and errors are uncorrelated: r
te
=0
Errors on different tests are uncorrelated:r
12
= 0
Reliability types
Stability
Internal consistency
T
e
s
t

-
r
e
t
e
s
t

S
p
l
i
t
-
h
a
v
e
s


P
a
r
a
l
l
e
l

f
o
r
m
s


The reliability of mental tests
Alpha Cronbach alpha
The square root of coefficient alpha is a very close
approximation to the correlation between individual s
scores on particular mental test and their true score
Standard error of measurement SEM = SD x SQ Root of 1- alpha
The average size of the correlation between the test items.
Other approaches to the measurement of reliability
Split-half reliability : the correlation of the total score based on the odd
numbered test items with the total score based on the even numbered
items and then corrected for a whole test using Spearman- Brown
formula. With the existence of the computer there seems to be no good
reason to use it today..
Test retest : we can call it stability. It checks whether trait scores
stay more or less constant over time.
Conditions (it required ):
Nothing significant has happened to the participants in the interval
between the two tests( e.g. no emotional crises, developmental
changes, or significant educational experiences that might affect
the trait. The test is a good measure of the trait.
If a test shows that a child is a genius one month and of average
intelligence the next,
The time between the first application and the second is very
important: leave a time that appropriate to minimize the
likelihood of people remembering their previous answers, and
the developmental changes, learning or other life events affect
individuals' positions on the trait.
The problem of test retest as compared with alpha : test-retest is
based on the total score- it says nothing about how people
perform on individual items. Whereas alpha shows whether a set
of items measures some single, underlying trait, a set of items
that had noting in common could still have perfect test-retest
reliability.
Parallel-forms reliability: In order to create two parallel forms of a
test, items are administered to a large sample of people, and pairs
of items with similar content , difficulty and item discrimination
are identified , in away that the two versions produce similar
distributions of scores.
Type of
Reliability
How to Measure
Test-Retest
Give the same assessment twice, separated by days,
weeks, or months. Reliability is stated as the
correlation between scores at Time 1 and Time
2.(Reliability between 0 to 1)
Problems: Practice, memory
Alternate
Forms
Create two forms of the same test (vary the items
slightly). Reliability is stated as correlation between
scores of Form A and Form B.
Problem: equality of the forms
Split-half
reliability
Cronbach
alpha

Split the test items into two groups (even, odd) give the
test to the group, correlate the total of the odd items
with the total of the even items.
( Reliability= 2r/1+r) Problem: equality of the halves,
reliability is not for the whole
Prof.Khalaf Nassar
Validity

Face

Construct
Predictive

Convergent
Divergent
Theory
Relationship
No relationship
Criterion
Concurrent
Predictive
Criterion
Prof.Khalaf Nassar
Content
validity
Logical

Test validity: According to reliability theory we can show whither or
not a set of test items seem to measure some underlying trait.
What it cannot do is shed any light on the nature of that trait. If we
construct a scale or test and we think that this set of items measure
a particular trait , there is no guarantee that they actually do so.
Even if a set of items appears to form a scale ,it is not possible to
tell what that scale measures just by looking at the items or just
because we claimed that.
Reliability is necessary for a test to be valid, since low reliability implies that
the test is not measuring any single trait. However, high reliability itself does
not guarantee validity , science this depends entirely on how, why, and with
whom the test is used.
Face Validity : It checks that the test looks as if it
measures what it is supposed to . Inspecting the
content of items is no guarantee that the test will
measure what it is intended to. Having high alpha does
not mean that the scale measures the concept that it
was designed to assess. ( Using judged , 80% )
Content validity : Sometimes it is possible to construct
a test which must be valid, by definition. For example :
constructing a spelling test , since, by definition, the
dictionary contains the whole domain of items, any
procedure that produces a representative sample of
words from this dictionary has to be a valid test of
sampling ability. ( Content analysis )
Face validity
Face validity is the simplest form.
Essentially just a subjective judgment about
whether the measure or test items appear to
be measuring what we want them to measure.

Prof.Khalaf Nassar
Logical validity
Ask judges to categorize the items according to
the test dimensions
Criterion validity
Whether the test gives results in agreement
with other measures of the same thing.
Two types:
concurrent: comparison of new test with
established test.
predictive: does the test predict some future event
(e.g. Intelligence test and exam results)?
Obviously concurrent validity is dependent on
the quality of the first test!
Prof.Khalaf Nassar
Predictive validity is the extent to which a score on a
scale or test predicts scores on some criterion measure.
For example, the validity of a cognitive test for job
performance is the correlation between test scores and,
for example, supervisor performance ratings. Such a
cognitive test would have predictive validity if the
observed correlation were statistically significant.

we assess the operationalization's ability to predict something
it should theoretically be able to predict. For instance, we
might theorize that a measure of math ability should be able to
predict how well a person will do in an engineering-based
profession. We could give our measure to experienced
engineers and see if there is a high correlation between scores
on the measure and their salaries as engineers. A high
correlation would provide evidence for predictive validity -- it
would show that our measure can correctly predict something
that we theoretically think it should be able to predict.
Construct validity
a test has construct validity if it
accurately measures a theoretical, non-
observable construct or trait
Does the measure tap the concept being
studied (face validity and predictive validity are
really types of construct validity)?
No simple way of establishing construct validity
but it is clearly very important.
Often we assess the relationships between
items in the test to see if they all appear to be
measuring the same thing.
Prof.Khalaf Nassar
Construct validity :
divergent validity is the exact opposite of criterion validity.
Suppose you are measuring a construct believed to have no
relationship to something else. If there is no relationship and if
your measurement has good construct validity, you would expect
scores on your measurement to be absolutely unrelated to scores
on a measure for the divergent construct.

Convergent validity: Is to check that the test scores relate to
other things as expected. measures of constructs that
theoretically should be related to each other are, in fact,
observed to be related to each other ,that is, you should be able
to show a correspondence or convergence between similar
constructs
Convergent : A test has convergent validity if it has a high
correlation with another test that measures the same
construct
Divergent :
divergent validity is demonstrated through a low
correlation with a test that measures a different construct
(two test that measure different traits)
The extent to which the new measure correlates poorly
with measures of different, unrelated constructs.
Prof.Khalaf Nassar
Type of
Validity
Definition Example/Non-Example
content
The extent to which the
content of the test matches
the instructional
objectives.
A semester or quarter exam that only
includes content covered during the
last six weeks is not a valid measure
of the course's overall objectives --
it has very low content validity.
Criterion
The extent to which scores
on the test are in
agreement with
(concurrent validity) or
predict (predictive
validity) an external
criterion.
If the end-of-year math tests in 4th
grade correlate highly with the
statewide math tests, they would
have high concurrent validity.
Construct
The extent to which an
assessment corresponds to
other variables, as
predicted by some
rationale or theory.
If you can correctly hypothesize that
students with high iman (Islamic
faith) have low depression (because
of theory), the assessment may have
construct validity.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy