1. Psychological tests aim to measure individual differences accurately, but are subject to systematic and random errors of measurement that reduce reliability.
2. A fundamental characteristic of psychological tests is that each scale should measure only one psychological trait. However, the example scoring shows that a test with four items measures both anxiety and sociability.
3. Both random and systematic errors can affect psychological test scores. Random errors vary randomly across people, while systematic errors consistently affect all people in the same way. Sources of error include item selection, administration, and scoring.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
285 views
Reliability and Validity
1. Psychological tests aim to measure individual differences accurately, but are subject to systematic and random errors of measurement that reduce reliability.
2. A fundamental characteristic of psychological tests is that each scale should measure only one psychological trait. However, the example scoring shows that a test with four items measures both anxiety and sociability.
3. Both random and systematic errors can affect psychological test scores. Random errors vary randomly across people, while systematic errors consistently affect all people in the same way. Sources of error include item selection, administration, and scoring.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29
CH 4
The Reliability and Validity of
Psychological tests Concept of systematic and random errors of measurement are very important when we assessing individual differences , and leads to the important aspect of psychometrics known as reliability theory. One fundamental and entirely uncontroversial characteristics of psychological tests is that each scale should assess one ( and only one) psychological characteristic. Item1: I often feel anxious Yes (2) - Uncertain ( 1) No (0)
Item 2: A good, load party is the best way to celebrate . Yes (2) - Uncertain ( 1) No (0)
Item 3: I have been to see my doctor because of nerves Yes (2) - Uncertain ( 1) No (0)
Item 4: I hate being on my own. Yes (2) - Uncertain ( 1) No (0)
Cannot hope to draw any conclusion Measures two distinct concepts. Scoring : 2 ,0 ,2 ,0 Anxious and unsociable Scoring : 0 ,2,0 ,2 Non-anxious and sociable Scoring : 1,1 ,1 ,1 Moderately anxious and moderately sociable Ensure that all of the items in a particular scale measure one trait There is always some error of measurement associated with measures of size, mass or volume ( Digital scale in the kitchen when we weight flour , the surveyors tape 100 time measurements average Well known and few variable can effect the accuracy of physical measurement Random error is caused by any factors that randomly affect measurement of the variable across the sample. For instance, each person's mood can inflate or deflate their performance on any occasion. In a particular testing, some children may be feeling in a good mood and others may be depressed. If mood affects their performance on the measure, it may artificially inflate the observed scores for some children and artificially deflate them for others. The important thing about random error is that it does not have any consistent effects across the entire sample. Instead, it pushes observed scores up or down randomly. This means that if we could see all of the random errors in a distribution they would have to sum to 0 -- there would be as many negative errors as positive ones. The important property of random error is that it adds variability to the data but does not affect average performance for the group. Because of this, random error is sometimes considered noise.
Systematic error : That they hold across most or all of the members of a group. unlike the random errors sources of systematic errors will not tend to cancel out when repeated measurements are made under the same physical conditions
Systematic error is caused by any factors that systematically affect measurement of the variable across the sample. For instance, if there is loud traffic going by just outside of a classroom where students are taking a test, this noise is likely to affect all of the children's scores -- in this case, systematically lowering them. Unlike random error, systematic errors tend to be consistently either positive or negative -- because of this, systematic error is sometimes considered to be bias in measurement.
Sources of measurement error Item selection Test administration Test scoring Reducing Measurement Error 1) pilot test the instruments, getting feedback from the respondents regarding how easy or hard the measure was and information about how the testing environment affected their performance. 2) train who applied the instrument so that they aren't unconsciously introducing error. 3) double-check the data thoroughly. All data entry for computer analysis should be verified. 4) use statistical procedures to adjust for measurement error. Finally, use multiple measures of the same construct. Especially if the different measures don't share the same systematic errors. Good measurement instruments are those that are little influenced by both random error & of systematic error. Taking multiple measurements under any set of physical conditions and averaging the results reduces the impact of random errors. Averaging measurements from different instruments will tend to reduce the effects of systematic error. Measurement error and reliability Measurement error reduces reliability or repeatability Assumptions from classical theory Measurement errors are random Mean error of measurement = 0 True scores and errors are uncorrelated: r te =0 Errors on different tests are uncorrelated:r 12 = 0 Reliability types Stability Internal consistency T e s t
- r e t e s t
S p l i t - h a v e s
P a r a l l e l
f o r m s
The reliability of mental tests Alpha Cronbach alpha The square root of coefficient alpha is a very close approximation to the correlation between individual s scores on particular mental test and their true score Standard error of measurement SEM = SD x SQ Root of 1- alpha The average size of the correlation between the test items. Other approaches to the measurement of reliability Split-half reliability : the correlation of the total score based on the odd numbered test items with the total score based on the even numbered items and then corrected for a whole test using Spearman- Brown formula. With the existence of the computer there seems to be no good reason to use it today.. Test retest : we can call it stability. It checks whether trait scores stay more or less constant over time. Conditions (it required ): Nothing significant has happened to the participants in the interval between the two tests( e.g. no emotional crises, developmental changes, or significant educational experiences that might affect the trait. The test is a good measure of the trait. If a test shows that a child is a genius one month and of average intelligence the next, The time between the first application and the second is very important: leave a time that appropriate to minimize the likelihood of people remembering their previous answers, and the developmental changes, learning or other life events affect individuals' positions on the trait. The problem of test retest as compared with alpha : test-retest is based on the total score- it says nothing about how people perform on individual items. Whereas alpha shows whether a set of items measures some single, underlying trait, a set of items that had noting in common could still have perfect test-retest reliability. Parallel-forms reliability: In order to create two parallel forms of a test, items are administered to a large sample of people, and pairs of items with similar content , difficulty and item discrimination are identified , in away that the two versions produce similar distributions of scores. Type of Reliability How to Measure Test-Retest Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and Time 2.(Reliability between 0 to 1) Problems: Practice, memory Alternate Forms Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Form A and Form B. Problem: equality of the forms Split-half reliability Cronbach alpha
Split the test items into two groups (even, odd) give the test to the group, correlate the total of the odd items with the total of the even items. ( Reliability= 2r/1+r) Problem: equality of the halves, reliability is not for the whole Prof.Khalaf Nassar Validity
Face
Construct Predictive
Convergent Divergent Theory Relationship No relationship Criterion Concurrent Predictive Criterion Prof.Khalaf Nassar Content validity Logical
Test validity: According to reliability theory we can show whither or not a set of test items seem to measure some underlying trait. What it cannot do is shed any light on the nature of that trait. If we construct a scale or test and we think that this set of items measure a particular trait , there is no guarantee that they actually do so. Even if a set of items appears to form a scale ,it is not possible to tell what that scale measures just by looking at the items or just because we claimed that. Reliability is necessary for a test to be valid, since low reliability implies that the test is not measuring any single trait. However, high reliability itself does not guarantee validity , science this depends entirely on how, why, and with whom the test is used. Face Validity : It checks that the test looks as if it measures what it is supposed to . Inspecting the content of items is no guarantee that the test will measure what it is intended to. Having high alpha does not mean that the scale measures the concept that it was designed to assess. ( Using judged , 80% ) Content validity : Sometimes it is possible to construct a test which must be valid, by definition. For example : constructing a spelling test , since, by definition, the dictionary contains the whole domain of items, any procedure that produces a representative sample of words from this dictionary has to be a valid test of sampling ability. ( Content analysis ) Face validity Face validity is the simplest form. Essentially just a subjective judgment about whether the measure or test items appear to be measuring what we want them to measure.
Prof.Khalaf Nassar Logical validity Ask judges to categorize the items according to the test dimensions Criterion validity Whether the test gives results in agreement with other measures of the same thing. Two types: concurrent: comparison of new test with established test. predictive: does the test predict some future event (e.g. Intelligence test and exam results)? Obviously concurrent validity is dependent on the quality of the first test! Prof.Khalaf Nassar Predictive validity is the extent to which a score on a scale or test predicts scores on some criterion measure. For example, the validity of a cognitive test for job performance is the correlation between test scores and, for example, supervisor performance ratings. Such a cognitive test would have predictive validity if the observed correlation were statistically significant.
we assess the operationalization's ability to predict something it should theoretically be able to predict. For instance, we might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession. We could give our measure to experienced engineers and see if there is a high correlation between scores on the measure and their salaries as engineers. A high correlation would provide evidence for predictive validity -- it would show that our measure can correctly predict something that we theoretically think it should be able to predict. Construct validity a test has construct validity if it accurately measures a theoretical, non- observable construct or trait Does the measure tap the concept being studied (face validity and predictive validity are really types of construct validity)? No simple way of establishing construct validity but it is clearly very important. Often we assess the relationships between items in the test to see if they all appear to be measuring the same thing. Prof.Khalaf Nassar Construct validity : divergent validity is the exact opposite of criterion validity. Suppose you are measuring a construct believed to have no relationship to something else. If there is no relationship and if your measurement has good construct validity, you would expect scores on your measurement to be absolutely unrelated to scores on a measure for the divergent construct.
Convergent validity: Is to check that the test scores relate to other things as expected. measures of constructs that theoretically should be related to each other are, in fact, observed to be related to each other ,that is, you should be able to show a correspondence or convergence between similar constructs Convergent : A test has convergent validity if it has a high correlation with another test that measures the same construct Divergent : divergent validity is demonstrated through a low correlation with a test that measures a different construct (two test that measure different traits) The extent to which the new measure correlates poorly with measures of different, unrelated constructs. Prof.Khalaf Nassar Type of Validity Definition Example/Non-Example content The extent to which the content of the test matches the instructional objectives. A semester or quarter exam that only includes content covered during the last six weeks is not a valid measure of the course's overall objectives -- it has very low content validity. Criterion The extent to which scores on the test are in agreement with (concurrent validity) or predict (predictive validity) an external criterion. If the end-of-year math tests in 4th grade correlate highly with the statewide math tests, they would have high concurrent validity. Construct The extent to which an assessment corresponds to other variables, as predicted by some rationale or theory. If you can correctly hypothesize that students with high iman (Islamic faith) have low depression (because of theory), the assessment may have construct validity.