III - Technical and Methodological Principles
III - Technical and Methodological Principles
General Considerations
As discussed in the previous module, it was discussed the characteristics of what a good test
should have. For a test to be helpful to a clinician, it should measure what it intends to measure in as
accurate a way as much as possible. Which brings us to this important question, what are the factors
we need to consider before using a test to assess someone psychologically?
Reliability
To consider the test to be suitable, it should be, first, reliable. Reliability of a test refers to the
“accuracy, precision, or consistency of a score obtained through the test” (Apruebo, 2010). Likewise,
Souza et al. (2017) mentioned that it should yield “a consistent result in time and space, or from
different observers, presenting aspects on coherence, stability, equivalence, and homogeneity. This
means, across different times, different situations, and different test takers, - a reliable test will always
reproduce a stable score that will measure a skill, knowledge, and domain consistently. In other
words, reliability addresses the degree to which an obtained score by a person is the same even if the
person retakes the same test on different occasions (Groth-Marnatt, 2010).
As an illustration, a K-Pop enthusiast decided to take an aptitude exam for the Korean language
without any preparation, relying on the phrases she learned from the Korean series she binge-
watched. As a result, she utterly fails the exam. Frustrated with her obtained score and a strong desire
to learn the language, she enrolled in a review center for this subject with plans to retake the exam.
After 3 months, she decided to retake the exam and scored higher than before. The Korean aptitude
test is said to be a good test as it can consistently measure her aptitude based on her understanding
of the subject. If the test is not reliable, retaking the said exam will only yield an increase or decrease
in her score based purely on chance.
However, Kaplan (2009) explained that errors of measurement could not be avoided as
discrepancies between true ability and measurement of abilities is inevitable. Humans are bound to
make mistakes, and our goal is to lessen the error to “keep testing errors within reasonably accepted
limits” (Groth-Marnatt, 2010). In other words, errors in measurement are an estimate of the possible
range of random changes in the score that can be expected from a person's score.
In psychological assessment, error implies inaccuracy of measurement. Again, tests that are
"relatively free of error" (Kaplan, 2009) are considered to be reliable. How do we know that a test is
“reliable” then?
This is where reliability analysis will enter to examine whether the test provides a consistent
measure.
For example, in answering a personality inventory, you might encounter a statement such as "I rather
read books than go out and party with people." Typical choices on this test are the following: Strongly
Agree, Agree, Neutral, Disagree, and Strongly Disagree. There is no right or wrong answer, but rather
you are just saying where you stand on the range of agreeing or disagreeing on this statement.
Validity
In psychological assessment, it is important to use a test that will measure what it intends to measure.
Just imagine you are taking your mid-term exams in Theories of Personality only to answer trivial
questions such as "What age did Sigmund Freud die?" or " Who coined the term "schizophrenia"? "
That is so unfair; the test is INVALID; it does not measure my knowledge about personality theories,"
you exclaimed. A test that is valid for identifying personality traits should measure what it is intended
to measure and should also produce information useful to clinicians. Validity is the degree to which
certain inference from a test is appropriate or meaningful. In layman's terms, it measures what it wants
to measure.
Groth-Marnatt (2010) explained that even though an instrument/test can be reliable without being
valid, it is an important requirement for the test to achieve a certain level of reliability. Souza et al.
(2017) emphasized that a test that is not reliable cannot be valid; however, a reliable test can,
sometimes, be invalid. Hence, high reliability does not guarantee a test's validity.
As cited by Apruebo in 2010, Nunnally & Bernstein said that validity has three (3) major meanings:
a. Construct Validity is measuring psychological domains
b. Predictive Validity is establishing a statistical relationship with a particular criterion.
c. Content Validity is sampling from a pool of required content.
Types of Validity Methods
According to Translation Validity (Apruebo, 2010)
Face Validity
One night, while browsing the internet, you become bored and decides to try an English proficiency
test on the Internet. Some of the questions go like this “ A is for Apple, C is for ___"?, " How much wood
would the woodchuck chucked?" and " Nan has 5 siblings. Bab, Beb, Bib, and _____. “
After item number 3, you decided to stop answering the test as you feel you've been duped, and it's a
waste of time since it clearly doesn't look like an English proficiency test. And that is a classic example
of what Face Validity is all about.
Face validity refers to the appearance of the test. It pertains to the perceived purpose of the test.
In other words, “Does your test looks like a test”?
For example, if you think you are answering an intelligence test because the test items are composed
of abstract items, then we can say that it has face validity.
Groth-Marnatt (2010) implied that it is really not a type of validity at all as it does not offer evidence
to support conclusions drawn from test scores. However, bear in mind that it is essential to have a test
that “looks like” it is a valid test, as these appearances can help motivate test takers because they can
see that the test is relevant.
Content validity
Say that, for example, you have an upcoming test for General Psychology. You have rigorously studied
your notes and book for that examination and known almost everything only to find that the professor
has come up with some trivial items that do not represent the content of the course. I know how hard
that moment is, which is why it is important for a test to have content validity.
Refers to the degree to which the items of the test are a representative sample of a universe content
(i.e., contains all the possible content areas of a construct). Meaning to say, it shows whether the test
includes comprehensive coverage of the construct. It also shows whether the test has been adequately
constructed and whether item contents and the domain it represented were examined by experts.
An example of a test with high content validity was the Board Licensure Examinations.
Predictive Validity
This type of validity measures how well its prediction agrees with subsequent and/or future outcomes.
A classic example of this would be in the United States; they used the SAT Critical Reading Test serves
as predictive validity evidence for college admissions tests to know if it accurately forecasts how well
5
Concurrent validity
Say that you, as the newly-hired Human Resource Specialist, are assigned to hire a Chef for a Korean
Eat-All-You-Can Buffet. You already screen your applicants to three (3) with the most impressive job
experience. Since they appear to have the same qualifications, what will be your tool for hiring the best
Chef among the three?
One way is to test potential employees on a sample of behaviors that represent the tasks to be required
of them. For example, as cited by Campion (cited by Kaplan & Sacuzzo, 2015) found that the most
effective way to select maintenance mechanics was to obtain samples of their mechanical work.
Similarly, the best way to hire the Chef is to require them to create their best version of Korean
Samgyupsal, and the best way to showcase their skill is, of course, to cook!
The abovementioned scenario is a good instance of the use of concurrent validity. In short, concurrent
validity is a correlation between the test and a criterion when both are measured at the same point in
time.
Convergent Validity
A measure determined by significant and strong correlations between different measures of the same
construct.
For example, you decided to test your newly constructed depression questionnaire, Light Scales, to be
compared with Aaron Beck's Depression Inventory to see if there is a high correlation between the two
tests.
If the data you obtained denotes a high correlation, it means that the Light Scales indeed measure
depression.
Discriminant Validity
This measure refers to the extent to which measures diverge from other operationalizations.
This means that when you use this validity test, it should yield a low correlation for tests that are
opposites of your measure.
For example, just for the sake of discussion, the test entitled Resilience Scale should not
correlate highly with Aaron Beck's Depression Inventory because it will mean that the Resilience Scale
measures the wrong construct, which is depression.
Validity Coefficient
The relationship between a test and a criterion is usually expressed as a correlation called a validity
coefficient. This coefficient tells the extent to which the test is valid for making statements about the
criterion.
Norms
This pertains to the performance of a particular reference group to which an examinee's score can be
compared. This means a norm is a normal or average performance.
It can be expressed as the number of correct items, the time required to finish a task, the number
of errors committed, etc.
Apruebo (2010) strongly argued that raw scores are pointless until they can be evaluated in terms of
appropriate interpretative standard data or statistical techniques.
In short, a norm is a set of scores from a group of individuals to which the raw score from a
psychological test is compared to.
Usage of Norms
Psychological test manuals provide tables of norms to facilitate comparing both individuals and
groups. However, several methods and techniques for deriving into more meaningful norms, more
specifically, "standard scores" from "raw scores," have been widely adopted because all of them reveal
the relative status of individuals within the group.
1.1 Mean
Commonly known as Arithmetic Average, computed by adding all the scores in the distribution
and dividing by the number of scores.
1.2 Median
The median is the score that divides a distribution exactly in half. Exactly 50% of the individuals in
distribution have scores or below the median. The median is equivalent to the 50 th percentile.
7
1.3 Mode
In a frequency distribution, the mode is the score or category that has the greatest frequency.
2. Frequency Distribution
A frequency distribution is an organized tabulation of the number of individuals located in each
category of the scale of measurement. It takes a disorganized set of scores and places them in order from
highest to lowest, grouping together all individuals who have the same score.
Personality Traits
Anxiety Traits
f %
(ANX)
59 or less 54 51.92
60 T to 69 T 41 39.42
70 to 81t 9 8.65
Total 104 100
An example of a frequency distribution. From the data above, the table indicates that the majority of
respondents' scores fall in the bracket of 59T or less, which means 54 people obtain that score.
Adapted from Statistics for the Behavioral Sciences by Gravetter, Frederick J. & Wallnau, Larry B. Copyright
©2012 Wadsworth/Cengage Learning
In a symmetrical distribution, it is possible to draw a vertical line through the middles so that one side
of the distribution is a mirror image of the other. In a skewed distribution, the scores tend to pile up
toward one end of the scale and taper off gradually at the other end.
The section where the scores taper off toward one end of the distribution is called the tail of the
distribution.
For example, in a very difficult exam, most scores tend to be low, with only a few individuals earning high
scores. This will produce a positively skewed distribution.
On the other hand, a very easy exam is inclined to produce a negatively skewed distribution, with most
of the students earning high scores and only a few low values.
4. Percentile Rank
A rank or percentile rank of a particular score is defined as the percentage of individuals in the
distribution with scores at or below the particular value. When a score is identified by its percentile rank,
the score is called a percentile. Percentile describes your exact position within the distribution.
How to interpret percentile:
0- 5 % tile Compartment 1 = Fail
6-10 % tile Compartment 2 = Low Average
11-50 % tile Compartment 3 = Below Average
51-85 % tile Compartment 4 = Average
86-95 % tile Compartment 5 =High Average
96-99 % tile Compartment 6 =Excellent
5. Stanine System
a. Raw Scores are transformed into nine groups.
b. one is the lowest and 9 Highest