0% found this document useful (0 votes)
37 views

III - Technical and Methodological Principles

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

III - Technical and Methodological Principles

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1

III: Technical and Methodological Principles

General Considerations
As discussed in the previous module, it was discussed the characteristics of what a good test
should have. For a test to be helpful to a clinician, it should measure what it intends to measure in as
accurate a way as much as possible. Which brings us to this important question, what are the factors
we need to consider before using a test to assess someone psychologically?

Reliability
To consider the test to be suitable, it should be, first, reliable. Reliability of a test refers to the
“accuracy, precision, or consistency of a score obtained through the test” (Apruebo, 2010). Likewise,
Souza et al. (2017) mentioned that it should yield “a consistent result in time and space, or from
different observers, presenting aspects on coherence, stability, equivalence, and homogeneity. This
means, across different times, different situations, and different test takers, - a reliable test will always
reproduce a stable score that will measure a skill, knowledge, and domain consistently. In other
words, reliability addresses the degree to which an obtained score by a person is the same even if the
person retakes the same test on different occasions (Groth-Marnatt, 2010).
As an illustration, a K-Pop enthusiast decided to take an aptitude exam for the Korean language
without any preparation, relying on the phrases she learned from the Korean series she binge-
watched. As a result, she utterly fails the exam. Frustrated with her obtained score and a strong desire
to learn the language, she enrolled in a review center for this subject with plans to retake the exam.
After 3 months, she decided to retake the exam and scored higher than before. The Korean aptitude
test is said to be a good test as it can consistently measure her aptitude based on her understanding
of the subject. If the test is not reliable, retaking the said exam will only yield an increase or decrease
in her score based purely on chance.
However, Kaplan (2009) explained that errors of measurement could not be avoided as
discrepancies between true ability and measurement of abilities is inevitable. Humans are bound to
make mistakes, and our goal is to lessen the error to “keep testing errors within reasonably accepted
limits” (Groth-Marnatt, 2010). In other words, errors in measurement are an estimate of the possible
range of random changes in the score that can be expected from a person's score.
In psychological assessment, error implies inaccuracy of measurement. Again, tests that are
"relatively free of error" (Kaplan, 2009) are considered to be reliable. How do we know that a test is
“reliable” then?
This is where reliability analysis will enter to examine whether the test provides a consistent
measure.

Common Ways of Estimating the Reliability of a Test


When we evaluate for reliability, it is important to identify first the source of measurement
you are trying to measure.
1. Test-Retest method (Coefficient of Stability)
Test-retest reliability pertains to “estimates are used to evaluate error associated with administering
a test at two different times” (Kaplan, 2009). Kaplan further elucidates that this type of reliability
analysis is important to consider only if we need to measure "constructs," "characteristics," or "traits"
that do not change over time. An example of this will be testing measuring intelligence as we consider
this trait to be a general ability. The coefficient of stability is obtained by correlating the scores
obtained on two different administrations by the same person. The degree of correlation between
these two scores shows the range of test scores that can be generalized from one occasion to another.
If a high correlation exists between these scores, the results are less likely to be an effect of some
random changes in the condition of the person or the ambiance of the testing environment. Simply, in
the actual application of the test, the examiner can confidently conclude that differences in obtained
scores are because of an actual change in the trait measured rather than a random chance result.
However, careful evaluation in choosing this reliability method, as it is not applicable to use this
type of reliability estimate in measuring constantly changing characteristics such as projective tests
like Draw a Person, House-Tre-Person, and Sachs Sentence Completion test as these tests tell the
clinician the client's wellbeing at the present time.
A common measure for test-retest reliability would be the usage of correlation, regression, and
multiple regression.

2. Parallel-Forms Method (Equivalence Forms Reliability)


Parallel-form method refers to the comparison of two equivalent forms of test that measure the same
attributes (Kaplan, 2009). These two forms contain different items that are selected with the same
difficulty level. For example, you have developed a Frustration-Anxiety Test, and you are interested to
know if all your test items measure the abovementioned trait. Using this method, you will create two
forms of the test (equivalent in items difficulty) and administer this test to the same group of people
on two different occasions to test. Afterward, the equivalent form reliability coefficient is calculated
using the correlation between the obtained scores on two forms of the test from the same group of test
takers.
Practically speaking, the use of the parallel-forms method is impractical and time-consuming as factors
such as the test taker's motivation, fatigue, and cooperation are posed a challenge in performing this
task, as well as the need to create two forms of test that are identical in difficulty level.

3. Split-Half Reliability Method


Groth-Marnat (2010) argued that this is the best method for determining the reliability of a trait with
a high degree of change. It is also a practical technique since a test is administered only once, and the
items are divided into halves that are scored separately (Kaplan, 2009). Usually, the test items are
divided using the odd-even half method. The items belonging to the odd number are grouped together,
while the other group is comprised of even-numbered items. Afterward, the two scores are correlated.
Since the test is given only once, the split-half method yields a measure of the internal consistency of
items. Kaplan (2009) defined the term internal consistency as an intercorrelation among items within
the same test. A good test that has an internal consistency measure a single construct consisting of
items that measure such traits and, ultimately, should have a high agreement with each item.
This means this method shows if all test items assess a single construct/trait. To estimate the reliability
of the test, employing the Spearman-Brown formula is a must, as it allows the estimation of what the
correlation between the two halves would have been if each half had been the length of the whole test
(Kaplan, 2009).
Aside from using the split-half technique, there are other methods for calculating the internal
consistency of a test. If the items are dichotomous in nature (usually scored by 0 or 1, Yes or No), one
can employ the use of KR20 or Kuder-Richardson 20.
This technique estimated the reliability of the test in single test administration and considered all
possible ways of splitting the items. As cited by Kaplan in 2009, Cronbach (1951) explained that
mathematical proofs have shown that the KR20 formula calculates the same reliability estimate that a
test would get if you took the mean of the split-half reliability estimates obtained by dividing the test
by all possible ways.
Remember, when you are performing an item analysis with items that are DICHOTOMOUS
(answerable by two options only, i.e ., Yes or No, right or wrong, true or false), it is recommended to
use the KR20 formula.
3

III: Technical and Methodological Principles


Another method of reliability test for internal consistency would be Cronbach’s Alpha. This is used to
evaluate the internal consistency of tests that are not answerable by right or wrong answers. Examples
of this test are personality tests and attitude scales.

For example, in answering a personality inventory, you might encounter a statement such as "I rather
read books than go out and party with people." Typical choices on this test are the following: Strongly
Agree, Agree, Neutral, Disagree, and Strongly Disagree. There is no right or wrong answer, but rather
you are just saying where you stand on the range of agreeing or disagreeing on this statement.

Kappa statistic (Inter-observer reliability)


What if the clinician with a strong behaviorist foundation uses direct observation of behavior? How do
we evaluate the reliability of a behavioral observation? For example, suppose you are measuring
assertiveness in a classroom setting. As a researcher, you will be assigned some people secretly
observing the behavior of their classmates. These observers will tabulate the number of observable
responses in each "display of assertiveness" category you choose. Hence, there would be one score for
every “taking the lead” and “assuming responsibility." After tabulating all the observers' scores, the
kappa statistic is best used in testing the reliability of such behavioral observations. Introduced by J.
Cohen in 1960, kappa indicates the actual agreement as a proportion of the potential agreement
following correction for chance agreement (Kaplan, 2009).
Kappa statistic is an agreement measure between observers/raters and has a maximum value of 1.00.
The higher the Kappa value is, the higher the concordance between the raters will be. Values close to
or below 0.00 indicate a lack of concordance.
Hence, when there is a high agreement or concordance between the observers/raters, we can conclude
that there is a lesser measurement error performed by raters, making the test reliable.

Validity
In psychological assessment, it is important to use a test that will measure what it intends to measure.
Just imagine you are taking your mid-term exams in Theories of Personality only to answer trivial
questions such as "What age did Sigmund Freud die?" or " Who coined the term "schizophrenia"? "
That is so unfair; the test is INVALID; it does not measure my knowledge about personality theories,"
you exclaimed. A test that is valid for identifying personality traits should measure what it is intended
to measure and should also produce information useful to clinicians. Validity is the degree to which
certain inference from a test is appropriate or meaningful. In layman's terms, it measures what it wants
to measure.
Groth-Marnatt (2010) explained that even though an instrument/test can be reliable without being
valid, it is an important requirement for the test to achieve a certain level of reliability. Souza et al.
(2017) emphasized that a test that is not reliable cannot be valid; however, a reliable test can,
sometimes, be invalid. Hence, high reliability does not guarantee a test's validity.

As cited by Apruebo in 2010, Nunnally & Bernstein said that validity has three (3) major meanings:
a. Construct Validity is measuring psychological domains
b. Predictive Validity is establishing a statistical relationship with a particular criterion.
c. Content Validity is sampling from a pool of required content.
Types of Validity Methods
According to Translation Validity (Apruebo, 2010)
Face Validity
One night, while browsing the internet, you become bored and decides to try an English proficiency
test on the Internet. Some of the questions go like this “ A is for Apple, C is for ___"?, " How much wood
would the woodchuck chucked?" and " Nan has 5 siblings. Bab, Beb, Bib, and _____. “
After item number 3, you decided to stop answering the test as you feel you've been duped, and it's a
waste of time since it clearly doesn't look like an English proficiency test. And that is a classic example
of what Face Validity is all about.
Face validity refers to the appearance of the test. It pertains to the perceived purpose of the test.
In other words, “Does your test looks like a test”?
For example, if you think you are answering an intelligence test because the test items are composed
of abstract items, then we can say that it has face validity.
Groth-Marnatt (2010) implied that it is really not a type of validity at all as it does not offer evidence
to support conclusions drawn from test scores. However, bear in mind that it is essential to have a test
that “looks like” it is a valid test, as these appearances can help motivate test takers because they can
see that the test is relevant.

Content validity
Say that, for example, you have an upcoming test for General Psychology. You have rigorously studied
your notes and book for that examination and known almost everything only to find that the professor
has come up with some trivial items that do not represent the content of the course. I know how hard
that moment is, which is why it is important for a test to have content validity.
Refers to the degree to which the items of the test are a representative sample of a universe content
(i.e., contains all the possible content areas of a construct). Meaning to say, it shows whether the test
includes comprehensive coverage of the construct. It also shows whether the test has been adequately
constructed and whether item contents and the domain it represented were examined by experts.
An example of a test with high content validity was the Board Licensure Examinations.

According to Criterion-related Validity (Apruebo, 2010)


When we say criterion-related validity, it means that a test was evaluated for its validity based on a set
of standards to which the test is compared.
Such evidence is provided by high correlations between a test and a well-defined criterion measure. A
criterion is a standard against which the test is compared.
For example, a test might be used to predict which students will graduate with honors and which ones
will stop or drop out. Academic success is the criterion, but it cannot be known at the time the students
take the test.

Predictive Validity
This type of validity measures how well its prediction agrees with subsequent and/or future outcomes.
A classic example of this would be in the United States; they used the SAT Critical Reading Test serves
as predictive validity evidence for college admissions tests to know if it accurately forecasts how well
5

III: Technical and Methodological Principles


high-school students will do in their college studies. The SAT, including its quantitative and verbal
subtests, is the predictor variable, and the college grade point average (GPA) is the criterion (Kaplan
& Sacuzzo, 2015).

Concurrent validity
Say that you, as the newly-hired Human Resource Specialist, are assigned to hire a Chef for a Korean
Eat-All-You-Can Buffet. You already screen your applicants to three (3) with the most impressive job
experience. Since they appear to have the same qualifications, what will be your tool for hiring the best
Chef among the three?
One way is to test potential employees on a sample of behaviors that represent the tasks to be required
of them. For example, as cited by Campion (cited by Kaplan & Sacuzzo, 2015) found that the most
effective way to select maintenance mechanics was to obtain samples of their mechanical work.
Similarly, the best way to hire the Chef is to require them to create their best version of Korean
Samgyupsal, and the best way to showcase their skill is, of course, to cook!
The abovementioned scenario is a good instance of the use of concurrent validity. In short, concurrent
validity is a correlation between the test and a criterion when both are measured at the same point in
time.
Convergent Validity
A measure determined by significant and strong correlations between different measures of the same
construct.
For example, you decided to test your newly constructed depression questionnaire, Light Scales, to be
compared with Aaron Beck's Depression Inventory to see if there is a high correlation between the two
tests.
If the data you obtained denotes a high correlation, it means that the Light Scales indeed measure
depression.
Discriminant Validity
This measure refers to the extent to which measures diverge from other operationalizations.
This means that when you use this validity test, it should yield a low correlation for tests that are
opposites of your measure.
For example, just for the sake of discussion, the test entitled Resilience Scale should not
correlate highly with Aaron Beck's Depression Inventory because it will mean that the Resilience Scale
measures the wrong construct, which is depression.

Validity Coefficient
The relationship between a test and a criterion is usually expressed as a correlation called a validity
coefficient. This coefficient tells the extent to which the test is valid for making statements about the
criterion.

Norms
This pertains to the performance of a particular reference group to which an examinee's score can be
compared. This means a norm is a normal or average performance.
It can be expressed as the number of correct items, the time required to finish a task, the number
of errors committed, etc.
Apruebo (2010) strongly argued that raw scores are pointless until they can be evaluated in terms of
appropriate interpretative standard data or statistical techniques.
In short, a norm is a set of scores from a group of individuals to which the raw score from a
psychological test is compared to.

Usage of Norms
Psychological test manuals provide tables of norms to facilitate comparing both individuals and
groups. However, several methods and techniques for deriving into more meaningful norms, more
specifically, "standard scores" from "raw scores," have been widely adopted because all of them reveal
the relative status of individuals within the group.

5 Basic Norming Techniques


1. Measure of Central Tendency
It is a statistical measure to determine a single score that defines the center of a distribution. The
goal of central tendency is to find the single score that is most typical or most representative of the entire
group.

1.1 Mean
Commonly known as Arithmetic Average, computed by adding all the scores in the distribution
and dividing by the number of scores.

1.2 Median
The median is the score that divides a distribution exactly in half. Exactly 50% of the individuals in
distribution have scores or below the median. The median is equivalent to the 50 th percentile.
7

III: Technical and Methodological Principles

1.3 Mode
In a frequency distribution, the mode is the score or category that has the greatest frequency.

2. Frequency Distribution
A frequency distribution is an organized tabulation of the number of individuals located in each
category of the scale of measurement. It takes a disorganized set of scores and places them in order from
highest to lowest, grouping together all individuals who have the same score.
Personality Traits

Anxiety Traits
f %
(ANX)
59 or less 54 51.92
60 T to 69 T 41 39.42
70 to 81t 9 8.65
Total 104 100

An example of a frequency distribution. From the data above, the table indicates that the majority of
respondents' scores fall in the bracket of 59T or less, which means 54 people obtain that score.

Adapted from Statistics for the Behavioral Sciences by Gravetter, Frederick J. & Wallnau, Larry B. Copyright
©2012 Wadsworth/Cengage Learning

In a symmetrical distribution, it is possible to draw a vertical line through the middles so that one side
of the distribution is a mirror image of the other. In a skewed distribution, the scores tend to pile up
toward one end of the scale and taper off gradually at the other end.
The section where the scores taper off toward one end of the distribution is called the tail of the
distribution.
For example, in a very difficult exam, most scores tend to be low, with only a few individuals earning high
scores. This will produce a positively skewed distribution.
On the other hand, a very easy exam is inclined to produce a negatively skewed distribution, with most
of the students earning high scores and only a few low values.

3. Use of Normal curve


A normal Distribution/Curve is a bell-shaped curve that shows the probability distribution of a
continuous random variable.

4. Percentile Rank
A rank or percentile rank of a particular score is defined as the percentage of individuals in the
distribution with scores at or below the particular value. When a score is identified by its percentile rank,
the score is called a percentile. Percentile describes your exact position within the distribution.
How to interpret percentile:
0- 5 % tile Compartment 1 = Fail
6-10 % tile Compartment 2 = Low Average
11-50 % tile Compartment 3 = Below Average
51-85 % tile Compartment 4 = Average
86-95 % tile Compartment 5 =High Average
96-99 % tile Compartment 6 =Excellent

5. Stanine System
a. Raw Scores are transformed into nine groups.
b. one is the lowest and 9 Highest

Basic Principles of Using Norms


1. There are two approaches to Norm Construction
a. Criterion-Referenced Approach: Did the person satisfy the said standard?
b. Norm-Referenced Approach: Person's performance relative to others; Norm Dependent
c. Norm used must be appropriate for the subject’s score.
2. Always check the indicator of the appropriateness
a. Indicator 1: Nationality (origin) Local Norms should be applicable to your present clients.
b. Indicator 2: Age
c. Indicator 3. Gender
i. Established Gender Differences: Verbal Ability,
Numerical Ability, Emotional Sensitivity, Aggression
ii. Established without Gender Differences
General IQ, Self-esteem
3. Norms should be constantly updated.
Each norm is only good until five years, so it needs to be updated to be in line and be a good
representation of a given population.
9

III: Technical and Methodological Principles


How are norms constructed?
1. Construct a Psychological Test – Is there a need for that test?
2. Pilot a test
-Administered the test in a group
3. Applying Norming Techniques

References and Supplementary Materials


Books and Journals
1. Cohen, R. J., & Swerdlik, M. E. (2018). Psychological testing and assessment: An introduction
to tests and measurement. New York, NY: McGraw-Hill Education.
2. Kaplan, R.M., & Sacuzzo, D.P. (2018). Psychological testing; principles, applications, and
issues. Belmont, CA: Wadsworth Cengage Learning
3. Apruebo, R.A. (2010). Psychological Testing Volume 1 (1 st ed). Quezon City; Central Book
Supply
4. Groth-Marnatt, G., Wright, A.J. (2010). Handbook of Psychological Assessment (6 th Edition).
5. Kaplan, R. M., & Saccuzzo, D. P. (2013)). Psychological Testing: Principles, Applications, & Issues
(8th Edition). Wadsworth. Cengage Learning
6. Gravetter, Frederick J. & Wallnau, Larry B. (2012). Statistics for the Behavioral Sciences.
Belmont, CA; Wadsworth/Cengage Learning
Online Supplementary Reading Materials
1. Swanson, E. (2014, June). Validity, Reliability, and the Questionable Role of Psychometrics in
Plastic Surgery. Retrieved August 15, 2018, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4174233/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy