Scoring and Interpretation
Scoring and Interpretation
Scoring and Interpretation
Scoring and
Interpretation
O
n a norm-referenced instrument such as the person's age or grade. The standard deviation (SD)
the PPVT-4 scale, raw scores become more of PPVT-4 standard scores is 15. The range of standard
meaningful when they are converted to scores within 1 SD of the mean—that is, between 85
normative scores or other types of derived scores. and 115—includes about 68% of the population, the
Normative scores allow for an individual's performance range of 2 SDs (70 to 130) includes about 95%, and
to be compared with that of a well-defined reference the range of 3 SDs (55 to 145) includes more than 99%.
group consisting of a large cross section of people of the The PPVT-4 standard score scale is the same as the
same age or in the same grade. In addition to being scale used in many other tests, which allows for a direct
more interpretable than raw scores, these normative comparison of PPVT-4 scores with the scores obtained
scores can be compared among different tests. on tests of language, achievement, and ability.
The first portion of this chapter describes the various Percentiles
normative scores available for the PPVT-4 instrument.
Percentiles (also known as percentile ranks) are
Next, the chapter explains the procedures for obtaining
commonly reported by examiners because they are
the various score types. In the next section, the calculation
readily understood. A percentile indicates the
and interpretation of confidence intervals are explained.
percentage of individuals in the reference group who
Next, the chapter discusses the growth scale value (GSV)
performed at or below the examinee's raw score. Thus,
scale, a nonnormative system that is ideal for measuring
a percentile of 50 signifies that the examinee's raw score
change. The chapter concludes with instructions for
is average for examinees of that age or grade. Although
completing several practice scoring exercises.
percentiles have a simple, straightforward
interpretation, they also have limitations. It is important
Types of Normative Scores to ensure that they are not misunderstood as being the
The PPVT-4 instrument has two types of normative percentage of test items answered correctly. Also,
scores: deviation and developmental. Standard scores, percentiles are on an ordinal or rank-order scale of
percentiles, normal curve equivalents (NCEs), and measurement, unlike standard scores, which form an
stanines are deviation-type normative scores because interval scale of measurement. Lacking the property of
they indicate how an examinee's raw score compares equal distances between units, percentiles cannot be
with the scores of people of the same age or in the same arithmetically manipulated (e.g., added, subtracted, or
grade. Age equivalents and grade equivalents are averaged) in the way that standard scores can.
developmental-type normative scores that designate
NCEs
where the examinee's raw score falls on a developmental
growth curve. NCEs, like standard scores, communicate the distance
between the examinee's raw score and the average raw
Deviation-Type Normative Scores score in the normative reference group. Many state
programs use NCEs in reporting test results, because
Standard Scores
this scale has the convenient property that several NCE
A standard score indicates the distance of the examinee's values directly relate to percentile units. In particular,
raw score from the average for people of the same age NCEs of 1, 50, and 99 correspond to percentiles of 1,
or grade, taking into account the range of scores among 50, and 99, respectively. However, other NCE values do
examinees in that reference group. On the PPVT-4 not have direct relationships to percentiles.
scale, a standard score of 100 is the average score for
Stanines on average raw scores at different ages or grades and do
not take score variability into account, they can appear
Stanines are whole-number scores that range from
to be inconsistent with standard scores and percentiles.
1 through 9, with a mean of 5 and an SD of 2. Each
When interpreting normative scores, one must keep in
stanine represents a particular range of percentiles,
making stanines useful as cutoff scores and in other mind that developmental-type and deviation-type
applications where a greater level of precision is scores provide fundamentally different types of
not needed. information.
developmental-type
of derived score.
equivalent signifies the grade
and deviation- (in tenths of a grade) at Converting a Raw Score to a Standard Score
type scores provide which a given raw score is Age-based standard scores, which range from 20 to
fundamentally the average score. Thus, a 160, are provided in Table B. 1 for individuals aged 2
different types of grade equivalent of 3.0 years 6 months through adulthood. Standard scores for
information. represents the average raw
score obtained by students
grade (i.e., kindergarten through Grade 12), which also
span from 20 to 160, are provided in Table B.2 (Fall,
at the beginning of third grade. Age-equivalent values July 1 through December 31) and Table B.3 (Spring,
range from 2:0 (i.e., 2 years 0 months) through 24 January 1 through June 30). To convert a raw score to a
(approximately the age at which average test standard score, first mark your choice of norms—Age,
performance plateaus). Grade: Fall, or Grade: Spring—on the record form
An age or grade equivalent does not necessarily mean cover. Next, consult the applicable table in Appendix B.
that the examinee's receptive vocabulary knowledge is Locate the section that corresponds to the examinee's
qualitatively the same as that of the average person at chronological age (in years and months) or the
that age or grade. Because of different life experiences, a examinee's grade (in school years and season). Find the
person aged 15 with a PPVT-4 age equivalent of 11:6 examinee's raw score in the appropriate Raw Score
may tend to know a different set of words than the column (A or B). Next, read across this row to the
average 11-year-old. Nevertheless, PPVT-4 age and Standard Score column, and locate the corresponding
grade equivalents can be useful for selecting standard score (see Figure 3.1). Transfer this score to
instructional materials or interventions that will be of the Standard Score box in the Score Summary area on
appropriate difficulty for the individual. the record form cover.
Unlike standard scores, age and grade equivalents are For example, Noah's raw score of 91 on Form A converts
not on an interval scale of measurement, and therefore to an age-based standard score of 90 (see Figure 3.2),
they should not be added, subtracted, or averaged. which is obtained by referring to the section for ages
Also, because age and grade equivalents are based only 6:2 through 6:3 in Table B.1.
Obtaining a Percentile, NCE, and Stanine Converting a Raw Score to an Age Equivalent,
Percentiles, NCEs, and stanines are obtained by Grade Equivalent, or GSV
converting the standard score, using Table B.4 in To convert a raw score to an age equivalent, use Table
Appendix B. This table applies to both age-based and B.5 in Appendix B. Table B.6, also in Appendix B, can
grade-based standard scores. To find the percentile, be used to convert the raw score to a grade equivalent.
NCE, and stanine in Table B.4, read down the standard In either table, locate the examinee's raw score in the
score column, either on the far left or far right. Then far-left column. Then, read across to the column for the
read across to locate the corresponding percentile, NCE, correct form, which shows the age equivalent (in years
and stanine. Record these values in the designated and months) or grade equivalent (in grade and tenths of
boxes in the Score Summary area on the record form a school year) for that raw score on that form. To obtain
cover. For example, as illustrated in Figure 3.3, Noah's the GSV, continue to read across (in either table) to the
age-based standard score of 90 converts to a percentile GSV column for the correct form, which is to the right
of 25, an NCE of 36, and a stanine of 4. of the Age Equivalent or Grade Equivalent column.
To compare GSVs on the PPVT-III and PPVT-4
Graphical Profile of Deviation-Type instruments, use Table B.7 to convert PPVT-III raw
Normative Scores scores to GSVs; for convenience, this table repeats
A Graphical Profile is included on the PPVT-4 record the PPVT-4 GSVs.
form cover as an aid to interpreting scores and may be
used with either age norms or grade norms. To use the
profile, mark the examinee's standard score on the
Standard Score line. Then draw a straight vertical line
through the standard score and across the other scales.
This Graphical Profile is used again later. For now,
simply verify that the values the drawn line intersects
correspond to the percentile, NCE, and stanine values
you obtained from the tables in Appendix B.
In Figure 3.4, Noah's age-based standard score of 90
is plotted, and the vertical line is drawn. This line
intersects with the percentile, NCE, and stanine values
that match those written in the Score Summary area. It
is important to note that Noah's score falls at the low -
end of the average range.
Errors of Measurement and
Confidence Intervals
The scores obtained from any test provide only an
estimate of a person's true ability in the trait or attribute
being measured. The true score cannot be k n o w n
because some degree of measurement error is always
present in the obtained score. Measurement errors
occur because all h u m a n behavior varies from time to
time and because all tests are imprecise to some degree.
The standard error of measurement (SEM) is the
statistic used to indicate the extent to which error
affects individual test scores. It represents the average
a m o u n t by which observed scores differ from true
scores. This statistic is calculated from reliability
coefficients using procedures described in Chapter 5 of
this manual.
Tryout, Standardization,
and Norms Development
Analyses of Standardization Data
Item analyses were performed on data from the
complete age norm sample, using Rasch techniques to
verify that items were functioning as well as expected
on the basis of national tryout data. The results of the
analyses supported the retention of all items. Four
items, all in the later portions of the test, appeared to
be misordered by difficulty and thus were repositioned
in the item sequence.
The reanalysis of start points by age showed that, for
most ages, the start point could be moved upward by
one item set. With these reset start points,
approximately 85% of examinees established a basal
at their designated starting set.
T
his chapter discusses technical characteristics the fourth type, is a measure of stability that indicates
of the PPVT-4 instrument that have important the consistency of scores when the same set of items is
implications for the interpretation of scores. readministered after a period of time (in this case, about
The first portion of the chapter reports on the internal 4 weeks). It is sensitive to measurement error caused
consistency, alternate-form, and test-retest reliability by variability over time in the examinee's state
of PPVT-4 scores, and explains how the standard errors (motivation level, fatigue, etc.) as well as by any
of measurement (SEMs) were derived and how they incidental differences in the administration procedure.
can be used when interpreting scores. The chapter
concludes with several types of evidence supporting Internal Consistency Reliability
the validity of inferences based on PPVT-4 scores, Split-half reliability and coefficient alpha of each form
including content selection procedures, the curve of were calculated for each of the 28 age groups in the age
growth with age, correlations with other tests, and the norm sample and for each of the 13 groups in the grade
average scores obtained by individuals with a variety norm sample. The procedure for computing split-half
of clinical diagnoses or educational classifications. reliability began by dividing the form into halves, one
containing the odd-numbered items and the other the
Reliability of Scores even-numbered items. The anchored item difficulty
Chapter 4 briefly discussed measurement error and values from the calibration of the entire test were used
confidence intervals. This chapter presents the to convert raw scores on the halves to Rasch ability
reliability data on which that information was based scores, which were then correlated. The Spearman-
and explains how the various confidence intervals Brown prophecy formula was applied to these
were calculated. correlations to estimate the preliminary reliability
coefficient for the full length of each form. Finally, to
Reliability refers to the precision of scores, that is, the prevent differences between the samples taking Form A
degree to which they are free of measurement error. and Form B from affecting the results, each reliability
Reliability is expressed on a numerical scale ranging was adjusted by referencing it to the standard deviation
from 0 (no precision) to 1.0 (completely free of error). (SD) of ability scores in the complete norm sample at
Several types of reliability were computed for the that age or grade. The split-half reliabilities are
a
PPVT-4 instrument that are sensitive to different presented in Table 5.1 for the age norm sample and in
sources of measurement error. The first two types, Table 5.2 for the grade norm sample. As shown in the
split-half reliability and coefficient alpha, are indicators tables, the split-half reliabilities are consistently very
of internal consistency reliability, that is, the degree of high across the entire age and grade ranges, averaging
consistency of performance on different sections of a .94 or .95 on each form. One of the goals of this
test. The third type, alternate-form reliability, reflects revision was to improve measurement at the youngest
the similarity in performance on different but parallel age levels, and the data in these tables indicate that that
forms administered at about the same time. Both goal was accomplished. Reliabilities tend to be at least
internal consistency and alternate-form reliability as high, if not higher, at the preschool ages and at
mainly are sensitive to measurement error arising from kindergarten than at the older ages and higher grades.
the use of different sets of items. Test-retest reliability,
Internal consistency reliabilities (split-half and coefficient alpha) for each form were adjusted by the following method. First, the unadjusted reliability was used to compute the within-form
SEM. Next, that SEM and the combined-forms SD were inserted into the basic reliability formula (reliability = 1 - SEM /SD ) to produce a reliability value referenced to the SD of ability scores
2 2
Validity
Validity is a characteristic of inferences drawn from
test scores. As a simple example, if a person obtains
a P P V T - 4 age-based standard score of 100, one might
infer that he or she has a level of receptive vocabulary
that is average for his or her age. The most important
assumption underlying this inference is that the PPVT-4
instrument measures vocabulary level; another
assumption is that the P P V T - 4 norms accurately
represent the population. The soundness of the latter
assumption is amply supported by the information
on the norming procedures (see Chapter 4 for details)
and will be further supported in the comparisons of
mean scores on the PPVT-4 and other instruments
reported later in this chapter. The discussion in the
present section focuses primarily on the first
assumption, that is, the question of what the PPVT-4
scale measures; this is referred to as construct validity.