Module 4 PT

MODULE 4
PT NOTES
CLASSIFIED AND SIMPLE
Steps for Constructing Standardized Tests

A carefully constructed test where the scoring, administration, and interpretation of result
follows a uniform process can be termed as a standardized test. Following are the steps that
can be followed to construct a standardised test:
Steps
1) Plan for the test.
2) Preparation of the test.
3) Trial run of the test.
4) Checking the Reliability and Validity of the test.
5) Prepare the norms for the test.
6) Prepare the manual of the test and reproducing the test.
1. Planning – There needs to be a systematic planning in order to formulate a

standardized test. Its objectives should be carefully defined. The type of content
should be determined for example using short/long/very short answers or using
multiple type questions, etc. A blueprint must be ready with instructions to the
method to be used for sampling, making the necessary requirements for preliminary
and final administration. The length, time for completing the test and number of
questions should be fixed. Detailed and precise instructions should be given for
administration of the test and also, it’s scoring.
2. Writing the items of the test – This requires a lot of creativity and is dependent on the
imagination, expertise, and knowledge. Its requirements are:
 In-Depth knowledge of the subject
 Awareness about the aptitude and ability of the individuals to be tested.
 Large vocabulary to avoid confusion in writing. Words should be simple and
descriptive enough for everybody to understand.
 Assembly and arrangement of items in a test must be proper, generally done in
ascending order of difficulty.
 Detailed instructions of the objective, time limit and the steps of recording the
answers must be given.
 Help from experts should be taken to crosscheck for subject and language errors.
3. Preliminary Administration – After modifying the items as per the advice of the
experts the test can be tried out on experimental basis, which is done to prune out any
inadequacy or weakness of the item. It highlights ambiguous items, irrelevant choices
in multiple choice questions, items that are very difficult or easy to answer. Also, the
time duration of the test and number of items that are to be kept in the final test can be
ascertained, this avoids repetition and vagueness in the instructions. This is done in
following three stages:
a) Preliminary try-out – This is performed individually, and it helps in improving and

modifying the linguistic difficulty and vagueness of items. It is administered to around
hundred people and modifications are done after observing the workability of the items.
b) The proper try-out – It is administered to approximately four hundred people wherein the
sample is kept same as the final intended participants of the test. This test is done to remove
the poor or less significant items and choose the good items and includes two activities:
 Item analysis – The difficulty of the test should be moderate with each item
discriminating the validity between high and low achievers. Item analysis is the
process to judge the quality of an item.
 Post item analysis: The final test is framed by retaining good items that have a
balanced level of difficulty and satisfactory discrimination. The blueprint is used to
guide in selection of number of items and then arranging them as per difficulty. Time
limit is set.
 Final try-out – It is administered on a large sample in order to estimate the reliability
and validity. It provides an indication to the effectiveness of the test when the
intended sample is subjected to it.
4. Reliability and Validity of the test – When test is finally composed, the final test is
again administered on a fresh sample in order to compute the reliability coefficient.
This time also sample should not be less than 100. Reliability is calculated through
test-retest method, split-half method, and the equivalent -form method. Reliability
shows the consistency of test scores. Validity refers to what the test measures and
how well it measures. If a test measures a trait that it intends to measure well then, the
test can be said to be a valid one. It is correlation of test with some outside
independent criterion.
5. Norms of the final test – Test constructor also prepares norms of the test. Norms are
defined as average performance scores. They are prepared to meaningfully interpret
the scores obtained on the test. The obtained scores on test themselves convey no
meaning regarding the ability or trait being measured. But when these are compared
with norms, a meaningful inference can be immediately drawn. The norms may be
age norms, grade norms etc. as discussed earlier. Similar norms cannot be used for all
tests.
6. Preparation of manual and reproduction of the test – The manual is prepared as the
last step and the psychometric properties of the test norms and references are reported.
It provides in detail the process to administer the test, its duration and scoring
technique. It also contains all instructions for the test.
Item Analysis
Item analysis is a method used to evaluate the quality of test items. It involves calculating
various indices to assess the difficulty, reliability, validity, and discrimination of each item.
Item Difficulty
Item difficulty is an index that measures how difficult an item is for test-takers. It is
calculated by determining the proportion of test-takers who answered the item correctly. This
index is denoted by 'pi' and can range from 0 to 1.
A higher value indicates a lower difficulty level.
- The formula to calculate item difficulty is

 (p - a)/2 = X,
where 'p' represents the percentage of people who answered the item correctly and
'a' is the chance probability of getting the answer correct.
- The difficulty level should be higher than the chance probability of guessing the correct
answer.
• X+a=Y
where Y is the difficulty level
Example:
If 70% of test-takers answered an item correctly and the chance probability of guessing the
correct answer is 50%, the calculation would be:
(0.70 - 0.50)/2 = 0.10
The difficulty level of the item would be 0.10.
Probability of Getting the Correct Answer
The probability of getting the correct answer can be calculated using the formula.
(c + 1)/2,
 where c is the guessing index.
For example, if there are 5 options and the guessing chance is 0.20, the calculation would be
as follows:
- Guessing chance: 0.20
- Guessing index (c): 5
- Probability of getting the correct answer: (0.20 + 1)/2 = 1.2/2 = 0.60
- Reliability assesses the consistency or stability of a test item.
- Validity measures the extent to which a test item measures what it
Item Discrimination
Item discrimination is an index that measures how well an item differentiates between high-
performing and low-performing test-takers. It assesses the ability of the item to discriminate
between individuals with different levels of the construct being measured. A higher
discrimination index indicates a better item.
It is a measure of the item's ability to differentiate between individuals who perform well on
the test and those who perform poorly.
 Evaluation Against External Criteria
Items can be selected or evaluated based on their relationship to the same external criteria.
This means that the items are compared to an external standard or measure to determine their
effectiveness in discriminating between high and low scorers. However, this method requires
the availability of an external criterion.
 Evaluation Against Total Test Score
When an external criterion is not available, item discrimination can be investigated against
the total test score. This means that the items are compared to the overall performance of the
test takers to determine their ability to differentiate between high and low scorers. However,
this method is only effective when the original item pool measures a single attribute, and that
attribute is a component of the external criteria or construct being assessed.
 Considerations for Complex Attributes
For tests that measure complex attributes, using item discrimination based on comparison
with total scores may not be sufficient. This method may reduce criteria coverage and lower
the validity of the test. It is important to consider the specific attributes being measured and
the relationship between the items and the external criteria.
 Satisfactory Item
A satisfactory item is one that has the highest external validities and the lowest coefficients of
internal consistency. This means that the item demonstrates a strong relationship with the
external criteria and is consistent in measuring the intended attribute.
Example:
If a high-performing group of test-takers answered an item correctly at a higher rate

compared to a low-performing group, the item would have a high discrimination index.
Item Discrimination Indices
A. Extreme Groups Method
- High and low scoring groups are selected.
- In a normally distributed set of scores, the upper and lower 27% serve as the optimum level.
- Identify the number of people in the two groups that have answered the item correctly.
- Subtract the number of people who answered in the low scoring group from the number of
people in the high scoring group.
- Divide the result by the total number of people to obtain a rough index of discriminative
value.
B. Point Biserial Method
- Assumes a normal and continuous distribution of the trait being measured.
- Applied to a dichotomously scored item (correct/incorrect) and the total test score, which is
a continuous variable representing overall test performance.
- The measure of item-criteria relationship yielded is independent of item difficulty.
- A negative correlation would indicate a poor item that does not effectively discriminate.
Criteria of Testing
- The criteria of testing refer to the standards or guidelines used to evaluate the quality and
effectiveness of a test.
- These criteria help determine whether a test is reliable, valid, and fair.
- Some common criteria of testing include:
- Reliability: The consistency and stability of test scores over time and across different
administrations.
- Validity: The extent to which a test measures what it is intended to measure.
- Fairness: Ensuring that the test does not discriminate against any particular group of
individuals based on factors such as race, gender, or socioeconomic status.
Note: The provided information does not include specific details about the criteria of testing.
Purpose of the Manual
The purpose of the Manual is to provide all relevant information about the test. It should
include details about the test's norms, cultural applicability, reliability, and validity in
different contexts.
Establishing Norms
Norms are established benchmarks or standards that are derived from the performance of a
representative sample of individuals. They are created with the purpose of the test in mind.
 Types of Norms
1. Generic Norms:
- Based on large samples.
- Applicable to broad, non-specific categories.
- May only be restricted by nationality (cultural applicability)
- Also, may be restricted by recency of results
2. Specific (Narrow) Norms:
- Specific to a particular group of respondents.
- Included in the group based on an important relevant characteristic.
- For example: tests for employment, Urban college students, IT sector managers, etc.
3. Customized Norms:
- Customized for a specific purpose.
- Allow for a narrower and more specific interpretation.
- However, these tests have limited sample size and context of the use.
Cultural Applicability
Refers to the extent to which the test is relevant and applicable across different cultural
contexts. It is important to test the reliability and validity of the test in different cultural
contexts to ensure its effectiveness.
Testing Cultural Applicability, Reliability, and Validity

To test the cultural applicability, reliability, and validity of the test in different contexts, the
following steps can be taken:
1. Conduct the test with a representative sample from different cultural backgrounds.
2. Analyse the test results to determine if there are any cultural biases or differences in
performance.
3. Assess the reliability of the test by conducting test-retest studies to determine if the results
are consistent over time.
4. Assess the validity of the test by comparing the test results with other established measures
of the construct.
Establishing norms for a test is important to ensure that the test is fair and accurate. Norms
can be generic, specific, or customized based on the purpose of the test.
Sample Size and Context of Use
 Sample Size
- refers to the number of individuals or units that are included in a study or experiment.
- A larger sample size generally provides more reliable and accurate results, as it reduces the
impact of random variation.
- The sample size should be determined based on the research question, the desired level of
precision, and the available resources.
- A sample size that is too small may lead to biased or inconclusive results, while a sample
size that is too large may be impractical or unnecessary.
 Context of Use
- refers to the specific population or setting in which the study or experiment is conducted.
- It is important to consider the context of use when interpreting the results of a study, as
findings may not be generalizable to other populations or settings.
- The context of use may include factors such as demographics, geographic location, cultural
background, or specific conditions or characteristics of the population.
- Researchers should carefully define and describe the context of use to ensure that the
findings are applicable and relevant to the intended audience or target population.
Example:
- A study on the effectiveness of a new medication for treating a specific disease may have a
sample size of 500 patients from different hospitals across the country. The context of use
would include factors such as the age, gender, and medical history of the patients, as well as
the specific hospitals and healthcare systems involved. The findings of this study may be
applicable to similar patient populations in similar healthcare settings but may not be
generalizable to other populations or settings.

Module 4 PT

Uploaded by

Copyright:

Available Formats

Module 4 PT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 4 PT

Uploaded by

Copyright:

Available Formats

MODULE 4

Steps for Constructing Standardized Tests

1. Planning – There needs to be a systematic planning in order to formulate a

a) Preliminary try-out – This is performed individually, and it helps in improving and

A higher value indicates a lower difficulty level.

- The formula to calculate item difficulty is

'a' is the chance probability of getting the answer correct.

where Y is the difficulty level

(0.70 - 0.50)/2 = 0.10

The difficulty level of the item would be 0.10.

Probability of Getting the Correct Answer

 where c is the guessing index.

- Guessing index (c): 5

- Probability of getting the correct answer: (0.20 + 1)/2 = 1.2/2 = 0.60

- Reliability assesses the consistency or stability of a test item.

- Validity measures the extent to which a test item measures what it

 Evaluation Against External Criteria

 Evaluation Against Total Test Score

If a high-performing group of test-takers answered an item correctly at a higher rate

Item Discrimination Indices

A. Extreme Groups Method

- High and low scoring groups are selected.

- Assumes a normal and continuous distribution of the trait being measured.

- The measure of item-criteria relationship yielded is independent of item difficulty.

- Some common criteria of testing include:

- Validity: The extent to which a test measures what it is intended to measure.

Purpose of the Manual

- Based on large samples.

- Applicable to broad, non-specific categories.

- May only be restricted by nationality (cultural applicability)

- Also, may be restricted by recency of results

2. Specific (Narrow) Norms:

- Specific to a particular group of respondents.

- Included in the group based on an important relevant characteristic.

- Customized for a specific purpose.

- Allow for a narrower and more specific interpretation.

Testing Cultural Applicability, Reliability, and Validity

Sample Size and Context of Use

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.