Assessing Data Quality
Assessing Data Quality
Measurement is an essential process to obtain data that is gathered from measurement tool
constructed by the researcher during a course of research study. Usually, the researchers construct the
measurement tool, and they do not simply assume that their measures work; instead, they collect data
to show that they are valid and reliable instrument for quantitative research study. In case, the tools
show that they are not valid and reliable, the researcher stops using them and try to construct
alternative means of measurement. However, to demonstrate the data quality, the researchers have
two distinct criteria to evaluate their quantitative measures i.e., reliability and of validity of instrument.
Types of reliability
1. External reliability: it is the extent to which a measure varies from one use to another
a. Test-retest reliability: measures the stability of test over a time
b. Inter-rater reliability: to the degree to which different raters give consistent estimates of
the same behavior
2. Internal reliability: it is the extent to which a measure is consistent within itself
c. Split half test: measures the extent to which all parts of the test contribute equally to
what is being measured
Test-retest reliability (stability over time): this test focus on instrument’s susceptibility to extraneous
factors over time, such as subject fatigue or environmental conditions. Stability is the extent to which
an instrument is administered twice to the same sample on two separate occasions (Time 1 and Time 2
usually after 7 days). A researcher will administer the measures on two occasions, and then compare
the scores using coefficient-correlation. Theoretically, the score value of coefficient reliability ranges
between -1.00 through .00 to +1.00. In general, perfection of coefficient reliability is difficult and hence
most of the researchers accept a test-retest correlation at 0.70 or greater depending upon the type of
instrument and area of research. Having good test re-test reliability signifies the internal validity of a
test and ensures the measurements obtained in two occasions are stable over time.
For example; if a scale weighed a person at 60 kg one minute and 60.01 kg the next, we would consider
it reliable instrument. The less variation of an instrument in repeated measures produces the higher in
its reliability. Any good measure of instrument should produce roughly the same scores, and a measure
that produces highly inconsistent scores over time cannot be a good construct and not reliable.
Test-retest reliability is used when the attributes are fairly stable in nature (e.g., self esteem that
usually does not fluctuate). This method is relatively an easy approach and can be used with interview
schedule, questionnaire, observational and physiological measures.
Inter-rater / inter-observer reliability (for equivalence): is estimated by having two or more trained
observers watching an single event simultaneously and independently recording the data to measure
the reliability to establish equivalence or consistency in their judgments primarily with observational
measures of instruments.
Inter-rater reliability is used to compute an index level of equivalence or agreement between the raters
or judges using coefficient correlation technique to demonstrate the strength of relationship between
one observer’s ratings to another’s.
There is another procedure is to compute reliability as a function of agreements between observers
using equation formula. If everyone agrees, it is 1 (or 100%) and if everyone disagrees, it is 0 (0%). The
equation formula for measuring reliability is -
Number of agreements
Number of agreements + Disagreements
Drawbacks of inter-rater / inter-observer reliability
a. The observers may tend to overestimate or underestimate the observation
b. The agreement equation formula may tends to overestimate observer agreements; the observe
may code for absence or presence, 50% of time by chance only
Split half test/technique (Internal consistency or homogeneity across the items)–Internal consistency
is concerned with consistency or homogeneity of results to an extent which all parts of the test
items/subparts contribute equally to what is being measured. In simpler terms, it is the degree to
which the subparts of an instrument yield the same results within the same test. One of the oldest,
cheapest and easiest methods for assessing internal consistency is the split-half technique where
researchers prefer to measure the reliability by including two versions of same instrument within the
same test.
In split-half reliability, the items of instrument are split into two parts or groups, and then both parts
are given to one group of subjects at the same time. Then, scores from both parts of the test are
correlated (Cronbach’s alpha formula) to test the reliability.
Split half technique is easy, economical and widely used reliability test as it requires single
administration. And it is the best means of assessing an important source of measurement error in
psychological instruments. This technique is most commonly used for multiple choice tests (also used
for other type of tests). The multiple choice tests contain distinct subtests or subparts, but related
concepts. In Split half test, the internal consistency of subparts is typically assessed, and if subpart
score are summed for an overall score, the scale’s internal consistency can be assessed. One of the
drawbacks is that it works only for a large set of questions that measure the construct.
Therefore, the second important criterion for evaluating a quantitative instrument is its validity. It is
the degree to which an instrument measures what it is supposed to measure. In other words, validity is
the appropriateness, completeness, and usefulness of an attribute measuring research instrument. For
example, a thermometer supposed to measure only body temperature; and can’t be considered as
valid instrument if it measures an attribute other than the temperature. Similarly, if a researcher
constructed instrument measures on pain, and if it includes the items on anxiety, can’t be considered
as valid. Hence, the valid instrument should measure only what it supposed to measure.
Face validity – is the overall looking of an instrument with regard to its appropriateness to measure
specific attribute. Although it is not considered as primary evidence, it is helpful for a measure to have
face validity if other type of validity has been demonstrated. For example; most people would expect a
self-esteem questionnaire to include items about whether they see themselves as a person of worth
and whether they think they have good qualities. So a questionnaire that included these kinds of items
would have good face validity.
Content validity – is the extent to which a measuring instrument provides adequate coverage of the
specific content. In other words, it is concerned with the degree to which an instrument has
appropriate and representative samples of items for the construct being measured. The content
validity of an instrument is primarily based on judgment, and there are no completely objective
methods to ensure adequate content coverage of an instrument. However, in recent years it has
become common to use a panel of experts to evaluate the new instruments for its adequacy and
appropriateness. The panel typically consist at least three members excluding language expert. The
content validity of an instrument is relevant for both; cognitive measures and affective measures
(feeling, emotions, and other psychological traits).
Construct validity – is the most complex and abstract. A measure is said to possess construct validity to
the degree that it confirms to predicted correlations with other theoretical propositions.